diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index cf604822d6eea..95729f845ff5c 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -1,515 +1,24 @@ Contributing to pandas ====================== -Where to start? ---------------- - -All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. - -If you are simply looking to start working with the *pandas* codebase, navigate to the [GitHub "issues" tab](https://github.com/pydata/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pydata/pandas/issues?labels=Docs&sort=updated&state=open) and [Difficulty Novice](https://github.com/pydata/pandas/issues?q=is%3Aopen+is%3Aissue+label%3A%22Difficulty+Novice%22) where you could start out. - -Or maybe through using *pandas* you have an idea of you own or are looking for something in the documentation and thinking 'this can be improved'...you can do something about it! - -Feel free to ask questions on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). - -Bug reports and enhancement requests ------------------------------------- - -Bug reports are an important part of making *pandas* more stable. Having a complete bug report will allow others to reproduce the bug and provide insight into fixing. Because many versions of *pandas* are supported, knowing version information will also identify improvements made since previous versions. Trying the bug-producing code out on the *master* branch is often a worthwhile exercise to confirm the bug still exists. It is also worth searching existing bug reports and pull requests to see if the issue has already been reported and/or fixed. - -Bug reports must: - -1. Include a short, self-contained Python snippet reproducing the problem. You can format the code nicely by using [GitHub Flavored Markdown](http://github.github.com/github-flavored-markdown/): - - ```python - >>> from pandas import DataFrame - >>> df = DataFrame(...) - ... - ``` - -2. Include the full version string of *pandas* and its dependencies. In versions of *pandas* after 0.12 you can use a built in function: - - >>> from pandas.util.print_versions import show_versions - >>> show_versions() - - and in *pandas* 0.13.1 onwards: - - >>> pd.show_versions() - -3. Explain why the current behavior is wrong/not desired and what you expect instead. - -The issue will then show up to the *pandas* community and be open to comments/ideas from others. - -Working with the code ---------------------- - -Now that you have an issue you want to fix, enhancement to add, or documentation to improve, you need to learn how to work with GitHub and the *pandas* code base. - -### Version control, Git, and GitHub - -To the new user, working with Git is one of the more daunting aspects of contributing to *pandas*. It can very quickly become overwhelming, but sticking to the guidelines below will help keep the process straightforward and mostly trouble free. As always, if you are having difficulties please feel free to ask for help. - -The code is hosted on [GitHub](https://www.github.com/pydata/pandas). To contribute you will need to sign up for a [free GitHub account](https://github.com/signup/free). We use [Git](http://git-scm.com/) for version control to allow many people to work together on the project. - -Some great resources for learning Git: - -- the [GitHub help pages](http://help.github.com/). -- the [NumPy's documentation](http://docs.scipy.org/doc/numpy/dev/index.html). -- Matthew Brett's [Pydagogue](http://matthew-brett.github.com/pydagogue/). - -### Getting started with Git - -[GitHub has instructions](http://help.github.com/set-up-git-redirect) for installing git, setting up your SSH key, and configuring git. All these steps need to be completed before you can work seamlessly between your local repository and GitHub. - -### Forking - -You will need your own fork to work on the code. Go to the [pandas project page](https://github.com/pydata/pandas) and hit the `Fork` button. You will want to clone your fork to your machine: - - git clone git@github.com:your-user-name/pandas.git pandas-yourname - cd pandas-yourname - git remote add upstream git://github.com/pydata/pandas.git - -This creates the directory pandas-yourname and connects your repository to the upstream (main project) *pandas* repository. - -The testing suite will run automatically on Travis-CI once your pull request is submitted. However, if you wish to run the test suite on a branch prior to submitting the pull request, then Travis-CI needs to be hooked up to your GitHub repository. Instructions for doing so are [here](http://about.travis-ci.org/docs/user/getting-started/). - -### Creating a branch - -You want your master branch to reflect only production-ready code, so create a feature branch for making your changes. For example: - - git branch shiny-new-feature - git checkout shiny-new-feature - -The above can be simplified to: - - git checkout -b shiny-new-feature - -This changes your working directory to the shiny-new-feature branch. Keep any changes in this branch specific to one bug or feature so it is clear what the branch brings to *pandas*. You can have many shiny-new-features and switch in between them using the git checkout command. - -To update this branch, you need to retrieve the changes from the master branch: - - git fetch upstream - git rebase upstream/master - -This will replay your commits on top of the lastest pandas git master. If this leads to merge conflicts, you must resolve these before submitting your pull request. If you have uncommitted changes, you will need to `stash` them prior to updating. This will effectively store your changes and they can be reapplied after updating. - -### Creating a development environment - -An easy way to create a *pandas* development environment is as follows. - -- Install either Anaconda <install.anaconda> or miniconda <install.miniconda> -- Make sure that you have cloned the repository <contributing.forking> -- `cd` to the *pandas* source directory - -Tell conda to create a new environment, named `pandas_dev`, or any other name you would like for this environment, by running: - - conda create -n pandas_dev --file ci/requirements_dev.txt - -For a python 3 environment: - - conda create -n pandas_dev python=3 --file ci/requirements_dev.txt - -If you are on Windows, then you will also need to install the compiler linkages: - - conda install -n pandas_dev libpython - -This will create the new environment, and not touch any of your existing environments, nor any existing python installation. It will install all of the basic dependencies of *pandas*, as well as the development and testing tools. If you would like to install other dependencies, you can install them as follows: - - conda install -n pandas_dev -c pandas pytables scipy - -To install *all* pandas dependencies you can do the following: - - conda install -n pandas_dev -c pandas --file ci/requirements_all.txt - -To work in this environment, Windows users should `activate` it as follows: - - activate pandas_dev - -Mac OSX and Linux users should use: - - source activate pandas_dev - -You will then see a confirmation message to indicate you are in the new development environment. - -To view your environments: - - conda info -e - -To return to you home root environment: - - deactivate - -See the full conda docs [here](http://conda.pydata.org/docs). - -At this point you can easily do an *in-place* install, as detailed in the next section. - -### Making changes - -Before making your code changes, it is often necessary to build the code that was just checked out. There are two primary methods of doing this. - -1. The best way to develop *pandas* is to build the C extensions in-place by running: - - python setup.py build_ext --inplace - - If you startup the Python interpreter in the *pandas* source directory you will call the built C extensions - -2. Another very common option is to do a `develop` install of *pandas*: - - python setup.py develop - - This makes a symbolic link that tells the Python interpreter to import *pandas* from your development directory. Thus, you can always be using the development version on your system without being inside the clone directory. - -Contributing to the documentation ---------------------------------- - -If you're not the developer type, contributing to the documentation is still of huge value. You don't even have to be an expert on *pandas* to do so! Something as simple as rewriting small passages for clarity as you reference the docs is a simple but effective way to contribute. The next person to read that passage will be in your debt! - -In fact, there are sections of the docs that are worse off after being written by experts. If something in the docs doesn't make sense to you, updating the relevant section after you figure it out is a simple way to ensure it will help the next person. - -### About the *pandas* documentation - -The documentation is written in **reStructuredText**, which is almost like writing in plain English, and built using [Sphinx](http://sphinx.pocoo.org/). The Sphinx Documentation has an excellent [introduction to reST](http://sphinx.pocoo.org/rest.html). Review the Sphinx docs to perform more complex changes to the documentation as well. - -Some other important things to know about the docs: - -- The *pandas* documentation consists of two parts: the docstrings in the code itself and the docs in this folder `pandas/doc/`. - - The docstrings provide a clear explanation of the usage of the individual functions, while the documentation in this folder consists of tutorial-like overviews per topic together with some other information (what's new, installation, etc). - -- The docstrings follow the **Numpy Docstring Standard**, which is used widely in the Scientific Python community. This standard specifies the format of the different sections of the docstring. See [this document](https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt) for a detailed explanation, or look at some of the existing functions to extend it in a similar manner. -- The tutorials make heavy use of the [ipython directive](http://matplotlib.org/sampledoc/ipython_directive.html) sphinx extension. This directive lets you put code in the documentation which will be run during the doc build. For example: - - .. ipython:: python - - x = 2 - x**3 - - will be rendered as: - - In [1]: x = 2 - - In [2]: x**3 - Out[2]: 8 - - Almost all code examples in the docs are run (and the output saved) during the doc build. This approach means that code examples will always be up to date, but it does make the doc building a bit more complex. - -> **note** -> -> The `.rst` files are used to automatically generate Markdown and HTML versions of the docs. For this reason, please do not edit `CONTRIBUTING.md` directly, but instead make any changes to `doc/source/contributing.rst`. Then, to generate `CONTRIBUTING.md`, use [pandoc](http://johnmacfarlane.net/pandoc/) with the following command: -> -> pandoc doc/source/contributing.rst -t markdown_github > CONTRIBUTING.md - -The utility script `scripts/api_rst_coverage.py` can be used to compare the list of methods documented in `doc/source/api.rst` (which is used to generate the [API Reference](http://pandas.pydata.org/pandas-docs/stable/api.html) page) and the actual public methods. This will identify methods documented in in `doc/source/api.rst` that are not actually class methods, and existing methods that are not documented in `doc/source/api.rst`. - -### How to build the *pandas* documentation - -#### Requirements - -To build the *pandas* docs there are some extra requirements: you will need to have `sphinx` and `ipython` installed. [numpydoc](https://github.com/numpy/numpydoc) is used to parse the docstrings that follow the Numpy Docstring Standard (see above), but you don't need to install this because a local copy of numpydoc is included in the *pandas* source code. - -It is easiest to create a development environment <contributing.dev\_env>, then install: - - conda install -n pandas_dev sphinx ipython - -Furthermore, it is recommended to have all [optional dependencies](http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies) installed. This is not strictly necessary, but be aware that you will see some error messages when building the docs. This happens because all the code in the documentation is executed during the doc build, and so code examples using optional dependencies will generate errors. Run `pd.show_versions()` to get an overview of the installed version of all dependencies. - -> **warning** -> -> You need to have `sphinx` version 1.2.2 or newer, but older than version 1.3. Versions before 1.1.3 should also work. - -#### Building the documentation - -So how do you build the docs? Navigate to your local `pandas/doc/` directory in the console and run: - - python make.py html - -Then you can find the HTML output in the folder `pandas/doc/build/html/`. - -The first time you build the docs, it will take quite a while because it has to run all the code examples and build all the generated docstring pages. In subsequent evocations, sphinx will try to only build the pages that have been modified. - -If you want to do a full clean build, do: - - python make.py clean - python make.py build - -Starting with *pandas* 0.13.1 you can tell `make.py` to compile only a single section of the docs, greatly reducing the turn-around time for checking your changes. You will be prompted to delete `.rst` files that aren't required. This is okay because the prior versions of these files can be checked out from git. However, you must make sure not to commit the file deletions to your Git repository! - - #omit autosummary and API section - python make.py clean - python make.py --no-api - - # compile the docs with only a single - # section, that which is in indexing.rst - python make.py clean - python make.py --single indexing - -For comparison, a full documentation build may take 10 minutes, a `-no-api` build may take 3 minutes and a single section may take 15 seconds. Subsequent builds, which only process portions you have changed, will be faster. Open the following file in a web browser to see the full documentation you just built: - - pandas/docs/build/html/index.html +Whether you are a novice or experienced software developer, all contributions and suggestions are welcome! -And you'll have the satisfaction of seeing your new and improved documentation! +Our main contribution docs can be found [here](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst), but if you do not want to read it in its entirety, we will summarize the main ways in which you can contribute and point to relevant places in the docs for further information. -#### Building master branch documentation - -When pull requests are merged into the *pandas* `master` branch, the main parts of the documentation are also built by Travis-CI. These docs are then hosted [here](http://pandas-docs.github.io/pandas-docs-travis). - -Contributing to the code base ------------------------------ - -### Code standards - -*pandas* uses the [PEP8](http://www.python.org/dev/peps/pep-0008/) standard. There are several tools to ensure you abide by this standard. - -We've written a tool to check that your commits are PEP8 great, [pip install pep8radius](https://github.com/hayd/pep8radius). Look at PEP8 fixes in your branch vs master with: - - pep8radius master --diff - -and make these changes with: - - pep8radius master --diff --in-place - -Alternatively, use the [flake8](http://pypi.python.org/pypi/flake8) tool for checking the style of your code. Additional standards are outlined on the [code style wiki page](https://github.com/pydata/pandas/wiki/Code-Style-and-Conventions). - -Please try to maintain backward compatibility. *pandas* has lots of users with lots of existing code, so don't break it if at all possible. If you think breakage is required, clearly state why as part of the pull request. Also, be careful when changing method signatures and add deprecation warnings where needed. - -### Test-driven development/code writing - -*pandas* is serious about testing and strongly encourages contributors to embrace [test-driven development (TDD)](http://en.wikipedia.org/wiki/Test-driven_development). This development process "relies on the repetition of a very short development cycle: first the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test." So, before actually writing any code, you should write your tests. Often the test can be taken from the original GitHub issue. However, it is always worth considering additional use cases and writing corresponding tests. - -Adding tests is one of the most common requests after code is pushed to *pandas*. Therefore, it is worth getting in the habit of writing tests ahead of time so this is never an issue. - -Like many packages, *pandas* uses the [Nose testing system](https://nose.readthedocs.io/en/latest/index.html) and the convenient extensions in [numpy.testing](http://docs.scipy.org/doc/numpy/reference/routines.testing.html). - -#### Writing tests - -All tests should go into the `tests` subdirectory of the specific package. This folder contains many current examples of tests, and we suggest looking to these for inspiration. If your test requires working with files or network connectivity, there is more information on the [testing page](https://github.com/pydata/pandas/wiki/Testing) of the wiki. - -The `pandas.util.testing` module has many special `assert` functions that make it easier to make statements about whether Series or DataFrame objects are equivalent. The easiest way to verify that your code is correct is to explicitly construct the result you expect, then compare the actual result to the expected correct result: - - def test_pivot(self): - data = { - 'index' : ['A', 'B', 'C', 'C', 'B', 'A'], - 'columns' : ['One', 'One', 'One', 'Two', 'Two', 'Two'], - 'values' : [1., 2., 3., 3., 2., 1.] - } - - frame = DataFrame(data) - pivoted = frame.pivot(index='index', columns='columns', values='values') - - expected = DataFrame({ - 'One' : {'A' : 1., 'B' : 2., 'C' : 3.}, - 'Two' : {'A' : 1., 'B' : 2., 'C' : 3.} - }) - - assert_frame_equal(pivoted, expected) - -#### Running the test suite - -The tests can then be run directly inside your Git clone (without having to install *pandas*) by typing: - - nosetests pandas - -The tests suite is exhaustive and takes around 20 minutes to run. Often it is worth running only a subset of tests first around your changes before running the entire suite. This is done using one of the following constructs: - - nosetests pandas/tests/[test-module].py - nosetests pandas/tests/[test-module].py:[TestClass] - nosetests pandas/tests/[test-module].py:[TestClass].[test_method] - -#### Running the performance test suite - -Performance matters and it is worth considering whether your code has introduced performance regressions. *pandas* is in the process of migrating to the [asv library](https://github.com/spacetelescope/asv) to enable easy monitoring of the performance of critical *pandas* operations. These benchmarks are all found in the `pandas/asv_bench` directory. asv supports both python2 and python3. - -> **note** -> -> The asv benchmark suite was translated from the previous framework, vbench, so many stylistic issues are likely a result of automated transformation of the code. - -To use asv you will need either `conda` or `virtualenv`. For more details please check the [asv installation webpage](https://asv.readthedocs.io/en/latest/installing.html). - -To install asv: - - pip install git+https://github.com/spacetelescope/asv - -If you need to run a benchmark, change your directory to `/asv_bench/` and run the following if you have been developing on `master`: - - asv continuous master - -If you are working on another branch, either of the following can be used: - - asv continuous master HEAD - asv continuous master your_branch - -This will check out the master revision and run the suite on both master and your commit. Running the full test suite can take up to one hour and use up to 3GB of RAM. Usually it is sufficient to paste only a subset of the results into the pull request to show that the committed changes do not cause unexpected performance regressions. - -You can run specific benchmarks using the `-b` flag, which takes a regular expression. For example, this will only run tests from a `pandas/asv_bench/benchmarks/groupby.py` file: - - asv continuous master -b groupby - -If you want to only run a specific group of tests from a file, you can do it using `.` as a separator. For example: - - asv continuous master -b groupby.groupby_agg_builtins1 - -will only run a `groupby_agg_builtins1` test defined in a `groupby` file. - -It can also be useful to run tests in your current environment. You can simply do it by: - - asv dev - -This command is equivalent to: - - asv run --quick --show-stderr --python=same - -This will launch every test only once, display stderr from the benchmarks, and use your local `python` that comes from your `$PATH`. - -Information on how to write a benchmark can be found in the [asv documentation](https://asv.readthedocs.io/en/latest/writing_benchmarks.html). - -#### Running the vbench performance test suite (phasing out) - -Historically, *pandas* used [vbench library](https://github.com/pydata/vbench) to enable easy monitoring of the performance of critical *pandas* operations. These benchmarks are all found in the `pandas/vb_suite` directory. vbench currently only works on python2. - -To install vbench: - - pip install git+https://github.com/pydata/vbench - -Vbench also requires `sqlalchemy`, `gitpython`, and `psutil`, which can all be installed using pip. If you need to run a benchmark, change your directory to the *pandas* root and run: - - ./test_perf.sh -b master -t HEAD - -This will check out the master revision and run the suite on both master and your commit. Running the full test suite can take up to one hour and use up to 3GB of RAM. Usually it is sufficient to paste a subset of the results into the Pull Request to show that the committed changes do not cause unexpected performance regressions. - -You can run specific benchmarks using the `-r` flag, which takes a regular expression. - -See the [performance testing wiki](https://github.com/pydata/pandas/wiki/Performance-Testing) for information on how to write a benchmark. - -### Documenting your code - -Changes should be reflected in the release notes located in `doc/source/whatsnew/vx.y.z.txt`. This file contains an ongoing change log for each release. Add an entry to this file to document your fix, enhancement or (unavoidable) breaking change. Make sure to include the GitHub issue number when adding your entry (using `` :issue:`1234` `` where 1234 is the issue/pull request number). - -If your code is an enhancement, it is most likely necessary to add usage examples to the existing documentation. This can be done following the section regarding documentation above <contributing.documentation>. Further, to let users know when this feature was added, the `versionadded` directive is used. The sphinx syntax for that is: - -``` sourceCode -.. versionadded:: 0.17.0 -``` - -This will put the text *New in version 0.17.0* wherever you put the sphinx directive. This should also be put in the docstring when adding a new function or method ([example](https://github.com/pydata/pandas/blob/v0.16.2/pandas/core/generic.py#L1959)) or a new keyword argument ([example](https://github.com/pydata/pandas/blob/v0.16.2/pandas/core/frame.py#L1171)). - -Contributing your changes to *pandas* -------------------------------------- - -### Committing your code - -Keep style fixes to a separate commit to make your pull request more readable. - -Once you've made changes, you can see them by typing: - - git status - -If you have created a new file, it is not being tracked by git. Add it by typing: - - git add path/to/file-to-be-added.py - -Doing 'git status' again should give something like: - - # On branch shiny-new-feature - # - # modified: /relative/path/to/file-you-added.py - # - -Finally, commit your changes to your local repository with an explanatory message. *Pandas* uses a convention for commit message prefixes and layout. Here are some common prefixes along with general guidelines for when to use them: - -> - ENH: Enhancement, new functionality -> - BUG: Bug fix -> - DOC: Additions/updates to documentation -> - TST: Additions/updates to tests -> - BLD: Updates to the build process/scripts -> - PERF: Performance improvement -> - CLN: Code cleanup - -The following defines how a commit message should be structured. Please reference the relevant GitHub issues in your commit message using GH1234 or \#1234. Either style is fine, but the former is generally preferred: - -> - a subject line with < 80 chars. -> - One blank line. -> - Optionally, a commit message body. - -Now you can commit your changes in your local repository: - - git commit -m - -### Combining commits - -If you have multiple commits, you may want to combine them into one commit, often referred to as "squashing" or "rebasing". This is a common request by package maintainers when submitting a pull request as it maintains a more compact commit history. To rebase your commits: - - git rebase -i HEAD~# - -Where \# is the number of commits you want to combine. Then you can pick the relevant commit message and discard others. - -To squash to the master branch do: - - git rebase -i master - -Use the `s` option on a commit to `squash`, meaning to keep the commit messages, or `f` to `fixup`, meaning to merge the commit messages. - -Then you will need to push the branch (see below) forcefully to replace the current commits with the new ones: - - git push origin shiny-new-feature -f - -### Pushing your changes - -When you want your changes to appear publicly on your GitHub page, push your forked feature branch's commits: - - git push origin shiny-new-feature - -Here `origin` is the default name given to your remote repository on GitHub. You can see the remote repositories: - - git remote -v - -If you added the upstream repository as described above you will see something like: - - origin git@github.com:yourname/pandas.git (fetch) - origin git@github.com:yourname/pandas.git (push) - upstream git://github.com/pydata/pandas.git (fetch) - upstream git://github.com/pydata/pandas.git (push) - -Now your code is on GitHub, but it is not yet a part of the *pandas* project. For that to happen, a pull request needs to be submitted on GitHub. - -### Review your code - -When you're ready to ask for a code review, file a pull request. Before you do, once again make sure that you have followed all the guidelines outlined in this document regarding code style, tests, performance tests, and documentation. You should also double check your branch changes against the branch it was based on: - -1. Navigate to your repository on GitHub -- -2. Click on `Branches` -3. Click on the `Compare` button for your feature branch -4. Select the `base` and `compare` branches, if necessary. This will be `master` and `shiny-new-feature`, respectively. - -### Finally, make the pull request - -If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository becomes available to the GitHub community and can be looked at and eventually merged into the master version. This pull request and its associated changes will eventually be committed to the master branch and available in the next release. To submit a pull request: - -1. Navigate to your repository on GitHub -2. Click on the `Pull Request` button -3. You can then click on `Commits` and `Files Changed` to make sure everything looks okay one last time -4. Write a description of your changes in the `Preview Discussion` tab -5. Click `Send Pull Request`. - -This request then goes to the repository maintainers, and they will review the code. If you need to make more changes, you can make them in your branch, push them to GitHub, and the pull request will be automatically updated. Pushing them to GitHub again is done by: - - git push -f origin shiny-new-feature - -This will automatically update your pull request with the latest code and restart the Travis-CI tests. - -### Delete your merged branch (optional) - -Once your feature branch is accepted into upstream, you'll probably want to get rid of the branch. First, merge upstream master into your branch so git knows it is safe to delete your branch: - - git fetch upstream - git checkout master - git merge upstream/master +Getting Started +--------------- +If you are looking to contribute to the *pandas* codebase, the best place to start is the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues). This is also a great place for filing bug reports and making suggestions for ways in which we can improve the code and documentation. -Then you can just do: +If you have additional questions, feel free to ask them on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). Further information can also be found in our [Getting Started](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#where-to-start) section of our main contribution doc. - git branch -d shiny-new-feature +Filing Issues +------------- +If you notice a bug in the code or in docs or have suggestions for how we can improve either, feel free to create an issue on the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) using [GitHub's "issue" form](https://github.com/pandas-dev/pandas/issues/new). The form contains some questions that will help us best address your issue. For more information regarding how to file issues against *pandas*, please refer to the [Bug reports and enhancement requests](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#bug-reports-and-enhancement-requests) section of our main contribution doc. -Make sure you use a lower-case `-d`, or else git won't warn you if your feature branch has not actually been merged. +Contributing to the Codebase +---------------------------- +The code is hosted on [GitHub](https://www.github.com/pandas-dev/pandas), so you will need to use [Git](http://git-scm.com/) to clone the project and make changes to the codebase. Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. For more information, please refer to our [Working with the code](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#working-with-the-code) section of our main contribution docs. -The branch will still exist on GitHub, so to delete it there do: +Before submitting your changes for review, make sure to check that your changes do not break any tests. You can find more information about our test suites can be found [here](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#test-driven-development-code-writing). We also have guidelines regarding coding style that will be enforced during testing. Details about coding style can be found [here](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#code-standards). - git push origin --delete shiny-new-feature +Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request. Details about how to do that can be found in the [Contributing your changes to pandas](https://github.com/pandas-dev/pandas/blob/master/doc/source/contributing.rst#contributing-your-changes-to-pandas) section of our main contribution docs. We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's ready, we will merge it, and you will have successfully contributed to the codebase! diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md index 6f91eba1ad239..1f614b54b1f71 100644 --- a/.github/ISSUE_TEMPLATE.md +++ b/.github/ISSUE_TEMPLATE.md @@ -1,15 +1,18 @@ -#### A small, complete example of the issue +#### Code Sample, a copy-pastable example if possible ```python # Your code here ``` +#### Problem description + +[this should explain **why** the current behaviour is a problem and why the expected output is a better solution.] #### Expected Output #### Output of ``pd.show_versions()``
-# Paste the output here +# Paste the output here pd.show_versions() here
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 918d427ee4f4c..9281c51059087 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,4 +1,4 @@ - [ ] closes #xxxx - [ ] tests added / passed - - [ ] passes ``git diff upstream/master | flake8 --diff`` + - [ ] passes ``git diff upstream/master --name-only -- '*.py' | flake8 --diff`` - [ ] whatsnew entry diff --git a/.gitignore b/.gitignore index a77e780f3332d..495429fcde429 100644 --- a/.gitignore +++ b/.gitignore @@ -19,6 +19,7 @@ .noseids .ipynb_checkpoints .tags +.cache/ # Compiled source # ################### @@ -56,6 +57,8 @@ dist **/wheelhouse/* # coverage .coverage +coverage.xml +coverage_html_report # OS generated files # ###################### @@ -100,3 +103,5 @@ doc/source/index.rst doc/build/html/index.html # Windows specific leftover: doc/tmp.sv +doc/source/styled.xlsx +doc/source/templates/ diff --git a/.travis.yml b/.travis.yml index 49765c9df96ea..e5e05ed26da56 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,26 +1,24 @@ sudo: false language: python +# Default Python version is usually 2.7 +python: 3.5 -# To turn off cached miniconda, cython files and compiler cache comment out the -# USE_CACHE=true line for the build in the matrix below. To delete caches go to -# https://travis-ci.org/OWNER/REPOSITORY/caches or run +# To turn off cached cython files and compiler cache +# set NOCACHE-true +# To delete caches go to https://travis-ci.org/OWNER/REPOSITORY/caches or run # travis cache --delete inside the project directory from the travis command line client -# The cash directories will be deleted if anything in ci/ changes in a commit +# The cache directories will be deleted if anything in ci/ changes in a commit cache: + ccache: true directories: - - $HOME/miniconda # miniconda cache - $HOME/.cache # cython cache - $HOME/.ccache # compiler cache env: global: - # scatterci API key - #- secure: "Bx5umgo6WjuGY+5XFa004xjCiX/vq0CyMZ/ETzcs7EIBI1BE/0fIDXOoWhoxbY9HPfdPGlDnDgB9nGqr5wArO2s+BavyKBWg6osZ3dmkfuJPMOWeyCa92EeP+sfKw8e5HSU5MizW9e319wHWOF/xkzdHR7T67Qd5erhv91x4DnQ=" - # ironcache API key - #- secure: "e4eEFn9nDQc3Xa5BWYkzfX37jaWVq89XidVX+rcCNEr5OlOImvveeXnF1IzbRXznH4Sv0YsLwUd8RGUWOmyCvkONq/VJeqCHWtTMyfaCIdqSyhIP9Odz8r9ahch+Y0XFepBey92AJHmlnTh+2GjCDgIiqq4fzglojnp56Vg1ojA=" - #- secure: "CjmYmY5qEu3KrvMtel6zWFEtMq8ORBeS1S1odJHnjQpbwT1KY2YFZRVlLphfyDQXSz6svKUdeRrCNp65baBzs3DQNA8lIuXGIBYFeJxqVGtYAZZs6+TzBPfJJK798sGOj5RshrOJkFG2rdlWNuTq/XphI0JOrN3nPUkRrdQRpAw=" - # pandas-docs-bot GH - - secure: "PCzUFR8CHmw9lH84p4ygnojdF7Z8U5h7YfY0RyT+5K/aiQ1ZTU3ZkDTPI0/rR5FVMxsEEKEQKMcc5fvqW0PeD7Q2wRmluloKgT9w4EVEJ1ppKf7lITPcvZR2QgVOvjv4AfDtibLHFNiaSjzoqyJVjM4igjOu8WTlF3JfZcmOQjQ=" + + # pandas-docs-travis GH + - secure: "YvvTc+FrSYHgdxqoxn9s8VOaCWjvZzlkaf6k55kkmQqCYR9dPiLMsot1F96/N7o3YlD1s0znPQCak93Du8HHi/8809zAXloTaMSZrWz4R4qn96xlZFRE88O/w/Z1t3VVYpKX3MHlCggBc8MtXrqmvWKJMAqXyysZ4TTzoiJDPvE=" git: # for cloning @@ -28,308 +26,108 @@ git: matrix: fast_finish: true + exclude: + # Exclude the default Python 3.5 build + - python: 3.5 include: - - language: objective-c - os: osx - compiler: clang - osx_image: xcode6.4 + - os: osx + language: generic env: - - PYTHON_VERSION=3.5 - - JOB_NAME: "35_osx" - - NOSE_ARGS="not slow and not network and not disabled" - - BUILD_TYPE=conda - - JOB_TAG=_OSX - - TRAVIS_PYTHON_VERSION=3.5 - - CACHE_NAME="35_osx" - - USE_CACHE=true - - python: 2.7 + - JOB="3.5_OSX" TEST_ARGS="--skip-slow --skip-network" + - os: linux env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_slow_nnet_LOCALE" - - NOSE_ARGS="slow and not network and not disabled" - - LOCALE_OVERRIDE="zh_CN.UTF-8" - - FULL_DEPS=true - - JOB_TAG=_LOCALE - - CACHE_NAME="27_slow_nnet_LOCALE" - - USE_CACHE=true + - JOB="2.7_LOCALE" TEST_ARGS="--only-slow --skip-network" LOCALE_OVERRIDE="zh_CN.UTF-8" addons: apt: packages: - language-pack-zh-hans - - python: 2.7 + - os: linux env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_nslow" - - NOSE_ARGS="not slow and not disabled" - - FULL_DEPS=true - - CLIPBOARD_GUI=gtk2 - - LINT=true - - CACHE_NAME="27_nslow" - - USE_CACHE=true + - JOB="2.7" TEST_ARGS="--skip-slow" LINT=true addons: apt: packages: - python-gtk2 - - python: 3.4 + - os: linux env: - - PYTHON_VERSION=3.4 - - JOB_NAME: "34_nslow" - - NOSE_ARGS="not slow and not disabled" - - FULL_DEPS=true - - CLIPBOARD=xsel - - CACHE_NAME="34_nslow" - - USE_CACHE=true + - JOB="3.5" TEST_ARGS="--skip-slow --skip-network" COVERAGE=true addons: apt: packages: - xsel - - python: 3.5 + - os: linux env: - - PYTHON_VERSION=3.5 - - JOB_NAME: "35_nslow" - - NOSE_ARGS="not slow and not network and not disabled" - - FULL_DEPS=true - - CLIPBOARD=xsel - - COVERAGE=true - - CACHE_NAME="35_nslow" -# - USE_CACHE=true # Don't use cache for 35_nslow - addons: - apt: - packages: - - xsel -# In allow_failures - - python: 2.7 + - JOB="3.6" TEST_ARGS="--skip-slow --skip-network" PANDAS_TESTING_MODE="deprecate" CONDA_FORGE=true + # In allow_failures + - os: linux env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_slow" - - JOB_TAG=_SLOW - - NOSE_ARGS="slow and not network and not disabled" - - FULL_DEPS=true - - CACHE_NAME="27_slow" - - USE_CACHE=true -# In allow_failures - - python: 3.4 + - JOB="2.7_SLOW" TEST_ARGS="--only-slow --skip-network" + # In allow_failures + - os: linux env: - - PYTHON_VERSION=3.4 - - JOB_NAME: "34_slow" - - JOB_TAG=_SLOW - - NOSE_ARGS="slow and not network and not disabled" - - FULL_DEPS=true - - CLIPBOARD=xsel - - CACHE_NAME="34_slow" - - USE_CACHE=true - addons: - apt: - packages: - - xsel -# In allow_failures - - python: 2.7 + - JOB="2.7_BUILD_TEST" TEST_ARGS="--skip-slow" BUILD_TEST=true + # In allow_failures + - os: linux env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_build_test_conda" - - JOB_TAG=_BUILD_TEST - - NOSE_ARGS="not slow and not disabled" - - FULL_DEPS=true - - BUILD_TEST=true - - CACHE_NAME="27_build_test_conda" - - USE_CACHE=true -# In allow_failures - - python: 3.6-dev + - JOB="3.6_NUMPY_DEV" TEST_ARGS="--skip-slow --skip-network" PANDAS_TESTING_MODE="deprecate" + # In allow_failures + - os: linux env: - - PYTHON_VERSION=3.6 - - JOB_NAME: "36_dev" - - JOB_TAG=_DEV - - NOSE_ARGS="not slow and not network and not disabled" - - PANDAS_TESTING_MODE="deprecate" - addons: - apt: - packages: - - libatlas-base-dev - - gfortran -# In allow_failures - - python: 3.5 - env: - - PYTHON_VERSION=3.5 - - JOB_NAME: "35_numpy_dev" - - JOB_TAG=_NUMPY_DEV - - NOSE_ARGS="not slow and not network and not disabled" - - PANDAS_TESTING_MODE="deprecate" - - CACHE_NAME="35_numpy_dev" - - USE_CACHE=true - addons: - apt: - packages: - - libatlas-base-dev - - gfortran -# In allow_failures - - python: 2.7 - env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_nslow_nnet_COMPAT" - - NOSE_ARGS="not slow and not network and not disabled" - - LOCALE_OVERRIDE="it_IT.UTF-8" - - INSTALL_TEST=true - - JOB_TAG=_COMPAT - - CACHE_NAME="27_nslow_nnet_COMPAT" - - USE_CACHE=true - addons: - apt: - packages: - - language-pack-it -# In allow_failures - - python: 3.5 - env: - - PYTHON_VERSION=3.5 - - JOB_NAME: "35_ascii" - - JOB_TAG=_ASCII - - NOSE_ARGS="not slow and not network and not disabled" - - LOCALE_OVERRIDE="C" - - CACHE_NAME="35_ascii" - - USE_CACHE=true -# In allow_failures - - python: 2.7 - env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "doc_build" - - FULL_DEPS=true - - DOC_BUILD=true - - JOB_TAG=_DOC_BUILD - - CACHE_NAME="doc_build" - - USE_CACHE=true + - JOB="3.5_DOC" DOC=true allow_failures: - - python: 2.7 - env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_slow" - - JOB_TAG=_SLOW - - NOSE_ARGS="slow and not network and not disabled" - - FULL_DEPS=true - - CACHE_NAME="27_slow" - - USE_CACHE=true - - python: 3.4 + - os: linux env: - - PYTHON_VERSION=3.4 - - JOB_NAME: "34_slow" - - JOB_TAG=_SLOW - - NOSE_ARGS="slow and not network and not disabled" - - FULL_DEPS=true - - CLIPBOARD=xsel - - CACHE_NAME="34_slow" - - USE_CACHE=true - addons: - apt: - packages: - - xsel - - python: 2.7 + - JOB="2.7_SLOW" TEST_ARGS="--only-slow --skip-network" + - os: linux env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_build_test_conda" - - JOB_TAG=_BUILD_TEST - - NOSE_ARGS="not slow and not disabled" - - FULL_DEPS=true - - BUILD_TEST=true - - CACHE_NAME="27_build_test_conda" - - USE_CACHE=true - - python: 3.6-dev - env: - - PYTHON_VERSION=3.6 - - JOB_NAME: "36_dev" - - JOB_TAG=_DEV - - NOSE_ARGS="not slow and not network and not disabled" - - PANDAS_TESTING_MODE="deprecate" - addons: - apt: - packages: - - libatlas-base-dev - - gfortran - - python: 3.5 - env: - - PYTHON_VERSION=3.5 - - JOB_NAME: "35_numpy_dev" - - JOB_TAG=_NUMPY_DEV - - NOSE_ARGS="not slow and not network and not disabled" - - PANDAS_TESTING_MODE="deprecate" - - CACHE_NAME="35_numpy_dev" - - USE_CACHE=true - addons: - apt: - packages: - - libatlas-base-dev - - gfortran - - python: 2.7 - env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "27_nslow_nnet_COMPAT" - - NOSE_ARGS="not slow and not network and not disabled" - - LOCALE_OVERRIDE="it_IT.UTF-8" - - INSTALL_TEST=true - - JOB_TAG=_COMPAT - - CACHE_NAME="27_nslow_nnet_COMPAT" - - USE_CACHE=true - addons: - apt: - packages: - - language-pack-it - - python: 3.5 + - JOB="2.7_BUILD_TEST" TEST_ARGS="--skip-slow" BUILD_TEST=true + - os: linux env: - - PYTHON_VERSION=3.5 - - JOB_NAME: "35_ascii" - - JOB_TAG=_ASCII - - NOSE_ARGS="not slow and not network and not disabled" - - LOCALE_OVERRIDE="C" - - CACHE_NAME="35_ascii" - - USE_CACHE=true - - python: 2.7 + - JOB="3.6_NUMPY_DEV" TEST_ARGS="--skip-slow --skip-network" PANDAS_TESTING_MODE="deprecate" + - os: linux env: - - PYTHON_VERSION=2.7 - - JOB_NAME: "doc_build" - - FULL_DEPS=true - - DOC_BUILD=true - - JOB_TAG=_DOC_BUILD - - CACHE_NAME="doc_build" - - USE_CACHE=true + - JOB="3.5_DOC" DOC=true before_install: - echo "before_install" - source ci/travis_process_gbq_encryption.sh - - echo $VIRTUAL_ENV - export PATH="$HOME/miniconda3/bin:$PATH" - df -h - - date - pwd - uname -a - - python -V -# git info & get tags - git --version - git tag - ci/before_install_travis.sh - - export DISPLAY=:99.0 + - export DISPLAY=":99.0" install: - echo "install start" - - ci/check_cache.sh - ci/prep_cython_cache.sh - ci/install_travis.sh - ci/submit_cython_cache.sh - echo "install done" before_script: - - source activate pandas && pip install codecov - - ci/install_db.sh + - ci/install_db_travis.sh script: - echo "script start" - ci/run_build_docs.sh - - ci/script.sh + - ci/script_single.sh + - ci/script_multi.sh - ci/lint.sh - echo "script done" after_success: - - source activate pandas && codecov + - ci/upload_coverage.sh after_script: - echo "after_script start" - - ci/install_test.sh - source activate pandas && python -c "import pandas; pandas.show_versions();" - - ci/print_skipped.py /tmp/nosetests.xml + - if [ -e /tmp/single.xml ]; then + ci/print_skipped.py /tmp/single.xml; + fi + - if [ -e /tmp/multiple.xml ]; then + ci/print_skipped.py /tmp/multiple.xml; + fi - echo "after_script done" diff --git a/MANIFEST.in b/MANIFEST.in index 2d26fbfd6adaf..8bd83a7d56948 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -7,7 +7,6 @@ include setup.py graft doc prune doc/build -graft examples graft pandas global-exclude *.so @@ -26,3 +25,4 @@ global-exclude *.png # recursive-include LICENSES * include versioneer.py include pandas/_version.py +include pandas/io/formats/templates/*.tpl diff --git a/Makefile b/Makefile index 9a768932b8bea..194a8861715b7 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,4 @@ -tseries: pandas/lib.pyx pandas/tslib.pyx pandas/hashtable.pyx +tseries: pandas/_libs/lib.pyx pandas/_libs/tslib.pyx pandas/_libs/hashtable.pyx python setup.py build_ext --inplace .PHONY : develop build clean clean_pyc tseries doc @@ -9,9 +9,6 @@ clean: clean_pyc: -find . -name '*.py[co]' -exec rm {} \; -sparse: pandas/src/sparse.pyx - python setup.py build_ext --inplace - build: clean_pyc python setup.py build_ext --inplace diff --git a/README.md b/README.md index 6ebc287fa2cf6..e05f1405419fc 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@
-
+
----------------- @@ -12,7 +12,7 @@ latest release - latest release + latest release Package Status @@ -25,27 +25,44 @@ Build Status - - travis build status + + travis build status + + + + + circleci build status + + + + - - appveyor build status + + appveyor build status Coverage - coverage + coverage Conda - conda downloads + conda default downloads + + + + + Conda-forge + + + conda-forge downloads @@ -127,7 +144,7 @@ Here are just a few of the things that pandas does well: ## Where to get it The source code is currently hosted on GitHub at: -http://github.com/pydata/pandas +http://github.com/pandas-dev/pandas Binary installers for the latest released version are available at the [Python package index](http://pypi.python.org/pypi/pandas/) and on conda. diff --git a/RELEASE.md b/RELEASE.md index 23c1817b7647c..efd075dabcba9 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,6 +1,6 @@ Release Notes ============= -The list of changes to pandas between each release can be found +The list of changes to Pandas between each release can be found [here](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html). For full -details, see the commit logs at http://github.com/pydata/pandas. +details, see the commit logs at http://github.com/pandas-dev/pandas. diff --git a/appveyor.yml b/appveyor.yml index 84c34b34626b9..684b859c206b2 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -14,29 +14,22 @@ environment: # /E:ON and /V:ON options are not enabled in the batch script intepreter # See: http://stackoverflow.com/a/13751649/163740 CMD_IN_ENV: "cmd /E:ON /V:ON /C .\\ci\\run_with_env.cmd" + clone_folder: C:\projects\pandas matrix: - # disable python 3.4 ATM - #- PYTHON: "C:\\Python34_64" - # PYTHON_VERSION: "3.4" - # PYTHON_ARCH: "64" - # CONDA_PY: "34" - # CONDA_NPY: "19" + - CONDA_ROOT: "C:\\Miniconda3_64" + PYTHON_VERSION: "3.6" + PYTHON_ARCH: "64" + CONDA_PY: "36" + CONDA_NPY: "112" - - PYTHON: "C:\\Python27_64" + - CONDA_ROOT: "C:\\Miniconda3_64" PYTHON_VERSION: "2.7" PYTHON_ARCH: "64" CONDA_PY: "27" CONDA_NPY: "110" - - PYTHON: "C:\\Python35_64" - PYTHON_VERSION: "3.5" - PYTHON_ARCH: "64" - CONDA_PY: "35" - CONDA_NPY: "111" - - # We always use a 64-bit machine, but can build x86 distributions # with the PYTHON_ARCH variable (which is used by CMD_IN_ENV). platform: @@ -45,9 +38,6 @@ platform: # all our python builds have to happen in tests_script... build: false -init: - - "ECHO %PYTHON_VERSION% %PYTHON%" - install: # cancel older builds for the same PR - ps: if ($env:APPVEYOR_PULL_REQUEST_NUMBER -and $env:APPVEYOR_BUILD_NUMBER -ne ((Invoke-RestMethod ` @@ -58,7 +48,7 @@ install: # this installs the appropriate Miniconda (Py2/Py3, 32/64 bit) # updates conda & installs: conda-build jinja2 anaconda-client - powershell .\ci\install.ps1 - - SET PATH=%PYTHON%;%PYTHON%\Scripts;%PATH% + - SET PATH=%CONDA_ROOT%;%CONDA_ROOT%\Scripts;%PATH% - echo "install" - cd - ls -ltr @@ -70,38 +60,30 @@ install: # install our build environment - cmd: conda config --set show_channel_urls true --set always_yes true --set changeps1 false - cmd: conda update -q conda - - # fix conda-build version - # https://github.com/conda/conda-build/issues/1001 - # disabling 3.4 as windows complains upon compiling byte - # code - - - cmd: conda install conda-build=1.21.7 - cmd: conda config --set ssl_verify false # add the pandas channel *before* defaults to have defaults take priority + - cmd: conda config --add channels conda-forge - cmd: conda config --add channels pandas - cmd: conda config --remove channels defaults - cmd: conda config --add channels defaults - - cmd: conda install anaconda-client # this is now the downloaded conda... - cmd: conda info -a - # build em using the local source checkout in the correct windows env - - cmd: '%CMD_IN_ENV% conda build ci\appveyor.recipe -q' - # create our env - - cmd: conda create -q -n pandas python=%PYTHON_VERSION% nose + - cmd: conda create -n pandas python=%PYTHON_VERSION% cython pytest pytest-xdist - cmd: activate pandas - - SET REQ=ci\requirements-%PYTHON_VERSION%-%PYTHON_ARCH%.run + - SET REQ=ci\requirements-%PYTHON_VERSION%_WIN.run - cmd: echo "installing requirements from %REQ%" - - cmd: conda install -n pandas -q --file=%REQ% - - ps: conda install -n pandas (conda build ci\appveyor.recipe -q --output) + - cmd: conda install -n pandas --file=%REQ% + - cmd: conda list -n pandas + - cmd: echo "installing requirements from %REQ% - done" + + # build em using the local source checkout in the correct windows env + - cmd: '%CMD_IN_ENV% python setup.py build_ext --inplace' test_script: # tests - - cd \ - cmd: activate pandas - - cmd: conda list - - cmd: nosetests --exe -A "not slow and not network and not disabled" pandas + - cmd: test.bat diff --git a/asv_bench/asv.conf.json b/asv_bench/asv.conf.json index f5fa849464881..59c05400d06b0 100644 --- a/asv_bench/asv.conf.json +++ b/asv_bench/asv.conf.json @@ -21,12 +21,12 @@ "environment_type": "conda", // the base URL to show a commit for the project. - "show_commit_url": "https://github.com/pydata/pandas/commit/", + "show_commit_url": "https://github.com/pandas-dev/pandas/commit/", // The Pythons you'd like to test against. If not provided, defaults // to the current version of Python used to run `asv`. // "pythons": ["2.7", "3.4"], - "pythons": ["2.7"], + "pythons": ["3.6"], // The matrix of dependencies to test. Each key is the name of a // package (in PyPI) and the values are version numbers. An empty @@ -46,11 +46,14 @@ "numexpr": [], "pytables": [null, ""], // platform dependent, see excludes below "tables": [null, ""], - "libpython": [null, ""], "openpyxl": [], "xlsxwriter": [], "xlrd": [], - "xlwt": [] + "xlwt": [], + "pytest": [], + // If using Windows with python 2.7 and want to build using the + // mingw toolchain (rather than MSVC), uncomment the following line. + // "libpython": [], }, // Combinations of libraries/python versions can be excluded/included @@ -79,10 +82,6 @@ {"environment_type": "conda", "pytables": null}, {"environment_type": "(?!conda).*", "tables": null}, {"environment_type": "(?!conda).*", "pytables": ""}, - // On conda&win32, install libpython - {"sys_platform": "(?!win32).*", "libpython": ""}, - {"environment_type": "conda", "sys_platform": "win32", "libpython": null}, - {"environment_type": "(?!conda).*", "libpython": ""} ], "include": [], diff --git a/asv_bench/benchmarks/algorithms.py b/asv_bench/benchmarks/algorithms.py index 53b7d55368f6a..40cfec1bcd4c7 100644 --- a/asv_bench/benchmarks/algorithms.py +++ b/asv_bench/benchmarks/algorithms.py @@ -1,13 +1,23 @@ +from importlib import import_module + import numpy as np + import pandas as pd from pandas.util import testing as tm +for imp in ['pandas.util', 'pandas.tools.hashing']: + try: + hashing = import_module(imp) + break + except: + pass -class algorithm(object): +class Algorithms(object): goal_time = 0.2 def setup(self): N = 100000 + np.random.seed(1234) self.int_unique = pd.Int64Index(np.arange(N * 5)) # cache is_unique @@ -17,28 +27,42 @@ def setup(self): self.float = pd.Float64Index(np.random.randn(N).repeat(5)) # Convenience naming. - self.checked_add = pd.core.nanops._checked_add_with_arr + self.checked_add = pd.core.algorithms.checked_add_with_arr self.arr = np.arange(1000000) self.arrpos = np.arange(1000000) self.arrneg = np.arange(-1000000, 0) self.arrmixed = np.array([1, -1]).repeat(500000) + self.strings = tm.makeStringIndex(100000) + + self.arr_nan = np.random.choice([True, False], size=1000000) + self.arrmixed_nan = np.random.choice([True, False], size=1000000) + + # match + self.uniques = tm.makeStringIndex(1000).values + self.all = self.uniques.repeat(10) + + def time_factorize_string(self): + self.strings.factorize() - def time_int_factorize(self): + def time_factorize_int(self): self.int.factorize() - def time_float_factorize(self): + def time_factorize_float(self): self.int.factorize() - def time_int_unique_duplicated(self): + def time_duplicated_int_unique(self): self.int_unique.duplicated() - def time_int_duplicated(self): + def time_duplicated_int(self): self.int.duplicated() - def time_float_duplicated(self): + def time_duplicated_float(self): self.float.duplicated() + def time_match_strings(self): + pd.match(self.all, self.uniques) + def time_add_overflow_pos_scalar(self): self.checked_add(self.arr, 1) @@ -57,8 +81,18 @@ def time_add_overflow_neg_arr(self): def time_add_overflow_mixed_arr(self): self.checked_add(self.arr, self.arrmixed) + def time_add_overflow_first_arg_nan(self): + self.checked_add(self.arr, self.arrmixed, arr_mask=self.arr_nan) + + def time_add_overflow_second_arg_nan(self): + self.checked_add(self.arr, self.arrmixed, b_mask=self.arrmixed_nan) + + def time_add_overflow_both_arg_nan(self): + self.checked_add(self.arr, self.arrmixed, arr_mask=self.arr_nan, + b_mask=self.arrmixed_nan) + -class hashing(object): +class Hashing(object): goal_time = 0.2 def setup(self): @@ -78,13 +112,13 @@ def setup(self): self.df.iloc[10:20] = np.nan def time_frame(self): - self.df.hash() + hashing.hash_pandas_object(self.df) def time_series_int(self): - self.df.E.hash() + hashing.hash_pandas_object(self.df.E) def time_series_string(self): - self.df.B.hash() + hashing.hash_pandas_object(self.df.B) def time_series_categorical(self): - self.df.C.hash() + hashing.hash_pandas_object(self.df.C) diff --git a/asv_bench/benchmarks/attrs_caching.py b/asv_bench/benchmarks/attrs_caching.py index de9aa18937985..b7610037bed4d 100644 --- a/asv_bench/benchmarks/attrs_caching.py +++ b/asv_bench/benchmarks/attrs_caching.py @@ -1,23 +1,36 @@ from .pandas_vb_common import * +try: + from pandas.util import cache_readonly +except ImportError: + from pandas.util.decorators import cache_readonly -class getattr_dataframe_index(object): + +class DataFrameAttributes(object): goal_time = 0.2 def setup(self): self.df = DataFrame(np.random.randn(10, 6)) self.cur_index = self.df.index - def time_getattr_dataframe_index(self): + def time_get_index(self): self.foo = self.df.index + def time_set_index(self): + self.df.index = self.cur_index + -class setattr_dataframe_index(object): +class CacheReadonly(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(np.random.randn(10, 6)) - self.cur_index = self.df.index - def time_setattr_dataframe_index(self): - self.df.index = self.cur_index + class Foo: + + @cache_readonly + def prop(self): + return 5 + self.obj = Foo() + + def time_cache_readonly(self): + self.obj.prop diff --git a/asv_bench/benchmarks/binary_ops.py b/asv_bench/benchmarks/binary_ops.py index d22d01f261b27..0ca21b929ea17 100644 --- a/asv_bench/benchmarks/binary_ops.py +++ b/asv_bench/benchmarks/binary_ops.py @@ -1,194 +1,83 @@ from .pandas_vb_common import * -import pandas.computation.expressions as expr +try: + import pandas.core.computation.expressions as expr +except ImportError: + import pandas.computation.expressions as expr -class frame_add(object): +class Ops(object): goal_time = 0.2 - def setup(self): + params = [[True, False], ['default', 1]] + param_names = ['use_numexpr', 'threads'] + + def setup(self, use_numexpr, threads): self.df = DataFrame(np.random.randn(20000, 100)) self.df2 = DataFrame(np.random.randn(20000, 100)) - def time_frame_add(self): - (self.df + self.df2) + if threads != 'default': + expr.set_numexpr_threads(threads) + if not use_numexpr: + expr.set_use_numexpr(False) -class frame_add_no_ne(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - expr.set_use_numexpr(False) - - def time_frame_add_no_ne(self): + def time_frame_add(self, use_numexpr, threads): (self.df + self.df2) - def teardown(self): - expr.set_use_numexpr(True) + def time_frame_mult(self, use_numexpr, threads): + (self.df * self.df2) + def time_frame_multi_and(self, use_numexpr, threads): + self.df[((self.df > 0) & (self.df2 > 0))] -class frame_add_st(object): - goal_time = 0.2 + def time_frame_comparison(self, use_numexpr, threads): + (self.df > self.df2) - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - expr.set_numexpr_threads(1) - - def time_frame_add_st(self): - (self.df + self.df2) - - def teardown(self): + def teardown(self, use_numexpr, threads): + expr.set_use_numexpr(True) expr.set_numexpr_threads() -class frame_float_div(object): +class Ops2(object): goal_time = 0.2 def setup(self): self.df = DataFrame(np.random.randn(1000, 1000)) self.df2 = DataFrame(np.random.randn(1000, 1000)) - def time_frame_float_div(self): - (self.df // self.df2) + self.df_int = DataFrame( + np.random.random_integers(np.iinfo(np.int16).min, + np.iinfo(np.int16).max, + size=(1000, 1000))) + self.df2_int = DataFrame( + np.random.random_integers(np.iinfo(np.int16).min, + np.iinfo(np.int16).max, + size=(1000, 1000))) + ## Division -class frame_float_div_by_zero(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 1000)) + def time_frame_float_div(self): + (self.df // self.df2) def time_frame_float_div_by_zero(self): (self.df / 0) - -class frame_float_floor_by_zero(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 1000)) - def time_frame_float_floor_by_zero(self): (self.df // 0) - -class frame_float_mod(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 1000)) - self.df2 = DataFrame(np.random.randn(1000, 1000)) - - def time_frame_float_mod(self): - (self.df / self.df2) - - -class frame_int_div_by_zero(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.random_integers(np.iinfo(np.int16).min, np.iinfo(np.int16).max, size=(1000, 1000))) - def time_frame_int_div_by_zero(self): - (self.df / 0) - + (self.df_int / 0) -class frame_int_mod(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.random_integers(np.iinfo(np.int16).min, np.iinfo(np.int16).max, size=(1000, 1000))) - self.df2 = DataFrame(np.random.random_integers(np.iinfo(np.int16).min, np.iinfo(np.int16).max, size=(1000, 1000))) + ## Modulo def time_frame_int_mod(self): (self.df / self.df2) - -class frame_mult(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - - def time_frame_mult(self): - (self.df * self.df2) - - -class frame_mult_no_ne(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - expr.set_use_numexpr(False) - - def time_frame_mult_no_ne(self): - (self.df * self.df2) - - def teardown(self): - expr.set_use_numexpr(True) - - -class frame_mult_st(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - expr.set_numexpr_threads(1) - - def time_frame_mult_st(self): - (self.df * self.df2) - - def teardown(self): - expr.set_numexpr_threads() - - -class frame_multi_and(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - - def time_frame_multi_and(self): - self.df[((self.df > 0) & (self.df2 > 0))] - - -class frame_multi_and_no_ne(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - expr.set_use_numexpr(False) - - def time_frame_multi_and_no_ne(self): - self.df[((self.df > 0) & (self.df2 > 0))] - - def teardown(self): - expr.set_use_numexpr(True) - - -class frame_multi_and_st(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - expr.set_numexpr_threads(1) - - def time_frame_multi_and_st(self): - self.df[((self.df > 0) & (self.df2 > 0))] - - def teardown(self): - expr.set_numexpr_threads() + def time_frame_float_mod(self): + (self.df / self.df2) -class series_timestamp_compare(object): +class Timeseries(object): goal_time = 0.2 def setup(self): @@ -197,65 +86,28 @@ def setup(self): self.s = Series(date_range('20010101', periods=self.N, freq='T')) self.ts = self.s[self.halfway] + self.s2 = Series(date_range('20010101', periods=self.N, freq='s')) + def time_series_timestamp_compare(self): (self.s <= self.ts) - -class timestamp_ops_diff1(object): - goal_time = 0.2 - N = 1000000 - - def setup(self): - self.s = self.create() - - def create(self): - return Series(date_range('20010101', periods=self.N, freq='s')) + def time_timestamp_series_compare(self): + (self.ts >= self.s) def time_timestamp_ops_diff1(self): - self.s.diff() - -class timestamp_tz_ops_diff1(timestamp_ops_diff1): - N = 10000 - - def create(self): - return Series(date_range('20010101', periods=self.N, freq='s', tz='US/Eastern')) - -class timestamp_ops_diff2(object): - goal_time = 0.2 - N = 1000000 - - def setup(self): - self.s = self.create() - - def create(self): - return Series(date_range('20010101', periods=self.N, freq='s')) + self.s2.diff() def time_timestamp_ops_diff2(self): (self.s - self.s.shift()) -class timestamp_tz_ops_diff2(timestamp_ops_diff2): - N = 10000 - def create(self): - return Series(date_range('20010101', periods=self.N, freq='s', tz='US/Eastern')) -class timestamp_series_compare(object): - goal_time = 0.2 - N = 1000000 +class TimeseriesTZ(Timeseries): def setup(self): + self.N = 1000000 self.halfway = ((self.N // 2) - 1) - self.s = self.create() + self.s = Series(date_range('20010101', periods=self.N, freq='T', tz='US/Eastern')) self.ts = self.s[self.halfway] - def create(self): - return Series(date_range('20010101', periods=self.N, freq='T')) - - def time_timestamp_series_compare(self): - (self.ts >= self.s) - -class timestamp_tz_series_compare(timestamp_series_compare): - N = 10000 - - def create(self): - return Series(date_range('20010101', periods=self.N, freq='T', tz='US/Eastern')) + self.s2 = Series(date_range('20010101', periods=self.N, freq='s', tz='US/Eastern')) diff --git a/asv_bench/benchmarks/categoricals.py b/asv_bench/benchmarks/categoricals.py index bf1e1b3f40ab0..6432ccfb19efe 100644 --- a/asv_bench/benchmarks/categoricals.py +++ b/asv_bench/benchmarks/categoricals.py @@ -1,34 +1,54 @@ from .pandas_vb_common import * try: - from pandas.types.concat import union_categoricals + from pandas.api.types import union_categoricals except ImportError: - pass -import string + try: + from pandas.types.concat import union_categoricals + except ImportError: + pass -class concat_categorical(object): +class Categoricals(object): goal_time = 0.2 def setup(self): - self.s = pd.Series((list('aabbcd') * 1000000)).astype('category') + N = 100000 + self.s = pd.Series((list('aabbcd') * N)).astype('category') - def time_concat_categorical(self): - concat([self.s, self.s]) + self.a = pd.Categorical((list('aabbcd') * N)) + self.b = pd.Categorical((list('bbcdjk') * N)) + self.categories = list('abcde') + self.cat_idx = Index(self.categories) + self.values = np.tile(self.categories, N) + self.codes = np.tile(range(len(self.categories)), N) -class union_categorical(object): - goal_time = 0.2 + self.datetimes = pd.Series(pd.date_range( + '1995-01-01 00:00:00', periods=10000, freq='s')) - def setup(self): - self.a = pd.Categorical((list('aabbcd') * 1000000)) - self.b = pd.Categorical((list('bbcdjk') * 1000000)) + def time_concat(self): + concat([self.s, self.s]) - def time_union_categorical(self): + def time_union(self): union_categoricals([self.a, self.b]) + def time_constructor_regular(self): + Categorical(self.values, self.categories) + + def time_constructor_fastpath(self): + Categorical(self.codes, self.cat_idx, fastpath=True) + + def time_constructor_datetimes(self): + Categorical(self.datetimes) + + def time_constructor_datetimes_with_nat(self): + t = self.datetimes + t.iloc[-1] = pd.NaT + Categorical(t) -class categorical_value_counts(object): - goal_time = 1 + +class Categoricals2(object): + goal_time = 0.2 def setup(self): n = 500000 @@ -36,56 +56,47 @@ def setup(self): arr = ['s%04d' % i for i in np.random.randint(0, n // 10, size=n)] self.ts = Series(arr).astype('category') + self.sel = self.ts.loc[[0]] + def time_value_counts(self): self.ts.value_counts(dropna=False) def time_value_counts_dropna(self): self.ts.value_counts(dropna=True) + def time_rendering(self): + str(self.sel) + -class categorical_constructor(object): +class Categoricals3(object): goal_time = 0.2 def setup(self): - n = 5 - N = 1e6 - self.categories = list(string.ascii_letters[:n]) - self.cat_idx = Index(self.categories) - self.values = np.tile(self.categories, N) - self.codes = np.tile(range(n), N) + N = 100000 + ncats = 100 - def time_regular_constructor(self): - Categorical(self.values, self.categories) + self.s1 = Series(np.array(tm.makeCategoricalIndex(N, ncats))) + self.s1_cat = self.s1.astype('category') + self.s1_cat_ordered = self.s1.astype('category', ordered=True) - def time_fastpath(self): - Categorical(self.codes, self.cat_idx, fastpath=True) + self.s2 = Series(np.random.randint(0, ncats, size=N)) + self.s2_cat = self.s2.astype('category') + self.s2_cat_ordered = self.s2.astype('category', ordered=True) + def time_rank_string(self): + self.s1.rank() -class categorical_constructor_with_datetimes(object): - goal_time = 0.2 + def time_rank_string_cat(self): + self.s1_cat.rank() - def setup(self): - self.datetimes = pd.Series(pd.date_range( - '1995-01-01 00:00:00', periods=10000, freq='s')) + def time_rank_string_cat_ordered(self): + self.s1_cat_ordered.rank() - def time_datetimes(self): - Categorical(self.datetimes) - - def time_datetimes_with_nat(self): - t = self.datetimes - t.iloc[-1] = pd.NaT - Categorical(t) + def time_rank_int(self): + self.s2.rank() + def time_rank_int_cat(self): + self.s2_cat.rank() -class categorical_rendering(object): - goal_time = 3e-3 - - def setup(self): - n = 1000 - items = [str(i) for i in range(n)] - s = pd.Series(items, dtype='category') - df = pd.DataFrame({'C': s, 'data': np.random.randn(n)}) - self.data = df[df.C == '20'] - - def time_rendering(self): - str(self.data.C) + def time_rank_int_cat_ordered(self): + self.s2_cat_ordered.rank() diff --git a/asv_bench/benchmarks/ctors.py b/asv_bench/benchmarks/ctors.py index f68cf9399c546..b5694a3a21502 100644 --- a/asv_bench/benchmarks/ctors.py +++ b/asv_bench/benchmarks/ctors.py @@ -1,52 +1,30 @@ from .pandas_vb_common import * -class frame_constructor_ndarray(object): +class Constructors(object): goal_time = 0.2 def setup(self): self.arr = np.random.randn(100, 100) + self.arr_str = np.array(['foo', 'bar', 'baz'], dtype=object) - def time_frame_constructor_ndarray(self): - DataFrame(self.arr) - - -class ctor_index_array_string(object): - goal_time = 0.2 - - def setup(self): - self.data = np.array(['foo', 'bar', 'baz'], dtype=object) - - def time_ctor_index_array_string(self): - Index(self.data) - - -class series_constructor_ndarray(object): - goal_time = 0.2 - - def setup(self): self.data = np.random.randn(100) self.index = Index(np.arange(100)) - def time_series_constructor_ndarray(self): - Series(self.data, index=self.index) + self.s = Series(([Timestamp('20110101'), Timestamp('20120101'), + Timestamp('20130101')] * 1000)) + def time_frame_from_ndarray(self): + DataFrame(self.arr) -class dtindex_from_series_ctor(object): - goal_time = 0.2 + def time_series_from_ndarray(self): + pd.Series(self.data, index=self.index) - def setup(self): - self.s = Series(([Timestamp('20110101'), Timestamp('20120101'), Timestamp('20130101')] * 1000)) + def time_index_from_array_string(self): + Index(self.arr_str) - def time_dtindex_from_series_ctor(self): + def time_dtindex_from_series(self): DatetimeIndex(self.s) - -class index_from_series_ctor(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(([Timestamp('20110101'), Timestamp('20120101'), Timestamp('20130101')] * 1000)) - - def time_index_from_series_ctor(self): + def time_dtindex_from_series2(self): Index(self.s) diff --git a/asv_bench/benchmarks/eval.py b/asv_bench/benchmarks/eval.py index d9978e0cc4595..6f33590ee9e33 100644 --- a/asv_bench/benchmarks/eval.py +++ b/asv_bench/benchmarks/eval.py @@ -1,9 +1,12 @@ from .pandas_vb_common import * import pandas as pd -import pandas.computation.expressions as expr +try: + import pandas.core.computation.expressions as expr +except ImportError: + import pandas.computation.expressions as expr -class eval_frame(object): +class Eval(object): goal_time = 0.2 params = [['numexpr', 'python'], [1, 'all']] @@ -34,8 +37,11 @@ def time_mult(self, engine, threads): df, df2, df3, df4 = self.df, self.df2, self.df3, self.df4 pd.eval('df * df2 * df3 * df4', engine=engine) + def teardown(self, engine, threads): + expr.set_numexpr_threads() -class query_datetime_index(object): + +class Query(object): goal_time = 0.2 def setup(self): @@ -45,41 +51,19 @@ def setup(self): self.s = Series(self.index) self.ts = self.s.iloc[self.halfway] self.df = DataFrame({'a': np.random.randn(self.N), }, index=self.index) + self.df2 = DataFrame({'dates': self.s.values,}) + + self.df3 = DataFrame({'a': np.random.randn(self.N),}) + self.min_val = self.df3['a'].min() + self.max_val = self.df3['a'].max() def time_query_datetime_index(self): ts = self.ts self.df.query('index < @ts') - -class query_datetime_series(object): - goal_time = 0.2 - - def setup(self): - self.N = 1000000 - self.halfway = ((self.N // 2) - 1) - self.index = date_range('20010101', periods=self.N, freq='T') - self.s = Series(self.index) - self.ts = self.s.iloc[self.halfway] - self.df = DataFrame({'dates': self.s.values, }) - def time_query_datetime_series(self): ts = self.ts - self.df.query('dates < @ts') - - -class query_with_boolean_selection(object): - goal_time = 0.2 - - def setup(self): - self.N = 1000000 - self.halfway = ((self.N // 2) - 1) - self.index = date_range('20010101', periods=self.N, freq='T') - self.s = Series(self.index) - self.ts = self.s.iloc[self.halfway] - self.N = 1000000 - self.df = DataFrame({'a': np.random.randn(self.N), }) - self.min_val = self.df['a'].min() - self.max_val = self.df['a'].max() + self.df2.query('dates < @ts') def time_query_with_boolean_selection(self): min_val, max_val = self.min_val, self.max_val diff --git a/asv_bench/benchmarks/frame_ctor.py b/asv_bench/benchmarks/frame_ctor.py index 6f40611e68531..dec4fcba0eb5e 100644 --- a/asv_bench/benchmarks/frame_ctor.py +++ b/asv_bench/benchmarks/frame_ctor.py @@ -5,1617 +5,10 @@ from pandas.core.datetools import * -class frame_ctor_dtindex_BDayx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BDay(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BDayx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BDayx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BDay(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BDayx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BMonthBeginx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BMonthBegin(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BMonthBeginx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BMonthBeginx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BMonthBegin(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BMonthBeginx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BMonthEndx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BMonthEnd(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BMonthEndx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BMonthEndx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BMonthEnd(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BMonthEndx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BQuarterBeginx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BQuarterBegin(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BQuarterBeginx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BQuarterBeginx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BQuarterBegin(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BQuarterBeginx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BQuarterEndx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BQuarterEnd(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BQuarterEndx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BQuarterEndx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BQuarterEnd(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BQuarterEndx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BYearBeginx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BYearBegin(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BYearBeginx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BYearBeginx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BYearBegin(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BYearBeginx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BYearEndx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BYearEnd(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BYearEndx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BYearEndx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BYearEnd(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BYearEndx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BusinessDayx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BusinessDay(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BusinessDayx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BusinessDayx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BusinessDay(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BusinessDayx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BusinessHourx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BusinessHour(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BusinessHourx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_BusinessHourx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(BusinessHour(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_BusinessHourx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CBMonthBeginx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CBMonthBegin(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CBMonthBeginx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CBMonthBeginx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CBMonthBegin(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CBMonthBeginx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CBMonthEndx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CBMonthEnd(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CBMonthEndx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CBMonthEndx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CBMonthEnd(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CBMonthEndx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CDayx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CDay(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CDayx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CDayx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CDay(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CDayx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CustomBusinessDayx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CustomBusinessDay(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CustomBusinessDayx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_CustomBusinessDayx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(CustomBusinessDay(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_CustomBusinessDayx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_DateOffsetx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(DateOffset(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_DateOffsetx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_DateOffsetx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(DateOffset(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_DateOffsetx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Dayx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Day(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Dayx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Dayx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Day(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Dayx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Easterx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Easter(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Easterx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Easterx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Easter(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Easterx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253Quarterx1__variation_last(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253Quarter(1, **{'startingMonth': 1, 'qtr_with_extra_week': 1, 'weekday': 1, 'variation': 'last', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253Quarterx1__variation_last(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253Quarterx1__variation_nearest(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253Quarter(1, **{'startingMonth': 1, 'qtr_with_extra_week': 1, 'weekday': 1, 'variation': 'nearest', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253Quarterx1__variation_nearest(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253Quarterx2__variation_last(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253Quarter(2, **{'startingMonth': 1, 'qtr_with_extra_week': 1, 'weekday': 1, 'variation': 'last', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253Quarterx2__variation_last(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253Quarterx2__variation_nearest(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253Quarter(2, **{'startingMonth': 1, 'qtr_with_extra_week': 1, 'weekday': 1, 'variation': 'nearest', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253Quarterx2__variation_nearest(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253x1__variation_last(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253(1, **{'startingMonth': 1, 'weekday': 1, 'variation': 'last', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253x1__variation_last(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253x1__variation_nearest(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253(1, **{'startingMonth': 1, 'weekday': 1, 'variation': 'nearest', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253x1__variation_nearest(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253x2__variation_last(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253(2, **{'startingMonth': 1, 'weekday': 1, 'variation': 'last', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253x2__variation_last(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_FY5253x2__variation_nearest(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(FY5253(2, **{'startingMonth': 1, 'weekday': 1, 'variation': 'nearest', })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_FY5253x2__variation_nearest(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Hourx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Hour(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Hourx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Hourx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Hour(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Hourx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_LastWeekOfMonthx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(LastWeekOfMonth(1, **{'week': 1, 'weekday': 1, })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_LastWeekOfMonthx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_LastWeekOfMonthx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(LastWeekOfMonth(2, **{'week': 1, 'weekday': 1, })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_LastWeekOfMonthx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Microx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Micro(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Microx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Microx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Micro(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Microx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Millix1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Milli(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Millix1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Millix2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Milli(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Millix2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Minutex1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Minute(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Minutex1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Minutex2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Minute(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Minutex2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_MonthBeginx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(MonthBegin(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_MonthBeginx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_MonthBeginx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(MonthBegin(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_MonthBeginx2(self): - DataFrame(self.d) +#---------------------------------------------------------------------- +# Creation from nested dict - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_MonthEndx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(MonthEnd(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_MonthEndx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_MonthEndx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(MonthEnd(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_MonthEndx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Nanox1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Nano(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Nanox1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Nanox2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Nano(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Nanox2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_QuarterBeginx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(QuarterBegin(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_QuarterBeginx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_QuarterBeginx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(QuarterBegin(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_QuarterBeginx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_QuarterEndx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(QuarterEnd(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_QuarterEndx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_QuarterEndx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(QuarterEnd(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_QuarterEndx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Secondx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Second(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Secondx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Secondx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Second(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Secondx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_WeekOfMonthx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(WeekOfMonth(1, **{'week': 1, 'weekday': 1, })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_WeekOfMonthx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_WeekOfMonthx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(WeekOfMonth(2, **{'week': 1, 'weekday': 1, })) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_WeekOfMonthx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Weekx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Week(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Weekx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_Weekx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(Week(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_Weekx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_YearBeginx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(YearBegin(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_YearBeginx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_YearBeginx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(YearBegin(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_YearBeginx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_YearEndx1(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(YearEnd(1, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_YearEndx1(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_dtindex_YearEndx2(object): - goal_time = 0.2 - - def setup(self): - self.idx = self.get_index_for_offset(YearEnd(2, **{})) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - - def time_frame_ctor_dtindex_YearEndx2(self): - DataFrame(self.d) - - def get_period_count(self, start_date, off): - self.ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (self.ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // self.ten_offsets_in_days)), 1000) - - def get_index_for_offset(self, off): - self.start_date = Timestamp('1/1/1900') - return date_range(self.start_date, periods=min(1000, self.get_period_count(self.start_date, off)), freq=off) - - -class frame_ctor_list_of_dict(object): +class FromDicts(object): goal_time = 0.2 def setup(self): @@ -1627,42 +20,29 @@ def setup(self): self.data = self.frame.to_dict() except: self.data = self.frame.toDict() - self.some_dict = self.data.values()[0] + self.some_dict = list(self.data.values())[0] self.dict_list = [dict(zip(self.columns, row)) for row in self.frame.values] + self.data2 = dict( + ((i, dict(((j, float(j)) for j in range(100)))) for i in + range(2000))) + def time_frame_ctor_list_of_dict(self): DataFrame(self.dict_list) - -class frame_ctor_nested_dict(object): - goal_time = 0.2 - - def setup(self): - (N, K) = (5000, 50) - self.index = tm.makeStringIndex(N) - self.columns = tm.makeStringIndex(K) - self.frame = DataFrame(np.random.randn(N, K), index=self.index, columns=self.columns) - try: - self.data = self.frame.to_dict() - except: - self.data = self.frame.toDict() - self.some_dict = self.data.values()[0] - self.dict_list = [dict(zip(self.columns, row)) for row in self.frame.values] - def time_frame_ctor_nested_dict(self): DataFrame(self.data) - -class frame_ctor_nested_dict_int64(object): - goal_time = 0.2 - - def setup(self): - self.data = dict(((i, dict(((j, float(j)) for j in range(100)))) for i in xrange(2000))) + def time_series_ctor_from_dict(self): + Series(self.some_dict) def time_frame_ctor_nested_dict_int64(self): + # nested dict, integer indexes, regression described in #621 DataFrame(self.data) +# from a mi-series + class frame_from_series(object): goal_time = 0.2 @@ -1670,10 +50,13 @@ def setup(self): self.mi = MultiIndex.from_tuples([(x, y) for x in range(100) for y in range(100)]) self.s = Series(randn(10000), index=self.mi) - def time_frame_from_series(self): + def time_frame_from_mi_series(self): DataFrame(self.s) +#---------------------------------------------------------------------- +# get_numeric_data + class frame_get_numeric_data(object): goal_time = 0.2 @@ -1687,20 +70,69 @@ def time_frame_get_numeric_data(self): self.df._get_numeric_data() -class series_ctor_from_dict(object): - goal_time = 0.2 +# ---------------------------------------------------------------------- +# From dict with DatetimeIndex with all offsets - def setup(self): - (N, K) = (5000, 50) - self.index = tm.makeStringIndex(N) - self.columns = tm.makeStringIndex(K) - self.frame = DataFrame(np.random.randn(N, K), index=self.index, columns=self.columns) - try: - self.data = self.frame.to_dict() - except: - self.data = self.frame.toDict() - self.some_dict = self.data.values()[0] - self.dict_list = [dict(zip(self.columns, row)) for row in self.frame.values] +# dynamically generate benchmarks for every offset +# +# get_period_count & get_index_for_offset are there because blindly taking each +# offset times 1000 can easily go out of Timestamp bounds and raise errors. - def time_series_ctor_from_dict(self): - Series(self.some_dict) + +def get_period_count(start_date, off): + ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days + if (ten_offsets_in_days == 0): + return 1000 + else: + return min((9 * ((Timestamp.max - start_date).days // ten_offsets_in_days)), 1000) + + +def get_index_for_offset(off): + start_date = Timestamp('1/1/1900') + return date_range(start_date, periods=min(1000, get_period_count( + start_date, off)), freq=off) + + +all_offsets = offsets.__all__ +# extra cases +for off in ['FY5253', 'FY5253Quarter']: + all_offsets.pop(all_offsets.index(off)) + all_offsets.extend([off + '_1', off + '_2']) + + +class FrameConstructorDTIndexFromOffsets(object): + + params = [all_offsets, [1, 2]] + param_names = ['offset', 'n_steps'] + + offset_kwargs = {'WeekOfMonth': {'weekday': 1, 'week': 1}, + 'LastWeekOfMonth': {'weekday': 1, 'week': 1}, + 'FY5253': {'startingMonth': 1, 'weekday': 1}, + 'FY5253Quarter': {'qtr_with_extra_week': 1, 'startingMonth': 1, 'weekday': 1}} + + offset_extra_cases = {'FY5253': {'variation': ['nearest', 'last']}, + 'FY5253Quarter': {'variation': ['nearest', 'last']}} + + def setup(self, offset, n_steps): + + extra = False + if offset.endswith("_", None, -1): + extra = int(offset[-1]) + offset = offset[:-2] + + kwargs = {} + if offset in self.offset_kwargs: + kwargs = self.offset_kwargs[offset] + + if extra: + extras = self.offset_extra_cases[offset] + for extra_arg in extras: + kwargs[extra_arg] = extras[extra_arg][extra -1] + + offset = getattr(offsets, offset) + self.idx = get_index_for_offset(offset(n_steps, **kwargs)) + self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) + self.d = dict([(col, self.df[col]) for col in self.df.columns]) + + def time_frame_ctor(self, offset, n_steps): + DataFrame(self.d) diff --git a/asv_bench/benchmarks/frame_methods.py b/asv_bench/benchmarks/frame_methods.py index df73a474b2683..af72ca1e9a6ab 100644 --- a/asv_bench/benchmarks/frame_methods.py +++ b/asv_bench/benchmarks/frame_methods.py @@ -2,444 +2,74 @@ import string -class frame_apply_axis_1(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 100)) - - def time_frame_apply_axis_1(self): - self.df.apply((lambda x: (x + 1)), axis=1) - - -class frame_apply_lambda_mean(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 100)) - - def time_frame_apply_lambda_mean(self): - self.df.apply((lambda x: x.sum())) - - -class frame_apply_np_mean(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 100)) - - def time_frame_apply_np_mean(self): - self.df.apply(np.mean) - - -class frame_apply_pass_thru(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 100)) - - def time_frame_apply_pass_thru(self): - self.df.apply((lambda x: x)) - - -class frame_apply_ref_by_name(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 3), columns=list('ABC')) - - def time_frame_apply_ref_by_name(self): - self.df.apply((lambda x: (x['A'] + x['B'])), axis=1) - - -class frame_apply_user_func(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.arange(1028.0)) - self.df = DataFrame({i: self.s for i in range(1028)}) - - def time_frame_apply_user_func(self): - self.df.apply((lambda x: np.corrcoef(x, self.s)[(0, 1)])) - - -class frame_assign_timeseries_index(object): - goal_time = 0.2 - - def setup(self): - self.idx = date_range('1/1/2000', periods=100000, freq='D') - self.df = DataFrame(randn(100000, 1), columns=['A'], index=self.idx) - - def time_frame_assign_timeseries_index(self): - self.f(self.df) - - def f(self, df): - self.x = self.df.copy() - self.x['date'] = self.x.index - - -class frame_boolean_row_select(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(randn(10000, 100)) - self.bool_arr = np.zeros(10000, dtype=bool) - self.bool_arr[:1000] = True - - def time_frame_boolean_row_select(self): - self.df[self.bool_arr] - - -class frame_count_level_axis0_mixed_dtypes_multi(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df['foo'] = 'bar' - self.df.index = MultiIndex.from_tuples(self.df.index.map((lambda x: (x, x)))) - self.df.columns = MultiIndex.from_tuples(self.df.columns.map((lambda x: (x, x)))) - - def time_frame_count_level_axis0_mixed_dtypes_multi(self): - self.df.count(axis=0, level=1) - - -class frame_count_level_axis0_multi(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df.index = MultiIndex.from_tuples(self.df.index.map((lambda x: (x, x)))) - self.df.columns = MultiIndex.from_tuples(self.df.columns.map((lambda x: (x, x)))) - - def time_frame_count_level_axis0_multi(self): - self.df.count(axis=0, level=1) - - -class frame_count_level_axis1_mixed_dtypes_multi(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df['foo'] = 'bar' - self.df.index = MultiIndex.from_tuples(self.df.index.map((lambda x: (x, x)))) - self.df.columns = MultiIndex.from_tuples(self.df.columns.map((lambda x: (x, x)))) - - def time_frame_count_level_axis1_mixed_dtypes_multi(self): - self.df.count(axis=1, level=1) - - -class frame_count_level_axis1_multi(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df.index = MultiIndex.from_tuples(self.df.index.map((lambda x: (x, x)))) - self.df.columns = MultiIndex.from_tuples(self.df.columns.map((lambda x: (x, x)))) - - def time_frame_count_level_axis1_multi(self): - self.df.count(axis=1, level=1) - - -class frame_dropna_axis0_all(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - - def time_frame_dropna_axis0_all(self): - self.df.dropna(how='all', axis=0) - - -class frame_dropna_axis0_all_mixed_dtypes(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df['foo'] = 'bar' - - def time_frame_dropna_axis0_all_mixed_dtypes(self): - self.df.dropna(how='all', axis=0) - - -class frame_dropna_axis0_any(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - - def time_frame_dropna_axis0_any(self): - self.df.dropna(how='any', axis=0) - - -class frame_dropna_axis0_any_mixed_dtypes(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df['foo'] = 'bar' - - def time_frame_dropna_axis0_any_mixed_dtypes(self): - self.df.dropna(how='any', axis=0) - - -class frame_dropna_axis1_all(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - - def time_frame_dropna_axis1_all(self): - self.df.dropna(how='all', axis=1) - - -class frame_dropna_axis1_all_mixed_dtypes(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df['foo'] = 'bar' - - def time_frame_dropna_axis1_all_mixed_dtypes(self): - self.df.dropna(how='all', axis=1) - - -class frame_dropna_axis1_any(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - - def time_frame_dropna_axis1_any(self): - self.df.dropna(how='any', axis=1) - - -class frame_dropna_axis1_any_mixed_dtypes(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) - self.df.ix[50:1000, 20:50] = np.nan - self.df.ix[2000:3000] = np.nan - self.df.ix[:, 60:70] = np.nan - self.df['foo'] = 'bar' - - def time_frame_dropna_axis1_any_mixed_dtypes(self): - self.df.dropna(how='any', axis=1) - - -class frame_dtypes(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 1000)) - - def time_frame_dtypes(self): - self.df.dtypes - - -class frame_duplicated(object): - goal_time = 0.2 - - def setup(self): - self.n = (1 << 20) - self.t = date_range('2015-01-01', freq='S', periods=(self.n // 64)) - self.xs = np.random.randn((self.n // 64)).round(2) - self.df = DataFrame({'a': np.random.randint(((-1) << 8), (1 << 8), self.n), 'b': np.random.choice(self.t, self.n), 'c': np.random.choice(self.xs, self.n), }) - - def time_frame_duplicated(self): - self.df.duplicated() - -class frame_duplicated_wide(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 100).astype(str)) - - def time_frame_duplicated_wide(self): - self.df.T.duplicated() - -class frame_fancy_lookup(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(10000, 8), columns=list('abcdefgh')) - self.df['foo'] = 'bar' - self.row_labels = list(self.df.index[::10])[:900] - self.col_labels = (list(self.df.columns) * 100) - self.row_labels_all = np.array((list(self.df.index) * len(self.df.columns)), dtype='object') - self.col_labels_all = np.array((list(self.df.columns) * len(self.df.index)), dtype='object') - - def time_frame_fancy_lookup(self): - self.df.lookup(self.row_labels, self.col_labels) - - -class frame_fancy_lookup_all(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(10000, 8), columns=list('abcdefgh')) - self.df['foo'] = 'bar' - self.row_labels = list(self.df.index[::10])[:900] - self.col_labels = (list(self.df.columns) * 100) - self.row_labels_all = np.array((list(self.df.index) * len(self.df.columns)), dtype='object') - self.col_labels_all = np.array((list(self.df.columns) * len(self.df.index)), dtype='object') - - def time_frame_fancy_lookup_all(self): - self.df.lookup(self.row_labels_all, self.col_labels_all) - - -class frame_fillna_inplace(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(randn(10000, 100)) - self.df.values[::2] = np.nan - - def time_frame_fillna_inplace(self): - self.df.fillna(0, inplace=True) - +#---------------------------------------------------------------------- +# lookup -class frame_float_equal(object): +class frame_fancy_lookup(object): goal_time = 0.2 def setup(self): - self.float_df = DataFrame(np.random.randn(1000, 1000)) - self.object_df = DataFrame(([(['foo'] * 1000)] * 1000)) - self.nonunique_cols = self.object_df.copy() - self.nonunique_cols.columns = (['A'] * len(self.nonunique_cols.columns)) - self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in (('float_df', self.float_df), ('object_df', self.object_df), ('nonunique_cols', self.nonunique_cols))]) - - def time_frame_float_equal(self): - self.test_equal('float_df') + self.df = DataFrame(np.random.randn(10000, 8), columns=list('abcdefgh')) + self.df['foo'] = 'bar' + self.row_labels = list(self.df.index[::10])[:900] + self.col_labels = (list(self.df.columns) * 100) + self.row_labels_all = np.array((list(self.df.index) * len(self.df.columns)), dtype='object') + self.col_labels_all = np.array((list(self.df.columns) * len(self.df.index)), dtype='object') - def make_pair(self, frame): - self.df = frame - self.df2 = self.df.copy() - self.df2.ix[((-1), (-1))] = np.nan - return (self.df, self.df2) + def time_frame_fancy_lookup(self): + self.df.lookup(self.row_labels, self.col_labels) - def test_equal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df) + def time_frame_fancy_lookup_all(self): + self.df.lookup(self.row_labels_all, self.col_labels_all) - def test_unequal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df2) +#---------------------------------------------------------------------- +# reindex -class frame_float_unequal(object): +class Reindex(object): goal_time = 0.2 def setup(self): - self.float_df = DataFrame(np.random.randn(1000, 1000)) - self.object_df = DataFrame(([(['foo'] * 1000)] * 1000)) - self.nonunique_cols = self.object_df.copy() - self.nonunique_cols.columns = (['A'] * len(self.nonunique_cols.columns)) - self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in (('float_df', self.float_df), ('object_df', self.object_df), ('nonunique_cols', self.nonunique_cols))]) - - def time_frame_float_unequal(self): - self.test_unequal('float_df') - - def make_pair(self, frame): - self.df = frame - self.df2 = self.df.copy() - self.df2.ix[((-1), (-1))] = np.nan - return (self.df, self.df2) - - def test_equal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df) - - def test_unequal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df2) - - -class frame_from_records_generator(object): - goal_time = 0.2 - - def time_frame_from_records_generator(self): - self.df = DataFrame.from_records(self.get_data()) - - def get_data(self, n=100000): - return ((x, (x * 20), (x * 100)) for x in range(n)) - - -class frame_from_records_generator_nrows(object): - goal_time = 0.2 + self.df = DataFrame(randn(10000, 1000)) + self.idx = np.arange(4000, 7000) - def time_frame_from_records_generator_nrows(self): - self.df = DataFrame.from_records(self.get_data(), nrows=1000) + self.df2 = DataFrame( + dict([(c, {0: randint(0, 2, 1000).astype(np.bool_), + 1: randint(0, 1000, 1000).astype( + np.int16), + 2: randint(0, 1000, 1000).astype( + np.int32), + 3: randint(0, 1000, 1000).astype( + np.int64),}[randint(0, 4)]) for c in + range(1000)])) + + def time_reindex_axis0(self): + self.df.reindex(self.idx) - def get_data(self, n=100000): - return ((x, (x * 20), (x * 100)) for x in range(n)) + def time_reindex_axis1(self): + self.df.reindex(columns=self.idx) + def time_reindex_both_axes(self): + self.df.reindex(index=self.idx, columns=self.idx) -class frame_get_dtype_counts(object): - goal_time = 0.2 + def time_reindex_both_axes_ix(self): + self.df.ix[(self.idx, self.idx)] - def setup(self): - self.df = DataFrame(np.random.randn(10, 10000)) + def time_reindex_upcast(self): + self.df2.reindex(np.random.permutation(range(1200))) - def time_frame_get_dtype_counts(self): - self.df.get_dtype_counts() +#---------------------------------------------------------------------- +# iteritems (monitor no-copying behaviour) -class frame_getitem_single_column(object): +class Iteration(object): goal_time = 0.2 def setup(self): self.df = DataFrame(randn(10000, 1000)) - self.df2 = DataFrame(randn(3000, 1), columns=['A']) - self.df3 = DataFrame(randn(3000, 1)) - - def time_frame_getitem_single_column(self): - self.h() + self.df2 = DataFrame(np.random.randn(50000, 10)) + self.df3 = pd.DataFrame(np.random.randn(1000,5000), + columns=['C'+str(c) for c in range(5000)]) def f(self): if hasattr(self.df, '_item_cache'): @@ -451,290 +81,259 @@ def g(self): for (name, col) in self.df.iteritems(): pass - def h(self): - for i in range(10000): - self.df2['A'] - - def j(self): - for i in range(10000): - self.df3[0] - - -class frame_getitem_single_column2(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(randn(10000, 1000)) - self.df2 = DataFrame(randn(3000, 1), columns=['A']) - self.df3 = DataFrame(randn(3000, 1)) + def time_iteritems(self): + self.f() - def time_frame_getitem_single_column2(self): - self.j() + def time_iteritems_cached(self): + self.g() - def f(self): - if hasattr(self.df, '_item_cache'): - self.df._item_cache.clear() - for (name, col) in self.df.iteritems(): - pass + def time_iteritems_indexing(self): + df = self.df3 + for col in df: + df[col] - def g(self): - for (name, col) in self.df.iteritems(): + def time_itertuples(self): + for row in self.df2.itertuples(): pass - def h(self): - for i in range(10000): - self.df2['A'] - - def j(self): - for i in range(10000): - self.df3[0] +#---------------------------------------------------------------------- +# to_string, to_html, repr -class frame_html_repr_trunc_mi(object): +class Formatting(object): goal_time = 0.2 def setup(self): - self.nrows = 10000 - self.data = randn(self.nrows, 10) - self.idx = MultiIndex.from_arrays(np.tile(randn(3, (self.nrows / 100)), 100)) - self.df = DataFrame(self.data, index=self.idx) - - def time_frame_html_repr_trunc_mi(self): - self.df._repr_html_() - + self.df = DataFrame(randn(100, 10)) -class frame_html_repr_trunc_si(object): - goal_time = 0.2 + self.nrows = 500 + self.df2 = DataFrame(randn(self.nrows, 10)) + self.df2[0] = period_range('2000', '2010', self.nrows) + self.df2[1] = range(self.nrows) - def setup(self): self.nrows = 10000 self.data = randn(self.nrows, 10) + self.idx = MultiIndex.from_arrays(np.tile(randn(3, int(self.nrows / 100)), 100)) + self.df3 = DataFrame(self.data, index=self.idx) self.idx = randn(self.nrows) - self.df = DataFrame(self.data, index=self.idx) + self.df4 = DataFrame(self.data, index=self.idx) - def time_frame_html_repr_trunc_si(self): - self.df._repr_html_() + self.df_tall = pandas.DataFrame(np.random.randn(10000, 10)) + self.df_wide = pandas.DataFrame(np.random.randn(10, 10000)) -class frame_insert_100_columns_begin(object): - goal_time = 0.2 - - def setup(self): - self.N = 1000 - - def time_frame_insert_100_columns_begin(self): - self.f() - - def f(self, K=100): - self.df = DataFrame(index=range(self.N)) - self.new_col = np.random.randn(self.N) - for i in range(K): - self.df.insert(0, i, self.new_col) - + def time_to_string_floats(self): + self.df.to_string() -class frame_insert_500_columns_end(object): - goal_time = 0.2 + def time_to_html_mixed(self): + self.df2.to_html() - def setup(self): - self.N = 1000 + def time_html_repr_trunc_mi(self): + self.df3._repr_html_() - def time_frame_insert_500_columns_end(self): - self.f() + def time_html_repr_trunc_si(self): + self.df4._repr_html_() - def f(self, K=500): - self.df = DataFrame(index=range(self.N)) - self.new_col = np.random.randn(self.N) - for i in range(K): - self.df[i] = self.new_col + def time_repr_tall(self): + repr(self.df_tall) + def time_frame_repr_wide(self): + repr(self.df_wide) -class frame_interpolate(object): - goal_time = 0.2 - def setup(self): - self.df = DataFrame(randn(10000, 100)) - self.df.values[::2] = np.nan +#---------------------------------------------------------------------- +# nulls/masking - def time_frame_interpolate(self): - self.df.interpolate() +## masking -class frame_interpolate_some_good(object): +class frame_mask_bools(object): goal_time = 0.2 def setup(self): - self.df = DataFrame({'A': np.arange(0, 10000), 'B': np.random.randint(0, 100, 10000), 'C': randn(10000), 'D': randn(10000), }) - self.df.loc[1::5, 'A'] = np.nan - self.df.loc[1::5, 'C'] = np.nan - - def time_frame_interpolate_some_good(self): - self.df.interpolate() - + self.data = np.random.randn(1000, 500) + self.df = DataFrame(self.data) + self.df = self.df.where((self.df > 0)) + self.bools = (self.df > 0) + self.mask = isnull(self.df) -class frame_interpolate_some_good_infer(object): - goal_time = 0.2 + def time_frame_mask_bools(self): + self.bools.mask(self.mask) - def setup(self): - self.df = DataFrame({'A': np.arange(0, 10000), 'B': np.random.randint(0, 100, 10000), 'C': randn(10000), 'D': randn(10000), }) - self.df.loc[1::5, 'A'] = np.nan - self.df.loc[1::5, 'C'] = np.nan + def time_frame_mask_floats(self): + self.bools.astype(float).mask(self.mask) - def time_frame_interpolate_some_good_infer(self): - self.df.interpolate(downcast='infer') +## isnull -class frame_isnull_floats_no_null(object): +class FrameIsnull(object): goal_time = 0.2 def setup(self): - self.data = np.random.randn(1000, 1000) - self.df = DataFrame(self.data) - - def time_frame_isnull(self): - isnull(self.df) - - -class frame_isnull_floats(object): - goal_time = 0.2 + self.df_no_null = DataFrame(np.random.randn(1000, 1000)) - def setup(self): np.random.seed(1234) self.sample = np.array([np.nan, 1.0]) self.data = np.random.choice(self.sample, (1000, 1000)) self.df = DataFrame(self.data) - def time_frame_isnull(self): - isnull(self.df) - - -class frame_isnull_strings(object): - goal_time = 0.2 - - def setup(self): np.random.seed(1234) self.sample = np.array(list(string.ascii_lowercase) + list(string.ascii_uppercase) + list(string.whitespace)) self.data = np.random.choice(self.sample, (1000, 1000)) - self.df = DataFrame(self.data) - - def time_frame_isnull(self): - isnull(self.df) - - -class frame_isnull_obj(object): - goal_time = 0.2 + self.df_strings= DataFrame(self.data) - def setup(self): np.random.seed(1234) self.sample = np.array([NaT, np.nan, None, np.datetime64('NaT'), np.timedelta64('NaT'), 0, 1, 2.0, '', 'abcd']) self.data = np.random.choice(self.sample, (1000, 1000)) - self.df = DataFrame(self.data) + self.df_obj = DataFrame(self.data) + + def time_isnull_floats_no_null(self): + isnull(self.df_no_null) - def time_frame_isnull(self): + def time_isnull(self): isnull(self.df) + def time_isnull_strngs(self): + isnull(self.df_strings) -class frame_iteritems(object): + def time_isnull_obj(self): + isnull(self.df_obj) + + +# ---------------------------------------------------------------------- +# fillna in place + +class frame_fillna_inplace(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(10000, 1000)) - self.df2 = DataFrame(randn(3000, 1), columns=['A']) - self.df3 = DataFrame(randn(3000, 1)) + self.df = DataFrame(randn(10000, 100)) + self.df.values[::2] = np.nan - def time_frame_iteritems(self): - self.f() + def time_frame_fillna_inplace(self): + self.df.fillna(0, inplace=True) - def f(self): - if hasattr(self.df, '_item_cache'): - self.df._item_cache.clear() - for (name, col) in self.df.iteritems(): - pass - def g(self): - for (name, col) in self.df.iteritems(): - pass - def h(self): - for i in range(10000): - self.df2['A'] +class frame_fillna_many_columns_pad(object): + goal_time = 0.2 - def j(self): - for i in range(10000): - self.df3[0] + def setup(self): + self.values = np.random.randn(1000, 1000) + self.values[::2] = np.nan + self.df = DataFrame(self.values) + def time_frame_fillna_many_columns_pad(self): + self.df.fillna(method='pad') -class frame_iteritems_cached(object): + + +class Dropna(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(10000, 1000)) - self.df2 = DataFrame(randn(3000, 1), columns=['A']) - self.df3 = DataFrame(randn(3000, 1)) + self.data = np.random.randn(10000, 1000) + self.df = DataFrame(self.data) + self.df.ix[50:1000, 20:50] = np.nan + self.df.ix[2000:3000] = np.nan + self.df.ix[:, 60:70] = np.nan + self.df_mixed = self.df.copy() + self.df_mixed['foo'] = 'bar' - def time_frame_iteritems_cached(self): - self.g() + self.df_mi = self.df.copy() + self.df_mi.index = MultiIndex.from_tuples(self.df_mi.index.map((lambda x: (x, x)))) + self.df_mi.columns = MultiIndex.from_tuples(self.df_mi.columns.map((lambda x: (x, x)))) - def f(self): - if hasattr(self.df, '_item_cache'): - self.df._item_cache.clear() - for (name, col) in self.df.iteritems(): - pass + self.df_mixed_mi = self.df_mixed.copy() + self.df_mixed_mi.index = MultiIndex.from_tuples(self.df_mixed_mi.index.map((lambda x: (x, x)))) + self.df_mixed_mi.columns = MultiIndex.from_tuples(self.df_mixed_mi.columns.map((lambda x: (x, x)))) - def g(self): - for (name, col) in self.df.iteritems(): - pass + def time_dropna_axis0_all(self): + self.df.dropna(how='all', axis=0) - def h(self): - for i in range(10000): - self.df2['A'] + def time_dropna_axis0_any(self): + self.df.dropna(how='any', axis=0) - def j(self): - for i in range(10000): - self.df3[0] + def time_dropna_axis1_all(self): + self.df.dropna(how='all', axis=1) + + def time_dropna_axis1_any(self): + self.df.dropna(how='any', axis=1) + + def time_dropna_axis0_all_mixed_dtypes(self): + self.df_mixed.dropna(how='all', axis=0) + + def time_dropna_axis0_any_mixed_dtypes(self): + self.df_mixed.dropna(how='any', axis=0) + def time_dropna_axis1_all_mixed_dtypes(self): + self.df_mixed.dropna(how='all', axis=1) -class frame_itertuples(object): + def time_dropna_axis1_any_mixed_dtypes(self): + self.df_mixed.dropna(how='any', axis=1) - def setup(self): - self.df = DataFrame(np.random.randn(50000, 10)) + def time_count_level_axis0_multi(self): + self.df_mi.count(axis=0, level=1) - def time_frame_itertuples(self): - for row in self.df.itertuples(): - pass + def time_count_level_axis1_multi(self): + self.df_mi.count(axis=1, level=1) + def time_count_level_axis0_mixed_dtypes_multi(self): + self.df_mixed_mi.count(axis=0, level=1) -class frame_mask_bools(object): + def time_count_level_axis1_mixed_dtypes_multi(self): + self.df_mixed_mi.count(axis=1, level=1) + + +class Apply(object): goal_time = 0.2 def setup(self): - self.data = np.random.randn(1000, 500) - self.df = DataFrame(self.data) - self.df = self.df.where((self.df > 0)) - self.bools = (self.df > 0) - self.mask = isnull(self.df) + self.df = DataFrame(np.random.randn(1000, 100)) - def time_frame_mask_bools(self): - self.bools.mask(self.mask) + self.s = Series(np.arange(1028.0)) + self.df2 = DataFrame({i: self.s for i in range(1028)}) + + self.df3 = DataFrame(np.random.randn(1000, 3), columns=list('ABC')) + + def time_apply_user_func(self): + self.df2.apply((lambda x: np.corrcoef(x, self.s)[(0, 1)])) + + def time_apply_axis_1(self): + self.df.apply((lambda x: (x + 1)), axis=1) + + def time_apply_lambda_mean(self): + self.df.apply((lambda x: x.mean())) + + def time_apply_np_mean(self): + self.df.apply(np.mean) + + def time_apply_pass_thru(self): + self.df.apply((lambda x: x)) + + def time_apply_ref_by_name(self): + self.df3.apply((lambda x: (x['A'] + x['B'])), axis=1) -class frame_mask_floats(object): +#---------------------------------------------------------------------- +# dtypes + +class frame_dtypes(object): goal_time = 0.2 def setup(self): - self.data = np.random.randn(1000, 500) - self.df = DataFrame(self.data) - self.df = self.df.where((self.df > 0)) - self.bools = (self.df > 0) - self.mask = isnull(self.df) + self.df = DataFrame(np.random.randn(1000, 1000)) - def time_frame_mask_floats(self): - self.bools.astype(float).mask(self.mask) + def time_frame_dtypes(self): + self.df.dtypes +#---------------------------------------------------------------------- +# equals -class frame_nonunique_equal(object): +class Equals(object): goal_time = 0.2 def setup(self): @@ -742,10 +341,9 @@ def setup(self): self.object_df = DataFrame(([(['foo'] * 1000)] * 1000)) self.nonunique_cols = self.object_df.copy() self.nonunique_cols.columns = (['A'] * len(self.nonunique_cols.columns)) - self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in (('float_df', self.float_df), ('object_df', self.object_df), ('nonunique_cols', self.nonunique_cols))]) - - def time_frame_nonunique_equal(self): - self.test_equal('nonunique_cols') + self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in ( + ('float_df', self.float_df), ('object_df', self.object_df), + ('nonunique_cols', self.nonunique_cols))]) def make_pair(self, frame): self.df = frame @@ -761,238 +359,273 @@ def test_unequal(self, name): (self.df, self.df2) = self.pairs[name] return self.df.equals(self.df2) + def time_frame_float_equal(self): + self.test_equal('float_df') -class frame_nonunique_unequal(object): - goal_time = 0.2 + def time_frame_float_unequal(self): + self.test_unequal('float_df') - def setup(self): - self.float_df = DataFrame(np.random.randn(1000, 1000)) - self.object_df = DataFrame(([(['foo'] * 1000)] * 1000)) - self.nonunique_cols = self.object_df.copy() - self.nonunique_cols.columns = (['A'] * len(self.nonunique_cols.columns)) - self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in (('float_df', self.float_df), ('object_df', self.object_df), ('nonunique_cols', self.nonunique_cols))]) + def time_frame_nonunique_equal(self): + self.test_equal('nonunique_cols') def time_frame_nonunique_unequal(self): self.test_unequal('nonunique_cols') - def make_pair(self, frame): - self.df = frame - self.df2 = self.df.copy() - self.df2.ix[((-1), (-1))] = np.nan - return (self.df, self.df2) - - def test_equal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df) + def time_frame_object_equal(self): + self.test_equal('object_df') - def test_unequal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df2) + def time_frame_object_unequal(self): + self.test_unequal('object_df') -class frame_object_equal(object): +class Interpolate(object): goal_time = 0.2 def setup(self): - self.float_df = DataFrame(np.random.randn(1000, 1000)) - self.object_df = DataFrame(([(['foo'] * 1000)] * 1000)) - self.nonunique_cols = self.object_df.copy() - self.nonunique_cols.columns = (['A'] * len(self.nonunique_cols.columns)) - self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in (('float_df', self.float_df), ('object_df', self.object_df), ('nonunique_cols', self.nonunique_cols))]) + # this is the worst case, where every column has NaNs. + self.df = DataFrame(randn(10000, 100)) + self.df.values[::2] = np.nan - def time_frame_object_equal(self): - self.test_equal('object_df') + self.df2 = DataFrame( + {'A': np.arange(0, 10000), 'B': np.random.randint(0, 100, 10000), + 'C': randn(10000), 'D': randn(10000),}) + self.df2.loc[1::5, 'A'] = np.nan + self.df2.loc[1::5, 'C'] = np.nan - def make_pair(self, frame): - self.df = frame - self.df2 = self.df.copy() - self.df2.ix[((-1), (-1))] = np.nan - return (self.df, self.df2) + def time_interpolate(self): + self.df.interpolate() - def test_equal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df) + def time_interpolate_some_good(self): + self.df2.interpolate() - def test_unequal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df2) + def time_interpolate_some_good_infer(self): + self.df2.interpolate(downcast='infer') -class frame_object_unequal(object): +class Shift(object): + # frame shift speedup issue-5609 goal_time = 0.2 def setup(self): - self.float_df = DataFrame(np.random.randn(1000, 1000)) - self.object_df = DataFrame(([(['foo'] * 1000)] * 1000)) - self.nonunique_cols = self.object_df.copy() - self.nonunique_cols.columns = (['A'] * len(self.nonunique_cols.columns)) - self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in (('float_df', self.float_df), ('object_df', self.object_df), ('nonunique_cols', self.nonunique_cols))]) - - def time_frame_object_unequal(self): - self.test_unequal('object_df') + self.df = DataFrame(np.random.rand(10000, 500)) - def make_pair(self, frame): - self.df = frame - self.df2 = self.df.copy() - self.df2.ix[((-1), (-1))] = np.nan - return (self.df, self.df2) + def time_shift_axis0(self): + self.df.shift(1, axis=0) - def test_equal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df) + def time_shift_axis_1(self): + self.df.shift(1, axis=1) - def test_unequal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df2) +#----------------------------------------------------------------------------- +# from_records issue-6700 -class frame_reindex_axis0(object): +class frame_from_records_generator(object): goal_time = 0.2 - def setup(self): - self.df = DataFrame(randn(10000, 10000)) - self.idx = np.arange(4000, 7000) + def get_data(self, n=100000): + return ((x, (x * 20), (x * 100)) for x in range(n)) - def time_frame_reindex_axis0(self): - self.df.reindex(self.idx) + def time_frame_from_records_generator(self): + self.df = DataFrame.from_records(self.get_data()) + + def time_frame_from_records_generator_nrows(self): + self.df = DataFrame.from_records(self.get_data(), nrows=1000) -class frame_reindex_axis1(object): - goal_time = 0.2 + +#----------------------------------------------------------------------------- +# nunique + +class frame_nunique(object): def setup(self): - self.df = DataFrame(randn(10000, 10000)) - self.idx = np.arange(4000, 7000) + self.data = np.random.randn(10000, 1000) + self.df = DataFrame(self.data) + + def time_frame_nunique(self): + self.df.nunique() + - def time_frame_reindex_axis1(self): - self.df.reindex(columns=self.idx) +#----------------------------------------------------------------------------- +# duplicated -class frame_reindex_both_axes(object): +class frame_duplicated(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(10000, 10000)) - self.idx = np.arange(4000, 7000) + self.n = (1 << 20) + self.t = date_range('2015-01-01', freq='S', periods=(self.n // 64)) + self.xs = np.random.randn((self.n // 64)).round(2) + self.df = DataFrame({'a': np.random.randint(((-1) << 8), (1 << 8), self.n), 'b': np.random.choice(self.t, self.n), 'c': np.random.choice(self.xs, self.n), }) - def time_frame_reindex_both_axes(self): - self.df.reindex(index=self.idx, columns=self.idx) + self.df2 = DataFrame(np.random.randn(1000, 100).astype(str)) + def time_frame_duplicated(self): + self.df.duplicated() -class frame_reindex_both_axes_ix(object): - goal_time = 0.2 + def time_frame_duplicated_wide(self): + self.df2.T.duplicated() - def setup(self): - self.df = DataFrame(randn(10000, 10000)) - self.idx = np.arange(4000, 7000) - def time_frame_reindex_both_axes_ix(self): - self.df.ix[(self.idx, self.idx)] -class frame_reindex_upcast(object): - goal_time = 0.2 - def setup(self): - self.df = DataFrame(dict([(c, {0: randint(0, 2, 1000).astype(np.bool_), 1: randint(0, 1000, 1000).astype(np.int16), 2: randint(0, 1000, 1000).astype(np.int32), 3: randint(0, 1000, 1000).astype(np.int64), }[randint(0, 4)]) for c in range(1000)])) - def time_frame_reindex_upcast(self): - self.df.reindex(permutation(range(1200))) -class frame_repr_tall(object): + + + + + + + + + +class frame_xs_col(object): goal_time = 0.2 def setup(self): - self.df = pandas.DataFrame(np.random.randn(10000, 10)) + self.df = DataFrame(randn(1, 100000)) - def time_frame_repr_tall(self): - repr(self.df) + def time_frame_xs_col(self): + self.df.xs(50000, axis=1) -class frame_repr_wide(object): +class frame_xs_row(object): goal_time = 0.2 def setup(self): - self.df = pandas.DataFrame(np.random.randn(10, 10000)) + self.df = DataFrame(randn(100000, 1)) - def time_frame_repr_wide(self): - repr(self.df) + def time_frame_xs_row(self): + self.df.xs(50000) -class frame_shift_axis0(object): +class frame_sort_index(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(np.random.rand(10000, 500)) + self.df = DataFrame(randn(1000000, 2), columns=list('AB')) - def time_frame_shift_axis0(self): - self.df.shift(1, axis=0) + def time_frame_sort_index(self): + self.df.sort_index() -class frame_shift_axis_1(object): +class frame_sort_index_by_columns(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(np.random.rand(10000, 500)) + self.N = 10000 + self.K = 10 + self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) + self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) + self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) + self.col_array_list = list(self.df.values.T) - def time_frame_shift_axis_1(self): - self.df.shift(1, axis=1) + def time_frame_sort_index_by_columns(self): + self.df.sort_index(by=['key1', 'key2']) -class frame_to_html_mixed(object): +class frame_quantile_axis1(object): goal_time = 0.2 def setup(self): - self.nrows = 500 - self.df = DataFrame(randn(self.nrows, 10)) - self.df[0] = period_range('2000', '2010', self.nrows) - self.df[1] = range(self.nrows) + self.df = DataFrame(np.random.randn(1000, 3), + columns=list('ABC')) + + def time_frame_quantile_axis1(self): + self.df.quantile([0.1, 0.5], axis=1) - def time_frame_to_html_mixed(self): - self.df.to_html() +#---------------------------------------------------------------------- +# boolean indexing -class frame_to_string_floats(object): +class frame_boolean_row_select(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(100, 10)) - - def time_frame_to_string_floats(self): - self.df.to_string() + self.df = DataFrame(randn(10000, 100)) + self.bool_arr = np.zeros(10000, dtype=bool) + self.bool_arr[:1000] = True + def time_frame_boolean_row_select(self): + self.df[self.bool_arr] -class frame_xs_col(object): +class frame_getitem_single_column(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(1, 100000)) + self.df = DataFrame(randn(10000, 1000)) + self.df2 = DataFrame(randn(3000, 1), columns=['A']) + self.df3 = DataFrame(randn(3000, 1)) - def time_frame_xs_col(self): - self.df.xs(50000, axis=1) + def h(self): + for i in range(10000): + self.df2['A'] + + def j(self): + for i in range(10000): + self.df3[0] + def time_frame_getitem_single_column(self): + self.h() -class frame_xs_row(object): + def time_frame_getitem_single_column2(self): + self.j() + + +#---------------------------------------------------------------------- +# assignment + +class frame_assign_timeseries_index(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(100000, 1)) + self.idx = date_range('1/1/2000', periods=100000, freq='H') + self.df = DataFrame(randn(100000, 1), columns=['A'], index=self.idx) - def time_frame_xs_row(self): - self.df.xs(50000) + def time_frame_assign_timeseries_index(self): + self.f(self.df) + def f(self, df): + self.x = self.df.copy() + self.x['date'] = self.x.index -class frame_sort_index(object): + + +# insert many columns + +class frame_insert_100_columns_begin(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(1000000, 2), columns=list('AB')) + self.N = 1000 - def time_frame_sort_index(self): - self.df.sort_index() + def f(self, K=100): + self.df = DataFrame(index=range(self.N)) + self.new_col = np.random.randn(self.N) + for i in range(K): + self.df.insert(0, i, self.new_col) + + def g(self, K=500): + self.df = DataFrame(index=range(self.N)) + self.new_col = np.random.randn(self.N) + for i in range(K): + self.df[i] = self.new_col + + def time_frame_insert_100_columns_begin(self): + self.f() + + def time_frame_insert_500_columns_end(self): + self.g() + +#---------------------------------------------------------------------- +# strings methods, #2602 + class series_string_vector_slice(object): goal_time = 0.2 @@ -1003,15 +636,17 @@ def time_series_string_vector_slice(self): self.s.str[:5] -class frame_quantile_axis1(object): +#---------------------------------------------------------------------- +# df.info() and get_dtype_counts() # 2807 + +class frame_get_dtype_counts(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(np.random.randn(1000, 3), - columns=list('ABC')) + self.df = DataFrame(np.random.randn(10, 10000)) - def time_frame_quantile_axis1(self): - self.df.quantile([0.1, 0.5], axis=1) + def time_frame_get_dtype_counts(self): + self.df.get_dtype_counts() class frame_nlargest(object): diff --git a/asv_bench/benchmarks/gil.py b/asv_bench/benchmarks/gil.py index 1c82560c7e630..78a94976e732d 100644 --- a/asv_bench/benchmarks/gil.py +++ b/asv_bench/benchmarks/gil.py @@ -1,11 +1,17 @@ from .pandas_vb_common import * -from pandas.core import common as com + +from pandas.core.algorithms import take_1d try: from cStringIO import StringIO except ImportError: from io import StringIO +try: + from pandas._libs import algos +except ImportError: + from pandas import algos + try: from pandas.util.testing import test_parallel @@ -22,7 +28,7 @@ def wrapper(fname): return wrapper -class nogil_groupby_base(object): +class NoGilGroupby(object): goal_time = 0.2 def setup(self): @@ -30,167 +36,122 @@ def setup(self): self.ngroups = 1000 np.random.seed(1234) self.df = DataFrame({'key': np.random.randint(0, self.ngroups, size=self.N), 'data': np.random.randn(self.N), }) - if (not have_real_test_parallel): - raise NotImplementedError - -class nogil_groupby_count_2(nogil_groupby_base): + np.random.seed(1234) + self.size = 2 ** 22 + self.ngroups = 100 + self.data = Series(np.random.randint(0, self.ngroups, size=self.size)) - def time_nogil_groupby_count_2(self): - self.pg2() + if (not have_real_test_parallel): + raise NotImplementedError @test_parallel(num_threads=2) - def pg2(self): + def _pg2_count(self): self.df.groupby('key')['data'].count() - -class nogil_groupby_last_2(nogil_groupby_base): - - def time_nogil_groupby_last_2(self): - self.pg2() + def time_count_2(self): + self._pg2_count() @test_parallel(num_threads=2) - def pg2(self): + def _pg2_last(self): self.df.groupby('key')['data'].last() - -class nogil_groupby_max_2(nogil_groupby_base): - - def time_nogil_groupby_max_2(self): - self.pg2() + def time_last_2(self): + self._pg2_last() @test_parallel(num_threads=2) - def pg2(self): + def _pg2_max(self): self.df.groupby('key')['data'].max() - -class nogil_groupby_mean_2(nogil_groupby_base): - - def time_nogil_groupby_mean_2(self): - self.pg2() + def time_max_2(self): + self._pg2_max() @test_parallel(num_threads=2) - def pg2(self): + def _pg2_mean(self): self.df.groupby('key')['data'].mean() - -class nogil_groupby_min_2(nogil_groupby_base): - - def time_nogil_groupby_min_2(self): - self.pg2() + def time_mean_2(self): + self._pg2_mean() @test_parallel(num_threads=2) - def pg2(self): + def _pg2_min(self): self.df.groupby('key')['data'].min() - -class nogil_groupby_prod_2(nogil_groupby_base): - - def time_nogil_groupby_prod_2(self): - self.pg2() + def time_min_2(self): + self._pg2_min() @test_parallel(num_threads=2) - def pg2(self): + def _pg2_prod(self): self.df.groupby('key')['data'].prod() - -class nogil_groupby_sum_2(nogil_groupby_base): - - def time_nogil_groupby_sum_2(self): - self.pg2() + def time_prod_2(self): + self._pg2_prod() @test_parallel(num_threads=2) - def pg2(self): - self.df.groupby('key')['data'].sum() - - -class nogil_groupby_sum_4(nogil_groupby_base): - - def time_nogil_groupby_sum_4(self): - self.pg4() - - def f(self): + def _pg2_sum(self): self.df.groupby('key')['data'].sum() - def g4(self): - for i in range(4): - self.f() + def time_sum_2(self): + self._pg2_sum() @test_parallel(num_threads=4) - def pg4(self): - self.f() - + def _pg4_sum(self): + self.df.groupby('key')['data'].sum() -class nogil_groupby_sum_8(nogil_groupby_base): + def time_sum_4(self): + self._pg4_sum() - def time_nogil_groupby_sum_8(self): - self.pg8() + def time_sum_4_notp(self): + for i in range(4): + self.df.groupby('key')['data'].sum() - def f(self): + def _f_sum(self): self.df.groupby('key')['data'].sum() - def g8(self): - for i in range(8): - self.f() - @test_parallel(num_threads=8) - def pg8(self): - self.f() - + def _pg8_sum(self): + self._f_sum() -class nogil_groupby_var_2(nogil_groupby_base): + def time_sum_8(self): + self._pg8_sum() - def time_nogil_groupby_var_2(self): - self.pg2() + def time_sum_8_notp(self): + for i in range(8): + self._f_sum() @test_parallel(num_threads=2) - def pg2(self): + def _pg2_var(self): self.df.groupby('key')['data'].var() + def time_var_2(self): + self._pg2_var() -class nogil_groupby_groups(object): - goal_time = 0.2 - - def setup(self): - np.random.seed(1234) - self.size = 2**22 - self.ngroups = 100 - self.data = Series(np.random.randint(0, self.ngroups, size=self.size)) - if (not have_real_test_parallel): - raise NotImplementedError + # get groups - def f(self): + def _groups(self): self.data.groupby(self.data).groups - -class nogil_groupby_groups_2(nogil_groupby_groups): - - def time_nogil_groupby_groups(self): - self.pg2() - @test_parallel(num_threads=2) - def pg2(self): - self.f() - - -class nogil_groupby_groups_4(nogil_groupby_groups): + def _pg2_groups(self): + self._groups() - def time_nogil_groupby_groups(self): - self.pg4() + def time_groups_2(self): + self._pg2_groups() @test_parallel(num_threads=4) - def pg4(self): - self.f() + def _pg4_groups(self): + self._groups() + def time_groups_4(self): + self._pg4_groups() -class nogil_groupby_groups_8(nogil_groupby_groups): + @test_parallel(num_threads=8) + def _pg8_groups(self): + self._groups() - def time_nogil_groupby_groups(self): - self.pg8() + def time_groups_8(self): + self._pg8_groups() - @test_parallel(num_threads=8) - def pg8(self): - self.f() class nogil_take1d_float64(object): @@ -212,11 +173,11 @@ def time_nogil_take1d_float64(self): @test_parallel(num_threads=2) def take_1d_pg2_int64(self): - com.take_1d(self.df.int64.values, self.indexer) + take_1d(self.df.int64.values, self.indexer) @test_parallel(num_threads=2) def take_1d_pg2_float64(self): - com.take_1d(self.df.float64.values, self.indexer) + take_1d(self.df.float64.values, self.indexer) class nogil_take1d_int64(object): @@ -238,11 +199,11 @@ def time_nogil_take1d_int64(self): @test_parallel(num_threads=2) def take_1d_pg2_int64(self): - com.take_1d(self.df.int64.values, self.indexer) + take_1d(self.df.int64.values, self.indexer) @test_parallel(num_threads=2) def take_1d_pg2_float64(self): - com.take_1d(self.df.float64.values, self.indexer) + take_1d(self.df.float64.values, self.indexer) class nogil_kth_smallest(object): @@ -271,7 +232,7 @@ class nogil_datetime_fields(object): def setup(self): self.N = 100000000 - self.dti = pd.date_range('1900-01-01', periods=self.N, freq='D') + self.dti = pd.date_range('1900-01-01', periods=self.N, freq='T') self.period = self.dti.to_period('D') if (not have_real_test_parallel): raise NotImplementedError @@ -408,19 +369,54 @@ def create_cols(self, name): def pg_read_csv(self): read_csv('__test__.csv', sep=',', header=None, float_precision=None) - def time_nogil_read_csv(self): + def time_read_csv(self): self.pg_read_csv() @test_parallel(num_threads=2) def pg_read_csv_object(self): read_csv('__test_object__.csv', sep=',') - def time_nogil_read_csv_object(self): + def time_read_csv_object(self): self.pg_read_csv_object() @test_parallel(num_threads=2) def pg_read_csv_datetime(self): read_csv('__test_datetime__.csv', sep=',', header=None) - def time_nogil_read_csv_datetime(self): + def time_read_csv_datetime(self): self.pg_read_csv_datetime() + + +class nogil_factorize(object): + number = 1 + repeat = 5 + + def setup(self): + if (not have_real_test_parallel): + raise NotImplementedError + + np.random.seed(1234) + self.strings = tm.makeStringIndex(100000) + + def factorize_strings(self): + pd.factorize(self.strings) + + @test_parallel(num_threads=4) + def _pg_factorize_strings_4(self): + self.factorize_strings() + + def time_factorize_strings_4(self): + for i in range(2): + self._pg_factorize_strings_4() + + @test_parallel(num_threads=2) + def _pg_factorize_strings_2(self): + self.factorize_strings() + + def time_factorize_strings_2(self): + for i in range(4): + self._pg_factorize_strings_2() + + def time_factorize_strings(self): + for i in range(8): + self.factorize_strings() diff --git a/asv_bench/benchmarks/groupby.py b/asv_bench/benchmarks/groupby.py index 5f3671012e6d5..c0c3a42cc4464 100644 --- a/asv_bench/benchmarks/groupby.py +++ b/asv_bench/benchmarks/groupby.py @@ -35,107 +35,67 @@ def time_groupby_apply_dict_return(self): #---------------------------------------------------------------------- # groups -class groupby_groups(object): +class Groups(object): goal_time = 0.1 - def setup(self): - size = 2**22 - self.data = Series(np.random.randint(0, 100, size=size)) - self.data2 = Series(np.random.randint(0, 10000, size=size)) - self.data3 = Series(tm.makeStringIndex(100).take(np.random.randint(0, 100, size=size))) - self.data4 = Series(tm.makeStringIndex(10000).take(np.random.randint(0, 10000, size=size))) - - def time_groupby_groups_int64_small(self): - self.data.groupby(self.data).groups + size = 2 ** 22 + data = { + 'int64_small': Series(np.random.randint(0, 100, size=size)), + 'int64_large' : Series(np.random.randint(0, 10000, size=size)), + 'object_small': Series(tm.makeStringIndex(100).take(np.random.randint(0, 100, size=size))), + 'object_large': Series(tm.makeStringIndex(10000).take(np.random.randint(0, 10000, size=size))) + } - def time_groupby_groups_int64_large(self): - self.data2.groupby(self.data2).groups + param_names = ['df'] + params = ['int64_small', 'int64_large', 'object_small', 'object_large'] - def time_groupby_groups_object_small(self): - self.data3.groupby(self.data3).groups + def setup(self, df): + self.df = self.data[df] - def time_groupby_groups_object_large(self): - self.data4.groupby(self.data4).groups + def time_groupby_groups(self, df): + self.df.groupby(self.df).groups #---------------------------------------------------------------------- # First / last functions -class groupby_first_last(object): - goal_time = 0.2 - - def setup(self): - self.labels = np.arange(10000).repeat(10) - self.data = Series(randn(len(self.labels))) - self.data[::3] = np.nan - self.data[1::3] = np.nan - self.data2 = Series(randn(len(self.labels)), dtype='float32') - self.data2[::3] = np.nan - self.data2[1::3] = np.nan - self.labels = self.labels.take(np.random.permutation(len(self.labels))) - - def time_groupby_first_float32(self): - self.data2.groupby(self.labels).first() - - def time_groupby_first_float64(self): - self.data.groupby(self.labels).first() - - def time_groupby_last_float32(self): - self.data2.groupby(self.labels).last() - - def time_groupby_last_float64(self): - self.data.groupby(self.labels).last() - - def time_groupby_nth_float32_any(self): - self.data2.groupby(self.labels).nth(0, dropna='all') - - def time_groupby_nth_float32_none(self): - self.data2.groupby(self.labels).nth(0) - - def time_groupby_nth_float64_any(self): - self.data.groupby(self.labels).nth(0, dropna='all') - - def time_groupby_nth_float64_none(self): - self.data.groupby(self.labels).nth(0) - -# with datetimes (GH7555) - -class groupby_first_last_datetimes(object): +class FirstLast(object): goal_time = 0.2 - def setup(self): - self.df = DataFrame({'a': date_range('1/1/2011', periods=100000, freq='s'), 'b': range(100000), }) - - def time_groupby_first_datetimes(self): - self.df.groupby('b').first() - - def time_groupby_last_datetimes(self): - self.df.groupby('b').last() - - def time_groupby_nth_datetimes_any(self): - self.df.groupby('b').nth(0, dropna='all') + param_names = ['dtype'] + params = ['float32', 'float64', 'datetime', 'object'] - def time_groupby_nth_datetimes_none(self): - self.df.groupby('b').nth(0) + # with datetimes (GH7555) + def setup(self, dtype): -class groupby_first_last_object(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame({'a': (['foo'] * 100000), 'b': range(100000)}) + if dtype == 'datetime': + self.df = DataFrame( + {'values': date_range('1/1/2011', periods=100000, freq='s'), + 'key': range(100000),}) + elif dtype == 'object': + self.df = DataFrame( + {'values': (['foo'] * 100000), + 'key': range(100000)}) + else: + labels = np.arange(10000).repeat(10) + data = Series(randn(len(labels)), dtype=dtype) + data[::3] = np.nan + data[1::3] = np.nan + labels = labels.take(np.random.permutation(len(labels))) + self.df = DataFrame({'values': data, 'key': labels}) - def time_groupby_first_object(self): - self.df.groupby('b').first() + def time_groupby_first(self, dtype): + self.df.groupby('key').first() - def time_groupby_last_object(self): - self.df.groupby('b').last() + def time_groupby_last(self, dtype): + self.df.groupby('key').last() - def time_groupby_nth_object_any(self): - self.df.groupby('b').nth(0, dropna='any') + def time_groupby_nth_any(self, dtype): + self.df.groupby('key').nth(0, dropna='all') - def time_groupby_nth_object_none(self): - self.df.groupby('b').nth(0) + def time_groupby_nth_none(self, dtype): + self.df.groupby('key').nth(0) #---------------------------------------------------------------------- @@ -148,16 +108,34 @@ def setup(self): self.N = 10000 self.labels = np.random.randint(0, 2000, size=self.N) self.labels2 = np.random.randint(0, 3, size=self.N) - self.df = DataFrame({'key': self.labels, 'key2': self.labels2, 'value1': randn(self.N), 'value2': (['foo', 'bar', 'baz', 'qux'] * (self.N / 4)), }) - - def f(self, g): + self.df = DataFrame({ + 'key': self.labels, + 'key2': self.labels2, + 'value1': np.random.randn(self.N), + 'value2': (['foo', 'bar', 'baz', 'qux'] * (self.N // 4)), + }) + + @staticmethod + def scalar_function(g): return 1 - def time_groupby_frame_apply(self): - self.df.groupby(['key', 'key2']).apply(self.f) + def time_groupby_frame_apply_scalar_function(self): + self.df.groupby(['key', 'key2']).apply(self.scalar_function) + + def time_groupby_frame_apply_scalar_function_overhead(self): + self.df.groupby('key').apply(self.scalar_function) - def time_groupby_frame_apply_overhead(self): - self.df.groupby('key').apply(self.f) + @staticmethod + def df_copy_function(g): + # ensure that the group name is available (see GH #15062) + g.name + return g.copy() + + def time_groupby_frame_df_copy_function(self): + self.df.groupby(['key', 'key2']).apply(self.df_copy_function) + + def time_groupby_frame_apply_df_copy_overhead(self): + self.df.groupby('key').apply(self.df_copy_function) #---------------------------------------------------------------------- @@ -189,24 +167,6 @@ def time_sum(self): self.df.groupby(self.labels).sum() -#---------------------------------------------------------------------- -# median - -class groupby_frame(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(100000, 2) - self.labels = np.random.randint(0, 1000, size=100000) - self.df = DataFrame(self.data) - - def time_groupby_frame_median(self): - self.df.groupby(self.labels).median() - - def time_groupby_simple_compress_timing(self): - self.df.groupby(self.labels).mean() - - #---------------------------------------------------------------------- # DataFrame nth @@ -309,6 +269,22 @@ def time_groupby_int_count(self): self.df.groupby(['key1', 'key2']).count() +#---------------------------------------------------------------------- +# nunique() speed + +class groupby_nunique(object): + + def setup(self): + self.n = 10000 + self.df = DataFrame({'key1': randint(0, 500, size=self.n), + 'key2': randint(0, 100, size=self.n), + 'ints': randint(0, 1000, size=self.n), + 'ints2': randint(0, 1000, size=self.n), }) + + def time_groupby_nunique(self): + self.df.groupby(['key1', 'key2']).nunique() + + #---------------------------------------------------------------------- # group with different functions per column @@ -355,7 +331,7 @@ def setup(self): def get_test_data(self, ngroups=100, n=100000): self.unique_groups = range(self.ngroups) - self.arr = np.asarray(np.tile(self.unique_groups, (n / self.ngroups)), dtype=object) + self.arr = np.asarray(np.tile(self.unique_groups, int(n / self.ngroups)), dtype=object) if (len(self.arr) < n): self.arr = np.asarray((list(self.arr) + self.unique_groups[:(n - len(self.arr))]), dtype=object) random.shuffle(self.arr) @@ -405,132 +381,118 @@ def time_groupby_dt_timegrouper_size(self): #---------------------------------------------------------------------- # groupby with a variable value for ngroups -class groupby_ngroups_int_10000(object): +class GroupBySuite(object): goal_time = 0.2 - dtype = 'int' - ngroups = 10000 - def setup(self): + param_names = ['dtype', 'ngroups'] + params = [['int', 'float'], [100, 10000]] + + def setup(self, dtype, ngroups): np.random.seed(1234) - size = self.ngroups * 2 - rng = np.arange(self.ngroups) - ts = rng.take(np.random.randint(0, self.ngroups, size=size)) - if self.dtype == 'int': - value = np.random.randint(0, size, size=size) + size = ngroups * 2 + rng = np.arange(ngroups) + values = rng.take(np.random.randint(0, ngroups, size=size)) + if dtype == 'int': + key = np.random.randint(0, size, size=size) else: - value = np.concatenate([np.random.random(self.ngroups) * 0.1, - np.random.random(self.ngroups) * 10.0]) + key = np.concatenate([np.random.random(ngroups) * 0.1, + np.random.random(ngroups) * 10.0]) - self.df = DataFrame({'timestamp': ts, - 'value': value}) + self.df = DataFrame({'values': values, + 'key': key}) - def time_all(self): - self.df.groupby('value')['timestamp'].all() + def time_all(self, dtype, ngroups): + self.df.groupby('key')['values'].all() - def time_any(self): - self.df.groupby('value')['timestamp'].any() + def time_any(self, dtype, ngroups): + self.df.groupby('key')['values'].any() - def time_count(self): - self.df.groupby('value')['timestamp'].count() + def time_count(self, dtype, ngroups): + self.df.groupby('key')['values'].count() - def time_cumcount(self): - self.df.groupby('value')['timestamp'].cumcount() + def time_cumcount(self, dtype, ngroups): + self.df.groupby('key')['values'].cumcount() - def time_cummax(self): - self.df.groupby('value')['timestamp'].cummax() + def time_cummax(self, dtype, ngroups): + self.df.groupby('key')['values'].cummax() - def time_cummin(self): - self.df.groupby('value')['timestamp'].cummin() + def time_cummin(self, dtype, ngroups): + self.df.groupby('key')['values'].cummin() - def time_cumprod(self): - self.df.groupby('value')['timestamp'].cumprod() + def time_cumprod(self, dtype, ngroups): + self.df.groupby('key')['values'].cumprod() - def time_cumsum(self): - self.df.groupby('value')['timestamp'].cumsum() + def time_cumsum(self, dtype, ngroups): + self.df.groupby('key')['values'].cumsum() - def time_describe(self): - self.df.groupby('value')['timestamp'].describe() + def time_describe(self, dtype, ngroups): + self.df.groupby('key')['values'].describe() - def time_diff(self): - self.df.groupby('value')['timestamp'].diff() + def time_diff(self, dtype, ngroups): + self.df.groupby('key')['values'].diff() - def time_first(self): - self.df.groupby('value')['timestamp'].first() + def time_first(self, dtype, ngroups): + self.df.groupby('key')['values'].first() - def time_head(self): - self.df.groupby('value')['timestamp'].head() + def time_head(self, dtype, ngroups): + self.df.groupby('key')['values'].head() - def time_last(self): - self.df.groupby('value')['timestamp'].last() + def time_last(self, dtype, ngroups): + self.df.groupby('key')['values'].last() - def time_mad(self): - self.df.groupby('value')['timestamp'].mad() + def time_mad(self, dtype, ngroups): + self.df.groupby('key')['values'].mad() - def time_max(self): - self.df.groupby('value')['timestamp'].max() + def time_max(self, dtype, ngroups): + self.df.groupby('key')['values'].max() - def time_mean(self): - self.df.groupby('value')['timestamp'].mean() + def time_mean(self, dtype, ngroups): + self.df.groupby('key')['values'].mean() - def time_median(self): - self.df.groupby('value')['timestamp'].median() + def time_median(self, dtype, ngroups): + self.df.groupby('key')['values'].median() - def time_min(self): - self.df.groupby('value')['timestamp'].min() + def time_min(self, dtype, ngroups): + self.df.groupby('key')['values'].min() - def time_nunique(self): - self.df.groupby('value')['timestamp'].nunique() + def time_nunique(self, dtype, ngroups): + self.df.groupby('key')['values'].nunique() - def time_pct_change(self): - self.df.groupby('value')['timestamp'].pct_change() + def time_pct_change(self, dtype, ngroups): + self.df.groupby('key')['values'].pct_change() - def time_prod(self): - self.df.groupby('value')['timestamp'].prod() + def time_prod(self, dtype, ngroups): + self.df.groupby('key')['values'].prod() - def time_rank(self): - self.df.groupby('value')['timestamp'].rank() + def time_rank(self, dtype, ngroups): + self.df.groupby('key')['values'].rank() - def time_sem(self): - self.df.groupby('value')['timestamp'].sem() + def time_sem(self, dtype, ngroups): + self.df.groupby('key')['values'].sem() - def time_size(self): - self.df.groupby('value')['timestamp'].size() + def time_size(self, dtype, ngroups): + self.df.groupby('key')['values'].size() - def time_skew(self): - self.df.groupby('value')['timestamp'].skew() + def time_skew(self, dtype, ngroups): + self.df.groupby('key')['values'].skew() - def time_std(self): - self.df.groupby('value')['timestamp'].std() + def time_std(self, dtype, ngroups): + self.df.groupby('key')['values'].std() - def time_sum(self): - self.df.groupby('value')['timestamp'].sum() + def time_sum(self, dtype, ngroups): + self.df.groupby('key')['values'].sum() - def time_tail(self): - self.df.groupby('value')['timestamp'].tail() + def time_tail(self, dtype, ngroups): + self.df.groupby('key')['values'].tail() - def time_unique(self): - self.df.groupby('value')['timestamp'].unique() + def time_unique(self, dtype, ngroups): + self.df.groupby('key')['values'].unique() - def time_value_counts(self): - self.df.groupby('value')['timestamp'].value_counts() + def time_value_counts(self, dtype, ngroups): + self.df.groupby('key')['values'].value_counts() - def time_var(self): - self.df.groupby('value')['timestamp'].var() - -class groupby_ngroups_int_100(groupby_ngroups_int_10000): - goal_time = 0.2 - dtype = 'int' - ngroups = 100 - -class groupby_ngroups_float_100(groupby_ngroups_int_10000): - goal_time = 0.2 - dtype = 'float' - ngroups = 100 - -class groupby_ngroups_float_10000(groupby_ngroups_int_10000): - goal_time = 0.2 - dtype = 'float' - ngroups = 10000 + def time_var(self, dtype, ngroups): + self.df.groupby('key')['values'].var() class groupby_float32(object): @@ -548,6 +510,43 @@ def time_groupby_sum(self): self.df.groupby(['a'])['b'].sum() +class groupby_categorical(object): + goal_time = 0.2 + + def setup(self): + N = 100000 + arr = np.random.random(N) + + self.df = DataFrame(dict( + a=Categorical(np.random.randint(10000, size=N)), + b=arr)) + self.df_ordered = DataFrame(dict( + a=Categorical(np.random.randint(10000, size=N), ordered=True), + b=arr)) + self.df_extra_cat = DataFrame(dict( + a=Categorical(np.random.randint(100, size=N), + categories=np.arange(10000)), + b=arr)) + + def time_groupby_sort(self): + self.df.groupby('a')['b'].count() + + def time_groupby_nosort(self): + self.df.groupby('a', sort=False)['b'].count() + + def time_groupby_ordered_sort(self): + self.df_ordered.groupby('a')['b'].count() + + def time_groupby_ordered_nosort(self): + self.df_ordered.groupby('a', sort=False)['b'].count() + + def time_groupby_extra_cat_sort(self): + self.df_extra_cat.groupby('a')['b'].count() + + def time_groupby_extra_cat_nosort(self): + self.df_extra_cat.groupby('a', sort=False)['b'].count() + + class groupby_period(object): # GH 14338 goal_time = 0.2 @@ -647,89 +646,75 @@ def time_groupby_sum_multiindex(self): #------------------------------------------------------------------------------- # Transform testing -class groupby_transform(object): +class Transform(object): goal_time = 0.2 def setup(self): - self.n_dates = 400 - self.n_securities = 250 - self.n_columns = 3 - self.share_na = 0.1 - self.dates = date_range('1997-12-31', periods=self.n_dates, freq='B') - self.dates = Index(map((lambda x: (((x.year * 10000) + (x.month * 100)) + x.day)), self.dates)) - self.secid_min = int('10000000', 16) - self.secid_max = int('F0000000', 16) - self.step = ((self.secid_max - self.secid_min) // (self.n_securities - 1)) - self.security_ids = map((lambda x: hex(x)[2:10].upper()), list(range(self.secid_min, (self.secid_max + 1), self.step))) - self.data_index = MultiIndex(levels=[self.dates.values, self.security_ids], - labels=[[i for i in range(self.n_dates) for _ in range(self.n_securities)], (list(range(self.n_securities)) * self.n_dates)], - names=['date', 'security_id']) - self.n_data = len(self.data_index) - self.columns = Index(['factor{}'.format(i) for i in range(1, (self.n_columns + 1))]) - self.data = DataFrame(np.random.randn(self.n_data, self.n_columns), index=self.data_index, columns=self.columns) - self.step = int((self.n_data * self.share_na)) - for column_index in range(self.n_columns): - self.index = column_index - while (self.index < self.n_data): - self.data.set_value(self.data_index[self.index], self.columns[column_index], np.nan) - self.index += self.step - self.f_fillna = (lambda x: x.fillna(method='pad')) - - def time_groupby_transform(self): - self.data.groupby(level='security_id').transform(self.f_fillna) - - def time_groupby_transform_ufunc(self): - self.data.groupby(level='date').transform(np.max) + n1 = 400 + n2 = 250 + index = MultiIndex( + levels=[np.arange(n1), pd.util.testing.makeStringIndex(n2)], + labels=[[i for i in range(n1) for _ in range(n2)], + (list(range(n2)) * n1)], + names=['lev1', 'lev2']) -class groupby_transform_multi_key(object): - goal_time = 0.2 + data = DataFrame(np.random.randn(n1 * n2, 3), + index=index, columns=['col1', 'col20', 'col3']) + step = int((n1 * n2 * 0.1)) + for col in range(len(data.columns)): + idx = col + while (idx < len(data)): + data.set_value(data.index[idx], data.columns[col], np.nan) + idx += step + self.df = data + self.f_fillna = (lambda x: x.fillna(method='pad')) - def setup(self): np.random.seed(2718281) - self.n = 20000 - self.df = DataFrame(np.random.randint(1, self.n, (self.n, 3)), columns=['jim', 'joe', 'jolie']) + n = 20000 + self.df1 = DataFrame(np.random.randint(1, n, (n, 3)), + columns=['jim', 'joe', 'jolie']) + self.df2 = self.df1.copy() + self.df2['jim'] = self.df2['joe'] - def time_groupby_transform_multi_key1(self): - self.df.groupby(['jim', 'joe'])['jolie'].transform('max') + self.df3 = DataFrame(np.random.randint(1, (n / 10), (n, 3)), + columns=['jim', 'joe', 'jolie']) + self.df4 = self.df3.copy() + self.df4['jim'] = self.df4['joe'] + def time_transform_func(self): + self.df.groupby(level='lev2').transform(self.f_fillna) -class groupby_transform_multi_key2(object): - goal_time = 0.2 + def time_transform_ufunc(self): + self.df.groupby(level='lev1').transform(np.max) - def setup(self): - np.random.seed(2718281) - self.n = 20000 - self.df = DataFrame(np.random.randint(1, self.n, (self.n, 3)), columns=['jim', 'joe', 'jolie']) - self.df['jim'] = self.df['joe'] + def time_transform_multi_key1(self): + self.df1.groupby(['jim', 'joe'])['jolie'].transform('max') - def time_groupby_transform_multi_key2(self): - self.df.groupby(['jim', 'joe'])['jolie'].transform('max') + def time_transform_multi_key2(self): + self.df2.groupby(['jim', 'joe'])['jolie'].transform('max') + def time_transform_multi_key3(self): + self.df3.groupby(['jim', 'joe'])['jolie'].transform('max') -class groupby_transform_multi_key3(object): - goal_time = 0.2 + def time_transform_multi_key4(self): + self.df4.groupby(['jim', 'joe'])['jolie'].transform('max') - def setup(self): - np.random.seed(2718281) - self.n = 200000 - self.df = DataFrame(np.random.randint(1, (self.n / 10), (self.n, 3)), columns=['jim', 'joe', 'jolie']) - def time_groupby_transform_multi_key3(self): - self.df.groupby(['jim', 'joe'])['jolie'].transform('max') -class groupby_transform_multi_key4(object): - goal_time = 0.2 +np.random.seed(0) +N = 120000 +N_TRANSITIONS = 1400 +transition_points = np.random.permutation(np.arange(N))[:N_TRANSITIONS] +transition_points.sort() +transitions = np.zeros((N,), dtype=np.bool) +transitions[transition_points] = True +g = transitions.cumsum() +df = DataFrame({'signal': np.random.rand(N), }) + - def setup(self): - np.random.seed(2718281) - self.n = 200000 - self.df = DataFrame(np.random.randint(1, (self.n / 10), (self.n, 3)), columns=['jim', 'joe', 'jolie']) - self.df['jim'] = self.df['joe'] - def time_groupby_transform_multi_key4(self): - self.df.groupby(['jim', 'joe'])['jolie'].transform('max') class groupby_transform_series(object): @@ -737,14 +722,12 @@ class groupby_transform_series(object): def setup(self): np.random.seed(0) - self.N = 120000 - self.N_TRANSITIONS = 1400 - self.transition_points = np.random.permutation(np.arange(self.N))[:self.N_TRANSITIONS] - self.transition_points.sort() - self.transitions = np.zeros((self.N,), dtype=np.bool) - self.transitions[self.transition_points] = True - self.g = self.transitions.cumsum() - self.df = DataFrame({'signal': np.random.rand(self.N), }) + N = 120000 + transition_points = np.sort(np.random.choice(np.arange(N), 1400)) + transitions = np.zeros((N,), dtype=np.bool) + transitions[transition_points] = True + self.g = transitions.cumsum() + self.df = DataFrame({'signal': np.random.rand(N)}) def time_groupby_transform_series(self): self.df['signal'].groupby(self.g).transform(np.mean) @@ -755,38 +738,26 @@ class groupby_transform_series2(object): def setup(self): np.random.seed(0) - self.df = DataFrame({'id': (np.arange(100000) / 3), 'val': np.random.randn(100000), }) + self.df = DataFrame({'key': (np.arange(100000) // 3), + 'val': np.random.randn(100000)}) - def time_groupby_transform_series2(self): - self.df.groupby('id')['val'].transform(np.mean) + self.df_nans = pd.DataFrame({'key': np.repeat(np.arange(1000), 10), + 'B': np.nan, + 'C': np.nan}) + self.df_nans.ix[4::10, 'B':'C'] = 5 + def time_transform_series2(self): + self.df.groupby('key')['val'].transform(np.mean) -class groupby_transform_dataframe(object): - # GH 12737 - goal_time = 0.2 - - def setup(self): - self.df = pd.DataFrame({'group': np.repeat(np.arange(1000), 10), - 'B': np.nan, - 'C': np.nan}) - self.df.ix[4::10, 'B':'C'] = 5 - - def time_groupby_transform_dataframe(self): - self.df.groupby('group').transform('first') - - -class groupby_transform_cythonized(object): - goal_time = 0.2 - - def setup(self): - np.random.seed(0) - self.df = DataFrame({'id': (np.arange(100000) / 3), 'val': np.random.randn(100000), }) + def time_cumprod(self): + self.df.groupby('key').cumprod() - def time_groupby_transform_cumprod(self): - self.df.groupby('id').cumprod() + def time_cumsum(self): + self.df.groupby('key').cumsum() - def time_groupby_transform_cumsum(self): - self.df.groupby('id').cumsum() + def time_shift(self): + self.df.groupby('key').shift() - def time_groupby_transform_shift(self): - self.df.groupby('id').shift() + def time_transform_dataframe(self): + # GH 12737 + self.df_nans.groupby('key').transform('first') diff --git a/asv_bench/benchmarks/hdfstore_bench.py b/asv_bench/benchmarks/hdfstore_bench.py index 659fc4941da54..dc72f3d548aaf 100644 --- a/asv_bench/benchmarks/hdfstore_bench.py +++ b/asv_bench/benchmarks/hdfstore_bench.py @@ -2,140 +2,41 @@ import os -class query_store_table(object): +class HDF5(object): goal_time = 0.2 def setup(self): - self.f = '__test__.h5' - self.index = date_range('1/1/2000', periods=25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), }, index=self.index) - self.remove(self.f) - self.store = HDFStore(self.f) - self.store.append('df12', self.df) - - def time_query_store_table(self): - self.store.select('df12', [('index', '>', self.df.index[10000]), ('index', '<', self.df.index[15000])]) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class query_store_table_wide(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.index = date_range('1/1/2000', periods=25000) - self.df = DataFrame(np.random.randn(25000, 100), index=self.index) - self.remove(self.f) - self.store = HDFStore(self.f) - self.store.append('df11', self.df) - - def time_query_store_table_wide(self): - self.store.select('df11', [('index', '>', self.df.index[10000]), ('index', '<', self.df.index[15000])]) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class read_store(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), }, index=self.index) - self.remove(self.f) - self.store = HDFStore(self.f) - self.store.put('df1', self.df) + self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000),}, + index=self.index) - def time_read_store(self): - self.store.get('df1') + self.df_mixed = DataFrame( + {'float1': randn(25000), 'float2': randn(25000), + 'string1': (['foo'] * 25000), + 'bool1': ([True] * 25000), + 'int1': np.random.randint(0, 250000, size=25000),}, + index=self.index) - def teardown(self): - self.store.close() + self.df_wide = DataFrame(np.random.randn(25000, 100)) - def remove(self, f): - try: - os.remove(self.f) - except: - pass + self.df2 = DataFrame({'float1': randn(25000), 'float2': randn(25000)}, + index=date_range('1/1/2000', periods=25000)) + self.df_wide2 = DataFrame(np.random.randn(25000, 100), + index=date_range('1/1/2000', periods=25000)) + self.df_dc = DataFrame(np.random.randn(10000, 10), + columns=[('C%03d' % i) for i in range(10)]) -class read_store_mixed(object): - goal_time = 0.2 - - def setup(self): self.f = '__test__.h5' - self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), 'string1': (['foo'] * 25000), 'bool1': ([True] * 25000), 'int1': np.random.randint(0, 250000, size=25000), }, index=self.index) self.remove(self.f) - self.store = HDFStore(self.f) - self.store.put('df3', self.df) - def time_read_store_mixed(self): - self.store.get('df3') - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class read_store_table(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), }, index=self.index) - self.remove(self.f) self.store = HDFStore(self.f) - self.store.append('df7', self.df) - - def time_read_store_table(self): - self.store.select('df7') - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class read_store_table_mixed(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.N = 10000 - self.index = tm.makeStringIndex(self.N) - self.df = DataFrame({'float1': randn(self.N), 'float2': randn(self.N), 'string1': (['foo'] * self.N), 'bool1': ([True] * self.N), 'int1': np.random.randint(0, self.N, size=self.N), }, index=self.index) - self.remove(self.f) - self.store = HDFStore(self.f) - self.store.append('df5', self.df) - - def time_read_store_table_mixed(self): - self.store.select('df5') + self.store.put('fixed', self.df) + self.store.put('fixed_mixed', self.df_mixed) + self.store.append('table', self.df2) + self.store.append('table_mixed', self.df_mixed) + self.store.append('table_wide', self.df_wide) + self.store.append('table_wide2', self.df_wide2) def teardown(self): self.store.close() @@ -146,156 +47,62 @@ def remove(self, f): except: pass + def time_read_store(self): + self.store.get('fixed') -class read_store_table_panel(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.p = Panel(randn(20, 1000, 25), items=[('Item%03d' % i) for i in range(20)], major_axis=date_range('1/1/2000', periods=1000), minor_axis=[('E%03d' % i) for i in range(25)]) - self.remove(self.f) - self.store = HDFStore(self.f) - self.store.append('p1', self.p) - - def time_read_store_table_panel(self): - self.store.select('p1') - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class read_store_table_wide(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.df = DataFrame(np.random.randn(25000, 100)) - self.remove(self.f) - self.store = HDFStore(self.f) - self.store.append('df9', self.df) - - def time_read_store_table_wide(self): - self.store.select('df9') - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class write_store(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), }, index=self.index) - self.remove(self.f) - self.store = HDFStore(self.f) + def time_read_store_mixed(self): + self.store.get('fixed_mixed') def time_write_store(self): - self.store.put('df2', self.df) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class write_store_mixed(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), 'string1': (['foo'] * 25000), 'bool1': ([True] * 25000), 'int1': np.random.randint(0, 250000, size=25000), }, index=self.index) - self.remove(self.f) - self.store = HDFStore(self.f) + self.store.put('fixed_write', self.df) def time_write_store_mixed(self): - self.store.put('df4', self.df) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass + self.store.put('fixed_mixed_write', self.df_mixed) + def time_read_store_table_mixed(self): + self.store.select('table_mixed') -class write_store_table(object): - goal_time = 0.2 + def time_write_store_table_mixed(self): + self.store.append('table_mixed_write', self.df_mixed) - def setup(self): - self.f = '__test__.h5' - self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), }, index=self.index) - self.remove(self.f) - self.store = HDFStore(self.f) + def time_read_store_table(self): + self.store.select('table') def time_write_store_table(self): - self.store.append('df8', self.df) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - + self.store.append('table_write', self.df) -class write_store_table_dc(object): - goal_time = 0.2 + def time_read_store_table_wide(self): + self.store.select('table_wide') - def setup(self): - self.f = '__test__.h5' - self.df = DataFrame(np.random.randn(10000, 10), columns=[('C%03d' % i) for i in range(10)]) - self.remove(self.f) - self.store = HDFStore(self.f) + def time_write_store_table_wide(self): + self.store.append('table_wide_write', self.df_wide) def time_write_store_table_dc(self): - self.store.append('df15', self.df, data_columns=True) + self.store.append('table_dc_write', self.df_dc, data_columns=True) - def teardown(self): - self.store.close() + def time_query_store_table_wide(self): + start = self.df_wide2.index[10000] + stop = self.df_wide2.index[15000] + self.store.select('table_wide', where="index > start and index < stop") - def remove(self, f): - try: - os.remove(self.f) - except: - pass + def time_query_store_table(self): + start = self.df2.index[10000] + stop = self.df2.index[15000] + self.store.select('table', where="index > start and index < stop") -class write_store_table_mixed(object): +class HDF5Panel(object): goal_time = 0.2 def setup(self): self.f = '__test__.h5' - self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000), 'string1': (['foo'] * 25000), 'bool1': ([True] * 25000), 'int1': np.random.randint(0, 25000, size=25000), }, index=self.index) + self.p = Panel(randn(20, 1000, 25), + items=[('Item%03d' % i) for i in range(20)], + major_axis=date_range('1/1/2000', periods=1000), + minor_axis=[('E%03d' % i) for i in range(25)]) self.remove(self.f) self.store = HDFStore(self.f) - - def time_write_store_table_mixed(self): - self.store.append('df6', self.df) + self.store.append('p1', self.p) def teardown(self): self.store.close() @@ -306,46 +113,8 @@ def remove(self, f): except: pass - -class write_store_table_panel(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.p = Panel(randn(20, 1000, 25), items=[('Item%03d' % i) for i in range(20)], major_axis=date_range('1/1/2000', periods=1000), minor_axis=[('E%03d' % i) for i in range(25)]) - self.remove(self.f) - self.store = HDFStore(self.f) + def time_read_store_table_panel(self): + self.store.select('p1') def time_write_store_table_panel(self): self.store.append('p2', self.p) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class write_store_table_wide(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.df = DataFrame(np.random.randn(25000, 100)) - self.remove(self.f) - self.store = HDFStore(self.f) - - def time_write_store_table_wide(self): - self.store.append('df10', self.df) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass diff --git a/asv_bench/benchmarks/index_object.py b/asv_bench/benchmarks/index_object.py index 2c94f9b2b1e8c..3fb53ce9b3c98 100644 --- a/asv_bench/benchmarks/index_object.py +++ b/asv_bench/benchmarks/index_object.py @@ -1,102 +1,93 @@ from .pandas_vb_common import * -class datetime_index_intersection(object): +class SetOperations(object): goal_time = 0.2 def setup(self): self.rng = date_range('1/1/2000', periods=10000, freq='T') self.rng2 = self.rng[:(-1)] - def time_datetime_index_intersection(self): + # object index with datetime values + if (self.rng.dtype == object): + self.idx_rng = self.rng.view(Index) + else: + self.idx_rng = self.rng.asobject + self.idx_rng2 = self.idx_rng[:(-1)] + + # other datetime + N = 100000 + A = N - 20000 + B = N + 20000 + self.dtidx1 = DatetimeIndex(range(N)) + self.dtidx2 = DatetimeIndex(range(A, B)) + self.dtidx3 = DatetimeIndex(range(N, B)) + + # integer + self.N = 1000000 + self.options = np.arange(self.N) + self.left = Index( + self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) + self.right = Index( + self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) + + # strings + N = 10000 + strs = tm.rands_array(10, N) + self.leftstr = Index(strs[:N * 2 // 3]) + self.rightstr = Index(strs[N // 3:]) + + def time_datetime_intersection(self): self.rng.intersection(self.rng2) - -class datetime_index_repr(object): - goal_time = 0.2 - - def setup(self): - self.dr = pd.date_range('20000101', freq='D', periods=100000) - - def time_datetime_index_repr(self): - self.dr._is_dates_only - - -class datetime_index_union(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=10000, freq='T') - self.rng2 = self.rng[:(-1)] - - def time_datetime_index_union(self): + def time_datetime_union(self): self.rng.union(self.rng2) + def time_datetime_difference(self): + self.dtidx1.difference(self.dtidx2) -class index_datetime_intersection(object): - goal_time = 0.2 + def time_datetime_difference_disjoint(self): + self.dtidx1.difference(self.dtidx3) - def setup(self): - self.rng = DatetimeIndex(start='1/1/2000', periods=10000, freq=datetools.Minute()) - if (self.rng.dtype == object): - self.rng = self.rng.view(Index) - else: - self.rng = self.rng.asobject - self.rng2 = self.rng[:(-1)] + def time_datetime_symmetric_difference(self): + self.dtidx1.symmetric_difference(self.dtidx2) def time_index_datetime_intersection(self): - self.rng.intersection(self.rng2) - - -class index_datetime_union(object): - goal_time = 0.2 - - def setup(self): - self.rng = DatetimeIndex(start='1/1/2000', periods=10000, freq=datetools.Minute()) - if (self.rng.dtype == object): - self.rng = self.rng.view(Index) - else: - self.rng = self.rng.asobject - self.rng2 = self.rng[:(-1)] + self.idx_rng.intersection(self.idx_rng2) def time_index_datetime_union(self): - self.rng.union(self.rng2) + self.idx_rng.union(self.idx_rng2) + def time_int64_intersection(self): + self.left.intersection(self.right) -class index_datetime_set_difference(object): - goal_time = 0.2 + def time_int64_union(self): + self.left.union(self.right) - def setup(self): - self.N = 100000 - self.A = self.N - 20000 - self.B = self.N + 20000 - self.idx1 = DatetimeIndex(range(self.N)) - self.idx2 = DatetimeIndex(range(self.A, self.B)) - self.idx3 = DatetimeIndex(range(self.N, self.B)) + def time_int64_difference(self): + self.left.difference(self.right) - def time_index_datetime_difference(self): - self.idx1.difference(self.idx2) + def time_int64_symmetric_difference(self): + self.left.symmetric_difference(self.right) - def time_index_datetime_difference_disjoint(self): - self.idx1.difference(self.idx3) + def time_str_difference(self): + self.leftstr.difference(self.rightstr) - def time_index_datetime_symmetric_difference(self): - self.idx1.symmetric_difference(self.idx2) + def time_str_symmetric_difference(self): + self.leftstr.symmetric_difference(self.rightstr) -class index_float64_boolean_indexer(object): +class Datetime(object): goal_time = 0.2 def setup(self): - self.idx = tm.makeFloatIndex(1000000) - self.mask = ((np.arange(self.idx.size) % 3) == 0) - self.series_mask = Series(self.mask) + self.dr = pd.date_range('20000101', freq='D', periods=10000) - def time_index_float64_boolean_indexer(self): - self.idx[self.mask] + def time_is_dates_only(self): + self.dr._is_dates_only -class index_float64_boolean_series_indexer(object): +class Float64(object): goal_time = 0.2 def setup(self): @@ -104,141 +95,34 @@ def setup(self): self.mask = ((np.arange(self.idx.size) % 3) == 0) self.series_mask = Series(self.mask) - def time_index_float64_boolean_series_indexer(self): - self.idx[self.series_mask] - - -class index_float64_construct(object): - goal_time = 0.2 - - def setup(self): self.baseidx = np.arange(1000000.0) - def time_index_float64_construct(self): - Index(self.baseidx) - + def time_boolean_indexer(self): + self.idx[self.mask] -class index_float64_div(object): - goal_time = 0.2 + def time_boolean_series_indexer(self): + self.idx[self.series_mask] - def setup(self): - self.idx = tm.makeFloatIndex(1000000) - self.mask = ((np.arange(self.idx.size) % 3) == 0) - self.series_mask = Series(self.mask) + def time_construct(self): + Index(self.baseidx) - def time_index_float64_div(self): + def time_div(self): (self.idx / 2) - -class index_float64_get(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeFloatIndex(1000000) - self.mask = ((np.arange(self.idx.size) % 3) == 0) - self.series_mask = Series(self.mask) - - def time_index_float64_get(self): + def time_get(self): self.idx[1] - -class index_float64_mul(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeFloatIndex(1000000) - self.mask = ((np.arange(self.idx.size) % 3) == 0) - self.series_mask = Series(self.mask) - - def time_index_float64_mul(self): + def time_mul(self): (self.idx * 2) - -class index_float64_slice_indexer_basic(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeFloatIndex(1000000) - self.mask = ((np.arange(self.idx.size) % 3) == 0) - self.series_mask = Series(self.mask) - - def time_index_float64_slice_indexer_basic(self): + def time_slice_indexer_basic(self): self.idx[:(-1)] - -class index_float64_slice_indexer_even(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeFloatIndex(1000000) - self.mask = ((np.arange(self.idx.size) % 3) == 0) - self.series_mask = Series(self.mask) - - def time_index_float64_slice_indexer_even(self): + def time_slice_indexer_even(self): self.idx[::2] -class index_int64_intersection(object): - goal_time = 0.2 - - def setup(self): - self.N = 1000000 - self.options = np.arange(self.N) - self.left = Index(self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) - self.right = Index(self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) - - def time_index_int64_intersection(self): - self.left.intersection(self.right) - - -class index_int64_union(object): - goal_time = 0.2 - - def setup(self): - self.N = 1000000 - self.options = np.arange(self.N) - self.left = Index(self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) - self.right = Index(self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) - - def time_index_int64_union(self): - self.left.union(self.right) - - -class index_int64_set_difference(object): - goal_time = 0.2 - - def setup(self): - self.N = 500000 - self.options = np.arange(self.N) - self.left = Index(self.options.take( - np.random.permutation(self.N)[:(self.N // 2)])) - self.right = Index(self.options.take( - np.random.permutation(self.N)[:(self.N // 2)])) - - def time_index_int64_difference(self): - self.left.difference(self.right) - - def time_index_int64_symmetric_difference(self): - self.left.symmetric_difference(self.right) - - -class index_str_set_difference(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.strs = tm.rands_array(10, self.N) - self.left = Index(self.strs[:self.N * 2 // 3]) - self.right = Index(self.strs[self.N // 3:]) - - def time_str_difference(self): - self.left.difference(self.right) - - def time_str_symmetric_difference(self): - self.left.symmetric_difference(self.right) - - -class index_str_boolean_indexer(object): +class StringIndex(object): goal_time = 0.2 def setup(self): @@ -246,47 +130,20 @@ def setup(self): self.mask = ((np.arange(1000000) % 3) == 0) self.series_mask = Series(self.mask) - def time_index_str_boolean_indexer(self): + def time_boolean_indexer(self): self.idx[self.mask] - -class index_str_boolean_series_indexer(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeStringIndex(1000000) - self.mask = ((np.arange(1000000) % 3) == 0) - self.series_mask = Series(self.mask) - - def time_index_str_boolean_series_indexer(self): + def time_boolean_series_indexer(self): self.idx[self.series_mask] - -class index_str_slice_indexer_basic(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeStringIndex(1000000) - self.mask = ((np.arange(1000000) % 3) == 0) - self.series_mask = Series(self.mask) - - def time_index_str_slice_indexer_basic(self): + def time_slice_indexer_basic(self): self.idx[:(-1)] - -class index_str_slice_indexer_even(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeStringIndex(1000000) - self.mask = ((np.arange(1000000) % 3) == 0) - self.series_mask = Series(self.mask) - - def time_index_str_slice_indexer_even(self): + def time_slice_indexer_even(self): self.idx[::2] -class multiindex_duplicated(object): +class Multi1(object): goal_time = 0.2 def setup(self): @@ -295,21 +152,16 @@ def setup(self): self.labels = [np.random.choice(n, (k * n)) for lev in self.levels] self.mi = MultiIndex(levels=self.levels, labels=self.labels) - def time_multiindex_duplicated(self): - self.mi.duplicated() - - -class multiindex_from_product(object): - goal_time = 0.2 - - def setup(self): self.iterables = [tm.makeStringIndex(10000), range(20)] - def time_multiindex_from_product(self): + def time_duplicated(self): + self.mi.duplicated() + + def time_from_product(self): MultiIndex.from_product(self.iterables) -class multiindex_sortlevel_int64(object): +class Multi2(object): goal_time = 0.2 def setup(self): @@ -319,23 +171,22 @@ def setup(self): self.i = np.random.permutation(self.n) self.mi = MultiIndex.from_arrays([self.f(11), self.f(7), self.f(5), self.f(3), self.f(1)])[self.i] - def time_multiindex_sortlevel_int64(self): + self.a = np.repeat(np.arange(100), 1000) + self.b = np.tile(np.arange(1000), 100) + self.midx2 = MultiIndex.from_arrays([self.a, self.b]) + self.midx2 = self.midx2.take(np.random.permutation(np.arange(100000))) + + def time_sortlevel_int64(self): self.mi.sortlevel() + def time_sortlevel_zero(self): + self.midx2.sortlevel(0) -class multiindex_with_datetime_level_full(object): - goal_time = 0.2 + def time_sortlevel_one(self): + self.midx2.sortlevel(1) - def setup(self): - self.level1 = range(1000) - self.level2 = date_range(start='1/1/2012', periods=100) - self.mi = MultiIndex.from_product([self.level1, self.level2]) - def time_multiindex_with_datetime_level_full(self): - self.mi.copy().values - - -class multiindex_with_datetime_level_sliced(object): +class Multi3(object): goal_time = 0.2 def setup(self): @@ -343,5 +194,8 @@ def setup(self): self.level2 = date_range(start='1/1/2012', periods=100) self.mi = MultiIndex.from_product([self.level1, self.level2]) - def time_multiindex_with_datetime_level_sliced(self): + def time_datetime_level_values_full(self): + self.mi.copy().values + + def time_datetime_level_values_sliced(self): self.mi[:10].values diff --git a/asv_bench/benchmarks/indexing.py b/asv_bench/benchmarks/indexing.py index 094ae23a92fad..8947a0fdd796c 100644 --- a/asv_bench/benchmarks/indexing.py +++ b/asv_bench/benchmarks/indexing.py @@ -1,208 +1,171 @@ from .pandas_vb_common import * -try: - import pandas.computation.expressions as expr -except: - expr = None -class dataframe_getitem_scalar(object): +class Int64Indexing(object): goal_time = 0.2 def setup(self): - self.index = tm.makeStringIndex(1000) - self.columns = tm.makeStringIndex(30) - self.df = DataFrame(np.random.rand(1000, 30), index=self.index, columns=self.columns) - self.idx = self.index[100] - self.col = self.columns[10] - - def time_dataframe_getitem_scalar(self): - self.df[self.col][self.idx] + self.s = Series(np.random.rand(1000000)) + def time_getitem_scalar(self): + self.s[800000] -class series_get_value(object): - goal_time = 0.2 + def time_getitem_slice(self): + self.s[:800000] - def setup(self): - self.index = tm.makeStringIndex(1000) - self.s = Series(np.random.rand(1000), index=self.index) - self.idx = self.index[100] + def time_getitem_list_like(self): + self.s[[800000]] - def time_series_get_value(self): - self.s.get_value(self.idx) + def time_getitem_array(self): + self.s[np.arange(10000)] + def time_iloc_array(self): + self.s.iloc[np.arange(10000)] -class time_series_getitem_scalar(object): - goal_time = 0.2 + def time_iloc_list_like(self): + self.s.iloc[[800000]] - def setup(self): - tm.N = 1000 - self.ts = tm.makeTimeSeries() - self.dt = self.ts.index[500] + def time_iloc_scalar(self): + self.s.iloc[800000] - def time_time_series_getitem_scalar(self): - self.ts[self.dt] + def time_iloc_slice(self): + self.s.iloc[:800000] + def time_ix_array(self): + self.s.ix[np.arange(10000)] -class frame_iloc_big(object): - goal_time = 0.2 + def time_ix_list_like(self): + self.s.ix[[800000]] - def setup(self): - self.df = DataFrame(dict(A=(['foo'] * 1000000))) + def time_ix_scalar(self): + self.s.ix[800000] - def time_frame_iloc_big(self): - self.df.iloc[:100, 0] + def time_ix_slice(self): + self.s.ix[:800000] + def time_loc_array(self): + self.s.loc[np.arange(10000)] -class frame_iloc_dups(object): - goal_time = 0.2 + def time_loc_list_like(self): + self.s.loc[[800000]] - def setup(self): - self.df = DataFrame({'A': ([0.1] * 3000), 'B': ([1] * 3000), }) - self.idx = (np.array(range(30)) * 99) - self.df2 = DataFrame({'A': ([0.1] * 1000), 'B': ([1] * 1000), }) - self.df2 = concat([self.df2, (2 * self.df2), (3 * self.df2)]) + def time_loc_scalar(self): + self.s.loc[800000] - def time_frame_iloc_dups(self): - self.df2.iloc[self.idx] + def time_loc_slice(self): + self.s.loc[:800000] -class frame_loc_dups(object): +class StringIndexing(object): goal_time = 0.2 def setup(self): - self.df = DataFrame({'A': ([0.1] * 3000), 'B': ([1] * 3000), }) - self.idx = (np.array(range(30)) * 99) - self.df2 = DataFrame({'A': ([0.1] * 1000), 'B': ([1] * 1000), }) - self.df2 = concat([self.df2, (2 * self.df2), (3 * self.df2)]) - - def time_frame_loc_dups(self): - self.df2.loc[self.idx] - + self.index = tm.makeStringIndex(1000000) + self.s = Series(np.random.rand(1000000), index=self.index) + self.lbl = self.s.index[800000] -class frame_xs_mi_ix(object): - goal_time = 0.2 + def time_getitem_label_slice(self): + self.s[:self.lbl] - def setup(self): - self.mi = MultiIndex.from_tuples([(x, y) for x in range(1000) for y in range(1000)]) - self.s = Series(np.random.randn(1000000), index=self.mi) - self.df = DataFrame(self.s) + def time_getitem_pos_slice(self): + self.s[:800000] - def time_frame_xs_mi_ix(self): - self.df.ix[999] + def time_get_value(self): + self.s.get_value(self.lbl) -class indexing_dataframe_boolean(object): +class DatetimeIndexing(object): goal_time = 0.2 def setup(self): - self.df = DataFrame(np.random.randn(50000, 100)) - self.df2 = DataFrame(np.random.randn(50000, 100)) + tm.N = 1000 + self.ts = tm.makeTimeSeries() + self.dt = self.ts.index[500] - def time_indexing_dataframe_boolean(self): - (self.df > self.df2) + def time_getitem_scalar(self): + self.ts[self.dt] -class indexing_dataframe_boolean_no_ne(object): +class DataFrameIndexing(object): goal_time = 0.2 def setup(self): - if (expr is None): - raise NotImplementedError - self.df = DataFrame(np.random.randn(50000, 100)) - self.df2 = DataFrame(np.random.randn(50000, 100)) - expr.set_use_numexpr(False) - - def time_indexing_dataframe_boolean_no_ne(self): - (self.df > self.df2) - - def teardown(self): - expr.set_use_numexpr(True) - - -class indexing_dataframe_boolean_rows(object): - goal_time = 0.2 + self.index = tm.makeStringIndex(1000) + self.columns = tm.makeStringIndex(30) + self.df = DataFrame(np.random.randn(1000, 30), index=self.index, + columns=self.columns) + self.idx = self.index[100] + self.col = self.columns[10] - def setup(self): - self.df = DataFrame(np.random.randn(10000, 4), columns=['A', 'B', 'C', 'D']) - self.indexer = (self.df['B'] > 0) + self.df2 = DataFrame(np.random.randn(10000, 4), + columns=['A', 'B', 'C', 'D']) + self.indexer = (self.df2['B'] > 0) self.obj_indexer = self.indexer.astype('O') - def time_indexing_dataframe_boolean_rows(self): - self.df[self.indexer] + # duptes + self.idx_dupe = (np.array(range(30)) * 99) + self.df3 = DataFrame({'A': ([0.1] * 1000), 'B': ([1] * 1000),}) + self.df3 = concat([self.df3, (2 * self.df3), (3 * self.df3)]) + self.df_big = DataFrame(dict(A=(['foo'] * 1000000))) -class indexing_dataframe_boolean_rows_object(object): - goal_time = 0.2 + def time_get_value(self): + self.df.get_value(self.idx, self.col) - def setup(self): - self.df = DataFrame(np.random.randn(10000, 4), columns=['A', 'B', 'C', 'D']) - self.indexer = (self.df['B'] > 0) - self.obj_indexer = self.indexer.astype('O') + def time_get_value_ix(self): + self.df.ix[(self.idx, self.col)] - def time_indexing_dataframe_boolean_rows_object(self): - self.df[self.obj_indexer] + def time_getitem_scalar(self): + self.df[self.col][self.idx] + def time_boolean_rows(self): + self.df2[self.indexer] -class indexing_dataframe_boolean_st(object): - goal_time = 0.2 + def time_boolean_rows_object(self): + self.df2[self.obj_indexer] - def setup(self): - if (expr is None): - raise NotImplementedError - self.df = DataFrame(np.random.randn(50000, 100)) - self.df2 = DataFrame(np.random.randn(50000, 100)) - expr.set_numexpr_threads(1) + def time_iloc_dups(self): + self.df3.iloc[self.idx_dupe] - def time_indexing_dataframe_boolean_st(self): - (self.df > self.df2) + def time_loc_dups(self): + self.df3.loc[self.idx_dupe] - def teardown(self): - expr.set_numexpr_threads() + def time_iloc_big(self): + self.df_big.iloc[:100, 0] -class indexing_frame_get_value(object): +class IndexingMethods(object): + # GH 13166 goal_time = 0.2 def setup(self): - self.index = tm.makeStringIndex(1000) - self.columns = tm.makeStringIndex(30) - self.df = DataFrame(np.random.randn(1000, 30), index=self.index, columns=self.columns) - self.idx = self.index[100] - self.col = self.columns[10] - - def time_indexing_frame_get_value(self): - self.df.get_value(self.idx, self.col) + a = np.arange(100000) + self.ind = pd.Float64Index(a * 4.8000000418824129e-08) + self.s = Series(np.random.rand(100000)) + self.ts = Series(np.random.rand(100000), + index=date_range('2011-01-01', freq='S', periods=100000)) + self.indexer = ([True, False, True, True, False] * 20000) -class indexing_frame_get_value_ix(object): - goal_time = 0.2 + def time_get_loc_float(self): + self.ind.get_loc(0) - def setup(self): - self.index = tm.makeStringIndex(1000) - self.columns = tm.makeStringIndex(30) - self.df = DataFrame(np.random.randn(1000, 30), index=self.index, columns=self.columns) - self.idx = self.index[100] - self.col = self.columns[10] + def time_take_dtindex(self): + self.ts.take(self.indexer) - def time_indexing_frame_get_value_ix(self): - self.df.ix[(self.idx, self.col)] + def time_take_intindex(self): + self.s.take(self.indexer) -class indexing_panel_subset(object): +class MultiIndexing(object): goal_time = 0.2 def setup(self): - self.p = Panel(np.random.randn(100, 100, 100)) - self.inds = range(0, 100, 10) - - def time_indexing_panel_subset(self): - self.p.ix[(self.inds, self.inds, self.inds)] - - -class multiindex_slicers(object): - goal_time = 0.2 + self.mi = MultiIndex.from_tuples([(x, y) for x in range(1000) for y in range(1000)]) + self.s = Series(np.random.randn(1000000), index=self.mi) + self.df = DataFrame(self.s) - def setup(self): + # slicers np.random.seed(1234) self.idx = pd.IndexSlice self.n = 100000 @@ -222,261 +185,69 @@ def setup(self): self.eps_C = 5 self.eps_D = 5000 self.mdt2 = self.mdt.set_index(['A', 'B', 'C', 'D']).sortlevel() + self.miint = MultiIndex.from_product( + [np.arange(1000), + np.arange(1000)], names=['one', 'two']) - def time_multiindex_slicers(self): - self.mdt2.loc[self.idx[(self.test_A - self.eps_A):(self.test_A + self.eps_A), (self.test_B - self.eps_B):(self.test_B + self.eps_B), (self.test_C - self.eps_C):(self.test_C + self.eps_C), (self.test_D - self.eps_D):(self.test_D + self.eps_D)], :] - - -class series_getitem_array(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_getitem_array(self): - self.s[np.arange(10000)] - - -class series_getitem_label_slice(object): - goal_time = 0.2 - - def setup(self): - self.index = tm.makeStringIndex(1000000) - self.s = Series(np.random.rand(1000000), index=self.index) - self.lbl = self.s.index[800000] - - def time_series_getitem_label_slice(self): - self.s[:self.lbl] - - -class series_getitem_list_like(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_getitem_list_like(self): - self.s[[800000]] - - -class series_getitem_pos_slice(object): - goal_time = 0.2 - - def setup(self): - self.index = tm.makeStringIndex(1000000) - self.s = Series(np.random.rand(1000000), index=self.index) - - def time_series_getitem_pos_slice(self): - self.s[:800000] + import string + self.mistring = MultiIndex.from_product( + [np.arange(1000), + np.arange(20), list(string.ascii_letters)], + names=['one', 'two', 'three']) + def time_series_xs_mi_ix(self): + self.s.ix[999] -class series_getitem_scalar(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_getitem_scalar(self): - self.s[800000] - - -class series_getitem_slice(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_getitem_slice(self): - self.s[:800000] - - -class series_iloc_array(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_iloc_array(self): - self.s.iloc[np.arange(10000)] - - -class series_iloc_list_like(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_iloc_list_like(self): - self.s.iloc[[800000]] - - -class series_iloc_scalar(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_iloc_scalar(self): - self.s.iloc[800000] - - -class series_iloc_slice(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_iloc_slice(self): - self.s.iloc[:800000] - - -class series_ix_array(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_ix_array(self): - self.s.ix[np.arange(10000)] - - -class series_ix_list_like(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_ix_list_like(self): - self.s.ix[[800000]] - - -class series_ix_scalar(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_ix_scalar(self): - self.s.ix[800000] - - -class series_ix_slice(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_ix_slice(self): - self.s.ix[:800000] - - -class series_loc_array(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_loc_array(self): - self.s.loc[np.arange(10000)] - - -class series_loc_list_like(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_loc_list_like(self): - self.s.loc[[800000]] - - -class series_loc_scalar(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_loc_scalar(self): - self.s.loc[800000] - - -class series_loc_slice(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_series_loc_slice(self): - self.s.loc[:800000] - - -class series_take_dtindex(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(100000)) - self.ts = Series(np.random.rand(100000), index=date_range('2011-01-01', freq='S', periods=100000)) - self.indexer = ([True, False, True, True, False] * 20000) - - def time_series_take_dtindex(self): - self.ts.take(self.indexer) - - -class series_take_intindex(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.rand(100000)) - self.ts = Series(np.random.rand(100000), index=date_range('2011-01-01', freq='S', periods=100000)) - self.indexer = ([True, False, True, True, False] * 20000) - - def time_series_take_intindex(self): - self.s.take(self.indexer) + def time_frame_xs_mi_ix(self): + self.df.ix[999] + def time_multiindex_slicers(self): + self.mdt2.loc[self.idx[ + (self.test_A - self.eps_A):(self.test_A + self.eps_A), + (self.test_B - self.eps_B):(self.test_B + self.eps_B), + (self.test_C - self.eps_C):(self.test_C + self.eps_C), + (self.test_D - self.eps_D):(self.test_D + self.eps_D)], :] -class series_xs_mi_ix(object): - goal_time = 0.2 + def time_multiindex_get_indexer(self): + self.miint.get_indexer( + np.array([(0, 10), (0, 11), (0, 12), + (0, 13), (0, 14), (0, 15), + (0, 16), (0, 17), (0, 18), + (0, 19)], dtype=object)) - def setup(self): - self.mi = MultiIndex.from_tuples([(x, y) for x in range(1000) for y in range(1000)]) - self.s = Series(np.random.randn(1000000), index=self.mi) + def time_multiindex_string_get_loc(self): + self.mistring.get_loc((999, 19, 'Z')) - def time_series_xs_mi_ix(self): - self.s.ix[999] + def time_is_monotonic(self): + self.miint.is_monotonic -class sort_level_one(object): +class IntervalIndexing(object): goal_time = 0.2 def setup(self): - self.a = np.repeat(np.arange(100), 1000) - self.b = np.tile(np.arange(1000), 100) - self.midx = MultiIndex.from_arrays([self.a, self.b]) - self.midx = self.midx.take(np.random.permutation(np.arange(100000))) + self.monotonic = Series(np.arange(1000000), + index=IntervalIndex.from_breaks(np.arange(1000001))) - def time_sort_level_one(self): - self.midx.sortlevel(1) + def time_getitem_scalar(self): + self.monotonic[80000] + def time_loc_scalar(self): + self.monotonic.loc[80000] -class sort_level_zero(object): - goal_time = 0.2 + def time_getitem_list(self): + self.monotonic[80000:] - def setup(self): - self.a = np.repeat(np.arange(100), 1000) - self.b = np.tile(np.arange(1000), 100) - self.midx = MultiIndex.from_arrays([self.a, self.b]) - self.midx = self.midx.take(np.random.permutation(np.arange(100000))) + def time_loc_list(self): + self.monotonic.loc[80000:] - def time_sort_level_zero(self): - self.midx.sortlevel(0) -class float_loc(object): - # GH 13166 +class PanelIndexing(object): goal_time = 0.2 def setup(self): - a = np.arange(100000) - self.ind = pd.Float64Index(a * 4.8000000418824129e-08) + self.p = Panel(np.random.randn(100, 100, 100)) + self.inds = range(0, 100, 10) - def time_float_loc(self): - self.ind.get_loc(0) + def time_subset(self): + self.p.ix[(self.inds, self.inds, self.inds)] diff --git a/asv_bench/benchmarks/inference.py b/asv_bench/benchmarks/inference.py index 0f9689dadcbb0..dc1d6de73f8ae 100644 --- a/asv_bench/benchmarks/inference.py +++ b/asv_bench/benchmarks/inference.py @@ -2,143 +2,76 @@ import pandas as pd -class dtype_infer_datetime64(object): +class DtypeInfer(object): goal_time = 0.2 - def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_datetime64(self): - (self.df_datetime64['A'] - self.df_datetime64['B']) - - -class dtype_infer_float32(object): - goal_time = 0.2 + # from GH 7332 def setup(self): self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_float32(self): - (self.df_float32['A'] + self.df_float32['B']) + self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), + B=np.arange(self.N, dtype='int64'))) + self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), + B=np.arange(self.N, dtype='int32'))) + self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), + B=np.arange(self.N, dtype='uint32'))) + self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), + B=np.arange(self.N, dtype='float64'))) + self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), + B=np.arange(self.N, dtype='float32'))) + self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), + B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) + self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), + B=self.df_datetime64['B'])) + + def time_int64(self): + (self.df_int64['A'] + self.df_int64['B']) + def time_int32(self): + (self.df_int32['A'] + self.df_int32['B']) -class dtype_infer_float64(object): - goal_time = 0.2 + def time_uint32(self): + (self.df_uint32['A'] + self.df_uint32['B']) - def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_float64(self): + def time_float64(self): (self.df_float64['A'] + self.df_float64['B']) + def time_float32(self): + (self.df_float32['A'] + self.df_float32['B']) -class dtype_infer_int32(object): - goal_time = 0.2 - - def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_int32(self): - (self.df_int32['A'] + self.df_int32['B']) - + def time_datetime64(self): + (self.df_datetime64['A'] - self.df_datetime64['B']) -class dtype_infer_int64(object): - goal_time = 0.2 + def time_timedelta64_1(self): + (self.df_timedelta64['A'] + self.df_timedelta64['B']) - def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_int64(self): - (self.df_int64['A'] + self.df_int64['B']) + def time_timedelta64_2(self): + (self.df_timedelta64['A'] + self.df_timedelta64['A']) -class dtype_infer_timedelta64_1(object): +class to_numeric(object): goal_time = 0.2 def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_timedelta64_1(self): - (self.df_timedelta64['A'] + self.df_timedelta64['B']) - + self.n = 10000 + self.float = Series(np.random.randn(self.n * 100)) + self.numstr = self.float.astype('str') + self.str = Series(tm.makeStringIndex(self.n)) -class dtype_infer_timedelta64_2(object): - goal_time = 0.2 - - def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_timedelta64_2(self): - (self.df_timedelta64['A'] + self.df_timedelta64['A']) + def time_from_float(self): + pd.to_numeric(self.float) + def time_from_numeric_str(self): + pd.to_numeric(self.numstr) -class dtype_infer_uint32(object): - goal_time = 0.2 + def time_from_str_ignore(self): + pd.to_numeric(self.str, errors='ignore') - def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), B=self.df_datetime64['B'])) - - def time_dtype_infer_uint32(self): - (self.df_uint32['A'] + self.df_uint32['B']) + def time_from_str_coerce(self): + pd.to_numeric(self.str, errors='coerce') -class to_numeric(object): +class to_numeric_downcast(object): param_names = ['dtype', 'downcast'] params = [['string-float', 'string-int', 'string-nint', 'datetime64', @@ -146,14 +79,15 @@ class to_numeric(object): [None, 'integer', 'signed', 'unsigned', 'float']] N = 500000 + N2 = int(N / 2) data_dict = { - 'string-int': (['1'] * (N / 2)) + ([2] * (N / 2)), - 'string-nint': (['-1'] * (N / 2)) + ([2] * (N / 2)), + 'string-int': (['1'] * N2) + ([2] * N2), + 'string-nint': (['-1'] * N2) + ([2] * N2), 'datetime64': np.repeat(np.array(['1970-01-01', '1970-01-02'], dtype='datetime64[D]'), N), - 'string-float': (['1.1'] * (N / 2)) + ([2] * (N / 2)), - 'int-list': ([1] * (N / 2)) + ([2] * (N / 2)), + 'string-float': (['1.1'] * N2) + ([2] * N2), + 'int-list': ([1] * N2) + ([2] * N2), 'int32': np.repeat(np.int32(1), N) } @@ -162,3 +96,22 @@ def setup(self, dtype, downcast): def time_downcast(self, dtype, downcast): pd.to_numeric(self.data, downcast=downcast) + + +class MaybeConvertNumeric(object): + + def setup(self): + n = 1000000 + arr = np.repeat([2**63], n) + arr = arr + np.arange(n).astype('uint64') + arr = np.array([arr[i] if i%2 == 0 else + str(arr[i]) for i in range(n)], + dtype=object) + + arr[-1] = -1 + self.data = arr + self.na_values = set() + + def time_convert(self): + lib.maybe_convert_numeric(self.data, self.na_values, + coerce_numeric=False) diff --git a/asv_bench/benchmarks/io_bench.py b/asv_bench/benchmarks/io_bench.py index 0f15ab6e5e142..52064d2cdb8a2 100644 --- a/asv_bench/benchmarks/io_bench.py +++ b/asv_bench/benchmarks/io_bench.py @@ -128,6 +128,29 @@ def time_read_parse_dates_iso8601(self): read_csv(StringIO(self.data), header=None, names=['foo'], parse_dates=['foo']) +class read_uint64_integers(object): + goal_time = 0.2 + + def setup(self): + self.na_values = [2**63 + 500] + + self.arr1 = np.arange(10000).astype('uint64') + 2**63 + self.data1 = '\n'.join(map(lambda x: str(x), self.arr1)) + + self.arr2 = self.arr1.copy().astype(object) + self.arr2[500] = -1 + self.data2 = '\n'.join(map(lambda x: str(x), self.arr2)) + + def time_read_uint64(self): + read_csv(StringIO(self.data1), header=None) + + def time_read_uint64_neg_values(self): + read_csv(StringIO(self.data2), header=None) + + def time_read_uint64_na_values(self): + read_csv(StringIO(self.data1), header=None, na_values=self.na_values) + + class write_csv_standard(object): goal_time = 0.2 @@ -153,7 +176,7 @@ def setup(self, compression, engine): # The Python 2 C parser can't read bz2 from open files. raise NotImplementedError try: - import boto + import s3fs except ImportError: # Skip these benchmarks if `boto` is not installed. raise NotImplementedError diff --git a/asv_bench/benchmarks/io_sql.py b/asv_bench/benchmarks/io_sql.py index c583ac1768c90..ec855e5d33525 100644 --- a/asv_bench/benchmarks/io_sql.py +++ b/asv_bench/benchmarks/io_sql.py @@ -4,121 +4,29 @@ from sqlalchemy import create_engine -class sql_datetime_read_and_parse_sqlalchemy(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df['datetime_string'] = self.df['datetime'].map(str) - self.df.to_sql('test_type', self.engine, if_exists='replace') - self.df[['float', 'datetime_string']].to_sql('test_type', self.con, if_exists='replace') +#------------------------------------------------------------------------------- +# to_sql - def time_sql_datetime_read_and_parse_sqlalchemy(self): - read_sql_table('test_type', self.engine, columns=['datetime_string'], parse_dates=['datetime_string']) - - -class sql_datetime_read_as_native_sqlalchemy(object): +class WriteSQL(object): goal_time = 0.2 def setup(self): self.engine = create_engine('sqlite:///:memory:') self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df['datetime_string'] = self.df['datetime'].map(str) - self.df.to_sql('test_type', self.engine, if_exists='replace') - self.df[['float', 'datetime_string']].to_sql('test_type', self.con, if_exists='replace') - - def time_sql_datetime_read_as_native_sqlalchemy(self): - read_sql_table('test_type', self.engine, columns=['datetime']) - - -class sql_datetime_write_sqlalchemy(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'string': (['foo'] * 10000), 'bool': ([True] * 10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df.loc[1000:3000, 'float'] = np.nan - - def time_sql_datetime_write_sqlalchemy(self): - self.df[['datetime']].to_sql('test_datetime', self.engine, if_exists='replace') - - -class sql_float_read_query_fallback(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df['datetime_string'] = self.df['datetime'].map(str) - self.df.to_sql('test_type', self.engine, if_exists='replace') - self.df[['float', 'datetime_string']].to_sql('test_type', self.con, if_exists='replace') - - def time_sql_float_read_query_fallback(self): - read_sql_query('SELECT float FROM test_type', self.con) - - -class sql_float_read_query_sqlalchemy(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df['datetime_string'] = self.df['datetime'].map(str) - self.df.to_sql('test_type', self.engine, if_exists='replace') - self.df[['float', 'datetime_string']].to_sql('test_type', self.con, if_exists='replace') - - def time_sql_float_read_query_sqlalchemy(self): - read_sql_query('SELECT float FROM test_type', self.engine) - - -class sql_float_read_table_sqlalchemy(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df['datetime_string'] = self.df['datetime'].map(str) - self.df.to_sql('test_type', self.engine, if_exists='replace') - self.df[['float', 'datetime_string']].to_sql('test_type', self.con, if_exists='replace') - - def time_sql_float_read_table_sqlalchemy(self): - read_sql_table('test_type', self.engine, columns=['float']) - - -class sql_float_write_fallback(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'string': (['foo'] * 10000), 'bool': ([True] * 10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df.loc[1000:3000, 'float'] = np.nan - - def time_sql_float_write_fallback(self): - self.df[['float']].to_sql('test_float', self.con, if_exists='replace') - + self.index = tm.makeStringIndex(10000) + self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) -class sql_float_write_sqlalchemy(object): - goal_time = 0.2 + def time_fallback(self): + self.df.to_sql('test1', self.con, if_exists='replace') - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'string': (['foo'] * 10000), 'bool': ([True] * 10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df.loc[1000:3000, 'float'] = np.nan + def time_sqlalchemy(self): + self.df.to_sql('test1', self.engine, if_exists='replace') - def time_sql_float_write_sqlalchemy(self): - self.df[['float']].to_sql('test_float', self.engine, if_exists='replace') +#------------------------------------------------------------------------------- +# read_sql -class sql_read_query_fallback(object): +class ReadSQL(object): goal_time = 0.2 def setup(self): @@ -129,41 +37,20 @@ def setup(self): self.df.to_sql('test2', self.engine, if_exists='replace') self.df.to_sql('test2', self.con, if_exists='replace') - def time_sql_read_query_fallback(self): + def time_read_query_fallback(self): read_sql_query('SELECT * FROM test2', self.con) - -class sql_read_query_sqlalchemy(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) - self.df.to_sql('test2', self.engine, if_exists='replace') - self.df.to_sql('test2', self.con, if_exists='replace') - - def time_sql_read_query_sqlalchemy(self): + def time_read_query_sqlalchemy(self): read_sql_query('SELECT * FROM test2', self.engine) - -class sql_read_table_sqlalchemy(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) - self.df.to_sql('test2', self.engine, if_exists='replace') - self.df.to_sql('test2', self.con, if_exists='replace') - - def time_sql_read_table_sqlalchemy(self): + def time_read_table_sqlalchemy(self): read_sql_table('test2', self.engine) -class sql_string_write_fallback(object): +#------------------------------------------------------------------------------- +# type specific write + +class WriteSQLTypes(object): goal_time = 0.2 def setup(self): @@ -172,44 +59,47 @@ def setup(self): self.df = DataFrame({'float': randn(10000), 'string': (['foo'] * 10000), 'bool': ([True] * 10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) self.df.loc[1000:3000, 'float'] = np.nan - def time_sql_string_write_fallback(self): + def time_string_fallback(self): self.df[['string']].to_sql('test_string', self.con, if_exists='replace') + def time_string_sqlalchemy(self): + self.df[['string']].to_sql('test_string', self.engine, if_exists='replace') -class sql_string_write_sqlalchemy(object): - goal_time = 0.2 + def time_float_fallback(self): + self.df[['float']].to_sql('test_float', self.con, if_exists='replace') - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'string': (['foo'] * 10000), 'bool': ([True] * 10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df.loc[1000:3000, 'float'] = np.nan + def time_float_sqlalchemy(self): + self.df[['float']].to_sql('test_float', self.engine, if_exists='replace') - def time_sql_string_write_sqlalchemy(self): - self.df[['string']].to_sql('test_string', self.engine, if_exists='replace') + def time_datetime_sqlalchemy(self): + self.df[['datetime']].to_sql('test_datetime', self.engine, if_exists='replace') -class sql_write_fallback(object): +#------------------------------------------------------------------------------- +# type specific read + +class ReadSQLTypes(object): goal_time = 0.2 def setup(self): self.engine = create_engine('sqlite:///:memory:') self.con = sqlite3.connect(':memory:') - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) + self.df = DataFrame({'float': randn(10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) + self.df['datetime_string'] = self.df['datetime'].map(str) + self.df.to_sql('test_type', self.engine, if_exists='replace') + self.df[['float', 'datetime_string']].to_sql('test_type', self.con, if_exists='replace') - def time_sql_write_fallback(self): - self.df.to_sql('test1', self.con, if_exists='replace') + def time_datetime_read_and_parse_sqlalchemy(self): + read_sql_table('test_type', self.engine, columns=['datetime_string'], parse_dates=['datetime_string']) + def time_datetime_read_as_native_sqlalchemy(self): + read_sql_table('test_type', self.engine, columns=['datetime']) -class sql_write_sqlalchemy(object): - goal_time = 0.2 + def time_float_read_query_fallback(self): + read_sql_query('SELECT float FROM test_type', self.con) - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) + def time_float_read_query_sqlalchemy(self): + read_sql_query('SELECT float FROM test_type', self.engine) - def time_sql_write_sqlalchemy(self): - self.df.to_sql('test1', self.engine, if_exists='replace') + def time_float_read_table_sqlalchemy(self): + read_sql_table('test_type', self.engine, columns=['float']) diff --git a/asv_bench/benchmarks/join_merge.py b/asv_bench/benchmarks/join_merge.py index c98179c8950c5..3b0e33b72ddc1 100644 --- a/asv_bench/benchmarks/join_merge.py +++ b/asv_bench/benchmarks/join_merge.py @@ -1,33 +1,20 @@ from .pandas_vb_common import * +try: + from pandas import merge_ordered +except ImportError: + from pandas import ordered_merge as merge_ordered -class append_frame_single_homogenous(object): - goal_time = 0.2 - - def setup(self): - self.df1 = pd.DataFrame(np.random.randn(10000, 4), columns=['A', 'B', 'C', 'D']) - self.df2 = self.df1.copy() - self.df2.index = np.arange(10000, 20000) - self.mdf1 = self.df1.copy() - self.mdf1['obj1'] = 'bar' - self.mdf1['obj2'] = 'bar' - self.mdf1['int1'] = 5 - try: - self.mdf1.consolidate(inplace=True) - except: - pass - self.mdf2 = self.mdf1.copy() - self.mdf2.index = self.df2.index - - def time_append_frame_single_homogenous(self): - self.df1.append(self.df2) +# ---------------------------------------------------------------------- +# Append -class append_frame_single_mixed(object): +class Append(object): goal_time = 0.2 def setup(self): - self.df1 = pd.DataFrame(np.random.randn(10000, 4), columns=['A', 'B', 'C', 'D']) + self.df1 = pd.DataFrame(np.random.randn(10000, 4), + columns=['A', 'B', 'C', 'D']) self.df2 = self.df1.copy() self.df2.index = np.arange(10000, 20000) self.mdf1 = self.df1.copy() @@ -41,33 +28,17 @@ def setup(self): self.mdf2 = self.mdf1.copy() self.mdf2.index = self.df2.index - def time_append_frame_single_mixed(self): - self.mdf1.append(self.mdf2) - - -class concat_empty_frames1(object): - goal_time = 0.2 - - def setup(self): - self.df = pd.DataFrame(dict(A=range(10000)), index=date_range('20130101', periods=10000, freq='s')) - self.empty = pd.DataFrame() - - def time_concat_empty_frames1(self): - concat([self.df, self.empty]) - - -class concat_empty_frames2(object): - goal_time = 0.2 + def time_append_homogenous(self): + self.df1.append(self.df2) - def setup(self): - self.df = pd.DataFrame(dict(A=range(10000)), index=date_range('20130101', periods=10000, freq='s')) - self.empty = pd.DataFrame() + def time_append_mixed(self): + self.mdf1.append(self.mdf2) - def time_concat_empty_frames2(self): - concat([self.empty, self.df]) +# ---------------------------------------------------------------------- +# Concat -class concat_series_axis1(object): +class Concat(object): goal_time = 0.2 def setup(self): @@ -77,21 +48,26 @@ def setup(self): self.pieces = [self.s[i:(- i)] for i in range(1, 10)] self.pieces = (self.pieces * 50) + self.df_small = pd.DataFrame(randn(5, 4)) + + # empty + self.df = pd.DataFrame(dict(A=range(10000)), index=date_range('20130101', periods=10000, freq='s')) + self.empty = pd.DataFrame() + def time_concat_series_axis1(self): concat(self.pieces, axis=1) + def time_concat_small_frames(self): + concat(([self.df_small] * 1000)) -class concat_small_frames(object): - goal_time = 0.2 - - def setup(self): - self.df = pd.DataFrame(randn(5, 4)) + def time_concat_empty_frames1(self): + concat([self.df, self.empty]) - def time_concat_small_frames(self): - concat(([self.df] * 1000)) + def time_concat_empty_frames2(self): + concat([self.empty, self.df]) -class concat_panels(object): +class ConcatPanels(object): goal_time = 0.2 def setup(self): @@ -101,26 +77,26 @@ def setup(self): self.panels_c = [pd.Panel(np.copy(dataset, order='C')) for i in range(20)] - def time_concat_c_ordered_axis0(self): + def time_c_ordered_axis0(self): concat(self.panels_c, axis=0, ignore_index=True) - def time_concat_f_ordered_axis0(self): + def time_f_ordered_axis0(self): concat(self.panels_f, axis=0, ignore_index=True) - def time_concat_c_ordered_axis1(self): + def time_c_ordered_axis1(self): concat(self.panels_c, axis=1, ignore_index=True) - def time_concat_f_ordered_axis1(self): + def time_f_ordered_axis1(self): concat(self.panels_f, axis=1, ignore_index=True) - def time_concat_c_ordered_axis2(self): + def time_c_ordered_axis2(self): concat(self.panels_c, axis=2, ignore_index=True) - def time_concat_f_ordered_axis2(self): + def time_f_ordered_axis2(self): concat(self.panels_f, axis=2, ignore_index=True) -class concat_dataframes(object): +class ConcatFrames(object): goal_time = 0.2 def setup(self): @@ -131,37 +107,23 @@ def setup(self): self.frames_c = [pd.DataFrame(np.copy(dataset, order='C')) for i in range(20)] - def time_concat_c_ordered_axis0(self): + def time_c_ordered_axis0(self): concat(self.frames_c, axis=0, ignore_index=True) - def time_concat_f_ordered_axis0(self): + def time_f_ordered_axis0(self): concat(self.frames_f, axis=0, ignore_index=True) - def time_concat_c_ordered_axis1(self): + def time_c_ordered_axis1(self): concat(self.frames_c, axis=1, ignore_index=True) - def time_concat_f_ordered_axis1(self): + def time_f_ordered_axis1(self): concat(self.frames_f, axis=1, ignore_index=True) -class i8merge(object): - goal_time = 0.2 - - def setup(self): - (low, high, n) = (((-1) << 10), (1 << 10), (1 << 20)) - self.left = pd.DataFrame(np.random.randint(low, high, (n, 7)), columns=list('ABCDEFG')) - self.left['left'] = self.left.sum(axis=1) - self.i = np.random.permutation(len(self.left)) - self.right = self.left.iloc[self.i].copy() - self.right.columns = (self.right.columns[:(-1)].tolist() + ['right']) - self.right.index = np.arange(len(self.right)) - self.right['right'] *= (-1) - - def time_i8merge(self): - merge(self.left, self.right, how='outer') - +# ---------------------------------------------------------------------- +# Joins -class join_dataframe_index_multi(object): +class Join(object): goal_time = 0.2 def setup(self): @@ -174,243 +136,230 @@ def setup(self): self.shuf = np.arange(100000) random.shuffle(self.shuf) try: - self.index2 = MultiIndex(levels=[self.level1, self.level2], labels=[self.label1, self.label2]) - self.index3 = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - self.df_multi = DataFrame(np.random.randn(len(self.index2), 4), index=self.index2, columns=['A', 'B', 'C', 'D']) + self.index2 = MultiIndex(levels=[self.level1, self.level2], + labels=[self.label1, self.label2]) + self.index3 = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], + labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) + self.df_multi = DataFrame(np.random.randn(len(self.index2), 4), + index=self.index2, + columns=['A', 'B', 'C', 'D']) except: pass - self.df = pd.DataFrame({'data1': np.random.randn(100000), 'data2': np.random.randn(100000), 'key1': self.key1, 'key2': self.key2, }) - self.df_key1 = pd.DataFrame(np.random.randn(len(self.level1), 4), index=self.level1, columns=['A', 'B', 'C', 'D']) - self.df_key2 = pd.DataFrame(np.random.randn(len(self.level2), 4), index=self.level2, columns=['A', 'B', 'C', 'D']) + self.df = pd.DataFrame({'data1': np.random.randn(100000), + 'data2': np.random.randn(100000), + 'key1': self.key1, + 'key2': self.key2}) + self.df_key1 = pd.DataFrame(np.random.randn(len(self.level1), 4), + index=self.level1, + columns=['A', 'B', 'C', 'D']) + self.df_key2 = pd.DataFrame(np.random.randn(len(self.level2), 4), + index=self.level2, + columns=['A', 'B', 'C', 'D']) self.df_shuf = self.df.reindex(self.df.index[self.shuf]) def time_join_dataframe_index_multi(self): self.df.join(self.df_multi, on=['key1', 'key2']) - -class join_dataframe_index_single_key_bigger(object): - goal_time = 0.2 - - def setup(self): - self.level1 = tm.makeStringIndex(10).values - self.level2 = tm.makeStringIndex(1000).values - self.label1 = np.arange(10).repeat(1000) - self.label2 = np.tile(np.arange(1000), 10) - self.key1 = np.tile(self.level1.take(self.label1), 10) - self.key2 = np.tile(self.level2.take(self.label2), 10) - self.shuf = np.arange(100000) - random.shuffle(self.shuf) - try: - self.index2 = MultiIndex(levels=[self.level1, self.level2], labels=[self.label1, self.label2]) - self.index3 = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - self.df_multi = DataFrame(np.random.randn(len(self.index2), 4), index=self.index2, columns=['A', 'B', 'C', 'D']) - except: - pass - self.df = pd.DataFrame({'data1': np.random.randn(100000), 'data2': np.random.randn(100000), 'key1': self.key1, 'key2': self.key2, }) - self.df_key1 = pd.DataFrame(np.random.randn(len(self.level1), 4), index=self.level1, columns=['A', 'B', 'C', 'D']) - self.df_key2 = pd.DataFrame(np.random.randn(len(self.level2), 4), index=self.level2, columns=['A', 'B', 'C', 'D']) - self.df_shuf = self.df.reindex(self.df.index[self.shuf]) - def time_join_dataframe_index_single_key_bigger(self): self.df.join(self.df_key2, on='key2') + def time_join_dataframe_index_single_key_bigger_sort(self): + self.df_shuf.join(self.df_key2, on='key2', sort=True) -class join_dataframe_index_single_key_bigger_sort(object): + def time_join_dataframe_index_single_key_small(self): + self.df.join(self.df_key1, on='key1') + + +class JoinIndex(object): goal_time = 0.2 def setup(self): - self.level1 = tm.makeStringIndex(10).values - self.level2 = tm.makeStringIndex(1000).values - self.label1 = np.arange(10).repeat(1000) - self.label2 = np.tile(np.arange(1000), 10) - self.key1 = np.tile(self.level1.take(self.label1), 10) - self.key2 = np.tile(self.level2.take(self.label2), 10) - self.shuf = np.arange(100000) - random.shuffle(self.shuf) - try: - self.index2 = MultiIndex(levels=[self.level1, self.level2], labels=[self.label1, self.label2]) - self.index3 = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - self.df_multi = DataFrame(np.random.randn(len(self.index2), 4), index=self.index2, columns=['A', 'B', 'C', 'D']) - except: - pass - self.df = pd.DataFrame({'data1': np.random.randn(100000), 'data2': np.random.randn(100000), 'key1': self.key1, 'key2': self.key2, }) - self.df_key1 = pd.DataFrame(np.random.randn(len(self.level1), 4), index=self.level1, columns=['A', 'B', 'C', 'D']) - self.df_key2 = pd.DataFrame(np.random.randn(len(self.level2), 4), index=self.level2, columns=['A', 'B', 'C', 'D']) - self.df_shuf = self.df.reindex(self.df.index[self.shuf]) + np.random.seed(2718281) + self.n = 50000 + self.left = pd.DataFrame(np.random.randint(1, (self.n / 500), (self.n, 2)), columns=['jim', 'joe']) + self.right = pd.DataFrame(np.random.randint(1, (self.n / 500), (self.n, 2)), columns=['jolie', 'jolia']).set_index('jolie') - def time_join_dataframe_index_single_key_bigger_sort(self): - self.df_shuf.join(self.df_key2, on='key2', sort=True) + def time_left_outer_join_index(self): + self.left.join(self.right, on='jim') -class join_dataframe_index_single_key_small(object): +class join_non_unique_equal(object): + # outer join of non-unique + # GH 6329 + goal_time = 0.2 def setup(self): - self.level1 = tm.makeStringIndex(10).values - self.level2 = tm.makeStringIndex(1000).values - self.label1 = np.arange(10).repeat(1000) - self.label2 = np.tile(np.arange(1000), 10) - self.key1 = np.tile(self.level1.take(self.label1), 10) - self.key2 = np.tile(self.level2.take(self.label2), 10) - self.shuf = np.arange(100000) - random.shuffle(self.shuf) - try: - self.index2 = MultiIndex(levels=[self.level1, self.level2], labels=[self.label1, self.label2]) - self.index3 = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - self.df_multi = DataFrame(np.random.randn(len(self.index2), 4), index=self.index2, columns=['A', 'B', 'C', 'D']) - except: - pass - self.df = pd.DataFrame({'data1': np.random.randn(100000), 'data2': np.random.randn(100000), 'key1': self.key1, 'key2': self.key2, }) - self.df_key1 = pd.DataFrame(np.random.randn(len(self.level1), 4), index=self.level1, columns=['A', 'B', 'C', 'D']) - self.df_key2 = pd.DataFrame(np.random.randn(len(self.level2), 4), index=self.level2, columns=['A', 'B', 'C', 'D']) - self.df_shuf = self.df.reindex(self.df.index[self.shuf]) + self.date_index = date_range('01-Jan-2013', '23-Jan-2013', freq='T') + self.daily_dates = self.date_index.to_period('D').to_timestamp('S', 'S') + self.fracofday = (self.date_index.view(np.ndarray) - self.daily_dates.view(np.ndarray)) + self.fracofday = (self.fracofday.astype('timedelta64[ns]').astype(np.float64) / 86400000000000.0) + self.fracofday = Series(self.fracofday, self.daily_dates) + self.index = date_range(self.date_index.min().to_period('A').to_timestamp('D', 'S'), self.date_index.max().to_period('A').to_timestamp('D', 'E'), freq='D') + self.temp = Series(1.0, self.index) - def time_join_dataframe_index_single_key_small(self): - self.df.join(self.df_key1, on='key1') + def time_join_non_unique_equal(self): + (self.fracofday * self.temp[self.fracofday.index]) -class join_dataframe_integer_2key(object): +# ---------------------------------------------------------------------- +# Merges + +class Merge(object): goal_time = 0.2 def setup(self): - self.df = pd.DataFrame({'key1': np.tile(np.arange(500).repeat(10), 2), 'key2': np.tile(np.arange(250).repeat(10), 4), 'value': np.random.randn(10000), }) - self.df2 = pd.DataFrame({'key1': np.arange(500), 'value2': randn(500), }) + self.N = 10000 + self.indices = tm.makeStringIndex(self.N).values + self.indices2 = tm.makeStringIndex(self.N).values + self.key = np.tile(self.indices[:8000], 10) + self.key2 = np.tile(self.indices2[:8000], 10) + self.left = pd.DataFrame({'key': self.key, 'key2': self.key2, + 'value': np.random.randn(80000)}) + self.right = pd.DataFrame({'key': self.indices[2000:], + 'key2': self.indices2[2000:], + 'value2': np.random.randn(8000)}) + + self.df = pd.DataFrame({'key1': np.tile(np.arange(500).repeat(10), 2), + 'key2': np.tile(np.arange(250).repeat(10), 4), + 'value': np.random.randn(10000)}) + self.df2 = pd.DataFrame({'key1': np.arange(500), 'value2': randn(500)}) self.df3 = self.df[:5000] - def time_join_dataframe_integer_2key(self): + def time_merge_2intkey_nosort(self): + merge(self.left, self.right, sort=False) + + def time_merge_2intkey_sort(self): + merge(self.left, self.right, sort=True) + + def time_merge_dataframe_integer_2key(self): merge(self.df, self.df3) + def time_merge_dataframe_integer_key(self): + merge(self.df, self.df2, on='key1') + -class join_dataframe_integer_key(object): +class i8merge(object): goal_time = 0.2 def setup(self): - self.df = pd.DataFrame({'key1': np.tile(np.arange(500).repeat(10), 2), 'key2': np.tile(np.arange(250).repeat(10), 4), 'value': np.random.randn(10000), }) - self.df2 = pd.DataFrame({'key1': np.arange(500), 'value2': randn(500), }) - self.df3 = self.df[:5000] + (low, high, n) = (((-1) << 10), (1 << 10), (1 << 20)) + self.left = pd.DataFrame(np.random.randint(low, high, (n, 7)), + columns=list('ABCDEFG')) + self.left['left'] = self.left.sum(axis=1) + self.i = np.random.permutation(len(self.left)) + self.right = self.left.iloc[self.i].copy() + self.right.columns = (self.right.columns[:(-1)].tolist() + ['right']) + self.right.index = np.arange(len(self.right)) + self.right['right'] *= (-1) - def time_join_dataframe_integer_key(self): - merge(self.df, self.df2, on='key1') + def time_i8merge(self): + merge(self.left, self.right, how='outer') -class merge_asof_noby(object): +class MergeCategoricals(object): + goal_time = 0.2 def setup(self): - np.random.seed(0) - one_count = 200000 - two_count = 1000000 - self.df1 = pd.DataFrame({'time': np.random.randint(0, one_count/20, one_count), - 'value1': np.random.randn(one_count)}) - self.df2 = pd.DataFrame({'time': np.random.randint(0, two_count/20, two_count), - 'value2': np.random.randn(two_count)}) - self.df1 = self.df1.sort_values('time') - self.df2 = self.df2.sort_values('time') + self.left_object = pd.DataFrame( + {'X': np.random.choice(range(0, 10), size=(10000,)), + 'Y': np.random.choice(['one', 'two', 'three'], size=(10000,))}) - def time_merge_asof_noby(self): - merge_asof(self.df1, self.df2, on='time') + self.right_object = pd.DataFrame( + {'X': np.random.choice(range(0, 10), size=(10000,)), + 'Z': np.random.choice(['jjj', 'kkk', 'sss'], size=(10000,))}) + self.left_cat = self.left_object.assign( + Y=self.left_object['Y'].astype('category')) + self.right_cat = self.right_object.assign( + Z=self.right_object['Z'].astype('category')) -class merge_asof_by_object(object): + def time_merge_object(self): + merge(self.left_object, self.right_object, on='X') - def setup(self): - import string - np.random.seed(0) - one_count = 200000 - two_count = 1000000 - self.df1 = pd.DataFrame({'time': np.random.randint(0, one_count/20, one_count), - 'key': np.random.choice(list(string.uppercase), one_count), - 'value1': np.random.randn(one_count)}) - self.df2 = pd.DataFrame({'time': np.random.randint(0, two_count/20, two_count), - 'key': np.random.choice(list(string.uppercase), two_count), - 'value2': np.random.randn(two_count)}) - self.df1 = self.df1.sort_values('time') - self.df2 = self.df2.sort_values('time') + def time_merge_cat(self): + merge(self.left_cat, self.right_cat, on='X') - def time_merge_asof_by_object(self): - merge_asof(self.df1, self.df2, on='time', by='key') +# ---------------------------------------------------------------------- +# Ordered merge -class merge_asof_by_int(object): +class MergeOrdered(object): def setup(self): - np.random.seed(0) - one_count = 200000 - two_count = 1000000 - self.df1 = pd.DataFrame({'time': np.random.randint(0, one_count/20, one_count), - 'key': np.random.randint(0, 25, one_count), - 'value1': np.random.randn(one_count)}) - self.df2 = pd.DataFrame({'time': np.random.randint(0, two_count/20, two_count), - 'key': np.random.randint(0, 25, two_count), - 'value2': np.random.randn(two_count)}) - self.df1 = self.df1.sort_values('time') - self.df2 = self.df2.sort_values('time') - def time_merge_asof_by_int(self): - merge_asof(self.df1, self.df2, on='time', by='key') + groups = tm.makeStringIndex(10).values + self.left = pd.DataFrame({'group': groups.repeat(5000), + 'key' : np.tile(np.arange(0, 10000, 2), 10), + 'lvalue': np.random.randn(50000)}) -class join_non_unique_equal(object): - goal_time = 0.2 + self.right = pd.DataFrame({'key' : np.arange(10000), + 'rvalue' : np.random.randn(10000)}) - def setup(self): - self.date_index = date_range('01-Jan-2013', '23-Jan-2013', freq='T') - self.daily_dates = self.date_index.to_period('D').to_timestamp('S', 'S') - self.fracofday = (self.date_index.view(np.ndarray) - self.daily_dates.view(np.ndarray)) - self.fracofday = (self.fracofday.astype('timedelta64[ns]').astype(np.float64) / 86400000000000.0) - self.fracofday = TimeSeries(self.fracofday, self.daily_dates) - self.index = date_range(self.date_index.min().to_period('A').to_timestamp('D', 'S'), self.date_index.max().to_period('A').to_timestamp('D', 'E'), freq='D') - self.temp = TimeSeries(1.0, self.index) + def time_merge_ordered(self): + merge_ordered(self.left, self.right, on='key', left_by='group') - def time_join_non_unique_equal(self): - (self.fracofday * self.temp[self.fracofday.index]) +# ---------------------------------------------------------------------- +# asof merge -class left_outer_join_index(object): - goal_time = 0.2 +class MergeAsof(object): def setup(self): - np.random.seed(2718281) - self.n = 50000 - self.left = pd.DataFrame(np.random.randint(1, (self.n / 500), (self.n, 2)), columns=['jim', 'joe']) - self.right = pd.DataFrame(np.random.randint(1, (self.n / 500), (self.n, 2)), columns=['jolie', 'jolia']).set_index('jolie') + import string + np.random.seed(0) + one_count = 200000 + two_count = 1000000 - def time_left_outer_join_index(self): - self.left.join(self.right, on='jim') + self.df1 = pd.DataFrame( + {'time': np.random.randint(0, one_count / 20, one_count), + 'key': np.random.choice(list(string.ascii_uppercase), one_count), + 'key2': np.random.randint(0, 25, one_count), + 'value1': np.random.randn(one_count)}) + self.df2 = pd.DataFrame( + {'time': np.random.randint(0, two_count / 20, two_count), + 'key': np.random.choice(list(string.ascii_uppercase), two_count), + 'key2': np.random.randint(0, 25, two_count), + 'value2': np.random.randn(two_count)}) + self.df1 = self.df1.sort_values('time') + self.df2 = self.df2.sort_values('time') -class merge_2intkey_nosort(object): - goal_time = 0.2 + self.df1['time32'] = np.int32(self.df1.time) + self.df2['time32'] = np.int32(self.df2.time) - def setup(self): - self.N = 10000 - self.indices = tm.makeStringIndex(self.N).values - self.indices2 = tm.makeStringIndex(self.N).values - self.key = np.tile(self.indices[:8000], 10) - self.key2 = np.tile(self.indices2[:8000], 10) - self.left = pd.DataFrame({'key': self.key, 'key2': self.key2, 'value': np.random.randn(80000), }) - self.right = pd.DataFrame({'key': self.indices[2000:], 'key2': self.indices2[2000:], 'value2': np.random.randn(8000), }) + self.df1a = self.df1[['time', 'value1']] + self.df2a = self.df2[['time', 'value2']] + self.df1b = self.df1[['time', 'key', 'value1']] + self.df2b = self.df2[['time', 'key', 'value2']] + self.df1c = self.df1[['time', 'key2', 'value1']] + self.df2c = self.df2[['time', 'key2', 'value2']] + self.df1d = self.df1[['time32', 'value1']] + self.df2d = self.df2[['time32', 'value2']] + self.df1e = self.df1[['time', 'key', 'key2', 'value1']] + self.df2e = self.df2[['time', 'key', 'key2', 'value2']] - def time_merge_2intkey_nosort(self): - merge(self.left, self.right, sort=False) + def time_noby(self): + merge_asof(self.df1a, self.df2a, on='time') + def time_by_object(self): + merge_asof(self.df1b, self.df2b, on='time', by='key') -class merge_2intkey_sort(object): - goal_time = 0.2 + def time_by_int(self): + merge_asof(self.df1c, self.df2c, on='time', by='key2') - def setup(self): - self.N = 10000 - self.indices = tm.makeStringIndex(self.N).values - self.indices2 = tm.makeStringIndex(self.N).values - self.key = np.tile(self.indices[:8000], 10) - self.key2 = np.tile(self.indices2[:8000], 10) - self.left = pd.DataFrame({'key': self.key, 'key2': self.key2, 'value': np.random.randn(80000), }) - self.right = pd.DataFrame({'key': self.indices[2000:], 'key2': self.indices2[2000:], 'value2': np.random.randn(8000), }) + def time_on_int32(self): + merge_asof(self.df1d, self.df2d, on='time32') + + def time_multiby(self): + merge_asof(self.df1e, self.df2e, on='time', by=['key', 'key2']) - def time_merge_2intkey_sort(self): - merge(self.left, self.right, sort=True) +# ---------------------------------------------------------------------- +# data alignment -class series_align_int64_index(object): +class Align(object): goal_time = 0.2 def setup(self): @@ -423,30 +372,12 @@ def setup(self): self.ts1 = Series(np.random.randn(self.sz), self.idx1) self.ts2 = Series(np.random.randn(self.sz), self.idx2) - def time_series_align_int64_index(self): - (self.ts1 + self.ts2) - def sample(self, values, k): self.sampler = np.random.permutation(len(values)) return values.take(self.sampler[:k]) - -class series_align_left_monotonic(object): - goal_time = 0.2 - - def setup(self): - self.n = 1000000 - self.sz = 500000 - self.rng = np.arange(0, 10000000000000, 10000000) - self.stamps = (np.datetime64(datetime.now()).view('i8') + self.rng) - self.idx1 = np.sort(self.sample(self.stamps, self.sz)) - self.idx2 = np.sort(self.sample(self.stamps, self.sz)) - self.ts1 = Series(np.random.randn(self.sz), self.idx1) - self.ts2 = Series(np.random.randn(self.sz), self.idx2) + def time_series_align_int64_index(self): + (self.ts1 + self.ts2) def time_series_align_left_monotonic(self): self.ts1.align(self.ts2, join='left') - - def sample(self, values, k): - self.sampler = np.random.permutation(len(values)) - return values.take(self.sampler[:k]) diff --git a/asv_bench/benchmarks/miscellaneous.py b/asv_bench/benchmarks/miscellaneous.py deleted file mode 100644 index f9d577a2b56d7..0000000000000 --- a/asv_bench/benchmarks/miscellaneous.py +++ /dev/null @@ -1,52 +0,0 @@ -from .pandas_vb_common import * -from pandas.util.decorators import cache_readonly - - -class match_strings(object): - goal_time = 0.2 - - def setup(self): - self.uniques = tm.makeStringIndex(1000).values - self.all = self.uniques.repeat(10) - - def time_match_strings(self): - match(self.all, self.uniques) - - -class misc_cache_readonly(object): - goal_time = 0.2 - - def setup(self): - - - class Foo: - - @cache_readonly - def prop(self): - return 5 - self.obj = Foo() - - def time_misc_cache_readonly(self): - self.obj.prop - - -class to_numeric(object): - goal_time = 0.2 - - def setup(self): - self.n = 10000 - self.float = Series(np.random.randn(self.n * 100)) - self.numstr = self.float.astype('str') - self.str = Series(tm.makeStringIndex(self.n)) - - def time_from_float(self): - pd.to_numeric(self.float) - - def time_from_numeric_str(self): - pd.to_numeric(self.numstr) - - def time_from_str_ignore(self): - pd.to_numeric(self.str, errors='ignore') - - def time_from_str_coerce(self): - pd.to_numeric(self.str, errors='coerce') diff --git a/asv_bench/benchmarks/packers.py b/asv_bench/benchmarks/packers.py index 5419571c75b43..24f80cc836dd4 100644 --- a/asv_bench/benchmarks/packers.py +++ b/asv_bench/benchmarks/packers.py @@ -8,28 +8,19 @@ from sqlalchemy import create_engine import numpy as np from random import randrange -from pandas.core import common as com - -class packers_read_csv(object): +class _Packers(object): goal_time = 0.2 - def setup(self): + def _setup(self): self.f = '__test__.msg' self.N = 100000 self.C = 5 self.index = date_range('20000101', periods=self.N, freq='H') self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) + self.df2 = self.df.copy() self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] self.remove(self.f) - self.df.to_csv(self.f) - - def time_packers_read_csv(self): - pd.read_csv(self.f) def remove(self, f): try: @@ -37,22 +28,21 @@ def remove(self, f): except: pass +class Packers(_Packers): + goal_time = 0.2 + + def setup(self): + self._setup() + self.df.to_csv(self.f) + + def time_packers_read_csv(self): + pd.read_csv(self.f) -class packers_read_excel(object): +class packers_read_excel(_Packers): goal_time = 0.2 def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.bio = BytesIO() self.writer = pd.io.excel.ExcelWriter(self.bio, engine='xlsxwriter') self.df[:2000].to_excel(self.writer) @@ -62,246 +52,94 @@ def time_packers_read_excel(self): self.bio.seek(0) pd.read_excel(self.bio) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_hdf_store(object): +class packers_read_hdf_store(_Packers): goal_time = 0.2 def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.df2.to_hdf(self.f, 'df') def time_packers_read_hdf_store(self): pd.read_hdf(self.f, 'df') - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_hdf_table(object): - goal_time = 0.2 +class packers_read_hdf_table(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.df2.to_hdf(self.f, 'df', format='table') def time_packers_read_hdf_table(self): pd.read_hdf(self.f, 'df') - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_json(object): - goal_time = 0.2 +class packers_read_json(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.df.to_json(self.f, orient='split') self.df.index = np.arange(self.N) def time_packers_read_json(self): pd.read_json(self.f, orient='split') - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_json_date_index(object): - goal_time = 0.2 +class packers_read_json_date_index(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] + self._setup() self.remove(self.f) self.df.to_json(self.f, orient='split') def time_packers_read_json_date_index(self): pd.read_json(self.f, orient='split') - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_pack(object): - goal_time = 0.2 +class packers_read_pack(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.df2.to_msgpack(self.f) def time_packers_read_pack(self): pd.read_msgpack(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_pickle(object): - goal_time = 0.2 +class packers_read_pickle(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.df2.to_pickle(self.f) def time_packers_read_pickle(self): pd.read_pickle(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_read_sql(object): - goal_time = 0.2 +class packers_read_sql(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.engine = create_engine('sqlite:///:memory:') self.df2.to_sql('table', self.engine, if_exists='replace') def time_packers_read_sql(self): pd.read_sql_table('table', self.engine) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_stata(object): - goal_time = 0.2 +class packers_read_stata(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.df.to_stata(self.f, {'index': 'tc', }) def time_packers_read_stata(self): pd.read_stata(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_stata_with_validation(object): - goal_time = 0.2 +class packers_read_stata_with_validation(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.df['int8_'] = [randint(np.iinfo(np.int8).min, (np.iinfo(np.int8).max - 27)) for _ in range(self.N)] self.df['int16_'] = [randint(np.iinfo(np.int16).min, (np.iinfo(np.int16).max - 27)) for _ in range(self.N)] self.df['int32_'] = [randint(np.iinfo(np.int32).min, (np.iinfo(np.int32).max - 27)) for _ in range(self.N)] @@ -311,594 +149,170 @@ def setup(self): def time_packers_read_stata_with_validation(self): pd.read_stata(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_read_sas7bdat(object): +class packers_read_sas(_Packers): def setup(self): - self.f = os.path.join(os.path.dirname(__file__), '..', '..', - 'pandas', 'io', 'tests', 'sas', 'data', - 'test1.sas7bdat') - - def time_packers_read_sas7bdat(self): - pd.read_sas(self.f, format='sas7bdat') + testdir = os.path.join(os.path.dirname(__file__), '..', '..', + 'pandas', 'tests', 'io', 'sas') + if not os.path.exists(testdir): + testdir = os.path.join(os.path.dirname(__file__), '..', '..', + 'pandas', 'io', 'tests', 'sas') + self.f = os.path.join(testdir, 'data', 'test1.sas7bdat') + self.f2 = os.path.join(testdir, 'data', 'paxraw_d_short.xpt') -class packers_read_xport(object): - - def setup(self): - self.f = os.path.join(os.path.dirname(__file__), '..', '..', - 'pandas', 'io', 'tests', 'sas', 'data', - 'paxraw_d_short.xpt') + def time_read_sas7bdat(self): + pd.read_sas(self.f, format='sas7bdat') - def time_packers_read_xport(self): - pd.read_sas(self.f, format='xport') + def time_read_xport(self): + pd.read_sas(self.f2, format='xport') -class packers_write_csv(object): - goal_time = 0.2 +class CSV(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() - def time_packers_write_csv(self): + def time_write_csv(self): self.df.to_csv(self.f) def teardown(self): self.remove(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_write_excel_openpyxl(object): - goal_time = 0.2 +class Excel(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.bio = BytesIO() - def time_packers_write_excel_openpyxl(self): + def time_write_excel_openpyxl(self): self.bio.seek(0) self.writer = pd.io.excel.ExcelWriter(self.bio, engine='openpyxl') self.df[:2000].to_excel(self.writer) self.writer.save() - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_excel_xlsxwriter(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - self.bio = BytesIO() - - def time_packers_write_excel_xlsxwriter(self): + def time_write_excel_xlsxwriter(self): self.bio.seek(0) self.writer = pd.io.excel.ExcelWriter(self.bio, engine='xlsxwriter') self.df[:2000].to_excel(self.writer) self.writer.save() - def remove(self, f): - try: - os.remove(self.f) - except: - pass + def time_write_excel_xlwt(self): + self.bio.seek(0) + self.writer = pd.io.excel.ExcelWriter(self.bio, engine='xlwt') + self.df[:2000].to_excel(self.writer) + self.writer.save() -class packers_write_excel_xlwt(object): - goal_time = 0.2 +class HDF(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - self.bio = BytesIO() - - def time_packers_write_excel_xlwt(self): - self.bio.seek(0) - self.writer = pd.io.excel.ExcelWriter(self.bio, engine='xlwt') - self.df[:2000].to_excel(self.writer) - self.writer.save() + self._setup() - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_hdf_store(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - - def time_packers_write_hdf_store(self): + def time_write_hdf_store(self): self.df2.to_hdf(self.f, 'df') - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_hdf_table(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - - def time_packers_write_hdf_table(self): + def time_write_hdf_table(self): self.df2.to_hdf(self.f, 'df', table=True) def teardown(self): self.remove(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_json(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - self.df.index = np.arange(self.N) - - def time_packers_write_json(self): - self.df.to_json(self.f, orient='split') - - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_json_lines(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.remove(self.f) - self.df.index = np.arange(self.N) - - def time_packers_write_json_lines(self): - self.df.to_json(self.f, orient="records", lines=True) - - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_json_T(object): - goal_time = 0.2 +class JSON(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() + self.df_date = self.df.copy() self.df.index = np.arange(self.N) - - def time_packers_write_json_T(self): - self.df.to_json(self.f, orient='columns') - - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_json_date_index(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - - def time_packers_write_json_date_index(self): - self.df.to_json(self.f, orient='split') - - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_json_mixed_delta_int_tstamp(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) self.cols = [(lambda i: ('{0}_timedelta'.format(i), [pd.Timedelta(('%d seconds' % randrange(1000000.0))) for _ in range(self.N)])), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N))), (lambda i: ('{0}_timestamp'.format(i), [pd.Timestamp((1418842918083256000 + randrange(1000000000.0, 1e+18, 200))) for _ in range(self.N)]))] self.df_mixed = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) - def time_packers_write_json_mixed_delta_int_tstamp(self): - self.df_mixed.to_json(self.f, orient='split') - - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_json_mixed_float_int(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) self.cols = [(lambda i: ('{0}_float'.format(i), randn(self.N))), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N)))] - self.df_mixed = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) - - def time_packers_write_json_mixed_float_int(self): - self.df_mixed.to_json(self.f, orient='index') + self.df_mixed2 = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_json_mixed_float_int_T(object): - goal_time = 0.2 + self.cols = [(lambda i: ('{0}_float'.format(i), randn(self.N))), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N))), (lambda i: ('{0}_str'.format(i), [('%08x' % randrange((16 ** 8))) for _ in range(self.N)]))] + self.df_mixed3 = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - self.cols = [(lambda i: ('{0}_float'.format(i), randn(self.N))), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N)))] - self.df_mixed = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) + def time_write_json(self): + self.df.to_json(self.f, orient='split') - def time_packers_write_json_mixed_float_int_T(self): - self.df_mixed.to_json(self.f, orient='columns') + def time_write_json_T(self): + self.df.to_json(self.f, orient='columns') - def teardown(self): - self.remove(self.f) + def time_write_json_date_index(self): + self.df_date.to_json(self.f, orient='split') - def remove(self, f): - try: - os.remove(self.f) - except: - pass + def time_write_json_mixed_delta_int_tstamp(self): + self.df_mixed.to_json(self.f, orient='split') + def time_write_json_mixed_float_int(self): + self.df_mixed2.to_json(self.f, orient='index') -class packers_write_json_mixed_float_int_str(object): - goal_time = 0.2 + def time_write_json_mixed_float_int_T(self): + self.df_mixed2.to_json(self.f, orient='columns') - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - self.cols = [(lambda i: ('{0}_float'.format(i), randn(self.N))), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N))), (lambda i: ('{0}_str'.format(i), [('%08x' % randrange((16 ** 8))) for _ in range(self.N)]))] - self.df_mixed = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) + def time_write_json_mixed_float_int_str(self): + self.df_mixed3.to_json(self.f, orient='split') - def time_packers_write_json_mixed_float_int_str(self): - self.df_mixed.to_json(self.f, orient='split') + def time_write_json_lines(self): + self.df.to_json(self.f, orient="records", lines=True) def teardown(self): self.remove(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_write_pack(object): - goal_time = 0.2 +class MsgPack(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() - def time_packers_write_pack(self): + def time_write_msgpack(self): self.df2.to_msgpack(self.f) def teardown(self): self.remove(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_write_pickle(object): - goal_time = 0.2 +class Pickle(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() - def time_packers_write_pickle(self): + def time_write_pickle(self): self.df2.to_pickle(self.f) def teardown(self): self.remove(self.f) - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_write_sql(object): - goal_time = 0.2 +class SQL(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) + self._setup() self.engine = create_engine('sqlite:///:memory:') - def time_packers_write_sql(self): + def time_write_sql(self): self.df2.to_sql('table', self.engine, if_exists='replace') - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class packers_write_stata(object): - goal_time = 0.2 +class STATA(_Packers): def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - self.df.to_stata(self.f, {'index': 'tc', }) - - def time_packers_write_stata(self): - self.df.to_stata(self.f, {'index': 'tc', }) + self._setup() - def teardown(self): - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - -class packers_write_stata_with_validation(object): - goal_time = 0.2 + self.df3=self.df.copy() + self.df3['int8_'] = [randint(np.iinfo(np.int8).min, (np.iinfo(np.int8).max - 27)) for _ in range(self.N)] + self.df3['int16_'] = [randint(np.iinfo(np.int16).min, (np.iinfo(np.int16).max - 27)) for _ in range(self.N)] + self.df3['int32_'] = [randint(np.iinfo(np.int32).min, (np.iinfo(np.int32).max - 27)) for _ in range(self.N)] + self.df3['float32_'] = np.array(randn(self.N), dtype=np.float32) - def setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df2 = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - self.df['int8_'] = [randint(np.iinfo(np.int8).min, (np.iinfo(np.int8).max - 27)) for _ in range(self.N)] - self.df['int16_'] = [randint(np.iinfo(np.int16).min, (np.iinfo(np.int16).max - 27)) for _ in range(self.N)] - self.df['int32_'] = [randint(np.iinfo(np.int32).min, (np.iinfo(np.int32).max - 27)) for _ in range(self.N)] - self.df['float32_'] = np.array(randn(self.N), dtype=np.float32) + def time_write_stata(self): self.df.to_stata(self.f, {'index': 'tc', }) - def time_packers_write_stata_with_validation(self): - self.df.to_stata(self.f, {'index': 'tc', }) + def time_write_stata_with_validation(self): + self.df3.to_stata(self.f, {'index': 'tc', }) def teardown(self): self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass diff --git a/asv_bench/benchmarks/pandas_vb_common.py b/asv_bench/benchmarks/pandas_vb_common.py index 3370131929c22..b1a58e49fe86c 100644 --- a/asv_bench/benchmarks/pandas_vb_common.py +++ b/asv_bench/benchmarks/pandas_vb_common.py @@ -1,28 +1,32 @@ from pandas import * import pandas as pd -from datetime import timedelta from numpy.random import randn from numpy.random import randint -from numpy.random import permutation import pandas.util.testing as tm import random import numpy as np import threading +from importlib import import_module + try: from pandas.compat import range except ImportError: pass np.random.seed(1234) -try: - import pandas._tseries as lib -except: - import pandas.lib as lib + +# try em until it works! +for imp in ['pandas._libs.lib', 'pandas.lib', 'pandas_tseries']: + try: + lib = import_module(imp) + break + except: + pass try: - Panel = WidePanel + Panel = Panel except Exception: - pass + Panel = WidePanel # didn't add to namespace until later try: diff --git a/asv_bench/benchmarks/panel_ctor.py b/asv_bench/benchmarks/panel_ctor.py index 4f6fd4a5a2df8..cc6071b054662 100644 --- a/asv_bench/benchmarks/panel_ctor.py +++ b/asv_bench/benchmarks/panel_ctor.py @@ -1,7 +1,8 @@ from .pandas_vb_common import * +from datetime import timedelta -class panel_from_dict_all_different_indexes(object): +class Constructors1(object): goal_time = 0.2 def setup(self): @@ -18,13 +19,13 @@ def time_panel_from_dict_all_different_indexes(self): Panel.from_dict(self.data_frames) -class panel_from_dict_equiv_indexes(object): +class Constructors2(object): goal_time = 0.2 def setup(self): self.data_frames = {} for x in range(100): - self.dr = np.asarray(DatetimeIndex(start=datetime(1990, 1, 1), end=datetime(2012, 1, 1), freq=datetools.Day(1))) + self.dr = np.asarray(DatetimeIndex(start=datetime(1990, 1, 1), end=datetime(2012, 1, 1), freq='D')) self.df = DataFrame({'a': ([0] * len(self.dr)), 'b': ([1] * len(self.dr)), 'c': ([2] * len(self.dr)), }, index=self.dr) self.data_frames[x] = self.df @@ -32,11 +33,11 @@ def time_panel_from_dict_equiv_indexes(self): Panel.from_dict(self.data_frames) -class panel_from_dict_same_index(object): +class Constructors3(object): goal_time = 0.2 def setup(self): - self.dr = np.asarray(DatetimeIndex(start=datetime(1990, 1, 1), end=datetime(2012, 1, 1), freq=datetools.Day(1))) + self.dr = np.asarray(DatetimeIndex(start=datetime(1990, 1, 1), end=datetime(2012, 1, 1), freq='D')) self.data_frames = {} for x in range(100): self.df = DataFrame({'a': ([0] * len(self.dr)), 'b': ([1] * len(self.dr)), 'c': ([2] * len(self.dr)), }, index=self.dr) @@ -46,7 +47,7 @@ def time_panel_from_dict_same_index(self): Panel.from_dict(self.data_frames) -class panel_from_dict_two_different_indexes(object): +class Constructors4(object): goal_time = 0.2 def setup(self): diff --git a/asv_bench/benchmarks/panel_methods.py b/asv_bench/benchmarks/panel_methods.py index 0bd572db2211a..6609305502011 100644 --- a/asv_bench/benchmarks/panel_methods.py +++ b/asv_bench/benchmarks/panel_methods.py @@ -1,56 +1,24 @@ from .pandas_vb_common import * -class panel_pct_change_items(object): +class PanelMethods(object): goal_time = 0.2 def setup(self): self.index = date_range(start='2000', freq='D', periods=1000) self.panel = Panel(np.random.randn(100, len(self.index), 1000)) - def time_panel_pct_change_items(self): + def time_pct_change_items(self): self.panel.pct_change(1, axis='items') - -class panel_pct_change_major(object): - goal_time = 0.2 - - def setup(self): - self.index = date_range(start='2000', freq='D', periods=1000) - self.panel = Panel(np.random.randn(100, len(self.index), 1000)) - - def time_panel_pct_change_major(self): + def time_pct_change_major(self): self.panel.pct_change(1, axis='major') - -class panel_pct_change_minor(object): - goal_time = 0.2 - - def setup(self): - self.index = date_range(start='2000', freq='D', periods=1000) - self.panel = Panel(np.random.randn(100, len(self.index), 1000)) - - def time_panel_pct_change_minor(self): + def time_pct_change_minor(self): self.panel.pct_change(1, axis='minor') - -class panel_shift(object): - goal_time = 0.2 - - def setup(self): - self.index = date_range(start='2000', freq='D', periods=1000) - self.panel = Panel(np.random.randn(100, len(self.index), 1000)) - - def time_panel_shift(self): + def time_shift(self): self.panel.shift(1) - -class panel_shift_minor(object): - goal_time = 0.2 - - def setup(self): - self.index = date_range(start='2000', freq='D', periods=1000) - self.panel = Panel(np.random.randn(100, len(self.index), 1000)) - - def time_panel_shift_minor(self): + def time_shift_minor(self): self.panel.shift(1, axis='minor') diff --git a/asv_bench/benchmarks/parser_vb.py b/asv_bench/benchmarks/parser_vb.py index 6dc8bffd6dac9..32bf7e50d1a89 100644 --- a/asv_bench/benchmarks/parser_vb.py +++ b/asv_bench/benchmarks/parser_vb.py @@ -1,71 +1,49 @@ from .pandas_vb_common import * import os -from pandas import read_csv, read_table +from pandas import read_csv try: from cStringIO import StringIO except ImportError: from io import StringIO -class read_csv_comment2(object): +class read_csv1(object): goal_time = 0.2 def setup(self): - self.data = ['A,B,C'] - self.data = (self.data + (['1,2,3 # comment'] * 100000)) - self.data = '\n'.join(self.data) - - def time_read_csv_comment2(self): - read_csv(StringIO(self.data), comment='#') - - -class read_csv_default_converter(object): - goal_time = 0.2 - - def setup(self): - self.data = """0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336\n -0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285\n -0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n -0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n -0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n""" - self.data = (self.data * 200) - - def time_read_csv_default_converter(self): - read_csv(StringIO(self.data), sep=',', header=None, float_precision=None) + self.N = 10000 + self.K = 8 + self.df = DataFrame((np.random.randn(self.N, self.K) * np.random.randint(100, 10000, (self.N, self.K)))) + self.df.to_csv('test.csv', sep='|') + self.format = (lambda x: '{:,}'.format(x)) + self.df2 = self.df.applymap(self.format) + self.df2.to_csv('test2.csv', sep='|') -class read_csv_default_converter_with_decimal(object): - goal_time = 0.2 + def time_sep(self): + read_csv('test.csv', sep='|') - def setup(self): - self.data = """0,1213700904466425978256438611;0,0525708283766902484401839501;0,4174092731488769913994474336\n -0,4096341697147408700274695547;0,1587830198973579909349496119;0,1292545832485494372576795285\n -0,8323255650024565799327547210;0,9694902427379478160318626578;0,6295047811546814475747169126\n -0,4679375305798131323697930383;0,2963942381834381301075609371;0,5268936082160610157032465394\n -0,6685382761849776311890991564;0,6721207066140679753374342908;0,6519975277021627935170045020\n""" - self.data = (self.data * 200) + def time_thousands(self): + read_csv('test.csv', sep='|', thousands=',') - def time_read_csv_default_converter_with_decimal(self): - read_csv(StringIO(self.data), sep=';', header=None, - float_precision=None, decimal=',') + def teardown(self): + os.remove('test.csv') + os.remove('test2.csv') -class read_csv_precise_converter(object): +class read_csv2(object): goal_time = 0.2 def setup(self): - self.data = """0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336\n -0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285\n -0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n -0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n -0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n""" - self.data = (self.data * 200) + self.data = ['A,B,C'] + self.data = (self.data + (['1,2,3 # comment'] * 100000)) + self.data = '\n'.join(self.data) - def time_read_csv_precise_converter(self): - read_csv(StringIO(self.data), sep=',', header=None, float_precision='high') + def time_comment(self): + read_csv(StringIO(self.data), comment='#') -class read_csv_roundtrip_converter(object): +class read_csv3(object): goal_time = 0.2 def setup(self): @@ -74,44 +52,33 @@ def setup(self): 0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n 0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n 0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n""" + self.data2 = self.data.replace(',', ';').replace('.', ',') self.data = (self.data * 200) + self.data2 = (self.data2 * 200) - def time_read_csv_roundtrip_converter(self): - read_csv(StringIO(self.data), sep=',', header=None, float_precision='round_trip') - - -class read_csv_thou_vb(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 8 - self.format = (lambda x: '{:,}'.format(x)) - self.df = DataFrame((np.random.randn(self.N, self.K) * np.random.randint(100, 10000, (self.N, self.K)))) - self.df = self.df.applymap(self.format) - self.df.to_csv('test.csv', sep='|') - - def time_read_csv_thou_vb(self): - read_csv('test.csv', sep='|', thousands=',') - - def teardown(self): - os.remove('test.csv') + def time_default_converter(self): + read_csv(StringIO(self.data), sep=',', header=None, + float_precision=None) + def time_default_converter_with_decimal(self): + read_csv(StringIO(self.data2), sep=';', header=None, + float_precision=None, decimal=',') -class read_csv_vb(object): - goal_time = 0.2 + def time_default_converter_python_engine(self): + read_csv(StringIO(self.data), sep=',', header=None, + float_precision=None, engine='python') - def setup(self): - self.N = 10000 - self.K = 8 - self.df = DataFrame((np.random.randn(self.N, self.K) * np.random.randint(100, 10000, (self.N, self.K)))) - self.df.to_csv('test.csv', sep='|') + def time_default_converter_with_decimal_python_engine(self): + read_csv(StringIO(self.data2), sep=';', header=None, + float_precision=None, decimal=',', engine='python') - def time_read_csv_vb(self): - read_csv('test.csv', sep='|') + def time_precise_converter(self): + read_csv(StringIO(self.data), sep=',', header=None, + float_precision='high') - def teardown(self): - os.remove('test.csv') + def time_roundtrip_converter(self): + read_csv(StringIO(self.data), sep=',', header=None, + float_precision='round_trip') class read_csv_categorical(object): @@ -125,17 +92,17 @@ def setup(self): 'c': np.random.choice(group1, N).astype('object')}) df.to_csv('strings.csv', index=False) - def time_read_csv_categorical_post(self): + def time_convert_post(self): read_csv('strings.csv').apply(pd.Categorical) - def time_read_csv_categorical_direct(self): + def time_convert_direct(self): read_csv('strings.csv', dtype='category') def teardown(self): os.remove('strings.csv') -class read_table_multiple_date(object): +class read_csv_dateparsing(object): goal_time = 0.2 def setup(self): @@ -143,43 +110,12 @@ def setup(self): self.K = 8 self.data = 'KORD,19990127, 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000\n KORD,19990127, 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000\n KORD,19990127, 21:00:00, 20:56:00, -0.5900, 2.2100, 5.7000, 0.0000, 280.0000\n KORD,19990127, 21:00:00, 21:18:00, -0.9900, 2.0100, 3.6000, 0.0000, 270.0000\n KORD,19990127, 22:00:00, 21:56:00, -0.5900, 1.7100, 5.1000, 0.0000, 290.0000\n ' self.data = (self.data * 200) + self.data2 = 'KORD,19990127 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000\n KORD,19990127 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000\n KORD,19990127 21:00:00, 20:56:00, -0.5900, 2.2100, 5.7000, 0.0000, 280.0000\n KORD,19990127 21:00:00, 21:18:00, -0.9900, 2.0100, 3.6000, 0.0000, 270.0000\n KORD,19990127 22:00:00, 21:56:00, -0.5900, 1.7100, 5.1000, 0.0000, 290.0000\n ' + self.data2 = (self.data2 * 200) - def time_read_table_multiple_date(self): - read_table(StringIO(self.data), sep=',', header=None, parse_dates=[[1, 2], [1, 3]]) - - -class read_table_multiple_date_baseline(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 8 - self.data = 'KORD,19990127 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000\n KORD,19990127 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000\n KORD,19990127 21:00:00, 20:56:00, -0.5900, 2.2100, 5.7000, 0.0000, 280.0000\n KORD,19990127 21:00:00, 21:18:00, -0.9900, 2.0100, 3.6000, 0.0000, 270.0000\n KORD,19990127 22:00:00, 21:56:00, -0.5900, 1.7100, 5.1000, 0.0000, 290.0000\n ' - self.data = (self.data * 200) - - def time_read_table_multiple_date_baseline(self): - read_table(StringIO(self.data), sep=',', header=None, parse_dates=[1]) - - -class read_csv_default_converter_python_engine(object): - goal_time = 0.2 - - def setup(self): - self.data = '0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336\n 0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285\n 0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n 0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n 0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n ' - self.data = (self.data * 200) - - def time_read_csv_default_converter(self): + def time_multiple_date(self): read_csv(StringIO(self.data), sep=',', header=None, - float_precision=None, engine='python') - + parse_dates=[[1, 2], [1, 3]]) -class read_csv_default_converter_with_decimal_python_engine(object): - goal_time = 0.2 - - def setup(self): - self.data = '0,1213700904466425978256438611;0,0525708283766902484401839501;0,4174092731488769913994474336\n 0,4096341697147408700274695547;0,1587830198973579909349496119;0,1292545832485494372576795285\n 0,8323255650024565799327547210;0,9694902427379478160318626578;0,6295047811546814475747169126\n 0,4679375305798131323697930383;0,2963942381834381301075609371;0,5268936082160610157032465394\n 0,6685382761849776311890991564;0,6721207066140679753374342908;0,6519975277021627935170045020\n ' - self.data = (self.data * 200) - - def time_read_csv_default_converter_with_decimal(self): - read_csv(StringIO(self.data), sep=';', header=None, - float_precision=None, decimal=',', engine='python') + def time_baseline(self): + read_csv(StringIO(self.data2), sep=',', header=None, parse_dates=[1]) diff --git a/asv_bench/benchmarks/period.py b/asv_bench/benchmarks/period.py index bed2c4d5309cd..f9837191a7bae 100644 --- a/asv_bench/benchmarks/period.py +++ b/asv_bench/benchmarks/period.py @@ -1,31 +1,33 @@ +import pandas as pd from pandas import Series, Period, PeriodIndex, date_range -class create_period_index_from_date_range(object): +class Constructor(object): goal_time = 0.2 - def time_period_index(self): - # Simulate irregular PeriodIndex - PeriodIndex(date_range('1985', periods=1000).to_pydatetime(), freq='D') + def setup(self): + self.rng = date_range('1985', periods=1000) + self.rng2 = date_range('1985', periods=1000).to_pydatetime() + + def time_from_date_range(self): + PeriodIndex(self.rng, freq='D') + def time_from_pydatetime(self): + PeriodIndex(self.rng2, freq='D') -class period_setitem(object): + +class DataFrame(object): goal_time = 0.2 def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = period_range(start='1/1/1990', freq='S', periods=20000) - self.df = DataFrame(index=range(len(self.rng))) - - def time_period_setitem(self): + self.rng = pd.period_range(start='1/1/1990', freq='S', periods=20000) + self.df = pd.DataFrame(index=range(len(self.rng))) + + def time_setitem_period_column(self): self.df['col'] = self.rng -class period_algorithm(object): +class Algorithms(object): goal_time = 0.2 def setup(self): @@ -34,16 +36,16 @@ def setup(self): self.s = Series(data * 1000) self.i = PeriodIndex(data, freq='M') - def time_period_series_drop_duplicates(self): + def time_drop_duplicates_pseries(self): self.s.drop_duplicates() - def time_period_index_drop_duplicates(self): + def time_drop_duplicates_pindex(self): self.i.drop_duplicates() - def time_period_series_value_counts(self): + def time_value_counts_pseries(self): self.s.value_counts() - def time_period_index_value_counts(self): + def time_value_counts_pindex(self): self.i.value_counts() diff --git a/asv_bench/benchmarks/plotting.py b/asv_bench/benchmarks/plotting.py index 7a4a98e2195c2..dda684b35e301 100644 --- a/asv_bench/benchmarks/plotting.py +++ b/asv_bench/benchmarks/plotting.py @@ -4,10 +4,13 @@ except ImportError: def date_range(start=None, end=None, periods=None, freq=None): return DatetimeIndex(start, end, periods=periods, offset=freq) -from pandas.tools.plotting import andrews_curves +try: + from pandas.plotting import andrews_curves +except ImportError: + from pandas.tools.plotting import andrews_curves -class plot_timeseries_period(object): +class TimeseriesPlotting(object): goal_time = 0.2 def setup(self): @@ -17,11 +20,14 @@ def setup(self): self.M = 5 self.df = DataFrame(np.random.randn(self.N, self.M), index=date_range('1/1/1975', periods=self.N)) - def time_plot_timeseries_period(self): + def time_plot_regular(self): self.df.plot() + def time_plot_regular_compat(self): + self.df.plot(x_compat=True) + -class plot_andrews_curves(object): +class Misc(object): goal_time = 0.6 def setup(self): diff --git a/asv_bench/benchmarks/reindex.py b/asv_bench/benchmarks/reindex.py index b1c039058ff8f..537d275e7c727 100644 --- a/asv_bench/benchmarks/reindex.py +++ b/asv_bench/benchmarks/reindex.py @@ -2,175 +2,52 @@ from random import shuffle -class dataframe_reindex(object): +class Reindexing(object): goal_time = 0.2 def setup(self): - self.rng = DatetimeIndex(start='1/1/1970', periods=10000, freq=datetools.Minute()) - self.df = DataFrame(np.random.rand(10000, 10), index=self.rng, columns=range(10)) + self.rng = DatetimeIndex(start='1/1/1970', periods=10000, freq='1min') + self.df = DataFrame(np.random.rand(10000, 10), index=self.rng, + columns=range(10)) self.df['foo'] = 'bar' self.rng2 = Index(self.rng[::2]) - def time_dataframe_reindex(self): - self.df.reindex(self.rng2) - - -class frame_drop_dup_inplace(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - - def time_frame_drop_dup_inplace(self): - self.df.drop_duplicates(['key1', 'key2'], inplace=True) - - -class frame_drop_dup_na_inplace(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - self.df.ix[:10000, :] = np.nan - - def time_frame_drop_dup_na_inplace(self): - self.df.drop_duplicates(['key1', 'key2'], inplace=True) - - -class frame_drop_duplicates(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - - def time_frame_drop_duplicates(self): - self.df.drop_duplicates(['key1', 'key2']) - - -class frame_drop_duplicates_int(object): - - def setup(self): - np.random.seed(1234) - self.N = 1000000 - self.K = 10000 - self.key1 = np.random.randint(0,self.K,size=self.N) - self.df = DataFrame({'key1': self.key1}) - - def time_frame_drop_duplicates_int(self): - self.df.drop_duplicates() - - -class frame_drop_duplicates_na(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - self.df.ix[:10000, :] = np.nan - - def time_frame_drop_duplicates_na(self): - self.df.drop_duplicates(['key1', 'key2']) - - -class frame_fillna_many_columns_pad(object): - goal_time = 0.2 - - def setup(self): - self.values = np.random.randn(1000, 1000) - self.values[::2] = np.nan - self.df = DataFrame(self.values) - - def time_frame_fillna_many_columns_pad(self): - self.df.fillna(method='pad') - - -class frame_reindex_columns(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(index=range(10000), data=np.random.rand(10000, 30), columns=range(30)) - - def time_frame_reindex_columns(self): - self.df.reindex(columns=self.df.columns[1:5]) - + self.df2 = DataFrame(index=range(10000), + data=np.random.rand(10000, 30), columns=range(30)) -class frame_sort_index_by_columns(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - - def time_frame_sort_index_by_columns(self): - self.df.sort_index(by=['key1', 'key2']) - - -class lib_fast_zip(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) + # multi-index + N = 5000 + K = 200 + level1 = tm.makeStringIndex(N).values.repeat(K) + level2 = np.tile(tm.makeStringIndex(K).values, N) + index = MultiIndex.from_arrays([level1, level2]) + self.s1 = Series(np.random.randn((N * K)), index=index) + self.s2 = self.s1[::2] - def time_lib_fast_zip(self): - lib.fast_zip(self.col_array_list) + def time_reindex_dates(self): + self.df.reindex(self.rng2) + def time_reindex_columns(self): + self.df2.reindex(columns=self.df.columns[1:5]) -class lib_fast_zip_fillna(object): - goal_time = 0.2 + def time_reindex_multiindex(self): + self.s1.reindex(self.s2.index) - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - self.df.ix[:10000, :] = np.nan - def time_lib_fast_zip_fillna(self): - lib.fast_zip_fillna(self.col_array_list) +#---------------------------------------------------------------------- +# Pad / backfill -class reindex_daterange_backfill(object): +class FillMethod(object): goal_time = 0.2 def setup(self): - self.rng = date_range('1/1/2000', periods=100000, freq=datetools.Minute()) + self.rng = date_range('1/1/2000', periods=100000, freq='1min') self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) self.ts2 = self.ts[::2] self.ts3 = self.ts2.reindex(self.ts.index) self.ts4 = self.ts3.astype('float32') - def time_reindex_daterange_backfill(self): - self.backfill(self.ts2, self.ts.index) - def pad(self, source_series, target_index): try: source_series.reindex(target_index, method='pad') @@ -183,215 +60,148 @@ def backfill(self, source_series, target_index): except: source_series.reindex(target_index, fillMethod='backfill') + def time_backfill_dates(self): + self.backfill(self.ts2, self.ts.index) -class reindex_daterange_pad(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=100000, freq=datetools.Minute()) - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - self.ts2 = self.ts[::2] - self.ts3 = self.ts2.reindex(self.ts.index) - self.ts4 = self.ts3.astype('float32') - - def time_reindex_daterange_pad(self): + def time_pad_daterange(self): self.pad(self.ts2, self.ts.index) - def pad(self, source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') - - def backfill(self, source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') - - -class reindex_fillna_backfill(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=100000, freq=datetools.Minute()) - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - self.ts2 = self.ts[::2] - self.ts3 = self.ts2.reindex(self.ts.index) - self.ts4 = self.ts3.astype('float32') - - def time_reindex_fillna_backfill(self): + def time_backfill(self): self.ts3.fillna(method='backfill') - def pad(self, source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') - - def backfill(self, source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') - - -class reindex_fillna_backfill_float32(object): - goal_time = 0.2 + def time_backfill_float32(self): + self.ts4.fillna(method='backfill') - def setup(self): - self.rng = date_range('1/1/2000', periods=100000, freq=datetools.Minute()) - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - self.ts2 = self.ts[::2] - self.ts3 = self.ts2.reindex(self.ts.index) - self.ts4 = self.ts3.astype('float32') + def time_pad(self): + self.ts3.fillna(method='pad') - def time_reindex_fillna_backfill_float32(self): - self.ts4.fillna(method='backfill') + def time_pad_float32(self): + self.ts4.fillna(method='pad') - def pad(self, source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') - def backfill(self, source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') +#---------------------------------------------------------------------- +# align on level -class reindex_fillna_pad(object): +class LevelAlign(object): goal_time = 0.2 def setup(self): - self.rng = date_range('1/1/2000', periods=100000, freq=datetools.Minute()) - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - self.ts2 = self.ts[::2] - self.ts3 = self.ts2.reindex(self.ts.index) - self.ts4 = self.ts3.astype('float32') + self.index = MultiIndex( + levels=[np.arange(10), np.arange(100), np.arange(100)], + labels=[np.arange(10).repeat(10000), + np.tile(np.arange(100).repeat(100), 10), + np.tile(np.tile(np.arange(100), 100), 10)]) + random.shuffle(self.index.values) + self.df = DataFrame(np.random.randn(len(self.index), 4), + index=self.index) + self.df_level = DataFrame(np.random.randn(100, 4), + index=self.index.levels[1]) - def time_reindex_fillna_pad(self): - self.ts3.fillna(method='pad') + def time_align_level(self): + self.df.align(self.df_level, level=1, copy=False) - def pad(self, source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') + def time_reindex_level(self): + self.df_level.reindex(self.df.index, level=1) - def backfill(self, source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') +#---------------------------------------------------------------------- +# drop_duplicates -class reindex_fillna_pad_float32(object): + +class Duplicates(object): goal_time = 0.2 def setup(self): - self.rng = date_range('1/1/2000', periods=100000, freq=datetools.Minute()) - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - self.ts2 = self.ts[::2] - self.ts3 = self.ts2.reindex(self.ts.index) - self.ts4 = self.ts3.astype('float32') + self.N = 10000 + self.K = 10 + self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) + self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) + self.df = DataFrame({'key1': self.key1, 'key2': self.key2, + 'value': np.random.randn((self.N * self.K)),}) + self.col_array_list = list(self.df.values.T) - def time_reindex_fillna_pad_float32(self): - self.ts4.fillna(method='pad') + self.df2 = self.df.copy() + self.df2.ix[:10000, :] = np.nan - def pad(self, source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') + self.s = Series(np.random.randint(0, 1000, size=10000)) + self.s2 = Series(np.tile(tm.makeStringIndex(1000).values, 10)) - def backfill(self, source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') + np.random.seed(1234) + self.N = 1000000 + self.K = 10000 + self.key1 = np.random.randint(0, self.K, size=self.N) + self.df_int = DataFrame({'key1': self.key1}) + self.df_bool = DataFrame({i: np.random.randint(0, 2, size=self.K, + dtype=bool) + for i in range(10)}) + def time_frame_drop_dups(self): + self.df.drop_duplicates(['key1', 'key2']) -class reindex_frame_level_align(object): - goal_time = 0.2 + def time_frame_drop_dups_inplace(self): + self.df.drop_duplicates(['key1', 'key2'], inplace=True) - def setup(self): - self.index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - random.shuffle(self.index.values) - self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) - self.df_level = DataFrame(np.random.randn(100, 4), index=self.index.levels[1]) + def time_frame_drop_dups_na(self): + self.df2.drop_duplicates(['key1', 'key2']) - def time_reindex_frame_level_align(self): - self.df.align(self.df_level, level=1, copy=False) + def time_frame_drop_dups_na_inplace(self): + self.df2.drop_duplicates(['key1', 'key2'], inplace=True) + def time_series_drop_dups_int(self): + self.s.drop_duplicates() -class reindex_frame_level_reindex(object): - goal_time = 0.2 + def time_series_drop_dups_string(self): + self.s2.drop_duplicates() - def setup(self): - self.index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - random.shuffle(self.index.values) - self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) - self.df_level = DataFrame(np.random.randn(100, 4), index=self.index.levels[1]) + def time_frame_drop_dups_int(self): + self.df_int.drop_duplicates() - def time_reindex_frame_level_reindex(self): - self.df_level.reindex(self.df.index, level=1) + def time_frame_drop_dups_bool(self): + self.df_bool.drop_duplicates() +#---------------------------------------------------------------------- +# blog "pandas escaped the zoo" -class reindex_multiindex(object): + +class Align(object): goal_time = 0.2 def setup(self): - self.N = 1000 - self.K = 20 - self.level1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.level2 = np.tile(tm.makeStringIndex(self.K).values, self.N) - self.index = MultiIndex.from_arrays([self.level1, self.level2]) - self.s1 = Series(np.random.randn((self.N * self.K)), index=self.index) - self.s2 = self.s1[::2] - - def time_reindex_multiindex(self): - self.s1.reindex(self.s2.index) - + n = 50000 + indices = tm.makeStringIndex(n) + subsample_size = 40000 -class series_align_irregular_string(object): - goal_time = 0.2 + def sample(values, k): + sampler = np.arange(len(values)) + shuffle(sampler) + return values.take(sampler[:k]) - def setup(self): - self.n = 50000 - self.indices = tm.makeStringIndex(self.n) - self.subsample_size = 40000 - self.x = Series(np.random.randn(50000), self.indices) - self.y = Series(np.random.randn(self.subsample_size), index=self.sample(self.indices, self.subsample_size)) + self.x = Series(np.random.randn(50000), indices) + self.y = Series(np.random.randn(subsample_size), + index=sample(indices, subsample_size)) - def time_series_align_irregular_string(self): + def time_align_series_irregular_string(self): (self.x + self.y) - def sample(self, values, k): - self.sampler = np.arange(len(values)) - shuffle(self.sampler) - return values.take(self.sampler[:k]) - -class series_drop_duplicates_int(object): +class LibFastZip(object): goal_time = 0.2 def setup(self): - self.s = Series(np.random.randint(0, 1000, size=10000)) - self.s2 = Series(np.tile(tm.makeStringIndex(1000).values, 10)) - - def time_series_drop_duplicates_int(self): - self.s.drop_duplicates() - + self.N = 10000 + self.K = 10 + self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) + self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) + self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) + self.col_array_list = list(self.df.values.T) -class series_drop_duplicates_string(object): - goal_time = 0.2 + self.df2 = self.df.copy() + self.df2.ix[:10000, :] = np.nan + self.col_array_list2 = list(self.df2.values.T) - def setup(self): - self.s = Series(np.random.randint(0, 1000, size=10000)) - self.s2 = Series(np.tile(tm.makeStringIndex(1000).values, 10)) + def time_lib_fast_zip(self): + lib.fast_zip(self.col_array_list) - def time_series_drop_duplicates_string(self): - self.s2.drop_duplicates() + def time_lib_fast_zip_fillna(self): + lib.fast_zip_fillna(self.col_array_list2) diff --git a/asv_bench/benchmarks/replace.py b/asv_bench/benchmarks/replace.py index 66b8af53801ac..63562f90eab2b 100644 --- a/asv_bench/benchmarks/replace.py +++ b/asv_bench/benchmarks/replace.py @@ -1,6 +1,4 @@ from .pandas_vb_common import * -from pandas.compat import range -from datetime import timedelta class replace_fillna(object): diff --git a/asv_bench/benchmarks/reshape.py b/asv_bench/benchmarks/reshape.py index ab235e085986c..177e3e7cb87fa 100644 --- a/asv_bench/benchmarks/reshape.py +++ b/asv_bench/benchmarks/reshape.py @@ -1,5 +1,5 @@ from .pandas_vb_common import * -from pandas.core.reshape import melt +from pandas import melt, wide_to_long class melt_dataframe(object): @@ -59,6 +59,27 @@ def time_reshape_unstack_simple(self): self.df.unstack(1) +class reshape_unstack_large_single_dtype(object): + goal_time = 0.2 + + def setup(self): + m = 100 + n = 1000 + + levels = np.arange(m) + index = pd.MultiIndex.from_product([levels]*2) + columns = np.arange(n) + values = np.arange(m*m*n).reshape(m*m, n) + self.df = pd.DataFrame(values, index, columns) + self.df2 = self.df.iloc[:-1] + + def time_unstack_full_product(self): + self.df.unstack() + + def time_unstack_with_mask(self): + self.df2.unstack() + + class unstack_sparse_keyspace(object): goal_time = 0.2 @@ -74,3 +95,25 @@ def setup(self): def time_unstack_sparse_keyspace(self): self.idf.unstack() + + +class wide_to_long_big(object): + goal_time = 0.2 + + def setup(self): + vars = 'ABCD' + nyrs = 20 + nidvars = 20 + N = 5000 + yrvars = [] + for var in vars: + for yr in range(1, nyrs + 1): + yrvars.append(var + str(yr)) + + self.df = pd.DataFrame(np.random.randn(N, nidvars + len(yrvars)), + columns=list(range(nidvars)) + yrvars) + self.vars = vars + + def time_wide_to_long_big(self): + self.df['id'] = self.df.index + wide_to_long(self.df, list(self.vars), i='id', j='year') diff --git a/asv_bench/benchmarks/series_methods.py b/asv_bench/benchmarks/series_methods.py index 413c4e044fd3a..c66654ee1e006 100644 --- a/asv_bench/benchmarks/series_methods.py +++ b/asv_bench/benchmarks/series_methods.py @@ -68,8 +68,8 @@ def setup(self): self.s4 = self.s3.astype('object') def time_series_nlargest1(self): - self.s1.nlargest(3, take_last=True) - self.s1.nlargest(3, take_last=False) + self.s1.nlargest(3, keep='last') + self.s1.nlargest(3, keep='first') class series_nlargest2(object): @@ -83,8 +83,8 @@ def setup(self): self.s4 = self.s3.astype('object') def time_series_nlargest2(self): - self.s2.nlargest(3, take_last=True) - self.s2.nlargest(3, take_last=False) + self.s2.nlargest(3, keep='last') + self.s2.nlargest(3, keep='first') class series_nsmallest2(object): @@ -98,8 +98,8 @@ def setup(self): self.s4 = self.s3.astype('object') def time_series_nsmallest2(self): - self.s2.nsmallest(3, take_last=True) - self.s2.nsmallest(3, take_last=False) + self.s2.nsmallest(3, keep='last') + self.s2.nsmallest(3, keep='first') class series_dropna_int64(object): diff --git a/asv_bench/benchmarks/sparse.py b/asv_bench/benchmarks/sparse.py index 717fe7218ceda..500149b89b08b 100644 --- a/asv_bench/benchmarks/sparse.py +++ b/asv_bench/benchmarks/sparse.py @@ -1,8 +1,6 @@ from .pandas_vb_common import * -import pandas.sparse.series import scipy.sparse -from pandas.core.sparse import SparseSeries, SparseDataFrame -from pandas.core.sparse import SparseDataFrame +from pandas import SparseSeries, SparseDataFrame class sparse_series_to_frame(object): @@ -37,7 +35,7 @@ def setup(self): self.A = scipy.sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(100, 100)) def time_sparse_series_from_coo(self): - self.ss = pandas.sparse.series.SparseSeries.from_coo(self.A) + self.ss = SparseSeries.from_coo(self.A) class sparse_series_to_coo(object): diff --git a/asv_bench/benchmarks/strings.py b/asv_bench/benchmarks/strings.py index d64606214ca6a..c1600d4e07f58 100644 --- a/asv_bench/benchmarks/strings.py +++ b/asv_bench/benchmarks/strings.py @@ -4,390 +4,104 @@ import pandas.util.testing as testing -class strings_cat(object): +class StringMethods(object): goal_time = 0.2 - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_cat(self): - self.many.str.cat(sep=',') - def make_series(self, letters, strlen, size): return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - -class strings_center(object): - goal_time = 0.2 - def setup(self): self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) + self.s = self.make_series(string.ascii_uppercase, strlen=10, size=10000).str.join('|') - def time_strings_center(self): - self.many.str.center(100) - - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_contains_few(object): - goal_time = 0.2 + def time_cat(self): + self.many.str.cat(sep=',') - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) + def time_center(self): + self.many.str.center(100) - def time_strings_contains_few(self): + def time_contains_few(self): self.few.str.contains('matchthis') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_contains_few_noregex(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_contains_few_noregex(self): + def time_contains_few_noregex(self): self.few.str.contains('matchthis', regex=False) - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_contains_many(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_contains_many(self): + def time_contains_many(self): self.many.str.contains('matchthis') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_contains_many_noregex(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_contains_many_noregex(self): + def time_contains_many_noregex(self): self.many.str.contains('matchthis', regex=False) - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_count(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_count(self): + def time_count(self): self.many.str.count('matchthis') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_encode_decode(object): - goal_time = 0.2 - - def setup(self): - self.ser = Series(testing.makeUnicodeIndex()) - - def time_strings_encode_decode(self): - self.ser.str.encode('utf-8').str.decode('utf-8') - - -class strings_endswith(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_endswith(self): + def time_endswith(self): self.many.str.endswith('matchthis') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_extract(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_extract(self): + def time_extract(self): self.many.str.extract('(\\w*)matchthis(\\w*)') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_findall(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_findall(self): + def time_findall(self): self.many.str.findall('[A-Z]+') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_get(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_get(self): + def time_get(self): self.many.str.get(0) - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_get_dummies(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - self.s = self.make_series(string.ascii_uppercase, strlen=10, size=10000).str.join('|') - - def time_strings_get_dummies(self): - self.s.str.get_dummies('|') - - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_join_split(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_join_split(self): + def time_join_split(self): self.many.str.join('--').str.split('--') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_join_split_expand(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_join_split_expand(self): + def time_join_split_expand(self): self.many.str.join('--').str.split('--', expand=True) - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_len(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_len(self): + def time_len(self): self.many.str.len() - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_lower(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_lower(self): - self.many.str.lower() - - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_lstrip(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_lstrip(self): - self.many.str.lstrip('matchthis') - - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_match(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_match(self): + def time_match(self): self.many.str.match('mat..this') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_pad(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_pad(self): + def time_pad(self): self.many.str.pad(100, side='both') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_repeat(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_repeat(self): + def time_repeat(self): self.many.str.repeat(list(IT.islice(IT.cycle(range(1, 4)), len(self.many)))) - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_replace(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_replace(self): + def time_replace(self): self.many.str.replace('(matchthis)', '\x01\x01') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_rstrip(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_rstrip(self): - self.many.str.rstrip('matchthis') - - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_slice(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_slice(self): + def time_slice(self): self.many.str.slice(5, 15, 2) - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_startswith(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_startswith(self): + def time_startswith(self): self.many.str.startswith('matchthis') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) - - -class strings_strip(object): - goal_time = 0.2 - - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_strip(self): + def time_strip(self): self.many.str.strip('matchthis') - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) + def time_rstrip(self): + self.many.str.rstrip('matchthis') + def time_lstrip(self): + self.many.str.lstrip('matchthis') -class strings_title(object): - goal_time = 0.2 + def time_title(self): + self.many.str.title() - def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) + def time_upper(self): + self.many.str.upper() - def time_strings_title(self): - self.many.str.title() + def time_lower(self): + self.many.str.lower() - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) + def time_get_dummies(self): + self.s.str.get_dummies('|') -class strings_upper(object): +class StringEncode(object): goal_time = 0.2 def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - - def time_strings_upper(self): - self.many.str.upper() + self.ser = Series(testing.makeUnicodeIndex()) - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) + def time_encode_decode(self): + self.ser.str.encode('utf-8').str.decode('utf-8') diff --git a/asv_bench/benchmarks/timedelta.py b/asv_bench/benchmarks/timedelta.py index 8470525dd01fa..c112d1ef72eb8 100644 --- a/asv_bench/benchmarks/timedelta.py +++ b/asv_bench/benchmarks/timedelta.py @@ -2,54 +2,36 @@ from pandas import to_timedelta, Timestamp -class timedelta_convert_int(object): +class ToTimedelta(object): goal_time = 0.2 def setup(self): self.arr = np.random.randint(0, 1000, size=10000) + self.arr2 = ['{0} days'.format(i) for i in self.arr] - def time_timedelta_convert_int(self): - to_timedelta(self.arr, unit='s') - + self.arr3 = np.random.randint(0, 60, size=10000) + self.arr3 = ['00:00:{0:02d}'.format(i) for i in self.arr3] -class timedelta_convert_string(object): - goal_time = 0.2 - - def setup(self): - self.arr = np.random.randint(0, 1000, size=10000) - self.arr = ['{0} days'.format(i) for i in self.arr] - - def time_timedelta_convert_string(self): - to_timedelta(self.arr) - - -class timedelta_convert_string_seconds(object): - goal_time = 0.2 - - def setup(self): - self.arr = np.random.randint(0, 60, size=10000) - self.arr = ['00:00:{0:02d}'.format(i) for i in self.arr] - - def time_timedelta_convert_string_seconds(self): - to_timedelta(self.arr) + self.arr4 = list(self.arr2) + self.arr4[-1] = 'apple' + def time_convert_int(self): + to_timedelta(self.arr, unit='s') -class timedelta_convert_bad_parse(object): - goal_time = 0.2 + def time_convert_string(self): + to_timedelta(self.arr2) - def setup(self): - self.arr = np.random.randint(0, 1000, size=10000) - self.arr = ['{0} days'.format(i) for i in self.arr] - self.arr[-1] = 'apple' + def time_convert_string_seconds(self): + to_timedelta(self.arr3) - def time_timedelta_convert_coerce(self): - to_timedelta(self.arr, errors='coerce') + def time_convert_coerce(self): + to_timedelta(self.arr4, errors='coerce') - def time_timedelta_convert_ignore(self): - to_timedelta(self.arr, errors='ignore') + def time_convert_ignore(self): + to_timedelta(self.arr4, errors='ignore') -class timedelta_add_overflow(object): +class Ops(object): goal_time = 0.2 def setup(self): diff --git a/asv_bench/benchmarks/timeseries.py b/asv_bench/benchmarks/timeseries.py index 8c00924cb07ef..f5ea4d7875931 100644 --- a/asv_bench/benchmarks/timeseries.py +++ b/asv_bench/benchmarks/timeseries.py @@ -1,7 +1,9 @@ -from pandas.tseries.converter import DatetimeConverter +try: + from pandas.plotting._converter import DatetimeConverter +except ImportError: + from pandas.tseries.converter import DatetimeConverter from .pandas_vb_common import * import pandas as pd -from datetime import timedelta import datetime as dt try: import pandas.tseries.holiday @@ -10,295 +12,211 @@ from pandas.tseries.frequencies import infer_freq import numpy as np - -class dataframe_resample_max_numpy(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='20130101', periods=100000, freq='50L') - self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) - - def time_dataframe_resample_max_numpy(self): - self.df.resample('1s', how=np.max) +if hasattr(Series, 'convert'): + Series.resample = Series.convert -class dataframe_resample_max_string(object): +class DatetimeIndex(object): goal_time = 0.2 def setup(self): self.N = 100000 self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='20130101', periods=100000, freq='50L') - self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) - - def time_dataframe_resample_max_string(self): - self.df.resample('1s', how='max') - + self.delta_offset = pd.offsets.Day() + self.fast_offset = pd.offsets.DateOffset(months=2, days=2) + self.slow_offset = pd.offsets.BusinessDay() -class dataframe_resample_mean_numpy(object): - goal_time = 0.2 + self.rng2 = date_range(start='1/1/2000 9:30', periods=10000, freq='S', tz='US/Eastern') - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='20130101', periods=100000, freq='50L') - self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) + self.index_repeated = date_range(start='1/1/2000', periods=1000, freq='T').repeat(10) - def time_dataframe_resample_mean_numpy(self): - self.df.resample('1s', how=np.mean) + self.rng3 = date_range(start='1/1/2000', periods=1000, freq='H') + self.df = DataFrame(np.random.randn(len(self.rng3), 2), self.rng3) + self.rng4 = date_range(start='1/1/2000', periods=1000, freq='H', tz='US/Eastern') + self.df2 = DataFrame(np.random.randn(len(self.rng4), 2), index=self.rng4) -class dataframe_resample_mean_string(object): - goal_time = 0.2 + N = 100000 + self.dti = pd.date_range('2011-01-01', freq='H', periods=N).repeat(5) + self.dti_tz = pd.date_range('2011-01-01', freq='H', periods=N, + tz='Asia/Tokyo').repeat(5) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='20130101', periods=100000, freq='50L') - self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) + self.rng5 = date_range(start='1/1/2000', end='3/1/2000', tz='US/Eastern') - def time_dataframe_resample_mean_string(self): - self.df.resample('1s', how='mean') + self.dst_rng = date_range(start='10/29/2000 1:00:00', end='10/29/2000 1:59:59', freq='S') + self.index = date_range(start='10/29/2000', end='10/29/2000 00:59:59', freq='S') + self.index = self.index.append(self.dst_rng) + self.index = self.index.append(self.dst_rng) + self.index = self.index.append(date_range(start='10/29/2000 2:00:00', end='10/29/2000 3:00:00', freq='S')) + self.N = 10000 + self.rng6 = date_range(start='1/1/1', periods=self.N, freq='B') -class dataframe_resample_min_numpy(object): - goal_time = 0.2 + self.rng7 = date_range(start='1/1/1700', freq='D', periods=100000) + self.a = self.rng7[:50000].append(self.rng7[50002:]) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='20130101', periods=100000, freq='50L') - self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) + def time_add_timedelta(self): + (self.rng + dt.timedelta(minutes=2)) - def time_dataframe_resample_min_numpy(self): - self.df.resample('1s', how=np.min) + def time_add_offset_delta(self): + (self.rng + self.delta_offset) + def time_add_offset_fast(self): + (self.rng + self.fast_offset) -class dataframe_resample_min_string(object): - goal_time = 0.2 + def time_add_offset_slow(self): + (self.rng + self.slow_offset) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='20130101', periods=100000, freq='50L') - self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) + def time_normalize(self): + self.rng2.normalize() - def time_dataframe_resample_min_string(self): - self.df.resample('1s', how='min') + def time_unique(self): + self.index_repeated.unique() + def time_reset_index(self): + self.df.reset_index() -class datetimeindex_add_offset(object): - goal_time = 0.2 + def time_reset_index_tz(self): + self.df2.reset_index() - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', periods=10000, freq='T') + def time_dti_factorize(self): + self.dti.factorize() - def time_datetimeindex_add_offset(self): - (self.rng + timedelta(minutes=2)) + def time_dti_tz_factorize(self): + self.dti_tz.factorize() + def time_timestamp_tzinfo_cons(self): + self.rng5[0] -class datetimeindex_converter(object): - goal_time = 0.2 + def time_infer_dst(self): + self.index.tz_localize('US/Eastern', infer_dst=True) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) + def time_timeseries_is_month_start(self): + self.rng6.is_month_start - def time_datetimeindex_converter(self): - DatetimeConverter.convert(self.rng, None, None) + def time_infer_freq(self): + infer_freq(self.a) -class datetimeindex_infer_dst(object): +class TimeDatetimeConverter(object): goal_time = 0.2 def setup(self): self.N = 100000 self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.dst_rng = date_range(start='10/29/2000 1:00:00', end='10/29/2000 1:59:59', freq='S') - self.index = date_range(start='10/29/2000', end='10/29/2000 00:59:59', freq='S') - self.index = self.index.append(self.dst_rng) - self.index = self.index.append(self.dst_rng) - self.index = self.index.append(date_range(start='10/29/2000 2:00:00', end='10/29/2000 3:00:00', freq='S')) - def time_datetimeindex_infer_dst(self): - self.index.tz_localize('US/Eastern', infer_dst=True) + def time_convert(self): + DatetimeConverter.convert(self.rng, None, None) -class datetimeindex_normalize(object): +class Iteration(object): goal_time = 0.2 def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000 9:30', periods=10000, freq='S', tz='US/Eastern') - - def time_datetimeindex_normalize(self): - self.rng.normalize() - - -class datetimeindex_unique(object): - goal_time = 0.2 + self.N = 1000000 + self.M = 10000 + self.idx1 = date_range(start='20140101', freq='T', periods=self.N) + self.idx2 = period_range(start='20140101', freq='T', periods=self.N) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', periods=1000, freq='T') - self.index = self.rng.repeat(10) + def iter_n(self, iterable, n=None): + self.i = 0 + for _ in iterable: + self.i += 1 + if ((n is not None) and (self.i > n)): + break - def time_datetimeindex_unique(self): - self.index.unique() + def time_iter_datetimeindex(self): + self.iter_n(self.idx1) + def time_iter_datetimeindex_preexit(self): + self.iter_n(self.idx1, self.M) -class dti_reset_index(object): - goal_time = 0.2 + def time_iter_periodindex(self): + self.iter_n(self.idx2) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', periods=1000, freq='H') - self.df = DataFrame(np.random.randn(len(self.rng), 2), self.rng) + def time_iter_periodindex_preexit(self): + self.iter_n(self.idx2, self.M) - def time_dti_reset_index(self): - self.df.reset_index() +#---------------------------------------------------------------------- +# Resampling -class dti_reset_index_tz(object): +class ResampleDataFrame(object): goal_time = 0.2 def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', periods=1000, freq='H', tz='US/Eastern') - self.df = DataFrame(np.random.randn(len(self.rng), 2), index=self.rng) + self.rng = date_range(start='20130101', periods=100000, freq='50L') + self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) - def time_dti_reset_index_tz(self): - self.df.reset_index() + def time_max_numpy(self): + self.df.resample('1s', how=np.max) + def time_max_string(self): + self.df.resample('1s', how='max') -class datetime_algorithm(object): - goal_time = 0.2 + def time_mean_numpy(self): + self.df.resample('1s', how=np.mean) - def setup(self): - N = 100000 - self.dti = pd.date_range('2011-01-01', freq='H', periods=N).repeat(5) - self.dti_tz = pd.date_range('2011-01-01', freq='H', periods=N, - tz='Asia/Tokyo').repeat(5) + def time_mean_string(self): + self.df.resample('1s', how='mean') - def time_dti_factorize(self): - self.dti.factorize() + def time_min_numpy(self): + self.df.resample('1s', how=np.min) - def time_dti_tz_factorize(self): - self.dti_tz.factorize() + def time_min_string(self): + self.df.resample('1s', how='min') -class timeseries_1min_5min_mean(object): +class ResampleSeries(object): goal_time = 0.2 def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - - def time_timeseries_1min_5min_mean(self): - self.ts[:10000].resample('5min', how='mean') + self.rng1 = period_range(start='1/1/2000', end='1/1/2001', freq='T') + self.ts1 = Series(np.random.randn(len(self.rng1)), index=self.rng1) + self.rng2 = date_range(start='1/1/2000', end='1/1/2001', freq='T') + self.ts2 = Series(np.random.randn(len(self.rng2)), index=self.rng2) -class timeseries_1min_5min_ohlc(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) + self.rng3 = date_range(start='2000-01-01 00:00:00', end='2000-01-01 10:00:00', freq='555000U') + self.int_ts = Series(5, self.rng3, dtype='int64') + self.dt_ts = self.int_ts.astype('datetime64[ns]') - def time_timeseries_1min_5min_ohlc(self): - self.ts[:10000].resample('5min', how='ohlc') + def time_period_downsample_mean(self): + self.ts1.resample('D', how='mean') + def time_timestamp_downsample_mean(self): + self.ts2.resample('D', how='mean') -class timeseries_add_irregular(object): - goal_time = 0.2 + def time_resample_datetime64(self): + # GH 7754 + self.dt_ts.resample('1S', how='last') - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.lindex = np.random.permutation(self.N)[:(self.N // 2)] - self.rindex = np.random.permutation(self.N)[:(self.N // 2)] - self.left = Series(self.ts.values.take(self.lindex), index=self.ts.index.take(self.lindex)) - self.right = Series(self.ts.values.take(self.rindex), index=self.ts.index.take(self.rindex)) + def time_1min_5min_mean(self): + self.ts2[:10000].resample('5min', how='mean') - def time_timeseries_add_irregular(self): - (self.left + self.right) + def time_1min_5min_ohlc(self): + self.ts2[:10000].resample('5min', how='ohlc') -class timeseries_asof(object): +class AsOf(object): goal_time = 0.2 def setup(self): self.N = 10000 self.rng = date_range(start='1/1/1990', periods=self.N, freq='53s') - self.dates = date_range(start='1/1/1990', periods=(self.N * 10), freq='5s') self.ts = Series(np.random.randn(self.N), index=self.rng) + self.dates = date_range(start='1/1/1990', periods=(self.N * 10), freq='5s') self.ts2 = self.ts.copy() self.ts2[250:5000] = np.nan self.ts3 = self.ts.copy() self.ts3[-5000:] = np.nan # test speed of pre-computing NAs. - def time_asof_list(self): + def time_asof(self): self.ts.asof(self.dates) # should be roughly the same as above. - def time_asof_nan_list(self): + def time_asof_nan(self): self.ts2.asof(self.dates) # test speed of the code path for a scalar index @@ -318,7 +236,7 @@ def time_asof_nan_single(self): self.ts3.asof(self.dates[-1]) -class timeseries_dataframe_asof(object): +class AsOfDataFrame(object): goal_time = 0.2 def setup(self): @@ -333,11 +251,11 @@ def setup(self): self.ts3.iloc[-5000:] = np.nan # test speed of pre-computing NAs. - def time_asof_list(self): + def time_asof(self): self.ts.asof(self.dates) # should be roughly the same as above. - def time_asof_nan_list(self): + def time_asof_nan(self): self.ts2.asof(self.dates) # test speed of the code path for a scalar index @@ -356,84 +274,108 @@ def time_asof_single_early(self): self.ts.asof(self.dates[0] - dt.timedelta(10)) -class timeseries_custom_bday_apply(object): +class TimeSeries(object): goal_time = 0.2 def setup(self): self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert + self.rng = date_range(start='1/1/2000', periods=self.N, freq='s') + self.rng = self.rng.take(np.random.permutation(self.N)) self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - def time_timeseries_custom_bday_apply(self): - self.cday.apply(self.date) + self.rng2 = date_range(start='1/1/2000', periods=self.N, freq='T') + self.ts2 = Series(np.random.randn(self.N), index=self.rng2) + + self.lindex = np.random.permutation(self.N)[:(self.N // 2)] + self.rindex = np.random.permutation(self.N)[:(self.N // 2)] + self.left = Series(self.ts2.values.take(self.lindex), index=self.ts2.index.take(self.lindex)) + self.right = Series(self.ts2.values.take(self.rindex), index=self.ts2.index.take(self.rindex)) + + self.rng3 = date_range(start='1/1/2000', periods=1500000, freq='S') + self.ts3 = Series(1, index=self.rng3) + + def time_sort_index_monotonic(self): + self.ts2.sort_index() + + def time_sort_index_non_monotonic(self): + self.ts.sort_index() + def time_timeseries_slice_minutely(self): + self.ts2[:10000] + + def time_add_irregular(self): + (self.left + self.right) + + def time_large_lookup_value(self): + self.ts3[self.ts3.index[(len(self.ts3) // 2)]] + self.ts3.index._cleanup() -class timeseries_custom_bday_apply_dt64(object): + +class SeriesArithmetic(object): goal_time = 0.2 def setup(self): self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) + self.s = Series(date_range(start='20140101', freq='T', periods=self.N)) + self.delta_offset = pd.offsets.Day() + self.fast_offset = pd.offsets.DateOffset(months=2, days=2) + self.slow_offset = pd.offsets.BusinessDay() - def time_timeseries_custom_bday_apply_dt64(self): - self.cday.apply(self.dt64) + def time_add_offset_delta(self): + (self.s + self.delta_offset) + + def time_add_offset_fast(self): + (self.s + self.fast_offset) + + def time_add_offset_slow(self): + (self.s + self.slow_offset) -class timeseries_custom_bday_cal_decr(object): +class ToDatetime(object): goal_time = 0.2 def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) + self.rng = date_range(start='1/1/2000', periods=10000, freq='D') + self.stringsD = Series((((self.rng.year * 10000) + (self.rng.month * 100)) + self.rng.day), dtype=np.int64).apply(str) + + self.rng = date_range(start='1/1/2000', periods=20000, freq='H') + self.strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in self.rng] + self.strings_nosep = [x.strftime('%Y%m%d %H:%M:%S') for x in self.rng] + self.strings_tz_space = [x.strftime('%Y-%m-%d %H:%M:%S') + ' -0800' + for x in self.rng] + + self.s = Series((['19MAY11', '19MAY11:00:00:00'] * 100000)) + self.s2 = self.s.str.replace(':\\S+$', '') + + def time_format_YYYYMMDD(self): + to_datetime(self.stringsD, format='%Y%m%d') + + def time_iso8601(self): + to_datetime(self.strings) + + def time_iso8601_nosep(self): + to_datetime(self.strings_nosep) + + def time_iso8601_format(self): + to_datetime(self.strings, format='%Y-%m-%d %H:%M:%S') + + def time_iso8601_format_no_sep(self): + to_datetime(self.strings_nosep, format='%Y%m%d %H:%M:%S') - def time_timeseries_custom_bday_cal_decr(self): - (self.date - (1 * self.cdayh)) + def time_iso8601_tz_spaceformat(self): + to_datetime(self.strings_tz_space) + + def time_format_exact(self): + to_datetime(self.s2, format='%d%b%y') + + def time_format_no_exact(self): + to_datetime(self.s, format='%d%b%y', exact=False) -class timeseries_custom_bday_cal_incr(object): +class Offsets(object): goal_time = 0.2 def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) self.date = dt.datetime(2011, 1, 1) self.dt64 = np.datetime64('2011-01-01 09:00Z') self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() @@ -444,741 +386,63 @@ def setup(self): self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - def time_timeseries_custom_bday_cal_incr(self): - (self.date + (1 * self.cdayh)) + def time_timeseries_day_apply(self): + self.day.apply(self.date) + def time_timeseries_day_incr(self): + (self.date + self.day) -class timeseries_custom_bday_cal_incr_n(object): - goal_time = 0.2 + def time_timeseries_year_apply(self): + self.year.apply(self.date) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_custom_bday_cal_incr_n(self): - (self.date + (10 * self.cdayh)) - - -class timeseries_custom_bday_cal_incr_neg_n(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_custom_bday_cal_incr_neg_n(self): - (self.date - (10 * self.cdayh)) - - -class timeseries_custom_bday_decr(object): - goal_time = 0.2 + def time_timeseries_year_incr(self): + (self.date + self.year) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) + # custom business offsets - def time_timeseries_custom_bday_decr(self): + def time_custom_bday_decr(self): (self.date - self.cday) - -class timeseries_custom_bday_incr(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_custom_bday_incr(self): + def time_custom_bday_incr(self): (self.date + self.cday) + def time_custom_bday_apply(self): + self.cday.apply(self.date) -class timeseries_custom_bmonthbegin_decr_n(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_custom_bmonthbegin_decr_n(self): - (self.date - (10 * self.cmb)) - - -class timeseries_custom_bmonthbegin_incr_n(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_custom_bmonthbegin_incr_n(self): - (self.date + (10 * self.cmb)) - - -class timeseries_custom_bmonthend_decr_n(object): - goal_time = 0.2 + def time_custom_bday_apply_dt64(self): + self.cday.apply(self.dt64) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) + def time_custom_bday_cal_incr(self): + self.date + 1 * self.cdayh - def time_timeseries_custom_bmonthend_decr_n(self): - (self.date - (10 * self.cme)) + def time_custom_bday_cal_decr(self): + self.date - 1 * self.cdayh + def time_custom_bday_cal_incr_n(self): + self.date + 10 * self.cdayh -class timeseries_custom_bmonthend_incr(object): - goal_time = 0.2 + def time_custom_bday_cal_incr_neg_n(self): + self.date - 10 * self.cdayh - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) + # Increment custom business month - def time_timeseries_custom_bmonthend_incr(self): + def time_custom_bmonthend_incr(self): (self.date + self.cme) - -class timeseries_custom_bmonthend_incr_n(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_custom_bmonthend_incr_n(self): + def time_custom_bmonthend_incr_n(self): (self.date + (10 * self.cme)) + def time_custom_bmonthend_decr_n(self): + (self.date - (10 * self.cme)) -class timeseries_datetimeindex_offset_delta(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 100000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() - - def time_timeseries_datetimeindex_offset_delta(self): - (self.idx1 + self.delta_offset) - - -class timeseries_datetimeindex_offset_fast(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 100000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() - - def time_timeseries_datetimeindex_offset_fast(self): - (self.idx1 + self.fast_offset) - - -class timeseries_datetimeindex_offset_slow(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 100000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() - - def time_timeseries_datetimeindex_offset_slow(self): - (self.idx1 + self.slow_offset) - - -class timeseries_day_apply(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_day_apply(self): - self.day.apply(self.date) - - -class timeseries_day_incr(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_day_incr(self): - (self.date + self.day) - - -class timeseries_infer_freq(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/1700', freq='D', periods=100000) - self.a = self.rng[:50000].append(self.rng[50002:]) - - def time_timeseries_infer_freq(self): - infer_freq(self.a) - - -class timeseries_is_month_start(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 10000 - self.rng = date_range(start='1/1/1', periods=self.N, freq='B') - - def time_timeseries_is_month_start(self): - self.rng.is_month_start - + def time_custom_bmonthbegin_decr_n(self): + (self.date - (10 * self.cmb)) -class timeseries_iter_datetimeindex(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 1000000 - self.M = 10000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.idx2 = period_range(start='20140101', freq='T', periods=self.N) - - def time_timeseries_iter_datetimeindex(self): - self.iter_n(self.idx1) - - def iter_n(self, iterable, n=None): - self.i = 0 - for _ in iterable: - self.i += 1 - if ((n is not None) and (self.i > n)): - break - - -class timeseries_iter_datetimeindex_preexit(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 1000000 - self.M = 10000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.idx2 = period_range(start='20140101', freq='T', periods=self.N) - - def time_timeseries_iter_datetimeindex_preexit(self): - self.iter_n(self.idx1, self.M) - - def iter_n(self, iterable, n=None): - self.i = 0 - for _ in iterable: - self.i += 1 - if ((n is not None) and (self.i > n)): - break - - -class timeseries_iter_periodindex(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 1000000 - self.M = 10000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.idx2 = period_range(start='20140101', freq='T', periods=self.N) - - def time_timeseries_iter_periodindex(self): - self.iter_n(self.idx2) - - def iter_n(self, iterable, n=None): - self.i = 0 - for _ in iterable: - self.i += 1 - if ((n is not None) and (self.i > n)): - break - - -class timeseries_iter_periodindex_preexit(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 1000000 - self.M = 10000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.idx2 = period_range(start='20140101', freq='T', periods=self.N) - - def time_timeseries_iter_periodindex_preexit(self): - self.iter_n(self.idx2, self.M) - - def iter_n(self, iterable, n=None): - self.i = 0 - for _ in iterable: - self.i += 1 - if ((n is not None) and (self.i > n)): - break - - -class timeseries_large_lookup_value(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', periods=1500000, freq='S') - self.ts = Series(1, index=self.rng) - - def time_timeseries_large_lookup_value(self): - self.ts[self.ts.index[(len(self.ts) // 2)]] - self.ts.index._cleanup() - - -class timeseries_period_downsample_mean(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = period_range(start='1/1/2000', end='1/1/2001', freq='T') - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - - def time_timeseries_period_downsample_mean(self): - self.ts.resample('D', how='mean') - - -class timeseries_resample_datetime64(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='2000-01-01 00:00:00', end='2000-01-01 10:00:00', freq='555000U') - self.int_ts = Series(5, self.rng, dtype='int64') - self.ts = self.int_ts.astype('datetime64[ns]') - - def time_timeseries_resample_datetime64(self): - self.ts.resample('1S', how='last') - - -class timeseries_series_offset_delta(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 100000 - self.s = Series(date_range(start='20140101', freq='T', periods=self.N)) - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() - - def time_timeseries_series_offset_delta(self): - (self.s + self.delta_offset) - - -class timeseries_series_offset_fast(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 100000 - self.s = Series(date_range(start='20140101', freq='T', periods=self.N)) - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() - - def time_timeseries_series_offset_fast(self): - (self.s + self.fast_offset) - - -class timeseries_series_offset_slow(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 100000 - self.s = Series(date_range(start='20140101', freq='T', periods=self.N)) - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() - - def time_timeseries_series_offset_slow(self): - (self.s + self.slow_offset) - - -class timeseries_slice_minutely(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - - def time_timeseries_slice_minutely(self): - self.ts[:10000] - - -class timeseries_sort_index(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='s') - self.rng = self.rng.take(np.random.permutation(self.N)) - self.ts = Series(np.random.randn(self.N), index=self.rng) - - def time_timeseries_sort_index(self): - self.ts.sort_index() - - -class timeseries_timestamp_downsample_mean(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', end='1/1/2001', freq='T') - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - - def time_timeseries_timestamp_downsample_mean(self): - self.ts.resample('D', how='mean') - - -class timeseries_timestamp_tzinfo_cons(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', end='3/1/2000', tz='US/Eastern') - - def time_timeseries_timestamp_tzinfo_cons(self): - self.rng[0] - - -class timeseries_to_datetime_YYYYMMDD(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.rng = date_range(start='1/1/2000', periods=10000, freq='D') - self.strings = Series((((self.rng.year * 10000) + (self.rng.month * 100)) + self.rng.day), dtype=np.int64).apply(str) - - def time_timeseries_to_datetime_YYYYMMDD(self): - to_datetime(self.strings, format='%Y%m%d') - - -class timeseries_to_datetime_iso8601(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range(start='1/1/2000', periods=20000, freq='H') - self.strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in self.rng] - self.strings_nosep = [x.strftime('%Y%m%d %H:%M:%S') for x in self.rng] - self.strings_tz_space = [x.strftime('%Y-%m-%d %H:%M:%S') + ' -0800' - for x in self.rng] - - def time_timeseries_to_datetime_iso8601(self): - to_datetime(self.strings) - - def time_timeseries_to_datetime_iso8601_nosep(self): - to_datetime(self.strings_nosep) - - def time_timeseries_to_datetime_iso8601_format(self): - to_datetime(self.strings, format='%Y-%m-%d %H:%M:%S') - - def time_timeseries_to_datetime_iso8601_format_no_sep(self): - to_datetime(self.strings_nosep, format='%Y%m%d %H:%M:%S') - - def time_timeseries_to_datetime_iso8601_tz_spaceformat(self): - to_datetime(self.strings_tz_space) - - -class timeseries_with_format_no_exact(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.s = Series((['19MAY11', '19MAY11:00:00:00'] * 100000)) - - def time_timeseries_with_format_no_exact(self): - to_datetime(self.s, format='%d%b%y', exact=False) - - -class timeseries_with_format_replace(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.s = Series((['19MAY11', '19MAY11:00:00:00'] * 100000)) - - def time_timeseries_with_format_replace(self): - to_datetime(self.s.str.replace(':\\S+$', ''), format='%d%b%y') - - -class timeseries_year_apply(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_year_apply(self): - self.year.apply(self.date) - - -class timeseries_year_incr(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - if hasattr(Series, 'convert'): - Series.resample = Series.convert - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_year_incr(self): - (self.date + self.year) + def time_custom_bmonthbegin_incr_n(self): + (self.date + (10 * self.cmb)) -class timeseries_semi_month_offset(object): +class SemiMonthOffset(object): goal_time = 0.2 def setup(self): @@ -1189,50 +453,50 @@ def setup(self): self.semi_month_end = pd.offsets.SemiMonthEnd() self.semi_month_begin = pd.offsets.SemiMonthBegin() - def time_semi_month_end_apply(self): + def time_end_apply(self): self.semi_month_end.apply(self.date) - def time_semi_month_end_incr(self): + def time_end_incr(self): self.date + self.semi_month_end - def time_semi_month_end_incr_n(self): + def time_end_incr_n(self): self.date + 10 * self.semi_month_end - def time_semi_month_end_decr(self): + def time_end_decr(self): self.date - self.semi_month_end - def time_semi_month_end_decr_n(self): + def time_end_decr_n(self): self.date - 10 * self.semi_month_end - def time_semi_month_end_apply_index(self): + def time_end_apply_index(self): self.semi_month_end.apply_index(self.rng) - def time_semi_month_end_incr_rng(self): + def time_end_incr_rng(self): self.rng + self.semi_month_end - def time_semi_month_end_decr_rng(self): + def time_end_decr_rng(self): self.rng - self.semi_month_end - def time_semi_month_begin_apply(self): + def time_begin_apply(self): self.semi_month_begin.apply(self.date) - def time_semi_month_begin_incr(self): + def time_begin_incr(self): self.date + self.semi_month_begin - def time_semi_month_begin_incr_n(self): + def time_begin_incr_n(self): self.date + 10 * self.semi_month_begin - def time_semi_month_begin_decr(self): + def time_begin_decr(self): self.date - self.semi_month_begin - def time_semi_month_begin_decr_n(self): + def time_begin_decr_n(self): self.date - 10 * self.semi_month_begin - def time_semi_month_begin_apply_index(self): + def time_begin_apply_index(self): self.semi_month_begin.apply_index(self.rng) - def time_semi_month_begin_incr_rng(self): + def time_begin_incr_rng(self): self.rng + self.semi_month_begin - def time_semi_month_begin_decr_rng(self): + def time_begin_decr_rng(self): self.rng - self.semi_month_begin diff --git a/ci/appveyor.recipe/bld.bat b/ci/appveyor.recipe/bld.bat deleted file mode 100644 index 284926fae8c04..0000000000000 --- a/ci/appveyor.recipe/bld.bat +++ /dev/null @@ -1,2 +0,0 @@ -@echo off -%PYTHON% setup.py install diff --git a/ci/appveyor.recipe/build.sh b/ci/appveyor.recipe/build.sh deleted file mode 100644 index f341bce6fcf96..0000000000000 --- a/ci/appveyor.recipe/build.sh +++ /dev/null @@ -1,2 +0,0 @@ -#!/bin/sh -$PYTHON setup.py install diff --git a/ci/appveyor.recipe/meta.yaml b/ci/appveyor.recipe/meta.yaml deleted file mode 100644 index 6bf0a14bc7d0b..0000000000000 --- a/ci/appveyor.recipe/meta.yaml +++ /dev/null @@ -1,37 +0,0 @@ -package: - name: pandas - version: 0.18.1 - -build: - number: {{environ.get('APPVEYOR_BUILD_NUMBER', 0)}} # [win] - string: np{{ environ.get('CONDA_NPY') }}py{{ environ.get('CONDA_PY') }}_{{ environ.get('APPVEYOR_BUILD_NUMBER', 0) }} # [win] - -source: - - # conda-build needs a full clone - # rather than a shallow git_url type clone - # https://github.com/conda/conda-build/issues/780 - path: ../../ - -requirements: - build: - - python - - cython - - numpy x.x - - setuptools - - pytz - - python-dateutil - - run: - - python - - numpy x.x - - python-dateutil - - pytz - -test: - imports: - - pandas - -about: - home: http://pandas.pydata.org - license: BSD diff --git a/ci/build_docs.sh b/ci/build_docs.sh index 8594cd4af34e5..1356d097025c9 100755 --- a/ci/build_docs.sh +++ b/ci/build_docs.sh @@ -17,14 +17,11 @@ if [ "$?" != "0" ]; then fi -if [ x"$DOC_BUILD" != x"" ]; then +if [ "$DOC" ]; then echo "Will build docs" source activate pandas - conda install -n pandas -c r r rpy2 --yes - - time sudo apt-get $APT_ARGS install dvipng texlive-latex-base texlive-latex-extra mv "$TRAVIS_BUILD_DIR"/doc /tmp cd /tmp/doc @@ -43,7 +40,9 @@ if [ x"$DOC_BUILD" != x"" ]; then cd /tmp/doc/build/html git config --global user.email "pandas-docs-bot@localhost.foo" git config --global user.name "pandas-docs-bot" + git config --global credential.helper cache + # create the repo git init touch README git add README @@ -53,7 +52,8 @@ if [ x"$DOC_BUILD" != x"" ]; then touch .nojekyll git add --all . git commit -m "Version" --allow-empty - git remote add origin https://$GH_TOKEN@github.com/pandas-docs/pandas-docs-travis + git remote remove origin + git remote add origin "https://${PANDAS_GH_TOKEN}@github.com/pandas-docs/pandas-docs-travis.git" git push origin gh-pages -f fi diff --git a/ci/check_cache.sh b/ci/check_cache.sh index cd7a6e8f6b6f9..b83144fc45ef4 100755 --- a/ci/check_cache.sh +++ b/ci/check_cache.sh @@ -1,5 +1,9 @@ #!/bin/bash +# currently not used +# script to make sure that cache is clean +# Travis CI now handles this + if [ "$TRAVIS_PULL_REQUEST" == "false" ] then echo "Not a PR: checking for changes in ci/ from last 2 commits" @@ -12,14 +16,12 @@ else ci_changes=$(git diff PR_HEAD~2 --numstat | grep -E "ci/"| wc -l) fi -MINICONDA_DIR="$HOME/miniconda/" CACHE_DIR="$HOME/.cache/" CCACHE_DIR="$HOME/.ccache/" if [ $ci_changes -ne 0 ] then echo "Files have changed in ci/ deleting all caches" - rm -rf "$MINICONDA_DIR" rm -rf "$CACHE_DIR" rm -rf "$CCACHE_DIR" -fi \ No newline at end of file +fi diff --git a/ci/install-3.6_DEV.sh b/ci/install-3.6_DEV.sh deleted file mode 100644 index 0b95f1cd45cad..0000000000000 --- a/ci/install-3.6_DEV.sh +++ /dev/null @@ -1,16 +0,0 @@ -#!/bin/bash - -echo "install 3.6 dev" - -conda config --set add_pip_as_python_dependency false -conda create -n pandas python=3.6 -c conda-forge/label/prerelease - -source activate pandas - -# ensure we have pip -python -m ensurepip - -# install deps -pip3.6 install nose cython numpy pytz python-dateutil - -true diff --git a/ci/install.ps1 b/ci/install.ps1 index 16c92dc76d273..64ec7f81884cd 100644 --- a/ci/install.ps1 +++ b/ci/install.ps1 @@ -84,9 +84,9 @@ function UpdateConda ($python_home) { function main () { - InstallMiniconda $env:PYTHON_VERSION $env:PYTHON_ARCH $env:PYTHON - UpdateConda $env:PYTHON - InstallCondaPackages $env:PYTHON "conda-build jinja2 anaconda-client" + InstallMiniconda "3.5" $env:PYTHON_ARCH $env:CONDA_ROOT + UpdateConda $env:CONDA_ROOT + InstallCondaPackages $env:CONDA_ROOT "conda-build jinja2 anaconda-client" } main diff --git a/ci/install_appveyor.ps1 b/ci/install_appveyor.ps1 deleted file mode 100644 index a022995dc7d58..0000000000000 --- a/ci/install_appveyor.ps1 +++ /dev/null @@ -1,133 +0,0 @@ -# Sample script to install Miniconda under Windows -# Authors: Olivier Grisel, Jonathan Helmus and Kyle Kastner, Robert McGibbon -# License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ - -$MINICONDA_URL = "http://repo.continuum.io/miniconda/" - - -function DownloadMiniconda ($python_version, $platform_suffix) { - $webclient = New-Object System.Net.WebClient - if ($python_version -match "3.4") { - $filename = "Miniconda3-3.5.5-Windows-" + $platform_suffix + ".exe" - } else { - $filename = "Miniconda-3.5.5-Windows-" + $platform_suffix + ".exe" - } - $url = $MINICONDA_URL + $filename - - $basedir = $pwd.Path + "\" - $filepath = $basedir + $filename - if (Test-Path $filename) { - Write-Host "Reusing" $filepath - return $filepath - } - - # Download and retry up to 3 times in case of network transient errors. - Write-Host "Downloading" $filename "from" $url - $retry_attempts = 2 - for($i=0; $i -lt $retry_attempts; $i++){ - try { - $webclient.DownloadFile($url, $filepath) - break - } - Catch [Exception]{ - Start-Sleep 1 - } - } - if (Test-Path $filepath) { - Write-Host "File saved at" $filepath - } else { - # Retry once to get the error message if any at the last try - $webclient.DownloadFile($url, $filepath) - } - return $filepath -} - -function Start-Executable { - param( - [String] $FilePath, - [String[]] $ArgumentList - ) - $OFS = " " - $process = New-Object System.Diagnostics.Process - $process.StartInfo.FileName = $FilePath - $process.StartInfo.Arguments = $ArgumentList - $process.StartInfo.UseShellExecute = $false - $process.StartInfo.RedirectStandardOutput = $true - if ( $process.Start() ) { - $output = $process.StandardOutput.ReadToEnd() ` - -replace "\r\n$","" - if ( $output ) { - if ( $output.Contains("`r`n") ) { - $output -split "`r`n" - } - elseif ( $output.Contains("`n") ) { - $output -split "`n" - } - else { - $output - } - } - $process.WaitForExit() - & "$Env:SystemRoot\system32\cmd.exe" ` - /c exit $process.ExitCode - } - } - -function InstallMiniconda ($python_version, $architecture, $python_home) { - Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home - if (Test-Path $python_home) { - Write-Host $python_home "already exists, skipping." - return $false - } - if ($architecture -match "32") { - $platform_suffix = "x86" - } else { - $platform_suffix = "x86_64" - } - - $filepath = DownloadMiniconda $python_version $platform_suffix - Write-Host "Installing" $filepath "to" $python_home - $install_log = $python_home + ".log" - $args = "/S /D=$python_home" - Write-Host $filepath $args - Start-Process -FilePath $filepath -ArgumentList $args -Wait - if (Test-Path $python_home) { - Write-Host "Python $python_version ($architecture) installation complete" - } else { - Write-Host "Failed to install Python in $python_home" - Get-Content -Path $install_log - Exit 1 - } -} - - -function InstallCondaPackages ($python_home, $spec) { - $conda_path = $python_home + "\Scripts\conda.exe" - $args = "install --yes --quiet " + $spec - Write-Host ("conda " + $args) - Start-Executable -FilePath "$conda_path" -ArgumentList $args -} -function InstallCondaPackagesFromFile ($python_home, $ver, $arch) { - $conda_path = $python_home + "\Scripts\conda.exe" - $args = "install --yes --quiet --file " + $env:APPVEYOR_BUILD_FOLDER + "\ci\requirements-" + $ver + "_" + $arch + ".txt" - Write-Host ("conda " + $args) - Start-Executable -FilePath "$conda_path" -ArgumentList $args -} - -function UpdateConda ($python_home) { - $conda_path = $python_home + "\Scripts\conda.exe" - Write-Host "Updating conda..." - $args = "update --yes conda" - Write-Host $conda_path $args - Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -} - - -function main () { - InstallMiniconda $env:PYTHON_VERSION $env:PYTHON_ARCH $env:PYTHON - UpdateConda $env:PYTHON - InstallCondaPackages $env:PYTHON "pip setuptools nose" - InstallCondaPackagesFromFile $env:PYTHON $env:PYTHON_VERSION $env:PYTHON_ARCH -} - -main \ No newline at end of file diff --git a/ci/install_circle.sh b/ci/install_circle.sh new file mode 100755 index 0000000000000..00e14b10ebbd6 --- /dev/null +++ b/ci/install_circle.sh @@ -0,0 +1,85 @@ +#!/usr/bin/env bash + +home_dir=$(pwd) +echo "[home_dir: $home_dir]" + +echo "[ls -ltr]" +ls -ltr + +echo "[Using clean Miniconda install]" +rm -rf "$MINICONDA_DIR" + +# install miniconda +wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -q -O miniconda.sh || exit 1 +bash miniconda.sh -b -p "$MINICONDA_DIR" || exit 1 + +export PATH="$MINICONDA_DIR/bin:$PATH" + +echo "[update conda]" +conda config --set ssl_verify false || exit 1 +conda config --set always_yes true --set changeps1 false || exit 1 +conda update -q conda + +# add the pandas channel to take priority +# to add extra packages +echo "[add channels]" +conda config --add channels pandas || exit 1 +conda config --remove channels defaults || exit 1 +conda config --add channels defaults || exit 1 + +# Useful for debugging any issues with conda +conda info -a || exit 1 + +# support env variables passed +export ENVS_FILE=".envs" + +# make sure that the .envs file exists. it is ok if it is empty +touch $ENVS_FILE + +# assume all command line arguments are environmental variables +for var in "$@" +do + echo "export $var" >> $ENVS_FILE +done + +echo "[environmental variable file]" +cat $ENVS_FILE +source $ENVS_FILE + +export REQ_BUILD=ci/requirements-${JOB}.build +export REQ_RUN=ci/requirements-${JOB}.run +export REQ_PIP=ci/requirements-${JOB}.pip + +# edit the locale override if needed +if [ -n "$LOCALE_OVERRIDE" ]; then + echo "[Adding locale to the first line of pandas/__init__.py]" + rm -f pandas/__init__.pyc + sedc="3iimport locale\nlocale.setlocale(locale.LC_ALL, '$LOCALE_OVERRIDE')\n" + sed -i "$sedc" pandas/__init__.py + echo "[head -4 pandas/__init__.py]" + head -4 pandas/__init__.py + echo +fi + +# create envbuild deps +echo "[create env: ${REQ_BUILD}]" +time conda create -n pandas -q --file=${REQ_BUILD} || exit 1 +time conda install -n pandas pytest || exit 1 + +source activate pandas + +# build but don't install +echo "[build em]" +time python setup.py build_ext --inplace || exit 1 + +# we may have run installations +echo "[conda installs: ${REQ_RUN}]" +if [ -e ${REQ_RUN} ]; then + time conda install -q --file=${REQ_RUN} || exit 1 +fi + +# we may have additional pip installs +echo "[pip installs: ${REQ_PIP}]" +if [ -e ${REQ_PIP} ]; then + pip install -r $REQ_PIP +fi diff --git a/ci/install_db_circle.sh b/ci/install_db_circle.sh new file mode 100755 index 0000000000000..a00f74f009f54 --- /dev/null +++ b/ci/install_db_circle.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +echo "installing dbs" +mysql -e 'create database pandas_nosetest;' +psql -c 'create database pandas_nosetest;' -U postgres + +echo "done" +exit 0 diff --git a/ci/install_db.sh b/ci/install_db_travis.sh similarity index 100% rename from ci/install_db.sh rename to ci/install_db_travis.sh diff --git a/ci/install_test.sh b/ci/install_test.sh deleted file mode 100755 index e01ad7b94a349..0000000000000 --- a/ci/install_test.sh +++ /dev/null @@ -1,17 +0,0 @@ -#!/bin/bash - -echo "inside $0" - -if [ "$INSTALL_TEST" ]; then - source activate pandas - echo "Starting installation test." - conda uninstall cython || exit 1 - python "$TRAVIS_BUILD_DIR"/setup.py sdist --formats=zip,gztar || exit 1 - pip install "$TRAVIS_BUILD_DIR"/dist/*tar.gz || exit 1 - nosetests --exe -A "$NOSE_ARGS" pandas/tests/test_series.py --with-xunit --xunit-file=/tmp/nosetests_install.xml -else - echo "Skipping installation test." -fi -RET="$?" - -exit "$RET" diff --git a/ci/install_travis.sh b/ci/install_travis.sh index bdd2c01f611b2..09668cbccc9d2 100755 --- a/ci/install_travis.sh +++ b/ci/install_travis.sh @@ -1,151 +1,183 @@ #!/bin/bash -# There are 2 distinct pieces that get zipped and cached -# - The venv site-packages dir including the installed dependencies -# - The pandas build artifacts, using the build cache support via -# scripts/use_build_cache.py -# -# if the user opted in to use the cache and we're on a whitelisted fork -# - if the server doesn't hold a cached version of venv/pandas build, -# do things the slow way, and put the results on the cache server -# for the next time. -# - if the cache files are available, instal some necessaries via apt -# (no compiling needed), then directly goto script and collect 200$. -# - +# edit the locale file if needed function edit_init() { if [ -n "$LOCALE_OVERRIDE" ]; then - echo "Adding locale to the first line of pandas/__init__.py" + echo "[Adding locale to the first line of pandas/__init__.py]" rm -f pandas/__init__.pyc sedc="3iimport locale\nlocale.setlocale(locale.LC_ALL, '$LOCALE_OVERRIDE')\n" sed -i "$sedc" pandas/__init__.py - echo "head -4 pandas/__init__.py" + echo "[head -4 pandas/__init__.py]" head -4 pandas/__init__.py echo fi } +echo +echo "[install_travis]" edit_init home_dir=$(pwd) -echo "home_dir: [$home_dir]" +echo +echo "[home_dir]: $home_dir" +# install miniconda MINICONDA_DIR="$HOME/miniconda3" -if [ -d "$MINICONDA_DIR" ] && [ -e "$MINICONDA_DIR/bin/conda" ] && [ "$USE_CACHE" ]; then - echo "Miniconda install already present from cache: $MINICONDA_DIR" - - conda config --set always_yes yes --set changeps1 no || exit 1 - echo "update conda" - conda update -q conda || exit 1 - - # Useful for debugging any issues with conda - conda info -a || exit 1 - - # set the compiler cache to work - if [ "${TRAVIS_OS_NAME}" == "linux" ]; then - echo "Using ccache" - export PATH=/usr/lib/ccache:/usr/lib64/ccache:$PATH - gcc=$(which gcc) - echo "gcc: $gcc" - ccache=$(which ccache) - echo "ccache: $ccache" - export CC='ccache gcc' - fi +echo +echo "[Using clean Miniconda install]" -else - echo "Using clean Miniconda install" - echo "Not using ccache" +if [ -d "$MINICONDA_DIR" ]; then rm -rf "$MINICONDA_DIR" - # install miniconda - if [ "${TRAVIS_OS_NAME}" == "osx" ]; then - wget http://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh || exit 1 - else - wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh || exit 1 - fi - bash miniconda.sh -b -p "$MINICONDA_DIR" || exit 1 +fi - echo "update conda" - conda config --set ssl_verify false || exit 1 - conda config --set always_yes true --set changeps1 false || exit 1 - conda update -q conda +# install miniconda +if [ "${TRAVIS_OS_NAME}" == "osx" ]; then + time wget http://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh || exit 1 +else + time wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh || exit 1 +fi +time bash miniconda.sh -b -p "$MINICONDA_DIR" || exit 1 + +echo +echo "[show conda]" +which conda + +echo +echo "[update conda]" +conda config --set ssl_verify false || exit 1 +conda config --set always_yes true --set changeps1 false || exit 1 +conda update -q conda + +echo +echo "[add channels]" +# add the pandas channel to take priority +# to add extra packages +conda config --add channels pandas || exit 1 +conda config --remove channels defaults || exit 1 +conda config --add channels defaults || exit 1 + +if [ "$CONDA_FORGE" ]; then + # add conda-forge channel as priority + conda config --add channels conda-forge || exit 1 +fi - # add the pandas channel *before* defaults to have defaults take priority - echo "add channels" - conda config --add channels pandas || exit 1 - conda config --remove channels defaults || exit 1 - conda config --add channels defaults || exit 1 +# Useful for debugging any issues with conda +conda info -a || exit 1 + +# set the compiler cache to work +echo +if [ -z "$NOCACHE" ] && [ "${TRAVIS_OS_NAME}" == "linux" ]; then + echo "[Using ccache]" + export PATH=/usr/lib/ccache:/usr/lib64/ccache:$PATH + gcc=$(which gcc) + echo "[gcc]: $gcc" + ccache=$(which ccache) + echo "[ccache]: $ccache" + export CC='ccache gcc' +elif [ -z "$NOCACHE" ] && [ "${TRAVIS_OS_NAME}" == "osx" ]; then + echo "[Install ccache]" + brew install ccache > /dev/null 2>&1 + echo "[Using ccache]" + export PATH=/usr/local/opt/ccache/libexec:$PATH + gcc=$(which gcc) + echo "[gcc]: $gcc" + ccache=$(which ccache) + echo "[ccache]: $ccache" +else + echo "[Not using ccache]" +fi - conda install anaconda-client +echo +echo "[create env]" - # Useful for debugging any issues with conda - conda info -a || exit 1 -fi +# create our environment +REQ="ci/requirements-${JOB}.build" +time conda create -n pandas --file=${REQ} || exit 1 -# may have installation instructions for this build -INSTALL="ci/install-${PYTHON_VERSION}${JOB_TAG}.sh" -if [ -e ${INSTALL} ]; then - time bash $INSTALL || exit 1 -else +source activate pandas - # create new env - time conda create -n pandas python=$PYTHON_VERSION nose coverage flake8 || exit 1 +# may have addtl installation instructions for this build +echo +echo "[build addtl installs]" +REQ="ci/requirements-${JOB}.build.sh" +if [ -e ${REQ} ]; then + time bash $REQ || exit 1 fi -# build deps -REQ="ci/requirements-${PYTHON_VERSION}${JOB_TAG}.build" +time conda install -n pandas pytest +time pip install pytest-xdist -# install deps -if [ -e ${REQ} ]; then - time conda install -n pandas --file=${REQ} || exit 1 +if [ "$LINT" ]; then + conda install flake8 + pip install cpplint fi -source activate pandas +if [ "$COVERAGE" ]; then + pip install coverage pytest-cov +fi +echo if [ "$BUILD_TEST" ]; then - # build testing - pip uninstall --yes cython - pip install cython==0.19.1 - ( python setup.py build_ext --inplace && python setup.py develop ) || true + # build & install testing + echo ["Starting installation test."] + rm -rf dist + python setup.py clean + python setup.py build_ext --inplace + python setup.py sdist --formats=gztar + conda uninstall cython + pip install dist/*tar.gz || exit 1 else # build but don't install - echo "build em" + echo "[build em]" time python setup.py build_ext --inplace || exit 1 - # we may have run installations - echo "conda installs" - REQ="ci/requirements-${PYTHON_VERSION}${JOB_TAG}.run" - if [ -e ${REQ} ]; then - time conda install -n pandas --file=${REQ} || exit 1 - fi +fi - # we may have additional pip installs - echo "pip installs" - REQ="ci/requirements-${PYTHON_VERSION}${JOB_TAG}.pip" - if [ -e ${REQ} ]; then - pip install --upgrade -r $REQ - fi +# we may have run installations +echo +echo "[conda installs]" +REQ="ci/requirements-${JOB}.run" +if [ -e ${REQ} ]; then + time conda install -n pandas --file=${REQ} || exit 1 +fi - # may have addtl installation instructions for this build - REQ="ci/requirements-${PYTHON_VERSION}${JOB_TAG}.sh" - if [ -e ${REQ} ]; then - time bash $REQ || exit 1 - fi +# we may have additional pip installs +echo +echo "[pip installs]" +REQ="ci/requirements-${JOB}.pip" +if [ -e ${REQ} ]; then + pip install -r $REQ +fi + +# may have addtl installation instructions for this build +echo +echo "[addtl installs]" +REQ="ci/requirements-${JOB}.sh" +if [ -e ${REQ} ]; then + time bash $REQ || exit 1 +fi + +# finish install if we are not doing a build-testk +if [ -z "$BUILD_TEST" ]; then # remove any installed pandas package # w/o removing anything else - echo "removing installed pandas" + echo + echo "[removing installed pandas]" conda remove pandas --force # install our pandas - echo "running setup.py develop" + echo + echo "[running setup.py develop]" python setup.py develop || exit 1 fi -echo "done" +echo +echo "[done]" exit 0 diff --git a/ci/ironcache/get.py b/ci/ironcache/get.py deleted file mode 100644 index a4663472b955c..0000000000000 --- a/ci/ironcache/get.py +++ /dev/null @@ -1,41 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- - -import sys -import re -import os -import time -import json -import base64 -from hashlib import sha1 -from iron_cache import * -import traceback as tb - -key='KEY.%s.%s' %(os.environ.get('TRAVIS_REPO_SLUG','unk'), - os.environ.get('JOB_NAME','unk')) -print(key) - -if sys.version_info[0] > 2: - key = bytes(key,encoding='utf8') - -key = sha1(key).hexdigest()[:8]+'.' - -b = b'' -cache = IronCache() -for i in range(20): - print("getting %s" % key+str(i)) - try: - item = cache.get(cache="travis", key=key+str(i)) - v = item.value - if sys.version_info[0] > 2: - v = bytes(v,encoding='utf8') - b += bytes(base64.b64decode(v)) - except Exception as e: - try: - print(tb.format_exc(e)) - except: - print("exception during exception, oh my") - break - -with open(os.path.join(os.environ.get('HOME',''),"ccache.7z"),'wb') as f: - f.write(b) diff --git a/ci/ironcache/put.py b/ci/ironcache/put.py deleted file mode 100644 index f6aef3a327e87..0000000000000 --- a/ci/ironcache/put.py +++ /dev/null @@ -1,48 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- - -import sys -import re -import os -import time -import json -import base64 -from hashlib import sha1 -from iron_cache import * - -key='KEY.%s.%s' %(os.environ.get('TRAVIS_REPO_SLUG','unk'), - os.environ.get('JOB_NAME','unk')) - -key='KEY.%s.%s' %(os.environ.get('TRAVIS_REPO_SLUG','unk'), - os.environ.get('JOB_NAME','unk')) -print(key) - -if sys.version_info[0] > 2: - key = bytes(key,encoding='utf8') - -key = sha1(key).hexdigest()[:8]+'.' - -os.chdir(os.environ.get('HOME')) - -cache = IronCache() - -i=0 - -for i, fname in enumerate(sorted([x for x in os.listdir('.') if re.match("ccache.\d+$",x)])): - print("Putting %s" % key+str(i)) - with open(fname,"rb") as f: - s= f.read() - value=base64.b64encode(s) - if isinstance(value, bytes): - value = value.decode('ascii') - item = cache.put(cache="travis", key=key+str(i), value=value,options=dict(expires_in=24*60*60)) - -# print("foo") -for i in range(i+1,20): - - try: - item = cache.delete(key+str(i),cache='travis') - print("Deleted %s" % key+str(i)) - except: - break - pass diff --git a/ci/lint.sh b/ci/lint.sh index d6390a16b763e..ed3af2568811c 100755 --- a/ci/lint.sh +++ b/ci/lint.sh @@ -7,10 +7,10 @@ source activate pandas RET=0 if [ "$LINT" ]; then - # pandas/rpy is deprecated and will be removed. - # pandas/src is C code, so no need to search there. + + # pandas/_libs/src is C code, so no need to search there. echo "Linting *.py" - flake8 pandas --filename=*.py --exclude pandas/rpy,pandas/src + flake8 pandas --filename=*.py --exclude pandas/_libs/src if [ $? -ne "0" ]; then RET=1 fi @@ -35,8 +35,27 @@ if [ "$LINT" ]; then done echo "Linting *.pxi.in DONE" + # readability/casting: Warnings about C casting instead of C++ casting + # runtime/int: Warnings about using C number types instead of C++ ones + # build/include_subdir: Warnings about prefacing included header files with directory + + # We don't lint all C files because we don't want to lint any that are built + # from Cython files nor do we want to lint C files that we didn't modify for + # this particular codebase (e.g. src/headers, src/klib, src/msgpack). However, + # we can lint all header files since they aren't "generated" like C files are. + echo "Linting *.c and *.h" + for path in '*.h' 'period_helper.c' 'datetime' 'parser' 'ujson' + do + echo "linting -> pandas/_libs/src/$path" + cpplint --quiet --extensions=c,h --headers=h --filter=-readability/casting,-runtime/int,-build/include_subdir --recursive pandas/_libs/src/$path + if [ $? -ne "0" ]; then + RET=1 + fi + done + echo "Linting *.c and *.h DONE" + echo "Check for invalid testing" - grep -r -E --include '*.py' --exclude nosetester.py --exclude testing.py '(numpy|np)\.testing' pandas + grep -r -E --include '*.py' --exclude testing.py '(numpy|np)\.testing' pandas if [ $? = "0" ]; then RET=1 fi diff --git a/ci/prep_cython_cache.sh b/ci/prep_cython_cache.sh index cadc356b641f9..18d9388327ddc 100755 --- a/ci/prep_cython_cache.sh +++ b/ci/prep_cython_cache.sh @@ -3,8 +3,8 @@ ls "$HOME/.cache/" PYX_CACHE_DIR="$HOME/.cache/pyxfiles" -pyx_file_list=`find ${TRAVIS_BUILD_DIR} -name "*.pyx" -o -name "*.pxd"` -pyx_cache_file_list=`find ${PYX_CACHE_DIR} -name "*.pyx" -o -name "*.pxd"` +pyx_file_list=`find ${TRAVIS_BUILD_DIR} -name "*.pyx" -o -name "*.pxd" -o -name "*.pxi.in"` +pyx_cache_file_list=`find ${PYX_CACHE_DIR} -name "*.pyx" -o -name "*.pxd" -o -name "*.pxi.in"` CACHE_File="$HOME/.cache/cython_files.tar" @@ -22,7 +22,7 @@ fi home_dir=$(pwd) -if [ -f "$CACHE_File" ] && [ "$USE_CACHE" ] && [ -d "$PYX_CACHE_DIR" ]; then +if [ -f "$CACHE_File" ] && [ -z "$NOCACHE" ] && [ -d "$PYX_CACHE_DIR" ]; then echo "Cache available - checking pyx diff" @@ -57,16 +57,16 @@ if [ -f "$CACHE_File" ] && [ "$USE_CACHE" ] && [ -d "$PYX_CACHE_DIR" ]; then fi -if [ $clear_cache -eq 0 ] && [ "$USE_CACHE" ] +if [ $clear_cache -eq 0 ] && [ -z "$NOCACHE" ] then - # No and use_cache is set + # No and nocache is not set echo "Will reuse cached cython file" cd / tar xvmf $CACHE_File cd $home_dir else echo "Rebuilding cythonized files" - echo "Use cache (Blank if not set) = $USE_CACHE" + echo "No cache = $NOCACHE" echo "Clear cache (1=YES) = $clear_cache" fi diff --git a/ci/print_skipped.py b/ci/print_skipped.py index 9fb05df64bcea..dd2180f6eeb19 100755 --- a/ci/print_skipped.py +++ b/ci/print_skipped.py @@ -30,20 +30,21 @@ def parse_results(filename): i += 1 assert i - 1 == len(skipped) assert i - 1 == len(skipped) - assert len(skipped) == int(root.attrib['skip']) + # assert len(skipped) == int(root.attrib['skip']) return '\n'.join(skipped) def main(args): print('SKIPPED TESTS:') - print(parse_results(args.filename)) + for fn in args.filename: + print(parse_results(fn)) return 0 def parse_args(): import argparse parser = argparse.ArgumentParser() - parser.add_argument('filename', help='XUnit file to parse') + parser.add_argument('filename', nargs='+', help='XUnit file to parse') return parser.parse_args() diff --git a/ci/requirements-2.7.build b/ci/requirements-2.7.build index b2e2038faf7c3..415df13179fcf 100644 --- a/ci/requirements-2.7.build +++ b/ci/requirements-2.7.build @@ -1,4 +1,6 @@ +python=2.7* python-dateutil=2.4.1 pytz=2013b +nomkl numpy -cython=0.19.1 +cython=0.23 diff --git a/ci/requirements-2.7.pip b/ci/requirements-2.7.pip index d16b932c8be4f..eb796368e7820 100644 --- a/ci/requirements-2.7.pip +++ b/ci/requirements-2.7.pip @@ -1,9 +1,8 @@ blosc -httplib2 -google-api-python-client==1.2 -python-gflags==2.0 -oauth2client==1.5.0 +pandas-gbq pathlib backports.lzma py PyCrypto +mock +ipython diff --git a/ci/requirements-2.7.run b/ci/requirements-2.7.run index eec7886fed38d..62e31e4ae24e3 100644 --- a/ci/requirements-2.7.run +++ b/ci/requirements-2.7.run @@ -11,14 +11,12 @@ sqlalchemy=0.9.6 lxml=3.2.1 scipy xlsxwriter=0.4.6 -boto=2.36.0 +s3fs bottleneck psycopg2=2.5.2 patsy pymysql=0.6.3 html5lib=1.0b2 beautiful-soup=4.2.1 -statsmodels jinja2=2.8 -xarray -pyqt=4.11.4 +xarray=0.8.0 diff --git a/ci/requirements-2.7.sh b/ci/requirements-2.7.sh new file mode 100644 index 0000000000000..64d470e5c6e0e --- /dev/null +++ b/ci/requirements-2.7.sh @@ -0,0 +1,7 @@ +#!/bin/bash + +source activate pandas + +echo "install 27" + +conda install -n pandas -c conda-forge feather-format diff --git a/ci/requirements-2.7_BUILD_TEST.build b/ci/requirements-2.7_BUILD_TEST.build index faf1e3559f7f1..aadec00cb7ebf 100644 --- a/ci/requirements-2.7_BUILD_TEST.build +++ b/ci/requirements-2.7_BUILD_TEST.build @@ -1,4 +1,6 @@ +python=2.7* dateutil pytz +nomkl numpy cython diff --git a/ci/requirements-2.7_COMPAT.build b/ci/requirements-2.7_COMPAT.build index 85148069a9e6a..0e1ccf9eac9bf 100644 --- a/ci/requirements-2.7_COMPAT.build +++ b/ci/requirements-2.7_COMPAT.build @@ -1,4 +1,5 @@ +python=2.7* numpy=1.7.1 -cython=0.19.1 +cython=0.23 dateutil=1.5 pytz=2013b diff --git a/ci/requirements-2.7_COMPAT.run b/ci/requirements-2.7_COMPAT.run index 32d71beb24388..d27b6a72c2d15 100644 --- a/ci/requirements-2.7_COMPAT.run +++ b/ci/requirements-2.7_COMPAT.run @@ -4,7 +4,6 @@ pytz=2013b scipy=0.11.0 xlwt=0.7.5 xlrd=0.9.2 -statsmodels=0.4.3 bottleneck=0.8.0 numexpr=2.2.2 pytables=3.0.0 diff --git a/ci/requirements-2.7_DOC_BUILD.build b/ci/requirements-2.7_DOC_BUILD.build deleted file mode 100644 index faf1e3559f7f1..0000000000000 --- a/ci/requirements-2.7_DOC_BUILD.build +++ /dev/null @@ -1,4 +0,0 @@ -dateutil -pytz -numpy -cython diff --git a/ci/requirements-2.7_LOCALE.build b/ci/requirements-2.7_LOCALE.build index ada6686f599ca..4a37ce8fbe161 100644 --- a/ci/requirements-2.7_LOCALE.build +++ b/ci/requirements-2.7_LOCALE.build @@ -1,4 +1,5 @@ +python=2.7* python-dateutil pytz=2013b -numpy=1.7.1 -cython=0.19.1 +numpy=1.8.2 +cython=0.23 diff --git a/ci/requirements-2.7_LOCALE.run b/ci/requirements-2.7_LOCALE.run index 9bb37ee10f8db..5d7cc31b7d55e 100644 --- a/ci/requirements-2.7_LOCALE.run +++ b/ci/requirements-2.7_LOCALE.run @@ -1,17 +1,14 @@ python-dateutil pytz=2013b -numpy=1.7.1 +numpy=1.8.2 xlwt=0.7.5 openpyxl=1.6.2 xlsxwriter=0.4.6 xlrd=0.9.2 bottleneck=0.8.0 -matplotlib=1.2.1 -patsy=0.1.0 +matplotlib=1.3.1 sqlalchemy=0.8.1 html5lib=1.0b2 lxml=3.2.1 -scipy=0.11.0 +scipy beautiful-soup=4.2.1 -statsmodels=0.4.3 -bigquery=2.0.17 diff --git a/ci/requirements-2.7_SLOW.build b/ci/requirements-2.7_SLOW.build index 664e8b418def7..0f4a2c6792e6b 100644 --- a/ci/requirements-2.7_SLOW.build +++ b/ci/requirements-2.7_SLOW.build @@ -1,3 +1,4 @@ +python=2.7* python-dateutil pytz numpy=1.8.2 diff --git a/ci/requirements-2.7_SLOW.run b/ci/requirements-2.7_SLOW.run index f02a7cb8a309a..c2d2a14285ad6 100644 --- a/ci/requirements-2.7_SLOW.run +++ b/ci/requirements-2.7_SLOW.run @@ -4,7 +4,6 @@ numpy=1.8.2 matplotlib=1.3.1 scipy patsy -statsmodels xlwt openpyxl xlsxwriter @@ -13,7 +12,7 @@ numexpr pytables sqlalchemy lxml -boto +s3fs bottleneck psycopg2 pymysql diff --git a/ci/requirements-2.7-64.run b/ci/requirements-2.7_WIN.run similarity index 85% rename from ci/requirements-2.7-64.run rename to ci/requirements-2.7_WIN.run index ce085a6ebf91c..f953682f52d45 100644 --- a/ci/requirements-2.7-64.run +++ b/ci/requirements-2.7_WIN.run @@ -3,7 +3,7 @@ pytz numpy=1.10* xlwt numexpr -pytables +pytables==3.2.2 matplotlib openpyxl xlrd @@ -11,9 +11,8 @@ sqlalchemy lxml=3.2.1 scipy xlsxwriter -boto +s3fs bottleneck html5lib beautiful-soup jinja2=2.8 -pyqt=4.11.4 diff --git a/ci/requirements-3.4-64.run b/ci/requirements-3.4-64.run deleted file mode 100644 index 106cc5b7168ba..0000000000000 --- a/ci/requirements-3.4-64.run +++ /dev/null @@ -1,12 +0,0 @@ -python-dateutil -pytz -numpy=1.9* -openpyxl -xlsxwriter -xlrd -xlwt -scipy -numexpr -pytables -bottleneck -jinja2=2.8 diff --git a/ci/requirements-3.4.build b/ci/requirements-3.4.build index e6e59dcba63fe..e8a957f70d40e 100644 --- a/ci/requirements-3.4.build +++ b/ci/requirements-3.4.build @@ -1,3 +1,4 @@ +python=3.4* numpy=1.8.1 cython=0.24.1 libgfortran=1.0 diff --git a/ci/requirements-3.4.pip b/ci/requirements-3.4.pip index 55986a0220bf0..4e5fe52d56cf1 100644 --- a/ci/requirements-3.4.pip +++ b/ci/requirements-3.4.pip @@ -1,5 +1,2 @@ python-dateutil==2.2 blosc -httplib2 -google-api-python-client -oauth2client diff --git a/ci/requirements-3.4_SLOW.build b/ci/requirements-3.4_SLOW.build index de36b1afb9fa4..88212053af472 100644 --- a/ci/requirements-3.4_SLOW.build +++ b/ci/requirements-3.4_SLOW.build @@ -1,4 +1,6 @@ +python=3.4* python-dateutil pytz -numpy=1.9.3 +nomkl +numpy=1.10* cython diff --git a/ci/requirements-3.4_SLOW.run b/ci/requirements-3.4_SLOW.run index f9f226e3f1465..90156f62c6e71 100644 --- a/ci/requirements-3.4_SLOW.run +++ b/ci/requirements-3.4_SLOW.run @@ -1,6 +1,6 @@ python-dateutil pytz -numpy=1.9.3 +numpy=1.10* openpyxl xlsxwriter xlrd @@ -9,7 +9,7 @@ html5lib patsy beautiful-soup scipy -numexpr=2.4.4 +numexpr=2.4.6 pytables matplotlib lxml @@ -17,5 +17,4 @@ sqlalchemy bottleneck pymysql psycopg2 -statsmodels jinja2=2.8 diff --git a/ci/requirements-3.4_SLOW.sh b/ci/requirements-3.4_SLOW.sh index bc8fb79147d2c..24f1e042ed69e 100644 --- a/ci/requirements-3.4_SLOW.sh +++ b/ci/requirements-3.4_SLOW.sh @@ -4,4 +4,4 @@ source activate pandas echo "install 34_slow" -conda install -n pandas -c conda-forge/label/rc -c conda-forge matplotlib +conda install -n pandas -c conda-forge matplotlib diff --git a/ci/requirements-3.5.build b/ci/requirements-3.5.build index 9558cf00ddf5c..76227e106e1fd 100644 --- a/ci/requirements-3.5.build +++ b/ci/requirements-3.5.build @@ -1,4 +1,6 @@ +python=3.5* python-dateutil pytz -numpy +nomkl +numpy=1.11.3 cython diff --git a/ci/requirements-3.5.pip b/ci/requirements-3.5.pip new file mode 100644 index 0000000000000..6e4f7b65f9728 --- /dev/null +++ b/ci/requirements-3.5.pip @@ -0,0 +1,2 @@ +xarray==0.9.1 +pandas-gbq diff --git a/ci/requirements-3.5.run b/ci/requirements-3.5.run index d9ce708585a33..43e6814ed6c8e 100644 --- a/ci/requirements-3.5.run +++ b/ci/requirements-3.5.run @@ -1,6 +1,6 @@ python-dateutil pytz -numpy +numpy=1.11.3 openpyxl xlsxwriter xlrd @@ -16,9 +16,6 @@ bottleneck sqlalchemy pymysql psycopg2 -xarray -boto -pyqt=4.11.4 - -# incompat with conda ATM -# beautiful-soup +s3fs +beautifulsoup4 +ipython diff --git a/ci/requirements-3.5.sh b/ci/requirements-3.5.sh new file mode 100644 index 0000000000000..d0f0b81802dc6 --- /dev/null +++ b/ci/requirements-3.5.sh @@ -0,0 +1,7 @@ +#!/bin/bash + +source activate pandas + +echo "install 35" + +conda install -n pandas -c conda-forge feather-format diff --git a/ci/requirements-3.5_ASCII.build b/ci/requirements-3.5_ASCII.build index 9558cf00ddf5c..f7befe3b31865 100644 --- a/ci/requirements-3.5_ASCII.build +++ b/ci/requirements-3.5_ASCII.build @@ -1,4 +1,6 @@ +python=3.5* python-dateutil pytz +nomkl numpy cython diff --git a/ci/requirements-3.5_DOC.build b/ci/requirements-3.5_DOC.build new file mode 100644 index 0000000000000..73aeb3192242f --- /dev/null +++ b/ci/requirements-3.5_DOC.build @@ -0,0 +1,5 @@ +python=3.5* +python-dateutil +pytz +numpy +cython diff --git a/ci/requirements-2.7_DOC_BUILD.run b/ci/requirements-3.5_DOC.run similarity index 81% rename from ci/requirements-2.7_DOC_BUILD.run rename to ci/requirements-3.5_DOC.run index 9da8e76bc7e22..9647ab53ab835 100644 --- a/ci/requirements-2.7_DOC_BUILD.run +++ b/ci/requirements-3.5_DOC.run @@ -1,13 +1,15 @@ ipython ipykernel +ipywidgets sphinx nbconvert nbformat notebook matplotlib +seaborn scipy lxml -beautiful-soup +beautifulsoup4 html5lib pytables openpyxl=1.8.5 @@ -18,4 +20,5 @@ sqlalchemy numexpr bottleneck statsmodels +xarray pyqt=4.11.4 diff --git a/ci/requirements-3.5_DOC.sh b/ci/requirements-3.5_DOC.sh new file mode 100644 index 0000000000000..e43e483d77a73 --- /dev/null +++ b/ci/requirements-3.5_DOC.sh @@ -0,0 +1,11 @@ +#!/bin/bash + +source activate pandas + +echo "[install DOC_BUILD deps]" + +pip install pandas-gbq + +conda install -n pandas -c conda-forge feather-format nbsphinx pandoc + +conda install -n pandas -c r r rpy2 --yes diff --git a/ci/requirements-3.5_NUMPY_DEV.sh b/ci/requirements-3.5_NUMPY_DEV.sh deleted file mode 100644 index 946ec43ad9f1a..0000000000000 --- a/ci/requirements-3.5_NUMPY_DEV.sh +++ /dev/null @@ -1,18 +0,0 @@ -#!/bin/bash - -source activate pandas - -echo "install numpy master wheel" - -# remove the system installed numpy -pip uninstall numpy -y - -# we need these for numpy - -# these wheels don't play nice with the conda libgfortran / openblas -# time conda install -n pandas libgfortran openblas || exit 1 - -# install numpy wheel from master -pip install --pre --upgrade --no-index --timeout=60 --trusted-host travis-dev-wheels.scipy.org -f http://travis-dev-wheels.scipy.org/ numpy scipy - -true diff --git a/ci/requirements-3.5_OSX.build b/ci/requirements-3.5_OSX.build index a201be352b8e4..f5bc01b67a20a 100644 --- a/ci/requirements-3.5_OSX.build +++ b/ci/requirements-3.5_OSX.build @@ -1,2 +1,4 @@ +python=3.5* +nomkl numpy=1.10.4 cython diff --git a/ci/requirements-3.5_OSX.run b/ci/requirements-3.5_OSX.run index ffa291ab7ff77..1d83474d10f2f 100644 --- a/ci/requirements-3.5_OSX.run +++ b/ci/requirements-3.5_OSX.run @@ -12,7 +12,5 @@ matplotlib jinja2 bottleneck xarray -boto - -# incompat with conda ATM -# beautiful-soup +s3fs +beautifulsoup4 diff --git a/ci/requirements-3.5_OSX.sh b/ci/requirements-3.5_OSX.sh new file mode 100644 index 0000000000000..cfbd2882a8a2d --- /dev/null +++ b/ci/requirements-3.5_OSX.sh @@ -0,0 +1,7 @@ +#!/bin/bash + +source activate pandas + +echo "install 35_OSX" + +conda install -n pandas -c conda-forge feather-format diff --git a/ci/requirements-3.6.build b/ci/requirements-3.6.build new file mode 100644 index 0000000000000..1c4b46aea3865 --- /dev/null +++ b/ci/requirements-3.6.build @@ -0,0 +1,6 @@ +python=3.6* +python-dateutil +pytz +nomkl +numpy +cython diff --git a/pandas/api/tests/__init__.py b/ci/requirements-3.6.pip similarity index 100% rename from pandas/api/tests/__init__.py rename to ci/requirements-3.6.pip diff --git a/ci/requirements-3.6.run b/ci/requirements-3.6.run new file mode 100644 index 0000000000000..41c9680ce1b7e --- /dev/null +++ b/ci/requirements-3.6.run @@ -0,0 +1,22 @@ +python-dateutil +pytz +numpy +scipy +openpyxl +xlsxwriter +xlrd +xlwt +numexpr +pytables +matplotlib +lxml +html5lib +jinja2 +sqlalchemy +pymysql +feather-format +# psycopg2 (not avail on defaults ATM) +beautifulsoup4 +s3fs +xarray +ipython diff --git a/ci/requirements-3.5_NUMPY_DEV.build b/ci/requirements-3.6_NUMPY_DEV.build similarity index 70% rename from ci/requirements-3.5_NUMPY_DEV.build rename to ci/requirements-3.6_NUMPY_DEV.build index d15edbfa3d2c1..738366867a217 100644 --- a/ci/requirements-3.5_NUMPY_DEV.build +++ b/ci/requirements-3.6_NUMPY_DEV.build @@ -1,3 +1,4 @@ +python=3.6* python-dateutil pytz cython diff --git a/ci/requirements-3.6_NUMPY_DEV.build.sh b/ci/requirements-3.6_NUMPY_DEV.build.sh new file mode 100644 index 0000000000000..4af1307f26a18 --- /dev/null +++ b/ci/requirements-3.6_NUMPY_DEV.build.sh @@ -0,0 +1,14 @@ +#!/bin/bash + +source activate pandas + +echo "install numpy master wheel" + +# remove the system installed numpy +pip uninstall numpy -y + +# install numpy wheel from master +PRE_WHEELS="https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com" +pip install --pre --upgrade --timeout=60 -f $PRE_WHEELS numpy scipy + +true diff --git a/ci/requirements-3.5_NUMPY_DEV.run b/ci/requirements-3.6_NUMPY_DEV.run similarity index 100% rename from ci/requirements-3.5_NUMPY_DEV.run rename to ci/requirements-3.6_NUMPY_DEV.run diff --git a/ci/requirements-3.5-64.run b/ci/requirements-3.6_WIN.run similarity index 70% rename from ci/requirements-3.5-64.run rename to ci/requirements-3.6_WIN.run index 1dc88ed2c94af..899bfbc6b6b23 100644 --- a/ci/requirements-3.5-64.run +++ b/ci/requirements-3.6_WIN.run @@ -1,13 +1,14 @@ python-dateutil pytz -numpy=1.10* +numpy=1.12* +bottleneck openpyxl xlsxwriter xlrd xlwt scipy +feather-format numexpr pytables matplotlib blosc -pyqt=4.11.4 diff --git a/ci/requirements_all.txt b/ci/requirements_all.txt index bc97957bff2b7..e9f49ed879c86 100644 --- a/ci/requirements_all.txt +++ b/ci/requirements_all.txt @@ -1,6 +1,9 @@ -nose +pytest +pytest-cov +pytest-xdist flake8 sphinx +nbsphinx ipython python-dateutil pytz @@ -17,6 +20,7 @@ scipy numexpr pytables matplotlib +seaborn lxml sqlalchemy bottleneck diff --git a/ci/requirements_dev.txt b/ci/requirements_dev.txt index 7396fba6548d9..1e051802ec9f8 100644 --- a/ci/requirements_dev.txt +++ b/ci/requirements_dev.txt @@ -2,5 +2,6 @@ python-dateutil pytz numpy cython -nose +pytest +pytest-cov flake8 diff --git a/ci/run_circle.sh b/ci/run_circle.sh new file mode 100755 index 0000000000000..0e46d28ab6fc4 --- /dev/null +++ b/ci/run_circle.sh @@ -0,0 +1,9 @@ +#!/usr/bin/env bash + +echo "[running tests]" +export PATH="$MINICONDA_DIR/bin:$PATH" + +source activate pandas + +echo "pytest --junitxml=$CIRCLE_TEST_REPORTS/reports/junit.xml $@ pandas" +pytest --junitxml=$CIRCLE_TEST_REPORTS/reports/junit.xml $@ pandas diff --git a/ci/script.sh b/ci/script.sh deleted file mode 100755 index e2ba883b81883..0000000000000 --- a/ci/script.sh +++ /dev/null @@ -1,32 +0,0 @@ -#!/bin/bash - -echo "inside $0" - -source activate pandas - -# don't run the tests for the doc build -if [ x"$DOC_BUILD" != x"" ]; then - exit 0 -fi - -if [ -n "$LOCALE_OVERRIDE" ]; then - export LC_ALL="$LOCALE_OVERRIDE"; - echo "Setting LC_ALL to $LOCALE_OVERRIDE" - - pycmd='import pandas; print("pandas detected console encoding: %s" % pandas.get_option("display.encoding"))' - python -c "$pycmd" -fi - -if [ "$BUILD_TEST" ]; then - echo "We are not running nosetests as this is simply a build test." -elif [ "$COVERAGE" ]; then - echo nosetests --exe -A "$NOSE_ARGS" pandas --with-coverage --with-xunit --xunit-file=/tmp/nosetests.xml - nosetests --exe -A "$NOSE_ARGS" pandas --with-coverage --cover-package=pandas --cover-tests --with-xunit --xunit-file=/tmp/nosetests.xml -else - echo nosetests --exe -A "$NOSE_ARGS" pandas --doctest-tests --with-xunit --xunit-file=/tmp/nosetests.xml - nosetests --exe -A "$NOSE_ARGS" pandas --doctest-tests --with-xunit --xunit-file=/tmp/nosetests.xml -fi - -RET="$?" - -exit "$RET" diff --git a/ci/script_multi.sh b/ci/script_multi.sh new file mode 100755 index 0000000000000..663d2feb5be23 --- /dev/null +++ b/ci/script_multi.sh @@ -0,0 +1,40 @@ +#!/bin/bash + +echo "[script multi]" + +source activate pandas + +if [ -n "$LOCALE_OVERRIDE" ]; then + export LC_ALL="$LOCALE_OVERRIDE"; + echo "Setting LC_ALL to $LOCALE_OVERRIDE" + + pycmd='import pandas; print("pandas detected console encoding: %s" % pandas.get_option("display.encoding"))' + python -c "$pycmd" +fi + +# Workaround for pytest-xdist flaky collection order +# https://github.com/pytest-dev/pytest/issues/920 +# https://github.com/pytest-dev/pytest/issues/1075 +export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))') +echo PYTHONHASHSEED=$PYTHONHASHSEED + +if [ "$BUILD_TEST" ]; then + echo "build-test" + cd /tmp + pwd + conda list pandas + echo "running" + python -c "import pandas; pandas.test(['-n 2'])" +elif [ "$DOC" ]; then + echo "We are not running pytest as this is a doc-build" +elif [ "$COVERAGE" ]; then + echo pytest -s -n 2 -m "not single" --cov=pandas --cov-report xml:/tmp/cov-multiple.xml --junitxml=/tmp/multiple.xml $TEST_ARGS pandas + pytest -s -n 2 -m "not single" --cov=pandas --cov-report xml:/tmp/cov-multiple.xml --junitxml=/tmp/multiple.xml $TEST_ARGS pandas +else + echo pytest -n 2 -m "not single" --junitxml=/tmp/multiple.xml $TEST_ARGS pandas + pytest -n 2 -m "not single" --junitxml=/tmp/multiple.xml $TEST_ARGS pandas # TODO: doctest +fi + +RET="$?" + +exit "$RET" diff --git a/ci/script_single.sh b/ci/script_single.sh new file mode 100755 index 0000000000000..db637679f0e0f --- /dev/null +++ b/ci/script_single.sh @@ -0,0 +1,29 @@ +#!/bin/bash + +echo "[script_single]" + +source activate pandas + +if [ -n "$LOCALE_OVERRIDE" ]; then + export LC_ALL="$LOCALE_OVERRIDE"; + echo "Setting LC_ALL to $LOCALE_OVERRIDE" + + pycmd='import pandas; print("pandas detected console encoding: %s" % pandas.get_option("display.encoding"))' + python -c "$pycmd" +fi + +if [ "$BUILD_TEST" ]; then + echo "We are not running pytest as this is a build test." +elif [ "$DOC" ]; then + echo "We are not running pytest as this is a doc-build" +elif [ "$COVERAGE" ]; then + echo pytest -s -m "single" --cov=pandas --cov-report xml:/tmp/cov-single.xml --junitxml=/tmp/single.xml $TEST_ARGS pandas + pytest -s -m "single" --cov=pandas --cov-report xml:/tmp/cov-single.xml --junitxml=/tmp/single.xml $TEST_ARGS pandas +else + echo pytest -m "single" --junitxml=/tmp/single.xml $TEST_ARGS pandas + pytest -m "single" --junitxml=/tmp/single.xml $TEST_ARGS pandas # TODO: doctest +fi + +RET="$?" + +exit "$RET" diff --git a/ci/show_circle.sh b/ci/show_circle.sh new file mode 100755 index 0000000000000..bfaa65c1d84f2 --- /dev/null +++ b/ci/show_circle.sh @@ -0,0 +1,8 @@ +#!/usr/bin/env bash + +echo "[installed versions]" + +export PATH="$MINICONDA_DIR/bin:$PATH" +source activate pandas + +python -c "import pandas; pandas.show_versions();" diff --git a/ci/speedpack/Vagrantfile b/ci/speedpack/Vagrantfile deleted file mode 100644 index ec939b7c0a937..0000000000000 --- a/ci/speedpack/Vagrantfile +++ /dev/null @@ -1,22 +0,0 @@ -# -*- mode: ruby -*- -# vi: set ft=ruby : -Vagrant.configure("2") do |config| - config.vm.box = "precise64" - config.vm.box_url = "http://files.vagrantup.com/precise64.box" - -# config.vbguest.auto_update = true -# config.vbguest.no_remote = true - - config.vm.synced_folder File.expand_path("..", Dir.pwd), "/reqf" - config.vm.synced_folder "wheelhouse", "/wheelhouse" - - config.vm.provider :virtualbox do |vb| - vb.customize ["modifyvm", :id, "--cpus", "4"] - vb.customize ["modifyvm", :id, "--memory", "2048"] - vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"] - vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"] - end - - config.vm.provision :shell, :path => "build.sh" - -end diff --git a/ci/speedpack/build.sh b/ci/speedpack/build.sh deleted file mode 100755 index 330d8984ea7b7..0000000000000 --- a/ci/speedpack/build.sh +++ /dev/null @@ -1,117 +0,0 @@ -#!/bin/bash - -# This script is meant to run on a mint precise64 VM. -# The generated wheel files should be compatible -# with travis-ci as of 07/2013. -# -# Runtime can be up to an hour or more. - -echo "Building wheels..." - -# print a trace for everything; RTFM -set -x - -# install and update some basics -apt-get update -apt-get install python-software-properties git -y -apt-add-repository ppa:fkrull/deadsnakes -y -apt-get update - -# install some deps and virtualenv -apt-get install python-pip libfreetype6-dev libpng12-dev libhdf5-serial-dev \ - g++ libatlas-base-dev gfortran libreadline-dev zlib1g-dev flex bison \ - libxml2-dev libxslt-dev libssl-dev -y -pip install virtualenv -apt-get build-dep python-lxml -y - -# install sql servers -apt-get install postgresql-client libpq-dev -y - -export PYTHONIOENCODING='utf-8' -export VIRTUALENV_DISTRIBUTE=0 - -function create_fake_pandas() { - local site_pkg_dir="$1" - rm -rf $site_pkg_dir/pandas - mkdir $site_pkg_dir/pandas - touch $site_pkg_dir/pandas/__init__.py - echo "version = '0.10.0-phony'" > $site_pkg_dir/pandas/version.py -} - - -function get_site_pkgs_dir() { - python$1 -c 'import distutils; print(distutils.sysconfig.get_python_lib())' -} - - -function create_wheel() { - local pip_args="$1" - local wheelhouse="$2" - local n="$3" - local pyver="$4" - - local site_pkgs_dir="$(get_site_pkgs_dir $pyver)" - - - if [[ "$n" == *statsmodels* ]]; then - create_fake_pandas $site_pkgs_dir && \ - pip wheel $pip_args --wheel-dir=$wheelhouse $n && \ - pip install $pip_args --no-index $n && \ - rm -Rf $site_pkgs_dir - else - pip wheel $pip_args --wheel-dir=$wheelhouse $n - pip install $pip_args --no-index $n - fi -} - - -function generate_wheels() { - # get the requirements file - local reqfile="$1" - - # get the python version - local TAG=$(echo $reqfile | grep -Po "(\d\.?[\d\-](_\w+)?)") - - # base dir for wheel dirs - local WHEELSTREET=/wheelhouse - local WHEELHOUSE="$WHEELSTREET/$TAG" - - local PY_VER="${TAG:0:3}" - local PY_MAJOR="${PY_VER:0:1}" - local PIP_ARGS="--use-wheel --find-links=$WHEELHOUSE --download-cache /tmp" - - # install the python version if not installed - apt-get install python$PY_VER python$PY_VER-dev -y - - # create a new virtualenv - rm -Rf /tmp/venv - virtualenv -p python$PY_VER /tmp/venv - source /tmp/venv/bin/activate - - # install pip setuptools - pip install -I --download-cache /tmp 'git+https://github.com/pypa/pip@42102e9d#egg=pip' - pip install -I -U --download-cache /tmp setuptools - pip install -I --download-cache /tmp wheel - - # make the dir if it doesn't exist - mkdir -p $WHEELHOUSE - - # put the requirements file in the wheelhouse - cp $reqfile $WHEELHOUSE - - # install and build the wheels - cat $reqfile | while read N; do - create_wheel "$PIP_ARGS" "$WHEELHOUSE" "$N" "$PY_VER" - done -} - - -# generate a single wheel version -# generate_wheels "/reqf/requirements-2.7.txt" -# -# if vagrant is already up -# run as vagrant provision - -for reqfile in $(ls -1 /reqf/requirements-*.*); do - generate_wheels "$reqfile" -done diff --git a/ci/speedpack/nginx/nginx.conf.template b/ci/speedpack/nginx/nginx.conf.template deleted file mode 100644 index e2cfeaf053d08..0000000000000 --- a/ci/speedpack/nginx/nginx.conf.template +++ /dev/null @@ -1,48 +0,0 @@ -#user nobody; -worker_processes 1; - -#error_log logs/error.log; -#error_log logs/error.log notice; -#error_log logs/error.log info; - -#pid logs/nginx.pid; - - -events { - worker_connections 1024; -} - - -http { - include mime.types; - default_type application/octet-stream; - - #log_format main '$remote_addr - $remote_user [$time_local] "$request" ' - # '$status $body_bytes_sent "$http_referer" ' - # '"$http_user_agent" "$http_x_forwarded_for"'; - - #access_log logs/access.log on; - - sendfile on; - #tcp_nopush on; - - #keepalive_timeout 0; - keepalive_timeout 65; - - #gzip on; - - server { - listen $OPENSHIFT_IP:$OPENSHIFT_PORT; - - access_log access.log ; - sendfile on; - - location / { - root ../../app-root/data/store/; - autoindex on; - } - - - } - -} diff --git a/ci/submit_cython_cache.sh b/ci/submit_cython_cache.sh index 5c98c3df61736..b87acef0ba11c 100755 --- a/ci/submit_cython_cache.sh +++ b/ci/submit_cython_cache.sh @@ -2,14 +2,14 @@ CACHE_File="$HOME/.cache/cython_files.tar" PYX_CACHE_DIR="$HOME/.cache/pyxfiles" -pyx_file_list=`find ${TRAVIS_BUILD_DIR} -name "*.pyx" -o -name "*.pxd"` +pyx_file_list=`find ${TRAVIS_BUILD_DIR} -name "*.pyx" -o -name "*.pxd" -o -name "*.pxi.in"` rm -rf $CACHE_File rm -rf $PYX_CACHE_DIR home_dir=$(pwd) -mkdir $PYX_CACHE_DIR +mkdir -p $PYX_CACHE_DIR rsync -Rv $pyx_file_list $PYX_CACHE_DIR echo "pyx files:" diff --git a/ci/upload_coverage.sh b/ci/upload_coverage.sh new file mode 100755 index 0000000000000..a7ef2fa908079 --- /dev/null +++ b/ci/upload_coverage.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +if [ -z "$COVERAGE" ]; then + echo "coverage is not selected for this build" + exit 0 +fi + +source activate pandas + +echo "uploading coverage" +bash <(curl -s https://codecov.io/bash) -Z -c -F single -f /tmp/cov-single.xml +bash <(curl -s https://codecov.io/bash) -Z -c -F multiple -f /tmp/cov-multiple.xml diff --git a/circle.yml b/circle.yml new file mode 100644 index 0000000000000..fa2da0680f388 --- /dev/null +++ b/circle.yml @@ -0,0 +1,38 @@ +machine: + environment: + # these are globally set + MINICONDA_DIR: /home/ubuntu/miniconda3 + + +database: + override: + - ./ci/install_db_circle.sh + + +checkout: + post: + # since circleci does a shallow fetch + # we need to populate our tags + - git fetch --depth=1000 + + +dependencies: + override: + - > + case $CIRCLE_NODE_INDEX in + 0) + sudo apt-get install language-pack-it && ./ci/install_circle.sh JOB="2.7_COMPAT" LOCALE_OVERRIDE="it_IT.UTF-8" ;; + 1) + sudo apt-get install language-pack-zh-hans && ./ci/install_circle.sh JOB="3.4_SLOW" LOCALE_OVERRIDE="zh_CN.UTF-8" ;; + 2) + sudo apt-get install language-pack-zh-hans && ./ci/install_circle.sh JOB="3.4" LOCALE_OVERRIDE="zh_CN.UTF-8" ;; + 3) + ./ci/install_circle.sh JOB="3.5_ASCII" LOCALE_OVERRIDE="C" ;; + esac + - ./ci/show_circle.sh + + +test: + override: + - case $CIRCLE_NODE_INDEX in 0) ./ci/run_circle.sh --skip-slow --skip-network ;; 1) ./ci/run_circle.sh --only-slow --skip-network ;; 2) ./ci/run_circle.sh --skip-slow --skip-network ;; 3) ./ci/run_circle.sh --skip-slow --skip-network ;; esac: + parallel: true diff --git a/codecov.yml b/codecov.yml index 45a6040c6a50d..b4552563deeaa 100644 --- a/codecov.yml +++ b/codecov.yml @@ -1,3 +1,6 @@ +codecov: + branch: master + coverage: status: project: @@ -6,4 +9,3 @@ coverage: patch: default: target: '50' - branches: null diff --git a/doc/README.rst b/doc/README.rst index a3733846d9ed1..0ea3234dec348 100644 --- a/doc/README.rst +++ b/doc/README.rst @@ -81,7 +81,9 @@ have ``sphinx`` and ``ipython`` installed. `numpydoc `_ is used to parse the docstrings that follow the Numpy Docstring Standard (see above), but you don't need to install this because a local copy of ``numpydoc`` is included in the pandas source -code. +code. `nbsphinx `_ is used to convert +Jupyter notebooks. You will need to install it if you intend to modify any of +the notebooks included in the documentation. Furthermore, it is recommended to have all `optional dependencies `_ diff --git a/doc/_templates/api_redirect.html b/doc/_templates/api_redirect.html index 24bdd8363830f..c04a8b58ce544 100644 --- a/doc/_templates/api_redirect.html +++ b/doc/_templates/api_redirect.html @@ -1,15 +1,10 @@ -{% set pgn = pagename.split('.') -%} -{% if pgn[-2][0].isupper() -%} - {% set redirect = ["pandas", pgn[-2], pgn[-1], 'html']|join('.') -%} -{% else -%} - {% set redirect = ["pandas", pgn[-1], 'html']|join('.') -%} -{% endif -%} +{% set redirect = redirects[pagename.split("/")[-1]] %} - + This API page has moved -

This API page has moved here.

+

This API page has moved here.

- \ No newline at end of file + diff --git a/doc/_templates/autosummary/accessor.rst b/doc/_templates/autosummary/accessor.rst index 1401121fb51c6..4ba745cd6fdba 100644 --- a/doc/_templates/autosummary/accessor.rst +++ b/doc/_templates/autosummary/accessor.rst @@ -3,4 +3,4 @@ .. currentmodule:: {{ module.split('.')[0] }} -.. automethod:: {{ [module.split('.')[1], objname]|join('.') }} +.. autoaccessor:: {{ (module.split('.')[1:] + [objname]) | join('.') }} diff --git a/doc/_templates/autosummary/accessor_attribute.rst b/doc/_templates/autosummary/accessor_attribute.rst index a2f0eb5e068c4..b5ad65d6a736f 100644 --- a/doc/_templates/autosummary/accessor_attribute.rst +++ b/doc/_templates/autosummary/accessor_attribute.rst @@ -3,4 +3,4 @@ .. currentmodule:: {{ module.split('.')[0] }} -.. autoaccessorattribute:: {{ [module.split('.')[1], objname]|join('.') }} +.. autoaccessorattribute:: {{ (module.split('.')[1:] + [objname]) | join('.') }} diff --git a/doc/_templates/autosummary/accessor_callable.rst b/doc/_templates/autosummary/accessor_callable.rst index 6f45e0fd01e16..7a3301814f5f4 100644 --- a/doc/_templates/autosummary/accessor_callable.rst +++ b/doc/_templates/autosummary/accessor_callable.rst @@ -3,4 +3,4 @@ .. currentmodule:: {{ module.split('.')[0] }} -.. autoaccessorcallable:: {{ [module.split('.')[1], objname]|join('.') }}.__call__ +.. autoaccessorcallable:: {{ (module.split('.')[1:] + [objname]) | join('.') }}.__call__ diff --git a/doc/_templates/autosummary/accessor_method.rst b/doc/_templates/autosummary/accessor_method.rst index 43dfc3b813120..aefbba6ef1bbc 100644 --- a/doc/_templates/autosummary/accessor_method.rst +++ b/doc/_templates/autosummary/accessor_method.rst @@ -3,4 +3,4 @@ .. currentmodule:: {{ module.split('.')[0] }} -.. autoaccessormethod:: {{ [module.split('.')[1], objname]|join('.') }} +.. autoaccessormethod:: {{ (module.split('.')[1:] + [objname]) | join('.') }} diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet.pdf b/doc/cheatsheet/Pandas_Cheat_Sheet.pdf index a2b222c683564..0492805a1408b 100644 Binary files a/doc/cheatsheet/Pandas_Cheat_Sheet.pdf and b/doc/cheatsheet/Pandas_Cheat_Sheet.pdf differ diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet.pptx b/doc/cheatsheet/Pandas_Cheat_Sheet.pptx index 5202256006ddf..6cca9ac4647f7 100644 Binary files a/doc/cheatsheet/Pandas_Cheat_Sheet.pptx and b/doc/cheatsheet/Pandas_Cheat_Sheet.pptx differ diff --git a/doc/cheatsheet/README.txt b/doc/cheatsheet/README.txt index e2f6ec042e9cc..d32fe5bcd05a6 100644 --- a/doc/cheatsheet/README.txt +++ b/doc/cheatsheet/README.txt @@ -2,3 +2,7 @@ The Pandas Cheat Sheet was created using Microsoft Powerpoint 2013. To create the PDF version, within Powerpoint, simply do a "Save As" and pick "PDF' as the format. +This cheat sheet was inspired by the RstudioData Wrangling Cheatsheet[1], written by Irv Lustig, Princeton Consultants[2]. + +[1]: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf +[2]: http://www.princetonoptimization.com/ diff --git a/doc/make.py b/doc/make.py index d46be2611ce3d..e70655c3e2f92 100755 --- a/doc/make.py +++ b/doc/make.py @@ -106,106 +106,42 @@ def clean(): @contextmanager -def cleanup_nb(nb): - try: - yield - finally: - try: - os.remove(nb + '.executed') - except OSError: - pass - - -def get_kernel(): - """Find the kernel name for your python version""" - return 'python%s' % sys.version_info.major - - -def execute_nb(src, dst, allow_errors=False, timeout=1000, kernel_name=''): - """ - Execute notebook in `src` and write the output to `dst` - - Parameters - ---------- - src, dst: str - path to notebook - allow_errors: bool - timeout: int - kernel_name: str - defualts to value set in notebook metadata - - Returns - ------- - dst: str - """ - import nbformat - from nbconvert.preprocessors import ExecutePreprocessor - - with io.open(src, encoding='utf-8') as f: - nb = nbformat.read(f, as_version=4) - - ep = ExecutePreprocessor(allow_errors=allow_errors, - timeout=timeout, - kernel_name=kernel_name) - ep.preprocess(nb, resources={}) - - with io.open(dst, 'wt', encoding='utf-8') as f: - nbformat.write(nb, f) - return dst - - -def convert_nb(src, dst, to='html', template_file='basic'): +def maybe_exclude_notebooks(): """ - Convert a notebook `src`. - - Parameters - ---------- - src, dst: str - filepaths - to: {'rst', 'html'} - format to export to - template_file: str - name of template file to use. Default 'basic' + Skip building the notebooks if pandoc is not installed. + This assumes that nbsphinx is installed. """ - from nbconvert import HTMLExporter, RSTExporter - - dispatch = {'rst': RSTExporter, 'html': HTMLExporter} - exporter = dispatch[to.lower()](template_file=template_file) - - (body, resources) = exporter.from_filename(src) - with io.open(dst, 'wt', encoding='utf-8') as f: - f.write(body) - return dst + base = os.path.dirname(__file__) + notebooks = [os.path.join(base, 'source', nb) + for nb in ['style.ipynb']] + contents = {} + try: + import nbconvert + nbconvert.utils.pandoc.get_pandoc_version() + except (ImportError, nbconvert.utils.pandoc.PandocMissing): + print("Warning: Pandoc is not installed. Skipping Notebooks.") + for nb in notebooks: + with open(nb, 'rt') as f: + contents[nb] = f.read() + os.remove(nb) + yield + for nb, content in contents.items(): + with open(nb, 'wt') as f: + f.write(content) def html(): check_build() - notebooks = [ - 'source/html-styling.ipynb', - ] - - for nb in notebooks: - with cleanup_nb(nb): - try: - print("Converting %s" % nb) - kernel_name = get_kernel() - executed = execute_nb(nb, nb + '.executed', allow_errors=True, - kernel_name=kernel_name) - convert_nb(executed, nb.rstrip('.ipynb') + '.html') - except (ImportError, IndexError) as e: - print(e) - print("Failed to convert %s" % nb) - - if os.system('sphinx-build -P -b html -d build/doctrees ' - 'source build/html'): - raise SystemExit("Building HTML failed.") - try: - # remove stale file - os.system('rm source/html-styling.html') - os.system('cd build; rm -f html/pandas.zip;') - except: - pass + with maybe_exclude_notebooks(): + if os.system('sphinx-build -P -b html -d build/doctrees ' + 'source build/html'): + raise SystemExit("Building HTML failed.") + try: + # remove stale file + os.remove('build/html/pandas.zip') + except: + pass def zip_html(): @@ -222,7 +158,7 @@ def latex(): check_build() if sys.platform != 'win32': # LaTeX format. - if os.system('sphinx-build -b latex -d build/doctrees ' + if os.system('sphinx-build -j 2 -b latex -d build/doctrees ' 'source build/latex'): raise SystemExit("Building LaTeX failed.") # Produce pdf. @@ -245,7 +181,7 @@ def latex_forced(): check_build() if sys.platform != 'win32': # LaTeX format. - if os.system('sphinx-build -b latex -d build/doctrees ' + if os.system('sphinx-build -j 2 -b latex -d build/doctrees ' 'source build/latex'): raise SystemExit("Building LaTeX failed.") # Produce pdf. diff --git a/doc/source/10min.rst b/doc/source/10min.rst index 54bcd76855f32..8482eef552c17 100644 --- a/doc/source/10min.rst +++ b/doc/source/10min.rst @@ -84,29 +84,28 @@ will be completed: @verbatim In [1]: df2. - df2.A df2.boxplot - df2.abs df2.C - df2.add df2.clip - df2.add_prefix df2.clip_lower - df2.add_suffix df2.clip_upper - df2.align df2.columns - df2.all df2.combine - df2.any df2.combineAdd + df2.A df2.bool + df2.abs df2.boxplot + df2.add df2.C + df2.add_prefix df2.clip + df2.add_suffix df2.clip_lower + df2.align df2.clip_upper + df2.all df2.columns + df2.any df2.combine df2.append df2.combine_first - df2.apply df2.combineMult - df2.applymap df2.compound - df2.as_blocks df2.consolidate - df2.asfreq df2.convert_objects - df2.as_matrix df2.copy - df2.astype df2.corr - df2.at df2.corrwith - df2.at_time df2.count - df2.axes df2.cov - df2.B df2.cummax - df2.between_time df2.cummin - df2.bfill df2.cumprod - df2.blocks df2.cumsum - df2.bool df2.D + df2.apply df2.compound + df2.applymap df2.consolidate + df2.as_blocks df2.convert_objects + df2.asfreq df2.copy + df2.as_matrix df2.corr + df2.astype df2.corrwith + df2.at df2.count + df2.at_time df2.cov + df2.axes df2.cummax + df2.B df2.cummin + df2.between_time df2.cumprod + df2.bfill df2.cumsum + df2.blocks df2.D As you can see, the columns ``A``, ``B``, ``C``, and ``D`` are automatically tab completed. ``E`` is there as well; the rest of the attributes have been @@ -282,7 +281,7 @@ Using a single column's values to select data. df[df.A > 0] -A ``where`` operation for getting. +Selecting values from a DataFrame where a boolean condition is met. .. ipython:: python diff --git a/doc/source/_static/ci.png b/doc/source/_static/ci.png new file mode 100644 index 0000000000000..4570ed2155586 Binary files /dev/null and b/doc/source/_static/ci.png differ diff --git a/doc/source/_static/style-excel.png b/doc/source/_static/style-excel.png new file mode 100644 index 0000000000000..f946949e8bcf9 Binary files /dev/null and b/doc/source/_static/style-excel.png differ diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index 0c843dd39b56f..ea00588ba156f 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -46,7 +46,7 @@ data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d). In this section, we will show what exactly we mean by "hierarchical" indexing -and how it integrates with the all of the pandas indexing functionality +and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing :ref:`group by ` and :ref:`pivoting and reshaping data `, we'll show non-trivial applications to illustrate how it aids in structuring data for @@ -59,7 +59,7 @@ Creating a MultiIndex (hierarchical index) object The ``MultiIndex`` object is the hierarchical analogue of the standard ``Index`` object which typically stores the axis labels in pandas objects. You -can think of ``MultiIndex`` an array of tuples where each tuple is unique. A +can think of ``MultiIndex`` as an array of tuples where each tuple is unique. A ``MultiIndex`` can be created from a list of arrays (using ``MultiIndex.from_arrays``), an array of tuples (using ``MultiIndex.from_tuples``), or a crossed set of iterables (using @@ -136,7 +136,7 @@ can find yourself working with hierarchically-indexed data without creating a may wish to generate your own ``MultiIndex`` when preparing the data set. Note that how the index is displayed by be controlled using the -``multi_sparse`` option in ``pandas.set_printoptions``: +``multi_sparse`` option in ``pandas.set_options()``: .. ipython:: python @@ -175,35 +175,40 @@ completely analogous way to selecting a column in a regular DataFrame: See :ref:`Cross-section with hierarchical index ` for how to select on a deeper level. -.. note:: +.. _advanced.shown_levels: + +Defined Levels +~~~~~~~~~~~~~~ + +The repr of a ``MultiIndex`` shows ALL the defined levels of an index, even +if the they are not actually used. When slicing an index, you may notice this. +For example: - The repr of a ``MultiIndex`` shows ALL the defined levels of an index, even - if the they are not actually used. When slicing an index, you may notice this. - For example: +.. ipython:: python - .. ipython:: python + # original multi-index + df.columns - # original multi-index - df.columns + # sliced + df[['foo','qux']].columns - # sliced - df[['foo','qux']].columns +This is done to avoid a recomputation of the levels in order to make slicing +highly performant. If you want to see the actual used levels. - This is done to avoid a recomputation of the levels in order to make slicing - highly performant. If you want to see the actual used levels. +.. ipython:: python - .. ipython:: python + df[['foo','qux']].columns.values - df[['foo','qux']].columns.values + # for a specific level + df[['foo','qux']].columns.get_level_values(0) - # for a specific level - df[['foo','qux']].columns.get_level_values(0) +To reconstruct the multiindex with only the used levels - To reconstruct the multiindex with only the used levels +.. versionadded:: 0.20.0 - .. ipython:: python +.. ipython:: python - pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values) + df[['foo','qux']].columns.remove_unused_levels() Data alignment and using ``reindex`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -230,7 +235,7 @@ of tuples: Advanced indexing with hierarchical index ----------------------------------------- -Syntactically integrating ``MultiIndex`` in advanced indexing with ``.loc/.ix`` is a +Syntactically integrating ``MultiIndex`` in advanced indexing with ``.loc`` is a bit challenging, but we've made every effort to do so. for example the following works as you would expect: @@ -258,7 +263,7 @@ Passing a list of labels or tuples works similar to reindexing: .. ipython:: python - df.ix[[('bar', 'two'), ('qux', 'one')]] + df.loc[[('bar', 'two'), ('qux', 'one')]] .. _advanced.mi_slicers: @@ -288,7 +293,7 @@ As usual, **both sides** of the slicers are included as this is label indexing. .. code-block:: python - df.loc[(slice('A1','A3'),.....),:] + df.loc[(slice('A1','A3'),.....), :] rather than this: @@ -317,43 +322,43 @@ Basic multi-index slicing using slices, lists, and labels. .. ipython:: python - dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:] + dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :] You can use a ``pd.IndexSlice`` to have a more natural syntax using ``:`` rather than using ``slice(None)`` .. ipython:: python idx = pd.IndexSlice - dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']] + dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']] It is possible to perform quite complicated selections using this method on multiple axes at the same time. .. ipython:: python - dfmi.loc['A1',(slice(None),'foo')] - dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']] + dfmi.loc['A1', (slice(None), 'foo')] + dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']] Using a boolean indexer you can provide selection related to the *values*. .. ipython:: python - mask = dfmi[('a','foo')]>200 - dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']] + mask = dfmi[('a', 'foo')] > 200 + dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']] You can also specify the ``axis`` argument to ``.loc`` to interpret the passed slicers on a single axis. .. ipython:: python - dfmi.loc(axis=0)[:,:,['C1','C3']] + dfmi.loc(axis=0)[:, :, ['C1', 'C3']] Furthermore you can *set* the values using these methods .. ipython:: python df2 = dfmi.copy() - df2.loc(axis=0)[:,:,['C1','C3']] = -10 + df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10 df2 You can use a right-hand-side of an alignable object as well. @@ -361,7 +366,7 @@ You can use a right-hand-side of an alignable object as well. .. ipython:: python df2 = dfmi.copy() - df2.loc[idx[:,:,['C1','C3']],:] = df2*1000 + df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000 df2 .. _advanced.xs: @@ -528,12 +533,14 @@ return a copy of the data rather than a view: jim joe 1 z 0.64094 +.. _advanced.unsorted: + Furthermore if you try to index something that is not fully lexsorted, this can raise: .. code-block:: ipython In [5]: dfm.loc[(0,'y'):(1, 'z')] - KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (1)' + UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)' The ``is_lexsorted()`` method on an ``Index`` show if the index is sorted, and the ``lexsort_depth`` property returns the sort depth: @@ -602,7 +609,7 @@ intended to work on boolean indices and may return unexpected results. ser = pd.Series(np.random.randn(10)) ser.take([False, False, True, True]) - ser.ix[[0, 1]] + ser.iloc[[0, 1]] Finally, as a small note on performance, because the ``take`` method handles a narrower range of inputs, it can offer performance that is a good deal @@ -618,7 +625,7 @@ faster than fancy indexing. timeit arr.take(indexer, axis=0) ser = pd.Series(arr[:, 0]) - timeit ser.ix[indexer] + timeit ser.iloc[indexer] timeit ser.take(indexer) .. _indexing.index_types: @@ -659,7 +666,7 @@ Setting the index, will create create a ``CategoricalIndex`` df2 = df.set_index('B') df2.index -Indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an ``Index`` with duplicates. +Indexing with ``__getitem__/.iloc/.loc`` works similarly to an ``Index`` with duplicates. The indexers MUST be in the category or the operation will raise. .. ipython:: python @@ -757,14 +764,12 @@ same. sf = pd.Series(range(5), index=indexf) sf -Scalar selection for ``[],.ix,.loc`` will always be label based. An integer will match an equal float index (e.g. ``3`` is equivalent to ``3.0``) +Scalar selection for ``[],.loc`` will always be label based. An integer will match an equal float index (e.g. ``3`` is equivalent to ``3.0``) .. ipython:: python sf[3] sf[3.0] - sf.ix[3] - sf.ix[3.0] sf.loc[3] sf.loc[3.0] @@ -781,7 +786,6 @@ Slicing is ALWAYS on the values of the index, for ``[],ix,loc`` and ALWAYS posit .. ipython:: python sf[2:4] - sf.ix[2:4] sf.loc[2:4] sf.iloc[2:4] @@ -811,14 +815,6 @@ In non-float indexes, slicing using floats will raise a ``TypeError`` In [3]: pd.Series(range(5)).iloc[3.0] TypeError: cannot do positional indexing on with these indexers [3.0] of - Further the treatment of ``.ix`` with a float indexer on a non-float index, will be label based, and thus coerce the index. - - .. ipython:: python - - s2 = pd.Series([1, 2, 3], index=list('abc')) - s2 - s2.ix[1.0] = 10 - s2 Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could for @@ -853,3 +849,170 @@ Of course if you need integer based selection, then use ``iloc`` .. ipython:: python dfir.iloc[0:5] + +.. _indexing.intervallindex: + +IntervalIndex +~~~~~~~~~~~~~ + +.. versionadded:: 0.20.0 + +.. warning:: + + These indexing behaviors are provisional and may change in a future version of pandas. + +.. ipython:: python + + df = pd.DataFrame({'A': [1, 2, 3, 4]}, + index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])) + df + +Label based indexing via ``.loc`` along the edges of an interval works as you would expect, +selecting that particular interval. + +.. ipython:: python + + df.loc[2] + df.loc[[2, 3]] + +If you select a lable *contained* within an interval, this will also select the interval. + +.. ipython:: python + + df.loc[2.5] + df.loc[[2.5, 3.5]] + + +Miscellaneous indexing FAQ +-------------------------- + +Integer indexing +~~~~~~~~~~~~~~~~ + +Label-based indexing with integer axis labels is a thorny topic. It has been +discussed heavily on mailing lists and among various members of the scientific +Python community. In pandas, our general viewpoint is that labels matter more +than integer locations. Therefore, with an integer axis index *only* +label-based indexing is possible with the standard tools like ``.loc``. The +following code will generate exceptions: + +.. code-block:: python + + s = pd.Series(range(5)) + s[-1] + df = pd.DataFrame(np.random.randn(5, 4)) + df + df.loc[-2:] + +This deliberate decision was made to prevent ambiguities and subtle bugs (many +users reported finding bugs when the API change was made to stop "falling back" +on position-based indexing). + +Non-monotonic indexes require exact matches +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If the index of a ``Series`` or ``DataFrame`` is monotonically increasing or decreasing, then the bounds +of a label-based slice can be outside the range of the index, much like slice indexing a +normal Python ``list``. Monotonicity of an index can be tested with the ``is_monotonic_increasing`` and +``is_monotonic_decreasing`` attributes. + +.. ipython:: python + + df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=list(range(5))) + df.index.is_monotonic_increasing + + # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4: + df.loc[0:4, :] + + # slice is are outside the index, so empty DataFrame is returned + df.loc[13:15, :] + +On the other hand, if the index is not monotonic, then both slice bounds must be +*unique* members of the index. + +.. ipython:: python + + df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=list(range(6))) + df.index.is_monotonic_increasing + + # OK because 2 and 4 are in the index + df.loc[2:4, :] + +.. code-block:: python + + # 0 is not in the index + In [9]: df.loc[0:4, :] + KeyError: 0 + + # 3 is not a unique label + In [11]: df.loc[2:3, :] + KeyError: 'Cannot get right slice bound for non-unique label: 3' + + +Endpoints are inclusive +~~~~~~~~~~~~~~~~~~~~~~~ + +Compared with standard Python sequence slicing in which the slice endpoint is +not inclusive, label-based slicing in pandas **is inclusive**. The primary +reason for this is that it is often not possible to easily determine the +"successor" or next element after a particular label in an index. For example, +consider the following Series: + +.. ipython:: python + + s = pd.Series(np.random.randn(6), index=list('abcdef')) + s + +Suppose we wished to slice from ``c`` to ``e``, using integers this would be + +.. ipython:: python + + s[2:5] + +However, if you only had ``c`` and ``e``, determining the next element in the +index can be somewhat complicated. For example, the following does not work: + +:: + + s.loc['c':'e'+1] + +A very common use case is to limit a time series to start and end at two +specific dates. To enable this, we made the design design to make label-based +slicing include both endpoints: + +.. ipython:: python + + s.loc['c':'e'] + +This is most definitely a "practicality beats purity" sort of thing, but it is +something to watch out for if you expect label-based slicing to behave exactly +in the way that standard Python integer slicing works. + + +Indexing potentially changes underlying Series dtype +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The different indexing operation can potentially change the dtype of a ``Series``. + +.. ipython:: python + + series1 = pd.Series([1, 2, 3]) + series1.dtype + res = series1[[0,4]] + res.dtype + res + +.. ipython:: python + + series2 = pd.Series([True]) + series2.dtype + res = series2.reindex_like(series1) + res.dtype + res + +This is because the (re)indexing operations above silently inserts ``NaNs`` and the ``dtype`` +changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs`` +such as ``numpy.logical_and``. + +See the `this old issue `__ for a more +detailed discussion. diff --git a/doc/source/api.rst b/doc/source/api.rst index 638abd5421862..c652573bc6677 100644 --- a/doc/source/api.rst +++ b/doc/source/api.rst @@ -5,6 +5,22 @@ API Reference ************* +This page gives an overview of all public pandas objects, functions and +methods. In general, all classes and functions exposed in the top-level +``pandas.*`` namespace are regarded as public. + +Further some of the subpackages are public, including ``pandas.errors``, +``pandas.plotting``, and ``pandas.testing``. Certain functions in the the +``pandas.io`` and ``pandas.tseries`` submodules are public as well (those +mentioned in the documentation). Further, the ``pandas.api.types`` subpackage +holds some public functions related to data types in pandas. + + +.. warning:: + + The ``pandas.core``, ``pandas.compat``, and ``pandas.util`` top-level modules are considered to be PRIVATE. Stability of functionality in those modules in not guaranteed. + + .. _api.functions: Input/Output @@ -60,6 +76,7 @@ JSON :toctree: generated/ json_normalize + build_table_schema .. currentmodule:: pandas @@ -83,6 +100,14 @@ HDFStore: PyTables (HDF5) HDFStore.get HDFStore.select +Feather +~~~~~~~ + +.. autosummary:: + :toctree: generated/ + + read_feather + SAS ~~~ @@ -109,7 +134,6 @@ Google BigQuery :toctree: generated/ read_gbq - to_gbq .. currentmodule:: pandas @@ -157,6 +181,8 @@ Data manipulations concat get_dummies factorize + unique + wide_to_long Top-level missing data ~~~~~~~~~~~~~~~~~~~~~~ @@ -259,13 +285,12 @@ Indexing, iteration Series.get Series.at Series.iat - Series.ix Series.loc Series.iloc Series.__iter__ Series.iteritems -For more information on ``.at``, ``.iat``, ``.ix``, ``.loc``, and +For more information on ``.at``, ``.iat``, ``.loc``, and ``.iloc``, see the :ref:`indexing documentation `. Binary operator functions @@ -305,6 +330,8 @@ Function application, GroupBy & Window :toctree: generated/ Series.apply + Series.aggregate + Series.transform Series.map Series.groupby Series.rolling @@ -409,7 +436,6 @@ Reshaping, sorting Series.reorder_levels Series.sort_values Series.sort_index - Series.sortlevel Series.swaplevel Series.unstack Series.searchsorted @@ -592,7 +618,6 @@ strings and apply several methods to it. These can be accessed like Series.cat Series.dt Index.str - CategoricalIndex.str MultiIndex.str DatetimeIndex.str TimedeltaIndex.str @@ -645,7 +670,7 @@ adding ordering information or special categories is need at creation time of th Categorical.from_codes ``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts -the Categorical back to a numpy array, so levels and order information is not preserved! +the Categorical back to a numpy array, so categories and order information is not preserved! .. autosummary:: :toctree: generated/ @@ -692,6 +717,7 @@ Serialization / IO / Conversion Series.to_pickle Series.to_csv Series.to_dict + Series.to_excel Series.to_frame Series.to_xarray Series.to_hdf @@ -703,8 +729,8 @@ Serialization / IO / Conversion Series.to_string Series.to_clipboard -Sparse methods -~~~~~~~~~~~~~~ +Sparse +~~~~~~ .. autosummary:: :toctree: generated/ @@ -765,7 +791,6 @@ Indexing, iteration DataFrame.head DataFrame.at DataFrame.iat - DataFrame.ix DataFrame.loc DataFrame.iloc DataFrame.insert @@ -782,7 +807,7 @@ Indexing, iteration DataFrame.mask DataFrame.query -For more information on ``.at``, ``.iat``, ``.ix``, ``.loc``, and +For more information on ``.at``, ``.iat``, ``.loc``, and ``.iloc``, see the :ref:`indexing documentation `. @@ -823,6 +848,8 @@ Function application, GroupBy & Window DataFrame.apply DataFrame.applymap + DataFrame.aggregate + DataFrame.transform DataFrame.groupby DataFrame.rolling DataFrame.expanding @@ -921,12 +948,12 @@ Reshaping, sorting, transposing DataFrame.reorder_levels DataFrame.sort_values DataFrame.sort_index - DataFrame.sortlevel DataFrame.nlargest DataFrame.nsmallest DataFrame.swaplevel DataFrame.stack DataFrame.unstack + DataFrame.melt DataFrame.T DataFrame.to_panel DataFrame.to_xarray @@ -1013,6 +1040,7 @@ Serialization / IO / Conversion DataFrame.to_excel DataFrame.to_json DataFrame.to_html + DataFrame.to_feather DataFrame.to_latex DataFrame.to_stata DataFrame.to_msgpack @@ -1023,6 +1051,13 @@ Serialization / IO / Conversion DataFrame.to_string DataFrame.to_clipboard +Sparse +~~~~~~ +.. autosummary:: + :toctree: generated/ + + SparseDataFrame.to_coo + .. _api.panel: Panel @@ -1081,7 +1116,6 @@ Indexing, iteration, slicing Panel.at Panel.iat - Panel.ix Panel.loc Panel.iloc Panel.__iter__ @@ -1091,7 +1125,7 @@ Indexing, iteration, slicing Panel.major_xs Panel.minor_xs -For more information on ``.at``, ``.iat``, ``.ix``, ``.loc``, and +For more information on ``.at``, ``.iat``, ``.loc``, and ``.iloc``, see the :ref:`indexing documentation `. Binary operator functions @@ -1231,57 +1265,6 @@ Serialization / IO / Conversion Panel.to_xarray Panel.to_clipboard -.. _api.panel4d: - -Panel4D -------- - -Constructor -~~~~~~~~~~~ -.. autosummary:: - :toctree: generated/ - - Panel4D - -Serialization / IO / Conversion -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. autosummary:: - :toctree: generated/ - - Panel4D.to_xarray - -Attributes and underlying data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -**Axes** - - * **labels**: axis 1; each label corresponds to a Panel contained inside - * **items**: axis 2; each item corresponds to a DataFrame contained inside - * **major_axis**: axis 3; the index (rows) of each of the DataFrames - * **minor_axis**: axis 4; the columns of each of the DataFrames - -.. autosummary:: - :toctree: generated/ - - Panel4D.values - Panel4D.axes - Panel4D.ndim - Panel4D.size - Panel4D.shape - Panel4D.dtypes - Panel4D.ftypes - Panel4D.get_dtype_counts - Panel4D.get_ftype_counts - -Conversion -~~~~~~~~~~ -.. autosummary:: - :toctree: generated/ - - Panel4D.astype - Panel4D.copy - Panel4D.isnull - Panel4D.notnull - .. _api.index: Index @@ -1315,6 +1298,7 @@ Attributes Index.nbytes Index.ndim Index.size + Index.empty Index.strides Index.itemsize Index.base @@ -1350,8 +1334,16 @@ Modifying and Computations Index.unique Index.nunique Index.value_counts + +Missing Values +~~~~~~~~~~~~~~ +.. autosummary:: + :toctree: generated/ + Index.fillna Index.dropna + Index.isnull + Index.notnull Conversion ~~~~~~~~~~ @@ -1411,6 +1403,7 @@ CategoricalIndex .. autosummary:: :toctree: generated/ + :template: autosummary/class_without_autosummary.rst CategoricalIndex @@ -1432,6 +1425,28 @@ Categorical Components CategoricalIndex.as_ordered CategoricalIndex.as_unordered +.. _api.intervalindex: + +IntervalIndex +------------- + +.. autosummary:: + :toctree: generated/ + :template: autosummary/class_without_autosummary.rst + + IntervalIndex + +IntervalIndex Components +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autosummary:: + :toctree: generated/ + + IntervalIndex.from_arrays + IntervalIndex.from_tuples + IntervalIndex.from_breaks + IntervalIndex.from_intervals + .. _api.multiindex: MultiIndex @@ -1441,6 +1456,7 @@ MultiIndex :toctree: generated/ MultiIndex + IndexSlice MultiIndex Components ~~~~~~~~~~~~~~~~~~~~~~ @@ -1454,10 +1470,12 @@ MultiIndex Components MultiIndex.set_levels MultiIndex.set_labels MultiIndex.to_hierarchical + MultiIndex.to_frame MultiIndex.is_lexsorted MultiIndex.droplevel MultiIndex.swaplevel MultiIndex.reorder_levels + MultiIndex.remove_unused_levels .. _api.datetimeindex: @@ -1760,7 +1778,7 @@ The following methods are available only for ``DataFrameGroupBy`` objects. Resampling ---------- -.. currentmodule:: pandas.tseries.resample +.. currentmodule:: pandas.core.resample Resampler objects are returned by resample calls: :func:`pandas.DataFrame.resample`, :func:`pandas.Series.resample`. @@ -1820,7 +1838,7 @@ Computations / Descriptive Stats Style ----- -.. currentmodule:: pandas.formats.style +.. currentmodule:: pandas.io.formats.style ``Styler`` objects are returned by :attr:`pandas.DataFrame.style`. @@ -1885,3 +1903,97 @@ Working with options get_option set_option option_context + +Testing functions +~~~~~~~~~~~~~~~~~ + +.. autosummary:: + :toctree: generated/ + + testing.assert_frame_equal + testing.assert_series_equal + testing.assert_index_equal + + +Exceptions and warnings +~~~~~~~~~~~~~~~~~~~~~~~ + +.. autosummary:: + :toctree: generated/ + + errors.DtypeWarning + errors.EmptyDataError + errors.OutOfBoundsDatetime + errors.ParserError + errors.ParserWarning + errors.PerformanceWarning + errors.UnsortedIndexError + errors.UnsupportedFunctionCall + + +Data types related functionality +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autosummary:: + :toctree: generated/ + + api.types.union_categoricals + api.types.infer_dtype + api.types.pandas_dtype + +Dtype introspection + +.. autosummary:: + :toctree: generated/ + + api.types.is_bool_dtype + api.types.is_categorical_dtype + api.types.is_complex_dtype + api.types.is_datetime64_any_dtype + api.types.is_datetime64_dtype + api.types.is_datetime64_ns_dtype + api.types.is_datetime64tz_dtype + api.types.is_extension_type + api.types.is_float_dtype + api.types.is_int64_dtype + api.types.is_integer_dtype + api.types.is_interval_dtype + api.types.is_numeric_dtype + api.types.is_object_dtype + api.types.is_period_dtype + api.types.is_signed_integer_dtype + api.types.is_string_dtype + api.types.is_timedelta64_dtype + api.types.is_timedelta64_ns_dtype + api.types.is_unsigned_integer_dtype + api.types.is_sparse + +Iterable introspection + +.. autosummary:: + :toctree: generated/ + + api.types.is_dict_like + api.types.is_file_like + api.types.is_list_like + api.types.is_named_tuple + api.types.is_iterator + +Scalar introspection + +.. autosummary:: + :toctree: generated/ + + api.types.is_bool + api.types.is_categorical + api.types.is_complex + api.types.is_datetimetz + api.types.is_float + api.types.is_hashable + api.types.is_integer + api.types.is_interval + api.types.is_number + api.types.is_period + api.types.is_re + api.types.is_re_compilable + api.types.is_scalar diff --git a/doc/source/basics.rst b/doc/source/basics.rst index e5aa6b577270a..134cc5106015b 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -93,7 +93,7 @@ Accelerated operations ---------------------- pandas has support for accelerating certain types of binary numerical and boolean operations using -the ``numexpr`` library (starting in 0.11.0) and the ``bottleneck`` libraries. +the ``numexpr`` library and the ``bottleneck`` libraries. These libraries are especially useful when dealing with large data sets, and provide large speedups. ``numexpr`` uses smart chunking, caching, and multiple cores. ``bottleneck`` is @@ -114,6 +114,15 @@ Here is a sample (using 100 column x 100,000 row ``DataFrames``): You are highly encouraged to install both libraries. See the section :ref:`Recommended Dependencies ` for more installation info. +These are both enabled to be used by default, you can control this by setting the options: + +.. versionadded:: 0.20.0 + +.. code-block:: python + + pd.set_option('compute.use_bottleneck', False) + pd.set_option('compute.use_numexpr', False) + .. _basics.binop: Flexible binary operations @@ -145,7 +154,7 @@ either match on the *index* or *columns* via the **axis** keyword: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']), 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])}) df - row = df.ix[1] + row = df.iloc[1] column = df['two'] df.sub(row, axis='columns') @@ -486,7 +495,9 @@ standard deviation 1), very concisely: xs_stand.std(1) Note that methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod` -preserve the location of NA values: +preserve the location of ``NaN`` values. This is somewhat different from +:meth:`~DataFrame.expanding` and :meth:`~DataFrame.rolling`. +For more details please see :ref:`this note `. .. ipython:: python @@ -554,7 +565,7 @@ course): series[::2] = np.nan series.describe() frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e']) - frame.ix[::2] = np.nan + frame.iloc[::2] = np.nan frame.describe() You can select specific percentiles to include in the output: @@ -700,7 +711,8 @@ on an entire ``DataFrame`` or ``Series``, row- or column-wise, or elementwise. 1. `Tablewise Function Application`_: :meth:`~DataFrame.pipe` 2. `Row or Column-wise Function Application`_: :meth:`~DataFrame.apply` -3. Elementwise_ function application: :meth:`~DataFrame.applymap` +3. `Aggregation API`_: :meth:`~DataFrame.agg` and :meth:`~DataFrame.transform` +4. `Applying Elementwise Functions`_: :meth:`~DataFrame.applymap` .. _basics.pipe: @@ -776,6 +788,13 @@ statistics methods, take an optional ``axis`` argument: df.apply(np.cumsum) df.apply(np.exp) +``.apply()`` will also dispatch on a string method name. + +.. ipython:: python + + df.apply('mean') + df.apply('mean', axis=1) + Depending on the return type of the function passed to :meth:`~DataFrame.apply`, the result will either be of lower dimension or the same dimension. @@ -825,16 +844,226 @@ set to True, the passed function will instead receive an ndarray object, which has positive performance implications if you do not need the indexing functionality. -.. seealso:: +.. _basics.aggregate: + +Aggregation API +~~~~~~~~~~~~~~~ + +.. versionadded:: 0.20.0 + +The aggregation API allows one to express possibly multiple aggregation operations in a single concise way. +This API is similar across pandas objects, see :ref:`groupby API `, the +:ref:`window functions API `, and the :ref:`resample API `. +The entry point for aggregation is the method :meth:`~DataFrame.aggregate`, or the alias :meth:`~DataFrame.agg`. + +We will use a similar starting frame from above: + +.. ipython:: python + + tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'], + index=pd.date_range('1/1/2000', periods=10)) + tsdf.iloc[3:7] = np.nan + tsdf + +Using a single function is equivalent to :meth:`~DataFrame.apply`; You can also pass named methods as strings. +These will return a ``Series`` of the aggregated output: + +.. ipython:: python + + tsdf.agg(np.sum) + + tsdf.agg('sum') + + # these are equivalent to a ``.sum()`` because we are aggregating on a single function + tsdf.sum() + +Single aggregations on a ``Series`` this will result in a scalar value: + +.. ipython:: python + + tsdf.A.agg('sum') + + +Aggregating with multiple functions ++++++++++++++++++++++++++++++++++++ + +You can pass multiple aggregation arguments as a list. +The results of each of the passed functions will be a row in the resultant ``DataFrame``. +These are naturally named from the aggregation function. - The section on :ref:`GroupBy ` demonstrates related, flexible - functionality for grouping by some criterion, applying, and combining the - results into a Series, DataFrame, etc. +.. ipython:: python + + tsdf.agg(['sum']) + +Multiple functions yield multiple rows: + +.. ipython:: python + + tsdf.agg(['sum', 'mean']) + +On a ``Series``, multiple functions return a ``Series``, indexed by the function names: + +.. ipython:: python + + tsdf.A.agg(['sum', 'mean']) + +Passing a ``lambda`` function will yield a ```` named row: + +.. ipython:: python + + tsdf.A.agg(['sum', lambda x: x.mean()]) + +Passing a named function will yield that name for the row: + +.. ipython:: python -.. _Elementwise: + def mymean(x): + return x.mean() -Applying elementwise Python functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + tsdf.A.agg(['sum', mymean]) + +Aggregating with a dict ++++++++++++++++++++++++ + +Passing a dictionary of column names to a scalar or a list of scalars, to ``DataFame.agg`` +allows you to customize which functions are applied to which columns. Note that the results +are not in any particular order, you can use an ``OrderedDict`` instead to guarantee ordering. + +.. ipython:: python + + tsdf.agg({'A': 'mean', 'B': 'sum'}) + +Passing a list-like will generate a ``DataFrame`` output. You will get a matrix-like output +of all of the aggregators. The output will consist of all unique functions. Those that are +not noted for a particular column will be ``NaN``: + +.. ipython:: python + + tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'}) + +.. _basics.aggregation.mixed_dtypes: + +Mixed Dtypes +++++++++++++ + +When presented with mixed dtypes that cannot aggregate, ``.agg`` will only take the valid +aggregations. This is similiar to how groupby ``.agg`` works. + +.. ipython:: python + + mdf = pd.DataFrame({'A': [1, 2, 3], + 'B': [1., 2., 3.], + 'C': ['foo', 'bar', 'baz'], + 'D': pd.date_range('20130101', periods=3)}) + mdf.dtypes + +.. ipython:: python + + mdf.agg(['min', 'sum']) + +.. _basics.aggregation.custom_describe: + +Custom describe ++++++++++++++++ + +With ``.agg()`` is it possible to easily create a custom describe function, similar +to the built in :ref:`describe function `. + +.. ipython:: python + + from functools import partial + + q_25 = partial(pd.Series.quantile, q=0.25) + q_25.__name__ = '25%' + q_75 = partial(pd.Series.quantile, q=0.75) + q_75.__name__ = '75%' + + tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max']) + +.. _basics.transform: + +Transform API +~~~~~~~~~~~~~ + +.. versionadded:: 0.20.0 + +The :meth:`~DataFrame.transform` method returns an object that is indexed the same (same size) +as the original. This API allows you to provide *multiple* operations at the same +time rather than one-by-one. Its API is quite similar to the ``.agg`` API. + +Use a similar frame to the above sections. + +.. ipython:: python + + tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'], + index=pd.date_range('1/1/2000', periods=10)) + tsdf.iloc[3:7] = np.nan + tsdf + +Transform the entire frame. ``.transform()`` allows input functions as: a numpy function, a string +function name or a user defined function. + +.. ipython:: python + :okwarning: + + tsdf.transform(np.abs) + tsdf.transform('abs') + tsdf.transform(lambda x: x.abs()) + +Here ``.transform()`` received a single function; this is equivalent to a ufunc application + +.. ipython:: python + + np.abs(tsdf) + +Passing a single function to ``.transform()`` with a ``Series`` will yield a single ``Series`` in return. + +.. ipython:: python + + tsdf.A.transform(np.abs) + + +Transform with multiple functions ++++++++++++++++++++++++++++++++++ + +Passing multiple functions will yield a column multi-indexed DataFrame. +The first level will be the original frame column names; the second level +will be the names of the transforming functions. + +.. ipython:: python + + tsdf.transform([np.abs, lambda x: x+1]) + +Passing multiple functions to a Series will yield a DataFrame. The +resulting column names will be the transforming functions. + +.. ipython:: python + + tsdf.A.transform([np.abs, lambda x: x+1]) + + +Transforming with a dict +++++++++++++++++++++++++ + + +Passing a dict of functions will will allow selective transforming per column. + +.. ipython:: python + + tsdf.transform({'A': np.abs, 'B': lambda x: x+1}) + +Passing a dict of lists will generate a multi-indexed DataFrame with these +selective transforms. + +.. ipython:: python + :okwarning: + + tsdf.transform({'A': np.abs, 'B': [lambda x: x+1, 'sqrt']}) + +.. _basics.elementwise: + +Applying Elementwise Functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods :meth:`~DataFrame.applymap` on DataFrame @@ -1079,7 +1308,7 @@ objects either on the DataFrame's index or columns using the ``axis`` argument: .. ipython:: python - df.align(df2.ix[0], axis=1) + df.align(df2.iloc[0], axis=1) .. _basics.reindex_fill: @@ -1755,6 +1984,7 @@ then the more *general* one will be used as the result of the operation. # conversion of dtypes df3.astype('float32').dtypes + Convert a subset of columns to a specified type using :meth:`~DataFrame.astype` .. ipython:: python @@ -1764,6 +1994,17 @@ Convert a subset of columns to a specified type using :meth:`~DataFrame.astype` dft dft.dtypes +.. versionadded:: 0.19.0 + +Convert certain columns to a specific dtype by passing a dict to :meth:`~DataFrame.astype` + +.. ipython:: python + + dft1 = pd.DataFrame({'a': [1,0,1], 'b': [4,5,6], 'c': [7, 8, 9]}) + dft1 = dft1.astype({'a': np.bool, 'c': np.float64}) + dft1 + dft1.dtypes + .. note:: When trying to convert a subset of columns to a specified type using :meth:`~DataFrame.astype` and :meth:`~DataFrame.loc`, upcasting occurs. @@ -1875,7 +2116,7 @@ gotchas Performing selection operations on ``integer`` type data can easily upcast the data to ``floating``. The dtype of the input data will be preserved in cases where ``nans`` are not introduced (starting in 0.11.0) -See also :ref:`integer na gotchas ` +See also :ref:`Support for integer NA ` .. ipython:: python diff --git a/doc/source/categorical.rst b/doc/source/categorical.rst index 090998570a358..a508e84465107 100644 --- a/doc/source/categorical.rst +++ b/doc/source/categorical.rst @@ -129,8 +129,7 @@ To get back to the original Series or `numpy` array, use ``Series.astype(origina s s2 = s.astype('category') s2 - s3 = s2.astype('string') - s3 + s2.astype(str) np.asarray(s2) If you have already `codes` and `categories`, you can use the :func:`~pandas.Categorical.from_codes` @@ -231,6 +230,15 @@ Categories must be unique or a `ValueError` is raised: except ValueError as e: print("ValueError: " + str(e)) +Categories must also not be ``NaN`` or a `ValueError` is raised: + +.. ipython:: python + + try: + s.cat.categories = [1,2,np.nan] + except ValueError as e: + print("ValueError: " + str(e)) + Appending new categories ~~~~~~~~~~~~~~~~~~~~~~~~ @@ -483,7 +491,7 @@ Pivot tables: Data munging ------------ -The optimized pandas data access methods ``.loc``, ``.iloc``, ``.ix`` ``.at``, and ``.iat``, +The optimized pandas data access methods ``.loc``, ``.iloc``, ``.at``, and ``.iat``, work as normal. The only difference is the return type (for getting) and that only values already in `categories` can be assigned. @@ -502,7 +510,6 @@ the ``category`` dtype is preserved. df.iloc[2:4,:] df.iloc[2:4,:].dtypes df.loc["h":"j","cats"] - df.ix["h":"j",0:1] df[df["cats"] == "b"] An example where the category type is not preserved is if you take one single row: the @@ -619,6 +626,7 @@ Assigning a `Categorical` to parts of a column of other types will use the value df df.dtypes +.. _categorical.merge: Merging ~~~~~~~ @@ -648,6 +656,9 @@ In this case the categories are not the same and so an error is raised: The same applies to ``df.append(df_different)``. +See also the section on :ref:`merge dtypes` for notes about preserving merge dtypes and performance. + + .. _categorical.union: Unioning @@ -662,7 +673,7 @@ will be the union of the categories being combined. .. ipython:: python - from pandas.types.concat import union_categoricals + from pandas.api.types import union_categoricals a = pd.Categorical(["b", "c"]) b = pd.Categorical(["a", "b"]) union_categoricals([a, b]) @@ -695,6 +706,17 @@ The below raises ``TypeError`` because the categories are ordered and not identi Out[3]: TypeError: to union ordered Categoricals, all categories must be the same +.. versionadded:: 0.20.0 + +Ordered categoricals with different categories or orderings can be combined by +using the ``ignore_ordered=True`` argument. + +.. ipython:: python + + a = pd.Categorical(["a", "b", "c"], ordered=True) + b = pd.Categorical(["c", "b", "a"], ordered=True) + union_categoricals([a, b], ignore_order=True) + ``union_categoricals`` also works with a ``CategoricalIndex``, or ``Series`` containing categorical data, but note that the resulting array will always be a plain ``Categorical`` diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index aa0cbab4df10b..194e022e34c7c 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -206,14 +206,6 @@ of its first argument in its second: s <- 0:4 match(s, c(2,4)) -The :meth:`~pandas.core.groupby.GroupBy.apply` method can be used to replicate -this: - -.. ipython:: python - - s = pd.Series(np.arange(5),dtype=np.float32) - pd.Series(pd.match(s,[2,4],np.nan)) - For more details and examples see :ref:`the reshaping documentation `. diff --git a/doc/source/computation.rst b/doc/source/computation.rst index d727424750be5..76a030d355e33 100644 --- a/doc/source/computation.rst +++ b/doc/source/computation.rst @@ -84,8 +84,8 @@ in order to have a valid result. .. ipython:: python frame = pd.DataFrame(np.random.randn(20, 3), columns=['a', 'b', 'c']) - frame.ix[:5, 'a'] = np.nan - frame.ix[5:10, 'b'] = np.nan + frame.loc[frame.index[:5], 'a'] = np.nan + frame.loc[frame.index[5:10], 'b'] = np.nan frame.cov() @@ -120,7 +120,7 @@ All of these are currently computed using pairwise complete observations. .. ipython:: python frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e']) - frame.ix[::2] = np.nan + frame.iloc[::2] = np.nan # Series with Series frame['a'].corr(frame['b']) @@ -137,8 +137,8 @@ Like ``cov``, ``corr`` also supports the optional ``min_periods`` keyword: .. ipython:: python frame = pd.DataFrame(np.random.randn(20, 3), columns=['a', 'b', 'c']) - frame.ix[:5, 'a'] = np.nan - frame.ix[5:10, 'b'] = np.nan + frame.loc[frame.index[:5], 'a'] = np.nan + frame.loc[frame.index[5:10], 'b'] = np.nan frame.corr() @@ -459,6 +459,48 @@ default of the index) in a DataFrame. dft dft.rolling('2s', on='foo').sum() +.. _stats.rolling_window.endpoints: + +Rolling Window Endpoints +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. versionadded:: 0.20.0 + +The inclusion of the interval endpoints in rolling window calculations can be specified with the ``closed`` +parameter: + +.. csv-table:: + :header: "``closed``", "Description", "Default for" + :widths: 20, 30, 30 + + ``right``, close right endpoint, time-based windows + ``left``, close left endpoint, + ``both``, close both endpoints, fixed windows + ``neither``, open endpoints, + +For example, having the right endpoint open is useful in many problems that require that there is no contamination +from present information back to past information. This allows the rolling window to compute statistics +"up to that point in time", but not including that point in time. + +.. ipython:: python + + df = pd.DataFrame({'x': 1}, + index = [pd.Timestamp('20130101 09:00:01'), + pd.Timestamp('20130101 09:00:02'), + pd.Timestamp('20130101 09:00:03'), + pd.Timestamp('20130101 09:00:04'), + pd.Timestamp('20130101 09:00:06')]) + + df["right"] = df.rolling('2s', closed='right').x.sum() # default + df["both"] = df.rolling('2s', closed='both').x.sum() + df["left"] = df.rolling('2s', closed='left').x.sum() + df["neither"] = df.rolling('2s', closed='neither').x.sum() + + df + +Currently, this feature is only implemented for time-based windows. +For fixed windows, the closed parameter cannot be set and the rolling window will always have both endpoints closed. + .. _stats.moments.ts-versus-resampling: Time-aware Rolling vs. Resampling @@ -505,13 +547,18 @@ two ``Series`` or any combination of ``DataFrame/Series`` or - ``DataFrame/DataFrame``: by default compute the statistic for matching column names, returning a DataFrame. If the keyword argument ``pairwise=True`` is passed then computes the statistic for each pair of columns, returning a - ``Panel`` whose ``items`` are the dates in question (see :ref:`the next section + ``MultiIndexed DataFrame`` whose ``index`` are the dates in question (see :ref:`the next section `). For example: .. ipython:: python + df = pd.DataFrame(np.random.randn(1000, 4), + index=pd.date_range('1/1/2000', periods=1000), + columns=['A', 'B', 'C', 'D']) + df = df.cumsum() + df2 = df[:20] df2.rolling(window=5).corr(df2['B']) @@ -520,11 +567,16 @@ For example: Computing rolling pairwise covariances and correlations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. warning:: + + Prior to version 0.20.0 if ``pairwise=True`` was passed, a ``Panel`` would be returned. + This will now return a 2-level MultiIndexed DataFrame, see the whatsnew :ref:`here ` + In financial data analysis and other fields it's common to compute covariance and correlation matrices for a collection of time series. Often one is also interested in moving-window covariance and correlation matrices. This can be done by passing the ``pairwise`` keyword argument, which in the case of -``DataFrame`` inputs will yield a ``Panel`` whose ``items`` are the dates in +``DataFrame`` inputs will yield a MultiIndexed ``DataFrame`` whose ``index`` are the dates in question. In the case of a single DataFrame argument the ``pairwise`` argument can even be omitted: @@ -539,15 +591,15 @@ can even be omitted: .. ipython:: python covs = df[['B','C','D']].rolling(window=50).cov(df[['A','B','C']], pairwise=True) - covs[df.index[-50]] + covs.loc['2002-09-22':] .. ipython:: python correls = df.rolling(window=50).corr() - correls[df.index[-50]] + correls.loc['2002-09-22':] You can efficiently retrieve the time series of correlations between two -columns using ``.loc`` indexing: +columns by reshaping and indexing: .. ipython:: python :suppress: @@ -557,7 +609,7 @@ columns using ``.loc`` indexing: .. ipython:: python @savefig rolling_corr_pairwise_ex.png - correls.loc[:, 'A', 'C'].plot() + correls.unstack(1)[('A', 'C')].plot() .. _stats.aggregate: @@ -565,7 +617,9 @@ Aggregation ----------- Once the ``Rolling``, ``Expanding`` or ``EWM`` objects have been created, several methods are available to -perform multiple computations on the data. This is very similar to a ``.groupby(...).agg`` seen :ref:`here `. +perform multiple computations on the data. These operations are similar to the :ref:`aggregating API `, +:ref:`groupby API `, and :ref:`resample API `. + .. ipython:: python @@ -590,24 +644,16 @@ columns if none are selected. .. _stats.aggregate.multifunc: -Applying multiple functions at once -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Applying multiple functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -With windowed Series you can also pass a list or dict of functions to do +With windowed ``Series`` you can also pass a list of functions to do aggregation with, outputting a DataFrame: .. ipython:: python r['A'].agg([np.sum, np.mean, np.std]) -If a dict is passed, the keys will be used to name the columns. Otherwise the -function's name (stored in the function object) will be used. - -.. ipython:: python - - r['A'].agg({'result1' : np.sum, - 'result2' : np.mean}) - On a widowed DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index: @@ -622,7 +668,7 @@ Applying different functions to DataFrame columns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By passing a dict to ``aggregate`` you can apply a different aggregation to the -columns of a DataFrame: +columns of a ``DataFrame``: .. ipython:: python :okexcept: @@ -691,6 +737,8 @@ Method Summary :meth:`~Expanding.cov`, Unbiased covariance (binary) :meth:`~Expanding.corr`, Correlation (binary) +.. currentmodule:: pandas + Aside from not having a ``window`` parameter, these functions have the same interfaces as their ``.rolling`` counterparts. Like above, the parameters they all accept are: @@ -700,18 +748,37 @@ all accept are: ``min_periods`` non-null data points have been seen. - ``center``: boolean, whether to set the labels at the center (default is False) +.. _stats.moments.expanding.note: .. note:: The output of the ``.rolling`` and ``.expanding`` methods do not return a ``NaN`` if there are at least ``min_periods`` non-null values in the current - window. This differs from ``cumsum``, ``cumprod``, ``cummax``, and - ``cummin``, which return ``NaN`` in the output wherever a ``NaN`` is - encountered in the input. + window. For example, + + .. ipython:: python + + sn = pd.Series([1, 2, np.nan, 3, np.nan, 4]) + sn + sn.rolling(2).max() + sn.rolling(2, min_periods=1).max() + + In case of expanding functions, this differs from :meth:`~DataFrame.cumsum`, + :meth:`~DataFrame.cumprod`, :meth:`~DataFrame.cummax`, + and :meth:`~DataFrame.cummin`, which return ``NaN`` in the output wherever + a ``NaN`` is encountered in the input. In order to match the output of ``cumsum`` + with ``expanding``, use :meth:`~DataFrame.fillna`: + + .. ipython:: python + + sn.expanding().sum() + sn.cumsum() + sn.cumsum().fillna(method='ffill') + An expanding window statistic will be more stable (and less responsive) than its rolling window counterpart as the increasing window size decreases the relative impact of an individual data point. As an example, here is the -:meth:`~Expanding.mean` output for the previous time series dataset: +:meth:`~core.window.Expanding.mean` output for the previous time series dataset: .. ipython:: python :suppress: @@ -731,13 +798,14 @@ relative impact of an individual data point. As an example, here is the Exponentially Weighted Windows ------------------------------ +.. currentmodule:: pandas.core.window + A related set of functions are exponentially weighted versions of several of the above statistics. A similar interface to ``.rolling`` and ``.expanding`` is accessed -thru the ``.ewm`` method to receive an :class:`~pandas.core.window.EWM` object. +through the ``.ewm`` method to receive an :class:`~EWM` object. A number of expanding EW (exponentially weighted) methods are provided: -.. currentmodule:: pandas.core.window .. csv-table:: :header: "Function", "Description" diff --git a/doc/source/conf.py b/doc/source/conf.py index 4f679f3f728bf..556e5f0227471 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -14,8 +14,17 @@ import os import re import inspect +import importlib from pandas.compat import u, PY3 +# https://github.com/sphinx-doc/sphinx/pull/2325/files +# Workaround for sphinx-build recursion limit overflow: +# pickle.dump(doctree, f, pickle.HIGHEST_PROTOCOL) +# RuntimeError: maximum recursion depth exceeded while pickling an object +# +# Python's default allowed recursion depth is 1000. +sys.setrecursionlimit(5000) + # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. @@ -44,14 +53,16 @@ 'numpydoc', # used to parse numpy-style docstrings for autodoc 'ipython_sphinxext.ipython_directive', 'ipython_sphinxext.ipython_console_highlighting', + 'IPython.sphinxext.ipython_console_highlighting', # lowercase didn't work 'sphinx.ext.intersphinx', 'sphinx.ext.coverage', - 'sphinx.ext.pngmath', + 'sphinx.ext.mathjax', 'sphinx.ext.ifconfig', 'sphinx.ext.linkcode', + 'nbsphinx', ] - +exclude_patterns = ['**.ipynb_checkpoints'] with open("index.rst") as f: index_rst_lines = f.readlines() @@ -62,15 +73,16 @@ # JP: added from sphinxdocs autosummary_generate = False -if any([re.match("\s*api\s*",l) for l in index_rst_lines]): +if any([re.match("\s*api\s*", l) for l in index_rst_lines]): autosummary_generate = True files_to_delete = [] for f in os.listdir(os.path.dirname(__file__)): - if not f.endswith('.rst') or f.startswith('.') or os.path.basename(f) == 'index.rst': + if (not f.endswith(('.ipynb', '.rst')) or + f.startswith('.') or os.path.basename(f) == 'index.rst'): continue - _file_basename = f.split('.rst')[0] + _file_basename = os.path.splitext(f)[0] _regex_to_match = "\s*{}\s*$".format(_file_basename) if not any([re.match(_regex_to_match, line) for line in index_rst_lines]): files_to_delete.append(f) @@ -215,20 +227,69 @@ # Additional templates that should be rendered to pages, maps page names to # template names. -# Add redirect for previously existing API pages (which are now included in -# the API pages as top-level functions) based on a template (GH9911) +# Add redirect for previously existing API pages +# each item is like `(from_old, to_new)` +# To redirect a class and all its methods, see below +# https://github.com/pandas-dev/pandas/issues/16186 + moved_api_pages = [ - 'pandas.core.common.isnull', 'pandas.core.common.notnull', 'pandas.core.reshape.get_dummies', - 'pandas.tools.merge.concat', 'pandas.tools.merge.merge', 'pandas.tools.pivot.pivot_table', - 'pandas.tseries.tools.to_datetime', 'pandas.io.clipboard.read_clipboard', 'pandas.io.excel.ExcelFile.parse', - 'pandas.io.excel.read_excel', 'pandas.io.html.read_html', 'pandas.io.json.read_json', - 'pandas.io.parsers.read_csv', 'pandas.io.parsers.read_fwf', 'pandas.io.parsers.read_table', - 'pandas.io.pickle.read_pickle', 'pandas.io.pytables.HDFStore.append', 'pandas.io.pytables.HDFStore.get', - 'pandas.io.pytables.HDFStore.put', 'pandas.io.pytables.HDFStore.select', 'pandas.io.pytables.read_hdf', - 'pandas.io.sql.read_sql', 'pandas.io.sql.read_frame', 'pandas.io.sql.write_frame', - 'pandas.io.stata.read_stata'] - -html_additional_pages = {'generated/' + page: 'api_redirect.html' for page in moved_api_pages} + ('pandas.core.common.isnull', 'pandas.isnull'), + ('pandas.core.common.notnull', 'pandas.notnull'), + ('pandas.core.reshape.get_dummies', 'pandas.get_dummies'), + ('pandas.tools.merge.concat', 'pandas.concat'), + ('pandas.tools.merge.merge', 'pandas.merge'), + ('pandas.tools.pivot.pivot_table', 'pandas.pivot_table'), + ('pandas.tseries.tools.to_datetime', 'pandas.to_datetime'), + ('pandas.io.clipboard.read_clipboard', 'pandas.read_clipboard'), + ('pandas.io.excel.ExcelFile.parse', 'pandas.ExcelFile.parse'), + ('pandas.io.excel.read_excel', 'pandas.read_excel'), + ('pandas.io.html.read_html', 'pandas.read_html'), + ('pandas.io.json.read_json', 'pandas.read_json'), + ('pandas.io.parsers.read_csv', 'pandas.read_csv'), + ('pandas.io.parsers.read_fwf', 'pandas.read_fwf'), + ('pandas.io.parsers.read_table', 'pandas.read_table'), + ('pandas.io.pickle.read_pickle', 'pandas.read_pickle'), + ('pandas.io.pytables.HDFStore.append', 'pandas.HDFStore.append'), + ('pandas.io.pytables.HDFStore.get', 'pandas.HDFStore.get'), + ('pandas.io.pytables.HDFStore.put', 'pandas.HDFStore.put'), + ('pandas.io.pytables.HDFStore.select', 'pandas.HDFStore.select'), + ('pandas.io.pytables.read_hdf', 'pandas.read_hdf'), + ('pandas.io.sql.read_sql', 'pandas.read_sql'), + ('pandas.io.sql.read_frame', 'pandas.read_frame'), + ('pandas.io.sql.write_frame', 'pandas.write_frame'), + ('pandas.io.stata.read_stata', 'pandas.read_stata'), +] + +# Again, tuples of (from_old, to_new) +moved_classes = [ + ('pandas.tseries.resample.Resampler', 'pandas.core.resample.Resampler'), + ('pandas.formats.style.Styler', 'pandas.io.formats.style.Styler'), +] + +for old, new in moved_classes: + # the class itself... + moved_api_pages.append((old, new)) + + mod, classname = new.rsplit('.', 1) + klass = getattr(importlib.import_module(mod), classname) + methods = [x for x in dir(klass) + if not x.startswith('_') or x in ('__iter__', '__array__')] + + for method in methods: + # ... and each of its public methods + moved_api_pages.append( + ("{old}.{method}".format(old=old, method=method), + "{new}.{method}".format(new=new, method=method)) + ) + +html_additional_pages = { + 'generated/' + page[0]: 'api_redirect.html' + for page in moved_api_pages +} + +html_context = { + 'redirects': {old: new for old, new in moved_api_pages} +} # If false, no module index is generated. html_use_modindex = True @@ -253,6 +314,9 @@ # Output file base name for HTML help builder. htmlhelp_basename = 'pandas' +# -- Options for nbsphinx ------------------------------------------------ + +nbsphinx_allow_errors = True # -- Options for LaTeX output -------------------------------------------- @@ -326,6 +390,23 @@ from sphinx.ext.autosummary import Autosummary +class AccessorDocumenter(MethodDocumenter): + """ + Specialized Documenter subclass for accessors. + """ + + objtype = 'accessor' + directivetype = 'method' + + # lower than MethodDocumenter so this is not chosen for normal methods + priority = 0.6 + + def format_signature(self): + # this method gives an error/warning for the accessors, therefore + # overriding it (accessor has no arguments) + return '' + + class AccessorLevelDocumenter(Documenter): """ Specialized Documenter subclass for objects on accessor level (methods, @@ -381,12 +462,17 @@ class AccessorAttributeDocumenter(AccessorLevelDocumenter, AttributeDocumenter): objtype = 'accessorattribute' directivetype = 'attribute' + # lower than AttributeDocumenter so this is not chosen for normal attributes + priority = 0.6 class AccessorMethodDocumenter(AccessorLevelDocumenter, MethodDocumenter): objtype = 'accessormethod' directivetype = 'method' + # lower than MethodDocumenter so this is not chosen for normal methods + priority = 0.6 + class AccessorCallableDocumenter(AccessorLevelDocumenter, MethodDocumenter): """ @@ -483,6 +569,7 @@ def remove_flags_docstring(app, what, name, obj, options, lines): def setup(app): app.connect("autodoc-process-docstring", remove_flags_docstring) + app.add_autodocumenter(AccessorDocumenter) app.add_autodocumenter(AccessorAttributeDocumenter) app.add_autodocumenter(AccessorMethodDocumenter) app.add_autodocumenter(AccessorCallableDocumenter) diff --git a/doc/source/contributing.rst b/doc/source/contributing.rst index 44ee6223d5ee1..aacfe25b91564 100644 --- a/doc/source/contributing.rst +++ b/doc/source/contributing.rst @@ -27,7 +27,7 @@ about it! Feel free to ask questions on the `mailing list `_ or on `Gitter -`_. +`_. Bug reports and enhancement requests ==================================== @@ -51,14 +51,9 @@ Bug reports must: ... ``` -#. Include the full version string of *pandas* and its dependencies. In versions - of *pandas* after 0.12 you can use a built in function:: - - >>> from pandas.util.print_versions import show_versions - >>> show_versions() - - and in *pandas* 0.13.1 onwards:: +#. Include the full version string of *pandas* and its dependencies. You can use the built in function:: + >>> import pandas as pd >>> pd.show_versions() #. Explain why the current behavior is wrong/not desired and what you expect instead. @@ -113,12 +108,6 @@ want to clone your fork to your machine:: This creates the directory `pandas-yourname` and connects your repository to the upstream (main project) *pandas* repository. -The testing suite will run automatically on Travis-CI once your pull request is -submitted. However, if you wish to run the test suite on a branch prior to -submitting the pull request, then Travis-CI needs to be hooked up to your -GitHub repository. Instructions for doing so are `here -`__. - Creating a branch ----------------- @@ -142,7 +131,7 @@ To update this branch, you need to retrieve the changes from the master branch:: git fetch upstream git rebase upstream/master -This will replay your commits on top of the lastest pandas git master. If this +This will replay your commits on top of the latest pandas git master. If this leads to merge conflicts, you must resolve these before submitting your pull request. If you have uncommitted changes, you will need to ``stash`` them prior to updating. This will effectively store your changes and they can be reapplied @@ -215,7 +204,7 @@ At this point you can easily do an *in-place* install, as detailed in the next s Creating a Windows development environment ------------------------------------------ -To build on Windows, you need to have compilers installed to build the extensions. You will need to install the appropriate Visual Studio compilers, VS 2008 for Python 2.7, VS 2010 for 3.4, and VS 2015 for Python 3.5. +To build on Windows, you need to have compilers installed to build the extensions. You will need to install the appropriate Visual Studio compilers, VS 2008 for Python 2.7, VS 2010 for 3.4, and VS 2015 for Python 3.5 and 3.6. For Python 2.7, you can install the ``mingw`` compiler which will work equivalently to VS 2008:: @@ -225,7 +214,7 @@ or use the `Microsoft Visual Studio VC++ compiler for Python `__. Read the references below as there may be various gotchas during the installation. -For Python 3.5, you can download and install the `Visual Studio 2015 Community Edition `__. +For Python 3.5 and 3.6, you can download and install the `Visual Studio 2015 Community Edition `__. Here are some references and blogs: @@ -358,15 +347,14 @@ have ``sphinx`` and ``ipython`` installed. `numpydoc `_ is used to parse the docstrings that follow the Numpy Docstring Standard (see above), but you don't need to install this because a local copy of numpydoc is included in the *pandas* source -code. -`nbconvert `_ and -`nbformat `_ are required to build +code. `nbsphinx `_ is required to build the Jupyter notebooks included in the documentation. If you have a conda environment named ``pandas_dev``, you can install the extra requirements with:: conda install -n pandas_dev sphinx ipython nbconvert nbformat + conda install -n pandas_dev -c conda-forge nbsphinx Furthermore, it is recommended to have all :ref:`optional dependencies `. installed. This is not strictly necessary, but be aware that you will see some error @@ -396,7 +384,7 @@ evocations, sphinx will try to only build the pages that have been modified. If you want to do a full clean build, do:: python make.py clean - python make.py build + python make.py html Starting with *pandas* 0.13.1 you can tell ``make.py`` to compile only a single section of the docs, greatly reducing the turn-around time for checking your changes. @@ -431,7 +419,8 @@ Building master branch documentation When pull requests are merged into the *pandas* ``master`` branch, the main parts of the documentation are also built by Travis-CI. These docs are then hosted `here -`__. +`__, see also +the :ref:`Continuous Integration ` section. Contributing to the code base ============================= @@ -442,38 +431,137 @@ Contributing to the code base Code standards -------------- +Writing good code is not just about what you write. It is also about *how* you +write it. During :ref:`Continuous Integration ` testing, several +tools will be run to check your code for stylistic errors. +Generating any warnings will cause the test to fail. +Thus, good style is a requirement for submitting code to *pandas*. + +In addition, because a lot of people use our library, it is important that we +do not make sudden changes to the code that could have the potential to break +a lot of user code as a result, that is, we need it to be as *backwards compatible* +as possible to avoid mass breakages. + +Additional standards are outlined on the `code style wiki +page `_. + +C (cpplint) +~~~~~~~~~~~ + +*pandas* uses the `Google `_ +standard. Google provides an open source style checker called ``cpplint``, but we +use a fork of it that can be found `here `__. +Here are *some* of the more common ``cpplint`` issues: + + - we restrict line-length to 80 characters to promote readability + - every header file must include a header guard to avoid name collisions if re-included + +:ref:`Continuous Integration `. will run the +`cpplint `_ tool +and report any stylistic errors in your code. Therefore, it is helpful before +submitting code to run the check yourself:: + + cpplint --extensions=c,h --headers=h --filter=-readability/casting,-runtime/int,-build/include_subdir modified-c-file + +You can also run this command on an entire directory if necessary:: + + cpplint --extensions=c,h --headers=h --filter=-readability/casting,-runtime/int,-build/include_subdir --recursive modified-c-directory + +To make your commits compliant with this standard, you can install the +`ClangFormat `_ tool, which can be +downloaded `here `__. To configure, in your home directory, +run the following command:: + + clang-format style=google -dump-config > .clang-format + +Then modify the file to ensure that any indentation width parameters are at least four. +Once configured, you can run the tool as follows:: + + clang-format modified-c-file + +This will output what your file will look like if the changes are made, and to apply +them, just run the following command:: + + clang-format -i modified-c-file + +To run the tool on an entire directory, you can run the following analogous commands:: + + clang-format modified-c-directory/*.c modified-c-directory/*.h + clang-format -i modified-c-directory/*.c modified-c-directory/*.h + +Do note that this tool is best-effort, meaning that it will try to correct as +many errors as possible, but it may not correct *all* of them. Thus, it is +recommended that you run ``cpplint`` to double check and make any other style +fixes manually. + +Python (PEP8) +~~~~~~~~~~~~~ + *pandas* uses the `PEP8 `_ standard. There are several tools to ensure you abide by this standard. Here are *some* of the more common ``PEP8`` issues: - - we restrict line-length to 80 characters to promote readability + - we restrict line-length to 79 characters to promote readability - passing arguments should have spaces after commas, e.g. ``foo(arg1, arg2, kw1='bar')`` -The Travis-CI will run `flake8 `_ tool and report -any stylistic errors in your code. Generating any warnings will cause the build to fail; -thus these are part of the requirements for submitting code to *pandas*. +:ref:`Continuous Integration ` will run +the `flake8 `_ tool +and report any stylistic errors in your code. Therefore, it is helpful before +submitting code to run the check yourself on the diff:: -It is helpful before submitting code to run this yourself on the diff:: + git diff master --name-only -- '*.py' | flake8 --diff - git diff master | flake8 --diff +This command will catch any stylistic errors in your changes specifically, but +be beware it may not catch all of them. For example, if you delete the only +usage of an imported function, it is stylistically incorrect to import an +unused function. However, style-checking the diff will not catch this because +the actual import is not part of the diff. Thus, for completeness, you should +run this command, though it will take longer:: -Furthermore, we've written a tool to check that your commits are PEP8 great, `pip install pep8radius -`_. Look at PEP8 fixes in your branch vs master with:: + git diff master --name-only -- '*.py' | grep 'pandas/' | xargs -r flake8 - pep8radius master --diff +Note that on OSX, the ``-r`` flag is not available, so you have to omit it and +run this slightly modified command:: -and make these changes with:: + git diff master --name-only -- '*.py' | grep 'pandas/' | xargs flake8 - pep8radius master --diff --in-place - -Additional standards are outlined on the `code style wiki -page `_. +Backwards Compatibility +~~~~~~~~~~~~~~~~~~~~~~~ Please try to maintain backward compatibility. *pandas* has lots of users with lots of existing code, so don't break it if at all possible. If you think breakage is required, clearly state why as part of the pull request. Also, be careful when changing method signatures and add deprecation warnings where needed. +.. _contributing.ci: + +Testing With Continuous Integration +----------------------------------- + +The *pandas* test suite will run automatically on `Travis-CI `__, +`Appveyor `__, and `Circle CI `__ continuous integration +services, once your pull request is submitted. +However, if you wish to run the test suite on a branch prior to submitting the pull request, +then the continuous integration services need to be hooked to your GitHub repository. Instructions are here +for `Travis-CI `__, +`Appveyor `__ , and `CircleCI `__. + +A pull-request will be considered for merging when you have an all 'green' build. If any tests are failing, +then you will get a red 'X', where you can click through to see the individual failed tests. +This is an example of a green build. + +.. image:: _static/ci.png + +.. note:: + + Each time you push to *your* fork, a *new* run of the tests will be triggered on the CI. Appveyor will auto-cancel + any non-currently-running tests for that same pull-request. You can enable the auto-cancel feature for + `Travis-CI here `__ and + for `CircleCI here `__. + +.. _contributing.tdd: + + Test-driven development/code writing ------------------------------------ @@ -489,8 +577,8 @@ use cases and writing corresponding tests. Adding tests is one of the most common requests after code is pushed to *pandas*. Therefore, it is worth getting in the habit of writing tests ahead of time so this is never an issue. -Like many packages, *pandas* uses the `Nose testing system -`_ and the convenient +Like many packages, *pandas* uses `pytest +`_ and the convenient extensions in `numpy.testing `_. @@ -526,23 +614,142 @@ the expected correct result:: assert_frame_equal(pivoted, expected) +Transitioning to ``pytest`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +*pandas* existing test structure is *mostly* classed based, meaning that you will typically find tests wrapped in a class. + +.. code-block:: python + + class TestReallyCoolFeature(object): + .... + +Going forward, we are moving to a more *functional* style using the `pytest `__ framework, which offers a richer testing +framework that will facilitate testing and developing. Thus, instead of writing test classes, we will write test functions like this: + +.. code-block:: python + + def test_really_cool_feature(): + .... + +Using ``pytest`` +~~~~~~~~~~~~~~~~ + +Here is an example of a self-contained set of tests that illustrate multiple features that we like to use. + +- functional style: tests are like ``test_*`` and *only* take arguments that are either fixtures or parameters +- using ``parametrize``: allow testing of multiple cases +- ``fixture``, code for object construction, on a per-test basis +- using bare ``assert`` for scalars and truth-testing +- ``tm.assert_series_equal`` (and its counter part ``tm.assert_frame_equal``), for pandas object comparisons. +- the typical pattern of constructing an ``expected`` and comparing versus the ``result`` + +We would name this file ``test_cool_feature.py`` and put in an appropriate place in the ``pandas/tests/`` structure. + +.. code-block:: python + + import pytest + import numpy as np + import pandas as pd + from pandas.util import testing as tm + + @pytest.mark.parametrize('dtype', ['int8', 'int16', 'int32', 'int64']) + def test_dtypes(dtype): + assert str(np.dtype(dtype)) == dtype + + @pytest.fixture + def series(): + return pd.Series([1, 2, 3]) + + @pytest.fixture(params=['int8', 'int16', 'int32', 'int64']) + def dtype(request): + return request.param + + def test_series(series, dtype): + result = series.astype(dtype) + assert result.dtype == dtype + + expected = pd.Series([1, 2, 3], dtype=dtype) + tm.assert_series_equal(result, expected) + + +A test run of this yields + +.. code-block:: shell + + ((pandas) bash-3.2$ pytest test_cool_feature.py -v + =========================== test session starts =========================== + platform darwin -- Python 3.5.2, pytest-3.0.5, py-1.4.31, pluggy-0.4.0 + collected 8 items + + tester.py::test_dtypes[int8] PASSED + tester.py::test_dtypes[int16] PASSED + tester.py::test_dtypes[int32] PASSED + tester.py::test_dtypes[int64] PASSED + tester.py::test_series[int8] PASSED + tester.py::test_series[int16] PASSED + tester.py::test_series[int32] PASSED + tester.py::test_series[int64] PASSED + +Tests that we have ``parametrized`` are now accessible via the test name, for example we could run these with ``-k int8`` to sub-select *only* those tests which match ``int8``. + + +.. code-block:: shell + + ((pandas) bash-3.2$ pytest test_cool_feature.py -v -k int8 + =========================== test session starts =========================== + platform darwin -- Python 3.5.2, pytest-3.0.5, py-1.4.31, pluggy-0.4.0 + collected 8 items + + test_cool_feature.py::test_dtypes[int8] PASSED + test_cool_feature.py::test_series[int8] PASSED + + Running the test suite -~~~~~~~~~~~~~~~~~~~~~~ +---------------------- The tests can then be run directly inside your Git clone (without having to install *pandas*) by typing:: - nosetests pandas + pytest pandas The tests suite is exhaustive and takes around 20 minutes to run. Often it is worth running only a subset of tests first around your changes before running the -entire suite. This is done using one of the following constructs:: +entire suite. + +The easiest way to do this is with:: - nosetests pandas/tests/[test-module].py - nosetests pandas/tests/[test-module].py:[TestClass] - nosetests pandas/tests/[test-module].py:[TestClass].[test_method] + pytest pandas/path/to/test.py -k regex_matching_test_name - .. versionadded:: 0.18.0 +Or with one of the following constructs:: + + pytest pandas/tests/[test-module].py + pytest pandas/tests/[test-module].py::[TestClass] + pytest pandas/tests/[test-module].py::[TestClass]::[test_method] + +Using `pytest-xdist `_, one can +speed up local testing on multicore machines. To use this feature, you will +need to install `pytest-xdist` via:: + + pip install pytest-xdist + +Two scripts are provided to assist with this. These scripts distribute +testing across 4 threads. + +On Unix variants, one can type:: + + test_fast.sh + +On Windows, one can type:: + + test_fast.bat + +This can significantly reduce the time it takes to locally run tests before +submitting a pull request. + +For more, see the `pytest `_ documentation. + + .. versionadded:: 0.20.0 Furthermore one can run @@ -553,7 +760,8 @@ Furthermore one can run with an imported pandas to run tests similarly. Running the performance test suite -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------------- + Performance matters and it is worth considering whether your code has introduced performance regressions. *pandas* is in the process of migrating to `asv benchmarks `__ @@ -561,12 +769,6 @@ to enable easy monitoring of the performance of critical *pandas* operations. These benchmarks are all found in the ``pandas/asv_bench`` directory. asv supports both python2 and python3. -.. note:: - - The asv benchmark suite was translated from the previous framework, vbench, - so many stylistic issues are likely a result of automated transformation of the - code. - To use all features of asv, you will need either ``conda`` or ``virtualenv``. For more details please check the `asv installation webpage `_. @@ -626,73 +828,6 @@ This will display stderr from the benchmarks, and use your local Information on how to write a benchmark and how to use asv can be found in the `asv documentation `_. -.. _contributing.gbq_integration_tests: - -Running Google BigQuery Integration Tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You will need to create a Google BigQuery private key in JSON format in -order to run Google BigQuery integration tests on your local machine and -on Travis-CI. The first step is to create a `service account -`__. - -Integration tests for ``pandas.io.gbq`` are skipped in pull requests because -the credentials that are required for running Google BigQuery integration -tests are `encrypted `__ -on Travis-CI and are only accessible from the pandas-dev/pandas repository. The -credentials won't be available on forks of pandas. Here are the steps to run -gbq integration tests on a forked repository: - -#. Go to `Travis CI `__ and sign in with your GitHub - account. -#. Click on the ``+`` icon next to the ``My Repositories`` list and enable - Travis builds for your fork. -#. Click on the gear icon to edit your travis build, and add two environment - variables: - - - ``GBQ_PROJECT_ID`` with the value being the ID of your BigQuery project. - - - ``SERVICE_ACCOUNT_KEY`` with the value being the contents of the JSON key - that you downloaded for your service account. Use single quotes around - your JSON key to ensure that it is treated as a string. - - For both environment variables, keep the "Display value in build log" option - DISABLED. These variables contain sensitive data and you do not want their - contents being exposed in build logs. -#. Your branch should be tested automatically once it is pushed. You can check - the status by visiting your Travis branches page which exists at the - following location: https://travis-ci.org/your-user-name/pandas/branches . - Click on a build job for your branch. Expand the following line in the - build log: ``ci/print_skipped.py /tmp/nosetests.xml`` . Search for the - term ``test_gbq`` and confirm that gbq integration tests are not skipped. - -Running the vbench performance test suite (phasing out) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Historically, *pandas* used `vbench library `_ -to enable easy monitoring of the performance of critical *pandas* operations. -These benchmarks are all found in the ``pandas/vb_suite`` directory. vbench -currently only works on python2. - -To install vbench:: - - pip install git+https://github.com/pydata/vbench - -Vbench also requires ``sqlalchemy``, ``gitpython``, and ``psutil``, which can all be installed -using pip. If you need to run a benchmark, change your directory to the *pandas* root and run:: - - ./test_perf.sh -b master -t HEAD - -This will check out the master revision and run the suite on both master and -your commit. Running the full test suite can take up to one hour and use up -to 3GB of RAM. Usually it is sufficient to paste a subset of the results into the Pull Request to show that the committed changes do not cause unexpected -performance regressions. - -You can run specific benchmarks using the ``-r`` flag, which takes a regular expression. - -See the `performance testing wiki `_ for information -on how to write a benchmark. - Documenting your code --------------------- @@ -852,12 +987,8 @@ updated. Pushing them to GitHub again is done by:: git push -f origin shiny-new-feature This will automatically update your pull request with the latest code and restart the -Travis-CI tests. +:ref:`Continuous Integration ` tests. -If your pull request is related to the ``pandas.io.gbq`` module, please see -the section on :ref:`Running Google BigQuery Integration Tests -` to configure a Google BigQuery service -account for your pull request on Travis-CI. Delete your merged branch (optional) ------------------------------------ diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst index 3e84d15caf50b..62aa487069132 100644 --- a/doc/source/cookbook.rst +++ b/doc/source/cookbook.rst @@ -7,6 +7,7 @@ import pandas as pd import numpy as np + from pandas.compat import StringIO import random import os @@ -65,19 +66,19 @@ An if-then on one column .. ipython:: python - df.ix[df.AAA >= 5,'BBB'] = -1; df + df.loc[df.AAA >= 5,'BBB'] = -1; df An if-then with assignment to 2 columns: .. ipython:: python - df.ix[df.AAA >= 5,['BBB','CCC']] = 555; df + df.loc[df.AAA >= 5,['BBB','CCC']] = 555; df Add another line with different logic, to do the -else .. ipython:: python - df.ix[df.AAA < 5,['BBB','CCC']] = 2000; df + df.loc[df.AAA < 5,['BBB','CCC']] = 2000; df Or use pandas where after you've set up a mask @@ -107,10 +108,8 @@ Splitting df = pd.DataFrame( {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df - dflow = df[df.AAA <= 5] - dfhigh = df[df.AAA > 5] - - dflow; dfhigh + dflow = df[df.AAA <= 5]; dflow + dfhigh = df[df.AAA > 5]; dfhigh Building Criteria ***************** @@ -150,7 +149,7 @@ Building Criteria {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df aValue = 43.0 - df.ix[(df.CCC-aValue).abs().argsort()] + df.loc[(df.CCC-aValue).abs().argsort()] `Dynamically reduce a list of criteria using a binary operators `__ @@ -218,9 +217,9 @@ There are 2 explicit slicing methods, with a third general case df.loc['bar':'kar'] #Label - #Generic - df.ix[0:3] #Same as .iloc[0:3] - df.ix['bar':'kar'] #Same as .loc['bar':'kar'] + # Generic + df.iloc[0:3] + df.loc['bar':'kar'] Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment. @@ -232,9 +231,6 @@ Ambiguity arises when an index consists of integers with a non-zero start or non df2.loc[1:3] #Label-oriented - df2.ix[1:3] #General, will mimic loc (label-oriented) - df2.ix[0:3] #General, will mimic iloc (position-oriented), as loc[0:3] would raise a KeyError - `Using inverse operator (~) to take the complement of a mask `__ @@ -441,7 +437,7 @@ Fill forward a reversed timeseries .. ipython:: python df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A')) - df.ix[3,'A'] = np.nan + df.loc[df.index[3], 'A'] = np.nan df df.reindex(df.index[::-1]).ffill() @@ -546,7 +542,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data') - sorted_df = df.ix[agg_n_sort_order.index] + sorted_df = df.loc[agg_n_sort_order.index] sorted_df @@ -909,14 +905,11 @@ CSV The :ref:`CSV ` docs -`read_csv in action `__ +`read_csv in action `__ `appending to a csv `__ -`how to read in multiple files, appending to create a single dataframe -`__ - `Reading a csv chunk-by-chunk `__ @@ -947,6 +940,42 @@ using that handle to read. `Write a multi-row index CSV without writing duplicates `__ +.. _cookbook.csv.multiple_files: + +Reading multiple files to create a single DataFrame +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The best way to combine multiple files into a single DataFrame is to read the individual frames one by one, put all +of the individual frames into a list, and then combine the frames in the list using :func:`pd.concat`: + +.. ipython:: python + + for i in range(3): + data = pd.DataFrame(np.random.randn(10, 4)) + data.to_csv('file_{}.csv'.format(i)) + + files = ['file_0.csv', 'file_1.csv', 'file_2.csv'] + result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True) + +You can use the same approach to read all files matching a pattern. Here is an example using ``glob``: + +.. ipython:: python + + import glob + files = glob.glob('file_*.csv') + result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True) + +Finally, this strategy will work with the other ``pd.read_*(...)`` functions described in the :ref:`io docs`. + +.. ipython:: python + :suppress: + + for i in range(3): + os.remove('file_{}.csv'.format(i)) + +Parsing date components in multi-columns +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Parsing date components in multi-columns is faster with a format .. code-block:: python @@ -987,9 +1016,6 @@ Skip row between header and data .. ipython:: python - from io import StringIO - import pandas as pd - data = """;;;; ;;;; ;;;; @@ -1016,7 +1042,7 @@ Option 1: pass rows explicitly to skiprows .. ipython:: python - pd.read_csv(StringIO(data.decode('UTF-8')), sep=';', skiprows=[11,12], + pd.read_csv(StringIO(data), sep=';', skiprows=[11,12], index_col=0, parse_dates=True, header=10) Option 2: read column names and then data @@ -1024,15 +1050,12 @@ Option 2: read column names and then data .. ipython:: python - pd.read_csv(StringIO(data.decode('UTF-8')), sep=';', - header=10, parse_dates=True, nrows=10).columns - columns = pd.read_csv(StringIO(data.decode('UTF-8')), sep=';', - header=10, parse_dates=True, nrows=10).columns - pd.read_csv(StringIO(data.decode('UTF-8')), sep=';', + pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns + columns = pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns + pd.read_csv(StringIO(data), sep=';', index_col=0, header=12, parse_dates=True, names=columns) - .. _cookbook.sql: SQL diff --git a/doc/source/dsintro.rst b/doc/source/dsintro.rst index cc69367017aed..3c6572229802d 100644 --- a/doc/source/dsintro.rst +++ b/doc/source/dsintro.rst @@ -153,7 +153,7 @@ Vectorized operations and label alignment with Series ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When doing data analysis, as with raw NumPy arrays looping through Series -value-by-value is usually not necessary. Series can be also be passed into most +value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray. @@ -763,6 +763,11 @@ completion mechanism so they can be tab-completed: Panel ----- +.. warning:: + + In 0.20.0, ``Panel`` is deprecated and will be removed in + a future version. See the section :ref:`Deprecate Panel `. + Panel is a somewhat less-used, but still important container for 3-dimensional data. The term `panel data `__ is derived from econometrics and is partially responsible for the name pandas: @@ -783,6 +788,7 @@ From 3D ndarray with optional axis labels ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. ipython:: python + :okwarning: wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], major_axis=pd.date_range('1/1/2000', periods=5), @@ -794,6 +800,7 @@ From dict of DataFrame objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. ipython:: python + :okwarning: data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 'Item2' : pd.DataFrame(np.random.randn(4, 2))} @@ -816,6 +823,7 @@ dictionary of DataFrames as above, and the following named parameters: For example, compare to the construction above: .. ipython:: python + :okwarning: pd.Panel.from_dict(data, orient='minor') @@ -824,6 +832,7 @@ DataFrame objects with mixed-type columns, all of the data will get upcasted to ``dtype=object`` unless you pass ``orient='minor'``: .. ipython:: python + :okwarning: df = pd.DataFrame({'a': ['foo', 'bar', 'baz'], 'b': np.random.randn(3)}) @@ -851,6 +860,7 @@ This method was introduced in v0.7 to replace ``LongPanel.to_long``, and convert a DataFrame with a two-level index to a Panel. .. ipython:: python + :okwarning: midx = pd.MultiIndex(levels=[['one', 'two'], ['x','y']], labels=[[1,1,0,0],[1,0,1,0]]) df = pd.DataFrame({'A' : [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=midx) @@ -880,6 +890,7 @@ A Panel can be rearranged using its ``transpose`` method (which does not make a copy by default unless the data are heterogeneous): .. ipython:: python + :okwarning: wp.transpose(2, 0, 1) @@ -909,6 +920,7 @@ Squeezing Another way to change the dimensionality of an object is to ``squeeze`` a 1-len object, similar to ``wp['Item1']`` .. ipython:: python + :okwarning: wp.reindex(items=['Item1']).squeeze() wp.reindex(items=['Item1'], minor=['B']).squeeze() @@ -923,12 +935,56 @@ for more on this. To convert a Panel to a DataFrame, use the ``to_frame`` method: .. ipython:: python + :okwarning: panel = pd.Panel(np.random.randn(3, 5, 4), items=['one', 'two', 'three'], major_axis=pd.date_range('1/1/2000', periods=5), minor_axis=['a', 'b', 'c', 'd']) panel.to_frame() + +.. _dsintro.deprecate_panel: + +Deprecate Panel +--------------- + +Over the last few years, pandas has increased in both breadth and depth, with new features, +datatype support, and manipulation routines. As a result, supporting efficient indexing and functional +routines for ``Series``, ``DataFrame`` and ``Panel`` has contributed to an increasingly fragmented and +difficult-to-understand codebase. + +The 3-D structure of a ``Panel`` is much less common for many types of data analysis, +than the 1-D of the ``Series`` or the 2-D of the ``DataFrame``. Going forward it makes sense for +pandas to focus on these areas exclusively. + +Oftentimes, one can simply use a MultiIndex ``DataFrame`` for easily working with higher dimensional data. + +In additon, the ``xarray`` package was built from the ground up, specifically in order to +support the multi-dimensional analysis that is one of ``Panel`` s main usecases. +`Here is a link to the xarray panel-transition documentation `__. + +.. ipython:: python + :okwarning: + + p = tm.makePanel() + p + +Convert to a MultiIndex DataFrame + +.. ipython:: python + :okwarning: + + p.to_frame() + +Alternatively, one can convert to an xarray ``DataArray``. + +.. ipython:: python + :okwarning: + + p.to_xarray() + +You can see the full-documentation for the `xarray package `__. + .. _dsintro.panelnd: .. _dsintro.panel4d: diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index 087b265ee83f2..31849fc142aea 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -93,12 +93,13 @@ targets the IPython Notebook environment. `Plotly’s `__ `Python API `__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js `__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn `__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks `__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `cloud `__, `offline `__, or `on-premise `__ accounts for private use. -`Pandas-Qt `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`QtPandas `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Spun off from the main pandas library, the `Pandas-Qt `__ +Spun off from the main pandas library, the `qtpandas `__ library enables DataFrame visualization and manipulation in PyQt4 and PySide applications. + .. _ecosystem.ide: IDE @@ -172,13 +173,15 @@ This package requires valid credentials for this API (non free). `pandaSDMX `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -pandaSDMX is an extensible library to retrieve and acquire statistical data +pandaSDMX is a library to retrieve and acquire statistical data and metadata disseminated in -`SDMX `_ 2.1. This standard is currently supported by -the European statistics office (Eurostat) -and the European Central Bank (ECB). Datasets may be returned as pandas Series -or multi-indexed DataFrames. - +`SDMX `_ 2.1, an ISO-standard +widely used by institutions such as statistics offices, central banks, +and international organisations. pandaSDMX can expose datasets and related +structural metadata including dataflows, code-lists, +and datastructure definitions as pandas Series +or multi-indexed DataFrames. + `fredapi `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fredapi is a Python interface to the `Federal Reserve Economic Data (FRED) `__ diff --git a/doc/source/faq.rst b/doc/source/faq.rst deleted file mode 100644 index d23e0ca59254d..0000000000000 --- a/doc/source/faq.rst +++ /dev/null @@ -1,115 +0,0 @@ -.. currentmodule:: pandas -.. _faq: - -******************************** -Frequently Asked Questions (FAQ) -******************************** - -.. ipython:: python - :suppress: - - import numpy as np - np.random.seed(123456) - np.set_printoptions(precision=4, suppress=True) - import pandas as pd - pd.options.display.max_rows = 15 - import matplotlib - matplotlib.style.use('ggplot') - import matplotlib.pyplot as plt - plt.close('all') - -.. _df-memory-usage: - -DataFrame memory usage ----------------------- -As of pandas version 0.15.0, the memory usage of a dataframe (including -the index) is shown when accessing the ``info`` method of a dataframe. A -configuration option, ``display.memory_usage`` (see :ref:`options`), -specifies if the dataframe's memory usage will be displayed when -invoking the ``df.info()`` method. - -For example, the memory usage of the dataframe below is shown -when calling ``df.info()``: - -.. ipython:: python - - dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', - 'complex128', 'object', 'bool'] - n = 5000 - data = dict([ (t, np.random.randint(100, size=n).astype(t)) - for t in dtypes]) - df = pd.DataFrame(data) - df['categorical'] = df['object'].astype('category') - - df.info() - -The ``+`` symbol indicates that the true memory usage could be higher, because -pandas does not count the memory used by values in columns with -``dtype=object``. - -.. versionadded:: 0.17.1 - -Passing ``memory_usage='deep'`` will enable a more accurate memory usage report, -that accounts for the full usage of the contained objects. This is optional -as it can be expensive to do this deeper introspection. - -.. ipython:: python - - df.info(memory_usage='deep') - -By default the display option is set to ``True`` but can be explicitly -overridden by passing the ``memory_usage`` argument when invoking ``df.info()``. - -The memory usage of each column can be found by calling the ``memory_usage`` -method. This returns a Series with an index represented by column names -and memory usage of each column shown in bytes. For the dataframe above, -the memory usage of each column and the total memory usage of the -dataframe can be found with the memory_usage method: - -.. ipython:: python - - df.memory_usage() - - # total memory usage of dataframe - df.memory_usage().sum() - -By default the memory usage of the dataframe's index is shown in the -returned Series, the memory usage of the index can be suppressed by passing -the ``index=False`` argument: - -.. ipython:: python - - df.memory_usage(index=False) - -The memory usage displayed by the ``info`` method utilizes the -``memory_usage`` method to determine the memory usage of a dataframe -while also formatting the output in human-readable units (base-2 -representation; i.e., 1KB = 1024 bytes). - -See also :ref:`Categorical Memory Usage `. - -Byte-Ordering Issues --------------------- -Occasionally you may have to deal with data that were created on a machine with -a different byte order than the one on which you are running Python. To deal -with this issue you should convert the underlying NumPy array to the native -system byte order *before* passing it to Series/DataFrame/Panel constructors -using something similar to the following: - -.. ipython:: python - - x = np.array(list(range(10)), '>i4') # big endian - newx = x.byteswap().newbyteorder() # force native byteorder - s = pd.Series(newx) - -See `the NumPy documentation on byte order -`__ for more -details. - - -Visualizing Data in Qt applications ------------------------------------ - -There is no support for such visualization in pandas. However, the external -package `pandas-qt `_ does -provide this functionality. diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index cfac5c257184d..11827fe2776cf 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -1,18 +1,92 @@ .. currentmodule:: pandas .. _gotchas: +******************************** +Frequently Asked Questions (FAQ) +******************************** + .. ipython:: python :suppress: import numpy as np + np.random.seed(123456) np.set_printoptions(precision=4, suppress=True) import pandas as pd - pd.options.display.max_rows=15 + pd.options.display.max_rows = 15 + import matplotlib + matplotlib.style.use('ggplot') + import matplotlib.pyplot as plt + plt.close('all') + +.. _df-memory-usage: + +DataFrame memory usage +---------------------- +As of pandas version 0.15.0, the memory usage of a dataframe (including +the index) is shown when accessing the ``info`` method of a dataframe. A +configuration option, ``display.memory_usage`` (see :ref:`options`), +specifies if the dataframe's memory usage will be displayed when +invoking the ``df.info()`` method. + +For example, the memory usage of the dataframe below is shown +when calling ``df.info()``: + +.. ipython:: python + + dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', + 'complex128', 'object', 'bool'] + n = 5000 + data = dict([ (t, np.random.randint(100, size=n).astype(t)) + for t in dtypes]) + df = pd.DataFrame(data) + df['categorical'] = df['object'].astype('category') + + df.info() + +The ``+`` symbol indicates that the true memory usage could be higher, because +pandas does not count the memory used by values in columns with +``dtype=object``. + +.. versionadded:: 0.17.1 + +Passing ``memory_usage='deep'`` will enable a more accurate memory usage report, +that accounts for the full usage of the contained objects. This is optional +as it can be expensive to do this deeper introspection. + +.. ipython:: python + + df.info(memory_usage='deep') + +By default the display option is set to ``True`` but can be explicitly +overridden by passing the ``memory_usage`` argument when invoking ``df.info()``. + +The memory usage of each column can be found by calling the ``memory_usage`` +method. This returns a Series with an index represented by column names +and memory usage of each column shown in bytes. For the dataframe above, +the memory usage of each column and the total memory usage of the +dataframe can be found with the memory_usage method: + +.. ipython:: python + df.memory_usage() -******************* -Caveats and Gotchas -******************* + # total memory usage of dataframe + df.memory_usage().sum() + +By default the memory usage of the dataframe's index is shown in the +returned Series, the memory usage of the index can be suppressed by passing +the ``index=False`` argument: + +.. ipython:: python + + df.memory_usage(index=False) + +The memory usage displayed by the ``info`` method utilizes the +``memory_usage`` method to determine the memory usage of a dataframe +while also formatting the output in human-readable units (base-2 +representation; i.e., 1KB = 1024 bytes). + +See also :ref:`Categorical Memory Usage `. .. _gotchas.truth: @@ -214,227 +288,6 @@ and traded integer ``NA`` capability for a much simpler approach of using a special value in float and object arrays to denote ``NA``, and promoting integer arrays to floating when NAs must be introduced. -Integer indexing ----------------- - -Label-based indexing with integer axis labels is a thorny topic. It has been -discussed heavily on mailing lists and among various members of the scientific -Python community. In pandas, our general viewpoint is that labels matter more -than integer locations. Therefore, with an integer axis index *only* -label-based indexing is possible with the standard tools like ``.ix``. The -following code will generate exceptions: - -.. code-block:: python - - s = pd.Series(range(5)) - s[-1] - df = pd.DataFrame(np.random.randn(5, 4)) - df - df.ix[-2:] - -This deliberate decision was made to prevent ambiguities and subtle bugs (many -users reported finding bugs when the API change was made to stop "falling back" -on position-based indexing). - -Label-based slicing conventions -------------------------------- - -Non-monotonic indexes require exact matches -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If the index of a ``Series`` or ``DataFrame`` is monotonically increasing or decreasing, then the bounds -of a label-based slice can be outside the range of the index, much like slice indexing a -normal Python ``list``. Monotonicity of an index can be tested with the ``is_monotonic_increasing`` and -``is_monotonic_decreasing`` attributes. - -.. ipython:: python - - df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=range(5)) - df.index.is_monotonic_increasing - - # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4: - df.loc[0:4, :] - - # slice is are outside the index, so empty DataFrame is returned - df.loc[13:15, :] - -On the other hand, if the index is not monotonic, then both slice bounds must be -*unique* members of the index. - -.. ipython:: python - - df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=range(6)) - df.index.is_monotonic_increasing - - # OK because 2 and 4 are in the index - df.loc[2:4, :] - -.. code-block:: python - - # 0 is not in the index - In [9]: df.loc[0:4, :] - KeyError: 0 - - # 3 is not a unique label - In [11]: df.loc[2:3, :] - KeyError: 'Cannot get right slice bound for non-unique label: 3' - - -Endpoints are inclusive -~~~~~~~~~~~~~~~~~~~~~~~ - -Compared with standard Python sequence slicing in which the slice endpoint is -not inclusive, label-based slicing in pandas **is inclusive**. The primary -reason for this is that it is often not possible to easily determine the -"successor" or next element after a particular label in an index. For example, -consider the following Series: - -.. ipython:: python - - s = pd.Series(np.random.randn(6), index=list('abcdef')) - s - -Suppose we wished to slice from ``c`` to ``e``, using integers this would be - -.. ipython:: python - - s[2:5] - -However, if you only had ``c`` and ``e``, determining the next element in the -index can be somewhat complicated. For example, the following does not work: - -:: - - s.ix['c':'e'+1] - -A very common use case is to limit a time series to start and end at two -specific dates. To enable this, we made the design design to make label-based -slicing include both endpoints: - -.. ipython:: python - - s.ix['c':'e'] - -This is most definitely a "practicality beats purity" sort of thing, but it is -something to watch out for if you expect label-based slicing to behave exactly -in the way that standard Python integer slicing works. - -Miscellaneous indexing gotchas ------------------------------- - -Reindex versus ix gotchas -~~~~~~~~~~~~~~~~~~~~~~~~~ - -Many users will find themselves using the ``ix`` indexing capabilities as a -concise means of selecting data from a pandas object: - -.. ipython:: python - - df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'], - index=list('abcdef')) - df - df.ix[['b', 'c', 'e']] - -This is, of course, completely equivalent *in this case* to using the -``reindex`` method: - -.. ipython:: python - - df.reindex(['b', 'c', 'e']) - -Some might conclude that ``ix`` and ``reindex`` are 100% equivalent based on -this. This is indeed true **except in the case of integer indexing**. For -example, the above operation could alternately have been expressed as: - -.. ipython:: python - - df.ix[[1, 2, 4]] - -If you pass ``[1, 2, 4]`` to ``reindex`` you will get another thing entirely: - -.. ipython:: python - - df.reindex([1, 2, 4]) - -So it's important to remember that ``reindex`` is **strict label indexing -only**. This can lead to some potentially surprising results in pathological -cases where an index contains, say, both integers and strings: - -.. ipython:: python - - s = pd.Series([1, 2, 3], index=['a', 0, 1]) - s - s.ix[[0, 1]] - s.reindex([0, 1]) - -Because the index in this case does not contain solely integers, ``ix`` falls -back on integer indexing. By contrast, ``reindex`` only looks for the values -passed in the index, thus finding the integers ``0`` and ``1``. While it would -be possible to insert some logic to check whether a passed sequence is all -contained in the index, that logic would exact a very high cost in large data -sets. - -Reindex potentially changes underlying Series dtype -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The use of ``reindex_like`` can potentially change the dtype of a ``Series``. - -.. ipython:: python - - series = pd.Series([1, 2, 3]) - x = pd.Series([True]) - x.dtype - x = pd.Series([True]).reindex_like(series) - x.dtype - -This is because ``reindex_like`` silently inserts ``NaNs`` and the ``dtype`` -changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs`` -such as ``numpy.logical_and``. - -See the `this old issue `__ for a more -detailed discussion. - -Parsing Dates from Text Files ------------------------------ - -When parsing multiple text file columns into a single date column, the new date -column is prepended to the data and then `index_col` specification is indexed off -of the new set of columns rather than the original ones: - -.. ipython:: python - :suppress: - - data = ("KORD,19990127, 19:00:00, 18:56:00, 0.8100\n" - "KORD,19990127, 20:00:00, 19:56:00, 0.0100\n" - "KORD,19990127, 21:00:00, 20:56:00, -0.5900\n" - "KORD,19990127, 21:00:00, 21:18:00, -0.9900\n" - "KORD,19990127, 22:00:00, 21:56:00, -0.5900\n" - "KORD,19990127, 23:00:00, 22:56:00, -0.5900") - - with open('tmp.csv', 'w') as fh: - fh.write(data) - -.. ipython:: python - - print(open('tmp.csv').read()) - - date_spec = {'nominal': [1, 2], 'actual': [1, 3]} - df = pd.read_csv('tmp.csv', header=None, - parse_dates=date_spec, - keep_date_col=True, - index_col=0) - - # index_col=0 refers to the combined column "nominal" and not the original - # first column of 'KORD' strings - - df - -.. ipython:: python - :suppress: - - import os - os.remove('tmp.csv') - Differences with NumPy ---------------------- @@ -455,115 +308,6 @@ where the data copying occurs. See `this link `__ for more information. -.. _html-gotchas: - -HTML Table Parsing ------------------- -There are some versioning issues surrounding the libraries that are used to -parse HTML tables in the top-level pandas io function ``read_html``. - -**Issues with** |lxml|_ - - * Benefits - - * |lxml|_ is very fast - - * |lxml|_ requires Cython to install correctly. - - * Drawbacks - - * |lxml|_ does *not* make any guarantees about the results of its parse - *unless* it is given |svm|_. - - * In light of the above, we have chosen to allow you, the user, to use the - |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ - fails to parse - - * It is therefore *highly recommended* that you install both - |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid - result (provided everything else is valid) even if |lxml|_ fails. - -**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** - - * The above issues hold here as well since |BeautifulSoup4|_ is essentially - just a wrapper around a parser backend. - -**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** - - * Benefits - - * |html5lib|_ is far more lenient than |lxml|_ and consequently deals - with *real-life markup* in a much saner way rather than just, e.g., - dropping an element without notifying you. - - * |html5lib|_ *generates valid HTML5 markup from invalid markup - automatically*. This is extremely important for parsing HTML tables, - since it guarantees a valid document. However, that does NOT mean that - it is "correct", since the process of fixing markup does not have a - single definition. - - * |html5lib|_ is pure Python and requires no additional build steps beyond - its own installation. - - * Drawbacks - - * The biggest drawback to using |html5lib|_ is that it is slow as - molasses. However consider the fact that many tables on the web are not - big enough for the parsing algorithm runtime to matter. It is more - likely that the bottleneck will be in the process of reading the raw - text from the URL over the web, i.e., IO (input-output). For very large - tables, this might not be true. - -**Issues with using** |Anaconda|_ - - * `Anaconda`_ ships with `lxml`_ version 3.2.0; the following workaround for - `Anaconda`_ was successfully used to deal with the versioning issues - surrounding `lxml`_ and `BeautifulSoup4`_. - - .. note:: - - Unless you have *both*: - - * A strong restriction on the upper bound of the runtime of some code - that incorporates :func:`~pandas.io.html.read_html` - * Complete knowledge that the HTML you will be parsing will be 100% - valid at all times - - then you should install `html5lib`_ and things will work swimmingly - without you having to muck around with `conda`. If you want the best of - both worlds then install both `html5lib`_ and `lxml`_. If you do install - `lxml`_ then you need to perform the following commands to ensure that - lxml will work correctly: - - .. code-block:: sh - - # remove the included version - conda remove lxml - - # install the latest version of lxml - pip install 'git+git://github.com/lxml/lxml.git' - - # install the latest version of beautifulsoup4 - pip install 'bzr+lp:beautifulsoup' - - Note that you need `bzr `__ and `git - `__ installed to perform the last two operations. - -.. |svm| replace:: **strictly valid markup** -.. _svm: http://validator.w3.org/docs/help.html#validation_basics - -.. |html5lib| replace:: **html5lib** -.. _html5lib: https://github.com/html5lib/html5lib-python - -.. |BeautifulSoup4| replace:: **BeautifulSoup4** -.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup - -.. |lxml| replace:: **lxml** -.. _lxml: http://lxml.de - -.. |Anaconda| replace:: **Anaconda** -.. _Anaconda: https://store.continuum.io/cshop/anaconda - Byte-Ordering Issues -------------------- diff --git a/doc/source/groupby.rst b/doc/source/groupby.rst index f3fcd6901a440..cf4f1059ae17a 100644 --- a/doc/source/groupby.rst +++ b/doc/source/groupby.rst @@ -94,11 +94,21 @@ The mapping can be specified many different ways: - For DataFrame objects, a string indicating a column to be used to group. Of course ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``, but it makes life simpler + - For DataFrame objects, a string indicating an index level to be used to group. - A list of any of the above things Collectively we refer to the grouping objects as the **keys**. For example, consider the following DataFrame: +.. note:: + + .. versionadded:: 0.20 + + A string passed to ``groupby`` may refer to either a column or an index level. + If a string matches both a column name and an index level name then a warning is + issued and the column takes precedence. This will result in an ambiguity error + in a future version. + .. ipython:: python df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', @@ -237,17 +247,6 @@ the length of the ``groups`` dict, so it is largely just a convenience: gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight - -.. ipython:: python - :suppress: - - df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', - 'foo', 'bar', 'foo', 'foo'], - 'B' : ['one', 'one', 'two', 'three', - 'two', 'two', 'one', 'three'], - 'C' : np.random.randn(8), - 'D' : np.random.randn(8)}) - .. _groupby.multiindex: GroupBy with MultiIndex @@ -289,7 +288,9 @@ chosen level: s.sum(level='second') -Also as of v0.6, grouping with multiple levels is supported. +.. versionadded:: 0.6 + +Grouping with multiple levels is supported. .. ipython:: python :suppress: @@ -306,8 +307,56 @@ Also as of v0.6, grouping with multiple levels is supported. s s.groupby(level=['first', 'second']).sum() +.. versionadded:: 0.20 + +Index level names may be supplied as keys. + +.. ipython:: python + + s.groupby(['first', 'second']).sum() + More on the ``sum`` function and aggregation later. +Grouping DataFrame with Index Levels and Columns +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A DataFrame may be grouped by a combination of columns and index levels by +specifying the column names as strings and the index levels as ``pd.Grouper`` +objects. + +.. ipython:: python + + arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], + ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] + + index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second']) + + df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3], + 'B': np.arange(8)}, + index=index) + + df + +The following example groups ``df`` by the ``second`` index level and +the ``A`` column. + +.. ipython:: python + + df.groupby([pd.Grouper(level=1), 'A']).sum() + +Index levels may also be specified by name. + +.. ipython:: python + + df.groupby([pd.Grouper(level='second'), 'A']).sum() + +.. versionadded:: 0.20 + +Index level names may be specified as keys directly to ``groupby``. + +.. ipython:: python + + df.groupby(['second', 'A']).sum() + DataFrame column selection in GroupBy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -315,6 +364,16 @@ Once you have created the GroupBy object from a DataFrame, for example, you might want to do something different for each of the columns. Thus, using ``[]`` similar to getting a column from a DataFrame, you can do: +.. ipython:: python + :suppress: + + df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', + 'foo', 'bar', 'foo', 'foo'], + 'B' : ['one', 'one', 'two', 'three', + 'two', 'two', 'one', 'three'], + 'C' : np.random.randn(8), + 'D' : np.random.randn(8)}) + .. ipython:: python grouped = df.groupby(['A']) @@ -380,7 +439,9 @@ Aggregation ----------- Once the GroupBy object has been created, several methods are available to -perform a computation on the grouped data. +perform a computation on the grouped data. These operations are similar to the +:ref:`aggregating API `, :ref:`window functions API `, +and :ref:`resample API `. An obvious one is aggregation via the ``aggregate`` or equivalently ``agg`` method: @@ -443,7 +504,7 @@ index are the group names and whose values are the sizes of each group. Applying multiple functions at once ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -With grouped Series you can also pass a list or dict of functions to do +With grouped ``Series`` you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame: .. ipython:: python @@ -451,23 +512,35 @@ aggregation with, outputting a DataFrame: grouped = df.groupby('A') grouped['C'].agg([np.sum, np.mean, np.std]) -If a dict is passed, the keys will be used to name the columns. Otherwise the -function's name (stored in the function object) will be used. +On a grouped ``DataFrame``, you can pass a list of functions to apply to each +column, which produces an aggregated result with a hierarchical index: .. ipython:: python - grouped['D'].agg({'result1' : np.sum, - 'result2' : np.mean}) + grouped.agg([np.sum, np.mean, np.std]) -On a grouped DataFrame, you can pass a list of functions to apply to each -column, which produces an aggregated result with a hierarchical index: + +The resulting aggregations are named for the functions themselves. If you +need to rename, then you can add in a chained operation for a ``Series`` like this: .. ipython:: python - grouped.agg([np.sum, np.mean, np.std]) + (grouped['C'].agg([np.sum, np.mean, np.std]) + .rename(columns={'sum': 'foo', + 'mean': 'bar', + 'std': 'baz'}) + ) + +For a grouped ``DataFrame``, you can rename in a similar manner: + +.. ipython:: python + + (grouped.agg([np.sum, np.mean, np.std]) + .rename(columns={'sum': 'foo', + 'mean': 'bar', + 'std': 'baz'}) + ) -Passing a dict of functions has different behavior by default, see the next -section. Applying different functions to DataFrame columns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -521,9 +594,21 @@ Transformation -------------- The ``transform`` method returns an object that is indexed the same (same size) -as the one being grouped. Thus, the passed transform function should return a -result that is the same size as the group chunk. For example, suppose we wished -to standardize the data within each group: +as the one being grouped. The transform function must: + +* Return a result that is either the same size as the group chunk or + broadcastable to the size of the group chunk (e.g., a scalar, + ``grouped.transform(lambda x: x.iloc[-1])``). +* Operate column-by-column on the group chunk. The transform is applied to + the first group chunk using chunk.apply. +* Not perform in-place operations on the group chunk. Group chunks should + be treated as immutable, and changes to a group chunk may produce unexpected + results. For example, when using ``fillna``, ``inplace`` must be ``False`` + (``grouped.transform(lambda x: x.fillna(inplace=False))``). +* (Optionally) operates on the entire group chunk. If this is supported, a + fast path is used starting from the *second* chunk. + +For example, suppose we wished to standardize the data within each group: .. ipython:: python @@ -561,6 +646,21 @@ We can also visually compare the original and transformed data sets. @savefig groupby_transform_plot.png compare.plot() +Transformation functions that have lower dimension outputs are broadcast to +match the shape of the input array. + +.. ipython:: python + + data_range = lambda x: x.max() - x.min() + ts.groupby(key).transform(data_range) + +Alternatively the built-in methods can be could be used to produce the same +outputs + +.. ipython:: python + + ts.groupby(key).transform('max') - ts.groupby(key).transform('min') + Another common data transform is to replace missing data with the group mean. .. ipython:: python @@ -605,8 +705,9 @@ and that the transformed data contains no NAs. .. note:: - Some functions when applied to a groupby object will automatically transform the input, returning - an object of the same shape as the original. Passing ``as_index=False`` will not affect these transformation methods. + Some functions when applied to a groupby object will automatically transform + the input, returning an object of the same shape as the original. Passing + ``as_index=False`` will not affect these transformation methods. For example: ``fillna, ffill, bfill, shift``. @@ -751,7 +852,7 @@ next). This enables some operations to be carried out rather succinctly: tsdf = pd.DataFrame(np.random.randn(1000, 3), index=pd.date_range('1/1/2000', periods=1000), columns=['A', 'B', 'C']) - tsdf.ix[::2] = np.nan + tsdf.iloc[::2] = np.nan grouped = tsdf.groupby(lambda x: x.year) grouped.fillna(method='pad') diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template index 1996ad75ea92a..0bfb2b635f53a 100644 --- a/doc/source/index.rst.template +++ b/doc/source/index.rst.template @@ -14,9 +14,9 @@ pandas: powerful Python data analysis toolkit **Binary Installers:** http://pypi.python.org/pypi/pandas -**Source Repository:** http://github.com/pydata/pandas +**Source Repository:** http://github.com/pandas-dev/pandas -**Issues & Ideas:** https://github.com/pydata/pandas/issues +**Issues & Ideas:** https://github.com/pandas-dev/pandas/issues **Q&A Support:** http://stackoverflow.com/questions/tagged/pandas @@ -116,7 +116,6 @@ See the package overview for more detail about what's in the library. whatsnew install contributing - faq overview 10min tutorials diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst index 0a6691936d97d..f988fb7cd6806 100644 --- a/doc/source/indexing.rst +++ b/doc/source/indexing.rst @@ -61,13 +61,15 @@ See the :ref:`MultiIndex / Advanced Indexing ` for ``MultiIndex`` and See the :ref:`cookbook` for some advanced strategies +.. _indexing.choice: + Different Choices for Indexing ------------------------------ .. versionadded:: 0.11.0 Object selection has had a number of user-requested additions in order to -support more explicit location based indexing. pandas now supports three types +support more explicit location based indexing. Pandas now supports three types of multi-axis indexing. - ``.loc`` is primarily label based, but may also be used with a boolean array. ``.loc`` will raise ``KeyError`` when the items are not found. Allowed inputs are: @@ -104,24 +106,13 @@ of multi-axis indexing. See more at :ref:`Selection by Position ` -- ``.ix`` supports mixed integer and label based access. It is primarily label - based, but will fall back to integer positional access unless the corresponding - axis is of integer type. ``.ix`` is the most general and will - support any of the inputs in ``.loc`` and ``.iloc``. ``.ix`` also supports floating point - label schemes. ``.ix`` is exceptionally useful when dealing with mixed positional - and label based hierarchical indexes. - - However, when an axis is integer based, ONLY - label based access and not positional access is supported. - Thus, in such cases, it's usually better to be explicit and use ``.iloc`` or ``.loc``. - See more at :ref:`Advanced Indexing ` and :ref:`Advanced Hierarchical `. -- ``.loc``, ``.iloc``, ``.ix`` and also ``[]`` indexing can accept a ``callable`` as indexer. See more at :ref:`Selection By Callable `. +- ``.loc``, ``.iloc``, and also ``[]`` indexing can accept a ``callable`` as indexer. See more at :ref:`Selection By Callable `. Getting values from an object with multi-axes selection uses the following -notation (using ``.loc`` as an example, but applies to ``.iloc`` and ``.ix`` as +notation (using ``.loc`` as an example, but applies to ``.iloc`` as well). Any of the axes accessors may be the null slice ``:``. Axes left out of the specification are assumed to be ``:``. (e.g. ``p.loc['a']`` is equiv to ``p.loc['a', :, :]``) @@ -193,7 +184,7 @@ columns. .. warning:: - pandas aligns all AXES when setting ``Series`` and ``DataFrame`` from ``.loc``, ``.iloc`` and ``.ix``. + pandas aligns all AXES when setting ``Series`` and ``DataFrame`` from ``.loc``, and ``.iloc``. This will **not** modify ``df`` because the column alignment is before value assignment. @@ -410,7 +401,7 @@ Selection By Position This is sometimes called ``chained assignment`` and should be avoided. See :ref:`Returning a View versus Copy ` -pandas provides a suite of methods in order to get **purely integer based indexing**. The semantics follow closely python and numpy slicing. These are ``0-based`` indexing. When slicing, the start bounds is *included*, while the upper bound is *excluded*. Trying to use a non-integer, even a **valid** label will raise a ``IndexError``. +Pandas provides a suite of methods in order to get **purely integer based indexing**. The semantics follow closely python and numpy slicing. These are ``0-based`` indexing. When slicing, the start bounds is *included*, while the upper bound is *excluded*. Trying to use a non-integer, even a **valid** label will raise an ``IndexError``. The ``.iloc`` attribute is the primary access method. The following are valid inputs: @@ -526,7 +517,7 @@ Selection By Callable .. versionadded:: 0.18.1 -``.loc``, ``.iloc``, ``.ix`` and also ``[]`` indexing can accept a ``callable`` as indexer. +``.loc``, ``.iloc``, and also ``[]`` indexing can accept a ``callable`` as indexer. The ``callable`` must be a function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing. .. ipython:: python @@ -559,6 +550,63 @@ without using temporary variable. (bb.groupby(['year', 'team']).sum() .loc[lambda df: df.r > 100]) +.. _indexing.deprecate_ix: + +IX Indexer is Deprecated +------------------------ + +.. warning:: + + Starting in 0.20.0, the ``.ix`` indexer is deprecated, in favor of the more strict ``.iloc`` + and ``.loc`` indexers. + +``.ix`` offers a lot of magic on the inference of what the user wants to do. To wit, ``.ix`` can decide +to index *positionally* OR via *labels* depending on the data type of the index. This has caused quite a +bit of user confusion over the years. + +The recommended methods of indexing are: + +- ``.loc`` if you want to *label* index +- ``.iloc`` if you want to *positionally* index. + +.. ipython:: python + + dfd = pd.DataFrame({'A': [1, 2, 3], + 'B': [4, 5, 6]}, + index=list('abc')) + + dfd + +Previous Behavior, where you wish to get the 0th and the 2nd elements from the index in the 'A' column. + +.. code-block:: ipython + + In [3]: dfd.ix[[0, 2], 'A'] + Out[3]: + a 1 + c 3 + Name: A, dtype: int64 + +Using ``.loc``. Here we will select the appropriate indexes from the index, then use *label* indexing. + +.. ipython:: python + + dfd.loc[dfd.index[[0, 2]], 'A'] + +This can also be expressed using ``.iloc``, by explicitly getting locations on the indexers, and using +*positional* indexing to select things. + +.. ipython:: python + + dfd.iloc[[0, 2], dfd.columns.get_loc('A')] + +For getting *multiple* indexers, using ``.get_indexer`` + +.. ipython:: python + + dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])] + + .. _indexing.basics.partial_setting: Selecting Random Samples @@ -641,7 +689,7 @@ Setting With Enlargement .. versionadded:: 0.13 -The ``.loc/.ix/[]`` operations can perform enlargement when setting a non-existant key for that axis. +The ``.loc/[]`` operations can perform enlargement when setting a non-existant key for that axis. In the ``Series`` case this is effectively an appending operation @@ -906,7 +954,7 @@ without creating a copy: Furthermore, ``where`` aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to -partial setting via ``.ix`` (but on the contents rather than the axis labels) +partial setting via ``.loc`` (but on the contents rather than the axis labels) .. ipython:: python @@ -1467,6 +1515,10 @@ with duplicates dropped. idx1.symmetric_difference(idx2) idx1 ^ idx2 +.. note:: + + The resulting index from a set operation will be sorted in ascending order. + Missing values ~~~~~~~~~~~~~~ @@ -1712,7 +1764,7 @@ A chained assignment can also crop up in setting in a mixed dtype frame. .. note:: - These setting rules apply to all of ``.loc/.iloc/.ix`` + These setting rules apply to all of ``.loc/.iloc`` This is the correct access method diff --git a/doc/source/install.rst b/doc/source/install.rst index 55b6b5fa69efb..578caae605471 100644 --- a/doc/source/install.rst +++ b/doc/source/install.rst @@ -23,18 +23,6 @@ Officially Python 2.7, 3.4, 3.5, and 3.6 Installing pandas ----------------- -Trying out pandas, no installation required! -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The easiest way to start experimenting with pandas doesn't involve installing -pandas at all. - -`Wakari `__ is a free service that provides a hosted -`IPython Notebook `__ service in the cloud. - -Simply create an account, and have access to pandas from within your brower via -an `IPython Notebook `__ in a few minutes. - .. _install.anaconda: Installing pandas with Anaconda @@ -188,8 +176,8 @@ Running the test suite pandas is equipped with an exhaustive set of unit tests covering about 97% of the codebase as of this writing. To run it on your machine to verify that everything is working (and you have all of the dependencies, soft and hard, -installed), make sure you have `nose -`__ and run: +installed), make sure you have `pytest +`__ and run: :: @@ -226,7 +214,7 @@ Recommended Dependencies * `numexpr `__: for accelerating certain numerical operations. ``numexpr`` uses multiple cores as well as smart chunking and caching to achieve large speedups. - If installed, must be Version 2.1 or higher (excluding a buggy 2.4.4). Version 2.4.6 or higher is highly recommended. + If installed, must be Version 2.4.6 or higher. * `bottleneck `__: for accelerating certain types of ``nan`` evaluations. ``bottleneck`` uses specialized cython routines to achieve large speedups. @@ -243,15 +231,16 @@ Optional Dependencies ~~~~~~~~~~~~~~~~~~~~~ * `Cython `__: Only necessary to build development - version. Version 0.19.1 or higher. + version. Version 0.23 or higher. * `SciPy `__: miscellaneous statistical functions * `xarray `__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended. * `PyTables `__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended. +* `Feather Format `__: necessary for feather-based storage, version 0.3.1 or higher. * `SQLAlchemy `__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs `__. Some common drivers are: - - `psycopg2 `__: for PostgreSQL - - `pymysql `__: for MySQL. - - `SQLite `__: for SQLite, this is included in Python's standard library by default. + * `psycopg2 `__: for PostgreSQL + * `pymysql `__: for MySQL. + * `SQLite `__: for SQLite, this is included in Python's standard library by default. * `matplotlib `__: for plotting * For Excel I/O: @@ -262,7 +251,7 @@ Optional Dependencies * `XlsxWriter `__: Alternative Excel writer * `Jinja2 `__: Template engine for conditional HTML formatting. -* `boto `__: necessary for Amazon S3 access. +* `s3fs `__: necessary for Amazon S3 access (s3fs >= 0.0.7). * `blosc `__: for msgpack compression using ``blosc`` * One of `PyQt4 `__, `PySide @@ -271,11 +260,8 @@ Optional Dependencies `__, or `xclip `__: necessary to use :func:`~pandas.read_clipboard`. Most package managers on Linux distributions will have ``xclip`` and/or ``xsel`` immediately available for installation. -* Google's `python-gflags <`__ , - `oauth2client `__ , - `httplib2 `__ - and `google-api-python-client `__ - : Needed for :mod:`~pandas.io.gbq` +* For Google BigQuery I/O - see `here `__ + * `Backports.lzma `__: Only for Python 2, for writing to and/or reading from an xz compressed DataFrame in CSV; Python 3 support is built into the standard library. * One of the following combinations of libraries is needed to use the top-level :func:`~pandas.read_html` function: @@ -284,7 +270,7 @@ Optional Dependencies okay.) * `BeautifulSoup4`_ and `lxml`_ * `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ - * Only `lxml`_, although see :ref:`HTML reading gotchas ` + * Only `lxml`_, although see :ref:`HTML Table Parsing ` for reasons as to why you should probably **not** take this approach. .. warning:: @@ -293,14 +279,12 @@ Optional Dependencies `lxml`_ or `html5lib`_ or both. :func:`~pandas.read_html` will **not** work with *only* `BeautifulSoup4`_ installed. - * You are highly encouraged to read :ref:`HTML reading gotchas - `. It explains issues surrounding the installation and - usage of the above three libraries + * You are highly encouraged to read :ref:`HTML Table Parsing gotchas `. + It explains issues surrounding the installation and + usage of the above three libraries. * You may need to install an older version of `BeautifulSoup4`_: Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and 32-bit Ubuntu/Debian - * Additionally, if you're using `Anaconda`_ you should definitely - read :ref:`the gotchas about HTML parsing libraries ` .. note:: @@ -317,7 +301,6 @@ Optional Dependencies .. _html5lib: https://github.com/html5lib/html5lib-python .. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup .. _lxml: http://lxml.de -.. _Anaconda: https://store.continuum.io/cshop/anaconda .. note:: diff --git a/doc/source/io.rst b/doc/source/io.rst index ba1bd328d2991..9692766505d7a 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -29,34 +29,26 @@ IO Tools (Text, CSV, HDF5, ...) =============================== The pandas I/O API is a set of top level ``reader`` functions accessed like ``pd.read_csv()`` that generally return a ``pandas`` -object. - - * :ref:`read_csv` - * :ref:`read_excel` - * :ref:`read_hdf` - * :ref:`read_sql` - * :ref:`read_json` - * :ref:`read_msgpack` (experimental) - * :ref:`read_html` - * :ref:`read_gbq` (experimental) - * :ref:`read_stata` - * :ref:`read_sas` - * :ref:`read_clipboard` - * :ref:`read_pickle` - -The corresponding ``writer`` functions are object methods that are accessed like ``df.to_csv()`` - - * :ref:`to_csv` - * :ref:`to_excel` - * :ref:`to_hdf` - * :ref:`to_sql` - * :ref:`to_json` - * :ref:`to_msgpack` (experimental) - * :ref:`to_html` - * :ref:`to_gbq` (experimental) - * :ref:`to_stata` - * :ref:`to_clipboard` - * :ref:`to_pickle` +object. The corresponding ``writer`` functions are object methods that are accessed like ``df.to_csv()`` + +.. csv-table:: + :header: "Format Type", "Data Description", "Reader", "Writer" + :widths: 30, 100, 60, 60 + :delim: ; + + text;`CSV `__;:ref:`read_csv`;:ref:`to_csv` + text;`JSON `__;:ref:`read_json`;:ref:`to_json` + text;`HTML `__;:ref:`read_html`;:ref:`to_html` + text; Local clipboard;:ref:`read_clipboard`;:ref:`to_clipboard` + binary;`MS Excel `__;:ref:`read_excel`;:ref:`to_excel` + binary;`HDF5 Format `__;:ref:`read_hdf`;:ref:`to_hdf` + binary;`Feather Format `__;:ref:`read_feather`;:ref:`to_feather` + binary;`Msgpack `__;:ref:`read_msgpack`;:ref:`to_msgpack` + binary;`Stata `__;:ref:`read_stata`;:ref:`to_stata` + binary;`SAS `__;:ref:`read_sas`; + binary;`Python Pickle Format `__;:ref:`read_pickle`;:ref:`to_pickle` + SQL;`SQL `__;:ref:`read_sql`;:ref:`to_sql` + SQL;`Google Big Query `__;:ref:`read_gbq`;:ref:`to_gbq` :ref:`Here ` is an informal performance comparison for some of these IO methods. @@ -89,11 +81,12 @@ filepath_or_buffer : various locations), or any object with a ``read()`` method (such as an open file or :class:`~python:io.StringIO`). sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_table` - Delimiter to use. If sep is ``None``, - will try to automatically determine this. Separators longer than 1 character - and different from ``'\s+'`` will be interpreted as regular expressions, will - force use of the python parsing engine and will ignore quotes in the data. - Regex example: ``'\\r\\t'``. + Delimiter to use. If sep is ``None``, the C engine cannot automatically detect + the separator, but the Python parsing engine can, meaning the latter will be + used automatically. In addition, separators longer than 1 character and + different from ``'\s+'`` will be interpreted as regular expressions and + will also force the use of the Python parsing engine. Note that regex + delimiters are prone to ignoring quoted data. Regex example: ``'\\r\\t'``. delimiter : str, default ``None`` Alternative argument name for sep. delim_whitespace : boolean, default False @@ -126,13 +119,23 @@ index_col : int or sequence or ``False``, default ``None`` MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider ``index_col=False`` to force pandas to *not* use the first column as the index (row names). -usecols : array-like, default ``None`` - Return a subset of the columns. All elements in this array must either +usecols : array-like or callable, default ``None`` + Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in `names` or - inferred from the document header row(s). For example, a valid `usecols` - parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter - results in much faster parsing time and lower memory usage. + inferred from the document header row(s). For example, a valid array-like + `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. + + If callable, the callable function will be evaluated against the column names, + returning names where the callable function evaluates to True: + + .. ipython:: python + + data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3']) + + Using this parameter results in much faster parsing time and lower memory usage. as_recarray : boolean, default ``False`` DEPRECATED: this argument will be removed in a future version. Please call ``pd.read_csv(...).to_records()`` instead. @@ -157,6 +160,9 @@ dtype : Type name or dict of column -> type, default ``None`` Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32}`` (unsupported with ``engine='python'``). Use `str` or `object` to preserve and not interpret dtype. + + .. versionadded:: 0.20.0 support for the Python parser. + engine : {``'c'``, ``'python'``} Parser engine to use. The C engine is faster while the python engine is currently more feature-complete. @@ -172,6 +178,16 @@ skipinitialspace : boolean, default ``False`` skiprows : list-like or integer, default ``None`` Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. + + If callable, the callable function will be evaluated against the row + indices, returning True if the row should be skipped and False otherwise: + + .. ipython:: python + + data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0) + skipfooter : int, default ``0`` Number of lines at bottom of file to skip (unsupported with engine='c'). skip_footer : int, default ``0`` @@ -310,8 +326,11 @@ encoding : str, default ``None`` Python standard encodings `_. dialect : str or :class:`python:csv.Dialect` instance, default ``None`` - If ``None`` defaults to Excel dialect. Ignored if sep longer than 1 char. See - :class:`python:csv.Dialect` documentation for more details. + If provided, this parameter will override values (default or not) for the + following parameters: `delimiter`, `doublequote`, `escapechar`, + `skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to + override values, a ParserWarning will be issued. See :class:`python:csv.Dialect` + documentation for more details. tupleize_cols : boolean, default ``False`` Leave a list of tuples on columns as is (default is to convert to a MultiIndex on the columns). @@ -323,99 +342,11 @@ error_bad_lines : boolean, default ``True`` Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If ``False``, then these "bad lines" will dropped from the DataFrame that is - returned (only valid with C parser). See :ref:`bad lines ` + returned. See :ref:`bad lines ` below. warn_bad_lines : boolean, default ``True`` If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for - each "bad line" will be output (only valid with C parser). - -.. ipython:: python - :suppress: - - f = open('foo.csv','w') - f.write('date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5') - f.close() - -Consider a typical CSV file containing, in this case, some time series data: - -.. ipython:: python - - print(open('foo.csv').read()) - -The default for `read_csv` is to create a DataFrame with simple numbered rows: - -.. ipython:: python - - pd.read_csv('foo.csv') - -In the case of indexed data, you can pass the column number or column name you -wish to use as the index: - -.. ipython:: python - - pd.read_csv('foo.csv', index_col=0) - -.. ipython:: python - - pd.read_csv('foo.csv', index_col='date') - -You can also use a list of columns to create a hierarchical index: - -.. ipython:: python - - pd.read_csv('foo.csv', index_col=[0, 'A']) - -.. _io.dialect: - -The ``dialect`` keyword gives greater flexibility in specifying the file format. -By default it uses the Excel dialect but you can specify either the dialect name -or a :class:`python:csv.Dialect` instance. - -.. ipython:: python - :suppress: - - data = ('label1,label2,label3\n' - 'index1,"a,c,e\n' - 'index2,b,d,f') - -Suppose you had data with unenclosed quotes: - -.. ipython:: python - - print(data) - -By default, ``read_csv`` uses the Excel dialect and treats the double quote as -the quote character, which causes it to fail when it finds a newline before it -finds the closing double quote. - -We can get around this using ``dialect`` - -.. ipython:: python - - dia = csv.excel() - dia.quoting = csv.QUOTE_NONE - pd.read_csv(StringIO(data), dialect=dia) - -All of the dialect options can be specified separately by keyword arguments: - -.. ipython:: python - - data = 'a,b,c~1,2,3~4,5,6' - pd.read_csv(StringIO(data), lineterminator='~') - -Another common dialect option is ``skipinitialspace``, to skip any whitespace -after a delimiter: - -.. ipython:: python - - data = 'a, b, c\n1, 2, 3\n4, 5, 6' - print(data) - pd.read_csv(StringIO(data), skipinitialspace=True) - -The parsers make every attempt to "do the right thing" and not be very -fragile. Type inference is a pretty big deal. So if a column can be coerced to -integer dtype without altering the contents, it will do so. Any non-numeric -columns will come through as object dtype as with the rest of pandas objects. + each "bad line" will be output. .. _io.dtypes: @@ -473,10 +404,9 @@ However, if you wanted for all the data to be coerced, no matter the type, then using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be worth trying. -.. note:: - The ``dtype`` option is currently only supported by the C engine. - Specifying ``dtype`` with ``engine`` other than 'c' raises a - ``ValueError``. + .. versionadded:: 0.20.0 support for the Python parser. + + The ``dtype`` option is supported by the 'python' engine .. note:: In some cases, reading in abnormal data with columns containing mixed dtypes @@ -488,9 +418,9 @@ worth trying. .. ipython:: python :okwarning: - df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)}) - df.to_csv('foo') - mixed_df = pd.read_csv('foo') + df = pd.DataFrame({'col_1': list(range(500000)) + ['a', 'b'] + list(range(500000))}) + df.to_csv('foo.csv') + mixed_df = pd.read_csv('foo.csv') mixed_df['col_1'].apply(type).value_counts() mixed_df['col_1'].dtype @@ -499,6 +429,11 @@ worth trying. data that was read in. It is important to note that the overall column will be marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes. +.. ipython:: python + :suppress: + + os.remove('foo.csv') + .. _io.categorical: Specifying Categorical dtype @@ -615,7 +550,9 @@ Filtering columns (``usecols``) +++++++++++++++++++++++++++++++ The ``usecols`` argument allows you to select any subset of the columns in a -file, either using the column names or position numbers: +file, either using the column names, position numbers or a callable: + +.. versionadded:: 0.20.0 support for callable `usecols` arguments .. ipython:: python @@ -623,6 +560,17 @@ file, either using the column names or position numbers: pd.read_csv(StringIO(data)) pd.read_csv(StringIO(data), usecols=['b', 'd']) pd.read_csv(StringIO(data), usecols=[0, 2, 3]) + pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['A', 'C']) + +The ``usecols`` argument can also be used to specify which columns not to +use in the final result: + +.. ipython:: python + + pd.read_csv(StringIO(data), usecols=lambda x: x not in ['a', 'c']) + +In this case, the callable is specifying that we exclude the "a" and "c" +columns from the output. Comments and Empty Lines '''''''''''''''''''''''' @@ -779,6 +727,13 @@ input text data into ``datetime`` objects. The simplest case is to just pass in ``parse_dates=True``: +.. ipython:: python + :suppress: + + f = open('foo.csv','w') + f.write('date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5') + f.close() + .. ipython:: python # Use a column as an index, and parse it as dates. @@ -852,6 +807,12 @@ data columns: index_col=0) #index is the nominal column df +.. note:: + If a column or index contains an unparseable date, the entire column or + index will be returned unaltered as an object data type. For non-standard + datetime parsing, use :func:`to_datetime` after ``pd.read_csv``. + + .. note:: read_csv has a fast_path for parsing datetime strings in iso8601 format, e.g "2000-01-01T00:01:02+00:00" and similar variations. If you can arrange @@ -1165,8 +1126,8 @@ too many will cause an error by default: In [28]: pd.read_csv(StringIO(data)) --------------------------------------------------------------------------- - CParserError Traceback (most recent call last) - CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4 + ParserError Traceback (most recent call last) + ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4 You can elect to skip bad lines: @@ -1180,6 +1141,75 @@ You can elect to skip bad lines: 0 1 2 3 1 8 9 10 +You can also use the ``usecols`` parameter to eliminate extraneous column +data that appear in some lines but not others: + +.. code-block:: ipython + + In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2]) + + Out[30]: + a b c + 0 1 2 3 + 1 4 5 6 + 2 8 9 10 + +.. _io.dialect: + +Dialect +''''''' + +The ``dialect`` keyword gives greater flexibility in specifying the file format. +By default it uses the Excel dialect but you can specify either the dialect name +or a :class:`python:csv.Dialect` instance. + +.. ipython:: python + :suppress: + + data = ('label1,label2,label3\n' + 'index1,"a,c,e\n' + 'index2,b,d,f') + +Suppose you had data with unenclosed quotes: + +.. ipython:: python + + print(data) + +By default, ``read_csv`` uses the Excel dialect and treats the double quote as +the quote character, which causes it to fail when it finds a newline before it +finds the closing double quote. + +We can get around this using ``dialect`` + +.. ipython:: python + :okwarning: + + dia = csv.excel() + dia.quoting = csv.QUOTE_NONE + pd.read_csv(StringIO(data), dialect=dia) + +All of the dialect options can be specified separately by keyword arguments: + +.. ipython:: python + + data = 'a,b,c~1,2,3~4,5,6' + pd.read_csv(StringIO(data), lineterminator='~') + +Another common dialect option is ``skipinitialspace``, to skip any whitespace +after a delimiter: + +.. ipython:: python + + data = 'a, b, c\n1, 2, 3\n4, 5, 6' + print(data) + pd.read_csv(StringIO(data), skipinitialspace=True) + +The parsers make every attempt to "do the right thing" and not be very +fragile. Type inference is a pretty big deal. So if a column can be coerced to +integer dtype without altering the contents, it will do so. Any non-numeric +columns will come through as object dtype as with the rest of pandas objects. + .. _io.quoting: Quoting and Escape Characters @@ -1266,11 +1296,22 @@ is whitespace). df = pd.read_fwf('bar.csv', header=None, index_col=0) df +.. versionadded:: 0.20.0 + +``read_fwf`` supports the ``dtype`` parameter for specifying the types of +parsed columns to be different from the inferred type. + +.. ipython:: python + + pd.read_fwf('bar.csv', header=None, index_col=0).dtypes + pd.read_fwf('bar.csv', header=None, dtype={2: 'object'}).dtypes + .. ipython:: python :suppress: os.remove('bar.csv') + Indexes ''''''' @@ -1331,7 +1372,7 @@ returned object: df = pd.read_csv("data/mindex_ex.csv", index_col=[0,1]) df - df.ix[1978] + df.loc[1978] .. _io.multi_index_columns: @@ -1398,6 +1439,14 @@ class of the csv module. For this, you have to specify ``sep=None``. print(open('tmp2.sv').read()) pd.read_csv('tmp2.sv', sep=None, engine='python') +.. _io.multiple_files: + +Reading multiple files to create a single DataFrame +''''''''''''''''''''''''''''''''''''''''''''''''''' + +It's best to use :func:`~pandas.concat` to combine multiple files. +See the :ref:`cookbook` for an example. + .. _io.chunking: Iterating through files chunk by chunk @@ -1455,6 +1504,23 @@ options include: Specifying any of the above options will produce a ``ParserWarning`` unless the python engine is selected explicitly using ``engine='python'``. +Reading remote files +'''''''''''''''''''' + +You can pass in a URL to a CSV file: + +.. code-block:: python + + df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item', + sep='\t') + +S3 URLs are handled as well: + +.. code-block:: python + + df = pd.read_csv('s3://pandas-test/tips.csv') + + Writing out Data '''''''''''''''' @@ -1966,6 +2032,126 @@ using Hadoop or Spark. df df.to_json(orient='records', lines=True) + +.. _io.table_schema: + +Table Schema +'''''''''''' + +.. versionadded:: 0.20.0 + +`Table Schema`_ is a spec for describing tabular datasets as a JSON +object. The JSON includes information on the field names, types, and +other attributes. You can use the orient ``table`` to build +a JSON string with two fields, ``schema`` and ``data``. + +.. ipython:: python + + df = pd.DataFrame( + {'A': [1, 2, 3], + 'B': ['a', 'b', 'c'], + 'C': pd.date_range('2016-01-01', freq='d', periods=3), + }, index=pd.Index(range(3), name='idx')) + df + df.to_json(orient='table', date_format="iso") + +The ``schema`` field contains the ``fields`` key, which itself contains +a list of column name to type pairs, including the ``Index`` or ``MultiIndex`` +(see below for a list of types). +The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index +is unique. + +The second field, ``data``, contains the serialized data with the ``records`` +orient. +The index is included, and any datetimes are ISO 8601 formatted, as required +by the Table Schema spec. + +The full list of types supported are described in the Table Schema +spec. This table shows the mapping from pandas types: + +=============== ================= +Pandas type Table Schema type +=============== ================= +int64 integer +float64 number +bool boolean +datetime64[ns] datetime +timedelta64[ns] duration +categorical any +object str +=============== ================= + +A few notes on the generated table schema: + +- The ``schema`` object contains a ``pandas_version`` field. This contains + the version of pandas' dialect of the schema, and will be incremented + with each revision. +- All dates are converted to UTC when serializing. Even timezone naïve values, + which are treated as UTC with an offset of 0. + + .. ipython:: python + + from pandas.io.json import build_table_schema + s = pd.Series(pd.date_range('2016', periods=4)) + build_table_schema(s) + +- datetimes with a timezone (before serializing), include an additional field + ``tz`` with the time zone name (e.g. ``'US/Central'``). + + .. ipython:: python + + s_tz = pd.Series(pd.date_range('2016', periods=12, + tz='US/Central')) + build_table_schema(s_tz) + +- Periods are converted to timestamps before serialization, and so have the + same behavior of being converted to UTC. In addition, periods will contain + and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'`` + + .. ipython:: python + + s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC', + periods=4)) + build_table_schema(s_per) + +- Categoricals use the ``any`` type and an ``enum`` constraint listing + the set of possible values. Additionally, an ``ordered`` field is included + + .. ipython:: python + + s_cat = pd.Series(pd.Categorical(['a', 'b', 'a'])) + build_table_schema(s_cat) + +- A ``primaryKey`` field, containing an array of labels, is included + *if the index is unique*: + + .. ipython:: python + + s_dupe = pd.Series([1, 2], index=[1, 1]) + build_table_schema(s_dupe) + +- The ``primaryKey`` behavior is the same with MultiIndexes, but in this + case the ``primaryKey`` is an array: + + .. ipython:: python + + s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'), + (0, 1)])) + build_table_schema(s_multi) + +- The default naming roughly follows these rules: + + + For series, the ``object.name`` is used. If that's none, then the + name is ``values`` + + For DataFrames, the stringified version of the column name is used + + For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a + fallback to ``index`` if that is None. + + For ``MultiIndex``, ``mi.names`` is used. If any level has no name, + then ``level_`` is used. + + +_Table Schema: http://specs.frictionlessdata.io/json-table-schema/ + HTML ---- @@ -1976,9 +2162,8 @@ Reading HTML Content .. warning:: - We **highly encourage** you to read the :ref:`HTML parsing gotchas - ` regarding the issues surrounding the - BeautifulSoup4/html5lib/lxml parsers. + We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas ` + below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers. .. versionadded:: 0.12.0 @@ -2045,9 +2230,10 @@ Read a URL and match a table that contains specific text match = 'Metcalf Bank' df_list = pd.read_html(url, match=match) -Specify a header row (by default ```` elements are used to form the column -index); if specified, the header row is taken from the data minus the parsed -header elements (```` elements). +Specify a header row (by default ```` or ```` elements located within a +```` are used to form the column index, if multiple rows are contained within +```` then a multiindex is created); if specified, the header row is taken +from the data minus the parsed header elements (```` elements). .. code-block:: python @@ -2280,6 +2466,83 @@ Not escaped: Some browsers may not show a difference in the rendering of the previous two HTML tables. + +.. _io.html.gotchas: + +HTML Table Parsing Gotchas +'''''''''''''''''''''''''' + +There are some versioning issues surrounding the libraries that are used to +parse HTML tables in the top-level pandas io function ``read_html``. + +**Issues with** |lxml|_ + + * Benefits + + * |lxml|_ is very fast + + * |lxml|_ requires Cython to install correctly. + + * Drawbacks + + * |lxml|_ does *not* make any guarantees about the results of its parse + *unless* it is given |svm|_. + + * In light of the above, we have chosen to allow you, the user, to use the + |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ + fails to parse + + * It is therefore *highly recommended* that you install both + |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid + result (provided everything else is valid) even if |lxml|_ fails. + +**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** + + * The above issues hold here as well since |BeautifulSoup4|_ is essentially + just a wrapper around a parser backend. + +**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** + + * Benefits + + * |html5lib|_ is far more lenient than |lxml|_ and consequently deals + with *real-life markup* in a much saner way rather than just, e.g., + dropping an element without notifying you. + + * |html5lib|_ *generates valid HTML5 markup from invalid markup + automatically*. This is extremely important for parsing HTML tables, + since it guarantees a valid document. However, that does NOT mean that + it is "correct", since the process of fixing markup does not have a + single definition. + + * |html5lib|_ is pure Python and requires no additional build steps beyond + its own installation. + + * Drawbacks + + * The biggest drawback to using |html5lib|_ is that it is slow as + molasses. However consider the fact that many tables on the web are not + big enough for the parsing algorithm runtime to matter. It is more + likely that the bottleneck will be in the process of reading the raw + text from the URL over the web, i.e., IO (input-output). For very large + tables, this might not be true. + + +.. |svm| replace:: **strictly valid markup** +.. _svm: http://validator.w3.org/docs/help.html#validation_basics + +.. |html5lib| replace:: **html5lib** +.. _html5lib: https://github.com/html5lib/html5lib-python + +.. |BeautifulSoup4| replace:: **BeautifulSoup4** +.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup + +.. |lxml| replace:: **lxml** +.. _lxml: http://lxml.de + + + + .. _io.excel: Excel files @@ -2318,7 +2581,7 @@ read into memory only once. .. code-block:: python - xlsx = pd.ExcelFile('path_to_file.xls) + xlsx = pd.ExcelFile('path_to_file.xls') df = pd.read_excel(xlsx, 'Sheet1') The ``ExcelFile`` class can also be used as a context manager. @@ -2503,6 +2766,20 @@ indices to be parsed. read_excel('path_to_file.xls', 'Sheet1', parse_cols=[0, 2, 3]) + +Parsing Dates ++++++++++++++ + +Datetime-like values are normally automatically converted to the appropriate +dtype when reading the excel file. But if you have a column of strings that +*look* like dates (but are not actually formatted as dates in excel), you can +use the `parse_dates` keyword to parse those strings to datetimes: + +.. code-block:: python + + read_excel('path_to_file.xls', 'Sheet1', parse_dates=['date_strings']) + + Cell Converters +++++++++++++++ @@ -2525,6 +2802,20 @@ missing data to recover integer dtype: cfun = lambda x: int(x) if x else -1 read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun}) +dtype Specifications +++++++++++++++++++++ + +.. versionadded:: 0.20 + +As an alternative to converters, the type for an entire column can +be specified using the `dtype` keyword, which takes a dictionary +mapping column names to types. To interpret data with +no type inference, use the type ``str`` or ``object``. + +.. code-block:: python + + read_excel('path_to_file.xls', dtype={'MyInts': 'int64', 'MyText': str}) + .. _io.excel_writer: Writing Excel Files @@ -2620,6 +2911,7 @@ Added support for Openpyxl >= 2.2 ``'xlsxwriter'`` will produce an Excel 2007-format workbook (xlsx). If omitted, an Excel 2007-formatted workbook is produced. + .. _io.excel.writers: Excel writer engines @@ -2666,6 +2958,18 @@ argument to ``to_excel`` and to ``ExcelWriter``. The built-in engines are: df.to_excel('path_to_file.xlsx', sheet_name='Sheet1') +.. _io.excel.style: + +Style and Formatting +'''''''''''''''''''' + +The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the ``DataFrame``'s ``to_excel`` method. + +- ``float_format`` : Format string for floating point numbers (default None) +- ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default None) + + + .. _io.clipboard: Clipboard @@ -2752,14 +3056,71 @@ any pickled pandas object (or any other pickled object) from file: See `this question `__ for a detailed explanation. -.. note:: +.. _io.pickle.compression: + +Compressed pickle files +''''''''''''''''''''''' + +.. versionadded:: 0.20.0 + +:func:`read_pickle`, :meth:`DataFame.to_pickle` and :meth:`Series.to_pickle` can read +and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz`` are supported for reading and writing. +`zip`` file supports read only and must contain only one data file +to be read in. + +The compression type can be an explicit parameter or be inferred from the file extension. +If 'infer', then use ``gzip``, ``bz2``, ``zip``, or ``xz`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, or +``'.xz'``, respectively. + +.. ipython:: python + + df = pd.DataFrame({ + 'A': np.random.randn(1000), + 'B': 'foo', + 'C': pd.date_range('20130101', periods=1000, freq='s')}) + df + +Using an explicit compression type + +.. ipython:: python + + df.to_pickle("data.pkl.compress", compression="gzip") + rt = pd.read_pickle("data.pkl.compress", compression="gzip") + rt + +Inferring compression type from the extension + +.. ipython:: python - These methods were previously ``pd.save`` and ``pd.load``, prior to 0.12.0, and are now deprecated. + df.to_pickle("data.pkl.xz", compression="infer") + rt = pd.read_pickle("data.pkl.xz", compression="infer") + rt + +The default is to 'infer + +.. ipython:: python + + df.to_pickle("data.pkl.gz") + rt = pd.read_pickle("data.pkl.gz") + rt + + df["A"].to_pickle("s1.pkl.bz2") + rt = pd.read_pickle("s1.pkl.bz2") + rt + +.. ipython:: python + :suppress: + + import os + os.remove("data.pkl.compress") + os.remove("data.pkl.xz") + os.remove("data.pkl.gz") + os.remove("s1.pkl.bz2") .. _io.msgpack: -msgpack (experimental) ----------------------- +msgpack +------- .. versionadded:: 0.13.0 @@ -3180,7 +3541,7 @@ defaults to `nan`. 'bool' : True, 'datetime64' : pd.Timestamp('20010102')}, index=list(range(8))) - df_mixed.ix[3:5,['A', 'B', 'string', 'datetime64']] = np.nan + df_mixed.loc[df_mixed.index[3:5], ['A', 'B', 'string', 'datetime64']] = np.nan store.append('df_mixed', df_mixed, min_itemsize = {'values': 50}) df_mixed1 = store.select('df_mixed') @@ -3460,15 +3821,15 @@ be data_columns df_dc = df.copy() df_dc['string'] = 'foo' - df_dc.ix[4:6,'string'] = np.nan - df_dc.ix[7:9,'string'] = 'bar' + df_dc.loc[df_dc.index[4:6], 'string'] = np.nan + df_dc.loc[df_dc.index[7:9], 'string'] = 'bar' df_dc['string2'] = 'cool' - df_dc.ix[1:3,['B','C']] = 1.0 + df_dc.loc[df_dc.index[1:3], ['B','C']] = 1.0 df_dc # on-disk operations store.append('df_dc', df_dc, data_columns = ['B', 'C', 'string', 'string2']) - store.select('df_dc', [ pd.Term('B>0') ]) + store.select('df_dc', where='B>0') # getting creative store.select('df_dc', 'B > 0 & C > 0 & string == foo') @@ -3627,7 +3988,7 @@ results. df_mt = pd.DataFrame(randn(8, 6), index=pd.date_range('1/1/2000', periods=8), columns=['A', 'B', 'C', 'D', 'E', 'F']) df_mt['foo'] = 'bar' - df_mt.ix[1, ('A', 'B')] = np.nan + df_mt.loc[df_mt.index[1], ('A', 'B')] = np.nan # you can also create the tables individually store.append_to_multiple({'df1_mt': ['A', 'B'], 'df2_mt': None }, @@ -3958,7 +4319,7 @@ and data values from the values and assembles them into a ``data.frame``: name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/") columns = list() for (idx in seq(data_paths)) { - # NOTE: matrices returned by h5read have to be transposed to to obtain + # NOTE: matrices returned by h5read have to be transposed to obtain # required Fortran order! data <- data.frame(t(h5read(h5File, data_paths[idx]))) names <- t(h5read(h5File, name_paths[idx])) @@ -4062,6 +4423,9 @@ HDFStore supports ``Panel4D`` storage. .. ipython:: python :okwarning: + wp = pd.Panel(randn(2, 5, 4), items=['Item1', 'Item2'], + major_axis=pd.date_range('1/1/2000', periods=5), + minor_axis=['A', 'B', 'C', 'D']) p4d = pd.Panel4D({ 'l1' : wp }) p4d store.append('p4d', p4d) @@ -4078,8 +4442,7 @@ object). This cannot be changed after table creation. :okwarning: store.append('p4d2', p4d, axes=['labels', 'major_axis', 'minor_axis']) - store - store.select('p4d2', [ pd.Term('labels=l1'), pd.Term('items=Item1'), pd.Term('minor_axis=A_big_strings') ]) + store.select('p4d2', where='labels=l1 and items=Item1 and minor_axis=A') .. ipython:: python :suppress: @@ -4089,6 +4452,69 @@ object). This cannot be changed after table creation. os.remove('store.h5') +.. _io.feather: + +Feather +------- + +.. versionadded:: 0.20.0 + +Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data +frames efficient, and to make sharing data across data analysis languages easy. + +Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas +dtypes, including extension dtypes such as categorical and datetime with tz. + +Several caveats. + +- This is a newer library, and the format, though stable, is not guaranteed to be backward compatible + to the earlier versions. +- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an + error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index. +- Duplicate column names and non-string columns names are not supported +- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message + on an attempt at serialization. + +See the `Full Documentation `__ + +.. ipython:: python + + df = pd.DataFrame({'a': list('abc'), + 'b': list(range(1, 4)), + 'c': np.arange(3, 6).astype('u1'), + 'd': np.arange(4.0, 7.0, dtype='float64'), + 'e': [True, False, True], + 'f': pd.Categorical(list('abc')), + 'g': pd.date_range('20130101', periods=3), + 'h': pd.date_range('20130101', periods=3, tz='US/Eastern'), + 'i': pd.date_range('20130101', periods=3, freq='ns')}) + + df + df.dtypes + +Write to a feather file. + +.. ipython:: python + :okwarning: + + df.to_feather('example.feather') + +Read from a feather file. + +.. ipython:: python + + result = pd.read_feather('example.feather') + result + + # we preserve dtypes + result.dtypes + +.. ipython:: python + :suppress: + + import os + os.remove('example.feather') + .. _io.sql: SQL Queries @@ -4417,247 +4843,21 @@ And then issue the following queries: .. _io.bigquery: -Google BigQuery (Experimental) ------------------------------- - -.. versionadded:: 0.13.0 - -The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery -analytics web service to simplify retrieving results from BigQuery tables -using SQL-like queries. Result sets are parsed into a pandas -DataFrame with a shape and data types derived from the source table. -Additionally, DataFrames can be inserted into new BigQuery tables or appended -to existing tables. - -You will need to install some additional dependencies: - -- Google's `python-gflags `__ -- `httplib2 `__ -- `google-api-python-client `__ - -.. warning:: - - To use this module, you will need a valid BigQuery account. Refer to the - `BigQuery Documentation `__ for details on the service itself. - -The key functions are: - -.. currentmodule:: pandas.io.gbq - -.. autosummary:: - :toctree: generated/ - - read_gbq - to_gbq - -.. currentmodule:: pandas - -.. _io.bigquery_reader: - -.. _io.bigquery_authentication: - -Authentication -'''''''''''''' - -.. versionadded:: 0.18.0 - -Authentication to the Google ``BigQuery`` service is via ``OAuth 2.0``. -Is possible to authenticate with either user account credentials or service account credentials. - -Authenticating with user account credentials is as simple as following the prompts in a browser window -which will be automatically opened for you. You will be authenticated to the specified -``BigQuery`` account using the product name ``pandas GBQ``. It is only possible on local host. -The remote authentication using user account credentials is not currently supported in Pandas. -Additional information on the authentication mechanism can be found -`here `__. - -Authentication with service account credentials is possible via the `'private_key'` parameter. This method -is particularly useful when working on remote servers (eg. jupyter iPython notebook on remote host). -Additional information on service accounts can be found -`here `__. - -You will need to install an additional dependency: `oauth2client `__. - -Authentication via ``application default credentials`` is also possible. This is only valid -if the parameter ``private_key`` is not provided. This method also requires that -the credentials can be fetched from the environment the code is running in. -Otherwise, the OAuth2 client-side authentication is used. -Additional information on -`application default credentials `__. - -.. versionadded:: 0.19.0 - -.. note:: - - The `'private_key'` parameter can be set to either the file path of the service account key - in JSON format, or key contents of the service account key in JSON format. - -.. note:: - - A private key can be obtained from the Google developers console by clicking - `here `__. Use JSON key type. - - -Querying -'''''''' - -Suppose you want to load all data from an existing BigQuery table : `test_dataset.test_table` -into a DataFrame using the :func:`~pandas.io.gbq.read_gbq` function. - -.. code-block:: python - - # Insert your BigQuery Project ID Here - # Can be found in the Google web console - projectid = "xxxxxxxx" - - data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', projectid) - - -You can define which column from BigQuery to use as an index in the -destination DataFrame as well as a preferred column order as follows: - -.. code-block:: python - - data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', - index_col='index_column_name', - col_order=['col1', 'col2', 'col3'], projectid) - -.. note:: - - You can find your project id in the `Google developers console `__. - - -.. note:: - - You can toggle the verbose output via the ``verbose`` flag which defaults to ``True``. - -.. note:: - - The ``dialect`` argument can be used to indicate whether to use BigQuery's ``'legacy'`` SQL - or BigQuery's ``'standard'`` SQL (beta). The default value is ``'legacy'``. For more information - on BigQuery's standard SQL, see `BigQuery SQL Reference - `__ - -.. _io.bigquery_writer: - - -Writing DataFrames -'''''''''''''''''' - -Assume we want to write a DataFrame ``df`` into a BigQuery table using :func:`~pandas.DataFrame.to_gbq`. - -.. ipython:: python - - df = pd.DataFrame({'my_string': list('abc'), - 'my_int64': list(range(1, 4)), - 'my_float64': np.arange(4.0, 7.0), - 'my_bool1': [True, False, True], - 'my_bool2': [False, True, False], - 'my_dates': pd.date_range('now', periods=3)}) - - df - df.dtypes - -.. code-block:: python - - df.to_gbq('my_dataset.my_table', projectid) - -.. note:: - - The destination table and destination dataset will automatically be created if they do not already exist. - -The ``if_exists`` argument can be used to dictate whether to ``'fail'``, ``'replace'`` -or ``'append'`` if the destination table already exists. The default value is ``'fail'``. - -For example, assume that ``if_exists`` is set to ``'fail'``. The following snippet will raise -a ``TableCreationError`` if the destination table already exists. - -.. code-block:: python - - df.to_gbq('my_dataset.my_table', projectid, if_exists='fail') - -.. note:: - - If the ``if_exists`` argument is set to ``'append'``, the destination dataframe will - be written to the table using the defined table schema and column types. The - dataframe must match the destination table in structure and data types. - If the ``if_exists`` argument is set to ``'replace'``, and the existing table has a - different schema, a delay of 2 minutes will be forced to ensure that the new schema - has propagated in the Google environment. See - `Google BigQuery issue 191 `__. - -Writing large DataFrames can result in errors due to size limitations being exceeded. -This can be avoided by setting the ``chunksize`` argument when calling :func:`~pandas.DataFrame.to_gbq`. -For example, the following writes ``df`` to a BigQuery table in batches of 10000 rows at a time: - -.. code-block:: python - - df.to_gbq('my_dataset.my_table', projectid, chunksize=10000) - -You can also see the progress of your post via the ``verbose`` flag which defaults to ``True``. -For example: - -.. code-block:: python - - In [8]: df.to_gbq('my_dataset.my_table', projectid, chunksize=10000, verbose=True) - - Streaming Insert is 10% Complete - Streaming Insert is 20% Complete - Streaming Insert is 30% Complete - Streaming Insert is 40% Complete - Streaming Insert is 50% Complete - Streaming Insert is 60% Complete - Streaming Insert is 70% Complete - Streaming Insert is 80% Complete - Streaming Insert is 90% Complete - Streaming Insert is 100% Complete - -.. note:: - - If an error occurs while streaming data to BigQuery, see - `Troubleshooting BigQuery Errors `__. - -.. note:: - - The BigQuery SQL query language has some oddities, see the - `BigQuery Query Reference Documentation `__. - -.. note:: - - While BigQuery uses SQL-like syntax, it has some important differences from traditional - databases both in functionality, API limitations (size and quantity of queries or uploads), - and how Google charges for use of the service. You should refer to `Google BigQuery documentation `__ - often as the service seems to be changing and evolving. BiqQuery is best for analyzing large - sets of data quickly, but it is not a direct replacement for a transactional database. - -Creating BigQuery Tables -'''''''''''''''''''''''' +Google BigQuery +--------------- .. warning:: - As of 0.17, the function :func:`~pandas.io.gbq.generate_bq_schema` has been deprecated and will be - removed in a future version. + Starting in 0.20.0, pandas has split off Google BigQuery support into the + separate package ``pandas-gbq``. You can ``pip install pandas-gbq`` to get it. -As of 0.15.2, the gbq module has a function :func:`~pandas.io.gbq.generate_bq_schema` which will -produce the dictionary representation schema of the specified pandas DataFrame. - -.. code-block:: ipython +The ``pandas-gbq`` package provides functionality to read/write from Google BigQuery. - In [10]: gbq.generate_bq_schema(df, default_type='STRING') - - Out[10]: {'fields': [{'name': 'my_bool1', 'type': 'BOOLEAN'}, - {'name': 'my_bool2', 'type': 'BOOLEAN'}, - {'name': 'my_dates', 'type': 'TIMESTAMP'}, - {'name': 'my_float64', 'type': 'FLOAT'}, - {'name': 'my_int64', 'type': 'INTEGER'}, - {'name': 'my_string', 'type': 'STRING'}]} - -.. note:: +pandas integrates with this external package. if ``pandas-gbq`` is installed, you can +use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the +respective functions from ``pandas-gbq``. - If you delete and re-create a BigQuery table with the same name, but different table schema, - you must wait 2 minutes before streaming data into the table. As a workaround, consider creating - the new table with a different name. Refer to - `Google BigQuery issue 191 `__. +Full cocumentation can be found `here `__ .. _io.stata: diff --git a/doc/source/merging.rst b/doc/source/merging.rst index c6541a26c72b4..170dde87c8363 100644 --- a/doc/source/merging.rst +++ b/doc/source/merging.rst @@ -13,7 +13,7 @@ import matplotlib.pyplot as plt plt.close('all') - import pandas.util.doctools as doctools + import pandas.util._doctools as doctools p = doctools.TablePlotter() @@ -132,7 +132,7 @@ means that we can now do stuff like select out each chunk by key: .. ipython:: python - result.ix['y'] + result.loc['y'] It's not a stretch to see how this can be very useful. More detail on this functionality below. @@ -693,6 +693,29 @@ either the left or right tables, the values in the joined table will be labels=['left', 'right'], vertical=False); plt.close('all'); +Here is another example with duplicate join keys in DataFrames: + +.. ipython:: python + + left = pd.DataFrame({'A' : [1,2], 'B' : [2, 2]}) + + right = pd.DataFrame({'A' : [4,5,6], 'B': [2,2,2]}) + + result = pd.merge(left, right, on='B', how='outer') + +.. ipython:: python + :suppress: + + @savefig merging_merge_on_key_dup.png + p.plot([left, right], result, + labels=['left', 'right'], vertical=False); + plt.close('all'); + +.. warning:: + + Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, + may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames. + .. _merging.indicator: The merge indicator @@ -723,6 +746,79 @@ The ``indicator`` argument will also accept string arguments, in which case the pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column') +.. _merging.dtypes: + +Merge Dtypes +~~~~~~~~~~~~ + +.. versionadded:: 0.19.0 + +Merging will preserve the dtype of the join keys. + +.. ipython:: python + + left = pd.DataFrame({'key': [1], 'v1': [10]}) + left + right = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]}) + right + +We are able to preserve the join keys + +.. ipython:: python + + pd.merge(left, right, how='outer') + pd.merge(left, right, how='outer').dtypes + +Of course if you have missing values that are introduced, then the +resulting dtype will be upcast. + +.. ipython:: python + + pd.merge(left, right, how='outer', on='key') + pd.merge(left, right, how='outer', on='key').dtypes + +.. versionadded:: 0.20.0 + +Merging will preserve ``category`` dtypes of the mergands. See also the section on :ref:`categoricals ` + +The left frame. + +.. ipython:: python + + X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,))) + X = X.astype('category', categories=['foo', 'bar']) + + left = pd.DataFrame({'X': X, + 'Y': np.random.choice(['one', 'two', 'three'], size=(10,))}) + left + left.dtypes + +The right frame. + +.. ipython:: python + + right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']), + 'Z': [1, 2]}) + right + right.dtypes + +The merged result + +.. ipython:: python + + result = pd.merge(left, right, how='outer') + result + result.dtypes + +.. note:: + + The category dtypes must be *exactly* the same, meaning the same categories and the ordered attribute. + Otherwise the result will coerce to ``object`` dtype. + +.. note:: + + Merging on ``category`` dtypes that are the same can be quite performant compared to ``object`` dtype merging. + .. _merging.join.index: Joining on index diff --git a/doc/source/missing_data.rst b/doc/source/missing_data.rst index 3ede97a902696..37930775885e3 100644 --- a/doc/source/missing_data.rst +++ b/doc/source/missing_data.rst @@ -113,7 +113,7 @@ pandas objects provide intercompatibility between ``NaT`` and ``NaN``. df2 = df.copy() df2['timestamp'] = pd.Timestamp('20120101') df2 - df2.ix[['a','c','h'],['one','timestamp']] = np.nan + df2.loc[['a','c','h'],['one','timestamp']] = np.nan df2 df2.get_dtype_counts() @@ -155,9 +155,9 @@ objects. .. ipython:: python :suppress: - df = df2.ix[:, ['one', 'two', 'three']] - a = df2.ix[:5, ['one', 'two']].fillna(method='pad') - b = df2.ix[:5, ['one', 'two', 'three']] + df = df2.loc[:, ['one', 'two', 'three']] + a = df2.loc[df2.index[:5], ['one', 'two']].fillna(method='pad') + b = df2.loc[df2.index[:5], ['one', 'two', 'three']] .. ipython:: python @@ -237,7 +237,7 @@ we can use the `limit` keyword: .. ipython:: python :suppress: - df.ix[2:4, :] = np.nan + df.iloc[2:4, :] = np.nan .. ipython:: python diff --git a/doc/source/options.rst b/doc/source/options.rst index 77cac6d495d13..5f6bf2fbb9662 100644 --- a/doc/source/options.rst +++ b/doc/source/options.rst @@ -273,151 +273,164 @@ Options are 'right', and 'left'. Available Options ----------------- -========================== ============ ================================== -Option Default Function -========================== ============ ================================== -display.chop_threshold None If set to a float value, all float - values smaller then the given - threshold will be displayed as - exactly 0 by repr and friends. -display.colheader_justify right Controls the justification of - column headers. used by DataFrameFormatter. -display.column_space 12 No description available. -display.date_dayfirst False When True, prints and parses dates - with the day first, eg 20/01/2005 -display.date_yearfirst False When True, prints and parses dates - with the year first, eg 2005/01/20 -display.encoding UTF-8 Defaults to the detected encoding - of the console. Specifies the encoding - to be used for strings returned by - to_string, these are generally strings - meant to be displayed on the console. -display.expand_frame_repr True Whether to print out the full DataFrame - repr for wide DataFrames across - multiple lines, `max_columns` is - still respected, but the output will - wrap-around across multiple "pages" - if its width exceeds `display.width`. -display.float_format None The callable should accept a floating - point number and return a string with - the desired format of the number. - This is used in some places like - SeriesFormatter. - See core.format.EngFormatter for an example. -display.height 60 Deprecated. Use `display.max_rows` instead. -display.large_repr truncate For DataFrames exceeding max_rows/max_cols, - the repr (and HTML repr) can show - a truncated table (the default from 0.13), - or switch to the view from df.info() - (the behaviour in earlier versions of pandas). - allowable settings, ['truncate', 'info'] -display.latex.repr False Whether to produce a latex DataFrame - representation for jupyter frontends - that support it. -display.latex.escape True Escapes special caracters in Dataframes, when - using the to_latex method. -display.latex.longtable False Specifies if the to_latex method of a Dataframe - uses the longtable format. -display.line_width 80 Deprecated. Use `display.width` instead. -display.max_columns 20 max_rows and max_columns are used - in __repr__() methods to decide if - to_string() or info() is used to - render an object to a string. In - case python/IPython is running in - a terminal this can be set to 0 and - pandas will correctly auto-detect - the width the terminal and swap to - a smaller format in case all columns - would not fit vertically. The IPython - notebook, IPython qtconsole, or IDLE - do not run in a terminal and hence - it is not possible to do correct - auto-detection. 'None' value means - unlimited. -display.max_colwidth 50 The maximum width in characters of - a column in the repr of a pandas - data structure. When the column overflows, - a "..." placeholder is embedded in - the output. -display.max_info_columns 100 max_info_columns is used in DataFrame.info - method to decide if per column information - will be printed. -display.max_info_rows 1690785 df.info() will usually show null-counts - for each column. For large frames - this can be quite slow. max_info_rows - and max_info_cols limit this null - check only to frames with smaller - dimensions then specified. -display.max_rows 60 This sets the maximum number of rows - pandas should output when printing - out various output. For example, - this value determines whether the - repr() for a dataframe prints out - fully or just a summary repr. - 'None' value means unlimited. -display.max_seq_items 100 when pretty-printing a long sequence, - no more then `max_seq_items` will - be printed. If items are omitted, - they will be denoted by the addition - of "..." to the resulting string. - If set to None, the number of items - to be printed is unlimited. -display.memory_usage True This specifies if the memory usage of - a DataFrame should be displayed when the - df.info() method is invoked. -display.multi_sparse True "Sparsify" MultiIndex display (don't - display repeated elements in outer - levels within groups) -display.notebook_repr_html True When True, IPython notebook will - use html representation for - pandas objects (if it is available). -display.pprint_nest_depth 3 Controls the number of nested levels - to process when pretty-printing -display.precision 6 Floating point output precision in - terms of number of places after the - decimal, for regular formatting as well - as scientific notation. Similar to - numpy's ``precision`` print option -display.show_dimensions truncate Whether to print out dimensions - at the end of DataFrame repr. - If 'truncate' is specified, only - print out the dimensions if the - frame is truncated (e.g. not display - all rows and/or columns) -display.width 80 Width of the display in characters. - In case python/IPython is running in - a terminal this can be set to None - and pandas will correctly auto-detect - the width. Note that the IPython notebook, - IPython qtconsole, or IDLE do not run in a - terminal and hence it is not possible - to correctly detect the width. -html.border 1 A ``border=value`` attribute is - inserted in the ```` tag - for the DataFrame HTML repr. -io.excel.xls.writer xlwt The default Excel writer engine for - 'xls' files. -io.excel.xlsm.writer openpyxl The default Excel writer engine for - 'xlsm' files. Available options: - 'openpyxl' (the default). -io.excel.xlsx.writer openpyxl The default Excel writer engine for - 'xlsx' files. -io.hdf.default_format None default format writing format, if - None, then put will default to - 'fixed' and append will default to - 'table' -io.hdf.dropna_table True drop ALL nan rows when appending - to a table -mode.chained_assignment warn Raise an exception, warn, or no - action if trying to use chained - assignment, The default is warn -mode.sim_interactive False Whether to simulate interactive mode - for purposes of testing -mode.use_inf_as_null False True means treat None, NaN, -INF, - INF as null (old way), False means - None and NaN are null, but INF, -INF - are not null (new way). -========================== ============ ================================== +=================================== ============ ================================== +Option Default Function +=================================== ============ ================================== +display.chop_threshold None If set to a float value, all float + values smaller then the given + threshold will be displayed as + exactly 0 by repr and friends. +display.colheader_justify right Controls the justification of + column headers. used by DataFrameFormatter. +display.column_space 12 No description available. +display.date_dayfirst False When True, prints and parses dates + with the day first, eg 20/01/2005 +display.date_yearfirst False When True, prints and parses dates + with the year first, eg 2005/01/20 +display.encoding UTF-8 Defaults to the detected encoding + of the console. Specifies the encoding + to be used for strings returned by + to_string, these are generally strings + meant to be displayed on the console. +display.expand_frame_repr True Whether to print out the full DataFrame + repr for wide DataFrames across + multiple lines, `max_columns` is + still respected, but the output will + wrap-around across multiple "pages" + if its width exceeds `display.width`. +display.float_format None The callable should accept a floating + point number and return a string with + the desired format of the number. + This is used in some places like + SeriesFormatter. + See core.format.EngFormatter for an example. +display.height 60 Deprecated. Use `display.max_rows` instead. +display.large_repr truncate For DataFrames exceeding max_rows/max_cols, + the repr (and HTML repr) can show + a truncated table (the default from 0.13), + or switch to the view from df.info() + (the behaviour in earlier versions of pandas). + allowable settings, ['truncate', 'info'] +display.latex.repr False Whether to produce a latex DataFrame + representation for jupyter frontends + that support it. +display.latex.escape True Escapes special caracters in Dataframes, when + using the to_latex method. +display.latex.longtable False Specifies if the to_latex method of a Dataframe + uses the longtable format. +display.latex.multicolumn True Combines columns when using a MultiIndex +display.latex.multicolumn_format 'l' Alignment of multicolumn labels +display.latex.multirow False Combines rows when using a MultiIndex. + Centered instead of top-aligned, + separated by clines. +display.line_width 80 Deprecated. Use `display.width` instead. +display.max_columns 20 max_rows and max_columns are used + in __repr__() methods to decide if + to_string() or info() is used to + render an object to a string. In + case python/IPython is running in + a terminal this can be set to 0 and + pandas will correctly auto-detect + the width the terminal and swap to + a smaller format in case all columns + would not fit vertically. The IPython + notebook, IPython qtconsole, or IDLE + do not run in a terminal and hence + it is not possible to do correct + auto-detection. 'None' value means + unlimited. +display.max_colwidth 50 The maximum width in characters of + a column in the repr of a pandas + data structure. When the column overflows, + a "..." placeholder is embedded in + the output. +display.max_info_columns 100 max_info_columns is used in DataFrame.info + method to decide if per column information + will be printed. +display.max_info_rows 1690785 df.info() will usually show null-counts + for each column. For large frames + this can be quite slow. max_info_rows + and max_info_cols limit this null + check only to frames with smaller + dimensions then specified. +display.max_rows 60 This sets the maximum number of rows + pandas should output when printing + out various output. For example, + this value determines whether the + repr() for a dataframe prints out + fully or just a summary repr. + 'None' value means unlimited. +display.max_seq_items 100 when pretty-printing a long sequence, + no more then `max_seq_items` will + be printed. If items are omitted, + they will be denoted by the addition + of "..." to the resulting string. + If set to None, the number of items + to be printed is unlimited. +display.memory_usage True This specifies if the memory usage of + a DataFrame should be displayed when the + df.info() method is invoked. +display.multi_sparse True "Sparsify" MultiIndex display (don't + display repeated elements in outer + levels within groups) +display.notebook_repr_html True When True, IPython notebook will + use html representation for + pandas objects (if it is available). +display.pprint_nest_depth 3 Controls the number of nested levels + to process when pretty-printing +display.precision 6 Floating point output precision in + terms of number of places after the + decimal, for regular formatting as well + as scientific notation. Similar to + numpy's ``precision`` print option +display.show_dimensions truncate Whether to print out dimensions + at the end of DataFrame repr. + If 'truncate' is specified, only + print out the dimensions if the + frame is truncated (e.g. not display + all rows and/or columns) +display.width 80 Width of the display in characters. + In case python/IPython is running in + a terminal this can be set to None + and pandas will correctly auto-detect + the width. Note that the IPython notebook, + IPython qtconsole, or IDLE do not run in a + terminal and hence it is not possible + to correctly detect the width. +display.html.table_schema False Whether to publish a Table Schema + representation for frontends that + support it. +html.border 1 A ``border=value`` attribute is + inserted in the ``
`` tag + for the DataFrame HTML repr. +io.excel.xls.writer xlwt The default Excel writer engine for + 'xls' files. +io.excel.xlsm.writer openpyxl The default Excel writer engine for + 'xlsm' files. Available options: + 'openpyxl' (the default). +io.excel.xlsx.writer openpyxl The default Excel writer engine for + 'xlsx' files. +io.hdf.default_format None default format writing format, if + None, then put will default to + 'fixed' and append will default to + 'table' +io.hdf.dropna_table True drop ALL nan rows when appending + to a table +mode.chained_assignment warn Raise an exception, warn, or no + action if trying to use chained + assignment, The default is warn +mode.sim_interactive False Whether to simulate interactive mode + for purposes of testing +mode.use_inf_as_null False True means treat None, NaN, -INF, + INF as null (old way), False means + None and NaN are null, but INF, -INF + are not null (new way). +compute.use_bottleneck True Use the bottleneck library to accelerate + computation if it is installed +compute.use_numexpr True Use the numexpr library to accelerate + computation if it is installed +=================================== ============ ================================== + .. _basics.console_output: @@ -507,3 +520,26 @@ Enabling ``display.unicode.ambiguous_as_wide`` lets pandas to figure these chara pd.set_option('display.unicode.east_asian_width', False) pd.set_option('display.unicode.ambiguous_as_wide', False) + +.. _options.table_schema: + +Table Schema Display +-------------------- + +.. versionadded:: 0.20.0 + +``DataFrame`` and ``Series`` will publish a Table Schema representation +by default. False by default, this can be enabled globally with the +``display.html.table_schema`` option: + +.. ipython:: python + + pd.set_option('display.html.table_schema', True) + +Only ``'display.max_rows'`` are serialized and published. + + +.. ipython:: python + :suppress: + + pd.reset_option('display.html.table_schema') diff --git a/doc/source/r_interface.rst b/doc/source/r_interface.rst index f2a8668bbda91..88634d7f75c63 100644 --- a/doc/source/r_interface.rst +++ b/doc/source/r_interface.rst @@ -1,5 +1,3 @@ -.. currentmodule:: pandas.rpy - .. _rpy: .. ipython:: python @@ -15,157 +13,64 @@ rpy2 / R interface .. warning:: - In v0.16.0, the ``pandas.rpy`` interface has been **deprecated and will be - removed in a future version**. Similar functionality can be accessed - through the `rpy2 `__ project. - See the :ref:`updating ` section for a guide to port your - code from the ``pandas.rpy`` to ``rpy2`` functions. - - -.. _rpy.updating: - -Updating your code to use rpy2 functions ----------------------------------------- - -In v0.16.0, the ``pandas.rpy`` module has been **deprecated** and users are -pointed to the similar functionality in ``rpy2`` itself (rpy2 >= 2.4). - -Instead of importing ``import pandas.rpy.common as com``, the following imports -should be done to activate the pandas conversion support in rpy2:: + Up to pandas 0.19, a ``pandas.rpy`` module existed with functionality to + convert between pandas and ``rpy2`` objects. This functionality now lives in + the `rpy2 `__ project itself. + See the `updating section `__ + of the previous documentation for a guide to port your code from the + removed ``pandas.rpy`` to ``rpy2`` functions. - from rpy2.robjects import pandas2ri - pandas2ri.activate() +`rpy2 `__ is an interface to R running embedded in a Python process, and also includes functionality to deal with pandas DataFrames. Converting data frames back and forth between rpy2 and pandas should be largely automated (no need to convert explicitly, it will be done on the fly in most rpy2 functions). - To convert explicitly, the functions are ``pandas2ri.py2ri()`` and -``pandas2ri.ri2py()``. So these functions can be used to replace the existing -functions in pandas: - -- ``com.convert_to_r_dataframe(df)`` should be replaced with ``pandas2ri.py2ri(df)`` -- ``com.convert_robj(rdf)`` should be replaced with ``pandas2ri.ri2py(rdf)`` - -Note: these functions are for the latest version (rpy2 2.5.x) and were called -``pandas2ri.pandas2ri()`` and ``pandas2ri.ri2pandas()`` previously. - -Some of the other functionality in `pandas.rpy` can be replaced easily as well. -For example to load R data as done with the ``load_data`` function, the -current method:: - - df_iris = com.load_data('iris') - -can be replaced with:: - - from rpy2.robjects import r - r.data('iris') - df_iris = pandas2ri.ri2py(r[name]) - -The ``convert_to_r_matrix`` function can be replaced by the normal -``pandas2ri.py2ri`` to convert dataframes, with a subsequent call to R -``as.matrix`` function. - -.. warning:: - - Not all conversion functions in rpy2 are working exactly the same as the - current methods in pandas. If you experience problems or limitations in - comparison to the ones in pandas, please report this at the - `issue tracker `_. - -See also the documentation of the `rpy2 `__ project. - - -R interface with rpy2 ---------------------- +``pandas2ri.ri2py()``. -If your computer has R and rpy2 (> 2.2) installed (which will be left to the -reader), you will be able to leverage the below functionality. On Windows, -doing this is quite an ordeal at the moment, but users on Unix-like systems -should find it quite easy. rpy2 evolves in time, and is currently reaching -its release 2.3, while the current interface is -designed for the 2.2.x series. We recommend to use 2.2.x over other series -unless you are prepared to fix parts of the code, yet the rpy2-2.3.0 -introduces improvements such as a better R-Python bridge memory management -layer so it might be a good idea to bite the bullet and submit patches for -the few minor differences that need to be fixed. +See also the documentation of the `rpy2 `__ project: https://rpy2.readthedocs.io. -:: +In the remainder of this page, a few examples of explicit conversion is given. The pandas conversion of rpy2 needs first to be activated: - # if installing for the first time - hg clone http://bitbucket.org/lgautier/rpy2 - - cd rpy2 - hg pull - hg update version_2.2.x - sudo python setup.py install - -.. note:: - - To use R packages with this interface, you will need to install - them inside R yourself. At the moment it cannot install them for - you. +.. ipython:: python -Once you have done installed R and rpy2, you should be able to import -``pandas.rpy.common`` without a hitch. + from rpy2.robjects import r, pandas2ri + pandas2ri.activate() Transferring R data sets into Python ------------------------------------ -The **load_data** function retrieves an R data set and converts it to the -appropriate pandas object (most likely a DataFrame): - +Once the pandas conversion is activated (``pandas2ri.activate()``), many conversions +of R to pandas objects will be done automatically. For example, to obtain the 'iris' dataset as a pandas DataFrame: .. ipython:: python - :okwarning: - import pandas.rpy.common as com - infert = com.load_data('infert') - - infert.head() + r.data('iris') + r['iris'].head() +If the pandas conversion was not activated, the above could also be accomplished +by explicitly converting it with the ``pandas2ri.ri2py`` function +(``pandas2ri.ri2py(r['iris'])``). Converting DataFrames into R objects ------------------------------------ -.. versionadded:: 0.8 - -Starting from pandas 0.8, there is **experimental** support to convert +The ``pandas2ri.py2ri`` function support the reverse operation to convert DataFrames into the equivalent R object (that is, **data.frame**): .. ipython:: python - import pandas.rpy.common as com df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]}, index=["one", "two", "three"]) - r_dataframe = com.convert_to_r_dataframe(df) - + r_dataframe = pandas2ri.py2ri(df) print(type(r_dataframe)) print(r_dataframe) - The DataFrame's index is stored as the ``rownames`` attribute of the data.frame instance. -You can also use **convert_to_r_matrix** to obtain a ``Matrix`` instance, but -bear in mind that it will only work with homogeneously-typed DataFrames (as -R matrices bear no information on the data type): - -.. ipython:: python - - import pandas.rpy.common as com - r_matrix = com.convert_to_r_matrix(df) - - print(type(r_matrix)) - print(r_matrix) - - -Calling R functions with pandas objects ---------------------------------------- - - - -High-level interface to R estimators ------------------------------------- +.. + Calling R functions with pandas objects + High-level interface to R estimators diff --git a/doc/source/release.rst b/doc/source/release.rst index a0aa7e032fcf6..f89fec9fb86e6 100644 --- a/doc/source/release.rst +++ b/doc/source/release.rst @@ -51,7 +51,7 @@ Highlights include: - Compatibility with Python 3.6 - Added a `Pandas Cheat Sheet `__. (:issue:`13202`). -See the :ref:`v0.19.1 Whatsnew ` page for an overview of all +See the :ref:`v0.19.2 Whatsnew ` page for an overview of all bugs that have been fixed in 0.19.2. Thanks diff --git a/doc/source/remote_data.rst b/doc/source/remote_data.rst index e2c713ac8519a..7980133582125 100644 --- a/doc/source/remote_data.rst +++ b/doc/source/remote_data.rst @@ -13,7 +13,7 @@ DataReader The sub-package ``pandas.io.data`` is removed in favor of a separately installable `pandas-datareader package -`_. This will allow the data +`_. This will allow the data modules to be independently updated to your pandas installation. The API for ``pandas-datareader v0.1.1`` is the same as in ``pandas v0.16.1``. (:issue:`8961`) @@ -29,63 +29,3 @@ modules to be independently updated to your pandas installation. The API for .. code-block:: python from pandas_datareader import data, wb - - -.. _remote_data.ga: - -Google Analytics ----------------- - -The :mod:`~pandas.io.ga` module provides a wrapper for -`Google Analytics API `__ -to simplify retrieving traffic data. -Result sets are parsed into a pandas DataFrame with a shape and data types -derived from the source table. - -Configuring Access to Google Analytics -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The first thing you need to do is to setup accesses to Google Analytics API. Follow the steps below: - -#. In the `Google Developers Console `__ - #. enable the Analytics API - #. create a new project - #. create a new Client ID for an "Installed Application" (in the "APIs & auth / Credentials section" of the newly created project) - #. download it (JSON file) -#. On your machine - #. rename it to ``client_secrets.json`` - #. move it to the ``pandas/io`` module directory - -The first time you use the :func:`read_ga` function, a browser window will open to ask you to authentify to the Google API. Do proceed. - -Using the Google Analytics API -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The following will fetch users and pageviews (metrics) data per day of the week, for the first semester of 2014, from a particular property. - -.. code-block:: python - - import pandas.io.ga as ga - ga.read_ga( - account_id = "2360420", - profile_id = "19462946", - property_id = "UA-2360420-5", - metrics = ['users', 'pageviews'], - dimensions = ['dayOfWeek'], - start_date = "2014-01-01", - end_date = "2014-08-01", - index_col = 0, - filters = "pagePath=~aboutus;ga:country==France", - ) - -The only mandatory arguments are ``metrics,`` ``dimensions`` and ``start_date``. We strongly recommend that you always specify the ``account_id``, ``profile_id`` and ``property_id`` to avoid accessing the wrong data bucket in Google Analytics. - -The ``index_col`` argument indicates which dimension(s) has to be taken as index. - -The ``filters`` argument indicates the filtering to apply to the query. In the above example, the page URL has to contain ``aboutus`` AND the visitors country has to be France. - -Detailed information in the following: - -* `pandas & google analytics, by yhat `__ -* `Google Analytics integration in pandas, by Chang She `__ -* `Google Analytics Dimensions and Metrics Reference `_ diff --git a/doc/source/reshaping.rst b/doc/source/reshaping.rst index 9ed2c42610b69..b93749922c8ea 100644 --- a/doc/source/reshaping.rst +++ b/doc/source/reshaping.rst @@ -217,7 +217,7 @@ calling ``sort_index``, of course). Here is a more complex example: ('one', 'two')], names=['first', 'second']) df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns) - df2 = df.ix[[0, 1, 2, 4, 5, 7]] + df2 = df.iloc[[0, 1, 2, 4, 5, 7]] df2 As mentioned above, ``stack`` can be called with a ``level`` argument to select @@ -265,8 +265,8 @@ the right thing: Reshaping by Melt ----------------- -The :func:`~pandas.melt` function is useful to massage a -DataFrame into a format where one or more columns are identifier variables, +The top-level :func:``melt` and :func:`~DataFrame.melt` functions are useful to +massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are "unpivoted" to the row axis, leaving just two non-identifier columns, "variable" and "value". The names of those columns can be customized by supplying the ``var_name`` and @@ -281,10 +281,11 @@ For instance, 'height' : [5.5, 6.0], 'weight' : [130, 150]}) cheese - pd.melt(cheese, id_vars=['first', 'last']) - pd.melt(cheese, id_vars=['first', 'last'], var_name='quantity') + cheese.melt(id_vars=['first', 'last']) + cheese.melt(id_vars=['first', 'last'], var_name='quantity') -Another way to transform is to use the ``wide_to_long`` panel data convenience function. +Another way to transform is to use the ``wide_to_long`` panel data convenience +function. .. ipython:: python @@ -323,6 +324,10 @@ Pivot tables .. _reshaping.pivot: +While ``pivot`` provides general purpose pivoting of DataFrames with various +data types (strings, numerics, etc.), Pandas also provides the ``pivot_table`` +function for pivoting with aggregation of numeric data. + The function ``pandas.pivot_table`` can be used to create spreadsheet-style pivot tables. See the :ref:`cookbook` for some advanced strategies @@ -512,7 +517,15 @@ Alternatively we can specify custom bin-edges: .. ipython:: python - pd.cut(ages, bins=[0, 18, 35, 70]) + c = pd.cut(ages, bins=[0, 18, 35, 70]) + c + +.. versionadded:: 0.20.0 + +If the ``bins`` keyword is an ``IntervalIndex``, then these will be +used to bin the passed data. + + pd.cut([25, 20, 50], bins=c.categories) .. _reshaping.dummies: @@ -646,10 +659,16 @@ handling of NaN: because of an ordering bug. See also `Here `__ -.. ipython:: python +.. code-block:: ipython + + In [2]: pd.factorize(x, sort=True) + Out[2]: + (array([ 2, 2, -1, 3, 0, 1]), + Index([3.14, inf, u'A', u'B'], dtype='object')) + + In [3]: np.unique(x, return_inverse=True)[::-1] + Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object)) - pd.factorize(x, sort=True) - np.unique(x, return_inverse=True)[::-1] .. note:: If you just want to handle one column as a categorical variable (like R's factor), diff --git a/doc/source/sparse.rst b/doc/source/sparse.rst index d3f921f8762cc..b4884cf1c4141 100644 --- a/doc/source/sparse.rst +++ b/doc/source/sparse.rst @@ -45,7 +45,7 @@ large, mostly NA DataFrame: .. ipython:: python df = pd.DataFrame(randn(10000, 4)) - df.ix[:9998] = np.nan + df.iloc[:9998] = np.nan sdf = df.to_sparse() sdf sdf.density @@ -186,7 +186,37 @@ the correct dense result. Interaction with scipy.sparse ----------------------------- -Experimental api to transform between sparse pandas and scipy.sparse structures. +SparseDataFrame +~~~~~~~~~~~~~~~ + +.. versionadded:: 0.20.0 + +Pandas supports creating sparse dataframes directly from ``scipy.sparse`` matrices. + +.. ipython:: python + + from scipy.sparse import csr_matrix + + arr = np.random.random(size=(1000, 5)) + arr[arr < .9] = 0 + + sp_arr = csr_matrix(arr) + sp_arr + + sdf = pd.SparseDataFrame(sp_arr) + sdf + +All sparse formats are supported, but matrices that are not in :mod:`COOrdinate ` format will be converted, copying data as needed. +To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you can use the :meth:`SparseDataFrame.to_coo` method: + +.. ipython:: python + + sdf.to_coo() + +SparseSeries +~~~~~~~~~~~~ + +.. versionadded:: 0.16.0 A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. diff --git a/doc/source/html-styling.ipynb b/doc/source/style.ipynb similarity index 71% rename from doc/source/html-styling.ipynb rename to doc/source/style.ipynb index e55712b2bb4f6..427b18b988aef 100644 --- a/doc/source/html-styling.ipynb +++ b/doc/source/style.ipynb @@ -2,45 +2,38 @@ "cells": [ { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "collapsed": true + }, "source": [ + "# Styling\n", + "\n", "*New in version 0.17.1*\n", "\n", - "

*Provisional: This is a new feature and still under development. We'll be adding features and possibly making breaking changes in future releases. We'd love to hear your [feedback](https://github.com/pydata/pandas/issues).*

\n", + "*Provisional: This is a new feature and still under development. We'll be adding features and possibly making breaking changes in future releases. We'd love to hear your feedback.*\n", "\n", - "This document is written as a Jupyter Notebook, and can be viewed or downloaded [here](http://nbviewer.ipython.org/github/pydata/pandas/blob/master/doc/source/html-styling.ipynb).\n", + "This document is written as a Jupyter Notebook, and can be viewed or downloaded [here](http://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/html-styling.ipynb).\n", "\n", "You can apply **conditional formatting**, the visual styling of a DataFrame\n", "depending on the data within, by using the ``DataFrame.style`` property.\n", - "This is a property that returns a ``pandas.Styler`` object, which has\n", + "This is a property that returns a ``Styler`` object, which has\n", "useful methods for formatting and displaying DataFrames.\n", "\n", "The styling is accomplished using CSS.\n", "You write \"style functions\" that take scalars, `DataFrame`s or `Series`, and return *like-indexed* DataFrames or Series with CSS `\"attribute: value\"` pairs for the values.\n", - "These functions can be incrementally passed to the `Styler` which collects the styles before rendering.\n", - "\n", - "### Contents\n", - "\n", - "- [Building Styles](#Building-Styles)\n", - "- [Finer Control: Slicing](#Finer-Control:-Slicing)\n", - "- [Builtin Styles](#Builtin-Styles)\n", - "- [Other options](#Other-options)\n", - "- [Sharing Styles](#Sharing-Styles)\n", - "- [Limitations](#Limitations)\n", - "- [Terms](#Terms)\n", - "- [Extensibility](#Extensibility)" + "These functions can be incrementally passed to the `Styler` which collects the styles before rendering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Building Styles\n", + "## Building Styles\n", "\n", "Pass your style functions into one of the following methods:\n", "\n", - "- `Styler.applymap`: elementwise\n", - "- `Styler.apply`: column-/row-/table-wise\n", + "- ``Styler.applymap``: elementwise\n", + "- ``Styler.apply``: column-/row-/table-wise\n", "\n", "Both of those methods take a function (and some other keyword arguments) and applies your function to the DataFrame in a certain way.\n", "`Styler.applymap` works through the DataFrame elementwise.\n", @@ -58,7 +51,21 @@ "cell_type": "code", "execution_count": null, "metadata": { - "collapsed": false + "collapsed": true, + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot\n", + "# We have this here to trigger matplotlib's font cache stuff.\n", + "# This cell is hidden from the output" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true }, "outputs": [], "source": [ @@ -82,9 +89,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style" @@ -94,7 +99,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "*Note*: The `DataFrame.style` attribute is a propetry that returns a `Styler` object. `Styler` has a `_repr_html_` method defined on it so they are rendered automatically. If you want the actual HTML back for further processing or for writing to file call the `.render()` method which returns a string.\n", + "*Note*: The `DataFrame.style` attribute is a property that returns a `Styler` object. `Styler` has a `_repr_html_` method defined on it so they are rendered automatically. If you want the actual HTML back for further processing or for writing to file call the `.render()` method which returns a string.\n", "\n", "The above output looks very similar to the standard DataFrame HTML representation. But we've done some work behind the scenes to attach CSS classes to each cell. We can view these by calling the `.render` method." ] @@ -102,9 +107,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.highlight_null().render().split('\\n')[:10]" @@ -155,9 +158,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "s = df.style.applymap(color_negative_red)\n", @@ -203,9 +204,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.apply(highlight_max)" @@ -229,9 +228,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.\\\n", @@ -245,7 +242,7 @@ "source": [ "Above we used `Styler.apply` to pass in each column one at a time.\n", "\n", - "

*Debugging Tip*: If you're having trouble writing your style function, try just passing it into DataFrame.apply. Internally, Styler.apply uses DataFrame.apply so the result should be the same.

\n", + "*Debugging Tip*: If you're having trouble writing your style function, try just passing it into DataFrame.apply. Internally, Styler.apply uses DataFrame.apply so the result should be the same.\n", "\n", "What if you wanted to highlight just the maximum value in the entire table?\n", "Use `.apply(function, axis=None)` to indicate that your function wants the entire table, not one column or row at a time. Let's try that next.\n", @@ -285,9 +282,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.apply(highlight_max, color='darkorange', axis=None)" @@ -335,9 +330,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.apply(highlight_max, subset=['B', 'C', 'D'])" @@ -353,9 +346,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.applymap(color_negative_red,\n", @@ -388,9 +379,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.format(\"{:.2%}\")" @@ -406,9 +395,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.format({'B': \"{:0<4.0f}\", 'D': '{:+.2f}'})" @@ -424,9 +411,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.format({\"B\": lambda x: \"±{:.2f}\".format(abs(x))})" @@ -449,9 +434,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.highlight_null(null_color='red')" @@ -467,9 +450,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", @@ -490,9 +471,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "# Uses the full color range\n", @@ -502,12 +481,10 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ - "# Compreess the color range\n", + "# Compress the color range\n", "(df.loc[:4]\n", " .style\n", " .background_gradient(cmap='viridis', low=.5, high=0)\n", @@ -518,67 +495,128 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can include \"bar charts\" in your DataFrame." + "There's also `.highlight_min` and `.highlight_max`." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ - "df.style.bar(subset=['A', 'B'], color='#d65f5f')" + "df.style.highlight_max(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "There's also `.highlight_min` and `.highlight_max`." + "Use `Styler.set_properties` when the style doesn't actually depend on the values." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ - "df.style.highlight_max(axis=0)" + "df.style.set_properties(**{'background-color': 'black',\n", + " 'color': 'lawngreen',\n", + " 'border-color': 'white'})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Bar charts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can include \"bar charts\" in your DataFrame." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ - "df.style.highlight_min(axis=0)" + "df.style.bar(subset=['A', 'B'], color='#d65f5f')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Use `Styler.set_properties` when the style doesn't actually depend on the values." + "New in version 0.20.0 is the ability to customize further the bar chart: You can now have the `df.style.bar` be centered on zero or midpoint value (in addition to the already existing way of having the min value at the left side of the cell), and you can pass a list of `[color_negative, color_positive]`.\n", + "\n", + "Here's how you can change the above with the new `align='mid'` option:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ - "df.style.set_properties(**{'background-color': 'black',\n", - " 'color': 'lawngreen',\n", - " 'border-color': 'white'})" + "df.style.bar(subset=['A', 'B'], align='mid', color=['#d65f5f', '#5fba7d'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following example aims to give a highlight of the behavior of the new align options:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from IPython.display import HTML\n", + "\n", + "# Test series\n", + "test1 = pd.Series([-100,-60,-30,-20], name='All Negative')\n", + "test2 = pd.Series([10,20,50,100], name='All Positive')\n", + "test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')\n", + "\n", + "head = \"\"\"\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\"\"\"\n", + "\n", + "aligns = ['left','zero','mid']\n", + "for align in aligns:\n", + " row = \"\".format(align)\n", + " for serie in [test1,test2,test3]:\n", + " s = serie.copy()\n", + " s.name=''\n", + " row += \"\".format(s.to_frame().style.bar(align=align, \n", + " color=['#d65f5f', '#5fba7d'], \n", + " width=100).render()) #testn['width']\n", + " row += ''\n", + " head += row\n", + " \n", + "head+= \"\"\"\n", + "\n", + "
AlignAll NegativeAll PositiveBoth Neg and Pos
{}{}
\"\"\"\n", + " \n", + "\n", + "HTML(head)" ] }, { @@ -598,9 +636,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df2 = -df\n", @@ -611,9 +647,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "style2 = df2.style\n", @@ -632,7 +666,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Other options\n", + "## Other Options\n", "\n", "You've seen a few methods for data-driven styling.\n", "`Styler` also provides a few other options for styles that don't depend on the data.\n", @@ -643,7 +677,7 @@ "\n", "Each of these can be specified in two ways:\n", "\n", - "- A keyword argument to `pandas.core.Styler`\n", + "- A keyword argument to `Styler.__init__`\n", "- A call to one of the `.set_` methods, e.g. `.set_caption`\n", "\n", "The best method to use depends on the context. Use the `Styler` constructor when building many styled DataFrames that should all share the same properties. For interactive use, the`.set_` methods are more convenient." @@ -653,7 +687,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Precision" + "### Precision" ] }, { @@ -666,9 +700,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "with pd.option_context('display.precision', 2):\n", @@ -688,9 +720,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style\\\n", @@ -723,9 +753,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "df.style.set_caption('Colormaps, with a caption.')\\\n", @@ -751,9 +779,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML\n", @@ -792,7 +818,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# CSS Classes\n", + "### CSS Classes\n", "\n", "Certain CSS classes are attached to cells.\n", "\n", @@ -813,7 +839,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Limitations\n", + "### Limitations\n", "\n", "- DataFrame only `(use Series.to_frame().style)`\n", "- The index and columns must be unique\n", @@ -828,7 +854,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Terms\n", + "### Terms\n", "\n", "- Style function: a function that's passed into `Styler.apply` or `Styler.applymap` and returns values like `'css attribute: value'`\n", "- Builtin style functions: style functions that are methods on `Styler`\n", @@ -849,9 +875,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "from IPython.html import widgets\n", @@ -867,7 +891,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "collapsed": false + "collapsed": true }, "outputs": [], "source": [ @@ -887,18 +911,16 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": false - }, + "metadata": {}, "outputs": [], "source": [ "np.random.seed(25)\n", "cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)\n", - "df = pd.DataFrame(np.random.randn(20, 25)).cumsum()\n", + "bigdf = pd.DataFrame(np.random.randn(20, 25)).cumsum()\n", "\n", - "df.style.background_gradient(cmap, axis=1)\\\n", + "bigdf.style.background_gradient(cmap, axis=1)\\\n", " .set_properties(**{'max-width': '80px', 'font-size': '1pt'})\\\n", - " .set_caption(\"Hover to magify\")\\\n", + " .set_caption(\"Hover to magnify\")\\\n", " .set_precision(2)\\\n", " .set_table_styles(magnify())" ] @@ -907,41 +929,235 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Extensibility\n", + "## Export to Excel\n", "\n", - "The core of pandas is, and will remain, its \"high-performance, easy-to-use data structures\".\n", - "With that in mind, we hope that `DataFrame.style` accomplishes two goals\n", + "*New in version 0.20.0*\n", "\n", - "- Provide an API that is pleasing to use interactively and is \"good enough\" for many tasks\n", - "- Provide the foundations for dedicated libraries to build on\n", + "*Experimental: This is a new feature and still under development. We'll be adding features and possibly making breaking changes in future releases. We'd love to hear your feedback.*\n", "\n", - "If you build a great library on top of this, let us know and we'll [link](http://pandas.pydata.org/pandas-docs/stable/ecosystem.html) to it.\n", + "Some support is available for exporting styled `DataFrames` to Excel worksheets using the `OpenPyXL` engine. CSS2.2 properties handled include:\n", "\n", - "## Subclassing\n", + "- `background-color`\n", + "- `border-style`, `border-width`, `border-color` and their {`top`, `right`, `bottom`, `left` variants}\n", + "- `color`\n", + "- `font-family`\n", + "- `font-style`\n", + "- `font-weight`\n", + "- `text-align`\n", + "- `text-decoration`\n", + "- `vertical-align`\n", + "- `white-space: nowrap`\n", "\n", - "This section contains a bit of information about the implementation of `Styler`.\n", - "Since the feature is so new all of this is subject to change, even more so than the end-use API.\n", + "Only CSS2 named colors and hex colors of the form `#rgb` or `#rrggbb` are currently supported." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df.style.\\\n", + " applymap(color_negative_red).\\\n", + " apply(highlight_max).\\\n", + " to_excel('styled.xlsx', engine='openpyxl')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A screenshot of the output:\n", "\n", - "As users apply styles (via `.apply`, `.applymap` or one of the builtins), we don't actually calculate anything.\n", - "Instead, we append functions and arguments to a list `self._todo`.\n", - "When asked (typically in `.render` we'll walk through the list and execute each function (this is in `self._compute()`.\n", - "These functions update an internal `defaultdict(list)`, `self.ctx` which maps DataFrame row / column positions to CSS attribute, value pairs.\n", + "![Excel spreadsheet with styled DataFrame](_static/style-excel.png)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extensibility\n", "\n", - "We take the extra step through `self._todo` so that we can export styles and set them on other `Styler`s.\n", + "The core of pandas is, and will remain, its \"high-performance, easy-to-use data structures\".\n", + "With that in mind, we hope that `DataFrame.style` accomplishes two goals\n", "\n", - "Rendering uses [Jinja](http://jinja.pocoo.org/) templates.\n", - "The `.translate` method takes `self.ctx` and builds another dictionary ready to be passed into `Styler.template.render`, the Jinja template.\n", + "- Provide an API that is pleasing to use interactively and is \"good enough\" for many tasks\n", + "- Provide the foundations for dedicated libraries to build on\n", "\n", + "If you build a great library on top of this, let us know and we'll [link](http://pandas.pydata.org/pandas-docs/stable/ecosystem.html) to it.\n", "\n", - "## Alternate templates\n", + "### Subclassing\n", "\n", - "We've used [Jinja](http://jinja.pocoo.org/) templates to build up the HTML.\n", - "The template is stored as a class variable ``Styler.template.``. Subclasses can override that.\n", + "If the default template doesn't quite suit your needs, you can subclass Styler and extend or override the template.\n", + "We'll show an example of extending the default template to insert a custom header before each table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from jinja2 import Environment, ChoiceLoader, FileSystemLoader\n", + "from IPython.display import HTML\n", + "from pandas.io.formats.style import Styler" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%mkdir templates" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This next cell writes the custom template.\n", + "We extend the template `html.tpl`, which comes with pandas." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%file templates/myhtml.tpl\n", + "{% extends \"html.tpl\" %}\n", + "{% block table %}\n", + "

{{ table_title|default(\"My Table\") }}

\n", + "{{ super() }}\n", + "{% endblock table %}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we've created a template, we need to set up a subclass of ``Styler`` that\n", + "knows about it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "class MyStyler(Styler):\n", + " env = Environment(\n", + " loader=ChoiceLoader([\n", + " FileSystemLoader(\"templates\"), # contains ours\n", + " Styler.loader, # the default\n", + " ])\n", + " )\n", + " template = env.get_template(\"myhtml.tpl\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that we include the original loader in our environment's loader.\n", + "That's because we extend the original template, so the Jinja environment needs\n", + "to be able to find it.\n", "\n", - "```python\n", - "class CustomStyle(Styler):\n", - " template = Template(\"\"\"...\"\"\")\n", - "```" + "Now we can use that custom styler. It's `__init__` takes a DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MyStyler(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our custom template accepts a `table_title` keyword. We can provide the value in the `.render` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "HTML(MyStyler(df).render(table_title=\"Extending Example\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For convenience, we provide the `Styler.from_custom_template` method that does the same as the custom subclass." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "EasyStyler = Styler.from_custom_template(\"templates\", \"myhtml.tpl\")\n", + "EasyStyler(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's the template structure:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"template_structure.html\") as f:\n", + " structure = f.read()\n", + " \n", + "HTML(structure)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "See the template in the [GitHub repo](https://github.com/pandas-dev/pandas) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Hack to get the same style in the notebook as the\n", + "# main site. This is hidden in the docs.\n", + "from IPython.display import HTML\n", + "with open(\"themes/nature_with_gtoc/static/nature.css_t\") as f:\n", + " css = f.read()\n", + " \n", + "HTML(''.format(css))" ] } ], @@ -961,9 +1177,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.5.1" + "version": "3.6.1" } }, "nbformat": 4, - "nbformat_minor": 0 + "nbformat_minor": 1 } diff --git a/doc/source/style.rst b/doc/source/style.rst deleted file mode 100644 index 506b38bf06e65..0000000000000 --- a/doc/source/style.rst +++ /dev/null @@ -1,10 +0,0 @@ -.. _style: - -.. currentmodule:: pandas - -***** -Style -***** - -.. raw:: html - :file: html-styling.html diff --git a/doc/source/template_structure.html b/doc/source/template_structure.html new file mode 100644 index 0000000000000..0778d8e2e6f18 --- /dev/null +++ b/doc/source/template_structure.html @@ -0,0 +1,57 @@ + + + +
before_style
+
style +
<style type="text/css">
+
table_styles
+
before_cellstyle
+
cellstyle
+
</style>
+
+ +
before_table
+ +
table +
<table ...>
+
caption
+ +
thead +
before_head_rows
+
head_tr (loop over headers)
+
after_head_rows
+
+ +
tbody +
before_rows
+
tr (loop over data rows)
+
after_rows
+
+
</table>
+
+ +
after_table
diff --git a/doc/source/text.rst b/doc/source/text.rst index 3a4a57ff4da95..e3e4b24d17f44 100644 --- a/doc/source/text.rst +++ b/doc/source/text.rst @@ -146,6 +146,46 @@ following code will cause trouble because of the regular expression meaning of # We need to escape the special character (for >1 len patterns) dollars.str.replace(r'-\$', '-') +The ``replace`` method can also take a callable as replacement. It is called +on every ``pat`` using :func:`re.sub`. The callable should expect one +positional argument (a regex object) and return a string. + +.. versionadded:: 0.20.0 + +.. ipython:: python + + # Reverse every lowercase alphabetic word + pat = r'[a-z]+' + repl = lambda m: m.group(0)[::-1] + pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(pat, repl) + + # Using regex groups + pat = r"(?P\w+) (?P\w+) (?P\w+)" + repl = lambda m: m.group('two').swapcase() + pd.Series(['Foo Bar Baz', np.nan]).str.replace(pat, repl) + +The ``replace`` method also accepts a compiled regular expression object +from :func:`re.compile` as a pattern. All flags should be included in the +compiled regular expression object. + +.. versionadded:: 0.20.0 + +.. ipython:: python + + import re + regex_pat = re.compile(r'^.a|dog', flags=re.IGNORECASE) + s3.str.replace(regex_pat, 'XX-XX ') + +Including a ``flags`` argument when calling ``replace`` with a compiled +regular expression object will raise a ``ValueError``. + +.. ipython:: + + @verbatim + In [1]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE) + --------------------------------------------------------------------------- + ValueError: case and flags cannot be set when pat is a compiled regex + Indexing with ``.str`` ---------------------- @@ -332,33 +372,20 @@ You can check whether elements contain a pattern: .. ipython:: python - pattern = r'[a-z][0-9]' + pattern = r'[0-9][a-z]' pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern) or match a pattern: - .. ipython:: python - pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True) + pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern) The distinction between ``match`` and ``contains`` is strictness: ``match`` relies on strict ``re.match``, while ``contains`` relies on ``re.search``. -.. warning:: - - In previous versions, ``match`` was for *extracting* groups, - returning a not-so-convenient Series of tuples. The new method ``extract`` - (described in the previous section) is now preferred. - - This old, deprecated behavior of ``match`` is still the default. As - demonstrated above, use the new behavior by setting ``as_indexer=True``. - In this mode, ``match`` is analogous to ``contains``, returning a boolean - Series. The new behavior will become the default behavior in a future - release. - Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take - an extra ``na`` argument so missing values can be considered True or False: +an extra ``na`` argument so missing values can be considered True or False: .. ipython:: python @@ -406,7 +433,7 @@ Method Summary :meth:`~Series.str.join`;Join strings in each element of the Series with passed separator :meth:`~Series.str.get_dummies`;Split strings on the delimiter returning DataFrame of dummy variables :meth:`~Series.str.contains`;Return boolean array if each string contains pattern/regex - :meth:`~Series.str.replace`;Replace occurrences of pattern/regex with some other string + :meth:`~Series.str.replace`;Replace occurrences of pattern/regex with some other string or the return value of a callable given the occurrence :meth:`~Series.str.repeat`;Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``) :meth:`~Series.str.pad`;"Add whitespace to left, right, or both sides of strings" :meth:`~Series.str.center`;Equivalent to ``str.center`` diff --git a/doc/source/themes/nature_with_gtoc/static/nature.css_t b/doc/source/themes/nature_with_gtoc/static/nature.css_t index 2948f0d68b402..b61068ee28bef 100644 --- a/doc/source/themes/nature_with_gtoc/static/nature.css_t +++ b/doc/source/themes/nature_with_gtoc/static/nature.css_t @@ -299,20 +299,45 @@ td.field-body blockquote { padding-left: 30px; } -.rendered_html table { +// Adapted from the new Jupyter notebook style +// https://github.com/jupyter/notebook/blob/c8841b68c4c0739bbee1291e0214771f24194079/notebook/static/notebook/less/renderedhtml.less#L59 +table { margin-left: auto; margin-right: auto; - border-right: 1px solid #cbcbcb; - border-bottom: 1px solid #cbcbcb; + border: none; + border-collapse: collapse; + border-spacing: 0; + color: @rendered_html_border_color; + table-layout: fixed; +} +thead { + border-bottom: 1px solid @rendered_html_border_color; + vertical-align: bottom; +} +tr, th, td { + vertical-align: middle; + padding: 0.5em 0.5em; + line-height: normal; + white-space: normal; + max-width: none; + border: none; +} +th { + font-weight: bold; +} +th.col_heading { + text-align: right; +} +tbody tr:nth-child(odd) { + background: #f5f5f5; } -.rendered_html td, th { - border-left: 1px solid #cbcbcb; - border-top: 1px solid #cbcbcb; - margin: 0; - padding: 0.5em .75em; +table td.data, table th.row_heading table th.col_heading { + font-family: monospace; + text-align: right; } + /** * See also */ diff --git a/doc/source/timedeltas.rst b/doc/source/timedeltas.rst index 7d0fa61b1bdac..07effcfdff33b 100644 --- a/doc/source/timedeltas.rst +++ b/doc/source/timedeltas.rst @@ -4,18 +4,17 @@ .. ipython:: python :suppress: - from datetime import datetime, timedelta + import datetime import numpy as np + import pandas as pd np.random.seed(123456) - from pandas import * randn = np.random.randn randint = np.random.randint np.set_printoptions(precision=4, suppress=True) - options.display.max_rows=15 + pd.options.display.max_rows=15 import dateutil import pytz from dateutil.relativedelta import relativedelta - from pandas.tseries.api import * from pandas.tseries.offsets import * .. _timedeltas.timedeltas: @@ -40,41 +39,41 @@ You can construct a ``Timedelta`` scalar through various arguments: .. ipython:: python # strings - Timedelta('1 days') - Timedelta('1 days 00:00:00') - Timedelta('1 days 2 hours') - Timedelta('-1 days 2 min 3us') + pd.Timedelta('1 days') + pd.Timedelta('1 days 00:00:00') + pd.Timedelta('1 days 2 hours') + pd.Timedelta('-1 days 2 min 3us') # like datetime.timedelta # note: these MUST be specified as keyword arguments - Timedelta(days=1, seconds=1) + pd.Timedelta(days=1, seconds=1) # integers with a unit - Timedelta(1, unit='d') + pd.Timedelta(1, unit='d') - # from a timedelta/np.timedelta64 - Timedelta(timedelta(days=1, seconds=1)) - Timedelta(np.timedelta64(1, 'ms')) + # from a datetime.timedelta/np.timedelta64 + pd.Timedelta(datetime.timedelta(days=1, seconds=1)) + pd.Timedelta(np.timedelta64(1, 'ms')) # negative Timedeltas have this string repr # to be more consistent with datetime.timedelta conventions - Timedelta('-1us') + pd.Timedelta('-1us') # a NaT - Timedelta('nan') - Timedelta('nat') + pd.Timedelta('nan') + pd.Timedelta('nat') :ref:`DateOffsets` (``Day, Hour, Minute, Second, Milli, Micro, Nano``) can also be used in construction. .. ipython:: python - Timedelta(Second(2)) + pd.Timedelta(Second(2)) Further, operations among the scalars yield another scalar ``Timedelta``. .. ipython:: python - Timedelta(Day(2)) + Timedelta(Second(2)) + Timedelta('00:00:00.000123') + pd.Timedelta(Day(2)) + pd.Timedelta(Second(2)) + pd.Timedelta('00:00:00.000123') to_timedelta ~~~~~~~~~~~~ @@ -93,21 +92,21 @@ You can parse a single string to a Timedelta: .. ipython:: python - to_timedelta('1 days 06:05:01.00003') - to_timedelta('15.5us') + pd.to_timedelta('1 days 06:05:01.00003') + pd.to_timedelta('15.5us') or a list/array of strings: .. ipython:: python - to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan']) + pd.to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan']) The ``unit`` keyword argument specifies the unit of the Timedelta: .. ipython:: python - to_timedelta(np.arange(5), unit='s') - to_timedelta(np.arange(5), unit='d') + pd.to_timedelta(np.arange(5), unit='s') + pd.to_timedelta(np.arange(5), unit='d') .. _timedeltas.limitations: @@ -133,17 +132,17 @@ subtraction operations on ``datetime64[ns]`` Series, or ``Timestamps``. .. ipython:: python - s = Series(date_range('2012-1-1', periods=3, freq='D')) - td = Series([ Timedelta(days=i) for i in range(3) ]) - df = DataFrame(dict(A = s, B = td)) + s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D')) + td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ]) + df = pd.DataFrame(dict(A = s, B = td)) df df['C'] = df['A'] + df['B'] df df.dtypes s - s.max() - s - datetime(2011, 1, 1, 3, 5) - s + timedelta(minutes=5) + s - datetime.datetime(2011, 1, 1, 3, 5) + s + datetime.timedelta(minutes=5) s + Minute(5) s + Minute(5) + Milli(5) @@ -173,17 +172,17 @@ Operands can also appear in a reversed order (a singular object operated with a .. ipython:: python s.max() - s - datetime(2011, 1, 1, 3, 5) - s - timedelta(minutes=5) + s + datetime.datetime(2011, 1, 1, 3, 5) - s + datetime.timedelta(minutes=5) + s ``min, max`` and the corresponding ``idxmin, idxmax`` operations are supported on frames: .. ipython:: python - A = s - Timestamp('20120101') - Timedelta('00:05:05') - B = s - Series(date_range('2012-1-2', periods=3, freq='D')) + A = s - pd.Timestamp('20120101') - pd.Timedelta('00:05:05') + B = s - pd.Series(pd.date_range('2012-1-2', periods=3, freq='D')) - df = DataFrame(dict(A=A, B=B)) + df = pd.DataFrame(dict(A=A, B=B)) df df.min() @@ -209,13 +208,13 @@ pass a timedelta to get a particular value. y.fillna(0) y.fillna(10) - y.fillna(Timedelta('-1 days, 00:00:05')) + y.fillna(pd.Timedelta('-1 days, 00:00:05')) You can also negate, multiply and use ``abs`` on ``Timedeltas``: .. ipython:: python - td1 = Timedelta('-1 days 2 hours 3 seconds') + td1 = pd.Timedelta('-1 days 2 hours 3 seconds') td1 -1 * td1 - td1 @@ -231,7 +230,7 @@ Numeric reduction operation for ``timedelta64[ns]`` will return ``Timedelta`` ob .. ipython:: python - y2 = Series(to_timedelta(['-1 days +00:00:05', 'nat', '-1 days +00:00:05', '1 days'])) + y2 = pd.Series(pd.to_timedelta(['-1 days +00:00:05', 'nat', '-1 days +00:00:05', '1 days'])) y2 y2.mean() y2.median() @@ -251,9 +250,9 @@ Note that division by the numpy scalar is true division, while astyping is equiv .. ipython:: python - td = Series(date_range('20130101', periods=4)) - \ - Series(date_range('20121201', periods=4)) - td[2] += timedelta(minutes=5, seconds=3) + td = pd.Series(pd.date_range('20130101', periods=4)) - \ + pd.Series(pd.date_range('20121201', periods=4)) + td[2] += datetime.timedelta(minutes=5, seconds=3) td[3] = np.nan td @@ -274,7 +273,7 @@ yields another ``timedelta64[ns]`` dtypes Series. .. ipython:: python td * -1 - td * Series([1, 2, 3, 4]) + td * pd.Series([1, 2, 3, 4]) Attributes ---------- @@ -298,7 +297,7 @@ You can access the value of the fields for a scalar ``Timedelta`` directly. .. ipython:: python - tds = Timedelta('31 days 5 min 3 sec') + tds = pd.Timedelta('31 days 5 min 3 sec') tds.days tds.seconds (-tds).seconds @@ -311,6 +310,21 @@ similarly to the ``Series``. These are the *displayed* values of the ``Timedelta td.dt.components td.dt.components.seconds +.. _timedeltas.isoformat: + +You can convert a ``Timedelta`` to an `ISO 8601 Duration`_ string with the +``.isoformat`` method + +.. versionadded:: 0.20.0 + +.. ipython:: python + + pd.Timedelta(days=6, minutes=50, seconds=3, + milliseconds=10, microseconds=10, + nanoseconds=12).isoformat() + +.. _ISO 8601 Duration: https://en.wikipedia.org/wiki/ISO_8601#Durations + .. _timedeltas.index: TimedeltaIndex @@ -326,15 +340,15 @@ or ``np.timedelta64`` objects. Passing ``np.nan/pd.NaT/nat`` will represent miss .. ipython:: python - TimedeltaIndex(['1 days', '1 days, 00:00:05', - np.timedelta64(2,'D'), timedelta(days=2,seconds=2)]) + pd.TimedeltaIndex(['1 days', '1 days, 00:00:05', + np.timedelta64(2,'D'), datetime.timedelta(days=2,seconds=2)]) Similarly to ``date_range``, you can construct regular ranges of a ``TimedeltaIndex``: .. ipython:: python - timedelta_range(start='1 days', periods=5, freq='D') - timedelta_range(start='1 days', end='2 days', freq='30T') + pd.timedelta_range(start='1 days', periods=5, freq='D') + pd.timedelta_range(start='1 days', end='2 days', freq='30T') Using the TimedeltaIndex ~~~~~~~~~~~~~~~~~~~~~~~~ @@ -344,8 +358,8 @@ Similarly to other of the datetime-like indices, ``DatetimeIndex`` and ``PeriodI .. ipython:: python - s = Series(np.arange(100), - index=timedelta_range('1 days', periods=100, freq='h')) + s = pd.Series(np.arange(100), + index=pd.timedelta_range('1 days', periods=100, freq='h')) s Selections work similarly, with coercion on string-likes and slices: @@ -354,7 +368,7 @@ Selections work similarly, with coercion on string-likes and slices: s['1 day':'2 day'] s['1 day 01:00:00'] - s[Timedelta('1 day 1h')] + s[pd.Timedelta('1 day 1h')] Furthermore you can use partial string selection and the range will be inferred: @@ -369,9 +383,9 @@ Finally, the combination of ``TimedeltaIndex`` with ``DatetimeIndex`` allow cert .. ipython:: python - tdi = TimedeltaIndex(['1 days', pd.NaT, '2 days']) + tdi = pd.TimedeltaIndex(['1 days', pd.NaT, '2 days']) tdi.tolist() - dti = date_range('20130101', periods=3) + dti = pd.date_range('20130101', periods=3) dti.tolist() (dti + tdi).tolist() (dti - tdi).tolist() @@ -391,14 +405,14 @@ Scalars type ops work as well. These can potentially return a *different* type o .. ipython:: python # adding or timedelta and date -> datelike - tdi + Timestamp('20130101') + tdi + pd.Timestamp('20130101') # subtraction of a date and a timedelta -> datelike # note that trying to subtract a date from a Timedelta will raise an exception - (Timestamp('20130101') - tdi).tolist() + (pd.Timestamp('20130101') - tdi).tolist() # timedelta + timedelta -> timedelta - tdi + Timedelta('10 days') + tdi + pd.Timedelta('10 days') # division can result in a Timedelta if the divisor is an integer tdi / 2 diff --git a/doc/source/timeseries.rst b/doc/source/timeseries.rst index 037dc53540fab..71d85f9b3995b 100644 --- a/doc/source/timeseries.rst +++ b/doc/source/timeseries.rst @@ -113,7 +113,7 @@ For example: pd.Period('2012-05', freq='D') ``Timestamp`` and ``Period`` can be the index. Lists of ``Timestamp`` and -``Period`` are automatically coerce to ``DatetimeIndex`` and ``PeriodIndex`` +``Period`` are automatically coerced to ``DatetimeIndex`` and ``PeriodIndex`` respectively. .. ipython:: python @@ -247,12 +247,15 @@ Return NaT for input when unparseable Out[6]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None) +.. _timeseries.converting.epoch: + Epoch Timestamps ~~~~~~~~~~~~~~~~ It's also possible to convert integer or float epoch times. The default unit for these is nanoseconds (since these are how ``Timestamp`` s are stored). However, -often epochs are stored in another ``unit`` which can be specified: +often epochs are stored in another ``unit`` which can be specified. These are computed +from the starting point specified by the :ref:`Origin Parameter `. Typical epoch stored units @@ -264,17 +267,63 @@ Typical epoch stored units pd.to_datetime([1349720105100, 1349720105200, 1349720105300, 1349720105400, 1349720105500 ], unit='ms') -These *work*, but the results may be unexpected. +.. note:: + + Epoch times will be rounded to the nearest nanosecond. + +.. warning:: + + Conversion of float epoch times can lead to inaccurate and unexpected results. + :ref:`Python floats ` have about 15 digits precision in + decimal. Rounding during conversion from float to high precision ``Timestamp`` is + unavoidable. The only way to achieve exact precision is to use a fixed-width + types (e.g. an int64). + + .. ipython:: python + + pd.to_datetime([1490195805.433, 1490195805.433502912], unit='s') + pd.to_datetime(1490195805433502912, unit='ns') + +.. _timeseries.converting.epoch_inverse: + +From Timestamps to Epoch +~~~~~~~~~~~~~~~~~~~~~~~~ + +To invert the operation from above, namely, to convert from a ``Timestamp`` to a 'unix' epoch: .. ipython:: python - pd.to_datetime([1]) + stamps = pd.date_range('2012-10-08 18:15:05', periods=4, freq='D') + stamps - pd.to_datetime([1, 3.14], unit='s') +We convert the ``DatetimeIndex`` to an ``int64`` array, then divide by the conversion unit. -.. note:: +.. ipython:: python - Epoch times will be rounded to the nearest nanosecond. + stamps.view('int64') // pd.Timedelta(1, unit='s') + +.. _timeseries.origin: + +Using the Origin Parameter +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. versionadded:: 0.20.0 + +Using the ``origin`` parameter, one can specify an alternative starting point for creation +of a ``DatetimeIndex``. + +Start with 1960-01-01 as the starting date + +.. ipython:: python + + pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01')) + +The default is set at ``origin='unix'``, which defaults to ``1970-01-01 00:00:00``. +Commonly called 'unix epoch' or POSIX time. + +.. ipython:: python + + pd.to_datetime([1, 2, 3], unit='D') .. _timeseries.daterange: @@ -358,8 +407,8 @@ See :ref:`here ` for ways to represent data outside these bound. .. _timeseries.datetimeindex: -DatetimeIndex -------------- +Indexing +-------- One of the main uses for ``DatetimeIndex`` is as an index for pandas objects. The ``DatetimeIndex`` class contains many timeseries related optimizations: @@ -399,8 +448,8 @@ intelligent functionality like selection, slicing, etc. .. _timeseries.partialindexing: -DatetimeIndex Partial String Indexing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Partial String Indexing +~~~~~~~~~~~~~~~~~~~~~~~ You can pass in dates and strings that parse to dates as indexing parameters: @@ -457,22 +506,6 @@ We are stopping on the included end-point as it is part of the index dft['2013-1-15':'2013-1-15 12:30:00'] -.. warning:: - - The following selection will raise a ``KeyError``; otherwise this selection methodology - would be inconsistent with other selection methods in pandas (as this is not a *slice*, nor does it - resolve to one) - - .. code-block:: python - - dft['2013-1-15 12:30:00'] - - To select a single row, use ``.loc`` - - .. ipython:: python - - dft.loc['2013-1-15 12:30:00'] - .. versionadded:: 0.18.0 DatetimeIndex Partial String Indexing also works on DataFrames with a ``MultiIndex``. For example: @@ -491,12 +524,86 @@ DatetimeIndex Partial String Indexing also works on DataFrames with a ``MultiInd dft2 = dft2.swaplevel(0, 1).sort_index() dft2.loc[idx[:, '2013-01-05'], :] -Datetime Indexing -~~~~~~~~~~~~~~~~~ +.. _timeseries.slice_vs_exact_match: + +Slice vs. exact match +~~~~~~~~~~~~~~~~~~~~~ + +.. versionchanged:: 0.20.0 + +The same string used as an indexing parameter can be treated either as a slice or as an exact match depending on the resolution of an index. If the string is less accurate than the index, it will be treated as a slice, otherwise as an exact match. + +For example, let us consider ``Series`` object which index has minute resolution. + +.. ipython:: python + + series_minute = pd.Series([1, 2, 3], + pd.DatetimeIndex(['2011-12-31 23:59:00', + '2012-01-01 00:00:00', + '2012-01-01 00:02:00'])) + series_minute.index.resolution + +A timestamp string less accurate than a minute gives a ``Series`` object. + +.. ipython:: python + + series_minute['2011-12-31 23'] + +A timestamp string with minute resolution (or more accurate), gives a scalar instead, i.e. it is not casted to a slice. + +.. ipython:: python + + series_minute['2011-12-31 23:59'] + series_minute['2011-12-31 23:59:00'] + +If index resolution is second, then, the minute-accurate timestamp gives a ``Series``. + +.. ipython:: python + + series_second = pd.Series([1, 2, 3], + pd.DatetimeIndex(['2011-12-31 23:59:59', + '2012-01-01 00:00:00', + '2012-01-01 00:00:01'])) + series_second.index.resolution + series_second['2011-12-31 23:59'] + +If the timestamp string is treated as a slice, it can be used to index ``DataFrame`` with ``[]`` as well. + +.. ipython:: python + + dft_minute = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, + index=series_minute.index) + dft_minute['2011-12-31 23'] + + +.. warning:: + + However if the string is treated as an exact match, the selection in ``DataFrame``'s ``[]`` will be column-wise and not row-wise, see :ref:`Indexing Basics `. For example ``dft_minute['2011-12-31 23:59']`` will raise ``KeyError`` as ``'2012-12-31 23:59'`` has the same resolution as index and there is no column with such name: + + To *always* have unambiguous selection, whether the row is treated as a slice or a single selection, use ``.loc``. -Indexing a ``DateTimeIndex`` with a partial string depends on the "accuracy" of the period, in other words how specific the interval is in relation to the frequency of the index. In contrast, indexing with datetime objects is exact, because the objects have exact meaning. These also follow the semantics of *including both endpoints*. + .. ipython:: python + + dft_minute.loc['2011-12-31 23:59'] + +Note also that ``DatetimeIndex`` resolution cannot be less precise than day. + +.. ipython:: python + + series_monthly = pd.Series([1, 2, 3], + pd.DatetimeIndex(['2011-12', + '2012-01', + '2012-02'])) + series_monthly.index.resolution + series_monthly['2011-12'] # returns Series -These ``datetime`` objects are specific ``hours, minutes,`` and ``seconds`` even though they were not explicitly specified (they are ``0``). + +Exact Indexing +~~~~~~~~~~~~~~ + +As discussed in previous section, indexing a ``DateTimeIndex`` with a partial string depends on the "accuracy" of the period, in other words how specific the interval is in relation to the resolution of the index. In contrast, indexing with ``Timestamp`` or ``datetime`` objects is exact, because the objects have exact meaning. These also follow the semantics of *including both endpoints*. + +These ``Timestamp`` and ``datetime`` objects have exact ``hours, minutes,`` and ``seconds``, even though they were not explicitly specified (they are ``0``). .. ipython:: python @@ -525,10 +632,10 @@ regularity will result in a ``DatetimeIndex`` (but frequency is lost): ts[[0, 2, 6]].index -.. _timeseries.offsets: +.. _timeseries.components: Time/Date Components -~~~~~~~~~~~~~~~~~~~~~~~~~~~ +-------------------- There are several time/date properties that one can access from ``Timestamp`` or a collection of timestamps like a ``DateTimeIndex``. @@ -549,10 +656,10 @@ There are several time/date properties that one can access from ``Timestamp`` or dayofyear,"The ordinal day of year" weekofyear,"The week ordinal of the year" week,"The week ordinal of the year" - dayofweek,"The numer of the day of the week with Monday=0, Sunday=6" + dayofweek,"The number of the day of the week with Monday=0, Sunday=6" weekday,"The number of the day of the week with Monday=0, Sunday=6" weekday_name,"The name of the day in a week (ex: Friday)" - quarter,"Quarter of the date: Jan=Mar = 1, Apr-Jun = 2, etc." + quarter,"Quarter of the date: Jan-Mar = 1, Apr-Jun = 2, etc." days_in_month,"The number of days in the month of the datetime" is_month_start,"Logical indicating if first day of month (defined by frequency)" is_month_end,"Logical indicating if last day of month (defined by frequency)" @@ -564,6 +671,8 @@ There are several time/date properties that one can access from ``Timestamp`` or Furthermore, if you have a ``Series`` with datetimelike values, then you can access these properties via the ``.dt`` accessor, see the :ref:`docs ` +.. _timeseries.offsets: + DateOffset objects ------------------ @@ -628,12 +737,12 @@ We could have done the same thing with ``DateOffset``: The key features of a ``DateOffset`` object are: - - it can be added / subtracted to/from a datetime object to obtain a - shifted date - - it can be multiplied by an integer (positive or negative) so that the - increment will be applied multiple times - - it has ``rollforward`` and ``rollback`` methods for moving a date forward - or backward to the next or previous "offset date" +- it can be added / subtracted to/from a datetime object to obtain a + shifted date +- it can be multiplied by an integer (positive or negative) so that the + increment will be applied multiple times +- it has ``rollforward`` and ``rollback`` methods for moving a date forward + or backward to the next or previous "offset date" Subclasses of ``DateOffset`` define the ``apply`` function which dictates custom date increment logic, such as adding business days: @@ -745,7 +854,7 @@ used exactly like a ``Timedelta`` - see the Note that some offsets (such as ``BQuarterEnd``) do not have a vectorized implementation. They can still be used but may -calculate significantly slower and will raise a ``PerformanceWarning`` +calculate significantly slower and will show a ``PerformanceWarning`` .. ipython:: python :okwarning: @@ -755,8 +864,8 @@ calculate significantly slower and will raise a ``PerformanceWarning`` .. _timeseries.custombusinessdays: -Custom Business Days (Experimental) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Custom Business Days +~~~~~~~~~~~~~~~~~~~~ The ``CDay`` or ``CustomBusinessDay`` class provides a parametric ``BusinessDay`` class which can be used to create customized business day @@ -785,7 +894,7 @@ Let's map to the weekday names pd.Series(dts.weekday, dts).map(pd.Series('Mon Tue Wed Thu Fri Sat Sun'.split())) -As of v0.14 holiday calendars can be used to provide the list of holidays. See the +Holiday calendars can be used to provide the list of holidays. See the :ref:`holiday calendar` section for more information. .. ipython:: python @@ -1286,15 +1395,17 @@ secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. ``.resample()`` is a time-based groupby, followed by a reduction method on each of its groups. +See some :ref:`cookbook examples ` for some advanced strategies Starting in version 0.18.1, the ``resample()`` function can be used directly from -DataFrameGroupBy objects, see the :ref:`groupby docs `. +``DataFrameGroupBy`` objects, see the :ref:`groupby docs `. .. note:: - ``.resample()`` is similar to using a ``.rolling()`` operation with a time-based offset, see a discussion `here ` + ``.resample()`` is similar to using a ``.rolling()`` operation with a time-based offset, see a discussion :ref:`here ` -See some :ref:`cookbook examples ` for some advanced strategies +Basics +~~~~~~ .. ipython:: python @@ -1408,11 +1519,13 @@ We can instead only resample those groups where we have points as follows: ts.groupby(partial(round, freq='3T')).sum() +.. _timeseries.aggregate: + Aggregation ~~~~~~~~~~~ -Similar to :ref:`groupby aggregates ` and the :ref:`window functions `, a ``Resampler`` can be selectively -resampled. +Similar to the :ref:`aggregating API `, :ref:`groupby API `, and the :ref:`window functions API `, +a ``Resampler`` can be selectively resampled. Resampling a ``DataFrame``, the default will be to act on all columns with the same function. @@ -1438,14 +1551,6 @@ You can pass a list or dict of functions to do aggregation with, outputting a Da r['A'].agg([np.sum, np.mean, np.std]) -If a dict is passed, the keys will be used to name the columns. Otherwise the -function's name (stored in the function object) will be used. - -.. ipython:: python - - r['A'].agg({'result1' : np.sum, - 'result2' : np.mean}) - On a resampled DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index: diff --git a/doc/source/tutorials.rst b/doc/source/tutorials.rst index c25e734a046b2..2489b787560d0 100644 --- a/doc/source/tutorials.rst +++ b/doc/source/tutorials.rst @@ -57,7 +57,7 @@ See `How to use this cookbook `_. +For more resources, please visit the main `repository `__. - `01 - Lesson: `_ - Importing libraries @@ -123,6 +123,33 @@ There are four sections covering selected topics as follows: - `Time Series `_ +.. _tutorial-exercises-new-users: + +Exercises for New Users +----------------------- +Practice your skills with real data sets and exercises. +For more resources, please visit the main `repository `__. + +- `01 - Getting & Knowing Your Data `_ + +- `02 - Filtering & Sorting `_ + +- `03 - Grouping `_ + +- `04 - Apply `_ + +- `05 - Merge `_ + +- `06 - Stats `_ + +- `07 - Visualization `_ + +- `08 - Creating Series and DataFrames `_ + +- `09 - Time Series `_ + +- `10 - Deleting `_ + .. _tutorial-modern: Modern Pandas @@ -150,3 +177,4 @@ Various Tutorials - `Intro to pandas data structures, by Greg Reda `_ - `Pandas and Python: Top 10, by Manish Amde `_ - `Pandas Tutorial, by Mikhail Semeniuk `_ +- `Pandas DataFrames Tutorial, by Karlijn Willems `_ diff --git a/doc/source/visualization.rst b/doc/source/visualization.rst index e3b186abe53fc..fb799c642131d 100644 --- a/doc/source/visualization.rst +++ b/doc/source/visualization.rst @@ -134,7 +134,7 @@ For example, a bar plot can be created the following way: plt.figure(); @savefig bar_plot_ex.png - df.ix[5].plot(kind='bar'); plt.axhline(0, color='k') + df.iloc[5].plot(kind='bar'); plt.axhline(0, color='k') .. versionadded:: 0.17.0 @@ -152,7 +152,7 @@ You can also create these other plots using the methods ``DataFrame.plot.` In addition to these ``kind`` s, there are the :ref:`DataFrame.hist() `, and :ref:`DataFrame.boxplot() ` methods, which use a separate interface. -Finally, there are several :ref:`plotting functions ` in ``pandas.tools.plotting`` +Finally, there are several :ref:`plotting functions ` in ``pandas.plotting`` that take a :class:`Series` or :class:`DataFrame` as an argument. These include @@ -179,7 +179,7 @@ For labeled, non-time series data, you may wish to produce a bar plot: plt.figure(); @savefig bar_plot_ex.png - df.ix[5].plot.bar(); plt.axhline(0, color='k') + df.iloc[5].plot.bar(); plt.axhline(0, color='k') Calling a DataFrame's :meth:`plot.bar() ` method produces a multiple bar plot: @@ -823,7 +823,7 @@ before plotting. Plotting Tools -------------- -These functions can be imported from ``pandas.tools.plotting`` +These functions can be imported from ``pandas.plotting`` and take a :class:`Series` or :class:`DataFrame` as an argument. .. _visualization.scatter_matrix: @@ -834,7 +834,7 @@ Scatter Matrix Plot .. versionadded:: 0.7.3 You can create a scatter plot matrix using the -``scatter_matrix`` method in ``pandas.tools.plotting``: +``scatter_matrix`` method in ``pandas.plotting``: .. ipython:: python :suppress: @@ -843,7 +843,7 @@ You can create a scatter plot matrix using the .. ipython:: python - from pandas.tools.plotting import scatter_matrix + from pandas.plotting import scatter_matrix df = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd']) @savefig scatter_matrix_kde.png @@ -896,7 +896,7 @@ of the same class will usually be closer together and form larger structures. .. ipython:: python - from pandas.tools.plotting import andrews_curves + from pandas.plotting import andrews_curves data = pd.read_csv('data/iris.data') @@ -918,7 +918,7 @@ represents one data point. Points that tend to cluster will appear closer togeth .. ipython:: python - from pandas.tools.plotting import parallel_coordinates + from pandas.plotting import parallel_coordinates data = pd.read_csv('data/iris.data') @@ -948,7 +948,7 @@ implies that the underlying data are not random. .. ipython:: python - from pandas.tools.plotting import lag_plot + from pandas.plotting import lag_plot plt.figure() @@ -983,7 +983,7 @@ confidence band. .. ipython:: python - from pandas.tools.plotting import autocorrelation_plot + from pandas.plotting import autocorrelation_plot plt.figure() @@ -1016,7 +1016,7 @@ are what constitutes the bootstrap plot. .. ipython:: python - from pandas.tools.plotting import bootstrap_plot + from pandas.plotting import bootstrap_plot data = pd.Series(np.random.rand(1000)) @@ -1048,7 +1048,7 @@ be colored differently. .. ipython:: python - from pandas.tools.plotting import radviz + from pandas.plotting import radviz data = pd.read_csv('data/iris.data') @@ -1228,14 +1228,14 @@ Using the ``x_compat`` parameter, you can suppress this behavior: plt.close('all') If you have more than one plot that needs to be suppressed, the ``use`` method -in ``pandas.plot_params`` can be used in a `with statement`: +in ``pandas.plotting.plot_params`` can be used in a `with statement`: .. ipython:: python plt.figure() @savefig ser_plot_suppress_context.png - with pd.plot_params.use('x_compat', True): + with pd.plotting.plot_params.use('x_compat', True): df.A.plot(color='r') df.B.plot(color='g') df.C.plot(color='b') @@ -1245,6 +1245,18 @@ in ``pandas.plot_params`` can be used in a `with statement`: plt.close('all') +Automatic Date Tick Adjustment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. versionadded:: 0.20.0 + +``TimedeltaIndex`` now uses the native matplotlib +tick locator methods, it is useful to call the automatic +date tick adjustment from matplotlib for figures whose ticklabels overlap. + +See the :meth:`autofmt_xdate ` method and the +`matplotlib documentation `__ for more. + Subplots ~~~~~~~~ @@ -1438,11 +1450,11 @@ Also, you can pass different :class:`DataFrame` or :class:`Series` for ``table`` plt.close('all') -Finally, there is a helper function ``pandas.tools.plotting.table`` to create a table from :class:`DataFrame` and :class:`Series`, and add it to an ``matplotlib.Axes``. This function can accept keywords which matplotlib table has. +Finally, there is a helper function ``pandas.plotting.table`` to create a table from :class:`DataFrame` and :class:`Series`, and add it to an ``matplotlib.Axes``. This function can accept keywords which matplotlib table has. .. ipython:: python - from pandas.tools.plotting import table + from pandas.plotting import table fig, ax = plt.subplots(1, 1) table(ax, np.round(df.describe(), 2), diff --git a/doc/source/whatsnew.rst b/doc/source/whatsnew.rst index 616e1f5c8efc7..d6fb1c6a8f9cc 100644 --- a/doc/source/whatsnew.rst +++ b/doc/source/whatsnew.rst @@ -18,6 +18,8 @@ What's New These are new features and improvements of note in each release. +.. include:: whatsnew/v0.20.0.txt + .. include:: whatsnew/v0.19.2.txt .. include:: whatsnew/v0.19.1.txt diff --git a/doc/source/whatsnew/v0.10.0.txt b/doc/source/whatsnew/v0.10.0.txt index fed3ba3ce3a84..cf5369466308c 100644 --- a/doc/source/whatsnew/v0.10.0.txt +++ b/doc/source/whatsnew/v0.10.0.txt @@ -303,11 +303,10 @@ Updated PyTables Support store.append('wp',wp) # selecting via A QUERY - store.select('wp', - [ Term('major_axis>20000102'), Term('minor_axis', '=', ['A','B']) ]) + store.select('wp', "major_axis>20000102 and minor_axis=['A','B']") # removing data from tables - store.remove('wp', Term('major_axis>20000103')) + store.remove('wp', "major_axis>20000103") store.select('wp') # deleting a store diff --git a/doc/source/whatsnew/v0.10.1.txt b/doc/source/whatsnew/v0.10.1.txt index b61d31932c60d..d5880e44e46c6 100644 --- a/doc/source/whatsnew/v0.10.1.txt +++ b/doc/source/whatsnew/v0.10.1.txt @@ -51,14 +51,14 @@ perform queries on a table, by passing a list to ``data_columns`` df = DataFrame(randn(8, 3), index=date_range('1/1/2000', periods=8), columns=['A', 'B', 'C']) df['string'] = 'foo' - df.ix[4:6,'string'] = np.nan - df.ix[7:9,'string'] = 'bar' + df.loc[df.index[4:6], 'string'] = np.nan + df.loc[df.index[7:9], 'string'] = 'bar' df['string2'] = 'cool' df # on-disk operations store.append('df', df, data_columns = ['B','C','string','string2']) - store.select('df',[ 'B > 0', 'string == foo' ]) + store.select('df', "B>0 and string=='foo'") # this is in-memory version of this type of selection df[(df.B > 0) & (df.string == 'foo')] @@ -78,7 +78,7 @@ You can now store ``datetime64`` in data columns df_mixed = df.copy() df_mixed['datetime64'] = Timestamp('20010102') - df_mixed.ix[3:4,['A','B']] = np.nan + df_mixed.loc[df_mixed.index[3:4], ['A','B']] = np.nan store.append('df_mixed', df_mixed) df_mixed1 = store.select('df_mixed') @@ -110,7 +110,7 @@ columns, this is equivalent to passing a store.select('mi') # the levels are automatically included as data columns - store.select('mi', Term('foo=bar')) + store.select('mi', "foo='bar'") Multi-table creation via ``append_to_multiple`` and selection via ``select_as_multiple`` can create/select from multiple tables and return a @@ -208,4 +208,3 @@ combined result, by using ``where`` on a selector table. See the :ref:`full release notes ` or issue tracker on GitHub for a complete list. - diff --git a/doc/source/whatsnew/v0.11.0.txt b/doc/source/whatsnew/v0.11.0.txt index 50b74fc5af090..ea149595e681f 100644 --- a/doc/source/whatsnew/v0.11.0.txt +++ b/doc/source/whatsnew/v0.11.0.txt @@ -190,7 +190,7 @@ Furthermore ``datetime64[ns]`` columns are created by default, when passed datet df.get_dtype_counts() # use the traditional nan, which is mapped to NaT internally - df.ix[2:4,['A','timestamp']] = np.nan + df.loc[df.index[2:4], ['A','timestamp']] = np.nan df Astype conversion on ``datetime64[ns]`` to ``object``, implicity converts ``NaT`` to ``np.nan`` diff --git a/doc/source/whatsnew/v0.13.0.txt b/doc/source/whatsnew/v0.13.0.txt index 6ecd4b487c798..3347b05a5df37 100644 --- a/doc/source/whatsnew/v0.13.0.txt +++ b/doc/source/whatsnew/v0.13.0.txt @@ -274,7 +274,6 @@ Float64Index API Change .. ipython:: python s[3] - s.ix[3] s.loc[3] The only positional indexing is via ``iloc`` @@ -290,7 +289,6 @@ Float64Index API Change .. ipython:: python s[2:4] - s.ix[2:4] s.loc[2:4] s.iloc[2:4] @@ -359,11 +357,11 @@ HDFStore API Changes .. ipython:: python path = 'test.h5' - df = DataFrame(randn(10,2)) + df = pd.DataFrame(np.random.randn(10,2)) df.to_hdf(path,'df_table',format='table') df.to_hdf(path,'df_table2',append=True) df.to_hdf(path,'df_fixed') - with get_store(path) as store: + with pd.HDFStore(path) as store: print(store) .. ipython:: python diff --git a/doc/source/whatsnew/v0.13.1.txt b/doc/source/whatsnew/v0.13.1.txt index 349acf508bbf3..5e5653945fefa 100644 --- a/doc/source/whatsnew/v0.13.1.txt +++ b/doc/source/whatsnew/v0.13.1.txt @@ -36,7 +36,7 @@ Highlights include: .. ipython:: python df = DataFrame(dict(A = np.array(['foo','bar','bah','foo','bar']))) - df.ix[0,'A'] = np.nan + df.loc[0,'A'] = np.nan df Output Formatting Enhancements @@ -121,11 +121,11 @@ API changes .. ipython:: python :okwarning: - + df = DataFrame({'col':['foo', 0, np.nan]}) df2 = DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0]) df.equals(df2) - df.equals(df2.sort()) + df.equals(df2.sort_index()) import pandas.core.common as com com.array_equivalent(np.array([0, np.nan]), np.array([0, np.nan])) diff --git a/doc/source/whatsnew/v0.14.0.txt b/doc/source/whatsnew/v0.14.0.txt index 181cd401c85d6..f1feab4b909dc 100644 --- a/doc/source/whatsnew/v0.14.0.txt +++ b/doc/source/whatsnew/v0.14.0.txt @@ -516,7 +516,7 @@ See also issues (:issue:`6134`, :issue:`4036`, :issue:`3057`, :issue:`2598`, :is names=['lvl0', 'lvl1']) df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))), index=index, - columns=columns).sortlevel().sortlevel(axis=1) + columns=columns).sort_index().sort_index(axis=1) df Basic multi-index slicing using slices, lists, and labels. @@ -630,9 +630,9 @@ There are prior version deprecations that are taking effect as of 0.14.0. - Remove ``unique`` keyword from :meth:`HDFStore.select_column` (:issue:`3256`) - Remove ``inferTimeRule`` keyword from :func:`Timestamp.offset` (:issue:`391`) - Remove ``name`` keyword from :func:`get_data_yahoo` and - :func:`get_data_google` ( `commit b921d1a `__ ) + :func:`get_data_google` ( `commit b921d1a `__ ) - Remove ``offset`` keyword from :class:`DatetimeIndex` constructor - ( `commit 3136390 `__ ) + ( `commit 3136390 `__ ) - Remove ``time_rule`` from several rolling-moment statistical functions, such as :func:`rolling_sum` (:issue:`1042`) - Removed neg ``-`` boolean operations on numpy arrays in favor of inv ``~``, as this is going to diff --git a/doc/source/whatsnew/v0.15.0.txt b/doc/source/whatsnew/v0.15.0.txt index df1171fb34486..6282f15b6faeb 100644 --- a/doc/source/whatsnew/v0.15.0.txt +++ b/doc/source/whatsnew/v0.15.0.txt @@ -80,7 +80,7 @@ For full docs, see the :ref:`categorical introduction ` and the # Reorder the categories and simultaneously add the missing categories df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) df["grade"] - df.sort("grade") + df.sort_values("grade") df.groupby("grade").size() - ``pandas.core.group_agg`` and ``pandas.core.factor_agg`` were removed. As an alternative, construct @@ -705,7 +705,7 @@ Other notable API changes: s = Series(np.arange(3,dtype='int64'), index=MultiIndex.from_product([['A'],['foo','bar','baz']], names=['one','two']) - ).sortlevel() + ).sort_index() s try: s.loc[['D']] diff --git a/doc/source/whatsnew/v0.15.2.txt b/doc/source/whatsnew/v0.15.2.txt index 3a62ac38f7260..feba3d6224e65 100644 --- a/doc/source/whatsnew/v0.15.2.txt +++ b/doc/source/whatsnew/v0.15.2.txt @@ -35,7 +35,7 @@ API changes df.loc[(1, 'z')] # lexically sorting - df2 = df.sortlevel() + df2 = df.sort_index() df2 df2.index.lexsort_depth df2.loc[(1,'z')] @@ -163,7 +163,7 @@ Other enhancements: p.all() - Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`). -- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here`. +- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See `here`__. - ``Timedelta`` arithmetic returns ``NotImplemented`` in unknown cases, allowing extensions by custom classes (:issue:`8813`). - ``Timedelta`` now supports arithemtic with ``numpy.ndarray`` objects of the appropriate dtype (numpy 1.8 or newer only) (:issue:`8884`). - Added ``Timedelta.to_timedelta64()`` method to the public API (:issue:`8884`). diff --git a/doc/source/whatsnew/v0.16.0.txt b/doc/source/whatsnew/v0.16.0.txt index 4255f4839bca0..4d43660960597 100644 --- a/doc/source/whatsnew/v0.16.0.txt +++ b/doc/source/whatsnew/v0.16.0.txt @@ -300,9 +300,14 @@ The behavior of a small sub-set of edge cases for using ``.loc`` have changed (: New Behavior - .. ipython:: python + .. code-block:: python - s.ix[-1.0:2] + In [2]: s.ix[-1.0:2] + Out[2]: + -1 1 + 1 2 + 2 3 + dtype: int64 - Provide a useful exception for indexing with an invalid type for that index when using ``.loc``. For example trying to use ``.loc`` on an index of type ``DatetimeIndex`` or ``PeriodIndex`` or ``TimedeltaIndex``, with an integer (or a float). diff --git a/doc/source/whatsnew/v0.16.1.txt b/doc/source/whatsnew/v0.16.1.txt old mode 100755 new mode 100644 diff --git a/doc/source/whatsnew/v0.17.0.txt b/doc/source/whatsnew/v0.17.0.txt index 9cb299593076d..a3bbaf73c01ca 100644 --- a/doc/source/whatsnew/v0.17.0.txt +++ b/doc/source/whatsnew/v0.17.0.txt @@ -329,7 +329,7 @@ has been changed to make this keyword unnecessary - the change is shown below. Google BigQuery Enhancements ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Added ability to automatically create a table/dataset using the :func:`pandas.io.gbq.to_gbq` function if the destination table/dataset does not exist. (:issue:`8325`, :issue:`11121`). -- Added ability to replace an existing table and schema when calling the :func:`pandas.io.gbq.to_gbq` function via the ``if_exists`` argument. See the :ref:`docs ` for more details (:issue:`8325`). +- Added ability to replace an existing table and schema when calling the :func:`pandas.io.gbq.to_gbq` function via the ``if_exists`` argument. See the `docs `__ for more details (:issue:`8325`). - ``InvalidColumnOrder`` and ``InvalidPageToken`` in the gbq module will raise ``ValueError`` instead of ``IOError``. - The ``generate_bq_schema()`` function is now deprecated and will be removed in a future version (:issue:`11121`) - The gbq module will now support Python 3 (:issue:`11094`). diff --git a/doc/source/whatsnew/v0.17.1.txt b/doc/source/whatsnew/v0.17.1.txt old mode 100755 new mode 100644 index c25e0300a1050..1ad7279ea79f7 --- a/doc/source/whatsnew/v0.17.1.txt +++ b/doc/source/whatsnew/v0.17.1.txt @@ -36,7 +36,7 @@ Conditional HTML Formatting We'll be adding features an possibly making breaking changes in future releases. Feedback is welcome_. -.. _welcome: https://github.com/pydata/pandas/issues/11610 +.. _welcome: https://github.com/pandas-dev/pandas/issues/11610 We've added *experimental* support for conditional HTML formatting: the visual styling of a DataFrame based on the data. @@ -58,7 +58,7 @@ We can render the HTML to get the following table. :file: whatsnew_0171_html_table.html :class:`~pandas.core.style.Styler` interacts nicely with the Jupyter Notebook. -See the :ref:`documentation - - - {% if caption %} - - {% endif %} - - - {% for r in head %} - - {% for c in r %} - {% if c.is_visible != False %} - <{{c.type}} class="{{c.class}}" {{ c.attributes|join(" ") }}> - {{c.value}} - {% endif %} - {% endfor %} - - {% endfor %} - - - {% for r in body %} - - {% for c in r %} - {% if c.is_visible != False %} - <{{c.type}} id="T_{{uuid}}{{c.id}}" - class="{{c.class}}" {{ c.attributes|join(" ") }}> - {{ c.display_value }} - {% endif %} - {% endfor %} - - {% endfor %} - -
{{caption}}
- """) - - def __init__(self, data, precision=None, table_styles=None, uuid=None, - caption=None, table_attributes=None): - self.ctx = defaultdict(list) - self._todo = [] - - if not isinstance(data, (pd.Series, pd.DataFrame)): - raise TypeError("``data`` must be a Series or DataFrame") - if data.ndim == 1: - data = data.to_frame() - if not data.index.is_unique or not data.columns.is_unique: - raise ValueError("style is not supported for non-unique indicies.") - - self.data = data - self.index = data.index - self.columns = data.columns - - self.uuid = uuid - self.table_styles = table_styles - self.caption = caption - if precision is None: - precision = get_option('display.precision') - self.precision = precision - self.table_attributes = table_attributes - # display_funcs maps (row, col) -> formatting function - - def default_display_func(x): - if is_float(x): - return '{:>.{precision}g}'.format(x, precision=self.precision) - else: - return x - - self._display_funcs = defaultdict(lambda: default_display_func) - - def _repr_html_(self): - """Hooks into Jupyter notebook rich display system.""" - return self.render() - - def _translate(self): - """ - Convert the DataFrame in `self.data` and the attrs from `_build_styles` - into a dictionary of {head, body, uuid, cellstyle} - """ - table_styles = self.table_styles or [] - caption = self.caption - ctx = self.ctx - precision = self.precision - uuid = self.uuid or str(uuid1()).replace("-", "_") - ROW_HEADING_CLASS = "row_heading" - COL_HEADING_CLASS = "col_heading" - INDEX_NAME_CLASS = "index_name" - - DATA_CLASS = "data" - BLANK_CLASS = "blank" - BLANK_VALUE = "" - - def format_attr(pair): - return "{key}={value}".format(**pair) - - # for sparsifying a MultiIndex - idx_lengths = _get_level_lengths(self.index) - col_lengths = _get_level_lengths(self.columns) - - cell_context = dict() - - n_rlvls = self.data.index.nlevels - n_clvls = self.data.columns.nlevels - rlabels = self.data.index.tolist() - clabels = self.data.columns.tolist() - - if n_rlvls == 1: - rlabels = [[x] for x in rlabels] - if n_clvls == 1: - clabels = [[x] for x in clabels] - clabels = list(zip(*clabels)) - - cellstyle = [] - head = [] - - for r in range(n_clvls): - # Blank for Index columns... - row_es = [{"type": "th", - "value": BLANK_VALUE, - "display_value": BLANK_VALUE, - "is_visible": True, - "class": " ".join([BLANK_CLASS])}] * (n_rlvls - 1) - - # ... except maybe the last for columns.names - name = self.data.columns.names[r] - cs = [BLANK_CLASS if name is None else INDEX_NAME_CLASS, - "level%s" % r] - name = BLANK_VALUE if name is None else name - row_es.append({"type": "th", - "value": name, - "display_value": name, - "class": " ".join(cs), - "is_visible": True}) - - for c in range(len(clabels[0])): - cs = [COL_HEADING_CLASS, "level%s" % r, "col%s" % c] - cs.extend(cell_context.get( - "col_headings", {}).get(r, {}).get(c, [])) - value = clabels[r][c] - row_es.append({"type": "th", - "value": value, - "display_value": value, - "class": " ".join(cs), - "is_visible": _is_visible(c, r, col_lengths), - "attributes": [ - format_attr({"key": "colspan", - "value": col_lengths.get( - (r, c), 1)}) - ]}) - head.append(row_es) - - if self.data.index.names and not all(x is None - for x in self.data.index.names): - index_header_row = [] - - for c, name in enumerate(self.data.index.names): - cs = [INDEX_NAME_CLASS, - "level%s" % c] - name = '' if name is None else name - index_header_row.append({"type": "th", "value": name, - "class": " ".join(cs)}) - - index_header_row.extend( - [{"type": "th", - "value": BLANK_VALUE, - "class": " ".join([BLANK_CLASS]) - }] * len(clabels[0])) - - head.append(index_header_row) - - body = [] - for r, idx in enumerate(self.data.index): - # cs.extend( - # cell_context.get("row_headings", {}).get(r, {}).get(c, [])) - row_es = [{"type": "th", - "is_visible": _is_visible(r, c, idx_lengths), - "attributes": [ - format_attr({"key": "rowspan", - "value": idx_lengths.get((c, r), 1)}) - ], - "value": rlabels[r][c], - "class": " ".join([ROW_HEADING_CLASS, "level%s" % c, - "row%s" % r]), - "display_value": rlabels[r][c]} - for c in range(len(rlabels[r]))] - - for c, col in enumerate(self.data.columns): - cs = [DATA_CLASS, "row%s" % r, "col%s" % c] - cs.extend(cell_context.get("data", {}).get(r, {}).get(c, [])) - formatter = self._display_funcs[(r, c)] - value = self.data.iloc[r, c] - row_es.append({ - "type": "td", - "value": value, - "class": " ".join(cs), - "id": "_".join(cs[1:]), - "display_value": formatter(value) - }) - props = [] - for x in ctx[r, c]: - # have to handle empty styles like [''] - if x.count(":"): - props.append(x.split(":")) - else: - props.append(['', '']) - cellstyle.append({'props': props, - 'selector': "row%s_col%s" % (r, c)}) - body.append(row_es) - - return dict(head=head, cellstyle=cellstyle, body=body, uuid=uuid, - precision=precision, table_styles=table_styles, - caption=caption, table_attributes=self.table_attributes) - - def format(self, formatter, subset=None): - """ - Format the text display value of cells. - - .. versionadded:: 0.18.0 - - Parameters - ---------- - formatter: str, callable, or dict - subset: IndexSlice - An argument to ``DataFrame.loc`` that restricts which elements - ``formatter`` is applied to. - - Returns - ------- - self : Styler - - Notes - ----- - - ``formatter`` is either an ``a`` or a dict ``{column name: a}`` where - ``a`` is one of - - - str: this will be wrapped in: ``a.format(x)`` - - callable: called with the value of an individual cell - - The default display value for numeric values is the "general" (``g``) - format with ``pd.options.display.precision`` precision. - - Examples - -------- - - >>> df = pd.DataFrame(np.random.randn(4, 2), columns=['a', 'b']) - >>> df.style.format("{:.2%}") - >>> df['c'] = ['a', 'b', 'c', 'd'] - >>> df.style.format({'C': str.upper}) - """ - if subset is None: - row_locs = range(len(self.data)) - col_locs = range(len(self.data.columns)) - else: - subset = _non_reducing_slice(subset) - if len(subset) == 1: - subset = subset, self.data.columns - - sub_df = self.data.loc[subset] - row_locs = self.data.index.get_indexer_for(sub_df.index) - col_locs = self.data.columns.get_indexer_for(sub_df.columns) - - if isinstance(formatter, MutableMapping): - for col, col_formatter in formatter.items(): - # formatter must be callable, so '{}' are converted to lambdas - col_formatter = _maybe_wrap_formatter(col_formatter) - col_num = self.data.columns.get_indexer_for([col])[0] - - for row_num in row_locs: - self._display_funcs[(row_num, col_num)] = col_formatter - else: - # single scalar to format all cells with - locs = product(*(row_locs, col_locs)) - for i, j in locs: - formatter = _maybe_wrap_formatter(formatter) - self._display_funcs[(i, j)] = formatter - return self - - def render(self): - """ - Render the built up styles to HTML - - .. versionadded:: 0.17.1 - - Returns - ------- - rendered: str - the rendered HTML - - Notes - ----- - ``Styler`` objects have defined the ``_repr_html_`` method - which automatically calls ``self.render()`` when it's the - last item in a Notebook cell. When calling ``Styler.render()`` - directly, wrap the result in ``IPython.display.HTML`` to view - the rendered HTML in the notebook. - """ - self._compute() - d = self._translate() - # filter out empty styles, every cell will have a class - # but the list of props may just be [['', '']]. - # so we have the neested anys below - trimmed = [x for x in d['cellstyle'] - if any(any(y) for y in x['props'])] - d['cellstyle'] = trimmed - return self.template.render(**d) - - def _update_ctx(self, attrs): - """ - update the state of the Styler. Collects a mapping - of {index_label: [': ']} - - attrs: Series or DataFrame - should contain strings of ': ;: ' - Whitespace shouldn't matter and the final trailing ';' shouldn't - matter. - """ - for row_label, v in attrs.iterrows(): - for col_label, col in v.iteritems(): - i = self.index.get_indexer([row_label])[0] - j = self.columns.get_indexer([col_label])[0] - for pair in col.rstrip(";").split(";"): - self.ctx[(i, j)].append(pair) - - def _copy(self, deepcopy=False): - styler = Styler(self.data, precision=self.precision, - caption=self.caption, uuid=self.uuid, - table_styles=self.table_styles) - if deepcopy: - styler.ctx = copy.deepcopy(self.ctx) - styler._todo = copy.deepcopy(self._todo) - else: - styler.ctx = self.ctx - styler._todo = self._todo - return styler - - def __copy__(self): - """ - Deep copy by default. - """ - return self._copy(deepcopy=False) - - def __deepcopy__(self, memo): - return self._copy(deepcopy=True) - - def clear(self): - """"Reset" the styler, removing any previously applied styles. - Returns None. - """ - self.ctx.clear() - self._todo = [] - - def _compute(self): - """ - Execute the style functions built up in `self._todo`. - - Relies on the conventions that all style functions go through - .apply or .applymap. The append styles to apply as tuples of - - (application method, *args, **kwargs) - """ - r = self - for func, args, kwargs in self._todo: - r = func(self)(*args, **kwargs) - return r - - def _apply(self, func, axis=0, subset=None, **kwargs): - subset = slice(None) if subset is None else subset - subset = _non_reducing_slice(subset) - data = self.data.loc[subset] - if axis is not None: - result = data.apply(func, axis=axis, **kwargs) - else: - result = func(data, **kwargs) - if not isinstance(result, pd.DataFrame): - raise TypeError( - "Function {!r} must return a DataFrame when " - "passed to `Styler.apply` with axis=None".format(func)) - if not (result.index.equals(data.index) and - result.columns.equals(data.columns)): - msg = ('Result of {!r} must have identical index and columns ' - 'as the input'.format(func)) - raise ValueError(msg) - - result_shape = result.shape - expected_shape = self.data.loc[subset].shape - if result_shape != expected_shape: - msg = ("Function {!r} returned the wrong shape.\n" - "Result has shape: {}\n" - "Expected shape: {}".format(func, - result.shape, - expected_shape)) - raise ValueError(msg) - self._update_ctx(result) - return self - - def apply(self, func, axis=0, subset=None, **kwargs): - """ - Apply a function column-wise, row-wise, or table-wase, - updating the HTML representation with the result. - - .. versionadded:: 0.17.1 - - Parameters - ---------- - func : function - ``func`` should take a Series or DataFrame (depending - on ``axis``), and return an object with the same shape. - Must return a DataFrame with identical index and - column labels when ``axis=None`` - axis : int, str or None - apply to each column (``axis=0`` or ``'index'``) - or to each row (``axis=1`` or ``'columns'``) or - to the entire DataFrame at once with ``axis=None`` - subset : IndexSlice - a valid indexer to limit ``data`` to *before* applying the - function. Consider using a pandas.IndexSlice - kwargs : dict - pass along to ``func`` - - Returns - ------- - self : Styler - - Notes - ----- - The output shape of ``func`` should match the input, i.e. if - ``x`` is the input row, column, or table (depending on ``axis``), - then ``func(x.shape) == x.shape`` should be true. - - This is similar to ``DataFrame.apply``, except that ``axis=None`` - applies the function to the entire DataFrame at once, - rather than column-wise or row-wise. - - Examples - -------- - >>> def highlight_max(x): - ... return ['background-color: yellow' if v == x.max() else '' - for v in x] - ... - >>> df = pd.DataFrame(np.random.randn(5, 2)) - >>> df.style.apply(highlight_max) - """ - self._todo.append((lambda instance: getattr(instance, '_apply'), - (func, axis, subset), kwargs)) - return self - - def _applymap(self, func, subset=None, **kwargs): - func = partial(func, **kwargs) # applymap doesn't take kwargs? - if subset is None: - subset = pd.IndexSlice[:] - subset = _non_reducing_slice(subset) - result = self.data.loc[subset].applymap(func) - self._update_ctx(result) - return self - - def applymap(self, func, subset=None, **kwargs): - """ - Apply a function elementwise, updating the HTML - representation with the result. - - .. versionadded:: 0.17.1 - - Parameters - ---------- - func : function - ``func`` should take a scalar and return a scalar - subset : IndexSlice - a valid indexer to limit ``data`` to *before* applying the - function. Consider using a pandas.IndexSlice - kwargs : dict - pass along to ``func`` - - Returns - ------- - self : Styler - - """ - self._todo.append((lambda instance: getattr(instance, '_applymap'), - (func, subset), kwargs)) - return self - - def set_precision(self, precision): - """ - Set the precision used to render. - - .. versionadded:: 0.17.1 - - Parameters - ---------- - precision: int - - Returns - ------- - self : Styler - """ - self.precision = precision - return self - - def set_table_attributes(self, attributes): - """ - Set the table attributes. These are the items - that show up in the opening ```` tag in addition - to to automatic (by default) id. - - .. versionadded:: 0.17.1 - - Parameters - ---------- - precision: int - - Returns - ------- - self : Styler - """ - self.table_attributes = attributes - return self - - def export(self): - """ - Export the styles to applied to the current Styler. - Can be applied to a second style with ``Styler.use``. - - .. versionadded:: 0.17.1 - - Returns - ------- - styles: list - - See Also - -------- - Styler.use - """ - return self._todo - - def use(self, styles): - """ - Set the styles on the current Styler, possibly using styles - from ``Styler.export``. - - .. versionadded:: 0.17.1 - - Parameters - ---------- - styles: list - list of style functions - - Returns - ------- - self : Styler - - See Also - -------- - Styler.export - """ - self._todo.extend(styles) - return self - - def set_uuid(self, uuid): - """ - Set the uuid for a Styler. - - .. versionadded:: 0.17.1 - - Parameters - ---------- - uuid: str - - Returns - ------- - self : Styler - """ - self.uuid = uuid - return self - - def set_caption(self, caption): - """ - Se the caption on a Styler - - .. versionadded:: 0.17.1 - - Parameters - ---------- - caption: str - - Returns - ------- - self : Styler - """ - self.caption = caption - return self - - def set_table_styles(self, table_styles): - """ - Set the table styles on a Styler. These are placed in a - ``""") + if self.notebook: + self.write(template) + def write_result(self, buf): indent = 0 frame = self.frame @@ -1015,6 +1149,7 @@ def write_result(self, buf): self.write(''.format(div_style)) + self.write_style() self.write('
' % (self.border, ' '.join(_classes)), indent) @@ -1076,7 +1211,7 @@ def _column_header(): sentinel = None levels = self.columns.format(sparsify=sentinel, adjoin=False, names=False) - level_lengths = _get_level_lengths(levels, sentinel) + level_lengths = get_level_lengths(levels, sentinel) inner_lvl = len(level_lengths) - 1 for lnum, (records, values) in enumerate(zip(level_lengths, levels)): @@ -1182,7 +1317,7 @@ def _write_body(self, indent): else: self._write_regular_rows(fmt_values, indent) else: - for i in range(len(self.frame)): + for i in range(min(len(self.frame), self.max_rows)): row = [fmt_values[j][i] for j in range(len(self.columns))] self.write_tr(row, indent, self.indent_delta, tags=None) @@ -1242,12 +1377,13 @@ def _write_hierarchical_rows(self, fmt_values, indent): levels = frame.index.format(sparsify=sentinel, adjoin=False, names=False) - level_lengths = _get_level_lengths(levels, sentinel) + level_lengths = get_level_lengths(levels, sentinel) inner_lvl = len(level_lengths) - 1 if truncate_v: # Insert ... row and adjust idx_values and # level_lengths to take this into account. ins_row = self.fmt.tr_row_num + inserted = False for lnum, records in enumerate(level_lengths): rec_new = {} for tag, span in list(records.items()): @@ -1255,9 +1391,17 @@ def _write_hierarchical_rows(self, fmt_values, indent): rec_new[tag + 1] = span elif tag + span > ins_row: rec_new[tag] = span + 1 - dot_row = list(idx_values[ins_row - 1]) - dot_row[-1] = u('...') - idx_values.insert(ins_row, tuple(dot_row)) + + # GH 14882 - Make sure insertion done once + if not inserted: + dot_row = list(idx_values[ins_row - 1]) + dot_row[-1] = u('...') + idx_values.insert(ins_row, tuple(dot_row)) + inserted = True + else: + dot_row = list(idx_values[ins_row]) + dot_row[inner_lvl - lnum] = u('...') + idx_values[ins_row] = tuple(dot_row) else: rec_new[tag] = span # If ins_row lies between tags, all cols idx cols @@ -1267,6 +1411,12 @@ def _write_hierarchical_rows(self, fmt_values, indent): if lnum == 0: idx_values.insert(ins_row, tuple( [u('...')] * len(level_lengths))) + + # GH 14882 - Place ... in correct level + elif inserted: + dot_row = list(idx_values[ins_row]) + dot_row[inner_lvl - lnum] = u('...') + idx_values[ins_row] = tuple(dot_row) level_lengths[lnum] = rec_new level_lengths[inner_lvl][ins_row] = 1 @@ -1310,47 +1460,8 @@ def _write_hierarchical_rows(self, fmt_values, indent): nindex_levels=frame.index.nlevels) -def _get_level_lengths(levels, sentinel=''): - """For each index in each level the function returns lengths of indexes. - - Parameters - ---------- - levels : list of lists - List of values on for level. - sentinel : string, optional - Value which states that no new index starts on there. - - Returns - ---------- - Returns list of maps. For each level returns map of indexes (key is index - in row and value is length of index). - """ - if len(levels) == 0: - return [] - - control = [True for x in levels[0]] - - result = [] - for level in levels: - last_index = 0 - - lengths = {} - for i, key in enumerate(level): - if control[i] and key == sentinel: - pass - else: - control[i] = False - lengths[last_index] = i - last_index - last_index = i - - lengths[last_index] = len(level) - last_index - - result.append(lengths) - - return result - - class CSVFormatter(object): + def __init__(self, obj, path_or_buf=None, sep=",", na_rep='', float_format=None, cols=None, header=True, index=True, index_label=None, mode='w', nanRep=None, encoding=None, @@ -1437,10 +1548,7 @@ def __init__(self, obj, path_or_buf=None, sep=",", na_rep='', self.chunksize = int(chunksize) self.data_index = obj.index - if isinstance(obj.index, PeriodIndex): - self.data_index = obj.index.to_timestamp() - - if (isinstance(self.data_index, DatetimeIndex) and + if (isinstance(self.data_index, (DatetimeIndex, PeriodIndex)) and date_format is not None): self.data_index = Index([x.strftime(date_format) if notnull(x) else '' for x in self.data_index]) @@ -1455,9 +1563,9 @@ def save(self): f = self.path_or_buf close = False else: - f = _get_handle(self.path_or_buf, self.mode, - encoding=self.encoding, - compression=self.compression) + f, handles = _get_handle(self.path_or_buf, self.mode, + encoding=self.encoding, + compression=self.compression) close = True try: @@ -1546,7 +1654,7 @@ def _save_header(self): if isinstance(index_label, list) and len(index_label) > 1: col_line.extend([''] * (len(index_label) - 1)) - col_line.extend(columns.get_level_values(i)) + col_line.extend(columns._get_level_values(i)) writer.writerow(col_line) @@ -1601,298 +1709,6 @@ def _save_chunk(self, start_i, end_i): lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer) -# from collections import namedtuple -# ExcelCell = namedtuple("ExcelCell", -# 'row, col, val, style, mergestart, mergeend') - - -class ExcelCell(object): - __fields__ = ('row', 'col', 'val', 'style', 'mergestart', 'mergeend') - __slots__ = __fields__ - - def __init__(self, row, col, val, style=None, mergestart=None, - mergeend=None): - self.row = row - self.col = col - self.val = val - self.style = style - self.mergestart = mergestart - self.mergeend = mergeend - - -header_style = {"font": {"bold": True}, - "borders": {"top": "thin", - "right": "thin", - "bottom": "thin", - "left": "thin"}, - "alignment": {"horizontal": "center", - "vertical": "top"}} - - -class ExcelFormatter(object): - """ - Class for formatting a DataFrame to a list of ExcelCells, - - Parameters - ---------- - df : dataframe - na_rep: na representation - float_format : string, default None - Format string for floating point numbers - cols : sequence, optional - Columns to write - header : boolean or list of string, default True - Write out column names. If a list of string is given it is - assumed to be aliases for the column names - index : boolean, default True - output row names (index) - index_label : string or sequence, default None - Column label for index column(s) if desired. If None is given, and - `header` and `index` are True, then the index names are used. A - sequence should be given if the DataFrame uses MultiIndex. - merge_cells : boolean, default False - Format MultiIndex and Hierarchical Rows as merged cells. - inf_rep : string, default `'inf'` - representation for np.inf values (which aren't representable in Excel) - A `'-'` sign will be added in front of -inf. - """ - - def __init__(self, df, na_rep='', float_format=None, cols=None, - header=True, index=True, index_label=None, merge_cells=False, - inf_rep='inf'): - self.rowcounter = 0 - self.na_rep = na_rep - self.df = df - if cols is not None: - self.df = df.loc[:, cols] - self.columns = self.df.columns - self.float_format = float_format - self.index = index - self.index_label = index_label - self.header = header - self.merge_cells = merge_cells - self.inf_rep = inf_rep - - def _format_value(self, val): - if lib.checknull(val): - val = self.na_rep - elif is_float(val): - if lib.isposinf_scalar(val): - val = self.inf_rep - elif lib.isneginf_scalar(val): - val = '-%s' % self.inf_rep - elif self.float_format is not None: - val = float(self.float_format % val) - return val - - def _format_header_mi(self): - if self.columns.nlevels > 1: - if not self.index: - raise NotImplementedError("Writing to Excel with MultiIndex" - " columns and no index " - "('index'=False) is not yet " - "implemented.") - - has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index)) - if not (has_aliases or self.header): - return - - columns = self.columns - level_strs = columns.format(sparsify=self.merge_cells, adjoin=False, - names=False) - level_lengths = _get_level_lengths(level_strs) - coloffset = 0 - lnum = 0 - - if self.index and isinstance(self.df.index, MultiIndex): - coloffset = len(self.df.index[0]) - 1 - - if self.merge_cells: - # Format multi-index as a merged cells. - for lnum in range(len(level_lengths)): - name = columns.names[lnum] - yield ExcelCell(lnum, coloffset, name, header_style) - - for lnum, (spans, levels, labels) in enumerate(zip( - level_lengths, columns.levels, columns.labels)): - values = levels.take(labels) - for i in spans: - if spans[i] > 1: - yield ExcelCell(lnum, coloffset + i + 1, values[i], - header_style, lnum, - coloffset + i + spans[i]) - else: - yield ExcelCell(lnum, coloffset + i + 1, values[i], - header_style) - else: - # Format in legacy format with dots to indicate levels. - for i, values in enumerate(zip(*level_strs)): - v = ".".join(map(pprint_thing, values)) - yield ExcelCell(lnum, coloffset + i + 1, v, header_style) - - self.rowcounter = lnum - - def _format_header_regular(self): - has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index)) - if has_aliases or self.header: - coloffset = 0 - - if self.index: - coloffset = 1 - if isinstance(self.df.index, MultiIndex): - coloffset = len(self.df.index[0]) - - colnames = self.columns - if has_aliases: - if len(self.header) != len(self.columns): - raise ValueError('Writing %d cols but got %d aliases' % - (len(self.columns), len(self.header))) - else: - colnames = self.header - - for colindex, colname in enumerate(colnames): - yield ExcelCell(self.rowcounter, colindex + coloffset, colname, - header_style) - - def _format_header(self): - if isinstance(self.columns, MultiIndex): - gen = self._format_header_mi() - else: - gen = self._format_header_regular() - - gen2 = () - if self.df.index.names: - row = [x if x is not None else '' - for x in self.df.index.names] + [''] * len(self.columns) - if reduce(lambda x, y: x and y, map(lambda x: x != '', row)): - gen2 = (ExcelCell(self.rowcounter, colindex, val, header_style) - for colindex, val in enumerate(row)) - self.rowcounter += 1 - return itertools.chain(gen, gen2) - - def _format_body(self): - - if isinstance(self.df.index, MultiIndex): - return self._format_hierarchical_rows() - else: - return self._format_regular_rows() - - def _format_regular_rows(self): - has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index)) - if has_aliases or self.header: - self.rowcounter += 1 - - coloffset = 0 - # output index and index_label? - if self.index: - # chek aliases - # if list only take first as this is not a MultiIndex - if (self.index_label and - isinstance(self.index_label, (list, tuple, np.ndarray, - Index))): - index_label = self.index_label[0] - # if string good to go - elif self.index_label and isinstance(self.index_label, str): - index_label = self.index_label - else: - index_label = self.df.index.names[0] - - if isinstance(self.columns, MultiIndex): - self.rowcounter += 1 - - if index_label and self.header is not False: - yield ExcelCell(self.rowcounter - 1, 0, index_label, - header_style) - - # write index_values - index_values = self.df.index - if isinstance(self.df.index, PeriodIndex): - index_values = self.df.index.to_timestamp() - - coloffset = 1 - for idx, idxval in enumerate(index_values): - yield ExcelCell(self.rowcounter + idx, 0, idxval, header_style) - - # Write the body of the frame data series by series. - for colidx in range(len(self.columns)): - series = self.df.iloc[:, colidx] - for i, val in enumerate(series): - yield ExcelCell(self.rowcounter + i, colidx + coloffset, val) - - def _format_hierarchical_rows(self): - has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index)) - if has_aliases or self.header: - self.rowcounter += 1 - - gcolidx = 0 - - if self.index: - index_labels = self.df.index.names - # check for aliases - if (self.index_label and - isinstance(self.index_label, (list, tuple, np.ndarray, - Index))): - index_labels = self.index_label - - # MultiIndex columns require an extra row - # with index names (blank if None) for - # unambigous round-trip, unless not merging, - # in which case the names all go on one row Issue #11328 - if isinstance(self.columns, MultiIndex) and self.merge_cells: - self.rowcounter += 1 - - # if index labels are not empty go ahead and dump - if (any(x is not None for x in index_labels) and - self.header is not False): - - for cidx, name in enumerate(index_labels): - yield ExcelCell(self.rowcounter - 1, cidx, name, - header_style) - - if self.merge_cells: - # Format hierarchical rows as merged cells. - level_strs = self.df.index.format(sparsify=True, adjoin=False, - names=False) - level_lengths = _get_level_lengths(level_strs) - - for spans, levels, labels in zip(level_lengths, - self.df.index.levels, - self.df.index.labels): - - values = levels.take(labels, - allow_fill=levels._can_hold_na, - fill_value=True) - - for i in spans: - if spans[i] > 1: - yield ExcelCell(self.rowcounter + i, gcolidx, - values[i], header_style, - self.rowcounter + i + spans[i] - 1, - gcolidx) - else: - yield ExcelCell(self.rowcounter + i, gcolidx, - values[i], header_style) - gcolidx += 1 - - else: - # Format hierarchical rows with non-merged values. - for indexcolvals in zip(*self.df.index): - for idx, indexcolval in enumerate(indexcolvals): - yield ExcelCell(self.rowcounter + idx, gcolidx, - indexcolval, header_style) - gcolidx += 1 - - # Write the body of the frame data series by series. - for colidx in range(len(self.columns)): - series = self.df.iloc[:, colidx] - for i, val in enumerate(series): - yield ExcelCell(self.rowcounter + i, gcolidx + colidx, val) - - def get_formatted_cells(self): - for cell in itertools.chain(self._format_header(), - self._format_body()): - cell.val = self._format_value(cell.val) - yield cell # ---------------------------------------------------------------------- # Array formatters @@ -1903,6 +1719,8 @@ def format_array(values, formatter, float_format=None, na_rep='NaN', if is_categorical_dtype(values): fmt_klass = CategoricalArrayFormatter + elif is_interval_dtype(values): + fmt_klass = IntervalArrayFormatter elif is_float_dtype(values.dtype): fmt_klass = FloatArrayFormatter elif is_period_arraylike(values): @@ -1935,6 +1753,7 @@ def format_array(values, formatter, float_format=None, na_rep='NaN', class GenericArrayFormatter(object): + def __init__(self, values, digits=7, formatter=None, na_rep='NaN', space=12, float_format=None, justify='right', decimal='.', quoting=None, fixed_width=True): @@ -2136,6 +1955,7 @@ def _format_strings(self): class IntArrayFormatter(GenericArrayFormatter): + def _format_strings(self): formatter = self.formatter or (lambda x: '% d' % x) fmt_values = [formatter(x) for x in self.values] @@ -2143,6 +1963,7 @@ def _format_strings(self): class Datetime64Formatter(GenericArrayFormatter): + def __init__(self, values, nat_rep='NaT', date_format=None, **kwargs): super(Datetime64Formatter, self).__init__(values, **kwargs) self.nat_rep = nat_rep @@ -2167,9 +1988,21 @@ def _format_strings(self): return fmt_values.tolist() +class IntervalArrayFormatter(GenericArrayFormatter): + + def __init__(self, values, *args, **kwargs): + GenericArrayFormatter.__init__(self, values, *args, **kwargs) + + def _format_strings(self): + formatter = self.formatter or str + fmt_values = np.array([formatter(x) for x in self.values]) + return fmt_values + + class PeriodArrayFormatter(IntArrayFormatter): + def _format_strings(self): - from pandas.tseries.period import IncompatibleFrequency + from pandas.core.indexes.period import IncompatibleFrequency try: values = PeriodIndex(self.values).to_native_types() except IncompatibleFrequency: @@ -2182,6 +2015,7 @@ def _format_strings(self): class CategoricalArrayFormatter(GenericArrayFormatter): + def __init__(self, values, *args, **kwargs): GenericArrayFormatter.__init__(self, values, *args, **kwargs) @@ -2313,6 +2147,7 @@ def _get_format_datetime64_from_values(values, date_format): class Datetime64TZFormatter(Datetime64Formatter): + def _format_strings(self): """ we by definition have a TZ """ @@ -2327,6 +2162,7 @@ def _format_strings(self): class Timedelta64Formatter(GenericArrayFormatter): + def __init__(self, values, nat_rep='NaT', box=False, **kwargs): super(Timedelta64Formatter, self).__init__(values, **kwargs) self.nat_rep = nat_rep @@ -2452,82 +2288,6 @@ def _has_names(index): else: return index.name is not None -# ----------------------------------------------------------------------------- -# Global formatting options - -_initial_defencoding = None - - -def detect_console_encoding(): - """ - Try to find the most capable encoding supported by the console. - slighly modified from the way IPython handles the same issue. - """ - import locale - global _initial_defencoding - - encoding = None - try: - encoding = sys.stdout.encoding or sys.stdin.encoding - except AttributeError: - pass - - # try again for something better - if not encoding or 'ascii' in encoding.lower(): - try: - encoding = locale.getpreferredencoding() - except Exception: - pass - - # when all else fails. this will usually be "ascii" - if not encoding or 'ascii' in encoding.lower(): - encoding = sys.getdefaultencoding() - - # GH3360, save the reported defencoding at import time - # MPL backends may change it. Make available for debugging. - if not _initial_defencoding: - _initial_defencoding = sys.getdefaultencoding() - - return encoding - - -def get_console_size(): - """Return console size as tuple = (width, height). - - Returns (None,None) in non-interactive session. - """ - display_width = get_option('display.width') - # deprecated. - display_height = get_option('display.height', silent=True) - - # Consider - # interactive shell terminal, can detect term size - # interactive non-shell terminal (ipnb/ipqtconsole), cannot detect term - # size non-interactive script, should disregard term size - - # in addition - # width,height have default values, but setting to 'None' signals - # should use Auto-Detection, But only in interactive shell-terminal. - # Simple. yeah. - - if com.in_interactive_session(): - if com.in_ipython_frontend(): - # sane defaults for interactive non-shell terminal - # match default for width,height in config_init - from pandas.core.config import get_default_val - terminal_width = get_default_val('display.width') - terminal_height = get_default_val('display.height') - else: - # pure terminal - terminal_width, terminal_height = get_terminal_size() - else: - terminal_width, terminal_height = None, None - - # Note if the User sets width/Height to None (auto-detection) - # and we're in a script (non-inter), this will return (None,None) - # caller needs to deal. - return (display_width or terminal_width, display_height or terminal_height) - class EngFormatter(object): """ @@ -2664,17 +2424,3 @@ def _binify(cols, line_width): bins.append(len(cols)) return bins - - -if __name__ == '__main__': - arr = np.array([746.03, 0.00, 5620.00, 1592.36]) - # arr = np.array([11111111.1, 1.55]) - # arr = [314200.0034, 1.4125678] - arr = np.array( - [327763.3119, 345040.9076, 364460.9915, 398226.8688, 383800.5172, - 433442.9262, 539415.0568, 568590.4108, 599502.4276, 620921.8593, - 620898.5294, 552427.1093, 555221.2193, 519639.7059, 388175.7, - 379199.5854, 614898.25, 504833.3333, 560600., 941214.2857, 1134250., - 1219550., 855736.85, 1042615.4286, 722621.3043, 698167.1818, 803750.]) - fmt = FloatArrayFormatter(arr, digits=7) - print(fmt.get_result()) diff --git a/pandas/formats/printing.py b/pandas/io/formats/printing.py similarity index 88% rename from pandas/formats/printing.py rename to pandas/io/formats/printing.py index 37bd4b63d6f7a..cbad603630bd3 100644 --- a/pandas/formats/printing.py +++ b/pandas/io/formats/printing.py @@ -2,7 +2,8 @@ printing tools """ -from pandas.types.inference import is_sequence +import sys +from pandas.core.dtypes.inference import is_sequence from pandas import compat from pandas.compat import u from pandas.core.config import get_option @@ -233,3 +234,34 @@ def as_escaped_unicode(thing, escape_chars=escape_chars): def pprint_thing_encoded(object, encoding='utf-8', errors='replace', **kwds): value = pprint_thing(object) # get unicode representation of object return value.encode(encoding, errors, **kwds) + + +def _enable_data_resource_formatter(enable): + if 'IPython' not in sys.modules: + # definitely not in IPython + return + from IPython import get_ipython + ip = get_ipython() + if ip is None: + # still not in IPython + return + + formatters = ip.display_formatter.formatters + mimetype = "application/vnd.dataresource+json" + + if enable: + if mimetype not in formatters: + # define tableschema formatter + from IPython.core.formatters import BaseFormatter + + class TableSchemaFormatter(BaseFormatter): + print_method = '_repr_data_resource_' + _return_type = (dict,) + # register it: + formatters[mimetype] = TableSchemaFormatter() + # enable it if it's been disabled: + formatters[mimetype].enabled = True + else: + # unregister tableschema mime-type + if mimetype in formatters: + formatters[mimetype].enabled = False diff --git a/pandas/io/formats/style.py b/pandas/io/formats/style.py new file mode 100644 index 0000000000000..eac82ddde2318 --- /dev/null +++ b/pandas/io/formats/style.py @@ -0,0 +1,1185 @@ +""" +Module for applying conditional formatting to +DataFrames and Series. +""" +from functools import partial +from itertools import product +from contextlib import contextmanager +from uuid import uuid1 +import copy +from collections import defaultdict, MutableMapping + +try: + from jinja2 import ( + PackageLoader, Environment, ChoiceLoader, FileSystemLoader + ) +except ImportError: + msg = "pandas.Styler requires jinja2. "\ + "Please install with `conda install Jinja2`\n"\ + "or `pip install Jinja2`" + raise ImportError(msg) + +from pandas.core.dtypes.common import is_float, is_string_like + +import numpy as np +import pandas as pd +from pandas.api.types import is_list_like +from pandas.compat import range +from pandas.core.config import get_option +from pandas.core.generic import _shared_docs +import pandas.core.common as com +from pandas.core.indexing import _maybe_numeric_slice, _non_reducing_slice +from pandas.util._decorators import Appender +try: + import matplotlib.pyplot as plt + from matplotlib import colors + has_mpl = True +except ImportError: + has_mpl = False + no_mpl_message = "{0} requires matplotlib." + + +@contextmanager +def _mpl(func): + if has_mpl: + yield plt, colors + else: + raise ImportError(no_mpl_message.format(func.__name__)) + + +class Styler(object): + """ + Helps style a DataFrame or Series according to the + data with HTML and CSS. + + .. versionadded:: 0.17.1 + + .. warning:: + This is a new feature and is under active development. + We'll be adding features and possibly making breaking changes in future + releases. + + Parameters + ---------- + data: Series or DataFrame + precision: int + precision to round floats to, defaults to pd.options.display.precision + table_styles: list-like, default None + list of {selector: (attr, value)} dicts; see Notes + uuid: str, default None + a unique identifier to avoid CSS collisons; generated automatically + caption: str, default None + caption to attach to the table + + Attributes + ---------- + env : Jinja2 Environment + template : Jinja2 Template + loader : Jinja2 Loader + + Notes + ----- + Most styling will be done by passing style functions into + ``Styler.apply`` or ``Styler.applymap``. Style functions should + return values with strings containing CSS ``'attr: value'`` that will + be applied to the indicated cells. + + If using in the Jupyter notebook, Styler has defined a ``_repr_html_`` + to automatically render itself. Otherwise call Styler.render to get + the genterated HTML. + + CSS classes are attached to the generated HTML + + * Index and Column names include ``index_name`` and ``level`` + where `k` is its level in a MultiIndex + * Index label cells include + + * ``row_heading`` + * ``row`` where `n` is the numeric position of the row + * ``level`` where `k` is the level in a MultiIndex + + * Column label cells include + * ``col_heading`` + * ``col`` where `n` is the numeric position of the column + * ``evel`` where `k` is the level in a MultiIndex + + * Blank cells include ``blank`` + * Data cells include ``data`` + + See Also + -------- + pandas.DataFrame.style + """ + loader = PackageLoader("pandas", "io/formats/templates") + env = Environment( + loader=loader, + trim_blocks=True, + ) + template = env.get_template("html.tpl") + + def __init__(self, data, precision=None, table_styles=None, uuid=None, + caption=None, table_attributes=None): + self.ctx = defaultdict(list) + self._todo = [] + + if not isinstance(data, (pd.Series, pd.DataFrame)): + raise TypeError("``data`` must be a Series or DataFrame") + if data.ndim == 1: + data = data.to_frame() + if not data.index.is_unique or not data.columns.is_unique: + raise ValueError("style is not supported for non-unique indicies.") + + self.data = data + self.index = data.index + self.columns = data.columns + + self.uuid = uuid + self.table_styles = table_styles + self.caption = caption + if precision is None: + precision = get_option('display.precision') + self.precision = precision + self.table_attributes = table_attributes + # display_funcs maps (row, col) -> formatting function + + def default_display_func(x): + if is_float(x): + return '{:>.{precision}g}'.format(x, precision=self.precision) + else: + return x + + self._display_funcs = defaultdict(lambda: default_display_func) + + def _repr_html_(self): + """Hooks into Jupyter notebook rich display system.""" + return self.render() + + @Appender(_shared_docs['to_excel'] % dict( + axes='index, columns', klass='Styler', + axes_single_arg="{0 or 'index', 1 or 'columns'}", + optional_by=""" + by : str or list of str + Name or list of names which refer to the axis items.""", + versionadded_to_excel='\n .. versionadded:: 0.20')) + def to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', + float_format=None, columns=None, header=True, index=True, + index_label=None, startrow=0, startcol=0, engine=None, + merge_cells=True, encoding=None, inf_rep='inf', verbose=True, + freeze_panes=None): + + from pandas.io.formats.excel import ExcelFormatter + formatter = ExcelFormatter(self, na_rep=na_rep, cols=columns, + header=header, + float_format=float_format, index=index, + index_label=index_label, + merge_cells=merge_cells, + inf_rep=inf_rep) + formatter.write(excel_writer, sheet_name=sheet_name, startrow=startrow, + startcol=startcol, freeze_panes=freeze_panes, + engine=engine) + + def _translate(self): + """ + Convert the DataFrame in `self.data` and the attrs from `_build_styles` + into a dictionary of {head, body, uuid, cellstyle} + """ + table_styles = self.table_styles or [] + caption = self.caption + ctx = self.ctx + precision = self.precision + uuid = self.uuid or str(uuid1()).replace("-", "_") + ROW_HEADING_CLASS = "row_heading" + COL_HEADING_CLASS = "col_heading" + INDEX_NAME_CLASS = "index_name" + + DATA_CLASS = "data" + BLANK_CLASS = "blank" + BLANK_VALUE = "" + + def format_attr(pair): + return "{key}={value}".format(**pair) + + # for sparsifying a MultiIndex + idx_lengths = _get_level_lengths(self.index) + col_lengths = _get_level_lengths(self.columns) + + cell_context = dict() + + n_rlvls = self.data.index.nlevels + n_clvls = self.data.columns.nlevels + rlabels = self.data.index.tolist() + clabels = self.data.columns.tolist() + + if n_rlvls == 1: + rlabels = [[x] for x in rlabels] + if n_clvls == 1: + clabels = [[x] for x in clabels] + clabels = list(zip(*clabels)) + + cellstyle = [] + head = [] + + for r in range(n_clvls): + # Blank for Index columns... + row_es = [{"type": "th", + "value": BLANK_VALUE, + "display_value": BLANK_VALUE, + "is_visible": True, + "class": " ".join([BLANK_CLASS])}] * (n_rlvls - 1) + + # ... except maybe the last for columns.names + name = self.data.columns.names[r] + cs = [BLANK_CLASS if name is None else INDEX_NAME_CLASS, + "level%s" % r] + name = BLANK_VALUE if name is None else name + row_es.append({"type": "th", + "value": name, + "display_value": name, + "class": " ".join(cs), + "is_visible": True}) + + for c, value in enumerate(clabels[r]): + cs = [COL_HEADING_CLASS, "level%s" % r, "col%s" % c] + cs.extend(cell_context.get( + "col_headings", {}).get(r, {}).get(c, [])) + es = { + "type": "th", + "value": value, + "display_value": value, + "class": " ".join(cs), + "is_visible": _is_visible(c, r, col_lengths), + } + colspan = col_lengths.get((r, c), 0) + if colspan > 1: + es["attributes"] = [ + format_attr({"key": "colspan", "value": colspan}) + ] + row_es.append(es) + head.append(row_es) + + if self.data.index.names and not all(x is None + for x in self.data.index.names): + index_header_row = [] + + for c, name in enumerate(self.data.index.names): + cs = [INDEX_NAME_CLASS, + "level%s" % c] + name = '' if name is None else name + index_header_row.append({"type": "th", "value": name, + "class": " ".join(cs)}) + + index_header_row.extend( + [{"type": "th", + "value": BLANK_VALUE, + "class": " ".join([BLANK_CLASS]) + }] * len(clabels[0])) + + head.append(index_header_row) + + body = [] + for r, idx in enumerate(self.data.index): + row_es = [] + for c, value in enumerate(rlabels[r]): + es = { + "type": "th", + "is_visible": _is_visible(r, c, idx_lengths), + "value": value, + "display_value": value, + "class": " ".join([ROW_HEADING_CLASS, "level%s" % c, + "row%s" % r]), + } + rowspan = idx_lengths.get((c, r), 0) + if rowspan > 1: + es["attributes"] = [ + format_attr({"key": "rowspan", "value": rowspan}) + ] + row_es.append(es) + + for c, col in enumerate(self.data.columns): + cs = [DATA_CLASS, "row%s" % r, "col%s" % c] + cs.extend(cell_context.get("data", {}).get(r, {}).get(c, [])) + formatter = self._display_funcs[(r, c)] + value = self.data.iloc[r, c] + row_es.append({ + "type": "td", + "value": value, + "class": " ".join(cs), + "id": "_".join(cs[1:]), + "display_value": formatter(value) + }) + props = [] + for x in ctx[r, c]: + # have to handle empty styles like [''] + if x.count(":"): + props.append(x.split(":")) + else: + props.append(['', '']) + cellstyle.append({'props': props, + 'selector': "row%s_col%s" % (r, c)}) + body.append(row_es) + + return dict(head=head, cellstyle=cellstyle, body=body, uuid=uuid, + precision=precision, table_styles=table_styles, + caption=caption, table_attributes=self.table_attributes) + + def format(self, formatter, subset=None): + """ + Format the text display value of cells. + + .. versionadded:: 0.18.0 + + Parameters + ---------- + formatter: str, callable, or dict + subset: IndexSlice + An argument to ``DataFrame.loc`` that restricts which elements + ``formatter`` is applied to. + + Returns + ------- + self : Styler + + Notes + ----- + + ``formatter`` is either an ``a`` or a dict ``{column name: a}`` where + ``a`` is one of + + - str: this will be wrapped in: ``a.format(x)`` + - callable: called with the value of an individual cell + + The default display value for numeric values is the "general" (``g``) + format with ``pd.options.display.precision`` precision. + + Examples + -------- + + >>> df = pd.DataFrame(np.random.randn(4, 2), columns=['a', 'b']) + >>> df.style.format("{:.2%}") + >>> df['c'] = ['a', 'b', 'c', 'd'] + >>> df.style.format({'C': str.upper}) + """ + if subset is None: + row_locs = range(len(self.data)) + col_locs = range(len(self.data.columns)) + else: + subset = _non_reducing_slice(subset) + if len(subset) == 1: + subset = subset, self.data.columns + + sub_df = self.data.loc[subset] + row_locs = self.data.index.get_indexer_for(sub_df.index) + col_locs = self.data.columns.get_indexer_for(sub_df.columns) + + if isinstance(formatter, MutableMapping): + for col, col_formatter in formatter.items(): + # formatter must be callable, so '{}' are converted to lambdas + col_formatter = _maybe_wrap_formatter(col_formatter) + col_num = self.data.columns.get_indexer_for([col])[0] + + for row_num in row_locs: + self._display_funcs[(row_num, col_num)] = col_formatter + else: + # single scalar to format all cells with + locs = product(*(row_locs, col_locs)) + for i, j in locs: + formatter = _maybe_wrap_formatter(formatter) + self._display_funcs[(i, j)] = formatter + return self + + def render(self, **kwargs): + r""" + Render the built up styles to HTML + + .. versionadded:: 0.17.1 + + Parameters + ---------- + **kwargs: + Any additional keyword arguments are passed through + to ``self.template.render``. This is useful when you + need to provide additional variables for a custom + template. + + .. versionadded:: 0.20 + + Returns + ------- + rendered: str + the rendered HTML + + Notes + ----- + ``Styler`` objects have defined the ``_repr_html_`` method + which automatically calls ``self.render()`` when it's the + last item in a Notebook cell. When calling ``Styler.render()`` + directly, wrap the result in ``IPython.display.HTML`` to view + the rendered HTML in the notebook. + + Pandas uses the following keys in render. Arguments passed + in ``**kwargs`` take precedence, so think carefuly if you want + to override them: + + * head + * cellstyle + * body + * uuid + * precision + * table_styles + * caption + * table_attributes + """ + self._compute() + # TODO: namespace all the pandas keys + d = self._translate() + # filter out empty styles, every cell will have a class + # but the list of props may just be [['', '']]. + # so we have the neested anys below + trimmed = [x for x in d['cellstyle'] + if any(any(y) for y in x['props'])] + d['cellstyle'] = trimmed + d.update(kwargs) + return self.template.render(**d) + + def _update_ctx(self, attrs): + """ + update the state of the Styler. Collects a mapping + of {index_label: [': ']} + + attrs: Series or DataFrame + should contain strings of ': ;: ' + Whitespace shouldn't matter and the final trailing ';' shouldn't + matter. + """ + for row_label, v in attrs.iterrows(): + for col_label, col in v.iteritems(): + i = self.index.get_indexer([row_label])[0] + j = self.columns.get_indexer([col_label])[0] + for pair in col.rstrip(";").split(";"): + self.ctx[(i, j)].append(pair) + + def _copy(self, deepcopy=False): + styler = Styler(self.data, precision=self.precision, + caption=self.caption, uuid=self.uuid, + table_styles=self.table_styles) + if deepcopy: + styler.ctx = copy.deepcopy(self.ctx) + styler._todo = copy.deepcopy(self._todo) + else: + styler.ctx = self.ctx + styler._todo = self._todo + return styler + + def __copy__(self): + """ + Deep copy by default. + """ + return self._copy(deepcopy=False) + + def __deepcopy__(self, memo): + return self._copy(deepcopy=True) + + def clear(self): + """"Reset" the styler, removing any previously applied styles. + Returns None. + """ + self.ctx.clear() + self._todo = [] + + def _compute(self): + """ + Execute the style functions built up in `self._todo`. + + Relies on the conventions that all style functions go through + .apply or .applymap. The append styles to apply as tuples of + + (application method, *args, **kwargs) + """ + r = self + for func, args, kwargs in self._todo: + r = func(self)(*args, **kwargs) + return r + + def _apply(self, func, axis=0, subset=None, **kwargs): + subset = slice(None) if subset is None else subset + subset = _non_reducing_slice(subset) + data = self.data.loc[subset] + if axis is not None: + result = data.apply(func, axis=axis, **kwargs) + else: + result = func(data, **kwargs) + if not isinstance(result, pd.DataFrame): + raise TypeError( + "Function {!r} must return a DataFrame when " + "passed to `Styler.apply` with axis=None".format(func)) + if not (result.index.equals(data.index) and + result.columns.equals(data.columns)): + msg = ('Result of {!r} must have identical index and columns ' + 'as the input'.format(func)) + raise ValueError(msg) + + result_shape = result.shape + expected_shape = self.data.loc[subset].shape + if result_shape != expected_shape: + msg = ("Function {!r} returned the wrong shape.\n" + "Result has shape: {}\n" + "Expected shape: {}".format(func, + result.shape, + expected_shape)) + raise ValueError(msg) + self._update_ctx(result) + return self + + def apply(self, func, axis=0, subset=None, **kwargs): + """ + Apply a function column-wise, row-wise, or table-wase, + updating the HTML representation with the result. + + .. versionadded:: 0.17.1 + + Parameters + ---------- + func : function + ``func`` should take a Series or DataFrame (depending + on ``axis``), and return an object with the same shape. + Must return a DataFrame with identical index and + column labels when ``axis=None`` + axis : int, str or None + apply to each column (``axis=0`` or ``'index'``) + or to each row (``axis=1`` or ``'columns'``) or + to the entire DataFrame at once with ``axis=None`` + subset : IndexSlice + a valid indexer to limit ``data`` to *before* applying the + function. Consider using a pandas.IndexSlice + kwargs : dict + pass along to ``func`` + + Returns + ------- + self : Styler + + Notes + ----- + The output shape of ``func`` should match the input, i.e. if + ``x`` is the input row, column, or table (depending on ``axis``), + then ``func(x.shape) == x.shape`` should be true. + + This is similar to ``DataFrame.apply``, except that ``axis=None`` + applies the function to the entire DataFrame at once, + rather than column-wise or row-wise. + + Examples + -------- + >>> def highlight_max(x): + ... return ['background-color: yellow' if v == x.max() else '' + for v in x] + ... + >>> df = pd.DataFrame(np.random.randn(5, 2)) + >>> df.style.apply(highlight_max) + """ + self._todo.append((lambda instance: getattr(instance, '_apply'), + (func, axis, subset), kwargs)) + return self + + def _applymap(self, func, subset=None, **kwargs): + func = partial(func, **kwargs) # applymap doesn't take kwargs? + if subset is None: + subset = pd.IndexSlice[:] + subset = _non_reducing_slice(subset) + result = self.data.loc[subset].applymap(func) + self._update_ctx(result) + return self + + def applymap(self, func, subset=None, **kwargs): + """ + Apply a function elementwise, updating the HTML + representation with the result. + + .. versionadded:: 0.17.1 + + Parameters + ---------- + func : function + ``func`` should take a scalar and return a scalar + subset : IndexSlice + a valid indexer to limit ``data`` to *before* applying the + function. Consider using a pandas.IndexSlice + kwargs : dict + pass along to ``func`` + + Returns + ------- + self : Styler + + """ + self._todo.append((lambda instance: getattr(instance, '_applymap'), + (func, subset), kwargs)) + return self + + def set_precision(self, precision): + """ + Set the precision used to render. + + .. versionadded:: 0.17.1 + + Parameters + ---------- + precision: int + + Returns + ------- + self : Styler + """ + self.precision = precision + return self + + def set_table_attributes(self, attributes): + """ + Set the table attributes. These are the items + that show up in the opening ``
`` tag in addition + to to automatic (by default) id. + + .. versionadded:: 0.17.1 + + Parameters + ---------- + attributes : string + + Returns + ------- + self : Styler + + Examples + -------- + >>> df = pd.DataFrame(np.random.randn(10, 4)) + >>> df.style.set_table_attributes('class="pure-table"') + # ...
... + """ + self.table_attributes = attributes + return self + + def export(self): + """ + Export the styles to applied to the current Styler. + Can be applied to a second style with ``Styler.use``. + + .. versionadded:: 0.17.1 + + Returns + ------- + styles: list + + See Also + -------- + Styler.use + """ + return self._todo + + def use(self, styles): + """ + Set the styles on the current Styler, possibly using styles + from ``Styler.export``. + + .. versionadded:: 0.17.1 + + Parameters + ---------- + styles: list + list of style functions + + Returns + ------- + self : Styler + + See Also + -------- + Styler.export + """ + self._todo.extend(styles) + return self + + def set_uuid(self, uuid): + """ + Set the uuid for a Styler. + + .. versionadded:: 0.17.1 + + Parameters + ---------- + uuid: str + + Returns + ------- + self : Styler + """ + self.uuid = uuid + return self + + def set_caption(self, caption): + """ + Se the caption on a Styler + + .. versionadded:: 0.17.1 + + Parameters + ---------- + caption: str + + Returns + ------- + self : Styler + """ + self.caption = caption + return self + + def set_table_styles(self, table_styles): + """ + Set the table styles on a Styler. These are placed in a + `` +{%- endblock style %} +{%- block before_table %}{% endblock before_table %} +{%- block table %} +
+{%- block caption %} +{%- if caption -%} + +{%- endif -%} +{%- endblock caption %} +{%- block thead %} + + {%- block before_head_rows %}{% endblock %} + {%- for r in head %} + {%- block head_tr scoped %} + + {%- for c in r %} + {%- if c.is_visible != False %} + <{{ c.type }} class="{{c.class}}" {{ c.attributes|join(" ") }}>{{c.value}} + {%- endif %} + {%- endfor %} + + {%- endblock head_tr %} + {%- endfor %} + {%- block after_head_rows %}{% endblock %} + +{%- endblock thead %} +{%- block tbody %} + + {%- block before_rows %}{%- endblock before_rows %} + {%- for r in body %} + {%- block tr scoped %} + + {%- for c in r %} + {%- if c.is_visible != False %} + <{{ c.type }} id="T_{{ uuid }}{{ c.id }}" class="{{ c.class }}" {{ c.attributes|join(" ") }}>{{ c.display_value }} + {%- endif %} + {%- endfor %} + + {%- endblock tr %} + {%- endfor %} + {%- block after_rows %}{%- endblock after_rows %} + +{%- endblock tbody %} +
{{caption}}
+{%- endblock table %} +{%- block after_table %}{% endblock after_table %} diff --git a/pandas/util/terminal.py b/pandas/io/formats/terminal.py similarity index 99% rename from pandas/util/terminal.py rename to pandas/io/formats/terminal.py index 6b8428ff75806..dadd09ae74ea4 100644 --- a/pandas/util/terminal.py +++ b/pandas/io/formats/terminal.py @@ -115,6 +115,7 @@ def ioctl_GWINSZ(fd): return None return int(cr[1]), int(cr[0]) + if __name__ == "__main__": sizex, sizey = get_terminal_size() print('width = %s height = %s' % (sizex, sizey)) diff --git a/pandas/io/ga.py b/pandas/io/ga.py deleted file mode 100644 index 45424e78ddbe7..0000000000000 --- a/pandas/io/ga.py +++ /dev/null @@ -1,463 +0,0 @@ -""" -1. Goto https://console.developers.google.com/iam-admin/projects -2. Create new project -3. Goto APIs and register for OAuth2.0 for installed applications -4. Download JSON secret file and move into same directory as this file -""" -# flake8: noqa - -from datetime import datetime -import re -from pandas import compat -import numpy as np -from pandas import DataFrame -import pandas as pd -import pandas.io.parsers as psr -import pandas.lib as lib -from pandas.io.date_converters import generic_parser -import pandas.io.auth as auth -from pandas.util.decorators import Appender, Substitution - -from apiclient.errors import HttpError -from oauth2client.client import AccessTokenRefreshError -from pandas.compat import zip, u - -# GH11038 -import warnings -warnings.warn("The pandas.io.ga module is deprecated and will be " - "removed in a future version.", - FutureWarning, stacklevel=2) - -TYPE_MAP = {u('INTEGER'): int, u('FLOAT'): float, u('TIME'): int} - -NO_CALLBACK = auth.OOB_CALLBACK_URN -DOC_URL = auth.DOC_URL - -_QUERY_PARAMS = """metrics : list of str - Un-prefixed metric names (e.g., 'visitors' and not 'ga:visitors') -dimensions : list of str - Un-prefixed dimension variable names -start_date : str/date/datetime -end_date : str/date/datetime, optional, default is None but internally set as today -segment : list of str, optional, default: None -filters : list of str, optional, default: None -start_index : int, default 1 -max_results : int, default 10000 - If >10000, must specify chunksize or ValueError will be raised""" - -_QUERY_DOC = """ -Construct a google analytics query using given parameters -Metrics and dimensions do not need the 'ga:' prefix - -Parameters ----------- -profile_id : str -%s -""" % _QUERY_PARAMS - -_GA_READER_DOC = """Given query parameters, return a DataFrame with all the -data or an iterator that returns DataFrames containing chunks of the data - -Parameters ----------- -%s -sort : bool/list, default True - Sort output by index or list of columns -chunksize : int, optional - If max_results >10000, specifies the number of rows per iteration -index_col : str/list of str/dict, optional, default: None - If unspecified then dimension variables are set as index -parse_dates : bool/list/dict, default: True -keep_date_col : boolean, default: False -date_parser : optional, default: None -na_values : optional, default: None -converters : optional, default: None -dayfirst : bool, default False - Informs date parsing -account_name : str, optional, default: None -account_id : str, optional, default: None -property_name : str, optional, default: None -property_id : str, optional, default: None -profile_name : str, optional, default: None -profile_id : str, optional, default: None -%%(extras)s -Returns -------- -data : DataFrame or DataFrame yielding iterator -""" % _QUERY_PARAMS - -_AUTH_PARAMS = """secrets : str, optional - File path to the secrets file -scope : str, optional - Authentication scope -token_file_name : str, optional - Path to token storage -redirect : str, optional - Local host redirect if unspecified -""" - - -def reset_token_store(): - """ - Deletes the default token store - """ - auth.reset_default_token_store() - - -@Substitution(extras=_AUTH_PARAMS) -@Appender(_GA_READER_DOC) -def read_ga(metrics, dimensions, start_date, **kwargs): - lst = ['secrets', 'scope', 'token_file_name', 'redirect'] - reader_kwds = dict((p, kwargs.pop(p)) for p in lst if p in kwargs) - reader = GAnalytics(**reader_kwds) - return reader.get_data(metrics=metrics, start_date=start_date, - dimensions=dimensions, **kwargs) - - -class OAuthDataReader(object): - """ - Abstract class for handling OAuth2 authentication using the Google - oauth2client library - """ - def __init__(self, scope, token_file_name, redirect): - """ - Parameters - ---------- - scope : str - Designates the authentication scope - token_file_name : str - Location of cache for authenticated tokens - redirect : str - Redirect URL - """ - self.scope = scope - self.token_store = auth.make_token_store(token_file_name) - self.redirect_url = redirect - - def authenticate(self, secrets): - """ - Run the authentication process and return an authorized - http object - - Parameters - ---------- - secrets : str - File name for client secrets - - Notes - ----- - See google documention for format of secrets file - %s - """ % DOC_URL - flow = self._create_flow(secrets) - return auth.authenticate(flow, self.token_store) - - def _create_flow(self, secrets): - """ - Create an authentication flow based on the secrets file - - Parameters - ---------- - secrets : str - File name for client secrets - - Notes - ----- - See google documentation for format of secrets file - %s - """ % DOC_URL - return auth.get_flow(secrets, self.scope, self.redirect_url) - - -class GDataReader(OAuthDataReader): - """ - Abstract class for reading data from google APIs using OAuth2 - Subclasses must implement create_query method - """ - def __init__(self, scope=auth.DEFAULT_SCOPE, - token_file_name=auth.DEFAULT_TOKEN_FILE, - redirect=NO_CALLBACK, secrets=auth.DEFAULT_SECRETS): - super(GDataReader, self).__init__(scope, token_file_name, redirect) - self._service = self._init_service(secrets) - - @property - def service(self): - """The authenticated request service object""" - return self._service - - def _init_service(self, secrets): - """ - Build an authenticated google api request service using the given - secrets file - """ - http = self.authenticate(secrets) - return auth.init_service(http) - - def get_account(self, name=None, id=None, **kwargs): - """ Retrieve an account that matches the name, id, or some account - attribute specified in **kwargs - - Parameters - ---------- - name : str, optional, default: None - id : str, optional, default: None - """ - accounts = self.service.management().accounts().list().execute() - return _get_match(accounts, name, id, **kwargs) - - def get_web_property(self, account_id=None, name=None, id=None, **kwargs): - """ - Retrieve a web property given and account and property name, id, or - custom attribute - - Parameters - ---------- - account_id : str, optional, default: None - name : str, optional, default: None - id : str, optional, default: None - """ - prop_store = self.service.management().webproperties() - kwds = {} - if account_id is not None: - kwds['accountId'] = account_id - prop_for_acct = prop_store.list(**kwds).execute() - return _get_match(prop_for_acct, name, id, **kwargs) - - def get_profile(self, account_id=None, web_property_id=None, name=None, - id=None, **kwargs): - - """ - Retrieve the right profile for the given account, web property, and - profile attribute (name, id, or arbitrary parameter in kwargs) - - Parameters - ---------- - account_id : str, optional, default: None - web_property_id : str, optional, default: None - name : str, optional, default: None - id : str, optional, default: None - """ - profile_store = self.service.management().profiles() - kwds = {} - if account_id is not None: - kwds['accountId'] = account_id - if web_property_id is not None: - kwds['webPropertyId'] = web_property_id - profiles = profile_store.list(**kwds).execute() - return _get_match(profiles, name, id, **kwargs) - - def create_query(self, *args, **kwargs): - raise NotImplementedError() - - @Substitution(extras='') - @Appender(_GA_READER_DOC) - def get_data(self, metrics, start_date, end_date=None, - dimensions=None, segment=None, filters=None, start_index=1, - max_results=10000, index_col=None, parse_dates=True, - keep_date_col=False, date_parser=None, na_values=None, - converters=None, sort=True, dayfirst=False, - account_name=None, account_id=None, property_name=None, - property_id=None, profile_name=None, profile_id=None, - chunksize=None): - if chunksize is None and max_results > 10000: - raise ValueError('Google API returns maximum of 10,000 rows, ' - 'please set chunksize') - - account = self.get_account(account_name, account_id) - web_property = self.get_web_property(account.get('id'), property_name, - property_id) - profile = self.get_profile(account.get('id'), web_property.get('id'), - profile_name, profile_id) - - profile_id = profile.get('id') - - if index_col is None and dimensions is not None: - if isinstance(dimensions, compat.string_types): - dimensions = [dimensions] - index_col = _clean_index(list(dimensions), parse_dates) - - def _read(start, result_size): - query = self.create_query(profile_id, metrics, start_date, - end_date=end_date, dimensions=dimensions, - segment=segment, filters=filters, - start_index=start, - max_results=result_size) - - try: - rs = query.execute() - rows = rs.get('rows', []) - col_info = rs.get('columnHeaders', []) - return self._parse_data(rows, col_info, index_col, - parse_dates=parse_dates, - keep_date_col=keep_date_col, - date_parser=date_parser, - dayfirst=dayfirst, - na_values=na_values, - converters=converters, sort=sort) - except HttpError as inst: - raise ValueError('Google API error %s: %s' % (inst.resp.status, - inst._get_reason())) - - if chunksize is None: - return _read(start_index, max_results) - - def iterator(): - curr_start = start_index - - while curr_start < max_results: - yield _read(curr_start, chunksize) - curr_start += chunksize - return iterator() - - def _parse_data(self, rows, col_info, index_col, parse_dates=True, - keep_date_col=False, date_parser=None, dayfirst=False, - na_values=None, converters=None, sort=True): - # TODO use returned column types - col_names = _get_col_names(col_info) - df = psr._read(rows, dict(index_col=index_col, parse_dates=parse_dates, - date_parser=date_parser, dayfirst=dayfirst, - na_values=na_values, - keep_date_col=keep_date_col, - converters=converters, - header=None, names=col_names)) - - if isinstance(sort, bool) and sort: - return df.sort_index() - elif isinstance(sort, (compat.string_types, list, tuple, np.ndarray)): - return df.sort_index(by=sort) - - return df - - -class GAnalytics(GDataReader): - - @Appender(_QUERY_DOC) - def create_query(self, profile_id, metrics, start_date, end_date=None, - dimensions=None, segment=None, filters=None, - start_index=None, max_results=10000, **kwargs): - qry = format_query(profile_id, metrics, start_date, end_date=end_date, - dimensions=dimensions, segment=segment, - filters=filters, start_index=start_index, - max_results=max_results, **kwargs) - try: - return self.service.data().ga().get(**qry) - except TypeError as error: - raise ValueError('Error making query: %s' % error) - - -def format_query(ids, metrics, start_date, end_date=None, dimensions=None, - segment=None, filters=None, sort=None, start_index=None, - max_results=10000, **kwargs): - if isinstance(metrics, compat.string_types): - metrics = [metrics] - met = ','.join(['ga:%s' % x for x in metrics]) - - start_date = pd.to_datetime(start_date).strftime('%Y-%m-%d') - if end_date is None: - end_date = datetime.today() - end_date = pd.to_datetime(end_date).strftime('%Y-%m-%d') - - qry = dict(ids='ga:%s' % str(ids), - metrics=met, - start_date=start_date, - end_date=end_date) - qry.update(kwargs) - - names = ['dimensions', 'filters', 'sort'] - lst = [dimensions, filters, sort] - [_maybe_add_arg(qry, n, d) for n, d in zip(names, lst)] - - if isinstance(segment, compat.string_types): - if re.match("^[a-zA-Z0-9\-\_]+$", segment): - _maybe_add_arg(qry, 'segment', segment, 'gaid:') - else: - _maybe_add_arg(qry, 'segment', segment, 'dynamic::ga') - elif isinstance(segment, int): - _maybe_add_arg(qry, 'segment', segment, 'gaid:') - elif segment: - raise ValueError("segment must be string for dynamic and int ID") - - if start_index is not None: - qry['start_index'] = str(start_index) - - if max_results is not None: - qry['max_results'] = str(max_results) - - return qry - - -def _maybe_add_arg(query, field, data, prefix='ga'): - if data is not None: - if isinstance(data, (compat.string_types, int)): - data = [data] - data = ','.join(['%s:%s' % (prefix, x) for x in data]) - query[field] = data - - -def _get_match(obj_store, name, id, **kwargs): - key, val = None, None - if len(kwargs) > 0: - key = list(kwargs.keys())[0] - val = list(kwargs.values())[0] - - if name is None and id is None and key is None: - return obj_store.get('items')[0] - - name_ok = lambda item: name is not None and item.get('name') == name - id_ok = lambda item: id is not None and item.get('id') == id - key_ok = lambda item: key is not None and item.get(key) == val - - match = None - if obj_store.get('items'): - # TODO look up gapi for faster lookup - for item in obj_store.get('items'): - if name_ok(item) or id_ok(item) or key_ok(item): - return item - - -def _clean_index(index_dims, parse_dates): - _should_add = lambda lst: pd.Index(lst).isin(index_dims).all() - to_remove = [] - to_add = [] - - if isinstance(parse_dates, (list, tuple, np.ndarray)): - for lst in parse_dates: - if isinstance(lst, (list, tuple, np.ndarray)): - if _should_add(lst): - to_add.append('_'.join(lst)) - to_remove.extend(lst) - elif isinstance(parse_dates, dict): - for name, lst in compat.iteritems(parse_dates): - if isinstance(lst, (list, tuple, np.ndarray)): - if _should_add(lst): - to_add.append(name) - to_remove.extend(lst) - - index_dims = pd.Index(index_dims) - to_remove = pd.Index(set(to_remove)) - to_add = pd.Index(set(to_add)) - - return index_dims.difference(to_remove).union(to_add) - - -def _get_col_names(header_info): - return [x['name'][3:] for x in header_info] - - -def _get_column_types(header_info): - return [(x['name'][3:], x['columnType']) for x in header_info] - - -def _get_dim_names(header_info): - return [x['name'][3:] for x in header_info - if x['columnType'] == u('DIMENSION')] - - -def _get_met_names(header_info): - return [x['name'][3:] for x in header_info - if x['columnType'] == u('METRIC')] - - -def _get_data_types(header_info): - return [(x['name'][3:], TYPE_MAP.get(x['dataType'], object)) - for x in header_info] diff --git a/pandas/io/gbq.py b/pandas/io/gbq.py index 8038cc500f6cd..b4dc9173f11ba 100644 --- a/pandas/io/gbq.py +++ b/pandas/io/gbq.py @@ -1,638 +1,37 @@ -import warnings -from datetime import datetime -import json -import logging -from time import sleep -import uuid -import time -import sys +""" Google BigQuery support """ -import numpy as np - -from distutils.version import StrictVersion -from pandas import compat -from pandas.core.api import DataFrame -from pandas.tools.merge import concat -from pandas.core.common import PandasError -from pandas.compat import lzip, bytes_to_str - - -def _check_google_client_version(): +def _try_import(): + # since pandas is a dependency of pandas-gbq + # we need to import on first use try: - import pkg_resources - + import pandas_gbq except ImportError: - raise ImportError('Could not import pkg_resources (setuptools).') - - if compat.PY3: - google_api_minimum_version = '1.4.1' - else: - google_api_minimum_version = '1.2.0' - - _GOOGLE_API_CLIENT_VERSION = pkg_resources.get_distribution( - 'google-api-python-client').version - - if (StrictVersion(_GOOGLE_API_CLIENT_VERSION) < - StrictVersion(google_api_minimum_version)): - raise ImportError("pandas requires google-api-python-client >= {0} " - "for Google BigQuery support, " - "current version {1}" - .format(google_api_minimum_version, - _GOOGLE_API_CLIENT_VERSION)) - - -def _test_google_api_imports(): - - try: - import httplib2 # noqa - try: - from googleapiclient.discovery import build # noqa - from googleapiclient.errors import HttpError # noqa - except: - from apiclient.discovery import build # noqa - from apiclient.errors import HttpError # noqa - from oauth2client.client import AccessTokenRefreshError # noqa - from oauth2client.client import OAuth2WebServerFlow # noqa - from oauth2client.file import Storage # noqa - from oauth2client.tools import run_flow, argparser # noqa - except ImportError as e: - raise ImportError("Missing module required for Google BigQuery " - "support: {0}".format(str(e))) - -logger = logging.getLogger('pandas.io.gbq') -logger.setLevel(logging.ERROR) - - -class InvalidPrivateKeyFormat(PandasError, ValueError): - """ - Raised when provided private key has invalid format. - """ - pass - - -class AccessDenied(PandasError, ValueError): - """ - Raised when invalid credentials are provided, or tokens have expired. - """ - pass - - -class DatasetCreationError(PandasError, ValueError): - """ - Raised when the create dataset method fails - """ - pass - - -class GenericGBQException(PandasError, ValueError): - """ - Raised when an unrecognized Google API Error occurs. - """ - pass - - -class InvalidColumnOrder(PandasError, ValueError): - """ - Raised when the provided column order for output - results DataFrame does not match the schema - returned by BigQuery. - """ - pass - - -class InvalidPageToken(PandasError, ValueError): - """ - Raised when Google BigQuery fails to return, - or returns a duplicate page token. - """ - pass - - -class InvalidSchema(PandasError, ValueError): - """ - Raised when the provided DataFrame does - not match the schema of the destination - table in BigQuery. - """ - pass - - -class NotFoundException(PandasError, ValueError): - """ - Raised when the project_id, table or dataset provided in the query could - not be found. - """ - pass - - -class StreamingInsertError(PandasError, ValueError): - """ - Raised when BigQuery reports a streaming insert error. - For more information see `Streaming Data Into BigQuery - `__ - """ - - -class TableCreationError(PandasError, ValueError): - """ - Raised when the create table method fails - """ - pass - - -class GbqConnector(object): - scope = 'https://www.googleapis.com/auth/bigquery' - - def __init__(self, project_id, reauth=False, verbose=False, - private_key=None, dialect='legacy'): - _check_google_client_version() - _test_google_api_imports() - self.project_id = project_id - self.reauth = reauth - self.verbose = verbose - self.private_key = private_key - self.dialect = dialect - self.credentials = self.get_credentials() - self.service = self.get_service() - - def get_credentials(self): - if self.private_key: - return self.get_service_account_credentials() - else: - # Try to retrieve Application Default Credentials - credentials = self.get_application_default_credentials() - if not credentials: - credentials = self.get_user_account_credentials() - return credentials - - def get_application_default_credentials(self): - """ - This method tries to retrieve the "default application credentials". - This could be useful for running code on Google Cloud Platform. - - .. versionadded:: 0.19.0 - - Parameters - ---------- - None - - Returns - ------- - - GoogleCredentials, - If the default application credentials can be retrieved - from the environment. The retrieved credentials should also - have access to the project (self.project_id) on BigQuery. - - OR None, - If default application credentials can not be retrieved - from the environment. Or, the retrieved credentials do not - have access to the project (self.project_id) on BigQuery. - """ - import httplib2 - try: - from googleapiclient.discovery import build - except ImportError: - from apiclient.discovery import build - try: - from oauth2client.client import GoogleCredentials - except ImportError: - return None - - try: - credentials = GoogleCredentials.get_application_default() - except: - return None - - http = httplib2.Http() - try: - http = credentials.authorize(http) - bigquery_service = build('bigquery', 'v2', http=http) - # Check if the application has rights to the BigQuery project - jobs = bigquery_service.jobs() - job_data = {'configuration': {'query': {'query': 'SELECT 1'}}} - jobs.insert(projectId=self.project_id, body=job_data).execute() - return credentials - except: - return None - - def get_user_account_credentials(self): - from oauth2client.client import OAuth2WebServerFlow - from oauth2client.file import Storage - from oauth2client.tools import run_flow, argparser - - flow = OAuth2WebServerFlow( - client_id=('495642085510-k0tmvj2m941jhre2nbqka17vqpjfddtd' - '.apps.googleusercontent.com'), - client_secret='kOc9wMptUtxkcIFbtZCcrEAc', - scope=self.scope, - redirect_uri='urn:ietf:wg:oauth:2.0:oob') - - storage = Storage('bigquery_credentials.dat') - credentials = storage.get() - - if credentials is None or credentials.invalid or self.reauth: - credentials = run_flow(flow, storage, argparser.parse_args([])) - - return credentials - - def get_service_account_credentials(self): - # Bug fix for https://github.com/pandas-dev/pandas/issues/12572 - # We need to know that a supported version of oauth2client is installed - # Test that either of the following is installed: - # - SignedJwtAssertionCredentials from oauth2client.client - # - ServiceAccountCredentials from oauth2client.service_account - # SignedJwtAssertionCredentials is available in oauthclient < 2.0.0 - # ServiceAccountCredentials is available in oauthclient >= 2.0.0 - oauth2client_v1 = True - oauth2client_v2 = True - - try: - from oauth2client.client import SignedJwtAssertionCredentials - except ImportError: - oauth2client_v1 = False - - try: - from oauth2client.service_account import ServiceAccountCredentials - except ImportError: - oauth2client_v2 = False - - if not oauth2client_v1 and not oauth2client_v2: - raise ImportError("Missing oauth2client required for BigQuery " - "service account support") - - from os.path import isfile - - try: - if isfile(self.private_key): - with open(self.private_key) as f: - json_key = json.loads(f.read()) - else: - # ugly hack: 'private_key' field has new lines inside, - # they break json parser, but we need to preserve them - json_key = json.loads(self.private_key.replace('\n', ' ')) - json_key['private_key'] = json_key['private_key'].replace( - ' ', '\n') - - if compat.PY3: - json_key['private_key'] = bytes( - json_key['private_key'], 'UTF-8') - - if oauth2client_v1: - return SignedJwtAssertionCredentials( - json_key['client_email'], - json_key['private_key'], - self.scope, - ) - else: - return ServiceAccountCredentials.from_json_keyfile_dict( - json_key, - self.scope) - except (KeyError, ValueError, TypeError, AttributeError): - raise InvalidPrivateKeyFormat( - "Private key is missing or invalid. It should be service " - "account private key JSON (file path or string contents) " - "with at least two keys: 'client_email' and 'private_key'. " - "Can be obtained from: https://console.developers.google." - "com/permissions/serviceaccounts") - - def _print(self, msg, end='\n'): - if self.verbose: - sys.stdout.write(msg + end) - sys.stdout.flush() - - def _start_timer(self): - self.start = time.time() - - def get_elapsed_seconds(self): - return round(time.time() - self.start, 2) - - def print_elapsed_seconds(self, prefix='Elapsed', postfix='s.', - overlong=7): - sec = self.get_elapsed_seconds() - if sec > overlong: - self._print('{} {} {}'.format(prefix, sec, postfix)) - - # http://stackoverflow.com/questions/1094841/reusable-library-to-get-human-readable-version-of-file-size - @staticmethod - def sizeof_fmt(num, suffix='b'): - fmt = "%3.1f %s%s" - for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']: - if abs(num) < 1024.0: - return fmt % (num, unit, suffix) - num /= 1024.0 - return fmt % (num, 'Y', suffix) - - def get_service(self): - import httplib2 - try: - from googleapiclient.discovery import build - except: - from apiclient.discovery import build - - http = httplib2.Http() - http = self.credentials.authorize(http) - bigquery_service = build('bigquery', 'v2', http=http) - - return bigquery_service - - @staticmethod - def process_http_error(ex): - # See `BigQuery Troubleshooting Errors - # `__ - - status = json.loads(bytes_to_str(ex.content))['error'] - errors = status.get('errors', None) - - if errors: - for error in errors: - reason = error['reason'] - message = error['message'] - - raise GenericGBQException( - "Reason: {0}, Message: {1}".format(reason, message)) - - raise GenericGBQException(errors) - - def process_insert_errors(self, insert_errors): - for insert_error in insert_errors: - row = insert_error['index'] - errors = insert_error.get('errors', None) - for error in errors: - reason = error['reason'] - message = error['message'] - location = error['location'] - error_message = ('Error at Row: {0}, Reason: {1}, ' - 'Location: {2}, Message: {3}' - .format(row, reason, location, message)) - - # Report all error messages if verbose is set - if self.verbose: - self._print(error_message) - else: - raise StreamingInsertError(error_message + - '\nEnable verbose logging to ' - 'see all errors') - - raise StreamingInsertError - - def run_query(self, query): - try: - from googleapiclient.errors import HttpError - except: - from apiclient.errors import HttpError - from oauth2client.client import AccessTokenRefreshError - - _check_google_client_version() - - job_collection = self.service.jobs() - job_data = { - 'configuration': { - 'query': { - 'query': query, - 'useLegacySql': self.dialect == 'legacy' - # 'allowLargeResults', 'createDisposition', - # 'preserveNulls', destinationTable, useQueryCache - } - } - } - - self._start_timer() - try: - self._print('Requesting query... ', end="") - query_reply = job_collection.insert( - projectId=self.project_id, body=job_data).execute() - self._print('ok.\nQuery running...') - except (AccessTokenRefreshError, ValueError): - if self.private_key: - raise AccessDenied( - "The service account credentials are not valid") - else: - raise AccessDenied( - "The credentials have been revoked or expired, " - "please re-run the application to re-authorize") - except HttpError as ex: - self.process_http_error(ex) - - job_reference = query_reply['jobReference'] - - while not query_reply.get('jobComplete', False): - self.print_elapsed_seconds(' Elapsed', 's. Waiting...') - try: - query_reply = job_collection.getQueryResults( - projectId=job_reference['projectId'], - jobId=job_reference['jobId']).execute() - except HttpError as ex: - self.process_http_error(ex) - - if self.verbose: - if query_reply['cacheHit']: - self._print('Query done.\nCache hit.\n') - else: - bytes_processed = int(query_reply.get( - 'totalBytesProcessed', '0')) - self._print('Query done.\nProcessed: {}\n'.format( - self.sizeof_fmt(bytes_processed))) - - self._print('Retrieving results...') - - total_rows = int(query_reply['totalRows']) - result_pages = list() - seen_page_tokens = list() - current_row = 0 - # Only read schema on first page - schema = query_reply['schema'] - - # Loop through each page of data - while 'rows' in query_reply and current_row < total_rows: - page = query_reply['rows'] - result_pages.append(page) - current_row += len(page) - - self.print_elapsed_seconds( - ' Got page: {}; {}% done. Elapsed'.format( - len(result_pages), - round(100.0 * current_row / total_rows))) - - if current_row == total_rows: - break - - page_token = query_reply.get('pageToken', None) - - if not page_token and current_row < total_rows: - raise InvalidPageToken("Required pageToken was missing. " - "Received {0} of {1} rows" - .format(current_row, total_rows)) - - elif page_token in seen_page_tokens: - raise InvalidPageToken("A duplicate pageToken was returned") - - seen_page_tokens.append(page_token) - - try: - query_reply = job_collection.getQueryResults( - projectId=job_reference['projectId'], - jobId=job_reference['jobId'], - pageToken=page_token).execute() - except HttpError as ex: - self.process_http_error(ex) - - if current_row < total_rows: - raise InvalidPageToken() - - # print basic query stats - self._print('Got {} rows.\n'.format(total_rows)) - - return schema, result_pages - - def load_data(self, dataframe, dataset_id, table_id, chunksize): - try: - from googleapiclient.errors import HttpError - except: - from apiclient.errors import HttpError - - job_id = uuid.uuid4().hex - rows = [] - remaining_rows = len(dataframe) - - total_rows = remaining_rows - self._print("\n\n") - - for index, row in dataframe.reset_index(drop=True).iterrows(): - row_dict = dict() - row_dict['json'] = json.loads(row.to_json(force_ascii=False, - date_unit='s', - date_format='iso')) - row_dict['insertId'] = job_id + str(index) - rows.append(row_dict) - remaining_rows -= 1 - - if (len(rows) % chunksize == 0) or (remaining_rows == 0): - self._print("\rStreaming Insert is {0}% Complete".format( - ((total_rows - remaining_rows) * 100) / total_rows)) - - body = {'rows': rows} - - try: - response = self.service.tabledata().insertAll( - projectId=self.project_id, - datasetId=dataset_id, - tableId=table_id, - body=body).execute() - except HttpError as ex: - self.process_http_error(ex) - - # For streaming inserts, even if you receive a success HTTP - # response code, you'll need to check the insertErrors property - # of the response to determine if the row insertions were - # successful, because it's possible that BigQuery was only - # partially successful at inserting the rows. See the `Success - # HTTP Response Codes - # `__ - # section - - insert_errors = response.get('insertErrors', None) - if insert_errors: - self.process_insert_errors(insert_errors) - - sleep(1) # Maintains the inserts "per second" rate per API - rows = [] - - self._print("\n") - - def verify_schema(self, dataset_id, table_id, schema): - try: - from googleapiclient.errors import HttpError - except: - from apiclient.errors import HttpError - - try: - remote_schema = self.service.tables().get( - projectId=self.project_id, - datasetId=dataset_id, - tableId=table_id).execute()['schema'] - - fields_remote = set([json.dumps(field_remote) - for field_remote in remote_schema['fields']]) - fields_local = set(json.dumps(field_local) - for field_local in schema['fields']) - - return fields_remote == fields_local - except HttpError as ex: - self.process_http_error(ex) - - def delete_and_recreate_table(self, dataset_id, table_id, table_schema): - delay = 0 - - # Changes to table schema may take up to 2 minutes as of May 2015 See - # `Issue 191 - # `__ - # Compare previous schema with new schema to determine if there should - # be a 120 second delay - - if not self.verify_schema(dataset_id, table_id, table_schema): - self._print('The existing table has a different schema. Please ' - 'wait 2 minutes. See Google BigQuery issue #191') - delay = 120 - - table = _Table(self.project_id, dataset_id, - private_key=self.private_key) - table.delete(table_id) - table.create(table_id, table_schema) - sleep(delay) + # give a nice error message + raise ImportError("Load data from Google BigQuery\n" + "\n" + "the pandas-gbq package is not installed\n" + "see the docs: https://pandas-gbq.readthedocs.io\n" + "\n" + "you can install via pip or conda:\n" + "pip install pandas-gbq\n" + "conda install pandas-gbq -c conda-forge\n") -def _parse_data(schema, rows): - # see: - # http://pandas.pydata.org/pandas-docs/dev/missing_data.html - # #missing-data-casting-rules-and-indexing - dtype_map = {'INTEGER': np.dtype(float), - 'FLOAT': np.dtype(float), - # This seems to be buggy without nanosecond indicator - 'TIMESTAMP': 'M8[ns]'} - - fields = schema['fields'] - col_types = [field['type'] for field in fields] - col_names = [str(field['name']) for field in fields] - col_dtypes = [dtype_map.get(field['type'], object) for field in fields] - page_array = np.zeros((len(rows),), - dtype=lzip(col_names, col_dtypes)) - - for row_num, raw_row in enumerate(rows): - entries = raw_row.get('f', []) - for col_num, field_type in enumerate(col_types): - field_value = _parse_entry(entries[col_num].get('v', ''), - field_type) - page_array[row_num][col_num] = field_value - - return DataFrame(page_array, columns=col_names) - - -def _parse_entry(field_value, field_type): - if field_value is None or field_value == 'null': - return None - if field_type == 'INTEGER' or field_type == 'FLOAT': - return float(field_value) - elif field_type == 'TIMESTAMP': - timestamp = datetime.utcfromtimestamp(float(field_value)) - return np.datetime64(timestamp) - elif field_type == 'BOOLEAN': - return field_value == 'true' - return field_value + return pandas_gbq def read_gbq(query, project_id=None, index_col=None, col_order=None, - reauth=False, verbose=True, private_key=None, dialect='legacy'): - """Load data from Google BigQuery. - - THIS IS AN EXPERIMENTAL LIBRARY + reauth=False, verbose=True, private_key=None, dialect='legacy', + **kwargs): + r"""Load data from Google BigQuery. The main method a user calls to execute a Query in Google BigQuery and read results into a pandas DataFrame. Google BigQuery API Client Library v2 for Python is used. - Documentation is available at - https://developers.google.com/api-client-library/python/apis/bigquery/v2 + Documentation is available `here + `__ Authentication to the Google BigQuery service is via OAuth 2.0. @@ -640,8 +39,6 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None, By default "application default credentials" are used. - .. versionadded:: 0.19.0 - If default application credentials are not found or are restrictive, user account credentials are used. In this case, you will be asked to grant permissions for product name 'pandas GBQ'. @@ -671,8 +68,6 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None, or string contents. This is useful for remote server authentication (eg. jupyter iPython notebook on remote host) - .. versionadded:: 0.18.1 - dialect : {'legacy', 'standard'}, default 'legacy' 'legacy' : Use BigQuery's legacy SQL dialect. 'standard' : Use BigQuery's standard SQL (beta), which is @@ -680,7 +75,14 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None, see `BigQuery SQL Reference `__ - .. versionadded:: 0.19.0 + **kwargs : Arbitrary keyword arguments + configuration (dict): query config parameters for job processing. + For example: + + configuration = {'query': {'useQueryCache': False}} + + For more information see `BigQuery SQL Reference + `__ Returns ------- @@ -688,442 +90,20 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None, DataFrame representing results of query """ - - if not project_id: - raise TypeError("Missing required parameter: project_id") - - if dialect not in ('legacy', 'standard'): - raise ValueError("'{0}' is not valid for dialect".format(dialect)) - - connector = GbqConnector(project_id, reauth=reauth, verbose=verbose, - private_key=private_key, - dialect=dialect) - schema, pages = connector.run_query(query) - dataframe_list = [] - while len(pages) > 0: - page = pages.pop() - dataframe_list.append(_parse_data(schema, page)) - - if len(dataframe_list) > 0: - final_df = concat(dataframe_list, ignore_index=True) - else: - final_df = _parse_data(schema, []) - - # Reindex the DataFrame on the provided column - if index_col is not None: - if index_col in final_df.columns: - final_df.set_index(index_col, inplace=True) - else: - raise InvalidColumnOrder( - 'Index column "{0}" does not exist in DataFrame.' - .format(index_col) - ) - - # Change the order of columns in the DataFrame based on provided list - if col_order is not None: - if sorted(col_order) == sorted(final_df.columns): - final_df = final_df[col_order] - else: - raise InvalidColumnOrder( - 'Column order does not match this DataFrame.' - ) - - # Downcast floats to integers and objects to booleans - # if there are no NaN's. This is presently due to a - # limitation of numpy in handling missing data. - final_df._data = final_df._data.downcast(dtypes='infer') - - connector.print_elapsed_seconds( - 'Total time taken', - datetime.now().strftime('s.\nFinished at %Y-%m-%d %H:%M:%S.'), - 0 - ) - - return final_df + pandas_gbq = _try_import() + return pandas_gbq.read_gbq( + query, project_id=project_id, + index_col=index_col, col_order=col_order, + reauth=reauth, verbose=verbose, + private_key=private_key, + dialect=dialect, + **kwargs) def to_gbq(dataframe, destination_table, project_id, chunksize=10000, verbose=True, reauth=False, if_exists='fail', private_key=None): - """Write a DataFrame to a Google BigQuery table. - - THIS IS AN EXPERIMENTAL LIBRARY - - The main method a user calls to export pandas DataFrame contents to - Google BigQuery table. - - Google BigQuery API Client Library v2 for Python is used. - Documentation is available at - https://developers.google.com/api-client-library/python/apis/bigquery/v2 - - Authentication to the Google BigQuery service is via OAuth 2.0. - - - If "private_key" is not provided: - - By default "application default credentials" are used. - - .. versionadded:: 0.19.0 - - If default application credentials are not found or are restrictive, - user account credentials are used. In this case, you will be asked to - grant permissions for product name 'pandas GBQ'. - - - If "private_key" is provided: - - Service account credentials will be used to authenticate. - - Parameters - ---------- - dataframe : DataFrame - DataFrame to be written - destination_table : string - Name of table to be written, in the form 'dataset.tablename' - project_id : str - Google BigQuery Account project ID. - chunksize : int (default 10000) - Number of rows to be inserted in each chunk from the dataframe. - verbose : boolean (default True) - Show percentage complete - reauth : boolean (default False) - Force Google BigQuery to reauthenticate the user. This is useful - if multiple accounts are used. - if_exists : {'fail', 'replace', 'append'}, default 'fail' - 'fail': If table exists, do nothing. - 'replace': If table exists, drop it, recreate it, and insert data. - 'append': If table exists, insert data. Create if does not exist. - private_key : str (optional) - Service account private key in JSON format. Can be file path - or string contents. This is useful for remote server - authentication (eg. jupyter iPython notebook on remote host) - """ - - if if_exists not in ('fail', 'replace', 'append'): - raise ValueError("'{0}' is not valid for if_exists".format(if_exists)) - - if '.' not in destination_table: - raise NotFoundException( - "Invalid Table Name. Should be of the form 'datasetId.tableId' ") - - connector = GbqConnector(project_id, reauth=reauth, verbose=verbose, - private_key=private_key) - dataset_id, table_id = destination_table.rsplit('.', 1) - - table = _Table(project_id, dataset_id, reauth=reauth, - private_key=private_key) - - table_schema = _generate_bq_schema(dataframe) - - # If table exists, check if_exists parameter - if table.exists(table_id): - if if_exists == 'fail': - raise TableCreationError("Could not create the table because it " - "already exists. " - "Change the if_exists parameter to " - "append or replace data.") - elif if_exists == 'replace': - connector.delete_and_recreate_table( - dataset_id, table_id, table_schema) - elif if_exists == 'append': - if not connector.verify_schema(dataset_id, table_id, table_schema): - raise InvalidSchema("Please verify that the structure and " - "data types in the DataFrame match the " - "schema of the destination table.") - else: - table.create(table_id, table_schema) - - connector.load_data(dataframe, dataset_id, table_id, chunksize) - - -def generate_bq_schema(df, default_type='STRING'): - # deprecation TimeSeries, #11121 - warnings.warn("generate_bq_schema is deprecated and will be removed in " - "a future version", FutureWarning, stacklevel=2) - - return _generate_bq_schema(df, default_type=default_type) - - -def _generate_bq_schema(df, default_type='STRING'): - """ Given a passed df, generate the associated Google BigQuery schema. - - Parameters - ---------- - df : DataFrame - default_type : string - The default big query type in case the type of the column - does not exist in the schema. - """ - - type_mapping = { - 'i': 'INTEGER', - 'b': 'BOOLEAN', - 'f': 'FLOAT', - 'O': 'STRING', - 'S': 'STRING', - 'U': 'STRING', - 'M': 'TIMESTAMP' - } - - fields = [] - for column_name, dtype in df.dtypes.iteritems(): - fields.append({'name': column_name, - 'type': type_mapping.get(dtype.kind, default_type)}) - - return {'fields': fields} - - -class _Table(GbqConnector): - - def __init__(self, project_id, dataset_id, reauth=False, verbose=False, - private_key=None): - try: - from googleapiclient.errors import HttpError - except: - from apiclient.errors import HttpError - self.http_error = HttpError - self.dataset_id = dataset_id - super(_Table, self).__init__(project_id, reauth, verbose, private_key) - - def exists(self, table_id): - """ Check if a table exists in Google BigQuery - - .. versionadded:: 0.17.0 - - Parameters - ---------- - table : str - Name of table to be verified - - Returns - ------- - boolean - true if table exists, otherwise false - """ - - try: - self.service.tables().get( - projectId=self.project_id, - datasetId=self.dataset_id, - tableId=table_id).execute() - return True - except self.http_error as ex: - if ex.resp.status == 404: - return False - else: - self.process_http_error(ex) - - def create(self, table_id, schema): - """ Create a table in Google BigQuery given a table and schema - - .. versionadded:: 0.17.0 - - Parameters - ---------- - table : str - Name of table to be written - schema : str - Use the generate_bq_schema to generate your table schema from a - dataframe. - """ - - if self.exists(table_id): - raise TableCreationError( - "The table could not be created because it already exists") - - if not _Dataset(self.project_id, - private_key=self.private_key).exists(self.dataset_id): - _Dataset(self.project_id, - private_key=self.private_key).create(self.dataset_id) - - body = { - 'schema': schema, - 'tableReference': { - 'tableId': table_id, - 'projectId': self.project_id, - 'datasetId': self.dataset_id - } - } - - try: - self.service.tables().insert( - projectId=self.project_id, - datasetId=self.dataset_id, - body=body).execute() - except self.http_error as ex: - self.process_http_error(ex) - - def delete(self, table_id): - """ Delete a table in Google BigQuery - - .. versionadded:: 0.17.0 - - Parameters - ---------- - table : str - Name of table to be deleted - """ - - if not self.exists(table_id): - raise NotFoundException("Table does not exist") - - try: - self.service.tables().delete( - datasetId=self.dataset_id, - projectId=self.project_id, - tableId=table_id).execute() - except self.http_error as ex: - self.process_http_error(ex) - - -class _Dataset(GbqConnector): - - def __init__(self, project_id, reauth=False, verbose=False, - private_key=None): - try: - from googleapiclient.errors import HttpError - except: - from apiclient.errors import HttpError - self.http_error = HttpError - super(_Dataset, self).__init__(project_id, reauth, verbose, - private_key) - - def exists(self, dataset_id): - """ Check if a dataset exists in Google BigQuery - - .. versionadded:: 0.17.0 - - Parameters - ---------- - dataset_id : str - Name of dataset to be verified - - Returns - ------- - boolean - true if dataset exists, otherwise false - """ - - try: - self.service.datasets().get( - projectId=self.project_id, - datasetId=dataset_id).execute() - return True - except self.http_error as ex: - if ex.resp.status == 404: - return False - else: - self.process_http_error(ex) - - def datasets(self): - """ Return a list of datasets in Google BigQuery - - .. versionadded:: 0.17.0 - - Parameters - ---------- - None - - Returns - ------- - list - List of datasets under the specific project - """ - - try: - list_dataset_response = self.service.datasets().list( - projectId=self.project_id).execute().get('datasets', None) - - if not list_dataset_response: - return [] - - dataset_list = list() - - for row_num, raw_row in enumerate(list_dataset_response): - dataset_list.append(raw_row['datasetReference']['datasetId']) - - return dataset_list - except self.http_error as ex: - self.process_http_error(ex) - - def create(self, dataset_id): - """ Create a dataset in Google BigQuery - - .. versionadded:: 0.17.0 - - Parameters - ---------- - dataset : str - Name of dataset to be written - """ - - if self.exists(dataset_id): - raise DatasetCreationError( - "The dataset could not be created because it already exists") - - body = { - 'datasetReference': { - 'projectId': self.project_id, - 'datasetId': dataset_id - } - } - - try: - self.service.datasets().insert( - projectId=self.project_id, - body=body).execute() - except self.http_error as ex: - self.process_http_error(ex) - - def delete(self, dataset_id): - """ Delete a dataset in Google BigQuery - - .. versionadded:: 0.17.0 - - Parameters - ---------- - dataset : str - Name of dataset to be deleted - """ - - if not self.exists(dataset_id): - raise NotFoundException( - "Dataset {0} does not exist".format(dataset_id)) - - try: - self.service.datasets().delete( - datasetId=dataset_id, - projectId=self.project_id).execute() - - except self.http_error as ex: - self.process_http_error(ex) - - def tables(self, dataset_id): - """ List tables in the specific dataset in Google BigQuery - - .. versionadded:: 0.17.0 - - Parameters - ---------- - dataset : str - Name of dataset to list tables for - - Returns - ------- - list - List of tables under the specific dataset - """ - - try: - list_table_response = self.service.tables().list( - projectId=self.project_id, - datasetId=dataset_id).execute().get('tables', None) - - if not list_table_response: - return [] - - table_list = list() - - for row_num, raw_row in enumerate(list_table_response): - table_list.append(raw_row['tableReference']['tableId']) - - return table_list - except self.http_error as ex: - self.process_http_error(ex) + pandas_gbq = _try_import() + pandas_gbq.to_gbq(dataframe, destination_table, project_id, + chunksize=chunksize, + verbose=verbose, reauth=reauth, + if_exists=if_exists, private_key=private_key) diff --git a/pandas/io/html.py b/pandas/io/html.py index 3c38dae91eb89..2613f26ae5f52 100644 --- a/pandas/io/html.py +++ b/pandas/io/html.py @@ -12,15 +12,16 @@ import numpy as np -from pandas.types.common import is_list_like -from pandas.io.common import (EmptyDataError, _is_url, urlopen, +from pandas.core.dtypes.common import is_list_like +from pandas.errors import EmptyDataError +from pandas.io.common import (_is_url, urlopen, parse_url, _validate_header_arg) from pandas.io.parsers import TextParser from pandas.compat import (lrange, lmap, u, string_types, iteritems, raise_with_traceback, binary_type) from pandas import Series from pandas.core.common import AbstractMethodError -from pandas.formats.printing import pprint_thing +from pandas.io.formats.printing import pprint_thing _IMPORTS = False _HAS_BS4 = False @@ -355,9 +356,12 @@ def _parse_raw_thead(self, table): thead = self._parse_thead(table) res = [] if thead: - res = lmap(self._text_getter, self._parse_th(thead[0])) - return np.atleast_1d( - np.array(res).squeeze()) if res and len(res) == 1 else res + trs = self._parse_tr(thead[0]) + for tr in trs: + cols = lmap(self._text_getter, self._parse_td(tr)) + if any([col != '' for col in cols]): + res.append(cols) + return res def _parse_raw_tfoot(self, table): tfoot = self._parse_tfoot(table) @@ -591,9 +595,17 @@ def _parse_tfoot(self, table): return table.xpath('.//tfoot') def _parse_raw_thead(self, table): - expr = './/thead//th' - return [_remove_whitespace(x.text_content()) for x in - table.xpath(expr)] + expr = './/thead' + thead = table.xpath(expr) + res = [] + if thead: + trs = self._parse_tr(thead[0]) + for tr in trs: + cols = [_remove_whitespace(x.text_content()) for x in + self._parse_td(tr)] + if any([col != '' for col in cols]): + res.append(cols) + return res def _parse_raw_tfoot(self, table): expr = './/tfoot//th|//tfoot//td' @@ -615,19 +627,17 @@ def _data_to_frame(**kwargs): head, body, foot = kwargs.pop('data') header = kwargs.pop('header') kwargs['skiprows'] = _get_skiprows(kwargs['skiprows']) - if head: - body = [head] + body - + rows = lrange(len(head)) + body = head + body if header is None: # special case when a table has elements - header = 0 + header = 0 if rows == [0] else rows if foot: body += [foot] # fill out elements of body that are "ragged" _expand_elements(body) - tp = TextParser(body, header=header, **kwargs) df = tp.read() return df @@ -853,7 +863,7 @@ def read_html(io, match='.+', flavor=None, header=None, index_col=None, Notes ----- Before using this function you should read the :ref:`gotchas about the - HTML parsing libraries `. + HTML parsing libraries `. Expect to do some cleanup after you call this function. For example, you might need to manually assign column names if the column names are diff --git a/pandas/io/json/__init__.py b/pandas/io/json/__init__.py new file mode 100644 index 0000000000000..32d110b3404a9 --- /dev/null +++ b/pandas/io/json/__init__.py @@ -0,0 +1,5 @@ +from .json import to_json, read_json, loads, dumps # noqa +from .normalize import json_normalize # noqa +from .table_schema import build_table_schema # noqa + +del json, normalize, table_schema # noqa diff --git a/pandas/io/json.py b/pandas/io/json/json.py similarity index 71% rename from pandas/io/json.py rename to pandas/io/json/json.py index 878506a6ddc05..b2fe074732cbb 100644 --- a/pandas/io/json.py +++ b/pandas/io/json/json.py @@ -1,21 +1,23 @@ # pylint: disable-msg=E1101,W0613,W0603 - import os -import copy -from collections import defaultdict import numpy as np -import pandas.json as _json -from pandas.tslib import iNaT +import pandas._libs.json as json +from pandas._libs.tslib import iNaT from pandas.compat import StringIO, long, u from pandas import compat, isnull -from pandas import Series, DataFrame, to_datetime +from pandas import Series, DataFrame, to_datetime, MultiIndex from pandas.io.common import get_filepath_or_buffer, _get_handle from pandas.core.common import AbstractMethodError -from pandas.formats.printing import pprint_thing +from pandas.io.formats.printing import pprint_thing +from .normalize import _convert_to_line_delimits +from .table_schema import build_table_schema +from pandas.core.dtypes.common import is_period_dtype + +loads = json.loads +dumps = json.dumps -loads = _json.loads -dumps = _json.dumps +TABLE_SCHEMA_VERSION = '0.20.0' # interface to/from @@ -24,22 +26,25 @@ def to_json(path_or_buf, obj, orient=None, date_format='epoch', default_handler=None, lines=False): if lines and orient != 'records': - raise ValueError( - "'lines' keyword only valid when 'orient' is records") - - if isinstance(obj, Series): - s = SeriesWriter( - obj, orient=orient, date_format=date_format, - double_precision=double_precision, ensure_ascii=force_ascii, - date_unit=date_unit, default_handler=default_handler).write() + raise ValueError( + "'lines' keyword only valid when 'orient' is records") + + if orient == 'table' and isinstance(obj, Series): + obj = obj.to_frame(name=obj.name or 'values') + if orient == 'table' and isinstance(obj, DataFrame): + writer = JSONTableWriter + elif isinstance(obj, Series): + writer = SeriesWriter elif isinstance(obj, DataFrame): - s = FrameWriter( - obj, orient=orient, date_format=date_format, - double_precision=double_precision, ensure_ascii=force_ascii, - date_unit=date_unit, default_handler=default_handler).write() + writer = FrameWriter else: raise NotImplementedError("'obj' should be a Series or a DataFrame") + s = writer( + obj, orient=orient, date_format=date_format, + double_precision=double_precision, ensure_ascii=force_ascii, + date_unit=date_unit, default_handler=default_handler).write() + if lines: s = _convert_to_line_delimits(s) @@ -82,7 +87,8 @@ def write(self): ensure_ascii=self.ensure_ascii, date_unit=self.date_unit, iso_dates=self.date_format == 'iso', - default_handler=self.default_handler) + default_handler=self.default_handler + ) class SeriesWriter(Writer): @@ -109,6 +115,60 @@ def _format_axes(self): "'%s'." % self.orient) +class JSONTableWriter(FrameWriter): + _default_orient = 'records' + + def __init__(self, obj, orient, date_format, double_precision, + ensure_ascii, date_unit, default_handler=None): + """ + Adds a `schema` attribut with the Table Schema, resets + the index (can't do in caller, because the schema inference needs + to know what the index is, forces orient to records, and forces + date_format to 'iso'. + """ + super(JSONTableWriter, self).__init__( + obj, orient, date_format, double_precision, ensure_ascii, + date_unit, default_handler=default_handler) + + if date_format != 'iso': + msg = ("Trying to write with `orient='table'` and " + "`date_format='%s'`. Table Schema requires dates " + "to be formatted with `date_format='iso'`" % date_format) + raise ValueError(msg) + + self.schema = build_table_schema(obj) + + # NotImplementd on a column MultiIndex + if obj.ndim == 2 and isinstance(obj.columns, MultiIndex): + raise NotImplementedError( + "orient='table' is not supported for MultiIndex") + + # TODO: Do this timedelta properly in objToJSON.c See GH #15137 + if ((obj.ndim == 1) and (obj.name in set(obj.index.names)) or + len(obj.columns & obj.index.names)): + msg = "Overlapping names between the index and columns" + raise ValueError(msg) + + obj = obj.copy() + timedeltas = obj.select_dtypes(include=['timedelta']).columns + if len(timedeltas): + obj[timedeltas] = obj[timedeltas].applymap( + lambda x: x.isoformat()) + # Convert PeriodIndex to datetimes before serialzing + if is_period_dtype(obj.index): + obj.index = obj.index.to_timestamp() + + self.obj = obj.reset_index() + self.date_format = 'iso' + self.orient = 'records' + + def write(self): + data = super(JSONTableWriter, self).write() + serialized = '{{"schema": {}, "data": {}}}'.format( + dumps(self.schema), data) + return serialized + + def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, @@ -245,6 +305,17 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, col 1 col 2 0 a b 1 c d + + Encoding with Table Schema + + >>> df.to_json(orient='table') + '{"schema": {"fields": [{"name": "index", "type": "string"}, + {"name": "col 1", "type": "string"}, + {"name": "col 2", "type": "string"}], + "primaryKey": "index", + "pandas_version": "0.20.0"}, + "data": [{"index": "row 1", "col 1": "a", "col 2": "b"}, + {"index": "row 2", "col 1": "c", "col 2": "d"}]}' """ filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf, @@ -259,8 +330,10 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, exists = False if exists: - with _get_handle(filepath_or_buffer, 'r', encoding=encoding) as fh: - json = fh.read() + fh, handles = _get_handle(filepath_or_buffer, 'r', + encoding=encoding) + json = fh.read() + fh.close() else: json = filepath_or_buffer elif hasattr(filepath_or_buffer, 'read'): @@ -272,7 +345,7 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, # If given a json lines file, we break the string into lines, add # commas and put it in a json list to make a valid json object. lines = list(StringIO(json.strip())) - json = u'[' + u','.join(lines) + u']' + json = '[' + ','.join(lines) + ']' obj = None if typ == 'frame': @@ -639,227 +712,3 @@ def is_ok(col): lambda col, c: self._try_convert_to_date(c), lambda col, c: ((self.keep_default_dates and is_ok(col)) or col in convert_dates)) - -# --------------------------------------------------------------------- -# JSON normalization routines - - -def _convert_to_line_delimits(s): - """Helper function that converts json lists to line delimited json.""" - - # Determine we have a JSON list to turn to lines otherwise just return the - # json object, only lists can - if not s[0] == '[' and s[-1] == ']': - return s - s = s[1:-1] - - from pandas.lib import convert_json_to_lines - return convert_json_to_lines(s) - - -def nested_to_record(ds, prefix="", level=0): - """a simplified json_normalize - - converts a nested dict into a flat dict ("record"), unlike json_normalize, - it does not attempt to extract a subset of the data. - - Parameters - ---------- - ds : dict or list of dicts - prefix: the prefix, optional, default: "" - level: the number of levels in the jason string, optional, default: 0 - - Returns - ------- - d - dict or list of dicts, matching `ds` - - Examples - -------- - - IN[52]: nested_to_record(dict(flat1=1,dict1=dict(c=1,d=2), - nested=dict(e=dict(c=1,d=2),d=2))) - Out[52]: - {'dict1.c': 1, - 'dict1.d': 2, - 'flat1': 1, - 'nested.d': 2, - 'nested.e.c': 1, - 'nested.e.d': 2} - """ - singleton = False - if isinstance(ds, dict): - ds = [ds] - singleton = True - - new_ds = [] - for d in ds: - - new_d = copy.deepcopy(d) - for k, v in d.items(): - # each key gets renamed with prefix - if not isinstance(k, compat.string_types): - k = str(k) - if level == 0: - newkey = k - else: - newkey = prefix + '.' + k - - # only dicts gets recurse-flattend - # only at level>1 do we rename the rest of the keys - if not isinstance(v, dict): - if level != 0: # so we skip copying for top level, common case - v = new_d.pop(k) - new_d[newkey] = v - continue - else: - v = new_d.pop(k) - new_d.update(nested_to_record(v, newkey, level + 1)) - new_ds.append(new_d) - - if singleton: - return new_ds[0] - return new_ds - - -def json_normalize(data, record_path=None, meta=None, - meta_prefix=None, - record_prefix=None): - """ - "Normalize" semi-structured JSON data into a flat table - - Parameters - ---------- - data : dict or list of dicts - Unserialized JSON objects - record_path : string or list of strings, default None - Path in each object to list of records. If not passed, data will be - assumed to be an array of records - meta : list of paths (string or list of strings), default None - Fields to use as metadata for each record in resulting table - record_prefix : string, default None - If True, prefix records with dotted (?) path, e.g. foo.bar.field if - path to records is ['foo', 'bar'] - meta_prefix : string, default None - - Returns - ------- - frame : DataFrame - - Examples - -------- - - >>> data = [{'state': 'Florida', - ... 'shortname': 'FL', - ... 'info': { - ... 'governor': 'Rick Scott' - ... }, - ... 'counties': [{'name': 'Dade', 'population': 12345}, - ... {'name': 'Broward', 'population': 40000}, - ... {'name': 'Palm Beach', 'population': 60000}]}, - ... {'state': 'Ohio', - ... 'shortname': 'OH', - ... 'info': { - ... 'governor': 'John Kasich' - ... }, - ... 'counties': [{'name': 'Summit', 'population': 1234}, - ... {'name': 'Cuyahoga', 'population': 1337}]}] - >>> from pandas.io.json import json_normalize - >>> result = json_normalize(data, 'counties', ['state', 'shortname', - ... ['info', 'governor']]) - >>> result - name population info.governor state shortname - 0 Dade 12345 Rick Scott Florida FL - 1 Broward 40000 Rick Scott Florida FL - 2 Palm Beach 60000 Rick Scott Florida FL - 3 Summit 1234 John Kasich Ohio OH - 4 Cuyahoga 1337 John Kasich Ohio OH - - """ - def _pull_field(js, spec): - result = js - if isinstance(spec, list): - for field in spec: - result = result[field] - else: - result = result[spec] - - return result - - # A bit of a hackjob - if isinstance(data, dict): - data = [data] - - if record_path is None: - if any([isinstance(x, dict) for x in compat.itervalues(data[0])]): - # naive normalization, this is idempotent for flat records - # and potentially will inflate the data considerably for - # deeply nested structures: - # {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@} - # - # TODO: handle record value which are lists, at least error - # reasonably - data = nested_to_record(data) - return DataFrame(data) - elif not isinstance(record_path, list): - record_path = [record_path] - - if meta is None: - meta = [] - elif not isinstance(meta, list): - meta = [meta] - - for i, x in enumerate(meta): - if not isinstance(x, list): - meta[i] = [x] - - # Disastrously inefficient for now - records = [] - lengths = [] - - meta_vals = defaultdict(list) - meta_keys = ['.'.join(val) for val in meta] - - def _recursive_extract(data, path, seen_meta, level=0): - if len(path) > 1: - for obj in data: - for val, key in zip(meta, meta_keys): - if level + 1 == len(val): - seen_meta[key] = _pull_field(obj, val[-1]) - - _recursive_extract(obj[path[0]], path[1:], - seen_meta, level=level + 1) - else: - for obj in data: - recs = _pull_field(obj, path[0]) - - # For repeating the metadata later - lengths.append(len(recs)) - - for val, key in zip(meta, meta_keys): - if level + 1 > len(val): - meta_val = seen_meta[key] - else: - meta_val = _pull_field(obj, val[level:]) - meta_vals[key].append(meta_val) - - records.extend(recs) - - _recursive_extract(data, record_path, {}, level=0) - - result = DataFrame(records) - - if record_prefix is not None: - result.rename(columns=lambda x: record_prefix + x, inplace=True) - - # Data types, a problem - for k, v in compat.iteritems(meta_vals): - if meta_prefix is not None: - k = meta_prefix + k - - if k in result: - raise ValueError('Conflicting metadata name %s, ' - 'need distinguishing prefix ' % k) - - result[k] = np.array(v).repeat(lengths) - - return result diff --git a/pandas/io/json/normalize.py b/pandas/io/json/normalize.py new file mode 100644 index 0000000000000..401d8d9ead2b8 --- /dev/null +++ b/pandas/io/json/normalize.py @@ -0,0 +1,266 @@ +# --------------------------------------------------------------------- +# JSON normalization routines + +import copy +from collections import defaultdict +import numpy as np + +from pandas._libs.lib import convert_json_to_lines +from pandas import compat, DataFrame + + +def _convert_to_line_delimits(s): + """Helper function that converts json lists to line delimited json.""" + + # Determine we have a JSON list to turn to lines otherwise just return the + # json object, only lists can + if not s[0] == '[' and s[-1] == ']': + return s + s = s[1:-1] + + return convert_json_to_lines(s) + + +def nested_to_record(ds, prefix="", sep=".", level=0): + """a simplified json_normalize + + converts a nested dict into a flat dict ("record"), unlike json_normalize, + it does not attempt to extract a subset of the data. + + Parameters + ---------- + ds : dict or list of dicts + prefix: the prefix, optional, default: "" + sep : string, default '.' + Nested records will generate names separated by sep, + e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar + + .. versionadded:: 0.20.0 + + level: the number of levels in the jason string, optional, default: 0 + + Returns + ------- + d - dict or list of dicts, matching `ds` + + Examples + -------- + + IN[52]: nested_to_record(dict(flat1=1,dict1=dict(c=1,d=2), + nested=dict(e=dict(c=1,d=2),d=2))) + Out[52]: + {'dict1.c': 1, + 'dict1.d': 2, + 'flat1': 1, + 'nested.d': 2, + 'nested.e.c': 1, + 'nested.e.d': 2} + """ + singleton = False + if isinstance(ds, dict): + ds = [ds] + singleton = True + + new_ds = [] + for d in ds: + + new_d = copy.deepcopy(d) + for k, v in d.items(): + # each key gets renamed with prefix + if not isinstance(k, compat.string_types): + k = str(k) + if level == 0: + newkey = k + else: + newkey = prefix + sep + k + + # only dicts gets recurse-flattend + # only at level>1 do we rename the rest of the keys + if not isinstance(v, dict): + if level != 0: # so we skip copying for top level, common case + v = new_d.pop(k) + new_d[newkey] = v + continue + else: + v = new_d.pop(k) + new_d.update(nested_to_record(v, newkey, sep, level + 1)) + new_ds.append(new_d) + + if singleton: + return new_ds[0] + return new_ds + + +def json_normalize(data, record_path=None, meta=None, + meta_prefix=None, + record_prefix=None, + errors='raise', + sep='.'): + """ + "Normalize" semi-structured JSON data into a flat table + + Parameters + ---------- + data : dict or list of dicts + Unserialized JSON objects + record_path : string or list of strings, default None + Path in each object to list of records. If not passed, data will be + assumed to be an array of records + meta : list of paths (string or list of strings), default None + Fields to use as metadata for each record in resulting table + record_prefix : string, default None + If True, prefix records with dotted (?) path, e.g. foo.bar.field if + path to records is ['foo', 'bar'] + meta_prefix : string, default None + errors : {'raise', 'ignore'}, default 'raise' + + * 'ignore' : will ignore KeyError if keys listed in meta are not + always present + * 'raise' : will raise KeyError if keys listed in meta are not + always present + + .. versionadded:: 0.20.0 + + sep : string, default '.' + Nested records will generate names separated by sep, + e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar + + .. versionadded:: 0.20.0 + + + Returns + ------- + frame : DataFrame + + Examples + -------- + + >>> data = [{'state': 'Florida', + ... 'shortname': 'FL', + ... 'info': { + ... 'governor': 'Rick Scott' + ... }, + ... 'counties': [{'name': 'Dade', 'population': 12345}, + ... {'name': 'Broward', 'population': 40000}, + ... {'name': 'Palm Beach', 'population': 60000}]}, + ... {'state': 'Ohio', + ... 'shortname': 'OH', + ... 'info': { + ... 'governor': 'John Kasich' + ... }, + ... 'counties': [{'name': 'Summit', 'population': 1234}, + ... {'name': 'Cuyahoga', 'population': 1337}]}] + >>> from pandas.io.json import json_normalize + >>> result = json_normalize(data, 'counties', ['state', 'shortname', + ... ['info', 'governor']]) + >>> result + name population info.governor state shortname + 0 Dade 12345 Rick Scott Florida FL + 1 Broward 40000 Rick Scott Florida FL + 2 Palm Beach 60000 Rick Scott Florida FL + 3 Summit 1234 John Kasich Ohio OH + 4 Cuyahoga 1337 John Kasich Ohio OH + + """ + def _pull_field(js, spec): + result = js + if isinstance(spec, list): + for field in spec: + result = result[field] + else: + result = result[spec] + + return result + + if isinstance(data, list) and len(data) is 0: + return DataFrame() + + # A bit of a hackjob + if isinstance(data, dict): + data = [data] + + if record_path is None: + if any([isinstance(x, dict) for x in compat.itervalues(data[0])]): + # naive normalization, this is idempotent for flat records + # and potentially will inflate the data considerably for + # deeply nested structures: + # {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@} + # + # TODO: handle record value which are lists, at least error + # reasonably + data = nested_to_record(data, sep=sep) + return DataFrame(data) + elif not isinstance(record_path, list): + record_path = [record_path] + + if meta is None: + meta = [] + elif not isinstance(meta, list): + meta = [meta] + + for i, x in enumerate(meta): + if not isinstance(x, list): + meta[i] = [x] + + # Disastrously inefficient for now + records = [] + lengths = [] + + meta_vals = defaultdict(list) + if not isinstance(sep, compat.string_types): + sep = str(sep) + meta_keys = [sep.join(val) for val in meta] + + def _recursive_extract(data, path, seen_meta, level=0): + if len(path) > 1: + for obj in data: + for val, key in zip(meta, meta_keys): + if level + 1 == len(val): + seen_meta[key] = _pull_field(obj, val[-1]) + + _recursive_extract(obj[path[0]], path[1:], + seen_meta, level=level + 1) + else: + for obj in data: + recs = _pull_field(obj, path[0]) + + # For repeating the metadata later + lengths.append(len(recs)) + + for val, key in zip(meta, meta_keys): + if level + 1 > len(val): + meta_val = seen_meta[key] + else: + try: + meta_val = _pull_field(obj, val[level:]) + except KeyError as e: + if errors == 'ignore': + meta_val = np.nan + else: + raise \ + KeyError("Try running with " + "errors='ignore' as key " + "%s is not always present", e) + meta_vals[key].append(meta_val) + + records.extend(recs) + + _recursive_extract(data, record_path, {}, level=0) + + result = DataFrame(records) + + if record_prefix is not None: + result.rename(columns=lambda x: record_prefix + x, inplace=True) + + # Data types, a problem + for k, v in compat.iteritems(meta_vals): + if meta_prefix is not None: + k = meta_prefix + k + + if k in result: + raise ValueError('Conflicting metadata name %s, ' + 'need distinguishing prefix ' % k) + + result[k] = np.array(v).repeat(lengths) + + return result diff --git a/pandas/io/json/table_schema.py b/pandas/io/json/table_schema.py new file mode 100644 index 0000000000000..c3865afa9c0c0 --- /dev/null +++ b/pandas/io/json/table_schema.py @@ -0,0 +1,181 @@ +""" +Table Schema builders + +http://specs.frictionlessdata.io/json-table-schema/ +""" +from pandas.core.dtypes.common import ( + is_integer_dtype, is_timedelta64_dtype, is_numeric_dtype, + is_bool_dtype, is_datetime64_dtype, is_datetime64tz_dtype, + is_categorical_dtype, is_period_dtype, is_string_dtype +) + + +def as_json_table_type(x): + """ + Convert a NumPy / pandas type to its corresponding json_table. + + Parameters + ---------- + x : array or dtype + + Returns + ------- + t : str + the Table Schema data types + + Notes + ----- + This table shows the relationship between NumPy / pandas dtypes, + and Table Schema dtypes. + + ============== ================= + Pandas type Table Schema type + ============== ================= + int64 integer + float64 number + bool boolean + datetime64[ns] datetime + timedelta64[ns] duration + object str + categorical any + =============== ================= + """ + if is_integer_dtype(x): + return 'integer' + elif is_bool_dtype(x): + return 'boolean' + elif is_numeric_dtype(x): + return 'number' + elif (is_datetime64_dtype(x) or is_datetime64tz_dtype(x) or + is_period_dtype(x)): + return 'datetime' + elif is_timedelta64_dtype(x): + return 'duration' + elif is_categorical_dtype(x): + return 'any' + elif is_string_dtype(x): + return 'string' + else: + return 'any' + + +def set_default_names(data): + """Sets index names to 'index' for regular, or 'level_x' for Multi""" + if all(name is not None for name in data.index.names): + return data + + data = data.copy() + if data.index.nlevels > 1: + names = [name if name is not None else 'level_{}'.format(i) + for i, name in enumerate(data.index.names)] + data.index.names = names + else: + data.index.name = data.index.name or 'index' + return data + + +def make_field(arr, dtype=None): + dtype = dtype or arr.dtype + if arr.name is None: + name = 'values' + else: + name = arr.name + field = {'name': name, + 'type': as_json_table_type(dtype)} + + if is_categorical_dtype(arr): + if hasattr(arr, 'categories'): + cats = arr.categories + ordered = arr.ordered + else: + cats = arr.cat.categories + ordered = arr.cat.ordered + field['constraints'] = {"enum": list(cats)} + field['ordered'] = ordered + elif is_period_dtype(arr): + field['freq'] = arr.freqstr + elif is_datetime64tz_dtype(arr): + if hasattr(arr, 'dt'): + field['tz'] = arr.dt.tz.zone + else: + field['tz'] = arr.tz.zone + return field + + +def build_table_schema(data, index=True, primary_key=None, version=True): + """ + Create a Table schema from ``data``. + + Parameters + ---------- + data : Series, DataFrame + index : bool, default True + Whether to include ``data.index`` in the schema. + primary_key : bool or None, default True + column names to designate as the primary key. + The default `None` will set `'primaryKey'` to the index + level or levels if the index is unique. + version : bool, default True + Whether to include a field `pandas_version` with the version + of pandas that generated the schema. + + Returns + ------- + schema : dict + + Examples + -------- + >>> df = pd.DataFrame( + ... {'A': [1, 2, 3], + ... 'B': ['a', 'b', 'c'], + ... 'C': pd.date_range('2016-01-01', freq='d', periods=3), + ... }, index=pd.Index(range(3), name='idx')) + >>> build_table_schema(df) + {'fields': [{'name': 'idx', 'type': 'integer'}, + {'name': 'A', 'type': 'integer'}, + {'name': 'B', 'type': 'string'}, + {'name': 'C', 'type': 'datetime'}], + 'pandas_version': '0.20.0', + 'primaryKey': ['idx']} + + Notes + ----- + See `_as_json_table_type` for conversion types. + Timedeltas as converted to ISO8601 duration format with + 9 decimal places after the secnods field for nanosecond precision. + + Categoricals are converted to the `any` dtype, and use the `enum` field + constraint to list the allowed values. The `ordered` attribute is included + in an `ordered` field. + """ + if index is True: + data = set_default_names(data) + + schema = {} + fields = [] + + if index: + if data.index.nlevels > 1: + for level in data.index.levels: + fields.append(make_field(level)) + else: + fields.append(make_field(data.index)) + + if data.ndim > 1: + for column, s in data.iteritems(): + fields.append(make_field(s)) + else: + fields.append(make_field(data)) + + schema['fields'] = fields + if index and data.index.is_unique and primary_key is None: + if data.index.nlevels == 1: + schema['primaryKey'] = [data.index.name] + else: + schema['primaryKey'] = data.index.names + elif primary_key is not None: + schema['primaryKey'] = primary_key + + if version: + schema['pandas_version'] = '0.20.0' + return schema diff --git a/pandas/msgpack/__init__.py b/pandas/io/msgpack/__init__.py similarity index 79% rename from pandas/msgpack/__init__.py rename to pandas/io/msgpack/__init__.py index 0c2370df936a4..984e90ee03e69 100644 --- a/pandas/msgpack/__init__.py +++ b/pandas/io/msgpack/__init__.py @@ -1,11 +1,10 @@ # coding: utf-8 -# flake8: noqa - -from pandas.msgpack._version import version -from pandas.msgpack.exceptions import * from collections import namedtuple +from pandas.io.msgpack.exceptions import * # noqa +from pandas.io.msgpack._version import version # noqa + class ExtType(namedtuple('ExtType', 'code data')): """ExtType represents ext type in msgpack.""" @@ -18,11 +17,10 @@ def __new__(cls, code, data): raise ValueError("code must be 0~127") return super(ExtType, cls).__new__(cls, code, data) +import os # noqa -import os -from pandas.msgpack._packer import Packer -from pandas.msgpack._unpacker import unpack, unpackb, Unpacker - +from pandas.io.msgpack._packer import Packer # noqa +from pandas.io.msgpack._unpacker import unpack, unpackb, Unpacker # noqa def pack(o, stream, **kwargs): @@ -43,6 +41,7 @@ def packb(o, **kwargs): """ return Packer(**kwargs).pack(o) + # alias for compatibility to simplejson/marshal/pickle. load = unpack loads = unpackb diff --git a/pandas/msgpack/_packer.pyx b/pandas/io/msgpack/_packer.pyx similarity index 98% rename from pandas/msgpack/_packer.pyx rename to pandas/io/msgpack/_packer.pyx index 008dbe5541d50..ad7ce1fb2531a 100644 --- a/pandas/msgpack/_packer.pyx +++ b/pandas/io/msgpack/_packer.pyx @@ -6,11 +6,11 @@ from libc.stdlib cimport * from libc.string cimport * from libc.limits cimport * -from pandas.msgpack.exceptions import PackValueError -from pandas.msgpack import ExtType +from pandas.io.msgpack.exceptions import PackValueError +from pandas.io.msgpack import ExtType -cdef extern from "../src/msgpack/pack.h": +cdef extern from "../../src/msgpack/pack.h": struct msgpack_packer: char* buf size_t length diff --git a/pandas/msgpack/_unpacker.pyx b/pandas/io/msgpack/_unpacker.pyx similarity index 98% rename from pandas/msgpack/_unpacker.pyx rename to pandas/io/msgpack/_unpacker.pyx index 6f23a24adde6c..504bfed48df3c 100644 --- a/pandas/msgpack/_unpacker.pyx +++ b/pandas/io/msgpack/_unpacker.pyx @@ -11,12 +11,12 @@ from libc.stdlib cimport * from libc.string cimport * from libc.limits cimport * -from pandas.msgpack.exceptions import (BufferFull, OutOfData, - UnpackValueError, ExtraData) -from pandas.msgpack import ExtType +from pandas.io.msgpack.exceptions import (BufferFull, OutOfData, + UnpackValueError, ExtraData) +from pandas.io.msgpack import ExtType -cdef extern from "../src/msgpack/unpack.h": +cdef extern from "../../src/msgpack/unpack.h": ctypedef struct msgpack_user: bint use_list PyObject* object_hook diff --git a/pandas/msgpack/_version.py b/pandas/io/msgpack/_version.py similarity index 100% rename from pandas/msgpack/_version.py rename to pandas/io/msgpack/_version.py diff --git a/pandas/msgpack/exceptions.py b/pandas/io/msgpack/exceptions.py similarity index 99% rename from pandas/msgpack/exceptions.py rename to pandas/io/msgpack/exceptions.py index 40f5a8af8f583..ae0f74a6700bd 100644 --- a/pandas/msgpack/exceptions.py +++ b/pandas/io/msgpack/exceptions.py @@ -15,6 +15,7 @@ class UnpackValueError(UnpackException, ValueError): class ExtraData(ValueError): + def __init__(self, unpacked, extra): self.unpacked = unpacked self.extra = extra diff --git a/pandas/io/packers.py b/pandas/io/packers.py index 1838d9175e597..a4b454eda7472 100644 --- a/pandas/io/packers.py +++ b/pandas/io/packers.py @@ -48,23 +48,24 @@ from pandas import compat from pandas.compat import u, u_safe -from pandas.types.common import (is_categorical_dtype, is_object_dtype, - needs_i8_conversion, pandas_dtype) +from pandas.core.dtypes.common import ( + is_categorical_dtype, is_object_dtype, + needs_i8_conversion, pandas_dtype) from pandas import (Timestamp, Period, Series, DataFrame, # noqa Index, MultiIndex, Float64Index, Int64Index, Panel, RangeIndex, PeriodIndex, DatetimeIndex, NaT, - Categorical) -from pandas.tslib import NaTType -from pandas.sparse.api import SparseSeries, SparseDataFrame -from pandas.sparse.array import BlockIndex, IntIndex + Categorical, CategoricalIndex) +from pandas._libs.tslib import NaTType +from pandas.core.sparse.api import SparseSeries, SparseDataFrame +from pandas.core.sparse.array import BlockIndex, IntIndex from pandas.core.generic import NDFrame -from pandas.core.common import PerformanceWarning +from pandas.errors import PerformanceWarning from pandas.io.common import get_filepath_or_buffer from pandas.core.internals import BlockManager, make_block, _safe_reshape import pandas.core.internals as internals -from pandas.msgpack import Unpacker as _Unpacker, Packer as _Packer, ExtType +from pandas.io.msgpack import Unpacker as _Unpacker, Packer as _Packer, ExtType from pandas.util._move import ( BadMove as _BadMove, move_into_mutable_buffer as _move_into_mutable_buffer, @@ -217,6 +218,7 @@ def read(fh): raise ValueError('path_or_buf needs to be a string file path or file-like') + dtype_dict = {21: np.dtype('M8[ns]'), u('datetime64[ns]'): np.dtype('M8[ns]'), u('datetime64[us]'): np.dtype('M8[us]'), @@ -237,6 +239,7 @@ def dtype_for(t): return dtype_dict[t] return np.typeDict.get(t, t) + c2f_dict = {'complex': np.float64, 'complex128': np.float64, 'complex64': np.float32} @@ -571,7 +574,7 @@ def decode(obj): elif typ == u'period_index': data = unconvert(obj[u'data'], np.int64, obj.get(u'compress')) d = dict(name=obj[u'name'], freq=obj[u'freq']) - return globals()[obj[u'klass']](data, **d) + return globals()[obj[u'klass']]._from_ordinals(data, **d) elif typ == u'datetime_index': data = unconvert(obj[u'data'], np.int64, obj.get(u'compress')) d = dict(name=obj[u'name'], freq=obj[u'freq'], verify_integrity=False) @@ -587,23 +590,18 @@ def decode(obj): from_codes = globals()[obj[u'klass']].from_codes return from_codes(codes=obj[u'codes'], categories=obj[u'categories'], - ordered=obj[u'ordered'], - name=obj[u'name']) + ordered=obj[u'ordered']) elif typ == u'series': dtype = dtype_for(obj[u'dtype']) pd_dtype = pandas_dtype(dtype) - np_dtype = pandas_dtype(dtype).base index = obj[u'index'] result = globals()[obj[u'klass']](unconvert(obj[u'data'], dtype, obj[u'compress']), index=index, - dtype=np_dtype, + dtype=pd_dtype, name=obj[u'name']) - tz = getattr(pd_dtype, 'tz', None) - if tz: - result = result.dt.tz_localize('UTC').dt.tz_convert(tz) return result elif typ == u'block_manager': diff --git a/pandas/io/parsers.py b/pandas/io/parsers.py index a5943ef518622..ce8643504932f 100755 --- a/pandas/io/parsers.py +++ b/pandas/io/parsers.py @@ -15,26 +15,32 @@ from pandas import compat from pandas.compat import (range, lrange, StringIO, lzip, zip, string_types, map, u) -from pandas.types.common import (is_integer, _ensure_object, - is_list_like, is_integer_dtype, - is_float, - is_scalar) +from pandas.core.dtypes.common import ( + is_integer, _ensure_object, + is_list_like, is_integer_dtype, + is_float, is_dtype_equal, + is_object_dtype, is_string_dtype, + is_scalar, is_categorical_dtype) +from pandas.core.dtypes.missing import isnull +from pandas.core.dtypes.cast import astype_nansafe from pandas.core.index import Index, MultiIndex, RangeIndex from pandas.core.series import Series from pandas.core.frame import DataFrame +from pandas.core.categorical import Categorical +from pandas.core import algorithms from pandas.core.common import AbstractMethodError -from pandas.core.config import get_option from pandas.io.date_converters import generic_parser +from pandas.errors import ParserWarning, ParserError, EmptyDataError from pandas.io.common import (get_filepath_or_buffer, _validate_header_arg, _get_handle, UnicodeReader, UTF8Recoder, - BaseIterator, CParserError, EmptyDataError, - ParserWarning, _NA_VALUES) -from pandas.tseries import tools + BaseIterator, + _NA_VALUES, _infer_compression) +from pandas.core.tools import datetimes as tools -from pandas.util.decorators import Appender +from pandas.util._decorators import Appender -import pandas.lib as lib -import pandas.parser as _parser +import pandas._libs.lib as lib +import pandas._libs.parsers as parsers # BOM character (byte order mark) @@ -86,13 +92,18 @@ MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to _not_ use the first column as the index (row names) -usecols : array-like, default None - Return a subset of the columns. All elements in this array must either +usecols : array-like or callable, default None + Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in `names` or - inferred from the document header row(s). For example, a valid `usecols` - parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter - results in much faster parsing time and lower memory usage. + inferred from the document header row(s). For example, a valid array-like + `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. + + If callable, the callable function will be evaluated against the column + names, returning names where the callable function evaluates to True. An + example of a valid callable argument would be ``lambda x: x.upper() in + ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster + parsing time and lower memory usage. as_recarray : boolean, default False DEPRECATED: this argument will be removed in a future version. Please call `pd.read_csv(...).to_records()` instead. @@ -111,8 +122,9 @@ are duplicate names in the columns. dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} - (Unsupported with engine='python'). Use `str` or `object` to preserve and - not interpret dtype. + Use `str` or `object` to preserve and not interpret dtype. + If converters are specified, they will be applied INSTEAD + of dtype conversion. %s converters : dict, default None Dict of functions for converting values in certain columns. Keys can either @@ -123,9 +135,13 @@ Values to consider as False skipinitialspace : boolean, default False Skip spaces after delimiter. -skiprows : list-like or integer, default None +skiprows : list-like or integer or callable, default None Line numbers to skip (0-indexed) or number of lines to skip (int) - at the start of the file + at the start of the file. + + If callable, the callable function will be evaluated against the row + indices, returning True if the row should be skipped and False otherwise. + An example of a valid callable argument would be ``lambda x: x in [0, 2]``. skipfooter : int, default 0 Number of lines at bottom of file to skip (Unsupported with engine='c') skip_footer : int, default 0 @@ -135,7 +151,8 @@ na_values : scalar, str, list-like, or dict, default None Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as - NaN: '""" + fill("', '".join(sorted(_NA_VALUES)), 70) + """'`. + NaN: '""" + fill("', '".join(sorted(_NA_VALUES)), + 70, subsequent_indent=" ") + """'`. keep_default_na : bool, default True If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they're appended to. @@ -154,16 +171,20 @@ * list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. * list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as - a single date column. + a single date column. * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo' + If a column or index contains an unparseable date, the entire column or + index will be returned unaltered as an object data type. For non-standard + datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv`` + Note: A fast-path exists for iso8601-formatted dates. infer_datetime_format : boolean, default False If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the - parsing speed by ~5-10x. + parsing speed by 5-10x. keep_date_col : boolean, default False If True and parse_dates specifies combining multiple columns then keep the original columns. @@ -182,10 +203,10 @@ Return TextFileReader object for iteration or getting chunks with ``get_chunk()``. chunksize : int, default None - Return TextFileReader object for iteration. `See IO Tools docs for more - information - `_ on - ``iterator`` and ``chunksize``. + Return TextFileReader object for iteration. + See the `IO Tools docs + `_ + for more information on ``iterator`` and ``chunksize``. compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer' For on-the-fly decompression of on-disk data. If 'infer', then use gzip, bz2, zip or xz if filepath_or_buffer is a string ending in '.gz', '.bz2', @@ -231,8 +252,11 @@ standard encodings `_ dialect : str or csv.Dialect instance, default None - If None defaults to Excel dialect. Ignored if sep longer than 1 char - See csv.Dialect documentation for more details + If provided, this parameter will override values (default or not) for the + following parameters: `delimiter`, `doublequote`, `escapechar`, + `skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to + override values, a ParserWarning will be issued. See csv.Dialect + documentation for more details. tupleize_cols : boolean, default False Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns) @@ -240,10 +264,10 @@ Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these "bad lines" will dropped from the DataFrame that is - returned. (Only valid with C parser) + returned. warn_bad_lines : boolean, default True If error_bad_lines is False, and warn_bad_lines is True, a warning for each - "bad line" will be output. (Only valid with C parser). + "bad line" will be output. low_memory : boolean, default True Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed @@ -283,10 +307,12 @@ currently more feature-complete.""" _sep_doc = r"""sep : str, default {default} - Delimiter to use. If sep is None, will try to automatically determine - this. Separators longer than 1 character and different from ``'\s+'`` will - be interpreted as regular expressions, will force use of the python parsing - engine and will ignore quotes in the data. Regex example: ``'\r\t'``""" + Delimiter to use. If sep is None, the C engine cannot automatically detect + the separator, but the Python parsing engine can, meaning the latter will + be used automatically. In addition, separators longer than 1 character and + different from ``'\s+'`` will be interpreted as regular expressions and + will also force the use of the Python parsing engine. Note that regex + delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``""" _read_csv_doc = """ Read CSV (comma-separated) file into DataFrame @@ -307,7 +333,7 @@ fields of each line as half-open intervals (i.e., [from, to[ ). String value 'infer' can be used to instruct the parser to try detecting the column specifications from the first 100 rows of - the data (default='infer'). + the data which are not being skipped via skiprows (default='infer'). widths : list of ints. optional A list of field widths which can be used instead of 'colspecs' if the intervals are contiguous. @@ -323,58 +349,48 @@ """ % (_parser_params % (_fwf_widths, '')) -def _validate_nrows(nrows): +def _validate_integer(name, val, min_val=0): """ - Checks whether the 'nrows' parameter for parsing is either + Checks whether the 'name' parameter for parsing is either an integer OR float that can SAFELY be cast to an integer without losing accuracy. Raises a ValueError if that is not the case. + + Parameters + ---------- + name : string + Parameter name (used for error reporting) + val : int or float + The value to check + min_val : int + Minimum allowed value (val < min_val will result in a ValueError) """ - msg = "'nrows' must be an integer" + msg = "'{name:s}' must be an integer >={min_val:d}".format(name=name, + min_val=min_val) - if nrows is not None: - if is_float(nrows): - if int(nrows) != nrows: + if val is not None: + if is_float(val): + if int(val) != val: raise ValueError(msg) - nrows = int(nrows) - elif not is_integer(nrows): + val = int(val) + elif not (is_integer(val) and val >= min_val): raise ValueError(msg) - return nrows + return val def _read(filepath_or_buffer, kwds): - "Generic reader of line files." + """Generic reader of line files.""" encoding = kwds.get('encoding', None) if encoding is not None: encoding = re.sub('_', '-', encoding).lower() kwds['encoding'] = encoding - # If the input could be a filename, check for a recognizable compression - # extension. If we're reading from a URL, the `get_filepath_or_buffer` - # will use header info to determine compression, so use what it finds in - # that case. - inferred_compression = kwds.get('compression') - if inferred_compression == 'infer': - if isinstance(filepath_or_buffer, compat.string_types): - if filepath_or_buffer.endswith('.gz'): - inferred_compression = 'gzip' - elif filepath_or_buffer.endswith('.bz2'): - inferred_compression = 'bz2' - elif filepath_or_buffer.endswith('.zip'): - inferred_compression = 'zip' - elif filepath_or_buffer.endswith('.xz'): - inferred_compression = 'xz' - else: - inferred_compression = None - else: - inferred_compression = None - + compression = kwds.get('compression') + compression = _infer_compression(filepath_or_buffer, compression) filepath_or_buffer, _, compression = get_filepath_or_buffer( - filepath_or_buffer, encoding, - compression=kwds.get('compression', None)) - kwds['compression'] = (inferred_compression if compression == 'infer' - else compression) + filepath_or_buffer, encoding, compression) + kwds['compression'] = compression if kwds.get('date_parser', None) is not None: if isinstance(kwds['parse_dates'], bool): @@ -382,26 +398,22 @@ def _read(filepath_or_buffer, kwds): # Extract some of the arguments (pass chunksize on). iterator = kwds.get('iterator', False) - chunksize = kwds.get('chunksize', None) - nrows = _validate_nrows(kwds.pop('nrows', None)) + chunksize = _validate_integer('chunksize', kwds.get('chunksize', None), 1) + nrows = _validate_integer('nrows', kwds.get('nrows', None)) # Create the parser. parser = TextFileReader(filepath_or_buffer, **kwds) - if (nrows is not None) and (chunksize is not None): - raise NotImplementedError("'nrows' and 'chunksize' cannot be used" - " together yet.") - elif nrows is not None: - data = parser.read(nrows) - parser.close() - return data - elif chunksize or iterator: + if chunksize or iterator: return parser - data = parser.read() - parser.close() + try: + data = parser.read(nrows) + finally: + parser.close() return data + _parser_defaults = { 'delimiter': None, @@ -421,6 +433,7 @@ def _read(filepath_or_buffer, kwds): 'true_values': None, 'false_values': None, 'converters': None, + 'dtype': None, 'skipfooter': 0, 'keep_default_na': True, @@ -436,7 +449,7 @@ def _read(filepath_or_buffer, kwds): 'usecols': None, - # 'nrows': None, + 'nrows': None, # 'iterator': False, 'chunksize': None, 'verbose': False, @@ -461,7 +474,6 @@ def _read(filepath_or_buffer, kwds): 'buffer_lines': None, 'error_bad_lines': True, 'warn_bad_lines': True, - 'dtype': None, 'float_precision': None } @@ -474,9 +486,6 @@ def _read(filepath_or_buffer, kwds): _python_unsupported = set([ 'low_memory', 'buffer_lines', - 'error_bad_lines', - 'warn_bad_lines', - 'dtype', 'float_precision', ]) _deprecated_args = set([ @@ -649,6 +658,7 @@ def parser_f(filepath_or_buffer, return parser_f + read_csv = _make_parser_function('read_csv', sep=',') read_csv = Appender(_read_csv_doc)(read_csv) @@ -700,12 +710,33 @@ def __init__(self, f, engine=None, **kwds): dialect = kwds['dialect'] if dialect in csv.list_dialects(): dialect = csv.get_dialect(dialect) - kwds['delimiter'] = dialect.delimiter - kwds['doublequote'] = dialect.doublequote - kwds['escapechar'] = dialect.escapechar - kwds['skipinitialspace'] = dialect.skipinitialspace - kwds['quotechar'] = dialect.quotechar - kwds['quoting'] = dialect.quoting + + # Any valid dialect should have these attributes. + # If any are missing, we will raise automatically. + for param in ('delimiter', 'doublequote', 'escapechar', + 'skipinitialspace', 'quotechar', 'quoting'): + try: + dialect_val = getattr(dialect, param) + except AttributeError: + raise ValueError("Invalid dialect '{dialect}' provided" + .format(dialect=kwds['dialect'])) + provided = kwds.get(param, _parser_defaults[param]) + + # Messages for conflicting values between the dialect instance + # and the actual parameters provided. + conflict_msgs = [] + + if dialect_val != provided: + conflict_msgs.append(( + "Conflicting values for '{param}': '{val}' was " + "provided, but the dialect specifies '{diaval}'. " + "Using the dialect-specified value.".format( + param=param, val=provided, diaval=dialect_val))) + + if conflict_msgs: + warnings.warn('\n\n'.join(conflict_msgs), ParserWarning, + stacklevel=2) + kwds[param] = dialect_val if kwds.get('header', 'infer') == 'infer': kwds['header'] = 0 if kwds.get('names') is None else None @@ -720,6 +751,7 @@ def __init__(self, f, engine=None, **kwds): options = self._get_options_with_defaults(engine) self.chunksize = options.pop('chunksize', None) + self.nrows = options.pop('nrows', None) self.squeeze = options.pop('squeeze', False) # might mutate self.engine @@ -819,6 +851,17 @@ def _clean_options(self, options, engine): encoding=encoding) engine = 'python' + quotechar = options['quotechar'] + if (quotechar is not None and + isinstance(quotechar, (str, compat.text_type, bytes))): + if (len(quotechar) == 1 and ord(quotechar) > 127 and + engine not in ('python', 'python-fwf')): + fallback_reason = ("ord(quotechar) > 127, meaning the " + "quotechar is larger than one byte, " + "and the 'c' engine does not support " + "such quotechars") + engine = 'python' + if fallback_reason and engine_specified: raise ValueError(fallback_reason) @@ -834,9 +877,6 @@ def _clean_options(self, options, engine): " ignored as it is not supported by the 'python'" " engine.").format(reason=fallback_reason, option=arg) - if arg == 'dtype': - msg += " (Note the 'converters' option provides"\ - " similar functionality.)" raise ValueError(msg) del result[arg] @@ -900,7 +940,10 @@ def _clean_options(self, options, engine): if engine != 'c': if is_integer(skiprows): skiprows = lrange(skiprows) - skiprows = set() if skiprows is None else set(skiprows) + if skiprows is None: + skiprows = set() + elif not callable(skiprows): + skiprows = set(skiprows) # put stuff back result['names'] = names @@ -969,6 +1012,10 @@ def _create_index(self, ret): def get_chunk(self, size=None): if size is None: size = self.chunksize + if self.nrows is not None: + if self._currow >= self.nrows: + raise StopIteration + size = min(size, self.nrows - self._currow) return self.read(nrows=size) @@ -976,24 +1023,89 @@ def _is_index_col(col): return col is not None and col is not False -def _validate_usecols_arg(usecols): +def _evaluate_usecols(usecols, names): """ Check whether or not the 'usecols' parameter - contains all integers (column selection by index) - or strings (column by name). Raises a ValueError - if that is not the case. + is a callable. If so, enumerates the 'names' + parameter and returns a set of indices for + each entry in 'names' that evaluates to True. + If not a callable, returns 'usecols'. + """ + if callable(usecols): + return set([i for i, name in enumerate(names) + if usecols(name)]) + return usecols + + +def _validate_skipfooter_arg(skipfooter): """ - msg = ("The elements of 'usecols' must " - "either be all strings, all unicode, or all integers") + Validate the 'skipfooter' parameter. + + Checks whether 'skipfooter' is a non-negative integer. + Raises a ValueError if that is not the case. + + Parameters + ---------- + skipfooter : non-negative integer + The number of rows to skip at the end of the file. + + Returns + ------- + validated_skipfooter : non-negative integer + The original input if the validation succeeds. + + Raises + ------ + ValueError : 'skipfooter' was not a non-negative integer. + """ + + if not is_integer(skipfooter): + raise ValueError("skipfooter must be an integer") + + if skipfooter < 0: + raise ValueError("skipfooter cannot be negative") + + return skipfooter + + +def _validate_usecols_arg(usecols): + """ + Validate the 'usecols' parameter. + + Checks whether or not the 'usecols' parameter contains all integers + (column selection by index), strings (column by name) or is a callable. + Raises a ValueError if that is not the case. + + Parameters + ---------- + usecols : array-like, callable, or None + List of columns to use when parsing or a callable that can be used + to filter a list of table columns. + + Returns + ------- + usecols_tuple : tuple + A tuple of (verified_usecols, usecols_dtype). + + 'verified_usecols' is either a set if an array-like is passed in or + 'usecols' if a callable or None is passed in. + + 'usecols_dtype` is the inferred dtype of 'usecols' if an array-like + is passed in or None if a callable or None is passed in. + """ + msg = ("'usecols' must either be all strings, all unicode, " + "all integers or a callable") if usecols is not None: + if callable(usecols): + return usecols, None usecols_dtype = lib.infer_dtype(usecols) if usecols_dtype not in ('empty', 'integer', 'string', 'unicode'): raise ValueError(msg) - return set(usecols) - return usecols + return set(usecols), usecols_dtype + return usecols, None def _validate_parse_dates_arg(parse_dates): @@ -1095,13 +1207,18 @@ def _should_parse_dates(self, i): if isinstance(self.parse_dates, bool): return self.parse_dates else: - name = self.index_names[i] + if self.index_names is not None: + name = self.index_names[i] + else: + name = None j = self.index_col[i] if is_scalar(self.parse_dates): - return (j == self.parse_dates) or (name == self.parse_dates) + return ((j == self.parse_dates) or + (name is not None and name == self.parse_dates)) else: - return (j in self.parse_dates) or (name in self.parse_dates) + return ((j in self.parse_dates) or + (name is not None and name in self.parse_dates)) def _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names=False): @@ -1142,7 +1259,7 @@ def tostr(x): # long for n in range(len(columns[0])): if all(['Unnamed' in tostr(c[n]) for c in columns]): - raise CParserError( + raise ParserError( "Passed header=[%s] are too many rows for this " "multi_index of columns" % ','.join([str(x) for x in self.header]) @@ -1271,6 +1388,7 @@ def _get_name(icol): def _agg_index(self, index, try_parse_dates=True): arrays = [] + for i, arr in enumerate(index): if (try_parse_dates and self._should_parse_dates(i)): @@ -1285,7 +1403,7 @@ def _agg_index(self, index, try_parse_dates=True): col_na_values, col_na_fvalues = _get_na_values( col_name, self.na_values, self.na_fvalues) - arr, _ = self._convert_types(arr, col_na_values | col_na_fvalues) + arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues) arrays.append(arr) index = MultiIndex.from_arrays(arrays, names=self.index_names) @@ -1293,10 +1411,15 @@ def _agg_index(self, index, try_parse_dates=True): return index def _convert_to_ndarrays(self, dct, na_values, na_fvalues, verbose=False, - converters=None): + converters=None, dtypes=None): result = {} for c, values in compat.iteritems(dct): conv_f = None if converters is None else converters.get(c, None) + if isinstance(dtypes, dict): + cast_type = dtypes.get(c, None) + else: + # single dtype or None + cast_type = dtypes if self.na_filter: col_na_values, col_na_fvalues = _get_na_values( @@ -1304,21 +1427,40 @@ def _convert_to_ndarrays(self, dct, na_values, na_fvalues, verbose=False, else: col_na_values, col_na_fvalues = set(), set() - coerce_type = True if conv_f is not None: + # conv_f applied to data before inference + if cast_type is not None: + warnings.warn(("Both a converter and dtype were specified " + "for column {0} - only the converter will " + "be used").format(c), ParserWarning, + stacklevel=7) + try: values = lib.map_infer(values, conv_f) except ValueError: - mask = lib.ismember(values, na_values).view(np.uint8) + mask = algorithms.isin( + values, list(na_values)).view(np.uint8) values = lib.map_infer_mask(values, conv_f, mask) - coerce_type = False - cvals, na_count = self._convert_types( - values, set(col_na_values) | col_na_fvalues, coerce_type) + cvals, na_count = self._infer_types( + values, set(col_na_values) | col_na_fvalues, + try_num_bool=False) + else: + # skip inference if specified dtype is object + try_num_bool = not (cast_type and is_string_dtype(cast_type)) + + # general type inference and conversion + cvals, na_count = self._infer_types( + values, set(col_na_values) | col_na_fvalues, + try_num_bool) + + # type specificed in dtype param + if cast_type and not is_dtype_equal(cvals, cast_type): + cvals = self._cast_types(cvals, cast_type, c) if issubclass(cvals.dtype.type, np.integer) and self.compact_ints: cvals = lib.downcast_int64( - cvals, _parser.na_values, + cvals, parsers.na_values, self.use_unsigned) result[c] = cvals @@ -1326,10 +1468,26 @@ def _convert_to_ndarrays(self, dct, na_values, na_fvalues, verbose=False, print('Filled %d NA values in column %s' % (na_count, str(c))) return result - def _convert_types(self, values, na_values, try_num_bool=True): + def _infer_types(self, values, na_values, try_num_bool=True): + """ + Infer types of values, possibly casting + + Parameters + ---------- + values : ndarray + na_values : set + try_num_bool : bool, default try + try to cast values to numeric (first preference) or boolean + + Returns: + -------- + converted : ndarray + na_count : int + """ + na_count = 0 if issubclass(values.dtype.type, (np.number, np.bool_)): - mask = lib.ismember(values, na_values) + mask = algorithms.isin(values, list(na_values)) na_count = mask.sum() if na_count > 0: if is_integer_dtype(values): @@ -1340,6 +1498,7 @@ def _convert_types(self, values, na_values, try_num_bool=True): if try_num_bool: try: result = lib.maybe_convert_numeric(values, na_values, False) + na_count = isnull(result).sum() except Exception: result = values if values.dtype == np.object_: @@ -1356,8 +1515,41 @@ def _convert_types(self, values, na_values, try_num_bool=True): return result, na_count + def _cast_types(self, values, cast_type, column): + """ + Cast values to specified type + + Parameters + ---------- + values : ndarray + cast_type : string or np.dtype + dtype to cast values to + column : string + column name - used only for error reporting + + Returns + ------- + converted : ndarray + """ + + if is_categorical_dtype(cast_type): + # XXX this is for consistency with + # c-parser which parses all categories + # as strings + if not is_object_dtype(values): + values = astype_nansafe(values, str) + values = Categorical(values) + else: + try: + values = astype_nansafe(values, cast_type, copy=True) + except ValueError: + raise ValueError("Unable to convert column %s to " + "type %s" % (column, cast_type)) + return values + def _do_date_conversions(self, names, data): # returns data, columns + if self.parse_dates is not None: data, names = _process_date_conversion( data, self._date_conv, self.parse_dates, self.index_col, @@ -1387,10 +1579,11 @@ def __init__(self, src, **kwds): # #2442 kwds['allow_leading_cols'] = self.index_col is not False - self._reader = _parser.TextReader(src, **kwds) + self._reader = parsers.TextReader(src, **kwds) # XXX - self.usecols = _validate_usecols_arg(self._reader.usecols) + self.usecols, self.usecols_dtype = _validate_usecols_arg( + self._reader.usecols) passed_names = self.names is None @@ -1426,11 +1619,12 @@ def __init__(self, src, **kwds): self.orig_names = self.names[:] if self.usecols: - if len(self.names) > len(self.usecols): + usecols = _evaluate_usecols(self.usecols, self.orig_names) + if len(self.names) > len(usecols): self.names = [n for i, n in enumerate(self.names) - if (i in self.usecols or n in self.usecols)] + if (i in usecols or n in usecols)] - if len(self.names) < len(self.usecols): + if len(self.names) < len(usecols): raise ValueError("Usecols do not match names.") self._set_noconvert_columns() @@ -1465,12 +1659,29 @@ def close(self): pass def _set_noconvert_columns(self): + """ + Set the columns that should not undergo dtype conversions. + + Currently, any column that is involved with date parsing will not + undergo such conversions. + """ names = self.orig_names - usecols = self.usecols + if self.usecols_dtype == 'integer': + # A set of integers will be converted to a list in + # the correct order every single time. + usecols = list(self.usecols) + elif (callable(self.usecols) or + self.usecols_dtype not in ('empty', None)): + # The names attribute should have the correct columns + # in the proper order for indexing with parse_dates. + usecols = self.names[:] + else: + # Usecols is empty. + usecols = None def _set(x): - if usecols and is_integer(x): - x = list(usecols)[x] + if usecols is not None and is_integer(x): + x = usecols[x] if not is_integer(x): x = names.index(x) @@ -1592,9 +1803,10 @@ def read(self, nrows=None): def _filter_usecols(self, names): # hackish - if self.usecols is not None and len(names) != len(self.usecols): + usecols = _evaluate_usecols(self.usecols, names) + if usecols is not None and len(names) != len(usecols): names = [name for i, name in enumerate(names) - if i in self.usecols or name in self.usecols] + if i in usecols or name in usecols] return names def _get_index_names(self): @@ -1675,70 +1887,6 @@ def count_empty_vals(vals): return sum([1 for v in vals if v == '' or v is None]) -def _wrap_compressed(f, compression, encoding=None): - """wraps compressed fileobject in a decompressing fileobject - NOTE: For all files in Python 3.2 and for bzip'd files under all Python - versions, this means reading in the entire file and then re-wrapping it in - StringIO. - """ - compression = compression.lower() - encoding = encoding or get_option('display.encoding') - - if compression == 'gzip': - import gzip - - f = gzip.GzipFile(fileobj=f) - if compat.PY3: - from io import TextIOWrapper - - f = TextIOWrapper(f) - return f - elif compression == 'bz2': - import bz2 - - if compat.PY3: - f = bz2.open(f, 'rt', encoding=encoding) - else: - # Python 2's bz2 module can't take file objects, so have to - # run through decompress manually - data = bz2.decompress(f.read()) - f = StringIO(data) - return f - elif compression == 'zip': - import zipfile - zip_file = zipfile.ZipFile(f) - zip_names = zip_file.namelist() - - if len(zip_names) == 1: - file_name = zip_names.pop() - f = zip_file.open(file_name) - return f - - elif len(zip_names) == 0: - raise ValueError('Corrupted or zero files found in compressed ' - 'zip file %s', zip_file.filename) - - else: - raise ValueError('Multiple files found in compressed ' - 'zip file %s', str(zip_names)) - - elif compression == 'xz': - - lzma = compat.import_lzma() - f = lzma.LZMAFile(f) - - if compat.PY3: - from io import TextIOWrapper - - f = TextIOWrapper(f) - - return f - - else: - raise ValueError('do not recognize compression method %s' - % compression) - - class PythonParser(ParserBase): def __init__(self, f, **kwds): @@ -1759,7 +1907,12 @@ def __init__(self, f, **kwds): self.memory_map = kwds['memory_map'] self.skiprows = kwds['skiprows'] - self.skipfooter = kwds['skipfooter'] + if callable(self.skiprows): + self.skipfunc = self.skiprows + else: + self.skipfunc = lambda x: x in self.skiprows + + self.skipfooter = _validate_skipfooter_arg(kwds['skipfooter']) self.delimiter = kwds['delimiter'] self.quotechar = kwds['quotechar'] @@ -1771,9 +1924,12 @@ def __init__(self, f, **kwds): self.skipinitialspace = kwds['skipinitialspace'] self.lineterminator = kwds['lineterminator'] self.quoting = kwds['quoting'] - self.usecols = _validate_usecols_arg(kwds['usecols']) + self.usecols, _ = _validate_usecols_arg(kwds['usecols']) self.skip_blank_lines = kwds['skip_blank_lines'] + self.warn_bad_lines = kwds['warn_bad_lines'] + self.error_bad_lines = kwds['error_bad_lines'] + self.names_passed = kwds['names'] or None self.na_filter = kwds['na_filter'] @@ -1784,6 +1940,7 @@ def __init__(self, f, **kwds): self.verbose = kwds['verbose'] self.converters = kwds['converters'] + self.dtype = kwds['dtype'] self.compact_ints = kwds['compact_ints'] self.use_unsigned = kwds['use_unsigned'] @@ -1793,20 +1950,10 @@ def __init__(self, f, **kwds): self.comment = kwds['comment'] self._comment_lines = [] - if isinstance(f, compat.string_types): - f = _get_handle(f, 'r', encoding=self.encoding, - compression=self.compression, - memory_map=self.memory_map) - self.handles.append(f) - elif self.compression: - f = _wrap_compressed(f, self.compression, self.encoding) - self.handles.append(f) - # in Python 3, convert BytesIO or fileobjects passed with an encoding - elif compat.PY3 and isinstance(f, compat.BytesIO): - from io import TextIOWrapper - - f = TextIOWrapper(f, encoding=self.encoding) - self.handles.append(f) + f, handles = _get_handle(f, 'r', encoding=self.encoding, + compression=self.compression, + memory_map=self.memory_map) + self.handles.extend(handles) # Set self.data to something that can read lines. if hasattr(f, 'readline'): @@ -1923,7 +2070,7 @@ class MyDialect(csv.Dialect): # attempt to sniff the delimiter if sniff_sep: line = f.readline() - while self.pos in self.skiprows: + while self.skipfunc(self.pos): self.pos += 1 line = f.readline() @@ -1982,7 +2129,7 @@ def read(self, rows=None): # DataFrame with the right metadata, even though it's length 0 names = self._maybe_dedup_names(self.orig_names) index, columns, col_dict = _get_empty_meta( - names, self.index_col, self.index_names) + names, self.index_col, self.index_names, self.dtype) columns = self._maybe_make_multi_index_columns( columns, self.col_names) return index, columns, col_dict @@ -2033,12 +2180,21 @@ def get_chunk(self, size=None): def _convert_data(self, data): # apply converters - clean_conv = {} + def _clean_mapping(mapping): + "converts col numbers to names" + clean = {} + for col, v in compat.iteritems(mapping): + if isinstance(col, int) and col not in self.orig_names: + col = self.orig_names[col] + clean[col] = v + return clean - for col, f in compat.iteritems(self.converters): - if isinstance(col, int) and col not in self.orig_names: - col = self.orig_names[col] - clean_conv[col] = f + clean_conv = _clean_mapping(self.converters) + if not isinstance(self.dtype, dict): + # handles single dtype applied to all columns + clean_dtypes = self.dtype + else: + clean_dtypes = _clean_mapping(self.dtype) # Apply NA values. clean_na_values = {} @@ -2060,7 +2216,7 @@ def _convert_data(self, data): return self._convert_to_ndarrays(data, clean_na_values, clean_na_fvalues, self.verbose, - clean_conv) + clean_conv, clean_dtypes) def _to_recarray(self, data, columns): dtypes = [] @@ -2203,11 +2359,12 @@ def _infer_columns(self): columns = [lrange(ncols)] columns = self._handle_usecols(columns, columns[0]) else: - if self.usecols is None or len(names) == num_original_columns: + if self.usecols is None or len(names) >= num_original_columns: columns = self._handle_usecols([names], names) num_original_columns = len(names) else: - if self.usecols and len(names) != len(self.usecols): + if (not callable(self.usecols) and + len(names) != len(self.usecols)): raise ValueError( 'Number of passed names did not match number of ' 'header fields in the file' @@ -2226,16 +2383,18 @@ def _handle_usecols(self, columns, usecols_key): usecols_key is used if there are string usecols. """ if self.usecols is not None: - if any([isinstance(u, string_types) for u in self.usecols]): + if callable(self.usecols): + col_indices = _evaluate_usecols(self.usecols, usecols_key) + elif any([isinstance(u, string_types) for u in self.usecols]): if len(columns) > 1: raise ValueError("If using multiple headers, usecols must " "be integers.") col_indices = [] - for u in self.usecols: - if isinstance(u, string_types): - col_indices.append(usecols_key.index(u)) + for col in self.usecols: + if isinstance(col, string_types): + col_indices.append(usecols_key.index(col)) else: - col_indices.append(u) + col_indices.append(col) else: col_indices = self.usecols @@ -2314,12 +2473,24 @@ def _check_for_bom(self, first_row): # return an empty string. return [""] - def _empty(self, line): + def _is_line_empty(self, line): + """ + Check if a line is empty or not. + + Parameters + ---------- + line : str, array-like + The line of data to check. + + Returns + ------- + boolean : Whether or not the line is empty. + """ return not line or all(not x for x in line) def _next_line(self): if isinstance(self.data, list): - while self.pos in self.skiprows: + while self.skipfunc(self.pos): self.pos += 1 while True: @@ -2327,51 +2498,36 @@ def _next_line(self): line = self._check_comments([self.data[self.pos]])[0] self.pos += 1 # either uncommented or blank to begin with - if not self.skip_blank_lines and (self._empty(self.data[ - self.pos - 1]) or line): + if (not self.skip_blank_lines and + (self._is_line_empty( + self.data[self.pos - 1]) or line)): break elif self.skip_blank_lines: - ret = self._check_empty([line]) + ret = self._remove_empty_lines([line]) if ret: line = ret[0] break except IndexError: raise StopIteration else: - while self.pos in self.skiprows: + while self.skipfunc(self.pos): self.pos += 1 next(self.data) while True: - try: - orig_line = next(self.data) - except csv.Error as e: - msg = str(e) - - if 'NULL byte' in str(e): - msg = ('NULL byte detected. This byte ' - 'cannot be processed in Python\'s ' - 'native csv library at the moment, ' - 'so please pass in engine=\'c\' instead') - - if self.skipfooter > 0: - reason = ('Error could possibly be due to ' - 'parsing errors in the skipped footer rows ' - '(the skipfooter keyword is only applied ' - 'after Python\'s csv library has parsed ' - 'all rows).') - msg += '. ' + reason - - raise csv.Error(msg) - line = self._check_comments([orig_line])[0] + orig_line = self._next_iter_line(row_num=self.pos + 1) self.pos += 1 - if (not self.skip_blank_lines and - (self._empty(orig_line) or line)): - break - elif self.skip_blank_lines: - ret = self._check_empty([line]) - if ret: - line = ret[0] + + if orig_line is not None: + line = self._check_comments([orig_line])[0] + + if self.skip_blank_lines: + ret = self._remove_empty_lines([line]) + + if ret: + line = ret[0] + break + elif self._is_line_empty(orig_line) or line: break # This was the first line of the file, @@ -2384,6 +2540,66 @@ def _next_line(self): self.buf.append(line) return line + def _alert_malformed(self, msg, row_num): + """ + Alert a user about a malformed row. + + If `self.error_bad_lines` is True, the alert will be `ParserError`. + If `self.warn_bad_lines` is True, the alert will be printed out. + + Parameters + ---------- + msg : The error message to display. + row_num : The row number where the parsing error occurred. + Because this row number is displayed, we 1-index, + even though we 0-index internally. + """ + + if self.error_bad_lines: + raise ParserError(msg) + elif self.warn_bad_lines: + base = 'Skipping line {row_num}: '.format(row_num=row_num) + sys.stderr.write(base + msg + '\n') + + def _next_iter_line(self, row_num): + """ + Wrapper around iterating through `self.data` (CSV source). + + When a CSV error is raised, we check for specific + error messages that allow us to customize the + error message displayed to the user. + + Parameters + ---------- + row_num : The row number of the line being parsed. + """ + + try: + return next(self.data) + except csv.Error as e: + if self.warn_bad_lines or self.error_bad_lines: + msg = str(e) + + if 'NULL byte' in msg: + msg = ('NULL byte detected. This byte ' + 'cannot be processed in Python\'s ' + 'native csv library at the moment, ' + 'so please pass in engine=\'c\' instead') + elif 'newline inside string' in msg: + msg = ('EOF inside string starting with ' + 'line ' + str(row_num)) + + if self.skipfooter > 0: + reason = ('Error could possibly be due to ' + 'parsing errors in the skipped footer rows ' + '(the skipfooter keyword is only applied ' + 'after Python\'s csv library has parsed ' + 'all rows).') + msg += '. ' + reason + + self._alert_malformed(msg, row_num) + return None + def _check_comments(self, lines): if self.comment is None: return lines @@ -2402,7 +2618,22 @@ def _check_comments(self, lines): ret.append(rl) return ret - def _check_empty(self, lines): + def _remove_empty_lines(self, lines): + """ + Iterate through the lines and remove any that are + either empty or contain only one whitespace value + + Parameters + ---------- + lines : array-like + The array of lines that we are to filter. + + Returns + ------- + filtered_lines : array-like + The same array of lines with the "empty" ones removed. + """ + ret = [] for l in lines: # Remove empty lines and lines with only one whitespace value @@ -2518,37 +2749,49 @@ def _rows_to_cols(self, content): if self._implicit_index: col_len += len(self.index_col) - # see gh-13320 - zipped_content = list(lib.to_object_array( - content, min_width=col_len).T) - zip_len = len(zipped_content) + max_len = max([len(row) for row in content]) - if self.skipfooter < 0: - raise ValueError('skip footer cannot be negative') - - # Loop through rows to verify lengths are correct. - if (col_len != zip_len and + # Check that there are no rows with too many + # elements in their row (rows with too few + # elements are padded with NaN). + if (max_len > col_len and self.index_col is not False and self.usecols is None): - i = 0 - for (i, l) in enumerate(content): - if len(l) != col_len: - break - footers = 0 - if self.skipfooter: - footers = self.skipfooter + footers = self.skipfooter if self.skipfooter else 0 + bad_lines = [] - row_num = self.pos - (len(content) - i + footers) + iter_content = enumerate(content) + content_len = len(content) + content = [] - msg = ('Expected %d fields in line %d, saw %d' % - (col_len, row_num + 1, zip_len)) - if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE: - # see gh-13374 - reason = ('Error could possibly be due to quotes being ' - 'ignored when a multi-char delimiter is used.') - msg += '. ' + reason - raise ValueError(msg) + for (i, l) in iter_content: + actual_len = len(l) + + if actual_len > col_len: + if self.error_bad_lines or self.warn_bad_lines: + row_num = self.pos - (content_len - i + footers) + bad_lines.append((row_num, actual_len)) + + if self.error_bad_lines: + break + else: + content.append(l) + + for row_num, actual_len in bad_lines: + msg = ('Expected %d fields in line %d, saw %d' % + (col_len, row_num + 1, actual_len)) + if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE: + # see gh-13374 + reason = ('Error could possibly be due to quotes being ' + 'ignored when a multi-char delimiter is used.') + msg += '. ' + reason + + self._alert_malformed(msg, row_num + 1) + + # see gh-13320 + zipped_content = list(lib.to_object_array( + content, min_width=col_len).T) if self.usecols: if self._implicit_index: @@ -2562,7 +2805,6 @@ def _rows_to_cols(self, content): return zipped_content def _get_lines(self, rows=None): - source = self.data lines = self.buf new_rows = None @@ -2577,20 +2819,20 @@ def _get_lines(self, rows=None): rows -= len(self.buf) if new_rows is None: - if isinstance(source, list): - if self.pos > len(source): + if isinstance(self.data, list): + if self.pos > len(self.data): raise StopIteration if rows is None: - new_rows = source[self.pos:] - new_pos = len(source) + new_rows = self.data[self.pos:] + new_pos = len(self.data) else: - new_rows = source[self.pos:self.pos + rows] + new_rows = self.data[self.pos:self.pos + rows] new_pos = self.pos + rows # Check for stop rows. n.b.: self.skiprows is a set. if self.skiprows: new_rows = [row for i, row in enumerate(new_rows) - if i + self.pos not in self.skiprows] + if not self.skipfunc(i + self.pos)] lines.extend(new_rows) self.pos = new_pos @@ -2600,25 +2842,23 @@ def _get_lines(self, rows=None): try: if rows is not None: for _ in range(rows): - new_rows.append(next(source)) + new_rows.append(next(self.data)) lines.extend(new_rows) else: rows = 0 + while True: - try: - new_rows.append(next(source)) - rows += 1 - except csv.Error as inst: - if 'newline inside string' in str(inst): - row_num = str(self.pos + rows) - msg = ('EOF inside string starting with ' - 'line ' + row_num) - raise Exception(msg) - raise + new_row = self._next_iter_line( + row_num=self.pos + rows + 1) + rows += 1 + + if new_row is not None: + new_rows.append(new_row) + except StopIteration: if self.skiprows: new_rows = [row for i, row in enumerate(new_rows) - if self.pos + i not in self.skiprows] + if not self.skipfunc(i + self.pos)] lines.extend(new_rows) if len(lines) == 0: raise @@ -2633,7 +2873,7 @@ def _get_lines(self, rows=None): lines = self._check_comments(lines) if self.skip_blank_lines: - lines = self._check_empty(lines) + lines = self._remove_empty_lines(lines) lines = self._check_thousands(lines) return self._check_decimal(lines) @@ -2748,7 +2988,7 @@ def _try_convert_dates(parser, colspec, data_dict, columns): if c in colset: colnames.append(c) elif isinstance(c, int) and c not in columns: - colnames.append(str(columns[c])) + colnames.append(columns[c]) else: colnames.append(c) @@ -2765,7 +3005,7 @@ def _clean_na_values(na_values, keep_default_na=True): if keep_default_na: na_values = _NA_VALUES else: - na_values = [] + na_values = set() na_fvalues = set() elif isinstance(na_values, dict): na_values = na_values.copy() # Prevent aliasing. @@ -2940,13 +3180,13 @@ class FixedWidthReader(BaseIterator): A reader of fixed-width lines. """ - def __init__(self, f, colspecs, delimiter, comment): + def __init__(self, f, colspecs, delimiter, comment, skiprows=None): self.f = f self.buffer = None self.delimiter = '\r\n' + delimiter if delimiter else '\n\r\t ' self.comment = comment if colspecs == 'infer': - self.colspecs = self.detect_colspecs() + self.colspecs = self.detect_colspecs(skiprows=skiprows) else: self.colspecs = colspecs @@ -2955,7 +3195,6 @@ def __init__(self, f, colspecs, delimiter, comment): "input was a %r" % type(colspecs).__name__) for colspec in self.colspecs: - if not (isinstance(colspec, (tuple, list)) and len(colspec) == 2 and isinstance(colspec[0], (int, np.integer, type(None))) and @@ -2963,20 +3202,50 @@ def __init__(self, f, colspecs, delimiter, comment): raise TypeError('Each column specification must be ' '2 element tuple or list of integers') - def get_rows(self, n): - rows = [] - for i, row in enumerate(self.f, 1): - rows.append(row) - if i >= n: + def get_rows(self, n, skiprows=None): + """ + Read rows from self.f, skipping as specified. + + We distinguish buffer_rows (the first <= n lines) + from the rows returned to detect_colspecs because + it's simpler to leave the other locations with + skiprows logic alone than to modify them to deal + with the fact we skipped some rows here as well. + + Parameters + ---------- + n : int + Number of rows to read from self.f, not counting + rows that are skipped. + skiprows: set, optional + Indices of rows to skip. + + Returns + ------- + detect_rows : list of str + A list containing the rows to read. + + """ + if skiprows is None: + skiprows = set() + buffer_rows = [] + detect_rows = [] + for i, row in enumerate(self.f): + if i not in skiprows: + detect_rows.append(row) + buffer_rows.append(row) + if len(detect_rows) >= n: break - self.buffer = iter(rows) - return rows + self.buffer = iter(buffer_rows) + return detect_rows - def detect_colspecs(self, n=100): + def detect_colspecs(self, n=100, skiprows=None): # Regex escape the delimiters delimiters = ''.join([r'\%s' % x for x in self.delimiter]) pattern = re.compile('([^%s]+)' % delimiters) - rows = self.get_rows(n) + rows = self.get_rows(n, skiprows) + if not rows: + raise EmptyDataError("No rows from which to infer column width") max_len = max(map(len, rows)) mask = np.zeros(max_len + 1, dtype=int) if self.comment is not None: @@ -2987,7 +3256,8 @@ def detect_colspecs(self, n=100): shifted = np.roll(mask, 1) shifted[0] = 0 edges = np.where((mask ^ shifted) == 1)[0] - return list(zip(edges[::2], edges[1::2])) + edge_pairs = list(zip(edges[::2], edges[1::2])) + return edge_pairs def __next__(self): if self.buffer is not None: @@ -3012,9 +3282,8 @@ class FixedWidthFieldParser(PythonParser): def __init__(self, f, **kwds): # Support iterators, convert to a list. self.colspecs = kwds.pop('colspecs') - PythonParser.__init__(self, f, **kwds) def _make_reader(self, f): self.data = FixedWidthReader(f, self.colspecs, self.delimiter, - self.comment) + self.comment, self.skiprows) diff --git a/pandas/io/pickle.py b/pandas/io/pickle.py index 2358c296f782e..0f91c407766fb 100644 --- a/pandas/io/pickle.py +++ b/pandas/io/pickle.py @@ -3,10 +3,11 @@ import numpy as np from numpy.lib.format import read_array, write_array from pandas.compat import BytesIO, cPickle as pkl, pickle_compat as pc, PY3 -from pandas.types.common import is_datetime64_dtype, _NS_DTYPE +from pandas.core.dtypes.common import is_datetime64_dtype, _NS_DTYPE +from pandas.io.common import _get_handle, _infer_compression -def to_pickle(obj, path): +def to_pickle(obj, path, compression='infer'): """ Pickle (serialize) object to input file path @@ -15,12 +16,23 @@ def to_pickle(obj, path): obj : any object path : string File path + compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer' + a string representing the compression to use in the output file + + .. versionadded:: 0.20.0 """ - with open(path, 'wb') as f: + inferred_compression = _infer_compression(path, compression) + f, fh = _get_handle(path, 'wb', + compression=inferred_compression, + is_text=False) + try: pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL) + finally: + for _f in fh: + _f.close() -def read_pickle(path): +def read_pickle(path, compression='infer'): """ Load pickled pandas object (or any other pickled object) from the specified file path @@ -32,12 +44,32 @@ def read_pickle(path): ---------- path : string File path + compression : {'infer', 'gzip', 'bz2', 'xz', 'zip', None}, default 'infer' + For on-the-fly decompression of on-disk data. If 'infer', then use + gzip, bz2, xz or zip if path is a string ending in '.gz', '.bz2', 'xz', + or 'zip' respectively, and no decompression otherwise. + Set to None for no decompression. + + .. versionadded:: 0.20.0 Returns ------- unpickled : type of object stored in file """ + inferred_compression = _infer_compression(path, compression) + + def read_wrapper(func): + # wrapper file handle open/close operation + f, fh = _get_handle(path, 'rb', + compression=inferred_compression, + is_text=False) + try: + return func(f) + finally: + for _f in fh: + _f.close() + def try_read(path, encoding=None): # try with cPickle # try with current pickle, if we have a Type Error then @@ -48,19 +80,16 @@ def try_read(path, encoding=None): # cpickle # GH 6899 try: - with open(path, 'rb') as fh: - return pkl.load(fh) + return read_wrapper(lambda f: pkl.load(f)) except Exception: # reg/patched pickle try: - with open(path, 'rb') as fh: - return pc.load(fh, encoding=encoding, compat=False) - + return read_wrapper( + lambda f: pc.load(f, encoding=encoding, compat=False)) # compat pickle except: - with open(path, 'rb') as fh: - return pc.load(fh, encoding=encoding, compat=True) - + return read_wrapper( + lambda f: pc.load(f, encoding=encoding, compat=True)) try: return try_read(path) except: @@ -68,6 +97,7 @@ def try_read(path, encoding=None): return try_read(path, encoding='latin1') raise + # compat with sparse pickle / unpickle diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py index e474aeab1f6ca..17bedd016f617 100644 --- a/pandas/io/pytables.py +++ b/pandas/io/pytables.py @@ -12,45 +12,41 @@ import warnings import os -from pandas.types.common import (is_list_like, - is_categorical_dtype, - is_timedelta64_dtype, - is_datetime64tz_dtype, - is_datetime64_dtype, - _ensure_object, - _ensure_int64, - _ensure_platform_int) -from pandas.types.missing import array_equivalent +from pandas.core.dtypes.common import ( + is_list_like, + is_categorical_dtype, + is_timedelta64_dtype, + is_datetime64tz_dtype, + is_datetime64_dtype, + _ensure_object, + _ensure_int64, + _ensure_platform_int) +from pandas.core.dtypes.missing import array_equivalent import numpy as np - -import pandas as pd from pandas import (Series, DataFrame, Panel, Panel4D, Index, - MultiIndex, Int64Index, isnull) + MultiIndex, Int64Index, isnull, concat, + SparseSeries, SparseDataFrame, PeriodIndex, + DatetimeIndex, TimedeltaIndex) from pandas.core import config from pandas.io.common import _stringify_path -from pandas.sparse.api import SparseSeries, SparseDataFrame -from pandas.sparse.array import BlockIndex, IntIndex -from pandas.tseries.api import PeriodIndex, DatetimeIndex -from pandas.tseries.tdi import TimedeltaIndex +from pandas.core.sparse.array import BlockIndex, IntIndex from pandas.core.base import StringMixin -from pandas.formats.printing import adjoin, pprint_thing -from pandas.core.common import _asarray_tuplesafe, PerformanceWarning +from pandas.io.formats.printing import adjoin, pprint_thing +from pandas.errors import PerformanceWarning +from pandas.core.common import _asarray_tuplesafe from pandas.core.algorithms import match, unique from pandas.core.categorical import Categorical, _factorize_from_iterables from pandas.core.internals import (BlockManager, make_block, _block2d_to_blocknd, _factor_indexer, _block_shape) from pandas.core.index import _ensure_index -from pandas.tools.merge import concat from pandas import compat from pandas.compat import u_safe as u, PY3, range, lrange, string_types, filter from pandas.core.config import get_option -from pandas.computation.pytables import Expr, maybe_expression +from pandas.core.computation.pytables import Expr, maybe_expression -import pandas.lib as lib -import pandas.algos as algos -import pandas.tslib as tslib +from pandas._libs import tslib, algos, lib from distutils.version import LooseVersion @@ -76,6 +72,7 @@ def _ensure_encoding(encoding): encoding = _default_encoding return encoding + Term = Expr @@ -114,6 +111,7 @@ class ClosedFileError(Exception): class IncompatibilityWarning(Warning): pass + incompatibility_doc = """ where criteria is being ignored as this version [%s] is too old (or not-defined), read the file in and write it out to a new file to upgrade (with @@ -124,6 +122,7 @@ class IncompatibilityWarning(Warning): class AttributeConflictWarning(Warning): pass + attribute_conflict_doc = """ the [%s] attribute of the existing index is [%s] which conflicts with the new [%s], resetting the attribute to None @@ -133,6 +132,7 @@ class AttributeConflictWarning(Warning): class DuplicateWarning(Warning): pass + duplicate_doc = """ duplicate entries in table, taking most recently appended """ @@ -164,7 +164,6 @@ class DuplicateWarning(Warning): Series: u('series'), SparseSeries: u('sparse_series'), - pd.TimeSeries: u('series'), DataFrame: u('frame'), SparseDataFrame: u('sparse_frame'), Panel: u('wide'), @@ -173,7 +172,6 @@ class DuplicateWarning(Warning): # storer class map _STORER_MAP = { - u('TimeSeries'): 'LegacySeriesFixed', u('Series'): 'LegacySeriesFixed', u('DataFrame'): 'LegacyFrameFixed', u('DataMatrix'): 'LegacyFrameFixed', @@ -833,7 +831,7 @@ def func(_start, _stop, _where): # concat and return return concat(objs, axis=axis, - verify_integrity=False).consolidate() + verify_integrity=False)._consolidate() # create the iterator it = TableIterator(self, s, func, where=where, nrows=nrows, @@ -1040,7 +1038,7 @@ def append_to_multiple(self, d, value, selector, data_columns=None, valid_index = next(idxs) for index in idxs: valid_index = valid_index.intersection(index) - value = value.ix[valid_index] + value = value.loc[valid_index] # append for k, v in d.items(): @@ -1326,6 +1324,13 @@ def _read_group(self, group, **kwargs): def get_store(path, **kwargs): """ Backwards compatible alias for ``HDFStore`` """ + warnings.warn( + "get_store is deprecated and be " + "removed in a future version\n" + "HDFStore(path, **kwargs) is the replacement", + FutureWarning, + stacklevel=6) + return HDFStore(path, **kwargs) @@ -2098,7 +2103,17 @@ def convert(self, values, nan_rep, encoding): # we have a categorical categories = self.metadata - self.data = Categorical.from_codes(self.data.ravel(), + codes = self.data.ravel() + + # if we have stored a NaN in the categories + # then strip it; in theory we could have BOTH + # -1s in the codes and nulls :< + mask = isnull(categories) + if mask.any(): + categories = categories[~mask] + codes[codes != -1] -= mask.astype(int).cumsum().values + + self.data = Categorical.from_codes(codes, categories=categories, ordered=self.ordered) @@ -3408,10 +3423,12 @@ def create_axes(self, axes, obj, validate=True, nan_rep=None, if existing_table is not None: indexer = len(self.non_index_axes) exist_axis = existing_table.non_index_axes[indexer][1] - if append_axis != exist_axis: + if not array_equivalent(np.array(append_axis), + np.array(exist_axis)): # ahah! -> reindex - if sorted(append_axis) == sorted(exist_axis): + if array_equivalent(np.array(sorted(append_axis)), + np.array(sorted(exist_axis))): append_axis = exist_axis # the non_index_axes info @@ -3440,7 +3457,7 @@ def get_blk_items(mgr, blocks): return [mgr.items.take(blk.mgr_locs) for blk in blocks] # figure out data_columns and get out blocks - block_obj = self.get_object(obj).consolidate() + block_obj = self.get_object(obj)._consolidate() blocks = block_obj._data.blocks blk_items = get_blk_items(block_obj._data, blocks) if len(self.non_index_axes): @@ -3576,8 +3593,8 @@ def process_filter(field, filt): filt = filt.union(Index(self.levels)) takers = op(axis_values, filt) - return obj.ix._getitem_axis(takers, - axis=axis_number) + return obj.loc._getitem_axis(takers, + axis=axis_number) # this might be the name of a file IN an axis elif field in axis_values: @@ -3590,8 +3607,8 @@ def process_filter(field, filt): if isinstance(obj, DataFrame): axis_number = 1 - axis_number takers = op(values, filt) - return obj.ix._getitem_axis(takers, - axis=axis_number) + return obj.loc._getitem_axis(takers, + axis=axis_number) raise ValueError( "cannot find the field [%s] for filtering!" % field) @@ -3789,9 +3806,9 @@ def read(self, where=None, columns=None, **kwargs): lp = DataFrame(c.data, index=long_index, columns=c.values) # need a better algorithm - tuple_index = long_index._tuple_index + tuple_index = long_index.values - unique_tuples = lib.fast_unique(tuple_index.values) + unique_tuples = lib.fast_unique(tuple_index) unique_tuples = _asarray_tuplesafe(unique_tuples) indexer = match(unique_tuples, tuple_index) @@ -3807,7 +3824,7 @@ def read(self, where=None, columns=None, **kwargs): if len(objs) == 1: wp = objs[0] else: - wp = concat(objs, axis=0, verify_integrity=False).consolidate() + wp = concat(objs, axis=0, verify_integrity=False)._consolidate() # apply the selection filters & axis orderings wp = self.process_axes(wp, columns=columns) @@ -4313,7 +4330,7 @@ def _reindex_axis(obj, axis, labels, other=None): labels = _ensure_index(labels.unique()) if other is not None: - labels = labels & _ensure_index(other.unique()) + labels = _ensure_index(other.unique()) & labels if not labels.equals(ax): slicer = [slice(None, None)] * obj.ndim slicer[axis] = labels diff --git a/pandas/io/s3.py b/pandas/io/s3.py index df8f1d9187031..5e48de757d00e 100644 --- a/pandas/io/s3.py +++ b/pandas/io/s3.py @@ -1,14 +1,10 @@ """ s3 support for remote file interactivity """ - -import os from pandas import compat -from pandas.compat import BytesIO - try: - import boto - from boto.s3 import key + import s3fs + from botocore.exceptions import NoCredentialsError except: - raise ImportError("boto is required to handle s3 files") + raise ImportError("The s3fs library is required to handle s3 files") if compat.PY3: from urllib.parse import urlparse as parse_url @@ -16,97 +12,24 @@ from urlparse import urlparse as parse_url -class BotoFileLikeReader(key.Key): - """boto Key modified to be more file-like - - This modification of the boto Key will read through a supplied - S3 key once, then stop. The unmodified boto Key object will repeatedly - cycle through a file in S3: after reaching the end of the file, - boto will close the file. Then the next call to `read` or `next` will - re-open the file and start reading from the beginning. - - Also adds a `readline` function which will split the returned - values by the `\n` character. - """ - - def __init__(self, *args, **kwargs): - encoding = kwargs.pop("encoding", None) # Python 2 compat - super(BotoFileLikeReader, self).__init__(*args, **kwargs) - # Add a flag to mark the end of the read. - self.finished_read = False - self.buffer = "" - self.lines = [] - if encoding is None and compat.PY3: - encoding = "utf-8" - self.encoding = encoding - self.lines = [] - - def next(self): - return self.readline() - - __next__ = next - - def read(self, *args, **kwargs): - if self.finished_read: - return b'' if compat.PY3 else '' - return super(BotoFileLikeReader, self).read(*args, **kwargs) - - def close(self, *args, **kwargs): - self.finished_read = True - return super(BotoFileLikeReader, self).close(*args, **kwargs) - - def seekable(self): - """Needed for reading by bz2""" - return False - - def readline(self): - """Split the contents of the Key by '\n' characters.""" - if self.lines: - retval = self.lines[0] - self.lines = self.lines[1:] - return retval - if self.finished_read: - if self.buffer: - retval, self.buffer = self.buffer, "" - return retval - else: - raise StopIteration - - if self.encoding: - self.buffer = "{}{}".format( - self.buffer, self.read(8192).decode(self.encoding)) - else: - self.buffer = "{}{}".format(self.buffer, self.read(8192)) - - split_buffer = self.buffer.split("\n") - self.lines.extend(split_buffer[:-1]) - self.buffer = split_buffer[-1] - - return self.readline() +def _strip_schema(url): + """Returns the url without the s3:// part""" + result = parse_url(url) + return result.netloc + result.path def get_filepath_or_buffer(filepath_or_buffer, encoding=None, compression=None): - - # Assuming AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_S3_HOST - # are environment variables - parsed_url = parse_url(filepath_or_buffer) - s3_host = os.environ.get('AWS_S3_HOST', 's3.amazonaws.com') - + fs = s3fs.S3FileSystem(anon=False) try: - conn = boto.connect_s3(host=s3_host) - except boto.exception.NoAuthHandlerFound: - conn = boto.connect_s3(host=s3_host, anon=True) - - b = conn.get_bucket(parsed_url.netloc, validate=False) - if compat.PY2 and (compression == 'gzip' or - (compression == 'infer' and - filepath_or_buffer.endswith(".gz"))): - k = boto.s3.key.Key(b, parsed_url.path) - filepath_or_buffer = BytesIO(k.get_contents_as_string( - encoding=encoding)) - else: - k = BotoFileLikeReader(b, parsed_url.path, encoding=encoding) - k.open('r') # Expose read errors immediately - filepath_or_buffer = k + filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer)) + except (OSError, NoCredentialsError): + # boto3 has troubles when trying to access a public file + # when credentialed... + # An OSError is raised if you have credentials, but they + # aren't valid for that bucket. + # A NoCredentialsError is raised if you don't have creds + # for that bucket. + fs = s3fs.S3FileSystem(anon=True) + filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer)) return filepath_or_buffer, None, compression diff --git a/pandas/io/sas/__init__.py b/pandas/io/sas/__init__.py index e69de29bb2d1d..fa6b29a1a3fcc 100644 --- a/pandas/io/sas/__init__.py +++ b/pandas/io/sas/__init__.py @@ -0,0 +1 @@ +from .sasreader import read_sas # noqa diff --git a/pandas/io/sas/saslib.pyx b/pandas/io/sas/sas.pyx similarity index 99% rename from pandas/io/sas/saslib.pyx rename to pandas/io/sas/sas.pyx index a66d62ea41581..4396180da44cb 100644 --- a/pandas/io/sas/saslib.pyx +++ b/pandas/io/sas/sas.pyx @@ -311,7 +311,7 @@ cdef class Parser(object): self.current_page_subheaders_count =\ self.parser._current_page_subheaders_count - cdef bint readline(self): + cdef readline(self): cdef: int offset, bit_offset, align_correction diff --git a/pandas/io/sas/sas7bdat.py b/pandas/io/sas/sas7bdat.py index 91f417abc0502..20b0cf85e95b7 100644 --- a/pandas/io/sas/sas7bdat.py +++ b/pandas/io/sas/sas7bdat.py @@ -20,7 +20,7 @@ import numpy as np import struct import pandas.io.sas.sas_constants as const -from pandas.io.sas.saslib import Parser +from pandas.io.sas._sas import Parser class _subheader_pointer(object): diff --git a/pandas/io/sas/sas_xport.py b/pandas/io/sas/sas_xport.py index 76fc55154bc49..a43a5988a2ade 100644 --- a/pandas/io/sas/sas_xport.py +++ b/pandas/io/sas/sas_xport.py @@ -14,7 +14,7 @@ from pandas import compat import struct import numpy as np -from pandas.util.decorators import Appender +from pandas.util._decorators import Appender import warnings _correct_line1 = ("HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!" diff --git a/pandas/io/sas/sasreader.py b/pandas/io/sas/sasreader.py index 081d780f71cb3..3e4d9c9024dbd 100644 --- a/pandas/io/sas/sasreader.py +++ b/pandas/io/sas/sasreader.py @@ -1,6 +1,7 @@ """ Read SAS sas7bdat or xport files. """ +from pandas import compat def read_sas(filepath_or_buffer, format=None, index=None, encoding=None, @@ -29,8 +30,12 @@ def read_sas(filepath_or_buffer, format=None, index=None, encoding=None, DataFrame if iterator=False and chunksize=None, else SAS7BDATReader or XportReader """ - if format is None: + buffer_error_msg = ("If this is a buffer object rather " + "than a string name, you must specify " + "a format string") + if not isinstance(filepath_or_buffer, compat.string_types): + raise ValueError(buffer_error_msg) try: fname = filepath_or_buffer.lower() if fname.endswith(".xpt"): diff --git a/pandas/io/sql.py b/pandas/io/sql.py index c9f8d32e1b504..ee992c6dd3439 100644 --- a/pandas/io/sql.py +++ b/pandas/io/sql.py @@ -11,17 +11,18 @@ import re import numpy as np -import pandas.lib as lib -from pandas.types.missing import isnull -from pandas.types.dtypes import DatetimeTZDtype -from pandas.types.common import (is_list_like, is_dict_like, - is_datetime64tz_dtype) +import pandas._libs.lib as lib +from pandas.core.dtypes.missing import isnull +from pandas.core.dtypes.dtypes import DatetimeTZDtype +from pandas.core.dtypes.common import ( + is_list_like, is_dict_like, + is_datetime64tz_dtype) from pandas.compat import (map, zip, raise_with_traceback, string_types, text_type) from pandas.core.api import DataFrame, Series from pandas.core.base import PandasObject -from pandas.tseries.tools import to_datetime +from pandas.core.tools.datetimes import to_datetime from contextlib import contextmanager @@ -214,7 +215,7 @@ def read_sql_table(table_name, con, schema=None, index_col=None, index_col : string or list of strings, optional, default: None Column(s) to set as index(MultiIndex) coerce_float : boolean, default True - Attempt to convert values to non-string, non-numeric objects (like + Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Can result in loss of Precision. parse_dates : list or dict, default: None - List of column names to parse as dates @@ -289,7 +290,7 @@ def read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, index_col : string or list of strings, optional, default: None Column(s) to set as index(MultiIndex) coerce_float : boolean, default True - Attempt to convert values to non-string, non-numeric objects (like + Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets params : list, tuple or dict, optional, default: None List of parameters to pass to execute method. The syntax used @@ -348,7 +349,7 @@ def read_sql(sql, con, index_col=None, coerce_float=True, params=None, index_col : string or list of strings, optional, default: None Column(s) to set as index(MultiIndex) coerce_float : boolean, default True - Attempt to convert values to non-string, non-numeric objects (like + Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets params : list, tuple or dict, optional, default: None List of parameters to pass to execute method. The syntax used @@ -495,6 +496,7 @@ def has_table(table_name, con, flavor=None, schema=None): pandas_sql = pandasSQL_builder(con, flavor=flavor, schema=schema) return pandas_sql.has_table(table_name) + table_exists = has_table @@ -748,8 +750,9 @@ def _get_column_names_and_types(self, dtype_mapper): if self.index is not None: for i, idx_label in enumerate(self.index): idx_type = dtype_mapper( - self.frame.index.get_level_values(i)) - column_names_and_types.append((idx_label, idx_type, True)) + self.frame.index._get_level_values(i)) + column_names_and_types.append((text_type(idx_label), + idx_type, True)) column_names_and_types += [ (text_type(self.frame.columns[i]), @@ -986,7 +989,7 @@ def read_table(self, table_name, index_col=None, coerce_float=True, index_col : string, optional, default: None Column to set as index coerce_float : boolean, default True - Attempt to convert values to non-string, non-numeric objects + Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. This can result in loss of precision. parse_dates : list or dict, default: None @@ -1048,7 +1051,7 @@ def read_query(self, sql, index_col=None, coerce_float=True, index_col : string, optional, default: None Column name to use as index for the returned DataFrame object. coerce_float : boolean, default True - Attempt to convert values to non-string, non-numeric objects (like + Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets params : list, tuple or dict, optional, default: None List of parameters to pass to execute method. The syntax used @@ -1219,7 +1222,7 @@ def _create_sql_schema(self, frame, table_name, keys=None, dtype=None): def _get_unicode_name(name): try: - uname = name.encode("utf-8", "strict").decode("utf-8") + uname = text_type(name).encode("utf-8", "strict").decode("utf-8") except UnicodeError: raise ValueError("Cannot convert identifier to UTF-8: '%s'" % name) return uname diff --git a/pandas/io/stata.py b/pandas/io/stata.py index c35e07be2c31a..55cac83804cd9 100644 --- a/pandas/io/stata.py +++ b/pandas/io/stata.py @@ -15,8 +15,9 @@ import struct from dateutil.relativedelta import relativedelta -from pandas.types.common import (is_categorical_dtype, is_datetime64_dtype, - _ensure_object) +from pandas.core.dtypes.common import ( + is_categorical_dtype, is_datetime64_dtype, + _ensure_object) from pandas.core.base import StringMixin from pandas.core.categorical import Categorical @@ -26,12 +27,15 @@ from pandas import compat, to_timedelta, to_datetime, isnull, DatetimeIndex from pandas.compat import lrange, lmap, lzip, text_type, string_types, range, \ zip, BytesIO -from pandas.util.decorators import Appender +from pandas.util._decorators import Appender import pandas as pd from pandas.io.common import get_filepath_or_buffer, BaseIterator -from pandas.lib import max_len_string_array, infer_dtype -from pandas.tslib import NaT, Timestamp +from pandas._libs.lib import max_len_string_array, infer_dtype +from pandas._libs.tslib import NaT, Timestamp + +VALID_ENCODINGS = ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', + 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1') _version_error = ("Version of given Stata file is not 104, 105, 108, " "111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), " @@ -45,7 +49,7 @@ _encoding_params = """\ encoding : string, None or encoding - Encoding used to parse the files. None defaults to iso-8859-1.""" + Encoding used to parse the files. None defaults to latin-1.""" _statafile_processing_params2 = """\ index : identifier of index column @@ -169,10 +173,13 @@ def read_stata(filepath_or_buffer, convert_dates=True, if iterator or chunksize: data = reader else: - data = reader.read() - reader.close() + try: + data = reader.read() + finally: + reader.close() return data + _date_formats = ["%tc", "%tC", "%td", "%d", "%tw", "%tm", "%tq", "%th", "%ty"] @@ -456,6 +463,7 @@ class PossiblePrecisionLoss(Warning): class ValueLabelTypeMismatch(Warning): pass + value_label_mismatch_doc = """ Stata value labels (pandas categories) must be strings. Column {0} contains non-string labels which will be converted to strings. Please check that the @@ -812,9 +820,14 @@ def get_base_missing_value(cls, dtype): class StataParser(object): - _default_encoding = 'iso-8859-1' + _default_encoding = 'latin-1' def __init__(self, encoding): + if encoding is not None: + if encoding not in VALID_ENCODINGS: + raise ValueError('Unknown encoding. Only latin-1 and ascii ' + 'supported.') + self._encoding = encoding # type code. @@ -932,7 +945,7 @@ def __init__(self, path_or_buf, convert_dates=True, convert_categoricals=True, index=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, - encoding='iso-8859-1', chunksize=None): + encoding='latin-1', chunksize=None): super(StataReader, self).__init__(encoding) self.col_sizes = () @@ -945,6 +958,10 @@ def __init__(self, path_or_buf, convert_dates=True, self._preserve_dtypes = preserve_dtypes self._columns = columns self._order_categoricals = order_categoricals + if encoding is not None: + if encoding not in VALID_ENCODINGS: + raise ValueError('Unknown encoding. Only latin-1 and ascii ' + 'supported.') self._encoding = encoding self._chunksize = chunksize @@ -1358,7 +1375,8 @@ def _read_value_labels(self): def _read_strls(self): self.path_or_buf.seek(self.seek_strls) - self.GSO = {0: ''} + # Wrap v_o in a string to allow uint64 values as keys on 32bit OS + self.GSO = {'0': ''} while True: if self.path_or_buf.read(3) != b'GSO': break @@ -1383,7 +1401,8 @@ def _read_strls(self): if self.format_version == 117: encoding = self._encoding or self._default_encoding va = va[0:-1].decode(encoding) - self.GSO[v_o] = va + # Wrap v_o in a string to allow uint64 values as keys on 32bit OS + self.GSO[str(v_o)] = va # legacy @Appender('DEPRECATED: ' + _data_method_doc) @@ -1619,7 +1638,8 @@ def _insert_strls(self, data): for i, typ in enumerate(self.typlist): if typ != 'Q': continue - data.iloc[:, i] = [self.GSO[k] for k in data.iloc[:, i]] + # Wrap v_o in a string to allow uint64 values as keys on 32bit OS + data.iloc[:, i] = [self.GSO[str(k)] for k in data.iloc[:, i]] return data def _do_select_columns(self, data, columns): @@ -1851,7 +1871,7 @@ class StataWriter(StataParser): write_index : bool Write the index to Stata dataset. encoding : str - Default is latin-1. Unicode is not supported + Default is latin-1. Only latin-1 and ascii are supported. byteorder : str Can be ">", "<", "little", or "big". default is `sys.byteorder` time_stamp : datetime @@ -2154,9 +2174,15 @@ def _write_header(self, data_label=None, time_stamp=None): time_stamp = datetime.datetime.now() elif not isinstance(time_stamp, datetime.datetime): raise ValueError("time_stamp should be datetime type") - self._file.write( - self._null_terminate(time_stamp.strftime("%d %b %Y %H:%M")) - ) + # GH #13856 + # Avoid locale-specific month conversion + months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', + 'Sep', 'Oct', 'Nov', 'Dec'] + month_lookup = {i + 1: month for i, month in enumerate(months)} + ts = (time_stamp.strftime("%d ") + + month_lookup[time_stamp.month] + + time_stamp.strftime(" %Y %H:%M")) + self._file.write(self._null_terminate(ts)) def _write_descriptors(self, typlist=None, varlist=None, srtlist=None, fmtlist=None, lbllist=None): diff --git a/pandas/io/tests/data/legacy_hdf/legacy.h5 b/pandas/io/tests/data/legacy_hdf/legacy.h5 deleted file mode 100644 index 38b822dd16994..0000000000000 Binary files a/pandas/io/tests/data/legacy_hdf/legacy.h5 and /dev/null differ diff --git a/pandas/io/tests/data/test_multisheet.xls b/pandas/io/tests/data/test_multisheet.xls deleted file mode 100644 index fa37723fcdefb..0000000000000 Binary files a/pandas/io/tests/data/test_multisheet.xls and /dev/null differ diff --git a/pandas/io/tests/data/test_multisheet.xlsm b/pandas/io/tests/data/test_multisheet.xlsm deleted file mode 100644 index 694f8e07d5e29..0000000000000 Binary files a/pandas/io/tests/data/test_multisheet.xlsm and /dev/null differ diff --git a/pandas/io/tests/data/test_multisheet.xlsx b/pandas/io/tests/data/test_multisheet.xlsx deleted file mode 100644 index 5de07772b276a..0000000000000 Binary files a/pandas/io/tests/data/test_multisheet.xlsx and /dev/null differ diff --git a/pandas/io/tests/json/test_json_norm.py b/pandas/io/tests/json/test_json_norm.py deleted file mode 100644 index 4848db97194d9..0000000000000 --- a/pandas/io/tests/json/test_json_norm.py +++ /dev/null @@ -1,230 +0,0 @@ -import nose - -from pandas import DataFrame -import numpy as np -import json - -import pandas.util.testing as tm -from pandas import compat - -from pandas.io.json import json_normalize, nested_to_record - - -def _assert_equal_data(left, right): - if not left.columns.equals(right.columns): - left = left.reindex(columns=right.columns) - - tm.assert_frame_equal(left, right) - - -class TestJSONNormalize(tm.TestCase): - - def setUp(self): - self.state_data = [ - {'counties': [{'name': 'Dade', 'population': 12345}, - {'name': 'Broward', 'population': 40000}, - {'name': 'Palm Beach', 'population': 60000}], - 'info': {'governor': 'Rick Scott'}, - 'shortname': 'FL', - 'state': 'Florida'}, - {'counties': [{'name': 'Summit', 'population': 1234}, - {'name': 'Cuyahoga', 'population': 1337}], - 'info': {'governor': 'John Kasich'}, - 'shortname': 'OH', - 'state': 'Ohio'}] - - def test_simple_records(self): - recs = [{'a': 1, 'b': 2, 'c': 3}, - {'a': 4, 'b': 5, 'c': 6}, - {'a': 7, 'b': 8, 'c': 9}, - {'a': 10, 'b': 11, 'c': 12}] - - result = json_normalize(recs) - expected = DataFrame(recs) - - tm.assert_frame_equal(result, expected) - - def test_simple_normalize(self): - result = json_normalize(self.state_data[0], 'counties') - expected = DataFrame(self.state_data[0]['counties']) - tm.assert_frame_equal(result, expected) - - result = json_normalize(self.state_data, 'counties') - - expected = [] - for rec in self.state_data: - expected.extend(rec['counties']) - expected = DataFrame(expected) - - tm.assert_frame_equal(result, expected) - - result = json_normalize(self.state_data, 'counties', meta='state') - expected['state'] = np.array(['Florida', 'Ohio']).repeat([3, 2]) - - tm.assert_frame_equal(result, expected) - - def test_more_deeply_nested(self): - data = [{'country': 'USA', - 'states': [{'name': 'California', - 'cities': [{'name': 'San Francisco', - 'pop': 12345}, - {'name': 'Los Angeles', - 'pop': 12346}] - }, - {'name': 'Ohio', - 'cities': [{'name': 'Columbus', - 'pop': 1234}, - {'name': 'Cleveland', - 'pop': 1236}]} - ] - }, - {'country': 'Germany', - 'states': [{'name': 'Bayern', - 'cities': [{'name': 'Munich', 'pop': 12347}] - }, - {'name': 'Nordrhein-Westfalen', - 'cities': [{'name': 'Duesseldorf', 'pop': 1238}, - {'name': 'Koeln', 'pop': 1239}]} - ] - } - ] - - result = json_normalize(data, ['states', 'cities'], - meta=['country', ['states', 'name']]) - # meta_prefix={'states': 'state_'}) - - ex_data = {'country': ['USA'] * 4 + ['Germany'] * 3, - 'states.name': ['California', 'California', 'Ohio', 'Ohio', - 'Bayern', 'Nordrhein-Westfalen', - 'Nordrhein-Westfalen'], - 'name': ['San Francisco', 'Los Angeles', 'Columbus', - 'Cleveland', 'Munich', 'Duesseldorf', 'Koeln'], - 'pop': [12345, 12346, 1234, 1236, 12347, 1238, 1239]} - - expected = DataFrame(ex_data, columns=result.columns) - tm.assert_frame_equal(result, expected) - - def test_shallow_nested(self): - data = [{'state': 'Florida', - 'shortname': 'FL', - 'info': { - 'governor': 'Rick Scott' - }, - 'counties': [{'name': 'Dade', 'population': 12345}, - {'name': 'Broward', 'population': 40000}, - {'name': 'Palm Beach', 'population': 60000}]}, - {'state': 'Ohio', - 'shortname': 'OH', - 'info': { - 'governor': 'John Kasich' - }, - 'counties': [{'name': 'Summit', 'population': 1234}, - {'name': 'Cuyahoga', 'population': 1337}]}] - - result = json_normalize(data, 'counties', - ['state', 'shortname', - ['info', 'governor']]) - ex_data = {'name': ['Dade', 'Broward', 'Palm Beach', 'Summit', - 'Cuyahoga'], - 'state': ['Florida'] * 3 + ['Ohio'] * 2, - 'shortname': ['FL', 'FL', 'FL', 'OH', 'OH'], - 'info.governor': ['Rick Scott'] * 3 + ['John Kasich'] * 2, - 'population': [12345, 40000, 60000, 1234, 1337]} - expected = DataFrame(ex_data, columns=result.columns) - tm.assert_frame_equal(result, expected) - - def test_meta_name_conflict(self): - data = [{'foo': 'hello', - 'bar': 'there', - 'data': [{'foo': 'something', 'bar': 'else'}, - {'foo': 'something2', 'bar': 'else2'}]}] - - self.assertRaises(ValueError, json_normalize, data, - 'data', meta=['foo', 'bar']) - - result = json_normalize(data, 'data', meta=['foo', 'bar'], - meta_prefix='meta') - - for val in ['metafoo', 'metabar', 'foo', 'bar']: - self.assertTrue(val in result) - - def test_record_prefix(self): - result = json_normalize(self.state_data[0], 'counties') - expected = DataFrame(self.state_data[0]['counties']) - tm.assert_frame_equal(result, expected) - - result = json_normalize(self.state_data, 'counties', - meta='state', - record_prefix='county_') - - expected = [] - for rec in self.state_data: - expected.extend(rec['counties']) - expected = DataFrame(expected) - expected = expected.rename(columns=lambda x: 'county_' + x) - expected['state'] = np.array(['Florida', 'Ohio']).repeat([3, 2]) - - tm.assert_frame_equal(result, expected) - - def test_non_ascii_key(self): - if compat.PY3: - testjson = ( - b'[{"\xc3\x9cnic\xc3\xb8de":0,"sub":{"A":1, "B":2}},' + - b'{"\xc3\x9cnic\xc3\xb8de":1,"sub":{"A":3, "B":4}}]' - ).decode('utf8') - else: - testjson = ('[{"\xc3\x9cnic\xc3\xb8de":0,"sub":{"A":1, "B":2}},' - '{"\xc3\x9cnic\xc3\xb8de":1,"sub":{"A":3, "B":4}}]') - - testdata = { - u'sub.A': [1, 3], - u'sub.B': [2, 4], - b"\xc3\x9cnic\xc3\xb8de".decode('utf8'): [0, 1] - } - expected = DataFrame(testdata) - - result = json_normalize(json.loads(testjson)) - tm.assert_frame_equal(result, expected) - - -class TestNestedToRecord(tm.TestCase): - - def test_flat_stays_flat(self): - recs = [dict(flat1=1, flat2=2), - dict(flat1=3, flat2=4), - ] - - result = nested_to_record(recs) - expected = recs - self.assertEqual(result, expected) - - def test_one_level_deep_flattens(self): - data = dict(flat1=1, - dict1=dict(c=1, d=2)) - - result = nested_to_record(data) - expected = {'dict1.c': 1, - 'dict1.d': 2, - 'flat1': 1} - - self.assertEqual(result, expected) - - def test_nested_flattens(self): - data = dict(flat1=1, - dict1=dict(c=1, d=2), - nested=dict(e=dict(c=1, d=2), - d=2)) - - result = nested_to_record(data) - expected = {'dict1.c': 1, - 'dict1.d': 2, - 'flat1': 1, - 'nested.d': 2, - 'nested.e.c': 1, - 'nested.e.d': 2} - - self.assertEqual(result, expected) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', - '--pdb-failure', '-s'], exit=False) diff --git a/pandas/io/tests/parser/test_network.py b/pandas/io/tests/parser/test_network.py deleted file mode 100644 index 9b02096dd0f26..0000000000000 --- a/pandas/io/tests/parser/test_network.py +++ /dev/null @@ -1,191 +0,0 @@ -# -*- coding: utf-8 -*- - -""" -Tests parsers ability to read and parse non-local files -and hence require a network connection to be read. -""" - -import os -import nose - -import pandas.util.testing as tm -from pandas import DataFrame -from pandas import compat -from pandas.io.parsers import read_csv, read_table - - -class TestUrlGz(tm.TestCase): - - def setUp(self): - dirpath = tm.get_data_path() - localtable = os.path.join(dirpath, 'salaries.csv') - self.local_table = read_table(localtable) - - @tm.network - def test_url_gz(self): - url = ('https://raw.github.com/pandas-dev/pandas/' - 'master/pandas/io/tests/parser/data/salaries.csv.gz') - url_table = read_table(url, compression="gzip", engine="python") - tm.assert_frame_equal(url_table, self.local_table) - - @tm.network - def test_url_gz_infer(self): - url = 'https://s3.amazonaws.com/pandas-test/salary.table.gz' - url_table = read_table(url, compression="infer", engine="python") - tm.assert_frame_equal(url_table, self.local_table) - - -class TestS3(tm.TestCase): - - def setUp(self): - try: - import boto # noqa - except ImportError: - raise nose.SkipTest("boto not installed") - - @tm.network - def test_parse_public_s3_bucket(self): - for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: - if comp == 'bz2' and compat.PY2: - # The Python 2 C parser can't read bz2 from S3. - self.assertRaises(ValueError, read_csv, - 's3://pandas-test/tips.csv' + ext, - compression=comp) - else: - df = read_csv('s3://pandas-test/tips.csv' + - ext, compression=comp) - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv( - tm.get_data_path('tips.csv')), df) - - # Read public file from bucket with not-public contents - df = read_csv('s3://cant_get_it/tips.csv') - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv(tm.get_data_path('tips.csv')), df) - - @tm.network - def test_parse_public_s3n_bucket(self): - # Read from AWS s3 as "s3n" URL - df = read_csv('s3n://pandas-test/tips.csv', nrows=10) - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv( - tm.get_data_path('tips.csv')).iloc[:10], df) - - @tm.network - def test_parse_public_s3a_bucket(self): - # Read from AWS s3 as "s3a" URL - df = read_csv('s3a://pandas-test/tips.csv', nrows=10) - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv( - tm.get_data_path('tips.csv')).iloc[:10], df) - - @tm.network - def test_parse_public_s3_bucket_nrows(self): - for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: - if comp == 'bz2' and compat.PY2: - # The Python 2 C parser can't read bz2 from S3. - self.assertRaises(ValueError, read_csv, - 's3://pandas-test/tips.csv' + ext, - compression=comp) - else: - df = read_csv('s3://pandas-test/tips.csv' + - ext, nrows=10, compression=comp) - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv( - tm.get_data_path('tips.csv')).iloc[:10], df) - - @tm.network - def test_parse_public_s3_bucket_chunked(self): - # Read with a chunksize - chunksize = 5 - local_tips = read_csv(tm.get_data_path('tips.csv')) - for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: - if comp == 'bz2' and compat.PY2: - # The Python 2 C parser can't read bz2 from S3. - self.assertRaises(ValueError, read_csv, - 's3://pandas-test/tips.csv' + ext, - compression=comp) - else: - df_reader = read_csv('s3://pandas-test/tips.csv' + ext, - chunksize=chunksize, compression=comp) - self.assertEqual(df_reader.chunksize, chunksize) - for i_chunk in [0, 1, 2]: - # Read a couple of chunks and make sure we see them - # properly. - df = df_reader.get_chunk() - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - true_df = local_tips.iloc[ - chunksize * i_chunk: chunksize * (i_chunk + 1)] - tm.assert_frame_equal(true_df, df) - - @tm.network - def test_parse_public_s3_bucket_chunked_python(self): - # Read with a chunksize using the Python parser - chunksize = 5 - local_tips = read_csv(tm.get_data_path('tips.csv')) - for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: - df_reader = read_csv('s3://pandas-test/tips.csv' + ext, - chunksize=chunksize, compression=comp, - engine='python') - self.assertEqual(df_reader.chunksize, chunksize) - for i_chunk in [0, 1, 2]: - # Read a couple of chunks and make sure we see them properly. - df = df_reader.get_chunk() - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - true_df = local_tips.iloc[ - chunksize * i_chunk: chunksize * (i_chunk + 1)] - tm.assert_frame_equal(true_df, df) - - @tm.network - def test_parse_public_s3_bucket_python(self): - for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: - df = read_csv('s3://pandas-test/tips.csv' + ext, engine='python', - compression=comp) - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv( - tm.get_data_path('tips.csv')), df) - - @tm.network - def test_infer_s3_compression(self): - for ext in ['', '.gz', '.bz2']: - df = read_csv('s3://pandas-test/tips.csv' + ext, - engine='python', compression='infer') - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv( - tm.get_data_path('tips.csv')), df) - - @tm.network - def test_parse_public_s3_bucket_nrows_python(self): - for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: - df = read_csv('s3://pandas-test/tips.csv' + ext, engine='python', - nrows=10, compression=comp) - self.assertTrue(isinstance(df, DataFrame)) - self.assertFalse(df.empty) - tm.assert_frame_equal(read_csv( - tm.get_data_path('tips.csv')).iloc[:10], df) - - @tm.network - def test_s3_fails(self): - import boto - with tm.assertRaisesRegexp(boto.exception.S3ResponseError, - 'S3ResponseError: 404 Not Found'): - read_csv('s3://nyqpug/asdf.csv') - - # Receive a permission error when trying to read a private bucket. - # It's irrelevant here that this isn't actually a table. - with tm.assertRaisesRegexp(boto.exception.S3ResponseError, - 'S3ResponseError: 403 Forbidden'): - read_csv('s3://cant_get_it/') - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/test_date_converters.py b/pandas/io/tests/test_date_converters.py deleted file mode 100644 index 95fd2d52db009..0000000000000 --- a/pandas/io/tests/test_date_converters.py +++ /dev/null @@ -1,143 +0,0 @@ -from pandas.compat import StringIO -from datetime import date, datetime - -import nose - -import numpy as np - -from pandas import DataFrame, MultiIndex -from pandas.io.parsers import (read_csv, read_table) -from pandas.util.testing import assert_frame_equal -import pandas.io.date_converters as conv -import pandas.util.testing as tm -from pandas.compat.numpy import np_array_datetime64_compat - - -class TestConverters(tm.TestCase): - - def setUp(self): - self.years = np.array([2007, 2008]) - self.months = np.array([1, 2]) - self.days = np.array([3, 4]) - self.hours = np.array([5, 6]) - self.minutes = np.array([7, 8]) - self.seconds = np.array([9, 0]) - self.dates = np.array(['2007/1/3', '2008/2/4'], dtype=object) - self.times = np.array(['05:07:09', '06:08:00'], dtype=object) - self.expected = np.array([datetime(2007, 1, 3, 5, 7, 9), - datetime(2008, 2, 4, 6, 8, 0)]) - - def test_parse_date_time(self): - result = conv.parse_date_time(self.dates, self.times) - self.assertTrue((result == self.expected).all()) - - data = """\ -date, time, a, b -2001-01-05, 10:00:00, 0.0, 10. -2001-01-05, 00:00:00, 1., 11. -""" - datecols = {'date_time': [0, 1]} - df = read_table(StringIO(data), sep=',', header=0, - parse_dates=datecols, date_parser=conv.parse_date_time) - self.assertIn('date_time', df) - self.assertEqual(df.date_time.ix[0], datetime(2001, 1, 5, 10, 0, 0)) - - data = ("KORD,19990127, 19:00:00, 18:56:00, 0.8100\n" - "KORD,19990127, 20:00:00, 19:56:00, 0.0100\n" - "KORD,19990127, 21:00:00, 20:56:00, -0.5900\n" - "KORD,19990127, 21:00:00, 21:18:00, -0.9900\n" - "KORD,19990127, 22:00:00, 21:56:00, -0.5900\n" - "KORD,19990127, 23:00:00, 22:56:00, -0.5900") - - date_spec = {'nominal': [1, 2], 'actual': [1, 3]} - df = read_csv(StringIO(data), header=None, parse_dates=date_spec, - date_parser=conv.parse_date_time) - - def test_parse_date_fields(self): - result = conv.parse_date_fields(self.years, self.months, self.days) - expected = np.array([datetime(2007, 1, 3), datetime(2008, 2, 4)]) - self.assertTrue((result == expected).all()) - - data = ("year, month, day, a\n 2001 , 01 , 10 , 10.\n" - "2001 , 02 , 1 , 11.") - datecols = {'ymd': [0, 1, 2]} - df = read_table(StringIO(data), sep=',', header=0, - parse_dates=datecols, - date_parser=conv.parse_date_fields) - self.assertIn('ymd', df) - self.assertEqual(df.ymd.ix[0], datetime(2001, 1, 10)) - - def test_datetime_six_col(self): - result = conv.parse_all_fields(self.years, self.months, self.days, - self.hours, self.minutes, self.seconds) - self.assertTrue((result == self.expected).all()) - - data = """\ -year, month, day, hour, minute, second, a, b -2001, 01, 05, 10, 00, 0, 0.0, 10. -2001, 01, 5, 10, 0, 00, 1., 11. -""" - datecols = {'ymdHMS': [0, 1, 2, 3, 4, 5]} - df = read_table(StringIO(data), sep=',', header=0, - parse_dates=datecols, - date_parser=conv.parse_all_fields) - self.assertIn('ymdHMS', df) - self.assertEqual(df.ymdHMS.ix[0], datetime(2001, 1, 5, 10, 0, 0)) - - def test_datetime_fractional_seconds(self): - data = """\ -year, month, day, hour, minute, second, a, b -2001, 01, 05, 10, 00, 0.123456, 0.0, 10. -2001, 01, 5, 10, 0, 0.500000, 1., 11. -""" - datecols = {'ymdHMS': [0, 1, 2, 3, 4, 5]} - df = read_table(StringIO(data), sep=',', header=0, - parse_dates=datecols, - date_parser=conv.parse_all_fields) - self.assertIn('ymdHMS', df) - self.assertEqual(df.ymdHMS.ix[0], datetime(2001, 1, 5, 10, 0, 0, - microsecond=123456)) - self.assertEqual(df.ymdHMS.ix[1], datetime(2001, 1, 5, 10, 0, 0, - microsecond=500000)) - - def test_generic(self): - data = "year, month, day, a\n 2001, 01, 10, 10.\n 2001, 02, 1, 11." - datecols = {'ym': [0, 1]} - dateconverter = lambda y, m: date(year=int(y), month=int(m), day=1) - df = read_table(StringIO(data), sep=',', header=0, - parse_dates=datecols, - date_parser=dateconverter) - self.assertIn('ym', df) - self.assertEqual(df.ym.ix[0], date(2001, 1, 1)) - - def test_dateparser_resolution_if_not_ns(self): - # issue 10245 - data = """\ -date,time,prn,rxstatus -2013-11-03,19:00:00,126,00E80000 -2013-11-03,19:00:00,23,00E80000 -2013-11-03,19:00:00,13,00E80000 -""" - - def date_parser(date, time): - datetime = np_array_datetime64_compat( - date + 'T' + time + 'Z', dtype='datetime64[s]') - return datetime - - df = read_csv(StringIO(data), date_parser=date_parser, - parse_dates={'datetime': ['date', 'time']}, - index_col=['datetime', 'prn']) - - datetimes = np_array_datetime64_compat(['2013-11-03T19:00:00Z'] * 3, - dtype='datetime64[s]') - df_correct = DataFrame(data={'rxstatus': ['00E80000'] * 3}, - index=MultiIndex.from_tuples( - [(datetimes[0], 126), - (datetimes[1], 23), - (datetimes[2], 13)], - names=['datetime', 'prn'])) - assert_frame_equal(df, df_correct) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/test_ga.py b/pandas/io/tests/test_ga.py deleted file mode 100644 index 469e121f633d7..0000000000000 --- a/pandas/io/tests/test_ga.py +++ /dev/null @@ -1,198 +0,0 @@ -# flake8: noqa - -import os -from datetime import datetime - -import warnings -import nose -import pandas as pd -from pandas import compat -from pandas.util.testing import (network, assert_frame_equal, - with_connectivity_check, slow) -import pandas.util.testing as tm - -if compat.PY3: - raise nose.SkipTest("python-gflags does not support Python 3 yet") - -try: - import httplib2 - import apiclient - - # deprecated - with warnings.catch_warnings(record=True): - import pandas.io.ga as ga - - from pandas.io.ga import GAnalytics, read_ga - from pandas.io.auth import AuthenticationConfigError, reset_default_token_store - from pandas.io import auth -except ImportError: - raise nose.SkipTest("need httplib2 and auth libs") - - -class TestGoogle(tm.TestCase): - - _multiprocess_can_split_ = True - - def test_remove_token_store(self): - auth.DEFAULT_TOKEN_FILE = 'test.dat' - with open(auth.DEFAULT_TOKEN_FILE, 'w') as fh: - fh.write('test') - - reset_default_token_store() - self.assertFalse(os.path.exists(auth.DEFAULT_TOKEN_FILE)) - - @with_connectivity_check("http://www.google.com") - def test_getdata(self): - try: - end_date = datetime.now() - start_date = end_date - pd.offsets.Day() * 5 - end_date = end_date.strftime('%Y-%m-%d') - start_date = start_date.strftime('%Y-%m-%d') - - reader = GAnalytics() - df = reader.get_data( - metrics=['avgTimeOnSite', 'visitors', 'newVisits', - 'pageviewsPerVisit'], - start_date=start_date, - end_date=end_date, - dimensions=['date', 'hour'], - parse_dates={'ts': ['date', 'hour']}, - index_col=0) - - self.assertIsInstance(df, pd.DataFrame) - self.assertIsInstance(df.index, pd.DatetimeIndex) - self.assertGreater(len(df), 1) - self.assertTrue('date' not in df) - self.assertTrue('hour' not in df) - self.assertEqual(df.index.name, 'ts') - self.assertTrue('avgTimeOnSite' in df) - self.assertTrue('visitors' in df) - self.assertTrue('newVisits' in df) - self.assertTrue('pageviewsPerVisit' in df) - - df2 = read_ga( - metrics=['avgTimeOnSite', 'visitors', 'newVisits', - 'pageviewsPerVisit'], - start_date=start_date, - end_date=end_date, - dimensions=['date', 'hour'], - parse_dates={'ts': ['date', 'hour']}, - index_col=0) - - assert_frame_equal(df, df2) - - except AuthenticationConfigError: - raise nose.SkipTest("authentication error") - - @with_connectivity_check("http://www.google.com") - def test_iterator(self): - try: - reader = GAnalytics() - - it = reader.get_data( - metrics='visitors', - start_date='2005-1-1', - dimensions='date', - max_results=10, chunksize=5, - index_col=0) - - df1 = next(it) - df2 = next(it) - - for df in [df1, df2]: - self.assertIsInstance(df, pd.DataFrame) - self.assertIsInstance(df.index, pd.DatetimeIndex) - self.assertEqual(len(df), 5) - self.assertTrue('date' not in df) - self.assertEqual(df.index.name, 'date') - self.assertTrue('visitors' in df) - - self.assertTrue((df2.index > df1.index).all()) - - except AuthenticationConfigError: - raise nose.SkipTest("authentication error") - - def test_v2_advanced_segment_format(self): - advanced_segment_id = 1234567 - query = ga.format_query('google_profile_id', ['visits'], '2013-09-01', segment=advanced_segment_id) - self.assertEqual(query['segment'], 'gaid::' + str(advanced_segment_id), "An integer value should be formatted as an advanced segment.") - - def test_v2_dynamic_segment_format(self): - dynamic_segment_id = 'medium==referral' - query = ga.format_query('google_profile_id', ['visits'], '2013-09-01', segment=dynamic_segment_id) - self.assertEqual(query['segment'], 'dynamic::ga:' + str(dynamic_segment_id), "A string value with more than just letters and numbers should be formatted as a dynamic segment.") - - def test_v3_advanced_segment_common_format(self): - advanced_segment_id = 'aZwqR234' - query = ga.format_query('google_profile_id', ['visits'], '2013-09-01', segment=advanced_segment_id) - self.assertEqual(query['segment'], 'gaid::' + str(advanced_segment_id), "A string value with just letters and numbers should be formatted as an advanced segment.") - - def test_v3_advanced_segment_weird_format(self): - advanced_segment_id = '_aZwqR234-s1' - query = ga.format_query('google_profile_id', ['visits'], '2013-09-01', segment=advanced_segment_id) - self.assertEqual(query['segment'], 'gaid::' + str(advanced_segment_id), "A string value with just letters, numbers, and hyphens should be formatted as an advanced segment.") - - def test_v3_advanced_segment_with_underscore_format(self): - advanced_segment_id = 'aZwqR234_s1' - query = ga.format_query('google_profile_id', ['visits'], '2013-09-01', segment=advanced_segment_id) - self.assertEqual(query['segment'], 'gaid::' + str(advanced_segment_id), "A string value with just letters, numbers, and underscores should be formatted as an advanced segment.") - - @with_connectivity_check("http://www.google.com") - def test_segment(self): - try: - end_date = datetime.now() - start_date = end_date - pd.offsets.Day() * 5 - end_date = end_date.strftime('%Y-%m-%d') - start_date = start_date.strftime('%Y-%m-%d') - - reader = GAnalytics() - df = reader.get_data( - metrics=['avgTimeOnSite', 'visitors', 'newVisits', - 'pageviewsPerVisit'], - start_date=start_date, - end_date=end_date, - segment=-2, - dimensions=['date', 'hour'], - parse_dates={'ts': ['date', 'hour']}, - index_col=0) - - self.assertIsInstance(df, pd.DataFrame) - self.assertIsInstance(df.index, pd.DatetimeIndex) - self.assertGreater(len(df), 1) - self.assertTrue('date' not in df) - self.assertTrue('hour' not in df) - self.assertEqual(df.index.name, 'ts') - self.assertTrue('avgTimeOnSite' in df) - self.assertTrue('visitors' in df) - self.assertTrue('newVisits' in df) - self.assertTrue('pageviewsPerVisit' in df) - - # dynamic - df = read_ga( - metrics=['avgTimeOnSite', 'visitors', 'newVisits', - 'pageviewsPerVisit'], - start_date=start_date, - end_date=end_date, - segment="source=~twitter", - dimensions=['date', 'hour'], - parse_dates={'ts': ['date', 'hour']}, - index_col=0) - - assert isinstance(df, pd.DataFrame) - assert isinstance(df.index, pd.DatetimeIndex) - self.assertGreater(len(df), 1) - self.assertTrue('date' not in df) - self.assertTrue('hour' not in df) - self.assertEqual(df.index.name, 'ts') - self.assertTrue('avgTimeOnSite' in df) - self.assertTrue('visitors' in df) - self.assertTrue('newVisits' in df) - self.assertTrue('pageviewsPerVisit' in df) - - except AuthenticationConfigError: - raise nose.SkipTest("authentication error") - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/test_gbq.py b/pandas/io/tests/test_gbq.py deleted file mode 100644 index 28820fd71af27..0000000000000 --- a/pandas/io/tests/test_gbq.py +++ /dev/null @@ -1,1148 +0,0 @@ -import re -from datetime import datetime -import nose -import pytz -import platform -from time import sleep -import os -import logging - -import numpy as np - -from distutils.version import StrictVersion -from pandas import compat - -from pandas import NaT -from pandas.compat import u, range -from pandas.core.frame import DataFrame -import pandas.io.gbq as gbq -import pandas.util.testing as tm -from pandas.compat.numpy import np_datetime64_compat - -PROJECT_ID = None -PRIVATE_KEY_JSON_PATH = None -PRIVATE_KEY_JSON_CONTENTS = None - -if compat.PY3: - DATASET_ID = 'pydata_pandas_bq_testing_py3' -else: - DATASET_ID = 'pydata_pandas_bq_testing_py2' - -TABLE_ID = 'new_test' -DESTINATION_TABLE = "{0}.{1}".format(DATASET_ID + "1", TABLE_ID) - -VERSION = platform.python_version() - -_IMPORTS = False -_GOOGLE_API_CLIENT_INSTALLED = False -_GOOGLE_API_CLIENT_VALID_VERSION = False -_HTTPLIB2_INSTALLED = False -_SETUPTOOLS_INSTALLED = False - - -def _skip_if_no_project_id(): - if not _get_project_id(): - raise nose.SkipTest( - "Cannot run integration tests without a project id") - - -def _skip_if_no_private_key_path(): - if not _get_private_key_path(): - raise nose.SkipTest("Cannot run integration tests without a " - "private key json file path") - - -def _skip_if_no_private_key_contents(): - if not _get_private_key_contents(): - raise nose.SkipTest("Cannot run integration tests without a " - "private key json contents") - - -def _in_travis_environment(): - return 'TRAVIS_BUILD_DIR' in os.environ and \ - 'GBQ_PROJECT_ID' in os.environ - - -def _get_project_id(): - if _in_travis_environment(): - return os.environ.get('GBQ_PROJECT_ID') - else: - return PROJECT_ID - - -def _get_private_key_path(): - if _in_travis_environment(): - return os.path.join(*[os.environ.get('TRAVIS_BUILD_DIR'), 'ci', - 'travis_gbq.json']) - else: - return PRIVATE_KEY_JSON_PATH - - -def _get_private_key_contents(): - if _in_travis_environment(): - with open(os.path.join(*[os.environ.get('TRAVIS_BUILD_DIR'), 'ci', - 'travis_gbq.json'])) as f: - return f.read() - else: - return PRIVATE_KEY_JSON_CONTENTS - - -def _test_imports(): - global _GOOGLE_API_CLIENT_INSTALLED, _GOOGLE_API_CLIENT_VALID_VERSION, \ - _HTTPLIB2_INSTALLED, _SETUPTOOLS_INSTALLED - - try: - import pkg_resources - _SETUPTOOLS_INSTALLED = True - except ImportError: - _SETUPTOOLS_INSTALLED = False - - if compat.PY3: - google_api_minimum_version = '1.4.1' - else: - google_api_minimum_version = '1.2.0' - - if _SETUPTOOLS_INSTALLED: - try: - try: - from googleapiclient.discovery import build # noqa - from googleapiclient.errors import HttpError # noqa - except: - from apiclient.discovery import build # noqa - from apiclient.errors import HttpError # noqa - - from oauth2client.client import OAuth2WebServerFlow # noqa - from oauth2client.client import AccessTokenRefreshError # noqa - - from oauth2client.file import Storage # noqa - from oauth2client.tools import run_flow # noqa - _GOOGLE_API_CLIENT_INSTALLED = True - _GOOGLE_API_CLIENT_VERSION = pkg_resources.get_distribution( - 'google-api-python-client').version - - if (StrictVersion(_GOOGLE_API_CLIENT_VERSION) >= - StrictVersion(google_api_minimum_version)): - _GOOGLE_API_CLIENT_VALID_VERSION = True - - except ImportError: - _GOOGLE_API_CLIENT_INSTALLED = False - - try: - import httplib2 # noqa - _HTTPLIB2_INSTALLED = True - except ImportError: - _HTTPLIB2_INSTALLED = False - - if not _SETUPTOOLS_INSTALLED: - raise ImportError('Could not import pkg_resources (setuptools).') - - if not _GOOGLE_API_CLIENT_INSTALLED: - raise ImportError('Could not import Google API Client.') - - if not _GOOGLE_API_CLIENT_VALID_VERSION: - raise ImportError("pandas requires google-api-python-client >= {0} " - "for Google BigQuery support, " - "current version {1}" - .format(google_api_minimum_version, - _GOOGLE_API_CLIENT_VERSION)) - - if not _HTTPLIB2_INSTALLED: - raise ImportError( - "pandas requires httplib2 for Google BigQuery support") - - # Bug fix for https://github.com/pandas-dev/pandas/issues/12572 - # We need to know that a supported version of oauth2client is installed - # Test that either of the following is installed: - # - SignedJwtAssertionCredentials from oauth2client.client - # - ServiceAccountCredentials from oauth2client.service_account - # SignedJwtAssertionCredentials is available in oauthclient < 2.0.0 - # ServiceAccountCredentials is available in oauthclient >= 2.0.0 - oauth2client_v1 = True - oauth2client_v2 = True - - try: - from oauth2client.client import SignedJwtAssertionCredentials # noqa - except ImportError: - oauth2client_v1 = False - - try: - from oauth2client.service_account import ServiceAccountCredentials # noqa - except ImportError: - oauth2client_v2 = False - - if not oauth2client_v1 and not oauth2client_v2: - raise ImportError("Missing oauth2client required for BigQuery " - "service account support") - - -def _setup_common(): - try: - _test_imports() - except (ImportError, NotImplementedError) as import_exception: - raise nose.SkipTest(import_exception) - - if _in_travis_environment(): - logging.getLogger('oauth2client').setLevel(logging.ERROR) - logging.getLogger('apiclient').setLevel(logging.ERROR) - - -def _check_if_can_get_correct_default_credentials(): - # Checks if "Application Default Credentials" can be fetched - # from the environment the tests are running in. - # See Issue #13577 - - import httplib2 - try: - from googleapiclient.discovery import build - except ImportError: - from apiclient.discovery import build - try: - from oauth2client.client import GoogleCredentials - credentials = GoogleCredentials.get_application_default() - http = httplib2.Http() - http = credentials.authorize(http) - bigquery_service = build('bigquery', 'v2', http=http) - jobs = bigquery_service.jobs() - job_data = {'configuration': {'query': {'query': 'SELECT 1'}}} - jobs.insert(projectId=_get_project_id(), body=job_data).execute() - return True - except: - return False - - -def clean_gbq_environment(private_key=None): - dataset = gbq._Dataset(_get_project_id(), private_key=private_key) - - for i in range(1, 10): - if DATASET_ID + str(i) in dataset.datasets(): - dataset_id = DATASET_ID + str(i) - table = gbq._Table(_get_project_id(), dataset_id, - private_key=private_key) - for j in range(1, 20): - if TABLE_ID + str(j) in dataset.tables(dataset_id): - table.delete(TABLE_ID + str(j)) - - dataset.delete(dataset_id) - - -def make_mixed_dataframe_v2(test_size): - # create df to test for all BQ datatypes except RECORD - bools = np.random.randint(2, size=(1, test_size)).astype(bool) - flts = np.random.randn(1, test_size) - ints = np.random.randint(1, 10, size=(1, test_size)) - strs = np.random.randint(1, 10, size=(1, test_size)).astype(str) - times = [datetime.now(pytz.timezone('US/Arizona')) - for t in range(test_size)] - return DataFrame({'bools': bools[0], - 'flts': flts[0], - 'ints': ints[0], - 'strs': strs[0], - 'times': times[0]}, - index=range(test_size)) - - -def test_generate_bq_schema_deprecated(): - # 11121 Deprecation of generate_bq_schema - with tm.assert_produces_warning(FutureWarning): - df = make_mixed_dataframe_v2(10) - gbq.generate_bq_schema(df) - - -class TestGBQConnectorIntegration(tm.TestCase): - - def setUp(self): - _setup_common() - _skip_if_no_project_id() - - self.sut = gbq.GbqConnector(_get_project_id(), - private_key=_get_private_key_path()) - - def test_should_be_able_to_make_a_connector(self): - self.assertTrue(self.sut is not None, - 'Could not create a GbqConnector') - - def test_should_be_able_to_get_valid_credentials(self): - credentials = self.sut.get_credentials() - self.assertFalse(credentials.invalid, 'Returned credentials invalid') - - def test_should_be_able_to_get_a_bigquery_service(self): - bigquery_service = self.sut.get_service() - self.assertTrue(bigquery_service is not None, 'No service returned') - - def test_should_be_able_to_get_schema_from_query(self): - schema, pages = self.sut.run_query('SELECT 1') - self.assertTrue(schema is not None) - - def test_should_be_able_to_get_results_from_query(self): - schema, pages = self.sut.run_query('SELECT 1') - self.assertTrue(pages is not None) - - def test_get_application_default_credentials_does_not_throw_error(self): - if _check_if_can_get_correct_default_credentials(): - raise nose.SkipTest("Can get default_credentials " - "from the environment!") - credentials = self.sut.get_application_default_credentials() - self.assertIsNone(credentials) - - def test_get_application_default_credentials_returns_credentials(self): - if not _check_if_can_get_correct_default_credentials(): - raise nose.SkipTest("Cannot get default_credentials " - "from the environment!") - from oauth2client.client import GoogleCredentials - credentials = self.sut.get_application_default_credentials() - self.assertTrue(isinstance(credentials, GoogleCredentials)) - - -class TestGBQConnectorServiceAccountKeyPathIntegration(tm.TestCase): - def setUp(self): - _setup_common() - - _skip_if_no_project_id() - _skip_if_no_private_key_path() - - self.sut = gbq.GbqConnector(_get_project_id(), - private_key=_get_private_key_path()) - - def test_should_be_able_to_make_a_connector(self): - self.assertTrue(self.sut is not None, - 'Could not create a GbqConnector') - - def test_should_be_able_to_get_valid_credentials(self): - credentials = self.sut.get_credentials() - self.assertFalse(credentials.invalid, 'Returned credentials invalid') - - def test_should_be_able_to_get_a_bigquery_service(self): - bigquery_service = self.sut.get_service() - self.assertTrue(bigquery_service is not None, 'No service returned') - - def test_should_be_able_to_get_schema_from_query(self): - schema, pages = self.sut.run_query('SELECT 1') - self.assertTrue(schema is not None) - - def test_should_be_able_to_get_results_from_query(self): - schema, pages = self.sut.run_query('SELECT 1') - self.assertTrue(pages is not None) - - -class TestGBQConnectorServiceAccountKeyContentsIntegration(tm.TestCase): - def setUp(self): - _setup_common() - - _skip_if_no_project_id() - _skip_if_no_private_key_path() - - self.sut = gbq.GbqConnector(_get_project_id(), - private_key=_get_private_key_path()) - - def test_should_be_able_to_make_a_connector(self): - self.assertTrue(self.sut is not None, - 'Could not create a GbqConnector') - - def test_should_be_able_to_get_valid_credentials(self): - credentials = self.sut.get_credentials() - self.assertFalse(credentials.invalid, 'Returned credentials invalid') - - def test_should_be_able_to_get_a_bigquery_service(self): - bigquery_service = self.sut.get_service() - self.assertTrue(bigquery_service is not None, 'No service returned') - - def test_should_be_able_to_get_schema_from_query(self): - schema, pages = self.sut.run_query('SELECT 1') - self.assertTrue(schema is not None) - - def test_should_be_able_to_get_results_from_query(self): - schema, pages = self.sut.run_query('SELECT 1') - self.assertTrue(pages is not None) - - -class GBQUnitTests(tm.TestCase): - def setUp(self): - _setup_common() - - def test_import_google_api_python_client(self): - if compat.PY2: - with tm.assertRaises(ImportError): - from googleapiclient.discovery import build # noqa - from googleapiclient.errors import HttpError # noqa - from apiclient.discovery import build # noqa - from apiclient.errors import HttpError # noqa - else: - from googleapiclient.discovery import build # noqa - from googleapiclient.errors import HttpError # noqa - - def test_should_return_bigquery_integers_as_python_floats(self): - result = gbq._parse_entry(1, 'INTEGER') - tm.assert_equal(result, float(1)) - - def test_should_return_bigquery_floats_as_python_floats(self): - result = gbq._parse_entry(1, 'FLOAT') - tm.assert_equal(result, float(1)) - - def test_should_return_bigquery_timestamps_as_numpy_datetime(self): - result = gbq._parse_entry('0e9', 'TIMESTAMP') - tm.assert_equal(result, np_datetime64_compat('1970-01-01T00:00:00Z')) - - def test_should_return_bigquery_booleans_as_python_booleans(self): - result = gbq._parse_entry('false', 'BOOLEAN') - tm.assert_equal(result, False) - - def test_should_return_bigquery_strings_as_python_strings(self): - result = gbq._parse_entry('STRING', 'STRING') - tm.assert_equal(result, 'STRING') - - def test_to_gbq_should_fail_if_invalid_table_name_passed(self): - with tm.assertRaises(gbq.NotFoundException): - gbq.to_gbq(DataFrame(), 'invalid_table_name', project_id="1234") - - def test_to_gbq_with_no_project_id_given_should_fail(self): - with tm.assertRaises(TypeError): - gbq.to_gbq(DataFrame(), 'dataset.tablename') - - def test_read_gbq_with_no_project_id_given_should_fail(self): - with tm.assertRaises(TypeError): - gbq.read_gbq('SELECT "1" as NUMBER_1') - - def test_that_parse_data_works_properly(self): - test_schema = {'fields': [ - {'mode': 'NULLABLE', 'name': 'VALID_STRING', 'type': 'STRING'}]} - test_page = [{'f': [{'v': 'PI'}]}] - - test_output = gbq._parse_data(test_schema, test_page) - correct_output = DataFrame({'VALID_STRING': ['PI']}) - tm.assert_frame_equal(test_output, correct_output) - - def test_read_gbq_with_invalid_private_key_json_should_fail(self): - with tm.assertRaises(gbq.InvalidPrivateKeyFormat): - gbq.read_gbq('SELECT 1', project_id='x', private_key='y') - - def test_read_gbq_with_empty_private_key_json_should_fail(self): - with tm.assertRaises(gbq.InvalidPrivateKeyFormat): - gbq.read_gbq('SELECT 1', project_id='x', private_key='{}') - - def test_read_gbq_with_private_key_json_wrong_types_should_fail(self): - with tm.assertRaises(gbq.InvalidPrivateKeyFormat): - gbq.read_gbq( - 'SELECT 1', project_id='x', - private_key='{ "client_email" : 1, "private_key" : True }') - - def test_read_gbq_with_empty_private_key_file_should_fail(self): - with tm.ensure_clean() as empty_file_path: - with tm.assertRaises(gbq.InvalidPrivateKeyFormat): - gbq.read_gbq('SELECT 1', project_id='x', - private_key=empty_file_path) - - def test_read_gbq_with_corrupted_private_key_json_should_fail(self): - _skip_if_no_private_key_path() - - with tm.assertRaises(gbq.InvalidPrivateKeyFormat): - gbq.read_gbq( - 'SELECT 1', project_id='x', - private_key=re.sub('[a-z]', '9', _get_private_key_path())) - - -class TestReadGBQIntegration(tm.TestCase): - - @classmethod - def setUpClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *BEFORE* - # executing *ALL* tests described below. - - _skip_if_no_project_id() - - _setup_common() - - def setUp(self): - # - PER-TEST FIXTURES - - # put here any instruction you want to be run *BEFORE* *EVERY* test is - # executed. - pass - - @classmethod - def tearDownClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *AFTER* - # executing all tests. - pass - - def tearDown(self): - # - PER-TEST FIXTURES - - # put here any instructions you want to be run *AFTER* *EVERY* test is - # executed. - pass - - def test_should_read_as_user_account(self): - if _in_travis_environment(): - raise nose.SkipTest("Cannot run local auth in travis environment") - - query = 'SELECT "PI" as VALID_STRING' - df = gbq.read_gbq(query, project_id=_get_project_id()) - tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']})) - - def test_should_read_as_service_account_with_key_path(self): - _skip_if_no_private_key_path() - query = 'SELECT "PI" as VALID_STRING' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']})) - - def test_should_read_as_service_account_with_key_contents(self): - _skip_if_no_private_key_contents() - query = 'SELECT "PI" as VALID_STRING' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_contents()) - tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']})) - - def test_should_properly_handle_valid_strings(self): - query = 'SELECT "PI" as VALID_STRING' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']})) - - def test_should_properly_handle_empty_strings(self): - query = 'SELECT "" as EMPTY_STRING' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'EMPTY_STRING': [""]})) - - def test_should_properly_handle_null_strings(self): - query = 'SELECT STRING(NULL) as NULL_STRING' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'NULL_STRING': [None]})) - - def test_should_properly_handle_valid_integers(self): - query = 'SELECT INTEGER(3) as VALID_INTEGER' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'VALID_INTEGER': [3]})) - - def test_should_properly_handle_null_integers(self): - query = 'SELECT INTEGER(NULL) as NULL_INTEGER' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'NULL_INTEGER': [np.nan]})) - - def test_should_properly_handle_valid_floats(self): - query = 'SELECT PI() as VALID_FLOAT' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame( - {'VALID_FLOAT': [3.141592653589793]})) - - def test_should_properly_handle_null_floats(self): - query = 'SELECT FLOAT(NULL) as NULL_FLOAT' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'NULL_FLOAT': [np.nan]})) - - def test_should_properly_handle_timestamp_unix_epoch(self): - query = 'SELECT TIMESTAMP("1970-01-01 00:00:00") as UNIX_EPOCH' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame( - {'UNIX_EPOCH': [np.datetime64('1970-01-01T00:00:00.000000Z')]})) - - def test_should_properly_handle_arbitrary_timestamp(self): - query = 'SELECT TIMESTAMP("2004-09-15 05:00:00") as VALID_TIMESTAMP' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({ - 'VALID_TIMESTAMP': [np.datetime64('2004-09-15T05:00:00.000000Z')] - })) - - def test_should_properly_handle_null_timestamp(self): - query = 'SELECT TIMESTAMP(NULL) as NULL_TIMESTAMP' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'NULL_TIMESTAMP': [NaT]})) - - def test_should_properly_handle_true_boolean(self): - query = 'SELECT BOOLEAN(TRUE) as TRUE_BOOLEAN' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'TRUE_BOOLEAN': [True]})) - - def test_should_properly_handle_false_boolean(self): - query = 'SELECT BOOLEAN(FALSE) as FALSE_BOOLEAN' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'FALSE_BOOLEAN': [False]})) - - def test_should_properly_handle_null_boolean(self): - query = 'SELECT BOOLEAN(NULL) as NULL_BOOLEAN' - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, DataFrame({'NULL_BOOLEAN': [None]})) - - def test_unicode_string_conversion_and_normalization(self): - correct_test_datatype = DataFrame( - {'UNICODE_STRING': [u("\xe9\xfc")]} - ) - - unicode_string = "\xc3\xa9\xc3\xbc" - - if compat.PY3: - unicode_string = unicode_string.encode('latin-1').decode('utf8') - - query = 'SELECT "{0}" as UNICODE_STRING'.format(unicode_string) - - df = gbq.read_gbq(query, project_id=_get_project_id(), - private_key=_get_private_key_path()) - tm.assert_frame_equal(df, correct_test_datatype) - - def test_index_column(self): - query = "SELECT 'a' as STRING_1, 'b' as STRING_2" - result_frame = gbq.read_gbq(query, project_id=_get_project_id(), - index_col="STRING_1", - private_key=_get_private_key_path()) - correct_frame = DataFrame( - {'STRING_1': ['a'], 'STRING_2': ['b']}).set_index("STRING_1") - tm.assert_equal(result_frame.index.name, correct_frame.index.name) - - def test_column_order(self): - query = "SELECT 'a' as STRING_1, 'b' as STRING_2, 'c' as STRING_3" - col_order = ['STRING_3', 'STRING_1', 'STRING_2'] - result_frame = gbq.read_gbq(query, project_id=_get_project_id(), - col_order=col_order, - private_key=_get_private_key_path()) - correct_frame = DataFrame({'STRING_1': ['a'], 'STRING_2': [ - 'b'], 'STRING_3': ['c']})[col_order] - tm.assert_frame_equal(result_frame, correct_frame) - - def test_column_order_plus_index(self): - query = "SELECT 'a' as STRING_1, 'b' as STRING_2, 'c' as STRING_3" - col_order = ['STRING_3', 'STRING_2'] - result_frame = gbq.read_gbq(query, project_id=_get_project_id(), - index_col='STRING_1', col_order=col_order, - private_key=_get_private_key_path()) - correct_frame = DataFrame( - {'STRING_1': ['a'], 'STRING_2': ['b'], 'STRING_3': ['c']}) - correct_frame.set_index('STRING_1', inplace=True) - correct_frame = correct_frame[col_order] - tm.assert_frame_equal(result_frame, correct_frame) - - def test_malformed_query(self): - with tm.assertRaises(gbq.GenericGBQException): - gbq.read_gbq("SELCET * FORM [publicdata:samples.shakespeare]", - project_id=_get_project_id(), - private_key=_get_private_key_path()) - - def test_bad_project_id(self): - with tm.assertRaises(gbq.GenericGBQException): - gbq.read_gbq("SELECT 1", project_id='001', - private_key=_get_private_key_path()) - - def test_bad_table_name(self): - with tm.assertRaises(gbq.GenericGBQException): - gbq.read_gbq("SELECT * FROM [publicdata:samples.nope]", - project_id=_get_project_id(), - private_key=_get_private_key_path()) - - def test_download_dataset_larger_than_200k_rows(self): - test_size = 200005 - # Test for known BigQuery bug in datasets larger than 100k rows - # http://stackoverflow.com/questions/19145587/bq-py-not-paging-results - df = gbq.read_gbq("SELECT id FROM [publicdata:samples.wikipedia] " - "GROUP EACH BY id ORDER BY id ASC LIMIT {0}" - .format(test_size), - project_id=_get_project_id(), - private_key=_get_private_key_path()) - self.assertEqual(len(df.drop_duplicates()), test_size) - - def test_zero_rows(self): - # Bug fix for https://github.com/pandas-dev/pandas/issues/10273 - df = gbq.read_gbq("SELECT title, id " - "FROM [publicdata:samples.wikipedia] " - "WHERE timestamp=-9999999", - project_id=_get_project_id(), - private_key=_get_private_key_path()) - page_array = np.zeros( - (0,), dtype=[('title', object), ('id', np.dtype(float))]) - expected_result = DataFrame(page_array, columns=['title', 'id']) - self.assert_frame_equal(df, expected_result) - - def test_legacy_sql(self): - legacy_sql = "SELECT id FROM [publicdata.samples.wikipedia] LIMIT 10" - - # Test that a legacy sql statement fails when - # setting dialect='standard' - with tm.assertRaises(gbq.GenericGBQException): - gbq.read_gbq(legacy_sql, project_id=_get_project_id(), - dialect='standard', - private_key=_get_private_key_path()) - - # Test that a legacy sql statement succeeds when - # setting dialect='legacy' - df = gbq.read_gbq(legacy_sql, project_id=_get_project_id(), - dialect='legacy', - private_key=_get_private_key_path()) - self.assertEqual(len(df.drop_duplicates()), 10) - - def test_standard_sql(self): - standard_sql = "SELECT DISTINCT id FROM " \ - "`publicdata.samples.wikipedia` LIMIT 10" - - # Test that a standard sql statement fails when using - # the legacy SQL dialect (default value) - with tm.assertRaises(gbq.GenericGBQException): - gbq.read_gbq(standard_sql, project_id=_get_project_id(), - private_key=_get_private_key_path()) - - # Test that a standard sql statement succeeds when - # setting dialect='standard' - df = gbq.read_gbq(standard_sql, project_id=_get_project_id(), - dialect='standard', - private_key=_get_private_key_path()) - self.assertEqual(len(df.drop_duplicates()), 10) - - def test_invalid_option_for_sql_dialect(self): - sql_statement = "SELECT DISTINCT id FROM " \ - "`publicdata.samples.wikipedia` LIMIT 10" - - # Test that an invalid option for `dialect` raises ValueError - with tm.assertRaises(ValueError): - gbq.read_gbq(sql_statement, project_id=_get_project_id(), - dialect='invalid', - private_key=_get_private_key_path()) - - # Test that a correct option for dialect succeeds - # to make sure ValueError was due to invalid dialect - gbq.read_gbq(sql_statement, project_id=_get_project_id(), - dialect='standard', private_key=_get_private_key_path()) - - -class TestToGBQIntegration(tm.TestCase): - # Changes to BigQuery table schema may take up to 2 minutes as of May 2015 - # As a workaround to this issue, each test should use a unique table name. - # Make sure to modify the for loop range in the tearDownClass when a new - # test is added See `Issue 191 - # `__ - - @classmethod - def setUpClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *BEFORE* - # executing *ALL* tests described below. - - _skip_if_no_project_id() - - _setup_common() - clean_gbq_environment(_get_private_key_path()) - - gbq._Dataset(_get_project_id(), - private_key=_get_private_key_path() - ).create(DATASET_ID + "1") - - def setUp(self): - # - PER-TEST FIXTURES - - # put here any instruction you want to be run *BEFORE* *EVERY* test is - # executed. - - self.dataset = gbq._Dataset(_get_project_id(), - private_key=_get_private_key_path()) - self.table = gbq._Table(_get_project_id(), DATASET_ID + "1", - private_key=_get_private_key_path()) - self.sut = gbq.GbqConnector(_get_project_id(), - private_key=_get_private_key_path()) - - @classmethod - def tearDownClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *AFTER* - # executing all tests. - - clean_gbq_environment(_get_private_key_path()) - - def tearDown(self): - # - PER-TEST FIXTURES - - # put here any instructions you want to be run *AFTER* *EVERY* test is - # executed. - pass - - def test_upload_data(self): - destination_table = DESTINATION_TABLE + "1" - - test_size = 20001 - df = make_mixed_dataframe_v2(test_size) - - gbq.to_gbq(df, destination_table, _get_project_id(), chunksize=10000, - private_key=_get_private_key_path()) - - sleep(30) # <- Curses Google!!! - - result = gbq.read_gbq("SELECT COUNT(*) as NUM_ROWS FROM {0}" - .format(destination_table), - project_id=_get_project_id(), - private_key=_get_private_key_path()) - self.assertEqual(result['NUM_ROWS'][0], test_size) - - def test_upload_data_if_table_exists_fail(self): - destination_table = DESTINATION_TABLE + "2" - - test_size = 10 - df = make_mixed_dataframe_v2(test_size) - self.table.create(TABLE_ID + "2", gbq._generate_bq_schema(df)) - - # Test the default value of if_exists is 'fail' - with tm.assertRaises(gbq.TableCreationError): - gbq.to_gbq(df, destination_table, _get_project_id(), - private_key=_get_private_key_path()) - - # Test the if_exists parameter with value 'fail' - with tm.assertRaises(gbq.TableCreationError): - gbq.to_gbq(df, destination_table, _get_project_id(), - if_exists='fail', private_key=_get_private_key_path()) - - def test_upload_data_if_table_exists_append(self): - destination_table = DESTINATION_TABLE + "3" - - test_size = 10 - df = make_mixed_dataframe_v2(test_size) - df_different_schema = tm.makeMixedDataFrame() - - # Initialize table with sample data - gbq.to_gbq(df, destination_table, _get_project_id(), chunksize=10000, - private_key=_get_private_key_path()) - - # Test the if_exists parameter with value 'append' - gbq.to_gbq(df, destination_table, _get_project_id(), - if_exists='append', private_key=_get_private_key_path()) - - sleep(30) # <- Curses Google!!! - - result = gbq.read_gbq("SELECT COUNT(*) as NUM_ROWS FROM {0}" - .format(destination_table), - project_id=_get_project_id(), - private_key=_get_private_key_path()) - self.assertEqual(result['NUM_ROWS'][0], test_size * 2) - - # Try inserting with a different schema, confirm failure - with tm.assertRaises(gbq.InvalidSchema): - gbq.to_gbq(df_different_schema, destination_table, - _get_project_id(), if_exists='append', - private_key=_get_private_key_path()) - - def test_upload_data_if_table_exists_replace(self): - - raise nose.SkipTest("buggy test") - - destination_table = DESTINATION_TABLE + "4" - - test_size = 10 - df = make_mixed_dataframe_v2(test_size) - df_different_schema = tm.makeMixedDataFrame() - - # Initialize table with sample data - gbq.to_gbq(df, destination_table, _get_project_id(), chunksize=10000, - private_key=_get_private_key_path()) - - # Test the if_exists parameter with the value 'replace'. - gbq.to_gbq(df_different_schema, destination_table, - _get_project_id(), if_exists='replace', - private_key=_get_private_key_path()) - - sleep(30) # <- Curses Google!!! - - result = gbq.read_gbq("SELECT COUNT(*) as NUM_ROWS FROM {0}" - .format(destination_table), - project_id=_get_project_id(), - private_key=_get_private_key_path()) - self.assertEqual(result['NUM_ROWS'][0], 5) - - def test_google_upload_errors_should_raise_exception(self): - destination_table = DESTINATION_TABLE + "5" - - test_timestamp = datetime.now(pytz.timezone('US/Arizona')) - bad_df = DataFrame({'bools': [False, False], 'flts': [0.0, 1.0], - 'ints': [0, '1'], 'strs': ['a', 1], - 'times': [test_timestamp, test_timestamp]}, - index=range(2)) - - with tm.assertRaises(gbq.StreamingInsertError): - gbq.to_gbq(bad_df, destination_table, _get_project_id(), - verbose=True, private_key=_get_private_key_path()) - - def test_generate_schema(self): - df = tm.makeMixedDataFrame() - schema = gbq._generate_bq_schema(df) - - test_schema = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - - self.assertEqual(schema, test_schema) - - def test_create_table(self): - destination_table = TABLE_ID + "6" - test_schema = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - self.table.create(destination_table, test_schema) - self.assertTrue(self.table.exists(destination_table), - 'Expected table to exist') - - def test_table_does_not_exist(self): - self.assertTrue(not self.table.exists(TABLE_ID + "7"), - 'Expected table not to exist') - - def test_delete_table(self): - destination_table = TABLE_ID + "8" - test_schema = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - self.table.create(destination_table, test_schema) - self.table.delete(destination_table) - self.assertTrue(not self.table.exists( - destination_table), 'Expected table not to exist') - - def test_list_table(self): - destination_table = TABLE_ID + "9" - test_schema = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - self.table.create(destination_table, test_schema) - self.assertTrue( - destination_table in self.dataset.tables(DATASET_ID + "1"), - 'Expected table list to contain table {0}' - .format(destination_table)) - - def test_verify_schema_allows_flexible_column_order(self): - destination_table = TABLE_ID + "10" - test_schema_1 = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - test_schema_2 = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - - self.table.create(destination_table, test_schema_1) - self.assertTrue(self.sut.verify_schema( - DATASET_ID + "1", destination_table, test_schema_2), - 'Expected schema to match') - - def test_verify_schema_fails_different_data_type(self): - destination_table = TABLE_ID + "11" - test_schema_1 = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - test_schema_2 = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'STRING'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - - self.table.create(destination_table, test_schema_1) - self.assertFalse(self.sut.verify_schema( - DATASET_ID + "1", destination_table, test_schema_2), - 'Expected different schema') - - def test_verify_schema_fails_different_structure(self): - destination_table = TABLE_ID + "12" - test_schema_1 = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - test_schema_2 = {'fields': [{'name': 'A', 'type': 'FLOAT'}, - {'name': 'B2', 'type': 'FLOAT'}, - {'name': 'C', 'type': 'STRING'}, - {'name': 'D', 'type': 'TIMESTAMP'}]} - - self.table.create(destination_table, test_schema_1) - self.assertFalse(self.sut.verify_schema( - DATASET_ID + "1", destination_table, test_schema_2), - 'Expected different schema') - - def test_upload_data_flexible_column_order(self): - destination_table = DESTINATION_TABLE + "13" - - test_size = 10 - df = make_mixed_dataframe_v2(test_size) - - # Initialize table with sample data - gbq.to_gbq(df, destination_table, _get_project_id(), chunksize=10000, - private_key=_get_private_key_path()) - - df_columns_reversed = df[df.columns[::-1]] - - gbq.to_gbq(df_columns_reversed, destination_table, _get_project_id(), - if_exists='append', private_key=_get_private_key_path()) - - def test_list_dataset(self): - dataset_id = DATASET_ID + "1" - self.assertTrue(dataset_id in self.dataset.datasets(), - 'Expected dataset list to contain dataset {0}' - .format(dataset_id)) - - def test_list_table_zero_results(self): - dataset_id = DATASET_ID + "2" - self.dataset.create(dataset_id) - table_list = gbq._Dataset(_get_project_id(), - private_key=_get_private_key_path() - ).tables(dataset_id) - self.assertEqual(len(table_list), 0, - 'Expected gbq.list_table() to return 0') - - def test_create_dataset(self): - dataset_id = DATASET_ID + "3" - self.dataset.create(dataset_id) - self.assertTrue(dataset_id in self.dataset.datasets(), - 'Expected dataset to exist') - - def test_delete_dataset(self): - dataset_id = DATASET_ID + "4" - self.dataset.create(dataset_id) - self.dataset.delete(dataset_id) - self.assertTrue(dataset_id not in self.dataset.datasets(), - 'Expected dataset not to exist') - - def test_dataset_exists(self): - dataset_id = DATASET_ID + "5" - self.dataset.create(dataset_id) - self.assertTrue(self.dataset.exists(dataset_id), - 'Expected dataset to exist') - - def create_table_data_dataset_does_not_exist(self): - dataset_id = DATASET_ID + "6" - table_id = TABLE_ID + "1" - table_with_new_dataset = gbq._Table(_get_project_id(), dataset_id) - df = make_mixed_dataframe_v2(10) - table_with_new_dataset.create(table_id, gbq._generate_bq_schema(df)) - self.assertTrue(self.dataset.exists(dataset_id), - 'Expected dataset to exist') - self.assertTrue(table_with_new_dataset.exists( - table_id), 'Expected dataset to exist') - - def test_dataset_does_not_exist(self): - self.assertTrue(not self.dataset.exists( - DATASET_ID + "_not_found"), 'Expected dataset not to exist') - - -class TestToGBQIntegrationServiceAccountKeyPath(tm.TestCase): - # Changes to BigQuery table schema may take up to 2 minutes as of May 2015 - # As a workaround to this issue, each test should use a unique table name. - # Make sure to modify the for loop range in the tearDownClass when a new - # test is added - # See `Issue 191 - # `__ - - @classmethod - def setUpClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *BEFORE* - # executing *ALL* tests described below. - - _skip_if_no_project_id() - _skip_if_no_private_key_path() - - _setup_common() - clean_gbq_environment(_get_private_key_path()) - - def setUp(self): - # - PER-TEST FIXTURES - - # put here any instruction you want to be run *BEFORE* *EVERY* test - # is executed. - pass - - @classmethod - def tearDownClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *AFTER* - # executing all tests. - - clean_gbq_environment(_get_private_key_path()) - - def tearDown(self): - # - PER-TEST FIXTURES - - # put here any instructions you want to be run *AFTER* *EVERY* test - # is executed. - pass - - def test_upload_data_as_service_account_with_key_path(self): - destination_table = DESTINATION_TABLE + "11" - - test_size = 10 - df = make_mixed_dataframe_v2(test_size) - - gbq.to_gbq(df, destination_table, _get_project_id(), chunksize=10000, - private_key=_get_private_key_path()) - - sleep(30) # <- Curses Google!!! - - result = gbq.read_gbq( - "SELECT COUNT(*) as NUM_ROWS FROM {0}".format(destination_table), - project_id=_get_project_id(), - private_key=_get_private_key_path()) - - self.assertEqual(result['NUM_ROWS'][0], test_size) - - -class TestToGBQIntegrationServiceAccountKeyContents(tm.TestCase): - # Changes to BigQuery table schema may take up to 2 minutes as of May 2015 - # As a workaround to this issue, each test should use a unique table name. - # Make sure to modify the for loop range in the tearDownClass when a new - # test is added - # See `Issue 191 - # `__ - - @classmethod - def setUpClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *BEFORE* - # executing *ALL* tests described below. - - _setup_common() - _skip_if_no_project_id() - _skip_if_no_private_key_contents() - - clean_gbq_environment(_get_private_key_contents()) - - def setUp(self): - # - PER-TEST FIXTURES - - # put here any instruction you want to be run *BEFORE* *EVERY* test - # is executed. - pass - - @classmethod - def tearDownClass(cls): - # - GLOBAL CLASS FIXTURES - - # put here any instruction you want to execute only *ONCE* *AFTER* - # executing all tests. - - clean_gbq_environment(_get_private_key_contents()) - - def tearDown(self): - # - PER-TEST FIXTURES - - # put here any instructions you want to be run *AFTER* *EVERY* test - # is executed. - pass - - def test_upload_data_as_service_account_with_key_contents(self): - raise nose.SkipTest( - "flaky test") - - destination_table = DESTINATION_TABLE + "12" - - test_size = 10 - df = make_mixed_dataframe_v2(test_size) - - gbq.to_gbq(df, destination_table, _get_project_id(), chunksize=10000, - private_key=_get_private_key_contents()) - - sleep(30) # <- Curses Google!!! - - result = gbq.read_gbq( - "SELECT COUNT(*) as NUM_ROWS FROM {0}".format(destination_table), - project_id=_get_project_id(), - private_key=_get_private_key_contents()) - self.assertEqual(result['NUM_ROWS'][0], test_size) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/test_pickle.py b/pandas/io/tests/test_pickle.py deleted file mode 100644 index a49f50b1bcb9f..0000000000000 --- a/pandas/io/tests/test_pickle.py +++ /dev/null @@ -1,291 +0,0 @@ -# pylint: disable=E1101,E1103,W0232 - -""" manage legacy pickle tests """ - -import nose -import os - -from distutils.version import LooseVersion - -import pandas as pd -from pandas import Index -from pandas.compat import u, is_platform_little_endian -import pandas -import pandas.util.testing as tm -from pandas.tseries.offsets import Day, MonthEnd - - -class TestPickle(): - """ - How to add pickle tests: - - 1. Install pandas version intended to output the pickle. - - 2. Execute "generate_legacy_storage_files.py" to create the pickle. - $ python generate_legacy_storage_files.py pickle - - 3. Move the created pickle to "data/legacy_pickle/" directory. - - NOTE: TestPickle can't be a subclass of tm.Testcase to use test generator. - http://stackoverflow.com/questions/6689537/ - nose-test-generators-inside-class - """ - _multiprocess_can_split_ = True - - def setUp(self): - from pandas.io.tests.generate_legacy_storage_files import ( - create_pickle_data) - self.data = create_pickle_data() - self.path = u('__%s__.pickle' % tm.rands(10)) - - def compare_element(self, result, expected, typ, version=None): - if isinstance(expected, Index): - tm.assert_index_equal(expected, result) - return - - if typ.startswith('sp_'): - comparator = getattr(tm, "assert_%s_equal" % typ) - comparator(result, expected, exact_indices=False) - elif typ == 'timestamp': - if expected is pd.NaT: - assert result is pd.NaT - else: - tm.assert_equal(result, expected) - tm.assert_equal(result.freq, expected.freq) - else: - comparator = getattr(tm, "assert_%s_equal" % - typ, tm.assert_almost_equal) - comparator(result, expected) - - def compare(self, vf, version): - - # py3 compat when reading py2 pickle - try: - data = pandas.read_pickle(vf) - except (ValueError) as e: - if 'unsupported pickle protocol:' in str(e): - # trying to read a py3 pickle in py2 - return - else: - raise - - for typ, dv in data.items(): - for dt, result in dv.items(): - try: - expected = self.data[typ][dt] - except (KeyError): - if version in ('0.10.1', '0.11.0') and dt == 'reg': - break - else: - raise - - # use a specific comparator - # if available - comparator = "compare_{typ}_{dt}".format(typ=typ, dt=dt) - comparator = getattr(self, comparator, self.compare_element) - comparator(result, expected, typ, version) - return data - - def compare_sp_series_ts(self, res, exp, typ, version): - # SparseTimeSeries integrated into SparseSeries in 0.12.0 - # and deprecated in 0.17.0 - if version and LooseVersion(version) <= "0.12.0": - tm.assert_sp_series_equal(res, exp, check_series_type=False) - else: - tm.assert_sp_series_equal(res, exp) - - def compare_series_ts(self, result, expected, typ, version): - # GH 7748 - tm.assert_series_equal(result, expected) - tm.assert_equal(result.index.freq, expected.index.freq) - tm.assert_equal(result.index.freq.normalize, False) - tm.assert_series_equal(result > 0, expected > 0) - - # GH 9291 - freq = result.index.freq - tm.assert_equal(freq + Day(1), Day(2)) - - res = freq + pandas.Timedelta(hours=1) - tm.assert_equal(isinstance(res, pandas.Timedelta), True) - tm.assert_equal(res, pandas.Timedelta(days=1, hours=1)) - - res = freq + pandas.Timedelta(nanoseconds=1) - tm.assert_equal(isinstance(res, pandas.Timedelta), True) - tm.assert_equal(res, pandas.Timedelta(days=1, nanoseconds=1)) - - def compare_series_dt_tz(self, result, expected, typ, version): - # 8260 - # dtype is object < 0.17.0 - if LooseVersion(version) < '0.17.0': - expected = expected.astype(object) - tm.assert_series_equal(result, expected) - else: - tm.assert_series_equal(result, expected) - - def compare_series_cat(self, result, expected, typ, version): - # Categorical dtype is added in 0.15.0 - # ordered is changed in 0.16.0 - if LooseVersion(version) < '0.15.0': - tm.assert_series_equal(result, expected, check_dtype=False, - check_categorical=False) - elif LooseVersion(version) < '0.16.0': - tm.assert_series_equal(result, expected, check_categorical=False) - else: - tm.assert_series_equal(result, expected) - - def compare_frame_dt_mixed_tzs(self, result, expected, typ, version): - # 8260 - # dtype is object < 0.17.0 - if LooseVersion(version) < '0.17.0': - expected = expected.astype(object) - tm.assert_frame_equal(result, expected) - else: - tm.assert_frame_equal(result, expected) - - def compare_frame_cat_onecol(self, result, expected, typ, version): - # Categorical dtype is added in 0.15.0 - # ordered is changed in 0.16.0 - if LooseVersion(version) < '0.15.0': - tm.assert_frame_equal(result, expected, check_dtype=False, - check_categorical=False) - elif LooseVersion(version) < '0.16.0': - tm.assert_frame_equal(result, expected, check_categorical=False) - else: - tm.assert_frame_equal(result, expected) - - def compare_frame_cat_and_float(self, result, expected, typ, version): - self.compare_frame_cat_onecol(result, expected, typ, version) - - def compare_index_period(self, result, expected, typ, version): - tm.assert_index_equal(result, expected) - tm.assertIsInstance(result.freq, MonthEnd) - tm.assert_equal(result.freq, MonthEnd()) - tm.assert_equal(result.freqstr, 'M') - tm.assert_index_equal(result.shift(2), expected.shift(2)) - - def compare_sp_frame_float(self, result, expected, typ, version): - if LooseVersion(version) <= '0.18.1': - tm.assert_sp_frame_equal(result, expected, exact_indices=False, - check_dtype=False) - else: - tm.assert_sp_frame_equal(result, expected) - - def read_pickles(self, version): - if not is_platform_little_endian(): - raise nose.SkipTest("known failure on non-little endian") - - pth = tm.get_data_path('legacy_pickle/{0}'.format(str(version))) - n = 0 - for f in os.listdir(pth): - vf = os.path.join(pth, f) - data = self.compare(vf, version) - - if data is None: - continue - n += 1 - assert n > 0, 'Pickle files are not tested' - - def test_pickles(self): - pickle_path = tm.get_data_path('legacy_pickle') - n = 0 - for v in os.listdir(pickle_path): - pth = os.path.join(pickle_path, v) - if os.path.isdir(pth): - yield self.read_pickles, v - n += 1 - assert n > 0, 'Pickle files are not tested' - - def test_round_trip_current(self): - - try: - import cPickle as c_pickle - - def c_pickler(obj, path): - with open(path, 'wb') as fh: - c_pickle.dump(obj, fh, protocol=-1) - - def c_unpickler(path): - with open(path, 'rb') as fh: - fh.seek(0) - return c_pickle.load(fh) - except: - c_pickler = None - c_unpickler = None - - import pickle as python_pickle - - def python_pickler(obj, path): - with open(path, 'wb') as fh: - python_pickle.dump(obj, fh, protocol=-1) - - def python_unpickler(path): - with open(path, 'rb') as fh: - fh.seek(0) - return python_pickle.load(fh) - - for typ, dv in self.data.items(): - for dt, expected in dv.items(): - - for writer in [pd.to_pickle, c_pickler, python_pickler]: - if writer is None: - continue - - with tm.ensure_clean(self.path) as path: - - # test writing with each pickler - writer(expected, path) - - # test reading with each unpickler - result = pd.read_pickle(path) - self.compare_element(result, expected, typ) - - if c_unpickler is not None: - result = c_unpickler(path) - self.compare_element(result, expected, typ) - - result = python_unpickler(path) - self.compare_element(result, expected, typ) - - def test_pickle_v0_14_1(self): - - # we have the name warning - # 10482 - with tm.assert_produces_warning(UserWarning): - cat = pd.Categorical(values=['a', 'b', 'c'], - categories=['a', 'b', 'c', 'd'], - name='foobar', ordered=False) - pickle_path = os.path.join(tm.get_data_path(), - 'categorical_0_14_1.pickle') - # This code was executed once on v0.14.1 to generate the pickle: - # - # cat = Categorical(labels=np.arange(3), levels=['a', 'b', 'c', 'd'], - # name='foobar') - # with open(pickle_path, 'wb') as f: pickle.dump(cat, f) - # - tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path)) - - def test_pickle_v0_15_2(self): - # ordered -> _ordered - # GH 9347 - - # we have the name warning - # 10482 - with tm.assert_produces_warning(UserWarning): - cat = pd.Categorical(values=['a', 'b', 'c'], - categories=['a', 'b', 'c', 'd'], - name='foobar', ordered=False) - pickle_path = os.path.join(tm.get_data_path(), - 'categorical_0_15_2.pickle') - # This code was executed once on v0.15.2 to generate the pickle: - # - # cat = Categorical(labels=np.arange(3), levels=['a', 'b', 'c', 'd'], - # name='foobar') - # with open(pickle_path, 'wb') as f: pickle.dump(cat, f) - # - tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path)) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - # '--with-coverage', '--cover-package=pandas.core'], - exit=False) diff --git a/pandas/io/tests/test_s3.py b/pandas/io/tests/test_s3.py deleted file mode 100644 index 8058698a906ea..0000000000000 --- a/pandas/io/tests/test_s3.py +++ /dev/null @@ -1,14 +0,0 @@ -import nose -from pandas.util import testing as tm - -from pandas.io.common import _is_s3_url - - -class TestS3URL(tm.TestCase): - def test_is_s3_url(self): - self.assertTrue(_is_s3_url("s3://pandas/somethingelse.com")) - self.assertFalse(_is_s3_url("s4://pandas/somethingelse.com")) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/wb.py b/pandas/io/wb.py index 2183290c7e074..5dc4d9ce1adc4 100644 --- a/pandas/io/wb.py +++ b/pandas/io/wb.py @@ -1,6 +1,6 @@ raise ImportError( "The pandas.io.wb module is moved to a separate package " "(pandas-datareader). After installing the pandas-datareader package " - "(https://github.com/pandas-dev/pandas-datareader), you can change " + "(https://github.com/pydata/pandas-datareader), you can change " "the import ``from pandas.io import data, wb`` to " "``from pandas_datareader import data, wb``.") diff --git a/pandas/json.py b/pandas/json.py new file mode 100644 index 0000000000000..0b87aa22394b9 --- /dev/null +++ b/pandas/json.py @@ -0,0 +1,7 @@ +# flake8: noqa + +import warnings +warnings.warn("The pandas.json module is deprecated and will be " + "removed in a future version. Please import from " + "the pandas.io.json instead", FutureWarning, stacklevel=2) +from pandas._libs.json import dumps, loads diff --git a/pandas/lib.py b/pandas/lib.py new file mode 100644 index 0000000000000..859a78060fcc1 --- /dev/null +++ b/pandas/lib.py @@ -0,0 +1,8 @@ +# flake8: noqa + +import warnings +warnings.warn("The pandas.lib module is deprecated and will be " + "removed in a future version. These are private functions " + "and can be accessed from pandas._libs.lib instead", + FutureWarning, stacklevel=2) +from pandas._libs.lib import * diff --git a/pandas/parser.py b/pandas/parser.py new file mode 100644 index 0000000000000..c0c3bf3179a2d --- /dev/null +++ b/pandas/parser.py @@ -0,0 +1,8 @@ +# flake8: noqa + +import warnings +warnings.warn("The pandas.parser module is deprecated and will be " + "removed in a future version. Please import from " + "the pandas.io.parser instead", FutureWarning, stacklevel=2) +from pandas._libs.parsers import na_values +from pandas.io.common import CParserError diff --git a/pandas/plotting/__init__.py b/pandas/plotting/__init__.py new file mode 100644 index 0000000000000..c3cbedb0fc28c --- /dev/null +++ b/pandas/plotting/__init__.py @@ -0,0 +1,19 @@ +""" +Plotting api +""" + +# flake8: noqa + +try: # mpl optional + from pandas.plotting import _converter + _converter.register() # needs to override so set_xlim works with str/number +except ImportError: + pass + +from pandas.plotting._misc import (scatter_matrix, radviz, + andrews_curves, bootstrap_plot, + parallel_coordinates, lag_plot, + autocorrelation_plot) +from pandas.plotting._core import boxplot +from pandas.plotting._style import plot_params +from pandas.plotting._tools import table diff --git a/pandas/plotting/_compat.py b/pandas/plotting/_compat.py new file mode 100644 index 0000000000000..7b04b9e1171ec --- /dev/null +++ b/pandas/plotting/_compat.py @@ -0,0 +1,67 @@ +# being a bit too dynamic +# pylint: disable=E1101 +from __future__ import division + +from distutils.version import LooseVersion + + +def _mpl_le_1_2_1(): + try: + import matplotlib as mpl + return (str(mpl.__version__) <= LooseVersion('1.2.1') and + str(mpl.__version__)[0] != '0') + except ImportError: + return False + + +def _mpl_ge_1_3_1(): + try: + import matplotlib + # The or v[0] == '0' is because their versioneer is + # messed up on dev + return (matplotlib.__version__ >= LooseVersion('1.3.1') or + matplotlib.__version__[0] == '0') + except ImportError: + return False + + +def _mpl_ge_1_4_0(): + try: + import matplotlib + return (matplotlib.__version__ >= LooseVersion('1.4') or + matplotlib.__version__[0] == '0') + except ImportError: + return False + + +def _mpl_ge_1_5_0(): + try: + import matplotlib + return (matplotlib.__version__ >= LooseVersion('1.5') or + matplotlib.__version__[0] == '0') + except ImportError: + return False + + +def _mpl_ge_2_0_0(): + try: + import matplotlib + return matplotlib.__version__ >= LooseVersion('2.0') + except ImportError: + return False + + +def _mpl_le_2_0_0(): + try: + import matplotlib + return matplotlib.compare_versions('2.0.0', matplotlib.__version__) + except ImportError: + return False + + +def _mpl_ge_2_0_1(): + try: + import matplotlib + return matplotlib.__version__ >= LooseVersion('2.0.1') + except ImportError: + return False diff --git a/pandas/plotting/_converter.py b/pandas/plotting/_converter.py new file mode 100644 index 0000000000000..97295dfa7baf1 --- /dev/null +++ b/pandas/plotting/_converter.py @@ -0,0 +1,1046 @@ +from datetime import datetime, timedelta +import datetime as pydt +import numpy as np + +from dateutil.relativedelta import relativedelta + +import matplotlib.units as units +import matplotlib.dates as dates + +from matplotlib.ticker import Formatter, AutoLocator, Locator +from matplotlib.transforms import nonsingular + +from pandas.core.dtypes.common import ( + is_float, is_integer, + is_integer_dtype, + is_float_dtype, + is_datetime64_ns_dtype, + is_period_arraylike, + is_nested_list_like +) + +from pandas.compat import lrange +import pandas.compat as compat +import pandas._libs.lib as lib +import pandas.core.common as com +from pandas.core.index import Index + +from pandas.core.series import Series +from pandas.core.indexes.datetimes import date_range +import pandas.core.tools.datetimes as tools +import pandas.tseries.frequencies as frequencies +from pandas.tseries.frequencies import FreqGroup +from pandas.core.indexes.period import Period, PeriodIndex + +from pandas.plotting._compat import _mpl_le_2_0_0 + +# constants +HOURS_PER_DAY = 24. +MIN_PER_HOUR = 60. +SEC_PER_MIN = 60. + +SEC_PER_HOUR = SEC_PER_MIN * MIN_PER_HOUR +SEC_PER_DAY = SEC_PER_HOUR * HOURS_PER_DAY + +MUSEC_PER_DAY = 1e6 * SEC_PER_DAY + + +def register(): + units.registry[lib.Timestamp] = DatetimeConverter() + units.registry[Period] = PeriodConverter() + units.registry[pydt.datetime] = DatetimeConverter() + units.registry[pydt.date] = DatetimeConverter() + units.registry[pydt.time] = TimeConverter() + units.registry[np.datetime64] = DatetimeConverter() + + +def _to_ordinalf(tm): + tot_sec = (tm.hour * 3600 + tm.minute * 60 + tm.second + + float(tm.microsecond / 1e6)) + return tot_sec + + +def time2num(d): + if isinstance(d, compat.string_types): + parsed = tools.to_datetime(d) + if not isinstance(parsed, datetime): + raise ValueError('Could not parse time %s' % d) + return _to_ordinalf(parsed.time()) + if isinstance(d, pydt.time): + return _to_ordinalf(d) + return d + + +class TimeConverter(units.ConversionInterface): + + @staticmethod + def convert(value, unit, axis): + valid_types = (str, pydt.time) + if (isinstance(value, valid_types) or is_integer(value) or + is_float(value)): + return time2num(value) + if isinstance(value, Index): + return value.map(time2num) + if isinstance(value, (list, tuple, np.ndarray, Index)): + return [time2num(x) for x in value] + return value + + @staticmethod + def axisinfo(unit, axis): + if unit != 'time': + return None + + majloc = AutoLocator() + majfmt = TimeFormatter(majloc) + return units.AxisInfo(majloc=majloc, majfmt=majfmt, label='time') + + @staticmethod + def default_units(x, axis): + return 'time' + + +# time formatter +class TimeFormatter(Formatter): + + def __init__(self, locs): + self.locs = locs + + def __call__(self, x, pos=0): + fmt = '%H:%M:%S' + s = int(x) + ms = int((x - s) * 1e3) + us = int((x - s) * 1e6 - ms) + m, s = divmod(s, 60) + h, m = divmod(m, 60) + _, h = divmod(h, 24) + if us != 0: + fmt += '.%6f' + elif ms != 0: + fmt += '.%3f' + + return pydt.time(h, m, s, us).strftime(fmt) + + +# Period Conversion + + +class PeriodConverter(dates.DateConverter): + + @staticmethod + def convert(values, units, axis): + if is_nested_list_like(values): + values = [PeriodConverter._convert_1d(v, units, axis) + for v in values] + else: + values = PeriodConverter._convert_1d(values, units, axis) + return values + + @staticmethod + def _convert_1d(values, units, axis): + if not hasattr(axis, 'freq'): + raise TypeError('Axis must have `freq` set to convert to Periods') + valid_types = (compat.string_types, datetime, + Period, pydt.date, pydt.time) + if (isinstance(values, valid_types) or is_integer(values) or + is_float(values)): + return get_datevalue(values, axis.freq) + if isinstance(values, PeriodIndex): + return values.asfreq(axis.freq)._values + if isinstance(values, Index): + return values.map(lambda x: get_datevalue(x, axis.freq)) + if is_period_arraylike(values): + return PeriodIndex(values, freq=axis.freq)._values + if isinstance(values, (list, tuple, np.ndarray, Index)): + return [get_datevalue(x, axis.freq) for x in values] + return values + + +def get_datevalue(date, freq): + if isinstance(date, Period): + return date.asfreq(freq).ordinal + elif isinstance(date, (compat.string_types, datetime, + pydt.date, pydt.time)): + return Period(date, freq).ordinal + elif (is_integer(date) or is_float(date) or + (isinstance(date, (np.ndarray, Index)) and (date.size == 1))): + return date + elif date is None: + return None + raise ValueError("Unrecognizable date '%s'" % date) + + +def _dt_to_float_ordinal(dt): + """ + Convert :mod:`datetime` to the Gregorian date as UTC float days, + preserving hours, minutes, seconds and microseconds. Return value + is a :func:`float`. + """ + if (isinstance(dt, (np.ndarray, Index, Series) + ) and is_datetime64_ns_dtype(dt)): + base = dates.epoch2num(dt.asi8 / 1.0E9) + else: + base = dates.date2num(dt) + return base + + +# Datetime Conversion +class DatetimeConverter(dates.DateConverter): + + @staticmethod + def convert(values, unit, axis): + # values might be a 1-d array, or a list-like of arrays. + if is_nested_list_like(values): + values = [DatetimeConverter._convert_1d(v, unit, axis) + for v in values] + else: + values = DatetimeConverter._convert_1d(values, unit, axis) + return values + + @staticmethod + def _convert_1d(values, unit, axis): + def try_parse(values): + try: + return _dt_to_float_ordinal(tools.to_datetime(values)) + except Exception: + return values + + if isinstance(values, (datetime, pydt.date)): + return _dt_to_float_ordinal(values) + elif isinstance(values, np.datetime64): + return _dt_to_float_ordinal(lib.Timestamp(values)) + elif isinstance(values, pydt.time): + return dates.date2num(values) + elif (is_integer(values) or is_float(values)): + return values + elif isinstance(values, compat.string_types): + return try_parse(values) + elif isinstance(values, (list, tuple, np.ndarray, Index)): + if isinstance(values, Index): + values = values.values + if not isinstance(values, np.ndarray): + values = com._asarray_tuplesafe(values) + + if is_integer_dtype(values) or is_float_dtype(values): + return values + + try: + values = tools.to_datetime(values) + if isinstance(values, Index): + values = _dt_to_float_ordinal(values) + else: + values = [_dt_to_float_ordinal(x) for x in values] + except Exception: + values = _dt_to_float_ordinal(values) + + return values + + @staticmethod + def axisinfo(unit, axis): + """ + Return the :class:`~matplotlib.units.AxisInfo` for *unit*. + + *unit* is a tzinfo instance or None. + The *axis* argument is required but not used. + """ + tz = unit + + majloc = PandasAutoDateLocator(tz=tz) + majfmt = PandasAutoDateFormatter(majloc, tz=tz) + datemin = pydt.date(2000, 1, 1) + datemax = pydt.date(2010, 1, 1) + + return units.AxisInfo(majloc=majloc, majfmt=majfmt, label='', + default_limits=(datemin, datemax)) + + +class PandasAutoDateFormatter(dates.AutoDateFormatter): + + def __init__(self, locator, tz=None, defaultfmt='%Y-%m-%d'): + dates.AutoDateFormatter.__init__(self, locator, tz, defaultfmt) + # matplotlib.dates._UTC has no _utcoffset called by pandas + if self._tz is dates.UTC: + self._tz._utcoffset = self._tz.utcoffset(None) + + # For mpl > 2.0 the format strings are controlled via rcparams + # so do not mess with them. For mpl < 2.0 change the second + # break point and add a musec break point + if _mpl_le_2_0_0(): + self.scaled[1. / SEC_PER_DAY] = '%H:%M:%S' + self.scaled[1. / MUSEC_PER_DAY] = '%H:%M:%S.%f' + + +class PandasAutoDateLocator(dates.AutoDateLocator): + + def get_locator(self, dmin, dmax): + 'Pick the best locator based on a distance.' + delta = relativedelta(dmax, dmin) + + num_days = (delta.years * 12.0 + delta.months) * 31.0 + delta.days + num_sec = (delta.hours * 60.0 + delta.minutes) * 60.0 + delta.seconds + tot_sec = num_days * 86400. + num_sec + + if abs(tot_sec) < self.minticks: + self._freq = -1 + locator = MilliSecondLocator(self.tz) + locator.set_axis(self.axis) + + locator.set_view_interval(*self.axis.get_view_interval()) + locator.set_data_interval(*self.axis.get_data_interval()) + return locator + + return dates.AutoDateLocator.get_locator(self, dmin, dmax) + + def _get_unit(self): + return MilliSecondLocator.get_unit_generic(self._freq) + + +class MilliSecondLocator(dates.DateLocator): + + UNIT = 1. / (24 * 3600 * 1000) + + def __init__(self, tz): + dates.DateLocator.__init__(self, tz) + self._interval = 1. + + def _get_unit(self): + return self.get_unit_generic(-1) + + @staticmethod + def get_unit_generic(freq): + unit = dates.RRuleLocator.get_unit_generic(freq) + if unit < 0: + return MilliSecondLocator.UNIT + return unit + + def __call__(self): + # if no data have been set, this will tank with a ValueError + try: + dmin, dmax = self.viewlim_to_dt() + except ValueError: + return [] + + if dmin > dmax: + dmax, dmin = dmin, dmax + # We need to cap at the endpoints of valid datetime + + # TODO(wesm) unused? + # delta = relativedelta(dmax, dmin) + # try: + # start = dmin - delta + # except ValueError: + # start = _from_ordinal(1.0) + + # try: + # stop = dmax + delta + # except ValueError: + # # The magic number! + # stop = _from_ordinal(3652059.9999999) + + nmax, nmin = dates.date2num((dmax, dmin)) + + num = (nmax - nmin) * 86400 * 1000 + max_millis_ticks = 6 + for interval in [1, 10, 50, 100, 200, 500]: + if num <= interval * (max_millis_ticks - 1): + self._interval = interval + break + else: + # We went through the whole loop without breaking, default to 1 + self._interval = 1000. + + estimate = (nmax - nmin) / (self._get_unit() * self._get_interval()) + + if estimate > self.MAXTICKS * 2: + raise RuntimeError(('MillisecondLocator estimated to generate %d ' + 'ticks from %s to %s: exceeds Locator.MAXTICKS' + '* 2 (%d) ') % + (estimate, dmin, dmax, self.MAXTICKS * 2)) + + freq = '%dL' % self._get_interval() + tz = self.tz.tzname(None) + st = _from_ordinal(dates.date2num(dmin)) # strip tz + ed = _from_ordinal(dates.date2num(dmax)) + all_dates = date_range(start=st, end=ed, freq=freq, tz=tz).asobject + + try: + if len(all_dates) > 0: + locs = self.raise_if_exceeds(dates.date2num(all_dates)) + return locs + except Exception: # pragma: no cover + pass + + lims = dates.date2num([dmin, dmax]) + return lims + + def _get_interval(self): + return self._interval + + def autoscale(self): + """ + Set the view limits to include the data range. + """ + dmin, dmax = self.datalim_to_dt() + if dmin > dmax: + dmax, dmin = dmin, dmax + + # We need to cap at the endpoints of valid datetime + + # TODO(wesm): unused? + + # delta = relativedelta(dmax, dmin) + # try: + # start = dmin - delta + # except ValueError: + # start = _from_ordinal(1.0) + + # try: + # stop = dmax + delta + # except ValueError: + # # The magic number! + # stop = _from_ordinal(3652059.9999999) + + dmin, dmax = self.datalim_to_dt() + + vmin = dates.date2num(dmin) + vmax = dates.date2num(dmax) + + return self.nonsingular(vmin, vmax) + + +def _from_ordinal(x, tz=None): + ix = int(x) + dt = datetime.fromordinal(ix) + remainder = float(x) - ix + hour, remainder = divmod(24 * remainder, 1) + minute, remainder = divmod(60 * remainder, 1) + second, remainder = divmod(60 * remainder, 1) + microsecond = int(1e6 * remainder) + if microsecond < 10: + microsecond = 0 # compensate for rounding errors + dt = datetime(dt.year, dt.month, dt.day, int(hour), int(minute), + int(second), microsecond) + if tz is not None: + dt = dt.astimezone(tz) + + if microsecond > 999990: # compensate for rounding errors + dt += timedelta(microseconds=1e6 - microsecond) + + return dt + +# Fixed frequency dynamic tick locators and formatters + +# ------------------------------------------------------------------------- +# --- Locators --- +# ------------------------------------------------------------------------- + + +def _get_default_annual_spacing(nyears): + """ + Returns a default spacing between consecutive ticks for annual data. + """ + if nyears < 11: + (min_spacing, maj_spacing) = (1, 1) + elif nyears < 20: + (min_spacing, maj_spacing) = (1, 2) + elif nyears < 50: + (min_spacing, maj_spacing) = (1, 5) + elif nyears < 100: + (min_spacing, maj_spacing) = (5, 10) + elif nyears < 200: + (min_spacing, maj_spacing) = (5, 25) + elif nyears < 600: + (min_spacing, maj_spacing) = (10, 50) + else: + factor = nyears // 1000 + 1 + (min_spacing, maj_spacing) = (factor * 20, factor * 100) + return (min_spacing, maj_spacing) + + +def period_break(dates, period): + """ + Returns the indices where the given period changes. + + Parameters + ---------- + dates : PeriodIndex + Array of intervals to monitor. + period : string + Name of the period to monitor. + """ + current = getattr(dates, period) + previous = getattr(dates - 1, period) + return np.nonzero(current - previous)[0] + + +def has_level_label(label_flags, vmin): + """ + Returns true if the ``label_flags`` indicate there is at least one label + for this level. + + if the minimum view limit is not an exact integer, then the first tick + label won't be shown, so we must adjust for that. + """ + if label_flags.size == 0 or (label_flags.size == 1 and + label_flags[0] == 0 and + vmin % 1 > 0.0): + return False + else: + return True + + +def _daily_finder(vmin, vmax, freq): + periodsperday = -1 + + if freq >= FreqGroup.FR_HR: + if freq == FreqGroup.FR_NS: + periodsperday = 24 * 60 * 60 * 1000000000 + elif freq == FreqGroup.FR_US: + periodsperday = 24 * 60 * 60 * 1000000 + elif freq == FreqGroup.FR_MS: + periodsperday = 24 * 60 * 60 * 1000 + elif freq == FreqGroup.FR_SEC: + periodsperday = 24 * 60 * 60 + elif freq == FreqGroup.FR_MIN: + periodsperday = 24 * 60 + elif freq == FreqGroup.FR_HR: + periodsperday = 24 + else: # pragma: no cover + raise ValueError("unexpected frequency: %s" % freq) + periodsperyear = 365 * periodsperday + periodspermonth = 28 * periodsperday + + elif freq == FreqGroup.FR_BUS: + periodsperyear = 261 + periodspermonth = 19 + elif freq == FreqGroup.FR_DAY: + periodsperyear = 365 + periodspermonth = 28 + elif frequencies.get_freq_group(freq) == FreqGroup.FR_WK: + periodsperyear = 52 + periodspermonth = 3 + else: # pragma: no cover + raise ValueError("unexpected frequency") + + # save this for later usage + vmin_orig = vmin + + (vmin, vmax) = (Period(ordinal=int(vmin), freq=freq), + Period(ordinal=int(vmax), freq=freq)) + span = vmax.ordinal - vmin.ordinal + 1 + dates_ = PeriodIndex(start=vmin, end=vmax, freq=freq) + # Initialize the output + info = np.zeros(span, + dtype=[('val', np.int64), ('maj', bool), + ('min', bool), ('fmt', '|S20')]) + info['val'][:] = dates_._values + info['fmt'][:] = '' + info['maj'][[0, -1]] = True + # .. and set some shortcuts + info_maj = info['maj'] + info_min = info['min'] + info_fmt = info['fmt'] + + def first_label(label_flags): + if (label_flags[0] == 0) and (label_flags.size > 1) and \ + ((vmin_orig % 1) > 0.0): + return label_flags[1] + else: + return label_flags[0] + + # Case 1. Less than a month + if span <= periodspermonth: + day_start = period_break(dates_, 'day') + month_start = period_break(dates_, 'month') + + def _hour_finder(label_interval, force_year_start): + _hour = dates_.hour + _prev_hour = (dates_ - 1).hour + hour_start = (_hour - _prev_hour) != 0 + info_maj[day_start] = True + info_min[hour_start & (_hour % label_interval == 0)] = True + year_start = period_break(dates_, 'year') + info_fmt[hour_start & (_hour % label_interval == 0)] = '%H:%M' + info_fmt[day_start] = '%H:%M\n%d-%b' + info_fmt[year_start] = '%H:%M\n%d-%b\n%Y' + if force_year_start and not has_level_label(year_start, vmin_orig): + info_fmt[first_label(day_start)] = '%H:%M\n%d-%b\n%Y' + + def _minute_finder(label_interval): + hour_start = period_break(dates_, 'hour') + _minute = dates_.minute + _prev_minute = (dates_ - 1).minute + minute_start = (_minute - _prev_minute) != 0 + info_maj[hour_start] = True + info_min[minute_start & (_minute % label_interval == 0)] = True + year_start = period_break(dates_, 'year') + info_fmt = info['fmt'] + info_fmt[minute_start & (_minute % label_interval == 0)] = '%H:%M' + info_fmt[day_start] = '%H:%M\n%d-%b' + info_fmt[year_start] = '%H:%M\n%d-%b\n%Y' + + def _second_finder(label_interval): + minute_start = period_break(dates_, 'minute') + _second = dates_.second + _prev_second = (dates_ - 1).second + second_start = (_second - _prev_second) != 0 + info['maj'][minute_start] = True + info['min'][second_start & (_second % label_interval == 0)] = True + year_start = period_break(dates_, 'year') + info_fmt = info['fmt'] + info_fmt[second_start & (_second % + label_interval == 0)] = '%H:%M:%S' + info_fmt[day_start] = '%H:%M:%S\n%d-%b' + info_fmt[year_start] = '%H:%M:%S\n%d-%b\n%Y' + + if span < periodsperday / 12000.0: + _second_finder(1) + elif span < periodsperday / 6000.0: + _second_finder(2) + elif span < periodsperday / 2400.0: + _second_finder(5) + elif span < periodsperday / 1200.0: + _second_finder(10) + elif span < periodsperday / 800.0: + _second_finder(15) + elif span < periodsperday / 400.0: + _second_finder(30) + elif span < periodsperday / 150.0: + _minute_finder(1) + elif span < periodsperday / 70.0: + _minute_finder(2) + elif span < periodsperday / 24.0: + _minute_finder(5) + elif span < periodsperday / 12.0: + _minute_finder(15) + elif span < periodsperday / 6.0: + _minute_finder(30) + elif span < periodsperday / 2.5: + _hour_finder(1, False) + elif span < periodsperday / 1.5: + _hour_finder(2, False) + elif span < periodsperday * 1.25: + _hour_finder(3, False) + elif span < periodsperday * 2.5: + _hour_finder(6, True) + elif span < periodsperday * 4: + _hour_finder(12, True) + else: + info_maj[month_start] = True + info_min[day_start] = True + year_start = period_break(dates_, 'year') + info_fmt = info['fmt'] + info_fmt[day_start] = '%d' + info_fmt[month_start] = '%d\n%b' + info_fmt[year_start] = '%d\n%b\n%Y' + if not has_level_label(year_start, vmin_orig): + if not has_level_label(month_start, vmin_orig): + info_fmt[first_label(day_start)] = '%d\n%b\n%Y' + else: + info_fmt[first_label(month_start)] = '%d\n%b\n%Y' + + # Case 2. Less than three months + elif span <= periodsperyear // 4: + month_start = period_break(dates_, 'month') + info_maj[month_start] = True + if freq < FreqGroup.FR_HR: + info['min'] = True + else: + day_start = period_break(dates_, 'day') + info['min'][day_start] = True + week_start = period_break(dates_, 'week') + year_start = period_break(dates_, 'year') + info_fmt[week_start] = '%d' + info_fmt[month_start] = '\n\n%b' + info_fmt[year_start] = '\n\n%b\n%Y' + if not has_level_label(year_start, vmin_orig): + if not has_level_label(month_start, vmin_orig): + info_fmt[first_label(week_start)] = '\n\n%b\n%Y' + else: + info_fmt[first_label(month_start)] = '\n\n%b\n%Y' + # Case 3. Less than 14 months ............... + elif span <= 1.15 * periodsperyear: + year_start = period_break(dates_, 'year') + month_start = period_break(dates_, 'month') + week_start = period_break(dates_, 'week') + info_maj[month_start] = True + info_min[week_start] = True + info_min[year_start] = False + info_min[month_start] = False + info_fmt[month_start] = '%b' + info_fmt[year_start] = '%b\n%Y' + if not has_level_label(year_start, vmin_orig): + info_fmt[first_label(month_start)] = '%b\n%Y' + # Case 4. Less than 2.5 years ............... + elif span <= 2.5 * periodsperyear: + year_start = period_break(dates_, 'year') + quarter_start = period_break(dates_, 'quarter') + month_start = period_break(dates_, 'month') + info_maj[quarter_start] = True + info_min[month_start] = True + info_fmt[quarter_start] = '%b' + info_fmt[year_start] = '%b\n%Y' + # Case 4. Less than 4 years ................. + elif span <= 4 * periodsperyear: + year_start = period_break(dates_, 'year') + month_start = period_break(dates_, 'month') + info_maj[year_start] = True + info_min[month_start] = True + info_min[year_start] = False + + month_break = dates_[month_start].month + jan_or_jul = month_start[(month_break == 1) | (month_break == 7)] + info_fmt[jan_or_jul] = '%b' + info_fmt[year_start] = '%b\n%Y' + # Case 5. Less than 11 years ................ + elif span <= 11 * periodsperyear: + year_start = period_break(dates_, 'year') + quarter_start = period_break(dates_, 'quarter') + info_maj[year_start] = True + info_min[quarter_start] = True + info_min[year_start] = False + info_fmt[year_start] = '%Y' + # Case 6. More than 12 years ................ + else: + year_start = period_break(dates_, 'year') + year_break = dates_[year_start].year + nyears = span / periodsperyear + (min_anndef, maj_anndef) = _get_default_annual_spacing(nyears) + major_idx = year_start[(year_break % maj_anndef == 0)] + info_maj[major_idx] = True + minor_idx = year_start[(year_break % min_anndef == 0)] + info_min[minor_idx] = True + info_fmt[major_idx] = '%Y' + + return info + + +def _monthly_finder(vmin, vmax, freq): + periodsperyear = 12 + + vmin_orig = vmin + (vmin, vmax) = (int(vmin), int(vmax)) + span = vmax - vmin + 1 + + # Initialize the output + info = np.zeros(span, + dtype=[('val', int), ('maj', bool), ('min', bool), + ('fmt', '|S8')]) + info['val'] = np.arange(vmin, vmax + 1) + dates_ = info['val'] + info['fmt'] = '' + year_start = (dates_ % 12 == 0).nonzero()[0] + info_maj = info['maj'] + info_fmt = info['fmt'] + + if span <= 1.15 * periodsperyear: + info_maj[year_start] = True + info['min'] = True + + info_fmt[:] = '%b' + info_fmt[year_start] = '%b\n%Y' + + if not has_level_label(year_start, vmin_orig): + if dates_.size > 1: + idx = 1 + else: + idx = 0 + info_fmt[idx] = '%b\n%Y' + + elif span <= 2.5 * periodsperyear: + quarter_start = (dates_ % 3 == 0).nonzero() + info_maj[year_start] = True + # TODO: Check the following : is it really info['fmt'] ? + info['fmt'][quarter_start] = True + info['min'] = True + + info_fmt[quarter_start] = '%b' + info_fmt[year_start] = '%b\n%Y' + + elif span <= 4 * periodsperyear: + info_maj[year_start] = True + info['min'] = True + + jan_or_jul = (dates_ % 12 == 0) | (dates_ % 12 == 6) + info_fmt[jan_or_jul] = '%b' + info_fmt[year_start] = '%b\n%Y' + + elif span <= 11 * periodsperyear: + quarter_start = (dates_ % 3 == 0).nonzero() + info_maj[year_start] = True + info['min'][quarter_start] = True + + info_fmt[year_start] = '%Y' + + else: + nyears = span / periodsperyear + (min_anndef, maj_anndef) = _get_default_annual_spacing(nyears) + years = dates_[year_start] // 12 + 1 + major_idx = year_start[(years % maj_anndef == 0)] + info_maj[major_idx] = True + info['min'][year_start[(years % min_anndef == 0)]] = True + + info_fmt[major_idx] = '%Y' + + return info + + +def _quarterly_finder(vmin, vmax, freq): + periodsperyear = 4 + vmin_orig = vmin + (vmin, vmax) = (int(vmin), int(vmax)) + span = vmax - vmin + 1 + + info = np.zeros(span, + dtype=[('val', int), ('maj', bool), ('min', bool), + ('fmt', '|S8')]) + info['val'] = np.arange(vmin, vmax + 1) + info['fmt'] = '' + dates_ = info['val'] + info_maj = info['maj'] + info_fmt = info['fmt'] + year_start = (dates_ % 4 == 0).nonzero()[0] + + if span <= 3.5 * periodsperyear: + info_maj[year_start] = True + info['min'] = True + + info_fmt[:] = 'Q%q' + info_fmt[year_start] = 'Q%q\n%F' + if not has_level_label(year_start, vmin_orig): + if dates_.size > 1: + idx = 1 + else: + idx = 0 + info_fmt[idx] = 'Q%q\n%F' + + elif span <= 11 * periodsperyear: + info_maj[year_start] = True + info['min'] = True + info_fmt[year_start] = '%F' + + else: + years = dates_[year_start] // 4 + 1 + nyears = span / periodsperyear + (min_anndef, maj_anndef) = _get_default_annual_spacing(nyears) + major_idx = year_start[(years % maj_anndef == 0)] + info_maj[major_idx] = True + info['min'][year_start[(years % min_anndef == 0)]] = True + info_fmt[major_idx] = '%F' + + return info + + +def _annual_finder(vmin, vmax, freq): + (vmin, vmax) = (int(vmin), int(vmax + 1)) + span = vmax - vmin + 1 + + info = np.zeros(span, + dtype=[('val', int), ('maj', bool), ('min', bool), + ('fmt', '|S8')]) + info['val'] = np.arange(vmin, vmax + 1) + info['fmt'] = '' + dates_ = info['val'] + + (min_anndef, maj_anndef) = _get_default_annual_spacing(span) + major_idx = dates_ % maj_anndef == 0 + info['maj'][major_idx] = True + info['min'][(dates_ % min_anndef == 0)] = True + info['fmt'][major_idx] = '%Y' + + return info + + +def get_finder(freq): + if isinstance(freq, compat.string_types): + freq = frequencies.get_freq(freq) + fgroup = frequencies.get_freq_group(freq) + + if fgroup == FreqGroup.FR_ANN: + return _annual_finder + elif fgroup == FreqGroup.FR_QTR: + return _quarterly_finder + elif freq == FreqGroup.FR_MTH: + return _monthly_finder + elif ((freq >= FreqGroup.FR_BUS) or fgroup == FreqGroup.FR_WK): + return _daily_finder + else: # pragma: no cover + errmsg = "Unsupported frequency: %s" % (freq) + raise NotImplementedError(errmsg) + + +class TimeSeries_DateLocator(Locator): + """ + Locates the ticks along an axis controlled by a :class:`Series`. + + Parameters + ---------- + freq : {var} + Valid frequency specifier. + minor_locator : {False, True}, optional + Whether the locator is for minor ticks (True) or not. + dynamic_mode : {True, False}, optional + Whether the locator should work in dynamic mode. + base : {int}, optional + quarter : {int}, optional + month : {int}, optional + day : {int}, optional + """ + + def __init__(self, freq, minor_locator=False, dynamic_mode=True, + base=1, quarter=1, month=1, day=1, plot_obj=None): + if isinstance(freq, compat.string_types): + freq = frequencies.get_freq(freq) + self.freq = freq + self.base = base + (self.quarter, self.month, self.day) = (quarter, month, day) + self.isminor = minor_locator + self.isdynamic = dynamic_mode + self.offset = 0 + self.plot_obj = plot_obj + self.finder = get_finder(freq) + + def _get_default_locs(self, vmin, vmax): + "Returns the default locations of ticks." + + if self.plot_obj.date_axis_info is None: + self.plot_obj.date_axis_info = self.finder(vmin, vmax, self.freq) + + locator = self.plot_obj.date_axis_info + + if self.isminor: + return np.compress(locator['min'], locator['val']) + return np.compress(locator['maj'], locator['val']) + + def __call__(self): + 'Return the locations of the ticks.' + # axis calls Locator.set_axis inside set_m_formatter + vi = tuple(self.axis.get_view_interval()) + if vi != self.plot_obj.view_interval: + self.plot_obj.date_axis_info = None + self.plot_obj.view_interval = vi + vmin, vmax = vi + if vmax < vmin: + vmin, vmax = vmax, vmin + if self.isdynamic: + locs = self._get_default_locs(vmin, vmax) + else: # pragma: no cover + base = self.base + (d, m) = divmod(vmin, base) + vmin = (d + 1) * base + locs = lrange(vmin, vmax + 1, base) + return locs + + def autoscale(self): + """ + Sets the view limits to the nearest multiples of base that contain the + data. + """ + # requires matplotlib >= 0.98.0 + (vmin, vmax) = self.axis.get_data_interval() + + locs = self._get_default_locs(vmin, vmax) + (vmin, vmax) = locs[[0, -1]] + if vmin == vmax: + vmin -= 1 + vmax += 1 + return nonsingular(vmin, vmax) + +# ------------------------------------------------------------------------- +# --- Formatter --- +# ------------------------------------------------------------------------- + + +class TimeSeries_DateFormatter(Formatter): + """ + Formats the ticks along an axis controlled by a :class:`PeriodIndex`. + + Parameters + ---------- + freq : {int, string} + Valid frequency specifier. + minor_locator : {False, True} + Whether the current formatter should apply to minor ticks (True) or + major ticks (False). + dynamic_mode : {True, False} + Whether the formatter works in dynamic mode or not. + """ + + def __init__(self, freq, minor_locator=False, dynamic_mode=True, + plot_obj=None): + if isinstance(freq, compat.string_types): + freq = frequencies.get_freq(freq) + self.format = None + self.freq = freq + self.locs = [] + self.formatdict = None + self.isminor = minor_locator + self.isdynamic = dynamic_mode + self.offset = 0 + self.plot_obj = plot_obj + self.finder = get_finder(freq) + + def _set_default_format(self, vmin, vmax): + "Returns the default ticks spacing." + + if self.plot_obj.date_axis_info is None: + self.plot_obj.date_axis_info = self.finder(vmin, vmax, self.freq) + info = self.plot_obj.date_axis_info + + if self.isminor: + format = np.compress(info['min'] & np.logical_not(info['maj']), + info) + else: + format = np.compress(info['maj'], info) + self.formatdict = dict([(x, f) for (x, _, _, f) in format]) + return self.formatdict + + def set_locs(self, locs): + 'Sets the locations of the ticks' + # don't actually use the locs. This is just needed to work with + # matplotlib. Force to use vmin, vmax + self.locs = locs + + (vmin, vmax) = vi = tuple(self.axis.get_view_interval()) + if vi != self.plot_obj.view_interval: + self.plot_obj.date_axis_info = None + self.plot_obj.view_interval = vi + if vmax < vmin: + (vmin, vmax) = (vmax, vmin) + self._set_default_format(vmin, vmax) + + def __call__(self, x, pos=0): + if self.formatdict is None: + return '' + else: + fmt = self.formatdict.pop(x, '') + return Period(ordinal=int(x), freq=self.freq).strftime(fmt) + + +class TimeSeries_TimedeltaFormatter(Formatter): + """ + Formats the ticks along an axis controlled by a :class:`TimedeltaIndex`. + """ + + @staticmethod + def format_timedelta_ticks(x, pos, n_decimals): + """ + Convert seconds to 'D days HH:MM:SS.F' + """ + s, ns = divmod(x, 1e9) + m, s = divmod(s, 60) + h, m = divmod(m, 60) + d, h = divmod(h, 24) + decimals = int(ns * 10**(n_decimals - 9)) + s = r'{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(s)) + if n_decimals > 0: + s += '.{{:0{:0d}d}}'.format(n_decimals).format(decimals) + if d != 0: + s = '{:d} days '.format(int(d)) + s + return s + + def __call__(self, x, pos=0): + (vmin, vmax) = tuple(self.axis.get_view_interval()) + n_decimals = int(np.ceil(np.log10(100 * 1e9 / (vmax - vmin)))) + if n_decimals > 9: + n_decimals = 9 + return self.format_timedelta_ticks(x, pos, n_decimals) diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py new file mode 100644 index 0000000000000..e88979b14c8af --- /dev/null +++ b/pandas/plotting/_core.py @@ -0,0 +1,2829 @@ +# being a bit too dynamic +# pylint: disable=E1101 +from __future__ import division + +import warnings +import re +from collections import namedtuple +from distutils.version import LooseVersion + +import numpy as np + +from pandas.util._decorators import cache_readonly +from pandas.core.base import PandasObject +from pandas.core.dtypes.common import ( + is_list_like, + is_integer, + is_number, + is_hashable, + is_iterator) +from pandas.core.common import AbstractMethodError, isnull, _try_sort +from pandas.core.generic import _shared_docs, _shared_doc_kwargs +from pandas.core.index import Index, MultiIndex +from pandas.core.series import Series, remove_na +from pandas.core.indexes.period import PeriodIndex +from pandas.compat import range, lrange, map, zip, string_types +import pandas.compat as compat +from pandas.io.formats.printing import pprint_thing +from pandas.util._decorators import Appender + +from pandas.plotting._compat import (_mpl_ge_1_3_1, + _mpl_ge_1_5_0) +from pandas.plotting._style import (mpl_stylesheet, plot_params, + _get_standard_colors) +from pandas.plotting._tools import (_subplots, _flatten, table, + _handle_shared_axes, _get_all_lines, + _get_xlim, _set_ticks_props, + format_date_labels) + + +if _mpl_ge_1_5_0(): + # Compat with mp 1.5, which uses cycler. + import cycler + colors = mpl_stylesheet.pop('axes.color_cycle') + mpl_stylesheet['axes.prop_cycle'] = cycler.cycler('color', colors) + + +def _get_standard_kind(kind): + return {'density': 'kde'}.get(kind, kind) + + +def _gca(): + import matplotlib.pyplot as plt + return plt.gca() + + +def _gcf(): + import matplotlib.pyplot as plt + return plt.gcf() + + +class MPLPlot(object): + """ + Base class for assembling a pandas plot using matplotlib + + Parameters + ---------- + data : + + """ + + @property + def _kind(self): + """Specify kind str. Must be overridden in child class""" + raise NotImplementedError + + _layout_type = 'vertical' + _default_rot = 0 + orientation = None + _pop_attributes = ['label', 'style', 'logy', 'logx', 'loglog', + 'mark_right', 'stacked'] + _attr_defaults = {'logy': False, 'logx': False, 'loglog': False, + 'mark_right': True, 'stacked': False} + + def __init__(self, data, kind=None, by=None, subplots=False, sharex=None, + sharey=False, use_index=True, + figsize=None, grid=None, legend=True, rot=None, + ax=None, fig=None, title=None, xlim=None, ylim=None, + xticks=None, yticks=None, + sort_columns=False, fontsize=None, + secondary_y=False, colormap=None, + table=False, layout=None, **kwds): + + self.data = data + self.by = by + + self.kind = kind + + self.sort_columns = sort_columns + + self.subplots = subplots + + if sharex is None: + if ax is None: + self.sharex = True + else: + # if we get an axis, the users should do the visibility + # setting... + self.sharex = False + else: + self.sharex = sharex + + self.sharey = sharey + self.figsize = figsize + self.layout = layout + + self.xticks = xticks + self.yticks = yticks + self.xlim = xlim + self.ylim = ylim + self.title = title + self.use_index = use_index + + self.fontsize = fontsize + + if rot is not None: + self.rot = rot + # need to know for format_date_labels since it's rotated to 30 by + # default + self._rot_set = True + else: + self._rot_set = False + self.rot = self._default_rot + + if grid is None: + grid = False if secondary_y else self.plt.rcParams['axes.grid'] + + self.grid = grid + self.legend = legend + self.legend_handles = [] + self.legend_labels = [] + + for attr in self._pop_attributes: + value = kwds.pop(attr, self._attr_defaults.get(attr, None)) + setattr(self, attr, value) + + self.ax = ax + self.fig = fig + self.axes = None + + # parse errorbar input if given + xerr = kwds.pop('xerr', None) + yerr = kwds.pop('yerr', None) + self.errors = {} + for kw, err in zip(['xerr', 'yerr'], [xerr, yerr]): + self.errors[kw] = self._parse_errorbars(kw, err) + + if not isinstance(secondary_y, (bool, tuple, list, np.ndarray, Index)): + secondary_y = [secondary_y] + self.secondary_y = secondary_y + + # ugly TypeError if user passes matplotlib's `cmap` name. + # Probably better to accept either. + if 'cmap' in kwds and colormap: + raise TypeError("Only specify one of `cmap` and `colormap`.") + elif 'cmap' in kwds: + self.colormap = kwds.pop('cmap') + else: + self.colormap = colormap + + self.table = table + + self.kwds = kwds + + self._validate_color_args() + + def _validate_color_args(self): + if 'color' not in self.kwds and 'colors' in self.kwds: + warnings.warn(("'colors' is being deprecated. Please use 'color'" + "instead of 'colors'")) + colors = self.kwds.pop('colors') + self.kwds['color'] = colors + + if ('color' in self.kwds and self.nseries == 1): + # support series.plot(color='green') + self.kwds['color'] = [self.kwds['color']] + + if ('color' in self.kwds or 'colors' in self.kwds) and \ + self.colormap is not None: + warnings.warn("'color' and 'colormap' cannot be used " + "simultaneously. Using 'color'") + + if 'color' in self.kwds and self.style is not None: + if is_list_like(self.style): + styles = self.style + else: + styles = [self.style] + # need only a single match + for s in styles: + if re.match('^[a-z]+?', s) is not None: + raise ValueError( + "Cannot pass 'style' string with a color " + "symbol and 'color' keyword argument. Please" + " use one or the other or pass 'style' " + "without a color symbol") + + def _iter_data(self, data=None, keep_index=False, fillna=None): + if data is None: + data = self.data + if fillna is not None: + data = data.fillna(fillna) + + # TODO: unused? + # if self.sort_columns: + # columns = _try_sort(data.columns) + # else: + # columns = data.columns + + for col, values in data.iteritems(): + if keep_index is True: + yield col, values + else: + yield col, values.values + + @property + def nseries(self): + if self.data.ndim == 1: + return 1 + else: + return self.data.shape[1] + + def draw(self): + self.plt.draw_if_interactive() + + def generate(self): + self._args_adjust() + self._compute_plot_data() + self._setup_subplots() + self._make_plot() + self._add_table() + self._make_legend() + self._adorn_subplots() + + for ax in self.axes: + self._post_plot_logic_common(ax, self.data) + self._post_plot_logic(ax, self.data) + + def _args_adjust(self): + pass + + def _has_plotted_object(self, ax): + """check whether ax has data""" + return (len(ax.lines) != 0 or + len(ax.artists) != 0 or + len(ax.containers) != 0) + + def _maybe_right_yaxis(self, ax, axes_num): + if not self.on_right(axes_num): + # secondary axes may be passed via ax kw + return self._get_ax_layer(ax) + + if hasattr(ax, 'right_ax'): + # if it has right_ax proparty, ``ax`` must be left axes + return ax.right_ax + elif hasattr(ax, 'left_ax'): + # if it has left_ax proparty, ``ax`` must be right axes + return ax + else: + # otherwise, create twin axes + orig_ax, new_ax = ax, ax.twinx() + # TODO: use Matplotlib public API when available + new_ax._get_lines = orig_ax._get_lines + new_ax._get_patches_for_fill = orig_ax._get_patches_for_fill + orig_ax.right_ax, new_ax.left_ax = new_ax, orig_ax + + if not self._has_plotted_object(orig_ax): # no data on left y + orig_ax.get_yaxis().set_visible(False) + return new_ax + + def _setup_subplots(self): + if self.subplots: + fig, axes = _subplots(naxes=self.nseries, + sharex=self.sharex, sharey=self.sharey, + figsize=self.figsize, ax=self.ax, + layout=self.layout, + layout_type=self._layout_type) + else: + if self.ax is None: + fig = self.plt.figure(figsize=self.figsize) + axes = fig.add_subplot(111) + else: + fig = self.ax.get_figure() + if self.figsize is not None: + fig.set_size_inches(self.figsize) + axes = self.ax + + axes = _flatten(axes) + + if self.logx or self.loglog: + [a.set_xscale('log') for a in axes] + if self.logy or self.loglog: + [a.set_yscale('log') for a in axes] + + self.fig = fig + self.axes = axes + + @property + def result(self): + """ + Return result axes + """ + if self.subplots: + if self.layout is not None and not is_list_like(self.ax): + return self.axes.reshape(*self.layout) + else: + return self.axes + else: + sec_true = isinstance(self.secondary_y, bool) and self.secondary_y + all_sec = (is_list_like(self.secondary_y) and + len(self.secondary_y) == self.nseries) + if (sec_true or all_sec): + # if all data is plotted on secondary, return right axes + return self._get_ax_layer(self.axes[0], primary=False) + else: + return self.axes[0] + + def _compute_plot_data(self): + data = self.data + + if isinstance(data, Series): + label = self.label + if label is None and data.name is None: + label = 'None' + data = data.to_frame(name=label) + + numeric_data = data._convert(datetime=True)._get_numeric_data() + + try: + is_empty = numeric_data.empty + except AttributeError: + is_empty = not len(numeric_data) + + # no empty frames or series allowed + if is_empty: + raise TypeError('Empty {0!r}: no numeric data to ' + 'plot'.format(numeric_data.__class__.__name__)) + + self.data = numeric_data + + def _make_plot(self): + raise AbstractMethodError(self) + + def _add_table(self): + if self.table is False: + return + elif self.table is True: + data = self.data.transpose() + else: + data = self.table + ax = self._get_ax(0) + table(ax, data) + + def _post_plot_logic_common(self, ax, data): + """Common post process for each axes""" + labels = [pprint_thing(key) for key in data.index] + labels = dict(zip(range(len(data.index)), labels)) + + if self.orientation == 'vertical' or self.orientation is None: + if self._need_to_set_index: + xticklabels = [labels.get(x, '') for x in ax.get_xticks()] + ax.set_xticklabels(xticklabels) + self._apply_axis_properties(ax.xaxis, rot=self.rot, + fontsize=self.fontsize) + self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize) + elif self.orientation == 'horizontal': + if self._need_to_set_index: + yticklabels = [labels.get(y, '') for y in ax.get_yticks()] + ax.set_yticklabels(yticklabels) + self._apply_axis_properties(ax.yaxis, rot=self.rot, + fontsize=self.fontsize) + self._apply_axis_properties(ax.xaxis, fontsize=self.fontsize) + else: # pragma no cover + raise ValueError + + def _post_plot_logic(self, ax, data): + """Post process for each axes. Overridden in child classes""" + pass + + def _adorn_subplots(self): + """Common post process unrelated to data""" + if len(self.axes) > 0: + all_axes = self._get_subplots() + nrows, ncols = self._get_axes_layout() + _handle_shared_axes(axarr=all_axes, nplots=len(all_axes), + naxes=nrows * ncols, nrows=nrows, + ncols=ncols, sharex=self.sharex, + sharey=self.sharey) + + for ax in self.axes: + if self.yticks is not None: + ax.set_yticks(self.yticks) + + if self.xticks is not None: + ax.set_xticks(self.xticks) + + if self.ylim is not None: + ax.set_ylim(self.ylim) + + if self.xlim is not None: + ax.set_xlim(self.xlim) + + ax.grid(self.grid) + + if self.title: + if self.subplots: + if is_list_like(self.title): + if len(self.title) != self.nseries: + msg = ('The length of `title` must equal the number ' + 'of columns if using `title` of type `list` ' + 'and `subplots=True`.\n' + 'length of title = {}\n' + 'number of columns = {}').format( + len(self.title), self.nseries) + raise ValueError(msg) + + for (ax, title) in zip(self.axes, self.title): + ax.set_title(title) + else: + self.fig.suptitle(self.title) + else: + if is_list_like(self.title): + msg = ('Using `title` of type `list` is not supported ' + 'unless `subplots=True` is passed') + raise ValueError(msg) + self.axes[0].set_title(self.title) + + def _apply_axis_properties(self, axis, rot=None, fontsize=None): + labels = axis.get_majorticklabels() + axis.get_minorticklabels() + for label in labels: + if rot is not None: + label.set_rotation(rot) + if fontsize is not None: + label.set_fontsize(fontsize) + + @property + def legend_title(self): + if not isinstance(self.data.columns, MultiIndex): + name = self.data.columns.name + if name is not None: + name = pprint_thing(name) + return name + else: + stringified = map(pprint_thing, + self.data.columns.names) + return ','.join(stringified) + + def _add_legend_handle(self, handle, label, index=None): + if label is not None: + if self.mark_right and index is not None: + if self.on_right(index): + label = label + ' (right)' + self.legend_handles.append(handle) + self.legend_labels.append(label) + + def _make_legend(self): + ax, leg = self._get_ax_legend(self.axes[0]) + + handles = [] + labels = [] + title = '' + + if not self.subplots: + if leg is not None: + title = leg.get_title().get_text() + handles = leg.legendHandles + labels = [x.get_text() for x in leg.get_texts()] + + if self.legend: + if self.legend == 'reverse': + self.legend_handles = reversed(self.legend_handles) + self.legend_labels = reversed(self.legend_labels) + + handles += self.legend_handles + labels += self.legend_labels + if self.legend_title is not None: + title = self.legend_title + + if len(handles) > 0: + ax.legend(handles, labels, loc='best', title=title) + + elif self.subplots and self.legend: + for ax in self.axes: + if ax.get_visible(): + ax.legend(loc='best') + + def _get_ax_legend(self, ax): + leg = ax.get_legend() + other_ax = (getattr(ax, 'left_ax', None) or + getattr(ax, 'right_ax', None)) + other_leg = None + if other_ax is not None: + other_leg = other_ax.get_legend() + if leg is None and other_leg is not None: + leg = other_leg + ax = other_ax + return ax, leg + + @cache_readonly + def plt(self): + import matplotlib.pyplot as plt + return plt + + @staticmethod + def mpl_ge_1_3_1(): + return _mpl_ge_1_3_1() + + @staticmethod + def mpl_ge_1_5_0(): + return _mpl_ge_1_5_0() + + _need_to_set_index = False + + def _get_xticks(self, convert_period=False): + index = self.data.index + is_datetype = index.inferred_type in ('datetime', 'date', + 'datetime64', 'time') + + if self.use_index: + if convert_period and isinstance(index, PeriodIndex): + self.data = self.data.reindex(index=index.sort_values()) + x = self.data.index.to_timestamp()._mpl_repr() + elif index.is_numeric(): + """ + Matplotlib supports numeric values or datetime objects as + xaxis values. Taking LBYL approach here, by the time + matplotlib raises exception when using non numeric/datetime + values for xaxis, several actions are already taken by plt. + """ + x = index._mpl_repr() + elif is_datetype: + self.data = self.data.sort_index() + x = self.data.index._mpl_repr() + else: + self._need_to_set_index = True + x = lrange(len(index)) + else: + x = lrange(len(index)) + + return x + + @classmethod + def _plot(cls, ax, x, y, style=None, is_errorbar=False, **kwds): + mask = isnull(y) + if mask.any(): + y = np.ma.array(y) + y = np.ma.masked_where(mask, y) + + if isinstance(x, Index): + x = x._mpl_repr() + + if is_errorbar: + if 'xerr' in kwds: + kwds['xerr'] = np.array(kwds.get('xerr')) + if 'yerr' in kwds: + kwds['yerr'] = np.array(kwds.get('yerr')) + return ax.errorbar(x, y, **kwds) + else: + # prevent style kwarg from going to errorbar, where it is + # unsupported + if style is not None: + args = (x, y, style) + else: + args = (x, y) + return ax.plot(*args, **kwds) + + def _get_index_name(self): + if isinstance(self.data.index, MultiIndex): + name = self.data.index.names + if any(x is not None for x in name): + name = ','.join([pprint_thing(x) for x in name]) + else: + name = None + else: + name = self.data.index.name + if name is not None: + name = pprint_thing(name) + + return name + + @classmethod + def _get_ax_layer(cls, ax, primary=True): + """get left (primary) or right (secondary) axes""" + if primary: + return getattr(ax, 'left_ax', ax) + else: + return getattr(ax, 'right_ax', ax) + + def _get_ax(self, i): + # get the twinx ax if appropriate + if self.subplots: + ax = self.axes[i] + ax = self._maybe_right_yaxis(ax, i) + self.axes[i] = ax + else: + ax = self.axes[0] + ax = self._maybe_right_yaxis(ax, i) + + ax.get_yaxis().set_visible(True) + return ax + + def on_right(self, i): + if isinstance(self.secondary_y, bool): + return self.secondary_y + + if isinstance(self.secondary_y, (tuple, list, np.ndarray, Index)): + return self.data.columns[i] in self.secondary_y + + def _apply_style_colors(self, colors, kwds, col_num, label): + """ + Manage style and color based on column number and its label. + Returns tuple of appropriate style and kwds which "color" may be added. + """ + style = None + if self.style is not None: + if isinstance(self.style, list): + try: + style = self.style[col_num] + except IndexError: + pass + elif isinstance(self.style, dict): + style = self.style.get(label, style) + else: + style = self.style + + has_color = 'color' in kwds or self.colormap is not None + nocolor_style = style is None or re.match('[a-z]+', style) is None + if (has_color or self.subplots) and nocolor_style: + kwds['color'] = colors[col_num % len(colors)] + return style, kwds + + def _get_colors(self, num_colors=None, color_kwds='color'): + if num_colors is None: + num_colors = self.nseries + + return _get_standard_colors(num_colors=num_colors, + colormap=self.colormap, + color=self.kwds.get(color_kwds)) + + def _parse_errorbars(self, label, err): + """ + Look for error keyword arguments and return the actual errorbar data + or return the error DataFrame/dict + + Error bars can be specified in several ways: + Series: the user provides a pandas.Series object of the same + length as the data + ndarray: provides a np.ndarray of the same length as the data + DataFrame/dict: error values are paired with keys matching the + key in the plotted DataFrame + str: the name of the column within the plotted DataFrame + """ + + if err is None: + return None + + from pandas import DataFrame, Series + + def match_labels(data, e): + e = e.reindex_axis(data.index) + return e + + # key-matched DataFrame + if isinstance(err, DataFrame): + + err = match_labels(self.data, err) + # key-matched dict + elif isinstance(err, dict): + pass + + # Series of error values + elif isinstance(err, Series): + # broadcast error series across data + err = match_labels(self.data, err) + err = np.atleast_2d(err) + err = np.tile(err, (self.nseries, 1)) + + # errors are a column in the dataframe + elif isinstance(err, string_types): + evalues = self.data[err].values + self.data = self.data[self.data.columns.drop(err)] + err = np.atleast_2d(evalues) + err = np.tile(err, (self.nseries, 1)) + + elif is_list_like(err): + if is_iterator(err): + err = np.atleast_2d(list(err)) + else: + # raw error values + err = np.atleast_2d(err) + + err_shape = err.shape + + # asymmetrical error bars + if err.ndim == 3: + if (err_shape[0] != self.nseries) or \ + (err_shape[1] != 2) or \ + (err_shape[2] != len(self.data)): + msg = "Asymmetrical error bars should be provided " + \ + "with the shape (%u, 2, %u)" % \ + (self.nseries, len(self.data)) + raise ValueError(msg) + + # broadcast errors to each data series + if len(err) == 1: + err = np.tile(err, (self.nseries, 1)) + + elif is_number(err): + err = np.tile([err], (self.nseries, len(self.data))) + + else: + msg = "No valid %s detected" % label + raise ValueError(msg) + + return err + + def _get_errorbars(self, label=None, index=None, xerr=True, yerr=True): + from pandas import DataFrame + errors = {} + + for kw, flag in zip(['xerr', 'yerr'], [xerr, yerr]): + if flag: + err = self.errors[kw] + # user provided label-matched dataframe of errors + if isinstance(err, (DataFrame, dict)): + if label is not None and label in err.keys(): + err = err[label] + else: + err = None + elif index is not None and err is not None: + err = err[index] + + if err is not None: + errors[kw] = err + return errors + + def _get_subplots(self): + from matplotlib.axes import Subplot + return [ax for ax in self.axes[0].get_figure().get_axes() + if isinstance(ax, Subplot)] + + def _get_axes_layout(self): + axes = self._get_subplots() + x_set = set() + y_set = set() + for ax in axes: + # check axes coordinates to estimate layout + points = ax.get_position().get_points() + x_set.add(points[0][0]) + y_set.add(points[0][1]) + return (len(y_set), len(x_set)) + + +class PlanePlot(MPLPlot): + """ + Abstract class for plotting on plane, currently scatter and hexbin. + """ + + _layout_type = 'single' + + def __init__(self, data, x, y, **kwargs): + MPLPlot.__init__(self, data, **kwargs) + if x is None or y is None: + raise ValueError(self._kind + ' requires and x and y column') + if is_integer(x) and not self.data.columns.holds_integer(): + x = self.data.columns[x] + if is_integer(y) and not self.data.columns.holds_integer(): + y = self.data.columns[y] + self.x = x + self.y = y + + @property + def nseries(self): + return 1 + + def _post_plot_logic(self, ax, data): + x, y = self.x, self.y + ax.set_ylabel(pprint_thing(y)) + ax.set_xlabel(pprint_thing(x)) + + +class ScatterPlot(PlanePlot): + _kind = 'scatter' + + def __init__(self, data, x, y, s=None, c=None, **kwargs): + if s is None: + # hide the matplotlib default for size, in case we want to change + # the handling of this argument later + s = 20 + super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs) + if is_integer(c) and not self.data.columns.holds_integer(): + c = self.data.columns[c] + self.c = c + + def _make_plot(self): + x, y, c, data = self.x, self.y, self.c, self.data + ax = self.axes[0] + + c_is_column = is_hashable(c) and c in self.data.columns + + # plot a colorbar only if a colormap is provided or necessary + cb = self.kwds.pop('colorbar', self.colormap or c_is_column) + + # pandas uses colormap, matplotlib uses cmap. + cmap = self.colormap or 'Greys' + cmap = self.plt.cm.get_cmap(cmap) + color = self.kwds.pop("color", None) + if c is not None and color is not None: + raise TypeError('Specify exactly one of `c` and `color`') + elif c is None and color is None: + c_values = self.plt.rcParams['patch.facecolor'] + elif color is not None: + c_values = color + elif c_is_column: + c_values = self.data[c].values + else: + c_values = c + + if self.legend and hasattr(self, 'label'): + label = self.label + else: + label = None + scatter = ax.scatter(data[x].values, data[y].values, c=c_values, + label=label, cmap=cmap, **self.kwds) + if cb: + img = ax.collections[0] + kws = dict(ax=ax) + if self.mpl_ge_1_3_1(): + kws['label'] = c if c_is_column else '' + self.fig.colorbar(img, **kws) + + if label is not None: + self._add_legend_handle(scatter, label) + else: + self.legend = False + + errors_x = self._get_errorbars(label=x, index=0, yerr=False) + errors_y = self._get_errorbars(label=y, index=0, xerr=False) + if len(errors_x) > 0 or len(errors_y) > 0: + err_kwds = dict(errors_x, **errors_y) + err_kwds['ecolor'] = scatter.get_facecolor()[0] + ax.errorbar(data[x].values, data[y].values, + linestyle='none', **err_kwds) + + +class HexBinPlot(PlanePlot): + _kind = 'hexbin' + + def __init__(self, data, x, y, C=None, **kwargs): + super(HexBinPlot, self).__init__(data, x, y, **kwargs) + if is_integer(C) and not self.data.columns.holds_integer(): + C = self.data.columns[C] + self.C = C + + def _make_plot(self): + x, y, data, C = self.x, self.y, self.data, self.C + ax = self.axes[0] + # pandas uses colormap, matplotlib uses cmap. + cmap = self.colormap or 'BuGn' + cmap = self.plt.cm.get_cmap(cmap) + cb = self.kwds.pop('colorbar', True) + + if C is None: + c_values = None + else: + c_values = data[C].values + + ax.hexbin(data[x].values, data[y].values, C=c_values, cmap=cmap, + **self.kwds) + if cb: + img = ax.collections[0] + self.fig.colorbar(img, ax=ax) + + def _make_legend(self): + pass + + +class LinePlot(MPLPlot): + _kind = 'line' + _default_rot = 0 + orientation = 'vertical' + + def __init__(self, data, **kwargs): + MPLPlot.__init__(self, data, **kwargs) + if self.stacked: + self.data = self.data.fillna(value=0) + self.x_compat = plot_params['x_compat'] + if 'x_compat' in self.kwds: + self.x_compat = bool(self.kwds.pop('x_compat')) + + def _is_ts_plot(self): + # this is slightly deceptive + return not self.x_compat and self.use_index and self._use_dynamic_x() + + def _use_dynamic_x(self): + from pandas.plotting._timeseries import _use_dynamic_x + return _use_dynamic_x(self._get_ax(0), self.data) + + def _make_plot(self): + if self._is_ts_plot(): + from pandas.plotting._timeseries import _maybe_convert_index + data = _maybe_convert_index(self._get_ax(0), self.data) + + x = data.index # dummy, not used + plotf = self._ts_plot + it = self._iter_data(data=data, keep_index=True) + else: + x = self._get_xticks(convert_period=True) + plotf = self._plot + it = self._iter_data() + + stacking_id = self._get_stacking_id() + is_errorbar = any(e is not None for e in self.errors.values()) + + colors = self._get_colors() + for i, (label, y) in enumerate(it): + ax = self._get_ax(i) + kwds = self.kwds.copy() + style, kwds = self._apply_style_colors(colors, kwds, i, label) + + errors = self._get_errorbars(label=label, index=i) + kwds = dict(kwds, **errors) + + label = pprint_thing(label) # .encode('utf-8') + kwds['label'] = label + + newlines = plotf(ax, x, y, style=style, column_num=i, + stacking_id=stacking_id, + is_errorbar=is_errorbar, + **kwds) + self._add_legend_handle(newlines[0], label, index=i) + + lines = _get_all_lines(ax) + left, right = _get_xlim(lines) + ax.set_xlim(left, right) + + @classmethod + def _plot(cls, ax, x, y, style=None, column_num=None, + stacking_id=None, **kwds): + # column_num is used to get the target column from protf in line and + # area plots + if column_num == 0: + cls._initialize_stacker(ax, stacking_id, len(y)) + y_values = cls._get_stacked_values(ax, stacking_id, y, kwds['label']) + lines = MPLPlot._plot(ax, x, y_values, style=style, **kwds) + cls._update_stacker(ax, stacking_id, y) + return lines + + @classmethod + def _ts_plot(cls, ax, x, data, style=None, **kwds): + from pandas.plotting._timeseries import (_maybe_resample, + _decorate_axes, + format_dateaxis) + # accept x to be consistent with normal plot func, + # x is not passed to tsplot as it uses data.index as x coordinate + # column_num must be in kwds for stacking purpose + freq, data = _maybe_resample(data, ax, kwds) + + # Set ax with freq info + _decorate_axes(ax, freq, kwds) + # digging deeper + if hasattr(ax, 'left_ax'): + _decorate_axes(ax.left_ax, freq, kwds) + if hasattr(ax, 'right_ax'): + _decorate_axes(ax.right_ax, freq, kwds) + ax._plot_data.append((data, cls._kind, kwds)) + + lines = cls._plot(ax, data.index, data.values, style=style, **kwds) + # set date formatter, locators and rescale limits + format_dateaxis(ax, ax.freq, data.index) + return lines + + def _get_stacking_id(self): + if self.stacked: + return id(self.data) + else: + return None + + @classmethod + def _initialize_stacker(cls, ax, stacking_id, n): + if stacking_id is None: + return + if not hasattr(ax, '_stacker_pos_prior'): + ax._stacker_pos_prior = {} + if not hasattr(ax, '_stacker_neg_prior'): + ax._stacker_neg_prior = {} + ax._stacker_pos_prior[stacking_id] = np.zeros(n) + ax._stacker_neg_prior[stacking_id] = np.zeros(n) + + @classmethod + def _get_stacked_values(cls, ax, stacking_id, values, label): + if stacking_id is None: + return values + if not hasattr(ax, '_stacker_pos_prior'): + # stacker may not be initialized for subplots + cls._initialize_stacker(ax, stacking_id, len(values)) + + if (values >= 0).all(): + return ax._stacker_pos_prior[stacking_id] + values + elif (values <= 0).all(): + return ax._stacker_neg_prior[stacking_id] + values + + raise ValueError('When stacked is True, each column must be either ' + 'all positive or negative.' + '{0} contains both positive and negative values' + .format(label)) + + @classmethod + def _update_stacker(cls, ax, stacking_id, values): + if stacking_id is None: + return + if (values >= 0).all(): + ax._stacker_pos_prior[stacking_id] += values + elif (values <= 0).all(): + ax._stacker_neg_prior[stacking_id] += values + + def _post_plot_logic(self, ax, data): + condition = (not self._use_dynamic_x() and + data.index.is_all_dates and + not self.subplots or + (self.subplots and self.sharex)) + + index_name = self._get_index_name() + + if condition: + # irregular TS rotated 30 deg. by default + # probably a better place to check / set this. + if not self._rot_set: + self.rot = 30 + format_date_labels(ax, rot=self.rot) + + if index_name is not None and self.use_index: + ax.set_xlabel(index_name) + + +class AreaPlot(LinePlot): + _kind = 'area' + + def __init__(self, data, **kwargs): + kwargs.setdefault('stacked', True) + data = data.fillna(value=0) + LinePlot.__init__(self, data, **kwargs) + + if not self.stacked: + # use smaller alpha to distinguish overlap + self.kwds.setdefault('alpha', 0.5) + + if self.logy or self.loglog: + raise ValueError("Log-y scales are not supported in area plot") + + @classmethod + def _plot(cls, ax, x, y, style=None, column_num=None, + stacking_id=None, is_errorbar=False, **kwds): + + if column_num == 0: + cls._initialize_stacker(ax, stacking_id, len(y)) + y_values = cls._get_stacked_values(ax, stacking_id, y, kwds['label']) + + # need to remove label, because subplots uses mpl legend as it is + line_kwds = kwds.copy() + if cls.mpl_ge_1_5_0(): + line_kwds.pop('label') + lines = MPLPlot._plot(ax, x, y_values, style=style, **line_kwds) + + # get data from the line to get coordinates for fill_between + xdata, y_values = lines[0].get_data(orig=False) + + # unable to use ``_get_stacked_values`` here to get starting point + if stacking_id is None: + start = np.zeros(len(y)) + elif (y >= 0).all(): + start = ax._stacker_pos_prior[stacking_id] + elif (y <= 0).all(): + start = ax._stacker_neg_prior[stacking_id] + else: + start = np.zeros(len(y)) + + if 'color' not in kwds: + kwds['color'] = lines[0].get_color() + + rect = ax.fill_between(xdata, start, y_values, **kwds) + cls._update_stacker(ax, stacking_id, y) + + # LinePlot expects list of artists + res = [rect] if cls.mpl_ge_1_5_0() else lines + return res + + def _add_legend_handle(self, handle, label, index=None): + if not self.mpl_ge_1_5_0(): + from matplotlib.patches import Rectangle + # Because fill_between isn't supported in legend, + # specifically add Rectangle handle here + alpha = self.kwds.get('alpha', None) + handle = Rectangle((0, 0), 1, 1, fc=handle.get_color(), + alpha=alpha) + LinePlot._add_legend_handle(self, handle, label, index=index) + + def _post_plot_logic(self, ax, data): + LinePlot._post_plot_logic(self, ax, data) + + if self.ylim is None: + if (data >= 0).all().all(): + ax.set_ylim(0, None) + elif (data <= 0).all().all(): + ax.set_ylim(None, 0) + + +class BarPlot(MPLPlot): + _kind = 'bar' + _default_rot = 90 + orientation = 'vertical' + + def __init__(self, data, **kwargs): + self.bar_width = kwargs.pop('width', 0.5) + pos = kwargs.pop('position', 0.5) + kwargs.setdefault('align', 'center') + self.tick_pos = np.arange(len(data)) + + self.bottom = kwargs.pop('bottom', 0) + self.left = kwargs.pop('left', 0) + + self.log = kwargs.pop('log', False) + MPLPlot.__init__(self, data, **kwargs) + + if self.stacked or self.subplots: + self.tickoffset = self.bar_width * pos + if kwargs['align'] == 'edge': + self.lim_offset = self.bar_width / 2 + else: + self.lim_offset = 0 + else: + if kwargs['align'] == 'edge': + w = self.bar_width / self.nseries + self.tickoffset = self.bar_width * (pos - 0.5) + w * 0.5 + self.lim_offset = w * 0.5 + else: + self.tickoffset = self.bar_width * pos + self.lim_offset = 0 + + self.ax_pos = self.tick_pos - self.tickoffset + + def _args_adjust(self): + if is_list_like(self.bottom): + self.bottom = np.array(self.bottom) + if is_list_like(self.left): + self.left = np.array(self.left) + + @classmethod + def _plot(cls, ax, x, y, w, start=0, log=False, **kwds): + return ax.bar(x, y, w, bottom=start, log=log, **kwds) + + @property + def _start_base(self): + return self.bottom + + def _make_plot(self): + import matplotlib as mpl + + colors = self._get_colors() + ncolors = len(colors) + + pos_prior = neg_prior = np.zeros(len(self.data)) + K = self.nseries + + for i, (label, y) in enumerate(self._iter_data(fillna=0)): + ax = self._get_ax(i) + kwds = self.kwds.copy() + kwds['color'] = colors[i % ncolors] + + errors = self._get_errorbars(label=label, index=i) + kwds = dict(kwds, **errors) + + label = pprint_thing(label) + + if (('yerr' in kwds) or ('xerr' in kwds)) \ + and (kwds.get('ecolor') is None): + kwds['ecolor'] = mpl.rcParams['xtick.color'] + + start = 0 + if self.log and (y >= 1).all(): + start = 1 + start = start + self._start_base + + if self.subplots: + w = self.bar_width / 2 + rect = self._plot(ax, self.ax_pos + w, y, self.bar_width, + start=start, label=label, + log=self.log, **kwds) + ax.set_title(label) + elif self.stacked: + mask = y > 0 + start = np.where(mask, pos_prior, neg_prior) + self._start_base + w = self.bar_width / 2 + rect = self._plot(ax, self.ax_pos + w, y, self.bar_width, + start=start, label=label, + log=self.log, **kwds) + pos_prior = pos_prior + np.where(mask, y, 0) + neg_prior = neg_prior + np.where(mask, 0, y) + else: + w = self.bar_width / K + rect = self._plot(ax, self.ax_pos + (i + 0.5) * w, y, w, + start=start, label=label, + log=self.log, **kwds) + self._add_legend_handle(rect, label, index=i) + + def _post_plot_logic(self, ax, data): + if self.use_index: + str_index = [pprint_thing(key) for key in data.index] + else: + str_index = [pprint_thing(key) for key in range(data.shape[0])] + name = self._get_index_name() + + s_edge = self.ax_pos[0] - 0.25 + self.lim_offset + e_edge = self.ax_pos[-1] + 0.25 + self.bar_width + self.lim_offset + + self._decorate_ticks(ax, name, str_index, s_edge, e_edge) + + def _decorate_ticks(self, ax, name, ticklabels, start_edge, end_edge): + ax.set_xlim((start_edge, end_edge)) + ax.set_xticks(self.tick_pos) + ax.set_xticklabels(ticklabels) + if name is not None and self.use_index: + ax.set_xlabel(name) + + +class BarhPlot(BarPlot): + _kind = 'barh' + _default_rot = 0 + orientation = 'horizontal' + + @property + def _start_base(self): + return self.left + + @classmethod + def _plot(cls, ax, x, y, w, start=0, log=False, **kwds): + return ax.barh(x, y, w, left=start, log=log, **kwds) + + def _decorate_ticks(self, ax, name, ticklabels, start_edge, end_edge): + # horizontal bars + ax.set_ylim((start_edge, end_edge)) + ax.set_yticks(self.tick_pos) + ax.set_yticklabels(ticklabels) + if name is not None and self.use_index: + ax.set_ylabel(name) + + +class HistPlot(LinePlot): + _kind = 'hist' + + def __init__(self, data, bins=10, bottom=0, **kwargs): + self.bins = bins # use mpl default + self.bottom = bottom + # Do not call LinePlot.__init__ which may fill nan + MPLPlot.__init__(self, data, **kwargs) + + def _args_adjust(self): + if is_integer(self.bins): + # create common bin edge + values = (self.data._convert(datetime=True)._get_numeric_data()) + values = np.ravel(values) + values = values[~isnull(values)] + + hist, self.bins = np.histogram( + values, bins=self.bins, + range=self.kwds.get('range', None), + weights=self.kwds.get('weights', None)) + + if is_list_like(self.bottom): + self.bottom = np.array(self.bottom) + + @classmethod + def _plot(cls, ax, y, style=None, bins=None, bottom=0, column_num=0, + stacking_id=None, **kwds): + if column_num == 0: + cls._initialize_stacker(ax, stacking_id, len(bins) - 1) + y = y[~isnull(y)] + + base = np.zeros(len(bins) - 1) + bottom = bottom + \ + cls._get_stacked_values(ax, stacking_id, base, kwds['label']) + # ignore style + n, bins, patches = ax.hist(y, bins=bins, bottom=bottom, **kwds) + cls._update_stacker(ax, stacking_id, n) + return patches + + def _make_plot(self): + colors = self._get_colors() + stacking_id = self._get_stacking_id() + + for i, (label, y) in enumerate(self._iter_data()): + ax = self._get_ax(i) + + kwds = self.kwds.copy() + + label = pprint_thing(label) + kwds['label'] = label + + style, kwds = self._apply_style_colors(colors, kwds, i, label) + if style is not None: + kwds['style'] = style + + kwds = self._make_plot_keywords(kwds, y) + artists = self._plot(ax, y, column_num=i, + stacking_id=stacking_id, **kwds) + self._add_legend_handle(artists[0], label, index=i) + + def _make_plot_keywords(self, kwds, y): + """merge BoxPlot/KdePlot properties to passed kwds""" + # y is required for KdePlot + kwds['bottom'] = self.bottom + kwds['bins'] = self.bins + return kwds + + def _post_plot_logic(self, ax, data): + if self.orientation == 'horizontal': + ax.set_xlabel('Frequency') + else: + ax.set_ylabel('Frequency') + + @property + def orientation(self): + if self.kwds.get('orientation', None) == 'horizontal': + return 'horizontal' + else: + return 'vertical' + + +class KdePlot(HistPlot): + _kind = 'kde' + orientation = 'vertical' + + def __init__(self, data, bw_method=None, ind=None, **kwargs): + MPLPlot.__init__(self, data, **kwargs) + self.bw_method = bw_method + self.ind = ind + + def _args_adjust(self): + pass + + def _get_ind(self, y): + if self.ind is None: + # np.nanmax() and np.nanmin() ignores the missing values + sample_range = np.nanmax(y) - np.nanmin(y) + ind = np.linspace(np.nanmin(y) - 0.5 * sample_range, + np.nanmax(y) + 0.5 * sample_range, 1000) + else: + ind = self.ind + return ind + + @classmethod + def _plot(cls, ax, y, style=None, bw_method=None, ind=None, + column_num=None, stacking_id=None, **kwds): + from scipy.stats import gaussian_kde + from scipy import __version__ as spv + + y = remove_na(y) + + if LooseVersion(spv) >= '0.11.0': + gkde = gaussian_kde(y, bw_method=bw_method) + else: + gkde = gaussian_kde(y) + if bw_method is not None: + msg = ('bw_method was added in Scipy 0.11.0.' + + ' Scipy version in use is %s.' % spv) + warnings.warn(msg) + + y = gkde.evaluate(ind) + lines = MPLPlot._plot(ax, ind, y, style=style, **kwds) + return lines + + def _make_plot_keywords(self, kwds, y): + kwds['bw_method'] = self.bw_method + kwds['ind'] = self._get_ind(y) + return kwds + + def _post_plot_logic(self, ax, data): + ax.set_ylabel('Density') + + +class PiePlot(MPLPlot): + _kind = 'pie' + _layout_type = 'horizontal' + + def __init__(self, data, kind=None, **kwargs): + data = data.fillna(value=0) + if (data < 0).any().any(): + raise ValueError("{0} doesn't allow negative values".format(kind)) + MPLPlot.__init__(self, data, kind=kind, **kwargs) + + def _args_adjust(self): + self.grid = False + self.logy = False + self.logx = False + self.loglog = False + + def _validate_color_args(self): + pass + + def _make_plot(self): + colors = self._get_colors( + num_colors=len(self.data), color_kwds='colors') + self.kwds.setdefault('colors', colors) + + for i, (label, y) in enumerate(self._iter_data()): + ax = self._get_ax(i) + if label is not None: + label = pprint_thing(label) + ax.set_ylabel(label) + + kwds = self.kwds.copy() + + def blank_labeler(label, value): + if value == 0: + return '' + else: + return label + + idx = [pprint_thing(v) for v in self.data.index] + labels = kwds.pop('labels', idx) + # labels is used for each wedge's labels + # Blank out labels for values of 0 so they don't overlap + # with nonzero wedges + if labels is not None: + blabels = [blank_labeler(l, value) for + l, value in zip(labels, y)] + else: + blabels = None + results = ax.pie(y, labels=blabels, **kwds) + + if kwds.get('autopct', None) is not None: + patches, texts, autotexts = results + else: + patches, texts = results + autotexts = [] + + if self.fontsize is not None: + for t in texts + autotexts: + t.set_fontsize(self.fontsize) + + # leglabels is used for legend labels + leglabels = labels if labels is not None else idx + for p, l in zip(patches, leglabels): + self._add_legend_handle(p, l) + + +class BoxPlot(LinePlot): + _kind = 'box' + _layout_type = 'horizontal' + + _valid_return_types = (None, 'axes', 'dict', 'both') + # namedtuple to hold results + BP = namedtuple("Boxplot", ['ax', 'lines']) + + def __init__(self, data, return_type='axes', **kwargs): + # Do not call LinePlot.__init__ which may fill nan + if return_type not in self._valid_return_types: + raise ValueError( + "return_type must be {None, 'axes', 'dict', 'both'}") + + self.return_type = return_type + MPLPlot.__init__(self, data, **kwargs) + + def _args_adjust(self): + if self.subplots: + # Disable label ax sharing. Otherwise, all subplots shows last + # column label + if self.orientation == 'vertical': + self.sharex = False + else: + self.sharey = False + + @classmethod + def _plot(cls, ax, y, column_num=None, return_type='axes', **kwds): + if y.ndim == 2: + y = [remove_na(v) for v in y] + # Boxplot fails with empty arrays, so need to add a NaN + # if any cols are empty + # GH 8181 + y = [v if v.size > 0 else np.array([np.nan]) for v in y] + else: + y = remove_na(y) + bp = ax.boxplot(y, **kwds) + + if return_type == 'dict': + return bp, bp + elif return_type == 'both': + return cls.BP(ax=ax, lines=bp), bp + else: + return ax, bp + + def _validate_color_args(self): + if 'color' in self.kwds: + if self.colormap is not None: + warnings.warn("'color' and 'colormap' cannot be used " + "simultaneously. Using 'color'") + self.color = self.kwds.pop('color') + + if isinstance(self.color, dict): + valid_keys = ['boxes', 'whiskers', 'medians', 'caps'] + for key, values in compat.iteritems(self.color): + if key not in valid_keys: + raise ValueError("color dict contains invalid " + "key '{0}' " + "The key must be either {1}" + .format(key, valid_keys)) + else: + self.color = None + + # get standard colors for default + colors = _get_standard_colors(num_colors=3, + colormap=self.colormap, + color=None) + # use 2 colors by default, for box/whisker and median + # flier colors isn't needed here + # because it can be specified by ``sym`` kw + self._boxes_c = colors[0] + self._whiskers_c = colors[0] + self._medians_c = colors[2] + self._caps_c = 'k' # mpl default + + def _get_colors(self, num_colors=None, color_kwds='color'): + pass + + def maybe_color_bp(self, bp): + if isinstance(self.color, dict): + boxes = self.color.get('boxes', self._boxes_c) + whiskers = self.color.get('whiskers', self._whiskers_c) + medians = self.color.get('medians', self._medians_c) + caps = self.color.get('caps', self._caps_c) + else: + # Other types are forwarded to matplotlib + # If None, use default colors + boxes = self.color or self._boxes_c + whiskers = self.color or self._whiskers_c + medians = self.color or self._medians_c + caps = self.color or self._caps_c + + from matplotlib.artist import setp + setp(bp['boxes'], color=boxes, alpha=1) + setp(bp['whiskers'], color=whiskers, alpha=1) + setp(bp['medians'], color=medians, alpha=1) + setp(bp['caps'], color=caps, alpha=1) + + def _make_plot(self): + if self.subplots: + self._return_obj = Series() + + for i, (label, y) in enumerate(self._iter_data()): + ax = self._get_ax(i) + kwds = self.kwds.copy() + + ret, bp = self._plot(ax, y, column_num=i, + return_type=self.return_type, **kwds) + self.maybe_color_bp(bp) + self._return_obj[label] = ret + + label = [pprint_thing(label)] + self._set_ticklabels(ax, label) + else: + y = self.data.values.T + ax = self._get_ax(0) + kwds = self.kwds.copy() + + ret, bp = self._plot(ax, y, column_num=0, + return_type=self.return_type, **kwds) + self.maybe_color_bp(bp) + self._return_obj = ret + + labels = [l for l, _ in self._iter_data()] + labels = [pprint_thing(l) for l in labels] + if not self.use_index: + labels = [pprint_thing(key) for key in range(len(labels))] + self._set_ticklabels(ax, labels) + + def _set_ticklabels(self, ax, labels): + if self.orientation == 'vertical': + ax.set_xticklabels(labels) + else: + ax.set_yticklabels(labels) + + def _make_legend(self): + pass + + def _post_plot_logic(self, ax, data): + pass + + @property + def orientation(self): + if self.kwds.get('vert', True): + return 'vertical' + else: + return 'horizontal' + + @property + def result(self): + if self.return_type is None: + return super(BoxPlot, self).result + else: + return self._return_obj + + +# kinds supported by both dataframe and series +_common_kinds = ['line', 'bar', 'barh', + 'kde', 'density', 'area', 'hist', 'box'] +# kinds supported by dataframe +_dataframe_kinds = ['scatter', 'hexbin'] +# kinds supported only by series or dataframe single column +_series_kinds = ['pie'] +_all_kinds = _common_kinds + _dataframe_kinds + _series_kinds + +_klasses = [LinePlot, BarPlot, BarhPlot, KdePlot, HistPlot, BoxPlot, + ScatterPlot, HexBinPlot, AreaPlot, PiePlot] + +_plot_klass = {} +for klass in _klasses: + _plot_klass[klass._kind] = klass + + +def _plot(data, x=None, y=None, subplots=False, + ax=None, kind='line', **kwds): + kind = _get_standard_kind(kind.lower().strip()) + if kind in _all_kinds: + klass = _plot_klass[kind] + else: + raise ValueError("%r is not a valid plot kind" % kind) + + from pandas import DataFrame + if kind in _dataframe_kinds: + if isinstance(data, DataFrame): + plot_obj = klass(data, x=x, y=y, subplots=subplots, ax=ax, + kind=kind, **kwds) + else: + raise ValueError("plot kind %r can only be used for data frames" + % kind) + + elif kind in _series_kinds: + if isinstance(data, DataFrame): + if y is None and subplots is False: + msg = "{0} requires either y column or 'subplots=True'" + raise ValueError(msg.format(kind)) + elif y is not None: + if is_integer(y) and not data.columns.holds_integer(): + y = data.columns[y] + # converted to series actually. copy to not modify + data = data[y].copy() + data.index.name = y + plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds) + else: + if isinstance(data, DataFrame): + if x is not None: + if is_integer(x) and not data.columns.holds_integer(): + x = data.columns[x] + data = data.set_index(x) + + if y is not None: + if is_integer(y) and not data.columns.holds_integer(): + y = data.columns[y] + label = kwds['label'] if 'label' in kwds else y + series = data[y].copy() # Don't modify + series.name = label + + for kw in ['xerr', 'yerr']: + if (kw in kwds) and \ + (isinstance(kwds[kw], string_types) or + is_integer(kwds[kw])): + try: + kwds[kw] = data[kwds[kw]] + except (IndexError, KeyError, TypeError): + pass + data = series + plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds) + + plot_obj.generate() + plot_obj.draw() + return plot_obj.result + + +df_kind = """- 'scatter' : scatter plot + - 'hexbin' : hexbin plot""" +series_kind = "" + +df_coord = """x : label or position, default None + y : label or position, default None + Allows plotting of one column versus another""" +series_coord = "" + +df_unique = """stacked : boolean, default False in line and + bar plots, and True in area plot. If True, create stacked plot. + sort_columns : boolean, default False + Sort column names to determine plot ordering + secondary_y : boolean or sequence, default False + Whether to plot on the secondary y-axis + If a list/tuple, which columns to plot on secondary y-axis""" +series_unique = """label : label argument to provide to plot + secondary_y : boolean or sequence of ints, default False + If True then y-axis will be on the right""" + +df_ax = """ax : matplotlib axes object, default None + subplots : boolean, default False + Make separate subplots for each column + sharex : boolean, default True if ax is None else False + In case subplots=True, share x axis and set some x axis labels to + invisible; defaults to True if ax is None otherwise False if an ax + is passed in; Be aware, that passing in both an ax and sharex=True + will alter all x axis labels for all axis in a figure! + sharey : boolean, default False + In case subplots=True, share y axis and set some y axis labels to + invisible + layout : tuple (optional) + (rows, columns) for the layout of subplots""" +series_ax = """ax : matplotlib axes object + If not passed, uses gca()""" + +df_note = """- If `kind` = 'scatter' and the argument `c` is the name of a dataframe + column, the values of that column are used to color each point. + - If `kind` = 'hexbin', you can control the size of the bins with the + `gridsize` argument. By default, a histogram of the counts around each + `(x, y)` point is computed. You can specify alternative aggregations + by passing values to the `C` and `reduce_C_function` arguments. + `C` specifies the value at each `(x, y)` point and `reduce_C_function` + is a function of one argument that reduces all the values in a bin to + a single number (e.g. `mean`, `max`, `sum`, `std`).""" +series_note = "" + +_shared_doc_df_kwargs = dict(klass='DataFrame', klass_obj='df', + klass_kind=df_kind, klass_coord=df_coord, + klass_ax=df_ax, klass_unique=df_unique, + klass_note=df_note) +_shared_doc_series_kwargs = dict(klass='Series', klass_obj='s', + klass_kind=series_kind, + klass_coord=series_coord, klass_ax=series_ax, + klass_unique=series_unique, + klass_note=series_note) + +_shared_docs['plot'] = """ + Make plots of %(klass)s using matplotlib / pylab. + + *New in version 0.17.0:* Each plot kind has a corresponding method on the + ``%(klass)s.plot`` accessor: + ``%(klass_obj)s.plot(kind='line')`` is equivalent to + ``%(klass_obj)s.plot.line()``. + + Parameters + ---------- + data : %(klass)s + %(klass_coord)s + kind : str + - 'line' : line plot (default) + - 'bar' : vertical bar plot + - 'barh' : horizontal bar plot + - 'hist' : histogram + - 'box' : boxplot + - 'kde' : Kernel Density Estimation plot + - 'density' : same as 'kde' + - 'area' : area plot + - 'pie' : pie plot + %(klass_kind)s + %(klass_ax)s + figsize : a tuple (width, height) in inches + use_index : boolean, default True + Use index as ticks for x axis + title : string or list + Title to use for the plot. If a string is passed, print the string at + the top of the figure. If a list is passed and `subplots` is True, + print each item in the list above the corresponding subplot. + grid : boolean, default None (matlab style default) + Axis grid lines + legend : False/True/'reverse' + Place legend on axis subplots + style : list or dict + matplotlib line style per column + logx : boolean, default False + Use log scaling on x axis + logy : boolean, default False + Use log scaling on y axis + loglog : boolean, default False + Use log scaling on both x and y axes + xticks : sequence + Values to use for the xticks + yticks : sequence + Values to use for the yticks + xlim : 2-tuple/list + ylim : 2-tuple/list + rot : int, default None + Rotation for ticks (xticks for vertical, yticks for horizontal plots) + fontsize : int, default None + Font size for xticks and yticks + colormap : str or matplotlib colormap object, default None + Colormap to select colors from. If string, load colormap with that name + from matplotlib. + colorbar : boolean, optional + If True, plot colorbar (only relevant for 'scatter' and 'hexbin' plots) + position : float + Specify relative alignments for bar plot layout. + From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center) + layout : tuple (optional) + (rows, columns) for the layout of the plot + table : boolean, Series or DataFrame, default False + If True, draw a table using the data in the DataFrame and the data will + be transposed to meet matplotlib's default layout. + If a Series or DataFrame is passed, use passed data to draw a table. + yerr : DataFrame, Series, array-like, dict and str + See :ref:`Plotting with Error Bars ` for + detail. + xerr : same types as yerr. + %(klass_unique)s + mark_right : boolean, default True + When using a secondary_y axis, automatically mark the column + labels with "(right)" in the legend + kwds : keywords + Options to pass to matplotlib plotting method + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + + Notes + ----- + + - See matplotlib documentation online for more on this subject + - If `kind` = 'bar' or 'barh', you can specify relative alignments + for bar plot layout by `position` keyword. + From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center) + %(klass_note)s + + """ + + +@Appender(_shared_docs['plot'] % _shared_doc_df_kwargs) +def plot_frame(data, x=None, y=None, kind='line', ax=None, + subplots=False, sharex=None, sharey=False, layout=None, + figsize=None, use_index=True, title=None, grid=None, + legend=True, style=None, logx=False, logy=False, loglog=False, + xticks=None, yticks=None, xlim=None, ylim=None, + rot=None, fontsize=None, colormap=None, table=False, + yerr=None, xerr=None, + secondary_y=False, sort_columns=False, + **kwds): + return _plot(data, kind=kind, x=x, y=y, ax=ax, + subplots=subplots, sharex=sharex, sharey=sharey, + layout=layout, figsize=figsize, use_index=use_index, + title=title, grid=grid, legend=legend, + style=style, logx=logx, logy=logy, loglog=loglog, + xticks=xticks, yticks=yticks, xlim=xlim, ylim=ylim, + rot=rot, fontsize=fontsize, colormap=colormap, table=table, + yerr=yerr, xerr=xerr, + secondary_y=secondary_y, sort_columns=sort_columns, + **kwds) + + +@Appender(_shared_docs['plot'] % _shared_doc_series_kwargs) +def plot_series(data, kind='line', ax=None, # Series unique + figsize=None, use_index=True, title=None, grid=None, + legend=False, style=None, logx=False, logy=False, loglog=False, + xticks=None, yticks=None, xlim=None, ylim=None, + rot=None, fontsize=None, colormap=None, table=False, + yerr=None, xerr=None, + label=None, secondary_y=False, # Series unique + **kwds): + + import matplotlib.pyplot as plt + """ + If no axes is specified, check whether there are existing figures + If there is no existing figures, _gca() will + create a figure with the default figsize, causing the figsize=parameter to + be ignored. + """ + if ax is None and len(plt.get_fignums()) > 0: + ax = _gca() + ax = MPLPlot._get_ax_layer(ax) + return _plot(data, kind=kind, ax=ax, + figsize=figsize, use_index=use_index, title=title, + grid=grid, legend=legend, + style=style, logx=logx, logy=logy, loglog=loglog, + xticks=xticks, yticks=yticks, xlim=xlim, ylim=ylim, + rot=rot, fontsize=fontsize, colormap=colormap, table=table, + yerr=yerr, xerr=xerr, + label=label, secondary_y=secondary_y, + **kwds) + + +_shared_docs['boxplot'] = """ + Make a box plot from DataFrame column optionally grouped by some columns or + other inputs + + Parameters + ---------- + data : the pandas object holding the data + column : column name or list of names, or vector + Can be any valid input to groupby + by : string or sequence + Column in the DataFrame to group by + ax : Matplotlib axes object, optional + fontsize : int or string + rot : label rotation angle + figsize : A tuple (width, height) in inches + grid : Setting this to True will show the grid + layout : tuple (optional) + (rows, columns) for the layout of the plot + return_type : {None, 'axes', 'dict', 'both'}, default None + The kind of object to return. The default is ``axes`` + 'axes' returns the matplotlib axes the boxplot is drawn on; + 'dict' returns a dictionary whose values are the matplotlib + Lines of the boxplot; + 'both' returns a namedtuple with the axes and dict. + + When grouping with ``by``, a Series mapping columns to ``return_type`` + is returned, unless ``return_type`` is None, in which case a NumPy + array of axes is returned with the same shape as ``layout``. + See the prose documentation for more. + + kwds : other plotting keyword arguments to be passed to matplotlib boxplot + function + + Returns + ------- + lines : dict + ax : matplotlib Axes + (ax, lines): namedtuple + + Notes + ----- + Use ``return_type='dict'`` when you want to tweak the appearance + of the lines after plotting. In this case a dict containing the Lines + making up the boxes, caps, fliers, medians, and whiskers is returned. + """ + + +@Appender(_shared_docs['boxplot'] % _shared_doc_kwargs) +def boxplot(data, column=None, by=None, ax=None, fontsize=None, + rot=0, grid=True, figsize=None, layout=None, return_type=None, + **kwds): + + # validate return_type: + if return_type not in BoxPlot._valid_return_types: + raise ValueError("return_type must be {'axes', 'dict', 'both'}") + + from pandas import Series, DataFrame + if isinstance(data, Series): + data = DataFrame({'x': data}) + column = 'x' + + def _get_colors(): + return _get_standard_colors(color=kwds.get('color'), num_colors=1) + + def maybe_color_bp(bp): + if 'color' not in kwds: + from matplotlib.artist import setp + setp(bp['boxes'], color=colors[0], alpha=1) + setp(bp['whiskers'], color=colors[0], alpha=1) + setp(bp['medians'], color=colors[2], alpha=1) + + def plot_group(keys, values, ax): + keys = [pprint_thing(x) for x in keys] + values = [remove_na(v) for v in values] + bp = ax.boxplot(values, **kwds) + if fontsize is not None: + ax.tick_params(axis='both', labelsize=fontsize) + if kwds.get('vert', 1): + ax.set_xticklabels(keys, rotation=rot) + else: + ax.set_yticklabels(keys, rotation=rot) + maybe_color_bp(bp) + + # Return axes in multiplot case, maybe revisit later # 985 + if return_type == 'dict': + return bp + elif return_type == 'both': + return BoxPlot.BP(ax=ax, lines=bp) + else: + return ax + + colors = _get_colors() + if column is None: + columns = None + else: + if isinstance(column, (list, tuple)): + columns = column + else: + columns = [column] + + if by is not None: + # Prefer array return type for 2-D plots to match the subplot layout + # https://github.com/pandas-dev/pandas/pull/12216#issuecomment-241175580 + result = _grouped_plot_by_column(plot_group, data, columns=columns, + by=by, grid=grid, figsize=figsize, + ax=ax, layout=layout, + return_type=return_type) + else: + if return_type is None: + return_type = 'axes' + if layout is not None: + raise ValueError("The 'layout' keyword is not supported when " + "'by' is None") + + if ax is None: + ax = _gca() + data = data._get_numeric_data() + if columns is None: + columns = data.columns + else: + data = data[columns] + + result = plot_group(columns, data.values.T, ax) + ax.grid(grid) + + return result + + +def scatter_plot(data, x, y, by=None, ax=None, figsize=None, grid=False, + **kwargs): + """ + Make a scatter plot from two DataFrame columns + + Parameters + ---------- + data : DataFrame + x : Column name for the x-axis values + y : Column name for the y-axis values + ax : Matplotlib axis object + figsize : A tuple (width, height) in inches + grid : Setting this to True will show the grid + kwargs : other plotting keyword arguments + To be passed to scatter function + + Returns + ------- + fig : matplotlib.Figure + """ + import matplotlib.pyplot as plt + + kwargs.setdefault('edgecolors', 'none') + + def plot_group(group, ax): + xvals = group[x].values + yvals = group[y].values + ax.scatter(xvals, yvals, **kwargs) + ax.grid(grid) + + if by is not None: + fig = _grouped_plot(plot_group, data, by=by, figsize=figsize, ax=ax) + else: + if ax is None: + fig = plt.figure() + ax = fig.add_subplot(111) + else: + fig = ax.get_figure() + plot_group(data, ax) + ax.set_ylabel(pprint_thing(y)) + ax.set_xlabel(pprint_thing(x)) + + ax.grid(grid) + + return fig + + +def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None, + xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, + sharey=False, figsize=None, layout=None, bins=10, **kwds): + """ + Draw histogram of the DataFrame's series using matplotlib / pylab. + + Parameters + ---------- + data : DataFrame + column : string or sequence + If passed, will be used to limit data to a subset of columns + by : object, optional + If passed, then used to form histograms for separate groups + grid : boolean, default True + Whether to show axis grid lines + xlabelsize : int, default None + If specified changes the x-axis label size + xrot : float, default None + rotation of x axis labels + ylabelsize : int, default None + If specified changes the y-axis label size + yrot : float, default None + rotation of y axis labels + ax : matplotlib axes object, default None + sharex : boolean, default True if ax is None else False + In case subplots=True, share x axis and set some x axis labels to + invisible; defaults to True if ax is None otherwise False if an ax + is passed in; Be aware, that passing in both an ax and sharex=True + will alter all x axis labels for all subplots in a figure! + sharey : boolean, default False + In case subplots=True, share y axis and set some y axis labels to + invisible + figsize : tuple + The size of the figure to create in inches by default + layout : tuple, optional + Tuple of (rows, columns) for the layout of the histograms + bins : integer, default 10 + Number of histogram bins to be used + kwds : other plotting keyword arguments + To be passed to hist function + """ + + if by is not None: + axes = grouped_hist(data, column=column, by=by, ax=ax, grid=grid, + figsize=figsize, sharex=sharex, sharey=sharey, + layout=layout, bins=bins, xlabelsize=xlabelsize, + xrot=xrot, ylabelsize=ylabelsize, + yrot=yrot, **kwds) + return axes + + if column is not None: + if not isinstance(column, (list, np.ndarray, Index)): + column = [column] + data = data[column] + data = data._get_numeric_data() + naxes = len(data.columns) + + fig, axes = _subplots(naxes=naxes, ax=ax, squeeze=False, + sharex=sharex, sharey=sharey, figsize=figsize, + layout=layout) + _axes = _flatten(axes) + + for i, col in enumerate(_try_sort(data.columns)): + ax = _axes[i] + ax.hist(data[col].dropna().values, bins=bins, **kwds) + ax.set_title(col) + ax.grid(grid) + + _set_ticks_props(axes, xlabelsize=xlabelsize, xrot=xrot, + ylabelsize=ylabelsize, yrot=yrot) + fig.subplots_adjust(wspace=0.3, hspace=0.3) + + return axes + + +def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None, + xrot=None, ylabelsize=None, yrot=None, figsize=None, + bins=10, **kwds): + """ + Draw histogram of the input series using matplotlib + + Parameters + ---------- + by : object, optional + If passed, then used to form histograms for separate groups + ax : matplotlib axis object + If not passed, uses gca() + grid : boolean, default True + Whether to show axis grid lines + xlabelsize : int, default None + If specified changes the x-axis label size + xrot : float, default None + rotation of x axis labels + ylabelsize : int, default None + If specified changes the y-axis label size + yrot : float, default None + rotation of y axis labels + figsize : tuple, default None + figure size in inches by default + bins: integer, default 10 + Number of histogram bins to be used + kwds : keywords + To be passed to the actual plotting function + + Notes + ----- + See matplotlib documentation online for more on this + + """ + import matplotlib.pyplot as plt + + if by is None: + if kwds.get('layout', None) is not None: + raise ValueError("The 'layout' keyword is not supported when " + "'by' is None") + # hack until the plotting interface is a bit more unified + fig = kwds.pop('figure', plt.gcf() if plt.get_fignums() else + plt.figure(figsize=figsize)) + if (figsize is not None and tuple(figsize) != + tuple(fig.get_size_inches())): + fig.set_size_inches(*figsize, forward=True) + if ax is None: + ax = fig.gca() + elif ax.get_figure() != fig: + raise AssertionError('passed axis not bound to passed figure') + values = self.dropna().values + + ax.hist(values, bins=bins, **kwds) + ax.grid(grid) + axes = np.array([ax]) + + _set_ticks_props(axes, xlabelsize=xlabelsize, xrot=xrot, + ylabelsize=ylabelsize, yrot=yrot) + + else: + if 'figure' in kwds: + raise ValueError("Cannot pass 'figure' when using the " + "'by' argument, since a new 'Figure' instance " + "will be created") + axes = grouped_hist(self, by=by, ax=ax, grid=grid, figsize=figsize, + bins=bins, xlabelsize=xlabelsize, xrot=xrot, + ylabelsize=ylabelsize, yrot=yrot, **kwds) + + if hasattr(axes, 'ndim'): + if axes.ndim == 1 and len(axes) == 1: + return axes[0] + return axes + + +def grouped_hist(data, column=None, by=None, ax=None, bins=50, figsize=None, + layout=None, sharex=False, sharey=False, rot=90, grid=True, + xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, + **kwargs): + """ + Grouped histogram + + Parameters + ---------- + data: Series/DataFrame + column: object, optional + by: object, optional + ax: axes, optional + bins: int, default 50 + figsize: tuple, optional + layout: optional + sharex: boolean, default False + sharey: boolean, default False + rot: int, default 90 + grid: bool, default True + kwargs: dict, keyword arguments passed to matplotlib.Axes.hist + + Returns + ------- + axes: collection of Matplotlib Axes + """ + def plot_group(group, ax): + ax.hist(group.dropna().values, bins=bins, **kwargs) + + xrot = xrot or rot + + fig, axes = _grouped_plot(plot_group, data, column=column, + by=by, sharex=sharex, sharey=sharey, ax=ax, + figsize=figsize, layout=layout, rot=rot) + + _set_ticks_props(axes, xlabelsize=xlabelsize, xrot=xrot, + ylabelsize=ylabelsize, yrot=yrot) + + fig.subplots_adjust(bottom=0.15, top=0.9, left=0.1, right=0.9, + hspace=0.5, wspace=0.3) + return axes + + +def boxplot_frame_groupby(grouped, subplots=True, column=None, fontsize=None, + rot=0, grid=True, ax=None, figsize=None, + layout=None, **kwds): + """ + Make box plots from DataFrameGroupBy data. + + Parameters + ---------- + grouped : Grouped DataFrame + subplots : + * ``False`` - no subplots will be used + * ``True`` - create a subplot for each group + column : column name or list of names, or vector + Can be any valid input to groupby + fontsize : int or string + rot : label rotation angle + grid : Setting this to True will show the grid + ax : Matplotlib axis object, default None + figsize : A tuple (width, height) in inches + layout : tuple (optional) + (rows, columns) for the layout of the plot + kwds : other plotting keyword arguments to be passed to matplotlib boxplot + function + + Returns + ------- + dict of key/value = group key/DataFrame.boxplot return value + or DataFrame.boxplot return value in case subplots=figures=False + + Examples + -------- + >>> import pandas + >>> import numpy as np + >>> import itertools + >>> + >>> tuples = [t for t in itertools.product(range(1000), range(4))] + >>> index = pandas.MultiIndex.from_tuples(tuples, names=['lvl0', 'lvl1']) + >>> data = np.random.randn(len(index),4) + >>> df = pandas.DataFrame(data, columns=list('ABCD'), index=index) + >>> + >>> grouped = df.groupby(level='lvl1') + >>> boxplot_frame_groupby(grouped) + >>> + >>> grouped = df.unstack(level='lvl1').groupby(level=0, axis=1) + >>> boxplot_frame_groupby(grouped, subplots=False) + """ + if subplots is True: + naxes = len(grouped) + fig, axes = _subplots(naxes=naxes, squeeze=False, + ax=ax, sharex=False, sharey=True, + figsize=figsize, layout=layout) + axes = _flatten(axes) + + ret = Series() + for (key, group), ax in zip(grouped, axes): + d = group.boxplot(ax=ax, column=column, fontsize=fontsize, + rot=rot, grid=grid, **kwds) + ax.set_title(pprint_thing(key)) + ret.loc[key] = d + fig.subplots_adjust(bottom=0.15, top=0.9, left=0.1, + right=0.9, wspace=0.2) + else: + from pandas.core.reshape.concat import concat + keys, frames = zip(*grouped) + if grouped.axis == 0: + df = concat(frames, keys=keys, axis=1) + else: + if len(frames) > 1: + df = frames[0].join(frames[1::]) + else: + df = frames[0] + ret = df.boxplot(column=column, fontsize=fontsize, rot=rot, + grid=grid, ax=ax, figsize=figsize, + layout=layout, **kwds) + return ret + + +def _grouped_plot(plotf, data, column=None, by=None, numeric_only=True, + figsize=None, sharex=True, sharey=True, layout=None, + rot=0, ax=None, **kwargs): + from pandas import DataFrame + + if figsize == 'default': + # allowed to specify mpl default with 'default' + warnings.warn("figsize='default' is deprecated. Specify figure" + "size by tuple instead", FutureWarning, stacklevel=4) + figsize = None + + grouped = data.groupby(by) + if column is not None: + grouped = grouped[column] + + naxes = len(grouped) + fig, axes = _subplots(naxes=naxes, figsize=figsize, + sharex=sharex, sharey=sharey, ax=ax, + layout=layout) + + _axes = _flatten(axes) + + for i, (key, group) in enumerate(grouped): + ax = _axes[i] + if numeric_only and isinstance(group, DataFrame): + group = group._get_numeric_data() + plotf(group, ax, **kwargs) + ax.set_title(pprint_thing(key)) + + return fig, axes + + +def _grouped_plot_by_column(plotf, data, columns=None, by=None, + numeric_only=True, grid=False, + figsize=None, ax=None, layout=None, + return_type=None, **kwargs): + grouped = data.groupby(by) + if columns is None: + if not isinstance(by, (list, tuple)): + by = [by] + columns = data._get_numeric_data().columns.difference(by) + naxes = len(columns) + fig, axes = _subplots(naxes=naxes, sharex=True, sharey=True, + figsize=figsize, ax=ax, layout=layout) + + _axes = _flatten(axes) + + result = Series() + ax_values = [] + + for i, col in enumerate(columns): + ax = _axes[i] + gp_col = grouped[col] + keys, values = zip(*gp_col) + re_plotf = plotf(keys, values, ax, **kwargs) + ax.set_title(col) + ax.set_xlabel(pprint_thing(by)) + ax_values.append(re_plotf) + ax.grid(grid) + + result = Series(ax_values, index=columns) + + # Return axes in multiplot case, maybe revisit later # 985 + if return_type is None: + result = axes + + byline = by[0] if len(by) == 1 else by + fig.suptitle('Boxplot grouped by %s' % byline) + fig.subplots_adjust(bottom=0.15, top=0.9, left=0.1, right=0.9, wspace=0.2) + + return result + + +class BasePlotMethods(PandasObject): + + def __init__(self, data): + self._data = data + + def __call__(self, *args, **kwargs): + raise NotImplementedError + + +class SeriesPlotMethods(BasePlotMethods): + """Series plotting accessor and method + + Examples + -------- + >>> s.plot.line() + >>> s.plot.bar() + >>> s.plot.hist() + + Plotting methods can also be accessed by calling the accessor as a method + with the ``kind`` argument: + ``s.plot(kind='line')`` is equivalent to ``s.plot.line()`` + """ + + def __call__(self, kind='line', ax=None, + figsize=None, use_index=True, title=None, grid=None, + legend=False, style=None, logx=False, logy=False, + loglog=False, xticks=None, yticks=None, + xlim=None, ylim=None, + rot=None, fontsize=None, colormap=None, table=False, + yerr=None, xerr=None, + label=None, secondary_y=False, **kwds): + return plot_series(self._data, kind=kind, ax=ax, figsize=figsize, + use_index=use_index, title=title, grid=grid, + legend=legend, style=style, logx=logx, logy=logy, + loglog=loglog, xticks=xticks, yticks=yticks, + xlim=xlim, ylim=ylim, rot=rot, fontsize=fontsize, + colormap=colormap, table=table, yerr=yerr, + xerr=xerr, label=label, secondary_y=secondary_y, + **kwds) + __call__.__doc__ = plot_series.__doc__ + + def line(self, **kwds): + """ + Line plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='line', **kwds) + + def bar(self, **kwds): + """ + Vertical bar plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='bar', **kwds) + + def barh(self, **kwds): + """ + Horizontal bar plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='barh', **kwds) + + def box(self, **kwds): + """ + Boxplot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='box', **kwds) + + def hist(self, bins=10, **kwds): + """ + Histogram + + .. versionadded:: 0.17.0 + + Parameters + ---------- + bins: integer, default 10 + Number of histogram bins to be used + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='hist', bins=bins, **kwds) + + def kde(self, **kwds): + """ + Kernel Density Estimate plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='kde', **kwds) + + density = kde + + def area(self, **kwds): + """ + Area plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='area', **kwds) + + def pie(self, **kwds): + """ + Pie chart + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='pie', **kwds) + + +class FramePlotMethods(BasePlotMethods): + """DataFrame plotting accessor and method + + Examples + -------- + >>> df.plot.line() + >>> df.plot.scatter('x', 'y') + >>> df.plot.hexbin() + + These plotting methods can also be accessed by calling the accessor as a + method with the ``kind`` argument: + ``df.plot(kind='line')`` is equivalent to ``df.plot.line()`` + """ + + def __call__(self, x=None, y=None, kind='line', ax=None, + subplots=False, sharex=None, sharey=False, layout=None, + figsize=None, use_index=True, title=None, grid=None, + legend=True, style=None, logx=False, logy=False, loglog=False, + xticks=None, yticks=None, xlim=None, ylim=None, + rot=None, fontsize=None, colormap=None, table=False, + yerr=None, xerr=None, + secondary_y=False, sort_columns=False, **kwds): + return plot_frame(self._data, kind=kind, x=x, y=y, ax=ax, + subplots=subplots, sharex=sharex, sharey=sharey, + layout=layout, figsize=figsize, use_index=use_index, + title=title, grid=grid, legend=legend, style=style, + logx=logx, logy=logy, loglog=loglog, xticks=xticks, + yticks=yticks, xlim=xlim, ylim=ylim, rot=rot, + fontsize=fontsize, colormap=colormap, table=table, + yerr=yerr, xerr=xerr, secondary_y=secondary_y, + sort_columns=sort_columns, **kwds) + __call__.__doc__ = plot_frame.__doc__ + + def line(self, x=None, y=None, **kwds): + """ + Line plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + x, y : label or position, optional + Coordinates for each point. + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='line', x=x, y=y, **kwds) + + def bar(self, x=None, y=None, **kwds): + """ + Vertical bar plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + x, y : label or position, optional + Coordinates for each point. + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='bar', x=x, y=y, **kwds) + + def barh(self, x=None, y=None, **kwds): + """ + Horizontal bar plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + x, y : label or position, optional + Coordinates for each point. + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='barh', x=x, y=y, **kwds) + + def box(self, by=None, **kwds): + """ + Boxplot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + by : string or sequence + Column in the DataFrame to group by. + \*\*kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='box', by=by, **kwds) + + def hist(self, by=None, bins=10, **kwds): + """ + Histogram + + .. versionadded:: 0.17.0 + + Parameters + ---------- + by : string or sequence + Column in the DataFrame to group by. + bins: integer, default 10 + Number of histogram bins to be used + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='hist', by=by, bins=bins, **kwds) + + def kde(self, **kwds): + """ + Kernel Density Estimate plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='kde', **kwds) + + density = kde + + def area(self, x=None, y=None, **kwds): + """ + Area plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + x, y : label or position, optional + Coordinates for each point. + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='area', x=x, y=y, **kwds) + + def pie(self, y=None, **kwds): + """ + Pie chart + + .. versionadded:: 0.17.0 + + Parameters + ---------- + y : label or position, optional + Column to plot. + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='pie', y=y, **kwds) + + def scatter(self, x, y, s=None, c=None, **kwds): + """ + Scatter plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + x, y : label or position, optional + Coordinates for each point. + s : scalar or array_like, optional + Size of each point. + c : label or position, optional + Color of each point. + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + return self(kind='scatter', x=x, y=y, c=c, s=s, **kwds) + + def hexbin(self, x, y, C=None, reduce_C_function=None, gridsize=None, + **kwds): + """ + Hexbin plot + + .. versionadded:: 0.17.0 + + Parameters + ---------- + x, y : label or position, optional + Coordinates for each point. + C : label or position, optional + The value at each `(x, y)` point. + reduce_C_function : callable, optional + Function of one argument that reduces all the values in a bin to + a single number (e.g. `mean`, `max`, `sum`, `std`). + gridsize : int, optional + Number of bins. + **kwds : optional + Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. + + Returns + ------- + axes : matplotlib.AxesSubplot or np.array of them + """ + if reduce_C_function is not None: + kwds['reduce_C_function'] = reduce_C_function + if gridsize is not None: + kwds['gridsize'] = gridsize + return self(kind='hexbin', x=x, y=y, C=C, **kwds) diff --git a/pandas/plotting/_misc.py b/pandas/plotting/_misc.py new file mode 100644 index 0000000000000..20ada033c0f58 --- /dev/null +++ b/pandas/plotting/_misc.py @@ -0,0 +1,573 @@ +# being a bit too dynamic +# pylint: disable=E1101 +from __future__ import division + +import numpy as np + +from pandas.util._decorators import deprecate_kwarg +from pandas.core.dtypes.missing import notnull +from pandas.compat import range, lrange, lmap, zip +from pandas.io.formats.printing import pprint_thing + + +from pandas.plotting._style import _get_standard_colors +from pandas.plotting._tools import _subplots, _set_ticks_props + + +def scatter_matrix(frame, alpha=0.5, figsize=None, ax=None, grid=False, + diagonal='hist', marker='.', density_kwds=None, + hist_kwds=None, range_padding=0.05, **kwds): + """ + Draw a matrix of scatter plots. + + Parameters + ---------- + frame : DataFrame + alpha : float, optional + amount of transparency applied + figsize : (float,float), optional + a tuple (width, height) in inches + ax : Matplotlib axis object, optional + grid : bool, optional + setting this to True will show the grid + diagonal : {'hist', 'kde'} + pick between 'kde' and 'hist' for + either Kernel Density Estimation or Histogram + plot in the diagonal + marker : str, optional + Matplotlib marker type, default '.' + hist_kwds : other plotting keyword arguments + To be passed to hist function + density_kwds : other plotting keyword arguments + To be passed to kernel density estimate plot + range_padding : float, optional + relative extension of axis range in x and y + with respect to (x_max - x_min) or (y_max - y_min), + default 0.05 + kwds : other plotting keyword arguments + To be passed to scatter function + + Examples + -------- + >>> df = DataFrame(np.random.randn(1000, 4), columns=['A','B','C','D']) + >>> scatter_matrix(df, alpha=0.2) + """ + + df = frame._get_numeric_data() + n = df.columns.size + naxes = n * n + fig, axes = _subplots(naxes=naxes, figsize=figsize, ax=ax, + squeeze=False) + + # no gaps between subplots + fig.subplots_adjust(wspace=0, hspace=0) + + mask = notnull(df) + + marker = _get_marker_compat(marker) + + hist_kwds = hist_kwds or {} + density_kwds = density_kwds or {} + + # GH 14855 + kwds.setdefault('edgecolors', 'none') + + boundaries_list = [] + for a in df.columns: + values = df[a].values[mask[a].values] + rmin_, rmax_ = np.min(values), np.max(values) + rdelta_ext = (rmax_ - rmin_) * range_padding / 2. + boundaries_list.append((rmin_ - rdelta_ext, rmax_ + rdelta_ext)) + + for i, a in zip(lrange(n), df.columns): + for j, b in zip(lrange(n), df.columns): + ax = axes[i, j] + + if i == j: + values = df[a].values[mask[a].values] + + # Deal with the diagonal by drawing a histogram there. + if diagonal == 'hist': + ax.hist(values, **hist_kwds) + + elif diagonal in ('kde', 'density'): + from scipy.stats import gaussian_kde + y = values + gkde = gaussian_kde(y) + ind = np.linspace(y.min(), y.max(), 1000) + ax.plot(ind, gkde.evaluate(ind), **density_kwds) + + ax.set_xlim(boundaries_list[i]) + + else: + common = (mask[a] & mask[b]).values + + ax.scatter(df[b][common], df[a][common], + marker=marker, alpha=alpha, **kwds) + + ax.set_xlim(boundaries_list[j]) + ax.set_ylim(boundaries_list[i]) + + ax.set_xlabel(b) + ax.set_ylabel(a) + + if j != 0: + ax.yaxis.set_visible(False) + if i != n - 1: + ax.xaxis.set_visible(False) + + if len(df.columns) > 1: + lim1 = boundaries_list[0] + locs = axes[0][1].yaxis.get_majorticklocs() + locs = locs[(lim1[0] <= locs) & (locs <= lim1[1])] + adj = (locs - lim1[0]) / (lim1[1] - lim1[0]) + + lim0 = axes[0][0].get_ylim() + adj = adj * (lim0[1] - lim0[0]) + lim0[0] + axes[0][0].yaxis.set_ticks(adj) + + if np.all(locs == locs.astype(int)): + # if all ticks are int + locs = locs.astype(int) + axes[0][0].yaxis.set_ticklabels(locs) + + _set_ticks_props(axes, xlabelsize=8, xrot=90, ylabelsize=8, yrot=0) + + return axes + + +def _get_marker_compat(marker): + import matplotlib.lines as mlines + import matplotlib as mpl + if mpl.__version__ < '1.1.0' and marker == '.': + return 'o' + if marker not in mlines.lineMarkers: + return 'o' + return marker + + +def radviz(frame, class_column, ax=None, color=None, colormap=None, **kwds): + """RadViz - a multivariate data visualization algorithm + + Parameters: + ----------- + frame: DataFrame + class_column: str + Column name containing class names + ax: Matplotlib axis object, optional + color: list or tuple, optional + Colors to use for the different classes + colormap : str or matplotlib colormap object, default None + Colormap to select colors from. If string, load colormap with that name + from matplotlib. + kwds: keywords + Options to pass to matplotlib scatter plotting method + + Returns: + -------- + ax: Matplotlib axis object + """ + import matplotlib.pyplot as plt + import matplotlib.patches as patches + + def normalize(series): + a = min(series) + b = max(series) + return (series - a) / (b - a) + + n = len(frame) + classes = frame[class_column].drop_duplicates() + class_col = frame[class_column] + df = frame.drop(class_column, axis=1).apply(normalize) + + if ax is None: + ax = plt.gca(xlim=[-1, 1], ylim=[-1, 1]) + + to_plot = {} + colors = _get_standard_colors(num_colors=len(classes), colormap=colormap, + color_type='random', color=color) + + for kls in classes: + to_plot[kls] = [[], []] + + m = len(frame.columns) - 1 + s = np.array([(np.cos(t), np.sin(t)) + for t in [2.0 * np.pi * (i / float(m)) + for i in range(m)]]) + + for i in range(n): + row = df.iloc[i].values + row_ = np.repeat(np.expand_dims(row, axis=1), 2, axis=1) + y = (s * row_).sum(axis=0) / row.sum() + kls = class_col.iat[i] + to_plot[kls][0].append(y[0]) + to_plot[kls][1].append(y[1]) + + for i, kls in enumerate(classes): + ax.scatter(to_plot[kls][0], to_plot[kls][1], color=colors[i], + label=pprint_thing(kls), **kwds) + ax.legend() + + ax.add_patch(patches.Circle((0.0, 0.0), radius=1.0, facecolor='none')) + + for xy, name in zip(s, df.columns): + + ax.add_patch(patches.Circle(xy, radius=0.025, facecolor='gray')) + + if xy[0] < 0.0 and xy[1] < 0.0: + ax.text(xy[0] - 0.025, xy[1] - 0.025, name, + ha='right', va='top', size='small') + elif xy[0] < 0.0 and xy[1] >= 0.0: + ax.text(xy[0] - 0.025, xy[1] + 0.025, name, + ha='right', va='bottom', size='small') + elif xy[0] >= 0.0 and xy[1] < 0.0: + ax.text(xy[0] + 0.025, xy[1] - 0.025, name, + ha='left', va='top', size='small') + elif xy[0] >= 0.0 and xy[1] >= 0.0: + ax.text(xy[0] + 0.025, xy[1] + 0.025, name, + ha='left', va='bottom', size='small') + + ax.axis('equal') + return ax + + +@deprecate_kwarg(old_arg_name='data', new_arg_name='frame') +def andrews_curves(frame, class_column, ax=None, samples=200, color=None, + colormap=None, **kwds): + """ + Generates a matplotlib plot of Andrews curves, for visualising clusters of + multivariate data. + + Andrews curves have the functional form: + + f(t) = x_1/sqrt(2) + x_2 sin(t) + x_3 cos(t) + + x_4 sin(2t) + x_5 cos(2t) + ... + + Where x coefficients correspond to the values of each dimension and t is + linearly spaced between -pi and +pi. Each row of frame then corresponds to + a single curve. + + Parameters: + ----------- + frame : DataFrame + Data to be plotted, preferably normalized to (0.0, 1.0) + class_column : Name of the column containing class names + ax : matplotlib axes object, default None + samples : Number of points to plot in each curve + color: list or tuple, optional + Colors to use for the different classes + colormap : str or matplotlib colormap object, default None + Colormap to select colors from. If string, load colormap with that name + from matplotlib. + kwds: keywords + Options to pass to matplotlib plotting method + + Returns: + -------- + ax: Matplotlib axis object + + """ + from math import sqrt, pi + import matplotlib.pyplot as plt + + def function(amplitudes): + def f(t): + x1 = amplitudes[0] + result = x1 / sqrt(2.0) + + # Take the rest of the coefficients and resize them + # appropriately. Take a copy of amplitudes as otherwise numpy + # deletes the element from amplitudes itself. + coeffs = np.delete(np.copy(amplitudes), 0) + coeffs.resize(int((coeffs.size + 1) / 2), 2) + + # Generate the harmonics and arguments for the sin and cos + # functions. + harmonics = np.arange(0, coeffs.shape[0]) + 1 + trig_args = np.outer(harmonics, t) + + result += np.sum(coeffs[:, 0, np.newaxis] * np.sin(trig_args) + + coeffs[:, 1, np.newaxis] * np.cos(trig_args), + axis=0) + return result + return f + + n = len(frame) + class_col = frame[class_column] + classes = frame[class_column].drop_duplicates() + df = frame.drop(class_column, axis=1) + t = np.linspace(-pi, pi, samples) + used_legends = set([]) + + color_values = _get_standard_colors(num_colors=len(classes), + colormap=colormap, color_type='random', + color=color) + colors = dict(zip(classes, color_values)) + if ax is None: + ax = plt.gca(xlim=(-pi, pi)) + for i in range(n): + row = df.iloc[i].values + f = function(row) + y = f(t) + kls = class_col.iat[i] + label = pprint_thing(kls) + if label not in used_legends: + used_legends.add(label) + ax.plot(t, y, color=colors[kls], label=label, **kwds) + else: + ax.plot(t, y, color=colors[kls], **kwds) + + ax.legend(loc='upper right') + ax.grid() + return ax + + +def bootstrap_plot(series, fig=None, size=50, samples=500, **kwds): + """Bootstrap plot. + + Parameters: + ----------- + series: Time series + fig: matplotlib figure object, optional + size: number of data points to consider during each sampling + samples: number of times the bootstrap procedure is performed + kwds: optional keyword arguments for plotting commands, must be accepted + by both hist and plot + + Returns: + -------- + fig: matplotlib figure + """ + import random + import matplotlib.pyplot as plt + + # random.sample(ndarray, int) fails on python 3.3, sigh + data = list(series.values) + samplings = [random.sample(data, size) for _ in range(samples)] + + means = np.array([np.mean(sampling) for sampling in samplings]) + medians = np.array([np.median(sampling) for sampling in samplings]) + midranges = np.array([(min(sampling) + max(sampling)) * 0.5 + for sampling in samplings]) + if fig is None: + fig = plt.figure() + x = lrange(samples) + axes = [] + ax1 = fig.add_subplot(2, 3, 1) + ax1.set_xlabel("Sample") + axes.append(ax1) + ax1.plot(x, means, **kwds) + ax2 = fig.add_subplot(2, 3, 2) + ax2.set_xlabel("Sample") + axes.append(ax2) + ax2.plot(x, medians, **kwds) + ax3 = fig.add_subplot(2, 3, 3) + ax3.set_xlabel("Sample") + axes.append(ax3) + ax3.plot(x, midranges, **kwds) + ax4 = fig.add_subplot(2, 3, 4) + ax4.set_xlabel("Mean") + axes.append(ax4) + ax4.hist(means, **kwds) + ax5 = fig.add_subplot(2, 3, 5) + ax5.set_xlabel("Median") + axes.append(ax5) + ax5.hist(medians, **kwds) + ax6 = fig.add_subplot(2, 3, 6) + ax6.set_xlabel("Midrange") + axes.append(ax6) + ax6.hist(midranges, **kwds) + for axis in axes: + plt.setp(axis.get_xticklabels(), fontsize=8) + plt.setp(axis.get_yticklabels(), fontsize=8) + return fig + + +@deprecate_kwarg(old_arg_name='colors', new_arg_name='color') +@deprecate_kwarg(old_arg_name='data', new_arg_name='frame', stacklevel=3) +def parallel_coordinates(frame, class_column, cols=None, ax=None, color=None, + use_columns=False, xticks=None, colormap=None, + axvlines=True, axvlines_kwds=None, sort_labels=False, + **kwds): + """Parallel coordinates plotting. + + Parameters + ---------- + frame: DataFrame + class_column: str + Column name containing class names + cols: list, optional + A list of column names to use + ax: matplotlib.axis, optional + matplotlib axis object + color: list or tuple, optional + Colors to use for the different classes + use_columns: bool, optional + If true, columns will be used as xticks + xticks: list or tuple, optional + A list of values to use for xticks + colormap: str or matplotlib colormap, default None + Colormap to use for line colors. + axvlines: bool, optional + If true, vertical lines will be added at each xtick + axvlines_kwds: keywords, optional + Options to be passed to axvline method for vertical lines + sort_labels: bool, False + Sort class_column labels, useful when assigning colours + + .. versionadded:: 0.20.0 + + kwds: keywords + Options to pass to matplotlib plotting method + + Returns + ------- + ax: matplotlib axis object + + Examples + -------- + >>> from pandas import read_csv + >>> from pandas.tools.plotting import parallel_coordinates + >>> from matplotlib import pyplot as plt + >>> df = read_csv('https://raw.github.com/pandas-dev/pandas/master' + '/pandas/tests/data/iris.csv') + >>> parallel_coordinates(df, 'Name', color=('#556270', + '#4ECDC4', '#C7F464')) + >>> plt.show() + """ + if axvlines_kwds is None: + axvlines_kwds = {'linewidth': 1, 'color': 'black'} + import matplotlib.pyplot as plt + + n = len(frame) + classes = frame[class_column].drop_duplicates() + class_col = frame[class_column] + + if cols is None: + df = frame.drop(class_column, axis=1) + else: + df = frame[cols] + + used_legends = set([]) + + ncols = len(df.columns) + + # determine values to use for xticks + if use_columns is True: + if not np.all(np.isreal(list(df.columns))): + raise ValueError('Columns must be numeric to be used as xticks') + x = df.columns + elif xticks is not None: + if not np.all(np.isreal(xticks)): + raise ValueError('xticks specified must be numeric') + elif len(xticks) != ncols: + raise ValueError('Length of xticks must match number of columns') + x = xticks + else: + x = lrange(ncols) + + if ax is None: + ax = plt.gca() + + color_values = _get_standard_colors(num_colors=len(classes), + colormap=colormap, color_type='random', + color=color) + + if sort_labels: + classes = sorted(classes) + color_values = sorted(color_values) + colors = dict(zip(classes, color_values)) + + for i in range(n): + y = df.iloc[i].values + kls = class_col.iat[i] + label = pprint_thing(kls) + if label not in used_legends: + used_legends.add(label) + ax.plot(x, y, color=colors[kls], label=label, **kwds) + else: + ax.plot(x, y, color=colors[kls], **kwds) + + if axvlines: + for i in x: + ax.axvline(i, **axvlines_kwds) + + ax.set_xticks(x) + ax.set_xticklabels(df.columns) + ax.set_xlim(x[0], x[-1]) + ax.legend(loc='upper right') + ax.grid() + return ax + + +def lag_plot(series, lag=1, ax=None, **kwds): + """Lag plot for time series. + + Parameters: + ----------- + series: Time series + lag: lag of the scatter plot, default 1 + ax: Matplotlib axis object, optional + kwds: Matplotlib scatter method keyword arguments, optional + + Returns: + -------- + ax: Matplotlib axis object + """ + import matplotlib.pyplot as plt + + # workaround because `c='b'` is hardcoded in matplotlibs scatter method + kwds.setdefault('c', plt.rcParams['patch.facecolor']) + + data = series.values + y1 = data[:-lag] + y2 = data[lag:] + if ax is None: + ax = plt.gca() + ax.set_xlabel("y(t)") + ax.set_ylabel("y(t + %s)" % lag) + ax.scatter(y1, y2, **kwds) + return ax + + +def autocorrelation_plot(series, ax=None, **kwds): + """Autocorrelation plot for time series. + + Parameters: + ----------- + series: Time series + ax: Matplotlib axis object, optional + kwds : keywords + Options to pass to matplotlib plotting method + + Returns: + ----------- + ax: Matplotlib axis object + """ + import matplotlib.pyplot as plt + n = len(series) + data = np.asarray(series) + if ax is None: + ax = plt.gca(xlim=(1, n), ylim=(-1.0, 1.0)) + mean = np.mean(data) + c0 = np.sum((data - mean) ** 2) / float(n) + + def r(h): + return ((data[:n - h] - mean) * + (data[h:] - mean)).sum() / float(n) / c0 + x = np.arange(n) + 1 + y = lmap(r, x) + z95 = 1.959963984540054 + z99 = 2.5758293035489004 + ax.axhline(y=z99 / np.sqrt(n), linestyle='--', color='grey') + ax.axhline(y=z95 / np.sqrt(n), color='grey') + ax.axhline(y=0.0, color='black') + ax.axhline(y=-z95 / np.sqrt(n), color='grey') + ax.axhline(y=-z99 / np.sqrt(n), linestyle='--', color='grey') + ax.set_xlabel("Lag") + ax.set_ylabel("Autocorrelation") + ax.plot(x, y, **kwds) + if 'label' in kwds: + ax.legend() + ax.grid() + return ax diff --git a/pandas/plotting/_style.py b/pandas/plotting/_style.py new file mode 100644 index 0000000000000..8cb4e30e0d91c --- /dev/null +++ b/pandas/plotting/_style.py @@ -0,0 +1,246 @@ +# being a bit too dynamic +# pylint: disable=E1101 +from __future__ import division + +import warnings +from contextlib import contextmanager +import re + +import numpy as np + +from pandas.core.dtypes.common import is_list_like +from pandas.compat import range, lrange, lmap +import pandas.compat as compat +from pandas.plotting._compat import _mpl_ge_2_0_0 + + +# Extracted from https://gist.github.com/huyng/816622 +# this is the rcParams set when setting display.with_mpl_style +# to True. +mpl_stylesheet = { + 'axes.axisbelow': True, + 'axes.color_cycle': ['#348ABD', + '#7A68A6', + '#A60628', + '#467821', + '#CF4457', + '#188487', + '#E24A33'], + 'axes.edgecolor': '#bcbcbc', + 'axes.facecolor': '#eeeeee', + 'axes.grid': True, + 'axes.labelcolor': '#555555', + 'axes.labelsize': 'large', + 'axes.linewidth': 1.0, + 'axes.titlesize': 'x-large', + 'figure.edgecolor': 'white', + 'figure.facecolor': 'white', + 'figure.figsize': (6.0, 4.0), + 'figure.subplot.hspace': 0.5, + 'font.family': 'monospace', + 'font.monospace': ['Andale Mono', + 'Nimbus Mono L', + 'Courier New', + 'Courier', + 'Fixed', + 'Terminal', + 'monospace'], + 'font.size': 10, + 'interactive': True, + 'keymap.all_axes': ['a'], + 'keymap.back': ['left', 'c', 'backspace'], + 'keymap.forward': ['right', 'v'], + 'keymap.fullscreen': ['f'], + 'keymap.grid': ['g'], + 'keymap.home': ['h', 'r', 'home'], + 'keymap.pan': ['p'], + 'keymap.save': ['s'], + 'keymap.xscale': ['L', 'k'], + 'keymap.yscale': ['l'], + 'keymap.zoom': ['o'], + 'legend.fancybox': True, + 'lines.antialiased': True, + 'lines.linewidth': 1.0, + 'patch.antialiased': True, + 'patch.edgecolor': '#EEEEEE', + 'patch.facecolor': '#348ABD', + 'patch.linewidth': 0.5, + 'toolbar': 'toolbar2', + 'xtick.color': '#555555', + 'xtick.direction': 'in', + 'xtick.major.pad': 6.0, + 'xtick.major.size': 0.0, + 'xtick.minor.pad': 6.0, + 'xtick.minor.size': 0.0, + 'ytick.color': '#555555', + 'ytick.direction': 'in', + 'ytick.major.pad': 6.0, + 'ytick.major.size': 0.0, + 'ytick.minor.pad': 6.0, + 'ytick.minor.size': 0.0 +} + + +def _get_standard_colors(num_colors=None, colormap=None, color_type='default', + color=None): + import matplotlib.pyplot as plt + + if color is None and colormap is not None: + if isinstance(colormap, compat.string_types): + import matplotlib.cm as cm + cmap = colormap + colormap = cm.get_cmap(colormap) + if colormap is None: + raise ValueError("Colormap {0} is not recognized".format(cmap)) + colors = lmap(colormap, np.linspace(0, 1, num=num_colors)) + elif color is not None: + if colormap is not None: + warnings.warn("'color' and 'colormap' cannot be used " + "simultaneously. Using 'color'") + colors = list(color) if is_list_like(color) else color + else: + if color_type == 'default': + # need to call list() on the result to copy so we don't + # modify the global rcParams below + try: + colors = [c['color'] + for c in list(plt.rcParams['axes.prop_cycle'])] + except KeyError: + colors = list(plt.rcParams.get('axes.color_cycle', + list('bgrcmyk'))) + if isinstance(colors, compat.string_types): + colors = list(colors) + elif color_type == 'random': + import random + + def random_color(column): + random.seed(column) + return [random.random() for _ in range(3)] + + colors = lmap(random_color, lrange(num_colors)) + else: + raise ValueError("color_type must be either 'default' or 'random'") + + if isinstance(colors, compat.string_types): + import matplotlib.colors + conv = matplotlib.colors.ColorConverter() + + def _maybe_valid_colors(colors): + try: + [conv.to_rgba(c) for c in colors] + return True + except ValueError: + return False + + # check whether the string can be convertable to single color + maybe_single_color = _maybe_valid_colors([colors]) + # check whether each character can be convertable to colors + maybe_color_cycle = _maybe_valid_colors(list(colors)) + if maybe_single_color and maybe_color_cycle and len(colors) > 1: + # Special case for single str 'CN' match and convert to hex + # for supporting matplotlib < 2.0.0 + if re.match(r'\AC[0-9]\Z', colors) and _mpl_ge_2_0_0(): + hex_color = [c['color'] + for c in list(plt.rcParams['axes.prop_cycle'])] + colors = [hex_color[int(colors[1])]] + else: + # this may no longer be required + msg = ("'{0}' can be parsed as both single color and " + "color cycle. Specify each color using a list " + "like ['{0}'] or {1}") + raise ValueError(msg.format(colors, list(colors))) + elif maybe_single_color: + colors = [colors] + else: + # ``colors`` is regarded as color cycle. + # mpl will raise error any of them is invalid + pass + + if len(colors) != num_colors: + try: + multiple = num_colors // len(colors) - 1 + except ZeroDivisionError: + raise ValueError("Invalid color argument: ''") + mod = num_colors % len(colors) + + colors += multiple * colors + colors += colors[:mod] + + return colors + + +class _Options(dict): + """ + Stores pandas plotting options. + Allows for parameter aliasing so you can just use parameter names that are + the same as the plot function parameters, but is stored in a canonical + format that makes it easy to breakdown into groups later + """ + + # alias so the names are same as plotting method parameter names + _ALIASES = {'x_compat': 'xaxis.compat'} + _DEFAULT_KEYS = ['xaxis.compat'] + + def __init__(self, deprecated=False): + self._deprecated = deprecated + # self['xaxis.compat'] = False + super(_Options, self).__setitem__('xaxis.compat', False) + + def _warn_if_deprecated(self): + if self._deprecated: + warnings.warn("'pandas.plot_params' is deprecated. Use " + "'pandas.plotting.plot_params' instead", + FutureWarning, stacklevel=3) + + def __getitem__(self, key): + self._warn_if_deprecated() + key = self._get_canonical_key(key) + if key not in self: + raise ValueError('%s is not a valid pandas plotting option' % key) + return super(_Options, self).__getitem__(key) + + def __setitem__(self, key, value): + self._warn_if_deprecated() + key = self._get_canonical_key(key) + return super(_Options, self).__setitem__(key, value) + + def __delitem__(self, key): + key = self._get_canonical_key(key) + if key in self._DEFAULT_KEYS: + raise ValueError('Cannot remove default parameter %s' % key) + return super(_Options, self).__delitem__(key) + + def __contains__(self, key): + key = self._get_canonical_key(key) + return super(_Options, self).__contains__(key) + + def reset(self): + """ + Reset the option store to its initial state + + Returns + ------- + None + """ + self._warn_if_deprecated() + self.__init__() + + def _get_canonical_key(self, key): + return self._ALIASES.get(key, key) + + @contextmanager + def use(self, key, value): + """ + Temporarily set a parameter value using the with statement. + Aliasing allowed. + """ + self._warn_if_deprecated() + old_value = self[key] + try: + self[key] = value + yield self + finally: + self[key] = old_value + + +plot_params = _Options() diff --git a/pandas/plotting/_timeseries.py b/pandas/plotting/_timeseries.py new file mode 100644 index 0000000000000..3d04973ed0009 --- /dev/null +++ b/pandas/plotting/_timeseries.py @@ -0,0 +1,339 @@ +# TODO: Use the fact that axis can have units to simplify the process + +import numpy as np + +from matplotlib import pylab +from pandas.core.indexes.period import Period +from pandas.tseries.offsets import DateOffset +import pandas.tseries.frequencies as frequencies +from pandas.core.indexes.datetimes import DatetimeIndex +from pandas.core.indexes.period import PeriodIndex +from pandas.core.indexes.timedeltas import TimedeltaIndex +from pandas.io.formats.printing import pprint_thing +import pandas.compat as compat + +from pandas.plotting._converter import (TimeSeries_DateLocator, + TimeSeries_DateFormatter, + TimeSeries_TimedeltaFormatter) + +# --------------------------------------------------------------------- +# Plotting functions and monkey patches + + +def tsplot(series, plotf, ax=None, **kwargs): + """ + Plots a Series on the given Matplotlib axes or the current axes + + Parameters + ---------- + axes : Axes + series : Series + + Notes + _____ + Supports same kwargs as Axes.plot + + """ + # Used inferred freq is possible, need a test case for inferred + if ax is None: + import matplotlib.pyplot as plt + ax = plt.gca() + + freq, series = _maybe_resample(series, ax, kwargs) + + # Set ax with freq info + _decorate_axes(ax, freq, kwargs) + ax._plot_data.append((series, plotf, kwargs)) + lines = plotf(ax, series.index._mpl_repr(), series.values, **kwargs) + + # set date formatter, locators and rescale limits + format_dateaxis(ax, ax.freq, series.index) + return lines + + +def _maybe_resample(series, ax, kwargs): + # resample against axes freq if necessary + freq, ax_freq = _get_freq(ax, series) + + if freq is None: # pragma: no cover + raise ValueError('Cannot use dynamic axis without frequency info') + + # Convert DatetimeIndex to PeriodIndex + if isinstance(series.index, DatetimeIndex): + series = series.to_period(freq=freq) + + if ax_freq is not None and freq != ax_freq: + if frequencies.is_superperiod(freq, ax_freq): # upsample input + series = series.copy() + series.index = series.index.asfreq(ax_freq, how='s') + freq = ax_freq + elif _is_sup(freq, ax_freq): # one is weekly + how = kwargs.pop('how', 'last') + series = getattr(series.resample('D'), how)().dropna() + series = getattr(series.resample(ax_freq), how)().dropna() + freq = ax_freq + elif frequencies.is_subperiod(freq, ax_freq) or _is_sub(freq, ax_freq): + _upsample_others(ax, freq, kwargs) + ax_freq = freq + else: # pragma: no cover + raise ValueError('Incompatible frequency conversion') + return freq, series + + +def _is_sub(f1, f2): + return ((f1.startswith('W') and frequencies.is_subperiod('D', f2)) or + (f2.startswith('W') and frequencies.is_subperiod(f1, 'D'))) + + +def _is_sup(f1, f2): + return ((f1.startswith('W') and frequencies.is_superperiod('D', f2)) or + (f2.startswith('W') and frequencies.is_superperiod(f1, 'D'))) + + +def _upsample_others(ax, freq, kwargs): + legend = ax.get_legend() + lines, labels = _replot_ax(ax, freq, kwargs) + _replot_ax(ax, freq, kwargs) + + other_ax = None + if hasattr(ax, 'left_ax'): + other_ax = ax.left_ax + if hasattr(ax, 'right_ax'): + other_ax = ax.right_ax + + if other_ax is not None: + rlines, rlabels = _replot_ax(other_ax, freq, kwargs) + lines.extend(rlines) + labels.extend(rlabels) + + if (legend is not None and kwargs.get('legend', True) and + len(lines) > 0): + title = legend.get_title().get_text() + if title == 'None': + title = None + ax.legend(lines, labels, loc='best', title=title) + + +def _replot_ax(ax, freq, kwargs): + data = getattr(ax, '_plot_data', None) + + # clear current axes and data + ax._plot_data = [] + ax.clear() + + _decorate_axes(ax, freq, kwargs) + + lines = [] + labels = [] + if data is not None: + for series, plotf, kwds in data: + series = series.copy() + idx = series.index.asfreq(freq, how='S') + series.index = idx + ax._plot_data.append((series, plotf, kwds)) + + # for tsplot + if isinstance(plotf, compat.string_types): + from pandas.plotting._core import _plot_klass + plotf = _plot_klass[plotf]._plot + + lines.append(plotf(ax, series.index._mpl_repr(), + series.values, **kwds)[0]) + labels.append(pprint_thing(series.name)) + + return lines, labels + + +def _decorate_axes(ax, freq, kwargs): + """Initialize axes for time-series plotting""" + if not hasattr(ax, '_plot_data'): + ax._plot_data = [] + + ax.freq = freq + xaxis = ax.get_xaxis() + xaxis.freq = freq + if not hasattr(ax, 'legendlabels'): + ax.legendlabels = [kwargs.get('label', None)] + else: + ax.legendlabels.append(kwargs.get('label', None)) + ax.view_interval = None + ax.date_axis_info = None + + +def _get_ax_freq(ax): + """ + Get the freq attribute of the ax object if set. + Also checks shared axes (eg when using secondary yaxis, sharex=True + or twinx) + """ + ax_freq = getattr(ax, 'freq', None) + if ax_freq is None: + # check for left/right ax in case of secondary yaxis + if hasattr(ax, 'left_ax'): + ax_freq = getattr(ax.left_ax, 'freq', None) + elif hasattr(ax, 'right_ax'): + ax_freq = getattr(ax.right_ax, 'freq', None) + if ax_freq is None: + # check if a shared ax (sharex/twinx) has already freq set + shared_axes = ax.get_shared_x_axes().get_siblings(ax) + if len(shared_axes) > 1: + for shared_ax in shared_axes: + ax_freq = getattr(shared_ax, 'freq', None) + if ax_freq is not None: + break + return ax_freq + + +def _get_freq(ax, series): + # get frequency from data + freq = getattr(series.index, 'freq', None) + if freq is None: + freq = getattr(series.index, 'inferred_freq', None) + + ax_freq = _get_ax_freq(ax) + + # use axes freq if no data freq + if freq is None: + freq = ax_freq + + # get the period frequency + if isinstance(freq, DateOffset): + freq = freq.rule_code + else: + freq = frequencies.get_base_alias(freq) + + freq = frequencies.get_period_alias(freq) + return freq, ax_freq + + +def _use_dynamic_x(ax, data): + freq = _get_index_freq(data) + ax_freq = _get_ax_freq(ax) + + if freq is None: # convert irregular if axes has freq info + freq = ax_freq + else: # do not use tsplot if irregular was plotted first + if (ax_freq is None) and (len(ax.get_lines()) > 0): + return False + + if freq is None: + return False + + if isinstance(freq, DateOffset): + freq = freq.rule_code + else: + freq = frequencies.get_base_alias(freq) + freq = frequencies.get_period_alias(freq) + + if freq is None: + return False + + # hack this for 0.10.1, creating more technical debt...sigh + if isinstance(data.index, DatetimeIndex): + base = frequencies.get_freq(freq) + x = data.index + if (base <= frequencies.FreqGroup.FR_DAY): + return x[:1].is_normalized + return Period(x[0], freq).to_timestamp(tz=x.tz) == x[0] + return True + + +def _get_index_freq(data): + freq = getattr(data.index, 'freq', None) + if freq is None: + freq = getattr(data.index, 'inferred_freq', None) + if freq == 'B': + weekdays = np.unique(data.index.dayofweek) + if (5 in weekdays) or (6 in weekdays): + freq = None + return freq + + +def _maybe_convert_index(ax, data): + # tsplot converts automatically, but don't want to convert index + # over and over for DataFrames + if isinstance(data.index, DatetimeIndex): + freq = getattr(data.index, 'freq', None) + + if freq is None: + freq = getattr(data.index, 'inferred_freq', None) + if isinstance(freq, DateOffset): + freq = freq.rule_code + + if freq is None: + freq = _get_ax_freq(ax) + + if freq is None: + raise ValueError('Could not get frequency alias for plotting') + + freq = frequencies.get_base_alias(freq) + freq = frequencies.get_period_alias(freq) + + data = data.to_period(freq=freq) + return data + + +# Patch methods for subplot. Only format_dateaxis is currently used. +# Do we need the rest for convenience? + +def format_timedelta_ticks(x, pos, n_decimals): + """ + Convert seconds to 'D days HH:MM:SS.F' + """ + s, ns = divmod(x, 1e9) + m, s = divmod(s, 60) + h, m = divmod(m, 60) + d, h = divmod(h, 24) + decimals = int(ns * 10**(n_decimals - 9)) + s = r'{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(s)) + if n_decimals > 0: + s += '.{{:0{:0d}d}}'.format(n_decimals).format(decimals) + if d != 0: + s = '{:d} days '.format(int(d)) + s + return s + + +def format_dateaxis(subplot, freq, index): + """ + Pretty-formats the date axis (x-axis). + + Major and minor ticks are automatically set for the frequency of the + current underlying series. As the dynamic mode is activated by + default, changing the limits of the x axis will intelligently change + the positions of the ticks. + """ + + # handle index specific formatting + # Note: DatetimeIndex does not use this + # interface. DatetimeIndex uses matplotlib.date directly + if isinstance(index, PeriodIndex): + + majlocator = TimeSeries_DateLocator(freq, dynamic_mode=True, + minor_locator=False, + plot_obj=subplot) + minlocator = TimeSeries_DateLocator(freq, dynamic_mode=True, + minor_locator=True, + plot_obj=subplot) + subplot.xaxis.set_major_locator(majlocator) + subplot.xaxis.set_minor_locator(minlocator) + + majformatter = TimeSeries_DateFormatter(freq, dynamic_mode=True, + minor_locator=False, + plot_obj=subplot) + minformatter = TimeSeries_DateFormatter(freq, dynamic_mode=True, + minor_locator=True, + plot_obj=subplot) + subplot.xaxis.set_major_formatter(majformatter) + subplot.xaxis.set_minor_formatter(minformatter) + + # x and y coord info + subplot.format_coord = lambda t, y: ( + "t = {0} y = {1:8f}".format(Period(ordinal=int(t), freq=freq), y)) + + elif isinstance(index, TimedeltaIndex): + subplot.xaxis.set_major_formatter( + TimeSeries_TimedeltaFormatter()) + else: + raise TypeError('index type not supported') + + pylab.draw_if_interactive() diff --git a/pandas/plotting/_tools.py b/pandas/plotting/_tools.py new file mode 100644 index 0000000000000..0c2314087525c --- /dev/null +++ b/pandas/plotting/_tools.py @@ -0,0 +1,383 @@ +# being a bit too dynamic +# pylint: disable=E1101 +from __future__ import division + +import warnings +from math import ceil + +import numpy as np + +from pandas.core.dtypes.common import is_list_like +from pandas.core.index import Index +from pandas.core.series import Series +from pandas.compat import range + + +def format_date_labels(ax, rot): + # mini version of autofmt_xdate + try: + for label in ax.get_xticklabels(): + label.set_ha('right') + label.set_rotation(rot) + fig = ax.get_figure() + fig.subplots_adjust(bottom=0.2) + except Exception: # pragma: no cover + pass + + +def table(ax, data, rowLabels=None, colLabels=None, + **kwargs): + """ + Helper function to convert DataFrame and Series to matplotlib.table + + Parameters + ---------- + `ax`: Matplotlib axes object + `data`: DataFrame or Series + data for table contents + `kwargs`: keywords, optional + keyword arguments which passed to matplotlib.table.table. + If `rowLabels` or `colLabels` is not specified, data index or column + name will be used. + + Returns + ------- + matplotlib table object + """ + from pandas import DataFrame + if isinstance(data, Series): + data = DataFrame(data, columns=[data.name]) + elif isinstance(data, DataFrame): + pass + else: + raise ValueError('Input data must be DataFrame or Series') + + if rowLabels is None: + rowLabels = data.index + + if colLabels is None: + colLabels = data.columns + + cellText = data.values + + import matplotlib.table + table = matplotlib.table.table(ax, cellText=cellText, + rowLabels=rowLabels, + colLabels=colLabels, **kwargs) + return table + + +def _get_layout(nplots, layout=None, layout_type='box'): + if layout is not None: + if not isinstance(layout, (tuple, list)) or len(layout) != 2: + raise ValueError('Layout must be a tuple of (rows, columns)') + + nrows, ncols = layout + + # Python 2 compat + ceil_ = lambda x: int(ceil(x)) + if nrows == -1 and ncols > 0: + layout = nrows, ncols = (ceil_(float(nplots) / ncols), ncols) + elif ncols == -1 and nrows > 0: + layout = nrows, ncols = (nrows, ceil_(float(nplots) / nrows)) + elif ncols <= 0 and nrows <= 0: + msg = "At least one dimension of layout must be positive" + raise ValueError(msg) + + if nrows * ncols < nplots: + raise ValueError('Layout of %sx%s must be larger than ' + 'required size %s' % (nrows, ncols, nplots)) + + return layout + + if layout_type == 'single': + return (1, 1) + elif layout_type == 'horizontal': + return (1, nplots) + elif layout_type == 'vertical': + return (nplots, 1) + + layouts = {1: (1, 1), 2: (1, 2), 3: (2, 2), 4: (2, 2)} + try: + return layouts[nplots] + except KeyError: + k = 1 + while k ** 2 < nplots: + k += 1 + + if (k - 1) * k >= nplots: + return k, (k - 1) + else: + return k, k + +# copied from matplotlib/pyplot.py and modified for pandas.plotting + + +def _subplots(naxes=None, sharex=False, sharey=False, squeeze=True, + subplot_kw=None, ax=None, layout=None, layout_type='box', + **fig_kw): + """Create a figure with a set of subplots already made. + + This utility wrapper makes it convenient to create common layouts of + subplots, including the enclosing figure object, in a single call. + + Keyword arguments: + + naxes : int + Number of required axes. Exceeded axes are set invisible. Default is + nrows * ncols. + + sharex : bool + If True, the X axis will be shared amongst all subplots. + + sharey : bool + If True, the Y axis will be shared amongst all subplots. + + squeeze : bool + + If True, extra dimensions are squeezed out from the returned axis object: + - if only one subplot is constructed (nrows=ncols=1), the resulting + single Axis object is returned as a scalar. + - for Nx1 or 1xN subplots, the returned object is a 1-d numpy object + array of Axis objects are returned as numpy 1-d arrays. + - for NxM subplots with N>1 and M>1 are returned as a 2d array. + + If False, no squeezing at all is done: the returned axis object is always + a 2-d array containing Axis instances, even if it ends up being 1x1. + + subplot_kw : dict + Dict with keywords passed to the add_subplot() call used to create each + subplots. + + ax : Matplotlib axis object, optional + + layout : tuple + Number of rows and columns of the subplot grid. + If not specified, calculated from naxes and layout_type + + layout_type : {'box', 'horziontal', 'vertical'}, default 'box' + Specify how to layout the subplot grid. + + fig_kw : Other keyword arguments to be passed to the figure() call. + Note that all keywords not recognized above will be + automatically included here. + + Returns: + + fig, ax : tuple + - fig is the Matplotlib Figure object + - ax can be either a single axis object or an array of axis objects if + more than one subplot was created. The dimensions of the resulting array + can be controlled with the squeeze keyword, see above. + + **Examples:** + + x = np.linspace(0, 2*np.pi, 400) + y = np.sin(x**2) + + # Just a figure and one subplot + f, ax = plt.subplots() + ax.plot(x, y) + ax.set_title('Simple plot') + + # Two subplots, unpack the output array immediately + f, (ax1, ax2) = plt.subplots(1, 2, sharey=True) + ax1.plot(x, y) + ax1.set_title('Sharing Y axis') + ax2.scatter(x, y) + + # Four polar axes + plt.subplots(2, 2, subplot_kw=dict(polar=True)) + """ + import matplotlib.pyplot as plt + + if subplot_kw is None: + subplot_kw = {} + + if ax is None: + fig = plt.figure(**fig_kw) + else: + if is_list_like(ax): + ax = _flatten(ax) + if layout is not None: + warnings.warn("When passing multiple axes, layout keyword is " + "ignored", UserWarning) + if sharex or sharey: + warnings.warn("When passing multiple axes, sharex and sharey " + "are ignored. These settings must be specified " + "when creating axes", UserWarning, + stacklevel=4) + if len(ax) == naxes: + fig = ax[0].get_figure() + return fig, ax + else: + raise ValueError("The number of passed axes must be {0}, the " + "same as the output plot".format(naxes)) + + fig = ax.get_figure() + # if ax is passed and a number of subplots is 1, return ax as it is + if naxes == 1: + if squeeze: + return fig, ax + else: + return fig, _flatten(ax) + else: + warnings.warn("To output multiple subplots, the figure containing " + "the passed axes is being cleared", UserWarning, + stacklevel=4) + fig.clear() + + nrows, ncols = _get_layout(naxes, layout=layout, layout_type=layout_type) + nplots = nrows * ncols + + # Create empty object array to hold all axes. It's easiest to make it 1-d + # so we can just append subplots upon creation, and then + axarr = np.empty(nplots, dtype=object) + + # Create first subplot separately, so we can share it if requested + ax0 = fig.add_subplot(nrows, ncols, 1, **subplot_kw) + + if sharex: + subplot_kw['sharex'] = ax0 + if sharey: + subplot_kw['sharey'] = ax0 + axarr[0] = ax0 + + # Note off-by-one counting because add_subplot uses the MATLAB 1-based + # convention. + for i in range(1, nplots): + kwds = subplot_kw.copy() + # Set sharex and sharey to None for blank/dummy axes, these can + # interfere with proper axis limits on the visible axes if + # they share axes e.g. issue #7528 + if i >= naxes: + kwds['sharex'] = None + kwds['sharey'] = None + ax = fig.add_subplot(nrows, ncols, i + 1, **kwds) + axarr[i] = ax + + if naxes != nplots: + for ax in axarr[naxes:]: + ax.set_visible(False) + + _handle_shared_axes(axarr, nplots, naxes, nrows, ncols, sharex, sharey) + + if squeeze: + # Reshape the array to have the final desired dimension (nrow,ncol), + # though discarding unneeded dimensions that equal 1. If we only have + # one subplot, just return it instead of a 1-element array. + if nplots == 1: + axes = axarr[0] + else: + axes = axarr.reshape(nrows, ncols).squeeze() + else: + # returned axis array will be always 2-d, even if nrows=ncols=1 + axes = axarr.reshape(nrows, ncols) + + return fig, axes + + +def _remove_labels_from_axis(axis): + for t in axis.get_majorticklabels(): + t.set_visible(False) + + try: + # set_visible will not be effective if + # minor axis has NullLocator and NullFormattor (default) + import matplotlib.ticker as ticker + if isinstance(axis.get_minor_locator(), ticker.NullLocator): + axis.set_minor_locator(ticker.AutoLocator()) + if isinstance(axis.get_minor_formatter(), ticker.NullFormatter): + axis.set_minor_formatter(ticker.FormatStrFormatter('')) + for t in axis.get_minorticklabels(): + t.set_visible(False) + except Exception: # pragma no cover + raise + axis.get_label().set_visible(False) + + +def _handle_shared_axes(axarr, nplots, naxes, nrows, ncols, sharex, sharey): + if nplots > 1: + + if nrows > 1: + try: + # first find out the ax layout, + # so that we can correctly handle 'gaps" + layout = np.zeros((nrows + 1, ncols + 1), dtype=np.bool) + for ax in axarr: + layout[ax.rowNum, ax.colNum] = ax.get_visible() + + for ax in axarr: + # only the last row of subplots should get x labels -> all + # other off layout handles the case that the subplot is + # the last in the column, because below is no subplot/gap. + if not layout[ax.rowNum + 1, ax.colNum]: + continue + if sharex or len(ax.get_shared_x_axes() + .get_siblings(ax)) > 1: + _remove_labels_from_axis(ax.xaxis) + + except IndexError: + # if gridspec is used, ax.rowNum and ax.colNum may different + # from layout shape. in this case, use last_row logic + for ax in axarr: + if ax.is_last_row(): + continue + if sharex or len(ax.get_shared_x_axes() + .get_siblings(ax)) > 1: + _remove_labels_from_axis(ax.xaxis) + + if ncols > 1: + for ax in axarr: + # only the first column should get y labels -> set all other to + # off as we only have labels in teh first column and we always + # have a subplot there, we can skip the layout test + if ax.is_first_col(): + continue + if sharey or len(ax.get_shared_y_axes().get_siblings(ax)) > 1: + _remove_labels_from_axis(ax.yaxis) + + +def _flatten(axes): + if not is_list_like(axes): + return np.array([axes]) + elif isinstance(axes, (np.ndarray, Index)): + return axes.ravel() + return np.array(axes) + + +def _get_all_lines(ax): + lines = ax.get_lines() + + if hasattr(ax, 'right_ax'): + lines += ax.right_ax.get_lines() + + if hasattr(ax, 'left_ax'): + lines += ax.left_ax.get_lines() + + return lines + + +def _get_xlim(lines): + left, right = np.inf, -np.inf + for l in lines: + x = l.get_xdata(orig=False) + left = min(x[0], left) + right = max(x[-1], right) + return left, right + + +def _set_ticks_props(axes, xlabelsize=None, xrot=None, + ylabelsize=None, yrot=None): + import matplotlib.pyplot as plt + + for ax in _flatten(axes): + if xlabelsize is not None: + plt.setp(ax.get_xticklabels(), fontsize=xlabelsize) + if xrot is not None: + plt.setp(ax.get_xticklabels(), rotation=xrot) + if ylabelsize is not None: + plt.setp(ax.get_yticklabels(), fontsize=ylabelsize) + if yrot is not None: + plt.setp(ax.get_yticklabels(), rotation=yrot) + return axes diff --git a/pandas/rpy/__init__.py b/pandas/rpy/__init__.py deleted file mode 100644 index b771a3d8374a3..0000000000000 --- a/pandas/rpy/__init__.py +++ /dev/null @@ -1,18 +0,0 @@ - -# GH9602 -# deprecate rpy to instead directly use rpy2 - -# flake8: noqa - -import warnings -warnings.warn("The pandas.rpy module is deprecated and will be " - "removed in a future version. We refer to external packages " - "like rpy2. " - "\nSee here for a guide on how to port your code to rpy2: " - "http://pandas.pydata.org/pandas-docs/stable/r_interface.html", - FutureWarning, stacklevel=2) - -try: - from .common import importr, r, load_data -except ImportError: - pass diff --git a/pandas/rpy/base.py b/pandas/rpy/base.py deleted file mode 100644 index ac339dd366b0b..0000000000000 --- a/pandas/rpy/base.py +++ /dev/null @@ -1,14 +0,0 @@ -# flake8: noqa - -import pandas.rpy.util as util - - -class lm(object): - """ - Examples - -------- - >>> model = lm('x ~ y + z', data) - >>> model.coef - """ - def __init__(self, formula, data): - pass diff --git a/pandas/rpy/common.py b/pandas/rpy/common.py deleted file mode 100644 index 95a072154dc68..0000000000000 --- a/pandas/rpy/common.py +++ /dev/null @@ -1,372 +0,0 @@ -""" -Utilities for making working with rpy2 more user- and -developer-friendly. -""" - -# flake8: noqa - -from __future__ import print_function - -from distutils.version import LooseVersion -from pandas.compat import zip, range -import numpy as np - -import pandas as pd -import pandas.core.common as com -import pandas.util.testing as _test - -from rpy2.robjects.packages import importr -from rpy2.robjects import r -import rpy2.robjects as robj - -import itertools as IT - - -__all__ = ['convert_robj', 'load_data', 'convert_to_r_dataframe', - 'convert_to_r_matrix'] - - -def load_data(name, package=None, convert=True): - if package: - importr(package) - - r.data(name) - - robj = r[name] - - if convert: - return convert_robj(robj) - else: - return robj - - -def _rclass(obj): - """ - Return R class name for input object - """ - return r['class'](obj)[0] - - -def _is_null(obj): - return _rclass(obj) == 'NULL' - - -def _convert_list(obj): - """ - Convert named Vector to dict, factors to list - """ - try: - values = [convert_robj(x) for x in obj] - keys = r['names'](obj) - return dict(zip(keys, values)) - except TypeError: - # For state.division and state.region - factors = list(r['factor'](obj)) - level = list(r['levels'](obj)) - result = [level[index-1] for index in factors] - return result - - -def _convert_array(obj): - """ - Convert Array to DataFrame - """ - def _list(item): - try: - return list(item) - except TypeError: - return [] - - # For iris3, HairEyeColor, UCBAdmissions, Titanic - dim = list(obj.dim) - values = np.array(list(obj)) - names = r['dimnames'](obj) - try: - columns = list(r['names'](names))[::-1] - except TypeError: - columns = ['X{:d}'.format(i) for i in range(len(names))][::-1] - columns.append('value') - name_list = [(_list(x) or range(d)) for x, d in zip(names, dim)][::-1] - arr = np.array(list(IT.product(*name_list))) - arr = np.column_stack([arr,values]) - df = pd.DataFrame(arr, columns=columns) - return df - - -def _convert_vector(obj): - if isinstance(obj, robj.IntVector): - return _convert_int_vector(obj) - elif isinstance(obj, robj.StrVector): - return _convert_str_vector(obj) - # Check if the vector has extra information attached to it that can be used - # as an index - try: - attributes = set(r['attributes'](obj).names) - except AttributeError: - return list(obj) - if 'names' in attributes: - return pd.Series(list(obj), index=r['names'](obj)) - elif 'tsp' in attributes: - return pd.Series(list(obj), index=r['time'](obj)) - elif 'labels' in attributes: - return pd.Series(list(obj), index=r['labels'](obj)) - if _rclass(obj) == 'dist': - # For 'eurodist'. WARNING: This results in a DataFrame, not a Series or list. - matrix = r['as.matrix'](obj) - return convert_robj(matrix) - else: - return list(obj) - -NA_INTEGER = -2147483648 - - -def _convert_int_vector(obj): - arr = np.asarray(obj) - mask = arr == NA_INTEGER - if mask.any(): - arr = arr.astype(float) - arr[mask] = np.nan - return arr - - -def _convert_str_vector(obj): - arr = np.asarray(obj, dtype=object) - mask = arr == robj.NA_Character - if mask.any(): - arr[mask] = np.nan - return arr - - -def _convert_DataFrame(rdf): - columns = list(rdf.colnames) - rows = np.array(rdf.rownames) - - data = {} - for i, col in enumerate(columns): - vec = rdf.rx2(i + 1) - values = _convert_vector(vec) - - if isinstance(vec, robj.FactorVector): - levels = np.asarray(vec.levels) - if com.is_float_dtype(values): - mask = np.isnan(values) - notmask = -mask - result = np.empty(len(values), dtype=object) - result[mask] = np.nan - - locs = (values[notmask] - 1).astype(np.int_) - result[notmask] = levels.take(locs) - values = result - else: - values = np.asarray(vec.levels).take(values - 1) - - data[col] = values - - return pd.DataFrame(data, index=_check_int(rows), columns=columns) - - -def _convert_Matrix(mat): - columns = mat.colnames - rows = mat.rownames - - columns = None if _is_null(columns) else list(columns) - index = r['time'](mat) if _is_null(rows) else list(rows) - return pd.DataFrame(np.array(mat), index=_check_int(index), - columns=columns) - - -def _check_int(vec): - try: - # R observation numbers come through as strings - vec = vec.astype(int) - except Exception: - pass - - return vec - -_pandas_converters = [ - (robj.DataFrame, _convert_DataFrame), - (robj.Matrix, _convert_Matrix), - (robj.StrVector, _convert_vector), - (robj.FloatVector, _convert_vector), - (robj.Array, _convert_array), - (robj.Vector, _convert_list), -] - -_converters = [ - (robj.DataFrame, lambda x: _convert_DataFrame(x).toRecords(index=False)), - (robj.Matrix, lambda x: _convert_Matrix(x).toRecords(index=False)), - (robj.IntVector, _convert_vector), - (robj.StrVector, _convert_vector), - (robj.FloatVector, _convert_vector), - (robj.Array, _convert_array), - (robj.Vector, _convert_list), -] - - -def convert_robj(obj, use_pandas=True): - """ - Convert rpy2 object to a pandas-friendly form - - Parameters - ---------- - obj : rpy2 object - - Returns - ------- - Non-rpy data structure, mix of NumPy and pandas objects - """ - if not isinstance(obj, robj.RObjectMixin): - return obj - - converters = _pandas_converters if use_pandas else _converters - - for rpy_type, converter in converters: - if isinstance(obj, rpy_type): - return converter(obj) - - raise TypeError('Do not know what to do with %s object' % type(obj)) - - -def convert_to_r_posixct(obj): - """ - Convert DatetimeIndex or np.datetime array to R POSIXct using - m8[s] format. - - Parameters - ---------- - obj : source pandas object (one of [DatetimeIndex, np.datetime]) - - Returns - ------- - An R POSIXct vector (rpy2.robjects.vectors.POSIXct) - - """ - import time - from rpy2.rinterface import StrSexpVector - - # convert m8[ns] to m8[s] - vals = robj.vectors.FloatSexpVector(obj.values.view('i8') / 1E9) - as_posixct = robj.baseenv.get('as.POSIXct') - origin = StrSexpVector([time.strftime("%Y-%m-%d", - time.gmtime(0)), ]) - - # We will be sending ints as UTC - tz = obj.tz.zone if hasattr( - obj, 'tz') and hasattr(obj.tz, 'zone') else 'UTC' - tz = StrSexpVector([tz]) - utc_tz = StrSexpVector(['UTC']) - - posixct = as_posixct(vals, origin=origin, tz=utc_tz) - posixct.do_slot_assign('tzone', tz) - return posixct - - -VECTOR_TYPES = {np.float64: robj.FloatVector, - np.float32: robj.FloatVector, - np.float: robj.FloatVector, - np.int: robj.IntVector, - np.int32: robj.IntVector, - np.int64: robj.IntVector, - np.object_: robj.StrVector, - np.str: robj.StrVector, - np.bool: robj.BoolVector} - - -NA_TYPES = {np.float64: robj.NA_Real, - np.float32: robj.NA_Real, - np.float: robj.NA_Real, - np.int: robj.NA_Integer, - np.int32: robj.NA_Integer, - np.int64: robj.NA_Integer, - np.object_: robj.NA_Character, - np.str: robj.NA_Character, - np.bool: robj.NA_Logical} - - -if LooseVersion(np.__version__) >= LooseVersion('1.8'): - for dict_ in (VECTOR_TYPES, NA_TYPES): - dict_.update({ - np.bool_: dict_[np.bool], - np.int_: dict_[np.int], - np.float_: dict_[np.float], - np.string_: dict_[np.str] - }) - - -def convert_to_r_dataframe(df, strings_as_factors=False): - """ - Convert a pandas DataFrame to a R data.frame. - - Parameters - ---------- - df: The DataFrame being converted - strings_as_factors: Whether to turn strings into R factors (default: False) - - Returns - ------- - A R data.frame - - """ - - import rpy2.rlike.container as rlc - - columns = rlc.OrdDict() - - # FIXME: This doesn't handle MultiIndex - - for column in df: - value = df[column] - value_type = value.dtype.type - - if value_type == np.datetime64: - value = convert_to_r_posixct(value) - else: - value = [item if pd.notnull(item) else NA_TYPES[value_type] - for item in value] - - value = VECTOR_TYPES[value_type](value) - - if not strings_as_factors: - I = robj.baseenv.get("I") - value = I(value) - - columns[column] = value - - r_dataframe = robj.DataFrame(columns) - - del columns - - r_dataframe.rownames = robj.StrVector(df.index) - - return r_dataframe - - -def convert_to_r_matrix(df, strings_as_factors=False): - - """ - Convert a pandas DataFrame to a R matrix. - - Parameters - ---------- - df: The DataFrame being converted - strings_as_factors: Whether to turn strings into R factors (default: False) - - Returns - ------- - A R matrix - - """ - - if df._is_mixed_type: - raise TypeError("Conversion to matrix only possible with non-mixed " - "type DataFrames") - - r_dataframe = convert_to_r_dataframe(df, strings_as_factors) - as_matrix = robj.baseenv.get("as.matrix") - r_matrix = as_matrix(r_dataframe) - - return r_matrix - -if __name__ == '__main__': - pass diff --git a/pandas/rpy/mass.py b/pandas/rpy/mass.py deleted file mode 100644 index 12fbbdfa4dc98..0000000000000 --- a/pandas/rpy/mass.py +++ /dev/null @@ -1,2 +0,0 @@ -class rlm(object): - pass diff --git a/pandas/rpy/tests/test_common.py b/pandas/rpy/tests/test_common.py deleted file mode 100644 index c3f09e21b1545..0000000000000 --- a/pandas/rpy/tests/test_common.py +++ /dev/null @@ -1,216 +0,0 @@ -""" -Testing that functions from rpy work as expected -""" - -# flake8: noqa - -import pandas as pd -import numpy as np -import unittest -import nose -import warnings -import pandas.util.testing as tm - -try: - import pandas.rpy.common as com - from rpy2.robjects import r - import rpy2.robjects as robj -except ImportError: - raise nose.SkipTest('R not installed') - - -class TestCommon(unittest.TestCase): - def test_convert_list(self): - obj = r('list(a=1, b=2, c=3)') - - converted = com.convert_robj(obj) - expected = {'a': [1], 'b': [2], 'c': [3]} - - tm.assert_dict_equal(converted, expected) - - def test_convert_nested_list(self): - obj = r('list(a=list(foo=1, bar=2))') - - converted = com.convert_robj(obj) - expected = {'a': {'foo': [1], 'bar': [2]}} - - tm.assert_dict_equal(converted, expected) - - def test_convert_frame(self): - # built-in dataset - df = r['faithful'] - - converted = com.convert_robj(df) - - assert np.array_equal(converted.columns, ['eruptions', 'waiting']) - assert np.array_equal(converted.index, np.arange(1, 273)) - - def _test_matrix(self): - r('mat <- matrix(rnorm(9), ncol=3)') - r('colnames(mat) <- c("one", "two", "three")') - r('rownames(mat) <- c("a", "b", "c")') - - return r['mat'] - - def test_convert_matrix(self): - mat = self._test_matrix() - - converted = com.convert_robj(mat) - - assert np.array_equal(converted.index, ['a', 'b', 'c']) - assert np.array_equal(converted.columns, ['one', 'two', 'three']) - - def test_convert_r_dataframe(self): - - is_na = robj.baseenv.get("is.na") - - seriesd = tm.getSeriesData() - frame = pd.DataFrame(seriesd, columns=['D', 'C', 'B', 'A']) - - # Null data - frame["E"] = [np.nan for item in frame["A"]] - # Some mixed type data - frame["F"] = ["text" if item % - 2 == 0 else np.nan for item in range(30)] - - r_dataframe = com.convert_to_r_dataframe(frame) - - assert np.array_equal( - com.convert_robj(r_dataframe.rownames), frame.index) - assert np.array_equal( - com.convert_robj(r_dataframe.colnames), frame.columns) - assert all(is_na(item) for item in r_dataframe.rx2("E")) - - for column in frame[["A", "B", "C", "D"]]: - coldata = r_dataframe.rx2(column) - original_data = frame[column] - assert np.array_equal(com.convert_robj(coldata), original_data) - - for column in frame[["D", "E"]]: - for original, converted in zip(frame[column], - r_dataframe.rx2(column)): - - if pd.isnull(original): - assert is_na(converted) - else: - assert original == converted - - def test_convert_r_matrix(self): - - is_na = robj.baseenv.get("is.na") - - seriesd = tm.getSeriesData() - frame = pd.DataFrame(seriesd, columns=['D', 'C', 'B', 'A']) - # Null data - frame["E"] = [np.nan for item in frame["A"]] - - r_dataframe = com.convert_to_r_matrix(frame) - - assert np.array_equal( - com.convert_robj(r_dataframe.rownames), frame.index) - assert np.array_equal( - com.convert_robj(r_dataframe.colnames), frame.columns) - assert all(is_na(item) for item in r_dataframe.rx(True, "E")) - - for column in frame[["A", "B", "C", "D"]]: - coldata = r_dataframe.rx(True, column) - original_data = frame[column] - assert np.array_equal(com.convert_robj(coldata), - original_data) - - # Pandas bug 1282 - frame["F"] = ["text" if item % - 2 == 0 else np.nan for item in range(30)] - - try: - wrong_matrix = com.convert_to_r_matrix(frame) - except TypeError: - pass - except Exception: - raise - - def test_dist(self): - for name in ('eurodist',): - df = com.load_data(name) - dist = r[name] - labels = r['labels'](dist) - assert np.array_equal(df.index, labels) - assert np.array_equal(df.columns, labels) - - def test_timeseries(self): - """ - Test that the series has an informative index. - Unfortunately the code currently does not build a DateTimeIndex - """ - for name in ( - 'austres', 'co2', 'fdeaths', 'freeny.y', 'JohnsonJohnson', - 'ldeaths', 'mdeaths', 'nottem', 'presidents', 'sunspot.month', 'sunspots', - 'UKDriverDeaths', 'UKgas', 'USAccDeaths', - 'airmiles', 'discoveries', 'EuStockMarkets', - 'LakeHuron', 'lh', 'lynx', 'nhtemp', 'Nile', - 'Seatbelts', 'sunspot.year', 'treering', 'uspop'): - series = com.load_data(name) - ts = r[name] - assert np.array_equal(series.index, r['time'](ts)) - - def test_numeric(self): - for name in ('euro', 'islands', 'precip'): - series = com.load_data(name) - numeric = r[name] - names = numeric.names - assert np.array_equal(series.index, names) - - def test_table(self): - iris3 = pd.DataFrame({'X0': {0: '0', 1: '1', 2: '2', 3: '3', 4: '4'}, - 'X1': {0: 'Sepal L.', - 1: 'Sepal L.', - 2: 'Sepal L.', - 3: 'Sepal L.', - 4: 'Sepal L.'}, - 'X2': {0: 'Setosa', - 1: 'Setosa', - 2: 'Setosa', - 3: 'Setosa', - 4: 'Setosa'}, - 'value': {0: '5.1', 1: '4.9', 2: '4.7', 3: '4.6', 4: '5.0'}}) - hec = pd.DataFrame( - { - 'Eye': {0: 'Brown', 1: 'Brown', 2: 'Brown', 3: 'Brown', 4: 'Blue'}, - 'Hair': {0: 'Black', 1: 'Brown', 2: 'Red', 3: 'Blond', 4: 'Black'}, - 'Sex': {0: 'Male', 1: 'Male', 2: 'Male', 3: 'Male', 4: 'Male'}, - 'value': {0: '32.0', 1: '53.0', 2: '10.0', 3: '3.0', 4: '11.0'}}) - titanic = pd.DataFrame( - { - 'Age': {0: 'Child', 1: 'Child', 2: 'Child', 3: 'Child', 4: 'Child'}, - 'Class': {0: '1st', 1: '2nd', 2: '3rd', 3: 'Crew', 4: '1st'}, - 'Sex': {0: 'Male', 1: 'Male', 2: 'Male', 3: 'Male', 4: 'Female'}, - 'Survived': {0: 'No', 1: 'No', 2: 'No', 3: 'No', 4: 'No'}, - 'value': {0: '0.0', 1: '0.0', 2: '35.0', 3: '0.0', 4: '0.0'}}) - for name, expected in zip(('HairEyeColor', 'Titanic', 'iris3'), - (hec, titanic, iris3)): - df = com.load_data(name) - table = r[name] - names = r['dimnames'](table) - try: - columns = list(r['names'](names))[::-1] - except TypeError: - columns = ['X{:d}'.format(i) for i in range(len(names))][::-1] - columns.append('value') - assert np.array_equal(df.columns, columns) - result = df.head() - cond = ((result.sort(axis=1) == expected.sort(axis=1))).values - assert np.all(cond) - - def test_factor(self): - for name in ('state.division', 'state.region'): - vector = r[name] - factors = list(r['factor'](vector)) - level = list(r['levels'](vector)) - factors = [level[index - 1] for index in factors] - result = com.load_data(name) - assert np.equal(result, factors) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - # '--with-coverage', '--cover-package=pandas.core'], - exit=False) diff --git a/pandas/rpy/vars.py b/pandas/rpy/vars.py deleted file mode 100644 index 2073b47483141..0000000000000 --- a/pandas/rpy/vars.py +++ /dev/null @@ -1,22 +0,0 @@ -# flake8: noqa - -import pandas.rpy.util as util - - -class VAR(object): - """ - - Parameters - ---------- - y : - p : - type : {"const", "trend", "both", "none"} - season : - exogen : - lag_max : - ic : {"AIC", "HQ", "SC", "FPE"} - Information criterion to use, if lag_max is not None - """ - def __init__(y, p=1, type="none", season=None, exogen=None, - lag_max=None, ic=None): - pass diff --git a/pandas/sparse/api.py b/pandas/sparse/api.py deleted file mode 100644 index 55841fbeffa2d..0000000000000 --- a/pandas/sparse/api.py +++ /dev/null @@ -1,6 +0,0 @@ -# pylint: disable=W0611 -# flake8: noqa -from pandas.sparse.array import SparseArray -from pandas.sparse.list import SparseList -from pandas.sparse.series import SparseSeries, SparseTimeSeries -from pandas.sparse.frame import SparseDataFrame diff --git a/pandas/src/hashtable_func_helper.pxi.in b/pandas/src/hashtable_func_helper.pxi.in deleted file mode 100644 index 1840b914f3328..0000000000000 --- a/pandas/src/hashtable_func_helper.pxi.in +++ /dev/null @@ -1,114 +0,0 @@ -""" -Template for each `dtype` helper function for hashtable - -WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in -""" - -#---------------------------------------------------------------------- -# VectorData -#---------------------------------------------------------------------- - -{{py: - -# name -dtypes = ['float64', 'int64'] - -}} - -{{for dtype in dtypes}} - - -@cython.wraparound(False) -@cython.boundscheck(False) -cdef build_count_table_{{dtype}}({{dtype}}_t[:] values, - kh_{{dtype}}_t *table, bint dropna): - cdef: - khiter_t k - Py_ssize_t i, n = len(values) - {{dtype}}_t val - int ret = 0 - - with nogil: - kh_resize_{{dtype}}(table, n) - - for i in range(n): - val = values[i] - if val == val or not dropna: - k = kh_get_{{dtype}}(table, val) - if k != table.n_buckets: - table.vals[k] += 1 - else: - k = kh_put_{{dtype}}(table, val, &ret) - table.vals[k] = 1 - - -@cython.wraparound(False) -@cython.boundscheck(False) -cpdef value_count_{{dtype}}({{dtype}}_t[:] values, bint dropna): - cdef: - Py_ssize_t i=0 - kh_{{dtype}}_t *table - {{dtype}}_t[:] result_keys - int64_t[:] result_counts - int k - - table = kh_init_{{dtype}}() - build_count_table_{{dtype}}(values, table, dropna) - - result_keys = np.empty(table.n_occupied, dtype=np.{{dtype}}) - result_counts = np.zeros(table.n_occupied, dtype=np.int64) - - with nogil: - for k in range(table.n_buckets): - if kh_exist_{{dtype}}(table, k): - result_keys[i] = table.keys[k] - result_counts[i] = table.vals[k] - i += 1 - kh_destroy_{{dtype}}(table) - - return np.asarray(result_keys), np.asarray(result_counts) - - -@cython.wraparound(False) -@cython.boundscheck(False) -def duplicated_{{dtype}}({{dtype}}_t[:] values, - object keep='first'): - cdef: - int ret = 0, k - {{dtype}}_t value - Py_ssize_t i, n = len(values) - kh_{{dtype}}_t * table = kh_init_{{dtype}}() - ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') - - kh_resize_{{dtype}}(table, min(n, _SIZE_HINT_LIMIT)) - - if keep not in ('last', 'first', False): - raise ValueError('keep must be either "first", "last" or False') - - if keep == 'last': - with nogil: - for i from n > i >=0: - kh_put_{{dtype}}(table, values[i], &ret) - out[i] = ret == 0 - elif keep == 'first': - with nogil: - for i from 0 <= i < n: - kh_put_{{dtype}}(table, values[i], &ret) - out[i] = ret == 0 - else: - with nogil: - for i from 0 <= i < n: - value = values[i] - k = kh_get_{{dtype}}(table, value) - if k != table.n_buckets: - out[table.vals[k]] = 1 - out[i] = 1 - else: - k = kh_put_{{dtype}}(table, value, &ret) - table.keys[k] = value - table.vals[k] = i - out[i] = 0 - kh_destroy_{{dtype}}(table) - return out - -{{endfor}} diff --git a/pandas/src/helper.h b/pandas/src/helper.h deleted file mode 100644 index b8c3cecbb2dc7..0000000000000 --- a/pandas/src/helper.h +++ /dev/null @@ -1,16 +0,0 @@ -#ifndef C_HELPER_H -#define C_HELPER_H - -#ifndef PANDAS_INLINE - #if defined(__GNUC__) - #define PANDAS_INLINE static __inline__ - #elif defined(_MSC_VER) - #define PANDAS_INLINE static __inline - #elif defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L - #define PANDAS_INLINE static inline - #else - #define PANDAS_INLINE - #endif -#endif - -#endif diff --git a/pandas/src/joins_func_helper.pxi.in b/pandas/src/joins_func_helper.pxi.in deleted file mode 100644 index 33926a23f7f41..0000000000000 --- a/pandas/src/joins_func_helper.pxi.in +++ /dev/null @@ -1,165 +0,0 @@ -# cython: boundscheck=False, wraparound=False -""" -Template for each `dtype` helper function for hashtable - -WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in -""" - -#---------------------------------------------------------------------- -# asof_join_by -#---------------------------------------------------------------------- - -{{py: - -# table_type, by_dtype -by_dtypes = [('PyObjectHashTable', 'object'), ('Int64HashTable', 'int64_t')] - -# on_dtype -on_dtypes = ['uint8_t', 'uint16_t', 'uint32_t', 'uint64_t', - 'int8_t', 'int16_t', 'int32_t', 'int64_t', - 'float', 'double'] - -}} - - -from hashtable cimport * - -{{for table_type, by_dtype in by_dtypes}} -{{for on_dtype in on_dtypes}} - - -def asof_join_{{on_dtype}}_by_{{by_dtype}}(ndarray[{{on_dtype}}] left_values, - ndarray[{{on_dtype}}] right_values, - ndarray[{{by_dtype}}] left_by_values, - ndarray[{{by_dtype}}] right_by_values, - bint allow_exact_matches=1, - tolerance=None): - - cdef: - Py_ssize_t left_pos, right_pos, left_size, right_size, found_right_pos - ndarray[int64_t] left_indexer, right_indexer - bint has_tolerance = 0 - {{on_dtype}} tolerance_ - {{table_type}} hash_table - {{by_dtype}} by_value - - # if we are using tolerance, set our objects - if tolerance is not None: - has_tolerance = 1 - tolerance_ = tolerance - - left_size = len(left_values) - right_size = len(right_values) - - left_indexer = np.empty(left_size, dtype=np.int64) - right_indexer = np.empty(left_size, dtype=np.int64) - - hash_table = {{table_type}}(right_size) - - right_pos = 0 - for left_pos in range(left_size): - # restart right_pos if it went negative in a previous iteration - if right_pos < 0: - right_pos = 0 - - # find last position in right whose value is less than left's value - if allow_exact_matches: - while right_pos < right_size and\ - right_values[right_pos] <= left_values[left_pos]: - hash_table.set_item(right_by_values[right_pos], right_pos) - right_pos += 1 - else: - while right_pos < right_size and\ - right_values[right_pos] < left_values[left_pos]: - hash_table.set_item(right_by_values[right_pos], right_pos) - right_pos += 1 - right_pos -= 1 - - # save positions as the desired index - by_value = left_by_values[left_pos] - found_right_pos = hash_table.get_item(by_value)\ - if by_value in hash_table else -1 - left_indexer[left_pos] = left_pos - right_indexer[left_pos] = found_right_pos - - # if needed, verify that tolerance is met - if has_tolerance and found_right_pos != -1: - diff = left_values[left_pos] - right_values[found_right_pos] - if diff > tolerance_: - right_indexer[left_pos] = -1 - - return left_indexer, right_indexer - -{{endfor}} -{{endfor}} - - -#---------------------------------------------------------------------- -# asof_join -#---------------------------------------------------------------------- - -{{py: - -# on_dtype -dtypes = ['uint8_t', 'uint16_t', 'uint32_t', 'uint64_t', - 'int8_t', 'int16_t', 'int32_t', 'int64_t', - 'float', 'double'] - -}} - -{{for on_dtype in dtypes}} - - -def asof_join_{{on_dtype}}(ndarray[{{on_dtype}}] left_values, - ndarray[{{on_dtype}}] right_values, - bint allow_exact_matches=1, - tolerance=None): - - cdef: - Py_ssize_t left_pos, right_pos, left_size, right_size - ndarray[int64_t] left_indexer, right_indexer - bint has_tolerance = 0 - {{on_dtype}} tolerance_ - - # if we are using tolerance, set our objects - if tolerance is not None: - has_tolerance = 1 - tolerance_ = tolerance - - left_size = len(left_values) - right_size = len(right_values) - - left_indexer = np.empty(left_size, dtype=np.int64) - right_indexer = np.empty(left_size, dtype=np.int64) - - right_pos = 0 - for left_pos in range(left_size): - # restart right_pos if it went negative in a previous iteration - if right_pos < 0: - right_pos = 0 - - # find last position in right whose value is less than left's value - if allow_exact_matches: - while right_pos < right_size and\ - right_values[right_pos] <= left_values[left_pos]: - right_pos += 1 - else: - while right_pos < right_size and\ - right_values[right_pos] < left_values[left_pos]: - right_pos += 1 - right_pos -= 1 - - # save positions as the desired index - left_indexer[left_pos] = left_pos - right_indexer[left_pos] = right_pos - - # if needed, verify that tolerance is met - if has_tolerance and right_pos != -1: - diff = left_values[left_pos] - right_values[right_pos] - if diff > tolerance_: - right_indexer[left_pos] = -1 - - return left_indexer, right_indexer - -{{endfor}} - diff --git a/pandas/src/numpy_helper.h b/pandas/src/numpy_helper.h deleted file mode 100644 index 9f406890c4e68..0000000000000 --- a/pandas/src/numpy_helper.h +++ /dev/null @@ -1,204 +0,0 @@ -#include "Python.h" -#include "numpy/arrayobject.h" -#include "numpy/arrayscalars.h" -#include "helper.h" - -#define PANDAS_FLOAT 0 -#define PANDAS_INT 1 -#define PANDAS_BOOL 2 -#define PANDAS_STRING 3 -#define PANDAS_OBJECT 4 -#define PANDAS_DATETIME 5 - -PANDAS_INLINE int -infer_type(PyObject* obj) { - if (PyBool_Check(obj)) { - return PANDAS_BOOL; - } - else if (PyArray_IsIntegerScalar(obj)) { - return PANDAS_INT; - } - else if (PyArray_IsScalar(obj, Datetime)) { - return PANDAS_DATETIME; - } - else if (PyFloat_Check(obj) || PyArray_IsScalar(obj, Floating)) { - return PANDAS_FLOAT; - } - else if (PyString_Check(obj) || PyUnicode_Check(obj)) { - return PANDAS_STRING; - } - else { - return PANDAS_OBJECT; - } -} - -PANDAS_INLINE npy_int64 -get_nat(void) { - return NPY_MIN_INT64; -} - -PANDAS_INLINE npy_datetime -get_datetime64_value(PyObject* obj) { - return ((PyDatetimeScalarObject*) obj)->obval; -} - -PANDAS_INLINE npy_timedelta -get_timedelta64_value(PyObject* obj) { - return ((PyTimedeltaScalarObject*) obj)->obval; -} - -PANDAS_INLINE int -is_integer_object(PyObject* obj) { - return (!PyBool_Check(obj)) && PyArray_IsIntegerScalar(obj); -// return PyArray_IsIntegerScalar(obj); -} - -PANDAS_INLINE int -is_float_object(PyObject* obj) { - return (PyFloat_Check(obj) || PyArray_IsScalar(obj, Floating)); -} -PANDAS_INLINE int -is_complex_object(PyObject* obj) { - return (PyComplex_Check(obj) || PyArray_IsScalar(obj, ComplexFloating)); -} - -PANDAS_INLINE int -is_bool_object(PyObject* obj) { - return (PyBool_Check(obj) || PyArray_IsScalar(obj, Bool)); -} - -PANDAS_INLINE int -is_string_object(PyObject* obj) { - return (PyString_Check(obj) || PyUnicode_Check(obj)); -} - -PANDAS_INLINE int -is_datetime64_object(PyObject *obj) { - return PyArray_IsScalar(obj, Datetime); -} - -PANDAS_INLINE int -is_timedelta64_object(PyObject *obj) { - return PyArray_IsScalar(obj, Timedelta); -} - -PANDAS_INLINE int -assign_value_1d(PyArrayObject* ap, Py_ssize_t _i, PyObject* v) { - npy_intp i = (npy_intp) _i; - char *item = (char *) PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0); - return PyArray_DESCR(ap)->f->setitem(v, item, ap); -} - -PANDAS_INLINE PyObject* -get_value_1d(PyArrayObject* ap, Py_ssize_t i) { - char *item = (char *) PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0); - return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*) ap); -} - - -PANDAS_INLINE char* -get_c_string(PyObject* obj) { -#if PY_VERSION_HEX >= 0x03000000 - PyObject* enc_str = PyUnicode_AsEncodedString(obj, "utf-8", "error"); - - char *ret; - ret = PyBytes_AS_STRING(enc_str); - - // TODO: memory leak here - - // Py_XDECREF(enc_str); - return ret; -#else - return PyString_AsString(obj); -#endif -} - -PANDAS_INLINE PyObject* -char_to_string(char* data) { -#if PY_VERSION_HEX >= 0x03000000 - return PyUnicode_FromString(data); -#else - return PyString_FromString(data); -#endif -} - -// PANDAS_INLINE int -// is_string(PyObject* obj) { -// #if PY_VERSION_HEX >= 0x03000000 -// return PyUnicode_Check(obj); -// #else -// return PyString_Check(obj); -// #endif - -PyObject* sarr_from_data(PyArray_Descr *descr, int length, void* data) { - PyArrayObject *result; - npy_intp dims[1] = {length}; - Py_INCREF(descr); // newfromdescr steals a reference to descr - result = (PyArrayObject*) PyArray_NewFromDescr(&PyArray_Type, descr, 1, dims, - NULL, data, 0, NULL); - - // Returned array doesn't own data by default - result->flags |= NPY_OWNDATA; - - return (PyObject*) result; -} - - -void transfer_object_column(char *dst, char *src, size_t stride, - size_t length) { - int i; - size_t sz = sizeof(PyObject*); - - for (i = 0; i < length; ++i) - { - // uninitialized data - - // Py_XDECREF(*((PyObject**) dst)); - - memcpy(dst, src, sz); - Py_INCREF(*((PyObject**) dst)); - src += sz; - dst += stride; - } -} - -void set_array_owndata(PyArrayObject *ao) { - ao->flags |= NPY_OWNDATA; -} - -void set_array_not_contiguous(PyArrayObject *ao) { - ao->flags &= ~(NPY_C_CONTIGUOUS | NPY_F_CONTIGUOUS); -} - - -// If arr is zerodim array, return a proper array scalar (e.g. np.int64). -// Otherwise, return arr as is. -PANDAS_INLINE PyObject* -unbox_if_zerodim(PyObject* arr) { - if (PyArray_IsZeroDim(arr)) { - PyObject *ret; - ret = PyArray_ToScalar(PyArray_DATA(arr), arr); - return ret; - } else { - Py_INCREF(arr); - return arr; - } -} - - -// PANDAS_INLINE PyObject* -// get_base_ndarray(PyObject* ap) { -// // if (!ap || (NULL == ap)) { -// // Py_RETURN_NONE; -// // } - -// while (!PyArray_CheckExact(ap)) { -// ap = PyArray_BASE((PyArrayObject*) ap); -// if (ap == Py_None) Py_RETURN_NONE; -// } -// // PyArray_BASE is a borrowed reference -// if(ap) { -// Py_INCREF(ap); -// } -// return ap; -// } diff --git a/pandas/src/parser/io.c b/pandas/src/parser/io.c deleted file mode 100644 index 566de72804968..0000000000000 --- a/pandas/src/parser/io.c +++ /dev/null @@ -1,282 +0,0 @@ -#include "io.h" - - /* - On-disk FILE, uncompressed - */ - - -void *new_file_source(char *fname, size_t buffer_size) { - file_source *fs = (file_source *) malloc(sizeof(file_source)); - fs->fp = fopen(fname, "rb"); - - if (fs->fp == NULL) { - free(fs); - return NULL; - } - setbuf(fs->fp, NULL); - - fs->initial_file_pos = ftell(fs->fp); - - // Only allocate this heap memory if we are not memory-mapping the file - fs->buffer = (char*) malloc((buffer_size + 1) * sizeof(char)); - - if (fs->buffer == NULL) { - return NULL; - } - - memset(fs->buffer, 0, buffer_size + 1); - fs->buffer[buffer_size] = '\0'; - - return (void *) fs; -} - - -// XXX handle on systems without the capability - - -/* - * void *new_file_buffer(FILE *f, int buffer_size) - * - * Allocate a new file_buffer. - * Returns NULL if the memory allocation fails or if the call to mmap fails. - * - * buffer_size is ignored. - */ - - -void* new_rd_source(PyObject *obj) { - rd_source *rds = (rd_source *) malloc(sizeof(rd_source)); - - /* hold on to this object */ - Py_INCREF(obj); - rds->obj = obj; - rds->buffer = NULL; - rds->position = 0; - - return (void*) rds; -} - -/* - - Cleanup callbacks - - */ - -int del_file_source(void *fs) { - // fseek(FS(fs)->fp, FS(fs)->initial_file_pos, SEEK_SET); - if (fs == NULL) - return 0; - - /* allocated on the heap */ - free(FS(fs)->buffer); - fclose(FS(fs)->fp); - free(fs); - - return 0; -} - -int del_rd_source(void *rds) { - Py_XDECREF(RDS(rds)->obj); - Py_XDECREF(RDS(rds)->buffer); - free(rds); - - return 0; -} - -/* - - IO callbacks - - */ - - -void* buffer_file_bytes(void *source, size_t nbytes, - size_t *bytes_read, int *status) { - file_source *src = FS(source); - - *bytes_read = fread((void*) src->buffer, sizeof(char), nbytes, - src->fp); - - if (*bytes_read == 0) { - *status = REACHED_EOF; - } else { - *status = 0; - } - - return (void*) src->buffer; - -} - - -void* buffer_rd_bytes(void *source, size_t nbytes, - size_t *bytes_read, int *status) { - PyGILState_STATE state; - PyObject *result, *func, *args, *tmp; - - void *retval; - - size_t length; - rd_source *src = RDS(source); - state = PyGILState_Ensure(); - - /* delete old object */ - Py_XDECREF(src->buffer); - src->buffer = NULL; - args = Py_BuildValue("(i)", nbytes); - - func = PyObject_GetAttrString(src->obj, "read"); - /* printf("%s\n", PyBytes_AsString(PyObject_Repr(func))); */ - - /* TODO: does this release the GIL? */ - result = PyObject_CallObject(func, args); - Py_XDECREF(args); - Py_XDECREF(func); - - /* PyObject_Print(PyObject_Type(result), stdout, 0); */ - if (result == NULL) { - PyGILState_Release(state); - *bytes_read = 0; - *status = CALLING_READ_FAILED; - return NULL; - } - else if (!PyBytes_Check(result)) { - tmp = PyUnicode_AsUTF8String(result); - Py_XDECREF(result); - result = tmp; - } - - length = PySequence_Length(result); - - if (length == 0) - *status = REACHED_EOF; - else - *status = 0; - - /* hang on to the Python object */ - src->buffer = result; - retval = (void*) PyBytes_AsString(result); - - - PyGILState_Release(state); - - /* TODO: more error handling */ - *bytes_read = length; - - return retval; -} - - -#ifdef HAVE_MMAP - -#include -#include - -void *new_mmap(char *fname) -{ - struct stat buf; - int fd; - memory_map *mm; - /* off_t position; */ - off_t filesize; - - mm = (memory_map *) malloc(sizeof(memory_map)); - mm->fp = fopen(fname, "rb"); - - fd = fileno(mm->fp); - if (fstat(fd, &buf) == -1) { - fprintf(stderr, "new_file_buffer: fstat() failed. errno =%d\n", errno); - return NULL; - } - filesize = buf.st_size; /* XXX This might be 32 bits. */ - - - if (mm == NULL) { - /* XXX Eventually remove this print statement. */ - fprintf(stderr, "new_file_buffer: malloc() failed.\n"); - return NULL; - } - mm->size = (off_t) filesize; - mm->line_number = 0; - - mm->fileno = fd; - mm->position = ftell(mm->fp); - mm->last_pos = (off_t) filesize; - - mm->memmap = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0); - if (mm->memmap == NULL) { - /* XXX Eventually remove this print statement. */ - fprintf(stderr, "new_file_buffer: mmap() failed.\n"); - free(mm); - mm = NULL; - } - - return (void*) mm; -} - - -int del_mmap(void *src) -{ - munmap(MM(src)->memmap, MM(src)->size); - - fclose(MM(src)->fp); - - /* - * With a memory mapped file, there is no need to do - * anything if restore == RESTORE_INITIAL. - */ - /* if (restore == RESTORE_FINAL) { */ - /* fseek(FB(fb)->file, FB(fb)->current_pos, SEEK_SET); */ - /* } */ - free(src); - - return 0; -} - -void* buffer_mmap_bytes(void *source, size_t nbytes, - size_t *bytes_read, int *status) { - void *retval; - memory_map *src = MM(source); - - if (src->position == src->last_pos) { - *bytes_read = 0; - *status = REACHED_EOF; - return NULL; - } - - retval = src->memmap + src->position; - - if (src->position + nbytes > src->last_pos) { - // fewer than nbytes remaining - *bytes_read = src->last_pos - src->position; - } else { - *bytes_read = nbytes; - } - - *status = 0; - - /* advance position in mmap data structure */ - src->position += *bytes_read; - - return retval; -} - -#else - -/* kludgy */ - -void *new_mmap(char *fname) { - return NULL; -} - -int del_mmap(void *src) { - return 0; -} - -/* don't use this! */ - -void* buffer_mmap_bytes(void *source, size_t nbytes, - size_t *bytes_read, int *status) { - return NULL; -} - -#endif diff --git a/pandas/src/parser/io.h b/pandas/src/parser/io.h deleted file mode 100644 index 2ae72ff8a7fe0..0000000000000 --- a/pandas/src/parser/io.h +++ /dev/null @@ -1,85 +0,0 @@ -#include "Python.h" -#include "tokenizer.h" - - -typedef struct _file_source { - /* The file being read. */ - FILE *fp; - - char *buffer; - /* Size of the file, in bytes. */ - /* off_t size; */ - - /* file position when the file_buffer was created. */ - off_t initial_file_pos; - - /* Offset in the file of the data currently in the buffer. */ - off_t buffer_file_pos; - - /* Actual number of bytes in the current buffer. (Can be less than buffer_size.) */ - off_t last_pos; - - /* Size (in bytes) of the buffer. */ - // off_t buffer_size; - - /* Pointer to the buffer. */ - // char *buffer; - -} file_source; - -#define FS(source) ((file_source *)source) - -#if !defined(_WIN32) && !defined(HAVE_MMAP) -#define HAVE_MMAP -#endif - -typedef struct _memory_map { - - FILE *fp; - - /* Size of the file, in bytes. */ - off_t size; - - /* file position when the file_buffer was created. */ - off_t initial_file_pos; - - int line_number; - - int fileno; - off_t position; - off_t last_pos; - char *memmap; - -} memory_map; - -#define MM(src) ((memory_map*) src) - -void *new_mmap(char *fname); - -int del_mmap(void *src); - -void* buffer_mmap_bytes(void *source, size_t nbytes, - size_t *bytes_read, int *status); - - -typedef struct _rd_source { - PyObject* obj; - PyObject* buffer; - size_t position; -} rd_source; - -#define RDS(source) ((rd_source *)source) - -void *new_file_source(char *fname, size_t buffer_size); - -void *new_rd_source(PyObject *obj); - -int del_file_source(void *src); -int del_rd_source(void *src); - -void* buffer_file_bytes(void *source, size_t nbytes, - size_t *bytes_read, int *status); - -void* buffer_rd_bytes(void *source, size_t nbytes, - size_t *bytes_read, int *status); - diff --git a/pandas/src/parser/tokenizer.c b/pandas/src/parser/tokenizer.c deleted file mode 100644 index 450abcf6c325c..0000000000000 --- a/pandas/src/parser/tokenizer.c +++ /dev/null @@ -1,2173 +0,0 @@ -/* - -Copyright (c) 2012, Lambda Foundry, Inc., except where noted - -Incorporates components of WarrenWeckesser/textreader, licensed under 3-clause -BSD - -See LICENSE for the license - -*/ - - /* - Low-level ascii-file processing for pandas. Combines some elements from - Python's built-in csv module and Warren Weckesser's textreader project on - GitHub. See Python Software Foundation License and BSD licenses for these. - - */ - - -#include "tokenizer.h" - -#include -#include -#include - - -//#define READ_ERROR_OUT_OF_MEMORY 1 - - -/* -* restore: -* RESTORE_NOT (0): -* Free memory, but leave the file position wherever it -* happend to be. -* RESTORE_INITIAL (1): -* Restore the file position to the location at which -* the file_buffer was created. -* RESTORE_FINAL (2): -* Put the file position at the next byte after the -* data read from the file_buffer. -* -#define RESTORE_NOT 0 -#define RESTORE_INITIAL 1 -#define RESTORE_FINAL 2 -*/ - -static void *safe_realloc(void *buffer, size_t size) { - void *result; - // OS X is weird - // http://stackoverflow.com/questions/9560609/ - // different-realloc-behaviour-in-linux-and-osx - - result = realloc(buffer, size); - TRACE(("safe_realloc: buffer = %p, size = %zu, result = %p\n", buffer, size, result)) - -/* if (result != NULL) { - // errno gets set to 12 on my OS Xmachine in some cases even when the - // realloc succeeds. annoying - errno = 0; - } else { - return buffer; - }*/ - return result; -} - - -void coliter_setup(coliter_t *self, parser_t *parser, int i, int start) { - // column i, starting at 0 - self->words = parser->words; - self->col = i; - self->line_start = parser->line_start + start; -} - -coliter_t *coliter_new(parser_t *self, int i) { - // column i, starting at 0 - coliter_t *iter = (coliter_t*) malloc(sizeof(coliter_t)); - - if (NULL == iter) { - return NULL; - } - - coliter_setup(iter, self, i, 0); - return iter; -} - - - /* int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max, int *error); */ - /* uint64_t str_to_uint64(const char *p_item, uint64_t uint_max, int *error); */ - - -static void free_if_not_null(void **ptr) { - TRACE(("free_if_not_null %p\n", *ptr)) - if (*ptr != NULL) { - free(*ptr); - *ptr = NULL; - } - } - - - - /* - - Parser / tokenizer - - */ - - -static void *grow_buffer(void *buffer, int length, int *capacity, - int space, int elsize, int *error) { - int cap = *capacity; - void *newbuffer = buffer; - - // Can we fit potentially nbytes tokens (+ null terminators) in the stream? - while ( (length + space >= cap) && (newbuffer != NULL) ){ - cap = cap? cap << 1 : 2; - buffer = newbuffer; - newbuffer = safe_realloc(newbuffer, elsize * cap); - } - - if (newbuffer == NULL) { - // realloc failed so don't change *capacity, set *error to errno - // and return the last good realloc'd buffer so it can be freed - *error = errno; - newbuffer = buffer; - } else { - // realloc worked, update *capacity and set *error to 0 - // sigh, multiple return values - *capacity = cap; - *error = 0; - } - return newbuffer; - } - - -void parser_set_default_options(parser_t *self) { - self->decimal = '.'; - self->sci = 'E'; - - // For tokenization - self->state = START_RECORD; - - self->delimiter = ','; // XXX - self->delim_whitespace = 0; - - self->doublequote = 0; - self->quotechar = '"'; - self->escapechar = 0; - - self->lineterminator = '\0'; /* NUL->standard logic */ - - self->skipinitialspace = 0; - self->quoting = QUOTE_MINIMAL; - self->allow_embedded_newline = 1; - self->strict = 0; - - self->expected_fields = -1; - self->error_bad_lines = 0; - self->warn_bad_lines = 0; - - self->commentchar = '#'; - self->thousands = '\0'; - - self->skipset = NULL; - self-> skip_first_N_rows = -1; - self->skip_footer = 0; -} - -int get_parser_memory_footprint(parser_t *self) { - return 0; -} - -parser_t* parser_new() { - return (parser_t*) calloc(1, sizeof(parser_t)); -} - -int parser_clear_data_buffers(parser_t *self) { - free_if_not_null((void *)&self->stream); - free_if_not_null((void *)&self->words); - free_if_not_null((void *)&self->word_starts); - free_if_not_null((void *)&self->line_start); - free_if_not_null((void *)&self->line_fields); - return 0; -} - -int parser_cleanup(parser_t *self) { - int status = 0; - - // XXX where to put this - free_if_not_null((void *) &self->error_msg); - free_if_not_null((void *) &self->warn_msg); - - if (self->skipset != NULL) { - kh_destroy_int64((kh_int64_t*) self->skipset); - self->skipset = NULL; - } - - if (parser_clear_data_buffers(self) < 0) { - status = -1; - } - - if (self->cb_cleanup != NULL) { - if (self->cb_cleanup(self->source) < 0) { - status = -1; - } - } - - return status; -} - - - -int parser_init(parser_t *self) { - int sz; - - /* - Initialize data buffers - */ - - self->stream = NULL; - self->words = NULL; - self->word_starts = NULL; - self->line_start = NULL; - self->line_fields = NULL; - self->error_msg = NULL; - self->warn_msg = NULL; - - // token stream - self->stream = (char*) malloc(STREAM_INIT_SIZE * sizeof(char)); - if (self->stream == NULL) { - parser_cleanup(self); - return PARSER_OUT_OF_MEMORY; - } - self->stream_cap = STREAM_INIT_SIZE; - self->stream_len = 0; - - // word pointers and metadata - sz = STREAM_INIT_SIZE / 10; - sz = sz? sz : 1; - self->words = (char**) malloc(sz * sizeof(char*)); - self->word_starts = (int*) malloc(sz * sizeof(int)); - self->words_cap = sz; - self->words_len = 0; - - // line pointers and metadata - self->line_start = (int*) malloc(sz * sizeof(int)); - - self->line_fields = (int*) malloc(sz * sizeof(int)); - - self->lines_cap = sz; - self->lines = 0; - self->file_lines = 0; - - if (self->stream == NULL || self->words == NULL || - self->word_starts == NULL || self->line_start == NULL || - self->line_fields == NULL) { - - parser_cleanup(self); - - return PARSER_OUT_OF_MEMORY; - } - - /* amount of bytes buffered */ - self->datalen = 0; - self->datapos = 0; - - self->line_start[0] = 0; - self->line_fields[0] = 0; - - self->pword_start = self->stream; - self->word_start = 0; - - self->state = START_RECORD; - - self->error_msg = NULL; - self->warn_msg = NULL; - - self->commentchar = '\0'; - - return 0; -} - - -void parser_free(parser_t *self) { - // opposite of parser_init - parser_cleanup(self); - free(self); -} - -static int make_stream_space(parser_t *self, size_t nbytes) { - int i, status, cap; - void *orig_ptr, *newptr; - - // Can we fit potentially nbytes tokens (+ null terminators) in the stream? - - /* TRACE(("maybe growing buffers\n")); */ - - /* - TOKEN STREAM - */ - - orig_ptr = (void *) self->stream; - TRACE(("\n\nmake_stream_space: nbytes = %zu. grow_buffer(self->stream...)\n", nbytes)) - self->stream = (char*) grow_buffer((void *) self->stream, - self->stream_len, - &self->stream_cap, nbytes * 2, - sizeof(char), &status); - TRACE(("make_stream_space: self->stream=%p, self->stream_len = %zu, self->stream_cap=%zu, status=%zu\n", - self->stream, self->stream_len, self->stream_cap, status)) - - if (status != 0) { - return PARSER_OUT_OF_MEMORY; - } - - // realloc sets errno when moving buffer? - if (self->stream != orig_ptr) { - // uff - /* TRACE(("Moving word pointers\n")) */ - - self->pword_start = self->stream + self->word_start; - - for (i = 0; i < self->words_len; ++i) - { - self->words[i] = self->stream + self->word_starts[i]; - } - } - - - /* - WORD VECTORS - */ - - cap = self->words_cap; - self->words = (char**) grow_buffer((void *) self->words, - self->words_len, - &self->words_cap, nbytes, - sizeof(char*), &status); - TRACE(("make_stream_space: grow_buffer(self->self->words, %zu, %zu, %zu, %d)\n", - self->words_len, self->words_cap, nbytes, status)) - if (status != 0) { - return PARSER_OUT_OF_MEMORY; - } - - - // realloc took place - if (cap != self->words_cap) { - TRACE(("make_stream_space: cap != self->words_cap, nbytes = %d, self->words_cap=%d\n", nbytes, self->words_cap)) - newptr = safe_realloc((void *) self->word_starts, sizeof(int) * self->words_cap); - if (newptr == NULL) { - return PARSER_OUT_OF_MEMORY; - } else { - self->word_starts = (int*) newptr; - } - } - - - /* - LINE VECTORS - */ - /* - printf("Line_start: "); - - for (j = 0; j < self->lines + 1; ++j) { - printf("%d ", self->line_fields[j]); - } - printf("\n"); - - printf("lines_cap: %d\n", self->lines_cap); - */ - cap = self->lines_cap; - self->line_start = (int*) grow_buffer((void *) self->line_start, - self->lines + 1, - &self->lines_cap, nbytes, - sizeof(int), &status); - TRACE(("make_stream_space: grow_buffer(self->line_start, %zu, %zu, %zu, %d)\n", - self->lines + 1, self->lines_cap, nbytes, status)) - if (status != 0) { - return PARSER_OUT_OF_MEMORY; - } - - // realloc took place - if (cap != self->lines_cap) { - TRACE(("make_stream_space: cap != self->lines_cap, nbytes = %d\n", nbytes)) - newptr = safe_realloc((void *) self->line_fields, sizeof(int) * self->lines_cap); - if (newptr == NULL) { - return PARSER_OUT_OF_MEMORY; - } else { - self->line_fields = (int*) newptr; - } - } - - /* TRACE(("finished growing buffers\n")); */ - - return 0; -} - - -static int push_char(parser_t *self, char c) { - /* TRACE(("pushing %c \n", c)) */ - TRACE(("push_char: self->stream[%zu] = %x, stream_cap=%zu\n", self->stream_len+1, c, self->stream_cap)) - if (self->stream_len >= self->stream_cap) { - TRACE(("push_char: ERROR!!! self->stream_len(%d) >= self->stream_cap(%d)\n", - self->stream_len, self->stream_cap)) - self->error_msg = (char*) malloc(64); - sprintf(self->error_msg, "Buffer overflow caught - possible malformed input file.\n"); - return PARSER_OUT_OF_MEMORY; - } - self->stream[self->stream_len++] = c; - return 0; -} - -int P_INLINE end_field(parser_t *self) { - // XXX cruft -// self->numeric_field = 0; - if (self->words_len >= self->words_cap) { - TRACE(("end_field: ERROR!!! self->words_len(%zu) >= self->words_cap(%zu)\n", self->words_len, self->words_cap)) - self->error_msg = (char*) malloc(64); - sprintf(self->error_msg, "Buffer overflow caught - possible malformed input file.\n"); - return PARSER_OUT_OF_MEMORY; - } - - // null terminate token - push_char(self, '\0'); - - // set pointer and metadata - self->words[self->words_len] = self->pword_start; - - TRACE(("end_field: Char diff: %d\n", self->pword_start - self->words[0])); - - TRACE(("end_field: Saw word %s at: %d. Total: %d\n", - self->pword_start, self->word_start, self->words_len + 1)) - - self->word_starts[self->words_len] = self->word_start; - self->words_len++; - - // increment line field count - self->line_fields[self->lines]++; - - // New field begin in stream - self->pword_start = self->stream + self->stream_len; - self->word_start = self->stream_len; - - return 0; -} - - -static void append_warning(parser_t *self, const char *msg) { - int ex_length; - int length = strlen(msg); - void *newptr; - - if (self->warn_msg == NULL) { - self->warn_msg = (char*) malloc(length + 1); - strcpy(self->warn_msg, msg); - } else { - ex_length = strlen(self->warn_msg); - newptr = safe_realloc(self->warn_msg, ex_length + length + 1); - if (newptr != NULL) { - self->warn_msg = (char*) newptr; - strcpy(self->warn_msg + ex_length, msg); - } - } -} - -static int end_line(parser_t *self) { - int fields; - int ex_fields = self->expected_fields; - char *msg; - - fields = self->line_fields[self->lines]; - - TRACE(("end_line: Line end, nfields: %d\n", fields)); - - if (self->lines > 0) { - if (self->expected_fields >= 0) { - ex_fields = self->expected_fields; - } else { - ex_fields = self->line_fields[self->lines - 1]; - } - } - - if (self->state == START_FIELD_IN_SKIP_LINE || \ - self->state == IN_FIELD_IN_SKIP_LINE || \ - self->state == IN_QUOTED_FIELD_IN_SKIP_LINE || \ - self->state == QUOTE_IN_QUOTED_FIELD_IN_SKIP_LINE - ) { - TRACE(("end_line: Skipping row %d\n", self->file_lines)); - // increment file line count - self->file_lines++; - - // skip the tokens from this bad line - self->line_start[self->lines] += fields; - - // reset field count - self->line_fields[self->lines] = 0; - return 0; - } - - if (!(self->lines <= self->header_end + 1) - && (self->expected_fields < 0 && fields > ex_fields) - && !(self->usecols)) { - // increment file line count - self->file_lines++; - - // skip the tokens from this bad line - self->line_start[self->lines] += fields; - - // reset field count - self->line_fields[self->lines] = 0; - - // file_lines is now the actual file line number (starting at 1) - if (self->error_bad_lines) { - self->error_msg = (char*) malloc(100); - sprintf(self->error_msg, "Expected %d fields in line %d, saw %d\n", - ex_fields, self->file_lines, fields); - - TRACE(("Error at line %d, %d fields\n", self->file_lines, fields)); - - return -1; - } else { - // simply skip bad lines - if (self->warn_bad_lines) { - // pass up error message - msg = (char*) malloc(100); - sprintf(msg, "Skipping line %d: expected %d fields, saw %d\n", - self->file_lines, ex_fields, fields); - append_warning(self, msg); - free(msg); - } - } - } else { - // missing trailing delimiters - if ((self->lines >= self->header_end + 1) && fields < ex_fields) { - - // might overrun the buffer when closing fields - if (make_stream_space(self, ex_fields - fields) < 0) { - self->error_msg = "out of memory"; - return -1; - } - - while (fields < ex_fields){ - end_field(self); - fields++; - } - } - - // increment both line counts - self->file_lines++; - self->lines++; - - // good line, set new start point - if (self->lines >= self->lines_cap) { - TRACE(("end_line: ERROR!!! self->lines(%zu) >= self->lines_cap(%zu)\n", self->lines, self->lines_cap)) \ - self->error_msg = (char*) malloc(100); \ - sprintf(self->error_msg, "Buffer overflow caught - possible malformed input file.\n"); \ - return PARSER_OUT_OF_MEMORY; \ - } - self->line_start[self->lines] = (self->line_start[self->lines - 1] + - fields); - - TRACE(("end_line: new line start: %d\n", self->line_start[self->lines])); - - // new line start with 0 fields - self->line_fields[self->lines] = 0; - } - - TRACE(("end_line: Finished line, at %d\n", self->lines)); - - return 0; -} - -int parser_add_skiprow(parser_t *self, int64_t row) { - khiter_t k; - kh_int64_t *set; - int ret = 0; - - if (self->skipset == NULL) { - self->skipset = (void*) kh_init_int64(); - } - - set = (kh_int64_t*) self->skipset; - - k = kh_put_int64(set, row, &ret); - set->keys[k] = row; - - return 0; -} - -int parser_set_skipfirstnrows(parser_t *self, int64_t nrows) { - // self->file_lines is zero based so subtract 1 from nrows - if (nrows > 0) { - self->skip_first_N_rows = nrows - 1; - } - - return 0; -} - -static int parser_buffer_bytes(parser_t *self, size_t nbytes) { - int status; - size_t bytes_read; - - status = 0; - self->datapos = 0; - self->data = self->cb_io(self->source, nbytes, &bytes_read, &status); - TRACE(("parser_buffer_bytes self->cb_io: nbytes=%zu, datalen: %d, status=%d\n", - nbytes, bytes_read, status)); - self->datalen = bytes_read; - - if (status != REACHED_EOF && self->data == NULL) { - self->error_msg = (char*) malloc(200); - - if (status == CALLING_READ_FAILED) { - sprintf(self->error_msg, ("Calling read(nbytes) on source failed. " - "Try engine='python'.")); - } else { - sprintf(self->error_msg, "Unknown error in IO callback"); - } - return -1; - } - - TRACE(("datalen: %d\n", self->datalen)); - - return status; -} - - -/* - - Tokenization macros and state machine code - -*/ - -// printf("pushing %c\n", c); - -#define PUSH_CHAR(c) \ - TRACE(("PUSH_CHAR: Pushing %c, slen= %d, stream_cap=%zu, stream_len=%zu\n", c, slen, self->stream_cap, self->stream_len)) \ - if (slen >= maxstreamsize) { \ - TRACE(("PUSH_CHAR: ERROR!!! slen(%d) >= maxstreamsize(%d)\n", slen, maxstreamsize)) \ - self->error_msg = (char*) malloc(100); \ - sprintf(self->error_msg, "Buffer overflow caught - possible malformed input file.\n"); \ - return PARSER_OUT_OF_MEMORY; \ - } \ - *stream++ = c; \ - slen++; - -// This is a little bit of a hack but works for now - -#define END_FIELD() \ - self->stream_len = slen; \ - if (end_field(self) < 0) { \ - goto parsingerror; \ - } \ - stream = self->stream + self->stream_len; \ - slen = self->stream_len; - -#define END_LINE_STATE(STATE) \ - self->stream_len = slen; \ - if (end_line(self) < 0) { \ - goto parsingerror; \ - } \ - stream = self->stream + self->stream_len; \ - slen = self->stream_len; \ - self->state = STATE; \ - if (line_limit > 0 && self->lines == start_lines + line_limit) { \ - goto linelimit; \ - \ - } - -#define END_LINE_AND_FIELD_STATE(STATE) \ - self->stream_len = slen; \ - if (end_line(self) < 0) { \ - goto parsingerror; \ - } \ - if (end_field(self) < 0) { \ - goto parsingerror; \ - } \ - stream = self->stream + self->stream_len; \ - slen = self->stream_len; \ - self->state = STATE; \ - if (line_limit > 0 && self->lines == start_lines + line_limit) { \ - goto linelimit; \ - \ - } - -#define END_LINE() END_LINE_STATE(START_RECORD) - -#define IS_WHITESPACE(c) ((c == ' ' || c == '\t')) - -#define IS_TERMINATOR(c) ((self->lineterminator == '\0' && c == '\n') || \ - (self->lineterminator != '\0' && \ - c == self->lineterminator)) - -#define IS_QUOTE(c) ((c == self->quotechar && self->quoting != QUOTE_NONE)) - -// don't parse '\r' with a custom line terminator -#define IS_CARRIAGE(c) ((self->lineterminator == '\0' && c == '\r')) - -#define IS_COMMENT_CHAR(c) ((self->commentchar != '\0' && c == self->commentchar)) - -#define IS_ESCAPE_CHAR(c) ((self->escapechar != '\0' && c == self->escapechar)) - -#define IS_SKIPPABLE_SPACE(c) ((!self->delim_whitespace && c == ' ' && \ - self->skipinitialspace)) - -// applied when in a field -#define IS_DELIMITER(c) ((!self->delim_whitespace && c == self->delimiter) || \ - (self->delim_whitespace && IS_WHITESPACE(c))) - -#define _TOKEN_CLEANUP() \ - self->stream_len = slen; \ - self->datapos = i; \ - TRACE(("_TOKEN_CLEANUP: datapos: %d, datalen: %d\n", self->datapos, self->datalen)); - -#define CHECK_FOR_BOM() \ - if (*buf == '\xef' && *(buf + 1) == '\xbb' && *(buf + 2) == '\xbf') { \ - buf += 3; \ - self->datapos += 3; \ - } - -int skip_this_line(parser_t *self, int64_t rownum) { - if (self->skipset != NULL) { - return ( kh_get_int64((kh_int64_t*) self->skipset, self->file_lines) != - ((kh_int64_t*)self->skipset)->n_buckets ); - } - else { - return ( rownum <= self->skip_first_N_rows ); - } -} - -int tokenize_bytes(parser_t *self, size_t line_limit, int start_lines) -{ - int i, slen; - long maxstreamsize; - char c; - char *stream; - char *buf = self->data + self->datapos; - - if (make_stream_space(self, self->datalen - self->datapos) < 0) { - self->error_msg = "out of memory"; - return -1; - } - - stream = self->stream + self->stream_len; - slen = self->stream_len; - maxstreamsize = self->stream_cap; - - TRACE(("%s\n", buf)); - - if (self->file_lines == 0) { - CHECK_FOR_BOM(); - } - - for (i = self->datapos; i < self->datalen; ++i) - { - // next character in file - c = *buf++; - - TRACE(("tokenize_bytes - Iter: %d Char: 0x%x Line %d field_count %d, state %d\n", - i, c, self->file_lines + 1, self->line_fields[self->lines], - self->state)); - - switch(self->state) { - - case START_FIELD_IN_SKIP_LINE: - if (IS_TERMINATOR(c)) { - END_LINE(); - } else if (IS_CARRIAGE(c)) { - self->file_lines++; - self->state = EAT_CRNL_NOP; - } else if (IS_QUOTE(c)) { - self->state = IN_QUOTED_FIELD_IN_SKIP_LINE; - } else if (IS_DELIMITER(c)) { - // Do nothing, we're starting a new field again. - } else { - self->state = IN_FIELD_IN_SKIP_LINE; - } - break; - - case IN_FIELD_IN_SKIP_LINE: - if (IS_TERMINATOR(c)) { - END_LINE(); - } else if (IS_CARRIAGE(c)) { - self->file_lines++; - self->state = EAT_CRNL_NOP; - } else if (IS_DELIMITER(c)) { - self->state = START_FIELD_IN_SKIP_LINE; - } - break; - - case IN_QUOTED_FIELD_IN_SKIP_LINE: - if (IS_QUOTE(c)) { - if (self->doublequote) { - self->state = QUOTE_IN_QUOTED_FIELD_IN_SKIP_LINE; - } else { - self->state = IN_FIELD_IN_SKIP_LINE; - } - } - break; - - case QUOTE_IN_QUOTED_FIELD_IN_SKIP_LINE: - if (IS_QUOTE(c)) { - self->state = IN_QUOTED_FIELD_IN_SKIP_LINE; - } else if (IS_TERMINATOR(c)) { - END_LINE(); - } else if (IS_CARRIAGE(c)) { - self->file_lines++; - self->state = EAT_CRNL_NOP; - } else if (IS_DELIMITER(c)) { - self->state = START_FIELD_IN_SKIP_LINE; - } else { - self->state = IN_FIELD_IN_SKIP_LINE; - } - break; - - case WHITESPACE_LINE: - if (IS_TERMINATOR(c)) { - self->file_lines++; - self->state = START_RECORD; - break; - } else if (IS_CARRIAGE(c)) { - self->file_lines++; - self->state = EAT_CRNL_NOP; - break; - } else if (!self->delim_whitespace) { - if (IS_WHITESPACE(c) && c != self->delimiter) { - ; - } else { // backtrack - // use i + 1 because buf has been incremented but not i - do { - --buf; - --i; - } while (i + 1 > self->datapos && !IS_TERMINATOR(*buf)); - - // reached a newline rather than the beginning - if (IS_TERMINATOR(*buf)) { - ++buf; // move pointer to first char after newline - ++i; - } - self->state = START_FIELD; - } - break; - } - // fall through - - case EAT_WHITESPACE: - if (IS_TERMINATOR(c)) { - END_LINE(); - self->state = START_RECORD; - break; - } else if (IS_CARRIAGE(c)) { - self->state = EAT_CRNL; - break; - } else if (!IS_WHITESPACE(c)) { - self->state = START_FIELD; - // fall through to subsequent state - } else { - // if whitespace char, keep slurping - break; - } - - case START_RECORD: - // start of record - if (skip_this_line(self, self->file_lines)) { - if (IS_QUOTE(c)) { - self->state = IN_QUOTED_FIELD_IN_SKIP_LINE; - } else { - self->state = IN_FIELD_IN_SKIP_LINE; - - if (IS_TERMINATOR(c)) { - END_LINE(); - } - } - break; - } else if (IS_TERMINATOR(c)) { - // \n\r possible? - if (self->skip_empty_lines) { - self->file_lines++; - } else { - END_LINE(); - } - break; - } else if (IS_CARRIAGE(c)) { - if (self->skip_empty_lines) { - self->file_lines++; - self->state = EAT_CRNL_NOP; - } else { - self->state = EAT_CRNL; - } - break; - } else if (IS_COMMENT_CHAR(c)) { - self->state = EAT_LINE_COMMENT; - break; - } else if (IS_WHITESPACE(c)) { - if (self->delim_whitespace) { - if (self->skip_empty_lines) { - self->state = WHITESPACE_LINE; - } else { - self->state = EAT_WHITESPACE; - } - break; - } else if (c != self->delimiter && self->skip_empty_lines) { - self->state = WHITESPACE_LINE; - break; - } - // fall through - } - - // normal character - fall through - // to handle as START_FIELD - self->state = START_FIELD; - - case START_FIELD: - // expecting field - if (IS_TERMINATOR(c)) { - END_FIELD(); - END_LINE(); - } else if (IS_CARRIAGE(c)) { - END_FIELD(); - self->state = EAT_CRNL; - } else if (IS_QUOTE(c)) { - // start quoted field - self->state = IN_QUOTED_FIELD; - } else if (IS_ESCAPE_CHAR(c)) { - // possible escaped character - self->state = ESCAPED_CHAR; - } else if (IS_SKIPPABLE_SPACE(c)) { - // ignore space at start of field - ; - } else if (IS_DELIMITER(c)) { - if (self->delim_whitespace) { - self->state = EAT_WHITESPACE; - } else { - // save empty field - END_FIELD(); - } - } else if (IS_COMMENT_CHAR(c)) { - END_FIELD(); - self->state = EAT_COMMENT; - } else { - // begin new unquoted field - // if (self->delim_whitespace && \ - // self->quoting == QUOTE_NONNUMERIC) { - // self->numeric_field = 1; - // } - - PUSH_CHAR(c); - self->state = IN_FIELD; - } - break; - - case ESCAPED_CHAR: - PUSH_CHAR(c); - self->state = IN_FIELD; - break; - - case EAT_LINE_COMMENT: - if (IS_TERMINATOR(c)) { - self->file_lines++; - self->state = START_RECORD; - } else if (IS_CARRIAGE(c)) { - self->file_lines++; - self->state = EAT_CRNL_NOP; - } - break; - - case IN_FIELD: - // in unquoted field - if (IS_TERMINATOR(c)) { - END_FIELD(); - END_LINE(); - } else if (IS_CARRIAGE(c)) { - END_FIELD(); - self->state = EAT_CRNL; - } else if (IS_ESCAPE_CHAR(c)) { - // possible escaped character - self->state = ESCAPED_CHAR; - } else if (IS_DELIMITER(c)) { - // end of field - end of line not reached yet - END_FIELD(); - - if (self->delim_whitespace) { - self->state = EAT_WHITESPACE; - } else { - self->state = START_FIELD; - } - } else if (IS_COMMENT_CHAR(c)) { - END_FIELD(); - self->state = EAT_COMMENT; - } else { - // normal character - save in field - PUSH_CHAR(c); - } - break; - - case IN_QUOTED_FIELD: - // in quoted field - if (IS_ESCAPE_CHAR(c)) { - // possible escape character - self->state = ESCAPE_IN_QUOTED_FIELD; - } else if (IS_QUOTE(c)) { - if (self->doublequote) { - // double quote - " represented by "" - self->state = QUOTE_IN_QUOTED_FIELD; - } else { - // end of quote part of field - self->state = IN_FIELD; - } - } else { - // normal character - save in field - PUSH_CHAR(c); - } - break; - - case ESCAPE_IN_QUOTED_FIELD: - PUSH_CHAR(c); - self->state = IN_QUOTED_FIELD; - break; - - case QUOTE_IN_QUOTED_FIELD: - // double quote - seen a quote in an quoted field - if (IS_QUOTE(c)) { - // save "" as " - - PUSH_CHAR(c); - self->state = IN_QUOTED_FIELD; - } else if (IS_DELIMITER(c)) { - // end of field - end of line not reached yet - END_FIELD(); - - if (self->delim_whitespace) { - self->state = EAT_WHITESPACE; - } else { - self->state = START_FIELD; - } - } else if (IS_TERMINATOR(c)) { - END_FIELD(); - END_LINE(); - } else if (IS_CARRIAGE(c)) { - END_FIELD(); - self->state = EAT_CRNL; - } else if (!self->strict) { - PUSH_CHAR(c); - self->state = IN_FIELD; - } else { - self->error_msg = (char*) malloc(50); - sprintf(self->error_msg, - "delimiter expected after " - "quote in quote"); - goto parsingerror; - } - break; - - case EAT_COMMENT: - if (IS_TERMINATOR(c)) { - END_LINE(); - } else if (IS_CARRIAGE(c)) { - self->state = EAT_CRNL; - } - break; - - // only occurs with non-custom line terminator, - // which is why we directly check for '\n' - case EAT_CRNL: - if (c == '\n') { - END_LINE(); - } else if (IS_DELIMITER(c)){ - - if (self->delim_whitespace) { - END_LINE_STATE(EAT_WHITESPACE); - } else { - // Handle \r-delimited files - END_LINE_AND_FIELD_STATE(START_FIELD); - } - } else { - if (self->delim_whitespace) { - /* XXX - * first character of a new record--need to back up and reread - * to handle properly... - */ - i--; buf--; // back up one character (HACK!) - END_LINE_STATE(START_RECORD); - } else { - // \r line terminator - // UGH. we don't actually want - // to consume the token. fix this later - self->stream_len = slen; - if (end_line(self) < 0) { - goto parsingerror; - } - - stream = self->stream + self->stream_len; - slen = self->stream_len; - self->state = START_RECORD; - - --i; buf--; // let's try this character again (HACK!) - if (line_limit > 0 && self->lines == start_lines + line_limit) { - goto linelimit; - } - } - } - break; - - // only occurs with non-custom line terminator, - // which is why we directly check for '\n' - case EAT_CRNL_NOP: // inside an ignored comment line - self->state = START_RECORD; - // \r line terminator -- parse this character again - if (c != '\n' && !IS_DELIMITER(c)) { - --i; - --buf; - } - break; - default: - break; - } - } - - _TOKEN_CLEANUP(); - - TRACE(("Finished tokenizing input\n")) - - return 0; - -parsingerror: - i++; - _TOKEN_CLEANUP(); - - return -1; - -linelimit: - i++; - _TOKEN_CLEANUP(); - - return 0; -} - -static int parser_handle_eof(parser_t *self) { - TRACE(("handling eof, datalen: %d, pstate: %d\n", self->datalen, self->state)) - - if (self->datalen != 0) - return -1; - - switch (self->state) { - case START_RECORD: - case WHITESPACE_LINE: - case EAT_CRNL_NOP: - case EAT_LINE_COMMENT: - return 0; - - case ESCAPE_IN_QUOTED_FIELD: - case IN_QUOTED_FIELD: - self->error_msg = (char*)malloc(100); - sprintf(self->error_msg, "EOF inside string starting at line %d", - self->file_lines); - return -1; - - case ESCAPED_CHAR: - self->error_msg = (char*)malloc(100); - sprintf(self->error_msg, "EOF following escape character"); - return -1; - - case IN_FIELD: - case START_FIELD: - case QUOTE_IN_QUOTED_FIELD: - if (end_field(self) < 0) - return -1; - break; - - default: - break; - } - - if (end_line(self) < 0) - return -1; - else - return 0; -} - -int parser_consume_rows(parser_t *self, size_t nrows) { - int i, offset, word_deletions, char_count; - - if (nrows > self->lines) { - nrows = self->lines; - } - - /* do nothing */ - if (nrows == 0) - return 0; - - /* cannot guarantee that nrows + 1 has been observed */ - word_deletions = self->line_start[nrows - 1] + self->line_fields[nrows - 1]; - char_count = (self->word_starts[word_deletions - 1] + - strlen(self->words[word_deletions - 1]) + 1); - - TRACE(("parser_consume_rows: Deleting %d words, %d chars\n", word_deletions, char_count)); - - /* move stream, only if something to move */ - if (char_count < self->stream_len) { - memmove((void*) self->stream, (void*) (self->stream + char_count), - self->stream_len - char_count); - } - /* buffer counts */ - self->stream_len -= char_count; - - /* move token metadata */ - for (i = 0; i < self->words_len - word_deletions; ++i) { - offset = i + word_deletions; - - self->words[i] = self->words[offset] - char_count; - self->word_starts[i] = self->word_starts[offset] - char_count; - } - self->words_len -= word_deletions; - - /* move current word pointer to stream */ - self->pword_start -= char_count; - self->word_start -= char_count; - /* - printf("Line_start: "); - for (i = 0; i < self->lines + 1; ++i) { - printf("%d ", self->line_fields[i]); - } - printf("\n"); - */ - /* move line metadata */ - for (i = 0; i < self->lines - nrows + 1; ++i) - { - offset = i + nrows; - self->line_start[i] = self->line_start[offset] - word_deletions; - - /* TRACE(("First word in line %d is now %s\n", i, */ - /* self->words[self->line_start[i]])); */ - - self->line_fields[i] = self->line_fields[offset]; - } - self->lines -= nrows; - /* self->line_fields[self->lines] = 0; */ - - return 0; -} - -static size_t _next_pow2(size_t sz) { - size_t result = 1; - while (result < sz) result *= 2; - return result; -} - -int parser_trim_buffers(parser_t *self) { - /* - Free memory - */ - size_t new_cap; - void *newptr; - - int i; - - /* trim words, word_starts */ - new_cap = _next_pow2(self->words_len) + 1; - if (new_cap < self->words_cap) { - TRACE(("parser_trim_buffers: new_cap < self->words_cap\n")); - newptr = safe_realloc((void*) self->words, new_cap * sizeof(char*)); - if (newptr == NULL) { - return PARSER_OUT_OF_MEMORY; - } else { - self->words = (char**) newptr; - } - newptr = safe_realloc((void*) self->word_starts, new_cap * sizeof(int)); - if (newptr == NULL) { - return PARSER_OUT_OF_MEMORY; - } else { - self->word_starts = (int*) newptr; - self->words_cap = new_cap; - } - } - - /* trim stream */ - new_cap = _next_pow2(self->stream_len) + 1; - TRACE(("parser_trim_buffers: new_cap = %zu, stream_cap = %zu, lines_cap = %zu\n", - new_cap, self->stream_cap, self->lines_cap)); - if (new_cap < self->stream_cap) { - TRACE(("parser_trim_buffers: new_cap < self->stream_cap, calling safe_realloc\n")); - newptr = safe_realloc((void*) self->stream, new_cap); - if (newptr == NULL) { - return PARSER_OUT_OF_MEMORY; - } else { - // Update the pointers in the self->words array (char **) if `safe_realloc` - // moved the `self->stream` buffer. This block mirrors a similar block in - // `make_stream_space`. - if (self->stream != newptr) { - /* TRACE(("Moving word pointers\n")) */ - self->pword_start = (char*) newptr + self->word_start; - - for (i = 0; i < self->words_len; ++i) - { - self->words[i] = (char*) newptr + self->word_starts[i]; - } - } - - self->stream = newptr; - self->stream_cap = new_cap; - - } - } - - /* trim line_start, line_fields */ - new_cap = _next_pow2(self->lines) + 1; - if (new_cap < self->lines_cap) { - TRACE(("parser_trim_buffers: new_cap < self->lines_cap\n")); - newptr = safe_realloc((void*) self->line_start, new_cap * sizeof(int)); - if (newptr == NULL) { - return PARSER_OUT_OF_MEMORY; - } else { - self->line_start = (int*) newptr; - } - newptr = safe_realloc((void*) self->line_fields, new_cap * sizeof(int)); - if (newptr == NULL) { - return PARSER_OUT_OF_MEMORY; - } else { - self->line_fields = (int*) newptr; - self->lines_cap = new_cap; - } - } - - return 0; -} - -void debug_print_parser(parser_t *self) { - int j, line; - char *token; - - for (line = 0; line < self->lines; ++line) - { - printf("(Parsed) Line %d: ", line); - - for (j = 0; j < self->line_fields[j]; ++j) - { - token = self->words[j + self->line_start[line]]; - printf("%s ", token); - } - printf("\n"); - } -} - -/*int clear_parsed_lines(parser_t *self, size_t nlines) { - // TODO. move data up in stream, shift relevant word pointers - - return 0; -}*/ - - -/* - nrows : number of rows to tokenize (or until reach EOF) - all : tokenize all the data vs. certain number of rows - */ - -int _tokenize_helper(parser_t *self, size_t nrows, int all) { - int status = 0; - int start_lines = self->lines; - - if (self->state == FINISHED) { - return 0; - } - - TRACE(("_tokenize_helper: Asked to tokenize %d rows, datapos=%d, datalen=%d\n", \ - (int) nrows, self->datapos, self->datalen)); - - while (1) { - if (!all && self->lines - start_lines >= nrows) - break; - - if (self->datapos == self->datalen) { - status = parser_buffer_bytes(self, self->chunksize); - - if (status == REACHED_EOF) { - // close out last line - status = parser_handle_eof(self); - self->state = FINISHED; - break; - } else if (status != 0) { - return status; - } - } - - TRACE(("_tokenize_helper: Trying to process %d bytes, datalen=%d, datapos= %d\n", - self->datalen - self->datapos, self->datalen, self->datapos)); - - status = tokenize_bytes(self, nrows, start_lines); - - if (status < 0) { - // XXX - TRACE(("_tokenize_helper: Status %d returned from tokenize_bytes, breaking\n", - status)); - status = -1; - break; - } - } - TRACE(("leaving tokenize_helper\n")); - return status; -} - -int tokenize_nrows(parser_t *self, size_t nrows) { - int status = _tokenize_helper(self, nrows, 0); - return status; -} - -int tokenize_all_rows(parser_t *self) { - int status = _tokenize_helper(self, -1, 1); - return status; -} - -/* SEL - does not look like this routine is used anywhere -void test_count_lines(char *fname) { - clock_t start = clock(); - - char *buffer, *tmp; - size_t bytes, lines = 0; - int i; - FILE *fp = fopen(fname, "rb"); - - buffer = (char*) malloc(CHUNKSIZE * sizeof(char)); - - while(1) { - tmp = buffer; - bytes = fread((void *) buffer, sizeof(char), CHUNKSIZE, fp); - // printf("Read %d bytes\n", bytes); - - if (bytes == 0) { - break; - } - - for (i = 0; i < bytes; ++i) - { - if (*tmp++ == '\n') { - lines++; - } - } - } - - - printf("Saw %d lines\n", (int) lines); - - free(buffer); - fclose(fp); - - printf("Time elapsed: %f\n", ((double)clock() - start) / CLOCKS_PER_SEC); -}*/ - - -P_INLINE void uppercase(char *p) { - for ( ; *p; ++p) *p = toupper(*p); -} - -/* SEL - does not look like these routines are used anywhere -P_INLINE void lowercase(char *p) { - for ( ; *p; ++p) *p = tolower(*p); -} - -int P_INLINE to_complex(char *item, double *p_real, double *p_imag, char sci, char decimal) -{ - char *p_end; - - *p_real = xstrtod(item, &p_end, decimal, sci, '\0', FALSE); - if (*p_end == '\0') { - *p_imag = 0.0; - return errno == 0; - } - if (*p_end == 'i' || *p_end == 'j') { - *p_imag = *p_real; - *p_real = 0.0; - ++p_end; - } - else { - if (*p_end == '+') { - ++p_end; - } - *p_imag = xstrtod(p_end, &p_end, decimal, sci, '\0', FALSE); - if (errno || ((*p_end != 'i') && (*p_end != 'j'))) { - return FALSE; - } - ++p_end; - } - while(*p_end == ' ') { - ++p_end; - } - return *p_end == '\0'; -}*/ - - -int P_INLINE to_longlong(char *item, long long *p_value) -{ - char *p_end; - - // Try integer conversion. We explicitly give the base to be 10. If - // we used 0, strtoll() would convert '012' to 10, because the leading 0 in - // '012' signals an octal number in C. For a general purpose reader, that - // would be a bug, not a feature. - *p_value = strtoll(item, &p_end, 10); - - // Allow trailing spaces. - while (isspace(*p_end)) ++p_end; - - return (errno == 0) && (!*p_end); -} - -/* does not look like this routine is used anywhere -int P_INLINE to_longlong_thousands(char *item, long long *p_value, char tsep) -{ - int i, pos, status, n = strlen(item), count = 0; - char *tmp; - char *p_end; - - for (i = 0; i < n; ++i) - { - if (*(item + i) == tsep) { - count++; - } - } - - if (count == 0) { - return to_longlong(item, p_value); - } - - tmp = (char*) malloc((n - count + 1) * sizeof(char)); - if (tmp == NULL) { - return 0; - } - - pos = 0; - for (i = 0; i < n; ++i) - { - if (item[i] != tsep) - tmp[pos++] = item[i]; - } - - tmp[pos] = '\0'; - - status = to_longlong(tmp, p_value); - free(tmp); - - return status; -}*/ - -int to_boolean(const char *item, uint8_t *val) { - char *tmp; - int i, status = 0; - - static const char *tstrs[1] = {"TRUE"}; - static const char *fstrs[1] = {"FALSE"}; - - tmp = malloc(sizeof(char) * (strlen(item) + 1)); - strcpy(tmp, item); - uppercase(tmp); - - for (i = 0; i < 1; ++i) - { - if (strcmp(tmp, tstrs[i]) == 0) { - *val = 1; - goto done; - } - } - - for (i = 0; i < 1; ++i) - { - if (strcmp(tmp, fstrs[i]) == 0) { - *val = 0; - goto done; - } - } - - status = -1; - -done: - free(tmp); - return status; -} - -// #define TEST - -#ifdef TEST - -int main(int argc, char *argv[]) -{ - double x, y; - long long xi; - int status; - char *s; - - //s = "0.10e-3-+5.5e2i"; - // s = "1-0j"; - // status = to_complex(s, &x, &y, 'e', '.'); - s = "123,789"; - status = to_longlong_thousands(s, &xi, ','); - printf("s = '%s'\n", s); - printf("status = %d\n", status); - printf("x = %d\n", (int) xi); - - // printf("x = %lg, y = %lg\n", x, y); - - return 0; -} -#endif - -// --------------------------------------------------------------------------- -// Implementation of xstrtod - -// -// strtod.c -// -// Convert string to double -// -// Copyright (C) 2002 Michael Ringgaard. All rights reserved. -// -// Redistribution and use in source and binary forms, with or without -// modification, are permitted provided that the following conditions -// are met: -// -// 1. Redistributions of source code must retain the above copyright -// notice, this list of conditions and the following disclaimer. -// 2. Redistributions in binary form must reproduce the above copyright -// notice, this list of conditions and the following disclaimer in the -// documentation and/or other materials provided with the distribution. -// 3. Neither the name of the project nor the names of its contributors -// may be used to endorse or promote products derived from this software -// without specific prior written permission. -// -// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -// ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE -// FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL -// DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS -// OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) -// HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT -// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY -// OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF -// SUCH DAMAGE. -// -// ----------------------------------------------------------------------- -// Modifications by Warren Weckesser, March 2011: -// * Rename strtod() to xstrtod(). -// * Added decimal and sci arguments. -// * Skip trailing spaces. -// * Commented out the other functions. -// Modifications by Richard T Guy, August 2013: -// * Add tsep argument for thousands separator -// - -double xstrtod(const char *str, char **endptr, char decimal, - char sci, char tsep, int skip_trailing) -{ - double number; - int exponent; - int negative; - char *p = (char *) str; - double p10; - int n; - int num_digits; - int num_decimals; - - errno = 0; - - // Skip leading whitespace - while (isspace(*p)) p++; - - // Handle optional sign - negative = 0; - switch (*p) - { - case '-': negative = 1; // Fall through to increment position - case '+': p++; - } - - number = 0.; - exponent = 0; - num_digits = 0; - num_decimals = 0; - - // Process string of digits - while (isdigit(*p)) - { - number = number * 10. + (*p - '0'); - p++; - num_digits++; - - p += (tsep != '\0' && *p == tsep); - } - - // Process decimal part - if (*p == decimal) - { - p++; - - while (isdigit(*p)) - { - number = number * 10. + (*p - '0'); - p++; - num_digits++; - num_decimals++; - } - - exponent -= num_decimals; - } - - if (num_digits == 0) - { - errno = ERANGE; - return 0.0; - } - - // Correct for sign - if (negative) number = -number; - - // Process an exponent string - if (toupper(*p) == toupper(sci)) - { - // Handle optional sign - negative = 0; - switch (*++p) - { - case '-': negative = 1; // Fall through to increment pos - case '+': p++; - } - - // Process string of digits - num_digits = 0; - n = 0; - while (isdigit(*p)) - { - n = n * 10 + (*p - '0'); - num_digits++; - p++; - } - - if (negative) - exponent -= n; - else - exponent += n; - - // If no digits, after the 'e'/'E', un-consume it - if (num_digits == 0) - p--; - } - - - if (exponent < DBL_MIN_EXP || exponent > DBL_MAX_EXP) - { - - errno = ERANGE; - return HUGE_VAL; - } - - // Scale the result - p10 = 10.; - n = exponent; - if (n < 0) n = -n; - while (n) - { - if (n & 1) - { - if (exponent < 0) - number /= p10; - else - number *= p10; - } - n >>= 1; - p10 *= p10; - } - - - if (number == HUGE_VAL) { - errno = ERANGE; - } - - if (skip_trailing) { - // Skip trailing whitespace - while (isspace(*p)) p++; - } - - if (endptr) *endptr = p; - - - return number; -} - -double precise_xstrtod(const char *str, char **endptr, char decimal, - char sci, char tsep, int skip_trailing) -{ - double number; - int exponent; - int negative; - char *p = (char *) str; - int num_digits; - int num_decimals; - int max_digits = 17; - int n; - // Cache powers of 10 in memory - static double e[] = {1., 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9, 1e10, - 1e11, 1e12, 1e13, 1e14, 1e15, 1e16, 1e17, 1e18, 1e19, 1e20, - 1e21, 1e22, 1e23, 1e24, 1e25, 1e26, 1e27, 1e28, 1e29, 1e30, - 1e31, 1e32, 1e33, 1e34, 1e35, 1e36, 1e37, 1e38, 1e39, 1e40, - 1e41, 1e42, 1e43, 1e44, 1e45, 1e46, 1e47, 1e48, 1e49, 1e50, - 1e51, 1e52, 1e53, 1e54, 1e55, 1e56, 1e57, 1e58, 1e59, 1e60, - 1e61, 1e62, 1e63, 1e64, 1e65, 1e66, 1e67, 1e68, 1e69, 1e70, - 1e71, 1e72, 1e73, 1e74, 1e75, 1e76, 1e77, 1e78, 1e79, 1e80, - 1e81, 1e82, 1e83, 1e84, 1e85, 1e86, 1e87, 1e88, 1e89, 1e90, - 1e91, 1e92, 1e93, 1e94, 1e95, 1e96, 1e97, 1e98, 1e99, 1e100, - 1e101, 1e102, 1e103, 1e104, 1e105, 1e106, 1e107, 1e108, 1e109, 1e110, - 1e111, 1e112, 1e113, 1e114, 1e115, 1e116, 1e117, 1e118, 1e119, 1e120, - 1e121, 1e122, 1e123, 1e124, 1e125, 1e126, 1e127, 1e128, 1e129, 1e130, - 1e131, 1e132, 1e133, 1e134, 1e135, 1e136, 1e137, 1e138, 1e139, 1e140, - 1e141, 1e142, 1e143, 1e144, 1e145, 1e146, 1e147, 1e148, 1e149, 1e150, - 1e151, 1e152, 1e153, 1e154, 1e155, 1e156, 1e157, 1e158, 1e159, 1e160, - 1e161, 1e162, 1e163, 1e164, 1e165, 1e166, 1e167, 1e168, 1e169, 1e170, - 1e171, 1e172, 1e173, 1e174, 1e175, 1e176, 1e177, 1e178, 1e179, 1e180, - 1e181, 1e182, 1e183, 1e184, 1e185, 1e186, 1e187, 1e188, 1e189, 1e190, - 1e191, 1e192, 1e193, 1e194, 1e195, 1e196, 1e197, 1e198, 1e199, 1e200, - 1e201, 1e202, 1e203, 1e204, 1e205, 1e206, 1e207, 1e208, 1e209, 1e210, - 1e211, 1e212, 1e213, 1e214, 1e215, 1e216, 1e217, 1e218, 1e219, 1e220, - 1e221, 1e222, 1e223, 1e224, 1e225, 1e226, 1e227, 1e228, 1e229, 1e230, - 1e231, 1e232, 1e233, 1e234, 1e235, 1e236, 1e237, 1e238, 1e239, 1e240, - 1e241, 1e242, 1e243, 1e244, 1e245, 1e246, 1e247, 1e248, 1e249, 1e250, - 1e251, 1e252, 1e253, 1e254, 1e255, 1e256, 1e257, 1e258, 1e259, 1e260, - 1e261, 1e262, 1e263, 1e264, 1e265, 1e266, 1e267, 1e268, 1e269, 1e270, - 1e271, 1e272, 1e273, 1e274, 1e275, 1e276, 1e277, 1e278, 1e279, 1e280, - 1e281, 1e282, 1e283, 1e284, 1e285, 1e286, 1e287, 1e288, 1e289, 1e290, - 1e291, 1e292, 1e293, 1e294, 1e295, 1e296, 1e297, 1e298, 1e299, 1e300, - 1e301, 1e302, 1e303, 1e304, 1e305, 1e306, 1e307, 1e308}; - errno = 0; - - // Skip leading whitespace - while (isspace(*p)) p++; - - // Handle optional sign - negative = 0; - switch (*p) - { - case '-': negative = 1; // Fall through to increment position - case '+': p++; - } - - number = 0.; - exponent = 0; - num_digits = 0; - num_decimals = 0; - - // Process string of digits - while (isdigit(*p)) - { - if (num_digits < max_digits) - { - number = number * 10. + (*p - '0'); - num_digits++; - } - else - ++exponent; - - p++; - p += (tsep != '\0' && *p == tsep); - } - - // Process decimal part - if (*p == decimal) - { - p++; - - while (num_digits < max_digits && isdigit(*p)) - { - number = number * 10. + (*p - '0'); - p++; - num_digits++; - num_decimals++; - } - - if (num_digits >= max_digits) // consume extra decimal digits - while (isdigit(*p)) - ++p; - - exponent -= num_decimals; - } - - if (num_digits == 0) - { - errno = ERANGE; - return 0.0; - } - - // Correct for sign - if (negative) number = -number; - - // Process an exponent string - if (toupper(*p) == toupper(sci)) - { - // Handle optional sign - negative = 0; - switch (*++p) - { - case '-': negative = 1; // Fall through to increment pos - case '+': p++; - } - - // Process string of digits - num_digits = 0; - n = 0; - while (isdigit(*p)) - { - n = n * 10 + (*p - '0'); - num_digits++; - p++; - } - - if (negative) - exponent -= n; - else - exponent += n; - - // If no digits, after the 'e'/'E', un-consume it - if (num_digits == 0) - p--; - } - - if (exponent > 308) - { - errno = ERANGE; - return HUGE_VAL; - } - else if (exponent > 0) - number *= e[exponent]; - else if (exponent < -308) // subnormal - { - if (exponent < -616) // prevent invalid array access - number = 0.; - number /= e[-308 - exponent]; - number /= e[308]; - } - else - number /= e[-exponent]; - - if (number == HUGE_VAL || number == -HUGE_VAL) - errno = ERANGE; - - if (skip_trailing) { - // Skip trailing whitespace - while (isspace(*p)) p++; - } - - if (endptr) *endptr = p; - return number; -} - -double round_trip(const char *p, char **q, char decimal, char sci, - char tsep, int skip_trailing) -{ -#if PY_VERSION_HEX >= 0x02070000 - return PyOS_string_to_double(p, q, 0); -#else - return strtod(p, q); -#endif -} - -/* -float strtof(const char *str, char **endptr) -{ - return (float) strtod(str, endptr); -} - - -long double strtold(const char *str, char **endptr) -{ - return strtod(str, endptr); -} - -double atof(const char *str) -{ - return strtod(str, NULL); -} -*/ - -// End of xstrtod code -// --------------------------------------------------------------------------- - -int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max, - int *error, char tsep) -{ - const char *p = (const char *) p_item; - int isneg = 0; - int64_t number = 0; - int d; - - // Skip leading spaces. - while (isspace(*p)) { - ++p; - } - - // Handle sign. - if (*p == '-') { - isneg = 1; - ++p; - } - else if (*p == '+') { - p++; - } - - // Check that there is a first digit. - if (!isdigit(*p)) { - // Error... - *error = ERROR_NO_DIGITS; - return 0; - } - - if (isneg) { - // If number is greater than pre_min, at least one more digit - // can be processed without overflowing. - int dig_pre_min = -(int_min % 10); - int64_t pre_min = int_min / 10; - - // Process the digits. - d = *p; - if (tsep != '\0') { - while (1) { - if (d == tsep) { - d = *++p; - continue; - } else if (!isdigit(d)) { - break; - } - if ((number > pre_min) || - ((number == pre_min) && (d - '0' <= dig_pre_min))) { - - number = number * 10 - (d - '0'); - d = *++p; - } - else { - *error = ERROR_OVERFLOW; - return 0; - } - } - } else { - while (isdigit(d)) { - if ((number > pre_min) || - ((number == pre_min) && (d - '0' <= dig_pre_min))) { - - number = number * 10 - (d - '0'); - d = *++p; - } - else { - *error = ERROR_OVERFLOW; - return 0; - } - } - } - } - else { - // If number is less than pre_max, at least one more digit - // can be processed without overflowing. - int64_t pre_max = int_max / 10; - int dig_pre_max = int_max % 10; - - //printf("pre_max = %lld dig_pre_max = %d\n", pre_max, dig_pre_max); - - // Process the digits. - d = *p; - if (tsep != '\0') { - while (1) { - if (d == tsep) { - d = *++p; - continue; - } else if (!isdigit(d)) { - break; - } - if ((number < pre_max) || - ((number == pre_max) && (d - '0' <= dig_pre_max))) { - - number = number * 10 + (d - '0'); - d = *++p; - - } - else { - *error = ERROR_OVERFLOW; - return 0; - } - } - } else { - while (isdigit(d)) { - if ((number < pre_max) || - ((number == pre_max) && (d - '0' <= dig_pre_max))) { - - number = number * 10 + (d - '0'); - d = *++p; - - } - else { - *error = ERROR_OVERFLOW; - return 0; - } - } - } - } - - // Skip trailing spaces. - while (isspace(*p)) { - ++p; - } - - // Did we use up all the characters? - if (*p) { - *error = ERROR_INVALID_CHARS; - return 0; - } - - *error = 0; - return number; -} - -/* does not look like this routine is used anywhere -uint64_t str_to_uint64(const char *p_item, uint64_t uint_max, int *error) -{ - int d, dig_pre_max; - uint64_t pre_max; - const char *p = (const char *) p_item; - uint64_t number = 0; - - // Skip leading spaces. - while (isspace(*p)) { - ++p; - } - - // Handle sign. - if (*p == '-') { - *error = ERROR_MINUS_SIGN; - return 0; - } - if (*p == '+') { - p++; - } - - // Check that there is a first digit. - if (!isdigit(*p)) { - // Error... - *error = ERROR_NO_DIGITS; - return 0; - } - - // If number is less than pre_max, at least one more digit - // can be processed without overflowing. - pre_max = uint_max / 10; - dig_pre_max = uint_max % 10; - - // Process the digits. - d = *p; - while (isdigit(d)) { - if ((number < pre_max) || ((number == pre_max) && (d - '0' <= dig_pre_max))) { - number = number * 10 + (d - '0'); - d = *++p; - } - else { - *error = ERROR_OVERFLOW; - return 0; - } - } - - // Skip trailing spaces. - while (isspace(*p)) { - ++p; - } - - // Did we use up all the characters? - if (*p) { - *error = ERROR_INVALID_CHARS; - return 0; - } - - *error = 0; - return number; -} -*/ diff --git a/pandas/src/period_helper.h b/pandas/src/period_helper.h deleted file mode 100644 index 0351321926fa2..0000000000000 --- a/pandas/src/period_helper.h +++ /dev/null @@ -1,170 +0,0 @@ -/* - * Borrowed and derived code from scikits.timeseries that we will expose via - * Cython to pandas. This primarily concerns interval representation and - * frequency conversion routines. - */ - -#ifndef C_PERIOD_H -#define C_PERIOD_H - -#include -#include "helper.h" -#include "numpy/ndarraytypes.h" -#include "headers/stdint.h" -#include "limits.h" - -/* - * declarations from period here - */ - -#define GREGORIAN_CALENDAR 0 -#define JULIAN_CALENDAR 1 - -#define SECONDS_PER_DAY ((double) 86400.0) - -#define Py_AssertWithArg(x,errortype,errorstr,a1) {if (!(x)) {PyErr_Format(errortype,errorstr,a1);goto onError;}} -#define Py_Error(errortype,errorstr) {PyErr_SetString(errortype,errorstr);goto onError;} - -/*** FREQUENCY CONSTANTS ***/ - -// HIGHFREQ_ORIG is the datetime ordinal from which to begin the second -// frequency ordinal sequence - -// typedef int64_t npy_int64; -// begins second ordinal at 1/1/1970 unix epoch - -// #define HIGHFREQ_ORIG 62135683200LL -#define BASE_YEAR 1970 -#define ORD_OFFSET 719163LL // days until 1970-01-01 -#define BDAY_OFFSET 513689LL // days until 1970-01-01 -#define WEEK_OFFSET 102737LL -#define BASE_WEEK_TO_DAY_OFFSET 1 // difference between day 0 and end of week in days -#define DAYS_PER_WEEK 7 -#define BUSINESS_DAYS_PER_WEEK 5 -#define HIGHFREQ_ORIG 0 // ORD_OFFSET * 86400LL // days until 1970-01-01 - -#define FR_ANN 1000 /* Annual */ -#define FR_ANNDEC FR_ANN /* Annual - December year end*/ -#define FR_ANNJAN 1001 /* Annual - January year end*/ -#define FR_ANNFEB 1002 /* Annual - February year end*/ -#define FR_ANNMAR 1003 /* Annual - March year end*/ -#define FR_ANNAPR 1004 /* Annual - April year end*/ -#define FR_ANNMAY 1005 /* Annual - May year end*/ -#define FR_ANNJUN 1006 /* Annual - June year end*/ -#define FR_ANNJUL 1007 /* Annual - July year end*/ -#define FR_ANNAUG 1008 /* Annual - August year end*/ -#define FR_ANNSEP 1009 /* Annual - September year end*/ -#define FR_ANNOCT 1010 /* Annual - October year end*/ -#define FR_ANNNOV 1011 /* Annual - November year end*/ - -/* The standard quarterly frequencies with various fiscal year ends - eg, Q42005 for Q@OCT runs Aug 1, 2005 to Oct 31, 2005 */ -#define FR_QTR 2000 /* Quarterly - December year end (default quarterly) */ -#define FR_QTRDEC FR_QTR /* Quarterly - December year end */ -#define FR_QTRJAN 2001 /* Quarterly - January year end */ -#define FR_QTRFEB 2002 /* Quarterly - February year end */ -#define FR_QTRMAR 2003 /* Quarterly - March year end */ -#define FR_QTRAPR 2004 /* Quarterly - April year end */ -#define FR_QTRMAY 2005 /* Quarterly - May year end */ -#define FR_QTRJUN 2006 /* Quarterly - June year end */ -#define FR_QTRJUL 2007 /* Quarterly - July year end */ -#define FR_QTRAUG 2008 /* Quarterly - August year end */ -#define FR_QTRSEP 2009 /* Quarterly - September year end */ -#define FR_QTROCT 2010 /* Quarterly - October year end */ -#define FR_QTRNOV 2011 /* Quarterly - November year end */ - -#define FR_MTH 3000 /* Monthly */ - -#define FR_WK 4000 /* Weekly */ -#define FR_WKSUN FR_WK /* Weekly - Sunday end of week */ -#define FR_WKMON 4001 /* Weekly - Monday end of week */ -#define FR_WKTUE 4002 /* Weekly - Tuesday end of week */ -#define FR_WKWED 4003 /* Weekly - Wednesday end of week */ -#define FR_WKTHU 4004 /* Weekly - Thursday end of week */ -#define FR_WKFRI 4005 /* Weekly - Friday end of week */ -#define FR_WKSAT 4006 /* Weekly - Saturday end of week */ - -#define FR_BUS 5000 /* Business days */ -#define FR_DAY 6000 /* Daily */ -#define FR_HR 7000 /* Hourly */ -#define FR_MIN 8000 /* Minutely */ -#define FR_SEC 9000 /* Secondly */ -#define FR_MS 10000 /* Millisecondly */ -#define FR_US 11000 /* Microsecondly */ -#define FR_NS 12000 /* Nanosecondly */ - -#define FR_UND -10000 /* Undefined */ - -#define INT_ERR_CODE INT32_MIN - -#define MEM_CHECK(item) if (item == NULL) { return PyErr_NoMemory(); } -#define ERR_CHECK(item) if (item == NULL) { return NULL; } - -typedef struct asfreq_info { - int from_week_end; // day the week ends on in the "from" frequency - int to_week_end; // day the week ends on in the "to" frequency - - int from_a_year_end; // month the year ends on in the "from" frequency - int to_a_year_end; // month the year ends on in the "to" frequency - - int from_q_year_end; // month the year ends on in the "from" frequency - int to_q_year_end; // month the year ends on in the "to" frequency - - npy_int64 intraday_conversion_factor; -} asfreq_info; - - -typedef struct date_info { - npy_int64 absdate; - double abstime; - - double second; - int minute; - int hour; - int day; - int month; - int quarter; - int year; - int day_of_week; - int day_of_year; - int calendar; -} date_info; - -typedef npy_int64 (*freq_conv_func)(npy_int64, char, asfreq_info*); - -/* - * new pandas API helper functions here - */ - -npy_int64 asfreq(npy_int64 period_ordinal, int freq1, int freq2, char relation); - -npy_int64 get_period_ordinal(int year, int month, int day, - int hour, int minute, int second, int microseconds, int picoseconds, - int freq); - -npy_int64 get_python_ordinal(npy_int64 period_ordinal, int freq); - -int get_date_info(npy_int64 ordinal, int freq, struct date_info *dinfo); -freq_conv_func get_asfreq_func(int fromFreq, int toFreq); -void get_asfreq_info(int fromFreq, int toFreq, asfreq_info *af_info); - -int pyear(npy_int64 ordinal, int freq); -int pqyear(npy_int64 ordinal, int freq); -int pquarter(npy_int64 ordinal, int freq); -int pmonth(npy_int64 ordinal, int freq); -int pday(npy_int64 ordinal, int freq); -int pweekday(npy_int64 ordinal, int freq); -int pday_of_week(npy_int64 ordinal, int freq); -int pday_of_year(npy_int64 ordinal, int freq); -int pweek(npy_int64 ordinal, int freq); -int phour(npy_int64 ordinal, int freq); -int pminute(npy_int64 ordinal, int freq); -int psecond(npy_int64 ordinal, int freq); -int pdays_in_month(npy_int64 ordinal, int freq); - -double getAbsTime(int freq, npy_int64 dailyDate, npy_int64 originalDate); -char *c_strftime(struct date_info *dinfo, char *fmt); -int get_yq(npy_int64 ordinal, int freq, int *quarter, int *year); - -void initialize_daytime_conversion_factor_matrix(void); -#endif diff --git a/pandas/src/skiplist.h b/pandas/src/skiplist.h deleted file mode 100644 index 3bf63aedce9cb..0000000000000 --- a/pandas/src/skiplist.h +++ /dev/null @@ -1,298 +0,0 @@ - -/* - Flexibly-sized, indexable skiplist data structure for maintaining a sorted - list of values - - Port of Wes McKinney's Cython version of Raymond Hettinger's original pure - Python recipe (http://rhettinger.wordpress.com/2010/02/06/lost-knowledge/) - */ - -// #include -// #include - - -#include -#include -#include -#include - -#ifndef PANDAS_INLINE - #if defined(__GNUC__) - #define PANDAS_INLINE static __inline__ - #elif defined(_MSC_VER) - #define PANDAS_INLINE static __inline - #elif defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L - #define PANDAS_INLINE static inline - #else - #define PANDAS_INLINE - #endif -#endif - -PANDAS_INLINE float __skiplist_nanf(void) -{ - const union { int __i; float __f;} __bint = {0x7fc00000UL}; - return __bint.__f; -} -#define PANDAS_NAN ((double) __skiplist_nanf()) - - -PANDAS_INLINE double Log2(double val) { - return log(val) / log(2.); -} - -typedef struct node_t node_t; - -struct node_t { - node_t **next; - int *width; - double value; - int is_nil; - int levels; - int ref_count; -}; - -typedef struct { - node_t *head; - node_t **tmp_chain; - int *tmp_steps; - int size; - int maxlevels; -} skiplist_t; - -PANDAS_INLINE double urand(void) { - return ((double) rand() + 1) / ((double) RAND_MAX + 2); -} - -PANDAS_INLINE int int_min(int a, int b) { - return a < b ? a : b; -} - -PANDAS_INLINE node_t *node_init(double value, int levels) { - node_t *result; - result = (node_t*) malloc(sizeof(node_t)); - if (result) { - result->value = value; - result->levels = levels; - result->is_nil = 0; - result->ref_count = 0; - result->next = (node_t**) malloc(levels * sizeof(node_t*)); - result->width = (int*) malloc(levels * sizeof(int)); - if (!(result->next && result->width) && (levels != 0)) { - free(result->next); - free(result->width); - free(result); - return NULL; - } - } - return result; -} - -// do this ourselves -PANDAS_INLINE void node_incref(node_t *node) { - ++(node->ref_count); -} - -PANDAS_INLINE void node_decref(node_t *node) { - --(node->ref_count); -} - -static void node_destroy(node_t *node) { - int i; - if (node) { - if (node->ref_count <= 1) { - for (i = 0; i < node->levels; ++i) { - node_destroy(node->next[i]); - } - free(node->next); - free(node->width); - // printf("Reference count was 1, freeing\n"); - free(node); - } - else { - node_decref(node); - } - // pretty sure that freeing the struct above will be enough - } -} - -PANDAS_INLINE void skiplist_destroy(skiplist_t *skp) { - if (skp) { - node_destroy(skp->head); - free(skp->tmp_steps); - free(skp->tmp_chain); - free(skp); - } -} - -PANDAS_INLINE skiplist_t *skiplist_init(int expected_size) { - skiplist_t *result; - node_t *NIL, *head; - int maxlevels, i; - - maxlevels = 1 + Log2((double) expected_size); - result = (skiplist_t*) malloc(sizeof(skiplist_t)); - if (!result) { - return NULL; - } - result->tmp_chain = (node_t**) malloc(maxlevels * sizeof(node_t*)); - result->tmp_steps = (int*) malloc(maxlevels * sizeof(int)); - result->maxlevels = maxlevels; - result->size = 0; - - head = result->head = node_init(PANDAS_NAN, maxlevels); - NIL = node_init(0.0, 0); - - if (!(result->tmp_chain && result->tmp_steps && result->head && NIL)) { - skiplist_destroy(result); - node_destroy(NIL); - return NULL; - } - - node_incref(head); - - NIL->is_nil = 1; - - for (i = 0; i < maxlevels; ++i) - { - head->next[i] = NIL; - head->width[i] = 1; - node_incref(NIL); - } - - return result; -} - -// 1 if left < right, 0 if left == right, -1 if left > right -PANDAS_INLINE int _node_cmp(node_t* node, double value){ - if (node->is_nil || node->value > value) { - return -1; - } - else if (node->value < value) { - return 1; - } - else { - return 0; - } -} - -PANDAS_INLINE double skiplist_get(skiplist_t *skp, int i, int *ret) { - node_t *node; - int level; - - if (i < 0 || i >= skp->size) { - *ret = 0; - return 0; - } - - node = skp->head; - ++i; - for (level = skp->maxlevels - 1; level >= 0; --level) - { - while (node->width[level] <= i) - { - i -= node->width[level]; - node = node->next[level]; - } - } - - *ret = 1; - return node->value; -} - -PANDAS_INLINE int skiplist_insert(skiplist_t *skp, double value) { - node_t *node, *prevnode, *newnode, *next_at_level; - int *steps_at_level; - int size, steps, level; - node_t **chain; - - chain = skp->tmp_chain; - - steps_at_level = skp->tmp_steps; - memset(steps_at_level, 0, skp->maxlevels * sizeof(int)); - - node = skp->head; - - for (level = skp->maxlevels - 1; level >= 0; --level) - { - next_at_level = node->next[level]; - while (_node_cmp(next_at_level, value) >= 0) { - steps_at_level[level] += node->width[level]; - node = next_at_level; - next_at_level = node->next[level]; - } - chain[level] = node; - } - - size = int_min(skp->maxlevels, 1 - ((int) Log2(urand()))); - - newnode = node_init(value, size); - if (!newnode) { - return -1; - } - steps = 0; - - for (level = 0; level < size; ++level) { - prevnode = chain[level]; - newnode->next[level] = prevnode->next[level]; - - prevnode->next[level] = newnode; - node_incref(newnode); // increment the reference count - - newnode->width[level] = prevnode->width[level] - steps; - prevnode->width[level] = steps + 1; - - steps += steps_at_level[level]; - } - - for (level = size; level < skp->maxlevels; ++level) { - chain[level]->width[level] += 1; - } - - ++(skp->size); - - return 1; -} - -PANDAS_INLINE int skiplist_remove(skiplist_t *skp, double value) { - int level, size; - node_t *node, *prevnode, *tmpnode, *next_at_level; - node_t **chain; - - chain = skp->tmp_chain; - node = skp->head; - - for (level = skp->maxlevels - 1; level >= 0; --level) - { - next_at_level = node->next[level]; - while (_node_cmp(next_at_level, value) > 0) { - node = next_at_level; - next_at_level = node->next[level]; - } - chain[level] = node; - } - - if (value != chain[0]->next[0]->value) { - return 0; - } - - size = chain[0]->next[0]->levels; - - for (level = 0; level < size; ++level) { - prevnode = chain[level]; - - tmpnode = prevnode->next[level]; - - prevnode->width[level] += tmpnode->width[level] - 1; - prevnode->next[level] = tmpnode->next[level]; - - tmpnode->next[level] = NULL; - node_destroy(tmpnode); // decrement refcount or free - } - - for (level = size; level < skp->maxlevels; ++level) { - --(chain[level]->width[level]); - } - - --(skp->size); - return 1; -} diff --git a/pandas/src/ujson/lib/ultrajsondec.c b/pandas/src/ujson/lib/ultrajsondec.c deleted file mode 100644 index 5496068832f2e..0000000000000 --- a/pandas/src/ujson/lib/ultrajsondec.c +++ /dev/null @@ -1,923 +0,0 @@ -/* -Copyright (c) 2011-2013, ESN Social Software AB and Jonas Tarnstrom -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: -* Redistributions of source code must retain the above copyright -notice, this list of conditions and the following disclaimer. -* Redistributions in binary form must reproduce the above copyright -notice, this list of conditions and the following disclaimer in the -documentation and/or other materials provided with the distribution. -* Neither the name of the ESN Social Software AB nor the -names of its contributors may be used to endorse or promote products -derived from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -DISCLAIMED. IN NO EVENT SHALL ESN SOCIAL SOFTWARE AB OR JONAS TARNSTROM BE LIABLE -FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND -ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - -Portions of code from MODP_ASCII - Ascii transformations (upper/lower, etc) -https://github.com/client9/stringencoders -Copyright (c) 2007 Nick Galbreath -- nickg [at] modp [dot] com. All rights reserved. - -Numeric decoder derived from from TCL library -http://www.opensource.apple.com/source/tcl/tcl-14/tcl/license.terms -* Copyright (c) 1988-1993 The Regents of the University of California. -* Copyright (c) 1994 Sun Microsystems, Inc. -*/ - -#include "ultrajson.h" -#include -#include -#include -#include -#include -#include -#include -#include - -#ifndef TRUE -#define TRUE 1 -#define FALSE 0 -#endif -#ifndef NULL -#define NULL 0 -#endif - -struct DecoderState -{ - char *start; - char *end; - wchar_t *escStart; - wchar_t *escEnd; - int escHeap; - int lastType; - JSUINT32 objDepth; - void *prv; - JSONObjectDecoder *dec; -}; - -JSOBJ FASTCALL_MSVC decode_any( struct DecoderState *ds) FASTCALL_ATTR; -typedef JSOBJ (*PFN_DECODER)( struct DecoderState *ds); - -static JSOBJ SetError( struct DecoderState *ds, int offset, const char *message) -{ - ds->dec->errorOffset = ds->start + offset; - ds->dec->errorStr = (char *) message; - return NULL; -} - -double createDouble(double intNeg, double intValue, double frcValue, int frcDecimalCount) -{ - static const double g_pow10[] = {1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001,0.0000001, 0.00000001, 0.000000001, 0.0000000001, 0.00000000001, 0.000000000001, 0.0000000000001, 0.00000000000001, 0.000000000000001}; - return (intValue + (frcValue * g_pow10[frcDecimalCount])) * intNeg; -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decodePreciseFloat(struct DecoderState *ds) -{ - char *end; - double value; - errno = 0; - - value = strtod(ds->start, &end); - - if (errno == ERANGE) - { - return SetError(ds, -1, "Range error when decoding numeric as double"); - } - - ds->start = end; - return ds->dec->newDouble(ds->prv, value); -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_numeric (struct DecoderState *ds) -{ - int intNeg = 1; - int mantSize = 0; - JSUINT64 intValue; - int chr; - int decimalCount = 0; - double frcValue = 0.0; - double expNeg; - double expValue; - char *offset = ds->start; - - JSUINT64 overflowLimit = LLONG_MAX; - - if (*(offset) == '-') - { - offset ++; - intNeg = -1; - overflowLimit = LLONG_MIN; - } - - // Scan integer part - intValue = 0; - - while (1) - { - chr = (int) (unsigned char) *(offset); - - switch (chr) - { - case '0': - case '1': - case '2': - case '3': - case '4': - case '5': - case '6': - case '7': - case '8': - case '9': - { - //FIXME: Check for arithemtic overflow here - //PERF: Don't do 64-bit arithmetic here unless we know we have to - intValue = intValue * 10ULL + (JSLONG) (chr - 48); - - if (intValue > overflowLimit) - { - return SetError(ds, -1, overflowLimit == LLONG_MAX ? "Value is too big" : "Value is too small"); - } - - offset ++; - mantSize ++; - break; - } - case '.': - { - offset ++; - goto DECODE_FRACTION; - break; - } - case 'e': - case 'E': - { - offset ++; - goto DECODE_EXPONENT; - break; - } - - default: - { - goto BREAK_INT_LOOP; - break; - } - } - } - -BREAK_INT_LOOP: - - ds->lastType = JT_INT; - ds->start = offset; - - if ((intValue >> 31)) - { - return ds->dec->newLong(ds->prv, (JSINT64) (intValue * (JSINT64) intNeg)); - } - else - { - return ds->dec->newInt(ds->prv, (JSINT32) (intValue * intNeg)); - } - -DECODE_FRACTION: - - if (ds->dec->preciseFloat) - { - return decodePreciseFloat(ds); - } - - // Scan fraction part - frcValue = 0.0; - for (;;) - { - chr = (int) (unsigned char) *(offset); - - switch (chr) - { - case '0': - case '1': - case '2': - case '3': - case '4': - case '5': - case '6': - case '7': - case '8': - case '9': - { - if (decimalCount < JSON_DOUBLE_MAX_DECIMALS) - { - frcValue = frcValue * 10.0 + (double) (chr - 48); - decimalCount ++; - } - offset ++; - break; - } - case 'e': - case 'E': - { - offset ++; - goto DECODE_EXPONENT; - break; - } - default: - { - goto BREAK_FRC_LOOP; - } - } - } - -BREAK_FRC_LOOP: - //FIXME: Check for arithemtic overflow here - ds->lastType = JT_DOUBLE; - ds->start = offset; - return ds->dec->newDouble (ds->prv, createDouble( (double) intNeg, (double) intValue, frcValue, decimalCount)); - -DECODE_EXPONENT: - if (ds->dec->preciseFloat) - { - return decodePreciseFloat(ds); - } - - expNeg = 1.0; - - if (*(offset) == '-') - { - expNeg = -1.0; - offset ++; - } - else - if (*(offset) == '+') - { - expNeg = +1.0; - offset ++; - } - - expValue = 0.0; - - for (;;) - { - chr = (int) (unsigned char) *(offset); - - switch (chr) - { - case '0': - case '1': - case '2': - case '3': - case '4': - case '5': - case '6': - case '7': - case '8': - case '9': - { - expValue = expValue * 10.0 + (double) (chr - 48); - offset ++; - break; - } - default: - { - goto BREAK_EXP_LOOP; - } - } - } - -BREAK_EXP_LOOP: - //FIXME: Check for arithemtic overflow here - ds->lastType = JT_DOUBLE; - ds->start = offset; - return ds->dec->newDouble (ds->prv, createDouble( (double) intNeg, (double) intValue , frcValue, decimalCount) * pow(10.0, expValue * expNeg)); -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_true ( struct DecoderState *ds) -{ - char *offset = ds->start; - offset ++; - - if (*(offset++) != 'r') - goto SETERROR; - if (*(offset++) != 'u') - goto SETERROR; - if (*(offset++) != 'e') - goto SETERROR; - - ds->lastType = JT_TRUE; - ds->start = offset; - return ds->dec->newTrue(ds->prv); - -SETERROR: - return SetError(ds, -1, "Unexpected character found when decoding 'true'"); -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_false ( struct DecoderState *ds) -{ - char *offset = ds->start; - offset ++; - - if (*(offset++) != 'a') - goto SETERROR; - if (*(offset++) != 'l') - goto SETERROR; - if (*(offset++) != 's') - goto SETERROR; - if (*(offset++) != 'e') - goto SETERROR; - - ds->lastType = JT_FALSE; - ds->start = offset; - return ds->dec->newFalse(ds->prv); - -SETERROR: - return SetError(ds, -1, "Unexpected character found when decoding 'false'"); -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_null ( struct DecoderState *ds) -{ - char *offset = ds->start; - offset ++; - - if (*(offset++) != 'u') - goto SETERROR; - if (*(offset++) != 'l') - goto SETERROR; - if (*(offset++) != 'l') - goto SETERROR; - - ds->lastType = JT_NULL; - ds->start = offset; - return ds->dec->newNull(ds->prv); - -SETERROR: - return SetError(ds, -1, "Unexpected character found when decoding 'null'"); -} - -FASTCALL_ATTR void FASTCALL_MSVC SkipWhitespace(struct DecoderState *ds) -{ - char *offset; - - for (offset = ds->start; (ds->end - offset) > 0; offset ++) - { - switch (*offset) - { - case ' ': - case '\t': - case '\r': - case '\n': - break; - - default: - ds->start = offset; - return; - } - } - - if (offset == ds->end) - { - ds->start = ds->end; - } -} - -enum DECODESTRINGSTATE -{ - DS_ISNULL = 0x32, - DS_ISQUOTE, - DS_ISESCAPE, - DS_UTFLENERROR, - -}; - -static const JSUINT8 g_decoderLookup[256] = -{ - /* 0x00 */ DS_ISNULL, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x10 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x20 */ 1, 1, DS_ISQUOTE, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x30 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x40 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x50 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, DS_ISESCAPE, 1, 1, 1, - /* 0x60 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x70 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x80 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0x90 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0xa0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0xb0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - /* 0xc0 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, - /* 0xd0 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, - /* 0xe0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, - /* 0xf0 */ 4, 4, 4, 4, 4, 4, 4, 4, DS_UTFLENERROR, DS_UTFLENERROR, DS_UTFLENERROR, DS_UTFLENERROR, DS_UTFLENERROR, DS_UTFLENERROR, DS_UTFLENERROR, DS_UTFLENERROR, -}; - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_string ( struct DecoderState *ds) -{ - JSUTF16 sur[2] = { 0 }; - int iSur = 0; - int index; - wchar_t *escOffset; - wchar_t *escStart; - size_t escLen = (ds->escEnd - ds->escStart); - JSUINT8 *inputOffset; - JSUINT8 oct; - JSUTF32 ucs; - ds->lastType = JT_INVALID; - ds->start ++; - - if ( (size_t) (ds->end - ds->start) > escLen) - { - size_t newSize = (ds->end - ds->start); - - if (ds->escHeap) - { - if (newSize > (SIZE_MAX / sizeof(wchar_t))) - { - return SetError(ds, -1, "Could not reserve memory block"); - } - escStart = (wchar_t *)ds->dec->realloc(ds->escStart, newSize * sizeof(wchar_t)); - if (!escStart) - { - ds->dec->free(ds->escStart); - return SetError(ds, -1, "Could not reserve memory block"); - } - ds->escStart = escStart; - } - else - { - wchar_t *oldStart = ds->escStart; - if (newSize > (SIZE_MAX / sizeof(wchar_t))) - { - return SetError(ds, -1, "Could not reserve memory block"); - } - ds->escStart = (wchar_t *) ds->dec->malloc(newSize * sizeof(wchar_t)); - if (!ds->escStart) - { - return SetError(ds, -1, "Could not reserve memory block"); - } - ds->escHeap = 1; - memcpy(ds->escStart, oldStart, escLen * sizeof(wchar_t)); - } - - ds->escEnd = ds->escStart + newSize; - } - - escOffset = ds->escStart; - inputOffset = (JSUINT8 *) ds->start; - - for (;;) - { - switch (g_decoderLookup[(JSUINT8)(*inputOffset)]) - { - case DS_ISNULL: - { - return SetError(ds, -1, "Unmatched ''\"' when when decoding 'string'"); - } - case DS_ISQUOTE: - { - ds->lastType = JT_UTF8; - inputOffset ++; - ds->start += ( (char *) inputOffset - (ds->start)); - return ds->dec->newString(ds->prv, ds->escStart, escOffset); - } - case DS_UTFLENERROR: - { - return SetError (ds, -1, "Invalid UTF-8 sequence length when decoding 'string'"); - } - case DS_ISESCAPE: - inputOffset ++; - switch (*inputOffset) - { - case '\\': *(escOffset++) = L'\\'; inputOffset++; continue; - case '\"': *(escOffset++) = L'\"'; inputOffset++; continue; - case '/': *(escOffset++) = L'/'; inputOffset++; continue; - case 'b': *(escOffset++) = L'\b'; inputOffset++; continue; - case 'f': *(escOffset++) = L'\f'; inputOffset++; continue; - case 'n': *(escOffset++) = L'\n'; inputOffset++; continue; - case 'r': *(escOffset++) = L'\r'; inputOffset++; continue; - case 't': *(escOffset++) = L'\t'; inputOffset++; continue; - - case 'u': - { - int index; - inputOffset ++; - - for (index = 0; index < 4; index ++) - { - switch (*inputOffset) - { - case '\0': return SetError (ds, -1, "Unterminated unicode escape sequence when decoding 'string'"); - default: return SetError (ds, -1, "Unexpected character in unicode escape sequence when decoding 'string'"); - - case '0': - case '1': - case '2': - case '3': - case '4': - case '5': - case '6': - case '7': - case '8': - case '9': - sur[iSur] = (sur[iSur] << 4) + (JSUTF16) (*inputOffset - '0'); - break; - - case 'a': - case 'b': - case 'c': - case 'd': - case 'e': - case 'f': - sur[iSur] = (sur[iSur] << 4) + 10 + (JSUTF16) (*inputOffset - 'a'); - break; - - case 'A': - case 'B': - case 'C': - case 'D': - case 'E': - case 'F': - sur[iSur] = (sur[iSur] << 4) + 10 + (JSUTF16) (*inputOffset - 'A'); - break; - } - - inputOffset ++; - } - - if (iSur == 0) - { - if((sur[iSur] & 0xfc00) == 0xd800) - { - // First of a surrogate pair, continue parsing - iSur ++; - break; - } - (*escOffset++) = (wchar_t) sur[iSur]; - iSur = 0; - } - else - { - // Decode pair - if ((sur[1] & 0xfc00) != 0xdc00) - { - return SetError (ds, -1, "Unpaired high surrogate when decoding 'string'"); - } -#if WCHAR_MAX == 0xffff - (*escOffset++) = (wchar_t) sur[0]; - (*escOffset++) = (wchar_t) sur[1]; -#else - (*escOffset++) = (wchar_t) 0x10000 + (((sur[0] - 0xd800) << 10) | (sur[1] - 0xdc00)); -#endif - iSur = 0; - } - break; - } - - case '\0': return SetError(ds, -1, "Unterminated escape sequence when decoding 'string'"); - default: return SetError(ds, -1, "Unrecognized escape sequence when decoding 'string'"); - } - break; - - case 1: - { - *(escOffset++) = (wchar_t) (*inputOffset++); - break; - } - - case 2: - { - ucs = (*inputOffset++) & 0x1f; - ucs <<= 6; - if (((*inputOffset) & 0x80) != 0x80) - { - return SetError(ds, -1, "Invalid octet in UTF-8 sequence when decoding 'string'"); - } - ucs |= (*inputOffset++) & 0x3f; - if (ucs < 0x80) return SetError (ds, -1, "Overlong 2 byte UTF-8 sequence detected when decoding 'string'"); - *(escOffset++) = (wchar_t) ucs; - break; - } - - case 3: - { - JSUTF32 ucs = 0; - ucs |= (*inputOffset++) & 0x0f; - - for (index = 0; index < 2; index ++) - { - ucs <<= 6; - oct = (*inputOffset++); - - if ((oct & 0x80) != 0x80) - { - return SetError(ds, -1, "Invalid octet in UTF-8 sequence when decoding 'string'"); - } - - ucs |= oct & 0x3f; - } - - if (ucs < 0x800) return SetError (ds, -1, "Overlong 3 byte UTF-8 sequence detected when encoding string"); - *(escOffset++) = (wchar_t) ucs; - break; - } - - case 4: - { - JSUTF32 ucs = 0; - ucs |= (*inputOffset++) & 0x07; - - for (index = 0; index < 3; index ++) - { - ucs <<= 6; - oct = (*inputOffset++); - - if ((oct & 0x80) != 0x80) - { - return SetError(ds, -1, "Invalid octet in UTF-8 sequence when decoding 'string'"); - } - - ucs |= oct & 0x3f; - } - - if (ucs < 0x10000) return SetError (ds, -1, "Overlong 4 byte UTF-8 sequence detected when decoding 'string'"); - -#if WCHAR_MAX == 0xffff - if (ucs >= 0x10000) - { - ucs -= 0x10000; - *(escOffset++) = (wchar_t) (ucs >> 10) + 0xd800; - *(escOffset++) = (wchar_t) (ucs & 0x3ff) + 0xdc00; - } - else - { - *(escOffset++) = (wchar_t) ucs; - } -#else - *(escOffset++) = (wchar_t) ucs; -#endif - break; - } - } - } -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_array(struct DecoderState *ds) -{ - JSOBJ itemValue; - JSOBJ newObj; - int len; - ds->objDepth++; - if (ds->objDepth > JSON_MAX_OBJECT_DEPTH) { - return SetError(ds, -1, "Reached object decoding depth limit"); - } - - newObj = ds->dec->newArray(ds->prv, ds->dec); - len = 0; - - ds->lastType = JT_INVALID; - ds->start ++; - - for (;;) - { - SkipWhitespace(ds); - - if ((*ds->start) == ']') - { - ds->objDepth--; - if (len == 0) - { - ds->start ++; - return ds->dec->endArray(ds->prv, newObj); - } - - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - return SetError(ds, -1, "Unexpected character found when decoding array value (1)"); - } - - itemValue = decode_any(ds); - - if (itemValue == NULL) - { - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - return NULL; - } - - if (!ds->dec->arrayAddItem (ds->prv, newObj, itemValue)) - { - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - return NULL; - } - - SkipWhitespace(ds); - - switch (*(ds->start++)) - { - case ']': - { - ds->objDepth--; - return ds->dec->endArray(ds->prv, newObj); - } - case ',': - break; - - default: - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - return SetError(ds, -1, "Unexpected character found when decoding array value (2)"); - } - - len ++; - } -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_object( struct DecoderState *ds) -{ - JSOBJ itemName; - JSOBJ itemValue; - JSOBJ newObj; - - ds->objDepth++; - if (ds->objDepth > JSON_MAX_OBJECT_DEPTH) { - return SetError(ds, -1, "Reached object decoding depth limit"); - } - - newObj = ds->dec->newObject(ds->prv, ds->dec); - - ds->start ++; - - for (;;) - { - SkipWhitespace(ds); - - if ((*ds->start) == '}') - { - ds->objDepth--; - ds->start ++; - return ds->dec->endObject(ds->prv, newObj); - } - - ds->lastType = JT_INVALID; - itemName = decode_any(ds); - - if (itemName == NULL) - { - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - return NULL; - } - - if (ds->lastType != JT_UTF8) - { - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - ds->dec->releaseObject(ds->prv, itemName, ds->dec); - return SetError(ds, -1, "Key name of object must be 'string' when decoding 'object'"); - } - - SkipWhitespace(ds); - - if (*(ds->start++) != ':') - { - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - ds->dec->releaseObject(ds->prv, itemName, ds->dec); - return SetError(ds, -1, "No ':' found when decoding object value"); - } - - SkipWhitespace(ds); - - itemValue = decode_any(ds); - - if (itemValue == NULL) - { - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - ds->dec->releaseObject(ds->prv, itemName, ds->dec); - return NULL; - } - - if (!ds->dec->objectAddKey (ds->prv, newObj, itemName, itemValue)) - { - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - ds->dec->releaseObject(ds->prv, itemName, ds->dec); - ds->dec->releaseObject(ds->prv, itemValue, ds->dec); - return NULL; - } - - SkipWhitespace(ds); - - switch (*(ds->start++)) - { - case '}': - { - ds->objDepth--; - return ds->dec->endObject(ds->prv, newObj); - } - case ',': - break; - - default: - ds->dec->releaseObject(ds->prv, newObj, ds->dec); - return SetError(ds, -1, "Unexpected character found when decoding object value"); - } - } -} - -FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_any(struct DecoderState *ds) -{ - for (;;) - { - switch (*ds->start) - { - case '\"': - return decode_string (ds); - case '0': - case '1': - case '2': - case '3': - case '4': - case '5': - case '6': - case '7': - case '8': - case '9': - case '-': - return decode_numeric (ds); - - case '[': return decode_array (ds); - case '{': return decode_object (ds); - case 't': return decode_true (ds); - case 'f': return decode_false (ds); - case 'n': return decode_null (ds); - - case ' ': - case '\t': - case '\r': - case '\n': - // White space - ds->start ++; - break; - - default: - return SetError(ds, -1, "Expected object or value"); - } - } -} - -JSOBJ JSON_DecodeObject(JSONObjectDecoder *dec, const char *buffer, size_t cbBuffer) -{ - /* - FIXME: Base the size of escBuffer of that of cbBuffer so that the unicode escaping doesn't run into the wall each time */ - char *locale; - struct DecoderState ds; - wchar_t escBuffer[(JSON_MAX_STACK_BUFFER_SIZE / sizeof(wchar_t))]; - JSOBJ ret; - - ds.start = (char *) buffer; - ds.end = ds.start + cbBuffer; - - ds.escStart = escBuffer; - ds.escEnd = ds.escStart + (JSON_MAX_STACK_BUFFER_SIZE / sizeof(wchar_t)); - ds.escHeap = 0; - ds.prv = dec->prv; - ds.dec = dec; - ds.dec->errorStr = NULL; - ds.dec->errorOffset = NULL; - ds.objDepth = 0; - - ds.dec = dec; - - locale = setlocale(LC_NUMERIC, NULL); - if (strcmp(locale, "C")) - { - locale = strdup(locale); - if (!locale) - { - return SetError(&ds, -1, "Could not reserve memory block"); - } - setlocale(LC_NUMERIC, "C"); - ret = decode_any (&ds); - setlocale(LC_NUMERIC, locale); - free(locale); - } - else - { - ret = decode_any (&ds); - } - - if (ds.escHeap) - { - dec->free(ds.escStart); - } - - SkipWhitespace(&ds); - - if (ds.start != ds.end && ret) - { - dec->releaseObject(ds.prv, ret, ds.dec); - return SetError(&ds, -1, "Trailing data"); - } - - return ret; -} diff --git a/pandas/src/ujson/lib/ultrajsonenc.c b/pandas/src/ujson/lib/ultrajsonenc.c deleted file mode 100644 index 2adf3cb707bdb..0000000000000 --- a/pandas/src/ujson/lib/ultrajsonenc.c +++ /dev/null @@ -1,947 +0,0 @@ -/* -Copyright (c) 2011-2013, ESN Social Software AB and Jonas Tarnstrom -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: - * Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - * Redistributions in binary form must reproduce the above copyright - notice, this list of conditions and the following disclaimer in the - documentation and/or other materials provided with the distribution. - * Neither the name of the ESN Social Software AB nor the - names of its contributors may be used to endorse or promote products - derived from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -DISCLAIMED. IN NO EVENT SHALL ESN SOCIAL SOFTWARE AB OR JONAS TARNSTROM BE LIABLE -FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND -ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - -Portions of code from MODP_ASCII - Ascii transformations (upper/lower, etc) -https://github.com/client9/stringencoders -Copyright (c) 2007 Nick Galbreath -- nickg [at] modp [dot] com. All rights reserved. - -Numeric decoder derived from from TCL library -http://www.opensource.apple.com/source/tcl/tcl-14/tcl/license.terms - * Copyright (c) 1988-1993 The Regents of the University of California. - * Copyright (c) 1994 Sun Microsystems, Inc. -*/ - -#include "ultrajson.h" -#include -#include -#include -#include -#include -#include - -#include - -#ifndef TRUE -#define TRUE 1 -#endif -#ifndef FALSE -#define FALSE 0 -#endif - -/* -Worst cases being: - -Control characters (ASCII < 32) -0x00 (1 byte) input => \u0000 output (6 bytes) -1 * 6 => 6 (6 bytes required) - -or UTF-16 surrogate pairs -4 bytes input in UTF-8 => \uXXXX\uYYYY (12 bytes). - -4 * 6 => 24 bytes (12 bytes required) - -The extra 2 bytes are for the quotes around the string - -*/ -#define RESERVE_STRING(_len) (2 + ((_len) * 6)) - -static const double g_pow10[] = {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000, 100000000000, 1000000000000, 10000000000000, 100000000000000, 1000000000000000}; -static const char g_hexChars[] = "0123456789abcdef"; -static const char g_escapeChars[] = "0123456789\\b\\t\\n\\f\\r\\\"\\\\\\/"; - -/* -FIXME: While this is fine dandy and working it's a magic value mess which probably only the author understands. -Needs a cleanup and more documentation */ - -/* -Table for pure ascii output escaping all characters above 127 to \uXXXX */ -static const JSUINT8 g_asciiOutputTable[256] = -{ -/* 0x00 */ 0, 30, 30, 30, 30, 30, 30, 30, 10, 12, 14, 30, 16, 18, 30, 30, -/* 0x10 */ 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, -/* 0x20 */ 1, 1, 20, 1, 1, 1, 29, 1, 1, 1, 1, 1, 1, 1, 1, 24, -/* 0x30 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 29, 1, 29, 1, -/* 0x40 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -/* 0x50 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 22, 1, 1, 1, -/* 0x60 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -/* 0x70 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -/* 0x80 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -/* 0x90 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -/* 0xa0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -/* 0xb0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -/* 0xc0 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -/* 0xd0 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -/* 0xe0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -/* 0xf0 */ 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 1, 1 -}; - -static void SetError (JSOBJ obj, JSONObjectEncoder *enc, const char *message) -{ - enc->errorMsg = message; - enc->errorObj = obj; -} - -/* -FIXME: Keep track of how big these get across several encoder calls and try to make an estimate -That way we won't run our head into the wall each call */ -void Buffer_Realloc (JSONObjectEncoder *enc, size_t cbNeeded) -{ - size_t curSize = enc->end - enc->start; - size_t newSize = curSize * 2; - size_t offset = enc->offset - enc->start; - - while (newSize < curSize + cbNeeded) - { - newSize *= 2; - } - - if (enc->heap) - { - enc->start = (char *) enc->realloc (enc->start, newSize); - if (!enc->start) - { - SetError (NULL, enc, "Could not reserve memory block"); - return; - } - } - else - { - char *oldStart = enc->start; - enc->heap = 1; - enc->start = (char *) enc->malloc (newSize); - if (!enc->start) - { - SetError (NULL, enc, "Could not reserve memory block"); - return; - } - memcpy (enc->start, oldStart, offset); - } - enc->offset = enc->start + offset; - enc->end = enc->start + newSize; -} - -FASTCALL_ATTR INLINE_PREFIX void FASTCALL_MSVC Buffer_AppendShortHexUnchecked (char *outputOffset, unsigned short value) -{ - *(outputOffset++) = g_hexChars[(value & 0xf000) >> 12]; - *(outputOffset++) = g_hexChars[(value & 0x0f00) >> 8]; - *(outputOffset++) = g_hexChars[(value & 0x00f0) >> 4]; - *(outputOffset++) = g_hexChars[(value & 0x000f) >> 0]; -} - -int Buffer_EscapeStringUnvalidated (JSONObjectEncoder *enc, const char *io, const char *end) -{ - char *of = (char *) enc->offset; - - for (;;) - { - switch (*io) - { - case 0x00: - { - if (io < end) - { - *(of++) = '\\'; - *(of++) = 'u'; - *(of++) = '0'; - *(of++) = '0'; - *(of++) = '0'; - *(of++) = '0'; - break; - } - else - { - enc->offset += (of - enc->offset); - return TRUE; - } - } - case '\"': (*of++) = '\\'; (*of++) = '\"'; break; - case '\\': (*of++) = '\\'; (*of++) = '\\'; break; - case '/': (*of++) = '\\'; (*of++) = '/'; break; - case '\b': (*of++) = '\\'; (*of++) = 'b'; break; - case '\f': (*of++) = '\\'; (*of++) = 'f'; break; - case '\n': (*of++) = '\\'; (*of++) = 'n'; break; - case '\r': (*of++) = '\\'; (*of++) = 'r'; break; - case '\t': (*of++) = '\\'; (*of++) = 't'; break; - - case 0x26: // '/' - case 0x3c: // '<' - case 0x3e: // '>' - { - if (enc->encodeHTMLChars) - { - // Fall through to \u00XX case below. - } - else - { - // Same as default case below. - (*of++) = (*io); - break; - } - } - case 0x01: - case 0x02: - case 0x03: - case 0x04: - case 0x05: - case 0x06: - case 0x07: - case 0x0b: - case 0x0e: - case 0x0f: - case 0x10: - case 0x11: - case 0x12: - case 0x13: - case 0x14: - case 0x15: - case 0x16: - case 0x17: - case 0x18: - case 0x19: - case 0x1a: - case 0x1b: - case 0x1c: - case 0x1d: - case 0x1e: - case 0x1f: - { - *(of++) = '\\'; - *(of++) = 'u'; - *(of++) = '0'; - *(of++) = '0'; - *(of++) = g_hexChars[ (unsigned char) (((*io) & 0xf0) >> 4)]; - *(of++) = g_hexChars[ (unsigned char) ((*io) & 0x0f)]; - break; - } - default: (*of++) = (*io); break; - } - io++; - } -} - -int Buffer_EscapeStringValidated (JSOBJ obj, JSONObjectEncoder *enc, const char *io, const char *end) -{ - JSUTF32 ucs; - char *of = (char *) enc->offset; - - for (;;) - { - JSUINT8 utflen = g_asciiOutputTable[(unsigned char) *io]; - - switch (utflen) - { - case 0: - { - if (io < end) - { - *(of++) = '\\'; - *(of++) = 'u'; - *(of++) = '0'; - *(of++) = '0'; - *(of++) = '0'; - *(of++) = '0'; - io ++; - continue; - } - else - { - enc->offset += (of - enc->offset); - return TRUE; - } - } - - case 1: - { - *(of++)= (*io++); - continue; - } - - case 2: - { - JSUTF32 in; - JSUTF16 in16; - - if (end - io < 1) - { - enc->offset += (of - enc->offset); - SetError (obj, enc, "Unterminated UTF-8 sequence when encoding string"); - return FALSE; - } - - memcpy(&in16, io, sizeof(JSUTF16)); - in = (JSUTF32) in16; - -#ifdef __LITTLE_ENDIAN__ - ucs = ((in & 0x1f) << 6) | ((in >> 8) & 0x3f); -#else - ucs = ((in & 0x1f00) >> 2) | (in & 0x3f); -#endif - - if (ucs < 0x80) - { - enc->offset += (of - enc->offset); - SetError (obj, enc, "Overlong 2 byte UTF-8 sequence detected when encoding string"); - return FALSE; - } - - io += 2; - break; - } - - case 3: - { - JSUTF32 in; - JSUTF16 in16; - JSUINT8 in8; - - if (end - io < 2) - { - enc->offset += (of - enc->offset); - SetError (obj, enc, "Unterminated UTF-8 sequence when encoding string"); - return FALSE; - } - - memcpy(&in16, io, sizeof(JSUTF16)); - memcpy(&in8, io + 2, sizeof(JSUINT8)); -#ifdef __LITTLE_ENDIAN__ - in = (JSUTF32) in16; - in |= in8 << 16; - ucs = ((in & 0x0f) << 12) | ((in & 0x3f00) >> 2) | ((in & 0x3f0000) >> 16); -#else - in = in16 << 8; - in |= in8; - ucs = ((in & 0x0f0000) >> 4) | ((in & 0x3f00) >> 2) | (in & 0x3f); -#endif - - if (ucs < 0x800) - { - enc->offset += (of - enc->offset); - SetError (obj, enc, "Overlong 3 byte UTF-8 sequence detected when encoding string"); - return FALSE; - } - - io += 3; - break; - } - case 4: - { - JSUTF32 in; - - if (end - io < 3) - { - enc->offset += (of - enc->offset); - SetError (obj, enc, "Unterminated UTF-8 sequence when encoding string"); - return FALSE; - } - - memcpy(&in, io, sizeof(JSUTF32)); -#ifdef __LITTLE_ENDIAN__ - ucs = ((in & 0x07) << 18) | ((in & 0x3f00) << 4) | ((in & 0x3f0000) >> 10) | ((in & 0x3f000000) >> 24); -#else - ucs = ((in & 0x07000000) >> 6) | ((in & 0x3f0000) >> 4) | ((in & 0x3f00) >> 2) | (in & 0x3f); -#endif - if (ucs < 0x10000) - { - enc->offset += (of - enc->offset); - SetError (obj, enc, "Overlong 4 byte UTF-8 sequence detected when encoding string"); - return FALSE; - } - - io += 4; - break; - } - - - case 5: - case 6: - { - enc->offset += (of - enc->offset); - SetError (obj, enc, "Unsupported UTF-8 sequence length when encoding string"); - return FALSE; - } - - case 29: - { - if (enc->encodeHTMLChars) - { - // Fall through to \u00XX case 30 below. - } - else - { - // Same as case 1 above. - *(of++) = (*io++); - continue; - } - } - - case 30: - { - // \uXXXX encode - *(of++) = '\\'; - *(of++) = 'u'; - *(of++) = '0'; - *(of++) = '0'; - *(of++) = g_hexChars[ (unsigned char) (((*io) & 0xf0) >> 4)]; - *(of++) = g_hexChars[ (unsigned char) ((*io) & 0x0f)]; - io ++; - continue; - } - case 10: - case 12: - case 14: - case 16: - case 18: - case 20: - case 22: - case 24: - { - *(of++) = *( (char *) (g_escapeChars + utflen + 0)); - *(of++) = *( (char *) (g_escapeChars + utflen + 1)); - io ++; - continue; - } - // This can never happen, it's here to make L4 VC++ happy - default: - { - ucs = 0; - break; - } - } - - /* - If the character is a UTF8 sequence of length > 1 we end up here */ - if (ucs >= 0x10000) - { - ucs -= 0x10000; - *(of++) = '\\'; - *(of++) = 'u'; - Buffer_AppendShortHexUnchecked(of, (unsigned short) (ucs >> 10) + 0xd800); - of += 4; - - *(of++) = '\\'; - *(of++) = 'u'; - Buffer_AppendShortHexUnchecked(of, (unsigned short) (ucs & 0x3ff) + 0xdc00); - of += 4; - } - else - { - *(of++) = '\\'; - *(of++) = 'u'; - Buffer_AppendShortHexUnchecked(of, (unsigned short) ucs); - of += 4; - } - } -} - -#define Buffer_Reserve(__enc, __len) \ - if ( (size_t) ((__enc)->end - (__enc)->offset) < (size_t) (__len)) \ - { \ - Buffer_Realloc((__enc), (__len));\ - } \ - - -#define Buffer_AppendCharUnchecked(__enc, __chr) \ - *((__enc)->offset++) = __chr; \ - -FASTCALL_ATTR INLINE_PREFIX void FASTCALL_MSVC strreverse(char* begin, char* end) -{ - char aux; - while (end > begin) - aux = *end, *end-- = *begin, *begin++ = aux; -} - -void Buffer_AppendIntUnchecked(JSONObjectEncoder *enc, JSINT32 value) -{ - char* wstr; - JSUINT32 uvalue = (value < 0) ? -value : value; - - wstr = enc->offset; - // Conversion. Number is reversed. - - do *wstr++ = (char)(48 + (uvalue % 10)); while(uvalue /= 10); - if (value < 0) *wstr++ = '-'; - - // Reverse string - strreverse(enc->offset,wstr - 1); - enc->offset += (wstr - (enc->offset)); -} - -void Buffer_AppendLongUnchecked(JSONObjectEncoder *enc, JSINT64 value) -{ - char* wstr; - JSUINT64 uvalue = (value < 0) ? -value : value; - - wstr = enc->offset; - // Conversion. Number is reversed. - - do *wstr++ = (char)(48 + (uvalue % 10ULL)); while(uvalue /= 10ULL); - if (value < 0) *wstr++ = '-'; - - // Reverse string - strreverse(enc->offset,wstr - 1); - enc->offset += (wstr - (enc->offset)); -} - -int Buffer_AppendDoubleUnchecked(JSOBJ obj, JSONObjectEncoder *enc, double value) -{ - /* if input is beyond the thresholds, revert to exponential */ - const double thres_max = (double) 1e16 - 1; - const double thres_min = (double) 1e-15; - char precision_str[20]; - int count; - double diff = 0.0; - char* str = enc->offset; - char* wstr = str; - unsigned long long whole; - double tmp; - unsigned long long frac; - int neg; - double pow10; - - if (value == HUGE_VAL || value == -HUGE_VAL) - { - SetError (obj, enc, "Invalid Inf value when encoding double"); - return FALSE; - } - - if (!(value == value)) - { - SetError (obj, enc, "Invalid Nan value when encoding double"); - return FALSE; - } - - /* we'll work in positive values and deal with the - negative sign issue later */ - neg = 0; - if (value < 0) - { - neg = 1; - value = -value; - } - - /* - for very large or small numbers switch back to native sprintf for - exponentials. anyone want to write code to replace this? */ - if (value > thres_max || (value != 0.0 && fabs(value) < thres_min)) - { - precision_str[0] = '%'; - precision_str[1] = '.'; -#if defined(_WIN32) && defined(_MSC_VER) - sprintf_s(precision_str+2, sizeof(precision_str)-2, "%ug", enc->doublePrecision); - enc->offset += sprintf_s(str, enc->end - enc->offset, precision_str, neg ? -value : value); -#else - snprintf(precision_str+2, sizeof(precision_str)-2, "%ug", enc->doublePrecision); - enc->offset += snprintf(str, enc->end - enc->offset, precision_str, neg ? -value : value); -#endif - return TRUE; - } - - pow10 = g_pow10[enc->doublePrecision]; - - whole = (unsigned long long) value; - tmp = (value - whole) * pow10; - frac = (unsigned long long)(tmp); - diff = tmp - frac; - - if (diff > 0.5) - { - ++frac; - /* handle rollover, e.g. case 0.99 with prec 1 is 1.0 */ - if (frac >= pow10) - { - frac = 0; - ++whole; - } - } - else - if (diff == 0.5 && ((frac == 0) || (frac & 1))) - { - /* if halfway, round up if odd, OR - if last digit is 0. That last part is strange */ - ++frac; - } - - if (enc->doublePrecision == 0) - { - diff = value - whole; - - if (diff > 0.5) - { - /* greater than 0.5, round up, e.g. 1.6 -> 2 */ - ++whole; - } - else - if (diff == 0.5 && (whole & 1)) - { - /* exactly 0.5 and ODD, then round up */ - /* 1.5 -> 2, but 2.5 -> 2 */ - ++whole; - } - - //vvvvvvvvvvvvvvvvvvv Diff from modp_dto2 - } - else - if (frac) - { - count = enc->doublePrecision; - // now do fractional part, as an unsigned number - // we know it is not 0 but we can have leading zeros, these - // should be removed - while (!(frac % 10)) - { - --count; - frac /= 10; - } - //^^^^^^^^^^^^^^^^^^^ Diff from modp_dto2 - - // now do fractional part, as an unsigned number - do - { - --count; - *wstr++ = (char)(48 + (frac % 10)); - } while (frac /= 10); - // add extra 0s - while (count-- > 0) - { - *wstr++ = '0'; - } - // add decimal - *wstr++ = '.'; - } - else - { - *wstr++ = '0'; - *wstr++ = '.'; - } - - // do whole part - // Take care of sign - // Conversion. Number is reversed. - do *wstr++ = (char)(48 + (whole % 10)); while (whole /= 10); - - if (neg) - { - *wstr++ = '-'; - } - strreverse(str, wstr-1); - enc->offset += (wstr - (enc->offset)); - - return TRUE; -} - -/* -FIXME: -Handle integration functions returning NULL here */ - -/* -FIXME: -Perhaps implement recursion detection */ - -void encode(JSOBJ obj, JSONObjectEncoder *enc, const char *name, size_t cbName) -{ - const char *value; - char *objName; - int count; - JSOBJ iterObj; - size_t szlen; - JSONTypeContext tc; - tc.encoder = enc; - - if (enc->level > enc->recursionMax) - { - SetError (obj, enc, "Maximum recursion level reached"); - return; - } - - /* - This reservation must hold - - length of _name as encoded worst case + - maxLength of double to string OR maxLength of JSLONG to string - */ - - Buffer_Reserve(enc, 256 + RESERVE_STRING(cbName)); - if (enc->errorMsg) - { - return; - } - - if (name) - { - Buffer_AppendCharUnchecked(enc, '\"'); - - if (enc->forceASCII) - { - if (!Buffer_EscapeStringValidated(obj, enc, name, name + cbName)) - { - return; - } - } - else - { - if (!Buffer_EscapeStringUnvalidated(enc, name, name + cbName)) - { - return; - } - } - - Buffer_AppendCharUnchecked(enc, '\"'); - - Buffer_AppendCharUnchecked (enc, ':'); -#ifndef JSON_NO_EXTRA_WHITESPACE - Buffer_AppendCharUnchecked (enc, ' '); -#endif - } - - enc->beginTypeContext(obj, &tc); - - switch (tc.type) - { - case JT_INVALID: - { - return; - } - - case JT_ARRAY: - { - count = 0; - enc->iterBegin(obj, &tc); - - Buffer_AppendCharUnchecked (enc, '['); - - while (enc->iterNext(obj, &tc)) - { - if (count > 0) - { - Buffer_AppendCharUnchecked (enc, ','); -#ifndef JSON_NO_EXTRA_WHITESPACE - Buffer_AppendCharUnchecked (buffer, ' '); -#endif - } - - iterObj = enc->iterGetValue(obj, &tc); - - enc->level ++; - encode (iterObj, enc, NULL, 0); - count ++; - } - - enc->iterEnd(obj, &tc); - Buffer_AppendCharUnchecked (enc, ']'); - break; - } - - case JT_OBJECT: - { - count = 0; - enc->iterBegin(obj, &tc); - - Buffer_AppendCharUnchecked (enc, '{'); - - while (enc->iterNext(obj, &tc)) - { - if (count > 0) - { - Buffer_AppendCharUnchecked (enc, ','); -#ifndef JSON_NO_EXTRA_WHITESPACE - Buffer_AppendCharUnchecked (enc, ' '); -#endif - } - - iterObj = enc->iterGetValue(obj, &tc); - objName = enc->iterGetName(obj, &tc, &szlen); - - enc->level ++; - encode (iterObj, enc, objName, szlen); - count ++; - } - - enc->iterEnd(obj, &tc); - Buffer_AppendCharUnchecked (enc, '}'); - break; - } - - case JT_LONG: - { - Buffer_AppendLongUnchecked (enc, enc->getLongValue(obj, &tc)); - break; - } - - case JT_INT: - { - Buffer_AppendIntUnchecked (enc, enc->getIntValue(obj, &tc)); - break; - } - - case JT_TRUE: - { - Buffer_AppendCharUnchecked (enc, 't'); - Buffer_AppendCharUnchecked (enc, 'r'); - Buffer_AppendCharUnchecked (enc, 'u'); - Buffer_AppendCharUnchecked (enc, 'e'); - break; - } - - case JT_FALSE: - { - Buffer_AppendCharUnchecked (enc, 'f'); - Buffer_AppendCharUnchecked (enc, 'a'); - Buffer_AppendCharUnchecked (enc, 'l'); - Buffer_AppendCharUnchecked (enc, 's'); - Buffer_AppendCharUnchecked (enc, 'e'); - break; - } - - - case JT_NULL: - { - Buffer_AppendCharUnchecked (enc, 'n'); - Buffer_AppendCharUnchecked (enc, 'u'); - Buffer_AppendCharUnchecked (enc, 'l'); - Buffer_AppendCharUnchecked (enc, 'l'); - break; - } - - case JT_DOUBLE: - { - if (!Buffer_AppendDoubleUnchecked (obj, enc, enc->getDoubleValue(obj, &tc))) - { - enc->endTypeContext(obj, &tc); - enc->level --; - return; - } - break; - } - - case JT_UTF8: - { - value = enc->getStringValue(obj, &tc, &szlen); - Buffer_Reserve(enc, RESERVE_STRING(szlen)); - if (enc->errorMsg) - { - enc->endTypeContext(obj, &tc); - return; - } - Buffer_AppendCharUnchecked (enc, '\"'); - - if (enc->forceASCII) - { - if (!Buffer_EscapeStringValidated(obj, enc, value, value + szlen)) - { - enc->endTypeContext(obj, &tc); - enc->level --; - return; - } - } - else - { - if (!Buffer_EscapeStringUnvalidated(enc, value, value + szlen)) - { - enc->endTypeContext(obj, &tc); - enc->level --; - return; - } - } - - Buffer_AppendCharUnchecked (enc, '\"'); - break; - } - } - - enc->endTypeContext(obj, &tc); - enc->level --; -} - -char *JSON_EncodeObject(JSOBJ obj, JSONObjectEncoder *enc, char *_buffer, size_t _cbBuffer) -{ - char *locale; - enc->malloc = enc->malloc ? enc->malloc : malloc; - enc->free = enc->free ? enc->free : free; - enc->realloc = enc->realloc ? enc->realloc : realloc; - enc->errorMsg = NULL; - enc->errorObj = NULL; - enc->level = 0; - - if (enc->recursionMax < 1) - { - enc->recursionMax = JSON_MAX_RECURSION_DEPTH; - } - - if (enc->doublePrecision < 0 || - enc->doublePrecision > JSON_DOUBLE_MAX_DECIMALS) - { - enc->doublePrecision = JSON_DOUBLE_MAX_DECIMALS; - } - - if (_buffer == NULL) - { - _cbBuffer = 32768; - enc->start = (char *) enc->malloc (_cbBuffer); - if (!enc->start) - { - SetError(obj, enc, "Could not reserve memory block"); - return NULL; - } - enc->heap = 1; - } - else - { - enc->start = _buffer; - enc->heap = 0; - } - - enc->end = enc->start + _cbBuffer; - enc->offset = enc->start; - - locale = setlocale(LC_NUMERIC, NULL); - if (strcmp(locale, "C")) - { - locale = strdup(locale); - if (!locale) - { - SetError(NULL, enc, "Could not reserve memory block"); - return NULL; - } - setlocale(LC_NUMERIC, "C"); - encode (obj, enc, NULL, 0); - setlocale(LC_NUMERIC, locale); - free(locale); - } - else - { - encode (obj, enc, NULL, 0); - } - - Buffer_Reserve(enc, 1); - if (enc->errorMsg) - { - return NULL; - } - Buffer_AppendCharUnchecked(enc, '\0'); - - return enc->start; -} diff --git a/pandas/src/ujson/python/JSONtoObj.c b/pandas/src/ujson/python/JSONtoObj.c deleted file mode 100644 index e4d02db4cb60a..0000000000000 --- a/pandas/src/ujson/python/JSONtoObj.c +++ /dev/null @@ -1,736 +0,0 @@ -/* -Copyright (c) 2011-2013, ESN Social Software AB and Jonas Tarnstrom -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: - * Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - * Redistributions in binary form must reproduce the above copyright - notice, this list of conditions and the following disclaimer in the - documentation and/or other materials provided with the distribution. - * Neither the name of the ESN Social Software AB nor the - names of its contributors may be used to endorse or promote products - derived from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -DISCLAIMED. IN NO EVENT SHALL ESN SOCIAL SOFTWARE AB OR JONAS TARNSTROM BE LIABLE -FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND -ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - -Portions of code from MODP_ASCII - Ascii transformations (upper/lower, etc) -https://github.com/client9/stringencoders -Copyright (c) 2007 Nick Galbreath -- nickg [at] modp [dot] com. All rights reserved. - -Numeric decoder derived from from TCL library -http://www.opensource.apple.com/source/tcl/tcl-14/tcl/license.terms - * Copyright (c) 1988-1993 The Regents of the University of California. - * Copyright (c) 1994 Sun Microsystems, Inc. -*/ - -#include "py_defines.h" -#define PY_ARRAY_UNIQUE_SYMBOL UJSON_NUMPY -#define NO_IMPORT_ARRAY -#include -#include - - -//#define PRINTMARK() fprintf(stderr, "%s: MARK(%d)\n", __FILE__, __LINE__) -#define PRINTMARK() - -typedef struct __PyObjectDecoder -{ - JSONObjectDecoder dec; - - void* npyarr; // Numpy context buffer - void* npyarr_addr; // Ref to npyarr ptr to track DECREF calls - npy_intp curdim; // Current array dimension - - PyArray_Descr* dtype; -} PyObjectDecoder; - -typedef struct __NpyArrContext -{ - PyObject* ret; - PyObject* labels[2]; - PyArray_Dims shape; - - PyObjectDecoder* dec; - - npy_intp i; - npy_intp elsize; - npy_intp elcount; -} NpyArrContext; - -// Numpy handling based on numpy internal code, specifically the function -// PyArray_FromIter. - -// numpy related functions are inter-dependent so declare them all here, -// to ensure the compiler catches any errors - -// standard numpy array handling -JSOBJ Object_npyNewArray(void *prv, void* decoder); -JSOBJ Object_npyEndArray(void *prv, JSOBJ obj); -int Object_npyArrayAddItem(void *prv, JSOBJ obj, JSOBJ value); - -// for more complex dtypes (object and string) fill a standard Python list -// and convert to a numpy array when done. -JSOBJ Object_npyNewArrayList(void *prv, void* decoder); -JSOBJ Object_npyEndArrayList(void *prv, JSOBJ obj); -int Object_npyArrayListAddItem(void *prv, JSOBJ obj, JSOBJ value); - -// labelled support, encode keys and values of JS object into separate numpy -// arrays -JSOBJ Object_npyNewObject(void *prv, void* decoder); -JSOBJ Object_npyEndObject(void *prv, JSOBJ obj); -int Object_npyObjectAddKey(void *prv, JSOBJ obj, JSOBJ name, JSOBJ value); - -// free the numpy context buffer -void Npy_releaseContext(NpyArrContext* npyarr) -{ - PRINTMARK(); - if (npyarr) - { - if (npyarr->shape.ptr) - { - PyObject_Free(npyarr->shape.ptr); - } - if (npyarr->dec) - { - npyarr->dec->npyarr = NULL; - npyarr->dec->curdim = 0; - } - Py_XDECREF(npyarr->labels[0]); - Py_XDECREF(npyarr->labels[1]); - Py_XDECREF(npyarr->ret); - PyObject_Free(npyarr); - } -} - -JSOBJ Object_npyNewArray(void *prv, void* _decoder) -{ - NpyArrContext* npyarr; - PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder; - PRINTMARK(); - if (decoder->curdim <= 0) - { - // start of array - initialise the context buffer - npyarr = decoder->npyarr = PyObject_Malloc(sizeof(NpyArrContext)); - decoder->npyarr_addr = npyarr; - - if (!npyarr) - { - PyErr_NoMemory(); - return NULL; - } - - npyarr->dec = decoder; - npyarr->labels[0] = npyarr->labels[1] = NULL; - - npyarr->shape.ptr = PyObject_Malloc(sizeof(npy_intp)*NPY_MAXDIMS); - npyarr->shape.len = 1; - npyarr->ret = NULL; - - npyarr->elsize = 0; - npyarr->elcount = 4; - npyarr->i = 0; - } - else - { - // starting a new dimension continue the current array (and reshape after) - npyarr = (NpyArrContext*) decoder->npyarr; - if (decoder->curdim >= npyarr->shape.len) - { - npyarr->shape.len++; - } - } - - npyarr->shape.ptr[decoder->curdim] = 0; - decoder->curdim++; - return npyarr; -} - -PyObject* Npy_returnLabelled(NpyArrContext* npyarr) -{ - PyObject* ret = npyarr->ret; - npy_intp i; - - if (npyarr->labels[0] || npyarr->labels[1]) - { - // finished decoding, build tuple with values and labels - ret = PyTuple_New(npyarr->shape.len+1); - for (i = 0; i < npyarr->shape.len; i++) - { - if (npyarr->labels[i]) - { - PyTuple_SET_ITEM(ret, i+1, npyarr->labels[i]); - npyarr->labels[i] = NULL; - } - else - { - Py_INCREF(Py_None); - PyTuple_SET_ITEM(ret, i+1, Py_None); - } - } - PyTuple_SET_ITEM(ret, 0, npyarr->ret); - } - - return ret; -} - -JSOBJ Object_npyEndArray(void *prv, JSOBJ obj) -{ - PyObject *ret; - char* new_data; - NpyArrContext* npyarr = (NpyArrContext*) obj; - int emptyType = NPY_DEFAULT_TYPE; - npy_intp i; - PRINTMARK(); - if (!npyarr) - { - return NULL; - } - - ret = npyarr->ret; - i = npyarr->i; - - npyarr->dec->curdim--; - - if (i == 0 || !npyarr->ret) { - // empty array would not have been initialised so do it now. - if (npyarr->dec->dtype) - { - emptyType = npyarr->dec->dtype->type_num; - } - npyarr->ret = ret = PyArray_EMPTY(npyarr->shape.len, npyarr->shape.ptr, emptyType, 0); - } - else if (npyarr->dec->curdim <= 0) - { - // realloc to final size - new_data = PyDataMem_RENEW(PyArray_DATA(ret), i * npyarr->elsize); - if (new_data == NULL) { - PyErr_NoMemory(); - Npy_releaseContext(npyarr); - return NULL; - } - ((PyArrayObject*) ret)->data = (void*) new_data; - // PyArray_BYTES(ret) = new_data; - } - - if (npyarr->dec->curdim <= 0) - { - // finished decoding array, reshape if necessary - if (npyarr->shape.len > 1) - { - npyarr->ret = PyArray_Newshape((PyArrayObject*) ret, &npyarr->shape, NPY_ANYORDER); - Py_DECREF(ret); - } - - ret = Npy_returnLabelled(npyarr); - - npyarr->ret = NULL; - Npy_releaseContext(npyarr); - } - - return ret; -} - -int Object_npyArrayAddItem(void *prv, JSOBJ obj, JSOBJ value) -{ - PyObject* type; - PyArray_Descr* dtype; - npy_intp i; - char *new_data, *item; - NpyArrContext* npyarr = (NpyArrContext*) obj; - PRINTMARK(); - if (!npyarr) - { - return 0; - } - - i = npyarr->i; - - npyarr->shape.ptr[npyarr->dec->curdim-1]++; - - if (PyArray_Check((PyObject*)value)) - { - // multidimensional array, keep decoding values. - return 1; - } - - if (!npyarr->ret) - { - // Array not initialised yet. - // We do it here so we can 'sniff' the data type if none was provided - if (!npyarr->dec->dtype) - { - type = PyObject_Type(value); - if(!PyArray_DescrConverter(type, &dtype)) - { - Py_DECREF(type); - goto fail; - } - Py_INCREF(dtype); - Py_DECREF(type); - } - else - { - dtype = PyArray_DescrNew(npyarr->dec->dtype); - } - - // If it's an object or string then fill a Python list and subsequently - // convert. Otherwise we would need to somehow mess about with - // reference counts when renewing memory. - npyarr->elsize = dtype->elsize; - if (PyDataType_REFCHK(dtype) || npyarr->elsize == 0) - { - Py_XDECREF(dtype); - - if (npyarr->dec->curdim > 1) - { - PyErr_SetString(PyExc_ValueError, "Cannot decode multidimensional arrays with variable length elements to numpy"); - goto fail; - } - npyarr->elcount = 0; - npyarr->ret = PyList_New(0); - if (!npyarr->ret) - { - goto fail; - } - ((JSONObjectDecoder*)npyarr->dec)->newArray = Object_npyNewArrayList; - ((JSONObjectDecoder*)npyarr->dec)->arrayAddItem = Object_npyArrayListAddItem; - ((JSONObjectDecoder*)npyarr->dec)->endArray = Object_npyEndArrayList; - return Object_npyArrayListAddItem(prv, obj, value); - } - - npyarr->ret = PyArray_NewFromDescr(&PyArray_Type, dtype, 1, - &npyarr->elcount, NULL,NULL, 0, NULL); - - if (!npyarr->ret) - { - goto fail; - } - } - - if (i >= npyarr->elcount) { - // Grow PyArray_DATA(ret): - // this is similar for the strategy for PyListObject, but we use - // 50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ... - if (npyarr->elsize == 0) - { - PyErr_SetString(PyExc_ValueError, "Cannot decode multidimensional arrays with variable length elements to numpy"); - goto fail; - } - - npyarr->elcount = (i >> 1) + (i < 4 ? 4 : 2) + i; - if (npyarr->elcount <= NPY_MAX_INTP/npyarr->elsize) { - new_data = PyDataMem_RENEW(PyArray_DATA(npyarr->ret), npyarr->elcount * npyarr->elsize); - } - else { - PyErr_NoMemory(); - goto fail; - } - ((PyArrayObject*) npyarr->ret)->data = (void*) new_data; - - // PyArray_BYTES(npyarr->ret) = new_data; - } - - PyArray_DIMS(npyarr->ret)[0] = i + 1; - - if ((item = PyArray_GETPTR1(npyarr->ret, i)) == NULL - || PyArray_SETITEM(npyarr->ret, item, value) == -1) { - goto fail; - } - - Py_DECREF( (PyObject *) value); - npyarr->i++; - return 1; - -fail: - - Npy_releaseContext(npyarr); - return 0; -} - -JSOBJ Object_npyNewArrayList(void *prv, void* _decoder) -{ - PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder; - PRINTMARK(); - PyErr_SetString(PyExc_ValueError, "nesting not supported for object or variable length dtypes"); - Npy_releaseContext(decoder->npyarr); - return NULL; -} - -JSOBJ Object_npyEndArrayList(void *prv, JSOBJ obj) -{ - PyObject *list, *ret; - NpyArrContext* npyarr = (NpyArrContext*) obj; - PRINTMARK(); - if (!npyarr) - { - return NULL; - } - - // convert decoded list to numpy array - list = (PyObject *) npyarr->ret; - npyarr->ret = PyArray_FROM_O(list); - - ret = Npy_returnLabelled(npyarr); - npyarr->ret = list; - - ((JSONObjectDecoder*)npyarr->dec)->newArray = Object_npyNewArray; - ((JSONObjectDecoder*)npyarr->dec)->arrayAddItem = Object_npyArrayAddItem; - ((JSONObjectDecoder*)npyarr->dec)->endArray = Object_npyEndArray; - Npy_releaseContext(npyarr); - return ret; -} - -int Object_npyArrayListAddItem(void *prv, JSOBJ obj, JSOBJ value) -{ - NpyArrContext* npyarr = (NpyArrContext*) obj; - PRINTMARK(); - if (!npyarr) - { - return 0; - } - PyList_Append((PyObject*) npyarr->ret, value); - Py_DECREF( (PyObject *) value); - npyarr->elcount++; - return 1; -} - - -JSOBJ Object_npyNewObject(void *prv, void* _decoder) -{ - PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder; - PRINTMARK(); - if (decoder->curdim > 1) - { - PyErr_SetString(PyExc_ValueError, "labels only supported up to 2 dimensions"); - return NULL; - } - - return ((JSONObjectDecoder*)decoder)->newArray(prv, decoder); -} - -JSOBJ Object_npyEndObject(void *prv, JSOBJ obj) -{ - PyObject *list; - npy_intp labelidx; - NpyArrContext* npyarr = (NpyArrContext*) obj; - PRINTMARK(); - if (!npyarr) - { - return NULL; - } - - labelidx = npyarr->dec->curdim-1; - - list = npyarr->labels[labelidx]; - if (list) - { - npyarr->labels[labelidx] = PyArray_FROM_O(list); - Py_DECREF(list); - } - - return (PyObject*) ((JSONObjectDecoder*)npyarr->dec)->endArray(prv, obj); -} - -int Object_npyObjectAddKey(void *prv, JSOBJ obj, JSOBJ name, JSOBJ value) -{ - PyObject *label; - npy_intp labelidx; - // add key to label array, value to values array - NpyArrContext* npyarr = (NpyArrContext*) obj; - PRINTMARK(); - if (!npyarr) - { - return 0; - } - - label = (PyObject*) name; - labelidx = npyarr->dec->curdim-1; - - if (!npyarr->labels[labelidx]) - { - npyarr->labels[labelidx] = PyList_New(0); - } - - // only fill label array once, assumes all column labels are the same - // for 2-dimensional arrays. - if (PyList_GET_SIZE(npyarr->labels[labelidx]) <= npyarr->elcount) - { - PyList_Append(npyarr->labels[labelidx], label); - } - - if(((JSONObjectDecoder*)npyarr->dec)->arrayAddItem(prv, obj, value)) - { - Py_DECREF(label); - return 1; - } - return 0; -} - -int Object_objectAddKey(void *prv, JSOBJ obj, JSOBJ name, JSOBJ value) -{ - PyDict_SetItem (obj, name, value); - Py_DECREF( (PyObject *) name); - Py_DECREF( (PyObject *) value); - return 1; -} - -int Object_arrayAddItem(void *prv, JSOBJ obj, JSOBJ value) -{ - PyList_Append(obj, value); - Py_DECREF( (PyObject *) value); - return 1; -} - -JSOBJ Object_newString(void *prv, wchar_t *start, wchar_t *end) -{ - return PyUnicode_FromWideChar (start, (end - start)); -} - -JSOBJ Object_newTrue(void *prv) -{ - Py_RETURN_TRUE; -} - -JSOBJ Object_newFalse(void *prv) -{ - Py_RETURN_FALSE; -} - -JSOBJ Object_newNull(void *prv) -{ - Py_RETURN_NONE; -} - -JSOBJ Object_newObject(void *prv, void* decoder) -{ - return PyDict_New(); -} - -JSOBJ Object_endObject(void *prv, JSOBJ obj) -{ - return obj; -} - -JSOBJ Object_newArray(void *prv, void* decoder) -{ - return PyList_New(0); -} - -JSOBJ Object_endArray(void *prv, JSOBJ obj) -{ - return obj; -} - -JSOBJ Object_newInteger(void *prv, JSINT32 value) -{ - return PyInt_FromLong( (long) value); -} - -JSOBJ Object_newLong(void *prv, JSINT64 value) -{ - return PyLong_FromLongLong (value); -} - -JSOBJ Object_newDouble(void *prv, double value) -{ - return PyFloat_FromDouble(value); -} - -static void Object_releaseObject(void *prv, JSOBJ obj, void* _decoder) -{ - PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder; - if (obj != decoder->npyarr_addr) - { - Py_XDECREF( ((PyObject *)obj)); - } -} - -static char *g_kwlist[] = {"obj", "precise_float", "numpy", "labelled", "dtype", NULL}; - -PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs) -{ - PyObject *ret; - PyObject *sarg; - PyObject *arg; - PyObject *opreciseFloat = NULL; - JSONObjectDecoder *decoder; - PyObjectDecoder pyDecoder; - PyArray_Descr *dtype = NULL; - int numpy = 0, labelled = 0; - - JSONObjectDecoder dec = - { - Object_newString, - Object_objectAddKey, - Object_arrayAddItem, - Object_newTrue, - Object_newFalse, - Object_newNull, - Object_newObject, - Object_endObject, - Object_newArray, - Object_endArray, - Object_newInteger, - Object_newLong, - Object_newDouble, - Object_releaseObject, - PyObject_Malloc, - PyObject_Free, - PyObject_Realloc - }; - - dec.preciseFloat = 0; - dec.prv = NULL; - - pyDecoder.dec = dec; - pyDecoder.curdim = 0; - pyDecoder.npyarr = NULL; - pyDecoder.npyarr_addr = NULL; - - decoder = (JSONObjectDecoder*) &pyDecoder; - - if (!PyArg_ParseTupleAndKeywords(args, kwargs, "O|OiiO&", g_kwlist, &arg, &opreciseFloat, &numpy, &labelled, PyArray_DescrConverter2, &dtype)) - { - Npy_releaseContext(pyDecoder.npyarr); - return NULL; - } - - if (opreciseFloat && PyObject_IsTrue(opreciseFloat)) - { - decoder->preciseFloat = 1; - } - - if (PyString_Check(arg)) - { - sarg = arg; - } - else - if (PyUnicode_Check(arg)) - { - sarg = PyUnicode_AsUTF8String(arg); - if (sarg == NULL) - { - //Exception raised above us by codec according to docs - return NULL; - } - } - else - { - PyErr_Format(PyExc_TypeError, "Expected String or Unicode"); - return NULL; - } - - decoder->errorStr = NULL; - decoder->errorOffset = NULL; - - if (numpy) - { - pyDecoder.dtype = dtype; - decoder->newArray = Object_npyNewArray; - decoder->endArray = Object_npyEndArray; - decoder->arrayAddItem = Object_npyArrayAddItem; - - if (labelled) - { - decoder->newObject = Object_npyNewObject; - decoder->endObject = Object_npyEndObject; - decoder->objectAddKey = Object_npyObjectAddKey; - } - } - - ret = JSON_DecodeObject(decoder, PyString_AS_STRING(sarg), PyString_GET_SIZE(sarg)); - - if (sarg != arg) - { - Py_DECREF(sarg); - } - - if (PyErr_Occurred()) - { - if (ret) - { - Py_DECREF( (PyObject *) ret); - } - Npy_releaseContext(pyDecoder.npyarr); - return NULL; - } - - if (decoder->errorStr) - { - /* - FIXME: It's possible to give a much nicer error message here with actual failing element in input etc*/ - - PyErr_Format (PyExc_ValueError, "%s", decoder->errorStr); - - if (ret) - { - Py_DECREF( (PyObject *) ret); - } - Npy_releaseContext(pyDecoder.npyarr); - - return NULL; - } - - return ret; -} - -PyObject* JSONFileToObj(PyObject* self, PyObject *args, PyObject *kwargs) -{ - PyObject *read; - PyObject *string; - PyObject *result; - PyObject *file = NULL; - PyObject *argtuple; - - if (!PyArg_ParseTuple (args, "O", &file)) - { - return NULL; - } - - if (!PyObject_HasAttrString (file, "read")) - { - PyErr_Format (PyExc_TypeError, "expected file"); - return NULL; - } - - read = PyObject_GetAttrString (file, "read"); - - if (!PyCallable_Check (read)) { - Py_XDECREF(read); - PyErr_Format (PyExc_TypeError, "expected file"); - return NULL; - } - - string = PyObject_CallObject (read, NULL); - Py_XDECREF(read); - - if (string == NULL) - { - return NULL; - } - - argtuple = PyTuple_Pack(1, string); - - result = JSONToObj (self, argtuple, kwargs); - - Py_XDECREF(argtuple); - Py_XDECREF(string); - - if (result == NULL) { - return NULL; - } - - return result; -} diff --git a/pandas/src/ujson/python/objToJSON.c b/pandas/src/ujson/python/objToJSON.c deleted file mode 100644 index 75de63acbd7d6..0000000000000 --- a/pandas/src/ujson/python/objToJSON.c +++ /dev/null @@ -1,2924 +0,0 @@ -/* -Copyright (c) 2011-2013, ESN Social Software AB and Jonas Tarnstrom -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: -* Redistributions of source code must retain the above copyright -notice, this list of conditions and the following disclaimer. -* Redistributions in binary form must reproduce the above copyright -notice, this list of conditions and the following disclaimer in the -documentation and/or other materials provided with the distribution. -* Neither the name of the ESN Social Software AB nor the -names of its contributors may be used to endorse or promote products -derived from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -DISCLAIMED. IN NO EVENT SHALL ESN SOCIAL SOFTWARE AB OR JONAS TARNSTROM BE LIABLE -FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND -ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - -Portions of code from MODP_ASCII - Ascii transformations (upper/lower, etc) -https://github.com/client9/stringencoders -Copyright (c) 2007 Nick Galbreath -- nickg [at] modp [dot] com. All rights reserved. - -Numeric decoder derived from from TCL library -http://www.opensource.apple.com/source/tcl/tcl-14/tcl/license.terms -* Copyright (c) 1988-1993 The Regents of the University of California. -* Copyright (c) 1994 Sun Microsystems, Inc. -*/ -#define PY_ARRAY_UNIQUE_SYMBOL UJSON_NUMPY - -#include "py_defines.h" -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -static PyObject* type_decimal; - -#define NPY_JSON_BUFSIZE 32768 - -static PyTypeObject* cls_dataframe; -static PyTypeObject* cls_series; -static PyTypeObject* cls_index; -static PyTypeObject* cls_nat; - -typedef void *(*PFN_PyTypeToJSON)(JSOBJ obj, JSONTypeContext *ti, void *outValue, size_t *_outLen); - -#if (PY_VERSION_HEX < 0x02050000) -typedef ssize_t Py_ssize_t; -#endif - - -typedef struct __NpyArrContext -{ - PyObject *array; - char* dataptr; - int curdim; // current dimension in array's order - int stridedim; // dimension we are striding over - int inc; // stride dimension increment (+/- 1) - npy_intp dim; - npy_intp stride; - npy_intp ndim; - npy_intp index[NPY_MAXDIMS]; - int type_num; - PyArray_GetItemFunc* getitem; - - char** rowLabels; - char** columnLabels; -} NpyArrContext; - -typedef struct __PdBlockContext -{ - int colIdx; - int ncols; - int transpose; - - int* cindices; // frame column -> block column map - NpyArrContext** npyCtxts; // NpyArrContext for each column -} PdBlockContext; - -typedef struct __TypeContext -{ - JSPFN_ITERBEGIN iterBegin; - JSPFN_ITEREND iterEnd; - JSPFN_ITERNEXT iterNext; - JSPFN_ITERGETNAME iterGetName; - JSPFN_ITERGETVALUE iterGetValue; - PFN_PyTypeToJSON PyTypeToJSON; - PyObject *newObj; - PyObject *dictObj; - Py_ssize_t index; - Py_ssize_t size; - PyObject *itemValue; - PyObject *itemName; - PyObject *attrList; - PyObject *iterator; - - double doubleValue; - JSINT64 longValue; - - char *cStr; - NpyArrContext *npyarr; - PdBlockContext *pdblock; - int transpose; - char** rowLabels; - char** columnLabels; - npy_intp rowLabelsLen; - npy_intp columnLabelsLen; -} TypeContext; - -typedef struct __PyObjectEncoder -{ - JSONObjectEncoder enc; - - // pass through the NpyArrContext when encoding multi-dimensional arrays - NpyArrContext* npyCtxtPassthru; - - // pass through the PdBlockContext when encoding blocks - PdBlockContext* blkCtxtPassthru; - - // pass-through to encode numpy data directly - int npyType; - void* npyValue; - TypeContext basicTypeContext; - - int datetimeIso; - PANDAS_DATETIMEUNIT datetimeUnit; - - // output format style for pandas data types - int outputFormat; - int originalOutputFormat; - - PyObject *defaultHandler; -} PyObjectEncoder; - -#define GET_TC(__ptrtc) ((TypeContext *)((__ptrtc)->prv)) - -enum PANDAS_FORMAT -{ - SPLIT, - RECORDS, - INDEX, - COLUMNS, - VALUES -}; - -//#define PRINTMARK() fprintf(stderr, "%s: MARK(%d)\n", __FILE__, __LINE__) -#define PRINTMARK() - -int PdBlock_iterNext(JSOBJ, JSONTypeContext *); - -// import_array() compat -#if (PY_VERSION_HEX >= 0x03000000) -void *initObjToJSON(void) - -#else -void initObjToJSON(void) -#endif -{ - PyObject *mod_pandas; - PyObject *mod_tslib; - PyObject* mod_decimal = PyImport_ImportModule("decimal"); - type_decimal = PyObject_GetAttrString(mod_decimal, "Decimal"); - Py_INCREF(type_decimal); - Py_DECREF(mod_decimal); - - PyDateTime_IMPORT; - - mod_pandas = PyImport_ImportModule("pandas"); - if (mod_pandas) - { - cls_dataframe = (PyTypeObject*) PyObject_GetAttrString(mod_pandas, "DataFrame"); - cls_index = (PyTypeObject*) PyObject_GetAttrString(mod_pandas, "Index"); - cls_series = (PyTypeObject*) PyObject_GetAttrString(mod_pandas, "Series"); - Py_DECREF(mod_pandas); - } - - mod_tslib = PyImport_ImportModule("pandas.tslib"); - if (mod_tslib) - { - cls_nat = (PyTypeObject*) PyObject_GetAttrString(mod_tslib, "NaTType"); - Py_DECREF(mod_tslib); - } - - /* Initialise numpy API and use 2/3 compatible return */ - import_array(); - return NUMPY_IMPORT_ARRAY_RETVAL; -} - -static TypeContext* createTypeContext(void) -{ - TypeContext *pc; - - pc = PyObject_Malloc(sizeof(TypeContext)); - if (!pc) - { - PyErr_NoMemory(); - return NULL; - } - pc->newObj = NULL; - pc->dictObj = NULL; - pc->itemValue = NULL; - pc->itemName = NULL; - pc->attrList = NULL; - pc->index = 0; - pc->size = 0; - pc->longValue = 0; - pc->doubleValue = 0.0; - pc->cStr = NULL; - pc->npyarr = NULL; - pc->pdblock = NULL; - pc->rowLabels = NULL; - pc->columnLabels = NULL; - pc->transpose = 0; - pc->rowLabelsLen = 0; - pc->columnLabelsLen = 0; - - return pc; -} - -static PyObject* get_values(PyObject *obj) -{ - PyObject *values = PyObject_GetAttrString(obj, "values"); - PRINTMARK(); - - if (values && !PyArray_CheckExact(values)) - { - if (PyObject_HasAttrString(values, "values")) - { - PyObject *subvals = get_values(values); - PyErr_Clear(); - PRINTMARK(); - // subvals are sometimes missing a dimension - if (subvals) - { - PyArrayObject *reshape = (PyArrayObject*) subvals; - PyObject *shape = PyObject_GetAttrString(obj, "shape"); - PyArray_Dims dims; - PRINTMARK(); - - if (!shape || !PyArray_IntpConverter(shape, &dims)) - { - subvals = NULL; - } - else - { - subvals = PyArray_Newshape(reshape, &dims, NPY_ANYORDER); - PyDimMem_FREE(dims.ptr); - } - Py_DECREF(reshape); - Py_XDECREF(shape); - } - Py_DECREF(values); - values = subvals; - } - else - { - PRINTMARK(); - Py_DECREF(values); - values = NULL; - } - } - - if (!values && PyObject_HasAttrString(obj, "get_values")) - { - PRINTMARK(); - values = PyObject_CallMethod(obj, "get_values", NULL); - if (values && !PyArray_CheckExact(values)) - { - PRINTMARK(); - Py_DECREF(values); - values = NULL; - } - } - - if (!values) - { - PyObject *typeRepr = PyObject_Repr((PyObject*) Py_TYPE(obj)); - PyObject *repr; - PRINTMARK(); - if (PyObject_HasAttrString(obj, "dtype")) - { - PyObject *dtype = PyObject_GetAttrString(obj, "dtype"); - repr = PyObject_Repr(dtype); - Py_DECREF(dtype); - } - else - { - repr = PyString_FromString(""); - } - - PyErr_Format(PyExc_ValueError, - "%s or %s are not JSON serializable yet", - PyString_AS_STRING(repr), - PyString_AS_STRING(typeRepr)); - Py_DECREF(repr); - Py_DECREF(typeRepr); - - return NULL; - } - - return values; -} - -static PyObject* get_sub_attr(PyObject *obj, char *attr, char *subAttr) -{ - PyObject *tmp = PyObject_GetAttrString(obj, attr); - PyObject *ret; - - if (tmp == 0) - { - return 0; - } - ret = PyObject_GetAttrString(tmp, subAttr); - Py_DECREF(tmp); - - return ret; -} - -static int is_simple_frame(PyObject *obj) -{ - PyObject *check = get_sub_attr(obj, "_data", "is_mixed_type"); - int ret = (check == Py_False); - - if (!check) - { - return 0; - } - - Py_DECREF(check); - return ret; -} - -static Py_ssize_t get_attr_length(PyObject *obj, char *attr) -{ - PyObject *tmp = PyObject_GetAttrString(obj, attr); - Py_ssize_t ret; - - if (tmp == 0) - { - return 0; - } - ret = PyObject_Length(tmp); - Py_DECREF(tmp); - - if (ret == -1) - { - return 0; - } - - return ret; -} - -static PyObject* get_item(PyObject *obj, Py_ssize_t i) -{ - PyObject *tmp = PyInt_FromSsize_t(i); - PyObject *ret; - - if (tmp == 0) - { - return 0; - } - ret = PyObject_GetItem(obj, tmp); - Py_DECREF(tmp); - - return ret; -} - -static void *CDouble(JSOBJ obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PRINTMARK(); - *((double *) outValue) = GET_TC(tc)->doubleValue; - return NULL; -} - -static void *CLong(JSOBJ obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PRINTMARK(); - *((JSINT64 *) outValue) = GET_TC(tc)->longValue; - return NULL; -} - -#ifdef _LP64 -static void *PyIntToINT64(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PyObject *obj = (PyObject *) _obj; - *((JSINT64 *) outValue) = PyInt_AS_LONG (obj); - return NULL; -} -#else -static void *PyIntToINT32(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PyObject *obj = (PyObject *) _obj; - *((JSINT32 *) outValue) = PyInt_AS_LONG (obj); - return NULL; -} -#endif - -static void *PyLongToINT64(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - *((JSINT64 *) outValue) = GET_TC(tc)->longValue; - return NULL; -} - -static void *NpyFloatToDOUBLE(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PyObject *obj = (PyObject *) _obj; - PyArray_CastScalarToCtype(obj, outValue, PyArray_DescrFromType(NPY_DOUBLE)); - return NULL; -} - -static void *PyFloatToDOUBLE(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PyObject *obj = (PyObject *) _obj; - *((double *) outValue) = PyFloat_AsDouble (obj); - return NULL; -} - -static void *PyStringToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PyObject *obj = (PyObject *) _obj; - *_outLen = PyString_GET_SIZE(obj); - return PyString_AS_STRING(obj); -} - -static void *PyUnicodeToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PyObject *obj = (PyObject *) _obj; - PyObject *newObj = PyUnicode_EncodeUTF8 (PyUnicode_AS_UNICODE(obj), PyUnicode_GET_SIZE(obj), NULL); - - GET_TC(tc)->newObj = newObj; - - *_outLen = PyString_GET_SIZE(newObj); - return PyString_AS_STRING(newObj); -} - -static void *PandasDateTimeStructToJSON(pandas_datetimestruct *dts, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - PANDAS_DATETIMEUNIT base = ((PyObjectEncoder*) tc->encoder)->datetimeUnit; - - if (((PyObjectEncoder*) tc->encoder)->datetimeIso) - { - PRINTMARK(); - *_outLen = (size_t) get_datetime_iso_8601_strlen(0, base); - GET_TC(tc)->cStr = PyObject_Malloc(sizeof(char) * (*_outLen)); - if (!GET_TC(tc)->cStr) - { - PyErr_NoMemory(); - ((JSONObjectEncoder*) tc->encoder)->errorMsg = ""; - return NULL; - } - - if (!make_iso_8601_datetime(dts, GET_TC(tc)->cStr, *_outLen, 0, base, -1, NPY_UNSAFE_CASTING)) - { - PRINTMARK(); - *_outLen = strlen(GET_TC(tc)->cStr); - return GET_TC(tc)->cStr; - } - else - { - PRINTMARK(); - PyErr_SetString(PyExc_ValueError, "Could not convert datetime value to string"); - ((JSONObjectEncoder*) tc->encoder)->errorMsg = ""; - PyObject_Free(GET_TC(tc)->cStr); - return NULL; - } - } - else - { - PRINTMARK(); - *((JSINT64*)outValue) = pandas_datetimestruct_to_datetime(base, dts); - return NULL; - } -} - -static void *NpyDateTimeScalarToJSON(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - pandas_datetimestruct dts; - PyDatetimeScalarObject *obj = (PyDatetimeScalarObject *) _obj; - PRINTMARK(); - - pandas_datetime_to_datetimestruct(obj->obval, (PANDAS_DATETIMEUNIT)obj->obmeta.base, &dts); - return PandasDateTimeStructToJSON(&dts, tc, outValue, _outLen); -} - -static void *PyDateTimeToJSON(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - pandas_datetimestruct dts; - PyObject *obj = (PyObject *) _obj; - - PRINTMARK(); - - if (!convert_pydatetime_to_datetimestruct(obj, &dts, NULL, 1)) - { - PRINTMARK(); - return PandasDateTimeStructToJSON(&dts, tc, outValue, _outLen); - } - else - { - if (!PyErr_Occurred()) - { - PyErr_SetString(PyExc_ValueError, "Could not convert datetime value to string"); - } - ((JSONObjectEncoder*) tc->encoder)->errorMsg = ""; - return NULL; - } -} - -static void *NpyDatetime64ToJSON(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *_outLen) -{ - pandas_datetimestruct dts; - PRINTMARK(); - - pandas_datetime_to_datetimestruct( - (npy_datetime) GET_TC(tc)->longValue, - PANDAS_FR_ns, &dts); - return PandasDateTimeStructToJSON(&dts, tc, outValue, _outLen); -} - -static void *PyTimeToJSON(JSOBJ _obj, JSONTypeContext *tc, void *outValue, size_t *outLen) -{ - PyObject *obj = (PyObject *) _obj; - PyObject *str; - PyObject *tmp; - - str = PyObject_CallMethod(obj, "isoformat", NULL); - if (str == NULL) { - PRINTMARK(); - *outLen = 0; - if (!PyErr_Occurred()) - { - PyErr_SetString(PyExc_ValueError, "Failed to convert time"); - } - ((JSONObjectEncoder*) tc->encoder)->errorMsg = ""; - return NULL; - } - if (PyUnicode_Check(str)) - { - tmp = str; - str = PyUnicode_AsUTF8String(str); - Py_DECREF(tmp); - } - - GET_TC(tc)->newObj = str; - - *outLen = PyString_GET_SIZE(str); - outValue = (void *) PyString_AS_STRING (str); - return outValue; -} - -static int NpyTypeToJSONType(PyObject* obj, JSONTypeContext* tc, int npyType, void* value) -{ - PyArray_VectorUnaryFunc* castfunc; - npy_double doubleVal; - npy_int64 longVal; - - if (PyTypeNum_ISFLOAT(npyType)) - { - PRINTMARK(); - castfunc = PyArray_GetCastFunc(PyArray_DescrFromType(npyType), NPY_DOUBLE); - if (!castfunc) - { - PyErr_Format ( - PyExc_ValueError, - "Cannot cast numpy dtype %d to double", - npyType); - } - castfunc(value, &doubleVal, 1, NULL, NULL); - if (npy_isnan(doubleVal) || npy_isinf(doubleVal)) - { - PRINTMARK(); - return JT_NULL; - } - GET_TC(tc)->doubleValue = (double) doubleVal; - GET_TC(tc)->PyTypeToJSON = CDouble; - return JT_DOUBLE; - } - - if (PyTypeNum_ISDATETIME(npyType)) - { - PRINTMARK(); - castfunc = PyArray_GetCastFunc(PyArray_DescrFromType(npyType), NPY_INT64); - if (!castfunc) - { - PyErr_Format ( - PyExc_ValueError, - "Cannot cast numpy dtype %d to long", - npyType); - } - castfunc(value, &longVal, 1, NULL, NULL); - if (longVal == get_nat()) - { - PRINTMARK(); - return JT_NULL; - } - GET_TC(tc)->longValue = (JSINT64) longVal; - GET_TC(tc)->PyTypeToJSON = NpyDatetime64ToJSON; - return ((PyObjectEncoder *) tc->encoder)->datetimeIso ? JT_UTF8 : JT_LONG; - } - - if (PyTypeNum_ISINTEGER(npyType)) - { - PRINTMARK(); - castfunc = PyArray_GetCastFunc(PyArray_DescrFromType(npyType), NPY_INT64); - if (!castfunc) - { - PyErr_Format ( - PyExc_ValueError, - "Cannot cast numpy dtype %d to long", - npyType); - } - castfunc(value, &longVal, 1, NULL, NULL); - GET_TC(tc)->longValue = (JSINT64) longVal; - GET_TC(tc)->PyTypeToJSON = CLong; - return JT_LONG; - } - - if (PyTypeNum_ISBOOL(npyType)) - { - PRINTMARK(); - return *((npy_bool *) value) == NPY_TRUE ? JT_TRUE : JT_FALSE; - } - - PRINTMARK(); - return JT_INVALID; -} - - -//============================================================================= -// Numpy array iteration functions -//============================================================================= - -static void NpyArr_freeItemValue(JSOBJ _obj, JSONTypeContext *tc) -{ - if (GET_TC(tc)->npyarr && GET_TC(tc)->itemValue != GET_TC(tc)->npyarr->array) - { - PRINTMARK(); - Py_XDECREF(GET_TC(tc)->itemValue); - GET_TC(tc)->itemValue = NULL; - } -} - -int NpyArr_iterNextNone(JSOBJ _obj, JSONTypeContext *tc) -{ - return 0; -} - -void NpyArr_iterBegin(JSOBJ _obj, JSONTypeContext *tc) -{ - PyArrayObject *obj; - NpyArrContext *npyarr; - - if (GET_TC(tc)->newObj) - { - obj = (PyArrayObject *) GET_TC(tc)->newObj; - } - else - { - obj = (PyArrayObject *) _obj; - } - - if (PyArray_SIZE(obj) < 0) - { - PRINTMARK(); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - } - else - { - PRINTMARK(); - npyarr = PyObject_Malloc(sizeof(NpyArrContext)); - GET_TC(tc)->npyarr = npyarr; - - if (!npyarr) - { - PyErr_NoMemory(); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - return; - } - - npyarr->array = (PyObject*) obj; - npyarr->getitem = (PyArray_GetItemFunc*) PyArray_DESCR(obj)->f->getitem; - npyarr->dataptr = PyArray_DATA(obj); - npyarr->ndim = PyArray_NDIM(obj) - 1; - npyarr->curdim = 0; - npyarr->type_num = PyArray_DESCR(obj)->type_num; - - if (GET_TC(tc)->transpose) - { - npyarr->dim = PyArray_DIM(obj, npyarr->ndim); - npyarr->stride = PyArray_STRIDE(obj, npyarr->ndim); - npyarr->stridedim = npyarr->ndim; - npyarr->index[npyarr->ndim] = 0; - npyarr->inc = -1; - } - else - { - npyarr->dim = PyArray_DIM(obj, 0); - npyarr->stride = PyArray_STRIDE(obj, 0); - npyarr->stridedim = 0; - npyarr->index[0] = 0; - npyarr->inc = 1; - } - - npyarr->columnLabels = GET_TC(tc)->columnLabels; - npyarr->rowLabels = GET_TC(tc)->rowLabels; - } -} - -void NpyArr_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - NpyArrContext *npyarr = GET_TC(tc)->npyarr; - PRINTMARK(); - - if (npyarr) - { - NpyArr_freeItemValue(obj, tc); - PyObject_Free(npyarr); - } -} - -void NpyArrPassThru_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - PRINTMARK(); -} - -void NpyArrPassThru_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - NpyArrContext* npyarr = GET_TC(tc)->npyarr; - PRINTMARK(); - // finished this dimension, reset the data pointer - npyarr->curdim--; - npyarr->dataptr -= npyarr->stride * npyarr->index[npyarr->stridedim]; - npyarr->stridedim -= npyarr->inc; - npyarr->dim = PyArray_DIM(npyarr->array, npyarr->stridedim); - npyarr->stride = PyArray_STRIDE(npyarr->array, npyarr->stridedim); - npyarr->dataptr += npyarr->stride; - - NpyArr_freeItemValue(obj, tc); -} - -int NpyArr_iterNextItem(JSOBJ obj, JSONTypeContext *tc) -{ - NpyArrContext* npyarr = GET_TC(tc)->npyarr; - PRINTMARK(); - - if (PyErr_Occurred()) - { - return 0; - } - - if (npyarr->index[npyarr->stridedim] >= npyarr->dim) - { - PRINTMARK(); - return 0; - } - - NpyArr_freeItemValue(obj, tc); - -#if NPY_API_VERSION < 0x00000007 - if(PyArray_ISDATETIME(npyarr->array)) - { - PRINTMARK(); - GET_TC(tc)->itemValue = PyArray_ToScalar(npyarr->dataptr, npyarr->array); - } - else - if (PyArray_ISNUMBER(npyarr->array)) -#else - if (PyArray_ISNUMBER(npyarr->array) || PyArray_ISDATETIME(npyarr->array)) -#endif - { - PRINTMARK(); - GET_TC(tc)->itemValue = obj; - Py_INCREF(obj); - ((PyObjectEncoder*) tc->encoder)->npyType = PyArray_TYPE(npyarr->array); - ((PyObjectEncoder*) tc->encoder)->npyValue = npyarr->dataptr; - ((PyObjectEncoder*) tc->encoder)->npyCtxtPassthru = npyarr; - } - else - { - PRINTMARK(); - GET_TC(tc)->itemValue = npyarr->getitem(npyarr->dataptr, npyarr->array); - } - - npyarr->dataptr += npyarr->stride; - npyarr->index[npyarr->stridedim]++; - return 1; -} - -int NpyArr_iterNext(JSOBJ _obj, JSONTypeContext *tc) -{ - NpyArrContext* npyarr = GET_TC(tc)->npyarr; - PRINTMARK(); - - if (PyErr_Occurred()) - { - PRINTMARK(); - return 0; - } - - if (npyarr->curdim >= npyarr->ndim || npyarr->index[npyarr->stridedim] >= npyarr->dim) - { - PRINTMARK(); - // innermost dimension, start retrieving item values - GET_TC(tc)->iterNext = NpyArr_iterNextItem; - return NpyArr_iterNextItem(_obj, tc); - } - - // dig a dimension deeper - npyarr->index[npyarr->stridedim]++; - - npyarr->curdim++; - npyarr->stridedim += npyarr->inc; - npyarr->dim = PyArray_DIM(npyarr->array, npyarr->stridedim); - npyarr->stride = PyArray_STRIDE(npyarr->array, npyarr->stridedim); - npyarr->index[npyarr->stridedim] = 0; - - ((PyObjectEncoder*) tc->encoder)->npyCtxtPassthru = npyarr; - GET_TC(tc)->itemValue = npyarr->array; - return 1; -} - -JSOBJ NpyArr_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - PRINTMARK(); - return GET_TC(tc)->itemValue; -} - -static void NpyArr_getLabel(JSOBJ obj, JSONTypeContext *tc, size_t *outLen, npy_intp idx, char** labels) -{ - JSONObjectEncoder* enc = (JSONObjectEncoder*) tc->encoder; - PRINTMARK(); - *outLen = strlen(labels[idx]); - memcpy(enc->offset, labels[idx], sizeof(char)*(*outLen)); - enc->offset += *outLen; - *outLen = 0; -} - -char *NpyArr_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - NpyArrContext* npyarr = GET_TC(tc)->npyarr; - npy_intp idx; - PRINTMARK(); - - if (GET_TC(tc)->iterNext == NpyArr_iterNextItem) - { - idx = npyarr->index[npyarr->stridedim] - 1; - NpyArr_getLabel(obj, tc, outLen, idx, npyarr->columnLabels); - } - else - { - idx = npyarr->index[npyarr->stridedim - npyarr->inc] - 1; - NpyArr_getLabel(obj, tc, outLen, idx, npyarr->rowLabels); - } - return NULL; -} - - -//============================================================================= -// Pandas block iteration functions -// -// Serialises a DataFrame column by column to avoid unnecessary data copies and -// more representative serialisation when dealing with mixed dtypes. -// -// Uses a dedicated NpyArrContext for each column. -//============================================================================= - - -void PdBlockPassThru_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - PdBlockContext *blkCtxt = GET_TC(tc)->pdblock; - PRINTMARK(); - - if (blkCtxt->transpose) - { - blkCtxt->colIdx++; - } - else - { - blkCtxt->colIdx = 0; - } - - NpyArr_freeItemValue(obj, tc); -} - -int PdBlock_iterNextItem(JSOBJ obj, JSONTypeContext *tc) -{ - PdBlockContext *blkCtxt = GET_TC(tc)->pdblock; - PRINTMARK(); - - if (blkCtxt->colIdx >= blkCtxt->ncols) - { - return 0; - } - - GET_TC(tc)->npyarr = blkCtxt->npyCtxts[blkCtxt->colIdx]; - blkCtxt->colIdx++; - return NpyArr_iterNextItem(obj, tc); -} - -char *PdBlock_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - PdBlockContext *blkCtxt = GET_TC(tc)->pdblock; - NpyArrContext *npyarr = blkCtxt->npyCtxts[0]; - npy_intp idx; - PRINTMARK(); - - if (GET_TC(tc)->iterNext == PdBlock_iterNextItem) - { - idx = blkCtxt->colIdx - 1; - NpyArr_getLabel(obj, tc, outLen, idx, npyarr->columnLabels); - } - else - { - idx = GET_TC(tc)->iterNext != PdBlock_iterNext - ? npyarr->index[npyarr->stridedim - npyarr->inc] - 1 - : npyarr->index[npyarr->stridedim]; - - NpyArr_getLabel(obj, tc, outLen, idx, npyarr->rowLabels); - } - return NULL; -} - -char *PdBlock_iterGetName_Transpose(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - PdBlockContext *blkCtxt = GET_TC(tc)->pdblock; - NpyArrContext* npyarr = blkCtxt->npyCtxts[blkCtxt->colIdx]; - npy_intp idx; - PRINTMARK(); - - if (GET_TC(tc)->iterNext == NpyArr_iterNextItem) - { - idx = npyarr->index[npyarr->stridedim] - 1; - NpyArr_getLabel(obj, tc, outLen, idx, npyarr->columnLabels); - } - else - { - idx = blkCtxt->colIdx; - NpyArr_getLabel(obj, tc, outLen, idx, npyarr->rowLabels); - } - return NULL; -} - -int PdBlock_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - PdBlockContext *blkCtxt = GET_TC(tc)->pdblock; - NpyArrContext* npyarr; - PRINTMARK(); - - if (PyErr_Occurred()) - { - return 0; - } - - if (blkCtxt->transpose) - { - if (blkCtxt->colIdx >= blkCtxt->ncols) - { - return 0; - } - } - else - { - npyarr = blkCtxt->npyCtxts[0]; - if (npyarr->index[npyarr->stridedim] >= npyarr->dim) - { - return 0; - } - } - - ((PyObjectEncoder*) tc->encoder)->blkCtxtPassthru = blkCtxt; - GET_TC(tc)->itemValue = obj; - - return 1; -} - -void PdBlockPassThru_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - PdBlockContext *blkCtxt = GET_TC(tc)->pdblock; - PRINTMARK(); - - if (blkCtxt->transpose) - { - // if transposed we exhaust each column before moving to the next - GET_TC(tc)->iterNext = NpyArr_iterNextItem; - GET_TC(tc)->iterGetName = PdBlock_iterGetName_Transpose; - GET_TC(tc)->npyarr = blkCtxt->npyCtxts[blkCtxt->colIdx]; - } -} - -void PdBlock_iterBegin(JSOBJ _obj, JSONTypeContext *tc) -{ - PyObject *obj, *blocks, *block, *values, *tmp; - PyArrayObject *locs; - PdBlockContext *blkCtxt; - NpyArrContext *npyarr; - Py_ssize_t i; - PyArray_Descr *dtype; - NpyIter *iter; - NpyIter_IterNextFunc *iternext; - npy_int64 **dataptr; - npy_int64 colIdx; - npy_intp idx; - - PRINTMARK(); - - i = 0; - blocks = NULL; - dtype = PyArray_DescrFromType(NPY_INT64); - obj = (PyObject *)_obj; - - GET_TC(tc)->iterGetName = GET_TC(tc)->transpose ? PdBlock_iterGetName_Transpose : PdBlock_iterGetName; - - blkCtxt = PyObject_Malloc(sizeof(PdBlockContext)); - if (!blkCtxt) - { - PyErr_NoMemory(); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - GET_TC(tc)->pdblock = blkCtxt; - - blkCtxt->colIdx = 0; - blkCtxt->transpose = GET_TC(tc)->transpose; - blkCtxt->ncols = get_attr_length(obj, "columns"); - - if (blkCtxt->ncols == 0) - { - blkCtxt->npyCtxts = NULL; - blkCtxt->cindices = NULL; - - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - - blkCtxt->npyCtxts = PyObject_Malloc(sizeof(NpyArrContext*) * blkCtxt->ncols); - if (!blkCtxt->npyCtxts) - { - PyErr_NoMemory(); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - for (i = 0; i < blkCtxt->ncols; i++) - { - blkCtxt->npyCtxts[i] = NULL; - } - - blkCtxt->cindices = PyObject_Malloc(sizeof(int) * blkCtxt->ncols); - if (!blkCtxt->cindices) - { - PyErr_NoMemory(); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - - blocks = get_sub_attr(obj, "_data", "blocks"); - if (!blocks) - { - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - - // force transpose so each NpyArrContext strides down its column - GET_TC(tc)->transpose = 1; - - for (i = 0; i < PyObject_Length(blocks); i++) - { - block = get_item(blocks, i); - if (!block) - { - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - - tmp = get_values(block); - if (!tmp) - { - ((JSONObjectEncoder*) tc->encoder)->errorMsg = ""; - Py_DECREF(block); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - - values = PyArray_Transpose((PyArrayObject*) tmp, NULL); - Py_DECREF(tmp); - if (!values) - { - Py_DECREF(block); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - - - locs = (PyArrayObject*) get_sub_attr(block, "mgr_locs", "as_array"); - if (!locs) - { - Py_DECREF(block); - Py_DECREF(values); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - - iter = NpyIter_New(locs, NPY_ITER_READONLY, NPY_KEEPORDER, NPY_NO_CASTING, dtype); - if (!iter) - { - Py_DECREF(block); - Py_DECREF(values); - Py_DECREF(locs); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - iternext = NpyIter_GetIterNext(iter, NULL); - if (!iternext) - { - NpyIter_Deallocate(iter); - Py_DECREF(block); - Py_DECREF(values); - Py_DECREF(locs); - GET_TC(tc)->iterNext = NpyArr_iterNextNone; - goto BLKRET; - } - dataptr = (npy_int64 **) NpyIter_GetDataPtrArray(iter); - do - { - colIdx = **dataptr; - idx = NpyIter_GetIterIndex(iter); - - blkCtxt->cindices[colIdx] = idx; - - // Reference freed in Pdblock_iterend - Py_INCREF(values); - GET_TC(tc)->newObj = values; - - // init a dedicated context for this column - NpyArr_iterBegin(obj, tc); - npyarr = GET_TC(tc)->npyarr; - - // set the dataptr to our desired column and initialise - if (npyarr != NULL) - { - npyarr->dataptr += npyarr->stride * idx; - NpyArr_iterNext(obj, tc); - } - GET_TC(tc)->itemValue = NULL; - ((PyObjectEncoder*) tc->encoder)->npyCtxtPassthru = NULL; - - blkCtxt->npyCtxts[colIdx] = npyarr; - GET_TC(tc)->newObj = NULL; - - } while (iternext(iter)); - - NpyIter_Deallocate(iter); - Py_DECREF(block); - Py_DECREF(values); - Py_DECREF(locs); - } - GET_TC(tc)->npyarr = blkCtxt->npyCtxts[0]; - -BLKRET: - Py_XDECREF(dtype); - Py_XDECREF(blocks); -} - -void PdBlock_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - PdBlockContext *blkCtxt; - NpyArrContext *npyarr; - int i; - PRINTMARK(); - - GET_TC(tc)->itemValue = NULL; - npyarr = GET_TC(tc)->npyarr; - - blkCtxt = GET_TC(tc)->pdblock; - - if (blkCtxt) - { - for (i = 0; i < blkCtxt->ncols; i++) - { - npyarr = blkCtxt->npyCtxts[i]; - if (npyarr) - { - if (npyarr->array) - { - Py_DECREF(npyarr->array); - npyarr->array = NULL; - } - - GET_TC(tc)->npyarr = npyarr; - NpyArr_iterEnd(obj, tc); - - blkCtxt->npyCtxts[i] = NULL; - } - } - - if (blkCtxt->npyCtxts) - { - PyObject_Free(blkCtxt->npyCtxts); - } - if (blkCtxt->cindices) - { - PyObject_Free(blkCtxt->cindices); - } - PyObject_Free(blkCtxt); - } -} - - -//============================================================================= -// Tuple iteration functions -// itemValue is borrowed reference, no ref counting -//============================================================================= -void Tuple_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->index = 0; - GET_TC(tc)->size = PyTuple_GET_SIZE( (PyObject *) obj); - GET_TC(tc)->itemValue = NULL; -} - -int Tuple_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - PyObject *item; - - if (GET_TC(tc)->index >= GET_TC(tc)->size) - { - return 0; - } - - item = PyTuple_GET_ITEM (obj, GET_TC(tc)->index); - - GET_TC(tc)->itemValue = item; - GET_TC(tc)->index ++; - return 1; -} - -void Tuple_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ -} - -JSOBJ Tuple_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->itemValue; -} - -char *Tuple_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - return NULL; -} - -//============================================================================= -// Iterator iteration functions -// itemValue is borrowed reference, no ref counting -//============================================================================= -void Iter_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->itemValue = NULL; - GET_TC(tc)->iterator = PyObject_GetIter(obj); -} - -int Iter_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - PyObject *item; - - if (GET_TC(tc)->itemValue) - { - Py_DECREF(GET_TC(tc)->itemValue); - GET_TC(tc)->itemValue = NULL; - } - - item = PyIter_Next(GET_TC(tc)->iterator); - - if (item == NULL) - { - return 0; - } - - GET_TC(tc)->itemValue = item; - return 1; -} - -void Iter_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - if (GET_TC(tc)->itemValue) - { - Py_DECREF(GET_TC(tc)->itemValue); - GET_TC(tc)->itemValue = NULL; - } - - if (GET_TC(tc)->iterator) - { - Py_DECREF(GET_TC(tc)->iterator); - GET_TC(tc)->iterator = NULL; - } -} - -JSOBJ Iter_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->itemValue; -} - -char *Iter_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - return NULL; -} - -//============================================================================= -// Dir iteration functions -// itemName ref is borrowed from PyObject_Dir (attrList). No refcount -// itemValue ref is from PyObject_GetAttr. Ref counted -//============================================================================= -void Dir_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->attrList = PyObject_Dir(obj); - GET_TC(tc)->index = 0; - GET_TC(tc)->size = PyList_GET_SIZE(GET_TC(tc)->attrList); - PRINTMARK(); -} - -void Dir_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - if (GET_TC(tc)->itemValue) - { - Py_DECREF(GET_TC(tc)->itemValue); - GET_TC(tc)->itemValue = NULL; - } - - if (GET_TC(tc)->itemName) - { - Py_DECREF(GET_TC(tc)->itemName); - GET_TC(tc)->itemName = NULL; - } - - Py_DECREF( (PyObject *) GET_TC(tc)->attrList); - PRINTMARK(); -} - -int Dir_iterNext(JSOBJ _obj, JSONTypeContext *tc) -{ - PyObject *obj = (PyObject *) _obj; - PyObject *itemValue = GET_TC(tc)->itemValue; - PyObject *itemName = GET_TC(tc)->itemName; - PyObject* attr; - PyObject* attrName; - char* attrStr; - - if (itemValue) - { - Py_DECREF(GET_TC(tc)->itemValue); - GET_TC(tc)->itemValue = itemValue = NULL; - } - - if (itemName) - { - Py_DECREF(GET_TC(tc)->itemName); - GET_TC(tc)->itemName = itemName = NULL; - } - - for (; GET_TC(tc)->index < GET_TC(tc)->size; GET_TC(tc)->index ++) - { - attrName = PyList_GET_ITEM(GET_TC(tc)->attrList, GET_TC(tc)->index); -#if PY_MAJOR_VERSION >= 3 - attr = PyUnicode_AsUTF8String(attrName); -#else - attr = attrName; - Py_INCREF(attr); -#endif - attrStr = PyString_AS_STRING(attr); - - if (attrStr[0] == '_') - { - PRINTMARK(); - Py_DECREF(attr); - continue; - } - - itemValue = PyObject_GetAttr(obj, attrName); - if (itemValue == NULL) - { - PyErr_Clear(); - Py_DECREF(attr); - PRINTMARK(); - continue; - } - - if (PyCallable_Check(itemValue)) - { - Py_DECREF(itemValue); - Py_DECREF(attr); - PRINTMARK(); - continue; - } - - GET_TC(tc)->itemName = itemName; - GET_TC(tc)->itemValue = itemValue; - GET_TC(tc)->index ++; - - PRINTMARK(); - itemName = attr; - break; - } - - if (itemName == NULL) - { - GET_TC(tc)->index = GET_TC(tc)->size; - GET_TC(tc)->itemValue = NULL; - return 0; - } - - GET_TC(tc)->itemName = itemName; - GET_TC(tc)->itemValue = itemValue; - GET_TC(tc)->index ++; - - PRINTMARK(); - return 1; -} - -JSOBJ Dir_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - PRINTMARK(); - return GET_TC(tc)->itemValue; -} - -char *Dir_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - PRINTMARK(); - *outLen = PyString_GET_SIZE(GET_TC(tc)->itemName); - return PyString_AS_STRING(GET_TC(tc)->itemName); -} - - -//============================================================================= -// List iteration functions -// itemValue is borrowed from object (which is list). No refcounting -//============================================================================= -void List_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->index = 0; - GET_TC(tc)->size = PyList_GET_SIZE( (PyObject *) obj); -} - -int List_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - if (GET_TC(tc)->index >= GET_TC(tc)->size) - { - PRINTMARK(); - return 0; - } - - GET_TC(tc)->itemValue = PyList_GET_ITEM (obj, GET_TC(tc)->index); - GET_TC(tc)->index ++; - return 1; -} - -void List_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ -} - -JSOBJ List_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->itemValue; -} - -char *List_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - return NULL; -} - -//============================================================================= -// pandas Index iteration functions -//============================================================================= -void Index_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->index = 0; - GET_TC(tc)->cStr = PyObject_Malloc(20 * sizeof(char)); - if (!GET_TC(tc)->cStr) - { - PyErr_NoMemory(); - } - PRINTMARK(); -} - -int Index_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - Py_ssize_t index; - if (!GET_TC(tc)->cStr) - { - return 0; - } - - index = GET_TC(tc)->index; - Py_XDECREF(GET_TC(tc)->itemValue); - if (index == 0) - { - memcpy(GET_TC(tc)->cStr, "name", sizeof(char)*5); - GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "name"); - } - else - if (index == 1) - { - memcpy(GET_TC(tc)->cStr, "data", sizeof(char)*5); - GET_TC(tc)->itemValue = get_values(obj); - if (!GET_TC(tc)->itemValue) - { - return 0; - } - } - else - { - PRINTMARK(); - return 0; - } - - GET_TC(tc)->index++; - PRINTMARK(); - return 1; -} - -void Index_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - PRINTMARK(); -} - -JSOBJ Index_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->itemValue; -} - -char *Index_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - *outLen = strlen(GET_TC(tc)->cStr); - return GET_TC(tc)->cStr; -} - -//============================================================================= -// pandas Series iteration functions -//============================================================================= -void Series_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - PyObjectEncoder* enc = (PyObjectEncoder*) tc->encoder; - GET_TC(tc)->index = 0; - GET_TC(tc)->cStr = PyObject_Malloc(20 * sizeof(char)); - enc->outputFormat = VALUES; // for contained series - if (!GET_TC(tc)->cStr) - { - PyErr_NoMemory(); - } - PRINTMARK(); -} - -int Series_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - Py_ssize_t index; - if (!GET_TC(tc)->cStr) - { - return 0; - } - - index = GET_TC(tc)->index; - Py_XDECREF(GET_TC(tc)->itemValue); - if (index == 0) - { - memcpy(GET_TC(tc)->cStr, "name", sizeof(char)*5); - GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "name"); - } - else - if (index == 1) - { - memcpy(GET_TC(tc)->cStr, "index", sizeof(char)*6); - GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "index"); - } - else - if (index == 2) - { - memcpy(GET_TC(tc)->cStr, "data", sizeof(char)*5); - GET_TC(tc)->itemValue = get_values(obj); - if (!GET_TC(tc)->itemValue) - { - return 0; - } - } - else - { - PRINTMARK(); - return 0; - } - - GET_TC(tc)->index++; - PRINTMARK(); - return 1; -} - -void Series_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - PyObjectEncoder* enc = (PyObjectEncoder*) tc->encoder; - enc->outputFormat = enc->originalOutputFormat; - PRINTMARK(); -} - -JSOBJ Series_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->itemValue; -} - -char *Series_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - *outLen = strlen(GET_TC(tc)->cStr); - return GET_TC(tc)->cStr; -} - -//============================================================================= -// pandas DataFrame iteration functions -//============================================================================= -void DataFrame_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - PyObjectEncoder* enc = (PyObjectEncoder*) tc->encoder; - GET_TC(tc)->index = 0; - GET_TC(tc)->cStr = PyObject_Malloc(20 * sizeof(char)); - enc->outputFormat = VALUES; // for contained series & index - if (!GET_TC(tc)->cStr) - { - PyErr_NoMemory(); - } - PRINTMARK(); -} - -int DataFrame_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - Py_ssize_t index; - if (!GET_TC(tc)->cStr) - { - return 0; - } - - index = GET_TC(tc)->index; - Py_XDECREF(GET_TC(tc)->itemValue); - if (index == 0) - { - memcpy(GET_TC(tc)->cStr, "columns", sizeof(char)*8); - GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "columns"); - } - else - if (index == 1) - { - memcpy(GET_TC(tc)->cStr, "index", sizeof(char)*6); - GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "index"); - } - else - if (index == 2) - { - memcpy(GET_TC(tc)->cStr, "data", sizeof(char)*5); - if (is_simple_frame(obj)) - { - GET_TC(tc)->itemValue = get_values(obj); - if (!GET_TC(tc)->itemValue) - { - return 0; - } - } - else - { - Py_INCREF(obj); - GET_TC(tc)->itemValue = obj; - } - } - else - { - PRINTMARK(); - return 0; - } - - GET_TC(tc)->index++; - PRINTMARK(); - return 1; -} - -void DataFrame_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - PyObjectEncoder* enc = (PyObjectEncoder*) tc->encoder; - enc->outputFormat = enc->originalOutputFormat; - PRINTMARK(); -} - -JSOBJ DataFrame_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->itemValue; -} - -char *DataFrame_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - *outLen = strlen(GET_TC(tc)->cStr); - return GET_TC(tc)->cStr; -} - -//============================================================================= -// Dict iteration functions -// itemName might converted to string (Python_Str). Do refCounting -// itemValue is borrowed from object (which is dict). No refCounting -//============================================================================= -void Dict_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->index = 0; - PRINTMARK(); -} - -int Dict_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ -#if PY_MAJOR_VERSION >= 3 - PyObject* itemNameTmp; -#endif - - if (GET_TC(tc)->itemName) - { - Py_DECREF(GET_TC(tc)->itemName); - GET_TC(tc)->itemName = NULL; - } - - - if (!PyDict_Next ( (PyObject *)GET_TC(tc)->dictObj, &GET_TC(tc)->index, &GET_TC(tc)->itemName, &GET_TC(tc)->itemValue)) - { - PRINTMARK(); - return 0; - } - - if (PyUnicode_Check(GET_TC(tc)->itemName)) - { - GET_TC(tc)->itemName = PyUnicode_AsUTF8String (GET_TC(tc)->itemName); - } - else - if (!PyString_Check(GET_TC(tc)->itemName)) - { - GET_TC(tc)->itemName = PyObject_Str(GET_TC(tc)->itemName); -#if PY_MAJOR_VERSION >= 3 - itemNameTmp = GET_TC(tc)->itemName; - GET_TC(tc)->itemName = PyUnicode_AsUTF8String (GET_TC(tc)->itemName); - Py_DECREF(itemNameTmp); -#endif - } - else - { - Py_INCREF(GET_TC(tc)->itemName); - } - PRINTMARK(); - return 1; -} - -void Dict_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - if (GET_TC(tc)->itemName) - { - Py_DECREF(GET_TC(tc)->itemName); - GET_TC(tc)->itemName = NULL; - } - Py_DECREF(GET_TC(tc)->dictObj); - PRINTMARK(); -} - -JSOBJ Dict_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->itemValue; -} - -char *Dict_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - *outLen = PyString_GET_SIZE(GET_TC(tc)->itemName); - return PyString_AS_STRING(GET_TC(tc)->itemName); -} - -void NpyArr_freeLabels(char** labels, npy_intp len) -{ - npy_intp i; - - if (labels) - { - for (i = 0; i < len; i++) - { - PyObject_Free(labels[i]); - } - PyObject_Free(labels); - } -} - -char** NpyArr_encodeLabels(PyArrayObject* labels, JSONObjectEncoder* enc, npy_intp num) -{ - // NOTE this function steals a reference to labels. - PyObjectEncoder* pyenc = (PyObjectEncoder *) enc; - PyObject* item = NULL; - npy_intp i, stride, len, need_quotes; - char** ret; - char *dataptr, *cLabel, *origend, *origst, *origoffset; - char labelBuffer[NPY_JSON_BUFSIZE]; - PyArray_GetItemFunc* getitem; - int type_num; - PRINTMARK(); - - if (!labels) - { - return 0; - } - - if (PyArray_SIZE(labels) < num) - { - PyErr_SetString(PyExc_ValueError, "Label array sizes do not match corresponding data shape"); - Py_DECREF(labels); - return 0; - } - - ret = PyObject_Malloc(sizeof(char*)*num); - if (!ret) - { - PyErr_NoMemory(); - Py_DECREF(labels); - return 0; - } - - for (i = 0; i < num; i++) - { - ret[i] = NULL; - } - - origst = enc->start; - origend = enc->end; - origoffset = enc->offset; - - stride = PyArray_STRIDE(labels, 0); - dataptr = PyArray_DATA(labels); - getitem = (PyArray_GetItemFunc*) PyArray_DESCR(labels)->f->getitem; - type_num = PyArray_TYPE(labels); - - for (i = 0; i < num; i++) - { -#if NPY_API_VERSION < 0x00000007 - if(PyTypeNum_ISDATETIME(type_num)) - { - item = PyArray_ToScalar(dataptr, labels); - } - else if(PyTypeNum_ISNUMBER(type_num)) -#else - if(PyTypeNum_ISDATETIME(type_num) || PyTypeNum_ISNUMBER(type_num)) -#endif - { - item = (PyObject *) labels; - pyenc->npyType = type_num; - pyenc->npyValue = dataptr; - } - else - { - item = getitem(dataptr, labels); - if (!item) - { - NpyArr_freeLabels(ret, num); - ret = 0; - break; - } - } - - cLabel = JSON_EncodeObject(item, enc, labelBuffer, NPY_JSON_BUFSIZE); - - if (item != (PyObject *) labels) - { - Py_DECREF(item); - } - - if (PyErr_Occurred() || enc->errorMsg) - { - NpyArr_freeLabels(ret, num); - ret = 0; - break; - } - - need_quotes = ((*cLabel) != '"'); - len = enc->offset - cLabel + 1 + 2 * need_quotes; - ret[i] = PyObject_Malloc(sizeof(char)*len); - - if (!ret[i]) - { - PyErr_NoMemory(); - ret = 0; - break; - } - - if (need_quotes) - { - ret[i][0] = '"'; - memcpy(ret[i]+1, cLabel, sizeof(char)*(len-4)); - ret[i][len-3] = '"'; - } - else - { - memcpy(ret[i], cLabel, sizeof(char)*(len-2)); - } - ret[i][len-2] = ':'; - ret[i][len-1] = '\0'; - dataptr += stride; - } - - enc->start = origst; - enc->end = origend; - enc->offset = origoffset; - - Py_DECREF(labels); - return ret; -} - -void Object_invokeDefaultHandler(PyObject *obj, PyObjectEncoder *enc) -{ - PyObject *tmpObj = NULL; - PRINTMARK(); - tmpObj = PyObject_CallFunctionObjArgs(enc->defaultHandler, obj, NULL); - if (!PyErr_Occurred()) - { - if (tmpObj == NULL) - { - PyErr_SetString(PyExc_TypeError, "Failed to execute default handler"); - } - else - { - encode (tmpObj, (JSONObjectEncoder*) enc, NULL, 0); - } - } - Py_XDECREF(tmpObj); - return; -} - -void Object_beginTypeContext (JSOBJ _obj, JSONTypeContext *tc) -{ - PyObject *obj, *exc, *toDictFunc, *tmpObj, *values; - TypeContext *pc; - PyObjectEncoder *enc; - double val; - npy_int64 value; - int base; - PRINTMARK(); - - tc->prv = NULL; - - if (!_obj) { - tc->type = JT_INVALID; - return; - } - - obj = (PyObject*) _obj; - enc = (PyObjectEncoder*) tc->encoder; - - if (enc->npyType >= 0) - { - PRINTMARK(); - tc->prv = &(enc->basicTypeContext); - tc->type = NpyTypeToJSONType(obj, tc, enc->npyType, enc->npyValue); - - if (tc->type == JT_INVALID) - { - if(enc->defaultHandler) - { - enc->npyType = -1; - PRINTMARK(); - Object_invokeDefaultHandler(enc->npyCtxtPassthru->getitem(enc->npyValue, enc->npyCtxtPassthru->array), enc); - } - else - { - PyErr_Format ( - PyExc_RuntimeError, - "Unhandled numpy dtype %d", - enc->npyType); - } - } - enc->npyCtxtPassthru = NULL; - enc->npyType = -1; - return; - } - - if (PyBool_Check(obj)) - { - PRINTMARK(); - tc->type = (obj == Py_True) ? JT_TRUE : JT_FALSE; - return; - } - else - if (obj == Py_None) - { - PRINTMARK(); - tc->type = JT_NULL; - return; - } - - pc = createTypeContext(); - if (!pc) - { - tc->type = JT_INVALID; - return; - } - tc->prv = pc; - - if (PyIter_Check(obj) || (PyArray_Check(obj) && !PyArray_CheckScalar(obj) )) - { - PRINTMARK(); - goto ISITERABLE; - } - - if (PyLong_Check(obj)) - { - PRINTMARK(); - pc->PyTypeToJSON = PyLongToINT64; - tc->type = JT_LONG; - GET_TC(tc)->longValue = PyLong_AsLongLong(obj); - - exc = PyErr_Occurred(); - - if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) - { - PRINTMARK(); - goto INVALID; - } - - return; - } - else - if (PyInt_Check(obj)) - { - PRINTMARK(); - -#ifdef _LP64 - pc->PyTypeToJSON = PyIntToINT64; tc->type = JT_LONG; -#else - pc->PyTypeToJSON = PyIntToINT32; tc->type = JT_INT; -#endif - return; - } - else - if (PyFloat_Check(obj)) - { - PRINTMARK(); - val = PyFloat_AS_DOUBLE (obj); - if (npy_isnan(val) || npy_isinf(val)) - { - tc->type = JT_NULL; - } - else - { - pc->PyTypeToJSON = PyFloatToDOUBLE; tc->type = JT_DOUBLE; - } - return; - } - else - if (PyString_Check(obj)) - { - PRINTMARK(); - pc->PyTypeToJSON = PyStringToUTF8; tc->type = JT_UTF8; - return; - } - else - if (PyUnicode_Check(obj)) - { - PRINTMARK(); - pc->PyTypeToJSON = PyUnicodeToUTF8; tc->type = JT_UTF8; - return; - } - else - if (PyObject_IsInstance(obj, type_decimal)) - { - PRINTMARK(); - pc->PyTypeToJSON = PyFloatToDOUBLE; tc->type = JT_DOUBLE; - return; - } - else - if (PyDateTime_Check(obj) || PyDate_Check(obj)) - { - if (PyObject_TypeCheck(obj, cls_nat)) - { - PRINTMARK(); - tc->type = JT_NULL; - return; - } - - PRINTMARK(); - pc->PyTypeToJSON = PyDateTimeToJSON; - if (enc->datetimeIso) - { - PRINTMARK(); - tc->type = JT_UTF8; - } - else - { - PRINTMARK(); - tc->type = JT_LONG; - } - return; - } - else - if (PyTime_Check(obj)) - { - PRINTMARK(); - pc->PyTypeToJSON = PyTimeToJSON; tc->type = JT_UTF8; - return; - } - else - if (PyArray_IsScalar(obj, Datetime)) - { - PRINTMARK(); - if (((PyDatetimeScalarObject*) obj)->obval == get_nat()) { - PRINTMARK(); - tc->type = JT_NULL; - return; - } - - PRINTMARK(); - pc->PyTypeToJSON = NpyDateTimeScalarToJSON; - tc->type = enc->datetimeIso ? JT_UTF8 : JT_LONG; - return; - } - else - if (PyDelta_Check(obj)) - { - if (PyObject_HasAttrString(obj, "value")) - { - PRINTMARK(); - value = get_long_attr(obj, "value"); - } - else - { - PRINTMARK(); - value = total_seconds(obj) * 1000000000LL; // nanoseconds per second - } - - base = ((PyObjectEncoder*) tc->encoder)->datetimeUnit; - switch (base) - { - case PANDAS_FR_ns: - break; - case PANDAS_FR_us: - value /= 1000LL; - break; - case PANDAS_FR_ms: - value /= 1000000LL; - break; - case PANDAS_FR_s: - value /= 1000000000LL; - break; - } - - exc = PyErr_Occurred(); - - if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) - { - PRINTMARK(); - goto INVALID; - } - - if (value == get_nat()) - { - PRINTMARK(); - tc->type = JT_NULL; - return; - } - - GET_TC(tc)->longValue = value; - - PRINTMARK(); - pc->PyTypeToJSON = PyLongToINT64; - tc->type = JT_LONG; - return; - } - else - if (PyArray_IsScalar(obj, Integer)) - { - PRINTMARK(); - pc->PyTypeToJSON = PyLongToINT64; - tc->type = JT_LONG; - PyArray_CastScalarToCtype(obj, &(GET_TC(tc)->longValue), PyArray_DescrFromType(NPY_INT64)); - - exc = PyErr_Occurred(); - - if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) - { - PRINTMARK(); - goto INVALID; - } - - return; - } - else - if (PyArray_IsScalar(obj, Bool)) - { - PRINTMARK(); - PyArray_CastScalarToCtype(obj, &(GET_TC(tc)->longValue), PyArray_DescrFromType(NPY_BOOL)); - tc->type = (GET_TC(tc)->longValue) ? JT_TRUE : JT_FALSE; - return; - } - else - if (PyArray_IsScalar(obj, Float) || PyArray_IsScalar(obj, Double)) - { - PRINTMARK(); - pc->PyTypeToJSON = NpyFloatToDOUBLE; tc->type = JT_DOUBLE; - return; - } - else - if (PyArray_Check(obj) && PyArray_CheckScalar(obj)) { - tmpObj = PyObject_Repr(obj); - PyErr_Format( - PyExc_TypeError, - "%s (0d array) is not JSON serializable at the moment", - PyString_AS_STRING(tmpObj) - ); - Py_DECREF(tmpObj); - goto INVALID; - } - -ISITERABLE: - - if (PyObject_TypeCheck(obj, cls_index)) - { - if (enc->outputFormat == SPLIT) - { - PRINTMARK(); - tc->type = JT_OBJECT; - pc->iterBegin = Index_iterBegin; - pc->iterEnd = Index_iterEnd; - pc->iterNext = Index_iterNext; - pc->iterGetValue = Index_iterGetValue; - pc->iterGetName = Index_iterGetName; - return; - } - - pc->newObj = get_values(obj); - if (pc->newObj) - { - PRINTMARK(); - tc->type = JT_ARRAY; - pc->iterBegin = NpyArr_iterBegin; - pc->iterEnd = NpyArr_iterEnd; - pc->iterNext = NpyArr_iterNext; - pc->iterGetValue = NpyArr_iterGetValue; - pc->iterGetName = NpyArr_iterGetName; - } - else - { - goto INVALID; - } - - return; - } - else - if (PyObject_TypeCheck(obj, cls_series)) - { - if (enc->outputFormat == SPLIT) - { - PRINTMARK(); - tc->type = JT_OBJECT; - pc->iterBegin = Series_iterBegin; - pc->iterEnd = Series_iterEnd; - pc->iterNext = Series_iterNext; - pc->iterGetValue = Series_iterGetValue; - pc->iterGetName = Series_iterGetName; - return; - } - - pc->newObj = get_values(obj); - if (!pc->newObj) - { - goto INVALID; - } - - if (enc->outputFormat == INDEX || enc->outputFormat == COLUMNS) - { - PRINTMARK(); - tc->type = JT_OBJECT; - tmpObj = PyObject_GetAttrString(obj, "index"); - if (!tmpObj) - { - goto INVALID; - } - values = get_values(tmpObj); - Py_DECREF(tmpObj); - if (!values) - { - goto INVALID; - } - pc->columnLabelsLen = PyArray_DIM(pc->newObj, 0); - pc->columnLabels = NpyArr_encodeLabels((PyArrayObject*) values, (JSONObjectEncoder*) enc, pc->columnLabelsLen); - if (!pc->columnLabels) - { - goto INVALID; - } - } - else - { - PRINTMARK(); - tc->type = JT_ARRAY; - } - pc->iterBegin = NpyArr_iterBegin; - pc->iterEnd = NpyArr_iterEnd; - pc->iterNext = NpyArr_iterNext; - pc->iterGetValue = NpyArr_iterGetValue; - pc->iterGetName = NpyArr_iterGetName; - return; - } - else - if (PyArray_Check(obj)) - { - if (enc->npyCtxtPassthru) - { - PRINTMARK(); - pc->npyarr = enc->npyCtxtPassthru; - tc->type = (pc->npyarr->columnLabels ? JT_OBJECT : JT_ARRAY); - - pc->iterBegin = NpyArrPassThru_iterBegin; - pc->iterNext = NpyArr_iterNext; - pc->iterEnd = NpyArrPassThru_iterEnd; - pc->iterGetValue = NpyArr_iterGetValue; - pc->iterGetName = NpyArr_iterGetName; - - enc->npyCtxtPassthru = NULL; - return; - } - - PRINTMARK(); - tc->type = JT_ARRAY; - pc->iterBegin = NpyArr_iterBegin; - pc->iterEnd = NpyArr_iterEnd; - pc->iterNext = NpyArr_iterNext; - pc->iterGetValue = NpyArr_iterGetValue; - pc->iterGetName = NpyArr_iterGetName; - return; - } - else - if (PyObject_TypeCheck(obj, cls_dataframe)) - { - if (enc->blkCtxtPassthru) - { - PRINTMARK(); - pc->pdblock = enc->blkCtxtPassthru; - tc->type = (pc->pdblock->npyCtxts[0]->columnLabels ? JT_OBJECT : JT_ARRAY); - - pc->iterBegin = PdBlockPassThru_iterBegin; - pc->iterEnd = PdBlockPassThru_iterEnd; - pc->iterNext = PdBlock_iterNextItem; - pc->iterGetName = PdBlock_iterGetName; - pc->iterGetValue = NpyArr_iterGetValue; - - enc->blkCtxtPassthru = NULL; - return; - } - - if (enc->outputFormat == SPLIT) - { - PRINTMARK(); - tc->type = JT_OBJECT; - pc->iterBegin = DataFrame_iterBegin; - pc->iterEnd = DataFrame_iterEnd; - pc->iterNext = DataFrame_iterNext; - pc->iterGetValue = DataFrame_iterGetValue; - pc->iterGetName = DataFrame_iterGetName; - return; - } - - PRINTMARK(); - if (is_simple_frame(obj)) - { - pc->iterBegin = NpyArr_iterBegin; - pc->iterEnd = NpyArr_iterEnd; - pc->iterNext = NpyArr_iterNext; - pc->iterGetName = NpyArr_iterGetName; - - pc->newObj = get_values(obj); - if (!pc->newObj) - { - goto INVALID; - } - } - else - { - pc->iterBegin = PdBlock_iterBegin; - pc->iterEnd = PdBlock_iterEnd; - pc->iterNext = PdBlock_iterNext; - pc->iterGetName = PdBlock_iterGetName; - } - pc->iterGetValue = NpyArr_iterGetValue; - - if (enc->outputFormat == VALUES) - { - PRINTMARK(); - tc->type = JT_ARRAY; - } - else - if (enc->outputFormat == RECORDS) - { - PRINTMARK(); - tc->type = JT_ARRAY; - tmpObj = PyObject_GetAttrString(obj, "columns"); - if (!tmpObj) - { - goto INVALID; - } - values = get_values(tmpObj); - if (!values) - { - Py_DECREF(tmpObj); - goto INVALID; - } - pc->columnLabelsLen = PyObject_Size(tmpObj); - pc->columnLabels = NpyArr_encodeLabels((PyArrayObject*) values, (JSONObjectEncoder*) enc, pc->columnLabelsLen); - Py_DECREF(tmpObj); - if (!pc->columnLabels) - { - goto INVALID; - } - } - else - if (enc->outputFormat == INDEX || enc->outputFormat == COLUMNS) - { - PRINTMARK(); - tc->type = JT_OBJECT; - tmpObj = (enc->outputFormat == INDEX ? PyObject_GetAttrString(obj, "index") : PyObject_GetAttrString(obj, "columns")); - if (!tmpObj) - { - goto INVALID; - } - values = get_values(tmpObj); - if (!values) - { - Py_DECREF(tmpObj); - goto INVALID; - } - pc->rowLabelsLen = PyObject_Size(tmpObj); - pc->rowLabels = NpyArr_encodeLabels((PyArrayObject*) values, (JSONObjectEncoder*) enc, pc->rowLabelsLen); - Py_DECREF(tmpObj); - tmpObj = (enc->outputFormat == INDEX ? PyObject_GetAttrString(obj, "columns") : PyObject_GetAttrString(obj, "index")); - if (!tmpObj) - { - NpyArr_freeLabels(pc->rowLabels, pc->rowLabelsLen); - pc->rowLabels = NULL; - goto INVALID; - } - values = get_values(tmpObj); - if (!values) - { - Py_DECREF(tmpObj); - NpyArr_freeLabels(pc->rowLabels, pc->rowLabelsLen); - pc->rowLabels = NULL; - goto INVALID; - } - pc->columnLabelsLen = PyObject_Size(tmpObj); - pc->columnLabels = NpyArr_encodeLabels((PyArrayObject*) values, (JSONObjectEncoder*) enc, pc->columnLabelsLen); - Py_DECREF(tmpObj); - if (!pc->columnLabels) - { - NpyArr_freeLabels(pc->rowLabels, pc->rowLabelsLen); - pc->rowLabels = NULL; - goto INVALID; - } - - if (enc->outputFormat == COLUMNS) - { - PRINTMARK(); - pc->transpose = 1; - } - } - else - { - goto INVALID; - } - return; - } - else - if (PyDict_Check(obj)) - { - PRINTMARK(); - tc->type = JT_OBJECT; - pc->iterBegin = Dict_iterBegin; - pc->iterEnd = Dict_iterEnd; - pc->iterNext = Dict_iterNext; - pc->iterGetValue = Dict_iterGetValue; - pc->iterGetName = Dict_iterGetName; - pc->dictObj = obj; - Py_INCREF(obj); - - return; - } - else - if (PyList_Check(obj)) - { - PRINTMARK(); - tc->type = JT_ARRAY; - pc->iterBegin = List_iterBegin; - pc->iterEnd = List_iterEnd; - pc->iterNext = List_iterNext; - pc->iterGetValue = List_iterGetValue; - pc->iterGetName = List_iterGetName; - return; - } - else - if (PyTuple_Check(obj)) - { - PRINTMARK(); - tc->type = JT_ARRAY; - pc->iterBegin = Tuple_iterBegin; - pc->iterEnd = Tuple_iterEnd; - pc->iterNext = Tuple_iterNext; - pc->iterGetValue = Tuple_iterGetValue; - pc->iterGetName = Tuple_iterGetName; - return; - } - else - if (PyAnySet_Check(obj)) - { - PRINTMARK(); - tc->type = JT_ARRAY; - pc->iterBegin = Iter_iterBegin; - pc->iterEnd = Iter_iterEnd; - pc->iterNext = Iter_iterNext; - pc->iterGetValue = Iter_iterGetValue; - pc->iterGetName = Iter_iterGetName; - return; - } - - toDictFunc = PyObject_GetAttrString(obj, "toDict"); - - if (toDictFunc) - { - PyObject* tuple = PyTuple_New(0); - PyObject* toDictResult = PyObject_Call(toDictFunc, tuple, NULL); - Py_DECREF(tuple); - Py_DECREF(toDictFunc); - - if (toDictResult == NULL) - { - PyErr_Clear(); - tc->type = JT_NULL; - return; - } - - if (!PyDict_Check(toDictResult)) - { - Py_DECREF(toDictResult); - tc->type = JT_NULL; - return; - } - - PRINTMARK(); - tc->type = JT_OBJECT; - pc->iterBegin = Dict_iterBegin; - pc->iterEnd = Dict_iterEnd; - pc->iterNext = Dict_iterNext; - pc->iterGetValue = Dict_iterGetValue; - pc->iterGetName = Dict_iterGetName; - pc->dictObj = toDictResult; - return; - } - - PyErr_Clear(); - - if (enc->defaultHandler) - { - Object_invokeDefaultHandler(obj, enc); - goto INVALID; - } - - PRINTMARK(); - tc->type = JT_OBJECT; - pc->iterBegin = Dir_iterBegin; - pc->iterEnd = Dir_iterEnd; - pc->iterNext = Dir_iterNext; - pc->iterGetValue = Dir_iterGetValue; - pc->iterGetName = Dir_iterGetName; - return; - -INVALID: - tc->type = JT_INVALID; - PyObject_Free(tc->prv); - tc->prv = NULL; - return; -} - -void Object_endTypeContext(JSOBJ obj, JSONTypeContext *tc) -{ - PRINTMARK(); - if(tc->prv) - { - Py_XDECREF(GET_TC(tc)->newObj); - GET_TC(tc)->newObj = NULL; - NpyArr_freeLabels(GET_TC(tc)->rowLabels, GET_TC(tc)->rowLabelsLen); - GET_TC(tc)->rowLabels = NULL; - NpyArr_freeLabels(GET_TC(tc)->columnLabels, GET_TC(tc)->columnLabelsLen); - GET_TC(tc)->columnLabels = NULL; - - PyObject_Free(GET_TC(tc)->cStr); - GET_TC(tc)->cStr = NULL; - if (tc->prv != &(((PyObjectEncoder*) tc->encoder)->basicTypeContext)) - { - PyObject_Free(tc->prv); - } - tc->prv = NULL; - } -} - -const char *Object_getStringValue(JSOBJ obj, JSONTypeContext *tc, size_t *_outLen) -{ - return GET_TC(tc)->PyTypeToJSON (obj, tc, NULL, _outLen); -} - -JSINT64 Object_getLongValue(JSOBJ obj, JSONTypeContext *tc) -{ - JSINT64 ret; - GET_TC(tc)->PyTypeToJSON (obj, tc, &ret, NULL); - return ret; -} - -JSINT32 Object_getIntValue(JSOBJ obj, JSONTypeContext *tc) -{ - JSINT32 ret; - GET_TC(tc)->PyTypeToJSON (obj, tc, &ret, NULL); - return ret; -} - -double Object_getDoubleValue(JSOBJ obj, JSONTypeContext *tc) -{ - double ret; - GET_TC(tc)->PyTypeToJSON (obj, tc, &ret, NULL); - return ret; -} - -static void Object_releaseObject(JSOBJ _obj) -{ - Py_DECREF( (PyObject *) _obj); -} - -void Object_iterBegin(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->iterBegin(obj, tc); -} - -int Object_iterNext(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->iterNext(obj, tc); -} - -void Object_iterEnd(JSOBJ obj, JSONTypeContext *tc) -{ - GET_TC(tc)->iterEnd(obj, tc); -} - -JSOBJ Object_iterGetValue(JSOBJ obj, JSONTypeContext *tc) -{ - return GET_TC(tc)->iterGetValue(obj, tc); -} - -char *Object_iterGetName(JSOBJ obj, JSONTypeContext *tc, size_t *outLen) -{ - return GET_TC(tc)->iterGetName(obj, tc, outLen); -} - -PyObject* objToJSON(PyObject* self, PyObject *args, PyObject *kwargs) -{ - static char *kwlist[] = { "obj", "ensure_ascii", "double_precision", "encode_html_chars", "orient", "date_unit", "iso_dates", "default_handler", NULL}; - - char buffer[65536]; - char *ret; - PyObject *newobj; - PyObject *oinput = NULL; - PyObject *oensureAscii = NULL; - int idoublePrecision = 10; // default double precision setting - PyObject *oencodeHTMLChars = NULL; - char *sOrient = NULL; - char *sdateFormat = NULL; - PyObject *oisoDates = 0; - PyObject *odefHandler = 0; - - PyObjectEncoder pyEncoder = - { - { - Object_beginTypeContext, - Object_endTypeContext, - Object_getStringValue, - Object_getLongValue, - Object_getIntValue, - Object_getDoubleValue, - Object_iterBegin, - Object_iterNext, - Object_iterEnd, - Object_iterGetValue, - Object_iterGetName, - Object_releaseObject, - PyObject_Malloc, - PyObject_Realloc, - PyObject_Free, - -1, //recursionMax - idoublePrecision, - 1, //forceAscii - 0, //encodeHTMLChars - } - }; - JSONObjectEncoder* encoder = (JSONObjectEncoder*) &pyEncoder; - - pyEncoder.npyCtxtPassthru = NULL; - pyEncoder.blkCtxtPassthru = NULL; - pyEncoder.npyType = -1; - pyEncoder.npyValue = NULL; - pyEncoder.datetimeIso = 0; - pyEncoder.datetimeUnit = PANDAS_FR_ms; - pyEncoder.outputFormat = COLUMNS; - pyEncoder.defaultHandler = 0; - pyEncoder.basicTypeContext.newObj = NULL; - pyEncoder.basicTypeContext.dictObj = NULL; - pyEncoder.basicTypeContext.itemValue = NULL; - pyEncoder.basicTypeContext.itemName = NULL; - pyEncoder.basicTypeContext.attrList = NULL; - pyEncoder.basicTypeContext.iterator = NULL; - pyEncoder.basicTypeContext.cStr = NULL; - pyEncoder.basicTypeContext.npyarr = NULL; - pyEncoder.basicTypeContext.rowLabels = NULL; - pyEncoder.basicTypeContext.columnLabels = NULL; - - PRINTMARK(); - - if (!PyArg_ParseTupleAndKeywords(args, kwargs, "O|OiOssOO", kwlist, &oinput, &oensureAscii, &idoublePrecision, &oencodeHTMLChars, &sOrient, &sdateFormat, &oisoDates, &odefHandler)) - { - return NULL; - } - - if (oensureAscii != NULL && !PyObject_IsTrue(oensureAscii)) - { - encoder->forceASCII = 0; - } - - if (oencodeHTMLChars != NULL && PyObject_IsTrue(oencodeHTMLChars)) - { - encoder->encodeHTMLChars = 1; - } - - if (idoublePrecision > JSON_DOUBLE_MAX_DECIMALS || idoublePrecision < 0) - { - PyErr_Format ( - PyExc_ValueError, - "Invalid value '%d' for option 'double_precision', max is '%u'", - idoublePrecision, - JSON_DOUBLE_MAX_DECIMALS); - return NULL; - } - encoder->doublePrecision = idoublePrecision; - - if (sOrient != NULL) - { - if (strcmp(sOrient, "records") == 0) - { - pyEncoder.outputFormat = RECORDS; - } - else - if (strcmp(sOrient, "index") == 0) - { - pyEncoder.outputFormat = INDEX; - } - else - if (strcmp(sOrient, "split") == 0) - { - pyEncoder.outputFormat = SPLIT; - } - else - if (strcmp(sOrient, "values") == 0) - { - pyEncoder.outputFormat = VALUES; - } - else - if (strcmp(sOrient, "columns") != 0) - { - PyErr_Format (PyExc_ValueError, "Invalid value '%s' for option 'orient'", sOrient); - return NULL; - } - } - - if (sdateFormat != NULL) - { - if (strcmp(sdateFormat, "s") == 0) - { - pyEncoder.datetimeUnit = PANDAS_FR_s; - } - else - if (strcmp(sdateFormat, "ms") == 0) - { - pyEncoder.datetimeUnit = PANDAS_FR_ms; - } - else - if (strcmp(sdateFormat, "us") == 0) - { - pyEncoder.datetimeUnit = PANDAS_FR_us; - } - else - if (strcmp(sdateFormat, "ns") == 0) - { - pyEncoder.datetimeUnit = PANDAS_FR_ns; - } - else - { - PyErr_Format (PyExc_ValueError, "Invalid value '%s' for option 'date_unit'", sdateFormat); - return NULL; - } - } - - if (oisoDates != NULL && PyObject_IsTrue(oisoDates)) - { - pyEncoder.datetimeIso = 1; - } - - - if (odefHandler != NULL && odefHandler != Py_None) - { - if (!PyCallable_Check(odefHandler)) - { - PyErr_SetString (PyExc_TypeError, "Default handler is not callable"); - return NULL; - } - pyEncoder.defaultHandler = odefHandler; - } - - pyEncoder.originalOutputFormat = pyEncoder.outputFormat; - PRINTMARK(); - ret = JSON_EncodeObject (oinput, encoder, buffer, sizeof (buffer)); - PRINTMARK(); - - if (PyErr_Occurred()) - { - PRINTMARK(); - return NULL; - } - - if (encoder->errorMsg) - { - PRINTMARK(); - if (ret != buffer) - { - encoder->free (ret); - } - - PyErr_Format (PyExc_OverflowError, "%s", encoder->errorMsg); - return NULL; - } - - newobj = PyString_FromString (ret); - - if (ret != buffer) - { - encoder->free (ret); - } - - PRINTMARK(); - - return newobj; -} - -PyObject* objToJSONFile(PyObject* self, PyObject *args, PyObject *kwargs) -{ - PyObject *data; - PyObject *file; - PyObject *string; - PyObject *write; - PyObject *argtuple; - - PRINTMARK(); - - if (!PyArg_ParseTuple (args, "OO", &data, &file)) - { - return NULL; - } - - if (!PyObject_HasAttrString (file, "write")) - { - PyErr_Format (PyExc_TypeError, "expected file"); - return NULL; - } - - write = PyObject_GetAttrString (file, "write"); - - if (!PyCallable_Check (write)) - { - Py_XDECREF(write); - PyErr_Format (PyExc_TypeError, "expected file"); - return NULL; - } - - argtuple = PyTuple_Pack(1, data); - - string = objToJSON (self, argtuple, kwargs); - - if (string == NULL) - { - Py_XDECREF(write); - Py_XDECREF(argtuple); - return NULL; - } - - Py_XDECREF(argtuple); - - argtuple = PyTuple_Pack (1, string); - if (argtuple == NULL) - { - Py_XDECREF(write); - return NULL; - } - if (PyObject_CallObject (write, argtuple) == NULL) - { - Py_XDECREF(write); - Py_XDECREF(argtuple); - return NULL; - } - - Py_XDECREF(write); - Py_DECREF(argtuple); - Py_XDECREF(string); - - PRINTMARK(); - - Py_RETURN_NONE; -} diff --git a/pandas/src/ujson/python/ujson.c b/pandas/src/ujson/python/ujson.c deleted file mode 100644 index 48ea92ed3bc8c..0000000000000 --- a/pandas/src/ujson/python/ujson.c +++ /dev/null @@ -1,112 +0,0 @@ -/* -Copyright (c) 2011-2013, ESN Social Software AB and Jonas Tarnstrom -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: -* Redistributions of source code must retain the above copyright -notice, this list of conditions and the following disclaimer. -* Redistributions in binary form must reproduce the above copyright -notice, this list of conditions and the following disclaimer in the -documentation and/or other materials provided with the distribution. -* Neither the name of the ESN Social Software AB nor the -names of its contributors may be used to endorse or promote products -derived from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -DISCLAIMED. IN NO EVENT SHALL ESN SOCIAL SOFTWARE AB OR JONAS TARNSTROM BE LIABLE -FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND -ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - -Portions of code from MODP_ASCII - Ascii transformations (upper/lower, etc) -https://github.com/client9/stringencoders -Copyright (c) 2007 Nick Galbreath -- nickg [at] modp [dot] com. All rights reserved. - -Numeric decoder derived from from TCL library -http://www.opensource.apple.com/source/tcl/tcl-14/tcl/license.terms -* Copyright (c) 1988-1993 The Regents of the University of California. -* Copyright (c) 1994 Sun Microsystems, Inc. -*/ - -#include "py_defines.h" -#include "version.h" - -/* objToJSON */ -PyObject* objToJSON(PyObject* self, PyObject *args, PyObject *kwargs); -void initObjToJSON(void); - -/* JSONToObj */ -PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs); - -/* objToJSONFile */ -PyObject* objToJSONFile(PyObject* self, PyObject *args, PyObject *kwargs); - -/* JSONFileToObj */ -PyObject* JSONFileToObj(PyObject* self, PyObject *args, PyObject *kwargs); - - -#define ENCODER_HELP_TEXT "Use ensure_ascii=false to output UTF-8. Pass in double_precision to alter the maximum digit precision of doubles. Set encode_html_chars=True to encode < > & as unicode escape sequences." - -static PyMethodDef ujsonMethods[] = { - {"encode", (PyCFunction) objToJSON, METH_VARARGS | METH_KEYWORDS, "Converts arbitrary object recursivly into JSON. " ENCODER_HELP_TEXT}, - {"decode", (PyCFunction) JSONToObj, METH_VARARGS | METH_KEYWORDS, "Converts JSON as string to dict object structure. Use precise_float=True to use high precision float decoder."}, - {"dumps", (PyCFunction) objToJSON, METH_VARARGS | METH_KEYWORDS, "Converts arbitrary object recursivly into JSON. " ENCODER_HELP_TEXT}, - {"loads", (PyCFunction) JSONToObj, METH_VARARGS | METH_KEYWORDS, "Converts JSON as string to dict object structure. Use precise_float=True to use high precision float decoder."}, - {"dump", (PyCFunction) objToJSONFile, METH_VARARGS | METH_KEYWORDS, "Converts arbitrary object recursively into JSON file. " ENCODER_HELP_TEXT}, - {"load", (PyCFunction) JSONFileToObj, METH_VARARGS | METH_KEYWORDS, "Converts JSON as file to dict object structure. Use precise_float=True to use high precision float decoder."}, - {NULL, NULL, 0, NULL} /* Sentinel */ -}; - -#if PY_MAJOR_VERSION >= 3 - -static struct PyModuleDef moduledef = { - PyModuleDef_HEAD_INIT, - "_pandasujson", - 0, /* m_doc */ - -1, /* m_size */ - ujsonMethods, /* m_methods */ - NULL, /* m_reload */ - NULL, /* m_traverse */ - NULL, /* m_clear */ - NULL /* m_free */ -}; - -#define PYMODINITFUNC PyMODINIT_FUNC PyInit_json(void) -#define PYMODULE_CREATE() PyModule_Create(&moduledef) -#define MODINITERROR return NULL - -#else - -#define PYMODINITFUNC PyMODINIT_FUNC initjson(void) -#define PYMODULE_CREATE() Py_InitModule("json", ujsonMethods) -#define MODINITERROR return - -#endif - -PYMODINITFUNC -{ - PyObject *module; - PyObject *version_string; - - initObjToJSON(); - module = PYMODULE_CREATE(); - - if (module == NULL) - { - MODINITERROR; - } - - version_string = PyString_FromString (UJSON_VERSION); - PyModule_AddObject (module, "__version__", version_string); - -#if PY_MAJOR_VERSION >= 3 - return module; -#endif -} diff --git a/pandas/stats/api.py b/pandas/stats/api.py index fd81b875faa91..2a11456d4f9e5 100644 --- a/pandas/stats/api.py +++ b/pandas/stats/api.py @@ -2,10 +2,6 @@ Common namespace of statistical functions """ -# pylint: disable-msg=W0611,W0614,W0401 - # flake8: noqa from pandas.stats.moments import * -from pandas.stats.interface import ols -from pandas.stats.fama_macbeth import fama_macbeth diff --git a/pandas/stats/common.py b/pandas/stats/common.py deleted file mode 100644 index be3b842e93cc8..0000000000000 --- a/pandas/stats/common.py +++ /dev/null @@ -1,45 +0,0 @@ - -_WINDOW_TYPES = { - 0: 'full_sample', - 1: 'rolling', - 2: 'expanding' -} -# also allow 'rolling' as key -_WINDOW_TYPES.update((v, v) for k, v in list(_WINDOW_TYPES.items())) -_ADDITIONAL_CLUSTER_TYPES = set(("entity", "time")) - - -def _get_cluster_type(cluster_type): - # this was previous behavior - if cluster_type is None: - return cluster_type - try: - return _get_window_type(cluster_type) - except ValueError: - final_type = str(cluster_type).lower().replace("_", " ") - if final_type in _ADDITIONAL_CLUSTER_TYPES: - return final_type - raise ValueError('Unrecognized cluster type: %s' % cluster_type) - - -def _get_window_type(window_type): - # e.g., 0, 1, 2 - final_type = _WINDOW_TYPES.get(window_type) - # e.g., 'full_sample' - final_type = final_type or _WINDOW_TYPES.get( - str(window_type).lower().replace(" ", "_")) - if final_type is None: - raise ValueError('Unrecognized window type: %s' % window_type) - return final_type - - -def banner(text, width=80): - """ - - """ - toFill = width - len(text) - - left = toFill // 2 - right = toFill - left - - return '%s%s%s' % ('-' * left, text, '-' * right) diff --git a/pandas/stats/fama_macbeth.py b/pandas/stats/fama_macbeth.py deleted file mode 100644 index f7d50e8e72a5c..0000000000000 --- a/pandas/stats/fama_macbeth.py +++ /dev/null @@ -1,240 +0,0 @@ -from pandas.core.base import StringMixin -from pandas.compat import StringIO, range - -import numpy as np - -from pandas.core.api import Series, DataFrame -import pandas.stats.common as common -from pandas.util.decorators import cache_readonly - -# flake8: noqa - -def fama_macbeth(**kwargs): - """Runs Fama-MacBeth regression. - - Parameters - ---------- - Takes the same arguments as a panel OLS, in addition to: - - nw_lags_beta: int - Newey-West adjusts the betas by the given lags - """ - window_type = kwargs.get('window_type') - if window_type is None: - klass = FamaMacBeth - else: - klass = MovingFamaMacBeth - - return klass(**kwargs) - - -class FamaMacBeth(StringMixin): - - def __init__(self, y, x, intercept=True, nw_lags=None, - nw_lags_beta=None, - entity_effects=False, time_effects=False, x_effects=None, - cluster=None, dropped_dummies=None, verbose=False): - import warnings - warnings.warn("The pandas.stats.fama_macbeth module is deprecated and will be " - "removed in a future version. We refer to external packages " - "like statsmodels, see here: " - "http://www.statsmodels.org/stable/index.html", - FutureWarning, stacklevel=4) - - if dropped_dummies is None: - dropped_dummies = {} - self._nw_lags_beta = nw_lags_beta - - from pandas.stats.plm import MovingPanelOLS - self._ols_result = MovingPanelOLS( - y=y, x=x, window_type='rolling', window=1, - intercept=intercept, - nw_lags=nw_lags, entity_effects=entity_effects, - time_effects=time_effects, x_effects=x_effects, cluster=cluster, - dropped_dummies=dropped_dummies, verbose=verbose) - - self._cols = self._ols_result._x.columns - - @cache_readonly - def _beta_raw(self): - return self._ols_result._beta_raw - - @cache_readonly - def _stats(self): - return _calc_t_stat(self._beta_raw, self._nw_lags_beta) - - @cache_readonly - def _mean_beta_raw(self): - return self._stats[0] - - @cache_readonly - def _std_beta_raw(self): - return self._stats[1] - - @cache_readonly - def _t_stat_raw(self): - return self._stats[2] - - def _make_result(self, result): - return Series(result, index=self._cols) - - @cache_readonly - def mean_beta(self): - return self._make_result(self._mean_beta_raw) - - @cache_readonly - def std_beta(self): - return self._make_result(self._std_beta_raw) - - @cache_readonly - def t_stat(self): - return self._make_result(self._t_stat_raw) - - @cache_readonly - def _results(self): - return { - 'mean_beta': self._mean_beta_raw, - 'std_beta': self._std_beta_raw, - 't_stat': self._t_stat_raw, - } - - @cache_readonly - def _coef_table(self): - buffer = StringIO() - buffer.write('%13s %13s %13s %13s %13s %13s\n' % - ('Variable', 'Beta', 'Std Err', 't-stat', 'CI 2.5%', 'CI 97.5%')) - template = '%13s %13.4f %13.4f %13.2f %13.4f %13.4f\n' - - for i, name in enumerate(self._cols): - if i and not (i % 5): - buffer.write('\n' + common.banner('')) - - mean_beta = self._results['mean_beta'][i] - std_beta = self._results['std_beta'][i] - t_stat = self._results['t_stat'][i] - ci1 = mean_beta - 1.96 * std_beta - ci2 = mean_beta + 1.96 * std_beta - - values = '(%s)' % name, mean_beta, std_beta, t_stat, ci1, ci2 - - buffer.write(template % values) - - if self._nw_lags_beta is not None: - buffer.write('\n') - buffer.write('*** The Std Err, t-stat are Newey-West ' - 'adjusted with Lags %5d\n' % self._nw_lags_beta) - - return buffer.getvalue() - - def __unicode__(self): - return self.summary - - @cache_readonly - def summary(self): - template = """ -----------------------Summary of Fama-MacBeth Analysis------------------------- - -Formula: Y ~ %(formulaRHS)s -# betas : %(nu)3d - -----------------------Summary of Estimated Coefficients------------------------ -%(coefTable)s ---------------------------------End of Summary--------------------------------- -""" - params = { - 'formulaRHS': ' + '.join(self._cols), - 'nu': len(self._beta_raw), - 'coefTable': self._coef_table, - } - - return template % params - - -class MovingFamaMacBeth(FamaMacBeth): - - def __init__(self, y, x, window_type='rolling', window=10, - intercept=True, nw_lags=None, nw_lags_beta=None, - entity_effects=False, time_effects=False, x_effects=None, - cluster=None, dropped_dummies=None, verbose=False): - if dropped_dummies is None: - dropped_dummies = {} - self._window_type = common._get_window_type(window_type) - self._window = window - - FamaMacBeth.__init__( - self, y=y, x=x, intercept=intercept, - nw_lags=nw_lags, nw_lags_beta=nw_lags_beta, - entity_effects=entity_effects, time_effects=time_effects, - x_effects=x_effects, cluster=cluster, - dropped_dummies=dropped_dummies, verbose=verbose) - - self._index = self._ols_result._index - self._T = len(self._index) - - @property - def _is_rolling(self): - return self._window_type == 'rolling' - - def _calc_stats(self): - mean_betas = [] - std_betas = [] - t_stats = [] - - # XXX - - mask = self._ols_result._rolling_ols_call[2] - obs_total = mask.astype(int).cumsum() - - start = self._window - 1 - betas = self._beta_raw - for i in range(start, self._T): - if self._is_rolling: - begin = i - start - else: - begin = 0 - - B = betas[max(obs_total[begin] - 1, 0): obs_total[i]] - mean_beta, std_beta, t_stat = _calc_t_stat(B, self._nw_lags_beta) - mean_betas.append(mean_beta) - std_betas.append(std_beta) - t_stats.append(t_stat) - - return np.array([mean_betas, std_betas, t_stats]) - - _stats = cache_readonly(_calc_stats) - - def _make_result(self, result): - return DataFrame(result, index=self._result_index, columns=self._cols) - - @cache_readonly - def _result_index(self): - mask = self._ols_result._rolling_ols_call[2] - # HACK XXX - return self._index[mask.cumsum() >= self._window] - - @cache_readonly - def _results(self): - return { - 'mean_beta': self._mean_beta_raw[-1], - 'std_beta': self._std_beta_raw[-1], - 't_stat': self._t_stat_raw[-1], - } - - -def _calc_t_stat(beta, nw_lags_beta): - N = len(beta) - B = beta - beta.mean(0) - C = np.dot(B.T, B) / N - - if nw_lags_beta is not None: - for i in range(nw_lags_beta + 1): - - cov = np.dot(B[i:].T, B[:(N - i)]) / N - weight = i / (nw_lags_beta + 1) - C += 2 * (1 - weight) * cov - - mean_beta = beta.mean(0) - std_beta = np.sqrt(np.diag(C)) / np.sqrt(N) - t_stat = mean_beta / std_beta - - return mean_beta, std_beta, t_stat diff --git a/pandas/stats/interface.py b/pandas/stats/interface.py deleted file mode 100644 index caf468b4f85fe..0000000000000 --- a/pandas/stats/interface.py +++ /dev/null @@ -1,143 +0,0 @@ -from pandas.core.api import Series, DataFrame, Panel, MultiIndex -from pandas.stats.ols import OLS, MovingOLS -from pandas.stats.plm import PanelOLS, MovingPanelOLS, NonPooledPanelOLS -import pandas.stats.common as common - - -def ols(**kwargs): - """Returns the appropriate OLS object depending on whether you need - simple or panel OLS, and a full-sample or rolling/expanding OLS. - - Will be a normal linear regression or a (pooled) panel regression depending - on the type of the inputs: - - y : Series, x : DataFrame -> OLS - y : Series, x : dict of DataFrame -> OLS - y : DataFrame, x : DataFrame -> PanelOLS - y : DataFrame, x : dict of DataFrame/Panel -> PanelOLS - y : Series with MultiIndex, x : Panel/DataFrame + MultiIndex -> PanelOLS - - Parameters - ---------- - y: Series or DataFrame - See above for types - x: Series, DataFrame, dict of Series, dict of DataFrame, Panel - weights : Series or ndarray - The weights are presumed to be (proportional to) the inverse of the - variance of the observations. That is, if the variables are to be - transformed by 1/sqrt(W) you must supply weights = 1/W - intercept: bool - True if you want an intercept. Defaults to True. - nw_lags: None or int - Number of Newey-West lags. Defaults to None. - nw_overlap: bool - Whether there are overlaps in the NW lags. Defaults to False. - window_type: {'full sample', 'rolling', 'expanding'} - 'full sample' by default - window: int - size of window (for rolling/expanding OLS). If window passed and no - explicit window_type, 'rolling" will be used as the window_type - - Panel OLS options: - pool: bool - Whether to run pooled panel regression. Defaults to true. - entity_effects: bool - Whether to account for entity fixed effects. Defaults to false. - time_effects: bool - Whether to account for time fixed effects. Defaults to false. - x_effects: list - List of x's to account for fixed effects. Defaults to none. - dropped_dummies: dict - Key is the name of the variable for the fixed effect. - Value is the value of that variable for which we drop the dummy. - - For entity fixed effects, key equals 'entity'. - - By default, the first dummy is dropped if no dummy is specified. - cluster: {'time', 'entity'} - cluster variances - - Examples - -------- - # Run simple OLS. - result = ols(y=y, x=x) - - # Run rolling simple OLS with window of size 10. - result = ols(y=y, x=x, window_type='rolling', window=10) - print(result.beta) - - result = ols(y=y, x=x, nw_lags=1) - - # Set up LHS and RHS for data across all items - y = A - x = {'B' : B, 'C' : C} - - # Run panel OLS. - result = ols(y=y, x=x) - - # Run expanding panel OLS with window 10 and entity clustering. - result = ols(y=y, x=x, cluster='entity', window_type='expanding', - window=10) - - Returns - ------- - The appropriate OLS object, which allows you to obtain betas and various - statistics, such as std err, t-stat, etc. - """ - - if (kwargs.get('cluster') is not None and - kwargs.get('nw_lags') is not None): - raise ValueError( - 'Pandas OLS does not work with Newey-West correction ' - 'and clustering.') - - pool = kwargs.get('pool') - if 'pool' in kwargs: - del kwargs['pool'] - - window_type = kwargs.get('window_type') - window = kwargs.get('window') - - if window_type is None: - if window is None: - window_type = 'full_sample' - else: - window_type = 'rolling' - else: - window_type = common._get_window_type(window_type) - - if window_type != 'full_sample': - kwargs['window_type'] = common._get_window_type(window_type) - - y = kwargs.get('y') - x = kwargs.get('x') - - panel = False - if isinstance(y, DataFrame) or (isinstance(y, Series) and - isinstance(y.index, MultiIndex)): - panel = True - if isinstance(x, Panel): - panel = True - - if window_type == 'full_sample': - for rolling_field in ('window_type', 'window', 'min_periods'): - if rolling_field in kwargs: - del kwargs[rolling_field] - - if panel: - if pool is False: - klass = NonPooledPanelOLS - else: - klass = PanelOLS - else: - klass = OLS - else: - if panel: - if pool is False: - klass = NonPooledPanelOLS - else: - klass = MovingPanelOLS - else: - klass = MovingOLS - - return klass(**kwargs) diff --git a/pandas/stats/math.py b/pandas/stats/math.py deleted file mode 100644 index 505415bebf89e..0000000000000 --- a/pandas/stats/math.py +++ /dev/null @@ -1,130 +0,0 @@ -# pylint: disable-msg=E1103 -# pylint: disable-msg=W0212 - -from __future__ import division - -from pandas.compat import range -import numpy as np -import numpy.linalg as linalg - - -def rank(X, cond=1.0e-12): - """ - Return the rank of a matrix X based on its generalized inverse, - not the SVD. - """ - X = np.asarray(X) - if len(X.shape) == 2: - import scipy.linalg as SL - D = SL.svdvals(X) - result = np.add.reduce(np.greater(D / D.max(), cond)) - return int(result.astype(np.int32)) - else: - return int(not np.alltrue(np.equal(X, 0.))) - - -def solve(a, b): - """Returns the solution of A X = B.""" - try: - return linalg.solve(a, b) - except linalg.LinAlgError: - return np.dot(linalg.pinv(a), b) - - -def inv(a): - """Returns the inverse of A.""" - try: - return np.linalg.inv(a) - except linalg.LinAlgError: - return np.linalg.pinv(a) - - -def is_psd(m): - eigvals = linalg.eigvals(m) - return np.isreal(eigvals).all() and (eigvals >= 0).all() - - -def newey_west(m, max_lags, nobs, df, nw_overlap=False): - """ - Compute Newey-West adjusted covariance matrix, taking into account - specified number of leads / lags - - Parameters - ---------- - m : (N x K) - max_lags : int - nobs : int - Number of observations in model - df : int - Degrees of freedom in explanatory variables - nw_overlap : boolean, default False - Assume data is overlapping - - Returns - ------- - ndarray (K x K) - - Reference - --------- - Newey, W. K. & West, K. D. (1987) A Simple, Positive - Semi-definite, Heteroskedasticity and Autocorrelation Consistent - Covariance Matrix, Econometrica, vol. 55(3), 703-708 - """ - Xeps = np.dot(m.T, m) - for lag in range(1, max_lags + 1): - auto_cov = np.dot(m[:-lag].T, m[lag:]) - weight = lag / (max_lags + 1) - if nw_overlap: - weight = 0 - bb = auto_cov + auto_cov.T - dd = (1 - weight) * bb - Xeps += dd - - Xeps *= nobs / (nobs - df) - - if nw_overlap and not is_psd(Xeps): - new_max_lags = int(np.ceil(max_lags * 1.5)) -# print('nw_overlap is True and newey_west generated a non positive ' -# 'semidefinite matrix, so using newey_west with max_lags of %d.' -# % new_max_lags) - return newey_west(m, new_max_lags, nobs, df) - - return Xeps - - -def calc_F(R, r, beta, var_beta, nobs, df): - """ - Computes the standard F-test statistic for linear restriction - hypothesis testing - - Parameters - ---------- - R: ndarray (N x N) - Restriction matrix - r: ndarray (N x 1) - Restriction vector - beta: ndarray (N x 1) - Estimated model coefficients - var_beta: ndarray (N x N) - Variance covariance matrix of regressors - nobs: int - Number of observations in model - df: int - Model degrees of freedom - - Returns - ------- - F value, (q, df_resid), p value - """ - from scipy.stats import f - - hyp = np.dot(R, beta.reshape(len(beta), 1)) - r - RSR = np.dot(R, np.dot(var_beta, R.T)) - - q = len(r) - - F = np.dot(hyp.T, np.dot(inv(RSR), hyp)).squeeze() / q - - p_value = 1 - f.cdf(F, q, nobs - df) - - return F, (q, nobs - df), p_value diff --git a/pandas/stats/misc.py b/pandas/stats/misc.py deleted file mode 100644 index 1a077dcb6f9a1..0000000000000 --- a/pandas/stats/misc.py +++ /dev/null @@ -1,389 +0,0 @@ -from numpy import NaN -from pandas import compat -import numpy as np - -from pandas.core.api import Series, DataFrame -from pandas.core.series import remove_na -from pandas.compat import zip, lrange -import pandas.core.common as com - - -def zscore(series): - return (series - series.mean()) / np.std(series, ddof=0) - - -def correl_ts(frame1, frame2): - """ - Pairwise correlation of columns of two DataFrame objects - - Parameters - ---------- - - Returns - ------- - y : Series - """ - results = {} - for col, series in compat.iteritems(frame1): - if col in frame2: - other = frame2[col] - - idx1 = series.valid().index - idx2 = other.valid().index - - common_index = idx1.intersection(idx2) - - seriesStand = zscore(series.reindex(common_index)) - otherStand = zscore(other.reindex(common_index)) - results[col] = (seriesStand * otherStand).mean() - - return Series(results) - - -def correl_xs(frame1, frame2): - return correl_ts(frame1.T, frame2.T) - - -def percentileofscore(a, score, kind='rank'): - """The percentile rank of a score relative to a list of scores. - - A `percentileofscore` of, for example, 80% means that 80% of the - scores in `a` are below the given score. In the case of gaps or - ties, the exact definition depends on the optional keyword, `kind`. - - Parameters - ---------- - a: array like - Array of scores to which `score` is compared. - score: int or float - Score that is compared to the elements in `a`. - kind: {'rank', 'weak', 'strict', 'mean'}, optional - This optional parameter specifies the interpretation of the - resulting score: - - - "rank": Average percentage ranking of score. In case of - multiple matches, average the percentage rankings of - all matching scores. - - "weak": This kind corresponds to the definition of a cumulative - distribution function. A percentileofscore of 80% - means that 80% of values are less than or equal - to the provided score. - - "strict": Similar to "weak", except that only values that are - strictly less than the given score are counted. - - "mean": The average of the "weak" and "strict" scores, often used in - testing. See - - http://en.wikipedia.org/wiki/Percentile_rank - - Returns - ------- - pcos : float - Percentile-position of score (0-100) relative to `a`. - - Examples - -------- - Three-quarters of the given values lie below a given score: - - >>> percentileofscore([1, 2, 3, 4], 3) - 75.0 - - With multiple matches, note how the scores of the two matches, 0.6 - and 0.8 respectively, are averaged: - - >>> percentileofscore([1, 2, 3, 3, 4], 3) - 70.0 - - Only 2/5 values are strictly less than 3: - - >>> percentileofscore([1, 2, 3, 3, 4], 3, kind='strict') - 40.0 - - But 4/5 values are less than or equal to 3: - - >>> percentileofscore([1, 2, 3, 3, 4], 3, kind='weak') - 80.0 - - The average between the weak and the strict scores is - - >>> percentileofscore([1, 2, 3, 3, 4], 3, kind='mean') - 60.0 - - """ - a = np.array(a) - n = len(a) - - if kind == 'rank': - if not(np.any(a == score)): - a = np.append(a, score) - a_len = np.array(lrange(len(a))) - else: - a_len = np.array(lrange(len(a))) + 1.0 - - a = np.sort(a) - idx = [a == score] - pct = (np.mean(a_len[idx]) / n) * 100.0 - return pct - - elif kind == 'strict': - return sum(a < score) / float(n) * 100 - elif kind == 'weak': - return sum(a <= score) / float(n) * 100 - elif kind == 'mean': - return (sum(a < score) + sum(a <= score)) * 50 / float(n) - else: - raise ValueError("kind can only be 'rank', 'strict', 'weak' or 'mean'") - - -def percentileRank(frame, column=None, kind='mean'): - """ - Return score at percentile for each point in time (cross-section) - - Parameters - ---------- - frame: DataFrame - column: string or Series, optional - Column name or specific Series to compute percentiles for. - If not provided, percentiles are computed for all values at each - point in time. Note that this can take a LONG time. - kind: {'rank', 'weak', 'strict', 'mean'}, optional - This optional parameter specifies the interpretation of the - resulting score: - - - "rank": Average percentage ranking of score. In case of - multiple matches, average the percentage rankings of - all matching scores. - - "weak": This kind corresponds to the definition of a cumulative - distribution function. A percentileofscore of 80% - means that 80% of values are less than or equal - to the provided score. - - "strict": Similar to "weak", except that only values that are - strictly less than the given score are counted. - - "mean": The average of the "weak" and "strict" scores, often used in - testing. See - - http://en.wikipedia.org/wiki/Percentile_rank - - Returns - ------- - TimeSeries or DataFrame, depending on input - """ - fun = lambda xs, score: percentileofscore(remove_na(xs), - score, kind=kind) - - results = {} - framet = frame.T - if column is not None: - if isinstance(column, Series): - for date, xs in compat.iteritems(frame.T): - results[date] = fun(xs, column.get(date, NaN)) - else: - for date, xs in compat.iteritems(frame.T): - results[date] = fun(xs, xs[column]) - results = Series(results) - else: - for column in frame.columns: - for date, xs in compat.iteritems(framet): - results.setdefault(date, {})[column] = fun(xs, xs[column]) - results = DataFrame(results).T - return results - - -def bucket(series, k, by=None): - """ - Produce DataFrame representing quantiles of a Series - - Parameters - ---------- - series : Series - k : int - number of quantiles - by : Series or same-length array - bucket by value - - Returns - ------- - DataFrame - """ - if by is None: - by = series - else: - by = by.reindex(series.index) - - split = _split_quantile(by, k) - mat = np.empty((len(series), k), dtype=float) * np.NaN - - for i, v in enumerate(split): - mat[:, i][v] = series.take(v) - - return DataFrame(mat, index=series.index, columns=np.arange(k) + 1) - - -def _split_quantile(arr, k): - arr = np.asarray(arr) - mask = np.isfinite(arr) - order = arr[mask].argsort() - n = len(arr) - - return np.array_split(np.arange(n)[mask].take(order), k) - - -def bucketcat(series, cats): - """ - Produce DataFrame representing quantiles of a Series - - Parameters - ---------- - series : Series - cat : Series or same-length array - bucket by category; mutually exclusive with 'by' - - Returns - ------- - DataFrame - """ - if not isinstance(series, Series): - series = Series(series, index=np.arange(len(series))) - - cats = np.asarray(cats) - - unique_labels = np.unique(cats) - unique_labels = unique_labels[com.notnull(unique_labels)] - - # group by - data = {} - - for label in unique_labels: - data[label] = series[cats == label] - - return DataFrame(data, columns=unique_labels) - - -def bucketpanel(series, bins=None, by=None, cat=None): - """ - Bucket data by two Series to create summary panel - - Parameters - ---------- - series : Series - bins : tuple (length-2) - e.g. (2, 2) - by : tuple of Series - bucket by value - cat : tuple of Series - bucket by category; mutually exclusive with 'by' - - Returns - ------- - DataFrame - """ - use_by = by is not None - use_cat = cat is not None - - if use_by and use_cat: - raise Exception('must specify by or cat, but not both') - elif use_by: - if len(by) != 2: - raise Exception('must provide two bucketing series') - - xby, yby = by - xbins, ybins = bins - - return _bucketpanel_by(series, xby, yby, xbins, ybins) - - elif use_cat: - xcat, ycat = cat - return _bucketpanel_cat(series, xcat, ycat) - else: - raise Exception('must specify either values or categories ' - 'to bucket by') - - -def _bucketpanel_by(series, xby, yby, xbins, ybins): - xby = xby.reindex(series.index) - yby = yby.reindex(series.index) - - xlabels = _bucket_labels(xby.reindex(series.index), xbins) - ylabels = _bucket_labels(yby.reindex(series.index), ybins) - - labels = _uniquify(xlabels, ylabels, xbins, ybins) - - mask = com.isnull(labels) - labels[mask] = -1 - - unique_labels = np.unique(labels) - bucketed = bucketcat(series, labels) - - _ulist = list(labels) - index_map = dict((x, _ulist.index(x)) for x in unique_labels) - - def relabel(key): - pos = index_map[key] - - xlab = xlabels[pos] - ylab = ylabels[pos] - - return '%sx%s' % (int(xlab) if com.notnull(xlab) else 'NULL', - int(ylab) if com.notnull(ylab) else 'NULL') - - return bucketed.rename(columns=relabel) - - -def _bucketpanel_cat(series, xcat, ycat): - xlabels, xmapping = _intern(xcat) - ylabels, ymapping = _intern(ycat) - - shift = 10 ** (np.ceil(np.log10(ylabels.max()))) - labels = xlabels * shift + ylabels - - sorter = labels.argsort() - sorted_labels = labels.take(sorter) - sorted_xlabels = xlabels.take(sorter) - sorted_ylabels = ylabels.take(sorter) - - unique_labels = np.unique(labels) - unique_labels = unique_labels[com.notnull(unique_labels)] - - locs = sorted_labels.searchsorted(unique_labels) - xkeys = sorted_xlabels.take(locs) - ykeys = sorted_ylabels.take(locs) - - stringified = ['(%s, %s)' % arg - for arg in zip(xmapping.take(xkeys), ymapping.take(ykeys))] - - result = bucketcat(series, labels) - result.columns = stringified - - return result - - -def _intern(values): - # assumed no NaN values - values = np.asarray(values) - - uniqued = np.unique(values) - labels = uniqued.searchsorted(values) - return labels, uniqued - - -def _uniquify(xlabels, ylabels, xbins, ybins): - # encode the stuff, create unique label - shifter = 10 ** max(xbins, ybins) - _xpiece = xlabels * shifter - _ypiece = ylabels - - return _xpiece + _ypiece - - -def _bucket_labels(series, k): - arr = np.asarray(series) - mask = np.isfinite(arr) - order = arr[mask].argsort() - n = len(series) - - split = np.array_split(np.arange(n)[mask].take(order), k) - - mat = np.empty(n, dtype=float) * np.NaN - for i, v in enumerate(split): - mat[v] = i - - return mat + 1 diff --git a/pandas/stats/moments.py b/pandas/stats/moments.py index 95b209aee0b0c..f6c3a08c6721a 100644 --- a/pandas/stats/moments.py +++ b/pandas/stats/moments.py @@ -6,9 +6,9 @@ import warnings import numpy as np -from pandas.types.common import is_scalar +from pandas.core.dtypes.common import is_scalar from pandas.core.api import DataFrame, Series -from pandas.util.decorators import Substitution, Appender +from pandas.util._decorators import Substitution, Appender __all__ = ['rolling_count', 'rolling_max', 'rolling_min', 'rolling_sum', 'rolling_mean', 'rolling_std', 'rolling_cov', @@ -385,6 +385,7 @@ def ewmstd(arg, com=None, span=None, halflife=None, alpha=None, min_periods=0, bias=bias, func_kw=['bias']) + ewmvol = ewmstd @@ -476,6 +477,7 @@ def f(arg, window, min_periods=None, freq=None, center=False, **kwargs) return f + rolling_max = _rolling_func('max', 'Moving maximum.', how='max') rolling_min = _rolling_func('min', 'Moving minimum.', how='min') rolling_sum = _rolling_func('sum', 'Moving sum.') @@ -683,6 +685,7 @@ def f(arg, min_periods=1, freq=None, **kwargs): **kwargs) return f + expanding_max = _expanding_func('max', 'Expanding maximum.') expanding_min = _expanding_func('min', 'Expanding minimum.') expanding_sum = _expanding_func('sum', 'Expanding sum.') diff --git a/pandas/stats/ols.py b/pandas/stats/ols.py deleted file mode 100644 index b533d255bd196..0000000000000 --- a/pandas/stats/ols.py +++ /dev/null @@ -1,1376 +0,0 @@ -""" -Ordinary least squares regression -""" - -# pylint: disable-msg=W0201 - -# flake8: noqa - -from pandas.compat import zip, range, StringIO -from itertools import starmap -from pandas import compat -import numpy as np - -from pandas.core.api import DataFrame, Series, isnull -from pandas.core.base import StringMixin -from pandas.types.common import _ensure_float64 -from pandas.core.index import MultiIndex -from pandas.core.panel import Panel -from pandas.util.decorators import cache_readonly - -import pandas.stats.common as scom -import pandas.stats.math as math -import pandas.stats.moments as moments - -_FP_ERR = 1e-8 - -class OLS(StringMixin): - """ - Runs a full sample ordinary least squares regression. - - Parameters - ---------- - y : Series - x : Series, DataFrame, dict of Series - intercept : bool - True if you want an intercept. - weights : array-like, optional - 1d array of weights. If you supply 1/W then the variables are pre- - multiplied by 1/sqrt(W). If no weights are supplied the default value - is 1 and WLS reults are the same as OLS. - nw_lags : None or int - Number of Newey-West lags. - nw_overlap : boolean, default False - Assume data is overlapping when computing Newey-West estimator - - """ - _panel_model = False - - def __init__(self, y, x, intercept=True, weights=None, nw_lags=None, - nw_overlap=False): - import warnings - warnings.warn("The pandas.stats.ols module is deprecated and will be " - "removed in a future version. We refer to external packages " - "like statsmodels, see some examples here: " - "http://www.statsmodels.org/stable/regression.html", - FutureWarning, stacklevel=4) - - try: - import statsmodels.api as sm - except ImportError: - import scikits.statsmodels.api as sm - - self._x_orig = x - self._y_orig = y - self._weights_orig = weights - self._intercept = intercept - self._nw_lags = nw_lags - self._nw_overlap = nw_overlap - - (self._y, self._x, self._weights, self._x_filtered, - self._index, self._time_has_obs) = self._prepare_data() - - if self._weights is not None: - self._x_trans = self._x.mul(np.sqrt(self._weights), axis=0) - self._y_trans = self._y * np.sqrt(self._weights) - self.sm_ols = sm.WLS(self._y.get_values(), - self._x.get_values(), - weights=self._weights.values).fit() - else: - self._x_trans = self._x - self._y_trans = self._y - self.sm_ols = sm.OLS(self._y.get_values(), - self._x.get_values()).fit() - - def _prepare_data(self): - """ - Cleans the input for single OLS. - - Parameters - ---------- - lhs: Series - Dependent variable in the regression. - rhs: dict, whose values are Series, DataFrame, or dict - Explanatory variables of the regression. - - Returns - ------- - Series, DataFrame - Cleaned lhs and rhs - """ - (filt_lhs, filt_rhs, filt_weights, - pre_filt_rhs, index, valid) = _filter_data(self._y_orig, self._x_orig, - self._weights_orig) - if self._intercept: - filt_rhs['intercept'] = 1. - pre_filt_rhs['intercept'] = 1. - - if hasattr(filt_weights, 'to_dense'): - filt_weights = filt_weights.to_dense() - - return (filt_lhs, filt_rhs, filt_weights, - pre_filt_rhs, index, valid) - - @property - def nobs(self): - return self._nobs - - @property - def _nobs(self): - return len(self._y) - - @property - def nw_lags(self): - return self._nw_lags - - @property - def x(self): - """Returns the filtered x used in the regression.""" - return self._x - - @property - def y(self): - """Returns the filtered y used in the regression.""" - return self._y - - @cache_readonly - def _beta_raw(self): - """Runs the regression and returns the beta.""" - return self.sm_ols.params - - @cache_readonly - def beta(self): - """Returns the betas in Series form.""" - return Series(self._beta_raw, index=self._x.columns) - - @cache_readonly - def _df_raw(self): - """Returns the degrees of freedom.""" - return math.rank(self._x.values) - - @cache_readonly - def df(self): - """Returns the degrees of freedom. - - This equals the rank of the X matrix. - """ - return self._df_raw - - @cache_readonly - def _df_model_raw(self): - """Returns the raw model degrees of freedom.""" - return self.sm_ols.df_model - - @cache_readonly - def df_model(self): - """Returns the degrees of freedom of the model.""" - return self._df_model_raw - - @cache_readonly - def _df_resid_raw(self): - """Returns the raw residual degrees of freedom.""" - return self.sm_ols.df_resid - - @cache_readonly - def df_resid(self): - """Returns the degrees of freedom of the residuals.""" - return self._df_resid_raw - - @cache_readonly - def _f_stat_raw(self): - """Returns the raw f-stat value.""" - from scipy.stats import f - - cols = self._x.columns - - if self._nw_lags is None: - F = self._r2_raw / (self._r2_raw - self._r2_adj_raw) - - q = len(cols) - if 'intercept' in cols: - q -= 1 - - shape = q, self.df_resid - p_value = 1 - f.cdf(F, shape[0], shape[1]) - return F, shape, p_value - - k = len(cols) - R = np.eye(k) - r = np.zeros((k, 1)) - - try: - intercept = cols.get_loc('intercept') - R = np.concatenate((R[0: intercept], R[intercept + 1:])) - r = np.concatenate((r[0: intercept], r[intercept + 1:])) - except KeyError: - # no intercept - pass - - return math.calc_F(R, r, self._beta_raw, self._var_beta_raw, - self._nobs, self.df) - - @cache_readonly - def f_stat(self): - """Returns the f-stat value.""" - return f_stat_to_dict(self._f_stat_raw) - - def f_test(self, hypothesis): - """Runs the F test, given a joint hypothesis. The hypothesis is - represented by a collection of equations, in the form - - A*x_1+B*x_2=C - - You must provide the coefficients even if they're 1. No spaces. - - The equations can be passed as either a single string or a - list of strings. - - Examples - -------- - o = ols(...) - o.f_test('1*x1+2*x2=0,1*x3=0') - o.f_test(['1*x1+2*x2=0','1*x3=0']) - """ - - x_names = self._x.columns - - R = [] - r = [] - - if isinstance(hypothesis, str): - eqs = hypothesis.split(',') - elif isinstance(hypothesis, list): - eqs = hypothesis - else: # pragma: no cover - raise Exception('hypothesis must be either string or list') - for equation in eqs: - row = np.zeros(len(x_names)) - lhs, rhs = equation.split('=') - for s in lhs.split('+'): - ss = s.split('*') - coeff = float(ss[0]) - x_name = ss[1] - - if x_name not in x_names: - raise Exception('no coefficient named %s' % x_name) - idx = x_names.get_loc(x_name) - row[idx] = coeff - rhs = float(rhs) - - R.append(row) - r.append(rhs) - - R = np.array(R) - q = len(r) - r = np.array(r).reshape(q, 1) - - result = math.calc_F(R, r, self._beta_raw, self._var_beta_raw, - self._nobs, self.df) - - return f_stat_to_dict(result) - - @cache_readonly - def _p_value_raw(self): - """Returns the raw p values.""" - from scipy.stats import t - - return 2 * t.sf(np.fabs(self._t_stat_raw), - self._df_resid_raw) - - @cache_readonly - def p_value(self): - """Returns the p values.""" - return Series(self._p_value_raw, index=self.beta.index) - - @cache_readonly - def _r2_raw(self): - """Returns the raw r-squared values.""" - if self._use_centered_tss: - return 1 - self.sm_ols.ssr / self.sm_ols.centered_tss - else: - return 1 - self.sm_ols.ssr / self.sm_ols.uncentered_tss - - @property - def _use_centered_tss(self): - # has_intercept = np.abs(self._resid_raw.sum()) < _FP_ERR - return self._intercept - - @cache_readonly - def r2(self): - """Returns the r-squared values.""" - return self._r2_raw - - @cache_readonly - def _r2_adj_raw(self): - """Returns the raw r-squared adjusted values.""" - return self.sm_ols.rsquared_adj - - @cache_readonly - def r2_adj(self): - """Returns the r-squared adjusted values.""" - return self._r2_adj_raw - - @cache_readonly - def _resid_raw(self): - """Returns the raw residuals.""" - return self.sm_ols.resid - - @cache_readonly - def resid(self): - """Returns the residuals.""" - return Series(self._resid_raw, index=self._x.index) - - @cache_readonly - def _rmse_raw(self): - """Returns the raw rmse values.""" - return np.sqrt(self.sm_ols.mse_resid) - - @cache_readonly - def rmse(self): - """Returns the rmse value.""" - return self._rmse_raw - - @cache_readonly - def _std_err_raw(self): - """Returns the raw standard err values.""" - return np.sqrt(np.diag(self._var_beta_raw)) - - @cache_readonly - def std_err(self): - """Returns the standard err values of the betas.""" - return Series(self._std_err_raw, index=self.beta.index) - - @cache_readonly - def _t_stat_raw(self): - """Returns the raw t-stat value.""" - return self._beta_raw / self._std_err_raw - - @cache_readonly - def t_stat(self): - """Returns the t-stat values of the betas.""" - return Series(self._t_stat_raw, index=self.beta.index) - - @cache_readonly - def _var_beta_raw(self): - """ - Returns the raw covariance of beta. - """ - x = self._x.values - y = self._y.values - - xx = np.dot(x.T, x) - - if self._nw_lags is None: - return math.inv(xx) * (self._rmse_raw ** 2) - else: - resid = y - np.dot(x, self._beta_raw) - m = (x.T * resid).T - - xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw, - self._nw_overlap) - - xx_inv = math.inv(xx) - return np.dot(xx_inv, np.dot(xeps, xx_inv)) - - @cache_readonly - def var_beta(self): - """Returns the variance-covariance matrix of beta.""" - return DataFrame(self._var_beta_raw, index=self.beta.index, - columns=self.beta.index) - - @cache_readonly - def _y_fitted_raw(self): - """Returns the raw fitted y values.""" - if self._weights is None: - X = self._x_filtered.values - else: - # XXX - return self.sm_ols.fittedvalues - - b = self._beta_raw - return np.dot(X, b) - - @cache_readonly - def y_fitted(self): - """Returns the fitted y values. This equals BX.""" - if self._weights is None: - index = self._x_filtered.index - orig_index = index - else: - index = self._y.index - orig_index = self._y_orig.index - - result = Series(self._y_fitted_raw, index=index) - return result.reindex(orig_index) - - @cache_readonly - def _y_predict_raw(self): - """Returns the raw predicted y values.""" - return self._y_fitted_raw - - @cache_readonly - def y_predict(self): - """Returns the predicted y values. - - For in-sample, this is same as y_fitted.""" - return self.y_fitted - - def predict(self, beta=None, x=None, fill_value=None, - fill_method=None, axis=0): - """ - Parameters - ---------- - beta : Series - x : Series or DataFrame - fill_value : scalar or dict, default None - fill_method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None - axis : {0, 1}, default 0 - See DataFrame.fillna for more details - - Notes - ----- - 1. If both fill_value and fill_method are None then NaNs are dropped - (this is the default behavior) - 2. An intercept will be automatically added to the new_y_values if - the model was fitted using an intercept - - Returns - ------- - Series of predicted values - """ - if beta is None and x is None: - return self.y_predict - - if beta is None: - beta = self.beta - else: - beta = beta.reindex(self.beta.index) - if isnull(beta).any(): - raise ValueError('Must supply betas for same variables') - - if x is None: - x = self._x - orig_x = x - else: - orig_x = x - if fill_value is None and fill_method is None: - x = x.dropna(how='any') - else: - x = x.fillna(value=fill_value, method=fill_method, axis=axis) - if isinstance(x, Series): - x = DataFrame({'x': x}) - if self._intercept: - x['intercept'] = 1. - - x = x.reindex(columns=self._x.columns) - - rs = np.dot(x.values, beta.values) - return Series(rs, x.index).reindex(orig_x.index) - - RESULT_FIELDS = ['r2', 'r2_adj', 'df', 'df_model', 'df_resid', 'rmse', - 'f_stat', 'beta', 'std_err', 't_stat', 'p_value', 'nobs'] - - @cache_readonly - def _results(self): - results = {} - for result in self.RESULT_FIELDS: - results[result] = getattr(self, result) - - return results - - @cache_readonly - def _coef_table(self): - buf = StringIO() - - buf.write('%14s %10s %10s %10s %10s %10s %10s\n' % - ('Variable', 'Coef', 'Std Err', 't-stat', - 'p-value', 'CI 2.5%', 'CI 97.5%')) - buf.write(scom.banner('')) - coef_template = '\n%14s %10.4f %10.4f %10.2f %10.4f %10.4f %10.4f' - - results = self._results - - beta = results['beta'] - - for i, name in enumerate(beta.index): - if i and not (i % 5): - buf.write('\n' + scom.banner('')) - - std_err = results['std_err'][name] - CI1 = beta[name] - 1.96 * std_err - CI2 = beta[name] + 1.96 * std_err - - t_stat = results['t_stat'][name] - p_value = results['p_value'][name] - - line = coef_template % (name, - beta[name], std_err, t_stat, p_value, CI1, CI2) - - buf.write(line) - - if self.nw_lags is not None: - buf.write('\n') - buf.write('*** The calculations are Newey-West ' - 'adjusted with lags %5d\n' % self.nw_lags) - - return buf.getvalue() - - @cache_readonly - def summary_as_matrix(self): - """Returns the formatted results of the OLS as a DataFrame.""" - results = self._results - beta = results['beta'] - data = {'beta': results['beta'], - 't-stat': results['t_stat'], - 'p-value': results['p_value'], - 'std err': results['std_err']} - return DataFrame(data, beta.index).T - - @cache_readonly - def summary(self): - """ - This returns the formatted result of the OLS computation - """ - template = """ -%(bannerTop)s - -Formula: Y ~ %(formula)s - -Number of Observations: %(nobs)d -Number of Degrees of Freedom: %(df)d - -R-squared: %(r2)10.4f -Adj R-squared: %(r2_adj)10.4f - -Rmse: %(rmse)10.4f - -F-stat %(f_stat_shape)s: %(f_stat)10.4f, p-value: %(f_stat_p_value)10.4f - -Degrees of Freedom: model %(df_model)d, resid %(df_resid)d - -%(bannerCoef)s -%(coef_table)s -%(bannerEnd)s -""" - coef_table = self._coef_table - - results = self._results - - f_stat = results['f_stat'] - - bracketed = ['<%s>' % str(c) for c in results['beta'].index] - - formula = StringIO() - formula.write(bracketed[0]) - tot = len(bracketed[0]) - line = 1 - for coef in bracketed[1:]: - tot = tot + len(coef) + 3 - - if tot // (68 * line): - formula.write('\n' + ' ' * 12) - line += 1 - - formula.write(' + ' + coef) - - params = { - 'bannerTop': scom.banner('Summary of Regression Analysis'), - 'bannerCoef': scom.banner('Summary of Estimated Coefficients'), - 'bannerEnd': scom.banner('End of Summary'), - 'formula': formula.getvalue(), - 'r2': results['r2'], - 'r2_adj': results['r2_adj'], - 'nobs': results['nobs'], - 'df': results['df'], - 'df_model': results['df_model'], - 'df_resid': results['df_resid'], - 'coef_table': coef_table, - 'rmse': results['rmse'], - 'f_stat': f_stat['f-stat'], - 'f_stat_shape': '(%d, %d)' % (f_stat['DF X'], f_stat['DF Resid']), - 'f_stat_p_value': f_stat['p-value'], - } - - return template % params - - def __unicode__(self): - return self.summary - - @cache_readonly - def _time_obs_count(self): - # XXX - return self._time_has_obs.astype(int) - - @property - def _total_times(self): - return self._time_has_obs.sum() - - -class MovingOLS(OLS): - """ - Runs a rolling/expanding simple OLS. - - Parameters - ---------- - y : Series - x : Series, DataFrame, or dict of Series - weights : array-like, optional - 1d array of weights. If None, equivalent to an unweighted OLS. - window_type : {'full sample', 'rolling', 'expanding'} - Default expanding - window : int - size of window (for rolling/expanding OLS) - min_periods : int - Threshold of non-null data points to require. - If None, defaults to size of window for window_type='rolling' and 1 - otherwise - intercept : bool - True if you want an intercept. - nw_lags : None or int - Number of Newey-West lags. - nw_overlap : boolean, default False - Assume data is overlapping when computing Newey-West estimator - - """ - - def __init__(self, y, x, weights=None, window_type='expanding', - window=None, min_periods=None, intercept=True, - nw_lags=None, nw_overlap=False): - - self._args = dict(intercept=intercept, nw_lags=nw_lags, - nw_overlap=nw_overlap) - - OLS.__init__(self, y=y, x=x, weights=weights, **self._args) - - self._set_window(window_type, window, min_periods) - - def _set_window(self, window_type, window, min_periods): - self._window_type = scom._get_window_type(window_type) - - if self._is_rolling: - if window is None: - raise AssertionError("Must specify window.") - if min_periods is None: - min_periods = window - else: - window = len(self._x) - if min_periods is None: - min_periods = 1 - - self._window = int(window) - self._min_periods = min_periods - -#------------------------------------------------------------------------------ -# "Public" results - - @cache_readonly - def beta(self): - """Returns the betas in Series/DataFrame form.""" - return DataFrame(self._beta_raw, - index=self._result_index, - columns=self._x.columns) - - @cache_readonly - def rank(self): - return Series(self._rank_raw, index=self._result_index) - - @cache_readonly - def df(self): - """Returns the degrees of freedom.""" - return Series(self._df_raw, index=self._result_index) - - @cache_readonly - def df_model(self): - """Returns the model degrees of freedom.""" - return Series(self._df_model_raw, index=self._result_index) - - @cache_readonly - def df_resid(self): - """Returns the residual degrees of freedom.""" - return Series(self._df_resid_raw, index=self._result_index) - - @cache_readonly - def f_stat(self): - """Returns the f-stat value.""" - f_stat_dicts = dict((date, f_stat_to_dict(f_stat)) - for date, f_stat in zip(self.beta.index, - self._f_stat_raw)) - - return DataFrame(f_stat_dicts).T - - def f_test(self, hypothesis): - raise NotImplementedError('must use full sample') - - @cache_readonly - def forecast_mean(self): - return Series(self._forecast_mean_raw, index=self._result_index) - - @cache_readonly - def forecast_vol(self): - return Series(self._forecast_vol_raw, index=self._result_index) - - @cache_readonly - def p_value(self): - """Returns the p values.""" - cols = self.beta.columns - return DataFrame(self._p_value_raw, columns=cols, - index=self._result_index) - - @cache_readonly - def r2(self): - """Returns the r-squared values.""" - return Series(self._r2_raw, index=self._result_index) - - @cache_readonly - def resid(self): - """Returns the residuals.""" - return Series(self._resid_raw[self._valid_obs_labels], - index=self._result_index) - - @cache_readonly - def r2_adj(self): - """Returns the r-squared adjusted values.""" - index = self.r2.index - - return Series(self._r2_adj_raw, index=index) - - @cache_readonly - def rmse(self): - """Returns the rmse values.""" - return Series(self._rmse_raw, index=self._result_index) - - @cache_readonly - def std_err(self): - """Returns the standard err values.""" - return DataFrame(self._std_err_raw, columns=self.beta.columns, - index=self._result_index) - - @cache_readonly - def t_stat(self): - """Returns the t-stat value.""" - return DataFrame(self._t_stat_raw, columns=self.beta.columns, - index=self._result_index) - - @cache_readonly - def var_beta(self): - """Returns the covariance of beta.""" - result = {} - result_index = self._result_index - for i in range(len(self._var_beta_raw)): - dm = DataFrame(self._var_beta_raw[i], columns=self.beta.columns, - index=self.beta.columns) - result[result_index[i]] = dm - - return Panel.from_dict(result, intersect=False) - - @cache_readonly - def y_fitted(self): - """Returns the fitted y values.""" - return Series(self._y_fitted_raw[self._valid_obs_labels], - index=self._result_index) - - @cache_readonly - def y_predict(self): - """Returns the predicted y values.""" - return Series(self._y_predict_raw[self._valid_obs_labels], - index=self._result_index) - -#------------------------------------------------------------------------------ -# "raw" attributes, calculations - - @property - def _is_rolling(self): - return self._window_type == 'rolling' - - @cache_readonly - def _beta_raw(self): - """Runs the regression and returns the beta.""" - beta, indices, mask = self._rolling_ols_call - - return beta[indices] - - @cache_readonly - def _result_index(self): - return self._index[self._valid_indices] - - @property - def _valid_indices(self): - return self._rolling_ols_call[1] - - @cache_readonly - def _rolling_ols_call(self): - return self._calc_betas(self._x_trans, self._y_trans) - - def _calc_betas(self, x, y): - N = len(self._index) - K = len(self._x.columns) - - betas = np.empty((N, K), dtype=float) - betas[:] = np.NaN - - valid = self._time_has_obs - enough = self._enough_obs - window = self._window - - # Use transformed (demeaned) Y, X variables - cum_xx = self._cum_xx(x) - cum_xy = self._cum_xy(x, y) - - for i in range(N): - if not valid[i] or not enough[i]: - continue - - xx = cum_xx[i] - xy = cum_xy[i] - if self._is_rolling and i >= window: - xx = xx - cum_xx[i - window] - xy = xy - cum_xy[i - window] - - betas[i] = math.solve(xx, xy) - - mask = ~np.isnan(betas).any(axis=1) - have_betas = np.arange(N)[mask] - - return betas, have_betas, mask - - def _rolling_rank(self): - dates = self._index - window = self._window - - ranks = np.empty(len(dates), dtype=float) - ranks[:] = np.NaN - for i, date in enumerate(dates): - if self._is_rolling and i >= window: - prior_date = dates[i - window + 1] - else: - prior_date = dates[0] - - x_slice = self._x.truncate(before=prior_date, after=date).values - - if len(x_slice) == 0: - continue - - ranks[i] = math.rank(x_slice) - - return ranks - - def _cum_xx(self, x): - dates = self._index - K = len(x.columns) - valid = self._time_has_obs - cum_xx = [] - - slicer = lambda df, dt: df.truncate(dt, dt).values - if not self._panel_model: - _get_index = x.index.get_loc - - def slicer(df, dt): - i = _get_index(dt) - return df.values[i:i + 1, :] - - last = np.zeros((K, K)) - - for i, date in enumerate(dates): - if not valid[i]: - cum_xx.append(last) - continue - - x_slice = slicer(x, date) - xx = last = last + np.dot(x_slice.T, x_slice) - cum_xx.append(xx) - - return cum_xx - - def _cum_xy(self, x, y): - dates = self._index - valid = self._time_has_obs - cum_xy = [] - - x_slicer = lambda df, dt: df.truncate(dt, dt).values - if not self._panel_model: - _get_index = x.index.get_loc - - def x_slicer(df, dt): - i = _get_index(dt) - return df.values[i:i + 1] - - _y_get_index = y.index.get_loc - _values = y.values - if isinstance(y.index, MultiIndex): - def y_slicer(df, dt): - loc = _y_get_index(dt) - return _values[loc] - else: - def y_slicer(df, dt): - i = _y_get_index(dt) - return _values[i:i + 1] - - last = np.zeros(len(x.columns)) - for i, date in enumerate(dates): - if not valid[i]: - cum_xy.append(last) - continue - - x_slice = x_slicer(x, date) - y_slice = y_slicer(y, date) - - xy = last = last + np.dot(x_slice.T, y_slice) - cum_xy.append(xy) - - return cum_xy - - @cache_readonly - def _rank_raw(self): - rank = self._rolling_rank() - return rank[self._valid_indices] - - @cache_readonly - def _df_raw(self): - """Returns the degrees of freedom.""" - return self._rank_raw - - @cache_readonly - def _df_model_raw(self): - """Returns the raw model degrees of freedom.""" - return self._df_raw - 1 - - @cache_readonly - def _df_resid_raw(self): - """Returns the raw residual degrees of freedom.""" - return self._nobs - self._df_raw - - @cache_readonly - def _f_stat_raw(self): - """Returns the raw f-stat value.""" - from scipy.stats import f - - items = self.beta.columns - nobs = self._nobs - df = self._df_raw - df_resid = nobs - df - - # var_beta has not been newey-west adjusted - if self._nw_lags is None: - F = self._r2_raw / (self._r2_raw - self._r2_adj_raw) - - q = len(items) - if 'intercept' in items: - q -= 1 - - def get_result_simple(Fst, d): - return Fst, (q, d), 1 - f.cdf(Fst, q, d) - - # Compute the P-value for each pair - result = starmap(get_result_simple, zip(F, df_resid)) - - return list(result) - - K = len(items) - R = np.eye(K) - r = np.zeros((K, 1)) - - try: - intercept = items.get_loc('intercept') - R = np.concatenate((R[0: intercept], R[intercept + 1:])) - r = np.concatenate((r[0: intercept], r[intercept + 1:])) - except KeyError: - # no intercept - pass - - def get_result(beta, vcov, n, d): - return math.calc_F(R, r, beta, vcov, n, d) - - results = starmap(get_result, - zip(self._beta_raw, self._var_beta_raw, nobs, df)) - - return list(results) - - @cache_readonly - def _p_value_raw(self): - """Returns the raw p values.""" - from scipy.stats import t - - result = [2 * t.sf(a, b) - for a, b in zip(np.fabs(self._t_stat_raw), - self._df_resid_raw)] - - return np.array(result) - - @cache_readonly - def _resid_stats(self): - uncentered_sst = [] - sst = [] - sse = [] - - Yreg = self._y - Y = self._y_trans - X = self._x_trans - weights = self._weights - - dates = self._index - window = self._window - for n, index in enumerate(self._valid_indices): - if self._is_rolling and index >= window: - prior_date = dates[index - window + 1] - else: - prior_date = dates[0] - - date = dates[index] - beta = self._beta_raw[n] - - X_slice = X.truncate(before=prior_date, after=date).values - Y_slice = _y_converter(Y.truncate(before=prior_date, after=date)) - - resid = Y_slice - np.dot(X_slice, beta) - - if weights is not None: - Y_slice = _y_converter(Yreg.truncate(before=prior_date, - after=date)) - weights_slice = weights.truncate(prior_date, date) - demeaned = Y_slice - np.average(Y_slice, weights=weights_slice) - SS_total = (weights_slice * demeaned ** 2).sum() - else: - SS_total = ((Y_slice - Y_slice.mean()) ** 2).sum() - - SS_err = (resid ** 2).sum() - SST_uncentered = (Y_slice ** 2).sum() - - sse.append(SS_err) - sst.append(SS_total) - uncentered_sst.append(SST_uncentered) - - return { - 'sse': np.array(sse), - 'centered_tss': np.array(sst), - 'uncentered_tss': np.array(uncentered_sst), - } - - @cache_readonly - def _rmse_raw(self): - """Returns the raw rmse values.""" - return np.sqrt(self._resid_stats['sse'] / self._df_resid_raw) - - @cache_readonly - def _r2_raw(self): - rs = self._resid_stats - - if self._use_centered_tss: - return 1 - rs['sse'] / rs['centered_tss'] - else: - return 1 - rs['sse'] / rs['uncentered_tss'] - - @cache_readonly - def _r2_adj_raw(self): - """Returns the raw r-squared adjusted values.""" - nobs = self._nobs - factors = (nobs - 1) / (nobs - self._df_raw) - return 1 - (1 - self._r2_raw) * factors - - @cache_readonly - def _resid_raw(self): - """Returns the raw residuals.""" - return (self._y.values - self._y_fitted_raw) - - @cache_readonly - def _std_err_raw(self): - """Returns the raw standard err values.""" - results = [] - for i in range(len(self._var_beta_raw)): - results.append(np.sqrt(np.diag(self._var_beta_raw[i]))) - - return np.array(results) - - @cache_readonly - def _t_stat_raw(self): - """Returns the raw t-stat value.""" - return self._beta_raw / self._std_err_raw - - @cache_readonly - def _var_beta_raw(self): - """Returns the raw covariance of beta.""" - x = self._x_trans - y = self._y_trans - dates = self._index - nobs = self._nobs - rmse = self._rmse_raw - beta = self._beta_raw - df = self._df_raw - window = self._window - cum_xx = self._cum_xx(self._x) - - results = [] - for n, i in enumerate(self._valid_indices): - xx = cum_xx[i] - date = dates[i] - - if self._is_rolling and i >= window: - xx = xx - cum_xx[i - window] - prior_date = dates[i - window + 1] - else: - prior_date = dates[0] - - x_slice = x.truncate(before=prior_date, after=date) - y_slice = y.truncate(before=prior_date, after=date) - xv = x_slice.values - yv = np.asarray(y_slice) - - if self._nw_lags is None: - result = math.inv(xx) * (rmse[n] ** 2) - else: - resid = yv - np.dot(xv, beta[n]) - m = (xv.T * resid).T - - xeps = math.newey_west(m, self._nw_lags, nobs[n], df[n], - self._nw_overlap) - - xx_inv = math.inv(xx) - result = np.dot(xx_inv, np.dot(xeps, xx_inv)) - - results.append(result) - - return np.array(results) - - @cache_readonly - def _forecast_mean_raw(self): - """Returns the raw covariance of beta.""" - nobs = self._nobs - window = self._window - - # x should be ones - dummy = DataFrame(index=self._y.index) - dummy['y'] = 1 - - cum_xy = self._cum_xy(dummy, self._y) - - results = [] - for n, i in enumerate(self._valid_indices): - sumy = cum_xy[i] - - if self._is_rolling and i >= window: - sumy = sumy - cum_xy[i - window] - - results.append(sumy[0] / nobs[n]) - - return np.array(results) - - @cache_readonly - def _forecast_vol_raw(self): - """Returns the raw covariance of beta.""" - beta = self._beta_raw - window = self._window - dates = self._index - x = self._x - - results = [] - for n, i in enumerate(self._valid_indices): - date = dates[i] - if self._is_rolling and i >= window: - prior_date = dates[i - window + 1] - else: - prior_date = dates[0] - - x_slice = x.truncate(prior_date, date).values - x_demeaned = x_slice - x_slice.mean(0) - x_cov = np.dot(x_demeaned.T, x_demeaned) / (len(x_slice) - 1) - - B = beta[n] - result = np.dot(B, np.dot(x_cov, B)) - results.append(np.sqrt(result)) - - return np.array(results) - - @cache_readonly - def _y_fitted_raw(self): - """Returns the raw fitted y values.""" - return (self._x.values * self._beta_matrix(lag=0)).sum(1) - - @cache_readonly - def _y_predict_raw(self): - """Returns the raw predicted y values.""" - return (self._x.values * self._beta_matrix(lag=1)).sum(1) - - @cache_readonly - def _results(self): - results = {} - for result in self.RESULT_FIELDS: - value = getattr(self, result) - if isinstance(value, Series): - value = value[self.beta.index[-1]] - elif isinstance(value, DataFrame): - value = value.xs(self.beta.index[-1]) - else: # pragma: no cover - raise Exception('Problem retrieving %s' % result) - results[result] = value - - return results - - @cache_readonly - def _window_time_obs(self): - window_obs = (Series(self._time_obs_count > 0) - .rolling(self._window, min_periods=1) - .sum() - .values - ) - - window_obs[np.isnan(window_obs)] = 0 - return window_obs.astype(int) - - @cache_readonly - def _nobs_raw(self): - if self._is_rolling: - window = self._window - else: - # expanding case - window = len(self._index) - - result = Series(self._time_obs_count).rolling( - window, min_periods=1).sum().values - - return result.astype(int) - - def _beta_matrix(self, lag=0): - if lag < 0: - raise AssertionError("'lag' must be greater than or equal to 0, " - "input was {0}".format(lag)) - - betas = self._beta_raw - - labels = np.arange(len(self._y)) - lag - indexer = self._valid_obs_labels.searchsorted(labels, side='left') - indexer[indexer == len(betas)] = len(betas) - 1 - - beta_matrix = betas[indexer] - beta_matrix[labels < self._valid_obs_labels[0]] = np.NaN - - return beta_matrix - - @cache_readonly - def _valid_obs_labels(self): - dates = self._index[self._valid_indices] - return self._y.index.searchsorted(dates) - - @cache_readonly - def _nobs(self): - return self._nobs_raw[self._valid_indices] - - @property - def nobs(self): - return Series(self._nobs, index=self._result_index) - - @cache_readonly - def _enough_obs(self): - # XXX: what's the best way to determine where to start? - return self._nobs_raw >= max(self._min_periods, - len(self._x.columns) + 1) - - -def _safe_update(d, other): - """ - Combine dictionaries with non-overlapping keys - """ - for k, v in compat.iteritems(other): - if k in d: - raise Exception('Duplicate regressor: %s' % k) - - d[k] = v - - -def _filter_data(lhs, rhs, weights=None): - """ - Cleans the input for single OLS. - - Parameters - ---------- - lhs : Series - Dependent variable in the regression. - rhs : dict, whose values are Series, DataFrame, or dict - Explanatory variables of the regression. - weights : array-like, optional - 1d array of weights. If None, equivalent to an unweighted OLS. - - Returns - ------- - Series, DataFrame - Cleaned lhs and rhs - """ - if not isinstance(lhs, Series): - if len(lhs) != len(rhs): - raise AssertionError("length of lhs must equal length of rhs") - lhs = Series(lhs, index=rhs.index) - - rhs = _combine_rhs(rhs) - lhs = DataFrame({'__y__': lhs}, dtype=float) - pre_filt_rhs = rhs.dropna(how='any') - - combined = rhs.join(lhs, how='outer') - if weights is not None: - combined['__weights__'] = weights - - valid = (combined.count(1) == len(combined.columns)).values - index = combined.index - combined = combined[valid] - - if weights is not None: - filt_weights = combined.pop('__weights__') - else: - filt_weights = None - - filt_lhs = combined.pop('__y__') - filt_rhs = combined - - if hasattr(filt_weights, 'to_dense'): - filt_weights = filt_weights.to_dense() - - return (filt_lhs.to_dense(), filt_rhs.to_dense(), filt_weights, - pre_filt_rhs.to_dense(), index, valid) - - -def _combine_rhs(rhs): - """ - Glue input X variables together while checking for potential - duplicates - """ - series = {} - - if isinstance(rhs, Series): - series['x'] = rhs - elif isinstance(rhs, DataFrame): - series = rhs.copy() - elif isinstance(rhs, dict): - for name, value in compat.iteritems(rhs): - if isinstance(value, Series): - _safe_update(series, {name: value}) - elif isinstance(value, (dict, DataFrame)): - _safe_update(series, value) - else: # pragma: no cover - raise Exception('Invalid RHS data type: %s' % type(value)) - else: # pragma: no cover - raise Exception('Invalid RHS type: %s' % type(rhs)) - - if not isinstance(series, DataFrame): - series = DataFrame(series, dtype=float) - - return series - -# A little kludge so we can use this method for both -# MovingOLS and MovingPanelOLS - - -def _y_converter(y): - y = y.values.squeeze() - if y.ndim == 0: # pragma: no cover - return np.array([y]) - else: - return y - - -def f_stat_to_dict(result): - f_stat, shape, p_value = result - - result = {} - result['f-stat'] = f_stat - result['DF X'] = shape[0] - result['DF Resid'] = shape[1] - result['p-value'] = p_value - - return result diff --git a/pandas/stats/plm.py b/pandas/stats/plm.py deleted file mode 100644 index 099c45d5ec60b..0000000000000 --- a/pandas/stats/plm.py +++ /dev/null @@ -1,863 +0,0 @@ -""" -Linear regression objects for panel data -""" - -# pylint: disable-msg=W0231 -# pylint: disable-msg=E1101,E1103 - -# flake8: noqa - -from __future__ import division -from pandas.compat import range -from pandas import compat -import warnings - -import numpy as np - -from pandas.core.panel import Panel -from pandas.core.frame import DataFrame -from pandas.core.reshape import get_dummies -from pandas.core.series import Series -from pandas.stats.ols import OLS, MovingOLS -import pandas.stats.common as com -import pandas.stats.math as math -from pandas.util.decorators import cache_readonly - - -class PanelOLS(OLS): - """Implements panel OLS. - - See ols function docs - """ - _panel_model = True - - def __init__(self, y, x, weights=None, intercept=True, nw_lags=None, - entity_effects=False, time_effects=False, x_effects=None, - cluster=None, dropped_dummies=None, verbose=False, - nw_overlap=False): - import warnings - warnings.warn("The pandas.stats.plm module is deprecated and will be " - "removed in a future version. We refer to external packages " - "like statsmodels, see some examples here: " - "http://www.statsmodels.org/stable/mixed_linear.html", - FutureWarning, stacklevel=4) - self._x_orig = x - self._y_orig = y - self._weights = weights - - self._intercept = intercept - self._nw_lags = nw_lags - self._nw_overlap = nw_overlap - self._entity_effects = entity_effects - self._time_effects = time_effects - self._x_effects = x_effects - self._dropped_dummies = dropped_dummies or {} - self._cluster = com._get_cluster_type(cluster) - self._verbose = verbose - - (self._x, self._x_trans, - self._x_filtered, self._y, - self._y_trans) = self._prepare_data() - - self._index = self._x.index.levels[0] - - self._T = len(self._index) - - def log(self, msg): - if self._verbose: # pragma: no cover - print(msg) - - def _prepare_data(self): - """Cleans and stacks input data into DataFrame objects - - If time effects is True, then we turn off intercepts and omit an item - from every (entity and x) fixed effect. - - Otherwise: - - If we have an intercept, we omit an item from every fixed effect. - - Else, we omit an item from every fixed effect except one of them. - - The categorical variables will get dropped from x. - """ - (x, x_filtered, y, weights, cat_mapping) = self._filter_data() - - self.log('Adding dummies to X variables') - x = self._add_dummies(x, cat_mapping) - - self.log('Adding dummies to filtered X variables') - x_filtered = self._add_dummies(x_filtered, cat_mapping) - - if self._x_effects: - x = x.drop(self._x_effects, axis=1) - x_filtered = x_filtered.drop(self._x_effects, axis=1) - - if self._time_effects: - x_regressor = x.sub(x.mean(level=0), level=0) - - unstacked_y = y.unstack() - y_regressor = unstacked_y.sub(unstacked_y.mean(1), axis=0).stack() - y_regressor.index = y.index - - elif self._intercept: - # only add intercept when no time effects - self.log('Adding intercept') - x = x_regressor = add_intercept(x) - x_filtered = add_intercept(x_filtered) - y_regressor = y - else: - self.log('No intercept added') - x_regressor = x - y_regressor = y - - if weights is not None: - if not y_regressor.index.equals(weights.index): - raise AssertionError("y_regressor and weights must have the " - "same index") - if not x_regressor.index.equals(weights.index): - raise AssertionError("x_regressor and weights must have the " - "same index") - - rt_weights = np.sqrt(weights) - y_regressor = y_regressor * rt_weights - x_regressor = x_regressor.mul(rt_weights, axis=0) - - return x, x_regressor, x_filtered, y, y_regressor - - def _filter_data(self): - """ - - """ - data = self._x_orig - cat_mapping = {} - - if isinstance(data, DataFrame): - data = data.to_panel() - else: - if isinstance(data, Panel): - data = data.copy() - - data, cat_mapping = self._convert_x(data) - - if not isinstance(data, Panel): - data = Panel.from_dict(data, intersect=True) - - x_names = data.items - - if self._weights is not None: - data['__weights__'] = self._weights - - # Filter x's without y (so we can make a prediction) - filtered = data.to_frame() - - # Filter all data together using to_frame - - # convert to DataFrame - y = self._y_orig - if isinstance(y, Series): - y = y.unstack() - - data['__y__'] = y - data_long = data.to_frame() - - x_filt = filtered.filter(x_names) - x = data_long.filter(x_names) - y = data_long['__y__'] - - if self._weights is not None and not self._weights.empty: - weights = data_long['__weights__'] - else: - weights = None - - return x, x_filt, y, weights, cat_mapping - - def _convert_x(self, x): - # Converts non-numeric data in x to floats. x_converted is the - # DataFrame with converted values, and x_conversion is a dict that - # provides the reverse mapping. For example, if 'A' was converted to 0 - # for x named 'variety', then x_conversion['variety'][0] is 'A'. - x_converted = {} - cat_mapping = {} - # x can be either a dict or a Panel, but in Python 3, dicts don't have - # .iteritems - iteritems = getattr(x, 'iteritems', x.items) - for key, df in iteritems(): - if not isinstance(df, DataFrame): - raise AssertionError("all input items must be DataFrames, " - "at least one is of " - "type {0}".format(type(df))) - - if _is_numeric(df): - x_converted[key] = df - else: - try: - df = df.astype(float) - except (TypeError, ValueError): - values = df.values - distinct_values = sorted(set(values.flat)) - cat_mapping[key] = dict(enumerate(distinct_values)) - new_values = np.searchsorted(distinct_values, values) - x_converted[key] = DataFrame(new_values, index=df.index, - columns=df.columns) - - if len(cat_mapping) == 0: - x_converted = x - - return x_converted, cat_mapping - - def _add_dummies(self, panel, mapping): - """ - Add entity and / or categorical dummies to input X DataFrame - - Returns - ------- - DataFrame - """ - panel = self._add_entity_effects(panel) - panel = self._add_categorical_dummies(panel, mapping) - - return panel - - def _add_entity_effects(self, panel): - """ - Add entity dummies to panel - - Returns - ------- - DataFrame - """ - from pandas.core.reshape import make_axis_dummies - - if not self._entity_effects: - return panel - - self.log('-- Adding entity fixed effect dummies') - - dummies = make_axis_dummies(panel, 'minor') - - if not self._use_all_dummies: - if 'entity' in self._dropped_dummies: - to_exclude = str(self._dropped_dummies.get('entity')) - else: - to_exclude = dummies.columns[0] - - if to_exclude not in dummies.columns: - raise Exception('%s not in %s' % (to_exclude, - dummies.columns)) - - self.log('-- Excluding dummy for entity: %s' % to_exclude) - - dummies = dummies.filter(dummies.columns.difference([to_exclude])) - - dummies = dummies.add_prefix('FE_') - panel = panel.join(dummies) - - return panel - - def _add_categorical_dummies(self, panel, cat_mappings): - """ - Add categorical dummies to panel - - Returns - ------- - DataFrame - """ - if not self._x_effects: - return panel - - dropped_dummy = (self._entity_effects and not self._use_all_dummies) - - for effect in self._x_effects: - self.log('-- Adding fixed effect dummies for %s' % effect) - - dummies = get_dummies(panel[effect]) - - val_map = cat_mappings.get(effect) - if val_map: - val_map = dict((v, k) for k, v in compat.iteritems(val_map)) - - if dropped_dummy or not self._use_all_dummies: - if effect in self._dropped_dummies: - to_exclude = mapped_name = self._dropped_dummies.get( - effect) - - if val_map: - mapped_name = val_map[to_exclude] - else: - to_exclude = mapped_name = dummies.columns[0] - - if mapped_name not in dummies.columns: # pragma: no cover - raise Exception('%s not in %s' % (to_exclude, - dummies.columns)) - - self.log( - '-- Excluding dummy for %s: %s' % (effect, to_exclude)) - - dummies = dummies.filter( - dummies.columns.difference([mapped_name])) - dropped_dummy = True - - dummies = _convertDummies(dummies, cat_mappings.get(effect)) - dummies = dummies.add_prefix('%s_' % effect) - panel = panel.join(dummies) - - return panel - - @property - def _use_all_dummies(self): - """ - In the case of using an intercept or including time fixed - effects, completely partitioning the sample would make the X - not full rank. - """ - return (not self._intercept and not self._time_effects) - - @cache_readonly - def _beta_raw(self): - """Runs the regression and returns the beta.""" - X = self._x_trans.values - Y = self._y_trans.values.squeeze() - - beta, _, _, _ = np.linalg.lstsq(X, Y) - - return beta - - @cache_readonly - def beta(self): - return Series(self._beta_raw, index=self._x.columns) - - @cache_readonly - def _df_model_raw(self): - """Returns the raw model degrees of freedom.""" - return self._df_raw - 1 - - @cache_readonly - def _df_resid_raw(self): - """Returns the raw residual degrees of freedom.""" - return self._nobs - self._df_raw - - @cache_readonly - def _df_raw(self): - """Returns the degrees of freedom.""" - df = math.rank(self._x_trans.values) - if self._time_effects: - df += self._total_times - - return df - - @cache_readonly - def _r2_raw(self): - Y = self._y_trans.values.squeeze() - X = self._x_trans.values - - resid = Y - np.dot(X, self._beta_raw) - - SSE = (resid ** 2).sum() - - if self._use_centered_tss: - SST = ((Y - np.mean(Y)) ** 2).sum() - else: - SST = (Y ** 2).sum() - - return 1 - SSE / SST - - @property - def _use_centered_tss(self): - # has_intercept = np.abs(self._resid_raw.sum()) < _FP_ERR - return self._intercept or self._entity_effects or self._time_effects - - @cache_readonly - def _r2_adj_raw(self): - """Returns the raw r-squared adjusted values.""" - nobs = self._nobs - factors = (nobs - 1) / (nobs - self._df_raw) - return 1 - (1 - self._r2_raw) * factors - - @cache_readonly - def _resid_raw(self): - Y = self._y.values.squeeze() - X = self._x.values - return Y - np.dot(X, self._beta_raw) - - @cache_readonly - def resid(self): - return self._unstack_vector(self._resid_raw) - - @cache_readonly - def _rmse_raw(self): - """Returns the raw rmse values.""" - # X = self._x.values - # Y = self._y.values.squeeze() - - X = self._x_trans.values - Y = self._y_trans.values.squeeze() - - resid = Y - np.dot(X, self._beta_raw) - ss = (resid ** 2).sum() - return np.sqrt(ss / (self._nobs - self._df_raw)) - - @cache_readonly - def _var_beta_raw(self): - cluster_axis = None - if self._cluster == 'time': - cluster_axis = 0 - elif self._cluster == 'entity': - cluster_axis = 1 - - x = self._x - y = self._y - - if self._time_effects: - xx = _xx_time_effects(x, y) - else: - xx = np.dot(x.values.T, x.values) - - return _var_beta_panel(y, x, self._beta_raw, xx, - self._rmse_raw, cluster_axis, self._nw_lags, - self._nobs, self._df_raw, self._nw_overlap) - - @cache_readonly - def _y_fitted_raw(self): - """Returns the raw fitted y values.""" - return np.dot(self._x.values, self._beta_raw) - - @cache_readonly - def y_fitted(self): - return self._unstack_vector(self._y_fitted_raw, index=self._x.index) - - def _unstack_vector(self, vec, index=None): - if index is None: - index = self._y_trans.index - panel = DataFrame(vec, index=index, columns=['dummy']) - return panel.to_panel()['dummy'] - - def _unstack_y(self, vec): - unstacked = self._unstack_vector(vec) - return unstacked.reindex(self.beta.index) - - @cache_readonly - def _time_obs_count(self): - return self._y_trans.count(level=0).values - - @cache_readonly - def _time_has_obs(self): - return self._time_obs_count > 0 - - @property - def _nobs(self): - return len(self._y) - - -def _convertDummies(dummies, mapping): - # cleans up the names of the generated dummies - new_items = [] - for item in dummies.columns: - if not mapping: - var = str(item) - if isinstance(item, float): - var = '%g' % item - - new_items.append(var) - else: - # renames the dummies if a conversion dict is provided - new_items.append(mapping[int(item)]) - - dummies = DataFrame(dummies.values, index=dummies.index, - columns=new_items) - - return dummies - - -def _is_numeric(df): - for col in df: - if df[col].dtype.name == 'object': - return False - - return True - - -def add_intercept(panel, name='intercept'): - """ - Add column of ones to input panel - - Parameters - ---------- - panel: Panel / DataFrame - name: string, default 'intercept'] - - Returns - ------- - New object (same type as input) - """ - panel = panel.copy() - panel[name] = 1. - - return panel.consolidate() - - -class MovingPanelOLS(MovingOLS, PanelOLS): - """Implements rolling/expanding panel OLS. - - See ols function docs - """ - _panel_model = True - - def __init__(self, y, x, weights=None, - window_type='expanding', window=None, - min_periods=None, - min_obs=None, - intercept=True, - nw_lags=None, nw_overlap=False, - entity_effects=False, - time_effects=False, - x_effects=None, - cluster=None, - dropped_dummies=None, - verbose=False): - - self._args = dict(intercept=intercept, - nw_lags=nw_lags, - nw_overlap=nw_overlap, - entity_effects=entity_effects, - time_effects=time_effects, - x_effects=x_effects, - cluster=cluster, - dropped_dummies=dropped_dummies, - verbose=verbose) - - PanelOLS.__init__(self, y=y, x=x, weights=weights, - **self._args) - - self._set_window(window_type, window, min_periods) - - if min_obs is None: - min_obs = len(self._x.columns) + 1 - - self._min_obs = min_obs - - @cache_readonly - def resid(self): - return self._unstack_y(self._resid_raw) - - @cache_readonly - def y_fitted(self): - return self._unstack_y(self._y_fitted_raw) - - @cache_readonly - def y_predict(self): - """Returns the predicted y values.""" - return self._unstack_y(self._y_predict_raw) - - def lagged_y_predict(self, lag=1): - """ - Compute forecast Y value lagging coefficient by input number - of time periods - - Parameters - ---------- - lag : int - - Returns - ------- - DataFrame - """ - x = self._x.values - betas = self._beta_matrix(lag=lag) - return self._unstack_y((betas * x).sum(1)) - - @cache_readonly - def _rolling_ols_call(self): - return self._calc_betas(self._x_trans, self._y_trans) - - @cache_readonly - def _df_raw(self): - """Returns the degrees of freedom.""" - df = self._rolling_rank() - - if self._time_effects: - df += self._window_time_obs - - return df[self._valid_indices] - - @cache_readonly - def _var_beta_raw(self): - """Returns the raw covariance of beta.""" - x = self._x - y = self._y - - dates = x.index.levels[0] - - cluster_axis = None - if self._cluster == 'time': - cluster_axis = 0 - elif self._cluster == 'entity': - cluster_axis = 1 - - nobs = self._nobs - rmse = self._rmse_raw - beta = self._beta_raw - df = self._df_raw - window = self._window - - if not self._time_effects: - # Non-transformed X - cum_xx = self._cum_xx(x) - - results = [] - for n, i in enumerate(self._valid_indices): - if self._is_rolling and i >= window: - prior_date = dates[i - window + 1] - else: - prior_date = dates[0] - - date = dates[i] - - x_slice = x.truncate(prior_date, date) - y_slice = y.truncate(prior_date, date) - - if self._time_effects: - xx = _xx_time_effects(x_slice, y_slice) - else: - xx = cum_xx[i] - if self._is_rolling and i >= window: - xx = xx - cum_xx[i - window] - - result = _var_beta_panel(y_slice, x_slice, beta[n], xx, rmse[n], - cluster_axis, self._nw_lags, - nobs[n], df[n], self._nw_overlap) - - results.append(result) - - return np.array(results) - - @cache_readonly - def _resid_raw(self): - beta_matrix = self._beta_matrix(lag=0) - - Y = self._y.values.squeeze() - X = self._x.values - resid = Y - (X * beta_matrix).sum(1) - - return resid - - @cache_readonly - def _y_fitted_raw(self): - x = self._x.values - betas = self._beta_matrix(lag=0) - return (betas * x).sum(1) - - @cache_readonly - def _y_predict_raw(self): - """Returns the raw predicted y values.""" - x = self._x.values - betas = self._beta_matrix(lag=1) - return (betas * x).sum(1) - - def _beta_matrix(self, lag=0): - if lag < 0: - raise AssertionError("'lag' must be greater than or equal to 0, " - "input was {0}".format(lag)) - - index = self._y_trans.index - major_labels = index.labels[0] - labels = major_labels - lag - indexer = self._valid_indices.searchsorted(labels, side='left') - - beta_matrix = self._beta_raw[indexer] - beta_matrix[labels < self._valid_indices[0]] = np.NaN - - return beta_matrix - - @cache_readonly - def _enough_obs(self): - # XXX: what's the best way to determine where to start? - # TODO: write unit tests for this - - rank_threshold = len(self._x.columns) + 1 - if self._min_obs < rank_threshold: # pragma: no cover - warnings.warn('min_obs is smaller than rank of X matrix') - - enough_observations = self._nobs_raw >= self._min_obs - enough_time_periods = self._window_time_obs >= self._min_periods - return enough_time_periods & enough_observations - - -def create_ols_dict(attr): - def attr_getter(self): - d = {} - for k, v in compat.iteritems(self.results): - result = getattr(v, attr) - d[k] = result - - return d - - return attr_getter - - -def create_ols_attr(attr): - return property(create_ols_dict(attr)) - - -class NonPooledPanelOLS(object): - """Implements non-pooled panel OLS. - - Parameters - ---------- - y : DataFrame - x : Series, DataFrame, or dict of Series - intercept : bool - True if you want an intercept. - nw_lags : None or int - Number of Newey-West lags. - window_type : {'full_sample', 'rolling', 'expanding'} - 'full_sample' by default - window : int - size of window (for rolling/expanding OLS) - """ - - ATTRIBUTES = [ - 'beta', - 'df', - 'df_model', - 'df_resid', - 'f_stat', - 'p_value', - 'r2', - 'r2_adj', - 'resid', - 'rmse', - 'std_err', - 'summary_as_matrix', - 't_stat', - 'var_beta', - 'x', - 'y', - 'y_fitted', - 'y_predict' - ] - - def __init__(self, y, x, window_type='full_sample', window=None, - min_periods=None, intercept=True, nw_lags=None, - nw_overlap=False): - - import warnings - warnings.warn("The pandas.stats.plm module is deprecated and will be " - "removed in a future version. We refer to external packages " - "like statsmodels, see some examples here: " - "http://www.statsmodels.org/stable/mixed_linear.html", - FutureWarning, stacklevel=4) - - for attr in self.ATTRIBUTES: - setattr(self.__class__, attr, create_ols_attr(attr)) - - results = {} - - for entity in y: - entity_y = y[entity] - - entity_x = {} - for x_var in x: - entity_x[x_var] = x[x_var][entity] - - from pandas.stats.interface import ols - results[entity] = ols(y=entity_y, - x=entity_x, - window_type=window_type, - window=window, - min_periods=min_periods, - intercept=intercept, - nw_lags=nw_lags, - nw_overlap=nw_overlap) - - self.results = results - - -def _var_beta_panel(y, x, beta, xx, rmse, cluster_axis, - nw_lags, nobs, df, nw_overlap): - xx_inv = math.inv(xx) - - yv = y.values - - if cluster_axis is None: - if nw_lags is None: - return xx_inv * (rmse ** 2) - else: - resid = yv - np.dot(x.values, beta) - m = (x.values.T * resid).T - - xeps = math.newey_west(m, nw_lags, nobs, df, nw_overlap) - - return np.dot(xx_inv, np.dot(xeps, xx_inv)) - else: - Xb = np.dot(x.values, beta).reshape((len(x.values), 1)) - resid = DataFrame(yv[:, None] - Xb, index=y.index, columns=['resid']) - - if cluster_axis == 1: - x = x.swaplevel(0, 1).sortlevel(0) - resid = resid.swaplevel(0, 1).sortlevel(0) - - m = _group_agg(x.values * resid.values, x.index._bounds, - lambda x: np.sum(x, axis=0)) - - if nw_lags is None: - nw_lags = 0 - - xox = 0 - for i in range(len(x.index.levels[0])): - xox += math.newey_west(m[i: i + 1], nw_lags, - nobs, df, nw_overlap) - - return np.dot(xx_inv, np.dot(xox, xx_inv)) - - -def _group_agg(values, bounds, f): - """ - R-style aggregator - - Parameters - ---------- - values : N-length or N x K ndarray - bounds : B-length ndarray - f : ndarray aggregation function - - Returns - ------- - ndarray with same length as bounds array - """ - if values.ndim == 1: - N = len(values) - result = np.empty(len(bounds), dtype=float) - elif values.ndim == 2: - N, K = values.shape - result = np.empty((len(bounds), K), dtype=float) - - testagg = f(values[:min(1, len(values))]) - if isinstance(testagg, np.ndarray) and testagg.ndim == 2: - raise AssertionError('Function must reduce') - - for i, left_bound in enumerate(bounds): - if i == len(bounds) - 1: - right_bound = N - else: - right_bound = bounds[i + 1] - - result[i] = f(values[left_bound:right_bound]) - - return result - - -def _xx_time_effects(x, y): - """ - Returns X'X - (X'T) (T'T)^-1 (T'X) - """ - # X'X - xx = np.dot(x.values.T, x.values) - xt = x.sum(level=0).values - - count = y.unstack().count(1).values - selector = count > 0 - - # X'X - (T'T)^-1 (T'X) - xt = xt[selector] - count = count[selector] - - return xx - np.dot(xt.T / count, xt) diff --git a/pandas/stats/tests/common.py b/pandas/stats/tests/common.py deleted file mode 100644 index 0ce4b20a4b719..0000000000000 --- a/pandas/stats/tests/common.py +++ /dev/null @@ -1,162 +0,0 @@ -# pylint: disable-msg=W0611,W0402 -# flake8: noqa - -from datetime import datetime -import string -import nose - -import numpy as np - -from pandas import DataFrame, bdate_range -from pandas.util.testing import assert_almost_equal # imported in other tests -import pandas.util.testing as tm - -N = 100 -K = 4 - -start = datetime(2007, 1, 1) -DATE_RANGE = bdate_range(start, periods=N) - -COLS = ['Col' + c for c in string.ascii_uppercase[:K]] - - -def makeDataFrame(): - data = DataFrame(np.random.randn(N, K), - columns=COLS, - index=DATE_RANGE) - - return data - - -def getBasicDatasets(): - A = makeDataFrame() - B = makeDataFrame() - C = makeDataFrame() - - return A, B, C - - -def check_for_scipy(): - try: - import scipy - except ImportError: - raise nose.SkipTest('no scipy') - - -def check_for_statsmodels(): - _have_statsmodels = True - try: - import statsmodels.api as sm - except ImportError: - try: - import scikits.statsmodels.api as sm - except ImportError: - raise nose.SkipTest('no statsmodels') - - -class BaseTest(tm.TestCase): - - def setUp(self): - check_for_scipy() - check_for_statsmodels() - - self.A, self.B, self.C = getBasicDatasets() - - self.createData1() - self.createData2() - self.createData3() - - def createData1(self): - date = datetime(2007, 1, 1) - date2 = datetime(2007, 1, 15) - date3 = datetime(2007, 1, 22) - - A = self.A.copy() - B = self.B.copy() - C = self.C.copy() - - A['ColA'][date] = np.NaN - B['ColA'][date] = np.NaN - C['ColA'][date] = np.NaN - C['ColA'][date2] = np.NaN - - # truncate data to save time - A = A[:30] - B = B[:30] - C = C[:30] - - self.panel_y = A - self.panel_x = {'B': B, 'C': C} - - self.series_panel_y = A.filter(['ColA']) - self.series_panel_x = {'B': B.filter(['ColA']), - 'C': C.filter(['ColA'])} - self.series_y = A['ColA'] - self.series_x = {'B': B['ColA'], - 'C': C['ColA']} - - def createData2(self): - y_data = [[1, np.NaN], - [2, 3], - [4, 5]] - y_index = [datetime(2000, 1, 1), - datetime(2000, 1, 2), - datetime(2000, 1, 3)] - y_cols = ['A', 'B'] - self.panel_y2 = DataFrame(np.array(y_data), index=y_index, - columns=y_cols) - - x1_data = [[6, np.NaN], - [7, 8], - [9, 30], - [11, 12]] - x1_index = [datetime(2000, 1, 1), - datetime(2000, 1, 2), - datetime(2000, 1, 3), - datetime(2000, 1, 4)] - x1_cols = ['A', 'B'] - x1 = DataFrame(np.array(x1_data), index=x1_index, - columns=x1_cols) - - x2_data = [[13, 14, np.NaN], - [15, np.NaN, np.NaN], - [16, 17, 48], - [19, 20, 21], - [22, 23, 24]] - x2_index = [datetime(2000, 1, 1), - datetime(2000, 1, 2), - datetime(2000, 1, 3), - datetime(2000, 1, 4), - datetime(2000, 1, 5)] - x2_cols = ['C', 'A', 'B'] - x2 = DataFrame(np.array(x2_data), index=x2_index, - columns=x2_cols) - - self.panel_x2 = {'x1': x1, 'x2': x2} - - def createData3(self): - y_data = [[1, 2], - [3, 4]] - y_index = [datetime(2000, 1, 1), - datetime(2000, 1, 2)] - y_cols = ['A', 'B'] - self.panel_y3 = DataFrame(np.array(y_data), index=y_index, - columns=y_cols) - - x1_data = [['A', 'B'], - ['C', 'A']] - x1_index = [datetime(2000, 1, 1), - datetime(2000, 1, 2)] - x1_cols = ['A', 'B'] - x1 = DataFrame(np.array(x1_data), index=x1_index, - columns=x1_cols) - - x2_data = [['foo', 'bar'], - ['baz', 'foo']] - x2_index = [datetime(2000, 1, 1), - datetime(2000, 1, 2)] - x2_cols = ['A', 'B'] - x2 = DataFrame(np.array(x2_data), index=x2_index, - columns=x2_cols) - - self.panel_x3 = {'x1': x1, 'x2': x2} diff --git a/pandas/stats/tests/test_fama_macbeth.py b/pandas/stats/tests/test_fama_macbeth.py deleted file mode 100644 index 706becfa730c4..0000000000000 --- a/pandas/stats/tests/test_fama_macbeth.py +++ /dev/null @@ -1,73 +0,0 @@ -# flake8: noqa - -from pandas import DataFrame, Panel -from pandas.stats.api import fama_macbeth -from .common import assert_almost_equal, BaseTest - -from pandas.compat import range -from pandas import compat -import pandas.util.testing as tm -import numpy as np - - -class TestFamaMacBeth(BaseTest): - - def testFamaMacBethRolling(self): - # self.checkFamaMacBethExtended('rolling', self.panel_x, self.panel_y, - # nw_lags_beta=2) - - # df = DataFrame(np.random.randn(50, 10)) - x = dict((k, DataFrame(np.random.randn(50, 10))) for k in 'abcdefg') - x = Panel.from_dict(x) - y = (DataFrame(np.random.randn(50, 10)) + - DataFrame(0.01 * np.random.randn(50, 10))) - self.checkFamaMacBethExtended('rolling', x, y, nw_lags_beta=2) - self.checkFamaMacBethExtended('expanding', x, y, nw_lags_beta=2) - - def checkFamaMacBethExtended(self, window_type, x, y, **kwds): - window = 25 - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = fama_macbeth(y=y, x=x, window_type=window_type, window=window, - **kwds) - self._check_stuff_works(result) - - index = result._index - time = len(index) - - for i in range(time - window + 1): - if window_type == 'rolling': - start = index[i] - else: - start = index[0] - - end = index[i + window - 1] - - x2 = {} - for k, v in x.iteritems(): - x2[k] = v.truncate(start, end) - y2 = y.truncate(start, end) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - reference = fama_macbeth(y=y2, x=x2, **kwds) - # reference._stats is tuple - assert_almost_equal(reference._stats, result._stats[:, i], - check_dtype=False) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - static = fama_macbeth(y=y2, x=x2, **kwds) - self._check_stuff_works(static) - - def _check_stuff_works(self, result): - # does it work? - attrs = ['mean_beta', 'std_beta', 't_stat'] - for attr in attrs: - getattr(result, attr) - - # does it work? - result.summary - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/stats/tests/test_math.py b/pandas/stats/tests/test_math.py deleted file mode 100644 index bc09f33d2f467..0000000000000 --- a/pandas/stats/tests/test_math.py +++ /dev/null @@ -1,63 +0,0 @@ -import nose - -from datetime import datetime -from numpy.random import randn -import numpy as np - -from pandas.core.api import Series, DataFrame, date_range -import pandas.util.testing as tm -import pandas.stats.math as pmath -from pandas import ols - -N, K = 100, 10 - -_have_statsmodels = True -try: - import statsmodels.api as sm -except ImportError: - try: - import scikits.statsmodels.api as sm # noqa - except ImportError: - _have_statsmodels = False - - -class TestMath(tm.TestCase): - - _nan_locs = np.arange(20, 40) - _inf_locs = np.array([]) - - def setUp(self): - arr = randn(N) - arr[self._nan_locs] = np.NaN - - self.arr = arr - self.rng = date_range(datetime(2009, 1, 1), periods=N) - - self.series = Series(arr.copy(), index=self.rng) - - self.frame = DataFrame(randn(N, K), index=self.rng, - columns=np.arange(K)) - - def test_rank_1d(self): - self.assertEqual(1, pmath.rank(self.series)) - self.assertEqual(0, pmath.rank(Series(0, self.series.index))) - - def test_solve_rect(self): - if not _have_statsmodels: - raise nose.SkipTest("no statsmodels") - - b = Series(np.random.randn(N), self.frame.index) - result = pmath.solve(self.frame, b) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - expected = ols(y=b, x=self.frame, intercept=False).beta - self.assertTrue(np.allclose(result, expected)) - - def test_inv_illformed(self): - singular = DataFrame(np.array([[1, 1], [2, 2]])) - rs = pmath.inv(singular) - expected = np.array([[0.1, 0.2], [0.1, 0.2]]) - self.assertTrue(np.allclose(rs, expected)) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/stats/tests/test_ols.py b/pandas/stats/tests/test_ols.py deleted file mode 100644 index 6f688649affb0..0000000000000 --- a/pandas/stats/tests/test_ols.py +++ /dev/null @@ -1,982 +0,0 @@ -""" -Unit test suite for OLS and PanelOLS classes -""" - -# pylint: disable-msg=W0212 - -# flake8: noqa - -from __future__ import division - -from datetime import datetime -from pandas import compat -from distutils.version import LooseVersion -import nose -import numpy as np - -from pandas import date_range, bdate_range -from pandas.core.panel import Panel -from pandas import DataFrame, Index, Series, notnull, offsets -from pandas.stats.api import ols -from pandas.stats.ols import _filter_data -from pandas.stats.plm import NonPooledPanelOLS, PanelOLS -from pandas.util.testing import (assert_almost_equal, assert_series_equal, - assert_frame_equal, assertRaisesRegexp, slow) -import pandas.util.testing as tm -import pandas.compat as compat -from pandas.stats.tests.common import BaseTest - -_have_statsmodels = True -try: - import statsmodels.api as sm -except ImportError: - try: - import scikits.statsmodels.api as sm - except ImportError: - _have_statsmodels = False - - -def _check_repr(obj): - repr(obj) - str(obj) - - -def _compare_ols_results(model1, model2): - tm.assertIsInstance(model1, type(model2)) - - if hasattr(model1, '_window_type'): - _compare_moving_ols(model1, model2) - else: - _compare_fullsample_ols(model1, model2) - - -def _compare_fullsample_ols(model1, model2): - assert_series_equal(model1.beta, model2.beta) - - -def _compare_moving_ols(model1, model2): - assert_frame_equal(model1.beta, model2.beta) - - -class TestOLS(BaseTest): - - _multiprocess_can_split_ = True - - # TODO: Add tests for OLS y predict - # TODO: Right now we just check for consistency between full-sample and - # rolling/expanding results of the panel OLS. We should also cross-check - # with trusted implementations of panel OLS (e.g. R). - # TODO: Add tests for non pooled OLS. - - @classmethod - def setUpClass(cls): - super(TestOLS, cls).setUpClass() - try: - import matplotlib as mpl - mpl.use('Agg', warn=False) - except ImportError: - pass - - if not _have_statsmodels: - raise nose.SkipTest("no statsmodels") - - def testOLSWithDatasets_ccard(self): - self.checkDataSet(sm.datasets.ccard.load(), skip_moving=True) - self.checkDataSet(sm.datasets.cpunish.load(), skip_moving=True) - self.checkDataSet(sm.datasets.longley.load(), skip_moving=True) - self.checkDataSet(sm.datasets.stackloss.load(), skip_moving=True) - - @slow - def testOLSWithDatasets_copper(self): - self.checkDataSet(sm.datasets.copper.load()) - - @slow - def testOLSWithDatasets_scotland(self): - self.checkDataSet(sm.datasets.scotland.load()) - - # degenerate case fails on some platforms - # self.checkDataSet(datasets.ccard.load(), 39, 49) # one col in X all - # 0s - - def testWLS(self): - # WLS centered SS changed (fixed) in 0.5.0 - sm_version = sm.version.version - if sm_version < LooseVersion('0.5.0'): - raise nose.SkipTest("WLS centered SS not fixed in statsmodels" - " version {0}".format(sm_version)) - - X = DataFrame(np.random.randn(30, 4), columns=['A', 'B', 'C', 'D']) - Y = Series(np.random.randn(30)) - weights = X.std(1) - - self._check_wls(X, Y, weights) - - weights.ix[[5, 15]] = np.nan - Y[[2, 21]] = np.nan - self._check_wls(X, Y, weights) - - def _check_wls(self, x, y, weights): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=y, x=x, weights=1 / weights) - - combined = x.copy() - combined['__y__'] = y - combined['__weights__'] = weights - combined = combined.dropna() - - endog = combined.pop('__y__').values - aweights = combined.pop('__weights__').values - exog = sm.add_constant(combined.values, prepend=False) - - sm_result = sm.WLS(endog, exog, weights=1 / aweights).fit() - - assert_almost_equal(sm_result.params, result._beta_raw) - assert_almost_equal(sm_result.resid, result._resid_raw) - - self.checkMovingOLS('rolling', x, y, weights=weights) - self.checkMovingOLS('expanding', x, y, weights=weights) - - def checkDataSet(self, dataset, start=None, end=None, skip_moving=False): - exog = dataset.exog[start: end] - endog = dataset.endog[start: end] - x = DataFrame(exog, index=np.arange(exog.shape[0]), - columns=np.arange(exog.shape[1])) - y = Series(endog, index=np.arange(len(endog))) - - self.checkOLS(exog, endog, x, y) - - if not skip_moving: - self.checkMovingOLS('rolling', x, y) - self.checkMovingOLS('rolling', x, y, nw_lags=0) - self.checkMovingOLS('expanding', x, y, nw_lags=0) - self.checkMovingOLS('rolling', x, y, nw_lags=1) - self.checkMovingOLS('expanding', x, y, nw_lags=1) - self.checkMovingOLS('expanding', x, y, nw_lags=1, nw_overlap=True) - - def checkOLS(self, exog, endog, x, y): - reference = sm.OLS(endog, sm.add_constant(exog, prepend=False)).fit() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=y, x=x) - - # check that sparse version is the same - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - sparse_result = ols(y=y.to_sparse(), x=x.to_sparse()) - _compare_ols_results(result, sparse_result) - - assert_almost_equal(reference.params, result._beta_raw) - assert_almost_equal(reference.df_model, result._df_model_raw) - assert_almost_equal(reference.df_resid, result._df_resid_raw) - assert_almost_equal(reference.fvalue, result._f_stat_raw[0]) - assert_almost_equal(reference.pvalues, result._p_value_raw) - assert_almost_equal(reference.rsquared, result._r2_raw) - assert_almost_equal(reference.rsquared_adj, result._r2_adj_raw) - assert_almost_equal(reference.resid, result._resid_raw) - assert_almost_equal(reference.bse, result._std_err_raw) - assert_almost_equal(reference.tvalues, result._t_stat_raw) - assert_almost_equal(reference.cov_params(), result._var_beta_raw) - assert_almost_equal(reference.fittedvalues, result._y_fitted_raw) - - _check_non_raw_results(result) - - def checkMovingOLS(self, window_type, x, y, weights=None, **kwds): - window = np.linalg.matrix_rank(x.values) * 2 - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - moving = ols(y=y, x=x, weights=weights, window_type=window_type, - window=window, **kwds) - - # check that sparse version is the same - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - sparse_moving = ols(y=y.to_sparse(), x=x.to_sparse(), - weights=weights, - window_type=window_type, - window=window, **kwds) - _compare_ols_results(moving, sparse_moving) - - index = moving._index - - for n, i in enumerate(moving._valid_indices): - if window_type == 'rolling' and i >= window: - prior_date = index[i - window + 1] - else: - prior_date = index[0] - - date = index[i] - - x_iter = {} - for k, v in compat.iteritems(x): - x_iter[k] = v.truncate(before=prior_date, after=date) - y_iter = y.truncate(before=prior_date, after=date) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - static = ols(y=y_iter, x=x_iter, weights=weights, **kwds) - - self.compare(static, moving, event_index=i, - result_index=n) - - _check_non_raw_results(moving) - - FIELDS = ['beta', 'df', 'df_model', 'df_resid', 'f_stat', 'p_value', - 'r2', 'r2_adj', 'rmse', 'std_err', 't_stat', - 'var_beta'] - - def compare(self, static, moving, event_index=None, - result_index=None): - - index = moving._index - - # Check resid if we have a time index specified - if event_index is not None: - ref = static._resid_raw[-1] - - label = index[event_index] - - res = moving.resid[label] - - assert_almost_equal(ref, res) - - ref = static._y_fitted_raw[-1] - res = moving.y_fitted[label] - - assert_almost_equal(ref, res) - - # Check y_fitted - - for field in self.FIELDS: - attr = '_%s_raw' % field - - ref = getattr(static, attr) - res = getattr(moving, attr) - - if result_index is not None: - res = res[result_index] - - assert_almost_equal(ref, res) - - def test_ols_object_dtype(self): - df = DataFrame(np.random.randn(20, 2), dtype=object) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=df[0], x=df[1]) - summary = repr(model) - - -class TestOLSMisc(tm.TestCase): - - _multiprocess_can_split_ = True - - """ - For test coverage with faux data - """ - @classmethod - def setUpClass(cls): - super(TestOLSMisc, cls).setUpClass() - if not _have_statsmodels: - raise nose.SkipTest("no statsmodels") - - def test_f_test(self): - x = tm.makeTimeDataFrame() - y = x.pop('A') - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x) - - hyp = '1*B+1*C+1*D=0' - result = model.f_test(hyp) - - hyp = ['1*B=0', - '1*C=0', - '1*D=0'] - result = model.f_test(hyp) - assert_almost_equal(result['f-stat'], model.f_stat['f-stat']) - - self.assertRaises(Exception, model.f_test, '1*A=0') - - def test_r2_no_intercept(self): - y = tm.makeTimeSeries() - x = tm.makeTimeDataFrame() - - x_with = x.copy() - x_with['intercept'] = 1. - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model1 = ols(y=y, x=x) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model2 = ols(y=y, x=x_with, intercept=False) - assert_series_equal(model1.beta, model2.beta) - - # TODO: can we infer whether the intercept is there... - self.assertNotEqual(model1.r2, model2.r2) - - # rolling - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model1 = ols(y=y, x=x, window=20) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model2 = ols(y=y, x=x_with, window=20, intercept=False) - assert_frame_equal(model1.beta, model2.beta) - self.assertTrue((model1.r2 != model2.r2).all()) - - def test_summary_many_terms(self): - x = DataFrame(np.random.randn(100, 20)) - y = np.random.randn(100) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x) - model.summary - - def test_y_predict(self): - y = tm.makeTimeSeries() - x = tm.makeTimeDataFrame() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model1 = ols(y=y, x=x) - assert_series_equal(model1.y_predict, model1.y_fitted) - assert_almost_equal(model1._y_predict_raw, model1._y_fitted_raw) - - def test_predict(self): - y = tm.makeTimeSeries() - x = tm.makeTimeDataFrame() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model1 = ols(y=y, x=x) - assert_series_equal(model1.predict(), model1.y_predict) - assert_series_equal(model1.predict(x=x), model1.y_predict) - assert_series_equal(model1.predict(beta=model1.beta), model1.y_predict) - - exog = x.copy() - exog['intercept'] = 1. - rs = Series(np.dot(exog.values, model1.beta.values), x.index) - assert_series_equal(model1.y_predict, rs) - - x2 = x.reindex(columns=x.columns[::-1]) - assert_series_equal(model1.predict(x=x2), model1.y_predict) - - x3 = x2 + 10 - pred3 = model1.predict(x=x3) - x3['intercept'] = 1. - x3 = x3.reindex(columns=model1.beta.index) - expected = Series(np.dot(x3.values, model1.beta.values), x3.index) - assert_series_equal(expected, pred3) - - beta = Series(0., model1.beta.index) - pred4 = model1.predict(beta=beta) - assert_series_equal(Series(0., pred4.index), pred4) - - def test_predict_longer_exog(self): - exogenous = {"1998": "4760", "1999": "5904", "2000": "4504", - "2001": "9808", "2002": "4241", "2003": "4086", - "2004": "4687", "2005": "7686", "2006": "3740", - "2007": "3075", "2008": "3753", "2009": "4679", - "2010": "5468", "2011": "7154", "2012": "4292", - "2013": "4283", "2014": "4595", "2015": "9194", - "2016": "4221", "2017": "4520"} - endogenous = {"1998": "691", "1999": "1580", "2000": "80", - "2001": "1450", "2002": "555", "2003": "956", - "2004": "877", "2005": "614", "2006": "468", - "2007": "191"} - - endog = Series(endogenous) - exog = Series(exogenous) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=endog, x=exog) - - pred = model.y_predict - self.assert_index_equal(pred.index, exog.index) - - def test_longpanel_series_combo(self): - wp = tm.makePanel() - lp = wp.to_frame() - - y = lp.pop('ItemA') - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=lp, entity_effects=True, window=20) - self.assertTrue(notnull(model.beta.values).all()) - tm.assertIsInstance(model, PanelOLS) - model.summary - - def test_series_rhs(self): - y = tm.makeTimeSeries() - x = tm.makeTimeSeries() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - expected = ols(y=y, x={'x': x}) - assert_series_equal(model.beta, expected.beta) - - # GH 5233/5250 - assert_series_equal(model.y_predict, model.predict(x=x)) - - def test_various_attributes(self): - # just make sure everything "works". test correctness elsewhere - - x = DataFrame(np.random.randn(100, 5)) - y = np.random.randn(100) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x, window=20) - - series_attrs = ['rank', 'df', 'forecast_mean', 'forecast_vol'] - - for attr in series_attrs: - value = getattr(model, attr) - tm.assertIsInstance(value, Series) - - # works - model._results - - def test_catch_regressor_overlap(self): - df1 = tm.makeTimeDataFrame().ix[:, ['A', 'B']] - df2 = tm.makeTimeDataFrame().ix[:, ['B', 'C', 'D']] - y = tm.makeTimeSeries() - - data = {'foo': df1, 'bar': df2} - - def f(): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - ols(y=y, x=data) - self.assertRaises(Exception, f) - - def test_plm_ctor(self): - y = tm.makeTimeDataFrame() - x = {'a': tm.makeTimeDataFrame(), - 'b': tm.makeTimeDataFrame()} - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x, intercept=False) - model.summary - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=Panel(x)) - model.summary - - def test_plm_attrs(self): - y = tm.makeTimeDataFrame() - x = {'a': tm.makeTimeDataFrame(), - 'b': tm.makeTimeDataFrame()} - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - rmodel = ols(y=y, x=x, window=10) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x) - model.resid - rmodel.resid - - def test_plm_lagged_y_predict(self): - y = tm.makeTimeDataFrame() - x = {'a': tm.makeTimeDataFrame(), - 'b': tm.makeTimeDataFrame()} - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x, window=10) - result = model.lagged_y_predict(2) - - def test_plm_f_test(self): - y = tm.makeTimeDataFrame() - x = {'a': tm.makeTimeDataFrame(), - 'b': tm.makeTimeDataFrame()} - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=y, x=x) - - hyp = '1*a+1*b=0' - result = model.f_test(hyp) - - hyp = ['1*a=0', - '1*b=0'] - result = model.f_test(hyp) - assert_almost_equal(result['f-stat'], model.f_stat['f-stat']) - - def test_plm_exclude_dummy_corner(self): - y = tm.makeTimeDataFrame() - x = {'a': tm.makeTimeDataFrame(), - 'b': tm.makeTimeDataFrame()} - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols( - y=y, x=x, entity_effects=True, dropped_dummies={'entity': 'D'}) - model.summary - - def f(): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - ols(y=y, x=x, entity_effects=True, - dropped_dummies={'entity': 'E'}) - self.assertRaises(Exception, f) - - def test_columns_tuples_summary(self): - # #1837 - X = DataFrame(np.random.randn(10, 2), columns=[('a', 'b'), ('c', 'd')]) - Y = Series(np.random.randn(10)) - - # it works! - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - model = ols(y=Y, x=X) - model.summary - - -class TestPanelOLS(BaseTest): - - _multiprocess_can_split_ = True - - FIELDS = ['beta', 'df', 'df_model', 'df_resid', 'f_stat', - 'p_value', 'r2', 'r2_adj', 'rmse', 'std_err', - 't_stat', 'var_beta'] - - _other_fields = ['resid', 'y_fitted'] - - def testFiltering(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y2, x=self.panel_x2) - - x = result._x - index = x.index.get_level_values(0) - index = Index(sorted(set(index))) - exp_index = Index([datetime(2000, 1, 1), datetime(2000, 1, 3)]) - self.assert_index_equal(exp_index, index) - - index = x.index.get_level_values(1) - index = Index(sorted(set(index))) - exp_index = Index(['A', 'B']) - self.assert_index_equal(exp_index, index) - - x = result._x_filtered - index = x.index.get_level_values(0) - index = Index(sorted(set(index))) - exp_index = Index([datetime(2000, 1, 1), - datetime(2000, 1, 3), - datetime(2000, 1, 4)]) - self.assert_index_equal(exp_index, index) - - # .flat is flatiter instance - assert_almost_equal(result._y.values.flat, [1, 4, 5], - check_dtype=False) - - exp_x = np.array([[6, 14, 1], [9, 17, 1], - [30, 48, 1]], dtype=np.float64) - assert_almost_equal(exp_x, result._x.values) - - exp_x_filtered = np.array([[6, 14, 1], [9, 17, 1], [30, 48, 1], - [11, 20, 1], [12, 21, 1]], dtype=np.float64) - assert_almost_equal(exp_x_filtered, result._x_filtered.values) - - self.assert_index_equal(result._x_filtered.index.levels[0], - result.y_fitted.index) - - def test_wls_panel(self): - y = tm.makeTimeDataFrame() - x = Panel({'x1': tm.makeTimeDataFrame(), - 'x2': tm.makeTimeDataFrame()}) - - y.ix[[1, 7], 'A'] = np.nan - y.ix[[6, 15], 'B'] = np.nan - y.ix[[3, 20], 'C'] = np.nan - y.ix[[5, 11], 'D'] = np.nan - - stack_y = y.stack() - stack_x = DataFrame(dict((k, v.stack()) - for k, v in x.iteritems())) - - weights = x.std('items') - stack_weights = weights.stack() - - stack_y.index = stack_y.index._tuple_index - stack_x.index = stack_x.index._tuple_index - stack_weights.index = stack_weights.index._tuple_index - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=y, x=x, weights=1 / weights) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - expected = ols(y=stack_y, x=stack_x, weights=1 / stack_weights) - - assert_almost_equal(result.beta, expected.beta) - - for attr in ['resid', 'y_fitted']: - rvals = getattr(result, attr).stack().values - evals = getattr(expected, attr).values - assert_almost_equal(rvals, evals) - - def testWithTimeEffects(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y2, x=self.panel_x2, time_effects=True) - - # .flat is flatiter instance - assert_almost_equal(result._y_trans.values.flat, [0, -0.5, 0.5], - check_dtype=False) - - exp_x = np.array([[0, 0], [-10.5, -15.5], [10.5, 15.5]]) - assert_almost_equal(result._x_trans.values, exp_x) - - # _check_non_raw_results(result) - - def testWithEntityEffects(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y2, x=self.panel_x2, entity_effects=True) - - # .flat is flatiter instance - assert_almost_equal(result._y.values.flat, [1, 4, 5], - check_dtype=False) - - exp_x = DataFrame([[0., 6., 14., 1.], [0, 9, 17, 1], [1, 30, 48, 1]], - index=result._x.index, columns=['FE_B', 'x1', 'x2', - 'intercept'], - dtype=float) - tm.assert_frame_equal(result._x, exp_x.ix[:, result._x.columns]) - # _check_non_raw_results(result) - - def testWithEntityEffectsAndDroppedDummies(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y2, x=self.panel_x2, entity_effects=True, - dropped_dummies={'entity': 'B'}) - - # .flat is flatiter instance - assert_almost_equal(result._y.values.flat, [1, 4, 5], - check_dtype=False) - exp_x = DataFrame([[1., 6., 14., 1.], [1, 9, 17, 1], [0, 30, 48, 1]], - index=result._x.index, columns=['FE_A', 'x1', 'x2', - 'intercept'], - dtype=float) - tm.assert_frame_equal(result._x, exp_x.ix[:, result._x.columns]) - # _check_non_raw_results(result) - - def testWithXEffects(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y2, x=self.panel_x2, x_effects=['x1']) - - # .flat is flatiter instance - assert_almost_equal(result._y.values.flat, [1, 4, 5], - check_dtype=False) - - res = result._x - exp_x = DataFrame([[0., 0., 14., 1.], [0, 1, 17, 1], [1, 0, 48, 1]], - columns=['x1_30', 'x1_9', 'x2', 'intercept'], - index=res.index, dtype=float) - exp_x[['x1_30', 'x1_9']] = exp_x[['x1_30', 'x1_9']].astype(np.uint8) - assert_frame_equal(res, exp_x.reindex(columns=res.columns)) - - def testWithXEffectsAndDroppedDummies(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y2, x=self.panel_x2, x_effects=['x1'], - dropped_dummies={'x1': 30}) - - res = result._x - # .flat is flatiter instance - assert_almost_equal(result._y.values.flat, [1, 4, 5], - check_dtype=False) - exp_x = DataFrame([[1., 0., 14., 1.], [0, 1, 17, 1], [0, 0, 48, 1]], - columns=['x1_6', 'x1_9', 'x2', 'intercept'], - index=res.index, dtype=float) - exp_x[['x1_6', 'x1_9']] = exp_x[['x1_6', 'x1_9']].astype(np.uint8) - - assert_frame_equal(res, exp_x.reindex(columns=res.columns)) - - def testWithXEffectsAndConversion(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y3, x=self.panel_x3, - x_effects=['x1', 'x2']) - - # .flat is flatiter instance - assert_almost_equal(result._y.values.flat, [1, 2, 3, 4], - check_dtype=False) - exp_x = np.array([[0, 0, 0, 1, 1], [1, 0, 0, 0, 1], [0, 1, 1, 0, 1], - [0, 0, 0, 1, 1]], dtype=np.float64) - assert_almost_equal(result._x.values, exp_x) - - exp_index = Index(['x1_B', 'x1_C', 'x2_baz', 'x2_foo', 'intercept']) - self.assert_index_equal(exp_index, result._x.columns) - - # _check_non_raw_results(result) - - def testWithXEffectsAndConversionAndDroppedDummies(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=self.panel_y3, x=self.panel_x3, x_effects=['x1', 'x2'], - dropped_dummies={'x2': 'foo'}) - # .flat is flatiter instance - assert_almost_equal(result._y.values.flat, [1, 2, 3, 4], - check_dtype=False) - exp_x = np.array([[0, 0, 0, 0, 1], [1, 0, 1, 0, 1], [0, 1, 0, 1, 1], - [0, 0, 0, 0, 1]], dtype=np.float64) - assert_almost_equal(result._x.values, exp_x) - - exp_index = Index(['x1_B', 'x1_C', 'x2_bar', 'x2_baz', 'intercept']) - self.assert_index_equal(exp_index, result._x.columns) - - # _check_non_raw_results(result) - - def testForSeries(self): - self.checkForSeries(self.series_panel_x, self.series_panel_y, - self.series_x, self.series_y) - - self.checkForSeries(self.series_panel_x, self.series_panel_y, - self.series_x, self.series_y, nw_lags=0) - - self.checkForSeries(self.series_panel_x, self.series_panel_y, - self.series_x, self.series_y, nw_lags=1, - nw_overlap=True) - - def testRolling(self): - self.checkMovingOLS(self.panel_x, self.panel_y) - - def testRollingWithFixedEffects(self): - self.checkMovingOLS(self.panel_x, self.panel_y, - entity_effects=True) - self.checkMovingOLS(self.panel_x, self.panel_y, intercept=False, - entity_effects=True) - - def testRollingWithTimeEffects(self): - self.checkMovingOLS(self.panel_x, self.panel_y, - time_effects=True) - - def testRollingWithNeweyWest(self): - self.checkMovingOLS(self.panel_x, self.panel_y, - nw_lags=1) - - def testRollingWithEntityCluster(self): - self.checkMovingOLS(self.panel_x, self.panel_y, - cluster='entity') - - def testUnknownClusterRaisesValueError(self): - assertRaisesRegexp(ValueError, "Unrecognized cluster.*ridiculous", - self.checkMovingOLS, self.panel_x, self.panel_y, - cluster='ridiculous') - - def testRollingWithTimeEffectsAndEntityCluster(self): - self.checkMovingOLS(self.panel_x, self.panel_y, - time_effects=True, cluster='entity') - - def testRollingWithTimeCluster(self): - self.checkMovingOLS(self.panel_x, self.panel_y, - cluster='time') - - def testRollingWithNeweyWestAndEntityCluster(self): - self.assertRaises(ValueError, self.checkMovingOLS, - self.panel_x, self.panel_y, - nw_lags=1, cluster='entity') - - def testRollingWithNeweyWestAndTimeEffectsAndEntityCluster(self): - self.assertRaises(ValueError, - self.checkMovingOLS, self.panel_x, self.panel_y, - nw_lags=1, cluster='entity', - time_effects=True) - - def testExpanding(self): - self.checkMovingOLS( - self.panel_x, self.panel_y, window_type='expanding') - - def testNonPooled(self): - self.checkNonPooled(y=self.panel_y, x=self.panel_x) - self.checkNonPooled(y=self.panel_y, x=self.panel_x, - window_type='rolling', window=25, min_periods=10) - - def testUnknownWindowType(self): - assertRaisesRegexp(ValueError, "window.*ridiculous", - self.checkNonPooled, y=self.panel_y, x=self.panel_x, - window_type='ridiculous', window=25, min_periods=10) - - def checkNonPooled(self, x, y, **kwds): - # For now, just check that it doesn't crash - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=y, x=x, pool=False, **kwds) - - _check_repr(result) - for attr in NonPooledPanelOLS.ATTRIBUTES: - _check_repr(getattr(result, attr)) - - def checkMovingOLS(self, x, y, window_type='rolling', **kwds): - window = 25 # must be larger than rank of x - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - moving = ols(y=y, x=x, window_type=window_type, - window=window, **kwds) - - index = moving._index - - for n, i in enumerate(moving._valid_indices): - if window_type == 'rolling' and i >= window: - prior_date = index[i - window + 1] - else: - prior_date = index[0] - - date = index[i] - - x_iter = {} - for k, v in compat.iteritems(x): - x_iter[k] = v.truncate(before=prior_date, after=date) - y_iter = y.truncate(before=prior_date, after=date) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - static = ols(y=y_iter, x=x_iter, **kwds) - - self.compare(static, moving, event_index=i, - result_index=n) - - _check_non_raw_results(moving) - - def checkForSeries(self, x, y, series_x, series_y, **kwds): - # Consistency check with simple OLS. - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = ols(y=y, x=x, **kwds) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - reference = ols(y=series_y, x=series_x, **kwds) - - self.compare(reference, result) - - def compare(self, static, moving, event_index=None, - result_index=None): - - # Check resid if we have a time index specified - if event_index is not None: - staticSlice = _period_slice(static, -1) - movingSlice = _period_slice(moving, event_index) - - ref = static._resid_raw[staticSlice] - res = moving._resid_raw[movingSlice] - - assert_almost_equal(ref, res) - - ref = static._y_fitted_raw[staticSlice] - res = moving._y_fitted_raw[movingSlice] - - assert_almost_equal(ref, res) - - # Check y_fitted - - for field in self.FIELDS: - attr = '_%s_raw' % field - - ref = getattr(static, attr) - res = getattr(moving, attr) - - if result_index is not None: - res = res[result_index] - - assert_almost_equal(ref, res) - - def test_auto_rolling_window_type(self): - data = tm.makeTimeDataFrame() - y = data.pop('A') - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - window_model = ols(y=y, x=data, window=20, min_periods=10) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - rolling_model = ols(y=y, x=data, window=20, min_periods=10, - window_type='rolling') - - assert_frame_equal(window_model.beta, rolling_model.beta) - - def test_group_agg(self): - from pandas.stats.plm import _group_agg - - values = np.ones((10, 2)) * np.arange(10).reshape((10, 1)) - bounds = np.arange(5) * 2 - f = lambda x: x.mean(axis=0) - - agged = _group_agg(values, bounds, f) - - assert(agged[1][0] == 2.5) - assert(agged[2][0] == 4.5) - - # test a function that doesn't aggregate - f2 = lambda x: np.zeros((2, 2)) - self.assertRaises(Exception, _group_agg, values, bounds, f2) - - -def _check_non_raw_results(model): - _check_repr(model) - _check_repr(model.resid) - _check_repr(model.summary_as_matrix) - _check_repr(model.y_fitted) - _check_repr(model.y_predict) - - -def _period_slice(panelModel, i): - index = panelModel._x_trans.index - period = index.levels[0][i] - - L, R = index.get_major_bounds(period, period) - - return slice(L, R) - - -class TestOLSFilter(tm.TestCase): - - _multiprocess_can_split_ = True - - def setUp(self): - date_index = date_range(datetime(2009, 12, 11), periods=3, - freq=offsets.BDay()) - ts = Series([3, 1, 4], index=date_index) - self.TS1 = ts - - date_index = date_range(datetime(2009, 12, 11), periods=5, - freq=offsets.BDay()) - ts = Series([1, 5, 9, 2, 6], index=date_index) - self.TS2 = ts - - date_index = date_range(datetime(2009, 12, 11), periods=3, - freq=offsets.BDay()) - ts = Series([5, np.nan, 3], index=date_index) - self.TS3 = ts - - date_index = date_range(datetime(2009, 12, 11), periods=5, - freq=offsets.BDay()) - ts = Series([np.nan, 5, 8, 9, 7], index=date_index) - self.TS4 = ts - - data = {'x1': self.TS2, 'x2': self.TS4} - self.DF1 = DataFrame(data=data) - - data = {'x1': self.TS2, 'x2': self.TS4} - self.DICT1 = data - - def testFilterWithSeriesRHS(self): - (lhs, rhs, weights, rhs_pre, - index, valid) = _filter_data(self.TS1, {'x1': self.TS2}, None) - self.tsAssertEqual(self.TS1.astype(np.float64), lhs, check_names=False) - self.tsAssertEqual(self.TS2[:3].astype(np.float64), rhs['x1'], - check_names=False) - self.tsAssertEqual(self.TS2.astype(np.float64), rhs_pre['x1'], - check_names=False) - - def testFilterWithSeriesRHS2(self): - (lhs, rhs, weights, rhs_pre, - index, valid) = _filter_data(self.TS2, {'x1': self.TS1}, None) - self.tsAssertEqual(self.TS2[:3].astype(np.float64), lhs, - check_names=False) - self.tsAssertEqual(self.TS1.astype(np.float64), rhs['x1'], - check_names=False) - self.tsAssertEqual(self.TS1.astype(np.float64), rhs_pre['x1'], - check_names=False) - - def testFilterWithSeriesRHS3(self): - (lhs, rhs, weights, rhs_pre, - index, valid) = _filter_data(self.TS3, {'x1': self.TS4}, None) - exp_lhs = self.TS3[2:3] - exp_rhs = self.TS4[2:3] - exp_rhs_pre = self.TS4[1:] - self.tsAssertEqual(exp_lhs, lhs, check_names=False) - self.tsAssertEqual(exp_rhs, rhs['x1'], check_names=False) - self.tsAssertEqual(exp_rhs_pre, rhs_pre['x1'], check_names=False) - - def testFilterWithDataFrameRHS(self): - (lhs, rhs, weights, rhs_pre, - index, valid) = _filter_data(self.TS1, self.DF1, None) - exp_lhs = self.TS1[1:].astype(np.float64) - exp_rhs1 = self.TS2[1:3] - exp_rhs2 = self.TS4[1:3].astype(np.float64) - self.tsAssertEqual(exp_lhs, lhs, check_names=False) - self.tsAssertEqual(exp_rhs1, rhs['x1'], check_names=False) - self.tsAssertEqual(exp_rhs2, rhs['x2'], check_names=False) - - def testFilterWithDictRHS(self): - (lhs, rhs, weights, rhs_pre, - index, valid) = _filter_data(self.TS1, self.DICT1, None) - exp_lhs = self.TS1[1:].astype(np.float64) - exp_rhs1 = self.TS2[1:3].astype(np.float64) - exp_rhs2 = self.TS4[1:3].astype(np.float64) - self.tsAssertEqual(exp_lhs, lhs, check_names=False) - self.tsAssertEqual(exp_rhs1, rhs['x1'], check_names=False) - self.tsAssertEqual(exp_rhs2, rhs['x2'], check_names=False) - - def tsAssertEqual(self, ts1, ts2, **kwargs): - self.assert_series_equal(ts1, ts2, **kwargs) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/stats/tests/test_var.py b/pandas/stats/tests/test_var.py deleted file mode 100644 index 9f2c95a2d3d5c..0000000000000 --- a/pandas/stats/tests/test_var.py +++ /dev/null @@ -1,99 +0,0 @@ -# flake8: noqa - -from __future__ import print_function - -import pandas.util.testing as tm - -from pandas.compat import range -import nose -import unittest - -raise nose.SkipTest('skipping this for now') - -try: - import statsmodels.tsa.var as sm_var - import statsmodels as sm -except ImportError: - import scikits.statsmodels.tsa.var as sm_var - import scikits.statsmodels as sm - - -import pandas.stats.var as _pvar -reload(_pvar) -from pandas.stats.var import VAR - -DECIMAL_6 = 6 -DECIMAL_5 = 5 -DECIMAL_4 = 4 -DECIMAL_3 = 3 -DECIMAL_2 = 2 - - -class CheckVAR(object): - - def test_params(self): - tm.assert_almost_equal(self.res1.params, self.res2.params, DECIMAL_3) - - def test_neqs(self): - tm.assert_numpy_array_equal(self.res1.neqs, self.res2.neqs) - - def test_nobs(self): - tm.assert_numpy_array_equal(self.res1.avobs, self.res2.nobs) - - def test_df_eq(self): - tm.assert_numpy_array_equal(self.res1.df_eq, self.res2.df_eq) - - def test_rmse(self): - results = self.res1.results - for i in range(len(results)): - tm.assert_almost_equal(results[i].mse_resid ** .5, - eval('self.res2.rmse_' + str(i + 1)), - DECIMAL_6) - - def test_rsquared(self): - results = self.res1.results - for i in range(len(results)): - tm.assert_almost_equal(results[i].rsquared, - eval('self.res2.rsquared_' + str(i + 1)), - DECIMAL_3) - - def test_llf(self): - results = self.res1.results - tm.assert_almost_equal(self.res1.llf, self.res2.llf, DECIMAL_2) - for i in range(len(results)): - tm.assert_almost_equal(results[i].llf, - eval('self.res2.llf_' + str(i + 1)), - DECIMAL_2) - - def test_aic(self): - tm.assert_almost_equal(self.res1.aic, self.res2.aic) - - def test_bic(self): - tm.assert_almost_equal(self.res1.bic, self.res2.bic) - - def test_hqic(self): - tm.assert_almost_equal(self.res1.hqic, self.res2.hqic) - - def test_fpe(self): - tm.assert_almost_equal(self.res1.fpe, self.res2.fpe) - - def test_detsig(self): - tm.assert_almost_equal(self.res1.detomega, self.res2.detsig) - - def test_bse(self): - tm.assert_almost_equal(self.res1.bse, self.res2.bse, DECIMAL_4) - - -class Foo(object): - - def __init__(self): - data = sm.datasets.macrodata.load() - data = data.data[['realinv', 'realgdp', 'realcons']].view((float, 3)) - data = diff(log(data), axis=0) - self.res1 = VAR2(endog=data).fit(maxlag=2) - from results import results_var - self.res2 = results_var.MacrodataResults() - - -if __name__ == '__main__': - unittest.main() diff --git a/pandas/stats/var.py b/pandas/stats/var.py deleted file mode 100644 index db4028d60f5c8..0000000000000 --- a/pandas/stats/var.py +++ /dev/null @@ -1,605 +0,0 @@ -# flake8: noqa - -from __future__ import division - -from pandas.compat import range, lrange, zip, reduce -from pandas import compat -import numpy as np -from pandas.core.base import StringMixin -from pandas.util.decorators import cache_readonly -from pandas.core.frame import DataFrame -from pandas.core.panel import Panel -from pandas.core.series import Series -import pandas.stats.common as common -from pandas.stats.math import inv -from pandas.stats.ols import _combine_rhs - - -class VAR(StringMixin): - """ - Estimates VAR(p) regression on multivariate time series data - presented in pandas data structures. - - Parameters - ---------- - data : DataFrame or dict of Series - p : lags to include - - """ - - def __init__(self, data, p=1, intercept=True): - import warnings - warnings.warn("The pandas.stats.var module is deprecated and will be " - "removed in a future version. We refer to external packages " - "like statsmodels, see some examples here: " - "http://www.statsmodels.org/stable/vector_ar.html#var", - FutureWarning, stacklevel=4) - - try: - import statsmodels.tsa.vector_ar.api as sm_var - except ImportError: - import scikits.statsmodels.tsa.var as sm_var - - self._data = DataFrame(_combine_rhs(data)) - self._p = p - - self._columns = self._data.columns - self._index = self._data.index - - self._intercept = intercept - - @cache_readonly - def aic(self): - """Returns the Akaike information criterion.""" - return self._ic['aic'] - - @cache_readonly - def bic(self): - """Returns the Bayesian information criterion.""" - return self._ic['bic'] - - @cache_readonly - def beta(self): - """ - Returns a DataFrame, where each column x1 contains the betas - calculated by regressing the x1 column of the VAR input with - the lagged input. - - Returns - ------- - DataFrame - """ - d = dict([(key, value.beta) - for (key, value) in compat.iteritems(self.ols_results)]) - return DataFrame(d) - - def forecast(self, h): - """ - Returns a DataFrame containing the forecasts for 1, 2, ..., n time - steps. Each column x1 contains the forecasts of the x1 column. - - Parameters - ---------- - n: int - Number of time steps ahead to forecast. - - Returns - ------- - DataFrame - """ - forecast = self._forecast_raw(h)[:, 0, :] - return DataFrame(forecast, index=lrange(1, 1 + h), - columns=self._columns) - - def forecast_cov(self, h): - """ - Returns the covariance of the forecast residuals. - - Returns - ------- - DataFrame - """ - return [DataFrame(value, index=self._columns, columns=self._columns) - for value in self._forecast_cov_raw(h)] - - def forecast_std_err(self, h): - """ - Returns the standard errors of the forecast residuals. - - Returns - ------- - DataFrame - """ - return DataFrame(self._forecast_std_err_raw(h), - index=lrange(1, 1 + h), columns=self._columns) - - @cache_readonly - def granger_causality(self): - """Returns the f-stats and p-values from the Granger Causality Test. - - If the data consists of columns x1, x2, x3, then we perform the - following regressions: - - x1 ~ L(x2, x3) - x1 ~ L(x1, x3) - x1 ~ L(x1, x2) - - The f-stats of these results are placed in the 'x1' column of the - returned DataFrame. We then repeat for x2, x3. - - Returns - ------- - Dict, where 'f-stat' returns the DataFrame containing the f-stats, - and 'p-value' returns the DataFrame containing the corresponding - p-values of the f-stats. - """ - from pandas.stats.api import ols - from scipy.stats import f - - d = {} - for col in self._columns: - d[col] = {} - for i in range(1, 1 + self._p): - lagged_data = self._lagged_data[i].filter( - self._columns - [col]) - - for key, value in compat.iteritems(lagged_data): - d[col][_make_param_name(i, key)] = value - - f_stat_dict = {} - p_value_dict = {} - - for col, y in compat.iteritems(self._data): - ssr_full = (self.resid[col] ** 2).sum() - - f_stats = [] - p_values = [] - - for col2 in self._columns: - result = ols(y=y, x=d[col2]) - - resid = result.resid - ssr_reduced = (resid ** 2).sum() - - M = self._p - N = self._nobs - K = self._k * self._p + 1 - f_stat = ((ssr_reduced - ssr_full) / M) / (ssr_full / (N - K)) - f_stats.append(f_stat) - - p_value = f.sf(f_stat, M, N - K) - p_values.append(p_value) - - f_stat_dict[col] = Series(f_stats, self._columns) - p_value_dict[col] = Series(p_values, self._columns) - - f_stat_mat = DataFrame(f_stat_dict) - p_value_mat = DataFrame(p_value_dict) - - return { - 'f-stat': f_stat_mat, - 'p-value': p_value_mat, - } - - @cache_readonly - def ols_results(self): - """ - Returns the results of the regressions: - x_1 ~ L(X) - x_2 ~ L(X) - ... - x_k ~ L(X) - - where X = [x_1, x_2, ..., x_k] - and L(X) represents the columns of X lagged 1, 2, ..., n lags - (n is the user-provided number of lags). - - Returns - ------- - dict - """ - from pandas.stats.api import ols - - d = {} - for i in range(1, 1 + self._p): - for col, series in compat.iteritems(self._lagged_data[i]): - d[_make_param_name(i, col)] = series - - result = dict([(col, ols(y=y, x=d, intercept=self._intercept)) - for col, y in compat.iteritems(self._data)]) - - return result - - @cache_readonly - def resid(self): - """ - Returns the DataFrame containing the residuals of the VAR regressions. - Each column x1 contains the residuals generated by regressing the x1 - column of the input against the lagged input. - - Returns - ------- - DataFrame - """ - d = dict([(col, series.resid) - for (col, series) in compat.iteritems(self.ols_results)]) - return DataFrame(d, index=self._index) - - @cache_readonly - def summary(self): - template = """ -%(banner_top)s - -Number of Observations: %(nobs)d -AIC: %(aic).3f -BIC: %(bic).3f - -%(banner_coef)s -%(coef_table)s -%(banner_end)s -""" - params = { - 'banner_top': common.banner('Summary of VAR'), - 'banner_coef': common.banner('Summary of Estimated Coefficients'), - 'banner_end': common.banner('End of Summary'), - 'coef_table': self.beta, - 'aic': self.aic, - 'bic': self.bic, - 'nobs': self._nobs, - } - - return template % params - - @cache_readonly - def _alpha(self): - """ - Returns array where the i-th element contains the intercept - when regressing the i-th column of self._data with the lagged data. - """ - if self._intercept: - return self._beta_raw[-1] - else: - return np.zeros(self._k) - - @cache_readonly - def _beta_raw(self): - return np.array([list(self.beta[col].values()) for col in self._columns]).T - - def _trans_B(self, h): - """ - Returns 0, 1, ..., (h-1)-th power of transpose of B as defined in - equation (4) on p. 142 of the Stata 11 Time Series reference book. - """ - result = [np.eye(1 + self._k * self._p)] - - row1 = np.zeros((1, 1 + self._k * self._p)) - row1[0, 0] = 1 - - v = self._alpha.reshape((self._k, 1)) - row2 = np.hstack(tuple([v] + self._lag_betas)) - - m = self._k * (self._p - 1) - row3 = np.hstack(( - np.zeros((m, 1)), - np.eye(m), - np.zeros((m, self._k)) - )) - - trans_B = np.vstack((row1, row2, row3)).T - - result.append(trans_B) - - for i in range(2, h): - result.append(np.dot(trans_B, result[i - 1])) - - return result - - @cache_readonly - def _x(self): - values = np.array([ - list(self._lagged_data[i][col].values()) - for i in range(1, 1 + self._p) - for col in self._columns - ]).T - - x = np.hstack((np.ones((len(values), 1)), values))[self._p:] - - return x - - @cache_readonly - def _cov_beta(self): - cov_resid = self._sigma - - x = self._x - - inv_cov_x = inv(np.dot(x.T, x)) - - return np.kron(inv_cov_x, cov_resid) - - def _data_xs(self, i): - """ - Returns the cross-section of the data at the given timestep. - """ - return self._data.values[i] - - def _forecast_cov_raw(self, n): - resid = self._forecast_cov_resid_raw(n) - # beta = self._forecast_cov_beta_raw(n) - - # return [a + b for a, b in zip(resid, beta)] - # TODO: ignore the beta forecast std err until it's verified - - return resid - - def _forecast_cov_beta_raw(self, n): - """ - Returns the covariance of the beta errors for the forecast at - 1, 2, ..., n timesteps. - """ - p = self._p - - values = self._data.values - T = len(values) - self._p - 1 - - results = [] - - for h in range(1, n + 1): - psi = self._psi(h) - trans_B = self._trans_B(h) - - sum = 0 - - cov_beta = self._cov_beta - - for t in range(T + 1): - index = t + p - y = values.take(lrange(index, index - p, -1), axis=0).ravel() - trans_Z = np.hstack(([1], y)) - trans_Z = trans_Z.reshape(1, len(trans_Z)) - - sum2 = 0 - for i in range(h): - ZB = np.dot(trans_Z, trans_B[h - 1 - i]) - - prod = np.kron(ZB, psi[i]) - sum2 = sum2 + prod - - sum = sum + chain_dot(sum2, cov_beta, sum2.T) - - results.append(sum / (T + 1)) - - return results - - def _forecast_cov_resid_raw(self, h): - """ - Returns the covariance of the residual errors for the forecast at - 1, 2, ..., h timesteps. - """ - psi_values = self._psi(h) - sum = 0 - result = [] - for i in range(h): - psi = psi_values[i] - sum = sum + chain_dot(psi, self._sigma, psi.T) - result.append(sum) - - return result - - def _forecast_raw(self, h): - """ - Returns the forecast at 1, 2, ..., h timesteps in the future. - """ - k = self._k - result = [] - for i in range(h): - sum = self._alpha.reshape(1, k) - for j in range(self._p): - beta = self._lag_betas[j] - idx = i - j - if idx > 0: - y = result[idx - 1] - else: - y = self._data_xs(idx - 1) - - sum = sum + np.dot(beta, y.T).T - result.append(sum) - - return np.array(result) - - def _forecast_std_err_raw(self, h): - """ - Returns the standard error of the forecasts - at 1, 2, ..., n timesteps. - """ - return np.array([np.sqrt(np.diag(value)) - for value in self._forecast_cov_raw(h)]) - - @cache_readonly - def _ic(self): - """ - Returns the Akaike/Bayesian information criteria. - """ - RSS = self._rss - k = self._p * (self._k * self._p + 1) - n = self._nobs * self._k - - return {'aic': 2 * k + n * np.log(RSS / n), - 'bic': n * np.log(RSS / n) + k * np.log(n)} - - @cache_readonly - def _k(self): - return len(self._columns) - - @cache_readonly - def _lag_betas(self): - """ - Returns list of B_i, where B_i represents the (k, k) matrix - with the j-th row containing the betas of regressing the j-th - column of self._data with self._data lagged i time steps. - First element is B_1, second element is B_2, etc. - """ - k = self._k - b = self._beta_raw - return [b[k * i: k * (i + 1)].T for i in range(self._p)] - - @cache_readonly - def _lagged_data(self): - return dict([(i, self._data.shift(i)) - for i in range(1, 1 + self._p)]) - - @cache_readonly - def _nobs(self): - return len(self._data) - self._p - - def _psi(self, h): - """ - psi value used for calculating standard error. - - Returns [psi_0, psi_1, ..., psi_(h - 1)] - """ - k = self._k - result = [np.eye(k)] - for i in range(1, h): - result.append(sum( - [np.dot(result[i - j], self._lag_betas[j - 1]) - for j in range(1, 1 + i) - if j <= self._p])) - - return result - - @cache_readonly - def _resid_raw(self): - resid = np.array([self.ols_results[col]._resid_raw - for col in self._columns]) - return resid - - @cache_readonly - def _rss(self): - """Returns the sum of the squares of the residuals.""" - return (self._resid_raw ** 2).sum() - - @cache_readonly - def _sigma(self): - """Returns covariance of resids.""" - k = self._k - n = self._nobs - - resid = self._resid_raw - - return np.dot(resid, resid.T) / (n - k) - - def __unicode__(self): - return self.summary - - -def lag_select(data, max_lags=5, ic=None): - """ - Select number of lags based on a variety of information criteria - - Parameters - ---------- - data : DataFrame-like - max_lags : int - Maximum number of lags to evaluate - ic : {None, 'aic', 'bic', ...} - Choosing None will just display the results - - Returns - ------- - None - """ - pass - - -class PanelVAR(VAR): - """ - Performs Vector Autoregression on panel data. - - Parameters - ---------- - data: Panel or dict of DataFrame - lags: int - """ - - def __init__(self, data, lags, intercept=True): - self._data = _prep_panel_data(data) - self._p = lags - self._intercept = intercept - - self._columns = self._data.items - - @cache_readonly - def _nobs(self): - """Returns the number of observations.""" - _, timesteps, entities = self._data.values.shape - return (timesteps - self._p) * entities - - @cache_readonly - def _rss(self): - """Returns the sum of the squares of the residuals.""" - return (self.resid.values ** 2).sum() - - def forecast(self, h): - """ - Returns the forecasts at 1, 2, ..., n timesteps in the future. - """ - forecast = self._forecast_raw(h).T.swapaxes(1, 2) - index = lrange(1, 1 + h) - w = Panel(forecast, items=self._data.items, major_axis=index, - minor_axis=self._data.minor_axis) - return w - - @cache_readonly - def resid(self): - """ - Returns the DataFrame containing the residuals of the VAR regressions. - Each column x1 contains the residuals generated by regressing the x1 - column of the input against the lagged input. - - Returns - ------- - DataFrame - """ - d = dict([(key, value.resid) - for (key, value) in compat.iteritems(self.ols_results)]) - return Panel.fromDict(d) - - def _data_xs(self, i): - return self._data.values[:, i, :].T - - @cache_readonly - def _sigma(self): - """Returns covariance of resids.""" - k = self._k - resid = _drop_incomplete_rows(self.resid.toLong().values) - n = len(resid) - return np.dot(resid.T, resid) / (n - k) - - -def _prep_panel_data(data): - """Converts the given data into a Panel.""" - if isinstance(data, Panel): - return data - - return Panel.fromDict(data) - - -def _drop_incomplete_rows(array): - mask = np.isfinite(array).all(1) - indices = np.arange(len(array))[mask] - return array.take(indices, 0) - - -def _make_param_name(lag, name): - return 'L%d.%s' % (lag, name) - - -def chain_dot(*matrices): - """ - Returns the dot product of the given matrices. - - Parameters - ---------- - matrices: argument list of ndarray - """ - return reduce(lambda x, y: np.dot(y, x), matrices[::-1]) diff --git a/pandas/testing.py b/pandas/testing.py new file mode 100644 index 0000000000000..3baf99957cb33 --- /dev/null +++ b/pandas/testing.py @@ -0,0 +1,8 @@ +# flake8: noqa + +""" +Public testing utility functions. +""" + +from pandas.util.testing import ( + assert_frame_equal, assert_series_equal, assert_index_equal) diff --git a/pandas/sparse/tests/__init__.py b/pandas/tests/api/__init__.py similarity index 100% rename from pandas/sparse/tests/__init__.py rename to pandas/tests/api/__init__.py diff --git a/pandas/tests/api/test_api.py b/pandas/tests/api/test_api.py new file mode 100644 index 0000000000000..b1652cf6eb6db --- /dev/null +++ b/pandas/tests/api/test_api.py @@ -0,0 +1,233 @@ +# -*- coding: utf-8 -*- + +from warnings import catch_warnings + +import pytest +import pandas as pd +from pandas import api +from pandas.util import testing as tm + + +class Base(object): + + def check(self, namespace, expected, ignored=None): + # see which names are in the namespace, minus optional + # ignored ones + # compare vs the expected + + result = sorted([f for f in dir(namespace) if not f.startswith('_')]) + if ignored is not None: + result = sorted(list(set(result) - set(ignored))) + + expected = sorted(expected) + tm.assert_almost_equal(result, expected) + + +class TestPDApi(Base): + + # these are optionally imported based on testing + # & need to be ignored + ignored = ['tests', 'locale', 'conftest'] + + # top-level sub-packages + lib = ['api', 'compat', 'core', 'errors', 'pandas', + 'plotting', 'test', 'testing', 'tools', 'tseries', + 'util', 'options', 'io'] + + # these are already deprecated; awaiting removal + deprecated_modules = ['stats', 'datetools', 'parser', + 'json', 'lib', 'tslib'] + + # misc + misc = ['IndexSlice', 'NaT'] + + # top-level classes + classes = ['Categorical', 'CategoricalIndex', 'DataFrame', 'DateOffset', + 'DatetimeIndex', 'ExcelFile', 'ExcelWriter', 'Float64Index', + 'Grouper', 'HDFStore', 'Index', 'Int64Index', 'MultiIndex', + 'Period', 'PeriodIndex', 'RangeIndex', 'UInt64Index', + 'Series', 'SparseArray', 'SparseDataFrame', + 'SparseSeries', 'TimeGrouper', 'Timedelta', + 'TimedeltaIndex', 'Timestamp', 'Interval', 'IntervalIndex'] + + # these are already deprecated; awaiting removal + deprecated_classes = ['WidePanel', 'Panel4D', + 'SparseList', 'Expr', 'Term'] + + # these should be deprecated in the future + deprecated_classes_in_future = ['Panel'] + + # external modules exposed in pandas namespace + modules = ['np', 'datetime'] + + # top-level functions + funcs = ['bdate_range', 'concat', 'crosstab', 'cut', + 'date_range', 'interval_range', 'eval', + 'factorize', 'get_dummies', + 'infer_freq', 'isnull', 'lreshape', + 'melt', 'notnull', 'offsets', + 'merge', 'merge_ordered', 'merge_asof', + 'period_range', + 'pivot', 'pivot_table', 'qcut', + 'show_versions', 'timedelta_range', 'unique', + 'value_counts', 'wide_to_long'] + + # top-level option funcs + funcs_option = ['reset_option', 'describe_option', 'get_option', + 'option_context', 'set_option', + 'set_eng_float_format'] + + # top-level read_* funcs + funcs_read = ['read_clipboard', 'read_csv', 'read_excel', 'read_fwf', + 'read_gbq', 'read_hdf', 'read_html', 'read_json', + 'read_msgpack', 'read_pickle', 'read_sas', 'read_sql', + 'read_sql_query', 'read_sql_table', 'read_stata', + 'read_table', 'read_feather'] + + # top-level to_* funcs + funcs_to = ['to_datetime', 'to_msgpack', + 'to_numeric', 'to_pickle', 'to_timedelta'] + + # these are already deprecated; awaiting removal + deprecated_funcs = ['ewma', 'ewmcorr', 'ewmcov', 'ewmstd', 'ewmvar', + 'ewmvol', 'expanding_apply', 'expanding_corr', + 'expanding_count', 'expanding_cov', 'expanding_kurt', + 'expanding_max', 'expanding_mean', 'expanding_median', + 'expanding_min', 'expanding_quantile', + 'expanding_skew', 'expanding_std', 'expanding_sum', + 'expanding_var', 'rolling_apply', + 'rolling_corr', 'rolling_count', 'rolling_cov', + 'rolling_kurt', 'rolling_max', 'rolling_mean', + 'rolling_median', 'rolling_min', 'rolling_quantile', + 'rolling_skew', 'rolling_std', 'rolling_sum', + 'rolling_var', 'rolling_window', 'ordered_merge', + 'pnow', 'match', 'groupby', 'get_store', + 'plot_params', 'scatter_matrix'] + + def test_api(self): + + self.check(pd, + self.lib + self.misc + + self.modules + self.deprecated_modules + + self.classes + self.deprecated_classes + + self.deprecated_classes_in_future + + self.funcs + self.funcs_option + + self.funcs_read + self.funcs_to + + self.deprecated_funcs, + self.ignored) + + +class TestApi(Base): + + allowed = ['types'] + + def test_api(self): + + self.check(api, self.allowed) + + +class TestTesting(Base): + + funcs = ['assert_frame_equal', 'assert_series_equal', + 'assert_index_equal'] + + def test_testing(self): + + from pandas import testing + self.check(testing, self.funcs) + + +class TestDatetoolsDeprecation(object): + + def test_deprecation_access_func(self): + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + pd.datetools.to_datetime('2016-01-01') + + def test_deprecation_access_obj(self): + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + pd.datetools.monthEnd + + +class TestTopLevelDeprecations(object): + + # top-level API deprecations + # GH 13790 + + def test_pnow(self): + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + pd.pnow(freq='M') + + def test_term(self): + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + pd.Term('index>=date') + + def test_expr(self): + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + pd.Expr('2>1') + + def test_match(self): + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + pd.match([1, 2, 3], [1]) + + def test_groupby(self): + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + pd.groupby(pd.Series([1, 2, 3]), [1, 1, 1]) + + # GH 15940 + + def test_get_store(self): + pytest.importorskip('tables') + with tm.ensure_clean() as path: + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + s = pd.get_store(path) + s.close() + + +class TestJson(object): + + def test_deprecation_access_func(self): + with catch_warnings(record=True): + pd.json.dumps([]) + + +class TestParser(object): + + def test_deprecation_access_func(self): + with catch_warnings(record=True): + pd.parser.na_values + + +class TestLib(object): + + def test_deprecation_access_func(self): + with catch_warnings(record=True): + pd.lib.infer_dtype('foo') + + +class TestTSLib(object): + + def test_deprecation_access_func(self): + with catch_warnings(record=True): + pd.tslib.Timestamp('20160101') + + +class TestTypes(object): + + def test_deprecation_access_func(self): + with tm.assert_produces_warning( + FutureWarning, check_stacklevel=False): + from pandas.types.concat import union_categoricals + c1 = pd.Categorical(list('aabc')) + c2 = pd.Categorical(list('abcd')) + union_categoricals( + [c1, c2], + sort_categories=True, + ignore_order=True) diff --git a/pandas/tests/api/test_types.py b/pandas/tests/api/test_types.py new file mode 100644 index 0000000000000..1cbcf3f9109a4 --- /dev/null +++ b/pandas/tests/api/test_types.py @@ -0,0 +1,103 @@ +# -*- coding: utf-8 -*- + +import pytest + +from warnings import catch_warnings +import numpy as np + +import pandas +from pandas.core import common as com +from pandas.api import types +from pandas.util import testing as tm + +from .test_api import Base + + +class TestTypes(Base): + + allowed = ['is_bool', 'is_bool_dtype', + 'is_categorical', 'is_categorical_dtype', 'is_complex', + 'is_complex_dtype', 'is_datetime64_any_dtype', + 'is_datetime64_dtype', 'is_datetime64_ns_dtype', + 'is_datetime64tz_dtype', 'is_datetimetz', 'is_dtype_equal', + 'is_extension_type', 'is_float', 'is_float_dtype', + 'is_int64_dtype', 'is_integer', + 'is_integer_dtype', 'is_number', 'is_numeric_dtype', + 'is_object_dtype', 'is_scalar', 'is_sparse', + 'is_string_dtype', 'is_signed_integer_dtype', + 'is_timedelta64_dtype', 'is_timedelta64_ns_dtype', + 'is_unsigned_integer_dtype', 'is_period', + 'is_period_dtype', 'is_interval', 'is_interval_dtype', + 'is_re', 'is_re_compilable', + 'is_dict_like', 'is_iterator', 'is_file_like', + 'is_list_like', 'is_hashable', + 'is_named_tuple', + 'pandas_dtype', 'union_categoricals', 'infer_dtype'] + deprecated = ['is_any_int_dtype', 'is_floating_dtype', 'is_sequence'] + dtypes = ['CategoricalDtype', 'DatetimeTZDtype', + 'PeriodDtype', 'IntervalDtype'] + + def test_types(self): + + self.check(types, self.allowed + self.dtypes + self.deprecated) + + def check_deprecation(self, fold, fnew): + with tm.assert_produces_warning(DeprecationWarning): + try: + result = fold('foo') + expected = fnew('foo') + assert result == expected + except TypeError: + pytest.raises(TypeError, lambda: fnew('foo')) + except AttributeError: + pytest.raises(AttributeError, lambda: fnew('foo')) + + def test_deprecation_core_common(self): + + # test that we are in fact deprecating + # the pandas.core.common introspectors + for t in self.allowed: + self.check_deprecation(getattr(com, t), getattr(types, t)) + + def test_deprecation_core_common_array_equivalent(self): + + with tm.assert_produces_warning(DeprecationWarning): + com.array_equivalent(np.array([1, 2]), np.array([1, 2])) + + def test_deprecation_core_common_moved(self): + + # these are in pandas.core.dtypes.common + l = ['is_datetime_arraylike', + 'is_datetime_or_timedelta_dtype', + 'is_datetimelike', + 'is_datetimelike_v_numeric', + 'is_datetimelike_v_object', + 'is_datetimetz', + 'is_int_or_datetime_dtype', + 'is_period_arraylike', + 'is_string_like', + 'is_string_like_dtype'] + + from pandas.core.dtypes import common as c + for t in l: + self.check_deprecation(getattr(com, t), getattr(c, t)) + + def test_removed_from_core_common(self): + + for t in ['is_null_datelike_scalar', + 'ensure_float']: + pytest.raises(AttributeError, lambda: getattr(com, t)) + + def test_deprecated_from_api_types(self): + + for t in self.deprecated: + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + getattr(types, t)(1) + + +def test_moved_infer_dtype(): + + with catch_warnings(record=True): + e = pandas.lib.infer_dtype('foo') + assert e is not None diff --git a/pandas/stats/tests/__init__.py b/pandas/tests/computation/__init__.py similarity index 100% rename from pandas/stats/tests/__init__.py rename to pandas/tests/computation/__init__.py diff --git a/pandas/tests/computation/test_compat.py b/pandas/tests/computation/test_compat.py new file mode 100644 index 0000000000000..ed569625177d3 --- /dev/null +++ b/pandas/tests/computation/test_compat.py @@ -0,0 +1,46 @@ +import pytest +from distutils.version import LooseVersion + +import pandas as pd + +from pandas.core.computation.engines import _engines +import pandas.core.computation.expr as expr +from pandas.core.computation import _MIN_NUMEXPR_VERSION + + +def test_compat(): + # test we have compat with our version of nu + + from pandas.core.computation import _NUMEXPR_INSTALLED + try: + import numexpr as ne + ver = ne.__version__ + if ver < LooseVersion(_MIN_NUMEXPR_VERSION): + assert not _NUMEXPR_INSTALLED + else: + assert _NUMEXPR_INSTALLED + except ImportError: + pytest.skip("not testing numexpr version compat") + + +@pytest.mark.parametrize('engine', _engines) +@pytest.mark.parametrize('parser', expr._parsers) +def test_invalid_numexpr_version(engine, parser): + def testit(): + a, b = 1, 2 # noqa + res = pd.eval('a + b', engine=engine, parser=parser) + assert res == 3 + + if engine == 'numexpr': + try: + import numexpr as ne + except ImportError: + pytest.skip("no numexpr") + else: + if ne.__version__ < LooseVersion(_MIN_NUMEXPR_VERSION): + with pytest.raises(ImportError): + testit() + else: + testit() + else: + testit() diff --git a/pandas/computation/tests/test_eval.py b/pandas/tests/computation/test_eval.py similarity index 73% rename from pandas/computation/tests/test_eval.py rename to pandas/tests/computation/test_eval.py index ffa2cb0684b72..89ab4531877a4 100644 --- a/pandas/computation/tests/test_eval.py +++ b/pandas/tests/computation/test_eval.py @@ -1,45 +1,59 @@ -#!/usr/bin/env python - -# flake8: noqa - import warnings +from warnings import catch_warnings import operator from itertools import product -from distutils.version import LooseVersion -import nose -from nose.tools import assert_raises +import pytest from numpy.random import randn, rand, randint import numpy as np -from pandas.types.common import is_list_like, is_scalar +from pandas.core.dtypes.common import is_bool, is_list_like, is_scalar import pandas as pd from pandas.core import common as com +from pandas.errors import PerformanceWarning from pandas import DataFrame, Series, Panel, date_range from pandas.util.testing import makeCustomDataframe as mkdf -from pandas.computation import pytables -from pandas.computation.engines import _engines, NumExprClobberingError -from pandas.computation.expr import PythonExprVisitor, PandasExprVisitor -from pandas.computation.ops import (_binary_ops_dict, - _special_case_arith_ops_syms, - _arith_ops_syms, _bool_ops_syms, - _unary_math_ops, _binary_math_ops) - -import pandas.computation.expr as expr +from pandas.core.computation import pytables +from pandas.core.computation.engines import _engines, NumExprClobberingError +from pandas.core.computation.expr import PythonExprVisitor, PandasExprVisitor +from pandas.core.computation.expressions import ( + _USE_NUMEXPR, _NUMEXPR_INSTALLED) +from pandas.core.computation.ops import ( + _binary_ops_dict, + _special_case_arith_ops_syms, + _arith_ops_syms, _bool_ops_syms, + _unary_math_ops, _binary_math_ops) + +import pandas.core.computation.expr as expr import pandas.util.testing as tm -import pandas.lib as lib from pandas.util.testing import (assert_frame_equal, randbool, - assertRaisesRegexp, assert_numpy_array_equal, - assert_produces_warning, assert_series_equal, - slow) -from pandas.compat import PY3, u, reduce + assert_numpy_array_equal, assert_series_equal, + assert_produces_warning, slow) +from pandas.compat import PY3, reduce _series_frame_incompatible = _bool_ops_syms _scalar_skip = 'in', 'not in' +@pytest.fixture(params=( + pytest.mark.skipif(engine == 'numexpr' and not _USE_NUMEXPR, + reason='numexpr enabled->{enabled}, ' + 'installed->{installed}'.format( + enabled=_USE_NUMEXPR, + installed=_NUMEXPR_INSTALLED))(engine) + for engine in _engines # noqa +)) +def engine(request): + return request.param + + +@pytest.fixture(params=expr._parsers) +def parser(request): + return request.param + + def engine_has_neg_frac(engine): return _engines[engine].has_neg_frac @@ -50,7 +64,8 @@ def _eval_single_bin(lhs, cmp1, rhs, engine): try: return c(lhs, rhs) except ValueError as e: - if str(e).startswith('negative number cannot be raised to a fractional power'): + if str(e).startswith('negative number cannot be ' + 'raised to a fractional power'): return np.nan raise return c(lhs, rhs) @@ -58,14 +73,14 @@ def _eval_single_bin(lhs, cmp1, rhs, engine): def _series_and_2d_ndarray(lhs, rhs): return ((isinstance(lhs, Series) and - isinstance(rhs, np.ndarray) and rhs.ndim > 1) - or (isinstance(rhs, Series) and - isinstance(lhs, np.ndarray) and lhs.ndim > 1)) + isinstance(rhs, np.ndarray) and rhs.ndim > 1) or + (isinstance(rhs, Series) and + isinstance(lhs, np.ndarray) and lhs.ndim > 1)) def _series_and_frame(lhs, rhs): - return ((isinstance(lhs, Series) and isinstance(rhs, DataFrame)) - or (isinstance(rhs, Series) and isinstance(lhs, DataFrame))) + return ((isinstance(lhs, Series) and isinstance(rhs, DataFrame)) or + (isinstance(rhs, Series) and isinstance(lhs, DataFrame))) def _bool_and_frame(lhs, rhs): @@ -80,11 +95,10 @@ def _is_py3_complex_incompat(result, expected): _good_arith_ops = com.difference(_arith_ops_syms, _special_case_arith_ops_syms) -class TestEvalNumexprPandas(tm.TestCase): +class TestEvalNumexprPandas(object): @classmethod - def setUpClass(cls): - super(TestEvalNumexprPandas, cls).setUpClass() + def setup_class(cls): tm.skip_if_no_ne() import numexpr as ne cls.ne = ne @@ -92,8 +106,7 @@ def setUpClass(cls): cls.parser = 'pandas' @classmethod - def tearDownClass(cls): - super(TestEvalNumexprPandas, cls).tearDownClass() + def teardown_class(cls): del cls.engine, cls.parser if hasattr(cls, 'ne'): del cls.ne @@ -122,12 +135,12 @@ def setup_ops(self): self.arith_ops = _good_arith_ops self.unary_ops = '-', '~', 'not ' - def setUp(self): + def setup_method(self, method): self.setup_ops() self.setup_data() self.current_engines = filter(lambda x: x != self.engine, _engines) - def tearDown(self): + def teardown_method(self, method): del self.lhses, self.rhses, self.scalar_rhses, self.scalar_lhses del self.pandas_rhses, self.pandas_lhses, self.current_engines @@ -194,7 +207,7 @@ def check_equal(self, result, expected): elif isinstance(result, np.ndarray): tm.assert_numpy_array_equal(result, expected) else: - self.assertEqual(result, expected) + assert result == expected def check_complex_cmp_op(self, lhs, cmp1, rhs, binop, cmp2): skip_these = _scalar_skip @@ -202,29 +215,32 @@ def check_complex_cmp_op(self, lhs, cmp1, rhs, binop, cmp2): binop=binop, cmp2=cmp2) scalar_with_in_notin = (is_scalar(rhs) and (cmp1 in skip_these or - cmp2 in skip_these)) + cmp2 in skip_these)) if scalar_with_in_notin: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): pd.eval(ex, engine=self.engine, parser=self.parser) - self.assertRaises(TypeError, pd.eval, ex, engine=self.engine, - parser=self.parser, local_dict={'lhs': lhs, - 'rhs': rhs}) + with pytest.raises(TypeError): + pd.eval(ex, engine=self.engine, parser=self.parser, + local_dict={'lhs': lhs, 'rhs': rhs}) else: lhs_new = _eval_single_bin(lhs, cmp1, rhs, self.engine) rhs_new = _eval_single_bin(lhs, cmp2, rhs, self.engine) - if (isinstance(lhs_new, Series) and isinstance(rhs_new, DataFrame) - and binop in _series_frame_incompatible): + if (isinstance(lhs_new, Series) and + isinstance(rhs_new, DataFrame) and + binop in _series_frame_incompatible): pass # TODO: the code below should be added back when left and right # hand side bool ops are fixed. - + # # try: - # self.assertRaises(Exception, pd.eval, ex, - #local_dict={'lhs': lhs, 'rhs': rhs}, - # engine=self.engine, parser=self.parser) + # pytest.raises(Exception, pd.eval, ex, + # local_dict={'lhs': lhs, 'rhs': rhs}, + # engine=self.engine, parser=self.parser) # except AssertionError: - #import ipdb; ipdb.set_trace() - # raise + # import ipdb + # + # ipdb.set_trace() + # raise else: expected = _eval_single_bin( lhs_new, binop, rhs_new, self.engine) @@ -232,7 +248,6 @@ def check_complex_cmp_op(self, lhs, cmp1, rhs, binop, cmp2): self.check_equal(result, expected) def check_chained_cmp_op(self, lhs, cmp1, mid, cmp2, rhs): - skip_these = _scalar_skip def check_operands(left, right, cmp_op): return _eval_single_bin(left, cmp_op, right, self.engine) @@ -255,9 +270,9 @@ def check_operands(left, right, cmp_op): def check_simple_cmp_op(self, lhs, cmp1, rhs): ex = 'lhs {0} rhs'.format(cmp1) if cmp1 in ('in', 'not in') and not is_list_like(rhs): - self.assertRaises(TypeError, pd.eval, ex, engine=self.engine, - parser=self.parser, local_dict={'lhs': lhs, - 'rhs': rhs}) + pytest.raises(TypeError, pd.eval, ex, engine=self.engine, + parser=self.parser, local_dict={'lhs': lhs, + 'rhs': rhs}) else: expected = _eval_single_bin(lhs, cmp1, rhs, self.engine) result = pd.eval(ex, engine=self.engine, parser=self.parser) @@ -310,17 +325,18 @@ def check_floor_division(self, lhs, arith1, rhs): expected = lhs // rhs self.check_equal(res, expected) else: - self.assertRaises(TypeError, pd.eval, ex, local_dict={'lhs': lhs, - 'rhs': rhs}, - engine=self.engine, parser=self.parser) + pytest.raises(TypeError, pd.eval, ex, + local_dict={'lhs': lhs, 'rhs': rhs}, + engine=self.engine, parser=self.parser) def get_expected_pow_result(self, lhs, rhs): try: expected = _eval_single_bin(lhs, '**', rhs, self.engine) except ValueError as e: - if str(e).startswith('negative number cannot be raised to a fractional power'): + if str(e).startswith('negative number cannot be ' + 'raised to a fractional power'): if self.engine == 'python': - raise nose.SkipTest(str(e)) + pytest.skip(str(e)) else: expected = np.nan else: @@ -334,8 +350,8 @@ def check_pow(self, lhs, arith1, rhs): if (is_scalar(lhs) and is_scalar(rhs) and _is_py3_complex_incompat(result, expected)): - self.assertRaises(AssertionError, tm.assert_numpy_array_equal, - result, expected) + pytest.raises(AssertionError, tm.assert_numpy_array_equal, + result, expected) else: tm.assert_almost_equal(result, expected) @@ -366,9 +382,9 @@ def check_compound_invert_op(self, lhs, cmp1, rhs): ex = '~(lhs {0} rhs)'.format(cmp1) if is_scalar(rhs) and cmp1 in skip_these: - self.assertRaises(TypeError, pd.eval, ex, engine=self.engine, - parser=self.parser, local_dict={'lhs': lhs, - 'rhs': rhs}) + pytest.raises(TypeError, pd.eval, ex, engine=self.engine, + parser=self.parser, local_dict={'lhs': lhs, + 'rhs': rhs}) else: # compound if is_scalar(lhs) and is_scalar(rhs): @@ -398,16 +414,16 @@ def test_frame_invert(self): # float always raises lhs = DataFrame(randn(5, 2)) if self.engine == 'numexpr': - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) # int raises on numexpr lhs = DataFrame(randint(5, size=(5, 2))) if self.engine == 'numexpr': - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = ~lhs @@ -423,10 +439,10 @@ def test_frame_invert(self): # object raises lhs = DataFrame({'b': ['a', 1, 2.0], 'c': rand(3) > 0.5}) if self.engine == 'numexpr': - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) def test_series_invert(self): @@ -437,16 +453,16 @@ def test_series_invert(self): # float raises lhs = Series(randn(5)) if self.engine == 'numexpr': - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) # int raises on numexpr lhs = Series(randint(5, size=5)) if self.engine == 'numexpr': - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = ~lhs @@ -466,10 +482,10 @@ def test_series_invert(self): # object lhs = Series(['a', 1, 2.0]) if self.engine == 'numexpr': - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) def test_frame_negate(self): @@ -490,7 +506,7 @@ def test_frame_negate(self): # bool doesn't work with numexpr but works elsewhere lhs = DataFrame(rand(5, 2) > 0.5) if self.engine == 'numexpr': - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = -lhs @@ -515,7 +531,7 @@ def test_series_negate(self): # bool doesn't work with numexpr but works elsewhere lhs = Series(rand(5) > 0.5) if self.engine == 'numexpr': - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = -lhs @@ -528,7 +544,7 @@ def test_frame_pos(self): # float lhs = DataFrame(randn(5, 2)) if self.engine == 'python': - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = lhs @@ -538,7 +554,7 @@ def test_frame_pos(self): # int lhs = DataFrame(randint(5, size=(5, 2))) if self.engine == 'python': - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = lhs @@ -548,7 +564,7 @@ def test_frame_pos(self): # bool doesn't work with numexpr but works elsewhere lhs = DataFrame(rand(5, 2) > 0.5) if self.engine == 'python': - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = lhs @@ -561,7 +577,7 @@ def test_series_pos(self): # float lhs = Series(randn(5)) if self.engine == 'python': - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = lhs @@ -571,7 +587,7 @@ def test_series_pos(self): # int lhs = Series(randint(5, size=5)) if self.engine == 'python': - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = lhs @@ -581,7 +597,7 @@ def test_series_pos(self): # bool doesn't work with numexpr but works elsewhere lhs = Series(rand(5) > 0.5) if self.engine == 'python': - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): result = pd.eval(expr, engine=self.engine, parser=self.parser) else: expect = lhs @@ -589,33 +605,31 @@ def test_series_pos(self): assert_series_equal(expect, result) def test_scalar_unary(self): - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): pd.eval('~1.0', engine=self.engine, parser=self.parser) - self.assertEqual( - pd.eval('-1.0', parser=self.parser, engine=self.engine), -1.0) - self.assertEqual( - pd.eval('+1.0', parser=self.parser, engine=self.engine), +1.0) - - self.assertEqual( - pd.eval('~1', parser=self.parser, engine=self.engine), ~1) - self.assertEqual( - pd.eval('-1', parser=self.parser, engine=self.engine), -1) - self.assertEqual( - pd.eval('+1', parser=self.parser, engine=self.engine), +1) - - self.assertEqual( - pd.eval('~True', parser=self.parser, engine=self.engine), ~True) - self.assertEqual( - pd.eval('~False', parser=self.parser, engine=self.engine), ~False) - self.assertEqual( - pd.eval('-True', parser=self.parser, engine=self.engine), -True) - self.assertEqual( - pd.eval('-False', parser=self.parser, engine=self.engine), -False) - self.assertEqual( - pd.eval('+True', parser=self.parser, engine=self.engine), +True) - self.assertEqual( - pd.eval('+False', parser=self.parser, engine=self.engine), +False) + assert pd.eval('-1.0', parser=self.parser, + engine=self.engine) == -1.0 + assert pd.eval('+1.0', parser=self.parser, + engine=self.engine) == +1.0 + assert pd.eval('~1', parser=self.parser, + engine=self.engine) == ~1 + assert pd.eval('-1', parser=self.parser, + engine=self.engine) == -1 + assert pd.eval('+1', parser=self.parser, + engine=self.engine) == +1 + assert pd.eval('~True', parser=self.parser, + engine=self.engine) == ~True + assert pd.eval('~False', parser=self.parser, + engine=self.engine) == ~False + assert pd.eval('-True', parser=self.parser, + engine=self.engine) == -True + assert pd.eval('-False', parser=self.parser, + engine=self.engine) == -False + assert pd.eval('+True', parser=self.parser, + engine=self.engine) == +True + assert pd.eval('+False', parser=self.parser, + engine=self.engine) == +False def test_unary_in_array(self): # GH 11235 @@ -634,63 +648,64 @@ def test_disallow_scalar_bool_ops(self): exprs += '2 * x > 2 or 1 and 2', exprs += '2 * df > 3 and 1 or a', - x, a, b, df = np.random.randn(3), 1, 2, DataFrame(randn(3, 2)) + x, a, b, df = np.random.randn(3), 1, 2, DataFrame(randn(3, 2)) # noqa for ex in exprs: - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval(ex, engine=self.engine, parser=self.parser) def test_identical(self): - # GH 10546 + # see gh-10546 x = 1 result = pd.eval('x', engine=self.engine, parser=self.parser) - self.assertEqual(result, 1) - self.assertTrue(is_scalar(result)) + assert result == 1 + assert is_scalar(result) x = 1.5 result = pd.eval('x', engine=self.engine, parser=self.parser) - self.assertEqual(result, 1.5) - self.assertTrue(is_scalar(result)) + assert result == 1.5 + assert is_scalar(result) x = False result = pd.eval('x', engine=self.engine, parser=self.parser) - self.assertEqual(result, False) - self.assertTrue(is_scalar(result)) + assert not result + assert is_bool(result) + assert is_scalar(result) x = np.array([1]) result = pd.eval('x', engine=self.engine, parser=self.parser) tm.assert_numpy_array_equal(result, np.array([1])) - self.assertEqual(result.shape, (1, )) + assert result.shape == (1, ) x = np.array([1.5]) result = pd.eval('x', engine=self.engine, parser=self.parser) tm.assert_numpy_array_equal(result, np.array([1.5])) - self.assertEqual(result.shape, (1, )) + assert result.shape == (1, ) - x = np.array([False]) + x = np.array([False]) # noqa result = pd.eval('x', engine=self.engine, parser=self.parser) tm.assert_numpy_array_equal(result, np.array([False])) - self.assertEqual(result.shape, (1, )) + assert result.shape == (1, ) def test_line_continuation(self): # GH 11149 exp = """1 + 2 * \ 5 - 1 + 2 """ result = pd.eval(exp, engine=self.engine, parser=self.parser) - self.assertEqual(result, 12) + assert result == 12 def test_float_truncation(self): # GH 14241 exp = '1000000000.006' result = pd.eval(exp, engine=self.engine, parser=self.parser) expected = np.float64(exp) - self.assertEqual(result, expected) + assert result == expected df = pd.DataFrame({'A': [1000000000.0009, 1000000000.0011, 1000000000.0015]}) cutoff = 1000000000.0006 result = df.query("A < %.4f" % cutoff) - self.assertTrue(result.empty) + assert result.empty cutoff = 1000000000.0010 result = df.query("A > %.4f" % cutoff) @@ -703,12 +718,11 @@ def test_float_truncation(self): tm.assert_frame_equal(expected, result) - class TestEvalNumexprPython(TestEvalNumexprPandas): @classmethod - def setUpClass(cls): - super(TestEvalNumexprPython, cls).setUpClass() + def setup_class(cls): + super(TestEvalNumexprPython, cls).setup_class() tm.skip_if_no_ne() import numexpr as ne cls.ne = ne @@ -727,15 +741,15 @@ def setup_ops(self): def check_chained_cmp_op(self, lhs, cmp1, mid, cmp2, rhs): ex1 = 'lhs {0} mid {1} rhs'.format(cmp1, cmp2) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval(ex1, engine=self.engine, parser=self.parser) class TestEvalPythonPython(TestEvalNumexprPython): @classmethod - def setUpClass(cls): - super(TestEvalPythonPython, cls).setUpClass() + def setup_class(cls): + super(TestEvalPythonPython, cls).setup_class() cls.engine = 'python' cls.parser = 'python' @@ -764,8 +778,8 @@ def check_alignment(self, result, nlhs, ghs, op): class TestEvalPythonPandas(TestEvalPythonPython): @classmethod - def setUpClass(cls): - super(TestEvalPythonPandas, cls).setUpClass() + def setup_class(cls): + super(TestEvalPythonPandas, cls).setup_class() cls.engine = 'python' cls.parser = 'pandas' @@ -777,16 +791,16 @@ def check_chained_cmp_op(self, lhs, cmp1, mid, cmp2, rhs): f = lambda *args, **kwargs: np.random.randn() -ENGINES_PARSERS = list(product(_engines, expr._parsers)) +# ------------------------------------- +# gh-12388: Typecasting rules consistency with python -#------------------------------------- -# typecasting rules consistency with python -# issue #12388 class TestTypeCasting(object): - - def check_binop_typecasting(self, engine, parser, op, dt): - tm.skip_if_no_ne(engine) + @pytest.mark.parametrize('op', ['+', '-', '*', '**', '/']) + # maybe someday... numexpr has too many upcasting rules now + # chain(*(np.sctypes[x] for x in ['uint', 'int', 'float'])) + @pytest.mark.parametrize('dt', [np.float32, np.float64]) + def test_binop_typecasting(self, engine, parser, op, dt): df = mkdf(5, 3, data_gen_f=f, dtype=dt) s = 'df {} 3'.format(op) res = pd.eval(s, engine=engine, parser=parser) @@ -800,17 +814,9 @@ def check_binop_typecasting(self, engine, parser, op, dt): assert res.values.dtype == dt assert_frame_equal(res, eval(s)) - def test_binop_typecasting(self): - for engine, parser in ENGINES_PARSERS: - for op in ['+', '-', '*', '**', '/']: - # maybe someday... numexpr has too many upcasting rules now - #for dt in chain(*(np.sctypes[x] for x in ['uint', 'int', 'float'])): - for dt in [np.float32, np.float64]: - yield self.check_binop_typecasting, engine, parser, op, dt - -#------------------------------------- -# basic and complex alignment +# ------------------------------------- +# Basic and complex alignment def _is_datetime(x): return issubclass(x.dtype.type, np.datetime64) @@ -827,19 +833,13 @@ class TestAlignment(object): index_types = 'i', 'u', 'dt' lhs_index_types = index_types + ('s',) # 'p' - def check_align_nested_unary_op(self, engine, parser): - tm.skip_if_no_ne(engine) + def test_align_nested_unary_op(self, engine, parser): s = 'df * ~2' df = mkdf(5, 3, data_gen_f=f) res = pd.eval(s, engine=engine, parser=parser) assert_frame_equal(res, df * ~2) - def test_align_nested_unary_op(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_align_nested_unary_op, engine, parser - - def check_basic_frame_alignment(self, engine, parser): - tm.skip_if_no_ne(engine) + def test_basic_frame_alignment(self, engine, parser): args = product(self.lhs_index_types, self.index_types, self.index_types) with warnings.catch_warnings(record=True): @@ -857,12 +857,7 @@ def check_basic_frame_alignment(self, engine, parser): res = pd.eval('df + df2', engine=engine, parser=parser) assert_frame_equal(res, df + df2) - def test_basic_frame_alignment(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_basic_frame_alignment, engine, parser - - def check_frame_comparison(self, engine, parser): - tm.skip_if_no_ne(engine) + def test_frame_comparison(self, engine, parser): args = product(self.lhs_index_types, repeat=2) for r_idx_type, c_idx_type in args: df = mkdf(10, 10, data_gen_f=f, r_idx_type=r_idx_type, @@ -875,12 +870,8 @@ def check_frame_comparison(self, engine, parser): res = pd.eval('df < df3', engine=engine, parser=parser) assert_frame_equal(res, df < df3) - def test_frame_comparison(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_frame_comparison, engine, parser - - def check_medium_complex_frame_alignment(self, engine, parser): - tm.skip_if_no_ne(engine) + @slow + def test_medium_complex_frame_alignment(self, engine, parser): args = product(self.lhs_index_types, self.index_types, self.index_types, self.index_types) @@ -900,14 +891,7 @@ def check_medium_complex_frame_alignment(self, engine, parser): engine=engine, parser=parser) assert_frame_equal(res, df + df2 + df3) - @slow - def test_medium_complex_frame_alignment(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_medium_complex_frame_alignment, engine, parser - - def check_basic_frame_series_alignment(self, engine, parser): - tm.skip_if_no_ne(engine) - + def test_basic_frame_series_alignment(self, engine, parser): def testit(r_idx_type, c_idx_type, index_name): df = mkdf(10, 10, data_gen_f=f, r_idx_type=r_idx_type, c_idx_type=c_idx_type) @@ -933,13 +917,7 @@ def testit(r_idx_type, c_idx_type, index_name): for r_idx_type, c_idx_type, index_name in args: testit(r_idx_type, c_idx_type, index_name) - def test_basic_frame_series_alignment(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_basic_frame_series_alignment, engine, parser - - def check_basic_series_frame_alignment(self, engine, parser): - tm.skip_if_no_ne(engine) - + def test_basic_series_frame_alignment(self, engine, parser): def testit(r_idx_type, c_idx_type, index_name): df = mkdf(10, 7, data_gen_f=f, r_idx_type=r_idx_type, c_idx_type=c_idx_type) @@ -969,12 +947,7 @@ def testit(r_idx_type, c_idx_type, index_name): for r_idx_type, c_idx_type, index_name in args: testit(r_idx_type, c_idx_type, index_name) - def test_basic_series_frame_alignment(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_basic_series_frame_alignment, engine, parser - - def check_series_frame_commutativity(self, engine, parser): - tm.skip_if_no_ne(engine) + def test_series_frame_commutativity(self, engine, parser): args = product(self.lhs_index_types, self.index_types, ('+', '*'), ('index', 'columns')) @@ -1001,13 +974,8 @@ def check_series_frame_commutativity(self, engine, parser): if engine == 'numexpr': assert_frame_equal(a, b) - def test_series_frame_commutativity(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_series_frame_commutativity, engine, parser - - def check_complex_series_frame_alignment(self, engine, parser): - tm.skip_if_no_ne(engine) - + @slow + def test_complex_series_frame_alignment(self, engine, parser): import random args = product(self.lhs_index_types, self.index_types, self.index_types, self.index_types) @@ -1048,20 +1016,14 @@ def check_complex_series_frame_alignment(self, engine, parser): parser=parser) else: res = pd.eval('df2 + s + df', engine=engine, parser=parser) - tm.assert_equal(res.shape, expected.shape) + assert res.shape == expected.shape assert_frame_equal(res, expected) - @slow - def test_complex_series_frame_alignment(self): - for engine, parser in ENGINES_PARSERS: - yield self.check_complex_series_frame_alignment, engine, parser - - def check_performance_warning_for_poor_alignment(self, engine, parser): - tm.skip_if_no_ne(engine) + def test_performance_warning_for_poor_alignment(self, engine, parser): df = DataFrame(randn(1000, 10)) s = Series(randn(10000)) if engine == 'numexpr': - seen = pd.core.common.PerformanceWarning + seen = PerformanceWarning else: seen = False @@ -1083,7 +1045,7 @@ def check_performance_warning_for_poor_alignment(self, engine, parser): is_python_engine = engine == 'python' if not is_python_engine: - wrn = pd.core.common.PerformanceWarning + wrn = PerformanceWarning else: wrn = False @@ -1091,36 +1053,29 @@ def check_performance_warning_for_poor_alignment(self, engine, parser): pd.eval('df + s', engine=engine, parser=parser) if not is_python_engine: - tm.assert_equal(len(w), 1) + assert len(w) == 1 msg = str(w[0].message) expected = ("Alignment difference on axis {0} is larger" " than an order of magnitude on term {1!r}, " "by more than {2:.4g}; performance may suffer" "".format(1, 'df', np.log10(s.size - df.shape[1]))) - tm.assert_equal(msg, expected) - - def test_performance_warning_for_poor_alignment(self): - for engine, parser in ENGINES_PARSERS: - yield (self.check_performance_warning_for_poor_alignment, engine, - parser) + assert msg == expected -#------------------------------------ -# slightly more complex ops +# ------------------------------------ +# Slightly more complex ops -class TestOperationsNumExprPandas(tm.TestCase): +class TestOperationsNumExprPandas(object): @classmethod - def setUpClass(cls): - super(TestOperationsNumExprPandas, cls).setUpClass() + def setup_class(cls): tm.skip_if_no_ne() cls.engine = 'numexpr' cls.parser = 'pandas' cls.arith_ops = expr._arith_ops_syms + expr._cmp_ops_syms @classmethod - def tearDownClass(cls): - super(TestOperationsNumExprPandas, cls).tearDownClass() + def teardown_class(cls): del cls.engine, cls.parser def eval(self, *args, **kwargs): @@ -1138,22 +1093,22 @@ def test_simple_arith_ops(self): ex3 = '1 {0} (x + 1)'.format(op) if op in ('in', 'not in'): - self.assertRaises(TypeError, pd.eval, ex, - engine=self.engine, parser=self.parser) + pytest.raises(TypeError, pd.eval, ex, + engine=self.engine, parser=self.parser) else: expec = _eval_single_bin(1, op, 1, self.engine) x = self.eval(ex, engine=self.engine, parser=self.parser) - tm.assert_equal(x, expec) + assert x == expec expec = _eval_single_bin(x, op, 1, self.engine) y = self.eval(ex2, local_dict={'x': x}, engine=self.engine, parser=self.parser) - tm.assert_equal(y, expec) + assert y == expec expec = _eval_single_bin(1, op, x + 1, self.engine) y = self.eval(ex3, local_dict={'x': x}, engine=self.engine, parser=self.parser) - tm.assert_equal(y, expec) + assert y == expec def test_simple_bool_ops(self): for op, lhs, rhs in product(expr._bool_ops_syms, (True, False), @@ -1161,7 +1116,7 @@ def test_simple_bool_ops(self): ex = '{0} {1} {2}'.format(lhs, op, rhs) res = self.eval(ex) exp = eval(ex) - self.assertEqual(res, exp) + assert res == exp def test_bool_ops_with_constants(self): for op, lhs, rhs in product(expr._bool_ops_syms, ('True', 'False'), @@ -1169,23 +1124,26 @@ def test_bool_ops_with_constants(self): ex = '{0} {1} {2}'.format(lhs, op, rhs) res = self.eval(ex) exp = eval(ex) - self.assertEqual(res, exp) + assert res == exp def test_panel_fails(self): - x = Panel(randn(3, 4, 5)) - y = Series(randn(10)) - assert_raises(NotImplementedError, self.eval, 'x + y', - local_dict={'x': x, 'y': y}) + with catch_warnings(record=True): + x = Panel(randn(3, 4, 5)) + y = Series(randn(10)) + with pytest.raises(NotImplementedError): + self.eval('x + y', + local_dict={'x': x, 'y': y}) def test_4d_ndarray_fails(self): x = randn(3, 4, 5, 6) y = Series(randn(10)) - assert_raises(NotImplementedError, self.eval, 'x + y', + with pytest.raises(NotImplementedError): + self.eval('x + y', local_dict={'x': x, 'y': y}) def test_constant(self): x = self.eval('1') - tm.assert_equal(x, 1) + assert x == 1 def test_single_variable(self): df = DataFrame(randn(10, 2)) @@ -1195,7 +1153,7 @@ def test_single_variable(self): def test_truediv(self): s = np.array([1]) ex = 's / 1' - d = {'s': s} + d = {'s': s} # noqa if PY3: res = self.eval(ex, truediv=False) @@ -1206,19 +1164,19 @@ def test_truediv(self): res = self.eval('1 / 2', truediv=True) expec = 0.5 - self.assertEqual(res, expec) + assert res == expec res = self.eval('1 / 2', truediv=False) expec = 0.5 - self.assertEqual(res, expec) + assert res == expec res = self.eval('s / 2', truediv=False) expec = 0.5 - self.assertEqual(res, expec) + assert res == expec res = self.eval('s / 2', truediv=True) expec = 0.5 - self.assertEqual(res, expec) + assert res == expec else: res = self.eval(ex, truediv=False) tm.assert_numpy_array_equal(res, np.array([1])) @@ -1228,23 +1186,23 @@ def test_truediv(self): res = self.eval('1 / 2', truediv=True) expec = 0.5 - self.assertEqual(res, expec) + assert res == expec res = self.eval('1 / 2', truediv=False) expec = 0 - self.assertEqual(res, expec) + assert res == expec res = self.eval('s / 2', truediv=False) expec = 0 - self.assertEqual(res, expec) + assert res == expec res = self.eval('s / 2', truediv=True) expec = 0.5 - self.assertEqual(res, expec) + assert res == expec def test_failing_subscript_with_name_error(self): - df = DataFrame(np.random.randn(5, 3)) - with tm.assertRaises(NameError): + df = DataFrame(np.random.randn(5, 3)) # noqa + with pytest.raises(NameError): self.eval('df[x > 2] > 2') def test_lhs_expression_subscript(self): @@ -1270,21 +1228,19 @@ def test_assignment_fails(self): df = DataFrame(np.random.randn(5, 3), columns=list('abc')) df2 = DataFrame(np.random.randn(5, 3)) expr1 = 'df = df2' - self.assertRaises(ValueError, self.eval, expr1, - local_dict={'df': df, 'df2': df2}) + pytest.raises(ValueError, self.eval, expr1, + local_dict={'df': df, 'df2': df2}) def test_assignment_column(self): - tm.skip_if_no_ne('numexpr') df = DataFrame(np.random.randn(5, 2), columns=list('ab')) orig_df = df.copy() # multiple assignees - self.assertRaises(SyntaxError, df.eval, 'd c = a + b') + pytest.raises(SyntaxError, df.eval, 'd c = a + b') # invalid assignees - self.assertRaises(SyntaxError, df.eval, 'd,c = a + b') - self.assertRaises( - SyntaxError, df.eval, 'Timestamp("20131001") = a + b') + pytest.raises(SyntaxError, df.eval, 'd,c = a + b') + pytest.raises(SyntaxError, df.eval, 'Timestamp("20131001") = a + b') # single assignment - existing variable expected = orig_df.copy() @@ -1320,14 +1276,14 @@ def f(): df.eval('a = a + b', inplace=True) result = old_a + df.b assert_series_equal(result, df.a, check_names=False) - self.assertTrue(result.name is None) + assert result.name is None f() # multiple assignment df = orig_df.copy() df.eval('c = a + b', inplace=True) - self.assertRaises(SyntaxError, df.eval, 'c = a = b') + pytest.raises(SyntaxError, df.eval, 'c = a = b') # explicit targets df = orig_df.copy() @@ -1345,17 +1301,17 @@ def test_column_in(self): assert_series_equal(result, expected) def assignment_not_inplace(self): - # GH 9297 - tm.skip_if_no_ne('numexpr') + # see gh-9297 df = DataFrame(np.random.randn(5, 2), columns=list('ab')) actual = df.eval('c = a + b', inplace=False) - self.assertIsNotNone(actual) + assert actual is not None + expected = df.copy() expected['c'] = expected['a'] + expected['b'] - assert_frame_equal(df, expected) + tm.assert_frame_equal(df, expected) - # default for inplace will change + # Default for inplace will change with tm.assert_produces_warnings(FutureWarning): df.eval('c = a + b') @@ -1365,7 +1321,6 @@ def assignment_not_inplace(self): def test_multi_line_expression(self): # GH 11149 - tm.skip_if_no_ne('numexpr') df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) expected = df.copy() @@ -1375,7 +1330,7 @@ def test_multi_line_expression(self): c = a + b d = c + b""", inplace=True) assert_frame_equal(expected, df) - self.assertIsNone(ans) + assert ans is None expected['a'] = expected['a'] - 1 expected['e'] = expected['a'] + 2 @@ -1383,17 +1338,16 @@ def test_multi_line_expression(self): a = a - 1 e = a + 2""", inplace=True) assert_frame_equal(expected, df) - self.assertIsNone(ans) + assert ans is None # multi-line not valid if not all assignments - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.eval(""" a = b + 2 b - 2""", inplace=False) def test_multi_line_expression_not_inplace(self): # GH 11149 - tm.skip_if_no_ne('numexpr') df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) expected = df.copy() @@ -1411,11 +1365,26 @@ def test_multi_line_expression_not_inplace(self): e = a + 2""", inplace=False) assert_frame_equal(expected, df) + def test_multi_line_expression_local_variable(self): + # GH 15342 + df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) + expected = df.copy() + + local_var = 7 + expected['c'] = expected['a'] * local_var + expected['d'] = expected['c'] + local_var + ans = df.eval(""" + c = a * @local_var + d = c + @local_var + """, inplace=True) + assert_frame_equal(expected, df) + assert ans is None + def test_assignment_in_query(self): # GH 8664 df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) df_orig = df.copy() - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.query('a = 1') assert_frame_equal(df, df_orig) @@ -1461,57 +1430,57 @@ def test_simple_in_ops(self): if self.parser != 'python': res = pd.eval('1 in [1, 2]', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res res = pd.eval('2 in (1, 2)', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res res = pd.eval('3 in (1, 2)', engine=self.engine, parser=self.parser) - self.assertFalse(res) + assert not res res = pd.eval('3 not in (1, 2)', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res res = pd.eval('[3] not in (1, 2)', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res res = pd.eval('[3] in ([3], 2)', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res res = pd.eval('[[3]] in [[[3]], 2]', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res res = pd.eval('(3,) in [(3,), 2]', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res res = pd.eval('(3,) not in [(3,), 2]', engine=self.engine, parser=self.parser) - self.assertFalse(res) + assert not res res = pd.eval('[(3,)] in [[(3,)], 2]', engine=self.engine, parser=self.parser) - self.assertTrue(res) + assert res else: - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('1 in [1, 2]', engine=self.engine, parser=self.parser) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('2 in (1, 2)', engine=self.engine, parser=self.parser) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('3 in (1, 2)', engine=self.engine, parser=self.parser) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('3 not in (1, 2)', engine=self.engine, parser=self.parser) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('[(3,)] in (1, 2, [(3,)])', engine=self.engine, parser=self.parser) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('[3] not in (1, 2, [[3]])', engine=self.engine, parser=self.parser) @@ -1519,8 +1488,8 @@ def test_simple_in_ops(self): class TestOperationsNumExprPython(TestOperationsNumExprPandas): @classmethod - def setUpClass(cls): - super(TestOperationsNumExprPython, cls).setUpClass() + def setup_class(cls): + super(TestOperationsNumExprPython, cls).setup_class() cls.engine = 'numexpr' cls.parser = 'python' tm.skip_if_no_ne(cls.engine) @@ -1529,40 +1498,40 @@ def setUpClass(cls): cls.arith_ops) def test_check_many_exprs(self): - a = 1 + a = 1 # noqa expr = ' * '.join('a' * 33) expected = 1 res = pd.eval(expr, engine=self.engine, parser=self.parser) - tm.assert_equal(res, expected) + assert res == expected def test_fails_and(self): df = DataFrame(np.random.randn(5, 3)) - self.assertRaises(NotImplementedError, pd.eval, 'df > 2 and df > 3', - local_dict={'df': df}, parser=self.parser, - engine=self.engine) + pytest.raises(NotImplementedError, pd.eval, 'df > 2 and df > 3', + local_dict={'df': df}, parser=self.parser, + engine=self.engine) def test_fails_or(self): df = DataFrame(np.random.randn(5, 3)) - self.assertRaises(NotImplementedError, pd.eval, 'df > 2 or df > 3', - local_dict={'df': df}, parser=self.parser, - engine=self.engine) + pytest.raises(NotImplementedError, pd.eval, 'df > 2 or df > 3', + local_dict={'df': df}, parser=self.parser, + engine=self.engine) def test_fails_not(self): df = DataFrame(np.random.randn(5, 3)) - self.assertRaises(NotImplementedError, pd.eval, 'not df > 2', - local_dict={'df': df}, parser=self.parser, - engine=self.engine) + pytest.raises(NotImplementedError, pd.eval, 'not df > 2', + local_dict={'df': df}, parser=self.parser, + engine=self.engine) def test_fails_ampersand(self): - df = DataFrame(np.random.randn(5, 3)) + df = DataFrame(np.random.randn(5, 3)) # noqa ex = '(df + 2)[df > 1] > 0 & (df > 0)' - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval(ex, parser=self.parser, engine=self.engine) def test_fails_pipe(self): - df = DataFrame(np.random.randn(5, 3)) + df = DataFrame(np.random.randn(5, 3)) # noqa ex = '(df + 2)[df > 1] > 0 | (df > 0)' - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval(ex, parser=self.parser, engine=self.engine) def test_bool_ops_with_constants(self): @@ -1570,31 +1539,31 @@ def test_bool_ops_with_constants(self): ('True', 'False')): ex = '{0} {1} {2}'.format(lhs, op, rhs) if op in ('and', 'or'): - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): self.eval(ex) else: res = self.eval(ex) exp = eval(ex) - self.assertEqual(res, exp) + assert res == exp def test_simple_bool_ops(self): for op, lhs, rhs in product(expr._bool_ops_syms, (True, False), (True, False)): ex = 'lhs {0} rhs'.format(op) if op in ('and', 'or'): - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval(ex, engine=self.engine, parser=self.parser) else: res = pd.eval(ex, engine=self.engine, parser=self.parser) exp = eval(ex) - self.assertEqual(res, exp) + assert res == exp class TestOperationsPythonPython(TestOperationsNumExprPython): @classmethod - def setUpClass(cls): - super(TestOperationsPythonPython, cls).setUpClass() + def setup_class(cls): + super(TestOperationsPythonPython, cls).setup_class() cls.engine = cls.parser = 'python' cls.arith_ops = expr._arith_ops_syms + expr._cmp_ops_syms cls.arith_ops = filter(lambda x: x not in ('in', 'not in'), @@ -1604,18 +1573,17 @@ def setUpClass(cls): class TestOperationsPythonPandas(TestOperationsNumExprPandas): @classmethod - def setUpClass(cls): - super(TestOperationsPythonPandas, cls).setUpClass() + def setup_class(cls): + super(TestOperationsPythonPandas, cls).setup_class() cls.engine = 'python' cls.parser = 'pandas' cls.arith_ops = expr._arith_ops_syms + expr._cmp_ops_syms -class TestMathPythonPython(tm.TestCase): +class TestMathPythonPython(object): @classmethod - def setUpClass(cls): - super(TestMathPythonPython, cls).setUpClass() + def setup_class(cls): tm.skip_if_no_ne() cls.engine = 'python' cls.parser = 'pandas' @@ -1623,7 +1591,7 @@ def setUpClass(cls): cls.binary_fns = _binary_math_ops @classmethod - def tearDownClass(cls): + def teardown_class(cls): del cls.engine, cls.parser def eval(self, *args, **kwargs): @@ -1676,14 +1644,14 @@ def test_df_arithmetic_subexpression(self): def check_result_type(self, dtype, expect_dtype): df = DataFrame({'a': np.random.randn(10).astype(dtype)}) - self.assertEqual(df.a.dtype, dtype) + assert df.a.dtype == dtype df.eval("b = sin(a)", engine=self.engine, parser=self.parser, inplace=True) got = df.b expect = np.sin(df.a) - self.assertEqual(expect.dtype, got.dtype) - self.assertEqual(expect_dtype, got.dtype) + assert expect.dtype == got.dtype + assert expect_dtype == got.dtype tm.assert_series_equal(got, expect, check_names=False) def test_result_types(self): @@ -1694,7 +1662,7 @@ def test_result_types(self): def test_result_types2(self): # xref https://github.com/pandas-dev/pandas/issues/12293 - raise nose.SkipTest("unreliable tests on complex128") + pytest.skip("unreliable tests on complex128") # Did not test complex64 because DataFrame is converting it to # complex128. Due to https://github.com/pandas-dev/pandas/issues/10952 @@ -1702,17 +1670,17 @@ def test_result_types2(self): def test_undefined_func(self): df = DataFrame({'a': np.random.randn(10)}) - with tm.assertRaisesRegexp(ValueError, - "\"mysin\" is not a supported function"): + with tm.assert_raises_regex( + ValueError, "\"mysin\" is not a supported function"): df.eval("mysin(a)", engine=self.engine, parser=self.parser) def test_keyword_arg(self): df = DataFrame({'a': np.random.randn(10)}) - with tm.assertRaisesRegexp(TypeError, - "Function \"sin\" does not support " - "keyword arguments"): + with tm.assert_raises_regex(TypeError, + "Function \"sin\" does not support " + "keyword arguments"): df.eval("sin(x=a)", engine=self.engine, parser=self.parser) @@ -1721,8 +1689,8 @@ def test_keyword_arg(self): class TestMathPythonPandas(TestMathPythonPython): @classmethod - def setUpClass(cls): - super(TestMathPythonPandas, cls).setUpClass() + def setup_class(cls): + super(TestMathPythonPandas, cls).setup_class() cls.engine = 'python' cls.parser = 'pandas' @@ -1730,8 +1698,8 @@ def setUpClass(cls): class TestMathNumExprPandas(TestMathPythonPython): @classmethod - def setUpClass(cls): - super(TestMathNumExprPandas, cls).setUpClass() + def setup_class(cls): + super(TestMathNumExprPandas, cls).setup_class() cls.engine = 'numexpr' cls.parser = 'pandas' @@ -1739,8 +1707,8 @@ def setUpClass(cls): class TestMathNumExprPython(TestMathPythonPython): @classmethod - def setUpClass(cls): - super(TestMathNumExprPython, cls).setUpClass() + def setup_class(cls): + super(TestMathNumExprPython, cls).setup_class() cls.engine = 'numexpr' cls.parser = 'python' @@ -1750,208 +1718,142 @@ def setUpClass(cls): class TestScope(object): - def check_global_scope(self, e, engine, parser): - tm.skip_if_no_ne(engine) + def test_global_scope(self, engine, parser): + e = '_var_s * 2' tm.assert_numpy_array_equal(_var_s * 2, pd.eval(e, engine=engine, parser=parser)) - def test_global_scope(self): - e = '_var_s * 2' - for engine, parser in product(_engines, expr._parsers): - yield self.check_global_scope, e, engine, parser - - def check_no_new_locals(self, engine, parser): - tm.skip_if_no_ne(engine) - x = 1 + def test_no_new_locals(self, engine, parser): + x = 1 # noqa lcls = locals().copy() pd.eval('x + 1', local_dict=lcls, engine=engine, parser=parser) lcls2 = locals().copy() lcls2.pop('lcls') - tm.assert_equal(lcls, lcls2) - - def test_no_new_locals(self): - for engine, parser in product(_engines, expr._parsers): - yield self.check_no_new_locals, engine, parser + assert lcls == lcls2 - def check_no_new_globals(self, engine, parser): - tm.skip_if_no_ne(engine) - x = 1 + def test_no_new_globals(self, engine, parser): + x = 1 # noqa gbls = globals().copy() pd.eval('x + 1', engine=engine, parser=parser) gbls2 = globals().copy() - tm.assert_equal(gbls, gbls2) - - def test_no_new_globals(self): - for engine, parser in product(_engines, expr._parsers): - yield self.check_no_new_globals, engine, parser + assert gbls == gbls2 def test_invalid_engine(): tm.skip_if_no_ne() - assertRaisesRegexp(KeyError, 'Invalid engine \'asdf\' passed', - pd.eval, 'x + y', local_dict={'x': 1, 'y': 2}, - engine='asdf') + tm.assert_raises_regex(KeyError, 'Invalid engine \'asdf\' passed', + pd.eval, 'x + y', local_dict={'x': 1, 'y': 2}, + engine='asdf') def test_invalid_parser(): tm.skip_if_no_ne() - assertRaisesRegexp(KeyError, 'Invalid parser \'asdf\' passed', - pd.eval, 'x + y', local_dict={'x': 1, 'y': 2}, - parser='asdf') + tm.assert_raises_regex(KeyError, 'Invalid parser \'asdf\' passed', + pd.eval, 'x + y', local_dict={'x': 1, 'y': 2}, + parser='asdf') _parsers = {'python': PythonExprVisitor, 'pytables': pytables.ExprVisitor, 'pandas': PandasExprVisitor} -def check_disallowed_nodes(engine, parser): +@pytest.mark.parametrize('engine', _parsers) +@pytest.mark.parametrize('parser', _parsers) +def test_disallowed_nodes(engine, parser): tm.skip_if_no_ne(engine) VisitorClass = _parsers[parser] uns_ops = VisitorClass.unsupported_nodes inst = VisitorClass('x + 1', engine, parser) for ops in uns_ops: - assert_raises(NotImplementedError, getattr(inst, ops)) - + with pytest.raises(NotImplementedError): + getattr(inst, ops)() -def test_disallowed_nodes(): - for engine, visitor in product(_parsers, repeat=2): - yield check_disallowed_nodes, engine, visitor - -def check_syntax_error_exprs(engine, parser): - tm.skip_if_no_ne(engine) +def test_syntax_error_exprs(engine, parser): e = 's +' - assert_raises(SyntaxError, pd.eval, e, engine=engine, parser=parser) - - -def test_syntax_error_exprs(): - for engine, parser in ENGINES_PARSERS: - yield check_syntax_error_exprs, engine, parser + with pytest.raises(SyntaxError): + pd.eval(e, engine=engine, parser=parser) -def check_name_error_exprs(engine, parser): - tm.skip_if_no_ne(engine) +def test_name_error_exprs(engine, parser): e = 's + t' - with tm.assertRaises(NameError): + with pytest.raises(NameError): pd.eval(e, engine=engine, parser=parser) -def test_name_error_exprs(): - for engine, parser in ENGINES_PARSERS: - yield check_name_error_exprs, engine, parser - - -def check_invalid_local_variable_reference(engine, parser): - tm.skip_if_no_ne(engine) - - a, b = 1, 2 +def test_invalid_local_variable_reference(engine, parser): + a, b = 1, 2 # noqa exprs = 'a + @b', '@a + b', '@a + @b' - for expr in exprs: + + for _expr in exprs: if parser != 'pandas': - with tm.assertRaisesRegexp(SyntaxError, "The '@' prefix is only"): - pd.eval(exprs, engine=engine, parser=parser) + with tm.assert_raises_regex(SyntaxError, + "The '@' prefix is only"): + pd.eval(_expr, engine=engine, parser=parser) else: - with tm.assertRaisesRegexp(SyntaxError, "The '@' prefix is not"): - pd.eval(exprs, engine=engine, parser=parser) - + with tm.assert_raises_regex(SyntaxError, + "The '@' prefix is not"): + pd.eval(_expr, engine=engine, parser=parser) -def test_invalid_local_variable_reference(): - for engine, parser in ENGINES_PARSERS: - yield check_invalid_local_variable_reference, engine, parser - -def check_numexpr_builtin_raises(engine, parser): - tm.skip_if_no_ne(engine) +def test_numexpr_builtin_raises(engine, parser): sin, dotted_line = 1, 2 if engine == 'numexpr': - with tm.assertRaisesRegexp(NumExprClobberingError, - 'Variables in expression .+'): + with tm.assert_raises_regex(NumExprClobberingError, + 'Variables in expression .+'): pd.eval('sin + dotted_line', engine=engine, parser=parser) else: res = pd.eval('sin + dotted_line', engine=engine, parser=parser) - tm.assert_equal(res, sin + dotted_line) - + assert res == sin + dotted_line -def test_numexpr_builtin_raises(): - for engine, parser in ENGINES_PARSERS: - yield check_numexpr_builtin_raises, engine, parser - -def check_bad_resolver_raises(engine, parser): - tm.skip_if_no_ne(engine) +def test_bad_resolver_raises(engine, parser): cannot_resolve = 42, 3.0 - with tm.assertRaisesRegexp(TypeError, 'Resolver of type .+'): + with tm.assert_raises_regex(TypeError, 'Resolver of type .+'): pd.eval('1 + 2', resolvers=cannot_resolve, engine=engine, parser=parser) -def test_bad_resolver_raises(): - for engine, parser in ENGINES_PARSERS: - yield check_bad_resolver_raises, engine, parser - - -def check_empty_string_raises(engine, parser): +def test_empty_string_raises(engine, parser): # GH 13139 - tm.skip_if_no_ne(engine) - with tm.assertRaisesRegexp(ValueError, 'expr cannot be an empty string'): + with tm.assert_raises_regex(ValueError, + 'expr cannot be an empty string'): pd.eval('', engine=engine, parser=parser) -def test_empty_string_raises(): - for engine, parser in ENGINES_PARSERS: - yield check_empty_string_raises, engine, parser - - -def check_more_than_one_expression_raises(engine, parser): - tm.skip_if_no_ne(engine) - with tm.assertRaisesRegexp(SyntaxError, - 'only a single expression is allowed'): +def test_more_than_one_expression_raises(engine, parser): + with tm.assert_raises_regex(SyntaxError, + 'only a single expression is allowed'): pd.eval('1 + 1; 2 + 2', engine=engine, parser=parser) -def test_more_than_one_expression_raises(): - for engine, parser in ENGINES_PARSERS: - yield check_more_than_one_expression_raises, engine, parser +@pytest.mark.parametrize('cmp', ('and', 'or')) +@pytest.mark.parametrize('lhs', (int, float)) +@pytest.mark.parametrize('rhs', (int, float)) +def test_bool_ops_fails_on_scalars(lhs, cmp, rhs, engine, parser): + gen = {int: lambda: np.random.randint(10), float: np.random.randn} + mid = gen[lhs]() # noqa + lhs = gen[lhs]() # noqa + rhs = gen[rhs]() # noqa -def check_bool_ops_fails_on_scalars(gen, lhs, cmp, rhs, engine, parser): - tm.skip_if_no_ne(engine) - mid = gen[type(lhs)]() ex1 = 'lhs {0} mid {1} rhs'.format(cmp, cmp) ex2 = 'lhs {0} mid and mid {1} rhs'.format(cmp, cmp) ex3 = '(lhs {0} mid) & (mid {1} rhs)'.format(cmp, cmp) for ex in (ex1, ex2, ex3): - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval(ex, engine=engine, parser=parser) -def test_bool_ops_fails_on_scalars(): - _bool_ops_syms = 'and', 'or' - dtypes = int, float - gen = {int: lambda: np.random.randint(10), float: np.random.randn} - for engine, parser, dtype1, cmp, dtype2 in product(_engines, expr._parsers, - dtypes, _bool_ops_syms, - dtypes): - yield (check_bool_ops_fails_on_scalars, gen, gen[dtype1](), cmp, - gen[dtype2](), engine, parser) - - -def check_inf(engine, parser): - tm.skip_if_no_ne(engine) +def test_inf(engine, parser): s = 'inf + 1' expected = np.inf result = pd.eval(s, engine=engine, parser=parser) - tm.assert_equal(result, expected) - + assert result == expected -def test_inf(): - for engine, parser in ENGINES_PARSERS: - yield check_inf, engine, parser - -def check_negate_lt_eq_le(engine, parser): - tm.skip_if_no_ne(engine) +def test_negate_lt_eq_le(engine, parser): df = pd.DataFrame([[0, 10], [1, 20]], columns=['cat', 'count']) expected = df[~(df.cat > 0)] @@ -1959,18 +1861,18 @@ def check_negate_lt_eq_le(engine, parser): tm.assert_frame_equal(result, expected) if parser == 'python': - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.query('not (cat > 0)', engine=engine, parser=parser) else: result = df.query('not (cat > 0)', engine=engine, parser=parser) tm.assert_frame_equal(result, expected) -def test_negate_lt_eq_le(): - for engine, parser in product(_engines, expr._parsers): - yield check_negate_lt_eq_le, engine, parser +class TestValidate(object): + def test_validate_bool_args(self): + invalid_values = [1, "True", [1, 2, 3], 5.0] -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + for value in invalid_values: + with pytest.raises(ValueError): + pd.eval("2+2", inplace=value) diff --git a/pandas/tests/formats/__init__.py b/pandas/tests/dtypes/__init__.py similarity index 100% rename from pandas/tests/formats/__init__.py rename to pandas/tests/dtypes/__init__.py diff --git a/pandas/tests/dtypes/test_cast.py b/pandas/tests/dtypes/test_cast.py new file mode 100644 index 0000000000000..e92724a5d9cd4 --- /dev/null +++ b/pandas/tests/dtypes/test_cast.py @@ -0,0 +1,323 @@ +# -*- coding: utf-8 -*- + +""" +These test the private routines in types/cast.py + +""" + +import pytest +from datetime import datetime, timedelta, date +import numpy as np + +from pandas import Timedelta, Timestamp, DatetimeIndex + +from pandas.core.dtypes.cast import ( + maybe_downcast_to_dtype, + maybe_convert_objects, + infer_dtype_from_scalar, + infer_dtype_from_array, + maybe_convert_string_to_object, + maybe_convert_scalar, + find_common_type) +from pandas.core.dtypes.dtypes import ( + CategoricalDtype, + DatetimeTZDtype, + PeriodDtype) +from pandas.util import testing as tm + + +class TestMaybeDowncast(object): + + def test_downcast_conv(self): + # test downcasting + + arr = np.array([8.5, 8.6, 8.7, 8.8, 8.9999999999995]) + result = maybe_downcast_to_dtype(arr, 'infer') + assert (np.array_equal(result, arr)) + + arr = np.array([8., 8., 8., 8., 8.9999999999995]) + result = maybe_downcast_to_dtype(arr, 'infer') + expected = np.array([8, 8, 8, 8, 9]) + assert (np.array_equal(result, expected)) + + arr = np.array([8., 8., 8., 8., 9.0000000000005]) + result = maybe_downcast_to_dtype(arr, 'infer') + expected = np.array([8, 8, 8, 8, 9]) + assert (np.array_equal(result, expected)) + + # conversions + + expected = np.array([1, 2]) + for dtype in [np.float64, object, np.int64]: + arr = np.array([1.0, 2.0], dtype=dtype) + result = maybe_downcast_to_dtype(arr, 'infer') + tm.assert_almost_equal(result, expected, check_dtype=False) + + for dtype in [np.float64, object]: + expected = np.array([1.0, 2.0, np.nan], dtype=dtype) + arr = np.array([1.0, 2.0, np.nan], dtype=dtype) + result = maybe_downcast_to_dtype(arr, 'infer') + tm.assert_almost_equal(result, expected) + + # empties + for dtype in [np.int32, np.float64, np.float32, np.bool_, + np.int64, object]: + arr = np.array([], dtype=dtype) + result = maybe_downcast_to_dtype(arr, 'int64') + tm.assert_almost_equal(result, np.array([], dtype=np.int64)) + assert result.dtype == np.int64 + + def test_datetimelikes_nan(self): + arr = np.array([1, 2, np.nan]) + exp = np.array([1, 2, np.datetime64('NaT')], dtype='datetime64[ns]') + res = maybe_downcast_to_dtype(arr, 'datetime64[ns]') + tm.assert_numpy_array_equal(res, exp) + + exp = np.array([1, 2, np.timedelta64('NaT')], dtype='timedelta64[ns]') + res = maybe_downcast_to_dtype(arr, 'timedelta64[ns]') + tm.assert_numpy_array_equal(res, exp) + + def test_datetime_with_timezone(self): + # GH 15426 + ts = Timestamp("2016-01-01 12:00:00", tz='US/Pacific') + exp = DatetimeIndex([ts, ts]) + res = maybe_downcast_to_dtype(exp, exp.dtype) + tm.assert_index_equal(res, exp) + + res = maybe_downcast_to_dtype(exp.asi8, exp.dtype) + tm.assert_index_equal(res, exp) + + +class TestInferDtype(object): + + def test_infer_dtype_from_scalar(self): + # Test that _infer_dtype_from_scalar is returning correct dtype for int + # and float. + + for dtypec in [np.uint8, np.int8, np.uint16, np.int16, np.uint32, + np.int32, np.uint64, np.int64]: + data = dtypec(12) + dtype, val = infer_dtype_from_scalar(data) + assert dtype == type(data) + + data = 12 + dtype, val = infer_dtype_from_scalar(data) + assert dtype == np.int64 + + for dtypec in [np.float16, np.float32, np.float64]: + data = dtypec(12) + dtype, val = infer_dtype_from_scalar(data) + assert dtype == dtypec + + data = np.float(12) + dtype, val = infer_dtype_from_scalar(data) + assert dtype == np.float64 + + for data in [True, False]: + dtype, val = infer_dtype_from_scalar(data) + assert dtype == np.bool_ + + for data in [np.complex64(1), np.complex128(1)]: + dtype, val = infer_dtype_from_scalar(data) + assert dtype == np.complex_ + + for data in [np.datetime64(1, 'ns'), Timestamp(1), + datetime(2000, 1, 1, 0, 0)]: + dtype, val = infer_dtype_from_scalar(data) + assert dtype == 'M8[ns]' + + for data in [np.timedelta64(1, 'ns'), Timedelta(1), + timedelta(1)]: + dtype, val = infer_dtype_from_scalar(data) + assert dtype == 'm8[ns]' + + for data in [date(2000, 1, 1), + Timestamp(1, tz='US/Eastern'), 'foo']: + dtype, val = infer_dtype_from_scalar(data) + assert dtype == np.object_ + + @pytest.mark.parametrize( + "arr, expected", + [('foo', np.object_), + (b'foo', np.object_), + (1, np.int_), + (1.5, np.float_), + ([1], np.int_), + (np.array([1]), np.int_), + ([np.nan, 1, ''], np.object_), + (np.array([[1.0, 2.0]]), np.float_), + (Timestamp('20160101'), np.object_), + (np.datetime64('2016-01-01'), np.dtype('= '1.7.0': + assert (s[8].value == np.datetime64('NaT').astype(np.int64)) + + +def test_is_scipy_sparse(spmatrix): # noqa: F811 + tm._skip_if_no_scipy() + assert is_scipy_sparse(spmatrix([[0, 1]])) + assert not is_scipy_sparse(np.array([1])) + + +def test_ensure_int32(): + values = np.arange(10, dtype=np.int32) + result = _ensure_int32(values) + assert (result.dtype == np.int32) + + values = np.arange(10, dtype=np.int64) + result = _ensure_int32(values) + assert (result.dtype == np.int32) + + +def test_ensure_categorical(): + values = np.arange(10, dtype=np.int32) + result = _ensure_categorical(values) + assert (result.dtype == 'category') + + values = Categorical(values) + result = _ensure_categorical(values) + tm.assert_categorical_equal(result, values) diff --git a/pandas/tests/types/test_io.py b/pandas/tests/dtypes/test_io.py similarity index 74% rename from pandas/tests/types/test_io.py rename to pandas/tests/dtypes/test_io.py index 545edf8f1386c..ae92e9ecca681 100644 --- a/pandas/tests/types/test_io.py +++ b/pandas/tests/dtypes/test_io.py @@ -1,25 +1,25 @@ # -*- coding: utf-8 -*- import numpy as np -import pandas.lib as lib +import pandas._libs.lib as lib import pandas.util.testing as tm from pandas.compat import long, u -class TestParseSQL(tm.TestCase): +class TestParseSQL(object): def test_convert_sql_column_floats(self): arr = np.array([1.5, None, 3, 4.2], dtype=object) result = lib.convert_sql_column(arr) expected = np.array([1.5, np.nan, 3, 4.2], dtype='f8') - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_convert_sql_column_strings(self): arr = np.array(['1.5', None, '3', '4.2'], dtype=object) result = lib.convert_sql_column(arr) expected = np.array(['1.5', np.nan, '3', '4.2'], dtype=object) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_convert_sql_column_unicode(self): arr = np.array([u('1.5'), None, u('3'), u('4.2')], @@ -27,7 +27,7 @@ def test_convert_sql_column_unicode(self): result = lib.convert_sql_column(arr) expected = np.array([u('1.5'), np.nan, u('3'), u('4.2')], dtype=object) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_convert_sql_column_ints(self): arr = np.array([1, 2, 3, 4], dtype='O') @@ -35,82 +35,75 @@ def test_convert_sql_column_ints(self): result = lib.convert_sql_column(arr) result2 = lib.convert_sql_column(arr2) expected = np.array([1, 2, 3, 4], dtype='i8') - self.assert_numpy_array_equal(result, expected) - self.assert_numpy_array_equal(result2, expected) + tm.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result2, expected) arr = np.array([1, 2, 3, None, 4], dtype='O') result = lib.convert_sql_column(arr) expected = np.array([1, 2, 3, np.nan, 4], dtype='f8') - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_convert_sql_column_longs(self): arr = np.array([long(1), long(2), long(3), long(4)], dtype='O') result = lib.convert_sql_column(arr) expected = np.array([1, 2, 3, 4], dtype='i8') - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) arr = np.array([long(1), long(2), long(3), None, long(4)], dtype='O') result = lib.convert_sql_column(arr) expected = np.array([1, 2, 3, np.nan, 4], dtype='f8') - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_convert_sql_column_bools(self): arr = np.array([True, False, True, False], dtype='O') result = lib.convert_sql_column(arr) expected = np.array([True, False, True, False], dtype=bool) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) arr = np.array([True, False, None, False], dtype='O') result = lib.convert_sql_column(arr) expected = np.array([True, False, np.nan, False], dtype=object) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_convert_sql_column_decimals(self): from decimal import Decimal arr = np.array([Decimal('1.5'), None, Decimal('3'), Decimal('4.2')]) result = lib.convert_sql_column(arr) expected = np.array([1.5, np.nan, 3, 4.2], dtype='f8') - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_convert_downcast_int64(self): - from pandas.parser import na_values + from pandas._libs.parsers import na_values arr = np.array([1, 2, 7, 8, 10], dtype=np.int64) expected = np.array([1, 2, 7, 8, 10], dtype=np.int8) # default argument result = lib.downcast_int64(arr, na_values) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = lib.downcast_int64(arr, na_values, use_unsigned=False) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) expected = np.array([1, 2, 7, 8, 10], dtype=np.uint8) result = lib.downcast_int64(arr, na_values, use_unsigned=True) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) # still cast to int8 despite use_unsigned=True # because of the negative number as an element arr = np.array([1, 2, -7, 8, 10], dtype=np.int64) expected = np.array([1, 2, -7, 8, 10], dtype=np.int8) result = lib.downcast_int64(arr, na_values, use_unsigned=True) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) arr = np.array([1, 2, 7, 8, 300], dtype=np.int64) expected = np.array([1, 2, 7, 8, 300], dtype=np.int16) result = lib.downcast_int64(arr, na_values) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) int8_na = na_values[np.int8] int64_na = na_values[np.int64] arr = np.array([int64_na, 2, 3, 10, 15], dtype=np.int64) expected = np.array([int8_na, 2, 3, 10, 15], dtype=np.int8) result = lib.downcast_int64(arr, na_values) - self.assert_numpy_array_equal(result, expected) - - -if __name__ == '__main__': - import nose - - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + tm.assert_numpy_array_equal(result, expected) diff --git a/pandas/tests/types/test_missing.py b/pandas/tests/dtypes/test_missing.py similarity index 86% rename from pandas/tests/types/test_missing.py rename to pandas/tests/dtypes/test_missing.py index fa2bd535bb8d5..90993890b7553 100644 --- a/pandas/tests/types/test_missing.py +++ b/pandas/tests/dtypes/test_missing.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- -import nose +from warnings import catch_warnings import numpy as np from datetime import datetime from pandas.util import testing as tm @@ -8,14 +8,13 @@ import pandas as pd from pandas.core import config as cf from pandas.compat import u -from pandas.tslib import iNaT +from pandas._libs.tslib import iNaT from pandas import (NaT, Float64Index, Series, DatetimeIndex, TimedeltaIndex, date_range) -from pandas.types.dtypes import DatetimeTZDtype -from pandas.types.missing import (array_equivalent, isnull, notnull, - na_value_for_dtype) - -_multiprocess_can_split_ = True +from pandas.core.dtypes.dtypes import DatetimeTZDtype +from pandas.core.dtypes.missing import ( + array_equivalent, isnull, notnull, + na_value_for_dtype) def test_notnull(): @@ -46,30 +45,38 @@ def test_notnull(): assert (isinstance(isnull(s), Series)) -class TestIsNull(tm.TestCase): +class TestIsNull(object): def test_0d_array(self): - self.assertTrue(isnull(np.array(np.nan))) - self.assertFalse(isnull(np.array(0.0))) - self.assertFalse(isnull(np.array(0))) + assert isnull(np.array(np.nan)) + assert not isnull(np.array(0.0)) + assert not isnull(np.array(0)) # test object dtype - self.assertTrue(isnull(np.array(np.nan, dtype=object))) - self.assertFalse(isnull(np.array(0.0, dtype=object))) - self.assertFalse(isnull(np.array(0, dtype=object))) + assert isnull(np.array(np.nan, dtype=object)) + assert not isnull(np.array(0.0, dtype=object)) + assert not isnull(np.array(0, dtype=object)) + + def test_empty_object(self): + + for shape in [(4, 0), (4,)]: + arr = np.empty(shape=shape, dtype=object) + result = isnull(arr) + expected = np.ones(shape=shape, dtype=bool) + tm.assert_numpy_array_equal(result, expected) def test_isnull(self): - self.assertFalse(isnull(1.)) - self.assertTrue(isnull(None)) - self.assertTrue(isnull(np.NaN)) - self.assertTrue(float('nan')) - self.assertFalse(isnull(np.inf)) - self.assertFalse(isnull(-np.inf)) + assert not isnull(1.) + assert isnull(None) + assert isnull(np.NaN) + assert float('nan') + assert not isnull(np.inf) + assert not isnull(-np.inf) # series for s in [tm.makeFloatSeries(), tm.makeStringSeries(), tm.makeObjectSeries(), tm.makeTimeSeries(), tm.makePeriodSeries()]: - self.assertIsInstance(isnull(s), Series) + assert isinstance(isnull(s), Series) # frame for df in [tm.makeTimeDataFrame(), tm.makePeriodFrame(), @@ -79,14 +86,15 @@ def test_isnull(self): tm.assert_frame_equal(result, expected) # panel - for p in [tm.makePanel(), tm.makePeriodPanel(), - tm.add_nans(tm.makePanel())]: - result = isnull(p) - expected = p.apply(isnull) - tm.assert_panel_equal(result, expected) + with catch_warnings(record=True): + for p in [tm.makePanel(), tm.makePeriodPanel(), + tm.add_nans(tm.makePanel())]: + result = isnull(p) + expected = p.apply(isnull) + tm.assert_panel_equal(result, expected) # panel 4d - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): for p in [tm.makePanel4D(), tm.add_nans_panel4d(tm.makePanel4D())]: result = isnull(p) expected = p.apply(isnull) @@ -127,8 +135,8 @@ def test_isnull_numpy_nat(self): tm.assert_numpy_array_equal(result, expected) def test_isnull_datetime(self): - self.assertFalse(isnull(datetime.now())) - self.assertTrue(notnull(datetime.now())) + assert not isnull(datetime.now()) + assert notnull(datetime.now()) idx = date_range('1/1/1990', periods=20) exp = np.ones(len(idx), dtype=bool) @@ -138,20 +146,20 @@ def test_isnull_datetime(self): idx[0] = iNaT idx = DatetimeIndex(idx) mask = isnull(idx) - self.assertTrue(mask[0]) + assert mask[0] exp = np.array([True] + [False] * (len(idx) - 1), dtype=bool) - self.assert_numpy_array_equal(mask, exp) + tm.assert_numpy_array_equal(mask, exp) # GH 9129 pidx = idx.to_period(freq='M') mask = isnull(pidx) - self.assertTrue(mask[0]) + assert mask[0] exp = np.array([True] + [False] * (len(idx) - 1), dtype=bool) - self.assert_numpy_array_equal(mask, exp) + tm.assert_numpy_array_equal(mask, exp) mask = isnull(pidx[1:]) exp = np.zeros(len(mask), dtype=bool) - self.assert_numpy_array_equal(mask, exp) + tm.assert_numpy_array_equal(mask, exp) def test_datetime_other_units(self): idx = pd.DatetimeIndex(['2011-01-01', 'NaT', '2011-01-02']) @@ -304,8 +312,3 @@ def test_na_value_for_dtype(): for dtype in ['O']: assert np.isnan(na_value_for_dtype(np.dtype(dtype))) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/formats/test_format.py b/pandas/tests/formats/test_format.py deleted file mode 100644 index e7c32a4baa4ea..0000000000000 --- a/pandas/tests/formats/test_format.py +++ /dev/null @@ -1,4439 +0,0 @@ -# -*- coding: utf-8 -*- - -# TODO(wesm): lots of issues making flake8 hard -# flake8: noqa - -from __future__ import print_function -from distutils.version import LooseVersion -import re - -from pandas.compat import (range, zip, lrange, StringIO, PY3, - u, lzip, is_platform_windows, - is_platform_32bit) -import pandas.compat as compat -import itertools -from operator import methodcaller -import os -import sys -from textwrap import dedent -import warnings - -from numpy import nan -from numpy.random import randn -import numpy as np - -import codecs - -div_style = '' -try: - import IPython - if IPython.__version__ < LooseVersion('3.0.0'): - div_style = ' style="max-width:1500px;overflow:auto;"' -except (ImportError, AttributeError): - pass - -from pandas import DataFrame, Series, Index, Timestamp, MultiIndex, date_range, NaT - -import pandas.formats.format as fmt -import pandas.util.testing as tm -import pandas.core.common as com -import pandas.formats.printing as printing -from pandas.util.terminal import get_terminal_size -import pandas as pd -from pandas.core.config import (set_option, get_option, option_context, - reset_option) -from datetime import datetime - -import nose - -use_32bit_repr = is_platform_windows() or is_platform_32bit() - -_frame = DataFrame(tm.getSeriesData()) - - -def curpath(): - pth, _ = os.path.split(os.path.abspath(__file__)) - return pth - - -def has_info_repr(df): - r = repr(df) - c1 = r.split('\n')[0].startswith(", 2. Index, 3. Columns, 4. dtype, 5. memory usage, 6. trailing newline - return has_info and nv - - -def has_horizontally_truncated_repr(df): - try: # Check header row - fst_line = np.array(repr(df).splitlines()[0].split()) - cand_col = np.where(fst_line == '...')[0][0] - except: - return False - # Make sure each row has this ... in the same place - r = repr(df) - for ix, l in enumerate(r.splitlines()): - if not r.split()[cand_col] == '...': - return False - return True - - -def has_vertically_truncated_repr(df): - r = repr(df) - only_dot_row = False - for row in r.splitlines(): - if re.match(r'^[\.\ ]+$', row): - only_dot_row = True - return only_dot_row - - -def has_truncated_repr(df): - return has_horizontally_truncated_repr( - df) or has_vertically_truncated_repr(df) - - -def has_doubly_truncated_repr(df): - return has_horizontally_truncated_repr( - df) and has_vertically_truncated_repr(df) - - -def has_expanded_repr(df): - r = repr(df) - for line in r.split('\n'): - if line.endswith('\\'): - return True - return False - - -class TestDataFrameFormatting(tm.TestCase): - _multiprocess_can_split_ = True - - def setUp(self): - self.warn_filters = warnings.filters - warnings.filterwarnings('ignore', category=FutureWarning, - module=".*format") - - self.frame = _frame.copy() - - def tearDown(self): - warnings.filters = self.warn_filters - - def test_repr_embedded_ndarray(self): - arr = np.empty(10, dtype=[('err', object)]) - for i in range(len(arr)): - arr['err'][i] = np.random.randn(i) - - df = DataFrame(arr) - repr(df['err']) - repr(df) - df.to_string() - - def test_eng_float_formatter(self): - self.frame.ix[5] = 0 - - fmt.set_eng_float_format() - repr(self.frame) - - fmt.set_eng_float_format(use_eng_prefix=True) - repr(self.frame) - - fmt.set_eng_float_format(accuracy=0) - repr(self.frame) - self.reset_display_options() - - def test_show_null_counts(self): - - df = DataFrame(1, columns=range(10), index=range(10)) - df.iloc[1, 1] = np.nan - - def check(null_counts, result): - buf = StringIO() - df.info(buf=buf, null_counts=null_counts) - self.assertTrue(('non-null' in buf.getvalue()) is result) - - with option_context('display.max_info_rows', 20, - 'display.max_info_columns', 20): - check(None, True) - check(True, True) - check(False, False) - - with option_context('display.max_info_rows', 5, - 'display.max_info_columns', 5): - check(None, False) - check(True, False) - check(False, False) - - def test_repr_tuples(self): - buf = StringIO() - - df = DataFrame({'tups': lzip(range(10), range(10))}) - repr(df) - df.to_string(col_space=10, buf=buf) - - def test_repr_truncation(self): - max_len = 20 - with option_context("display.max_colwidth", max_len): - df = DataFrame({'A': np.random.randn(10), - 'B': [tm.rands(np.random.randint( - max_len - 1, max_len + 1)) for i in range(10) - ]}) - r = repr(df) - r = r[r.find('\n') + 1:] - - adj = fmt._get_adjustment() - - for line, value in lzip(r.split('\n'), df['B']): - if adj.len(value) + 1 > max_len: - self.assertIn('...', line) - else: - self.assertNotIn('...', line) - - with option_context("display.max_colwidth", 999999): - self.assertNotIn('...', repr(df)) - - with option_context("display.max_colwidth", max_len + 2): - self.assertNotIn('...', repr(df)) - - def test_repr_chop_threshold(self): - df = DataFrame([[0.1, 0.5], [0.5, -0.1]]) - pd.reset_option("display.chop_threshold") # default None - self.assertEqual(repr(df), ' 0 1\n0 0.1 0.5\n1 0.5 -0.1') - - with option_context("display.chop_threshold", 0.2): - self.assertEqual(repr(df), ' 0 1\n0 0.0 0.5\n1 0.5 0.0') - - with option_context("display.chop_threshold", 0.6): - self.assertEqual(repr(df), ' 0 1\n0 0.0 0.0\n1 0.0 0.0') - - with option_context("display.chop_threshold", None): - self.assertEqual(repr(df), ' 0 1\n0 0.1 0.5\n1 0.5 -0.1') - - def test_repr_obeys_max_seq_limit(self): - with option_context("display.max_seq_items", 2000): - self.assertTrue(len(printing.pprint_thing(lrange(1000))) > 1000) - - with option_context("display.max_seq_items", 5): - self.assertTrue(len(printing.pprint_thing(lrange(1000))) < 100) - - def test_repr_set(self): - self.assertEqual(printing.pprint_thing(set([1])), '{1}') - - def test_repr_is_valid_construction_code(self): - # for the case of Index, where the repr is traditional rather then - # stylized - idx = Index(['a', 'b']) - res = eval("pd." + repr(idx)) - tm.assert_series_equal(Series(res), Series(idx)) - - def test_repr_should_return_str(self): - # http://docs.python.org/py3k/reference/datamodel.html#object.__repr__ - # http://docs.python.org/reference/datamodel.html#object.__repr__ - # "...The return value must be a string object." - - # (str on py2.x, str (unicode) on py3) - - data = [8, 5, 3, 5] - index1 = [u("\u03c3"), u("\u03c4"), u("\u03c5"), u("\u03c6")] - cols = [u("\u03c8")] - df = DataFrame(data, columns=cols, index=index1) - self.assertTrue(type(df.__repr__()) == str) # both py2 / 3 - - def test_repr_no_backslash(self): - with option_context('mode.sim_interactive', True): - df = DataFrame(np.random.randn(10, 4)) - self.assertTrue('\\' not in repr(df)) - - def test_expand_frame_repr(self): - df_small = DataFrame('hello', [0], [0]) - df_wide = DataFrame('hello', [0], lrange(10)) - df_tall = DataFrame('hello', lrange(30), lrange(5)) - - with option_context('mode.sim_interactive', True): - with option_context('display.max_columns', 10, 'display.width', 20, - 'display.max_rows', 20, - 'display.show_dimensions', True): - with option_context('display.expand_frame_repr', True): - self.assertFalse(has_truncated_repr(df_small)) - self.assertFalse(has_expanded_repr(df_small)) - self.assertFalse(has_truncated_repr(df_wide)) - self.assertTrue(has_expanded_repr(df_wide)) - self.assertTrue(has_vertically_truncated_repr(df_tall)) - self.assertTrue(has_expanded_repr(df_tall)) - - with option_context('display.expand_frame_repr', False): - self.assertFalse(has_truncated_repr(df_small)) - self.assertFalse(has_expanded_repr(df_small)) - self.assertFalse(has_horizontally_truncated_repr(df_wide)) - self.assertFalse(has_expanded_repr(df_wide)) - self.assertTrue(has_vertically_truncated_repr(df_tall)) - self.assertFalse(has_expanded_repr(df_tall)) - - def test_repr_non_interactive(self): - # in non interactive mode, there can be no dependency on the - # result of terminal auto size detection - df = DataFrame('hello', lrange(1000), lrange(5)) - - with option_context('mode.sim_interactive', False, 'display.width', 0, - 'display.height', 0, 'display.max_rows', 5000): - self.assertFalse(has_truncated_repr(df)) - self.assertFalse(has_expanded_repr(df)) - - def test_repr_max_columns_max_rows(self): - term_width, term_height = get_terminal_size() - if term_width < 10 or term_height < 10: - raise nose.SkipTest("terminal size too small, " - "{0} x {1}".format(term_width, term_height)) - - def mkframe(n): - index = ['%05d' % i for i in range(n)] - return DataFrame(0, index, index) - - df6 = mkframe(6) - df10 = mkframe(10) - with option_context('mode.sim_interactive', True): - with option_context('display.width', term_width * 2): - with option_context('display.max_rows', 5, - 'display.max_columns', 5): - self.assertFalse(has_expanded_repr(mkframe(4))) - self.assertFalse(has_expanded_repr(mkframe(5))) - self.assertFalse(has_expanded_repr(df6)) - self.assertTrue(has_doubly_truncated_repr(df6)) - - with option_context('display.max_rows', 20, - 'display.max_columns', 10): - # Out off max_columns boundary, but no extending - # since not exceeding width - self.assertFalse(has_expanded_repr(df6)) - self.assertFalse(has_truncated_repr(df6)) - - with option_context('display.max_rows', 9, - 'display.max_columns', 10): - # out vertical bounds can not result in exanded repr - self.assertFalse(has_expanded_repr(df10)) - self.assertTrue(has_vertically_truncated_repr(df10)) - - # width=None in terminal, auto detection - with option_context('display.max_columns', 100, 'display.max_rows', - term_width * 20, 'display.width', None): - df = mkframe((term_width // 7) - 2) - self.assertFalse(has_expanded_repr(df)) - df = mkframe((term_width // 7) + 2) - printing.pprint_thing(df._repr_fits_horizontal_()) - self.assertTrue(has_expanded_repr(df)) - - def test_str_max_colwidth(self): - # GH 7856 - df = pd.DataFrame([{'a': 'foo', - 'b': 'bar', - 'c': 'uncomfortably long line with lots of stuff', - 'd': 1}, {'a': 'foo', - 'b': 'bar', - 'c': 'stuff', - 'd': 1}]) - df.set_index(['a', 'b', 'c']) - self.assertTrue( - str(df) == - ' a b c d\n' - '0 foo bar uncomfortably long line with lots of stuff 1\n' - '1 foo bar stuff 1') - with option_context('max_colwidth', 20): - self.assertTrue(str(df) == ' a b c d\n' - '0 foo bar uncomfortably lo... 1\n' - '1 foo bar stuff 1') - - def test_auto_detect(self): - term_width, term_height = get_terminal_size() - fac = 1.05 # Arbitrary large factor to exceed term widht - cols = range(int(term_width * fac)) - index = range(10) - df = DataFrame(index=index, columns=cols) - with option_context('mode.sim_interactive', True): - with option_context('max_rows', None): - with option_context('max_columns', None): - # Wrap around with None - self.assertTrue(has_expanded_repr(df)) - with option_context('max_rows', 0): - with option_context('max_columns', 0): - # Truncate with auto detection. - self.assertTrue(has_horizontally_truncated_repr(df)) - - index = range(int(term_height * fac)) - df = DataFrame(index=index, columns=cols) - with option_context('max_rows', 0): - with option_context('max_columns', None): - # Wrap around with None - self.assertTrue(has_expanded_repr(df)) - # Truncate vertically - self.assertTrue(has_vertically_truncated_repr(df)) - - with option_context('max_rows', None): - with option_context('max_columns', 0): - self.assertTrue(has_horizontally_truncated_repr(df)) - - def test_to_string_repr_unicode(self): - buf = StringIO() - - unicode_values = [u('\u03c3')] * 10 - unicode_values = np.array(unicode_values, dtype=object) - df = DataFrame({'unicode': unicode_values}) - df.to_string(col_space=10, buf=buf) - - # it works! - repr(df) - - idx = Index(['abc', u('\u03c3a'), 'aegdvg']) - ser = Series(np.random.randn(len(idx)), idx) - rs = repr(ser).split('\n') - line_len = len(rs[0]) - for line in rs[1:]: - try: - line = line.decode(get_option("display.encoding")) - except: - pass - if not line.startswith('dtype:'): - self.assertEqual(len(line), line_len) - - # it works even if sys.stdin in None - _stdin = sys.stdin - try: - sys.stdin = None - repr(df) - finally: - sys.stdin = _stdin - - def test_to_string_unicode_columns(self): - df = DataFrame({u('\u03c3'): np.arange(10.)}) - - buf = StringIO() - df.to_string(buf=buf) - buf.getvalue() - - buf = StringIO() - df.info(buf=buf) - buf.getvalue() - - result = self.frame.to_string() - tm.assertIsInstance(result, compat.text_type) - - def test_to_string_utf8_columns(self): - n = u("\u05d0").encode('utf-8') - - with option_context('display.max_rows', 1): - df = DataFrame([1, 2], columns=[n]) - repr(df) - - def test_to_string_unicode_two(self): - dm = DataFrame({u('c/\u03c3'): []}) - buf = StringIO() - dm.to_string(buf) - - def test_to_string_unicode_three(self): - dm = DataFrame(['\xc2']) - buf = StringIO() - dm.to_string(buf) - - def test_to_string_with_formatters(self): - df = DataFrame({'int': [1, 2, 3], - 'float': [1.0, 2.0, 3.0], - 'object': [(1, 2), True, False]}, - columns=['int', 'float', 'object']) - - formatters = [('int', lambda x: '0x%x' % x), - ('float', lambda x: '[% 4.1f]' % x), - ('object', lambda x: '-%s-' % str(x))] - result = df.to_string(formatters=dict(formatters)) - result2 = df.to_string(formatters=lzip(*formatters)[1]) - self.assertEqual(result, (' int float object\n' - '0 0x1 [ 1.0] -(1, 2)-\n' - '1 0x2 [ 2.0] -True-\n' - '2 0x3 [ 3.0] -False-')) - self.assertEqual(result, result2) - - def test_to_string_with_datetime64_monthformatter(self): - months = [datetime(2016, 1, 1), datetime(2016, 2, 2)] - x = DataFrame({'months': months}) - - def format_func(x): - return x.strftime('%Y-%m') - result = x.to_string(formatters={'months': format_func}) - expected = 'months\n0 2016-01\n1 2016-02' - self.assertEqual(result.strip(), expected) - - def test_to_string_with_datetime64_hourformatter(self): - - x = DataFrame({'hod': pd.to_datetime(['10:10:10.100', '12:12:12.120'], - format='%H:%M:%S.%f')}) - - def format_func(x): - return x.strftime('%H:%M') - - result = x.to_string(formatters={'hod': format_func}) - expected = 'hod\n0 10:10\n1 12:12' - self.assertEqual(result.strip(), expected) - - def test_to_string_with_formatters_unicode(self): - df = DataFrame({u('c/\u03c3'): [1, 2, 3]}) - result = df.to_string(formatters={u('c/\u03c3'): lambda x: '%s' % x}) - self.assertEqual(result, u(' c/\u03c3\n') + '0 1\n1 2\n2 3') - - def test_east_asian_unicode_frame(self): - if PY3: - _rep = repr - else: - _rep = unicode - - # not alighned properly because of east asian width - - # mid col - df = DataFrame({'a': [u'あ', u'いいい', u'う', u'ええええええ'], - 'b': [1, 222, 33333, 4]}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" a b\na あ 1\n" - u"bb いいい 222\nc う 33333\n" - u"ddd ええええええ 4") - self.assertEqual(_rep(df), expected) - - # last col - df = DataFrame({'a': [1, 222, 33333, 4], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" a b\na 1 あ\n" - u"bb 222 いいい\nc 33333 う\n" - u"ddd 4 ええええええ") - self.assertEqual(_rep(df), expected) - - # all col - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" a b\na あああああ あ\n" - u"bb い いいい\nc う う\n" - u"ddd えええ ええええええ") - self.assertEqual(_rep(df), expected) - - # column name - df = DataFrame({u'あああああ': [1, 222, 33333, 4], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" b あああああ\na あ 1\n" - u"bb いいい 222\nc う 33333\n" - u"ddd ええええええ 4") - self.assertEqual(_rep(df), expected) - - # index - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=[u'あああ', u'いいいいいい', u'うう', u'え']) - expected = (u" a b\nあああ あああああ あ\n" - u"いいいいいい い いいい\nうう う う\n" - u"え えええ ええええええ") - self.assertEqual(_rep(df), expected) - - # index name - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=pd.Index([u'あ', u'い', u'うう', u'え'], name=u'おおおお')) - expected = (u" a b\nおおおお \nあ あああああ あ\n" - u"い い いいい\nうう う う\nえ えええ ええええええ" - ) - self.assertEqual(_rep(df), expected) - - # all - df = DataFrame({u'あああ': [u'あああ', u'い', u'う', u'えええええ'], - u'いいいいい': [u'あ', u'いいい', u'う', u'ええ']}, - index=pd.Index([u'あ', u'いいい', u'うう', u'え'], name=u'お')) - expected = (u" あああ いいいいい\nお \nあ あああ あ\n" - u"いいい い いいい\nうう う う\nえ えええええ ええ") - self.assertEqual(_rep(df), expected) - - # MultiIndex - idx = pd.MultiIndex.from_tuples([(u'あ', u'いい'), (u'う', u'え'), ( - u'おおお', u'かかかか'), (u'き', u'くく')]) - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, index=idx) - expected = (u" a b\nあ いい あああああ あ\n" - u"う え い いいい\nおおお かかかか う う\n" - u"き くく えええ ええええええ") - self.assertEqual(_rep(df), expected) - - # truncate - with option_context('display.max_rows', 3, 'display.max_columns', 3): - df = pd.DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ'], - 'c': [u'お', u'か', u'ききき', u'くくくくくく'], - u'ああああ': [u'さ', u'し', u'す', u'せ']}, - columns=['a', 'b', 'c', u'ああああ']) - - expected = (u" a ... ああああ\n0 あああああ ... さ\n" - u".. ... ... ...\n3 えええ ... せ\n" - u"\n[4 rows x 4 columns]") - self.assertEqual(_rep(df), expected) - - df.index = [u'あああ', u'いいいい', u'う', 'aaa'] - expected = (u" a ... ああああ\nあああ あああああ ... さ\n" - u".. ... ... ...\naaa えええ ... せ\n" - u"\n[4 rows x 4 columns]") - self.assertEqual(_rep(df), expected) - - # Emable Unicode option ----------------------------------------- - with option_context('display.unicode.east_asian_width', True): - - # mid col - df = DataFrame({'a': [u'あ', u'いいい', u'う', u'ええええええ'], - 'b': [1, 222, 33333, 4]}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" a b\na あ 1\n" - u"bb いいい 222\nc う 33333\n" - u"ddd ええええええ 4") - self.assertEqual(_rep(df), expected) - - # last col - df = DataFrame({'a': [1, 222, 33333, 4], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" a b\na 1 あ\n" - u"bb 222 いいい\nc 33333 う\n" - u"ddd 4 ええええええ") - self.assertEqual(_rep(df), expected) - - # all col - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" a b\na あああああ あ\n" - u"bb い いいい\nc う う\n" - u"ddd えええ ええええええ" - "") - self.assertEqual(_rep(df), expected) - - # column name - df = DataFrame({u'あああああ': [1, 222, 33333, 4], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=['a', 'bb', 'c', 'ddd']) - expected = (u" b あああああ\na あ 1\n" - u"bb いいい 222\nc う 33333\n" - u"ddd ええええええ 4") - self.assertEqual(_rep(df), expected) - - # index - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=[u'あああ', u'いいいいいい', u'うう', u'え']) - expected = (u" a b\nあああ あああああ あ\n" - u"いいいいいい い いいい\nうう う う\n" - u"え えええ ええええええ") - self.assertEqual(_rep(df), expected) - - # index name - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, - index=pd.Index([u'あ', u'い', u'うう', u'え'], name=u'おおおお')) - expected = (u" a b\nおおおお \n" - u"あ あああああ あ\nい い いいい\n" - u"うう う う\nえ えええ ええええええ" - ) - self.assertEqual(_rep(df), expected) - - # all - df = DataFrame({u'あああ': [u'あああ', u'い', u'う', u'えええええ'], - u'いいいいい': [u'あ', u'いいい', u'う', u'ええ']}, - index=pd.Index([u'あ', u'いいい', u'うう', u'え'], name=u'お')) - expected = (u" あああ いいいいい\nお \n" - u"あ あああ あ\nいいい い いいい\n" - u"うう う う\nえ えええええ ええ") - self.assertEqual(_rep(df), expected) - - # MultiIndex - idx = pd.MultiIndex.from_tuples([(u'あ', u'いい'), (u'う', u'え'), ( - u'おおお', u'かかかか'), (u'き', u'くく')]) - df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, index=idx) - expected = (u" a b\nあ いい あああああ あ\n" - u"う え い いいい\nおおお かかかか う う\n" - u"き くく えええ ええええええ") - self.assertEqual(_rep(df), expected) - - # truncate - with option_context('display.max_rows', 3, 'display.max_columns', - 3): - - df = pd.DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], - 'b': [u'あ', u'いいい', u'う', u'ええええええ'], - 'c': [u'お', u'か', u'ききき', u'くくくくくく'], - u'ああああ': [u'さ', u'し', u'す', u'せ']}, - columns=['a', 'b', 'c', u'ああああ']) - - expected = (u" a ... ああああ\n0 あああああ ... さ\n" - u".. ... ... ...\n3 えええ ... せ\n" - u"\n[4 rows x 4 columns]") - self.assertEqual(_rep(df), expected) - - df.index = [u'あああ', u'いいいい', u'う', 'aaa'] - expected = (u" a ... ああああ\nあああ あああああ ... さ\n" - u"... ... ... ...\naaa えええ ... せ\n" - u"\n[4 rows x 4 columns]") - self.assertEqual(_rep(df), expected) - - # ambiguous unicode - df = DataFrame({u'あああああ': [1, 222, 33333, 4], - 'b': [u'あ', u'いいい', u'¡¡', u'ええええええ']}, - index=['a', 'bb', 'c', '¡¡¡']) - expected = (u" b あああああ\na あ 1\n" - u"bb いいい 222\nc ¡¡ 33333\n" - u"¡¡¡ ええええええ 4") - self.assertEqual(_rep(df), expected) - - def test_to_string_buffer_all_unicode(self): - buf = StringIO() - - empty = DataFrame({u('c/\u03c3'): Series()}) - nonempty = DataFrame({u('c/\u03c3'): Series([1, 2, 3])}) - - print(empty, file=buf) - print(nonempty, file=buf) - - # this should work - buf.getvalue() - - def test_to_string_with_col_space(self): - df = DataFrame(np.random.random(size=(1, 3))) - c10 = len(df.to_string(col_space=10).split("\n")[1]) - c20 = len(df.to_string(col_space=20).split("\n")[1]) - c30 = len(df.to_string(col_space=30).split("\n")[1]) - self.assertTrue(c10 < c20 < c30) - - # GH 8230 - # col_space wasn't being applied with header=False - with_header = df.to_string(col_space=20) - with_header_row1 = with_header.splitlines()[1] - no_header = df.to_string(col_space=20, header=False) - self.assertEqual(len(with_header_row1), len(no_header)) - - def test_to_string_truncate_indices(self): - for index in [tm.makeStringIndex, tm.makeUnicodeIndex, tm.makeIntIndex, - tm.makeDateIndex, tm.makePeriodIndex]: - for column in [tm.makeStringIndex]: - for h in [10, 20]: - for w in [10, 20]: - with option_context("display.expand_frame_repr", - False): - df = DataFrame(index=index(h), columns=column(w)) - with option_context("display.max_rows", 15): - if h == 20: - self.assertTrue( - has_vertically_truncated_repr(df)) - else: - self.assertFalse( - has_vertically_truncated_repr(df)) - with option_context("display.max_columns", 15): - if w == 20: - self.assertTrue( - has_horizontally_truncated_repr(df)) - else: - self.assertFalse( - has_horizontally_truncated_repr(df)) - with option_context("display.max_rows", 15, - "display.max_columns", 15): - if h == 20 and w == 20: - self.assertTrue(has_doubly_truncated_repr( - df)) - else: - self.assertFalse(has_doubly_truncated_repr( - df)) - - def test_to_string_truncate_multilevel(self): - arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], - ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] - df = DataFrame(index=arrays, columns=arrays) - with option_context("display.max_rows", 7, "display.max_columns", 7): - self.assertTrue(has_doubly_truncated_repr(df)) - - def test_truncate_with_different_dtypes(self): - - # 11594, 12045 - # when truncated the dtypes of the splits can differ - - # 11594 - import datetime - s = Series([datetime.datetime(2012, 1, 1)]*10 + [datetime.datetime(1012,1,2)] + [datetime.datetime(2012, 1, 3)]*10) - - with pd.option_context('display.max_rows', 8): - result = str(s) - self.assertTrue('object' in result) - - # 12045 - df = DataFrame({'text': ['some words'] + [None]*9}) - - with pd.option_context('display.max_rows', 8, 'display.max_columns', 3): - result = str(df) - self.assertTrue('None' in result) - self.assertFalse('NaN' in result) - - def test_datetimelike_frame(self): - - # GH 12211 - df = DataFrame({'date' : [pd.Timestamp('20130101').tz_localize('UTC')] + [pd.NaT]*5}) - - with option_context("display.max_rows", 5): - result = str(df) - self.assertTrue('2013-01-01 00:00:00+00:00' in result) - self.assertTrue('NaT' in result) - self.assertTrue('...' in result) - self.assertTrue('[6 rows x 1 columns]' in result) - - dts = [pd.Timestamp('2011-01-01', tz='US/Eastern')] * 5 + [pd.NaT] * 5 - df = pd.DataFrame({"dt": dts, - "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) - with option_context('display.max_rows', 5): - expected = (' dt x\n' - '0 2011-01-01 00:00:00-05:00 1\n' - '1 2011-01-01 00:00:00-05:00 2\n' - '.. ... ..\n' - '8 NaT 9\n' - '9 NaT 10\n\n' - '[10 rows x 2 columns]') - self.assertEqual(repr(df), expected) - - dts = [pd.NaT] * 5 + [pd.Timestamp('2011-01-01', tz='US/Eastern')] * 5 - df = pd.DataFrame({"dt": dts, - "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) - with option_context('display.max_rows', 5): - expected = (' dt x\n' - '0 NaT 1\n' - '1 NaT 2\n' - '.. ... ..\n' - '8 2011-01-01 00:00:00-05:00 9\n' - '9 2011-01-01 00:00:00-05:00 10\n\n' - '[10 rows x 2 columns]') - self.assertEqual(repr(df), expected) - - dts = ([pd.Timestamp('2011-01-01', tz='Asia/Tokyo')] * 5 + - [pd.Timestamp('2011-01-01', tz='US/Eastern')] * 5) - df = pd.DataFrame({"dt": dts, - "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) - with option_context('display.max_rows', 5): - expected = (' dt x\n' - '0 2011-01-01 00:00:00+09:00 1\n' - '1 2011-01-01 00:00:00+09:00 2\n' - '.. ... ..\n' - '8 2011-01-01 00:00:00-05:00 9\n' - '9 2011-01-01 00:00:00-05:00 10\n\n' - '[10 rows x 2 columns]') - self.assertEqual(repr(df), expected) - - def test_to_html_with_col_space(self): - def check_with_width(df, col_space): - import re - # check that col_space affects HTML generation - # and be very brittle about it. - html = df.to_html(col_space=col_space) - hdrs = [x for x in html.split(r"\n") if re.search(r"\s]", x)] - self.assertTrue(len(hdrs) > 0) - for h in hdrs: - self.assertTrue("min-width" in h) - self.assertTrue(str(col_space) in h) - - df = DataFrame(np.random.random(size=(1, 3))) - - check_with_width(df, 30) - check_with_width(df, 50) - - def test_to_html_with_empty_string_label(self): - # GH3547, to_html regards empty string labels as repeated labels - data = {'c1': ['a', 'b'], 'c2': ['a', ''], 'data': [1, 2]} - df = DataFrame(data).set_index(['c1', 'c2']) - res = df.to_html() - self.assertTrue("rowspan" not in res) - - def test_to_html_unicode(self): - df = DataFrame({u('\u03c3'): np.arange(10.)}) - expected = u'\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
\u03c3
00.0
11.0
22.0
33.0
44.0
55.0
66.0
77.0
88.0
99.0
' - self.assertEqual(df.to_html(), expected) - df = DataFrame({'A': [u('\u03c3')]}) - expected = u'\n \n \n \n \n \n \n \n \n \n \n \n \n
A
0\u03c3
' - self.assertEqual(df.to_html(), expected) - - def test_to_html_decimal(self): - # GH 12031 - df = DataFrame({'A': [6.0, 3.1, 2.2]}) - result = df.to_html(decimal=',') - expected = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
A
06,0
13,1
22,2
') - self.assertEqual(result, expected) - - def test_to_html_escaped(self): - a = 'str", - b: ""}, - 'co>l2': {a: "", - b: ""}} - rs = DataFrame(test_dict).to_html() - xp = """ - - - - - - - - - - - - - - - - - - - -
co<l1co>l2
str<ing1 &amp;<type 'str'><type 'str'>
stri>ng2 &amp;<type 'str'><type 'str'>
""" - - self.assertEqual(xp, rs) - - def test_to_html_escape_disabled(self): - a = 'strbold", - b: "bold"}, - 'co>l2': {a: "bold", - b: "bold"}} - rs = DataFrame(test_dict).to_html(escape=False) - xp = """ - - - - - - - - - - - - - - - - - -
co - co>l2
str - boldbold
stri>ng2 &boldbold
""" - - self.assertEqual(xp, rs) - - def test_to_html_multiindex_index_false(self): - # issue 8452 - df = DataFrame({ - 'a': range(2), - 'b': range(3, 5), - 'c': range(5, 7), - 'd': range(3, 5) - }) - df.columns = MultiIndex.from_product([['a', 'b'], ['c', 'd']]) - result = df.to_html(index=False) - expected = """\ - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ab
cdcd
0353
1464
""" - - self.assertEqual(result, expected) - - df.index = Index(df.index.values, name='idx') - result = df.to_html(index=False) - self.assertEqual(result, expected) - - def test_to_html_multiindex_sparsify_false_multi_sparse(self): - with option_context('display.multi_sparse', False): - index = MultiIndex.from_arrays([[0, 0, 1, 1], [0, 1, 0, 1]], - names=['foo', None]) - - df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], index=index) - - result = df.to_html() - expected = """\ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
01
foo
0001
0123
1045
1167
""" - - self.assertEqual(result, expected) - - df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], - columns=index[::2], index=index) - - result = df.to_html() - expected = """\ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
foo01
00
foo
0001
0123
1045
1167
""" - - self.assertEqual(result, expected) - - def test_to_html_multiindex_sparsify(self): - index = MultiIndex.from_arrays([[0, 0, 1, 1], [0, 1, 0, 1]], - names=['foo', None]) - - df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], index=index) - - result = df.to_html() - expected = """ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
01
foo
0001
123
1045
167
""" - - self.assertEqual(result, expected) - - df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], columns=index[::2], - index=index) - - result = df.to_html() - expected = """\ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
foo01
00
foo
0001
123
1045
167
""" - - self.assertEqual(result, expected) - - def test_to_html_index_formatter(self): - df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], columns=['foo', None], - index=lrange(4)) - - f = lambda x: 'abcd' [x] - result = df.to_html(formatters={'__index__': f}) - expected = """\ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
fooNone
a01
b23
c45
d67
""" - - self.assertEqual(result, expected) - - def test_to_html_datetime64_monthformatter(self): - months = [datetime(2016, 1, 1), datetime(2016, 2, 2)] - x = DataFrame({'months': months}) - - def format_func(x): - return x.strftime('%Y-%m') - result = x.to_html(formatters={'months': format_func}) - expected = """\ - - - - - - - - - - - - - - - - - -
months
02016-01
12016-02
""" - self.assertEqual(result, expected) - - def test_to_html_datetime64_hourformatter(self): - - x = DataFrame({'hod': pd.to_datetime(['10:10:10.100', '12:12:12.120'], - format='%H:%M:%S.%f')}) - - def format_func(x): - return x.strftime('%H:%M') - result = x.to_html(formatters={'hod': format_func}) - expected = """\ - - - - - - - - - - - - - - - - - -
hod
010:10
112:12
""" - self.assertEqual(result, expected) - - def test_to_html_regression_GH6098(self): - df = DataFrame({u('clé1'): [u('a'), u('a'), u('b'), u('b'), u('a')], - u('clé2'): [u('1er'), u('2ème'), u('1er'), u('2ème'), - u('1er')], - 'données1': np.random.randn(5), - 'données2': np.random.randn(5)}) - # it works - df.pivot_table(index=[u('clé1')], columns=[u('clé2')])._repr_html_() - - def test_to_html_truncate(self): - raise nose.SkipTest("unreliable on travis") - index = pd.DatetimeIndex(start='20010101', freq='D', periods=20) - df = DataFrame(index=index, columns=range(20)) - fmt.set_option('display.max_rows', 8) - fmt.set_option('display.max_columns', 4) - result = df._repr_html_() - expected = '''\ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
01...1819
2001-01-01NaNNaN...NaNNaN
2001-01-02NaNNaN...NaNNaN
2001-01-03NaNNaN...NaNNaN
2001-01-04NaNNaN...NaNNaN
..................
2001-01-17NaNNaN...NaNNaN
2001-01-18NaNNaN...NaNNaN
2001-01-19NaNNaN...NaNNaN
2001-01-20NaNNaN...NaNNaN
-

20 rows × 20 columns

-'''.format(div_style) - if compat.PY2: - expected = expected.decode('utf-8') - self.assertEqual(result, expected) - - def test_to_html_truncate_multi_index(self): - raise nose.SkipTest("unreliable on travis") - arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], - ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] - df = DataFrame(index=arrays, columns=arrays) - fmt.set_option('display.max_rows', 7) - fmt.set_option('display.max_columns', 7) - result = df._repr_html_() - expected = '''\ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
barbaz...fooqux
onetwoone...twoonetwo
baroneNaNNaNNaN...NaNNaNNaN
twoNaNNaNNaN...NaNNaNNaN
bazoneNaNNaNNaN...NaNNaNNaN
...........................
footwoNaNNaNNaN...NaNNaNNaN
quxoneNaNNaNNaN...NaNNaNNaN
twoNaNNaNNaN...NaNNaNNaN
-

8 rows × 8 columns

-'''.format(div_style) - if compat.PY2: - expected = expected.decode('utf-8') - self.assertEqual(result, expected) - - def test_to_html_truncate_multi_index_sparse_off(self): - raise nose.SkipTest("unreliable on travis") - arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], - ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] - df = DataFrame(index=arrays, columns=arrays) - fmt.set_option('display.max_rows', 7) - fmt.set_option('display.max_columns', 7) - fmt.set_option('display.multi_sparse', False) - result = df._repr_html_() - expected = '''\ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
barbarbaz...fooquxqux
onetwoone...twoonetwo
baroneNaNNaNNaN...NaNNaNNaN
bartwoNaNNaNNaN...NaNNaNNaN
bazoneNaNNaNNaN...NaNNaNNaN
footwoNaNNaNNaN...NaNNaNNaN
quxoneNaNNaNNaN...NaNNaNNaN
quxtwoNaNNaNNaN...NaNNaNNaN
-

8 rows × 8 columns

-'''.format(div_style) - if compat.PY2: - expected = expected.decode('utf-8') - self.assertEqual(result, expected) - - def test_to_html_border(self): - df = DataFrame({'A': [1, 2]}) - result = df.to_html() - assert 'border="1"' in result - - def test_to_html_border_option(self): - df = DataFrame({'A': [1, 2]}) - with pd.option_context('html.border', 0): - result = df.to_html() - self.assertTrue('border="0"' in result) - self.assertTrue('border="0"' in df._repr_html_()) - - def test_to_html_border_zero(self): - df = DataFrame({'A': [1, 2]}) - result = df.to_html(border=0) - self.assertTrue('border="0"' in result) - - def test_nonunicode_nonascii_alignment(self): - df = DataFrame([["aa\xc3\xa4\xc3\xa4", 1], ["bbbb", 2]]) - rep_str = df.to_string() - lines = rep_str.split('\n') - self.assertEqual(len(lines[1]), len(lines[2])) - - def test_unicode_problem_decoding_as_ascii(self): - dm = DataFrame({u('c/\u03c3'): Series({'test': np.NaN})}) - compat.text_type(dm.to_string()) - - def test_string_repr_encoding(self): - filepath = tm.get_data_path('unicode_series.csv') - df = pd.read_csv(filepath, header=None, encoding='latin1') - repr(df) - repr(df[1]) - - def test_repr_corner(self): - # representing infs poses no problems - df = DataFrame({'foo': [-np.inf, np.inf]}) - repr(df) - - def test_frame_info_encoding(self): - index = ['\'Til There Was You (1997)', - 'ldum klaka (Cold Fever) (1994)'] - fmt.set_option('display.max_rows', 1) - df = DataFrame(columns=['a', 'b', 'c'], index=index) - repr(df) - repr(df.T) - fmt.set_option('display.max_rows', 200) - - def test_pprint_thing(self): - from pandas.formats.printing import pprint_thing as pp_t - - if PY3: - raise nose.SkipTest("doesn't work on Python 3") - - self.assertEqual(pp_t('a'), u('a')) - self.assertEqual(pp_t(u('a')), u('a')) - self.assertEqual(pp_t(None), 'None') - self.assertEqual(pp_t(u('\u05d0'), quote_strings=True), u("u'\u05d0'")) - self.assertEqual(pp_t(u('\u05d0'), quote_strings=False), u('\u05d0')) - self.assertEqual(pp_t((u('\u05d0'), - u('\u05d1')), quote_strings=True), - u("(u'\u05d0', u'\u05d1')")) - self.assertEqual(pp_t((u('\u05d0'), (u('\u05d1'), - u('\u05d2'))), - quote_strings=True), - u("(u'\u05d0', (u'\u05d1', u'\u05d2'))")) - self.assertEqual(pp_t(('foo', u('\u05d0'), (u('\u05d0'), - u('\u05d0'))), - quote_strings=True), - u("(u'foo', u'\u05d0', (u'\u05d0', u'\u05d0'))")) - - # escape embedded tabs in string - # GH #2038 - self.assertTrue(not "\t" in pp_t("a\tb", escape_chars=("\t", ))) - - def test_wide_repr(self): - with option_context('mode.sim_interactive', True, - 'display.show_dimensions', True): - max_cols = get_option('display.max_columns') - df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) - set_option('display.expand_frame_repr', False) - rep_str = repr(df) - - assert "10 rows x %d columns" % (max_cols - 1) in rep_str - set_option('display.expand_frame_repr', True) - wide_repr = repr(df) - self.assertNotEqual(rep_str, wide_repr) - - with option_context('display.width', 120): - wider_repr = repr(df) - self.assertTrue(len(wider_repr) < len(wide_repr)) - - reset_option('display.expand_frame_repr') - - def test_wide_repr_wide_columns(self): - with option_context('mode.sim_interactive', True): - df = DataFrame(randn(5, 3), columns=['a' * 90, 'b' * 90, 'c' * 90]) - rep_str = repr(df) - - self.assertEqual(len(rep_str.splitlines()), 20) - - def test_wide_repr_named(self): - with option_context('mode.sim_interactive', True): - max_cols = get_option('display.max_columns') - df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) - df.index.name = 'DataFrame Index' - set_option('display.expand_frame_repr', False) - - rep_str = repr(df) - set_option('display.expand_frame_repr', True) - wide_repr = repr(df) - self.assertNotEqual(rep_str, wide_repr) - - with option_context('display.width', 150): - wider_repr = repr(df) - self.assertTrue(len(wider_repr) < len(wide_repr)) - - for line in wide_repr.splitlines()[1::13]: - self.assertIn('DataFrame Index', line) - - reset_option('display.expand_frame_repr') - - def test_wide_repr_multiindex(self): - with option_context('mode.sim_interactive', True): - midx = MultiIndex.from_arrays(tm.rands_array(5, size=(2, 10))) - max_cols = get_option('display.max_columns') - df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1)), - index=midx) - df.index.names = ['Level 0', 'Level 1'] - set_option('display.expand_frame_repr', False) - rep_str = repr(df) - set_option('display.expand_frame_repr', True) - wide_repr = repr(df) - self.assertNotEqual(rep_str, wide_repr) - - with option_context('display.width', 150): - wider_repr = repr(df) - self.assertTrue(len(wider_repr) < len(wide_repr)) - - for line in wide_repr.splitlines()[1::13]: - self.assertIn('Level 0 Level 1', line) - - reset_option('display.expand_frame_repr') - - def test_wide_repr_multiindex_cols(self): - with option_context('mode.sim_interactive', True): - max_cols = get_option('display.max_columns') - midx = MultiIndex.from_arrays(tm.rands_array(5, size=(2, 10))) - mcols = MultiIndex.from_arrays(tm.rands_array(3, size=(2, max_cols - - 1))) - df = DataFrame(tm.rands_array(25, (10, max_cols - 1)), - index=midx, columns=mcols) - df.index.names = ['Level 0', 'Level 1'] - set_option('display.expand_frame_repr', False) - rep_str = repr(df) - set_option('display.expand_frame_repr', True) - wide_repr = repr(df) - self.assertNotEqual(rep_str, wide_repr) - - with option_context('display.width', 150): - wider_repr = repr(df) - self.assertTrue(len(wider_repr) < len(wide_repr)) - - reset_option('display.expand_frame_repr') - - def test_wide_repr_unicode(self): - with option_context('mode.sim_interactive', True): - max_cols = get_option('display.max_columns') - df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) - set_option('display.expand_frame_repr', False) - rep_str = repr(df) - set_option('display.expand_frame_repr', True) - wide_repr = repr(df) - self.assertNotEqual(rep_str, wide_repr) - - with option_context('display.width', 150): - wider_repr = repr(df) - self.assertTrue(len(wider_repr) < len(wide_repr)) - - reset_option('display.expand_frame_repr') - - def test_wide_repr_wide_long_columns(self): - with option_context('mode.sim_interactive', True): - df = DataFrame({'a': ['a' * 30, 'b' * 30], - 'b': ['c' * 70, 'd' * 80]}) - - result = repr(df) - self.assertTrue('ccccc' in result) - self.assertTrue('ddddd' in result) - - def test_long_series(self): - n = 1000 - s = Series( - np.random.randint(-50, 50, n), - index=['s%04d' % x for x in range(n)], dtype='int64') - - import re - str_rep = str(s) - nmatches = len(re.findall('dtype', str_rep)) - self.assertEqual(nmatches, 1) - - def test_index_with_nan(self): - # GH 2850 - df = DataFrame({'id1': {0: '1a3', - 1: '9h4'}, - 'id2': {0: np.nan, - 1: 'd67'}, - 'id3': {0: '78d', - 1: '79d'}, - 'value': {0: 123, - 1: 64}}) - - # multi-index - y = df.set_index(['id1', 'id2', 'id3']) - result = y.to_string() - expected = u( - ' value\nid1 id2 id3 \n1a3 NaN 78d 123\n9h4 d67 79d 64') - self.assertEqual(result, expected) - - # index - y = df.set_index('id2') - result = y.to_string() - expected = u( - ' id1 id3 value\nid2 \nNaN 1a3 78d 123\nd67 9h4 79d 64') - self.assertEqual(result, expected) - - # with append (this failed in 0.12) - y = df.set_index(['id1', 'id2']).set_index('id3', append=True) - result = y.to_string() - expected = u( - ' value\nid1 id2 id3 \n1a3 NaN 78d 123\n9h4 d67 79d 64') - self.assertEqual(result, expected) - - # all-nan in mi - df2 = df.copy() - df2.ix[:, 'id2'] = np.nan - y = df2.set_index('id2') - result = y.to_string() - expected = u( - ' id1 id3 value\nid2 \nNaN 1a3 78d 123\nNaN 9h4 79d 64') - self.assertEqual(result, expected) - - # partial nan in mi - df2 = df.copy() - df2.ix[:, 'id2'] = np.nan - y = df2.set_index(['id2', 'id3']) - result = y.to_string() - expected = u( - ' id1 value\nid2 id3 \nNaN 78d 1a3 123\n 79d 9h4 64') - self.assertEqual(result, expected) - - df = DataFrame({'id1': {0: np.nan, - 1: '9h4'}, - 'id2': {0: np.nan, - 1: 'd67'}, - 'id3': {0: np.nan, - 1: '79d'}, - 'value': {0: 123, - 1: 64}}) - - y = df.set_index(['id1', 'id2', 'id3']) - result = y.to_string() - expected = u( - ' value\nid1 id2 id3 \nNaN NaN NaN 123\n9h4 d67 79d 64') - self.assertEqual(result, expected) - - def test_to_string(self): - from pandas import read_table - import re - - # big mixed - biggie = DataFrame({'A': randn(200), - 'B': tm.makeStringIndex(200)}, - index=lrange(200)) - - biggie.loc[:20, 'A'] = nan - biggie.loc[:20, 'B'] = nan - s = biggie.to_string() - - buf = StringIO() - retval = biggie.to_string(buf=buf) - self.assertIsNone(retval) - self.assertEqual(buf.getvalue(), s) - - tm.assertIsInstance(s, compat.string_types) - - # print in right order - result = biggie.to_string(columns=['B', 'A'], col_space=17, - float_format='%.5f'.__mod__) - lines = result.split('\n') - header = lines[0].strip().split() - joined = '\n'.join([re.sub(r'\s+', ' ', x).strip() for x in lines[1:]]) - recons = read_table(StringIO(joined), names=header, - header=None, sep=' ') - tm.assert_series_equal(recons['B'], biggie['B']) - self.assertEqual(recons['A'].count(), biggie['A'].count()) - self.assertTrue((np.abs(recons['A'].dropna() - biggie['A'].dropna()) < - 0.1).all()) - - # expected = ['B', 'A'] - # self.assertEqual(header, expected) - - result = biggie.to_string(columns=['A'], col_space=17) - header = result.split('\n')[0].strip().split() - expected = ['A'] - self.assertEqual(header, expected) - - biggie.to_string(columns=['B', 'A'], - formatters={'A': lambda x: '%.1f' % x}) - - biggie.to_string(columns=['B', 'A'], float_format=str) - biggie.to_string(columns=['B', 'A'], col_space=12, float_format=str) - - frame = DataFrame(index=np.arange(200)) - frame.to_string() - - def test_to_string_no_header(self): - df = DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) - - df_s = df.to_string(header=False) - expected = "0 1 4\n1 2 5\n2 3 6" - - self.assertEqual(df_s, expected) - - def test_to_string_no_index(self): - df = DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) - - df_s = df.to_string(index=False) - expected = "x y\n1 4\n2 5\n3 6" - - self.assertEqual(df_s, expected) - - def test_to_string_line_width_no_index(self): - df = DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) - - df_s = df.to_string(line_width=1, index=False) - expected = "x \\\n1 \n2 \n3 \n\ny \n4 \n5 \n6" - - self.assertEqual(df_s, expected) - - def test_to_string_float_formatting(self): - self.reset_display_options() - fmt.set_option('display.precision', 5, 'display.column_space', 12, - 'display.notebook_repr_html', False) - - df = DataFrame({'x': [0, 0.25, 3456.000, 12e+45, 1.64e+6, 1.7e+8, - 1.253456, np.pi, -1e6]}) - - df_s = df.to_string() - - # Python 2.5 just wants me to be sad. And debian 32-bit - # sys.version_info[0] == 2 and sys.version_info[1] < 6: - if _three_digit_exp(): - expected = (' x\n0 0.00000e+000\n1 2.50000e-001\n' - '2 3.45600e+003\n3 1.20000e+046\n4 1.64000e+006\n' - '5 1.70000e+008\n6 1.25346e+000\n7 3.14159e+000\n' - '8 -1.00000e+006') - else: - expected = (' x\n0 0.00000e+00\n1 2.50000e-01\n' - '2 3.45600e+03\n3 1.20000e+46\n4 1.64000e+06\n' - '5 1.70000e+08\n6 1.25346e+00\n7 3.14159e+00\n' - '8 -1.00000e+06') - self.assertEqual(df_s, expected) - - df = DataFrame({'x': [3234, 0.253]}) - df_s = df.to_string() - - expected = (' x\n' '0 3234.000\n' '1 0.253') - self.assertEqual(df_s, expected) - - self.reset_display_options() - self.assertEqual(get_option("display.precision"), 6) - - df = DataFrame({'x': [1e9, 0.2512]}) - df_s = df.to_string() - # Python 2.5 just wants me to be sad. And debian 32-bit - # sys.version_info[0] == 2 and sys.version_info[1] < 6: - if _three_digit_exp(): - expected = (' x\n' - '0 1.000000e+009\n' - '1 2.512000e-001') - else: - expected = (' x\n' - '0 1.000000e+09\n' - '1 2.512000e-01') - self.assertEqual(df_s, expected) - - def test_to_string_small_float_values(self): - df = DataFrame({'a': [1.5, 1e-17, -5.5e-7]}) - - result = df.to_string() - # sadness per above - if '%.4g' % 1.7e8 == '1.7e+008': - expected = (' a\n' - '0 1.500000e+000\n' - '1 1.000000e-017\n' - '2 -5.500000e-007') - else: - expected = (' a\n' - '0 1.500000e+00\n' - '1 1.000000e-17\n' - '2 -5.500000e-07') - self.assertEqual(result, expected) - - # but not all exactly zero - df = df * 0 - result = df.to_string() - expected = (' 0\n' '0 0\n' '1 0\n' '2 -0') - - def test_to_string_float_index(self): - index = Index([1.5, 2, 3, 4, 5]) - df = DataFrame(lrange(5), index=index) - - result = df.to_string() - expected = (' 0\n' - '1.5 0\n' - '2.0 1\n' - '3.0 2\n' - '4.0 3\n' - '5.0 4') - self.assertEqual(result, expected) - - def test_to_string_ascii_error(self): - data = [('0 ', u(' .gitignore '), u(' 5 '), - ' \xe2\x80\xa2\xe2\x80\xa2\xe2\x80' - '\xa2\xe2\x80\xa2\xe2\x80\xa2')] - df = DataFrame(data) - - # it works! - repr(df) - - def test_to_string_int_formatting(self): - df = DataFrame({'x': [-15, 20, 25, -35]}) - self.assertTrue(issubclass(df['x'].dtype.type, np.integer)) - - output = df.to_string() - expected = (' x\n' '0 -15\n' '1 20\n' '2 25\n' '3 -35') - self.assertEqual(output, expected) - - def test_to_string_index_formatter(self): - df = DataFrame([lrange(5), lrange(5, 10), lrange(10, 15)]) - - rs = df.to_string(formatters={'__index__': lambda x: 'abc' [x]}) - - xp = """\ - 0 1 2 3 4 -a 0 1 2 3 4 -b 5 6 7 8 9 -c 10 11 12 13 14\ -""" - - self.assertEqual(rs, xp) - - def test_to_string_left_justify_cols(self): - self.reset_display_options() - df = DataFrame({'x': [3234, 0.253]}) - df_s = df.to_string(justify='left') - expected = (' x \n' '0 3234.000\n' '1 0.253') - self.assertEqual(df_s, expected) - - def test_to_string_format_na(self): - self.reset_display_options() - df = DataFrame({'A': [np.nan, -1, -2.1234, 3, 4], - 'B': [np.nan, 'foo', 'foooo', 'fooooo', 'bar']}) - result = df.to_string() - - expected = (' A B\n' - '0 NaN NaN\n' - '1 -1.0000 foo\n' - '2 -2.1234 foooo\n' - '3 3.0000 fooooo\n' - '4 4.0000 bar') - self.assertEqual(result, expected) - - df = DataFrame({'A': [np.nan, -1., -2., 3., 4.], - 'B': [np.nan, 'foo', 'foooo', 'fooooo', 'bar']}) - result = df.to_string() - - expected = (' A B\n' - '0 NaN NaN\n' - '1 -1.0 foo\n' - '2 -2.0 foooo\n' - '3 3.0 fooooo\n' - '4 4.0 bar') - self.assertEqual(result, expected) - - def test_to_string_line_width(self): - df = DataFrame(123, lrange(10, 15), lrange(30)) - s = df.to_string(line_width=80) - self.assertEqual(max(len(l) for l in s.split('\n')), 80) - - def test_show_dimensions(self): - df = DataFrame(123, lrange(10, 15), lrange(30)) - - with option_context('display.max_rows', 10, 'display.max_columns', 40, - 'display.width', 500, 'display.expand_frame_repr', - 'info', 'display.show_dimensions', True): - self.assertTrue('5 rows' in str(df)) - self.assertTrue('5 rows' in df._repr_html_()) - with option_context('display.max_rows', 10, 'display.max_columns', 40, - 'display.width', 500, 'display.expand_frame_repr', - 'info', 'display.show_dimensions', False): - self.assertFalse('5 rows' in str(df)) - self.assertFalse('5 rows' in df._repr_html_()) - with option_context('display.max_rows', 2, 'display.max_columns', 2, - 'display.width', 500, 'display.expand_frame_repr', - 'info', 'display.show_dimensions', 'truncate'): - self.assertTrue('5 rows' in str(df)) - self.assertTrue('5 rows' in df._repr_html_()) - with option_context('display.max_rows', 10, 'display.max_columns', 40, - 'display.width', 500, 'display.expand_frame_repr', - 'info', 'display.show_dimensions', 'truncate'): - self.assertFalse('5 rows' in str(df)) - self.assertFalse('5 rows' in df._repr_html_()) - - def test_to_html(self): - # big mixed - biggie = DataFrame({'A': randn(200), - 'B': tm.makeStringIndex(200)}, - index=lrange(200)) - - biggie.loc[:20, 'A'] = nan - biggie.loc[:20, 'B'] = nan - s = biggie.to_html() - - buf = StringIO() - retval = biggie.to_html(buf=buf) - self.assertIsNone(retval) - self.assertEqual(buf.getvalue(), s) - - tm.assertIsInstance(s, compat.string_types) - - biggie.to_html(columns=['B', 'A'], col_space=17) - biggie.to_html(columns=['B', 'A'], - formatters={'A': lambda x: '%.1f' % x}) - - biggie.to_html(columns=['B', 'A'], float_format=str) - biggie.to_html(columns=['B', 'A'], col_space=12, float_format=str) - - frame = DataFrame(index=np.arange(200)) - frame.to_html() - - def test_to_html_filename(self): - biggie = DataFrame({'A': randn(200), - 'B': tm.makeStringIndex(200)}, - index=lrange(200)) - - biggie.loc[:20, 'A'] = nan - biggie.loc[:20, 'B'] = nan - with tm.ensure_clean('test.html') as path: - biggie.to_html(path) - with open(path, 'r') as f: - s = biggie.to_html() - s2 = f.read() - self.assertEqual(s, s2) - - frame = DataFrame(index=np.arange(200)) - with tm.ensure_clean('test.html') as path: - frame.to_html(path) - with open(path, 'r') as f: - self.assertEqual(frame.to_html(), f.read()) - - def test_to_html_with_no_bold(self): - x = DataFrame({'x': randn(5)}) - ashtml = x.to_html(bold_rows=False) - self.assertFalse('")]) - - def test_to_html_columns_arg(self): - result = self.frame.to_html(columns=['A']) - self.assertNotIn('B', result) - - def test_to_html_multiindex(self): - columns = MultiIndex.from_tuples(list(zip(np.arange(2).repeat(2), - np.mod(lrange(4), 2))), - names=['CL0', 'CL1']) - df = DataFrame([list('abcd'), list('efgh')], columns=columns) - result = df.to_html(justify='left') - expected = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
CL001
CL10101
0abcd
1efgh
') - - self.assertEqual(result, expected) - - columns = MultiIndex.from_tuples(list(zip( - range(4), np.mod( - lrange(4), 2)))) - df = DataFrame([list('abcd'), list('efgh')], columns=columns) - - result = df.to_html(justify='right') - expected = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
0123
0101
0abcd
1efgh
') - - self.assertEqual(result, expected) - - def test_to_html_justify(self): - df = DataFrame({'A': [6, 30000, 2], - 'B': [1, 2, 70000], - 'C': [223442, 0, 1]}, - columns=['A', 'B', 'C']) - result = df.to_html(justify='left') - expected = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
ABC
061223442
13000020
22700001
') - self.assertEqual(result, expected) - - result = df.to_html(justify='right') - expected = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
ABC
061223442
13000020
22700001
') - self.assertEqual(result, expected) - - def test_to_html_index(self): - index = ['foo', 'bar', 'baz'] - df = DataFrame({'A': [1, 2, 3], - 'B': [1.2, 3.4, 5.6], - 'C': ['one', 'two', np.NaN]}, - columns=['A', 'B', 'C'], - index=index) - expected_with_index = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
ABC
foo11.2one
bar23.4two
baz35.6NaN
') - self.assertEqual(df.to_html(), expected_with_index) - - expected_without_index = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
ABC
11.2one
23.4two
35.6NaN
') - result = df.to_html(index=False) - for i in index: - self.assertNotIn(i, result) - self.assertEqual(result, expected_without_index) - df.index = Index(['foo', 'bar', 'baz'], name='idx') - expected_with_index = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
ABC
idx
foo11.2one
bar23.4two
baz35.6NaN
') - self.assertEqual(df.to_html(), expected_with_index) - self.assertEqual(df.to_html(index=False), expected_without_index) - - tuples = [('foo', 'car'), ('foo', 'bike'), ('bar', 'car')] - df.index = MultiIndex.from_tuples(tuples) - - expected_with_index = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
ABC
foocar11.2one
bike23.4two
barcar35.6NaN
') - self.assertEqual(df.to_html(), expected_with_index) - - result = df.to_html(index=False) - for i in ['foo', 'bar', 'car', 'bike']: - self.assertNotIn(i, result) - # must be the same result as normal index - self.assertEqual(result, expected_without_index) - - df.index = MultiIndex.from_tuples(tuples, names=['idx1', 'idx2']) - expected_with_index = ('\n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - ' \n' - '
ABC
idx1idx2
foocar11.2one
bike23.4two
barcar35.6NaN
') - self.assertEqual(df.to_html(), expected_with_index) - self.assertEqual(df.to_html(index=False), expected_without_index) - - def test_repr_html(self): - self.frame._repr_html_() - - fmt.set_option('display.max_rows', 1, 'display.max_columns', 1) - self.frame._repr_html_() - - fmt.set_option('display.notebook_repr_html', False) - self.frame._repr_html_() - - self.reset_display_options() - - df = DataFrame([[1, 2], [3, 4]]) - fmt.set_option('display.show_dimensions', True) - self.assertTrue('2 rows' in df._repr_html_()) - fmt.set_option('display.show_dimensions', False) - self.assertFalse('2 rows' in df._repr_html_()) - - self.reset_display_options() - - def test_repr_html_wide(self): - max_cols = get_option('display.max_columns') - df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) - reg_repr = df._repr_html_() - assert "..." not in reg_repr - - wide_df = DataFrame(tm.rands_array(25, size=(10, max_cols + 1))) - wide_repr = wide_df._repr_html_() - assert "..." in wide_repr - - def test_repr_html_wide_multiindex_cols(self): - max_cols = get_option('display.max_columns') - - mcols = MultiIndex.from_product([np.arange(max_cols // 2), - ['foo', 'bar']], - names=['first', 'second']) - df = DataFrame(tm.rands_array(25, size=(10, len(mcols))), - columns=mcols) - reg_repr = df._repr_html_() - assert '...' not in reg_repr - - mcols = MultiIndex.from_product((np.arange(1 + (max_cols // 2)), - ['foo', 'bar']), - names=['first', 'second']) - df = DataFrame(tm.rands_array(25, size=(10, len(mcols))), - columns=mcols) - wide_repr = df._repr_html_() - assert '...' in wide_repr - - def test_repr_html_long(self): - max_rows = get_option('display.max_rows') - h = max_rows - 1 - df = DataFrame({'A': np.arange(1, 1 + h), 'B': np.arange(41, 41 + h)}) - reg_repr = df._repr_html_() - assert '..' not in reg_repr - assert str(41 + max_rows // 2) in reg_repr - - h = max_rows + 1 - df = DataFrame({'A': np.arange(1, 1 + h), 'B': np.arange(41, 41 + h)}) - long_repr = df._repr_html_() - assert '..' in long_repr - assert str(41 + max_rows // 2) not in long_repr - assert u('%d rows ') % h in long_repr - assert u('2 columns') in long_repr - - def test_repr_html_float(self): - max_rows = get_option('display.max_rows') - h = max_rows - 1 - df = DataFrame({'idx': np.linspace(-10, 10, h), - 'A': np.arange(1, 1 + h), - 'B': np.arange(41, 41 + h)}).set_index('idx') - reg_repr = df._repr_html_() - assert '..' not in reg_repr - assert str(40 + h) in reg_repr - - h = max_rows + 1 - df = DataFrame({'idx': np.linspace(-10, 10, h), - 'A': np.arange(1, 1 + h), - 'B': np.arange(41, 41 + h)}).set_index('idx') - long_repr = df._repr_html_() - assert '..' in long_repr - assert '31' not in long_repr - assert u('%d rows ') % h in long_repr - assert u('2 columns') in long_repr - - def test_repr_html_long_multiindex(self): - max_rows = get_option('display.max_rows') - max_L1 = max_rows // 2 - - tuples = list(itertools.product(np.arange(max_L1), ['foo', 'bar'])) - idx = MultiIndex.from_tuples(tuples, names=['first', 'second']) - df = DataFrame(np.random.randn(max_L1 * 2, 2), index=idx, - columns=['A', 'B']) - reg_repr = df._repr_html_() - assert '...' not in reg_repr - - tuples = list(itertools.product(np.arange(max_L1 + 1), ['foo', 'bar'])) - idx = MultiIndex.from_tuples(tuples, names=['first', 'second']) - df = DataFrame(np.random.randn((max_L1 + 1) * 2, 2), index=idx, - columns=['A', 'B']) - long_repr = df._repr_html_() - assert '...' in long_repr - - def test_repr_html_long_and_wide(self): - max_cols = get_option('display.max_columns') - max_rows = get_option('display.max_rows') - - h, w = max_rows - 1, max_cols - 1 - df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) - assert '...' not in df._repr_html_() - - h, w = max_rows + 1, max_cols + 1 - df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) - assert '...' in df._repr_html_() - - def test_info_repr(self): - max_rows = get_option('display.max_rows') - max_cols = get_option('display.max_columns') - # Long - h, w = max_rows + 1, max_cols - 1 - df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) - assert has_vertically_truncated_repr(df) - with option_context('display.large_repr', 'info'): - assert has_info_repr(df) - - # Wide - h, w = max_rows - 1, max_cols + 1 - df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) - assert has_horizontally_truncated_repr(df) - with option_context('display.large_repr', 'info'): - assert has_info_repr(df) - - def test_info_repr_max_cols(self): - # GH #6939 - df = DataFrame(randn(10, 5)) - with option_context('display.large_repr', 'info', - 'display.max_columns', 1, - 'display.max_info_columns', 4): - self.assertTrue(has_non_verbose_info_repr(df)) - - with option_context('display.large_repr', 'info', - 'display.max_columns', 1, - 'display.max_info_columns', 5): - self.assertFalse(has_non_verbose_info_repr(df)) - - # test verbose overrides - # fmt.set_option('display.max_info_columns', 4) # exceeded - - def test_info_repr_html(self): - max_rows = get_option('display.max_rows') - max_cols = get_option('display.max_columns') - # Long - h, w = max_rows + 1, max_cols - 1 - df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) - assert r'<class' not in df._repr_html_() - with option_context('display.large_repr', 'info'): - assert r'<class' in df._repr_html_() - - # Wide - h, w = max_rows - 1, max_cols + 1 - df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) - assert ' - - - - - - - - - - """).strip() - self.assertEqual(result, expected) - - result = df.to_html(classes=["sortable", "draggable"]) - self.assertEqual(result, expected) - - def test_pprint_pathological_object(self): - """ - if the test fails, the stack will overflow and nose crash, - but it won't hang. - """ - - class A: - - def __getitem__(self, key): - return 3 # obviously simplified - - df = DataFrame([A()]) - repr(df) # just don't dine - - def test_float_trim_zeros(self): - vals = [2.08430917305e+10, 3.52205017305e+10, 2.30674817305e+10, - 2.03954217305e+10, 5.59897817305e+10] - skip = True - for line in repr(DataFrame({'A': vals})).split('\n')[:-2]: - if line.startswith('dtype:'): - continue - if _three_digit_exp(): - self.assertTrue(('+010' in line) or skip) - else: - self.assertTrue(('+10' in line) or skip) - skip = False - - def test_dict_entries(self): - df = DataFrame({'A': [{'a': 1, 'b': 2}]}) - - val = df.to_string() - self.assertTrue("'a': 1" in val) - self.assertTrue("'b': 2" in val) - - def test_to_latex_filename(self): - with tm.ensure_clean('test.tex') as path: - self.frame.to_latex(path) - - with open(path, 'r') as f: - self.assertEqual(self.frame.to_latex(), f.read()) - - # test with utf-8 and encoding option (GH 7061) - df = DataFrame([[u'au\xdfgangen']]) - with tm.ensure_clean('test.tex') as path: - df.to_latex(path, encoding='utf-8') - with codecs.open(path, 'r', encoding='utf-8') as f: - self.assertEqual(df.to_latex(), f.read()) - - # test with utf-8 without encoding option - if compat.PY3: # python3: pandas default encoding is utf-8 - with tm.ensure_clean('test.tex') as path: - df.to_latex(path) - with codecs.open(path, 'r', encoding='utf-8') as f: - self.assertEqual(df.to_latex(), f.read()) - else: - # python2 default encoding is ascii, so an error should be raised - with tm.ensure_clean('test.tex') as path: - self.assertRaises(UnicodeEncodeError, df.to_latex, path) - - def test_to_latex(self): - # it works! - self.frame.to_latex() - - df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) - withindex_result = df.to_latex() - withindex_expected = r"""\begin{tabular}{lrl} -\toprule -{} & a & b \\ -\midrule -0 & 1 & b1 \\ -1 & 2 & b2 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(withindex_result, withindex_expected) - - withoutindex_result = df.to_latex(index=False) - withoutindex_expected = r"""\begin{tabular}{rl} -\toprule - a & b \\ -\midrule - 1 & b1 \\ - 2 & b2 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(withoutindex_result, withoutindex_expected) - - def test_to_latex_format(self): - # GH Bug #9402 - self.frame.to_latex(column_format='ccc') - - df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) - withindex_result = df.to_latex(column_format='ccc') - withindex_expected = r"""\begin{tabular}{ccc} -\toprule -{} & a & b \\ -\midrule -0 & 1 & b1 \\ -1 & 2 & b2 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(withindex_result, withindex_expected) - - def test_to_latex_with_formatters(self): - df = DataFrame({'int': [1, 2, 3], - 'float': [1.0, 2.0, 3.0], - 'object': [(1, 2), True, False], - 'datetime64': [datetime(2016, 1, 1), - datetime(2016, 2, 5), - datetime(2016, 3, 3)]}) - - formatters = {'int': lambda x: '0x%x' % x, - 'float': lambda x: '[% 4.1f]' % x, - 'object': lambda x: '-%s-' % str(x), - 'datetime64': lambda x: x.strftime('%Y-%m'), - '__index__': lambda x: 'index: %s' % x} - result = df.to_latex(formatters=dict(formatters)) - - expected = r"""\begin{tabular}{llrrl} -\toprule -{} & datetime64 & float & int & object \\ -\midrule -index: 0 & 2016-01 & [ 1.0] & 0x1 & -(1, 2)- \\ -index: 1 & 2016-02 & [ 2.0] & 0x2 & -True- \\ -index: 2 & 2016-03 & [ 3.0] & 0x3 & -False- \\ -\bottomrule -\end{tabular} -""" - self.assertEqual(result, expected) - - def test_to_latex_multiindex(self): - df = DataFrame({('x', 'y'): ['a']}) - result = df.to_latex() - expected = r"""\begin{tabular}{ll} -\toprule -{} & x \\ -{} & y \\ -\midrule -0 & a \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(result, expected) - - result = df.T.to_latex() - expected = r"""\begin{tabular}{lll} -\toprule - & & 0 \\ -\midrule -x & y & a \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(result, expected) - - df = DataFrame.from_dict({ - ('c1', 0): pd.Series(dict((x, x) for x in range(4))), - ('c1', 1): pd.Series(dict((x, x + 4) for x in range(4))), - ('c2', 0): pd.Series(dict((x, x) for x in range(4))), - ('c2', 1): pd.Series(dict((x, x + 4) for x in range(4))), - ('c3', 0): pd.Series(dict((x, x) for x in range(4))), - }).T - result = df.to_latex() - expected = r"""\begin{tabular}{llrrrr} -\toprule - & & 0 & 1 & 2 & 3 \\ -\midrule -c1 & 0 & 0 & 1 & 2 & 3 \\ - & 1 & 4 & 5 & 6 & 7 \\ -c2 & 0 & 0 & 1 & 2 & 3 \\ - & 1 & 4 & 5 & 6 & 7 \\ -c3 & 0 & 0 & 1 & 2 & 3 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(result, expected) - - # GH 10660 - df = pd.DataFrame({'a': [0, 0, 1, 1], - 'b': list('abab'), - 'c': [1, 2, 3, 4]}) - result = df.set_index(['a', 'b']).to_latex() - expected = r"""\begin{tabular}{llr} -\toprule - & & c \\ -a & b & \\ -\midrule -0 & a & 1 \\ - & b & 2 \\ -1 & a & 3 \\ - & b & 4 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(result, expected) - - result = df.groupby('a').describe().to_latex() - expected = r"""\begin{tabular}{llr} -\toprule - & & c \\ -a & {} & \\ -\midrule -0 & count & 2.000000 \\ - & mean & 1.500000 \\ - & std & 0.707107 \\ - & min & 1.000000 \\ - & 25\% & 1.250000 \\ - & 50\% & 1.500000 \\ - & 75\% & 1.750000 \\ - & max & 2.000000 \\ -1 & count & 2.000000 \\ - & mean & 3.500000 \\ - & std & 0.707107 \\ - & min & 3.000000 \\ - & 25\% & 3.250000 \\ - & 50\% & 3.500000 \\ - & 75\% & 3.750000 \\ - & max & 4.000000 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(result, expected) - - def test_to_latex_escape(self): - a = 'a' - b = 'b' - - test_dict = {u('co^l1'): {a: "a", - b: "b"}, - u('co$e^x$'): {a: "a", - b: "b"}} - - unescaped_result = DataFrame(test_dict).to_latex(escape=False) - escaped_result = DataFrame(test_dict).to_latex( - ) # default: escape=True - - unescaped_expected = r'''\begin{tabular}{lll} -\toprule -{} & co$e^x$ & co^l1 \\ -\midrule -a & a & a \\ -b & b & b \\ -\bottomrule -\end{tabular} -''' - - escaped_expected = r'''\begin{tabular}{lll} -\toprule -{} & co\$e\textasciicircumx\$ & co\textasciicircuml1 \\ -\midrule -a & a & a \\ -b & b & b \\ -\bottomrule -\end{tabular} -''' - - self.assertEqual(unescaped_result, unescaped_expected) - self.assertEqual(escaped_result, escaped_expected) - - def test_to_latex_longtable(self): - self.frame.to_latex(longtable=True) - - df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) - withindex_result = df.to_latex(longtable=True) - withindex_expected = r"""\begin{longtable}{lrl} -\toprule -{} & a & b \\ -\midrule -\endhead -\midrule -\multicolumn{3}{r}{{Continued on next page}} \\ -\midrule -\endfoot - -\bottomrule -\endlastfoot -0 & 1 & b1 \\ -1 & 2 & b2 \\ -\end{longtable} -""" - - self.assertEqual(withindex_result, withindex_expected) - - withoutindex_result = df.to_latex(index=False, longtable=True) - withoutindex_expected = r"""\begin{longtable}{rl} -\toprule - a & b \\ -\midrule -\endhead -\midrule -\multicolumn{3}{r}{{Continued on next page}} \\ -\midrule -\endfoot - -\bottomrule -\endlastfoot - 1 & b1 \\ - 2 & b2 \\ -\end{longtable} -""" - - self.assertEqual(withoutindex_result, withoutindex_expected) - - def test_to_latex_escape_special_chars(self): - special_characters = ['&', '%', '$', '#', '_', '{', '}', '~', '^', - '\\'] - df = DataFrame(data=special_characters) - observed = df.to_latex() - expected = r"""\begin{tabular}{ll} -\toprule -{} & 0 \\ -\midrule -0 & \& \\ -1 & \% \\ -2 & \$ \\ -3 & \# \\ -4 & \_ \\ -5 & \{ \\ -6 & \} \\ -7 & \textasciitilde \\ -8 & \textasciicircum \\ -9 & \textbackslash \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(observed, expected) - - def test_to_latex_no_header(self): - # GH 7124 - df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) - withindex_result = df.to_latex(header=False) - withindex_expected = r"""\begin{tabular}{lrl} -\toprule -0 & 1 & b1 \\ -1 & 2 & b2 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(withindex_result, withindex_expected) - - withoutindex_result = df.to_latex(index=False, header=False) - withoutindex_expected = r"""\begin{tabular}{rl} -\toprule - 1 & b1 \\ - 2 & b2 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(withoutindex_result, withoutindex_expected) - - def test_to_latex_decimal(self): - # GH 12031 - self.frame.to_latex() - df = DataFrame({'a': [1.0, 2.1], 'b': ['b1', 'b2']}) - withindex_result = df.to_latex(decimal=',') - print("WHAT THE") - withindex_expected = r"""\begin{tabular}{lrl} -\toprule -{} & a & b \\ -\midrule -0 & 1,0 & b1 \\ -1 & 2,1 & b2 \\ -\bottomrule -\end{tabular} -""" - - self.assertEqual(withindex_result, withindex_expected) - - def test_to_csv_quotechar(self): - df = DataFrame({'col': [1, 2]}) - expected = """\ -"","col" -"0","1" -"1","2" -""" - - with tm.ensure_clean('test.csv') as path: - df.to_csv(path, quoting=1) # 1=QUOTE_ALL - with open(path, 'r') as f: - self.assertEqual(f.read(), expected) - - expected = """\ -$$,$col$ -$0$,$1$ -$1$,$2$ -""" - - with tm.ensure_clean('test.csv') as path: - df.to_csv(path, quoting=1, quotechar="$") - with open(path, 'r') as f: - self.assertEqual(f.read(), expected) - - with tm.ensure_clean('test.csv') as path: - with tm.assertRaisesRegexp(TypeError, 'quotechar'): - df.to_csv(path, quoting=1, quotechar=None) - - def test_to_csv_doublequote(self): - df = DataFrame({'col': ['a"a', '"bb"']}) - expected = '''\ -"","col" -"0","a""a" -"1","""bb""" -''' - - with tm.ensure_clean('test.csv') as path: - df.to_csv(path, quoting=1, doublequote=True) # QUOTE_ALL - with open(path, 'r') as f: - self.assertEqual(f.read(), expected) - - from _csv import Error - with tm.ensure_clean('test.csv') as path: - with tm.assertRaisesRegexp(Error, 'escapechar'): - df.to_csv(path, doublequote=False) # no escapechar set - - def test_to_csv_escapechar(self): - df = DataFrame({'col': ['a"a', '"bb"']}) - expected = '''\ -"","col" -"0","a\\"a" -"1","\\"bb\\"" -''' - - with tm.ensure_clean('test.csv') as path: # QUOTE_ALL - df.to_csv(path, quoting=1, doublequote=False, escapechar='\\') - with open(path, 'r') as f: - self.assertEqual(f.read(), expected) - - df = DataFrame({'col': ['a,a', ',bb,']}) - expected = """\ -,col -0,a\\,a -1,\\,bb\\, -""" - - with tm.ensure_clean('test.csv') as path: - df.to_csv(path, quoting=3, escapechar='\\') # QUOTE_NONE - with open(path, 'r') as f: - self.assertEqual(f.read(), expected) - - def test_csv_to_string(self): - df = DataFrame({'col': [1, 2]}) - expected = ',col\n0,1\n1,2\n' - self.assertEqual(df.to_csv(), expected) - - def test_to_csv_decimal(self): - # GH 781 - df = DataFrame({'col1': [1], 'col2': ['a'], 'col3': [10.1]}) - - expected_default = ',col1,col2,col3\n0,1,a,10.1\n' - self.assertEqual(df.to_csv(), expected_default) - - expected_european_excel = ';col1;col2;col3\n0;1;a;10,1\n' - self.assertEqual( - df.to_csv(decimal=',', sep=';'), expected_european_excel) - - expected_float_format_default = ',col1,col2,col3\n0,1,a,10.10\n' - self.assertEqual( - df.to_csv(float_format='%.2f'), expected_float_format_default) - - expected_float_format = ';col1;col2;col3\n0;1;a;10,10\n' - self.assertEqual( - df.to_csv(decimal=',', sep=';', - float_format='%.2f'), expected_float_format) - - # GH 11553: testing if decimal is taken into account for '0.0' - df = pd.DataFrame({'a': [0, 1.1], 'b': [2.2, 3.3], 'c': 1}) - expected = 'a,b,c\n0^0,2^2,1\n1^1,3^3,1\n' - self.assertEqual(df.to_csv(index=False, decimal='^'), expected) - - # same but for an index - self.assertEqual(df.set_index('a').to_csv(decimal='^'), expected) - - # same for a multi-index - self.assertEqual( - df.set_index(['a', 'b']).to_csv(decimal="^"), expected) - - def test_to_csv_float_format(self): - # testing if float_format is taken into account for the index - # GH 11553 - df = pd.DataFrame({'a': [0, 1], 'b': [2.2, 3.3], 'c': 1}) - expected = 'a,b,c\n0,2.20,1\n1,3.30,1\n' - self.assertEqual( - df.set_index('a').to_csv(float_format='%.2f'), expected) - - # same for a multi-index - self.assertEqual( - df.set_index(['a', 'b']).to_csv(float_format='%.2f'), expected) - - def test_to_csv_na_rep(self): - # testing if NaN values are correctly represented in the index - # GH 11553 - df = DataFrame({'a': [0, np.NaN], 'b': [0, 1], 'c': [2, 3]}) - expected = "a,b,c\n0.0,0,2\n_,1,3\n" - self.assertEqual(df.set_index('a').to_csv(na_rep='_'), expected) - self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected) - - # now with an index containing only NaNs - df = DataFrame({'a': np.NaN, 'b': [0, 1], 'c': [2, 3]}) - expected = "a,b,c\n_,0,2\n_,1,3\n" - self.assertEqual(df.set_index('a').to_csv(na_rep='_'), expected) - self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected) - - # check if na_rep parameter does not break anything when no NaN - df = DataFrame({'a': 0, 'b': [0, 1], 'c': [2, 3]}) - expected = "a,b,c\n0,0,2\n0,1,3\n" - self.assertEqual(df.set_index('a').to_csv(na_rep='_'), expected) - self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected) - - def test_to_csv_date_format(self): - # GH 10209 - df_sec = DataFrame({'A': pd.date_range('20130101', periods=5, freq='s') - }) - df_day = DataFrame({'A': pd.date_range('20130101', periods=5, freq='d') - }) - - expected_default_sec = ',A\n0,2013-01-01 00:00:00\n1,2013-01-01 00:00:01\n2,2013-01-01 00:00:02' + \ - '\n3,2013-01-01 00:00:03\n4,2013-01-01 00:00:04\n' - self.assertEqual(df_sec.to_csv(), expected_default_sec) - - expected_ymdhms_day = ',A\n0,2013-01-01 00:00:00\n1,2013-01-02 00:00:00\n2,2013-01-03 00:00:00' + \ - '\n3,2013-01-04 00:00:00\n4,2013-01-05 00:00:00\n' - self.assertEqual( - df_day.to_csv( - date_format='%Y-%m-%d %H:%M:%S'), expected_ymdhms_day) - - expected_ymd_sec = ',A\n0,2013-01-01\n1,2013-01-01\n2,2013-01-01\n3,2013-01-01\n4,2013-01-01\n' - self.assertEqual( - df_sec.to_csv(date_format='%Y-%m-%d'), expected_ymd_sec) - - expected_default_day = ',A\n0,2013-01-01\n1,2013-01-02\n2,2013-01-03\n3,2013-01-04\n4,2013-01-05\n' - self.assertEqual(df_day.to_csv(), expected_default_day) - self.assertEqual( - df_day.to_csv(date_format='%Y-%m-%d'), expected_default_day) - - # testing if date_format parameter is taken into account for - # multi-indexed dataframes (GH 7791) - df_sec['B'] = 0 - df_sec['C'] = 1 - expected_ymd_sec = 'A,B,C\n2013-01-01,0,1\n' - df_sec_grouped = df_sec.groupby([pd.Grouper(key='A', freq='1h'), 'B']) - self.assertEqual(df_sec_grouped.mean().to_csv(date_format='%Y-%m-%d'), - expected_ymd_sec) - - def test_to_csv_multi_index(self): - # see gh-6618 - df = DataFrame([1], columns=pd.MultiIndex.from_arrays([[1],[2]])) - - exp = ",1\n,2\n0,1\n" - self.assertEqual(df.to_csv(), exp) - - exp = "1\n2\n1\n" - self.assertEqual(df.to_csv(index=False), exp) - - df = DataFrame([1], columns=pd.MultiIndex.from_arrays([[1],[2]]), - index=pd.MultiIndex.from_arrays([[1],[2]])) - - exp = ",,1\n,,2\n1,2,1\n" - self.assertEqual(df.to_csv(), exp) - - exp = "1\n2\n1\n" - self.assertEqual(df.to_csv(index=False), exp) - - df = DataFrame([1], columns=pd.MultiIndex.from_arrays([['foo'],['bar']])) - - exp = ",foo\n,bar\n0,1\n" - self.assertEqual(df.to_csv(), exp) - - exp = "foo\nbar\n1\n" - self.assertEqual(df.to_csv(index=False), exp) - - def test_period(self): - # GH 12615 - df = pd.DataFrame({'A': pd.period_range('2013-01', - periods=4, freq='M'), - 'B': [pd.Period('2011-01', freq='M'), - pd.Period('2011-02-01', freq='D'), - pd.Period('2011-03-01 09:00', freq='H'), - pd.Period('2011-04', freq='M')], - 'C': list('abcd')}) - exp = (" A B C\n0 2013-01 2011-01 a\n" - "1 2013-02 2011-02-01 b\n2 2013-03 2011-03-01 09:00 c\n" - "3 2013-04 2011-04 d") - self.assertEqual(str(df), exp) - - -class TestSeriesFormatting(tm.TestCase): - - _multiprocess_can_split_ = True - - def setUp(self): - self.ts = tm.makeTimeSeries() - - def test_repr_unicode(self): - s = Series([u('\u03c3')] * 10) - repr(s) - - a = Series([u("\u05d0")] * 1000) - a.name = 'title1' - repr(a) - - def test_to_string(self): - buf = StringIO() - - s = self.ts.to_string() - - retval = self.ts.to_string(buf=buf) - self.assertIsNone(retval) - self.assertEqual(buf.getvalue().strip(), s) - - # pass float_format - format = '%.4f'.__mod__ - result = self.ts.to_string(float_format=format) - result = [x.split()[1] for x in result.split('\n')[:-1]] - expected = [format(x) for x in self.ts] - self.assertEqual(result, expected) - - # empty string - result = self.ts[:0].to_string() - self.assertEqual(result, 'Series([], Freq: B)') - - result = self.ts[:0].to_string(length=0) - self.assertEqual(result, 'Series([], Freq: B)') - - # name and length - cp = self.ts.copy() - cp.name = 'foo' - result = cp.to_string(length=True, name=True, dtype=True) - last_line = result.split('\n')[-1].strip() - self.assertEqual(last_line, - "Freq: B, Name: foo, Length: %d, dtype: float64" % - len(cp)) - - def test_freq_name_separation(self): - s = Series(np.random.randn(10), - index=date_range('1/1/2000', periods=10), name=0) - - result = repr(s) - self.assertTrue('Freq: D, Name: 0' in result) - - def test_to_string_mixed(self): - s = Series(['foo', np.nan, -1.23, 4.56]) - result = s.to_string() - expected = (u('0 foo\n') + u('1 NaN\n') + u('2 -1.23\n') + - u('3 4.56')) - self.assertEqual(result, expected) - - # but don't count NAs as floats - s = Series(['foo', np.nan, 'bar', 'baz']) - result = s.to_string() - expected = (u('0 foo\n') + '1 NaN\n' + '2 bar\n' + '3 baz') - self.assertEqual(result, expected) - - s = Series(['foo', 5, 'bar', 'baz']) - result = s.to_string() - expected = (u('0 foo\n') + '1 5\n' + '2 bar\n' + '3 baz') - self.assertEqual(result, expected) - - def test_to_string_float_na_spacing(self): - s = Series([0., 1.5678, 2., -3., 4.]) - s[::2] = np.nan - - result = s.to_string() - expected = (u('0 NaN\n') + '1 1.5678\n' + '2 NaN\n' + - '3 -3.0000\n' + '4 NaN') - self.assertEqual(result, expected) - - def test_to_string_without_index(self): - # GH 11729 Test index=False option - s = Series([1, 2, 3, 4]) - result = s.to_string(index=False) - expected = (u('1\n') + '2\n' + '3\n' + '4') - self.assertEqual(result, expected) - - def test_unicode_name_in_footer(self): - s = Series([1, 2], name=u('\u05e2\u05d1\u05e8\u05d9\u05ea')) - sf = fmt.SeriesFormatter(s, name=u('\u05e2\u05d1\u05e8\u05d9\u05ea')) - sf._get_footer() # should not raise exception - - def test_east_asian_unicode_series(self): - if PY3: - _rep = repr - else: - _rep = unicode - # not alighned properly because of east asian width - - # unicode index - s = Series(['a', 'bb', 'CCC', 'D'], - index=[u'あ', u'いい', u'ううう', u'ええええ']) - expected = (u"あ a\nいい bb\nううう CCC\n" - u"ええええ D\ndtype: object") - self.assertEqual(_rep(s), expected) - - # unicode values - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], - index=['a', 'bb', 'c', 'ddd']) - expected = (u"a あ\nbb いい\nc ううう\n" - u"ddd ええええ\ndtype: object") - self.assertEqual(_rep(s), expected) - - # both - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], - index=[u'ああ', u'いいいい', u'う', u'えええ']) - expected = (u"ああ あ\nいいいい いい\nう ううう\n" - u"えええ ええええ\ndtype: object") - self.assertEqual(_rep(s), expected) - - # unicode footer - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], - index=[u'ああ', u'いいいい', u'う', u'えええ'], name=u'おおおおおおお') - expected = (u"ああ あ\nいいいい いい\nう ううう\n" - u"えええ ええええ\nName: おおおおおおお, dtype: object") - self.assertEqual(_rep(s), expected) - - # MultiIndex - idx = pd.MultiIndex.from_tuples([(u'あ', u'いい'), (u'う', u'え'), ( - u'おおお', u'かかかか'), (u'き', u'くく')]) - s = Series([1, 22, 3333, 44444], index=idx) - expected = (u"あ いい 1\nう え 22\nおおお かかかか 3333\n" - u"き くく 44444\ndtype: int64") - self.assertEqual(_rep(s), expected) - - # object dtype, shorter than unicode repr - s = Series([1, 22, 3333, 44444], index=[1, 'AB', np.nan, u'あああ']) - expected = (u"1 1\nAB 22\nNaN 3333\n" - u"あああ 44444\ndtype: int64") - self.assertEqual(_rep(s), expected) - - # object dtype, longer than unicode repr - s = Series([1, 22, 3333, 44444], - index=[1, 'AB', pd.Timestamp('2011-01-01'), u'あああ']) - expected = (u"1 1\nAB 22\n" - u"2011-01-01 00:00:00 3333\nあああ 44444\ndtype: int64" - ) - self.assertEqual(_rep(s), expected) - - # truncate - with option_context('display.max_rows', 3): - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], name=u'おおおおおおお') - - expected = (u"0 あ\n ... \n" - u"3 ええええ\nName: おおおおおおお, dtype: object") - self.assertEqual(_rep(s), expected) - - s.index = [u'ああ', u'いいいい', u'う', u'えええ'] - expected = (u"ああ あ\n ... \n" - u"えええ ええええ\nName: おおおおおおお, dtype: object") - self.assertEqual(_rep(s), expected) - - # Emable Unicode option ----------------------------------------- - with option_context('display.unicode.east_asian_width', True): - - # unicode index - s = Series(['a', 'bb', 'CCC', 'D'], - index=[u'あ', u'いい', u'ううう', u'ええええ']) - expected = (u"あ a\nいい bb\nううう CCC\n" - u"ええええ D\ndtype: object") - self.assertEqual(_rep(s), expected) - - # unicode values - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], - index=['a', 'bb', 'c', 'ddd']) - expected = (u"a あ\nbb いい\nc ううう\n" - u"ddd ええええ\ndtype: object") - self.assertEqual(_rep(s), expected) - - # both - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], - index=[u'ああ', u'いいいい', u'う', u'えええ']) - expected = (u"ああ あ\nいいいい いい\nう ううう\n" - u"えええ ええええ\ndtype: object") - self.assertEqual(_rep(s), expected) - - # unicode footer - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], - index=[u'ああ', u'いいいい', u'う', u'えええ'], name=u'おおおおおおお') - expected = (u"ああ あ\nいいいい いい\nう ううう\n" - u"えええ ええええ\nName: おおおおおおお, dtype: object") - self.assertEqual(_rep(s), expected) - - # MultiIndex - idx = pd.MultiIndex.from_tuples([(u'あ', u'いい'), (u'う', u'え'), ( - u'おおお', u'かかかか'), (u'き', u'くく')]) - s = Series([1, 22, 3333, 44444], index=idx) - expected = (u"あ いい 1\nう え 22\nおおお かかかか 3333\n" - u"き くく 44444\ndtype: int64") - self.assertEqual(_rep(s), expected) - - # object dtype, shorter than unicode repr - s = Series([1, 22, 3333, 44444], index=[1, 'AB', np.nan, u'あああ']) - expected = (u"1 1\nAB 22\nNaN 3333\n" - u"あああ 44444\ndtype: int64") - self.assertEqual(_rep(s), expected) - - # object dtype, longer than unicode repr - s = Series([1, 22, 3333, 44444], - index=[1, 'AB', pd.Timestamp('2011-01-01'), u'あああ']) - expected = (u"1 1\nAB 22\n" - u"2011-01-01 00:00:00 3333\nあああ 44444\ndtype: int64" - ) - self.assertEqual(_rep(s), expected) - - # truncate - with option_context('display.max_rows', 3): - s = Series([u'あ', u'いい', u'ううう', u'ええええ'], name=u'おおおおおおお') - expected = (u"0 あ\n ... \n" - u"3 ええええ\nName: おおおおおおお, dtype: object") - self.assertEqual(_rep(s), expected) - - s.index = [u'ああ', u'いいいい', u'う', u'えええ'] - expected = (u"ああ あ\n ... \n" - u"えええ ええええ\nName: おおおおおおお, dtype: object") - self.assertEqual(_rep(s), expected) - - # ambiguous unicode - s = Series([u'¡¡', u'い¡¡', u'ううう', u'ええええ'], - index=[u'ああ', u'¡¡¡¡いい', u'¡¡', u'えええ']) - expected = (u"ああ ¡¡\n¡¡¡¡いい い¡¡\n¡¡ ううう\n" - u"えええ ええええ\ndtype: object") - self.assertEqual(_rep(s), expected) - - def test_float_trim_zeros(self): - vals = [2.08430917305e+10, 3.52205017305e+10, 2.30674817305e+10, - 2.03954217305e+10, 5.59897817305e+10] - for line in repr(Series(vals)).split('\n'): - if line.startswith('dtype:'): - continue - if _three_digit_exp(): - self.assertIn('+010', line) - else: - self.assertIn('+10', line) - - def test_datetimeindex(self): - - index = date_range('20130102', periods=6) - s = Series(1, index=index) - result = s.to_string() - self.assertTrue('2013-01-02' in result) - - # nat in index - s2 = Series(2, index=[Timestamp('20130111'), NaT]) - s = s2.append(s) - result = s.to_string() - self.assertTrue('NaT' in result) - - # nat in summary - result = str(s2.index) - self.assertTrue('NaT' in result) - - def test_timedelta64(self): - - from datetime import datetime, timedelta - - Series(np.array([1100, 20], dtype='timedelta64[ns]')).to_string() - - s = Series(date_range('2012-1-1', periods=3, freq='D')) - - # GH2146 - - # adding NaTs - y = s - s.shift(1) - result = y.to_string() - self.assertTrue('1 days' in result) - self.assertTrue('00:00:00' not in result) - self.assertTrue('NaT' in result) - - # with frac seconds - o = Series([datetime(2012, 1, 1, microsecond=150)] * 3) - y = s - o - result = y.to_string() - self.assertTrue('-1 days +23:59:59.999850' in result) - - # rounding? - o = Series([datetime(2012, 1, 1, 1)] * 3) - y = s - o - result = y.to_string() - self.assertTrue('-1 days +23:00:00' in result) - self.assertTrue('1 days 23:00:00' in result) - - o = Series([datetime(2012, 1, 1, 1, 1)] * 3) - y = s - o - result = y.to_string() - self.assertTrue('-1 days +22:59:00' in result) - self.assertTrue('1 days 22:59:00' in result) - - o = Series([datetime(2012, 1, 1, 1, 1, microsecond=150)] * 3) - y = s - o - result = y.to_string() - self.assertTrue('-1 days +22:58:59.999850' in result) - self.assertTrue('0 days 22:58:59.999850' in result) - - # neg time - td = timedelta(minutes=5, seconds=3) - s2 = Series(date_range('2012-1-1', periods=3, freq='D')) + td - y = s - s2 - result = y.to_string() - self.assertTrue('-1 days +23:54:57' in result) - - td = timedelta(microseconds=550) - s2 = Series(date_range('2012-1-1', periods=3, freq='D')) + td - y = s - td - result = y.to_string() - self.assertTrue('2012-01-01 23:59:59.999450' in result) - - # no boxing of the actual elements - td = Series(pd.timedelta_range('1 days', periods=3)) - result = td.to_string() - self.assertEqual(result, u("0 1 days\n1 2 days\n2 3 days")) - - def test_mixed_datetime64(self): - df = DataFrame({'A': [1, 2], 'B': ['2012-01-01', '2012-01-02']}) - df['B'] = pd.to_datetime(df.B) - - result = repr(df.ix[0]) - self.assertTrue('2012-01-01' in result) - - def test_period(self): - # GH 12615 - index = pd.period_range('2013-01', periods=6, freq='M') - s = Series(np.arange(6, dtype='int64'), index=index) - exp = ("2013-01 0\n2013-02 1\n2013-03 2\n2013-04 3\n" - "2013-05 4\n2013-06 5\nFreq: M, dtype: int64") - self.assertEqual(str(s), exp) - - s = Series(index) - exp = ("0 2013-01\n1 2013-02\n2 2013-03\n3 2013-04\n" - "4 2013-05\n5 2013-06\ndtype: object") - self.assertEqual(str(s), exp) - - # periods with mixed freq - s = Series([pd.Period('2011-01', freq='M'), - pd.Period('2011-02-01', freq='D'), - pd.Period('2011-03-01 09:00', freq='H')]) - exp = ("0 2011-01\n1 2011-02-01\n" - "2 2011-03-01 09:00\ndtype: object") - self.assertEqual(str(s), exp) - - def test_max_multi_index_display(self): - # GH 7101 - - # doc example (indexing.rst) - - # multi-index - arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], - ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] - tuples = list(zip(*arrays)) - index = MultiIndex.from_tuples(tuples, names=['first', 'second']) - s = Series(randn(8), index=index) - - with option_context("display.max_rows", 10): - self.assertEqual(len(str(s).split('\n')), 10) - with option_context("display.max_rows", 3): - self.assertEqual(len(str(s).split('\n')), 5) - with option_context("display.max_rows", 2): - self.assertEqual(len(str(s).split('\n')), 5) - with option_context("display.max_rows", 1): - self.assertEqual(len(str(s).split('\n')), 4) - with option_context("display.max_rows", 0): - self.assertEqual(len(str(s).split('\n')), 10) - - # index - s = Series(randn(8), None) - - with option_context("display.max_rows", 10): - self.assertEqual(len(str(s).split('\n')), 9) - with option_context("display.max_rows", 3): - self.assertEqual(len(str(s).split('\n')), 4) - with option_context("display.max_rows", 2): - self.assertEqual(len(str(s).split('\n')), 4) - with option_context("display.max_rows", 1): - self.assertEqual(len(str(s).split('\n')), 3) - with option_context("display.max_rows", 0): - self.assertEqual(len(str(s).split('\n')), 9) - - # Make sure #8532 is fixed - def test_consistent_format(self): - s = pd.Series([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9999, 1, 1] * 10) - with option_context("display.max_rows", 10): - res = repr(s) - exp = ('0 1.0000\n1 1.0000\n2 1.0000\n3 ' - '1.0000\n4 1.0000\n ... \n125 ' - '1.0000\n126 1.0000\n127 0.9999\n128 ' - '1.0000\n129 1.0000\ndtype: float64') - self.assertEqual(res, exp) - - @staticmethod - def gen_test_series(): - s1 = pd.Series(['a'] * 100) - s2 = pd.Series(['ab'] * 100) - s3 = pd.Series(['a', 'ab', 'abc', 'abcd', 'abcde', 'abcdef']) - s4 = s3[::-1] - test_sers = {'onel': s1, 'twol': s2, 'asc': s3, 'desc': s4} - return test_sers - - def chck_ncols(self, s): - with option_context("display.max_rows", 10): - res = repr(s) - lines = res.split('\n') - lines = [line for line in repr(s).split('\n') - if not re.match(r'[^\.]*\.+', line)][:-1] - ncolsizes = len(set(len(line.strip()) for line in lines)) - self.assertEqual(ncolsizes, 1) - - def test_format_explicit(self): - test_sers = self.gen_test_series() - with option_context("display.max_rows", 4): - res = repr(test_sers['onel']) - exp = '0 a\n1 a\n ..\n98 a\n99 a\ndtype: object' - self.assertEqual(exp, res) - res = repr(test_sers['twol']) - exp = ('0 ab\n1 ab\n ..\n98 ab\n99 ab\ndtype:' - ' object') - self.assertEqual(exp, res) - res = repr(test_sers['asc']) - exp = ('0 a\n1 ab\n ... \n4 abcde\n5' - ' abcdef\ndtype: object') - self.assertEqual(exp, res) - res = repr(test_sers['desc']) - exp = ('5 abcdef\n4 abcde\n ... \n1 ab\n0' - ' a\ndtype: object') - self.assertEqual(exp, res) - - def test_ncols(self): - test_sers = self.gen_test_series() - for s in test_sers.values(): - self.chck_ncols(s) - - def test_max_rows_eq_one(self): - s = Series(range(10), dtype='int64') - with option_context("display.max_rows", 1): - strrepr = repr(s).split('\n') - exp1 = ['0', '0'] - res1 = strrepr[0].split() - self.assertEqual(exp1, res1) - exp2 = ['..'] - res2 = strrepr[1].split() - self.assertEqual(exp2, res2) - - def test_truncate_ndots(self): - def getndots(s): - return len(re.match(r'[^\.]*(\.*)', s).groups()[0]) - - s = Series([0, 2, 3, 6]) - with option_context("display.max_rows", 2): - strrepr = repr(s).replace('\n', '') - self.assertEqual(getndots(strrepr), 2) - - s = Series([0, 100, 200, 400]) - with option_context("display.max_rows", 2): - strrepr = repr(s).replace('\n', '') - self.assertEqual(getndots(strrepr), 3) - - def test_to_string_name(self): - s = Series(range(100), dtype='int64') - s.name = 'myser' - res = s.to_string(max_rows=2, name=True) - exp = '0 0\n ..\n99 99\nName: myser' - self.assertEqual(res, exp) - res = s.to_string(max_rows=2, name=False) - exp = '0 0\n ..\n99 99' - self.assertEqual(res, exp) - - def test_to_string_dtype(self): - s = Series(range(100), dtype='int64') - res = s.to_string(max_rows=2, dtype=True) - exp = '0 0\n ..\n99 99\ndtype: int64' - self.assertEqual(res, exp) - res = s.to_string(max_rows=2, dtype=False) - exp = '0 0\n ..\n99 99' - self.assertEqual(res, exp) - - def test_to_string_length(self): - s = Series(range(100), dtype='int64') - res = s.to_string(max_rows=2, length=True) - exp = '0 0\n ..\n99 99\nLength: 100' - self.assertEqual(res, exp) - - def test_to_string_na_rep(self): - s = pd.Series(index=range(100)) - res = s.to_string(na_rep='foo', max_rows=2) - exp = '0 foo\n ..\n99 foo' - self.assertEqual(res, exp) - - def test_to_string_float_format(self): - s = pd.Series(range(10), dtype='float64') - res = s.to_string(float_format=lambda x: '{0:2.1f}'.format(x), - max_rows=2) - exp = '0 0.0\n ..\n9 9.0' - self.assertEqual(res, exp) - - def test_to_string_header(self): - s = pd.Series(range(10), dtype='int64') - s.index.name = 'foo' - res = s.to_string(header=True, max_rows=2) - exp = 'foo\n0 0\n ..\n9 9' - self.assertEqual(res, exp) - res = s.to_string(header=False, max_rows=2) - exp = '0 0\n ..\n9 9' - self.assertEqual(res, exp) - - -class TestEngFormatter(tm.TestCase): - _multiprocess_can_split_ = True - - def test_eng_float_formatter(self): - df = DataFrame({'A': [1.41, 141., 14100, 1410000.]}) - - fmt.set_eng_float_format() - result = df.to_string() - expected = (' A\n' - '0 1.410E+00\n' - '1 141.000E+00\n' - '2 14.100E+03\n' - '3 1.410E+06') - self.assertEqual(result, expected) - - fmt.set_eng_float_format(use_eng_prefix=True) - result = df.to_string() - expected = (' A\n' - '0 1.410\n' - '1 141.000\n' - '2 14.100k\n' - '3 1.410M') - self.assertEqual(result, expected) - - fmt.set_eng_float_format(accuracy=0) - result = df.to_string() - expected = (' A\n' - '0 1E+00\n' - '1 141E+00\n' - '2 14E+03\n' - '3 1E+06') - self.assertEqual(result, expected) - - self.reset_display_options() - - def compare(self, formatter, input, output): - formatted_input = formatter(input) - msg = ("formatting of %s results in '%s', expected '%s'" % - (str(input), formatted_input, output)) - self.assertEqual(formatted_input, output, msg) - - def compare_all(self, formatter, in_out): - """ - Parameters: - ----------- - formatter: EngFormatter under test - in_out: list of tuples. Each tuple = (number, expected_formatting) - - It is tested if 'formatter(number) == expected_formatting'. - *number* should be >= 0 because formatter(-number) == fmt is also - tested. *fmt* is derived from *expected_formatting* - """ - for input, output in in_out: - self.compare(formatter, input, output) - self.compare(formatter, -input, "-" + output[1:]) - - def test_exponents_with_eng_prefix(self): - formatter = fmt.EngFormatter(accuracy=3, use_eng_prefix=True) - f = np.sqrt(2) - in_out = [(f * 10 ** -24, " 1.414y"), (f * 10 ** -23, " 14.142y"), - (f * 10 ** -22, " 141.421y"), (f * 10 ** -21, " 1.414z"), - (f * 10 ** -20, " 14.142z"), (f * 10 ** -19, " 141.421z"), - (f * 10 ** -18, " 1.414a"), (f * 10 ** -17, " 14.142a"), - (f * 10 ** -16, " 141.421a"), (f * 10 ** -15, " 1.414f"), - (f * 10 ** -14, " 14.142f"), (f * 10 ** -13, " 141.421f"), - (f * 10 ** -12, " 1.414p"), (f * 10 ** -11, " 14.142p"), - (f * 10 ** -10, " 141.421p"), (f * 10 ** -9, " 1.414n"), - (f * 10 ** -8, " 14.142n"), (f * 10 ** -7, " 141.421n"), - (f * 10 ** -6, " 1.414u"), (f * 10 ** -5, " 14.142u"), - (f * 10 ** -4, " 141.421u"), (f * 10 ** -3, " 1.414m"), - (f * 10 ** -2, " 14.142m"), (f * 10 ** -1, " 141.421m"), - (f * 10 ** 0, " 1.414"), (f * 10 ** 1, " 14.142"), - (f * 10 ** 2, " 141.421"), (f * 10 ** 3, " 1.414k"), - (f * 10 ** 4, " 14.142k"), (f * 10 ** 5, " 141.421k"), - (f * 10 ** 6, " 1.414M"), (f * 10 ** 7, " 14.142M"), - (f * 10 ** 8, " 141.421M"), (f * 10 ** 9, " 1.414G"), ( - f * 10 ** 10, " 14.142G"), (f * 10 ** 11, " 141.421G"), - (f * 10 ** 12, " 1.414T"), (f * 10 ** 13, " 14.142T"), ( - f * 10 ** 14, " 141.421T"), (f * 10 ** 15, " 1.414P"), ( - f * 10 ** 16, " 14.142P"), (f * 10 ** 17, " 141.421P"), ( - f * 10 ** 18, " 1.414E"), (f * 10 ** 19, " 14.142E"), - (f * 10 ** 20, " 141.421E"), (f * 10 ** 21, " 1.414Z"), ( - f * 10 ** 22, " 14.142Z"), (f * 10 ** 23, " 141.421Z"), ( - f * 10 ** 24, " 1.414Y"), (f * 10 ** 25, " 14.142Y"), ( - f * 10 ** 26, " 141.421Y")] - self.compare_all(formatter, in_out) - - def test_exponents_without_eng_prefix(self): - formatter = fmt.EngFormatter(accuracy=4, use_eng_prefix=False) - f = np.pi - in_out = [(f * 10 ** -24, " 3.1416E-24"), - (f * 10 ** -23, " 31.4159E-24"), - (f * 10 ** -22, " 314.1593E-24"), - (f * 10 ** -21, " 3.1416E-21"), - (f * 10 ** -20, " 31.4159E-21"), - (f * 10 ** -19, " 314.1593E-21"), - (f * 10 ** -18, " 3.1416E-18"), - (f * 10 ** -17, " 31.4159E-18"), - (f * 10 ** -16, " 314.1593E-18"), - (f * 10 ** -15, " 3.1416E-15"), - (f * 10 ** -14, " 31.4159E-15"), - (f * 10 ** -13, " 314.1593E-15"), - (f * 10 ** -12, " 3.1416E-12"), - (f * 10 ** -11, " 31.4159E-12"), - (f * 10 ** -10, " 314.1593E-12"), - (f * 10 ** -9, " 3.1416E-09"), (f * 10 ** -8, " 31.4159E-09"), - (f * 10 ** -7, " 314.1593E-09"), (f * 10 ** -6, " 3.1416E-06"), - (f * 10 ** -5, " 31.4159E-06"), (f * 10 ** -4, - " 314.1593E-06"), - (f * 10 ** -3, " 3.1416E-03"), (f * 10 ** -2, " 31.4159E-03"), - (f * 10 ** -1, " 314.1593E-03"), (f * 10 ** 0, " 3.1416E+00"), ( - f * 10 ** 1, " 31.4159E+00"), (f * 10 ** 2, " 314.1593E+00"), - (f * 10 ** 3, " 3.1416E+03"), (f * 10 ** 4, " 31.4159E+03"), ( - f * 10 ** 5, " 314.1593E+03"), (f * 10 ** 6, " 3.1416E+06"), - (f * 10 ** 7, " 31.4159E+06"), (f * 10 ** 8, " 314.1593E+06"), ( - f * 10 ** 9, " 3.1416E+09"), (f * 10 ** 10, " 31.4159E+09"), - (f * 10 ** 11, " 314.1593E+09"), (f * 10 ** 12, " 3.1416E+12"), - (f * 10 ** 13, " 31.4159E+12"), (f * 10 ** 14, " 314.1593E+12"), - (f * 10 ** 15, " 3.1416E+15"), (f * 10 ** 16, " 31.4159E+15"), - (f * 10 ** 17, " 314.1593E+15"), (f * 10 ** 18, " 3.1416E+18"), - (f * 10 ** 19, " 31.4159E+18"), (f * 10 ** 20, " 314.1593E+18"), - (f * 10 ** 21, " 3.1416E+21"), (f * 10 ** 22, " 31.4159E+21"), - (f * 10 ** 23, " 314.1593E+21"), (f * 10 ** 24, " 3.1416E+24"), - (f * 10 ** 25, " 31.4159E+24"), (f * 10 ** 26, " 314.1593E+24")] - self.compare_all(formatter, in_out) - - def test_rounding(self): - formatter = fmt.EngFormatter(accuracy=3, use_eng_prefix=True) - in_out = [(5.55555, ' 5.556'), (55.5555, ' 55.556'), - (555.555, ' 555.555'), (5555.55, ' 5.556k'), - (55555.5, ' 55.556k'), (555555, ' 555.555k')] - self.compare_all(formatter, in_out) - - formatter = fmt.EngFormatter(accuracy=1, use_eng_prefix=True) - in_out = [(5.55555, ' 5.6'), (55.5555, ' 55.6'), (555.555, ' 555.6'), - (5555.55, ' 5.6k'), (55555.5, ' 55.6k'), (555555, ' 555.6k')] - self.compare_all(formatter, in_out) - - formatter = fmt.EngFormatter(accuracy=0, use_eng_prefix=True) - in_out = [(5.55555, ' 6'), (55.5555, ' 56'), (555.555, ' 556'), - (5555.55, ' 6k'), (55555.5, ' 56k'), (555555, ' 556k')] - self.compare_all(formatter, in_out) - - formatter = fmt.EngFormatter(accuracy=3, use_eng_prefix=True) - result = formatter(0) - self.assertEqual(result, u(' 0.000')) - - def test_nan(self): - # Issue #11981 - - formatter = fmt.EngFormatter(accuracy=1, use_eng_prefix=True) - result = formatter(np.nan) - self.assertEqual(result, u('NaN')) - - df = pd.DataFrame({'a':[1.5, 10.3, 20.5], - 'b':[50.3, 60.67, 70.12], - 'c':[100.2, 101.33, 120.33]}) - pt = df.pivot_table(values='a', index='b', columns='c') - fmt.set_eng_float_format(accuracy=1) - result = pt.to_string() - self.assertTrue('NaN' in result) - self.reset_display_options() - - def test_inf(self): - # Issue #11981 - - formatter = fmt.EngFormatter(accuracy=1, use_eng_prefix=True) - result = formatter(np.inf) - self.assertEqual(result, u('inf')) - - -def _three_digit_exp(): - return '%.4g' % 1.7e8 == '1.7e+008' - - -class TestFloatArrayFormatter(tm.TestCase): - - def test_misc(self): - obj = fmt.FloatArrayFormatter(np.array([], dtype=np.float64)) - result = obj.get_result() - self.assertTrue(len(result) == 0) - - def test_format(self): - obj = fmt.FloatArrayFormatter(np.array([12, 0], dtype=np.float64)) - result = obj.get_result() - self.assertEqual(result[0], " 12.0") - self.assertEqual(result[1], " 0.0") - - def test_output_significant_digits(self): - # Issue #9764 - - # In case default display precision changes: - with pd.option_context('display.precision', 6): - # DataFrame example from issue #9764 - d = pd.DataFrame( - {'col1': [9.999e-8, 1e-7, 1.0001e-7, 2e-7, 4.999e-7, 5e-7, - 5.0001e-7, 6e-7, 9.999e-7, 1e-6, 1.0001e-6, 2e-6, - 4.999e-6, 5e-6, 5.0001e-6, 6e-6]}) - - expected_output = { - (0, 6): - ' col1\n0 9.999000e-08\n1 1.000000e-07\n2 1.000100e-07\n3 2.000000e-07\n4 4.999000e-07\n5 5.000000e-07', - (1, 6): - ' col1\n1 1.000000e-07\n2 1.000100e-07\n3 2.000000e-07\n4 4.999000e-07\n5 5.000000e-07', - (1, 8): - ' col1\n1 1.000000e-07\n2 1.000100e-07\n3 2.000000e-07\n4 4.999000e-07\n5 5.000000e-07\n6 5.000100e-07\n7 6.000000e-07', - (8, 16): - ' col1\n8 9.999000e-07\n9 1.000000e-06\n10 1.000100e-06\n11 2.000000e-06\n12 4.999000e-06\n13 5.000000e-06\n14 5.000100e-06\n15 6.000000e-06', - (9, 16): - ' col1\n9 0.000001\n10 0.000001\n11 0.000002\n12 0.000005\n13 0.000005\n14 0.000005\n15 0.000006' - } - - for (start, stop), v in expected_output.items(): - self.assertEqual(str(d[start:stop]), v) - - def test_too_long(self): - # GH 10451 - with pd.option_context('display.precision', 4): - # need both a number > 1e6 and something that normally formats to - # having length > display.precision + 6 - df = pd.DataFrame(dict(x=[12345.6789])) - self.assertEqual(str(df), ' x\n0 12345.6789') - df = pd.DataFrame(dict(x=[2e6])) - self.assertEqual(str(df), ' x\n0 2000000.0') - df = pd.DataFrame(dict(x=[12345.6789, 2e6])) - self.assertEqual( - str(df), ' x\n0 1.2346e+04\n1 2.0000e+06') - - -class TestRepr_timedelta64(tm.TestCase): - - def test_none(self): - delta_1d = pd.to_timedelta(1, unit='D') - delta_0d = pd.to_timedelta(0, unit='D') - delta_1s = pd.to_timedelta(1, unit='s') - delta_500ms = pd.to_timedelta(500, unit='ms') - - drepr = lambda x: x._repr_base() - self.assertEqual(drepr(delta_1d), "1 days") - self.assertEqual(drepr(-delta_1d), "-1 days") - self.assertEqual(drepr(delta_0d), "0 days") - self.assertEqual(drepr(delta_1s), "0 days 00:00:01") - self.assertEqual(drepr(delta_500ms), "0 days 00:00:00.500000") - self.assertEqual(drepr(delta_1d + delta_1s), "1 days 00:00:01") - self.assertEqual( - drepr(delta_1d + delta_500ms), "1 days 00:00:00.500000") - - def test_even_day(self): - delta_1d = pd.to_timedelta(1, unit='D') - delta_0d = pd.to_timedelta(0, unit='D') - delta_1s = pd.to_timedelta(1, unit='s') - delta_500ms = pd.to_timedelta(500, unit='ms') - - drepr = lambda x: x._repr_base(format='even_day') - self.assertEqual(drepr(delta_1d), "1 days") - self.assertEqual(drepr(-delta_1d), "-1 days") - self.assertEqual(drepr(delta_0d), "0 days") - self.assertEqual(drepr(delta_1s), "0 days 00:00:01") - self.assertEqual(drepr(delta_500ms), "0 days 00:00:00.500000") - self.assertEqual(drepr(delta_1d + delta_1s), "1 days 00:00:01") - self.assertEqual( - drepr(delta_1d + delta_500ms), "1 days 00:00:00.500000") - - def test_sub_day(self): - delta_1d = pd.to_timedelta(1, unit='D') - delta_0d = pd.to_timedelta(0, unit='D') - delta_1s = pd.to_timedelta(1, unit='s') - delta_500ms = pd.to_timedelta(500, unit='ms') - - drepr = lambda x: x._repr_base(format='sub_day') - self.assertEqual(drepr(delta_1d), "1 days") - self.assertEqual(drepr(-delta_1d), "-1 days") - self.assertEqual(drepr(delta_0d), "00:00:00") - self.assertEqual(drepr(delta_1s), "00:00:01") - self.assertEqual(drepr(delta_500ms), "00:00:00.500000") - self.assertEqual(drepr(delta_1d + delta_1s), "1 days 00:00:01") - self.assertEqual( - drepr(delta_1d + delta_500ms), "1 days 00:00:00.500000") - - def test_long(self): - delta_1d = pd.to_timedelta(1, unit='D') - delta_0d = pd.to_timedelta(0, unit='D') - delta_1s = pd.to_timedelta(1, unit='s') - delta_500ms = pd.to_timedelta(500, unit='ms') - - drepr = lambda x: x._repr_base(format='long') - self.assertEqual(drepr(delta_1d), "1 days 00:00:00") - self.assertEqual(drepr(-delta_1d), "-1 days +00:00:00") - self.assertEqual(drepr(delta_0d), "0 days 00:00:00") - self.assertEqual(drepr(delta_1s), "0 days 00:00:01") - self.assertEqual(drepr(delta_500ms), "0 days 00:00:00.500000") - self.assertEqual(drepr(delta_1d + delta_1s), "1 days 00:00:01") - self.assertEqual( - drepr(delta_1d + delta_500ms), "1 days 00:00:00.500000") - - def test_all(self): - delta_1d = pd.to_timedelta(1, unit='D') - delta_0d = pd.to_timedelta(0, unit='D') - delta_1ns = pd.to_timedelta(1, unit='ns') - - drepr = lambda x: x._repr_base(format='all') - self.assertEqual(drepr(delta_1d), "1 days 00:00:00.000000000") - self.assertEqual(drepr(delta_0d), "0 days 00:00:00.000000000") - self.assertEqual(drepr(delta_1ns), "0 days 00:00:00.000000001") - - -class TestTimedelta64Formatter(tm.TestCase): - - def test_days(self): - x = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='D') - result = fmt.Timedelta64Formatter(x, box=True).get_result() - self.assertEqual(result[0].strip(), "'0 days'") - self.assertEqual(result[1].strip(), "'1 days'") - - result = fmt.Timedelta64Formatter(x[1:2], box=True).get_result() - self.assertEqual(result[0].strip(), "'1 days'") - - result = fmt.Timedelta64Formatter(x, box=False).get_result() - self.assertEqual(result[0].strip(), "0 days") - self.assertEqual(result[1].strip(), "1 days") - - result = fmt.Timedelta64Formatter(x[1:2], box=False).get_result() - self.assertEqual(result[0].strip(), "1 days") - - def test_days_neg(self): - x = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='D') - result = fmt.Timedelta64Formatter(-x, box=True).get_result() - self.assertEqual(result[0].strip(), "'0 days'") - self.assertEqual(result[1].strip(), "'-1 days'") - - def test_subdays(self): - y = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='s') - result = fmt.Timedelta64Formatter(y, box=True).get_result() - self.assertEqual(result[0].strip(), "'00:00:00'") - self.assertEqual(result[1].strip(), "'00:00:01'") - - def test_subdays_neg(self): - y = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='s') - result = fmt.Timedelta64Formatter(-y, box=True).get_result() - self.assertEqual(result[0].strip(), "'00:00:00'") - self.assertEqual(result[1].strip(), "'-1 days +23:59:59'") - - def test_zero(self): - x = pd.to_timedelta(list(range(1)) + [pd.NaT], unit='D') - result = fmt.Timedelta64Formatter(x, box=True).get_result() - self.assertEqual(result[0].strip(), "'0 days'") - - x = pd.to_timedelta(list(range(1)), unit='D') - result = fmt.Timedelta64Formatter(x, box=True).get_result() - self.assertEqual(result[0].strip(), "'0 days'") - - -class TestDatetime64Formatter(tm.TestCase): - - def test_mixed(self): - x = Series([datetime(2013, 1, 1), datetime(2013, 1, 1, 12), pd.NaT]) - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "2013-01-01 00:00:00") - self.assertEqual(result[1].strip(), "2013-01-01 12:00:00") - - def test_dates(self): - x = Series([datetime(2013, 1, 1), datetime(2013, 1, 2), pd.NaT]) - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "2013-01-01") - self.assertEqual(result[1].strip(), "2013-01-02") - - def test_date_nanos(self): - x = Series([Timestamp(200)]) - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "1970-01-01 00:00:00.000000200") - - def test_dates_display(self): - - # 10170 - # make sure that we are consistently display date formatting - x = Series(date_range('20130101 09:00:00', periods=5, freq='D')) - x.iloc[1] = np.nan - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "2013-01-01 09:00:00") - self.assertEqual(result[1].strip(), "NaT") - self.assertEqual(result[4].strip(), "2013-01-05 09:00:00") - - x = Series(date_range('20130101 09:00:00', periods=5, freq='s')) - x.iloc[1] = np.nan - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "2013-01-01 09:00:00") - self.assertEqual(result[1].strip(), "NaT") - self.assertEqual(result[4].strip(), "2013-01-01 09:00:04") - - x = Series(date_range('20130101 09:00:00', periods=5, freq='ms')) - x.iloc[1] = np.nan - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "2013-01-01 09:00:00.000") - self.assertEqual(result[1].strip(), "NaT") - self.assertEqual(result[4].strip(), "2013-01-01 09:00:00.004") - - x = Series(date_range('20130101 09:00:00', periods=5, freq='us')) - x.iloc[1] = np.nan - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "2013-01-01 09:00:00.000000") - self.assertEqual(result[1].strip(), "NaT") - self.assertEqual(result[4].strip(), "2013-01-01 09:00:00.000004") - - x = Series(date_range('20130101 09:00:00', periods=5, freq='N')) - x.iloc[1] = np.nan - result = fmt.Datetime64Formatter(x).get_result() - self.assertEqual(result[0].strip(), "2013-01-01 09:00:00.000000000") - self.assertEqual(result[1].strip(), "NaT") - self.assertEqual(result[4].strip(), "2013-01-01 09:00:00.000000004") - - def test_datetime64formatter_yearmonth(self): - x = Series([datetime(2016, 1, 1), datetime(2016, 2, 2)]) - - def format_func(x): - return x.strftime('%Y-%m') - - formatter = fmt.Datetime64Formatter(x, formatter=format_func) - result = formatter.get_result() - self.assertEqual(result, ['2016-01', '2016-02']) - - def test_datetime64formatter_hoursecond(self): - - x = Series(pd.to_datetime(['10:10:10.100', '12:12:12.120'], - format='%H:%M:%S.%f')) - - def format_func(x): - return x.strftime('%H:%M') - - formatter = fmt.Datetime64Formatter(x, formatter=format_func) - result = formatter.get_result() - self.assertEqual(result, ['10:10', '12:12']) - - -class TestNaTFormatting(tm.TestCase): - - def test_repr(self): - self.assertEqual(repr(pd.NaT), "NaT") - - def test_str(self): - self.assertEqual(str(pd.NaT), "NaT") - - -class TestDatetimeIndexFormat(tm.TestCase): - - def test_datetime(self): - formatted = pd.to_datetime([datetime(2003, 1, 1, 12), pd.NaT]).format() - self.assertEqual(formatted[0], "2003-01-01 12:00:00") - self.assertEqual(formatted[1], "NaT") - - def test_date(self): - formatted = pd.to_datetime([datetime(2003, 1, 1), pd.NaT]).format() - self.assertEqual(formatted[0], "2003-01-01") - self.assertEqual(formatted[1], "NaT") - - def test_date_tz(self): - formatted = pd.to_datetime([datetime(2013, 1, 1)], utc=True).format() - self.assertEqual(formatted[0], "2013-01-01 00:00:00+00:00") - - formatted = pd.to_datetime( - [datetime(2013, 1, 1), pd.NaT], utc=True).format() - self.assertEqual(formatted[0], "2013-01-01 00:00:00+00:00") - - def test_date_explict_date_format(self): - formatted = pd.to_datetime([datetime(2003, 2, 1), pd.NaT]).format( - date_format="%m-%d-%Y", na_rep="UT") - self.assertEqual(formatted[0], "02-01-2003") - self.assertEqual(formatted[1], "UT") - - -class TestDatetimeIndexUnicode(tm.TestCase): - - def test_dates(self): - text = str(pd.to_datetime([datetime(2013, 1, 1), datetime(2014, 1, 1) - ])) - self.assertTrue("['2013-01-01'," in text) - self.assertTrue(", '2014-01-01']" in text) - - def test_mixed(self): - text = str(pd.to_datetime([datetime(2013, 1, 1), datetime( - 2014, 1, 1, 12), datetime(2014, 1, 1)])) - self.assertTrue("'2013-01-01 00:00:00'," in text) - self.assertTrue("'2014-01-01 00:00:00']" in text) - - -class TestStringRepTimestamp(tm.TestCase): - - def test_no_tz(self): - dt_date = datetime(2013, 1, 2) - self.assertEqual(str(dt_date), str(Timestamp(dt_date))) - - dt_datetime = datetime(2013, 1, 2, 12, 1, 3) - self.assertEqual(str(dt_datetime), str(Timestamp(dt_datetime))) - - dt_datetime_us = datetime(2013, 1, 2, 12, 1, 3, 45) - self.assertEqual(str(dt_datetime_us), str(Timestamp(dt_datetime_us))) - - ts_nanos_only = Timestamp(200) - self.assertEqual(str(ts_nanos_only), "1970-01-01 00:00:00.000000200") - - ts_nanos_micros = Timestamp(1200) - self.assertEqual(str(ts_nanos_micros), "1970-01-01 00:00:00.000001200") - - def test_tz_pytz(self): - tm._skip_if_no_pytz() - - import pytz - - dt_date = datetime(2013, 1, 2, tzinfo=pytz.utc) - self.assertEqual(str(dt_date), str(Timestamp(dt_date))) - - dt_datetime = datetime(2013, 1, 2, 12, 1, 3, tzinfo=pytz.utc) - self.assertEqual(str(dt_datetime), str(Timestamp(dt_datetime))) - - dt_datetime_us = datetime(2013, 1, 2, 12, 1, 3, 45, tzinfo=pytz.utc) - self.assertEqual(str(dt_datetime_us), str(Timestamp(dt_datetime_us))) - - def test_tz_dateutil(self): - tm._skip_if_no_dateutil() - import dateutil - utc = dateutil.tz.tzutc() - - dt_date = datetime(2013, 1, 2, tzinfo=utc) - self.assertEqual(str(dt_date), str(Timestamp(dt_date))) - - dt_datetime = datetime(2013, 1, 2, 12, 1, 3, tzinfo=utc) - self.assertEqual(str(dt_datetime), str(Timestamp(dt_datetime))) - - dt_datetime_us = datetime(2013, 1, 2, 12, 1, 3, 45, tzinfo=utc) - self.assertEqual(str(dt_datetime_us), str(Timestamp(dt_datetime_us))) - - def test_nat_representations(self): - for f in (str, repr, methodcaller('isoformat')): - self.assertEqual(f(pd.NaT), 'NaT') - - -def test_format_percentiles(): - result = fmt.format_percentiles([0.01999, 0.02001, 0.5, 0.666666, 0.9999]) - expected = ['1.999%', '2.001%', '50%', '66.667%', '99.99%'] - tm.assert_equal(result, expected) - - result = fmt.format_percentiles([0, 0.5, 0.02001, 0.5, 0.666666, 0.9999]) - expected = ['0%', '50%', '2.0%', '50%', '66.67%', '99.99%'] - tm.assert_equal(result, expected) - - tm.assertRaises(ValueError, fmt.format_percentiles, [0.1, np.nan, 0.5]) - tm.assertRaises(ValueError, fmt.format_percentiles, [-0.001, 0.1, 0.5]) - tm.assertRaises(ValueError, fmt.format_percentiles, [2, 0.1, 0.5]) - tm.assertRaises(ValueError, fmt.format_percentiles, [0.1, 0.5, 'a']) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/formats/test_printing.py b/pandas/tests/formats/test_printing.py deleted file mode 100644 index 3bcceca1f50a7..0000000000000 --- a/pandas/tests/formats/test_printing.py +++ /dev/null @@ -1,142 +0,0 @@ -# -*- coding: utf-8 -*- -import nose -from pandas import compat -import pandas.formats.printing as printing -import pandas.formats.format as fmt -import pandas.util.testing as tm -import pandas.core.config as cf - -_multiprocess_can_split_ = True - - -def test_adjoin(): - data = [['a', 'b', 'c'], ['dd', 'ee', 'ff'], ['ggg', 'hhh', 'iii']] - expected = 'a dd ggg\nb ee hhh\nc ff iii' - - adjoined = printing.adjoin(2, *data) - - assert (adjoined == expected) - - -def test_repr_binary_type(): - import string - letters = string.ascii_letters - btype = compat.binary_type - try: - raw = btype(letters, encoding=cf.get_option('display.encoding')) - except TypeError: - raw = btype(letters) - b = compat.text_type(compat.bytes_to_str(raw)) - res = printing.pprint_thing(b, quote_strings=True) - tm.assert_equal(res, repr(b)) - res = printing.pprint_thing(b, quote_strings=False) - tm.assert_equal(res, b) - - -class TestFormattBase(tm.TestCase): - - def test_adjoin(self): - data = [['a', 'b', 'c'], ['dd', 'ee', 'ff'], ['ggg', 'hhh', 'iii']] - expected = 'a dd ggg\nb ee hhh\nc ff iii' - - adjoined = printing.adjoin(2, *data) - - self.assertEqual(adjoined, expected) - - def test_adjoin_unicode(self): - data = [[u'あ', 'b', 'c'], ['dd', u'ええ', 'ff'], ['ggg', 'hhh', u'いいい']] - expected = u'あ dd ggg\nb ええ hhh\nc ff いいい' - adjoined = printing.adjoin(2, *data) - self.assertEqual(adjoined, expected) - - adj = fmt.EastAsianTextAdjustment() - - expected = u"""あ dd ggg -b ええ hhh -c ff いいい""" - - adjoined = adj.adjoin(2, *data) - self.assertEqual(adjoined, expected) - cols = adjoined.split('\n') - self.assertEqual(adj.len(cols[0]), 13) - self.assertEqual(adj.len(cols[1]), 13) - self.assertEqual(adj.len(cols[2]), 16) - - expected = u"""あ dd ggg -b ええ hhh -c ff いいい""" - - adjoined = adj.adjoin(7, *data) - self.assertEqual(adjoined, expected) - cols = adjoined.split('\n') - self.assertEqual(adj.len(cols[0]), 23) - self.assertEqual(adj.len(cols[1]), 23) - self.assertEqual(adj.len(cols[2]), 26) - - def test_justify(self): - adj = fmt.EastAsianTextAdjustment() - - def just(x, *args, **kwargs): - # wrapper to test single str - return adj.justify([x], *args, **kwargs)[0] - - self.assertEqual(just('abc', 5, mode='left'), 'abc ') - self.assertEqual(just('abc', 5, mode='center'), ' abc ') - self.assertEqual(just('abc', 5, mode='right'), ' abc') - self.assertEqual(just(u'abc', 5, mode='left'), 'abc ') - self.assertEqual(just(u'abc', 5, mode='center'), ' abc ') - self.assertEqual(just(u'abc', 5, mode='right'), ' abc') - - self.assertEqual(just(u'パンダ', 5, mode='left'), u'パンダ') - self.assertEqual(just(u'パンダ', 5, mode='center'), u'パンダ') - self.assertEqual(just(u'パンダ', 5, mode='right'), u'パンダ') - - self.assertEqual(just(u'パンダ', 10, mode='left'), u'パンダ ') - self.assertEqual(just(u'パンダ', 10, mode='center'), u' パンダ ') - self.assertEqual(just(u'パンダ', 10, mode='right'), u' パンダ') - - def test_east_asian_len(self): - adj = fmt.EastAsianTextAdjustment() - - self.assertEqual(adj.len('abc'), 3) - self.assertEqual(adj.len(u'abc'), 3) - - self.assertEqual(adj.len(u'パンダ'), 6) - self.assertEqual(adj.len(u'パンダ'), 5) - self.assertEqual(adj.len(u'パンダpanda'), 11) - self.assertEqual(adj.len(u'パンダpanda'), 10) - - def test_ambiguous_width(self): - adj = fmt.EastAsianTextAdjustment() - self.assertEqual(adj.len(u'¡¡ab'), 4) - - with cf.option_context('display.unicode.ambiguous_as_wide', True): - adj = fmt.EastAsianTextAdjustment() - self.assertEqual(adj.len(u'¡¡ab'), 6) - - data = [[u'あ', 'b', 'c'], ['dd', u'ええ', 'ff'], - ['ggg', u'¡¡ab', u'いいい']] - expected = u'あ dd ggg \nb ええ ¡¡ab\nc ff いいい' - adjoined = adj.adjoin(2, *data) - self.assertEqual(adjoined, expected) - - -# TODO: fix this broken test - -# def test_console_encode(): -# """ -# On Python 2, if sys.stdin.encoding is None (IPython with zmq frontend) -# common.console_encode should encode things as utf-8. -# """ -# if compat.PY3: -# raise nose.SkipTest - -# with tm.stdin_encoding(encoding=None): -# result = printing.console_encode(u"\u05d0") -# expected = u"\u05d0".encode('utf-8') -# assert (result == expected) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/frame/common.py b/pandas/tests/frame/common.py index 37f67712e1b58..b475d25eb5dac 100644 --- a/pandas/tests/frame/common.py +++ b/pandas/tests/frame/common.py @@ -1,7 +1,7 @@ import numpy as np from pandas import compat -from pandas.util.decorators import cache_readonly +from pandas.util._decorators import cache_readonly import pandas.util.testing as tm import pandas as pd @@ -89,11 +89,11 @@ def empty(self): @cache_readonly def ts1(self): - return tm.makeTimeSeries() + return tm.makeTimeSeries(nper=30) @cache_readonly def ts2(self): - return tm.makeTimeSeries()[5:] + return tm.makeTimeSeries(nper=30)[5:] @cache_readonly def simple(self): diff --git a/pandas/tests/frame/test_alter_axes.py b/pandas/tests/frame/test_alter_axes.py index 46f0fff7bb4b8..e6313dfc602a8 100644 --- a/pandas/tests/frame/test_alter_axes.py +++ b/pandas/tests/frame/test_alter_axes.py @@ -2,27 +2,30 @@ from __future__ import print_function +import pytest + from datetime import datetime, timedelta import numpy as np from pandas.compat import lrange from pandas import (DataFrame, Series, Index, MultiIndex, - RangeIndex) + RangeIndex, date_range, IntervalIndex, + to_datetime) +from pandas.core.dtypes.common import ( + is_object_dtype, + is_categorical_dtype, + is_interval_dtype) import pandas as pd -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from pandas.tests.frame.common import TestData -class TestDataFrameAlterAxes(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameAlterAxes(TestData): def test_set_index(self): idx = Index(np.arange(len(self.mixed_frame))) @@ -30,8 +33,8 @@ def test_set_index(self): # cache it _ = self.mixed_frame['foo'] # noqa self.mixed_frame.index = idx - self.assertIs(self.mixed_frame['foo'].index, idx) - with assertRaisesRegexp(ValueError, 'Length mismatch'): + assert self.mixed_frame['foo'].index is idx + with tm.assert_raises_regex(ValueError, 'Length mismatch'): self.mixed_frame.index = idx[::2] def test_set_index_cast(self): @@ -39,10 +42,10 @@ def test_set_index_cast(self): # issue casting an index then set_index df = DataFrame({'A': [1.1, 2.2, 3.3], 'B': [5.0, 6.1, 7.2]}, index=[2010, 2011, 2012]) - expected = df.ix[2010] + expected = df.loc[2010] new_index = df.index.astype(np.int32) df.index = new_index - result = df.ix[2010] + result = df.loc[2010] assert_series_equal(result, expected) def test_set_index2(self): @@ -58,7 +61,7 @@ def test_set_index2(self): index = Index(df['C'], name='C') - expected = df.ix[:, ['A', 'B', 'D', 'E']] + expected = df.loc[:, ['A', 'B', 'D', 'E']] expected.index = index expected_nodrop = df.copy() @@ -66,7 +69,7 @@ def test_set_index2(self): assert_frame_equal(result, expected) assert_frame_equal(result_nodrop, expected_nodrop) - self.assertEqual(result.index.name, index.name) + assert result.index.name == index.name # inplace, single df2 = df.copy() @@ -86,7 +89,7 @@ def test_set_index2(self): index = MultiIndex.from_arrays([df['A'], df['B']], names=['A', 'B']) - expected = df.ix[:, ['C', 'D', 'E']] + expected = df.loc[:, ['C', 'D', 'E']] expected.index = index expected_nodrop = df.copy() @@ -94,7 +97,7 @@ def test_set_index2(self): assert_frame_equal(result, expected) assert_frame_equal(result_nodrop, expected_nodrop) - self.assertEqual(result.index.names, index.names) + assert result.index.names == index.names # inplace df2 = df.copy() @@ -106,7 +109,8 @@ def test_set_index2(self): assert_frame_equal(df3, expected_nodrop) # corner case - with assertRaisesRegexp(ValueError, 'Index has duplicate keys'): + with tm.assert_raises_regex(ValueError, + 'Index has duplicate keys'): df.set_index('A', verify_integrity=True) # append @@ -123,7 +127,7 @@ def test_set_index2(self): # Series result = df.set_index(df.C) - self.assertEqual(result.index.name, 'C') + assert result.index.name == 'C' def test_set_index_nonuniq(self): df = DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar'], @@ -131,9 +135,10 @@ def test_set_index_nonuniq(self): 'C': ['a', 'b', 'c', 'd', 'e'], 'D': np.random.randn(5), 'E': np.random.randn(5)}) - with assertRaisesRegexp(ValueError, 'Index has duplicate keys'): + with tm.assert_raises_regex(ValueError, + 'Index has duplicate keys'): df.set_index('A', verify_integrity=True, inplace=True) - self.assertIn('A', df) + assert 'A' in df def test_set_index_bug(self): # GH1590 @@ -169,7 +174,7 @@ def test_construction_with_categorical_index(self): idf = df.set_index('B') str(idf) tm.assert_index_equal(idf.index, ci, check_names=False) - self.assertEqual(idf.index.name, 'B') + assert idf.index.name == 'B' # from a CategoricalIndex df = DataFrame({'A': np.random.randn(10), @@ -177,17 +182,17 @@ def test_construction_with_categorical_index(self): idf = df.set_index('B') str(idf) tm.assert_index_equal(idf.index, ci, check_names=False) - self.assertEqual(idf.index.name, 'B') + assert idf.index.name == 'B' idf = df.set_index('B').reset_index().set_index('B') str(idf) tm.assert_index_equal(idf.index, ci, check_names=False) - self.assertEqual(idf.index.name, 'B') + assert idf.index.name == 'B' new_df = idf.reset_index() new_df.index = df.B tm.assert_index_equal(new_df.index, ci, check_names=False) - self.assertEqual(idf.index.name, 'B') + assert idf.index.name == 'B' def test_set_index_cast_datetimeindex(self): df = DataFrame({'A': [datetime(2000, 1, 1) + timedelta(i) @@ -195,13 +200,13 @@ def test_set_index_cast_datetimeindex(self): 'B': np.random.randn(1000)}) idf = df.set_index('A') - tm.assertIsInstance(idf.index, pd.DatetimeIndex) + assert isinstance(idf.index, pd.DatetimeIndex) # don't cast a DatetimeIndex WITH a tz, leave as object # GH 6032 i = (pd.DatetimeIndex( - pd.tseries.tools.to_datetime(['2013-1-1 13:00', - '2013-1-2 14:00'], errors="raise")) + to_datetime(['2013-1-1 13:00', + '2013-1-2 14:00'], errors="raise")) .tz_localize('US/Pacific')) df = DataFrame(np.random.randn(2, 1), columns=['A']) @@ -219,7 +224,7 @@ def test_set_index_cast_datetimeindex(self): df['B'] = i result = df['B'] assert_series_equal(result, expected, check_names=False) - self.assertEqual(result.name, 'B') + assert result.name == 'B' # keep the timezone result = i.to_series(keep_tz=True) @@ -230,13 +235,13 @@ def test_set_index_cast_datetimeindex(self): result = df['C'] comp = pd.DatetimeIndex(expected.values).copy() comp.tz = None - self.assert_numpy_array_equal(result.values, comp.values) + tm.assert_numpy_array_equal(result.values, comp.values) # list of datetimes with a tz df['D'] = i.to_pydatetime() result = df['D'] assert_series_equal(result, expected, check_names=False) - self.assertEqual(result.name, 'D') + assert result.name == 'D' # GH 6785 # set the index manually @@ -274,9 +279,9 @@ def test_set_index_timezone(self): i = pd.to_datetime(["2014-01-01 10:10:10"], utc=True).tz_convert('Europe/Rome') df = DataFrame({'i': i}) - self.assertEqual(df.set_index(i).index[0].hour, 11) - self.assertEqual(pd.DatetimeIndex(pd.Series(df.i))[0].hour, 11) - self.assertEqual(df.set_index(df.i).index[0].hour, 11) + assert df.set_index(i).index[0].hour == 11 + assert pd.DatetimeIndex(pd.Series(df.i))[0].hour == 11 + assert df.set_index(df.i).index[0].hour == 11 def test_set_index_dst(self): di = pd.date_range('2006-10-29 00:00:00', periods=3, @@ -297,12 +302,23 @@ def test_set_index_dst(self): exp = pd.DataFrame({'b': [3, 4, 5]}, index=exp_index) tm.assert_frame_equal(res, exp) + def test_reset_index_with_intervals(self): + idx = pd.IntervalIndex.from_breaks(np.arange(11), name='x') + original = pd.DataFrame({'x': idx, 'y': np.arange(10)})[['x', 'y']] + + result = original.set_index('x') + expected = pd.DataFrame({'y': np.arange(10)}, index=idx) + assert_frame_equal(result, expected) + + result2 = result.reset_index() + assert_frame_equal(result2, original) + def test_set_index_multiindexcolumns(self): columns = MultiIndex.from_tuples([('foo', 1), ('foo', 2), ('bar', 1)]) df = DataFrame(np.random.randn(3, 3), columns=columns) rs = df.set_index(df.columns[0]) - xp = df.ix[:, 1:] - xp.index = df.ix[:, 0].values + xp = df.iloc[:, 1:] + xp.index = df.iloc[:, 0].values xp.index.names = [df.columns[0]] assert_frame_equal(rs, xp) @@ -322,9 +338,35 @@ def test_set_index_empty_column(self): def test_set_columns(self): cols = Index(np.arange(len(self.mixed_frame.columns))) self.mixed_frame.columns = cols - with assertRaisesRegexp(ValueError, 'Length mismatch'): + with tm.assert_raises_regex(ValueError, 'Length mismatch'): self.mixed_frame.columns = cols[::2] + def test_dti_set_index_reindex(self): + # GH 6631 + df = DataFrame(np.random.random(6)) + idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern') + idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo') + + df = df.set_index(idx1) + tm.assert_index_equal(df.index, idx1) + df = df.reindex(idx2) + tm.assert_index_equal(df.index, idx2) + + # 11314 + # with tz + index = date_range(datetime(2015, 10, 1), + datetime(2015, 10, 1, 23), + freq='H', tz='US/Eastern') + df = DataFrame(np.random.randn(24, 1), columns=['a'], index=index) + new_index = date_range(datetime(2015, 10, 2), + datetime(2015, 10, 2, 23), + freq='H', tz='US/Eastern') + + # TODO: unused? + result = df.set_index(new_index) # noqa + + assert new_index.freq == index.freq + # Renaming def test_rename(self): @@ -356,7 +398,7 @@ def test_rename(self): tm.assert_index_equal(renamed.index, pd.Index(['BAR', 'FOO'])) # have to pass something - self.assertRaises(TypeError, self.frame.rename) + pytest.raises(TypeError, self.frame.rename) # partial columns renamed = self.frame.rename(columns={'C': 'foo', 'D': 'bar'}) @@ -374,45 +416,100 @@ def test_rename(self): renamed = renamer.rename(index={'foo': 'bar', 'bar': 'foo'}) tm.assert_index_equal(renamed.index, pd.Index(['bar', 'foo'], name='name')) - self.assertEqual(renamed.index.name, renamer.index.name) + assert renamed.index.name == renamer.index.name + + def test_rename_multiindex(self): - # MultiIndex tuples_index = [('foo1', 'bar1'), ('foo2', 'bar2')] tuples_columns = [('fizz1', 'buzz1'), ('fizz2', 'buzz2')] index = MultiIndex.from_tuples(tuples_index, names=['foo', 'bar']) columns = MultiIndex.from_tuples( tuples_columns, names=['fizz', 'buzz']) - renamer = DataFrame([(0, 0), (1, 1)], index=index, columns=columns) - renamed = renamer.rename(index={'foo1': 'foo3', 'bar2': 'bar3'}, - columns={'fizz1': 'fizz3', 'buzz2': 'buzz3'}) + df = DataFrame([(0, 0), (1, 1)], index=index, columns=columns) + + # + # without specifying level -> accross all levels + + renamed = df.rename(index={'foo1': 'foo3', 'bar2': 'bar3'}, + columns={'fizz1': 'fizz3', 'buzz2': 'buzz3'}) new_index = MultiIndex.from_tuples([('foo3', 'bar1'), ('foo2', 'bar3')], names=['foo', 'bar']) new_columns = MultiIndex.from_tuples([('fizz3', 'buzz1'), ('fizz2', 'buzz3')], names=['fizz', 'buzz']) - self.assert_index_equal(renamed.index, new_index) - self.assert_index_equal(renamed.columns, new_columns) - self.assertEqual(renamed.index.names, renamer.index.names) - self.assertEqual(renamed.columns.names, renamer.columns.names) + tm.assert_index_equal(renamed.index, new_index) + tm.assert_index_equal(renamed.columns, new_columns) + assert renamed.index.names == df.index.names + assert renamed.columns.names == df.columns.names + + # + # with specifying a level (GH13766) + + # dict + new_columns = MultiIndex.from_tuples([('fizz3', 'buzz1'), + ('fizz2', 'buzz2')], + names=['fizz', 'buzz']) + renamed = df.rename(columns={'fizz1': 'fizz3', 'buzz2': 'buzz3'}, + level=0) + tm.assert_index_equal(renamed.columns, new_columns) + renamed = df.rename(columns={'fizz1': 'fizz3', 'buzz2': 'buzz3'}, + level='fizz') + tm.assert_index_equal(renamed.columns, new_columns) + + new_columns = MultiIndex.from_tuples([('fizz1', 'buzz1'), + ('fizz2', 'buzz3')], + names=['fizz', 'buzz']) + renamed = df.rename(columns={'fizz1': 'fizz3', 'buzz2': 'buzz3'}, + level=1) + tm.assert_index_equal(renamed.columns, new_columns) + renamed = df.rename(columns={'fizz1': 'fizz3', 'buzz2': 'buzz3'}, + level='buzz') + tm.assert_index_equal(renamed.columns, new_columns) + + # function + func = str.upper + new_columns = MultiIndex.from_tuples([('FIZZ1', 'buzz1'), + ('FIZZ2', 'buzz2')], + names=['fizz', 'buzz']) + renamed = df.rename(columns=func, level=0) + tm.assert_index_equal(renamed.columns, new_columns) + renamed = df.rename(columns=func, level='fizz') + tm.assert_index_equal(renamed.columns, new_columns) + + new_columns = MultiIndex.from_tuples([('fizz1', 'BUZZ1'), + ('fizz2', 'BUZZ2')], + names=['fizz', 'buzz']) + renamed = df.rename(columns=func, level=1) + tm.assert_index_equal(renamed.columns, new_columns) + renamed = df.rename(columns=func, level='buzz') + tm.assert_index_equal(renamed.columns, new_columns) + + # index + new_index = MultiIndex.from_tuples([('foo3', 'bar1'), + ('foo2', 'bar2')], + names=['foo', 'bar']) + renamed = df.rename(index={'foo1': 'foo3', 'bar2': 'bar3'}, + level=0) + tm.assert_index_equal(renamed.index, new_index) def test_rename_nocopy(self): renamed = self.frame.rename(columns={'C': 'foo'}, copy=False) renamed['foo'] = 1. - self.assertTrue((self.frame['C'] == 1.).all()) + assert (self.frame['C'] == 1.).all() def test_rename_inplace(self): self.frame.rename(columns={'C': 'foo'}) - self.assertIn('C', self.frame) - self.assertNotIn('foo', self.frame) + assert 'C' in self.frame + assert 'foo' not in self.frame c_id = id(self.frame['C']) frame = self.frame.copy() frame.rename(columns={'C': 'foo'}, inplace=True) - self.assertNotIn('C', frame) - self.assertIn('foo', frame) - self.assertNotEqual(id(frame['foo']), c_id) + assert 'C' not in frame + assert 'foo' in frame + assert id(frame['foo']) != c_id def test_rename_bug(self): # GH 5344 @@ -492,27 +589,27 @@ def test_reset_index(self): # default name assigned rdf = self.frame.reset_index() exp = pd.Series(self.frame.index.values, name='index') - self.assert_series_equal(rdf['index'], exp) + tm.assert_series_equal(rdf['index'], exp) # default name assigned, corner case df = self.frame.copy() df['index'] = 'foo' rdf = df.reset_index() exp = pd.Series(self.frame.index.values, name='level_0') - self.assert_series_equal(rdf['level_0'], exp) + tm.assert_series_equal(rdf['level_0'], exp) # but this is ok self.frame.index.name = 'index' deleveled = self.frame.reset_index() - self.assert_series_equal(deleveled['index'], - pd.Series(self.frame.index)) - self.assert_index_equal(deleveled.index, - pd.Index(np.arange(len(deleveled)))) + tm.assert_series_equal(deleveled['index'], + pd.Series(self.frame.index)) + tm.assert_index_equal(deleveled.index, + pd.Index(np.arange(len(deleveled)))) # preserve column names self.frame.columns.name = 'columns' resetted = self.frame.reset_index() - self.assertEqual(resetted.columns.name, 'columns') + assert resetted.columns.name == 'columns' # only remove certain columns frame = self.frame.reset_index().set_index(['index', 'A', 'B']) @@ -552,10 +649,10 @@ def test_reset_index_right_dtype(self): df = DataFrame(s1) resetted = s1.reset_index() - self.assertEqual(resetted['time'].dtype, np.float64) + assert resetted['time'].dtype == np.float64 resetted = df.reset_index() - self.assertEqual(resetted['time'].dtype, np.float64) + assert resetted['time'].dtype == np.float64 def test_reset_index_multiindex_col(self): vals = np.random.randn(3, 3).astype(object) @@ -600,6 +697,33 @@ def test_reset_index_multiindex_col(self): ['a', 'mean', 'median', 'mean']]) assert_frame_equal(rs, xp) + def test_reset_index_multiindex_nan(self): + # GH6322, testing reset_index on MultiIndexes + # when we have a nan or all nan + df = pd.DataFrame({'A': ['a', 'b', 'c'], + 'B': [0, 1, np.nan], + 'C': np.random.rand(3)}) + rs = df.set_index(['A', 'B']).reset_index() + assert_frame_equal(rs, df) + + df = pd.DataFrame({'A': [np.nan, 'b', 'c'], + 'B': [0, 1, 2], + 'C': np.random.rand(3)}) + rs = df.set_index(['A', 'B']).reset_index() + assert_frame_equal(rs, df) + + df = pd.DataFrame({'A': ['a', 'b', 'c'], + 'B': [0, 1, 2], + 'C': [np.nan, 1.1, 2.2]}) + rs = df.set_index(['A', 'B']).reset_index() + assert_frame_equal(rs, df) + + df = pd.DataFrame({'A': ['a', 'b', 'c'], + 'B': [np.nan, np.nan, np.nan], + 'C': np.random.rand(3)}) + rs = df.set_index(['A', 'B']).reset_index() + assert_frame_equal(rs, df) + def test_reset_index_with_datetimeindex_cols(self): # GH5818 # @@ -618,7 +742,7 @@ def test_reset_index_range(self): df = pd.DataFrame([[0, 0], [1, 1]], columns=['A', 'B'], index=RangeIndex(stop=2)) result = df.reset_index() - tm.assertIsInstance(result.index, RangeIndex) + assert isinstance(result.index, RangeIndex) expected = pd.DataFrame([[0, 0, 0], [1, 1, 1]], columns=['index', 'A', 'B'], index=RangeIndex(stop=2)) @@ -628,7 +752,7 @@ def test_set_index_names(self): df = pd.util.testing.makeDataFrame() df.index.name = 'name' - self.assertEqual(df.set_index(df.index).index.names, ['name']) + assert df.set_index(df.index).index.names == ['name'] mi = MultiIndex.from_arrays(df[['A', 'B']].T.values, names=['A', 'B']) mi2 = MultiIndex.from_arrays(df[['A', 'B', 'A', 'B']].T.values, @@ -636,26 +760,27 @@ def test_set_index_names(self): df = df.set_index(['A', 'B']) - self.assertEqual(df.set_index(df.index).index.names, ['A', 'B']) + assert df.set_index(df.index).index.names == ['A', 'B'] # Check that set_index isn't converting a MultiIndex into an Index - self.assertTrue(isinstance(df.set_index(df.index).index, MultiIndex)) + assert isinstance(df.set_index(df.index).index, MultiIndex) # Check actual equality tm.assert_index_equal(df.set_index(df.index).index, mi) # Check that [MultiIndex, MultiIndex] yields a MultiIndex rather # than a pair of tuples - self.assertTrue(isinstance(df.set_index( - [df.index, df.index]).index, MultiIndex)) + assert isinstance(df.set_index( + [df.index, df.index]).index, MultiIndex) # Check equality tm.assert_index_equal(df.set_index([df.index, df.index]).index, mi2) def test_rename_objects(self): renamed = self.mixed_frame.rename(columns=str.upper) - self.assertIn('FOO', renamed) - self.assertNotIn('foo', renamed) + + assert 'FOO' in renamed + assert 'foo' not in renamed def test_assign_columns(self): self.frame['hi'] = 'there' @@ -679,3 +804,53 @@ def test_set_index_preserve_categorical_dtype(self): result = df.set_index(cols).reset_index() result = result.reindex(columns=df.columns) tm.assert_frame_equal(result, df) + + +class TestIntervalIndex(object): + + def test_setitem(self): + + df = DataFrame({'A': range(10)}) + s = pd.cut(df.A, 5) + assert isinstance(s.cat.categories, IntervalIndex) + + # B & D end up as Categoricals + # the remainer are converted to in-line objects + # contining an IntervalIndex.values + df['B'] = s + df['C'] = np.array(s) + df['D'] = s.values + df['E'] = np.array(s.values) + + assert is_categorical_dtype(df['B']) + assert is_interval_dtype(df['B'].cat.categories) + assert is_categorical_dtype(df['D']) + assert is_interval_dtype(df['D'].cat.categories) + + assert is_object_dtype(df['C']) + assert is_object_dtype(df['E']) + + # they compare equal as Index + # when converted to numpy objects + c = lambda x: Index(np.array(x)) + tm.assert_index_equal(c(df.B), c(df.B), check_names=False) + tm.assert_index_equal(c(df.B), c(df.C), check_names=False) + tm.assert_index_equal(c(df.B), c(df.D), check_names=False) + tm.assert_index_equal(c(df.B), c(df.D), check_names=False) + + # B & D are the same Series + tm.assert_series_equal(df['B'], df['B'], check_names=False) + tm.assert_series_equal(df['B'], df['D'], check_names=False) + + # C & E are the same Series + tm.assert_series_equal(df['C'], df['C'], check_names=False) + tm.assert_series_equal(df['C'], df['E'], check_names=False) + + def test_set_reset_index(self): + + df = DataFrame({'A': range(10)}) + s = pd.cut(df.A, 5) + df['B'] = s + df = df.set_index('B') + + df = df.reset_index() diff --git a/pandas/tests/frame/test_analytics.py b/pandas/tests/frame/test_analytics.py index f6081e14d4081..be89b27912d1c 100644 --- a/pandas/tests/frame/test_analytics.py +++ b/pandas/tests/frame/test_analytics.py @@ -2,29 +2,29 @@ from __future__ import print_function -from datetime import timedelta, datetime +from datetime import timedelta from distutils.version import LooseVersion import sys -import nose +import pytest +from string import ascii_lowercase from numpy import nan from numpy.random import randn import numpy as np -from pandas.compat import lrange +from pandas.compat import lrange, product from pandas import (compat, isnull, notnull, DataFrame, Series, MultiIndex, date_range, Timestamp) import pandas as pd import pandas.core.nanops as nanops -import pandas.formats.printing as printing +import pandas.core.algorithms as algorithms +import pandas.io.formats.printing as printing import pandas.util.testing as tm from pandas.tests.frame.common import TestData -class TestDataFrameAnalytics(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameAnalytics(TestData): # ---------------------------------------------------------------------= # Correlation and covariance @@ -58,7 +58,7 @@ def _check_method(self, method='pearson', check_minp=False): else: result = self.frame.corr(min_periods=len(self.frame) - 8) expected = self.frame.corr() - expected.ix['A', 'B'] = expected.ix['B', 'A'] = nan + expected.loc['A', 'B'] = expected.loc['B', 'A'] = nan tm.assert_frame_equal(result, expected) def test_corr_non_numeric(self): @@ -68,7 +68,7 @@ def test_corr_non_numeric(self): # exclude non-numeric types result = self.mixed_frame.corr() - expected = self.mixed_frame.ix[:, ['A', 'B', 'C', 'D']].corr() + expected = self.mixed_frame.loc[:, ['A', 'B', 'C', 'D']].corr() tm.assert_frame_equal(result, expected) def test_corr_nooverlap(self): @@ -81,11 +81,11 @@ def test_corr_nooverlap(self): 'C': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}) rs = df.corr(meth) - self.assertTrue(isnull(rs.ix['A', 'B'])) - self.assertTrue(isnull(rs.ix['B', 'A'])) - self.assertEqual(rs.ix['A', 'A'], 1) - self.assertEqual(rs.ix['B', 'B'], 1) - self.assertTrue(isnull(rs.ix['C', 'C'])) + assert isnull(rs.loc['A', 'B']) + assert isnull(rs.loc['B', 'A']) + assert rs.loc['A', 'A'] == 1 + assert rs.loc['B', 'B'] == 1 + assert isnull(rs.loc['C', 'C']) def test_corr_constant(self): tm._skip_if_no_scipy() @@ -96,7 +96,7 @@ def test_corr_constant(self): df = DataFrame({'A': [1, 1, 1, np.nan, np.nan, np.nan], 'B': [np.nan, np.nan, np.nan, 1, 1, 1]}) rs = df.corr(meth) - self.assertTrue(isnull(rs.values).all()) + assert isnull(rs.values).all() def test_corr_int(self): # dtypes other than float64 #1761 @@ -119,6 +119,15 @@ def test_corr_int_and_boolean(self): for meth in ['pearson', 'kendall', 'spearman']: tm.assert_frame_equal(df.corr(meth), expected) + def test_corr_cov_independent_index_column(self): + # GH 14617 + df = pd.DataFrame(np.random.randn(4 * 10).reshape(10, 4), + columns=list("abcd")) + for method in ['cov', 'corr']: + result = getattr(df, method)() + assert result.index is not result.columns + assert result.index.equals(result.columns) + def test_cov(self): # min_periods no NAs (corner case) expected = self.frame.cov() @@ -127,7 +136,7 @@ def test_cov(self): tm.assert_frame_equal(expected, result) result = self.frame.cov(min_periods=len(self.frame) + 1) - self.assertTrue(isnull(result.values).all()) + assert isnull(result.values).all() # with NAs frame = self.frame.copy() @@ -135,8 +144,8 @@ def test_cov(self): frame['B'][5:10] = nan result = self.frame.cov(min_periods=len(self.frame) - 8) expected = self.frame.cov() - expected.ix['A', 'B'] = np.nan - expected.ix['B', 'A'] = np.nan + expected.loc['A', 'B'] = np.nan + expected.loc['B', 'A'] = np.nan # regular self.frame['A'][:5] = nan @@ -148,7 +157,7 @@ def test_cov(self): # exclude non-numeric types result = self.mixed_frame.cov() - expected = self.mixed_frame.ix[:, ['A', 'B', 'C', 'D']].cov() + expected = self.mixed_frame.loc[:, ['A', 'B', 'C', 'D']].cov() tm.assert_frame_equal(result, expected) # Single column frame @@ -157,7 +166,7 @@ def test_cov(self): expected = DataFrame(np.cov(df.values.T).reshape((1, 1)), index=df.columns, columns=df.columns) tm.assert_frame_equal(result, expected) - df.ix[0] = np.nan + df.loc[0] = np.nan result = df.cov() expected = DataFrame(np.cov(df.values[1:].T).reshape((1, 1)), index=df.columns, columns=df.columns) @@ -181,10 +190,10 @@ def test_corrwith(self): dropped = a.corrwith(b, axis=0, drop=True) tm.assert_almost_equal(dropped['A'], a['A'].corr(b['A'])) - self.assertNotIn('B', dropped) + assert 'B' not in dropped dropped = a.corrwith(b, axis=1, drop=True) - self.assertNotIn(a.index[-1], dropped.index) + assert a.index[-1] not in dropped.index # non time-series data index = ['a', 'b', 'c', 'd', 'e'] @@ -193,7 +202,8 @@ def test_corrwith(self): df2 = DataFrame(randn(4, 4), index=index[:4], columns=columns) correls = df1.corrwith(df2, axis=1) for row in index[:4]: - tm.assert_almost_equal(correls[row], df1.ix[row].corr(df2.ix[row])) + tm.assert_almost_equal(correls[row], + df1.loc[row].corr(df2.loc[row])) def test_corrwith_with_objects(self): df1 = tm.makeTimeDataFrame() @@ -204,11 +214,11 @@ def test_corrwith_with_objects(self): df2['obj'] = 'bar' result = df1.corrwith(df2) - expected = df1.ix[:, cols].corrwith(df2.ix[:, cols]) + expected = df1.loc[:, cols].corrwith(df2.loc[:, cols]) tm.assert_series_equal(result, expected) result = df1.corrwith(df2, axis=1) - expected = df1.ix[:, cols].corrwith(df2.ix[:, cols], axis=1) + expected = df1.loc[:, cols].corrwith(df2.loc[:, cols], axis=1) tm.assert_series_equal(result, expected) def test_corrwith_series(self): @@ -224,7 +234,7 @@ def test_corrwith_matches_corrcoef(self): c2 = np.corrcoef(df1['a'], df2['a'])[0][1] tm.assert_almost_equal(c1, c2) - self.assertTrue(c1 < 1) + assert c1 < 1 def test_bool_describe_in_mixed_frame(self): df = DataFrame({ @@ -325,8 +335,8 @@ def test_describe_datetime_columns(self): '50%', '75%', 'max']) expected.columns = exp_columns tm.assert_frame_equal(result, expected) - self.assertEqual(result.columns.freq, 'MS') - self.assertEqual(result.columns.tz, expected.columns.tz) + assert result.columns.freq == 'MS' + assert result.columns.tz == expected.columns.tz def test_describe_timedelta_values(self): # GH 6145 @@ -363,7 +373,7 @@ def test_describe_timedelta_values(self): "50% 3 days 00:00:00 0 days 03:00:00\n" "75% 4 days 00:00:00 0 days 04:00:00\n" "max 5 days 00:00:00 0 days 05:00:00") - self.assertEqual(repr(res), exp_repr) + assert repr(res) == exp_repr def test_reduce_mixed_frame(self): # GH 6806 @@ -389,10 +399,10 @@ def test_count(self): # corner case frame = DataFrame() ct1 = frame.count(1) - tm.assertIsInstance(ct1, Series) + assert isinstance(ct1, Series) ct2 = frame.count(0) - tm.assertIsInstance(ct2, Series) + assert isinstance(ct2, Series) # GH #423 df = DataFrame(index=lrange(10)) @@ -410,6 +420,21 @@ def test_count(self): expected = Series(0, index=[]) tm.assert_series_equal(result, expected) + def test_nunique(self): + f = lambda s: len(algorithms.unique1d(s.dropna())) + self._check_stat_op('nunique', f, has_skipna=False, + check_dtype=False, check_dates=True) + + df = DataFrame({'A': [1, 1, 1], + 'B': [1, 2, 3], + 'C': [1, np.nan, 3]}) + tm.assert_series_equal(df.nunique(), Series({'A': 1, 'B': 3, 'C': 2})) + tm.assert_series_equal(df.nunique(dropna=False), + Series({'A': 1, 'B': 3, 'C': 3})) + tm.assert_series_equal(df.nunique(axis=1), Series({0: 1, 1: 2, 2: 2})) + tm.assert_series_equal(df.nunique(axis=1, dropna=False), + Series({0: 1, 1: 3, 2: 2})) + def test_sum(self): self._check_stat_op('sum', np.sum, has_numeric_only=True) @@ -437,7 +462,7 @@ def test_stat_operators_attempt_obj_array(self): for df in [df1, df2]: for meth in methods: - self.assertEqual(df.values.dtype, np.object_) + assert df.values.dtype == np.object_ result = getattr(df, meth)(1) expected = getattr(df.astype('f8'), meth)(1) @@ -463,9 +488,9 @@ def test_min(self): self._check_stat_op('min', np.min, frame=self.intframe) def test_cummin(self): - self.tsframe.ix[5:10, 0] = nan - self.tsframe.ix[10:15, 1] = nan - self.tsframe.ix[15:, 2] = nan + self.tsframe.loc[5:10, 0] = nan + self.tsframe.loc[10:15, 1] = nan + self.tsframe.loc[15:, 2] = nan # axis = 0 cummin = self.tsframe.cummin() @@ -483,12 +508,12 @@ def test_cummin(self): # fix issue cummin_xs = self.tsframe.cummin(axis=1) - self.assertEqual(np.shape(cummin_xs), np.shape(self.tsframe)) + assert np.shape(cummin_xs) == np.shape(self.tsframe) def test_cummax(self): - self.tsframe.ix[5:10, 0] = nan - self.tsframe.ix[10:15, 1] = nan - self.tsframe.ix[15:, 2] = nan + self.tsframe.loc[5:10, 0] = nan + self.tsframe.loc[10:15, 1] = nan + self.tsframe.loc[15:, 2] = nan # axis = 0 cummax = self.tsframe.cummax() @@ -506,7 +531,7 @@ def test_cummax(self): # fix issue cummax_xs = self.tsframe.cummax(axis=1) - self.assertEqual(np.shape(cummax_xs), np.shape(self.tsframe)) + assert np.shape(cummax_xs) == np.shape(self.tsframe) def test_max(self): self._check_stat_op('max', np.max, check_dates=True) @@ -533,11 +558,11 @@ def test_var_std(self): arr = np.repeat(np.random.random((1, 1000)), 1000, 0) result = nanops.nanvar(arr, axis=0) - self.assertFalse((result < 0).any()) + assert not (result < 0).any() if nanops._USE_BOTTLENECK: nanops._USE_BOTTLENECK = False result = nanops.nanvar(arr, axis=0) - self.assertFalse((result < 0).any()) + assert not (result < 0).any() nanops._USE_BOTTLENECK = True def test_numeric_only_flag(self): @@ -545,11 +570,11 @@ def test_numeric_only_flag(self): methods = ['sem', 'var', 'std'] df1 = DataFrame(np.random.randn(5, 3), columns=['foo', 'bar', 'baz']) # set one entry to a number in str format - df1.ix[0, 'foo'] = '100' + df1.loc[0, 'foo'] = '100' df2 = DataFrame(np.random.randn(5, 3), columns=['foo', 'bar', 'baz']) # set one entry to a non-number str - df2.ix[0, 'foo'] = 'a' + df2.loc[0, 'foo'] = 'a' for meth in methods: result = getattr(df1, meth)(axis=1, numeric_only=True) @@ -561,15 +586,32 @@ def test_numeric_only_flag(self): tm.assert_series_equal(expected, result) # df1 has all numbers, df2 has a letter inside - self.assertRaises(TypeError, lambda: getattr(df1, meth) - (axis=1, numeric_only=False)) - self.assertRaises(TypeError, lambda: getattr(df2, meth) - (axis=1, numeric_only=False)) + pytest.raises(TypeError, lambda: getattr(df1, meth)( + axis=1, numeric_only=False)) + pytest.raises(TypeError, lambda: getattr(df2, meth)( + axis=1, numeric_only=False)) + + def test_mixed_ops(self): + # GH 16116 + df = DataFrame({'int': [1, 2, 3, 4], + 'float': [1., 2., 3., 4.], + 'str': ['a', 'b', 'c', 'd']}) + + for op in ['mean', 'std', 'var', 'skew', + 'kurt', 'sem']: + result = getattr(df, op)() + assert len(result) == 2 + + if nanops._USE_BOTTLENECK: + nanops._USE_BOTTLENECK = False + result = getattr(df, op)() + assert len(result) == 2 + nanops._USE_BOTTLENECK = True def test_cumsum(self): - self.tsframe.ix[5:10, 0] = nan - self.tsframe.ix[10:15, 1] = nan - self.tsframe.ix[15:, 2] = nan + self.tsframe.loc[5:10, 0] = nan + self.tsframe.loc[10:15, 1] = nan + self.tsframe.loc[15:, 2] = nan # axis = 0 cumsum = self.tsframe.cumsum() @@ -587,12 +629,12 @@ def test_cumsum(self): # fix issue cumsum_xs = self.tsframe.cumsum(axis=1) - self.assertEqual(np.shape(cumsum_xs), np.shape(self.tsframe)) + assert np.shape(cumsum_xs) == np.shape(self.tsframe) def test_cumprod(self): - self.tsframe.ix[5:10, 0] = nan - self.tsframe.ix[10:15, 1] = nan - self.tsframe.ix[15:, 2] = nan + self.tsframe.loc[5:10, 0] = nan + self.tsframe.loc[10:15, 1] = nan + self.tsframe.loc[15:, 2] = nan # axis = 0 cumprod = self.tsframe.cumprod() @@ -606,7 +648,7 @@ def test_cumprod(self): # fix issue cumprod_xs = self.tsframe.cumprod(axis=1) - self.assertEqual(np.shape(cumprod_xs), np.shape(self.tsframe)) + assert np.shape(cumprod_xs) == np.shape(self.tsframe) # ints df = self.tsframe.fillna(0).astype(int) @@ -618,173 +660,6 @@ def test_cumprod(self): df.cumprod(0) df.cumprod(1) - def test_rank(self): - tm._skip_if_no_scipy() - from scipy.stats import rankdata - - self.frame['A'][::2] = np.nan - self.frame['B'][::3] = np.nan - self.frame['C'][::4] = np.nan - self.frame['D'][::5] = np.nan - - ranks0 = self.frame.rank() - ranks1 = self.frame.rank(1) - mask = np.isnan(self.frame.values) - - fvals = self.frame.fillna(np.inf).values - - exp0 = np.apply_along_axis(rankdata, 0, fvals) - exp0[mask] = np.nan - - exp1 = np.apply_along_axis(rankdata, 1, fvals) - exp1[mask] = np.nan - - tm.assert_almost_equal(ranks0.values, exp0) - tm.assert_almost_equal(ranks1.values, exp1) - - # integers - df = DataFrame(np.random.randint(0, 5, size=40).reshape((10, 4))) - - result = df.rank() - exp = df.astype(float).rank() - tm.assert_frame_equal(result, exp) - - result = df.rank(1) - exp = df.astype(float).rank(1) - tm.assert_frame_equal(result, exp) - - def test_rank2(self): - df = DataFrame([[1, 3, 2], [1, 2, 3]]) - expected = DataFrame([[1.0, 3.0, 2.0], [1, 2, 3]]) / 3.0 - result = df.rank(1, pct=True) - tm.assert_frame_equal(result, expected) - - df = DataFrame([[1, 3, 2], [1, 2, 3]]) - expected = df.rank(0) / 2.0 - result = df.rank(0, pct=True) - tm.assert_frame_equal(result, expected) - - df = DataFrame([['b', 'c', 'a'], ['a', 'c', 'b']]) - expected = DataFrame([[2.0, 3.0, 1.0], [1, 3, 2]]) - result = df.rank(1, numeric_only=False) - tm.assert_frame_equal(result, expected) - - expected = DataFrame([[2.0, 1.5, 1.0], [1, 1.5, 2]]) - result = df.rank(0, numeric_only=False) - tm.assert_frame_equal(result, expected) - - df = DataFrame([['b', np.nan, 'a'], ['a', 'c', 'b']]) - expected = DataFrame([[2.0, nan, 1.0], [1.0, 3.0, 2.0]]) - result = df.rank(1, numeric_only=False) - tm.assert_frame_equal(result, expected) - - expected = DataFrame([[2.0, nan, 1.0], [1.0, 1.0, 2.0]]) - result = df.rank(0, numeric_only=False) - tm.assert_frame_equal(result, expected) - - # f7u12, this does not work without extensive workaround - data = [[datetime(2001, 1, 5), nan, datetime(2001, 1, 2)], - [datetime(2000, 1, 2), datetime(2000, 1, 3), - datetime(2000, 1, 1)]] - df = DataFrame(data) - - # check the rank - expected = DataFrame([[2., nan, 1.], - [2., 3., 1.]]) - result = df.rank(1, numeric_only=False, ascending=True) - tm.assert_frame_equal(result, expected) - - expected = DataFrame([[1., nan, 2.], - [2., 1., 3.]]) - result = df.rank(1, numeric_only=False, ascending=False) - tm.assert_frame_equal(result, expected) - - # mixed-type frames - self.mixed_frame['datetime'] = datetime.now() - self.mixed_frame['timedelta'] = timedelta(days=1, seconds=1) - - result = self.mixed_frame.rank(1) - expected = self.mixed_frame.rank(1, numeric_only=True) - tm.assert_frame_equal(result, expected) - - df = DataFrame({"a": [1e-20, -5, 1e-20 + 1e-40, 10, - 1e60, 1e80, 1e-30]}) - exp = DataFrame({"a": [3.5, 1., 3.5, 5., 6., 7., 2.]}) - tm.assert_frame_equal(df.rank(), exp) - - def test_rank_na_option(self): - tm._skip_if_no_scipy() - from scipy.stats import rankdata - - self.frame['A'][::2] = np.nan - self.frame['B'][::3] = np.nan - self.frame['C'][::4] = np.nan - self.frame['D'][::5] = np.nan - - # bottom - ranks0 = self.frame.rank(na_option='bottom') - ranks1 = self.frame.rank(1, na_option='bottom') - - fvals = self.frame.fillna(np.inf).values - - exp0 = np.apply_along_axis(rankdata, 0, fvals) - exp1 = np.apply_along_axis(rankdata, 1, fvals) - - tm.assert_almost_equal(ranks0.values, exp0) - tm.assert_almost_equal(ranks1.values, exp1) - - # top - ranks0 = self.frame.rank(na_option='top') - ranks1 = self.frame.rank(1, na_option='top') - - fval0 = self.frame.fillna((self.frame.min() - 1).to_dict()).values - fval1 = self.frame.T - fval1 = fval1.fillna((fval1.min() - 1).to_dict()).T - fval1 = fval1.fillna(np.inf).values - - exp0 = np.apply_along_axis(rankdata, 0, fval0) - exp1 = np.apply_along_axis(rankdata, 1, fval1) - - tm.assert_almost_equal(ranks0.values, exp0) - tm.assert_almost_equal(ranks1.values, exp1) - - # descending - - # bottom - ranks0 = self.frame.rank(na_option='top', ascending=False) - ranks1 = self.frame.rank(1, na_option='top', ascending=False) - - fvals = self.frame.fillna(np.inf).values - - exp0 = np.apply_along_axis(rankdata, 0, -fvals) - exp1 = np.apply_along_axis(rankdata, 1, -fvals) - - tm.assert_almost_equal(ranks0.values, exp0) - tm.assert_almost_equal(ranks1.values, exp1) - - # descending - - # top - ranks0 = self.frame.rank(na_option='bottom', ascending=False) - ranks1 = self.frame.rank(1, na_option='bottom', ascending=False) - - fval0 = self.frame.fillna((self.frame.min() - 1).to_dict()).values - fval1 = self.frame.T - fval1 = fval1.fillna((fval1.min() - 1).to_dict()).T - fval1 = fval1.fillna(np.inf).values - - exp0 = np.apply_along_axis(rankdata, 0, -fval0) - exp1 = np.apply_along_axis(rankdata, 1, -fval1) - - tm.assert_numpy_array_equal(ranks0.values, exp0) - tm.assert_numpy_array_equal(ranks1.values, exp1) - - def test_rank_axis(self): - # check if using axes' names gives the same result - df = pd.DataFrame([[2, 1], [4, 3]]) - tm.assert_frame_equal(df.rank(axis=0), df.rank(axis='index')) - tm.assert_frame_equal(df.rank(axis=1), df.rank(axis='columns')) - def test_sem(self): alt = lambda x: np.std(x, ddof=1) / np.sqrt(len(x)) self._check_stat_op('sem', alt) @@ -796,33 +671,13 @@ def test_sem(self): arr = np.repeat(np.random.random((1, 1000)), 1000, 0) result = nanops.nansem(arr, axis=0) - self.assertFalse((result < 0).any()) + assert not (result < 0).any() if nanops._USE_BOTTLENECK: nanops._USE_BOTTLENECK = False result = nanops.nansem(arr, axis=0) - self.assertFalse((result < 0).any()) + assert not (result < 0).any() nanops._USE_BOTTLENECK = True - def test_sort_invalid_kwargs(self): - df = DataFrame([1, 2, 3], columns=['a']) - - msg = r"sort\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, df.sort, foo=2) - - # Neither of these should raise an error because they - # are explicit keyword arguments in the signature and - # hence should not be swallowed by the kwargs parameter - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - df.sort(axis=1) - - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - df.sort(kind='mergesort') - - msg = "the 'order' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, df.sort, order=2) - def test_skew(self): tm._skip_if_no_scipy() from scipy.stats import skew @@ -855,8 +710,8 @@ def alt(x): kurt = df.kurt() kurt2 = df.kurt(level=0).xs('bar') tm.assert_series_equal(kurt, kurt2, check_names=False) - self.assertTrue(kurt.name is None) - self.assertEqual(kurt2.name, 'bar') + assert kurt.name is None + assert kurt2.name == 'bar' def _check_stat_op(self, name, alternative, frame=None, has_skipna=True, has_numeric_only=False, check_dtype=True, @@ -864,8 +719,8 @@ def _check_stat_op(self, name, alternative, frame=None, has_skipna=True, if frame is None: frame = self.frame # set some NAs - frame.ix[5:10] = np.nan - frame.ix[15:20, -2:] = np.nan + frame.loc[5:10] = np.nan + frame.loc[15:20, -2:] = np.nan f = getattr(frame, name) @@ -873,12 +728,12 @@ def _check_stat_op(self, name, alternative, frame=None, has_skipna=True, df = DataFrame({'b': date_range('1/1/2001', periods=2)}) _f = getattr(df, name) result = _f() - self.assertIsInstance(result, Series) + assert isinstance(result, Series) df['a'] = lrange(len(df)) result = getattr(df, name)() - self.assertIsInstance(result, Series) - self.assertTrue(len(result)) + assert isinstance(result, Series) + assert len(result) if has_skipna: def skipna_wrapper(x): @@ -916,15 +771,15 @@ def wrapper(x): # check dtypes if check_dtype: lcd_dtype = frame.values.dtype - self.assertEqual(lcd_dtype, result0.dtype) - self.assertEqual(lcd_dtype, result1.dtype) + assert lcd_dtype == result0.dtype + assert lcd_dtype == result1.dtype # result = f(axis=1) # comp = frame.apply(alternative, axis=1).reindex(result.index) # assert_series_equal(result, comp) # bad axis - tm.assertRaisesRegexp(ValueError, 'No axis named 2', f, axis=2) + tm.assert_raises_regex(ValueError, 'No axis named 2', f, axis=2) # make sure works on mixed-type frame getattr(self.mixed_frame, name)(axis=0) getattr(self.mixed_frame, name)(axis=1) @@ -941,8 +796,8 @@ def wrapper(x): r0 = getattr(all_na, name)(axis=0) r1 = getattr(all_na, name)(axis=1) if not tm._incompat_bottleneck_version(name): - self.assertTrue(np.isnan(r0).all()) - self.assertTrue(np.isnan(r1).all()) + assert np.isnan(r0).all() + assert np.isnan(r1).all() def test_mode(self): df = pd.DataFrame({"A": [12, 12, 11, 12, 19, 11], @@ -952,18 +807,23 @@ def test_mode(self): "E": [8, 8, 1, 1, 3, 3]}) tm.assert_frame_equal(df[["A"]].mode(), pd.DataFrame({"A": [12]})) - expected = pd.Series([], dtype='int64', name='D').to_frame() + expected = pd.Series([0, 1, 2, 3, 4, 5], dtype='int64', name='D').\ + to_frame() tm.assert_frame_equal(df[["D"]].mode(), expected) expected = pd.Series([1, 3, 8], dtype='int64', name='E').to_frame() tm.assert_frame_equal(df[["E"]].mode(), expected) tm.assert_frame_equal(df[["A", "B"]].mode(), pd.DataFrame({"A": [12], "B": [10.]})) tm.assert_frame_equal(df.mode(), - pd.DataFrame({"A": [12, np.nan, np.nan], - "B": [10, np.nan, np.nan], - "C": [8, 9, np.nan], - "D": [np.nan, np.nan, np.nan], - "E": [1, 3, 8]})) + pd.DataFrame({"A": [12, np.nan, np.nan, np.nan, + np.nan, np.nan], + "B": [10, np.nan, np.nan, np.nan, + np.nan, np.nan], + "C": [8, 9, np.nan, np.nan, np.nan, + np.nan], + "D": [0, 1, 2, 3, 4, 5], + "E": [1, 3, 8, np.nan, np.nan, + np.nan]})) # outputs in sorted order df["C"] = list(reversed(df["C"])) @@ -980,20 +840,12 @@ def test_mode(self): df = pd.DataFrame({"A": np.arange(6, dtype='int64'), "B": pd.date_range('2011', periods=6), "C": list('abcdef')}) - exp = pd.DataFrame({"A": pd.Series([], dtype=df["A"].dtype), - "B": pd.Series([], dtype=df["B"].dtype), - "C": pd.Series([], dtype=df["C"].dtype)}) - tm.assert_frame_equal(df.mode(), exp) - - # and also when not empty - df.loc[1, "A"] = 0 - df.loc[4, "B"] = df.loc[3, "B"] - df.loc[5, "C"] = 'e' - exp = pd.DataFrame({"A": pd.Series([0], dtype=df["A"].dtype), - "B": pd.Series([df.loc[3, "B"]], + exp = pd.DataFrame({"A": pd.Series(np.arange(6, dtype='int64'), + dtype=df["A"].dtype), + "B": pd.Series(pd.date_range('2011', periods=6), dtype=df["B"].dtype), - "C": pd.Series(['e'], dtype=df["C"].dtype)}) - + "C": pd.Series(list('abcdef'), + dtype=df["C"].dtype)}) tm.assert_frame_equal(df.mode(), exp) def test_operators_timedelta64(self): @@ -1008,19 +860,19 @@ def test_operators_timedelta64(self): # min result = diffs.min() - self.assertEqual(result[0], diffs.ix[0, 'A']) - self.assertEqual(result[1], diffs.ix[0, 'B']) + assert result[0] == diffs.loc[0, 'A'] + assert result[1] == diffs.loc[0, 'B'] result = diffs.min(axis=1) - self.assertTrue((result == diffs.ix[0, 'B']).all()) + assert (result == diffs.loc[0, 'B']).all() # max result = diffs.max() - self.assertEqual(result[0], diffs.ix[2, 'A']) - self.assertEqual(result[1], diffs.ix[2, 'B']) + assert result[0] == diffs.loc[2, 'A'] + assert result[1] == diffs.loc[2, 'B'] result = diffs.max(axis=1) - self.assertTrue((result == diffs['A']).all()) + assert (result == diffs['A']).all() # abs result = diffs.abs() @@ -1038,7 +890,7 @@ def test_operators_timedelta64(self): mixed['F'] = Timestamp('20130101') # results in an object array - from pandas.tseries.timedeltas import ( + from pandas.core.tools.timedeltas import ( _coerce_scalar_to_timedelta_type as _coerce) result = mixed.min() @@ -1068,20 +920,20 @@ def test_operators_timedelta64(self): df = DataFrame({'time': date_range('20130102', periods=5), 'time2': date_range('20130105', periods=5)}) df['off1'] = df['time2'] - df['time'] - self.assertEqual(df['off1'].dtype, 'timedelta64[ns]') + assert df['off1'].dtype == 'timedelta64[ns]' df['off2'] = df['time'] - df['time2'] df._consolidate_inplace() - self.assertTrue(df['off1'].dtype == 'timedelta64[ns]') - self.assertTrue(df['off2'].dtype == 'timedelta64[ns]') + assert df['off1'].dtype == 'timedelta64[ns]' + assert df['off2'].dtype == 'timedelta64[ns]' def test_sum_corner(self): axis0 = self.empty.sum(0) axis1 = self.empty.sum(1) - tm.assertIsInstance(axis0, Series) - tm.assertIsInstance(axis1, Series) - self.assertEqual(len(axis0), 0) - self.assertEqual(len(axis1), 0) + assert isinstance(axis0, Series) + assert isinstance(axis1, Series) + assert len(axis0) == 0 + assert len(axis1) == 0 def test_sum_object(self): values = self.frame.values.astype(int) @@ -1100,18 +952,18 @@ def test_mean_corner(self): # unit test when have object data the_mean = self.mixed_frame.mean(axis=0) the_sum = self.mixed_frame.sum(axis=0, numeric_only=True) - self.assert_index_equal(the_sum.index, the_mean.index) - self.assertTrue(len(the_mean.index) < len(self.mixed_frame.columns)) + tm.assert_index_equal(the_sum.index, the_mean.index) + assert len(the_mean.index) < len(self.mixed_frame.columns) # xs sum mixed type, just want to know it works... the_mean = self.mixed_frame.mean(axis=1) the_sum = self.mixed_frame.sum(axis=1, numeric_only=True) - self.assert_index_equal(the_sum.index, the_mean.index) + tm.assert_index_equal(the_sum.index, the_mean.index) # take mean of boolean column self.frame['bool'] = self.frame['A'] > 0 means = self.frame.mean(0) - self.assertEqual(means['bool'], self.frame['bool'].values.mean()) + assert means['bool'] == self.frame['bool'].values.mean() def test_stats_mixed_type(self): # don't blow up @@ -1147,14 +999,14 @@ def test_cumsum_corner(self): def test_sum_bools(self): df = DataFrame(index=lrange(1), columns=lrange(10)) bools = isnull(df) - self.assertEqual(bools.sum(axis=1)[0], 10) + assert bools.sum(axis=1)[0] == 10 # Index of max / min def test_idxmin(self): frame = self.frame - frame.ix[5:10] = np.nan - frame.ix[15:20, -2:] = np.nan + frame.loc[5:10] = np.nan + frame.loc[15:20, -2:] = np.nan for skipna in [True, False]: for axis in [0, 1]: for df in [frame, self.intframe]: @@ -1163,12 +1015,12 @@ def test_idxmin(self): skipna=skipna) tm.assert_series_equal(result, expected) - self.assertRaises(ValueError, frame.idxmin, axis=2) + pytest.raises(ValueError, frame.idxmin, axis=2) def test_idxmax(self): frame = self.frame - frame.ix[5:10] = np.nan - frame.ix[15:20, -2:] = np.nan + frame.loc[5:10] = np.nan + frame.loc[15:20, -2:] = np.nan for skipna in [True, False]: for axis in [0, 1]: for df in [frame, self.intframe]: @@ -1177,7 +1029,7 @@ def test_idxmax(self): skipna=skipna) tm.assert_series_equal(result, expected) - self.assertRaises(ValueError, frame.idxmax, axis=2) + pytest.raises(ValueError, frame.idxmax, axis=2) # ---------------------------------------------------------------------- # Logical reductions @@ -1219,8 +1071,8 @@ def _check_bool_op(self, name, alternative, frame=None, has_skipna=True, # set some NAs frame = DataFrame(frame.values.astype(object), frame.index, frame.columns) - frame.ix[5:10] = np.nan - frame.ix[15:20, -2:] = np.nan + frame.loc[5:10] = np.nan + frame.loc[15:20, -2:] = np.nan f = getattr(frame, name) @@ -1252,7 +1104,7 @@ def wrapper(x): # assert_series_equal(result, comp) # bad axis - self.assertRaises(ValueError, f, axis=2) + pytest.raises(ValueError, f, axis=2) # make sure works on mixed-type frame mixed = self.mixed_frame @@ -1279,79 +1131,12 @@ def __nonzero__(self): r0 = getattr(all_na, name)(axis=0) r1 = getattr(all_na, name)(axis=1) if name == 'any': - self.assertFalse(r0.any()) - self.assertFalse(r1.any()) + assert not r0.any() + assert not r1.any() else: - self.assertTrue(r0.all()) - self.assertTrue(r1.all()) - - # ---------------------------------------------------------------------- - # Top / bottom - - def test_nlargest(self): - # GH10393 - from string import ascii_lowercase - df = pd.DataFrame({'a': np.random.permutation(10), - 'b': list(ascii_lowercase[:10])}) - result = df.nlargest(5, 'a') - expected = df.sort_values('a', ascending=False).head(5) - tm.assert_frame_equal(result, expected) - - def test_nlargest_multiple_columns(self): - from string import ascii_lowercase - df = pd.DataFrame({'a': np.random.permutation(10), - 'b': list(ascii_lowercase[:10]), - 'c': np.random.permutation(10).astype('float64')}) - result = df.nlargest(5, ['a', 'b']) - expected = df.sort_values(['a', 'b'], ascending=False).head(5) - tm.assert_frame_equal(result, expected) - - def test_nsmallest(self): - from string import ascii_lowercase - df = pd.DataFrame({'a': np.random.permutation(10), - 'b': list(ascii_lowercase[:10])}) - result = df.nsmallest(5, 'a') - expected = df.sort_values('a').head(5) - tm.assert_frame_equal(result, expected) - - def test_nsmallest_multiple_columns(self): - from string import ascii_lowercase - df = pd.DataFrame({'a': np.random.permutation(10), - 'b': list(ascii_lowercase[:10]), - 'c': np.random.permutation(10).astype('float64')}) - result = df.nsmallest(5, ['a', 'c']) - expected = df.sort_values(['a', 'c']).head(5) - tm.assert_frame_equal(result, expected) - - def test_nsmallest_nlargest_duplicate_index(self): - # GH 13412 - df = pd.DataFrame({'a': [1, 2, 3, 4], - 'b': [4, 3, 2, 1], - 'c': [0, 1, 2, 3]}, - index=[0, 0, 1, 1]) - result = df.nsmallest(4, 'a') - expected = df.sort_values('a').head(4) - tm.assert_frame_equal(result, expected) - - result = df.nlargest(4, 'a') - expected = df.sort_values('a', ascending=False).head(4) - tm.assert_frame_equal(result, expected) + assert r0.all() + assert r1.all() - result = df.nsmallest(4, ['a', 'c']) - expected = df.sort_values(['a', 'c']).head(4) - tm.assert_frame_equal(result, expected) - - result = df.nsmallest(4, ['c', 'a']) - expected = df.sort_values(['c', 'a']).head(4) - tm.assert_frame_equal(result, expected) - - result = df.nlargest(4, ['a', 'c']) - expected = df.sort_values(['a', 'c'], ascending=False).head(4) - tm.assert_frame_equal(result, expected) - - result = df.nlargest(4, ['c', 'a']) - expected = df.sort_values(['c', 'a'], ascending=False).head(4) - tm.assert_frame_equal(result, expected) # ---------------------------------------------------------------------- # Isin @@ -1395,10 +1180,10 @@ def test_isin_with_string_scalar(self): df = DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'], 'ids2': ['a', 'n', 'c', 'n']}, index=['foo', 'bar', 'baz', 'qux']) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.isin('a') - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.isin('aaa') def test_isin_df(self): @@ -1421,18 +1206,18 @@ def test_isin_df_dupe_values(self): # just cols duped df2 = DataFrame([[0, 2], [12, 4], [2, np.nan], [4, 5]], columns=['B', 'B']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df1.isin(df2) # just index duped df2 = DataFrame([[0, 2], [12, 4], [2, np.nan], [4, 5]], columns=['A', 'B'], index=[0, 0, 1, 1]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df1.isin(df2) # cols and index: df2.columns = ['B', 'B'] - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df1.isin(df2) def test_isin_dupe_self(self): @@ -1478,6 +1263,27 @@ def test_isin_multiIndex(self): result = df1.isin(df2) tm.assert_frame_equal(result, expected) + def test_isin_empty_datetimelike(self): + # GH 15473 + df1_ts = DataFrame({'date': + pd.to_datetime(['2014-01-01', '2014-01-02'])}) + df1_td = DataFrame({'date': + [pd.Timedelta(1, 's'), pd.Timedelta(2, 's')]}) + df2 = DataFrame({'date': []}) + df3 = DataFrame() + + expected = DataFrame({'date': [False, False]}) + + result = df1_ts.isin(df2) + tm.assert_frame_equal(result, expected) + result = df1_ts.isin(df3) + tm.assert_frame_equal(result, expected) + + result = df1_td.isin(df2) + tm.assert_frame_equal(result, expected) + result = df1_td.isin(df3) + tm.assert_frame_equal(result, expected) + # ---------------------------------------------------------------------- # Row deduplication @@ -1495,43 +1301,31 @@ def test_drop_duplicates(self): tm.assert_frame_equal(result, expected) result = df.drop_duplicates('AAA', keep='last') - expected = df.ix[[6, 7]] + expected = df.loc[[6, 7]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates('AAA', keep=False) - expected = df.ix[[]] + expected = df.loc[[]] tm.assert_frame_equal(result, expected) - self.assertEqual(len(result), 0) - - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df.drop_duplicates('AAA', take_last=True) - expected = df.ix[[6, 7]] - tm.assert_frame_equal(result, expected) + assert len(result) == 0 # multi column - expected = df.ix[[0, 1, 2, 3]] + expected = df.loc[[0, 1, 2, 3]] result = df.drop_duplicates(np.array(['AAA', 'B'])) tm.assert_frame_equal(result, expected) result = df.drop_duplicates(['AAA', 'B']) tm.assert_frame_equal(result, expected) result = df.drop_duplicates(('AAA', 'B'), keep='last') - expected = df.ix[[0, 5, 6, 7]] + expected = df.loc[[0, 5, 6, 7]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates(('AAA', 'B'), keep=False) - expected = df.ix[[0]] - tm.assert_frame_equal(result, expected) - - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df.drop_duplicates(('AAA', 'B'), take_last=True) - expected = df.ix[[0, 5, 6, 7]] + expected = df.loc[[0]] tm.assert_frame_equal(result, expected) # consider everything - df2 = df.ix[:, ['AAA', 'B', 'C']] + df2 = df.loc[:, ['AAA', 'B', 'C']] result = df2.drop_duplicates() # in this case only @@ -1546,13 +1340,6 @@ def test_drop_duplicates(self): expected = df2.drop_duplicates(['AAA', 'B'], keep=False) tm.assert_frame_equal(result, expected) - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df2.drop_duplicates(take_last=True) - with tm.assert_produces_warning(FutureWarning): - expected = df2.drop_duplicates(['AAA', 'B'], take_last=True) - tm.assert_frame_equal(result, expected) - # integers result = df.drop_duplicates('C') expected = df.iloc[[0, 2]] @@ -1593,7 +1380,7 @@ def test_drop_duplicates(self): df = df.append([[1] + [0] * 8], ignore_index=True) for keep in ['first', 'last', False]: - self.assertEqual(df.duplicated(keep=keep).sum(), 0) + assert df.duplicated(keep=keep).sum() == 0 def test_drop_duplicates_for_take_all(self): df = DataFrame({'AAA': ['foo', 'bar', 'baz', 'bar', @@ -1643,22 +1430,16 @@ def test_drop_duplicates_tuple(self): tm.assert_frame_equal(result, expected) result = df.drop_duplicates(('AA', 'AB'), keep='last') - expected = df.ix[[6, 7]] + expected = df.loc[[6, 7]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates(('AA', 'AB'), keep=False) - expected = df.ix[[]] # empty df - self.assertEqual(len(result), 0) - tm.assert_frame_equal(result, expected) - - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df.drop_duplicates(('AA', 'AB'), take_last=True) - expected = df.ix[[6, 7]] + expected = df.loc[[]] # empty df + assert len(result) == 0 tm.assert_frame_equal(result, expected) # multi column - expected = df.ix[[0, 1, 2, 3]] + expected = df.loc[[0, 1, 2, 3]] result = df.drop_duplicates((('AA', 'AB'), 'B')) tm.assert_frame_equal(result, expected) @@ -1673,41 +1454,29 @@ def test_drop_duplicates_NA(self): # single column result = df.drop_duplicates('A') - expected = df.ix[[0, 2, 3]] + expected = df.loc[[0, 2, 3]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates('A', keep='last') - expected = df.ix[[1, 6, 7]] + expected = df.loc[[1, 6, 7]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates('A', keep=False) - expected = df.ix[[]] # empty df - tm.assert_frame_equal(result, expected) - self.assertEqual(len(result), 0) - - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df.drop_duplicates('A', take_last=True) - expected = df.ix[[1, 6, 7]] + expected = df.loc[[]] # empty df tm.assert_frame_equal(result, expected) + assert len(result) == 0 # multi column result = df.drop_duplicates(['A', 'B']) - expected = df.ix[[0, 2, 3, 6]] + expected = df.loc[[0, 2, 3, 6]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates(['A', 'B'], keep='last') - expected = df.ix[[1, 5, 6, 7]] + expected = df.loc[[1, 5, 6, 7]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates(['A', 'B'], keep=False) - expected = df.ix[[6]] - tm.assert_frame_equal(result, expected) - - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df.drop_duplicates(['A', 'B'], take_last=True) - expected = df.ix[[1, 5, 6, 7]] + expected = df.loc[[6]] tm.assert_frame_equal(result, expected) # nan @@ -1724,37 +1493,25 @@ def test_drop_duplicates_NA(self): tm.assert_frame_equal(result, expected) result = df.drop_duplicates('C', keep='last') - expected = df.ix[[3, 7]] + expected = df.loc[[3, 7]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates('C', keep=False) - expected = df.ix[[]] # empty df - tm.assert_frame_equal(result, expected) - self.assertEqual(len(result), 0) - - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df.drop_duplicates('C', take_last=True) - expected = df.ix[[3, 7]] + expected = df.loc[[]] # empty df tm.assert_frame_equal(result, expected) + assert len(result) == 0 # multi column result = df.drop_duplicates(['C', 'B']) - expected = df.ix[[0, 1, 2, 4]] + expected = df.loc[[0, 1, 2, 4]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates(['C', 'B'], keep='last') - expected = df.ix[[1, 3, 6, 7]] + expected = df.loc[[1, 3, 6, 7]] tm.assert_frame_equal(result, expected) result = df.drop_duplicates(['C', 'B'], keep=False) - expected = df.ix[[1]] - tm.assert_frame_equal(result, expected) - - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - result = df.drop_duplicates(['C', 'B'], take_last=True) - expected = df.ix[[1, 3, 6, 7]] + expected = df.loc[[1]] tm.assert_frame_equal(result, expected) def test_drop_duplicates_NA_for_take_all(self): @@ -1808,54 +1565,38 @@ def test_drop_duplicates_inplace(self): df = orig.copy() df.drop_duplicates('A', keep='last', inplace=True) - expected = orig.ix[[6, 7]] + expected = orig.loc[[6, 7]] result = df tm.assert_frame_equal(result, expected) df = orig.copy() df.drop_duplicates('A', keep=False, inplace=True) - expected = orig.ix[[]] - result = df - tm.assert_frame_equal(result, expected) - self.assertEqual(len(df), 0) - - # deprecate take_last - df = orig.copy() - with tm.assert_produces_warning(FutureWarning): - df.drop_duplicates('A', take_last=True, inplace=True) - expected = orig.ix[[6, 7]] + expected = orig.loc[[]] result = df tm.assert_frame_equal(result, expected) + assert len(df) == 0 # multi column df = orig.copy() df.drop_duplicates(['A', 'B'], inplace=True) - expected = orig.ix[[0, 1, 2, 3]] + expected = orig.loc[[0, 1, 2, 3]] result = df tm.assert_frame_equal(result, expected) df = orig.copy() df.drop_duplicates(['A', 'B'], keep='last', inplace=True) - expected = orig.ix[[0, 5, 6, 7]] + expected = orig.loc[[0, 5, 6, 7]] result = df tm.assert_frame_equal(result, expected) df = orig.copy() df.drop_duplicates(['A', 'B'], keep=False, inplace=True) - expected = orig.ix[[0]] - result = df - tm.assert_frame_equal(result, expected) - - # deprecate take_last - df = orig.copy() - with tm.assert_produces_warning(FutureWarning): - df.drop_duplicates(['A', 'B'], take_last=True, inplace=True) - expected = orig.ix[[0, 5, 6, 7]] + expected = orig.loc[[0]] result = df tm.assert_frame_equal(result, expected) # consider everything - orig2 = orig.ix[:, ['A', 'B', 'C']].copy() + orig2 = orig.loc[:, ['A', 'B', 'C']].copy() df2 = orig2.copy() df2.drop_duplicates(inplace=True) @@ -1876,17 +1617,7 @@ def test_drop_duplicates_inplace(self): result = df2 tm.assert_frame_equal(result, expected) - # deprecate take_last - df2 = orig2.copy() - with tm.assert_produces_warning(FutureWarning): - df2.drop_duplicates(take_last=True, inplace=True) - with tm.assert_produces_warning(FutureWarning): - expected = orig2.drop_duplicates(['A', 'B'], take_last=True) - result = df2 - tm.assert_frame_equal(result, expected) - # Rounding - def test_round(self): # GH 2665 @@ -1915,7 +1646,7 @@ def test_round(self): # Round with a list round_list = [1, 2] - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(round_list) # Round with a dictionary @@ -1938,34 +1669,34 @@ def test_round(self): # float input to `decimals` non_int_round_dict = {'col1': 1, 'col2': 0.5} - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(non_int_round_dict) # String input non_int_round_dict = {'col1': 1, 'col2': 'foo'} - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(non_int_round_dict) non_int_round_Series = Series(non_int_round_dict) - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(non_int_round_Series) # List input non_int_round_dict = {'col1': 1, 'col2': [1, 2]} - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(non_int_round_dict) non_int_round_Series = Series(non_int_round_dict) - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(non_int_round_Series) # Non integer Series inputs non_int_round_Series = Series(non_int_round_dict) - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(non_int_round_Series) non_int_round_Series = Series(non_int_round_dict) - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(non_int_round_Series) # Negative numbers @@ -1986,10 +1717,10 @@ def test_round(self): if sys.version < LooseVersion('2.7'): # Rounding with decimal is a ValueError in Python < 2.7 - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.round(nan_round_Series) else: - with self.assertRaises(TypeError): + with pytest.raises(TypeError): df.round(nan_round_Series) # Make sure this doesn't break existing Series.round @@ -2018,7 +1749,7 @@ def test_numpy_round(self): tm.assert_frame_equal(out, expected) msg = "the 'out' parameter is not supported" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): np.round(df, decimals=0, out=df) def test_round_mixed_type(self): @@ -2044,15 +1775,15 @@ def test_round_issue(self): dfs = pd.concat((df, df), axis=1) rounded = dfs.round() - self.assert_index_equal(rounded.index, dfs.index) + tm.assert_index_equal(rounded.index, dfs.index) decimals = pd.Series([1, 0, 2], index=['A', 'B', 'A']) - self.assertRaises(ValueError, df.round, decimals) + pytest.raises(ValueError, df.round, decimals) def test_built_in_round(self): if not compat.PY3: - raise nose.SkipTest("build in round cannot be overriden " - "prior to Python 3") + pytest.skip("build in round cannot be overriden " + "prior to Python 3") # GH11763 # Here's the test frame we'll be working with @@ -2070,13 +1801,13 @@ def test_clip(self): median = self.frame.median().median() capped = self.frame.clip_upper(median) - self.assertFalse((capped.values > median).any()) + assert not (capped.values > median).any() floored = self.frame.clip_lower(median) - self.assertFalse((floored.values < median).any()) + assert not (floored.values < median).any() double = self.frame.clip(upper=median, lower=median) - self.assertFalse((double.values != median).any()) + assert not (double.values != median).any() def test_dataframe_clip(self): # GH #2747 @@ -2089,10 +1820,9 @@ def test_dataframe_clip(self): lb_mask = df.values <= lb ub_mask = df.values >= ub mask = ~lb_mask & ~ub_mask - self.assertTrue((clipped_df.values[lb_mask] == lb).all()) - self.assertTrue((clipped_df.values[ub_mask] == ub).all()) - self.assertTrue((clipped_df.values[mask] == - df.values[mask]).all()) + assert (clipped_df.values[lb_mask] == lb).all() + assert (clipped_df.values[ub_mask] == ub).all() + assert (clipped_df.values[mask] == df.values[mask]).all() def test_clip_against_series(self): # GH #6966 @@ -2110,11 +1840,11 @@ def test_clip_against_series(self): result = clipped_df.loc[lb_mask, i] tm.assert_series_equal(result, lb[lb_mask], check_names=False) - self.assertEqual(result.name, i) + assert result.name == i result = clipped_df.loc[ub_mask, i] tm.assert_series_equal(result, ub[ub_mask], check_names=False) - self.assertEqual(result.name, i) + assert result.name == i tm.assert_series_equal(clipped_df.loc[mask, i], df.loc[mask, i]) @@ -2153,20 +1883,21 @@ def test_dot(self): # Check series argument result = a.dot(b['one']) tm.assert_series_equal(result, expected['one'], check_names=False) - self.assertTrue(result.name is None) + assert result.name is None result = a.dot(b1['one']) tm.assert_series_equal(result, expected['one'], check_names=False) - self.assertTrue(result.name is None) + assert result.name is None # can pass correct-length arrays - row = a.ix[0].values + row = a.iloc[0].values result = a.dot(row) - exp = a.dot(a.ix[0]) + exp = a.dot(a.iloc[0]) tm.assert_series_equal(result, exp) - with tm.assertRaisesRegexp(ValueError, 'Dot product shape mismatch'): + with tm.assert_raises_regex(ValueError, + 'Dot product shape mismatch'): a.dot(row[:-1]) a = np.random.rand(1, 5) @@ -2183,9 +1914,134 @@ def test_dot(self): df = DataFrame(randn(3, 4), index=[1, 2, 3], columns=lrange(4)) df2 = DataFrame(randn(5, 3), index=lrange(5), columns=[1, 2, 3]) - with tm.assertRaisesRegexp(ValueError, 'aligned'): + with tm.assert_raises_regex(ValueError, 'aligned'): df.dot(df2) -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + +@pytest.fixture +def df_duplicates(): + return pd.DataFrame({'a': [1, 2, 3, 4, 4], + 'b': [1, 1, 1, 1, 1], + 'c': [0, 1, 2, 5, 4]}, + index=[0, 0, 1, 1, 1]) + + +@pytest.fixture +def df_strings(): + return pd.DataFrame({'a': np.random.permutation(10), + 'b': list(ascii_lowercase[:10]), + 'c': np.random.permutation(10).astype('float64')}) + + +@pytest.fixture +def df_main_dtypes(): + return pd.DataFrame( + {'group': [1, 1, 2], + 'int': [1, 2, 3], + 'float': [4., 5., 6.], + 'string': list('abc'), + 'category_string': pd.Series(list('abc')).astype('category'), + 'category_int': [7, 8, 9], + 'datetime': pd.date_range('20130101', periods=3), + 'datetimetz': pd.date_range('20130101', + periods=3, + tz='US/Eastern'), + 'timedelta': pd.timedelta_range('1 s', periods=3, freq='s')}, + columns=['group', 'int', 'float', 'string', + 'category_string', 'category_int', + 'datetime', 'datetimetz', + 'timedelta']) + + +class TestNLargestNSmallest(object): + + dtype_error_msg_template = ("Column {column!r} has dtype {dtype}, cannot " + "use method {method!r} with this dtype") + + # ---------------------------------------------------------------------- + # Top / bottom + @pytest.mark.parametrize( + 'method, n, order', + product(['nsmallest', 'nlargest'], range(1, 11), + [['a'], + ['c'], + ['a', 'b'], + ['a', 'c'], + ['b', 'a'], + ['b', 'c'], + ['a', 'b', 'c'], + ['c', 'a', 'b'], + ['c', 'b', 'a'], + ['b', 'c', 'a'], + ['b', 'a', 'c'], + + # dups! + ['b', 'c', 'c'], + + ])) + def test_n(self, df_strings, method, n, order): + # GH10393 + df = df_strings + if 'b' in order: + + error_msg = self.dtype_error_msg_template.format( + column='b', method=method, dtype='object') + with tm.assert_raises_regex(TypeError, error_msg): + getattr(df, method)(n, order) + else: + ascending = method == 'nsmallest' + result = getattr(df, method)(n, order) + expected = df.sort_values(order, ascending=ascending).head(n) + tm.assert_frame_equal(result, expected) + + @pytest.mark.parametrize( + 'method, columns', + product(['nsmallest', 'nlargest'], + product(['group'], ['category_string', 'string']) + )) + def test_n_error(self, df_main_dtypes, method, columns): + df = df_main_dtypes + error_msg = self.dtype_error_msg_template.format( + column=columns[1], method=method, dtype=df[columns[1]].dtype) + with tm.assert_raises_regex(TypeError, error_msg): + getattr(df, method)(2, columns) + + def test_n_all_dtypes(self, df_main_dtypes): + df = df_main_dtypes + df.nsmallest(2, list(set(df) - {'category_string', 'string'})) + df.nlargest(2, list(set(df) - {'category_string', 'string'})) + + def test_n_identical_values(self): + # GH15297 + df = pd.DataFrame({'a': [1] * 5, 'b': [1, 2, 3, 4, 5]}) + + result = df.nlargest(3, 'a') + expected = pd.DataFrame( + {'a': [1] * 3, 'b': [1, 2, 3]}, index=[0, 1, 2] + ) + tm.assert_frame_equal(result, expected) + + result = df.nsmallest(3, 'a') + expected = pd.DataFrame({'a': [1] * 3, 'b': [1, 2, 3]}) + tm.assert_frame_equal(result, expected) + + @pytest.mark.parametrize( + 'n, order', + product([1, 2, 3, 4, 5], + [['a', 'b', 'c'], + ['c', 'b', 'a'], + ['a'], + ['b'], + ['a', 'b'], + ['c', 'b']])) + def test_n_duplicate_index(self, df_duplicates, n, order): + # GH 13412 + + df = df_duplicates + result = df.nsmallest(n, order) + expected = df.sort_values(order).head(n) + tm.assert_frame_equal(result, expected) + + result = df.nlargest(n, order) + expected = df.sort_values(order, ascending=False).head(n) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/frame/test_misc_api.py b/pandas/tests/frame/test_api.py similarity index 52% rename from pandas/tests/frame/test_misc_api.py rename to pandas/tests/frame/test_api.py index 089b71b30119b..f63918c97c614 100644 --- a/pandas/tests/frame/test_misc_api.py +++ b/pandas/tests/frame/test_api.py @@ -1,10 +1,12 @@ # -*- coding: utf-8 -*- from __future__ import print_function + +import pytest + # pylint: disable-msg=W0612,E1101 from copy import deepcopy import sys -import nose from distutils.version import LooseVersion from pandas.compat import range, lrange @@ -13,13 +15,12 @@ from numpy.random import randn import numpy as np -from pandas import DataFrame, Series +from pandas import DataFrame, Series, date_range, timedelta_range import pandas as pd from pandas.util.testing import (assert_almost_equal, assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) + assert_frame_equal) import pandas.util.testing as tm @@ -28,8 +29,6 @@ class SharedWithSparse(object): - _multiprocess_can_split_ = True - def test_copy_index_name_checking(self): # don't want to be able to modify the index stored elsewhere after # making a copy @@ -38,20 +37,20 @@ def test_copy_index_name_checking(self): ind.name = None cp = self.frame.copy() getattr(cp, attr).name = 'foo' - self.assertIsNone(getattr(self.frame, attr).name) + assert getattr(self.frame, attr).name is None def test_getitem_pop_assign_name(self): s = self.frame['A'] - self.assertEqual(s.name, 'A') + assert s.name == 'A' s = self.frame.pop('A') - self.assertEqual(s.name, 'A') + assert s.name == 'A' - s = self.frame.ix[:, 'B'] - self.assertEqual(s.name, 'B') + s = self.frame.loc[:, 'B'] + assert s.name == 'B' - s2 = s.ix[:] - self.assertEqual(s2.name, 'B') + s2 = s.loc[:] + assert s2.name == 'B' def test_get_value(self): for idx in self.frame.index: @@ -60,134 +59,49 @@ def test_get_value(self): expected = self.frame[col][idx] tm.assert_almost_equal(result, expected) - def test_join_index(self): - # left / right - - f = self.frame.reindex(columns=['A', 'B'])[:10] - f2 = self.frame.reindex(columns=['C', 'D']) - - joined = f.join(f2) - self.assert_index_equal(f.index, joined.index) - self.assertEqual(len(joined.columns), 4) - - joined = f.join(f2, how='left') - self.assert_index_equal(joined.index, f.index) - self.assertEqual(len(joined.columns), 4) - - joined = f.join(f2, how='right') - self.assert_index_equal(joined.index, f2.index) - self.assertEqual(len(joined.columns), 4) - - # inner - - f = self.frame.reindex(columns=['A', 'B'])[:10] - f2 = self.frame.reindex(columns=['C', 'D']) - - joined = f.join(f2, how='inner') - self.assert_index_equal(joined.index, f.index.intersection(f2.index)) - self.assertEqual(len(joined.columns), 4) - - # outer - - f = self.frame.reindex(columns=['A', 'B'])[:10] - f2 = self.frame.reindex(columns=['C', 'D']) - - joined = f.join(f2, how='outer') - self.assertTrue(tm.equalContents(self.frame.index, joined.index)) - self.assertEqual(len(joined.columns), 4) - - assertRaisesRegexp(ValueError, 'join method', f.join, f2, how='foo') - - # corner case - overlapping columns - for how in ('outer', 'left', 'inner'): - with assertRaisesRegexp(ValueError, 'columns overlap but ' - 'no suffix'): - self.frame.join(self.frame, how=how) - - def test_join_index_more(self): - af = self.frame.ix[:, ['A', 'B']] - bf = self.frame.ix[::2, ['C', 'D']] - - expected = af.copy() - expected['C'] = self.frame['C'][::2] - expected['D'] = self.frame['D'][::2] - - result = af.join(bf) - assert_frame_equal(result, expected) - - result = af.join(bf, how='right') - assert_frame_equal(result, expected[::2]) - - result = bf.join(af, how='right') - assert_frame_equal(result, expected.ix[:, result.columns]) - - def test_join_index_series(self): - df = self.frame.copy() - s = df.pop(self.frame.columns[-1]) - joined = df.join(s) - - # TODO should this check_names ? - assert_frame_equal(joined, self.frame, check_names=False) - - s.name = None - assertRaisesRegexp(ValueError, 'must have a name', df.join, s) - - def test_join_overlap(self): - df1 = self.frame.ix[:, ['A', 'B', 'C']] - df2 = self.frame.ix[:, ['B', 'C', 'D']] - - joined = df1.join(df2, lsuffix='_df1', rsuffix='_df2') - df1_suf = df1.ix[:, ['B', 'C']].add_suffix('_df1') - df2_suf = df2.ix[:, ['B', 'C']].add_suffix('_df2') - - no_overlap = self.frame.ix[:, ['A', 'D']] - expected = df1_suf.join(df2_suf).join(no_overlap) - - # column order not necessarily sorted - assert_frame_equal(joined, expected.ix[:, joined.columns]) - def test_add_prefix_suffix(self): with_prefix = self.frame.add_prefix('foo#') expected = pd.Index(['foo#%s' % c for c in self.frame.columns]) - self.assert_index_equal(with_prefix.columns, expected) + tm.assert_index_equal(with_prefix.columns, expected) with_suffix = self.frame.add_suffix('#foo') expected = pd.Index(['%s#foo' % c for c in self.frame.columns]) - self.assert_index_equal(with_suffix.columns, expected) + tm.assert_index_equal(with_suffix.columns, expected) -class TestDataFrameMisc(tm.TestCase, SharedWithSparse, TestData): +class TestDataFrameMisc(SharedWithSparse, TestData): klass = DataFrame - _multiprocess_can_split_ = True - def test_get_axis(self): f = self.frame - self.assertEqual(f._get_axis_number(0), 0) - self.assertEqual(f._get_axis_number(1), 1) - self.assertEqual(f._get_axis_number('index'), 0) - self.assertEqual(f._get_axis_number('rows'), 0) - self.assertEqual(f._get_axis_number('columns'), 1) - - self.assertEqual(f._get_axis_name(0), 'index') - self.assertEqual(f._get_axis_name(1), 'columns') - self.assertEqual(f._get_axis_name('index'), 'index') - self.assertEqual(f._get_axis_name('rows'), 'index') - self.assertEqual(f._get_axis_name('columns'), 'columns') - - self.assertIs(f._get_axis(0), f.index) - self.assertIs(f._get_axis(1), f.columns) - - assertRaisesRegexp(ValueError, 'No axis named', f._get_axis_number, 2) - assertRaisesRegexp(ValueError, 'No axis.*foo', f._get_axis_name, 'foo') - assertRaisesRegexp(ValueError, 'No axis.*None', f._get_axis_name, None) - assertRaisesRegexp(ValueError, 'No axis named', f._get_axis_number, - None) + assert f._get_axis_number(0) == 0 + assert f._get_axis_number(1) == 1 + assert f._get_axis_number('index') == 0 + assert f._get_axis_number('rows') == 0 + assert f._get_axis_number('columns') == 1 + + assert f._get_axis_name(0) == 'index' + assert f._get_axis_name(1) == 'columns' + assert f._get_axis_name('index') == 'index' + assert f._get_axis_name('rows') == 'index' + assert f._get_axis_name('columns') == 'columns' + + assert f._get_axis(0) is f.index + assert f._get_axis(1) is f.columns + + tm.assert_raises_regex( + ValueError, 'No axis named', f._get_axis_number, 2) + tm.assert_raises_regex( + ValueError, 'No axis.*foo', f._get_axis_name, 'foo') + tm.assert_raises_regex( + ValueError, 'No axis.*None', f._get_axis_name, None) + tm.assert_raises_regex(ValueError, 'No axis named', + f._get_axis_number, None) def test_keys(self): getkeys = self.frame.keys - self.assertIs(getkeys(), self.frame.columns) + assert getkeys() is self.frame.columns def test_column_contains_typeerror(self): try: @@ -197,53 +111,53 @@ def test_column_contains_typeerror(self): def test_not_hashable(self): df = pd.DataFrame([1]) - self.assertRaises(TypeError, hash, df) - self.assertRaises(TypeError, hash, self.empty) + pytest.raises(TypeError, hash, df) + pytest.raises(TypeError, hash, self.empty) def test_new_empty_index(self): df1 = DataFrame(randn(0, 3)) df2 = DataFrame(randn(0, 3)) df1.index.name = 'foo' - self.assertIsNone(df2.index.name) + assert df2.index.name is None def test_array_interface(self): with np.errstate(all='ignore'): result = np.sqrt(self.frame) - tm.assertIsInstance(result, type(self.frame)) - self.assertIs(result.index, self.frame.index) - self.assertIs(result.columns, self.frame.columns) + assert isinstance(result, type(self.frame)) + assert result.index is self.frame.index + assert result.columns is self.frame.columns assert_frame_equal(result, self.frame.apply(np.sqrt)) def test_get_agg_axis(self): cols = self.frame._get_agg_axis(0) - self.assertIs(cols, self.frame.columns) + assert cols is self.frame.columns idx = self.frame._get_agg_axis(1) - self.assertIs(idx, self.frame.index) + assert idx is self.frame.index - self.assertRaises(ValueError, self.frame._get_agg_axis, 2) + pytest.raises(ValueError, self.frame._get_agg_axis, 2) def test_nonzero(self): - self.assertTrue(self.empty.empty) + assert self.empty.empty - self.assertFalse(self.frame.empty) - self.assertFalse(self.mixed_frame.empty) + assert not self.frame.empty + assert not self.mixed_frame.empty # corner case df = DataFrame({'A': [1., 2., 3.], 'B': ['a', 'b', 'c']}, index=np.arange(3)) del df['A'] - self.assertFalse(df.empty) + assert not df.empty def test_iteritems(self): df = DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'a', 'b']) for k, v in compat.iteritems(df): - self.assertEqual(type(v), Series) + assert type(v) == Series def test_iter(self): - self.assertTrue(tm.equalContents(list(self.frame), self.frame.columns)) + assert tm.equalContents(list(self.frame), self.frame.columns) def test_iterrows(self): for i, (k, v) in enumerate(self.frame.iterrows()): @@ -258,47 +172,45 @@ def test_itertuples(self): for i, tup in enumerate(self.frame.itertuples()): s = Series(tup[1:]) s.name = tup[0] - expected = self.frame.ix[i, :].reset_index(drop=True) + expected = self.frame.iloc[i, :].reset_index(drop=True) assert_series_equal(s, expected) df = DataFrame({'floats': np.random.randn(5), 'ints': lrange(5)}, columns=['floats', 'ints']) for tup in df.itertuples(index=False): - tm.assertIsInstance(tup[1], np.integer) + assert isinstance(tup[1], np.integer) df = DataFrame(data={"a": [1, 2, 3], "b": [4, 5, 6]}) dfaa = df[['a', 'a']] - self.assertEqual(list(dfaa.itertuples()), [ - (0, 1, 1), (1, 2, 2), (2, 3, 3)]) - self.assertEqual(repr(list(df.itertuples(name=None))), - '[(0, 1, 4), (1, 2, 5), (2, 3, 6)]') + assert (list(dfaa.itertuples()) == + [(0, 1, 1), (1, 2, 2), (2, 3, 3)]) + assert (repr(list(df.itertuples(name=None))) == + '[(0, 1, 4), (1, 2, 5), (2, 3, 6)]') tup = next(df.itertuples(name='TestName')) - # no support for field renaming in Python 2.6, regular tuples are - # returned if sys.version >= LooseVersion('2.7'): - self.assertEqual(tup._fields, ('Index', 'a', 'b')) - self.assertEqual((tup.Index, tup.a, tup.b), tup) - self.assertEqual(type(tup).__name__, 'TestName') + assert tup._fields == ('Index', 'a', 'b') + assert (tup.Index, tup.a, tup.b) == tup + assert type(tup).__name__ == 'TestName' df.columns = ['def', 'return'] tup2 = next(df.itertuples(name='TestName')) - self.assertEqual(tup2, (0, 1, 4)) + assert tup2 == (0, 1, 4) if sys.version >= LooseVersion('2.7'): - self.assertEqual(tup2._fields, ('Index', '_1', '_2')) + assert tup2._fields == ('Index', '_1', '_2') df3 = DataFrame(dict(('f' + str(i), [i]) for i in range(1024))) # will raise SyntaxError if trying to create namedtuple tup3 = next(df3.itertuples()) - self.assertFalse(hasattr(tup3, '_fields')) - self.assertIsInstance(tup3, tuple) + assert not hasattr(tup3, '_fields') + assert isinstance(tup3, tuple) def test_len(self): - self.assertEqual(len(self.frame), len(self.frame.index)) + assert len(self.frame) == len(self.frame.index) def test_as_matrix(self): frame = self.frame @@ -309,17 +221,17 @@ def test_as_matrix(self): for j, value in enumerate(row): col = frameCols[j] if np.isnan(value): - self.assertTrue(np.isnan(frame[col][i])) + assert np.isnan(frame[col][i]) else: - self.assertEqual(value, frame[col][i]) + assert value == frame[col][i] # mixed type mat = self.mixed_frame.as_matrix(['foo', 'A']) - self.assertEqual(mat[0, 0], 'bar') + assert mat[0, 0] == 'bar' df = DataFrame({'real': [1, 2, 3], 'complex': [1j, 2j, 3j]}) mat = df.as_matrix() - self.assertEqual(mat[0, 0], 1j) + assert mat[0, 0] == 1j # single block corner case mat = self.frame.as_matrix(['A', 'B']) @@ -328,14 +240,14 @@ def test_as_matrix(self): def test_values(self): self.frame.values[:, 0] = 5. - self.assertTrue((self.frame.values[:, 0] == 5).all()) + assert (self.frame.values[:, 0] == 5).all() def test_deepcopy(self): cp = deepcopy(self.frame) series = cp['A'] series[:] = 10 for idx, value in compat.iteritems(series): - self.assertNotEqual(self.frame['A'][idx], value) + assert self.frame['A'][idx] != value # --------------------------------------------------------------------- # Transposing @@ -346,9 +258,9 @@ def test_transpose(self): for idx, series in compat.iteritems(dft): for col, value in compat.iteritems(series): if np.isnan(value): - self.assertTrue(np.isnan(frame[col][idx])) + assert np.isnan(frame[col][idx]) else: - self.assertEqual(value, frame[col][idx]) + assert value == frame[col][idx] # mixed type index, data = tm.getMixedTypeDict() @@ -356,20 +268,20 @@ def test_transpose(self): mixed_T = mixed.T for col, s in compat.iteritems(mixed_T): - self.assertEqual(s.dtype, np.object_) + assert s.dtype == np.object_ def test_transpose_get_view(self): dft = self.frame.T dft.values[:, 5:10] = 5 - self.assertTrue((self.frame.values[5:10] == 5).all()) + assert (self.frame.values[5:10] == 5).all() def test_swapaxes(self): df = DataFrame(np.random.randn(10, 5)) assert_frame_equal(df.T, df.swapaxes(0, 1)) assert_frame_equal(df.T, df.swapaxes(1, 0)) assert_frame_equal(df, df.swapaxes(0, 0)) - self.assertRaises(ValueError, df.swapaxes, 2, 5) + pytest.raises(ValueError, df.swapaxes, 2, 5) def test_axis_aliases(self): f = self.frame @@ -385,34 +297,49 @@ def test_axis_aliases(self): def test_more_asMatrix(self): values = self.mixed_frame.as_matrix() - self.assertEqual(values.shape[1], len(self.mixed_frame.columns)) + assert values.shape[1] == len(self.mixed_frame.columns) def test_repr_with_mi_nat(self): df = DataFrame({'X': [1, 2]}, index=[[pd.NaT, pd.Timestamp('20130101')], ['a', 'b']]) res = repr(df) exp = ' X\nNaT a 1\n2013-01-01 b 2' - self.assertEqual(res, exp) + assert res == exp - def test_iterkv_deprecation(self): - with tm.assert_produces_warning(FutureWarning): - self.mixed_float.iterkv() - - def test_iterkv_names(self): + def test_iteritems_names(self): for k, v in compat.iteritems(self.mixed_frame): - self.assertEqual(v.name, k) + assert v.name == k def test_series_put_names(self): series = self.mixed_frame._series for k, v in compat.iteritems(series): - self.assertEqual(v.name, k) + assert v.name == k def test_empty_nonzero(self): df = DataFrame([1, 2, 3]) - self.assertFalse(df.empty) + assert not df.empty + df = pd.DataFrame(index=[1], columns=[1]) + assert not df.empty df = DataFrame(index=['a', 'b'], columns=['c', 'd']).dropna() - self.assertTrue(df.empty) - self.assertTrue(df.T.empty) + assert df.empty + assert df.T.empty + empty_frames = [pd.DataFrame(), + pd.DataFrame(index=[1]), + pd.DataFrame(columns=[1]), + pd.DataFrame({1: []})] + for df in empty_frames: + assert df.empty + assert df.T.empty + + def test_with_datetimelikes(self): + + df = DataFrame({'A': date_range('20130101', periods=10), + 'B': timedelta_range('1 day', periods=10)}) + t = df.T + + result = t.get_dtype_counts() + expected = Series({'object': 10}) + tm.assert_series_equal(result, expected) def test_inplace_return_self(self): # re #1893 @@ -423,7 +350,7 @@ def test_inplace_return_self(self): def _check_f(base, f): result = f(base) - self.assertTrue(result is None) + assert result is None # -----DataFrame----- @@ -447,10 +374,6 @@ def _check_f(base, f): f = lambda x: x.sort_index(inplace=True) _check_f(data.copy(), f) - # sortlevel - f = lambda x: x.sortlevel(0, inplace=True) - _check_f(data.set_index(['a', 'b']), f) - # fillna f = lambda x: x.fillna(0, inplace=True) _check_f(data.copy(), f) @@ -481,8 +404,3 @@ def _check_f(base, f): # rename f = lambda x: x.rename({1: 'foo'}, inplace=True) _check_f(d.copy(), f) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/frame/test_apply.py b/pandas/tests/frame/test_apply.py index 5cadb4dba577f..aa7c7a7120c1b 100644 --- a/pandas/tests/frame/test_apply.py +++ b/pandas/tests/frame/test_apply.py @@ -2,6 +2,8 @@ from __future__ import print_function +import pytest + from datetime import datetime import warnings @@ -10,44 +12,43 @@ from pandas import (notnull, DataFrame, Series, MultiIndex, date_range, Timestamp, compat) import pandas as pd -from pandas.types.dtypes import CategoricalDtype +from pandas.core.dtypes.dtypes import CategoricalDtype from pandas.util.testing import (assert_series_equal, assert_frame_equal) import pandas.util.testing as tm from pandas.tests.frame.common import TestData -class TestDataFrameApply(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameApply(TestData): def test_apply(self): with np.errstate(all='ignore'): # ufunc applied = self.frame.apply(np.sqrt) - assert_series_equal(np.sqrt(self.frame['A']), applied['A']) + tm.assert_series_equal(np.sqrt(self.frame['A']), applied['A']) # aggregator applied = self.frame.apply(np.mean) - self.assertEqual(applied['A'], np.mean(self.frame['A'])) + assert applied['A'] == np.mean(self.frame['A']) d = self.frame.index[0] applied = self.frame.apply(np.mean, axis=1) - self.assertEqual(applied[d], np.mean(self.frame.xs(d))) - self.assertIs(applied.index, self.frame.index) # want this + assert applied[d] == np.mean(self.frame.xs(d)) + assert applied.index is self.frame.index # want this # invalid axis df = DataFrame( [[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=['a', 'a', 'c']) - self.assertRaises(ValueError, df.apply, lambda x: x, 2) + pytest.raises(ValueError, df.apply, lambda x: x, 2) - # GH9573 + # see gh-9573 df = DataFrame({'c0': ['A', 'A', 'B', 'B'], 'c1': ['C', 'C', 'D', 'D']}) df = df.apply(lambda ts: ts.astype('category')) - self.assertEqual(df.shape, (4, 2)) - self.assertTrue(isinstance(df['c0'].dtype, CategoricalDtype)) - self.assertTrue(isinstance(df['c1'].dtype, CategoricalDtype)) + + assert df.shape == (4, 2) + assert isinstance(df['c0'].dtype, CategoricalDtype) + assert isinstance(df['c1'].dtype, CategoricalDtype) def test_apply_mixed_datetimelike(self): # mixed datetimelike @@ -60,17 +61,17 @@ def test_apply_mixed_datetimelike(self): def test_apply_empty(self): # empty applied = self.empty.apply(np.sqrt) - self.assertTrue(applied.empty) + assert applied.empty applied = self.empty.apply(np.mean) - self.assertTrue(applied.empty) + assert applied.empty no_rows = self.frame[:0] result = no_rows.apply(lambda x: x.mean()) expected = Series(np.nan, index=self.frame.columns) assert_series_equal(result, expected) - no_cols = self.frame.ix[:, []] + no_cols = self.frame.loc[:, []] result = no_cols.apply(lambda x: x.mean(), axis=1) expected = Series(np.nan, index=self.frame.index) assert_series_equal(result, expected) @@ -96,7 +97,7 @@ def test_apply_empty(self): [], index=pd.Index([], dtype=object))) # Ensure that x.append hasn't been called - self.assertEqual(x, []) + assert x == [] def test_apply_standard_nonunique(self): df = DataFrame( @@ -108,17 +109,28 @@ def test_apply_standard_nonunique(self): rs = df.T.apply(lambda s: s[0], axis=0) assert_series_equal(rs, xp) + def test_with_string_args(self): + + for arg in ['sum', 'mean', 'min', 'max', 'std']: + result = self.frame.apply(arg) + expected = getattr(self.frame, arg)() + tm.assert_series_equal(result, expected) + + result = self.frame.apply(arg, axis=1) + expected = getattr(self.frame, arg)(axis=1) + tm.assert_series_equal(result, expected) + def test_apply_broadcast(self): broadcasted = self.frame.apply(np.mean, broadcast=True) agged = self.frame.apply(np.mean) for col, ts in compat.iteritems(broadcasted): - self.assertTrue((ts == agged[col]).all()) + assert (ts == agged[col]).all() broadcasted = self.frame.apply(np.mean, axis=1, broadcast=True) agged = self.frame.apply(np.mean, axis=1) for idx in broadcasted.index: - self.assertTrue((broadcasted.xs(idx) == agged[idx]).all()) + assert (broadcasted.xs(idx) == agged[idx]).all() def test_apply_raw(self): result0 = self.frame.apply(np.mean, raw=True) @@ -138,7 +150,7 @@ def test_apply_raw(self): def test_apply_axis1(self): d = self.frame.index[0] tapplied = self.frame.apply(np.mean, axis=1) - self.assertEqual(tapplied[d], np.mean(self.frame.xs(d))) + assert tapplied[d] == np.mean(self.frame.xs(d)) def test_apply_ignore_failures(self): result = self.mixed_frame._apply_standard(np.mean, 0, @@ -178,10 +190,10 @@ def _checkit(axis=0, raw=False): res = df.apply(f, axis=axis, raw=raw) if is_reduction: agg_axis = df._get_agg_axis(axis) - tm.assertIsInstance(res, Series) - self.assertIs(res.index, agg_axis) + assert isinstance(res, Series) + assert res.index is agg_axis else: - tm.assertIsInstance(res, DataFrame) + assert isinstance(res, DataFrame) _checkit() _checkit(axis=1) @@ -195,7 +207,7 @@ def _checkit(axis=0, raw=False): _check(no_index, lambda x: x.mean()) result = no_cols.apply(lambda x: x.mean(), broadcast=True) - tm.assertIsInstance(result, DataFrame) + assert isinstance(result, DataFrame) def test_apply_with_args_kwds(self): def add_some(x, howmuch=0): @@ -224,7 +236,7 @@ def test_apply_yield_list(self): assert_frame_equal(result, self.frame) def test_apply_reduce_Series(self): - self.frame.ix[::2, 'A'] = np.nan + self.frame.loc[::2, 'A'] = np.nan expected = self.frame.mean(1) result = self.frame.apply(np.mean, axis=1) assert_series_equal(result, expected) @@ -272,12 +284,11 @@ def transform2(row): return row try: - transformed = data.apply(transform, axis=1) # noqa + data.apply(transform, axis=1) except AttributeError as e: - self.assertEqual(len(e.args), 2) - self.assertEqual(e.args[1], 'occurred at index 4') - self.assertEqual( - e.args[0], "'float' object has no attribute 'startswith'") + assert len(e.args) == 2 + assert e.args[1] == 'occurred at index 4' + assert e.args[0] == "'float' object has no attribute 'startswith'" def test_apply_bug(self): @@ -348,7 +359,7 @@ def test_apply_multi_index(self): s.index = MultiIndex.from_arrays([['a', 'a', 'b'], ['c', 'd', 'd']]) s.columns = ['col1', 'col2'] res = s.apply(lambda x: Series({'min': min(x), 'max': max(x)}), 1) - tm.assertIsInstance(res.index, MultiIndex) + assert isinstance(res.index, MultiIndex) def test_apply_dict(self): @@ -371,23 +382,23 @@ def test_apply_dict(self): def test_applymap(self): applied = self.frame.applymap(lambda x: x * 2) - assert_frame_equal(applied, self.frame * 2) - result = self.frame.applymap(type) + tm.assert_frame_equal(applied, self.frame * 2) + self.frame.applymap(type) - # GH #465, function returning tuples + # gh-465: function returning tuples result = self.frame.applymap(lambda x: (x, x)) - tm.assertIsInstance(result['A'][0], tuple) + assert isinstance(result['A'][0], tuple) - # GH 2909, object conversion to float in constructor? + # gh-2909: object conversion to float in constructor? df = DataFrame(data=[1, 'a']) result = df.applymap(lambda x: x) - self.assertEqual(result.dtypes[0], object) + assert result.dtypes[0] == object df = DataFrame(data=[1., 'a']) result = df.applymap(lambda x: x) - self.assertEqual(result.dtypes[0], object) + assert result.dtypes[0] == object - # GH2786 + # see gh-2786 df = DataFrame(np.random.random((3, 4))) df2 = df.copy() cols = ['a', 'a', 'a', 'a'] @@ -396,14 +407,24 @@ def test_applymap(self): expected = df2.applymap(str) expected.columns = cols result = df.applymap(str) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # datetime/timedelta df['datetime'] = Timestamp('20130101') df['timedelta'] = pd.Timedelta('1 min') result = df.applymap(str) for f in ['datetime', 'timedelta']: - self.assertEqual(result.loc[0, f], str(df.loc[0, f])) + assert result.loc[0, f] == str(df.loc[0, f]) + + # see gh-8222 + empty_frames = [pd.DataFrame(), + pd.DataFrame(columns=list('ABC')), + pd.DataFrame(index=list('ABC')), + pd.DataFrame({'A': [], 'B': [], 'C': []})] + for frame in empty_frames: + for func in [round, lambda x: x]: + result = frame.applymap(func) + tm.assert_frame_equal(result, frame) def test_applymap_box(self): # ufunc will not be boxed. Same test cases as the test_map_box @@ -423,6 +444,15 @@ def test_applymap_box(self): 'd': ['Period', 'Period']}) tm.assert_frame_equal(res, exp) + def test_frame_apply_dont_convert_datetime64(self): + from pandas.tseries.offsets import BDay + df = DataFrame({'x1': [datetime(1996, 1, 1)]}) + + df = df.applymap(lambda x: x + BDay()) + df = df.applymap(lambda x: x + BDay()) + + assert df.x1.dtype == 'M8[ns]' + # See gh-12244 def test_apply_non_numpy_dtype(self): df = DataFrame({'dt': pd.date_range( @@ -438,3 +468,170 @@ def test_apply_non_numpy_dtype(self): df = DataFrame({'dt': ['a', 'b', 'c', 'a']}, dtype='category') result = df.apply(lambda x: x) assert_frame_equal(result, df) + + +def zip_frames(*frames): + """ + take a list of frames, zip the columns together for each + assume that these all have the first frame columns + + return a new frame + """ + columns = frames[0].columns + zipped = [f[c] for c in columns for f in frames] + return pd.concat(zipped, axis=1) + + +class TestDataFrameAggregate(TestData): + + _multiprocess_can_split_ = True + + def test_agg_transform(self): + + with np.errstate(all='ignore'): + + f_sqrt = np.sqrt(self.frame) + f_abs = np.abs(self.frame) + + # ufunc + result = self.frame.transform(np.sqrt) + expected = f_sqrt.copy() + assert_frame_equal(result, expected) + + result = self.frame.apply(np.sqrt) + assert_frame_equal(result, expected) + + result = self.frame.transform(np.sqrt) + assert_frame_equal(result, expected) + + # list-like + result = self.frame.apply([np.sqrt]) + expected = f_sqrt.copy() + expected.columns = pd.MultiIndex.from_product( + [self.frame.columns, ['sqrt']]) + assert_frame_equal(result, expected) + + result = self.frame.transform([np.sqrt]) + assert_frame_equal(result, expected) + + # multiple items in list + # these are in the order as if we are applying both + # functions per series and then concatting + expected = zip_frames(f_sqrt, f_abs) + expected.columns = pd.MultiIndex.from_product( + [self.frame.columns, ['sqrt', 'absolute']]) + result = self.frame.apply([np.sqrt, np.abs]) + assert_frame_equal(result, expected) + + result = self.frame.transform(['sqrt', np.abs]) + assert_frame_equal(result, expected) + + def test_transform_and_agg_err(self): + # cannot both transform and agg + def f(): + self.frame.transform(['max', 'min']) + pytest.raises(ValueError, f) + + def f(): + with np.errstate(all='ignore'): + self.frame.agg(['max', 'sqrt']) + pytest.raises(ValueError, f) + + def f(): + with np.errstate(all='ignore'): + self.frame.transform(['max', 'sqrt']) + pytest.raises(ValueError, f) + + df = pd.DataFrame({'A': range(5), 'B': 5}) + + def f(): + with np.errstate(all='ignore'): + df.agg({'A': ['abs', 'sum'], 'B': ['mean', 'max']}) + + def test_demo(self): + # demonstration tests + df = pd.DataFrame({'A': range(5), 'B': 5}) + + result = df.agg(['min', 'max']) + expected = DataFrame({'A': [0, 4], 'B': [5, 5]}, + columns=['A', 'B'], + index=['min', 'max']) + tm.assert_frame_equal(result, expected) + + result = df.agg({'A': ['min', 'max'], 'B': ['sum', 'max']}) + expected = DataFrame({'A': [4.0, 0.0, np.nan], + 'B': [5.0, np.nan, 25.0]}, + columns=['A', 'B'], + index=['max', 'min', 'sum']) + tm.assert_frame_equal(result.reindex_like(expected), expected) + + def test_agg_dict_nested_renaming_depr(self): + + df = pd.DataFrame({'A': range(5), 'B': 5}) + + # nested renaming + with tm.assert_produces_warning(FutureWarning): + df.agg({'A': {'foo': 'min'}, + 'B': {'bar': 'max'}}) + + def test_agg_reduce(self): + # all reducers + expected = zip_frames(self.frame.mean().to_frame(), + self.frame.max().to_frame(), + self.frame.sum().to_frame()).T + expected.index = ['mean', 'max', 'sum'] + result = self.frame.agg(['mean', 'max', 'sum']) + assert_frame_equal(result, expected) + + # dict input with scalars + result = self.frame.agg({'A': 'mean', 'B': 'sum'}) + expected = Series([self.frame.A.mean(), self.frame.B.sum()], + index=['A', 'B']) + assert_series_equal(result.reindex_like(expected), expected) + + # dict input with lists + result = self.frame.agg({'A': ['mean'], 'B': ['sum']}) + expected = DataFrame({'A': Series([self.frame.A.mean()], + index=['mean']), + 'B': Series([self.frame.B.sum()], + index=['sum'])}) + assert_frame_equal(result.reindex_like(expected), expected) + + # dict input with lists with multiple + result = self.frame.agg({'A': ['mean', 'sum'], + 'B': ['sum', 'max']}) + expected = DataFrame({'A': Series([self.frame.A.mean(), + self.frame.A.sum()], + index=['mean', 'sum']), + 'B': Series([self.frame.B.sum(), + self.frame.B.max()], + index=['sum', 'max'])}) + assert_frame_equal(result.reindex_like(expected), expected) + + def test_nuiscance_columns(self): + + # GH 15015 + df = DataFrame({'A': [1, 2, 3], + 'B': [1., 2., 3.], + 'C': ['foo', 'bar', 'baz'], + 'D': pd.date_range('20130101', periods=3)}) + + result = df.agg('min') + expected = Series([1, 1., 'bar', pd.Timestamp('20130101')], + index=df.columns) + assert_series_equal(result, expected) + + result = df.agg(['min']) + expected = DataFrame([[1, 1., 'bar', pd.Timestamp('20130101')]], + index=['min'], columns=df.columns) + assert_frame_equal(result, expected) + + result = df.agg('sum') + expected = Series([6, 6., 'foobarbaz'], + index=['A', 'B', 'C']) + assert_series_equal(result, expected) + + result = df.agg(['sum']) + expected = DataFrame([[6, 6., 'foobarbaz']], + index=['sum'], columns=['A', 'B', 'C']) + assert_frame_equal(result, expected) diff --git a/pandas/tests/frame/test_asof.py b/pandas/tests/frame/test_asof.py index 6c15c75cb5427..d4e3d541937dc 100644 --- a/pandas/tests/frame/test_asof.py +++ b/pandas/tests/frame/test_asof.py @@ -1,72 +1,108 @@ # coding=utf-8 -import nose - import numpy as np -from pandas import DataFrame, date_range +from pandas import (DataFrame, date_range, Timestamp, Series, + to_datetime) -from pandas.util.testing import assert_frame_equal import pandas.util.testing as tm from .common import TestData -class TestFrameAsof(TestData, tm.TestCase): - _multiprocess_can_split_ = True - - def setUp(self): +class TestFrameAsof(TestData): + def setup_method(self, method): self.N = N = 50 - rng = date_range('1/1/1990', periods=N, freq='53s') + self.rng = date_range('1/1/1990', periods=N, freq='53s') self.df = DataFrame({'A': np.arange(N), 'B': np.arange(N)}, - index=rng) + index=self.rng) def test_basic(self): - df = self.df.copy() - df.ix[15:30, 'A'] = np.nan + df.loc[15:30, 'A'] = np.nan dates = date_range('1/1/1990', periods=self.N * 3, freq='25s') result = df.asof(dates) - self.assertTrue(result.notnull().all(1).all()) + assert result.notnull().all(1).all() lb = df.index[14] ub = df.index[30] dates = list(dates) result = df.asof(dates) - self.assertTrue(result.notnull().all(1).all()) + assert result.notnull().all(1).all() mask = (result.index >= lb) & (result.index < ub) rs = result[mask] - self.assertTrue((rs == 14).all(1).all()) + assert (rs == 14).all(1).all() def test_subset(self): - N = 10 rng = date_range('1/1/1990', periods=N, freq='53s') df = DataFrame({'A': np.arange(N), 'B': np.arange(N)}, index=rng) - df.ix[4:8, 'A'] = np.nan + df.loc[4:8, 'A'] = np.nan dates = date_range('1/1/1990', periods=N * 3, freq='25s') # with a subset of A should be the same result = df.asof(dates, subset='A') expected = df.asof(dates) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # same with A/B result = df.asof(dates, subset=['A', 'B']) expected = df.asof(dates) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # B gives self.df.asof result = df.asof(dates, subset='B') expected = df.resample('25s', closed='right').ffill().reindex(dates) expected.iloc[20:] = 9 - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_missing(self): + # GH 15118 + # no match found - `where` value before earliest date in index + N = 10 + rng = date_range('1/1/1990', periods=N, freq='53s') + df = DataFrame({'A': np.arange(N), 'B': np.arange(N)}, + index=rng) + result = df.asof('1989-12-31') + + expected = Series(index=['A', 'B'], name=Timestamp('1989-12-31')) + tm.assert_series_equal(result, expected) + + result = df.asof(to_datetime(['1989-12-31'])) + expected = DataFrame(index=to_datetime(['1989-12-31']), + columns=['A', 'B'], dtype='float64') + tm.assert_frame_equal(result, expected) + + def test_all_nans(self): + # GH 15713 + # DataFrame is all nans + result = DataFrame([np.nan]).asof([0]) + expected = DataFrame([np.nan]) + tm.assert_frame_equal(result, expected) + + # testing non-default indexes, multiple inputs + dates = date_range('1/1/1990', periods=self.N * 3, freq='25s') + result = DataFrame(np.nan, index=self.rng, columns=['A']).asof(dates) + expected = DataFrame(np.nan, index=dates, columns=['A']) + tm.assert_frame_equal(result, expected) + + # testing multiple columns + dates = date_range('1/1/1990', periods=self.N * 3, freq='25s') + result = DataFrame(np.nan, index=self.rng, + columns=['A', 'B', 'C']).asof(dates) + expected = DataFrame(np.nan, index=dates, columns=['A', 'B', 'C']) + tm.assert_frame_equal(result, expected) + + # testing scalar input + result = DataFrame(np.nan, index=[1, 2], columns=['A', 'B']).asof([3]) + expected = DataFrame(np.nan, index=[3], columns=['A', 'B']) + tm.assert_frame_equal(result, expected) + + result = DataFrame(np.nan, index=[1, 2], columns=['A', 'B']).asof(3) + expected = Series(np.nan, index=['A', 'B'], name=3) + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/frame/test_axis_select_reindex.py b/pandas/tests/frame/test_axis_select_reindex.py index 9da1b31d259c5..a6326083c1bee 100644 --- a/pandas/tests/frame/test_axis_select_reindex.py +++ b/pandas/tests/frame/test_axis_select_reindex.py @@ -2,6 +2,8 @@ from __future__ import print_function +import pytest + from datetime import datetime from numpy import random @@ -12,22 +14,18 @@ date_range, isnull) import pandas as pd -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_frame_equal -from pandas.core.common import PerformanceWarning +from pandas.errors import PerformanceWarning import pandas.util.testing as tm from pandas.tests.frame.common import TestData -class TestDataFrameSelectReindex(tm.TestCase, TestData): +class TestDataFrameSelectReindex(TestData): # These are specific reindex-based tests; other indexing tests should go in # test_indexing - _multiprocess_can_split_ = True - def test_drop_names(self): df = DataFrame([[1, 2, 3], [3, 4, 5], [5, 6, 7]], index=['a', 'b', 'c'], @@ -39,29 +37,29 @@ def test_drop_names(self): df_inplace_b.drop('b', inplace=True) df_inplace_e.drop('e', axis=1, inplace=True) for obj in (df_dropped_b, df_dropped_e, df_inplace_b, df_inplace_e): - self.assertEqual(obj.index.name, 'first') - self.assertEqual(obj.columns.name, 'second') - self.assertEqual(list(df.columns), ['d', 'e', 'f']) + assert obj.index.name == 'first' + assert obj.columns.name == 'second' + assert list(df.columns) == ['d', 'e', 'f'] - self.assertRaises(ValueError, df.drop, ['g']) - self.assertRaises(ValueError, df.drop, ['g'], 1) + pytest.raises(ValueError, df.drop, ['g']) + pytest.raises(ValueError, df.drop, ['g'], 1) # errors = 'ignore' dropped = df.drop(['g'], errors='ignore') expected = Index(['a', 'b', 'c'], name='first') - self.assert_index_equal(dropped.index, expected) + tm.assert_index_equal(dropped.index, expected) dropped = df.drop(['b', 'g'], errors='ignore') expected = Index(['a', 'c'], name='first') - self.assert_index_equal(dropped.index, expected) + tm.assert_index_equal(dropped.index, expected) dropped = df.drop(['g'], axis=1, errors='ignore') expected = Index(['d', 'e', 'f'], name='second') - self.assert_index_equal(dropped.columns, expected) + tm.assert_index_equal(dropped.columns, expected) dropped = df.drop(['d', 'g'], axis=1, errors='ignore') expected = Index(['e', 'f'], name='second') - self.assert_index_equal(dropped.columns, expected) + tm.assert_index_equal(dropped.columns, expected) def test_drop_col_still_multiindex(self): arrays = [['a', 'b', 'c', 'top'], @@ -80,19 +78,19 @@ def test_drop(self): assert_frame_equal(simple.drop("A", axis=1), simple[['B']]) assert_frame_equal(simple.drop(["A", "B"], axis='columns'), simple[[]]) - assert_frame_equal(simple.drop([0, 1, 3], axis=0), simple.ix[[2], :]) + assert_frame_equal(simple.drop([0, 1, 3], axis=0), simple.loc[[2], :]) assert_frame_equal(simple.drop( - [0, 3], axis='index'), simple.ix[[1, 2], :]) + [0, 3], axis='index'), simple.loc[[1, 2], :]) - self.assertRaises(ValueError, simple.drop, 5) - self.assertRaises(ValueError, simple.drop, 'C', 1) - self.assertRaises(ValueError, simple.drop, [1, 5]) - self.assertRaises(ValueError, simple.drop, ['A', 'C'], 1) + pytest.raises(ValueError, simple.drop, 5) + pytest.raises(ValueError, simple.drop, 'C', 1) + pytest.raises(ValueError, simple.drop, [1, 5]) + pytest.raises(ValueError, simple.drop, ['A', 'C'], 1) # errors = 'ignore' assert_frame_equal(simple.drop(5, errors='ignore'), simple) assert_frame_equal(simple.drop([0, 5], errors='ignore'), - simple.ix[[1, 2, 3], :]) + simple.loc[[1, 2, 3], :]) assert_frame_equal(simple.drop('C', axis=1, errors='ignore'), simple) assert_frame_equal(simple.drop(['A', 'C'], axis=1, errors='ignore'), simple[['B']]) @@ -105,8 +103,8 @@ def test_drop(self): nu_df = nu_df.set_index(pd.Index(['X', 'Y', 'X'])) nu_df.columns = list('abc') - assert_frame_equal(nu_df.drop('X', axis='rows'), nu_df.ix[["Y"], :]) - assert_frame_equal(nu_df.drop(['X', 'Y'], axis=0), nu_df.ix[[], :]) + assert_frame_equal(nu_df.drop('X', axis='rows'), nu_df.loc[["Y"], :]) + assert_frame_equal(nu_df.drop(['X', 'Y'], axis=0), nu_df.loc[[], :]) # inplace cache issue # GH 5628 @@ -122,7 +120,7 @@ def test_drop_multiindex_not_lexsorted(self): lexsorted_mi = MultiIndex.from_tuples( [('a', ''), ('b1', 'c1'), ('b2', 'c2')], names=['b', 'c']) lexsorted_df = DataFrame([[1, 3, 4]], columns=lexsorted_mi) - self.assertTrue(lexsorted_df.columns.is_lexsorted()) + assert lexsorted_df.columns.is_lexsorted() # define the non-lexsorted version not_lexsorted_df = DataFrame(columns=['a', 'b', 'c', 'd'], @@ -131,7 +129,7 @@ def test_drop_multiindex_not_lexsorted(self): not_lexsorted_df = not_lexsorted_df.pivot_table( index='a', columns=['b', 'c'], values='d') not_lexsorted_df = not_lexsorted_df.reset_index() - self.assertFalse(not_lexsorted_df.columns.is_lexsorted()) + assert not not_lexsorted_df.columns.is_lexsorted() # compare the results tm.assert_frame_equal(lexsorted_df, not_lexsorted_df) @@ -174,16 +172,16 @@ def test_reindex(self): for idx, val in compat.iteritems(newFrame[col]): if idx in self.frame.index: if np.isnan(val): - self.assertTrue(np.isnan(self.frame[col][idx])) + assert np.isnan(self.frame[col][idx]) else: - self.assertEqual(val, self.frame[col][idx]) + assert val == self.frame[col][idx] else: - self.assertTrue(np.isnan(val)) + assert np.isnan(val) for col, series in compat.iteritems(newFrame): - self.assertTrue(tm.equalContents(series.index, newFrame.index)) + assert tm.equalContents(series.index, newFrame.index) emptyFrame = self.frame.reindex(Index([])) - self.assertEqual(len(emptyFrame.index), 0) + assert len(emptyFrame.index) == 0 # Cython code should be unit-tested directly nonContigFrame = self.frame.reindex(self.ts1.index[::2]) @@ -192,41 +190,40 @@ def test_reindex(self): for idx, val in compat.iteritems(nonContigFrame[col]): if idx in self.frame.index: if np.isnan(val): - self.assertTrue(np.isnan(self.frame[col][idx])) + assert np.isnan(self.frame[col][idx]) else: - self.assertEqual(val, self.frame[col][idx]) + assert val == self.frame[col][idx] else: - self.assertTrue(np.isnan(val)) + assert np.isnan(val) for col, series in compat.iteritems(nonContigFrame): - self.assertTrue(tm.equalContents(series.index, - nonContigFrame.index)) + assert tm.equalContents(series.index, nonContigFrame.index) # corner cases # Same index, copies values but not index if copy=False newFrame = self.frame.reindex(self.frame.index, copy=False) - self.assertIs(newFrame.index, self.frame.index) + assert newFrame.index is self.frame.index # length zero newFrame = self.frame.reindex([]) - self.assertTrue(newFrame.empty) - self.assertEqual(len(newFrame.columns), len(self.frame.columns)) + assert newFrame.empty + assert len(newFrame.columns) == len(self.frame.columns) # length zero with columns reindexed with non-empty index newFrame = self.frame.reindex([]) newFrame = newFrame.reindex(self.frame.index) - self.assertEqual(len(newFrame.index), len(self.frame.index)) - self.assertEqual(len(newFrame.columns), len(self.frame.columns)) + assert len(newFrame.index) == len(self.frame.index) + assert len(newFrame.columns) == len(self.frame.columns) # pass non-Index newFrame = self.frame.reindex(list(self.ts1.index)) - self.assert_index_equal(newFrame.index, self.ts1.index) + tm.assert_index_equal(newFrame.index, self.ts1.index) # copy with no axes result = self.frame.reindex() assert_frame_equal(result, self.frame) - self.assertFalse(result is self.frame) + assert result is not self.frame def test_reindex_nan(self): df = pd.DataFrame([[1, 2], [3, 5], [7, 11], [9, 23]], @@ -258,27 +255,27 @@ def test_reindex_name_remains(self): i = Series(np.arange(10), name='iname') df = df.reindex(i) - self.assertEqual(df.index.name, 'iname') + assert df.index.name == 'iname' df = df.reindex(Index(np.arange(10), name='tmpname')) - self.assertEqual(df.index.name, 'tmpname') + assert df.index.name == 'tmpname' s = Series(random.rand(10)) df = DataFrame(s.T, index=np.arange(len(s))) i = Series(np.arange(10), name='iname') df = df.reindex(columns=i) - self.assertEqual(df.columns.name, 'iname') + assert df.columns.name == 'iname' def test_reindex_int(self): smaller = self.intframe.reindex(self.intframe.index[::2]) - self.assertEqual(smaller['A'].dtype, np.int64) + assert smaller['A'].dtype == np.int64 bigger = smaller.reindex(self.intframe.index) - self.assertEqual(bigger['A'].dtype, np.float64) + assert bigger['A'].dtype == np.float64 smaller = self.intframe.reindex(columns=['A', 'B']) - self.assertEqual(smaller['A'].dtype, np.int64) + assert smaller['A'].dtype == np.int64 def test_reindex_like(self): other = self.frame.reindex(index=self.frame.index[:10], @@ -287,15 +284,53 @@ def test_reindex_like(self): assert_frame_equal(other, self.frame.reindex_like(other)) def test_reindex_columns(self): - newFrame = self.frame.reindex(columns=['A', 'B', 'E']) + new_frame = self.frame.reindex(columns=['A', 'B', 'E']) + + tm.assert_series_equal(new_frame['B'], self.frame['B']) + assert np.isnan(new_frame['E']).all() + assert 'C' not in new_frame + + # Length zero + new_frame = self.frame.reindex(columns=[]) + assert new_frame.empty + + def test_reindex_columns_method(self): + + # GH 14992, reindexing over columns ignored method + df = DataFrame(data=[[11, 12, 13], [21, 22, 23], [31, 32, 33]], + index=[1, 2, 4], + columns=[1, 2, 4], + dtype=float) + + # default method + result = df.reindex(columns=range(6)) + expected = DataFrame(data=[[np.nan, 11, 12, np.nan, 13, np.nan], + [np.nan, 21, 22, np.nan, 23, np.nan], + [np.nan, 31, 32, np.nan, 33, np.nan]], + index=[1, 2, 4], + columns=range(6), + dtype=float) + assert_frame_equal(result, expected) - assert_series_equal(newFrame['B'], self.frame['B']) - self.assertTrue(np.isnan(newFrame['E']).all()) - self.assertNotIn('C', newFrame) + # method='ffill' + result = df.reindex(columns=range(6), method='ffill') + expected = DataFrame(data=[[np.nan, 11, 12, 12, 13, 13], + [np.nan, 21, 22, 22, 23, 23], + [np.nan, 31, 32, 32, 33, 33]], + index=[1, 2, 4], + columns=range(6), + dtype=float) + assert_frame_equal(result, expected) - # length zero - newFrame = self.frame.reindex(columns=[]) - self.assertTrue(newFrame.empty) + # method='bfill' + result = df.reindex(columns=range(6), method='bfill') + expected = DataFrame(data=[[11, 11, 12, 13, 13, np.nan], + [21, 21, 22, 23, 23, np.nan], + [31, 31, 32, 33, 33, np.nan]], + index=[1, 2, 4], + columns=range(6), + dtype=float) + assert_frame_equal(result, expected) def test_reindex_axes(self): # GH 3317, reindexing by both axes loses freq of the index @@ -311,15 +346,15 @@ def test_reindex_axes(self): both_freq = df.reindex(index=time_freq, columns=some_cols).index.freq seq_freq = df.reindex(index=time_freq).reindex( columns=some_cols).index.freq - self.assertEqual(index_freq, both_freq) - self.assertEqual(index_freq, seq_freq) + assert index_freq == both_freq + assert index_freq == seq_freq def test_reindex_fill_value(self): df = DataFrame(np.random.randn(10, 4)) # axis=0 result = df.reindex(lrange(15)) - self.assertTrue(np.isnan(result.values[-5:]).all()) + assert np.isnan(result.values[-5:]).all() result = df.reindex(lrange(15), fill_value=0) expected = df.reindex(lrange(15)).fillna(0) @@ -369,37 +404,39 @@ def test_reindex_dups(self): assert_frame_equal(result, expected) # reindex fails - self.assertRaises(ValueError, df.reindex, index=list(range(len(df)))) + pytest.raises(ValueError, df.reindex, index=list(range(len(df)))) def test_align(self): af, bf = self.frame.align(self.frame) - self.assertIsNot(af._data, self.frame._data) + assert af._data is not self.frame._data af, bf = self.frame.align(self.frame, copy=False) - self.assertIs(af._data, self.frame._data) + assert af._data is self.frame._data # axis = 0 - other = self.frame.ix[:-5, :3] + other = self.frame.iloc[:-5, :3] af, bf = self.frame.align(other, axis=0, fill_value=-1) - self.assert_index_equal(bf.columns, other.columns) + + tm.assert_index_equal(bf.columns, other.columns) + # test fill value join_idx = self.frame.index.join(other.index) diff_a = self.frame.index.difference(join_idx) diff_b = other.index.difference(join_idx) diff_a_vals = af.reindex(diff_a).values diff_b_vals = bf.reindex(diff_b).values - self.assertTrue((diff_a_vals == -1).all()) + assert (diff_a_vals == -1).all() af, bf = self.frame.align(other, join='right', axis=0) - self.assert_index_equal(bf.columns, other.columns) - self.assert_index_equal(bf.index, other.index) - self.assert_index_equal(af.index, other.index) + tm.assert_index_equal(bf.columns, other.columns) + tm.assert_index_equal(bf.index, other.index) + tm.assert_index_equal(af.index, other.index) # axis = 1 - other = self.frame.ix[:-5, :3].copy() + other = self.frame.iloc[:-5, :3].copy() af, bf = self.frame.align(other, axis=1) - self.assert_index_equal(bf.columns, self.frame.columns) - self.assert_index_equal(bf.index, other.index) + tm.assert_index_equal(bf.columns, self.frame.columns) + tm.assert_index_equal(bf.index, other.index) # test fill value join_idx = self.frame.index.join(other.index) @@ -410,42 +447,42 @@ def test_align(self): # TODO(wesm): unused? diff_b_vals = bf.reindex(diff_b).values # noqa - self.assertTrue((diff_a_vals == -1).all()) + assert (diff_a_vals == -1).all() af, bf = self.frame.align(other, join='inner', axis=1) - self.assert_index_equal(bf.columns, other.columns) + tm.assert_index_equal(bf.columns, other.columns) af, bf = self.frame.align(other, join='inner', axis=1, method='pad') - self.assert_index_equal(bf.columns, other.columns) + tm.assert_index_equal(bf.columns, other.columns) # test other non-float types af, bf = self.intframe.align(other, join='inner', axis=1, method='pad') - self.assert_index_equal(bf.columns, other.columns) + tm.assert_index_equal(bf.columns, other.columns) af, bf = self.mixed_frame.align(self.mixed_frame, join='inner', axis=1, method='pad') - self.assert_index_equal(bf.columns, self.mixed_frame.columns) + tm.assert_index_equal(bf.columns, self.mixed_frame.columns) - af, bf = self.frame.align(other.ix[:, 0], join='inner', axis=1, + af, bf = self.frame.align(other.iloc[:, 0], join='inner', axis=1, method=None, fill_value=None) - self.assert_index_equal(bf.index, Index([])) + tm.assert_index_equal(bf.index, Index([])) - af, bf = self.frame.align(other.ix[:, 0], join='inner', axis=1, + af, bf = self.frame.align(other.iloc[:, 0], join='inner', axis=1, method=None, fill_value=0) - self.assert_index_equal(bf.index, Index([])) + tm.assert_index_equal(bf.index, Index([])) # mixed floats/ints - af, bf = self.mixed_float.align(other.ix[:, 0], join='inner', axis=1, + af, bf = self.mixed_float.align(other.iloc[:, 0], join='inner', axis=1, method=None, fill_value=0) - self.assert_index_equal(bf.index, Index([])) + tm.assert_index_equal(bf.index, Index([])) - af, bf = self.mixed_int.align(other.ix[:, 0], join='inner', axis=1, + af, bf = self.mixed_int.align(other.iloc[:, 0], join='inner', axis=1, method=None, fill_value=0) - self.assert_index_equal(bf.index, Index([])) + tm.assert_index_equal(bf.index, Index([])) - # try to align dataframe to series along bad axis - self.assertRaises(ValueError, self.frame.align, af.ix[0, :3], - join='inner', axis=2) + # Try to align DataFrame to Series along bad axis + with pytest.raises(ValueError): + self.frame.align(af.iloc[0, :3], join='inner', axis=2) # align dataframe to series with broadcast or not idx = self.frame.index @@ -454,7 +491,7 @@ def test_align(self): left, right = self.frame.align(s, axis=0) tm.assert_index_equal(left.index, self.frame.index) tm.assert_index_equal(right.index, self.frame.index) - self.assertTrue(isinstance(right, Series)) + assert isinstance(right, Series) left, right = self.frame.align(s, broadcast_axis=1) tm.assert_index_equal(left.index, self.frame.index) @@ -463,17 +500,17 @@ def test_align(self): expected[c] = s expected = DataFrame(expected, index=self.frame.index, columns=self.frame.columns) - assert_frame_equal(right, expected) + tm.assert_frame_equal(right, expected) - # GH 9558 + # see gh-9558 df = DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) result = df[df['a'] == 2] expected = DataFrame([[2, 5]], index=[1], columns=['a', 'b']) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) result = df.where(df['a'] == 2, 0) expected = DataFrame({'a': [0, 2, 0], 'b': [0, 5, 0]}) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def _check_align(self, a, b, axis, fill_axis, how, method, limit=None): aa, ab = a.align(b, axis=axis, join=how, method=method, limit=limit, @@ -523,9 +560,9 @@ def test_align_fill_method_right(self): self._check_align_fill('right', meth, ax, fax) def _check_align_fill(self, kind, meth, ax, fax): - left = self.frame.ix[0:4, :10] - right = self.frame.ix[2:, 6:] - empty = self.frame.ix[:0, :0] + left = self.frame.iloc[0:4, :10] + right = self.frame.iloc[2:, 6:] + empty = self.frame.iloc[:0, :0] self._check_align(left, right, axis=ax, fill_axis=fax, how=kind, method=meth) @@ -619,33 +656,33 @@ def test_align_series_combinations(self): tm.assert_frame_equal(res2, exp1) def test_filter(self): - # items + # Items filtered = self.frame.filter(['A', 'B', 'E']) - self.assertEqual(len(filtered.columns), 2) - self.assertNotIn('E', filtered) + assert len(filtered.columns) == 2 + assert 'E' not in filtered filtered = self.frame.filter(['A', 'B', 'E'], axis='columns') - self.assertEqual(len(filtered.columns), 2) - self.assertNotIn('E', filtered) + assert len(filtered.columns) == 2 + assert 'E' not in filtered - # other axis + # Other axis idx = self.frame.index[0:4] filtered = self.frame.filter(idx, axis='index') expected = self.frame.reindex(index=idx) - assert_frame_equal(filtered, expected) + tm.assert_frame_equal(filtered, expected) # like fcopy = self.frame.copy() fcopy['AA'] = 1 filtered = fcopy.filter(like='A') - self.assertEqual(len(filtered.columns), 2) - self.assertIn('AA', filtered) + assert len(filtered.columns) == 2 + assert 'AA' in filtered # like with ints in column names df = DataFrame(0., index=[0, 1, 2], columns=[0, 1, '_A', '_B']) filtered = df.filter(like='_') - self.assertEqual(len(filtered.columns), 2) + assert len(filtered.columns) == 2 # regex with ints in column names # from PR #10384 @@ -653,41 +690,41 @@ def test_filter(self): expected = DataFrame( 0., index=[0, 1, 2], columns=pd.Index([1, 2], dtype=object)) filtered = df.filter(regex='^[0-9]+$') - assert_frame_equal(filtered, expected) + tm.assert_frame_equal(filtered, expected) expected = DataFrame(0., index=[0, 1, 2], columns=[0, '0', 1, '1']) # shouldn't remove anything filtered = expected.filter(regex='^[0-9]+$') - assert_frame_equal(filtered, expected) + tm.assert_frame_equal(filtered, expected) # pass in None - with assertRaisesRegexp(TypeError, 'Must pass'): + with tm.assert_raises_regex(TypeError, 'Must pass'): self.frame.filter() - with assertRaisesRegexp(TypeError, 'Must pass'): + with tm.assert_raises_regex(TypeError, 'Must pass'): self.frame.filter(items=None) - with assertRaisesRegexp(TypeError, 'Must pass'): + with tm.assert_raises_regex(TypeError, 'Must pass'): self.frame.filter(axis=1) # test mutually exclusive arguments - with assertRaisesRegexp(TypeError, 'mutually exclusive'): + with tm.assert_raises_regex(TypeError, 'mutually exclusive'): self.frame.filter(items=['one', 'three'], regex='e$', like='bbi') - with assertRaisesRegexp(TypeError, 'mutually exclusive'): + with tm.assert_raises_regex(TypeError, 'mutually exclusive'): self.frame.filter(items=['one', 'three'], regex='e$', axis=1) - with assertRaisesRegexp(TypeError, 'mutually exclusive'): + with tm.assert_raises_regex(TypeError, 'mutually exclusive'): self.frame.filter(items=['one', 'three'], regex='e$') - with assertRaisesRegexp(TypeError, 'mutually exclusive'): + with tm.assert_raises_regex(TypeError, 'mutually exclusive'): self.frame.filter(items=['one', 'three'], like='bbi', axis=0) - with assertRaisesRegexp(TypeError, 'mutually exclusive'): + with tm.assert_raises_regex(TypeError, 'mutually exclusive'): self.frame.filter(items=['one', 'three'], like='bbi') # objects filtered = self.mixed_frame.filter(like='foo') - self.assertIn('foo', filtered) + assert 'foo' in filtered # unicode columns, won't ascii-encode df = self.frame.rename(columns={'B': u('\u2202')}) filtered = df.filter(like='C') - self.assertTrue('C' in filtered) + assert 'C' in filtered def test_filter_regex_search(self): fcopy = self.frame.copy() @@ -695,8 +732,8 @@ def test_filter_regex_search(self): # regex filtered = fcopy.filter(regex='[A]+') - self.assertEqual(len(filtered.columns), 2) - self.assertIn('AA', filtered) + assert len(filtered.columns) == 2 + assert 'AA' in filtered # doesn't have to be at beginning df = DataFrame({'aBBa': [1, 2], @@ -741,7 +778,7 @@ def test_take(self): # axis = 1 result = df.take(order, axis=1) - expected = df.ix[:, ['D', 'B', 'C', 'A']] + expected = df.loc[:, ['D', 'B', 'C', 'A']] assert_frame_equal(result, expected, check_names=False) # neg indicies @@ -754,14 +791,14 @@ def test_take(self): # axis = 1 result = df.take(order, axis=1) - expected = df.ix[:, ['C', 'B', 'D']] + expected = df.loc[:, ['C', 'B', 'D']] assert_frame_equal(result, expected, check_names=False) # illegal indices - self.assertRaises(IndexError, df.take, [3, 1, 2, 30], axis=0) - self.assertRaises(IndexError, df.take, [3, 1, 2, -31], axis=0) - self.assertRaises(IndexError, df.take, [3, 1, 2, 5], axis=1) - self.assertRaises(IndexError, df.take, [3, 1, 2, -5], axis=1) + pytest.raises(IndexError, df.take, [3, 1, 2, 30], axis=0) + pytest.raises(IndexError, df.take, [3, 1, 2, -31], axis=0) + pytest.raises(IndexError, df.take, [3, 1, 2, 5], axis=1) + pytest.raises(IndexError, df.take, [3, 1, 2, -5], axis=1) # mixed-dtype order = [4, 1, 2, 0, 3] @@ -773,7 +810,7 @@ def test_take(self): # axis = 1 result = df.take(order, axis=1) - expected = df.ix[:, ['foo', 'B', 'C', 'A', 'D']] + expected = df.loc[:, ['foo', 'B', 'C', 'A', 'D']] assert_frame_equal(result, expected) # neg indicies @@ -786,7 +823,7 @@ def test_take(self): # axis = 1 result = df.take(order, axis=1) - expected = df.ix[:, ['foo', 'B', 'D']] + expected = df.loc[:, ['foo', 'B', 'D']] assert_frame_equal(result, expected) # by dtype @@ -799,7 +836,7 @@ def test_take(self): # axis = 1 result = df.take(order, axis=1) - expected = df.ix[:, ['B', 'C', 'A', 'D']] + expected = df.loc[:, ['B', 'C', 'A', 'D']] assert_frame_equal(result, expected) def test_reindex_boolean(self): @@ -808,29 +845,29 @@ def test_reindex_boolean(self): columns=[0, 2]) reindexed = frame.reindex(np.arange(10)) - self.assertEqual(reindexed.values.dtype, np.object_) - self.assertTrue(isnull(reindexed[0][1])) + assert reindexed.values.dtype == np.object_ + assert isnull(reindexed[0][1]) reindexed = frame.reindex(columns=lrange(3)) - self.assertEqual(reindexed.values.dtype, np.object_) - self.assertTrue(isnull(reindexed[1]).all()) + assert reindexed.values.dtype == np.object_ + assert isnull(reindexed[1]).all() def test_reindex_objects(self): reindexed = self.mixed_frame.reindex(columns=['foo', 'A', 'B']) - self.assertIn('foo', reindexed) + assert 'foo' in reindexed reindexed = self.mixed_frame.reindex(columns=['A', 'B']) - self.assertNotIn('foo', reindexed) + assert 'foo' not in reindexed def test_reindex_corner(self): index = Index(['a', 'b', 'c']) dm = self.empty.reindex(index=[1, 2, 3]) reindexed = dm.reindex(columns=index) - self.assert_index_equal(reindexed.columns, index) + tm.assert_index_equal(reindexed.columns, index) # ints are weird smaller = self.intframe.reindex(columns=['A', 'B', 'E']) - self.assertEqual(smaller['E'].dtype, np.float64) + assert smaller['E'].dtype == np.float64 def test_reindex_axis(self): cols = ['A', 'B', 'E'] @@ -843,7 +880,7 @@ def test_reindex_axis(self): reindexed2 = self.intframe.reindex(index=rows) assert_frame_equal(reindexed1, reindexed2) - self.assertRaises(ValueError, self.intframe.reindex_axis, rows, axis=2) + pytest.raises(ValueError, self.intframe.reindex_axis, rows, axis=2) # no-op case cols = self.frame.columns.copy() diff --git a/pandas/tests/frame/test_block_internals.py b/pandas/tests/frame/test_block_internals.py index e51cc0f5a6ec7..c1a5b437be5d0 100644 --- a/pandas/tests/frame/test_block_internals.py +++ b/pandas/tests/frame/test_block_internals.py @@ -2,6 +2,8 @@ from __future__ import print_function +import pytest + from datetime import datetime, timedelta import itertools @@ -15,8 +17,7 @@ from pandas.util.testing import (assert_almost_equal, assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) + assert_frame_equal) import pandas.util.testing as tm @@ -27,9 +28,7 @@ # structure -class TestDataFrameBlockInternals(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameBlockInternals(TestData): def test_cast_internals(self): casted = DataFrame(self.frame._data, dtype=int) @@ -42,18 +41,24 @@ def test_cast_internals(self): def test_consolidate(self): self.frame['E'] = 7. - consolidated = self.frame.consolidate() - self.assertEqual(len(consolidated._data.blocks), 1) + consolidated = self.frame._consolidate() + assert len(consolidated._data.blocks) == 1 # Ensure copy, do I want this? - recons = consolidated.consolidate() - self.assertIsNot(recons, consolidated) - assert_frame_equal(recons, consolidated) + recons = consolidated._consolidate() + assert recons is not consolidated + tm.assert_frame_equal(recons, consolidated) self.frame['F'] = 8. - self.assertEqual(len(self.frame._data.blocks), 3) - self.frame.consolidate(inplace=True) - self.assertEqual(len(self.frame._data.blocks), 1) + assert len(self.frame._data.blocks) == 3 + + self.frame._consolidate(inplace=True) + assert len(self.frame._data.blocks) == 1 + + def test_consolidate_deprecation(self): + self.frame['E'] = 7 + with tm.assert_produces_warning(FutureWarning): + self.frame.consolidate() def test_consolidate_inplace(self): frame = self.frame.copy() # noqa @@ -64,18 +69,18 @@ def test_consolidate_inplace(self): def test_as_matrix_consolidate(self): self.frame['E'] = 7. - self.assertFalse(self.frame._data.is_consolidated()) + assert not self.frame._data.is_consolidated() _ = self.frame.as_matrix() # noqa - self.assertTrue(self.frame._data.is_consolidated()) + assert self.frame._data.is_consolidated() def test_modify_values(self): self.frame.values[5] = 5 - self.assertTrue((self.frame.values[5] == 5).all()) + assert (self.frame.values[5] == 5).all() # unconsolidated self.frame['E'] = 7. self.frame.values[6] = 6 - self.assertTrue((self.frame.values[6] == 6).all()) + assert (self.frame.values[6] == 6).all() def test_boolean_set_uncons(self): self.frame['E'] = 7. @@ -90,47 +95,47 @@ def test_as_matrix_numeric_cols(self): self.frame['foo'] = 'bar' values = self.frame.as_matrix(['A', 'B', 'C', 'D']) - self.assertEqual(values.dtype, np.float64) + assert values.dtype == np.float64 def test_as_matrix_lcd(self): # mixed lcd values = self.mixed_float.as_matrix(['A', 'B', 'C', 'D']) - self.assertEqual(values.dtype, np.float64) + assert values.dtype == np.float64 values = self.mixed_float.as_matrix(['A', 'B', 'C']) - self.assertEqual(values.dtype, np.float32) + assert values.dtype == np.float32 values = self.mixed_float.as_matrix(['C']) - self.assertEqual(values.dtype, np.float16) + assert values.dtype == np.float16 # GH 10364 # B uint64 forces float because there are other signed int types values = self.mixed_int.as_matrix(['A', 'B', 'C', 'D']) - self.assertEqual(values.dtype, np.float64) + assert values.dtype == np.float64 values = self.mixed_int.as_matrix(['A', 'D']) - self.assertEqual(values.dtype, np.int64) + assert values.dtype == np.int64 # B uint64 forces float because there are other signed int types values = self.mixed_int.as_matrix(['A', 'B', 'C']) - self.assertEqual(values.dtype, np.float64) + assert values.dtype == np.float64 # as B and C are both unsigned, no forcing to float is needed values = self.mixed_int.as_matrix(['B', 'C']) - self.assertEqual(values.dtype, np.uint64) + assert values.dtype == np.uint64 values = self.mixed_int.as_matrix(['A', 'C']) - self.assertEqual(values.dtype, np.int32) + assert values.dtype == np.int32 values = self.mixed_int.as_matrix(['C', 'D']) - self.assertEqual(values.dtype, np.int64) + assert values.dtype == np.int64 values = self.mixed_int.as_matrix(['A']) - self.assertEqual(values.dtype, np.int32) + assert values.dtype == np.int32 values = self.mixed_int.as_matrix(['C']) - self.assertEqual(values.dtype, np.uint8) + assert values.dtype == np.uint8 def test_constructor_with_convert(self): # this is actually mostly a test of lib.maybe_convert_objects @@ -142,7 +147,7 @@ def test_constructor_with_convert(self): df = DataFrame({'A': [2 ** 63]}) result = df['A'] - expected = Series(np.asarray([2 ** 63], np.object_), name='A') + expected = Series(np.asarray([2 ** 63], np.uint64), name='A') assert_series_equal(result, expected) df = DataFrame({'A': [datetime(2005, 1, 1), True]}) @@ -215,8 +220,8 @@ def test_construction_with_mixed(self): # mixed-type frames self.mixed_frame['datetime'] = datetime.now() self.mixed_frame['timedelta'] = timedelta(days=1, seconds=1) - self.assertEqual(self.mixed_frame['datetime'].dtype, 'M8[ns]') - self.assertEqual(self.mixed_frame['timedelta'].dtype, 'm8[ns]') + assert self.mixed_frame['datetime'].dtype == 'M8[ns]' + assert self.mixed_frame['timedelta'].dtype == 'm8[ns]' result = self.mixed_frame.get_dtype_counts().sort_values() expected = Series({'float64': 4, 'object': 1, @@ -281,10 +286,10 @@ def f(dtype): columns=["A", "B", "C"], dtype=dtype) - self.assertRaises(NotImplementedError, f, - [("A", "datetime64[h]"), - ("B", "str"), - ("C", "int32")]) + pytest.raises(NotImplementedError, f, + [("A", "datetime64[h]"), + ("B", "str"), + ("C", "int32")]) # these work (though results may be unexpected) f('int64') @@ -302,12 +307,12 @@ def test_equals_different_blocks(self): df1 = df0.reset_index()[["A", "B", "C"]] # this assert verifies that the above operations have # induced a block rearrangement - self.assertTrue(df0._data.blocks[0].dtype != - df1._data.blocks[0].dtype) + assert (df0._data.blocks[0].dtype != df1._data.blocks[0].dtype) + # do the real tests assert_frame_equal(df0, df1) - self.assertTrue(df0.equals(df1)) - self.assertTrue(df1.equals(df0)) + assert df0.equals(df1) + assert df1.equals(df0) def test_copy_blocks(self): # API/ENH 9607 @@ -318,10 +323,10 @@ def test_copy_blocks(self): blocks = df.as_blocks() for dtype, _df in blocks.items(): if column in _df: - _df.ix[:, column] = _df[column] + 1 + _df.loc[:, column] = _df[column] + 1 # make sure we did not change the original DataFrame - self.assertFalse(_df[column].equals(df[column])) + assert not _df[column].equals(df[column]) def test_no_copy_blocks(self): # API/ENH 9607 @@ -332,33 +337,33 @@ def test_no_copy_blocks(self): blocks = df.as_blocks(copy=False) for dtype, _df in blocks.items(): if column in _df: - _df.ix[:, column] = _df[column] + 1 + _df.loc[:, column] = _df[column] + 1 # make sure we did change the original DataFrame - self.assertTrue(_df[column].equals(df[column])) + assert _df[column].equals(df[column]) def test_copy(self): cop = self.frame.copy() cop['E'] = cop['A'] - self.assertNotIn('E', self.frame) + assert 'E' not in self.frame # copy objects copy = self.mixed_frame.copy() - self.assertIsNot(copy._data, self.mixed_frame._data) + assert copy._data is not self.mixed_frame._data def test_pickle(self): - unpickled = self.round_trip_pickle(self.mixed_frame) + unpickled = tm.round_trip_pickle(self.mixed_frame) assert_frame_equal(self.mixed_frame, unpickled) # buglet self.mixed_frame._data.ndim # empty - unpickled = self.round_trip_pickle(self.empty) + unpickled = tm.round_trip_pickle(self.empty) repr(unpickled) # tz frame - unpickled = self.round_trip_pickle(self.tzframe) + unpickled = tm.round_trip_pickle(self.tzframe) assert_frame_equal(self.tzframe, unpickled) def test_consolidate_datetime64(self): @@ -394,8 +399,8 @@ def test_consolidate_datetime64(self): tm.assert_index_equal(pd.DatetimeIndex(df.ending), ser_ending.index) def test_is_mixed_type(self): - self.assertFalse(self.frame._is_mixed_type) - self.assertTrue(self.mixed_frame._is_mixed_type) + assert not self.frame._is_mixed_type + assert self.mixed_frame._is_mixed_type def test_get_numeric_data(self): # TODO(wesm): unused? @@ -423,12 +428,12 @@ def test_get_numeric_data(self): index=np.arange(10)) result = df._get_numeric_data() - expected = df.ix[:, ['a', 'b', 'd', 'e', 'f']] + expected = df.loc[:, ['a', 'b', 'd', 'e', 'f']] assert_frame_equal(result, expected) - only_obj = df.ix[:, ['c', 'g']] + only_obj = df.loc[:, ['c', 'g']] result = only_obj._get_numeric_data() - expected = df.ix[:, []] + expected = df.loc[:, []] assert_frame_equal(result, expected) df = DataFrame.from_dict( @@ -447,7 +452,7 @@ def test_convert_objects(self): oops = self.mixed_frame.T.T converted = oops._convert(datetime=True) assert_frame_equal(converted, self.mixed_frame) - self.assertEqual(converted['A'].dtype, np.float64) + assert converted['A'].dtype == np.float64 # force numeric conversion self.mixed_frame['H'] = '1.' @@ -457,25 +462,25 @@ def test_convert_objects(self): l = len(self.mixed_frame) self.mixed_frame['J'] = '1.' self.mixed_frame['K'] = '1' - self.mixed_frame.ix[0:5, ['J', 'K']] = 'garbled' + self.mixed_frame.loc[0:5, ['J', 'K']] = 'garbled' converted = self.mixed_frame._convert(datetime=True, numeric=True) - self.assertEqual(converted['H'].dtype, 'float64') - self.assertEqual(converted['I'].dtype, 'int64') - self.assertEqual(converted['J'].dtype, 'float64') - self.assertEqual(converted['K'].dtype, 'float64') - self.assertEqual(len(converted['J'].dropna()), l - 5) - self.assertEqual(len(converted['K'].dropna()), l - 5) + assert converted['H'].dtype == 'float64' + assert converted['I'].dtype == 'int64' + assert converted['J'].dtype == 'float64' + assert converted['K'].dtype == 'float64' + assert len(converted['J'].dropna()) == l - 5 + assert len(converted['K'].dropna()) == l - 5 # via astype converted = self.mixed_frame.copy() converted['H'] = converted['H'].astype('float64') converted['I'] = converted['I'].astype('int64') - self.assertEqual(converted['H'].dtype, 'float64') - self.assertEqual(converted['I'].dtype, 'int64') + assert converted['H'].dtype == 'float64' + assert converted['I'].dtype == 'int64' # via astype, but errors converted = self.mixed_frame.copy() - with assertRaisesRegexp(ValueError, 'invalid literal'): + with tm.assert_raises_regex(ValueError, 'invalid literal'): converted['H'].astype('int32') # mixed in a single column @@ -502,7 +507,7 @@ def test_stale_cached_series_bug_473(self): repr(Y) result = Y.sum() # noqa exp = Y['g'].sum() # noqa - self.assertTrue(pd.isnull(Y['g']['c'])) + assert pd.isnull(Y['g']['c']) def test_get_X_columns(self): # numeric and object columns @@ -513,8 +518,8 @@ def test_get_X_columns(self): 'd': [None, None, None], 'e': [3.14, 0.577, 2.773]}) - self.assert_index_equal(df._get_numeric_data().columns, - pd.Index(['a', 'b', 'e'])) + tm.assert_index_equal(df._get_numeric_data().columns, + pd.Index(['a', 'b', 'e'])) def test_strange_column_corruption_issue(self): # (wesm) Unclear how exactly this is related to internal matters @@ -535,6 +540,6 @@ def test_strange_column_corruption_issue(self): myid = 100 - first = len(df.ix[pd.isnull(df[myid]), [myid]]) - second = len(df.ix[pd.isnull(df[myid]), [myid]]) - self.assertTrue(first == second == 0) + first = len(df.loc[pd.isnull(df[myid]), [myid]]) + second = len(df.loc[pd.isnull(df[myid]), [myid]]) + assert first == second == 0 diff --git a/pandas/tests/frame/test_combine_concat.py b/pandas/tests/frame/test_combine_concat.py index c6b69dad3e6b5..688cacdee263e 100644 --- a/pandas/tests/frame/test_combine_concat.py +++ b/pandas/tests/frame/test_combine_concat.py @@ -9,20 +9,16 @@ import pandas as pd -from pandas import DataFrame, Index, Series, Timestamp +from pandas import DataFrame, Index, Series, Timestamp, date_range from pandas.compat import lrange from pandas.tests.frame.common import TestData import pandas.util.testing as tm -from pandas.util.testing import (assertRaisesRegexp, - assert_frame_equal, - assert_series_equal) +from pandas.util.testing import assert_frame_equal, assert_series_equal -class TestDataFrameConcatCommon(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameConcatCommon(TestData): def test_concat_multiple_frames_dtypes(self): @@ -79,12 +75,14 @@ def test_append_series_dict(self): df = DataFrame(np.random.randn(5, 4), columns=['foo', 'bar', 'baz', 'qux']) - series = df.ix[4] - with assertRaisesRegexp(ValueError, 'Indexes have overlapping values'): + series = df.loc[4] + with tm.assert_raises_regex(ValueError, + 'Indexes have overlapping values'): df.append(series, verify_integrity=True) series.name = None - with assertRaisesRegexp(TypeError, 'Can only append a Series if ' - 'ignore_index=True'): + with tm.assert_raises_regex(TypeError, + 'Can only append a Series if ' + 'ignore_index=True'): df.append(series, verify_integrity=True) result = df.append(series[::-1], ignore_index=True) @@ -99,10 +97,10 @@ def test_append_series_dict(self): result = df.append(series[::-1][:3], ignore_index=True) expected = df.append(DataFrame({0: series[::-1][:3]}).T, ignore_index=True) - assert_frame_equal(result, expected.ix[:, result.columns]) + assert_frame_equal(result, expected.loc[:, result.columns]) # can append when name set - row = df.ix[4] + row = df.loc[4] row.name = 5 result = df.append(row) expected = df.append(df[-1:], ignore_index=True) @@ -272,7 +270,7 @@ def test_update_raise(self): other = DataFrame([[2., nan], [nan, 7]], index=[1, 3], columns=[1, 2]) - with assertRaisesRegexp(ValueError, "Data overlaps"): + with tm.assert_raises_regex(ValueError, "Data overlaps"): df.update(other, raise_conflict=True) def test_update_from_non_df(self): @@ -305,7 +303,7 @@ def test_join_str_datetime(self): tst = A.join(C, on='aa') - self.assertEqual(len(tst.columns), 3) + assert len(tst.columns) == 3 def test_join_multiindex_leftright(self): # GH 10741 @@ -421,13 +419,29 @@ def test_concat_axis_parameter(self): assert_frame_equal(concatted_1_series, expected_columns_series) # Testing ValueError - with assertRaisesRegexp(ValueError, 'No axis named'): + with tm.assert_raises_regex(ValueError, 'No axis named'): pd.concat([series1, series2], axis='something') - -class TestDataFrameCombineFirst(tm.TestCase, TestData): - - _multiprocess_can_split_ = True + def test_concat_numerical_names(self): + # #15262 # #12223 + df = pd.DataFrame({'col': range(9)}, + dtype='int32', + index=(pd.MultiIndex + .from_product([['A0', 'A1', 'A2'], + ['B0', 'B1', 'B2']], + names=[1, 2]))) + result = pd.concat((df.iloc[:2, :], df.iloc[-2:, :])) + expected = pd.DataFrame({'col': [0, 1, 7, 8]}, + dtype='int32', + index=pd.MultiIndex.from_tuples([('A0', 'B0'), + ('A0', 'B1'), + ('A2', 'B1'), + ('A2', 'B2')], + names=[1, 2])) + tm.assert_frame_equal(result, expected) + + +class TestDataFrameCombineFirst(TestData): def test_combine_first_mixed(self): a = Series(['a', 'b'], index=lrange(2)) @@ -450,7 +464,7 @@ def test_combine_first(self): combined = head.combine_first(tail) reordered_frame = self.frame.reindex(combined.index) assert_frame_equal(combined, reordered_frame) - self.assertTrue(tm.equalContents(combined.columns, self.frame.columns)) + assert tm.equalContents(combined.columns, self.frame.columns) assert_series_equal(combined['A'], reordered_frame['A']) # same index @@ -464,7 +478,7 @@ def test_combine_first(self): combined = fcopy.combine_first(fcopy2) - self.assertTrue((combined['A'] == 1).all()) + assert (combined['A'] == 1).all() assert_series_equal(combined['B'], fcopy['B']) assert_series_equal(combined['C'], fcopy2['C']) assert_series_equal(combined['D'], fcopy['D']) @@ -474,12 +488,12 @@ def test_combine_first(self): head['A'] = 1 combined = head.combine_first(tail) - self.assertTrue((combined['A'][:10] == 1).all()) + assert (combined['A'][:10] == 1).all() # reverse overlap tail['A'][:10] = 0 combined = tail.combine_first(head) - self.assertTrue((combined['A'][:10] == 0).all()) + assert (combined['A'][:10] == 0).all() # no overlap f = self.frame[:10] @@ -496,13 +510,13 @@ def test_combine_first(self): assert_frame_equal(comb, self.frame) comb = self.frame.combine_first(DataFrame(index=["faz", "boo"])) - self.assertTrue("faz" in comb.index) + assert "faz" in comb.index # #2525 df = DataFrame({'a': [1]}, index=[datetime(2012, 1, 1)]) df2 = DataFrame({}, columns=['b']) result = df.combine_first(df2) - self.assertTrue('b' in result) + assert 'b' in result def test_combine_first_mixed_bug(self): idx = Index(['a', 'b', 'c', 'e']) @@ -524,7 +538,7 @@ def test_combine_first_mixed_bug(self): "col5": ser3}) combined = frame1.combine_first(frame2) - self.assertEqual(len(combined.columns), 5) + assert len(combined.columns) == 5 # gh 3016 (same as in update) df = DataFrame([[1., 2., False, True], [4., 5., True, False]], @@ -534,9 +548,9 @@ def test_combine_first_mixed_bug(self): result = df.combine_first(other) assert_frame_equal(result, df) - df.ix[0, 'A'] = np.nan + df.loc[0, 'A'] = np.nan result = df.combine_first(other) - df.ix[0, 'A'] = 45 + df.loc[0, 'A'] = 45 assert_frame_equal(result, df) # doc example @@ -589,28 +603,28 @@ def test_combine_first_align_nan(self): dfa = pd.DataFrame([[pd.Timestamp('2011-01-01'), 2]], columns=['a', 'b']) dfb = pd.DataFrame([[4], [5]], columns=['b']) - self.assertEqual(dfa['a'].dtype, 'datetime64[ns]') - self.assertEqual(dfa['b'].dtype, 'int64') + assert dfa['a'].dtype == 'datetime64[ns]' + assert dfa['b'].dtype == 'int64' res = dfa.combine_first(dfb) exp = pd.DataFrame({'a': [pd.Timestamp('2011-01-01'), pd.NaT], 'b': [2., 5.]}, columns=['a', 'b']) tm.assert_frame_equal(res, exp) - self.assertEqual(res['a'].dtype, 'datetime64[ns]') + assert res['a'].dtype == 'datetime64[ns]' # ToDo: this must be int64 - self.assertEqual(res['b'].dtype, 'float64') + assert res['b'].dtype == 'float64' res = dfa.iloc[:0].combine_first(dfb) exp = pd.DataFrame({'a': [np.nan, np.nan], 'b': [4, 5]}, columns=['a', 'b']) tm.assert_frame_equal(res, exp) # ToDo: this must be datetime64 - self.assertEqual(res['a'].dtype, 'float64') + assert res['a'].dtype == 'float64' # ToDo: this must be int64 - self.assertEqual(res['b'].dtype, 'int64') + assert res['b'].dtype == 'int64' def test_combine_first_timezone(self): - # GH 7630 + # see gh-7630 data1 = pd.to_datetime('20100101 01:01').tz_localize('UTC') df1 = pd.DataFrame(columns=['UTCdatetime', 'abc'], data=data1, @@ -630,10 +644,10 @@ def test_combine_first_timezone(self): index=pd.date_range('20140627', periods=2, freq='D')) tm.assert_frame_equal(res, exp) - self.assertEqual(res['UTCdatetime'].dtype, 'datetime64[ns, UTC]') - self.assertEqual(res['abc'].dtype, 'datetime64[ns, UTC]') + assert res['UTCdatetime'].dtype == 'datetime64[ns, UTC]' + assert res['abc'].dtype == 'datetime64[ns, UTC]' - # GH 10567 + # see gh-10567 dts1 = pd.date_range('2015-01-01', '2015-01-05', tz='UTC') df1 = pd.DataFrame({'DATE': dts1}) dts2 = pd.date_range('2015-01-03', '2015-01-05', tz='UTC') @@ -641,7 +655,7 @@ def test_combine_first_timezone(self): res = df1.combine_first(df2) tm.assert_frame_equal(res, df1) - self.assertEqual(res['DATE'].dtype, 'datetime64[ns, UTC]') + assert res['DATE'].dtype == 'datetime64[ns, UTC]' dts1 = pd.DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03', '2011-01-04'], tz='US/Eastern') @@ -666,7 +680,7 @@ def test_combine_first_timezone(self): # if df1 doesn't have NaN, keep its dtype res = df1.combine_first(df2) tm.assert_frame_equal(res, df1) - self.assertEqual(res['DATE'].dtype, 'datetime64[ns, US/Eastern]') + assert res['DATE'].dtype == 'datetime64[ns, US/Eastern]' dts1 = pd.date_range('2015-01-01', '2015-01-02', tz='US/Eastern') df1 = pd.DataFrame({'DATE': dts1}) @@ -679,7 +693,7 @@ def test_combine_first_timezone(self): pd.Timestamp('2015-01-03')] exp = pd.DataFrame({'DATE': exp_dts}) tm.assert_frame_equal(res, exp) - self.assertEqual(res['DATE'].dtype, 'object') + assert res['DATE'].dtype == 'object' def test_combine_first_timedelta(self): data1 = pd.TimedeltaIndex(['1 day', 'NaT', '3 day', '4day']) @@ -692,7 +706,7 @@ def test_combine_first_timedelta(self): '11 day', '3 day', '4 day']) exp = pd.DataFrame({'TD': exp_dts}, index=[1, 2, 3, 4, 5, 7]) tm.assert_frame_equal(res, exp) - self.assertEqual(res['TD'].dtype, 'timedelta64[ns]') + assert res['TD'].dtype == 'timedelta64[ns]' def test_combine_first_period(self): data1 = pd.PeriodIndex(['2011-01', 'NaT', '2011-03', @@ -708,7 +722,7 @@ def test_combine_first_period(self): freq='M') exp = pd.DataFrame({'P': exp_dts}, index=[1, 2, 3, 4, 5, 7]) tm.assert_frame_equal(res, exp) - self.assertEqual(res['P'].dtype, 'object') + assert res['P'].dtype == 'object' # different freq dts2 = pd.PeriodIndex(['2012-01-01', '2012-01-02', @@ -724,7 +738,7 @@ def test_combine_first_period(self): pd.Period('2011-04', freq='M')] exp = pd.DataFrame({'P': exp_dts}, index=[1, 2, 3, 4, 5, 7]) tm.assert_frame_equal(res, exp) - self.assertEqual(res['P'].dtype, 'object') + assert res['P'].dtype == 'object' def test_combine_first_int(self): # GH14687 - integer series that do no align exactly @@ -734,4 +748,18 @@ def test_combine_first_int(self): res = df1.combine_first(df2) tm.assert_frame_equal(res, df1) - self.assertEqual(res['a'].dtype, 'int64') + assert res['a'].dtype == 'int64' + + def test_concat_datetime_datetime64_frame(self): + # #2624 + rows = [] + rows.append([datetime(2010, 1, 1), 1]) + rows.append([datetime(2010, 1, 2), 'hi']) + + df2_obj = DataFrame.from_records(rows, columns=['date', 'test']) + + ind = date_range(start="2000/1/1", freq="D", periods=10) + df1 = DataFrame({'date': ind, 'test': lrange(10)}) + + # it works! + pd.concat([df1, df2_obj]) diff --git a/pandas/tests/frame/test_constructors.py b/pandas/tests/frame/test_constructors.py index 489c85a7234b8..8459900ea1059 100644 --- a/pandas/tests/frame/test_constructors.py +++ b/pandas/tests/frame/test_constructors.py @@ -6,25 +6,22 @@ import functools import itertools -import nose - +import pytest from numpy.random import randn import numpy as np import numpy.ma as ma import numpy.ma.mrecords as mrecords -from pandas.types.common import is_integer_dtype +from pandas.core.dtypes.common import is_integer_dtype from pandas.compat import (lmap, long, zip, range, lrange, lzip, OrderedDict, is_platform_little_endian) from pandas import compat from pandas import (DataFrame, Index, Series, isnull, MultiIndex, Timedelta, Timestamp, date_range) -from pandas.core.common import PandasError import pandas as pd -import pandas.core.common as com -import pandas.lib as lib +import pandas._libs.lib as lib import pandas.util.testing as tm from pandas.tests.frame.common import TestData @@ -35,16 +32,14 @@ 'int32', 'int64'] -class TestDataFrameConstructors(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameConstructors(TestData): def test_constructor(self): df = DataFrame() - self.assertEqual(len(df.index), 0) + assert len(df.index) == 0 df = DataFrame(data={}) - self.assertEqual(len(df.index), 0) + assert len(df.index) == 0 def test_constructor_mixed(self): index, data = tm.getMixedTypeDict() @@ -53,11 +48,11 @@ def test_constructor_mixed(self): indexed_frame = DataFrame(data, index=index) # noqa unindexed_frame = DataFrame(data) # noqa - self.assertEqual(self.mixed_frame['foo'].dtype, np.object_) + assert self.mixed_frame['foo'].dtype == np.object_ def test_constructor_cast_failure(self): foo = DataFrame({'a': ['a', 'b', 'c']}, dtype=np.float64) - self.assertEqual(foo['a'].dtype, object) + assert foo['a'].dtype == object # GH 3010, constructing with odd arrays df = DataFrame(np.ones((4, 2))) @@ -66,8 +61,8 @@ def test_constructor_cast_failure(self): df['foo'] = np.ones((4, 2)).tolist() # this is not ok - self.assertRaises(ValueError, df.__setitem__, tuple(['test']), - np.ones((4, 2))) + pytest.raises(ValueError, df.__setitem__, tuple(['test']), + np.ones((4, 2))) # this is ok df['foo2'] = np.ones((4, 2)).tolist() @@ -81,32 +76,31 @@ def test_constructor_dtype_copy(self): new_df = pd.DataFrame(orig_df, dtype=float, copy=True) new_df['col1'] = 200. - self.assertEqual(orig_df['col1'][0], 1.) + assert orig_df['col1'][0] == 1. def test_constructor_dtype_nocast_view(self): df = DataFrame([[1, 2]]) should_be_view = DataFrame(df, dtype=df[0].dtype) should_be_view[0][0] = 99 - self.assertEqual(df.values[0, 0], 99) + assert df.values[0, 0] == 99 should_be_view = DataFrame(df.values, dtype=df[0].dtype) should_be_view[0][0] = 97 - self.assertEqual(df.values[0, 0], 97) + assert df.values[0, 0] == 97 def test_constructor_dtype_list_data(self): df = DataFrame([[1, '2'], [None, 'a']], dtype=object) - self.assertIsNone(df.ix[1, 0]) - self.assertEqual(df.ix[0, 1], '2') + assert df.loc[1, 0] is None + assert df.loc[0, 1] == '2' def test_constructor_list_frames(self): - - # GH 3243 + # see gh-3243 result = DataFrame([DataFrame([])]) - self.assertEqual(result.shape, (1, 0)) + assert result.shape == (1, 0) result = DataFrame([DataFrame(dict(A=lrange(5)))]) - tm.assertIsInstance(result.iloc[0, 0], DataFrame) + assert isinstance(result.iloc[0, 0], DataFrame) def test_constructor_mixed_dtypes(self): @@ -154,8 +148,8 @@ def test_constructor_complex_dtypes(self): b = np.random.rand(10).astype(np.complex128) df = DataFrame({'a': a, 'b': b}) - self.assertEqual(a.dtype, df.a.dtype) - self.assertEqual(b.dtype, df.b.dtype) + assert a.dtype == df.a.dtype + assert b.dtype == df.b.dtype def test_constructor_rec(self): rec = self.frame.to_records(index=False) @@ -166,11 +160,11 @@ def test_constructor_rec(self): index = self.frame.index df = DataFrame(rec) - self.assert_index_equal(df.columns, pd.Index(rec.dtype.names)) + tm.assert_index_equal(df.columns, pd.Index(rec.dtype.names)) df2 = DataFrame(rec, index=index) - self.assert_index_equal(df2.columns, pd.Index(rec.dtype.names)) - self.assert_index_equal(df2.index, index) + tm.assert_index_equal(df2.columns, pd.Index(rec.dtype.names)) + tm.assert_index_equal(df2.index, index) rng = np.arange(len(rec))[::-1] df3 = DataFrame(rec, index=rng, columns=['C', 'B']) @@ -180,16 +174,17 @@ def test_constructor_rec(self): def test_constructor_bool(self): df = DataFrame({0: np.ones(10, dtype=bool), 1: np.zeros(10, dtype=bool)}) - self.assertEqual(df.values.dtype, np.bool_) + assert df.values.dtype == np.bool_ def test_constructor_overflow_int64(self): + # see gh-14881 values = np.array([2 ** 64 - i for i in range(1, 10)], dtype=np.uint64) result = DataFrame({'a': values}) - self.assertEqual(result['a'].dtype, object) + assert result['a'].dtype == np.uint64 - # #2355 + # see gh-2355 data_scores = [(6311132704823138710, 273), (2685045978526272070, 23), (8921811264899370420, 45), (long(17019687244989530680), 270), @@ -198,7 +193,7 @@ def test_constructor_overflow_int64(self): data = np.zeros((len(data_scores),), dtype=dtype) data[:] = data_scores df_crawls = DataFrame(data) - self.assertEqual(df_crawls['uid'].dtype, object) + assert df_crawls['uid'].dtype == np.uint64 def test_constructor_ordereddict(self): import random @@ -207,15 +202,15 @@ def test_constructor_ordereddict(self): random.shuffle(nums) expected = ['A%d' % i for i in nums] df = DataFrame(OrderedDict(zip(expected, [[0]] * nitems))) - self.assertEqual(expected, list(df.columns)) + assert expected == list(df.columns) def test_constructor_dict(self): frame = DataFrame({'col1': self.ts1, 'col2': self.ts2}) # col2 is padded with NaN - self.assertEqual(len(self.ts1), 30) - self.assertEqual(len(self.ts2), 25) + assert len(self.ts1) == 30 + assert len(self.ts2) == 25 tm.assert_series_equal(self.ts1, frame['col1'], check_names=False) @@ -227,55 +222,55 @@ def test_constructor_dict(self): 'col2': self.ts2}, columns=['col2', 'col3', 'col4']) - self.assertEqual(len(frame), len(self.ts2)) - self.assertNotIn('col1', frame) - self.assertTrue(isnull(frame['col3']).all()) + assert len(frame) == len(self.ts2) + assert 'col1' not in frame + assert isnull(frame['col3']).all() # Corner cases - self.assertEqual(len(DataFrame({})), 0) + assert len(DataFrame({})) == 0 # mix dict and array, wrong size - no spec for which error should raise # first - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): DataFrame({'A': {'a': 'a', 'b': 'b'}, 'B': ['a', 'b', 'c']}) # Length-one dict micro-optimization frame = DataFrame({'A': {'1': 1, '2': 2}}) - self.assert_index_equal(frame.index, pd.Index(['1', '2'])) + tm.assert_index_equal(frame.index, pd.Index(['1', '2'])) # empty dict plus index idx = Index([0, 1, 2]) frame = DataFrame({}, index=idx) - self.assertIs(frame.index, idx) + assert frame.index is idx # empty with index and columns idx = Index([0, 1, 2]) frame = DataFrame({}, index=idx, columns=idx) - self.assertIs(frame.index, idx) - self.assertIs(frame.columns, idx) - self.assertEqual(len(frame._series), 3) + assert frame.index is idx + assert frame.columns is idx + assert len(frame._series) == 3 # with dict of empty list and Series frame = DataFrame({'A': [], 'B': []}, columns=['A', 'B']) - self.assert_index_equal(frame.index, Index([], dtype=np.int64)) + tm.assert_index_equal(frame.index, Index([], dtype=np.int64)) # GH 14381 # Dict with None value frame_none = DataFrame(dict(a=None), index=[0]) frame_none_list = DataFrame(dict(a=[None]), index=[0]) - tm.assert_equal(frame_none.get_value(0, 'a'), None) - tm.assert_equal(frame_none_list.get_value(0, 'a'), None) + assert frame_none.get_value(0, 'a') is None + assert frame_none_list.get_value(0, 'a') is None tm.assert_frame_equal(frame_none, frame_none_list) # GH10856 # dict with scalar values should raise error, even if columns passed - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): DataFrame({'a': 0.7}) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): DataFrame({'a': 0.7}, columns=['a']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): DataFrame({'a': 0.7}, columns=['b']) def test_constructor_multi_index(self): @@ -284,47 +279,50 @@ def test_constructor_multi_index(self): tuples = [(2, 3), (3, 3), (3, 3)] mi = MultiIndex.from_tuples(tuples) df = DataFrame(index=mi, columns=mi) - self.assertTrue(pd.isnull(df).values.ravel().all()) + assert pd.isnull(df).values.ravel().all() tuples = [(3, 3), (2, 3), (3, 3)] mi = MultiIndex.from_tuples(tuples) df = DataFrame(index=mi, columns=mi) - self.assertTrue(pd.isnull(df).values.ravel().all()) + assert pd.isnull(df).values.ravel().all() def test_constructor_error_msgs(self): msg = "Empty data passed with indices specified." # passing an empty array with columns specified. - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): DataFrame(np.empty(0), columns=list('abc')) msg = "Mixing dicts with non-Series may lead to ambiguous ordering." # mix dict and array, wrong size - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): DataFrame({'A': {'a': 'a', 'b': 'b'}, 'B': ['a', 'b', 'c']}) # wrong size ndarray, GH 3105 msg = r"Shape of passed values is \(3, 4\), indices imply \(3, 3\)" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): DataFrame(np.arange(12).reshape((4, 3)), columns=['foo', 'bar', 'baz'], index=pd.date_range('2000-01-01', periods=3)) # higher dim raise exception - with tm.assertRaisesRegexp(ValueError, 'Must pass 2-d input'): + with tm.assert_raises_regex(ValueError, 'Must pass 2-d input'): DataFrame(np.zeros((3, 3, 3)), columns=['A', 'B', 'C'], index=[1]) # wrong size axis labels - with tm.assertRaisesRegexp(ValueError, "Shape of passed values is " - r"\(3, 2\), indices imply \(3, 1\)"): + with tm.assert_raises_regex(ValueError, "Shape of passed values " + "is \(3, 2\), indices " + "imply \(3, 1\)"): DataFrame(np.random.rand(2, 3), columns=['A', 'B', 'C'], index=[1]) - with tm.assertRaisesRegexp(ValueError, "Shape of passed values is " - r"\(3, 2\), indices imply \(2, 2\)"): + with tm.assert_raises_regex(ValueError, "Shape of passed values " + "is \(3, 2\), indices " + "imply \(2, 2\)"): DataFrame(np.random.rand(2, 3), columns=['A', 'B'], index=[1, 2]) - with tm.assertRaisesRegexp(ValueError, 'If using all scalar values, ' - 'you must pass an index'): + with tm.assert_raises_regex(ValueError, "If using all scalar " + "values, you must pass " + "an index"): DataFrame({'a': False, 'b': True}) def test_constructor_with_embedded_frames(self): @@ -379,14 +377,14 @@ def test_constructor_dict_cast(self): 'B': {'1': '1', '2': '2', '3': '3'}, } frame = DataFrame(test_data, dtype=float) - self.assertEqual(len(frame), 3) - self.assertEqual(frame['B'].dtype, np.float64) - self.assertEqual(frame['A'].dtype, np.float64) + assert len(frame) == 3 + assert frame['B'].dtype == np.float64 + assert frame['A'].dtype == np.float64 frame = DataFrame(test_data) - self.assertEqual(len(frame), 3) - self.assertEqual(frame['B'].dtype, np.object_) - self.assertEqual(frame['A'].dtype, np.float64) + assert len(frame) == 3 + assert frame['B'].dtype == np.object_ + assert frame['A'].dtype == np.float64 # can't cast to float test_data = { @@ -394,17 +392,17 @@ def test_constructor_dict_cast(self): 'B': dict(zip(range(15), randn(15))) } frame = DataFrame(test_data, dtype=float) - self.assertEqual(len(frame), 20) - self.assertEqual(frame['A'].dtype, np.object_) - self.assertEqual(frame['B'].dtype, np.float64) + assert len(frame) == 20 + assert frame['A'].dtype == np.object_ + assert frame['B'].dtype == np.float64 def test_constructor_dict_dont_upcast(self): d = {'Col1': {'Row1': 'A String', 'Row2': np.nan}} df = DataFrame(d) - tm.assertIsInstance(df['Col1']['Row2'], float) + assert isinstance(df['Col1']['Row2'], float) dm = DataFrame([[1, 2], ['a', 'b']], index=[1, 2], columns=[1, 2]) - tm.assertIsInstance(dm[1][1], int) + assert isinstance(dm[1][1], int) def test_constructor_dict_of_tuples(self): # GH #1491 @@ -495,14 +493,14 @@ def test_constructor_period(self): a = pd.PeriodIndex(['2012-01', 'NaT', '2012-04'], freq='M') b = pd.PeriodIndex(['2012-02-01', '2012-03-01', 'NaT'], freq='D') df = pd.DataFrame({'a': a, 'b': b}) - self.assertEqual(df['a'].dtype, 'object') - self.assertEqual(df['b'].dtype, 'object') + assert df['a'].dtype == 'object' + assert df['b'].dtype == 'object' # list of periods df = pd.DataFrame({'a': a.asobject.tolist(), 'b': b.asobject.tolist()}) - self.assertEqual(df['a'].dtype, 'object') - self.assertEqual(df['b'].dtype, 'object') + assert df['a'].dtype == 'object' + assert df['b'].dtype == 'object' def test_nested_dict_frame_constructor(self): rng = pd.period_range('1/1/2000', periods=5) @@ -531,55 +529,55 @@ def _check_basic_constructor(self, empty): # 2-D input frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(len(frame.index), 2) - self.assertEqual(len(frame.columns), 3) + assert len(frame.index) == 2 + assert len(frame.columns) == 3 # 1-D input frame = DataFrame(empty((3,)), columns=['A'], index=[1, 2, 3]) - self.assertEqual(len(frame.index), 3) - self.assertEqual(len(frame.columns), 1) + assert len(frame.index) == 3 + assert len(frame.columns) == 1 # cast type frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2], dtype=np.int64) - self.assertEqual(frame.values.dtype, np.int64) + assert frame.values.dtype == np.int64 # wrong size axis labels msg = r'Shape of passed values is \(3, 2\), indices imply \(3, 1\)' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): DataFrame(mat, columns=['A', 'B', 'C'], index=[1]) msg = r'Shape of passed values is \(3, 2\), indices imply \(2, 2\)' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): DataFrame(mat, columns=['A', 'B'], index=[1, 2]) # higher dim raise exception - with tm.assertRaisesRegexp(ValueError, 'Must pass 2-d input'): + with tm.assert_raises_regex(ValueError, 'Must pass 2-d input'): DataFrame(empty((3, 3, 3)), columns=['A', 'B', 'C'], index=[1]) # automatic labeling frame = DataFrame(mat) - self.assert_index_equal(frame.index, pd.Index(lrange(2))) - self.assert_index_equal(frame.columns, pd.Index(lrange(3))) + tm.assert_index_equal(frame.index, pd.Index(lrange(2))) + tm.assert_index_equal(frame.columns, pd.Index(lrange(3))) frame = DataFrame(mat, index=[1, 2]) - self.assert_index_equal(frame.columns, pd.Index(lrange(3))) + tm.assert_index_equal(frame.columns, pd.Index(lrange(3))) frame = DataFrame(mat, columns=['A', 'B', 'C']) - self.assert_index_equal(frame.index, pd.Index(lrange(2))) + tm.assert_index_equal(frame.index, pd.Index(lrange(2))) # 0-length axis frame = DataFrame(empty((0, 3))) - self.assertEqual(len(frame.index), 0) + assert len(frame.index) == 0 frame = DataFrame(empty((3, 0))) - self.assertEqual(len(frame.columns), 0) + assert len(frame.columns) == 0 def test_constructor_ndarray(self): self._check_basic_constructor(np.ones) frame = DataFrame(['foo', 'bar'], index=[0, 1], columns=['A']) - self.assertEqual(len(frame), 2) + assert len(frame) == 2 def test_constructor_maskedarray(self): self._check_basic_constructor(ma.masked_all) @@ -589,13 +587,13 @@ def test_constructor_maskedarray(self): mat[0, 0] = 1.0 mat[1, 2] = 2.0 frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(1.0, frame['A'][1]) - self.assertEqual(2.0, frame['C'][2]) + assert 1.0 == frame['A'][1] + assert 2.0 == frame['C'][2] # what is this even checking?? mat = ma.masked_all((2, 3), dtype=float) frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertTrue(np.all(~np.asarray(frame == frame))) + assert np.all(~np.asarray(frame == frame)) def test_constructor_maskedarray_nonfloat(self): # masked int promoted to float @@ -603,66 +601,66 @@ def test_constructor_maskedarray_nonfloat(self): # 2-D input frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(len(frame.index), 2) - self.assertEqual(len(frame.columns), 3) - self.assertTrue(np.all(~np.asarray(frame == frame))) + assert len(frame.index) == 2 + assert len(frame.columns) == 3 + assert np.all(~np.asarray(frame == frame)) # cast type frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2], dtype=np.float64) - self.assertEqual(frame.values.dtype, np.float64) + assert frame.values.dtype == np.float64 # Check non-masked values mat2 = ma.copy(mat) mat2[0, 0] = 1 mat2[1, 2] = 2 frame = DataFrame(mat2, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(1, frame['A'][1]) - self.assertEqual(2, frame['C'][2]) + assert 1 == frame['A'][1] + assert 2 == frame['C'][2] # masked np.datetime64 stays (use lib.NaT as null) mat = ma.masked_all((2, 3), dtype='M8[ns]') # 2-D input frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(len(frame.index), 2) - self.assertEqual(len(frame.columns), 3) - self.assertTrue(isnull(frame).values.all()) + assert len(frame.index) == 2 + assert len(frame.columns) == 3 + assert isnull(frame).values.all() # cast type frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2], dtype=np.int64) - self.assertEqual(frame.values.dtype, np.int64) + assert frame.values.dtype == np.int64 # Check non-masked values mat2 = ma.copy(mat) mat2[0, 0] = 1 mat2[1, 2] = 2 frame = DataFrame(mat2, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(1, frame['A'].view('i8')[1]) - self.assertEqual(2, frame['C'].view('i8')[2]) + assert 1 == frame['A'].view('i8')[1] + assert 2 == frame['C'].view('i8')[2] # masked bool promoted to object mat = ma.masked_all((2, 3), dtype=bool) # 2-D input frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(len(frame.index), 2) - self.assertEqual(len(frame.columns), 3) - self.assertTrue(np.all(~np.asarray(frame == frame))) + assert len(frame.index) == 2 + assert len(frame.columns) == 3 + assert np.all(~np.asarray(frame == frame)) # cast type frame = DataFrame(mat, columns=['A', 'B', 'C'], index=[1, 2], dtype=object) - self.assertEqual(frame.values.dtype, object) + assert frame.values.dtype == object # Check non-masked values mat2 = ma.copy(mat) mat2[0, 0] = True mat2[1, 2] = False frame = DataFrame(mat2, columns=['A', 'B', 'C'], index=[1, 2]) - self.assertEqual(True, frame['A'][1]) - self.assertEqual(False, frame['C'][2]) + assert frame['A'][1] is True + assert frame['C'][2] is False def test_constructor_mrecarray(self): # Ensure mrecarray produces frame identical to dict of masked arrays @@ -709,41 +707,41 @@ def test_constructor_mrecarray(self): def test_constructor_corner(self): df = DataFrame(index=[]) - self.assertEqual(df.values.shape, (0, 0)) + assert df.values.shape == (0, 0) # empty but with specified dtype df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=object) - self.assertEqual(df.values.dtype, np.object_) + assert df.values.dtype == np.object_ # does not error but ends up float df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int) - self.assertEqual(df.values.dtype, np.object_) + assert df.values.dtype == np.object_ # #1783 empty dtype object df = DataFrame({}, columns=['foo', 'bar']) - self.assertEqual(df.values.dtype, np.object_) + assert df.values.dtype == np.object_ df = DataFrame({'b': 1}, index=lrange(10), columns=list('abc'), dtype=int) - self.assertEqual(df.values.dtype, np.object_) + assert df.values.dtype == np.object_ def test_constructor_scalar_inference(self): data = {'int': 1, 'bool': True, 'float': 3., 'complex': 4j, 'object': 'foo'} df = DataFrame(data, index=np.arange(10)) - self.assertEqual(df['int'].dtype, np.int64) - self.assertEqual(df['bool'].dtype, np.bool_) - self.assertEqual(df['float'].dtype, np.float64) - self.assertEqual(df['complex'].dtype, np.complex128) - self.assertEqual(df['object'].dtype, np.object_) + assert df['int'].dtype == np.int64 + assert df['bool'].dtype == np.bool_ + assert df['float'].dtype == np.float64 + assert df['complex'].dtype == np.complex128 + assert df['object'].dtype == np.object_ def test_constructor_arrays_and_scalars(self): df = DataFrame({'a': randn(10), 'b': True}) exp = DataFrame({'a': df['a'].values, 'b': [True] * 10}) tm.assert_frame_equal(df, exp) - with tm.assertRaisesRegexp(ValueError, 'must pass an index'): + with tm.assert_raises_regex(ValueError, 'must pass an index'): DataFrame({'a': False, 'b': True}) def test_constructor_DataFrame(self): @@ -751,38 +749,38 @@ def test_constructor_DataFrame(self): tm.assert_frame_equal(df, self.frame) df_casted = DataFrame(self.frame, dtype=np.int64) - self.assertEqual(df_casted.values.dtype, np.int64) + assert df_casted.values.dtype == np.int64 def test_constructor_more(self): # used to be in test_matrix.py arr = randn(10) dm = DataFrame(arr, columns=['A'], index=np.arange(10)) - self.assertEqual(dm.values.ndim, 2) + assert dm.values.ndim == 2 arr = randn(0) dm = DataFrame(arr) - self.assertEqual(dm.values.ndim, 2) - self.assertEqual(dm.values.ndim, 2) + assert dm.values.ndim == 2 + assert dm.values.ndim == 2 # no data specified dm = DataFrame(columns=['A', 'B'], index=np.arange(10)) - self.assertEqual(dm.values.shape, (10, 2)) + assert dm.values.shape == (10, 2) dm = DataFrame(columns=['A', 'B']) - self.assertEqual(dm.values.shape, (0, 2)) + assert dm.values.shape == (0, 2) dm = DataFrame(index=np.arange(10)) - self.assertEqual(dm.values.shape, (10, 0)) + assert dm.values.shape == (10, 0) # corner, silly # TODO: Fix this Exception to be better... - with tm.assertRaisesRegexp(PandasError, 'constructor not ' - 'properly called'): + with tm.assert_raises_regex(ValueError, 'constructor not ' + 'properly called'): DataFrame((1, 2, 3)) # can't cast mat = np.array(['foo', 'bar'], dtype=object).reshape(2, 1) - with tm.assertRaisesRegexp(ValueError, 'cast'): + with tm.assert_raises_regex(ValueError, 'cast'): DataFrame(mat, index=[0, 1], columns=[0], dtype=float) dm = DataFrame(DataFrame(self.frame._series)) @@ -793,8 +791,8 @@ def test_constructor_more(self): 'B': np.ones(10, dtype=np.float64)}, index=np.arange(10)) - self.assertEqual(len(dm.columns), 2) - self.assertEqual(dm.values.dtype, np.float64) + assert len(dm.columns) == 2 + assert dm.values.dtype == np.float64 def test_constructor_empty_list(self): df = DataFrame([], index=[]) @@ -818,8 +816,8 @@ def test_constructor_list_of_lists(self): # GH #484 l = [[1, 'a'], [2, 'b']] df = DataFrame(data=l, columns=["num", "str"]) - self.assertTrue(is_integer_dtype(df['num'])) - self.assertEqual(df['str'].dtype, np.object_) + assert is_integer_dtype(df['num']) + assert df['str'].dtype == np.object_ # GH 4851 # list of 0-dim ndarrays @@ -1008,8 +1006,8 @@ class CustomDict(dict): def test_constructor_ragged(self): data = {'A': randn(10), 'B': randn(8)} - with tm.assertRaisesRegexp(ValueError, - 'arrays must all be same length'): + with tm.assert_raises_regex(ValueError, + 'arrays must all be same length'): DataFrame(data) def test_constructor_scalar(self): @@ -1028,10 +1026,10 @@ def test_constructor_mixed_dict_and_Series(self): data['B'] = Series([4, 3, 2, 1], index=['bar', 'qux', 'baz', 'foo']) result = DataFrame(data) - self.assertTrue(result.index.is_monotonic) + assert result.index.is_monotonic # ordering ambiguous, raise exception - with tm.assertRaisesRegexp(ValueError, 'ambiguous ordering'): + with tm.assert_raises_regex(ValueError, 'ambiguous ordering'): DataFrame({'A': ['a', 'b'], 'B': {'a': 'a', 'b': 'b'}}) # this is OK though @@ -1076,8 +1074,8 @@ def test_constructor_orient(self): def test_constructor_Series_named(self): a = Series([1, 2, 3], index=['a', 'b', 'c'], name='x') df = DataFrame(a) - self.assertEqual(df.columns[0], 'x') - self.assert_index_equal(df.index, a.index) + assert df.columns[0] == 'x' + tm.assert_index_equal(df.index, a.index) # ndarray like arr = np.random.randn(10) @@ -1091,12 +1089,12 @@ def test_constructor_Series_named(self): expected = DataFrame({0: s}) tm.assert_frame_equal(df, expected) - self.assertRaises(ValueError, DataFrame, s, columns=[1, 2]) + pytest.raises(ValueError, DataFrame, s, columns=[1, 2]) # #2234 a = Series([], name='x') df = DataFrame(a) - self.assertEqual(df.columns[0], 'x') + assert df.columns[0] == 'x' # series with name and w/o s1 = Series(arr, name='x') @@ -1121,13 +1119,13 @@ def test_constructor_Series_differently_indexed(self): df1 = DataFrame(s1, index=other_index) exp1 = DataFrame(s1.reindex(other_index)) - self.assertEqual(df1.columns[0], 'x') + assert df1.columns[0] == 'x' tm.assert_frame_equal(df1, exp1) df2 = DataFrame(s2, index=other_index) exp2 = DataFrame(s2.reindex(other_index)) - self.assertEqual(df2.columns[0], 0) - self.assert_index_equal(df2.index, other_index) + assert df2.columns[0] == 0 + tm.assert_index_equal(df2.index, other_index) tm.assert_frame_equal(df2, exp2) def test_constructor_manager_resize(self): @@ -1136,8 +1134,8 @@ def test_constructor_manager_resize(self): result = DataFrame(self.frame._data, index=index, columns=columns) - self.assert_index_equal(result.index, Index(index)) - self.assert_index_equal(result.columns, Index(columns)) + tm.assert_index_equal(result.index, Index(index)) + tm.assert_index_equal(result.columns, Index(columns)) def test_constructor_from_items(self): items = [(c, self.frame[c]) for c in self.frame.columns] @@ -1146,7 +1144,7 @@ def test_constructor_from_items(self): # pass some columns recons = DataFrame.from_items(items, columns=['C', 'B', 'A']) - tm.assert_frame_equal(recons, self.frame.ix[:, ['C', 'B', 'A']]) + tm.assert_frame_equal(recons, self.frame.loc[:, ['C', 'B', 'A']]) # orient='index' @@ -1157,10 +1155,11 @@ def test_constructor_from_items(self): columns=self.mixed_frame.columns, orient='index') tm.assert_frame_equal(recons, self.mixed_frame) - self.assertEqual(recons['A'].dtype, np.float64) + assert recons['A'].dtype == np.float64 - with tm.assertRaisesRegexp(TypeError, - "Must pass columns with orient='index'"): + with tm.assert_raises_regex(TypeError, + "Must pass columns with " + "orient='index'"): DataFrame.from_items(row_items, orient='index') # orient='index', but thar be tuples @@ -1173,7 +1172,7 @@ def test_constructor_from_items(self): columns=self.mixed_frame.columns, orient='index') tm.assert_frame_equal(recons, self.mixed_frame) - tm.assertIsInstance(recons['foo'][0], tuple) + assert isinstance(recons['foo'][0], tuple) rs = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', @@ -1185,9 +1184,10 @@ def test_constructor_from_items(self): def test_constructor_mix_series_nonseries(self): df = DataFrame({'A': self.frame['A'], 'B': list(self.frame['B'])}, columns=['A', 'B']) - tm.assert_frame_equal(df, self.frame.ix[:, ['A', 'B']]) + tm.assert_frame_equal(df, self.frame.loc[:, ['A', 'B']]) - with tm.assertRaisesRegexp(ValueError, 'does not match index length'): + with tm.assert_raises_regex(ValueError, 'does not match ' + 'index length'): DataFrame({'A': self.frame['A'], 'B': list(self.frame['B'])[:-2]}) def test_constructor_miscast_na_int_dtype(self): @@ -1196,8 +1196,8 @@ def test_constructor_miscast_na_int_dtype(self): tm.assert_frame_equal(df, expected) def test_constructor_iterator_failure(self): - with tm.assertRaisesRegexp(TypeError, 'iterator'): - df = DataFrame(iter([1, 2, 3])) # noqa + with tm.assert_raises_regex(TypeError, 'iterator'): + DataFrame(iter([1, 2, 3])) def test_constructor_column_duplicates(self): # it works! #2079 @@ -1211,9 +1211,9 @@ def test_constructor_column_duplicates(self): [('a', [8]), ('a', [5])], columns=['a', 'a']) tm.assert_frame_equal(idf, edf) - self.assertRaises(ValueError, DataFrame.from_items, - [('a', [8]), ('a', [5]), ('b', [6])], - columns=['b', 'a', 'a']) + pytest.raises(ValueError, DataFrame.from_items, + [('a', [8]), ('a', [5]), ('b', [6])], + columns=['b', 'a', 'a']) def test_constructor_empty_with_string_dtype(self): # GH 9428 @@ -1244,9 +1244,10 @@ def test_constructor_single_value(self): dtype=object), index=[1, 2], columns=['a', 'c'])) - self.assertRaises(com.PandasError, DataFrame, 'a', [1, 2]) - self.assertRaises(com.PandasError, DataFrame, 'a', columns=['a', 'c']) - with tm.assertRaisesRegexp(TypeError, 'incompatible data and dtype'): + pytest.raises(ValueError, DataFrame, 'a', [1, 2]) + pytest.raises(ValueError, DataFrame, 'a', columns=['a', 'c']) + with tm.assert_raises_regex(TypeError, 'incompatible data ' + 'and dtype'): DataFrame('a', [1, 2], ['a', 'c'], float) def test_constructor_with_datetimes(self): @@ -1303,7 +1304,7 @@ def test_constructor_with_datetimes(self): ind = date_range(start="2000-01-01", freq="D", periods=10) datetimes = [ts.to_pydatetime() for ts in ind] datetime_s = Series(datetimes) - self.assertEqual(datetime_s.dtype, 'M8[ns]') + assert datetime_s.dtype == 'M8[ns]' df = DataFrame({'datetime_s': datetime_s}) result = df.get_dtype_counts() expected = Series({datetime64name: 1}) @@ -1329,12 +1330,12 @@ def test_constructor_with_datetimes(self): dt = tz.localize(datetime(2012, 1, 1)) df = DataFrame({'End Date': dt}, index=[0]) - self.assertEqual(df.iat[0, 0], dt) + assert df.iat[0, 0] == dt tm.assert_series_equal(df.dtypes, Series( {'End Date': 'datetime64[ns, US/Eastern]'})) df = DataFrame([{'End Date': dt}]) - self.assertEqual(df.iat[0, 0], dt) + assert df.iat[0, 0] == dt tm.assert_series_equal(df.dtypes, Series( {'End Date': 'datetime64[ns, US/Eastern]'})) @@ -1342,13 +1343,13 @@ def test_constructor_with_datetimes(self): # GH 8411 dr = date_range('20130101', periods=3) df = DataFrame({'value': dr}) - self.assertTrue(df.iat[0, 0].tz is None) + assert df.iat[0, 0].tz is None dr = date_range('20130101', periods=3, tz='UTC') df = DataFrame({'value': dr}) - self.assertTrue(str(df.iat[0, 0].tz) == 'UTC') + assert str(df.iat[0, 0].tz) == 'UTC' dr = date_range('20130101', periods=3, tz='US/Eastern') df = DataFrame({'value': dr}) - self.assertTrue(str(df.iat[0, 0].tz) == 'US/Eastern') + assert str(df.iat[0, 0].tz) == 'US/Eastern' # GH 7822 # preserver an index with a tz on dict construction @@ -1370,6 +1371,15 @@ def test_constructor_with_datetimes(self): .reset_index(drop=True), 'b': i_no_tz}) tm.assert_frame_equal(df, expected) + def test_constructor_datetimes_with_nulls(self): + # gh-15869 + for arr in [np.array([None, None, None, None, + datetime.now(), None]), + np.array([None, None, datetime.now(), None])]: + result = DataFrame(arr).get_dtype_counts() + expected = Series({'datetime64[ns]': 1}) + tm.assert_series_equal(result, expected) + def test_constructor_for_list_with_dtypes(self): # TODO(wesm): unused intname = np.dtype(np.int_).name # noqa @@ -1440,18 +1450,18 @@ def test_constructor_for_list_with_dtypes(self): def test_constructor_frame_copy(self): cop = DataFrame(self.frame, copy=True) cop['A'] = 5 - self.assertTrue((cop['A'] == 5).all()) - self.assertFalse((self.frame['A'] == 5).all()) + assert (cop['A'] == 5).all() + assert not (self.frame['A'] == 5).all() def test_constructor_ndarray_copy(self): df = DataFrame(self.frame.values) self.frame.values[5] = 5 - self.assertTrue((df.values[5] == 5).all()) + assert (df.values[5] == 5).all() df = DataFrame(self.frame.values, copy=True) self.frame.values[6] = 6 - self.assertFalse((df.values[6] == 6).all()) + assert not (df.values[6] == 6).all() def test_constructor_series_copy(self): series = self.frame._series @@ -1459,7 +1469,7 @@ def test_constructor_series_copy(self): df = DataFrame({'A': series['A']}) df['A'][:] = 5 - self.assertFalse((series['A'] == 5).all()) + assert not (series['A'] == 5).all() def test_constructor_with_nas(self): # GH 5016 @@ -1481,7 +1491,7 @@ def check(df): def f(): df.loc[:, np.nan] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) df = DataFrame([[1, 2, 3], [4, 5, 6]], index=[1, np.nan]) check(df) @@ -1500,8 +1510,8 @@ def f(): def test_constructor_lists_to_object_dtype(self): # from #1074 d = DataFrame({'a': [np.nan, False]}) - self.assertEqual(d['a'].dtype, np.object_) - self.assertFalse(d['a'][1]) + assert d['a'].dtype == np.object_ + assert not d['a'][1] def test_from_records_to_records(self): # from numpy documentation @@ -1513,7 +1523,7 @@ def test_from_records_to_records(self): index = pd.Index(np.arange(len(arr))[::-1]) indexed_frame = DataFrame.from_records(arr, index=index) - self.assert_index_equal(indexed_frame.index, index) + tm.assert_index_equal(indexed_frame.index, index) # without names, it should go to last ditch arr2 = np.zeros((2, 3)) @@ -1521,18 +1531,18 @@ def test_from_records_to_records(self): # wrong length msg = r'Shape of passed values is \(3, 2\), indices imply \(3, 1\)' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): DataFrame.from_records(arr, index=index[:-1]) indexed_frame = DataFrame.from_records(arr, index='f1') # what to do? records = indexed_frame.to_records() - self.assertEqual(len(records.dtype.names), 3) + assert len(records.dtype.names) == 3 records = indexed_frame.to_records(index=False) - self.assertEqual(len(records.dtype.names), 2) - self.assertNotIn('index', records.dtype.names) + assert len(records.dtype.names) == 2 + assert 'index' not in records.dtype.names def test_from_records_nones(self): tuples = [(1, 2, None, 3), @@ -1540,7 +1550,7 @@ def test_from_records_nones(self): (None, 2, 5, 3)] df = DataFrame.from_records(tuples, columns=['a', 'b', 'c', 'd']) - self.assertTrue(np.isnan(df['c'][0])) + assert np.isnan(df['c'][0]) def test_from_records_iterator(self): arr = np.array([(1.0, 1.0, 2, 2), (3.0, 3.0, 4, 4), (5., 5., 6, 6), @@ -1605,7 +1615,7 @@ def test_from_records_columns_not_modified(self): df = DataFrame.from_records(tuples, columns=columns, index='a') # noqa - self.assertEqual(columns, original_columns) + assert columns == original_columns def test_from_records_decimal(self): from decimal import Decimal @@ -1613,11 +1623,11 @@ def test_from_records_decimal(self): tuples = [(Decimal('1.5'),), (Decimal('2.5'),), (None,)] df = DataFrame.from_records(tuples, columns=['a']) - self.assertEqual(df['a'].dtype, object) + assert df['a'].dtype == object df = DataFrame.from_records(tuples, columns=['a'], coerce_float=True) - self.assertEqual(df['a'].dtype, np.float64) - self.assertTrue(np.isnan(df['a'].values[-1])) + assert df['a'].dtype == np.float64 + assert np.isnan(df['a'].values[-1]) def test_from_records_duplicates(self): result = DataFrame.from_records([(1, 2, 3), (4, 5, 6)], @@ -1637,12 +1647,12 @@ def create_dict(order_id): documents.append({'order_id': 10, 'quantity': 5}) result = DataFrame.from_records(documents, index='order_id') - self.assertEqual(result.index.name, 'order_id') + assert result.index.name == 'order_id' # MultiIndex result = DataFrame.from_records(documents, index=['order_id', 'quantity']) - self.assertEqual(result.index.names, ('order_id', 'quantity')) + assert result.index.names == ('order_id', 'quantity') def test_from_records_misc_brokenness(self): # #2179 @@ -1691,20 +1701,20 @@ def test_from_records_empty_with_nonempty_fields_gh3682(self): a = np.array([(1, 2)], dtype=[('id', np.int64), ('value', np.int64)]) df = DataFrame.from_records(a, index='id') tm.assert_index_equal(df.index, Index([1], name='id')) - self.assertEqual(df.index.name, 'id') + assert df.index.name == 'id' tm.assert_index_equal(df.columns, Index(['value'])) b = np.array([], dtype=[('id', np.int64), ('value', np.int64)]) df = DataFrame.from_records(b, index='id') tm.assert_index_equal(df.index, Index([], name='id')) - self.assertEqual(df.index.name, 'id') + assert df.index.name == 'id' def test_from_records_with_datetimes(self): # this may fail on certain platforms because of a numpy issue # related GH6140 if not is_platform_little_endian(): - raise nose.SkipTest("known failure of test on non-little endian") + pytest.skip("known failure of test on non-little endian") # construction with a null in a recarray # GH 6140 @@ -1716,7 +1726,7 @@ def test_from_records_with_datetimes(self): try: recarray = np.core.records.fromarrays(arrdata, dtype=dtypes) except (ValueError): - raise nose.SkipTest("known failure of numpy rec array creation") + pytest.skip("known failure of numpy rec array creation") result = DataFrame.from_records(recarray) tm.assert_frame_equal(result, expected) @@ -1793,13 +1803,13 @@ def test_from_records_sequencelike(self): # empty case result = DataFrame.from_records([], columns=['foo', 'bar', 'baz']) - self.assertEqual(len(result), 0) - self.assert_index_equal(result.columns, - pd.Index(['foo', 'bar', 'baz'])) + assert len(result) == 0 + tm.assert_index_equal(result.columns, + pd.Index(['foo', 'bar', 'baz'])) result = DataFrame.from_records([]) - self.assertEqual(len(result), 0) - self.assertEqual(len(result.columns), 0) + assert len(result) == 0 + assert len(result.columns) == 0 def test_from_records_dictlike(self): @@ -1852,8 +1862,8 @@ def test_from_records_bad_index_column(self): tm.assert_index_equal(df1.index, Index(df.C)) # should fail - self.assertRaises(ValueError, DataFrame.from_records, df, index=[2]) - self.assertRaises(KeyError, DataFrame.from_records, df, index=2) + pytest.raises(ValueError, DataFrame.from_records, df, index=[2]) + pytest.raises(KeyError, DataFrame.from_records, df, index=2) def test_from_records_non_tuple(self): class Record(object): @@ -1879,14 +1889,21 @@ def test_from_records_len0_with_columns(self): result = DataFrame.from_records([], index='foo', columns=['foo', 'bar']) - self.assertTrue(np.array_equal(result.columns, ['bar'])) - self.assertEqual(len(result), 0) - self.assertEqual(result.index.name, 'foo') + assert np.array_equal(result.columns, ['bar']) + assert len(result) == 0 + assert result.index.name == 'foo' + + def test_to_frame_with_falsey_names(self): + # GH 16114 + result = Series(name=0).to_frame().dtypes + expected = Series({0: np.float64}) + tm.assert_series_equal(result, expected) + result = DataFrame(Series(name=0)).dtypes + tm.assert_series_equal(result, expected) -class TestDataFrameConstructorWithDatetimeTZ(tm.TestCase, TestData): - _multiprocess_can_split_ = True +class TestDataFrameConstructorWithDatetimeTZ(TestData): def test_from_dict(self): @@ -1899,8 +1916,8 @@ def test_from_dict(self): # construction df = DataFrame({'A': idx, 'B': dr}) - self.assertTrue(df['A'].dtype, 'M8[ns, US/Eastern') - self.assertTrue(df['A'].name == 'A') + assert df['A'].dtype, 'M8[ns, US/Eastern' + assert df['A'].name == 'A' tm.assert_series_equal(df['A'], Series(idx, name='A')) tm.assert_series_equal(df['B'], Series(dr, name='B')) @@ -1919,8 +1936,28 @@ def test_from_index(self): df2 = DataFrame(Series(idx2)) tm.assert_series_equal(df2[0], Series(idx2, name=0)) -if __name__ == '__main__': - import nose # noqa + def test_frame_dict_constructor_datetime64_1680(self): + dr = date_range('1/1/2012', periods=10) + s = Series(dr, index=dr) + + # it works! + DataFrame({'a': 'foo', 'b': s}, index=dr) + DataFrame({'a': 'foo', 'b': s.values}, index=dr) + + def test_frame_datetime64_mixed_index_ctor_1681(self): + dr = date_range('2011/1/1', '2012/1/1', freq='W-FRI') + ts = Series(dr) + + # it works! + d = DataFrame({'A': 'foo', 'B': ts}, index=dr) + assert d['B'].isnull().all() + + def test_frame_timeseries_to_records(self): + index = date_range('1/1/2000', periods=10) + df = DataFrame(np.random.randn(10, 3), index=index, + columns=['a', 'b', 'c']) + + result = df.to_records() + result['index'].dtype == 'M8[ns]' - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + result = df.to_records(index=False) diff --git a/pandas/tests/frame/test_convert_to.py b/pandas/tests/frame/test_convert_to.py index 53083a602e183..e0cdca7904db7 100644 --- a/pandas/tests/frame/test_convert_to.py +++ b/pandas/tests/frame/test_convert_to.py @@ -1,8 +1,6 @@ # -*- coding: utf-8 -*- -from __future__ import print_function - -from numpy import nan +import pytest import numpy as np from pandas import compat @@ -10,13 +8,10 @@ date_range) import pandas.util.testing as tm - from pandas.tests.frame.common import TestData -class TestDataFrameConvertTo(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameConvertTo(TestData): def test_to_dict(self): test_data = { @@ -27,31 +22,31 @@ def test_to_dict(self): for k, v in compat.iteritems(test_data): for k2, v2 in compat.iteritems(v): - self.assertEqual(v2, recons_data[k][k2]) + assert v2 == recons_data[k][k2] recons_data = DataFrame(test_data).to_dict("l") for k, v in compat.iteritems(test_data): for k2, v2 in compat.iteritems(v): - self.assertEqual(v2, recons_data[k][int(k2) - 1]) + assert v2 == recons_data[k][int(k2) - 1] recons_data = DataFrame(test_data).to_dict("s") for k, v in compat.iteritems(test_data): for k2, v2 in compat.iteritems(v): - self.assertEqual(v2, recons_data[k][k2]) + assert v2 == recons_data[k][k2] recons_data = DataFrame(test_data).to_dict("sp") expected_split = {'columns': ['A', 'B'], 'index': ['1', '2', '3'], - 'data': [[1.0, '1'], [2.0, '2'], [nan, '3']]} + 'data': [[1.0, '1'], [2.0, '2'], [np.nan, '3']]} tm.assert_dict_equal(recons_data, expected_split) recons_data = DataFrame(test_data).to_dict("r") expected_records = [{'A': 1.0, 'B': '1'}, {'A': 2.0, 'B': '2'}, - {'A': nan, 'B': '3'}] - tm.assertIsInstance(recons_data, list) - self.assertEqual(len(recons_data), 3) + {'A': np.nan, 'B': '3'}] + assert isinstance(recons_data, list) + assert len(recons_data) == 3 for l, r in zip(recons_data, expected_records): tm.assert_dict_equal(l, r) @@ -60,7 +55,7 @@ def test_to_dict(self): for k, v in compat.iteritems(test_data): for k2, v2 in compat.iteritems(v): - self.assertEqual(v2, recons_data[k2][k]) + assert v2 == recons_data[k2][k] def test_to_dict_timestamp(self): @@ -77,10 +72,10 @@ def test_to_dict_timestamp(self): expected_records_mixed = [{'A': tsmp, 'B': 1}, {'A': tsmp, 'B': 2}] - self.assertEqual(test_data.to_dict(orient='records'), - expected_records) - self.assertEqual(test_data_mixed.to_dict(orient='records'), - expected_records_mixed) + assert (test_data.to_dict(orient='records') == + expected_records) + assert (test_data_mixed.to_dict(orient='records') == + expected_records_mixed) expected_series = { 'A': Series([tsmp, tsmp], name='A'), @@ -116,16 +111,16 @@ def test_to_dict_timestamp(self): def test_to_dict_invalid_orient(self): df = DataFrame({'A': [0, 1]}) - self.assertRaises(ValueError, df.to_dict, orient='xinvalid') + pytest.raises(ValueError, df.to_dict, orient='xinvalid') def test_to_records_dt64(self): df = DataFrame([["one", "two", "three"], ["four", "five", "six"]], index=date_range("2012-01-01", "2012-01-02")) - self.assertEqual(df.to_records()['index'][0], df.index[0]) + assert df.to_records()['index'][0] == df.index[0] rs = df.to_records(convert_datetime64=False) - self.assertEqual(rs['index'][0], df.index.values[0]) + assert rs['index'][0] == df.index.values[0] def test_to_records_with_multindex(self): # GH3189 @@ -134,8 +129,8 @@ def test_to_records_with_multindex(self): data = np.zeros((8, 4)) df = DataFrame(data, index=index) r = df.to_records(index=True)['level_0'] - self.assertTrue('bar' in r) - self.assertTrue('one' not in r) + assert 'bar' in r + assert 'one' not in r def test_to_records_with_Mapping_type(self): import email @@ -161,16 +156,16 @@ def test_to_records_index_name(self): df = DataFrame(np.random.randn(3, 3)) df.index.name = 'X' rs = df.to_records() - self.assertIn('X', rs.dtype.fields) + assert 'X' in rs.dtype.fields df = DataFrame(np.random.randn(3, 3)) rs = df.to_records() - self.assertIn('index', rs.dtype.fields) + assert 'index' in rs.dtype.fields df.index = MultiIndex.from_tuples([('a', 'x'), ('a', 'y'), ('b', 'z')]) df.index.names = ['A', None] rs = df.to_records() - self.assertIn('level_0', rs.dtype.fields) + assert 'level_0' in rs.dtype.fields def test_to_records_with_unicode_index(self): # GH13172 @@ -179,3 +174,33 @@ def test_to_records_with_unicode_index(self): .to_records() expected = np.rec.array([('x', 'y')], dtype=[('a', 'O'), ('b', 'O')]) tm.assert_almost_equal(result, expected) + + def test_to_records_with_unicode_column_names(self): + # xref issue: https://github.com/numpy/numpy/issues/2407 + # Issue #11879. to_records used to raise an exception when used + # with column names containing non ascii caracters in Python 2 + result = DataFrame(data={u"accented_name_é": [1.0]}).to_records() + + # Note that numpy allows for unicode field names but dtypes need + # to be specified using dictionnary intsead of list of tuples. + expected = np.rec.array( + [(0, 1.0)], + dtype={"names": ["index", u"accented_name_é"], + "formats": [' 0] result = casted.get_dtype_counts() expected = Series({'float64': 6, 'int32': 1, 'int64': 1}) @@ -313,7 +322,7 @@ def test_getitem_boolean_list(self): def _checkit(lst): result = df[lst] - expected = df.ix[df.index[lst]] + expected = df.loc[df.index[lst]] assert_frame_equal(result, expected) _checkit([True, False, True]) @@ -345,12 +354,13 @@ def test_getitem_ix_mixed_integer(self): df = DataFrame(np.random.randn(4, 3), index=[1, 10, 'C', 'E'], columns=[1, 2, 3]) - result = df.ix[:-1] - expected = df.ix[df.index[:-1]] + result = df.iloc[:-1] + expected = df.loc[df.index[:-1]] assert_frame_equal(result, expected) - result = df.ix[[1, 10]] - expected = df.ix[Index([1, 10], dtype=object)] + with catch_warnings(record=True): + result = df.ix[[1, 10]] + expected = df.ix[Index([1, 10], dtype=object)] assert_frame_equal(result, expected) # 11320 @@ -367,48 +377,55 @@ def test_getitem_ix_mixed_integer(self): assert_frame_equal(result, expected) def test_getitem_setitem_ix_negative_integers(self): - result = self.frame.ix[:, -1] + with catch_warnings(record=True): + result = self.frame.ix[:, -1] assert_series_equal(result, self.frame['D']) - result = self.frame.ix[:, [-1]] + with catch_warnings(record=True): + result = self.frame.ix[:, [-1]] assert_frame_equal(result, self.frame[['D']]) - result = self.frame.ix[:, [-1, -2]] + with catch_warnings(record=True): + result = self.frame.ix[:, [-1, -2]] assert_frame_equal(result, self.frame[['D', 'C']]) - self.frame.ix[:, [-1]] = 0 - self.assertTrue((self.frame['D'] == 0).all()) + with catch_warnings(record=True): + self.frame.ix[:, [-1]] = 0 + assert (self.frame['D'] == 0).all() df = DataFrame(np.random.randn(8, 4)) - self.assertTrue(isnull(df.ix[:, [-1]].values).all()) + with catch_warnings(record=True): + assert isnull(df.ix[:, [-1]].values).all() # #1942 a = DataFrame(randn(20, 2), index=[chr(x + 65) for x in range(20)]) - a.ix[-1] = a.ix[-2] + with catch_warnings(record=True): + a.ix[-1] = a.ix[-2] - assert_series_equal(a.ix[-1], a.ix[-2], check_names=False) - self.assertEqual(a.ix[-1].name, 'T') - self.assertEqual(a.ix[-2].name, 'S') + with catch_warnings(record=True): + assert_series_equal(a.ix[-1], a.ix[-2], check_names=False) + assert a.ix[-1].name == 'T' + assert a.ix[-2].name == 'S' def test_getattr(self): assert_series_equal(self.frame.A, self.frame['A']) - self.assertRaises(AttributeError, getattr, self.frame, - 'NONEXISTENT_NAME') + pytest.raises(AttributeError, getattr, self.frame, + 'NONEXISTENT_NAME') def test_setattr_column(self): df = DataFrame({'foobar': 1}, index=lrange(10)) df.foobar = 5 - self.assertTrue((df.foobar == 5).all()) + assert (df.foobar == 5).all() def test_setitem(self): # not sure what else to do here series = self.frame['A'][::2] self.frame['col5'] = series - self.assertIn('col5', self.frame) + assert 'col5' in self.frame - self.assertEqual(len(series), 15) - self.assertEqual(len(self.frame), 30) + assert len(series) == 15 + assert len(self.frame) == 30 exp = np.ravel(np.column_stack((series.values, [np.nan] * 15))) exp = Series(exp, index=self.frame.index, name='col5') @@ -418,13 +435,13 @@ def test_setitem(self): self.frame['col6'] = series tm.assert_series_equal(series, self.frame['col6'], check_names=False) - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): self.frame[randn(len(self.frame) + 1)] = 1 # set ndarray arr = randn(len(self.frame)) self.frame['col9'] = arr - self.assertTrue((self.frame['col9'] == arr).all()) + assert (self.frame['col9'] == arr).all() self.frame['col7'] = 5 assert((self.frame['col7'] == 5).all()) @@ -441,14 +458,14 @@ def test_setitem(self): def f(): smaller['col10'] = ['1', '2'] - self.assertRaises(com.SettingWithCopyError, f) - self.assertEqual(smaller['col10'].dtype, np.object_) - self.assertTrue((smaller['col10'] == ['1', '2']).all()) + pytest.raises(com.SettingWithCopyError, f) + assert smaller['col10'].dtype == np.object_ + assert (smaller['col10'] == ['1', '2']).all() # with a dtype for dtype in ['int32', 'int64', 'float32', 'float64']: self.frame[dtype] = np.array(arr, dtype=dtype) - self.assertEqual(self.frame[dtype].dtype.name, dtype) + assert self.frame[dtype].dtype.name == dtype # dtype changing GH4204 df = DataFrame([[0, 0]]) @@ -470,7 +487,7 @@ def test_setitem_always_copy(self): self.frame['E'] = s self.frame['E'][5:10] = nan - self.assertTrue(notnull(s[5:10]).all()) + assert notnull(s[5:10]).all() def test_setitem_boolean(self): df = self.frame.copy() @@ -505,8 +522,9 @@ def test_setitem_boolean(self): values[values == 2] = 3 assert_almost_equal(df.values, values) - with assertRaisesRegexp(TypeError, 'Must pass DataFrame with boolean ' - 'values only'): + with tm.assert_raises_regex(TypeError, 'Must pass ' + 'DataFrame with ' + 'boolean values only'): df[df * 0] = 2 # index with DataFrame @@ -524,32 +542,32 @@ def test_setitem_boolean(self): def test_setitem_cast(self): self.frame['D'] = self.frame['D'].astype('i8') - self.assertEqual(self.frame['D'].dtype, np.int64) + assert self.frame['D'].dtype == np.int64 # #669, should not cast? # this is now set to int64, which means a replacement of the column to # the value dtype (and nothing to do with the existing dtype) self.frame['B'] = 0 - self.assertEqual(self.frame['B'].dtype, np.int64) + assert self.frame['B'].dtype == np.int64 # cast if pass array of course self.frame['B'] = np.arange(len(self.frame)) - self.assertTrue(issubclass(self.frame['B'].dtype.type, np.integer)) + assert issubclass(self.frame['B'].dtype.type, np.integer) self.frame['foo'] = 'bar' self.frame['foo'] = 0 - self.assertEqual(self.frame['foo'].dtype, np.int64) + assert self.frame['foo'].dtype == np.int64 self.frame['foo'] = 'bar' self.frame['foo'] = 2.5 - self.assertEqual(self.frame['foo'].dtype, np.float64) + assert self.frame['foo'].dtype == np.float64 self.frame['something'] = 0 - self.assertEqual(self.frame['something'].dtype, np.int64) + assert self.frame['something'].dtype == np.int64 self.frame['something'] = 2 - self.assertEqual(self.frame['something'].dtype, np.int64) + assert self.frame['something'].dtype == np.int64 self.frame['something'] = 2.5 - self.assertEqual(self.frame['something'].dtype, np.float64) + assert self.frame['something'].dtype == np.float64 # GH 7704 # dtype conversion on setting @@ -563,15 +581,15 @@ def test_setitem_cast(self): # Test that data type is preserved . #5782 df = DataFrame({'one': np.arange(6, dtype=np.int8)}) df.loc[1, 'one'] = 6 - self.assertEqual(df.dtypes.one, np.dtype(np.int8)) + assert df.dtypes.one == np.dtype(np.int8) df.one = np.int8(7) - self.assertEqual(df.dtypes.one, np.dtype(np.int8)) + assert df.dtypes.one == np.dtype(np.int8) def test_setitem_boolean_column(self): expected = self.frame.copy() mask = self.frame['A'] > 0 - self.frame.ix[mask, 'B'] = 0 + self.frame.loc[mask, 'B'] = 0 expected.values[mask.values, 1] = 0 assert_frame_equal(self.frame, expected) @@ -583,8 +601,8 @@ def test_setitem_corner(self): index=np.arange(3)) del df['B'] df['B'] = [1., 2., 3.] - self.assertIn('B', df) - self.assertEqual(len(df.columns), 2) + assert 'B' in df + assert len(df.columns) == 2 df['A'] = 'beginning' df['E'] = 'foo' @@ -596,29 +614,29 @@ def test_setitem_corner(self): dm = DataFrame(index=self.frame.index) dm['A'] = 'foo' dm['B'] = 'bar' - self.assertEqual(len(dm.columns), 2) - self.assertEqual(dm.values.dtype, np.object_) + assert len(dm.columns) == 2 + assert dm.values.dtype == np.object_ # upcast dm['C'] = 1 - self.assertEqual(dm['C'].dtype, np.int64) + assert dm['C'].dtype == np.int64 dm['E'] = 1. - self.assertEqual(dm['E'].dtype, np.float64) + assert dm['E'].dtype == np.float64 # set existing column dm['A'] = 'bar' - self.assertEqual('bar', dm['A'][0]) + assert 'bar' == dm['A'][0] dm = DataFrame(index=np.arange(3)) dm['A'] = 1 dm['foo'] = 'bar' del dm['foo'] dm['foo'] = 'bar' - self.assertEqual(dm['foo'].dtype, np.object_) + assert dm['foo'].dtype == np.object_ dm['coercable'] = ['1', '2', '3'] - self.assertEqual(dm['coercable'].dtype, np.object_) + assert dm['coercable'].dtype == np.object_ def test_setitem_corner2(self): data = {"title": ['foobar', 'bar', 'foobar'] + ['foobar'] * 17, @@ -627,17 +645,17 @@ def test_setitem_corner2(self): df = DataFrame(data) ix = df[df['title'] == 'bar'].index - df.ix[ix, ['title']] = 'foobar' - df.ix[ix, ['cruft']] = 0 + df.loc[ix, ['title']] = 'foobar' + df.loc[ix, ['cruft']] = 0 - assert(df.ix[1, 'title'] == 'foobar') - assert(df.ix[1, 'cruft'] == 0) + assert df.loc[1, 'title'] == 'foobar' + assert df.loc[1, 'cruft'] == 0 def test_setitem_ambig(self): - # difficulties with mixed-type data + # Difficulties with mixed-type data from decimal import Decimal - # created as float type + # Created as float type dm = DataFrame(index=lrange(3), columns=lrange(3)) coercable_series = Series([Decimal(1) for _ in range(3)], @@ -645,32 +663,29 @@ def test_setitem_ambig(self): uncoercable_series = Series(['foo', 'bzr', 'baz'], index=lrange(3)) dm[0] = np.ones(3) - self.assertEqual(len(dm.columns), 3) - # self.assertIsNone(dm.objects) + assert len(dm.columns) == 3 dm[1] = coercable_series - self.assertEqual(len(dm.columns), 3) - # self.assertIsNone(dm.objects) + assert len(dm.columns) == 3 dm[2] = uncoercable_series - self.assertEqual(len(dm.columns), 3) - # self.assertIsNotNone(dm.objects) - self.assertEqual(dm[2].dtype, np.object_) + assert len(dm.columns) == 3 + assert dm[2].dtype == np.object_ def test_setitem_clear_caches(self): - # GH #304 + # see gh-304 df = DataFrame({'x': [1.1, 2.1, 3.1, 4.1], 'y': [5.1, 6.1, 7.1, 8.1]}, index=[0, 1, 2, 3]) df.insert(2, 'z', np.nan) # cache it foo = df['z'] - - df.ix[2:, 'z'] = 42 + df.loc[df.index[2:], 'z'] = 42 expected = Series([np.nan, np.nan, 42, 42], index=df.index, name='z') - self.assertIsNot(df['z'], foo) - assert_series_equal(df['z'], expected) + + assert df['z'] is not foo + tm.assert_series_equal(df['z'], expected) def test_setitem_None(self): # GH #766 @@ -716,79 +731,87 @@ def test_getitem_empty_frame_with_boolean(self): def test_delitem_corner(self): f = self.frame.copy() del f['D'] - self.assertEqual(len(f.columns), 3) - self.assertRaises(KeyError, f.__delitem__, 'D') + assert len(f.columns) == 3 + pytest.raises(KeyError, f.__delitem__, 'D') del f['B'] - self.assertEqual(len(f.columns), 2) + assert len(f.columns) == 2 def test_getitem_fancy_2d(self): f = self.frame - ix = f.ix - assert_frame_equal(ix[:, ['B', 'A']], f.reindex(columns=['B', 'A'])) + with catch_warnings(record=True): + assert_frame_equal(f.ix[:, ['B', 'A']], + f.reindex(columns=['B', 'A'])) subidx = self.frame.index[[5, 4, 1]] - assert_frame_equal(ix[subidx, ['B', 'A']], - f.reindex(index=subidx, columns=['B', 'A'])) + with catch_warnings(record=True): + assert_frame_equal(f.ix[subidx, ['B', 'A']], + f.reindex(index=subidx, columns=['B', 'A'])) # slicing rows, etc. - assert_frame_equal(ix[5:10], f[5:10]) - assert_frame_equal(ix[5:10, :], f[5:10]) - assert_frame_equal(ix[:5, ['A', 'B']], - f.reindex(index=f.index[:5], columns=['A', 'B'])) + with catch_warnings(record=True): + assert_frame_equal(f.ix[5:10], f[5:10]) + assert_frame_equal(f.ix[5:10, :], f[5:10]) + assert_frame_equal(f.ix[:5, ['A', 'B']], + f.reindex(index=f.index[:5], + columns=['A', 'B'])) # slice rows with labels, inclusive! - expected = ix[5:11] - result = ix[f.index[5]:f.index[10]] + with catch_warnings(record=True): + expected = f.ix[5:11] + result = f.ix[f.index[5]:f.index[10]] assert_frame_equal(expected, result) # slice columns - assert_frame_equal(ix[:, :2], f.reindex(columns=['A', 'B'])) + with catch_warnings(record=True): + assert_frame_equal(f.ix[:, :2], f.reindex(columns=['A', 'B'])) # get view - exp = f.copy() - ix[5:10].values[:] = 5 - exp.values[5:10] = 5 - assert_frame_equal(f, exp) + with catch_warnings(record=True): + exp = f.copy() + f.ix[5:10].values[:] = 5 + exp.values[5:10] = 5 + assert_frame_equal(f, exp) - self.assertRaises(ValueError, ix.__getitem__, f > 0.5) + with catch_warnings(record=True): + pytest.raises(ValueError, f.ix.__getitem__, f > 0.5) def test_slice_floats(self): index = [52195.504153, 52196.303147, 52198.369883] df = DataFrame(np.random.rand(3, 2), index=index) - s1 = df.ix[52195.1:52196.5] - self.assertEqual(len(s1), 2) + s1 = df.loc[52195.1:52196.5] + assert len(s1) == 2 - s1 = df.ix[52195.1:52196.6] - self.assertEqual(len(s1), 2) + s1 = df.loc[52195.1:52196.6] + assert len(s1) == 2 - s1 = df.ix[52195.1:52198.9] - self.assertEqual(len(s1), 3) + s1 = df.loc[52195.1:52198.9] + assert len(s1) == 3 def test_getitem_fancy_slice_integers_step(self): df = DataFrame(np.random.randn(10, 5)) # this is OK - result = df.ix[:8:2] # noqa - df.ix[:8:2] = np.nan - self.assertTrue(isnull(df.ix[:8:2]).values.all()) + result = df.iloc[:8:2] # noqa + df.iloc[:8:2] = np.nan + assert isnull(df.iloc[:8:2]).values.all() def test_getitem_setitem_integer_slice_keyerrors(self): df = DataFrame(np.random.randn(10, 5), index=lrange(0, 20, 2)) # this is OK cp = df.copy() - cp.ix[4:10] = 0 - self.assertTrue((cp.ix[4:10] == 0).values.all()) + cp.iloc[4:10] = 0 + assert (cp.iloc[4:10] == 0).values.all() # so is this cp = df.copy() - cp.ix[3:11] = 0 - self.assertTrue((cp.ix[3:11] == 0).values.all()) + cp.iloc[3:11] = 0 + assert (cp.iloc[3:11] == 0).values.all() - result = df.ix[4:10] - result2 = df.ix[3:11] + result = df.iloc[2:6] + result2 = df.loc[3:11] expected = df.reindex([4, 6, 8, 10]) assert_frame_equal(result, expected) @@ -796,18 +819,17 @@ def test_getitem_setitem_integer_slice_keyerrors(self): # non-monotonic, raise KeyError df2 = df.iloc[lrange(5) + lrange(5, 10)[::-1]] - self.assertRaises(KeyError, df2.ix.__getitem__, slice(3, 11)) - self.assertRaises(KeyError, df2.ix.__setitem__, slice(3, 11), 0) + pytest.raises(KeyError, df2.loc.__getitem__, slice(3, 11)) + pytest.raises(KeyError, df2.loc.__setitem__, slice(3, 11), 0) def test_setitem_fancy_2d(self): - f = self.frame - - ix = f.ix # noqa # case 1 frame = self.frame.copy() expected = frame.copy() - frame.ix[:, ['B', 'A']] = 1 + + with catch_warnings(record=True): + frame.ix[:, ['B', 'A']] = 1 expected['B'] = 1. expected['A'] = 1. assert_frame_equal(frame, expected) @@ -821,11 +843,12 @@ def test_setitem_fancy_2d(self): subidx = self.frame.index[[5, 4, 1]] values = randn(3, 2) - frame.ix[subidx, ['B', 'A']] = values - frame2.ix[[5, 4, 1], ['B', 'A']] = values + with catch_warnings(record=True): + frame.ix[subidx, ['B', 'A']] = values + frame2.ix[[5, 4, 1], ['B', 'A']] = values - expected['B'].ix[subidx] = values[:, 0] - expected['A'].ix[subidx] = values[:, 1] + expected['B'].ix[subidx] = values[:, 0] + expected['A'].ix[subidx] = values[:, 1] assert_frame_equal(frame, expected) assert_frame_equal(frame2, expected) @@ -833,60 +856,67 @@ def test_setitem_fancy_2d(self): # case 3: slicing rows, etc. frame = self.frame.copy() - expected1 = self.frame.copy() - frame.ix[5:10] = 1. - expected1.values[5:10] = 1. + with catch_warnings(record=True): + expected1 = self.frame.copy() + frame.ix[5:10] = 1. + expected1.values[5:10] = 1. assert_frame_equal(frame, expected1) - expected2 = self.frame.copy() - arr = randn(5, len(frame.columns)) - frame.ix[5:10] = arr - expected2.values[5:10] = arr + with catch_warnings(record=True): + expected2 = self.frame.copy() + arr = randn(5, len(frame.columns)) + frame.ix[5:10] = arr + expected2.values[5:10] = arr assert_frame_equal(frame, expected2) # case 4 - frame = self.frame.copy() - frame.ix[5:10, :] = 1. - assert_frame_equal(frame, expected1) - frame.ix[5:10, :] = arr + with catch_warnings(record=True): + frame = self.frame.copy() + frame.ix[5:10, :] = 1. + assert_frame_equal(frame, expected1) + frame.ix[5:10, :] = arr assert_frame_equal(frame, expected2) # case 5 - frame = self.frame.copy() - frame2 = self.frame.copy() + with catch_warnings(record=True): + frame = self.frame.copy() + frame2 = self.frame.copy() - expected = self.frame.copy() - values = randn(5, 2) + expected = self.frame.copy() + values = randn(5, 2) - frame.ix[:5, ['A', 'B']] = values - expected['A'][:5] = values[:, 0] - expected['B'][:5] = values[:, 1] + frame.ix[:5, ['A', 'B']] = values + expected['A'][:5] = values[:, 0] + expected['B'][:5] = values[:, 1] assert_frame_equal(frame, expected) - frame2.ix[:5, [0, 1]] = values + with catch_warnings(record=True): + frame2.ix[:5, [0, 1]] = values assert_frame_equal(frame2, expected) # case 6: slice rows with labels, inclusive! - frame = self.frame.copy() - expected = self.frame.copy() + with catch_warnings(record=True): + frame = self.frame.copy() + expected = self.frame.copy() - frame.ix[frame.index[5]:frame.index[10]] = 5. - expected.values[5:11] = 5 + frame.ix[frame.index[5]:frame.index[10]] = 5. + expected.values[5:11] = 5 assert_frame_equal(frame, expected) # case 7: slice columns - frame = self.frame.copy() - frame2 = self.frame.copy() - expected = self.frame.copy() + with catch_warnings(record=True): + frame = self.frame.copy() + frame2 = self.frame.copy() + expected = self.frame.copy() - # slice indices - frame.ix[:, 1:3] = 4. - expected.values[:, 1:3] = 4. - assert_frame_equal(frame, expected) + # slice indices + frame.ix[:, 1:3] = 4. + expected.values[:, 1:3] = 4. + assert_frame_equal(frame, expected) - # slice with labels - frame.ix[:, 'B':'C'] = 4. - assert_frame_equal(frame, expected) + # slice with labels + frame.ix[:, 'B':'C'] = 4. + assert_frame_equal(frame, expected) # new corner case of boolean slicing / setting frame = DataFrame(lzip([2, 3, 9, 6, 7], [np.nan] * 5), @@ -899,38 +929,41 @@ def test_setitem_fancy_2d(self): assert_frame_equal(frame, expected) def test_fancy_getitem_slice_mixed(self): - sliced = self.mixed_frame.ix[:, -3:] - self.assertEqual(sliced['D'].dtype, np.float64) + sliced = self.mixed_frame.iloc[:, -3:] + assert sliced['D'].dtype == np.float64 # get view with single block # setting it triggers setting with copy - sliced = self.frame.ix[:, -3:] + sliced = self.frame.iloc[:, -3:] def f(): sliced['C'] = 4. - self.assertRaises(com.SettingWithCopyError, f) - self.assertTrue((self.frame['C'] == 4).all()) + pytest.raises(com.SettingWithCopyError, f) + assert (self.frame['C'] == 4).all() def test_fancy_setitem_int_labels(self): # integer index defers to label-based indexing df = DataFrame(np.random.randn(10, 5), index=np.arange(0, 20, 2)) - tmp = df.copy() - exp = df.copy() - tmp.ix[[0, 2, 4]] = 5 - exp.values[:3] = 5 + with catch_warnings(record=True): + tmp = df.copy() + exp = df.copy() + tmp.ix[[0, 2, 4]] = 5 + exp.values[:3] = 5 assert_frame_equal(tmp, exp) - tmp = df.copy() - exp = df.copy() - tmp.ix[6] = 5 - exp.values[3] = 5 + with catch_warnings(record=True): + tmp = df.copy() + exp = df.copy() + tmp.ix[6] = 5 + exp.values[3] = 5 assert_frame_equal(tmp, exp) - tmp = df.copy() - exp = df.copy() - tmp.ix[:, 2] = 5 + with catch_warnings(record=True): + tmp = df.copy() + exp = df.copy() + tmp.ix[:, 2] = 5 # tmp correctly sets the dtype # so match the exp way @@ -940,133 +973,148 @@ def test_fancy_setitem_int_labels(self): def test_fancy_getitem_int_labels(self): df = DataFrame(np.random.randn(10, 5), index=np.arange(0, 20, 2)) - result = df.ix[[4, 2, 0], [2, 0]] - expected = df.reindex(index=[4, 2, 0], columns=[2, 0]) + with catch_warnings(record=True): + result = df.ix[[4, 2, 0], [2, 0]] + expected = df.reindex(index=[4, 2, 0], columns=[2, 0]) assert_frame_equal(result, expected) - result = df.ix[[4, 2, 0]] - expected = df.reindex(index=[4, 2, 0]) + with catch_warnings(record=True): + result = df.ix[[4, 2, 0]] + expected = df.reindex(index=[4, 2, 0]) assert_frame_equal(result, expected) - result = df.ix[4] - expected = df.xs(4) + with catch_warnings(record=True): + result = df.ix[4] + expected = df.xs(4) assert_series_equal(result, expected) - result = df.ix[:, 3] - expected = df[3] + with catch_warnings(record=True): + result = df.ix[:, 3] + expected = df[3] assert_series_equal(result, expected) def test_fancy_index_int_labels_exceptions(self): df = DataFrame(np.random.randn(10, 5), index=np.arange(0, 20, 2)) - # labels that aren't contained - self.assertRaises(KeyError, df.ix.__setitem__, + with catch_warnings(record=True): + + # labels that aren't contained + pytest.raises(KeyError, df.ix.__setitem__, ([0, 1, 2], [2, 3, 4]), 5) - # try to set indices not contained in frame - self.assertRaises(KeyError, - self.frame.ix.__setitem__, + # try to set indices not contained in frame + pytest.raises(KeyError, self.frame.ix.__setitem__, ['foo', 'bar', 'baz'], 1) - self.assertRaises(KeyError, - self.frame.ix.__setitem__, + pytest.raises(KeyError, self.frame.ix.__setitem__, (slice(None, None), ['E']), 1) - # partial setting now allows this GH2578 - # self.assertRaises(KeyError, - # self.frame.ix.__setitem__, - # (slice(None, None), 'E'), 1) + # partial setting now allows this GH2578 + # pytest.raises(KeyError, self.frame.ix.__setitem__, + # (slice(None, None), 'E'), 1) def test_setitem_fancy_mixed_2d(self): - self.mixed_frame.ix[:5, ['C', 'B', 'A']] = 5 - result = self.mixed_frame.ix[:5, ['C', 'B', 'A']] - self.assertTrue((result.values == 5).all()) - self.mixed_frame.ix[5] = np.nan - self.assertTrue(isnull(self.mixed_frame.ix[5]).all()) + with catch_warnings(record=True): + self.mixed_frame.ix[:5, ['C', 'B', 'A']] = 5 + result = self.mixed_frame.ix[:5, ['C', 'B', 'A']] + assert (result.values == 5).all() + + self.mixed_frame.ix[5] = np.nan + assert isnull(self.mixed_frame.ix[5]).all() - self.mixed_frame.ix[5] = self.mixed_frame.ix[6] - assert_series_equal(self.mixed_frame.ix[5], self.mixed_frame.ix[6], - check_names=False) + self.mixed_frame.ix[5] = self.mixed_frame.ix[6] + assert_series_equal(self.mixed_frame.ix[5], self.mixed_frame.ix[6], + check_names=False) # #1432 - df = DataFrame({1: [1., 2., 3.], - 2: [3, 4, 5]}) - self.assertTrue(df._is_mixed_type) + with catch_warnings(record=True): + df = DataFrame({1: [1., 2., 3.], + 2: [3, 4, 5]}) + assert df._is_mixed_type - df.ix[1] = [5, 10] + df.ix[1] = [5, 10] - expected = DataFrame({1: [1., 5., 3.], - 2: [3, 10, 5]}) + expected = DataFrame({1: [1., 5., 3.], + 2: [3, 10, 5]}) - assert_frame_equal(df, expected) + assert_frame_equal(df, expected) def test_ix_align(self): b = Series(randn(10), name=0).sort_values() df_orig = DataFrame(randn(10, 4)) df = df_orig.copy() - df.ix[:, 0] = b - assert_series_equal(df.ix[:, 0].reindex(b.index), b) - - dft = df_orig.T - dft.ix[0, :] = b - assert_series_equal(dft.ix[0, :].reindex(b.index), b) - - df = df_orig.copy() - df.ix[:5, 0] = b - s = df.ix[:5, 0] - assert_series_equal(s, b.reindex(s.index)) - - dft = df_orig.T - dft.ix[0, :5] = b - s = dft.ix[0, :5] - assert_series_equal(s, b.reindex(s.index)) - - df = df_orig.copy() - idx = [0, 1, 3, 5] - df.ix[idx, 0] = b - s = df.ix[idx, 0] - assert_series_equal(s, b.reindex(s.index)) - - dft = df_orig.T - dft.ix[0, idx] = b - s = dft.ix[0, idx] - assert_series_equal(s, b.reindex(s.index)) + with catch_warnings(record=True): + df.ix[:, 0] = b + assert_series_equal(df.ix[:, 0].reindex(b.index), b) + + with catch_warnings(record=True): + dft = df_orig.T + dft.ix[0, :] = b + assert_series_equal(dft.ix[0, :].reindex(b.index), b) + + with catch_warnings(record=True): + df = df_orig.copy() + df.ix[:5, 0] = b + s = df.ix[:5, 0] + assert_series_equal(s, b.reindex(s.index)) + + with catch_warnings(record=True): + dft = df_orig.T + dft.ix[0, :5] = b + s = dft.ix[0, :5] + assert_series_equal(s, b.reindex(s.index)) + + with catch_warnings(record=True): + df = df_orig.copy() + idx = [0, 1, 3, 5] + df.ix[idx, 0] = b + s = df.ix[idx, 0] + assert_series_equal(s, b.reindex(s.index)) + + with catch_warnings(record=True): + dft = df_orig.T + dft.ix[0, idx] = b + s = dft.ix[0, idx] + assert_series_equal(s, b.reindex(s.index)) def test_ix_frame_align(self): b = DataFrame(np.random.randn(3, 4)) df_orig = DataFrame(randn(10, 4)) df = df_orig.copy() - df.ix[:3] = b - out = b.ix[:3] - assert_frame_equal(out, b) + with catch_warnings(record=True): + df.ix[:3] = b + out = b.ix[:3] + assert_frame_equal(out, b) b.sort_index(inplace=True) - df = df_orig.copy() - df.ix[[0, 1, 2]] = b - out = df.ix[[0, 1, 2]].reindex(b.index) - assert_frame_equal(out, b) + with catch_warnings(record=True): + df = df_orig.copy() + df.ix[[0, 1, 2]] = b + out = df.ix[[0, 1, 2]].reindex(b.index) + assert_frame_equal(out, b) - df = df_orig.copy() - df.ix[:3] = b - out = df.ix[:3] - assert_frame_equal(out, b.reindex(out.index)) + with catch_warnings(record=True): + df = df_orig.copy() + df.ix[:3] = b + out = df.ix[:3] + assert_frame_equal(out, b.reindex(out.index)) def test_getitem_setitem_non_ix_labels(self): df = tm.makeTimeDataFrame() start, end = df.index[[5, 10]] - result = df.ix[start:end] + result = df.loc[start:end] result2 = df[start:end] expected = df[5:11] assert_frame_equal(result, expected) assert_frame_equal(result2, expected) result = df.copy() - result.ix[start:end] = 0 + result.loc[start:end] = 0 result2 = df.copy() result2[start:end] = 0 expected = df.copy() @@ -1076,13 +1124,13 @@ def test_getitem_setitem_non_ix_labels(self): def test_ix_multi_take(self): df = DataFrame(np.random.randn(3, 2)) - rs = df.ix[df.index == 0, :] + rs = df.loc[df.index == 0, :] xp = df.reindex([0]) assert_frame_equal(rs, xp) """ #1321 df = DataFrame(np.random.randn(3, 2)) - rs = df.ix[df.index==0, df.columns==1] + rs = df.loc[df.index==0, df.columns==1] xp = df.reindex([0], [1]) assert_frame_equal(rs, xp) """ @@ -1090,14 +1138,16 @@ def test_ix_multi_take(self): def test_ix_multi_take_nonint_index(self): df = DataFrame(np.random.randn(3, 2), index=['x', 'y', 'z'], columns=['a', 'b']) - rs = df.ix[[0], [0]] + with catch_warnings(record=True): + rs = df.ix[[0], [0]] xp = df.reindex(['x'], columns=['a']) assert_frame_equal(rs, xp) def test_ix_multi_take_multiindex(self): df = DataFrame(np.random.randn(3, 2), index=['x', 'y', 'z'], columns=[['a', 'b'], ['1', '2']]) - rs = df.ix[[0], [0]] + with catch_warnings(record=True): + rs = df.ix[[0], [0]] xp = df.reindex(['x'], columns=[('a', '1')]) assert_frame_equal(rs, xp) @@ -1105,57 +1155,68 @@ def test_ix_dup(self): idx = Index(['a', 'a', 'b', 'c', 'd', 'd']) df = DataFrame(np.random.randn(len(idx), 3), idx) - sub = df.ix[:'d'] - assert_frame_equal(sub, df) + with catch_warnings(record=True): + sub = df.ix[:'d'] + assert_frame_equal(sub, df) - sub = df.ix['a':'c'] - assert_frame_equal(sub, df.ix[0:4]) + with catch_warnings(record=True): + sub = df.ix['a':'c'] + assert_frame_equal(sub, df.ix[0:4]) - sub = df.ix['b':'d'] - assert_frame_equal(sub, df.ix[2:]) + with catch_warnings(record=True): + sub = df.ix['b':'d'] + assert_frame_equal(sub, df.ix[2:]) def test_getitem_fancy_1d(self): f = self.frame - ix = f.ix # return self if no slicing...for now - self.assertIs(ix[:, :], f) + with catch_warnings(record=True): + assert f.ix[:, :] is f # low dimensional slice - xs1 = ix[2, ['C', 'B', 'A']] + with catch_warnings(record=True): + xs1 = f.ix[2, ['C', 'B', 'A']] xs2 = f.xs(f.index[2]).reindex(['C', 'B', 'A']) - assert_series_equal(xs1, xs2) + tm.assert_series_equal(xs1, xs2) - ts1 = ix[5:10, 2] + with catch_warnings(record=True): + ts1 = f.ix[5:10, 2] ts2 = f[f.columns[2]][5:10] - assert_series_equal(ts1, ts2) + tm.assert_series_equal(ts1, ts2) # positional xs - xs1 = ix[0] + with catch_warnings(record=True): + xs1 = f.ix[0] xs2 = f.xs(f.index[0]) - assert_series_equal(xs1, xs2) + tm.assert_series_equal(xs1, xs2) - xs1 = ix[f.index[5]] + with catch_warnings(record=True): + xs1 = f.ix[f.index[5]] xs2 = f.xs(f.index[5]) - assert_series_equal(xs1, xs2) + tm.assert_series_equal(xs1, xs2) # single column - assert_series_equal(ix[:, 'A'], f['A']) + with catch_warnings(record=True): + assert_series_equal(f.ix[:, 'A'], f['A']) # return view - exp = f.copy() - exp.values[5] = 4 - ix[5][:] = 4 - assert_frame_equal(exp, f) + with catch_warnings(record=True): + exp = f.copy() + exp.values[5] = 4 + f.ix[5][:] = 4 + tm.assert_frame_equal(exp, f) - exp.values[:, 1] = 6 - ix[:, 1][:] = 6 - assert_frame_equal(exp, f) + with catch_warnings(record=True): + exp.values[:, 1] = 6 + f.ix[:, 1][:] = 6 + tm.assert_frame_equal(exp, f) # slice of mixed-frame - xs = self.mixed_frame.ix[5] + with catch_warnings(record=True): + xs = self.mixed_frame.ix[5] exp = self.mixed_frame.xs(self.mixed_frame.index[5]) - assert_series_equal(xs, exp) + tm.assert_series_equal(xs, exp) def test_setitem_fancy_1d(self): @@ -1163,62 +1224,71 @@ def test_setitem_fancy_1d(self): frame = self.frame.copy() expected = self.frame.copy() - frame.ix[2, ['C', 'B', 'A']] = [1., 2., 3.] + with catch_warnings(record=True): + frame.ix[2, ['C', 'B', 'A']] = [1., 2., 3.] expected['C'][2] = 1. expected['B'][2] = 2. expected['A'][2] = 3. assert_frame_equal(frame, expected) - frame2 = self.frame.copy() - frame2.ix[2, [3, 2, 1]] = [1., 2., 3.] + with catch_warnings(record=True): + frame2 = self.frame.copy() + frame2.ix[2, [3, 2, 1]] = [1., 2., 3.] assert_frame_equal(frame, expected) # case 2, set a section of a column frame = self.frame.copy() expected = self.frame.copy() - vals = randn(5) - expected.values[5:10, 2] = vals - frame.ix[5:10, 2] = vals + with catch_warnings(record=True): + vals = randn(5) + expected.values[5:10, 2] = vals + frame.ix[5:10, 2] = vals assert_frame_equal(frame, expected) - frame2 = self.frame.copy() - frame2.ix[5:10, 'B'] = vals + with catch_warnings(record=True): + frame2 = self.frame.copy() + frame2.ix[5:10, 'B'] = vals assert_frame_equal(frame, expected) # case 3: full xs frame = self.frame.copy() expected = self.frame.copy() - frame.ix[4] = 5. - expected.values[4] = 5. + with catch_warnings(record=True): + frame.ix[4] = 5. + expected.values[4] = 5. assert_frame_equal(frame, expected) - frame.ix[frame.index[4]] = 6. - expected.values[4] = 6. + with catch_warnings(record=True): + frame.ix[frame.index[4]] = 6. + expected.values[4] = 6. assert_frame_equal(frame, expected) # single column frame = self.frame.copy() expected = self.frame.copy() - frame.ix[:, 'A'] = 7. - expected['A'] = 7. + with catch_warnings(record=True): + frame.ix[:, 'A'] = 7. + expected['A'] = 7. assert_frame_equal(frame, expected) def test_getitem_fancy_scalar(self): f = self.frame - ix = f.ix + ix = f.loc + # individual value for col in f.columns: ts = f[col] for idx in f.index[::5]: - self.assertEqual(ix[idx, col], ts[idx]) + assert ix[idx, col] == ts[idx] def test_setitem_fancy_scalar(self): f = self.frame expected = self.frame.copy() - ix = f.ix + ix = f.loc + # individual value for j, col in enumerate(f.columns): ts = f[col] # noqa @@ -1226,19 +1296,20 @@ def test_setitem_fancy_scalar(self): i = f.index.get_loc(idx) val = randn() expected.values[i, j] = val + ix[idx, col] = val assert_frame_equal(f, expected) def test_getitem_fancy_boolean(self): f = self.frame - ix = f.ix + ix = f.loc expected = f.reindex(columns=['B', 'D']) result = ix[:, [False, True, False, True]] assert_frame_equal(result, expected) expected = f.reindex(index=f.index[5:10], columns=['B', 'D']) - result = ix[5:10, [False, True, False, True]] + result = ix[f.index[5:10], [False, True, False, True]] assert_frame_equal(result, expected) boolvec = f.index > f.index[7] @@ -1248,7 +1319,7 @@ def test_getitem_fancy_boolean(self): result = ix[boolvec, :] assert_frame_equal(result, expected) - result = ix[boolvec, 2:] + result = ix[boolvec, f.columns[2:]] expected = f.reindex(index=f.index[boolvec], columns=['C', 'D']) assert_frame_equal(result, expected) @@ -1259,45 +1330,45 @@ def test_setitem_fancy_boolean(self): expected = self.frame.copy() mask = frame['A'] > 0 - frame.ix[mask] = 0. + frame.loc[mask] = 0. expected.values[mask.values] = 0. assert_frame_equal(frame, expected) frame = self.frame.copy() expected = self.frame.copy() - frame.ix[mask, ['A', 'B']] = 0. + frame.loc[mask, ['A', 'B']] = 0. expected.values[mask.values, :2] = 0. assert_frame_equal(frame, expected) def test_getitem_fancy_ints(self): - result = self.frame.ix[[1, 4, 7]] - expected = self.frame.ix[self.frame.index[[1, 4, 7]]] + result = self.frame.iloc[[1, 4, 7]] + expected = self.frame.loc[self.frame.index[[1, 4, 7]]] assert_frame_equal(result, expected) - result = self.frame.ix[:, [2, 0, 1]] - expected = self.frame.ix[:, self.frame.columns[[2, 0, 1]]] + result = self.frame.iloc[:, [2, 0, 1]] + expected = self.frame.loc[:, self.frame.columns[[2, 0, 1]]] assert_frame_equal(result, expected) def test_getitem_setitem_fancy_exceptions(self): - ix = self.frame.ix - with assertRaisesRegexp(IndexingError, 'Too many indexers'): + ix = self.frame.iloc + with tm.assert_raises_regex(IndexingError, 'Too many indexers'): ix[:, :, :] - with assertRaises(IndexingError): + with pytest.raises(IndexingError): ix[:, :, :] = 1 def test_getitem_setitem_boolean_misaligned(self): # boolean index misaligned labels mask = self.frame['A'][::-1] > 1 - result = self.frame.ix[mask] - expected = self.frame.ix[mask[::-1]] + result = self.frame.loc[mask] + expected = self.frame.loc[mask[::-1]] assert_frame_equal(result, expected) cp = self.frame.copy() expected = self.frame.copy() - cp.ix[mask] = 0 - expected.ix[mask] = 0 + cp.loc[mask] = 0 + expected.loc[mask] = 0 assert_frame_equal(cp, expected) def test_getitem_setitem_boolean_multi(self): @@ -1306,105 +1377,105 @@ def test_getitem_setitem_boolean_multi(self): # get k1 = np.array([True, False, True]) k2 = np.array([False, True]) - result = df.ix[k1, k2] - expected = df.ix[[0, 2], [1]] + result = df.loc[k1, k2] + expected = df.loc[[0, 2], [1]] assert_frame_equal(result, expected) expected = df.copy() - df.ix[np.array([True, False, True]), - np.array([False, True])] = 5 - expected.ix[[0, 2], [1]] = 5 + df.loc[np.array([True, False, True]), + np.array([False, True])] = 5 + expected.loc[[0, 2], [1]] = 5 assert_frame_equal(df, expected) def test_getitem_setitem_float_labels(self): index = Index([1.5, 2, 3, 4, 5]) df = DataFrame(np.random.randn(5, 5), index=index) - result = df.ix[1.5:4] + result = df.loc[1.5:4] expected = df.reindex([1.5, 2, 3, 4]) assert_frame_equal(result, expected) - self.assertEqual(len(result), 4) + assert len(result) == 4 - result = df.ix[4:5] + result = df.loc[4:5] expected = df.reindex([4, 5]) # reindex with int assert_frame_equal(result, expected, check_index_type=False) - self.assertEqual(len(result), 2) + assert len(result) == 2 - result = df.ix[4:5] + result = df.loc[4:5] expected = df.reindex([4.0, 5.0]) # reindex with float assert_frame_equal(result, expected) - self.assertEqual(len(result), 2) + assert len(result) == 2 # loc_float changes this to work properly - result = df.ix[1:2] + result = df.loc[1:2] expected = df.iloc[0:2] assert_frame_equal(result, expected) - df.ix[1:2] = 0 + df.loc[1:2] = 0 result = df[1:2] - self.assertTrue((result == 0).all().all()) + assert (result == 0).all().all() # #2727 index = Index([1.0, 2.5, 3.5, 4.5, 5.0]) df = DataFrame(np.random.randn(5, 5), index=index) # positional slicing only via iloc! - self.assertRaises(TypeError, lambda: df.iloc[1.0:5]) + pytest.raises(TypeError, lambda: df.iloc[1.0:5]) result = df.iloc[4:5] expected = df.reindex([5.0]) assert_frame_equal(result, expected) - self.assertEqual(len(result), 1) + assert len(result) == 1 cp = df.copy() def f(): cp.iloc[1.0:5] = 0 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def f(): result = cp.iloc[1.0:5] == 0 # noqa - self.assertRaises(TypeError, f) - self.assertTrue(result.values.all()) - self.assertTrue((cp.iloc[0:1] == df.iloc[0:1]).values.all()) + pytest.raises(TypeError, f) + assert result.values.all() + assert (cp.iloc[0:1] == df.iloc[0:1]).values.all() cp = df.copy() cp.iloc[4:5] = 0 - self.assertTrue((cp.iloc[4:5] == 0).values.all()) - self.assertTrue((cp.iloc[0:4] == df.iloc[0:4]).values.all()) + assert (cp.iloc[4:5] == 0).values.all() + assert (cp.iloc[0:4] == df.iloc[0:4]).values.all() # float slicing - result = df.ix[1.0:5] + result = df.loc[1.0:5] expected = df assert_frame_equal(result, expected) - self.assertEqual(len(result), 5) + assert len(result) == 5 - result = df.ix[1.1:5] + result = df.loc[1.1:5] expected = df.reindex([2.5, 3.5, 4.5, 5.0]) assert_frame_equal(result, expected) - self.assertEqual(len(result), 4) + assert len(result) == 4 - result = df.ix[4.51:5] + result = df.loc[4.51:5] expected = df.reindex([5.0]) assert_frame_equal(result, expected) - self.assertEqual(len(result), 1) + assert len(result) == 1 - result = df.ix[1.0:5.0] + result = df.loc[1.0:5.0] expected = df.reindex([1.0, 2.5, 3.5, 4.5, 5.0]) assert_frame_equal(result, expected) - self.assertEqual(len(result), 5) + assert len(result) == 5 cp = df.copy() - cp.ix[1.0:5.0] = 0 - result = cp.ix[1.0:5.0] - self.assertTrue((result == 0).values.all()) + cp.loc[1.0:5.0] = 0 + result = cp.loc[1.0:5.0] + assert (result == 0).values.all() def test_setitem_single_column_mixed(self): df = DataFrame(randn(5, 3), index=['a', 'b', 'c', 'd', 'e'], columns=['foo', 'bar', 'baz']) df['str'] = 'qux' - df.ix[::2, 'str'] = nan + df.loc[df.index[::2], 'str'] = nan expected = np.array([nan, 'qux', nan, 'qux', nan], dtype=object) assert_almost_equal(df['str'].values, expected) @@ -1420,28 +1491,28 @@ def test_setitem_single_column_mixed_datetime(self): assert_series_equal(result, expected) # set an allowable datetime64 type - from pandas import tslib - df.ix['b', 'timestamp'] = tslib.iNaT - self.assertTrue(isnull(df.ix['b', 'timestamp'])) + df.loc['b', 'timestamp'] = iNaT + assert isnull(df.loc['b', 'timestamp']) # allow this syntax - df.ix['c', 'timestamp'] = nan - self.assertTrue(isnull(df.ix['c', 'timestamp'])) + df.loc['c', 'timestamp'] = nan + assert isnull(df.loc['c', 'timestamp']) # allow this syntax - df.ix['d', :] = nan - self.assertTrue(isnull(df.ix['c', :]).all() == False) # noqa + df.loc['d', :] = nan + assert not isnull(df.loc['c', :]).all() # as of GH 3216 this will now work! # try to set with a list like item - # self.assertRaises( - # Exception, df.ix.__setitem__, ('d', 'timestamp'), [nan]) + # pytest.raises( + # Exception, df.loc.__setitem__, ('d', 'timestamp'), [nan]) def test_setitem_frame(self): - piece = self.frame.ix[:2, ['A', 'B']] - self.frame.ix[-2:, ['A', 'B']] = piece.values - assert_almost_equal(self.frame.ix[-2:, ['A', 'B']].values, - piece.values) + piece = self.frame.loc[self.frame.index[:2], ['A', 'B']] + self.frame.loc[self.frame.index[-2]:, ['A', 'B']] = piece.values + result = self.frame.loc[self.frame.index[-2:], ['A', 'B']].values + expected = piece.values + assert_almost_equal(result, expected) # GH 3216 @@ -1450,8 +1521,8 @@ def test_setitem_frame(self): piece = DataFrame([[1., 2.], [3., 4.]], index=f.index[0:2], columns=['A', 'B']) key = (slice(None, 2), ['A', 'B']) - f.ix[key] = piece - assert_almost_equal(f.ix[0:2, ['A', 'B']].values, + f.loc[key] = piece + assert_almost_equal(f.loc[f.index[0:2], ['A', 'B']].values, piece.values) # rows unaligned @@ -1460,60 +1531,61 @@ def test_setitem_frame(self): index=list(f.index[0:2]) + ['foo', 'bar'], columns=['A', 'B']) key = (slice(None, 2), ['A', 'B']) - f.ix[key] = piece - assert_almost_equal(f.ix[0:2:, ['A', 'B']].values, + f.loc[key] = piece + assert_almost_equal(f.loc[f.index[0:2:], ['A', 'B']].values, piece.values[0:2]) # key is unaligned with values f = self.mixed_frame.copy() - piece = f.ix[:2, ['A']] + piece = f.loc[f.index[:2], ['A']] piece.index = f.index[-2:] key = (slice(-2, None), ['A', 'B']) - f.ix[key] = piece + f.loc[key] = piece piece['B'] = np.nan - assert_almost_equal(f.ix[-2:, ['A', 'B']].values, + assert_almost_equal(f.loc[f.index[-2:], ['A', 'B']].values, piece.values) # ndarray f = self.mixed_frame.copy() - piece = self.mixed_frame.ix[:2, ['A', 'B']] + piece = self.mixed_frame.loc[f.index[:2], ['A', 'B']] key = (slice(-2, None), ['A', 'B']) - f.ix[key] = piece.values - assert_almost_equal(f.ix[-2:, ['A', 'B']].values, + f.loc[key] = piece.values + assert_almost_equal(f.loc[f.index[-2:], ['A', 'B']].values, piece.values) # needs upcasting df = DataFrame([[1, 2, 'foo'], [3, 4, 'bar']], columns=['A', 'B', 'C']) df2 = df.copy() - df2.ix[:, ['A', 'B']] = df.ix[:, ['A', 'B']] + 0.5 + df2.loc[:, ['A', 'B']] = df.loc[:, ['A', 'B']] + 0.5 expected = df.reindex(columns=['A', 'B']) expected += 0.5 expected['C'] = df['C'] assert_frame_equal(df2, expected) def test_setitem_frame_align(self): - piece = self.frame.ix[:2, ['A', 'B']] + piece = self.frame.loc[self.frame.index[:2], ['A', 'B']] piece.index = self.frame.index[-2:] piece.columns = ['A', 'B'] - self.frame.ix[-2:, ['A', 'B']] = piece - assert_almost_equal(self.frame.ix[-2:, ['A', 'B']].values, - piece.values) + self.frame.loc[self.frame.index[-2:], ['A', 'B']] = piece + result = self.frame.loc[self.frame.index[-2:], ['A', 'B']].values + expected = piece.values + assert_almost_equal(result, expected) def test_getitem_setitem_ix_duplicates(self): # #1201 df = DataFrame(np.random.randn(5, 3), index=['foo', 'foo', 'bar', 'baz', 'bar']) - result = df.ix['foo'] + result = df.loc['foo'] expected = df[:2] assert_frame_equal(result, expected) - result = df.ix['bar'] - expected = df.ix[[2, 4]] + result = df.loc['bar'] + expected = df.iloc[[2, 4]] assert_frame_equal(result, expected) - result = df.ix['baz'] - expected = df.ix[3] + result = df.loc['baz'] + expected = df.iloc[3] assert_series_equal(result, expected) def test_getitem_ix_boolean_duplicates_multiple(self): @@ -1521,15 +1593,15 @@ def test_getitem_ix_boolean_duplicates_multiple(self): df = DataFrame(np.random.randn(5, 3), index=['foo', 'foo', 'bar', 'baz', 'bar']) - result = df.ix[['bar']] - exp = df.ix[[2, 4]] + result = df.loc[['bar']] + exp = df.iloc[[2, 4]] assert_frame_equal(result, exp) - result = df.ix[df[1] > 0] + result = df.loc[df[1] > 0] exp = df[df[1] > 0] assert_frame_equal(result, exp) - result = df.ix[df[0] > 0] + result = df.loc[df[0] > 0] exp = df[df[0] > 0] assert_frame_equal(result, exp) @@ -1537,11 +1609,11 @@ def test_getitem_setitem_ix_bool_keyerror(self): # #2199 df = DataFrame({'a': [1, 2, 3]}) - self.assertRaises(KeyError, df.ix.__getitem__, False) - self.assertRaises(KeyError, df.ix.__getitem__, True) + pytest.raises(KeyError, df.loc.__getitem__, False) + pytest.raises(KeyError, df.loc.__getitem__, True) - self.assertRaises(KeyError, df.ix.__setitem__, False, 0) - self.assertRaises(KeyError, df.ix.__setitem__, True, 0) + pytest.raises(KeyError, df.loc.__setitem__, False, 0) + pytest.raises(KeyError, df.loc.__setitem__, True, 0) def test_getitem_list_duplicates(self): # #1943 @@ -1549,9 +1621,9 @@ def test_getitem_list_duplicates(self): df.columns.name = 'foo' result = df[['B', 'C']] - self.assertEqual(result.columns.name, 'foo') + assert result.columns.name == 'foo' - expected = df.ix[:, 2:] + expected = df.iloc[:, 2:] assert_frame_equal(result, expected) def test_get_value(self): @@ -1559,7 +1631,7 @@ def test_get_value(self): for col in self.frame.columns: result = self.frame.get_value(idx, col) expected = self.frame[col][idx] - self.assertEqual(result, expected) + assert result == expected def test_lookup(self): def alt(df, rows, cols, dtype): @@ -1585,46 +1657,46 @@ def testit(df): df['mask'] = df.lookup(df.index, 'mask_' + df['label']) exp_mask = alt(df, df.index, 'mask_' + df['label'], dtype=np.bool_) tm.assert_series_equal(df['mask'], pd.Series(exp_mask, name='mask')) - self.assertEqual(df['mask'].dtype, np.bool_) + assert df['mask'].dtype == np.bool_ - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): self.frame.lookup(['xyz'], ['A']) - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): self.frame.lookup([self.frame.index[0]], ['xyz']) - with tm.assertRaisesRegexp(ValueError, 'same size'): + with tm.assert_raises_regex(ValueError, 'same size'): self.frame.lookup(['a', 'b', 'c'], ['a']) def test_set_value(self): for idx in self.frame.index: for col in self.frame.columns: self.frame.set_value(idx, col, 1) - self.assertEqual(self.frame[col][idx], 1) + assert self.frame[col][idx] == 1 def test_set_value_resize(self): res = self.frame.set_value('foobar', 'B', 0) - self.assertIs(res, self.frame) - self.assertEqual(res.index[-1], 'foobar') - self.assertEqual(res.get_value('foobar', 'B'), 0) + assert res is self.frame + assert res.index[-1] == 'foobar' + assert res.get_value('foobar', 'B') == 0 self.frame.loc['foobar', 'qux'] = 0 - self.assertEqual(self.frame.get_value('foobar', 'qux'), 0) + assert self.frame.get_value('foobar', 'qux') == 0 res = self.frame.copy() res3 = res.set_value('foobar', 'baz', 'sam') - self.assertEqual(res3['baz'].dtype, np.object_) + assert res3['baz'].dtype == np.object_ res = self.frame.copy() res3 = res.set_value('foobar', 'baz', True) - self.assertEqual(res3['baz'].dtype, np.object_) + assert res3['baz'].dtype == np.object_ res = self.frame.copy() res3 = res.set_value('foobar', 'baz', 5) - self.assertTrue(is_float_dtype(res3['baz'])) - self.assertTrue(isnull(res3['baz'].drop(['foobar'])).all()) - self.assertRaises(ValueError, res3.set_value, 'foobar', 'baz', 'sam') + assert is_float_dtype(res3['baz']) + assert isnull(res3['baz'].drop(['foobar'])).all() + pytest.raises(ValueError, res3.set_value, 'foobar', 'baz', 'sam') def test_set_value_with_index_dtype_change(self): df_orig = DataFrame(randn(3, 3), index=lrange(3), columns=list('ABC')) @@ -1633,65 +1705,81 @@ def test_set_value_with_index_dtype_change(self): # so column is not created df = df_orig.copy() df.set_value('C', 2, 1.0) - self.assertEqual(list(df.index), list(df_orig.index) + ['C']) - # self.assertEqual(list(df.columns), list(df_orig.columns) + [2]) + assert list(df.index) == list(df_orig.index) + ['C'] + # assert list(df.columns) == list(df_orig.columns) + [2] df = df_orig.copy() df.loc['C', 2] = 1.0 - self.assertEqual(list(df.index), list(df_orig.index) + ['C']) - # self.assertEqual(list(df.columns), list(df_orig.columns) + [2]) + assert list(df.index) == list(df_orig.index) + ['C'] + # assert list(df.columns) == list(df_orig.columns) + [2] # create both new df = df_orig.copy() df.set_value('C', 'D', 1.0) - self.assertEqual(list(df.index), list(df_orig.index) + ['C']) - self.assertEqual(list(df.columns), list(df_orig.columns) + ['D']) + assert list(df.index) == list(df_orig.index) + ['C'] + assert list(df.columns) == list(df_orig.columns) + ['D'] df = df_orig.copy() df.loc['C', 'D'] = 1.0 - self.assertEqual(list(df.index), list(df_orig.index) + ['C']) - self.assertEqual(list(df.columns), list(df_orig.columns) + ['D']) + assert list(df.index) == list(df_orig.index) + ['C'] + assert list(df.columns) == list(df_orig.columns) + ['D'] def test_get_set_value_no_partial_indexing(self): # partial w/ MultiIndex raise exception index = MultiIndex.from_tuples([(0, 1), (0, 2), (1, 1), (1, 2)]) df = DataFrame(index=index, columns=lrange(4)) - self.assertRaises(KeyError, df.get_value, 0, 1) - # self.assertRaises(KeyError, df.set_value, 0, 1, 0) + pytest.raises(KeyError, df.get_value, 0, 1) + # pytest.raises(KeyError, df.set_value, 0, 1, 0) def test_single_element_ix_dont_upcast(self): self.frame['E'] = 1 - self.assertTrue(issubclass(self.frame['E'].dtype.type, - (int, np.integer))) + assert issubclass(self.frame['E'].dtype.type, (int, np.integer)) - result = self.frame.ix[self.frame.index[5], 'E'] - self.assertTrue(is_integer(result)) + with catch_warnings(record=True): + result = self.frame.ix[self.frame.index[5], 'E'] + assert is_integer(result) - def test_irow(self): - df = DataFrame(np.random.randn(10, 4), index=lrange(0, 20, 2)) + result = self.frame.loc[self.frame.index[5], 'E'] + assert is_integer(result) + + # GH 11617 + df = pd.DataFrame(dict(a=[1.23])) + df["b"] = 666 + + with catch_warnings(record=True): + result = df.ix[0, "b"] + assert is_integer(result) + result = df.loc[0, "b"] + assert is_integer(result) - # 10711, deprecated - with tm.assert_produces_warning(FutureWarning): - df.irow(1) + expected = Series([666], [0], name='b') + with catch_warnings(record=True): + result = df.ix[[0], "b"] + assert_series_equal(result, expected) + result = df.loc[[0], "b"] + assert_series_equal(result, expected) + + def test_iloc_row(self): + df = DataFrame(np.random.randn(10, 4), index=lrange(0, 20, 2)) result = df.iloc[1] - exp = df.ix[2] + exp = df.loc[2] assert_series_equal(result, exp) result = df.iloc[2] - exp = df.ix[4] + exp = df.loc[4] assert_series_equal(result, exp) # slice result = df.iloc[slice(4, 8)] - expected = df.ix[8:14] + expected = df.loc[8:14] assert_frame_equal(result, expected) # verify slice is view # setting it makes it raise/warn def f(): result[2] = 0. - self.assertRaises(com.SettingWithCopyError, f) + pytest.raises(com.SettingWithCopyError, f) exp_col = df[2].copy() exp_col[4:8] = 0. assert_series_equal(df[2], exp_col) @@ -1701,54 +1789,51 @@ def f(): expected = df.reindex(df.index[[1, 2, 4, 6]]) assert_frame_equal(result, expected) - def test_icol(self): + def test_iloc_col(self): df = DataFrame(np.random.randn(4, 10), columns=lrange(0, 20, 2)) - # 10711, deprecated - with tm.assert_produces_warning(FutureWarning): - df.icol(1) - result = df.iloc[:, 1] - exp = df.ix[:, 2] + exp = df.loc[:, 2] assert_series_equal(result, exp) result = df.iloc[:, 2] - exp = df.ix[:, 4] + exp = df.loc[:, 4] assert_series_equal(result, exp) # slice result = df.iloc[:, slice(4, 8)] - expected = df.ix[:, 8:14] + expected = df.loc[:, 8:14] assert_frame_equal(result, expected) # verify slice is view # and that we are setting a copy def f(): result[8] = 0. - self.assertRaises(com.SettingWithCopyError, f) - self.assertTrue((df[8] == 0).all()) + pytest.raises(com.SettingWithCopyError, f) + assert (df[8] == 0).all() # list of integers result = df.iloc[:, [1, 2, 4, 6]] expected = df.reindex(columns=df.columns[[1, 2, 4, 6]]) assert_frame_equal(result, expected) - def test_irow_icol_duplicates(self): - # 10711, deprecated + def test_iloc_duplicates(self): df = DataFrame(np.random.rand(3, 3), columns=list('ABC'), index=list('aab')) result = df.iloc[0] - result2 = df.ix[0] - tm.assertIsInstance(result, Series) + with catch_warnings(record=True): + result2 = df.ix[0] + assert isinstance(result, Series) assert_almost_equal(result.values, df.values[0]) assert_series_equal(result, result2) - result = df.T.iloc[:, 0] - result2 = df.T.ix[:, 0] - tm.assertIsInstance(result, Series) + with catch_warnings(record=True): + result = df.T.iloc[:, 0] + result2 = df.T.ix[:, 0] + assert isinstance(result, Series) assert_almost_equal(result.values, df.values[0]) assert_series_equal(result, result2) @@ -1757,16 +1842,19 @@ def test_irow_icol_duplicates(self): columns=[['i', 'i', 'j'], ['A', 'A', 'B']], index=[['i', 'i', 'j'], ['X', 'X', 'Y']]) - rs = df.iloc[0] - xp = df.ix[0] + with catch_warnings(record=True): + rs = df.iloc[0] + xp = df.ix[0] assert_series_equal(rs, xp) - rs = df.iloc[:, 0] - xp = df.T.ix[0] + with catch_warnings(record=True): + rs = df.iloc[:, 0] + xp = df.T.ix[0] assert_series_equal(rs, xp) - rs = df.iloc[:, [0]] - xp = df.ix[:, [0]] + with catch_warnings(record=True): + rs = df.iloc[:, [0]] + xp = df.ix[:, [0]] assert_frame_equal(rs, xp) # #2259 @@ -1775,22 +1863,18 @@ def test_irow_icol_duplicates(self): expected = df.take([0], axis=1) assert_frame_equal(result, expected) - def test_icol_sparse_propegate_fill_value(self): - from pandas.sparse.api import SparseDataFrame + def test_iloc_sparse_propegate_fill_value(self): + from pandas.core.sparse.api import SparseDataFrame df = SparseDataFrame({'A': [999, 1]}, default_fill_value=999) - self.assertTrue(len(df['A'].sp_values) == len(df.iloc[:, 0].sp_values)) - - def test_iget_value(self): - # 10711 deprecated + assert len(df['A'].sp_values) == len(df.iloc[:, 0].sp_values) - with tm.assert_produces_warning(FutureWarning): - self.frame.iget_value(0, 0) + def test_iat(self): for i, row in enumerate(self.frame.index): for j, col in enumerate(self.frame.columns): result = self.frame.iat[i, j] expected = self.frame.at[row, col] - self.assertEqual(result, expected) + assert result == expected def test_nested_exception(self): # Ignore the strange way of triggering the problem @@ -1806,7 +1890,7 @@ def test_nested_exception(self): try: repr(df) except Exception as e: - self.assertNotEqual(type(e), UnboundLocalError) + assert type(e) != UnboundLocalError def test_reindex_methods(self): df = pd.DataFrame({'x': list(range(5))}) @@ -1844,6 +1928,21 @@ def test_reindex_methods(self): actual = df.reindex(target, method='nearest', tolerance=0.2) assert_frame_equal(expected, actual) + def test_reindex_frame_add_nat(self): + rng = date_range('1/1/2000 00:00:00', periods=10, freq='10s') + df = DataFrame({'A': np.random.randn(len(rng)), 'B': rng}) + + result = df.reindex(lrange(15)) + assert np.issubdtype(result['B'].dtype, np.dtype('M8[ns]')) + + mask = com.isnull(result)['B'] + assert mask[-5:].all() + assert not mask[:-5].any() + + def test_set_dataframe_column_ns_dtype(self): + x = DataFrame([datetime.now(), datetime.now()]) + assert x[0].dtype == np.dtype('M8[ns]') + def test_non_monotonic_reindex_methods(self): dr = pd.date_range('2013-08-01', periods=6, freq='B') data = np.random.randn(6, 1) @@ -1851,11 +1950,10 @@ def test_non_monotonic_reindex_methods(self): df_rev = pd.DataFrame(data, index=dr[[3, 4, 5] + [0, 1, 2]], columns=list('A')) # index is not monotonic increasing or decreasing - self.assertRaises(ValueError, df_rev.reindex, df.index, method='pad') - self.assertRaises(ValueError, df_rev.reindex, df.index, method='ffill') - self.assertRaises(ValueError, df_rev.reindex, df.index, method='bfill') - self.assertRaises(ValueError, df_rev.reindex, - df.index, method='nearest') + pytest.raises(ValueError, df_rev.reindex, df.index, method='pad') + pytest.raises(ValueError, df_rev.reindex, df.index, method='ffill') + pytest.raises(ValueError, df_rev.reindex, df.index, method='bfill') + pytest.raises(ValueError, df_rev.reindex, df.index, method='nearest') def test_reindex_level(self): from itertools import permutations @@ -1933,7 +2031,8 @@ def test_getitem_ix_float_duplicates(self): index=[0.1, 0.2, 0.2], columns=list('abc')) expect = df.iloc[1:] assert_frame_equal(df.loc[0.2], expect) - assert_frame_equal(df.ix[0.2], expect) + with catch_warnings(record=True): + assert_frame_equal(df.ix[0.2], expect) expect = df.iloc[1:, 0] assert_series_equal(df.loc[0.2, 'a'], expect) @@ -1941,7 +2040,8 @@ def test_getitem_ix_float_duplicates(self): df.index = [1, 0.2, 0.2] expect = df.iloc[1:] assert_frame_equal(df.loc[0.2], expect) - assert_frame_equal(df.ix[0.2], expect) + with catch_warnings(record=True): + assert_frame_equal(df.ix[0.2], expect) expect = df.iloc[1:, 0] assert_series_equal(df.loc[0.2, 'a'], expect) @@ -1950,7 +2050,8 @@ def test_getitem_ix_float_duplicates(self): index=[1, 0.2, 0.2, 1], columns=list('abc')) expect = df.iloc[1:-1] assert_frame_equal(df.loc[0.2], expect) - assert_frame_equal(df.ix[0.2], expect) + with catch_warnings(record=True): + assert_frame_equal(df.ix[0.2], expect) expect = df.iloc[1:-1, 0] assert_series_equal(df.loc[0.2, 'a'], expect) @@ -1958,7 +2059,8 @@ def test_getitem_ix_float_duplicates(self): df.index = [0.1, 0.2, 2, 0.2] expect = df.iloc[[1, -1]] assert_frame_equal(df.loc[0.2], expect) - assert_frame_equal(df.ix[0.2], expect) + with catch_warnings(record=True): + assert_frame_equal(df.ix[0.2], expect) expect = df.iloc[[1, -1], 0] assert_series_equal(df.loc[0.2, 'a'], expect) @@ -1993,13 +2095,13 @@ def test_setitem_with_unaligned_tz_aware_datetime_column(self): assert_series_equal(df['dates'], column) def test_setitem_datetime_coercion(self): - # GH 1048 + # gh-1048 df = pd.DataFrame({'c': [pd.Timestamp('2010-10-01')] * 3}) df.loc[0:1, 'c'] = np.datetime64('2008-08-08') - self.assertEqual(pd.Timestamp('2008-08-08'), df.loc[0, 'c']) - self.assertEqual(pd.Timestamp('2008-08-08'), df.loc[1, 'c']) + assert pd.Timestamp('2008-08-08') == df.loc[0, 'c'] + assert pd.Timestamp('2008-08-08') == df.loc[1, 'c'] df.loc[2, 'c'] = date(2005, 5, 5) - self.assertEqual(pd.Timestamp('2005-05-05'), df.loc[2, 'c']) + assert pd.Timestamp('2005-05-05') == df.loc[2, 'c'] def test_setitem_datetimelike_with_inference(self): # GH 7592 @@ -2010,11 +2112,13 @@ def test_setitem_datetimelike_with_inference(self): df['A'] = np.array([1 * one_hour] * 4, dtype='m8[ns]') df.loc[:, 'B'] = np.array([2 * one_hour] * 4, dtype='m8[ns]') df.loc[:3, 'C'] = np.array([3 * one_hour] * 3, dtype='m8[ns]') - df.ix[:, 'D'] = np.array([4 * one_hour] * 4, dtype='m8[ns]') - df.ix[:3, 'E'] = np.array([5 * one_hour] * 3, dtype='m8[ns]') + df.loc[:, 'D'] = np.array([4 * one_hour] * 4, dtype='m8[ns]') + df.loc[df.index[:3], 'E'] = np.array([5 * one_hour] * 3, + dtype='m8[ns]') df['F'] = np.timedelta64('NaT') - df.ix[:-1, 'F'] = np.array([6 * one_hour] * 3, dtype='m8[ns]') - df.ix[-3:, 'G'] = date_range('20130101', periods=3) + df.loc[df.index[:-1], 'F'] = np.array([6 * one_hour] * 3, + dtype='m8[ns]') + df.loc[df.index[-3]:, 'G'] = date_range('20130101', periods=3) df['H'] = np.datetime64('NaT') result = df.dtypes expected = Series([np.dtype('timedelta64[ns]')] * 6 + @@ -2031,41 +2135,41 @@ def test_at_time_between_time_datetimeindex(self): binds = [26, 27, 28, 74, 75, 76, 122, 123, 124, 170, 171, 172] result = df.at_time(akey) - expected = df.ix[akey] - expected2 = df.ix[ainds] + expected = df.loc[akey] + expected2 = df.iloc[ainds] assert_frame_equal(result, expected) assert_frame_equal(result, expected2) - self.assertEqual(len(result), 4) + assert len(result) == 4 result = df.between_time(bkey.start, bkey.stop) - expected = df.ix[bkey] - expected2 = df.ix[binds] + expected = df.loc[bkey] + expected2 = df.iloc[binds] assert_frame_equal(result, expected) assert_frame_equal(result, expected2) - self.assertEqual(len(result), 12) + assert len(result) == 12 result = df.copy() - result.ix[akey] = 0 - result = result.ix[akey] - expected = df.ix[akey].copy() - expected.ix[:] = 0 + result.loc[akey] = 0 + result = result.loc[akey] + expected = df.loc[akey].copy() + expected.loc[:] = 0 assert_frame_equal(result, expected) result = df.copy() - result.ix[akey] = 0 - result.ix[akey] = df.ix[ainds] + result.loc[akey] = 0 + result.loc[akey] = df.iloc[ainds] assert_frame_equal(result, df) result = df.copy() - result.ix[bkey] = 0 - result = result.ix[bkey] - expected = df.ix[bkey].copy() - expected.ix[:] = 0 + result.loc[bkey] = 0 + result = result.loc[bkey] + expected = df.loc[bkey].copy() + expected.loc[:] = 0 assert_frame_equal(result, expected) result = df.copy() - result.ix[bkey] = 0 - result.ix[bkey] = df.ix[binds] + result.loc[bkey] = 0 + result.loc[bkey] = df.iloc[binds] assert_frame_equal(result, df) def test_xs(self): @@ -2073,9 +2177,9 @@ def test_xs(self): xs = self.frame.xs(idx) for item, value in compat.iteritems(xs): if np.isnan(value): - self.assertTrue(np.isnan(self.frame[item][idx])) + assert np.isnan(self.frame[item][idx]) else: - self.assertEqual(value, self.frame[item][idx]) + assert value == self.frame[item][idx] # mixed-type xs test_data = { @@ -2084,11 +2188,11 @@ def test_xs(self): } frame = DataFrame(test_data) xs = frame.xs('1') - self.assertEqual(xs.dtype, np.object_) - self.assertEqual(xs['A'], 1) - self.assertEqual(xs['B'], '1') + assert xs.dtype == np.object_ + assert xs['A'] == 1 + assert xs['B'] == '1' - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): self.tsframe.xs(self.tsframe.index[0] - BDay()) # xs get column @@ -2099,7 +2203,7 @@ def test_xs(self): # view is returned if possible series = self.frame.xs('A', axis=1) series[:] = 5 - self.assertTrue((expected == 5).all()) + assert (expected == 5).all() def test_xs_corner(self): # pathological mixed-type reordering case @@ -2149,7 +2253,7 @@ def test_xs_view(self): index=lrange(4), columns=lrange(5)) dm.xs(2)[:] = 10 - self.assertTrue((dm.xs(2) == 10).all()) + assert (dm.xs(2) == 10).all() def test_index_namedtuple(self): from collections import namedtuple @@ -2159,8 +2263,13 @@ def test_index_namedtuple(self): index = Index([idx1, idx2], name="composite_index", tupleize_cols=False) df = DataFrame([(1, 2), (3, 4)], index=index, columns=["A", "B"]) - result = df.ix[IndexType("foo", "bar")]["A"] - self.assertEqual(result, 1) + + with catch_warnings(record=True): + result = df.ix[IndexType("foo", "bar")]["A"] + assert result == 1 + + result = df.loc[IndexType("foo", "bar")]["A"] + assert result == 1 def test_boolean_indexing(self): idx = lrange(3) @@ -2180,7 +2289,7 @@ def test_boolean_indexing(self): df1[df1 > 2.0 * df2] = -1 assert_frame_equal(df1, expected) - with assertRaisesRegexp(ValueError, 'Item wrong length'): + with tm.assert_raises_regex(ValueError, 'Item wrong length'): df1[df1.index[:-1] > 2] = -1 def test_boolean_indexing_mixed(self): @@ -2211,7 +2320,8 @@ def test_boolean_indexing_mixed(self): assert_frame_equal(df2, expected) df['foo'] = 'test' - with tm.assertRaisesRegexp(TypeError, 'boolean setting on mixed-type'): + with tm.assert_raises_regex(TypeError, 'boolean setting ' + 'on mixed-type'): df[df > 0.3] = 1 def test_where(self): @@ -2239,7 +2349,7 @@ def _check_get(df, cond, check_dtypes=True): # dtypes if check_dtypes: - self.assertTrue((rs.dtypes == df.dtypes).all()) + assert (rs.dtypes == df.dtypes).all() # check getting for df in [default_frame, self.mixed_frame, @@ -2251,7 +2361,7 @@ def _check_get(df, cond, check_dtypes=True): df = DataFrame(dict([(c, Series([1] * 3, dtype=c)) for c in ['int64', 'int32', 'float32', 'float64']])) - df.ix[1, :] = 0 + df.iloc[1, :] = 0 result = df.where(df >= 0).get_dtype_counts() # when we don't preserve boolean casts @@ -2288,7 +2398,7 @@ def _check_align(df, cond, other, check_dtypes=True): # can't check dtype when other is an ndarray if check_dtypes and not isinstance(other, np.ndarray): - self.assertTrue((rs.dtypes == df.dtypes).all()) + assert (rs.dtypes == df.dtypes).all() for df in [self.mixed_frame, self.mixed_float, self.mixed_int]: @@ -2309,14 +2419,14 @@ def _check_align(df, cond, other, check_dtypes=True): # invalid conditions df = default_frame err1 = (df + 1).values[0:2, :] - self.assertRaises(ValueError, df.where, cond, err1) + pytest.raises(ValueError, df.where, cond, err1) - err2 = cond.ix[:2, :].values + err2 = cond.iloc[:2, :].values other1 = _safe_add(df) - self.assertRaises(ValueError, df.where, err2, other1) + pytest.raises(ValueError, df.where, err2, other1) - self.assertRaises(ValueError, df.mask, True) - self.assertRaises(ValueError, df.mask, 0) + pytest.raises(ValueError, df.mask, True) + pytest.raises(ValueError, df.mask, 0) # where inplace def _check_set(df, cond, check_dtypes=True): @@ -2332,7 +2442,7 @@ def _check_set(df, cond, check_dtypes=True): for k, v in compat.iteritems(df.dtypes): if issubclass(v.type, np.integer) and not cond[k].all(): v = np.dtype('float64') - self.assertEqual(dfi[k].dtype, v) + assert dfi[k].dtype == v for df in [default_frame, self.mixed_frame, self.mixed_float, self.mixed_int]: @@ -2354,6 +2464,95 @@ def _check_set(df, cond, check_dtypes=True): expected = df[df['a'] == 1].reindex(df.index) assert_frame_equal(result, expected) + def test_where_array_like(self): + # see gh-15414 + klasses = [list, tuple, np.array] + + df = DataFrame({'a': [1, 2, 3]}) + cond = [[False], [True], [True]] + expected = DataFrame({'a': [np.nan, 2, 3]}) + + for klass in klasses: + result = df.where(klass(cond)) + assert_frame_equal(result, expected) + + df['b'] = 2 + expected['b'] = [2, np.nan, 2] + cond = [[False, True], [True, False], [True, True]] + + for klass in klasses: + result = df.where(klass(cond)) + assert_frame_equal(result, expected) + + def test_where_invalid_input(self): + # see gh-15414: only boolean arrays accepted + df = DataFrame({'a': [1, 2, 3]}) + msg = "Boolean array expected for the condition" + + conds = [ + [[1], [0], [1]], + Series([[2], [5], [7]]), + DataFrame({'a': [2, 5, 7]}), + [["True"], ["False"], ["True"]], + [[Timestamp("2017-01-01")], + [pd.NaT], [Timestamp("2017-01-02")]] + ] + + for cond in conds: + with tm.assert_raises_regex(ValueError, msg): + df.where(cond) + + df['b'] = 2 + conds = [ + [[0, 1], [1, 0], [1, 1]], + Series([[0, 2], [5, 0], [4, 7]]), + [["False", "True"], ["True", "False"], + ["True", "True"]], + DataFrame({'a': [2, 5, 7], 'b': [4, 8, 9]}), + [[pd.NaT, Timestamp("2017-01-01")], + [Timestamp("2017-01-02"), pd.NaT], + [Timestamp("2017-01-03"), Timestamp("2017-01-03")]] + ] + + for cond in conds: + with tm.assert_raises_regex(ValueError, msg): + df.where(cond) + + def test_where_dataframe_col_match(self): + df = DataFrame([[1, 2, 3], [4, 5, 6]]) + cond = DataFrame([[True, False, True], [False, False, True]]) + + out = df.where(cond) + expected = DataFrame([[1.0, np.nan, 3], [np.nan, np.nan, 6]]) + tm.assert_frame_equal(out, expected) + + cond.columns = ["a", "b", "c"] # Columns no longer match. + msg = "Boolean array expected for the condition" + with tm.assert_raises_regex(ValueError, msg): + df.where(cond) + + def test_where_ndframe_align(self): + msg = "Array conditional must be same shape as self" + df = DataFrame([[1, 2, 3], [4, 5, 6]]) + + cond = [True] + with tm.assert_raises_regex(ValueError, msg): + df.where(cond) + + expected = DataFrame([[1, 2, 3], [np.nan, np.nan, np.nan]]) + + out = df.where(Series(cond)) + tm.assert_frame_equal(out, expected) + + cond = np.array([False, True, False, True]) + with tm.assert_raises_regex(ValueError, msg): + df.where(cond) + + expected = DataFrame([[np.nan, np.nan, np.nan], [4, 5, 6]]) + + out = df.where(Series(cond)) + tm.assert_frame_equal(out, expected) + def test_where_bug(self): # GH 2793 @@ -2434,7 +2633,8 @@ def test_where_none(self): df = DataFrame([{'A': 1, 'B': np.nan, 'C': 'Test'}, { 'A': np.nan, 'B': 'Test', 'C': np.nan}]) expected = df.where(~isnull(df), None) - with tm.assertRaisesRegexp(TypeError, 'boolean setting on mixed-type'): + with tm.assert_raises_regex(TypeError, 'boolean setting ' + 'on mixed-type'): df.where(~isnull(df), None, inplace=True) def test_where_align(self): @@ -2692,7 +2892,7 @@ def test_type_error_multiindex(self): dg = df.pivot_table(index='i', columns='c', values=['x', 'y']) - with assertRaisesRegexp(TypeError, "is an invalid key"): + with tm.assert_raises_regex(TypeError, "is an invalid key"): str(dg[:, 0]) index = Index(range(2), name='i') @@ -2712,11 +2912,9 @@ def test_type_error_multiindex(self): assert_series_equal(result, expected) -class TestDataFrameIndexingDatetimeWithTZ(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameIndexingDatetimeWithTZ(TestData): - def setUp(self): + def setup_method(self, method): self.idx = Index(date_range('20130101', periods=3, tz='US/Eastern'), name='foo') self.dr = date_range('20130110', periods=3) @@ -2740,9 +2938,8 @@ def test_setitem(self): # are copies) b1 = df._data.blocks[1] b2 = df._data.blocks[2] - self.assertTrue(b1.values.equals(b2.values)) - self.assertFalse(id(b1.values.values.base) == - id(b2.values.values.base)) + assert b1.values.equals(b2.values) + assert id(b1.values.values.base) != id(b2.values.values.base) # with nan df2 = df.copy() @@ -2760,9 +2957,9 @@ def test_set_reset(self): # set/reset df = DataFrame({'A': [0, 1, 2]}, index=idx) result = df.reset_index() - self.assertTrue(result['foo'].dtype, 'M8[ns, US/Eastern') + assert result['foo'].dtype, 'M8[ns, US/Eastern' - result = result.set_index('foo') + df = result.set_index('foo') tm.assert_index_equal(df.index, idx) def test_transpose(self): @@ -2773,7 +2970,55 @@ def test_transpose(self): assert_frame_equal(result, expected) -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) +class TestDataFrameIndexingUInt64(TestData): + + def setup_method(self, method): + self.ir = Index(np.arange(3), dtype=np.uint64) + self.idx = Index([2**63, 2**63 + 5, 2**63 + 10], name='foo') + + self.df = DataFrame({'A': self.idx, 'B': self.ir}) + + def test_setitem(self): + + df = self.df + idx = self.idx + + # setitem + df['C'] = idx + assert_series_equal(df['C'], Series(idx, name='C')) + + df['D'] = 'foo' + df['D'] = idx + assert_series_equal(df['D'], Series(idx, name='D')) + del df['D'] + + # With NaN: because uint64 has no NaN element, + # the column should be cast to object. + df2 = df.copy() + df2.iloc[1, 1] = pd.NaT + df2.iloc[1, 2] = pd.NaT + result = df2['B'] + assert_series_equal(notnull(result), Series( + [True, False, True], name='B')) + assert_series_equal(df2.dtypes, Series([np.dtype('uint64'), + np.dtype('O'), np.dtype('O')], + index=['A', 'B', 'C'])) + + def test_set_reset(self): + + idx = self.idx + + # set/reset + df = DataFrame({'A': [0, 1, 2]}, index=idx) + result = df.reset_index() + assert result['foo'].dtype == np.dtype('uint64') + + df = result.set_index('foo') + tm.assert_index_equal(df.index, idx) + + def test_transpose(self): + + result = self.df.T + expected = DataFrame(self.df.values.T) + expected.index = ['A', 'B'] + assert_frame_equal(result, expected) diff --git a/pandas/tests/frame/test_join.py b/pandas/tests/frame/test_join.py new file mode 100644 index 0000000000000..21807cb42aa6e --- /dev/null +++ b/pandas/tests/frame/test_join.py @@ -0,0 +1,141 @@ +# -*- coding: utf-8 -*- + +import pytest +import numpy as np + +from pandas import DataFrame, Index +from pandas.tests.frame.common import TestData +import pandas.util.testing as tm + + +@pytest.fixture +def frame(): + return TestData().frame + + +@pytest.fixture +def left(): + return DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0]) + + +@pytest.fixture +def right(): + return DataFrame({'b': [300, 100, 200]}, index=[3, 1, 2]) + + +@pytest.mark.parametrize( + "how, sort, expected", + [('inner', False, DataFrame({'a': [20, 10], + 'b': [200, 100]}, + index=[2, 1])), + ('inner', True, DataFrame({'a': [10, 20], + 'b': [100, 200]}, + index=[1, 2])), + ('left', False, DataFrame({'a': [20, 10, 0], + 'b': [200, 100, np.nan]}, + index=[2, 1, 0])), + ('left', True, DataFrame({'a': [0, 10, 20], + 'b': [np.nan, 100, 200]}, + index=[0, 1, 2])), + ('right', False, DataFrame({'a': [np.nan, 10, 20], + 'b': [300, 100, 200]}, + index=[3, 1, 2])), + ('right', True, DataFrame({'a': [10, 20, np.nan], + 'b': [100, 200, 300]}, + index=[1, 2, 3])), + ('outer', False, DataFrame({'a': [0, 10, 20, np.nan], + 'b': [np.nan, 100, 200, 300]}, + index=[0, 1, 2, 3])), + ('outer', True, DataFrame({'a': [0, 10, 20, np.nan], + 'b': [np.nan, 100, 200, 300]}, + index=[0, 1, 2, 3]))]) +def test_join(left, right, how, sort, expected): + + result = left.join(right, how=how, sort=sort) + tm.assert_frame_equal(result, expected) + + +def test_join_index(frame): + # left / right + + f = frame.loc[frame.index[:10], ['A', 'B']] + f2 = frame.loc[frame.index[5:], ['C', 'D']].iloc[::-1] + + joined = f.join(f2) + tm.assert_index_equal(f.index, joined.index) + expected_columns = Index(['A', 'B', 'C', 'D']) + tm.assert_index_equal(joined.columns, expected_columns) + + joined = f.join(f2, how='left') + tm.assert_index_equal(joined.index, f.index) + tm.assert_index_equal(joined.columns, expected_columns) + + joined = f.join(f2, how='right') + tm.assert_index_equal(joined.index, f2.index) + tm.assert_index_equal(joined.columns, expected_columns) + + # inner + + joined = f.join(f2, how='inner') + tm.assert_index_equal(joined.index, f.index[5:10]) + tm.assert_index_equal(joined.columns, expected_columns) + + # outer + + joined = f.join(f2, how='outer') + tm.assert_index_equal(joined.index, frame.index.sort_values()) + tm.assert_index_equal(joined.columns, expected_columns) + + tm.assert_raises_regex( + ValueError, 'join method', f.join, f2, how='foo') + + # corner case - overlapping columns + for how in ('outer', 'left', 'inner'): + with tm.assert_raises_regex(ValueError, 'columns overlap but ' + 'no suffix'): + frame.join(frame, how=how) + + +def test_join_index_more(frame): + af = frame.loc[:, ['A', 'B']] + bf = frame.loc[::2, ['C', 'D']] + + expected = af.copy() + expected['C'] = frame['C'][::2] + expected['D'] = frame['D'][::2] + + result = af.join(bf) + tm.assert_frame_equal(result, expected) + + result = af.join(bf, how='right') + tm.assert_frame_equal(result, expected[::2]) + + result = bf.join(af, how='right') + tm.assert_frame_equal(result, expected.loc[:, result.columns]) + + +def test_join_index_series(frame): + df = frame.copy() + s = df.pop(frame.columns[-1]) + joined = df.join(s) + + # TODO should this check_names ? + tm.assert_frame_equal(joined, frame, check_names=False) + + s.name = None + tm.assert_raises_regex(ValueError, 'must have a name', df.join, s) + + +def test_join_overlap(frame): + df1 = frame.loc[:, ['A', 'B', 'C']] + df2 = frame.loc[:, ['B', 'C', 'D']] + + joined = df1.join(df2, lsuffix='_df1', rsuffix='_df2') + df1_suf = df1.loc[:, ['B', 'C']].add_suffix('_df1') + df2_suf = df2.loc[:, ['B', 'C']].add_suffix('_df2') + + no_overlap = frame.loc[:, ['A', 'D']] + expected = df1_suf.join(df2_suf).join(no_overlap) + + # column order not necessarily sorted + tm.assert_frame_equal(joined, expected.loc[:, joined.columns]) diff --git a/pandas/tests/frame/test_missing.py b/pandas/tests/frame/test_missing.py index 8a6cbe44465c1..77f0357685cab 100644 --- a/pandas/tests/frame/test_missing.py +++ b/pandas/tests/frame/test_missing.py @@ -2,6 +2,8 @@ from __future__ import print_function +import pytest + from distutils.version import LooseVersion from numpy import nan, random import numpy as np @@ -11,25 +13,28 @@ date_range) import pandas as pd -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from pandas.tests.frame.common import TestData, _check_mixed_float +try: + import scipy + _is_scipy_ge_0190 = scipy.__version__ >= LooseVersion('0.19.0') +except: + _is_scipy_ge_0190 = False + + def _skip_if_no_pchip(): try: from scipy.interpolate import pchip_interpolate # noqa except ImportError: - import nose - raise nose.SkipTest('scipy.interpolate.pchip missing') - + import pytest + pytest.skip('scipy.interpolate.pchip missing') -class TestDataFrameMissingData(tm.TestCase, TestData): - _multiprocess_can_split_ = True +class TestDataFrameMissingData(TestData): def test_dropEmptyRows(self): N = len(self.frame.index) @@ -73,24 +78,24 @@ def test_dropIncompleteRows(self): samesize_frame = frame.dropna(subset=['bar']) assert_series_equal(frame['foo'], original) - self.assertTrue((frame['bar'] == 5).all()) + assert (frame['bar'] == 5).all() inp_frame2.dropna(subset=['bar'], inplace=True) - self.assert_index_equal(samesize_frame.index, self.frame.index) - self.assert_index_equal(inp_frame2.index, self.frame.index) + tm.assert_index_equal(samesize_frame.index, self.frame.index) + tm.assert_index_equal(inp_frame2.index, self.frame.index) def test_dropna(self): df = DataFrame(np.random.randn(6, 4)) df[2][:2] = nan dropped = df.dropna(axis=1) - expected = df.ix[:, [0, 1, 3]] + expected = df.loc[:, [0, 1, 3]] inp = df.copy() inp.dropna(axis=1, inplace=True) assert_frame_equal(dropped, expected) assert_frame_equal(inp, expected) dropped = df.dropna(axis=0) - expected = df.ix[lrange(2, 6)] + expected = df.loc[lrange(2, 6)] inp = df.copy() inp.dropna(axis=0, inplace=True) assert_frame_equal(dropped, expected) @@ -98,14 +103,14 @@ def test_dropna(self): # threshold dropped = df.dropna(axis=1, thresh=5) - expected = df.ix[:, [0, 1, 3]] + expected = df.loc[:, [0, 1, 3]] inp = df.copy() inp.dropna(axis=1, thresh=5, inplace=True) assert_frame_equal(dropped, expected) assert_frame_equal(inp, expected) dropped = df.dropna(axis=0, thresh=4) - expected = df.ix[lrange(2, 6)] + expected = df.loc[lrange(2, 6)] inp = df.copy() inp.dropna(axis=0, thresh=4, inplace=True) assert_frame_equal(dropped, expected) @@ -130,11 +135,11 @@ def test_dropna(self): df[2] = nan dropped = df.dropna(axis=1, how='all') - expected = df.ix[:, [0, 1, 3]] + expected = df.loc[:, [0, 1, 3]] assert_frame_equal(dropped, expected) # bad input - self.assertRaises(ValueError, df.dropna, axis=3) + pytest.raises(ValueError, df.dropna, axis=3) def test_drop_and_dropna_caching(self): # tst that cacher updates @@ -153,10 +158,10 @@ def test_drop_and_dropna_caching(self): def test_dropna_corner(self): # bad input - self.assertRaises(ValueError, self.frame.dropna, how='foo') - self.assertRaises(TypeError, self.frame.dropna, how=None) + pytest.raises(ValueError, self.frame.dropna, how='foo') + pytest.raises(TypeError, self.frame.dropna, how=None) # non-existent column - 8303 - self.assertRaises(KeyError, self.frame.dropna, subset=['A', 'X']) + pytest.raises(KeyError, self.frame.dropna, subset=['A', 'X']) def test_dropna_multiple_axes(self): df = DataFrame([[1, np.nan, 2, 3], @@ -177,28 +182,31 @@ def test_dropna_multiple_axes(self): assert_frame_equal(inp, expected) def test_fillna(self): - self.tsframe.ix[:5, 'A'] = nan - self.tsframe.ix[-5:, 'A'] = nan + tf = self.tsframe + tf.loc[tf.index[:5], 'A'] = nan + tf.loc[tf.index[-5:], 'A'] = nan zero_filled = self.tsframe.fillna(0) - self.assertTrue((zero_filled.ix[:5, 'A'] == 0).all()) + assert (zero_filled.loc[zero_filled.index[:5], 'A'] == 0).all() padded = self.tsframe.fillna(method='pad') - self.assertTrue(np.isnan(padded.ix[:5, 'A']).all()) - self.assertTrue((padded.ix[-5:, 'A'] == padded.ix[-5, 'A']).all()) + assert np.isnan(padded.loc[padded.index[:5], 'A']).all() + assert (padded.loc[padded.index[-5:], 'A'] == + padded.loc[padded.index[-5], 'A']).all() # mixed type - self.mixed_frame.ix[5:20, 'foo'] = nan - self.mixed_frame.ix[-10:, 'A'] = nan + mf = self.mixed_frame + mf.loc[mf.index[5:20], 'foo'] = nan + mf.loc[mf.index[-10:], 'A'] = nan result = self.mixed_frame.fillna(value=0) result = self.mixed_frame.fillna(method='pad') - self.assertRaises(ValueError, self.tsframe.fillna) - self.assertRaises(ValueError, self.tsframe.fillna, 5, method='ffill') + pytest.raises(ValueError, self.tsframe.fillna) + pytest.raises(ValueError, self.tsframe.fillna, 5, method='ffill') # mixed numeric (but no float16) mf = self.mixed_float.reindex(columns=['A', 'B', 'D']) - mf.ix[-10:, 'A'] = nan + mf.loc[mf.index[-10:], 'A'] = nan result = mf.fillna(value=0) _check_mixed_float(result, dtype=dict(C=None)) @@ -208,7 +216,7 @@ def test_fillna(self): # empty frame (GH #2778) df = DataFrame(columns=['x']) for m in ['pad', 'backfill']: - df.x.fillna(method=m, inplace=1) + df.x.fillna(method=m, inplace=True) df.x.fillna(method=m) # with different dtype (GH3386) @@ -243,10 +251,39 @@ def test_fillna(self): }) expected = df.copy() - expected['Date'] = expected['Date'].fillna(df.ix[0, 'Date2']) + expected['Date'] = expected['Date'].fillna( + df.loc[df.index[0], 'Date2']) result = df.fillna(value={'Date': df['Date2']}) assert_frame_equal(result, expected) + # with timezone + # GH 15855 + df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), + pd.NaT]}) + exp = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), + pd.Timestamp('2012-11-11 00:00:00+01:00')]}) + assert_frame_equal(df.fillna(method='pad'), exp) + + df = pd.DataFrame({'A': [pd.NaT, + pd.Timestamp('2012-11-11 00:00:00+01:00')]}) + exp = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), + pd.Timestamp('2012-11-11 00:00:00+01:00')]}) + assert_frame_equal(df.fillna(method='bfill'), exp) + + def test_fillna_downcast(self): + # GH 15277 + # infer int64 from float64 + df = pd.DataFrame({'a': [1., np.nan]}) + result = df.fillna(0, downcast='infer') + expected = pd.DataFrame({'a': [1, 0]}) + assert_frame_equal(result, expected) + + # infer int64 from float64 when fillna value is a dict + df = pd.DataFrame({'a': [1., np.nan]}) + result = df.fillna({'a': 0}, downcast='infer') + expected = pd.DataFrame({'a': [1, 0]}) + assert_frame_equal(result, expected) + def test_fillna_dtype_conversion(self): # make sure that fillna on an empty frame works df = DataFrame(index=["A", "B", "C"], columns=[1, 2, 3, 4, 5]) @@ -286,7 +323,7 @@ def test_fillna_datetime_columns(self): 'C': ['foo', 'bar', '?'], 'D': ['foo2', 'bar2', '?']}, index=date_range('20130110', periods=3)) - self.assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) df = pd.DataFrame({'A': [-1, -2, np.nan], 'B': [pd.Timestamp('2013-01-01'), @@ -301,7 +338,7 @@ def test_fillna_datetime_columns(self): 'C': ['foo', 'bar', '?'], 'D': ['foo2', 'bar2', '?']}, index=pd.date_range('20130110', periods=3)) - self.assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_ffill(self): self.tsframe['A'][:5] = nan @@ -317,6 +354,40 @@ def test_bfill(self): assert_frame_equal(self.tsframe.bfill(), self.tsframe.fillna(method='bfill')) + def test_frame_pad_backfill_limit(self): + index = np.arange(10) + df = DataFrame(np.random.randn(10, 4), index=index) + + result = df[:2].reindex(index, method='pad', limit=5) + + expected = df[:2].reindex(index).fillna(method='pad') + expected.values[-3:] = np.nan + tm.assert_frame_equal(result, expected) + + result = df[-2:].reindex(index, method='backfill', limit=5) + + expected = df[-2:].reindex(index).fillna(method='backfill') + expected.values[:3] = np.nan + tm.assert_frame_equal(result, expected) + + def test_frame_fillna_limit(self): + index = np.arange(10) + df = DataFrame(np.random.randn(10, 4), index=index) + + result = df[:2].reindex(index) + result = result.fillna(method='pad', limit=5) + + expected = df[:2].reindex(index).fillna(method='pad') + expected.values[-3:] = np.nan + tm.assert_frame_equal(result, expected) + + result = df[-2:].reindex(index) + result = result.fillna(method='backfill', limit=5) + + expected = df[-2:].reindex(index).fillna(method='backfill') + expected.values[:3] = np.nan + tm.assert_frame_equal(result, expected) + def test_fillna_skip_certain_blocks(self): # don't try to fill boolean, int blocks @@ -331,18 +402,18 @@ def test_fillna_inplace(self): df[3][-4:] = np.nan expected = df.fillna(value=0) - self.assertIsNot(expected, df) + assert expected is not df df.fillna(value=0, inplace=True) - assert_frame_equal(df, expected) + tm.assert_frame_equal(df, expected) df[1][:4] = np.nan df[3][-4:] = np.nan expected = df.fillna(method='ffill') - self.assertIsNot(expected, df) + assert expected is not df df.fillna(method='ffill', inplace=True) - assert_frame_equal(df, expected) + tm.assert_frame_equal(df, expected) def test_fillna_dict_series(self): df = DataFrame({'a': [nan, 1, 2, nan, nan], @@ -365,7 +436,8 @@ def test_fillna_dict_series(self): assert_frame_equal(result, expected) # disable this for now - with assertRaisesRegexp(NotImplementedError, 'column by column'): + with tm.assert_raises_regex(NotImplementedError, + 'column by column'): df.fillna(df.max(1), axis=1) def test_fillna_dataframe(self): @@ -405,31 +477,31 @@ def test_fillna_columns(self): assert_frame_equal(result, expected) def test_fillna_invalid_method(self): - with assertRaisesRegexp(ValueError, 'ffil'): + with tm.assert_raises_regex(ValueError, 'ffil'): self.frame.fillna(method='ffil') def test_fillna_invalid_value(self): # list - self.assertRaises(TypeError, self.frame.fillna, [1, 2]) + pytest.raises(TypeError, self.frame.fillna, [1, 2]) # tuple - self.assertRaises(TypeError, self.frame.fillna, (1, 2)) + pytest.raises(TypeError, self.frame.fillna, (1, 2)) # frame with series - self.assertRaises(ValueError, self.frame.iloc[:, 0].fillna, - self.frame) + pytest.raises(ValueError, self.frame.iloc[:, 0].fillna, self.frame) def test_fillna_col_reordering(self): cols = ["COL." + str(i) for i in range(5, 0, -1)] data = np.random.rand(20, 5) df = DataFrame(index=lrange(20), columns=cols, data=data) filled = df.fillna(method='ffill') - self.assertEqual(df.columns.tolist(), filled.columns.tolist()) + assert df.columns.tolist() == filled.columns.tolist() def test_fill_corner(self): - self.mixed_frame.ix[5:20, 'foo'] = nan - self.mixed_frame.ix[-10:, 'A'] = nan + mf = self.mixed_frame + mf.loc[mf.index[5:20], 'foo'] = nan + mf.loc[mf.index[-10:], 'A'] = nan filled = self.mixed_frame.fillna(value=0) - self.assertTrue((filled.ix[5:20, 'foo'] == 0).all()) + assert (filled.loc[filled.index[5:20], 'foo'] == 0).all() del self.mixed_frame['foo'] empty_float = self.frame.reindex(columns=[]) @@ -447,7 +519,7 @@ def test_fill_value_when_combine_const(self): assert_frame_equal(res, exp) -class TestDataFrameInterpolate(tm.TestCase, TestData): +class TestDataFrameInterpolate(TestData): def test_interp_basic(self): df = DataFrame({'A': [1, 2, np.nan, 4], @@ -472,7 +544,7 @@ def test_interp_bad_method(self): 'B': [1, 4, 9, np.nan], 'C': [1, 2, 3, 5], 'D': list('abcd')}) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.interpolate(method='not_a_method') def test_interp_combo(self): @@ -492,11 +564,12 @@ def test_interp_combo(self): def test_interp_nan_idx(self): df = DataFrame({'A': [1, 2, np.nan, 4], 'B': [np.nan, 2, 3, 4]}) df = df.set_index('A') - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.interpolate(method='values') def test_interp_various(self): tm._skip_if_no_scipy() + df = DataFrame({'A': [1, 2, np.nan, 4, 5, np.nan, 7], 'C': [1, 2, 3, 5, 8, 13, 21]}) df = df.set_index('C') @@ -508,8 +581,15 @@ def test_interp_various(self): assert_frame_equal(result, expected) result = df.interpolate(method='cubic') - expected.A.loc[3] = 2.81621174 - expected.A.loc[13] = 5.64146581 + # GH #15662. + # new cubic and quadratic interpolation algorithms from scipy 0.19.0. + # previously `splmake` was used. See scipy/scipy#6710 + if _is_scipy_ge_0190: + expected.A.loc[3] = 2.81547781 + expected.A.loc[13] = 5.52964175 + else: + expected.A.loc[3] = 2.81621174 + expected.A.loc[13] = 5.64146581 assert_frame_equal(result, expected) result = df.interpolate(method='nearest') @@ -518,8 +598,12 @@ def test_interp_various(self): assert_frame_equal(result, expected, check_dtype=False) result = df.interpolate(method='quadratic') - expected.A.loc[3] = 2.82533638 - expected.A.loc[13] = 6.02817974 + if _is_scipy_ge_0190: + expected.A.loc[3] = 2.82150771 + expected.A.loc[13] = 6.12648668 + else: + expected.A.loc[3] = 2.82533638 + expected.A.loc[13] = 6.02817974 assert_frame_equal(result, expected) result = df.interpolate(method='slinear') @@ -532,19 +616,14 @@ def test_interp_various(self): expected.A.loc[13] = 5 assert_frame_equal(result, expected, check_dtype=False) - result = df.interpolate(method='quadratic') - expected.A.loc[3] = 2.82533638 - expected.A.loc[13] = 6.02817974 - assert_frame_equal(result, expected) - def test_interp_alt_scipy(self): tm._skip_if_no_scipy() df = DataFrame({'A': [1, 2, np.nan, 4, 5, np.nan, 7], 'C': [1, 2, 3, 5, 8, 13, 21]}) result = df.interpolate(method='barycentric') expected = df.copy() - expected.ix[2, 'A'] = 3 - expected.ix[5, 'A'] = 6 + expected.loc[2, 'A'] = 3 + expected.loc[5, 'A'] = 6 assert_frame_equal(result, expected) result = df.interpolate(method='barycentric', downcast='infer') @@ -558,12 +637,12 @@ def test_interp_alt_scipy(self): _skip_if_no_pchip() import scipy result = df.interpolate(method='pchip') - expected.ix[2, 'A'] = 3 + expected.loc[2, 'A'] = 3 if LooseVersion(scipy.__version__) >= '0.17.0': - expected.ix[5, 'A'] = 6.0 + expected.loc[5, 'A'] = 6.0 else: - expected.ix[5, 'A'] = 6.125 + expected.loc[5, 'A'] = 6.125 assert_frame_equal(result, expected) @@ -613,7 +692,7 @@ def test_interp_raise_on_only_mixed(self): 'C': [np.nan, 2, 5, 7], 'D': [np.nan, np.nan, 9, 9], 'E': [1, 2, 3, 4]}) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.interpolate(axis=1) def test_interp_inplace(self): @@ -657,10 +736,3 @@ def test_interp_ignore_all_good(self): # all good result = df[['B', 'D']].interpolate(downcast=None) assert_frame_equal(result, df[['B', 'D']]) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - # '--with-coverage', '--cover-package=pandas.core'] - exit=False) diff --git a/pandas/tests/frame/test_mutate_columns.py b/pandas/tests/frame/test_mutate_columns.py index 5beab1565e538..4462260a290d9 100644 --- a/pandas/tests/frame/test_mutate_columns.py +++ b/pandas/tests/frame/test_mutate_columns.py @@ -1,15 +1,13 @@ # -*- coding: utf-8 -*- from __future__ import print_function - +import pytest from pandas.compat import range, lrange import numpy as np -from pandas import DataFrame, Series, Index +from pandas import DataFrame, Series, Index, MultiIndex -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_frame_equal import pandas.util.testing as tm @@ -19,9 +17,7 @@ # Column add, remove, delete. -class TestDataFrameMutateColumns(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameMutateColumns(TestData): def test_assign(self): df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) @@ -78,13 +74,13 @@ def test_assign_alphabetical(self): def test_assign_bad(self): df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) # non-keyword argument - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.assign(lambda x: x.A) - with tm.assertRaises(AttributeError): + with pytest.raises(AttributeError): df.assign(C=df.A, D=df.A + df.C) - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): df.assign(C=lambda df: df.A, D=lambda df: df['A'] + df['C']) - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): df.assign(C=df.A, D=lambda x: x['A'] + x['C']) def test_insert_error_msmgs(self): @@ -95,7 +91,7 @@ def test_insert_error_msmgs(self): s = DataFrame({'foo': ['a', 'b', 'c', 'a'], 'fiz': [ 'g', 'h', 'i', 'j']}).set_index('foo') msg = 'cannot reindex from a duplicate axis' - with assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): df['newcol'] = s # GH 4107, more descriptive error message @@ -103,7 +99,7 @@ def test_insert_error_msmgs(self): columns=['a', 'b', 'c', 'd']) msg = 'incompatible index of inserted column with frame index' - with assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): df['gr'] = df.groupby(['b', 'c']).count() def test_insert_benchmark(self): @@ -123,12 +119,12 @@ def test_insert(self): columns=['c', 'b', 'a']) df.insert(0, 'foo', df['a']) - self.assert_index_equal(df.columns, Index(['foo', 'c', 'b', 'a'])) + tm.assert_index_equal(df.columns, Index(['foo', 'c', 'b', 'a'])) tm.assert_series_equal(df['a'], df['foo'], check_names=False) df.insert(2, 'bar', df['c']) - self.assert_index_equal(df.columns, - Index(['foo', 'c', 'bar', 'b', 'a'])) + tm.assert_index_equal(df.columns, + Index(['foo', 'c', 'bar', 'b', 'a'])) tm.assert_almost_equal(df['c'], df['bar'], check_names=False) # diff dtype @@ -136,25 +132,25 @@ def test_insert(self): # new item df['x'] = df['a'].astype('float32') result = Series(dict(float64=5, float32=1)) - self.assertTrue((df.get_dtype_counts() == result).all()) + assert (df.get_dtype_counts() == result).all() # replacing current (in different block) df['a'] = df['a'].astype('float32') result = Series(dict(float64=4, float32=2)) - self.assertTrue((df.get_dtype_counts() == result).all()) + assert (df.get_dtype_counts() == result).all() df['y'] = df['a'].astype('int32') result = Series(dict(float64=4, float32=2, int32=1)) - self.assertTrue((df.get_dtype_counts() == result).all()) + assert (df.get_dtype_counts() == result).all() - with assertRaisesRegexp(ValueError, 'already exists'): + with tm.assert_raises_regex(ValueError, 'already exists'): df.insert(1, 'a', df['b']) - self.assertRaises(ValueError, df.insert, 1, 'c', df['b']) + pytest.raises(ValueError, df.insert, 1, 'c', df['b']) df.columns.name = 'some_name' # preserve columns name field df.insert(0, 'baz', df['c']) - self.assertEqual(df.columns.name, 'some_name') + assert df.columns.name == 'some_name' # GH 13522 df = DataFrame(index=['A', 'B', 'C']) @@ -165,21 +161,45 @@ def test_insert(self): def test_delitem(self): del self.frame['A'] - self.assertNotIn('A', self.frame) + assert 'A' not in self.frame + + def test_delitem_multiindex(self): + midx = MultiIndex.from_product([['A', 'B'], [1, 2]]) + df = DataFrame(np.random.randn(4, 4), columns=midx) + assert len(df.columns) == 4 + assert ('A', ) in df.columns + assert 'A' in df.columns + + result = df['A'] + assert isinstance(result, DataFrame) + del df['A'] + + assert len(df.columns) == 2 + + # A still in the levels, BUT get a KeyError if trying + # to delete + assert ('A', ) not in df.columns + with pytest.raises(KeyError): + del df[('A',)] + + # xref: https://github.com/pandas-dev/pandas/issues/2770 + # the 'A' is STILL in the columns! + assert 'A' in df.columns + with pytest.raises(KeyError): + del df['A'] def test_pop(self): self.frame.columns.name = 'baz' self.frame.pop('A') - self.assertNotIn('A', self.frame) + assert 'A' not in self.frame self.frame['foo'] = 'bar' self.frame.pop('foo') - self.assertNotIn('foo', self.frame) - # TODO self.assertEqual(self.frame.columns.name, 'baz') + assert 'foo' not in self.frame + # TODO assert self.frame.columns.name == 'baz' - # 10912 - # inplace ops cause caching issue + # gh-10912: inplace ops cause caching issue a = DataFrame([[1, 2, 3], [4, 5, 6]], columns=[ 'A', 'B', 'C'], index=['X', 'Y']) b = a.pop('B') @@ -188,23 +208,23 @@ def test_pop(self): # original frame expected = DataFrame([[1, 3], [4, 6]], columns=[ 'A', 'C'], index=['X', 'Y']) - assert_frame_equal(a, expected) + tm.assert_frame_equal(a, expected) # result expected = Series([2, 5], index=['X', 'Y'], name='B') + 1 - assert_series_equal(b, expected) + tm.assert_series_equal(b, expected) def test_pop_non_unique_cols(self): df = DataFrame({0: [0, 1], 1: [0, 1], 2: [4, 5]}) df.columns = ["a", "b", "a"] res = df.pop("a") - self.assertEqual(type(res), DataFrame) - self.assertEqual(len(res), 2) - self.assertEqual(len(df.columns), 1) - self.assertTrue("b" in df.columns) - self.assertFalse("a" in df.columns) - self.assertEqual(len(df.index), 2) + assert type(res) == DataFrame + assert len(res) == 2 + assert len(df.columns) == 1 + assert "b" in df.columns + assert "a" not in df.columns + assert len(df.index) == 2 def test_insert_column_bug_4032(self): diff --git a/pandas/tests/frame/test_nonunique_indexes.py b/pandas/tests/frame/test_nonunique_indexes.py index 220d29f624942..4f77ba0ae1f5a 100644 --- a/pandas/tests/frame/test_nonunique_indexes.py +++ b/pandas/tests/frame/test_nonunique_indexes.py @@ -2,24 +2,21 @@ from __future__ import print_function +import pytest import numpy as np from pandas.compat import lrange, u from pandas import DataFrame, Series, MultiIndex, date_range import pandas as pd -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from pandas.tests.frame.common import TestData -class TestDataFrameNonuniqueIndexes(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameNonuniqueIndexes(TestData): def test_column_dups_operations(self): @@ -54,7 +51,7 @@ def check(result, expected=None): [2, 1, 3, 5, 'bah']], columns=['foo', 'bar', 'foo', 'hello', 'string']) check(df, expected) - with assertRaisesRegexp(ValueError, 'Length of value'): + with tm.assert_raises_regex(ValueError, 'Length of value'): df.insert(0, 'AnotherColumn', range(len(df.index) - 1)) # insert same dtype @@ -89,7 +86,7 @@ def check(result, expected=None): check(df, expected) # consolidate - df = df.consolidate() + df = df._consolidate() expected = DataFrame([[1, 1, 'bah', 3], [1, 2, 'bah', 3], [2, 3, 'bah', 3]], columns=['foo', 'foo', 'string', 'foo2']) @@ -104,8 +101,8 @@ def check(result, expected=None): check(df, expected) # insert a dup - assertRaisesRegexp(ValueError, 'cannot insert', - df.insert, 2, 'new_col', 4.) + tm.assert_raises_regex(ValueError, 'cannot insert', + df.insert, 2, 'new_col', 4.) df.insert(2, 'new_col', 4., allow_duplicates=True) expected = DataFrame([[1, 1, 4., 5., 'bah', 3], [1, 2, 4., 5., 'bah', 3], @@ -154,7 +151,7 @@ def check(result, expected=None): df = DataFrame([[1, 2.5], [3, 4.5]], index=[1, 2], columns=['x', 'x']) result = df.values expected = np.array([[1, 2.5], [3, 4.5]]) - self.assertTrue((result == expected).all().all()) + assert (result == expected).all().all() # rename, GH 4403 df4 = DataFrame( @@ -191,8 +188,8 @@ def check(result, expected=None): # reindex is invalid! df = DataFrame([[1, 5, 7.], [1, 5, 7.], [1, 5, 7.]], columns=['bar', 'a', 'a']) - self.assertRaises(ValueError, df.reindex, columns=['bar']) - self.assertRaises(ValueError, df.reindex, columns=['bar', 'foo']) + pytest.raises(ValueError, df.reindex, columns=['bar']) + pytest.raises(ValueError, df.reindex, columns=['bar', 'foo']) # drop df = DataFrame([[1, 5, 7.], [1, 5, 7.], [1, 5, 7.]], @@ -309,7 +306,7 @@ def check(result, expected=None): # boolean with the duplicate raises df = DataFrame(np.arange(12).reshape(3, 4), columns=dups, dtype='float64') - self.assertRaises(ValueError, lambda: df[df.A > 6]) + pytest.raises(ValueError, lambda: df[df.A > 6]) # dup aligining operations should work # GH 5185 @@ -326,7 +323,7 @@ def check(result, expected=None): columns=['A', 'A']) # not-comparing like-labelled - self.assertRaises(ValueError, lambda: df1 == df2) + pytest.raises(ValueError, lambda: df1 == df2) df1r = df1.reindex_like(df2) result = df1r == df2 @@ -353,13 +350,13 @@ def check(result, expected=None): index=['a', 'b', 'c', 'd', 'e'], columns=['A', 'B', 'C', 'D', 'E']) z = df[['A', 'C', 'A']].copy() - expected = z.ix[['a', 'c', 'a']] + expected = z.loc[['a', 'c', 'a']] df = DataFrame(np.arange(25.).reshape(5, 5), index=['a', 'b', 'c', 'd', 'e'], columns=['A', 'B', 'C', 'D', 'E']) z = df[['A', 'C', 'A']] - result = z.ix[['a', 'c', 'a']] + result = z.loc[['a', 'c', 'a']] check(result, expected) def test_column_dups_indexing2(self): @@ -413,7 +410,7 @@ def test_columns_with_dups(self): assert_frame_equal(df, expected) # this is an error because we cannot disambiguate the dup columns - self.assertRaises(Exception, lambda x: DataFrame( + pytest.raises(Exception, lambda x: DataFrame( [[1, 2, 'foo', 'bar']], columns=['a', 'a', 'a', 'a'])) # dups across blocks @@ -428,10 +425,10 @@ def test_columns_with_dups(self): columns=df_float.columns) df = pd.concat([df_float, df_int, df_bool, df_object, df_dt], axis=1) - self.assertEqual(len(df._data._blknos), len(df.columns)) - self.assertEqual(len(df._data._blklocs), len(df.columns)) + assert len(df._data._blknos) == len(df.columns) + assert len(df._data._blklocs) == len(df.columns) - # testing iget + # testing iloc for i in range(len(df.columns)): df.iloc[:, i] @@ -451,7 +448,7 @@ def test_as_matrix_duplicates(self): expected = np.array([[1, 2, 'a', 'b'], [1, 2, 'a', 'b']], dtype=object) - self.assertTrue(np.array_equal(result, expected)) + assert np.array_equal(result, expected) def test_set_value_by_index(self): # See gh-12344 diff --git a/pandas/tests/frame/test_operators.py b/pandas/tests/frame/test_operators.py index 85aadee8b0900..8ec6c6e6263d8 100644 --- a/pandas/tests/frame/test_operators.py +++ b/pandas/tests/frame/test_operators.py @@ -5,7 +5,7 @@ from datetime import datetime import operator -import nose +import pytest from numpy import nan, random import numpy as np @@ -15,13 +15,12 @@ from pandas import (DataFrame, Series, MultiIndex, Timestamp, date_range) import pandas.core.common as com -import pandas.formats.printing as printing +import pandas.io.formats.printing as printing import pandas as pd from pandas.util.testing import (assert_numpy_array_equal, assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) + assert_frame_equal) import pandas.util.testing as tm @@ -29,9 +28,7 @@ _check_mixed_int) -class TestDataFrameOperators(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameOperators(TestData): def test_operators(self): garbage = random.random(4) @@ -44,17 +41,17 @@ def test_operators(self): for idx, val in compat.iteritems(series): origVal = self.frame[col][idx] * 2 if not np.isnan(val): - self.assertEqual(val, origVal) + assert val == origVal else: - self.assertTrue(np.isnan(origVal)) + assert np.isnan(origVal) for col, series in compat.iteritems(seriesSum): for idx, val in compat.iteritems(series): origVal = self.frame[col][idx] + colSeries[col] if not np.isnan(val): - self.assertEqual(val, origVal) + assert val == origVal else: - self.assertTrue(np.isnan(origVal)) + assert np.isnan(origVal) added = self.frame2 + self.frame2 expected = self.frame2 * 2 @@ -71,7 +68,7 @@ def test_operators(self): DataFrame(index=[0], dtype=dtype), ] for df in frames: - self.assertTrue((df + df).equals(df)) + assert (df + df).equals(df) assert_frame_equal(df + df, df) def test_ops_np_scalar(self): @@ -121,12 +118,12 @@ def test_operators_boolean(self): def f(): DataFrame(1.0, index=[1], columns=['A']) | DataFrame( True, index=[1], columns=['A']) - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def f(): DataFrame('foo', index=[1], columns=['A']) | DataFrame( True, index=[1], columns=['A']) - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def test_operators_none_as_na(self): df = DataFrame({"col1": [2, 5.0, 123, None], @@ -159,12 +156,12 @@ def test_comparison_invalid(self): def check(df, df2): for (x, y) in [(df, df2), (df2, df)]: - self.assertRaises(TypeError, lambda: x == y) - self.assertRaises(TypeError, lambda: x != y) - self.assertRaises(TypeError, lambda: x >= y) - self.assertRaises(TypeError, lambda: x > y) - self.assertRaises(TypeError, lambda: x < y) - self.assertRaises(TypeError, lambda: x <= y) + pytest.raises(TypeError, lambda: x == y) + pytest.raises(TypeError, lambda: x != y) + pytest.raises(TypeError, lambda: x >= y) + pytest.raises(TypeError, lambda: x > y) + pytest.raises(TypeError, lambda: x < y) + pytest.raises(TypeError, lambda: x <= y) # GH4968 # invalid date/int comparisons @@ -239,7 +236,7 @@ def test_modulo(self): s = p[0] res = s % p res2 = p % s - self.assertFalse(np.array_equal(res.fillna(0), res2.fillna(0))) + assert not np.array_equal(res.fillna(0), res2.fillna(0)) def test_div(self): @@ -273,7 +270,7 @@ def test_div(self): s = p[0] res = s / p res2 = p / s - self.assertFalse(np.array_equal(res.fillna(0), res2.fillna(0))) + assert not np.array_equal(res.fillna(0), res2.fillna(0)) def test_logical_operators(self): @@ -281,14 +278,14 @@ def _check_bin_op(op): result = op(df1, df2) expected = DataFrame(op(df1.values, df2.values), index=df1.index, columns=df1.columns) - self.assertEqual(result.values.dtype, np.bool_) + assert result.values.dtype == np.bool_ assert_frame_equal(result, expected) def _check_unary_op(op): result = op(df1) expected = DataFrame(op(df1.values), index=df1.index, columns=df1.columns) - self.assertEqual(result.values.dtype, np.bool_) + assert result.values.dtype == np.bool_ assert_frame_equal(result, expected) df1 = {'a': {'a': True, 'b': False, 'c': False, 'd': True, 'e': True}, @@ -320,12 +317,12 @@ def _check_unary_op(op): def test_logical_typeerror(self): if not compat.PY3: - self.assertRaises(TypeError, self.frame.__eq__, 'foo') - self.assertRaises(TypeError, self.frame.__lt__, 'foo') - self.assertRaises(TypeError, self.frame.__gt__, 'foo') - self.assertRaises(TypeError, self.frame.__ne__, 'foo') + pytest.raises(TypeError, self.frame.__eq__, 'foo') + pytest.raises(TypeError, self.frame.__lt__, 'foo') + pytest.raises(TypeError, self.frame.__gt__, 'foo') + pytest.raises(TypeError, self.frame.__ne__, 'foo') else: - raise nose.SkipTest('test_logical_typeerror not tested on PY3') + pytest.skip('test_logical_typeerror not tested on PY3') def test_logical_with_nas(self): d = DataFrame({'a': [np.nan, False], 'b': [True, True]}) @@ -378,10 +375,10 @@ def test_arith_flex_frame(self): result = getattr(self.mixed_int, op)(2 + self.mixed_int) exp = f(self.mixed_int, 2 + self.mixed_int) - # overflow in the uint + # no overflow in the uint dtype = None if op in ['sub']: - dtype = dict(B='object', C=None) + dtype = dict(B='uint64', C=None) elif op in ['add', 'mul']: dtype = dict(C=None) assert_frame_equal(result, exp) @@ -410,10 +407,10 @@ def test_arith_flex_frame(self): 2 + self.mixed_int) exp = f(self.mixed_int, 2 + self.mixed_int) - # overflow in the uint + # no overflow in the uint dtype = None if op in ['sub']: - dtype = dict(B='object', C=None) + dtype = dict(B='uint64', C=None) elif op in ['add', 'mul']: dtype = dict(C=None) assert_frame_equal(result, exp) @@ -425,10 +422,10 @@ def test_arith_flex_frame(self): # ndim >= 3 ndim_5 = np.ones(self.frame.shape + (3, 4, 5)) msg = "Unable to coerce to Series/DataFrame" - with assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): f(self.frame, ndim_5) - with assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): getattr(self.frame, op)(ndim_5) # res_add = self.frame.add(self.frame) @@ -450,9 +447,9 @@ def test_arith_flex_frame(self): result = self.frame[:0].add(self.frame) assert_frame_equal(result, self.frame * np.nan) - with assertRaisesRegexp(NotImplementedError, 'fill_value'): + with tm.assert_raises_regex(NotImplementedError, 'fill_value'): self.frame.add(self.frame.iloc[0], fill_value=3) - with assertRaisesRegexp(NotImplementedError, 'fill_value'): + with tm.assert_raises_regex(NotImplementedError, 'fill_value'): self.frame.add(self.frame.iloc[0], axis='index', fill_value=3) def test_binary_ops_align(self): @@ -467,7 +464,7 @@ def test_binary_ops_align(self): df = DataFrame(np.arange(27 * 3).reshape(27, 3), index=index, - columns=['value1', 'value2', 'value3']).sortlevel() + columns=['value1', 'value2', 'value3']).sort_index() idx = pd.IndexSlice for op in ['add', 'sub', 'mul', 'div', 'truediv']: @@ -479,7 +476,7 @@ def test_binary_ops_align(self): result = getattr(df, op)(x, level='third', axis=0) expected = pd.concat([opa(df.loc[idx[:, :, i], :], v) - for i, v in x.iteritems()]).sortlevel() + for i, v in x.iteritems()]).sort_index() assert_frame_equal(result, expected) x = Series([1.0, 10.0], ['two', 'three']) @@ -487,7 +484,7 @@ def test_binary_ops_align(self): expected = (pd.concat([opa(df.loc[idx[:, i], :], v) for i, v in x.iteritems()]) - .reindex_like(df).sortlevel()) + .reindex_like(df).sort_index()) assert_frame_equal(result, expected) # GH9463 (alignment level of dataframe with series) @@ -570,14 +567,14 @@ def test_bool_flex_frame(self): # Unaligned def _check_unaligned_frame(meth, op, df, other): - part_o = other.ix[3:, 1:].copy() + part_o = other.loc[3:, 1:].copy() rs = meth(part_o) xp = op(df, part_o.reindex(index=df.index, columns=df.columns)) assert_frame_equal(rs, xp) # DataFrame - self.assertTrue(df.eq(df).values.all()) - self.assertFalse(df.ne(df).values.any()) + assert df.eq(df).values.all() + assert not df.ne(df).values.any() for op in ['eq', 'ne', 'gt', 'lt', 'ge', 'le']: f = getattr(df, op) o = getattr(operator, op) @@ -591,7 +588,7 @@ def _check_unaligned_frame(meth, op, df, other): # NAs msg = "Unable to coerce to Series/DataFrame" assert_frame_equal(f(np.nan), o(df, np.nan)) - with assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): f(ndim_5) # Series @@ -635,19 +632,19 @@ def _test_seq(df, idx_ser, col_ser): _test_seq(df, idx_ser.values, col_ser.values) # NA - df.ix[0, 0] = np.nan + df.loc[0, 0] = np.nan rs = df.eq(df) - self.assertFalse(rs.ix[0, 0]) + assert not rs.loc[0, 0] rs = df.ne(df) - self.assertTrue(rs.ix[0, 0]) + assert rs.loc[0, 0] rs = df.gt(df) - self.assertFalse(rs.ix[0, 0]) + assert not rs.loc[0, 0] rs = df.lt(df) - self.assertFalse(rs.ix[0, 0]) + assert not rs.loc[0, 0] rs = df.ge(df) - self.assertFalse(rs.ix[0, 0]) + assert not rs.loc[0, 0] rs = df.le(df) - self.assertFalse(rs.ix[0, 0]) + assert not rs.loc[0, 0] # complex arr = np.array([np.nan, 1, 6, np.nan]) @@ -655,14 +652,14 @@ def _test_seq(df, idx_ser, col_ser): df = DataFrame({'a': arr}) df2 = DataFrame({'a': arr2}) rs = df.gt(df2) - self.assertFalse(rs.values.any()) + assert not rs.values.any() rs = df.ne(df2) - self.assertTrue(rs.values.all()) + assert rs.values.all() arr3 = np.array([2j, np.nan, None]) df3 = DataFrame({'a': arr3}) rs = df3.gt(2j) - self.assertFalse(rs.values.any()) + assert not rs.values.any() # corner, dtype=object df1 = DataFrame({'col': ['foo', np.nan, 'bar']}) @@ -671,6 +668,22 @@ def _test_seq(df, idx_ser, col_ser): exp = DataFrame({'col': [False, True, False]}) assert_frame_equal(result, exp) + def test_return_dtypes_bool_op_costant(self): + # GH15077 + df = DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]}) + const = 2 + + # not empty DataFrame + for op in ['eq', 'ne', 'gt', 'lt', 'ge', 'le']: + result = getattr(df, op)(const).get_dtype_counts() + tm.assert_series_equal(result, Series([2], ['bool'])) + + # empty DataFrame + empty = df.iloc[:0] + for op in ['eq', 'ne', 'gt', 'lt', 'ge', 'le']: + result = getattr(empty, op)(const).get_dtype_counts() + tm.assert_series_equal(result, Series([2], ['bool'])) + def test_dti_tz_convert_to_utc(self): base = pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='UTC') @@ -753,31 +766,30 @@ def test_combineFrame(self): exp.loc[~exp.index.isin(indexer)] = np.nan tm.assert_series_equal(added['A'], exp.loc[added['A'].index]) - self.assertTrue( - np.isnan(added['C'].reindex(frame_copy.index)[:5]).all()) + assert np.isnan(added['C'].reindex(frame_copy.index)[:5]).all() # assert(False) - self.assertTrue(np.isnan(added['D']).all()) + assert np.isnan(added['D']).all() self_added = self.frame + self.frame - self.assert_index_equal(self_added.index, self.frame.index) + tm.assert_index_equal(self_added.index, self.frame.index) added_rev = frame_copy + self.frame - self.assertTrue(np.isnan(added['D']).all()) - self.assertTrue(np.isnan(added_rev['D']).all()) + assert np.isnan(added['D']).all() + assert np.isnan(added_rev['D']).all() # corner cases # empty plus_empty = self.frame + self.empty - self.assertTrue(np.isnan(plus_empty.values).all()) + assert np.isnan(plus_empty.values).all() empty_plus = self.empty + self.frame - self.assertTrue(np.isnan(empty_plus.values).all()) + assert np.isnan(empty_plus.values).all() empty_empty = self.empty + self.empty - self.assertTrue(empty_empty.empty) + assert empty_empty.empty # out of order reverse = self.frame.reindex(columns=self.frame.columns[::-1]) @@ -817,8 +829,8 @@ def test_combineSeries(self): for key, s in compat.iteritems(self.frame): assert_series_equal(larger_added[key], s + series[key]) - self.assertIn('E', larger_added) - self.assertTrue(np.isnan(larger_added['E']).all()) + assert 'E' in larger_added + assert np.isnan(larger_added['E']).all() # vs mix (upcast) as needed added = self.mixed_float + series @@ -849,16 +861,16 @@ def test_combineSeries(self): for key, col in compat.iteritems(self.tsframe): result = col + ts assert_series_equal(added[key], result, check_names=False) - self.assertEqual(added[key].name, key) + assert added[key].name == key if col.name == ts.name: - self.assertEqual(result.name, 'A') + assert result.name == 'A' else: - self.assertTrue(result.name is None) + assert result.name is None smaller_frame = self.tsframe[:-5] smaller_added = smaller_frame.add(ts, axis='index') - self.assert_index_equal(smaller_added.index, self.tsframe.index) + tm.assert_index_equal(smaller_added.index, self.tsframe.index) smaller_ts = ts[:-5] smaller_added2 = self.tsframe.add(smaller_ts, axis='index') @@ -879,22 +891,22 @@ def test_combineSeries(self): # empty but with non-empty index frame = self.tsframe[:1].reindex(columns=[]) result = frame.mul(ts, axis='index') - self.assertEqual(len(result), len(ts)) + assert len(result) == len(ts) def test_combineFunc(self): result = self.frame * 2 - self.assert_numpy_array_equal(result.values, self.frame.values * 2) + tm.assert_numpy_array_equal(result.values, self.frame.values * 2) # vs mix result = self.mixed_float * 2 for c, s in compat.iteritems(result): - self.assert_numpy_array_equal( + tm.assert_numpy_array_equal( s.values, self.mixed_float[c].values * 2) _check_mixed_float(result, dtype=dict(C=None)) result = self.empty * 2 - self.assertIs(result.index, self.empty.index) - self.assertEqual(len(result.columns), 0) + assert result.index is self.empty.index + assert len(result.columns) == 0 def test_comparisons(self): df1 = tm.makeTimeDataFrame() @@ -905,21 +917,23 @@ def test_comparisons(self): def test_comp(func): result = func(df1, df2) - self.assert_numpy_array_equal(result.values, - func(df1.values, df2.values)) - with assertRaisesRegexp(ValueError, 'Wrong number of dimensions'): + tm.assert_numpy_array_equal(result.values, + func(df1.values, df2.values)) + with tm.assert_raises_regex(ValueError, + 'Wrong number of dimensions'): func(df1, ndim_5) result2 = func(self.simple, row) - self.assert_numpy_array_equal(result2.values, - func(self.simple.values, row.values)) + tm.assert_numpy_array_equal(result2.values, + func(self.simple.values, row.values)) result3 = func(self.frame, 0) - self.assert_numpy_array_equal(result3.values, - func(self.frame.values, 0)) + tm.assert_numpy_array_equal(result3.values, + func(self.frame.values, 0)) - with assertRaisesRegexp(ValueError, 'Can only compare ' - 'identically-labeled DataFrame'): + with tm.assert_raises_regex(ValueError, + 'Can only compare identically' + '-labeled DataFrame'): func(self.simple, self.simple[:2]) test_comp(operator.eq) @@ -936,23 +950,23 @@ def test_comparison_protected_from_errstate(self): expected = missing_df.values < 0 with np.errstate(invalid='raise'): result = (missing_df < 0).values - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_string_comparison(self): df = DataFrame([{"a": 1, "b": "foo"}, {"a": 2, "b": "bar"}]) mask_a = df.a > 1 - assert_frame_equal(df[mask_a], df.ix[1:1, :]) - assert_frame_equal(df[-mask_a], df.ix[0:0, :]) + assert_frame_equal(df[mask_a], df.loc[1:1, :]) + assert_frame_equal(df[-mask_a], df.loc[0:0, :]) mask_b = df.b == "foo" - assert_frame_equal(df[mask_b], df.ix[0:0, :]) - assert_frame_equal(df[-mask_b], df.ix[1:1, :]) + assert_frame_equal(df[mask_b], df.loc[0:0, :]) + assert_frame_equal(df[-mask_b], df.loc[1:1, :]) def test_float_none_comparison(self): df = DataFrame(np.random.randn(8, 3), index=lrange(8), columns=['A', 'B', 'C']) - self.assertRaises(TypeError, df.__eq__, None) + pytest.raises(TypeError, df.__eq__, None) def test_boolean_comparison(self): @@ -985,8 +999,8 @@ def test_boolean_comparison(self): result = df.values > b_r assert_numpy_array_equal(result, expected.values) - self.assertRaises(ValueError, df.__gt__, b_c) - self.assertRaises(ValueError, df.values.__gt__, b_c) + pytest.raises(ValueError, df.__gt__, b_c) + pytest.raises(ValueError, df.values.__gt__, b_c) # == expected = DataFrame([[False, False], [True, False], [False, False]]) @@ -1005,8 +1019,8 @@ def test_boolean_comparison(self): result = df.values == b_r assert_numpy_array_equal(result, expected.values) - self.assertRaises(ValueError, lambda: df == b_c) - self.assertFalse(np.array_equal(df.values, b_c)) + pytest.raises(ValueError, lambda: df == b_c) + assert not np.array_equal(df.values, b_c) # with alignment df = DataFrame(np.arange(6).reshape((3, 2)), @@ -1021,90 +1035,23 @@ def test_boolean_comparison(self): assert_frame_equal(result, expected) # not shape compatible - self.assertRaises(ValueError, lambda: df == (2, 2)) - self.assertRaises(ValueError, lambda: df == [2, 2]) - - def test_combineAdd(self): - - with tm.assert_produces_warning(FutureWarning): - # trivial - comb = self.frame.combineAdd(self.frame) - assert_frame_equal(comb, self.frame * 2) - - # more rigorous - a = DataFrame([[1., nan, nan, 2., nan]], - columns=np.arange(5)) - b = DataFrame([[2., 3., nan, 2., 6., nan]], - columns=np.arange(6)) - expected = DataFrame([[3., 3., nan, 4., 6., nan]], - columns=np.arange(6)) - - with tm.assert_produces_warning(FutureWarning): - result = a.combineAdd(b) - assert_frame_equal(result, expected) - - with tm.assert_produces_warning(FutureWarning): - result2 = a.T.combineAdd(b.T) - assert_frame_equal(result2, expected.T) - - expected2 = a.combine(b, operator.add, fill_value=0.) - assert_frame_equal(expected, expected2) - - # corner cases - with tm.assert_produces_warning(FutureWarning): - comb = self.frame.combineAdd(self.empty) - assert_frame_equal(comb, self.frame) - - with tm.assert_produces_warning(FutureWarning): - comb = self.empty.combineAdd(self.frame) - assert_frame_equal(comb, self.frame) - - # integer corner case - df1 = DataFrame({'x': [5]}) - df2 = DataFrame({'x': [1]}) - df3 = DataFrame({'x': [6]}) - - with tm.assert_produces_warning(FutureWarning): - comb = df1.combineAdd(df2) - assert_frame_equal(comb, df3) - - # mixed type GH2191 - df1 = DataFrame({'A': [1, 2], 'B': [3, 4]}) - df2 = DataFrame({'A': [1, 2], 'C': [5, 6]}) - with tm.assert_produces_warning(FutureWarning): - rs = df1.combineAdd(df2) - xp = DataFrame({'A': [2, 4], 'B': [3, 4.], 'C': [5, 6.]}) - assert_frame_equal(xp, rs) - - # TODO: test integer fill corner? - - def test_combineMult(self): - with tm.assert_produces_warning(FutureWarning): - # trivial - comb = self.frame.combineMult(self.frame) - - assert_frame_equal(comb, self.frame ** 2) - - # corner cases - comb = self.frame.combineMult(self.empty) - assert_frame_equal(comb, self.frame) - - comb = self.empty.combineMult(self.frame) - assert_frame_equal(comb, self.frame) + pytest.raises(ValueError, lambda: df == (2, 2)) + pytest.raises(ValueError, lambda: df == [2, 2]) def test_combine_generic(self): df1 = self.frame - df2 = self.frame.ix[:-5, ['A', 'B', 'C']] + df2 = self.frame.loc[self.frame.index[:-5], ['A', 'B', 'C']] combined = df1.combine(df2, np.add) combined2 = df2.combine(df1, np.add) - self.assertTrue(combined['D'].isnull().all()) - self.assertTrue(combined2['D'].isnull().all()) + assert combined['D'].isnull().all() + assert combined2['D'].isnull().all() - chunk = combined.ix[:-5, ['A', 'B', 'C']] - chunk2 = combined2.ix[:-5, ['A', 'B', 'C']] + chunk = combined.loc[combined.index[:-5], ['A', 'B', 'C']] + chunk2 = combined2.loc[combined2.index[:-5], ['A', 'B', 'C']] - exp = self.frame.ix[:-5, ['A', 'B', 'C']].reindex_like(chunk) * 2 + exp = self.frame.loc[self.frame.index[:-5], + ['A', 'B', 'C']].reindex_like(chunk) * 2 assert_frame_equal(chunk, exp) assert_frame_equal(chunk2, exp) @@ -1168,16 +1115,16 @@ def test_inplace_ops_identity(self): s += 1 assert_series_equal(s, s2) assert_series_equal(s_orig + 1, s) - self.assertIs(s, s2) - self.assertIs(s._data, s2._data) + assert s is s2 + assert s._data is s2._data df = df_orig.copy() df2 = df df += 1 assert_frame_equal(df, df2) assert_frame_equal(df_orig + 1, df) - self.assertIs(df, df2) - self.assertIs(df._data, df2._data) + assert df is df2 + assert df._data is df2._data # dtype change s = s_orig.copy() @@ -1191,8 +1138,8 @@ def test_inplace_ops_identity(self): df += 1.5 assert_frame_equal(df, df2) assert_frame_equal(df_orig + 1.5, df) - self.assertIs(df, df2) - self.assertIs(df._data, df2._data) + assert df is df2 + assert df._data is df2._data # mixed dtype arr = np.random.randint(0, 10, size=5) @@ -1203,7 +1150,7 @@ def test_inplace_ops_identity(self): expected = DataFrame({'A': arr.copy() + 1, 'B': 'foo'}) assert_frame_equal(df, expected) assert_frame_equal(df2, expected) - self.assertIs(df._data, df2._data) + assert df._data is df2._data df = df_orig.copy() df2 = df @@ -1211,7 +1158,7 @@ def test_inplace_ops_identity(self): expected = DataFrame({'A': arr.copy() + 1.5, 'B': 'foo'}) assert_frame_equal(df, expected) assert_frame_equal(df2, expected) - self.assertIs(df._data, df2._data) + assert df._data is df2._data def test_alignment_non_pandas(self): index = ['A', 'B', 'C'] @@ -1230,10 +1177,10 @@ def test_alignment_non_pandas(self): # length mismatch msg = 'Unable to coerce to Series, length must be 3: given 2' for val in [[1, 2], (1, 2), np.array([1, 2])]: - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): align(df, val, 'index') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): align(df, val, 'columns') val = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) @@ -1247,19 +1194,14 @@ def test_alignment_non_pandas(self): # shape mismatch msg = 'Unable to coerce to DataFrame, shape must be' val = np.array([[1, 2, 3], [4, 5, 6]]) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): align(df, val, 'index') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): align(df, val, 'columns') val = np.zeros((3, 3, 3)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): align(df, val, 'index') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): align(df, val, 'columns') - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/frame/test_period.py b/pandas/tests/frame/test_period.py new file mode 100644 index 0000000000000..482210966fe6b --- /dev/null +++ b/pandas/tests/frame/test_period.py @@ -0,0 +1,140 @@ +import numpy as np +from numpy.random import randn +from datetime import timedelta + +import pandas as pd +import pandas.util.testing as tm +from pandas import (PeriodIndex, period_range, DataFrame, date_range, + Index, to_datetime, DatetimeIndex) + + +def _permute(obj): + return obj.take(np.random.permutation(len(obj))) + + +class TestPeriodIndex(object): + + def setup_method(self, method): + pass + + def test_as_frame_columns(self): + rng = period_range('1/1/2000', periods=5) + df = DataFrame(randn(10, 5), columns=rng) + + ts = df[rng[0]] + tm.assert_series_equal(ts, df.iloc[:, 0]) + + # GH # 1211 + repr(df) + + ts = df['1/1/2000'] + tm.assert_series_equal(ts, df.iloc[:, 0]) + + def test_frame_setitem(self): + rng = period_range('1/1/2000', periods=5, name='index') + df = DataFrame(randn(5, 3), index=rng) + + df['Index'] = rng + rs = Index(df['Index']) + tm.assert_index_equal(rs, rng, check_names=False) + assert rs.name == 'Index' + assert rng.name == 'index' + + rs = df.reset_index().set_index('index') + assert isinstance(rs.index, PeriodIndex) + tm.assert_index_equal(rs.index, rng) + + def test_frame_to_time_stamp(self): + K = 5 + index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + df = DataFrame(randn(len(index), K), index=index) + df['mix'] = 'a' + + exp_index = date_range('1/1/2001', end='12/31/2009', freq='A-DEC') + result = df.to_timestamp('D', 'end') + tm.assert_index_equal(result.index, exp_index) + tm.assert_numpy_array_equal(result.values, df.values) + + exp_index = date_range('1/1/2001', end='1/1/2009', freq='AS-JAN') + result = df.to_timestamp('D', 'start') + tm.assert_index_equal(result.index, exp_index) + + def _get_with_delta(delta, freq='A-DEC'): + return date_range(to_datetime('1/1/2001') + delta, + to_datetime('12/31/2009') + delta, freq=freq) + + delta = timedelta(hours=23) + result = df.to_timestamp('H', 'end') + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.index, exp_index) + + delta = timedelta(hours=23, minutes=59) + result = df.to_timestamp('T', 'end') + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.index, exp_index) + + result = df.to_timestamp('S', 'end') + delta = timedelta(hours=23, minutes=59, seconds=59) + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.index, exp_index) + + # columns + df = df.T + + exp_index = date_range('1/1/2001', end='12/31/2009', freq='A-DEC') + result = df.to_timestamp('D', 'end', axis=1) + tm.assert_index_equal(result.columns, exp_index) + tm.assert_numpy_array_equal(result.values, df.values) + + exp_index = date_range('1/1/2001', end='1/1/2009', freq='AS-JAN') + result = df.to_timestamp('D', 'start', axis=1) + tm.assert_index_equal(result.columns, exp_index) + + delta = timedelta(hours=23) + result = df.to_timestamp('H', 'end', axis=1) + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.columns, exp_index) + + delta = timedelta(hours=23, minutes=59) + result = df.to_timestamp('T', 'end', axis=1) + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.columns, exp_index) + + result = df.to_timestamp('S', 'end', axis=1) + delta = timedelta(hours=23, minutes=59, seconds=59) + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.columns, exp_index) + + # invalid axis + tm.assert_raises_regex( + ValueError, 'axis', df.to_timestamp, axis=2) + + result1 = df.to_timestamp('5t', axis=1) + result2 = df.to_timestamp('t', axis=1) + expected = pd.date_range('2001-01-01', '2009-01-01', freq='AS') + assert isinstance(result1.columns, DatetimeIndex) + assert isinstance(result2.columns, DatetimeIndex) + tm.assert_numpy_array_equal(result1.columns.asi8, expected.asi8) + tm.assert_numpy_array_equal(result2.columns.asi8, expected.asi8) + # PeriodIndex.to_timestamp always use 'infer' + assert result1.columns.freqstr == 'AS-JAN' + assert result2.columns.freqstr == 'AS-JAN' + + def test_frame_index_to_string(self): + index = PeriodIndex(['2011-1', '2011-2', '2011-3'], freq='M') + frame = DataFrame(np.random.randn(3, 4), index=index) + + # it works! + frame.to_string() + + def test_align_frame(self): + rng = period_range('1/1/2000', '1/1/2010', freq='A') + ts = DataFrame(np.random.randn(len(rng), 3), index=rng) + + result = ts + ts[::2] + expected = ts + ts + expected.values[1::2] = np.nan + tm.assert_frame_equal(result, expected) + + result = ts + _permute(ts[::2]) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/frame/test_quantile.py b/pandas/tests/frame/test_quantile.py index 22414a6ba8a53..2482e493dbefd 100644 --- a/pandas/tests/frame/test_quantile.py +++ b/pandas/tests/frame/test_quantile.py @@ -3,15 +3,13 @@ from __future__ import print_function -import nose +import pytest import numpy as np from pandas import (DataFrame, Series, Timestamp, _np_version_under1p11) import pandas as pd -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from pandas import _np_version_under1p9 @@ -19,20 +17,18 @@ from pandas.tests.frame.common import TestData -class TestDataFrameQuantile(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameQuantile(TestData): def test_quantile(self): from numpy import percentile q = self.tsframe.quantile(0.1, axis=0) - self.assertEqual(q['A'], percentile(self.tsframe['A'], 10)) + assert q['A'] == percentile(self.tsframe['A'], 10) tm.assert_index_equal(q.index, self.tsframe.columns) q = self.tsframe.quantile(0.9, axis=1) - self.assertEqual(q['2000-01-17'], - percentile(self.tsframe.loc['2000-01-17'], 90)) + assert (q['2000-01-17'] == + percentile(self.tsframe.loc['2000-01-17'], 90)) tm.assert_index_equal(q.index, self.tsframe.index) # test degenerate case @@ -79,7 +75,7 @@ def test_quantile_axis_mixed(self): # must raise def f(): df.quantile(.5, axis=1, numeric_only=False) - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def test_quantile_axis_parameter(self): # GH 9543/9544 @@ -102,44 +98,44 @@ def test_quantile_axis_parameter(self): result = df.quantile(.5, axis="columns") assert_series_equal(result, expected) - self.assertRaises(ValueError, df.quantile, 0.1, axis=-1) - self.assertRaises(ValueError, df.quantile, 0.1, axis="column") + pytest.raises(ValueError, df.quantile, 0.1, axis=-1) + pytest.raises(ValueError, df.quantile, 0.1, axis="column") def test_quantile_interpolation(self): - # GH #10174 + # see gh-10174 if _np_version_under1p9: - raise nose.SkipTest("Numpy version under 1.9") + pytest.skip("Numpy version under 1.9") from numpy import percentile # interpolation = linear (default case) q = self.tsframe.quantile(0.1, axis=0, interpolation='linear') - self.assertEqual(q['A'], percentile(self.tsframe['A'], 10)) + assert q['A'] == percentile(self.tsframe['A'], 10) q = self.intframe.quantile(0.1) - self.assertEqual(q['A'], percentile(self.intframe['A'], 10)) + assert q['A'] == percentile(self.intframe['A'], 10) # test with and without interpolation keyword q1 = self.intframe.quantile(0.1) - self.assertEqual(q1['A'], np.percentile(self.intframe['A'], 10)) - assert_series_equal(q, q1) + assert q1['A'] == np.percentile(self.intframe['A'], 10) + tm.assert_series_equal(q, q1) # interpolation method other than default linear df = DataFrame({"A": [1, 2, 3], "B": [2, 3, 4]}, index=[1, 2, 3]) result = df.quantile(.5, axis=1, interpolation='nearest') expected = Series([1, 2, 3], index=[1, 2, 3], name=0.5) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) # cross-check interpolation=nearest results in original dtype exp = np.percentile(np.array([[1, 2, 3], [2, 3, 4]]), .5, axis=0, interpolation='nearest') expected = Series(exp, index=[1, 2, 3], name=0.5, dtype='int64') - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) # float df = DataFrame({"A": [1., 2., 3.], "B": [2., 3., 4.]}, index=[1, 2, 3]) result = df.quantile(.5, axis=1, interpolation='nearest') expected = Series([1., 2., 3.], index=[1, 2, 3], name=0.5) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) exp = np.percentile(np.array([[1., 2., 3.], [2., 3., 4.]]), .5, axis=0, interpolation='nearest') expected = Series(exp, index=[1, 2, 3], name=0.5, dtype='float64') @@ -171,41 +167,41 @@ def test_quantile_interpolation(self): assert_frame_equal(result, expected) def test_quantile_interpolation_np_lt_1p9(self): - # GH #10174 + # see gh-10174 if not _np_version_under1p9: - raise nose.SkipTest("Numpy version is greater than 1.9") + pytest.skip("Numpy version is greater than 1.9") from numpy import percentile # interpolation = linear (default case) q = self.tsframe.quantile(0.1, axis=0, interpolation='linear') - self.assertEqual(q['A'], percentile(self.tsframe['A'], 10)) + assert q['A'] == percentile(self.tsframe['A'], 10) q = self.intframe.quantile(0.1) - self.assertEqual(q['A'], percentile(self.intframe['A'], 10)) + assert q['A'] == percentile(self.intframe['A'], 10) # test with and without interpolation keyword q1 = self.intframe.quantile(0.1) - self.assertEqual(q1['A'], np.percentile(self.intframe['A'], 10)) + assert q1['A'] == np.percentile(self.intframe['A'], 10) assert_series_equal(q, q1) # interpolation method other than default linear - expErrMsg = "Interpolation methods other than linear" + msg = "Interpolation methods other than linear" df = DataFrame({"A": [1, 2, 3], "B": [2, 3, 4]}, index=[1, 2, 3]) - with assertRaisesRegexp(ValueError, expErrMsg): + with tm.assert_raises_regex(ValueError, msg): df.quantile(.5, axis=1, interpolation='nearest') - with assertRaisesRegexp(ValueError, expErrMsg): + with tm.assert_raises_regex(ValueError, msg): df.quantile([.5, .75], axis=1, interpolation='lower') # test degenerate case df = DataFrame({'x': [], 'y': []}) - with assertRaisesRegexp(ValueError, expErrMsg): + with tm.assert_raises_regex(ValueError, msg): q = df.quantile(0.1, axis=0, interpolation='higher') # multi df = DataFrame([[1, 1, 1], [2, 2, 2], [3, 3, 3]], columns=['a', 'b', 'c']) - with assertRaisesRegexp(ValueError, expErrMsg): + with tm.assert_raises_regex(ValueError, msg): df.quantile([.25, .5], interpolation='midpoint') def test_quantile_multi(self): @@ -270,7 +266,7 @@ def test_quantile_datetime(self): def test_quantile_invalid(self): msg = 'percentiles should all be in the interval \\[0, 1\\]' for invalid in [-1, 2, [0.5, -1], [0.5, 2]]: - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.tsframe.quantile(invalid) def test_quantile_box(self): @@ -433,7 +429,7 @@ def test_quantile_empty(self): # res = df.quantile(0.5) # datetimes - df = DataFrame(columns=['a', 'b'], dtype='datetime64') + df = DataFrame(columns=['a', 'b'], dtype='datetime64[ns]') # FIXME (gives NaNs instead of NaT in 0.18.1 or 0.19.0) # res = df.quantile(0.5, numeric_only=False) diff --git a/pandas/tests/frame/test_query_eval.py b/pandas/tests/frame/test_query_eval.py index 36ae5dac733a5..f0f1a2df27e93 100644 --- a/pandas/tests/frame/test_query_eval.py +++ b/pandas/tests/frame/test_query_eval.py @@ -3,8 +3,7 @@ from __future__ import print_function import operator -import nose -from itertools import product +import pytest from pandas.compat import (zip, range, lrange, StringIO) from pandas import DataFrame, Series, Index, MultiIndex, date_range @@ -15,11 +14,10 @@ from pandas.util.testing import (assert_series_equal, assert_frame_equal, - assertRaises, makeCustomDataframe as mkdf) import pandas.util.testing as tm -from pandas.computation import _NUMEXPR_INSTALLED +from pandas.core.computation import _NUMEXPR_INSTALLED from pandas.tests.frame.common import TestData @@ -28,21 +26,31 @@ ENGINES = 'python', 'numexpr' +@pytest.fixture(params=PARSERS, ids=lambda x: x) +def parser(request): + return request.param + + +@pytest.fixture(params=ENGINES, ids=lambda x: x) +def engine(request): + return request.param + + def skip_if_no_pandas_parser(parser): if parser != 'pandas': - raise nose.SkipTest("cannot evaluate with parser {0!r}".format(parser)) + pytest.skip("cannot evaluate with parser {0!r}".format(parser)) def skip_if_no_ne(engine='numexpr'): if engine == 'numexpr': if not _NUMEXPR_INSTALLED: - raise nose.SkipTest("cannot query engine numexpr when numexpr not " - "installed") + pytest.skip("cannot query engine numexpr when numexpr not " + "installed") -class TestCompat(tm.TestCase): +class TestCompat(object): - def setUp(self): + def setup_method(self, method): self.df = DataFrame({'A': [1, 2, 3]}) self.expected1 = self.df[self.df.A > 0] self.expected2 = self.df.A + 1 @@ -82,15 +90,13 @@ def test_query_numexpr(self): result = df.eval('A+1', engine='numexpr') assert_series_equal(result, self.expected2, check_names=False) else: - self.assertRaises(ImportError, - lambda: df.query('A>0', engine='numexpr')) - self.assertRaises(ImportError, - lambda: df.eval('A+1', engine='numexpr')) + pytest.raises(ImportError, + lambda: df.query('A>0', engine='numexpr')) + pytest.raises(ImportError, + lambda: df.eval('A+1', engine='numexpr')) -class TestDataFrameEval(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameEval(TestData): def test_ops(self): @@ -141,10 +147,10 @@ def test_query_non_str(self): df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'b']}) msg = "expr must be a string to be evaluated" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): df.query(lambda x: x.B == "b") - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): df.query(111) def test_query_empty_string(self): @@ -152,7 +158,7 @@ def test_query_empty_string(self): df = pd.DataFrame({'A': [1, 2, 3]}) msg = "expr cannot be an empty string" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): df.query('') def test_eval_resolvers_as_list(self): @@ -160,18 +166,17 @@ def test_eval_resolvers_as_list(self): df = DataFrame(randn(10, 2), columns=list('ab')) dict1 = {'a': 1} dict2 = {'b': 2} - self.assertTrue(df.eval('a + b', resolvers=[dict1, dict2]) == - dict1['a'] + dict2['b']) - self.assertTrue(pd.eval('a + b', resolvers=[dict1, dict2]) == - dict1['a'] + dict2['b']) - + assert (df.eval('a + b', resolvers=[dict1, dict2]) == + dict1['a'] + dict2['b']) + assert (pd.eval('a + b', resolvers=[dict1, dict2]) == + dict1['a'] + dict2['b']) -class TestDataFrameQueryWithMultiIndex(tm.TestCase): - _multiprocess_can_split_ = True +class TestDataFrameQueryWithMultiIndex(object): - def check_query_with_named_multiindex(self, parser, engine): + def test_query_with_named_multiindex(self, parser, engine): tm.skip_if_no_ne(engine) + skip_if_no_pandas_parser(parser) a = np.random.choice(['red', 'green'], size=10) b = np.random.choice(['eggs', 'ham'], size=10) index = MultiIndex.from_arrays([a, b], names=['color', 'food']) @@ -219,12 +224,9 @@ def check_query_with_named_multiindex(self, parser, engine): assert_frame_equal(res1, exp) assert_frame_equal(res2, exp) - def test_query_with_named_multiindex(self): - for parser, engine in product(['pandas'], ENGINES): - yield self.check_query_with_named_multiindex, parser, engine - - def check_query_with_unnamed_multiindex(self, parser, engine): + def test_query_with_unnamed_multiindex(self, parser, engine): tm.skip_if_no_ne(engine) + skip_if_no_pandas_parser(parser) a = np.random.choice(['red', 'green'], size=10) b = np.random.choice(['eggs', 'ham'], size=10) index = MultiIndex.from_arrays([a, b]) @@ -313,12 +315,9 @@ def check_query_with_unnamed_multiindex(self, parser, engine): assert_frame_equal(res1, exp) assert_frame_equal(res2, exp) - def test_query_with_unnamed_multiindex(self): - for parser, engine in product(['pandas'], ENGINES): - yield self.check_query_with_unnamed_multiindex, parser, engine - - def check_query_with_partially_named_multiindex(self, parser, engine): + def test_query_with_partially_named_multiindex(self, parser, engine): tm.skip_if_no_ne(engine) + skip_if_no_pandas_parser(parser) a = np.random.choice(['red', 'green'], size=10) b = np.arange(10) index = MultiIndex.from_arrays([a, b]) @@ -346,17 +345,7 @@ def check_query_with_partially_named_multiindex(self, parser, engine): exp = df[ind != "red"] assert_frame_equal(res, exp) - def test_query_with_partially_named_multiindex(self): - for parser, engine in product(['pandas'], ENGINES): - yield (self.check_query_with_partially_named_multiindex, - parser, engine) - def test_query_multiindex_get_index_resolvers(self): - for parser, engine in product(['pandas'], ENGINES): - yield (self.check_query_multiindex_get_index_resolvers, parser, - engine) - - def check_query_multiindex_get_index_resolvers(self, parser, engine): df = mkdf(10, 3, r_idx_nlevels=2, r_idx_names=['spam', 'eggs']) resolvers = df._get_index_resolvers() @@ -380,41 +369,31 @@ def to_series(mi, level): else: raise AssertionError("object must be a Series or Index") - def test_raise_on_panel_with_multiindex(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_raise_on_panel_with_multiindex, parser, engine - - def check_raise_on_panel_with_multiindex(self, parser, engine): + def test_raise_on_panel_with_multiindex(self, parser, engine): tm.skip_if_no_ne() p = tm.makePanel(7) p.items = tm.makeCustomIndex(len(p.items), nlevels=2) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('p + 1', parser=parser, engine=engine) - def test_raise_on_panel4d_with_multiindex(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_raise_on_panel4d_with_multiindex, parser, engine - - def check_raise_on_panel4d_with_multiindex(self, parser, engine): + def test_raise_on_panel4d_with_multiindex(self, parser, engine): tm.skip_if_no_ne() p4d = tm.makePanel4D(7) p4d.items = tm.makeCustomIndex(len(p4d.items), nlevels=2) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.eval('p4d + 1', parser=parser, engine=engine) -class TestDataFrameQueryNumExprPandas(tm.TestCase): +class TestDataFrameQueryNumExprPandas(object): @classmethod - def setUpClass(cls): - super(TestDataFrameQueryNumExprPandas, cls).setUpClass() + def setup_class(cls): cls.engine = 'numexpr' cls.parser = 'pandas' tm.skip_if_no_ne(cls.engine) @classmethod - def tearDownClass(cls): - super(TestDataFrameQueryNumExprPandas, cls).tearDownClass() + def teardown_class(cls): del cls.engine, cls.parser def test_date_query_with_attribute_access(self): @@ -488,7 +467,7 @@ def test_date_index_query_with_NaT_duplicates(self): df = DataFrame(d) df.loc[np.random.rand(n) > 0.5, 'dates1'] = pd.NaT df.set_index('dates1', inplace=True, drop=True) - res = df.query('index < 20130101 < dates3', engine=engine, + res = df.query('dates1 < 20130101 < dates3', engine=engine, parser=parser) expec = df[(df.index.to_series() < '20130101') & ('20130101' < df.dates3)] @@ -504,18 +483,18 @@ def test_date_query_with_non_date(self): ops = '==', '!=', '<', '>', '<=', '>=' for op in ops: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.query('dates %s nondate' % op, parser=parser, engine=engine) def test_query_syntax_error(self): engine, parser = self.engine, self.parser df = DataFrame({"i": lrange(10), "+": lrange(3, 13), "r": lrange(4, 14)}) - with tm.assertRaises(SyntaxError): + with pytest.raises(SyntaxError): df.query('i - +', engine=engine, parser=parser) def test_query_scope(self): - from pandas.computation.ops import UndefinedVariableError + from pandas.core.computation.ops import UndefinedVariableError engine, parser = self.engine, self.parser skip_if_no_pandas_parser(parser) @@ -531,34 +510,34 @@ def test_query_scope(self): assert_frame_equal(res, expected) # no local variable c - with tm.assertRaises(UndefinedVariableError): + with pytest.raises(UndefinedVariableError): df.query('@a > b > @c', engine=engine, parser=parser) # no column named 'c' - with tm.assertRaises(UndefinedVariableError): + with pytest.raises(UndefinedVariableError): df.query('@a > b > c', engine=engine, parser=parser) def test_query_doesnt_pickup_local(self): - from pandas.computation.ops import UndefinedVariableError + from pandas.core.computation.ops import UndefinedVariableError engine, parser = self.engine, self.parser n = m = 10 df = DataFrame(np.random.randint(m, size=(n, 3)), columns=list('abc')) # we don't pick up the local 'sin' - with tm.assertRaises(UndefinedVariableError): + with pytest.raises(UndefinedVariableError): df.query('sin > 5', engine=engine, parser=parser) def test_query_builtin(self): - from pandas.computation.engines import NumExprClobberingError + from pandas.core.computation.engines import NumExprClobberingError engine, parser = self.engine, self.parser n = m = 10 df = DataFrame(np.random.randint(m, size=(n, 3)), columns=list('abc')) df.index.name = 'sin' - with tm.assertRaisesRegexp(NumExprClobberingError, - 'Variables in expression.+'): + with tm.assert_raises_regex(NumExprClobberingError, + 'Variables in expression.+'): df.query('sin > 5', engine=engine, parser=parser) def test_query(self): @@ -628,12 +607,12 @@ def test_nested_scope(self): assert_frame_equal(result, expected) def test_nested_raises_on_local_self_reference(self): - from pandas.computation.ops import UndefinedVariableError + from pandas.core.computation.ops import UndefinedVariableError df = DataFrame(np.random.randn(5, 3)) # can't reference ourself b/c we're a local so @ is necessary - with tm.assertRaises(UndefinedVariableError): + with pytest.raises(UndefinedVariableError): df.query('df > 0', engine=self.engine, parser=self.parser) def test_local_syntax(self): @@ -687,12 +666,12 @@ def test_at_inside_string(self): assert_frame_equal(result, expected) def test_query_undefined_local(self): - from pandas.computation.ops import UndefinedVariableError + from pandas.core.computation.ops import UndefinedVariableError engine, parser = self.engine, self.parser skip_if_no_pandas_parser(parser) df = DataFrame(np.random.rand(10, 2), columns=list('ab')) - with tm.assertRaisesRegexp(UndefinedVariableError, - "local variable 'c' is not defined"): + with tm.assert_raises_regex(UndefinedVariableError, + "local variable 'c' is not defined"): df.query('a == @c', engine=engine, parser=parser) def test_index_resolvers_come_after_columns_with_the_same_name(self): @@ -738,8 +717,8 @@ def test_inf(self): class TestDataFrameQueryNumExprPython(TestDataFrameQueryNumExprPandas): @classmethod - def setUpClass(cls): - super(TestDataFrameQueryNumExprPython, cls).setUpClass() + def setup_class(cls): + super(TestDataFrameQueryNumExprPython, cls).setup_class() cls.engine = 'numexpr' cls.parser = 'python' tm.skip_if_no_ne(cls.engine) @@ -803,26 +782,26 @@ def test_date_index_query_with_NaT_duplicates(self): df['dates3'] = date_range('1/1/2014', periods=n) df.loc[np.random.rand(n) > 0.5, 'dates1'] = pd.NaT df.set_index('dates1', inplace=True, drop=True) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.query('index < 20130101 < dates3', engine=engine, parser=parser) def test_nested_scope(self): - from pandas.computation.ops import UndefinedVariableError + from pandas.core.computation.ops import UndefinedVariableError engine = self.engine parser = self.parser # smoke test x = 1 # noqa result = pd.eval('x + 1', engine=engine, parser=parser) - self.assertEqual(result, 2) + assert result == 2 df = DataFrame(np.random.randn(5, 3)) df2 = DataFrame(np.random.randn(5, 3)) # don't have the pandas parser - with tm.assertRaises(SyntaxError): + with pytest.raises(SyntaxError): df.query('(@df>0) & (@df2>0)', engine=engine, parser=parser) - with tm.assertRaises(UndefinedVariableError): + with pytest.raises(UndefinedVariableError): df.query('(df>0) & (df2>0)', engine=engine, parser=parser) expected = df[(df > 0) & (df2 > 0)] @@ -839,8 +818,8 @@ def test_nested_scope(self): class TestDataFrameQueryPythonPandas(TestDataFrameQueryNumExprPandas): @classmethod - def setUpClass(cls): - super(TestDataFrameQueryPythonPandas, cls).setUpClass() + def setup_class(cls): + super(TestDataFrameQueryPythonPandas, cls).setup_class() cls.engine = 'python' cls.parser = 'pandas' cls.frame = TestData().frame @@ -860,8 +839,8 @@ def test_query_builtin(self): class TestDataFrameQueryPythonPython(TestDataFrameQueryNumExprPython): @classmethod - def setUpClass(cls): - super(TestDataFrameQueryPythonPython, cls).setUpClass() + def setup_class(cls): + super(TestDataFrameQueryPythonPython, cls).setup_class() cls.engine = cls.parser = 'python' cls.frame = TestData().frame @@ -877,9 +856,9 @@ def test_query_builtin(self): assert_frame_equal(expected, result) -class TestDataFrameQueryStrings(tm.TestCase): +class TestDataFrameQueryStrings(object): - def check_str_query_method(self, parser, engine): + def test_str_query_method(self, parser, engine): tm.skip_if_no_ne(engine) df = DataFrame(randn(10, 1), columns=['b']) df['strings'] = Series(list('aabbccddee')) @@ -897,8 +876,9 @@ def check_str_query_method(self, parser, engine): for lhs, op, rhs in zip(lhs, ops, rhs): ex = '{lhs} {op} {rhs}'.format(lhs=lhs, op=op, rhs=rhs) - assertRaises(NotImplementedError, df.query, ex, engine=engine, - parser=parser, local_dict={'strings': df.strings}) + pytest.raises(NotImplementedError, df.query, ex, + engine=engine, parser=parser, + local_dict={'strings': df.strings}) else: res = df.query('"a" == strings', engine=engine, parser=parser) assert_frame_equal(res, expect) @@ -915,15 +895,7 @@ def check_str_query_method(self, parser, engine): assert_frame_equal(res, expect) assert_frame_equal(res, df[~df.strings.isin(['a'])]) - def test_str_query_method(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_str_query_method, parser, engine - - def test_str_list_query_method(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_str_list_query_method, parser, engine - - def check_str_list_query_method(self, parser, engine): + def test_str_list_query_method(self, parser, engine): tm.skip_if_no_ne(engine) df = DataFrame(randn(10, 1), columns=['b']) df['strings'] = Series(list('aabbccddee')) @@ -941,7 +913,7 @@ def check_str_list_query_method(self, parser, engine): for lhs, op, rhs in zip(lhs, ops, rhs): ex = '{lhs} {op} {rhs}'.format(lhs=lhs, op=op, rhs=rhs) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.query(ex, engine=engine, parser=parser) else: res = df.query('strings == ["a", "b"]', engine=engine, @@ -962,7 +934,7 @@ def check_str_list_query_method(self, parser, engine): parser=parser) assert_frame_equal(res, expect) - def check_query_with_string_columns(self, parser, engine): + def test_query_with_string_columns(self, parser, engine): tm.skip_if_no_ne(engine) df = DataFrame({'a': list('aaaabbbbcccc'), 'b': list('aabbccddeeff'), @@ -977,17 +949,13 @@ def check_query_with_string_columns(self, parser, engine): expec = df[df.a.isin(df.b) & (df.c < df.d)] assert_frame_equal(res, expec) else: - with assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.query('a in b', parser=parser, engine=engine) - with assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.query('a in b and c < d', parser=parser, engine=engine) - def test_query_with_string_columns(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_query_with_string_columns, parser, engine - - def check_object_array_eq_ne(self, parser, engine): + def test_object_array_eq_ne(self, parser, engine): tm.skip_if_no_ne(engine) df = DataFrame({'a': list('aaaabbbbcccc'), 'b': list('aabbccddeeff'), @@ -1001,11 +969,7 @@ def check_object_array_eq_ne(self, parser, engine): exp = df[df.a != df.b] assert_frame_equal(res, exp) - def test_object_array_eq_ne(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_object_array_eq_ne, parser, engine - - def check_query_with_nested_strings(self, parser, engine): + def test_query_with_nested_strings(self, parser, engine): tm.skip_if_no_ne(engine) skip_if_no_pandas_parser(parser) raw = """id event timestamp @@ -1029,11 +993,7 @@ def check_query_with_nested_strings(self, parser, engine): engine=engine) assert_frame_equal(expected, res) - def test_query_with_nested_string(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_query_with_nested_strings, parser, engine - - def check_query_with_nested_special_character(self, parser, engine): + def test_query_with_nested_special_character(self, parser, engine): skip_if_no_pandas_parser(parser) tm.skip_if_no_ne(engine) df = DataFrame({'a': ['a', 'b', 'test & test'], @@ -1042,12 +1002,7 @@ def check_query_with_nested_special_character(self, parser, engine): expec = df[df.a == 'test & test'] assert_frame_equal(res, expec) - def test_query_with_nested_special_character(self): - for parser, engine in product(PARSERS, ENGINES): - yield (self.check_query_with_nested_special_character, - parser, engine) - - def check_query_lex_compare_strings(self, parser, engine): + def test_query_lex_compare_strings(self, parser, engine): tm.skip_if_no_ne(engine=engine) import operator as opr @@ -1062,11 +1017,7 @@ def check_query_lex_compare_strings(self, parser, engine): expected = df[func(df.X, 'd')] assert_frame_equal(res, expected) - def test_query_lex_compare_strings(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_query_lex_compare_strings, parser, engine - - def check_query_single_element_booleans(self, parser, engine): + def test_query_single_element_booleans(self, parser, engine): tm.skip_if_no_ne(engine) columns = 'bid', 'bidsize', 'ask', 'asksize' data = np.random.randint(2, size=(1, len(columns))).astype(bool) @@ -1075,12 +1026,9 @@ def check_query_single_element_booleans(self, parser, engine): expected = df[df.bid & df.ask] assert_frame_equal(res, expected) - def test_query_single_element_booleans(self): - for parser, engine in product(PARSERS, ENGINES): - yield self.check_query_single_element_booleans, parser, engine - - def check_query_string_scalar_variable(self, parser, engine): + def test_query_string_scalar_variable(self, parser, engine): tm.skip_if_no_ne(engine) + skip_if_no_pandas_parser(parser) df = pd.DataFrame({'Symbol': ['BUD US', 'BUD US', 'IBM US', 'IBM US'], 'Price': [109.70, 109.72, 183.30, 183.35]}) e = df[df.Symbol == 'BUD US'] @@ -1088,24 +1036,19 @@ def check_query_string_scalar_variable(self, parser, engine): r = df.query('Symbol == @symb', parser=parser, engine=engine) assert_frame_equal(e, r) - def test_query_string_scalar_variable(self): - for parser, engine in product(['pandas'], ENGINES): - yield self.check_query_string_scalar_variable, parser, engine - -class TestDataFrameEvalNumExprPandas(tm.TestCase): +class TestDataFrameEvalNumExprPandas(object): @classmethod - def setUpClass(cls): - super(TestDataFrameEvalNumExprPandas, cls).setUpClass() + def setup_class(cls): cls.engine = 'numexpr' cls.parser = 'pandas' tm.skip_if_no_ne() - def setUp(self): + def setup_method(self, method): self.frame = DataFrame(randn(10, 3), columns=list('abc')) - def tearDown(self): + def teardown_method(self, method): del self.frame def test_simple_expr(self): @@ -1123,9 +1066,9 @@ def test_invalid_type_for_operator_raises(self): df = DataFrame({'a': [1, 2], 'b': ['c', 'd']}) ops = '+', '-', '*', '/' for op in ops: - with tm.assertRaisesRegexp(TypeError, - r"unsupported operand type\(s\) for " - r".+: '.+' and '.+'"): + with tm.assert_raises_regex(TypeError, + "unsupported operand type\(s\) " + "for .+: '.+' and '.+'"): df.eval('a {0} b'.format(op), engine=self.engine, parser=self.parser) @@ -1133,8 +1076,8 @@ def test_invalid_type_for_operator_raises(self): class TestDataFrameEvalNumExprPython(TestDataFrameEvalNumExprPandas): @classmethod - def setUpClass(cls): - super(TestDataFrameEvalNumExprPython, cls).setUpClass() + def setup_class(cls): + super(TestDataFrameEvalNumExprPython, cls).setup_class() cls.engine = 'numexpr' cls.parser = 'python' tm.skip_if_no_ne(cls.engine) @@ -1143,8 +1086,8 @@ def setUpClass(cls): class TestDataFrameEvalPythonPandas(TestDataFrameEvalNumExprPandas): @classmethod - def setUpClass(cls): - super(TestDataFrameEvalPythonPandas, cls).setUpClass() + def setup_class(cls): + super(TestDataFrameEvalPythonPandas, cls).setup_class() cls.engine = 'python' cls.parser = 'pandas' @@ -1152,11 +1095,5 @@ def setUpClass(cls): class TestDataFrameEvalPythonPython(TestDataFrameEvalNumExprPython): @classmethod - def setUpClass(cls): - super(TestDataFrameEvalPythonPython, cls).tearDownClass() + def setup_class(cls): cls.engine = cls.parser = 'python' - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/frame/test_rank.py b/pandas/tests/frame/test_rank.py new file mode 100644 index 0000000000000..acf887d047c9e --- /dev/null +++ b/pandas/tests/frame/test_rank.py @@ -0,0 +1,269 @@ +# -*- coding: utf-8 -*- +from datetime import timedelta, datetime +from distutils.version import LooseVersion +from numpy import nan +import numpy as np + +from pandas import Series, DataFrame + +from pandas.compat import product +from pandas.util.testing import assert_frame_equal +import pandas.util.testing as tm +from pandas.tests.frame.common import TestData + + +class TestRank(TestData): + s = Series([1, 3, 4, 2, nan, 2, 1, 5, nan, 3]) + df = DataFrame({'A': s, 'B': s}) + + results = { + 'average': np.array([1.5, 5.5, 7.0, 3.5, nan, + 3.5, 1.5, 8.0, nan, 5.5]), + 'min': np.array([1, 5, 7, 3, nan, 3, 1, 8, nan, 5]), + 'max': np.array([2, 6, 7, 4, nan, 4, 2, 8, nan, 6]), + 'first': np.array([1, 5, 7, 3, nan, 4, 2, 8, nan, 6]), + 'dense': np.array([1, 3, 4, 2, nan, 2, 1, 5, nan, 3]), + } + + def test_rank(self): + tm._skip_if_no_scipy() + from scipy.stats import rankdata + + self.frame['A'][::2] = np.nan + self.frame['B'][::3] = np.nan + self.frame['C'][::4] = np.nan + self.frame['D'][::5] = np.nan + + ranks0 = self.frame.rank() + ranks1 = self.frame.rank(1) + mask = np.isnan(self.frame.values) + + fvals = self.frame.fillna(np.inf).values + + exp0 = np.apply_along_axis(rankdata, 0, fvals) + exp0[mask] = np.nan + + exp1 = np.apply_along_axis(rankdata, 1, fvals) + exp1[mask] = np.nan + + tm.assert_almost_equal(ranks0.values, exp0) + tm.assert_almost_equal(ranks1.values, exp1) + + # integers + df = DataFrame(np.random.randint(0, 5, size=40).reshape((10, 4))) + + result = df.rank() + exp = df.astype(float).rank() + tm.assert_frame_equal(result, exp) + + result = df.rank(1) + exp = df.astype(float).rank(1) + tm.assert_frame_equal(result, exp) + + def test_rank2(self): + df = DataFrame([[1, 3, 2], [1, 2, 3]]) + expected = DataFrame([[1.0, 3.0, 2.0], [1, 2, 3]]) / 3.0 + result = df.rank(1, pct=True) + tm.assert_frame_equal(result, expected) + + df = DataFrame([[1, 3, 2], [1, 2, 3]]) + expected = df.rank(0) / 2.0 + result = df.rank(0, pct=True) + tm.assert_frame_equal(result, expected) + + df = DataFrame([['b', 'c', 'a'], ['a', 'c', 'b']]) + expected = DataFrame([[2.0, 3.0, 1.0], [1, 3, 2]]) + result = df.rank(1, numeric_only=False) + tm.assert_frame_equal(result, expected) + + expected = DataFrame([[2.0, 1.5, 1.0], [1, 1.5, 2]]) + result = df.rank(0, numeric_only=False) + tm.assert_frame_equal(result, expected) + + df = DataFrame([['b', np.nan, 'a'], ['a', 'c', 'b']]) + expected = DataFrame([[2.0, nan, 1.0], [1.0, 3.0, 2.0]]) + result = df.rank(1, numeric_only=False) + tm.assert_frame_equal(result, expected) + + expected = DataFrame([[2.0, nan, 1.0], [1.0, 1.0, 2.0]]) + result = df.rank(0, numeric_only=False) + tm.assert_frame_equal(result, expected) + + # f7u12, this does not work without extensive workaround + data = [[datetime(2001, 1, 5), nan, datetime(2001, 1, 2)], + [datetime(2000, 1, 2), datetime(2000, 1, 3), + datetime(2000, 1, 1)]] + df = DataFrame(data) + + # check the rank + expected = DataFrame([[2., nan, 1.], + [2., 3., 1.]]) + result = df.rank(1, numeric_only=False, ascending=True) + tm.assert_frame_equal(result, expected) + + expected = DataFrame([[1., nan, 2.], + [2., 1., 3.]]) + result = df.rank(1, numeric_only=False, ascending=False) + tm.assert_frame_equal(result, expected) + + # mixed-type frames + self.mixed_frame['datetime'] = datetime.now() + self.mixed_frame['timedelta'] = timedelta(days=1, seconds=1) + + result = self.mixed_frame.rank(1) + expected = self.mixed_frame.rank(1, numeric_only=True) + tm.assert_frame_equal(result, expected) + + df = DataFrame({"a": [1e-20, -5, 1e-20 + 1e-40, 10, + 1e60, 1e80, 1e-30]}) + exp = DataFrame({"a": [3.5, 1., 3.5, 5., 6., 7., 2.]}) + tm.assert_frame_equal(df.rank(), exp) + + def test_rank_na_option(self): + tm._skip_if_no_scipy() + from scipy.stats import rankdata + + self.frame['A'][::2] = np.nan + self.frame['B'][::3] = np.nan + self.frame['C'][::4] = np.nan + self.frame['D'][::5] = np.nan + + # bottom + ranks0 = self.frame.rank(na_option='bottom') + ranks1 = self.frame.rank(1, na_option='bottom') + + fvals = self.frame.fillna(np.inf).values + + exp0 = np.apply_along_axis(rankdata, 0, fvals) + exp1 = np.apply_along_axis(rankdata, 1, fvals) + + tm.assert_almost_equal(ranks0.values, exp0) + tm.assert_almost_equal(ranks1.values, exp1) + + # top + ranks0 = self.frame.rank(na_option='top') + ranks1 = self.frame.rank(1, na_option='top') + + fval0 = self.frame.fillna((self.frame.min() - 1).to_dict()).values + fval1 = self.frame.T + fval1 = fval1.fillna((fval1.min() - 1).to_dict()).T + fval1 = fval1.fillna(np.inf).values + + exp0 = np.apply_along_axis(rankdata, 0, fval0) + exp1 = np.apply_along_axis(rankdata, 1, fval1) + + tm.assert_almost_equal(ranks0.values, exp0) + tm.assert_almost_equal(ranks1.values, exp1) + + # descending + + # bottom + ranks0 = self.frame.rank(na_option='top', ascending=False) + ranks1 = self.frame.rank(1, na_option='top', ascending=False) + + fvals = self.frame.fillna(np.inf).values + + exp0 = np.apply_along_axis(rankdata, 0, -fvals) + exp1 = np.apply_along_axis(rankdata, 1, -fvals) + + tm.assert_almost_equal(ranks0.values, exp0) + tm.assert_almost_equal(ranks1.values, exp1) + + # descending + + # top + ranks0 = self.frame.rank(na_option='bottom', ascending=False) + ranks1 = self.frame.rank(1, na_option='bottom', ascending=False) + + fval0 = self.frame.fillna((self.frame.min() - 1).to_dict()).values + fval1 = self.frame.T + fval1 = fval1.fillna((fval1.min() - 1).to_dict()).T + fval1 = fval1.fillna(np.inf).values + + exp0 = np.apply_along_axis(rankdata, 0, -fval0) + exp1 = np.apply_along_axis(rankdata, 1, -fval1) + + tm.assert_numpy_array_equal(ranks0.values, exp0) + tm.assert_numpy_array_equal(ranks1.values, exp1) + + def test_rank_axis(self): + # check if using axes' names gives the same result + df = DataFrame([[2, 1], [4, 3]]) + tm.assert_frame_equal(df.rank(axis=0), df.rank(axis='index')) + tm.assert_frame_equal(df.rank(axis=1), df.rank(axis='columns')) + + def test_rank_methods_frame(self): + tm.skip_if_no_package('scipy', min_version='0.13', + app='scipy.stats.rankdata') + import scipy + from scipy.stats import rankdata + + xs = np.random.randint(0, 21, (100, 26)) + xs = (xs - 10.0) / 10.0 + cols = [chr(ord('z') - i) for i in range(xs.shape[1])] + + for vals in [xs, xs + 1e6, xs * 1e-6]: + df = DataFrame(vals, columns=cols) + + for ax in [0, 1]: + for m in ['average', 'min', 'max', 'first', 'dense']: + result = df.rank(axis=ax, method=m) + sprank = np.apply_along_axis( + rankdata, ax, vals, + m if m != 'first' else 'ordinal') + sprank = sprank.astype(np.float64) + expected = DataFrame(sprank, columns=cols) + + if LooseVersion(scipy.__version__) >= '0.17.0': + expected = expected.astype('float64') + tm.assert_frame_equal(result, expected) + + def test_rank_descending(self): + dtypes = ['O', 'f8', 'i8'] + + for dtype, method in product(dtypes, self.results): + if 'i' in dtype: + df = self.df.dropna() + else: + df = self.df.astype(dtype) + + res = df.rank(ascending=False) + expected = (df.max() - df).rank() + assert_frame_equal(res, expected) + + if method == 'first' and dtype == 'O': + continue + + expected = (df.max() - df).rank(method=method) + + if dtype != 'O': + res2 = df.rank(method=method, ascending=False, + numeric_only=True) + assert_frame_equal(res2, expected) + + res3 = df.rank(method=method, ascending=False, + numeric_only=False) + assert_frame_equal(res3, expected) + + def test_rank_2d_tie_methods(self): + df = self.df + + def _check2d(df, expected, method='average', axis=0): + exp_df = DataFrame({'A': expected, 'B': expected}) + + if axis == 1: + df = df.T + exp_df = exp_df.T + + result = df.rank(method=method, axis=axis) + assert_frame_equal(result, exp_df) + + dtypes = [None, object] + disabled = set([(object, 'first')]) + results = self.results + + for method, axis, dtype in product(results, [0, 1], dtypes): + if (dtype, method) in disabled: + continue + frame = df if dtype is None else df.astype(dtype) + _check2d(frame, results[method], method=method, axis=axis) diff --git a/pandas/tests/frame/test_replace.py b/pandas/tests/frame/test_replace.py index 3bc388da5bec8..fbc4accd0e41e 100644 --- a/pandas/tests/frame/test_replace.py +++ b/pandas/tests/frame/test_replace.py @@ -2,6 +2,8 @@ from __future__ import print_function +import pytest + from datetime import datetime import re @@ -21,9 +23,7 @@ from pandas.tests.frame.common import TestData -class TestDataFrameReplace(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameReplace(TestData): def test_replace_inplace(self): self.tsframe['A'][:5] = nan @@ -33,12 +33,13 @@ def test_replace_inplace(self): tsframe.replace(nan, 0, inplace=True) assert_frame_equal(tsframe, self.tsframe.fillna(0)) - self.assertRaises(TypeError, self.tsframe.replace, nan, inplace=True) - self.assertRaises(TypeError, self.tsframe.replace, nan) + pytest.raises(TypeError, self.tsframe.replace, nan, inplace=True) + pytest.raises(TypeError, self.tsframe.replace, nan) # mixed type - self.mixed_frame.ix[5:20, 'foo'] = nan - self.mixed_frame.ix[-10:, 'A'] = nan + mf = self.mixed_frame + mf.iloc[5:20, mf.columns.get_loc('foo')] = nan + mf.iloc[-10:, mf.columns.get_loc('A')] = nan result = self.mixed_frame.replace(np.nan, 0) expected = self.mixed_frame.fillna(value=0) @@ -547,7 +548,7 @@ def test_regex_replace_numeric_to_object_conversion(self): expec = DataFrame({'a': ['a', 1, 2, 3], 'b': mix['b'], 'c': mix['c']}) res = df.replace(0, 'a') assert_frame_equal(res, expec) - self.assertEqual(res.a.dtype, np.object_) + assert res.a.dtype == np.object_ def test_replace_regex_metachar(self): metachars = '[]', '()', r'\d', r'\w', r'\s' @@ -639,8 +640,9 @@ def test_replace_convert(self): assert_series_equal(expec, res) def test_replace_mixed(self): - self.mixed_frame.ix[5:20, 'foo'] = nan - self.mixed_frame.ix[-10:, 'A'] = nan + mf = self.mixed_frame + mf.iloc[5:20, mf.columns.get_loc('foo')] = nan + mf.iloc[-10:, mf.columns.get_loc('A')] = nan result = self.mixed_frame.replace(np.nan, -18) expected = self.mixed_frame.fillna(value=-18) @@ -718,7 +720,7 @@ def test_replace_simple_nested_dict_with_nonexistent_value(self): assert_frame_equal(expected, result) def test_replace_value_is_none(self): - self.assertRaises(TypeError, self.tsframe.replace, nan) + pytest.raises(TypeError, self.tsframe.replace, nan) orig_value = self.tsframe.iloc[0, 0] orig2 = self.tsframe.iloc[1, 0] @@ -779,7 +781,7 @@ def test_replace_dtypes(self): # bools df = DataFrame({'bools': [True, False, True]}) result = df.replace(False, True) - self.assertTrue(result.values.all()) + assert result.values.all() # complex blocks df = DataFrame({'complex': [1j, 2j, 3j]}) @@ -795,7 +797,7 @@ def test_replace_dtypes(self): expected = DataFrame({'datetime64': Index([now] * 3)}) assert_frame_equal(result, expected) - def test_replace_input_formats(self): + def test_replace_input_formats_listlike(self): # both dicts to_rep = {'A': np.nan, 'B': 0, 'C': ''} values = {'A': 0, 'B': -1, 'C': 'missing'} @@ -812,15 +814,6 @@ def test_replace_input_formats(self): 'C': ['', 'asdf', 'fd']}) assert_frame_equal(result, expected) - # dict to scalar - filled = df.replace(to_rep, 0) - expected = {} - for k, v in compat.iteritems(df): - expected[k] = v.replace(to_rep[k], 0) - assert_frame_equal(filled, DataFrame(expected)) - - self.assertRaises(TypeError, df.replace, to_rep, [np.nan, 0, '']) - # scalar to dict values = {'A': 0, 'B': -1, 'C': 'missing'} df = DataFrame({'A': [np.nan, 0, np.nan], 'B': [0, 2, 5], @@ -840,7 +833,21 @@ def test_replace_input_formats(self): expected.replace(to_rep[i], values[i], inplace=True) assert_frame_equal(result, expected) - self.assertRaises(ValueError, df.replace, to_rep, values[1:]) + pytest.raises(ValueError, df.replace, to_rep, values[1:]) + + def test_replace_input_formats_scalar(self): + df = DataFrame({'A': [np.nan, 0, np.inf], 'B': [0, 2, 5], + 'C': ['', 'asdf', 'fd']}) + + # dict to scalar + to_rep = {'A': np.nan, 'B': 0, 'C': ''} + filled = df.replace(to_rep, 0) + expected = {} + for k, v in compat.iteritems(df): + expected[k] = v.replace(to_rep[k], 0) + assert_frame_equal(filled, DataFrame(expected)) + + pytest.raises(TypeError, df.replace, to_rep, [np.nan, 0, '']) # list to scalar to_rep = [np.nan, 0, ''] @@ -911,7 +918,7 @@ def test_replace_bool_with_bool(self): def test_replace_with_dict_with_bool_keys(self): df = DataFrame({0: [True, False], 1: [False, True]}) - with tm.assertRaisesRegexp(TypeError, 'Cannot compare types .+'): + with tm.assert_raises_regex(TypeError, 'Cannot compare types .+'): df.replace({'asdf': 'asdb', True: 'yes'}) def test_replace_truthy(self): @@ -922,7 +929,8 @@ def test_replace_truthy(self): def test_replace_int_to_int_chain(self): df = DataFrame({'a': lrange(1, 5)}) - with tm.assertRaisesRegexp(ValueError, "Replacement not allowed .+"): + with tm.assert_raises_regex(ValueError, + "Replacement not allowed .+"): df.replace({'a': dict(zip(range(1, 5), range(2, 6)))}) def test_replace_str_to_str_chain(self): @@ -930,7 +938,8 @@ def test_replace_str_to_str_chain(self): astr = a.astype(str) bstr = np.arange(2, 6).astype(str) df = DataFrame({'a': astr}) - with tm.assertRaisesRegexp(ValueError, "Replacement not allowed .+"): + with tm.assert_raises_regex(ValueError, + "Replacement not allowed .+"): df.replace({'a': dict(zip(astr, bstr))}) def test_replace_swapping_bug(self): @@ -969,7 +978,7 @@ def test_replace_period(self): 'out_augmented_MAY_2011.json', 'out_augmented_AUG_2011.json', 'out_augmented_JAN_2011.json'], columns=['fname']) - tm.assert_equal(set(df.fname.values), set(d['fname'].keys())) + assert set(df.fname.values) == set(d['fname'].keys()) expected = DataFrame({'fname': [d['fname'][k] for k in df.fname.values]}) result = df.replace(d) @@ -992,7 +1001,7 @@ def test_replace_datetime(self): 'out_augmented_MAY_2011.json', 'out_augmented_AUG_2011.json', 'out_augmented_JAN_2011.json'], columns=['fname']) - tm.assert_equal(set(df.fname.values), set(d['fname'].keys())) + assert set(df.fname.values) == set(d['fname'].keys()) expected = DataFrame({'fname': [d['fname'][k] for k in df.fname.values]}) result = df.replace(d) @@ -1053,3 +1062,13 @@ def test_replace_datetimetz(self): Timestamp('20130103', tz='US/Eastern')], 'B': [0, np.nan, 2]}) assert_frame_equal(result, expected) + + def test_replace_with_empty_dictlike(self): + # GH 15289 + mix = {'a': lrange(4), 'b': list('ab..'), 'c': ['a', 'b', nan, 'd']} + df = DataFrame(mix) + assert_frame_equal(df, df.replace({})) + assert_frame_equal(df, df.replace(Series([]))) + + assert_frame_equal(df, df.replace({'b': {}})) + assert_frame_equal(df, df.replace(Series({'b': {}}))) diff --git a/pandas/tests/frame/test_repr_info.py b/pandas/tests/frame/test_repr_info.py index 12cd62f8b4cc0..cc37f8cc3cb02 100644 --- a/pandas/tests/frame/test_repr_info.py +++ b/pandas/tests/frame/test_repr_info.py @@ -11,7 +11,7 @@ from pandas import (DataFrame, compat, option_context) from pandas.compat import StringIO, lrange, u -import pandas.formats.format as fmt +import pandas.io.formats.format as fmt import pandas as pd import pandas.util.testing as tm @@ -23,9 +23,7 @@ # structure -class TestDataFrameReprInfoEtc(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameReprInfoEtc(TestData): def test_repr_empty(self): # empty @@ -74,20 +72,20 @@ def test_repr(self): self.empty.info(buf=buf) df = DataFrame(["a\n\r\tb"], columns=["a\n\r\td"], index=["a\n\r\tf"]) - self.assertFalse("\t" in repr(df)) - self.assertFalse("\r" in repr(df)) - self.assertFalse("a\n" in repr(df)) + assert "\t" not in repr(df) + assert "\r" not in repr(df) + assert "a\n" not in repr(df) def test_repr_dimensions(self): df = DataFrame([[1, 2, ], [3, 4]]) with option_context('display.show_dimensions', True): - self.assertTrue("2 rows x 2 columns" in repr(df)) + assert "2 rows x 2 columns" in repr(df) with option_context('display.show_dimensions', False): - self.assertFalse("2 rows x 2 columns" in repr(df)) + assert "2 rows x 2 columns" not in repr(df) with option_context('display.show_dimensions', 'truncate'): - self.assertFalse("2 rows x 2 columns" in repr(df)) + assert "2 rows x 2 columns" not in repr(df) @tm.slow def test_repr_big(self): @@ -120,7 +118,7 @@ def test_repr_unsortable(self): fmt.set_option('display.max_rows', 1000, 'display.max_columns', 1000) repr(self.frame) - self.reset_display_options() + tm.reset_display_options() warnings.filters = warn_filters @@ -134,11 +132,11 @@ def test_repr_unicode(self): result = repr(df) ex_top = ' A' - self.assertEqual(result.split('\n')[0].rstrip(), ex_top) + assert result.split('\n')[0].rstrip() == ex_top df = DataFrame({'A': [uval, uval]}) result = repr(df) - self.assertEqual(result.split('\n')[0].rstrip(), ex_top) + assert result.split('\n')[0].rstrip() == ex_top def test_unicode_string_with_unicode(self): df = DataFrame({'A': [u("\u05d0")]}) @@ -173,7 +171,7 @@ def test_repr_column_name_unicode_truncation_bug(self): ' the File through the code..')}) result = repr(df) - self.assertIn('StringCol', result) + assert 'StringCol' in result def test_latex_repr(self): result = r"""\begin{tabular}{llll} @@ -188,11 +186,12 @@ def test_latex_repr(self): with option_context("display.latex.escape", False, 'display.latex.repr', True): df = DataFrame([[r'$\alpha$', 'b', 'c'], [1, 2, 3]]) - self.assertEqual(result, df._repr_latex_()) + assert result == df._repr_latex_() # GH 12182 - self.assertIsNone(df._repr_latex_()) + assert df._repr_latex_() is None + @tm.capture_stdout def test_info(self): io = StringIO() self.frame.info(buf=io) @@ -200,11 +199,8 @@ def test_info(self): frame = DataFrame(np.random.randn(5, 3)) - import sys - sys.stdout = StringIO() frame.info() frame.info(verbose=False) - sys.stdout = sys.__stdout__ def test_info_wide(self): from pandas import set_option, reset_option @@ -215,13 +211,13 @@ def test_info_wide(self): io = StringIO() df.info(buf=io, max_cols=101) rs = io.getvalue() - self.assertTrue(len(rs.splitlines()) > 100) + assert len(rs.splitlines()) > 100 xp = rs set_option('display.max_info_columns', 101) io = StringIO() df.info(buf=io) - self.assertEqual(rs, xp) + assert rs == xp reset_option('display.max_info_columns') def test_info_duplicate_columns(self): @@ -241,8 +237,8 @@ def test_info_duplicate_columns_shows_correct_dtypes(self): frame.info(buf=io) io.seek(0) lines = io.readlines() - self.assertEqual('a 1 non-null int64\n', lines[3]) - self.assertEqual('a 1 non-null float64\n', lines[4]) + assert 'a 1 non-null int64\n' == lines[3] + assert 'a 1 non-null float64\n' == lines[4] def test_info_shows_column_dtypes(self): dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', @@ -267,7 +263,7 @@ def test_info_max_cols(self): buf = StringIO() df.info(buf=buf, verbose=verbose) res = buf.getvalue() - self.assertEqual(len(res.strip().split('\n')), len_) + assert len(res.strip().split('\n')) == len_ for len_, verbose in [(10, None), (5, False), (10, True)]: @@ -276,7 +272,7 @@ def test_info_max_cols(self): buf = StringIO() df.info(buf=buf, verbose=verbose) res = buf.getvalue() - self.assertEqual(len(res.strip().split('\n')), len_) + assert len(res.strip().split('\n')) == len_ for len_, max_cols in [(10, 5), (5, 4)]: # setting truncates @@ -284,14 +280,14 @@ def test_info_max_cols(self): buf = StringIO() df.info(buf=buf, max_cols=max_cols) res = buf.getvalue() - self.assertEqual(len(res.strip().split('\n')), len_) + assert len(res.strip().split('\n')) == len_ # setting wouldn't truncate with option_context('max_info_columns', 5): buf = StringIO() df.info(buf=buf, max_cols=max_cols) res = buf.getvalue() - self.assertEqual(len(res.strip().split('\n')), len_) + assert len(res.strip().split('\n')) == len_ def test_info_memory_usage(self): # Ensure memory usage is displayed, when asserted, on the last line @@ -303,41 +299,45 @@ def test_info_memory_usage(self): data[i] = np.random.randint(2, size=n).astype(dtype) df = DataFrame(data) buf = StringIO() + # display memory usage case df.info(buf=buf, memory_usage=True) res = buf.getvalue().splitlines() - self.assertTrue("memory usage: " in res[-1]) + assert "memory usage: " in res[-1] + # do not display memory usage cas df.info(buf=buf, memory_usage=False) res = buf.getvalue().splitlines() - self.assertTrue("memory usage: " not in res[-1]) + assert "memory usage: " not in res[-1] df.info(buf=buf, memory_usage=True) res = buf.getvalue().splitlines() + # memory usage is a lower bound, so print it as XYZ+ MB - self.assertTrue(re.match(r"memory usage: [^+]+\+", res[-1])) + assert re.match(r"memory usage: [^+]+\+", res[-1]) df.iloc[:, :5].info(buf=buf, memory_usage=True) res = buf.getvalue().splitlines() + # excluded column with object dtype, so estimate is accurate - self.assertFalse(re.match(r"memory usage: [^+]+\+", res[-1])) + assert not re.match(r"memory usage: [^+]+\+", res[-1]) df_with_object_index = pd.DataFrame({'a': [1]}, index=['foo']) df_with_object_index.info(buf=buf, memory_usage=True) res = buf.getvalue().splitlines() - self.assertTrue(re.match(r"memory usage: [^+]+\+", res[-1])) + assert re.match(r"memory usage: [^+]+\+", res[-1]) df_with_object_index.info(buf=buf, memory_usage='deep') res = buf.getvalue().splitlines() - self.assertTrue(re.match(r"memory usage: [^+]+$", res[-1])) + assert re.match(r"memory usage: [^+]+$", res[-1]) - self.assertGreater(df_with_object_index.memory_usage(index=True, - deep=True).sum(), - df_with_object_index.memory_usage(index=True).sum()) + assert (df_with_object_index.memory_usage( + index=True, deep=True).sum() > df_with_object_index.memory_usage( + index=True).sum()) df_object = pd.DataFrame({'a': ['a']}) - self.assertGreater(df_object.memory_usage(deep=True).sum(), - df_object.memory_usage().sum()) + assert (df_object.memory_usage(deep=True).sum() > + df_object.memory_usage().sum()) # Test a DataFrame with duplicate columns dtypes = ['int64', 'int64', 'int64', 'float64'] @@ -352,15 +352,14 @@ def test_info_memory_usage(self): # (cols * rows * bytes) + index size df_size = df.memory_usage().sum() exp_size = len(dtypes) * n * 8 + df.index.nbytes - self.assertEqual(df_size, exp_size) + assert df_size == exp_size # Ensure number of cols in memory_usage is the same as df size_df = np.size(df.columns.values) + 1 # index=True; default - self.assertEqual(size_df, np.size(df.memory_usage())) + assert size_df == np.size(df.memory_usage()) # assert deep works only on object - self.assertEqual(df.memory_usage().sum(), - df.memory_usage(deep=True).sum()) + assert df.memory_usage().sum() == df.memory_usage(deep=True).sum() # test for validity DataFrame(1, index=['a'], columns=['A'] @@ -380,7 +379,35 @@ def test_info_memory_usage(self): # sys.getsizeof will call the .memory_usage with # deep=True, and add on some GC overhead diff = df.memory_usage(deep=True).sum() - sys.getsizeof(df) - self.assertTrue(abs(diff) < 100) + assert abs(diff) < 100 + + def test_info_memory_usage_qualified(self): + + buf = StringIO() + df = DataFrame(1, columns=list('ab'), + index=[1, 2, 3]) + df.info(buf=buf) + assert '+' not in buf.getvalue() + + buf = StringIO() + df = DataFrame(1, columns=list('ab'), + index=list('ABC')) + df.info(buf=buf) + assert '+' in buf.getvalue() + + buf = StringIO() + df = DataFrame(1, columns=list('ab'), + index=pd.MultiIndex.from_product( + [range(3), range(3)])) + df.info(buf=buf) + assert '+' not in buf.getvalue() + + buf = StringIO() + df = DataFrame(1, columns=list('ab'), + index=pd.MultiIndex.from_product( + [range(3), ['foo', 'bar']])) + df.info(buf=buf) + assert '+' in buf.getvalue() def test_info_memory_usage_bug_on_multiindex(self): # GH 14308 @@ -400,11 +427,11 @@ def memory_usage(f): df = DataFrame({'value': np.random.randn(N * M)}, index=index) unstacked = df.unstack('id') - self.assertEqual(df.values.nbytes, unstacked.values.nbytes) - self.assertTrue(memory_usage(df) > memory_usage(unstacked)) + assert df.values.nbytes == unstacked.values.nbytes + assert memory_usage(df) > memory_usage(unstacked) # high upper bound - self.assertTrue(memory_usage(unstacked) - memory_usage(df) < 2000) + assert memory_usage(unstacked) - memory_usage(df) < 2000 def test_info_categorical(self): # GH14298 diff --git a/pandas/tests/frame/test_reshape.py b/pandas/tests/frame/test_reshape.py index 6b0dd38cdb82c..fdb0119d8ae60 100644 --- a/pandas/tests/frame/test_reshape.py +++ b/pandas/tests/frame/test_reshape.py @@ -2,8 +2,11 @@ from __future__ import print_function +from warnings import catch_warnings from datetime import datetime + import itertools +import pytest from numpy.random import randn from numpy import nan @@ -14,18 +17,14 @@ Timedelta, Period) import pandas as pd -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from pandas.tests.frame.common import TestData -class TestDataFrameReshape(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameReshape(TestData): def test_pivot(self): data = { @@ -42,44 +41,45 @@ def test_pivot(self): 'One': {'A': 1., 'B': 2., 'C': 3.}, 'Two': {'A': 1., 'B': 2., 'C': 3.} }) - expected.index.name, expected.columns.name = 'index', 'columns' - assert_frame_equal(pivoted, expected) + expected.index.name, expected.columns.name = 'index', 'columns' + tm.assert_frame_equal(pivoted, expected) # name tracking - self.assertEqual(pivoted.index.name, 'index') - self.assertEqual(pivoted.columns.name, 'columns') + assert pivoted.index.name == 'index' + assert pivoted.columns.name == 'columns' # don't specify values pivoted = frame.pivot(index='index', columns='columns') - self.assertEqual(pivoted.index.name, 'index') - self.assertEqual(pivoted.columns.names, (None, 'columns')) + assert pivoted.index.name == 'index' + assert pivoted.columns.names == (None, 'columns') - # pivot multiple columns - wp = tm.makePanel() - lp = wp.to_frame() - df = lp.reset_index() - assert_frame_equal(df.pivot('major', 'minor'), lp.unstack()) + with catch_warnings(record=True): + # pivot multiple columns + wp = tm.makePanel() + lp = wp.to_frame() + df = lp.reset_index() + tm.assert_frame_equal(df.pivot('major', 'minor'), lp.unstack()) def test_pivot_duplicates(self): data = DataFrame({'a': ['bar', 'bar', 'foo', 'foo', 'foo'], 'b': ['one', 'two', 'one', 'one', 'two'], 'c': [1., 2., 3., 3., 4.]}) - with assertRaisesRegexp(ValueError, 'duplicate entries'): + with tm.assert_raises_regex(ValueError, 'duplicate entries'): data.pivot('a', 'b', 'c') def test_pivot_empty(self): df = DataFrame({}, columns=['a', 'b', 'c']) result = df.pivot('a', 'b', 'c') expected = DataFrame({}) - assert_frame_equal(result, expected, check_names=False) + tm.assert_frame_equal(result, expected, check_names=False) def test_pivot_integer_bug(self): df = DataFrame(data=[("A", "1", "A1"), ("B", "2", "B2")]) result = df.pivot(index=1, columns=0, values=2) repr(result) - self.assert_index_equal(result.columns, Index(['A', 'B'], name=0)) + tm.assert_index_equal(result.columns, Index(['A', 'B'], name=0)) def test_pivot_index_none(self): # gh-3962 @@ -106,36 +106,32 @@ def test_pivot_index_none(self): ('values', 'Two')], names=[None, 'columns']) expected.index.name = 'index' - assert_frame_equal(result, expected, check_names=False) - self.assertEqual(result.index.name, 'index',) - self.assertEqual(result.columns.names, (None, 'columns')) + tm.assert_frame_equal(result, expected, check_names=False) + assert result.index.name == 'index' + assert result.columns.names == (None, 'columns') expected.columns = expected.columns.droplevel(0) - - data = { - 'index': range(7), - 'columns': ['One', 'One', 'One', 'Two', 'Two', 'Two'], - 'values': [1., 2., 3., 3., 2., 1.] - } - result = frame.pivot(columns='columns', values='values') expected.columns.name = 'columns' - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_stack_unstack(self): - stacked = self.frame.stack() + f = self.frame.copy() + f[:] = np.arange(np.prod(f.shape)).reshape(f.shape) + + stacked = f.stack() stacked_df = DataFrame({'foo': stacked, 'bar': stacked}) unstacked = stacked.unstack() unstacked_df = stacked_df.unstack() - assert_frame_equal(unstacked, self.frame) - assert_frame_equal(unstacked_df['bar'], self.frame) + assert_frame_equal(unstacked, f) + assert_frame_equal(unstacked_df['bar'], f) unstacked_cols = stacked.unstack(0) unstacked_cols_df = stacked_df.unstack(0) - assert_frame_equal(unstacked_cols.T, self.frame) - assert_frame_equal(unstacked_cols_df['bar'].T, self.frame) + assert_frame_equal(unstacked_cols.T, f) + assert_frame_equal(unstacked_cols_df['bar'].T, f) def test_unstack_fill(self): @@ -361,7 +357,7 @@ def test_stack_mixed_levels(self): # When mixed types are passed and the ints are not level # names, raise - self.assertRaises(ValueError, df2.stack, level=['animal', 0]) + pytest.raises(ValueError, df2.stack, level=['animal', 0]) # GH #8584: Having 0 in the level names could raise a # strange error about lexsort depth @@ -442,7 +438,7 @@ def test_unstack_to_series(self): # check reversibility data = self.frame.unstack() - self.assertTrue(isinstance(data, Series)) + assert isinstance(data, Series) undo = data.unstack().T assert_frame_equal(undo, self.frame) @@ -513,17 +509,17 @@ def test_unstack_dtypes(self): right = right.set_index(['A', 'B']).unstack(0) right[('D', 'a')] = right[('D', 'a')].astype('int64') - self.assertEqual(left.shape, (3, 2)) - assert_frame_equal(left, right) + assert left.shape == (3, 2) + tm.assert_frame_equal(left, right) def test_unstack_non_unique_index_names(self): idx = MultiIndex.from_tuples([('a', 'b'), ('c', 'd')], names=['c1', 'c1']) df = DataFrame([1, 2], index=idx) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.unstack('c1') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.T.stack('c1') def test_unstack_nan_index(self): # GH7466 @@ -537,7 +533,7 @@ def verify(df): left = sorted(df.iloc[i, j].split('.')) right = mk_list(df.index[i]) + mk_list(df.columns[j]) right = sorted(list(map(cast, right))) - self.assertEqual(left, right) + assert left == right df = DataFrame({'jim': ['a', 'b', nan, 'd'], 'joe': ['w', 'x', 'y', 'z'], @@ -551,7 +547,7 @@ def verify(df): mi = df.set_index(list(idx)) for lev in range(2): udf = mi.unstack(level=lev) - self.assertEqual(udf.notnull().values.sum(), len(df)) + assert udf.notnull().values.sum() == len(df) verify(udf['jolie']) df = DataFrame({'1st': ['d'] * 3 + [nan] * 5 + ['a'] * 2 + @@ -569,7 +565,7 @@ def verify(df): mi = df.set_index(list(idx)) for lev in range(3): udf = mi.unstack(level=lev) - self.assertEqual(udf.notnull().values.sum(), 2 * len(df)) + assert udf.notnull().values.sum() == 2 * len(df) for col in ['4th', '5th']: verify(udf[col]) @@ -659,7 +655,7 @@ def verify(df): right = DataFrame(vals, columns=cols, index=idx) assert_frame_equal(left, right) - left = df.ix[17264:].copy().set_index(['s_id', 'dosage', 'agent']) + left = df.loc[17264:].copy().set_index(['s_id', 'dosage', 'agent']) assert_frame_equal(left.unstack(), right) # GH9497 - multiple unstack with nulls @@ -674,12 +670,12 @@ def verify(df): df.loc[1, '3rd'] = df.loc[4, '3rd'] = nan left = df.set_index(['1st', '2nd', '3rd']).unstack(['2nd', '3rd']) - self.assertEqual(left.notnull().values.sum(), 2 * len(df)) + assert left.notnull().values.sum() == 2 * len(df) for col in ['jim', 'joe']: for _, r in df.iterrows(): key = r['1st'], (col, r['2nd'], r['3rd']) - self.assertEqual(r[col], left.loc[key]) + assert r[col] == left.loc[key] def test_stack_datetime_column_multiIndex(self): # GH 8039 diff --git a/pandas/tests/frame/test_sorting.py b/pandas/tests/frame/test_sorting.py index b7a38e9e13ebd..98f7f82c0ace7 100644 --- a/pandas/tests/frame/test_sorting.py +++ b/pandas/tests/frame/test_sorting.py @@ -2,83 +2,31 @@ from __future__ import print_function +import pytest +import random import numpy as np +import pandas as pd from pandas.compat import lrange from pandas import (DataFrame, Series, MultiIndex, Timestamp, - date_range) + date_range, NaT, IntervalIndex) -from pandas.util.testing import (assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from pandas.tests.frame.common import TestData -class TestDataFrameSorting(tm.TestCase, TestData): - - _multiprocess_can_split_ = True - - def test_sort_index(self): - # GH13496 - - frame = DataFrame(np.arange(16).reshape(4, 4), index=[1, 2, 3, 4], - columns=['A', 'B', 'C', 'D']) - - # axis=0 : sort rows by index labels - unordered = frame.ix[[3, 2, 4, 1]] - result = unordered.sort_index(axis=0) - expected = frame - assert_frame_equal(result, expected) - - result = unordered.sort_index(ascending=False) - expected = frame[::-1] - assert_frame_equal(result, expected) - - # axis=1 : sort columns by column names - unordered = frame.ix[:, [2, 1, 3, 0]] - result = unordered.sort_index(axis=1) - assert_frame_equal(result, frame) - - result = unordered.sort_index(axis=1, ascending=False) - expected = frame.ix[:, ::-1] - assert_frame_equal(result, expected) - - def test_sort_index_multiindex(self): - # GH13496 - - # sort rows by specified level of multi-index - mi = MultiIndex.from_tuples([[2, 1, 3], [1, 1, 1]], names=list('ABC')) - df = DataFrame([[1, 2], [3, 4]], mi) - - result = df.sort_index(level='A', sort_remaining=False) - expected = df.sortlevel('A', sort_remaining=False) - assert_frame_equal(result, expected) - - # sort columns by specified level of multi-index - df = df.T - result = df.sort_index(level='A', axis=1, sort_remaining=False) - expected = df.sortlevel('A', axis=1, sort_remaining=False) - assert_frame_equal(result, expected) - - # MI sort, but no level: sort_level has no effect - mi = MultiIndex.from_tuples([[1, 1, 3], [1, 1, 1]], names=list('ABC')) - df = DataFrame([[1, 2], [3, 4]], mi) - result = df.sort_index(sort_remaining=False) - expected = df.sort_index() - assert_frame_equal(result, expected) +class TestDataFrameSorting(TestData): def test_sort(self): frame = DataFrame(np.arange(16).reshape(4, 4), index=[1, 2, 3, 4], columns=['A', 'B', 'C', 'D']) - # 9816 deprecated + # see gh-9816 with tm.assert_produces_warning(FutureWarning): - frame.sort(columns='A') - with tm.assert_produces_warning(FutureWarning): - frame.sort() + frame.sortlevel() def test_sort_values(self): frame = DataFrame([[1, 1, 2], [3, 1, 0], [4, 5, 6]], @@ -87,12 +35,12 @@ def test_sort_values(self): # by column (axis=0) sorted_df = frame.sort_values(by='A') indexer = frame['A'].argsort().values - expected = frame.ix[frame.index[indexer]] + expected = frame.loc[frame.index[indexer]] assert_frame_equal(sorted_df, expected) sorted_df = frame.sort_values(by='A', ascending=False) indexer = indexer[::-1] - expected = frame.ix[frame.index[indexer]] + expected = frame.loc[frame.index[indexer]] assert_frame_equal(sorted_df, expected) sorted_df = frame.sort_values(by='A', ascending=False) @@ -113,7 +61,7 @@ def test_sort_values(self): sorted_df = frame.sort_values(by=['B', 'A'], ascending=[True, False]) assert_frame_equal(sorted_df, expected) - self.assertRaises(ValueError, lambda: frame.sort_values( + pytest.raises(ValueError, lambda: frame.sort_values( by=['A', 'B'], axis=2, inplace=True)) # by row (axis=1): GH 10806 @@ -138,7 +86,7 @@ def test_sort_values(self): assert_frame_equal(sorted_df, expected) msg = r'Length of ascending \(5\) != length of by \(2\)' - with assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): frame.sort_values(by=['A', 'B'], axis=0, ascending=[True] * 5) def test_sort_values_inplace(self): @@ -165,21 +113,6 @@ def test_sort_values_inplace(self): expected = frame.sort_values(by=['A', 'B'], ascending=False) assert_frame_equal(sorted_df, expected) - def test_sort_index_categorical_index(self): - - df = (DataFrame({'A': np.arange(6, dtype='int64'), - 'B': Series(list('aabbca')) - .astype('category', categories=list('cab'))}) - .set_index('B')) - - result = df.sort_index() - expected = df.iloc[[4, 0, 1, 5, 2, 3]] - assert_frame_equal(result, expected) - - result = df.sort_index(ascending=False) - expected = df.iloc[[3, 2, 5, 1, 0, 4]] - assert_frame_equal(result, expected) - def test_sort_nan(self): # GH3917 nan = np.nan @@ -305,8 +238,86 @@ def test_stable_descending_multicolumn_sort(self): kind='mergesort') assert_frame_equal(sorted_df, expected) + def test_sort_datetimes(self): + + # GH 3461, argsort / lexsort differences for a datetime column + df = DataFrame(['a', 'a', 'a', 'b', 'c', 'd', 'e', 'f', 'g'], + columns=['A'], + index=date_range('20130101', periods=9)) + dts = [Timestamp(x) + for x in ['2004-02-11', '2004-01-21', '2004-01-26', + '2005-09-20', '2010-10-04', '2009-05-12', + '2008-11-12', '2010-09-28', '2010-09-28']] + df['B'] = dts[::2] + dts[1::2] + df['C'] = 2. + df['A1'] = 3. + + df1 = df.sort_values(by='A') + df2 = df.sort_values(by=['A']) + assert_frame_equal(df1, df2) + + df1 = df.sort_values(by='B') + df2 = df.sort_values(by=['B']) + assert_frame_equal(df1, df2) + + def test_frame_column_inplace_sort_exception(self): + s = self.frame['A'] + with tm.assert_raises_regex(ValueError, "This Series is a view"): + s.sort_values(inplace=True) + + cp = s.copy() + cp.sort_values() # it works! + + def test_sort_nat_values_in_int_column(self): + + # GH 14922: "sorting with large float and multiple columns incorrect" + + # cause was that the int64 value NaT was considered as "na". Which is + # only correct for datetime64 columns. + + int_values = (2, int(NaT)) + float_values = (2.0, -1.797693e308) + + df = DataFrame(dict(int=int_values, float=float_values), + columns=["int", "float"]) + + df_reversed = DataFrame(dict(int=int_values[::-1], + float=float_values[::-1]), + columns=["int", "float"], + index=[1, 0]) + + # NaT is not a "na" for int64 columns, so na_position must not + # influence the result: + df_sorted = df.sort_values(["int", "float"], na_position="last") + assert_frame_equal(df_sorted, df_reversed) + + df_sorted = df.sort_values(["int", "float"], na_position="first") + assert_frame_equal(df_sorted, df_reversed) + + # reverse sorting order + df_sorted = df.sort_values(["int", "float"], ascending=False) + assert_frame_equal(df_sorted, df) + + # and now check if NaT is still considered as "na" for datetime64 + # columns: + df = DataFrame(dict(datetime=[Timestamp("2016-01-01"), NaT], + float=float_values), columns=["datetime", "float"]) + + df_reversed = DataFrame(dict(datetime=[NaT, Timestamp("2016-01-01")], + float=float_values[::-1]), + columns=["datetime", "float"], + index=[1, 0]) + + df_sorted = df.sort_values(["datetime", "float"], na_position="first") + assert_frame_equal(df_sorted, df_reversed) + + df_sorted = df.sort_values(["datetime", "float"], na_position="last") + assert_frame_equal(df_sorted, df_reversed) + + +class TestDataFrameSortIndexKinds(TestData): + def test_sort_index_multicolumn(self): - import random A = np.arange(5).repeat(20) B = np.tile(np.arange(5), 20) random.shuffle(A) @@ -344,13 +355,13 @@ def test_sort_index_inplace(self): columns=['A', 'B', 'C', 'D']) # axis=0 - unordered = frame.ix[[3, 2, 4, 1]] + unordered = frame.loc[[3, 2, 4, 1]] a_id = id(unordered['A']) df = unordered.copy() df.sort_index(inplace=True) expected = frame assert_frame_equal(df, expected) - self.assertNotEqual(a_id, id(df['A'])) + assert a_id != id(df['A']) df = unordered.copy() df.sort_index(ascending=False, inplace=True) @@ -358,7 +369,7 @@ def test_sort_index_inplace(self): assert_frame_equal(df, expected) # axis=1 - unordered = frame.ix[:, ['D', 'B', 'C', 'A']] + unordered = frame.loc[:, ['D', 'B', 'C', 'A']] df = unordered.copy() df.sort_index(axis=1, inplace=True) expected = frame @@ -366,7 +377,7 @@ def test_sort_index_inplace(self): df = unordered.copy() df.sort_index(axis=1, ascending=False, inplace=True) - expected = frame.ix[:, ::-1] + expected = frame.iloc[:, ::-1] assert_frame_equal(df, expected) def test_sort_index_different_sortorder(self): @@ -407,26 +418,26 @@ def test_sort_index_duplicates(self): df = DataFrame([lrange(5, 9), lrange(4)], columns=['a', 'a', 'b', 'b']) - with assertRaisesRegexp(ValueError, 'duplicate'): + with tm.assert_raises_regex(ValueError, 'duplicate'): # use .sort_values #9816 with tm.assert_produces_warning(FutureWarning): df.sort_index(by='a') - with assertRaisesRegexp(ValueError, 'duplicate'): + with tm.assert_raises_regex(ValueError, 'duplicate'): df.sort_values(by='a') - with assertRaisesRegexp(ValueError, 'duplicate'): + with tm.assert_raises_regex(ValueError, 'duplicate'): # use .sort_values #9816 with tm.assert_produces_warning(FutureWarning): df.sort_index(by=['a']) - with assertRaisesRegexp(ValueError, 'duplicate'): + with tm.assert_raises_regex(ValueError, 'duplicate'): df.sort_values(by=['a']) - with assertRaisesRegexp(ValueError, 'duplicate'): + with tm.assert_raises_regex(ValueError, 'duplicate'): # use .sort_values #9816 with tm.assert_produces_warning(FutureWarning): # multi-column 'by' is separate codepath df.sort_index(by=['a', 'b']) - with assertRaisesRegexp(ValueError, 'duplicate'): + with tm.assert_raises_regex(ValueError, 'duplicate'): # multi-column 'by' is separate codepath df.sort_values(by=['a', 'b']) @@ -434,11 +445,11 @@ def test_sort_index_duplicates(self): # GH4370 df = DataFrame(np.random.randn(4, 2), columns=MultiIndex.from_tuples([('a', 0), ('a', 1)])) - with assertRaisesRegexp(ValueError, 'levels'): + with tm.assert_raises_regex(ValueError, 'levels'): # use .sort_values #9816 with tm.assert_produces_warning(FutureWarning): df.sort_index(by='a') - with assertRaisesRegexp(ValueError, 'levels'): + with tm.assert_raises_regex(ValueError, 'levels'): df.sort_values(by='a') # convert tuples to a list of tuples @@ -453,41 +464,82 @@ def test_sort_index_duplicates(self): result = df.sort_values(by=('a', 1)) assert_frame_equal(result, expected) - def test_sortlevel(self): + def test_sort_index_level(self): mi = MultiIndex.from_tuples([[1, 1, 3], [1, 1, 1]], names=list('ABC')) df = DataFrame([[1, 2], [3, 4]], mi) - res = df.sortlevel('A', sort_remaining=False) + res = df.sort_index(level='A', sort_remaining=False) assert_frame_equal(df, res) - res = df.sortlevel(['A', 'B'], sort_remaining=False) + res = df.sort_index(level=['A', 'B'], sort_remaining=False) assert_frame_equal(df, res) - def test_sort_datetimes(self): + def test_sort_index_categorical_index(self): - # GH 3461, argsort / lexsort differences for a datetime column - df = DataFrame(['a', 'a', 'a', 'b', 'c', 'd', 'e', 'f', 'g'], - columns=['A'], - index=date_range('20130101', periods=9)) - dts = [Timestamp(x) - for x in ['2004-02-11', '2004-01-21', '2004-01-26', - '2005-09-20', '2010-10-04', '2009-05-12', - '2008-11-12', '2010-09-28', '2010-09-28']] - df['B'] = dts[::2] + dts[1::2] - df['C'] = 2. - df['A1'] = 3. + df = (DataFrame({'A': np.arange(6, dtype='int64'), + 'B': Series(list('aabbca')) + .astype('category', categories=list('cab'))}) + .set_index('B')) - df1 = df.sort_values(by='A') - df2 = df.sort_values(by=['A']) - assert_frame_equal(df1, df2) + result = df.sort_index() + expected = df.iloc[[4, 0, 1, 5, 2, 3]] + assert_frame_equal(result, expected) - df1 = df.sort_values(by='B') - df2 = df.sort_values(by=['B']) - assert_frame_equal(df1, df2) + result = df.sort_index(ascending=False) + expected = df.iloc[[3, 2, 5, 1, 0, 4]] + assert_frame_equal(result, expected) - def test_frame_column_inplace_sort_exception(self): - s = self.frame['A'] - with assertRaisesRegexp(ValueError, "This Series is a view"): - s.sort_values(inplace=True) + def test_sort_index(self): + # GH13496 - cp = s.copy() - cp.sort_values() # it works! + frame = DataFrame(np.arange(16).reshape(4, 4), index=[1, 2, 3, 4], + columns=['A', 'B', 'C', 'D']) + + # axis=0 : sort rows by index labels + unordered = frame.loc[[3, 2, 4, 1]] + result = unordered.sort_index(axis=0) + expected = frame + assert_frame_equal(result, expected) + + result = unordered.sort_index(ascending=False) + expected = frame[::-1] + assert_frame_equal(result, expected) + + # axis=1 : sort columns by column names + unordered = frame.iloc[:, [2, 1, 3, 0]] + result = unordered.sort_index(axis=1) + assert_frame_equal(result, frame) + + result = unordered.sort_index(axis=1, ascending=False) + expected = frame.iloc[:, ::-1] + assert_frame_equal(result, expected) + + def test_sort_index_multiindex(self): + # GH13496 + + # sort rows by specified level of multi-index + mi = MultiIndex.from_tuples([[2, 1, 3], [1, 1, 1]], names=list('ABC')) + df = DataFrame([[1, 2], [3, 4]], mi) + + # MI sort, but no level: sort_level has no effect + mi = MultiIndex.from_tuples([[1, 1, 3], [1, 1, 1]], names=list('ABC')) + df = DataFrame([[1, 2], [3, 4]], mi) + result = df.sort_index(sort_remaining=False) + expected = df.sort_index() + assert_frame_equal(result, expected) + + def test_sort_index_intervalindex(self): + # this is a de-facto sort via unstack + # confirming that we sort in the order of the bins + y = Series(np.random.randn(100)) + x1 = Series(np.sign(np.random.randn(100))) + x2 = pd.cut(Series(np.random.randn(100)), + bins=[-3, -0.5, 0, 0.5, 3]) + model = pd.concat([y, x1, x2], axis=1, keys=['Y', 'X1', 'X2']) + + result = model.groupby(['X1', 'X2']).mean().unstack() + expected = IntervalIndex.from_tuples( + [(-3.0, -0.5), (-0.5, 0.0), + (0.0, 0.5), (0.5, 3.0)], + closed='right') + result = result.columns.levels[1].categories + tm.assert_index_equal(result, expected) diff --git a/pandas/tests/frame/test_subclass.py b/pandas/tests/frame/test_subclass.py index 6a57f67a6cb3d..52c591e4dcbb0 100644 --- a/pandas/tests/frame/test_subclass.py +++ b/pandas/tests/frame/test_subclass.py @@ -2,6 +2,7 @@ from __future__ import print_function +from warnings import catch_warnings import numpy as np from pandas import DataFrame, Series, MultiIndex, Panel @@ -11,9 +12,7 @@ from pandas.tests.frame.common import TestData -class TestDataFrameSubclassing(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameSubclassing(TestData): def test_frame_subclassing_and_slicing(self): # Subclass frame and ensure it returns the right class on slicing it @@ -51,45 +50,45 @@ def custom_frame_function(self): cdf = CustomDataFrame(data) # Did we get back our own DF class? - self.assertTrue(isinstance(cdf, CustomDataFrame)) + assert isinstance(cdf, CustomDataFrame) # Do we get back our own Series class after selecting a column? cdf_series = cdf.col1 - self.assertTrue(isinstance(cdf_series, CustomSeries)) - self.assertEqual(cdf_series.custom_series_function(), 'OK') + assert isinstance(cdf_series, CustomSeries) + assert cdf_series.custom_series_function() == 'OK' # Do we get back our own DF class after slicing row-wise? cdf_rows = cdf[1:5] - self.assertTrue(isinstance(cdf_rows, CustomDataFrame)) - self.assertEqual(cdf_rows.custom_frame_function(), 'OK') + assert isinstance(cdf_rows, CustomDataFrame) + assert cdf_rows.custom_frame_function() == 'OK' # Make sure sliced part of multi-index frame is custom class mcol = pd.MultiIndex.from_tuples([('A', 'A'), ('A', 'B')]) cdf_multi = CustomDataFrame([[0, 1], [2, 3]], columns=mcol) - self.assertTrue(isinstance(cdf_multi['A'], CustomDataFrame)) + assert isinstance(cdf_multi['A'], CustomDataFrame) mcol = pd.MultiIndex.from_tuples([('A', ''), ('B', '')]) cdf_multi2 = CustomDataFrame([[0, 1], [2, 3]], columns=mcol) - self.assertTrue(isinstance(cdf_multi2['A'], CustomSeries)) + assert isinstance(cdf_multi2['A'], CustomSeries) def test_dataframe_metadata(self): df = tm.SubclassedDataFrame({'X': [1, 2, 3], 'Y': [1, 2, 3]}, index=['a', 'b', 'c']) df.testattr = 'XXX' - self.assertEqual(df.testattr, 'XXX') - self.assertEqual(df[['X']].testattr, 'XXX') - self.assertEqual(df.loc[['a', 'b'], :].testattr, 'XXX') - self.assertEqual(df.iloc[[0, 1], :].testattr, 'XXX') + assert df.testattr == 'XXX' + assert df[['X']].testattr == 'XXX' + assert df.loc[['a', 'b'], :].testattr == 'XXX' + assert df.iloc[[0, 1], :].testattr == 'XXX' - # GH9776 - self.assertEqual(df.iloc[0:1, :].testattr, 'XXX') + # see gh-9776 + assert df.iloc[0:1, :].testattr == 'XXX' - # GH10553 - unpickled = self.round_trip_pickle(df) + # see gh-10553 + unpickled = tm.round_trip_pickle(df) tm.assert_frame_equal(df, unpickled) - self.assertEqual(df._metadata, unpickled._metadata) - self.assertEqual(df.testattr, unpickled.testattr) + assert df._metadata == unpickled._metadata + assert df.testattr == unpickled.testattr def test_indexing_sliced(self): # GH 11559 @@ -100,54 +99,55 @@ def test_indexing_sliced(self): res = df.loc[:, 'X'] exp = tm.SubclassedSeries([1, 2, 3], index=list('abc'), name='X') tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) res = df.iloc[:, 1] exp = tm.SubclassedSeries([4, 5, 6], index=list('abc'), name='Y') tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) - res = df.ix[:, 'Z'] + res = df.loc[:, 'Z'] exp = tm.SubclassedSeries([7, 8, 9], index=list('abc'), name='Z') tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) res = df.loc['a', :] exp = tm.SubclassedSeries([1, 4, 7], index=list('XYZ'), name='a') tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) res = df.iloc[1, :] exp = tm.SubclassedSeries([2, 5, 8], index=list('XYZ'), name='b') tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) - res = df.ix['c', :] + res = df.loc['c', :] exp = tm.SubclassedSeries([3, 6, 9], index=list('XYZ'), name='c') tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) def test_to_panel_expanddim(self): # GH 9762 - class SubclassedFrame(DataFrame): + with catch_warnings(record=True): + class SubclassedFrame(DataFrame): - @property - def _constructor_expanddim(self): - return SubclassedPanel - - class SubclassedPanel(Panel): - pass - - index = MultiIndex.from_tuples([(0, 0), (0, 1), (0, 2)]) - df = SubclassedFrame({'X': [1, 2, 3], 'Y': [4, 5, 6]}, index=index) - result = df.to_panel() - self.assertTrue(isinstance(result, SubclassedPanel)) - expected = SubclassedPanel([[[1, 2, 3]], [[4, 5, 6]]], - items=['X', 'Y'], major_axis=[0], - minor_axis=[0, 1, 2], - dtype='int64') - tm.assert_panel_equal(result, expected) + @property + def _constructor_expanddim(self): + return SubclassedPanel + + class SubclassedPanel(Panel): + pass + + index = MultiIndex.from_tuples([(0, 0), (0, 1), (0, 2)]) + df = SubclassedFrame({'X': [1, 2, 3], 'Y': [4, 5, 6]}, index=index) + result = df.to_panel() + assert isinstance(result, SubclassedPanel) + expected = SubclassedPanel([[[1, 2, 3]], [[4, 5, 6]]], + items=['X', 'Y'], major_axis=[0], + minor_axis=[0, 1, 2], + dtype='int64') + tm.assert_panel_equal(result, expected) def test_subclass_attr_err_propagation(self): # GH 11808 @@ -156,7 +156,7 @@ class A(DataFrame): @property def bar(self): return self.i_dont_exist - with tm.assertRaisesRegexp(AttributeError, '.*i_dont_exist.*'): + with tm.assert_raises_regex(AttributeError, '.*i_dont_exist.*'): A().bar def test_subclass_align(self): @@ -173,15 +173,15 @@ def test_subclass_align(self): exp2 = tm.SubclassedDataFrame({'c': [1, 2, np.nan, 4, np.nan], 'd': [1, 2, np.nan, 4, np.nan]}, index=list('ABCDE')) - tm.assertIsInstance(res1, tm.SubclassedDataFrame) + assert isinstance(res1, tm.SubclassedDataFrame) tm.assert_frame_equal(res1, exp1) - tm.assertIsInstance(res2, tm.SubclassedDataFrame) + assert isinstance(res2, tm.SubclassedDataFrame) tm.assert_frame_equal(res2, exp2) res1, res2 = df1.a.align(df2.c) - tm.assertIsInstance(res1, tm.SubclassedSeries) + assert isinstance(res1, tm.SubclassedSeries) tm.assert_series_equal(res1, exp1.a) - tm.assertIsInstance(res2, tm.SubclassedSeries) + assert isinstance(res2, tm.SubclassedSeries) tm.assert_series_equal(res2, exp2.c) def test_subclass_align_combinations(self): @@ -199,23 +199,23 @@ def test_subclass_align_combinations(self): exp2 = pd.Series([1, 2, np.nan, 4, np.nan], index=list('ABCDE'), name='x') - tm.assertIsInstance(res1, tm.SubclassedDataFrame) + assert isinstance(res1, tm.SubclassedDataFrame) tm.assert_frame_equal(res1, exp1) - tm.assertIsInstance(res2, tm.SubclassedSeries) + assert isinstance(res2, tm.SubclassedSeries) tm.assert_series_equal(res2, exp2) # series + frame res1, res2 = s.align(df) - tm.assertIsInstance(res1, tm.SubclassedSeries) + assert isinstance(res1, tm.SubclassedSeries) tm.assert_series_equal(res1, exp2) - tm.assertIsInstance(res2, tm.SubclassedDataFrame) + assert isinstance(res2, tm.SubclassedDataFrame) tm.assert_frame_equal(res2, exp1) def test_subclass_iterrows(self): # GH 13977 df = tm.SubclassedDataFrame({'a': [1]}) for i, row in df.iterrows(): - tm.assertIsInstance(row, tm.SubclassedSeries) + assert isinstance(row, tm.SubclassedSeries) tm.assert_series_equal(row, df.loc[i]) def test_subclass_sparse_slice(self): @@ -229,9 +229,9 @@ def test_subclass_sparse_slice(self): tm.SubclassedSparseDataFrame(rows[:2])) tm.assert_sp_frame_equal(ssdf[:2], tm.SubclassedSparseDataFrame(rows[:2])) - tm.assert_equal(ssdf.loc[:2].testattr, "testattr") - tm.assert_equal(ssdf.iloc[:2].testattr, "testattr") - tm.assert_equal(ssdf[:2].testattr, "testattr") + assert ssdf.loc[:2].testattr == "testattr" + assert ssdf.iloc[:2].testattr == "testattr" + assert ssdf[:2].testattr == "testattr" tm.assert_sp_series_equal(ssdf.loc[1], tm.SubclassedSparseSeries(rows[1]), diff --git a/pandas/tests/frame/test_timeseries.py b/pandas/tests/frame/test_timeseries.py index 9758c2b9c805e..143a7ea8f6fb2 100644 --- a/pandas/tests/frame/test_timeseries.py +++ b/pandas/tests/frame/test_timeseries.py @@ -2,29 +2,29 @@ from __future__ import print_function -from datetime import datetime +from datetime import datetime, time + +import pytest from numpy import nan from numpy.random import randn import numpy as np -from pandas import DataFrame, Series, Index, Timestamp, DatetimeIndex +from pandas import (DataFrame, Series, Index, + Timestamp, DatetimeIndex, + to_datetime, date_range) import pandas as pd import pandas.tseries.offsets as offsets -from pandas.util.testing import (assert_almost_equal, - assert_series_equal, - assert_frame_equal, - assertRaisesRegexp) +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm +from pandas.compat import product from pandas.tests.frame.common import TestData -class TestDataFrameTimeSeriesMethods(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameTimeSeriesMethods(TestData): def test_diff(self): the_diff = self.tsframe.diff(1) @@ -38,7 +38,7 @@ def test_diff(self): s = Series([a, b]) rs = DataFrame({'s': s}).diff() - self.assertEqual(rs.s[1], 1) + assert rs.s[1] == 1 # mixed numeric tf = self.tsframe.astype('float32') @@ -71,7 +71,7 @@ def test_diff_mixed_dtype(self): df['A'] = np.array([1, 2, 3, 4, 5], dtype=object) result = df.diff() - self.assertEqual(result[0].dtype, np.float64) + assert result[0].dtype == np.float64 def test_diff_neg_n(self): rs = self.tsframe.diff(-1) @@ -117,16 +117,70 @@ def test_pct_change_shift_over_nas(self): edf = DataFrame({'a': expected, 'b': expected}) assert_frame_equal(chg, edf) + def test_frame_ctor_datetime64_column(self): + rng = date_range('1/1/2000 00:00:00', '1/1/2000 1:59:50', freq='10s') + dates = np.asarray(rng) + + df = DataFrame({'A': np.random.randn(len(rng)), 'B': dates}) + assert np.issubdtype(df['B'].dtype, np.dtype('M8[ns]')) + + def test_frame_add_datetime64_column(self): + rng = date_range('1/1/2000 00:00:00', '1/1/2000 1:59:50', freq='10s') + df = DataFrame(index=np.arange(len(rng))) + + df['A'] = rng + assert np.issubdtype(df['A'].dtype, np.dtype('M8[ns]')) + + def test_frame_datetime64_pre1900_repr(self): + df = DataFrame({'year': date_range('1/1/1700', periods=50, + freq='A-DEC')}) + # it works! + repr(df) + + def test_frame_add_datetime64_col_other_units(self): + n = 100 + + units = ['h', 'm', 's', 'ms', 'D', 'M', 'Y'] + + ns_dtype = np.dtype('M8[ns]') + + for unit in units: + dtype = np.dtype('M8[%s]' % unit) + vals = np.arange(n, dtype=np.int64).view(dtype) + + df = DataFrame({'ints': np.arange(n)}, index=np.arange(n)) + df[unit] = vals + + ex_vals = to_datetime(vals.astype('O')).values + + assert df[unit].dtype == ns_dtype + assert (df[unit].values == ex_vals).all() + + # Test insertion into existing datetime64 column + df = DataFrame({'ints': np.arange(n)}, index=np.arange(n)) + df['dates'] = np.arange(n, dtype=np.int64).view(ns_dtype) + + for unit in units: + dtype = np.dtype('M8[%s]' % unit) + vals = np.arange(n, dtype=np.int64).view(dtype) + + tmp = df.copy() + + tmp['dates'] = vals + ex_vals = to_datetime(vals.astype('O')).values + + assert (tmp['dates'].values == ex_vals).all() + def test_shift(self): # naive shift shiftedFrame = self.tsframe.shift(5) - self.assert_index_equal(shiftedFrame.index, self.tsframe.index) + tm.assert_index_equal(shiftedFrame.index, self.tsframe.index) shiftedSeries = self.tsframe['A'].shift(5) assert_series_equal(shiftedFrame['A'], shiftedSeries) shiftedFrame = self.tsframe.shift(-5) - self.assert_index_equal(shiftedFrame.index, self.tsframe.index) + tm.assert_index_equal(shiftedFrame.index, self.tsframe.index) shiftedSeries = self.tsframe['A'].shift(-5) assert_series_equal(shiftedFrame['A'], shiftedSeries) @@ -137,7 +191,7 @@ def test_shift(self): # shift by DateOffset shiftedFrame = self.tsframe.shift(5, freq=offsets.BDay()) - self.assertEqual(len(shiftedFrame), len(self.tsframe)) + assert len(shiftedFrame) == len(self.tsframe) shiftedFrame2 = self.tsframe.shift(5, freq='B') assert_frame_equal(shiftedFrame, shiftedFrame2) @@ -154,18 +208,19 @@ def test_shift(self): ps = tm.makePeriodFrame() shifted = ps.shift(1) unshifted = shifted.shift(-1) - self.assert_index_equal(shifted.index, ps.index) - self.assert_index_equal(unshifted.index, ps.index) - tm.assert_numpy_array_equal(unshifted.ix[:, 0].valid().values, - ps.ix[:-1, 0].values) + tm.assert_index_equal(shifted.index, ps.index) + tm.assert_index_equal(unshifted.index, ps.index) + tm.assert_numpy_array_equal(unshifted.iloc[:, 0].valid().values, + ps.iloc[:-1, 0].values) shifted2 = ps.shift(1, 'B') shifted3 = ps.shift(1, offsets.BDay()) assert_frame_equal(shifted2, shifted3) assert_frame_equal(ps, shifted2.shift(-1, 'B')) - assertRaisesRegexp(ValueError, 'does not match PeriodIndex freq', - ps.shift, freq='D') + tm.assert_raises_regex(ValueError, + 'does not match PeriodIndex freq', + ps.shift, freq='D') # shift other axis # GH 6371 @@ -225,7 +280,8 @@ def test_tshift(self): shifted3 = ps.tshift(freq=offsets.BDay()) assert_frame_equal(shifted, shifted3) - assertRaisesRegexp(ValueError, 'does not match', ps.tshift, freq='M') + tm.assert_raises_regex( + ValueError, 'does not match', ps.tshift, freq='M') # DatetimeIndex shifted = self.tsframe.tshift(1) @@ -244,8 +300,8 @@ def test_tshift(self): assert_frame_equal(shifted, self.tsframe.tshift(1)) assert_frame_equal(unshifted, inferred_ts) - no_freq = self.tsframe.ix[[0, 5, 7], :] - self.assertRaises(ValueError, no_freq.tshift) + no_freq = self.tsframe.iloc[[0, 5, 7], :] + pytest.raises(ValueError, no_freq.tshift) def test_truncate(self): ts = self.tsframe[::3] @@ -286,21 +342,21 @@ def test_truncate(self): truncated = ts.truncate(after=end_missing) assert_frame_equal(truncated, expected) - self.assertRaises(ValueError, ts.truncate, - before=ts.index[-1] - 1, - after=ts.index[0] + 1) + pytest.raises(ValueError, ts.truncate, + before=ts.index[-1] - 1, + after=ts.index[0] + 1) def test_truncate_copy(self): index = self.tsframe.index truncated = self.tsframe.truncate(index[5], index[10]) truncated.values[:] = 5. - self.assertFalse((self.tsframe.values[5:11] == 5).any()) + assert not (self.tsframe.values[5:11] == 5).any() def test_asfreq(self): offset_monthly = self.tsframe.asfreq(offsets.BMonthEnd()) rule_monthly = self.tsframe.asfreq('BM') - assert_almost_equal(offset_monthly['A'], rule_monthly['A']) + tm.assert_almost_equal(offset_monthly['A'], rule_monthly['A']) filled = rule_monthly.asfreq('B', method='pad') # noqa # TODO: actually check that this worked. @@ -311,17 +367,37 @@ def test_asfreq(self): # test does not blow up on length-0 DataFrame zero_length = self.tsframe.reindex([]) result = zero_length.asfreq('BM') - self.assertIsNot(result, zero_length) + assert result is not zero_length def test_asfreq_datetimeindex(self): df = DataFrame({'A': [1, 2, 3]}, index=[datetime(2011, 11, 1), datetime(2011, 11, 2), datetime(2011, 11, 3)]) df = df.asfreq('B') - tm.assertIsInstance(df.index, DatetimeIndex) + assert isinstance(df.index, DatetimeIndex) ts = df['A'].asfreq('B') - tm.assertIsInstance(ts.index, DatetimeIndex) + assert isinstance(ts.index, DatetimeIndex) + + def test_asfreq_fillvalue(self): + # test for fill value during upsampling, related to issue 3715 + + # setup + rng = pd.date_range('1/1/2016', periods=10, freq='2S') + ts = pd.Series(np.arange(len(rng)), index=rng) + df = pd.DataFrame({'one': ts}) + + # insert pre-existing missing value + df.loc['2016-01-01 00:00:08', 'one'] = None + + actual_df = df.asfreq(freq='1S', fill_value=9.0) + expected_df = df.asfreq(freq='1S').fillna(9.0) + expected_df.loc['2016-01-01 00:00:08', 'one'] = None + assert_frame_equal(expected_df, actual_df) + + expected_series = ts.asfreq(freq='1S').fillna(9.0) + actual_series = ts.asfreq(freq='1S', fill_value=9.0) + assert_series_equal(expected_series, actual_series) def test_first_last_valid(self): N = len(self.frame.index) @@ -332,15 +408,105 @@ def test_first_last_valid(self): frame = DataFrame({'foo': mat}, index=self.frame.index) index = frame.first_valid_index() - self.assertEqual(index, frame.index[5]) + assert index == frame.index[5] index = frame.last_valid_index() - self.assertEqual(index, frame.index[-6]) + assert index == frame.index[-6] # GH12800 empty = DataFrame() - self.assertIsNone(empty.last_valid_index()) - self.assertIsNone(empty.first_valid_index()) + assert empty.last_valid_index() is None + assert empty.first_valid_index() is None + + def test_at_time_frame(self): + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + ts = DataFrame(np.random.randn(len(rng), 2), index=rng) + rs = ts.at_time(rng[1]) + assert (rs.index.hour == rng[1].hour).all() + assert (rs.index.minute == rng[1].minute).all() + assert (rs.index.second == rng[1].second).all() + + result = ts.at_time('9:30') + expected = ts.at_time(time(9, 30)) + assert_frame_equal(result, expected) + + result = ts.loc[time(9, 30)] + expected = ts.loc[(rng.hour == 9) & (rng.minute == 30)] + + assert_frame_equal(result, expected) + + # midnight, everything + rng = date_range('1/1/2000', '1/31/2000') + ts = DataFrame(np.random.randn(len(rng), 3), index=rng) + + result = ts.at_time(time(0, 0)) + assert_frame_equal(result, ts) + + # time doesn't exist + rng = date_range('1/1/2012', freq='23Min', periods=384) + ts = DataFrame(np.random.randn(len(rng), 2), rng) + rs = ts.at_time('16:00') + assert len(rs) == 0 + + def test_between_time_frame(self): + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + ts = DataFrame(np.random.randn(len(rng), 2), index=rng) + stime = time(0, 0) + etime = time(1, 0) + + close_open = product([True, False], [True, False]) + for inc_start, inc_end in close_open: + filtered = ts.between_time(stime, etime, inc_start, inc_end) + exp_len = 13 * 4 + 1 + if not inc_start: + exp_len -= 5 + if not inc_end: + exp_len -= 4 + + assert len(filtered) == exp_len + for rs in filtered.index: + t = rs.time() + if inc_start: + assert t >= stime + else: + assert t > stime + + if inc_end: + assert t <= etime + else: + assert t < etime + + result = ts.between_time('00:00', '01:00') + expected = ts.between_time(stime, etime) + assert_frame_equal(result, expected) + + # across midnight + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + ts = DataFrame(np.random.randn(len(rng), 2), index=rng) + stime = time(22, 0) + etime = time(9, 0) + + close_open = product([True, False], [True, False]) + for inc_start, inc_end in close_open: + filtered = ts.between_time(stime, etime, inc_start, inc_end) + exp_len = (12 * 11 + 1) * 4 + 1 + if not inc_start: + exp_len -= 4 + if not inc_end: + exp_len -= 4 + + assert len(filtered) == exp_len + for rs in filtered.index: + t = rs.time() + if inc_start: + assert (t >= stime) or (t <= etime) + else: + assert (t > stime) or (t <= etime) + + if inc_end: + assert (t <= etime) or (t >= stime) + else: + assert (t < etime) or (t >= stime) def test_operation_on_NaT(self): # Both NaT and Timestamp are in DataFrame. @@ -366,8 +532,45 @@ def test_operation_on_NaT(self): exp = pd.Series([pd.NaT], index=["foo"]) tm.assert_series_equal(res, exp) - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_datetime_assignment_with_NaT_and_diff_time_units(self): + # GH 7492 + data_ns = np.array([1, 'nat'], dtype='datetime64[ns]') + result = pd.Series(data_ns).to_frame() + result['new'] = data_ns + expected = pd.DataFrame({0: [1, None], + 'new': [1, None]}, dtype='datetime64[ns]') + tm.assert_frame_equal(result, expected) + # OutOfBoundsDatetime error shouldn't occur + data_s = np.array([1, 'nat'], dtype='datetime64[s]') + result['new'] = data_s + expected = pd.DataFrame({0: [1, None], + 'new': [1e9, None]}, dtype='datetime64[ns]') + tm.assert_frame_equal(result, expected) + + def test_frame_to_period(self): + K = 5 + from pandas.core.indexes.period import period_range + + dr = date_range('1/1/2000', '1/1/2001') + pr = period_range('1/1/2000', '1/1/2001') + df = DataFrame(randn(len(dr), K), index=dr) + df['mix'] = 'a' + + pts = df.to_period() + exp = df.copy() + exp.index = pr + assert_frame_equal(pts, exp) + + pts = df.to_period('M') + tm.assert_index_equal(pts.index, exp.index.asfreq('M')) + + df = df.T + pts = df.to_period(axis=1) + exp = df.copy() + exp.columns = pr + assert_frame_equal(pts, exp) + + pts = df.to_period('M', axis=1) + tm.assert_index_equal(pts.columns, exp.columns.asfreq('M')) + + pytest.raises(ValueError, df.to_period, axis=2) diff --git a/pandas/tests/frame/test_to_csv.py b/pandas/tests/frame/test_to_csv.py index d888a5d334099..69bd2b008416f 100644 --- a/pandas/tests/frame/test_to_csv.py +++ b/pandas/tests/frame/test_to_csv.py @@ -3,12 +3,13 @@ from __future__ import print_function import csv +import pytest from numpy import nan import numpy as np from pandas.compat import (lmap, range, lrange, StringIO, u) -from pandas.parser import CParserError +from pandas.errors import ParserError from pandas import (DataFrame, Index, Series, MultiIndex, Timestamp, date_range, read_csv, compat, to_datetime) import pandas as pd @@ -16,9 +17,8 @@ from pandas.util.testing import (assert_almost_equal, assert_series_equal, assert_frame_equal, - ensure_clean, - makeCustomDataframe as mkdf, - assertRaisesRegexp, slow) + ensure_clean, slow, + makeCustomDataframe as mkdf) import pandas.util.testing as tm from pandas.tests.frame.common import TestData @@ -29,9 +29,7 @@ 'int32', 'int64'] -class TestDataFrameToCSV(tm.TestCase, TestData): - - _multiprocess_can_split_ = True +class TestDataFrameToCSV(TestData): def test_to_csv_from_csv1(self): @@ -96,8 +94,8 @@ def test_to_csv_from_csv2(self): assert_frame_equal(xp, rs) - self.assertRaises(ValueError, self.frame2.to_csv, path, - header=['AA', 'X']) + pytest.raises(ValueError, self.frame2.to_csv, path, + header=['AA', 'X']) def test_to_csv_from_csv3(self): @@ -435,13 +433,13 @@ def test_to_csv_no_index(self): assert_frame_equal(df, result) def test_to_csv_with_mix_columns(self): - # GH11637, incorrect output when a mix of integer and string column + # gh-11637: incorrect output when a mix of integer and string column # names passed as columns parameter in to_csv df = DataFrame({0: ['a', 'b', 'c'], 1: ['aa', 'bb', 'cc']}) df['test'] = 'txt' - self.assertEqual(df.to_csv(), df.to_csv(columns=[0, 1, 'test'])) + assert df.to_csv() == df.to_csv(columns=[0, 1, 'test']) def test_to_csv_headers(self): # GH6186, the presence or absence of `index` incorrectly @@ -477,7 +475,7 @@ def test_to_csv_multiindex(self): # TODO to_csv drops column name assert_frame_equal(frame, df, check_names=False) - self.assertEqual(frame.index.names, df.index.names) + assert frame.index.names == df.index.names # needed if setUP becomes a classmethod self.frame.index = old_index @@ -496,7 +494,7 @@ def test_to_csv_multiindex(self): # do not load index tsframe.to_csv(path) recons = DataFrame.from_csv(path, index_col=None) - self.assertEqual(len(recons.columns), len(tsframe.columns) + 2) + assert len(recons.columns) == len(tsframe.columns) + 2 # no index tsframe.to_csv(path, index=False) @@ -550,7 +548,7 @@ def _make_frame(names=None): df = _make_frame(True) df.to_csv(path, tupleize_cols=False, index=False) result = read_csv(path, header=[0, 1], tupleize_cols=False) - self.assertTrue(all([x is None for x in result.columns.names])) + assert all([x is None for x in result.columns.names]) result.columns.names = df.columns.names assert_frame_equal(df, result) @@ -589,13 +587,13 @@ def _make_frame(names=None): for i in [6, 7]: msg = 'len of {i}, but only 5 lines in file'.format(i=i) - with assertRaisesRegexp(CParserError, msg): + with tm.assert_raises_regex(ParserError, msg): read_csv(path, tupleize_cols=False, header=lrange(i), index_col=0) # write with cols - with assertRaisesRegexp(TypeError, 'cannot specify cols with a ' - 'MultiIndex'): + with tm.assert_raises_regex(TypeError, 'cannot specify cols ' + 'with a MultiIndex'): df.to_csv(path, tupleize_cols=False, columns=['foo', 'bar']) with ensure_clean('__tmp_to_csv_multiindex__') as path: @@ -605,8 +603,8 @@ def _make_frame(names=None): exp = tsframe[:0] exp.index = [] - self.assert_index_equal(recons.columns, exp.columns) - self.assertEqual(len(recons), 0) + tm.assert_index_equal(recons.columns, exp.columns) + assert len(recons) == 0 def test_to_csv_float32_nanrep(self): df = DataFrame(np.random.randn(1, 4).astype(np.float32)) @@ -617,7 +615,7 @@ def test_to_csv_float32_nanrep(self): with open(path) as f: lines = f.readlines() - self.assertEqual(lines[1].split(',')[2], '999') + assert lines[1].split(',')[2] == '999' def test_to_csv_withcommas(self): @@ -646,10 +644,10 @@ def create_cols(name): index=df_float.index, columns=create_cols('date')) # add in some nans - df_float.ix[30:50, 1:3] = np.nan + df_float.loc[30:50, 1:3] = np.nan # ## this is a bug in read_csv right now #### - # df_dt.ix[30:50,1:3] = np.nan + # df_dt.loc[30:50,1:3] = np.nan df = pd.concat([df_float, df_int, df_bool, df_object, df_dt], axis=1) @@ -815,7 +813,7 @@ def test_to_csv_unicodewriter_quoting(self): '2,"bar"\n' '3,"baz"\n') - self.assertEqual(result, expected) + assert result == expected def test_to_csv_quote_none(self): # GH4328 @@ -826,7 +824,7 @@ def test_to_csv_quote_none(self): encoding=encoding, index=False) result = buf.getvalue() expected = 'A\nhello\n{"hello"}\n' - self.assertEqual(result, expected) + assert result == expected def test_to_csv_index_no_leading_comma(self): df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, @@ -838,7 +836,7 @@ def test_to_csv_index_no_leading_comma(self): 'one,1,4\n' 'two,2,5\n' 'three,3,6\n') - self.assertEqual(buf.getvalue(), expected) + assert buf.getvalue() == expected def test_to_csv_line_terminators(self): df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, @@ -850,7 +848,7 @@ def test_to_csv_line_terminators(self): 'one,1,4\r\n' 'two,2,5\r\n' 'three,3,6\r\n') - self.assertEqual(buf.getvalue(), expected) + assert buf.getvalue() == expected buf = StringIO() df.to_csv(buf) # The default line terminator remains \n @@ -858,7 +856,7 @@ def test_to_csv_line_terminators(self): 'one,1,4\n' 'two,2,5\n' 'three,3,6\n') - self.assertEqual(buf.getvalue(), expected) + assert buf.getvalue() == expected def test_to_csv_from_csv_categorical(self): @@ -870,7 +868,7 @@ def test_to_csv_from_csv_categorical(self): s.to_csv(res) exp = StringIO() s2.to_csv(exp) - self.assertEqual(res.getvalue(), exp.getvalue()) + assert res.getvalue() == exp.getvalue() df = DataFrame({"s": s}) df2 = DataFrame({"s": s2}) @@ -878,14 +876,14 @@ def test_to_csv_from_csv_categorical(self): df.to_csv(res) exp = StringIO() df2.to_csv(exp) - self.assertEqual(res.getvalue(), exp.getvalue()) + assert res.getvalue() == exp.getvalue() def test_to_csv_path_is_none(self): # GH 8215 # Make sure we return string for consistency with # Series.to_csv() csv_str = self.frame.to_csv(path_or_buf=None) - self.assertIsInstance(csv_str, str) + assert isinstance(csv_str, str) recons = pd.read_csv(StringIO(csv_str), index_col=0) assert_frame_equal(self.frame, recons) @@ -910,7 +908,7 @@ def test_to_csv_compression_gzip(self): text = f.read().decode('utf8') f.close() for col in df.columns: - self.assertIn(col, text) + assert col in text def test_to_csv_compression_bz2(self): # GH7615 @@ -933,7 +931,7 @@ def test_to_csv_compression_bz2(self): text = f.read().decode('utf8') f.close() for col in df.columns: - self.assertIn(col, text) + assert col in text def test_to_csv_compression_xz(self): # GH11852 @@ -967,8 +965,8 @@ def test_to_csv_compression_value_error(self): with ensure_clean() as filename: # zip compression is not supported and should raise ValueError import zipfile - self.assertRaises(zipfile.BadZipfile, df.to_csv, - filename, compression="zip") + pytest.raises(zipfile.BadZipfile, df.to_csv, + filename, compression="zip") def test_to_csv_date_format(self): with ensure_clean('__tmp_to_csv_date_format__') as path: @@ -1080,13 +1078,13 @@ def test_to_csv_quoting(self): 1,False,3.2,,"b,c" """ result = df.to_csv() - self.assertEqual(result, expected) + assert result == expected result = df.to_csv(quoting=None) - self.assertEqual(result, expected) + assert result == expected result = df.to_csv(quoting=csv.QUOTE_MINIMAL) - self.assertEqual(result, expected) + assert result == expected expected = """\ "","c_bool","c_float","c_int","c_string" @@ -1094,7 +1092,7 @@ def test_to_csv_quoting(self): "1","False","3.2","","b,c" """ result = df.to_csv(quoting=csv.QUOTE_ALL) - self.assertEqual(result, expected) + assert result == expected # see gh-12922, gh-13259: make sure changes to # the formatters do not break this behaviour @@ -1104,14 +1102,14 @@ def test_to_csv_quoting(self): 1,False,3.2,"","b,c" """ result = df.to_csv(quoting=csv.QUOTE_NONNUMERIC) - self.assertEqual(result, expected) + assert result == expected msg = "need to escape, but no escapechar set" - tm.assertRaisesRegexp(csv.Error, msg, df.to_csv, - quoting=csv.QUOTE_NONE) - tm.assertRaisesRegexp(csv.Error, msg, df.to_csv, - quoting=csv.QUOTE_NONE, - escapechar=None) + tm.assert_raises_regex(csv.Error, msg, df.to_csv, + quoting=csv.QUOTE_NONE) + tm.assert_raises_regex(csv.Error, msg, df.to_csv, + quoting=csv.QUOTE_NONE, + escapechar=None) expected = """\ ,c_bool,c_float,c_int,c_string @@ -1120,7 +1118,7 @@ def test_to_csv_quoting(self): """ result = df.to_csv(quoting=csv.QUOTE_NONE, escapechar='!') - self.assertEqual(result, expected) + assert result == expected expected = """\ ,c_bool,c_ffloat,c_int,c_string @@ -1129,7 +1127,7 @@ def test_to_csv_quoting(self): """ result = df.to_csv(quoting=csv.QUOTE_NONE, escapechar='f') - self.assertEqual(result, expected) + assert result == expected # see gh-3503: quoting Windows line terminators # presents with encoding? @@ -1137,17 +1135,39 @@ def test_to_csv_quoting(self): df = pd.read_csv(StringIO(text)) buf = StringIO() df.to_csv(buf, encoding='utf-8', index=False) - self.assertEqual(buf.getvalue(), text) + assert buf.getvalue() == text # xref gh-7791: make sure the quoting parameter is passed through # with multi-indexes df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}) df = df.set_index(['a', 'b']) expected = '"a","b","c"\n"1","3","5"\n"2","4","6"\n' - self.assertEqual(df.to_csv(quoting=csv.QUOTE_ALL), expected) + assert df.to_csv(quoting=csv.QUOTE_ALL) == expected + + def test_period_index_date_overflow(self): + # see gh-15982 + + dates = ["1990-01-01", "2000-01-01", "3005-01-01"] + index = pd.PeriodIndex(dates, freq="D") + + df = pd.DataFrame([4, 5, 6], index=index) + result = df.to_csv() + expected = ',0\n1990-01-01,4\n2000-01-01,5\n3005-01-01,6\n' + assert result == expected + + date_format = "%m-%d-%Y" + result = df.to_csv(date_format=date_format) + + expected = ',0\n01-01-1990,4\n01-01-2000,5\n01-01-3005,6\n' + assert result == expected + + # Overflow with pd.NaT + dates = ["1990-01-01", pd.NaT, "3005-01-01"] + index = pd.PeriodIndex(dates, freq="D") + + df = pd.DataFrame([4, 5, 6], index=index) + result = df.to_csv() -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + expected = ',0\n1990-01-01,4\n,5\n3005-01-01,6\n' + assert result == expected diff --git a/pandas/tests/frame/test_validate.py b/pandas/tests/frame/test_validate.py new file mode 100644 index 0000000000000..d6065e6042908 --- /dev/null +++ b/pandas/tests/frame/test_validate.py @@ -0,0 +1,34 @@ +from pandas.core.frame import DataFrame + +import pytest + + +class TestDataFrameValidate(object): + """Tests for error handling related to data types of method arguments.""" + df = DataFrame({'a': [1, 2], 'b': [3, 4]}) + + def test_validate_bool_args(self): + # Tests for error handling related to boolean arguments. + invalid_values = [1, "True", [1, 2, 3], 5.0] + + for value in invalid_values: + with pytest.raises(ValueError): + self.df.query('a > b', inplace=value) + + with pytest.raises(ValueError): + self.df.eval('a + b', inplace=value) + + with pytest.raises(ValueError): + self.df.set_index(keys=['a'], inplace=value) + + with pytest.raises(ValueError): + self.df.reset_index(inplace=value) + + with pytest.raises(ValueError): + self.df.dropna(inplace=value) + + with pytest.raises(ValueError): + self.df.drop_duplicates(inplace=value) + + with pytest.raises(ValueError): + self.df.sort_values(by=['a'], inplace=value) diff --git a/pandas/tests/types/__init__.py b/pandas/tests/groupby/__init__.py similarity index 100% rename from pandas/tests/types/__init__.py rename to pandas/tests/groupby/__init__.py diff --git a/pandas/tests/groupby/common.py b/pandas/tests/groupby/common.py new file mode 100644 index 0000000000000..3e99e8211b4f8 --- /dev/null +++ b/pandas/tests/groupby/common.py @@ -0,0 +1,62 @@ +""" Base setup """ + +import pytest +import numpy as np +from pandas.util import testing as tm +from pandas import DataFrame, MultiIndex + + +@pytest.fixture +def mframe(): + index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', + 'three']], + labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], + [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=['first', 'second']) + return DataFrame(np.random.randn(10, 3), index=index, + columns=['A', 'B', 'C']) + + +@pytest.fixture +def df(): + return DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.random.randn(8)}) + + +class MixIn(object): + + def setup_method(self, method): + self.ts = tm.makeTimeSeries() + + self.seriesd = tm.getSeriesData() + self.tsd = tm.getTimeSeriesData() + self.frame = DataFrame(self.seriesd) + self.tsframe = DataFrame(self.tsd) + + self.df = df() + self.df_mixed_floats = DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.array( + np.random.randn(8), dtype='float32')}) + + self.mframe = mframe() + + self.three_group = DataFrame( + {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', + 'foo', 'foo', 'foo'], + 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', + 'two', 'two', 'one'], + 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', + 'dull', 'shiny', 'shiny', 'shiny'], + 'D': np.random.randn(11), + 'E': np.random.randn(11), + 'F': np.random.randn(11)}) + + +def assert_fp_equal(a, b): + assert (np.abs(a - b) < 1e-12).all() diff --git a/pandas/tests/groupby/test_aggregate.py b/pandas/tests/groupby/test_aggregate.py new file mode 100644 index 0000000000000..d7b46e6748b99 --- /dev/null +++ b/pandas/tests/groupby/test_aggregate.py @@ -0,0 +1,846 @@ +# -*- coding: utf-8 -*- + +""" +we test .agg behavior / note that .apply is tested +generally in test_groupby.py +""" + +from __future__ import print_function + +import pytest + +from datetime import datetime, timedelta +from functools import partial + +import numpy as np +from numpy import nan +import pandas as pd + +from pandas import (date_range, MultiIndex, DataFrame, + Series, Index, bdate_range, concat) +from pandas.util.testing import assert_frame_equal, assert_series_equal +from pandas.core.groupby import SpecificationError, DataError +from pandas.compat import OrderedDict +from pandas.io.formats.printing import pprint_thing +import pandas.util.testing as tm + + +class TestGroupByAggregate(object): + + def setup_method(self, method): + self.ts = tm.makeTimeSeries() + + self.seriesd = tm.getSeriesData() + self.tsd = tm.getTimeSeriesData() + self.frame = DataFrame(self.seriesd) + self.tsframe = DataFrame(self.tsd) + + self.df = DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.random.randn(8)}) + + self.df_mixed_floats = DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.array( + np.random.randn(8), dtype='float32')}) + + index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', + 'three']], + labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], + [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=['first', 'second']) + self.mframe = DataFrame(np.random.randn(10, 3), index=index, + columns=['A', 'B', 'C']) + + self.three_group = DataFrame( + {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', + 'foo', 'foo', 'foo'], + 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', + 'two', 'two', 'one'], + 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', + 'dull', 'shiny', 'shiny', 'shiny'], + 'D': np.random.randn(11), + 'E': np.random.randn(11), + 'F': np.random.randn(11)}) + + def test_agg_api(self): + + # GH 6337 + # http://stackoverflow.com/questions/21706030/pandas-groupby-agg-function-column-dtype-error + # different api for agg when passed custom function with mixed frame + + df = DataFrame({'data1': np.random.randn(5), + 'data2': np.random.randn(5), + 'key1': ['a', 'a', 'b', 'b', 'a'], + 'key2': ['one', 'two', 'one', 'two', 'one']}) + grouped = df.groupby('key1') + + def peak_to_peak(arr): + return arr.max() - arr.min() + + expected = grouped.agg([peak_to_peak]) + expected.columns = ['data1', 'data2'] + result = grouped.agg(peak_to_peak) + assert_frame_equal(result, expected) + + def test_agg_regression1(self): + grouped = self.tsframe.groupby([lambda x: x.year, lambda x: x.month]) + result = grouped.agg(np.mean) + expected = grouped.mean() + assert_frame_equal(result, expected) + + def test_agg_datetimes_mixed(self): + data = [[1, '2012-01-01', 1.0], [2, '2012-01-02', 2.0], [3, None, 3.0]] + + df1 = DataFrame({'key': [x[0] for x in data], + 'date': [x[1] for x in data], + 'value': [x[2] for x in data]}) + + data = [[row[0], datetime.strptime(row[1], '%Y-%m-%d').date() if row[1] + else None, row[2]] for row in data] + + df2 = DataFrame({'key': [x[0] for x in data], + 'date': [x[1] for x in data], + 'value': [x[2] for x in data]}) + + df1['weights'] = df1['value'] / df1['value'].sum() + gb1 = df1.groupby('date').aggregate(np.sum) + + df2['weights'] = df1['value'] / df1['value'].sum() + gb2 = df2.groupby('date').aggregate(np.sum) + + assert (len(gb1) == len(gb2)) + + def test_agg_period_index(self): + from pandas import period_range, PeriodIndex + prng = period_range('2012-1-1', freq='M', periods=3) + df = DataFrame(np.random.randn(3, 2), index=prng) + rs = df.groupby(level=0).sum() + assert isinstance(rs.index, PeriodIndex) + + # GH 3579 + index = period_range(start='1999-01', periods=5, freq='M') + s1 = Series(np.random.rand(len(index)), index=index) + s2 = Series(np.random.rand(len(index)), index=index) + series = [('s1', s1), ('s2', s2)] + df = DataFrame.from_items(series) + grouped = df.groupby(df.index.month) + list(grouped) + + def test_agg_dict_parameter_cast_result_dtypes(self): + # GH 12821 + + df = DataFrame( + {'class': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], + 'time': date_range('1/1/2011', periods=8, freq='H')}) + df.loc[[0, 1, 2, 5], 'time'] = None + + # test for `first` function + exp = df.loc[[0, 3, 4, 6]].set_index('class') + grouped = df.groupby('class') + assert_frame_equal(grouped.first(), exp) + assert_frame_equal(grouped.agg('first'), exp) + assert_frame_equal(grouped.agg({'time': 'first'}), exp) + assert_series_equal(grouped.time.first(), exp['time']) + assert_series_equal(grouped.time.agg('first'), exp['time']) + + # test for `last` function + exp = df.loc[[0, 3, 4, 7]].set_index('class') + grouped = df.groupby('class') + assert_frame_equal(grouped.last(), exp) + assert_frame_equal(grouped.agg('last'), exp) + assert_frame_equal(grouped.agg({'time': 'last'}), exp) + assert_series_equal(grouped.time.last(), exp['time']) + assert_series_equal(grouped.time.agg('last'), exp['time']) + + # count + exp = pd.Series([2, 2, 2, 2], + index=Index(list('ABCD'), name='class'), + name='time') + assert_series_equal(grouped.time.agg(len), exp) + assert_series_equal(grouped.time.size(), exp) + + exp = pd.Series([0, 1, 1, 2], + index=Index(list('ABCD'), name='class'), + name='time') + assert_series_equal(grouped.time.count(), exp) + + def test_agg_cast_results_dtypes(self): + # similar to GH12821 + # xref #11444 + u = [datetime(2015, x + 1, 1) for x in range(12)] + v = list('aaabbbbbbccd') + df = pd.DataFrame({'X': v, 'Y': u}) + + result = df.groupby('X')['Y'].agg(len) + expected = df.groupby('X')['Y'].count() + assert_series_equal(result, expected) + + def test_agg_must_agg(self): + grouped = self.df.groupby('A')['C'] + pytest.raises(Exception, grouped.agg, lambda x: x.describe()) + pytest.raises(Exception, grouped.agg, lambda x: x.index[:2]) + + def test_agg_ser_multi_key(self): + # TODO(wesm): unused + ser = self.df.C # noqa + + f = lambda x: x.sum() + results = self.df.C.groupby([self.df.A, self.df.B]).aggregate(f) + expected = self.df.groupby(['A', 'B']).sum()['C'] + assert_series_equal(results, expected) + + def test_agg_apply_corner(self): + # nothing to group, all NA + grouped = self.ts.groupby(self.ts * np.nan) + assert self.ts.dtype == np.float64 + + # groupby float64 values results in Float64Index + exp = Series([], dtype=np.float64, index=pd.Index( + [], dtype=np.float64)) + assert_series_equal(grouped.sum(), exp) + assert_series_equal(grouped.agg(np.sum), exp) + assert_series_equal(grouped.apply(np.sum), exp, check_index_type=False) + + # DataFrame + grouped = self.tsframe.groupby(self.tsframe['A'] * np.nan) + exp_df = DataFrame(columns=self.tsframe.columns, dtype=float, + index=pd.Index([], dtype=np.float64)) + assert_frame_equal(grouped.sum(), exp_df, check_names=False) + assert_frame_equal(grouped.agg(np.sum), exp_df, check_names=False) + assert_frame_equal(grouped.apply(np.sum), exp_df.iloc[:, :0], + check_names=False) + + def test_agg_grouping_is_list_tuple(self): + from pandas.core.groupby import Grouping + + df = tm.makeTimeDataFrame() + + grouped = df.groupby(lambda x: x.year) + grouper = grouped.grouper.groupings[0].grouper + grouped.grouper.groupings[0] = Grouping(self.ts.index, list(grouper)) + + result = grouped.agg(np.mean) + expected = grouped.mean() + tm.assert_frame_equal(result, expected) + + grouped.grouper.groupings[0] = Grouping(self.ts.index, tuple(grouper)) + + result = grouped.agg(np.mean) + expected = grouped.mean() + tm.assert_frame_equal(result, expected) + + def test_aggregate_api_consistency(self): + # GH 9052 + # make sure that the aggregates via dict + # are consistent + + df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', + 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'two', + 'two', 'two', 'one', 'two'], + 'C': np.random.randn(8) + 1.0, + 'D': np.arange(8)}) + + grouped = df.groupby(['A', 'B']) + c_mean = grouped['C'].mean() + c_sum = grouped['C'].sum() + d_mean = grouped['D'].mean() + d_sum = grouped['D'].sum() + + result = grouped['D'].agg(['sum', 'mean']) + expected = pd.concat([d_sum, d_mean], + axis=1) + expected.columns = ['sum', 'mean'] + assert_frame_equal(result, expected, check_like=True) + + result = grouped.agg([np.sum, np.mean]) + expected = pd.concat([c_sum, + c_mean, + d_sum, + d_mean], + axis=1) + expected.columns = MultiIndex.from_product([['C', 'D'], + ['sum', 'mean']]) + assert_frame_equal(result, expected, check_like=True) + + result = grouped[['D', 'C']].agg([np.sum, np.mean]) + expected = pd.concat([d_sum, + d_mean, + c_sum, + c_mean], + axis=1) + expected.columns = MultiIndex.from_product([['D', 'C'], + ['sum', 'mean']]) + assert_frame_equal(result, expected, check_like=True) + + result = grouped.agg({'C': 'mean', 'D': 'sum'}) + expected = pd.concat([d_sum, + c_mean], + axis=1) + assert_frame_equal(result, expected, check_like=True) + + result = grouped.agg({'C': ['mean', 'sum'], + 'D': ['mean', 'sum']}) + expected = pd.concat([c_mean, + c_sum, + d_mean, + d_sum], + axis=1) + expected.columns = MultiIndex.from_product([['C', 'D'], + ['mean', 'sum']]) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = grouped[['D', 'C']].agg({'r': np.sum, + 'r2': np.mean}) + expected = pd.concat([d_sum, + c_sum, + d_mean, + c_mean], + axis=1) + expected.columns = MultiIndex.from_product([['r', 'r2'], + ['D', 'C']]) + assert_frame_equal(result, expected, check_like=True) + + def test_agg_dict_renaming_deprecation(self): + # 15931 + df = pd.DataFrame({'A': [1, 1, 1, 2, 2], + 'B': range(5), + 'C': range(5)}) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False) as w: + df.groupby('A').agg({'B': {'foo': ['sum', 'max']}, + 'C': {'bar': ['count', 'min']}}) + assert "using a dict with renaming" in str(w[0].message) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + df.groupby('A')[['B', 'C']].agg({'ma': 'max'}) + + with tm.assert_produces_warning(FutureWarning) as w: + df.groupby('A').B.agg({'foo': 'count'}) + assert "using a dict on a Series for aggregation" in str( + w[0].message) + + def test_agg_compat(self): + + # GH 12334 + + df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', + 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'two', + 'two', 'two', 'one', 'two'], + 'C': np.random.randn(8) + 1.0, + 'D': np.arange(8)}) + + g = df.groupby(['A', 'B']) + + expected = pd.concat([g['D'].sum(), + g['D'].std()], + axis=1) + expected.columns = MultiIndex.from_tuples([('C', 'sum'), + ('C', 'std')]) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = g['D'].agg({'C': ['sum', 'std']}) + assert_frame_equal(result, expected, check_like=True) + + expected = pd.concat([g['D'].sum(), + g['D'].std()], + axis=1) + expected.columns = ['C', 'D'] + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = g['D'].agg({'C': 'sum', 'D': 'std'}) + assert_frame_equal(result, expected, check_like=True) + + def test_agg_nested_dicts(self): + + # API change for disallowing these types of nested dicts + df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', + 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'two', + 'two', 'two', 'one', 'two'], + 'C': np.random.randn(8) + 1.0, + 'D': np.arange(8)}) + + g = df.groupby(['A', 'B']) + + def f(): + g.aggregate({'r1': {'C': ['mean', 'sum']}, + 'r2': {'D': ['mean', 'sum']}}) + + pytest.raises(SpecificationError, f) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = g.agg({'C': {'ra': ['mean', 'std']}, + 'D': {'rb': ['mean', 'std']}}) + expected = pd.concat([g['C'].mean(), g['C'].std(), g['D'].mean(), + g['D'].std()], axis=1) + expected.columns = pd.MultiIndex.from_tuples([('ra', 'mean'), ( + 'ra', 'std'), ('rb', 'mean'), ('rb', 'std')]) + assert_frame_equal(result, expected, check_like=True) + + # same name as the original column + # GH9052 + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + expected = g['D'].agg({'result1': np.sum, 'result2': np.mean}) + expected = expected.rename(columns={'result1': 'D'}) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = g['D'].agg({'D': np.sum, 'result2': np.mean}) + assert_frame_equal(result, expected, check_like=True) + + def test_agg_python_multiindex(self): + grouped = self.mframe.groupby(['A', 'B']) + + result = grouped.agg(np.mean) + expected = grouped.mean() + tm.assert_frame_equal(result, expected) + + def test_aggregate_str_func(self): + def _check_results(grouped): + # single series + result = grouped['A'].agg('std') + expected = grouped['A'].std() + assert_series_equal(result, expected) + + # group frame by function name + result = grouped.aggregate('var') + expected = grouped.var() + assert_frame_equal(result, expected) + + # group frame by function dict + result = grouped.agg(OrderedDict([['A', 'var'], ['B', 'std'], + ['C', 'mean'], ['D', 'sem']])) + expected = DataFrame(OrderedDict([['A', grouped['A'].var( + )], ['B', grouped['B'].std()], ['C', grouped['C'].mean()], + ['D', grouped['D'].sem()]])) + assert_frame_equal(result, expected) + + by_weekday = self.tsframe.groupby(lambda x: x.weekday()) + _check_results(by_weekday) + + by_mwkday = self.tsframe.groupby([lambda x: x.month, + lambda x: x.weekday()]) + _check_results(by_mwkday) + + def test_aggregate_item_by_item(self): + + df = self.df.copy() + df['E'] = ['a'] * len(self.df) + grouped = self.df.groupby('A') + + # API change in 0.11 + # def aggfun(ser): + # return len(ser + 'a') + # result = grouped.agg(aggfun) + # assert len(result.columns) == 1 + + aggfun = lambda ser: ser.size + result = grouped.agg(aggfun) + foo = (self.df.A == 'foo').sum() + bar = (self.df.A == 'bar').sum() + K = len(result.columns) + + # GH5782 + # odd comparisons can result here, so cast to make easy + exp = pd.Series(np.array([foo] * K), index=list('BCD'), + dtype=np.float64, name='foo') + tm.assert_series_equal(result.xs('foo'), exp) + + exp = pd.Series(np.array([bar] * K), index=list('BCD'), + dtype=np.float64, name='bar') + tm.assert_almost_equal(result.xs('bar'), exp) + + def aggfun(ser): + return ser.size + + result = DataFrame().groupby(self.df.A).agg(aggfun) + assert isinstance(result, DataFrame) + assert len(result) == 0 + + def test_agg_item_by_item_raise_typeerror(self): + from numpy.random import randint + + df = DataFrame(randint(10, size=(20, 10))) + + def raiseException(df): + pprint_thing('----------------------------------------') + pprint_thing(df.to_string()) + raise TypeError + + pytest.raises(TypeError, df.groupby(0).agg, raiseException) + + def test_series_agg_multikey(self): + ts = tm.makeTimeSeries() + grouped = ts.groupby([lambda x: x.year, lambda x: x.month]) + + result = grouped.agg(np.sum) + expected = grouped.sum() + assert_series_equal(result, expected) + + def test_series_agg_multi_pure_python(self): + data = DataFrame( + {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', + 'foo', 'foo', 'foo'], + 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', + 'two', 'two', 'one'], + 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', + 'dull', 'shiny', 'shiny', 'shiny'], + 'D': np.random.randn(11), + 'E': np.random.randn(11), + 'F': np.random.randn(11)}) + + def bad(x): + assert (len(x.base) > 0) + return 'foo' + + result = data.groupby(['A', 'B']).agg(bad) + expected = data.groupby(['A', 'B']).agg(lambda x: 'foo') + assert_frame_equal(result, expected) + + def test_cythonized_aggers(self): + data = {'A': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1., nan, nan], + 'B': ['A', 'B'] * 6, + 'C': np.random.randn(12)} + df = DataFrame(data) + df.loc[2:10:2, 'C'] = nan + + def _testit(name): + + op = lambda x: getattr(x, name)() + + # single column + grouped = df.drop(['B'], axis=1).groupby('A') + exp = {} + for cat, group in grouped: + exp[cat] = op(group['C']) + exp = DataFrame({'C': exp}) + exp.index.name = 'A' + result = op(grouped) + assert_frame_equal(result, exp) + + # multiple columns + grouped = df.groupby(['A', 'B']) + expd = {} + for (cat1, cat2), group in grouped: + expd.setdefault(cat1, {})[cat2] = op(group['C']) + exp = DataFrame(expd).T.stack(dropna=False) + exp.index.names = ['A', 'B'] + exp.name = 'C' + + result = op(grouped)['C'] + if not tm._incompat_bottleneck_version(name): + assert_series_equal(result, exp) + + _testit('count') + _testit('sum') + _testit('std') + _testit('var') + _testit('sem') + _testit('mean') + _testit('median') + _testit('prod') + _testit('min') + _testit('max') + + def test_cython_agg_boolean(self): + frame = DataFrame({'a': np.random.randint(0, 5, 50), + 'b': np.random.randint(0, 2, 50).astype('bool')}) + result = frame.groupby('a')['b'].mean() + expected = frame.groupby('a')['b'].agg(np.mean) + + assert_series_equal(result, expected) + + def test_cython_agg_nothing_to_agg(self): + frame = DataFrame({'a': np.random.randint(0, 5, 50), + 'b': ['foo', 'bar'] * 25}) + pytest.raises(DataError, frame.groupby('a')['b'].mean) + + frame = DataFrame({'a': np.random.randint(0, 5, 50), + 'b': ['foo', 'bar'] * 25}) + pytest.raises(DataError, frame[['b']].groupby(frame['a']).mean) + + def test_cython_agg_nothing_to_agg_with_dates(self): + frame = DataFrame({'a': np.random.randint(0, 5, 50), + 'b': ['foo', 'bar'] * 25, + 'dates': pd.date_range('now', periods=50, + freq='T')}) + with tm.assert_raises_regex(DataError, + "No numeric types to aggregate"): + frame.groupby('b').dates.mean() + + def test_cython_agg_frame_columns(self): + # #2113 + df = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}) + + df.groupby(level=0, axis='columns').mean() + df.groupby(level=0, axis='columns').mean() + df.groupby(level=0, axis='columns').mean() + df.groupby(level=0, axis='columns').mean() + + def test_cython_fail_agg(self): + dr = bdate_range('1/1/2000', periods=50) + ts = Series(['A', 'B', 'C', 'D', 'E'] * 10, index=dr) + + grouped = ts.groupby(lambda x: x.month) + summed = grouped.sum() + expected = grouped.agg(np.sum) + assert_series_equal(summed, expected) + + def test_agg_consistency(self): + # agg with ([]) and () not consistent + # GH 6715 + + def P1(a): + try: + return np.percentile(a.dropna(), q=1) + except: + return np.nan + + import datetime as dt + df = DataFrame({'col1': [1, 2, 3, 4], + 'col2': [10, 25, 26, 31], + 'date': [dt.date(2013, 2, 10), dt.date(2013, 2, 10), + dt.date(2013, 2, 11), dt.date(2013, 2, 11)]}) + + g = df.groupby('date') + + expected = g.agg([P1]) + expected.columns = expected.columns.levels[0] + + result = g.agg(P1) + assert_frame_equal(result, expected) + + def test_wrap_agg_out(self): + grouped = self.three_group.groupby(['A', 'B']) + + def func(ser): + if ser.dtype == np.object: + raise TypeError + else: + return ser.sum() + + result = grouped.aggregate(func) + exp_grouped = self.three_group.loc[:, self.three_group.columns != 'C'] + expected = exp_grouped.groupby(['A', 'B']).aggregate(func) + assert_frame_equal(result, expected) + + def test_agg_multiple_functions_maintain_order(self): + # GH #610 + funcs = [('mean', np.mean), ('max', np.max), ('min', np.min)] + result = self.df.groupby('A')['C'].agg(funcs) + exp_cols = Index(['mean', 'max', 'min']) + + tm.assert_index_equal(result.columns, exp_cols) + + def test_multiple_functions_tuples_and_non_tuples(self): + # #1359 + + funcs = [('foo', 'mean'), 'std'] + ex_funcs = [('foo', 'mean'), ('std', 'std')] + + result = self.df.groupby('A')['C'].agg(funcs) + expected = self.df.groupby('A')['C'].agg(ex_funcs) + assert_frame_equal(result, expected) + + result = self.df.groupby('A').agg(funcs) + expected = self.df.groupby('A').agg(ex_funcs) + assert_frame_equal(result, expected) + + def test_agg_multiple_functions_too_many_lambdas(self): + grouped = self.df.groupby('A') + funcs = ['mean', lambda x: x.mean(), lambda x: x.std()] + + pytest.raises(SpecificationError, grouped.agg, funcs) + + def test_more_flexible_frame_multi_function(self): + + grouped = self.df.groupby('A') + + exmean = grouped.agg(OrderedDict([['C', np.mean], ['D', np.mean]])) + exstd = grouped.agg(OrderedDict([['C', np.std], ['D', np.std]])) + + expected = concat([exmean, exstd], keys=['mean', 'std'], axis=1) + expected = expected.swaplevel(0, 1, axis=1).sort_index(level=0, axis=1) + + d = OrderedDict([['C', [np.mean, np.std]], ['D', [np.mean, np.std]]]) + result = grouped.aggregate(d) + + assert_frame_equal(result, expected) + + # be careful + result = grouped.aggregate(OrderedDict([['C', np.mean], + ['D', [np.mean, np.std]]])) + expected = grouped.aggregate(OrderedDict([['C', np.mean], + ['D', [np.mean, np.std]]])) + assert_frame_equal(result, expected) + + def foo(x): + return np.mean(x) + + def bar(x): + return np.std(x, ddof=1) + + # this uses column selection & renaming + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + d = OrderedDict([['C', np.mean], ['D', OrderedDict( + [['foo', np.mean], ['bar', np.std]])]]) + result = grouped.aggregate(d) + + d = OrderedDict([['C', [np.mean]], ['D', [foo, bar]]]) + expected = grouped.aggregate(d) + + assert_frame_equal(result, expected) + + def test_multi_function_flexible_mix(self): + # GH #1268 + grouped = self.df.groupby('A') + + d = OrderedDict([['C', OrderedDict([['foo', 'mean'], [ + 'bar', 'std' + ]])], ['D', 'sum']]) + + # this uses column selection & renaming + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = grouped.aggregate(d) + + d2 = OrderedDict([['C', OrderedDict([['foo', 'mean'], [ + 'bar', 'std' + ]])], ['D', ['sum']]]) + + # this uses column selection & renaming + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result2 = grouped.aggregate(d2) + + d3 = OrderedDict([['C', OrderedDict([['foo', 'mean'], [ + 'bar', 'std' + ]])], ['D', {'sum': 'sum'}]]) + + # this uses column selection & renaming + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + expected = grouped.aggregate(d3) + + assert_frame_equal(result, expected) + assert_frame_equal(result2, expected) + + def test_agg_callables(self): + # GH 7929 + df = DataFrame({'foo': [1, 2], 'bar': [3, 4]}).astype(np.int64) + + class fn_class(object): + + def __call__(self, x): + return sum(x) + + equiv_callables = [sum, np.sum, lambda x: sum(x), lambda x: x.sum(), + partial(sum), fn_class()] + + expected = df.groupby("foo").agg(sum) + for ecall in equiv_callables: + result = df.groupby('foo').agg(ecall) + assert_frame_equal(result, expected) + + def test__cython_agg_general(self): + ops = [('mean', np.mean), + ('median', np.median), + ('var', np.var), + ('add', np.sum), + ('prod', np.prod), + ('min', np.min), + ('max', np.max), + ('first', lambda x: x.iloc[0]), + ('last', lambda x: x.iloc[-1]), ] + df = DataFrame(np.random.randn(1000)) + labels = np.random.randint(0, 50, size=1000).astype(float) + + for op, targop in ops: + result = df.groupby(labels)._cython_agg_general(op) + expected = df.groupby(labels).agg(targop) + try: + tm.assert_frame_equal(result, expected) + except BaseException as exc: + exc.args += ('operation: %s' % op, ) + raise + + def test_cython_agg_empty_buckets(self): + ops = [('mean', np.mean), + ('median', lambda x: np.median(x) if len(x) > 0 else np.nan), + ('var', lambda x: np.var(x, ddof=1)), + ('add', lambda x: np.sum(x) if len(x) > 0 else np.nan), + ('prod', np.prod), + ('min', np.min), + ('max', np.max), ] + + df = pd.DataFrame([11, 12, 13]) + grps = range(0, 55, 5) + + for op, targop in ops: + result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op) + expected = df.groupby(pd.cut(df[0], grps)).agg(lambda x: targop(x)) + try: + tm.assert_frame_equal(result, expected) + except BaseException as exc: + exc.args += ('operation: %s' % op,) + raise + + def test_agg_over_numpy_arrays(self): + # GH 3788 + df = pd.DataFrame([[1, np.array([10, 20, 30])], + [1, np.array([40, 50, 60])], + [2, np.array([20, 30, 40])]], + columns=['category', 'arraydata']) + result = df.groupby('category').agg(sum) + + expected_data = [[np.array([50, 70, 90])], [np.array([20, 30, 40])]] + expected_index = pd.Index([1, 2], name='category') + expected_column = ['arraydata'] + expected = pd.DataFrame(expected_data, + index=expected_index, + columns=expected_column) + + assert_frame_equal(result, expected) + + def test_agg_timezone_round_trip(self): + # GH 15426 + ts = pd.Timestamp("2016-01-01 12:00:00", tz='US/Pacific') + df = pd.DataFrame({'a': 1, 'b': [ts + timedelta(minutes=nn) + for nn in range(10)]}) + + result1 = df.groupby('a')['b'].agg(np.min).iloc[0] + result2 = df.groupby('a')['b'].agg(lambda x: np.min(x)).iloc[0] + result3 = df.groupby('a')['b'].min().iloc[0] + + assert result1 == ts + assert result2 == ts + assert result3 == ts + + dates = [pd.Timestamp("2016-01-0%d 12:00:00" % i, tz='US/Pacific') + for i in range(1, 5)] + df = pd.DataFrame({'A': ['a', 'b'] * 2, 'B': dates}) + grouped = df.groupby('A') + + ts = df['B'].iloc[0] + assert ts == grouped.nth(0)['B'].iloc[0] + assert ts == grouped.head(1)['B'].iloc[0] + assert ts == grouped.first()['B'].iloc[0] + assert ts == grouped.apply(lambda x: x.iloc[0])[0] + + ts = df['B'].iloc[2] + assert ts == grouped.last()['B'].iloc[0] + assert ts == grouped.apply(lambda x: x.iloc[-1])[0] diff --git a/pandas/tseries/tests/test_bin_groupby.py b/pandas/tests/groupby/test_bin_groupby.py similarity index 84% rename from pandas/tseries/tests/test_bin_groupby.py rename to pandas/tests/groupby/test_bin_groupby.py index 08c0833be0cd6..f527c732fb76b 100644 --- a/pandas/tseries/tests/test_bin_groupby.py +++ b/pandas/tests/groupby/test_bin_groupby.py @@ -1,14 +1,15 @@ # -*- coding: utf-8 -*- +import pytest + from numpy import nan import numpy as np -from pandas.types.common import _ensure_int64 +from pandas.core.dtypes.common import _ensure_int64 from pandas import Index, isnull from pandas.util.testing import assert_almost_equal import pandas.util.testing as tm -import pandas.lib as lib -import pandas.algos as algos +from pandas._libs import lib, groupby def test_series_grouper(): @@ -45,10 +46,9 @@ def test_series_bin_grouper(): assert_almost_equal(counts, exp_counts) -class TestBinGroupers(tm.TestCase): - _multiprocess_can_split_ = True +class TestBinGroupers(object): - def setUp(self): + def setup_method(self, method): self.obj = np.random.randn(10, 1) self.labels = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2], dtype=np.int64) self.bins = np.array([3, 6], dtype=np.int64) @@ -72,15 +72,15 @@ def test_generate_bins(self): bins = func(values, binner, closed='right') assert ((bins == np.array([3, 6])).all()) - self.assertRaises(ValueError, generate_bins_generic, values, [], - 'right') - self.assertRaises(ValueError, generate_bins_generic, values[:0], - binner, 'right') + pytest.raises(ValueError, generate_bins_generic, values, [], + 'right') + pytest.raises(ValueError, generate_bins_generic, values[:0], + binner, 'right') - self.assertRaises(ValueError, generate_bins_generic, values, [4], - 'right') - self.assertRaises(ValueError, generate_bins_generic, values, [-3, -1], - 'right') + pytest.raises(ValueError, generate_bins_generic, values, [4], + 'right') + pytest.raises(ValueError, generate_bins_generic, values, [-3, -1], + 'right') def test_group_ohlc(): @@ -93,7 +93,7 @@ def _check(dtype): labels = _ensure_int64(np.repeat(np.arange(3), np.diff(np.r_[0, bins]))) - func = getattr(algos, 'group_ohlc_%s' % dtype) + func = getattr(groupby, 'group_ohlc_%s' % dtype) func(out, counts, obj[:, None], labels) def _ohlc(group): @@ -117,11 +117,12 @@ def _ohlc(group): _check('float64') -class TestMoments(tm.TestCase): +class TestMoments(object): pass -class TestReducer(tm.TestCase): +class TestReducer(object): + def test_int_index(self): from pandas.core.series import Series diff --git a/pandas/tests/groupby/test_categorical.py b/pandas/tests/groupby/test_categorical.py new file mode 100644 index 0000000000000..fdc03acd3e931 --- /dev/null +++ b/pandas/tests/groupby/test_categorical.py @@ -0,0 +1,528 @@ +# -*- coding: utf-8 -*- +from __future__ import print_function +from datetime import datetime + +import pytest + +import numpy as np +from numpy import nan + +import pandas as pd +from pandas import (Index, MultiIndex, CategoricalIndex, + DataFrame, Categorical, Series, Interval) +from pandas.util.testing import assert_frame_equal, assert_series_equal +import pandas.util.testing as tm +from .common import MixIn + + +class TestGroupByCategorical(MixIn): + + def test_level_groupby_get_group(self): + # GH15155 + df = DataFrame(data=np.arange(2, 22, 2), + index=MultiIndex( + levels=[pd.CategoricalIndex(["a", "b"]), range(10)], + labels=[[0] * 5 + [1] * 5, range(10)], + names=["Index1", "Index2"])) + g = df.groupby(level=["Index1"]) + + # expected should equal test.loc[["a"]] + # GH15166 + expected = DataFrame(data=np.arange(2, 12, 2), + index=pd.MultiIndex(levels=[pd.CategoricalIndex( + ["a", "b"]), range(5)], + labels=[[0] * 5, range(5)], + names=["Index1", "Index2"])) + result = g.get_group('a') + + assert_frame_equal(result, expected) + + def test_apply_use_categorical_name(self): + from pandas import qcut + cats = qcut(self.df.C, 4) + + def get_stats(group): + return {'min': group.min(), + 'max': group.max(), + 'count': group.count(), + 'mean': group.mean()} + + result = self.df.groupby(cats).D.apply(get_stats) + assert result.index.names[0] == 'C' + + def test_apply_categorical_data(self): + # GH 10138 + for ordered in [True, False]: + dense = Categorical(list('abc'), ordered=ordered) + # 'b' is in the categories but not in the list + missing = Categorical( + list('aaa'), categories=['a', 'b'], ordered=ordered) + values = np.arange(len(dense)) + df = DataFrame({'missing': missing, + 'dense': dense, + 'values': values}) + grouped = df.groupby(['missing', 'dense']) + + # missing category 'b' should still exist in the output index + idx = MultiIndex.from_product( + [Categorical(['a', 'b'], ordered=ordered), + Categorical(['a', 'b', 'c'], ordered=ordered)], + names=['missing', 'dense']) + expected = DataFrame([0, 1, 2, np.nan, np.nan, np.nan], + index=idx, + columns=['values']) + + assert_frame_equal(grouped.apply(lambda x: np.mean(x)), expected) + assert_frame_equal(grouped.mean(), expected) + assert_frame_equal(grouped.agg(np.mean), expected) + + # but for transform we should still get back the original index + idx = MultiIndex.from_product([['a'], ['a', 'b', 'c']], + names=['missing', 'dense']) + expected = Series(1, index=idx) + assert_series_equal(grouped.apply(lambda x: 1), expected) + + def test_groupby_categorical(self): + levels = ['foo', 'bar', 'baz', 'qux'] + codes = np.random.randint(0, 4, size=100) + + cats = Categorical.from_codes(codes, levels, ordered=True) + + data = DataFrame(np.random.randn(100, 4)) + + result = data.groupby(cats).mean() + + expected = data.groupby(np.asarray(cats)).mean() + exp_idx = CategoricalIndex(levels, categories=cats.categories, + ordered=True) + expected = expected.reindex(exp_idx) + + assert_frame_equal(result, expected) + + grouped = data.groupby(cats) + desc_result = grouped.describe() + + idx = cats.codes.argsort() + ord_labels = np.asarray(cats).take(idx) + ord_data = data.take(idx) + + exp_cats = Categorical(ord_labels, ordered=True, + categories=['foo', 'bar', 'baz', 'qux']) + expected = ord_data.groupby(exp_cats, sort=False).describe() + assert_frame_equal(desc_result, expected) + + # GH 10460 + expc = Categorical.from_codes(np.arange(4).repeat(8), + levels, ordered=True) + exp = CategoricalIndex(expc) + tm.assert_index_equal((desc_result.stack().index + .get_level_values(0)), exp) + exp = Index(['count', 'mean', 'std', 'min', '25%', '50%', + '75%', 'max'] * 4) + tm.assert_index_equal((desc_result.stack().index + .get_level_values(1)), exp) + + def test_groupby_datetime_categorical(self): + # GH9049: ensure backward compatibility + levels = pd.date_range('2014-01-01', periods=4) + codes = np.random.randint(0, 4, size=100) + + cats = Categorical.from_codes(codes, levels, ordered=True) + + data = DataFrame(np.random.randn(100, 4)) + result = data.groupby(cats).mean() + + expected = data.groupby(np.asarray(cats)).mean() + expected = expected.reindex(levels) + expected.index = CategoricalIndex(expected.index, + categories=expected.index, + ordered=True) + + assert_frame_equal(result, expected) + + grouped = data.groupby(cats) + desc_result = grouped.describe() + + idx = cats.codes.argsort() + ord_labels = cats.take_nd(idx) + ord_data = data.take(idx) + expected = ord_data.groupby(ord_labels).describe() + assert_frame_equal(desc_result, expected) + tm.assert_index_equal(desc_result.index, expected.index) + tm.assert_index_equal( + desc_result.index.get_level_values(0), + expected.index.get_level_values(0)) + + # GH 10460 + expc = Categorical.from_codes( + np.arange(4).repeat(8), levels, ordered=True) + exp = CategoricalIndex(expc) + tm.assert_index_equal((desc_result.stack().index + .get_level_values(0)), exp) + exp = Index(['count', 'mean', 'std', 'min', '25%', '50%', + '75%', 'max'] * 4) + tm.assert_index_equal((desc_result.stack().index + .get_level_values(1)), exp) + + def test_groupby_categorical_index(self): + + s = np.random.RandomState(12345) + levels = ['foo', 'bar', 'baz', 'qux'] + codes = s.randint(0, 4, size=20) + cats = Categorical.from_codes(codes, levels, ordered=True) + df = DataFrame( + np.repeat( + np.arange(20), 4).reshape(-1, 4), columns=list('abcd')) + df['cats'] = cats + + # with a cat index + result = df.set_index('cats').groupby(level=0).sum() + expected = df[list('abcd')].groupby(cats.codes).sum() + expected.index = CategoricalIndex( + Categorical.from_codes( + [0, 1, 2, 3], levels, ordered=True), name='cats') + assert_frame_equal(result, expected) + + # with a cat column, should produce a cat index + result = df.groupby('cats').sum() + expected = df[list('abcd')].groupby(cats.codes).sum() + expected.index = CategoricalIndex( + Categorical.from_codes( + [0, 1, 2, 3], levels, ordered=True), name='cats') + assert_frame_equal(result, expected) + + def test_groupby_describe_categorical_columns(self): + # GH 11558 + cats = pd.CategoricalIndex(['qux', 'foo', 'baz', 'bar'], + categories=['foo', 'bar', 'baz', 'qux'], + ordered=True) + df = DataFrame(np.random.randn(20, 4), columns=cats) + result = df.groupby([1, 2, 3, 4] * 5).describe() + + tm.assert_index_equal(result.stack().columns, cats) + tm.assert_categorical_equal(result.stack().columns.values, cats.values) + + def test_groupby_unstack_categorical(self): + # GH11558 (example is taken from the original issue) + df = pd.DataFrame({'a': range(10), + 'medium': ['A', 'B'] * 5, + 'artist': list('XYXXY') * 2}) + df['medium'] = df['medium'].astype('category') + + gcat = df.groupby(['artist', 'medium'])['a'].count().unstack() + result = gcat.describe() + + exp_columns = pd.CategoricalIndex(['A', 'B'], ordered=False, + name='medium') + tm.assert_index_equal(result.columns, exp_columns) + tm.assert_categorical_equal(result.columns.values, exp_columns.values) + + result = gcat['A'] + gcat['B'] + expected = pd.Series([6, 4], index=pd.Index(['X', 'Y'], name='artist')) + tm.assert_series_equal(result, expected) + + def test_groupby_bins_unequal_len(self): + # GH3011 + series = Series([np.nan, np.nan, 1, 1, 2, 2, 3, 3, 4, 4]) + bins = pd.cut(series.dropna().values, 4) + + # len(bins) != len(series) here + def f(): + series.groupby(bins).mean() + pytest.raises(ValueError, f) + + def test_groupby_multi_categorical_as_index(self): + # GH13204 + df = DataFrame({'cat': Categorical([1, 2, 2], [1, 2, 3]), + 'A': [10, 11, 11], + 'B': [101, 102, 103]}) + result = df.groupby(['cat', 'A'], as_index=False).sum() + expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), + 'A': [10, 11, 10, 11, 10, 11], + 'B': [101.0, nan, nan, 205.0, nan, nan]}, + columns=['cat', 'A', 'B']) + tm.assert_frame_equal(result, expected) + + # function grouper + f = lambda r: df.loc[r, 'A'] + result = df.groupby(['cat', f], as_index=False).sum() + expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), + 'A': [10.0, nan, nan, 22.0, nan, nan], + 'B': [101.0, nan, nan, 205.0, nan, nan]}, + columns=['cat', 'A', 'B']) + tm.assert_frame_equal(result, expected) + + # another not in-axis grouper (conflicting names in index) + s = Series(['a', 'b', 'b'], name='cat') + result = df.groupby(['cat', s], as_index=False).sum() + expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), + 'A': [10.0, nan, nan, 22.0, nan, nan], + 'B': [101.0, nan, nan, 205.0, nan, nan]}, + columns=['cat', 'A', 'B']) + tm.assert_frame_equal(result, expected) + + # is original index dropped? + expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), + 'A': [10, 11, 10, 11, 10, 11], + 'B': [101.0, nan, nan, 205.0, nan, nan]}, + columns=['cat', 'A', 'B']) + + group_columns = ['cat', 'A'] + + for name in [None, 'X', 'B', 'cat']: + df.index = Index(list("abc"), name=name) + + if name in group_columns and name in df.index.names: + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = df.groupby(group_columns, as_index=False).sum() + + else: + result = df.groupby(group_columns, as_index=False).sum() + + tm.assert_frame_equal(result, expected, check_index_type=True) + + def test_groupby_preserve_categories(self): + # GH-13179 + categories = list('abc') + + # ordered=True + df = DataFrame({'A': pd.Categorical(list('ba'), + categories=categories, + ordered=True)}) + index = pd.CategoricalIndex(categories, categories, ordered=True) + tm.assert_index_equal(df.groupby('A', sort=True).first().index, index) + tm.assert_index_equal(df.groupby('A', sort=False).first().index, index) + + # ordered=False + df = DataFrame({'A': pd.Categorical(list('ba'), + categories=categories, + ordered=False)}) + sort_index = pd.CategoricalIndex(categories, categories, ordered=False) + nosort_index = pd.CategoricalIndex(list('bac'), list('bac'), + ordered=False) + tm.assert_index_equal(df.groupby('A', sort=True).first().index, + sort_index) + tm.assert_index_equal(df.groupby('A', sort=False).first().index, + nosort_index) + + def test_groupby_preserve_categorical_dtype(self): + # GH13743, GH13854 + df = DataFrame({'A': [1, 2, 1, 1, 2], + 'B': [10, 16, 22, 28, 34], + 'C1': Categorical(list("abaab"), + categories=list("bac"), + ordered=False), + 'C2': Categorical(list("abaab"), + categories=list("bac"), + ordered=True)}) + # single grouper + exp_full = DataFrame({'A': [2.0, 1.0, np.nan], + 'B': [25.0, 20.0, np.nan], + 'C1': Categorical(list("bac"), + categories=list("bac"), + ordered=False), + 'C2': Categorical(list("bac"), + categories=list("bac"), + ordered=True)}) + for col in ['C1', 'C2']: + result1 = df.groupby(by=col, as_index=False).mean() + result2 = df.groupby(by=col, as_index=True).mean().reset_index() + expected = exp_full.reindex(columns=result1.columns) + tm.assert_frame_equal(result1, expected) + tm.assert_frame_equal(result2, expected) + + # multiple grouper + exp_full = DataFrame({'A': [1, 1, 1, 2, 2, 2], + 'B': [np.nan, 20.0, np.nan, 25.0, np.nan, + np.nan], + 'C1': Categorical(list("bacbac"), + categories=list("bac"), + ordered=False), + 'C2': Categorical(list("bacbac"), + categories=list("bac"), + ordered=True)}) + for cols in [['A', 'C1'], ['A', 'C2']]: + result1 = df.groupby(by=cols, as_index=False).mean() + result2 = df.groupby(by=cols, as_index=True).mean().reset_index() + expected = exp_full.reindex(columns=result1.columns) + tm.assert_frame_equal(result1, expected) + tm.assert_frame_equal(result2, expected) + + def test_groupby_categorical_no_compress(self): + data = Series(np.random.randn(9)) + + codes = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]) + cats = Categorical.from_codes(codes, [0, 1, 2], ordered=True) + + result = data.groupby(cats).mean() + exp = data.groupby(codes).mean() + + exp.index = CategoricalIndex(exp.index, categories=cats.categories, + ordered=cats.ordered) + assert_series_equal(result, exp) + + codes = np.array([0, 0, 0, 1, 1, 1, 3, 3, 3]) + cats = Categorical.from_codes(codes, [0, 1, 2, 3], ordered=True) + + result = data.groupby(cats).mean() + exp = data.groupby(codes).mean().reindex(cats.categories) + exp.index = CategoricalIndex(exp.index, categories=cats.categories, + ordered=cats.ordered) + assert_series_equal(result, exp) + + cats = Categorical(["a", "a", "a", "b", "b", "b", "c", "c", "c"], + categories=["a", "b", "c", "d"], ordered=True) + data = DataFrame({"a": [1, 1, 1, 2, 2, 2, 3, 4, 5], "b": cats}) + + result = data.groupby("b").mean() + result = result["a"].values + exp = np.array([1, 2, 4, np.nan]) + tm.assert_numpy_array_equal(result, exp) + + def test_groupby_sort_categorical(self): + # dataframe groupby sort was being ignored # GH 8868 + df = DataFrame([['(7.5, 10]', 10, 10], + ['(7.5, 10]', 8, 20], + ['(2.5, 5]', 5, 30], + ['(5, 7.5]', 6, 40], + ['(2.5, 5]', 4, 50], + ['(0, 2.5]', 1, 60], + ['(5, 7.5]', 7, 70]], columns=['range', 'foo', 'bar']) + df['range'] = Categorical(df['range'], ordered=True) + index = CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', + '(7.5, 10]'], name='range', ordered=True) + result_sort = DataFrame([[1, 60], [5, 30], [6, 40], [10, 10]], + columns=['foo', 'bar'], index=index) + + col = 'range' + assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) + # when categories is ordered, group is ordered by category's order + assert_frame_equal(result_sort, df.groupby(col, sort=False).first()) + + df['range'] = Categorical(df['range'], ordered=False) + index = CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', + '(7.5, 10]'], name='range') + result_sort = DataFrame([[1, 60], [5, 30], [6, 40], [10, 10]], + columns=['foo', 'bar'], index=index) + + index = CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', + '(0, 2.5]'], + categories=['(7.5, 10]', '(2.5, 5]', + '(5, 7.5]', '(0, 2.5]'], + name='range') + result_nosort = DataFrame([[10, 10], [5, 30], [6, 40], [1, 60]], + index=index, columns=['foo', 'bar']) + + col = 'range' + # this is an unordered categorical, but we allow this #### + assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) + assert_frame_equal(result_nosort, df.groupby(col, sort=False).first()) + + def test_groupby_sort_categorical_datetimelike(self): + # GH10505 + + # use same data as test_groupby_sort_categorical, which category is + # corresponding to datetime.month + df = DataFrame({'dt': [datetime(2011, 7, 1), datetime(2011, 7, 1), + datetime(2011, 2, 1), datetime(2011, 5, 1), + datetime(2011, 2, 1), datetime(2011, 1, 1), + datetime(2011, 5, 1)], + 'foo': [10, 8, 5, 6, 4, 1, 7], + 'bar': [10, 20, 30, 40, 50, 60, 70]}, + columns=['dt', 'foo', 'bar']) + + # ordered=True + df['dt'] = Categorical(df['dt'], ordered=True) + index = [datetime(2011, 1, 1), datetime(2011, 2, 1), + datetime(2011, 5, 1), datetime(2011, 7, 1)] + result_sort = DataFrame( + [[1, 60], [5, 30], [6, 40], [10, 10]], columns=['foo', 'bar']) + result_sort.index = CategoricalIndex(index, name='dt', ordered=True) + + index = [datetime(2011, 7, 1), datetime(2011, 2, 1), + datetime(2011, 5, 1), datetime(2011, 1, 1)] + result_nosort = DataFrame([[10, 10], [5, 30], [6, 40], [1, 60]], + columns=['foo', 'bar']) + result_nosort.index = CategoricalIndex(index, categories=index, + name='dt', ordered=True) + + col = 'dt' + assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) + # when categories is ordered, group is ordered by category's order + assert_frame_equal(result_sort, df.groupby(col, sort=False).first()) + + # ordered = False + df['dt'] = Categorical(df['dt'], ordered=False) + index = [datetime(2011, 1, 1), datetime(2011, 2, 1), + datetime(2011, 5, 1), datetime(2011, 7, 1)] + result_sort = DataFrame( + [[1, 60], [5, 30], [6, 40], [10, 10]], columns=['foo', 'bar']) + result_sort.index = CategoricalIndex(index, name='dt') + + index = [datetime(2011, 7, 1), datetime(2011, 2, 1), + datetime(2011, 5, 1), datetime(2011, 1, 1)] + result_nosort = DataFrame([[10, 10], [5, 30], [6, 40], [1, 60]], + columns=['foo', 'bar']) + result_nosort.index = CategoricalIndex(index, categories=index, + name='dt') + + col = 'dt' + assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) + assert_frame_equal(result_nosort, df.groupby(col, sort=False).first()) + + def test_groupby_categorical_two_columns(self): + + # https://github.com/pandas-dev/pandas/issues/8138 + d = {'cat': + pd.Categorical(["a", "b", "a", "b"], categories=["a", "b", "c"], + ordered=True), + 'ints': [1, 1, 2, 2], + 'val': [10, 20, 30, 40]} + test = pd.DataFrame(d) + + # Grouping on a single column + groups_single_key = test.groupby("cat") + res = groups_single_key.agg('mean') + + exp_index = pd.CategoricalIndex(["a", "b", "c"], name="cat", + ordered=True) + exp = DataFrame({"ints": [1.5, 1.5, np.nan], "val": [20, 30, np.nan]}, + index=exp_index) + tm.assert_frame_equal(res, exp) + + # Grouping on two columns + groups_double_key = test.groupby(["cat", "ints"]) + res = groups_double_key.agg('mean') + exp = DataFrame({"val": [10, 30, 20, 40, np.nan, np.nan], + "cat": pd.Categorical(["a", "a", "b", "b", "c", "c"], + ordered=True), + "ints": [1, 2, 1, 2, 1, 2]}).set_index(["cat", "ints" + ]) + tm.assert_frame_equal(res, exp) + + # GH 10132 + for key in [('a', 1), ('b', 2), ('b', 1), ('a', 2)]: + c, i = key + result = groups_double_key.get_group(key) + expected = test[(test.cat == c) & (test.ints == i)] + assert_frame_equal(result, expected) + + d = {'C1': [3, 3, 4, 5], 'C2': [1, 2, 3, 4], 'C3': [10, 100, 200, 34]} + test = pd.DataFrame(d) + values = pd.cut(test['C1'], [1, 2, 3, 6]) + values.name = "cat" + groups_double_key = test.groupby([values, 'C2']) + + res = groups_double_key.agg('mean') + nan = np.nan + idx = MultiIndex.from_product( + [Categorical([Interval(1, 2), Interval(2, 3), + Interval(3, 6)], ordered=True), + [1, 2, 3, 4]], + names=["cat", "C2"]) + exp = DataFrame({"C1": [nan, nan, nan, nan, 3, 3, + nan, nan, nan, nan, 4, 5], + "C3": [nan, nan, nan, nan, 10, 100, + nan, nan, nan, nan, 200, 34]}, index=idx) + tm.assert_frame_equal(res, exp) diff --git a/pandas/tests/groupby/test_filters.py b/pandas/tests/groupby/test_filters.py new file mode 100644 index 0000000000000..cac6b46af8f87 --- /dev/null +++ b/pandas/tests/groupby/test_filters.py @@ -0,0 +1,622 @@ +# -*- coding: utf-8 -*- +from __future__ import print_function +from numpy import nan + +import pytest + +from pandas import Timestamp +from pandas.core.index import MultiIndex +from pandas.core.api import DataFrame + +from pandas.core.series import Series + +from pandas.util.testing import (assert_frame_equal, assert_series_equal + ) +from pandas.compat import (lmap) + +from pandas import compat + +import pandas.core.common as com +import numpy as np + +import pandas.util.testing as tm +import pandas as pd + + +class TestGroupByFilter(object): + + def setup_method(self, method): + self.ts = tm.makeTimeSeries() + + self.seriesd = tm.getSeriesData() + self.tsd = tm.getTimeSeriesData() + self.frame = DataFrame(self.seriesd) + self.tsframe = DataFrame(self.tsd) + + self.df = DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.random.randn(8)}) + + self.df_mixed_floats = DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.array( + np.random.randn(8), dtype='float32')}) + + index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', + 'three']], + labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], + [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=['first', 'second']) + self.mframe = DataFrame(np.random.randn(10, 3), index=index, + columns=['A', 'B', 'C']) + + self.three_group = DataFrame( + {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', + 'foo', 'foo', 'foo'], + 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', + 'two', 'two', 'one'], + 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', + 'dull', 'shiny', 'shiny', 'shiny'], + 'D': np.random.randn(11), + 'E': np.random.randn(11), + 'F': np.random.randn(11)}) + + def test_filter_series(self): + s = pd.Series([1, 3, 20, 5, 22, 24, 7]) + expected_odd = pd.Series([1, 3, 5, 7], index=[0, 1, 3, 6]) + expected_even = pd.Series([20, 22, 24], index=[2, 4, 5]) + grouper = s.apply(lambda x: x % 2) + grouped = s.groupby(grouper) + assert_series_equal( + grouped.filter(lambda x: x.mean() < 10), expected_odd) + assert_series_equal( + grouped.filter(lambda x: x.mean() > 10), expected_even) + # Test dropna=False. + assert_series_equal( + grouped.filter(lambda x: x.mean() < 10, dropna=False), + expected_odd.reindex(s.index)) + assert_series_equal( + grouped.filter(lambda x: x.mean() > 10, dropna=False), + expected_even.reindex(s.index)) + + def test_filter_single_column_df(self): + df = pd.DataFrame([1, 3, 20, 5, 22, 24, 7]) + expected_odd = pd.DataFrame([1, 3, 5, 7], index=[0, 1, 3, 6]) + expected_even = pd.DataFrame([20, 22, 24], index=[2, 4, 5]) + grouper = df[0].apply(lambda x: x % 2) + grouped = df.groupby(grouper) + assert_frame_equal( + grouped.filter(lambda x: x.mean() < 10), expected_odd) + assert_frame_equal( + grouped.filter(lambda x: x.mean() > 10), expected_even) + # Test dropna=False. + assert_frame_equal( + grouped.filter(lambda x: x.mean() < 10, dropna=False), + expected_odd.reindex(df.index)) + assert_frame_equal( + grouped.filter(lambda x: x.mean() > 10, dropna=False), + expected_even.reindex(df.index)) + + def test_filter_multi_column_df(self): + df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': [1, 1, 1, 1]}) + grouper = df['A'].apply(lambda x: x % 2) + grouped = df.groupby(grouper) + expected = pd.DataFrame({'A': [12, 12], 'B': [1, 1]}, index=[1, 2]) + assert_frame_equal( + grouped.filter(lambda x: x['A'].sum() - x['B'].sum() > 10), + expected) + + def test_filter_mixed_df(self): + df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()}) + grouper = df['A'].apply(lambda x: x % 2) + grouped = df.groupby(grouper) + expected = pd.DataFrame({'A': [12, 12], 'B': ['b', 'c']}, index=[1, 2]) + assert_frame_equal( + grouped.filter(lambda x: x['A'].sum() > 10), expected) + + def test_filter_out_all_groups(self): + s = pd.Series([1, 3, 20, 5, 22, 24, 7]) + grouper = s.apply(lambda x: x % 2) + grouped = s.groupby(grouper) + assert_series_equal(grouped.filter(lambda x: x.mean() > 1000), s[[]]) + df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()}) + grouper = df['A'].apply(lambda x: x % 2) + grouped = df.groupby(grouper) + assert_frame_equal( + grouped.filter(lambda x: x['A'].sum() > 1000), df.loc[[]]) + + def test_filter_out_no_groups(self): + s = pd.Series([1, 3, 20, 5, 22, 24, 7]) + grouper = s.apply(lambda x: x % 2) + grouped = s.groupby(grouper) + filtered = grouped.filter(lambda x: x.mean() > 0) + assert_series_equal(filtered, s) + df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()}) + grouper = df['A'].apply(lambda x: x % 2) + grouped = df.groupby(grouper) + filtered = grouped.filter(lambda x: x['A'].mean() > 0) + assert_frame_equal(filtered, df) + + def test_filter_out_all_groups_in_df(self): + # GH12768 + df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 0]}) + res = df.groupby('a') + res = res.filter(lambda x: x['b'].sum() > 5, dropna=False) + expected = pd.DataFrame({'a': [nan] * 3, 'b': [nan] * 3}) + assert_frame_equal(expected, res) + + df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 0]}) + res = df.groupby('a') + res = res.filter(lambda x: x['b'].sum() > 5, dropna=True) + expected = pd.DataFrame({'a': [], 'b': []}, dtype="int64") + assert_frame_equal(expected, res) + + def test_filter_condition_raises(self): + def raise_if_sum_is_zero(x): + if x.sum() == 0: + raise ValueError + else: + return x.sum() > 0 + + s = pd.Series([-1, 0, 1, 2]) + grouper = s.apply(lambda x: x % 2) + grouped = s.groupby(grouper) + pytest.raises(TypeError, + lambda: grouped.filter(raise_if_sum_is_zero)) + + def test_filter_with_axis_in_groupby(self): + # issue 11041 + index = pd.MultiIndex.from_product([range(10), [0, 1]]) + data = pd.DataFrame( + np.arange(100).reshape(-1, 20), columns=index, dtype='int64') + result = data.groupby(level=0, + axis=1).filter(lambda x: x.iloc[0, 0] > 10) + expected = data.iloc[:, 12:20] + assert_frame_equal(result, expected) + + def test_filter_bad_shapes(self): + df = DataFrame({'A': np.arange(8), + 'B': list('aabbbbcc'), + 'C': np.arange(8)}) + s = df['B'] + g_df = df.groupby('B') + g_s = s.groupby(s) + + f = lambda x: x + pytest.raises(TypeError, lambda: g_df.filter(f)) + pytest.raises(TypeError, lambda: g_s.filter(f)) + + f = lambda x: x == 1 + pytest.raises(TypeError, lambda: g_df.filter(f)) + pytest.raises(TypeError, lambda: g_s.filter(f)) + + f = lambda x: np.outer(x, x) + pytest.raises(TypeError, lambda: g_df.filter(f)) + pytest.raises(TypeError, lambda: g_s.filter(f)) + + def test_filter_nan_is_false(self): + df = DataFrame({'A': np.arange(8), + 'B': list('aabbbbcc'), + 'C': np.arange(8)}) + s = df['B'] + g_df = df.groupby(df['B']) + g_s = s.groupby(s) + + f = lambda x: np.nan + assert_frame_equal(g_df.filter(f), df.loc[[]]) + assert_series_equal(g_s.filter(f), s[[]]) + + def test_filter_against_workaround(self): + np.random.seed(0) + # Series of ints + s = Series(np.random.randint(0, 100, 1000)) + grouper = s.apply(lambda x: np.round(x, -1)) + grouped = s.groupby(grouper) + f = lambda x: x.mean() > 10 + + old_way = s[grouped.transform(f).astype('bool')] + new_way = grouped.filter(f) + assert_series_equal(new_way.sort_values(), old_way.sort_values()) + + # Series of floats + s = 100 * Series(np.random.random(1000)) + grouper = s.apply(lambda x: np.round(x, -1)) + grouped = s.groupby(grouper) + f = lambda x: x.mean() > 10 + old_way = s[grouped.transform(f).astype('bool')] + new_way = grouped.filter(f) + assert_series_equal(new_way.sort_values(), old_way.sort_values()) + + # Set up DataFrame of ints, floats, strings. + from string import ascii_lowercase + letters = np.array(list(ascii_lowercase)) + N = 1000 + random_letters = letters.take(np.random.randint(0, 26, N)) + df = DataFrame({'ints': Series(np.random.randint(0, 100, N)), + 'floats': N / 10 * Series(np.random.random(N)), + 'letters': Series(random_letters)}) + + # Group by ints; filter on floats. + grouped = df.groupby('ints') + old_way = df[grouped.floats. + transform(lambda x: x.mean() > N / 20).astype('bool')] + new_way = grouped.filter(lambda x: x['floats'].mean() > N / 20) + assert_frame_equal(new_way, old_way) + + # Group by floats (rounded); filter on strings. + grouper = df.floats.apply(lambda x: np.round(x, -1)) + grouped = df.groupby(grouper) + old_way = df[grouped.letters. + transform(lambda x: len(x) < N / 10).astype('bool')] + new_way = grouped.filter(lambda x: len(x.letters) < N / 10) + assert_frame_equal(new_way, old_way) + + # Group by strings; filter on ints. + grouped = df.groupby('letters') + old_way = df[grouped.ints. + transform(lambda x: x.mean() > N / 20).astype('bool')] + new_way = grouped.filter(lambda x: x['ints'].mean() > N / 20) + assert_frame_equal(new_way, old_way) + + def test_filter_using_len(self): + # BUG GH4447 + df = DataFrame({'A': np.arange(8), + 'B': list('aabbbbcc'), + 'C': np.arange(8)}) + grouped = df.groupby('B') + actual = grouped.filter(lambda x: len(x) > 2) + expected = DataFrame( + {'A': np.arange(2, 6), + 'B': list('bbbb'), + 'C': np.arange(2, 6)}, index=np.arange(2, 6)) + assert_frame_equal(actual, expected) + + actual = grouped.filter(lambda x: len(x) > 4) + expected = df.loc[[]] + assert_frame_equal(actual, expected) + + # Series have always worked properly, but we'll test anyway. + s = df['B'] + grouped = s.groupby(s) + actual = grouped.filter(lambda x: len(x) > 2) + expected = Series(4 * ['b'], index=np.arange(2, 6), name='B') + assert_series_equal(actual, expected) + + actual = grouped.filter(lambda x: len(x) > 4) + expected = s[[]] + assert_series_equal(actual, expected) + + def test_filter_maintains_ordering(self): + # Simple case: index is sequential. #4621 + df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], + 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}) + s = df['pid'] + grouped = df.groupby('tag') + actual = grouped.filter(lambda x: len(x) > 1) + expected = df.iloc[[1, 2, 4, 7]] + assert_frame_equal(actual, expected) + + grouped = s.groupby(df['tag']) + actual = grouped.filter(lambda x: len(x) > 1) + expected = s.iloc[[1, 2, 4, 7]] + assert_series_equal(actual, expected) + + # Now index is sequentially decreasing. + df.index = np.arange(len(df) - 1, -1, -1) + s = df['pid'] + grouped = df.groupby('tag') + actual = grouped.filter(lambda x: len(x) > 1) + expected = df.iloc[[1, 2, 4, 7]] + assert_frame_equal(actual, expected) + + grouped = s.groupby(df['tag']) + actual = grouped.filter(lambda x: len(x) > 1) + expected = s.iloc[[1, 2, 4, 7]] + assert_series_equal(actual, expected) + + # Index is shuffled. + SHUFFLED = [4, 6, 7, 2, 1, 0, 5, 3] + df.index = df.index[SHUFFLED] + s = df['pid'] + grouped = df.groupby('tag') + actual = grouped.filter(lambda x: len(x) > 1) + expected = df.iloc[[1, 2, 4, 7]] + assert_frame_equal(actual, expected) + + grouped = s.groupby(df['tag']) + actual = grouped.filter(lambda x: len(x) > 1) + expected = s.iloc[[1, 2, 4, 7]] + assert_series_equal(actual, expected) + + def test_filter_multiple_timestamp(self): + # GH 10114 + df = DataFrame({'A': np.arange(5, dtype='int64'), + 'B': ['foo', 'bar', 'foo', 'bar', 'bar'], + 'C': Timestamp('20130101')}) + + grouped = df.groupby(['B', 'C']) + + result = grouped['A'].filter(lambda x: True) + assert_series_equal(df['A'], result) + + result = grouped['A'].transform(len) + expected = Series([2, 3, 2, 3, 3], name='A') + assert_series_equal(result, expected) + + result = grouped.filter(lambda x: True) + assert_frame_equal(df, result) + + result = grouped.transform('sum') + expected = DataFrame({'A': [2, 8, 2, 8, 8]}) + assert_frame_equal(result, expected) + + result = grouped.transform(len) + expected = DataFrame({'A': [2, 3, 2, 3, 3]}) + assert_frame_equal(result, expected) + + def test_filter_and_transform_with_non_unique_int_index(self): + # GH4620 + index = [1, 1, 1, 2, 1, 1, 0, 1] + df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], + 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) + grouped_df = df.groupby('tag') + ser = df['pid'] + grouped_ser = ser.groupby(df['tag']) + expected_indexes = [1, 2, 4, 7] + + # Filter DataFrame + actual = grouped_df.filter(lambda x: len(x) > 1) + expected = df.iloc[expected_indexes] + assert_frame_equal(actual, expected) + + actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) + expected = df.copy() + expected.iloc[[0, 3, 5, 6]] = np.nan + assert_frame_equal(actual, expected) + + # Filter Series + actual = grouped_ser.filter(lambda x: len(x) > 1) + expected = ser.take(expected_indexes) + assert_series_equal(actual, expected) + + actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) + NA = np.nan + expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') + # ^ made manually because this can get confusing! + assert_series_equal(actual, expected) + + # Transform Series + actual = grouped_ser.transform(len) + expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') + assert_series_equal(actual, expected) + + # Transform (a column from) DataFrameGroupBy + actual = grouped_df.pid.transform(len) + assert_series_equal(actual, expected) + + def test_filter_and_transform_with_multiple_non_unique_int_index(self): + # GH4620 + index = [1, 1, 1, 2, 0, 0, 0, 1] + df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], + 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) + grouped_df = df.groupby('tag') + ser = df['pid'] + grouped_ser = ser.groupby(df['tag']) + expected_indexes = [1, 2, 4, 7] + + # Filter DataFrame + actual = grouped_df.filter(lambda x: len(x) > 1) + expected = df.iloc[expected_indexes] + assert_frame_equal(actual, expected) + + actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) + expected = df.copy() + expected.iloc[[0, 3, 5, 6]] = np.nan + assert_frame_equal(actual, expected) + + # Filter Series + actual = grouped_ser.filter(lambda x: len(x) > 1) + expected = ser.take(expected_indexes) + assert_series_equal(actual, expected) + + actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) + NA = np.nan + expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') + # ^ made manually because this can get confusing! + assert_series_equal(actual, expected) + + # Transform Series + actual = grouped_ser.transform(len) + expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') + assert_series_equal(actual, expected) + + # Transform (a column from) DataFrameGroupBy + actual = grouped_df.pid.transform(len) + assert_series_equal(actual, expected) + + def test_filter_and_transform_with_non_unique_float_index(self): + # GH4620 + index = np.array([1, 1, 1, 2, 1, 1, 0, 1], dtype=float) + df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], + 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) + grouped_df = df.groupby('tag') + ser = df['pid'] + grouped_ser = ser.groupby(df['tag']) + expected_indexes = [1, 2, 4, 7] + + # Filter DataFrame + actual = grouped_df.filter(lambda x: len(x) > 1) + expected = df.iloc[expected_indexes] + assert_frame_equal(actual, expected) + + actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) + expected = df.copy() + expected.iloc[[0, 3, 5, 6]] = np.nan + assert_frame_equal(actual, expected) + + # Filter Series + actual = grouped_ser.filter(lambda x: len(x) > 1) + expected = ser.take(expected_indexes) + assert_series_equal(actual, expected) + + actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) + NA = np.nan + expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') + # ^ made manually because this can get confusing! + assert_series_equal(actual, expected) + + # Transform Series + actual = grouped_ser.transform(len) + expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') + assert_series_equal(actual, expected) + + # Transform (a column from) DataFrameGroupBy + actual = grouped_df.pid.transform(len) + assert_series_equal(actual, expected) + + def test_filter_and_transform_with_non_unique_timestamp_index(self): + # GH4620 + t0 = Timestamp('2013-09-30 00:05:00') + t1 = Timestamp('2013-10-30 00:05:00') + t2 = Timestamp('2013-11-30 00:05:00') + index = [t1, t1, t1, t2, t1, t1, t0, t1] + df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], + 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) + grouped_df = df.groupby('tag') + ser = df['pid'] + grouped_ser = ser.groupby(df['tag']) + expected_indexes = [1, 2, 4, 7] + + # Filter DataFrame + actual = grouped_df.filter(lambda x: len(x) > 1) + expected = df.iloc[expected_indexes] + assert_frame_equal(actual, expected) + + actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) + expected = df.copy() + expected.iloc[[0, 3, 5, 6]] = np.nan + assert_frame_equal(actual, expected) + + # Filter Series + actual = grouped_ser.filter(lambda x: len(x) > 1) + expected = ser.take(expected_indexes) + assert_series_equal(actual, expected) + + actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) + NA = np.nan + expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') + # ^ made manually because this can get confusing! + assert_series_equal(actual, expected) + + # Transform Series + actual = grouped_ser.transform(len) + expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') + assert_series_equal(actual, expected) + + # Transform (a column from) DataFrameGroupBy + actual = grouped_df.pid.transform(len) + assert_series_equal(actual, expected) + + def test_filter_and_transform_with_non_unique_string_index(self): + # GH4620 + index = list('bbbcbbab') + df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], + 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) + grouped_df = df.groupby('tag') + ser = df['pid'] + grouped_ser = ser.groupby(df['tag']) + expected_indexes = [1, 2, 4, 7] + + # Filter DataFrame + actual = grouped_df.filter(lambda x: len(x) > 1) + expected = df.iloc[expected_indexes] + assert_frame_equal(actual, expected) + + actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) + expected = df.copy() + expected.iloc[[0, 3, 5, 6]] = np.nan + assert_frame_equal(actual, expected) + + # Filter Series + actual = grouped_ser.filter(lambda x: len(x) > 1) + expected = ser.take(expected_indexes) + assert_series_equal(actual, expected) + + actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) + NA = np.nan + expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') + # ^ made manually because this can get confusing! + assert_series_equal(actual, expected) + + # Transform Series + actual = grouped_ser.transform(len) + expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') + assert_series_equal(actual, expected) + + # Transform (a column from) DataFrameGroupBy + actual = grouped_df.pid.transform(len) + assert_series_equal(actual, expected) + + def test_filter_has_access_to_grouped_cols(self): + df = DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B']) + g = df.groupby('A') + # previously didn't have access to col A #???? + filt = g.filter(lambda x: x['A'].sum() == 2) + assert_frame_equal(filt, df.iloc[[0, 1]]) + + def test_filter_enforces_scalarness(self): + df = pd.DataFrame([ + ['best', 'a', 'x'], + ['worst', 'b', 'y'], + ['best', 'c', 'x'], + ['best', 'd', 'y'], + ['worst', 'd', 'y'], + ['worst', 'd', 'y'], + ['best', 'd', 'z'], + ], columns=['a', 'b', 'c']) + with tm.assert_raises_regex(TypeError, + 'filter function returned a.*'): + df.groupby('c').filter(lambda g: g['a'] == 'best') + + def test_filter_non_bool_raises(self): + df = pd.DataFrame([ + ['best', 'a', 1], + ['worst', 'b', 1], + ['best', 'c', 1], + ['best', 'd', 1], + ['worst', 'd', 1], + ['worst', 'd', 1], + ['best', 'd', 1], + ], columns=['a', 'b', 'c']) + with tm.assert_raises_regex(TypeError, + 'filter function returned a.*'): + df.groupby('a').filter(lambda g: g.c.mean()) + + def test_filter_dropna_with_empty_groups(self): + # GH 10780 + data = pd.Series(np.random.rand(9), index=np.repeat([1, 2, 3], 3)) + groupped = data.groupby(level=0) + result_false = groupped.filter(lambda x: x.mean() > 1, dropna=False) + expected_false = pd.Series([np.nan] * 9, + index=np.repeat([1, 2, 3], 3)) + tm.assert_series_equal(result_false, expected_false) + + result_true = groupped.filter(lambda x: x.mean() > 1, dropna=True) + expected_true = pd.Series(index=pd.Index([], dtype=int)) + tm.assert_series_equal(result_true, expected_true) + + +def assert_fp_equal(a, b): + assert (np.abs(a - b) < 1e-12).all() + + +def _check_groupby(df, result, keys, field, f=lambda x: x.sum()): + tups = lmap(tuple, df[keys].values) + tups = com._asarray_tuplesafe(tups) + expected = f(df.groupby(tups)[field]) + for k, v in compat.iteritems(expected): + assert (result[k] == v) diff --git a/pandas/tests/groupby/test_groupby.py b/pandas/tests/groupby/test_groupby.py new file mode 100644 index 0000000000000..88afa51e46b6c --- /dev/null +++ b/pandas/tests/groupby/test_groupby.py @@ -0,0 +1,3954 @@ +# -*- coding: utf-8 -*- +from __future__ import print_function + +import pytest + +from warnings import catch_warnings +from string import ascii_lowercase +from datetime import datetime +from numpy import nan + +from pandas import (date_range, bdate_range, Timestamp, + Index, MultiIndex, DataFrame, Series, + concat, Panel, DatetimeIndex) +from pandas.errors import UnsupportedFunctionCall, PerformanceWarning +from pandas.util.testing import (assert_panel_equal, assert_frame_equal, + assert_series_equal, assert_almost_equal, + assert_index_equal) +from pandas.compat import (range, long, lrange, StringIO, lmap, lzip, map, zip, + builtins, OrderedDict, product as cart_product) +from pandas import compat +from collections import defaultdict +import pandas.core.common as com +import numpy as np + +import pandas.core.nanops as nanops +import pandas.util.testing as tm +import pandas as pd +from .common import MixIn + + +class TestGroupBy(MixIn): + + def test_basic(self): + def checkit(dtype): + data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype) + + index = np.arange(9) + np.random.shuffle(index) + data = data.reindex(index) + + grouped = data.groupby(lambda x: x // 3) + + for k, v in grouped: + assert len(v) == 3 + + agged = grouped.aggregate(np.mean) + assert agged[1] == 1 + + assert_series_equal(agged, grouped.agg(np.mean)) # shorthand + assert_series_equal(agged, grouped.mean()) + assert_series_equal(grouped.agg(np.sum), grouped.sum()) + + expected = grouped.apply(lambda x: x * x.sum()) + transformed = grouped.transform(lambda x: x * x.sum()) + assert transformed[7] == 12 + assert_series_equal(transformed, expected) + + value_grouped = data.groupby(data) + assert_series_equal(value_grouped.aggregate(np.mean), agged, + check_index_type=False) + + # complex agg + agged = grouped.aggregate([np.mean, np.std]) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + agged = grouped.aggregate({'one': np.mean, 'two': np.std}) + + group_constants = {0: 10, 1: 20, 2: 30} + agged = grouped.agg(lambda x: group_constants[x.name] + x.mean()) + assert agged[1] == 21 + + # corner cases + pytest.raises(Exception, grouped.aggregate, lambda x: x * 2) + + for dtype in ['int64', 'int32', 'float64', 'float32']: + checkit(dtype) + + def test_select_bad_cols(self): + df = DataFrame([[1, 2]], columns=['A', 'B']) + g = df.groupby('A') + pytest.raises(KeyError, g.__getitem__, ['C']) # g[['C']] + + pytest.raises(KeyError, g.__getitem__, ['A', 'C']) # g[['A', 'C']] + with tm.assert_raises_regex(KeyError, '^[^A]+$'): + # A should not be referenced as a bad column... + # will have to rethink regex if you change message! + g[['A', 'C']] + + def test_group_selection_cache(self): + # GH 12839 nth, head, and tail should return same result consistently + df = DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B']) + expected = df.iloc[[0, 2]].set_index('A') + + g = df.groupby('A') + result1 = g.head(n=2) + result2 = g.nth(0) + assert_frame_equal(result1, df) + assert_frame_equal(result2, expected) + + g = df.groupby('A') + result1 = g.tail(n=2) + result2 = g.nth(0) + assert_frame_equal(result1, df) + assert_frame_equal(result2, expected) + + g = df.groupby('A') + result1 = g.nth(0) + result2 = g.head(n=2) + assert_frame_equal(result1, expected) + assert_frame_equal(result2, df) + + g = df.groupby('A') + result1 = g.nth(0) + result2 = g.tail(n=2) + assert_frame_equal(result1, expected) + assert_frame_equal(result2, df) + + def test_grouper_index_types(self): + # related GH5375 + # groupby misbehaving when using a Floatlike index + df = DataFrame(np.arange(10).reshape(5, 2), columns=list('AB')) + for index in [tm.makeFloatIndex, tm.makeStringIndex, + tm.makeUnicodeIndex, tm.makeIntIndex, tm.makeDateIndex, + tm.makePeriodIndex]: + + df.index = index(len(df)) + df.groupby(list('abcde')).apply(lambda x: x) + + df.index = list(reversed(df.index.tolist())) + df.groupby(list('abcde')).apply(lambda x: x) + + def test_grouper_multilevel_freq(self): + + # GH 7885 + # with level and freq specified in a pd.Grouper + from datetime import date, timedelta + d0 = date.today() - timedelta(days=14) + dates = date_range(d0, date.today()) + date_index = pd.MultiIndex.from_product( + [dates, dates], names=['foo', 'bar']) + df = pd.DataFrame(np.random.randint(0, 100, 225), index=date_index) + + # Check string level + expected = df.reset_index().groupby([pd.Grouper( + key='foo', freq='W'), pd.Grouper(key='bar', freq='W')]).sum() + # reset index changes columns dtype to object + expected.columns = pd.Index([0], dtype='int64') + + result = df.groupby([pd.Grouper(level='foo', freq='W'), pd.Grouper( + level='bar', freq='W')]).sum() + assert_frame_equal(result, expected) + + # Check integer level + result = df.groupby([pd.Grouper(level=0, freq='W'), pd.Grouper( + level=1, freq='W')]).sum() + assert_frame_equal(result, expected) + + def test_grouper_creation_bug(self): + + # GH 8795 + df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]}) + g = df.groupby('A') + expected = g.sum() + + g = df.groupby(pd.Grouper(key='A')) + result = g.sum() + assert_frame_equal(result, expected) + + result = g.apply(lambda x: x.sum()) + assert_frame_equal(result, expected) + + g = df.groupby(pd.Grouper(key='A', axis=0)) + result = g.sum() + assert_frame_equal(result, expected) + + # GH14334 + # pd.Grouper(key=...) may be passed in a list + df = DataFrame({'A': [0, 0, 0, 1, 1, 1], + 'B': [1, 1, 2, 2, 3, 3], + 'C': [1, 2, 3, 4, 5, 6]}) + # Group by single column + expected = df.groupby('A').sum() + g = df.groupby([pd.Grouper(key='A')]) + result = g.sum() + assert_frame_equal(result, expected) + + # Group by two columns + # using a combination of strings and Grouper objects + expected = df.groupby(['A', 'B']).sum() + + # Group with two Grouper objects + g = df.groupby([pd.Grouper(key='A'), pd.Grouper(key='B')]) + result = g.sum() + assert_frame_equal(result, expected) + + # Group with a string and a Grouper object + g = df.groupby(['A', pd.Grouper(key='B')]) + result = g.sum() + assert_frame_equal(result, expected) + + # Group with a Grouper object and a string + g = df.groupby([pd.Grouper(key='A'), 'B']) + result = g.sum() + assert_frame_equal(result, expected) + + # GH8866 + s = Series(np.arange(8, dtype='int64'), + index=pd.MultiIndex.from_product( + [list('ab'), range(2), + date_range('20130101', periods=2)], + names=['one', 'two', 'three'])) + result = s.groupby(pd.Grouper(level='three', freq='M')).sum() + expected = Series([28], index=Index( + [Timestamp('2013-01-31')], freq='M', name='three')) + assert_series_equal(result, expected) + + # just specifying a level breaks + result = s.groupby(pd.Grouper(level='one')).sum() + expected = s.groupby(level='one').sum() + assert_series_equal(result, expected) + + def test_grouper_column_and_index(self): + # GH 14327 + + # Grouping a multi-index frame by a column and an index level should + # be equivalent to resetting the index and grouping by two columns + idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), + ('b', 1), ('b', 2), ('b', 3)]) + idx.names = ['outer', 'inner'] + df_multi = pd.DataFrame({"A": np.arange(6), + 'B': ['one', 'one', 'two', + 'two', 'one', 'one']}, + index=idx) + result = df_multi.groupby(['B', pd.Grouper(level='inner')]).mean() + expected = df_multi.reset_index().groupby(['B', 'inner']).mean() + assert_frame_equal(result, expected) + + # Test the reverse grouping order + result = df_multi.groupby([pd.Grouper(level='inner'), 'B']).mean() + expected = df_multi.reset_index().groupby(['inner', 'B']).mean() + assert_frame_equal(result, expected) + + # Grouping a single-index frame by a column and the index should + # be equivalent to resetting the index and grouping by two columns + df_single = df_multi.reset_index('outer') + result = df_single.groupby(['B', pd.Grouper(level='inner')]).mean() + expected = df_single.reset_index().groupby(['B', 'inner']).mean() + assert_frame_equal(result, expected) + + # Test the reverse grouping order + result = df_single.groupby([pd.Grouper(level='inner'), 'B']).mean() + expected = df_single.reset_index().groupby(['inner', 'B']).mean() + assert_frame_equal(result, expected) + + def test_grouper_index_level_as_string(self): + # GH 5677, allow strings passed as the `by` parameter to reference + # columns or index levels + + idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), + ('b', 1), ('b', 2), ('b', 3)]) + idx.names = ['outer', 'inner'] + df_multi = pd.DataFrame({"A": np.arange(6), + 'B': ['one', 'one', 'two', + 'two', 'one', 'one']}, + index=idx) + + df_single = df_multi.reset_index('outer') + + # Column and Index on MultiIndex + result = df_multi.groupby(['B', 'inner']).mean() + expected = df_multi.groupby(['B', pd.Grouper(level='inner')]).mean() + assert_frame_equal(result, expected) + + # Index and Column on MultiIndex + result = df_multi.groupby(['inner', 'B']).mean() + expected = df_multi.groupby([pd.Grouper(level='inner'), 'B']).mean() + assert_frame_equal(result, expected) + + # Column and Index on single Index + result = df_single.groupby(['B', 'inner']).mean() + expected = df_single.groupby(['B', pd.Grouper(level='inner')]).mean() + assert_frame_equal(result, expected) + + # Index and Column on single Index + result = df_single.groupby(['inner', 'B']).mean() + expected = df_single.groupby([pd.Grouper(level='inner'), 'B']).mean() + assert_frame_equal(result, expected) + + # Single element list of Index on MultiIndex + result = df_multi.groupby(['inner']).mean() + expected = df_multi.groupby(pd.Grouper(level='inner')).mean() + assert_frame_equal(result, expected) + + # Single element list of Index on single Index + result = df_single.groupby(['inner']).mean() + expected = df_single.groupby(pd.Grouper(level='inner')).mean() + assert_frame_equal(result, expected) + + # Index on MultiIndex + result = df_multi.groupby('inner').mean() + expected = df_multi.groupby(pd.Grouper(level='inner')).mean() + assert_frame_equal(result, expected) + + # Index on single Index + result = df_single.groupby('inner').mean() + expected = df_single.groupby(pd.Grouper(level='inner')).mean() + assert_frame_equal(result, expected) + + def test_grouper_column_index_level_precedence(self): + # GH 5677, when a string passed as the `by` parameter + # matches a column and an index level the column takes + # precedence + + idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), + ('b', 1), ('b', 2), ('b', 3)]) + idx.names = ['outer', 'inner'] + df_multi_both = pd.DataFrame({"A": np.arange(6), + 'B': ['one', 'one', 'two', + 'two', 'one', 'one'], + 'inner': [1, 1, 1, 1, 1, 1]}, + index=idx) + + df_single_both = df_multi_both.reset_index('outer') + + # Group MultiIndex by single key + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_multi_both.groupby('inner').mean() + + expected = df_multi_both.groupby([pd.Grouper(key='inner')]).mean() + assert_frame_equal(result, expected) + not_expected = df_multi_both.groupby(pd.Grouper(level='inner')).mean() + assert not result.index.equals(not_expected.index) + + # Group single Index by single key + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_single_both.groupby('inner').mean() + + expected = df_single_both.groupby([pd.Grouper(key='inner')]).mean() + assert_frame_equal(result, expected) + not_expected = df_single_both.groupby(pd.Grouper(level='inner')).mean() + assert not result.index.equals(not_expected.index) + + # Group MultiIndex by single key list + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_multi_both.groupby(['inner']).mean() + + expected = df_multi_both.groupby([pd.Grouper(key='inner')]).mean() + assert_frame_equal(result, expected) + not_expected = df_multi_both.groupby(pd.Grouper(level='inner')).mean() + assert not result.index.equals(not_expected.index) + + # Group single Index by single key list + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_single_both.groupby(['inner']).mean() + + expected = df_single_both.groupby([pd.Grouper(key='inner')]).mean() + assert_frame_equal(result, expected) + not_expected = df_single_both.groupby(pd.Grouper(level='inner')).mean() + assert not result.index.equals(not_expected.index) + + # Group MultiIndex by two keys (1) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_multi_both.groupby(['B', 'inner']).mean() + + expected = df_multi_both.groupby(['B', + pd.Grouper(key='inner')]).mean() + assert_frame_equal(result, expected) + not_expected = df_multi_both.groupby(['B', + pd.Grouper(level='inner') + ]).mean() + assert not result.index.equals(not_expected.index) + + # Group MultiIndex by two keys (2) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_multi_both.groupby(['inner', 'B']).mean() + + expected = df_multi_both.groupby([pd.Grouper(key='inner'), + 'B']).mean() + assert_frame_equal(result, expected) + not_expected = df_multi_both.groupby([pd.Grouper(level='inner'), + 'B']).mean() + assert not result.index.equals(not_expected.index) + + # Group single Index by two keys (1) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_single_both.groupby(['B', 'inner']).mean() + + expected = df_single_both.groupby(['B', + pd.Grouper(key='inner')]).mean() + assert_frame_equal(result, expected) + not_expected = df_single_both.groupby(['B', + pd.Grouper(level='inner') + ]).mean() + assert not result.index.equals(not_expected.index) + + # Group single Index by two keys (2) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df_single_both.groupby(['inner', 'B']).mean() + + expected = df_single_both.groupby([pd.Grouper(key='inner'), + 'B']).mean() + assert_frame_equal(result, expected) + not_expected = df_single_both.groupby([pd.Grouper(level='inner'), + 'B']).mean() + assert not result.index.equals(not_expected.index) + + def test_grouper_getting_correct_binner(self): + + # GH 10063 + # using a non-time-based grouper and a time-based grouper + # and specifying levels + df = DataFrame({'A': 1}, index=pd.MultiIndex.from_product( + [list('ab'), date_range('20130101', periods=80)], names=['one', + 'two'])) + result = df.groupby([pd.Grouper(level='one'), pd.Grouper( + level='two', freq='M')]).sum() + expected = DataFrame({'A': [31, 28, 21, 31, 28, 21]}, + index=MultiIndex.from_product( + [list('ab'), + date_range('20130101', freq='M', periods=3)], + names=['one', 'two'])) + assert_frame_equal(result, expected) + + def test_grouper_iter(self): + assert sorted(self.df.groupby('A').grouper) == ['bar', 'foo'] + + def test_empty_groups(self): + # see gh-1048 + pytest.raises(ValueError, self.df.groupby, []) + + def test_groupby_grouper(self): + grouped = self.df.groupby('A') + + result = self.df.groupby(grouped.grouper).mean() + expected = grouped.mean() + tm.assert_frame_equal(result, expected) + + def test_groupby_duplicated_column_errormsg(self): + # GH7511 + df = DataFrame(columns=['A', 'B', 'A', 'C'], + data=[range(4), range(2, 6), range(0, 8, 2)]) + + pytest.raises(ValueError, df.groupby, 'A') + pytest.raises(ValueError, df.groupby, ['A', 'B']) + + grouped = df.groupby('B') + c = grouped.count() + assert c.columns.nlevels == 1 + assert c.columns.size == 3 + + def test_groupby_dict_mapping(self): + # GH #679 + from pandas import Series + s = Series({'T1': 5}) + result = s.groupby({'T1': 'T2'}).agg(sum) + expected = s.groupby(['T2']).agg(sum) + assert_series_equal(result, expected) + + s = Series([1., 2., 3., 4.], index=list('abcd')) + mapping = {'a': 0, 'b': 0, 'c': 1, 'd': 1} + + result = s.groupby(mapping).mean() + result2 = s.groupby(mapping).agg(np.mean) + expected = s.groupby([0, 0, 1, 1]).mean() + expected2 = s.groupby([0, 0, 1, 1]).mean() + assert_series_equal(result, expected) + assert_series_equal(result, result2) + assert_series_equal(result, expected2) + + def test_groupby_grouper_f_sanity_checked(self): + dates = date_range('01-Jan-2013', periods=12, freq='MS') + ts = Series(np.random.randn(12), index=dates) + + # GH3035 + # index.map is used to apply grouper to the index + # if it fails on the elements, map tries it on the entire index as + # a sequence. That can yield invalid results that cause trouble + # down the line. + # the surprise comes from using key[0:6] rather then str(key)[0:6] + # when the elements are Timestamp. + # the result is Index[0:6], very confusing. + + pytest.raises(AssertionError, ts.groupby, lambda key: key[0:6]) + + def test_groupby_nonobject_dtype(self): + key = self.mframe.index.labels[0] + grouped = self.mframe.groupby(key) + result = grouped.sum() + + expected = self.mframe.groupby(key.astype('O')).sum() + assert_frame_equal(result, expected) + + # GH 3911, mixed frame non-conversion + df = self.df_mixed_floats.copy() + df['value'] = lrange(len(df)) + + def max_value(group): + return group.loc[group['value'].idxmax()] + + applied = df.groupby('A').apply(max_value) + result = applied.get_dtype_counts().sort_values() + expected = Series({'object': 2, + 'float64': 2, + 'int64': 1}).sort_values() + assert_series_equal(result, expected) + + def test_groupby_return_type(self): + + # GH2893, return a reduced type + df1 = DataFrame( + [{"val1": 1, "val2": 20}, + {"val1": 1, "val2": 19}, + {"val1": 2, "val2": 27}, + {"val1": 2, "val2": 12} + ]) + + def func(dataf): + return dataf["val2"] - dataf["val2"].mean() + + result = df1.groupby("val1", squeeze=True).apply(func) + assert isinstance(result, Series) + + df2 = DataFrame( + [{"val1": 1, "val2": 20}, + {"val1": 1, "val2": 19}, + {"val1": 1, "val2": 27}, + {"val1": 1, "val2": 12} + ]) + + def func(dataf): + return dataf["val2"] - dataf["val2"].mean() + + result = df2.groupby("val1", squeeze=True).apply(func) + assert isinstance(result, Series) + + # GH3596, return a consistent type (regression in 0.11 from 0.10.1) + df = DataFrame([[1, 1], [1, 1]], columns=['X', 'Y']) + result = df.groupby('X', squeeze=False).count() + assert isinstance(result, DataFrame) + + # GH5592 + # inconcistent return type + df = DataFrame(dict(A=['Tiger', 'Tiger', 'Tiger', 'Lamb', 'Lamb', + 'Pony', 'Pony'], B=Series( + np.arange(7), dtype='int64'), C=date_range( + '20130101', periods=7))) + + def f(grp): + return grp.iloc[0] + + expected = df.groupby('A').first()[['B']] + result = df.groupby('A').apply(f)[['B']] + assert_frame_equal(result, expected) + + def f(grp): + if grp.name == 'Tiger': + return None + return grp.iloc[0] + + result = df.groupby('A').apply(f)[['B']] + e = expected.copy() + e.loc['Tiger'] = np.nan + assert_frame_equal(result, e) + + def f(grp): + if grp.name == 'Pony': + return None + return grp.iloc[0] + + result = df.groupby('A').apply(f)[['B']] + e = expected.copy() + e.loc['Pony'] = np.nan + assert_frame_equal(result, e) + + # 5592 revisited, with datetimes + def f(grp): + if grp.name == 'Pony': + return None + return grp.iloc[0] + + result = df.groupby('A').apply(f)[['C']] + e = df.groupby('A').first()[['C']] + e.loc['Pony'] = pd.NaT + assert_frame_equal(result, e) + + # scalar outputs + def f(grp): + if grp.name == 'Pony': + return None + return grp.iloc[0].loc['C'] + + result = df.groupby('A').apply(f) + e = df.groupby('A').first()['C'].copy() + e.loc['Pony'] = np.nan + e.name = None + assert_series_equal(result, e) + + def test_get_group(self): + with catch_warnings(record=True): + wp = tm.makePanel() + grouped = wp.groupby(lambda x: x.month, axis='major') + + gp = grouped.get_group(1) + expected = wp.reindex( + major=[x for x in wp.major_axis if x.month == 1]) + assert_panel_equal(gp, expected) + + # GH 5267 + # be datelike friendly + df = DataFrame({'DATE': pd.to_datetime( + ['10-Oct-2013', '10-Oct-2013', '10-Oct-2013', '11-Oct-2013', + '11-Oct-2013', '11-Oct-2013']), + 'label': ['foo', 'foo', 'bar', 'foo', 'foo', 'bar'], + 'VAL': [1, 2, 3, 4, 5, 6]}) + + g = df.groupby('DATE') + key = list(g.groups)[0] + result1 = g.get_group(key) + result2 = g.get_group(Timestamp(key).to_pydatetime()) + result3 = g.get_group(str(Timestamp(key))) + assert_frame_equal(result1, result2) + assert_frame_equal(result1, result3) + + g = df.groupby(['DATE', 'label']) + + key = list(g.groups)[0] + result1 = g.get_group(key) + result2 = g.get_group((Timestamp(key[0]).to_pydatetime(), key[1])) + result3 = g.get_group((str(Timestamp(key[0])), key[1])) + assert_frame_equal(result1, result2) + assert_frame_equal(result1, result3) + + # must pass a same-length tuple with multiple keys + pytest.raises(ValueError, lambda: g.get_group('foo')) + pytest.raises(ValueError, lambda: g.get_group(('foo'))) + pytest.raises(ValueError, + lambda: g.get_group(('foo', 'bar', 'baz'))) + + def test_get_group_empty_bins(self): + + d = pd.DataFrame([3, 1, 7, 6]) + bins = [0, 5, 10, 15] + g = d.groupby(pd.cut(d[0], bins)) + + # TODO: should prob allow a str of Interval work as well + # IOW '(0, 5]' + result = g.get_group(pd.Interval(0, 5)) + expected = DataFrame([3, 1], index=[0, 1]) + assert_frame_equal(result, expected) + + pytest.raises(KeyError, lambda: g.get_group(pd.Interval(10, 15))) + + def test_get_group_grouped_by_tuple(self): + # GH 8121 + df = DataFrame([[(1, ), (1, 2), (1, ), (1, 2)]], index=['ids']).T + gr = df.groupby('ids') + expected = DataFrame({'ids': [(1, ), (1, )]}, index=[0, 2]) + result = gr.get_group((1, )) + assert_frame_equal(result, expected) + + dt = pd.to_datetime(['2010-01-01', '2010-01-02', '2010-01-01', + '2010-01-02']) + df = DataFrame({'ids': [(x, ) for x in dt]}) + gr = df.groupby('ids') + result = gr.get_group(('2010-01-01', )) + expected = DataFrame({'ids': [(dt[0], ), (dt[0], )]}, index=[0, 2]) + assert_frame_equal(result, expected) + + def test_grouping_error_on_multidim_input(self): + from pandas.core.groupby import Grouping + pytest.raises(ValueError, + Grouping, self.df.index, self.df[['A', 'A']]) + + def test_apply_describe_bug(self): + grouped = self.mframe.groupby(level='first') + grouped.describe() # it works! + + def test_apply_issues(self): + # GH 5788 + + s = """2011.05.16,00:00,1.40893 +2011.05.16,01:00,1.40760 +2011.05.16,02:00,1.40750 +2011.05.16,03:00,1.40649 +2011.05.17,02:00,1.40893 +2011.05.17,03:00,1.40760 +2011.05.17,04:00,1.40750 +2011.05.17,05:00,1.40649 +2011.05.18,02:00,1.40893 +2011.05.18,03:00,1.40760 +2011.05.18,04:00,1.40750 +2011.05.18,05:00,1.40649""" + + df = pd.read_csv( + StringIO(s), header=None, names=['date', 'time', 'value'], + parse_dates=[['date', 'time']]) + df = df.set_index('date_time') + + expected = df.groupby(df.index.date).idxmax() + result = df.groupby(df.index.date).apply(lambda x: x.idxmax()) + assert_frame_equal(result, expected) + + # GH 5789 + # don't auto coerce dates + df = pd.read_csv( + StringIO(s), header=None, names=['date', 'time', 'value']) + exp_idx = pd.Index( + ['2011.05.16', '2011.05.17', '2011.05.18' + ], dtype=object, name='date') + expected = Series(['00:00', '02:00', '02:00'], index=exp_idx) + result = df.groupby('date').apply( + lambda x: x['time'][x['value'].idxmax()]) + assert_series_equal(result, expected) + + def test_time_field_bug(self): + # Test a fix for the following error related to GH issue 11324 When + # non-key fields in a group-by dataframe contained time-based fields + # that were not returned by the apply function, an exception would be + # raised. + + df = pd.DataFrame({'a': 1, 'b': [datetime.now() for nn in range(10)]}) + + def func_with_no_date(batch): + return pd.Series({'c': 2}) + + def func_with_date(batch): + return pd.Series({'c': 2, 'b': datetime(2015, 1, 1)}) + + dfg_no_conversion = df.groupby(by=['a']).apply(func_with_no_date) + dfg_no_conversion_expected = pd.DataFrame({'c': 2}, index=[1]) + dfg_no_conversion_expected.index.name = 'a' + + dfg_conversion = df.groupby(by=['a']).apply(func_with_date) + dfg_conversion_expected = pd.DataFrame( + {'b': datetime(2015, 1, 1), + 'c': 2}, index=[1]) + dfg_conversion_expected.index.name = 'a' + + tm.assert_frame_equal(dfg_no_conversion, dfg_no_conversion_expected) + tm.assert_frame_equal(dfg_conversion, dfg_conversion_expected) + + def test_len(self): + df = tm.makeTimeDataFrame() + grouped = df.groupby([lambda x: x.year, lambda x: x.month, + lambda x: x.day]) + assert len(grouped) == len(df) + + grouped = df.groupby([lambda x: x.year, lambda x: x.month]) + expected = len(set([(x.year, x.month) for x in df.index])) + assert len(grouped) == expected + + # issue 11016 + df = pd.DataFrame(dict(a=[np.nan] * 3, b=[1, 2, 3])) + assert len(df.groupby(('a'))) == 0 + assert len(df.groupby(('b'))) == 3 + assert len(df.groupby(('a', 'b'))) == 3 + + def test_groups(self): + grouped = self.df.groupby(['A']) + groups = grouped.groups + assert groups is grouped.groups # caching works + + for k, v in compat.iteritems(grouped.groups): + assert (self.df.loc[v]['A'] == k).all() + + grouped = self.df.groupby(['A', 'B']) + groups = grouped.groups + assert groups is grouped.groups # caching works + + for k, v in compat.iteritems(grouped.groups): + assert (self.df.loc[v]['A'] == k[0]).all() + assert (self.df.loc[v]['B'] == k[1]).all() + + def test_basic_regression(self): + # regression + T = [1.0 * x for x in lrange(1, 10) * 10][:1095] + result = Series(T, lrange(0, len(T))) + + groupings = np.random.random((1100, )) + groupings = Series(groupings, lrange(0, len(groupings))) * 10. + + grouped = result.groupby(groupings) + grouped.mean() + + def test_with_na(self): + index = Index(np.arange(10)) + + for dtype in ['float64', 'float32', 'int64', 'int32', 'int16', 'int8']: + values = Series(np.ones(10), index, dtype=dtype) + labels = Series([nan, 'foo', 'bar', 'bar', nan, nan, 'bar', + 'bar', nan, 'foo'], index=index) + + # this SHOULD be an int + grouped = values.groupby(labels) + agged = grouped.agg(len) + expected = Series([4, 2], index=['bar', 'foo']) + + assert_series_equal(agged, expected, check_dtype=False) + + # assert issubclass(agged.dtype.type, np.integer) + + # explicity return a float from my function + def f(x): + return float(len(x)) + + agged = grouped.agg(f) + expected = Series([4, 2], index=['bar', 'foo']) + + assert_series_equal(agged, expected, check_dtype=False) + assert issubclass(agged.dtype.type, np.dtype(dtype).type) + + def test_indices_concatenation_order(self): + + # GH 2808 + + def f1(x): + y = x[(x.b % 2) == 1] ** 2 + if y.empty: + multiindex = MultiIndex(levels=[[]] * 2, labels=[[]] * 2, + names=['b', 'c']) + res = DataFrame(None, columns=['a'], index=multiindex) + return res + else: + y = y.set_index(['b', 'c']) + return y + + def f2(x): + y = x[(x.b % 2) == 1] ** 2 + if y.empty: + return DataFrame() + else: + y = y.set_index(['b', 'c']) + return y + + def f3(x): + y = x[(x.b % 2) == 1] ** 2 + if y.empty: + multiindex = MultiIndex(levels=[[]] * 2, labels=[[]] * 2, + names=['foo', 'bar']) + res = DataFrame(None, columns=['a', 'b'], index=multiindex) + return res + else: + return y + + df = DataFrame({'a': [1, 2, 2, 2], 'b': lrange(4), 'c': lrange(5, 9)}) + + df2 = DataFrame({'a': [3, 2, 2, 2], 'b': lrange(4), 'c': lrange(5, 9)}) + + # correct result + result1 = df.groupby('a').apply(f1) + result2 = df2.groupby('a').apply(f1) + assert_frame_equal(result1, result2) + + # should fail (not the same number of levels) + pytest.raises(AssertionError, df.groupby('a').apply, f2) + pytest.raises(AssertionError, df2.groupby('a').apply, f2) + + # should fail (incorrect shape) + pytest.raises(AssertionError, df.groupby('a').apply, f3) + pytest.raises(AssertionError, df2.groupby('a').apply, f3) + + def test_attr_wrapper(self): + grouped = self.ts.groupby(lambda x: x.weekday()) + + result = grouped.std() + expected = grouped.agg(lambda x: np.std(x, ddof=1)) + assert_series_equal(result, expected) + + # this is pretty cool + result = grouped.describe() + expected = {} + for name, gp in grouped: + expected[name] = gp.describe() + expected = DataFrame(expected).T + assert_frame_equal(result, expected) + + # get attribute + result = grouped.dtype + expected = grouped.agg(lambda x: x.dtype) + + # make sure raises error + pytest.raises(AttributeError, getattr, grouped, 'foo') + + def test_series_describe_multikey(self): + ts = tm.makeTimeSeries() + grouped = ts.groupby([lambda x: x.year, lambda x: x.month]) + result = grouped.describe() + assert_series_equal(result['mean'], grouped.mean(), check_names=False) + assert_series_equal(result['std'], grouped.std(), check_names=False) + assert_series_equal(result['min'], grouped.min(), check_names=False) + + def test_series_describe_single(self): + ts = tm.makeTimeSeries() + grouped = ts.groupby(lambda x: x.month) + result = grouped.apply(lambda x: x.describe()) + expected = grouped.describe().stack() + assert_series_equal(result, expected) + + def test_series_index_name(self): + grouped = self.df.loc[:, ['C']].groupby(self.df['A']) + result = grouped.agg(lambda x: x.mean()) + assert result.index.name == 'A' + + def test_frame_describe_multikey(self): + grouped = self.tsframe.groupby([lambda x: x.year, lambda x: x.month]) + result = grouped.describe() + desc_groups = [] + for col in self.tsframe: + group = grouped[col].describe() + group_col = pd.MultiIndex([[col] * len(group.columns), + group.columns], + [[0] * len(group.columns), + range(len(group.columns))]) + group = pd.DataFrame(group.values, + columns=group_col, + index=group.index) + desc_groups.append(group) + expected = pd.concat(desc_groups, axis=1) + tm.assert_frame_equal(result, expected) + + groupedT = self.tsframe.groupby({'A': 0, 'B': 0, + 'C': 1, 'D': 1}, axis=1) + result = groupedT.describe() + expected = self.tsframe.describe().T + expected.index = pd.MultiIndex([[0, 0, 1, 1], expected.index], + [range(4), range(len(expected.index))]) + tm.assert_frame_equal(result, expected) + + def test_frame_describe_tupleindex(self): + + # GH 14848 - regression from 0.19.0 to 0.19.1 + df1 = DataFrame({'x': [1, 2, 3, 4, 5] * 3, + 'y': [10, 20, 30, 40, 50] * 3, + 'z': [100, 200, 300, 400, 500] * 3}) + df1['k'] = [(0, 0, 1), (0, 1, 0), (1, 0, 0)] * 5 + df2 = df1.rename(columns={'k': 'key'}) + pytest.raises(ValueError, lambda: df1.groupby('k').describe()) + pytest.raises(ValueError, lambda: df2.groupby('key').describe()) + + def test_frame_describe_unstacked_format(self): + # GH 4792 + prices = {pd.Timestamp('2011-01-06 10:59:05', tz=None): 24990, + pd.Timestamp('2011-01-06 12:43:33', tz=None): 25499, + pd.Timestamp('2011-01-06 12:54:09', tz=None): 25499} + volumes = {pd.Timestamp('2011-01-06 10:59:05', tz=None): 1500000000, + pd.Timestamp('2011-01-06 12:43:33', tz=None): 5000000000, + pd.Timestamp('2011-01-06 12:54:09', tz=None): 100000000} + df = pd.DataFrame({'PRICE': prices, + 'VOLUME': volumes}) + result = df.groupby('PRICE').VOLUME.describe() + data = [df[df.PRICE == 24990].VOLUME.describe().values.tolist(), + df[df.PRICE == 25499].VOLUME.describe().values.tolist()] + expected = pd.DataFrame(data, + index=pd.Index([24990, 25499], name='PRICE'), + columns=['count', 'mean', 'std', 'min', + '25%', '50%', '75%', 'max']) + tm.assert_frame_equal(result, expected) + + def test_frame_groupby(self): + grouped = self.tsframe.groupby(lambda x: x.weekday()) + + # aggregate + aggregated = grouped.aggregate(np.mean) + assert len(aggregated) == 5 + assert len(aggregated.columns) == 4 + + # by string + tscopy = self.tsframe.copy() + tscopy['weekday'] = [x.weekday() for x in tscopy.index] + stragged = tscopy.groupby('weekday').aggregate(np.mean) + assert_frame_equal(stragged, aggregated, check_names=False) + + # transform + grouped = self.tsframe.head(30).groupby(lambda x: x.weekday()) + transformed = grouped.transform(lambda x: x - x.mean()) + assert len(transformed) == 30 + assert len(transformed.columns) == 4 + + # transform propagate + transformed = grouped.transform(lambda x: x.mean()) + for name, group in grouped: + mean = group.mean() + for idx in group.index: + tm.assert_series_equal(transformed.xs(idx), mean, + check_names=False) + + # iterate + for weekday, group in grouped: + assert group.index[0].weekday() == weekday + + # groups / group_indices + groups = grouped.groups + indices = grouped.indices + + for k, v in compat.iteritems(groups): + samething = self.tsframe.index.take(indices[k]) + assert (samething == v).all() + + def test_grouping_is_iterable(self): + # this code path isn't used anywhere else + # not sure it's useful + grouped = self.tsframe.groupby([lambda x: x.weekday(), lambda x: x.year + ]) + + # test it works + for g in grouped.grouper.groupings[0]: + pass + + def test_frame_groupby_columns(self): + mapping = {'A': 0, 'B': 0, 'C': 1, 'D': 1} + grouped = self.tsframe.groupby(mapping, axis=1) + + # aggregate + aggregated = grouped.aggregate(np.mean) + assert len(aggregated) == len(self.tsframe) + assert len(aggregated.columns) == 2 + + # transform + tf = lambda x: x - x.mean() + groupedT = self.tsframe.T.groupby(mapping, axis=0) + assert_frame_equal(groupedT.transform(tf).T, grouped.transform(tf)) + + # iterate + for k, v in grouped: + assert len(v.columns) == 2 + + def test_frame_set_name_single(self): + grouped = self.df.groupby('A') + + result = grouped.mean() + assert result.index.name == 'A' + + result = self.df.groupby('A', as_index=False).mean() + assert result.index.name != 'A' + + result = grouped.agg(np.mean) + assert result.index.name == 'A' + + result = grouped.agg({'C': np.mean, 'D': np.std}) + assert result.index.name == 'A' + + result = grouped['C'].mean() + assert result.index.name == 'A' + result = grouped['C'].agg(np.mean) + assert result.index.name == 'A' + result = grouped['C'].agg([np.mean, np.std]) + assert result.index.name == 'A' + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = grouped['C'].agg({'foo': np.mean, 'bar': np.std}) + assert result.index.name == 'A' + + def test_multi_iter(self): + s = Series(np.arange(6)) + k1 = np.array(['a', 'a', 'a', 'b', 'b', 'b']) + k2 = np.array(['1', '2', '1', '2', '1', '2']) + + grouped = s.groupby([k1, k2]) + + iterated = list(grouped) + expected = [('a', '1', s[[0, 2]]), ('a', '2', s[[1]]), + ('b', '1', s[[4]]), ('b', '2', s[[3, 5]])] + for i, ((one, two), three) in enumerate(iterated): + e1, e2, e3 = expected[i] + assert e1 == one + assert e2 == two + assert_series_equal(three, e3) + + def test_multi_iter_frame(self): + k1 = np.array(['b', 'b', 'b', 'a', 'a', 'a']) + k2 = np.array(['1', '2', '1', '2', '1', '2']) + df = DataFrame({'v1': np.random.randn(6), + 'v2': np.random.randn(6), + 'k1': k1, 'k2': k2}, + index=['one', 'two', 'three', 'four', 'five', 'six']) + + grouped = df.groupby(['k1', 'k2']) + + # things get sorted! + iterated = list(grouped) + idx = df.index + expected = [('a', '1', df.loc[idx[[4]]]), + ('a', '2', df.loc[idx[[3, 5]]]), + ('b', '1', df.loc[idx[[0, 2]]]), + ('b', '2', df.loc[idx[[1]]])] + for i, ((one, two), three) in enumerate(iterated): + e1, e2, e3 = expected[i] + assert e1 == one + assert e2 == two + assert_frame_equal(three, e3) + + # don't iterate through groups with no data + df['k1'] = np.array(['b', 'b', 'b', 'a', 'a', 'a']) + df['k2'] = np.array(['1', '1', '1', '2', '2', '2']) + grouped = df.groupby(['k1', 'k2']) + groups = {} + for key, gp in grouped: + groups[key] = gp + assert len(groups) == 2 + + # axis = 1 + three_levels = self.three_group.groupby(['A', 'B', 'C']).mean() + grouped = three_levels.T.groupby(axis=1, level=(1, 2)) + for key, group in grouped: + pass + + def test_multi_iter_panel(self): + with catch_warnings(record=True): + wp = tm.makePanel() + grouped = wp.groupby([lambda x: x.month, lambda x: x.weekday()], + axis=1) + + for (month, wd), group in grouped: + exp_axis = [x + for x in wp.major_axis + if x.month == month and x.weekday() == wd] + expected = wp.reindex(major=exp_axis) + assert_panel_equal(group, expected) + + def test_multi_func(self): + col1 = self.df['A'] + col2 = self.df['B'] + + grouped = self.df.groupby([col1.get, col2.get]) + agged = grouped.mean() + expected = self.df.groupby(['A', 'B']).mean() + + # TODO groupby get drops names + assert_frame_equal(agged.loc[:, ['C', 'D']], + expected.loc[:, ['C', 'D']], + check_names=False) + + # some "groups" with no data + df = DataFrame({'v1': np.random.randn(6), + 'v2': np.random.randn(6), + 'k1': np.array(['b', 'b', 'b', 'a', 'a', 'a']), + 'k2': np.array(['1', '1', '1', '2', '2', '2'])}, + index=['one', 'two', 'three', 'four', 'five', 'six']) + # only verify that it works for now + grouped = df.groupby(['k1', 'k2']) + grouped.agg(np.sum) + + def test_multi_key_multiple_functions(self): + grouped = self.df.groupby(['A', 'B'])['C'] + + agged = grouped.agg([np.mean, np.std]) + expected = DataFrame({'mean': grouped.agg(np.mean), + 'std': grouped.agg(np.std)}) + assert_frame_equal(agged, expected) + + def test_frame_multi_key_function_list(self): + data = DataFrame( + {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', + 'foo', 'foo', 'foo'], + 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', + 'two', 'two', 'one'], + 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', + 'dull', 'shiny', 'shiny', 'shiny'], + 'D': np.random.randn(11), + 'E': np.random.randn(11), + 'F': np.random.randn(11)}) + + grouped = data.groupby(['A', 'B']) + funcs = [np.mean, np.std] + agged = grouped.agg(funcs) + expected = concat([grouped['D'].agg(funcs), grouped['E'].agg(funcs), + grouped['F'].agg(funcs)], + keys=['D', 'E', 'F'], axis=1) + assert (isinstance(agged.index, MultiIndex)) + assert (isinstance(expected.index, MultiIndex)) + assert_frame_equal(agged, expected) + + def test_groupby_multiple_columns(self): + data = self.df + grouped = data.groupby(['A', 'B']) + + def _check_op(op): + + with catch_warnings(record=True): + result1 = op(grouped) + + expected = defaultdict(dict) + for n1, gp1 in data.groupby('A'): + for n2, gp2 in gp1.groupby('B'): + expected[n1][n2] = op(gp2.loc[:, ['C', 'D']]) + expected = dict((k, DataFrame(v)) + for k, v in compat.iteritems(expected)) + expected = Panel.fromDict(expected).swapaxes(0, 1) + expected.major_axis.name, expected.minor_axis.name = 'A', 'B' + + # a little bit crude + for col in ['C', 'D']: + result_col = op(grouped[col]) + exp = expected[col] + pivoted = result1[col].unstack() + pivoted2 = result_col.unstack() + assert_frame_equal(pivoted.reindex_like(exp), exp) + assert_frame_equal(pivoted2.reindex_like(exp), exp) + + _check_op(lambda x: x.sum()) + _check_op(lambda x: x.mean()) + + # test single series works the same + result = data['C'].groupby([data['A'], data['B']]).mean() + expected = data.groupby(['A', 'B']).mean()['C'] + + assert_series_equal(result, expected) + + def test_groupby_as_index_agg(self): + grouped = self.df.groupby('A', as_index=False) + + # single-key + + result = grouped.agg(np.mean) + expected = grouped.mean() + assert_frame_equal(result, expected) + + result2 = grouped.agg(OrderedDict([['C', np.mean], ['D', np.sum]])) + expected2 = grouped.mean() + expected2['D'] = grouped.sum()['D'] + assert_frame_equal(result2, expected2) + + grouped = self.df.groupby('A', as_index=True) + expected3 = grouped['C'].sum() + expected3 = DataFrame(expected3).rename(columns={'C': 'Q'}) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result3 = grouped['C'].agg({'Q': np.sum}) + assert_frame_equal(result3, expected3) + + # multi-key + + grouped = self.df.groupby(['A', 'B'], as_index=False) + + result = grouped.agg(np.mean) + expected = grouped.mean() + assert_frame_equal(result, expected) + + result2 = grouped.agg(OrderedDict([['C', np.mean], ['D', np.sum]])) + expected2 = grouped.mean() + expected2['D'] = grouped.sum()['D'] + assert_frame_equal(result2, expected2) + + expected3 = grouped['C'].sum() + expected3 = DataFrame(expected3).rename(columns={'C': 'Q'}) + result3 = grouped['C'].agg({'Q': np.sum}) + assert_frame_equal(result3, expected3) + + # GH7115 & GH8112 & GH8582 + df = DataFrame(np.random.randint(0, 100, (50, 3)), + columns=['jim', 'joe', 'jolie']) + ts = Series(np.random.randint(5, 10, 50), name='jim') + + gr = df.groupby(ts) + gr.nth(0) # invokes set_selection_from_grouper internally + assert_frame_equal(gr.apply(sum), df.groupby(ts).apply(sum)) + + for attr in ['mean', 'max', 'count', 'idxmax', 'cumsum', 'all']: + gr = df.groupby(ts, as_index=False) + left = getattr(gr, attr)() + + gr = df.groupby(ts.values, as_index=True) + right = getattr(gr, attr)().reset_index(drop=True) + + assert_frame_equal(left, right) + + def test_series_groupby_nunique(self): + + def check_nunique(df, keys, as_index=True): + for sort, dropna in cart_product((False, True), repeat=2): + gr = df.groupby(keys, as_index=as_index, sort=sort) + left = gr['julie'].nunique(dropna=dropna) + + gr = df.groupby(keys, as_index=as_index, sort=sort) + right = gr['julie'].apply(Series.nunique, dropna=dropna) + if not as_index: + right = right.reset_index(drop=True) + + assert_series_equal(left, right, check_names=False) + + days = date_range('2015-08-23', periods=10) + + for n, m in cart_product(10 ** np.arange(2, 6), (10, 100, 1000)): + frame = DataFrame({ + 'jim': np.random.choice( + list(ascii_lowercase), n), + 'joe': np.random.choice(days, n), + 'julie': np.random.randint(0, m, n) + }) + + check_nunique(frame, ['jim']) + check_nunique(frame, ['jim', 'joe']) + + frame.loc[1::17, 'jim'] = None + frame.loc[3::37, 'joe'] = None + frame.loc[7::19, 'julie'] = None + frame.loc[8::19, 'julie'] = None + frame.loc[9::19, 'julie'] = None + + check_nunique(frame, ['jim']) + check_nunique(frame, ['jim', 'joe']) + check_nunique(frame, ['jim'], as_index=False) + check_nunique(frame, ['jim', 'joe'], as_index=False) + + def test_multiindex_passthru(self): + + # GH 7997 + # regression from 0.14.1 + df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) + df.columns = pd.MultiIndex.from_tuples([(0, 1), (1, 1), (2, 1)]) + + result = df.groupby(axis=1, level=[0, 1]).first() + assert_frame_equal(result, df) + + def test_multiindex_negative_level(self): + # GH 13901 + result = self.mframe.groupby(level=-1).sum() + expected = self.mframe.groupby(level='second').sum() + assert_frame_equal(result, expected) + + result = self.mframe.groupby(level=-2).sum() + expected = self.mframe.groupby(level='first').sum() + assert_frame_equal(result, expected) + + result = self.mframe.groupby(level=[-2, -1]).sum() + expected = self.mframe + assert_frame_equal(result, expected) + + result = self.mframe.groupby(level=[-1, 'first']).sum() + expected = self.mframe.groupby(level=['second', 'first']).sum() + assert_frame_equal(result, expected) + + def test_multifunc_select_col_integer_cols(self): + df = self.df + df.columns = np.arange(len(df.columns)) + + # it works! + df.groupby(1, as_index=False)[2].agg({'Q': np.mean}) + + def test_as_index_series_return_frame(self): + grouped = self.df.groupby('A', as_index=False) + grouped2 = self.df.groupby(['A', 'B'], as_index=False) + + result = grouped['C'].agg(np.sum) + expected = grouped.agg(np.sum).loc[:, ['A', 'C']] + assert isinstance(result, DataFrame) + assert_frame_equal(result, expected) + + result2 = grouped2['C'].agg(np.sum) + expected2 = grouped2.agg(np.sum).loc[:, ['A', 'B', 'C']] + assert isinstance(result2, DataFrame) + assert_frame_equal(result2, expected2) + + result = grouped['C'].sum() + expected = grouped.sum().loc[:, ['A', 'C']] + assert isinstance(result, DataFrame) + assert_frame_equal(result, expected) + + result2 = grouped2['C'].sum() + expected2 = grouped2.sum().loc[:, ['A', 'B', 'C']] + assert isinstance(result2, DataFrame) + assert_frame_equal(result2, expected2) + + # corner case + pytest.raises(Exception, grouped['C'].__getitem__, 'D') + + def test_groupby_as_index_cython(self): + data = self.df + + # single-key + grouped = data.groupby('A', as_index=False) + result = grouped.mean() + expected = data.groupby(['A']).mean() + expected.insert(0, 'A', expected.index) + expected.index = np.arange(len(expected)) + assert_frame_equal(result, expected) + + # multi-key + grouped = data.groupby(['A', 'B'], as_index=False) + result = grouped.mean() + expected = data.groupby(['A', 'B']).mean() + + arrays = lzip(*expected.index.values) + expected.insert(0, 'A', arrays[0]) + expected.insert(1, 'B', arrays[1]) + expected.index = np.arange(len(expected)) + assert_frame_equal(result, expected) + + def test_groupby_as_index_series_scalar(self): + grouped = self.df.groupby(['A', 'B'], as_index=False) + + # GH #421 + + result = grouped['C'].agg(len) + expected = grouped.agg(len).loc[:, ['A', 'B', 'C']] + assert_frame_equal(result, expected) + + def test_groupby_as_index_corner(self): + pytest.raises(TypeError, self.ts.groupby, lambda x: x.weekday(), + as_index=False) + + pytest.raises(ValueError, self.df.groupby, lambda x: x.lower(), + as_index=False, axis=1) + + def test_groupby_as_index_apply(self): + # GH #4648 and #3417 + df = DataFrame({'item_id': ['b', 'b', 'a', 'c', 'a', 'b'], + 'user_id': [1, 2, 1, 1, 3, 1], + 'time': range(6)}) + + g_as = df.groupby('user_id', as_index=True) + g_not_as = df.groupby('user_id', as_index=False) + + res_as = g_as.head(2).index + res_not_as = g_not_as.head(2).index + exp = Index([0, 1, 2, 4]) + assert_index_equal(res_as, exp) + assert_index_equal(res_not_as, exp) + + res_as_apply = g_as.apply(lambda x: x.head(2)).index + res_not_as_apply = g_not_as.apply(lambda x: x.head(2)).index + + # apply doesn't maintain the original ordering + # changed in GH5610 as the as_index=False returns a MI here + exp_not_as_apply = MultiIndex.from_tuples([(0, 0), (0, 2), (1, 1), ( + 2, 4)]) + tp = [(1, 0), (1, 2), (2, 1), (3, 4)] + exp_as_apply = MultiIndex.from_tuples(tp, names=['user_id', None]) + + assert_index_equal(res_as_apply, exp_as_apply) + assert_index_equal(res_not_as_apply, exp_not_as_apply) + + ind = Index(list('abcde')) + df = DataFrame([[1, 2], [2, 3], [1, 4], [1, 5], [2, 6]], index=ind) + res = df.groupby(0, as_index=False).apply(lambda x: x).index + assert_index_equal(res, ind) + + def test_groupby_head_tail(self): + df = DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B']) + g_as = df.groupby('A', as_index=True) + g_not_as = df.groupby('A', as_index=False) + + # as_index= False, much easier + assert_frame_equal(df.loc[[0, 2]], g_not_as.head(1)) + assert_frame_equal(df.loc[[1, 2]], g_not_as.tail(1)) + + empty_not_as = DataFrame(columns=df.columns, + index=pd.Index([], dtype=df.index.dtype)) + empty_not_as['A'] = empty_not_as['A'].astype(df.A.dtype) + empty_not_as['B'] = empty_not_as['B'].astype(df.B.dtype) + assert_frame_equal(empty_not_as, g_not_as.head(0)) + assert_frame_equal(empty_not_as, g_not_as.tail(0)) + assert_frame_equal(empty_not_as, g_not_as.head(-1)) + assert_frame_equal(empty_not_as, g_not_as.tail(-1)) + + assert_frame_equal(df, g_not_as.head(7)) # contains all + assert_frame_equal(df, g_not_as.tail(7)) + + # as_index=True, (used to be different) + df_as = df + + assert_frame_equal(df_as.loc[[0, 2]], g_as.head(1)) + assert_frame_equal(df_as.loc[[1, 2]], g_as.tail(1)) + + empty_as = DataFrame(index=df_as.index[:0], columns=df.columns) + empty_as['A'] = empty_not_as['A'].astype(df.A.dtype) + empty_as['B'] = empty_not_as['B'].astype(df.B.dtype) + assert_frame_equal(empty_as, g_as.head(0)) + assert_frame_equal(empty_as, g_as.tail(0)) + assert_frame_equal(empty_as, g_as.head(-1)) + assert_frame_equal(empty_as, g_as.tail(-1)) + + assert_frame_equal(df_as, g_as.head(7)) # contains all + assert_frame_equal(df_as, g_as.tail(7)) + + # test with selection + assert_frame_equal(g_as[[]].head(1), df_as.loc[[0, 2], []]) + assert_frame_equal(g_as[['A']].head(1), df_as.loc[[0, 2], ['A']]) + assert_frame_equal(g_as[['B']].head(1), df_as.loc[[0, 2], ['B']]) + assert_frame_equal(g_as[['A', 'B']].head(1), df_as.loc[[0, 2]]) + + assert_frame_equal(g_not_as[[]].head(1), df_as.loc[[0, 2], []]) + assert_frame_equal(g_not_as[['A']].head(1), df_as.loc[[0, 2], ['A']]) + assert_frame_equal(g_not_as[['B']].head(1), df_as.loc[[0, 2], ['B']]) + assert_frame_equal(g_not_as[['A', 'B']].head(1), df_as.loc[[0, 2]]) + + def test_groupby_multiple_key(self): + df = tm.makeTimeDataFrame() + grouped = df.groupby([lambda x: x.year, lambda x: x.month, + lambda x: x.day]) + agged = grouped.sum() + assert_almost_equal(df.values, agged.values) + + grouped = df.T.groupby([lambda x: x.year, + lambda x: x.month, + lambda x: x.day], axis=1) + + agged = grouped.agg(lambda x: x.sum()) + tm.assert_index_equal(agged.index, df.columns) + assert_almost_equal(df.T.values, agged.values) + + agged = grouped.agg(lambda x: x.sum()) + assert_almost_equal(df.T.values, agged.values) + + def test_groupby_multi_corner(self): + # test that having an all-NA column doesn't mess you up + df = self.df.copy() + df['bad'] = np.nan + agged = df.groupby(['A', 'B']).mean() + + expected = self.df.groupby(['A', 'B']).mean() + expected['bad'] = np.nan + + assert_frame_equal(agged, expected) + + def test_omit_nuisance(self): + grouped = self.df.groupby('A') + + result = grouped.mean() + expected = self.df.loc[:, ['A', 'C', 'D']].groupby('A').mean() + assert_frame_equal(result, expected) + + agged = grouped.agg(np.mean) + exp = grouped.mean() + assert_frame_equal(agged, exp) + + df = self.df.loc[:, ['A', 'C', 'D']] + df['E'] = datetime.now() + grouped = df.groupby('A') + result = grouped.agg(np.sum) + expected = grouped.sum() + assert_frame_equal(result, expected) + + # won't work with axis = 1 + grouped = df.groupby({'A': 0, 'C': 0, 'D': 1, 'E': 1}, axis=1) + result = pytest.raises(TypeError, grouped.agg, + lambda x: x.sum(0, numeric_only=False)) + + def test_omit_nuisance_python_multiple(self): + grouped = self.three_group.groupby(['A', 'B']) + + agged = grouped.agg(np.mean) + exp = grouped.mean() + assert_frame_equal(agged, exp) + + def test_empty_groups_corner(self): + # handle empty groups + df = DataFrame({'k1': np.array(['b', 'b', 'b', 'a', 'a', 'a']), + 'k2': np.array(['1', '1', '1', '2', '2', '2']), + 'k3': ['foo', 'bar'] * 3, + 'v1': np.random.randn(6), + 'v2': np.random.randn(6)}) + + grouped = df.groupby(['k1', 'k2']) + result = grouped.agg(np.mean) + expected = grouped.mean() + assert_frame_equal(result, expected) + + grouped = self.mframe[3:5].groupby(level=0) + agged = grouped.apply(lambda x: x.mean()) + agged_A = grouped['A'].apply(np.mean) + assert_series_equal(agged['A'], agged_A) + assert agged.index.name == 'first' + + def test_apply_concat_preserve_names(self): + grouped = self.three_group.groupby(['A', 'B']) + + def desc(group): + result = group.describe() + result.index.name = 'stat' + return result + + def desc2(group): + result = group.describe() + result.index.name = 'stat' + result = result[:len(group)] + # weirdo + return result + + def desc3(group): + result = group.describe() + + # names are different + result.index.name = 'stat_%d' % len(group) + + result = result[:len(group)] + # weirdo + return result + + result = grouped.apply(desc) + assert result.index.names == ('A', 'B', 'stat') + + result2 = grouped.apply(desc2) + assert result2.index.names == ('A', 'B', 'stat') + + result3 = grouped.apply(desc3) + assert result3.index.names == ('A', 'B', None) + + def test_nonsense_func(self): + df = DataFrame([0]) + pytest.raises(Exception, df.groupby, lambda x: x + 'foo') + + def test_builtins_apply(self): # GH8155 + df = pd.DataFrame(np.random.randint(1, 50, (1000, 2)), + columns=['jim', 'joe']) + df['jolie'] = np.random.randn(1000) + + for keys in ['jim', ['jim', 'joe']]: # single key & multi-key + if keys == 'jim': + continue + for f in [max, min, sum]: + fname = f.__name__ + result = df.groupby(keys).apply(f) + result.shape + ngroups = len(df.drop_duplicates(subset=keys)) + assert result.shape == (ngroups, 3), 'invalid frame shape: '\ + '{} (expected ({}, 3))'.format(result.shape, ngroups) + + assert_frame_equal(result, # numpy's equivalent function + df.groupby(keys).apply(getattr(np, fname))) + + if f != sum: + expected = df.groupby(keys).agg(fname).reset_index() + expected.set_index(keys, inplace=True, drop=False) + assert_frame_equal(result, expected, check_dtype=False) + + assert_series_equal(getattr(result, fname)(), + getattr(df, fname)()) + + def test_max_min_non_numeric(self): + # #2700 + aa = DataFrame({'nn': [11, 11, 22, 22], + 'ii': [1, 2, 3, 4], + 'ss': 4 * ['mama']}) + + result = aa.groupby('nn').max() + assert 'ss' in result + + result = aa.groupby('nn').max(numeric_only=False) + assert 'ss' in result + + result = aa.groupby('nn').min() + assert 'ss' in result + + result = aa.groupby('nn').min(numeric_only=False) + assert 'ss' in result + + def test_arg_passthru(self): + # make sure that we are passing thru kwargs + # to our agg functions + + # GH3668 + # GH5724 + df = pd.DataFrame( + {'group': [1, 1, 2], + 'int': [1, 2, 3], + 'float': [4., 5., 6.], + 'string': list('abc'), + 'category_string': pd.Series(list('abc')).astype('category'), + 'category_int': [7, 8, 9], + 'datetime': pd.date_range('20130101', periods=3), + 'datetimetz': pd.date_range('20130101', + periods=3, + tz='US/Eastern'), + 'timedelta': pd.timedelta_range('1 s', periods=3, freq='s')}, + columns=['group', 'int', 'float', 'string', + 'category_string', 'category_int', + 'datetime', 'datetimetz', + 'timedelta']) + + expected_columns_numeric = Index(['int', 'float', 'category_int']) + + # mean / median + expected = pd.DataFrame( + {'category_int': [7.5, 9], + 'float': [4.5, 6.], + 'timedelta': [pd.Timedelta('1.5s'), + pd.Timedelta('3s')], + 'int': [1.5, 3], + 'datetime': [pd.Timestamp('2013-01-01 12:00:00'), + pd.Timestamp('2013-01-03 00:00:00')], + 'datetimetz': [ + pd.Timestamp('2013-01-01 12:00:00', tz='US/Eastern'), + pd.Timestamp('2013-01-03 00:00:00', tz='US/Eastern')]}, + index=Index([1, 2], name='group'), + columns=['int', 'float', 'category_int', + 'datetime', 'datetimetz', 'timedelta']) + for attr in ['mean', 'median']: + f = getattr(df.groupby('group'), attr) + result = f() + tm.assert_index_equal(result.columns, expected_columns_numeric) + + result = f(numeric_only=False) + assert_frame_equal(result.reindex_like(expected), expected) + + # TODO: min, max *should* handle + # categorical (ordered) dtype + expected_columns = Index(['int', 'float', 'string', + 'category_int', + 'datetime', 'datetimetz', + 'timedelta']) + for attr in ['min', 'max']: + f = getattr(df.groupby('group'), attr) + result = f() + tm.assert_index_equal(result.columns, expected_columns) + + result = f(numeric_only=False) + tm.assert_index_equal(result.columns, expected_columns) + + expected_columns = Index(['int', 'float', 'string', + 'category_string', 'category_int', + 'datetime', 'datetimetz', + 'timedelta']) + for attr in ['first', 'last']: + f = getattr(df.groupby('group'), attr) + result = f() + tm.assert_index_equal(result.columns, expected_columns) + + result = f(numeric_only=False) + tm.assert_index_equal(result.columns, expected_columns) + + expected_columns = Index(['int', 'float', 'string', + 'category_int', 'timedelta']) + for attr in ['sum']: + f = getattr(df.groupby('group'), attr) + result = f() + tm.assert_index_equal(result.columns, expected_columns_numeric) + + result = f(numeric_only=False) + tm.assert_index_equal(result.columns, expected_columns) + + expected_columns = Index(['int', 'float', 'category_int']) + for attr in ['prod', 'cumprod']: + f = getattr(df.groupby('group'), attr) + result = f() + tm.assert_index_equal(result.columns, expected_columns_numeric) + + result = f(numeric_only=False) + tm.assert_index_equal(result.columns, expected_columns) + + # like min, max, but don't include strings + expected_columns = Index(['int', 'float', + 'category_int', + 'datetime', 'datetimetz', + 'timedelta']) + for attr in ['cummin', 'cummax']: + f = getattr(df.groupby('group'), attr) + result = f() + # GH 15561: numeric_only=False set by default like min/max + tm.assert_index_equal(result.columns, expected_columns) + + result = f(numeric_only=False) + tm.assert_index_equal(result.columns, expected_columns) + + expected_columns = Index(['int', 'float', 'category_int', + 'timedelta']) + for attr in ['cumsum']: + f = getattr(df.groupby('group'), attr) + result = f() + tm.assert_index_equal(result.columns, expected_columns_numeric) + + result = f(numeric_only=False) + tm.assert_index_equal(result.columns, expected_columns) + + def test_groupby_timedelta_cython_count(self): + df = DataFrame({'g': list('ab' * 2), + 'delt': np.arange(4).astype('timedelta64[ns]')}) + expected = Series([ + 2, 2 + ], index=pd.Index(['a', 'b'], name='g'), name='delt') + result = df.groupby('g').delt.count() + tm.assert_series_equal(expected, result) + + def test_wrap_aggregated_output_multindex(self): + df = self.mframe.T + df['baz', 'two'] = 'peekaboo' + + keys = [np.array([0, 0, 1]), np.array([0, 0, 1])] + agged = df.groupby(keys).agg(np.mean) + assert isinstance(agged.columns, MultiIndex) + + def aggfun(ser): + if ser.name == ('foo', 'one'): + raise TypeError + else: + return ser.sum() + + agged2 = df.groupby(keys).aggregate(aggfun) + assert len(agged2.columns) + 1 == len(df.columns) + + def test_groupby_level(self): + frame = self.mframe + deleveled = frame.reset_index() + + result0 = frame.groupby(level=0).sum() + result1 = frame.groupby(level=1).sum() + + expected0 = frame.groupby(deleveled['first'].values).sum() + expected1 = frame.groupby(deleveled['second'].values).sum() + + expected0 = expected0.reindex(frame.index.levels[0]) + expected1 = expected1.reindex(frame.index.levels[1]) + + assert result0.index.name == 'first' + assert result1.index.name == 'second' + + assert_frame_equal(result0, expected0) + assert_frame_equal(result1, expected1) + assert result0.index.name == frame.index.names[0] + assert result1.index.name == frame.index.names[1] + + # groupby level name + result0 = frame.groupby(level='first').sum() + result1 = frame.groupby(level='second').sum() + assert_frame_equal(result0, expected0) + assert_frame_equal(result1, expected1) + + # axis=1 + + result0 = frame.T.groupby(level=0, axis=1).sum() + result1 = frame.T.groupby(level=1, axis=1).sum() + assert_frame_equal(result0, expected0.T) + assert_frame_equal(result1, expected1.T) + + # raise exception for non-MultiIndex + pytest.raises(ValueError, self.df.groupby, level=1) + + def test_groupby_level_index_names(self): + # GH4014 this used to raise ValueError since 'exp'>1 (in py2) + df = DataFrame({'exp': ['A'] * 3 + ['B'] * 3, + 'var1': lrange(6), }).set_index('exp') + df.groupby(level='exp') + pytest.raises(ValueError, df.groupby, level='foo') + + def test_groupby_level_with_nas(self): + index = MultiIndex(levels=[[1, 0], [0, 1, 2, 3]], + labels=[[1, 1, 1, 1, 0, 0, 0, 0], [0, 1, 2, 3, 0, 1, + 2, 3]]) + + # factorizing doesn't confuse things + s = Series(np.arange(8.), index=index) + result = s.groupby(level=0).sum() + expected = Series([22., 6.], index=[1, 0]) + assert_series_equal(result, expected) + + index = MultiIndex(levels=[[1, 0], [0, 1, 2, 3]], + labels=[[1, 1, 1, 1, -1, 0, 0, 0], [0, 1, 2, 3, 0, + 1, 2, 3]]) + + # factorizing doesn't confuse things + s = Series(np.arange(8.), index=index) + result = s.groupby(level=0).sum() + expected = Series([18., 6.], index=[1, 0]) + assert_series_equal(result, expected) + + def test_groupby_level_apply(self): + frame = self.mframe + + result = frame.groupby(level=0).count() + assert result.index.name == 'first' + result = frame.groupby(level=1).count() + assert result.index.name == 'second' + + result = frame['A'].groupby(level=0).count() + assert result.index.name == 'first' + + def test_groupby_args(self): + # PR8618 and issue 8015 + frame = self.mframe + + def j(): + frame.groupby() + + tm.assert_raises_regex(TypeError, "You have to supply one of " + "'by' and 'level'", j) + + def k(): + frame.groupby(by=None, level=None) + + tm.assert_raises_regex(TypeError, "You have to supply one of " + "'by' and 'level'", k) + + def test_groupby_level_mapper(self): + frame = self.mframe + deleveled = frame.reset_index() + + mapper0 = {'foo': 0, 'bar': 0, 'baz': 1, 'qux': 1} + mapper1 = {'one': 0, 'two': 0, 'three': 1} + + result0 = frame.groupby(mapper0, level=0).sum() + result1 = frame.groupby(mapper1, level=1).sum() + + mapped_level0 = np.array([mapper0.get(x) for x in deleveled['first']]) + mapped_level1 = np.array([mapper1.get(x) for x in deleveled['second']]) + expected0 = frame.groupby(mapped_level0).sum() + expected1 = frame.groupby(mapped_level1).sum() + expected0.index.name, expected1.index.name = 'first', 'second' + + assert_frame_equal(result0, expected0) + assert_frame_equal(result1, expected1) + + def test_groupby_level_nonmulti(self): + # GH 1313, GH 13901 + s = Series([1, 2, 3, 10, 4, 5, 20, 6], + Index([1, 2, 3, 1, 4, 5, 2, 6], name='foo')) + expected = Series([11, 22, 3, 4, 5, 6], + Index(range(1, 7), name='foo')) + + result = s.groupby(level=0).sum() + tm.assert_series_equal(result, expected) + result = s.groupby(level=[0]).sum() + tm.assert_series_equal(result, expected) + result = s.groupby(level=-1).sum() + tm.assert_series_equal(result, expected) + result = s.groupby(level=[-1]).sum() + tm.assert_series_equal(result, expected) + + pytest.raises(ValueError, s.groupby, level=1) + pytest.raises(ValueError, s.groupby, level=-2) + pytest.raises(ValueError, s.groupby, level=[]) + pytest.raises(ValueError, s.groupby, level=[0, 0]) + pytest.raises(ValueError, s.groupby, level=[0, 1]) + pytest.raises(ValueError, s.groupby, level=[1]) + + def test_groupby_complex(self): + # GH 12902 + a = Series(data=np.arange(4) * (1 + 2j), index=[0, 0, 1, 1]) + expected = Series((1 + 2j, 5 + 10j)) + + result = a.groupby(level=0).sum() + assert_series_equal(result, expected) + + result = a.sum(level=0) + assert_series_equal(result, expected) + + def test_level_preserve_order(self): + grouped = self.mframe.groupby(level=0) + exp_labels = np.array([0, 0, 0, 1, 1, 2, 2, 3, 3, 3], np.intp) + assert_almost_equal(grouped.grouper.labels[0], exp_labels) + + def test_grouping_labels(self): + grouped = self.mframe.groupby(self.mframe.index.get_level_values(0)) + exp_labels = np.array([2, 2, 2, 0, 0, 1, 1, 3, 3, 3], dtype=np.intp) + assert_almost_equal(grouped.grouper.labels[0], exp_labels) + + def test_apply_series_to_frame(self): + def f(piece): + with np.errstate(invalid='ignore'): + logged = np.log(piece) + return DataFrame({'value': piece, + 'demeaned': piece - piece.mean(), + 'logged': logged}) + + dr = bdate_range('1/1/2000', periods=100) + ts = Series(np.random.randn(100), index=dr) + + grouped = ts.groupby(lambda x: x.month) + result = grouped.apply(f) + + assert isinstance(result, DataFrame) + tm.assert_index_equal(result.index, ts.index) + + def test_apply_series_yield_constant(self): + result = self.df.groupby(['A', 'B'])['C'].apply(len) + assert result.index.names[:2] == ('A', 'B') + + def test_apply_frame_yield_constant(self): + # GH13568 + result = self.df.groupby(['A', 'B']).apply(len) + assert isinstance(result, Series) + assert result.name is None + + result = self.df.groupby(['A', 'B'])[['C', 'D']].apply(len) + assert isinstance(result, Series) + assert result.name is None + + def test_apply_frame_to_series(self): + grouped = self.df.groupby(['A', 'B']) + result = grouped.apply(len) + expected = grouped.count()['C'] + tm.assert_index_equal(result.index, expected.index) + tm.assert_numpy_array_equal(result.values, expected.values) + + def test_apply_frame_concat_series(self): + def trans(group): + return group.groupby('B')['C'].sum().sort_values()[:2] + + def trans2(group): + grouped = group.groupby(df.reindex(group.index)['B']) + return grouped.sum().sort_values()[:2] + + df = DataFrame({'A': np.random.randint(0, 5, 1000), + 'B': np.random.randint(0, 5, 1000), + 'C': np.random.randn(1000)}) + + result = df.groupby('A').apply(trans) + exp = df.groupby('A')['C'].apply(trans2) + assert_series_equal(result, exp, check_names=False) + assert result.name == 'C' + + def test_apply_transform(self): + grouped = self.ts.groupby(lambda x: x.month) + result = grouped.apply(lambda x: x * 2) + expected = grouped.transform(lambda x: x * 2) + assert_series_equal(result, expected) + + def test_apply_multikey_corner(self): + grouped = self.tsframe.groupby([lambda x: x.year, lambda x: x.month]) + + def f(group): + return group.sort_values('A')[-5:] + + result = grouped.apply(f) + for key, group in grouped: + assert_frame_equal(result.loc[key], f(group)) + + def test_mutate_groups(self): + + # GH3380 + + mydf = DataFrame({ + 'cat1': ['a'] * 8 + ['b'] * 6, + 'cat2': ['c'] * 2 + ['d'] * 2 + ['e'] * 2 + ['f'] * 2 + ['c'] * 2 + + ['d'] * 2 + ['e'] * 2, + 'cat3': lmap(lambda x: 'g%s' % x, lrange(1, 15)), + 'val': np.random.randint(100, size=14), + }) + + def f_copy(x): + x = x.copy() + x['rank'] = x.val.rank(method='min') + return x.groupby('cat2')['rank'].min() + + def f_no_copy(x): + x['rank'] = x.val.rank(method='min') + return x.groupby('cat2')['rank'].min() + + grpby_copy = mydf.groupby('cat1').apply(f_copy) + grpby_no_copy = mydf.groupby('cat1').apply(f_no_copy) + assert_series_equal(grpby_copy, grpby_no_copy) + + def test_no_mutate_but_looks_like(self): + + # GH 8467 + # first show's mutation indicator + # second does not, but should yield the same results + df = DataFrame({'key': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'value': range(9)}) + + result1 = df.groupby('key', group_keys=True).apply(lambda x: x[:].key) + result2 = df.groupby('key', group_keys=True).apply(lambda x: x.key) + assert_series_equal(result1, result2) + + def test_apply_chunk_view(self): + # Low level tinkering could be unsafe, make sure not + df = DataFrame({'key': [1, 1, 1, 2, 2, 2, 3, 3, 3], + 'value': lrange(9)}) + + # return view + f = lambda x: x[:2] + + result = df.groupby('key', group_keys=False).apply(f) + expected = df.take([0, 1, 3, 4, 6, 7]) + assert_frame_equal(result, expected) + + def test_apply_no_name_column_conflict(self): + df = DataFrame({'name': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2], + 'name2': [0, 0, 0, 1, 1, 1, 0, 0, 1, 1], + 'value': lrange(10)[::-1]}) + + # it works! #2605 + grouped = df.groupby(['name', 'name2']) + grouped.apply(lambda x: x.sort_values('value', inplace=True)) + + def test_groupby_series_indexed_differently(self): + s1 = Series([5.0, -9.0, 4.0, 100., -5., 55., 6.7], + index=Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'])) + s2 = Series([1.0, 1.0, 4.0, 5.0, 5.0, 7.0], + index=Index(['a', 'b', 'd', 'f', 'g', 'h'])) + + grouped = s1.groupby(s2) + agged = grouped.mean() + exp = s1.groupby(s2.reindex(s1.index).get).mean() + assert_series_equal(agged, exp) + + def test_groupby_with_hier_columns(self): + tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', + 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', + 'one', 'two']])) + index = MultiIndex.from_tuples(tuples) + columns = MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'), ( + 'B', 'cat'), ('A', 'dog')]) + df = DataFrame(np.random.randn(8, 4), index=index, columns=columns) + + result = df.groupby(level=0).mean() + tm.assert_index_equal(result.columns, columns) + + result = df.groupby(level=0, axis=1).mean() + tm.assert_index_equal(result.index, df.index) + + result = df.groupby(level=0).agg(np.mean) + tm.assert_index_equal(result.columns, columns) + + result = df.groupby(level=0).apply(lambda x: x.mean()) + tm.assert_index_equal(result.columns, columns) + + result = df.groupby(level=0, axis=1).agg(lambda x: x.mean(1)) + tm.assert_index_equal(result.columns, Index(['A', 'B'])) + tm.assert_index_equal(result.index, df.index) + + # add a nuisance column + sorted_columns, _ = columns.sortlevel(0) + df['A', 'foo'] = 'bar' + result = df.groupby(level=0).mean() + tm.assert_index_equal(result.columns, df.columns[:-1]) + + def test_pass_args_kwargs(self): + from numpy import percentile + + def f(x, q=None, axis=0): + return percentile(x, q, axis=axis) + + g = lambda x: percentile(x, 80, axis=0) + + # Series + ts_grouped = self.ts.groupby(lambda x: x.month) + agg_result = ts_grouped.agg(percentile, 80, axis=0) + apply_result = ts_grouped.apply(percentile, 80, axis=0) + trans_result = ts_grouped.transform(percentile, 80, axis=0) + + agg_expected = ts_grouped.quantile(.8) + trans_expected = ts_grouped.transform(g) + + assert_series_equal(apply_result, agg_expected) + assert_series_equal(agg_result, agg_expected, check_names=False) + assert_series_equal(trans_result, trans_expected) + + agg_result = ts_grouped.agg(f, q=80) + apply_result = ts_grouped.apply(f, q=80) + trans_result = ts_grouped.transform(f, q=80) + assert_series_equal(agg_result, agg_expected) + assert_series_equal(apply_result, agg_expected) + assert_series_equal(trans_result, trans_expected) + + # DataFrame + df_grouped = self.tsframe.groupby(lambda x: x.month) + agg_result = df_grouped.agg(percentile, 80, axis=0) + apply_result = df_grouped.apply(DataFrame.quantile, .8) + expected = df_grouped.quantile(.8) + assert_frame_equal(apply_result, expected) + assert_frame_equal(agg_result, expected, check_names=False) + + agg_result = df_grouped.agg(f, q=80) + apply_result = df_grouped.apply(DataFrame.quantile, q=.8) + assert_frame_equal(agg_result, expected, check_names=False) + assert_frame_equal(apply_result, expected) + + def test_size(self): + grouped = self.df.groupby(['A', 'B']) + result = grouped.size() + for key, group in grouped: + assert result[key] == len(group) + + grouped = self.df.groupby('A') + result = grouped.size() + for key, group in grouped: + assert result[key] == len(group) + + grouped = self.df.groupby('B') + result = grouped.size() + for key, group in grouped: + assert result[key] == len(group) + + df = DataFrame(np.random.choice(20, (1000, 3)), columns=list('abc')) + for sort, key in cart_product((False, True), ('a', 'b', ['a', 'b'])): + left = df.groupby(key, sort=sort).size() + right = df.groupby(key, sort=sort)['c'].apply(lambda a: a.shape[0]) + assert_series_equal(left, right, check_names=False) + + # GH11699 + df = DataFrame([], columns=['A', 'B']) + out = Series([], dtype='int64', index=Index([], name='A')) + assert_series_equal(df.groupby('A').size(), out) + + def test_count(self): + from string import ascii_lowercase + n = 1 << 15 + dr = date_range('2015-08-30', periods=n // 10, freq='T') + + df = DataFrame({ + '1st': np.random.choice( + list(ascii_lowercase), n), + '2nd': np.random.randint(0, 5, n), + '3rd': np.random.randn(n).round(3), + '4th': np.random.randint(-10, 10, n), + '5th': np.random.choice(dr, n), + '6th': np.random.randn(n).round(3), + '7th': np.random.randn(n).round(3), + '8th': np.random.choice(dr, n) - np.random.choice(dr, 1), + '9th': np.random.choice( + list(ascii_lowercase), n) + }) + + for col in df.columns.drop(['1st', '2nd', '4th']): + df.loc[np.random.choice(n, n // 10), col] = np.nan + + df['9th'] = df['9th'].astype('category') + + for key in '1st', '2nd', ['1st', '2nd']: + left = df.groupby(key).count() + right = df.groupby(key).apply(DataFrame.count).drop(key, axis=1) + assert_frame_equal(left, right) + + # GH5610 + # count counts non-nulls + df = pd.DataFrame([[1, 2, 'foo'], [1, nan, 'bar'], [3, nan, nan]], + columns=['A', 'B', 'C']) + + count_as = df.groupby('A').count() + count_not_as = df.groupby('A', as_index=False).count() + + expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], + index=[1, 3]) + expected.index.name = 'A' + assert_frame_equal(count_not_as, expected.reset_index()) + assert_frame_equal(count_as, expected) + + count_B = df.groupby('A')['B'].count() + assert_series_equal(count_B, expected['B']) + + def test_count_object(self): + df = pd.DataFrame({'a': ['a'] * 3 + ['b'] * 3, 'c': [2] * 3 + [3] * 3}) + result = df.groupby('c').a.count() + expected = pd.Series([ + 3, 3 + ], index=pd.Index([2, 3], name='c'), name='a') + tm.assert_series_equal(result, expected) + + df = pd.DataFrame({'a': ['a', np.nan, np.nan] + ['b'] * 3, + 'c': [2] * 3 + [3] * 3}) + result = df.groupby('c').a.count() + expected = pd.Series([ + 1, 3 + ], index=pd.Index([2, 3], name='c'), name='a') + tm.assert_series_equal(result, expected) + + def test_count_cross_type(self): # GH8169 + vals = np.hstack((np.random.randint(0, 5, (100, 2)), np.random.randint( + 0, 2, (100, 2)))) + + df = pd.DataFrame(vals, columns=['a', 'b', 'c', 'd']) + df[df == 2] = np.nan + expected = df.groupby(['c', 'd']).count() + + for t in ['float32', 'object']: + df['a'] = df['a'].astype(t) + df['b'] = df['b'].astype(t) + result = df.groupby(['c', 'd']).count() + tm.assert_frame_equal(result, expected) + + def test_nunique(self): + df = DataFrame({ + 'A': list('abbacc'), + 'B': list('abxacc'), + 'C': list('abbacx'), + }) + + expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]}) + result = df.groupby('A', as_index=False).nunique() + tm.assert_frame_equal(result, expected) + + # as_index + expected.index = list('abc') + expected.index.name = 'A' + result = df.groupby('A').nunique() + tm.assert_frame_equal(result, expected) + + # with na + result = df.replace({'x': None}).groupby('A').nunique(dropna=False) + tm.assert_frame_equal(result, expected) + + # dropna + expected = DataFrame({'A': [1] * 3, 'B': [1] * 3, 'C': [1] * 3}, + index=list('abc')) + expected.index.name = 'A' + result = df.replace({'x': None}).groupby('A').nunique() + tm.assert_frame_equal(result, expected) + + def test_non_cython_api(self): + + # GH5610 + # non-cython calls should not include the grouper + + df = DataFrame( + [[1, 2, 'foo'], [1, + nan, + 'bar', ], [3, nan, 'baz'] + ], columns=['A', 'B', 'C']) + g = df.groupby('A') + gni = df.groupby('A', as_index=False) + + # mad + expected = DataFrame([[0], [nan]], columns=['B'], index=[1, 3]) + expected.index.name = 'A' + result = g.mad() + assert_frame_equal(result, expected) + + expected = DataFrame([[0., 0.], [0, nan]], columns=['A', 'B'], + index=[0, 1]) + result = gni.mad() + assert_frame_equal(result, expected) + + # describe + expected_index = pd.Index([1, 3], name='A') + expected_col = pd.MultiIndex(levels=[['B'], + ['count', 'mean', 'std', 'min', + '25%', '50%', '75%', 'max']], + labels=[[0] * 8, list(range(8))]) + expected = pd.DataFrame([[1.0, 2.0, nan, 2.0, 2.0, 2.0, 2.0, 2.0], + [0.0, nan, nan, nan, nan, nan, nan, nan]], + index=expected_index, + columns=expected_col) + result = g.describe() + assert_frame_equal(result, expected) + + expected = pd.concat([df[df.A == 1].describe().unstack().to_frame().T, + df[df.A == 3].describe().unstack().to_frame().T]) + expected.index = pd.Index([0, 1]) + result = gni.describe() + assert_frame_equal(result, expected) + + # any + expected = DataFrame([[True, True], [False, True]], columns=['B', 'C'], + index=[1, 3]) + expected.index.name = 'A' + result = g.any() + assert_frame_equal(result, expected) + + # idxmax + expected = DataFrame([[0], [nan]], columns=['B'], index=[1, 3]) + expected.index.name = 'A' + result = g.idxmax() + assert_frame_equal(result, expected) + + def test_cython_api2(self): + + # this takes the fast apply path + + # cumsum (GH5614) + df = DataFrame( + [[1, 2, np.nan], [1, np.nan, 9], [3, 4, 9] + ], columns=['A', 'B', 'C']) + expected = DataFrame( + [[2, np.nan], [np.nan, 9], [4, 9]], columns=['B', 'C']) + result = df.groupby('A').cumsum() + assert_frame_equal(result, expected) + + # GH 5755 - cumsum is a transformer and should ignore as_index + result = df.groupby('A', as_index=False).cumsum() + assert_frame_equal(result, expected) + + # GH 13994 + result = df.groupby('A').cumsum(axis=1) + expected = df.cumsum(axis=1) + assert_frame_equal(result, expected) + result = df.groupby('A').cumprod(axis=1) + expected = df.cumprod(axis=1) + assert_frame_equal(result, expected) + + def test_grouping_ndarray(self): + grouped = self.df.groupby(self.df['A'].values) + + result = grouped.sum() + expected = self.df.groupby('A').sum() + assert_frame_equal(result, expected, check_names=False + ) # Note: no names when grouping by value + + def test_apply_typecast_fail(self): + df = DataFrame({'d': [1., 1., 1., 2., 2., 2.], + 'c': np.tile( + ['a', 'b', 'c'], 2), + 'v': np.arange(1., 7.)}) + + def f(group): + v = group['v'] + group['v2'] = (v - v.min()) / (v.max() - v.min()) + return group + + result = df.groupby('d').apply(f) + + expected = df.copy() + expected['v2'] = np.tile([0., 0.5, 1], 2) + + assert_frame_equal(result, expected) + + def test_apply_multiindex_fail(self): + index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1], [1, 2, 3, 1, 2, 3] + ]) + df = DataFrame({'d': [1., 1., 1., 2., 2., 2.], + 'c': np.tile(['a', 'b', 'c'], 2), + 'v': np.arange(1., 7.)}, index=index) + + def f(group): + v = group['v'] + group['v2'] = (v - v.min()) / (v.max() - v.min()) + return group + + result = df.groupby('d').apply(f) + + expected = df.copy() + expected['v2'] = np.tile([0., 0.5, 1], 2) + + assert_frame_equal(result, expected) + + def test_apply_corner(self): + result = self.tsframe.groupby(lambda x: x.year).apply(lambda x: x * 2) + expected = self.tsframe * 2 + assert_frame_equal(result, expected) + + def test_apply_without_copy(self): + # GH 5545 + # returning a non-copy in an applied function fails + + data = DataFrame({'id_field': [100, 100, 200, 300], + 'category': ['a', 'b', 'c', 'c'], + 'value': [1, 2, 3, 4]}) + + def filt1(x): + if x.shape[0] == 1: + return x.copy() + else: + return x[x.category == 'c'] + + def filt2(x): + if x.shape[0] == 1: + return x + else: + return x[x.category == 'c'] + + expected = data.groupby('id_field').apply(filt1) + result = data.groupby('id_field').apply(filt2) + assert_frame_equal(result, expected) + + def test_apply_corner_cases(self): + # #535, can't use sliding iterator + + N = 1000 + labels = np.random.randint(0, 100, size=N) + df = DataFrame({'key': labels, + 'value1': np.random.randn(N), + 'value2': ['foo', 'bar', 'baz', 'qux'] * (N // 4)}) + + grouped = df.groupby('key') + + def f(g): + g['value3'] = g['value1'] * 2 + return g + + result = grouped.apply(f) + assert 'value3' in result + + def test_groupby_wrong_multi_labels(self): + from pandas import read_csv + data = """index,foo,bar,baz,spam,data +0,foo1,bar1,baz1,spam2,20 +1,foo1,bar2,baz1,spam3,30 +2,foo2,bar2,baz1,spam2,40 +3,foo1,bar1,baz2,spam1,50 +4,foo3,bar1,baz2,spam1,60""" + + data = read_csv(StringIO(data), index_col=0) + + grouped = data.groupby(['foo', 'bar', 'baz', 'spam']) + + result = grouped.agg(np.mean) + expected = grouped.mean() + assert_frame_equal(result, expected) + + def test_groupby_series_with_name(self): + result = self.df.groupby(self.df['A']).mean() + result2 = self.df.groupby(self.df['A'], as_index=False).mean() + assert result.index.name == 'A' + assert 'A' in result2 + + result = self.df.groupby([self.df['A'], self.df['B']]).mean() + result2 = self.df.groupby([self.df['A'], self.df['B']], + as_index=False).mean() + assert result.index.names == ('A', 'B') + assert 'A' in result2 + assert 'B' in result2 + + def test_seriesgroupby_name_attr(self): + # GH 6265 + result = self.df.groupby('A')['C'] + assert result.count().name == 'C' + assert result.mean().name == 'C' + + testFunc = lambda x: np.sum(x) * 2 + assert result.agg(testFunc).name == 'C' + + def test_consistency_name(self): + # GH 12363 + + df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', + 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'two', + 'two', 'two', 'one', 'two'], + 'C': np.random.randn(8) + 1.0, + 'D': np.arange(8)}) + + expected = df.groupby(['A']).B.count() + result = df.B.groupby(df.A).count() + assert_series_equal(result, expected) + + def test_groupby_name_propagation(self): + # GH 6124 + def summarize(df, name=None): + return Series({'count': 1, 'mean': 2, 'omissions': 3, }, name=name) + + def summarize_random_name(df): + # Provide a different name for each Series. In this case, groupby + # should not attempt to propagate the Series name since they are + # inconsistent. + return Series({ + 'count': 1, + 'mean': 2, + 'omissions': 3, + }, name=df.iloc[0]['A']) + + metrics = self.df.groupby('A').apply(summarize) + assert metrics.columns.name is None + metrics = self.df.groupby('A').apply(summarize, 'metrics') + assert metrics.columns.name == 'metrics' + metrics = self.df.groupby('A').apply(summarize_random_name) + assert metrics.columns.name is None + + def test_groupby_nonstring_columns(self): + df = DataFrame([np.arange(10) for x in range(10)]) + grouped = df.groupby(0) + result = grouped.mean() + expected = df.groupby(df[0]).mean() + assert_frame_equal(result, expected) + + def test_groupby_mixed_type_columns(self): + # GH 13432, unorderable types in py3 + df = DataFrame([[0, 1, 2]], columns=['A', 'B', 0]) + expected = DataFrame([[1, 2]], columns=['B', 0], + index=Index([0], name='A')) + + result = df.groupby('A').first() + tm.assert_frame_equal(result, expected) + + result = df.groupby('A').sum() + tm.assert_frame_equal(result, expected) + + def test_cython_grouper_series_bug_noncontig(self): + arr = np.empty((100, 100)) + arr.fill(np.nan) + obj = Series(arr[:, 0], index=lrange(100)) + inds = np.tile(lrange(10), 10) + + result = obj.groupby(inds).agg(Series.median) + assert result.isnull().all() + + def test_series_grouper_noncontig_index(self): + index = Index(tm.rands_array(10, 100)) + + values = Series(np.random.randn(50), index=index[::2]) + labels = np.random.randint(0, 5, 50) + + # it works! + grouped = values.groupby(labels) + + # accessing the index elements causes segfault + f = lambda x: len(set(map(id, x.index))) + grouped.agg(f) + + def test_convert_objects_leave_decimal_alone(self): + + from decimal import Decimal + + s = Series(lrange(5)) + labels = np.array(['a', 'b', 'c', 'd', 'e'], dtype='O') + + def convert_fast(x): + return Decimal(str(x.mean())) + + def convert_force_pure(x): + # base will be length 0 + assert (len(x.base) > 0) + return Decimal(str(x.mean())) + + grouped = s.groupby(labels) + + result = grouped.agg(convert_fast) + assert result.dtype == np.object_ + assert isinstance(result[0], Decimal) + + result = grouped.agg(convert_force_pure) + assert result.dtype == np.object_ + assert isinstance(result[0], Decimal) + + def test_fast_apply(self): + # make sure that fast apply is correctly called + # rather than raising any kind of error + # otherwise the python path will be callsed + # which slows things down + N = 1000 + labels = np.random.randint(0, 2000, size=N) + labels2 = np.random.randint(0, 3, size=N) + df = DataFrame({'key': labels, + 'key2': labels2, + 'value1': np.random.randn(N), + 'value2': ['foo', 'bar', 'baz', 'qux'] * (N // 4)}) + + def f(g): + return 1 + + g = df.groupby(['key', 'key2']) + + grouper = g.grouper + + splitter = grouper._get_splitter(g._selected_obj, axis=g.axis) + group_keys = grouper._get_group_keys() + + values, mutated = splitter.fast_apply(f, group_keys) + assert not mutated + + def test_apply_with_mixed_dtype(self): + # GH3480, apply with mixed dtype on axis=1 breaks in 0.11 + df = DataFrame({'foo1': ['one', 'two', 'two', 'three', 'one', 'two'], + 'foo2': np.random.randn(6)}) + result = df.apply(lambda x: x, axis=1) + assert_series_equal(df.get_dtype_counts(), result.get_dtype_counts()) + + # GH 3610 incorrect dtype conversion with as_index=False + df = DataFrame({"c1": [1, 2, 6, 6, 8]}) + df["c2"] = df.c1 / 2.0 + result1 = df.groupby("c2").mean().reset_index().c2 + result2 = df.groupby("c2", as_index=False).mean().c2 + assert_series_equal(result1, result2) + + def test_groupby_aggregation_mixed_dtype(self): + + # GH 6212 + expected = DataFrame({ + 'v1': [5, 5, 7, np.nan, 3, 3, 4, 1], + 'v2': [55, 55, 77, np.nan, 33, 33, 44, 11]}, + index=MultiIndex.from_tuples([(1, 95), (1, 99), (2, 95), (2, 99), + ('big', 'damp'), + ('blue', 'dry'), + ('red', 'red'), ('red', 'wet')], + names=['by1', 'by2'])) + + df = DataFrame({ + 'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9], + 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99], + 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, + 12], + 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, + np.nan, np.nan] + }) + + g = df.groupby(['by1', 'by2']) + result = g[['v1', 'v2']].mean() + assert_frame_equal(result, expected) + + def test_groupby_dtype_inference_empty(self): + # GH 6733 + df = DataFrame({'x': [], 'range': np.arange(0, dtype='int64')}) + assert df['x'].dtype == np.float64 + + result = df.groupby('x').first() + exp_index = Index([], name='x', dtype=np.float64) + expected = DataFrame({'range': Series( + [], index=exp_index, dtype='int64')}) + assert_frame_equal(result, expected, by_blocks=True) + + def test_groupby_list_infer_array_like(self): + result = self.df.groupby(list(self.df['A'])).mean() + expected = self.df.groupby(self.df['A']).mean() + assert_frame_equal(result, expected, check_names=False) + + pytest.raises(Exception, self.df.groupby, list(self.df['A'][:-1])) + + # pathological case of ambiguity + df = DataFrame({'foo': [0, 1], + 'bar': [3, 4], + 'val': np.random.randn(2)}) + + result = df.groupby(['foo', 'bar']).mean() + expected = df.groupby([df['foo'], df['bar']]).mean()[['val']] + + def test_groupby_keys_same_size_as_index(self): + # GH 11185 + freq = 's' + index = pd.date_range(start=pd.Timestamp('2015-09-29T11:34:44-0700'), + periods=2, freq=freq) + df = pd.DataFrame([['A', 10], ['B', 15]], columns=[ + 'metric', 'values' + ], index=index) + result = df.groupby([pd.Grouper(level=0, freq=freq), 'metric']).mean() + expected = df.set_index([df.index, 'metric']) + + assert_frame_equal(result, expected) + + def test_groupby_one_row(self): + # GH 11741 + df1 = pd.DataFrame(np.random.randn(1, 4), columns=list('ABCD')) + pytest.raises(KeyError, df1.groupby, 'Z') + df2 = pd.DataFrame(np.random.randn(2, 4), columns=list('ABCD')) + pytest.raises(KeyError, df2.groupby, 'Z') + + def test_groupby_nat_exclude(self): + # GH 6992 + df = pd.DataFrame( + {'values': np.random.randn(8), + 'dt': [np.nan, pd.Timestamp('2013-01-01'), np.nan, pd.Timestamp( + '2013-02-01'), np.nan, pd.Timestamp('2013-02-01'), np.nan, + pd.Timestamp('2013-01-01')], + 'str': [np.nan, 'a', np.nan, 'a', np.nan, 'a', np.nan, 'b']}) + grouped = df.groupby('dt') + + expected = [pd.Index([1, 7]), pd.Index([3, 5])] + keys = sorted(grouped.groups.keys()) + assert len(keys) == 2 + for k, e in zip(keys, expected): + # grouped.groups keys are np.datetime64 with system tz + # not to be affected by tz, only compare values + tm.assert_index_equal(grouped.groups[k], e) + + # confirm obj is not filtered + tm.assert_frame_equal(grouped.grouper.groupings[0].obj, df) + assert grouped.ngroups == 2 + + expected = { + Timestamp('2013-01-01 00:00:00'): np.array([1, 7], dtype=np.int64), + Timestamp('2013-02-01 00:00:00'): np.array([3, 5], dtype=np.int64) + } + + for k in grouped.indices: + tm.assert_numpy_array_equal(grouped.indices[k], expected[k]) + + tm.assert_frame_equal( + grouped.get_group(Timestamp('2013-01-01')), df.iloc[[1, 7]]) + tm.assert_frame_equal( + grouped.get_group(Timestamp('2013-02-01')), df.iloc[[3, 5]]) + + pytest.raises(KeyError, grouped.get_group, pd.NaT) + + nan_df = DataFrame({'nan': [np.nan, np.nan, np.nan], + 'nat': [pd.NaT, pd.NaT, pd.NaT]}) + assert nan_df['nan'].dtype == 'float64' + assert nan_df['nat'].dtype == 'datetime64[ns]' + + for key in ['nan', 'nat']: + grouped = nan_df.groupby(key) + assert grouped.groups == {} + assert grouped.ngroups == 0 + assert grouped.indices == {} + pytest.raises(KeyError, grouped.get_group, np.nan) + pytest.raises(KeyError, grouped.get_group, pd.NaT) + + def test_dictify(self): + dict(iter(self.df.groupby('A'))) + dict(iter(self.df.groupby(['A', 'B']))) + dict(iter(self.df['C'].groupby(self.df['A']))) + dict(iter(self.df['C'].groupby([self.df['A'], self.df['B']]))) + dict(iter(self.df.groupby('A')['C'])) + dict(iter(self.df.groupby(['A', 'B'])['C'])) + + def test_sparse_friendly(self): + sdf = self.df[['C', 'D']].to_sparse() + with catch_warnings(record=True): + panel = tm.makePanel() + tm.add_nans(panel) + + def _check_work(gp): + gp.mean() + gp.agg(np.mean) + dict(iter(gp)) + + # it works! + _check_work(sdf.groupby(lambda x: x // 2)) + _check_work(sdf['C'].groupby(lambda x: x // 2)) + _check_work(sdf.groupby(self.df['A'])) + + # do this someday + # _check_work(panel.groupby(lambda x: x.month, axis=1)) + + def test_panel_groupby(self): + with catch_warnings(record=True): + self.panel = tm.makePanel() + tm.add_nans(self.panel) + grouped = self.panel.groupby({'ItemA': 0, 'ItemB': 0, 'ItemC': 1}, + axis='items') + agged = grouped.mean() + agged2 = grouped.agg(lambda x: x.mean('items')) + + tm.assert_panel_equal(agged, agged2) + + tm.assert_index_equal(agged.items, Index([0, 1])) + + grouped = self.panel.groupby(lambda x: x.month, axis='major') + agged = grouped.mean() + + exp = Index(sorted(list(set(self.panel.major_axis.month)))) + tm.assert_index_equal(agged.major_axis, exp) + + grouped = self.panel.groupby({'A': 0, 'B': 0, 'C': 1, 'D': 1}, + axis='minor') + agged = grouped.mean() + tm.assert_index_equal(agged.minor_axis, Index([0, 1])) + + def test_groupby_2d_malformed(self): + d = DataFrame(index=lrange(2)) + d['group'] = ['g1', 'g2'] + d['zeros'] = [0, 0] + d['ones'] = [1, 1] + d['label'] = ['l1', 'l2'] + tmp = d.groupby(['group']).mean() + res_values = np.array([[0, 1], [0, 1]], dtype=np.int64) + tm.assert_index_equal(tmp.columns, Index(['zeros', 'ones'])) + tm.assert_numpy_array_equal(tmp.values, res_values) + + def test_int32_overflow(self): + B = np.concatenate((np.arange(10000), np.arange(10000), np.arange(5000) + )) + A = np.arange(25000) + df = DataFrame({'A': A, + 'B': B, + 'C': A, + 'D': B, + 'E': np.random.randn(25000)}) + + left = df.groupby(['A', 'B', 'C', 'D']).sum() + right = df.groupby(['D', 'C', 'B', 'A']).sum() + assert len(left) == len(right) + + def test_groupby_sort_multi(self): + df = DataFrame({'a': ['foo', 'bar', 'baz'], + 'b': [3, 2, 1], + 'c': [0, 1, 2], + 'd': np.random.randn(3)}) + + tups = lmap(tuple, df[['a', 'b', 'c']].values) + tups = com._asarray_tuplesafe(tups) + result = df.groupby(['a', 'b', 'c'], sort=True).sum() + tm.assert_numpy_array_equal(result.index.values, tups[[1, 2, 0]]) + + tups = lmap(tuple, df[['c', 'a', 'b']].values) + tups = com._asarray_tuplesafe(tups) + result = df.groupby(['c', 'a', 'b'], sort=True).sum() + tm.assert_numpy_array_equal(result.index.values, tups) + + tups = lmap(tuple, df[['b', 'c', 'a']].values) + tups = com._asarray_tuplesafe(tups) + result = df.groupby(['b', 'c', 'a'], sort=True).sum() + tm.assert_numpy_array_equal(result.index.values, tups[[2, 1, 0]]) + + df = DataFrame({'a': [0, 1, 2, 0, 1, 2], + 'b': [0, 0, 0, 1, 1, 1], + 'd': np.random.randn(6)}) + grouped = df.groupby(['a', 'b'])['d'] + result = grouped.sum() + _check_groupby(df, result, ['a', 'b'], 'd') + + def test_intercept_builtin_sum(self): + s = Series([1., 2., np.nan, 3.]) + grouped = s.groupby([0, 1, 2, 2]) + + result = grouped.agg(builtins.sum) + result2 = grouped.apply(builtins.sum) + expected = grouped.sum() + assert_series_equal(result, expected) + assert_series_equal(result2, expected) + + def test_column_select_via_attr(self): + result = self.df.groupby('A').C.sum() + expected = self.df.groupby('A')['C'].sum() + assert_series_equal(result, expected) + + self.df['mean'] = 1.5 + result = self.df.groupby('A').mean() + expected = self.df.groupby('A').agg(np.mean) + assert_frame_equal(result, expected) + + def test_rank_apply(self): + lev1 = tm.rands_array(10, 100) + lev2 = tm.rands_array(10, 130) + lab1 = np.random.randint(0, 100, size=500) + lab2 = np.random.randint(0, 130, size=500) + + df = DataFrame({'value': np.random.randn(500), + 'key1': lev1.take(lab1), + 'key2': lev2.take(lab2)}) + + result = df.groupby(['key1', 'key2']).value.rank() + + expected = [] + for key, piece in df.groupby(['key1', 'key2']): + expected.append(piece.value.rank()) + expected = concat(expected, axis=0) + expected = expected.reindex(result.index) + assert_series_equal(result, expected) + + result = df.groupby(['key1', 'key2']).value.rank(pct=True) + + expected = [] + for key, piece in df.groupby(['key1', 'key2']): + expected.append(piece.value.rank(pct=True)) + expected = concat(expected, axis=0) + expected = expected.reindex(result.index) + assert_series_equal(result, expected) + + def test_dont_clobber_name_column(self): + df = DataFrame({'key': ['a', 'a', 'a', 'b', 'b', 'b'], + 'name': ['foo', 'bar', 'baz'] * 2}) + + result = df.groupby('key').apply(lambda x: x) + assert_frame_equal(result, df) + + def test_skip_group_keys(self): + from pandas import concat + + tsf = tm.makeTimeDataFrame() + + grouped = tsf.groupby(lambda x: x.month, group_keys=False) + result = grouped.apply(lambda x: x.sort_values(by='A')[:3]) + + pieces = [] + for key, group in grouped: + pieces.append(group.sort_values(by='A')[:3]) + + expected = concat(pieces) + assert_frame_equal(result, expected) + + grouped = tsf['A'].groupby(lambda x: x.month, group_keys=False) + result = grouped.apply(lambda x: x.sort_values()[:3]) + + pieces = [] + for key, group in grouped: + pieces.append(group.sort_values()[:3]) + + expected = concat(pieces) + assert_series_equal(result, expected) + + def test_no_nonsense_name(self): + # GH #995 + s = self.frame['C'].copy() + s.name = None + + result = s.groupby(self.frame['A']).agg(np.sum) + assert result.name is None + + def test_multifunc_sum_bug(self): + # GH #1065 + x = DataFrame(np.arange(9).reshape(3, 3)) + x['test'] = 0 + x['fl'] = [1.3, 1.5, 1.6] + + grouped = x.groupby('test') + result = grouped.agg({'fl': 'sum', 2: 'size'}) + assert result['fl'].dtype == np.float64 + + def test_handle_dict_return_value(self): + def f(group): + return {'min': group.min(), 'max': group.max()} + + def g(group): + return Series({'min': group.min(), 'max': group.max()}) + + result = self.df.groupby('A')['C'].apply(f) + expected = self.df.groupby('A')['C'].apply(g) + + assert isinstance(result, Series) + assert_series_equal(result, expected) + + def test_getitem_list_of_columns(self): + df = DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.random.randn(8), + 'E': np.random.randn(8)}) + + result = df.groupby('A')[['C', 'D']].mean() + result2 = df.groupby('A')['C', 'D'].mean() + result3 = df.groupby('A')[df.columns[2:4]].mean() + + expected = df.loc[:, ['A', 'C', 'D']].groupby('A').mean() + + assert_frame_equal(result, expected) + assert_frame_equal(result2, expected) + assert_frame_equal(result3, expected) + + def test_getitem_numeric_column_names(self): + # GH #13731 + df = DataFrame({0: list('abcd') * 2, + 2: np.random.randn(8), + 4: np.random.randn(8), + 6: np.random.randn(8)}) + result = df.groupby(0)[df.columns[1:3]].mean() + result2 = df.groupby(0)[2, 4].mean() + result3 = df.groupby(0)[[2, 4]].mean() + + expected = df.loc[:, [0, 2, 4]].groupby(0).mean() + + assert_frame_equal(result, expected) + assert_frame_equal(result2, expected) + assert_frame_equal(result3, expected) + + def test_set_group_name(self): + def f(group): + assert group.name is not None + return group + + def freduce(group): + assert group.name is not None + return group.sum() + + def foo(x): + return freduce(x) + + def _check_all(grouped): + # make sure all these work + grouped.apply(f) + grouped.aggregate(freduce) + grouped.aggregate({'C': freduce, 'D': freduce}) + grouped.transform(f) + + grouped['C'].apply(f) + grouped['C'].aggregate(freduce) + grouped['C'].aggregate([freduce, foo]) + grouped['C'].transform(f) + + _check_all(self.df.groupby('A')) + _check_all(self.df.groupby(['A', 'B'])) + + def test_group_name_available_in_inference_pass(self): + # gh-15062 + df = pd.DataFrame({'a': [0, 0, 1, 1, 2, 2], 'b': np.arange(6)}) + + names = [] + + def f(group): + names.append(group.name) + return group.copy() + + df.groupby('a', sort=False, group_keys=False).apply(f) + # we expect 2 zeros because we call ``f`` once to see if a faster route + # can be used. + expected_names = [0, 0, 1, 2] + assert names == expected_names + + def test_no_dummy_key_names(self): + # see gh-1291 + result = self.df.groupby(self.df['A'].values).sum() + assert result.index.name is None + + result = self.df.groupby([self.df['A'].values, self.df['B'].values + ]).sum() + assert result.index.names == (None, None) + + def test_groupby_sort_multiindex_series(self): + # series multiindex groupby sort argument was not being passed through + # _compress_group_index + # GH 9444 + index = MultiIndex(levels=[[1, 2], [1, 2]], + labels=[[0, 0, 0, 0, 1, 1], [1, 1, 0, 0, 0, 0]], + names=['a', 'b']) + mseries = Series([0, 1, 2, 3, 4, 5], index=index) + index = MultiIndex(levels=[[1, 2], [1, 2]], + labels=[[0, 0, 1], [1, 0, 0]], names=['a', 'b']) + mseries_result = Series([0, 2, 4], index=index) + + result = mseries.groupby(level=['a', 'b'], sort=False).first() + assert_series_equal(result, mseries_result) + result = mseries.groupby(level=['a', 'b'], sort=True).first() + assert_series_equal(result, mseries_result.sort_index()) + + def test_groupby_reindex_inside_function(self): + + periods = 1000 + ind = DatetimeIndex(start='2012/1/1', freq='5min', periods=periods) + df = DataFrame({'high': np.arange( + periods), 'low': np.arange(periods)}, index=ind) + + def agg_before(hour, func, fix=False): + """ + Run an aggregate func on the subset of data. + """ + + def _func(data): + d = data.select(lambda x: x.hour < 11).dropna() + if fix: + data[data.index[0]] + if len(d) == 0: + return None + return func(d) + + return _func + + def afunc(data): + d = data.select(lambda x: x.hour < 11).dropna() + return np.max(d) + + grouped = df.groupby(lambda x: datetime(x.year, x.month, x.day)) + closure_bad = grouped.agg({'high': agg_before(11, np.max)}) + closure_good = grouped.agg({'high': agg_before(11, np.max, True)}) + + assert_frame_equal(closure_bad, closure_good) + + def test_multiindex_columns_empty_level(self): + l = [['count', 'values'], ['to filter', '']] + midx = MultiIndex.from_tuples(l) + + df = DataFrame([[long(1), 'A']], columns=midx) + + grouped = df.groupby('to filter').groups + assert grouped['A'] == [0] + + grouped = df.groupby([('to filter', '')]).groups + assert grouped['A'] == [0] + + df = DataFrame([[long(1), 'A'], [long(2), 'B']], columns=midx) + + expected = df.groupby('to filter').groups + result = df.groupby([('to filter', '')]).groups + assert result == expected + + df = DataFrame([[long(1), 'A'], [long(2), 'A']], columns=midx) + + expected = df.groupby('to filter').groups + result = df.groupby([('to filter', '')]).groups + tm.assert_dict_equal(result, expected) + + def test_cython_median(self): + df = DataFrame(np.random.randn(1000)) + df.values[::2] = np.nan + + labels = np.random.randint(0, 50, size=1000).astype(float) + labels[::17] = np.nan + + result = df.groupby(labels).median() + exp = df.groupby(labels).agg(nanops.nanmedian) + assert_frame_equal(result, exp) + + df = DataFrame(np.random.randn(1000, 5)) + rs = df.groupby(labels).agg(np.median) + xp = df.groupby(labels).median() + assert_frame_equal(rs, xp) + + def test_median_empty_bins(self): + df = pd.DataFrame(np.random.randint(0, 44, 500)) + + grps = range(0, 55, 5) + bins = pd.cut(df[0], grps) + + result = df.groupby(bins).median() + expected = df.groupby(bins).agg(lambda x: x.median()) + assert_frame_equal(result, expected) + + def test_groupby_non_arithmetic_agg_types(self): + # GH9311, GH6620 + df = pd.DataFrame( + [{'a': 1, 'b': 1}, + {'a': 1, 'b': 2}, + {'a': 2, 'b': 3}, + {'a': 2, 'b': 4}]) + + dtypes = ['int8', 'int16', 'int32', 'int64', 'float32', 'float64'] + + grp_exp = {'first': {'df': [{'a': 1, 'b': 1}, {'a': 2, 'b': 3}]}, + 'last': {'df': [{'a': 1, 'b': 2}, {'a': 2, 'b': 4}]}, + 'min': {'df': [{'a': 1, 'b': 1}, {'a': 2, 'b': 3}]}, + 'max': {'df': [{'a': 1, 'b': 2}, {'a': 2, 'b': 4}]}, + 'nth': {'df': [{'a': 1, 'b': 2}, {'a': 2, 'b': 4}], + 'args': [1]}, + 'count': {'df': [{'a': 1, 'b': 2}, {'a': 2, 'b': 2}], + 'out_type': 'int64'}} + + for dtype in dtypes: + df_in = df.copy() + df_in['b'] = df_in.b.astype(dtype) + + for method, data in compat.iteritems(grp_exp): + if 'args' not in data: + data['args'] = [] + + if 'out_type' in data: + out_type = data['out_type'] + else: + out_type = dtype + + exp = data['df'] + df_out = pd.DataFrame(exp) + + df_out['b'] = df_out.b.astype(out_type) + df_out.set_index('a', inplace=True) + + grpd = df_in.groupby('a') + t = getattr(grpd, method)(*data['args']) + assert_frame_equal(t, df_out) + + def test_groupby_non_arithmetic_agg_intlike_precision(self): + # GH9311, GH6620 + c = 24650000000000000 + + inputs = ((Timestamp('2011-01-15 12:50:28.502376'), + Timestamp('2011-01-20 12:50:28.593448')), (1 + c, 2 + c)) + + for i in inputs: + df = pd.DataFrame([{'a': 1, 'b': i[0]}, {'a': 1, 'b': i[1]}]) + + grp_exp = {'first': {'expected': i[0]}, + 'last': {'expected': i[1]}, + 'min': {'expected': i[0]}, + 'max': {'expected': i[1]}, + 'nth': {'expected': i[1], + 'args': [1]}, + 'count': {'expected': 2}} + + for method, data in compat.iteritems(grp_exp): + if 'args' not in data: + data['args'] = [] + + grpd = df.groupby('a') + res = getattr(grpd, method)(*data['args']) + assert res.iloc[0].b == data['expected'] + + def test_groupby_multiindex_missing_pair(self): + # GH9049 + df = DataFrame({'group1': ['a', 'a', 'a', 'b'], + 'group2': ['c', 'c', 'd', 'c'], + 'value': [1, 1, 1, 5]}) + df = df.set_index(['group1', 'group2']) + df_grouped = df.groupby(level=['group1', 'group2'], sort=True) + + res = df_grouped.agg('sum') + idx = MultiIndex.from_tuples( + [('a', 'c'), ('a', 'd'), ('b', 'c')], names=['group1', 'group2']) + exp = DataFrame([[2], [1], [5]], index=idx, columns=['value']) + + tm.assert_frame_equal(res, exp) + + def test_groupby_multiindex_not_lexsorted(self): + # GH 11640 + + # define the lexsorted version + lexsorted_mi = MultiIndex.from_tuples( + [('a', ''), ('b1', 'c1'), ('b2', 'c2')], names=['b', 'c']) + lexsorted_df = DataFrame([[1, 3, 4]], columns=lexsorted_mi) + assert lexsorted_df.columns.is_lexsorted() + + # define the non-lexsorted version + not_lexsorted_df = DataFrame(columns=['a', 'b', 'c', 'd'], + data=[[1, 'b1', 'c1', 3], + [1, 'b2', 'c2', 4]]) + not_lexsorted_df = not_lexsorted_df.pivot_table( + index='a', columns=['b', 'c'], values='d') + not_lexsorted_df = not_lexsorted_df.reset_index() + assert not not_lexsorted_df.columns.is_lexsorted() + + # compare the results + tm.assert_frame_equal(lexsorted_df, not_lexsorted_df) + + expected = lexsorted_df.groupby('a').mean() + with tm.assert_produces_warning(PerformanceWarning): + result = not_lexsorted_df.groupby('a').mean() + tm.assert_frame_equal(expected, result) + + # a transforming function should work regardless of sort + # GH 14776 + df = DataFrame({'x': ['a', 'a', 'b', 'a'], + 'y': [1, 1, 2, 2], + 'z': [1, 2, 3, 4]}).set_index(['x', 'y']) + assert not df.index.is_lexsorted() + + for level in [0, 1, [0, 1]]: + for sort in [False, True]: + result = df.groupby(level=level, sort=sort).apply( + DataFrame.drop_duplicates) + expected = df + tm.assert_frame_equal(expected, result) + + result = df.sort_index().groupby(level=level, sort=sort).apply( + DataFrame.drop_duplicates) + expected = df.sort_index() + tm.assert_frame_equal(expected, result) + + def test_groupby_levels_and_columns(self): + # GH9344, GH9049 + idx_names = ['x', 'y'] + idx = pd.MultiIndex.from_tuples( + [(1, 1), (1, 2), (3, 4), (5, 6)], names=idx_names) + df = pd.DataFrame(np.arange(12).reshape(-1, 3), index=idx) + + by_levels = df.groupby(level=idx_names).mean() + # reset_index changes columns dtype to object + by_columns = df.reset_index().groupby(idx_names).mean() + + tm.assert_frame_equal(by_levels, by_columns, check_column_type=False) + + by_columns.columns = pd.Index(by_columns.columns, dtype=np.int64) + tm.assert_frame_equal(by_levels, by_columns) + + def test_gb_apply_list_of_unequal_len_arrays(self): + + # GH1738 + df = DataFrame({'group1': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a', + 'b', 'b', 'b'], + 'group2': ['c', 'c', 'd', 'd', 'd', 'e', 'c', 'c', 'd', + 'd', 'd', 'e'], + 'weight': [1.1, 2, 3, 4, 5, 6, 2, 4, 6, 8, 1, 2], + 'value': [7.1, 8, 9, 10, 11, 12, 8, 7, 6, 5, 4, 3]}) + df = df.set_index(['group1', 'group2']) + df_grouped = df.groupby(level=['group1', 'group2'], sort=True) + + def noddy(value, weight): + out = np.array(value * weight).repeat(3) + return out + + # the kernel function returns arrays of unequal length + # pandas sniffs the first one, sees it's an array and not + # a list, and assumed the rest are of equal length + # and so tries a vstack + + # don't die + df_grouped.apply(lambda x: noddy(x.value, x.weight)) + + def test_groupby_with_empty(self): + index = pd.DatetimeIndex(()) + data = () + series = pd.Series(data, index) + grouper = pd.core.resample.TimeGrouper('D') + grouped = series.groupby(grouper) + assert next(iter(grouped), None) is None + + def test_groupby_with_single_column(self): + df = pd.DataFrame({'a': list('abssbab')}) + tm.assert_frame_equal(df.groupby('a').get_group('a'), df.iloc[[0, 5]]) + # GH 13530 + exp = pd.DataFrame([], index=pd.Index(['a', 'b', 's'], name='a')) + tm.assert_frame_equal(df.groupby('a').count(), exp) + tm.assert_frame_equal(df.groupby('a').sum(), exp) + tm.assert_frame_equal(df.groupby('a').nth(1), exp) + + def test_groupby_with_small_elem(self): + # GH 8542 + # length=2 + df = pd.DataFrame({'event': ['start', 'start'], + 'change': [1234, 5678]}, + index=pd.DatetimeIndex(['2014-09-10', '2013-10-10'])) + grouped = df.groupby([pd.TimeGrouper(freq='M'), 'event']) + assert len(grouped.groups) == 2 + assert grouped.ngroups == 2 + assert (pd.Timestamp('2014-09-30'), 'start') in grouped.groups + assert (pd.Timestamp('2013-10-31'), 'start') in grouped.groups + + res = grouped.get_group((pd.Timestamp('2014-09-30'), 'start')) + tm.assert_frame_equal(res, df.iloc[[0], :]) + res = grouped.get_group((pd.Timestamp('2013-10-31'), 'start')) + tm.assert_frame_equal(res, df.iloc[[1], :]) + + df = pd.DataFrame({'event': ['start', 'start', 'start'], + 'change': [1234, 5678, 9123]}, + index=pd.DatetimeIndex(['2014-09-10', '2013-10-10', + '2014-09-15'])) + grouped = df.groupby([pd.TimeGrouper(freq='M'), 'event']) + assert len(grouped.groups) == 2 + assert grouped.ngroups == 2 + assert (pd.Timestamp('2014-09-30'), 'start') in grouped.groups + assert (pd.Timestamp('2013-10-31'), 'start') in grouped.groups + + res = grouped.get_group((pd.Timestamp('2014-09-30'), 'start')) + tm.assert_frame_equal(res, df.iloc[[0, 2], :]) + res = grouped.get_group((pd.Timestamp('2013-10-31'), 'start')) + tm.assert_frame_equal(res, df.iloc[[1], :]) + + # length=3 + df = pd.DataFrame({'event': ['start', 'start', 'start'], + 'change': [1234, 5678, 9123]}, + index=pd.DatetimeIndex(['2014-09-10', '2013-10-10', + '2014-08-05'])) + grouped = df.groupby([pd.TimeGrouper(freq='M'), 'event']) + assert len(grouped.groups) == 3 + assert grouped.ngroups == 3 + assert (pd.Timestamp('2014-09-30'), 'start') in grouped.groups + assert (pd.Timestamp('2013-10-31'), 'start') in grouped.groups + assert (pd.Timestamp('2014-08-31'), 'start') in grouped.groups + + res = grouped.get_group((pd.Timestamp('2014-09-30'), 'start')) + tm.assert_frame_equal(res, df.iloc[[0], :]) + res = grouped.get_group((pd.Timestamp('2013-10-31'), 'start')) + tm.assert_frame_equal(res, df.iloc[[1], :]) + res = grouped.get_group((pd.Timestamp('2014-08-31'), 'start')) + tm.assert_frame_equal(res, df.iloc[[2], :]) + + def test_cumcount(self): + df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A']) + g = df.groupby('A') + sg = g.A + + expected = Series([0, 1, 2, 0, 3]) + + assert_series_equal(expected, g.cumcount()) + assert_series_equal(expected, sg.cumcount()) + + def test_cumcount_empty(self): + ge = DataFrame().groupby(level=0) + se = Series().groupby(level=0) + + # edge case, as this is usually considered float + e = Series(dtype='int64') + + assert_series_equal(e, ge.cumcount()) + assert_series_equal(e, se.cumcount()) + + def test_cumcount_dupe_index(self): + df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], + index=[0] * 5) + g = df.groupby('A') + sg = g.A + + expected = Series([0, 1, 2, 0, 3], index=[0] * 5) + + assert_series_equal(expected, g.cumcount()) + assert_series_equal(expected, sg.cumcount()) + + def test_cumcount_mi(self): + mi = MultiIndex.from_tuples([[0, 1], [1, 2], [2, 2], [2, 2], [1, 0]]) + df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], + index=mi) + g = df.groupby('A') + sg = g.A + + expected = Series([0, 1, 2, 0, 3], index=mi) + + assert_series_equal(expected, g.cumcount()) + assert_series_equal(expected, sg.cumcount()) + + def test_cumcount_groupby_not_col(self): + df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], + index=[0] * 5) + g = df.groupby([0, 0, 0, 1, 0]) + sg = g.A + + expected = Series([0, 1, 2, 0, 3], index=[0] * 5) + + assert_series_equal(expected, g.cumcount()) + assert_series_equal(expected, sg.cumcount()) + + def test_fill_constistency(self): + + # GH9221 + # pass thru keyword arguments to the generated wrapper + # are set if the passed kw is None (only) + df = DataFrame(index=pd.MultiIndex.from_product( + [['value1', 'value2'], date_range('2014-01-01', '2014-01-06')]), + columns=Index( + ['1', '2'], name='id')) + df['1'] = [np.nan, 1, np.nan, np.nan, 11, np.nan, np.nan, 2, np.nan, + np.nan, 22, np.nan] + df['2'] = [np.nan, 3, np.nan, np.nan, 33, np.nan, np.nan, 4, np.nan, + np.nan, 44, np.nan] + + expected = df.groupby(level=0, axis=0).fillna(method='ffill') + result = df.T.groupby(level=0, axis=1).fillna(method='ffill').T + assert_frame_equal(result, expected) + + def test_index_label_overlaps_location(self): + # checking we don't have any label/location confusion in the + # the wake of GH5375 + df = DataFrame(list('ABCDE'), index=[2, 0, 2, 1, 1]) + g = df.groupby(list('ababb')) + actual = g.filter(lambda x: len(x) > 2) + expected = df.iloc[[1, 3, 4]] + assert_frame_equal(actual, expected) + + ser = df[0] + g = ser.groupby(list('ababb')) + actual = g.filter(lambda x: len(x) > 2) + expected = ser.take([1, 3, 4]) + assert_series_equal(actual, expected) + + # ... and again, with a generic Index of floats + df.index = df.index.astype(float) + g = df.groupby(list('ababb')) + actual = g.filter(lambda x: len(x) > 2) + expected = df.iloc[[1, 3, 4]] + assert_frame_equal(actual, expected) + + ser = df[0] + g = ser.groupby(list('ababb')) + actual = g.filter(lambda x: len(x) > 2) + expected = ser.take([1, 3, 4]) + assert_series_equal(actual, expected) + + def test_lower_int_prec_count(self): + df = DataFrame({'a': np.array( + [0, 1, 2, 100], np.int8), + 'b': np.array( + [1, 2, 3, 6], np.uint32), + 'c': np.array( + [4, 5, 6, 8], np.int16), + 'grp': list('ab' * 2)}) + result = df.groupby('grp').count() + expected = DataFrame({'a': [2, 2], + 'b': [2, 2], + 'c': [2, 2]}, index=pd.Index(list('ab'), + name='grp')) + tm.assert_frame_equal(result, expected) + + def test_count_uses_size_on_exception(self): + class RaisingObjectException(Exception): + pass + + class RaisingObject(object): + + def __init__(self, msg='I will raise inside Cython'): + super(RaisingObject, self).__init__() + self.msg = msg + + def __eq__(self, other): + # gets called in Cython to check that raising calls the method + raise RaisingObjectException(self.msg) + + df = DataFrame({'a': [RaisingObject() for _ in range(4)], + 'grp': list('ab' * 2)}) + result = df.groupby('grp').count() + expected = DataFrame({'a': [2, 2]}, index=pd.Index( + list('ab'), name='grp')) + tm.assert_frame_equal(result, expected) + + def test_groupby_cumprod(self): + # GH 4095 + df = pd.DataFrame({'key': ['b'] * 10, 'value': 2}) + + actual = df.groupby('key')['value'].cumprod() + expected = df.groupby('key')['value'].apply(lambda x: x.cumprod()) + expected.name = 'value' + tm.assert_series_equal(actual, expected) + + df = pd.DataFrame({'key': ['b'] * 100, 'value': 2}) + actual = df.groupby('key')['value'].cumprod() + # if overflows, groupby product casts to float + # while numpy passes back invalid values + df['value'] = df['value'].astype(float) + expected = df.groupby('key')['value'].apply(lambda x: x.cumprod()) + expected.name = 'value' + tm.assert_series_equal(actual, expected) + + def test_ops_general(self): + ops = [('mean', np.mean), + ('median', np.median), + ('std', np.std), + ('var', np.var), + ('sum', np.sum), + ('prod', np.prod), + ('min', np.min), + ('max', np.max), + ('first', lambda x: x.iloc[0]), + ('last', lambda x: x.iloc[-1]), + ('count', np.size), ] + try: + from scipy.stats import sem + except ImportError: + pass + else: + ops.append(('sem', sem)) + df = DataFrame(np.random.randn(1000)) + labels = np.random.randint(0, 50, size=1000).astype(float) + + for op, targop in ops: + result = getattr(df.groupby(labels), op)().astype(float) + expected = df.groupby(labels).agg(targop) + try: + tm.assert_frame_equal(result, expected) + except BaseException as exc: + exc.args += ('operation: %s' % op, ) + raise + + def test_max_nan_bug(self): + raw = """,Date,app,File +2013-04-23,2013-04-23 00:00:00,,log080001.log +2013-05-06,2013-05-06 00:00:00,,log.log +2013-05-07,2013-05-07 00:00:00,OE,xlsx""" + + df = pd.read_csv(StringIO(raw), parse_dates=[0]) + gb = df.groupby('Date') + r = gb[['File']].max() + e = gb['File'].max().to_frame() + tm.assert_frame_equal(r, e) + assert not r['File'].isnull().any() + + def test_nlargest(self): + a = Series([1, 3, 5, 7, 2, 9, 0, 4, 6, 10]) + b = Series(list('a' * 5 + 'b' * 5)) + gb = a.groupby(b) + r = gb.nlargest(3) + e = Series([ + 7, 5, 3, 10, 9, 6 + ], index=MultiIndex.from_arrays([list('aaabbb'), [3, 2, 1, 9, 5, 8]])) + tm.assert_series_equal(r, e) + + a = Series([1, 1, 3, 2, 0, 3, 3, 2, 1, 0]) + gb = a.groupby(b) + e = Series([ + 3, 2, 1, 3, 3, 2 + ], index=MultiIndex.from_arrays([list('aaabbb'), [2, 3, 1, 6, 5, 7]])) + assert_series_equal(gb.nlargest(3, keep='last'), e) + + def test_nsmallest(self): + a = Series([1, 3, 5, 7, 2, 9, 0, 4, 6, 10]) + b = Series(list('a' * 5 + 'b' * 5)) + gb = a.groupby(b) + r = gb.nsmallest(3) + e = Series([ + 1, 2, 3, 0, 4, 6 + ], index=MultiIndex.from_arrays([list('aaabbb'), [0, 4, 1, 6, 7, 8]])) + tm.assert_series_equal(r, e) + + a = Series([1, 1, 3, 2, 0, 3, 3, 2, 1, 0]) + gb = a.groupby(b) + e = Series([ + 0, 1, 1, 0, 1, 2 + ], index=MultiIndex.from_arrays([list('aaabbb'), [4, 1, 0, 9, 8, 7]])) + assert_series_equal(gb.nsmallest(3, keep='last'), e) + + def test_transform_doesnt_clobber_ints(self): + # GH 7972 + n = 6 + x = np.arange(n) + df = DataFrame({'a': x // 2, 'b': 2.0 * x, 'c': 3.0 * x}) + df2 = DataFrame({'a': x // 2 * 1.0, 'b': 2.0 * x, 'c': 3.0 * x}) + + gb = df.groupby('a') + result = gb.transform('mean') + + gb2 = df2.groupby('a') + expected = gb2.transform('mean') + tm.assert_frame_equal(result, expected) + + def test_groupby_apply_all_none(self): + # Tests to make sure no errors if apply function returns all None + # values. Issue 9684. + test_df = DataFrame({'groups': [0, 0, 1, 1], + 'random_vars': [8, 7, 4, 5]}) + + def test_func(x): + pass + + result = test_df.groupby('groups').apply(test_func) + expected = DataFrame() + tm.assert_frame_equal(result, expected) + + def test_groupby_apply_none_first(self): + # GH 12824. Tests if apply returns None first. + test_df1 = DataFrame({'groups': [1, 1, 1, 2], 'vars': [0, 1, 2, 3]}) + test_df2 = DataFrame({'groups': [1, 2, 2, 2], 'vars': [0, 1, 2, 3]}) + + def test_func(x): + if x.shape[0] < 2: + return None + return x.iloc[[0, -1]] + + result1 = test_df1.groupby('groups').apply(test_func) + result2 = test_df2.groupby('groups').apply(test_func) + index1 = MultiIndex.from_arrays([[1, 1], [0, 2]], + names=['groups', None]) + index2 = MultiIndex.from_arrays([[2, 2], [1, 3]], + names=['groups', None]) + expected1 = DataFrame({'groups': [1, 1], 'vars': [0, 2]}, + index=index1) + expected2 = DataFrame({'groups': [2, 2], 'vars': [1, 3]}, + index=index2) + tm.assert_frame_equal(result1, expected1) + tm.assert_frame_equal(result2, expected2) + + def test_groupby_preserves_sort(self): + # Test to ensure that groupby always preserves sort order of original + # object. Issue #8588 and #9651 + + df = DataFrame( + {'int_groups': [3, 1, 0, 1, 0, 3, 3, 3], + 'string_groups': ['z', 'a', 'z', 'a', 'a', 'g', 'g', 'g'], + 'ints': [8, 7, 4, 5, 2, 9, 1, 1], + 'floats': [2.3, 5.3, 6.2, -2.4, 2.2, 1.1, 1.1, 5], + 'strings': ['z', 'd', 'a', 'e', 'word', 'word2', '42', '47']}) + + # Try sorting on different types and with different group types + for sort_column in ['ints', 'floats', 'strings', ['ints', 'floats'], + ['ints', 'strings']]: + for group_column in ['int_groups', 'string_groups', + ['int_groups', 'string_groups']]: + + df = df.sort_values(by=sort_column) + + g = df.groupby(group_column) + + def test_sort(x): + assert_frame_equal(x, x.sort_values(by=sort_column)) + + g.apply(test_sort) + + def test_nunique_with_object(self): + # GH 11077 + data = pd.DataFrame( + [[100, 1, 'Alice'], + [200, 2, 'Bob'], + [300, 3, 'Charlie'], + [-400, 4, 'Dan'], + [500, 5, 'Edith']], + columns=['amount', 'id', 'name'] + ) + + result = data.groupby(['id', 'amount'])['name'].nunique() + index = MultiIndex.from_arrays([data.id, data.amount]) + expected = pd.Series([1] * 5, name='name', index=index) + tm.assert_series_equal(result, expected) + + def test_nunique_with_empty_series(self): + # GH 12553 + data = pd.Series(name='name') + result = data.groupby(level=0).nunique() + expected = pd.Series(name='name', dtype='int64') + tm.assert_series_equal(result, expected) + + def test_nunique_with_timegrouper(self): + # GH 13453 + test = pd.DataFrame({ + 'time': [Timestamp('2016-06-28 09:35:35'), + Timestamp('2016-06-28 16:09:30'), + Timestamp('2016-06-28 16:46:28')], + 'data': ['1', '2', '3']}).set_index('time') + result = test.groupby(pd.TimeGrouper(freq='h'))['data'].nunique() + expected = test.groupby( + pd.TimeGrouper(freq='h') + )['data'].apply(pd.Series.nunique) + tm.assert_series_equal(result, expected) + + def test_numpy_compat(self): + # see gh-12811 + df = pd.DataFrame({'A': [1, 2, 1], 'B': [1, 2, 3]}) + g = df.groupby('A') + + msg = "numpy operations are not valid with groupby" + + for func in ('mean', 'var', 'std', 'cumprod', 'cumsum'): + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(g, func), 1, 2, 3) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(g, func), foo=1) + + def test_grouping_string_repr(self): + # GH 13394 + mi = MultiIndex.from_arrays([list("AAB"), list("aba")]) + df = DataFrame([[1, 2, 3]], columns=mi) + gr = df.groupby(df[('A', 'a')]) + + result = gr.grouper.groupings[0].__repr__() + expected = "Grouping(('A', 'a'))" + assert result == expected + + def test_group_shift_with_null_key(self): + # This test is designed to replicate the segfault in issue #13813. + n_rows = 1200 + + # Generate a moderately large dataframe with occasional missing + # values in column `B`, and then group by [`A`, `B`]. This should + # force `-1` in `labels` array of `g.grouper.group_info` exactly + # at those places, where the group-by key is partilly missing. + df = DataFrame([(i % 12, i % 3 if i % 3 else np.nan, i) + for i in range(n_rows)], dtype=float, + columns=["A", "B", "Z"], index=None) + g = df.groupby(["A", "B"]) + + expected = DataFrame([(i + 12 if i % 3 and i < n_rows - 12 + else np.nan) + for i in range(n_rows)], dtype=float, + columns=["Z"], index=None) + result = g.shift(-1) + + assert_frame_equal(result, expected) + + def test_pivot_table_values_key_error(self): + # This test is designed to replicate the error in issue #14938 + df = pd.DataFrame({'eventDate': + pd.date_range(pd.datetime.today(), + periods=20, freq='M').tolist(), + 'thename': range(0, 20)}) + + df['year'] = df.set_index('eventDate').index.year + df['month'] = df.set_index('eventDate').index.month + + with pytest.raises(KeyError): + df.reset_index().pivot_table(index='year', columns='month', + values='badname', aggfunc='count') + + def test_cummin_cummax(self): + # GH 15048 + num_types = [np.int32, np.int64, np.float32, np.float64] + num_mins = [np.iinfo(np.int32).min, np.iinfo(np.int64).min, + np.finfo(np.float32).min, np.finfo(np.float64).min] + num_max = [np.iinfo(np.int32).max, np.iinfo(np.int64).max, + np.finfo(np.float32).max, np.finfo(np.float64).max] + base_df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 2], + 'B': [3, 4, 3, 2, 2, 3, 2, 1]}) + expected_mins = [3, 3, 3, 2, 2, 2, 2, 1] + expected_maxs = [3, 4, 4, 4, 2, 3, 3, 3] + + for dtype, min_val, max_val in zip(num_types, num_mins, num_max): + df = base_df.astype(dtype) + + # cummin + expected = pd.DataFrame({'B': expected_mins}).astype(dtype) + result = df.groupby('A').cummin() + tm.assert_frame_equal(result, expected) + result = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame() + tm.assert_frame_equal(result, expected) + + # Test cummin w/ min value for dtype + df.loc[[2, 6], 'B'] = min_val + expected.loc[[2, 3, 6, 7], 'B'] = min_val + result = df.groupby('A').cummin() + tm.assert_frame_equal(result, expected) + expected = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame() + tm.assert_frame_equal(result, expected) + + # cummax + expected = pd.DataFrame({'B': expected_maxs}).astype(dtype) + result = df.groupby('A').cummax() + tm.assert_frame_equal(result, expected) + result = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame() + tm.assert_frame_equal(result, expected) + + # Test cummax w/ max value for dtype + df.loc[[2, 6], 'B'] = max_val + expected.loc[[2, 3, 6, 7], 'B'] = max_val + result = df.groupby('A').cummax() + tm.assert_frame_equal(result, expected) + expected = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame() + tm.assert_frame_equal(result, expected) + + # Test nan in some values + base_df.loc[[0, 2, 4, 6], 'B'] = np.nan + expected = pd.DataFrame({'B': [np.nan, 4, np.nan, 2, + np.nan, 3, np.nan, 1]}) + result = base_df.groupby('A').cummin() + tm.assert_frame_equal(result, expected) + expected = (base_df.groupby('A') + .B + .apply(lambda x: x.cummin()) + .to_frame()) + tm.assert_frame_equal(result, expected) + + expected = pd.DataFrame({'B': [np.nan, 4, np.nan, 4, + np.nan, 3, np.nan, 3]}) + result = base_df.groupby('A').cummax() + tm.assert_frame_equal(result, expected) + expected = (base_df.groupby('A') + .B + .apply(lambda x: x.cummax()) + .to_frame()) + tm.assert_frame_equal(result, expected) + + # Test nan in entire column + base_df['B'] = np.nan + expected = pd.DataFrame({'B': [np.nan] * 8}) + result = base_df.groupby('A').cummin() + tm.assert_frame_equal(expected, result) + result = base_df.groupby('A').B.apply(lambda x: x.cummin()).to_frame() + tm.assert_frame_equal(expected, result) + result = base_df.groupby('A').cummax() + tm.assert_frame_equal(expected, result) + result = base_df.groupby('A').B.apply(lambda x: x.cummax()).to_frame() + tm.assert_frame_equal(expected, result) + + # GH 15561 + df = pd.DataFrame(dict(a=[1], b=pd.to_datetime(['2001']))) + expected = pd.Series(pd.to_datetime('2001'), index=[0], name='b') + for method in ['cummax', 'cummin']: + result = getattr(df.groupby('a')['b'], method)() + tm.assert_series_equal(expected, result) + + # GH 15635 + df = pd.DataFrame(dict(a=[1, 2, 1], b=[2, 1, 1])) + result = df.groupby('a').b.cummax() + expected = pd.Series([2, 1, 2], name='b') + tm.assert_series_equal(result, expected) + + df = pd.DataFrame(dict(a=[1, 2, 1], b=[1, 2, 2])) + result = df.groupby('a').b.cummin() + expected = pd.Series([1, 2, 1], name='b') + tm.assert_series_equal(result, expected) + + def test_apply_numeric_coercion_when_datetime(self): + # In the past, group-by/apply operations have been over-eager + # in converting dtypes to numeric, in the presence of datetime + # columns. Various GH issues were filed, the reproductions + # for which are here. + + # GH 15670 + df = pd.DataFrame({'Number': [1, 2], + 'Date': ["2017-03-02"] * 2, + 'Str': ["foo", "inf"]}) + expected = df.groupby(['Number']).apply(lambda x: x.iloc[0]) + df.Date = pd.to_datetime(df.Date) + result = df.groupby(['Number']).apply(lambda x: x.iloc[0]) + tm.assert_series_equal(result['Str'], expected['Str']) + + # GH 15421 + df = pd.DataFrame({'A': [10, 20, 30], + 'B': ['foo', '3', '4'], + 'T': [pd.Timestamp("12:31:22")] * 3}) + + def get_B(g): + return g.iloc[0][['B']] + result = df.groupby('A').apply(get_B)['B'] + expected = df.B + expected.index = df.A + tm.assert_series_equal(result, expected) + + # GH 14423 + def predictions(tool): + out = pd.Series(index=['p1', 'p2', 'useTime'], dtype=object) + if 'step1' in list(tool.State): + out['p1'] = str(tool[tool.State == 'step1'].Machine.values[0]) + if 'step2' in list(tool.State): + out['p2'] = str(tool[tool.State == 'step2'].Machine.values[0]) + out['useTime'] = str( + tool[tool.State == 'step2'].oTime.values[0]) + return out + df1 = pd.DataFrame({'Key': ['B', 'B', 'A', 'A'], + 'State': ['step1', 'step2', 'step1', 'step2'], + 'oTime': ['', '2016-09-19 05:24:33', + '', '2016-09-19 23:59:04'], + 'Machine': ['23', '36L', '36R', '36R']}) + df2 = df1.copy() + df2.oTime = pd.to_datetime(df2.oTime) + expected = df1.groupby('Key').apply(predictions).p1 + result = df2.groupby('Key').apply(predictions).p1 + tm.assert_series_equal(expected, result) + + +def _check_groupby(df, result, keys, field, f=lambda x: x.sum()): + tups = lmap(tuple, df[keys].values) + tups = com._asarray_tuplesafe(tups) + expected = f(df.groupby(tups)[field]) + for k, v in compat.iteritems(expected): + assert (result[k] == v) diff --git a/pandas/tests/groupby/test_nth.py b/pandas/tests/groupby/test_nth.py new file mode 100644 index 0000000000000..7912b4bf3bdf6 --- /dev/null +++ b/pandas/tests/groupby/test_nth.py @@ -0,0 +1,247 @@ +import numpy as np +import pandas as pd +from pandas import DataFrame, MultiIndex, Index, Series, isnull +from pandas.compat import lrange +from pandas.util.testing import assert_frame_equal, assert_series_equal + +from .common import MixIn + + +class TestNth(MixIn): + + def test_first_last_nth(self): + # tests for first / last / nth + grouped = self.df.groupby('A') + first = grouped.first() + expected = self.df.loc[[1, 0], ['B', 'C', 'D']] + expected.index = Index(['bar', 'foo'], name='A') + expected = expected.sort_index() + assert_frame_equal(first, expected) + + nth = grouped.nth(0) + assert_frame_equal(nth, expected) + + last = grouped.last() + expected = self.df.loc[[5, 7], ['B', 'C', 'D']] + expected.index = Index(['bar', 'foo'], name='A') + assert_frame_equal(last, expected) + + nth = grouped.nth(-1) + assert_frame_equal(nth, expected) + + nth = grouped.nth(1) + expected = self.df.loc[[2, 3], ['B', 'C', 'D']].copy() + expected.index = Index(['foo', 'bar'], name='A') + expected = expected.sort_index() + assert_frame_equal(nth, expected) + + # it works! + grouped['B'].first() + grouped['B'].last() + grouped['B'].nth(0) + + self.df.loc[self.df['A'] == 'foo', 'B'] = np.nan + assert isnull(grouped['B'].first()['foo']) + assert isnull(grouped['B'].last()['foo']) + assert isnull(grouped['B'].nth(0)['foo']) + + # v0.14.0 whatsnew + df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B']) + g = df.groupby('A') + result = g.first() + expected = df.iloc[[1, 2]].set_index('A') + assert_frame_equal(result, expected) + + expected = df.iloc[[1, 2]].set_index('A') + result = g.nth(0, dropna='any') + assert_frame_equal(result, expected) + + def test_first_last_nth_dtypes(self): + + df = self.df_mixed_floats.copy() + df['E'] = True + df['F'] = 1 + + # tests for first / last / nth + grouped = df.groupby('A') + first = grouped.first() + expected = df.loc[[1, 0], ['B', 'C', 'D', 'E', 'F']] + expected.index = Index(['bar', 'foo'], name='A') + expected = expected.sort_index() + assert_frame_equal(first, expected) + + last = grouped.last() + expected = df.loc[[5, 7], ['B', 'C', 'D', 'E', 'F']] + expected.index = Index(['bar', 'foo'], name='A') + expected = expected.sort_index() + assert_frame_equal(last, expected) + + nth = grouped.nth(1) + expected = df.loc[[3, 2], ['B', 'C', 'D', 'E', 'F']] + expected.index = Index(['bar', 'foo'], name='A') + expected = expected.sort_index() + assert_frame_equal(nth, expected) + + # GH 2763, first/last shifting dtypes + idx = lrange(10) + idx.append(9) + s = Series(data=lrange(11), index=idx, name='IntCol') + assert s.dtype == 'int64' + f = s.groupby(level=0).first() + assert f.dtype == 'int64' + + def test_nth(self): + df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B']) + g = df.groupby('A') + + assert_frame_equal(g.nth(0), df.iloc[[0, 2]].set_index('A')) + assert_frame_equal(g.nth(1), df.iloc[[1]].set_index('A')) + assert_frame_equal(g.nth(2), df.loc[[]].set_index('A')) + assert_frame_equal(g.nth(-1), df.iloc[[1, 2]].set_index('A')) + assert_frame_equal(g.nth(-2), df.iloc[[0]].set_index('A')) + assert_frame_equal(g.nth(-3), df.loc[[]].set_index('A')) + assert_series_equal(g.B.nth(0), df.set_index('A').B.iloc[[0, 2]]) + assert_series_equal(g.B.nth(1), df.set_index('A').B.iloc[[1]]) + assert_frame_equal(g[['B']].nth(0), + df.loc[[0, 2], ['A', 'B']].set_index('A')) + + exp = df.set_index('A') + assert_frame_equal(g.nth(0, dropna='any'), exp.iloc[[1, 2]]) + assert_frame_equal(g.nth(-1, dropna='any'), exp.iloc[[1, 2]]) + + exp['B'] = np.nan + assert_frame_equal(g.nth(7, dropna='any'), exp.iloc[[1, 2]]) + assert_frame_equal(g.nth(2, dropna='any'), exp.iloc[[1, 2]]) + + # out of bounds, regression from 0.13.1 + # GH 6621 + df = DataFrame({'color': {0: 'green', + 1: 'green', + 2: 'red', + 3: 'red', + 4: 'red'}, + 'food': {0: 'ham', + 1: 'eggs', + 2: 'eggs', + 3: 'ham', + 4: 'pork'}, + 'two': {0: 1.5456590000000001, + 1: -0.070345000000000005, + 2: -2.4004539999999999, + 3: 0.46206000000000003, + 4: 0.52350799999999997}, + 'one': {0: 0.56573799999999996, + 1: -0.9742360000000001, + 2: 1.033801, + 3: -0.78543499999999999, + 4: 0.70422799999999997}}).set_index(['color', + 'food']) + + result = df.groupby(level=0, as_index=False).nth(2) + expected = df.iloc[[-1]] + assert_frame_equal(result, expected) + + result = df.groupby(level=0, as_index=False).nth(3) + expected = df.loc[[]] + assert_frame_equal(result, expected) + + # GH 7559 + # from the vbench + df = DataFrame(np.random.randint(1, 10, (100, 2)), dtype='int64') + s = df[1] + g = df[0] + expected = s.groupby(g).first() + expected2 = s.groupby(g).apply(lambda x: x.iloc[0]) + assert_series_equal(expected2, expected, check_names=False) + assert expected.name, 0 + assert expected.name == 1 + + # validate first + v = s[g == 1].iloc[0] + assert expected.iloc[0] == v + assert expected2.iloc[0] == v + + # this is NOT the same as .first (as sorted is default!) + # as it keeps the order in the series (and not the group order) + # related GH 7287 + expected = s.groupby(g, sort=False).first() + result = s.groupby(g, sort=False).nth(0, dropna='all') + assert_series_equal(result, expected) + + # doc example + df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B']) + g = df.groupby('A') + result = g.B.nth(0, dropna=True) + expected = g.B.first() + assert_series_equal(result, expected) + + # test multiple nth values + df = DataFrame([[1, np.nan], [1, 3], [1, 4], [5, 6], [5, 7]], + columns=['A', 'B']) + g = df.groupby('A') + + assert_frame_equal(g.nth(0), df.iloc[[0, 3]].set_index('A')) + assert_frame_equal(g.nth([0]), df.iloc[[0, 3]].set_index('A')) + assert_frame_equal(g.nth([0, 1]), df.iloc[[0, 1, 3, 4]].set_index('A')) + assert_frame_equal( + g.nth([0, -1]), df.iloc[[0, 2, 3, 4]].set_index('A')) + assert_frame_equal( + g.nth([0, 1, 2]), df.iloc[[0, 1, 2, 3, 4]].set_index('A')) + assert_frame_equal( + g.nth([0, 1, -1]), df.iloc[[0, 1, 2, 3, 4]].set_index('A')) + assert_frame_equal(g.nth([2]), df.iloc[[2]].set_index('A')) + assert_frame_equal(g.nth([3, 4]), df.loc[[]].set_index('A')) + + business_dates = pd.date_range(start='4/1/2014', end='6/30/2014', + freq='B') + df = DataFrame(1, index=business_dates, columns=['a', 'b']) + # get the first, fourth and last two business days for each month + key = (df.index.year, df.index.month) + result = df.groupby(key, as_index=False).nth([0, 3, -2, -1]) + expected_dates = pd.to_datetime( + ['2014/4/1', '2014/4/4', '2014/4/29', '2014/4/30', '2014/5/1', + '2014/5/6', '2014/5/29', '2014/5/30', '2014/6/2', '2014/6/5', + '2014/6/27', '2014/6/30']) + expected = DataFrame(1, columns=['a', 'b'], index=expected_dates) + assert_frame_equal(result, expected) + + def test_nth_multi_index(self): + # PR 9090, related to issue 8979 + # test nth on MultiIndex, should match .first() + grouped = self.three_group.groupby(['A', 'B']) + result = grouped.nth(0) + expected = grouped.first() + assert_frame_equal(result, expected) + + def test_nth_multi_index_as_expected(self): + # PR 9090, related to issue 8979 + # test nth on MultiIndex + three_group = DataFrame( + {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', + 'foo', 'foo', 'foo'], + 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', + 'two', 'two', 'one'], + 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', + 'dull', 'shiny', 'shiny', 'shiny']}) + grouped = three_group.groupby(['A', 'B']) + result = grouped.nth(0) + expected = DataFrame( + {'C': ['dull', 'dull', 'dull', 'dull']}, + index=MultiIndex.from_arrays([['bar', 'bar', 'foo', 'foo'], + ['one', 'two', 'one', 'two']], + names=['A', 'B'])) + assert_frame_equal(result, expected) + + +def test_nth_empty(): + # GH 16064 + df = DataFrame(index=[0], columns=['a', 'b', 'c']) + result = df.groupby('a').nth(10) + expected = DataFrame(index=Index([], name='a'), columns=['b', 'c']) + assert_frame_equal(result, expected) + + result = df.groupby(['a', 'b']).nth(10) + expected = DataFrame(index=MultiIndex([[], []], [[], []], + names=['a', 'b']), + columns=['c']) + assert_frame_equal(result, expected) diff --git a/pandas/tests/groupby/test_timegrouper.py b/pandas/tests/groupby/test_timegrouper.py new file mode 100644 index 0000000000000..2196318d1920e --- /dev/null +++ b/pandas/tests/groupby/test_timegrouper.py @@ -0,0 +1,611 @@ +""" test with the TimeGrouper / grouping with datetimes """ + +import pytest + +from datetime import datetime +import numpy as np +from numpy import nan + +import pandas as pd +from pandas import (DataFrame, date_range, Index, + Series, MultiIndex, Timestamp, DatetimeIndex) +from pandas.compat import StringIO +from pandas.util import testing as tm +from pandas.util.testing import assert_frame_equal, assert_series_equal + + +class TestGroupBy(object): + + def test_groupby_with_timegrouper(self): + # GH 4161 + # TimeGrouper requires a sorted index + # also verifies that the resultant index has the correct name + df_original = DataFrame({ + 'Buyer': 'Carl Carl Carl Carl Joe Carl'.split(), + 'Quantity': [18, 3, 5, 1, 9, 3], + 'Date': [ + datetime(2013, 9, 1, 13, 0), + datetime(2013, 9, 1, 13, 5), + datetime(2013, 10, 1, 20, 0), + datetime(2013, 10, 3, 10, 0), + datetime(2013, 12, 2, 12, 0), + datetime(2013, 9, 2, 14, 0), + ] + }) + + # GH 6908 change target column's order + df_reordered = df_original.sort_values(by='Quantity') + + for df in [df_original, df_reordered]: + df = df.set_index(['Date']) + + expected = DataFrame( + {'Quantity': np.nan}, + index=date_range('20130901 13:00:00', + '20131205 13:00:00', freq='5D', + name='Date', closed='left')) + expected.iloc[[0, 6, 18], 0] = np.array( + [24., 6., 9.], dtype='float64') + + result1 = df.resample('5D') .sum() + assert_frame_equal(result1, expected) + + df_sorted = df.sort_index() + result2 = df_sorted.groupby(pd.TimeGrouper(freq='5D')).sum() + assert_frame_equal(result2, expected) + + result3 = df.groupby(pd.TimeGrouper(freq='5D')).sum() + assert_frame_equal(result3, expected) + + def test_groupby_with_timegrouper_methods(self): + # GH 3881 + # make sure API of timegrouper conforms + + df_original = pd.DataFrame({ + 'Branch': 'A A A A A B'.split(), + 'Buyer': 'Carl Mark Carl Joe Joe Carl'.split(), + 'Quantity': [1, 3, 5, 8, 9, 3], + 'Date': [ + datetime(2013, 1, 1, 13, 0), + datetime(2013, 1, 1, 13, 5), + datetime(2013, 10, 1, 20, 0), + datetime(2013, 10, 2, 10, 0), + datetime(2013, 12, 2, 12, 0), + datetime(2013, 12, 2, 14, 0), + ] + }) + + df_sorted = df_original.sort_values(by='Quantity', ascending=False) + + for df in [df_original, df_sorted]: + df = df.set_index('Date', drop=False) + g = df.groupby(pd.TimeGrouper('6M')) + assert g.group_keys + assert isinstance(g.grouper, pd.core.groupby.BinGrouper) + groups = g.groups + assert isinstance(groups, dict) + assert len(groups) == 3 + + def test_timegrouper_with_reg_groups(self): + + # GH 3794 + # allow combinateion of timegrouper/reg groups + + df_original = DataFrame({ + 'Branch': 'A A A A A A A B'.split(), + 'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(), + 'Quantity': [1, 3, 5, 1, 8, 1, 9, 3], + 'Date': [ + datetime(2013, 1, 1, 13, 0), + datetime(2013, 1, 1, 13, 5), + datetime(2013, 10, 1, 20, 0), + datetime(2013, 10, 2, 10, 0), + datetime(2013, 10, 1, 20, 0), + datetime(2013, 10, 2, 10, 0), + datetime(2013, 12, 2, 12, 0), + datetime(2013, 12, 2, 14, 0), + ] + }).set_index('Date') + + df_sorted = df_original.sort_values(by='Quantity', ascending=False) + + for df in [df_original, df_sorted]: + expected = DataFrame({ + 'Buyer': 'Carl Joe Mark'.split(), + 'Quantity': [10, 18, 3], + 'Date': [ + datetime(2013, 12, 31, 0, 0), + datetime(2013, 12, 31, 0, 0), + datetime(2013, 12, 31, 0, 0), + ] + }).set_index(['Date', 'Buyer']) + + result = df.groupby([pd.Grouper(freq='A'), 'Buyer']).sum() + assert_frame_equal(result, expected) + + expected = DataFrame({ + 'Buyer': 'Carl Mark Carl Joe'.split(), + 'Quantity': [1, 3, 9, 18], + 'Date': [ + datetime(2013, 1, 1, 0, 0), + datetime(2013, 1, 1, 0, 0), + datetime(2013, 7, 1, 0, 0), + datetime(2013, 7, 1, 0, 0), + ] + }).set_index(['Date', 'Buyer']) + result = df.groupby([pd.Grouper(freq='6MS'), 'Buyer']).sum() + assert_frame_equal(result, expected) + + df_original = DataFrame({ + 'Branch': 'A A A A A A A B'.split(), + 'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(), + 'Quantity': [1, 3, 5, 1, 8, 1, 9, 3], + 'Date': [ + datetime(2013, 10, 1, 13, 0), + datetime(2013, 10, 1, 13, 5), + datetime(2013, 10, 1, 20, 0), + datetime(2013, 10, 2, 10, 0), + datetime(2013, 10, 1, 20, 0), + datetime(2013, 10, 2, 10, 0), + datetime(2013, 10, 2, 12, 0), + datetime(2013, 10, 2, 14, 0), + ] + }).set_index('Date') + + df_sorted = df_original.sort_values(by='Quantity', ascending=False) + for df in [df_original, df_sorted]: + + expected = DataFrame({ + 'Buyer': 'Carl Joe Mark Carl Joe'.split(), + 'Quantity': [6, 8, 3, 4, 10], + 'Date': [ + datetime(2013, 10, 1, 0, 0), + datetime(2013, 10, 1, 0, 0), + datetime(2013, 10, 1, 0, 0), + datetime(2013, 10, 2, 0, 0), + datetime(2013, 10, 2, 0, 0), + ] + }).set_index(['Date', 'Buyer']) + + result = df.groupby([pd.Grouper(freq='1D'), 'Buyer']).sum() + assert_frame_equal(result, expected) + + result = df.groupby([pd.Grouper(freq='1M'), 'Buyer']).sum() + expected = DataFrame({ + 'Buyer': 'Carl Joe Mark'.split(), + 'Quantity': [10, 18, 3], + 'Date': [ + datetime(2013, 10, 31, 0, 0), + datetime(2013, 10, 31, 0, 0), + datetime(2013, 10, 31, 0, 0), + ] + }).set_index(['Date', 'Buyer']) + assert_frame_equal(result, expected) + + # passing the name + df = df.reset_index() + result = df.groupby([pd.Grouper(freq='1M', key='Date'), 'Buyer' + ]).sum() + assert_frame_equal(result, expected) + + with pytest.raises(KeyError): + df.groupby([pd.Grouper(freq='1M', key='foo'), 'Buyer']).sum() + + # passing the level + df = df.set_index('Date') + result = df.groupby([pd.Grouper(freq='1M', level='Date'), 'Buyer' + ]).sum() + assert_frame_equal(result, expected) + result = df.groupby([pd.Grouper(freq='1M', level=0), 'Buyer']).sum( + ) + assert_frame_equal(result, expected) + + with pytest.raises(ValueError): + df.groupby([pd.Grouper(freq='1M', level='foo'), + 'Buyer']).sum() + + # multi names + df = df.copy() + df['Date'] = df.index + pd.offsets.MonthEnd(2) + result = df.groupby([pd.Grouper(freq='1M', key='Date'), 'Buyer' + ]).sum() + expected = DataFrame({ + 'Buyer': 'Carl Joe Mark'.split(), + 'Quantity': [10, 18, 3], + 'Date': [ + datetime(2013, 11, 30, 0, 0), + datetime(2013, 11, 30, 0, 0), + datetime(2013, 11, 30, 0, 0), + ] + }).set_index(['Date', 'Buyer']) + assert_frame_equal(result, expected) + + # error as we have both a level and a name! + with pytest.raises(ValueError): + df.groupby([pd.Grouper(freq='1M', key='Date', + level='Date'), 'Buyer']).sum() + + # single groupers + expected = DataFrame({'Quantity': [31], + 'Date': [datetime(2013, 10, 31, 0, 0) + ]}).set_index('Date') + result = df.groupby(pd.Grouper(freq='1M')).sum() + assert_frame_equal(result, expected) + + result = df.groupby([pd.Grouper(freq='1M')]).sum() + assert_frame_equal(result, expected) + + expected = DataFrame({'Quantity': [31], + 'Date': [datetime(2013, 11, 30, 0, 0) + ]}).set_index('Date') + result = df.groupby(pd.Grouper(freq='1M', key='Date')).sum() + assert_frame_equal(result, expected) + + result = df.groupby([pd.Grouper(freq='1M', key='Date')]).sum() + assert_frame_equal(result, expected) + + # GH 6764 multiple grouping with/without sort + df = DataFrame({ + 'date': pd.to_datetime([ + '20121002', '20121007', '20130130', '20130202', '20130305', + '20121002', '20121207', '20130130', '20130202', '20130305', + '20130202', '20130305' + ]), + 'user_id': [1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 5, 5], + 'whole_cost': [1790, 364, 280, 259, 201, 623, 90, 312, 359, 301, + 359, 801], + 'cost1': [12, 15, 10, 24, 39, 1, 0, 90, 45, 34, 1, 12] + }).set_index('date') + + for freq in ['D', 'M', 'A', 'Q-APR']: + expected = df.groupby('user_id')[ + 'whole_cost'].resample( + freq).sum().dropna().reorder_levels( + ['date', 'user_id']).sort_index().astype('int64') + expected.name = 'whole_cost' + + result1 = df.sort_index().groupby([pd.TimeGrouper(freq=freq), + 'user_id'])['whole_cost'].sum() + assert_series_equal(result1, expected) + + result2 = df.groupby([pd.TimeGrouper(freq=freq), 'user_id'])[ + 'whole_cost'].sum() + assert_series_equal(result2, expected) + + def test_timegrouper_get_group(self): + # GH 6914 + + df_original = DataFrame({ + 'Buyer': 'Carl Joe Joe Carl Joe Carl'.split(), + 'Quantity': [18, 3, 5, 1, 9, 3], + 'Date': [datetime(2013, 9, 1, 13, 0), + datetime(2013, 9, 1, 13, 5), + datetime(2013, 10, 1, 20, 0), + datetime(2013, 10, 3, 10, 0), + datetime(2013, 12, 2, 12, 0), + datetime(2013, 9, 2, 14, 0), ] + }) + df_reordered = df_original.sort_values(by='Quantity') + + # single grouping + expected_list = [df_original.iloc[[0, 1, 5]], df_original.iloc[[2, 3]], + df_original.iloc[[4]]] + dt_list = ['2013-09-30', '2013-10-31', '2013-12-31'] + + for df in [df_original, df_reordered]: + grouped = df.groupby(pd.Grouper(freq='M', key='Date')) + for t, expected in zip(dt_list, expected_list): + dt = pd.Timestamp(t) + result = grouped.get_group(dt) + assert_frame_equal(result, expected) + + # multiple grouping + expected_list = [df_original.iloc[[1]], df_original.iloc[[3]], + df_original.iloc[[4]]] + g_list = [('Joe', '2013-09-30'), ('Carl', '2013-10-31'), + ('Joe', '2013-12-31')] + + for df in [df_original, df_reordered]: + grouped = df.groupby(['Buyer', pd.Grouper(freq='M', key='Date')]) + for (b, t), expected in zip(g_list, expected_list): + dt = pd.Timestamp(t) + result = grouped.get_group((b, dt)) + assert_frame_equal(result, expected) + + # with index + df_original = df_original.set_index('Date') + df_reordered = df_original.sort_values(by='Quantity') + + expected_list = [df_original.iloc[[0, 1, 5]], df_original.iloc[[2, 3]], + df_original.iloc[[4]]] + + for df in [df_original, df_reordered]: + grouped = df.groupby(pd.Grouper(freq='M')) + for t, expected in zip(dt_list, expected_list): + dt = pd.Timestamp(t) + result = grouped.get_group(dt) + assert_frame_equal(result, expected) + + def test_timegrouper_apply_return_type_series(self): + # Using `apply` with the `TimeGrouper` should give the + # same return type as an `apply` with a `Grouper`. + # Issue #11742 + df = pd.DataFrame({'date': ['10/10/2000', '11/10/2000'], + 'value': [10, 13]}) + df_dt = df.copy() + df_dt['date'] = pd.to_datetime(df_dt['date']) + + def sumfunc_series(x): + return pd.Series([x['value'].sum()], ('sum',)) + + expected = df.groupby(pd.Grouper(key='date')).apply(sumfunc_series) + result = (df_dt.groupby(pd.TimeGrouper(freq='M', key='date')) + .apply(sumfunc_series)) + assert_frame_equal(result.reset_index(drop=True), + expected.reset_index(drop=True)) + + def test_timegrouper_apply_return_type_value(self): + # Using `apply` with the `TimeGrouper` should give the + # same return type as an `apply` with a `Grouper`. + # Issue #11742 + df = pd.DataFrame({'date': ['10/10/2000', '11/10/2000'], + 'value': [10, 13]}) + df_dt = df.copy() + df_dt['date'] = pd.to_datetime(df_dt['date']) + + def sumfunc_value(x): + return x.value.sum() + + expected = df.groupby(pd.Grouper(key='date')).apply(sumfunc_value) + result = (df_dt.groupby(pd.TimeGrouper(freq='M', key='date')) + .apply(sumfunc_value)) + assert_series_equal(result.reset_index(drop=True), + expected.reset_index(drop=True)) + + def test_groupby_groups_datetimeindex(self): + # #1430 + periods = 1000 + ind = DatetimeIndex(start='2012/1/1', freq='5min', periods=periods) + df = DataFrame({'high': np.arange(periods), + 'low': np.arange(periods)}, index=ind) + grouped = df.groupby(lambda x: datetime(x.year, x.month, x.day)) + + # it works! + groups = grouped.groups + assert isinstance(list(groups.keys())[0], datetime) + + # GH 11442 + index = pd.date_range('2015/01/01', periods=5, name='date') + df = pd.DataFrame({'A': [5, 6, 7, 8, 9], + 'B': [1, 2, 3, 4, 5]}, index=index) + result = df.groupby(level='date').groups + dates = ['2015-01-05', '2015-01-04', '2015-01-03', + '2015-01-02', '2015-01-01'] + expected = {pd.Timestamp(date): pd.DatetimeIndex([date], name='date') + for date in dates} + tm.assert_dict_equal(result, expected) + + grouped = df.groupby(level='date') + for date in dates: + result = grouped.get_group(date) + data = [[df.loc[date, 'A'], df.loc[date, 'B']]] + expected_index = pd.DatetimeIndex([date], name='date') + expected = pd.DataFrame(data, + columns=list('AB'), + index=expected_index) + tm.assert_frame_equal(result, expected) + + def test_groupby_groups_datetimeindex_tz(self): + # GH 3950 + dates = ['2011-07-19 07:00:00', '2011-07-19 08:00:00', + '2011-07-19 09:00:00', '2011-07-19 07:00:00', + '2011-07-19 08:00:00', '2011-07-19 09:00:00'] + df = DataFrame({'label': ['a', 'a', 'a', 'b', 'b', 'b'], + 'datetime': dates, + 'value1': np.arange(6, dtype='int64'), + 'value2': [1, 2] * 3}) + df['datetime'] = df['datetime'].apply( + lambda d: Timestamp(d, tz='US/Pacific')) + + exp_idx1 = pd.DatetimeIndex(['2011-07-19 07:00:00', + '2011-07-19 07:00:00', + '2011-07-19 08:00:00', + '2011-07-19 08:00:00', + '2011-07-19 09:00:00', + '2011-07-19 09:00:00'], + tz='US/Pacific', name='datetime') + exp_idx2 = Index(['a', 'b'] * 3, name='label') + exp_idx = MultiIndex.from_arrays([exp_idx1, exp_idx2]) + expected = DataFrame({'value1': [0, 3, 1, 4, 2, 5], + 'value2': [1, 2, 2, 1, 1, 2]}, + index=exp_idx, columns=['value1', 'value2']) + + result = df.groupby(['datetime', 'label']).sum() + assert_frame_equal(result, expected) + + # by level + didx = pd.DatetimeIndex(dates, tz='Asia/Tokyo') + df = DataFrame({'value1': np.arange(6, dtype='int64'), + 'value2': [1, 2, 3, 1, 2, 3]}, + index=didx) + + exp_idx = pd.DatetimeIndex(['2011-07-19 07:00:00', + '2011-07-19 08:00:00', + '2011-07-19 09:00:00'], tz='Asia/Tokyo') + expected = DataFrame({'value1': [3, 5, 7], 'value2': [2, 4, 6]}, + index=exp_idx, columns=['value1', 'value2']) + + result = df.groupby(level=0).sum() + assert_frame_equal(result, expected) + + def test_frame_datetime64_handling_groupby(self): + # it works! + df = DataFrame([(3, np.datetime64('2012-07-03')), + (3, np.datetime64('2012-07-04'))], + columns=['a', 'date']) + result = df.groupby('a').first() + assert result['date'][3] == Timestamp('2012-07-03') + + def test_groupby_multi_timezone(self): + + # combining multiple / different timezones yields UTC + + data = """0,2000-01-28 16:47:00,America/Chicago +1,2000-01-29 16:48:00,America/Chicago +2,2000-01-30 16:49:00,America/Los_Angeles +3,2000-01-31 16:50:00,America/Chicago +4,2000-01-01 16:50:00,America/New_York""" + + df = pd.read_csv(StringIO(data), header=None, + names=['value', 'date', 'tz']) + result = df.groupby('tz').date.apply( + lambda x: pd.to_datetime(x).dt.tz_localize(x.name)) + + expected = Series([Timestamp('2000-01-28 16:47:00-0600', + tz='America/Chicago'), + Timestamp('2000-01-29 16:48:00-0600', + tz='America/Chicago'), + Timestamp('2000-01-30 16:49:00-0800', + tz='America/Los_Angeles'), + Timestamp('2000-01-31 16:50:00-0600', + tz='America/Chicago'), + Timestamp('2000-01-01 16:50:00-0500', + tz='America/New_York')], + name='date', + dtype=object) + assert_series_equal(result, expected) + + tz = 'America/Chicago' + res_values = df.groupby('tz').date.get_group(tz) + result = pd.to_datetime(res_values).dt.tz_localize(tz) + exp_values = Series(['2000-01-28 16:47:00', '2000-01-29 16:48:00', + '2000-01-31 16:50:00'], + index=[0, 1, 3], name='date') + expected = pd.to_datetime(exp_values).dt.tz_localize(tz) + assert_series_equal(result, expected) + + def test_groupby_groups_periods(self): + dates = ['2011-07-19 07:00:00', '2011-07-19 08:00:00', + '2011-07-19 09:00:00', '2011-07-19 07:00:00', + '2011-07-19 08:00:00', '2011-07-19 09:00:00'] + df = DataFrame({'label': ['a', 'a', 'a', 'b', 'b', 'b'], + 'period': [pd.Period(d, freq='H') for d in dates], + 'value1': np.arange(6, dtype='int64'), + 'value2': [1, 2] * 3}) + + exp_idx1 = pd.PeriodIndex(['2011-07-19 07:00:00', + '2011-07-19 07:00:00', + '2011-07-19 08:00:00', + '2011-07-19 08:00:00', + '2011-07-19 09:00:00', + '2011-07-19 09:00:00'], + freq='H', name='period') + exp_idx2 = Index(['a', 'b'] * 3, name='label') + exp_idx = MultiIndex.from_arrays([exp_idx1, exp_idx2]) + expected = DataFrame({'value1': [0, 3, 1, 4, 2, 5], + 'value2': [1, 2, 2, 1, 1, 2]}, + index=exp_idx, columns=['value1', 'value2']) + + result = df.groupby(['period', 'label']).sum() + assert_frame_equal(result, expected) + + # by level + didx = pd.PeriodIndex(dates, freq='H') + df = DataFrame({'value1': np.arange(6, dtype='int64'), + 'value2': [1, 2, 3, 1, 2, 3]}, + index=didx) + + exp_idx = pd.PeriodIndex(['2011-07-19 07:00:00', + '2011-07-19 08:00:00', + '2011-07-19 09:00:00'], freq='H') + expected = DataFrame({'value1': [3, 5, 7], 'value2': [2, 4, 6]}, + index=exp_idx, columns=['value1', 'value2']) + + result = df.groupby(level=0).sum() + assert_frame_equal(result, expected) + + def test_groupby_first_datetime64(self): + df = DataFrame([(1, 1351036800000000000), (2, 1351036800000000000)]) + df[1] = df[1].view('M8[ns]') + + assert issubclass(df[1].dtype.type, np.datetime64) + + result = df.groupby(level=0).first() + got_dt = result[1].dtype + assert issubclass(got_dt.type, np.datetime64) + + result = df[1].groupby(level=0).first() + got_dt = result.dtype + assert issubclass(got_dt.type, np.datetime64) + + def test_groupby_max_datetime64(self): + # GH 5869 + # datetimelike dtype conversion from int + df = DataFrame(dict(A=Timestamp('20130101'), B=np.arange(5))) + expected = df.groupby('A')['A'].apply(lambda x: x.max()) + result = df.groupby('A')['A'].max() + assert_series_equal(result, expected) + + def test_groupby_datetime64_32_bit(self): + # GH 6410 / numpy 4328 + # 32-bit under 1.9-dev indexing issue + + df = DataFrame({"A": range(2), "B": [pd.Timestamp('2000-01-1')] * 2}) + result = df.groupby("A")["B"].transform(min) + expected = Series([pd.Timestamp('2000-01-1')] * 2, name='B') + assert_series_equal(result, expected) + + def test_groupby_with_timezone_selection(self): + # GH 11616 + # Test that column selection returns output in correct timezone. + np.random.seed(42) + df = pd.DataFrame({ + 'factor': np.random.randint(0, 3, size=60), + 'time': pd.date_range('01/01/2000 00:00', periods=60, + freq='s', tz='UTC') + }) + df1 = df.groupby('factor').max()['time'] + df2 = df.groupby('factor')['time'].max() + tm.assert_series_equal(df1, df2) + + def test_timezone_info(self): + # GH 11682 + # Timezone info lost when broadcasting scalar datetime to DataFrame + tm._skip_if_no_pytz() + import pytz + + df = pd.DataFrame({'a': [1], 'b': [datetime.now(pytz.utc)]}) + assert df['b'][0].tzinfo == pytz.utc + df = pd.DataFrame({'a': [1, 2, 3]}) + df['b'] = datetime.now(pytz.utc) + assert df['b'][0].tzinfo == pytz.utc + + def test_datetime_count(self): + df = DataFrame({'a': [1, 2, 3] * 2, + 'dates': pd.date_range('now', periods=6, freq='T')}) + result = df.groupby('a').dates.count() + expected = Series([ + 2, 2, 2 + ], index=Index([1, 2, 3], name='a'), name='dates') + tm.assert_series_equal(result, expected) + + def test_first_last_max_min_on_time_data(self): + # GH 10295 + # Verify that NaT is not in the result of max, min, first and last on + # Dataframe with datetime or timedelta values. + from datetime import timedelta as td + df_test = DataFrame( + {'dt': [nan, '2015-07-24 10:10', '2015-07-25 11:11', + '2015-07-23 12:12', nan], + 'td': [nan, td(days=1), td(days=2), td(days=3), nan]}) + df_test.dt = pd.to_datetime(df_test.dt) + df_test['group'] = 'A' + df_ref = df_test[df_test.dt.notnull()] + + grouped_test = df_test.groupby('group') + grouped_ref = df_ref.groupby('group') + + assert_frame_equal(grouped_ref.max(), grouped_test.max()) + assert_frame_equal(grouped_ref.min(), grouped_test.min()) + assert_frame_equal(grouped_ref.first(), grouped_test.first()) + assert_frame_equal(grouped_ref.last(), grouped_test.last()) diff --git a/pandas/tests/groupby/test_transform.py b/pandas/tests/groupby/test_transform.py new file mode 100644 index 0000000000000..40434ff510421 --- /dev/null +++ b/pandas/tests/groupby/test_transform.py @@ -0,0 +1,563 @@ +""" test with the .transform """ + +import pytest + +import numpy as np +import pandas as pd +from pandas.util import testing as tm +from pandas import Series, DataFrame, Timestamp, MultiIndex, concat, date_range +from pandas.core.dtypes.common import ( + _ensure_platform_int, is_timedelta64_dtype) +from pandas.compat import StringIO +from pandas._libs import groupby +from .common import MixIn, assert_fp_equal + +from pandas.util.testing import assert_frame_equal, assert_series_equal +from pandas.core.groupby import DataError +from pandas.core.config import option_context + + +class TestGroupBy(MixIn): + + def test_transform(self): + data = Series(np.arange(9) // 3, index=np.arange(9)) + + index = np.arange(9) + np.random.shuffle(index) + data = data.reindex(index) + + grouped = data.groupby(lambda x: x // 3) + + transformed = grouped.transform(lambda x: x * x.sum()) + assert transformed[7] == 12 + + # GH 8046 + # make sure that we preserve the input order + + df = DataFrame( + np.arange(6, dtype='int64').reshape( + 3, 2), columns=["a", "b"], index=[0, 2, 1]) + key = [0, 0, 1] + expected = df.sort_index().groupby(key).transform( + lambda x: x - x.mean()).groupby(key).mean() + result = df.groupby(key).transform(lambda x: x - x.mean()).groupby( + key).mean() + assert_frame_equal(result, expected) + + def demean(arr): + return arr - arr.mean() + + people = DataFrame(np.random.randn(5, 5), + columns=['a', 'b', 'c', 'd', 'e'], + index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']) + key = ['one', 'two', 'one', 'two', 'one'] + result = people.groupby(key).transform(demean).groupby(key).mean() + expected = people.groupby(key).apply(demean).groupby(key).mean() + assert_frame_equal(result, expected) + + # GH 8430 + df = tm.makeTimeDataFrame() + g = df.groupby(pd.TimeGrouper('M')) + g.transform(lambda x: x - 1) + + # GH 9700 + df = DataFrame({'a': range(5, 10), 'b': range(5)}) + result = df.groupby('a').transform(max) + expected = DataFrame({'b': range(5)}) + tm.assert_frame_equal(result, expected) + + def test_transform_fast(self): + + df = DataFrame({'id': np.arange(100000) / 3, + 'val': np.random.randn(100000)}) + + grp = df.groupby('id')['val'] + + values = np.repeat(grp.mean().values, + _ensure_platform_int(grp.count().values)) + expected = pd.Series(values, index=df.index, name='val') + + result = grp.transform(np.mean) + assert_series_equal(result, expected) + + result = grp.transform('mean') + assert_series_equal(result, expected) + + # GH 12737 + df = pd.DataFrame({'grouping': [0, 1, 1, 3], 'f': [1.1, 2.1, 3.1, 4.5], + 'd': pd.date_range('2014-1-1', '2014-1-4'), + 'i': [1, 2, 3, 4]}, + columns=['grouping', 'f', 'i', 'd']) + result = df.groupby('grouping').transform('first') + + dates = [pd.Timestamp('2014-1-1'), pd.Timestamp('2014-1-2'), + pd.Timestamp('2014-1-2'), pd.Timestamp('2014-1-4')] + expected = pd.DataFrame({'f': [1.1, 2.1, 2.1, 4.5], + 'd': dates, + 'i': [1, 2, 2, 4]}, + columns=['f', 'i', 'd']) + assert_frame_equal(result, expected) + + # selection + result = df.groupby('grouping')[['f', 'i']].transform('first') + expected = expected[['f', 'i']] + assert_frame_equal(result, expected) + + # dup columns + df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['g', 'a', 'a']) + result = df.groupby('g').transform('first') + expected = df.drop('g', axis=1) + assert_frame_equal(result, expected) + + def test_transform_broadcast(self): + grouped = self.ts.groupby(lambda x: x.month) + result = grouped.transform(np.mean) + + tm.assert_index_equal(result.index, self.ts.index) + for _, gp in grouped: + assert_fp_equal(result.reindex(gp.index), gp.mean()) + + grouped = self.tsframe.groupby(lambda x: x.month) + result = grouped.transform(np.mean) + tm.assert_index_equal(result.index, self.tsframe.index) + for _, gp in grouped: + agged = gp.mean() + res = result.reindex(gp.index) + for col in self.tsframe: + assert_fp_equal(res[col], agged[col]) + + # group columns + grouped = self.tsframe.groupby({'A': 0, 'B': 0, 'C': 1, 'D': 1}, + axis=1) + result = grouped.transform(np.mean) + tm.assert_index_equal(result.index, self.tsframe.index) + tm.assert_index_equal(result.columns, self.tsframe.columns) + for _, gp in grouped: + agged = gp.mean(1) + res = result.reindex(columns=gp.columns) + for idx in gp.index: + assert_fp_equal(res.xs(idx), agged[idx]) + + def test_transform_axis(self): + + # make sure that we are setting the axes + # correctly when on axis=0 or 1 + # in the presence of a non-monotonic indexer + # GH12713 + + base = self.tsframe.iloc[0:5] + r = len(base.index) + c = len(base.columns) + tso = DataFrame(np.random.randn(r, c), + index=base.index, + columns=base.columns, + dtype='float64') + # monotonic + ts = tso + grouped = ts.groupby(lambda x: x.weekday()) + result = ts - grouped.transform('mean') + expected = grouped.apply(lambda x: x - x.mean()) + assert_frame_equal(result, expected) + + ts = ts.T + grouped = ts.groupby(lambda x: x.weekday(), axis=1) + result = ts - grouped.transform('mean') + expected = grouped.apply(lambda x: (x.T - x.mean(1)).T) + assert_frame_equal(result, expected) + + # non-monotonic + ts = tso.iloc[[1, 0] + list(range(2, len(base)))] + grouped = ts.groupby(lambda x: x.weekday()) + result = ts - grouped.transform('mean') + expected = grouped.apply(lambda x: x - x.mean()) + assert_frame_equal(result, expected) + + ts = ts.T + grouped = ts.groupby(lambda x: x.weekday(), axis=1) + result = ts - grouped.transform('mean') + expected = grouped.apply(lambda x: (x.T - x.mean(1)).T) + assert_frame_equal(result, expected) + + def test_transform_dtype(self): + # GH 9807 + # Check transform dtype output is preserved + df = DataFrame([[1, 3], [2, 3]]) + result = df.groupby(1).transform('mean') + expected = DataFrame([[1.5], [1.5]]) + assert_frame_equal(result, expected) + + def test_transform_bug(self): + # GH 5712 + # transforming on a datetime column + df = DataFrame(dict(A=Timestamp('20130101'), B=np.arange(5))) + result = df.groupby('A')['B'].transform( + lambda x: x.rank(ascending=False)) + expected = Series(np.arange(5, 0, step=-1), name='B') + assert_series_equal(result, expected) + + def test_transform_datetime_to_timedelta(self): + # GH 15429 + # transforming a datetime to timedelta + df = DataFrame(dict(A=Timestamp('20130101'), B=np.arange(5))) + expected = pd.Series([ + Timestamp('20130101') - Timestamp('20130101')] * 5, name='A') + + # this does date math without changing result type in transform + base_time = df['A'][0] + result = df.groupby('A')['A'].transform( + lambda x: x.max() - x.min() + base_time) - base_time + assert_series_equal(result, expected) + + # this does date math and causes the transform to return timedelta + result = df.groupby('A')['A'].transform(lambda x: x.max() - x.min()) + assert_series_equal(result, expected) + + def test_transform_datetime_to_numeric(self): + # GH 10972 + # convert dt to float + df = DataFrame({ + 'a': 1, 'b': date_range('2015-01-01', periods=2, freq='D')}) + result = df.groupby('a').b.transform( + lambda x: x.dt.dayofweek - x.dt.dayofweek.mean()) + + expected = Series([-0.5, 0.5], name='b') + assert_series_equal(result, expected) + + # convert dt to int + df = DataFrame({ + 'a': 1, 'b': date_range('2015-01-01', periods=2, freq='D')}) + result = df.groupby('a').b.transform( + lambda x: x.dt.dayofweek - x.dt.dayofweek.min()) + + expected = Series([0, 1], name='b') + assert_series_equal(result, expected) + + def test_transform_casting(self): + # 13046 + data = """ + idx A ID3 DATETIME + 0 B-028 b76cd912ff "2014-10-08 13:43:27" + 1 B-054 4a57ed0b02 "2014-10-08 14:26:19" + 2 B-076 1a682034f8 "2014-10-08 14:29:01" + 3 B-023 b76cd912ff "2014-10-08 18:39:34" + 4 B-023 f88g8d7sds "2014-10-08 18:40:18" + 5 B-033 b76cd912ff "2014-10-08 18:44:30" + 6 B-032 b76cd912ff "2014-10-08 18:46:00" + 7 B-037 b76cd912ff "2014-10-08 18:52:15" + 8 B-046 db959faf02 "2014-10-08 18:59:59" + 9 B-053 b76cd912ff "2014-10-08 19:17:48" + 10 B-065 b76cd912ff "2014-10-08 19:21:38" + """ + df = pd.read_csv(StringIO(data), sep='\s+', + index_col=[0], parse_dates=['DATETIME']) + + result = df.groupby('ID3')['DATETIME'].transform(lambda x: x.diff()) + assert is_timedelta64_dtype(result.dtype) + + result = df[['ID3', 'DATETIME']].groupby('ID3').transform( + lambda x: x.diff()) + assert is_timedelta64_dtype(result.DATETIME.dtype) + + def test_transform_multiple(self): + grouped = self.ts.groupby([lambda x: x.year, lambda x: x.month]) + + grouped.transform(lambda x: x * 2) + grouped.transform(np.mean) + + def test_dispatch_transform(self): + df = self.tsframe[::5].reindex(self.tsframe.index) + + grouped = df.groupby(lambda x: x.month) + + filled = grouped.fillna(method='pad') + fillit = lambda x: x.fillna(method='pad') + expected = df.groupby(lambda x: x.month).transform(fillit) + assert_frame_equal(filled, expected) + + def test_transform_select_columns(self): + f = lambda x: x.mean() + result = self.df.groupby('A')['C', 'D'].transform(f) + + selection = self.df[['C', 'D']] + expected = selection.groupby(self.df['A']).transform(f) + + assert_frame_equal(result, expected) + + def test_transform_exclude_nuisance(self): + + # this also tests orderings in transform between + # series/frame to make sure it's consistent + expected = {} + grouped = self.df.groupby('A') + expected['C'] = grouped['C'].transform(np.mean) + expected['D'] = grouped['D'].transform(np.mean) + expected = DataFrame(expected) + result = self.df.groupby('A').transform(np.mean) + + assert_frame_equal(result, expected) + + def test_transform_function_aliases(self): + result = self.df.groupby('A').transform('mean') + expected = self.df.groupby('A').transform(np.mean) + assert_frame_equal(result, expected) + + result = self.df.groupby('A')['C'].transform('mean') + expected = self.df.groupby('A')['C'].transform(np.mean) + assert_series_equal(result, expected) + + def test_series_fast_transform_date(self): + # GH 13191 + df = pd.DataFrame({'grouping': [np.nan, 1, 1, 3], + 'd': pd.date_range('2014-1-1', '2014-1-4')}) + result = df.groupby('grouping')['d'].transform('first') + dates = [pd.NaT, pd.Timestamp('2014-1-2'), pd.Timestamp('2014-1-2'), + pd.Timestamp('2014-1-4')] + expected = pd.Series(dates, name='d') + assert_series_equal(result, expected) + + def test_transform_length(self): + # GH 9697 + df = pd.DataFrame({'col1': [1, 1, 2, 2], 'col2': [1, 2, 3, np.nan]}) + expected = pd.Series([3.0] * 4) + + def nsum(x): + return np.nansum(x) + + results = [df.groupby('col1').transform(sum)['col2'], + df.groupby('col1')['col2'].transform(sum), + df.groupby('col1').transform(nsum)['col2'], + df.groupby('col1')['col2'].transform(nsum)] + for result in results: + assert_series_equal(result, expected, check_names=False) + + def test_transform_coercion(self): + + # 14457 + # when we are transforming be sure to not coerce + # via assignment + df = pd.DataFrame(dict(A=['a', 'a'], B=[0, 1])) + g = df.groupby('A') + + expected = g.transform(np.mean) + result = g.transform(lambda x: np.mean(x)) + assert_frame_equal(result, expected) + + def test_groupby_transform_with_int(self): + + # GH 3740, make sure that we might upcast on item-by-item transform + + # floats + df = DataFrame(dict(A=[1, 1, 1, 2, 2, 2], B=Series(1, dtype='float64'), + C=Series( + [1, 2, 3, 1, 2, 3], dtype='float64'), D='foo')) + with np.errstate(all='ignore'): + result = df.groupby('A').transform( + lambda x: (x - x.mean()) / x.std()) + expected = DataFrame(dict(B=np.nan, C=Series( + [-1, 0, 1, -1, 0, 1], dtype='float64'))) + assert_frame_equal(result, expected) + + # int case + df = DataFrame(dict(A=[1, 1, 1, 2, 2, 2], B=1, + C=[1, 2, 3, 1, 2, 3], D='foo')) + with np.errstate(all='ignore'): + result = df.groupby('A').transform( + lambda x: (x - x.mean()) / x.std()) + expected = DataFrame(dict(B=np.nan, C=[-1, 0, 1, -1, 0, 1])) + assert_frame_equal(result, expected) + + # int that needs float conversion + s = Series([2, 3, 4, 10, 5, -1]) + df = DataFrame(dict(A=[1, 1, 1, 2, 2, 2], B=1, C=s, D='foo')) + with np.errstate(all='ignore'): + result = df.groupby('A').transform( + lambda x: (x - x.mean()) / x.std()) + + s1 = s.iloc[0:3] + s1 = (s1 - s1.mean()) / s1.std() + s2 = s.iloc[3:6] + s2 = (s2 - s2.mean()) / s2.std() + expected = DataFrame(dict(B=np.nan, C=concat([s1, s2]))) + assert_frame_equal(result, expected) + + # int downcasting + result = df.groupby('A').transform(lambda x: x * 2 / 2) + expected = DataFrame(dict(B=1, C=[2, 3, 4, 10, 5, -1])) + assert_frame_equal(result, expected) + + def test_groupby_transform_with_nan_group(self): + # GH 9941 + df = pd.DataFrame({'a': range(10), + 'b': [1, 1, 2, 3, np.nan, 4, 4, 5, 5, 5]}) + result = df.groupby(df.b)['a'].transform(max) + expected = pd.Series([1., 1., 2., 3., np.nan, 6., 6., 9., 9., 9.], + name='a') + assert_series_equal(result, expected) + + def test_transform_mixed_type(self): + index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1], [1, 2, 3, 1, 2, 3] + ]) + df = DataFrame({'d': [1., 1., 1., 2., 2., 2.], + 'c': np.tile(['a', 'b', 'c'], 2), + 'v': np.arange(1., 7.)}, index=index) + + def f(group): + group['g'] = group['d'] * 2 + return group[:1] + + grouped = df.groupby('c') + result = grouped.apply(f) + + assert result['d'].dtype == np.float64 + + # this is by definition a mutating operation! + with option_context('mode.chained_assignment', None): + for key, group in grouped: + res = f(group) + assert_frame_equal(res, result.loc[key]) + + def test_cython_group_transform_algos(self): + # GH 4095 + dtypes = [np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint32, + np.uint64, np.float32, np.float64] + + ops = [(groupby.group_cumprod_float64, np.cumproduct, [np.float64]), + (groupby.group_cumsum, np.cumsum, dtypes)] + + is_datetimelike = False + for pd_op, np_op, dtypes in ops: + for dtype in dtypes: + data = np.array([[1], [2], [3], [4]], dtype=dtype) + ans = np.zeros_like(data) + labels = np.array([0, 0, 0, 0], dtype=np.int64) + pd_op(ans, data, labels, is_datetimelike) + tm.assert_numpy_array_equal(np_op(data), ans[:, 0], + check_dtype=False) + + # with nans + labels = np.array([0, 0, 0, 0, 0], dtype=np.int64) + + data = np.array([[1], [2], [3], [np.nan], [4]], dtype='float64') + actual = np.zeros_like(data) + actual.fill(np.nan) + groupby.group_cumprod_float64(actual, data, labels, is_datetimelike) + expected = np.array([1, 2, 6, np.nan, 24], dtype='float64') + tm.assert_numpy_array_equal(actual[:, 0], expected) + + actual = np.zeros_like(data) + actual.fill(np.nan) + groupby.group_cumsum(actual, data, labels, is_datetimelike) + expected = np.array([1, 3, 6, np.nan, 10], dtype='float64') + tm.assert_numpy_array_equal(actual[:, 0], expected) + + # timedelta + is_datetimelike = True + data = np.array([np.timedelta64(1, 'ns')] * 5, dtype='m8[ns]')[:, None] + actual = np.zeros_like(data, dtype='int64') + groupby.group_cumsum(actual, data.view('int64'), labels, + is_datetimelike) + expected = np.array([np.timedelta64(1, 'ns'), np.timedelta64( + 2, 'ns'), np.timedelta64(3, 'ns'), np.timedelta64(4, 'ns'), + np.timedelta64(5, 'ns')]) + tm.assert_numpy_array_equal(actual[:, 0].view('m8[ns]'), expected) + + def test_cython_transform(self): + # GH 4095 + ops = [(('cumprod', + ()), lambda x: x.cumprod()), (('cumsum', ()), + lambda x: x.cumsum()), + (('shift', (-1, )), + lambda x: x.shift(-1)), (('shift', + (1, )), lambda x: x.shift())] + + s = Series(np.random.randn(1000)) + s_missing = s.copy() + s_missing.iloc[2:10] = np.nan + labels = np.random.randint(0, 50, size=1000).astype(float) + + # series + for (op, args), targop in ops: + for data in [s, s_missing]: + # print(data.head()) + expected = data.groupby(labels).transform(targop) + + tm.assert_series_equal(expected, + data.groupby(labels).transform(op, + *args)) + tm.assert_series_equal(expected, getattr( + data.groupby(labels), op)(*args)) + + strings = list('qwertyuiopasdfghjklz') + strings_missing = strings[:] + strings_missing[5] = np.nan + df = DataFrame({'float': s, + 'float_missing': s_missing, + 'int': [1, 1, 1, 1, 2] * 200, + 'datetime': pd.date_range('1990-1-1', periods=1000), + 'timedelta': pd.timedelta_range(1, freq='s', + periods=1000), + 'string': strings * 50, + 'string_missing': strings_missing * 50}) + df['cat'] = df['string'].astype('category') + + df2 = df.copy() + df2.index = pd.MultiIndex.from_product([range(100), range(10)]) + + # DataFrame - Single and MultiIndex, + # group by values, index level, columns + for df in [df, df2]: + for gb_target in [dict(by=labels), dict(level=0), dict(by='string') + ]: # dict(by='string_missing')]: + # dict(by=['int','string'])]: + + gb = df.groupby(**gb_target) + # whitelisted methods set the selection before applying + # bit a of hack to make sure the cythonized shift + # is equivalent to pre 0.17.1 behavior + if op == 'shift': + gb._set_group_selection() + + for (op, args), targop in ops: + if op != 'shift' and 'int' not in gb_target: + # numeric apply fastpath promotes dtype so have + # to apply seperately and concat + i = gb[['int']].apply(targop) + f = gb[['float', 'float_missing']].apply(targop) + expected = pd.concat([f, i], axis=1) + else: + expected = gb.apply(targop) + + expected = expected.sort_index(axis=1) + tm.assert_frame_equal(expected, + gb.transform(op, *args).sort_index( + axis=1)) + tm.assert_frame_equal(expected, getattr(gb, op)(*args)) + # individual columns + for c in df: + if c not in ['float', 'int', 'float_missing' + ] and op != 'shift': + pytest.raises(DataError, gb[c].transform, op) + pytest.raises(DataError, getattr(gb[c], op)) + else: + expected = gb[c].apply(targop) + expected.name = c + tm.assert_series_equal(expected, + gb[c].transform(op, *args)) + tm.assert_series_equal(expected, + getattr(gb[c], op)(*args)) + + def test_transform_with_non_scalar_group(self): + # GH 10165 + cols = pd.MultiIndex.from_tuples([ + ('syn', 'A'), ('mis', 'A'), ('non', 'A'), + ('syn', 'C'), ('mis', 'C'), ('non', 'C'), + ('syn', 'T'), ('mis', 'T'), ('non', 'T'), + ('syn', 'G'), ('mis', 'G'), ('non', 'G')]) + df = pd.DataFrame(np.random.randint(1, 10, (4, 12)), + columns=cols, + index=['A', 'C', 'G', 'T']) + tm.assert_raises_regex(ValueError, 'transform must return ' + 'a scalar value for each ' + 'group.*', + df.groupby(axis=1, level=1).transform, + lambda z: z.div(z.sum(axis=1), axis=0)) diff --git a/pandas/tests/groupby/test_value_counts.py b/pandas/tests/groupby/test_value_counts.py new file mode 100644 index 0000000000000..b70a03ec3a1d3 --- /dev/null +++ b/pandas/tests/groupby/test_value_counts.py @@ -0,0 +1,61 @@ +import pytest + +from itertools import product +import numpy as np + +from pandas.util import testing as tm +from pandas import MultiIndex, DataFrame, Series, date_range + + +@pytest.mark.slow +@pytest.mark.parametrize("n,m", product((100, 1000), (5, 20))) +def test_series_groupby_value_counts(n, m): + np.random.seed(1234) + + def rebuild_index(df): + arr = list(map(df.index.get_level_values, range(df.index.nlevels))) + df.index = MultiIndex.from_arrays(arr, names=df.index.names) + return df + + def check_value_counts(df, keys, bins): + for isort, normalize, sort, ascending, dropna \ + in product((False, True), repeat=5): + + kwargs = dict(normalize=normalize, sort=sort, + ascending=ascending, dropna=dropna, bins=bins) + + gr = df.groupby(keys, sort=isort) + left = gr['3rd'].value_counts(**kwargs) + + gr = df.groupby(keys, sort=isort) + right = gr['3rd'].apply(Series.value_counts, **kwargs) + right.index.names = right.index.names[:-1] + ['3rd'] + + # have to sort on index because of unstable sort on values + left, right = map(rebuild_index, (left, right)) # xref GH9212 + tm.assert_series_equal(left.sort_index(), right.sort_index()) + + def loop(df): + bins = None, np.arange(0, max(5, df['3rd'].max()) + 1, 2) + keys = '1st', '2nd', ('1st', '2nd') + for k, b in product(keys, bins): + check_value_counts(df, k, b) + + days = date_range('2015-08-24', periods=10) + + frame = DataFrame({ + '1st': np.random.choice( + list('abcd'), n), + '2nd': np.random.choice(days, n), + '3rd': np.random.randint(1, m + 1, n) + }) + + loop(frame) + + frame.loc[1::11, '1st'] = np.nan + frame.loc[3::17, '2nd'] = np.nan + frame.loc[7::19, '3rd'] = np.nan + frame.loc[8::19, '3rd'] = np.nan + frame.loc[9::19, '3rd'] = np.nan + + loop(frame) diff --git a/pandas/tests/groupby/test_whitelist.py b/pandas/tests/groupby/test_whitelist.py new file mode 100644 index 0000000000000..5d131717f8345 --- /dev/null +++ b/pandas/tests/groupby/test_whitelist.py @@ -0,0 +1,301 @@ +""" +test methods relating to generic function evaluation +the so-called white/black lists +""" + +import pytest +from string import ascii_lowercase +import numpy as np +from pandas import DataFrame, Series, compat, date_range, Index, MultiIndex +from pandas.util import testing as tm +from pandas.compat import lrange, product + +AGG_FUNCTIONS = ['sum', 'prod', 'min', 'max', 'median', 'mean', 'skew', + 'mad', 'std', 'var', 'sem'] +AGG_FUNCTIONS_WITH_SKIPNA = ['skew', 'mad'] + +df_whitelist = frozenset([ + 'last', + 'first', + 'mean', + 'sum', + 'min', + 'max', + 'head', + 'tail', + 'cumcount', + 'resample', + 'rank', + 'quantile', + 'fillna', + 'mad', + 'any', + 'all', + 'take', + 'idxmax', + 'idxmin', + 'shift', + 'tshift', + 'ffill', + 'bfill', + 'pct_change', + 'skew', + 'plot', + 'boxplot', + 'hist', + 'median', + 'dtypes', + 'corrwith', + 'corr', + 'cov', + 'diff', +]) + +s_whitelist = frozenset([ + 'last', + 'first', + 'mean', + 'sum', + 'min', + 'max', + 'head', + 'tail', + 'cumcount', + 'resample', + 'rank', + 'quantile', + 'fillna', + 'mad', + 'any', + 'all', + 'take', + 'idxmax', + 'idxmin', + 'shift', + 'tshift', + 'ffill', + 'bfill', + 'pct_change', + 'skew', + 'plot', + 'hist', + 'median', + 'dtype', + 'corr', + 'cov', + 'diff', + 'unique', + 'nlargest', + 'nsmallest', +]) + + +@pytest.fixture +def mframe(): + index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', + 'three']], + labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], + [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=['first', 'second']) + return DataFrame(np.random.randn(10, 3), index=index, + columns=['A', 'B', 'C']) + + +@pytest.fixture +def df(): + return DataFrame( + {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], + 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], + 'C': np.random.randn(8), + 'D': np.random.randn(8)}) + + +@pytest.fixture +def df_letters(): + letters = np.array(list(ascii_lowercase)) + N = 10 + random_letters = letters.take(np.random.randint(0, 26, N)) + df = DataFrame({'floats': N / 10 * Series(np.random.random(N)), + 'letters': Series(random_letters)}) + return df + + +@pytest.mark.parametrize( + "obj, whitelist", zip((df_letters(), df_letters().floats), + (df_whitelist, s_whitelist))) +def test_groupby_whitelist(df_letters, obj, whitelist): + df = df_letters + + # these are aliases so ok to have the alias __name__ + alias = {'bfill': 'backfill', + 'ffill': 'pad', + 'boxplot': None} + + gb = obj.groupby(df.letters) + + assert whitelist == gb._apply_whitelist + for m in whitelist: + + m = alias.get(m, m) + if m is None: + continue + + f = getattr(type(gb), m) + + # name + try: + n = f.__name__ + except AttributeError: + continue + assert n == m + + # qualname + if compat.PY3: + try: + n = f.__qualname__ + except AttributeError: + continue + assert n.endswith(m) + + +@pytest.fixture +def raw_frame(): + index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', + 'three']], + labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], + [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=['first', 'second']) + raw_frame = DataFrame(np.random.randn(10, 3), index=index, + columns=Index(['A', 'B', 'C'], name='exp')) + raw_frame.iloc[1, [1, 2]] = np.nan + raw_frame.iloc[7, [0, 1]] = np.nan + return raw_frame + + +@pytest.mark.parametrize( + "op, level, axis, skipna", + product(AGG_FUNCTIONS, + lrange(2), lrange(2), + [True, False])) +def test_regression_whitelist_methods(raw_frame, op, level, axis, skipna): + # GH6944 + # explicity test the whitelest methods + + if axis == 0: + frame = raw_frame + else: + frame = raw_frame.T + + if op in AGG_FUNCTIONS_WITH_SKIPNA: + grouped = frame.groupby(level=level, axis=axis) + result = getattr(grouped, op)(skipna=skipna) + expected = getattr(frame, op)(level=level, axis=axis, + skipna=skipna) + tm.assert_frame_equal(result, expected) + else: + grouped = frame.groupby(level=level, axis=axis) + result = getattr(grouped, op)() + expected = getattr(frame, op)(level=level, axis=axis) + tm.assert_frame_equal(result, expected) + + +def test_groupby_blacklist(df_letters): + df = df_letters + s = df_letters.floats + + blacklist = [ + 'eval', 'query', 'abs', 'where', + 'mask', 'align', 'groupby', 'clip', 'astype', + 'at', 'combine', 'consolidate', 'convert_objects', + ] + to_methods = [method for method in dir(df) if method.startswith('to_')] + + blacklist.extend(to_methods) + + # e.g., to_csv + defined_but_not_allowed = ("(?:^Cannot.+{0!r}.+{1!r}.+try using the " + "'apply' method$)") + + # e.g., query, eval + not_defined = "(?:^{1!r} object has no attribute {0!r}$)" + fmt = defined_but_not_allowed + '|' + not_defined + for bl in blacklist: + for obj in (df, s): + gb = obj.groupby(df.letters) + msg = fmt.format(bl, type(gb).__name__) + with tm.assert_raises_regex(AttributeError, msg): + getattr(gb, bl) + + +def test_tab_completion(mframe): + grp = mframe.groupby(level='second') + results = set([v for v in dir(grp) if not v.startswith('_')]) + expected = set( + ['A', 'B', 'C', 'agg', 'aggregate', 'apply', 'boxplot', 'filter', + 'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max', + 'mean', 'median', 'min', 'ngroups', 'nth', 'ohlc', 'plot', + 'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count', + 'nunique', 'head', 'describe', 'cummax', 'quantile', + 'rank', 'cumprod', 'tail', 'resample', 'cummin', 'fillna', + 'cumsum', 'cumcount', 'all', 'shift', 'skew', + 'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith', + 'cov', 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin', + 'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding']) + assert results == expected + + +def test_groupby_function_rename(mframe): + grp = mframe.groupby(level='second') + for name in ['sum', 'prod', 'min', 'max', 'first', 'last']: + f = getattr(grp, name) + assert f.__name__ == name + + +def test_groupby_selection_with_methods(df): + # some methods which require DatetimeIndex + rng = date_range('2014', periods=len(df)) + df.index = rng + + g = df.groupby(['A'])[['C']] + g_exp = df[['C']].groupby(df['A']) + # TODO check groupby with > 1 col ? + + # methods which are called as .foo() + methods = ['count', + 'corr', + 'cummax', + 'cummin', + 'cumprod', + 'describe', + 'rank', + 'quantile', + 'diff', + 'shift', + 'all', + 'any', + 'idxmin', + 'idxmax', + 'ffill', + 'bfill', + 'pct_change', + 'tshift'] + + for m in methods: + res = getattr(g, m)() + exp = getattr(g_exp, m)() + + # should always be frames! + tm.assert_frame_equal(res, exp) + + # methods which aren't just .foo() + tm.assert_frame_equal(g.fillna(0), g_exp.fillna(0)) + tm.assert_frame_equal(g.dtypes, g_exp.dtypes) + tm.assert_frame_equal(g.apply(lambda x: x.sum()), + g_exp.apply(lambda x: x.sum())) + + tm.assert_frame_equal(g.resample('D').mean(), g_exp.resample('D').mean()) + tm.assert_frame_equal(g.resample('D').ohlc(), + g_exp.resample('D').ohlc()) + + tm.assert_frame_equal(g.filter(lambda x: len(x) == 3), + g_exp.filter(lambda x: len(x) == 3)) diff --git a/pandas/tests/indexes/common.py b/pandas/tests/indexes/common.py index 1b373baf9b3c1..bbde902fb87bf 100644 --- a/pandas/tests/indexes/common.py +++ b/pandas/tests/indexes/common.py @@ -1,15 +1,19 @@ # -*- coding: utf-8 -*- +import pytest + from pandas import compat from pandas.compat import PY3 import numpy as np -from pandas import (Series, Index, Float64Index, Int64Index, RangeIndex, - MultiIndex, CategoricalIndex, DatetimeIndex, - TimedeltaIndex, PeriodIndex, notnull) -from pandas.types.common import needs_i8_conversion -from pandas.util.testing import assertRaisesRegexp +from pandas import (Series, Index, Float64Index, Int64Index, UInt64Index, + RangeIndex, MultiIndex, CategoricalIndex, DatetimeIndex, + TimedeltaIndex, PeriodIndex, IntervalIndex, + notnull, isnull) +from pandas.core.indexes.datetimelike import DatetimeIndexOpsMixin +from pandas.core.dtypes.common import needs_i8_conversion +from pandas._libs.tslib import iNaT import pandas.util.testing as tm @@ -26,8 +30,8 @@ def setup_indices(self): setattr(self, name, idx) def verify_pickle(self, index): - unpickled = self.round_trip_pickle(index) - self.assertTrue(index.equals(unpickled)) + unpickled = tm.round_trip_pickle(index) + assert index.equals(unpickled) def test_pickle_compat_construction(self): # this is testing for pickle compat @@ -35,14 +39,23 @@ def test_pickle_compat_construction(self): return # need an object to create with - self.assertRaises(TypeError, self._holder) + pytest.raises(TypeError, self._holder) + + def test_to_series(self): + # assert that we are creating a copy of the index + + idx = self.create_index() + s = idx.to_series() + assert s.values is not idx.values + assert s.index is not idx + assert s.name == idx.name def test_shift(self): # GH8083 test the base class for shift idx = self.create_index() - self.assertRaises(NotImplementedError, idx.shift, 1) - self.assertRaises(NotImplementedError, idx.shift, 1, 2) + pytest.raises(NotImplementedError, idx.shift, 1) + pytest.raises(NotImplementedError, idx.shift, 1, 2) def test_create_index_existing_name(self): @@ -77,26 +90,26 @@ def test_create_index_existing_name(self): def test_numeric_compat(self): idx = self.create_index() - tm.assertRaisesRegexp(TypeError, "cannot perform __mul__", - lambda: idx * 1) - tm.assertRaisesRegexp(TypeError, "cannot perform __mul__", - lambda: 1 * idx) + tm.assert_raises_regex(TypeError, "cannot perform __mul__", + lambda: idx * 1) + tm.assert_raises_regex(TypeError, "cannot perform __mul__", + lambda: 1 * idx) div_err = "cannot perform __truediv__" if PY3 \ else "cannot perform __div__" - tm.assertRaisesRegexp(TypeError, div_err, lambda: idx / 1) - tm.assertRaisesRegexp(TypeError, div_err, lambda: 1 / idx) - tm.assertRaisesRegexp(TypeError, "cannot perform __floordiv__", - lambda: idx // 1) - tm.assertRaisesRegexp(TypeError, "cannot perform __floordiv__", - lambda: 1 // idx) + tm.assert_raises_regex(TypeError, div_err, lambda: idx / 1) + tm.assert_raises_regex(TypeError, div_err, lambda: 1 / idx) + tm.assert_raises_regex(TypeError, "cannot perform __floordiv__", + lambda: idx // 1) + tm.assert_raises_regex(TypeError, "cannot perform __floordiv__", + lambda: 1 // idx) def test_logical_compat(self): idx = self.create_index() - tm.assertRaisesRegexp(TypeError, 'cannot perform all', - lambda: idx.all()) - tm.assertRaisesRegexp(TypeError, 'cannot perform any', - lambda: idx.any()) + tm.assert_raises_regex(TypeError, 'cannot perform all', + lambda: idx.all()) + tm.assert_raises_regex(TypeError, 'cannot perform any', + lambda: idx.any()) def test_boolean_context_compat(self): @@ -107,7 +120,7 @@ def f(): if idx: pass - tm.assertRaisesRegexp(ValueError, 'The truth value of a', f) + tm.assert_raises_regex(ValueError, 'The truth value of a', f) def test_reindex_base(self): idx = self.create_index() @@ -116,18 +129,17 @@ def test_reindex_base(self): actual = idx.get_indexer(idx) tm.assert_numpy_array_equal(expected, actual) - with tm.assertRaisesRegexp(ValueError, 'Invalid fill method'): + with tm.assert_raises_regex(ValueError, 'Invalid fill method'): idx.get_indexer(idx, method='invalid') def test_ndarray_compat_properties(self): - idx = self.create_index() - self.assertTrue(idx.T.equals(idx)) - self.assertTrue(idx.transpose().equals(idx)) + assert idx.T.equals(idx) + assert idx.transpose().equals(idx) values = idx.values for prop in self._compat_props: - self.assertEqual(getattr(idx, prop), getattr(values, prop)) + assert getattr(idx, prop) == getattr(values, prop) # test for validity idx.nbytes @@ -143,14 +155,14 @@ def test_str(self): # test the string repr idx = self.create_index() idx.name = 'foo' - self.assertTrue("'foo'" in str(idx)) - self.assertTrue(idx.__class__.__name__ in str(idx)) + assert "'foo'" in str(idx) + assert idx.__class__.__name__ in str(idx) def test_dtype_str(self): for idx in self.indices.values(): dtype = idx.dtype_str - self.assertIsInstance(dtype, compat.string_types) - self.assertEqual(dtype, str(idx.dtype)) + assert isinstance(dtype, compat.string_types) + assert dtype == str(idx.dtype) def test_repr_max_seq_item_setting(self): # GH10182 @@ -158,14 +170,14 @@ def test_repr_max_seq_item_setting(self): idx = idx.repeat(50) with pd.option_context("display.max_seq_items", None): repr(idx) - self.assertFalse('...' in str(idx)) + assert '...' not in str(idx) def test_wrong_number_names(self): def testit(ind): ind.names = ["apple", "banana", "carrot"] for ind in self.indices.values(): - assertRaisesRegexp(ValueError, "^Length", testit, ind) + tm.assert_raises_regex(ValueError, "^Length", testit, ind) def test_set_name_methods(self): new_name = "This is the new name for this index" @@ -177,35 +189,36 @@ def test_set_name_methods(self): original_name = ind.name new_ind = ind.set_names([new_name]) - self.assertEqual(new_ind.name, new_name) - self.assertEqual(ind.name, original_name) + assert new_ind.name == new_name + assert ind.name == original_name res = ind.rename(new_name, inplace=True) # should return None - self.assertIsNone(res) - self.assertEqual(ind.name, new_name) - self.assertEqual(ind.names, [new_name]) - # with assertRaisesRegexp(TypeError, "list-like"): + assert res is None + assert ind.name == new_name + assert ind.names == [new_name] + # with tm.assert_raises_regex(TypeError, "list-like"): # # should still fail even if it would be the right length # ind.set_names("a") - with assertRaisesRegexp(ValueError, "Level must be None"): + with tm.assert_raises_regex(ValueError, "Level must be None"): ind.set_names("a", level=0) # rename in place just leaves tuples and other containers alone name = ('A', 'B') ind.rename(name, inplace=True) - self.assertEqual(ind.name, name) - self.assertEqual(ind.names, [name]) + assert ind.name == name + assert ind.names == [name] def test_hash_error(self): for ind in self.indices.values(): - with tm.assertRaisesRegexp(TypeError, "unhashable type: %r" % - type(ind).__name__): + with tm.assert_raises_regex(TypeError, "unhashable type: %r" % + type(ind).__name__): hash(ind) def test_copy_name(self): - # Check that "name" argument passed at initialization is honoured - # GH12309 + # gh-12309: Check that the "name" argument + # passed at initialization is honored. + for name, index in compat.iteritems(self.indices): if isinstance(index, MultiIndex): continue @@ -214,18 +227,21 @@ def test_copy_name(self): second = first.__class__(first, copy=False) # Even though "copy=False", we want a new object. - self.assertIsNot(first, second) - # Not using tm.assert_index_equal() since names differ: - self.assertTrue(index.equals(first)) + assert first is not second - self.assertEqual(first.name, 'mario') - self.assertEqual(second.name, 'mario') + # Not using tm.assert_index_equal() since names differ. + assert index.equals(first) + + assert first.name == 'mario' + assert second.name == 'mario' s1 = Series(2, index=first) s2 = Series(3, index=second[:-1]) - if not isinstance(index, CategoricalIndex): # See GH13365 + + if not isinstance(index, CategoricalIndex): + # See gh-13365 s3 = s1 * s2 - self.assertEqual(s3.index.name, 'mario') + assert s3.index.name == 'mario' def test_ensure_copied_data(self): # Check the "copy" argument of each Index.__new__ is honoured @@ -246,18 +262,21 @@ def test_ensure_copied_data(self): tm.assert_numpy_array_equal(index.values, result.values, check_same='copy') - if not isinstance(index, PeriodIndex): - result = index_type(index.values, copy=False, **init_kwargs) - tm.assert_numpy_array_equal(index.values, result.values, - check_same='same') - tm.assert_numpy_array_equal(index._values, result._values, - check_same='same') - else: + if isinstance(index, PeriodIndex): # .values an object array of Period, thus copied result = index_type(ordinal=index.asi8, copy=False, **init_kwargs) tm.assert_numpy_array_equal(index._values, result._values, check_same='same') + elif isinstance(index, IntervalIndex): + # checked in test_interval.py + pass + else: + result = index_type(index.values, copy=False, **init_kwargs) + tm.assert_numpy_array_equal(index.values, result.values, + check_same='same') + tm.assert_numpy_array_equal(index._values, result._values, + check_same='same') def test_copy_and_deepcopy(self): from copy import copy, deepcopy @@ -270,11 +289,11 @@ def test_copy_and_deepcopy(self): for func in (copy, deepcopy): idx_copy = func(ind) - self.assertIsNot(idx_copy, ind) - self.assertTrue(idx_copy.equals(ind)) + assert idx_copy is not ind + assert idx_copy.equals(ind) new_copy = ind.copy(deep=True, name="banana") - self.assertEqual(new_copy.name, "banana") + assert new_copy.name == "banana" def test_duplicates(self): for ind in self.indices.values(): @@ -284,15 +303,15 @@ def test_duplicates(self): if isinstance(ind, MultiIndex): continue idx = self._holder([ind[0]] * 5) - self.assertFalse(idx.is_unique) - self.assertTrue(idx.has_duplicates) + assert not idx.is_unique + assert idx.has_duplicates # GH 10115 # preserve names idx.name = 'foo' result = idx.drop_duplicates() - self.assertEqual(result.name, 'foo') - self.assert_index_equal(result, Index([ind[0]], name='foo')) + assert result.name == 'foo' + tm.assert_index_equal(result, Index([ind[0]], name='foo')) def test_get_unique_index(self): for ind in self.indices.values(): @@ -303,26 +322,26 @@ def test_get_unique_index(self): idx = ind[[0] * 5] idx_unique = ind[[0]] + # We test against `idx_unique`, so first we make sure it's unique # and doesn't contain nans. - self.assertTrue(idx_unique.is_unique) + assert idx_unique.is_unique try: - self.assertFalse(idx_unique.hasnans) + assert not idx_unique.hasnans except NotImplementedError: pass for dropna in [False, True]: result = idx._get_unique_index(dropna=dropna) - self.assert_index_equal(result, idx_unique) + tm.assert_index_equal(result, idx_unique) # nans: - if not ind._can_hold_na: continue if needs_i8_conversion(ind): vals = ind.asi8[[0] * 5] - vals[0] = pd.tslib.iNaT + vals[0] = iNaT else: vals = ind.values[[0] * 5] vals[0] = np.nan @@ -330,41 +349,56 @@ def test_get_unique_index(self): vals_unique = vals[:2] idx_nan = ind._shallow_copy(vals) idx_unique_nan = ind._shallow_copy(vals_unique) - self.assertTrue(idx_unique_nan.is_unique) + assert idx_unique_nan.is_unique - self.assertEqual(idx_nan.dtype, ind.dtype) - self.assertEqual(idx_unique_nan.dtype, ind.dtype) + assert idx_nan.dtype == ind.dtype + assert idx_unique_nan.dtype == ind.dtype for dropna, expected in zip([False, True], [idx_unique_nan, idx_unique]): for i in [idx_nan, idx_unique_nan]: result = i._get_unique_index(dropna=dropna) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) def test_sort(self): for ind in self.indices.values(): - self.assertRaises(TypeError, ind.sort) - - def test_order(self): - for ind in self.indices.values(): - # 9816 deprecated - with tm.assert_produces_warning(FutureWarning): - ind.order() + pytest.raises(TypeError, ind.sort) def test_mutability(self): for ind in self.indices.values(): if not len(ind): continue - self.assertRaises(TypeError, ind.__setitem__, 0, ind[0]) + pytest.raises(TypeError, ind.__setitem__, 0, ind[0]) def test_view(self): for ind in self.indices.values(): i_view = ind.view() - self.assertEqual(i_view.name, ind.name) + assert i_view.name == ind.name def test_compat(self): for ind in self.indices.values(): - self.assertEqual(ind.tolist(), list(ind)) + assert ind.tolist() == list(ind) + + def test_memory_usage(self): + for name, index in compat.iteritems(self.indices): + result = index.memory_usage() + if len(index): + index.get_loc(index[0]) + result2 = index.memory_usage() + result3 = index.memory_usage(deep=True) + + # RangeIndex, IntervalIndex + # don't have engines + if not isinstance(index, (RangeIndex, IntervalIndex)): + assert result2 > result + + if index.inferred_type == 'object': + assert result3 > result2 + + else: + + # we report 0 for no-length + assert result == 0 def test_argsort(self): for k, ind in self.indices.items(): @@ -387,21 +421,21 @@ def test_numpy_argsort(self): # pandas compatibility input validation - the # rest already perform separate (or no) such # validation via their 'values' attribute as - # defined in pandas/indexes/base.py - they + # defined in pandas.core.indexes/base.py - they # cannot be changed at the moment due to # backwards compatibility concerns if isinstance(type(ind), (CategoricalIndex, RangeIndex)): msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, - np.argsort, ind, axis=1) + tm.assert_raises_regex(ValueError, msg, + np.argsort, ind, axis=1) msg = "the 'kind' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.argsort, - ind, kind='mergesort') + tm.assert_raises_regex(ValueError, msg, np.argsort, + ind, kind='mergesort') msg = "the 'order' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.argsort, - ind, order=('a', 'b')) + tm.assert_raises_regex(ValueError, msg, np.argsort, + ind, order=('a', 'b')) def test_pickle(self): for ind in self.indices.values(): @@ -419,12 +453,12 @@ def test_take(self): result = ind.take(indexer) expected = ind[indexer] - self.assertTrue(result.equals(expected)) + assert result.equals(expected) if not isinstance(ind, (DatetimeIndex, PeriodIndex, TimedeltaIndex)): # GH 10791 - with tm.assertRaises(AttributeError): + with pytest.raises(AttributeError): ind.freq def test_take_invalid_kwargs(self): @@ -432,16 +466,16 @@ def test_take_invalid_kwargs(self): indices = [1, 2] msg = r"take\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, idx.take, - indices, foo=2) + tm.assert_raises_regex(TypeError, msg, idx.take, + indices, foo=2) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, out=indices) + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, out=indices) msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, mode='clip') + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, mode='clip') def test_repeat(self): rep = 2 @@ -461,8 +495,8 @@ def test_numpy_repeat(self): tm.assert_index_equal(np.repeat(i, rep), expected) msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.repeat, - i, rep, axis=0) + tm.assert_raises_regex(ValueError, msg, np.repeat, + i, rep, axis=0) def test_where(self): i = self.create_index() @@ -470,12 +504,25 @@ def test_where(self): expected = i tm.assert_index_equal(result, expected) - i2 = i.copy() - i2 = pd.Index([np.nan, np.nan] + i[2:].tolist()) - result = i.where(notnull(i2)) - expected = i2 + _nan = i._na_value + cond = [False] + [True] * len(i[1:]) + expected = pd.Index([_nan] + i[1:].tolist(), dtype=i.dtype) + + result = i.where(cond) tm.assert_index_equal(result, expected) + def test_where_array_like(self): + i = self.create_index() + + _nan = i._na_value + cond = [False] + [True] * (len(i) - 1) + klasses = [list, tuple, np.array, pd.Series] + expected = pd.Index([_nan] + i[1:].tolist(), dtype=i.dtype) + + for klass in klasses: + result = i.where(klass(cond)) + tm.assert_index_equal(result, expected) + def test_setops_errorcases(self): for name, idx in compat.iteritems(self.indices): # # non-iterable input @@ -485,9 +532,10 @@ def test_setops_errorcases(self): for method in methods: for case in cases: - assertRaisesRegexp(TypeError, - "Input must be Index or array-like", - method, case) + tm.assert_raises_regex(TypeError, + "Input must be Index " + "or array-like", + method, case) def test_intersection_base(self): for name, idx in compat.iteritems(self.indices): @@ -498,7 +546,7 @@ def test_intersection_base(self): if isinstance(idx, CategoricalIndex): pass else: - self.assertTrue(tm.equalContents(intersect, second)) + assert tm.equalContents(intersect, second) # GH 10149 cases = [klass(second.values) @@ -506,17 +554,17 @@ def test_intersection_base(self): for case in cases: if isinstance(idx, PeriodIndex): msg = "can only call with other PeriodIndex-ed objects" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): result = first.intersection(case) elif isinstance(idx, CategoricalIndex): pass else: result = first.intersection(case) - self.assertTrue(tm.equalContents(result, second)) + assert tm.equalContents(result, second) if isinstance(idx, MultiIndex): msg = "other must be a MultiIndex or a list of tuples" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): result = first.intersection([1, 2, 3]) def test_union_base(self): @@ -525,7 +573,7 @@ def test_union_base(self): second = idx[:5] everything = idx union = first.union(second) - self.assertTrue(tm.equalContents(union, everything)) + assert tm.equalContents(union, everything) # GH 10149 cases = [klass(second.values) @@ -533,17 +581,17 @@ def test_union_base(self): for case in cases: if isinstance(idx, PeriodIndex): msg = "can only call with other PeriodIndex-ed objects" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): result = first.union(case) elif isinstance(idx, CategoricalIndex): pass else: result = first.union(case) - self.assertTrue(tm.equalContents(result, everything)) + assert tm.equalContents(result, everything) if isinstance(idx, MultiIndex): msg = "other must be a MultiIndex or a list of tuples" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): result = first.union([1, 2, 3]) def test_difference_base(self): @@ -556,7 +604,7 @@ def test_difference_base(self): if isinstance(idx, CategoricalIndex): pass else: - self.assertTrue(tm.equalContents(result, answer)) + assert tm.equalContents(result, answer) # GH 10149 cases = [klass(second.values) @@ -564,20 +612,20 @@ def test_difference_base(self): for case in cases: if isinstance(idx, PeriodIndex): msg = "can only call with other PeriodIndex-ed objects" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): result = first.difference(case) elif isinstance(idx, CategoricalIndex): pass elif isinstance(idx, (DatetimeIndex, TimedeltaIndex)): - self.assertEqual(result.__class__, answer.__class__) + assert result.__class__ == answer.__class__ tm.assert_numpy_array_equal(result.asi8, answer.asi8) else: result = first.difference(case) - self.assertTrue(tm.equalContents(result, answer)) + assert tm.equalContents(result, answer) if isinstance(idx, MultiIndex): msg = "other must be a MultiIndex or a list of tuples" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): result = first.difference([1, 2, 3]) def test_symmetric_difference(self): @@ -589,7 +637,7 @@ def test_symmetric_difference(self): else: answer = idx[[0, -1]] result = first.symmetric_difference(second) - self.assertTrue(tm.equalContents(result, answer)) + assert tm.equalContents(result, answer) # GH 10149 cases = [klass(second.values) @@ -597,17 +645,17 @@ def test_symmetric_difference(self): for case in cases: if isinstance(idx, PeriodIndex): msg = "can only call with other PeriodIndex-ed objects" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): result = first.symmetric_difference(case) elif isinstance(idx, CategoricalIndex): pass else: result = first.symmetric_difference(case) - self.assertTrue(tm.equalContents(result, answer)) + assert tm.equalContents(result, answer) if isinstance(idx, MultiIndex): msg = "other must be a MultiIndex or a list of tuples" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): result = first.symmetric_difference([1, 2, 3]) # 12591 deprecated @@ -623,7 +671,7 @@ def test_insert_base(self): continue # test 0th element - self.assertTrue(idx[0:4].equals(result.insert(0, idx[0]))) + assert idx[0:4].equals(result.insert(0, idx[0])) def test_delete_base(self): @@ -638,31 +686,37 @@ def test_delete_base(self): expected = idx[1:] result = idx.delete(0) - self.assertTrue(result.equals(expected)) - self.assertEqual(result.name, expected.name) + assert result.equals(expected) + assert result.name == expected.name expected = idx[:-1] result = idx.delete(-1) - self.assertTrue(result.equals(expected)) - self.assertEqual(result.name, expected.name) + assert result.equals(expected) + assert result.name == expected.name - with tm.assertRaises((IndexError, ValueError)): + with pytest.raises((IndexError, ValueError)): # either depending on numpy version result = idx.delete(len(idx)) def test_equals(self): for name, idx in compat.iteritems(self.indices): - self.assertTrue(idx.equals(idx)) - self.assertTrue(idx.equals(idx.copy())) - self.assertTrue(idx.equals(idx.astype(object))) + assert idx.equals(idx) + assert idx.equals(idx.copy()) + assert idx.equals(idx.astype(object)) + + assert not idx.equals(list(idx)) + assert not idx.equals(np.array(idx)) - self.assertFalse(idx.equals(list(idx))) - self.assertFalse(idx.equals(np.array(idx))) + # Cannot pass in non-int64 dtype to RangeIndex + if not isinstance(idx, RangeIndex): + same_values = Index(idx, dtype=object) + assert idx.equals(same_values) + assert same_values.equals(idx) if idx.nlevels == 1: # do not test MultiIndex - self.assertFalse(idx.equals(pd.Series(idx))) + assert not idx.equals(pd.Series(idx)) def test_equals_op(self): # GH9947, GH10637 @@ -674,7 +728,7 @@ def test_equals_op(self): index_b = index_a[0:-1] index_c = index_a[0:-1].append(index_a[-2:-1]) index_d = index_a[0:1] - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): index_a == index_b expected1 = np.array([True] * n) expected2 = np.array([True] * (n - 1) + [False]) @@ -686,7 +740,7 @@ def test_equals_op(self): array_b = np.array(index_a[0:-1]) array_c = np.array(index_a[0:-1].append(index_a[-2:-1])) array_d = np.array(index_a[0:1]) - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): index_a == array_b tm.assert_numpy_array_equal(index_a == array_a, expected1) tm.assert_numpy_array_equal(index_a == array_c, expected2) @@ -696,22 +750,22 @@ def test_equals_op(self): series_b = Series(array_b) series_c = Series(array_c) series_d = Series(array_d) - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): index_a == series_b tm.assert_numpy_array_equal(index_a == series_a, expected1) tm.assert_numpy_array_equal(index_a == series_c, expected2) # cases where length is 1 for one of them - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): index_a == index_d - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): index_a == series_d - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): index_a == array_d msg = "Can only compare identically-labeled Series objects" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): series_a == series_d - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): series_a == array_d # comparing with a scalar should broadcast; note that we are excluding @@ -738,44 +792,44 @@ def test_numpy_ufuncs(self): np.arccos, np.arctan, np.sinh, np.cosh, np.tanh, np.arcsinh, np.arccosh, np.arctanh, np.deg2rad, np.rad2deg]: - if isinstance(idx, pd.tseries.base.DatetimeIndexOpsMixin): + if isinstance(idx, DatetimeIndexOpsMixin): # raise TypeError or ValueError (PeriodIndex) # PeriodIndex behavior should be changed in future version - with tm.assertRaises(Exception): + with pytest.raises(Exception): with np.errstate(all='ignore'): func(idx) - elif isinstance(idx, (Float64Index, Int64Index)): + elif isinstance(idx, (Float64Index, Int64Index, UInt64Index)): # coerces to float (e.g. np.sin) with np.errstate(all='ignore'): result = func(idx) exp = Index(func(idx.values), name=idx.name) - self.assert_index_equal(result, exp) - self.assertIsInstance(result, pd.Float64Index) + + tm.assert_index_equal(result, exp) + assert isinstance(result, pd.Float64Index) else: # raise AttributeError or TypeError if len(idx) == 0: continue else: - with tm.assertRaises(Exception): + with pytest.raises(Exception): with np.errstate(all='ignore'): func(idx) for func in [np.isfinite, np.isinf, np.isnan, np.signbit]: - if isinstance(idx, pd.tseries.base.DatetimeIndexOpsMixin): + if isinstance(idx, DatetimeIndexOpsMixin): # raise TypeError or ValueError (PeriodIndex) - with tm.assertRaises(Exception): + with pytest.raises(Exception): func(idx) - elif isinstance(idx, (Float64Index, Int64Index)): - # results in bool array + elif isinstance(idx, (Float64Index, Int64Index, UInt64Index)): + # Results in bool array result = func(idx) - exp = func(idx.values) - self.assertIsInstance(result, np.ndarray) - tm.assertNotIsInstance(result, Index) + assert isinstance(result, np.ndarray) + assert not isinstance(result, Index) else: if len(idx) == 0: continue else: - with tm.assertRaises(Exception): + with pytest.raises(Exception): func(idx) def test_hasnans_isnans(self): @@ -788,17 +842,17 @@ def test_hasnans_isnans(self): # cases in indices doesn't include NaN expected = np.array([False] * len(idx), dtype=bool) - self.assert_numpy_array_equal(idx._isnan, expected) - self.assertFalse(idx.hasnans) + tm.assert_numpy_array_equal(idx._isnan, expected) + assert not idx.hasnans idx = index.copy() values = idx.values if len(index) == 0: continue - elif isinstance(index, pd.tseries.base.DatetimeIndexOpsMixin): - values[1] = pd.tslib.iNaT - elif isinstance(index, Int64Index): + elif isinstance(index, DatetimeIndexOpsMixin): + values[1] = iNaT + elif isinstance(index, (Int64Index, UInt64Index)): continue else: values[1] = np.nan @@ -810,8 +864,8 @@ def test_hasnans_isnans(self): expected = np.array([False] * len(idx), dtype=bool) expected[1] = True - self.assert_numpy_array_equal(idx._isnan, expected) - self.assertTrue(idx.hasnans) + tm.assert_numpy_array_equal(idx._isnan, expected) + assert idx.hasnans def test_fillna(self): # GH 11343 @@ -821,24 +875,24 @@ def test_fillna(self): elif isinstance(index, MultiIndex): idx = index.copy() msg = "isnull is not defined for MultiIndex" - with self.assertRaisesRegexp(NotImplementedError, msg): + with tm.assert_raises_regex(NotImplementedError, msg): idx.fillna(idx[0]) else: idx = index.copy() result = idx.fillna(idx[0]) - self.assert_index_equal(result, idx) - self.assertFalse(result is idx) + tm.assert_index_equal(result, idx) + assert result is not idx msg = "'value' must be a scalar, passed: " - with self.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): idx.fillna([idx[0]]) idx = index.copy() values = idx.values - if isinstance(index, pd.tseries.base.DatetimeIndexOpsMixin): - values[1] = pd.tslib.iNaT - elif isinstance(index, Int64Index): + if isinstance(index, DatetimeIndexOpsMixin): + values[1] = iNaT + elif isinstance(index, (Int64Index, UInt64Index)): continue else: values[1] = np.nan @@ -850,5 +904,36 @@ def test_fillna(self): expected = np.array([False] * len(idx), dtype=bool) expected[1] = True - self.assert_numpy_array_equal(idx._isnan, expected) - self.assertTrue(idx.hasnans) + tm.assert_numpy_array_equal(idx._isnan, expected) + assert idx.hasnans + + def test_nulls(self): + # this is really a smoke test for the methods + # as these are adequantely tested for function elsewhere + + for name, index in self.indices.items(): + if len(index) == 0: + tm.assert_numpy_array_equal( + index.isnull(), np.array([], dtype=bool)) + elif isinstance(index, MultiIndex): + idx = index.copy() + msg = "isnull is not defined for MultiIndex" + with tm.assert_raises_regex(NotImplementedError, msg): + idx.isnull() + else: + + if not index.hasnans: + tm.assert_numpy_array_equal( + index.isnull(), np.zeros(len(index), dtype=bool)) + tm.assert_numpy_array_equal( + index.notnull(), np.ones(len(index), dtype=bool)) + else: + result = isnull(index) + tm.assert_numpy_array_equal(index.isnull(), result) + tm.assert_numpy_array_equal(index.notnull(), ~result) + + def test_empty(self): + # GH 15270 + index = self.create_index() + assert not index.empty + assert index[:0].empty diff --git a/pandas/tests/indexes/data/s1-0.12.0.pickle b/pandas/tests/indexes/data/s1-0.12.0.pickle deleted file mode 100644 index 0ce9cfdf3aa94..0000000000000 Binary files a/pandas/tests/indexes/data/s1-0.12.0.pickle and /dev/null differ diff --git a/pandas/tests/indexes/data/s2-0.12.0.pickle b/pandas/tests/indexes/data/s2-0.12.0.pickle deleted file mode 100644 index 2318be2d9978b..0000000000000 Binary files a/pandas/tests/indexes/data/s2-0.12.0.pickle and /dev/null differ diff --git a/pandas/tests/indexes/datetimelike.py b/pandas/tests/indexes/datetimelike.py new file mode 100644 index 0000000000000..114940009377c --- /dev/null +++ b/pandas/tests/indexes/datetimelike.py @@ -0,0 +1,40 @@ +""" generic datetimelike tests """ + +from .common import Base +import pandas.util.testing as tm + + +class DatetimeLike(Base): + + def test_shift_identity(self): + + idx = self.create_index() + tm.assert_index_equal(idx, idx.shift(0)) + + def test_str(self): + + # test the string repr + idx = self.create_index() + idx.name = 'foo' + assert not "length=%s" % len(idx) in str(idx) + assert "'foo'" in str(idx) + assert idx.__class__.__name__ in str(idx) + + if hasattr(idx, 'tz'): + if idx.tz is not None: + assert idx.tz in str(idx) + if hasattr(idx, 'freq'): + assert "freq='%s'" % idx.freqstr in str(idx) + + def test_view(self): + super(DatetimeLike, self).test_view() + + i = self.create_index() + + i_view = i.view('i8') + result = self._holder(i) + tm.assert_index_equal(result, i) + + i_view = i.view(self._holder) + result = self._holder(i) + tm.assert_index_equal(result, i_view) diff --git a/pandas/tools/tests/__init__.py b/pandas/tests/indexes/datetimes/__init__.py similarity index 100% rename from pandas/tools/tests/__init__.py rename to pandas/tests/indexes/datetimes/__init__.py diff --git a/pandas/tests/indexes/datetimes/test_astype.py b/pandas/tests/indexes/datetimes/test_astype.py new file mode 100644 index 0000000000000..0f7acf1febae8 --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_astype.py @@ -0,0 +1,312 @@ +import pytest + +import numpy as np + +from datetime import datetime +import pandas as pd +import pandas.util.testing as tm +from pandas import (DatetimeIndex, date_range, Series, NaT, Index, Timestamp, + Int64Index, Period) + + +class TestDatetimeIndex(object): + + def test_astype(self): + # GH 13149, GH 13209 + idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) + + result = idx.astype(object) + expected = Index([Timestamp('2016-05-16')] + [NaT] * 3, dtype=object) + tm.assert_index_equal(result, expected) + + result = idx.astype(int) + expected = Int64Index([1463356800000000000] + + [-9223372036854775808] * 3, dtype=np.int64) + tm.assert_index_equal(result, expected) + + rng = date_range('1/1/2000', periods=10) + result = rng.astype('i8') + tm.assert_index_equal(result, Index(rng.asi8)) + tm.assert_numpy_array_equal(result.values, rng.asi8) + + def test_astype_with_tz(self): + + # with tz + rng = date_range('1/1/2000', periods=10, tz='US/Eastern') + result = rng.astype('datetime64[ns]') + expected = (date_range('1/1/2000', periods=10, + tz='US/Eastern') + .tz_convert('UTC').tz_localize(None)) + tm.assert_index_equal(result, expected) + + # BUG#10442 : testing astype(str) is correct for Series/DatetimeIndex + result = pd.Series(pd.date_range('2012-01-01', periods=3)).astype(str) + expected = pd.Series( + ['2012-01-01', '2012-01-02', '2012-01-03'], dtype=object) + tm.assert_series_equal(result, expected) + + result = Series(pd.date_range('2012-01-01', periods=3, + tz='US/Eastern')).astype(str) + expected = Series(['2012-01-01 00:00:00-05:00', + '2012-01-02 00:00:00-05:00', + '2012-01-03 00:00:00-05:00'], + dtype=object) + tm.assert_series_equal(result, expected) + + def test_astype_str_compat(self): + # GH 13149, GH 13209 + # verify that we are returing NaT as a string (and not unicode) + + idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) + result = idx.astype(str) + expected = Index(['2016-05-16', 'NaT', 'NaT', 'NaT'], dtype=object) + tm.assert_index_equal(result, expected) + + def test_astype_str(self): + # test astype string - #10442 + result = date_range('2012-01-01', periods=4, + name='test_name').astype(str) + expected = Index(['2012-01-01', '2012-01-02', '2012-01-03', + '2012-01-04'], name='test_name', dtype=object) + tm.assert_index_equal(result, expected) + + # test astype string with tz and name + result = date_range('2012-01-01', periods=3, name='test_name', + tz='US/Eastern').astype(str) + expected = Index(['2012-01-01 00:00:00-05:00', + '2012-01-02 00:00:00-05:00', + '2012-01-03 00:00:00-05:00'], + name='test_name', dtype=object) + tm.assert_index_equal(result, expected) + + # test astype string with freqH and name + result = date_range('1/1/2011', periods=3, freq='H', + name='test_name').astype(str) + expected = Index(['2011-01-01 00:00:00', '2011-01-01 01:00:00', + '2011-01-01 02:00:00'], + name='test_name', dtype=object) + tm.assert_index_equal(result, expected) + + # test astype string with freqH and timezone + result = date_range('3/6/2012 00:00', periods=2, freq='H', + tz='Europe/London', name='test_name').astype(str) + expected = Index(['2012-03-06 00:00:00+00:00', + '2012-03-06 01:00:00+00:00'], + dtype=object, name='test_name') + tm.assert_index_equal(result, expected) + + def test_astype_datetime64(self): + # GH 13149, GH 13209 + idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) + + result = idx.astype('datetime64[ns]') + tm.assert_index_equal(result, idx) + assert result is not idx + + result = idx.astype('datetime64[ns]', copy=False) + tm.assert_index_equal(result, idx) + assert result is idx + + idx_tz = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN], tz='EST') + result = idx_tz.astype('datetime64[ns]') + expected = DatetimeIndex(['2016-05-16 05:00:00', 'NaT', 'NaT', 'NaT'], + dtype='datetime64[ns]') + tm.assert_index_equal(result, expected) + + def test_astype_raises(self): + # GH 13149, GH 13209 + idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) + + pytest.raises(ValueError, idx.astype, float) + pytest.raises(ValueError, idx.astype, 'timedelta64') + pytest.raises(ValueError, idx.astype, 'timedelta64[ns]') + pytest.raises(ValueError, idx.astype, 'datetime64') + pytest.raises(ValueError, idx.astype, 'datetime64[D]') + + def test_index_convert_to_datetime_array(self): + tm._skip_if_no_pytz() + + def _check_rng(rng): + converted = rng.to_pydatetime() + assert isinstance(converted, np.ndarray) + for x, stamp in zip(converted, rng): + assert isinstance(x, datetime) + assert x == stamp.to_pydatetime() + assert x.tzinfo == stamp.tzinfo + + rng = date_range('20090415', '20090519') + rng_eastern = date_range('20090415', '20090519', tz='US/Eastern') + rng_utc = date_range('20090415', '20090519', tz='utc') + + _check_rng(rng) + _check_rng(rng_eastern) + _check_rng(rng_utc) + + def test_index_convert_to_datetime_array_explicit_pytz(self): + tm._skip_if_no_pytz() + import pytz + + def _check_rng(rng): + converted = rng.to_pydatetime() + assert isinstance(converted, np.ndarray) + for x, stamp in zip(converted, rng): + assert isinstance(x, datetime) + assert x == stamp.to_pydatetime() + assert x.tzinfo == stamp.tzinfo + + rng = date_range('20090415', '20090519') + rng_eastern = date_range('20090415', '20090519', + tz=pytz.timezone('US/Eastern')) + rng_utc = date_range('20090415', '20090519', tz=pytz.utc) + + _check_rng(rng) + _check_rng(rng_eastern) + _check_rng(rng_utc) + + def test_index_convert_to_datetime_array_dateutil(self): + tm._skip_if_no_dateutil() + import dateutil + + def _check_rng(rng): + converted = rng.to_pydatetime() + assert isinstance(converted, np.ndarray) + for x, stamp in zip(converted, rng): + assert isinstance(x, datetime) + assert x == stamp.to_pydatetime() + assert x.tzinfo == stamp.tzinfo + + rng = date_range('20090415', '20090519') + rng_eastern = date_range('20090415', '20090519', + tz='dateutil/US/Eastern') + rng_utc = date_range('20090415', '20090519', tz=dateutil.tz.tzutc()) + + _check_rng(rng) + _check_rng(rng_eastern) + _check_rng(rng_utc) + + +class TestToPeriod(object): + + def setup_method(self, method): + data = [Timestamp('2007-01-01 10:11:12.123456Z'), + Timestamp('2007-01-01 10:11:13.789123Z')] + self.index = DatetimeIndex(data) + + def test_to_period_millisecond(self): + index = self.index + + period = index.to_period(freq='L') + assert 2 == len(period) + assert period[0] == Period('2007-01-01 10:11:12.123Z', 'L') + assert period[1] == Period('2007-01-01 10:11:13.789Z', 'L') + + def test_to_period_microsecond(self): + index = self.index + + period = index.to_period(freq='U') + assert 2 == len(period) + assert period[0] == Period('2007-01-01 10:11:12.123456Z', 'U') + assert period[1] == Period('2007-01-01 10:11:13.789123Z', 'U') + + def test_to_period_tz_pytz(self): + tm._skip_if_no_pytz() + from dateutil.tz import tzlocal + from pytz import utc as UTC + + xp = date_range('1/1/2000', '4/1/2000').to_period() + + ts = date_range('1/1/2000', '4/1/2000', tz='US/Eastern') + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + ts = date_range('1/1/2000', '4/1/2000', tz=UTC) + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + ts = date_range('1/1/2000', '4/1/2000', tz=tzlocal()) + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + def test_to_period_tz_explicit_pytz(self): + tm._skip_if_no_pytz() + import pytz + from dateutil.tz import tzlocal + + xp = date_range('1/1/2000', '4/1/2000').to_period() + + ts = date_range('1/1/2000', '4/1/2000', tz=pytz.timezone('US/Eastern')) + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + ts = date_range('1/1/2000', '4/1/2000', tz=pytz.utc) + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + ts = date_range('1/1/2000', '4/1/2000', tz=tzlocal()) + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + def test_to_period_tz_dateutil(self): + tm._skip_if_no_dateutil() + import dateutil + from dateutil.tz import tzlocal + + xp = date_range('1/1/2000', '4/1/2000').to_period() + + ts = date_range('1/1/2000', '4/1/2000', tz='dateutil/US/Eastern') + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + ts = date_range('1/1/2000', '4/1/2000', tz=dateutil.tz.tzutc()) + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + ts = date_range('1/1/2000', '4/1/2000', tz=tzlocal()) + + result = ts.to_period()[0] + expected = ts[0].to_period() + + assert result == expected + tm.assert_index_equal(ts.to_period(), xp) + + def test_astype_object(self): + # NumPy 1.6.1 weak ns support + rng = date_range('1/1/2000', periods=20) + + casted = rng.astype('O') + exp_values = list(rng) + + tm.assert_index_equal(casted, Index(exp_values, dtype=np.object_)) + assert casted.tolist() == exp_values diff --git a/pandas/tests/indexes/datetimes/test_construction.py b/pandas/tests/indexes/datetimes/test_construction.py new file mode 100644 index 0000000000000..fcfc56ea823da --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_construction.py @@ -0,0 +1,593 @@ +import pytest + +import numpy as np +from datetime import timedelta + +import pandas as pd +from pandas import offsets +import pandas.util.testing as tm +from pandas._libs import tslib, lib +from pandas._libs.tslib import OutOfBoundsDatetime +from pandas import (DatetimeIndex, Index, Timestamp, datetime, date_range, + to_datetime) + + +class TestDatetimeIndex(object): + + def test_construction_caching(self): + + df = pd.DataFrame({'dt': pd.date_range('20130101', periods=3), + 'dttz': pd.date_range('20130101', periods=3, + tz='US/Eastern'), + 'dt_with_null': [pd.Timestamp('20130101'), pd.NaT, + pd.Timestamp('20130103')], + 'dtns': pd.date_range('20130101', periods=3, + freq='ns')}) + assert df.dttz.dtype.tz.zone == 'US/Eastern' + + def test_construction_with_alt(self): + + i = pd.date_range('20130101', periods=5, freq='H', tz='US/Eastern') + i2 = DatetimeIndex(i, dtype=i.dtype) + tm.assert_index_equal(i, i2) + assert i.tz.zone == 'US/Eastern' + + i2 = DatetimeIndex(i.tz_localize(None).asi8, tz=i.dtype.tz) + tm.assert_index_equal(i, i2) + assert i.tz.zone == 'US/Eastern' + + i2 = DatetimeIndex(i.tz_localize(None).asi8, dtype=i.dtype) + tm.assert_index_equal(i, i2) + assert i.tz.zone == 'US/Eastern' + + i2 = DatetimeIndex( + i.tz_localize(None).asi8, dtype=i.dtype, tz=i.dtype.tz) + tm.assert_index_equal(i, i2) + assert i.tz.zone == 'US/Eastern' + + # localize into the provided tz + i2 = DatetimeIndex(i.tz_localize(None).asi8, tz='UTC') + expected = i.tz_localize(None).tz_localize('UTC') + tm.assert_index_equal(i2, expected) + + # incompat tz/dtype + pytest.raises(ValueError, lambda: DatetimeIndex( + i.tz_localize(None).asi8, dtype=i.dtype, tz='US/Pacific')) + + def test_construction_index_with_mixed_timezones(self): + # gh-11488: no tz results in DatetimeIndex + result = Index([Timestamp('2011-01-01'), + Timestamp('2011-01-02')], name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01'), + Timestamp('2011-01-02')], name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is None + + # same tz results in DatetimeIndex + result = Index([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + Timestamp('2011-01-02 10:00', tz='Asia/Tokyo')], + name='idx') + exp = DatetimeIndex( + [Timestamp('2011-01-01 10:00'), Timestamp('2011-01-02 10:00') + ], tz='Asia/Tokyo', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is not None + assert result.tz == exp.tz + + # same tz results in DatetimeIndex (DST) + result = Index([Timestamp('2011-01-01 10:00', tz='US/Eastern'), + Timestamp('2011-08-01 10:00', tz='US/Eastern')], + name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), + Timestamp('2011-08-01 10:00')], + tz='US/Eastern', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is not None + assert result.tz == exp.tz + + # Different tz results in Index(dtype=object) + result = Index([Timestamp('2011-01-01 10:00'), + Timestamp('2011-01-02 10:00', tz='US/Eastern')], + name='idx') + exp = Index([Timestamp('2011-01-01 10:00'), + Timestamp('2011-01-02 10:00', tz='US/Eastern')], + dtype='object', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert not isinstance(result, DatetimeIndex) + + result = Index([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + Timestamp('2011-01-02 10:00', tz='US/Eastern')], + name='idx') + exp = Index([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + Timestamp('2011-01-02 10:00', tz='US/Eastern')], + dtype='object', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert not isinstance(result, DatetimeIndex) + + # length = 1 + result = Index([Timestamp('2011-01-01')], name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01')], name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is None + + # length = 1 with tz + result = Index( + [Timestamp('2011-01-01 10:00', tz='Asia/Tokyo')], name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01 10:00')], tz='Asia/Tokyo', + name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is not None + assert result.tz == exp.tz + + def test_construction_index_with_mixed_timezones_with_NaT(self): + # see gh-11488 + result = Index([pd.NaT, Timestamp('2011-01-01'), + pd.NaT, Timestamp('2011-01-02')], name='idx') + exp = DatetimeIndex([pd.NaT, Timestamp('2011-01-01'), + pd.NaT, Timestamp('2011-01-02')], name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is None + + # Same tz results in DatetimeIndex + result = Index([pd.NaT, Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + pd.NaT, Timestamp('2011-01-02 10:00', + tz='Asia/Tokyo')], + name='idx') + exp = DatetimeIndex([pd.NaT, Timestamp('2011-01-01 10:00'), + pd.NaT, Timestamp('2011-01-02 10:00')], + tz='Asia/Tokyo', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is not None + assert result.tz == exp.tz + + # same tz results in DatetimeIndex (DST) + result = Index([Timestamp('2011-01-01 10:00', tz='US/Eastern'), + pd.NaT, + Timestamp('2011-08-01 10:00', tz='US/Eastern')], + name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), pd.NaT, + Timestamp('2011-08-01 10:00')], + tz='US/Eastern', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is not None + assert result.tz == exp.tz + + # different tz results in Index(dtype=object) + result = Index([pd.NaT, Timestamp('2011-01-01 10:00'), + pd.NaT, Timestamp('2011-01-02 10:00', + tz='US/Eastern')], + name='idx') + exp = Index([pd.NaT, Timestamp('2011-01-01 10:00'), + pd.NaT, Timestamp('2011-01-02 10:00', tz='US/Eastern')], + dtype='object', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert not isinstance(result, DatetimeIndex) + + result = Index([pd.NaT, Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + pd.NaT, Timestamp('2011-01-02 10:00', + tz='US/Eastern')], name='idx') + exp = Index([pd.NaT, Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + pd.NaT, Timestamp('2011-01-02 10:00', tz='US/Eastern')], + dtype='object', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert not isinstance(result, DatetimeIndex) + + # all NaT + result = Index([pd.NaT, pd.NaT], name='idx') + exp = DatetimeIndex([pd.NaT, pd.NaT], name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is None + + # all NaT with tz + result = Index([pd.NaT, pd.NaT], tz='Asia/Tokyo', name='idx') + exp = DatetimeIndex([pd.NaT, pd.NaT], tz='Asia/Tokyo', name='idx') + + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + assert result.tz is not None + assert result.tz == exp.tz + + def test_construction_dti_with_mixed_timezones(self): + # GH 11488 (not changed, added explicit tests) + + # no tz results in DatetimeIndex + result = DatetimeIndex( + [Timestamp('2011-01-01'), Timestamp('2011-01-02')], name='idx') + exp = DatetimeIndex( + [Timestamp('2011-01-01'), Timestamp('2011-01-02')], name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + + # same tz results in DatetimeIndex + result = DatetimeIndex([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + Timestamp('2011-01-02 10:00', + tz='Asia/Tokyo')], + name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), + Timestamp('2011-01-02 10:00')], + tz='Asia/Tokyo', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + + # same tz results in DatetimeIndex (DST) + result = DatetimeIndex([Timestamp('2011-01-01 10:00', tz='US/Eastern'), + Timestamp('2011-08-01 10:00', + tz='US/Eastern')], + name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), + Timestamp('2011-08-01 10:00')], + tz='US/Eastern', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + + # different tz coerces tz-naive to tz-awareIndex(dtype=object) + result = DatetimeIndex([Timestamp('2011-01-01 10:00'), + Timestamp('2011-01-02 10:00', + tz='US/Eastern')], name='idx') + exp = DatetimeIndex([Timestamp('2011-01-01 05:00'), + Timestamp('2011-01-02 10:00')], + tz='US/Eastern', name='idx') + tm.assert_index_equal(result, exp, exact=True) + assert isinstance(result, DatetimeIndex) + + # tz mismatch affecting to tz-aware raises TypeError/ValueError + + with pytest.raises(ValueError): + DatetimeIndex([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + Timestamp('2011-01-02 10:00', tz='US/Eastern')], + name='idx') + + with tm.assert_raises_regex(TypeError, + 'data is already tz-aware'): + DatetimeIndex([Timestamp('2011-01-01 10:00'), + Timestamp('2011-01-02 10:00', tz='US/Eastern')], + tz='Asia/Tokyo', name='idx') + + with pytest.raises(ValueError): + DatetimeIndex([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), + Timestamp('2011-01-02 10:00', tz='US/Eastern')], + tz='US/Eastern', name='idx') + + with tm.assert_raises_regex(TypeError, + 'data is already tz-aware'): + # passing tz should results in DatetimeIndex, then mismatch raises + # TypeError + Index([pd.NaT, Timestamp('2011-01-01 10:00'), + pd.NaT, Timestamp('2011-01-02 10:00', tz='US/Eastern')], + tz='Asia/Tokyo', name='idx') + + def test_construction_base_constructor(self): + arr = [pd.Timestamp('2011-01-01'), pd.NaT, pd.Timestamp('2011-01-03')] + tm.assert_index_equal(pd.Index(arr), pd.DatetimeIndex(arr)) + tm.assert_index_equal(pd.Index(np.array(arr)), + pd.DatetimeIndex(np.array(arr))) + + arr = [np.nan, pd.NaT, pd.Timestamp('2011-01-03')] + tm.assert_index_equal(pd.Index(arr), pd.DatetimeIndex(arr)) + tm.assert_index_equal(pd.Index(np.array(arr)), + pd.DatetimeIndex(np.array(arr))) + + def test_construction_outofbounds(self): + # GH 13663 + dates = [datetime(3000, 1, 1), datetime(4000, 1, 1), + datetime(5000, 1, 1), datetime(6000, 1, 1)] + exp = Index(dates, dtype=object) + # coerces to object + tm.assert_index_equal(Index(dates), exp) + + with pytest.raises(OutOfBoundsDatetime): + # can't create DatetimeIndex + DatetimeIndex(dates) + + def test_construction_with_ndarray(self): + # GH 5152 + dates = [datetime(2013, 10, 7), + datetime(2013, 10, 8), + datetime(2013, 10, 9)] + data = DatetimeIndex(dates, freq=pd.tseries.frequencies.BDay()).values + result = DatetimeIndex(data, freq=pd.tseries.frequencies.BDay()) + expected = DatetimeIndex(['2013-10-07', + '2013-10-08', + '2013-10-09'], + freq='B') + tm.assert_index_equal(result, expected) + + def test_constructor_coverage(self): + rng = date_range('1/1/2000', periods=10.5) + exp = date_range('1/1/2000', periods=10) + tm.assert_index_equal(rng, exp) + + pytest.raises(ValueError, DatetimeIndex, start='1/1/2000', + periods='foo', freq='D') + + pytest.raises(ValueError, DatetimeIndex, start='1/1/2000', + end='1/10/2000') + + pytest.raises(ValueError, DatetimeIndex, '1/1/2000') + + # generator expression + gen = (datetime(2000, 1, 1) + timedelta(i) for i in range(10)) + result = DatetimeIndex(gen) + expected = DatetimeIndex([datetime(2000, 1, 1) + timedelta(i) + for i in range(10)]) + tm.assert_index_equal(result, expected) + + # NumPy string array + strings = np.array(['2000-01-01', '2000-01-02', '2000-01-03']) + result = DatetimeIndex(strings) + expected = DatetimeIndex(strings.astype('O')) + tm.assert_index_equal(result, expected) + + from_ints = DatetimeIndex(expected.asi8) + tm.assert_index_equal(from_ints, expected) + + # string with NaT + strings = np.array(['2000-01-01', '2000-01-02', 'NaT']) + result = DatetimeIndex(strings) + expected = DatetimeIndex(strings.astype('O')) + tm.assert_index_equal(result, expected) + + from_ints = DatetimeIndex(expected.asi8) + tm.assert_index_equal(from_ints, expected) + + # non-conforming + pytest.raises(ValueError, DatetimeIndex, + ['2000-01-01', '2000-01-02', '2000-01-04'], freq='D') + + pytest.raises(ValueError, DatetimeIndex, start='2011-01-01', + freq='b') + pytest.raises(ValueError, DatetimeIndex, end='2011-01-01', + freq='B') + pytest.raises(ValueError, DatetimeIndex, periods=10, freq='D') + + def test_constructor_datetime64_tzformat(self): + # GH 6572 + tm._skip_if_no_pytz() + import pytz + # ISO 8601 format results in pytz.FixedOffset + for freq in ['AS', 'W-SUN']: + idx = date_range('2013-01-01T00:00:00-05:00', + '2016-01-01T23:59:59-05:00', freq=freq) + expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', + freq=freq, tz=pytz.FixedOffset(-300)) + tm.assert_index_equal(idx, expected) + # Unable to use `US/Eastern` because of DST + expected_i8 = date_range('2013-01-01T00:00:00', + '2016-01-01T23:59:59', freq=freq, + tz='America/Lima') + tm.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) + + idx = date_range('2013-01-01T00:00:00+09:00', + '2016-01-01T23:59:59+09:00', freq=freq) + expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', + freq=freq, tz=pytz.FixedOffset(540)) + tm.assert_index_equal(idx, expected) + expected_i8 = date_range('2013-01-01T00:00:00', + '2016-01-01T23:59:59', freq=freq, + tz='Asia/Tokyo') + tm.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) + + tm._skip_if_no_dateutil() + + # Non ISO 8601 format results in dateutil.tz.tzoffset + for freq in ['AS', 'W-SUN']: + idx = date_range('2013/1/1 0:00:00-5:00', '2016/1/1 23:59:59-5:00', + freq=freq) + expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', + freq=freq, tz=pytz.FixedOffset(-300)) + tm.assert_index_equal(idx, expected) + # Unable to use `US/Eastern` because of DST + expected_i8 = date_range('2013-01-01T00:00:00', + '2016-01-01T23:59:59', freq=freq, + tz='America/Lima') + tm.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) + + idx = date_range('2013/1/1 0:00:00+9:00', + '2016/1/1 23:59:59+09:00', freq=freq) + expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', + freq=freq, tz=pytz.FixedOffset(540)) + tm.assert_index_equal(idx, expected) + expected_i8 = date_range('2013-01-01T00:00:00', + '2016-01-01T23:59:59', freq=freq, + tz='Asia/Tokyo') + tm.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) + + def test_constructor_dtype(self): + + # passing a dtype with a tz should localize + idx = DatetimeIndex(['2013-01-01', '2013-01-02'], + dtype='datetime64[ns, US/Eastern]') + expected = DatetimeIndex(['2013-01-01', '2013-01-02'] + ).tz_localize('US/Eastern') + tm.assert_index_equal(idx, expected) + + idx = DatetimeIndex(['2013-01-01', '2013-01-02'], + tz='US/Eastern') + tm.assert_index_equal(idx, expected) + + # if we already have a tz and its not the same, then raise + idx = DatetimeIndex(['2013-01-01', '2013-01-02'], + dtype='datetime64[ns, US/Eastern]') + + pytest.raises(ValueError, + lambda: DatetimeIndex(idx, + dtype='datetime64[ns]')) + + # this is effectively trying to convert tz's + pytest.raises(TypeError, + lambda: DatetimeIndex(idx, + dtype='datetime64[ns, CET]')) + pytest.raises(ValueError, + lambda: DatetimeIndex( + idx, tz='CET', + dtype='datetime64[ns, US/Eastern]')) + result = DatetimeIndex(idx, dtype='datetime64[ns, US/Eastern]') + tm.assert_index_equal(idx, result) + + def test_constructor_name(self): + idx = DatetimeIndex(start='2000-01-01', periods=1, freq='A', + name='TEST') + assert idx.name == 'TEST' + + def test_000constructor_resolution(self): + # 2252 + t1 = Timestamp((1352934390 * 1000000000) + 1000000 + 1000 + 1) + idx = DatetimeIndex([t1]) + + assert idx.nanosecond[0] == t1.nanosecond + + +class TestTimeSeries(object): + + def test_dti_constructor_preserve_dti_freq(self): + rng = date_range('1/1/2000', '1/2/2000', freq='5min') + + rng2 = DatetimeIndex(rng) + assert rng.freq == rng2.freq + + def test_dti_constructor_years_only(self): + # GH 6961 + for tz in [None, 'UTC', 'Asia/Tokyo', 'dateutil/US/Pacific']: + rng1 = date_range('2014', '2015', freq='M', tz=tz) + expected1 = date_range('2014-01-31', '2014-12-31', freq='M', tz=tz) + + rng2 = date_range('2014', '2015', freq='MS', tz=tz) + expected2 = date_range('2014-01-01', '2015-01-01', freq='MS', + tz=tz) + + rng3 = date_range('2014', '2020', freq='A', tz=tz) + expected3 = date_range('2014-12-31', '2019-12-31', freq='A', tz=tz) + + rng4 = date_range('2014', '2020', freq='AS', tz=tz) + expected4 = date_range('2014-01-01', '2020-01-01', freq='AS', + tz=tz) + + for rng, expected in [(rng1, expected1), (rng2, expected2), + (rng3, expected3), (rng4, expected4)]: + tm.assert_index_equal(rng, expected) + + def test_dti_constructor_small_int(self): + # GH 13721 + exp = DatetimeIndex(['1970-01-01 00:00:00.00000000', + '1970-01-01 00:00:00.00000001', + '1970-01-01 00:00:00.00000002']) + + for dtype in [np.int64, np.int32, np.int16, np.int8]: + arr = np.array([0, 10, 20], dtype=dtype) + tm.assert_index_equal(DatetimeIndex(arr), exp) + + def test_ctor_str_intraday(self): + rng = DatetimeIndex(['1-1-2000 00:00:01']) + assert rng[0].second == 1 + + def test_is_(self): + dti = DatetimeIndex(start='1/1/2005', end='12/1/2005', freq='M') + assert dti.is_(dti) + assert dti.is_(dti.view()) + assert not dti.is_(dti.copy()) + + def test_index_cast_datetime64_other_units(self): + arr = np.arange(0, 100, 10, dtype=np.int64).view('M8[D]') + idx = Index(arr) + + assert (idx.values == tslib.cast_to_nanoseconds(arr)).all() + + def test_constructor_int64_nocopy(self): + # #1624 + arr = np.arange(1000, dtype=np.int64) + index = DatetimeIndex(arr) + + arr[50:100] = -1 + assert (index.asi8[50:100] == -1).all() + + arr = np.arange(1000, dtype=np.int64) + index = DatetimeIndex(arr, copy=True) + + arr[50:100] = -1 + assert (index.asi8[50:100] != -1).all() + + def test_from_freq_recreate_from_data(self): + freqs = ['M', 'Q', 'A', 'D', 'B', 'BH', 'T', 'S', 'L', 'U', 'H', 'N', + 'C'] + + for f in freqs: + org = DatetimeIndex(start='2001/02/01 09:00', freq=f, periods=1) + idx = DatetimeIndex(org, freq=f) + tm.assert_index_equal(idx, org) + + org = DatetimeIndex(start='2001/02/01 09:00', freq=f, + tz='US/Pacific', periods=1) + idx = DatetimeIndex(org, freq=f, tz='US/Pacific') + tm.assert_index_equal(idx, org) + + def test_datetimeindex_constructor_misc(self): + arr = ['1/1/2005', '1/2/2005', 'Jn 3, 2005', '2005-01-04'] + pytest.raises(Exception, DatetimeIndex, arr) + + arr = ['1/1/2005', '1/2/2005', '1/3/2005', '2005-01-04'] + idx1 = DatetimeIndex(arr) + + arr = [datetime(2005, 1, 1), '1/2/2005', '1/3/2005', '2005-01-04'] + idx2 = DatetimeIndex(arr) + + arr = [lib.Timestamp(datetime(2005, 1, 1)), '1/2/2005', '1/3/2005', + '2005-01-04'] + idx3 = DatetimeIndex(arr) + + arr = np.array(['1/1/2005', '1/2/2005', '1/3/2005', + '2005-01-04'], dtype='O') + idx4 = DatetimeIndex(arr) + + arr = to_datetime(['1/1/2005', '1/2/2005', '1/3/2005', '2005-01-04']) + idx5 = DatetimeIndex(arr) + + arr = to_datetime(['1/1/2005', '1/2/2005', 'Jan 3, 2005', '2005-01-04' + ]) + idx6 = DatetimeIndex(arr) + + idx7 = DatetimeIndex(['12/05/2007', '25/01/2008'], dayfirst=True) + idx8 = DatetimeIndex(['2007/05/12', '2008/01/25'], dayfirst=False, + yearfirst=True) + tm.assert_index_equal(idx7, idx8) + + for other in [idx2, idx3, idx4, idx5, idx6]: + assert (idx1.values == other.values).all() + + sdate = datetime(1999, 12, 25) + edate = datetime(2000, 1, 1) + idx = DatetimeIndex(start=sdate, freq='1B', periods=20) + assert len(idx) == 20 + assert idx[0] == sdate + 0 * offsets.BDay() + assert idx.freq == 'B' + + idx = DatetimeIndex(end=edate, freq=('D', 5), periods=20) + assert len(idx) == 20 + assert idx[-1] == edate + assert idx.freq == '5D' + + idx1 = DatetimeIndex(start=sdate, end=edate, freq='W-SUN') + idx2 = DatetimeIndex(start=sdate, end=edate, + freq=offsets.Week(weekday=6)) + assert len(idx1) == len(idx2) + assert idx1.offset == idx2.offset + + idx1 = DatetimeIndex(start=sdate, end=edate, freq='QS') + idx2 = DatetimeIndex(start=sdate, end=edate, + freq=offsets.QuarterBegin(startingMonth=1)) + assert len(idx1) == len(idx2) + assert idx1.offset == idx2.offset + + idx1 = DatetimeIndex(start=sdate, end=edate, freq='BQ') + idx2 = DatetimeIndex(start=sdate, end=edate, + freq=offsets.BQuarterEnd(startingMonth=12)) + assert len(idx1) == len(idx2) + assert idx1.offset == idx2.offset diff --git a/pandas/tests/indexes/datetimes/test_date_range.py b/pandas/tests/indexes/datetimes/test_date_range.py new file mode 100644 index 0000000000000..0586ea9c4db2b --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_date_range.py @@ -0,0 +1,565 @@ +""" +test date_range, bdate_range, cdate_range +construction from the convenience range functions +""" + +import pytest + +import numpy as np +from datetime import datetime, timedelta, time + +import pandas as pd +import pandas.util.testing as tm +from pandas import compat +from pandas.core.indexes.datetimes import bdate_range, cdate_range +from pandas import date_range, offsets, DatetimeIndex, Timestamp +from pandas.tseries.offsets import (generate_range, CDay, BDay, + DateOffset, MonthEnd) + +from pandas.tests.series.common import TestData + +START, END = datetime(2009, 1, 1), datetime(2010, 1, 1) + + +def eq_gen_range(kwargs, expected): + rng = generate_range(**kwargs) + assert (np.array_equal(list(rng), expected)) + + +class TestDateRanges(TestData): + + def test_date_range_gen_error(self): + rng = date_range('1/1/2000 00:00', '1/1/2000 00:18', freq='5min') + assert len(rng) == 4 + + def test_date_range_negative_freq(self): + # GH 11018 + rng = date_range('2011-12-31', freq='-2A', periods=3) + exp = pd.DatetimeIndex(['2011-12-31', '2009-12-31', + '2007-12-31'], freq='-2A') + tm.assert_index_equal(rng, exp) + assert rng.freq == '-2A' + + rng = date_range('2011-01-31', freq='-2M', periods=3) + exp = pd.DatetimeIndex(['2011-01-31', '2010-11-30', + '2010-09-30'], freq='-2M') + tm.assert_index_equal(rng, exp) + assert rng.freq == '-2M' + + def test_date_range_bms_bug(self): + # #1645 + rng = date_range('1/1/2000', periods=10, freq='BMS') + + ex_first = Timestamp('2000-01-03') + assert rng[0] == ex_first + + def test_date_range_normalize(self): + snap = datetime.today() + n = 50 + + rng = date_range(snap, periods=n, normalize=False, freq='2D') + + offset = timedelta(2) + values = DatetimeIndex([snap + i * offset for i in range(n)]) + + tm.assert_index_equal(rng, values) + + rng = date_range('1/1/2000 08:15', periods=n, normalize=False, + freq='B') + the_time = time(8, 15) + for val in rng: + assert val.time() == the_time + + def test_date_range_fy5252(self): + dr = date_range(start="2013-01-01", periods=2, freq=offsets.FY5253( + startingMonth=1, weekday=3, variation="nearest")) + assert dr[0] == Timestamp('2013-01-31') + assert dr[1] == Timestamp('2014-01-30') + + def test_date_range_ambiguous_arguments(self): + # #2538 + start = datetime(2011, 1, 1, 5, 3, 40) + end = datetime(2011, 1, 1, 8, 9, 40) + + pytest.raises(ValueError, date_range, start, end, freq='s', + periods=10) + + def test_date_range_businesshour(self): + idx = DatetimeIndex(['2014-07-04 09:00', '2014-07-04 10:00', + '2014-07-04 11:00', + '2014-07-04 12:00', '2014-07-04 13:00', + '2014-07-04 14:00', + '2014-07-04 15:00', '2014-07-04 16:00'], + freq='BH') + rng = date_range('2014-07-04 09:00', '2014-07-04 16:00', freq='BH') + tm.assert_index_equal(idx, rng) + + idx = DatetimeIndex( + ['2014-07-04 16:00', '2014-07-07 09:00'], freq='BH') + rng = date_range('2014-07-04 16:00', '2014-07-07 09:00', freq='BH') + tm.assert_index_equal(idx, rng) + + idx = DatetimeIndex(['2014-07-04 09:00', '2014-07-04 10:00', + '2014-07-04 11:00', + '2014-07-04 12:00', '2014-07-04 13:00', + '2014-07-04 14:00', + '2014-07-04 15:00', '2014-07-04 16:00', + '2014-07-07 09:00', '2014-07-07 10:00', + '2014-07-07 11:00', + '2014-07-07 12:00', '2014-07-07 13:00', + '2014-07-07 14:00', + '2014-07-07 15:00', '2014-07-07 16:00', + '2014-07-08 09:00', '2014-07-08 10:00', + '2014-07-08 11:00', + '2014-07-08 12:00', '2014-07-08 13:00', + '2014-07-08 14:00', + '2014-07-08 15:00', '2014-07-08 16:00'], + freq='BH') + rng = date_range('2014-07-04 09:00', '2014-07-08 16:00', freq='BH') + tm.assert_index_equal(idx, rng) + + def test_range_misspecified(self): + # GH #1095 + + pytest.raises(ValueError, date_range, '1/1/2000') + pytest.raises(ValueError, date_range, end='1/1/2000') + pytest.raises(ValueError, date_range, periods=10) + + pytest.raises(ValueError, date_range, '1/1/2000', freq='H') + pytest.raises(ValueError, date_range, end='1/1/2000', freq='H') + pytest.raises(ValueError, date_range, periods=10, freq='H') + + def test_compat_replace(self): + # https://github.com/statsmodels/statsmodels/issues/3349 + # replace should take ints/longs for compat + + for f in [compat.long, int]: + result = date_range(Timestamp('1960-04-01 00:00:00', + freq='QS-JAN'), + periods=f(76), + freq='QS-JAN') + assert len(result) == 76 + + def test_catch_infinite_loop(self): + offset = offsets.DateOffset(minute=5) + # blow up, don't loop forever + pytest.raises(Exception, date_range, datetime(2011, 11, 11), + datetime(2011, 11, 12), freq=offset) + + +class TestGenRangeGeneration(object): + + def test_generate(self): + rng1 = list(generate_range(START, END, offset=BDay())) + rng2 = list(generate_range(START, END, time_rule='B')) + assert rng1 == rng2 + + def test_generate_cday(self): + rng1 = list(generate_range(START, END, offset=CDay())) + rng2 = list(generate_range(START, END, time_rule='C')) + assert rng1 == rng2 + + def test_1(self): + eq_gen_range(dict(start=datetime(2009, 3, 25), periods=2), + [datetime(2009, 3, 25), datetime(2009, 3, 26)]) + + def test_2(self): + eq_gen_range(dict(start=datetime(2008, 1, 1), + end=datetime(2008, 1, 3)), + [datetime(2008, 1, 1), + datetime(2008, 1, 2), + datetime(2008, 1, 3)]) + + def test_3(self): + eq_gen_range(dict(start=datetime(2008, 1, 5), + end=datetime(2008, 1, 6)), + []) + + def test_precision_finer_than_offset(self): + # GH 9907 + result1 = DatetimeIndex(start='2015-04-15 00:00:03', + end='2016-04-22 00:00:00', freq='Q') + result2 = DatetimeIndex(start='2015-04-15 00:00:03', + end='2015-06-22 00:00:04', freq='W') + expected1_list = ['2015-06-30 00:00:03', '2015-09-30 00:00:03', + '2015-12-31 00:00:03', '2016-03-31 00:00:03'] + expected2_list = ['2015-04-19 00:00:03', '2015-04-26 00:00:03', + '2015-05-03 00:00:03', '2015-05-10 00:00:03', + '2015-05-17 00:00:03', '2015-05-24 00:00:03', + '2015-05-31 00:00:03', '2015-06-07 00:00:03', + '2015-06-14 00:00:03', '2015-06-21 00:00:03'] + expected1 = DatetimeIndex(expected1_list, dtype='datetime64[ns]', + freq='Q-DEC', tz=None) + expected2 = DatetimeIndex(expected2_list, dtype='datetime64[ns]', + freq='W-SUN', tz=None) + tm.assert_index_equal(result1, expected1) + tm.assert_index_equal(result2, expected2) + + +class TestBusinessDateRange(object): + + def setup_method(self, method): + self.rng = bdate_range(START, END) + + def test_constructor(self): + bdate_range(START, END, freq=BDay()) + bdate_range(START, periods=20, freq=BDay()) + bdate_range(end=START, periods=20, freq=BDay()) + pytest.raises(ValueError, date_range, '2011-1-1', '2012-1-1', 'B') + pytest.raises(ValueError, bdate_range, '2011-1-1', '2012-1-1', 'B') + + def test_naive_aware_conflicts(self): + naive = bdate_range(START, END, freq=BDay(), tz=None) + aware = bdate_range(START, END, freq=BDay(), + tz="Asia/Hong_Kong") + tm.assert_raises_regex(TypeError, "tz-naive.*tz-aware", + naive.join, aware) + tm.assert_raises_regex(TypeError, "tz-naive.*tz-aware", + aware.join, naive) + + def test_cached_range(self): + DatetimeIndex._cached_range(START, END, offset=BDay()) + DatetimeIndex._cached_range(START, periods=20, offset=BDay()) + DatetimeIndex._cached_range(end=START, periods=20, offset=BDay()) + + tm.assert_raises_regex(TypeError, "offset", + DatetimeIndex._cached_range, + START, END) + + tm.assert_raises_regex(TypeError, "specify period", + DatetimeIndex._cached_range, START, + offset=BDay()) + + tm.assert_raises_regex(TypeError, "specify period", + DatetimeIndex._cached_range, end=END, + offset=BDay()) + + tm.assert_raises_regex(TypeError, "start or end", + DatetimeIndex._cached_range, periods=20, + offset=BDay()) + + def test_cached_range_bug(self): + rng = date_range('2010-09-01 05:00:00', periods=50, + freq=DateOffset(hours=6)) + assert len(rng) == 50 + assert rng[0] == datetime(2010, 9, 1, 5) + + def test_timezone_comparaison_bug(self): + # smoke test + start = Timestamp('20130220 10:00', tz='US/Eastern') + result = date_range(start, periods=2, tz='US/Eastern') + assert len(result) == 2 + + def test_timezone_comparaison_assert(self): + start = Timestamp('20130220 10:00', tz='US/Eastern') + pytest.raises(AssertionError, date_range, start, periods=2, + tz='Europe/Berlin') + + def test_misc(self): + end = datetime(2009, 5, 13) + dr = bdate_range(end=end, periods=20) + firstDate = end - 19 * BDay() + + assert len(dr) == 20 + assert dr[0] == firstDate + assert dr[-1] == end + + def test_date_parse_failure(self): + badly_formed_date = '2007/100/1' + + pytest.raises(ValueError, Timestamp, badly_formed_date) + + pytest.raises(ValueError, bdate_range, start=badly_formed_date, + periods=10) + pytest.raises(ValueError, bdate_range, end=badly_formed_date, + periods=10) + pytest.raises(ValueError, bdate_range, badly_formed_date, + badly_formed_date) + + def test_daterange_bug_456(self): + # GH #456 + rng1 = bdate_range('12/5/2011', '12/5/2011') + rng2 = bdate_range('12/2/2011', '12/5/2011') + rng2.offset = BDay() + + result = rng1.union(rng2) + assert isinstance(result, DatetimeIndex) + + def test_error_with_zero_monthends(self): + pytest.raises(ValueError, date_range, '1/1/2000', '1/1/2001', + freq=MonthEnd(0)) + + def test_range_bug(self): + # GH #770 + offset = DateOffset(months=3) + result = date_range("2011-1-1", "2012-1-31", freq=offset) + + start = datetime(2011, 1, 1) + exp_values = [start + i * offset for i in range(5)] + tm.assert_index_equal(result, DatetimeIndex(exp_values)) + + def test_range_tz_pytz(self): + # GH 2906 + tm._skip_if_no_pytz() + from pytz import timezone + + tz = timezone('US/Eastern') + start = tz.localize(datetime(2011, 1, 1)) + end = tz.localize(datetime(2011, 1, 3)) + + dr = date_range(start=start, periods=3) + assert dr.tz.zone == tz.zone + assert dr[0] == start + assert dr[2] == end + + dr = date_range(end=end, periods=3) + assert dr.tz.zone == tz.zone + assert dr[0] == start + assert dr[2] == end + + dr = date_range(start=start, end=end) + assert dr.tz.zone == tz.zone + assert dr[0] == start + assert dr[2] == end + + def test_range_tz_dst_straddle_pytz(self): + + tm._skip_if_no_pytz() + from pytz import timezone + tz = timezone('US/Eastern') + dates = [(tz.localize(datetime(2014, 3, 6)), + tz.localize(datetime(2014, 3, 12))), + (tz.localize(datetime(2013, 11, 1)), + tz.localize(datetime(2013, 11, 6)))] + for (start, end) in dates: + dr = date_range(start, end, freq='D') + assert dr[0] == start + assert dr[-1] == end + assert np.all(dr.hour == 0) + + dr = date_range(start, end, freq='D', tz='US/Eastern') + assert dr[0] == start + assert dr[-1] == end + assert np.all(dr.hour == 0) + + dr = date_range(start.replace(tzinfo=None), end.replace( + tzinfo=None), freq='D', tz='US/Eastern') + assert dr[0] == start + assert dr[-1] == end + assert np.all(dr.hour == 0) + + def test_range_tz_dateutil(self): + # GH 2906 + tm._skip_if_no_dateutil() + # Use maybe_get_tz to fix filename in tz under dateutil. + from pandas._libs.tslib import maybe_get_tz + tz = lambda x: maybe_get_tz('dateutil/' + x) + + start = datetime(2011, 1, 1, tzinfo=tz('US/Eastern')) + end = datetime(2011, 1, 3, tzinfo=tz('US/Eastern')) + + dr = date_range(start=start, periods=3) + assert dr.tz == tz('US/Eastern') + assert dr[0] == start + assert dr[2] == end + + dr = date_range(end=end, periods=3) + assert dr.tz == tz('US/Eastern') + assert dr[0] == start + assert dr[2] == end + + dr = date_range(start=start, end=end) + assert dr.tz == tz('US/Eastern') + assert dr[0] == start + assert dr[2] == end + + def test_range_closed(self): + begin = datetime(2011, 1, 1) + end = datetime(2014, 1, 1) + + for freq in ["1D", "3D", "2M", "7W", "3H", "A"]: + closed = date_range(begin, end, closed=None, freq=freq) + left = date_range(begin, end, closed="left", freq=freq) + right = date_range(begin, end, closed="right", freq=freq) + expected_left = left + expected_right = right + + if end == closed[-1]: + expected_left = closed[:-1] + if begin == closed[0]: + expected_right = closed[1:] + + tm.assert_index_equal(expected_left, left) + tm.assert_index_equal(expected_right, right) + + def test_range_closed_with_tz_aware_start_end(self): + # GH12409, GH12684 + begin = Timestamp('2011/1/1', tz='US/Eastern') + end = Timestamp('2014/1/1', tz='US/Eastern') + + for freq in ["1D", "3D", "2M", "7W", "3H", "A"]: + closed = date_range(begin, end, closed=None, freq=freq) + left = date_range(begin, end, closed="left", freq=freq) + right = date_range(begin, end, closed="right", freq=freq) + expected_left = left + expected_right = right + + if end == closed[-1]: + expected_left = closed[:-1] + if begin == closed[0]: + expected_right = closed[1:] + + tm.assert_index_equal(expected_left, left) + tm.assert_index_equal(expected_right, right) + + begin = Timestamp('2011/1/1') + end = Timestamp('2014/1/1') + begintz = Timestamp('2011/1/1', tz='US/Eastern') + endtz = Timestamp('2014/1/1', tz='US/Eastern') + + for freq in ["1D", "3D", "2M", "7W", "3H", "A"]: + closed = date_range(begin, end, closed=None, freq=freq, + tz='US/Eastern') + left = date_range(begin, end, closed="left", freq=freq, + tz='US/Eastern') + right = date_range(begin, end, closed="right", freq=freq, + tz='US/Eastern') + expected_left = left + expected_right = right + + if endtz == closed[-1]: + expected_left = closed[:-1] + if begintz == closed[0]: + expected_right = closed[1:] + + tm.assert_index_equal(expected_left, left) + tm.assert_index_equal(expected_right, right) + + def test_range_closed_boundary(self): + # GH 11804 + for closed in ['right', 'left', None]: + right_boundary = date_range('2015-09-12', '2015-12-01', + freq='QS-MAR', closed=closed) + left_boundary = date_range('2015-09-01', '2015-09-12', + freq='QS-MAR', closed=closed) + both_boundary = date_range('2015-09-01', '2015-12-01', + freq='QS-MAR', closed=closed) + expected_right = expected_left = expected_both = both_boundary + + if closed == 'right': + expected_left = both_boundary[1:] + if closed == 'left': + expected_right = both_boundary[:-1] + if closed is None: + expected_right = both_boundary[1:] + expected_left = both_boundary[:-1] + + tm.assert_index_equal(right_boundary, expected_right) + tm.assert_index_equal(left_boundary, expected_left) + tm.assert_index_equal(both_boundary, expected_both) + + def test_years_only(self): + # GH 6961 + dr = date_range('2014', '2015', freq='M') + assert dr[0] == datetime(2014, 1, 31) + assert dr[-1] == datetime(2014, 12, 31) + + def test_freq_divides_end_in_nanos(self): + # GH 10885 + result_1 = date_range('2005-01-12 10:00', '2005-01-12 16:00', + freq='345min') + result_2 = date_range('2005-01-13 10:00', '2005-01-13 16:00', + freq='345min') + expected_1 = DatetimeIndex(['2005-01-12 10:00:00', + '2005-01-12 15:45:00'], + dtype='datetime64[ns]', freq='345T', + tz=None) + expected_2 = DatetimeIndex(['2005-01-13 10:00:00', + '2005-01-13 15:45:00'], + dtype='datetime64[ns]', freq='345T', + tz=None) + tm.assert_index_equal(result_1, expected_1) + tm.assert_index_equal(result_2, expected_2) + + +class TestCustomDateRange(object): + def setup_method(self, method): + self.rng = cdate_range(START, END) + + def test_constructor(self): + cdate_range(START, END, freq=CDay()) + cdate_range(START, periods=20, freq=CDay()) + cdate_range(end=START, periods=20, freq=CDay()) + pytest.raises(ValueError, date_range, '2011-1-1', '2012-1-1', 'C') + pytest.raises(ValueError, cdate_range, '2011-1-1', '2012-1-1', 'C') + + def test_cached_range(self): + DatetimeIndex._cached_range(START, END, offset=CDay()) + DatetimeIndex._cached_range(START, periods=20, + offset=CDay()) + DatetimeIndex._cached_range(end=START, periods=20, + offset=CDay()) + + pytest.raises(Exception, DatetimeIndex._cached_range, START, END) + + pytest.raises(Exception, DatetimeIndex._cached_range, START, + freq=CDay()) + + pytest.raises(Exception, DatetimeIndex._cached_range, end=END, + freq=CDay()) + + pytest.raises(Exception, DatetimeIndex._cached_range, periods=20, + freq=CDay()) + + def test_misc(self): + end = datetime(2009, 5, 13) + dr = cdate_range(end=end, periods=20) + firstDate = end - 19 * CDay() + + assert len(dr) == 20 + assert dr[0] == firstDate + assert dr[-1] == end + + def test_date_parse_failure(self): + badly_formed_date = '2007/100/1' + + pytest.raises(ValueError, Timestamp, badly_formed_date) + + pytest.raises(ValueError, cdate_range, start=badly_formed_date, + periods=10) + pytest.raises(ValueError, cdate_range, end=badly_formed_date, + periods=10) + pytest.raises(ValueError, cdate_range, badly_formed_date, + badly_formed_date) + + def test_daterange_bug_456(self): + # GH #456 + rng1 = cdate_range('12/5/2011', '12/5/2011') + rng2 = cdate_range('12/2/2011', '12/5/2011') + rng2.offset = CDay() + + result = rng1.union(rng2) + assert isinstance(result, DatetimeIndex) + + def test_cdaterange(self): + rng = cdate_range('2013-05-01', periods=3) + xp = DatetimeIndex(['2013-05-01', '2013-05-02', '2013-05-03']) + tm.assert_index_equal(xp, rng) + + def test_cdaterange_weekmask(self): + rng = cdate_range('2013-05-01', periods=3, + weekmask='Sun Mon Tue Wed Thu') + xp = DatetimeIndex(['2013-05-01', '2013-05-02', '2013-05-05']) + tm.assert_index_equal(xp, rng) + + def test_cdaterange_holidays(self): + rng = cdate_range('2013-05-01', periods=3, holidays=['2013-05-01']) + xp = DatetimeIndex(['2013-05-02', '2013-05-03', '2013-05-06']) + tm.assert_index_equal(xp, rng) + + def test_cdaterange_weekmask_and_holidays(self): + rng = cdate_range('2013-05-01', periods=3, + weekmask='Sun Mon Tue Wed Thu', + holidays=['2013-05-01']) + xp = DatetimeIndex(['2013-05-02', '2013-05-05', '2013-05-06']) + tm.assert_index_equal(xp, rng) diff --git a/pandas/tests/indexes/datetimes/test_datetime.py b/pandas/tests/indexes/datetimes/test_datetime.py new file mode 100644 index 0000000000000..96c8da546ff9d --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_datetime.py @@ -0,0 +1,776 @@ +import pytest + +import numpy as np +from datetime import date, timedelta, time + +import pandas as pd +import pandas.util.testing as tm +from pandas.compat import lrange +from pandas.compat.numpy import np_datetime64_compat +from pandas import (DatetimeIndex, Index, date_range, Series, DataFrame, + Timestamp, datetime, offsets, _np_version_under1p8) + +from pandas.util.testing import assert_series_equal, assert_almost_equal + +randn = np.random.randn + + +class TestDatetimeIndex(object): + + def test_get_loc(self): + idx = pd.date_range('2000-01-01', periods=3) + + for method in [None, 'pad', 'backfill', 'nearest']: + assert idx.get_loc(idx[1], method) == 1 + assert idx.get_loc(idx[1].to_pydatetime(), method) == 1 + assert idx.get_loc(str(idx[1]), method) == 1 + + if method is not None: + assert idx.get_loc(idx[1], method, + tolerance=pd.Timedelta('0 days')) == 1 + + assert idx.get_loc('2000-01-01', method='nearest') == 0 + assert idx.get_loc('2000-01-01T12', method='nearest') == 1 + + assert idx.get_loc('2000-01-01T12', method='nearest', + tolerance='1 day') == 1 + assert idx.get_loc('2000-01-01T12', method='nearest', + tolerance=pd.Timedelta('1D')) == 1 + assert idx.get_loc('2000-01-01T12', method='nearest', + tolerance=np.timedelta64(1, 'D')) == 1 + assert idx.get_loc('2000-01-01T12', method='nearest', + tolerance=timedelta(1)) == 1 + with tm.assert_raises_regex(ValueError, 'must be convertible'): + idx.get_loc('2000-01-01T12', method='nearest', tolerance='foo') + with pytest.raises(KeyError): + idx.get_loc('2000-01-01T03', method='nearest', tolerance='2 hours') + + assert idx.get_loc('2000', method='nearest') == slice(0, 3) + assert idx.get_loc('2000-01', method='nearest') == slice(0, 3) + + assert idx.get_loc('1999', method='nearest') == 0 + assert idx.get_loc('2001', method='nearest') == 2 + + with pytest.raises(KeyError): + idx.get_loc('1999', method='pad') + with pytest.raises(KeyError): + idx.get_loc('2001', method='backfill') + + with pytest.raises(KeyError): + idx.get_loc('foobar') + with pytest.raises(TypeError): + idx.get_loc(slice(2)) + + idx = pd.to_datetime(['2000-01-01', '2000-01-04']) + assert idx.get_loc('2000-01-02', method='nearest') == 0 + assert idx.get_loc('2000-01-03', method='nearest') == 1 + assert idx.get_loc('2000-01', method='nearest') == slice(0, 2) + + # time indexing + idx = pd.date_range('2000-01-01', periods=24, freq='H') + tm.assert_numpy_array_equal(idx.get_loc(time(12)), + np.array([12]), check_dtype=False) + tm.assert_numpy_array_equal(idx.get_loc(time(12, 30)), + np.array([]), check_dtype=False) + with pytest.raises(NotImplementedError): + idx.get_loc(time(12, 30), method='pad') + + def test_get_indexer(self): + idx = pd.date_range('2000-01-01', periods=3) + exp = np.array([0, 1, 2], dtype=np.intp) + tm.assert_numpy_array_equal(idx.get_indexer(idx), exp) + + target = idx[0] + pd.to_timedelta(['-1 hour', '12 hours', + '1 day 1 hour']) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'pad'), + np.array([-1, 0, 1], dtype=np.intp)) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'backfill'), + np.array([0, 1, 2], dtype=np.intp)) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest'), + np.array([0, 1, 1], dtype=np.intp)) + tm.assert_numpy_array_equal( + idx.get_indexer(target, 'nearest', + tolerance=pd.Timedelta('1 hour')), + np.array([0, -1, 1], dtype=np.intp)) + with pytest.raises(ValueError): + idx.get_indexer(idx[[0]], method='nearest', tolerance='foo') + + def test_reasonable_keyerror(self): + # GH #1062 + index = DatetimeIndex(['1/3/2000']) + try: + index.get_loc('1/1/2000') + except KeyError as e: + assert '2000' in str(e) + + def test_roundtrip_pickle_with_tz(self): + + # GH 8367 + # round-trip of timezone + index = date_range('20130101', periods=3, tz='US/Eastern', name='foo') + unpickled = tm.round_trip_pickle(index) + tm.assert_index_equal(index, unpickled) + + def test_reindex_preserves_tz_if_target_is_empty_list_or_array(self): + # GH7774 + index = date_range('20130101', periods=3, tz='US/Eastern') + assert str(index.reindex([])[0].tz) == 'US/Eastern' + assert str(index.reindex(np.array([]))[0].tz) == 'US/Eastern' + + def test_time_loc(self): # GH8667 + from datetime import time + from pandas._libs.index import _SIZE_CUTOFF + + ns = _SIZE_CUTOFF + np.array([-100, 100], dtype=np.int64) + key = time(15, 11, 30) + start = key.hour * 3600 + key.minute * 60 + key.second + step = 24 * 3600 + + for n in ns: + idx = pd.date_range('2014-11-26', periods=n, freq='S') + ts = pd.Series(np.random.randn(n), index=idx) + i = np.arange(start, n, step) + + tm.assert_numpy_array_equal(ts.index.get_loc(key), i, + check_dtype=False) + tm.assert_series_equal(ts[key], ts.iloc[i]) + + left, right = ts.copy(), ts.copy() + left[key] *= -10 + right.iloc[i] *= -10 + tm.assert_series_equal(left, right) + + def test_time_overflow_for_32bit_machines(self): + # GH8943. On some machines NumPy defaults to np.int32 (for example, + # 32-bit Linux machines). In the function _generate_regular_range + # found in tseries/index.py, `periods` gets multiplied by `strides` + # (which has value 1e9) and since the max value for np.int32 is ~2e9, + # and since those machines won't promote np.int32 to np.int64, we get + # overflow. + periods = np.int_(1000) + + idx1 = pd.date_range(start='2000', periods=periods, freq='S') + assert len(idx1) == periods + + idx2 = pd.date_range(end='2000', periods=periods, freq='S') + assert len(idx2) == periods + + def test_nat(self): + assert DatetimeIndex([np.nan])[0] is pd.NaT + + def test_ufunc_coercions(self): + idx = date_range('2011-01-01', periods=3, freq='2D', name='x') + + delta = np.timedelta64(1, 'D') + for result in [idx + delta, np.add(idx, delta)]: + assert isinstance(result, DatetimeIndex) + exp = date_range('2011-01-02', periods=3, freq='2D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '2D' + + for result in [idx - delta, np.subtract(idx, delta)]: + assert isinstance(result, DatetimeIndex) + exp = date_range('2010-12-31', periods=3, freq='2D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '2D' + + delta = np.array([np.timedelta64(1, 'D'), np.timedelta64(2, 'D'), + np.timedelta64(3, 'D')]) + for result in [idx + delta, np.add(idx, delta)]: + assert isinstance(result, DatetimeIndex) + exp = DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-08'], + freq='3D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '3D' + + for result in [idx - delta, np.subtract(idx, delta)]: + assert isinstance(result, DatetimeIndex) + exp = DatetimeIndex(['2010-12-31', '2011-01-01', '2011-01-02'], + freq='D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == 'D' + + def test_week_of_month_frequency(self): + # GH 5348: "ValueError: Could not evaluate WOM-1SUN" shouldn't raise + d1 = date(2002, 9, 1) + d2 = date(2013, 10, 27) + d3 = date(2012, 9, 30) + idx1 = DatetimeIndex([d1, d2]) + idx2 = DatetimeIndex([d3]) + result_append = idx1.append(idx2) + expected = DatetimeIndex([d1, d2, d3]) + tm.assert_index_equal(result_append, expected) + result_union = idx1.union(idx2) + expected = DatetimeIndex([d1, d3, d2]) + tm.assert_index_equal(result_union, expected) + + # GH 5115 + result = date_range("2013-1-1", periods=4, freq='WOM-1SAT') + dates = ['2013-01-05', '2013-02-02', '2013-03-02', '2013-04-06'] + expected = DatetimeIndex(dates, freq='WOM-1SAT') + tm.assert_index_equal(result, expected) + + def test_hash_error(self): + index = date_range('20010101', periods=10) + with tm.assert_raises_regex(TypeError, "unhashable type: %r" % + type(index).__name__): + hash(index) + + def test_stringified_slice_with_tz(self): + # GH2658 + import datetime + start = datetime.datetime.now() + idx = DatetimeIndex(start=start, freq="1d", periods=10) + df = DataFrame(lrange(10), index=idx) + df["2013-01-14 23:44:34.437768-05:00":] # no exception here + + def test_append_join_nondatetimeindex(self): + rng = date_range('1/1/2000', periods=10) + idx = Index(['a', 'b', 'c', 'd']) + + result = rng.append(idx) + assert isinstance(result[0], Timestamp) + + # it works + rng.join(idx, how='outer') + + def test_to_period_nofreq(self): + idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-04']) + pytest.raises(ValueError, idx.to_period) + + idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], + freq='infer') + assert idx.freqstr == 'D' + expected = pd.PeriodIndex(['2000-01-01', '2000-01-02', + '2000-01-03'], freq='D') + tm.assert_index_equal(idx.to_period(), expected) + + # GH 7606 + idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03']) + assert idx.freqstr is None + tm.assert_index_equal(idx.to_period(), expected) + + def test_comparisons_coverage(self): + rng = date_range('1/1/2000', periods=10) + + # raise TypeError for now + pytest.raises(TypeError, rng.__lt__, rng[3].value) + + result = rng == list(rng) + exp = rng == rng + tm.assert_numpy_array_equal(result, exp) + + def test_comparisons_nat(self): + + fidx1 = pd.Index([1.0, np.nan, 3.0, np.nan, 5.0, 7.0]) + fidx2 = pd.Index([2.0, 3.0, np.nan, np.nan, 6.0, 7.0]) + + didx1 = pd.DatetimeIndex(['2014-01-01', pd.NaT, '2014-03-01', pd.NaT, + '2014-05-01', '2014-07-01']) + didx2 = pd.DatetimeIndex(['2014-02-01', '2014-03-01', pd.NaT, pd.NaT, + '2014-06-01', '2014-07-01']) + darr = np.array([np_datetime64_compat('2014-02-01 00:00Z'), + np_datetime64_compat('2014-03-01 00:00Z'), + np_datetime64_compat('nat'), np.datetime64('nat'), + np_datetime64_compat('2014-06-01 00:00Z'), + np_datetime64_compat('2014-07-01 00:00Z')]) + + if _np_version_under1p8: + # cannot test array because np.datetime('nat') returns today's date + cases = [(fidx1, fidx2), (didx1, didx2)] + else: + cases = [(fidx1, fidx2), (didx1, didx2), (didx1, darr)] + + # Check pd.NaT is handles as the same as np.nan + with tm.assert_produces_warning(None): + for idx1, idx2 in cases: + + result = idx1 < idx2 + expected = np.array([True, False, False, False, True, False]) + tm.assert_numpy_array_equal(result, expected) + + result = idx2 > idx1 + expected = np.array([True, False, False, False, True, False]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 <= idx2 + expected = np.array([True, False, False, False, True, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx2 >= idx1 + expected = np.array([True, False, False, False, True, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 == idx2 + expected = np.array([False, False, False, False, False, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 != idx2 + expected = np.array([True, True, True, True, True, False]) + tm.assert_numpy_array_equal(result, expected) + + with tm.assert_produces_warning(None): + for idx1, val in [(fidx1, np.nan), (didx1, pd.NaT)]: + result = idx1 < val + expected = np.array([False, False, False, False, False, False]) + tm.assert_numpy_array_equal(result, expected) + result = idx1 > val + tm.assert_numpy_array_equal(result, expected) + + result = idx1 <= val + tm.assert_numpy_array_equal(result, expected) + result = idx1 >= val + tm.assert_numpy_array_equal(result, expected) + + result = idx1 == val + tm.assert_numpy_array_equal(result, expected) + + result = idx1 != val + expected = np.array([True, True, True, True, True, True]) + tm.assert_numpy_array_equal(result, expected) + + # Check pd.NaT is handles as the same as np.nan + with tm.assert_produces_warning(None): + for idx1, val in [(fidx1, 3), (didx1, datetime(2014, 3, 1))]: + result = idx1 < val + expected = np.array([True, False, False, False, False, False]) + tm.assert_numpy_array_equal(result, expected) + result = idx1 > val + expected = np.array([False, False, False, False, True, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 <= val + expected = np.array([True, False, True, False, False, False]) + tm.assert_numpy_array_equal(result, expected) + result = idx1 >= val + expected = np.array([False, False, True, False, True, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 == val + expected = np.array([False, False, True, False, False, False]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 != val + expected = np.array([True, True, False, True, True, True]) + tm.assert_numpy_array_equal(result, expected) + + def test_map(self): + rng = date_range('1/1/2000', periods=10) + + f = lambda x: x.strftime('%Y%m%d') + result = rng.map(f) + exp = Index([f(x) for x in rng], dtype='= -1') + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -2]), fill_value=True) + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -5]), fill_value=True) + + with pytest.raises(IndexError): + idx.take(np.array([1, -5])) + + def test_take_fill_value_with_timezone(self): + idx = pd.DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01'], + name='xxx', tz='US/Eastern') + result = idx.take(np.array([1, 0, -1])) + expected = pd.DatetimeIndex(['2011-02-01', '2011-01-01', '2011-03-01'], + name='xxx', tz='US/Eastern') + tm.assert_index_equal(result, expected) + + # fill_value + result = idx.take(np.array([1, 0, -1]), fill_value=True) + expected = pd.DatetimeIndex(['2011-02-01', '2011-01-01', 'NaT'], + name='xxx', tz='US/Eastern') + tm.assert_index_equal(result, expected) + + # allow_fill=False + result = idx.take(np.array([1, 0, -1]), allow_fill=False, + fill_value=True) + expected = pd.DatetimeIndex(['2011-02-01', '2011-01-01', '2011-03-01'], + name='xxx', tz='US/Eastern') + tm.assert_index_equal(result, expected) + + msg = ('When allow_fill=True and fill_value is not None, ' + 'all indices must be >= -1') + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -2]), fill_value=True) + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -5]), fill_value=True) + + with pytest.raises(IndexError): + idx.take(np.array([1, -5])) + + def test_map_bug_1677(self): + index = DatetimeIndex(['2012-04-25 09:30:00.393000']) + f = index.asof + + result = index.map(f) + expected = Index([f(index[0])]) + tm.assert_index_equal(result, expected) + + def test_groupby_function_tuple_1677(self): + df = DataFrame(np.random.rand(100), + index=date_range("1/1/2000", periods=100)) + monthly_group = df.groupby(lambda x: (x.year, x.month)) + + result = monthly_group.mean() + assert isinstance(result.index[0], tuple) + + def test_append_numpy_bug_1681(self): + # another datetime64 bug + dr = date_range('2011/1/1', '2012/1/1', freq='W-FRI') + a = DataFrame() + c = DataFrame({'A': 'foo', 'B': dr}, index=dr) + + result = a.append(c) + assert (result['B'] == dr).all() + + def test_isin(self): + index = tm.makeDateIndex(4) + result = index.isin(index) + assert result.all() + + result = index.isin(list(index)) + assert result.all() + + assert_almost_equal(index.isin([index[2], 5]), + np.array([False, False, True, False])) + + def test_time(self): + rng = pd.date_range('1/1/2000', freq='12min', periods=10) + result = pd.Index(rng).time + expected = [t.time() for t in rng] + assert (result == expected).all() + + def test_date(self): + rng = pd.date_range('1/1/2000', freq='12H', periods=10) + result = pd.Index(rng).date + expected = [t.date() for t in rng] + assert (result == expected).all() + + def test_does_not_convert_mixed_integer(self): + df = tm.makeCustomDataframe(10, 10, + data_gen_f=lambda *args, **kwargs: randn(), + r_idx_type='i', c_idx_type='dt') + cols = df.columns.join(df.index, how='outer') + joined = cols.join(df.columns) + assert cols.dtype == np.dtype('O') + assert cols.dtype == joined.dtype + tm.assert_numpy_array_equal(cols.values, joined.values) + + def test_slice_keeps_name(self): + # GH4226 + st = pd.Timestamp('2013-07-01 00:00:00', tz='America/Los_Angeles') + et = pd.Timestamp('2013-07-02 00:00:00', tz='America/Los_Angeles') + dr = pd.date_range(st, et, freq='H', name='timebucket') + assert dr[1:].name == dr.name + + def test_join_self(self): + index = date_range('1/1/2000', periods=10) + kinds = 'outer', 'inner', 'left', 'right' + for kind in kinds: + joined = index.join(index, how=kind) + assert index is joined + + def assert_index_parameters(self, index): + assert index.freq == '40960N' + assert index.inferred_freq == '40960N' + + def test_ns_index(self): + nsamples = 400 + ns = int(1e9 / 24414) + dtstart = np.datetime64('2012-09-20T00:00:00') + + dt = dtstart + np.arange(nsamples) * np.timedelta64(ns, 'ns') + freq = ns * offsets.Nano() + index = pd.DatetimeIndex(dt, freq=freq, name='time') + self.assert_index_parameters(index) + + new_index = pd.DatetimeIndex(start=index[0], end=index[-1], + freq=index.freq) + self.assert_index_parameters(new_index) + + def test_join_with_period_index(self): + df = tm.makeCustomDataframe( + 10, 10, data_gen_f=lambda *args: np.random.randint(2), + c_idx_type='p', r_idx_type='dt') + s = df.iloc[:5, 0] + joins = 'left', 'right', 'inner', 'outer' + + for join in joins: + with tm.assert_raises_regex(ValueError, + 'can only call with other ' + 'PeriodIndex-ed objects'): + df.columns.join(s.index, how=join) + + def test_factorize(self): + idx1 = DatetimeIndex(['2014-01', '2014-01', '2014-02', '2014-02', + '2014-03', '2014-03']) + + exp_arr = np.array([0, 0, 1, 1, 2, 2], dtype=np.intp) + exp_idx = DatetimeIndex(['2014-01', '2014-02', '2014-03']) + + arr, idx = idx1.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + arr, idx = idx1.factorize(sort=True) + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + # tz must be preserved + idx1 = idx1.tz_localize('Asia/Tokyo') + exp_idx = exp_idx.tz_localize('Asia/Tokyo') + + arr, idx = idx1.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + idx2 = pd.DatetimeIndex(['2014-03', '2014-03', '2014-02', '2014-01', + '2014-03', '2014-01']) + + exp_arr = np.array([2, 2, 1, 0, 2, 0], dtype=np.intp) + exp_idx = DatetimeIndex(['2014-01', '2014-02', '2014-03']) + arr, idx = idx2.factorize(sort=True) + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + exp_arr = np.array([0, 0, 1, 2, 0, 2], dtype=np.intp) + exp_idx = DatetimeIndex(['2014-03', '2014-02', '2014-01']) + arr, idx = idx2.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + # freq must be preserved + idx3 = date_range('2000-01', periods=4, freq='M', tz='Asia/Tokyo') + exp_arr = np.array([0, 1, 2, 3], dtype=np.intp) + arr, idx = idx3.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, idx3) + + def test_factorize_tz(self): + # GH 13750 + for tz in [None, 'UTC', 'US/Eastern', 'Asia/Tokyo']: + base = pd.date_range('2016-11-05', freq='H', periods=100, tz=tz) + idx = base.repeat(5) + + exp_arr = np.arange(100, dtype=np.intp).repeat(5) + + for obj in [idx, pd.Series(idx)]: + arr, res = obj.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(res, base) + + def test_factorize_dst(self): + # GH 13750 + idx = pd.date_range('2016-11-06', freq='H', periods=12, + tz='US/Eastern') + + for obj in [idx, pd.Series(idx)]: + arr, res = obj.factorize() + tm.assert_numpy_array_equal(arr, np.arange(12, dtype=np.intp)) + tm.assert_index_equal(res, idx) + + idx = pd.date_range('2016-06-13', freq='H', periods=12, + tz='US/Eastern') + + for obj in [idx, pd.Series(idx)]: + arr, res = obj.factorize() + tm.assert_numpy_array_equal(arr, np.arange(12, dtype=np.intp)) + tm.assert_index_equal(res, idx) + + def test_slice_with_negative_step(self): + ts = Series(np.arange(20), + date_range('2014-01-01', periods=20, freq='MS')) + SLC = pd.IndexSlice + + def assert_slices_equivalent(l_slc, i_slc): + assert_series_equal(ts[l_slc], ts.iloc[i_slc]) + assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) + assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) + + assert_slices_equivalent(SLC[Timestamp('2014-10-01')::-1], SLC[9::-1]) + assert_slices_equivalent(SLC['2014-10-01'::-1], SLC[9::-1]) + + assert_slices_equivalent(SLC[:Timestamp('2014-10-01'):-1], SLC[:8:-1]) + assert_slices_equivalent(SLC[:'2014-10-01':-1], SLC[:8:-1]) + + assert_slices_equivalent(SLC['2015-02-01':'2014-10-01':-1], + SLC[13:8:-1]) + assert_slices_equivalent(SLC[Timestamp('2015-02-01'):Timestamp( + '2014-10-01'):-1], SLC[13:8:-1]) + assert_slices_equivalent(SLC['2015-02-01':Timestamp('2014-10-01'):-1], + SLC[13:8:-1]) + assert_slices_equivalent(SLC[Timestamp('2015-02-01'):'2014-10-01':-1], + SLC[13:8:-1]) + + assert_slices_equivalent(SLC['2014-10-01':'2015-02-01':-1], SLC[:0]) + + def test_slice_with_zero_step_raises(self): + ts = Series(np.arange(20), + date_range('2014-01-01', periods=20, freq='MS')) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts.loc[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts.loc[::0]) + + def test_slice_bounds_empty(self): + # GH 14354 + empty_idx = DatetimeIndex(freq='1H', periods=0, end='2015') + + right = empty_idx._maybe_cast_slice_bound('2015-01-02', 'right', 'loc') + exp = Timestamp('2015-01-02 23:59:59.999999999') + assert right == exp + + left = empty_idx._maybe_cast_slice_bound('2015-01-02', 'left', 'loc') + exp = Timestamp('2015-01-02 00:00:00') + assert left == exp diff --git a/pandas/tests/indexes/datetimes/test_datetimelike.py b/pandas/tests/indexes/datetimes/test_datetimelike.py new file mode 100644 index 0000000000000..3b970ee382521 --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_datetimelike.py @@ -0,0 +1,76 @@ +""" generic tests from the Datetimelike class """ + +import numpy as np +import pandas as pd +from pandas.util import testing as tm +from pandas import Series, Index, DatetimeIndex, date_range + +from ..datetimelike import DatetimeLike + + +class TestDatetimeIndex(DatetimeLike): + _holder = DatetimeIndex + + def setup_method(self, method): + self.indices = dict(index=tm.makeDateIndex(10)) + self.setup_indices() + + def create_index(self): + return date_range('20130101', periods=5) + + def test_shift(self): + + # test shift for datetimeIndex and non datetimeIndex + # GH8083 + + drange = self.create_index() + result = drange.shift(1) + expected = DatetimeIndex(['2013-01-02', '2013-01-03', '2013-01-04', + '2013-01-05', + '2013-01-06'], freq='D') + tm.assert_index_equal(result, expected) + + result = drange.shift(-1) + expected = DatetimeIndex(['2012-12-31', '2013-01-01', '2013-01-02', + '2013-01-03', '2013-01-04'], + freq='D') + tm.assert_index_equal(result, expected) + + result = drange.shift(3, freq='2D') + expected = DatetimeIndex(['2013-01-07', '2013-01-08', '2013-01-09', + '2013-01-10', + '2013-01-11'], freq='D') + tm.assert_index_equal(result, expected) + + def test_pickle_compat_construction(self): + pass + + def test_intersection(self): + first = self.index + second = self.index[5:] + intersect = first.intersection(second) + assert tm.equalContents(intersect, second) + + # GH 10149 + cases = [klass(second.values) for klass in [np.array, Series, list]] + for case in cases: + result = first.intersection(case) + assert tm.equalContents(result, second) + + third = Index(['a', 'b', 'c']) + result = first.intersection(third) + expected = pd.Index([], dtype=object) + tm.assert_index_equal(result, expected) + + def test_union(self): + first = self.index[:5] + second = self.index[5:] + everything = self.index + union = first.union(second) + assert tm.equalContents(union, everything) + + # GH 10149 + cases = [klass(second.values) for klass in [np.array, Series, list]] + for case in cases: + result = first.union(case) + assert tm.equalContents(result, everything) diff --git a/pandas/tests/indexes/datetimes/test_formats.py b/pandas/tests/indexes/datetimes/test_formats.py new file mode 100644 index 0000000000000..ea2731f66f0ef --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_formats.py @@ -0,0 +1,47 @@ +from pandas import DatetimeIndex + +import numpy as np + +import pandas.util.testing as tm +import pandas as pd + + +def test_to_native_types(): + index = DatetimeIndex(freq='1D', periods=3, start='2017-01-01') + + # First, with no arguments. + expected = np.array(['2017-01-01', '2017-01-02', + '2017-01-03'], dtype=object) + + result = index.to_native_types() + tm.assert_numpy_array_equal(result, expected) + + # No NaN values, so na_rep has no effect + result = index.to_native_types(na_rep='pandas') + tm.assert_numpy_array_equal(result, expected) + + # Make sure slicing works + expected = np.array(['2017-01-01', '2017-01-03'], dtype=object) + + result = index.to_native_types([0, 2]) + tm.assert_numpy_array_equal(result, expected) + + # Make sure date formatting works + expected = np.array(['01-2017-01', '01-2017-02', + '01-2017-03'], dtype=object) + + result = index.to_native_types(date_format='%m-%Y-%d') + tm.assert_numpy_array_equal(result, expected) + + # NULL object handling should work + index = DatetimeIndex(['2017-01-01', pd.NaT, '2017-01-03']) + expected = np.array(['2017-01-01', 'NaT', '2017-01-03'], dtype=object) + + result = index.to_native_types() + tm.assert_numpy_array_equal(result, expected) + + expected = np.array(['2017-01-01', 'pandas', + '2017-01-03'], dtype=object) + + result = index.to_native_types(na_rep='pandas') + tm.assert_numpy_array_equal(result, expected) diff --git a/pandas/tests/indexes/datetimes/test_indexing.py b/pandas/tests/indexes/datetimes/test_indexing.py new file mode 100644 index 0000000000000..a9ea028c9d0f7 --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_indexing.py @@ -0,0 +1,243 @@ +import pytest + +import numpy as np +import pandas as pd +import pandas.util.testing as tm +import pandas.compat as compat +from pandas import notnull, Index, DatetimeIndex, datetime, date_range + + +class TestDatetimeIndex(object): + + def test_where_other(self): + + # other is ndarray or Index + i = pd.date_range('20130101', periods=3, tz='US/Eastern') + + for arr in [np.nan, pd.NaT]: + result = i.where(notnull(i), other=np.nan) + expected = i + tm.assert_index_equal(result, expected) + + i2 = i.copy() + i2 = Index([pd.NaT, pd.NaT] + i[2:].tolist()) + result = i.where(notnull(i2), i2) + tm.assert_index_equal(result, i2) + + i2 = i.copy() + i2 = Index([pd.NaT, pd.NaT] + i[2:].tolist()) + result = i.where(notnull(i2), i2.values) + tm.assert_index_equal(result, i2) + + def test_where_tz(self): + i = pd.date_range('20130101', periods=3, tz='US/Eastern') + result = i.where(notnull(i)) + expected = i + tm.assert_index_equal(result, expected) + + i2 = i.copy() + i2 = Index([pd.NaT, pd.NaT] + i[2:].tolist()) + result = i.where(notnull(i2)) + expected = i2 + tm.assert_index_equal(result, expected) + + def test_insert(self): + idx = DatetimeIndex( + ['2000-01-04', '2000-01-01', '2000-01-02'], name='idx') + + result = idx.insert(2, datetime(2000, 1, 5)) + exp = DatetimeIndex(['2000-01-04', '2000-01-01', '2000-01-05', + '2000-01-02'], name='idx') + tm.assert_index_equal(result, exp) + + # insertion of non-datetime should coerce to object index + result = idx.insert(1, 'inserted') + expected = Index([datetime(2000, 1, 4), 'inserted', + datetime(2000, 1, 1), + datetime(2000, 1, 2)], name='idx') + assert not isinstance(result, DatetimeIndex) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + + idx = date_range('1/1/2000', periods=3, freq='M', name='idx') + + # preserve freq + expected_0 = DatetimeIndex(['1999-12-31', '2000-01-31', '2000-02-29', + '2000-03-31'], name='idx', freq='M') + expected_3 = DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', + '2000-04-30'], name='idx', freq='M') + + # reset freq to None + expected_1_nofreq = DatetimeIndex(['2000-01-31', '2000-01-31', + '2000-02-29', + '2000-03-31'], name='idx', + freq=None) + expected_3_nofreq = DatetimeIndex(['2000-01-31', '2000-02-29', + '2000-03-31', + '2000-01-02'], name='idx', + freq=None) + + cases = [(0, datetime(1999, 12, 31), expected_0), + (-3, datetime(1999, 12, 31), expected_0), + (3, datetime(2000, 4, 30), expected_3), + (1, datetime(2000, 1, 31), expected_1_nofreq), + (3, datetime(2000, 1, 2), expected_3_nofreq)] + + for n, d, expected in cases: + result = idx.insert(n, d) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + # reset freq to None + result = idx.insert(3, datetime(2000, 1, 2)) + expected = DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', + '2000-01-02'], name='idx', freq=None) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq is None + + # GH 7299 + tm._skip_if_no_pytz() + import pytz + + idx = date_range('1/1/2000', periods=3, freq='D', tz='Asia/Tokyo', + name='idx') + with pytest.raises(ValueError): + idx.insert(3, pd.Timestamp('2000-01-04')) + with pytest.raises(ValueError): + idx.insert(3, datetime(2000, 1, 4)) + with pytest.raises(ValueError): + idx.insert(3, pd.Timestamp('2000-01-04', tz='US/Eastern')) + with pytest.raises(ValueError): + idx.insert(3, datetime(2000, 1, 4, + tzinfo=pytz.timezone('US/Eastern'))) + + for tz in ['US/Pacific', 'Asia/Singapore']: + idx = date_range('1/1/2000 09:00', periods=6, freq='H', tz=tz, + name='idx') + # preserve freq + expected = date_range('1/1/2000 09:00', periods=7, freq='H', tz=tz, + name='idx') + for d in [pd.Timestamp('2000-01-01 15:00', tz=tz), + pytz.timezone(tz).localize(datetime(2000, 1, 1, 15))]: + + result = idx.insert(6, d) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + assert result.tz == expected.tz + + expected = DatetimeIndex(['2000-01-01 09:00', '2000-01-01 10:00', + '2000-01-01 11:00', + '2000-01-01 12:00', '2000-01-01 13:00', + '2000-01-01 14:00', + '2000-01-01 10:00'], name='idx', + tz=tz, freq=None) + # reset freq to None + for d in [pd.Timestamp('2000-01-01 10:00', tz=tz), + pytz.timezone(tz).localize(datetime(2000, 1, 1, 10))]: + result = idx.insert(6, d) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.tz == expected.tz + assert result.freq is None + + def test_delete(self): + idx = date_range(start='2000-01-01', periods=5, freq='M', name='idx') + + # prserve freq + expected_0 = date_range(start='2000-02-01', periods=4, freq='M', + name='idx') + expected_4 = date_range(start='2000-01-01', periods=4, freq='M', + name='idx') + + # reset freq to None + expected_1 = DatetimeIndex(['2000-01-31', '2000-03-31', '2000-04-30', + '2000-05-31'], freq=None, name='idx') + + cases = {0: expected_0, + -5: expected_0, + -1: expected_4, + 4: expected_4, + 1: expected_1} + for n, expected in compat.iteritems(cases): + result = idx.delete(n) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + with pytest.raises((IndexError, ValueError)): + # either depeidnig on numpy version + result = idx.delete(5) + + for tz in [None, 'Asia/Tokyo', 'US/Pacific']: + idx = date_range(start='2000-01-01 09:00', periods=10, freq='H', + name='idx', tz=tz) + + expected = date_range(start='2000-01-01 10:00', periods=9, + freq='H', name='idx', tz=tz) + result = idx.delete(0) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freqstr == 'H' + assert result.tz == expected.tz + + expected = date_range(start='2000-01-01 09:00', periods=9, + freq='H', name='idx', tz=tz) + result = idx.delete(-1) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freqstr == 'H' + assert result.tz == expected.tz + + def test_delete_slice(self): + idx = date_range(start='2000-01-01', periods=10, freq='D', name='idx') + + # prserve freq + expected_0_2 = date_range(start='2000-01-04', periods=7, freq='D', + name='idx') + expected_7_9 = date_range(start='2000-01-01', periods=7, freq='D', + name='idx') + + # reset freq to None + expected_3_5 = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', + '2000-01-07', '2000-01-08', '2000-01-09', + '2000-01-10'], freq=None, name='idx') + + cases = {(0, 1, 2): expected_0_2, + (7, 8, 9): expected_7_9, + (3, 4, 5): expected_3_5} + for n, expected in compat.iteritems(cases): + result = idx.delete(n) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + result = idx.delete(slice(n[0], n[-1] + 1)) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + for tz in [None, 'Asia/Tokyo', 'US/Pacific']: + ts = pd.Series(1, index=pd.date_range( + '2000-01-01 09:00', periods=10, freq='H', name='idx', tz=tz)) + # preserve freq + result = ts.drop(ts.index[:5]).index + expected = pd.date_range('2000-01-01 14:00', periods=5, freq='H', + name='idx', tz=tz) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + assert result.tz == expected.tz + + # reset freq to None + result = ts.drop(ts.index[[1, 3, 5, 7, 9]]).index + expected = DatetimeIndex(['2000-01-01 09:00', '2000-01-01 11:00', + '2000-01-01 13:00', + '2000-01-01 15:00', '2000-01-01 17:00'], + freq=None, name='idx', tz=tz) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + assert result.tz == expected.tz diff --git a/pandas/tests/indexes/datetimes/test_misc.py b/pandas/tests/indexes/datetimes/test_misc.py new file mode 100644 index 0000000000000..951aa2c520d0f --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_misc.py @@ -0,0 +1,349 @@ +import pytest + +import numpy as np +import pandas as pd +import pandas.util.testing as tm +from pandas import (Index, DatetimeIndex, datetime, offsets, + Float64Index, date_range, Timestamp) + + +class TestDateTimeIndexToJulianDate(object): + + def test_1700(self): + r1 = Float64Index([2345897.5, 2345898.5, 2345899.5, 2345900.5, + 2345901.5]) + r2 = date_range(start=Timestamp('1710-10-01'), periods=5, + freq='D').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_2000(self): + r1 = Float64Index([2451601.5, 2451602.5, 2451603.5, 2451604.5, + 2451605.5]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='D').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_hour(self): + r1 = Float64Index( + [2451601.5, 2451601.5416666666666666, 2451601.5833333333333333, + 2451601.625, 2451601.6666666666666666]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='H').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_minute(self): + r1 = Float64Index( + [2451601.5, 2451601.5006944444444444, 2451601.5013888888888888, + 2451601.5020833333333333, 2451601.5027777777777777]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='T').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_second(self): + r1 = Float64Index( + [2451601.5, 2451601.500011574074074, 2451601.5000231481481481, + 2451601.5000347222222222, 2451601.5000462962962962]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='S').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + +class TestTimeSeries(object): + + def test_pass_datetimeindex_to_index(self): + # Bugs in #1396 + rng = date_range('1/1/2000', '3/1/2000') + idx = Index(rng, dtype=object) + + expected = Index(rng.to_pydatetime(), dtype=object) + + tm.assert_numpy_array_equal(idx.values, expected.values) + + def test_range_edges(self): + # GH 13672 + idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000000001'), + end=Timestamp('1970-01-01 00:00:00.000000004'), + freq='N') + exp = DatetimeIndex(['1970-01-01 00:00:00.000000001', + '1970-01-01 00:00:00.000000002', + '1970-01-01 00:00:00.000000003', + '1970-01-01 00:00:00.000000004']) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000000004'), + end=Timestamp('1970-01-01 00:00:00.000000001'), + freq='N') + exp = DatetimeIndex([]) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000000001'), + end=Timestamp('1970-01-01 00:00:00.000000001'), + freq='N') + exp = DatetimeIndex(['1970-01-01 00:00:00.000000001']) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000001'), + end=Timestamp('1970-01-01 00:00:00.000004'), + freq='U') + exp = DatetimeIndex(['1970-01-01 00:00:00.000001', + '1970-01-01 00:00:00.000002', + '1970-01-01 00:00:00.000003', + '1970-01-01 00:00:00.000004']) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.001'), + end=Timestamp('1970-01-01 00:00:00.004'), + freq='L') + exp = DatetimeIndex(['1970-01-01 00:00:00.001', + '1970-01-01 00:00:00.002', + '1970-01-01 00:00:00.003', + '1970-01-01 00:00:00.004']) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:01'), + end=Timestamp('1970-01-01 00:00:04'), freq='S') + exp = DatetimeIndex(['1970-01-01 00:00:01', '1970-01-01 00:00:02', + '1970-01-01 00:00:03', '1970-01-01 00:00:04']) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01 00:01'), + end=Timestamp('1970-01-01 00:04'), freq='T') + exp = DatetimeIndex(['1970-01-01 00:01', '1970-01-01 00:02', + '1970-01-01 00:03', '1970-01-01 00:04']) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01 01:00'), + end=Timestamp('1970-01-01 04:00'), freq='H') + exp = DatetimeIndex(['1970-01-01 01:00', '1970-01-01 02:00', + '1970-01-01 03:00', '1970-01-01 04:00']) + tm.assert_index_equal(idx, exp) + + idx = DatetimeIndex(start=Timestamp('1970-01-01'), + end=Timestamp('1970-01-04'), freq='D') + exp = DatetimeIndex(['1970-01-01', '1970-01-02', + '1970-01-03', '1970-01-04']) + tm.assert_index_equal(idx, exp) + + def test_datetimeindex_integers_shift(self): + rng = date_range('1/1/2000', periods=20) + + result = rng + 5 + expected = rng.shift(5) + tm.assert_index_equal(result, expected) + + result = rng - 5 + expected = rng.shift(-5) + tm.assert_index_equal(result, expected) + + def test_datetimeindex_repr_short(self): + dr = date_range(start='1/1/2012', periods=1) + repr(dr) + + dr = date_range(start='1/1/2012', periods=2) + repr(dr) + + dr = date_range(start='1/1/2012', periods=3) + repr(dr) + + def test_normalize(self): + rng = date_range('1/1/2000 9:30', periods=10, freq='D') + + result = rng.normalize() + expected = date_range('1/1/2000', periods=10, freq='D') + tm.assert_index_equal(result, expected) + + rng_ns = pd.DatetimeIndex(np.array([1380585623454345752, + 1380585612343234312]).astype( + "datetime64[ns]")) + rng_ns_normalized = rng_ns.normalize() + expected = pd.DatetimeIndex(np.array([1380585600000000000, + 1380585600000000000]).astype( + "datetime64[ns]")) + tm.assert_index_equal(rng_ns_normalized, expected) + + assert result.is_normalized + assert not rng.is_normalized + + +class TestDatetime64(object): + + def test_datetimeindex_accessors(self): + + dti_naive = DatetimeIndex(freq='D', start=datetime(1998, 1, 1), + periods=365) + # GH 13303 + dti_tz = DatetimeIndex(freq='D', start=datetime(1998, 1, 1), + periods=365, tz='US/Eastern') + for dti in [dti_naive, dti_tz]: + + assert dti.year[0] == 1998 + assert dti.month[0] == 1 + assert dti.day[0] == 1 + assert dti.hour[0] == 0 + assert dti.minute[0] == 0 + assert dti.second[0] == 0 + assert dti.microsecond[0] == 0 + assert dti.dayofweek[0] == 3 + + assert dti.dayofyear[0] == 1 + assert dti.dayofyear[120] == 121 + + assert dti.weekofyear[0] == 1 + assert dti.weekofyear[120] == 18 + + assert dti.quarter[0] == 1 + assert dti.quarter[120] == 2 + + assert dti.days_in_month[0] == 31 + assert dti.days_in_month[90] == 30 + + assert dti.is_month_start[0] + assert not dti.is_month_start[1] + assert dti.is_month_start[31] + assert dti.is_quarter_start[0] + assert dti.is_quarter_start[90] + assert dti.is_year_start[0] + assert not dti.is_year_start[364] + assert not dti.is_month_end[0] + assert dti.is_month_end[30] + assert not dti.is_month_end[31] + assert dti.is_month_end[364] + assert not dti.is_quarter_end[0] + assert not dti.is_quarter_end[30] + assert dti.is_quarter_end[89] + assert dti.is_quarter_end[364] + assert not dti.is_year_end[0] + assert dti.is_year_end[364] + + # GH 11128 + assert dti.weekday_name[4] == u'Monday' + assert dti.weekday_name[5] == u'Tuesday' + assert dti.weekday_name[6] == u'Wednesday' + assert dti.weekday_name[7] == u'Thursday' + assert dti.weekday_name[8] == u'Friday' + assert dti.weekday_name[9] == u'Saturday' + assert dti.weekday_name[10] == u'Sunday' + + assert Timestamp('2016-04-04').weekday_name == u'Monday' + assert Timestamp('2016-04-05').weekday_name == u'Tuesday' + assert Timestamp('2016-04-06').weekday_name == u'Wednesday' + assert Timestamp('2016-04-07').weekday_name == u'Thursday' + assert Timestamp('2016-04-08').weekday_name == u'Friday' + assert Timestamp('2016-04-09').weekday_name == u'Saturday' + assert Timestamp('2016-04-10').weekday_name == u'Sunday' + + assert len(dti.year) == 365 + assert len(dti.month) == 365 + assert len(dti.day) == 365 + assert len(dti.hour) == 365 + assert len(dti.minute) == 365 + assert len(dti.second) == 365 + assert len(dti.microsecond) == 365 + assert len(dti.dayofweek) == 365 + assert len(dti.dayofyear) == 365 + assert len(dti.weekofyear) == 365 + assert len(dti.quarter) == 365 + assert len(dti.is_month_start) == 365 + assert len(dti.is_month_end) == 365 + assert len(dti.is_quarter_start) == 365 + assert len(dti.is_quarter_end) == 365 + assert len(dti.is_year_start) == 365 + assert len(dti.is_year_end) == 365 + assert len(dti.weekday_name) == 365 + + dti.name = 'name' + + # non boolean accessors -> return Index + for accessor in DatetimeIndex._field_ops: + res = getattr(dti, accessor) + assert len(res) == 365 + assert isinstance(res, Index) + assert res.name == 'name' + + # boolean accessors -> return array + for accessor in DatetimeIndex._bool_ops: + res = getattr(dti, accessor) + assert len(res) == 365 + assert isinstance(res, np.ndarray) + + # test boolean indexing + res = dti[dti.is_quarter_start] + exp = dti[[0, 90, 181, 273]] + tm.assert_index_equal(res, exp) + res = dti[dti.is_leap_year] + exp = DatetimeIndex([], freq='D', tz=dti.tz, name='name') + tm.assert_index_equal(res, exp) + + dti = DatetimeIndex(freq='BQ-FEB', start=datetime(1998, 1, 1), + periods=4) + + assert sum(dti.is_quarter_start) == 0 + assert sum(dti.is_quarter_end) == 4 + assert sum(dti.is_year_start) == 0 + assert sum(dti.is_year_end) == 1 + + # Ensure is_start/end accessors throw ValueError for CustomBusinessDay, + # CBD requires np >= 1.7 + bday_egypt = offsets.CustomBusinessDay(weekmask='Sun Mon Tue Wed Thu') + dti = date_range(datetime(2013, 4, 30), periods=5, freq=bday_egypt) + pytest.raises(ValueError, lambda: dti.is_month_start) + + dti = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03']) + + assert dti.is_month_start[0] == 1 + + tests = [ + (Timestamp('2013-06-01', freq='M').is_month_start, 1), + (Timestamp('2013-06-01', freq='BM').is_month_start, 0), + (Timestamp('2013-06-03', freq='M').is_month_start, 0), + (Timestamp('2013-06-03', freq='BM').is_month_start, 1), + (Timestamp('2013-02-28', freq='Q-FEB').is_month_end, 1), + (Timestamp('2013-02-28', freq='Q-FEB').is_quarter_end, 1), + (Timestamp('2013-02-28', freq='Q-FEB').is_year_end, 1), + (Timestamp('2013-03-01', freq='Q-FEB').is_month_start, 1), + (Timestamp('2013-03-01', freq='Q-FEB').is_quarter_start, 1), + (Timestamp('2013-03-01', freq='Q-FEB').is_year_start, 1), + (Timestamp('2013-03-31', freq='QS-FEB').is_month_end, 1), + (Timestamp('2013-03-31', freq='QS-FEB').is_quarter_end, 0), + (Timestamp('2013-03-31', freq='QS-FEB').is_year_end, 0), + (Timestamp('2013-02-01', freq='QS-FEB').is_month_start, 1), + (Timestamp('2013-02-01', freq='QS-FEB').is_quarter_start, 1), + (Timestamp('2013-02-01', freq='QS-FEB').is_year_start, 1), + (Timestamp('2013-06-30', freq='BQ').is_month_end, 0), + (Timestamp('2013-06-30', freq='BQ').is_quarter_end, 0), + (Timestamp('2013-06-30', freq='BQ').is_year_end, 0), + (Timestamp('2013-06-28', freq='BQ').is_month_end, 1), + (Timestamp('2013-06-28', freq='BQ').is_quarter_end, 1), + (Timestamp('2013-06-28', freq='BQ').is_year_end, 0), + (Timestamp('2013-06-30', freq='BQS-APR').is_month_end, 0), + (Timestamp('2013-06-30', freq='BQS-APR').is_quarter_end, 0), + (Timestamp('2013-06-30', freq='BQS-APR').is_year_end, 0), + (Timestamp('2013-06-28', freq='BQS-APR').is_month_end, 1), + (Timestamp('2013-06-28', freq='BQS-APR').is_quarter_end, 1), + (Timestamp('2013-03-29', freq='BQS-APR').is_year_end, 1), + (Timestamp('2013-11-01', freq='AS-NOV').is_year_start, 1), + (Timestamp('2013-10-31', freq='AS-NOV').is_year_end, 1), + (Timestamp('2012-02-01').days_in_month, 29), + (Timestamp('2013-02-01').days_in_month, 28)] + + for ts, value in tests: + assert ts == value + + # GH 6538: Check that DatetimeIndex and its TimeStamp elements + # return the same weekofyear accessor close to new year w/ tz + dates = ["2013/12/29", "2013/12/30", "2013/12/31"] + dates = DatetimeIndex(dates, tz="Europe/Brussels") + expected = [52, 1, 1] + assert dates.weekofyear.tolist() == expected + assert [d.weekofyear for d in dates] == expected + + def test_nanosecond_field(self): + dti = DatetimeIndex(np.arange(10)) + + tm.assert_index_equal(dti.nanosecond, + pd.Index(np.arange(10, dtype=np.int64))) diff --git a/pandas/tests/indexes/datetimes/test_missing.py b/pandas/tests/indexes/datetimes/test_missing.py new file mode 100644 index 0000000000000..adc0b7b3d81e8 --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_missing.py @@ -0,0 +1,50 @@ +import pandas as pd +import pandas.util.testing as tm + + +class TestDatetimeIndex(object): + + def test_fillna_datetime64(self): + # GH 11343 + for tz in ['US/Eastern', 'Asia/Tokyo']: + idx = pd.DatetimeIndex(['2011-01-01 09:00', pd.NaT, + '2011-01-01 11:00']) + + exp = pd.DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', + '2011-01-01 11:00']) + tm.assert_index_equal( + idx.fillna(pd.Timestamp('2011-01-01 10:00')), exp) + + # tz mismatch + exp = pd.Index([pd.Timestamp('2011-01-01 09:00'), + pd.Timestamp('2011-01-01 10:00', tz=tz), + pd.Timestamp('2011-01-01 11:00')], dtype=object) + tm.assert_index_equal( + idx.fillna(pd.Timestamp('2011-01-01 10:00', tz=tz)), exp) + + # object + exp = pd.Index([pd.Timestamp('2011-01-01 09:00'), 'x', + pd.Timestamp('2011-01-01 11:00')], dtype=object) + tm.assert_index_equal(idx.fillna('x'), exp) + + idx = pd.DatetimeIndex(['2011-01-01 09:00', pd.NaT, + '2011-01-01 11:00'], tz=tz) + + exp = pd.DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', + '2011-01-01 11:00'], tz=tz) + tm.assert_index_equal( + idx.fillna(pd.Timestamp('2011-01-01 10:00', tz=tz)), exp) + + exp = pd.Index([pd.Timestamp('2011-01-01 09:00', tz=tz), + pd.Timestamp('2011-01-01 10:00'), + pd.Timestamp('2011-01-01 11:00', tz=tz)], + dtype=object) + tm.assert_index_equal( + idx.fillna(pd.Timestamp('2011-01-01 10:00')), exp) + + # object + exp = pd.Index([pd.Timestamp('2011-01-01 09:00', tz=tz), + 'x', + pd.Timestamp('2011-01-01 11:00', tz=tz)], + dtype=object) + tm.assert_index_equal(idx.fillna('x'), exp) diff --git a/pandas/tests/indexes/datetimes/test_ops.py b/pandas/tests/indexes/datetimes/test_ops.py new file mode 100644 index 0000000000000..80e93a1f76a66 --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_ops.py @@ -0,0 +1,1292 @@ +import pytest +import warnings +import numpy as np +from datetime import timedelta + +from itertools import product +import pandas as pd +import pandas._libs.tslib as tslib +import pandas.util.testing as tm +from pandas.errors import PerformanceWarning +from pandas.core.indexes.datetimes import cdate_range +from pandas import (DatetimeIndex, PeriodIndex, Series, Timestamp, Timedelta, + date_range, TimedeltaIndex, _np_version_under1p10, Index, + datetime, Float64Index, offsets, bdate_range) +from pandas.tseries.offsets import BMonthEnd, CDay, BDay +from pandas.tests.test_base import Ops + + +START, END = datetime(2009, 1, 1), datetime(2010, 1, 1) + + +class TestDatetimeIndexOps(Ops): + tz = [None, 'UTC', 'Asia/Tokyo', 'US/Eastern', 'dateutil/Asia/Singapore', + 'dateutil/US/Pacific'] + + def setup_method(self, method): + super(TestDatetimeIndexOps, self).setup_method(method) + mask = lambda x: (isinstance(x, DatetimeIndex) or + isinstance(x, PeriodIndex)) + self.is_valid_objs = [o for o in self.objs if mask(o)] + self.not_valid_objs = [o for o in self.objs if not mask(o)] + + def test_ops_properties(self): + f = lambda x: isinstance(x, DatetimeIndex) + self.check_ops_properties(DatetimeIndex._field_ops, f) + self.check_ops_properties(DatetimeIndex._object_ops, f) + self.check_ops_properties(DatetimeIndex._bool_ops, f) + + def test_ops_properties_basic(self): + + # sanity check that the behavior didn't change + # GH7206 + for op in ['year', 'day', 'second', 'weekday']: + pytest.raises(TypeError, lambda x: getattr(self.dt_series, op)) + + # attribute access should still work! + s = Series(dict(year=2000, month=1, day=10)) + assert s.year == 2000 + assert s.month == 1 + assert s.day == 10 + pytest.raises(AttributeError, lambda: s.weekday) + + def test_asobject_tolist(self): + idx = pd.date_range(start='2013-01-01', periods=4, freq='M', + name='idx') + expected_list = [Timestamp('2013-01-31'), + Timestamp('2013-02-28'), + Timestamp('2013-03-31'), + Timestamp('2013-04-30')] + expected = pd.Index(expected_list, dtype=object, name='idx') + result = idx.asobject + assert isinstance(result, Index) + + assert result.dtype == object + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert idx.tolist() == expected_list + + idx = pd.date_range(start='2013-01-01', periods=4, freq='M', + name='idx', tz='Asia/Tokyo') + expected_list = [Timestamp('2013-01-31', tz='Asia/Tokyo'), + Timestamp('2013-02-28', tz='Asia/Tokyo'), + Timestamp('2013-03-31', tz='Asia/Tokyo'), + Timestamp('2013-04-30', tz='Asia/Tokyo')] + expected = pd.Index(expected_list, dtype=object, name='idx') + result = idx.asobject + assert isinstance(result, Index) + assert result.dtype == object + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert idx.tolist() == expected_list + + idx = DatetimeIndex([datetime(2013, 1, 1), datetime(2013, 1, 2), + pd.NaT, datetime(2013, 1, 4)], name='idx') + expected_list = [Timestamp('2013-01-01'), + Timestamp('2013-01-02'), pd.NaT, + Timestamp('2013-01-04')] + expected = pd.Index(expected_list, dtype=object, name='idx') + result = idx.asobject + assert isinstance(result, Index) + assert result.dtype == object + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert idx.tolist() == expected_list + + def test_minmax(self): + for tz in self.tz: + # monotonic + idx1 = pd.DatetimeIndex(['2011-01-01', '2011-01-02', + '2011-01-03'], tz=tz) + assert idx1.is_monotonic + + # non-monotonic + idx2 = pd.DatetimeIndex(['2011-01-01', pd.NaT, '2011-01-03', + '2011-01-02', pd.NaT], tz=tz) + assert not idx2.is_monotonic + + for idx in [idx1, idx2]: + assert idx.min() == Timestamp('2011-01-01', tz=tz) + assert idx.max() == Timestamp('2011-01-03', tz=tz) + assert idx.argmin() == 0 + assert idx.argmax() == 2 + + for op in ['min', 'max']: + # Return NaT + obj = DatetimeIndex([]) + assert pd.isnull(getattr(obj, op)()) + + obj = DatetimeIndex([pd.NaT]) + assert pd.isnull(getattr(obj, op)()) + + obj = DatetimeIndex([pd.NaT, pd.NaT, pd.NaT]) + assert pd.isnull(getattr(obj, op)()) + + def test_numpy_minmax(self): + dr = pd.date_range(start='2016-01-15', end='2016-01-20') + + assert np.min(dr) == Timestamp('2016-01-15 00:00:00', freq='D') + assert np.max(dr) == Timestamp('2016-01-20 00:00:00', freq='D') + + errmsg = "the 'out' parameter is not supported" + tm.assert_raises_regex(ValueError, errmsg, np.min, dr, out=0) + tm.assert_raises_regex(ValueError, errmsg, np.max, dr, out=0) + + assert np.argmin(dr) == 0 + assert np.argmax(dr) == 5 + + if not _np_version_under1p10: + errmsg = "the 'out' parameter is not supported" + tm.assert_raises_regex( + ValueError, errmsg, np.argmin, dr, out=0) + tm.assert_raises_regex( + ValueError, errmsg, np.argmax, dr, out=0) + + def test_round(self): + for tz in self.tz: + rng = pd.date_range(start='2016-01-01', periods=5, + freq='30Min', tz=tz) + elt = rng[1] + + expected_rng = DatetimeIndex([ + Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), + Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), + Timestamp('2016-01-01 01:00:00', tz=tz, freq='30T'), + Timestamp('2016-01-01 02:00:00', tz=tz, freq='30T'), + Timestamp('2016-01-01 02:00:00', tz=tz, freq='30T'), + ]) + expected_elt = expected_rng[1] + + tm.assert_index_equal(rng.round(freq='H'), expected_rng) + assert elt.round(freq='H') == expected_elt + + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + rng.round(freq='foo') + with tm.assert_raises_regex(ValueError, msg): + elt.round(freq='foo') + + msg = " is a non-fixed frequency" + tm.assert_raises_regex(ValueError, msg, rng.round, freq='M') + tm.assert_raises_regex(ValueError, msg, elt.round, freq='M') + + # GH 14440 & 15578 + index = pd.DatetimeIndex(['2016-10-17 12:00:00.0015'], tz=tz) + result = index.round('ms') + expected = pd.DatetimeIndex(['2016-10-17 12:00:00.002000'], tz=tz) + tm.assert_index_equal(result, expected) + + for freq in ['us', 'ns']: + tm.assert_index_equal(index, index.round(freq)) + + index = pd.DatetimeIndex(['2016-10-17 12:00:00.00149'], tz=tz) + result = index.round('ms') + expected = pd.DatetimeIndex(['2016-10-17 12:00:00.001000'], tz=tz) + tm.assert_index_equal(result, expected) + + index = pd.DatetimeIndex(['2016-10-17 12:00:00.001501031']) + result = index.round('10ns') + expected = pd.DatetimeIndex(['2016-10-17 12:00:00.001501030']) + tm.assert_index_equal(result, expected) + + with tm.assert_produces_warning(): + ts = '2016-10-17 12:00:00.001501031' + pd.DatetimeIndex([ts]).round('1010ns') + + def test_repeat_range(self): + rng = date_range('1/1/2000', '1/1/2001') + + result = rng.repeat(5) + assert result.freq is None + assert len(result) == 5 * len(rng) + + for tz in self.tz: + index = pd.date_range('2001-01-01', periods=2, freq='D', tz=tz) + exp = pd.DatetimeIndex(['2001-01-01', '2001-01-01', + '2001-01-02', '2001-01-02'], tz=tz) + for res in [index.repeat(2), np.repeat(index, 2)]: + tm.assert_index_equal(res, exp) + assert res.freq is None + + index = pd.date_range('2001-01-01', periods=2, freq='2D', tz=tz) + exp = pd.DatetimeIndex(['2001-01-01', '2001-01-01', + '2001-01-03', '2001-01-03'], tz=tz) + for res in [index.repeat(2), np.repeat(index, 2)]: + tm.assert_index_equal(res, exp) + assert res.freq is None + + index = pd.DatetimeIndex(['2001-01-01', 'NaT', '2003-01-01'], + tz=tz) + exp = pd.DatetimeIndex(['2001-01-01', '2001-01-01', '2001-01-01', + 'NaT', 'NaT', 'NaT', + '2003-01-01', '2003-01-01', '2003-01-01'], + tz=tz) + for res in [index.repeat(3), np.repeat(index, 3)]: + tm.assert_index_equal(res, exp) + assert res.freq is None + + def test_repeat(self): + reps = 2 + msg = "the 'axis' parameter is not supported" + + for tz in self.tz: + rng = pd.date_range(start='2016-01-01', periods=2, + freq='30Min', tz=tz) + + expected_rng = DatetimeIndex([ + Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), + Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), + Timestamp('2016-01-01 00:30:00', tz=tz, freq='30T'), + Timestamp('2016-01-01 00:30:00', tz=tz, freq='30T'), + ]) + + res = rng.repeat(reps) + tm.assert_index_equal(res, expected_rng) + assert res.freq is None + + tm.assert_index_equal(np.repeat(rng, reps), expected_rng) + tm.assert_raises_regex(ValueError, msg, np.repeat, + rng, reps, axis=1) + + def test_representation(self): + + idx = [] + idx.append(DatetimeIndex([], freq='D')) + idx.append(DatetimeIndex(['2011-01-01'], freq='D')) + idx.append(DatetimeIndex(['2011-01-01', '2011-01-02'], freq='D')) + idx.append(DatetimeIndex( + ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D')) + idx.append(DatetimeIndex( + ['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00' + ], freq='H', tz='Asia/Tokyo')) + idx.append(DatetimeIndex( + ['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], tz='US/Eastern')) + idx.append(DatetimeIndex( + ['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], tz='UTC')) + + exp = [] + exp.append("""DatetimeIndex([], dtype='datetime64[ns]', freq='D')""") + exp.append("DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', " + "freq='D')") + exp.append("DatetimeIndex(['2011-01-01', '2011-01-02'], " + "dtype='datetime64[ns]', freq='D')") + exp.append("DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], " + "dtype='datetime64[ns]', freq='D')") + exp.append("DatetimeIndex(['2011-01-01 09:00:00+09:00', " + "'2011-01-01 10:00:00+09:00', '2011-01-01 11:00:00+09:00']" + ", dtype='datetime64[ns, Asia/Tokyo]', freq='H')") + exp.append("DatetimeIndex(['2011-01-01 09:00:00-05:00', " + "'2011-01-01 10:00:00-05:00', 'NaT'], " + "dtype='datetime64[ns, US/Eastern]', freq=None)") + exp.append("DatetimeIndex(['2011-01-01 09:00:00+00:00', " + "'2011-01-01 10:00:00+00:00', 'NaT'], " + "dtype='datetime64[ns, UTC]', freq=None)""") + + with pd.option_context('display.width', 300): + for indx, expected in zip(idx, exp): + for func in ['__repr__', '__unicode__', '__str__']: + result = getattr(indx, func)() + assert result == expected + + def test_representation_to_series(self): + idx1 = DatetimeIndex([], freq='D') + idx2 = DatetimeIndex(['2011-01-01'], freq='D') + idx3 = DatetimeIndex(['2011-01-01', '2011-01-02'], freq='D') + idx4 = DatetimeIndex( + ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D') + idx5 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', + '2011-01-01 11:00'], freq='H', tz='Asia/Tokyo') + idx6 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], + tz='US/Eastern') + idx7 = DatetimeIndex(['2011-01-01 09:00', '2011-01-02 10:15']) + + exp1 = """Series([], dtype: datetime64[ns])""" + + exp2 = """0 2011-01-01 +dtype: datetime64[ns]""" + + exp3 = """0 2011-01-01 +1 2011-01-02 +dtype: datetime64[ns]""" + + exp4 = """0 2011-01-01 +1 2011-01-02 +2 2011-01-03 +dtype: datetime64[ns]""" + + exp5 = """0 2011-01-01 09:00:00+09:00 +1 2011-01-01 10:00:00+09:00 +2 2011-01-01 11:00:00+09:00 +dtype: datetime64[ns, Asia/Tokyo]""" + + exp6 = """0 2011-01-01 09:00:00-05:00 +1 2011-01-01 10:00:00-05:00 +2 NaT +dtype: datetime64[ns, US/Eastern]""" + + exp7 = """0 2011-01-01 09:00:00 +1 2011-01-02 10:15:00 +dtype: datetime64[ns]""" + + with pd.option_context('display.width', 300): + for idx, expected in zip([idx1, idx2, idx3, idx4, + idx5, idx6, idx7], + [exp1, exp2, exp3, exp4, + exp5, exp6, exp7]): + result = repr(Series(idx)) + assert result == expected + + def test_summary(self): + # GH9116 + idx1 = DatetimeIndex([], freq='D') + idx2 = DatetimeIndex(['2011-01-01'], freq='D') + idx3 = DatetimeIndex(['2011-01-01', '2011-01-02'], freq='D') + idx4 = DatetimeIndex( + ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D') + idx5 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', + '2011-01-01 11:00'], + freq='H', tz='Asia/Tokyo') + idx6 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], + tz='US/Eastern') + + exp1 = """DatetimeIndex: 0 entries +Freq: D""" + + exp2 = """DatetimeIndex: 1 entries, 2011-01-01 to 2011-01-01 +Freq: D""" + + exp3 = """DatetimeIndex: 2 entries, 2011-01-01 to 2011-01-02 +Freq: D""" + + exp4 = """DatetimeIndex: 3 entries, 2011-01-01 to 2011-01-03 +Freq: D""" + + exp5 = ("DatetimeIndex: 3 entries, 2011-01-01 09:00:00+09:00 " + "to 2011-01-01 11:00:00+09:00\n" + "Freq: H") + + exp6 = """DatetimeIndex: 3 entries, 2011-01-01 09:00:00-05:00 to NaT""" + + for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, idx6], + [exp1, exp2, exp3, exp4, exp5, exp6]): + result = idx.summary() + assert result == expected + + def test_resolution(self): + for freq, expected in zip(['A', 'Q', 'M', 'D', 'H', 'T', + 'S', 'L', 'U'], + ['day', 'day', 'day', 'day', 'hour', + 'minute', 'second', 'millisecond', + 'microsecond']): + for tz in self.tz: + idx = pd.date_range(start='2013-04-01', periods=30, freq=freq, + tz=tz) + assert idx.resolution == expected + + def test_union(self): + for tz in self.tz: + # union + rng1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + other1 = pd.date_range('1/6/2000', freq='D', periods=5, tz=tz) + expected1 = pd.date_range('1/1/2000', freq='D', periods=10, tz=tz) + + rng2 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + other2 = pd.date_range('1/4/2000', freq='D', periods=5, tz=tz) + expected2 = pd.date_range('1/1/2000', freq='D', periods=8, tz=tz) + + rng3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + other3 = pd.DatetimeIndex([], tz=tz) + expected3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + + for rng, other, expected in [(rng1, other1, expected1), + (rng2, other2, expected2), + (rng3, other3, expected3)]: + + result_union = rng.union(other) + tm.assert_index_equal(result_union, expected) + + def test_add_iadd(self): + for tz in self.tz: + + # offset + offsets = [pd.offsets.Hour(2), timedelta(hours=2), + np.timedelta64(2, 'h'), Timedelta(hours=2)] + + for delta in offsets: + rng = pd.date_range('2000-01-01', '2000-02-01', tz=tz) + result = rng + delta + expected = pd.date_range('2000-01-01 02:00', + '2000-02-01 02:00', tz=tz) + tm.assert_index_equal(result, expected) + rng += delta + tm.assert_index_equal(rng, expected) + + # int + rng = pd.date_range('2000-01-01 09:00', freq='H', periods=10, + tz=tz) + result = rng + 1 + expected = pd.date_range('2000-01-01 10:00', freq='H', periods=10, + tz=tz) + tm.assert_index_equal(result, expected) + rng += 1 + tm.assert_index_equal(rng, expected) + + idx = DatetimeIndex(['2011-01-01', '2011-01-02']) + msg = "cannot add a datelike to a DatetimeIndex" + with tm.assert_raises_regex(TypeError, msg): + idx + Timestamp('2011-01-01') + + with tm.assert_raises_regex(TypeError, msg): + Timestamp('2011-01-01') + idx + + def test_add_dti_dti(self): + # previously performed setop (deprecated in 0.16.0), now raises + # TypeError (GH14164) + + dti = date_range('20130101', periods=3) + dti_tz = date_range('20130101', periods=3).tz_localize('US/Eastern') + + with pytest.raises(TypeError): + dti + dti + + with pytest.raises(TypeError): + dti_tz + dti_tz + + with pytest.raises(TypeError): + dti_tz + dti + + with pytest.raises(TypeError): + dti + dti_tz + + def test_difference(self): + for tz in self.tz: + # diff + rng1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + other1 = pd.date_range('1/6/2000', freq='D', periods=5, tz=tz) + expected1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + + rng2 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + other2 = pd.date_range('1/4/2000', freq='D', periods=5, tz=tz) + expected2 = pd.date_range('1/1/2000', freq='D', periods=3, tz=tz) + + rng3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + other3 = pd.DatetimeIndex([], tz=tz) + expected3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) + + for rng, other, expected in [(rng1, other1, expected1), + (rng2, other2, expected2), + (rng3, other3, expected3)]: + result_diff = rng.difference(other) + tm.assert_index_equal(result_diff, expected) + + def test_sub_isub(self): + for tz in self.tz: + + # offset + offsets = [pd.offsets.Hour(2), timedelta(hours=2), + np.timedelta64(2, 'h'), Timedelta(hours=2)] + + for delta in offsets: + rng = pd.date_range('2000-01-01', '2000-02-01', tz=tz) + expected = pd.date_range('1999-12-31 22:00', + '2000-01-31 22:00', tz=tz) + + result = rng - delta + tm.assert_index_equal(result, expected) + rng -= delta + tm.assert_index_equal(rng, expected) + + # int + rng = pd.date_range('2000-01-01 09:00', freq='H', periods=10, + tz=tz) + result = rng - 1 + expected = pd.date_range('2000-01-01 08:00', freq='H', periods=10, + tz=tz) + tm.assert_index_equal(result, expected) + rng -= 1 + tm.assert_index_equal(rng, expected) + + def test_sub_dti_dti(self): + # previously performed setop (deprecated in 0.16.0), now changed to + # return subtraction -> TimeDeltaIndex (GH ...) + + dti = date_range('20130101', periods=3) + dti_tz = date_range('20130101', periods=3).tz_localize('US/Eastern') + dti_tz2 = date_range('20130101', periods=3).tz_localize('UTC') + expected = TimedeltaIndex([0, 0, 0]) + + result = dti - dti + tm.assert_index_equal(result, expected) + + result = dti_tz - dti_tz + tm.assert_index_equal(result, expected) + + with pytest.raises(TypeError): + dti_tz - dti + + with pytest.raises(TypeError): + dti - dti_tz + + with pytest.raises(TypeError): + dti_tz - dti_tz2 + + # isub + dti -= dti + tm.assert_index_equal(dti, expected) + + # different length raises ValueError + dti1 = date_range('20130101', periods=3) + dti2 = date_range('20130101', periods=4) + with pytest.raises(ValueError): + dti1 - dti2 + + # NaN propagation + dti1 = DatetimeIndex(['2012-01-01', np.nan, '2012-01-03']) + dti2 = DatetimeIndex(['2012-01-02', '2012-01-03', np.nan]) + expected = TimedeltaIndex(['1 days', np.nan, np.nan]) + result = dti2 - dti1 + tm.assert_index_equal(result, expected) + + def test_sub_period(self): + # GH 13078 + # not supported, check TypeError + p = pd.Period('2011-01-01', freq='D') + + for freq in [None, 'D']: + idx = pd.DatetimeIndex(['2011-01-01', '2011-01-02'], freq=freq) + + with pytest.raises(TypeError): + idx - p + + with pytest.raises(TypeError): + p - idx + + def test_comp_nat(self): + left = pd.DatetimeIndex([pd.Timestamp('2011-01-01'), pd.NaT, + pd.Timestamp('2011-01-03')]) + right = pd.DatetimeIndex([pd.NaT, pd.NaT, pd.Timestamp('2011-01-03')]) + + for l, r in [(left, right), (left.asobject, right.asobject)]: + result = l == r + expected = np.array([False, False, True]) + tm.assert_numpy_array_equal(result, expected) + + result = l != r + expected = np.array([True, True, False]) + tm.assert_numpy_array_equal(result, expected) + + expected = np.array([False, False, False]) + tm.assert_numpy_array_equal(l == pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT == r, expected) + + expected = np.array([True, True, True]) + tm.assert_numpy_array_equal(l != pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT != l, expected) + + expected = np.array([False, False, False]) + tm.assert_numpy_array_equal(l < pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT > l, expected) + + def test_value_counts_unique(self): + # GH 7735 + for tz in self.tz: + idx = pd.date_range('2011-01-01 09:00', freq='H', periods=10) + # create repeated values, 'n'th element is repeated by n+1 times + idx = DatetimeIndex(np.repeat(idx.values, range(1, len(idx) + 1)), + tz=tz) + + exp_idx = pd.date_range('2011-01-01 18:00', freq='-1H', periods=10, + tz=tz) + expected = Series(range(10, 0, -1), index=exp_idx, dtype='int64') + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(), expected) + + expected = pd.date_range('2011-01-01 09:00', freq='H', periods=10, + tz=tz) + tm.assert_index_equal(idx.unique(), expected) + + idx = DatetimeIndex(['2013-01-01 09:00', '2013-01-01 09:00', + '2013-01-01 09:00', '2013-01-01 08:00', + '2013-01-01 08:00', pd.NaT], tz=tz) + + exp_idx = DatetimeIndex(['2013-01-01 09:00', '2013-01-01 08:00'], + tz=tz) + expected = Series([3, 2], index=exp_idx) + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(), expected) + + exp_idx = DatetimeIndex(['2013-01-01 09:00', '2013-01-01 08:00', + pd.NaT], tz=tz) + expected = Series([3, 2, 1], index=exp_idx) + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(dropna=False), + expected) + + tm.assert_index_equal(idx.unique(), exp_idx) + + def test_nonunique_contains(self): + # GH 9512 + for idx in map(DatetimeIndex, + ([0, 1, 0], [0, 0, -1], [0, -1, -1], + ['2015', '2015', '2016'], ['2015', '2015', '2014'])): + assert idx[0] in idx + + def test_order(self): + # with freq + idx1 = DatetimeIndex(['2011-01-01', '2011-01-02', + '2011-01-03'], freq='D', name='idx') + idx2 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', + '2011-01-01 11:00'], freq='H', + tz='Asia/Tokyo', name='tzidx') + + for idx in [idx1, idx2]: + ordered = idx.sort_values() + tm.assert_index_equal(ordered, idx) + assert ordered.freq == idx.freq + + ordered = idx.sort_values(ascending=False) + expected = idx[::-1] + tm.assert_index_equal(ordered, expected) + assert ordered.freq == expected.freq + assert ordered.freq.n == -1 + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, idx) + tm.assert_numpy_array_equal(indexer, np.array([0, 1, 2]), + check_dtype=False) + assert ordered.freq == idx.freq + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + expected = idx[::-1] + tm.assert_index_equal(ordered, expected) + tm.assert_numpy_array_equal(indexer, + np.array([2, 1, 0]), + check_dtype=False) + assert ordered.freq == expected.freq + assert ordered.freq.n == -1 + + # without freq + for tz in self.tz: + idx1 = DatetimeIndex(['2011-01-01', '2011-01-03', '2011-01-05', + '2011-01-02', '2011-01-01'], + tz=tz, name='idx1') + exp1 = DatetimeIndex(['2011-01-01', '2011-01-01', '2011-01-02', + '2011-01-03', '2011-01-05'], + tz=tz, name='idx1') + + idx2 = DatetimeIndex(['2011-01-01', '2011-01-03', '2011-01-05', + '2011-01-02', '2011-01-01'], + tz=tz, name='idx2') + + exp2 = DatetimeIndex(['2011-01-01', '2011-01-01', '2011-01-02', + '2011-01-03', '2011-01-05'], + tz=tz, name='idx2') + + idx3 = DatetimeIndex([pd.NaT, '2011-01-03', '2011-01-05', + '2011-01-02', pd.NaT], tz=tz, name='idx3') + exp3 = DatetimeIndex([pd.NaT, pd.NaT, '2011-01-02', '2011-01-03', + '2011-01-05'], tz=tz, name='idx3') + + for idx, expected in [(idx1, exp1), (idx2, exp2), (idx3, exp3)]: + ordered = idx.sort_values() + tm.assert_index_equal(ordered, expected) + assert ordered.freq is None + + ordered = idx.sort_values(ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + assert ordered.freq is None + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, expected) + + exp = np.array([0, 4, 3, 1, 2]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + assert ordered.freq is None + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + + exp = np.array([2, 1, 3, 4, 0]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + assert ordered.freq is None + + def test_getitem(self): + idx1 = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') + idx2 = pd.date_range('2011-01-01', '2011-01-31', freq='D', + tz='Asia/Tokyo', name='idx') + + for idx in [idx1, idx2]: + result = idx[0] + assert result == Timestamp('2011-01-01', tz=idx.tz) + + result = idx[0:5] + expected = pd.date_range('2011-01-01', '2011-01-05', freq='D', + tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx[0:10:2] + expected = pd.date_range('2011-01-01', '2011-01-09', freq='2D', + tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx[-20:-5:3] + expected = pd.date_range('2011-01-12', '2011-01-24', freq='3D', + tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx[4::-1] + expected = DatetimeIndex(['2011-01-05', '2011-01-04', '2011-01-03', + '2011-01-02', '2011-01-01'], + freq='-1D', tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + def test_drop_duplicates_metadata(self): + # GH 10115 + idx = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') + result = idx.drop_duplicates() + tm.assert_index_equal(idx, result) + assert idx.freq == result.freq + + idx_dup = idx.append(idx) + assert idx_dup.freq is None # freq is reset + result = idx_dup.drop_duplicates() + tm.assert_index_equal(idx, result) + assert result.freq is None + + def test_drop_duplicates(self): + # to check Index/Series compat + base = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') + idx = base.append(base[:5]) + + res = idx.drop_duplicates() + tm.assert_index_equal(res, base) + res = Series(idx).drop_duplicates() + tm.assert_series_equal(res, Series(base)) + + res = idx.drop_duplicates(keep='last') + exp = base[5:].append(base[:5]) + tm.assert_index_equal(res, exp) + res = Series(idx).drop_duplicates(keep='last') + tm.assert_series_equal(res, Series(exp, index=np.arange(5, 36))) + + res = idx.drop_duplicates(keep=False) + tm.assert_index_equal(res, base[5:]) + res = Series(idx).drop_duplicates(keep=False) + tm.assert_series_equal(res, Series(base[5:], index=np.arange(5, 31))) + + def test_take(self): + # GH 10295 + idx1 = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') + idx2 = pd.date_range('2011-01-01', '2011-01-31', freq='D', + tz='Asia/Tokyo', name='idx') + + for idx in [idx1, idx2]: + result = idx.take([0]) + assert result == Timestamp('2011-01-01', tz=idx.tz) + + result = idx.take([0, 1, 2]) + expected = pd.date_range('2011-01-01', '2011-01-03', freq='D', + tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx.take([0, 2, 4]) + expected = pd.date_range('2011-01-01', '2011-01-05', freq='2D', + tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx.take([7, 4, 1]) + expected = pd.date_range('2011-01-08', '2011-01-02', freq='-3D', + tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx.take([3, 2, 5]) + expected = DatetimeIndex(['2011-01-04', '2011-01-03', + '2011-01-06'], + freq=None, tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq is None + + result = idx.take([-3, 2, 5]) + expected = DatetimeIndex(['2011-01-29', '2011-01-03', + '2011-01-06'], + freq=None, tz=idx.tz, name='idx') + tm.assert_index_equal(result, expected) + assert result.freq is None + + def test_take_invalid_kwargs(self): + idx = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') + indices = [1, 6, 5, 9, 10, 13, 15, 3] + + msg = r"take\(\) got an unexpected keyword argument 'foo'" + tm.assert_raises_regex(TypeError, msg, idx.take, + indices, foo=2) + + msg = "the 'out' parameter is not supported" + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, out=indices) + + msg = "the 'mode' parameter is not supported" + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, mode='clip') + + def test_infer_freq(self): + # GH 11018 + for freq in ['A', '2A', '-2A', 'Q', '-1Q', 'M', '-1M', 'D', '3D', + '-3D', 'W', '-1W', 'H', '2H', '-2H', 'T', '2T', 'S', + '-3S']: + idx = pd.date_range('2011-01-01 09:00:00', freq=freq, periods=10) + result = pd.DatetimeIndex(idx.asi8, freq='infer') + tm.assert_index_equal(idx, result) + assert result.freq == freq + + def test_nat_new(self): + idx = pd.date_range('2011-01-01', freq='D', periods=5, name='x') + result = idx._nat_new() + exp = pd.DatetimeIndex([pd.NaT] * 5, name='x') + tm.assert_index_equal(result, exp) + + result = idx._nat_new(box=False) + exp = np.array([tslib.iNaT] * 5, dtype=np.int64) + tm.assert_numpy_array_equal(result, exp) + + def test_shift(self): + # GH 9903 + for tz in self.tz: + idx = pd.DatetimeIndex([], name='xxx', tz=tz) + tm.assert_index_equal(idx.shift(0, freq='H'), idx) + tm.assert_index_equal(idx.shift(3, freq='H'), idx) + + idx = pd.DatetimeIndex(['2011-01-01 10:00', '2011-01-01 11:00' + '2011-01-01 12:00'], name='xxx', tz=tz) + tm.assert_index_equal(idx.shift(0, freq='H'), idx) + exp = pd.DatetimeIndex(['2011-01-01 13:00', '2011-01-01 14:00' + '2011-01-01 15:00'], name='xxx', tz=tz) + tm.assert_index_equal(idx.shift(3, freq='H'), exp) + exp = pd.DatetimeIndex(['2011-01-01 07:00', '2011-01-01 08:00' + '2011-01-01 09:00'], name='xxx', tz=tz) + tm.assert_index_equal(idx.shift(-3, freq='H'), exp) + + def test_nat(self): + assert pd.DatetimeIndex._na_value is pd.NaT + assert pd.DatetimeIndex([])._na_value is pd.NaT + + for tz in [None, 'US/Eastern', 'UTC']: + idx = pd.DatetimeIndex(['2011-01-01', '2011-01-02'], tz=tz) + assert idx._can_hold_na + + tm.assert_numpy_array_equal(idx._isnan, np.array([False, False])) + assert not idx.hasnans + tm.assert_numpy_array_equal(idx._nan_idxs, + np.array([], dtype=np.intp)) + + idx = pd.DatetimeIndex(['2011-01-01', 'NaT'], tz=tz) + assert idx._can_hold_na + + tm.assert_numpy_array_equal(idx._isnan, np.array([False, True])) + assert idx.hasnans + tm.assert_numpy_array_equal(idx._nan_idxs, + np.array([1], dtype=np.intp)) + + def test_equals(self): + # GH 13107 + for tz in [None, 'UTC', 'US/Eastern', 'Asia/Tokyo']: + idx = pd.DatetimeIndex(['2011-01-01', '2011-01-02', 'NaT']) + assert idx.equals(idx) + assert idx.equals(idx.copy()) + assert idx.equals(idx.asobject) + assert idx.asobject.equals(idx) + assert idx.asobject.equals(idx.asobject) + assert not idx.equals(list(idx)) + assert not idx.equals(pd.Series(idx)) + + idx2 = pd.DatetimeIndex(['2011-01-01', '2011-01-02', 'NaT'], + tz='US/Pacific') + assert not idx.equals(idx2) + assert not idx.equals(idx2.copy()) + assert not idx.equals(idx2.asobject) + assert not idx.asobject.equals(idx2) + assert not idx.equals(list(idx2)) + assert not idx.equals(pd.Series(idx2)) + + # same internal, different tz + idx3 = pd.DatetimeIndex._simple_new(idx.asi8, tz='US/Pacific') + tm.assert_numpy_array_equal(idx.asi8, idx3.asi8) + assert not idx.equals(idx3) + assert not idx.equals(idx3.copy()) + assert not idx.equals(idx3.asobject) + assert not idx.asobject.equals(idx3) + assert not idx.equals(list(idx3)) + assert not idx.equals(pd.Series(idx3)) + + +class TestDateTimeIndexToJulianDate(object): + + def test_1700(self): + r1 = Float64Index([2345897.5, 2345898.5, 2345899.5, 2345900.5, + 2345901.5]) + r2 = date_range(start=Timestamp('1710-10-01'), periods=5, + freq='D').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_2000(self): + r1 = Float64Index([2451601.5, 2451602.5, 2451603.5, 2451604.5, + 2451605.5]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='D').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_hour(self): + r1 = Float64Index( + [2451601.5, 2451601.5416666666666666, 2451601.5833333333333333, + 2451601.625, 2451601.6666666666666666]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='H').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_minute(self): + r1 = Float64Index( + [2451601.5, 2451601.5006944444444444, 2451601.5013888888888888, + 2451601.5020833333333333, 2451601.5027777777777777]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='T').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + def test_second(self): + r1 = Float64Index( + [2451601.5, 2451601.500011574074074, 2451601.5000231481481481, + 2451601.5000347222222222, 2451601.5000462962962962]) + r2 = date_range(start=Timestamp('2000-02-27'), periods=5, + freq='S').to_julian_date() + assert isinstance(r2, Float64Index) + tm.assert_index_equal(r1, r2) + + +# GH 10699 +@pytest.mark.parametrize('klass,assert_func', zip([Series, DatetimeIndex], + [tm.assert_series_equal, + tm.assert_index_equal])) +def test_datetime64_with_DateOffset(klass, assert_func): + s = klass(date_range('2000-01-01', '2000-01-31'), name='a') + result = s + pd.DateOffset(years=1) + result2 = pd.DateOffset(years=1) + s + exp = klass(date_range('2001-01-01', '2001-01-31'), name='a') + assert_func(result, exp) + assert_func(result2, exp) + + result = s - pd.DateOffset(years=1) + exp = klass(date_range('1999-01-01', '1999-01-31'), name='a') + assert_func(result, exp) + + s = klass([Timestamp('2000-01-15 00:15:00', tz='US/Central'), + pd.Timestamp('2000-02-15', tz='US/Central')], name='a') + result = s + pd.offsets.Day() + result2 = pd.offsets.Day() + s + exp = klass([Timestamp('2000-01-16 00:15:00', tz='US/Central'), + Timestamp('2000-02-16', tz='US/Central')], name='a') + assert_func(result, exp) + assert_func(result2, exp) + + s = klass([Timestamp('2000-01-15 00:15:00', tz='US/Central'), + pd.Timestamp('2000-02-15', tz='US/Central')], name='a') + result = s + pd.offsets.MonthEnd() + result2 = pd.offsets.MonthEnd() + s + exp = klass([Timestamp('2000-01-31 00:15:00', tz='US/Central'), + Timestamp('2000-02-29', tz='US/Central')], name='a') + assert_func(result, exp) + assert_func(result2, exp) + + # array of offsets - valid for Series only + if klass is Series: + with tm.assert_produces_warning(PerformanceWarning): + s = klass([Timestamp('2000-1-1'), Timestamp('2000-2-1')]) + result = s + Series([pd.offsets.DateOffset(years=1), + pd.offsets.MonthEnd()]) + exp = klass([Timestamp('2001-1-1'), Timestamp('2000-2-29') + ]) + assert_func(result, exp) + + # same offset + result = s + Series([pd.offsets.DateOffset(years=1), + pd.offsets.DateOffset(years=1)]) + exp = klass([Timestamp('2001-1-1'), Timestamp('2001-2-1')]) + assert_func(result, exp) + + s = klass([Timestamp('2000-01-05 00:15:00'), + Timestamp('2000-01-31 00:23:00'), + Timestamp('2000-01-01'), + Timestamp('2000-03-31'), + Timestamp('2000-02-29'), + Timestamp('2000-12-31'), + Timestamp('2000-05-15'), + Timestamp('2001-06-15')]) + + # DateOffset relativedelta fastpath + relative_kwargs = [('years', 2), ('months', 5), ('days', 3), + ('hours', 5), ('minutes', 10), ('seconds', 2), + ('microseconds', 5)] + for i, kwd in enumerate(relative_kwargs): + op = pd.DateOffset(**dict([kwd])) + assert_func(klass([x + op for x in s]), s + op) + assert_func(klass([x - op for x in s]), s - op) + op = pd.DateOffset(**dict(relative_kwargs[:i + 1])) + assert_func(klass([x + op for x in s]), s + op) + assert_func(klass([x - op for x in s]), s - op) + + # assert these are equal on a piecewise basis + offsets = ['YearBegin', ('YearBegin', {'month': 5}), + 'YearEnd', ('YearEnd', {'month': 5}), + 'MonthBegin', 'MonthEnd', + 'SemiMonthEnd', 'SemiMonthBegin', + 'Week', ('Week', {'weekday': 3}), + 'BusinessDay', 'BDay', 'QuarterEnd', 'QuarterBegin', + 'CustomBusinessDay', 'CDay', 'CBMonthEnd', + 'CBMonthBegin', 'BMonthBegin', 'BMonthEnd', + 'BusinessHour', 'BYearBegin', 'BYearEnd', + 'BQuarterBegin', ('LastWeekOfMonth', {'weekday': 2}), + ('FY5253Quarter', {'qtr_with_extra_week': 1, + 'startingMonth': 1, + 'weekday': 2, + 'variation': 'nearest'}), + ('FY5253', {'weekday': 0, + 'startingMonth': 2, + 'variation': + 'nearest'}), + ('WeekOfMonth', {'weekday': 2, + 'week': 2}), + 'Easter', ('DateOffset', {'day': 4}), + ('DateOffset', {'month': 5})] + + with warnings.catch_warnings(record=True): + for normalize in (True, False): + for do in offsets: + if isinstance(do, tuple): + do, kwargs = do + else: + do = do + kwargs = {} + + for n in [0, 5]: + if (do in ['WeekOfMonth', 'LastWeekOfMonth', + 'FY5253Quarter', 'FY5253'] and n == 0): + continue + op = getattr(pd.offsets, do)(n, + normalize=normalize, + **kwargs) + assert_func(klass([x + op for x in s]), s + op) + assert_func(klass([x - op for x in s]), s - op) + assert_func(klass([op + x for x in s]), op + s) + + +@pytest.mark.parametrize('years,months', product([-1, 0, 1], [-2, 0, 2])) +def test_shift_months(years, months): + s = DatetimeIndex([Timestamp('2000-01-05 00:15:00'), + Timestamp('2000-01-31 00:23:00'), + Timestamp('2000-01-01'), + Timestamp('2000-02-29'), + Timestamp('2000-12-31')]) + actual = DatetimeIndex(tslib.shift_months(s.asi8, years * 12 + + months)) + expected = DatetimeIndex([x + offsets.DateOffset( + years=years, months=months) for x in s]) + tm.assert_index_equal(actual, expected) + + +class TestBusinessDatetimeIndex(object): + + def setup_method(self, method): + self.rng = bdate_range(START, END) + + def test_comparison(self): + d = self.rng[10] + + comp = self.rng > d + assert comp[11] + assert not comp[9] + + def test_pickle_unpickle(self): + unpickled = tm.round_trip_pickle(self.rng) + assert unpickled.offset is not None + + def test_copy(self): + cp = self.rng.copy() + repr(cp) + tm.assert_index_equal(cp, self.rng) + + def test_repr(self): + # only really care that it works + repr(self.rng) + + def test_getitem(self): + smaller = self.rng[:5] + exp = DatetimeIndex(self.rng.view(np.ndarray)[:5]) + tm.assert_index_equal(smaller, exp) + + assert smaller.offset == self.rng.offset + + sliced = self.rng[::5] + assert sliced.offset == BDay() * 5 + + fancy_indexed = self.rng[[4, 3, 2, 1, 0]] + assert len(fancy_indexed) == 5 + assert isinstance(fancy_indexed, DatetimeIndex) + assert fancy_indexed.freq is None + + # 32-bit vs. 64-bit platforms + assert self.rng[4] == self.rng[np.int_(4)] + + def test_getitem_matplotlib_hackaround(self): + values = self.rng[:, None] + expected = self.rng.values[:, None] + tm.assert_numpy_array_equal(values, expected) + + def test_shift(self): + shifted = self.rng.shift(5) + assert shifted[0] == self.rng[5] + assert shifted.offset == self.rng.offset + + shifted = self.rng.shift(-5) + assert shifted[5] == self.rng[0] + assert shifted.offset == self.rng.offset + + shifted = self.rng.shift(0) + assert shifted[0] == self.rng[0] + assert shifted.offset == self.rng.offset + + rng = date_range(START, END, freq=BMonthEnd()) + shifted = rng.shift(1, freq=BDay()) + assert shifted[0] == rng[0] + BDay() + + def test_summary(self): + self.rng.summary() + self.rng[2:2].summary() + + def test_summary_pytz(self): + tm._skip_if_no_pytz() + import pytz + bdate_range('1/1/2005', '1/1/2009', tz=pytz.utc).summary() + + def test_summary_dateutil(self): + tm._skip_if_no_dateutil() + import dateutil + bdate_range('1/1/2005', '1/1/2009', tz=dateutil.tz.tzutc()).summary() + + def test_equals(self): + assert not self.rng.equals(list(self.rng)) + + def test_identical(self): + t1 = self.rng.copy() + t2 = self.rng.copy() + assert t1.identical(t2) + + # name + t1 = t1.rename('foo') + assert t1.equals(t2) + assert not t1.identical(t2) + t2 = t2.rename('foo') + assert t1.identical(t2) + + # freq + t2v = Index(t2.values) + assert t1.equals(t2v) + assert not t1.identical(t2v) + + +class TestCustomDatetimeIndex(object): + + def setup_method(self, method): + self.rng = cdate_range(START, END) + + def test_comparison(self): + d = self.rng[10] + + comp = self.rng > d + assert comp[11] + assert not comp[9] + + def test_copy(self): + cp = self.rng.copy() + repr(cp) + tm.assert_index_equal(cp, self.rng) + + def test_repr(self): + # only really care that it works + repr(self.rng) + + def test_getitem(self): + smaller = self.rng[:5] + exp = DatetimeIndex(self.rng.view(np.ndarray)[:5]) + tm.assert_index_equal(smaller, exp) + assert smaller.offset == self.rng.offset + + sliced = self.rng[::5] + assert sliced.offset == CDay() * 5 + + fancy_indexed = self.rng[[4, 3, 2, 1, 0]] + assert len(fancy_indexed) == 5 + assert isinstance(fancy_indexed, DatetimeIndex) + assert fancy_indexed.freq is None + + # 32-bit vs. 64-bit platforms + assert self.rng[4] == self.rng[np.int_(4)] + + def test_getitem_matplotlib_hackaround(self): + values = self.rng[:, None] + expected = self.rng.values[:, None] + tm.assert_numpy_array_equal(values, expected) + + def test_shift(self): + + shifted = self.rng.shift(5) + assert shifted[0] == self.rng[5] + assert shifted.offset == self.rng.offset + + shifted = self.rng.shift(-5) + assert shifted[5] == self.rng[0] + assert shifted.offset == self.rng.offset + + shifted = self.rng.shift(0) + assert shifted[0] == self.rng[0] + assert shifted.offset == self.rng.offset + + # PerformanceWarning + with warnings.catch_warnings(record=True): + rng = date_range(START, END, freq=BMonthEnd()) + shifted = rng.shift(1, freq=CDay()) + assert shifted[0] == rng[0] + CDay() + + def test_pickle_unpickle(self): + unpickled = tm.round_trip_pickle(self.rng) + assert unpickled.offset is not None + + def test_summary(self): + self.rng.summary() + self.rng[2:2].summary() + + def test_summary_pytz(self): + tm._skip_if_no_pytz() + import pytz + cdate_range('1/1/2005', '1/1/2009', tz=pytz.utc).summary() + + def test_summary_dateutil(self): + tm._skip_if_no_dateutil() + import dateutil + cdate_range('1/1/2005', '1/1/2009', tz=dateutil.tz.tzutc()).summary() + + def test_equals(self): + assert not self.rng.equals(list(self.rng)) diff --git a/pandas/tests/indexes/datetimes/test_partial_slicing.py b/pandas/tests/indexes/datetimes/test_partial_slicing.py new file mode 100644 index 0000000000000..e7d03aa193cbd --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_partial_slicing.py @@ -0,0 +1,270 @@ +""" test partial slicing on Series/Frame """ + +import pytest + +from datetime import datetime +import numpy as np +import pandas as pd + +from pandas import (DatetimeIndex, Series, DataFrame, + date_range, Index, Timedelta, Timestamp) +from pandas.util import testing as tm + + +class TestSlicing(object): + + def test_slice_year(self): + dti = DatetimeIndex(freq='B', start=datetime(2005, 1, 1), periods=500) + + s = Series(np.arange(len(dti)), index=dti) + result = s['2005'] + expected = s[s.index.year == 2005] + tm.assert_series_equal(result, expected) + + df = DataFrame(np.random.rand(len(dti), 5), index=dti) + result = df.loc['2005'] + expected = df[df.index.year == 2005] + tm.assert_frame_equal(result, expected) + + rng = date_range('1/1/2000', '1/1/2010') + + result = rng.get_loc('2009') + expected = slice(3288, 3653) + assert result == expected + + def test_slice_quarter(self): + dti = DatetimeIndex(freq='D', start=datetime(2000, 6, 1), periods=500) + + s = Series(np.arange(len(dti)), index=dti) + assert len(s['2001Q1']) == 90 + + df = DataFrame(np.random.rand(len(dti), 5), index=dti) + assert len(df.loc['1Q01']) == 90 + + def test_slice_month(self): + dti = DatetimeIndex(freq='D', start=datetime(2005, 1, 1), periods=500) + s = Series(np.arange(len(dti)), index=dti) + assert len(s['2005-11']) == 30 + + df = DataFrame(np.random.rand(len(dti), 5), index=dti) + assert len(df.loc['2005-11']) == 30 + + tm.assert_series_equal(s['2005-11'], s['11-2005']) + + def test_partial_slice(self): + rng = DatetimeIndex(freq='D', start=datetime(2005, 1, 1), periods=500) + s = Series(np.arange(len(rng)), index=rng) + + result = s['2005-05':'2006-02'] + expected = s['20050501':'20060228'] + tm.assert_series_equal(result, expected) + + result = s['2005-05':] + expected = s['20050501':] + tm.assert_series_equal(result, expected) + + result = s[:'2006-02'] + expected = s[:'20060228'] + tm.assert_series_equal(result, expected) + + result = s['2005-1-1'] + assert result == s.iloc[0] + + pytest.raises(Exception, s.__getitem__, '2004-12-31') + + def test_partial_slice_daily(self): + rng = DatetimeIndex(freq='H', start=datetime(2005, 1, 31), periods=500) + s = Series(np.arange(len(rng)), index=rng) + + result = s['2005-1-31'] + tm.assert_series_equal(result, s.iloc[:24]) + + pytest.raises(Exception, s.__getitem__, '2004-12-31 00') + + def test_partial_slice_hourly(self): + rng = DatetimeIndex(freq='T', start=datetime(2005, 1, 1, 20, 0, 0), + periods=500) + s = Series(np.arange(len(rng)), index=rng) + + result = s['2005-1-1'] + tm.assert_series_equal(result, s.iloc[:60 * 4]) + + result = s['2005-1-1 20'] + tm.assert_series_equal(result, s.iloc[:60]) + + assert s['2005-1-1 20:00'] == s.iloc[0] + pytest.raises(Exception, s.__getitem__, '2004-12-31 00:15') + + def test_partial_slice_minutely(self): + rng = DatetimeIndex(freq='S', start=datetime(2005, 1, 1, 23, 59, 0), + periods=500) + s = Series(np.arange(len(rng)), index=rng) + + result = s['2005-1-1 23:59'] + tm.assert_series_equal(result, s.iloc[:60]) + + result = s['2005-1-1'] + tm.assert_series_equal(result, s.iloc[:60]) + + assert s[Timestamp('2005-1-1 23:59:00')] == s.iloc[0] + pytest.raises(Exception, s.__getitem__, '2004-12-31 00:00:00') + + def test_partial_slice_second_precision(self): + rng = DatetimeIndex(start=datetime(2005, 1, 1, 0, 0, 59, + microsecond=999990), + periods=20, freq='US') + s = Series(np.arange(20), rng) + + tm.assert_series_equal(s['2005-1-1 00:00'], s.iloc[:10]) + tm.assert_series_equal(s['2005-1-1 00:00:59'], s.iloc[:10]) + + tm.assert_series_equal(s['2005-1-1 00:01'], s.iloc[10:]) + tm.assert_series_equal(s['2005-1-1 00:01:00'], s.iloc[10:]) + + assert s[Timestamp('2005-1-1 00:00:59.999990')] == s.iloc[0] + tm.assert_raises_regex(KeyError, '2005-1-1 00:00:00', + lambda: s['2005-1-1 00:00:00']) + + def test_partial_slicing_dataframe(self): + # GH14856 + # Test various combinations of string slicing resolution vs. + # index resolution + # - If string resolution is less precise than index resolution, + # string is considered a slice + # - If string resolution is equal to or more precise than index + # resolution, string is considered an exact match + formats = ['%Y', '%Y-%m', '%Y-%m-%d', '%Y-%m-%d %H', + '%Y-%m-%d %H:%M', '%Y-%m-%d %H:%M:%S'] + resolutions = ['year', 'month', 'day', 'hour', 'minute', 'second'] + for rnum, resolution in enumerate(resolutions[2:], 2): + # we check only 'day', 'hour', 'minute' and 'second' + unit = Timedelta("1 " + resolution) + middate = datetime(2012, 1, 1, 0, 0, 0) + index = DatetimeIndex([middate - unit, + middate, middate + unit]) + values = [1, 2, 3] + df = DataFrame({'a': values}, index, dtype=np.int64) + assert df.index.resolution == resolution + + # Timestamp with the same resolution as index + # Should be exact match for Series (return scalar) + # and raise KeyError for Frame + for timestamp, expected in zip(index, values): + ts_string = timestamp.strftime(formats[rnum]) + # make ts_string as precise as index + result = df['a'][ts_string] + assert isinstance(result, np.int64) + assert result == expected + pytest.raises(KeyError, df.__getitem__, ts_string) + + # Timestamp with resolution less precise than index + for fmt in formats[:rnum]: + for element, theslice in [[0, slice(None, 1)], + [1, slice(1, None)]]: + ts_string = index[element].strftime(fmt) + + # Series should return slice + result = df['a'][ts_string] + expected = df['a'][theslice] + tm.assert_series_equal(result, expected) + + # Frame should return slice as well + result = df[ts_string] + expected = df[theslice] + tm.assert_frame_equal(result, expected) + + # Timestamp with resolution more precise than index + # Compatible with existing key + # Should return scalar for Series + # and raise KeyError for Frame + for fmt in formats[rnum + 1:]: + ts_string = index[1].strftime(fmt) + result = df['a'][ts_string] + assert isinstance(result, np.int64) + assert result == 2 + pytest.raises(KeyError, df.__getitem__, ts_string) + + # Not compatible with existing key + # Should raise KeyError + for fmt, res in list(zip(formats, resolutions))[rnum + 1:]: + ts = index[1] + Timedelta("1 " + res) + ts_string = ts.strftime(fmt) + pytest.raises(KeyError, df['a'].__getitem__, ts_string) + pytest.raises(KeyError, df.__getitem__, ts_string) + + def test_partial_slicing_with_multiindex(self): + + # GH 4758 + # partial string indexing with a multi-index buggy + df = DataFrame({'ACCOUNT': ["ACCT1", "ACCT1", "ACCT1", "ACCT2"], + 'TICKER': ["ABC", "MNP", "XYZ", "XYZ"], + 'val': [1, 2, 3, 4]}, + index=date_range("2013-06-19 09:30:00", + periods=4, freq='5T')) + df_multi = df.set_index(['ACCOUNT', 'TICKER'], append=True) + + expected = DataFrame([ + [1] + ], index=Index(['ABC'], name='TICKER'), columns=['val']) + result = df_multi.loc[('2013-06-19 09:30:00', 'ACCT1')] + tm.assert_frame_equal(result, expected) + + expected = df_multi.loc[ + (pd.Timestamp('2013-06-19 09:30:00', tz=None), 'ACCT1', 'ABC')] + result = df_multi.loc[('2013-06-19 09:30:00', 'ACCT1', 'ABC')] + tm.assert_series_equal(result, expected) + + # this is a KeyError as we don't do partial string selection on + # multi-levels + def f(): + df_multi.loc[('2013-06-19', 'ACCT1', 'ABC')] + + pytest.raises(KeyError, f) + + # GH 4294 + # partial slice on a series mi + s = pd.DataFrame(np.random.rand(1000, 1000), index=pd.date_range( + '2000-1-1', periods=1000)).stack() + + s2 = s[:-1].copy() + expected = s2['2000-1-4'] + result = s2[pd.Timestamp('2000-1-4')] + tm.assert_series_equal(result, expected) + + result = s[pd.Timestamp('2000-1-4')] + expected = s['2000-1-4'] + tm.assert_series_equal(result, expected) + + df2 = pd.DataFrame(s) + expected = df2.xs('2000-1-4') + result = df2.loc[pd.Timestamp('2000-1-4')] + tm.assert_frame_equal(result, expected) + + def test_partial_slice_doesnt_require_monotonicity(self): + # For historical reasons. + s = pd.Series(np.arange(10), pd.date_range('2014-01-01', periods=10)) + + nonmonotonic = s[[3, 5, 4]] + expected = nonmonotonic.iloc[:0] + timestamp = pd.Timestamp('2014-01-10') + + tm.assert_series_equal(nonmonotonic['2014-01-10':], expected) + tm.assert_raises_regex(KeyError, + r"Timestamp\('2014-01-10 00:00:00'\)", + lambda: nonmonotonic[timestamp:]) + + tm.assert_series_equal(nonmonotonic.loc['2014-01-10':], expected) + tm.assert_raises_regex(KeyError, + r"Timestamp\('2014-01-10 00:00:00'\)", + lambda: nonmonotonic.loc[timestamp:]) + + def test_loc_datetime_length_one(self): + # GH16071 + df = pd.DataFrame(columns=['1'], + index=pd.date_range('2016-10-01T00:00:00', + '2016-10-01T23:59:59')) + result = df.loc[datetime(2016, 10, 1):] + tm.assert_frame_equal(result, df) + + result = df.loc['2016-10-01T00:00:00':] + tm.assert_frame_equal(result, df) diff --git a/pandas/tests/indexes/datetimes/test_setops.py b/pandas/tests/indexes/datetimes/test_setops.py new file mode 100644 index 0000000000000..f3af7dd30c27f --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_setops.py @@ -0,0 +1,419 @@ +from datetime import datetime + +import numpy as np + +import pandas as pd +import pandas.util.testing as tm +from pandas.core.indexes.datetimes import cdate_range +from pandas import (DatetimeIndex, date_range, Series, bdate_range, DataFrame, + Int64Index, Index, to_datetime) +from pandas.tseries.offsets import Minute, BMonthEnd, MonthEnd + +START, END = datetime(2009, 1, 1), datetime(2010, 1, 1) + + +class TestDatetimeIndex(object): + + def test_union(self): + i1 = Int64Index(np.arange(0, 20, 2)) + i2 = Int64Index(np.arange(10, 30, 2)) + result = i1.union(i2) + expected = Int64Index(np.arange(0, 30, 2)) + tm.assert_index_equal(result, expected) + + def test_union_coverage(self): + idx = DatetimeIndex(['2000-01-03', '2000-01-01', '2000-01-02']) + ordered = DatetimeIndex(idx.sort_values(), freq='infer') + result = ordered.union(idx) + tm.assert_index_equal(result, ordered) + + result = ordered[:0].union(ordered) + tm.assert_index_equal(result, ordered) + assert result.freq == ordered.freq + + def test_union_bug_1730(self): + rng_a = date_range('1/1/2012', periods=4, freq='3H') + rng_b = date_range('1/1/2012', periods=4, freq='4H') + + result = rng_a.union(rng_b) + exp = DatetimeIndex(sorted(set(list(rng_a)) | set(list(rng_b)))) + tm.assert_index_equal(result, exp) + + def test_union_bug_1745(self): + left = DatetimeIndex(['2012-05-11 15:19:49.695000']) + right = DatetimeIndex(['2012-05-29 13:04:21.322000', + '2012-05-11 15:27:24.873000', + '2012-05-11 15:31:05.350000']) + + result = left.union(right) + exp = DatetimeIndex(sorted(set(list(left)) | set(list(right)))) + tm.assert_index_equal(result, exp) + + def test_union_bug_4564(self): + from pandas import DateOffset + left = date_range("2013-01-01", "2013-02-01") + right = left + DateOffset(minutes=15) + + result = left.union(right) + exp = DatetimeIndex(sorted(set(list(left)) | set(list(right)))) + tm.assert_index_equal(result, exp) + + def test_union_freq_both_none(self): + # GH11086 + expected = bdate_range('20150101', periods=10) + expected.freq = None + + result = expected.union(expected) + tm.assert_index_equal(result, expected) + assert result.freq is None + + def test_union_dataframe_index(self): + rng1 = date_range('1/1/1999', '1/1/2012', freq='MS') + s1 = Series(np.random.randn(len(rng1)), rng1) + + rng2 = date_range('1/1/1980', '12/1/2001', freq='MS') + s2 = Series(np.random.randn(len(rng2)), rng2) + df = DataFrame({'s1': s1, 's2': s2}) + + exp = pd.date_range('1/1/1980', '1/1/2012', freq='MS') + tm.assert_index_equal(df.index, exp) + + def test_union_with_DatetimeIndex(self): + i1 = Int64Index(np.arange(0, 20, 2)) + i2 = DatetimeIndex(start='2012-01-03 00:00:00', periods=10, freq='D') + i1.union(i2) # Works + i2.union(i1) # Fails with "AttributeError: can't set attribute" + + def test_intersection(self): + # GH 4690 (with tz) + for tz in [None, 'Asia/Tokyo', 'US/Eastern', 'dateutil/US/Pacific']: + base = date_range('6/1/2000', '6/30/2000', freq='D', name='idx') + + # if target has the same name, it is preserved + rng2 = date_range('5/15/2000', '6/20/2000', freq='D', name='idx') + expected2 = date_range('6/1/2000', '6/20/2000', freq='D', + name='idx') + + # if target name is different, it will be reset + rng3 = date_range('5/15/2000', '6/20/2000', freq='D', name='other') + expected3 = date_range('6/1/2000', '6/20/2000', freq='D', + name=None) + + rng4 = date_range('7/1/2000', '7/31/2000', freq='D', name='idx') + expected4 = DatetimeIndex([], name='idx') + + for (rng, expected) in [(rng2, expected2), (rng3, expected3), + (rng4, expected4)]: + result = base.intersection(rng) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + assert result.tz == expected.tz + + # non-monotonic + base = DatetimeIndex(['2011-01-05', '2011-01-04', + '2011-01-02', '2011-01-03'], + tz=tz, name='idx') + + rng2 = DatetimeIndex(['2011-01-04', '2011-01-02', + '2011-02-02', '2011-02-03'], + tz=tz, name='idx') + expected2 = DatetimeIndex( + ['2011-01-04', '2011-01-02'], tz=tz, name='idx') + + rng3 = DatetimeIndex(['2011-01-04', '2011-01-02', + '2011-02-02', '2011-02-03'], + tz=tz, name='other') + expected3 = DatetimeIndex( + ['2011-01-04', '2011-01-02'], tz=tz, name=None) + + # GH 7880 + rng4 = date_range('7/1/2000', '7/31/2000', freq='D', tz=tz, + name='idx') + expected4 = DatetimeIndex([], tz=tz, name='idx') + + for (rng, expected) in [(rng2, expected2), (rng3, expected3), + (rng4, expected4)]: + result = base.intersection(rng) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq is None + assert result.tz == expected.tz + + # empty same freq GH2129 + rng = date_range('6/1/2000', '6/15/2000', freq='T') + result = rng[0:0].intersection(rng) + assert len(result) == 0 + + result = rng.intersection(rng[0:0]) + assert len(result) == 0 + + def test_intersection_bug_1708(self): + from pandas import DateOffset + index_1 = date_range('1/1/2012', periods=4, freq='12H') + index_2 = index_1 + DateOffset(hours=1) + + result = index_1 & index_2 + assert len(result) == 0 + + def test_difference_freq(self): + # GH14323: difference of DatetimeIndex should not preserve frequency + + index = date_range("20160920", "20160925", freq="D") + other = date_range("20160921", "20160924", freq="D") + expected = DatetimeIndex(["20160920", "20160925"], freq=None) + idx_diff = index.difference(other) + tm.assert_index_equal(idx_diff, expected) + tm.assert_attr_equal('freq', idx_diff, expected) + + other = date_range("20160922", "20160925", freq="D") + idx_diff = index.difference(other) + expected = DatetimeIndex(["20160920", "20160921"], freq=None) + tm.assert_index_equal(idx_diff, expected) + tm.assert_attr_equal('freq', idx_diff, expected) + + def test_datetimeindex_diff(self): + dti1 = DatetimeIndex(freq='Q-JAN', start=datetime(1997, 12, 31), + periods=100) + dti2 = DatetimeIndex(freq='Q-JAN', start=datetime(1997, 12, 31), + periods=98) + assert len(dti1.difference(dti2)) == 2 + + def test_datetimeindex_union_join_empty(self): + dti = DatetimeIndex(start='1/1/2001', end='2/1/2001', freq='D') + empty = Index([]) + + result = dti.union(empty) + assert isinstance(result, DatetimeIndex) + assert result is result + + result = dti.join(empty) + assert isinstance(result, DatetimeIndex) + + def test_join_nonunique(self): + idx1 = to_datetime(['2012-11-06 16:00:11.477563', + '2012-11-06 16:00:11.477563']) + idx2 = to_datetime(['2012-11-06 15:11:09.006507', + '2012-11-06 15:11:09.006507']) + rs = idx1.join(idx2, how='outer') + assert rs.is_monotonic + + +class TestBusinessDatetimeIndex(object): + + def setup_method(self, method): + self.rng = bdate_range(START, END) + + def test_union(self): + # overlapping + left = self.rng[:10] + right = self.rng[5:10] + + the_union = left.union(right) + assert isinstance(the_union, DatetimeIndex) + + # non-overlapping, gap in middle + left = self.rng[:5] + right = self.rng[10:] + + the_union = left.union(right) + assert isinstance(the_union, Index) + + # non-overlapping, no gap + left = self.rng[:5] + right = self.rng[5:10] + + the_union = left.union(right) + assert isinstance(the_union, DatetimeIndex) + + # order does not matter + tm.assert_index_equal(right.union(left), the_union) + + # overlapping, but different offset + rng = date_range(START, END, freq=BMonthEnd()) + + the_union = self.rng.union(rng) + assert isinstance(the_union, DatetimeIndex) + + def test_outer_join(self): + # should just behave as union + + # overlapping + left = self.rng[:10] + right = self.rng[5:10] + + the_join = left.join(right, how='outer') + assert isinstance(the_join, DatetimeIndex) + + # non-overlapping, gap in middle + left = self.rng[:5] + right = self.rng[10:] + + the_join = left.join(right, how='outer') + assert isinstance(the_join, DatetimeIndex) + assert the_join.freq is None + + # non-overlapping, no gap + left = self.rng[:5] + right = self.rng[5:10] + + the_join = left.join(right, how='outer') + assert isinstance(the_join, DatetimeIndex) + + # overlapping, but different offset + rng = date_range(START, END, freq=BMonthEnd()) + + the_join = self.rng.join(rng, how='outer') + assert isinstance(the_join, DatetimeIndex) + assert the_join.freq is None + + def test_union_not_cacheable(self): + rng = date_range('1/1/2000', periods=50, freq=Minute()) + rng1 = rng[10:] + rng2 = rng[:25] + the_union = rng1.union(rng2) + tm.assert_index_equal(the_union, rng) + + rng1 = rng[10:] + rng2 = rng[15:35] + the_union = rng1.union(rng2) + expected = rng[10:] + tm.assert_index_equal(the_union, expected) + + def test_intersection(self): + rng = date_range('1/1/2000', periods=50, freq=Minute()) + rng1 = rng[10:] + rng2 = rng[:25] + the_int = rng1.intersection(rng2) + expected = rng[10:25] + tm.assert_index_equal(the_int, expected) + assert isinstance(the_int, DatetimeIndex) + assert the_int.offset == rng.offset + + the_int = rng1.intersection(rng2.view(DatetimeIndex)) + tm.assert_index_equal(the_int, expected) + + # non-overlapping + the_int = rng[:10].intersection(rng[10:]) + expected = DatetimeIndex([]) + tm.assert_index_equal(the_int, expected) + + def test_intersection_bug(self): + # GH #771 + a = bdate_range('11/30/2011', '12/31/2011') + b = bdate_range('12/10/2011', '12/20/2011') + result = a.intersection(b) + tm.assert_index_equal(result, b) + + def test_month_range_union_tz_pytz(self): + tm._skip_if_no_pytz() + from pytz import timezone + tz = timezone('US/Eastern') + + early_start = datetime(2011, 1, 1) + early_end = datetime(2011, 3, 1) + + late_start = datetime(2011, 3, 1) + late_end = datetime(2011, 5, 1) + + early_dr = date_range(start=early_start, end=early_end, tz=tz, + freq=MonthEnd()) + late_dr = date_range(start=late_start, end=late_end, tz=tz, + freq=MonthEnd()) + + early_dr.union(late_dr) + + def test_month_range_union_tz_dateutil(self): + tm._skip_if_windows_python_3() + tm._skip_if_no_dateutil() + from pandas._libs.tslib import _dateutil_gettz as timezone + tz = timezone('US/Eastern') + + early_start = datetime(2011, 1, 1) + early_end = datetime(2011, 3, 1) + + late_start = datetime(2011, 3, 1) + late_end = datetime(2011, 5, 1) + + early_dr = date_range(start=early_start, end=early_end, tz=tz, + freq=MonthEnd()) + late_dr = date_range(start=late_start, end=late_end, tz=tz, + freq=MonthEnd()) + + early_dr.union(late_dr) + + +class TestCustomDatetimeIndex(object): + + def setup_method(self, method): + self.rng = cdate_range(START, END) + + def test_union(self): + # overlapping + left = self.rng[:10] + right = self.rng[5:10] + + the_union = left.union(right) + assert isinstance(the_union, DatetimeIndex) + + # non-overlapping, gap in middle + left = self.rng[:5] + right = self.rng[10:] + + the_union = left.union(right) + assert isinstance(the_union, Index) + + # non-overlapping, no gap + left = self.rng[:5] + right = self.rng[5:10] + + the_union = left.union(right) + assert isinstance(the_union, DatetimeIndex) + + # order does not matter + tm.assert_index_equal(right.union(left), the_union) + + # overlapping, but different offset + rng = date_range(START, END, freq=BMonthEnd()) + + the_union = self.rng.union(rng) + assert isinstance(the_union, DatetimeIndex) + + def test_outer_join(self): + # should just behave as union + + # overlapping + left = self.rng[:10] + right = self.rng[5:10] + + the_join = left.join(right, how='outer') + assert isinstance(the_join, DatetimeIndex) + + # non-overlapping, gap in middle + left = self.rng[:5] + right = self.rng[10:] + + the_join = left.join(right, how='outer') + assert isinstance(the_join, DatetimeIndex) + assert the_join.freq is None + + # non-overlapping, no gap + left = self.rng[:5] + right = self.rng[5:10] + + the_join = left.join(right, how='outer') + assert isinstance(the_join, DatetimeIndex) + + # overlapping, but different offset + rng = date_range(START, END, freq=BMonthEnd()) + + the_join = self.rng.join(rng, how='outer') + assert isinstance(the_join, DatetimeIndex) + assert the_join.freq is None + + def test_intersection_bug(self): + # GH #771 + a = cdate_range('11/30/2011', '12/31/2011') + b = cdate_range('12/10/2011', '12/20/2011') + result = a.intersection(b) + tm.assert_index_equal(result, b) diff --git a/pandas/tests/indexes/datetimes/test_tools.py b/pandas/tests/indexes/datetimes/test_tools.py new file mode 100644 index 0000000000000..648df01be5289 --- /dev/null +++ b/pandas/tests/indexes/datetimes/test_tools.py @@ -0,0 +1,1615 @@ +""" test to_datetime """ + +import sys +import pytest +import locale +import calendar +import numpy as np +from datetime import datetime, date, time +from distutils.version import LooseVersion + +import pandas as pd +from pandas._libs import tslib, lib +from pandas.core.tools import datetimes as tools +from pandas.core.tools.datetimes import normalize_date +from pandas.compat import lmap +from pandas.compat.numpy import np_array_datetime64_compat +from pandas.core.dtypes.common import is_datetime64_ns_dtype +from pandas.util import testing as tm +from pandas.util.testing import assert_series_equal, _skip_if_has_locale +from pandas import (isnull, to_datetime, Timestamp, Series, DataFrame, + Index, DatetimeIndex, NaT, date_range, bdate_range, + compat) + + +class TimeConversionFormats(object): + + def test_to_datetime_format(self): + values = ['1/1/2000', '1/2/2000', '1/3/2000'] + + results1 = [Timestamp('20000101'), Timestamp('20000201'), + Timestamp('20000301')] + results2 = [Timestamp('20000101'), Timestamp('20000102'), + Timestamp('20000103')] + for vals, expecteds in [(values, (Index(results1), Index(results2))), + (Series(values), + (Series(results1), Series(results2))), + (values[0], (results1[0], results2[0])), + (values[1], (results1[1], results2[1])), + (values[2], (results1[2], results2[2]))]: + + for i, fmt in enumerate(['%d/%m/%Y', '%m/%d/%Y']): + result = to_datetime(vals, format=fmt) + expected = expecteds[i] + + if isinstance(expected, Series): + assert_series_equal(result, Series(expected)) + elif isinstance(expected, Timestamp): + assert result == expected + else: + tm.assert_index_equal(result, expected) + + def test_to_datetime_format_YYYYMMDD(self): + s = Series([19801222, 19801222] + [19810105] * 5) + expected = Series([Timestamp(x) for x in s.apply(str)]) + + result = to_datetime(s, format='%Y%m%d') + assert_series_equal(result, expected) + + result = to_datetime(s.apply(str), format='%Y%m%d') + assert_series_equal(result, expected) + + # with NaT + expected = Series([Timestamp("19801222"), Timestamp("19801222")] + + [Timestamp("19810105")] * 5) + expected[2] = np.nan + s[2] = np.nan + + result = to_datetime(s, format='%Y%m%d') + assert_series_equal(result, expected) + + # string with NaT + s = s.apply(str) + s[2] = 'nat' + result = to_datetime(s, format='%Y%m%d') + assert_series_equal(result, expected) + + # coercion + # GH 7930 + s = Series([20121231, 20141231, 99991231]) + result = pd.to_datetime(s, format='%Y%m%d', errors='ignore') + expected = Series([datetime(2012, 12, 31), + datetime(2014, 12, 31), datetime(9999, 12, 31)], + dtype=object) + tm.assert_series_equal(result, expected) + + result = pd.to_datetime(s, format='%Y%m%d', errors='coerce') + expected = Series(['20121231', '20141231', 'NaT'], dtype='M8[ns]') + assert_series_equal(result, expected) + + # GH 10178 + def test_to_datetime_format_integer(self): + s = Series([2000, 2001, 2002]) + expected = Series([Timestamp(x) for x in s.apply(str)]) + + result = to_datetime(s, format='%Y') + assert_series_equal(result, expected) + + s = Series([200001, 200105, 200206]) + expected = Series([Timestamp(x[:4] + '-' + x[4:]) for x in s.apply(str) + ]) + + result = to_datetime(s, format='%Y%m') + assert_series_equal(result, expected) + + def test_to_datetime_format_microsecond(self): + + # these are locale dependent + lang, _ = locale.getlocale() + month_abbr = calendar.month_abbr[4] + val = '01-{}-2011 00:00:01.978'.format(month_abbr) + + format = '%d-%b-%Y %H:%M:%S.%f' + result = to_datetime(val, format=format) + exp = datetime.strptime(val, format) + assert result == exp + + def test_to_datetime_format_time(self): + data = [ + ['01/10/2010 15:20', '%m/%d/%Y %H:%M', + Timestamp('2010-01-10 15:20')], + ['01/10/2010 05:43', '%m/%d/%Y %I:%M', + Timestamp('2010-01-10 05:43')], + ['01/10/2010 13:56:01', '%m/%d/%Y %H:%M:%S', + Timestamp('2010-01-10 13:56:01')] # , + # ['01/10/2010 08:14 PM', '%m/%d/%Y %I:%M %p', + # Timestamp('2010-01-10 20:14')], + # ['01/10/2010 07:40 AM', '%m/%d/%Y %I:%M %p', + # Timestamp('2010-01-10 07:40')], + # ['01/10/2010 09:12:56 AM', '%m/%d/%Y %I:%M:%S %p', + # Timestamp('2010-01-10 09:12:56')] + ] + for s, format, dt in data: + assert to_datetime(s, format=format) == dt + + def test_to_datetime_with_non_exact(self): + # GH 10834 + tm._skip_if_has_locale() + + # 8904 + # exact kw + if sys.version_info < (2, 7): + pytest.skip('on python version < 2.7') + + s = Series(['19MAY11', 'foobar19MAY11', '19MAY11:00:00:00', + '19MAY11 00:00:00Z']) + result = to_datetime(s, format='%d%b%y', exact=False) + expected = to_datetime(s.str.extract(r'(\d+\w+\d+)', expand=False), + format='%d%b%y') + assert_series_equal(result, expected) + + def test_parse_nanoseconds_with_formula(self): + + # GH8989 + # trunctaing the nanoseconds when a format was provided + for v in ["2012-01-01 09:00:00.000000001", + "2012-01-01 09:00:00.000001", + "2012-01-01 09:00:00.001", + "2012-01-01 09:00:00.001000", + "2012-01-01 09:00:00.001000000", ]: + expected = pd.to_datetime(v) + result = pd.to_datetime(v, format="%Y-%m-%d %H:%M:%S.%f") + assert result == expected + + def test_to_datetime_format_weeks(self): + data = [ + ['2009324', '%Y%W%w', Timestamp('2009-08-13')], + ['2013020', '%Y%U%w', Timestamp('2013-01-13')] + ] + for s, format, dt in data: + assert to_datetime(s, format=format) == dt + + +class TestToDatetime(object): + + def test_to_datetime_dt64s(self): + in_bound_dts = [ + np.datetime64('2000-01-01'), + np.datetime64('2000-01-02'), + ] + + for dt in in_bound_dts: + assert pd.to_datetime(dt) == Timestamp(dt) + + oob_dts = [np.datetime64('1000-01-01'), np.datetime64('5000-01-02'), ] + + for dt in oob_dts: + pytest.raises(ValueError, pd.to_datetime, dt, errors='raise') + pytest.raises(ValueError, Timestamp, dt) + assert pd.to_datetime(dt, errors='coerce') is NaT + + def test_to_datetime_array_of_dt64s(self): + dts = [np.datetime64('2000-01-01'), np.datetime64('2000-01-02'), ] + + # Assuming all datetimes are in bounds, to_datetime() returns + # an array that is equal to Timestamp() parsing + tm.assert_numpy_array_equal( + pd.to_datetime(dts, box=False), + np.array([Timestamp(x).asm8 for x in dts]) + ) + + # A list of datetimes where the last one is out of bounds + dts_with_oob = dts + [np.datetime64('9999-01-01')] + + pytest.raises(ValueError, pd.to_datetime, dts_with_oob, + errors='raise') + + tm.assert_numpy_array_equal( + pd.to_datetime(dts_with_oob, box=False, errors='coerce'), + np.array( + [ + Timestamp(dts_with_oob[0]).asm8, + Timestamp(dts_with_oob[1]).asm8, + tslib.iNaT, + ], + dtype='M8' + ) + ) + + # With errors='ignore', out of bounds datetime64s + # are converted to their .item(), which depending on the version of + # numpy is either a python datetime.datetime or datetime.date + tm.assert_numpy_array_equal( + pd.to_datetime(dts_with_oob, box=False, errors='ignore'), + np.array( + [dt.item() for dt in dts_with_oob], + dtype='O' + ) + ) + + def test_to_datetime_tz(self): + + # xref 8260 + # uniform returns a DatetimeIndex + arr = [pd.Timestamp('2013-01-01 13:00:00-0800', tz='US/Pacific'), + pd.Timestamp('2013-01-02 14:00:00-0800', tz='US/Pacific')] + result = pd.to_datetime(arr) + expected = DatetimeIndex( + ['2013-01-01 13:00:00', '2013-01-02 14:00:00'], tz='US/Pacific') + tm.assert_index_equal(result, expected) + + # mixed tzs will raise + arr = [pd.Timestamp('2013-01-01 13:00:00', tz='US/Pacific'), + pd.Timestamp('2013-01-02 14:00:00', tz='US/Eastern')] + pytest.raises(ValueError, lambda: pd.to_datetime(arr)) + + def test_to_datetime_tz_pytz(self): + + # xref 8260 + tm._skip_if_no_pytz() + import pytz + + us_eastern = pytz.timezone('US/Eastern') + arr = np.array([us_eastern.localize(datetime(year=2000, month=1, day=1, + hour=3, minute=0)), + us_eastern.localize(datetime(year=2000, month=6, day=1, + hour=3, minute=0))], + dtype=object) + result = pd.to_datetime(arr, utc=True) + expected = DatetimeIndex(['2000-01-01 08:00:00+00:00', + '2000-06-01 07:00:00+00:00'], + dtype='datetime64[ns, UTC]', freq=None) + tm.assert_index_equal(result, expected) + + def test_to_datetime_utc_is_true(self): + # See gh-11934 + start = pd.Timestamp('2014-01-01', tz='utc') + end = pd.Timestamp('2014-01-03', tz='utc') + date_range = pd.bdate_range(start, end) + + result = pd.to_datetime(date_range, utc=True) + expected = pd.DatetimeIndex(data=date_range) + tm.assert_index_equal(result, expected) + + def test_to_datetime_tz_psycopg2(self): + + # xref 8260 + try: + import psycopg2 + except ImportError: + pytest.skip("no psycopg2 installed") + + # misc cases + tz1 = psycopg2.tz.FixedOffsetTimezone(offset=-300, name=None) + tz2 = psycopg2.tz.FixedOffsetTimezone(offset=-240, name=None) + arr = np.array([datetime(2000, 1, 1, 3, 0, tzinfo=tz1), + datetime(2000, 6, 1, 3, 0, tzinfo=tz2)], + dtype=object) + + result = pd.to_datetime(arr, errors='coerce', utc=True) + expected = DatetimeIndex(['2000-01-01 08:00:00+00:00', + '2000-06-01 07:00:00+00:00'], + dtype='datetime64[ns, UTC]', freq=None) + tm.assert_index_equal(result, expected) + + # dtype coercion + i = pd.DatetimeIndex([ + '2000-01-01 08:00:00+00:00' + ], tz=psycopg2.tz.FixedOffsetTimezone(offset=-300, name=None)) + assert is_datetime64_ns_dtype(i) + + # tz coerceion + result = pd.to_datetime(i, errors='coerce') + tm.assert_index_equal(result, i) + + result = pd.to_datetime(i, errors='coerce', utc=True) + expected = pd.DatetimeIndex(['2000-01-01 13:00:00'], + dtype='datetime64[ns, UTC]') + tm.assert_index_equal(result, expected) + + def test_datetime_bool(self): + # GH13176 + with pytest.raises(TypeError): + to_datetime(False) + assert to_datetime(False, errors="coerce") is NaT + assert to_datetime(False, errors="ignore") is False + with pytest.raises(TypeError): + to_datetime(True) + assert to_datetime(True, errors="coerce") is NaT + assert to_datetime(True, errors="ignore") is True + with pytest.raises(TypeError): + to_datetime([False, datetime.today()]) + with pytest.raises(TypeError): + to_datetime(['20130101', True]) + tm.assert_index_equal(to_datetime([0, False, NaT, 0.0], + errors="coerce"), + DatetimeIndex([to_datetime(0), NaT, + NaT, to_datetime(0)])) + + def test_datetime_invalid_datatype(self): + # GH13176 + + with pytest.raises(TypeError): + pd.to_datetime(bool) + with pytest.raises(TypeError): + pd.to_datetime(pd.to_datetime) + + +class ToDatetimeUnit(object): + + def test_unit(self): + # GH 11758 + # test proper behavior with erros + + with pytest.raises(ValueError): + to_datetime([1], unit='D', format='%Y%m%d') + + values = [11111111, 1, 1.0, tslib.iNaT, NaT, np.nan, + 'NaT', ''] + result = to_datetime(values, unit='D', errors='ignore') + expected = Index([11111111, Timestamp('1970-01-02'), + Timestamp('1970-01-02'), NaT, + NaT, NaT, NaT, NaT], + dtype=object) + tm.assert_index_equal(result, expected) + + result = to_datetime(values, unit='D', errors='coerce') + expected = DatetimeIndex(['NaT', '1970-01-02', '1970-01-02', + 'NaT', 'NaT', 'NaT', 'NaT', 'NaT']) + tm.assert_index_equal(result, expected) + + with pytest.raises(tslib.OutOfBoundsDatetime): + to_datetime(values, unit='D', errors='raise') + + values = [1420043460000, tslib.iNaT, NaT, np.nan, 'NaT'] + + result = to_datetime(values, errors='ignore', unit='s') + expected = Index([1420043460000, NaT, NaT, + NaT, NaT], dtype=object) + tm.assert_index_equal(result, expected) + + result = to_datetime(values, errors='coerce', unit='s') + expected = DatetimeIndex(['NaT', 'NaT', 'NaT', 'NaT', 'NaT']) + tm.assert_index_equal(result, expected) + + with pytest.raises(tslib.OutOfBoundsDatetime): + to_datetime(values, errors='raise', unit='s') + + # if we have a string, then we raise a ValueError + # and NOT an OutOfBoundsDatetime + for val in ['foo', Timestamp('20130101')]: + try: + to_datetime(val, errors='raise', unit='s') + except tslib.OutOfBoundsDatetime: + raise AssertionError("incorrect exception raised") + except ValueError: + pass + + def test_unit_consistency(self): + + # consistency of conversions + expected = Timestamp('1970-05-09 14:25:11') + result = pd.to_datetime(11111111, unit='s', errors='raise') + assert result == expected + assert isinstance(result, Timestamp) + + result = pd.to_datetime(11111111, unit='s', errors='coerce') + assert result == expected + assert isinstance(result, Timestamp) + + result = pd.to_datetime(11111111, unit='s', errors='ignore') + assert result == expected + assert isinstance(result, Timestamp) + + def test_unit_with_numeric(self): + + # GH 13180 + # coercions from floats/ints are ok + expected = DatetimeIndex(['2015-06-19 05:33:20', + '2015-05-27 22:33:20']) + arr1 = [1.434692e+18, 1.432766e+18] + arr2 = np.array(arr1).astype('int64') + for errors in ['ignore', 'raise', 'coerce']: + result = pd.to_datetime(arr1, errors=errors) + tm.assert_index_equal(result, expected) + + result = pd.to_datetime(arr2, errors=errors) + tm.assert_index_equal(result, expected) + + # but we want to make sure that we are coercing + # if we have ints/strings + expected = DatetimeIndex(['NaT', + '2015-06-19 05:33:20', + '2015-05-27 22:33:20']) + arr = ['foo', 1.434692e+18, 1.432766e+18] + result = pd.to_datetime(arr, errors='coerce') + tm.assert_index_equal(result, expected) + + expected = DatetimeIndex(['2015-06-19 05:33:20', + '2015-05-27 22:33:20', + 'NaT', + 'NaT']) + arr = [1.434692e+18, 1.432766e+18, 'foo', 'NaT'] + result = pd.to_datetime(arr, errors='coerce') + tm.assert_index_equal(result, expected) + + def test_unit_mixed(self): + + # mixed integers/datetimes + expected = DatetimeIndex(['2013-01-01', 'NaT', 'NaT']) + arr = [pd.Timestamp('20130101'), 1.434692e+18, 1.432766e+18] + result = pd.to_datetime(arr, errors='coerce') + tm.assert_index_equal(result, expected) + + with pytest.raises(ValueError): + pd.to_datetime(arr, errors='raise') + + expected = DatetimeIndex(['NaT', + 'NaT', + '2013-01-01']) + arr = [1.434692e+18, 1.432766e+18, pd.Timestamp('20130101')] + result = pd.to_datetime(arr, errors='coerce') + tm.assert_index_equal(result, expected) + + with pytest.raises(ValueError): + pd.to_datetime(arr, errors='raise') + + def test_dataframe(self): + + df = DataFrame({'year': [2015, 2016], + 'month': [2, 3], + 'day': [4, 5], + 'hour': [6, 7], + 'minute': [58, 59], + 'second': [10, 11], + 'ms': [1, 1], + 'us': [2, 2], + 'ns': [3, 3]}) + + result = to_datetime({'year': df['year'], + 'month': df['month'], + 'day': df['day']}) + expected = Series([Timestamp('20150204 00:00:00'), + Timestamp('20160305 00:0:00')]) + assert_series_equal(result, expected) + + # dict-like + result = to_datetime(df[['year', 'month', 'day']].to_dict()) + assert_series_equal(result, expected) + + # dict but with constructable + df2 = df[['year', 'month', 'day']].to_dict() + df2['month'] = 2 + result = to_datetime(df2) + expected2 = Series([Timestamp('20150204 00:00:00'), + Timestamp('20160205 00:0:00')]) + assert_series_equal(result, expected2) + + # unit mappings + units = [{'year': 'years', + 'month': 'months', + 'day': 'days', + 'hour': 'hours', + 'minute': 'minutes', + 'second': 'seconds'}, + {'year': 'year', + 'month': 'month', + 'day': 'day', + 'hour': 'hour', + 'minute': 'minute', + 'second': 'second'}, + ] + + for d in units: + result = to_datetime(df[list(d.keys())].rename(columns=d)) + expected = Series([Timestamp('20150204 06:58:10'), + Timestamp('20160305 07:59:11')]) + assert_series_equal(result, expected) + + d = {'year': 'year', + 'month': 'month', + 'day': 'day', + 'hour': 'hour', + 'minute': 'minute', + 'second': 'second', + 'ms': 'ms', + 'us': 'us', + 'ns': 'ns'} + + result = to_datetime(df.rename(columns=d)) + expected = Series([Timestamp('20150204 06:58:10.001002003'), + Timestamp('20160305 07:59:11.001002003')]) + assert_series_equal(result, expected) + + # coerce back to int + result = to_datetime(df.astype(str)) + assert_series_equal(result, expected) + + # passing coerce + df2 = DataFrame({'year': [2015, 2016], + 'month': [2, 20], + 'day': [4, 5]}) + with pytest.raises(ValueError): + to_datetime(df2) + result = to_datetime(df2, errors='coerce') + expected = Series([Timestamp('20150204 00:00:00'), + NaT]) + assert_series_equal(result, expected) + + # extra columns + with pytest.raises(ValueError): + df2 = df.copy() + df2['foo'] = 1 + to_datetime(df2) + + # not enough + for c in [['year'], + ['year', 'month'], + ['year', 'month', 'second'], + ['month', 'day'], + ['year', 'day', 'second']]: + with pytest.raises(ValueError): + to_datetime(df[c]) + + # duplicates + df2 = DataFrame({'year': [2015, 2016], + 'month': [2, 20], + 'day': [4, 5]}) + df2.columns = ['year', 'year', 'day'] + with pytest.raises(ValueError): + to_datetime(df2) + + df2 = DataFrame({'year': [2015, 2016], + 'month': [2, 20], + 'day': [4, 5], + 'hour': [4, 5]}) + df2.columns = ['year', 'month', 'day', 'day'] + with pytest.raises(ValueError): + to_datetime(df2) + + def test_dataframe_dtypes(self): + # #13451 + df = DataFrame({'year': [2015, 2016], + 'month': [2, 3], + 'day': [4, 5]}) + + # int16 + result = to_datetime(df.astype('int16')) + expected = Series([Timestamp('20150204 00:00:00'), + Timestamp('20160305 00:00:00')]) + assert_series_equal(result, expected) + + # mixed dtypes + df['month'] = df['month'].astype('int8') + df['day'] = df['day'].astype('int8') + result = to_datetime(df) + expected = Series([Timestamp('20150204 00:00:00'), + Timestamp('20160305 00:00:00')]) + assert_series_equal(result, expected) + + # float + df = DataFrame({'year': [2000, 2001], + 'month': [1.5, 1], + 'day': [1, 1]}) + with pytest.raises(ValueError): + to_datetime(df) + + +class ToDatetimeMisc(object): + + def test_index_to_datetime(self): + idx = Index(['1/1/2000', '1/2/2000', '1/3/2000']) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = idx.to_datetime() + expected = DatetimeIndex(pd.to_datetime(idx.values)) + tm.assert_index_equal(result, expected) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + today = datetime.today() + idx = Index([today], dtype=object) + result = idx.to_datetime() + expected = DatetimeIndex([today]) + tm.assert_index_equal(result, expected) + + def test_to_datetime_iso8601(self): + result = to_datetime(["2012-01-01 00:00:00"]) + exp = Timestamp("2012-01-01 00:00:00") + assert result[0] == exp + + result = to_datetime(['20121001']) # bad iso 8601 + exp = Timestamp('2012-10-01') + assert result[0] == exp + + def test_to_datetime_default(self): + rs = to_datetime('2001') + xp = datetime(2001, 1, 1) + assert rs, xp + + # dayfirst is essentially broken + + # to_datetime('01-13-2012', dayfirst=True) + # pytest.raises(ValueError, to_datetime('01-13-2012', + # dayfirst=True)) + + def test_to_datetime_on_datetime64_series(self): + # #2699 + s = Series(date_range('1/1/2000', periods=10)) + + result = to_datetime(s) + assert result[0] == s[0] + + def test_to_datetime_with_space_in_series(self): + # GH 6428 + s = Series(['10/18/2006', '10/18/2008', ' ']) + pytest.raises(ValueError, lambda: to_datetime(s, errors='raise')) + result_coerce = to_datetime(s, errors='coerce') + expected_coerce = Series([datetime(2006, 10, 18), + datetime(2008, 10, 18), + NaT]) + tm.assert_series_equal(result_coerce, expected_coerce) + result_ignore = to_datetime(s, errors='ignore') + tm.assert_series_equal(result_ignore, s) + + def test_to_datetime_with_apply(self): + # this is only locale tested with US/None locales + tm._skip_if_has_locale() + + # GH 5195 + # with a format and coerce a single item to_datetime fails + td = Series(['May 04', 'Jun 02', 'Dec 11'], index=[1, 2, 3]) + expected = pd.to_datetime(td, format='%b %y') + result = td.apply(pd.to_datetime, format='%b %y') + assert_series_equal(result, expected) + + td = pd.Series(['May 04', 'Jun 02', ''], index=[1, 2, 3]) + pytest.raises(ValueError, + lambda: pd.to_datetime(td, format='%b %y', + errors='raise')) + pytest.raises(ValueError, + lambda: td.apply(pd.to_datetime, format='%b %y', + errors='raise')) + expected = pd.to_datetime(td, format='%b %y', errors='coerce') + + result = td.apply( + lambda x: pd.to_datetime(x, format='%b %y', errors='coerce')) + assert_series_equal(result, expected) + + def test_to_datetime_types(self): + + # empty string + result = to_datetime('') + assert result is NaT + + result = to_datetime(['', '']) + assert isnull(result).all() + + # ints + result = Timestamp(0) + expected = to_datetime(0) + assert result == expected + + # GH 3888 (strings) + expected = to_datetime(['2012'])[0] + result = to_datetime('2012') + assert result == expected + + # array = ['2012','20120101','20120101 12:01:01'] + array = ['20120101', '20120101 12:01:01'] + expected = list(to_datetime(array)) + result = lmap(Timestamp, array) + tm.assert_almost_equal(result, expected) + + # currently fails ### + # result = Timestamp('2012') + # expected = to_datetime('2012') + # assert result == expected + + def test_to_datetime_unprocessable_input(self): + # GH 4928 + tm.assert_numpy_array_equal( + to_datetime([1, '1'], errors='ignore'), + np.array([1, '1'], dtype='O') + ) + pytest.raises(TypeError, to_datetime, [1, '1'], errors='raise') + + def test_to_datetime_other_datetime64_units(self): + # 5/25/2012 + scalar = np.int64(1337904000000000).view('M8[us]') + as_obj = scalar.astype('O') + + index = DatetimeIndex([scalar]) + assert index[0] == scalar.astype('O') + + value = Timestamp(scalar) + assert value == as_obj + + def test_to_datetime_list_of_integers(self): + rng = date_range('1/1/2000', periods=20) + rng = DatetimeIndex(rng.values) + + ints = list(rng.asi8) + + result = DatetimeIndex(ints) + + tm.assert_index_equal(rng, result) + + def test_to_datetime_freq(self): + xp = bdate_range('2000-1-1', periods=10, tz='UTC') + rs = xp.to_datetime() + assert xp.freq == rs.freq + assert xp.tzinfo == rs.tzinfo + + def test_string_na_nat_conversion(self): + # GH #999, #858 + + from pandas.compat import parse_date + + strings = np.array(['1/1/2000', '1/2/2000', np.nan, + '1/4/2000, 12:34:56'], dtype=object) + + expected = np.empty(4, dtype='M8[ns]') + for i, val in enumerate(strings): + if isnull(val): + expected[i] = tslib.iNaT + else: + expected[i] = parse_date(val) + + result = tslib.array_to_datetime(strings) + tm.assert_almost_equal(result, expected) + + result2 = to_datetime(strings) + assert isinstance(result2, DatetimeIndex) + tm.assert_numpy_array_equal(result, result2.values) + + malformed = np.array(['1/100/2000', np.nan], dtype=object) + + # GH 10636, default is now 'raise' + pytest.raises(ValueError, + lambda: to_datetime(malformed, errors='raise')) + + result = to_datetime(malformed, errors='ignore') + tm.assert_numpy_array_equal(result, malformed) + + pytest.raises(ValueError, to_datetime, malformed, errors='raise') + + idx = ['a', 'b', 'c', 'd', 'e'] + series = Series(['1/1/2000', np.nan, '1/3/2000', np.nan, + '1/5/2000'], index=idx, name='foo') + dseries = Series([to_datetime('1/1/2000'), np.nan, + to_datetime('1/3/2000'), np.nan, + to_datetime('1/5/2000')], index=idx, name='foo') + + result = to_datetime(series) + dresult = to_datetime(dseries) + + expected = Series(np.empty(5, dtype='M8[ns]'), index=idx) + for i in range(5): + x = series[i] + if isnull(x): + expected[i] = tslib.iNaT + else: + expected[i] = to_datetime(x) + + assert_series_equal(result, expected, check_names=False) + assert result.name == 'foo' + + assert_series_equal(dresult, expected, check_names=False) + assert dresult.name == 'foo' + + def test_dti_constructor_numpy_timeunits(self): + # GH 9114 + base = pd.to_datetime(['2000-01-01T00:00', '2000-01-02T00:00', 'NaT']) + + for dtype in ['datetime64[h]', 'datetime64[m]', 'datetime64[s]', + 'datetime64[ms]', 'datetime64[us]', 'datetime64[ns]']: + values = base.values.astype(dtype) + + tm.assert_index_equal(DatetimeIndex(values), base) + tm.assert_index_equal(to_datetime(values), base) + + def test_dayfirst(self): + # GH 5917 + arr = ['10/02/2014', '11/02/2014', '12/02/2014'] + expected = DatetimeIndex([datetime(2014, 2, 10), datetime(2014, 2, 11), + datetime(2014, 2, 12)]) + idx1 = DatetimeIndex(arr, dayfirst=True) + idx2 = DatetimeIndex(np.array(arr), dayfirst=True) + idx3 = to_datetime(arr, dayfirst=True) + idx4 = to_datetime(np.array(arr), dayfirst=True) + idx5 = DatetimeIndex(Index(arr), dayfirst=True) + idx6 = DatetimeIndex(Series(arr), dayfirst=True) + tm.assert_index_equal(expected, idx1) + tm.assert_index_equal(expected, idx2) + tm.assert_index_equal(expected, idx3) + tm.assert_index_equal(expected, idx4) + tm.assert_index_equal(expected, idx5) + tm.assert_index_equal(expected, idx6) + + +class TestGuessDatetimeFormat(object): + + def test_guess_datetime_format_with_parseable_formats(self): + tm._skip_if_not_us_locale() + dt_string_to_format = (('20111230', '%Y%m%d'), + ('2011-12-30', '%Y-%m-%d'), + ('30-12-2011', '%d-%m-%Y'), + ('2011-12-30 00:00:00', '%Y-%m-%d %H:%M:%S'), + ('2011-12-30T00:00:00', '%Y-%m-%dT%H:%M:%S'), + ('2011-12-30 00:00:00.000000', + '%Y-%m-%d %H:%M:%S.%f'), ) + + for dt_string, dt_format in dt_string_to_format: + assert tools._guess_datetime_format(dt_string) == dt_format + + def test_guess_datetime_format_with_dayfirst(self): + ambiguous_string = '01/01/2011' + assert tools._guess_datetime_format( + ambiguous_string, dayfirst=True) == '%d/%m/%Y' + assert tools._guess_datetime_format( + ambiguous_string, dayfirst=False) == '%m/%d/%Y' + + def test_guess_datetime_format_with_locale_specific_formats(self): + # The month names will vary depending on the locale, in which + # case these wont be parsed properly (dateutil can't parse them) + tm._skip_if_has_locale() + + dt_string_to_format = (('30/Dec/2011', '%d/%b/%Y'), + ('30/December/2011', '%d/%B/%Y'), + ('30/Dec/2011 00:00:00', '%d/%b/%Y %H:%M:%S'), ) + + for dt_string, dt_format in dt_string_to_format: + assert tools._guess_datetime_format(dt_string) == dt_format + + def test_guess_datetime_format_invalid_inputs(self): + # A datetime string must include a year, month and a day for it + # to be guessable, in addition to being a string that looks like + # a datetime + invalid_dts = [ + '2013', + '01/2013', + '12:00:00', + '1/1/1/1', + 'this_is_not_a_datetime', + '51a', + 9, + datetime(2011, 1, 1), + ] + + for invalid_dt in invalid_dts: + assert tools._guess_datetime_format(invalid_dt) is None + + def test_guess_datetime_format_nopadding(self): + # GH 11142 + dt_string_to_format = (('2011-1-1', '%Y-%m-%d'), + ('30-1-2011', '%d-%m-%Y'), + ('1/1/2011', '%m/%d/%Y'), + ('2011-1-1 00:00:00', '%Y-%m-%d %H:%M:%S'), + ('2011-1-1 0:0:0', '%Y-%m-%d %H:%M:%S'), + ('2011-1-3T00:00:0', '%Y-%m-%dT%H:%M:%S')) + + for dt_string, dt_format in dt_string_to_format: + assert tools._guess_datetime_format(dt_string) == dt_format + + def test_guess_datetime_format_for_array(self): + tm._skip_if_not_us_locale() + expected_format = '%Y-%m-%d %H:%M:%S.%f' + dt_string = datetime(2011, 12, 30, 0, 0, 0).strftime(expected_format) + + test_arrays = [ + np.array([dt_string, dt_string, dt_string], dtype='O'), + np.array([np.nan, np.nan, dt_string], dtype='O'), + np.array([dt_string, 'random_string'], dtype='O'), + ] + + for test_array in test_arrays: + assert tools._guess_datetime_format_for_array( + test_array) == expected_format + + format_for_string_of_nans = tools._guess_datetime_format_for_array( + np.array( + [np.nan, np.nan, np.nan], dtype='O')) + assert format_for_string_of_nans is None + + +class TestToDatetimeInferFormat(object): + + def test_to_datetime_infer_datetime_format_consistent_format(self): + s = pd.Series(pd.date_range('20000101', periods=50, freq='H')) + + test_formats = ['%m-%d-%Y', '%m/%d/%Y %H:%M:%S.%f', + '%Y-%m-%dT%H:%M:%S.%f'] + + for test_format in test_formats: + s_as_dt_strings = s.apply(lambda x: x.strftime(test_format)) + + with_format = pd.to_datetime(s_as_dt_strings, format=test_format) + no_infer = pd.to_datetime(s_as_dt_strings, + infer_datetime_format=False) + yes_infer = pd.to_datetime(s_as_dt_strings, + infer_datetime_format=True) + + # Whether the format is explicitly passed, it is inferred, or + # it is not inferred, the results should all be the same + tm.assert_series_equal(with_format, no_infer) + tm.assert_series_equal(no_infer, yes_infer) + + def test_to_datetime_infer_datetime_format_inconsistent_format(self): + s = pd.Series(np.array(['01/01/2011 00:00:00', + '01-02-2011 00:00:00', + '2011-01-03T00:00:00'])) + + # When the format is inconsistent, infer_datetime_format should just + # fallback to the default parsing + tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), + pd.to_datetime(s, infer_datetime_format=True)) + + s = pd.Series(np.array(['Jan/01/2011', 'Feb/01/2011', 'Mar/01/2011'])) + + tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), + pd.to_datetime(s, infer_datetime_format=True)) + + def test_to_datetime_infer_datetime_format_series_with_nans(self): + s = pd.Series(np.array(['01/01/2011 00:00:00', np.nan, + '01/03/2011 00:00:00', np.nan])) + tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), + pd.to_datetime(s, infer_datetime_format=True)) + + def test_to_datetime_infer_datetime_format_series_starting_with_nans(self): + s = pd.Series(np.array([np.nan, np.nan, '01/01/2011 00:00:00', + '01/02/2011 00:00:00', '01/03/2011 00:00:00'])) + + tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), + pd.to_datetime(s, infer_datetime_format=True)) + + def test_to_datetime_iso8601_noleading_0s(self): + # GH 11871 + s = pd.Series(['2014-1-1', '2014-2-2', '2015-3-3']) + expected = pd.Series([pd.Timestamp('2014-01-01'), + pd.Timestamp('2014-02-02'), + pd.Timestamp('2015-03-03')]) + tm.assert_series_equal(pd.to_datetime(s), expected) + tm.assert_series_equal(pd.to_datetime(s, format='%Y-%m-%d'), expected) + + +class TestDaysInMonth(object): + # tests for issue #10154 + + def test_day_not_in_month_coerce(self): + assert isnull(to_datetime('2015-02-29', errors='coerce')) + assert isnull(to_datetime('2015-02-29', format="%Y-%m-%d", + errors='coerce')) + assert isnull(to_datetime('2015-02-32', format="%Y-%m-%d", + errors='coerce')) + assert isnull(to_datetime('2015-04-31', format="%Y-%m-%d", + errors='coerce')) + + def test_day_not_in_month_raise(self): + pytest.raises(ValueError, to_datetime, '2015-02-29', + errors='raise') + pytest.raises(ValueError, to_datetime, '2015-02-29', + errors='raise', format="%Y-%m-%d") + pytest.raises(ValueError, to_datetime, '2015-02-32', + errors='raise', format="%Y-%m-%d") + pytest.raises(ValueError, to_datetime, '2015-04-31', + errors='raise', format="%Y-%m-%d") + + def test_day_not_in_month_ignore(self): + assert to_datetime('2015-02-29', errors='ignore') == '2015-02-29' + assert to_datetime('2015-02-29', errors='ignore', + format="%Y-%m-%d") == '2015-02-29' + assert to_datetime('2015-02-32', errors='ignore', + format="%Y-%m-%d") == '2015-02-32' + assert to_datetime('2015-04-31', errors='ignore', + format="%Y-%m-%d") == '2015-04-31' + + +class TestDatetimeParsingWrappers(object): + def test_does_not_convert_mixed_integer(self): + bad_date_strings = ('-50000', '999', '123.1234', 'm', 'T') + + for bad_date_string in bad_date_strings: + assert not tslib._does_string_look_like_datetime(bad_date_string) + + good_date_strings = ('2012-01-01', + '01/01/2012', + 'Mon Sep 16, 2013', + '01012012', + '0101', + '1-1', ) + + for good_date_string in good_date_strings: + assert tslib._does_string_look_like_datetime(good_date_string) + + def test_parsers(self): + + # https://github.com/dateutil/dateutil/issues/217 + import dateutil + yearfirst = dateutil.__version__ >= LooseVersion('2.5.0') + + cases = {'2011-01-01': datetime(2011, 1, 1), + '2Q2005': datetime(2005, 4, 1), + '2Q05': datetime(2005, 4, 1), + '2005Q1': datetime(2005, 1, 1), + '05Q1': datetime(2005, 1, 1), + '2011Q3': datetime(2011, 7, 1), + '11Q3': datetime(2011, 7, 1), + '3Q2011': datetime(2011, 7, 1), + '3Q11': datetime(2011, 7, 1), + + # quarterly without space + '2000Q4': datetime(2000, 10, 1), + '00Q4': datetime(2000, 10, 1), + '4Q2000': datetime(2000, 10, 1), + '4Q00': datetime(2000, 10, 1), + '2000q4': datetime(2000, 10, 1), + '2000-Q4': datetime(2000, 10, 1), + '00-Q4': datetime(2000, 10, 1), + '4Q-2000': datetime(2000, 10, 1), + '4Q-00': datetime(2000, 10, 1), + '00q4': datetime(2000, 10, 1), + '2005': datetime(2005, 1, 1), + '2005-11': datetime(2005, 11, 1), + '2005 11': datetime(2005, 11, 1), + '11-2005': datetime(2005, 11, 1), + '11 2005': datetime(2005, 11, 1), + '200511': datetime(2020, 5, 11), + '20051109': datetime(2005, 11, 9), + '20051109 10:15': datetime(2005, 11, 9, 10, 15), + '20051109 08H': datetime(2005, 11, 9, 8, 0), + '2005-11-09 10:15': datetime(2005, 11, 9, 10, 15), + '2005-11-09 08H': datetime(2005, 11, 9, 8, 0), + '2005/11/09 10:15': datetime(2005, 11, 9, 10, 15), + '2005/11/09 08H': datetime(2005, 11, 9, 8, 0), + "Thu Sep 25 10:36:28 2003": datetime(2003, 9, 25, 10, + 36, 28), + "Thu Sep 25 2003": datetime(2003, 9, 25), + "Sep 25 2003": datetime(2003, 9, 25), + "January 1 2014": datetime(2014, 1, 1), + + # GH 10537 + '2014-06': datetime(2014, 6, 1), + '06-2014': datetime(2014, 6, 1), + '2014-6': datetime(2014, 6, 1), + '6-2014': datetime(2014, 6, 1), + + '20010101 12': datetime(2001, 1, 1, 12), + '20010101 1234': datetime(2001, 1, 1, 12, 34), + '20010101 123456': datetime(2001, 1, 1, 12, 34, 56), + } + + for date_str, expected in compat.iteritems(cases): + result1, _, _ = tools.parse_time_string(date_str, + yearfirst=yearfirst) + result2 = to_datetime(date_str, yearfirst=yearfirst) + result3 = to_datetime([date_str], yearfirst=yearfirst) + # result5 is used below + result4 = to_datetime(np.array([date_str], dtype=object), + yearfirst=yearfirst) + result6 = DatetimeIndex([date_str], yearfirst=yearfirst) + # result7 is used below + result8 = DatetimeIndex(Index([date_str]), yearfirst=yearfirst) + result9 = DatetimeIndex(Series([date_str]), yearfirst=yearfirst) + + for res in [result1, result2]: + assert res == expected + for res in [result3, result4, result6, result8, result9]: + exp = DatetimeIndex([pd.Timestamp(expected)]) + tm.assert_index_equal(res, exp) + + # these really need to have yearfist, but we don't support + if not yearfirst: + result5 = Timestamp(date_str) + assert result5 == expected + result7 = date_range(date_str, freq='S', periods=1, + yearfirst=yearfirst) + assert result7 == expected + + # NaT + result1, _, _ = tools.parse_time_string('NaT') + result2 = to_datetime('NaT') + result3 = Timestamp('NaT') + result4 = DatetimeIndex(['NaT'])[0] + assert result1 is tslib.NaT + assert result1 is tslib.NaT + assert result1 is tslib.NaT + assert result1 is tslib.NaT + + def test_parsers_quarter_invalid(self): + + cases = ['2Q 2005', '2Q-200A', '2Q-200', '22Q2005', '6Q-20', '2Q200.'] + for case in cases: + pytest.raises(ValueError, tools.parse_time_string, case) + + def test_parsers_dayfirst_yearfirst(self): + tm._skip_if_no_dateutil() + + # OK + # 2.5.1 10-11-12 [dayfirst=0, yearfirst=0] -> 2012-10-11 00:00:00 + # 2.5.2 10-11-12 [dayfirst=0, yearfirst=1] -> 2012-10-11 00:00:00 + # 2.5.3 10-11-12 [dayfirst=0, yearfirst=0] -> 2012-10-11 00:00:00 + + # OK + # 2.5.1 10-11-12 [dayfirst=0, yearfirst=1] -> 2010-11-12 00:00:00 + # 2.5.2 10-11-12 [dayfirst=0, yearfirst=1] -> 2010-11-12 00:00:00 + # 2.5.3 10-11-12 [dayfirst=0, yearfirst=1] -> 2010-11-12 00:00:00 + + # bug fix in 2.5.2 + # 2.5.1 10-11-12 [dayfirst=1, yearfirst=1] -> 2010-11-12 00:00:00 + # 2.5.2 10-11-12 [dayfirst=1, yearfirst=1] -> 2010-12-11 00:00:00 + # 2.5.3 10-11-12 [dayfirst=1, yearfirst=1] -> 2010-12-11 00:00:00 + + # OK + # 2.5.1 10-11-12 [dayfirst=1, yearfirst=0] -> 2012-11-10 00:00:00 + # 2.5.2 10-11-12 [dayfirst=1, yearfirst=0] -> 2012-11-10 00:00:00 + # 2.5.3 10-11-12 [dayfirst=1, yearfirst=0] -> 2012-11-10 00:00:00 + + # OK + # 2.5.1 20/12/21 [dayfirst=0, yearfirst=0] -> 2021-12-20 00:00:00 + # 2.5.2 20/12/21 [dayfirst=0, yearfirst=0] -> 2021-12-20 00:00:00 + # 2.5.3 20/12/21 [dayfirst=0, yearfirst=0] -> 2021-12-20 00:00:00 + + # OK + # 2.5.1 20/12/21 [dayfirst=0, yearfirst=1] -> 2020-12-21 00:00:00 + # 2.5.2 20/12/21 [dayfirst=0, yearfirst=1] -> 2020-12-21 00:00:00 + # 2.5.3 20/12/21 [dayfirst=0, yearfirst=1] -> 2020-12-21 00:00:00 + + # revert of bug in 2.5.2 + # 2.5.1 20/12/21 [dayfirst=1, yearfirst=1] -> 2020-12-21 00:00:00 + # 2.5.2 20/12/21 [dayfirst=1, yearfirst=1] -> month must be in 1..12 + # 2.5.3 20/12/21 [dayfirst=1, yearfirst=1] -> 2020-12-21 00:00:00 + + # OK + # 2.5.1 20/12/21 [dayfirst=1, yearfirst=0] -> 2021-12-20 00:00:00 + # 2.5.2 20/12/21 [dayfirst=1, yearfirst=0] -> 2021-12-20 00:00:00 + # 2.5.3 20/12/21 [dayfirst=1, yearfirst=0] -> 2021-12-20 00:00:00 + + import dateutil + is_lt_253 = dateutil.__version__ < LooseVersion('2.5.3') + + # str : dayfirst, yearfirst, expected + cases = {'10-11-12': [(False, False, + datetime(2012, 10, 11)), + (True, False, + datetime(2012, 11, 10)), + (False, True, + datetime(2010, 11, 12)), + (True, True, + datetime(2010, 12, 11))], + '20/12/21': [(False, False, + datetime(2021, 12, 20)), + (True, False, + datetime(2021, 12, 20)), + (False, True, + datetime(2020, 12, 21)), + (True, True, + datetime(2020, 12, 21))]} + + from dateutil.parser import parse + for date_str, values in compat.iteritems(cases): + for dayfirst, yearfirst, expected in values: + + # odd comparisons across version + # let's just skip + if dayfirst and yearfirst and is_lt_253: + continue + + # compare with dateutil result + dateutil_result = parse(date_str, dayfirst=dayfirst, + yearfirst=yearfirst) + assert dateutil_result == expected + + result1, _, _ = tools.parse_time_string(date_str, + dayfirst=dayfirst, + yearfirst=yearfirst) + + # we don't support dayfirst/yearfirst here: + if not dayfirst and not yearfirst: + result2 = Timestamp(date_str) + assert result2 == expected + + result3 = to_datetime(date_str, dayfirst=dayfirst, + yearfirst=yearfirst) + + result4 = DatetimeIndex([date_str], dayfirst=dayfirst, + yearfirst=yearfirst)[0] + + assert result1 == expected + assert result3 == expected + assert result4 == expected + + def test_parsers_timestring(self): + tm._skip_if_no_dateutil() + from dateutil.parser import parse + + # must be the same as dateutil result + cases = {'10:15': (parse('10:15'), datetime(1, 1, 1, 10, 15)), + '9:05': (parse('9:05'), datetime(1, 1, 1, 9, 5))} + + for date_str, (exp_now, exp_def) in compat.iteritems(cases): + result1, _, _ = tools.parse_time_string(date_str) + result2 = to_datetime(date_str) + result3 = to_datetime([date_str]) + result4 = Timestamp(date_str) + result5 = DatetimeIndex([date_str])[0] + # parse time string return time string based on default date + # others are not, and can't be changed because it is used in + # time series plot + assert result1 == exp_def + assert result2 == exp_now + assert result3 == exp_now + assert result4 == exp_now + assert result5 == exp_now + + def test_parsers_time(self): + # GH11818 + _skip_if_has_locale() + strings = ["14:15", "1415", "2:15pm", "0215pm", "14:15:00", "141500", + "2:15:00pm", "021500pm", time(14, 15)] + expected = time(14, 15) + + for time_string in strings: + assert tools.to_time(time_string) == expected + + new_string = "14.15" + pytest.raises(ValueError, tools.to_time, new_string) + assert tools.to_time(new_string, format="%H.%M") == expected + + arg = ["14:15", "20:20"] + expected_arr = [time(14, 15), time(20, 20)] + assert tools.to_time(arg) == expected_arr + assert tools.to_time(arg, format="%H:%M") == expected_arr + assert tools.to_time(arg, infer_time_format=True) == expected_arr + assert tools.to_time(arg, format="%I:%M%p", + errors="coerce") == [None, None] + + res = tools.to_time(arg, format="%I:%M%p", errors="ignore") + tm.assert_numpy_array_equal(res, np.array(arg, dtype=np.object_)) + + with pytest.raises(ValueError): + tools.to_time(arg, format="%I:%M%p", errors="raise") + + tm.assert_series_equal(tools.to_time(Series(arg, name="test")), + Series(expected_arr, name="test")) + + res = tools.to_time(np.array(arg)) + assert isinstance(res, list) + assert res == expected_arr + + def test_parsers_monthfreq(self): + cases = {'201101': datetime(2011, 1, 1, 0, 0), + '200005': datetime(2000, 5, 1, 0, 0)} + + for date_str, expected in compat.iteritems(cases): + result1, _, _ = tools.parse_time_string(date_str, freq='M') + assert result1 == expected + + def test_parsers_quarterly_with_freq(self): + msg = ('Incorrect quarterly string is given, quarter ' + 'must be between 1 and 4: 2013Q5') + with tm.assert_raises_regex(tslib.DateParseError, msg): + tools.parse_time_string('2013Q5') + + # GH 5418 + msg = ('Unable to retrieve month information from given freq: ' + 'INVLD-L-DEC-SAT') + with tm.assert_raises_regex(tslib.DateParseError, msg): + tools.parse_time_string('2013Q1', freq='INVLD-L-DEC-SAT') + + cases = {('2013Q2', None): datetime(2013, 4, 1), + ('2013Q2', 'A-APR'): datetime(2012, 8, 1), + ('2013-Q2', 'A-DEC'): datetime(2013, 4, 1)} + + for (date_str, freq), exp in compat.iteritems(cases): + result, _, _ = tools.parse_time_string(date_str, freq=freq) + assert result == exp + + def test_parsers_timezone_minute_offsets_roundtrip(self): + # GH11708 + base = to_datetime("2013-01-01 00:00:00") + dt_strings = [ + ('2013-01-01 05:45+0545', + "Asia/Katmandu", + "Timestamp('2013-01-01 05:45:00+0545', tz='Asia/Katmandu')"), + ('2013-01-01 05:30+0530', + "Asia/Kolkata", + "Timestamp('2013-01-01 05:30:00+0530', tz='Asia/Kolkata')") + ] + + for dt_string, tz, dt_string_repr in dt_strings: + dt_time = to_datetime(dt_string) + assert base == dt_time + converted_time = dt_time.tz_localize('UTC').tz_convert(tz) + assert dt_string_repr == repr(converted_time) + + def test_parsers_iso8601(self): + # GH 12060 + # test only the iso parser - flexibility to different + # separators and leadings 0s + # Timestamp construction falls back to dateutil + cases = {'2011-01-02': datetime(2011, 1, 2), + '2011-1-2': datetime(2011, 1, 2), + '2011-01': datetime(2011, 1, 1), + '2011-1': datetime(2011, 1, 1), + '2011 01 02': datetime(2011, 1, 2), + '2011.01.02': datetime(2011, 1, 2), + '2011/01/02': datetime(2011, 1, 2), + '2011\\01\\02': datetime(2011, 1, 2), + '2013-01-01 05:30:00': datetime(2013, 1, 1, 5, 30), + '2013-1-1 5:30:00': datetime(2013, 1, 1, 5, 30)} + for date_str, exp in compat.iteritems(cases): + actual = tslib._test_parse_iso8601(date_str) + assert actual == exp + + # seperators must all match - YYYYMM not valid + invalid_cases = ['2011-01/02', '2011^11^11', + '201401', '201111', '200101', + # mixed separated and unseparated + '2005-0101', '200501-01', + '20010101 12:3456', '20010101 1234:56', + # HHMMSS must have two digits in each component + # if unseparated + '20010101 1', '20010101 123', '20010101 12345', + '20010101 12345Z', + # wrong separator for HHMMSS + '2001-01-01 12-34-56'] + for date_str in invalid_cases: + with pytest.raises(ValueError): + tslib._test_parse_iso8601(date_str) + # If no ValueError raised, let me know which case failed. + raise Exception(date_str) + + +class TestArrayToDatetime(object): + + def test_try_parse_dates(self): + from dateutil.parser import parse + arr = np.array(['5/1/2000', '6/1/2000', '7/1/2000'], dtype=object) + + result = lib.try_parse_dates(arr, dayfirst=True) + expected = [parse(d, dayfirst=True) for d in arr] + assert np.array_equal(result, expected) + + def test_parsing_valid_dates(self): + arr = np.array(['01-01-2013', '01-02-2013'], dtype=object) + tm.assert_numpy_array_equal( + tslib.array_to_datetime(arr), + np_array_datetime64_compat( + [ + '2013-01-01T00:00:00.000000000-0000', + '2013-01-02T00:00:00.000000000-0000' + ], + dtype='M8[ns]' + ) + ) + + arr = np.array(['Mon Sep 16 2013', 'Tue Sep 17 2013'], dtype=object) + tm.assert_numpy_array_equal( + tslib.array_to_datetime(arr), + np_array_datetime64_compat( + [ + '2013-09-16T00:00:00.000000000-0000', + '2013-09-17T00:00:00.000000000-0000' + ], + dtype='M8[ns]' + ) + ) + + def test_parsing_timezone_offsets(self): + # All of these datetime strings with offsets are equivalent + # to the same datetime after the timezone offset is added + dt_strings = [ + '01-01-2013 08:00:00+08:00', + '2013-01-01T08:00:00.000000000+0800', + '2012-12-31T16:00:00.000000000-0800', + '12-31-2012 23:00:00-01:00' + ] + + expected_output = tslib.array_to_datetime(np.array( + ['01-01-2013 00:00:00'], dtype=object)) + + for dt_string in dt_strings: + tm.assert_numpy_array_equal( + tslib.array_to_datetime( + np.array([dt_string], dtype=object) + ), + expected_output + ) + + def test_number_looking_strings_not_into_datetime(self): + # #4601 + # These strings don't look like datetimes so they shouldn't be + # attempted to be converted + arr = np.array(['-352.737091', '183.575577'], dtype=object) + tm.assert_numpy_array_equal( + tslib.array_to_datetime(arr, errors='ignore'), arr) + + arr = np.array(['1', '2', '3', '4', '5'], dtype=object) + tm.assert_numpy_array_equal( + tslib.array_to_datetime(arr, errors='ignore'), arr) + + def test_coercing_dates_outside_of_datetime64_ns_bounds(self): + invalid_dates = [ + date(1000, 1, 1), + datetime(1000, 1, 1), + '1000-01-01', + 'Jan 1, 1000', + np.datetime64('1000-01-01'), + ] + + for invalid_date in invalid_dates: + pytest.raises(ValueError, + tslib.array_to_datetime, + np.array([invalid_date], dtype='object'), + errors='raise', ) + tm.assert_numpy_array_equal( + tslib.array_to_datetime( + np.array([invalid_date], dtype='object'), + errors='coerce'), + np.array([tslib.iNaT], dtype='M8[ns]') + ) + + arr = np.array(['1/1/1000', '1/1/2000'], dtype=object) + tm.assert_numpy_array_equal( + tslib.array_to_datetime(arr, errors='coerce'), + np_array_datetime64_compat( + [ + tslib.iNaT, + '2000-01-01T00:00:00.000000000-0000' + ], + dtype='M8[ns]' + ) + ) + + def test_coerce_of_invalid_datetimes(self): + arr = np.array(['01-01-2013', 'not_a_date', '1'], dtype=object) + + # Without coercing, the presence of any invalid dates prevents + # any values from being converted + tm.assert_numpy_array_equal( + tslib.array_to_datetime(arr, errors='ignore'), arr) + + # With coercing, the invalid dates becomes iNaT + tm.assert_numpy_array_equal( + tslib.array_to_datetime(arr, errors='coerce'), + np_array_datetime64_compat( + [ + '2013-01-01T00:00:00.000000000-0000', + tslib.iNaT, + tslib.iNaT + ], + dtype='M8[ns]' + ) + ) + + +def test_normalize_date(): + value = date(2012, 9, 7) + + result = normalize_date(value) + assert (result == datetime(2012, 9, 7)) + + value = datetime(2012, 9, 7, 12) + + result = normalize_date(value) + assert (result == datetime(2012, 9, 7)) + + +@pytest.fixture(params=['D', 's', 'ms', 'us', 'ns']) +def units(request): + return request.param + + +@pytest.fixture +def epoch_1960(): + # for origin as 1960-01-01 + return Timestamp('1960-01-01') + + +@pytest.fixture +def units_from_epochs(): + return list(range(5)) + + +@pytest.fixture(params=[epoch_1960(), + epoch_1960().to_pydatetime(), + epoch_1960().to_datetime64(), + str(epoch_1960())]) +def epochs(request): + return request.param + + +@pytest.fixture +def julian_dates(): + return pd.date_range('2014-1-1', periods=10).to_julian_date().values + + +class TestOrigin(object): + + def test_to_basic(self, julian_dates): + # gh-11276, gh-11745 + # for origin as julian + + result = Series(pd.to_datetime( + julian_dates, unit='D', origin='julian')) + expected = Series(pd.to_datetime( + julian_dates - pd.Timestamp(0).to_julian_date(), unit='D')) + assert_series_equal(result, expected) + + result = Series(pd.to_datetime( + [0, 1, 2], unit='D', origin='unix')) + expected = Series([Timestamp('1970-01-01'), + Timestamp('1970-01-02'), + Timestamp('1970-01-03')]) + assert_series_equal(result, expected) + + # default + result = Series(pd.to_datetime( + [0, 1, 2], unit='D')) + expected = Series([Timestamp('1970-01-01'), + Timestamp('1970-01-02'), + Timestamp('1970-01-03')]) + assert_series_equal(result, expected) + + def test_julian_round_trip(self): + result = pd.to_datetime(2456658, origin='julian', unit='D') + assert result.to_julian_date() == 2456658 + + # out-of-bounds + with pytest.raises(ValueError): + pd.to_datetime(1, origin="julian", unit='D') + + def test_invalid_unit(self, units, julian_dates): + + # checking for invalid combination of origin='julian' and unit != D + if units != 'D': + with pytest.raises(ValueError): + pd.to_datetime(julian_dates, unit=units, origin='julian') + + def test_invalid_origin(self): + + # need to have a numeric specified + with pytest.raises(ValueError): + pd.to_datetime("2005-01-01", origin="1960-01-01") + + with pytest.raises(ValueError): + pd.to_datetime("2005-01-01", origin="1960-01-01", unit='D') + + def test_epoch(self, units, epochs, epoch_1960, units_from_epochs): + + expected = Series( + [pd.Timedelta(x, unit=units) + + epoch_1960 for x in units_from_epochs]) + + result = Series(pd.to_datetime( + units_from_epochs, unit=units, origin=epochs)) + assert_series_equal(result, expected) + + @pytest.mark.parametrize("origin, exc", + [('random_string', ValueError), + ('epoch', ValueError), + ('13-24-1990', ValueError), + (datetime(1, 1, 1), tslib.OutOfBoundsDatetime)]) + def test_invalid_origins(self, origin, exc, units, units_from_epochs): + + with pytest.raises(exc): + pd.to_datetime(units_from_epochs, unit=units, + origin=origin) + + def test_processing_order(self): + # make sure we handle out-of-bounds *before* + # constructing the dates + + result = pd.to_datetime(200 * 365, unit='D') + expected = Timestamp('2169-11-13 00:00:00') + assert result == expected + + result = pd.to_datetime(200 * 365, unit='D', origin='1870-01-01') + expected = Timestamp('2069-11-13 00:00:00') + assert result == expected + + result = pd.to_datetime(300 * 365, unit='D', origin='1870-01-01') + expected = Timestamp('2169-10-20 00:00:00') + assert result == expected diff --git a/pandas/tseries/tests/__init__.py b/pandas/tests/indexes/period/__init__.py similarity index 100% rename from pandas/tseries/tests/__init__.py rename to pandas/tests/indexes/period/__init__.py diff --git a/pandas/tests/indexes/period/test_asfreq.py b/pandas/tests/indexes/period/test_asfreq.py new file mode 100644 index 0000000000000..c8724b2a3bc91 --- /dev/null +++ b/pandas/tests/indexes/period/test_asfreq.py @@ -0,0 +1,155 @@ +import pytest + +import numpy as np +import pandas as pd +from pandas.util import testing as tm +from pandas import PeriodIndex, Series, DataFrame + + +class TestPeriodIndex(object): + + def setup_method(self, method): + pass + + def test_asfreq(self): + pi1 = PeriodIndex(freq='A', start='1/1/2001', end='1/1/2001') + pi2 = PeriodIndex(freq='Q', start='1/1/2001', end='1/1/2001') + pi3 = PeriodIndex(freq='M', start='1/1/2001', end='1/1/2001') + pi4 = PeriodIndex(freq='D', start='1/1/2001', end='1/1/2001') + pi5 = PeriodIndex(freq='H', start='1/1/2001', end='1/1/2001 00:00') + pi6 = PeriodIndex(freq='Min', start='1/1/2001', end='1/1/2001 00:00') + pi7 = PeriodIndex(freq='S', start='1/1/2001', end='1/1/2001 00:00:00') + + assert pi1.asfreq('Q', 'S') == pi2 + assert pi1.asfreq('Q', 's') == pi2 + assert pi1.asfreq('M', 'start') == pi3 + assert pi1.asfreq('D', 'StarT') == pi4 + assert pi1.asfreq('H', 'beGIN') == pi5 + assert pi1.asfreq('Min', 'S') == pi6 + assert pi1.asfreq('S', 'S') == pi7 + + assert pi2.asfreq('A', 'S') == pi1 + assert pi2.asfreq('M', 'S') == pi3 + assert pi2.asfreq('D', 'S') == pi4 + assert pi2.asfreq('H', 'S') == pi5 + assert pi2.asfreq('Min', 'S') == pi6 + assert pi2.asfreq('S', 'S') == pi7 + + assert pi3.asfreq('A', 'S') == pi1 + assert pi3.asfreq('Q', 'S') == pi2 + assert pi3.asfreq('D', 'S') == pi4 + assert pi3.asfreq('H', 'S') == pi5 + assert pi3.asfreq('Min', 'S') == pi6 + assert pi3.asfreq('S', 'S') == pi7 + + assert pi4.asfreq('A', 'S') == pi1 + assert pi4.asfreq('Q', 'S') == pi2 + assert pi4.asfreq('M', 'S') == pi3 + assert pi4.asfreq('H', 'S') == pi5 + assert pi4.asfreq('Min', 'S') == pi6 + assert pi4.asfreq('S', 'S') == pi7 + + assert pi5.asfreq('A', 'S') == pi1 + assert pi5.asfreq('Q', 'S') == pi2 + assert pi5.asfreq('M', 'S') == pi3 + assert pi5.asfreq('D', 'S') == pi4 + assert pi5.asfreq('Min', 'S') == pi6 + assert pi5.asfreq('S', 'S') == pi7 + + assert pi6.asfreq('A', 'S') == pi1 + assert pi6.asfreq('Q', 'S') == pi2 + assert pi6.asfreq('M', 'S') == pi3 + assert pi6.asfreq('D', 'S') == pi4 + assert pi6.asfreq('H', 'S') == pi5 + assert pi6.asfreq('S', 'S') == pi7 + + assert pi7.asfreq('A', 'S') == pi1 + assert pi7.asfreq('Q', 'S') == pi2 + assert pi7.asfreq('M', 'S') == pi3 + assert pi7.asfreq('D', 'S') == pi4 + assert pi7.asfreq('H', 'S') == pi5 + assert pi7.asfreq('Min', 'S') == pi6 + + pytest.raises(ValueError, pi7.asfreq, 'T', 'foo') + result1 = pi1.asfreq('3M') + result2 = pi1.asfreq('M') + expected = PeriodIndex(freq='M', start='2001-12', end='2001-12') + tm.assert_numpy_array_equal(result1.asi8, expected.asi8) + assert result1.freqstr == '3M' + tm.assert_numpy_array_equal(result2.asi8, expected.asi8) + assert result2.freqstr == 'M' + + def test_asfreq_nat(self): + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', '2011-04'], freq='M') + result = idx.asfreq(freq='Q') + expected = PeriodIndex(['2011Q1', '2011Q1', 'NaT', '2011Q2'], freq='Q') + tm.assert_index_equal(result, expected) + + def test_asfreq_mult_pi(self): + pi = PeriodIndex(['2001-01', '2001-02', 'NaT', '2001-03'], freq='2M') + + for freq in ['D', '3D']: + result = pi.asfreq(freq) + exp = PeriodIndex(['2001-02-28', '2001-03-31', 'NaT', + '2001-04-30'], freq=freq) + tm.assert_index_equal(result, exp) + assert result.freq == exp.freq + + result = pi.asfreq(freq, how='S') + exp = PeriodIndex(['2001-01-01', '2001-02-01', 'NaT', + '2001-03-01'], freq=freq) + tm.assert_index_equal(result, exp) + assert result.freq == exp.freq + + def test_asfreq_combined_pi(self): + pi = pd.PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', 'NaT'], + freq='H') + exp = PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', 'NaT'], + freq='25H') + for freq, how in zip(['1D1H', '1H1D'], ['S', 'E']): + result = pi.asfreq(freq, how=how) + tm.assert_index_equal(result, exp) + assert result.freq == exp.freq + + for freq in ['1D1H', '1H1D']: + pi = pd.PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', + 'NaT'], freq=freq) + result = pi.asfreq('H') + exp = PeriodIndex(['2001-01-02 00:00', '2001-01-03 02:00', 'NaT'], + freq='H') + tm.assert_index_equal(result, exp) + assert result.freq == exp.freq + + pi = pd.PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', + 'NaT'], freq=freq) + result = pi.asfreq('H', how='S') + exp = PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', 'NaT'], + freq='H') + tm.assert_index_equal(result, exp) + assert result.freq == exp.freq + + def test_asfreq_ts(self): + index = PeriodIndex(freq='A', start='1/1/2001', end='12/31/2010') + ts = Series(np.random.randn(len(index)), index=index) + df = DataFrame(np.random.randn(len(index), 3), index=index) + + result = ts.asfreq('D', how='end') + df_result = df.asfreq('D', how='end') + exp_index = index.asfreq('D', how='end') + assert len(result) == len(ts) + tm.assert_index_equal(result.index, exp_index) + tm.assert_index_equal(df_result.index, exp_index) + + result = ts.asfreq('D', how='start') + assert len(result) == len(ts) + tm.assert_index_equal(result.index, index.asfreq('D', how='start')) + + def test_astype_asfreq(self): + pi1 = PeriodIndex(['2011-01-01', '2011-02-01', '2011-03-01'], freq='D') + exp = PeriodIndex(['2011-01', '2011-02', '2011-03'], freq='M') + tm.assert_index_equal(pi1.asfreq('M'), exp) + tm.assert_index_equal(pi1.astype('period[M]'), exp) + + exp = PeriodIndex(['2011-01', '2011-02', '2011-03'], freq='3M') + tm.assert_index_equal(pi1.asfreq('3M'), exp) + tm.assert_index_equal(pi1.astype('period[3M]'), exp) diff --git a/pandas/tests/indexes/period/test_construction.py b/pandas/tests/indexes/period/test_construction.py new file mode 100644 index 0000000000000..6a188c0987f91 --- /dev/null +++ b/pandas/tests/indexes/period/test_construction.py @@ -0,0 +1,489 @@ +import pytest + +import numpy as np +import pandas as pd +import pandas.util.testing as tm +import pandas.core.indexes.period as period +from pandas.compat import lrange, PY3, text_type, lmap +from pandas import (Period, PeriodIndex, period_range, offsets, date_range, + Series, Index) + + +class TestPeriodIndex(object): + + def setup_method(self, method): + pass + + def test_construction_base_constructor(self): + # GH 13664 + arr = [pd.Period('2011-01', freq='M'), pd.NaT, + pd.Period('2011-03', freq='M')] + tm.assert_index_equal(pd.Index(arr), pd.PeriodIndex(arr)) + tm.assert_index_equal(pd.Index(np.array(arr)), + pd.PeriodIndex(np.array(arr))) + + arr = [np.nan, pd.NaT, pd.Period('2011-03', freq='M')] + tm.assert_index_equal(pd.Index(arr), pd.PeriodIndex(arr)) + tm.assert_index_equal(pd.Index(np.array(arr)), + pd.PeriodIndex(np.array(arr))) + + arr = [pd.Period('2011-01', freq='M'), pd.NaT, + pd.Period('2011-03', freq='D')] + tm.assert_index_equal(pd.Index(arr), pd.Index(arr, dtype=object)) + + tm.assert_index_equal(pd.Index(np.array(arr)), + pd.Index(np.array(arr), dtype=object)) + + def test_constructor_use_start_freq(self): + # GH #1118 + p = Period('4/2/2012', freq='B') + index = PeriodIndex(start=p, periods=10) + expected = PeriodIndex(start='4/2/2012', periods=10, freq='B') + tm.assert_index_equal(index, expected) + + def test_constructor_field_arrays(self): + # GH #1264 + + years = np.arange(1990, 2010).repeat(4)[2:-2] + quarters = np.tile(np.arange(1, 5), 20)[2:-2] + + index = PeriodIndex(year=years, quarter=quarters, freq='Q-DEC') + expected = period_range('1990Q3', '2009Q2', freq='Q-DEC') + tm.assert_index_equal(index, expected) + + index2 = PeriodIndex(year=years, quarter=quarters, freq='2Q-DEC') + tm.assert_numpy_array_equal(index.asi8, index2.asi8) + + index = PeriodIndex(year=years, quarter=quarters) + tm.assert_index_equal(index, expected) + + years = [2007, 2007, 2007] + months = [1, 2] + pytest.raises(ValueError, PeriodIndex, year=years, month=months, + freq='M') + pytest.raises(ValueError, PeriodIndex, year=years, month=months, + freq='2M') + pytest.raises(ValueError, PeriodIndex, year=years, month=months, + freq='M', start=Period('2007-01', freq='M')) + + years = [2007, 2007, 2007] + months = [1, 2, 3] + idx = PeriodIndex(year=years, month=months, freq='M') + exp = period_range('2007-01', periods=3, freq='M') + tm.assert_index_equal(idx, exp) + + def test_constructor_U(self): + # U was used as undefined period + pytest.raises(ValueError, period_range, '2007-1-1', periods=500, + freq='X') + + def test_constructor_nano(self): + idx = period_range(start=Period(ordinal=1, freq='N'), + end=Period(ordinal=4, freq='N'), freq='N') + exp = PeriodIndex([Period(ordinal=1, freq='N'), + Period(ordinal=2, freq='N'), + Period(ordinal=3, freq='N'), + Period(ordinal=4, freq='N')], freq='N') + tm.assert_index_equal(idx, exp) + + def test_constructor_arrays_negative_year(self): + years = np.arange(1960, 2000, dtype=np.int64).repeat(4) + quarters = np.tile(np.array([1, 2, 3, 4], dtype=np.int64), 40) + + pindex = PeriodIndex(year=years, quarter=quarters) + + tm.assert_index_equal(pindex.year, pd.Index(years)) + tm.assert_index_equal(pindex.quarter, pd.Index(quarters)) + + def test_constructor_invalid_quarters(self): + pytest.raises(ValueError, PeriodIndex, year=lrange(2000, 2004), + quarter=lrange(4), freq='Q-DEC') + + def test_constructor_corner(self): + pytest.raises(ValueError, PeriodIndex, periods=10, freq='A') + + start = Period('2007', freq='A-JUN') + end = Period('2010', freq='A-DEC') + pytest.raises(ValueError, PeriodIndex, start=start, end=end) + pytest.raises(ValueError, PeriodIndex, start=start) + pytest.raises(ValueError, PeriodIndex, end=end) + + result = period_range('2007-01', periods=10.5, freq='M') + exp = period_range('2007-01', periods=10, freq='M') + tm.assert_index_equal(result, exp) + + def test_constructor_fromarraylike(self): + idx = period_range('2007-01', periods=20, freq='M') + + # values is an array of Period, thus can retrieve freq + tm.assert_index_equal(PeriodIndex(idx.values), idx) + tm.assert_index_equal(PeriodIndex(list(idx.values)), idx) + + pytest.raises(ValueError, PeriodIndex, idx._values) + pytest.raises(ValueError, PeriodIndex, list(idx._values)) + pytest.raises(TypeError, PeriodIndex, + data=Period('2007', freq='A')) + + result = PeriodIndex(iter(idx)) + tm.assert_index_equal(result, idx) + + result = PeriodIndex(idx) + tm.assert_index_equal(result, idx) + + result = PeriodIndex(idx, freq='M') + tm.assert_index_equal(result, idx) + + result = PeriodIndex(idx, freq=offsets.MonthEnd()) + tm.assert_index_equal(result, idx) + assert result.freq, 'M' + + result = PeriodIndex(idx, freq='2M') + tm.assert_index_equal(result, idx.asfreq('2M')) + assert result.freq, '2M' + + result = PeriodIndex(idx, freq=offsets.MonthEnd(2)) + tm.assert_index_equal(result, idx.asfreq('2M')) + assert result.freq, '2M' + + result = PeriodIndex(idx, freq='D') + exp = idx.asfreq('D', 'e') + tm.assert_index_equal(result, exp) + + def test_constructor_datetime64arr(self): + vals = np.arange(100000, 100000 + 10000, 100, dtype=np.int64) + vals = vals.view(np.dtype('M8[us]')) + + pytest.raises(ValueError, PeriodIndex, vals, freq='D') + + def test_constructor_dtype(self): + # passing a dtype with a tz should localize + idx = PeriodIndex(['2013-01', '2013-03'], dtype='period[M]') + exp = PeriodIndex(['2013-01', '2013-03'], freq='M') + tm.assert_index_equal(idx, exp) + assert idx.dtype == 'period[M]' + + idx = PeriodIndex(['2013-01-05', '2013-03-05'], dtype='period[3D]') + exp = PeriodIndex(['2013-01-05', '2013-03-05'], freq='3D') + tm.assert_index_equal(idx, exp) + assert idx.dtype == 'period[3D]' + + # if we already have a freq and its not the same, then asfreq + # (not changed) + idx = PeriodIndex(['2013-01-01', '2013-01-02'], freq='D') + + res = PeriodIndex(idx, dtype='period[M]') + exp = PeriodIndex(['2013-01', '2013-01'], freq='M') + tm.assert_index_equal(res, exp) + assert res.dtype == 'period[M]' + + res = PeriodIndex(idx, freq='M') + tm.assert_index_equal(res, exp) + assert res.dtype == 'period[M]' + + msg = 'specified freq and dtype are different' + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + PeriodIndex(['2011-01'], freq='M', dtype='period[D]') + + def test_constructor_empty(self): + idx = pd.PeriodIndex([], freq='M') + assert isinstance(idx, PeriodIndex) + assert len(idx) == 0 + assert idx.freq == 'M' + + with tm.assert_raises_regex(ValueError, 'freq not specified'): + pd.PeriodIndex([]) + + def test_constructor_pi_nat(self): + idx = PeriodIndex([Period('2011-01', freq='M'), pd.NaT, + Period('2011-01', freq='M')]) + exp = PeriodIndex(['2011-01', 'NaT', '2011-01'], freq='M') + tm.assert_index_equal(idx, exp) + + idx = PeriodIndex(np.array([Period('2011-01', freq='M'), pd.NaT, + Period('2011-01', freq='M')])) + tm.assert_index_equal(idx, exp) + + idx = PeriodIndex([pd.NaT, pd.NaT, Period('2011-01', freq='M'), + Period('2011-01', freq='M')]) + exp = PeriodIndex(['NaT', 'NaT', '2011-01', '2011-01'], freq='M') + tm.assert_index_equal(idx, exp) + + idx = PeriodIndex(np.array([pd.NaT, pd.NaT, + Period('2011-01', freq='M'), + Period('2011-01', freq='M')])) + tm.assert_index_equal(idx, exp) + + idx = PeriodIndex([pd.NaT, pd.NaT, '2011-01', '2011-01'], freq='M') + tm.assert_index_equal(idx, exp) + + with tm.assert_raises_regex(ValueError, 'freq not specified'): + PeriodIndex([pd.NaT, pd.NaT]) + + with tm.assert_raises_regex(ValueError, 'freq not specified'): + PeriodIndex(np.array([pd.NaT, pd.NaT])) + + with tm.assert_raises_regex(ValueError, 'freq not specified'): + PeriodIndex(['NaT', 'NaT']) + + with tm.assert_raises_regex(ValueError, 'freq not specified'): + PeriodIndex(np.array(['NaT', 'NaT'])) + + def test_constructor_incompat_freq(self): + msg = "Input has different freq=D from PeriodIndex\\(freq=M\\)" + + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + PeriodIndex([Period('2011-01', freq='M'), pd.NaT, + Period('2011-01', freq='D')]) + + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + PeriodIndex(np.array([Period('2011-01', freq='M'), pd.NaT, + Period('2011-01', freq='D')])) + + # first element is pd.NaT + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + PeriodIndex([pd.NaT, Period('2011-01', freq='M'), + Period('2011-01', freq='D')]) + + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + PeriodIndex(np.array([pd.NaT, Period('2011-01', freq='M'), + Period('2011-01', freq='D')])) + + def test_constructor_mixed(self): + idx = PeriodIndex(['2011-01', pd.NaT, Period('2011-01', freq='M')]) + exp = PeriodIndex(['2011-01', 'NaT', '2011-01'], freq='M') + tm.assert_index_equal(idx, exp) + + idx = PeriodIndex(['NaT', pd.NaT, Period('2011-01', freq='M')]) + exp = PeriodIndex(['NaT', 'NaT', '2011-01'], freq='M') + tm.assert_index_equal(idx, exp) + + idx = PeriodIndex([Period('2011-01-01', freq='D'), pd.NaT, + '2012-01-01']) + exp = PeriodIndex(['2011-01-01', 'NaT', '2012-01-01'], freq='D') + tm.assert_index_equal(idx, exp) + + def test_constructor_simple_new(self): + idx = period_range('2007-01', name='p', periods=2, freq='M') + result = idx._simple_new(idx, 'p', freq=idx.freq) + tm.assert_index_equal(result, idx) + + result = idx._simple_new(idx.astype('i8'), 'p', freq=idx.freq) + tm.assert_index_equal(result, idx) + + result = idx._simple_new([pd.Period('2007-01', freq='M'), + pd.Period('2007-02', freq='M')], + 'p', freq=idx.freq) + tm.assert_index_equal(result, idx) + + result = idx._simple_new(np.array([pd.Period('2007-01', freq='M'), + pd.Period('2007-02', freq='M')]), + 'p', freq=idx.freq) + tm.assert_index_equal(result, idx) + + def test_constructor_simple_new_empty(self): + # GH13079 + idx = PeriodIndex([], freq='M', name='p') + result = idx._simple_new(idx, name='p', freq='M') + tm.assert_index_equal(result, idx) + + def test_constructor_floats(self): + # GH13079 + for floats in [[1.1, 2.1], np.array([1.1, 2.1])]: + with pytest.raises(TypeError): + pd.PeriodIndex._simple_new(floats, freq='M') + + with pytest.raises(TypeError): + pd.PeriodIndex(floats, freq='M') + + def test_constructor_nat(self): + pytest.raises(ValueError, period_range, start='NaT', + end='2011-01-01', freq='M') + pytest.raises(ValueError, period_range, start='2011-01-01', + end='NaT', freq='M') + + def test_constructor_year_and_quarter(self): + year = pd.Series([2001, 2002, 2003]) + quarter = year - 2000 + idx = PeriodIndex(year=year, quarter=quarter) + strs = ['%dQ%d' % t for t in zip(quarter, year)] + lops = list(map(Period, strs)) + p = PeriodIndex(lops) + tm.assert_index_equal(p, idx) + + def test_constructor_freq_mult(self): + # GH #7811 + for func in [PeriodIndex, period_range]: + # must be the same, but for sure... + pidx = func(start='2014-01', freq='2M', periods=4) + expected = PeriodIndex(['2014-01', '2014-03', + '2014-05', '2014-07'], freq='2M') + tm.assert_index_equal(pidx, expected) + + pidx = func(start='2014-01-02', end='2014-01-15', freq='3D') + expected = PeriodIndex(['2014-01-02', '2014-01-05', + '2014-01-08', '2014-01-11', + '2014-01-14'], freq='3D') + tm.assert_index_equal(pidx, expected) + + pidx = func(end='2014-01-01 17:00', freq='4H', periods=3) + expected = PeriodIndex(['2014-01-01 09:00', '2014-01-01 13:00', + '2014-01-01 17:00'], freq='4H') + tm.assert_index_equal(pidx, expected) + + msg = ('Frequency must be positive, because it' + ' represents span: -1M') + with tm.assert_raises_regex(ValueError, msg): + PeriodIndex(['2011-01'], freq='-1M') + + msg = ('Frequency must be positive, because it' ' represents span: 0M') + with tm.assert_raises_regex(ValueError, msg): + PeriodIndex(['2011-01'], freq='0M') + + msg = ('Frequency must be positive, because it' ' represents span: 0M') + with tm.assert_raises_regex(ValueError, msg): + period_range('2011-01', periods=3, freq='0M') + + def test_constructor_freq_mult_dti_compat(self): + import itertools + mults = [1, 2, 3, 4, 5] + freqs = ['A', 'M', 'D', 'T', 'S'] + for mult, freq in itertools.product(mults, freqs): + freqstr = str(mult) + freq + pidx = PeriodIndex(start='2014-04-01', freq=freqstr, periods=10) + expected = date_range(start='2014-04-01', freq=freqstr, + periods=10).to_period(freqstr) + tm.assert_index_equal(pidx, expected) + + def test_constructor_freq_combined(self): + for freq in ['1D1H', '1H1D']: + pidx = PeriodIndex(['2016-01-01', '2016-01-02'], freq=freq) + expected = PeriodIndex(['2016-01-01 00:00', '2016-01-02 00:00'], + freq='25H') + for freq, func in zip(['1D1H', '1H1D'], [PeriodIndex, period_range]): + pidx = func(start='2016-01-01', periods=2, freq=freq) + expected = PeriodIndex(['2016-01-01 00:00', '2016-01-02 01:00'], + freq='25H') + tm.assert_index_equal(pidx, expected) + + def test_constructor(self): + pi = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + assert len(pi) == 9 + + pi = PeriodIndex(freq='Q', start='1/1/2001', end='12/1/2009') + assert len(pi) == 4 * 9 + + pi = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') + assert len(pi) == 12 * 9 + + pi = PeriodIndex(freq='D', start='1/1/2001', end='12/31/2009') + assert len(pi) == 365 * 9 + 2 + + pi = PeriodIndex(freq='B', start='1/1/2001', end='12/31/2009') + assert len(pi) == 261 * 9 + + pi = PeriodIndex(freq='H', start='1/1/2001', end='12/31/2001 23:00') + assert len(pi) == 365 * 24 + + pi = PeriodIndex(freq='Min', start='1/1/2001', end='1/1/2001 23:59') + assert len(pi) == 24 * 60 + + pi = PeriodIndex(freq='S', start='1/1/2001', end='1/1/2001 23:59:59') + assert len(pi) == 24 * 60 * 60 + + start = Period('02-Apr-2005', 'B') + i1 = PeriodIndex(start=start, periods=20) + assert len(i1) == 20 + assert i1.freq == start.freq + assert i1[0] == start + + end_intv = Period('2006-12-31', 'W') + i1 = PeriodIndex(end=end_intv, periods=10) + assert len(i1) == 10 + assert i1.freq == end_intv.freq + assert i1[-1] == end_intv + + end_intv = Period('2006-12-31', '1w') + i2 = PeriodIndex(end=end_intv, periods=10) + assert len(i1) == len(i2) + assert (i1 == i2).all() + assert i1.freq == i2.freq + + end_intv = Period('2006-12-31', ('w', 1)) + i2 = PeriodIndex(end=end_intv, periods=10) + assert len(i1) == len(i2) + assert (i1 == i2).all() + assert i1.freq == i2.freq + + end_intv = Period('2005-05-01', 'B') + i1 = PeriodIndex(start=start, end=end_intv) + + # infer freq from first element + i2 = PeriodIndex([end_intv, Period('2005-05-05', 'B')]) + assert len(i2) == 2 + assert i2[0] == end_intv + + i2 = PeriodIndex(np.array([end_intv, Period('2005-05-05', 'B')])) + assert len(i2) == 2 + assert i2[0] == end_intv + + # Mixed freq should fail + vals = [end_intv, Period('2006-12-31', 'w')] + pytest.raises(ValueError, PeriodIndex, vals) + vals = np.array(vals) + pytest.raises(ValueError, PeriodIndex, vals) + + def test_constructor_error(self): + start = Period('02-Apr-2005', 'B') + end_intv = Period('2006-12-31', ('w', 1)) + + msg = 'Start and end must have same freq' + with tm.assert_raises_regex(ValueError, msg): + PeriodIndex(start=start, end=end_intv) + + msg = 'Must specify 2 of start, end, periods' + with tm.assert_raises_regex(ValueError, msg): + PeriodIndex(start=start) + + def test_recreate_from_data(self): + for o in ['M', 'Q', 'A', 'D', 'B', 'T', 'S', 'L', 'U', 'N', 'H']: + org = PeriodIndex(start='2001/04/01', freq=o, periods=1) + idx = PeriodIndex(org.values, freq=o) + tm.assert_index_equal(idx, org) + + def test_map_with_string_constructor(self): + raw = [2005, 2007, 2009] + index = PeriodIndex(raw, freq='A') + types = str, + + if PY3: + # unicode + types += text_type, + + for t in types: + expected = Index(lmap(t, raw)) + res = index.map(t) + + # should return an Index + assert isinstance(res, Index) + + # preserve element types + assert all(isinstance(resi, t) for resi in res) + + # lastly, values should compare equal + tm.assert_index_equal(res, expected) + + +class TestSeriesPeriod(object): + + def setup_method(self, method): + self.series = Series(period_range('2000-01-01', periods=10, freq='D')) + + def test_constructor_cant_cast_period(self): + with pytest.raises(TypeError): + Series(period_range('2000-01-01', periods=10, freq='D'), + dtype=float) + + def test_constructor_cast_object(self): + s = Series(period_range('1/1/2000', periods=10), dtype=object) + exp = Series(period_range('1/1/2000', periods=10)) + tm.assert_series_equal(s, exp) diff --git a/pandas/tests/indexes/period/test_formats.py b/pandas/tests/indexes/period/test_formats.py new file mode 100644 index 0000000000000..533481ce051f7 --- /dev/null +++ b/pandas/tests/indexes/period/test_formats.py @@ -0,0 +1,48 @@ +from pandas import PeriodIndex + +import numpy as np + +import pandas.util.testing as tm +import pandas as pd + + +def test_to_native_types(): + index = PeriodIndex(['2017-01-01', '2017-01-02', + '2017-01-03'], freq='D') + + # First, with no arguments. + expected = np.array(['2017-01-01', '2017-01-02', + '2017-01-03'], dtype='= -1') + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -2]), fill_value=True) + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -5]), fill_value=True) + + with pytest.raises(IndexError): + idx.take(np.array([1, -5])) diff --git a/pandas/tests/indexes/period/test_ops.py b/pandas/tests/indexes/period/test_ops.py new file mode 100644 index 0000000000000..7acc335c31be4 --- /dev/null +++ b/pandas/tests/indexes/period/test_ops.py @@ -0,0 +1,1347 @@ +import pytest + +import numpy as np +from datetime import timedelta + +import pandas as pd +import pandas._libs.tslib as tslib +import pandas.util.testing as tm +import pandas.core.indexes.period as period +from pandas import (DatetimeIndex, PeriodIndex, period_range, Series, Period, + _np_version_under1p10, Index, Timedelta, offsets) + +from pandas.tests.test_base import Ops + + +class TestPeriodIndexOps(Ops): + + def setup_method(self, method): + super(TestPeriodIndexOps, self).setup_method(method) + mask = lambda x: (isinstance(x, DatetimeIndex) or + isinstance(x, PeriodIndex)) + self.is_valid_objs = [o for o in self.objs if mask(o)] + self.not_valid_objs = [o for o in self.objs if not mask(o)] + + def test_ops_properties(self): + f = lambda x: isinstance(x, PeriodIndex) + self.check_ops_properties(PeriodIndex._field_ops, f) + self.check_ops_properties(PeriodIndex._object_ops, f) + self.check_ops_properties(PeriodIndex._bool_ops, f) + + def test_asobject_tolist(self): + idx = pd.period_range(start='2013-01-01', periods=4, freq='M', + name='idx') + expected_list = [pd.Period('2013-01-31', freq='M'), + pd.Period('2013-02-28', freq='M'), + pd.Period('2013-03-31', freq='M'), + pd.Period('2013-04-30', freq='M')] + expected = pd.Index(expected_list, dtype=object, name='idx') + result = idx.asobject + assert isinstance(result, Index) + assert result.dtype == object + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert idx.tolist() == expected_list + + idx = PeriodIndex(['2013-01-01', '2013-01-02', 'NaT', + '2013-01-04'], freq='D', name='idx') + expected_list = [pd.Period('2013-01-01', freq='D'), + pd.Period('2013-01-02', freq='D'), + pd.Period('NaT', freq='D'), + pd.Period('2013-01-04', freq='D')] + expected = pd.Index(expected_list, dtype=object, name='idx') + result = idx.asobject + assert isinstance(result, Index) + assert result.dtype == object + tm.assert_index_equal(result, expected) + for i in [0, 1, 3]: + assert result[i] == expected[i] + assert result[2] is pd.NaT + assert result.name == expected.name + + result_list = idx.tolist() + for i in [0, 1, 3]: + assert result_list[i] == expected_list[i] + assert result_list[2] is pd.NaT + + def test_minmax(self): + + # monotonic + idx1 = pd.PeriodIndex([pd.NaT, '2011-01-01', '2011-01-02', + '2011-01-03'], freq='D') + assert idx1.is_monotonic + + # non-monotonic + idx2 = pd.PeriodIndex(['2011-01-01', pd.NaT, '2011-01-03', + '2011-01-02', pd.NaT], freq='D') + assert not idx2.is_monotonic + + for idx in [idx1, idx2]: + assert idx.min() == pd.Period('2011-01-01', freq='D') + assert idx.max() == pd.Period('2011-01-03', freq='D') + assert idx1.argmin() == 1 + assert idx2.argmin() == 0 + assert idx1.argmax() == 3 + assert idx2.argmax() == 2 + + for op in ['min', 'max']: + # Return NaT + obj = PeriodIndex([], freq='M') + result = getattr(obj, op)() + assert result is tslib.NaT + + obj = PeriodIndex([pd.NaT], freq='M') + result = getattr(obj, op)() + assert result is tslib.NaT + + obj = PeriodIndex([pd.NaT, pd.NaT, pd.NaT], freq='M') + result = getattr(obj, op)() + assert result is tslib.NaT + + def test_numpy_minmax(self): + pr = pd.period_range(start='2016-01-15', end='2016-01-20') + + assert np.min(pr) == Period('2016-01-15', freq='D') + assert np.max(pr) == Period('2016-01-20', freq='D') + + errmsg = "the 'out' parameter is not supported" + tm.assert_raises_regex(ValueError, errmsg, np.min, pr, out=0) + tm.assert_raises_regex(ValueError, errmsg, np.max, pr, out=0) + + assert np.argmin(pr) == 0 + assert np.argmax(pr) == 5 + + if not _np_version_under1p10: + errmsg = "the 'out' parameter is not supported" + tm.assert_raises_regex( + ValueError, errmsg, np.argmin, pr, out=0) + tm.assert_raises_regex( + ValueError, errmsg, np.argmax, pr, out=0) + + def test_representation(self): + # GH 7601 + idx1 = PeriodIndex([], freq='D') + idx2 = PeriodIndex(['2011-01-01'], freq='D') + idx3 = PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') + idx4 = PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], + freq='D') + idx5 = PeriodIndex(['2011', '2012', '2013'], freq='A') + idx6 = PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', + 'NaT'], freq='H') + idx7 = pd.period_range('2013Q1', periods=1, freq="Q") + idx8 = pd.period_range('2013Q1', periods=2, freq="Q") + idx9 = pd.period_range('2013Q1', periods=3, freq="Q") + idx10 = PeriodIndex(['2011-01-01', '2011-02-01'], freq='3D') + + exp1 = """PeriodIndex([], dtype='period[D]', freq='D')""" + + exp2 = """PeriodIndex(['2011-01-01'], dtype='period[D]', freq='D')""" + + exp3 = ("PeriodIndex(['2011-01-01', '2011-01-02'], dtype='period[D]', " + "freq='D')") + + exp4 = ("PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], " + "dtype='period[D]', freq='D')") + + exp5 = ("PeriodIndex(['2011', '2012', '2013'], dtype='period[A-DEC]', " + "freq='A-DEC')") + + exp6 = ("PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', 'NaT'], " + "dtype='period[H]', freq='H')") + + exp7 = ("PeriodIndex(['2013Q1'], dtype='period[Q-DEC]', " + "freq='Q-DEC')") + + exp8 = ("PeriodIndex(['2013Q1', '2013Q2'], dtype='period[Q-DEC]', " + "freq='Q-DEC')") + + exp9 = ("PeriodIndex(['2013Q1', '2013Q2', '2013Q3'], " + "dtype='period[Q-DEC]', freq='Q-DEC')") + + exp10 = ("PeriodIndex(['2011-01-01', '2011-02-01'], " + "dtype='period[3D]', freq='3D')") + + for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, + idx6, idx7, idx8, idx9, idx10], + [exp1, exp2, exp3, exp4, exp5, + exp6, exp7, exp8, exp9, exp10]): + for func in ['__repr__', '__unicode__', '__str__']: + result = getattr(idx, func)() + assert result == expected + + def test_representation_to_series(self): + # GH 10971 + idx1 = PeriodIndex([], freq='D') + idx2 = PeriodIndex(['2011-01-01'], freq='D') + idx3 = PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') + idx4 = PeriodIndex(['2011-01-01', '2011-01-02', + '2011-01-03'], freq='D') + idx5 = PeriodIndex(['2011', '2012', '2013'], freq='A') + idx6 = PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', + 'NaT'], freq='H') + + idx7 = pd.period_range('2013Q1', periods=1, freq="Q") + idx8 = pd.period_range('2013Q1', periods=2, freq="Q") + idx9 = pd.period_range('2013Q1', periods=3, freq="Q") + + exp1 = """Series([], dtype: object)""" + + exp2 = """0 2011-01-01 +dtype: object""" + + exp3 = """0 2011-01-01 +1 2011-01-02 +dtype: object""" + + exp4 = """0 2011-01-01 +1 2011-01-02 +2 2011-01-03 +dtype: object""" + + exp5 = """0 2011 +1 2012 +2 2013 +dtype: object""" + + exp6 = """0 2011-01-01 09:00 +1 2012-02-01 10:00 +2 NaT +dtype: object""" + + exp7 = """0 2013Q1 +dtype: object""" + + exp8 = """0 2013Q1 +1 2013Q2 +dtype: object""" + + exp9 = """0 2013Q1 +1 2013Q2 +2 2013Q3 +dtype: object""" + + for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, + idx6, idx7, idx8, idx9], + [exp1, exp2, exp3, exp4, exp5, + exp6, exp7, exp8, exp9]): + result = repr(pd.Series(idx)) + assert result == expected + + def test_summary(self): + # GH9116 + idx1 = PeriodIndex([], freq='D') + idx2 = PeriodIndex(['2011-01-01'], freq='D') + idx3 = PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') + idx4 = PeriodIndex( + ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D') + idx5 = PeriodIndex(['2011', '2012', '2013'], freq='A') + idx6 = PeriodIndex( + ['2011-01-01 09:00', '2012-02-01 10:00', 'NaT'], freq='H') + + idx7 = pd.period_range('2013Q1', periods=1, freq="Q") + idx8 = pd.period_range('2013Q1', periods=2, freq="Q") + idx9 = pd.period_range('2013Q1', periods=3, freq="Q") + + exp1 = """PeriodIndex: 0 entries +Freq: D""" + + exp2 = """PeriodIndex: 1 entries, 2011-01-01 to 2011-01-01 +Freq: D""" + + exp3 = """PeriodIndex: 2 entries, 2011-01-01 to 2011-01-02 +Freq: D""" + + exp4 = """PeriodIndex: 3 entries, 2011-01-01 to 2011-01-03 +Freq: D""" + + exp5 = """PeriodIndex: 3 entries, 2011 to 2013 +Freq: A-DEC""" + + exp6 = """PeriodIndex: 3 entries, 2011-01-01 09:00 to NaT +Freq: H""" + + exp7 = """PeriodIndex: 1 entries, 2013Q1 to 2013Q1 +Freq: Q-DEC""" + + exp8 = """PeriodIndex: 2 entries, 2013Q1 to 2013Q2 +Freq: Q-DEC""" + + exp9 = """PeriodIndex: 3 entries, 2013Q1 to 2013Q3 +Freq: Q-DEC""" + + for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, + idx6, idx7, idx8, idx9], + [exp1, exp2, exp3, exp4, exp5, + exp6, exp7, exp8, exp9]): + result = idx.summary() + assert result == expected + + def test_resolution(self): + for freq, expected in zip(['A', 'Q', 'M', 'D', 'H', + 'T', 'S', 'L', 'U'], + ['day', 'day', 'day', 'day', + 'hour', 'minute', 'second', + 'millisecond', 'microsecond']): + + idx = pd.period_range(start='2013-04-01', periods=30, freq=freq) + assert idx.resolution == expected + + def test_add_iadd(self): + rng = pd.period_range('1/1/2000', freq='D', periods=5) + other = pd.period_range('1/6/2000', freq='D', periods=5) + + # previously performed setop union, now raises TypeError (GH14164) + with pytest.raises(TypeError): + rng + other + + with pytest.raises(TypeError): + rng += other + + # offset + # DateOffset + rng = pd.period_range('2014', '2024', freq='A') + result = rng + pd.offsets.YearEnd(5) + expected = pd.period_range('2019', '2029', freq='A') + tm.assert_index_equal(result, expected) + rng += pd.offsets.YearEnd(5) + tm.assert_index_equal(rng, expected) + + for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), + pd.offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365), Timedelta(days=365)]: + msg = ('Input has different freq(=.+)? ' + 'from PeriodIndex\\(freq=A-DEC\\)') + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng + o + + rng = pd.period_range('2014-01', '2016-12', freq='M') + result = rng + pd.offsets.MonthEnd(5) + expected = pd.period_range('2014-06', '2017-05', freq='M') + tm.assert_index_equal(result, expected) + rng += pd.offsets.MonthEnd(5) + tm.assert_index_equal(rng, expected) + + for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), + pd.offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365), Timedelta(days=365)]: + rng = pd.period_range('2014-01', '2016-12', freq='M') + msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=M\\)' + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng + o + + # Tick + offsets = [pd.offsets.Day(3), timedelta(days=3), + np.timedelta64(3, 'D'), pd.offsets.Hour(72), + timedelta(minutes=60 * 24 * 3), np.timedelta64(72, 'h'), + Timedelta('72:00:00')] + for delta in offsets: + rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') + result = rng + delta + expected = pd.period_range('2014-05-04', '2014-05-18', freq='D') + tm.assert_index_equal(result, expected) + rng += delta + tm.assert_index_equal(rng, expected) + + for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), + pd.offsets.Minute(), np.timedelta64(4, 'h'), + timedelta(hours=23), Timedelta('23:00:00')]: + rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') + msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=D\\)' + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng + o + + offsets = [pd.offsets.Hour(2), timedelta(hours=2), + np.timedelta64(2, 'h'), pd.offsets.Minute(120), + timedelta(minutes=120), np.timedelta64(120, 'm'), + Timedelta(minutes=120)] + for delta in offsets: + rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', + freq='H') + result = rng + delta + expected = pd.period_range('2014-01-01 12:00', '2014-01-05 12:00', + freq='H') + tm.assert_index_equal(result, expected) + rng += delta + tm.assert_index_equal(rng, expected) + + for delta in [pd.offsets.YearBegin(2), timedelta(minutes=30), + np.timedelta64(30, 's'), Timedelta(seconds=30)]: + rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', + freq='H') + msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=H\\)' + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng + delta + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng += delta + + # int + rng = pd.period_range('2000-01-01 09:00', freq='H', periods=10) + result = rng + 1 + expected = pd.period_range('2000-01-01 10:00', freq='H', periods=10) + tm.assert_index_equal(result, expected) + rng += 1 + tm.assert_index_equal(rng, expected) + + def test_sub(self): + rng = period_range('2007-01', periods=50) + + result = rng - 5 + exp = rng + (-5) + tm.assert_index_equal(result, exp) + + def test_sub_isub(self): + + # previously performed setop, now raises TypeError (GH14164) + # TODO needs to wait on #13077 for decision on result type + rng = pd.period_range('1/1/2000', freq='D', periods=5) + other = pd.period_range('1/6/2000', freq='D', periods=5) + + with pytest.raises(TypeError): + rng - other + + with pytest.raises(TypeError): + rng -= other + + # offset + # DateOffset + rng = pd.period_range('2014', '2024', freq='A') + result = rng - pd.offsets.YearEnd(5) + expected = pd.period_range('2009', '2019', freq='A') + tm.assert_index_equal(result, expected) + rng -= pd.offsets.YearEnd(5) + tm.assert_index_equal(rng, expected) + + for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), + pd.offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + rng = pd.period_range('2014', '2024', freq='A') + msg = ('Input has different freq(=.+)? ' + 'from PeriodIndex\\(freq=A-DEC\\)') + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng - o + + rng = pd.period_range('2014-01', '2016-12', freq='M') + result = rng - pd.offsets.MonthEnd(5) + expected = pd.period_range('2013-08', '2016-07', freq='M') + tm.assert_index_equal(result, expected) + rng -= pd.offsets.MonthEnd(5) + tm.assert_index_equal(rng, expected) + + for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), + pd.offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + rng = pd.period_range('2014-01', '2016-12', freq='M') + msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=M\\)' + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng - o + + # Tick + offsets = [pd.offsets.Day(3), timedelta(days=3), + np.timedelta64(3, 'D'), pd.offsets.Hour(72), + timedelta(minutes=60 * 24 * 3), np.timedelta64(72, 'h')] + for delta in offsets: + rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') + result = rng - delta + expected = pd.period_range('2014-04-28', '2014-05-12', freq='D') + tm.assert_index_equal(result, expected) + rng -= delta + tm.assert_index_equal(rng, expected) + + for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), + pd.offsets.Minute(), np.timedelta64(4, 'h'), + timedelta(hours=23)]: + rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') + msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=D\\)' + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng - o + + offsets = [pd.offsets.Hour(2), timedelta(hours=2), + np.timedelta64(2, 'h'), pd.offsets.Minute(120), + timedelta(minutes=120), np.timedelta64(120, 'm')] + for delta in offsets: + rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', + freq='H') + result = rng - delta + expected = pd.period_range('2014-01-01 08:00', '2014-01-05 08:00', + freq='H') + tm.assert_index_equal(result, expected) + rng -= delta + tm.assert_index_equal(rng, expected) + + for delta in [pd.offsets.YearBegin(2), timedelta(minutes=30), + np.timedelta64(30, 's')]: + rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', + freq='H') + msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=H\\)' + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng + delta + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + rng += delta + + # int + rng = pd.period_range('2000-01-01 09:00', freq='H', periods=10) + result = rng - 1 + expected = pd.period_range('2000-01-01 08:00', freq='H', periods=10) + tm.assert_index_equal(result, expected) + rng -= 1 + tm.assert_index_equal(rng, expected) + + def test_comp_nat(self): + left = pd.PeriodIndex([pd.Period('2011-01-01'), pd.NaT, + pd.Period('2011-01-03')]) + right = pd.PeriodIndex([pd.NaT, pd.NaT, pd.Period('2011-01-03')]) + + for l, r in [(left, right), (left.asobject, right.asobject)]: + result = l == r + expected = np.array([False, False, True]) + tm.assert_numpy_array_equal(result, expected) + + result = l != r + expected = np.array([True, True, False]) + tm.assert_numpy_array_equal(result, expected) + + expected = np.array([False, False, False]) + tm.assert_numpy_array_equal(l == pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT == r, expected) + + expected = np.array([True, True, True]) + tm.assert_numpy_array_equal(l != pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT != l, expected) + + expected = np.array([False, False, False]) + tm.assert_numpy_array_equal(l < pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT > l, expected) + + def test_value_counts_unique(self): + # GH 7735 + idx = pd.period_range('2011-01-01 09:00', freq='H', periods=10) + # create repeated values, 'n'th element is repeated by n+1 times + idx = PeriodIndex(np.repeat(idx.values, range(1, len(idx) + 1)), + freq='H') + + exp_idx = PeriodIndex(['2011-01-01 18:00', '2011-01-01 17:00', + '2011-01-01 16:00', '2011-01-01 15:00', + '2011-01-01 14:00', '2011-01-01 13:00', + '2011-01-01 12:00', '2011-01-01 11:00', + '2011-01-01 10:00', + '2011-01-01 09:00'], freq='H') + expected = Series(range(10, 0, -1), index=exp_idx, dtype='int64') + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(), expected) + + expected = pd.period_range('2011-01-01 09:00', freq='H', + periods=10) + tm.assert_index_equal(idx.unique(), expected) + + idx = PeriodIndex(['2013-01-01 09:00', '2013-01-01 09:00', + '2013-01-01 09:00', '2013-01-01 08:00', + '2013-01-01 08:00', pd.NaT], freq='H') + + exp_idx = PeriodIndex(['2013-01-01 09:00', '2013-01-01 08:00'], + freq='H') + expected = Series([3, 2], index=exp_idx) + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(), expected) + + exp_idx = PeriodIndex(['2013-01-01 09:00', '2013-01-01 08:00', + pd.NaT], freq='H') + expected = Series([3, 2, 1], index=exp_idx) + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(dropna=False), expected) + + tm.assert_index_equal(idx.unique(), exp_idx) + + def test_drop_duplicates_metadata(self): + # GH 10115 + idx = pd.period_range('2011-01-01', '2011-01-31', freq='D', name='idx') + result = idx.drop_duplicates() + tm.assert_index_equal(idx, result) + assert idx.freq == result.freq + + idx_dup = idx.append(idx) # freq will not be reset + result = idx_dup.drop_duplicates() + tm.assert_index_equal(idx, result) + assert idx.freq == result.freq + + def test_drop_duplicates(self): + # to check Index/Series compat + base = pd.period_range('2011-01-01', '2011-01-31', freq='D', + name='idx') + idx = base.append(base[:5]) + + res = idx.drop_duplicates() + tm.assert_index_equal(res, base) + res = Series(idx).drop_duplicates() + tm.assert_series_equal(res, Series(base)) + + res = idx.drop_duplicates(keep='last') + exp = base[5:].append(base[:5]) + tm.assert_index_equal(res, exp) + res = Series(idx).drop_duplicates(keep='last') + tm.assert_series_equal(res, Series(exp, index=np.arange(5, 36))) + + res = idx.drop_duplicates(keep=False) + tm.assert_index_equal(res, base[5:]) + res = Series(idx).drop_duplicates(keep=False) + tm.assert_series_equal(res, Series(base[5:], index=np.arange(5, 31))) + + def test_order_compat(self): + def _check_freq(index, expected_index): + if isinstance(index, PeriodIndex): + assert index.freq == expected_index.freq + + pidx = PeriodIndex(['2011', '2012', '2013'], name='pidx', freq='A') + # for compatibility check + iidx = Index([2011, 2012, 2013], name='idx') + for idx in [pidx, iidx]: + ordered = idx.sort_values() + tm.assert_index_equal(ordered, idx) + _check_freq(ordered, idx) + + ordered = idx.sort_values(ascending=False) + tm.assert_index_equal(ordered, idx[::-1]) + _check_freq(ordered, idx[::-1]) + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, idx) + tm.assert_numpy_array_equal(indexer, np.array([0, 1, 2]), + check_dtype=False) + _check_freq(ordered, idx) + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + tm.assert_index_equal(ordered, idx[::-1]) + tm.assert_numpy_array_equal(indexer, np.array([2, 1, 0]), + check_dtype=False) + _check_freq(ordered, idx[::-1]) + + pidx = PeriodIndex(['2011', '2013', '2015', '2012', + '2011'], name='pidx', freq='A') + pexpected = PeriodIndex( + ['2011', '2011', '2012', '2013', '2015'], name='pidx', freq='A') + # for compatibility check + iidx = Index([2011, 2013, 2015, 2012, 2011], name='idx') + iexpected = Index([2011, 2011, 2012, 2013, 2015], name='idx') + for idx, expected in [(pidx, pexpected), (iidx, iexpected)]: + ordered = idx.sort_values() + tm.assert_index_equal(ordered, expected) + _check_freq(ordered, idx) + + ordered = idx.sort_values(ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + _check_freq(ordered, idx) + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, expected) + + exp = np.array([0, 4, 3, 1, 2]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + _check_freq(ordered, idx) + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + + exp = np.array([2, 1, 3, 4, 0]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + _check_freq(ordered, idx) + + pidx = PeriodIndex(['2011', '2013', 'NaT', '2011'], name='pidx', + freq='D') + + result = pidx.sort_values() + expected = PeriodIndex(['NaT', '2011', '2011', '2013'], + name='pidx', freq='D') + tm.assert_index_equal(result, expected) + assert result.freq == 'D' + + result = pidx.sort_values(ascending=False) + expected = PeriodIndex( + ['2013', '2011', '2011', 'NaT'], name='pidx', freq='D') + tm.assert_index_equal(result, expected) + assert result.freq == 'D' + + def test_order(self): + for freq in ['D', '2D', '4D']: + idx = PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], + freq=freq, name='idx') + + ordered = idx.sort_values() + tm.assert_index_equal(ordered, idx) + assert ordered.freq == idx.freq + + ordered = idx.sort_values(ascending=False) + expected = idx[::-1] + tm.assert_index_equal(ordered, expected) + assert ordered.freq == expected.freq + assert ordered.freq == freq + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, idx) + tm.assert_numpy_array_equal(indexer, np.array([0, 1, 2]), + check_dtype=False) + assert ordered.freq == idx.freq + assert ordered.freq == freq + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + expected = idx[::-1] + tm.assert_index_equal(ordered, expected) + tm.assert_numpy_array_equal(indexer, np.array([2, 1, 0]), + check_dtype=False) + assert ordered.freq == expected.freq + assert ordered.freq == freq + + idx1 = PeriodIndex(['2011-01-01', '2011-01-03', '2011-01-05', + '2011-01-02', '2011-01-01'], freq='D', name='idx1') + exp1 = PeriodIndex(['2011-01-01', '2011-01-01', '2011-01-02', + '2011-01-03', '2011-01-05'], freq='D', name='idx1') + + idx2 = PeriodIndex(['2011-01-01', '2011-01-03', '2011-01-05', + '2011-01-02', '2011-01-01'], + freq='D', name='idx2') + exp2 = PeriodIndex(['2011-01-01', '2011-01-01', '2011-01-02', + '2011-01-03', '2011-01-05'], + freq='D', name='idx2') + + idx3 = PeriodIndex([pd.NaT, '2011-01-03', '2011-01-05', + '2011-01-02', pd.NaT], freq='D', name='idx3') + exp3 = PeriodIndex([pd.NaT, pd.NaT, '2011-01-02', '2011-01-03', + '2011-01-05'], freq='D', name='idx3') + + for idx, expected in [(idx1, exp1), (idx2, exp2), (idx3, exp3)]: + ordered = idx.sort_values() + tm.assert_index_equal(ordered, expected) + assert ordered.freq == 'D' + + ordered = idx.sort_values(ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + assert ordered.freq == 'D' + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, expected) + + exp = np.array([0, 4, 3, 1, 2]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + assert ordered.freq == 'D' + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + + exp = np.array([2, 1, 3, 4, 0]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + assert ordered.freq == 'D' + + def test_nat_new(self): + + idx = pd.period_range('2011-01', freq='M', periods=5, name='x') + result = idx._nat_new() + exp = pd.PeriodIndex([pd.NaT] * 5, freq='M', name='x') + tm.assert_index_equal(result, exp) + + result = idx._nat_new(box=False) + exp = np.array([tslib.iNaT] * 5, dtype=np.int64) + tm.assert_numpy_array_equal(result, exp) + + def test_shift(self): + # GH 9903 + idx = pd.PeriodIndex([], name='xxx', freq='H') + + with pytest.raises(TypeError): + # period shift doesn't accept freq + idx.shift(1, freq='H') + + tm.assert_index_equal(idx.shift(0), idx) + tm.assert_index_equal(idx.shift(3), idx) + + idx = pd.PeriodIndex(['2011-01-01 10:00', '2011-01-01 11:00' + '2011-01-01 12:00'], name='xxx', freq='H') + tm.assert_index_equal(idx.shift(0), idx) + exp = pd.PeriodIndex(['2011-01-01 13:00', '2011-01-01 14:00' + '2011-01-01 15:00'], name='xxx', freq='H') + tm.assert_index_equal(idx.shift(3), exp) + exp = pd.PeriodIndex(['2011-01-01 07:00', '2011-01-01 08:00' + '2011-01-01 09:00'], name='xxx', freq='H') + tm.assert_index_equal(idx.shift(-3), exp) + + def test_repeat(self): + index = pd.period_range('2001-01-01', periods=2, freq='D') + exp = pd.PeriodIndex(['2001-01-01', '2001-01-01', + '2001-01-02', '2001-01-02'], freq='D') + for res in [index.repeat(2), np.repeat(index, 2)]: + tm.assert_index_equal(res, exp) + + index = pd.period_range('2001-01-01', periods=2, freq='2D') + exp = pd.PeriodIndex(['2001-01-01', '2001-01-01', + '2001-01-03', '2001-01-03'], freq='2D') + for res in [index.repeat(2), np.repeat(index, 2)]: + tm.assert_index_equal(res, exp) + + index = pd.PeriodIndex(['2001-01', 'NaT', '2003-01'], freq='M') + exp = pd.PeriodIndex(['2001-01', '2001-01', '2001-01', + 'NaT', 'NaT', 'NaT', + '2003-01', '2003-01', '2003-01'], freq='M') + for res in [index.repeat(3), np.repeat(index, 3)]: + tm.assert_index_equal(res, exp) + + def test_nat(self): + assert pd.PeriodIndex._na_value is pd.NaT + assert pd.PeriodIndex([], freq='M')._na_value is pd.NaT + + idx = pd.PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') + assert idx._can_hold_na + + tm.assert_numpy_array_equal(idx._isnan, np.array([False, False])) + assert not idx.hasnans + tm.assert_numpy_array_equal(idx._nan_idxs, + np.array([], dtype=np.intp)) + + idx = pd.PeriodIndex(['2011-01-01', 'NaT'], freq='D') + assert idx._can_hold_na + + tm.assert_numpy_array_equal(idx._isnan, np.array([False, True])) + assert idx.hasnans + tm.assert_numpy_array_equal(idx._nan_idxs, + np.array([1], dtype=np.intp)) + + def test_equals(self): + # GH 13107 + for freq in ['D', 'M']: + idx = pd.PeriodIndex(['2011-01-01', '2011-01-02', 'NaT'], + freq=freq) + assert idx.equals(idx) + assert idx.equals(idx.copy()) + assert idx.equals(idx.asobject) + assert idx.asobject.equals(idx) + assert idx.asobject.equals(idx.asobject) + assert not idx.equals(list(idx)) + assert not idx.equals(pd.Series(idx)) + + idx2 = pd.PeriodIndex(['2011-01-01', '2011-01-02', 'NaT'], + freq='H') + assert not idx.equals(idx2) + assert not idx.equals(idx2.copy()) + assert not idx.equals(idx2.asobject) + assert not idx.asobject.equals(idx2) + assert not idx.equals(list(idx2)) + assert not idx.equals(pd.Series(idx2)) + + # same internal, different tz + idx3 = pd.PeriodIndex._simple_new(idx.asi8, freq='H') + tm.assert_numpy_array_equal(idx.asi8, idx3.asi8) + assert not idx.equals(idx3) + assert not idx.equals(idx3.copy()) + assert not idx.equals(idx3.asobject) + assert not idx.asobject.equals(idx3) + assert not idx.equals(list(idx3)) + assert not idx.equals(pd.Series(idx3)) + + +class TestPeriodIndexSeriesMethods(object): + """ Test PeriodIndex and Period Series Ops consistency """ + + def _check(self, values, func, expected): + idx = pd.PeriodIndex(values) + result = func(idx) + if isinstance(expected, pd.Index): + tm.assert_index_equal(result, expected) + else: + # comp op results in bool + tm.assert_numpy_array_equal(result, expected) + + s = pd.Series(values) + result = func(s) + + exp = pd.Series(expected, name=values.name) + tm.assert_series_equal(result, exp) + + def test_pi_ops(self): + idx = PeriodIndex(['2011-01', '2011-02', '2011-03', + '2011-04'], freq='M', name='idx') + + expected = PeriodIndex(['2011-03', '2011-04', + '2011-05', '2011-06'], freq='M', name='idx') + self._check(idx, lambda x: x + 2, expected) + self._check(idx, lambda x: 2 + x, expected) + + self._check(idx + 2, lambda x: x - 2, idx) + result = idx - Period('2011-01', freq='M') + exp = pd.Index([0, 1, 2, 3], name='idx') + tm.assert_index_equal(result, exp) + + result = Period('2011-01', freq='M') - idx + exp = pd.Index([0, -1, -2, -3], name='idx') + tm.assert_index_equal(result, exp) + + def test_pi_ops_errors(self): + idx = PeriodIndex(['2011-01', '2011-02', '2011-03', + '2011-04'], freq='M', name='idx') + s = pd.Series(idx) + + msg = r"unsupported operand type\(s\)" + + for obj in [idx, s]: + for ng in ["str", 1.5]: + with tm.assert_raises_regex(TypeError, msg): + obj + ng + + with pytest.raises(TypeError): + # error message differs between PY2 and 3 + ng + obj + + with tm.assert_raises_regex(TypeError, msg): + obj - ng + + with pytest.raises(TypeError): + np.add(obj, ng) + + if _np_version_under1p10: + assert np.add(ng, obj) is NotImplemented + else: + with pytest.raises(TypeError): + np.add(ng, obj) + + with pytest.raises(TypeError): + np.subtract(obj, ng) + + if _np_version_under1p10: + assert np.subtract(ng, obj) is NotImplemented + else: + with pytest.raises(TypeError): + np.subtract(ng, obj) + + def test_pi_ops_nat(self): + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2011-04'], freq='M', name='idx') + expected = PeriodIndex(['2011-03', '2011-04', + 'NaT', '2011-06'], freq='M', name='idx') + self._check(idx, lambda x: x + 2, expected) + self._check(idx, lambda x: 2 + x, expected) + self._check(idx, lambda x: np.add(x, 2), expected) + + self._check(idx + 2, lambda x: x - 2, idx) + self._check(idx + 2, lambda x: np.subtract(x, 2), idx) + + # freq with mult + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2011-04'], freq='2M', name='idx') + expected = PeriodIndex(['2011-07', '2011-08', + 'NaT', '2011-10'], freq='2M', name='idx') + self._check(idx, lambda x: x + 3, expected) + self._check(idx, lambda x: 3 + x, expected) + self._check(idx, lambda x: np.add(x, 3), expected) + + self._check(idx + 3, lambda x: x - 3, idx) + self._check(idx + 3, lambda x: np.subtract(x, 3), idx) + + def test_pi_ops_array_int(self): + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2011-04'], freq='M', name='idx') + f = lambda x: x + np.array([1, 2, 3, 4]) + exp = PeriodIndex(['2011-02', '2011-04', 'NaT', + '2011-08'], freq='M', name='idx') + self._check(idx, f, exp) + + f = lambda x: np.add(x, np.array([4, -1, 1, 2])) + exp = PeriodIndex(['2011-05', '2011-01', 'NaT', + '2011-06'], freq='M', name='idx') + self._check(idx, f, exp) + + f = lambda x: x - np.array([1, 2, 3, 4]) + exp = PeriodIndex(['2010-12', '2010-12', 'NaT', + '2010-12'], freq='M', name='idx') + self._check(idx, f, exp) + + f = lambda x: np.subtract(x, np.array([3, 2, 3, -2])) + exp = PeriodIndex(['2010-10', '2010-12', 'NaT', + '2011-06'], freq='M', name='idx') + self._check(idx, f, exp) + + def test_pi_ops_offset(self): + idx = PeriodIndex(['2011-01-01', '2011-02-01', '2011-03-01', + '2011-04-01'], freq='D', name='idx') + f = lambda x: x + offsets.Day() + exp = PeriodIndex(['2011-01-02', '2011-02-02', '2011-03-02', + '2011-04-02'], freq='D', name='idx') + self._check(idx, f, exp) + + f = lambda x: x + offsets.Day(2) + exp = PeriodIndex(['2011-01-03', '2011-02-03', '2011-03-03', + '2011-04-03'], freq='D', name='idx') + self._check(idx, f, exp) + + f = lambda x: x - offsets.Day(2) + exp = PeriodIndex(['2010-12-30', '2011-01-30', '2011-02-27', + '2011-03-30'], freq='D', name='idx') + self._check(idx, f, exp) + + def test_pi_offset_errors(self): + idx = PeriodIndex(['2011-01-01', '2011-02-01', '2011-03-01', + '2011-04-01'], freq='D', name='idx') + s = pd.Series(idx) + + # Series op is applied per Period instance, thus error is raised + # from Period + msg_idx = r"Input has different freq from PeriodIndex\(freq=D\)" + msg_s = r"Input cannot be converted to Period\(freq=D\)" + for obj, msg in [(idx, msg_idx), (s, msg_s)]: + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + obj + offsets.Hour(2) + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + offsets.Hour(2) + obj + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + obj - offsets.Hour(2) + + def test_pi_sub_period(self): + # GH 13071 + idx = PeriodIndex(['2011-01', '2011-02', '2011-03', + '2011-04'], freq='M', name='idx') + + result = idx - pd.Period('2012-01', freq='M') + exp = pd.Index([-12, -11, -10, -9], name='idx') + tm.assert_index_equal(result, exp) + + result = np.subtract(idx, pd.Period('2012-01', freq='M')) + tm.assert_index_equal(result, exp) + + result = pd.Period('2012-01', freq='M') - idx + exp = pd.Index([12, 11, 10, 9], name='idx') + tm.assert_index_equal(result, exp) + + result = np.subtract(pd.Period('2012-01', freq='M'), idx) + if _np_version_under1p10: + assert result is NotImplemented + else: + tm.assert_index_equal(result, exp) + + exp = pd.TimedeltaIndex([np.nan, np.nan, np.nan, np.nan], name='idx') + tm.assert_index_equal(idx - pd.Period('NaT', freq='M'), exp) + tm.assert_index_equal(pd.Period('NaT', freq='M') - idx, exp) + + def test_pi_sub_pdnat(self): + # GH 13071 + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2011-04'], freq='M', name='idx') + exp = pd.TimedeltaIndex([pd.NaT] * 4, name='idx') + tm.assert_index_equal(pd.NaT - idx, exp) + tm.assert_index_equal(idx - pd.NaT, exp) + + def test_pi_sub_period_nat(self): + # GH 13071 + idx = PeriodIndex(['2011-01', 'NaT', '2011-03', + '2011-04'], freq='M', name='idx') + + result = idx - pd.Period('2012-01', freq='M') + exp = pd.Index([-12, np.nan, -10, -9], name='idx') + tm.assert_index_equal(result, exp) + + result = pd.Period('2012-01', freq='M') - idx + exp = pd.Index([12, np.nan, 10, 9], name='idx') + tm.assert_index_equal(result, exp) + + exp = pd.TimedeltaIndex([np.nan, np.nan, np.nan, np.nan], name='idx') + tm.assert_index_equal(idx - pd.Period('NaT', freq='M'), exp) + tm.assert_index_equal(pd.Period('NaT', freq='M') - idx, exp) + + def test_pi_comp_period(self): + idx = PeriodIndex(['2011-01', '2011-02', '2011-03', + '2011-04'], freq='M', name='idx') + + f = lambda x: x == pd.Period('2011-03', freq='M') + exp = np.array([False, False, True, False], dtype=np.bool) + self._check(idx, f, exp) + f = lambda x: pd.Period('2011-03', freq='M') == x + self._check(idx, f, exp) + + f = lambda x: x != pd.Period('2011-03', freq='M') + exp = np.array([True, True, False, True], dtype=np.bool) + self._check(idx, f, exp) + f = lambda x: pd.Period('2011-03', freq='M') != x + self._check(idx, f, exp) + + f = lambda x: pd.Period('2011-03', freq='M') >= x + exp = np.array([True, True, True, False], dtype=np.bool) + self._check(idx, f, exp) + + f = lambda x: x > pd.Period('2011-03', freq='M') + exp = np.array([False, False, False, True], dtype=np.bool) + self._check(idx, f, exp) + + f = lambda x: pd.Period('2011-03', freq='M') >= x + exp = np.array([True, True, True, False], dtype=np.bool) + self._check(idx, f, exp) + + def test_pi_comp_period_nat(self): + idx = PeriodIndex(['2011-01', 'NaT', '2011-03', + '2011-04'], freq='M', name='idx') + + f = lambda x: x == pd.Period('2011-03', freq='M') + exp = np.array([False, False, True, False], dtype=np.bool) + self._check(idx, f, exp) + f = lambda x: pd.Period('2011-03', freq='M') == x + self._check(idx, f, exp) + + f = lambda x: x == tslib.NaT + exp = np.array([False, False, False, False], dtype=np.bool) + self._check(idx, f, exp) + f = lambda x: tslib.NaT == x + self._check(idx, f, exp) + + f = lambda x: x != pd.Period('2011-03', freq='M') + exp = np.array([True, True, False, True], dtype=np.bool) + self._check(idx, f, exp) + f = lambda x: pd.Period('2011-03', freq='M') != x + self._check(idx, f, exp) + + f = lambda x: x != tslib.NaT + exp = np.array([True, True, True, True], dtype=np.bool) + self._check(idx, f, exp) + f = lambda x: tslib.NaT != x + self._check(idx, f, exp) + + f = lambda x: pd.Period('2011-03', freq='M') >= x + exp = np.array([True, False, True, False], dtype=np.bool) + self._check(idx, f, exp) + + f = lambda x: x < pd.Period('2011-03', freq='M') + exp = np.array([True, False, False, False], dtype=np.bool) + self._check(idx, f, exp) + + f = lambda x: x > tslib.NaT + exp = np.array([False, False, False, False], dtype=np.bool) + self._check(idx, f, exp) + + f = lambda x: tslib.NaT >= x + exp = np.array([False, False, False, False], dtype=np.bool) + self._check(idx, f, exp) + + +class TestSeriesPeriod(object): + + def setup_method(self, method): + self.series = Series(period_range('2000-01-01', periods=10, freq='D')) + + def test_ops_series_timedelta(self): + # GH 13043 + s = pd.Series([pd.Period('2015-01-01', freq='D'), + pd.Period('2015-01-02', freq='D')], name='xxx') + assert s.dtype == object + + exp = pd.Series([pd.Period('2015-01-02', freq='D'), + pd.Period('2015-01-03', freq='D')], name='xxx') + tm.assert_series_equal(s + pd.Timedelta('1 days'), exp) + tm.assert_series_equal(pd.Timedelta('1 days') + s, exp) + + tm.assert_series_equal(s + pd.tseries.offsets.Day(), exp) + tm.assert_series_equal(pd.tseries.offsets.Day() + s, exp) + + def test_ops_series_period(self): + # GH 13043 + s = pd.Series([pd.Period('2015-01-01', freq='D'), + pd.Period('2015-01-02', freq='D')], name='xxx') + assert s.dtype == object + + p = pd.Period('2015-01-10', freq='D') + # dtype will be object because of original dtype + exp = pd.Series([9, 8], name='xxx', dtype=object) + tm.assert_series_equal(p - s, exp) + tm.assert_series_equal(s - p, -exp) + + s2 = pd.Series([pd.Period('2015-01-05', freq='D'), + pd.Period('2015-01-04', freq='D')], name='xxx') + assert s2.dtype == object + + exp = pd.Series([4, 2], name='xxx', dtype=object) + tm.assert_series_equal(s2 - s, exp) + tm.assert_series_equal(s - s2, -exp) + + +class TestFramePeriod(object): + + def test_ops_frame_period(self): + # GH 13043 + df = pd.DataFrame({'A': [pd.Period('2015-01', freq='M'), + pd.Period('2015-02', freq='M')], + 'B': [pd.Period('2014-01', freq='M'), + pd.Period('2014-02', freq='M')]}) + assert df['A'].dtype == object + assert df['B'].dtype == object + + p = pd.Period('2015-03', freq='M') + # dtype will be object because of original dtype + exp = pd.DataFrame({'A': np.array([2, 1], dtype=object), + 'B': np.array([14, 13], dtype=object)}) + tm.assert_frame_equal(p - df, exp) + tm.assert_frame_equal(df - p, -exp) + + df2 = pd.DataFrame({'A': [pd.Period('2015-05', freq='M'), + pd.Period('2015-06', freq='M')], + 'B': [pd.Period('2015-05', freq='M'), + pd.Period('2015-06', freq='M')]}) + assert df2['A'].dtype == object + assert df2['B'].dtype == object + + exp = pd.DataFrame({'A': np.array([4, 4], dtype=object), + 'B': np.array([16, 16], dtype=object)}) + tm.assert_frame_equal(df2 - df, exp) + tm.assert_frame_equal(df - df2, -exp) + + +class TestPeriodIndexComparisons(object): + + def test_pi_pi_comp(self): + + for freq in ['M', '2M', '3M']: + base = PeriodIndex(['2011-01', '2011-02', + '2011-03', '2011-04'], freq=freq) + p = Period('2011-02', freq=freq) + + exp = np.array([False, True, False, False]) + tm.assert_numpy_array_equal(base == p, exp) + tm.assert_numpy_array_equal(p == base, exp) + + exp = np.array([True, False, True, True]) + tm.assert_numpy_array_equal(base != p, exp) + tm.assert_numpy_array_equal(p != base, exp) + + exp = np.array([False, False, True, True]) + tm.assert_numpy_array_equal(base > p, exp) + tm.assert_numpy_array_equal(p < base, exp) + + exp = np.array([True, False, False, False]) + tm.assert_numpy_array_equal(base < p, exp) + tm.assert_numpy_array_equal(p > base, exp) + + exp = np.array([False, True, True, True]) + tm.assert_numpy_array_equal(base >= p, exp) + tm.assert_numpy_array_equal(p <= base, exp) + + exp = np.array([True, True, False, False]) + tm.assert_numpy_array_equal(base <= p, exp) + tm.assert_numpy_array_equal(p >= base, exp) + + idx = PeriodIndex(['2011-02', '2011-01', '2011-03', + '2011-05'], freq=freq) + + exp = np.array([False, False, True, False]) + tm.assert_numpy_array_equal(base == idx, exp) + + exp = np.array([True, True, False, True]) + tm.assert_numpy_array_equal(base != idx, exp) + + exp = np.array([False, True, False, False]) + tm.assert_numpy_array_equal(base > idx, exp) + + exp = np.array([True, False, False, True]) + tm.assert_numpy_array_equal(base < idx, exp) + + exp = np.array([False, True, True, False]) + tm.assert_numpy_array_equal(base >= idx, exp) + + exp = np.array([True, False, True, True]) + tm.assert_numpy_array_equal(base <= idx, exp) + + # different base freq + msg = "Input has different freq=A-DEC from PeriodIndex" + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + base <= Period('2011', freq='A') + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + Period('2011', freq='A') >= base + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + idx = PeriodIndex(['2011', '2012', '2013', '2014'], freq='A') + base <= idx + + # Different frequency + msg = "Input has different freq=4M from PeriodIndex" + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + base <= Period('2011', freq='4M') + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + Period('2011', freq='4M') >= base + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + idx = PeriodIndex(['2011', '2012', '2013', '2014'], freq='4M') + base <= idx + + def test_pi_nat_comp(self): + for freq in ['M', '2M', '3M']: + idx1 = PeriodIndex( + ['2011-01', '2011-02', 'NaT', '2011-05'], freq=freq) + + result = idx1 > Period('2011-02', freq=freq) + exp = np.array([False, False, False, True]) + tm.assert_numpy_array_equal(result, exp) + result = Period('2011-02', freq=freq) < idx1 + tm.assert_numpy_array_equal(result, exp) + + result = idx1 == Period('NaT', freq=freq) + exp = np.array([False, False, False, False]) + tm.assert_numpy_array_equal(result, exp) + result = Period('NaT', freq=freq) == idx1 + tm.assert_numpy_array_equal(result, exp) + + result = idx1 != Period('NaT', freq=freq) + exp = np.array([True, True, True, True]) + tm.assert_numpy_array_equal(result, exp) + result = Period('NaT', freq=freq) != idx1 + tm.assert_numpy_array_equal(result, exp) + + idx2 = PeriodIndex(['2011-02', '2011-01', '2011-04', + 'NaT'], freq=freq) + result = idx1 < idx2 + exp = np.array([True, False, False, False]) + tm.assert_numpy_array_equal(result, exp) + + result = idx1 == idx2 + exp = np.array([False, False, False, False]) + tm.assert_numpy_array_equal(result, exp) + + result = idx1 != idx2 + exp = np.array([True, True, True, True]) + tm.assert_numpy_array_equal(result, exp) + + result = idx1 == idx1 + exp = np.array([True, True, False, True]) + tm.assert_numpy_array_equal(result, exp) + + result = idx1 != idx1 + exp = np.array([False, False, True, False]) + tm.assert_numpy_array_equal(result, exp) + + diff = PeriodIndex(['2011-02', '2011-01', '2011-04', + 'NaT'], freq='4M') + msg = "Input has different freq=4M from PeriodIndex" + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + idx1 > diff + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + idx1 == diff diff --git a/pandas/tests/indexes/period/test_partial_slicing.py b/pandas/tests/indexes/period/test_partial_slicing.py new file mode 100644 index 0000000000000..6d142722c315a --- /dev/null +++ b/pandas/tests/indexes/period/test_partial_slicing.py @@ -0,0 +1,141 @@ +import pytest + +import numpy as np + +import pandas as pd +from pandas.util import testing as tm +from pandas import (Series, period_range, DatetimeIndex, PeriodIndex, + DataFrame, _np_version_under1p12, Period) + + +class TestPeriodIndex(object): + + def setup_method(self, method): + pass + + def test_slice_with_negative_step(self): + ts = Series(np.arange(20), + period_range('2014-01', periods=20, freq='M')) + SLC = pd.IndexSlice + + def assert_slices_equivalent(l_slc, i_slc): + tm.assert_series_equal(ts[l_slc], ts.iloc[i_slc]) + tm.assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) + tm.assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) + + assert_slices_equivalent(SLC[Period('2014-10')::-1], SLC[9::-1]) + assert_slices_equivalent(SLC['2014-10'::-1], SLC[9::-1]) + + assert_slices_equivalent(SLC[:Period('2014-10'):-1], SLC[:8:-1]) + assert_slices_equivalent(SLC[:'2014-10':-1], SLC[:8:-1]) + + assert_slices_equivalent(SLC['2015-02':'2014-10':-1], SLC[13:8:-1]) + assert_slices_equivalent(SLC[Period('2015-02'):Period('2014-10'):-1], + SLC[13:8:-1]) + assert_slices_equivalent(SLC['2015-02':Period('2014-10'):-1], + SLC[13:8:-1]) + assert_slices_equivalent(SLC[Period('2015-02'):'2014-10':-1], + SLC[13:8:-1]) + + assert_slices_equivalent(SLC['2014-10':'2015-02':-1], SLC[:0]) + + def test_slice_with_zero_step_raises(self): + ts = Series(np.arange(20), + period_range('2014-01', periods=20, freq='M')) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts.loc[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts.loc[::0]) + + def test_slice_keep_name(self): + idx = period_range('20010101', periods=10, freq='D', name='bob') + assert idx.name == idx[1:].name + + def test_pindex_slice_index(self): + pi = PeriodIndex(start='1/1/10', end='12/31/12', freq='M') + s = Series(np.random.rand(len(pi)), index=pi) + res = s['2010'] + exp = s[0:12] + tm.assert_series_equal(res, exp) + res = s['2011'] + exp = s[12:24] + tm.assert_series_equal(res, exp) + + def test_range_slice_day(self): + # GH 6716 + didx = DatetimeIndex(start='2013/01/01', freq='D', periods=400) + pidx = PeriodIndex(start='2013/01/01', freq='D', periods=400) + + # changed to TypeError in 1.12 + # https://github.com/numpy/numpy/pull/6271 + exc = IndexError if _np_version_under1p12 else TypeError + + for idx in [didx, pidx]: + # slices against index should raise IndexError + values = ['2014', '2013/02', '2013/01/02', '2013/02/01 9H', + '2013/02/01 09:00'] + for v in values: + with pytest.raises(exc): + idx[v:] + + s = Series(np.random.rand(len(idx)), index=idx) + + tm.assert_series_equal(s['2013/01/02':], s[1:]) + tm.assert_series_equal(s['2013/01/02':'2013/01/05'], s[1:5]) + tm.assert_series_equal(s['2013/02':], s[31:]) + tm.assert_series_equal(s['2014':], s[365:]) + + invalid = ['2013/02/01 9H', '2013/02/01 09:00'] + for v in invalid: + with pytest.raises(exc): + idx[v:] + + def test_range_slice_seconds(self): + # GH 6716 + didx = DatetimeIndex(start='2013/01/01 09:00:00', freq='S', + periods=4000) + pidx = PeriodIndex(start='2013/01/01 09:00:00', freq='S', periods=4000) + + # changed to TypeError in 1.12 + # https://github.com/numpy/numpy/pull/6271 + exc = IndexError if _np_version_under1p12 else TypeError + + for idx in [didx, pidx]: + # slices against index should raise IndexError + values = ['2014', '2013/02', '2013/01/02', '2013/02/01 9H', + '2013/02/01 09:00'] + for v in values: + with pytest.raises(exc): + idx[v:] + + s = Series(np.random.rand(len(idx)), index=idx) + + tm.assert_series_equal(s['2013/01/01 09:05':'2013/01/01 09:10'], + s[300:660]) + tm.assert_series_equal(s['2013/01/01 10:00':'2013/01/01 10:05'], + s[3600:3960]) + tm.assert_series_equal(s['2013/01/01 10H':], s[3600:]) + tm.assert_series_equal(s[:'2013/01/01 09:30'], s[:1860]) + for d in ['2013/01/01', '2013/01', '2013']: + tm.assert_series_equal(s[d:], s) + + def test_range_slice_outofbounds(self): + # GH 5407 + didx = DatetimeIndex(start='2013/10/01', freq='D', periods=10) + pidx = PeriodIndex(start='2013/10/01', freq='D', periods=10) + + for idx in [didx, pidx]: + df = DataFrame(dict(units=[100 + i for i in range(10)]), index=idx) + empty = DataFrame(index=idx.__class__([], freq='D'), + columns=['units']) + empty['units'] = empty['units'].astype('int64') + + tm.assert_frame_equal(df['2013/09/01':'2013/09/30'], empty) + tm.assert_frame_equal(df['2013/09/30':'2013/10/02'], df.iloc[:2]) + tm.assert_frame_equal(df['2013/10/01':'2013/10/02'], df.iloc[:2]) + tm.assert_frame_equal(df['2013/10/02':'2013/09/30'], empty) + tm.assert_frame_equal(df['2013/10/15':'2013/10/17'], empty) + tm.assert_frame_equal(df['2013-06':'2013-09'], empty) + tm.assert_frame_equal(df['2013-11':'2013-12'], empty) diff --git a/pandas/tests/indexes/period/test_period.py b/pandas/tests/indexes/period/test_period.py new file mode 100644 index 0000000000000..6f73e7c15e4d9 --- /dev/null +++ b/pandas/tests/indexes/period/test_period.py @@ -0,0 +1,775 @@ +import pytest + +import numpy as np +from numpy.random import randn +from datetime import timedelta + +import pandas as pd +from pandas.util import testing as tm +from pandas import (PeriodIndex, period_range, notnull, DatetimeIndex, NaT, + Index, Period, Int64Index, Series, DataFrame, date_range, + offsets, compat) + +from ..datetimelike import DatetimeLike + + +class TestPeriodIndex(DatetimeLike): + _holder = PeriodIndex + _multiprocess_can_split_ = True + + def setup_method(self, method): + self.indices = dict(index=tm.makePeriodIndex(10)) + self.setup_indices() + + def create_index(self): + return period_range('20130101', periods=5, freq='D') + + def test_astype(self): + # GH 13149, GH 13209 + idx = PeriodIndex(['2016-05-16', 'NaT', NaT, np.NaN], freq='D') + + result = idx.astype(object) + expected = Index([Period('2016-05-16', freq='D')] + + [Period(NaT, freq='D')] * 3, dtype='object') + tm.assert_index_equal(result, expected) + + result = idx.astype(int) + expected = Int64Index([16937] + [-9223372036854775808] * 3, + dtype=np.int64) + tm.assert_index_equal(result, expected) + + idx = period_range('1990', '2009', freq='A') + result = idx.astype('i8') + tm.assert_index_equal(result, Index(idx.asi8)) + tm.assert_numpy_array_equal(result.values, idx.asi8) + + def test_astype_raises(self): + # GH 13149, GH 13209 + idx = PeriodIndex(['2016-05-16', 'NaT', NaT, np.NaN], freq='D') + + pytest.raises(ValueError, idx.astype, str) + pytest.raises(ValueError, idx.astype, float) + pytest.raises(ValueError, idx.astype, 'timedelta64') + pytest.raises(ValueError, idx.astype, 'timedelta64[ns]') + + def test_pickle_compat_construction(self): + pass + + def test_pickle_round_trip(self): + for freq in ['D', 'M', 'A']: + idx = PeriodIndex(['2016-05-16', 'NaT', NaT, np.NaN], freq=freq) + result = tm.round_trip_pickle(idx) + tm.assert_index_equal(result, idx) + + def test_get_loc(self): + idx = pd.period_range('2000-01-01', periods=3) + + for method in [None, 'pad', 'backfill', 'nearest']: + assert idx.get_loc(idx[1], method) == 1 + assert idx.get_loc(idx[1].asfreq('H', how='start'), method) == 1 + assert idx.get_loc(idx[1].to_timestamp(), method) == 1 + assert idx.get_loc(idx[1].to_timestamp() + .to_pydatetime(), method) == 1 + assert idx.get_loc(str(idx[1]), method) == 1 + + idx = pd.period_range('2000-01-01', periods=5)[::2] + assert idx.get_loc('2000-01-02T12', method='nearest', + tolerance='1 day') == 1 + assert idx.get_loc('2000-01-02T12', method='nearest', + tolerance=pd.Timedelta('1D')) == 1 + assert idx.get_loc('2000-01-02T12', method='nearest', + tolerance=np.timedelta64(1, 'D')) == 1 + assert idx.get_loc('2000-01-02T12', method='nearest', + tolerance=timedelta(1)) == 1 + with tm.assert_raises_regex(ValueError, 'must be convertible'): + idx.get_loc('2000-01-10', method='nearest', tolerance='foo') + + msg = 'Input has different freq from PeriodIndex\\(freq=D\\)' + with tm.assert_raises_regex(ValueError, msg): + idx.get_loc('2000-01-10', method='nearest', tolerance='1 hour') + with pytest.raises(KeyError): + idx.get_loc('2000-01-10', method='nearest', tolerance='1 day') + + def test_where(self): + i = self.create_index() + result = i.where(notnull(i)) + expected = i + tm.assert_index_equal(result, expected) + + i2 = pd.PeriodIndex([pd.NaT, pd.NaT] + i[2:].tolist(), + freq='D') + result = i.where(notnull(i2)) + expected = i2 + tm.assert_index_equal(result, expected) + + def test_where_array_like(self): + i = self.create_index() + cond = [False] + [True] * (len(i) - 1) + klasses = [list, tuple, np.array, Series] + expected = pd.PeriodIndex([pd.NaT] + i[1:].tolist(), freq='D') + + for klass in klasses: + result = i.where(klass(cond)) + tm.assert_index_equal(result, expected) + + def test_where_other(self): + + i = self.create_index() + for arr in [np.nan, pd.NaT]: + result = i.where(notnull(i), other=np.nan) + expected = i + tm.assert_index_equal(result, expected) + + i2 = i.copy() + i2 = pd.PeriodIndex([pd.NaT, pd.NaT] + i[2:].tolist(), + freq='D') + result = i.where(notnull(i2), i2) + tm.assert_index_equal(result, i2) + + i2 = i.copy() + i2 = pd.PeriodIndex([pd.NaT, pd.NaT] + i[2:].tolist(), + freq='D') + result = i.where(notnull(i2), i2.values) + tm.assert_index_equal(result, i2) + + def test_get_indexer(self): + idx = pd.period_range('2000-01-01', periods=3).asfreq('H', how='start') + tm.assert_numpy_array_equal(idx.get_indexer(idx), + np.array([0, 1, 2], dtype=np.intp)) + + target = pd.PeriodIndex(['1999-12-31T23', '2000-01-01T12', + '2000-01-02T01'], freq='H') + tm.assert_numpy_array_equal(idx.get_indexer(target, 'pad'), + np.array([-1, 0, 1], dtype=np.intp)) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'backfill'), + np.array([0, 1, 2], dtype=np.intp)) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest'), + np.array([0, 1, 1], dtype=np.intp)) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest', + tolerance='1 hour'), + np.array([0, -1, 1], dtype=np.intp)) + + msg = 'Input has different freq from PeriodIndex\\(freq=H\\)' + with tm.assert_raises_regex(ValueError, msg): + idx.get_indexer(target, 'nearest', tolerance='1 minute') + + tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest', + tolerance='1 day'), + np.array([0, 1, 1], dtype=np.intp)) + + def test_repeat(self): + # GH10183 + idx = pd.period_range('2000-01-01', periods=3, freq='D') + res = idx.repeat(3) + exp = PeriodIndex(idx.values.repeat(3), freq='D') + tm.assert_index_equal(res, exp) + assert res.freqstr == 'D' + + def test_period_index_indexer(self): + # GH4125 + idx = pd.period_range('2002-01', '2003-12', freq='M') + df = pd.DataFrame(pd.np.random.randn(24, 10), index=idx) + tm.assert_frame_equal(df, df.loc[idx]) + tm.assert_frame_equal(df, df.loc[list(idx)]) + tm.assert_frame_equal(df, df.loc[list(idx)]) + tm.assert_frame_equal(df.iloc[0:5], df.loc[idx[0:5]]) + tm.assert_frame_equal(df, df.loc[list(idx)]) + + def test_fillna_period(self): + # GH 11343 + idx = pd.PeriodIndex(['2011-01-01 09:00', pd.NaT, + '2011-01-01 11:00'], freq='H') + + exp = pd.PeriodIndex(['2011-01-01 09:00', '2011-01-01 10:00', + '2011-01-01 11:00'], freq='H') + tm.assert_index_equal( + idx.fillna(pd.Period('2011-01-01 10:00', freq='H')), exp) + + exp = pd.Index([pd.Period('2011-01-01 09:00', freq='H'), 'x', + pd.Period('2011-01-01 11:00', freq='H')], dtype=object) + tm.assert_index_equal(idx.fillna('x'), exp) + + exp = pd.Index([pd.Period('2011-01-01 09:00', freq='H'), + pd.Period('2011-01-01', freq='D'), + pd.Period('2011-01-01 11:00', freq='H')], dtype=object) + tm.assert_index_equal(idx.fillna( + pd.Period('2011-01-01', freq='D')), exp) + + def test_no_millisecond_field(self): + with pytest.raises(AttributeError): + DatetimeIndex.millisecond + + with pytest.raises(AttributeError): + DatetimeIndex([]).millisecond + + def test_difference_freq(self): + # GH14323: difference of Period MUST preserve frequency + # but the ability to union results must be preserved + + index = period_range("20160920", "20160925", freq="D") + + other = period_range("20160921", "20160924", freq="D") + expected = PeriodIndex(["20160920", "20160925"], freq='D') + idx_diff = index.difference(other) + tm.assert_index_equal(idx_diff, expected) + tm.assert_attr_equal('freq', idx_diff, expected) + + other = period_range("20160922", "20160925", freq="D") + idx_diff = index.difference(other) + expected = PeriodIndex(["20160920", "20160921"], freq='D') + tm.assert_index_equal(idx_diff, expected) + tm.assert_attr_equal('freq', idx_diff, expected) + + def test_hash_error(self): + index = period_range('20010101', periods=10) + with tm.assert_raises_regex(TypeError, "unhashable type: %r" % + type(index).__name__): + hash(index) + + def test_make_time_series(self): + index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + series = Series(1, index=index) + assert isinstance(series, Series) + + def test_shallow_copy_empty(self): + + # GH13067 + idx = PeriodIndex([], freq='M') + result = idx._shallow_copy() + expected = idx + + tm.assert_index_equal(result, expected) + + def test_dtype_str(self): + pi = pd.PeriodIndex([], freq='M') + assert pi.dtype_str == 'period[M]' + assert pi.dtype_str == str(pi.dtype) + + pi = pd.PeriodIndex([], freq='3M') + assert pi.dtype_str == 'period[3M]' + assert pi.dtype_str == str(pi.dtype) + + def test_view_asi8(self): + idx = pd.PeriodIndex([], freq='M') + + exp = np.array([], dtype=np.int64) + tm.assert_numpy_array_equal(idx.view('i8'), exp) + tm.assert_numpy_array_equal(idx.asi8, exp) + + idx = pd.PeriodIndex(['2011-01', pd.NaT], freq='M') + + exp = np.array([492, -9223372036854775808], dtype=np.int64) + tm.assert_numpy_array_equal(idx.view('i8'), exp) + tm.assert_numpy_array_equal(idx.asi8, exp) + + exp = np.array([14975, -9223372036854775808], dtype=np.int64) + idx = pd.PeriodIndex(['2011-01-01', pd.NaT], freq='D') + tm.assert_numpy_array_equal(idx.view('i8'), exp) + tm.assert_numpy_array_equal(idx.asi8, exp) + + def test_values(self): + idx = pd.PeriodIndex([], freq='M') + + exp = np.array([], dtype=np.object) + tm.assert_numpy_array_equal(idx.values, exp) + tm.assert_numpy_array_equal(idx.get_values(), exp) + exp = np.array([], dtype=np.int64) + tm.assert_numpy_array_equal(idx._values, exp) + + idx = pd.PeriodIndex(['2011-01', pd.NaT], freq='M') + + exp = np.array([pd.Period('2011-01', freq='M'), pd.NaT], dtype=object) + tm.assert_numpy_array_equal(idx.values, exp) + tm.assert_numpy_array_equal(idx.get_values(), exp) + exp = np.array([492, -9223372036854775808], dtype=np.int64) + tm.assert_numpy_array_equal(idx._values, exp) + + idx = pd.PeriodIndex(['2011-01-01', pd.NaT], freq='D') + + exp = np.array([pd.Period('2011-01-01', freq='D'), pd.NaT], + dtype=object) + tm.assert_numpy_array_equal(idx.values, exp) + tm.assert_numpy_array_equal(idx.get_values(), exp) + exp = np.array([14975, -9223372036854775808], dtype=np.int64) + tm.assert_numpy_array_equal(idx._values, exp) + + def test_period_index_length(self): + pi = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + assert len(pi) == 9 + + pi = PeriodIndex(freq='Q', start='1/1/2001', end='12/1/2009') + assert len(pi) == 4 * 9 + + pi = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') + assert len(pi) == 12 * 9 + + start = Period('02-Apr-2005', 'B') + i1 = PeriodIndex(start=start, periods=20) + assert len(i1) == 20 + assert i1.freq == start.freq + assert i1[0] == start + + end_intv = Period('2006-12-31', 'W') + i1 = PeriodIndex(end=end_intv, periods=10) + assert len(i1) == 10 + assert i1.freq == end_intv.freq + assert i1[-1] == end_intv + + end_intv = Period('2006-12-31', '1w') + i2 = PeriodIndex(end=end_intv, periods=10) + assert len(i1) == len(i2) + assert (i1 == i2).all() + assert i1.freq == i2.freq + + end_intv = Period('2006-12-31', ('w', 1)) + i2 = PeriodIndex(end=end_intv, periods=10) + assert len(i1) == len(i2) + assert (i1 == i2).all() + assert i1.freq == i2.freq + + try: + PeriodIndex(start=start, end=end_intv) + raise AssertionError('Cannot allow mixed freq for start and end') + except ValueError: + pass + + end_intv = Period('2005-05-01', 'B') + i1 = PeriodIndex(start=start, end=end_intv) + + try: + PeriodIndex(start=start) + raise AssertionError( + 'Must specify periods if missing start or end') + except ValueError: + pass + + # infer freq from first element + i2 = PeriodIndex([end_intv, Period('2005-05-05', 'B')]) + assert len(i2) == 2 + assert i2[0] == end_intv + + i2 = PeriodIndex(np.array([end_intv, Period('2005-05-05', 'B')])) + assert len(i2) == 2 + assert i2[0] == end_intv + + # Mixed freq should fail + vals = [end_intv, Period('2006-12-31', 'w')] + pytest.raises(ValueError, PeriodIndex, vals) + vals = np.array(vals) + pytest.raises(ValueError, PeriodIndex, vals) + + def test_fields(self): + # year, month, day, hour, minute + # second, weekofyear, week, dayofweek, weekday, dayofyear, quarter + # qyear + pi = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2005') + self._check_all_fields(pi) + + pi = PeriodIndex(freq='Q', start='1/1/2001', end='12/1/2002') + self._check_all_fields(pi) + + pi = PeriodIndex(freq='M', start='1/1/2001', end='1/1/2002') + self._check_all_fields(pi) + + pi = PeriodIndex(freq='D', start='12/1/2001', end='6/1/2001') + self._check_all_fields(pi) + + pi = PeriodIndex(freq='B', start='12/1/2001', end='6/1/2001') + self._check_all_fields(pi) + + pi = PeriodIndex(freq='H', start='12/31/2001', end='1/1/2002 23:00') + self._check_all_fields(pi) + + pi = PeriodIndex(freq='Min', start='12/31/2001', end='1/1/2002 00:20') + self._check_all_fields(pi) + + pi = PeriodIndex(freq='S', start='12/31/2001 00:00:00', + end='12/31/2001 00:05:00') + self._check_all_fields(pi) + + end_intv = Period('2006-12-31', 'W') + i1 = PeriodIndex(end=end_intv, periods=10) + self._check_all_fields(i1) + + def _check_all_fields(self, periodindex): + fields = ['year', 'month', 'day', 'hour', 'minute', 'second', + 'weekofyear', 'week', 'dayofweek', 'dayofyear', + 'quarter', 'qyear', 'days_in_month'] + + periods = list(periodindex) + s = pd.Series(periodindex) + + for field in fields: + field_idx = getattr(periodindex, field) + assert len(periodindex) == len(field_idx) + for x, val in zip(periods, field_idx): + assert getattr(x, field) == val + + if len(s) == 0: + continue + + field_s = getattr(s.dt, field) + assert len(periodindex) == len(field_s) + for x, val in zip(periods, field_s): + assert getattr(x, field) == val + + def test_indexing(self): + + # GH 4390, iat incorrectly indexing + index = period_range('1/1/2001', periods=10) + s = Series(randn(10), index=index) + expected = s[index[0]] + result = s.iat[0] + assert expected == result + + def test_period_set_index_reindex(self): + # GH 6631 + df = DataFrame(np.random.random(6)) + idx1 = period_range('2011/01/01', periods=6, freq='M') + idx2 = period_range('2013', periods=6, freq='A') + + df = df.set_index(idx1) + tm.assert_index_equal(df.index, idx1) + df = df.set_index(idx2) + tm.assert_index_equal(df.index, idx2) + + def test_factorize(self): + idx1 = PeriodIndex(['2014-01', '2014-01', '2014-02', '2014-02', + '2014-03', '2014-03'], freq='M') + + exp_arr = np.array([0, 0, 1, 1, 2, 2], dtype=np.intp) + exp_idx = PeriodIndex(['2014-01', '2014-02', '2014-03'], freq='M') + + arr, idx = idx1.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + arr, idx = idx1.factorize(sort=True) + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + idx2 = pd.PeriodIndex(['2014-03', '2014-03', '2014-02', '2014-01', + '2014-03', '2014-01'], freq='M') + + exp_arr = np.array([2, 2, 1, 0, 2, 0], dtype=np.intp) + arr, idx = idx2.factorize(sort=True) + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + exp_arr = np.array([0, 0, 1, 2, 0, 2], dtype=np.intp) + exp_idx = PeriodIndex(['2014-03', '2014-02', '2014-01'], freq='M') + arr, idx = idx2.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + def test_asobject_like(self): + idx = pd.PeriodIndex([], freq='M') + + exp = np.array([], dtype=object) + tm.assert_numpy_array_equal(idx.asobject.values, exp) + tm.assert_numpy_array_equal(idx._mpl_repr(), exp) + + idx = pd.PeriodIndex(['2011-01', pd.NaT], freq='M') + + exp = np.array([pd.Period('2011-01', freq='M'), pd.NaT], dtype=object) + tm.assert_numpy_array_equal(idx.asobject.values, exp) + tm.assert_numpy_array_equal(idx._mpl_repr(), exp) + + exp = np.array([pd.Period('2011-01-01', freq='D'), pd.NaT], + dtype=object) + idx = pd.PeriodIndex(['2011-01-01', pd.NaT], freq='D') + tm.assert_numpy_array_equal(idx.asobject.values, exp) + tm.assert_numpy_array_equal(idx._mpl_repr(), exp) + + def test_is_(self): + create_index = lambda: PeriodIndex(freq='A', start='1/1/2001', + end='12/1/2009') + index = create_index() + assert index.is_(index) + assert not index.is_(create_index()) + assert index.is_(index.view()) + assert index.is_(index.view().view().view().view().view()) + assert index.view().is_(index) + ind2 = index.view() + index.name = "Apple" + assert ind2.is_(index) + assert not index.is_(index[:]) + assert not index.is_(index.asfreq('M')) + assert not index.is_(index.asfreq('A')) + assert not index.is_(index - 2) + assert not index.is_(index - 0) + + def test_comp_period(self): + idx = period_range('2007-01', periods=20, freq='M') + + result = idx < idx[10] + exp = idx.values < idx.values[10] + tm.assert_numpy_array_equal(result, exp) + + def test_contains(self): + rng = period_range('2007-01', freq='M', periods=10) + + assert Period('2007-01', freq='M') in rng + assert not Period('2007-01', freq='D') in rng + assert not Period('2007-01', freq='2M') in rng + + def test_contains_nat(self): + # see gh-13582 + idx = period_range('2007-01', freq='M', periods=10) + assert pd.NaT not in idx + assert None not in idx + assert float('nan') not in idx + assert np.nan not in idx + + idx = pd.PeriodIndex(['2011-01', 'NaT', '2011-02'], freq='M') + assert pd.NaT in idx + assert None in idx + assert float('nan') in idx + assert np.nan in idx + + def test_periods_number_check(self): + with pytest.raises(ValueError): + period_range('2011-1-1', '2012-1-1', 'B') + + def test_start_time(self): + index = PeriodIndex(freq='M', start='2016-01-01', end='2016-05-31') + expected_index = date_range('2016-01-01', end='2016-05-31', freq='MS') + tm.assert_index_equal(index.start_time, expected_index) + + def test_end_time(self): + index = PeriodIndex(freq='M', start='2016-01-01', end='2016-05-31') + expected_index = date_range('2016-01-01', end='2016-05-31', freq='M') + tm.assert_index_equal(index.end_time, expected_index) + + def test_index_duplicate_periods(self): + # monotonic + idx = PeriodIndex([2000, 2007, 2007, 2009, 2009], freq='A-JUN') + ts = Series(np.random.randn(len(idx)), index=idx) + + result = ts[2007] + expected = ts[1:3] + tm.assert_series_equal(result, expected) + result[:] = 1 + assert (ts[1:3] == 1).all() + + # not monotonic + idx = PeriodIndex([2000, 2007, 2007, 2009, 2007], freq='A-JUN') + ts = Series(np.random.randn(len(idx)), index=idx) + + result = ts[2007] + expected = ts[idx == 2007] + tm.assert_series_equal(result, expected) + + def test_index_unique(self): + idx = PeriodIndex([2000, 2007, 2007, 2009, 2009], freq='A-JUN') + expected = PeriodIndex([2000, 2007, 2009], freq='A-JUN') + tm.assert_index_equal(idx.unique(), expected) + assert idx.nunique() == 3 + + idx = PeriodIndex([2000, 2007, 2007, 2009, 2007], freq='A-JUN', + tz='US/Eastern') + expected = PeriodIndex([2000, 2007, 2009], freq='A-JUN', + tz='US/Eastern') + tm.assert_index_equal(idx.unique(), expected) + assert idx.nunique() == 3 + + def test_shift_gh8083(self): + + # test shift for PeriodIndex + # GH8083 + drange = self.create_index() + result = drange.shift(1) + expected = PeriodIndex(['2013-01-02', '2013-01-03', '2013-01-04', + '2013-01-05', '2013-01-06'], freq='D') + tm.assert_index_equal(result, expected) + + def test_shift(self): + pi1 = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + pi2 = PeriodIndex(freq='A', start='1/1/2002', end='12/1/2010') + + tm.assert_index_equal(pi1.shift(0), pi1) + + assert len(pi1) == len(pi2) + tm.assert_index_equal(pi1.shift(1), pi2) + + pi1 = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + pi2 = PeriodIndex(freq='A', start='1/1/2000', end='12/1/2008') + assert len(pi1) == len(pi2) + tm.assert_index_equal(pi1.shift(-1), pi2) + + pi1 = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') + pi2 = PeriodIndex(freq='M', start='2/1/2001', end='1/1/2010') + assert len(pi1) == len(pi2) + tm.assert_index_equal(pi1.shift(1), pi2) + + pi1 = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') + pi2 = PeriodIndex(freq='M', start='12/1/2000', end='11/1/2009') + assert len(pi1) == len(pi2) + tm.assert_index_equal(pi1.shift(-1), pi2) + + pi1 = PeriodIndex(freq='D', start='1/1/2001', end='12/1/2009') + pi2 = PeriodIndex(freq='D', start='1/2/2001', end='12/2/2009') + assert len(pi1) == len(pi2) + tm.assert_index_equal(pi1.shift(1), pi2) + + pi1 = PeriodIndex(freq='D', start='1/1/2001', end='12/1/2009') + pi2 = PeriodIndex(freq='D', start='12/31/2000', end='11/30/2009') + assert len(pi1) == len(pi2) + tm.assert_index_equal(pi1.shift(-1), pi2) + + def test_shift_nat(self): + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2011-04'], freq='M', name='idx') + result = idx.shift(1) + expected = PeriodIndex(['2011-02', '2011-03', 'NaT', + '2011-05'], freq='M', name='idx') + tm.assert_index_equal(result, expected) + assert result.name == expected.name + + def test_ndarray_compat_properties(self): + if compat.is_platform_32bit(): + pytest.skip("skipping on 32bit") + super(TestPeriodIndex, self).test_ndarray_compat_properties() + + def test_shift_ndarray(self): + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2011-04'], freq='M', name='idx') + result = idx.shift(np.array([1, 2, 3, 4])) + expected = PeriodIndex(['2011-02', '2011-04', 'NaT', + '2011-08'], freq='M', name='idx') + tm.assert_index_equal(result, expected) + + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2011-04'], freq='M', name='idx') + result = idx.shift(np.array([1, -2, 3, -4])) + expected = PeriodIndex(['2011-02', '2010-12', 'NaT', + '2010-12'], freq='M', name='idx') + tm.assert_index_equal(result, expected) + + def test_negative_ordinals(self): + Period(ordinal=-1000, freq='A') + Period(ordinal=0, freq='A') + + idx1 = PeriodIndex(ordinal=[-1, 0, 1], freq='A') + idx2 = PeriodIndex(ordinal=np.array([-1, 0, 1]), freq='A') + tm.assert_index_equal(idx1, idx2) + + def test_pindex_fieldaccessor_nat(self): + idx = PeriodIndex(['2011-01', '2011-02', 'NaT', + '2012-03', '2012-04'], freq='D', name='name') + + exp = Index([2011, 2011, -1, 2012, 2012], dtype=np.int64, name='name') + tm.assert_index_equal(idx.year, exp) + exp = Index([1, 2, -1, 3, 4], dtype=np.int64, name='name') + tm.assert_index_equal(idx.month, exp) + + def test_pindex_qaccess(self): + pi = PeriodIndex(['2Q05', '3Q05', '4Q05', '1Q06', '2Q06'], freq='Q') + s = Series(np.random.rand(len(pi)), index=pi).cumsum() + # Todo: fix these accessors! + assert s['05Q4'] == s[2] + + def test_numpy_repeat(self): + index = period_range('20010101', periods=2) + expected = PeriodIndex([Period('2001-01-01'), Period('2001-01-01'), + Period('2001-01-02'), Period('2001-01-02')]) + + tm.assert_index_equal(np.repeat(index, 2), expected) + + msg = "the 'axis' parameter is not supported" + tm.assert_raises_regex( + ValueError, msg, np.repeat, index, 2, axis=1) + + def test_pindex_multiples(self): + pi = PeriodIndex(start='1/1/11', end='12/31/11', freq='2M') + expected = PeriodIndex(['2011-01', '2011-03', '2011-05', '2011-07', + '2011-09', '2011-11'], freq='2M') + tm.assert_index_equal(pi, expected) + assert pi.freq == offsets.MonthEnd(2) + assert pi.freqstr == '2M' + + pi = period_range(start='1/1/11', end='12/31/11', freq='2M') + tm.assert_index_equal(pi, expected) + assert pi.freq == offsets.MonthEnd(2) + assert pi.freqstr == '2M' + + pi = period_range(start='1/1/11', periods=6, freq='2M') + tm.assert_index_equal(pi, expected) + assert pi.freq == offsets.MonthEnd(2) + assert pi.freqstr == '2M' + + def test_iteration(self): + index = PeriodIndex(start='1/1/10', periods=4, freq='B') + + result = list(index) + assert isinstance(result[0], Period) + assert result[0].freq == index.freq + + def test_is_full(self): + index = PeriodIndex([2005, 2007, 2009], freq='A') + assert not index.is_full + + index = PeriodIndex([2005, 2006, 2007], freq='A') + assert index.is_full + + index = PeriodIndex([2005, 2005, 2007], freq='A') + assert not index.is_full + + index = PeriodIndex([2005, 2005, 2006], freq='A') + assert index.is_full + + index = PeriodIndex([2006, 2005, 2005], freq='A') + pytest.raises(ValueError, getattr, index, 'is_full') + + assert index[:0].is_full + + def test_with_multi_index(self): + # #1705 + index = date_range('1/1/2012', periods=4, freq='12H') + index_as_arrays = [index.to_period(freq='D'), index.hour] + + s = Series([0, 1, 2, 3], index_as_arrays) + + assert isinstance(s.index.levels[0], PeriodIndex) + + assert isinstance(s.index.values[0][0], Period) + + def test_convert_array_of_periods(self): + rng = period_range('1/1/2000', periods=20, freq='D') + periods = list(rng) + + result = pd.Index(periods) + assert isinstance(result, PeriodIndex) + + def test_append_concat(self): + # #1815 + d1 = date_range('12/31/1990', '12/31/1999', freq='A-DEC') + d2 = date_range('12/31/2000', '12/31/2009', freq='A-DEC') + + s1 = Series(np.random.randn(10), d1) + s2 = Series(np.random.randn(10), d2) + + s1 = s1.to_period() + s2 = s2.to_period() + + # drops index + result = pd.concat([s1, s2]) + assert isinstance(result.index, PeriodIndex) + assert result.index[0] == s1.index[0] + + def test_pickle_freq(self): + # GH2891 + prng = period_range('1/1/2011', '1/1/2012', freq='M') + new_prng = tm.round_trip_pickle(prng) + assert new_prng.freq == offsets.MonthEnd() + assert new_prng.freqstr == 'M' + + def test_map(self): + index = PeriodIndex([2005, 2007, 2009], freq='A') + result = index.map(lambda x: x + 1) + expected = index + 1 + tm.assert_index_equal(result, expected) + + result = index.map(lambda x: x.ordinal) + exp = Index([x.ordinal for x in index]) + tm.assert_index_equal(result, exp) diff --git a/pandas/tests/indexes/period/test_setops.py b/pandas/tests/indexes/period/test_setops.py new file mode 100644 index 0000000000000..1ac05f9fa94b7 --- /dev/null +++ b/pandas/tests/indexes/period/test_setops.py @@ -0,0 +1,252 @@ +import pytest + +import numpy as np + +import pandas as pd +import pandas.util.testing as tm +import pandas.core.indexes.period as period +from pandas import period_range, PeriodIndex, Index, date_range + + +def _permute(obj): + return obj.take(np.random.permutation(len(obj))) + + +class TestPeriodIndex(object): + + def setup_method(self, method): + pass + + def test_joins(self): + index = period_range('1/1/2000', '1/20/2000', freq='D') + + for kind in ['inner', 'outer', 'left', 'right']: + joined = index.join(index[:-5], how=kind) + + assert isinstance(joined, PeriodIndex) + assert joined.freq == index.freq + + def test_join_self(self): + index = period_range('1/1/2000', '1/20/2000', freq='D') + + for kind in ['inner', 'outer', 'left', 'right']: + res = index.join(index, how=kind) + assert index is res + + def test_join_does_not_recur(self): + df = tm.makeCustomDataframe( + 3, 2, data_gen_f=lambda *args: np.random.randint(2), + c_idx_type='p', r_idx_type='dt') + s = df.iloc[:2, 0] + + res = s.index.join(df.columns, how='outer') + expected = Index([s.index[0], s.index[1], + df.columns[0], df.columns[1]], object) + tm.assert_index_equal(res, expected) + + def test_union(self): + # union + rng1 = pd.period_range('1/1/2000', freq='D', periods=5) + other1 = pd.period_range('1/6/2000', freq='D', periods=5) + expected1 = pd.period_range('1/1/2000', freq='D', periods=10) + + rng2 = pd.period_range('1/1/2000', freq='D', periods=5) + other2 = pd.period_range('1/4/2000', freq='D', periods=5) + expected2 = pd.period_range('1/1/2000', freq='D', periods=8) + + rng3 = pd.period_range('1/1/2000', freq='D', periods=5) + other3 = pd.PeriodIndex([], freq='D') + expected3 = pd.period_range('1/1/2000', freq='D', periods=5) + + rng4 = pd.period_range('2000-01-01 09:00', freq='H', periods=5) + other4 = pd.period_range('2000-01-02 09:00', freq='H', periods=5) + expected4 = pd.PeriodIndex(['2000-01-01 09:00', '2000-01-01 10:00', + '2000-01-01 11:00', '2000-01-01 12:00', + '2000-01-01 13:00', '2000-01-02 09:00', + '2000-01-02 10:00', '2000-01-02 11:00', + '2000-01-02 12:00', '2000-01-02 13:00'], + freq='H') + + rng5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:03', + '2000-01-01 09:05'], freq='T') + other5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:05' + '2000-01-01 09:08'], + freq='T') + expected5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:03', + '2000-01-01 09:05', '2000-01-01 09:08'], + freq='T') + + rng6 = pd.period_range('2000-01-01', freq='M', periods=7) + other6 = pd.period_range('2000-04-01', freq='M', periods=7) + expected6 = pd.period_range('2000-01-01', freq='M', periods=10) + + rng7 = pd.period_range('2003-01-01', freq='A', periods=5) + other7 = pd.period_range('1998-01-01', freq='A', periods=8) + expected7 = pd.period_range('1998-01-01', freq='A', periods=10) + + for rng, other, expected in [(rng1, other1, expected1), + (rng2, other2, expected2), + (rng3, other3, expected3), (rng4, other4, + expected4), + (rng5, other5, expected5), (rng6, other6, + expected6), + (rng7, other7, expected7)]: + + result_union = rng.union(other) + tm.assert_index_equal(result_union, expected) + + def test_union_misc(self): + index = period_range('1/1/2000', '1/20/2000', freq='D') + + result = index[:-5].union(index[10:]) + tm.assert_index_equal(result, index) + + # not in order + result = _permute(index[:-5]).union(_permute(index[10:])) + tm.assert_index_equal(result, index) + + # raise if different frequencies + index = period_range('1/1/2000', '1/20/2000', freq='D') + index2 = period_range('1/1/2000', '1/20/2000', freq='W-WED') + with pytest.raises(period.IncompatibleFrequency): + index.union(index2) + + msg = 'can only call with other PeriodIndex-ed objects' + with tm.assert_raises_regex(ValueError, msg): + index.join(index.to_timestamp()) + + index3 = period_range('1/1/2000', '1/20/2000', freq='2D') + with pytest.raises(period.IncompatibleFrequency): + index.join(index3) + + def test_union_dataframe_index(self): + rng1 = pd.period_range('1/1/1999', '1/1/2012', freq='M') + s1 = pd.Series(np.random.randn(len(rng1)), rng1) + + rng2 = pd.period_range('1/1/1980', '12/1/2001', freq='M') + s2 = pd.Series(np.random.randn(len(rng2)), rng2) + df = pd.DataFrame({'s1': s1, 's2': s2}) + + exp = pd.period_range('1/1/1980', '1/1/2012', freq='M') + tm.assert_index_equal(df.index, exp) + + def test_intersection(self): + index = period_range('1/1/2000', '1/20/2000', freq='D') + + result = index[:-5].intersection(index[10:]) + tm.assert_index_equal(result, index[10:-5]) + + # not in order + left = _permute(index[:-5]) + right = _permute(index[10:]) + result = left.intersection(right).sort_values() + tm.assert_index_equal(result, index[10:-5]) + + # raise if different frequencies + index = period_range('1/1/2000', '1/20/2000', freq='D') + index2 = period_range('1/1/2000', '1/20/2000', freq='W-WED') + with pytest.raises(period.IncompatibleFrequency): + index.intersection(index2) + + index3 = period_range('1/1/2000', '1/20/2000', freq='2D') + with pytest.raises(period.IncompatibleFrequency): + index.intersection(index3) + + def test_intersection_cases(self): + base = period_range('6/1/2000', '6/30/2000', freq='D', name='idx') + + # if target has the same name, it is preserved + rng2 = period_range('5/15/2000', '6/20/2000', freq='D', name='idx') + expected2 = period_range('6/1/2000', '6/20/2000', freq='D', + name='idx') + + # if target name is different, it will be reset + rng3 = period_range('5/15/2000', '6/20/2000', freq='D', name='other') + expected3 = period_range('6/1/2000', '6/20/2000', freq='D', + name=None) + + rng4 = period_range('7/1/2000', '7/31/2000', freq='D', name='idx') + expected4 = PeriodIndex([], name='idx', freq='D') + + for (rng, expected) in [(rng2, expected2), (rng3, expected3), + (rng4, expected4)]: + result = base.intersection(rng) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + # non-monotonic + base = PeriodIndex(['2011-01-05', '2011-01-04', '2011-01-02', + '2011-01-03'], freq='D', name='idx') + + rng2 = PeriodIndex(['2011-01-04', '2011-01-02', + '2011-02-02', '2011-02-03'], + freq='D', name='idx') + expected2 = PeriodIndex(['2011-01-04', '2011-01-02'], freq='D', + name='idx') + + rng3 = PeriodIndex(['2011-01-04', '2011-01-02', '2011-02-02', + '2011-02-03'], + freq='D', name='other') + expected3 = PeriodIndex(['2011-01-04', '2011-01-02'], freq='D', + name=None) + + rng4 = period_range('7/1/2000', '7/31/2000', freq='D', name='idx') + expected4 = PeriodIndex([], freq='D', name='idx') + + for (rng, expected) in [(rng2, expected2), (rng3, expected3), + (rng4, expected4)]: + result = base.intersection(rng) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == 'D' + + # empty same freq + rng = date_range('6/1/2000', '6/15/2000', freq='T') + result = rng[0:0].intersection(rng) + assert len(result) == 0 + + result = rng.intersection(rng[0:0]) + assert len(result) == 0 + + def test_difference(self): + # diff + rng1 = pd.period_range('1/1/2000', freq='D', periods=5) + other1 = pd.period_range('1/6/2000', freq='D', periods=5) + expected1 = pd.period_range('1/1/2000', freq='D', periods=5) + + rng2 = pd.period_range('1/1/2000', freq='D', periods=5) + other2 = pd.period_range('1/4/2000', freq='D', periods=5) + expected2 = pd.period_range('1/1/2000', freq='D', periods=3) + + rng3 = pd.period_range('1/1/2000', freq='D', periods=5) + other3 = pd.PeriodIndex([], freq='D') + expected3 = pd.period_range('1/1/2000', freq='D', periods=5) + + rng4 = pd.period_range('2000-01-01 09:00', freq='H', periods=5) + other4 = pd.period_range('2000-01-02 09:00', freq='H', periods=5) + expected4 = rng4 + + rng5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:03', + '2000-01-01 09:05'], freq='T') + other5 = pd.PeriodIndex( + ['2000-01-01 09:01', '2000-01-01 09:05'], freq='T') + expected5 = pd.PeriodIndex(['2000-01-01 09:03'], freq='T') + + rng6 = pd.period_range('2000-01-01', freq='M', periods=7) + other6 = pd.period_range('2000-04-01', freq='M', periods=7) + expected6 = pd.period_range('2000-01-01', freq='M', periods=3) + + rng7 = pd.period_range('2003-01-01', freq='A', periods=5) + other7 = pd.period_range('1998-01-01', freq='A', periods=8) + expected7 = pd.period_range('2006-01-01', freq='A', periods=2) + + for rng, other, expected in [(rng1, other1, expected1), + (rng2, other2, expected2), + (rng3, other3, expected3), + (rng4, other4, expected4), + (rng5, other5, expected5), + (rng6, other6, expected6), + (rng7, other7, expected7), ]: + result_union = rng.difference(other) + tm.assert_index_equal(result_union, expected) diff --git a/pandas/tests/indexes/period/test_tools.py b/pandas/tests/indexes/period/test_tools.py new file mode 100644 index 0000000000000..074678164e6f9 --- /dev/null +++ b/pandas/tests/indexes/period/test_tools.py @@ -0,0 +1,434 @@ +import numpy as np +from datetime import datetime, timedelta + +import pandas as pd +import pandas.util.testing as tm +import pandas.core.indexes.period as period +from pandas.compat import lrange +from pandas.tseries.frequencies import get_freq, MONTHS +from pandas._libs.period import period_ordinal, period_asfreq +from pandas import (PeriodIndex, Period, DatetimeIndex, Timestamp, Series, + date_range, to_datetime, period_range) + + +class TestPeriodRepresentation(object): + """ + Wish to match NumPy units + """ + + def _check_freq(self, freq, base_date): + rng = PeriodIndex(start=base_date, periods=10, freq=freq) + exp = np.arange(10, dtype=np.int64) + tm.assert_numpy_array_equal(rng._values, exp) + tm.assert_numpy_array_equal(rng.asi8, exp) + + def test_annual(self): + self._check_freq('A', 1970) + + def test_monthly(self): + self._check_freq('M', '1970-01') + + def test_weekly(self): + self._check_freq('W-THU', '1970-01-01') + + def test_daily(self): + self._check_freq('D', '1970-01-01') + + def test_business_daily(self): + self._check_freq('B', '1970-01-01') + + def test_hourly(self): + self._check_freq('H', '1970-01-01') + + def test_minutely(self): + self._check_freq('T', '1970-01-01') + + def test_secondly(self): + self._check_freq('S', '1970-01-01') + + def test_millisecondly(self): + self._check_freq('L', '1970-01-01') + + def test_microsecondly(self): + self._check_freq('U', '1970-01-01') + + def test_nanosecondly(self): + self._check_freq('N', '1970-01-01') + + def test_negone_ordinals(self): + freqs = ['A', 'M', 'Q', 'D', 'H', 'T', 'S'] + + period = Period(ordinal=-1, freq='D') + for freq in freqs: + repr(period.asfreq(freq)) + + for freq in freqs: + period = Period(ordinal=-1, freq=freq) + repr(period) + assert period.year == 1969 + + period = Period(ordinal=-1, freq='B') + repr(period) + period = Period(ordinal=-1, freq='W') + repr(period) + + +class TestTslib(object): + def test_intraday_conversion_factors(self): + assert period_asfreq(1, get_freq('D'), get_freq('H'), False) == 24 + assert period_asfreq(1, get_freq('D'), get_freq('T'), False) == 1440 + assert period_asfreq(1, get_freq('D'), get_freq('S'), False) == 86400 + assert period_asfreq(1, get_freq('D'), + get_freq('L'), False) == 86400000 + assert period_asfreq(1, get_freq('D'), + get_freq('U'), False) == 86400000000 + assert period_asfreq(1, get_freq('D'), + get_freq('N'), False) == 86400000000000 + + assert period_asfreq(1, get_freq('H'), get_freq('T'), False) == 60 + assert period_asfreq(1, get_freq('H'), get_freq('S'), False) == 3600 + assert period_asfreq(1, get_freq('H'), + get_freq('L'), False) == 3600000 + assert period_asfreq(1, get_freq('H'), + get_freq('U'), False) == 3600000000 + assert period_asfreq(1, get_freq('H'), + get_freq('N'), False) == 3600000000000 + + assert period_asfreq(1, get_freq('T'), get_freq('S'), False) == 60 + assert period_asfreq(1, get_freq('T'), get_freq('L'), False) == 60000 + assert period_asfreq(1, get_freq('T'), + get_freq('U'), False) == 60000000 + assert period_asfreq(1, get_freq('T'), + get_freq('N'), False) == 60000000000 + + assert period_asfreq(1, get_freq('S'), get_freq('L'), False) == 1000 + assert period_asfreq(1, get_freq('S'), + get_freq('U'), False) == 1000000 + assert period_asfreq(1, get_freq('S'), + get_freq('N'), False) == 1000000000 + + assert period_asfreq(1, get_freq('L'), get_freq('U'), False) == 1000 + assert period_asfreq(1, get_freq('L'), + get_freq('N'), False) == 1000000 + + assert period_asfreq(1, get_freq('U'), get_freq('N'), False) == 1000 + + def test_period_ordinal_start_values(self): + # information for 1.1.1970 + assert period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, get_freq('A')) == 0 + assert period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, get_freq('M')) == 0 + assert period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, get_freq('W')) == 1 + assert period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, get_freq('D')) == 0 + assert period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, get_freq('B')) == 0 + + def test_period_ordinal_week(self): + assert period_ordinal(1970, 1, 4, 0, 0, 0, 0, 0, get_freq('W')) == 1 + assert period_ordinal(1970, 1, 5, 0, 0, 0, 0, 0, get_freq('W')) == 2 + assert period_ordinal(2013, 10, 6, 0, + 0, 0, 0, 0, get_freq('W')) == 2284 + assert period_ordinal(2013, 10, 7, 0, + 0, 0, 0, 0, get_freq('W')) == 2285 + + def test_period_ordinal_business_day(self): + # Thursday + assert period_ordinal(2013, 10, 3, 0, + 0, 0, 0, 0, get_freq('B')) == 11415 + # Friday + assert period_ordinal(2013, 10, 4, 0, + 0, 0, 0, 0, get_freq('B')) == 11416 + # Saturday + assert period_ordinal(2013, 10, 5, 0, + 0, 0, 0, 0, get_freq('B')) == 11417 + # Sunday + assert period_ordinal(2013, 10, 6, 0, + 0, 0, 0, 0, get_freq('B')) == 11417 + # Monday + assert period_ordinal(2013, 10, 7, 0, + 0, 0, 0, 0, get_freq('B')) == 11417 + # Tuesday + assert period_ordinal(2013, 10, 8, 0, + 0, 0, 0, 0, get_freq('B')) == 11418 + + +class TestPeriodIndex(object): + + def setup_method(self, method): + pass + + def test_tolist(self): + index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + rs = index.tolist() + for x in rs: + assert isinstance(x, Period) + + recon = PeriodIndex(rs) + tm.assert_index_equal(index, recon) + + def test_to_timestamp(self): + index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') + series = Series(1, index=index, name='foo') + + exp_index = date_range('1/1/2001', end='12/31/2009', freq='A-DEC') + result = series.to_timestamp(how='end') + tm.assert_index_equal(result.index, exp_index) + assert result.name == 'foo' + + exp_index = date_range('1/1/2001', end='1/1/2009', freq='AS-JAN') + result = series.to_timestamp(how='start') + tm.assert_index_equal(result.index, exp_index) + + def _get_with_delta(delta, freq='A-DEC'): + return date_range(to_datetime('1/1/2001') + delta, + to_datetime('12/31/2009') + delta, freq=freq) + + delta = timedelta(hours=23) + result = series.to_timestamp('H', 'end') + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.index, exp_index) + + delta = timedelta(hours=23, minutes=59) + result = series.to_timestamp('T', 'end') + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.index, exp_index) + + result = series.to_timestamp('S', 'end') + delta = timedelta(hours=23, minutes=59, seconds=59) + exp_index = _get_with_delta(delta) + tm.assert_index_equal(result.index, exp_index) + + index = PeriodIndex(freq='H', start='1/1/2001', end='1/2/2001') + series = Series(1, index=index, name='foo') + + exp_index = date_range('1/1/2001 00:59:59', end='1/2/2001 00:59:59', + freq='H') + result = series.to_timestamp(how='end') + tm.assert_index_equal(result.index, exp_index) + assert result.name == 'foo' + + def test_to_timestamp_quarterly_bug(self): + years = np.arange(1960, 2000).repeat(4) + quarters = np.tile(lrange(1, 5), 40) + + pindex = PeriodIndex(year=years, quarter=quarters) + + stamps = pindex.to_timestamp('D', 'end') + expected = DatetimeIndex([x.to_timestamp('D', 'end') for x in pindex]) + tm.assert_index_equal(stamps, expected) + + def test_to_timestamp_preserve_name(self): + index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009', + name='foo') + assert index.name == 'foo' + + conv = index.to_timestamp('D') + assert conv.name == 'foo' + + def test_to_timestamp_repr_is_code(self): + zs = [Timestamp('99-04-17 00:00:00', tz='UTC'), + Timestamp('2001-04-17 00:00:00', tz='UTC'), + Timestamp('2001-04-17 00:00:00', tz='America/Los_Angeles'), + Timestamp('2001-04-17 00:00:00', tz=None)] + for z in zs: + assert eval(repr(z)) == z + + def test_to_timestamp_pi_nat(self): + # GH 7228 + index = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='M', + name='idx') + + result = index.to_timestamp('D') + expected = DatetimeIndex([pd.NaT, datetime(2011, 1, 1), + datetime(2011, 2, 1)], name='idx') + tm.assert_index_equal(result, expected) + assert result.name == 'idx' + + result2 = result.to_period(freq='M') + tm.assert_index_equal(result2, index) + assert result2.name == 'idx' + + result3 = result.to_period(freq='3M') + exp = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='3M', name='idx') + tm.assert_index_equal(result3, exp) + assert result3.freqstr == '3M' + + msg = ('Frequency must be positive, because it' + ' represents span: -2A') + with tm.assert_raises_regex(ValueError, msg): + result.to_period(freq='-2A') + + def test_to_timestamp_pi_mult(self): + idx = PeriodIndex(['2011-01', 'NaT', '2011-02'], freq='2M', name='idx') + result = idx.to_timestamp() + expected = DatetimeIndex( + ['2011-01-01', 'NaT', '2011-02-01'], name='idx') + tm.assert_index_equal(result, expected) + result = idx.to_timestamp(how='E') + expected = DatetimeIndex( + ['2011-02-28', 'NaT', '2011-03-31'], name='idx') + tm.assert_index_equal(result, expected) + + def test_to_timestamp_pi_combined(self): + idx = PeriodIndex(start='2011', periods=2, freq='1D1H', name='idx') + result = idx.to_timestamp() + expected = DatetimeIndex( + ['2011-01-01 00:00', '2011-01-02 01:00'], name='idx') + tm.assert_index_equal(result, expected) + result = idx.to_timestamp(how='E') + expected = DatetimeIndex( + ['2011-01-02 00:59:59', '2011-01-03 01:59:59'], name='idx') + tm.assert_index_equal(result, expected) + result = idx.to_timestamp(how='E', freq='H') + expected = DatetimeIndex( + ['2011-01-02 00:00', '2011-01-03 01:00'], name='idx') + tm.assert_index_equal(result, expected) + + def test_to_timestamp_to_period_astype(self): + idx = DatetimeIndex([pd.NaT, '2011-01-01', '2011-02-01'], name='idx') + + res = idx.astype('period[M]') + exp = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='M', name='idx') + tm.assert_index_equal(res, exp) + + res = idx.astype('period[3M]') + exp = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='3M', name='idx') + tm.assert_index_equal(res, exp) + + def test_dti_to_period(self): + dti = DatetimeIndex(start='1/1/2005', end='12/1/2005', freq='M') + pi1 = dti.to_period() + pi2 = dti.to_period(freq='D') + pi3 = dti.to_period(freq='3D') + + assert pi1[0] == Period('Jan 2005', freq='M') + assert pi2[0] == Period('1/31/2005', freq='D') + assert pi3[0] == Period('1/31/2005', freq='3D') + + assert pi1[-1] == Period('Nov 2005', freq='M') + assert pi2[-1] == Period('11/30/2005', freq='D') + assert pi3[-1], Period('11/30/2005', freq='3D') + + tm.assert_index_equal(pi1, period_range('1/1/2005', '11/1/2005', + freq='M')) + tm.assert_index_equal(pi2, period_range('1/1/2005', '11/1/2005', + freq='M').asfreq('D')) + tm.assert_index_equal(pi3, period_range('1/1/2005', '11/1/2005', + freq='M').asfreq('3D')) + + def test_period_astype_to_timestamp(self): + pi = pd.PeriodIndex(['2011-01', '2011-02', '2011-03'], freq='M') + + exp = pd.DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01']) + tm.assert_index_equal(pi.astype('datetime64[ns]'), exp) + + exp = pd.DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31']) + tm.assert_index_equal(pi.astype('datetime64[ns]', how='end'), exp) + + exp = pd.DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01'], + tz='US/Eastern') + res = pi.astype('datetime64[ns, US/Eastern]') + tm.assert_index_equal(pi.astype('datetime64[ns, US/Eastern]'), exp) + + exp = pd.DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31'], + tz='US/Eastern') + res = pi.astype('datetime64[ns, US/Eastern]', how='end') + tm.assert_index_equal(res, exp) + + def test_to_period_quarterly(self): + # make sure we can make the round trip + for month in MONTHS: + freq = 'Q-%s' % month + rng = period_range('1989Q3', '1991Q3', freq=freq) + stamps = rng.to_timestamp() + result = stamps.to_period(freq) + tm.assert_index_equal(rng, result) + + def test_to_period_quarterlyish(self): + offsets = ['BQ', 'QS', 'BQS'] + for off in offsets: + rng = date_range('01-Jan-2012', periods=8, freq=off) + prng = rng.to_period() + assert prng.freq == 'Q-DEC' + + def test_to_period_annualish(self): + offsets = ['BA', 'AS', 'BAS'] + for off in offsets: + rng = date_range('01-Jan-2012', periods=8, freq=off) + prng = rng.to_period() + assert prng.freq == 'A-DEC' + + def test_to_period_monthish(self): + offsets = ['MS', 'BM'] + for off in offsets: + rng = date_range('01-Jan-2012', periods=8, freq=off) + prng = rng.to_period() + assert prng.freq == 'M' + + rng = date_range('01-Jan-2012', periods=8, freq='M') + prng = rng.to_period() + assert prng.freq == 'M' + + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + date_range('01-Jan-2012', periods=8, freq='EOM') + + def test_period_dt64_round_trip(self): + dti = date_range('1/1/2000', '1/7/2002', freq='B') + pi = dti.to_period() + tm.assert_index_equal(pi.to_timestamp(), dti) + + dti = date_range('1/1/2000', '1/7/2002', freq='B') + pi = dti.to_period(freq='H') + tm.assert_index_equal(pi.to_timestamp(), dti) + + def test_to_timestamp_1703(self): + index = period_range('1/1/2012', periods=4, freq='D') + + result = index.to_timestamp() + assert result[0] == Timestamp('1/1/2012') + + def test_to_datetime_depr(self): + index = period_range('1/1/2012', periods=4, freq='D') + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = index.to_datetime() + assert result[0] == Timestamp('1/1/2012') + + def test_combine_first(self): + # GH 3367 + didx = pd.DatetimeIndex(start='1950-01-31', end='1950-07-31', freq='M') + pidx = pd.PeriodIndex(start=pd.Period('1950-1'), + end=pd.Period('1950-7'), freq='M') + # check to be consistent with DatetimeIndex + for idx in [didx, pidx]: + a = pd.Series([1, np.nan, np.nan, 4, 5, np.nan, 7], index=idx) + b = pd.Series([9, 9, 9, 9, 9, 9, 9], index=idx) + + result = a.combine_first(b) + expected = pd.Series([1, 9, 9, 4, 5, 9, 7], index=idx, + dtype=np.float64) + tm.assert_series_equal(result, expected) + + def test_searchsorted(self): + for freq in ['D', '2D']: + pidx = pd.PeriodIndex(['2014-01-01', '2014-01-02', '2014-01-03', + '2014-01-04', '2014-01-05'], freq=freq) + + p1 = pd.Period('2014-01-01', freq=freq) + assert pidx.searchsorted(p1) == 0 + + p2 = pd.Period('2014-01-04', freq=freq) + assert pidx.searchsorted(p2) == 3 + + msg = "Input has different freq=H from PeriodIndex" + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + pidx.searchsorted(pd.Period('2014-01-01', freq='H')) + + msg = "Input has different freq=5D from PeriodIndex" + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + pidx.searchsorted(pd.Period('2014-01-01', freq='5D')) + + with tm.assert_produces_warning(FutureWarning): + pidx.searchsorted(key=p2) diff --git a/pandas/tests/indexes/test_base.py b/pandas/tests/indexes/test_base.py index afcd6889bf4d9..6a2087b37631e 100644 --- a/pandas/tests/indexes/test_base.py +++ b/pandas/tests/indexes/test_base.py @@ -1,44 +1,45 @@ # -*- coding: utf-8 -*- +import pytest + from datetime import datetime, timedelta import pandas.util.testing as tm -from pandas.indexes.api import Index, MultiIndex -from .common import Base +from pandas.core.indexes.api import Index, MultiIndex +from pandas.tests.indexes.common import Base from pandas.compat import (range, lrange, lzip, u, - zip, PY3, PY36) + text_type, zip, PY3, PY36) import operator -import os - import numpy as np from pandas import (period_range, date_range, Series, - Float64Index, Int64Index, + DataFrame, Float64Index, Int64Index, CategoricalIndex, DatetimeIndex, TimedeltaIndex, - PeriodIndex) + PeriodIndex, isnull) +from pandas.core.index import _get_combined_index from pandas.util.testing import assert_almost_equal from pandas.compat.numpy import np_datetime64_compat import pandas.core.config as cf -from pandas.tseries.index import _to_m8 +from pandas.core.indexes.datetimes import _to_m8 import pandas as pd -from pandas.lib import Timestamp +from pandas._libs.lib import Timestamp -class TestIndex(Base, tm.TestCase): +class TestIndex(Base): _holder = Index - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): self.indices = dict(unicodeIndex=tm.makeUnicodeIndex(100), strIndex=tm.makeStringIndex(100), dateIndex=tm.makeDateIndex(100), periodIndex=tm.makePeriodIndex(100), tdIndex=tm.makeTimedeltaIndex(100), intIndex=tm.makeIntIndex(100), + uintIndex=tm.makeUIntIndex(100), rangeIndex=tm.makeIntIndex(100), floatIndex=tm.makeFloatIndex(100), boolIndex=Index([True, False]), @@ -53,14 +54,14 @@ def create_index(self): def test_new_axis(self): new_index = self.dateIndex[None, :] - self.assertEqual(new_index.ndim, 2) - tm.assertIsInstance(new_index, np.ndarray) + assert new_index.ndim == 2 + assert isinstance(new_index, np.ndarray) def test_copy_and_deepcopy(self): super(TestIndex, self).test_copy_and_deepcopy() new_copy2 = self.intIndex.copy(dtype=int) - self.assertEqual(new_copy2.dtype.kind, 'i') + assert new_copy2.dtype.kind == 'i' def test_constructor(self): # regular instance creation @@ -76,41 +77,41 @@ def test_constructor(self): # copy arr = np.array(self.strIndex) index = Index(arr, copy=True, name='name') - tm.assertIsInstance(index, Index) - self.assertEqual(index.name, 'name') + assert isinstance(index, Index) + assert index.name == 'name' tm.assert_numpy_array_equal(arr, index.values) arr[0] = "SOMEBIGLONGSTRING" - self.assertNotEqual(index[0], "SOMEBIGLONGSTRING") + assert index[0] != "SOMEBIGLONGSTRING" # what to do here? # arr = np.array(5.) - # self.assertRaises(Exception, arr.view, Index) + # pytest.raises(Exception, arr.view, Index) def test_constructor_corner(self): # corner case - self.assertRaises(TypeError, Index, 0) + pytest.raises(TypeError, Index, 0) def test_construction_list_mixed_tuples(self): - # 10697 - # if we are constructing from a mixed list of tuples, make sure that we - # are independent of the sorting order + # see gh-10697: if we are constructing from a mixed list of tuples, + # make sure that we are independent of the sorting order. idx1 = Index([('A', 1), 'B']) - self.assertIsInstance(idx1, Index) and self.assertNotInstance( - idx1, MultiIndex) + assert isinstance(idx1, Index) + assert not isinstance(idx1, MultiIndex) + idx2 = Index(['B', ('A', 1)]) - self.assertIsInstance(idx2, Index) and self.assertNotInstance( - idx2, MultiIndex) + assert isinstance(idx2, Index) + assert not isinstance(idx2, MultiIndex) def test_constructor_from_index_datetimetz(self): idx = pd.date_range('2015-01-01 10:00', freq='D', periods=3, tz='US/Eastern') result = pd.Index(idx) tm.assert_index_equal(result, idx) - self.assertEqual(result.tz, idx.tz) + assert result.tz == idx.tz result = pd.Index(idx.asobject) tm.assert_index_equal(result, idx) - self.assertEqual(result.tz, idx.tz) + assert result.tz == idx.tz def test_constructor_from_index_timedelta(self): idx = pd.timedelta_range('1 days', freq='D', periods=3) @@ -133,7 +134,7 @@ def test_constructor_from_series_datetimetz(self): tz='US/Eastern') result = pd.Index(pd.Series(idx)) tm.assert_index_equal(result, idx) - self.assertEqual(result.tz, idx.tz) + assert result.tz == idx.tz def test_constructor_from_series_timedelta(self): idx = pd.timedelta_range('1 days', freq='D', periods=3) @@ -152,9 +153,9 @@ def test_constructor_from_series(self): s = Series([Timestamp('20110101'), Timestamp('20120101'), Timestamp('20130101')]) result = Index(s) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = DatetimeIndex(s) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # GH 6273 # create from a series, passing a freq @@ -163,24 +164,24 @@ def test_constructor_from_series(self): result = DatetimeIndex(s, freq='MS') expected = DatetimeIndex(['1-1-1990', '2-1-1990', '3-1-1990', '4-1-1990', '5-1-1990'], freq='MS') - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) df = pd.DataFrame(np.random.rand(5, 3)) df['date'] = ['1-1-1990', '2-1-1990', '3-1-1990', '4-1-1990', '5-1-1990'] result = DatetimeIndex(df['date'], freq='MS') expected.name = 'date' - self.assert_index_equal(result, expected) - self.assertEqual(df['date'].dtype, object) + tm.assert_index_equal(result, expected) + assert df['date'].dtype == object exp = pd.Series(['1-1-1990', '2-1-1990', '3-1-1990', '4-1-1990', '5-1-1990'], name='date') - self.assert_series_equal(df['date'], exp) + tm.assert_series_equal(df['date'], exp) # GH 6274 # infer freq of same result = pd.infer_freq(df['date']) - self.assertEqual(result, 'MS') + assert result == 'MS' def test_constructor_ndarray_like(self): # GH 5460#issuecomment-44474502 @@ -198,22 +199,39 @@ def __array__(self, dtype=None): date_range('2000-01-01', periods=3).values]: expected = pd.Index(array) result = pd.Index(ArrayLike(array)) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) + + def test_constructor_int_dtype_nan(self): + # see gh-15187 + data = [np.nan] + msg = "cannot convert" + + with tm.assert_raises_regex(ValueError, msg): + Index(data, dtype='int64') + + with tm.assert_raises_regex(ValueError, msg): + Index(data, dtype='uint64') + + # This, however, should not break + # because NaN is float. + expected = Float64Index(data) + result = Index(data, dtype='float') + tm.assert_index_equal(result, expected) def test_index_ctor_infer_nan_nat(self): # GH 13467 exp = pd.Float64Index([np.nan, np.nan]) - self.assertEqual(exp.dtype, np.float64) + assert exp.dtype == np.float64 tm.assert_index_equal(Index([np.nan, np.nan]), exp) tm.assert_index_equal(Index(np.array([np.nan, np.nan])), exp) exp = pd.DatetimeIndex([pd.NaT, pd.NaT]) - self.assertEqual(exp.dtype, 'datetime64[ns]') + assert exp.dtype == 'datetime64[ns]' tm.assert_index_equal(Index([pd.NaT, pd.NaT]), exp) tm.assert_index_equal(Index(np.array([pd.NaT, pd.NaT])), exp) exp = pd.DatetimeIndex([pd.NaT, pd.NaT]) - self.assertEqual(exp.dtype, 'datetime64[ns]') + assert exp.dtype == 'datetime64[ns]' for data in [[pd.NaT, np.nan], [np.nan, pd.NaT], [np.nan, np.datetime64('nat')], @@ -222,7 +240,7 @@ def test_index_ctor_infer_nan_nat(self): tm.assert_index_equal(Index(np.array(data, dtype=object)), exp) exp = pd.TimedeltaIndex([pd.NaT, pd.NaT]) - self.assertEqual(exp.dtype, 'timedelta64[ns]') + assert exp.dtype == 'timedelta64[ns]' for data in [[np.nan, np.timedelta64('nat')], [np.timedelta64('nat'), np.nan], @@ -247,47 +265,47 @@ def test_index_ctor_infer_periodindex(self): xp = period_range('2012-1-1', freq='M', periods=3) rs = Index(xp) tm.assert_index_equal(rs, xp) - tm.assertIsInstance(rs, PeriodIndex) + assert isinstance(rs, PeriodIndex) def test_constructor_simple_new(self): idx = Index([1, 2, 3, 4, 5], name='int') result = idx._simple_new(idx, 'int') - self.assert_index_equal(result, idx) + tm.assert_index_equal(result, idx) idx = Index([1.1, np.nan, 2.2, 3.0], name='float') result = idx._simple_new(idx, 'float') - self.assert_index_equal(result, idx) + tm.assert_index_equal(result, idx) idx = Index(['A', 'B', 'C', np.nan], name='obj') result = idx._simple_new(idx, 'obj') - self.assert_index_equal(result, idx) + tm.assert_index_equal(result, idx) def test_constructor_dtypes(self): for idx in [Index(np.array([1, 2, 3], dtype=int)), Index(np.array([1, 2, 3], dtype=int), dtype=int), Index([1, 2, 3], dtype=int)]: - self.assertIsInstance(idx, Int64Index) + assert isinstance(idx, Int64Index) - # these should coerce + # These should coerce for idx in [Index(np.array([1., 2., 3.], dtype=float), dtype=int), Index([1., 2., 3.], dtype=int)]: - self.assertIsInstance(idx, Int64Index) + assert isinstance(idx, Int64Index) for idx in [Index(np.array([1., 2., 3.], dtype=float)), Index(np.array([1, 2, 3], dtype=int), dtype=float), Index(np.array([1., 2., 3.], dtype=float), dtype=float), Index([1, 2, 3], dtype=float), Index([1., 2., 3.], dtype=float)]: - self.assertIsInstance(idx, Float64Index) + assert isinstance(idx, Float64Index) for idx in [Index(np.array([True, False, True], dtype=bool)), Index([True, False, True]), Index(np.array([True, False, True], dtype=bool), dtype=bool), Index([True, False, True], dtype=bool)]: - self.assertIsInstance(idx, Index) - self.assertEqual(idx.dtype, object) + assert isinstance(idx, Index) + assert idx.dtype == object for idx in [Index(np.array([1, 2, 3], dtype=int), dtype='category'), Index([1, 2, 3], dtype='category'), @@ -296,32 +314,32 @@ def test_constructor_dtypes(self): dtype='category'), Index([datetime(2011, 1, 1), datetime(2011, 1, 2)], dtype='category')]: - self.assertIsInstance(idx, CategoricalIndex) + assert isinstance(idx, CategoricalIndex) for idx in [Index(np.array([np_datetime64_compat('2011-01-01'), np_datetime64_compat('2011-01-02')])), Index([datetime(2011, 1, 1), datetime(2011, 1, 2)])]: - self.assertIsInstance(idx, DatetimeIndex) + assert isinstance(idx, DatetimeIndex) for idx in [Index(np.array([np_datetime64_compat('2011-01-01'), np_datetime64_compat('2011-01-02')]), dtype=object), Index([datetime(2011, 1, 1), datetime(2011, 1, 2)], dtype=object)]: - self.assertNotIsInstance(idx, DatetimeIndex) - self.assertIsInstance(idx, Index) - self.assertEqual(idx.dtype, object) + assert not isinstance(idx, DatetimeIndex) + assert isinstance(idx, Index) + assert idx.dtype == object for idx in [Index(np.array([np.timedelta64(1, 'D'), np.timedelta64( 1, 'D')])), Index([timedelta(1), timedelta(1)])]: - self.assertIsInstance(idx, TimedeltaIndex) + assert isinstance(idx, TimedeltaIndex) for idx in [Index(np.array([np.timedelta64(1, 'D'), np.timedelta64(1, 'D')]), dtype=object), Index([timedelta(1), timedelta(1)], dtype=object)]: - self.assertNotIsInstance(idx, TimedeltaIndex) - self.assertIsInstance(idx, Index) - self.assertEqual(idx.dtype, object) + assert not isinstance(idx, TimedeltaIndex) + assert isinstance(idx, Index) + assert idx.dtype == object def test_constructor_dtypes_datetime(self): @@ -371,7 +389,7 @@ def test_view_with_args(self): ind = self.indices[i] # with arguments - self.assertRaises(TypeError, lambda: ind.view('i8')) + pytest.raises(TypeError, lambda: ind.view('i8')) # these are ok for i in list(set(self.indices.keys()) - set(restricted)): @@ -380,15 +398,6 @@ def test_view_with_args(self): # with arguments ind.view('i8') - def test_legacy_pickle_identity(self): - - # GH 8431 - pth = tm.get_data_path() - s1 = pd.read_pickle(os.path.join(pth, 's1-0.12.0.pickle')) - s2 = pd.read_pickle(os.path.join(pth, 's2-0.12.0.pickle')) - self.assertFalse(s1.index.identical(s2.index)) - self.assertFalse(s1.index.equals(s2.index)) - def test_astype(self): casted = self.intIndex.astype('i8') @@ -398,20 +407,20 @@ def test_astype(self): # pass on name self.intIndex.name = 'foobar' casted = self.intIndex.astype('i8') - self.assertEqual(casted.name, 'foobar') + assert casted.name == 'foobar' def test_equals_object(self): # same - self.assertTrue(Index(['a', 'b', 'c']).equals(Index(['a', 'b', 'c']))) + assert Index(['a', 'b', 'c']).equals(Index(['a', 'b', 'c'])) # different length - self.assertFalse(Index(['a', 'b', 'c']).equals(Index(['a', 'b']))) + assert not Index(['a', 'b', 'c']).equals(Index(['a', 'b'])) # same length, different values - self.assertFalse(Index(['a', 'b', 'c']).equals(Index(['a', 'b', 'd']))) + assert not Index(['a', 'b', 'c']).equals(Index(['a', 'b', 'd'])) # Must also be an Index - self.assertFalse(Index(['a', 'b', 'c']).equals(['a', 'b', 'c'])) + assert not Index(['a', 'b', 'c']).equals(['a', 'b', 'c']) def test_insert(self): @@ -420,35 +429,35 @@ def test_insert(self): result = Index(['b', 'c', 'd']) # test 0th element - self.assert_index_equal(Index(['a', 'b', 'c', 'd']), - result.insert(0, 'a')) + tm.assert_index_equal(Index(['a', 'b', 'c', 'd']), + result.insert(0, 'a')) # test Nth element that follows Python list behavior - self.assert_index_equal(Index(['b', 'c', 'e', 'd']), - result.insert(-1, 'e')) + tm.assert_index_equal(Index(['b', 'c', 'e', 'd']), + result.insert(-1, 'e')) # test loc +/- neq (0, -1) - self.assert_index_equal(result.insert(1, 'z'), result.insert(-2, 'z')) + tm.assert_index_equal(result.insert(1, 'z'), result.insert(-2, 'z')) # test empty null_index = Index([]) - self.assert_index_equal(Index(['a']), null_index.insert(0, 'a')) + tm.assert_index_equal(Index(['a']), null_index.insert(0, 'a')) def test_delete(self): idx = Index(['a', 'b', 'c', 'd'], name='idx') expected = Index(['b', 'c', 'd'], name='idx') result = idx.delete(0) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) + tm.assert_index_equal(result, expected) + assert result.name == expected.name expected = Index(['a', 'b', 'c'], name='idx') result = idx.delete(-1) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) + tm.assert_index_equal(result, expected) + assert result.name == expected.name - with tm.assertRaises((IndexError, ValueError)): - # either depeidnig on numpy version + with pytest.raises((IndexError, ValueError)): + # either depending on numpy version result = idx.delete(5) def test_identical(self): @@ -457,60 +466,60 @@ def test_identical(self): i1 = Index(['a', 'b', 'c']) i2 = Index(['a', 'b', 'c']) - self.assertTrue(i1.identical(i2)) + assert i1.identical(i2) i1 = i1.rename('foo') - self.assertTrue(i1.equals(i2)) - self.assertFalse(i1.identical(i2)) + assert i1.equals(i2) + assert not i1.identical(i2) i2 = i2.rename('foo') - self.assertTrue(i1.identical(i2)) + assert i1.identical(i2) i3 = Index([('a', 'a'), ('a', 'b'), ('b', 'a')]) i4 = Index([('a', 'a'), ('a', 'b'), ('b', 'a')], tupleize_cols=False) - self.assertFalse(i3.identical(i4)) + assert not i3.identical(i4) def test_is_(self): ind = Index(range(10)) - self.assertTrue(ind.is_(ind)) - self.assertTrue(ind.is_(ind.view().view().view().view())) - self.assertFalse(ind.is_(Index(range(10)))) - self.assertFalse(ind.is_(ind.copy())) - self.assertFalse(ind.is_(ind.copy(deep=False))) - self.assertFalse(ind.is_(ind[:])) - self.assertFalse(ind.is_(ind.view(np.ndarray).view(Index))) - self.assertFalse(ind.is_(np.array(range(10)))) + assert ind.is_(ind) + assert ind.is_(ind.view().view().view().view()) + assert not ind.is_(Index(range(10))) + assert not ind.is_(ind.copy()) + assert not ind.is_(ind.copy(deep=False)) + assert not ind.is_(ind[:]) + assert not ind.is_(ind.view(np.ndarray).view(Index)) + assert not ind.is_(np.array(range(10))) # quasi-implementation dependent - self.assertTrue(ind.is_(ind.view())) + assert ind.is_(ind.view()) ind2 = ind.view() ind2.name = 'bob' - self.assertTrue(ind.is_(ind2)) - self.assertTrue(ind2.is_(ind)) + assert ind.is_(ind2) + assert ind2.is_(ind) # doesn't matter if Indices are *actually* views of underlying data, - self.assertFalse(ind.is_(Index(ind.values))) + assert not ind.is_(Index(ind.values)) arr = np.array(range(1, 11)) ind1 = Index(arr, copy=False) ind2 = Index(arr, copy=False) - self.assertFalse(ind1.is_(ind2)) + assert not ind1.is_(ind2) def test_asof(self): d = self.dateIndex[0] - self.assertEqual(self.dateIndex.asof(d), d) - self.assertTrue(np.isnan(self.dateIndex.asof(d - timedelta(1)))) + assert self.dateIndex.asof(d) == d + assert isnull(self.dateIndex.asof(d - timedelta(1))) d = self.dateIndex[-1] - self.assertEqual(self.dateIndex.asof(d + timedelta(1)), d) + assert self.dateIndex.asof(d + timedelta(1)) == d d = self.dateIndex[0].to_pydatetime() - tm.assertIsInstance(self.dateIndex.asof(d), Timestamp) + assert isinstance(self.dateIndex.asof(d), Timestamp) def test_asof_datetime_partial(self): idx = pd.date_range('2010-01-01', periods=2, freq='m') expected = Timestamp('2010-02-28') result = idx.asof('2010-02') - self.assertEqual(result, expected) - self.assertFalse(isinstance(result, Index)) + assert result == expected + assert not isinstance(result, Index) def test_nanosecond_index_access(self): s = Series([Timestamp('20130101')]).values.view('i8')[0] @@ -520,12 +529,11 @@ def test_nanosecond_index_access(self): first_value = x.asof(x.index[0]) # this does not yet work, as parsing strings is done via dateutil - # self.assertEqual(first_value, - # x['2013-01-01 00:00:00.000000050+0000']) + # assert first_value == x['2013-01-01 00:00:00.000000050+0000'] exp_ts = np_datetime64_compat('2013-01-01 00:00:00.000000050+0000', 'ns') - self.assertEqual(first_value, x[Timestamp(exp_ts)]) + assert first_value == x[Timestamp(exp_ts)] def test_comparators(self): index = self.dateIndex @@ -538,7 +546,7 @@ def _check(op): arr_result = op(arr, element) index_result = op(index, element) - self.assertIsInstance(index_result, np.ndarray) + assert isinstance(index_result, np.ndarray) tm.assert_numpy_array_equal(arr_result, index_result) _check(operator.eq) @@ -555,16 +563,16 @@ def test_booleanindex(self): subIndex = self.strIndex[boolIdx] for i, val in enumerate(subIndex): - self.assertEqual(subIndex.get_loc(val), i) + assert subIndex.get_loc(val) == i subIndex = self.strIndex[list(boolIdx)] for i, val in enumerate(subIndex): - self.assertEqual(subIndex.get_loc(val), i) + assert subIndex.get_loc(val) == i def test_fancy(self): sl = self.strIndex[[1, 2, 3]] for i in sl: - self.assertEqual(i, sl[sl.get_loc(i)]) + assert i == sl[sl.get_loc(i)] def test_empty_fancy(self): empty_farr = np.array([], dtype=np.float_) @@ -576,64 +584,69 @@ def test_empty_fancy(self): for idx in [self.strIndex, self.intIndex, self.floatIndex]: empty_idx = idx.__class__([]) - self.assertTrue(idx[[]].identical(empty_idx)) - self.assertTrue(idx[empty_iarr].identical(empty_idx)) - self.assertTrue(idx[empty_barr].identical(empty_idx)) + assert idx[[]].identical(empty_idx) + assert idx[empty_iarr].identical(empty_idx) + assert idx[empty_barr].identical(empty_idx) # np.ndarray only accepts ndarray of int & bool dtypes, so should # Index. - self.assertRaises(IndexError, idx.__getitem__, empty_farr) + pytest.raises(IndexError, idx.__getitem__, empty_farr) def test_getitem(self): arr = np.array(self.dateIndex) exp = self.dateIndex[5] exp = _to_m8(exp) - self.assertEqual(exp, arr[5]) + assert exp == arr[5] def test_intersection(self): first = self.strIndex[:20] second = self.strIndex[:10] intersect = first.intersection(second) - self.assertTrue(tm.equalContents(intersect, second)) + assert tm.equalContents(intersect, second) # Corner cases inter = first.intersection(first) - self.assertIs(inter, first) + assert inter is first idx1 = Index([1, 2, 3, 4, 5], name='idx') # if target has the same name, it is preserved idx2 = Index([3, 4, 5, 6, 7], name='idx') expected2 = Index([3, 4, 5], name='idx') result2 = idx1.intersection(idx2) - self.assert_index_equal(result2, expected2) - self.assertEqual(result2.name, expected2.name) + tm.assert_index_equal(result2, expected2) + assert result2.name == expected2.name # if target name is different, it will be reset idx3 = Index([3, 4, 5, 6, 7], name='other') expected3 = Index([3, 4, 5], name=None) result3 = idx1.intersection(idx3) - self.assert_index_equal(result3, expected3) - self.assertEqual(result3.name, expected3.name) + tm.assert_index_equal(result3, expected3) + assert result3.name == expected3.name # non monotonic idx1 = Index([5, 3, 2, 4, 1], name='idx') idx2 = Index([4, 7, 6, 5, 3], name='idx') - result2 = idx1.intersection(idx2) - self.assertTrue(tm.equalContents(result2, expected2)) - self.assertEqual(result2.name, expected2.name) + expected = Index([5, 3, 4], name='idx') + result = idx1.intersection(idx2) + tm.assert_index_equal(result, expected) - idx3 = Index([4, 7, 6, 5, 3], name='other') - result3 = idx1.intersection(idx3) - self.assertTrue(tm.equalContents(result3, expected3)) - self.assertEqual(result3.name, expected3.name) + idx2 = Index([4, 7, 6, 5, 3], name='other') + expected = Index([5, 3, 4], name=None) + result = idx1.intersection(idx2) + tm.assert_index_equal(result, expected) # non-monotonic non-unique idx1 = Index(['A', 'B', 'A', 'C']) idx2 = Index(['B', 'D']) expected = Index(['B'], dtype='object') result = idx1.intersection(idx2) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) + + idx2 = Index(['B', 'D', 'A']) + expected = Index(['A', 'B', 'A'], dtype='object') + result = idx1.intersection(idx2) + tm.assert_index_equal(result, expected) # preserve names first = self.strIndex[5:20] @@ -641,39 +654,39 @@ def test_intersection(self): first.name = 'A' second.name = 'A' intersect = first.intersection(second) - self.assertEqual(intersect.name, 'A') + assert intersect.name == 'A' second.name = 'B' intersect = first.intersection(second) - self.assertIsNone(intersect.name) + assert intersect.name is None first.name = None second.name = 'B' intersect = first.intersection(second) - self.assertIsNone(intersect.name) + assert intersect.name is None def test_union(self): first = self.strIndex[5:20] second = self.strIndex[:10] everything = self.strIndex[:20] union = first.union(second) - self.assertTrue(tm.equalContents(union, everything)) + assert tm.equalContents(union, everything) # GH 10149 cases = [klass(second.values) for klass in [np.array, Series, list]] for case in cases: result = first.union(case) - self.assertTrue(tm.equalContents(result, everything)) + assert tm.equalContents(result, everything) # Corner cases union = first.union(first) - self.assertIs(union, first) + assert union is first union = first.union([]) - self.assertIs(union, first) + assert union is first union = Index([]).union(first) - self.assertIs(union, first) + assert union is first # preserve names first = Index(list('ab'), name='A') @@ -739,8 +752,8 @@ def test_union(self): else: appended = np.append(self.strIndex, self.dateIndex.astype('O')) - self.assertTrue(tm.equalContents(firstCat, appended)) - self.assertTrue(tm.equalContents(secondCat, self.strIndex)) + assert tm.equalContents(firstCat, appended) + assert tm.equalContents(secondCat, self.strIndex) tm.assert_contains_all(self.strIndex, firstCat) tm.assert_contains_all(self.strIndex, secondCat) tm.assert_contains_all(self.dateIndex, firstCat) @@ -748,63 +761,105 @@ def test_union(self): def test_add(self): idx = self.strIndex expected = Index(self.strIndex.values * 2) - self.assert_index_equal(idx + idx, expected) - self.assert_index_equal(idx + idx.tolist(), expected) - self.assert_index_equal(idx.tolist() + idx, expected) + tm.assert_index_equal(idx + idx, expected) + tm.assert_index_equal(idx + idx.tolist(), expected) + tm.assert_index_equal(idx.tolist() + idx, expected) # test add and radd idx = Index(list('abc')) expected = Index(['a1', 'b1', 'c1']) - self.assert_index_equal(idx + '1', expected) + tm.assert_index_equal(idx + '1', expected) expected = Index(['1a', '1b', '1c']) - self.assert_index_equal('1' + idx, expected) + tm.assert_index_equal('1' + idx, expected) def test_sub(self): idx = self.strIndex - self.assertRaises(TypeError, lambda: idx - 'a') - self.assertRaises(TypeError, lambda: idx - idx) - self.assertRaises(TypeError, lambda: idx - idx.tolist()) - self.assertRaises(TypeError, lambda: idx.tolist() - idx) + pytest.raises(TypeError, lambda: idx - 'a') + pytest.raises(TypeError, lambda: idx - idx) + pytest.raises(TypeError, lambda: idx - idx.tolist()) + pytest.raises(TypeError, lambda: idx.tolist() - idx) + + def test_map_identity_mapping(self): + # GH 12766 + for name, cur_index in self.indices.items(): + tm.assert_index_equal(cur_index, cur_index.map(lambda x: x)) + + def test_map_with_tuples(self): + # GH 12766 + + # Test that returning a single tuple from an Index + # returns an Index. + boolean_index = tm.makeIntIndex(3).map(lambda x: (x,)) + expected = Index([(0,), (1,), (2,)]) + tm.assert_index_equal(boolean_index, expected) + + # Test that returning a tuple from a map of a single index + # returns a MultiIndex object. + boolean_index = tm.makeIntIndex(3).map(lambda x: (x, x == 1)) + expected = MultiIndex.from_tuples([(0, False), (1, True), (2, False)]) + tm.assert_index_equal(boolean_index, expected) + + # Test that returning a single object from a MultiIndex + # returns an Index. + first_level = ['foo', 'bar', 'baz'] + multi_index = MultiIndex.from_tuples(lzip(first_level, [1, 2, 3])) + reduced_index = multi_index.map(lambda x: x[0]) + tm.assert_index_equal(reduced_index, Index(first_level)) + + def test_map_tseries_indices_return_index(self): + date_index = tm.makeDateIndex(10) + exp = Index([1] * 10) + tm.assert_index_equal(exp, date_index.map(lambda x: 1)) + + period_index = tm.makePeriodIndex(10) + tm.assert_index_equal(exp, period_index.map(lambda x: 1)) + + tdelta_index = tm.makeTimedeltaIndex(10) + tm.assert_index_equal(exp, tdelta_index.map(lambda x: 1)) + + date_index = tm.makeDateIndex(24, freq='h', name='hourly') + exp = Index(range(24), name='hourly') + tm.assert_index_equal(exp, date_index.map(lambda x: x.hour)) def test_append_multiple(self): index = Index(['a', 'b', 'c', 'd', 'e', 'f']) foos = [index[:2], index[2:4], index[4:]] result = foos[0].append(foos[1:]) - self.assert_index_equal(result, index) + tm.assert_index_equal(result, index) # empty result = index.append([]) - self.assert_index_equal(result, index) + tm.assert_index_equal(result, index) def test_append_empty_preserve_name(self): left = Index([], name='foo') right = Index([1, 2, 3], name='foo') result = left.append(right) - self.assertEqual(result.name, 'foo') + assert result.name == 'foo' left = Index([], name='foo') right = Index([1, 2, 3], name='bar') result = left.append(right) - self.assertIsNone(result.name) + assert result.name is None def test_add_string(self): # from bug report index = Index(['a', 'b', 'c']) index2 = index + 'foo' - self.assertNotIn('a', index2) - self.assertIn('afoo', index2) + assert 'a' not in index2 + assert 'afoo' in index2 def test_iadd_string(self): index = pd.Index(['a', 'b', 'c']) # doesn't fail test unless there is a check before `+=` - self.assertIn('a', index) + assert 'a' in index index += '_x' - self.assertIn('a_x', index) + assert 'a_x' in index def test_difference(self): @@ -815,23 +870,23 @@ def test_difference(self): # different names result = first.difference(second) - self.assertTrue(tm.equalContents(result, answer)) - self.assertEqual(result.name, None) + assert tm.equalContents(result, answer) + assert result.name is None # same names second.name = 'name' result = first.difference(second) - self.assertEqual(result.name, 'name') + assert result.name == 'name' # with empty result = first.difference([]) - self.assertTrue(tm.equalContents(result, first)) - self.assertEqual(result.name, first.name) + assert tm.equalContents(result, first) + assert result.name == first.name - # with everythin + # with everything result = first.difference(first) - self.assertEqual(len(result), 0) - self.assertEqual(result.name, first.name) + assert len(result) == 0 + assert result.name == first.name def test_symmetric_difference(self): # smoke @@ -839,20 +894,20 @@ def test_symmetric_difference(self): idx2 = Index([2, 3, 4, 5]) result = idx1.symmetric_difference(idx2) expected = Index([1, 5]) - self.assertTrue(tm.equalContents(result, expected)) - self.assertIsNone(result.name) + assert tm.equalContents(result, expected) + assert result.name is None # __xor__ syntax expected = idx1 ^ idx2 - self.assertTrue(tm.equalContents(result, expected)) - self.assertIsNone(result.name) + assert tm.equalContents(result, expected) + assert result.name is None # multiIndex idx1 = MultiIndex.from_tuples(self.tuples) idx2 = MultiIndex.from_tuples([('foo', 1), ('bar', 3)]) result = idx1.symmetric_difference(idx2) expected = MultiIndex.from_tuples([('bar', 2), ('baz', 3), ('bar', 3)]) - self.assertTrue(tm.equalContents(result, expected)) + assert tm.equalContents(result, expected) # nans: # GH 13514 change: {nan} - {nan} == {} @@ -874,32 +929,32 @@ def test_symmetric_difference(self): idx2 = np.array([2, 3, 4, 5]) expected = Index([1, 5]) result = idx1.symmetric_difference(idx2) - self.assertTrue(tm.equalContents(result, expected)) - self.assertEqual(result.name, 'idx1') + assert tm.equalContents(result, expected) + assert result.name == 'idx1' result = idx1.symmetric_difference(idx2, result_name='new_name') - self.assertTrue(tm.equalContents(result, expected)) - self.assertEqual(result.name, 'new_name') + assert tm.equalContents(result, expected) + assert result.name == 'new_name' def test_is_numeric(self): - self.assertFalse(self.dateIndex.is_numeric()) - self.assertFalse(self.strIndex.is_numeric()) - self.assertTrue(self.intIndex.is_numeric()) - self.assertTrue(self.floatIndex.is_numeric()) - self.assertFalse(self.catIndex.is_numeric()) + assert not self.dateIndex.is_numeric() + assert not self.strIndex.is_numeric() + assert self.intIndex.is_numeric() + assert self.floatIndex.is_numeric() + assert not self.catIndex.is_numeric() def test_is_object(self): - self.assertTrue(self.strIndex.is_object()) - self.assertTrue(self.boolIndex.is_object()) - self.assertFalse(self.catIndex.is_object()) - self.assertFalse(self.intIndex.is_object()) - self.assertFalse(self.dateIndex.is_object()) - self.assertFalse(self.floatIndex.is_object()) + assert self.strIndex.is_object() + assert self.boolIndex.is_object() + assert not self.catIndex.is_object() + assert not self.intIndex.is_object() + assert not self.dateIndex.is_object() + assert not self.floatIndex.is_object() def test_is_all_dates(self): - self.assertTrue(self.dateIndex.is_all_dates) - self.assertFalse(self.strIndex.is_all_dates) - self.assertFalse(self.intIndex.is_all_dates) + assert self.dateIndex.is_all_dates + assert not self.strIndex.is_all_dates + assert not self.intIndex.is_all_dates def test_summary(self): self._check_method_works(Index.summary) @@ -907,8 +962,8 @@ def test_summary(self): ind = Index(['{other}%s', "~:{range}:0"], name='A') result = ind.summary() # shouldn't be formatted accidentally. - self.assertIn('~:{range}:0', result) - self.assertIn('{other}%s', result) + assert '~:{range}:0' in result + assert '{other}%s' in result def test_format(self): self._check_method_works(Index.format) @@ -922,19 +977,19 @@ def test_format(self): index = Index([now]) formatted = index.format() expected = [str(index[0])] - self.assertEqual(formatted, expected) + assert formatted == expected # 2845 index = Index([1, 2.0 + 3.0j, np.nan]) formatted = index.format() expected = [str(index[0]), str(index[1]), u('NaN')] - self.assertEqual(formatted, expected) + assert formatted == expected # is this really allowed? index = Index([1, 2.0 + 3.0j, None]) formatted = index.format() expected = [str(index[0]), str(index[1]), u('NaN')] - self.assertEqual(formatted, expected) + assert formatted == expected self.strIndex[:0].format() @@ -944,27 +999,27 @@ def test_format_with_name_time_info(self): dates = Index([dt + inc for dt in self.dateIndex], name='something') formatted = dates.format(name=True) - self.assertEqual(formatted[0], 'something') + assert formatted[0] == 'something' def test_format_datetime_with_time(self): t = Index([datetime(2012, 2, 7), datetime(2012, 2, 7, 23)]) result = t.format() expected = ['2012-02-07 00:00:00', '2012-02-07 23:00:00'] - self.assertEqual(len(result), 2) - self.assertEqual(result, expected) + assert len(result) == 2 + assert result == expected def test_format_none(self): values = ['a', 'b', 'c', None] idx = Index(values) idx.format() - self.assertIsNone(idx[3]) + assert idx[3] is None def test_logical_compat(self): idx = self.create_index() - self.assertEqual(idx.all(), idx.values.all()) - self.assertEqual(idx.any(), idx.values.any()) + assert idx.all() == idx.values.all() + assert idx.any() == idx.values.any() def _check_method_works(self, method): method(self.empty) @@ -1006,10 +1061,10 @@ def test_get_indexer_invalid(self): # GH10411 idx = Index(np.arange(10)) - with tm.assertRaisesRegexp(ValueError, 'tolerance argument'): + with tm.assert_raises_regex(ValueError, 'tolerance argument'): idx.get_indexer([1, 0], tolerance=1) - with tm.assertRaisesRegexp(ValueError, 'limit argument'): + with tm.assert_raises_regex(ValueError, 'limit argument'): idx.get_indexer([1, 0], limit=1) def test_get_indexer_nearest(self): @@ -1043,7 +1098,7 @@ def test_get_indexer_nearest(self): tm.assert_numpy_array_equal(actual, np.array(expected, dtype=np.intp)) - with tm.assertRaisesRegexp(ValueError, 'limit argument'): + with tm.assert_raises_regex(ValueError, 'limit argument'): idx.get_indexer([1, 0], method='nearest', limit=1) def test_get_indexer_nearest_decreasing(self): @@ -1072,41 +1127,41 @@ def test_get_indexer_strings(self): expected = np.array([0, 0, 1, -1], dtype=np.intp) tm.assert_numpy_array_equal(actual, expected) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): idx.get_indexer(['a', 'b', 'c', 'd'], method='nearest') - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): idx.get_indexer(['a', 'b', 'c', 'd'], method='pad', tolerance=2) def test_get_loc(self): idx = pd.Index([0, 1, 2]) all_methods = [None, 'pad', 'backfill', 'nearest'] for method in all_methods: - self.assertEqual(idx.get_loc(1, method=method), 1) + assert idx.get_loc(1, method=method) == 1 if method is not None: - self.assertEqual(idx.get_loc(1, method=method, tolerance=0), 1) - with tm.assertRaises(TypeError): + assert idx.get_loc(1, method=method, tolerance=0) == 1 + with pytest.raises(TypeError): idx.get_loc([1, 2], method=method) for method, loc in [('pad', 1), ('backfill', 2), ('nearest', 1)]: - self.assertEqual(idx.get_loc(1.1, method), loc) + assert idx.get_loc(1.1, method) == loc for method, loc in [('pad', 1), ('backfill', 2), ('nearest', 1)]: - self.assertEqual(idx.get_loc(1.1, method, tolerance=1), loc) + assert idx.get_loc(1.1, method, tolerance=1) == loc for method in ['pad', 'backfill', 'nearest']: - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): idx.get_loc(1.1, method, tolerance=0.05) - with tm.assertRaisesRegexp(ValueError, 'must be numeric'): + with tm.assert_raises_regex(ValueError, 'must be numeric'): idx.get_loc(1.1, 'nearest', tolerance='invalid') - with tm.assertRaisesRegexp(ValueError, 'tolerance .* valid if'): + with tm.assert_raises_regex(ValueError, 'tolerance .* valid if'): idx.get_loc(1.1, tolerance=1) idx = pd.Index(['a', 'c']) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): idx.get_loc('a', method='nearest') - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): idx.get_loc('a', method='pad', tolerance='invalid') def test_slice_locs(self): @@ -1114,71 +1169,71 @@ def test_slice_locs(self): idx = Index(np.array([0, 1, 2, 5, 6, 7, 9, 10], dtype=dtype)) n = len(idx) - self.assertEqual(idx.slice_locs(start=2), (2, n)) - self.assertEqual(idx.slice_locs(start=3), (3, n)) - self.assertEqual(idx.slice_locs(3, 8), (3, 6)) - self.assertEqual(idx.slice_locs(5, 10), (3, n)) - self.assertEqual(idx.slice_locs(end=8), (0, 6)) - self.assertEqual(idx.slice_locs(end=9), (0, 7)) + assert idx.slice_locs(start=2) == (2, n) + assert idx.slice_locs(start=3) == (3, n) + assert idx.slice_locs(3, 8) == (3, 6) + assert idx.slice_locs(5, 10) == (3, n) + assert idx.slice_locs(end=8) == (0, 6) + assert idx.slice_locs(end=9) == (0, 7) # reversed idx2 = idx[::-1] - self.assertEqual(idx2.slice_locs(8, 2), (2, 6)) - self.assertEqual(idx2.slice_locs(7, 3), (2, 5)) + assert idx2.slice_locs(8, 2) == (2, 6) + assert idx2.slice_locs(7, 3) == (2, 5) # float slicing idx = Index(np.array([0, 1, 2, 5, 6, 7, 9, 10], dtype=float)) n = len(idx) - self.assertEqual(idx.slice_locs(5.0, 10.0), (3, n)) - self.assertEqual(idx.slice_locs(4.5, 10.5), (3, 8)) + assert idx.slice_locs(5.0, 10.0) == (3, n) + assert idx.slice_locs(4.5, 10.5) == (3, 8) idx2 = idx[::-1] - self.assertEqual(idx2.slice_locs(8.5, 1.5), (2, 6)) - self.assertEqual(idx2.slice_locs(10.5, -1), (0, n)) + assert idx2.slice_locs(8.5, 1.5) == (2, 6) + assert idx2.slice_locs(10.5, -1) == (0, n) # int slicing with floats # GH 4892, these are all TypeErrors idx = Index(np.array([0, 1, 2, 5, 6, 7, 9, 10], dtype=int)) - self.assertRaises(TypeError, - lambda: idx.slice_locs(5.0, 10.0), (3, n)) - self.assertRaises(TypeError, - lambda: idx.slice_locs(4.5, 10.5), (3, 8)) + pytest.raises(TypeError, + lambda: idx.slice_locs(5.0, 10.0), (3, n)) + pytest.raises(TypeError, + lambda: idx.slice_locs(4.5, 10.5), (3, 8)) idx2 = idx[::-1] - self.assertRaises(TypeError, - lambda: idx2.slice_locs(8.5, 1.5), (2, 6)) - self.assertRaises(TypeError, - lambda: idx2.slice_locs(10.5, -1), (0, n)) + pytest.raises(TypeError, + lambda: idx2.slice_locs(8.5, 1.5), (2, 6)) + pytest.raises(TypeError, + lambda: idx2.slice_locs(10.5, -1), (0, n)) def test_slice_locs_dup(self): idx = Index(['a', 'a', 'b', 'c', 'd', 'd']) - self.assertEqual(idx.slice_locs('a', 'd'), (0, 6)) - self.assertEqual(idx.slice_locs(end='d'), (0, 6)) - self.assertEqual(idx.slice_locs('a', 'c'), (0, 4)) - self.assertEqual(idx.slice_locs('b', 'd'), (2, 6)) + assert idx.slice_locs('a', 'd') == (0, 6) + assert idx.slice_locs(end='d') == (0, 6) + assert idx.slice_locs('a', 'c') == (0, 4) + assert idx.slice_locs('b', 'd') == (2, 6) idx2 = idx[::-1] - self.assertEqual(idx2.slice_locs('d', 'a'), (0, 6)) - self.assertEqual(idx2.slice_locs(end='a'), (0, 6)) - self.assertEqual(idx2.slice_locs('d', 'b'), (0, 4)) - self.assertEqual(idx2.slice_locs('c', 'a'), (2, 6)) + assert idx2.slice_locs('d', 'a') == (0, 6) + assert idx2.slice_locs(end='a') == (0, 6) + assert idx2.slice_locs('d', 'b') == (0, 4) + assert idx2.slice_locs('c', 'a') == (2, 6) for dtype in [int, float]: idx = Index(np.array([10, 12, 12, 14], dtype=dtype)) - self.assertEqual(idx.slice_locs(12, 12), (1, 3)) - self.assertEqual(idx.slice_locs(11, 13), (1, 3)) + assert idx.slice_locs(12, 12) == (1, 3) + assert idx.slice_locs(11, 13) == (1, 3) idx2 = idx[::-1] - self.assertEqual(idx2.slice_locs(12, 12), (1, 3)) - self.assertEqual(idx2.slice_locs(13, 11), (1, 3)) + assert idx2.slice_locs(12, 12) == (1, 3) + assert idx2.slice_locs(13, 11) == (1, 3) def test_slice_locs_na(self): idx = Index([np.nan, 1, 2]) - self.assertRaises(KeyError, idx.slice_locs, start=1.5) - self.assertRaises(KeyError, idx.slice_locs, end=1.5) - self.assertEqual(idx.slice_locs(1), (1, 3)) - self.assertEqual(idx.slice_locs(np.nan), (0, 3)) + pytest.raises(KeyError, idx.slice_locs, start=1.5) + pytest.raises(KeyError, idx.slice_locs, end=1.5) + assert idx.slice_locs(1) == (1, 3) + assert idx.slice_locs(np.nan) == (0, 3) idx = Index([0, np.nan, np.nan, 1, 2]) - self.assertEqual(idx.slice_locs(np.nan), (1, 5)) + assert idx.slice_locs(np.nan) == (1, 5) def test_slice_locs_negative_step(self): idx = Index(list('bcdxy')) @@ -1190,19 +1245,19 @@ def check_slice(in_slice, expected): in_slice.step) result = idx[s_start:s_stop:in_slice.step] expected = pd.Index(list(expected)) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) for in_slice, expected in [ - (SLC[::-1], 'yxdcb'), (SLC['b':'y':-1], ''), - (SLC['b'::-1], 'b'), (SLC[:'b':-1], 'yxdcb'), - (SLC[:'y':-1], 'y'), (SLC['y'::-1], 'yxdcb'), - (SLC['y'::-4], 'yb'), - # absent labels - (SLC[:'a':-1], 'yxdcb'), (SLC[:'a':-2], 'ydb'), - (SLC['z'::-1], 'yxdcb'), (SLC['z'::-3], 'yc'), - (SLC['m'::-1], 'dcb'), (SLC[:'m':-1], 'yx'), - (SLC['a':'a':-1], ''), (SLC['z':'z':-1], ''), - (SLC['m':'m':-1], '') + (SLC[::-1], 'yxdcb'), (SLC['b':'y':-1], ''), + (SLC['b'::-1], 'b'), (SLC[:'b':-1], 'yxdcb'), + (SLC[:'y':-1], 'y'), (SLC['y'::-1], 'yxdcb'), + (SLC['y'::-4], 'yb'), + # absent labels + (SLC[:'a':-1], 'yxdcb'), (SLC[:'a':-2], 'ydb'), + (SLC['z'::-1], 'yxdcb'), (SLC['z'::-3], 'yc'), + (SLC['m'::-1], 'dcb'), (SLC[:'m':-1], 'yx'), + (SLC['a':'a':-1], ''), (SLC['z':'z':-1], ''), + (SLC['m':'m':-1], '') ]: check_slice(in_slice, expected) @@ -1212,40 +1267,40 @@ def test_drop(self): drop = self.strIndex[lrange(5, 10)] dropped = self.strIndex.drop(drop) expected = self.strIndex[lrange(5) + lrange(10, n)] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) - self.assertRaises(ValueError, self.strIndex.drop, ['foo', 'bar']) - self.assertRaises(ValueError, self.strIndex.drop, ['1', 'bar']) + pytest.raises(ValueError, self.strIndex.drop, ['foo', 'bar']) + pytest.raises(ValueError, self.strIndex.drop, ['1', 'bar']) # errors='ignore' mixed = drop.tolist() + ['foo'] dropped = self.strIndex.drop(mixed, errors='ignore') expected = self.strIndex[lrange(5) + lrange(10, n)] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) dropped = self.strIndex.drop(['foo', 'bar'], errors='ignore') expected = self.strIndex[lrange(n)] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) dropped = self.strIndex.drop(self.strIndex[0]) expected = self.strIndex[1:] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) ser = Index([1, 2, 3]) dropped = ser.drop(1) expected = Index([2, 3]) - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) # errors='ignore' - self.assertRaises(ValueError, ser.drop, [3, 4]) + pytest.raises(ValueError, ser.drop, [3, 4]) dropped = ser.drop(4, errors='ignore') expected = Index([1, 2, 3]) - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) dropped = ser.drop([3, 4, 5], errors='ignore') expected = Index([1, 2]) - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) def test_tuple_union_bug(self): import pandas @@ -1264,19 +1319,19 @@ def test_tuple_union_bug(self): int_idx = idx1.intersection(idx2) # needs to be 1d like idx1 and idx2 expected = idx1[:4] # pandas.Index(sorted(set(idx1) & set(idx2))) - self.assertEqual(int_idx.ndim, 1) - self.assert_index_equal(int_idx, expected) + assert int_idx.ndim == 1 + tm.assert_index_equal(int_idx, expected) # union broken union_idx = idx1.union(idx2) expected = idx2 - self.assertEqual(union_idx.ndim, 1) - self.assert_index_equal(union_idx, expected) + assert union_idx.ndim == 1 + tm.assert_index_equal(union_idx, expected) def test_is_monotonic_incomparable(self): index = Index([5, datetime.now(), 7]) - self.assertFalse(index.is_monotonic) - self.assertFalse(index.is_monotonic_decreasing) + assert not index.is_monotonic + assert not index.is_monotonic_decreasing def test_get_set_value(self): values = np.random.randn(100) @@ -1285,7 +1340,7 @@ def test_get_set_value(self): assert_almost_equal(self.dateIndex.get_value(values, date), values[67]) self.dateIndex.set_value(values, date, 10) - self.assertEqual(values[67], 10) + assert values[67] == 10 def test_isin(self): values = ['foo', 'bar', 'quux'] @@ -1302,8 +1357,8 @@ def test_isin(self): # empty, return dtype bool idx = Index([]) result = idx.isin(values) - self.assertEqual(len(result), 0) - self.assertEqual(result.dtype, np.bool_) + assert len(result) == 0 + assert result.dtype == np.bool_ def test_isin_nan(self): tm.assert_numpy_array_equal(Index(['a', np.nan]).isin([np.nan]), @@ -1314,14 +1369,17 @@ def test_isin_nan(self): np.array([False, False])) tm.assert_numpy_array_equal(Index(['a', np.nan]).isin([pd.NaT]), np.array([False, False])) + # Float64Index overrides isin, so must be checked separately tm.assert_numpy_array_equal(Float64Index([1.0, np.nan]).isin([np.nan]), np.array([False, True])) tm.assert_numpy_array_equal( Float64Index([1.0, np.nan]).isin([float('nan')]), np.array([False, True])) + + # we cannot compare NaT with NaN tm.assert_numpy_array_equal(Float64Index([1.0, np.nan]).isin([pd.NaT]), - np.array([False, True])) + np.array([False, False])) def test_isin_level_kwarg(self): def check_idx(idx): @@ -1331,19 +1389,19 @@ def check_idx(idx): tm.assert_numpy_array_equal(expected, idx.isin(values, level=0)) tm.assert_numpy_array_equal(expected, idx.isin(values, level=-1)) - self.assertRaises(IndexError, idx.isin, values, level=1) - self.assertRaises(IndexError, idx.isin, values, level=10) - self.assertRaises(IndexError, idx.isin, values, level=-2) + pytest.raises(IndexError, idx.isin, values, level=1) + pytest.raises(IndexError, idx.isin, values, level=10) + pytest.raises(IndexError, idx.isin, values, level=-2) - self.assertRaises(KeyError, idx.isin, values, level=1.0) - self.assertRaises(KeyError, idx.isin, values, level='foobar') + pytest.raises(KeyError, idx.isin, values, level=1.0) + pytest.raises(KeyError, idx.isin, values, level='foobar') idx.name = 'foobar' tm.assert_numpy_array_equal(expected, idx.isin(values, level='foobar')) - self.assertRaises(KeyError, idx.isin, values, level='xyzzy') - self.assertRaises(KeyError, idx.isin, values, level=np.nan) + pytest.raises(KeyError, idx.isin, values, level='xyzzy') + pytest.raises(KeyError, idx.isin, values, level=np.nan) check_idx(Index(['qux', 'baz', 'foo', 'bar'])) # Float64Index overrides isin, so must be checked separately @@ -1360,11 +1418,11 @@ def test_boolean_cmp(self): def test_get_level_values(self): result = self.strIndex.get_level_values(0) - self.assert_index_equal(result, self.strIndex) + tm.assert_index_equal(result, self.strIndex) def test_slice_keep_name(self): idx = Index(['a', 'b'], name='asdf') - self.assertEqual(idx.name, idx[1:].name) + assert idx.name == idx[1:].name def test_join_self(self): # instance attributes of the form self.Index @@ -1375,7 +1433,7 @@ def test_join_self(self): for kind in kinds: joined = res.join(res, how=kind) - self.assertIs(res, joined) + assert res is joined def test_str_attribute(self): # GH9068 @@ -1391,8 +1449,8 @@ def test_str_attribute(self): MultiIndex.from_tuples([('foo', '1'), ('bar', '3')]), PeriodIndex(start='2000', end='2010', freq='A')] for idx in indices: - with self.assertRaisesRegexp(AttributeError, - 'only use .str accessor'): + with tm.assert_raises_regex(AttributeError, + 'only use .str accessor'): idx.str.repeat(2) idx = Index(['a b c', 'd e', 'f']) @@ -1408,7 +1466,7 @@ def test_str_attribute(self): idx = Index(['a1', 'a2', 'b1', 'b2']) expected = np.array([True, True, False, False]) tm.assert_numpy_array_equal(idx.str.startswith('a'), expected) - self.assertIsInstance(idx.str.startswith('a'), np.ndarray) + assert isinstance(idx.str.startswith('a'), np.ndarray) s = Series(range(4), index=idx) expected = Series(range(2), index=['a1', 'a2']) tm.assert_series_equal(s[s.index.str.startswith('a')], expected) @@ -1416,17 +1474,16 @@ def test_str_attribute(self): def test_tab_completion(self): # GH 9910 idx = Index(list('abcd')) - self.assertTrue('str' in dir(idx)) + assert 'str' in dir(idx) idx = Index(range(4)) - self.assertTrue('str' not in dir(idx)) + assert 'str' not in dir(idx) def test_indexing_doesnt_change_class(self): idx = Index([1, 2, 3, 'a', 'b', 'c']) - self.assertTrue(idx[1:3].identical(pd.Index([2, 3], dtype=np.object_))) - self.assertTrue(idx[[0, 1]].identical(pd.Index( - [1, 2], dtype=np.object_))) + assert idx[1:3].identical(pd.Index([2, 3], dtype=np.object_)) + assert idx[[0, 1]].identical(pd.Index([1, 2], dtype=np.object_)) def test_outer_join_sort(self): left_idx = Index(np.random.permutation(15)) @@ -1467,19 +1524,19 @@ def test_take_fill_value(self): msg = ('When allow_fill=True and fill_value is not None, ' 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): idx.take(np.array([1, -5])) def test_reshape_raise(self): msg = "reshaping is not supported" idx = pd.Index([0, 1, 2]) - tm.assertRaisesRegexp(NotImplementedError, msg, - idx.reshape, idx.shape) + tm.assert_raises_regex(NotImplementedError, msg, + idx.reshape, idx.shape) def test_reindex_preserves_name_if_target_is_list_or_ndarray(self): # GH6552 @@ -1488,28 +1545,28 @@ def test_reindex_preserves_name_if_target_is_list_or_ndarray(self): dt_idx = pd.date_range('20130101', periods=3) idx.name = None - self.assertEqual(idx.reindex([])[0].name, None) - self.assertEqual(idx.reindex(np.array([]))[0].name, None) - self.assertEqual(idx.reindex(idx.tolist())[0].name, None) - self.assertEqual(idx.reindex(idx.tolist()[:-1])[0].name, None) - self.assertEqual(idx.reindex(idx.values)[0].name, None) - self.assertEqual(idx.reindex(idx.values[:-1])[0].name, None) + assert idx.reindex([])[0].name is None + assert idx.reindex(np.array([]))[0].name is None + assert idx.reindex(idx.tolist())[0].name is None + assert idx.reindex(idx.tolist()[:-1])[0].name is None + assert idx.reindex(idx.values)[0].name is None + assert idx.reindex(idx.values[:-1])[0].name is None # Must preserve name even if dtype changes. - self.assertEqual(idx.reindex(dt_idx.values)[0].name, None) - self.assertEqual(idx.reindex(dt_idx.tolist())[0].name, None) + assert idx.reindex(dt_idx.values)[0].name is None + assert idx.reindex(dt_idx.tolist())[0].name is None idx.name = 'foobar' - self.assertEqual(idx.reindex([])[0].name, 'foobar') - self.assertEqual(idx.reindex(np.array([]))[0].name, 'foobar') - self.assertEqual(idx.reindex(idx.tolist())[0].name, 'foobar') - self.assertEqual(idx.reindex(idx.tolist()[:-1])[0].name, 'foobar') - self.assertEqual(idx.reindex(idx.values)[0].name, 'foobar') - self.assertEqual(idx.reindex(idx.values[:-1])[0].name, 'foobar') + assert idx.reindex([])[0].name == 'foobar' + assert idx.reindex(np.array([]))[0].name == 'foobar' + assert idx.reindex(idx.tolist())[0].name == 'foobar' + assert idx.reindex(idx.tolist()[:-1])[0].name == 'foobar' + assert idx.reindex(idx.values)[0].name == 'foobar' + assert idx.reindex(idx.values[:-1])[0].name == 'foobar' # Must preserve name even if dtype changes. - self.assertEqual(idx.reindex(dt_idx.values)[0].name, 'foobar') - self.assertEqual(idx.reindex(dt_idx.tolist())[0].name, 'foobar') + assert idx.reindex(dt_idx.values)[0].name == 'foobar' + assert idx.reindex(dt_idx.tolist())[0].name == 'foobar' def test_reindex_preserves_type_if_target_is_empty_list_or_array(self): # GH7774 @@ -1518,10 +1575,9 @@ def test_reindex_preserves_type_if_target_is_empty_list_or_array(self): def get_reindex_type(target): return idx.reindex(target)[0].dtype.type - self.assertEqual(get_reindex_type([]), np.object_) - self.assertEqual(get_reindex_type(np.array([])), np.object_) - self.assertEqual(get_reindex_type(np.array([], dtype=np.int64)), - np.object_) + assert get_reindex_type([]) == np.object_ + assert get_reindex_type(np.array([])) == np.object_ + assert get_reindex_type(np.array([], dtype=np.int64)) == np.object_ def test_reindex_doesnt_preserve_type_if_target_is_empty_index(self): # GH7774 @@ -1530,14 +1586,14 @@ def test_reindex_doesnt_preserve_type_if_target_is_empty_index(self): def get_reindex_type(target): return idx.reindex(target)[0].dtype.type - self.assertEqual(get_reindex_type(pd.Int64Index([])), np.int64) - self.assertEqual(get_reindex_type(pd.Float64Index([])), np.float64) - self.assertEqual(get_reindex_type(pd.DatetimeIndex([])), np.datetime64) + assert get_reindex_type(pd.Int64Index([])) == np.int64 + assert get_reindex_type(pd.Float64Index([])) == np.float64 + assert get_reindex_type(pd.DatetimeIndex([])) == np.datetime64 reindexed = idx.reindex(pd.MultiIndex( [pd.Int64Index([]), pd.Float64Index([])], [[], []]))[0] - self.assertEqual(reindexed.levels[0].dtype.type, np.int64) - self.assertEqual(reindexed.levels[1].dtype.type, np.float64) + assert reindexed.levels[0].dtype.type == np.int64 + assert reindexed.levels[1].dtype.type == np.float64 def test_groupby(self): idx = Index(range(5)) @@ -1558,11 +1614,11 @@ def test_equals_op_multiindex(self): mi2 = MultiIndex.from_tuples([(1, 2), (4, 6)]) tm.assert_numpy_array_equal(df.index == mi2, np.array([True, False])) mi3 = MultiIndex.from_tuples([(1, 2), (4, 5), (8, 9)]) - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): df.index == mi3 index_a = Index(['foo', 'bar', 'baz']) - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): df.index == index_a tm.assert_numpy_array_equal(index_a == mi3, np.array([False, False, False])) @@ -1570,27 +1626,26 @@ def test_equals_op_multiindex(self): def test_conversion_preserves_name(self): # GH 10875 i = pd.Index(['01:02:03', '01:02:04'], name='label') - self.assertEqual(i.name, pd.to_datetime(i).name) - self.assertEqual(i.name, pd.to_timedelta(i).name) + assert i.name == pd.to_datetime(i).name + assert i.name == pd.to_timedelta(i).name def test_string_index_repr(self): # py3/py2 repr can differ because of "u" prefix # which also affects to displayed element size - # suppress flake8 warnings if PY3: coerce = lambda x: x else: - coerce = unicode + coerce = unicode # noqa # short idx = pd.Index(['a', 'bb', 'ccc']) if PY3: expected = u"""Index(['a', 'bb', 'ccc'], dtype='object')""" - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""Index([u'a', u'bb', u'ccc'], dtype='object')""" - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # multiple lines idx = pd.Index(['a', 'bb', 'ccc'] * 10) @@ -1601,7 +1656,7 @@ def test_string_index_repr(self): 'a', 'bb', 'ccc', 'a', 'bb', 'ccc'], dtype='object')""" - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""\ Index([u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', @@ -1609,7 +1664,7 @@ def test_string_index_repr(self): u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc'], dtype='object')""" - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # truncated idx = pd.Index(['a', 'bb', 'ccc'] * 100) @@ -1620,7 +1675,7 @@ def test_string_index_repr(self): 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc'], dtype='object', length=300)""" - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""\ Index([u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', @@ -1628,16 +1683,16 @@ def test_string_index_repr(self): u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc'], dtype='object', length=300)""" - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # short idx = pd.Index([u'あ', u'いい', u'ううう']) if PY3: expected = u"""Index(['あ', 'いい', 'ううう'], dtype='object')""" - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""Index([u'あ', u'いい', u'ううう'], dtype='object')""" - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # multiple lines idx = pd.Index([u'あ', u'いい', u'ううう'] * 10) @@ -1649,7 +1704,7 @@ def test_string_index_repr(self): u" 'あ', 'いい', 'ううう', 'あ', 'いい', " u"'ううう'],\n" u" dtype='object')") - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = (u"Index([u'あ', u'いい', u'ううう', u'あ', u'いい', " u"u'ううう', u'あ', u'いい', u'ううう', u'あ',\n" @@ -1658,7 +1713,7 @@ def test_string_index_repr(self): u" u'ううう', u'あ', u'いい', u'ううう', u'あ', " u"u'いい', u'ううう', u'あ', u'いい', u'ううう'],\n" u" dtype='object')") - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # truncated idx = pd.Index([u'あ', u'いい', u'ううう'] * 100) @@ -1669,7 +1724,7 @@ def test_string_index_repr(self): u" 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', " u"'ううう', 'あ', 'いい', 'ううう'],\n" u" dtype='object', length=300)") - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = (u"Index([u'あ', u'いい', u'ううう', u'あ', u'いい', " u"u'ううう', u'あ', u'いい', u'ううう', u'あ',\n" @@ -1678,7 +1733,7 @@ def test_string_index_repr(self): u"u'いい', u'ううう', u'あ', u'いい', u'ううう'],\n" u" dtype='object', length=300)") - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # Emable Unicode option ----------------------------------------- with cf.option_context('display.unicode.east_asian_width', True): @@ -1688,11 +1743,11 @@ def test_string_index_repr(self): if PY3: expected = (u"Index(['あ', 'いい', 'ううう'], " u"dtype='object')") - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = (u"Index([u'あ', u'いい', u'ううう'], " u"dtype='object')") - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # multiple lines idx = pd.Index([u'あ', u'いい', u'ううう'] * 10) @@ -1706,7 +1761,7 @@ def test_string_index_repr(self): u" 'あ', 'いい', 'ううう'],\n" u" dtype='object')""") - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = (u"Index([u'あ', u'いい', u'ううう', u'あ', u'いい', " u"u'ううう', u'あ', u'いい',\n" @@ -1718,7 +1773,7 @@ def test_string_index_repr(self): u"u'あ', u'いい', u'ううう'],\n" u" dtype='object')") - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected # truncated idx = pd.Index([u'あ', u'いい', u'ううう'] * 100) @@ -1732,7 +1787,7 @@ def test_string_index_repr(self): u" 'ううう'],\n" u" dtype='object', length=300)") - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = (u"Index([u'あ', u'いい', u'ううう', u'あ', u'いい', " u"u'ううう', u'あ', u'いい',\n" @@ -1743,48 +1798,30 @@ def test_string_index_repr(self): u" u'いい', u'ううう'],\n" u" dtype='object', length=300)") - self.assertEqual(coerce(idx), expected) + assert coerce(idx) == expected -class TestMixedIntIndex(Base, tm.TestCase): +class TestMixedIntIndex(Base): # Mostly the tests from common.py for which the results differ # in py2 and py3 because ints and strings are uncomparable in py3 # (GH 13514) _holder = Index - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): self.indices = dict(mixedIndex=Index([0, 'a', 1, 'b', 2, 'c'])) self.setup_indices() def create_index(self): return self.mixedIndex - def test_order(self): - idx = self.create_index() - # 9816 deprecated - if PY36: - with tm.assertRaisesRegexp(TypeError, "'>' not supported " - "between instances of 'str' and 'int'"): - with tm.assert_produces_warning(FutureWarning): - idx.order() - elif PY3: - with tm.assertRaisesRegexp(TypeError, "unorderable types"): - with tm.assert_produces_warning(FutureWarning): - idx.order() - else: - with tm.assert_produces_warning(FutureWarning): - idx.order() - def test_argsort(self): idx = self.create_index() if PY36: - with tm.assertRaisesRegexp(TypeError, "'>' not supported " - "between instances of 'str' and 'int'"): + with tm.assert_raises_regex(TypeError, "'>' not supported"): result = idx.argsort() elif PY3: - with tm.assertRaisesRegexp(TypeError, "unorderable types"): + with tm.assert_raises_regex(TypeError, "unorderable types"): result = idx.argsort() else: result = idx.argsort() @@ -1794,11 +1831,10 @@ def test_argsort(self): def test_numpy_argsort(self): idx = self.create_index() if PY36: - with tm.assertRaisesRegexp(TypeError, "'>' not supported " - "between instances of 'str' and 'int'"): + with tm.assert_raises_regex(TypeError, "'>' not supported"): result = np.argsort(idx) elif PY3: - with tm.assertRaisesRegexp(TypeError, "unorderable types"): + with tm.assert_raises_regex(TypeError, "unorderable types"): result = np.argsort(idx) else: result = np.argsort(idx) @@ -1814,22 +1850,22 @@ def test_copy_name(self): second = first.__class__(first, copy=False) # Even though "copy=False", we want a new object. - self.assertIsNot(first, second) + assert first is not second # Not using tm.assert_index_equal() since names differ: - self.assertTrue(idx.equals(first)) + assert idx.equals(first) - self.assertEqual(first.name, 'mario') - self.assertEqual(second.name, 'mario') + assert first.name == 'mario' + assert second.name == 'mario' s1 = Series(2, index=first) s2 = Series(3, index=second[:-1]) - if PY3: - with tm.assert_produces_warning(RuntimeWarning): - # unorderable types - s3 = s1 * s2 - else: + + warning_type = RuntimeWarning if PY3 else None + with tm.assert_produces_warning(warning_type): + # Python 3: Unorderable types s3 = s1 * s2 - self.assertEqual(s3.index.name, 'mario') + + assert s3.index.name == 'mario' def test_copy_name2(self): # Check that adding a "name" parameter to the copy is honored @@ -1837,23 +1873,23 @@ def test_copy_name2(self): idx = pd.Index([1, 2], name='MyName') idx1 = idx.copy() - self.assertTrue(idx.equals(idx1)) - self.assertEqual(idx.name, 'MyName') - self.assertEqual(idx1.name, 'MyName') + assert idx.equals(idx1) + assert idx.name == 'MyName' + assert idx1.name == 'MyName' idx2 = idx.copy(name='NewName') - self.assertTrue(idx.equals(idx2)) - self.assertEqual(idx.name, 'MyName') - self.assertEqual(idx2.name, 'NewName') + assert idx.equals(idx2) + assert idx.name == 'MyName' + assert idx2.name == 'NewName' idx3 = idx.copy(names=['NewName']) - self.assertTrue(idx.equals(idx3)) - self.assertEqual(idx.name, 'MyName') - self.assertEqual(idx.names, ['MyName']) - self.assertEqual(idx3.name, 'NewName') - self.assertEqual(idx3.names, ['NewName']) + assert idx.equals(idx3) + assert idx.name == 'MyName' + assert idx.names == ['MyName'] + assert idx3.name == 'NewName' + assert idx3.names == ['NewName'] def test_union_base(self): idx = self.create_index() @@ -1865,11 +1901,11 @@ def test_union_base(self): # unorderable types result = first.union(second) expected = Index(['b', 2, 'c', 0, 'a', 1]) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) else: result = first.union(second) expected = Index(['b', 2, 'c', 0, 'a', 1]) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # GH 10149 cases = [klass(second.values) @@ -1879,10 +1915,10 @@ def test_union_base(self): with tm.assert_produces_warning(RuntimeWarning): # unorderable types result = first.union(case) - self.assertTrue(tm.equalContents(result, idx)) + assert tm.equalContents(result, idx) else: result = first.union(case) - self.assertTrue(tm.equalContents(result, idx)) + assert tm.equalContents(result, idx) def test_intersection_base(self): # (same results for py2 and py3 but sortedness not tested elsewhere) @@ -1891,14 +1927,14 @@ def test_intersection_base(self): second = idx[:3] result = first.intersection(second) expected = Index([0, 'a', 1]) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # GH 10149 cases = [klass(second.values) for klass in [np.array, Series, list]] for case in cases: result = first.intersection(case) - self.assertTrue(tm.equalContents(result, second)) + assert tm.equalContents(result, second) def test_difference_base(self): # (same results for py2 and py3 but sortedness not tested elsewhere) @@ -1908,7 +1944,7 @@ def test_difference_base(self): result = first.difference(second) expected = Index([0, 1, 'a']) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) def test_symmetric_difference(self): # (same results for py2 and py3 but sortedness not tested elsewhere) @@ -1918,12 +1954,12 @@ def test_symmetric_difference(self): result = first.symmetric_difference(second) expected = Index([0, 1, 2, 'a', 'c']) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) def test_logical_compat(self): idx = self.create_index() - self.assertEqual(idx.all(), idx.values.all()) - self.assertEqual(idx.any(), idx.values.any()) + assert idx.all() == idx.values.all() + assert idx.any() == idx.values.any() def test_dropna(self): # GH 6194 @@ -1953,7 +1989,7 @@ def test_dropna(self): idx = pd.TimedeltaIndex(['1 days', '2 days', '3 days']) tm.assert_index_equal(idx.dropna(), idx) nanidx = pd.TimedeltaIndex([pd.NaT, '1 days', '2 days', - '3 days', pd.NaT]) + '3 days', pd.NaT]) tm.assert_index_equal(nanidx.dropna(), idx) idx = pd.PeriodIndex(['2012-02', '2012-04', '2012-05'], freq='M') @@ -1963,11 +1999,77 @@ def test_dropna(self): tm.assert_index_equal(nanidx.dropna(), idx) msg = "invalid how option: xxx" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): pd.Index([1, 2, 3]).dropna(how='xxx') + def test_get_combined_index(self): + result = _get_combined_index([]) + tm.assert_index_equal(result, Index([])) + + def test_repeat(self): + repeats = 2 + idx = pd.Index([1, 2, 3]) + expected = pd.Index([1, 1, 2, 2, 3, 3]) + + result = idx.repeat(repeats) + tm.assert_index_equal(result, expected) + + with tm.assert_produces_warning(FutureWarning): + result = idx.repeat(n=repeats) + tm.assert_index_equal(result, expected) + + def test_is_monotonic_na(self): + examples = [pd.Index([np.nan]), + pd.Index([np.nan, 1]), + pd.Index([1, 2, np.nan]), + pd.Index(['a', 'b', np.nan]), + pd.to_datetime(['NaT']), + pd.to_datetime(['NaT', '2000-01-01']), + pd.to_datetime(['2000-01-01', 'NaT', '2000-01-02']), + pd.to_timedelta(['1 day', 'NaT']), ] + for index in examples: + assert not index.is_monotonic_increasing + assert not index.is_monotonic_decreasing + + def test_repr_summary(self): + with cf.option_context('display.max_seq_items', 10): + r = repr(pd.Index(np.arange(1000))) + assert len(r) < 200 + assert "..." in r + + def test_int_name_format(self): + index = Index(['a', 'b', 'c'], name=0) + s = Series(lrange(3), index) + df = DataFrame(lrange(3), index=index) + repr(s) + repr(df) + + def test_print_unicode_columns(self): + df = pd.DataFrame({u("\u05d0"): [1, 2, 3], + "\u05d1": [4, 5, 6], + "c": [7, 8, 9]}) + repr(df.columns) # should not raise UnicodeDecodeError + + def test_unicode_string_with_unicode(self): + idx = Index(lrange(1000)) + + if PY3: + str(idx) + else: + text_type(idx) + + def test_bytestring_with_unicode(self): + idx = Index(lrange(1000)) + if PY3: + bytes(idx) + else: + str(idx) + + def test_intersect_str_dates(self): + dt_dates = [datetime(2012, 2, 9), datetime(2012, 2, 22)] + + i1 = Index(dt_dates, dtype=object) + i2 = Index(['aa'], dtype=object) + res = i2.intersection(i1) -def test_get_combined_index(): - from pandas.core.index import _get_combined_index - result = _get_combined_index([]) - tm.assert_index_equal(result, Index([])) + assert len(res) == 0 diff --git a/pandas/tests/indexes/test_category.py b/pandas/tests/indexes/test_category.py index 819b88bf4c5d3..4e4f9b29f9a4c 100644 --- a/pandas/tests/indexes/test_category.py +++ b/pandas/tests/indexes/test_category.py @@ -1,17 +1,16 @@ # -*- coding: utf-8 -*- -# TODO(wesm): fix long line flake8 issues -# flake8: noqa +import pytest import pandas.util.testing as tm -from pandas.indexes.api import Index, CategoricalIndex +from pandas.core.indexes.api import Index, CategoricalIndex from .common import Base from pandas.compat import range, PY3 import numpy as np -from pandas import Categorical, compat, notnull +from pandas import Categorical, IntervalIndex, compat, notnull from pandas.util.testing import assert_almost_equal import pandas.core.config as cf import pandas as pd @@ -20,10 +19,10 @@ unicode = lambda x: x -class TestCategoricalIndex(Base, tm.TestCase): +class TestCategoricalIndex(Base): _holder = CategoricalIndex - def setUp(self): + def setup_method(self, method): self.indices = dict(catIndex=tm.makeCategoricalIndex(100)) self.setup_indices() @@ -40,62 +39,66 @@ def test_construction(self): result = Index(ci) tm.assert_index_equal(result, ci, exact=True) - self.assertFalse(result.ordered) + assert not result.ordered result = Index(ci.values) tm.assert_index_equal(result, ci, exact=True) - self.assertFalse(result.ordered) + assert not result.ordered # empty result = CategoricalIndex(categories=categories) - self.assert_index_equal(result.categories, Index(categories)) + tm.assert_index_equal(result.categories, Index(categories)) tm.assert_numpy_array_equal(result.codes, np.array([], dtype='int8')) - self.assertFalse(result.ordered) + assert not result.ordered # passing categories result = CategoricalIndex(list('aabbca'), categories=categories) - self.assert_index_equal(result.categories, Index(categories)) + tm.assert_index_equal(result.categories, Index(categories)) tm.assert_numpy_array_equal(result.codes, - np.array([0, 0, 1, 1, 2, 0], dtype='int8')) + np.array([0, 0, 1, + 1, 2, 0], dtype='int8')) c = pd.Categorical(list('aabbca')) result = CategoricalIndex(c) - self.assert_index_equal(result.categories, Index(list('abc'))) + tm.assert_index_equal(result.categories, Index(list('abc'))) tm.assert_numpy_array_equal(result.codes, - np.array([0, 0, 1, 1, 2, 0], dtype='int8')) - self.assertFalse(result.ordered) + np.array([0, 0, 1, + 1, 2, 0], dtype='int8')) + assert not result.ordered result = CategoricalIndex(c, categories=categories) - self.assert_index_equal(result.categories, Index(categories)) + tm.assert_index_equal(result.categories, Index(categories)) tm.assert_numpy_array_equal(result.codes, - np.array([0, 0, 1, 1, 2, 0], dtype='int8')) - self.assertFalse(result.ordered) + np.array([0, 0, 1, + 1, 2, 0], dtype='int8')) + assert not result.ordered ci = CategoricalIndex(c, categories=list('abcd')) result = CategoricalIndex(ci) - self.assert_index_equal(result.categories, Index(categories)) + tm.assert_index_equal(result.categories, Index(categories)) tm.assert_numpy_array_equal(result.codes, - np.array([0, 0, 1, 1, 2, 0], dtype='int8')) - self.assertFalse(result.ordered) + np.array([0, 0, 1, + 1, 2, 0], dtype='int8')) + assert not result.ordered result = CategoricalIndex(ci, categories=list('ab')) - self.assert_index_equal(result.categories, Index(list('ab'))) + tm.assert_index_equal(result.categories, Index(list('ab'))) tm.assert_numpy_array_equal(result.codes, - np.array([0, 0, 1, 1, -1, 0], - dtype='int8')) - self.assertFalse(result.ordered) + np.array([0, 0, 1, + 1, -1, 0], dtype='int8')) + assert not result.ordered result = CategoricalIndex(ci, categories=list('ab'), ordered=True) - self.assert_index_equal(result.categories, Index(list('ab'))) + tm.assert_index_equal(result.categories, Index(list('ab'))) tm.assert_numpy_array_equal(result.codes, - np.array([0, 0, 1, 1, -1, 0], - dtype='int8')) - self.assertTrue(result.ordered) + np.array([0, 0, 1, + 1, -1, 0], dtype='int8')) + assert result.ordered # turn me to an Index result = Index(np.array(ci)) - self.assertIsInstance(result, Index) - self.assertNotIsInstance(result, CategoricalIndex) + assert isinstance(result, Index) + assert not isinstance(result, CategoricalIndex) def test_construction_with_dtype(self): @@ -128,12 +131,12 @@ def test_disallow_set_ops(self): # set ops (+/-) raise TypeError idx = pd.Index(pd.Categorical(['a', 'b'])) - self.assertRaises(TypeError, lambda: idx - idx) - self.assertRaises(TypeError, lambda: idx + idx) - self.assertRaises(TypeError, lambda: idx - ['a', 'b']) - self.assertRaises(TypeError, lambda: idx + ['a', 'b']) - self.assertRaises(TypeError, lambda: ['a', 'b'] - idx) - self.assertRaises(TypeError, lambda: ['a', 'b'] + idx) + pytest.raises(TypeError, lambda: idx - idx) + pytest.raises(TypeError, lambda: idx + idx) + pytest.raises(TypeError, lambda: idx - ['a', 'b']) + pytest.raises(TypeError, lambda: idx + ['a', 'b']) + pytest.raises(TypeError, lambda: ['a', 'b'] - idx) + pytest.raises(TypeError, lambda: ['a', 'b'] + idx) def test_method_delegation(self): @@ -167,70 +170,69 @@ def test_method_delegation(self): list('aabbca'), categories=list('cabdef'), ordered=True)) # invalid - self.assertRaises(ValueError, lambda: ci.set_categories( + pytest.raises(ValueError, lambda: ci.set_categories( list('cab'), inplace=True)) def test_contains(self): ci = self.create_index(categories=list('cabdef')) - self.assertTrue('a' in ci) - self.assertTrue('z' not in ci) - self.assertTrue('e' not in ci) - self.assertTrue(np.nan not in ci) + assert 'a' in ci + assert 'z' not in ci + assert 'e' not in ci + assert np.nan not in ci # assert codes NOT in index - self.assertFalse(0 in ci) - self.assertFalse(1 in ci) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - ci = CategoricalIndex( - list('aabbca'), categories=list('cabdef') + [np.nan]) - self.assertFalse(np.nan in ci) + assert 0 not in ci + assert 1 not in ci ci = CategoricalIndex( list('aabbca') + [np.nan], categories=list('cabdef')) - self.assertTrue(np.nan in ci) + assert np.nan in ci def test_min_max(self): ci = self.create_index(ordered=False) - self.assertRaises(TypeError, lambda: ci.min()) - self.assertRaises(TypeError, lambda: ci.max()) + pytest.raises(TypeError, lambda: ci.min()) + pytest.raises(TypeError, lambda: ci.max()) ci = self.create_index(ordered=True) - self.assertEqual(ci.min(), 'c') - self.assertEqual(ci.max(), 'b') + assert ci.min() == 'c' + assert ci.max() == 'b' def test_map(self): ci = pd.CategoricalIndex(list('ABABC'), categories=list('CBA'), ordered=True) result = ci.map(lambda x: x.lower()) - exp = pd.Categorical(list('ababc'), categories=list('cba'), - ordered=True) - tm.assert_categorical_equal(result, exp) + exp = pd.CategoricalIndex(list('ababc'), categories=list('cba'), + ordered=True) + tm.assert_index_equal(result, exp) ci = pd.CategoricalIndex(list('ABABC'), categories=list('BAC'), ordered=False, name='XXX') result = ci.map(lambda x: x.lower()) - exp = pd.Categorical(list('ababc'), categories=list('bac'), - ordered=False) - tm.assert_categorical_equal(result, exp) + exp = pd.CategoricalIndex(list('ababc'), categories=list('bac'), + ordered=False, name='XXX') + tm.assert_index_equal(result, exp) - tm.assert_numpy_array_equal(ci.map(lambda x: 1), - np.array([1] * 5, dtype=np.int64)) + # GH 12766: Return an index not an array + tm.assert_index_equal(ci.map(lambda x: 1), + Index(np.array([1] * 5, dtype=np.int64), + name='XXX')) # change categories dtype ci = pd.CategoricalIndex(list('ABABC'), categories=list('BAC'), ordered=False) + def f(x): return {'A': 10, 'B': 20, 'C': 30}.get(x) result = ci.map(f) - exp = pd.Categorical([10, 20, 10, 20, 30], categories=[20, 10, 30], - ordered=False) - tm.assert_categorical_equal(result, exp) + exp = pd.CategoricalIndex([10, 20, 10, 20, 30], + categories=[20, 10, 30], + ordered=False) + tm.assert_index_equal(result, exp) def test_where(self): i = self.create_index() @@ -238,13 +240,23 @@ def test_where(self): expected = i tm.assert_index_equal(result, expected) - i2 = i.copy() i2 = pd.CategoricalIndex([np.nan, np.nan] + i[2:].tolist(), categories=i.categories) result = i.where(notnull(i2)) expected = i2 tm.assert_index_equal(result, expected) + def test_where_array_like(self): + i = self.create_index() + cond = [False] + [True] * (len(i) - 1) + klasses = [list, tuple, np.array, pd.Series] + expected = pd.CategoricalIndex([np.nan] + i[1:].tolist(), + categories=i.categories) + + for klass in klasses: + result = i.where(klass(cond)) + tm.assert_index_equal(result, expected) + def test_append(self): ci = self.create_index() @@ -263,10 +275,10 @@ def test_append(self): tm.assert_index_equal(result, ci, exact=True) # appending with different categories or reoreded is not ok - self.assertRaises( + pytest.raises( TypeError, lambda: ci.append(ci.values.set_categories(list('abcd')))) - self.assertRaises( + pytest.raises( TypeError, lambda: ci.append(ci.values.reorder_categories(list('abc')))) @@ -276,7 +288,7 @@ def test_append(self): tm.assert_index_equal(result, expected, exact=True) # invalid objects - self.assertRaises(TypeError, lambda: ci.append(Index(['a', 'd']))) + pytest.raises(TypeError, lambda: ci.append(Index(['a', 'd']))) # GH14298 - if base object is not categorical -> coerce to object result = Index(['c', 'a']).append(ci) @@ -304,7 +316,7 @@ def test_insert(self): tm.assert_index_equal(result, expected, exact=True) # invalid - self.assertRaises(TypeError, lambda: ci.insert(0, 'd')) + pytest.raises(TypeError, lambda: ci.insert(0, 'd')) def test_delete(self): @@ -319,9 +331,9 @@ def test_delete(self): expected = CategoricalIndex(list('aabbc'), categories=categories) tm.assert_index_equal(result, expected, exact=True) - with tm.assertRaises((IndexError, ValueError)): - # either depeidnig on numpy version - result = ci.delete(10) + with pytest.raises((IndexError, ValueError)): + # Either depending on NumPy version + ci.delete(10) def test_astype(self): @@ -330,23 +342,38 @@ def test_astype(self): tm.assert_index_equal(result, ci, exact=True) result = ci.astype(object) - self.assert_index_equal(result, Index(np.array(ci))) + tm.assert_index_equal(result, Index(np.array(ci))) # this IS equal, but not the same class - self.assertTrue(result.equals(ci)) - self.assertIsInstance(result, Index) - self.assertNotIsInstance(result, CategoricalIndex) + assert result.equals(ci) + assert isinstance(result, Index) + assert not isinstance(result, CategoricalIndex) + + # interval + ii = IntervalIndex.from_arrays(left=[-0.001, 2.0], + right=[2, 4], + closed='right') + + ci = CategoricalIndex(Categorical.from_codes( + [0, 1, -1], categories=ii, ordered=True)) + + result = ci.astype('interval') + expected = ii.take([0, 1, -1]) + tm.assert_index_equal(result, expected) + + result = IntervalIndex.from_intervals(result.values) + tm.assert_index_equal(result, expected) def test_reindex_base(self): # determined by cat ordering idx = self.create_index() - expected = np.array([4, 0, 1, 5, 2, 3], dtype=np.intp) + expected = np.arange(len(idx), dtype=np.intp) actual = idx.get_indexer(idx) tm.assert_numpy_array_equal(expected, actual) - with tm.assertRaisesRegexp(ValueError, 'Invalid fill method'): + with tm.assert_raises_regex(ValueError, 'Invalid fill method'): idx.get_indexer(idx, method='invalid') def test_reindexing(self): @@ -359,7 +386,8 @@ def test_reindexing(self): expected = oidx.get_indexer_non_unique(finder)[0] actual = ci.get_indexer(finder) - tm.assert_numpy_array_equal(expected.values, actual, check_dtype=False) + tm.assert_numpy_array_equal( + expected.values, actual, check_dtype=False) def test_reindex_dtype(self): c = CategoricalIndex(['a', 'b', 'c', 'a']) @@ -395,12 +423,12 @@ def test_reindex_dtype(self): def test_duplicates(self): idx = CategoricalIndex([0, 0, 0], name='foo') - self.assertFalse(idx.is_unique) - self.assertTrue(idx.has_duplicates) + assert not idx.is_unique + assert idx.has_duplicates expected = CategoricalIndex([0], name='foo') - self.assert_index_equal(idx.drop_duplicates(), expected) - self.assert_index_equal(idx.unique(), expected) + tm.assert_index_equal(idx.drop_duplicates(), expected) + tm.assert_index_equal(idx.unique(), expected) def test_get_indexer(self): @@ -411,22 +439,22 @@ def test_get_indexer(self): r1 = idx1.get_indexer(idx2) assert_almost_equal(r1, np.array([0, 1, 2, -1], dtype=np.intp)) - self.assertRaises(NotImplementedError, - lambda: idx2.get_indexer(idx1, method='pad')) - self.assertRaises(NotImplementedError, - lambda: idx2.get_indexer(idx1, method='backfill')) - self.assertRaises(NotImplementedError, - lambda: idx2.get_indexer(idx1, method='nearest')) + pytest.raises(NotImplementedError, + lambda: idx2.get_indexer(idx1, method='pad')) + pytest.raises(NotImplementedError, + lambda: idx2.get_indexer(idx1, method='backfill')) + pytest.raises(NotImplementedError, + lambda: idx2.get_indexer(idx1, method='nearest')) def test_get_loc(self): # GH 12531 cidx1 = CategoricalIndex(list('abcde'), categories=list('edabc')) idx1 = Index(list('abcde')) - self.assertEqual(cidx1.get_loc('a'), idx1.get_loc('a')) - self.assertEqual(cidx1.get_loc('e'), idx1.get_loc('e')) + assert cidx1.get_loc('a') == idx1.get_loc('a') + assert cidx1.get_loc('e') == idx1.get_loc('e') for i in [cidx1, idx1]: - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): i.get_loc('NOT-EXIST') # non-unique @@ -435,16 +463,16 @@ def test_get_loc(self): # results in bool array res = cidx2.get_loc('d') - self.assert_numpy_array_equal(res, idx2.get_loc('d')) - self.assert_numpy_array_equal(res, np.array([False, False, False, - True, False, True])) + tm.assert_numpy_array_equal(res, idx2.get_loc('d')) + tm.assert_numpy_array_equal(res, np.array([False, False, False, + True, False, True])) # unique element results in scalar res = cidx2.get_loc('e') - self.assertEqual(res, idx2.get_loc('e')) - self.assertEqual(res, 4) + assert res == idx2.get_loc('e') + assert res == 4 for i in [cidx2, idx2]: - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): i.get_loc('NOT-EXIST') # non-unique, slicable @@ -453,15 +481,15 @@ def test_get_loc(self): # results in slice res = cidx3.get_loc('a') - self.assertEqual(res, idx3.get_loc('a')) - self.assertEqual(res, slice(0, 2, None)) + assert res == idx3.get_loc('a') + assert res == slice(0, 2, None) res = cidx3.get_loc('b') - self.assertEqual(res, idx3.get_loc('b')) - self.assertEqual(res, slice(2, 5, None)) + assert res == idx3.get_loc('b') + assert res == slice(2, 5, None) for i in [cidx3, idx3]: - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): i.get_loc('c') def test_repr_roundtrip(self): @@ -509,92 +537,85 @@ def test_identical(self): ci1 = CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=True) ci2 = CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=True) - self.assertTrue(ci1.identical(ci1)) - self.assertTrue(ci1.identical(ci1.copy())) - self.assertFalse(ci1.identical(ci2)) + assert ci1.identical(ci1) + assert ci1.identical(ci1.copy()) + assert not ci1.identical(ci2) def test_ensure_copied_data(self): - # Check the "copy" argument of each Index.__new__ is honoured - # GH12309 + # gh-12309: Check the "copy" argument of each + # Index.__new__ is honored. + # # Must be tested separately from other indexes because - # self.value is not an ndarray - _base = lambda ar : ar if ar.base is None else ar.base + # self.value is not an ndarray. + _base = lambda ar: ar if ar.base is None else ar.base + for index in self.indices.values(): result = CategoricalIndex(index.values, copy=True) tm.assert_index_equal(index, result) - self.assertIsNot(_base(index.values), _base(result.values)) + assert _base(index.values) is not _base(result.values) result = CategoricalIndex(index.values, copy=False) - self.assertIs(_base(index.values), _base(result.values)) + assert _base(index.values) is _base(result.values) def test_equals_categorical(self): - ci1 = CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=True) ci2 = CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=True) - self.assertTrue(ci1.equals(ci1)) - self.assertFalse(ci1.equals(ci2)) - self.assertTrue(ci1.equals(ci1.astype(object))) - self.assertTrue(ci1.astype(object).equals(ci1)) + assert ci1.equals(ci1) + assert not ci1.equals(ci2) + assert ci1.equals(ci1.astype(object)) + assert ci1.astype(object).equals(ci1) - self.assertTrue((ci1 == ci1).all()) - self.assertFalse((ci1 != ci1).all()) - self.assertFalse((ci1 > ci1).all()) - self.assertFalse((ci1 < ci1).all()) - self.assertTrue((ci1 <= ci1).all()) - self.assertTrue((ci1 >= ci1).all()) + assert (ci1 == ci1).all() + assert not (ci1 != ci1).all() + assert not (ci1 > ci1).all() + assert not (ci1 < ci1).all() + assert (ci1 <= ci1).all() + assert (ci1 >= ci1).all() - self.assertFalse((ci1 == 1).all()) - self.assertTrue((ci1 == Index(['a', 'b'])).all()) - self.assertTrue((ci1 == ci1.values).all()) + assert not (ci1 == 1).all() + assert (ci1 == Index(['a', 'b'])).all() + assert (ci1 == ci1.values).all() # invalid comparisons - with tm.assertRaisesRegexp(ValueError, "Lengths must match"): + with tm.assert_raises_regex(ValueError, "Lengths must match"): ci1 == Index(['a', 'b', 'c']) - self.assertRaises(TypeError, lambda: ci1 == ci2) - self.assertRaises( + pytest.raises(TypeError, lambda: ci1 == ci2) + pytest.raises( TypeError, lambda: ci1 == Categorical(ci1.values, ordered=False)) - self.assertRaises( + pytest.raises( TypeError, lambda: ci1 == Categorical(ci1.values, categories=list('abc'))) # tests # make sure that we are testing for category inclusion properly ci = CategoricalIndex(list('aabca'), categories=['c', 'a', 'b']) - self.assertFalse(ci.equals(list('aabca'))) - self.assertFalse(ci.equals(CategoricalIndex(list('aabca')))) - self.assertTrue(ci.equals(ci.copy())) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - ci = CategoricalIndex(list('aabca'), - categories=['c', 'a', 'b', np.nan]) - self.assertFalse(ci.equals(list('aabca'))) - self.assertFalse(ci.equals(CategoricalIndex(list('aabca')))) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertTrue(ci.equals(ci.copy())) + assert not ci.equals(list('aabca')) + assert not ci.equals(CategoricalIndex(list('aabca'))) + assert ci.equals(ci.copy()) ci = CategoricalIndex(list('aabca') + [np.nan], categories=['c', 'a', 'b']) - self.assertFalse(ci.equals(list('aabca'))) - self.assertFalse(ci.equals(CategoricalIndex(list('aabca')))) - self.assertTrue(ci.equals(ci.copy())) + assert not ci.equals(list('aabca')) + assert not ci.equals(CategoricalIndex(list('aabca'))) + assert ci.equals(ci.copy()) ci = CategoricalIndex(list('aabca') + [np.nan], categories=['c', 'a', 'b']) - self.assertFalse(ci.equals(list('aabca') + [np.nan])) - self.assertFalse(ci.equals(CategoricalIndex(list('aabca') + [np.nan]))) - self.assertTrue(ci.equals(ci.copy())) + assert not ci.equals(list('aabca') + [np.nan]) + assert not ci.equals(CategoricalIndex(list('aabca') + [np.nan])) + assert ci.equals(ci.copy()) def test_string_categorical_index_repr(self): # short idx = pd.CategoricalIndex(['a', 'bb', 'ccc']) if PY3: - expected = u"""CategoricalIndex(['a', 'bb', 'ccc'], categories=['a', 'bb', 'ccc'], ordered=False, dtype='category')""" - self.assertEqual(repr(idx), expected) + expected = u"""CategoricalIndex(['a', 'bb', 'ccc'], categories=['a', 'bb', 'ccc'], ordered=False, dtype='category')""" # noqa + assert repr(idx) == expected else: - expected = u"""CategoricalIndex([u'a', u'bb', u'ccc'], categories=[u'a', u'bb', u'ccc'], ordered=False, dtype='category')""" - self.assertEqual(unicode(idx), expected) + expected = u"""CategoricalIndex([u'a', u'bb', u'ccc'], categories=[u'a', u'bb', u'ccc'], ordered=False, dtype='category')""" # noqa + assert unicode(idx) == expected # multiple lines idx = pd.CategoricalIndex(['a', 'bb', 'ccc'] * 10) @@ -602,17 +623,17 @@ def test_string_categorical_index_repr(self): expected = u"""CategoricalIndex(['a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc'], - categories=['a', 'bb', 'ccc'], ordered=False, dtype='category')""" + categories=['a', 'bb', 'ccc'], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc'], - categories=[u'a', u'bb', u'ccc'], ordered=False, dtype='category')""" + categories=[u'a', u'bb', u'ccc'], ordered=False, dtype='category')""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # truncated idx = pd.CategoricalIndex(['a', 'bb', 'ccc'] * 100) @@ -620,42 +641,42 @@ def test_string_categorical_index_repr(self): expected = u"""CategoricalIndex(['a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', ... 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc'], - categories=['a', 'bb', 'ccc'], ordered=False, dtype='category', length=300)""" + categories=['a', 'bb', 'ccc'], ordered=False, dtype='category', length=300)""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', ... u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc', u'a', u'bb', u'ccc'], - categories=[u'a', u'bb', u'ccc'], ordered=False, dtype='category', length=300)""" + categories=[u'a', u'bb', u'ccc'], ordered=False, dtype='category', length=300)""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # larger categories idx = pd.CategoricalIndex(list('abcdefghijklmmo')) if PY3: expected = u"""CategoricalIndex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'm', 'o'], - categories=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', ...], ordered=False, dtype='category')""" + categories=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', ...], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'm', u'o'], - categories=[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', ...], ordered=False, dtype='category')""" + categories=[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', ...], ordered=False, dtype='category')""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # short idx = pd.CategoricalIndex([u'あ', u'いい', u'ううう']) if PY3: - expected = u"""CategoricalIndex(['あ', 'いい', 'ううう'], categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" - self.assertEqual(repr(idx), expected) + expected = u"""CategoricalIndex(['あ', 'いい', 'ううう'], categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" # noqa + assert repr(idx) == expected else: - expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう'], categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" - self.assertEqual(unicode(idx), expected) + expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう'], categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" # noqa + assert unicode(idx) == expected # multiple lines idx = pd.CategoricalIndex([u'あ', u'いい', u'ううう'] * 10) @@ -663,17 +684,17 @@ def test_string_categorical_index_repr(self): expected = u"""CategoricalIndex(['あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう'], - categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" + categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう'], - categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" + categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # truncated idx = pd.CategoricalIndex([u'あ', u'いい', u'ううう'] * 100) @@ -681,33 +702,33 @@ def test_string_categorical_index_repr(self): expected = u"""CategoricalIndex(['あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', ... 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう'], - categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category', length=300)""" + categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category', length=300)""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', ... u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう'], - categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category', length=300)""" + categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category', length=300)""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # larger categories idx = pd.CategoricalIndex(list(u'あいうえおかきくけこさしすせそ')) if PY3: expected = u"""CategoricalIndex(['あ', 'い', 'う', 'え', 'お', 'か', 'き', 'く', 'け', 'こ', 'さ', 'し', 'す', 'せ', 'そ'], - categories=['あ', 'い', 'う', 'え', 'お', 'か', 'き', 'く', ...], ordered=False, dtype='category')""" + categories=['あ', 'い', 'う', 'え', 'お', 'か', 'き', 'く', ...], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'あ', u'い', u'う', u'え', u'お', u'か', u'き', u'く', u'け', u'こ', u'さ', u'し', u'す', u'せ', u'そ'], - categories=[u'あ', u'い', u'う', u'え', u'お', u'か', u'き', u'く', ...], ordered=False, dtype='category')""" + categories=[u'あ', u'い', u'う', u'え', u'お', u'か', u'き', u'く', ...], ordered=False, dtype='category')""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # Emable Unicode option ----------------------------------------- with cf.option_context('display.unicode.east_asian_width', True): @@ -715,11 +736,11 @@ def test_string_categorical_index_repr(self): # short idx = pd.CategoricalIndex([u'あ', u'いい', u'ううう']) if PY3: - expected = u"""CategoricalIndex(['あ', 'いい', 'ううう'], categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" - self.assertEqual(repr(idx), expected) + expected = u"""CategoricalIndex(['あ', 'いい', 'ううう'], categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" # noqa + assert repr(idx) == expected else: - expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう'], categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" - self.assertEqual(unicode(idx), expected) + expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう'], categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" # noqa + assert unicode(idx) == expected # multiple lines idx = pd.CategoricalIndex([u'あ', u'いい', u'ううう'] * 10) @@ -728,18 +749,18 @@ def test_string_categorical_index_repr(self): 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう'], - categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" + categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう'], - categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" + categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category')""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # truncated idx = pd.CategoricalIndex([u'あ', u'いい', u'ううう'] * 100) @@ -749,44 +770,44 @@ def test_string_categorical_index_repr(self): ... 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう', 'あ', 'いい', 'ううう'], - categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category', length=300)""" + categories=['あ', 'いい', 'ううう'], ordered=False, dtype='category', length=300)""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', ... u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう', u'あ', u'いい', u'ううう'], - categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category', length=300)""" + categories=[u'あ', u'いい', u'ううう'], ordered=False, dtype='category', length=300)""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected # larger categories idx = pd.CategoricalIndex(list(u'あいうえおかきくけこさしすせそ')) if PY3: expected = u"""CategoricalIndex(['あ', 'い', 'う', 'え', 'お', 'か', 'き', 'く', 'け', 'こ', 'さ', 'し', 'す', 'せ', 'そ'], - categories=['あ', 'い', 'う', 'え', 'お', 'か', 'き', 'く', ...], ordered=False, dtype='category')""" + categories=['あ', 'い', 'う', 'え', 'お', 'か', 'き', 'く', ...], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(idx), expected) + assert repr(idx) == expected else: expected = u"""CategoricalIndex([u'あ', u'い', u'う', u'え', u'お', u'か', u'き', u'く', u'け', u'こ', u'さ', u'し', u'す', u'せ', u'そ'], - categories=[u'あ', u'い', u'う', u'え', u'お', u'か', u'き', u'く', ...], ordered=False, dtype='category')""" + categories=[u'あ', u'い', u'う', u'え', u'お', u'か', u'き', u'く', ...], ordered=False, dtype='category')""" # noqa - self.assertEqual(unicode(idx), expected) + assert unicode(idx) == expected def test_fillna_categorical(self): # GH 11343 idx = CategoricalIndex([1.0, np.nan, 3.0, 1.0], name='x') # fill by value in categories exp = CategoricalIndex([1.0, 1.0, 3.0, 1.0], name='x') - self.assert_index_equal(idx.fillna(1.0), exp) + tm.assert_index_equal(idx.fillna(1.0), exp) # fill by value not in categories raises ValueError - with tm.assertRaisesRegexp(ValueError, - 'fill value must be in categories'): + with tm.assert_raises_regex(ValueError, + 'fill value must be in categories'): idx.fillna(2.0) def test_take_fill_value(self): @@ -840,12 +861,12 @@ def test_take_fill_value(self): msg = ('When allow_fill=True and fill_value is not None, ' 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): idx.take(np.array([1, -5])) def test_take_fill_value_datetime(self): @@ -878,12 +899,12 @@ def test_take_fill_value_datetime(self): msg = ('When allow_fill=True and fill_value is not None, ' 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): idx.take(np.array([1, -5])) def test_take_invalid_kwargs(self): @@ -891,13 +912,13 @@ def test_take_invalid_kwargs(self): indices = [1, 0, -1] msg = r"take\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, idx.take, - indices, foo=2) + tm.assert_raises_regex(TypeError, msg, idx.take, + indices, foo=2) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, out=indices) + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, out=indices) msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, mode='clip') + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, mode='clip') diff --git a/pandas/tests/indexes/test_datetimelike.py b/pandas/tests/indexes/test_datetimelike.py deleted file mode 100644 index b04e840ffc849..0000000000000 --- a/pandas/tests/indexes/test_datetimelike.py +++ /dev/null @@ -1,1225 +0,0 @@ -# -*- coding: utf-8 -*- - -from datetime import datetime, timedelta, time - -import numpy as np - -from pandas import (DatetimeIndex, Float64Index, Index, Int64Index, - NaT, Period, PeriodIndex, Series, Timedelta, - TimedeltaIndex, date_range, period_range, - timedelta_range, notnull) - -import pandas.util.testing as tm - -import pandas as pd -from pandas.tslib import Timestamp, OutOfBoundsDatetime - -from .common import Base - - -class DatetimeLike(Base): - - def test_shift_identity(self): - - idx = self.create_index() - self.assert_index_equal(idx, idx.shift(0)) - - def test_str(self): - - # test the string repr - idx = self.create_index() - idx.name = 'foo' - self.assertFalse("length=%s" % len(idx) in str(idx)) - self.assertTrue("'foo'" in str(idx)) - self.assertTrue(idx.__class__.__name__ in str(idx)) - - if hasattr(idx, 'tz'): - if idx.tz is not None: - self.assertTrue(idx.tz in str(idx)) - if hasattr(idx, 'freq'): - self.assertTrue("freq='%s'" % idx.freqstr in str(idx)) - - def test_view(self): - super(DatetimeLike, self).test_view() - - i = self.create_index() - - i_view = i.view('i8') - result = self._holder(i) - tm.assert_index_equal(result, i) - - i_view = i.view(self._holder) - result = self._holder(i) - tm.assert_index_equal(result, i_view) - - -class TestDatetimeIndex(DatetimeLike, tm.TestCase): - _holder = DatetimeIndex - _multiprocess_can_split_ = True - - def setUp(self): - self.indices = dict(index=tm.makeDateIndex(10)) - self.setup_indices() - - def create_index(self): - return date_range('20130101', periods=5) - - def test_shift(self): - - # test shift for datetimeIndex and non datetimeIndex - # GH8083 - - drange = self.create_index() - result = drange.shift(1) - expected = DatetimeIndex(['2013-01-02', '2013-01-03', '2013-01-04', - '2013-01-05', - '2013-01-06'], freq='D') - self.assert_index_equal(result, expected) - - result = drange.shift(-1) - expected = DatetimeIndex(['2012-12-31', '2013-01-01', '2013-01-02', - '2013-01-03', '2013-01-04'], - freq='D') - self.assert_index_equal(result, expected) - - result = drange.shift(3, freq='2D') - expected = DatetimeIndex(['2013-01-07', '2013-01-08', '2013-01-09', - '2013-01-10', - '2013-01-11'], freq='D') - self.assert_index_equal(result, expected) - - def test_construction_with_alt(self): - - i = pd.date_range('20130101', periods=5, freq='H', tz='US/Eastern') - i2 = DatetimeIndex(i, dtype=i.dtype) - self.assert_index_equal(i, i2) - - i2 = DatetimeIndex(i.tz_localize(None).asi8, tz=i.dtype.tz) - self.assert_index_equal(i, i2) - - i2 = DatetimeIndex(i.tz_localize(None).asi8, dtype=i.dtype) - self.assert_index_equal(i, i2) - - i2 = DatetimeIndex( - i.tz_localize(None).asi8, dtype=i.dtype, tz=i.dtype.tz) - self.assert_index_equal(i, i2) - - # localize into the provided tz - i2 = DatetimeIndex(i.tz_localize(None).asi8, tz='UTC') - expected = i.tz_localize(None).tz_localize('UTC') - self.assert_index_equal(i2, expected) - - # incompat tz/dtype - self.assertRaises(ValueError, lambda: DatetimeIndex( - i.tz_localize(None).asi8, dtype=i.dtype, tz='US/Pacific')) - - def test_pickle_compat_construction(self): - pass - - def test_construction_index_with_mixed_timezones(self): - # GH 11488 - # no tz results in DatetimeIndex - result = Index([Timestamp('2011-01-01'), - Timestamp('2011-01-02')], name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01'), - Timestamp('2011-01-02')], name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNone(result.tz) - - # same tz results in DatetimeIndex - result = Index([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - Timestamp('2011-01-02 10:00', tz='Asia/Tokyo')], - name='idx') - exp = DatetimeIndex( - [Timestamp('2011-01-01 10:00'), Timestamp('2011-01-02 10:00') - ], tz='Asia/Tokyo', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNotNone(result.tz) - self.assertEqual(result.tz, exp.tz) - - # same tz results in DatetimeIndex (DST) - result = Index([Timestamp('2011-01-01 10:00', tz='US/Eastern'), - Timestamp('2011-08-01 10:00', tz='US/Eastern')], - name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), - Timestamp('2011-08-01 10:00')], - tz='US/Eastern', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNotNone(result.tz) - self.assertEqual(result.tz, exp.tz) - - # different tz results in Index(dtype=object) - result = Index([Timestamp('2011-01-01 10:00'), - Timestamp('2011-01-02 10:00', tz='US/Eastern')], - name='idx') - exp = Index([Timestamp('2011-01-01 10:00'), - Timestamp('2011-01-02 10:00', tz='US/Eastern')], - dtype='object', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertFalse(isinstance(result, DatetimeIndex)) - - result = Index([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - Timestamp('2011-01-02 10:00', tz='US/Eastern')], - name='idx') - exp = Index([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - Timestamp('2011-01-02 10:00', tz='US/Eastern')], - dtype='object', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertFalse(isinstance(result, DatetimeIndex)) - - # length = 1 - result = Index([Timestamp('2011-01-01')], name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01')], name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNone(result.tz) - - # length = 1 with tz - result = Index( - [Timestamp('2011-01-01 10:00', tz='Asia/Tokyo')], name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01 10:00')], tz='Asia/Tokyo', - name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNotNone(result.tz) - self.assertEqual(result.tz, exp.tz) - - def test_construction_index_with_mixed_timezones_with_NaT(self): - # GH 11488 - result = Index([pd.NaT, Timestamp('2011-01-01'), - pd.NaT, Timestamp('2011-01-02')], name='idx') - exp = DatetimeIndex([pd.NaT, Timestamp('2011-01-01'), - pd.NaT, Timestamp('2011-01-02')], name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNone(result.tz) - - # same tz results in DatetimeIndex - result = Index([pd.NaT, Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - pd.NaT, Timestamp('2011-01-02 10:00', - tz='Asia/Tokyo')], - name='idx') - exp = DatetimeIndex([pd.NaT, Timestamp('2011-01-01 10:00'), - pd.NaT, Timestamp('2011-01-02 10:00')], - tz='Asia/Tokyo', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNotNone(result.tz) - self.assertEqual(result.tz, exp.tz) - - # same tz results in DatetimeIndex (DST) - result = Index([Timestamp('2011-01-01 10:00', tz='US/Eastern'), - pd.NaT, - Timestamp('2011-08-01 10:00', tz='US/Eastern')], - name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), pd.NaT, - Timestamp('2011-08-01 10:00')], - tz='US/Eastern', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNotNone(result.tz) - self.assertEqual(result.tz, exp.tz) - - # different tz results in Index(dtype=object) - result = Index([pd.NaT, Timestamp('2011-01-01 10:00'), - pd.NaT, Timestamp('2011-01-02 10:00', - tz='US/Eastern')], - name='idx') - exp = Index([pd.NaT, Timestamp('2011-01-01 10:00'), - pd.NaT, Timestamp('2011-01-02 10:00', tz='US/Eastern')], - dtype='object', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertFalse(isinstance(result, DatetimeIndex)) - - result = Index([pd.NaT, Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - pd.NaT, Timestamp('2011-01-02 10:00', - tz='US/Eastern')], name='idx') - exp = Index([pd.NaT, Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - pd.NaT, Timestamp('2011-01-02 10:00', tz='US/Eastern')], - dtype='object', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertFalse(isinstance(result, DatetimeIndex)) - - # all NaT - result = Index([pd.NaT, pd.NaT], name='idx') - exp = DatetimeIndex([pd.NaT, pd.NaT], name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNone(result.tz) - - # all NaT with tz - result = Index([pd.NaT, pd.NaT], tz='Asia/Tokyo', name='idx') - exp = DatetimeIndex([pd.NaT, pd.NaT], tz='Asia/Tokyo', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - self.assertIsNotNone(result.tz) - self.assertEqual(result.tz, exp.tz) - - def test_construction_dti_with_mixed_timezones(self): - # GH 11488 (not changed, added explicit tests) - - # no tz results in DatetimeIndex - result = DatetimeIndex( - [Timestamp('2011-01-01'), Timestamp('2011-01-02')], name='idx') - exp = DatetimeIndex( - [Timestamp('2011-01-01'), Timestamp('2011-01-02')], name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - - # same tz results in DatetimeIndex - result = DatetimeIndex([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - Timestamp('2011-01-02 10:00', - tz='Asia/Tokyo')], - name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), - Timestamp('2011-01-02 10:00')], - tz='Asia/Tokyo', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - - # same tz results in DatetimeIndex (DST) - result = DatetimeIndex([Timestamp('2011-01-01 10:00', tz='US/Eastern'), - Timestamp('2011-08-01 10:00', - tz='US/Eastern')], - name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01 10:00'), - Timestamp('2011-08-01 10:00')], - tz='US/Eastern', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - - # different tz coerces tz-naive to tz-awareIndex(dtype=object) - result = DatetimeIndex([Timestamp('2011-01-01 10:00'), - Timestamp('2011-01-02 10:00', - tz='US/Eastern')], name='idx') - exp = DatetimeIndex([Timestamp('2011-01-01 05:00'), - Timestamp('2011-01-02 10:00')], - tz='US/Eastern', name='idx') - self.assert_index_equal(result, exp, exact=True) - self.assertTrue(isinstance(result, DatetimeIndex)) - - # tz mismatch affecting to tz-aware raises TypeError/ValueError - - with tm.assertRaises(ValueError): - DatetimeIndex([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - Timestamp('2011-01-02 10:00', tz='US/Eastern')], - name='idx') - - with tm.assertRaisesRegexp(TypeError, 'data is already tz-aware'): - DatetimeIndex([Timestamp('2011-01-01 10:00'), - Timestamp('2011-01-02 10:00', tz='US/Eastern')], - tz='Asia/Tokyo', name='idx') - - with tm.assertRaises(ValueError): - DatetimeIndex([Timestamp('2011-01-01 10:00', tz='Asia/Tokyo'), - Timestamp('2011-01-02 10:00', tz='US/Eastern')], - tz='US/Eastern', name='idx') - - with tm.assertRaisesRegexp(TypeError, 'data is already tz-aware'): - # passing tz should results in DatetimeIndex, then mismatch raises - # TypeError - Index([pd.NaT, Timestamp('2011-01-01 10:00'), - pd.NaT, Timestamp('2011-01-02 10:00', tz='US/Eastern')], - tz='Asia/Tokyo', name='idx') - - def test_construction_base_constructor(self): - arr = [pd.Timestamp('2011-01-01'), pd.NaT, pd.Timestamp('2011-01-03')] - tm.assert_index_equal(pd.Index(arr), pd.DatetimeIndex(arr)) - tm.assert_index_equal(pd.Index(np.array(arr)), - pd.DatetimeIndex(np.array(arr))) - - arr = [np.nan, pd.NaT, pd.Timestamp('2011-01-03')] - tm.assert_index_equal(pd.Index(arr), pd.DatetimeIndex(arr)) - tm.assert_index_equal(pd.Index(np.array(arr)), - pd.DatetimeIndex(np.array(arr))) - - def test_construction_outofbounds(self): - # GH 13663 - dates = [datetime(3000, 1, 1), datetime(4000, 1, 1), - datetime(5000, 1, 1), datetime(6000, 1, 1)] - exp = Index(dates, dtype=object) - # coerces to object - tm.assert_index_equal(Index(dates), exp) - - with tm.assertRaises(OutOfBoundsDatetime): - # can't create DatetimeIndex - DatetimeIndex(dates) - - def test_astype(self): - # GH 13149, GH 13209 - idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) - - result = idx.astype(object) - expected = Index([Timestamp('2016-05-16')] + [NaT] * 3, dtype=object) - tm.assert_index_equal(result, expected) - - result = idx.astype(int) - expected = Int64Index([1463356800000000000] + - [-9223372036854775808] * 3, dtype=np.int64) - tm.assert_index_equal(result, expected) - - rng = date_range('1/1/2000', periods=10) - result = rng.astype('i8') - self.assert_index_equal(result, Index(rng.asi8)) - self.assert_numpy_array_equal(result.values, rng.asi8) - - def test_astype_with_tz(self): - - # with tz - rng = date_range('1/1/2000', periods=10, tz='US/Eastern') - result = rng.astype('datetime64[ns]') - expected = (date_range('1/1/2000', periods=10, - tz='US/Eastern') - .tz_convert('UTC').tz_localize(None)) - tm.assert_index_equal(result, expected) - - # BUG#10442 : testing astype(str) is correct for Series/DatetimeIndex - result = pd.Series(pd.date_range('2012-01-01', periods=3)).astype(str) - expected = pd.Series( - ['2012-01-01', '2012-01-02', '2012-01-03'], dtype=object) - tm.assert_series_equal(result, expected) - - result = Series(pd.date_range('2012-01-01', periods=3, - tz='US/Eastern')).astype(str) - expected = Series(['2012-01-01 00:00:00-05:00', - '2012-01-02 00:00:00-05:00', - '2012-01-03 00:00:00-05:00'], - dtype=object) - tm.assert_series_equal(result, expected) - - def test_astype_str_compat(self): - # GH 13149, GH 13209 - # verify that we are returing NaT as a string (and not unicode) - - idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) - result = idx.astype(str) - expected = Index(['2016-05-16', 'NaT', 'NaT', 'NaT'], dtype=object) - tm.assert_index_equal(result, expected) - - def test_astype_str(self): - # test astype string - #10442 - result = date_range('2012-01-01', periods=4, - name='test_name').astype(str) - expected = Index(['2012-01-01', '2012-01-02', '2012-01-03', - '2012-01-04'], name='test_name', dtype=object) - tm.assert_index_equal(result, expected) - - # test astype string with tz and name - result = date_range('2012-01-01', periods=3, name='test_name', - tz='US/Eastern').astype(str) - expected = Index(['2012-01-01 00:00:00-05:00', - '2012-01-02 00:00:00-05:00', - '2012-01-03 00:00:00-05:00'], - name='test_name', dtype=object) - tm.assert_index_equal(result, expected) - - # test astype string with freqH and name - result = date_range('1/1/2011', periods=3, freq='H', - name='test_name').astype(str) - expected = Index(['2011-01-01 00:00:00', '2011-01-01 01:00:00', - '2011-01-01 02:00:00'], - name='test_name', dtype=object) - tm.assert_index_equal(result, expected) - - # test astype string with freqH and timezone - result = date_range('3/6/2012 00:00', periods=2, freq='H', - tz='Europe/London', name='test_name').astype(str) - expected = Index(['2012-03-06 00:00:00+00:00', - '2012-03-06 01:00:00+00:00'], - dtype=object, name='test_name') - tm.assert_index_equal(result, expected) - - def test_astype_datetime64(self): - # GH 13149, GH 13209 - idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) - - result = idx.astype('datetime64[ns]') - tm.assert_index_equal(result, idx) - self.assertFalse(result is idx) - - result = idx.astype('datetime64[ns]', copy=False) - tm.assert_index_equal(result, idx) - self.assertTrue(result is idx) - - idx_tz = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN], tz='EST') - result = idx_tz.astype('datetime64[ns]') - expected = DatetimeIndex(['2016-05-16 05:00:00', 'NaT', 'NaT', 'NaT'], - dtype='datetime64[ns]') - tm.assert_index_equal(result, expected) - - def test_astype_raises(self): - # GH 13149, GH 13209 - idx = DatetimeIndex(['2016-05-16', 'NaT', NaT, np.NaN]) - - self.assertRaises(ValueError, idx.astype, float) - self.assertRaises(ValueError, idx.astype, 'timedelta64') - self.assertRaises(ValueError, idx.astype, 'timedelta64[ns]') - self.assertRaises(ValueError, idx.astype, 'datetime64') - self.assertRaises(ValueError, idx.astype, 'datetime64[D]') - - def test_where_other(self): - - # other is ndarray or Index - i = pd.date_range('20130101', periods=3, tz='US/Eastern') - - for arr in [np.nan, pd.NaT]: - result = i.where(notnull(i), other=np.nan) - expected = i - tm.assert_index_equal(result, expected) - - i2 = i.copy() - i2 = Index([pd.NaT, pd.NaT] + i[2:].tolist()) - result = i.where(notnull(i2), i2) - tm.assert_index_equal(result, i2) - - i2 = i.copy() - i2 = Index([pd.NaT, pd.NaT] + i[2:].tolist()) - result = i.where(notnull(i2), i2.values) - tm.assert_index_equal(result, i2) - - def test_where_tz(self): - i = pd.date_range('20130101', periods=3, tz='US/Eastern') - result = i.where(notnull(i)) - expected = i - tm.assert_index_equal(result, expected) - - i2 = i.copy() - i2 = Index([pd.NaT, pd.NaT] + i[2:].tolist()) - result = i.where(notnull(i2)) - expected = i2 - tm.assert_index_equal(result, expected) - - def test_get_loc(self): - idx = pd.date_range('2000-01-01', periods=3) - - for method in [None, 'pad', 'backfill', 'nearest']: - self.assertEqual(idx.get_loc(idx[1], method), 1) - self.assertEqual(idx.get_loc(idx[1].to_pydatetime(), method), 1) - self.assertEqual(idx.get_loc(str(idx[1]), method), 1) - if method is not None: - self.assertEqual(idx.get_loc(idx[1], method, - tolerance=pd.Timedelta('0 days')), - 1) - - self.assertEqual(idx.get_loc('2000-01-01', method='nearest'), 0) - self.assertEqual(idx.get_loc('2000-01-01T12', method='nearest'), 1) - - self.assertEqual(idx.get_loc('2000-01-01T12', method='nearest', - tolerance='1 day'), 1) - self.assertEqual(idx.get_loc('2000-01-01T12', method='nearest', - tolerance=pd.Timedelta('1D')), 1) - self.assertEqual(idx.get_loc('2000-01-01T12', method='nearest', - tolerance=np.timedelta64(1, 'D')), 1) - self.assertEqual(idx.get_loc('2000-01-01T12', method='nearest', - tolerance=timedelta(1)), 1) - with tm.assertRaisesRegexp(ValueError, 'must be convertible'): - idx.get_loc('2000-01-01T12', method='nearest', tolerance='foo') - with tm.assertRaises(KeyError): - idx.get_loc('2000-01-01T03', method='nearest', tolerance='2 hours') - - self.assertEqual(idx.get_loc('2000', method='nearest'), slice(0, 3)) - self.assertEqual(idx.get_loc('2000-01', method='nearest'), slice(0, 3)) - - self.assertEqual(idx.get_loc('1999', method='nearest'), 0) - self.assertEqual(idx.get_loc('2001', method='nearest'), 2) - - with tm.assertRaises(KeyError): - idx.get_loc('1999', method='pad') - with tm.assertRaises(KeyError): - idx.get_loc('2001', method='backfill') - - with tm.assertRaises(KeyError): - idx.get_loc('foobar') - with tm.assertRaises(TypeError): - idx.get_loc(slice(2)) - - idx = pd.to_datetime(['2000-01-01', '2000-01-04']) - self.assertEqual(idx.get_loc('2000-01-02', method='nearest'), 0) - self.assertEqual(idx.get_loc('2000-01-03', method='nearest'), 1) - self.assertEqual(idx.get_loc('2000-01', method='nearest'), slice(0, 2)) - - # time indexing - idx = pd.date_range('2000-01-01', periods=24, freq='H') - tm.assert_numpy_array_equal(idx.get_loc(time(12)), - np.array([12]), check_dtype=False) - tm.assert_numpy_array_equal(idx.get_loc(time(12, 30)), - np.array([]), check_dtype=False) - with tm.assertRaises(NotImplementedError): - idx.get_loc(time(12, 30), method='pad') - - def test_get_indexer(self): - idx = pd.date_range('2000-01-01', periods=3) - exp = np.array([0, 1, 2], dtype=np.intp) - tm.assert_numpy_array_equal(idx.get_indexer(idx), exp) - - target = idx[0] + pd.to_timedelta(['-1 hour', '12 hours', - '1 day 1 hour']) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'pad'), - np.array([-1, 0, 1], dtype=np.intp)) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'backfill'), - np.array([0, 1, 2], dtype=np.intp)) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest'), - np.array([0, 1, 1], dtype=np.intp)) - tm.assert_numpy_array_equal( - idx.get_indexer(target, 'nearest', - tolerance=pd.Timedelta('1 hour')), - np.array([0, -1, 1], dtype=np.intp)) - with tm.assertRaises(ValueError): - idx.get_indexer(idx[[0]], method='nearest', tolerance='foo') - - def test_roundtrip_pickle_with_tz(self): - - # GH 8367 - # round-trip of timezone - index = date_range('20130101', periods=3, tz='US/Eastern', name='foo') - unpickled = self.round_trip_pickle(index) - self.assert_index_equal(index, unpickled) - - def test_reindex_preserves_tz_if_target_is_empty_list_or_array(self): - # GH7774 - index = date_range('20130101', periods=3, tz='US/Eastern') - self.assertEqual(str(index.reindex([])[0].tz), 'US/Eastern') - self.assertEqual(str(index.reindex(np.array([]))[0].tz), 'US/Eastern') - - def test_time_loc(self): # GH8667 - from datetime import time - from pandas.index import _SIZE_CUTOFF - - ns = _SIZE_CUTOFF + np.array([-100, 100], dtype=np.int64) - key = time(15, 11, 30) - start = key.hour * 3600 + key.minute * 60 + key.second - step = 24 * 3600 - - for n in ns: - idx = pd.date_range('2014-11-26', periods=n, freq='S') - ts = pd.Series(np.random.randn(n), index=idx) - i = np.arange(start, n, step) - - tm.assert_numpy_array_equal(ts.index.get_loc(key), i, - check_dtype=False) - tm.assert_series_equal(ts[key], ts.iloc[i]) - - left, right = ts.copy(), ts.copy() - left[key] *= -10 - right.iloc[i] *= -10 - tm.assert_series_equal(left, right) - - def test_time_overflow_for_32bit_machines(self): - # GH8943. On some machines NumPy defaults to np.int32 (for example, - # 32-bit Linux machines). In the function _generate_regular_range - # found in tseries/index.py, `periods` gets multiplied by `strides` - # (which has value 1e9) and since the max value for np.int32 is ~2e9, - # and since those machines won't promote np.int32 to np.int64, we get - # overflow. - periods = np.int_(1000) - - idx1 = pd.date_range(start='2000', periods=periods, freq='S') - self.assertEqual(len(idx1), periods) - - idx2 = pd.date_range(end='2000', periods=periods, freq='S') - self.assertEqual(len(idx2), periods) - - def test_intersection(self): - first = self.index - second = self.index[5:] - intersect = first.intersection(second) - self.assertTrue(tm.equalContents(intersect, second)) - - # GH 10149 - cases = [klass(second.values) for klass in [np.array, Series, list]] - for case in cases: - result = first.intersection(case) - self.assertTrue(tm.equalContents(result, second)) - - third = Index(['a', 'b', 'c']) - result = first.intersection(third) - expected = pd.Index([], dtype=object) - self.assert_index_equal(result, expected) - - def test_union(self): - first = self.index[:5] - second = self.index[5:] - everything = self.index - union = first.union(second) - self.assertTrue(tm.equalContents(union, everything)) - - # GH 10149 - cases = [klass(second.values) for klass in [np.array, Series, list]] - for case in cases: - result = first.union(case) - self.assertTrue(tm.equalContents(result, everything)) - - def test_nat(self): - self.assertIs(DatetimeIndex([np.nan])[0], pd.NaT) - - def test_ufunc_coercions(self): - idx = date_range('2011-01-01', periods=3, freq='2D', name='x') - - delta = np.timedelta64(1, 'D') - for result in [idx + delta, np.add(idx, delta)]: - tm.assertIsInstance(result, DatetimeIndex) - exp = date_range('2011-01-02', periods=3, freq='2D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '2D') - - for result in [idx - delta, np.subtract(idx, delta)]: - tm.assertIsInstance(result, DatetimeIndex) - exp = date_range('2010-12-31', periods=3, freq='2D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '2D') - - delta = np.array([np.timedelta64(1, 'D'), np.timedelta64(2, 'D'), - np.timedelta64(3, 'D')]) - for result in [idx + delta, np.add(idx, delta)]: - tm.assertIsInstance(result, DatetimeIndex) - exp = DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-08'], - freq='3D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '3D') - - for result in [idx - delta, np.subtract(idx, delta)]: - tm.assertIsInstance(result, DatetimeIndex) - exp = DatetimeIndex(['2010-12-31', '2011-01-01', '2011-01-02'], - freq='D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, 'D') - - def test_fillna_datetime64(self): - # GH 11343 - for tz in ['US/Eastern', 'Asia/Tokyo']: - idx = pd.DatetimeIndex(['2011-01-01 09:00', pd.NaT, - '2011-01-01 11:00']) - - exp = pd.DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', - '2011-01-01 11:00']) - self.assert_index_equal( - idx.fillna(pd.Timestamp('2011-01-01 10:00')), exp) - - # tz mismatch - exp = pd.Index([pd.Timestamp('2011-01-01 09:00'), - pd.Timestamp('2011-01-01 10:00', tz=tz), - pd.Timestamp('2011-01-01 11:00')], dtype=object) - self.assert_index_equal( - idx.fillna(pd.Timestamp('2011-01-01 10:00', tz=tz)), exp) - - # object - exp = pd.Index([pd.Timestamp('2011-01-01 09:00'), 'x', - pd.Timestamp('2011-01-01 11:00')], dtype=object) - self.assert_index_equal(idx.fillna('x'), exp) - - idx = pd.DatetimeIndex(['2011-01-01 09:00', pd.NaT, - '2011-01-01 11:00'], tz=tz) - - exp = pd.DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', - '2011-01-01 11:00'], tz=tz) - self.assert_index_equal( - idx.fillna(pd.Timestamp('2011-01-01 10:00', tz=tz)), exp) - - exp = pd.Index([pd.Timestamp('2011-01-01 09:00', tz=tz), - pd.Timestamp('2011-01-01 10:00'), - pd.Timestamp('2011-01-01 11:00', tz=tz)], - dtype=object) - self.assert_index_equal( - idx.fillna(pd.Timestamp('2011-01-01 10:00')), exp) - - # object - exp = pd.Index([pd.Timestamp('2011-01-01 09:00', tz=tz), - 'x', - pd.Timestamp('2011-01-01 11:00', tz=tz)], - dtype=object) - self.assert_index_equal(idx.fillna('x'), exp) - - def test_difference_of_union(self): - # GH14323: Test taking the union of differences of an Index. - # Difference of DatetimeIndex does not preserve frequency, - # so a differencing operation should not retain the freq field of the - # original index. - i = pd.date_range("20160920", "20160925", freq="D") - - a = pd.date_range("20160921", "20160924", freq="D") - expected = pd.DatetimeIndex(["20160920", "20160925"], freq=None) - a_diff = i.difference(a) - tm.assert_index_equal(a_diff, expected) - tm.assert_attr_equal('freq', a_diff, expected) - - b = pd.date_range("20160922", "20160925", freq="D") - b_diff = i.difference(b) - expected = pd.DatetimeIndex(["20160920", "20160921"], freq=None) - tm.assert_index_equal(b_diff, expected) - tm.assert_attr_equal('freq', b_diff, expected) - - union_of_diff = a_diff.union(b_diff) - expected = pd.DatetimeIndex(["20160920", "20160921", "20160925"], - freq=None) - tm.assert_index_equal(union_of_diff, expected) - tm.assert_attr_equal('freq', union_of_diff, expected) - - -class TestPeriodIndex(DatetimeLike, tm.TestCase): - _holder = PeriodIndex - _multiprocess_can_split_ = True - - def setUp(self): - self.indices = dict(index=tm.makePeriodIndex(10)) - self.setup_indices() - - def create_index(self): - return period_range('20130101', periods=5, freq='D') - - def test_construction_base_constructor(self): - # GH 13664 - arr = [pd.Period('2011-01', freq='M'), pd.NaT, - pd.Period('2011-03', freq='M')] - tm.assert_index_equal(pd.Index(arr), pd.PeriodIndex(arr)) - tm.assert_index_equal(pd.Index(np.array(arr)), - pd.PeriodIndex(np.array(arr))) - - arr = [np.nan, pd.NaT, pd.Period('2011-03', freq='M')] - tm.assert_index_equal(pd.Index(arr), pd.PeriodIndex(arr)) - tm.assert_index_equal(pd.Index(np.array(arr)), - pd.PeriodIndex(np.array(arr))) - - arr = [pd.Period('2011-01', freq='M'), pd.NaT, - pd.Period('2011-03', freq='D')] - tm.assert_index_equal(pd.Index(arr), pd.Index(arr, dtype=object)) - - tm.assert_index_equal(pd.Index(np.array(arr)), - pd.Index(np.array(arr), dtype=object)) - - def test_astype(self): - # GH 13149, GH 13209 - idx = PeriodIndex(['2016-05-16', 'NaT', NaT, np.NaN], freq='D') - - result = idx.astype(object) - expected = Index([Period('2016-05-16', freq='D')] + - [Period(NaT, freq='D')] * 3, dtype='object') - tm.assert_index_equal(result, expected) - - result = idx.astype(int) - expected = Int64Index([16937] + [-9223372036854775808] * 3, - dtype=np.int64) - tm.assert_index_equal(result, expected) - - idx = period_range('1990', '2009', freq='A') - result = idx.astype('i8') - self.assert_index_equal(result, Index(idx.asi8)) - self.assert_numpy_array_equal(result.values, idx.asi8) - - def test_astype_raises(self): - # GH 13149, GH 13209 - idx = PeriodIndex(['2016-05-16', 'NaT', NaT, np.NaN], freq='D') - - self.assertRaises(ValueError, idx.astype, str) - self.assertRaises(ValueError, idx.astype, float) - self.assertRaises(ValueError, idx.astype, 'timedelta64') - self.assertRaises(ValueError, idx.astype, 'timedelta64[ns]') - - def test_shift(self): - - # test shift for PeriodIndex - # GH8083 - drange = self.create_index() - result = drange.shift(1) - expected = PeriodIndex(['2013-01-02', '2013-01-03', '2013-01-04', - '2013-01-05', '2013-01-06'], freq='D') - self.assert_index_equal(result, expected) - - def test_pickle_compat_construction(self): - pass - - def test_get_loc(self): - idx = pd.period_range('2000-01-01', periods=3) - - for method in [None, 'pad', 'backfill', 'nearest']: - self.assertEqual(idx.get_loc(idx[1], method), 1) - self.assertEqual( - idx.get_loc(idx[1].asfreq('H', how='start'), method), 1) - self.assertEqual(idx.get_loc(idx[1].to_timestamp(), method), 1) - self.assertEqual( - idx.get_loc(idx[1].to_timestamp().to_pydatetime(), method), 1) - self.assertEqual(idx.get_loc(str(idx[1]), method), 1) - - idx = pd.period_range('2000-01-01', periods=5)[::2] - self.assertEqual(idx.get_loc('2000-01-02T12', method='nearest', - tolerance='1 day'), 1) - self.assertEqual(idx.get_loc('2000-01-02T12', method='nearest', - tolerance=pd.Timedelta('1D')), 1) - self.assertEqual(idx.get_loc('2000-01-02T12', method='nearest', - tolerance=np.timedelta64(1, 'D')), 1) - self.assertEqual(idx.get_loc('2000-01-02T12', method='nearest', - tolerance=timedelta(1)), 1) - with tm.assertRaisesRegexp(ValueError, 'must be convertible'): - idx.get_loc('2000-01-10', method='nearest', tolerance='foo') - - msg = 'Input has different freq from PeriodIndex\\(freq=D\\)' - with tm.assertRaisesRegexp(ValueError, msg): - idx.get_loc('2000-01-10', method='nearest', tolerance='1 hour') - with tm.assertRaises(KeyError): - idx.get_loc('2000-01-10', method='nearest', tolerance='1 day') - - def test_where(self): - i = self.create_index() - result = i.where(notnull(i)) - expected = i - tm.assert_index_equal(result, expected) - - i2 = i.copy() - i2 = pd.PeriodIndex([pd.NaT, pd.NaT] + i[2:].tolist(), - freq='D') - result = i.where(notnull(i2)) - expected = i2 - tm.assert_index_equal(result, expected) - - def test_where_other(self): - - i = self.create_index() - for arr in [np.nan, pd.NaT]: - result = i.where(notnull(i), other=np.nan) - expected = i - tm.assert_index_equal(result, expected) - - i2 = i.copy() - i2 = pd.PeriodIndex([pd.NaT, pd.NaT] + i[2:].tolist(), - freq='D') - result = i.where(notnull(i2), i2) - tm.assert_index_equal(result, i2) - - i2 = i.copy() - i2 = pd.PeriodIndex([pd.NaT, pd.NaT] + i[2:].tolist(), - freq='D') - result = i.where(notnull(i2), i2.values) - tm.assert_index_equal(result, i2) - - def test_get_indexer(self): - idx = pd.period_range('2000-01-01', periods=3).asfreq('H', how='start') - tm.assert_numpy_array_equal(idx.get_indexer(idx), - np.array([0, 1, 2], dtype=np.intp)) - - target = pd.PeriodIndex(['1999-12-31T23', '2000-01-01T12', - '2000-01-02T01'], freq='H') - tm.assert_numpy_array_equal(idx.get_indexer(target, 'pad'), - np.array([-1, 0, 1], dtype=np.intp)) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'backfill'), - np.array([0, 1, 2], dtype=np.intp)) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest'), - np.array([0, 1, 1], dtype=np.intp)) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest', - tolerance='1 hour'), - np.array([0, -1, 1], dtype=np.intp)) - - msg = 'Input has different freq from PeriodIndex\\(freq=H\\)' - with self.assertRaisesRegexp(ValueError, msg): - idx.get_indexer(target, 'nearest', tolerance='1 minute') - - tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest', - tolerance='1 day'), - np.array([0, 1, 1], dtype=np.intp)) - - def test_repeat(self): - # GH10183 - idx = pd.period_range('2000-01-01', periods=3, freq='D') - res = idx.repeat(3) - exp = PeriodIndex(idx.values.repeat(3), freq='D') - self.assert_index_equal(res, exp) - self.assertEqual(res.freqstr, 'D') - - def test_period_index_indexer(self): - # GH4125 - idx = pd.period_range('2002-01', '2003-12', freq='M') - df = pd.DataFrame(pd.np.random.randn(24, 10), index=idx) - self.assert_frame_equal(df, df.ix[idx]) - self.assert_frame_equal(df, df.ix[list(idx)]) - self.assert_frame_equal(df, df.loc[list(idx)]) - self.assert_frame_equal(df.iloc[0:5], df.loc[idx[0:5]]) - self.assert_frame_equal(df, df.loc[list(idx)]) - - def test_fillna_period(self): - # GH 11343 - idx = pd.PeriodIndex(['2011-01-01 09:00', pd.NaT, - '2011-01-01 11:00'], freq='H') - - exp = pd.PeriodIndex(['2011-01-01 09:00', '2011-01-01 10:00', - '2011-01-01 11:00'], freq='H') - self.assert_index_equal( - idx.fillna(pd.Period('2011-01-01 10:00', freq='H')), exp) - - exp = pd.Index([pd.Period('2011-01-01 09:00', freq='H'), 'x', - pd.Period('2011-01-01 11:00', freq='H')], dtype=object) - self.assert_index_equal(idx.fillna('x'), exp) - - exp = pd.Index([pd.Period('2011-01-01 09:00', freq='H'), - pd.Period('2011-01-01', freq='D'), - pd.Period('2011-01-01 11:00', freq='H')], dtype=object) - self.assert_index_equal(idx.fillna(pd.Period('2011-01-01', freq='D')), - exp) - - def test_no_millisecond_field(self): - with self.assertRaises(AttributeError): - DatetimeIndex.millisecond - - with self.assertRaises(AttributeError): - DatetimeIndex([]).millisecond - - def test_difference_of_union(self): - # GH14323: Test taking the union of differences of an Index. - # Difference of Period MUST preserve frequency, but the ability - # to union results must be preserved - i = pd.period_range("20160920", "20160925", freq="D") - - a = pd.period_range("20160921", "20160924", freq="D") - expected = pd.PeriodIndex(["20160920", "20160925"], freq='D') - a_diff = i.difference(a) - tm.assert_index_equal(a_diff, expected) - tm.assert_attr_equal('freq', a_diff, expected) - - b = pd.period_range("20160922", "20160925", freq="D") - b_diff = i.difference(b) - expected = pd.PeriodIndex(["20160920", "20160921"], freq='D') - tm.assert_index_equal(b_diff, expected) - tm.assert_attr_equal('freq', b_diff, expected) - - union_of_diff = a_diff.union(b_diff) - expected = pd.PeriodIndex(["20160920", "20160921", "20160925"], - freq='D') - tm.assert_index_equal(union_of_diff, expected) - tm.assert_attr_equal('freq', union_of_diff, expected) - - -class TestTimedeltaIndex(DatetimeLike, tm.TestCase): - _holder = TimedeltaIndex - _multiprocess_can_split_ = True - - def setUp(self): - self.indices = dict(index=tm.makeTimedeltaIndex(10)) - self.setup_indices() - - def create_index(self): - return pd.to_timedelta(range(5), unit='d') + pd.offsets.Hour(1) - - def test_construction_base_constructor(self): - arr = [pd.Timedelta('1 days'), pd.NaT, pd.Timedelta('3 days')] - tm.assert_index_equal(pd.Index(arr), pd.TimedeltaIndex(arr)) - tm.assert_index_equal(pd.Index(np.array(arr)), - pd.TimedeltaIndex(np.array(arr))) - - arr = [np.nan, pd.NaT, pd.Timedelta('1 days')] - tm.assert_index_equal(pd.Index(arr), pd.TimedeltaIndex(arr)) - tm.assert_index_equal(pd.Index(np.array(arr)), - pd.TimedeltaIndex(np.array(arr))) - - def test_shift(self): - # test shift for TimedeltaIndex - # err8083 - - drange = self.create_index() - result = drange.shift(1) - expected = TimedeltaIndex(['1 days 01:00:00', '2 days 01:00:00', - '3 days 01:00:00', - '4 days 01:00:00', '5 days 01:00:00'], - freq='D') - self.assert_index_equal(result, expected) - - result = drange.shift(3, freq='2D 1s') - expected = TimedeltaIndex(['6 days 01:00:03', '7 days 01:00:03', - '8 days 01:00:03', '9 days 01:00:03', - '10 days 01:00:03'], freq='D') - self.assert_index_equal(result, expected) - - def test_astype(self): - # GH 13149, GH 13209 - idx = TimedeltaIndex([1e14, 'NaT', pd.NaT, np.NaN]) - - result = idx.astype(object) - expected = Index([Timedelta('1 days 03:46:40')] + [pd.NaT] * 3, - dtype=object) - tm.assert_index_equal(result, expected) - - result = idx.astype(int) - expected = Int64Index([100000000000000] + [-9223372036854775808] * 3, - dtype=np.int64) - tm.assert_index_equal(result, expected) - - rng = timedelta_range('1 days', periods=10) - - result = rng.astype('i8') - self.assert_index_equal(result, Index(rng.asi8)) - self.assert_numpy_array_equal(rng.asi8, result.values) - - def test_astype_timedelta64(self): - # GH 13149, GH 13209 - idx = TimedeltaIndex([1e14, 'NaT', pd.NaT, np.NaN]) - - result = idx.astype('timedelta64') - expected = Float64Index([1e+14] + [np.NaN] * 3, dtype='float64') - tm.assert_index_equal(result, expected) - - result = idx.astype('timedelta64[ns]') - tm.assert_index_equal(result, idx) - self.assertFalse(result is idx) - - result = idx.astype('timedelta64[ns]', copy=False) - tm.assert_index_equal(result, idx) - self.assertTrue(result is idx) - - def test_astype_raises(self): - # GH 13149, GH 13209 - idx = TimedeltaIndex([1e14, 'NaT', pd.NaT, np.NaN]) - - self.assertRaises(ValueError, idx.astype, float) - self.assertRaises(ValueError, idx.astype, str) - self.assertRaises(ValueError, idx.astype, 'datetime64') - self.assertRaises(ValueError, idx.astype, 'datetime64[ns]') - - def test_get_loc(self): - idx = pd.to_timedelta(['0 days', '1 days', '2 days']) - - for method in [None, 'pad', 'backfill', 'nearest']: - self.assertEqual(idx.get_loc(idx[1], method), 1) - self.assertEqual(idx.get_loc(idx[1].to_pytimedelta(), method), 1) - self.assertEqual(idx.get_loc(str(idx[1]), method), 1) - - self.assertEqual( - idx.get_loc(idx[1], 'pad', tolerance=pd.Timedelta(0)), 1) - self.assertEqual( - idx.get_loc(idx[1], 'pad', tolerance=np.timedelta64(0, 's')), 1) - self.assertEqual(idx.get_loc(idx[1], 'pad', tolerance=timedelta(0)), 1) - - with tm.assertRaisesRegexp(ValueError, 'must be convertible'): - idx.get_loc(idx[1], method='nearest', tolerance='foo') - - for method, loc in [('pad', 1), ('backfill', 2), ('nearest', 1)]: - self.assertEqual(idx.get_loc('1 day 1 hour', method), loc) - - def test_get_indexer(self): - idx = pd.to_timedelta(['0 days', '1 days', '2 days']) - tm.assert_numpy_array_equal(idx.get_indexer(idx), - np.array([0, 1, 2], dtype=np.intp)) - - target = pd.to_timedelta(['-1 hour', '12 hours', '1 day 1 hour']) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'pad'), - np.array([-1, 0, 1], dtype=np.intp)) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'backfill'), - np.array([0, 1, 2], dtype=np.intp)) - tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest'), - np.array([0, 1, 1], dtype=np.intp)) - - res = idx.get_indexer(target, 'nearest', - tolerance=pd.Timedelta('1 hour')) - tm.assert_numpy_array_equal(res, np.array([0, -1, 1], dtype=np.intp)) - - def test_numeric_compat(self): - - idx = self._holder(np.arange(5, dtype='int64')) - didx = self._holder(np.arange(5, dtype='int64') ** 2) - result = idx * 1 - tm.assert_index_equal(result, idx) - - result = 1 * idx - tm.assert_index_equal(result, idx) - - result = idx / 1 - tm.assert_index_equal(result, idx) - - result = idx // 1 - tm.assert_index_equal(result, idx) - - result = idx * np.array(5, dtype='int64') - tm.assert_index_equal(result, - self._holder(np.arange(5, dtype='int64') * 5)) - - result = idx * np.arange(5, dtype='int64') - tm.assert_index_equal(result, didx) - - result = idx * Series(np.arange(5, dtype='int64')) - tm.assert_index_equal(result, didx) - - result = idx * Series(np.arange(5, dtype='float64') + 0.1) - tm.assert_index_equal(result, self._holder(np.arange( - 5, dtype='float64') * (np.arange(5, dtype='float64') + 0.1))) - - # invalid - self.assertRaises(TypeError, lambda: idx * idx) - self.assertRaises(ValueError, lambda: idx * self._holder(np.arange(3))) - self.assertRaises(ValueError, lambda: idx * np.array([1, 2])) - - def test_pickle_compat_construction(self): - pass - - def test_ufunc_coercions(self): - # normal ops are also tested in tseries/test_timedeltas.py - idx = TimedeltaIndex(['2H', '4H', '6H', '8H', '10H'], - freq='2H', name='x') - - for result in [idx * 2, np.multiply(idx, 2)]: - tm.assertIsInstance(result, TimedeltaIndex) - exp = TimedeltaIndex(['4H', '8H', '12H', '16H', '20H'], - freq='4H', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '4H') - - for result in [idx / 2, np.divide(idx, 2)]: - tm.assertIsInstance(result, TimedeltaIndex) - exp = TimedeltaIndex(['1H', '2H', '3H', '4H', '5H'], - freq='H', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, 'H') - - idx = TimedeltaIndex(['2H', '4H', '6H', '8H', '10H'], - freq='2H', name='x') - for result in [-idx, np.negative(idx)]: - tm.assertIsInstance(result, TimedeltaIndex) - exp = TimedeltaIndex(['-2H', '-4H', '-6H', '-8H', '-10H'], - freq='-2H', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '-2H') - - idx = TimedeltaIndex(['-2H', '-1H', '0H', '1H', '2H'], - freq='H', name='x') - for result in [abs(idx), np.absolute(idx)]: - tm.assertIsInstance(result, TimedeltaIndex) - exp = TimedeltaIndex(['2H', '1H', '0H', '1H', '2H'], - freq=None, name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, None) - - def test_fillna_timedelta(self): - # GH 11343 - idx = pd.TimedeltaIndex(['1 day', pd.NaT, '3 day']) - - exp = pd.TimedeltaIndex(['1 day', '2 day', '3 day']) - self.assert_index_equal(idx.fillna(pd.Timedelta('2 day')), exp) - - exp = pd.TimedeltaIndex(['1 day', '3 hour', '3 day']) - idx.fillna(pd.Timedelta('3 hour')) - - exp = pd.Index( - [pd.Timedelta('1 day'), 'x', pd.Timedelta('3 day')], dtype=object) - self.assert_index_equal(idx.fillna('x'), exp) - - def test_difference_of_union(self): - # GH14323: Test taking the union of differences of an Index. - # Difference of TimedeltaIndex does not preserve frequency, - # so a differencing operation should not retain the freq field of the - # original index. - i = pd.timedelta_range("0 days", "5 days", freq="D") - - a = pd.timedelta_range("1 days", "4 days", freq="D") - expected = pd.TimedeltaIndex(["0 days", "5 days"], freq=None) - a_diff = i.difference(a) - tm.assert_index_equal(a_diff, expected) - tm.assert_attr_equal('freq', a_diff, expected) - - b = pd.timedelta_range("2 days", "5 days", freq="D") - b_diff = i.difference(b) - expected = pd.TimedeltaIndex(["0 days", "1 days"], freq=None) - tm.assert_index_equal(b_diff, expected) - tm.assert_attr_equal('freq', b_diff, expected) - - union_of_difference = a_diff.union(b_diff) - expected = pd.TimedeltaIndex(["0 days", "1 days", "5 days"], - freq=None) - tm.assert_index_equal(union_of_difference, expected) - tm.assert_attr_equal('freq', union_of_difference, expected) diff --git a/pandas/tests/indexes/test_frozen.py b/pandas/tests/indexes/test_frozen.py new file mode 100644 index 0000000000000..ca9841112b1d5 --- /dev/null +++ b/pandas/tests/indexes/test_frozen.py @@ -0,0 +1,71 @@ +import numpy as np +from pandas.util import testing as tm +from pandas.tests.test_base import CheckImmutable, CheckStringMixin +from pandas.core.indexes.frozen import FrozenList, FrozenNDArray +from pandas.compat import u + + +class TestFrozenList(CheckImmutable, CheckStringMixin): + mutable_methods = ('extend', 'pop', 'remove', 'insert') + unicode_container = FrozenList([u("\u05d0"), u("\u05d1"), "c"]) + + def setup_method(self, method): + self.lst = [1, 2, 3, 4, 5] + self.container = FrozenList(self.lst) + self.klass = FrozenList + + def test_add(self): + result = self.container + (1, 2, 3) + expected = FrozenList(self.lst + [1, 2, 3]) + self.check_result(result, expected) + + result = (1, 2, 3) + self.container + expected = FrozenList([1, 2, 3] + self.lst) + self.check_result(result, expected) + + def test_inplace(self): + q = r = self.container + q += [5] + self.check_result(q, self.lst + [5]) + # other shouldn't be mutated + self.check_result(r, self.lst) + + +class TestFrozenNDArray(CheckImmutable, CheckStringMixin): + mutable_methods = ('put', 'itemset', 'fill') + unicode_container = FrozenNDArray([u("\u05d0"), u("\u05d1"), "c"]) + + def setup_method(self, method): + self.lst = [3, 5, 7, -2] + self.container = FrozenNDArray(self.lst) + self.klass = FrozenNDArray + + def test_shallow_copying(self): + original = self.container.copy() + assert isinstance(self.container.view(), FrozenNDArray) + assert not isinstance(self.container.view(np.ndarray), FrozenNDArray) + assert self.container.view() is not self.container + tm.assert_numpy_array_equal(self.container, original) + + # Shallow copy should be the same too + assert isinstance(self.container._shallow_copy(), FrozenNDArray) + + # setting should not be allowed + def testit(container): + container[0] = 16 + + self.check_mutable_error(testit, self.container) + + def test_values(self): + original = self.container.view(np.ndarray).copy() + n = original[0] + 15 + + vals = self.container.values() + tm.assert_numpy_array_equal(original, vals) + + assert original is not vals + vals[0] = n + + assert isinstance(self.container, FrozenNDArray) + tm.assert_numpy_array_equal(self.container.values(), original) + assert vals[0] == n diff --git a/pandas/tests/indexes/test_interval.py b/pandas/tests/indexes/test_interval.py new file mode 100644 index 0000000000000..33745017fe3d6 --- /dev/null +++ b/pandas/tests/indexes/test_interval.py @@ -0,0 +1,804 @@ +from __future__ import division + +import pytest +import numpy as np + +from pandas import (Interval, IntervalIndex, Index, isnull, + interval_range, Timestamp, Timedelta, + compat) +from pandas._libs.interval import IntervalTree +from pandas.tests.indexes.common import Base +import pandas.util.testing as tm +import pandas as pd + + +class TestIntervalIndex(Base): + _holder = IntervalIndex + + def setup_method(self, method): + self.index = IntervalIndex.from_arrays([0, 1], [1, 2]) + self.index_with_nan = IntervalIndex.from_tuples( + [(0, 1), np.nan, (1, 2)]) + self.indices = dict(intervalIndex=tm.makeIntervalIndex(10)) + + def create_index(self): + return IntervalIndex.from_breaks(np.arange(10)) + + def test_constructors(self): + expected = self.index + actual = IntervalIndex.from_breaks(np.arange(3), closed='right') + assert expected.equals(actual) + + alternate = IntervalIndex.from_breaks(np.arange(3), closed='left') + assert not expected.equals(alternate) + + actual = IntervalIndex.from_intervals([Interval(0, 1), Interval(1, 2)]) + assert expected.equals(actual) + + actual = IntervalIndex([Interval(0, 1), Interval(1, 2)]) + assert expected.equals(actual) + + actual = IntervalIndex.from_arrays(np.arange(2), np.arange(2) + 1, + closed='right') + assert expected.equals(actual) + + actual = Index([Interval(0, 1), Interval(1, 2)]) + assert isinstance(actual, IntervalIndex) + assert expected.equals(actual) + + actual = Index(expected) + assert isinstance(actual, IntervalIndex) + assert expected.equals(actual) + + def test_constructors_other(self): + + # all-nan + result = IntervalIndex.from_intervals([np.nan]) + expected = np.array([np.nan], dtype=object) + tm.assert_numpy_array_equal(result.values, expected) + + # empty + result = IntervalIndex.from_intervals([]) + expected = np.array([], dtype=object) + tm.assert_numpy_array_equal(result.values, expected) + + def test_constructors_errors(self): + + # scalar + with pytest.raises(TypeError): + IntervalIndex(5) + + # not an interval + with pytest.raises(TypeError): + IntervalIndex([0, 1]) + + with pytest.raises(TypeError): + IntervalIndex.from_intervals([0, 1]) + + # invalid closed + with pytest.raises(ValueError): + IntervalIndex.from_arrays([0, 1], [1, 2], closed='invalid') + + # mismatched closed + with pytest.raises(ValueError): + IntervalIndex.from_intervals([Interval(0, 1), + Interval(1, 2, closed='left')]) + + with pytest.raises(ValueError): + IntervalIndex.from_arrays([0, 10], [3, 5]) + + with pytest.raises(ValueError): + Index([Interval(0, 1), Interval(2, 3, closed='left')]) + + # no point in nesting periods in an IntervalIndex + with pytest.raises(ValueError): + IntervalIndex.from_breaks( + pd.period_range('2000-01-01', periods=3)) + + def test_constructors_datetimelike(self): + + # DTI / TDI + for idx in [pd.date_range('20130101', periods=5), + pd.timedelta_range('1 day', periods=5)]: + result = IntervalIndex.from_breaks(idx) + expected = IntervalIndex.from_breaks(idx.values) + tm.assert_index_equal(result, expected) + + expected_scalar_type = type(idx[0]) + i = result[0] + assert isinstance(i.left, expected_scalar_type) + assert isinstance(i.right, expected_scalar_type) + + def test_constructors_error(self): + + # non-intervals + def f(): + IntervalIndex.from_intervals([0.997, 4.0]) + pytest.raises(TypeError, f) + + def test_properties(self): + index = self.index + assert len(index) == 2 + assert index.size == 2 + assert index.shape == (2, ) + + tm.assert_index_equal(index.left, Index([0, 1])) + tm.assert_index_equal(index.right, Index([1, 2])) + tm.assert_index_equal(index.mid, Index([0.5, 1.5])) + + assert index.closed == 'right' + + expected = np.array([Interval(0, 1), Interval(1, 2)], dtype=object) + tm.assert_numpy_array_equal(np.asarray(index), expected) + tm.assert_numpy_array_equal(index.values, expected) + + # with nans + index = self.index_with_nan + assert len(index) == 3 + assert index.size == 3 + assert index.shape == (3, ) + + tm.assert_index_equal(index.left, Index([0, np.nan, 1])) + tm.assert_index_equal(index.right, Index([1, np.nan, 2])) + tm.assert_index_equal(index.mid, Index([0.5, np.nan, 1.5])) + + assert index.closed == 'right' + + expected = np.array([Interval(0, 1), np.nan, + Interval(1, 2)], dtype=object) + tm.assert_numpy_array_equal(np.asarray(index), expected) + tm.assert_numpy_array_equal(index.values, expected) + + def test_with_nans(self): + index = self.index + assert not index.hasnans + tm.assert_numpy_array_equal(index.isnull(), + np.array([False, False])) + tm.assert_numpy_array_equal(index.notnull(), + np.array([True, True])) + + index = self.index_with_nan + assert index.hasnans + tm.assert_numpy_array_equal(index.notnull(), + np.array([True, False, True])) + tm.assert_numpy_array_equal(index.isnull(), + np.array([False, True, False])) + + def test_copy(self): + actual = self.index.copy() + assert actual.equals(self.index) + + actual = self.index.copy(deep=True) + assert actual.equals(self.index) + assert actual.left is not self.index.left + + def test_ensure_copied_data(self): + # exercise the copy flag in the constructor + + # not copying + index = self.index + result = IntervalIndex(index, copy=False) + tm.assert_numpy_array_equal(index.left.values, result.left.values, + check_same='same') + tm.assert_numpy_array_equal(index.right.values, result.right.values, + check_same='same') + + # by-definition make a copy + result = IntervalIndex.from_intervals(index.values, copy=False) + tm.assert_numpy_array_equal(index.left.values, result.left.values, + check_same='copy') + tm.assert_numpy_array_equal(index.right.values, result.right.values, + check_same='copy') + + def test_equals(self): + + idx = self.index + assert idx.equals(idx) + assert idx.equals(idx.copy()) + + assert not idx.equals(idx.astype(object)) + assert not idx.equals(np.array(idx)) + assert not idx.equals(list(idx)) + + assert not idx.equals([1, 2]) + assert not idx.equals(np.array([1, 2])) + assert not idx.equals(pd.date_range('20130101', periods=2)) + + def test_astype(self): + + idx = self.index + + for dtype in [np.int64, np.float64, 'datetime64[ns]', + 'datetime64[ns, US/Eastern]', 'timedelta64', + 'period[M]']: + pytest.raises(ValueError, idx.astype, dtype) + + result = idx.astype(object) + tm.assert_index_equal(result, Index(idx.values, dtype='object')) + assert not idx.equals(result) + assert idx.equals(IntervalIndex.from_intervals(result)) + + result = idx.astype('interval') + tm.assert_index_equal(result, idx) + assert result.equals(idx) + + result = idx.astype('category') + expected = pd.Categorical(idx, ordered=True) + tm.assert_categorical_equal(result, expected) + + def test_where(self): + expected = self.index + result = self.index.where(self.index.notnull()) + tm.assert_index_equal(result, expected) + + idx = IntervalIndex.from_breaks([1, 2]) + result = idx.where([True, False]) + expected = IntervalIndex.from_intervals( + [Interval(1.0, 2.0, closed='right'), np.nan]) + tm.assert_index_equal(result, expected) + + def test_where_array_like(self): + pass + + def test_delete(self): + expected = IntervalIndex.from_breaks([1, 2]) + actual = self.index.delete(0) + assert expected.equals(actual) + + def test_insert(self): + expected = IntervalIndex.from_breaks(range(4)) + actual = self.index.insert(2, Interval(2, 3)) + assert expected.equals(actual) + + pytest.raises(ValueError, self.index.insert, 0, 1) + pytest.raises(ValueError, self.index.insert, 0, + Interval(2, 3, closed='left')) + + def test_take(self): + actual = self.index.take([0, 1]) + assert self.index.equals(actual) + + expected = IntervalIndex.from_arrays([0, 0, 1], [1, 1, 2]) + actual = self.index.take([0, 0, 1]) + assert expected.equals(actual) + + def test_monotonic_and_unique(self): + assert self.index.is_monotonic + assert self.index.is_unique + + idx = IntervalIndex.from_tuples([(0, 1), (0.5, 1.5)]) + assert idx.is_monotonic + assert idx.is_unique + + idx = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 2)]) + assert not idx.is_monotonic + assert idx.is_unique + + idx = IntervalIndex.from_tuples([(0, 2), (0, 2)]) + assert not idx.is_unique + assert idx.is_monotonic + + @pytest.mark.xfail(reason='not a valid repr as we use interval notation') + def test_repr(self): + i = IntervalIndex.from_tuples([(0, 1), (1, 2)], closed='right') + expected = ("IntervalIndex(left=[0, 1]," + "\n right=[1, 2]," + "\n closed='right'," + "\n dtype='interval[int64]')") + assert repr(i) == expected + + i = IntervalIndex.from_tuples((Timestamp('20130101'), + Timestamp('20130102')), + (Timestamp('20130102'), + Timestamp('20130103')), + closed='right') + expected = ("IntervalIndex(left=['2013-01-01', '2013-01-02']," + "\n right=['2013-01-02', '2013-01-03']," + "\n closed='right'," + "\n dtype='interval[datetime64[ns]]')") + assert repr(i) == expected + + @pytest.mark.xfail(reason='not a valid repr as we use interval notation') + def test_repr_max_seq_item_setting(self): + super(TestIntervalIndex, self).test_repr_max_seq_item_setting() + + @pytest.mark.xfail(reason='not a valid repr as we use interval notation') + def test_repr_roundtrip(self): + super(TestIntervalIndex, self).test_repr_roundtrip() + + def test_get_item(self): + i = IntervalIndex.from_arrays((0, 1, np.nan), (1, 2, np.nan), + closed='right') + assert i[0] == Interval(0.0, 1.0) + assert i[1] == Interval(1.0, 2.0) + assert isnull(i[2]) + + result = i[0:1] + expected = IntervalIndex.from_arrays((0.,), (1.,), closed='right') + tm.assert_index_equal(result, expected) + + result = i[0:2] + expected = IntervalIndex.from_arrays((0., 1), (1., 2.), closed='right') + tm.assert_index_equal(result, expected) + + result = i[1:3] + expected = IntervalIndex.from_arrays((1., np.nan), (2., np.nan), + closed='right') + tm.assert_index_equal(result, expected) + + def test_get_loc_value(self): + pytest.raises(KeyError, self.index.get_loc, 0) + assert self.index.get_loc(0.5) == 0 + assert self.index.get_loc(1) == 0 + assert self.index.get_loc(1.5) == 1 + assert self.index.get_loc(2) == 1 + pytest.raises(KeyError, self.index.get_loc, -1) + pytest.raises(KeyError, self.index.get_loc, 3) + + idx = IntervalIndex.from_tuples([(0, 2), (1, 3)]) + assert idx.get_loc(0.5) == 0 + assert idx.get_loc(1) == 0 + tm.assert_numpy_array_equal(idx.get_loc(1.5), + np.array([0, 1], dtype='int64')) + tm.assert_numpy_array_equal(np.sort(idx.get_loc(2)), + np.array([0, 1], dtype='int64')) + assert idx.get_loc(3) == 1 + pytest.raises(KeyError, idx.get_loc, 3.5) + + idx = IntervalIndex.from_arrays([0, 2], [1, 3]) + pytest.raises(KeyError, idx.get_loc, 1.5) + + def slice_locs_cases(self, breaks): + # TODO: same tests for more index types + index = IntervalIndex.from_breaks([0, 1, 2], closed='right') + assert index.slice_locs() == (0, 2) + assert index.slice_locs(0, 1) == (0, 1) + assert index.slice_locs(1, 1) == (0, 1) + assert index.slice_locs(0, 2) == (0, 2) + assert index.slice_locs(0.5, 1.5) == (0, 2) + assert index.slice_locs(0, 0.5) == (0, 1) + assert index.slice_locs(start=1) == (0, 2) + assert index.slice_locs(start=1.2) == (1, 2) + assert index.slice_locs(end=1) == (0, 1) + assert index.slice_locs(end=1.1) == (0, 2) + assert index.slice_locs(end=1.0) == (0, 1) + assert index.slice_locs(-1, -1) == (0, 0) + + index = IntervalIndex.from_breaks([0, 1, 2], closed='neither') + assert index.slice_locs(0, 1) == (0, 1) + assert index.slice_locs(0, 2) == (0, 2) + assert index.slice_locs(0.5, 1.5) == (0, 2) + assert index.slice_locs(1, 1) == (1, 1) + assert index.slice_locs(1, 2) == (1, 2) + + index = IntervalIndex.from_breaks([0, 1, 2], closed='both') + assert index.slice_locs(1, 1) == (0, 2) + assert index.slice_locs(1, 2) == (0, 2) + + def test_slice_locs_int64(self): + self.slice_locs_cases([0, 1, 2]) + + def test_slice_locs_float64(self): + self.slice_locs_cases([0.0, 1.0, 2.0]) + + def slice_locs_decreasing_cases(self, tuples): + index = IntervalIndex.from_tuples(tuples) + assert index.slice_locs(1.5, 0.5) == (1, 3) + assert index.slice_locs(2, 0) == (1, 3) + assert index.slice_locs(2, 1) == (1, 3) + assert index.slice_locs(3, 1.1) == (0, 3) + assert index.slice_locs(3, 3) == (0, 2) + assert index.slice_locs(3.5, 3.3) == (0, 1) + assert index.slice_locs(1, -3) == (2, 3) + + slice_locs = index.slice_locs(-1, -1) + assert slice_locs[0] == slice_locs[1] + + def test_slice_locs_decreasing_int64(self): + self.slice_locs_cases([(2, 4), (1, 3), (0, 2)]) + + def test_slice_locs_decreasing_float64(self): + self.slice_locs_cases([(2., 4.), (1., 3.), (0., 2.)]) + + def test_slice_locs_fails(self): + index = IntervalIndex.from_tuples([(1, 2), (0, 1), (2, 3)]) + with pytest.raises(KeyError): + index.slice_locs(1, 2) + + def test_get_loc_interval(self): + assert self.index.get_loc(Interval(0, 1)) == 0 + assert self.index.get_loc(Interval(0, 0.5)) == 0 + assert self.index.get_loc(Interval(0, 1, 'left')) == 0 + pytest.raises(KeyError, self.index.get_loc, Interval(2, 3)) + pytest.raises(KeyError, self.index.get_loc, + Interval(-1, 0, 'left')) + + def test_get_indexer(self): + actual = self.index.get_indexer([-1, 0, 0.5, 1, 1.5, 2, 3]) + expected = np.array([-1, -1, 0, 0, 1, 1, -1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index.get_indexer(self.index) + expected = np.array([0, 1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + index = IntervalIndex.from_breaks([0, 1, 2], closed='left') + actual = index.get_indexer([-1, 0, 0.5, 1, 1.5, 2, 3]) + expected = np.array([-1, 0, 0, 1, 1, -1, -1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index.get_indexer(index[:1]) + expected = np.array([0], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index.get_indexer(index) + expected = np.array([-1, 1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + def test_get_indexer_subintervals(self): + + # TODO: is this right? + # return indexers for wholly contained subintervals + target = IntervalIndex.from_breaks(np.linspace(0, 2, 5)) + actual = self.index.get_indexer(target) + expected = np.array([0, 0, 1, 1], dtype='p') + tm.assert_numpy_array_equal(actual, expected) + + target = IntervalIndex.from_breaks([0, 0.67, 1.33, 2]) + actual = self.index.get_indexer(target) + expected = np.array([0, 0, 1, 1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index.get_indexer(target[[0, -1]]) + expected = np.array([0, 1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + target = IntervalIndex.from_breaks([0, 0.33, 0.67, 1], closed='left') + actual = self.index.get_indexer(target) + expected = np.array([0, 0, 0], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + def test_contains(self): + # Only endpoints are valid. + i = IntervalIndex.from_arrays([0, 1], [1, 2]) + + # Invalid + assert 0 not in i + assert 1 not in i + assert 2 not in i + + # Valid + assert Interval(0, 1) in i + assert Interval(0, 2) in i + assert Interval(0, 0.5) in i + assert Interval(3, 5) not in i + assert Interval(-1, 0, closed='left') not in i + + def testcontains(self): + # can select values that are IN the range of a value + i = IntervalIndex.from_arrays([0, 1], [1, 2]) + + assert i.contains(0.1) + assert i.contains(0.5) + assert i.contains(1) + assert i.contains(Interval(0, 1)) + assert i.contains(Interval(0, 2)) + + # these overlaps completely + assert i.contains(Interval(0, 3)) + assert i.contains(Interval(1, 3)) + + assert not i.contains(20) + assert not i.contains(-20) + + def test_dropna(self): + + expected = IntervalIndex.from_tuples([(0.0, 1.0), (1.0, 2.0)]) + + ii = IntervalIndex.from_tuples([(0, 1), (1, 2), np.nan]) + result = ii.dropna() + tm.assert_index_equal(result, expected) + + ii = IntervalIndex.from_arrays([0, 1, np.nan], [1, 2, np.nan]) + result = ii.dropna() + tm.assert_index_equal(result, expected) + + def test_non_contiguous(self): + index = IntervalIndex.from_tuples([(0, 1), (2, 3)]) + target = [0.5, 1.5, 2.5] + actual = index.get_indexer(target) + expected = np.array([0, -1, 1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + assert 1.5 not in index + + def test_union(self): + other = IntervalIndex.from_arrays([2], [3]) + expected = IntervalIndex.from_arrays(range(3), range(1, 4)) + actual = self.index.union(other) + assert expected.equals(actual) + + actual = other.union(self.index) + assert expected.equals(actual) + + tm.assert_index_equal(self.index.union(self.index), self.index) + tm.assert_index_equal(self.index.union(self.index[:1]), + self.index) + + def test_intersection(self): + other = IntervalIndex.from_breaks([1, 2, 3]) + expected = IntervalIndex.from_breaks([1, 2]) + actual = self.index.intersection(other) + assert expected.equals(actual) + + tm.assert_index_equal(self.index.intersection(self.index), + self.index) + + def test_difference(self): + tm.assert_index_equal(self.index.difference(self.index[:1]), + self.index[1:]) + + def test_symmetric_difference(self): + result = self.index[:1].symmetric_difference(self.index[1:]) + expected = self.index + tm.assert_index_equal(result, expected) + + def test_set_operation_errors(self): + pytest.raises(ValueError, self.index.union, self.index.left) + + other = IntervalIndex.from_breaks([0, 1, 2], closed='neither') + pytest.raises(ValueError, self.index.union, other) + + def test_isin(self): + actual = self.index.isin(self.index) + tm.assert_numpy_array_equal(np.array([True, True]), actual) + + actual = self.index.isin(self.index[:1]) + tm.assert_numpy_array_equal(np.array([True, False]), actual) + + def test_comparison(self): + actual = Interval(0, 1) < self.index + expected = np.array([False, True]) + tm.assert_numpy_array_equal(actual, expected) + + actual = Interval(0.5, 1.5) < self.index + expected = np.array([False, True]) + tm.assert_numpy_array_equal(actual, expected) + actual = self.index > Interval(0.5, 1.5) + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index == self.index + expected = np.array([True, True]) + tm.assert_numpy_array_equal(actual, expected) + actual = self.index <= self.index + tm.assert_numpy_array_equal(actual, expected) + actual = self.index >= self.index + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index < self.index + expected = np.array([False, False]) + tm.assert_numpy_array_equal(actual, expected) + actual = self.index > self.index + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index == IntervalIndex.from_breaks([0, 1, 2], 'left') + tm.assert_numpy_array_equal(actual, expected) + + actual = self.index == self.index.values + tm.assert_numpy_array_equal(actual, np.array([True, True])) + actual = self.index.values == self.index + tm.assert_numpy_array_equal(actual, np.array([True, True])) + actual = self.index <= self.index.values + tm.assert_numpy_array_equal(actual, np.array([True, True])) + actual = self.index != self.index.values + tm.assert_numpy_array_equal(actual, np.array([False, False])) + actual = self.index > self.index.values + tm.assert_numpy_array_equal(actual, np.array([False, False])) + actual = self.index.values > self.index + tm.assert_numpy_array_equal(actual, np.array([False, False])) + + # invalid comparisons + actual = self.index == 0 + tm.assert_numpy_array_equal(actual, np.array([False, False])) + actual = self.index == self.index.left + tm.assert_numpy_array_equal(actual, np.array([False, False])) + + with tm.assert_raises_regex(TypeError, 'unorderable types'): + self.index > 0 + with tm.assert_raises_regex(TypeError, 'unorderable types'): + self.index <= 0 + with pytest.raises(TypeError): + self.index > np.arange(2) + with pytest.raises(ValueError): + self.index > np.arange(3) + + def test_missing_values(self): + idx = pd.Index([np.nan, pd.Interval(0, 1), pd.Interval(1, 2)]) + idx2 = pd.IntervalIndex.from_arrays([np.nan, 0, 1], [np.nan, 1, 2]) + assert idx.equals(idx2) + + with pytest.raises(ValueError): + IntervalIndex.from_arrays([np.nan, 0, 1], np.array([0, 1, 2])) + + tm.assert_numpy_array_equal(isnull(idx), + np.array([True, False, False])) + + def test_sort_values(self): + expected = IntervalIndex.from_breaks([1, 2, 3, 4]) + actual = IntervalIndex.from_tuples([(3, 4), (1, 2), + (2, 3)]).sort_values() + tm.assert_index_equal(expected, actual) + + # nan + idx = self.index_with_nan + mask = idx.isnull() + tm.assert_numpy_array_equal(mask, np.array([False, True, False])) + + result = idx.sort_values() + mask = result.isnull() + tm.assert_numpy_array_equal(mask, np.array([False, False, True])) + + result = idx.sort_values(ascending=False) + mask = result.isnull() + tm.assert_numpy_array_equal(mask, np.array([True, False, False])) + + def test_datetime(self): + dates = pd.date_range('2000', periods=3) + idx = IntervalIndex.from_breaks(dates) + + tm.assert_index_equal(idx.left, dates[:2]) + tm.assert_index_equal(idx.right, dates[-2:]) + + expected = pd.date_range('2000-01-01T12:00', periods=2) + tm.assert_index_equal(idx.mid, expected) + + assert pd.Timestamp('2000-01-01T12') not in idx + assert pd.Timestamp('2000-01-01T12') not in idx + + target = pd.date_range('1999-12-31T12:00', periods=7, freq='12H') + actual = idx.get_indexer(target) + + expected = np.array([-1, -1, 0, 0, 1, 1, -1], dtype='intp') + tm.assert_numpy_array_equal(actual, expected) + + def test_append(self): + + index1 = IntervalIndex.from_arrays([0, 1], [1, 2]) + index2 = IntervalIndex.from_arrays([1, 2], [2, 3]) + + result = index1.append(index2) + expected = IntervalIndex.from_arrays([0, 1, 1, 2], [1, 2, 2, 3]) + tm.assert_index_equal(result, expected) + + result = index1.append([index1, index2]) + expected = IntervalIndex.from_arrays([0, 1, 0, 1, 1, 2], + [1, 2, 1, 2, 2, 3]) + tm.assert_index_equal(result, expected) + + def f(): + index1.append(IntervalIndex.from_arrays([0, 1], [1, 2], + closed='both')) + + pytest.raises(ValueError, f) + + +class TestIntervalRange(object): + + def test_construction(self): + result = interval_range(0, 5, name='foo', closed='both') + expected = IntervalIndex.from_breaks( + np.arange(0, 5), name='foo', closed='both') + tm.assert_index_equal(result, expected) + + def test_errors(self): + + # not enough params + def f(): + interval_range(0) + + pytest.raises(ValueError, f) + + def f(): + interval_range(periods=2) + + pytest.raises(ValueError, f) + + def f(): + interval_range() + + pytest.raises(ValueError, f) + + # mixed units + def f(): + interval_range(0, Timestamp('20130101'), freq=2) + + pytest.raises(ValueError, f) + + def f(): + interval_range(0, 10, freq=Timedelta('1day')) + + pytest.raises(ValueError, f) + + +class TestIntervalTree(object): + def setup_method(self, method): + gentree = lambda dtype: IntervalTree(np.arange(5, dtype=dtype), + np.arange(5, dtype=dtype) + 2) + self.tree = gentree('int64') + self.trees = {dtype: gentree(dtype) + for dtype in ['int32', 'int64', 'float32', 'float64']} + + def test_get_loc(self): + for dtype, tree in self.trees.items(): + tm.assert_numpy_array_equal(tree.get_loc(1), + np.array([0], dtype='int64')) + tm.assert_numpy_array_equal(np.sort(tree.get_loc(2)), + np.array([0, 1], dtype='int64')) + with pytest.raises(KeyError): + tree.get_loc(-1) + + def test_get_indexer(self): + for dtype, tree in self.trees.items(): + tm.assert_numpy_array_equal( + tree.get_indexer(np.array([1.0, 5.5, 6.5])), + np.array([0, 4, -1], dtype='int64')) + with pytest.raises(KeyError): + tree.get_indexer(np.array([3.0])) + + def test_get_indexer_non_unique(self): + indexer, missing = self.tree.get_indexer_non_unique( + np.array([1.0, 2.0, 6.5])) + tm.assert_numpy_array_equal(indexer[:1], + np.array([0], dtype='int64')) + tm.assert_numpy_array_equal(np.sort(indexer[1:3]), + np.array([0, 1], dtype='int64')) + tm.assert_numpy_array_equal(np.sort(indexer[3:]), + np.array([-1], dtype='int64')) + tm.assert_numpy_array_equal(missing, np.array([2], dtype='int64')) + + def test_duplicates(self): + tree = IntervalTree([0, 0, 0], [1, 1, 1]) + tm.assert_numpy_array_equal(np.sort(tree.get_loc(0.5)), + np.array([0, 1, 2], dtype='int64')) + + with pytest.raises(KeyError): + tree.get_indexer(np.array([0.5])) + + indexer, missing = tree.get_indexer_non_unique(np.array([0.5])) + tm.assert_numpy_array_equal(np.sort(indexer), + np.array([0, 1, 2], dtype='int64')) + tm.assert_numpy_array_equal(missing, np.array([], dtype='int64')) + + def test_get_loc_closed(self): + for closed in ['left', 'right', 'both', 'neither']: + tree = IntervalTree([0], [1], closed=closed) + for p, errors in [(0, tree.open_left), + (1, tree.open_right)]: + if errors: + with pytest.raises(KeyError): + tree.get_loc(p) + else: + tm.assert_numpy_array_equal(tree.get_loc(p), + np.array([0], dtype='int64')) + + @pytest.mark.skipif(compat.is_platform_32bit(), + reason="int type mismatch on 32bit") + def test_get_indexer_closed(self): + x = np.arange(1000, dtype='float64') + found = x.astype('intp') + not_found = (-1 * np.ones(1000)).astype('intp') + + for leaf_size in [1, 10, 100, 10000]: + for closed in ['left', 'right', 'both', 'neither']: + tree = IntervalTree(x, x + 0.5, closed=closed, + leaf_size=leaf_size) + tm.assert_numpy_array_equal(found, + tree.get_indexer(x + 0.25)) + + expected = found if tree.closed_left else not_found + tm.assert_numpy_array_equal(expected, + tree.get_indexer(x + 0.0)) + + expected = found if tree.closed_right else not_found + tm.assert_numpy_array_equal(expected, + tree.get_indexer(x + 0.5)) diff --git a/pandas/tests/indexes/test_multi.py b/pandas/tests/indexes/test_multi.py index 61a4ea53f06fb..1fe4d85815c4b 100644 --- a/pandas/tests/indexes/test_multi.py +++ b/pandas/tests/indexes/test_multi.py @@ -1,36 +1,37 @@ # -*- coding: utf-8 -*- -from datetime import timedelta -from itertools import product -import nose import re import warnings -from pandas import (DataFrame, date_range, period_range, MultiIndex, Index, - CategoricalIndex, compat) -from pandas.core.common import PerformanceWarning -from pandas.indexes.base import InvalidIndexError -from pandas.compat import range, lrange, u, PY3, long, lzip +from datetime import timedelta +from itertools import product + +import pytest import numpy as np -from pandas.util.testing import (assert_almost_equal, assertRaises, - assertRaisesRegexp, assert_copy) +import pandas as pd + +from pandas import (CategoricalIndex, DataFrame, Index, MultiIndex, + compat, date_range, period_range) +from pandas.compat import PY3, long, lrange, lzip, range, u +from pandas.errors import PerformanceWarning, UnsortedIndexError +from pandas.core.indexes.base import InvalidIndexError +from pandas._libs import lib +from pandas._libs.lib import Timestamp import pandas.util.testing as tm -import pandas as pd -from pandas.lib import Timestamp +from pandas.util.testing import assert_almost_equal, assert_copy from .common import Base -class TestMultiIndex(Base, tm.TestCase): +class TestMultiIndex(Base): _holder = MultiIndex - _multiprocess_can_split_ = True _compat_props = ['shape', 'ndim', 'size', 'itemsize'] - def setUp(self): + def setup_method(self, method): major_axis = Index(['foo', 'bar', 'baz', 'qux']) minor_axis = Index(['one', 'two']) @@ -58,25 +59,25 @@ def f(): if common: pass - tm.assertRaisesRegexp(ValueError, 'The truth value of a', f) + tm.assert_raises_regex(ValueError, 'The truth value of a', f) def test_labels_dtypes(self): # GH 8456 i = MultiIndex.from_tuples([('A', 1), ('A', 2)]) - self.assertTrue(i.labels[0].dtype == 'int8') - self.assertTrue(i.labels[1].dtype == 'int8') + assert i.labels[0].dtype == 'int8' + assert i.labels[1].dtype == 'int8' i = MultiIndex.from_product([['a'], range(40)]) - self.assertTrue(i.labels[1].dtype == 'int8') + assert i.labels[1].dtype == 'int8' i = MultiIndex.from_product([['a'], range(400)]) - self.assertTrue(i.labels[1].dtype == 'int16') + assert i.labels[1].dtype == 'int16' i = MultiIndex.from_product([['a'], range(40000)]) - self.assertTrue(i.labels[1].dtype == 'int32') + assert i.labels[1].dtype == 'int32' i = pd.MultiIndex.from_product([['a'], range(1000)]) - self.assertTrue((i.labels[0] >= 0).all()) - self.assertTrue((i.labels[1] >= 0).all()) + assert (i.labels[0] >= 0).all() + assert (i.labels[1] >= 0).all() def test_where(self): i = MultiIndex.from_tuples([('A', 1), ('A', 2)]) @@ -84,7 +85,16 @@ def test_where(self): def f(): i.where(True) - self.assertRaises(NotImplementedError, f) + pytest.raises(NotImplementedError, f) + + def test_where_array_like(self): + i = MultiIndex.from_tuples([('A', 1), ('A', 2)]) + klasses = [list, tuple, np.array, pd.Series] + cond = [False, True] + + for klass in klasses: + f = lambda: i.where(klass(cond)) + pytest.raises(NotImplementedError, f) def test_repeat(self): reps = 2 @@ -97,6 +107,10 @@ def test_repeat(self): numbers, names.repeat(reps)], names=names) tm.assert_index_equal(m.repeat(reps), expected) + with tm.assert_produces_warning(FutureWarning): + result = m.repeat(n=reps) + tm.assert_index_equal(result, expected) + def test_numpy_repeat(self): reps = 2 numbers = [1, 2, 3] @@ -109,39 +123,40 @@ def test_numpy_repeat(self): tm.assert_index_equal(np.repeat(m, reps), expected) msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.repeat, m, reps, axis=1) + tm.assert_raises_regex( + ValueError, msg, np.repeat, m, reps, axis=1) def test_set_name_methods(self): # so long as these are synonyms, we don't need to test set_names - self.assertEqual(self.index.rename, self.index.set_names) + assert self.index.rename == self.index.set_names new_names = [name + "SUFFIX" for name in self.index_names] ind = self.index.set_names(new_names) - self.assertEqual(self.index.names, self.index_names) - self.assertEqual(ind.names, new_names) - with assertRaisesRegexp(ValueError, "^Length"): + assert self.index.names == self.index_names + assert ind.names == new_names + with tm.assert_raises_regex(ValueError, "^Length"): ind.set_names(new_names + new_names) new_names2 = [name + "SUFFIX2" for name in new_names] res = ind.set_names(new_names2, inplace=True) - self.assertIsNone(res) - self.assertEqual(ind.names, new_names2) + assert res is None + assert ind.names == new_names2 # set names for specific level (# GH7792) ind = self.index.set_names(new_names[0], level=0) - self.assertEqual(self.index.names, self.index_names) - self.assertEqual(ind.names, [new_names[0], self.index_names[1]]) + assert self.index.names == self.index_names + assert ind.names == [new_names[0], self.index_names[1]] res = ind.set_names(new_names2[0], level=0, inplace=True) - self.assertIsNone(res) - self.assertEqual(ind.names, [new_names2[0], self.index_names[1]]) + assert res is None + assert ind.names == [new_names2[0], self.index_names[1]] # set names for multiple levels ind = self.index.set_names(new_names, level=[0, 1]) - self.assertEqual(self.index.names, self.index_names) - self.assertEqual(ind.names, new_names) + assert self.index.names == self.index_names + assert ind.names == new_names res = ind.set_names(new_names2, level=[0, 1], inplace=True) - self.assertIsNone(res) - self.assertEqual(ind.names, new_names2) + assert res is None + assert ind.names == new_names2 def test_set_levels(self): # side note - you probably wouldn't want to use levels and labels @@ -152,7 +167,7 @@ def test_set_levels(self): def assert_matching(actual, expected, check_dtype=False): # avoid specifying internal representation # as much as possible - self.assertEqual(len(actual), len(expected)) + assert len(actual) == len(expected) for act, exp in zip(actual, expected): act = np.asarray(act) exp = np.asarray(exp) @@ -166,7 +181,7 @@ def assert_matching(actual, expected, check_dtype=False): # level changing [w/ mutation] ind2 = self.index.copy() inplace_return = ind2.set_levels(new_levels, inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.levels, new_levels) # level changing specific level [w/o mutation] @@ -186,13 +201,13 @@ def assert_matching(actual, expected, check_dtype=False): # level changing specific level [w/ mutation] ind2 = self.index.copy() inplace_return = ind2.set_levels(new_levels[0], level=0, inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.levels, [new_levels[0], levels[1]]) assert_matching(self.index.levels, levels) ind2 = self.index.copy() inplace_return = ind2.set_levels(new_levels[1], level=1, inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.levels, [levels[0], new_levels[1]]) assert_matching(self.index.levels, levels) @@ -200,7 +215,7 @@ def assert_matching(actual, expected, check_dtype=False): ind2 = self.index.copy() inplace_return = ind2.set_levels(new_levels, level=[0, 1], inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.levels, new_levels) assert_matching(self.index.levels, levels) @@ -208,23 +223,23 @@ def assert_matching(actual, expected, check_dtype=False): # GH 13754 original_index = self.index.copy() for inplace in [True, False]: - with assertRaisesRegexp(ValueError, "^On"): + with tm.assert_raises_regex(ValueError, "^On"): self.index.set_levels(['c'], level=0, inplace=inplace) assert_matching(self.index.levels, original_index.levels, check_dtype=True) - with assertRaisesRegexp(ValueError, "^On"): + with tm.assert_raises_regex(ValueError, "^On"): self.index.set_labels([0, 1, 2, 3, 4, 5], level=0, inplace=inplace) assert_matching(self.index.labels, original_index.labels, check_dtype=True) - with assertRaisesRegexp(TypeError, "^Levels"): + with tm.assert_raises_regex(TypeError, "^Levels"): self.index.set_levels('c', level=0, inplace=inplace) assert_matching(self.index.levels, original_index.levels, check_dtype=True) - with assertRaisesRegexp(TypeError, "^Labels"): + with tm.assert_raises_regex(TypeError, "^Labels"): self.index.set_labels(1, level=0, inplace=inplace) assert_matching(self.index.labels, original_index.labels, check_dtype=True) @@ -241,7 +256,7 @@ def test_set_labels(self): def assert_matching(actual, expected): # avoid specifying internal representation # as much as possible - self.assertEqual(len(actual), len(expected)) + assert len(actual) == len(expected) for act, exp in zip(actual, expected): act = np.asarray(act) exp = np.asarray(exp, dtype=np.int8) @@ -255,7 +270,7 @@ def assert_matching(actual, expected): # label changing [w/ mutation] ind2 = self.index.copy() inplace_return = ind2.set_labels(new_labels, inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.labels, new_labels) # label changing specific level [w/o mutation] @@ -275,13 +290,13 @@ def assert_matching(actual, expected): # label changing specific level [w/ mutation] ind2 = self.index.copy() inplace_return = ind2.set_labels(new_labels[0], level=0, inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.labels, [new_labels[0], labels[1]]) assert_matching(self.index.labels, labels) ind2 = self.index.copy() inplace_return = ind2.set_labels(new_labels[1], level=1, inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.labels, [labels[0], new_labels[1]]) assert_matching(self.index.labels, labels) @@ -289,7 +304,7 @@ def assert_matching(actual, expected): ind2 = self.index.copy() inplace_return = ind2.set_labels(new_labels, level=[0, 1], inplace=True) - self.assertIsNone(inplace_return) + assert inplace_return is None assert_matching(ind2.labels, new_labels) assert_matching(self.index.labels, labels) @@ -297,46 +312,46 @@ def test_set_levels_labels_names_bad_input(self): levels, labels = self.index.levels, self.index.labels names = self.index.names - with tm.assertRaisesRegexp(ValueError, 'Length of levels'): + with tm.assert_raises_regex(ValueError, 'Length of levels'): self.index.set_levels([levels[0]]) - with tm.assertRaisesRegexp(ValueError, 'Length of labels'): + with tm.assert_raises_regex(ValueError, 'Length of labels'): self.index.set_labels([labels[0]]) - with tm.assertRaisesRegexp(ValueError, 'Length of names'): + with tm.assert_raises_regex(ValueError, 'Length of names'): self.index.set_names([names[0]]) # shouldn't scalar data error, instead should demand list-like - with tm.assertRaisesRegexp(TypeError, 'list of lists-like'): + with tm.assert_raises_regex(TypeError, 'list of lists-like'): self.index.set_levels(levels[0]) # shouldn't scalar data error, instead should demand list-like - with tm.assertRaisesRegexp(TypeError, 'list of lists-like'): + with tm.assert_raises_regex(TypeError, 'list of lists-like'): self.index.set_labels(labels[0]) # shouldn't scalar data error, instead should demand list-like - with tm.assertRaisesRegexp(TypeError, 'list-like'): + with tm.assert_raises_regex(TypeError, 'list-like'): self.index.set_names(names[0]) # should have equal lengths - with tm.assertRaisesRegexp(TypeError, 'list of lists-like'): + with tm.assert_raises_regex(TypeError, 'list of lists-like'): self.index.set_levels(levels[0], level=[0, 1]) - with tm.assertRaisesRegexp(TypeError, 'list-like'): + with tm.assert_raises_regex(TypeError, 'list-like'): self.index.set_levels(levels, level=0) # should have equal lengths - with tm.assertRaisesRegexp(TypeError, 'list of lists-like'): + with tm.assert_raises_regex(TypeError, 'list of lists-like'): self.index.set_labels(labels[0], level=[0, 1]) - with tm.assertRaisesRegexp(TypeError, 'list-like'): + with tm.assert_raises_regex(TypeError, 'list-like'): self.index.set_labels(labels, level=0) # should have equal lengths - with tm.assertRaisesRegexp(ValueError, 'Length of names'): + with tm.assert_raises_regex(ValueError, 'Length of names'): self.index.set_names(names[0], level=[0, 1]) - with tm.assertRaisesRegexp(TypeError, 'string'): + with tm.assert_raises_regex(TypeError, 'string'): self.index.set_names(names, level=0) def test_set_levels_categorical(self): @@ -359,57 +374,64 @@ def test_metadata_immutable(self): levels, labels = self.index.levels, self.index.labels # shouldn't be able to set at either the top level or base level mutable_regex = re.compile('does not support mutable operations') - with assertRaisesRegexp(TypeError, mutable_regex): + with tm.assert_raises_regex(TypeError, mutable_regex): levels[0] = levels[0] - with assertRaisesRegexp(TypeError, mutable_regex): + with tm.assert_raises_regex(TypeError, mutable_regex): levels[0][0] = levels[0][0] # ditto for labels - with assertRaisesRegexp(TypeError, mutable_regex): + with tm.assert_raises_regex(TypeError, mutable_regex): labels[0] = labels[0] - with assertRaisesRegexp(TypeError, mutable_regex): + with tm.assert_raises_regex(TypeError, mutable_regex): labels[0][0] = labels[0][0] # and for names names = self.index.names - with assertRaisesRegexp(TypeError, mutable_regex): + with tm.assert_raises_regex(TypeError, mutable_regex): names[0] = names[0] def test_inplace_mutation_resets_values(self): levels = [['a', 'b', 'c'], [4]] levels2 = [[1, 2, 3], ['a']] labels = [[0, 1, 0, 2, 2, 0], [0, 0, 0, 0, 0, 0]] + mi1 = MultiIndex(levels=levels, labels=labels) mi2 = MultiIndex(levels=levels2, labels=labels) vals = mi1.values.copy() vals2 = mi2.values.copy() - self.assertIsNotNone(mi1._tuples) - # make sure level setting works + assert mi1._tuples is not None + + # Make sure level setting works new_vals = mi1.set_levels(levels2).values - assert_almost_equal(vals2, new_vals) - # non-inplace doesn't kill _tuples [implementation detail] - assert_almost_equal(mi1._tuples, vals) - # and values is still same too - assert_almost_equal(mi1.values, vals) + tm.assert_almost_equal(vals2, new_vals) + + # Non-inplace doesn't kill _tuples [implementation detail] + tm.assert_almost_equal(mi1._tuples, vals) + + # ...and values is still same too + tm.assert_almost_equal(mi1.values, vals) - # inplace should kill _tuples + # Inplace should kill _tuples mi1.set_levels(levels2, inplace=True) - assert_almost_equal(mi1.values, vals2) + tm.assert_almost_equal(mi1.values, vals2) - # make sure label setting works too + # Make sure label setting works too labels2 = [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]] exp_values = np.empty((6, ), dtype=object) exp_values[:] = [(long(1), 'a')] * 6 - # must be 1d array of tuples - self.assertEqual(exp_values.shape, (6, )) + + # Must be 1d array of tuples + assert exp_values.shape == (6, ) new_values = mi2.set_labels(labels2).values - # not inplace shouldn't change - assert_almost_equal(mi2._tuples, vals2) - # should have correct values - assert_almost_equal(exp_values, new_values) - # and again setting inplace should kill _tuples, etc + # Not inplace shouldn't change + tm.assert_almost_equal(mi2._tuples, vals2) + + # Should have correct values + tm.assert_almost_equal(exp_values, new_values) + + # ...and again setting inplace should kill _tuples, etc mi2.set_labels(labels2, inplace=True) - assert_almost_equal(mi2.values, new_values) + tm.assert_almost_equal(mi2.values, new_values) def test_copy_in_constructor(self): levels = np.array(["a", "b", "c"]) @@ -417,12 +439,12 @@ def test_copy_in_constructor(self): val = labels[0] mi = MultiIndex(levels=[levels, levels], labels=[labels, labels], copy=True) - self.assertEqual(mi.labels[0][0], val) + assert mi.labels[0][0] == val labels[0] = 15 - self.assertEqual(mi.labels[0][0], val) + assert mi.labels[0][0] == val val = levels[0] levels[0] = "PANDA" - self.assertEqual(mi.levels[0][0], val) + assert mi.levels[0][0] == val def test_set_value_keeps_names(self): # motivating example from #3742 @@ -433,12 +455,12 @@ def test_set_value_keeps_names(self): np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'], index=idx) - df = df.sortlevel() - self.assertIsNone(df.is_copy) - self.assertEqual(df.index.names, ('Name', 'Number')) + df = df.sort_index() + assert df.is_copy is None + assert df.index.names == ('Name', 'Number') df = df.set_value(('grethe', '4'), 'one', 99.34) - self.assertIsNone(df.is_copy) - self.assertEqual(df.index.names, ('Name', 'Number')) + assert df.is_copy is None + assert df.index.names == ('Name', 'Number') def test_copy_names(self): # Check that adding a "names" parameter to the copy is honored @@ -446,62 +468,63 @@ def test_copy_names(self): multi_idx = pd.Index([(1, 2), (3, 4)], names=['MyName1', 'MyName2']) multi_idx1 = multi_idx.copy() - self.assertTrue(multi_idx.equals(multi_idx1)) - self.assertEqual(multi_idx.names, ['MyName1', 'MyName2']) - self.assertEqual(multi_idx1.names, ['MyName1', 'MyName2']) + assert multi_idx.equals(multi_idx1) + assert multi_idx.names == ['MyName1', 'MyName2'] + assert multi_idx1.names == ['MyName1', 'MyName2'] multi_idx2 = multi_idx.copy(names=['NewName1', 'NewName2']) - self.assertTrue(multi_idx.equals(multi_idx2)) - self.assertEqual(multi_idx.names, ['MyName1', 'MyName2']) - self.assertEqual(multi_idx2.names, ['NewName1', 'NewName2']) + assert multi_idx.equals(multi_idx2) + assert multi_idx.names == ['MyName1', 'MyName2'] + assert multi_idx2.names == ['NewName1', 'NewName2'] multi_idx3 = multi_idx.copy(name=['NewName1', 'NewName2']) - self.assertTrue(multi_idx.equals(multi_idx3)) - self.assertEqual(multi_idx.names, ['MyName1', 'MyName2']) - self.assertEqual(multi_idx3.names, ['NewName1', 'NewName2']) + assert multi_idx.equals(multi_idx3) + assert multi_idx.names == ['MyName1', 'MyName2'] + assert multi_idx3.names == ['NewName1', 'NewName2'] def test_names(self): - # names are assigned in __init__ + # names are assigned in setup names = self.index_names level_names = [level.name for level in self.index.levels] - self.assertEqual(names, level_names) + assert names == level_names # setting bad names on existing index = self.index - assertRaisesRegexp(ValueError, "^Length of names", setattr, index, - "names", list(index.names) + ["third"]) - assertRaisesRegexp(ValueError, "^Length of names", setattr, index, - "names", []) + tm.assert_raises_regex(ValueError, "^Length of names", + setattr, index, "names", + list(index.names) + ["third"]) + tm.assert_raises_regex(ValueError, "^Length of names", + setattr, index, "names", []) # initializing with bad names (should always be equivalent) major_axis, minor_axis = self.index.levels major_labels, minor_labels = self.index.labels - assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, - levels=[major_axis, minor_axis], - labels=[major_labels, minor_labels], - names=['first']) - assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, - levels=[major_axis, minor_axis], - labels=[major_labels, minor_labels], - names=['first', 'second', 'third']) + tm.assert_raises_regex(ValueError, "^Length of names", MultiIndex, + levels=[major_axis, minor_axis], + labels=[major_labels, minor_labels], + names=['first']) + tm.assert_raises_regex(ValueError, "^Length of names", MultiIndex, + levels=[major_axis, minor_axis], + labels=[major_labels, minor_labels], + names=['first', 'second', 'third']) # names are assigned index.names = ["a", "b"] ind_names = list(index.names) level_names = [level.name for level in index.levels] - self.assertEqual(ind_names, level_names) + assert ind_names == level_names def test_reference_duplicate_name(self): idx = MultiIndex.from_tuples( [('a', 'b'), ('c', 'd')], names=['x', 'x']) - self.assertTrue(idx._reference_duplicate_name('x')) + assert idx._reference_duplicate_name('x') idx = MultiIndex.from_tuples( [('a', 'b'), ('c', 'd')], names=['x', 'y']) - self.assertFalse(idx._reference_duplicate_name('x')) + assert not idx._reference_duplicate_name('x') def test_astype(self): expected = self.index.copy() @@ -510,79 +533,79 @@ def test_astype(self): assert_copy(actual.labels, expected.labels) self.check_level_names(actual, expected.names) - with assertRaisesRegexp(TypeError, "^Setting.*dtype.*object"): + with tm.assert_raises_regex(TypeError, "^Setting.*dtype.*object"): self.index.astype(np.dtype(int)) def test_constructor_single_level(self): single_level = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux']], labels=[[0, 1, 2, 3]], names=['first']) - tm.assertIsInstance(single_level, Index) - self.assertNotIsInstance(single_level, MultiIndex) - self.assertEqual(single_level.name, 'first') + assert isinstance(single_level, Index) + assert not isinstance(single_level, MultiIndex) + assert single_level.name == 'first' single_level = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux']], labels=[[0, 1, 2, 3]]) - self.assertIsNone(single_level.name) + assert single_level.name is None def test_constructor_no_levels(self): - assertRaisesRegexp(ValueError, "non-zero number of levels/labels", - MultiIndex, levels=[], labels=[]) + tm.assert_raises_regex(ValueError, "non-zero number " + "of levels/labels", + MultiIndex, levels=[], labels=[]) both_re = re.compile('Must pass both levels and labels') - with tm.assertRaisesRegexp(TypeError, both_re): + with tm.assert_raises_regex(TypeError, both_re): MultiIndex(levels=[]) - with tm.assertRaisesRegexp(TypeError, both_re): + with tm.assert_raises_regex(TypeError, both_re): MultiIndex(labels=[]) def test_constructor_mismatched_label_levels(self): labels = [np.array([1]), np.array([2]), np.array([3])] levels = ["a"] - assertRaisesRegexp(ValueError, "Length of levels and labels must be" - " the same", MultiIndex, levels=levels, - labels=labels) + tm.assert_raises_regex(ValueError, "Length of levels and labels " + "must be the same", MultiIndex, + levels=levels, labels=labels) length_error = re.compile('>= length of level') label_error = re.compile(r'Unequal label lengths: \[4, 2\]') # important to check that it's looking at the right thing. - with tm.assertRaisesRegexp(ValueError, length_error): + with tm.assert_raises_regex(ValueError, length_error): MultiIndex(levels=[['a'], ['b']], labels=[[0, 1, 2, 3], [0, 3, 4, 1]]) - with tm.assertRaisesRegexp(ValueError, label_error): + with tm.assert_raises_regex(ValueError, label_error): MultiIndex(levels=[['a'], ['b']], labels=[[0, 0, 0, 0], [0, 0]]) # external API - with tm.assertRaisesRegexp(ValueError, length_error): + with tm.assert_raises_regex(ValueError, length_error): self.index.copy().set_levels([['a'], ['b']]) - with tm.assertRaisesRegexp(ValueError, label_error): + with tm.assert_raises_regex(ValueError, label_error): self.index.copy().set_labels([[0, 0, 0, 0], [0, 0]]) # deprecated properties with warnings.catch_warnings(): warnings.simplefilter('ignore') - with tm.assertRaisesRegexp(ValueError, length_error): + with tm.assert_raises_regex(ValueError, length_error): self.index.copy().levels = [['a'], ['b']] - with tm.assertRaisesRegexp(ValueError, label_error): + with tm.assert_raises_regex(ValueError, label_error): self.index.copy().labels = [[0, 0, 0, 0], [0, 0]] def assert_multiindex_copied(self, copy, original): - # levels should be (at least, shallow copied) - assert_copy(copy.levels, original.levels) + # Levels should be (at least, shallow copied) + tm.assert_copy(copy.levels, original.levels) + tm.assert_almost_equal(copy.labels, original.labels) - assert_almost_equal(copy.labels, original.labels) + # Labels doesn't matter which way copied + tm.assert_almost_equal(copy.labels, original.labels) + assert copy.labels is not original.labels - # labels doesn't matter which way copied - assert_almost_equal(copy.labels, original.labels) - self.assertIsNot(copy.labels, original.labels) + # Names doesn't matter which way copied + assert copy.names == original.names + assert copy.names is not original.names - # names doesn't matter which way copied - self.assertEqual(copy.names, original.names) - self.assertIsNot(copy.names, original.names) - - # sort order should be copied - self.assertEqual(copy.sortorder, original.sortorder) + # Sort order should be copied + assert copy.sortorder == original.sortorder def test_copy(self): i_copy = self.index.copy() @@ -600,7 +623,7 @@ def test_view(self): self.assert_multiindex_copied(i_view, self.index) def check_level_names(self, index, names): - self.assertEqual([level.name for level in index.levels], list(names)) + assert [level.name for level in index.levels] == list(names) def test_changing_names(self): @@ -628,16 +651,16 @@ def test_changing_names(self): def test_duplicate_names(self): self.index.names = ['foo', 'foo'] - assertRaisesRegexp(KeyError, 'Level foo not found', - self.index._get_level_number, 'foo') + tm.assert_raises_regex(KeyError, 'Level foo not found', + self.index._get_level_number, 'foo') def test_get_level_number_integer(self): self.index.names = [1, 0] - self.assertEqual(self.index._get_level_number(1), 0) - self.assertEqual(self.index._get_level_number(0), 1) - self.assertRaises(IndexError, self.index._get_level_number, 2) - assertRaisesRegexp(KeyError, 'Level fourth not found', - self.index._get_level_number, 'fourth') + assert self.index._get_level_number(1) == 0 + assert self.index._get_level_number(0) == 1 + pytest.raises(IndexError, self.index._get_level_number, 2) + tm.assert_raises_regex(KeyError, 'Level fourth not found', + self.index._get_level_number, 'fourth') def test_from_arrays(self): arrays = [] @@ -645,14 +668,13 @@ def test_from_arrays(self): arrays.append(np.asarray(lev).take(lab)) result = MultiIndex.from_arrays(arrays) - self.assertEqual(list(result), list(self.index)) + assert list(result) == list(self.index) # infer correctly result = MultiIndex.from_arrays([[pd.NaT, Timestamp('20130101')], ['a', 'b']]) - self.assertTrue(result.levels[0].equals(Index([Timestamp('20130101') - ]))) - self.assertTrue(result.levels[1].equals(Index(['a', 'b']))) + assert result.levels[0].equals(Index([Timestamp('20130101')])) + assert result.levels[1].equals(Index(['a', 'b'])) def test_from_arrays_index_series_datetimetz(self): idx1 = pd.date_range('2015-01-01 10:00', freq='D', periods=3, @@ -740,7 +762,7 @@ def test_from_arrays_index_series_categorical(self): def test_from_arrays_empty(self): # 0 levels - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( ValueError, "Must pass non-zero number of levels/labels"): MultiIndex.from_arrays(arrays=[]) @@ -762,24 +784,27 @@ def test_from_arrays_invalid_input(self): invalid_inputs = [1, [1], [1, 2], [[1], 2], 'a', ['a'], ['a', 'b'], [['a'], 'b']] for i in invalid_inputs: - tm.assertRaises(TypeError, MultiIndex.from_arrays, arrays=i) + pytest.raises(TypeError, MultiIndex.from_arrays, arrays=i) def test_from_arrays_different_lengths(self): - # GH13599 + # see gh-13599 idx1 = [1, 2, 3] idx2 = ['a', 'b'] - assertRaisesRegexp(ValueError, '^all arrays must be same length$', - MultiIndex.from_arrays, [idx1, idx2]) + tm.assert_raises_regex(ValueError, '^all arrays must ' + 'be same length$', + MultiIndex.from_arrays, [idx1, idx2]) idx1 = [] idx2 = ['a', 'b'] - assertRaisesRegexp(ValueError, '^all arrays must be same length$', - MultiIndex.from_arrays, [idx1, idx2]) + tm.assert_raises_regex(ValueError, '^all arrays must ' + 'be same length$', + MultiIndex.from_arrays, [idx1, idx2]) idx1 = [1, 2, 3] idx2 = [] - assertRaisesRegexp(ValueError, '^all arrays must be same length$', - MultiIndex.from_arrays, [idx1, idx2]) + tm.assert_raises_regex(ValueError, '^all arrays must ' + 'be same length$', + MultiIndex.from_arrays, [idx1, idx2]) def test_from_product(self): @@ -794,11 +819,11 @@ def test_from_product(self): expected = MultiIndex.from_tuples(tuples, names=names) tm.assert_index_equal(result, expected) - self.assertEqual(result.names, names) + assert result.names == names def test_from_product_empty(self): # 0 levels - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( ValueError, "Must pass non-zero number of levels/labels"): MultiIndex.from_product([]) @@ -831,12 +856,12 @@ def test_from_product_invalid_input(self): invalid_inputs = [1, [1], [1, 2], [[1], 2], 'a', ['a'], ['a', 'b'], [['a'], 'b']] for i in invalid_inputs: - tm.assertRaises(TypeError, MultiIndex.from_product, iterables=i) + pytest.raises(TypeError, MultiIndex.from_product, iterables=i) def test_from_product_datetimeindex(self): dt_index = date_range('2000-01-01', periods=2) mi = pd.MultiIndex.from_product([[1, 2], dt_index]) - etalon = pd.lib.list_to_object_array([(1, pd.Timestamp( + etalon = lib.list_to_object_array([(1, pd.Timestamp( '2000-01-01')), (1, pd.Timestamp('2000-01-02')), (2, pd.Timestamp( '2000-01-01')), (2, pd.Timestamp('2000-01-02'))]) tm.assert_numpy_array_equal(mi.values, etalon) @@ -863,21 +888,21 @@ def test_values_boxed(self): (3, pd.Timestamp('2000-01-03'))] mi = pd.MultiIndex.from_tuples(tuples) tm.assert_numpy_array_equal(mi.values, - pd.lib.list_to_object_array(tuples)) + lib.list_to_object_array(tuples)) # Check that code branches for boxed values produce identical results tm.assert_numpy_array_equal(mi.values[:4], mi[:4].values) def test_append(self): result = self.index[:3].append(self.index[3:]) - self.assertTrue(result.equals(self.index)) + assert result.equals(self.index) foos = [self.index[:1], self.index[1:3], self.index[3:]] result = foos[0].append(foos[1:]) - self.assertTrue(result.equals(self.index)) + assert result.equals(self.index) # empty result = self.index.append([]) - self.assertTrue(result.equals(self.index)) + assert result.equals(self.index) def test_append_mixed_dtypes(self): # GH 13660 @@ -889,15 +914,15 @@ def test_append_mixed_dtypes(self): [1.1, np.nan, 3.3], ['a', 'b', 'c'], dti, dti_tz, pi]) - self.assertEqual(mi.nlevels, 6) + assert mi.nlevels == 6 res = mi.append(mi) exp = MultiIndex.from_arrays([[1, 2, 3, 1, 2, 3], - [1.1, np.nan, 3.3, 1.1, np.nan, 3.3], - ['a', 'b', 'c', 'a', 'b', 'c'], - dti.append(dti), - dti_tz.append(dti_tz), - pi.append(pi)]) + [1.1, np.nan, 3.3, 1.1, np.nan, 3.3], + ['a', 'b', 'c', 'a', 'b', 'c'], + dti.append(dti), + dti_tz.append(dti_tz), + pi.append(pi)]) tm.assert_index_equal(res, exp) other = MultiIndex.from_arrays([['x', 'y', 'z'], ['x', 'y', 'z'], @@ -906,11 +931,11 @@ def test_append_mixed_dtypes(self): res = mi.append(other) exp = MultiIndex.from_arrays([[1, 2, 3, 'x', 'y', 'z'], - [1.1, np.nan, 3.3, 'x', 'y', 'z'], - ['a', 'b', 'c', 'x', 'y', 'z'], - dti.append(pd.Index(['x', 'y', 'z'])), - dti_tz.append(pd.Index(['x', 'y', 'z'])), - pi.append(pd.Index(['x', 'y', 'z']))]) + [1.1, np.nan, 3.3, 'x', 'y', 'z'], + ['a', 'b', 'c', 'x', 'y', 'z'], + dti.append(pd.Index(['x', 'y', 'z'])), + dti_tz.append(pd.Index(['x', 'y', 'z'])), + pi.append(pd.Index(['x', 'y', 'z']))]) tm.assert_index_equal(res, exp) def test_get_level_values(self): @@ -918,7 +943,7 @@ def test_get_level_values(self): expected = Index(['foo', 'foo', 'bar', 'baz', 'qux', 'qux'], name='first') tm.assert_index_equal(result, expected) - self.assertEqual(result.name, 'first') + assert result.name == 'first' result = self.index.get_level_values('first') expected = self.index.get_level_values(0) @@ -929,9 +954,9 @@ def test_get_level_values(self): ['A', 'B']), CategoricalIndex([1, 2, 3])], labels=[np.array( [0, 0, 0, 1, 1, 1]), np.array([0, 1, 2, 0, 1, 2])]) exp = CategoricalIndex(['A', 'A', 'A', 'B', 'B', 'B']) - self.assert_index_equal(index.get_level_values(0), exp) + tm.assert_index_equal(index.get_level_values(0), exp) exp = CategoricalIndex([1, 2, 3, 1, 2, 3]) - self.assert_index_equal(index.get_level_values(1), exp) + tm.assert_index_equal(index.get_level_values(1), exp) def test_get_level_values_na(self): arrays = [['a', 'b', 'b'], [1, np.nan, 2]] @@ -964,32 +989,32 @@ def test_get_level_values_na(self): arrays = [[], []] index = pd.MultiIndex.from_arrays(arrays) values = index.get_level_values(0) - self.assertEqual(values.shape, (0, )) + assert values.shape == (0, ) def test_reorder_levels(self): # this blows up - assertRaisesRegexp(IndexError, '^Too many levels', - self.index.reorder_levels, [2, 1, 0]) + tm.assert_raises_regex(IndexError, '^Too many levels', + self.index.reorder_levels, [2, 1, 0]) def test_nlevels(self): - self.assertEqual(self.index.nlevels, 2) + assert self.index.nlevels == 2 def test_iter(self): result = list(self.index) expected = [('foo', 'one'), ('foo', 'two'), ('bar', 'one'), ('baz', 'two'), ('qux', 'one'), ('qux', 'two')] - self.assertEqual(result, expected) + assert result == expected def test_legacy_pickle(self): if PY3: - raise nose.SkipTest("testing for legacy pickles not " - "support on py3") + pytest.skip("testing for legacy pickles not " + "support on py3") path = tm.get_data_path('multiindex_v1.pickle') obj = pd.read_pickle(path) obj2 = MultiIndex.from_tuples(obj.values) - self.assertTrue(obj.equals(obj2)) + assert obj.equals(obj2) res = obj.get_indexer(obj) exp = np.arange(len(obj), dtype=np.intp) @@ -1008,7 +1033,7 @@ def test_legacy_v2_unpickle(self): obj = pd.read_pickle(path) obj2 = MultiIndex.from_tuples(obj.values) - self.assertTrue(obj.equals(obj2)) + assert obj.equals(obj2) res = obj.get_indexer(obj) exp = np.arange(len(obj), dtype=np.intp) @@ -1028,73 +1053,99 @@ def test_roundtrip_pickle_with_tz(self): [[1, 2], ['a', 'b'], date_range('20130101', periods=3, tz='US/Eastern') ], names=['one', 'two', 'three']) - unpickled = self.round_trip_pickle(index) - self.assertTrue(index.equal_levels(unpickled)) + unpickled = tm.round_trip_pickle(index) + assert index.equal_levels(unpickled) def test_from_tuples_index_values(self): result = MultiIndex.from_tuples(self.index) - self.assertTrue((result.values == self.index.values).all()) + assert (result.values == self.index.values).all() def test_contains(self): - self.assertIn(('foo', 'two'), self.index) - self.assertNotIn(('bar', 'two'), self.index) - self.assertNotIn(None, self.index) + assert ('foo', 'two') in self.index + assert ('bar', 'two') not in self.index + assert None not in self.index + + def test_contains_top_level(self): + midx = MultiIndex.from_product([['A', 'B'], [1, 2]]) + assert 'A' in midx + assert 'A' not in midx._engine + + def test_contains_with_nat(self): + # MI with a NaT + mi = MultiIndex(levels=[['C'], + pd.date_range('2012-01-01', periods=5)], + labels=[[0, 0, 0, 0, 0, 0], [-1, 0, 1, 2, 3, 4]], + names=[None, 'B']) + assert ('C', pd.Timestamp('2012-01-01')) in mi + for val in mi.values: + assert val in mi def test_is_all_dates(self): - self.assertFalse(self.index.is_all_dates) + assert not self.index.is_all_dates def test_is_numeric(self): # MultiIndex is never numeric - self.assertFalse(self.index.is_numeric()) + assert not self.index.is_numeric() def test_getitem(self): # scalar - self.assertEqual(self.index[2], ('bar', 'one')) + assert self.index[2] == ('bar', 'one') # slice result = self.index[2:5] expected = self.index[[2, 3, 4]] - self.assertTrue(result.equals(expected)) + assert result.equals(expected) # boolean result = self.index[[True, False, True, False, True, True]] result2 = self.index[np.array([True, False, True, False, True, True])] expected = self.index[[0, 2, 4, 5]] - self.assertTrue(result.equals(expected)) - self.assertTrue(result2.equals(expected)) + assert result.equals(expected) + assert result2.equals(expected) def test_getitem_group_select(self): sorted_idx, _ = self.index.sortlevel(0) - self.assertEqual(sorted_idx.get_loc('baz'), slice(3, 4)) - self.assertEqual(sorted_idx.get_loc('foo'), slice(0, 2)) + assert sorted_idx.get_loc('baz') == slice(3, 4) + assert sorted_idx.get_loc('foo') == slice(0, 2) def test_get_loc(self): - self.assertEqual(self.index.get_loc(('foo', 'two')), 1) - self.assertEqual(self.index.get_loc(('baz', 'two')), 3) - self.assertRaises(KeyError, self.index.get_loc, ('bar', 'two')) - self.assertRaises(KeyError, self.index.get_loc, 'quux') + assert self.index.get_loc(('foo', 'two')) == 1 + assert self.index.get_loc(('baz', 'two')) == 3 + pytest.raises(KeyError, self.index.get_loc, ('bar', 'two')) + pytest.raises(KeyError, self.index.get_loc, 'quux') - self.assertRaises(NotImplementedError, self.index.get_loc, 'foo', - method='nearest') + pytest.raises(NotImplementedError, self.index.get_loc, 'foo', + method='nearest') # 3 levels index = MultiIndex(levels=[Index(lrange(4)), Index(lrange(4)), Index( lrange(4))], labels=[np.array([0, 0, 1, 2, 2, 2, 3, 3]), np.array( [0, 1, 0, 0, 0, 1, 0, 1]), np.array([1, 0, 1, 1, 0, 0, 1, 0])]) - self.assertRaises(KeyError, index.get_loc, (1, 1)) - self.assertEqual(index.get_loc((2, 0)), slice(3, 5)) + pytest.raises(KeyError, index.get_loc, (1, 1)) + assert index.get_loc((2, 0)) == slice(3, 5) def test_get_loc_duplicates(self): index = Index([2, 2, 2, 2]) result = index.get_loc(2) expected = slice(0, 4) - self.assertEqual(result, expected) - # self.assertRaises(Exception, index.get_loc, 2) + assert result == expected + # pytest.raises(Exception, index.get_loc, 2) index = Index(['c', 'a', 'a', 'b', 'b']) rs = index.get_loc('c') xp = 0 - assert (rs == xp) + assert rs == xp + + def test_get_value_duplicates(self): + index = MultiIndex(levels=[['D', 'B', 'C'], + [0, 26, 27, 37, 57, 67, 75, 82]], + labels=[[0, 0, 0, 1, 2, 2, 2, 2, 2, 2], + [1, 3, 4, 6, 0, 2, 2, 3, 5, 7]], + names=['tag', 'day']) + + assert index.get_loc('D') == slice(0, 3) + with pytest.raises(KeyError): + index._engine.get_value(np.array([]), 'D') def test_get_loc_level(self): index = MultiIndex(levels=[Index(lrange(4)), Index(lrange(4)), Index( @@ -1104,22 +1155,22 @@ def test_get_loc_level(self): loc, new_index = index.get_loc_level((0, 1)) expected = slice(1, 2) exp_index = index[expected].droplevel(0).droplevel(0) - self.assertEqual(loc, expected) - self.assertTrue(new_index.equals(exp_index)) + assert loc == expected + assert new_index.equals(exp_index) loc, new_index = index.get_loc_level((0, 1, 0)) expected = 1 - self.assertEqual(loc, expected) - self.assertIsNone(new_index) + assert loc == expected + assert new_index is None - self.assertRaises(KeyError, index.get_loc_level, (2, 2)) + pytest.raises(KeyError, index.get_loc_level, (2, 2)) index = MultiIndex(levels=[[2000], lrange(4)], labels=[np.array( [0, 0, 0, 0]), np.array([0, 1, 2, 3])]) result, new_index = index.get_loc_level((2000, slice(None, None))) expected = slice(None, None) - self.assertEqual(result, expected) - self.assertTrue(new_index.equals(index.droplevel(0))) + assert result == expected + assert new_index.equals(index.droplevel(0)) def test_slice_locs(self): df = tm.makeTimeDataFrame() @@ -1141,17 +1192,19 @@ def test_slice_locs_with_type_mismatch(self): df = tm.makeTimeDataFrame() stacked = df.stack() idx = stacked.index - assertRaisesRegexp(TypeError, '^Level type mismatch', idx.slice_locs, - (1, 3)) - assertRaisesRegexp(TypeError, '^Level type mismatch', idx.slice_locs, - df.index[5] + timedelta(seconds=30), (5, 2)) + tm.assert_raises_regex(TypeError, '^Level type mismatch', + idx.slice_locs, (1, 3)) + tm.assert_raises_regex(TypeError, '^Level type mismatch', + idx.slice_locs, + df.index[5] + timedelta( + seconds=30), (5, 2)) df = tm.makeCustomDataframe(5, 5) stacked = df.stack() idx = stacked.index - with assertRaisesRegexp(TypeError, '^Level type mismatch'): + with tm.assert_raises_regex(TypeError, '^Level type mismatch'): idx.slice_locs(timedelta(seconds=30)) # TODO: Try creating a UnicodeDecodeError in exception message - with assertRaisesRegexp(TypeError, '^Level type mismatch'): + with tm.assert_raises_regex(TypeError, '^Level type mismatch'): idx.slice_locs(df.index[1], (16, "a")) def test_slice_locs_not_sorted(self): @@ -1159,9 +1212,9 @@ def test_slice_locs_not_sorted(self): lrange(4))], labels=[np.array([0, 0, 1, 2, 2, 2, 3, 3]), np.array( [0, 1, 0, 0, 0, 1, 0, 1]), np.array([1, 0, 1, 1, 0, 0, 1, 0])]) - assertRaisesRegexp(KeyError, "[Kk]ey length.*greater than MultiIndex" - " lexsort depth", index.slice_locs, (1, 0, 1), - (2, 1, 0)) + tm.assert_raises_regex(KeyError, "[Kk]ey length.*greater than " + "MultiIndex lexsort depth", + index.slice_locs, (1, 0, 1), (2, 1, 0)) # works sorted_index, _ = index.sortlevel(0) @@ -1172,16 +1225,16 @@ def test_slice_locs_partial(self): sorted_idx, _ = self.index.sortlevel(0) result = sorted_idx.slice_locs(('foo', 'two'), ('qux', 'one')) - self.assertEqual(result, (1, 5)) + assert result == (1, 5) result = sorted_idx.slice_locs(None, ('qux', 'one')) - self.assertEqual(result, (0, 5)) + assert result == (0, 5) result = sorted_idx.slice_locs(('foo', 'two'), None) - self.assertEqual(result, (1, len(sorted_idx))) + assert result == (1, len(sorted_idx)) result = sorted_idx.slice_locs('bar', 'baz') - self.assertEqual(result, (2, 4)) + assert result == (2, 4) def test_slice_locs_not_contained(self): # some searchsorted action @@ -1191,22 +1244,22 @@ def test_slice_locs_not_contained(self): [0, 1, 2, 1, 2, 2, 0, 1, 2]], sortorder=0) result = index.slice_locs((1, 0), (5, 2)) - self.assertEqual(result, (3, 6)) + assert result == (3, 6) result = index.slice_locs(1, 5) - self.assertEqual(result, (3, 6)) + assert result == (3, 6) result = index.slice_locs((2, 2), (5, 2)) - self.assertEqual(result, (3, 6)) + assert result == (3, 6) result = index.slice_locs(2, 5) - self.assertEqual(result, (3, 6)) + assert result == (3, 6) result = index.slice_locs((1, 0), (6, 3)) - self.assertEqual(result, (3, 8)) + assert result == (3, 8) result = index.slice_locs(-1, 10) - self.assertEqual(result, (0, len(index))) + assert result == (0, len(index)) def test_consistency(self): # need to construct an overflow @@ -1226,7 +1279,7 @@ def test_consistency(self): index = MultiIndex(levels=[major_axis, minor_axis], labels=[major_labels, minor_labels]) - self.assertFalse(index.is_unique) + assert not index.is_unique def test_truncate(self): major_axis = Index(lrange(4)) @@ -1239,18 +1292,18 @@ def test_truncate(self): labels=[major_labels, minor_labels]) result = index.truncate(before=1) - self.assertNotIn('foo', result.levels[0]) - self.assertIn(1, result.levels[0]) + assert 'foo' not in result.levels[0] + assert 1 in result.levels[0] result = index.truncate(after=1) - self.assertNotIn(2, result.levels[0]) - self.assertIn(1, result.levels[0]) + assert 2 not in result.levels[0] + assert 1 in result.levels[0] result = index.truncate(before=1, after=2) - self.assertEqual(len(result.levels[0]), 2) + assert len(result.levels[0]) == 2 # after < before - self.assertRaises(ValueError, index.truncate, 3, 1) + pytest.raises(ValueError, index.truncate, 3, 1) def test_get_indexer(self): major_axis = Index(lrange(4)) @@ -1288,28 +1341,41 @@ def test_get_indexer(self): assert_almost_equal(r1, rbfill1) # pass non-MultiIndex - r1 = idx1.get_indexer(idx2._tuple_index) + r1 = idx1.get_indexer(idx2.values) rexp1 = idx1.get_indexer(idx2) assert_almost_equal(r1, rexp1) r1 = idx1.get_indexer([1, 2, 3]) - self.assertTrue((r1 == [-1, -1, -1]).all()) + assert (r1 == [-1, -1, -1]).all() # create index with duplicates idx1 = Index(lrange(10) + lrange(10)) idx2 = Index(lrange(20)) msg = "Reindexing only valid with uniquely valued Index objects" - with assertRaisesRegexp(InvalidIndexError, msg): + with tm.assert_raises_regex(InvalidIndexError, msg): idx1.get_indexer(idx2) def test_get_indexer_nearest(self): midx = MultiIndex.from_tuples([('a', 1), ('b', 2)]) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): midx.get_indexer(['a'], method='nearest') - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): midx.get_indexer(['a'], method='pad', tolerance=2) + def test_hash_collisions(self): + # non-smoke test that we don't get hash collisions + + index = MultiIndex.from_product([np.arange(1000), np.arange(1000)], + names=['one', 'two']) + result = index.get_indexer(index.values) + tm.assert_numpy_array_equal(result, np.arange( + len(index), dtype='intp')) + + for i in [0, 1, len(index) - 2, len(index) - 1]: + result = index.get_loc(index[i]) + assert result == i + def test_format(self): self.index.format() self.index[:0].format() @@ -1325,7 +1391,7 @@ def test_format_sparse_display(self): [0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0]]) result = index.format() - self.assertEqual(result[3], '1 0 0 0') + assert result[3] == '1 0 0 0' def test_format_sparse_config(self): warn_filters = warnings.filters @@ -1335,12 +1401,49 @@ def test_format_sparse_config(self): pd.set_option('display.multi_sparse', False) result = self.index.format() - self.assertEqual(result[1], 'foo two') + assert result[1] == 'foo two' - self.reset_display_options() + tm.reset_display_options() warnings.filters = warn_filters + def test_to_frame(self): + tuples = [(1, 'one'), (1, 'two'), (2, 'one'), (2, 'two')] + + index = MultiIndex.from_tuples(tuples) + result = index.to_frame(index=False) + expected = DataFrame(tuples) + tm.assert_frame_equal(result, expected) + + result = index.to_frame() + expected.index = index + tm.assert_frame_equal(result, expected) + + tuples = [(1, 'one'), (1, 'two'), (2, 'one'), (2, 'two')] + index = MultiIndex.from_tuples(tuples, names=['first', 'second']) + result = index.to_frame(index=False) + expected = DataFrame(tuples) + expected.columns = ['first', 'second'] + tm.assert_frame_equal(result, expected) + + result = index.to_frame() + expected.index = index + tm.assert_frame_equal(result, expected) + + index = MultiIndex.from_product([range(5), + pd.date_range('20130101', periods=3)]) + result = index.to_frame(index=False) + expected = DataFrame( + {0: np.repeat(np.arange(5, dtype='int64'), 3), + 1: np.tile(pd.date_range('20130101', periods=3), 5)}) + tm.assert_frame_equal(result, expected) + + index = MultiIndex.from_product([range(5), + pd.date_range('20130101', periods=3)]) + result = index.to_frame() + expected.index = index + tm.assert_frame_equal(result, expected) + def test_to_hierarchical(self): index = MultiIndex.from_tuples([(1, 'one'), (1, 'two'), (2, 'one'), ( 2, 'two')]) @@ -1349,7 +1452,7 @@ def test_to_hierarchical(self): labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]]) tm.assert_index_equal(result, expected) - self.assertEqual(result.names, index.names) + assert result.names == index.names # K > 1 result = index.to_hierarchical(3, 2) @@ -1357,7 +1460,7 @@ def test_to_hierarchical(self): labels=[[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]) tm.assert_index_equal(result, expected) - self.assertEqual(result.names, index.names) + assert result.names == index.names # non-sorted index = MultiIndex.from_tuples([(2, 'c'), (1, 'b'), @@ -1371,18 +1474,19 @@ def test_to_hierarchical(self): (2, 'b'), (2, 'b')], names=['N1', 'N2']) tm.assert_index_equal(result, expected) - self.assertEqual(result.names, index.names) + assert result.names == index.names def test_bounds(self): self.index._bounds def test_equals_multi(self): - self.assertTrue(self.index.equals(self.index)) - self.assertTrue(self.index.equal_levels(self.index)) - - self.assertFalse(self.index.equals(self.index[:-1])) + assert self.index.equals(self.index) + assert not self.index.equals(self.index.values) + assert self.index.equals(Index(self.index.values)) - self.assertTrue(self.index.equals(self.index._tuple_index)) + assert self.index.equal_levels(self.index) + assert not self.index.equals(self.index[:-1]) + assert not self.index.equals(self.index[-1]) # different number of levels index = MultiIndex(levels=[Index(lrange(4)), Index(lrange(4)), Index( @@ -1390,8 +1494,8 @@ def test_equals_multi(self): [0, 1, 0, 0, 0, 1, 0, 1]), np.array([1, 0, 1, 1, 0, 0, 1, 0])]) index2 = MultiIndex(levels=index.levels[:-1], labels=index.labels[:-1]) - self.assertFalse(index.equals(index2)) - self.assertFalse(index.equal_levels(index2)) + assert not index.equals(index2) + assert not index.equal_levels(index2) # levels are different major_axis = Index(lrange(4)) @@ -1402,8 +1506,8 @@ def test_equals_multi(self): index = MultiIndex(levels=[major_axis, minor_axis], labels=[major_labels, minor_labels]) - self.assertFalse(self.index.equals(index)) - self.assertFalse(self.index.equal_levels(index)) + assert not self.index.equals(index) + assert not self.index.equal_levels(index) # some of the labels are different major_axis = Index(['foo', 'bar', 'baz', 'qux']) @@ -1414,52 +1518,61 @@ def test_equals_multi(self): index = MultiIndex(levels=[major_axis, minor_axis], labels=[major_labels, minor_labels]) - self.assertFalse(self.index.equals(index)) + assert not self.index.equals(index) + + def test_equals_missing_values(self): + # make sure take is not using -1 + i = pd.MultiIndex.from_tuples([(0, pd.NaT), + (0, pd.Timestamp('20130101'))]) + result = i[0:1].equals(i[0]) + assert not result + result = i[1:2].equals(i[1]) + assert not result def test_identical(self): mi = self.index.copy() mi2 = self.index.copy() - self.assertTrue(mi.identical(mi2)) + assert mi.identical(mi2) mi = mi.set_names(['new1', 'new2']) - self.assertTrue(mi.equals(mi2)) - self.assertFalse(mi.identical(mi2)) + assert mi.equals(mi2) + assert not mi.identical(mi2) mi2 = mi2.set_names(['new1', 'new2']) - self.assertTrue(mi.identical(mi2)) + assert mi.identical(mi2) mi3 = Index(mi.tolist(), names=mi.names) mi4 = Index(mi.tolist(), names=mi.names, tupleize_cols=False) - self.assertTrue(mi.identical(mi3)) - self.assertFalse(mi.identical(mi4)) - self.assertTrue(mi.equals(mi4)) + assert mi.identical(mi3) + assert not mi.identical(mi4) + assert mi.equals(mi4) def test_is_(self): mi = MultiIndex.from_tuples(lzip(range(10), range(10))) - self.assertTrue(mi.is_(mi)) - self.assertTrue(mi.is_(mi.view())) - self.assertTrue(mi.is_(mi.view().view().view().view())) + assert mi.is_(mi) + assert mi.is_(mi.view()) + assert mi.is_(mi.view().view().view().view()) mi2 = mi.view() # names are metadata, they don't change id mi2.names = ["A", "B"] - self.assertTrue(mi2.is_(mi)) - self.assertTrue(mi.is_(mi2)) + assert mi2.is_(mi) + assert mi.is_(mi2) - self.assertTrue(mi.is_(mi.set_names(["C", "D"]))) + assert mi.is_(mi.set_names(["C", "D"])) mi2 = mi.view() mi2.set_names(["E", "F"], inplace=True) - self.assertTrue(mi.is_(mi2)) + assert mi.is_(mi2) # levels are inherent properties, they change identity mi3 = mi2.set_levels([lrange(10), lrange(10)]) - self.assertFalse(mi3.is_(mi2)) + assert not mi3.is_(mi2) # shouldn't change - self.assertTrue(mi2.is_(mi)) + assert mi2.is_(mi) mi4 = mi3.view() mi4.set_levels([[1 for _ in range(10)], lrange(10)], inplace=True) - self.assertFalse(mi4.is_(mi3)) + assert not mi4.is_(mi3) mi5 = mi.view() mi5.set_levels(mi5.levels, inplace=True) - self.assertFalse(mi5.is_(mi)) + assert not mi5.is_(mi) def test_union(self): piece1 = self.index[:5][::-1] @@ -1467,69 +1580,69 @@ def test_union(self): the_union = piece1 | piece2 - tups = sorted(self.index._tuple_index) + tups = sorted(self.index.values) expected = MultiIndex.from_tuples(tups) - self.assertTrue(the_union.equals(expected)) + assert the_union.equals(expected) # corner case, pass self or empty thing: the_union = self.index.union(self.index) - self.assertIs(the_union, self.index) + assert the_union is self.index the_union = self.index.union(self.index[:0]) - self.assertIs(the_union, self.index) + assert the_union is self.index # won't work in python 3 - # tuples = self.index._tuple_index + # tuples = self.index.values # result = self.index[:4] | tuples[4:] - # self.assertTrue(result.equals(tuples)) + # assert result.equals(tuples) # not valid for python 3 # def test_union_with_regular_index(self): # other = Index(['A', 'B', 'C']) # result = other.union(self.index) - # self.assertIn(('foo', 'one'), result) - # self.assertIn('B', result) + # assert ('foo', 'one') in result + # assert 'B' in result # result2 = self.index.union(other) - # self.assertTrue(result.equals(result2)) + # assert result.equals(result2) def test_intersection(self): piece1 = self.index[:5][::-1] piece2 = self.index[3:] the_int = piece1 & piece2 - tups = sorted(self.index[3:5]._tuple_index) + tups = sorted(self.index[3:5].values) expected = MultiIndex.from_tuples(tups) - self.assertTrue(the_int.equals(expected)) + assert the_int.equals(expected) # corner case, pass self the_int = self.index.intersection(self.index) - self.assertIs(the_int, self.index) + assert the_int is self.index # empty intersection: disjoint empty = self.index[:2] & self.index[2:] expected = self.index[:0] - self.assertTrue(empty.equals(expected)) + assert empty.equals(expected) # can't do in python 3 - # tuples = self.index._tuple_index + # tuples = self.index.values # result = self.index & tuples - # self.assertTrue(result.equals(tuples)) + # assert result.equals(tuples) def test_sub(self): first = self.index # - now raises (previously was set op difference) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): first - self.index[-3:] - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): self.index[-3:] - first - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): self.index[-3:] - first.tolist() - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): first.tolist() - self.index[-3:] def test_difference(self): @@ -1540,66 +1653,68 @@ def test_difference(self): sortorder=0, names=self.index.names) - tm.assertIsInstance(result, MultiIndex) - self.assertTrue(result.equals(expected)) - self.assertEqual(result.names, self.index.names) + assert isinstance(result, MultiIndex) + assert result.equals(expected) + assert result.names == self.index.names # empty difference: reflexive result = self.index.difference(self.index) expected = self.index[:0] - self.assertTrue(result.equals(expected)) - self.assertEqual(result.names, self.index.names) + assert result.equals(expected) + assert result.names == self.index.names # empty difference: superset result = self.index[-3:].difference(self.index) expected = self.index[:0] - self.assertTrue(result.equals(expected)) - self.assertEqual(result.names, self.index.names) + assert result.equals(expected) + assert result.names == self.index.names # empty difference: degenerate result = self.index[:0].difference(self.index) expected = self.index[:0] - self.assertTrue(result.equals(expected)) - self.assertEqual(result.names, self.index.names) + assert result.equals(expected) + assert result.names == self.index.names # names not the same chunklet = self.index[-3:] chunklet.names = ['foo', 'baz'] result = first.difference(chunklet) - self.assertEqual(result.names, (None, None)) + assert result.names == (None, None) # empty, but non-equal result = self.index.difference(self.index.sortlevel(1)[0]) - self.assertEqual(len(result), 0) + assert len(result) == 0 # raise Exception called with non-MultiIndex - result = first.difference(first._tuple_index) - self.assertTrue(result.equals(first[:0])) + result = first.difference(first.values) + assert result.equals(first[:0]) # name from empty array result = first.difference([]) - self.assertTrue(first.equals(result)) - self.assertEqual(first.names, result.names) + assert first.equals(result) + assert first.names == result.names # name from non-empty array result = first.difference([('foo', 'one')]) expected = pd.MultiIndex.from_tuples([('bar', 'one'), ('baz', 'two'), ( 'foo', 'two'), ('qux', 'one'), ('qux', 'two')]) expected.names = first.names - self.assertEqual(first.names, result.names) - assertRaisesRegexp(TypeError, "other must be a MultiIndex or a list" - " of tuples", first.difference, [1, 2, 3, 4, 5]) + assert first.names == result.names + tm.assert_raises_regex(TypeError, "other must be a MultiIndex " + "or a list of tuples", + first.difference, [1, 2, 3, 4, 5]) def test_from_tuples(self): - assertRaisesRegexp(TypeError, 'Cannot infer number of levels from' - ' empty list', MultiIndex.from_tuples, []) + tm.assert_raises_regex(TypeError, 'Cannot infer number of levels ' + 'from empty list', + MultiIndex.from_tuples, []) idx = MultiIndex.from_tuples(((1, 2), (3, 4)), names=['a', 'b']) - self.assertEqual(len(idx), 2) + assert len(idx) == 2 def test_argsort(self): result = self.index.argsort() - expected = self.index._tuple_index.argsort() + expected = self.index.values.argsort() tm.assert_numpy_array_equal(result, expected) def test_sortlevel(self): @@ -1612,23 +1727,23 @@ def test_sortlevel(self): sorted_idx, _ = index.sortlevel(0) expected = MultiIndex.from_tuples(sorted(tuples)) - self.assertTrue(sorted_idx.equals(expected)) + assert sorted_idx.equals(expected) sorted_idx, _ = index.sortlevel(0, ascending=False) - self.assertTrue(sorted_idx.equals(expected[::-1])) + assert sorted_idx.equals(expected[::-1]) sorted_idx, _ = index.sortlevel(1) by1 = sorted(tuples, key=lambda x: (x[1], x[0])) expected = MultiIndex.from_tuples(by1) - self.assertTrue(sorted_idx.equals(expected)) + assert sorted_idx.equals(expected) sorted_idx, _ = index.sortlevel(1, ascending=False) - self.assertTrue(sorted_idx.equals(expected[::-1])) + assert sorted_idx.equals(expected[::-1]) def test_sortlevel_not_sort_remaining(self): mi = MultiIndex.from_tuples([[1, 1, 3], [1, 1, 1]], names=list('ABC')) sorted_idx, _ = mi.sortlevel('A', sort_remaining=False) - self.assertTrue(sorted_idx.equals(mi)) + assert sorted_idx.equals(mi) def test_sortlevel_deterministic(self): tuples = [('bar', 'one'), ('foo', 'two'), ('qux', 'two'), @@ -1638,18 +1753,18 @@ def test_sortlevel_deterministic(self): sorted_idx, _ = index.sortlevel(0) expected = MultiIndex.from_tuples(sorted(tuples)) - self.assertTrue(sorted_idx.equals(expected)) + assert sorted_idx.equals(expected) sorted_idx, _ = index.sortlevel(0, ascending=False) - self.assertTrue(sorted_idx.equals(expected[::-1])) + assert sorted_idx.equals(expected[::-1]) sorted_idx, _ = index.sortlevel(1) by1 = sorted(tuples, key=lambda x: (x[1], x[0])) expected = MultiIndex.from_tuples(by1) - self.assertTrue(sorted_idx.equals(expected)) + assert sorted_idx.equals(expected) sorted_idx, _ = index.sortlevel(1, ascending=False) - self.assertTrue(sorted_idx.equals(expected[::-1])) + assert sorted_idx.equals(expected[::-1]) def test_dims(self): pass @@ -1661,66 +1776,66 @@ def test_drop(self): dropped2 = self.index.drop(index) expected = self.index[[0, 2, 3, 5]] - self.assert_index_equal(dropped, expected) - self.assert_index_equal(dropped2, expected) + tm.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped2, expected) dropped = self.index.drop(['bar']) expected = self.index[[0, 1, 3, 4, 5]] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) dropped = self.index.drop('foo') expected = self.index[[2, 3, 4, 5]] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) index = MultiIndex.from_tuples([('bar', 'two')]) - self.assertRaises(KeyError, self.index.drop, [('bar', 'two')]) - self.assertRaises(KeyError, self.index.drop, index) - self.assertRaises(KeyError, self.index.drop, ['foo', 'two']) + pytest.raises(KeyError, self.index.drop, [('bar', 'two')]) + pytest.raises(KeyError, self.index.drop, index) + pytest.raises(KeyError, self.index.drop, ['foo', 'two']) # partially correct argument mixed_index = MultiIndex.from_tuples([('qux', 'one'), ('bar', 'two')]) - self.assertRaises(KeyError, self.index.drop, mixed_index) + pytest.raises(KeyError, self.index.drop, mixed_index) # error='ignore' dropped = self.index.drop(index, errors='ignore') expected = self.index[[0, 1, 2, 3, 4, 5]] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) dropped = self.index.drop(mixed_index, errors='ignore') expected = self.index[[0, 1, 2, 3, 5]] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) dropped = self.index.drop(['foo', 'two'], errors='ignore') expected = self.index[[2, 3, 4, 5]] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) # mixed partial / full drop dropped = self.index.drop(['foo', ('qux', 'one')]) expected = self.index[[2, 3, 5]] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) # mixed partial / full drop / error='ignore' mixed_index = ['foo', ('qux', 'one'), 'two'] - self.assertRaises(KeyError, self.index.drop, mixed_index) + pytest.raises(KeyError, self.index.drop, mixed_index) dropped = self.index.drop(mixed_index, errors='ignore') expected = self.index[[2, 3, 5]] - self.assert_index_equal(dropped, expected) + tm.assert_index_equal(dropped, expected) def test_droplevel_with_names(self): index = self.index[self.index.get_loc('foo')] dropped = index.droplevel(0) - self.assertEqual(dropped.name, 'second') + assert dropped.name == 'second' index = MultiIndex(levels=[Index(lrange(4)), Index(lrange(4)), Index( lrange(4))], labels=[np.array([0, 0, 1, 2, 2, 2, 3, 3]), np.array( [0, 1, 0, 0, 0, 1, 0, 1]), np.array([1, 0, 1, 1, 0, 0, 1, 0])], names=['one', 'two', 'three']) dropped = index.droplevel(0) - self.assertEqual(dropped.names, ('two', 'three')) + assert dropped.names == ('two', 'three') dropped = index.droplevel('two') expected = index.droplevel(1) - self.assertTrue(dropped.equals(expected)) + assert dropped.equals(expected) def test_droplevel_multiple(self): index = MultiIndex(levels=[Index(lrange(4)), Index(lrange(4)), Index( @@ -1730,7 +1845,7 @@ def test_droplevel_multiple(self): dropped = index[:2].droplevel(['three', 'one']) expected = index[:2].droplevel(2).droplevel(0) - self.assertTrue(dropped.equals(expected)) + assert dropped.equals(expected) def test_drop_not_lexsorted(self): # GH 12078 @@ -1738,7 +1853,7 @@ def test_drop_not_lexsorted(self): # define the lexsorted version of the multi-index tuples = [('a', ''), ('b1', 'c1'), ('b2', 'c2')] lexsorted_mi = MultiIndex.from_tuples(tuples, names=['b', 'c']) - self.assertTrue(lexsorted_mi.is_lexsorted()) + assert lexsorted_mi.is_lexsorted() # and the not-lexsorted version df = pd.DataFrame(columns=['a', 'b', 'c', 'd'], @@ -1746,19 +1861,19 @@ def test_drop_not_lexsorted(self): df = df.pivot_table(index='a', columns=['b', 'c'], values='d') df = df.reset_index() not_lexsorted_mi = df.columns - self.assertFalse(not_lexsorted_mi.is_lexsorted()) + assert not not_lexsorted_mi.is_lexsorted() # compare the results - self.assert_index_equal(lexsorted_mi, not_lexsorted_mi) - with self.assert_produces_warning(PerformanceWarning): - self.assert_index_equal(lexsorted_mi.drop('a'), - not_lexsorted_mi.drop('a')) + tm.assert_index_equal(lexsorted_mi, not_lexsorted_mi) + with tm.assert_produces_warning(PerformanceWarning): + tm.assert_index_equal(lexsorted_mi.drop('a'), + not_lexsorted_mi.drop('a')) def test_insert(self): # key contained in all levels new_index = self.index.insert(0, ('bar', 'two')) - self.assertTrue(new_index.equal_levels(self.index)) - self.assertEqual(new_index[0], ('bar', 'two')) + assert new_index.equal_levels(self.index) + assert new_index[0] == ('bar', 'two') # key not contained in all levels new_index = self.index.insert(0, ('abc', 'three')) @@ -1768,11 +1883,11 @@ def test_insert(self): exp1 = Index(list(self.index.levels[1]) + ['three'], name='second') tm.assert_index_equal(new_index.levels[1], exp1) - self.assertEqual(new_index[0], ('abc', 'three')) + assert new_index[0] == ('abc', 'three') # key wrong length msg = "Item must have length equal to number of levels" - with assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.index.insert(0, ('foo2', )) left = pd.DataFrame([['a', 'b', 0], ['b', 'd', 1]], @@ -1813,7 +1928,7 @@ def test_insert(self): pd.MultiIndex.from_tuples(idx[:-2])) left.loc[('test', 17)] = 11 - left.ix[('test', 18)] = 12 + left.loc[('test', 18)] = 12 right = pd.Series(np.linspace(0, 12, 13), pd.MultiIndex.from_tuples(idx)) @@ -1822,7 +1937,7 @@ def test_insert(self): def test_take_preserve_name(self): taken = self.index.take([3, 0, 1]) - self.assertEqual(taken.names, self.index.names) + assert taken.names == self.index.names def test_take_fill_value(self): # GH 12631 @@ -1856,12 +1971,12 @@ def test_take_fill_value(self): msg = ('When allow_fill=True and fill_value is not None, ' 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): idx.take(np.array([1, -5])) def take_invalid_kwargs(self): @@ -1871,16 +1986,16 @@ def take_invalid_kwargs(self): indices = [1, 2] msg = r"take\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, idx.take, - indices, foo=2) + tm.assert_raises_regex(TypeError, msg, idx.take, + indices, foo=2) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, out=indices) + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, out=indices) msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, mode='clip') + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, mode='clip') def test_join_level(self): def _check_how(other, how): @@ -1889,8 +2004,8 @@ def _check_how(other, how): return_indexers=True) exp_level = other.join(self.index.levels[1], how=how) - self.assertTrue(join_index.levels[0].equals(self.index.levels[0])) - self.assertTrue(join_index.levels[1].equals(exp_level)) + assert join_index.levels[0].equals(self.index.levels[0]) + assert join_index.levels[1].equals(exp_level) # pare down levels mask = np.array( @@ -1903,7 +2018,7 @@ def _check_how(other, how): self.index.join(other, how=how, level='second', return_indexers=True) - self.assertTrue(join_index.equals(join_index2)) + assert join_index.equals(join_index2) tm.assert_numpy_array_equal(lidx, lidx2) tm.assert_numpy_array_equal(ridx, ridx2) tm.assert_numpy_array_equal(join_index2.values, exp_values) @@ -1921,17 +2036,17 @@ def _check_all(other): # some corner cases idx = Index(['three', 'one', 'two']) result = idx.join(self.index, level='second') - tm.assertIsInstance(result, MultiIndex) + assert isinstance(result, MultiIndex) - assertRaisesRegexp(TypeError, "Join.*MultiIndex.*ambiguous", - self.index.join, self.index, level=1) + tm.assert_raises_regex(TypeError, "Join.*MultiIndex.*ambiguous", + self.index.join, self.index, level=1) def test_join_self(self): kinds = 'outer', 'inner', 'left', 'right' for kind in kinds: res = self.index joined = res.join(res, how=kind) - self.assertIs(res, joined) + assert res is joined def test_join_multi(self): # GH 10665 @@ -1945,36 +2060,36 @@ def test_join_multi(self): [np.arange(4), [1, 2]], names=['a', 'b']) exp_lidx = np.array([1, 2, 5, 6, 9, 10, 13, 14], dtype=np.intp) exp_ridx = np.array([0, 1, 0, 1, 0, 1, 0, 1], dtype=np.intp) - self.assert_index_equal(jidx, exp_idx) - self.assert_numpy_array_equal(lidx, exp_lidx) - self.assert_numpy_array_equal(ridx, exp_ridx) + tm.assert_index_equal(jidx, exp_idx) + tm.assert_numpy_array_equal(lidx, exp_lidx) + tm.assert_numpy_array_equal(ridx, exp_ridx) # flip jidx, ridx, lidx = idx.join(midx, how='inner', return_indexers=True) - self.assert_index_equal(jidx, exp_idx) - self.assert_numpy_array_equal(lidx, exp_lidx) - self.assert_numpy_array_equal(ridx, exp_ridx) + tm.assert_index_equal(jidx, exp_idx) + tm.assert_numpy_array_equal(lidx, exp_lidx) + tm.assert_numpy_array_equal(ridx, exp_ridx) # keep MultiIndex jidx, lidx, ridx = midx.join(idx, how='left', return_indexers=True) exp_ridx = np.array([-1, 0, 1, -1, -1, 0, 1, -1, -1, 0, 1, -1, -1, 0, 1, -1], dtype=np.intp) - self.assert_index_equal(jidx, midx) - self.assertIsNone(lidx) - self.assert_numpy_array_equal(ridx, exp_ridx) + tm.assert_index_equal(jidx, midx) + assert lidx is None + tm.assert_numpy_array_equal(ridx, exp_ridx) # flip jidx, ridx, lidx = idx.join(midx, how='right', return_indexers=True) - self.assert_index_equal(jidx, midx) - self.assertIsNone(lidx) - self.assert_numpy_array_equal(ridx, exp_ridx) + tm.assert_index_equal(jidx, midx) + assert lidx is None + tm.assert_numpy_array_equal(ridx, exp_ridx) def test_reindex(self): result, indexer = self.index.reindex(list(self.index[:4])) - tm.assertIsInstance(result, MultiIndex) + assert isinstance(result, MultiIndex) self.check_level_names(result, self.index[:4].names) result, indexer = self.index.reindex(list(self.index)) - tm.assertIsInstance(result, MultiIndex) - self.assertIsNone(indexer) + assert isinstance(result, MultiIndex) + assert indexer is None self.check_level_names(result, self.index.names) def test_reindex_level(self): @@ -1986,28 +2101,29 @@ def test_reindex_level(self): exp_index = self.index.join(idx, level='second', how='right') exp_index2 = self.index.join(idx, level='second', how='left') - self.assertTrue(target.equals(exp_index)) + assert target.equals(exp_index) exp_indexer = np.array([0, 2, 4]) tm.assert_numpy_array_equal(indexer, exp_indexer, check_dtype=False) - self.assertTrue(target2.equals(exp_index2)) + assert target2.equals(exp_index2) exp_indexer2 = np.array([0, -1, 0, -1, 0, -1]) tm.assert_numpy_array_equal(indexer2, exp_indexer2, check_dtype=False) - assertRaisesRegexp(TypeError, "Fill method not supported", - self.index.reindex, self.index, method='pad', - level='second') + tm.assert_raises_regex(TypeError, "Fill method not supported", + self.index.reindex, self.index, + method='pad', level='second') - assertRaisesRegexp(TypeError, "Fill method not supported", idx.reindex, - idx, method='bfill', level='first') + tm.assert_raises_regex(TypeError, "Fill method not supported", + idx.reindex, idx, method='bfill', + level='first') def test_duplicates(self): - self.assertFalse(self.index.has_duplicates) - self.assertTrue(self.index.append(self.index).has_duplicates) + assert not self.index.has_duplicates + assert self.index.append(self.index).has_duplicates index = MultiIndex(levels=[[0, 1], [0, 1, 2]], labels=[ [0, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 0, 1, 2]]) - self.assertTrue(index.has_duplicates) + assert index.has_duplicates # GH 9075 t = [(u('x'), u('out'), u('z'), 5, u('y'), u('in'), u('z'), 169), @@ -2030,7 +2146,7 @@ def test_duplicates(self): (u('x'), u('out'), u('z'), 12, u('y'), u('in'), u('z'), 144)] index = pd.MultiIndex.from_tuples(t) - self.assertFalse(index.has_duplicates) + assert not index.has_duplicates # handle int64 overflow if possible def check(nlevels, with_nulls): @@ -2051,7 +2167,7 @@ def check(nlevels, with_nulls): # no dups index = MultiIndex(levels=levels, labels=labels) - self.assertFalse(index.has_duplicates) + assert not index.has_duplicates # with a dup if with_nulls: @@ -2062,7 +2178,7 @@ def check(nlevels, with_nulls): values = index.values.tolist() index = MultiIndex.from_tuples(values + [values[0]]) - self.assertTrue(index.has_duplicates) + assert index.has_duplicates # no overflow check(4, False) @@ -2080,14 +2196,14 @@ def check(nlevels, with_nulls): for keep in ['first', 'last', False]: left = mi.duplicated(keep=keep) - right = pd.hashtable.duplicated_object(mi.values, keep=keep) + right = pd._libs.hashtable.duplicated_object(mi.values, keep=keep) tm.assert_numpy_array_equal(left, right) # GH5873 for a in [101, 102]: mi = MultiIndex.from_arrays([[101, a], [3.5, np.nan]]) - self.assertFalse(mi.has_duplicates) - self.assertEqual(mi.get_duplicates(), []) + assert not mi.has_duplicates + assert mi.get_duplicates() == [] tm.assert_numpy_array_equal(mi.duplicated(), np.zeros( 2, dtype='bool')) @@ -2097,9 +2213,9 @@ def check(nlevels, with_nulls): lab = product(range(-1, n), range(-1, m)) mi = MultiIndex(levels=[list('abcde')[:n], list('WXYZ')[:m]], labels=np.random.permutation(list(lab)).T) - self.assertEqual(len(mi), (n + 1) * (m + 1)) - self.assertFalse(mi.has_duplicates) - self.assertEqual(mi.get_duplicates(), []) + assert len(mi) == (n + 1) * (m + 1) + assert not mi.has_duplicates + assert mi.get_duplicates() == [] tm.assert_numpy_array_equal(mi.duplicated(), np.zeros( len(mi), dtype='bool')) @@ -2111,8 +2227,8 @@ def test_duplicate_meta_data(self): index.set_names([None, None]), index.set_names([None, 'Num']), index.set_names(['Upper', 'Num']), ]: - self.assertTrue(idx.has_duplicates) - self.assertEqual(idx.drop_duplicates().names, idx.names) + assert idx.has_duplicates + assert idx.drop_duplicates().names == idx.names def test_get_unique_index(self): idx = self.index[[0, 1, 0, 1, 1, 0, 0]] @@ -2120,8 +2236,8 @@ def test_get_unique_index(self): for dropna in [False, True]: result = idx._get_unique_index(dropna=dropna) - self.assertTrue(result.unique) - self.assert_index_equal(result, expected) + assert result.unique + tm.assert_index_equal(result, expected) def test_unique(self): mi = pd.MultiIndex.from_arrays([[1, 2, 1, 2], [1, 1, 1, 2]]) @@ -2158,14 +2274,13 @@ def test_unique_datetimelike(self): def test_tolist(self): result = self.index.tolist() exp = list(self.index.values) - self.assertEqual(result, exp) + assert result == exp def test_repr_with_unicode_data(self): with pd.core.config.option_context("display.encoding", 'UTF-8'): d = {"a": [u("\u05d0"), 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]} index = pd.DataFrame(d).set_index(["a", "b"]).index - self.assertFalse("\\u" in repr(index) - ) # we don't want unicode-escaped + assert "\\u" not in repr(index) # we don't want unicode-escaped def test_repr_roundtrip(self): @@ -2179,10 +2294,8 @@ def test_repr_roundtrip(self): result = eval(repr(mi)) # string coerces to unicode tm.assert_index_equal(result, mi, exact=False) - self.assertEqual( - mi.get_level_values('first').inferred_type, 'string') - self.assertEqual( - result.get_level_values('first').inferred_type, 'unicode') + assert mi.get_level_values('first').inferred_type == 'string' + assert result.get_level_values('first').inferred_type == 'unicode' mi_u = MultiIndex.from_product( [list(u'ab'), range(3)], names=['first', 'second']) @@ -2198,7 +2311,6 @@ def test_repr_roundtrip(self): # long format mi = MultiIndex.from_product([list('abcdefg'), range(10)], names=['first', 'second']) - result = str(mi) if PY3: tm.assert_index_equal(eval(repr(mi)), mi, exact=True) @@ -2206,13 +2318,9 @@ def test_repr_roundtrip(self): result = eval(repr(mi)) # string coerces to unicode tm.assert_index_equal(result, mi, exact=False) - self.assertEqual( - mi.get_level_values('first').inferred_type, 'string') - self.assertEqual( - result.get_level_values('first').inferred_type, 'unicode') + assert mi.get_level_values('first').inferred_type == 'string' + assert result.get_level_values('first').inferred_type == 'unicode' - mi = MultiIndex.from_product( - [list(u'abcdefg'), range(10)], names=['first', 'second']) result = eval(repr(mi_u)) tm.assert_index_equal(result, mi_u, exact=True) @@ -2241,13 +2349,13 @@ def test_bytestring_with_unicode(self): def test_slice_keep_name(self): x = MultiIndex.from_tuples([('a', 'b'), (1, 2), ('c', 'd')], names=['x', 'y']) - self.assertEqual(x[1:].names, x.names) + assert x[1:].names == x.names def test_isnull_behavior(self): # should not segfault GH5123 # NOTE: if MI representation changes, may make sense to allow # isnull(MI) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.isnull(self.index) def test_level_setting_resets_attributes(self): @@ -2257,9 +2365,132 @@ def test_level_setting_resets_attributes(self): assert ind.is_monotonic ind.set_levels([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]], inplace=True) + # if this fails, probably didn't reset the cache correctly. assert not ind.is_monotonic + def test_is_monotonic(self): + i = MultiIndex.from_product([np.arange(10), + np.arange(10)], names=['one', 'two']) + assert i.is_monotonic + assert Index(i.values).is_monotonic + + i = MultiIndex.from_product([np.arange(10, 0, -1), + np.arange(10)], names=['one', 'two']) + assert not i.is_monotonic + assert not Index(i.values).is_monotonic + + i = MultiIndex.from_product([np.arange(10), + np.arange(10, 0, -1)], + names=['one', 'two']) + assert not i.is_monotonic + assert not Index(i.values).is_monotonic + + i = MultiIndex.from_product([[1.0, np.nan, 2.0], ['a', 'b', 'c']]) + assert not i.is_monotonic + assert not Index(i.values).is_monotonic + + # string ordering + i = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], + ['one', 'two', 'three']], + labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], + [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=['first', 'second']) + assert not i.is_monotonic + assert not Index(i.values).is_monotonic + + i = MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], + ['mom', 'next', 'zenith']], + labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], + [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=['first', 'second']) + assert i.is_monotonic + assert Index(i.values).is_monotonic + + # mixed levels, hits the TypeError + i = MultiIndex( + levels=[[1, 2, 3, 4], ['gb00b03mlx29', 'lu0197800237', + 'nl0000289783', + 'nl0000289965', 'nl0000301109']], + labels=[[0, 1, 1, 2, 2, 2, 3], [4, 2, 0, 0, 1, 3, -1]], + names=['household_id', 'asset_id']) + + assert not i.is_monotonic + + def test_reconstruct_sort(self): + + # starts off lexsorted & monotonic + mi = MultiIndex.from_arrays([ + ['A', 'A', 'B', 'B', 'B'], [1, 2, 1, 2, 3] + ]) + assert mi.is_lexsorted() + assert mi.is_monotonic + + recons = mi._sort_levels_monotonic() + assert recons.is_lexsorted() + assert recons.is_monotonic + assert mi is recons + + assert mi.equals(recons) + assert Index(mi.values).equals(Index(recons.values)) + + # cannot convert to lexsorted + mi = pd.MultiIndex.from_tuples([('z', 'a'), ('x', 'a'), ('y', 'b'), + ('x', 'b'), ('y', 'a'), ('z', 'b')], + names=['one', 'two']) + assert not mi.is_lexsorted() + assert not mi.is_monotonic + + recons = mi._sort_levels_monotonic() + assert not recons.is_lexsorted() + assert not recons.is_monotonic + + assert mi.equals(recons) + assert Index(mi.values).equals(Index(recons.values)) + + # cannot convert to lexsorted + mi = MultiIndex(levels=[['b', 'd', 'a'], [1, 2, 3]], + labels=[[0, 1, 0, 2], [2, 0, 0, 1]], + names=['col1', 'col2']) + assert not mi.is_lexsorted() + assert not mi.is_monotonic + + recons = mi._sort_levels_monotonic() + assert not recons.is_lexsorted() + assert not recons.is_monotonic + + assert mi.equals(recons) + assert Index(mi.values).equals(Index(recons.values)) + + def test_reconstruct_remove_unused(self): + # xref to GH 2770 + df = DataFrame([['deleteMe', 1, 9], + ['keepMe', 2, 9], + ['keepMeToo', 3, 9]], + columns=['first', 'second', 'third']) + df2 = df.set_index(['first', 'second'], drop=False) + df2 = df2[df2['first'] != 'deleteMe'] + + # removed levels are there + expected = MultiIndex(levels=[['deleteMe', 'keepMe', 'keepMeToo'], + [1, 2, 3]], + labels=[[1, 2], [1, 2]], + names=['first', 'second']) + result = df2.index + tm.assert_index_equal(result, expected) + + expected = MultiIndex(levels=[['keepMe', 'keepMeToo'], + [2, 3]], + labels=[[0, 1], [0, 1]], + names=['first', 'second']) + result = df2.index.remove_unused_levels() + tm.assert_index_equal(result, expected) + + # idempotent + result2 = result.remove_unused_levels() + tm.assert_index_equal(result2, expected) + assert result2 is result + def test_isin(self): values = [('foo', 2), ('bar', 3), ('quux', 4)] @@ -2272,8 +2503,8 @@ def test_isin(self): # empty, return dtype bool idx = MultiIndex.from_arrays([[], []]) result = idx.isin(values) - self.assertEqual(len(result), 0) - self.assertEqual(result.dtype, np.bool_) + assert len(result) == 0 + assert result.dtype == np.bool_ def test_isin_nan(self): idx = MultiIndex.from_arrays([['foo', 'bar'], [1.0, np.nan]]) @@ -2296,18 +2527,18 @@ def test_isin_level_kwarg(self): tm.assert_numpy_array_equal(expected, idx.isin(vals_1, level=1)) tm.assert_numpy_array_equal(expected, idx.isin(vals_1, level=-1)) - self.assertRaises(IndexError, idx.isin, vals_0, level=5) - self.assertRaises(IndexError, idx.isin, vals_0, level=-5) + pytest.raises(IndexError, idx.isin, vals_0, level=5) + pytest.raises(IndexError, idx.isin, vals_0, level=-5) - self.assertRaises(KeyError, idx.isin, vals_0, level=1.0) - self.assertRaises(KeyError, idx.isin, vals_1, level=-1.0) - self.assertRaises(KeyError, idx.isin, vals_1, level='A') + pytest.raises(KeyError, idx.isin, vals_0, level=1.0) + pytest.raises(KeyError, idx.isin, vals_1, level=-1.0) + pytest.raises(KeyError, idx.isin, vals_1, level='A') idx.names = ['A', 'B'] tm.assert_numpy_array_equal(expected, idx.isin(vals_0, level='A')) tm.assert_numpy_array_equal(expected, idx.isin(vals_1, level='B')) - self.assertRaises(KeyError, idx.isin, vals_1, level='C') + pytest.raises(KeyError, idx.isin, vals_1, level='C') def test_reindex_preserves_names_when_target_is_list_or_ndarray(self): # GH6552 @@ -2318,39 +2549,33 @@ def test_reindex_preserves_names_when_target_is_list_or_ndarray(self): other_dtype = pd.MultiIndex.from_product([[1, 2], [3, 4]]) # list & ndarray cases - self.assertEqual(idx.reindex([])[0].names, [None, None]) - self.assertEqual(idx.reindex(np.array([]))[0].names, [None, None]) - self.assertEqual(idx.reindex(target.tolist())[0].names, [None, None]) - self.assertEqual(idx.reindex(target.values)[0].names, [None, None]) - self.assertEqual( - idx.reindex(other_dtype.tolist())[0].names, [None, None]) - self.assertEqual( - idx.reindex(other_dtype.values)[0].names, [None, None]) + assert idx.reindex([])[0].names == [None, None] + assert idx.reindex(np.array([]))[0].names == [None, None] + assert idx.reindex(target.tolist())[0].names == [None, None] + assert idx.reindex(target.values)[0].names == [None, None] + assert idx.reindex(other_dtype.tolist())[0].names == [None, None] + assert idx.reindex(other_dtype.values)[0].names == [None, None] idx.names = ['foo', 'bar'] - self.assertEqual(idx.reindex([])[0].names, ['foo', 'bar']) - self.assertEqual(idx.reindex(np.array([]))[0].names, ['foo', 'bar']) - self.assertEqual(idx.reindex(target.tolist())[0].names, ['foo', 'bar']) - self.assertEqual(idx.reindex(target.values)[0].names, ['foo', 'bar']) - self.assertEqual( - idx.reindex(other_dtype.tolist())[0].names, ['foo', 'bar']) - self.assertEqual( - idx.reindex(other_dtype.values)[0].names, ['foo', 'bar']) + assert idx.reindex([])[0].names == ['foo', 'bar'] + assert idx.reindex(np.array([]))[0].names == ['foo', 'bar'] + assert idx.reindex(target.tolist())[0].names == ['foo', 'bar'] + assert idx.reindex(target.values)[0].names == ['foo', 'bar'] + assert idx.reindex(other_dtype.tolist())[0].names == ['foo', 'bar'] + assert idx.reindex(other_dtype.values)[0].names == ['foo', 'bar'] def test_reindex_lvl_preserves_names_when_target_is_list_or_array(self): # GH7774 idx = pd.MultiIndex.from_product([[0, 1], ['a', 'b']], names=['foo', 'bar']) - self.assertEqual(idx.reindex([], level=0)[0].names, ['foo', 'bar']) - self.assertEqual(idx.reindex([], level=1)[0].names, ['foo', 'bar']) + assert idx.reindex([], level=0)[0].names == ['foo', 'bar'] + assert idx.reindex([], level=1)[0].names == ['foo', 'bar'] def test_reindex_lvl_preserves_type_if_target_is_empty_list_or_array(self): # GH7774 idx = pd.MultiIndex.from_product([[0, 1], ['a', 'b']]) - self.assertEqual(idx.reindex([], level=0)[0].levels[0].dtype.type, - np.int64) - self.assertEqual(idx.reindex([], level=1)[0].levels[1].dtype.type, - np.object_) + assert idx.reindex([], level=0)[0].levels[0].dtype.type == np.int64 + assert idx.reindex([], level=1)[0].levels[1].dtype.type == np.object_ def test_groupby(self): groups = self.index.groupby(np.array([1, 1, 1, 2, 2, 2])) @@ -2378,23 +2603,23 @@ def test_index_name_retained(self): def test_equals_operator(self): # GH9785 - self.assertTrue((self.index == self.index).all()) + assert (self.index == self.index).all() def test_large_multiindex_error(self): # GH12527 df_below_1000000 = pd.DataFrame( 1, index=pd.MultiIndex.from_product([[1, 2], range(499999)]), columns=['dest']) - with assertRaises(KeyError): + with pytest.raises(KeyError): df_below_1000000.loc[(-1, 0), 'dest'] - with assertRaises(KeyError): + with pytest.raises(KeyError): df_below_1000000.loc[(3, 0), 'dest'] df_above_1000000 = pd.DataFrame( 1, index=pd.MultiIndex.from_product([[1, 2], range(500001)]), columns=['dest']) - with assertRaises(KeyError): + with pytest.raises(KeyError): df_above_1000000.loc[(-1, 0), 'dest'] - with assertRaises(KeyError): + with pytest.raises(KeyError): df_above_1000000.loc[(3, 0), 'dest'] def test_partial_string_timestamp_multiindex(self): @@ -2447,7 +2672,7 @@ def test_partial_string_timestamp_multiindex(self): # ambiguous and we don't want to extend this behavior forward to work # in multi-indexes. This would amount to selecting a scalar from a # column. - with assertRaises(KeyError): + with pytest.raises(KeyError): df['2016-01-01'] # partial string match on year only @@ -2476,7 +2701,7 @@ def test_partial_string_timestamp_multiindex(self): tm.assert_frame_equal(result, expected) # Slicing date on first level should break (of course) - with assertRaises(KeyError): + with pytest.raises(KeyError): df_swap.loc['2016-01-01'] # GH12685 (partial string with daily resolution or below) @@ -2529,5 +2754,54 @@ def test_dropna(self): tm.assert_index_equal(idx.dropna(how='all'), exp) msg = "invalid how option: xxx" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.dropna(how='xxx') + + def test_unsortedindex(self): + # GH 11897 + mi = pd.MultiIndex.from_tuples([('z', 'a'), ('x', 'a'), ('y', 'b'), + ('x', 'b'), ('y', 'a'), ('z', 'b')], + names=['one', 'two']) + df = pd.DataFrame([[i, 10 * i] for i in lrange(6)], index=mi, + columns=['one', 'two']) + + with pytest.raises(UnsortedIndexError): + df.loc(axis=0)['z', :] + df.sort_index(inplace=True) + assert len(df.loc(axis=0)['z', :]) == 2 + + with pytest.raises(KeyError): + df.loc(axis=0)['q', :] + + def test_unsortedindex_doc_examples(self): + # http://pandas.pydata.org/pandas-docs/stable/advanced.html#sorting-a-multiindex # noqa + dfm = DataFrame({'jim': [0, 0, 1, 1], + 'joe': ['x', 'x', 'z', 'y'], + 'jolie': np.random.rand(4)}) + + dfm = dfm.set_index(['jim', 'joe']) + with tm.assert_produces_warning(PerformanceWarning): + dfm.loc[(1, 'z')] + + with pytest.raises(UnsortedIndexError): + dfm.loc[(0, 'y'):(1, 'z')] + + assert not dfm.index.is_lexsorted() + assert dfm.index.lexsort_depth == 1 + + # sort it + dfm = dfm.sort_index() + dfm.loc[(1, 'z')] + dfm.loc[(0, 'y'):(1, 'z')] + + assert dfm.index.is_lexsorted() + assert dfm.index.lexsort_depth == 2 + + def test_tuples_with_name_string(self): + # GH 15110 and GH 14848 + + li = [(0, 0, 1), (0, 1, 0), (1, 0, 0)] + with pytest.raises(ValueError): + pd.Index(li, name='abc') + with pytest.raises(ValueError): + pd.Index(li, name='a') diff --git a/pandas/tests/indexes/test_numeric.py b/pandas/tests/indexes/test_numeric.py index b362c9716b672..3d06f1672ae32 100644 --- a/pandas/tests/indexes/test_numeric.py +++ b/pandas/tests/indexes/test_numeric.py @@ -1,22 +1,21 @@ # -*- coding: utf-8 -*- +import pytest + from datetime import datetime -from pandas import compat -from pandas.compat import range, lrange, u, PY3 +from pandas.compat import range, PY3 import numpy as np -from pandas import (date_range, Series, DataFrame, - Index, Float64Index, Int64Index, RangeIndex) -from pandas.util.testing import assertRaisesRegexp +from pandas import (date_range, notnull, Series, Index, Float64Index, + Int64Index, UInt64Index, RangeIndex) import pandas.util.testing as tm -import pandas.core.config as cf import pandas as pd -from pandas.lib import Timestamp +from pandas._libs.lib import Timestamp -from .common import Base +from pandas.tests.indexes.common import Base def full_like(array, value): @@ -64,10 +63,11 @@ def test_numeric_compat(self): result = idx * np.array(5, dtype='int64') tm.assert_index_equal(result, idx * 5) - result = idx * np.arange(5, dtype='int64') + arr_dtype = 'uint64' if isinstance(idx, UInt64Index) else 'int64' + result = idx * np.arange(5, dtype=arr_dtype) tm.assert_index_equal(result, didx) - result = idx * Series(np.arange(5, dtype='int64')) + result = idx * Series(np.arange(5, dtype=arr_dtype)) tm.assert_index_equal(result, didx) result = idx * Series(np.arange(5, dtype='float64') + 0.1) @@ -76,10 +76,10 @@ def test_numeric_compat(self): tm.assert_index_equal(result, expected) # invalid - self.assertRaises(TypeError, - lambda: idx * date_range('20130101', periods=5)) - self.assertRaises(ValueError, lambda: idx * idx[0:3]) - self.assertRaises(ValueError, lambda: idx * np.array([1, 2])) + pytest.raises(TypeError, + lambda: idx * date_range('20130101', periods=5)) + pytest.raises(ValueError, lambda: idx * idx[0:3]) + pytest.raises(ValueError, lambda: idx * np.array([1, 2])) result = divmod(idx, 2) with np.errstate(all='ignore'): @@ -105,6 +105,15 @@ def test_numeric_compat(self): for r, e in zip(result, expected): tm.assert_index_equal(r, e) + # test power calculations both ways, GH 14973 + expected = pd.Float64Index(2.0**idx.values) + result = 2.0**idx + tm.assert_index_equal(result, expected) + + expected = pd.Float64Index(idx.values**2.0) + result = idx**2.0 + tm.assert_index_equal(result, expected) + def test_explicit_conversions(self): # GH 8608 @@ -164,14 +173,13 @@ def test_modulo(self): # GH 9244 index = self.create_index() expected = Index(index.values % 2) - self.assert_index_equal(index % 2, expected) + tm.assert_index_equal(index % 2, expected) -class TestFloat64Index(Numeric, tm.TestCase): +class TestFloat64Index(Numeric): _holder = Float64Index - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): self.indices = dict(mixed=Float64Index([1.5, 2, 3, 4, 5]), float=Float64Index(np.arange(5) * 2.5)) self.setup_indices() @@ -184,14 +192,14 @@ def test_repr_roundtrip(self): tm.assert_index_equal(eval(repr(ind)), ind) def check_is_index(self, i): - self.assertIsInstance(i, Index) - self.assertNotIsInstance(i, Float64Index) + assert isinstance(i, Index) + assert not isinstance(i, Float64Index) def check_coerce(self, a, b, is_float_index=True): - self.assertTrue(a.equals(b)) - self.assert_index_equal(a, b, exact=False) + assert a.equals(b) + tm.assert_index_equal(a, b, exact=False) if is_float_index: - self.assertIsInstance(b, Float64Index) + assert isinstance(b, Float64Index) else: self.check_is_index(b) @@ -199,39 +207,39 @@ def test_constructor(self): # explicit construction index = Float64Index([1, 2, 3, 4, 5]) - self.assertIsInstance(index, Float64Index) + assert isinstance(index, Float64Index) expected = np.array([1, 2, 3, 4, 5], dtype='float64') - self.assert_numpy_array_equal(index.values, expected) + tm.assert_numpy_array_equal(index.values, expected) index = Float64Index(np.array([1, 2, 3, 4, 5])) - self.assertIsInstance(index, Float64Index) + assert isinstance(index, Float64Index) index = Float64Index([1., 2, 3, 4, 5]) - self.assertIsInstance(index, Float64Index) + assert isinstance(index, Float64Index) index = Float64Index(np.array([1., 2, 3, 4, 5])) - self.assertIsInstance(index, Float64Index) - self.assertEqual(index.dtype, float) + assert isinstance(index, Float64Index) + assert index.dtype == float index = Float64Index(np.array([1., 2, 3, 4, 5]), dtype=np.float32) - self.assertIsInstance(index, Float64Index) - self.assertEqual(index.dtype, np.float64) + assert isinstance(index, Float64Index) + assert index.dtype == np.float64 index = Float64Index(np.array([1, 2, 3, 4, 5]), dtype=np.float32) - self.assertIsInstance(index, Float64Index) - self.assertEqual(index.dtype, np.float64) + assert isinstance(index, Float64Index) + assert index.dtype == np.float64 # nan handling result = Float64Index([np.nan, np.nan]) - self.assertTrue(pd.isnull(result.values).all()) + assert pd.isnull(result.values).all() result = Float64Index(np.array([np.nan])) - self.assertTrue(pd.isnull(result.values).all()) + assert pd.isnull(result.values).all() result = Index(np.array([np.nan])) - self.assertTrue(pd.isnull(result.values).all()) + assert pd.isnull(result.values).all() def test_constructor_invalid(self): # invalid - self.assertRaises(TypeError, Float64Index, 0.) - self.assertRaises(TypeError, Float64Index, ['a', 'b', 0.]) - self.assertRaises(TypeError, Float64Index, [Timestamp('20130101')]) + pytest.raises(TypeError, Float64Index, 0.) + pytest.raises(TypeError, Float64Index, ['a', 'b', 0.]) + pytest.raises(TypeError, Float64Index, [Timestamp('20130101')]) def test_constructor_coerce(self): @@ -252,15 +260,15 @@ def test_constructor_explicit(self): def test_astype(self): result = self.float.astype(object) - self.assertTrue(result.equals(self.float)) - self.assertTrue(self.float.equals(result)) + assert result.equals(self.float) + assert self.float.equals(result) self.check_is_index(result) i = self.mixed.copy() i.name = 'foo' result = i.astype(object) - self.assertTrue(result.equals(i)) - self.assertTrue(i.equals(result)) + assert result.equals(i) + assert i.equals(result) self.check_is_index(result) # GH 12881 @@ -289,28 +297,28 @@ def test_astype(self): # invalid for dtype in ['M8[ns]', 'm8[ns]']: - self.assertRaises(TypeError, lambda: i.astype(dtype)) + pytest.raises(TypeError, lambda: i.astype(dtype)) # GH 13149 for dtype in ['int16', 'int32', 'int64']: i = Float64Index([0, 1.1, np.NAN]) - self.assertRaises(ValueError, lambda: i.astype(dtype)) + pytest.raises(ValueError, lambda: i.astype(dtype)) def test_equals_numeric(self): i = Float64Index([1.0, 2.0]) - self.assertTrue(i.equals(i)) - self.assertTrue(i.identical(i)) + assert i.equals(i) + assert i.identical(i) i2 = Float64Index([1.0, 2.0]) - self.assertTrue(i.equals(i2)) + assert i.equals(i2) i = Float64Index([1.0, np.nan]) - self.assertTrue(i.equals(i)) - self.assertTrue(i.identical(i)) + assert i.equals(i) + assert i.identical(i) i2 = Float64Index([1.0, np.nan]) - self.assertTrue(i.equals(i2)) + assert i.equals(i2) def test_get_indexer(self): idx = Float64Index([0.0, 1.0, 2.0]) @@ -328,54 +336,54 @@ def test_get_indexer(self): def test_get_loc(self): idx = Float64Index([0.0, 1.0, 2.0]) for method in [None, 'pad', 'backfill', 'nearest']: - self.assertEqual(idx.get_loc(1, method), 1) + assert idx.get_loc(1, method) == 1 if method is not None: - self.assertEqual(idx.get_loc(1, method, tolerance=0), 1) + assert idx.get_loc(1, method, tolerance=0) == 1 for method, loc in [('pad', 1), ('backfill', 2), ('nearest', 1)]: - self.assertEqual(idx.get_loc(1.1, method), loc) - self.assertEqual(idx.get_loc(1.1, method, tolerance=0.9), loc) + assert idx.get_loc(1.1, method) == loc + assert idx.get_loc(1.1, method, tolerance=0.9) == loc - self.assertRaises(KeyError, idx.get_loc, 'foo') - self.assertRaises(KeyError, idx.get_loc, 1.5) - self.assertRaises(KeyError, idx.get_loc, 1.5, method='pad', - tolerance=0.1) + pytest.raises(KeyError, idx.get_loc, 'foo') + pytest.raises(KeyError, idx.get_loc, 1.5) + pytest.raises(KeyError, idx.get_loc, 1.5, method='pad', + tolerance=0.1) - with tm.assertRaisesRegexp(ValueError, 'must be numeric'): + with tm.assert_raises_regex(ValueError, 'must be numeric'): idx.get_loc(1.4, method='nearest', tolerance='foo') def test_get_loc_na(self): idx = Float64Index([np.nan, 1, 2]) - self.assertEqual(idx.get_loc(1), 1) - self.assertEqual(idx.get_loc(np.nan), 0) + assert idx.get_loc(1) == 1 + assert idx.get_loc(np.nan) == 0 idx = Float64Index([np.nan, 1, np.nan]) - self.assertEqual(idx.get_loc(1), 1) + assert idx.get_loc(1) == 1 # representable by slice [0:2:2] - # self.assertRaises(KeyError, idx.slice_locs, np.nan) + # pytest.raises(KeyError, idx.slice_locs, np.nan) sliced = idx.slice_locs(np.nan) - self.assertTrue(isinstance(sliced, tuple)) - self.assertEqual(sliced, (0, 3)) + assert isinstance(sliced, tuple) + assert sliced == (0, 3) # not representable by slice idx = Float64Index([np.nan, 1, np.nan, np.nan]) - self.assertEqual(idx.get_loc(1), 1) - self.assertRaises(KeyError, idx.slice_locs, np.nan) + assert idx.get_loc(1) == 1 + pytest.raises(KeyError, idx.slice_locs, np.nan) def test_contains_nans(self): i = Float64Index([1.0, 2.0, np.nan]) - self.assertTrue(np.nan in i) + assert np.nan in i def test_contains_not_nans(self): i = Float64Index([1.0, 2.0, np.nan]) - self.assertTrue(1.0 in i) + assert 1.0 in i def test_doesnt_contain_all_the_things(self): i = Float64Index([np.nan]) - self.assertFalse(i.isin([0]).item()) - self.assertFalse(i.isin([1]).item()) - self.assertTrue(i.isin([np.nan]).item()) + assert not i.isin([0]).item() + assert not i.isin([1]).item() + assert i.isin([np.nan]).item() def test_nan_multiple_containment(self): i = Float64Index([1.0, np.nan]) @@ -392,7 +400,7 @@ def test_astype_from_object(self): index = Index([1.0, np.nan, 0.2], dtype='object') result = index.astype(float) expected = Float64Index([1.0, np.nan, 0.2]) - self.assertEqual(result.dtype, expected.dtype) + assert result.dtype == expected.dtype tm.assert_index_equal(result, expected) def test_fillna_float64(self): @@ -400,15 +408,15 @@ def test_fillna_float64(self): idx = Index([1.0, np.nan, 3.0], dtype=float, name='x') # can't downcast exp = Index([1.0, 0.1, 3.0], name='x') - self.assert_index_equal(idx.fillna(0.1), exp) + tm.assert_index_equal(idx.fillna(0.1), exp) # downcast exp = Float64Index([1.0, 2.0, 3.0], name='x') - self.assert_index_equal(idx.fillna(2), exp) + tm.assert_index_equal(idx.fillna(2), exp) # object exp = Index([1.0, 'obj', 3.0], name='x') - self.assert_index_equal(idx.fillna('obj'), exp) + tm.assert_index_equal(idx.fillna('obj'), exp) def test_take_fill_value(self): # GH 12631 @@ -430,32 +438,200 @@ def test_take_fill_value(self): msg = ('When allow_fill=True and fill_value is not None, ' 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -2]), fill_value=True) + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -5]), fill_value=True) + + with pytest.raises(IndexError): + idx.take(np.array([1, -5])) + + +class NumericInt(Numeric): + + def test_view(self): + super(NumericInt, self).test_view() + + i = self._holder([], name='Foo') + i_view = i.view() + assert i_view.name == 'Foo' + + i_view = i.view(self._dtype) + tm.assert_index_equal(i, self._holder(i_view, name='Foo')) + + i_view = i.view(self._holder) + tm.assert_index_equal(i, self._holder(i_view, name='Foo')) + + def test_is_monotonic(self): + assert self.index.is_monotonic + assert self.index.is_monotonic_increasing + assert not self.index.is_monotonic_decreasing + + index = self._holder([4, 3, 2, 1]) + assert not index.is_monotonic + assert index.is_monotonic_decreasing + + index = self._holder([1]) + assert index.is_monotonic + assert index.is_monotonic_increasing + assert index.is_monotonic_decreasing + + def test_logical_compat(self): + idx = self.create_index() + assert idx.all() == idx.values.all() + assert idx.any() == idx.values.any() + + def test_identical(self): + i = Index(self.index.copy()) + assert i.identical(self.index) + + same_values_different_type = Index(i, dtype=object) + assert not i.identical(same_values_different_type) + + i = self.index.copy(dtype=object) + i = i.rename('foo') + same_values = Index(i, dtype=object) + assert same_values.identical(i) + + assert not i.identical(self.index) + assert Index(same_values, name='foo', dtype=object).identical(i) + + assert not self.index.copy(dtype=object).identical( + self.index.copy(dtype=self._dtype)) + + def test_join_non_unique(self): + left = Index([4, 4, 3, 3]) + + joined, lidx, ridx = left.join(left, return_indexers=True) + + exp_joined = Index([3, 3, 3, 3, 4, 4, 4, 4]) + tm.assert_index_equal(joined, exp_joined) + + exp_lidx = np.array([2, 2, 3, 3, 0, 0, 1, 1], dtype=np.intp) + tm.assert_numpy_array_equal(lidx, exp_lidx) + + exp_ridx = np.array([2, 3, 2, 3, 0, 1, 0, 1], dtype=np.intp) + tm.assert_numpy_array_equal(ridx, exp_ridx) + + def test_join_self(self): + kinds = 'outer', 'inner', 'left', 'right' + for kind in kinds: + joined = self.index.join(self.index, how=kind) + assert self.index is joined + + def test_union_noncomparable(self): + from datetime import datetime, timedelta + # corner case, non-Int64Index + now = datetime.now() + other = Index([now + timedelta(i) for i in range(4)], dtype=object) + result = self.index.union(other) + expected = Index(np.concatenate((self.index, other))) + tm.assert_index_equal(result, expected) + + result = other.union(self.index) + expected = Index(np.concatenate((other, self.index))) + tm.assert_index_equal(result, expected) + + def test_cant_or_shouldnt_cast(self): + # can't + data = ['foo', 'bar', 'baz'] + pytest.raises(TypeError, self._holder, data) + + # shouldn't + data = ['0', '1', '2'] + pytest.raises(TypeError, self._holder, data) + + def test_view_index(self): + self.index.view(Index) + + def test_prevent_casting(self): + result = self.index.astype('O') + assert result.dtype == np.object_ + + def test_take_preserve_name(self): + index = self._holder([1, 2, 3, 4], name='foo') + taken = index.take([3, 0, 1]) + assert index.name == taken.name + + def test_take_fill_value(self): + # see gh-12631 + idx = self._holder([1, 2, 3], name='xxx') + result = idx.take(np.array([1, 0, -1])) + expected = self._holder([2, 1, 3], name='xxx') + tm.assert_index_equal(result, expected) + + name = self._holder.__name__ + msg = ("Unable to fill values because " + "{name} cannot contain NA").format(name=name) + + # fill_value=True + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -1]), fill_value=True) + + # allow_fill=False + result = idx.take(np.array([1, 0, -1]), allow_fill=False, + fill_value=True) + expected = self._holder([2, 1, 3], name='xxx') + tm.assert_index_equal(result, expected) + + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): idx.take(np.array([1, -5])) + def test_slice_keep_name(self): + idx = self._holder([1, 2], name='asdf') + assert idx.name == idx[1:].name + + def test_ufunc_coercions(self): + idx = self._holder([1, 2, 3, 4, 5], name='x') -class TestInt64Index(Numeric, tm.TestCase): + result = np.sqrt(idx) + assert isinstance(result, Float64Index) + exp = Float64Index(np.sqrt(np.array([1, 2, 3, 4, 5])), name='x') + tm.assert_index_equal(result, exp) + + result = np.divide(idx, 2.) + assert isinstance(result, Float64Index) + exp = Float64Index([0.5, 1., 1.5, 2., 2.5], name='x') + tm.assert_index_equal(result, exp) + + # _evaluate_numeric_binop + result = idx + 2. + assert isinstance(result, Float64Index) + exp = Float64Index([3., 4., 5., 6., 7.], name='x') + tm.assert_index_equal(result, exp) + + result = idx - 2. + assert isinstance(result, Float64Index) + exp = Float64Index([-1., 0., 1., 2., 3.], name='x') + tm.assert_index_equal(result, exp) + + result = idx * 1. + assert isinstance(result, Float64Index) + exp = Float64Index([1., 2., 3., 4., 5.], name='x') + tm.assert_index_equal(result, exp) + + result = idx / 2. + assert isinstance(result, Float64Index) + exp = Float64Index([0.5, 1., 1.5, 2., 2.5], name='x') + tm.assert_index_equal(result, exp) + + +class TestInt64Index(NumericInt): + _dtype = 'int64' _holder = Int64Index - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): self.indices = dict(index=Int64Index(np.arange(0, 20, 2))) self.setup_indices() def create_index(self): return Int64Index(np.arange(5, dtype='int64')) - def test_too_many_names(self): - def testit(): - self.index.names = ["roger", "harold"] - - assertRaisesRegexp(ValueError, "^Length", testit) - def test_constructor(self): # pass list, coerce fine index = Int64Index([-5, 0, 1, 2]) @@ -467,7 +643,7 @@ def test_constructor(self): tm.assert_index_equal(index, expected) # scalar raise Exception - self.assertRaises(TypeError, Int64Index, 5) + pytest.raises(TypeError, Int64Index, 5) # copy arr = self.index.values @@ -477,7 +653,7 @@ def test_constructor(self): # this should not change index arr[0] = val - self.assertNotEqual(new_index[0], val) + assert new_index[0] != val # interpret list-like expected = Int64Index([5, 0]) @@ -490,103 +666,51 @@ def test_constructor(self): def test_constructor_corner(self): arr = np.array([1, 2, 3, 4], dtype=object) index = Int64Index(arr) - self.assertEqual(index.values.dtype, np.int64) - self.assert_index_equal(index, Index(arr)) + assert index.values.dtype == np.int64 + tm.assert_index_equal(index, Index(arr)) # preventing casting arr = np.array([1, '2', 3, '4'], dtype=object) - with tm.assertRaisesRegexp(TypeError, 'casting'): + with tm.assert_raises_regex(TypeError, 'casting'): Int64Index(arr) arr_with_floats = [0, 2, 3, 4, 5, 1.25, 3, -1] - with tm.assertRaisesRegexp(TypeError, 'casting'): + with tm.assert_raises_regex(TypeError, 'casting'): Int64Index(arr_with_floats) - def test_copy(self): - i = Int64Index([], name='Foo') - i_copy = i.copy() - self.assertEqual(i_copy.name, 'Foo') - - def test_view(self): - super(TestInt64Index, self).test_view() - - i = Int64Index([], name='Foo') - i_view = i.view() - self.assertEqual(i_view.name, 'Foo') - - i_view = i.view('i8') - tm.assert_index_equal(i, Int64Index(i_view, name='Foo')) - - i_view = i.view(Int64Index) - tm.assert_index_equal(i, Int64Index(i_view, name='Foo')) - def test_coerce_list(self): # coerce things arr = Index([1, 2, 3, 4]) - tm.assertIsInstance(arr, Int64Index) + assert isinstance(arr, Int64Index) # but not if explicit dtype passed arr = Index([1, 2, 3, 4], dtype=object) - tm.assertIsInstance(arr, Index) - - def test_dtype(self): - self.assertEqual(self.index.dtype, np.int64) - - def test_is_monotonic(self): - self.assertTrue(self.index.is_monotonic) - self.assertTrue(self.index.is_monotonic_increasing) - self.assertFalse(self.index.is_monotonic_decreasing) - - index = Int64Index([4, 3, 2, 1]) - self.assertFalse(index.is_monotonic) - self.assertTrue(index.is_monotonic_decreasing) - - index = Int64Index([1]) - self.assertTrue(index.is_monotonic) - self.assertTrue(index.is_monotonic_increasing) - self.assertTrue(index.is_monotonic_decreasing) - - def test_is_monotonic_na(self): - examples = [Index([np.nan]), - Index([np.nan, 1]), - Index([1, 2, np.nan]), - Index(['a', 'b', np.nan]), - pd.to_datetime(['NaT']), - pd.to_datetime(['NaT', '2000-01-01']), - pd.to_datetime(['2000-01-01', 'NaT', '2000-01-02']), - pd.to_timedelta(['1 day', 'NaT']), ] - for index in examples: - self.assertFalse(index.is_monotonic_increasing) - self.assertFalse(index.is_monotonic_decreasing) - - def test_equals(self): - same_values = Index(self.index, dtype=object) - self.assertTrue(self.index.equals(same_values)) - self.assertTrue(same_values.equals(self.index)) + assert isinstance(arr, Index) - def test_logical_compat(self): - idx = self.create_index() - self.assertEqual(idx.all(), idx.values.all()) - self.assertEqual(idx.any(), idx.values.any()) + def test_where(self): + i = self.create_index() + result = i.where(notnull(i)) + expected = i + tm.assert_index_equal(result, expected) - def test_identical(self): - i = Index(self.index.copy()) - self.assertTrue(i.identical(self.index)) + _nan = i._na_value + cond = [False] + [True] * len(i[1:]) + expected = pd.Index([_nan] + i[1:].tolist()) - same_values_different_type = Index(i, dtype=object) - self.assertFalse(i.identical(same_values_different_type)) + result = i.where(cond) + tm.assert_index_equal(result, expected) - i = self.index.copy(dtype=object) - i = i.rename('foo') - same_values = Index(i, dtype=object) - self.assertTrue(same_values.identical(i)) + def test_where_array_like(self): + i = self.create_index() - self.assertFalse(i.identical(self.index)) - self.assertTrue(Index(same_values, name='foo', dtype=object).identical( - i)) + _nan = i._na_value + cond = [False] + [True] * (len(i) - 1) + klasses = [list, tuple, np.array, pd.Series] + expected = pd.Index([_nan] + i[1:].tolist()) - self.assertFalse(self.index.copy(dtype=object) - .identical(self.index.copy(dtype='int64'))) + for klass in klasses: + result = i.where(klass(cond)) + tm.assert_index_equal(result, expected) def test_get_indexer(self): target = Int64Index(np.arange(10)) @@ -594,54 +718,27 @@ def test_get_indexer(self): expected = np.array([0, -1, 1, -1, 2, -1, 3, -1, 4, -1], dtype=np.intp) tm.assert_numpy_array_equal(indexer, expected) - def test_get_indexer_pad(self): target = Int64Index(np.arange(10)) indexer = self.index.get_indexer(target, method='pad') expected = np.array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4], dtype=np.intp) tm.assert_numpy_array_equal(indexer, expected) - def test_get_indexer_backfill(self): target = Int64Index(np.arange(10)) indexer = self.index.get_indexer(target, method='backfill') expected = np.array([0, 1, 1, 2, 2, 3, 3, 4, 4, 5], dtype=np.intp) tm.assert_numpy_array_equal(indexer, expected) - def test_join_outer(self): - other = Int64Index([7, 12, 25, 1, 2, 5]) - other_mono = Int64Index([1, 2, 5, 7, 12, 25]) - - # not monotonic - # guarantee of sortedness - res, lidx, ridx = self.index.join(other, how='outer', - return_indexers=True) - noidx_res = self.index.join(other, how='outer') - self.assert_index_equal(res, noidx_res) - - eres = Int64Index([0, 1, 2, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18, 25]) - elidx = np.array([0, -1, 1, 2, -1, 3, -1, 4, 5, 6, 7, 8, 9, -1], - dtype=np.intp) - eridx = np.array([-1, 3, 4, -1, 5, -1, 0, -1, -1, 1, -1, -1, -1, 2], - dtype=np.intp) - - tm.assertIsInstance(res, Int64Index) - self.assert_index_equal(res, eres) - tm.assert_numpy_array_equal(lidx, elidx) - tm.assert_numpy_array_equal(ridx, eridx) - - # monotonic - res, lidx, ridx = self.index.join(other_mono, how='outer', - return_indexers=True) - noidx_res = self.index.join(other_mono, how='outer') - self.assert_index_equal(res, noidx_res) + def test_intersection(self): + other = Index([1, 2, 3, 4, 5]) + result = self.index.intersection(other) + expected = Index(np.sort(np.intersect1d(self.index.values, + other.values))) + tm.assert_index_equal(result, expected) - elidx = np.array([0, -1, 1, 2, -1, 3, -1, 4, 5, 6, 7, 8, 9, -1], - dtype=np.intp) - eridx = np.array([-1, 0, 1, -1, 2, -1, 3, -1, -1, 4, -1, -1, -1, 5], - dtype=np.intp) - tm.assertIsInstance(res, Int64Index) - self.assert_index_equal(res, eres) - tm.assert_numpy_array_equal(lidx, elidx) - tm.assert_numpy_array_equal(ridx, eridx) + result = other.intersection(self.index) + expected = Index(np.sort(np.asarray(np.intersect1d(self.index.values, + other.values)))) + tm.assert_index_equal(result, expected) def test_join_inner(self): other = Int64Index([7, 12, 25, 1, 2, 5]) @@ -661,8 +758,8 @@ def test_join_inner(self): elidx = np.array([1, 6], dtype=np.intp) eridx = np.array([4, 1], dtype=np.intp) - tm.assertIsInstance(res, Int64Index) - self.assert_index_equal(res, eres) + assert isinstance(res, Int64Index) + tm.assert_index_equal(res, eres) tm.assert_numpy_array_equal(lidx, elidx) tm.assert_numpy_array_equal(ridx, eridx) @@ -671,12 +768,12 @@ def test_join_inner(self): return_indexers=True) res2 = self.index.intersection(other_mono) - self.assert_index_equal(res, res2) + tm.assert_index_equal(res, res2) elidx = np.array([1, 6], dtype=np.intp) eridx = np.array([1, 4], dtype=np.intp) - tm.assertIsInstance(res, Int64Index) - self.assert_index_equal(res, eres) + assert isinstance(res, Int64Index) + tm.assert_index_equal(res, eres) tm.assert_numpy_array_equal(lidx, elidx) tm.assert_numpy_array_equal(ridx, eridx) @@ -691,9 +788,9 @@ def test_join_left(self): eridx = np.array([-1, 4, -1, -1, -1, -1, 1, -1, -1, -1], dtype=np.intp) - tm.assertIsInstance(res, Int64Index) - self.assert_index_equal(res, eres) - self.assertIsNone(lidx) + assert isinstance(res, Int64Index) + tm.assert_index_equal(res, eres) + assert lidx is None tm.assert_numpy_array_equal(ridx, eridx) # monotonic @@ -701,9 +798,9 @@ def test_join_left(self): return_indexers=True) eridx = np.array([-1, 1, -1, -1, -1, -1, 4, -1, -1, -1], dtype=np.intp) - tm.assertIsInstance(res, Int64Index) - self.assert_index_equal(res, eres) - self.assertIsNone(lidx) + assert isinstance(res, Int64Index) + tm.assert_index_equal(res, eres) + assert lidx is None tm.assert_numpy_array_equal(ridx, eridx) # non-unique @@ -713,7 +810,7 @@ def test_join_left(self): eres = Index([1, 1, 2, 5, 7, 9]) # 1 is in idx2, so it should be x2 eridx = np.array([0, 1, 2, 3, -1, -1], dtype=np.intp) elidx = np.array([0, 0, 1, 2, 3, 4], dtype=np.intp) - self.assert_index_equal(res, eres) + tm.assert_index_equal(res, eres) tm.assert_numpy_array_equal(lidx, elidx) tm.assert_numpy_array_equal(ridx, eridx) @@ -727,20 +824,20 @@ def test_join_right(self): eres = other elidx = np.array([-1, 6, -1, -1, 1, -1], dtype=np.intp) - tm.assertIsInstance(other, Int64Index) - self.assert_index_equal(res, eres) + assert isinstance(other, Int64Index) + tm.assert_index_equal(res, eres) tm.assert_numpy_array_equal(lidx, elidx) - self.assertIsNone(ridx) + assert ridx is None # monotonic res, lidx, ridx = self.index.join(other_mono, how='right', return_indexers=True) eres = other_mono elidx = np.array([-1, 1, -1, -1, 6, -1], dtype=np.intp) - tm.assertIsInstance(other, Int64Index) - self.assert_index_equal(res, eres) + assert isinstance(other, Int64Index) + tm.assert_index_equal(res, eres) tm.assert_numpy_array_equal(lidx, elidx) - self.assertIsNone(ridx) + assert ridx is None # non-unique idx = Index([1, 1, 2, 5]) @@ -749,7 +846,7 @@ def test_join_right(self): eres = Index([1, 1, 2, 5, 7, 9]) # 1 is in idx2, so it should be x2 elidx = np.array([0, 1, 2, 3, -1, -1], dtype=np.intp) eridx = np.array([0, 0, 1, 2, 3, 4], dtype=np.intp) - self.assert_index_equal(res, eres) + tm.assert_index_equal(res, eres) tm.assert_numpy_array_equal(lidx, elidx) tm.assert_numpy_array_equal(ridx, eridx) @@ -759,49 +856,116 @@ def test_join_non_int_index(self): outer = self.index.join(other, how='outer') outer2 = other.join(self.index, how='outer') expected = Index([0, 2, 3, 4, 6, 7, 8, 10, 12, 14, 16, 18]) - self.assert_index_equal(outer, outer2) - self.assert_index_equal(outer, expected) + tm.assert_index_equal(outer, outer2) + tm.assert_index_equal(outer, expected) inner = self.index.join(other, how='inner') inner2 = other.join(self.index, how='inner') expected = Index([6, 8, 10]) - self.assert_index_equal(inner, inner2) - self.assert_index_equal(inner, expected) + tm.assert_index_equal(inner, inner2) + tm.assert_index_equal(inner, expected) left = self.index.join(other, how='left') - self.assert_index_equal(left, self.index.astype(object)) + tm.assert_index_equal(left, self.index.astype(object)) left2 = other.join(self.index, how='left') - self.assert_index_equal(left2, other) + tm.assert_index_equal(left2, other) right = self.index.join(other, how='right') - self.assert_index_equal(right, other) + tm.assert_index_equal(right, other) right2 = other.join(self.index, how='right') - self.assert_index_equal(right2, self.index.astype(object)) + tm.assert_index_equal(right2, self.index.astype(object)) - def test_join_non_unique(self): - left = Index([4, 4, 3, 3]) + def test_join_outer(self): + other = Int64Index([7, 12, 25, 1, 2, 5]) + other_mono = Int64Index([1, 2, 5, 7, 12, 25]) - joined, lidx, ridx = left.join(left, return_indexers=True) + # not monotonic + # guarantee of sortedness + res, lidx, ridx = self.index.join(other, how='outer', + return_indexers=True) + noidx_res = self.index.join(other, how='outer') + tm.assert_index_equal(res, noidx_res) - exp_joined = Index([3, 3, 3, 3, 4, 4, 4, 4]) - self.assert_index_equal(joined, exp_joined) + eres = Int64Index([0, 1, 2, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18, 25]) + elidx = np.array([0, -1, 1, 2, -1, 3, -1, 4, 5, 6, 7, 8, 9, -1], + dtype=np.intp) + eridx = np.array([-1, 3, 4, -1, 5, -1, 0, -1, -1, 1, -1, -1, -1, 2], + dtype=np.intp) - exp_lidx = np.array([2, 2, 3, 3, 0, 0, 1, 1], dtype=np.intp) - tm.assert_numpy_array_equal(lidx, exp_lidx) + assert isinstance(res, Int64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) - exp_ridx = np.array([2, 3, 2, 3, 0, 1, 0, 1], dtype=np.intp) - tm.assert_numpy_array_equal(ridx, exp_ridx) + # monotonic + res, lidx, ridx = self.index.join(other_mono, how='outer', + return_indexers=True) + noidx_res = self.index.join(other_mono, how='outer') + tm.assert_index_equal(res, noidx_res) - def test_join_self(self): - kinds = 'outer', 'inner', 'left', 'right' - for kind in kinds: - joined = self.index.join(self.index, how=kind) - self.assertIs(self.index, joined) + elidx = np.array([0, -1, 1, 2, -1, 3, -1, 4, 5, 6, 7, 8, 9, -1], + dtype=np.intp) + eridx = np.array([-1, 0, 1, -1, 2, -1, 3, -1, -1, 4, -1, -1, -1, 5], + dtype=np.intp) + assert isinstance(res, Int64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) + + +class TestUInt64Index(NumericInt): + + _dtype = 'uint64' + _holder = UInt64Index + + def setup_method(self, method): + self.indices = dict(index=UInt64Index([2**63, 2**63 + 10, 2**63 + 15, + 2**63 + 20, 2**63 + 25])) + self.setup_indices() + + def create_index(self): + return UInt64Index(np.arange(5, dtype='uint64')) + + def test_constructor(self): + idx = UInt64Index([1, 2, 3]) + res = Index([1, 2, 3], dtype=np.uint64) + tm.assert_index_equal(res, idx) + + idx = UInt64Index([1, 2**63]) + res = Index([1, 2**63], dtype=np.uint64) + tm.assert_index_equal(res, idx) + + idx = UInt64Index([1, 2**63]) + res = Index([1, 2**63]) + tm.assert_index_equal(res, idx) + + idx = Index([-1, 2**63], dtype=object) + res = Index(np.array([-1, 2**63], dtype=object)) + tm.assert_index_equal(res, idx) + + def test_get_indexer(self): + target = UInt64Index(np.arange(10).astype('uint64') * 5 + 2**63) + indexer = self.index.get_indexer(target) + expected = np.array([0, -1, 1, 2, 3, 4, + -1, -1, -1, -1], dtype=np.intp) + tm.assert_numpy_array_equal(indexer, expected) + + target = UInt64Index(np.arange(10).astype('uint64') * 5 + 2**63) + indexer = self.index.get_indexer(target, method='pad') + expected = np.array([0, 0, 1, 2, 3, 4, + 4, 4, 4, 4], dtype=np.intp) + tm.assert_numpy_array_equal(indexer, expected) + + target = UInt64Index(np.arange(10).astype('uint64') * 5 + 2**63) + indexer = self.index.get_indexer(target, method='backfill') + expected = np.array([0, 1, 1, 2, 3, 4, + -1, -1, -1, -1], dtype=np.intp) + tm.assert_numpy_array_equal(indexer, expected) def test_intersection(self): - other = Index([1, 2, 3, 4, 5]) + other = Index([2**63, 2**63 + 5, 2**63 + 10, 2**63 + 15, 2**63 + 20]) result = self.index.intersection(other) expected = Index(np.sort(np.intersect1d(self.index.values, other.values))) @@ -812,147 +976,193 @@ def test_intersection(self): other.values)))) tm.assert_index_equal(result, expected) - def test_intersect_str_dates(self): - dt_dates = [datetime(2012, 2, 9), datetime(2012, 2, 22)] + def test_join_inner(self): + other = UInt64Index(2**63 + np.array( + [7, 12, 25, 1, 2, 10], dtype='uint64')) + other_mono = UInt64Index(2**63 + np.array( + [1, 2, 7, 10, 12, 25], dtype='uint64')) - i1 = Index(dt_dates, dtype=object) - i2 = Index(['aa'], dtype=object) - res = i2.intersection(i1) + # not monotonic + res, lidx, ridx = self.index.join(other, how='inner', + return_indexers=True) - self.assertEqual(len(res), 0) + # no guarantee of sortedness, so sort for comparison purposes + ind = res.argsort() + res = res.take(ind) + lidx = lidx.take(ind) + ridx = ridx.take(ind) - def test_union_noncomparable(self): - from datetime import datetime, timedelta - # corner case, non-Int64Index - now = datetime.now() - other = Index([now + timedelta(i) for i in range(4)], dtype=object) - result = self.index.union(other) - expected = Index(np.concatenate((self.index, other))) - tm.assert_index_equal(result, expected) + eres = UInt64Index(2**63 + np.array([10, 25], dtype='uint64')) + elidx = np.array([1, 4], dtype=np.intp) + eridx = np.array([5, 2], dtype=np.intp) - result = other.union(self.index) - expected = Index(np.concatenate((other, self.index))) - tm.assert_index_equal(result, expected) + assert isinstance(res, UInt64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) - def test_cant_or_shouldnt_cast(self): - # can't - data = ['foo', 'bar', 'baz'] - self.assertRaises(TypeError, Int64Index, data) + # monotonic + res, lidx, ridx = self.index.join(other_mono, how='inner', + return_indexers=True) - # shouldn't - data = ['0', '1', '2'] - self.assertRaises(TypeError, Int64Index, data) + res2 = self.index.intersection(other_mono) + tm.assert_index_equal(res, res2) - def test_view_Index(self): - self.index.view(Index) + elidx = np.array([1, 4], dtype=np.intp) + eridx = np.array([3, 5], dtype=np.intp) - def test_prevent_casting(self): - result = self.index.astype('O') - self.assertEqual(result.dtype, np.object_) + assert isinstance(res, UInt64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) - def test_take_preserve_name(self): - index = Int64Index([1, 2, 3, 4], name='foo') - taken = index.take([3, 0, 1]) - self.assertEqual(index.name, taken.name) + def test_join_left(self): + other = UInt64Index(2**63 + np.array( + [7, 12, 25, 1, 2, 10], dtype='uint64')) + other_mono = UInt64Index(2**63 + np.array( + [1, 2, 7, 10, 12, 25], dtype='uint64')) - def test_take_fill_value(self): - # GH 12631 - idx = pd.Int64Index([1, 2, 3], name='xxx') - result = idx.take(np.array([1, 0, -1])) - expected = pd.Int64Index([2, 1, 3], name='xxx') - tm.assert_index_equal(result, expected) + # not monotonic + res, lidx, ridx = self.index.join(other, how='left', + return_indexers=True) + eres = self.index + eridx = np.array([-1, 5, -1, -1, 2], dtype=np.intp) - # fill_value - msg = "Unable to fill values because Int64Index cannot contain NA" - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -1]), fill_value=True) + assert isinstance(res, UInt64Index) + tm.assert_index_equal(res, eres) + assert lidx is None + tm.assert_numpy_array_equal(ridx, eridx) - # allow_fill=False - result = idx.take(np.array([1, 0, -1]), allow_fill=False, - fill_value=True) - expected = pd.Int64Index([2, 1, 3], name='xxx') - tm.assert_index_equal(result, expected) + # monotonic + res, lidx, ridx = self.index.join(other_mono, how='left', + return_indexers=True) + eridx = np.array([-1, 3, -1, -1, 5], dtype=np.intp) - msg = "Unable to fill values because Int64Index cannot contain NA" - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -5]), fill_value=True) + assert isinstance(res, UInt64Index) + tm.assert_index_equal(res, eres) + assert lidx is None + tm.assert_numpy_array_equal(ridx, eridx) - with tm.assertRaises(IndexError): - idx.take(np.array([1, -5])) + # non-unique + idx = UInt64Index(2**63 + np.array([1, 1, 2, 5], dtype='uint64')) + idx2 = UInt64Index(2**63 + np.array([1, 2, 5, 7, 9], dtype='uint64')) + res, lidx, ridx = idx2.join(idx, how='left', return_indexers=True) - def test_int_name_format(self): - index = Index(['a', 'b', 'c'], name=0) - s = Series(lrange(3), index) - df = DataFrame(lrange(3), index=index) - repr(s) - repr(df) - - def test_print_unicode_columns(self): - df = pd.DataFrame({u("\u05d0"): [1, 2, 3], - "\u05d1": [4, 5, 6], - "c": [7, 8, 9]}) - repr(df.columns) # should not raise UnicodeDecodeError - - def test_repr_summary(self): - with cf.option_context('display.max_seq_items', 10): - r = repr(pd.Index(np.arange(1000))) - self.assertTrue(len(r) < 200) - self.assertTrue("..." in r) + # 1 is in idx2, so it should be x2 + eres = UInt64Index(2**63 + np.array( + [1, 1, 2, 5, 7, 9], dtype='uint64')) + eridx = np.array([0, 1, 2, 3, -1, -1], dtype=np.intp) + elidx = np.array([0, 0, 1, 2, 3, 4], dtype=np.intp) - def test_repr_roundtrip(self): - tm.assert_index_equal(eval(repr(self.index)), self.index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) - def test_unicode_string_with_unicode(self): - idx = Index(lrange(1000)) + def test_join_right(self): + other = UInt64Index(2**63 + np.array( + [7, 12, 25, 1, 2, 10], dtype='uint64')) + other_mono = UInt64Index(2**63 + np.array( + [1, 2, 7, 10, 12, 25], dtype='uint64')) - if PY3: - str(idx) - else: - compat.text_type(idx) + # not monotonic + res, lidx, ridx = self.index.join(other, how='right', + return_indexers=True) + eres = other + elidx = np.array([-1, -1, 4, -1, -1, 1], dtype=np.intp) - def test_bytestring_with_unicode(self): - idx = Index(lrange(1000)) - if PY3: - bytes(idx) - else: - str(idx) + tm.assert_numpy_array_equal(lidx, elidx) + assert isinstance(other, UInt64Index) + tm.assert_index_equal(res, eres) + assert ridx is None - def test_slice_keep_name(self): - idx = Int64Index([1, 2], name='asdf') - self.assertEqual(idx.name, idx[1:].name) + # monotonic + res, lidx, ridx = self.index.join(other_mono, how='right', + return_indexers=True) + eres = other_mono + elidx = np.array([-1, -1, -1, 1, -1, 4], dtype=np.intp) - def test_ufunc_coercions(self): - idx = Int64Index([1, 2, 3, 4, 5], name='x') + assert isinstance(other, UInt64Index) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_index_equal(res, eres) + assert ridx is None - result = np.sqrt(idx) - tm.assertIsInstance(result, Float64Index) - exp = Float64Index(np.sqrt(np.array([1, 2, 3, 4, 5])), name='x') - tm.assert_index_equal(result, exp) + # non-unique + idx = UInt64Index(2**63 + np.array([1, 1, 2, 5], dtype='uint64')) + idx2 = UInt64Index(2**63 + np.array([1, 2, 5, 7, 9], dtype='uint64')) + res, lidx, ridx = idx.join(idx2, how='right', return_indexers=True) - result = np.divide(idx, 2.) - tm.assertIsInstance(result, Float64Index) - exp = Float64Index([0.5, 1., 1.5, 2., 2.5], name='x') - tm.assert_index_equal(result, exp) + # 1 is in idx2, so it should be x2 + eres = UInt64Index(2**63 + np.array( + [1, 1, 2, 5, 7, 9], dtype='uint64')) + elidx = np.array([0, 1, 2, 3, -1, -1], dtype=np.intp) + eridx = np.array([0, 0, 1, 2, 3, 4], dtype=np.intp) - # _evaluate_numeric_binop - result = idx + 2. - tm.assertIsInstance(result, Float64Index) - exp = Float64Index([3., 4., 5., 6., 7.], name='x') - tm.assert_index_equal(result, exp) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) - result = idx - 2. - tm.assertIsInstance(result, Float64Index) - exp = Float64Index([-1., 0., 1., 2., 3.], name='x') - tm.assert_index_equal(result, exp) + def test_join_non_int_index(self): + other = Index(2**63 + np.array( + [1, 5, 7, 10, 20], dtype='uint64'), dtype=object) - result = idx * 1. - tm.assertIsInstance(result, Float64Index) - exp = Float64Index([1., 2., 3., 4., 5.], name='x') - tm.assert_index_equal(result, exp) + outer = self.index.join(other, how='outer') + outer2 = other.join(self.index, how='outer') + expected = Index(2**63 + np.array( + [0, 1, 5, 7, 10, 15, 20, 25], dtype='uint64')) + tm.assert_index_equal(outer, outer2) + tm.assert_index_equal(outer, expected) - result = idx / 2. - tm.assertIsInstance(result, Float64Index) - exp = Float64Index([0.5, 1., 1.5, 2., 2.5], name='x') - tm.assert_index_equal(result, exp) + inner = self.index.join(other, how='inner') + inner2 = other.join(self.index, how='inner') + expected = Index(2**63 + np.array([10, 20], dtype='uint64')) + tm.assert_index_equal(inner, inner2) + tm.assert_index_equal(inner, expected) + + left = self.index.join(other, how='left') + tm.assert_index_equal(left, self.index.astype(object)) + + left2 = other.join(self.index, how='left') + tm.assert_index_equal(left2, other) + + right = self.index.join(other, how='right') + tm.assert_index_equal(right, other) + + right2 = other.join(self.index, how='right') + tm.assert_index_equal(right2, self.index.astype(object)) + + def test_join_outer(self): + other = UInt64Index(2**63 + np.array( + [7, 12, 25, 1, 2, 10], dtype='uint64')) + other_mono = UInt64Index(2**63 + np.array( + [1, 2, 7, 10, 12, 25], dtype='uint64')) + + # not monotonic + # guarantee of sortedness + res, lidx, ridx = self.index.join(other, how='outer', + return_indexers=True) + noidx_res = self.index.join(other, how='outer') + tm.assert_index_equal(res, noidx_res) + + eres = UInt64Index(2**63 + np.array( + [0, 1, 2, 7, 10, 12, 15, 20, 25], dtype='uint64')) + elidx = np.array([0, -1, -1, -1, 1, -1, 2, 3, 4], dtype=np.intp) + eridx = np.array([-1, 3, 4, 0, 5, 1, -1, -1, 2], dtype=np.intp) + + assert isinstance(res, UInt64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) + + # monotonic + res, lidx, ridx = self.index.join(other_mono, how='outer', + return_indexers=True) + noidx_res = self.index.join(other_mono, how='outer') + tm.assert_index_equal(res, noidx_res) + + elidx = np.array([0, -1, -1, -1, 1, -1, 2, 3, 4], dtype=np.intp) + eridx = np.array([-1, 0, 1, 2, 3, 4, -1, -1, 5], dtype=np.intp) + + assert isinstance(res, UInt64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) diff --git a/pandas/tests/indexes/test_range.py b/pandas/tests/indexes/test_range.py index 38e715fce2720..18539989084e9 100644 --- a/pandas/tests/indexes/test_range.py +++ b/pandas/tests/indexes/test_range.py @@ -1,5 +1,7 @@ # -*- coding: utf-8 -*- +import pytest + from datetime import datetime from itertools import combinations import operator @@ -8,8 +10,8 @@ import numpy as np -from pandas import (Series, Index, Float64Index, Int64Index, RangeIndex) -from pandas.util.testing import assertRaisesRegexp +from pandas import (notnull, Series, Index, Float64Index, + Int64Index, RangeIndex) import pandas.util.testing as tm @@ -18,11 +20,11 @@ from .test_numeric import Numeric -class TestRangeIndex(Numeric, tm.TestCase): +class TestRangeIndex(Numeric): _holder = RangeIndex _compat_props = ['shape', 'ndim', 'size', 'itemsize'] - def setUp(self): + def setup_method(self, method): self.indices = dict(index=RangeIndex(0, 20, 2, name='foo')) self.setup_indices() @@ -62,105 +64,105 @@ def test_too_many_names(self): def testit(): self.index.names = ["roger", "harold"] - assertRaisesRegexp(ValueError, "^Length", testit) + tm.assert_raises_regex(ValueError, "^Length", testit) def test_constructor(self): index = RangeIndex(5) expected = np.arange(5, dtype=np.int64) - self.assertIsInstance(index, RangeIndex) - self.assertEqual(index._start, 0) - self.assertEqual(index._stop, 5) - self.assertEqual(index._step, 1) - self.assertEqual(index.name, None) + assert isinstance(index, RangeIndex) + assert index._start == 0 + assert index._stop == 5 + assert index._step == 1 + assert index.name is None tm.assert_index_equal(Index(expected), index) index = RangeIndex(1, 5) expected = np.arange(1, 5, dtype=np.int64) - self.assertIsInstance(index, RangeIndex) - self.assertEqual(index._start, 1) + assert isinstance(index, RangeIndex) + assert index._start == 1 tm.assert_index_equal(Index(expected), index) index = RangeIndex(1, 5, 2) expected = np.arange(1, 5, 2, dtype=np.int64) - self.assertIsInstance(index, RangeIndex) - self.assertEqual(index._step, 2) + assert isinstance(index, RangeIndex) + assert index._step == 2 tm.assert_index_equal(Index(expected), index) msg = "RangeIndex\\(\\.\\.\\.\\) must be called with integers" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): RangeIndex() for index in [RangeIndex(0), RangeIndex(start=0), RangeIndex(stop=0), RangeIndex(0, 0)]: expected = np.empty(0, dtype=np.int64) - self.assertIsInstance(index, RangeIndex) - self.assertEqual(index._start, 0) - self.assertEqual(index._stop, 0) - self.assertEqual(index._step, 1) + assert isinstance(index, RangeIndex) + assert index._start == 0 + assert index._stop == 0 + assert index._step == 1 tm.assert_index_equal(Index(expected), index) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): RangeIndex(name='Foo') for index in [RangeIndex(0, name='Foo'), RangeIndex(start=0, name='Foo'), RangeIndex(stop=0, name='Foo'), RangeIndex(0, 0, name='Foo')]: - self.assertIsInstance(index, RangeIndex) - self.assertEqual(index.name, 'Foo') + assert isinstance(index, RangeIndex) + assert index.name == 'Foo' # we don't allow on a bare Index - self.assertRaises(TypeError, lambda: Index(0, 1000)) + pytest.raises(TypeError, lambda: Index(0, 1000)) # invalid args for i in [Index(['a', 'b']), Series(['a', 'b']), np.array(['a', 'b']), [], 'foo', datetime(2000, 1, 1, 0, 0), np.arange(0, 10), np.array([1]), [1]]: - self.assertRaises(TypeError, lambda: RangeIndex(i)) + pytest.raises(TypeError, lambda: RangeIndex(i)) def test_constructor_same(self): # pass thru w and w/o copy index = RangeIndex(1, 5, 2) result = RangeIndex(index, copy=False) - self.assertTrue(result.identical(index)) + assert result.identical(index) result = RangeIndex(index, copy=True) - self.assert_index_equal(result, index, exact=True) + tm.assert_index_equal(result, index, exact=True) result = RangeIndex(index) - self.assert_index_equal(result, index, exact=True) + tm.assert_index_equal(result, index, exact=True) - self.assertRaises(TypeError, - lambda: RangeIndex(index, dtype='float64')) + pytest.raises(TypeError, + lambda: RangeIndex(index, dtype='float64')) def test_constructor_range(self): - self.assertRaises(TypeError, lambda: RangeIndex(range(1, 5, 2))) + pytest.raises(TypeError, lambda: RangeIndex(range(1, 5, 2))) result = RangeIndex.from_range(range(1, 5, 2)) expected = RangeIndex(1, 5, 2) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) result = RangeIndex.from_range(range(5, 6)) expected = RangeIndex(5, 6, 1) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) # an invalid range result = RangeIndex.from_range(range(5, 1)) expected = RangeIndex(0, 0, 1) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) result = RangeIndex.from_range(range(5)) expected = RangeIndex(0, 5, 1) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) result = Index(range(1, 5, 2)) expected = RangeIndex(1, 5, 2) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) - self.assertRaises(TypeError, - lambda: Index(range(1, 5, 2), dtype='float64')) + pytest.raises(TypeError, + lambda: Index(range(1, 5, 2), dtype='float64')) def test_constructor_name(self): # GH12288 @@ -170,16 +172,16 @@ def test_constructor_name(self): copy = RangeIndex(orig) copy.name = 'copy' - self.assertTrue(orig.name, 'original') - self.assertTrue(copy.name, 'copy') + assert orig.name, 'original' + assert copy.name, 'copy' new = Index(copy) - self.assertTrue(new.name, 'copy') + assert new.name, 'copy' new.name = 'new' - self.assertTrue(orig.name, 'original') - self.assertTrue(new.name, 'copy') - self.assertTrue(new.name, 'new') + assert orig.name, 'original' + assert new.name, 'copy' + assert new.name, 'new' def test_numeric_compat2(self): # validate that we are handling the RangeIndex overrides to numeric ops @@ -189,15 +191,15 @@ def test_numeric_compat2(self): result = idx * 2 expected = RangeIndex(0, 20, 4) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) result = idx + 2 expected = RangeIndex(2, 12, 2) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) result = idx - 2 expected = RangeIndex(-2, 8, 2) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) # truediv under PY3 result = idx / 2 @@ -206,11 +208,11 @@ def test_numeric_compat2(self): expected = RangeIndex(0, 5, 1).astype('float64') else: expected = RangeIndex(0, 5, 1) - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) result = idx / 4 expected = RangeIndex(0, 10, 2) / 4 - self.assert_index_equal(result, expected, exact=True) + tm.assert_index_equal(result, expected, exact=True) result = idx // 1 expected = idx @@ -244,25 +246,25 @@ def test_numeric_compat2(self): def test_constructor_corner(self): arr = np.array([1, 2, 3, 4], dtype=object) index = RangeIndex(1, 5) - self.assertEqual(index.values.dtype, np.int64) - self.assert_index_equal(index, Index(arr)) + assert index.values.dtype == np.int64 + tm.assert_index_equal(index, Index(arr)) # non-int raise Exception - self.assertRaises(TypeError, RangeIndex, '1', '10', '1') - self.assertRaises(TypeError, RangeIndex, 1.1, 10.2, 1.3) + pytest.raises(TypeError, RangeIndex, '1', '10', '1') + pytest.raises(TypeError, RangeIndex, 1.1, 10.2, 1.3) # invalid passed type - self.assertRaises(TypeError, lambda: RangeIndex(1, 5, dtype='float64')) + pytest.raises(TypeError, lambda: RangeIndex(1, 5, dtype='float64')) def test_copy(self): i = RangeIndex(5, name='Foo') i_copy = i.copy() - self.assertTrue(i_copy is not i) - self.assertTrue(i_copy.identical(i)) - self.assertEqual(i_copy._start, 0) - self.assertEqual(i_copy._stop, 5) - self.assertEqual(i_copy._step, 1) - self.assertEqual(i_copy.name, 'Foo') + assert i_copy is not i + assert i_copy.identical(i) + assert i_copy._start == 0 + assert i_copy._stop == 5 + assert i_copy._step == 1 + assert i_copy.name == 'Foo' def test_repr(self): i = RangeIndex(5, name='Foo') @@ -271,18 +273,18 @@ def test_repr(self): expected = "RangeIndex(start=0, stop=5, step=1, name='Foo')" else: expected = "RangeIndex(start=0, stop=5, step=1, name=u'Foo')" - self.assertTrue(result, expected) + assert result, expected result = eval(result) - self.assert_index_equal(result, i, exact=True) + tm.assert_index_equal(result, i, exact=True) i = RangeIndex(5, 0, -1) result = repr(i) expected = "RangeIndex(start=5, stop=0, step=-1)" - self.assertEqual(result, expected) + assert result == expected result = eval(result) - self.assert_index_equal(result, i, exact=True) + tm.assert_index_equal(result, i, exact=True) def test_insert(self): @@ -290,22 +292,22 @@ def test_insert(self): result = idx[1:4] # test 0th element - self.assert_index_equal(idx[0:4], result.insert(0, idx[0])) + tm.assert_index_equal(idx[0:4], result.insert(0, idx[0])) def test_delete(self): idx = RangeIndex(5, name='Foo') expected = idx[1:].astype(int) result = idx.delete(0) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) + tm.assert_index_equal(result, expected) + assert result.name == expected.name expected = idx[:-1].astype(int) result = idx.delete(-1) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) + tm.assert_index_equal(result, expected) + assert result.name == expected.name - with tm.assertRaises((IndexError, ValueError)): + with pytest.raises((IndexError, ValueError)): # either depending on numpy version result = idx.delete(len(idx)) @@ -314,7 +316,7 @@ def test_view(self): i = RangeIndex(0, name='Foo') i_view = i.view() - self.assertEqual(i_view.name, 'Foo') + assert i_view.name == 'Foo' i_view = i.view('i8') tm.assert_numpy_array_equal(i.values, i_view) @@ -323,31 +325,31 @@ def test_view(self): tm.assert_index_equal(i, i_view) def test_dtype(self): - self.assertEqual(self.index.dtype, np.int64) + assert self.index.dtype == np.int64 def test_is_monotonic(self): - self.assertTrue(self.index.is_monotonic) - self.assertTrue(self.index.is_monotonic_increasing) - self.assertFalse(self.index.is_monotonic_decreasing) + assert self.index.is_monotonic + assert self.index.is_monotonic_increasing + assert not self.index.is_monotonic_decreasing index = RangeIndex(4, 0, -1) - self.assertFalse(index.is_monotonic) - self.assertTrue(index.is_monotonic_decreasing) + assert not index.is_monotonic + assert index.is_monotonic_decreasing index = RangeIndex(1, 2) - self.assertTrue(index.is_monotonic) - self.assertTrue(index.is_monotonic_increasing) - self.assertTrue(index.is_monotonic_decreasing) + assert index.is_monotonic + assert index.is_monotonic_increasing + assert index.is_monotonic_decreasing index = RangeIndex(2, 1) - self.assertTrue(index.is_monotonic) - self.assertTrue(index.is_monotonic_increasing) - self.assertTrue(index.is_monotonic_decreasing) + assert index.is_monotonic + assert index.is_monotonic_increasing + assert index.is_monotonic_decreasing index = RangeIndex(1, 1) - self.assertTrue(index.is_monotonic) - self.assertTrue(index.is_monotonic_increasing) - self.assertTrue(index.is_monotonic_decreasing) + assert index.is_monotonic + assert index.is_monotonic_increasing + assert index.is_monotonic_decreasing def test_equals_range(self): equiv_pairs = [(RangeIndex(0, 9, 2), RangeIndex(0, 10, 2)), @@ -355,54 +357,53 @@ def test_equals_range(self): (RangeIndex(1, 2, 3), RangeIndex(1, 3, 4)), (RangeIndex(0, -9, -2), RangeIndex(0, -10, -2))] for left, right in equiv_pairs: - self.assertTrue(left.equals(right)) - self.assertTrue(right.equals(left)) + assert left.equals(right) + assert right.equals(left) def test_logical_compat(self): idx = self.create_index() - self.assertEqual(idx.all(), idx.values.all()) - self.assertEqual(idx.any(), idx.values.any()) + assert idx.all() == idx.values.all() + assert idx.any() == idx.values.any() def test_identical(self): i = Index(self.index.copy()) - self.assertTrue(i.identical(self.index)) + assert i.identical(self.index) # we don't allow object dtype for RangeIndex if isinstance(self.index, RangeIndex): return same_values_different_type = Index(i, dtype=object) - self.assertFalse(i.identical(same_values_different_type)) + assert not i.identical(same_values_different_type) i = self.index.copy(dtype=object) i = i.rename('foo') same_values = Index(i, dtype=object) - self.assertTrue(same_values.identical(self.index.copy(dtype=object))) + assert same_values.identical(self.index.copy(dtype=object)) - self.assertFalse(i.identical(self.index)) - self.assertTrue(Index(same_values, name='foo', dtype=object).identical( - i)) + assert not i.identical(self.index) + assert Index(same_values, name='foo', dtype=object).identical(i) - self.assertFalse(self.index.copy(dtype=object) - .identical(self.index.copy(dtype='int64'))) + assert not self.index.copy(dtype=object).identical( + self.index.copy(dtype='int64')) def test_get_indexer(self): target = RangeIndex(10) indexer = self.index.get_indexer(target) expected = np.array([0, -1, 1, -1, 2, -1, 3, -1, 4, -1], dtype=np.intp) - self.assert_numpy_array_equal(indexer, expected) + tm.assert_numpy_array_equal(indexer, expected) def test_get_indexer_pad(self): target = RangeIndex(10) indexer = self.index.get_indexer(target, method='pad') expected = np.array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4], dtype=np.intp) - self.assert_numpy_array_equal(indexer, expected) + tm.assert_numpy_array_equal(indexer, expected) def test_get_indexer_backfill(self): target = RangeIndex(10) indexer = self.index.get_indexer(target, method='backfill') expected = np.array([0, 1, 1, 2, 2, 3, 3, 4, 4, 5], dtype=np.intp) - self.assert_numpy_array_equal(indexer, expected) + tm.assert_numpy_array_equal(indexer, expected) def test_join_outer(self): # join with Int64Index @@ -411,7 +412,7 @@ def test_join_outer(self): res, lidx, ridx = self.index.join(other, how='outer', return_indexers=True) noidx_res = self.index.join(other, how='outer') - self.assert_index_equal(res, noidx_res) + tm.assert_index_equal(res, noidx_res) eres = Int64Index([0, 2, 4, 6, 8, 10, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]) @@ -420,11 +421,11 @@ def test_join_outer(self): eridx = np.array([-1, -1, -1, -1, -1, -1, -1, -1, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0], dtype=np.intp) - self.assertIsInstance(res, Int64Index) - self.assertFalse(isinstance(res, RangeIndex)) - self.assert_index_equal(res, eres) - self.assert_numpy_array_equal(lidx, elidx) - self.assert_numpy_array_equal(ridx, eridx) + assert isinstance(res, Int64Index) + assert not isinstance(res, RangeIndex) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) # join with RangeIndex other = RangeIndex(25, 14, -1) @@ -432,13 +433,13 @@ def test_join_outer(self): res, lidx, ridx = self.index.join(other, how='outer', return_indexers=True) noidx_res = self.index.join(other, how='outer') - self.assert_index_equal(res, noidx_res) + tm.assert_index_equal(res, noidx_res) - self.assertIsInstance(res, Int64Index) - self.assertFalse(isinstance(res, RangeIndex)) - self.assert_index_equal(res, eres) - self.assert_numpy_array_equal(lidx, elidx) - self.assert_numpy_array_equal(ridx, eridx) + assert isinstance(res, Int64Index) + assert not isinstance(res, RangeIndex) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) def test_join_inner(self): # Join with non-RangeIndex @@ -457,10 +458,10 @@ def test_join_inner(self): elidx = np.array([8, 9], dtype=np.intp) eridx = np.array([9, 7], dtype=np.intp) - self.assertIsInstance(res, Int64Index) - self.assert_index_equal(res, eres) - self.assert_numpy_array_equal(lidx, elidx) - self.assert_numpy_array_equal(ridx, eridx) + assert isinstance(res, Int64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) # Join two RangeIndex other = RangeIndex(25, 14, -1) @@ -468,10 +469,10 @@ def test_join_inner(self): res, lidx, ridx = self.index.join(other, how='inner', return_indexers=True) - self.assertIsInstance(res, RangeIndex) - self.assert_index_equal(res, eres) - self.assert_numpy_array_equal(lidx, elidx) - self.assert_numpy_array_equal(ridx, eridx) + assert isinstance(res, RangeIndex) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) def test_join_left(self): # Join with Int64Index @@ -482,10 +483,10 @@ def test_join_left(self): eres = self.index eridx = np.array([-1, -1, -1, -1, -1, -1, -1, -1, 9, 7], dtype=np.intp) - self.assertIsInstance(res, RangeIndex) - self.assert_index_equal(res, eres) - self.assertIsNone(lidx) - self.assert_numpy_array_equal(ridx, eridx) + assert isinstance(res, RangeIndex) + tm.assert_index_equal(res, eres) + assert lidx is None + tm.assert_numpy_array_equal(ridx, eridx) # Join withRangeIndex other = Int64Index(np.arange(25, 14, -1)) @@ -493,10 +494,10 @@ def test_join_left(self): res, lidx, ridx = self.index.join(other, how='left', return_indexers=True) - self.assertIsInstance(res, RangeIndex) - self.assert_index_equal(res, eres) - self.assertIsNone(lidx) - self.assert_numpy_array_equal(ridx, eridx) + assert isinstance(res, RangeIndex) + tm.assert_index_equal(res, eres) + assert lidx is None + tm.assert_numpy_array_equal(ridx, eridx) def test_join_right(self): # Join with Int64Index @@ -508,10 +509,10 @@ def test_join_right(self): elidx = np.array([-1, -1, -1, -1, -1, -1, -1, 9, -1, 8, -1], dtype=np.intp) - self.assertIsInstance(other, Int64Index) - self.assert_index_equal(res, eres) - self.assert_numpy_array_equal(lidx, elidx) - self.assertIsNone(ridx) + assert isinstance(other, Int64Index) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + assert ridx is None # Join withRangeIndex other = RangeIndex(25, 14, -1) @@ -520,10 +521,10 @@ def test_join_right(self): return_indexers=True) eres = other - self.assertIsInstance(other, RangeIndex) - self.assert_index_equal(res, eres) - self.assert_numpy_array_equal(lidx, elidx) - self.assertIsNone(ridx) + assert isinstance(other, RangeIndex) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + assert ridx is None def test_join_non_int_index(self): other = Index([3, 6, 7, 8, 10], dtype=object) @@ -531,26 +532,26 @@ def test_join_non_int_index(self): outer = self.index.join(other, how='outer') outer2 = other.join(self.index, how='outer') expected = Index([0, 2, 3, 4, 6, 7, 8, 10, 12, 14, 16, 18]) - self.assert_index_equal(outer, outer2) - self.assert_index_equal(outer, expected) + tm.assert_index_equal(outer, outer2) + tm.assert_index_equal(outer, expected) inner = self.index.join(other, how='inner') inner2 = other.join(self.index, how='inner') expected = Index([6, 8, 10]) - self.assert_index_equal(inner, inner2) - self.assert_index_equal(inner, expected) + tm.assert_index_equal(inner, inner2) + tm.assert_index_equal(inner, expected) left = self.index.join(other, how='left') - self.assert_index_equal(left, self.index.astype(object)) + tm.assert_index_equal(left, self.index.astype(object)) left2 = other.join(self.index, how='left') - self.assert_index_equal(left2, other) + tm.assert_index_equal(left2, other) right = self.index.join(other, how='right') - self.assert_index_equal(right, other) + tm.assert_index_equal(right, other) right2 = other.join(self.index, how='right') - self.assert_index_equal(right2, self.index.astype(object)) + tm.assert_index_equal(right2, self.index.astype(object)) def test_join_non_unique(self): other = Index([4, 4, 3, 3]) @@ -562,15 +563,15 @@ def test_join_non_unique(self): eridx = np.array([-1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1], dtype=np.intp) - self.assert_index_equal(res, eres) - self.assert_numpy_array_equal(lidx, elidx) - self.assert_numpy_array_equal(ridx, eridx) + tm.assert_index_equal(res, eres) + tm.assert_numpy_array_equal(lidx, elidx) + tm.assert_numpy_array_equal(ridx, eridx) def test_join_self(self): kinds = 'outer', 'inner', 'left', 'right' for kind in kinds: joined = self.index.join(self.index, how=kind) - self.assertIs(self.index, joined) + assert self.index is joined def test_intersection(self): # intersect with Int64Index @@ -578,26 +579,26 @@ def test_intersection(self): result = self.index.intersection(other) expected = Index(np.sort(np.intersect1d(self.index.values, other.values))) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = other.intersection(self.index) expected = Index(np.sort(np.asarray(np.intersect1d(self.index.values, other.values)))) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # intersect with increasing RangeIndex other = RangeIndex(1, 6) result = self.index.intersection(other) expected = Index(np.sort(np.intersect1d(self.index.values, other.values))) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # intersect with decreasing RangeIndex other = RangeIndex(5, 0, -1) result = self.index.intersection(other) expected = Index(np.sort(np.intersect1d(self.index.values, other.values))) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) index = RangeIndex(5) @@ -605,28 +606,28 @@ def test_intersection(self): other = RangeIndex(5, 10, 1) result = index.intersection(other) expected = RangeIndex(0, 0, 1) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) other = RangeIndex(-1, -5, -1) result = index.intersection(other) expected = RangeIndex(0, 0, 1) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # intersection of empty indices other = RangeIndex(0, 0, 1) result = index.intersection(other) expected = RangeIndex(0, 0, 1) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = other.intersection(index) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # intersection of non-overlapping values based on start value and gcd index = RangeIndex(1, 10, 2) other = RangeIndex(0, 10, 4) result = index.intersection(other) expected = RangeIndex(0, 0, 1) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) def test_intersect_str_dates(self): dt_dates = [datetime(2012, 2, 9), datetime(2012, 2, 22)] @@ -635,7 +636,7 @@ def test_intersect_str_dates(self): i2 = Index(['aa'], dtype=object) res = i2.intersection(i1) - self.assertEqual(len(res), 0) + assert len(res) == 0 def test_union_noncomparable(self): from datetime import datetime, timedelta @@ -644,11 +645,11 @@ def test_union_noncomparable(self): other = Index([now + timedelta(i) for i in range(4)], dtype=object) result = self.index.union(other) expected = Index(np.concatenate((self.index, other))) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = other.union(self.index) expected = Index(np.concatenate((other, self.index))) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) def test_union(self): RI = RangeIndex @@ -687,30 +688,30 @@ def test_nbytes(self): # memory savings vs int index i = RangeIndex(0, 1000) - self.assertTrue(i.nbytes < i.astype(int).nbytes / 10) + assert i.nbytes < i.astype(int).nbytes / 10 # constant memory usage i2 = RangeIndex(0, 10) - self.assertEqual(i.nbytes, i2.nbytes) + assert i.nbytes == i2.nbytes def test_cant_or_shouldnt_cast(self): # can't - self.assertRaises(TypeError, RangeIndex, 'foo', 'bar', 'baz') + pytest.raises(TypeError, RangeIndex, 'foo', 'bar', 'baz') # shouldn't - self.assertRaises(TypeError, RangeIndex, '0', '1', '2') + pytest.raises(TypeError, RangeIndex, '0', '1', '2') def test_view_Index(self): self.index.view(Index) def test_prevent_casting(self): result = self.index.astype('O') - self.assertEqual(result.dtype, np.object_) + assert result.dtype == np.object_ def test_take_preserve_name(self): index = RangeIndex(1, 5, name='foo') taken = index.take([3, 0, 1]) - self.assertEqual(index.name, taken.name) + assert index.name == taken.name def test_take_fill_value(self): # GH 12631 @@ -721,7 +722,7 @@ def test_take_fill_value(self): # fill_value msg = "Unable to fill values because RangeIndex cannot contain NA" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -1]), fill_value=True) # allow_fill=False @@ -731,12 +732,12 @@ def test_take_fill_value(self): tm.assert_index_equal(result, expected) msg = "Unable to fill values because RangeIndex cannot contain NA" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): idx.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): idx.take(np.array([1, -5])) def test_print_unicode_columns(self): @@ -750,7 +751,7 @@ def test_repr_roundtrip(self): def test_slice_keep_name(self): idx = RangeIndex(1, 2, name='asdf') - self.assertEqual(idx.name, idx[1:].name) + assert idx.name == idx[1:].name def test_explicit_conversions(self): @@ -782,8 +783,8 @@ def test_duplicates(self): if not len(ind): continue idx = self.indices[ind] - self.assertTrue(idx.is_unique) - self.assertFalse(idx.has_duplicates) + assert idx.is_unique + assert not idx.has_duplicates def test_ufunc_compat(self): idx = RangeIndex(5) @@ -793,48 +794,48 @@ def test_ufunc_compat(self): def test_extended_gcd(self): result = self.index._extended_gcd(6, 10) - self.assertEqual(result[0], result[1] * 6 + result[2] * 10) - self.assertEqual(2, result[0]) + assert result[0] == result[1] * 6 + result[2] * 10 + assert 2 == result[0] result = self.index._extended_gcd(10, 6) - self.assertEqual(2, result[1] * 10 + result[2] * 6) - self.assertEqual(2, result[0]) + assert 2 == result[1] * 10 + result[2] * 6 + assert 2 == result[0] def test_min_fitting_element(self): result = RangeIndex(0, 20, 2)._min_fitting_element(1) - self.assertEqual(2, result) + assert 2 == result result = RangeIndex(1, 6)._min_fitting_element(1) - self.assertEqual(1, result) + assert 1 == result result = RangeIndex(18, -2, -2)._min_fitting_element(1) - self.assertEqual(2, result) + assert 2 == result result = RangeIndex(5, 0, -1)._min_fitting_element(1) - self.assertEqual(1, result) + assert 1 == result big_num = 500000000000000000000000 result = RangeIndex(5, big_num * 2, 1)._min_fitting_element(big_num) - self.assertEqual(big_num, result) + assert big_num == result def test_max_fitting_element(self): result = RangeIndex(0, 20, 2)._max_fitting_element(17) - self.assertEqual(16, result) + assert 16 == result result = RangeIndex(1, 6)._max_fitting_element(4) - self.assertEqual(4, result) + assert 4 == result result = RangeIndex(18, -2, -2)._max_fitting_element(17) - self.assertEqual(16, result) + assert 16 == result result = RangeIndex(5, 0, -1)._max_fitting_element(4) - self.assertEqual(4, result) + assert 4 == result big_num = 500000000000000000000000 result = RangeIndex(5, big_num * 2, 1)._max_fitting_element(big_num) - self.assertEqual(big_num, result) + assert big_num == result def test_pickle_compat_construction(self): # RangeIndex() is a valid constructor @@ -845,53 +846,53 @@ def test_slice_specialised(self): # scalar indexing res = self.index[1] expected = 2 - self.assertEqual(res, expected) + assert res == expected res = self.index[-1] expected = 18 - self.assertEqual(res, expected) + assert res == expected # slicing # slice value completion index = self.index[:] expected = self.index - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) # positive slice values index = self.index[7:10:2] expected = Index(np.array([14, 18]), name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) # negative slice values index = self.index[-1:-5:-2] expected = Index(np.array([18, 14]), name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) # stop overshoot index = self.index[2:100:4] expected = Index(np.array([4, 12]), name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) # reverse index = self.index[::-1] expected = Index(self.index.values[::-1], name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) index = self.index[-8::-1] expected = Index(np.array([4, 2, 0]), name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) index = self.index[-40::-1] expected = Index(np.array([], dtype=np.int64), name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) index = self.index[40::-1] expected = Index(self.index.values[40::-1], name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) index = self.index[10::-1] expected = Index(self.index.values[::-1], name='foo') - self.assert_index_equal(index, expected) + tm.assert_index_equal(index, expected) def test_len_specialised(self): @@ -902,16 +903,41 @@ def test_len_specialised(self): arr = np.arange(0, 5, step) i = RangeIndex(0, 5, step) - self.assertEqual(len(i), len(arr)) + assert len(i) == len(arr) i = RangeIndex(5, 0, step) - self.assertEqual(len(i), 0) + assert len(i) == 0 for step in np.arange(-6, -1, 1): arr = np.arange(5, 0, step) i = RangeIndex(5, 0, step) - self.assertEqual(len(i), len(arr)) + assert len(i) == len(arr) i = RangeIndex(0, 5, step) - self.assertEqual(len(i), 0) + assert len(i) == 0 + + def test_where(self): + i = self.create_index() + result = i.where(notnull(i)) + expected = i + tm.assert_index_equal(result, expected) + + _nan = i._na_value + cond = [False] + [True] * len(i[1:]) + expected = pd.Index([_nan] + i[1:].tolist()) + + result = i.where(cond) + tm.assert_index_equal(result, expected) + + def test_where_array_like(self): + i = self.create_index() + + _nan = i._na_value + cond = [False] + [True] * (len(i) - 1) + klasses = [list, tuple, np.array, pd.Series] + expected = pd.Index([_nan] + i[1:].tolist()) + + for klass in klasses: + result = i.where(klass(cond)) + tm.assert_index_equal(result, expected) diff --git a/vb_suite/source/_static/stub b/pandas/tests/indexes/timedeltas/__init__.py similarity index 100% rename from vb_suite/source/_static/stub rename to pandas/tests/indexes/timedeltas/__init__.py diff --git a/pandas/tests/indexes/timedeltas/test_astype.py b/pandas/tests/indexes/timedeltas/test_astype.py new file mode 100644 index 0000000000000..586b96f980f8f --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_astype.py @@ -0,0 +1,123 @@ +import pytest + +import numpy as np + +import pandas as pd +import pandas.util.testing as tm +from pandas import (TimedeltaIndex, timedelta_range, Int64Index, Float64Index, + Index, Timedelta, Series) + +from ..datetimelike import DatetimeLike + + +class TestTimedeltaIndex(DatetimeLike): + _holder = TimedeltaIndex + _multiprocess_can_split_ = True + + def setup_method(self, method): + self.indices = dict(index=tm.makeTimedeltaIndex(10)) + self.setup_indices() + + def create_index(self): + return pd.to_timedelta(range(5), unit='d') + pd.offsets.Hour(1) + + def test_astype(self): + # GH 13149, GH 13209 + idx = TimedeltaIndex([1e14, 'NaT', pd.NaT, np.NaN]) + + result = idx.astype(object) + expected = Index([Timedelta('1 days 03:46:40')] + [pd.NaT] * 3, + dtype=object) + tm.assert_index_equal(result, expected) + + result = idx.astype(int) + expected = Int64Index([100000000000000] + [-9223372036854775808] * 3, + dtype=np.int64) + tm.assert_index_equal(result, expected) + + rng = timedelta_range('1 days', periods=10) + + result = rng.astype('i8') + tm.assert_index_equal(result, Index(rng.asi8)) + tm.assert_numpy_array_equal(rng.asi8, result.values) + + def test_astype_timedelta64(self): + # GH 13149, GH 13209 + idx = TimedeltaIndex([1e14, 'NaT', pd.NaT, np.NaN]) + + result = idx.astype('timedelta64') + expected = Float64Index([1e+14] + [np.NaN] * 3, dtype='float64') + tm.assert_index_equal(result, expected) + + result = idx.astype('timedelta64[ns]') + tm.assert_index_equal(result, idx) + assert result is not idx + + result = idx.astype('timedelta64[ns]', copy=False) + tm.assert_index_equal(result, idx) + assert result is idx + + def test_astype_raises(self): + # GH 13149, GH 13209 + idx = TimedeltaIndex([1e14, 'NaT', pd.NaT, np.NaN]) + + pytest.raises(ValueError, idx.astype, float) + pytest.raises(ValueError, idx.astype, str) + pytest.raises(ValueError, idx.astype, 'datetime64') + pytest.raises(ValueError, idx.astype, 'datetime64[ns]') + + def test_pickle_compat_construction(self): + pass + + def test_shift(self): + # test shift for TimedeltaIndex + # err8083 + + drange = self.create_index() + result = drange.shift(1) + expected = TimedeltaIndex(['1 days 01:00:00', '2 days 01:00:00', + '3 days 01:00:00', + '4 days 01:00:00', '5 days 01:00:00'], + freq='D') + tm.assert_index_equal(result, expected) + + result = drange.shift(3, freq='2D 1s') + expected = TimedeltaIndex(['6 days 01:00:03', '7 days 01:00:03', + '8 days 01:00:03', '9 days 01:00:03', + '10 days 01:00:03'], freq='D') + tm.assert_index_equal(result, expected) + + def test_numeric_compat(self): + + idx = self._holder(np.arange(5, dtype='int64')) + didx = self._holder(np.arange(5, dtype='int64') ** 2) + result = idx * 1 + tm.assert_index_equal(result, idx) + + result = 1 * idx + tm.assert_index_equal(result, idx) + + result = idx / 1 + tm.assert_index_equal(result, idx) + + result = idx // 1 + tm.assert_index_equal(result, idx) + + result = idx * np.array(5, dtype='int64') + tm.assert_index_equal(result, + self._holder(np.arange(5, dtype='int64') * 5)) + + result = idx * np.arange(5, dtype='int64') + tm.assert_index_equal(result, didx) + + result = idx * Series(np.arange(5, dtype='int64')) + tm.assert_index_equal(result, didx) + + result = idx * Series(np.arange(5, dtype='float64') + 0.1) + tm.assert_index_equal(result, self._holder(np.arange( + 5, dtype='float64') * (np.arange(5, dtype='float64') + 0.1))) + + # invalid + pytest.raises(TypeError, lambda: idx * idx) + pytest.raises(ValueError, lambda: idx * self._holder(np.arange(3))) + pytest.raises(ValueError, lambda: idx * np.array([1, 2])) diff --git a/pandas/tests/indexes/timedeltas/test_construction.py b/pandas/tests/indexes/timedeltas/test_construction.py new file mode 100644 index 0000000000000..dd25e2cca2e55 --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_construction.py @@ -0,0 +1,88 @@ +import pytest + +import numpy as np +from datetime import timedelta + +import pandas as pd +import pandas.util.testing as tm +from pandas import TimedeltaIndex, timedelta_range, to_timedelta + + +class TestTimedeltaIndex(object): + _multiprocess_can_split_ = True + + def test_construction_base_constructor(self): + arr = [pd.Timedelta('1 days'), pd.NaT, pd.Timedelta('3 days')] + tm.assert_index_equal(pd.Index(arr), pd.TimedeltaIndex(arr)) + tm.assert_index_equal(pd.Index(np.array(arr)), + pd.TimedeltaIndex(np.array(arr))) + + arr = [np.nan, pd.NaT, pd.Timedelta('1 days')] + tm.assert_index_equal(pd.Index(arr), pd.TimedeltaIndex(arr)) + tm.assert_index_equal(pd.Index(np.array(arr)), + pd.TimedeltaIndex(np.array(arr))) + + def test_constructor(self): + expected = TimedeltaIndex(['1 days', '1 days 00:00:05', '2 days', + '2 days 00:00:02', '0 days 00:00:03']) + result = TimedeltaIndex(['1 days', '1 days, 00:00:05', np.timedelta64( + 2, 'D'), timedelta(days=2, seconds=2), pd.offsets.Second(3)]) + tm.assert_index_equal(result, expected) + + # unicode + result = TimedeltaIndex([u'1 days', '1 days, 00:00:05', np.timedelta64( + 2, 'D'), timedelta(days=2, seconds=2), pd.offsets.Second(3)]) + + expected = TimedeltaIndex(['0 days 00:00:00', '0 days 00:00:01', + '0 days 00:00:02']) + tm.assert_index_equal(TimedeltaIndex(range(3), unit='s'), expected) + expected = TimedeltaIndex(['0 days 00:00:00', '0 days 00:00:05', + '0 days 00:00:09']) + tm.assert_index_equal(TimedeltaIndex([0, 5, 9], unit='s'), expected) + expected = TimedeltaIndex( + ['0 days 00:00:00.400', '0 days 00:00:00.450', + '0 days 00:00:01.200']) + tm.assert_index_equal(TimedeltaIndex([400, 450, 1200], unit='ms'), + expected) + + def test_constructor_coverage(self): + rng = timedelta_range('1 days', periods=10.5) + exp = timedelta_range('1 days', periods=10) + tm.assert_index_equal(rng, exp) + + pytest.raises(ValueError, TimedeltaIndex, start='1 days', + periods='foo', freq='D') + + pytest.raises(ValueError, TimedeltaIndex, start='1 days', + end='10 days') + + pytest.raises(ValueError, TimedeltaIndex, '1 days') + + # generator expression + gen = (timedelta(i) for i in range(10)) + result = TimedeltaIndex(gen) + expected = TimedeltaIndex([timedelta(i) for i in range(10)]) + tm.assert_index_equal(result, expected) + + # NumPy string array + strings = np.array(['1 days', '2 days', '3 days']) + result = TimedeltaIndex(strings) + expected = to_timedelta([1, 2, 3], unit='d') + tm.assert_index_equal(result, expected) + + from_ints = TimedeltaIndex(expected.asi8) + tm.assert_index_equal(from_ints, expected) + + # non-conforming freq + pytest.raises(ValueError, TimedeltaIndex, + ['1 days', '2 days', '4 days'], freq='D') + + pytest.raises(ValueError, TimedeltaIndex, periods=10, freq='D') + + def test_constructor_name(self): + idx = TimedeltaIndex(start='1 days', periods=1, freq='D', name='TEST') + assert idx.name == 'TEST' + + # GH10025 + idx2 = TimedeltaIndex(idx, name='something else') + assert idx2.name == 'something else' diff --git a/pandas/tests/indexes/timedeltas/test_indexing.py b/pandas/tests/indexes/timedeltas/test_indexing.py new file mode 100644 index 0000000000000..844033cc19eed --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_indexing.py @@ -0,0 +1,112 @@ +import pytest + +from datetime import timedelta + +import pandas.util.testing as tm +from pandas import TimedeltaIndex, timedelta_range, compat, Index, Timedelta + + +class TestTimedeltaIndex(object): + _multiprocess_can_split_ = True + + def test_insert(self): + + idx = TimedeltaIndex(['4day', '1day', '2day'], name='idx') + + result = idx.insert(2, timedelta(days=5)) + exp = TimedeltaIndex(['4day', '1day', '5day', '2day'], name='idx') + tm.assert_index_equal(result, exp) + + # insertion of non-datetime should coerce to object index + result = idx.insert(1, 'inserted') + expected = Index([Timedelta('4day'), 'inserted', Timedelta('1day'), + Timedelta('2day')], name='idx') + assert not isinstance(result, TimedeltaIndex) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + + idx = timedelta_range('1day 00:00:01', periods=3, freq='s', name='idx') + + # preserve freq + expected_0 = TimedeltaIndex(['1day', '1day 00:00:01', '1day 00:00:02', + '1day 00:00:03'], + name='idx', freq='s') + expected_3 = TimedeltaIndex(['1day 00:00:01', '1day 00:00:02', + '1day 00:00:03', '1day 00:00:04'], + name='idx', freq='s') + + # reset freq to None + expected_1_nofreq = TimedeltaIndex(['1day 00:00:01', '1day 00:00:01', + '1day 00:00:02', '1day 00:00:03'], + name='idx', freq=None) + expected_3_nofreq = TimedeltaIndex(['1day 00:00:01', '1day 00:00:02', + '1day 00:00:03', '1day 00:00:05'], + name='idx', freq=None) + + cases = [(0, Timedelta('1day'), expected_0), + (-3, Timedelta('1day'), expected_0), + (3, Timedelta('1day 00:00:04'), expected_3), + (1, Timedelta('1day 00:00:01'), expected_1_nofreq), + (3, Timedelta('1day 00:00:05'), expected_3_nofreq)] + + for n, d, expected in cases: + result = idx.insert(n, d) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + def test_delete(self): + idx = timedelta_range(start='1 Days', periods=5, freq='D', name='idx') + + # prserve freq + expected_0 = timedelta_range(start='2 Days', periods=4, freq='D', + name='idx') + expected_4 = timedelta_range(start='1 Days', periods=4, freq='D', + name='idx') + + # reset freq to None + expected_1 = TimedeltaIndex( + ['1 day', '3 day', '4 day', '5 day'], freq=None, name='idx') + + cases = {0: expected_0, + -5: expected_0, + -1: expected_4, + 4: expected_4, + 1: expected_1} + for n, expected in compat.iteritems(cases): + result = idx.delete(n) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + with pytest.raises((IndexError, ValueError)): + # either depeidnig on numpy version + result = idx.delete(5) + + def test_delete_slice(self): + idx = timedelta_range(start='1 days', periods=10, freq='D', name='idx') + + # prserve freq + expected_0_2 = timedelta_range(start='4 days', periods=7, freq='D', + name='idx') + expected_7_9 = timedelta_range(start='1 days', periods=7, freq='D', + name='idx') + + # reset freq to None + expected_3_5 = TimedeltaIndex(['1 d', '2 d', '3 d', + '7 d', '8 d', '9 d', '10d'], + freq=None, name='idx') + + cases = {(0, 1, 2): expected_0_2, + (7, 8, 9): expected_7_9, + (3, 4, 5): expected_3_5} + for n, expected in compat.iteritems(cases): + result = idx.delete(n) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq + + result = idx.delete(slice(n[0], n[-1] + 1)) + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert result.freq == expected.freq diff --git a/pandas/tests/indexes/timedeltas/test_ops.py b/pandas/tests/indexes/timedeltas/test_ops.py new file mode 100644 index 0000000000000..9a9912d4f0ab1 --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_ops.py @@ -0,0 +1,1284 @@ +import pytest + +import numpy as np +from datetime import timedelta +from distutils.version import LooseVersion + +import pandas as pd +import pandas.util.testing as tm +from pandas import to_timedelta +from pandas.util.testing import assert_series_equal, assert_frame_equal +from pandas import (Series, Timedelta, DataFrame, Timestamp, TimedeltaIndex, + timedelta_range, date_range, DatetimeIndex, Int64Index, + _np_version_under1p10, Float64Index, Index) +from pandas._libs.tslib import iNaT +from pandas.tests.test_base import Ops + + +class TestTimedeltaIndexOps(Ops): + def setup_method(self, method): + super(TestTimedeltaIndexOps, self).setup_method(method) + mask = lambda x: isinstance(x, TimedeltaIndex) + self.is_valid_objs = [o for o in self.objs if mask(o)] + self.not_valid_objs = [] + + def test_ops_properties(self): + f = lambda x: isinstance(x, TimedeltaIndex) + self.check_ops_properties(TimedeltaIndex._field_ops, f) + self.check_ops_properties(TimedeltaIndex._object_ops, f) + + def test_asobject_tolist(self): + idx = timedelta_range(start='1 days', periods=4, freq='D', name='idx') + expected_list = [Timedelta('1 days'), Timedelta('2 days'), + Timedelta('3 days'), Timedelta('4 days')] + expected = pd.Index(expected_list, dtype=object, name='idx') + result = idx.asobject + assert isinstance(result, Index) + + assert result.dtype == object + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert idx.tolist() == expected_list + + idx = TimedeltaIndex([timedelta(days=1), timedelta(days=2), pd.NaT, + timedelta(days=4)], name='idx') + expected_list = [Timedelta('1 days'), Timedelta('2 days'), pd.NaT, + Timedelta('4 days')] + expected = pd.Index(expected_list, dtype=object, name='idx') + result = idx.asobject + assert isinstance(result, Index) + assert result.dtype == object + tm.assert_index_equal(result, expected) + assert result.name == expected.name + assert idx.tolist() == expected_list + + def test_minmax(self): + + # monotonic + idx1 = TimedeltaIndex(['1 days', '2 days', '3 days']) + assert idx1.is_monotonic + + # non-monotonic + idx2 = TimedeltaIndex(['1 days', np.nan, '3 days', 'NaT']) + assert not idx2.is_monotonic + + for idx in [idx1, idx2]: + assert idx.min() == Timedelta('1 days') + assert idx.max() == Timedelta('3 days') + assert idx.argmin() == 0 + assert idx.argmax() == 2 + + for op in ['min', 'max']: + # Return NaT + obj = TimedeltaIndex([]) + assert pd.isnull(getattr(obj, op)()) + + obj = TimedeltaIndex([pd.NaT]) + assert pd.isnull(getattr(obj, op)()) + + obj = TimedeltaIndex([pd.NaT, pd.NaT, pd.NaT]) + assert pd.isnull(getattr(obj, op)()) + + def test_numpy_minmax(self): + dr = pd.date_range(start='2016-01-15', end='2016-01-20') + td = TimedeltaIndex(np.asarray(dr)) + + assert np.min(td) == Timedelta('16815 days') + assert np.max(td) == Timedelta('16820 days') + + errmsg = "the 'out' parameter is not supported" + tm.assert_raises_regex(ValueError, errmsg, np.min, td, out=0) + tm.assert_raises_regex(ValueError, errmsg, np.max, td, out=0) + + assert np.argmin(td) == 0 + assert np.argmax(td) == 5 + + if not _np_version_under1p10: + errmsg = "the 'out' parameter is not supported" + tm.assert_raises_regex( + ValueError, errmsg, np.argmin, td, out=0) + tm.assert_raises_regex( + ValueError, errmsg, np.argmax, td, out=0) + + def test_round(self): + td = pd.timedelta_range(start='16801 days', periods=5, freq='30Min') + elt = td[1] + + expected_rng = TimedeltaIndex([ + Timedelta('16801 days 00:00:00'), + Timedelta('16801 days 00:00:00'), + Timedelta('16801 days 01:00:00'), + Timedelta('16801 days 02:00:00'), + Timedelta('16801 days 02:00:00'), + ]) + expected_elt = expected_rng[1] + + tm.assert_index_equal(td.round(freq='H'), expected_rng) + assert elt.round(freq='H') == expected_elt + + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + td.round(freq='foo') + with tm.assert_raises_regex(ValueError, msg): + elt.round(freq='foo') + + msg = " is a non-fixed frequency" + tm.assert_raises_regex(ValueError, msg, td.round, freq='M') + tm.assert_raises_regex(ValueError, msg, elt.round, freq='M') + + def test_representation(self): + idx1 = TimedeltaIndex([], freq='D') + idx2 = TimedeltaIndex(['1 days'], freq='D') + idx3 = TimedeltaIndex(['1 days', '2 days'], freq='D') + idx4 = TimedeltaIndex(['1 days', '2 days', '3 days'], freq='D') + idx5 = TimedeltaIndex(['1 days 00:00:01', '2 days', '3 days']) + + exp1 = """TimedeltaIndex([], dtype='timedelta64[ns]', freq='D')""" + + exp2 = ("TimedeltaIndex(['1 days'], dtype='timedelta64[ns]', " + "freq='D')") + + exp3 = ("TimedeltaIndex(['1 days', '2 days'], " + "dtype='timedelta64[ns]', freq='D')") + + exp4 = ("TimedeltaIndex(['1 days', '2 days', '3 days'], " + "dtype='timedelta64[ns]', freq='D')") + + exp5 = ("TimedeltaIndex(['1 days 00:00:01', '2 days 00:00:00', " + "'3 days 00:00:00'], dtype='timedelta64[ns]', freq=None)") + + with pd.option_context('display.width', 300): + for idx, expected in zip([idx1, idx2, idx3, idx4, idx5], + [exp1, exp2, exp3, exp4, exp5]): + for func in ['__repr__', '__unicode__', '__str__']: + result = getattr(idx, func)() + assert result == expected + + def test_representation_to_series(self): + idx1 = TimedeltaIndex([], freq='D') + idx2 = TimedeltaIndex(['1 days'], freq='D') + idx3 = TimedeltaIndex(['1 days', '2 days'], freq='D') + idx4 = TimedeltaIndex(['1 days', '2 days', '3 days'], freq='D') + idx5 = TimedeltaIndex(['1 days 00:00:01', '2 days', '3 days']) + + exp1 = """Series([], dtype: timedelta64[ns])""" + + exp2 = """0 1 days +dtype: timedelta64[ns]""" + + exp3 = """0 1 days +1 2 days +dtype: timedelta64[ns]""" + + exp4 = """0 1 days +1 2 days +2 3 days +dtype: timedelta64[ns]""" + + exp5 = """0 1 days 00:00:01 +1 2 days 00:00:00 +2 3 days 00:00:00 +dtype: timedelta64[ns]""" + + with pd.option_context('display.width', 300): + for idx, expected in zip([idx1, idx2, idx3, idx4, idx5], + [exp1, exp2, exp3, exp4, exp5]): + result = repr(pd.Series(idx)) + assert result == expected + + def test_summary(self): + # GH9116 + idx1 = TimedeltaIndex([], freq='D') + idx2 = TimedeltaIndex(['1 days'], freq='D') + idx3 = TimedeltaIndex(['1 days', '2 days'], freq='D') + idx4 = TimedeltaIndex(['1 days', '2 days', '3 days'], freq='D') + idx5 = TimedeltaIndex(['1 days 00:00:01', '2 days', '3 days']) + + exp1 = """TimedeltaIndex: 0 entries +Freq: D""" + + exp2 = """TimedeltaIndex: 1 entries, 1 days to 1 days +Freq: D""" + + exp3 = """TimedeltaIndex: 2 entries, 1 days to 2 days +Freq: D""" + + exp4 = """TimedeltaIndex: 3 entries, 1 days to 3 days +Freq: D""" + + exp5 = ("TimedeltaIndex: 3 entries, 1 days 00:00:01 to 3 days " + "00:00:00") + + for idx, expected in zip([idx1, idx2, idx3, idx4, idx5], + [exp1, exp2, exp3, exp4, exp5]): + result = idx.summary() + assert result == expected + + def test_add_iadd(self): + + # only test adding/sub offsets as + is now numeric + + # offset + offsets = [pd.offsets.Hour(2), timedelta(hours=2), + np.timedelta64(2, 'h'), Timedelta(hours=2)] + + for delta in offsets: + rng = timedelta_range('1 days', '10 days') + result = rng + delta + expected = timedelta_range('1 days 02:00:00', '10 days 02:00:00', + freq='D') + tm.assert_index_equal(result, expected) + rng += delta + tm.assert_index_equal(rng, expected) + + # int + rng = timedelta_range('1 days 09:00:00', freq='H', periods=10) + result = rng + 1 + expected = timedelta_range('1 days 10:00:00', freq='H', periods=10) + tm.assert_index_equal(result, expected) + rng += 1 + tm.assert_index_equal(rng, expected) + + def test_sub_isub(self): + # only test adding/sub offsets as - is now numeric + + # offset + offsets = [pd.offsets.Hour(2), timedelta(hours=2), + np.timedelta64(2, 'h'), Timedelta(hours=2)] + + for delta in offsets: + rng = timedelta_range('1 days', '10 days') + result = rng - delta + expected = timedelta_range('0 days 22:00:00', '9 days 22:00:00') + tm.assert_index_equal(result, expected) + rng -= delta + tm.assert_index_equal(rng, expected) + + # int + rng = timedelta_range('1 days 09:00:00', freq='H', periods=10) + result = rng - 1 + expected = timedelta_range('1 days 08:00:00', freq='H', periods=10) + tm.assert_index_equal(result, expected) + rng -= 1 + tm.assert_index_equal(rng, expected) + + idx = TimedeltaIndex(['1 day', '2 day']) + msg = "cannot subtract a datelike from a TimedeltaIndex" + with tm.assert_raises_regex(TypeError, msg): + idx - Timestamp('2011-01-01') + + result = Timestamp('2011-01-01') + idx + expected = DatetimeIndex(['2011-01-02', '2011-01-03']) + tm.assert_index_equal(result, expected) + + def test_ops_compat(self): + + offsets = [pd.offsets.Hour(2), timedelta(hours=2), + np.timedelta64(2, 'h'), Timedelta(hours=2)] + + rng = timedelta_range('1 days', '10 days', name='foo') + + # multiply + for offset in offsets: + pytest.raises(TypeError, lambda: rng * offset) + + # divide + expected = Int64Index((np.arange(10) + 1) * 12, name='foo') + for offset in offsets: + result = rng / offset + tm.assert_index_equal(result, expected, exact=False) + + # floor divide + expected = Int64Index((np.arange(10) + 1) * 12, name='foo') + for offset in offsets: + result = rng // offset + tm.assert_index_equal(result, expected, exact=False) + + # divide with nats + rng = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') + expected = Float64Index([12, np.nan, 24], name='foo') + for offset in offsets: + result = rng / offset + tm.assert_index_equal(result, expected) + + # don't allow division by NaT (make could in the future) + pytest.raises(TypeError, lambda: rng / pd.NaT) + + def test_subtraction_ops(self): + + # with datetimes/timedelta and tdi/dti + tdi = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') + dti = date_range('20130101', periods=3, name='bar') + td = Timedelta('1 days') + dt = Timestamp('20130101') + + pytest.raises(TypeError, lambda: tdi - dt) + pytest.raises(TypeError, lambda: tdi - dti) + pytest.raises(TypeError, lambda: td - dt) + pytest.raises(TypeError, lambda: td - dti) + + result = dt - dti + expected = TimedeltaIndex(['0 days', '-1 days', '-2 days'], name='bar') + tm.assert_index_equal(result, expected) + + result = dti - dt + expected = TimedeltaIndex(['0 days', '1 days', '2 days'], name='bar') + tm.assert_index_equal(result, expected) + + result = tdi - td + expected = TimedeltaIndex(['0 days', pd.NaT, '1 days'], name='foo') + tm.assert_index_equal(result, expected, check_names=False) + + result = td - tdi + expected = TimedeltaIndex(['0 days', pd.NaT, '-1 days'], name='foo') + tm.assert_index_equal(result, expected, check_names=False) + + result = dti - td + expected = DatetimeIndex( + ['20121231', '20130101', '20130102'], name='bar') + tm.assert_index_equal(result, expected, check_names=False) + + result = dt - tdi + expected = DatetimeIndex(['20121231', pd.NaT, '20121230'], name='foo') + tm.assert_index_equal(result, expected) + + def test_subtraction_ops_with_tz(self): + + # check that dt/dti subtraction ops with tz are validated + dti = date_range('20130101', periods=3) + ts = Timestamp('20130101') + dt = ts.to_pydatetime() + dti_tz = date_range('20130101', periods=3).tz_localize('US/Eastern') + ts_tz = Timestamp('20130101').tz_localize('US/Eastern') + ts_tz2 = Timestamp('20130101').tz_localize('CET') + dt_tz = ts_tz.to_pydatetime() + td = Timedelta('1 days') + + def _check(result, expected): + assert result == expected + assert isinstance(result, Timedelta) + + # scalars + result = ts - ts + expected = Timedelta('0 days') + _check(result, expected) + + result = dt_tz - ts_tz + expected = Timedelta('0 days') + _check(result, expected) + + result = ts_tz - dt_tz + expected = Timedelta('0 days') + _check(result, expected) + + # tz mismatches + pytest.raises(TypeError, lambda: dt_tz - ts) + pytest.raises(TypeError, lambda: dt_tz - dt) + pytest.raises(TypeError, lambda: dt_tz - ts_tz2) + pytest.raises(TypeError, lambda: dt - dt_tz) + pytest.raises(TypeError, lambda: ts - dt_tz) + pytest.raises(TypeError, lambda: ts_tz2 - ts) + pytest.raises(TypeError, lambda: ts_tz2 - dt) + pytest.raises(TypeError, lambda: ts_tz - ts_tz2) + + # with dti + pytest.raises(TypeError, lambda: dti - ts_tz) + pytest.raises(TypeError, lambda: dti_tz - ts) + pytest.raises(TypeError, lambda: dti_tz - ts_tz2) + + result = dti_tz - dt_tz + expected = TimedeltaIndex(['0 days', '1 days', '2 days']) + tm.assert_index_equal(result, expected) + + result = dt_tz - dti_tz + expected = TimedeltaIndex(['0 days', '-1 days', '-2 days']) + tm.assert_index_equal(result, expected) + + result = dti_tz - ts_tz + expected = TimedeltaIndex(['0 days', '1 days', '2 days']) + tm.assert_index_equal(result, expected) + + result = ts_tz - dti_tz + expected = TimedeltaIndex(['0 days', '-1 days', '-2 days']) + tm.assert_index_equal(result, expected) + + result = td - td + expected = Timedelta('0 days') + _check(result, expected) + + result = dti_tz - td + expected = DatetimeIndex( + ['20121231', '20130101', '20130102'], tz='US/Eastern') + tm.assert_index_equal(result, expected) + + def test_dti_tdi_numeric_ops(self): + + # These are normally union/diff set-like ops + tdi = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') + dti = date_range('20130101', periods=3, name='bar') + + # TODO(wesm): unused? + # td = Timedelta('1 days') + # dt = Timestamp('20130101') + + result = tdi - tdi + expected = TimedeltaIndex(['0 days', pd.NaT, '0 days'], name='foo') + tm.assert_index_equal(result, expected) + + result = tdi + tdi + expected = TimedeltaIndex(['2 days', pd.NaT, '4 days'], name='foo') + tm.assert_index_equal(result, expected) + + result = dti - tdi # name will be reset + expected = DatetimeIndex(['20121231', pd.NaT, '20130101']) + tm.assert_index_equal(result, expected) + + def test_sub_period(self): + # GH 13078 + # not supported, check TypeError + p = pd.Period('2011-01-01', freq='D') + + for freq in [None, 'H']: + idx = pd.TimedeltaIndex(['1 hours', '2 hours'], freq=freq) + + with pytest.raises(TypeError): + idx - p + + with pytest.raises(TypeError): + p - idx + + def test_addition_ops(self): + + # with datetimes/timedelta and tdi/dti + tdi = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') + dti = date_range('20130101', periods=3, name='bar') + td = Timedelta('1 days') + dt = Timestamp('20130101') + + result = tdi + dt + expected = DatetimeIndex(['20130102', pd.NaT, '20130103'], name='foo') + tm.assert_index_equal(result, expected) + + result = dt + tdi + expected = DatetimeIndex(['20130102', pd.NaT, '20130103'], name='foo') + tm.assert_index_equal(result, expected) + + result = td + tdi + expected = TimedeltaIndex(['2 days', pd.NaT, '3 days'], name='foo') + tm.assert_index_equal(result, expected) + + result = tdi + td + expected = TimedeltaIndex(['2 days', pd.NaT, '3 days'], name='foo') + tm.assert_index_equal(result, expected) + + # unequal length + pytest.raises(ValueError, lambda: tdi + dti[0:1]) + pytest.raises(ValueError, lambda: tdi[0:1] + dti) + + # random indexes + pytest.raises(TypeError, lambda: tdi + Int64Index([1, 2, 3])) + + # this is a union! + # pytest.raises(TypeError, lambda : Int64Index([1,2,3]) + tdi) + + result = tdi + dti # name will be reset + expected = DatetimeIndex(['20130102', pd.NaT, '20130105']) + tm.assert_index_equal(result, expected) + + result = dti + tdi # name will be reset + expected = DatetimeIndex(['20130102', pd.NaT, '20130105']) + tm.assert_index_equal(result, expected) + + result = dt + td + expected = Timestamp('20130102') + assert result == expected + + result = td + dt + expected = Timestamp('20130102') + assert result == expected + + def test_comp_nat(self): + left = pd.TimedeltaIndex([pd.Timedelta('1 days'), pd.NaT, + pd.Timedelta('3 days')]) + right = pd.TimedeltaIndex([pd.NaT, pd.NaT, pd.Timedelta('3 days')]) + + for l, r in [(left, right), (left.asobject, right.asobject)]: + result = l == r + expected = np.array([False, False, True]) + tm.assert_numpy_array_equal(result, expected) + + result = l != r + expected = np.array([True, True, False]) + tm.assert_numpy_array_equal(result, expected) + + expected = np.array([False, False, False]) + tm.assert_numpy_array_equal(l == pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT == r, expected) + + expected = np.array([True, True, True]) + tm.assert_numpy_array_equal(l != pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT != l, expected) + + expected = np.array([False, False, False]) + tm.assert_numpy_array_equal(l < pd.NaT, expected) + tm.assert_numpy_array_equal(pd.NaT > l, expected) + + def test_value_counts_unique(self): + # GH 7735 + + idx = timedelta_range('1 days 09:00:00', freq='H', periods=10) + # create repeated values, 'n'th element is repeated by n+1 times + idx = TimedeltaIndex(np.repeat(idx.values, range(1, len(idx) + 1))) + + exp_idx = timedelta_range('1 days 18:00:00', freq='-1H', periods=10) + expected = Series(range(10, 0, -1), index=exp_idx, dtype='int64') + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(), expected) + + expected = timedelta_range('1 days 09:00:00', freq='H', periods=10) + tm.assert_index_equal(idx.unique(), expected) + + idx = TimedeltaIndex(['1 days 09:00:00', '1 days 09:00:00', + '1 days 09:00:00', '1 days 08:00:00', + '1 days 08:00:00', pd.NaT]) + + exp_idx = TimedeltaIndex(['1 days 09:00:00', '1 days 08:00:00']) + expected = Series([3, 2], index=exp_idx) + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(), expected) + + exp_idx = TimedeltaIndex(['1 days 09:00:00', '1 days 08:00:00', + pd.NaT]) + expected = Series([3, 2, 1], index=exp_idx) + + for obj in [idx, Series(idx)]: + tm.assert_series_equal(obj.value_counts(dropna=False), expected) + + tm.assert_index_equal(idx.unique(), exp_idx) + + def test_nonunique_contains(self): + # GH 9512 + for idx in map(TimedeltaIndex, ([0, 1, 0], [0, 0, -1], [0, -1, -1], + ['00:01:00', '00:01:00', '00:02:00'], + ['00:01:00', '00:01:00', '00:00:01'])): + assert idx[0] in idx + + def test_unknown_attribute(self): + # see gh-9680 + tdi = pd.timedelta_range(start=0, periods=10, freq='1s') + ts = pd.Series(np.random.normal(size=10), index=tdi) + assert 'foo' not in ts.__dict__.keys() + pytest.raises(AttributeError, lambda: ts.foo) + + def test_order(self): + # GH 10295 + idx1 = TimedeltaIndex(['1 day', '2 day', '3 day'], freq='D', + name='idx') + idx2 = TimedeltaIndex( + ['1 hour', '2 hour', '3 hour'], freq='H', name='idx') + + for idx in [idx1, idx2]: + ordered = idx.sort_values() + tm.assert_index_equal(ordered, idx) + assert ordered.freq == idx.freq + + ordered = idx.sort_values(ascending=False) + expected = idx[::-1] + tm.assert_index_equal(ordered, expected) + assert ordered.freq == expected.freq + assert ordered.freq.n == -1 + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, idx) + tm.assert_numpy_array_equal(indexer, np.array([0, 1, 2]), + check_dtype=False) + assert ordered.freq == idx.freq + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + tm.assert_index_equal(ordered, idx[::-1]) + assert ordered.freq == expected.freq + assert ordered.freq.n == -1 + + idx1 = TimedeltaIndex(['1 hour', '3 hour', '5 hour', + '2 hour ', '1 hour'], name='idx1') + exp1 = TimedeltaIndex(['1 hour', '1 hour', '2 hour', + '3 hour', '5 hour'], name='idx1') + + idx2 = TimedeltaIndex(['1 day', '3 day', '5 day', + '2 day', '1 day'], name='idx2') + + # TODO(wesm): unused? + # exp2 = TimedeltaIndex(['1 day', '1 day', '2 day', + # '3 day', '5 day'], name='idx2') + + # idx3 = TimedeltaIndex([pd.NaT, '3 minute', '5 minute', + # '2 minute', pd.NaT], name='idx3') + # exp3 = TimedeltaIndex([pd.NaT, pd.NaT, '2 minute', '3 minute', + # '5 minute'], name='idx3') + + for idx, expected in [(idx1, exp1), (idx1, exp1), (idx1, exp1)]: + ordered = idx.sort_values() + tm.assert_index_equal(ordered, expected) + assert ordered.freq is None + + ordered = idx.sort_values(ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + assert ordered.freq is None + + ordered, indexer = idx.sort_values(return_indexer=True) + tm.assert_index_equal(ordered, expected) + + exp = np.array([0, 4, 3, 1, 2]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + assert ordered.freq is None + + ordered, indexer = idx.sort_values(return_indexer=True, + ascending=False) + tm.assert_index_equal(ordered, expected[::-1]) + + exp = np.array([2, 1, 3, 4, 0]) + tm.assert_numpy_array_equal(indexer, exp, check_dtype=False) + assert ordered.freq is None + + def test_getitem(self): + idx1 = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') + + for idx in [idx1]: + result = idx[0] + assert result == pd.Timedelta('1 day') + + result = idx[0:5] + expected = pd.timedelta_range('1 day', '5 day', freq='D', + name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx[0:10:2] + expected = pd.timedelta_range('1 day', '9 day', freq='2D', + name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx[-20:-5:3] + expected = pd.timedelta_range('12 day', '24 day', freq='3D', + name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx[4::-1] + expected = TimedeltaIndex(['5 day', '4 day', '3 day', + '2 day', '1 day'], + freq='-1D', name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + def test_drop_duplicates_metadata(self): + # GH 10115 + idx = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') + result = idx.drop_duplicates() + tm.assert_index_equal(idx, result) + assert idx.freq == result.freq + + idx_dup = idx.append(idx) + assert idx_dup.freq is None # freq is reset + result = idx_dup.drop_duplicates() + tm.assert_index_equal(idx, result) + assert result.freq is None + + def test_drop_duplicates(self): + # to check Index/Series compat + base = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') + idx = base.append(base[:5]) + + res = idx.drop_duplicates() + tm.assert_index_equal(res, base) + res = Series(idx).drop_duplicates() + tm.assert_series_equal(res, Series(base)) + + res = idx.drop_duplicates(keep='last') + exp = base[5:].append(base[:5]) + tm.assert_index_equal(res, exp) + res = Series(idx).drop_duplicates(keep='last') + tm.assert_series_equal(res, Series(exp, index=np.arange(5, 36))) + + res = idx.drop_duplicates(keep=False) + tm.assert_index_equal(res, base[5:]) + res = Series(idx).drop_duplicates(keep=False) + tm.assert_series_equal(res, Series(base[5:], index=np.arange(5, 31))) + + def test_take(self): + # GH 10295 + idx1 = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') + + for idx in [idx1]: + result = idx.take([0]) + assert result == pd.Timedelta('1 day') + + result = idx.take([-1]) + assert result == pd.Timedelta('31 day') + + result = idx.take([0, 1, 2]) + expected = pd.timedelta_range('1 day', '3 day', freq='D', + name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx.take([0, 2, 4]) + expected = pd.timedelta_range('1 day', '5 day', freq='2D', + name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx.take([7, 4, 1]) + expected = pd.timedelta_range('8 day', '2 day', freq='-3D', + name='idx') + tm.assert_index_equal(result, expected) + assert result.freq == expected.freq + + result = idx.take([3, 2, 5]) + expected = TimedeltaIndex(['4 day', '3 day', '6 day'], name='idx') + tm.assert_index_equal(result, expected) + assert result.freq is None + + result = idx.take([-3, 2, 5]) + expected = TimedeltaIndex(['29 day', '3 day', '6 day'], name='idx') + tm.assert_index_equal(result, expected) + assert result.freq is None + + def test_take_invalid_kwargs(self): + idx = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') + indices = [1, 6, 5, 9, 10, 13, 15, 3] + + msg = r"take\(\) got an unexpected keyword argument 'foo'" + tm.assert_raises_regex(TypeError, msg, idx.take, + indices, foo=2) + + msg = "the 'out' parameter is not supported" + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, out=indices) + + msg = "the 'mode' parameter is not supported" + tm.assert_raises_regex(ValueError, msg, idx.take, + indices, mode='clip') + + def test_infer_freq(self): + # GH 11018 + for freq in ['D', '3D', '-3D', 'H', '2H', '-2H', 'T', '2T', 'S', '-3S' + ]: + idx = pd.timedelta_range('1', freq=freq, periods=10) + result = pd.TimedeltaIndex(idx.asi8, freq='infer') + tm.assert_index_equal(idx, result) + assert result.freq == freq + + def test_nat_new(self): + + idx = pd.timedelta_range('1', freq='D', periods=5, name='x') + result = idx._nat_new() + exp = pd.TimedeltaIndex([pd.NaT] * 5, name='x') + tm.assert_index_equal(result, exp) + + result = idx._nat_new(box=False) + exp = np.array([iNaT] * 5, dtype=np.int64) + tm.assert_numpy_array_equal(result, exp) + + def test_shift(self): + # GH 9903 + idx = pd.TimedeltaIndex([], name='xxx') + tm.assert_index_equal(idx.shift(0, freq='H'), idx) + tm.assert_index_equal(idx.shift(3, freq='H'), idx) + + idx = pd.TimedeltaIndex(['5 hours', '6 hours', '9 hours'], name='xxx') + tm.assert_index_equal(idx.shift(0, freq='H'), idx) + exp = pd.TimedeltaIndex(['8 hours', '9 hours', '12 hours'], name='xxx') + tm.assert_index_equal(idx.shift(3, freq='H'), exp) + exp = pd.TimedeltaIndex(['2 hours', '3 hours', '6 hours'], name='xxx') + tm.assert_index_equal(idx.shift(-3, freq='H'), exp) + + tm.assert_index_equal(idx.shift(0, freq='T'), idx) + exp = pd.TimedeltaIndex(['05:03:00', '06:03:00', '9:03:00'], + name='xxx') + tm.assert_index_equal(idx.shift(3, freq='T'), exp) + exp = pd.TimedeltaIndex(['04:57:00', '05:57:00', '8:57:00'], + name='xxx') + tm.assert_index_equal(idx.shift(-3, freq='T'), exp) + + def test_repeat(self): + index = pd.timedelta_range('1 days', periods=2, freq='D') + exp = pd.TimedeltaIndex(['1 days', '1 days', '2 days', '2 days']) + for res in [index.repeat(2), np.repeat(index, 2)]: + tm.assert_index_equal(res, exp) + assert res.freq is None + + index = TimedeltaIndex(['1 days', 'NaT', '3 days']) + exp = TimedeltaIndex(['1 days', '1 days', '1 days', + 'NaT', 'NaT', 'NaT', + '3 days', '3 days', '3 days']) + for res in [index.repeat(3), np.repeat(index, 3)]: + tm.assert_index_equal(res, exp) + assert res.freq is None + + def test_nat(self): + assert pd.TimedeltaIndex._na_value is pd.NaT + assert pd.TimedeltaIndex([])._na_value is pd.NaT + + idx = pd.TimedeltaIndex(['1 days', '2 days']) + assert idx._can_hold_na + + tm.assert_numpy_array_equal(idx._isnan, np.array([False, False])) + assert not idx.hasnans + tm.assert_numpy_array_equal(idx._nan_idxs, + np.array([], dtype=np.intp)) + + idx = pd.TimedeltaIndex(['1 days', 'NaT']) + assert idx._can_hold_na + + tm.assert_numpy_array_equal(idx._isnan, np.array([False, True])) + assert idx.hasnans + tm.assert_numpy_array_equal(idx._nan_idxs, + np.array([1], dtype=np.intp)) + + def test_equals(self): + # GH 13107 + idx = pd.TimedeltaIndex(['1 days', '2 days', 'NaT']) + assert idx.equals(idx) + assert idx.equals(idx.copy()) + assert idx.equals(idx.asobject) + assert idx.asobject.equals(idx) + assert idx.asobject.equals(idx.asobject) + assert not idx.equals(list(idx)) + assert not idx.equals(pd.Series(idx)) + + idx2 = pd.TimedeltaIndex(['2 days', '1 days', 'NaT']) + assert not idx.equals(idx2) + assert not idx.equals(idx2.copy()) + assert not idx.equals(idx2.asobject) + assert not idx.asobject.equals(idx2) + assert not idx.asobject.equals(idx2.asobject) + assert not idx.equals(list(idx2)) + assert not idx.equals(pd.Series(idx2)) + + +class TestTimedeltas(object): + _multiprocess_can_split_ = True + + def test_ops(self): + + td = Timedelta(10, unit='d') + assert -td == Timedelta(-10, unit='d') + assert +td == Timedelta(10, unit='d') + assert td - td == Timedelta(0, unit='ns') + assert (td - pd.NaT) is pd.NaT + assert td + td == Timedelta(20, unit='d') + assert (td + pd.NaT) is pd.NaT + assert td * 2 == Timedelta(20, unit='d') + assert (td * pd.NaT) is pd.NaT + assert td / 2 == Timedelta(5, unit='d') + assert td // 2 == Timedelta(5, unit='d') + assert abs(td) == td + assert abs(-td) == td + assert td / td == 1 + assert (td / pd.NaT) is np.nan + assert (td // pd.NaT) is np.nan + + # invert + assert -td == Timedelta('-10d') + assert td * -1 == Timedelta('-10d') + assert -1 * td == Timedelta('-10d') + assert abs(-td) == Timedelta('10d') + + # invalid multiply with another timedelta + pytest.raises(TypeError, lambda: td * td) + + # can't operate with integers + pytest.raises(TypeError, lambda: td + 2) + pytest.raises(TypeError, lambda: td - 2) + + def test_ops_offsets(self): + td = Timedelta(10, unit='d') + assert Timedelta(241, unit='h') == td + pd.offsets.Hour(1) + assert Timedelta(241, unit='h') == pd.offsets.Hour(1) + td + assert 240 == td / pd.offsets.Hour(1) + assert 1 / 240.0 == pd.offsets.Hour(1) / td + assert Timedelta(239, unit='h') == td - pd.offsets.Hour(1) + assert Timedelta(-239, unit='h') == pd.offsets.Hour(1) - td + + def test_ops_ndarray(self): + td = Timedelta('1 day') + + # timedelta, timedelta + other = pd.to_timedelta(['1 day']).values + expected = pd.to_timedelta(['2 days']).values + tm.assert_numpy_array_equal(td + other, expected) + if LooseVersion(np.__version__) >= '1.8': + tm.assert_numpy_array_equal(other + td, expected) + pytest.raises(TypeError, lambda: td + np.array([1])) + pytest.raises(TypeError, lambda: np.array([1]) + td) + + expected = pd.to_timedelta(['0 days']).values + tm.assert_numpy_array_equal(td - other, expected) + if LooseVersion(np.__version__) >= '1.8': + tm.assert_numpy_array_equal(-other + td, expected) + pytest.raises(TypeError, lambda: td - np.array([1])) + pytest.raises(TypeError, lambda: np.array([1]) - td) + + expected = pd.to_timedelta(['2 days']).values + tm.assert_numpy_array_equal(td * np.array([2]), expected) + tm.assert_numpy_array_equal(np.array([2]) * td, expected) + pytest.raises(TypeError, lambda: td * other) + pytest.raises(TypeError, lambda: other * td) + + tm.assert_numpy_array_equal(td / other, + np.array([1], dtype=np.float64)) + if LooseVersion(np.__version__) >= '1.8': + tm.assert_numpy_array_equal(other / td, + np.array([1], dtype=np.float64)) + + # timedelta, datetime + other = pd.to_datetime(['2000-01-01']).values + expected = pd.to_datetime(['2000-01-02']).values + tm.assert_numpy_array_equal(td + other, expected) + if LooseVersion(np.__version__) >= '1.8': + tm.assert_numpy_array_equal(other + td, expected) + + expected = pd.to_datetime(['1999-12-31']).values + tm.assert_numpy_array_equal(-td + other, expected) + if LooseVersion(np.__version__) >= '1.8': + tm.assert_numpy_array_equal(other - td, expected) + + def test_ops_series(self): + # regression test for GH8813 + td = Timedelta('1 day') + other = pd.Series([1, 2]) + expected = pd.Series(pd.to_timedelta(['1 day', '2 days'])) + tm.assert_series_equal(expected, td * other) + tm.assert_series_equal(expected, other * td) + + def test_ops_series_object(self): + # GH 13043 + s = pd.Series([pd.Timestamp('2015-01-01', tz='US/Eastern'), + pd.Timestamp('2015-01-01', tz='Asia/Tokyo')], + name='xxx') + assert s.dtype == object + + exp = pd.Series([pd.Timestamp('2015-01-02', tz='US/Eastern'), + pd.Timestamp('2015-01-02', tz='Asia/Tokyo')], + name='xxx') + tm.assert_series_equal(s + pd.Timedelta('1 days'), exp) + tm.assert_series_equal(pd.Timedelta('1 days') + s, exp) + + # object series & object series + s2 = pd.Series([pd.Timestamp('2015-01-03', tz='US/Eastern'), + pd.Timestamp('2015-01-05', tz='Asia/Tokyo')], + name='xxx') + assert s2.dtype == object + exp = pd.Series([pd.Timedelta('2 days'), pd.Timedelta('4 days')], + name='xxx') + tm.assert_series_equal(s2 - s, exp) + tm.assert_series_equal(s - s2, -exp) + + s = pd.Series([pd.Timedelta('01:00:00'), pd.Timedelta('02:00:00')], + name='xxx', dtype=object) + assert s.dtype == object + + exp = pd.Series([pd.Timedelta('01:30:00'), pd.Timedelta('02:30:00')], + name='xxx') + tm.assert_series_equal(s + pd.Timedelta('00:30:00'), exp) + tm.assert_series_equal(pd.Timedelta('00:30:00') + s, exp) + + def test_ops_notimplemented(self): + class Other: + pass + + other = Other() + + td = Timedelta('1 day') + assert td.__add__(other) is NotImplemented + assert td.__sub__(other) is NotImplemented + assert td.__truediv__(other) is NotImplemented + assert td.__mul__(other) is NotImplemented + assert td.__floordiv__(other) is NotImplemented + + def test_ops_error_str(self): + # GH 13624 + tdi = TimedeltaIndex(['1 day', '2 days']) + + for l, r in [(tdi, 'a'), ('a', tdi)]: + with pytest.raises(TypeError): + l + r + + with pytest.raises(TypeError): + l > r + + with pytest.raises(TypeError): + l == r + + with pytest.raises(TypeError): + l != r + + def test_timedelta_ops(self): + # GH4984 + # make sure ops return Timedelta + s = Series([Timestamp('20130101') + timedelta(seconds=i * i) + for i in range(10)]) + td = s.diff() + + result = td.mean() + expected = to_timedelta(timedelta(seconds=9)) + assert result == expected + + result = td.to_frame().mean() + assert result[0] == expected + + result = td.quantile(.1) + expected = Timedelta(np.timedelta64(2600, 'ms')) + assert result == expected + + result = td.median() + expected = to_timedelta('00:00:09') + assert result == expected + + result = td.to_frame().median() + assert result[0] == expected + + # GH 6462 + # consistency in returned values for sum + result = td.sum() + expected = to_timedelta('00:01:21') + assert result == expected + + result = td.to_frame().sum() + assert result[0] == expected + + # std + result = td.std() + expected = to_timedelta(Series(td.dropna().values).std()) + assert result == expected + + result = td.to_frame().std() + assert result[0] == expected + + # invalid ops + for op in ['skew', 'kurt', 'sem', 'prod']: + pytest.raises(TypeError, getattr(td, op)) + + # GH 10040 + # make sure NaT is properly handled by median() + s = Series([Timestamp('2015-02-03'), Timestamp('2015-02-07')]) + assert s.diff().median() == timedelta(days=4) + + s = Series([Timestamp('2015-02-03'), Timestamp('2015-02-07'), + Timestamp('2015-02-15')]) + assert s.diff().median() == timedelta(days=6) + + def test_timedelta_ops_scalar(self): + # GH 6808 + base = pd.to_datetime('20130101 09:01:12.123456') + expected_add = pd.to_datetime('20130101 09:01:22.123456') + expected_sub = pd.to_datetime('20130101 09:01:02.123456') + + for offset in [pd.to_timedelta(10, unit='s'), timedelta(seconds=10), + np.timedelta64(10, 's'), + np.timedelta64(10000000000, 'ns'), + pd.offsets.Second(10)]: + result = base + offset + assert result == expected_add + + result = base - offset + assert result == expected_sub + + base = pd.to_datetime('20130102 09:01:12.123456') + expected_add = pd.to_datetime('20130103 09:01:22.123456') + expected_sub = pd.to_datetime('20130101 09:01:02.123456') + + for offset in [pd.to_timedelta('1 day, 00:00:10'), + pd.to_timedelta('1 days, 00:00:10'), + timedelta(days=1, seconds=10), + np.timedelta64(1, 'D') + np.timedelta64(10, 's'), + pd.offsets.Day() + pd.offsets.Second(10)]: + result = base + offset + assert result == expected_add + + result = base - offset + assert result == expected_sub + + def test_timedelta_ops_with_missing_values(self): + # setup + s1 = pd.to_timedelta(Series(['00:00:01'])) + s2 = pd.to_timedelta(Series(['00:00:02'])) + sn = pd.to_timedelta(Series([pd.NaT])) + df1 = DataFrame(['00:00:01']).apply(pd.to_timedelta) + df2 = DataFrame(['00:00:02']).apply(pd.to_timedelta) + dfn = DataFrame([pd.NaT]).apply(pd.to_timedelta) + scalar1 = pd.to_timedelta('00:00:01') + scalar2 = pd.to_timedelta('00:00:02') + timedelta_NaT = pd.to_timedelta('NaT') + NA = np.nan + + actual = scalar1 + scalar1 + assert actual == scalar2 + actual = scalar2 - scalar1 + assert actual == scalar1 + + actual = s1 + s1 + assert_series_equal(actual, s2) + actual = s2 - s1 + assert_series_equal(actual, s1) + + actual = s1 + scalar1 + assert_series_equal(actual, s2) + actual = scalar1 + s1 + assert_series_equal(actual, s2) + actual = s2 - scalar1 + assert_series_equal(actual, s1) + actual = -scalar1 + s2 + assert_series_equal(actual, s1) + + actual = s1 + timedelta_NaT + assert_series_equal(actual, sn) + actual = timedelta_NaT + s1 + assert_series_equal(actual, sn) + actual = s1 - timedelta_NaT + assert_series_equal(actual, sn) + actual = -timedelta_NaT + s1 + assert_series_equal(actual, sn) + + actual = s1 + NA + assert_series_equal(actual, sn) + actual = NA + s1 + assert_series_equal(actual, sn) + actual = s1 - NA + assert_series_equal(actual, sn) + actual = -NA + s1 + assert_series_equal(actual, sn) + + actual = s1 + pd.NaT + assert_series_equal(actual, sn) + actual = s2 - pd.NaT + assert_series_equal(actual, sn) + + actual = s1 + df1 + assert_frame_equal(actual, df2) + actual = s2 - df1 + assert_frame_equal(actual, df1) + actual = df1 + s1 + assert_frame_equal(actual, df2) + actual = df2 - s1 + assert_frame_equal(actual, df1) + + actual = df1 + df1 + assert_frame_equal(actual, df2) + actual = df2 - df1 + assert_frame_equal(actual, df1) + + actual = df1 + scalar1 + assert_frame_equal(actual, df2) + actual = df2 - scalar1 + assert_frame_equal(actual, df1) + + actual = df1 + timedelta_NaT + assert_frame_equal(actual, dfn) + actual = df1 - timedelta_NaT + assert_frame_equal(actual, dfn) + + actual = df1 + NA + assert_frame_equal(actual, dfn) + actual = df1 - NA + assert_frame_equal(actual, dfn) + + actual = df1 + pd.NaT # NaT is datetime, not timedelta + assert_frame_equal(actual, dfn) + actual = df1 - pd.NaT + assert_frame_equal(actual, dfn) + + def test_compare_timedelta_series(self): + # regresssion test for GH5963 + s = pd.Series([timedelta(days=1), timedelta(days=2)]) + actual = s > timedelta(days=1) + expected = pd.Series([False, True]) + tm.assert_series_equal(actual, expected) + + def test_compare_timedelta_ndarray(self): + # GH11835 + periods = [Timedelta('0 days 01:00:00'), Timedelta('0 days 01:00:00')] + arr = np.array(periods) + result = arr[0] > arr + expected = np.array([False, False]) + tm.assert_numpy_array_equal(result, expected) + + +class TestSlicing(object): + + def test_tdi_ops_attributes(self): + rng = timedelta_range('2 days', periods=5, freq='2D', name='x') + + result = rng + 1 + exp = timedelta_range('4 days', periods=5, freq='2D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '2D' + + result = rng - 2 + exp = timedelta_range('-2 days', periods=5, freq='2D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '2D' + + result = rng * 2 + exp = timedelta_range('4 days', periods=5, freq='4D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '4D' + + result = rng / 2 + exp = timedelta_range('1 days', periods=5, freq='D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == 'D' + + result = -rng + exp = timedelta_range('-2 days', periods=5, freq='-2D', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '-2D' + + rng = pd.timedelta_range('-2 days', periods=5, freq='D', name='x') + + result = abs(rng) + exp = TimedeltaIndex(['2 days', '1 days', '0 days', '1 days', + '2 days'], name='x') + tm.assert_index_equal(result, exp) + assert result.freq is None + + def test_add_overflow(self): + # see gh-14068 + msg = "too (big|large) to convert" + with tm.assert_raises_regex(OverflowError, msg): + to_timedelta(106580, 'D') + Timestamp('2000') + with tm.assert_raises_regex(OverflowError, msg): + Timestamp('2000') + to_timedelta(106580, 'D') + + _NaT = int(pd.NaT) + 1 + msg = "Overflow in int64 addition" + with tm.assert_raises_regex(OverflowError, msg): + to_timedelta([106580], 'D') + Timestamp('2000') + with tm.assert_raises_regex(OverflowError, msg): + Timestamp('2000') + to_timedelta([106580], 'D') + with tm.assert_raises_regex(OverflowError, msg): + to_timedelta([_NaT]) - Timedelta('1 days') + with tm.assert_raises_regex(OverflowError, msg): + to_timedelta(['5 days', _NaT]) - Timedelta('1 days') + with tm.assert_raises_regex(OverflowError, msg): + (to_timedelta([_NaT, '5 days', '1 hours']) - + to_timedelta(['7 seconds', _NaT, '4 hours'])) + + # These should not overflow! + exp = TimedeltaIndex([pd.NaT]) + result = to_timedelta([pd.NaT]) - Timedelta('1 days') + tm.assert_index_equal(result, exp) + + exp = TimedeltaIndex(['4 days', pd.NaT]) + result = to_timedelta(['5 days', pd.NaT]) - Timedelta('1 days') + tm.assert_index_equal(result, exp) + + exp = TimedeltaIndex([pd.NaT, pd.NaT, '5 hours']) + result = (to_timedelta([pd.NaT, '5 days', '1 hours']) + + to_timedelta(['7 seconds', pd.NaT, '4 hours'])) + tm.assert_index_equal(result, exp) diff --git a/pandas/tests/indexes/timedeltas/test_partial_slicing.py b/pandas/tests/indexes/timedeltas/test_partial_slicing.py new file mode 100644 index 0000000000000..8e5eae2a7a3ef --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_partial_slicing.py @@ -0,0 +1,83 @@ +import pytest + +import numpy as np +import pandas.util.testing as tm + +import pandas as pd +from pandas import Series, timedelta_range, Timedelta +from pandas.util.testing import assert_series_equal + + +class TestSlicing(object): + + def test_partial_slice(self): + rng = timedelta_range('1 day 10:11:12', freq='h', periods=500) + s = Series(np.arange(len(rng)), index=rng) + + result = s['5 day':'6 day'] + expected = s.iloc[86:134] + assert_series_equal(result, expected) + + result = s['5 day':] + expected = s.iloc[86:] + assert_series_equal(result, expected) + + result = s[:'6 day'] + expected = s.iloc[:134] + assert_series_equal(result, expected) + + result = s['6 days, 23:11:12'] + assert result == s.iloc[133] + + pytest.raises(KeyError, s.__getitem__, '50 days') + + def test_partial_slice_high_reso(self): + + # higher reso + rng = timedelta_range('1 day 10:11:12', freq='us', periods=2000) + s = Series(np.arange(len(rng)), index=rng) + + result = s['1 day 10:11:12':] + expected = s.iloc[0:] + assert_series_equal(result, expected) + + result = s['1 day 10:11:12.001':] + expected = s.iloc[1000:] + assert_series_equal(result, expected) + + result = s['1 days, 10:11:12.001001'] + assert result == s.iloc[1001] + + def test_slice_with_negative_step(self): + ts = Series(np.arange(20), timedelta_range('0', periods=20, freq='H')) + SLC = pd.IndexSlice + + def assert_slices_equivalent(l_slc, i_slc): + assert_series_equal(ts[l_slc], ts.iloc[i_slc]) + assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) + assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) + + assert_slices_equivalent(SLC[Timedelta(hours=7)::-1], SLC[7::-1]) + assert_slices_equivalent(SLC['7 hours'::-1], SLC[7::-1]) + + assert_slices_equivalent(SLC[:Timedelta(hours=7):-1], SLC[:6:-1]) + assert_slices_equivalent(SLC[:'7 hours':-1], SLC[:6:-1]) + + assert_slices_equivalent(SLC['15 hours':'7 hours':-1], SLC[15:6:-1]) + assert_slices_equivalent(SLC[Timedelta(hours=15):Timedelta(hours=7):- + 1], SLC[15:6:-1]) + assert_slices_equivalent(SLC['15 hours':Timedelta(hours=7):-1], + SLC[15:6:-1]) + assert_slices_equivalent(SLC[Timedelta(hours=15):'7 hours':-1], + SLC[15:6:-1]) + + assert_slices_equivalent(SLC['7 hours':'15 hours':-1], SLC[:0]) + + def test_slice_with_zero_step_raises(self): + ts = Series(np.arange(20), timedelta_range('0', periods=20, freq='H')) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts.loc[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: ts.loc[::0]) diff --git a/pandas/tests/indexes/timedeltas/test_setops.py b/pandas/tests/indexes/timedeltas/test_setops.py new file mode 100644 index 0000000000000..22546d25273a7 --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_setops.py @@ -0,0 +1,76 @@ +import numpy as np + +import pandas as pd +import pandas.util.testing as tm +from pandas import TimedeltaIndex, timedelta_range, Int64Index + + +class TestTimedeltaIndex(object): + _multiprocess_can_split_ = True + + def test_union(self): + + i1 = timedelta_range('1day', periods=5) + i2 = timedelta_range('3day', periods=5) + result = i1.union(i2) + expected = timedelta_range('1day', periods=7) + tm.assert_index_equal(result, expected) + + i1 = Int64Index(np.arange(0, 20, 2)) + i2 = TimedeltaIndex(start='1 day', periods=10, freq='D') + i1.union(i2) # Works + i2.union(i1) # Fails with "AttributeError: can't set attribute" + + def test_union_coverage(self): + + idx = TimedeltaIndex(['3d', '1d', '2d']) + ordered = TimedeltaIndex(idx.sort_values(), freq='infer') + result = ordered.union(idx) + tm.assert_index_equal(result, ordered) + + result = ordered[:0].union(ordered) + tm.assert_index_equal(result, ordered) + assert result.freq == ordered.freq + + def test_union_bug_1730(self): + + rng_a = timedelta_range('1 day', periods=4, freq='3H') + rng_b = timedelta_range('1 day', periods=4, freq='4H') + + result = rng_a.union(rng_b) + exp = TimedeltaIndex(sorted(set(list(rng_a)) | set(list(rng_b)))) + tm.assert_index_equal(result, exp) + + def test_union_bug_1745(self): + + left = TimedeltaIndex(['1 day 15:19:49.695000']) + right = TimedeltaIndex(['2 day 13:04:21.322000', + '1 day 15:27:24.873000', + '1 day 15:31:05.350000']) + + result = left.union(right) + exp = TimedeltaIndex(sorted(set(list(left)) | set(list(right)))) + tm.assert_index_equal(result, exp) + + def test_union_bug_4564(self): + + left = timedelta_range("1 day", "30d") + right = left + pd.offsets.Minute(15) + + result = left.union(right) + exp = TimedeltaIndex(sorted(set(list(left)) | set(list(right)))) + tm.assert_index_equal(result, exp) + + def test_intersection_bug_1708(self): + index_1 = timedelta_range('1 day', periods=4, freq='h') + index_2 = index_1 + pd.offsets.Hour(5) + + result = index_1 & index_2 + assert len(result) == 0 + + index_1 = timedelta_range('1 day', periods=4, freq='h') + index_2 = index_1 + pd.offsets.Hour(1) + + result = index_1 & index_2 + expected = timedelta_range('1 day 01:00:00', periods=3, freq='h') + tm.assert_index_equal(result, expected) diff --git a/pandas/tests/indexes/timedeltas/test_timedelta.py b/pandas/tests/indexes/timedeltas/test_timedelta.py new file mode 100644 index 0000000000000..79fe0a864f246 --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_timedelta.py @@ -0,0 +1,599 @@ +import pytest + +import numpy as np +from datetime import timedelta + +import pandas as pd +import pandas.util.testing as tm +from pandas import (timedelta_range, date_range, Series, Timedelta, + DatetimeIndex, TimedeltaIndex, Index, DataFrame, + Int64Index, _np_version_under1p8) +from pandas.util.testing import (assert_almost_equal, assert_series_equal, + assert_index_equal) + +from ..datetimelike import DatetimeLike + +randn = np.random.randn + + +class TestTimedeltaIndex(DatetimeLike): + _holder = TimedeltaIndex + _multiprocess_can_split_ = True + + def setup_method(self, method): + self.indices = dict(index=tm.makeTimedeltaIndex(10)) + self.setup_indices() + + def create_index(self): + return pd.to_timedelta(range(5), unit='d') + pd.offsets.Hour(1) + + def test_shift(self): + # test shift for TimedeltaIndex + # err8083 + + drange = self.create_index() + result = drange.shift(1) + expected = TimedeltaIndex(['1 days 01:00:00', '2 days 01:00:00', + '3 days 01:00:00', + '4 days 01:00:00', '5 days 01:00:00'], + freq='D') + tm.assert_index_equal(result, expected) + + result = drange.shift(3, freq='2D 1s') + expected = TimedeltaIndex(['6 days 01:00:03', '7 days 01:00:03', + '8 days 01:00:03', '9 days 01:00:03', + '10 days 01:00:03'], freq='D') + tm.assert_index_equal(result, expected) + + def test_get_loc(self): + idx = pd.to_timedelta(['0 days', '1 days', '2 days']) + + for method in [None, 'pad', 'backfill', 'nearest']: + assert idx.get_loc(idx[1], method) == 1 + assert idx.get_loc(idx[1].to_pytimedelta(), method) == 1 + assert idx.get_loc(str(idx[1]), method) == 1 + + assert idx.get_loc(idx[1], 'pad', + tolerance=pd.Timedelta(0)) == 1 + assert idx.get_loc(idx[1], 'pad', + tolerance=np.timedelta64(0, 's')) == 1 + assert idx.get_loc(idx[1], 'pad', + tolerance=timedelta(0)) == 1 + + with tm.assert_raises_regex(ValueError, 'must be convertible'): + idx.get_loc(idx[1], method='nearest', tolerance='foo') + + for method, loc in [('pad', 1), ('backfill', 2), ('nearest', 1)]: + assert idx.get_loc('1 day 1 hour', method) == loc + + def test_get_loc_nat(self): + tidx = TimedeltaIndex(['1 days 01:00:00', 'NaT', '2 days 01:00:00']) + + assert tidx.get_loc(pd.NaT) == 1 + assert tidx.get_loc(None) == 1 + assert tidx.get_loc(float('nan')) == 1 + assert tidx.get_loc(np.nan) == 1 + + def test_get_indexer(self): + idx = pd.to_timedelta(['0 days', '1 days', '2 days']) + tm.assert_numpy_array_equal(idx.get_indexer(idx), + np.array([0, 1, 2], dtype=np.intp)) + + target = pd.to_timedelta(['-1 hour', '12 hours', '1 day 1 hour']) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'pad'), + np.array([-1, 0, 1], dtype=np.intp)) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'backfill'), + np.array([0, 1, 2], dtype=np.intp)) + tm.assert_numpy_array_equal(idx.get_indexer(target, 'nearest'), + np.array([0, 1, 1], dtype=np.intp)) + + res = idx.get_indexer(target, 'nearest', + tolerance=pd.Timedelta('1 hour')) + tm.assert_numpy_array_equal(res, np.array([0, -1, 1], dtype=np.intp)) + + def test_numeric_compat(self): + + idx = self._holder(np.arange(5, dtype='int64')) + didx = self._holder(np.arange(5, dtype='int64') ** 2) + result = idx * 1 + tm.assert_index_equal(result, idx) + + result = 1 * idx + tm.assert_index_equal(result, idx) + + result = idx / 1 + tm.assert_index_equal(result, idx) + + result = idx // 1 + tm.assert_index_equal(result, idx) + + result = idx * np.array(5, dtype='int64') + tm.assert_index_equal(result, + self._holder(np.arange(5, dtype='int64') * 5)) + + result = idx * np.arange(5, dtype='int64') + tm.assert_index_equal(result, didx) + + result = idx * Series(np.arange(5, dtype='int64')) + tm.assert_index_equal(result, didx) + + result = idx * Series(np.arange(5, dtype='float64') + 0.1) + tm.assert_index_equal(result, self._holder(np.arange( + 5, dtype='float64') * (np.arange(5, dtype='float64') + 0.1))) + + # invalid + pytest.raises(TypeError, lambda: idx * idx) + pytest.raises(ValueError, lambda: idx * self._holder(np.arange(3))) + pytest.raises(ValueError, lambda: idx * np.array([1, 2])) + + def test_pickle_compat_construction(self): + pass + + def test_ufunc_coercions(self): + # normal ops are also tested in tseries/test_timedeltas.py + idx = TimedeltaIndex(['2H', '4H', '6H', '8H', '10H'], + freq='2H', name='x') + + for result in [idx * 2, np.multiply(idx, 2)]: + assert isinstance(result, TimedeltaIndex) + exp = TimedeltaIndex(['4H', '8H', '12H', '16H', '20H'], + freq='4H', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '4H' + + for result in [idx / 2, np.divide(idx, 2)]: + assert isinstance(result, TimedeltaIndex) + exp = TimedeltaIndex(['1H', '2H', '3H', '4H', '5H'], + freq='H', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == 'H' + + idx = TimedeltaIndex(['2H', '4H', '6H', '8H', '10H'], + freq='2H', name='x') + for result in [-idx, np.negative(idx)]: + assert isinstance(result, TimedeltaIndex) + exp = TimedeltaIndex(['-2H', '-4H', '-6H', '-8H', '-10H'], + freq='-2H', name='x') + tm.assert_index_equal(result, exp) + assert result.freq == '-2H' + + idx = TimedeltaIndex(['-2H', '-1H', '0H', '1H', '2H'], + freq='H', name='x') + for result in [abs(idx), np.absolute(idx)]: + assert isinstance(result, TimedeltaIndex) + exp = TimedeltaIndex(['2H', '1H', '0H', '1H', '2H'], + freq=None, name='x') + tm.assert_index_equal(result, exp) + assert result.freq is None + + def test_fillna_timedelta(self): + # GH 11343 + idx = pd.TimedeltaIndex(['1 day', pd.NaT, '3 day']) + + exp = pd.TimedeltaIndex(['1 day', '2 day', '3 day']) + tm.assert_index_equal(idx.fillna(pd.Timedelta('2 day')), exp) + + exp = pd.TimedeltaIndex(['1 day', '3 hour', '3 day']) + idx.fillna(pd.Timedelta('3 hour')) + + exp = pd.Index( + [pd.Timedelta('1 day'), 'x', pd.Timedelta('3 day')], dtype=object) + tm.assert_index_equal(idx.fillna('x'), exp) + + def test_difference_freq(self): + # GH14323: Difference of TimedeltaIndex should not preserve frequency + + index = timedelta_range("0 days", "5 days", freq="D") + + other = timedelta_range("1 days", "4 days", freq="D") + expected = TimedeltaIndex(["0 days", "5 days"], freq=None) + idx_diff = index.difference(other) + tm.assert_index_equal(idx_diff, expected) + tm.assert_attr_equal('freq', idx_diff, expected) + + other = timedelta_range("2 days", "5 days", freq="D") + idx_diff = index.difference(other) + expected = TimedeltaIndex(["0 days", "1 days"], freq=None) + tm.assert_index_equal(idx_diff, expected) + tm.assert_attr_equal('freq', idx_diff, expected) + + def test_take(self): + + tds = ['1day 02:00:00', '1 day 04:00:00', '1 day 10:00:00'] + idx = TimedeltaIndex(start='1d', end='2d', freq='H', name='idx') + expected = TimedeltaIndex(tds, freq=None, name='idx') + + taken1 = idx.take([2, 4, 10]) + taken2 = idx[[2, 4, 10]] + + for taken in [taken1, taken2]: + tm.assert_index_equal(taken, expected) + assert isinstance(taken, TimedeltaIndex) + assert taken.freq is None + assert taken.name == expected.name + + def test_take_fill_value(self): + # GH 12631 + idx = pd.TimedeltaIndex(['1 days', '2 days', '3 days'], + name='xxx') + result = idx.take(np.array([1, 0, -1])) + expected = pd.TimedeltaIndex(['2 days', '1 days', '3 days'], + name='xxx') + tm.assert_index_equal(result, expected) + + # fill_value + result = idx.take(np.array([1, 0, -1]), fill_value=True) + expected = pd.TimedeltaIndex(['2 days', '1 days', 'NaT'], + name='xxx') + tm.assert_index_equal(result, expected) + + # allow_fill=False + result = idx.take(np.array([1, 0, -1]), allow_fill=False, + fill_value=True) + expected = pd.TimedeltaIndex(['2 days', '1 days', '3 days'], + name='xxx') + tm.assert_index_equal(result, expected) + + msg = ('When allow_fill=True and fill_value is not None, ' + 'all indices must be >= -1') + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -2]), fill_value=True) + with tm.assert_raises_regex(ValueError, msg): + idx.take(np.array([1, 0, -5]), fill_value=True) + + with pytest.raises(IndexError): + idx.take(np.array([1, -5])) + + def test_isin(self): + + index = tm.makeTimedeltaIndex(4) + result = index.isin(index) + assert result.all() + + result = index.isin(list(index)) + assert result.all() + + assert_almost_equal(index.isin([index[2], 5]), + np.array([False, False, True, False])) + + def test_factorize(self): + idx1 = TimedeltaIndex(['1 day', '1 day', '2 day', '2 day', '3 day', + '3 day']) + + exp_arr = np.array([0, 0, 1, 1, 2, 2], dtype=np.intp) + exp_idx = TimedeltaIndex(['1 day', '2 day', '3 day']) + + arr, idx = idx1.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + arr, idx = idx1.factorize(sort=True) + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, exp_idx) + + # freq must be preserved + idx3 = timedelta_range('1 day', periods=4, freq='s') + exp_arr = np.array([0, 1, 2, 3], dtype=np.intp) + arr, idx = idx3.factorize() + tm.assert_numpy_array_equal(arr, exp_arr) + tm.assert_index_equal(idx, idx3) + + def test_join_self(self): + + index = timedelta_range('1 day', periods=10) + kinds = 'outer', 'inner', 'left', 'right' + for kind in kinds: + joined = index.join(index, how=kind) + tm.assert_index_equal(index, joined) + + def test_slice_keeps_name(self): + + # GH4226 + dr = pd.timedelta_range('1d', '5d', freq='H', name='timebucket') + assert dr[1:].name == dr.name + + def test_does_not_convert_mixed_integer(self): + df = tm.makeCustomDataframe(10, 10, + data_gen_f=lambda *args, **kwargs: randn(), + r_idx_type='i', c_idx_type='td') + str(df) + + cols = df.columns.join(df.index, how='outer') + joined = cols.join(df.columns) + assert cols.dtype == np.dtype('O') + assert cols.dtype == joined.dtype + tm.assert_index_equal(cols, joined) + + def test_sort_values(self): + + idx = TimedeltaIndex(['4d', '1d', '2d']) + + ordered = idx.sort_values() + assert ordered.is_monotonic + + ordered = idx.sort_values(ascending=False) + assert ordered[::-1].is_monotonic + + ordered, dexer = idx.sort_values(return_indexer=True) + assert ordered.is_monotonic + + tm.assert_numpy_array_equal(dexer, np.array([1, 2, 0]), + check_dtype=False) + + ordered, dexer = idx.sort_values(return_indexer=True, ascending=False) + assert ordered[::-1].is_monotonic + + tm.assert_numpy_array_equal(dexer, np.array([0, 2, 1]), + check_dtype=False) + + def test_get_duplicates(self): + idx = TimedeltaIndex(['1 day', '2 day', '2 day', '3 day', '3day', + '4day']) + + result = idx.get_duplicates() + ex = TimedeltaIndex(['2 day', '3day']) + tm.assert_index_equal(result, ex) + + def test_argmin_argmax(self): + idx = TimedeltaIndex(['1 day 00:00:05', '1 day 00:00:01', + '1 day 00:00:02']) + assert idx.argmin() == 1 + assert idx.argmax() == 0 + + def test_misc_coverage(self): + + rng = timedelta_range('1 day', periods=5) + result = rng.groupby(rng.days) + assert isinstance(list(result.values())[0][0], Timedelta) + + idx = TimedeltaIndex(['3d', '1d', '2d']) + assert not idx.equals(list(idx)) + + non_td = Index(list('abc')) + assert not idx.equals(list(non_td)) + + def test_map(self): + + rng = timedelta_range('1 day', periods=10) + + f = lambda x: x.days + result = rng.map(f) + exp = Int64Index([f(x) for x in rng]) + tm.assert_index_equal(result, exp) + + def test_comparisons_nat(self): + + tdidx1 = pd.TimedeltaIndex(['1 day', pd.NaT, '1 day 00:00:01', pd.NaT, + '1 day 00:00:01', '5 day 00:00:03']) + tdidx2 = pd.TimedeltaIndex(['2 day', '2 day', pd.NaT, pd.NaT, + '1 day 00:00:02', '5 days 00:00:03']) + tdarr = np.array([np.timedelta64(2, 'D'), + np.timedelta64(2, 'D'), np.timedelta64('nat'), + np.timedelta64('nat'), + np.timedelta64(1, 'D') + np.timedelta64(2, 's'), + np.timedelta64(5, 'D') + np.timedelta64(3, 's')]) + + if _np_version_under1p8: + # cannot test array because np.datetime('nat') returns today's date + cases = [(tdidx1, tdidx2)] + else: + cases = [(tdidx1, tdidx2), (tdidx1, tdarr)] + + # Check pd.NaT is handles as the same as np.nan + for idx1, idx2 in cases: + + result = idx1 < idx2 + expected = np.array([True, False, False, False, True, False]) + tm.assert_numpy_array_equal(result, expected) + + result = idx2 > idx1 + expected = np.array([True, False, False, False, True, False]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 <= idx2 + expected = np.array([True, False, False, False, True, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx2 >= idx1 + expected = np.array([True, False, False, False, True, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 == idx2 + expected = np.array([False, False, False, False, False, True]) + tm.assert_numpy_array_equal(result, expected) + + result = idx1 != idx2 + expected = np.array([True, True, True, True, True, False]) + tm.assert_numpy_array_equal(result, expected) + + def test_comparisons_coverage(self): + rng = timedelta_range('1 days', periods=10) + + result = rng < rng[3] + exp = np.array([True, True, True] + [False] * 7) + tm.assert_numpy_array_equal(result, exp) + + # raise TypeError for now + pytest.raises(TypeError, rng.__lt__, rng[3].value) + + result = rng == list(rng) + exp = rng == rng + tm.assert_numpy_array_equal(result, exp) + + def test_total_seconds(self): + # GH 10939 + # test index + rng = timedelta_range('1 days, 10:11:12.100123456', periods=2, + freq='s') + expt = [1 * 86400 + 10 * 3600 + 11 * 60 + 12 + 100123456. / 1e9, + 1 * 86400 + 10 * 3600 + 11 * 60 + 13 + 100123456. / 1e9] + tm.assert_almost_equal(rng.total_seconds(), Index(expt)) + + # test Series + s = Series(rng) + s_expt = Series(expt, index=[0, 1]) + tm.assert_series_equal(s.dt.total_seconds(), s_expt) + + # with nat + s[1] = np.nan + s_expt = Series([1 * 86400 + 10 * 3600 + 11 * 60 + + 12 + 100123456. / 1e9, np.nan], index=[0, 1]) + tm.assert_series_equal(s.dt.total_seconds(), s_expt) + + # with both nat + s = Series([np.nan, np.nan], dtype='timedelta64[ns]') + tm.assert_series_equal(s.dt.total_seconds(), + Series([np.nan, np.nan], index=[0, 1])) + + def test_pass_TimedeltaIndex_to_index(self): + + rng = timedelta_range('1 days', '10 days') + idx = Index(rng, dtype=object) + + expected = Index(rng.to_pytimedelta(), dtype=object) + + tm.assert_numpy_array_equal(idx.values, expected.values) + + def test_pickle(self): + + rng = timedelta_range('1 days', periods=10) + rng_p = tm.round_trip_pickle(rng) + tm.assert_index_equal(rng, rng_p) + + def test_hash_error(self): + index = timedelta_range('1 days', periods=10) + with tm.assert_raises_regex(TypeError, "unhashable type: %r" % + type(index).__name__): + hash(index) + + def test_append_join_nondatetimeindex(self): + rng = timedelta_range('1 days', periods=10) + idx = Index(['a', 'b', 'c', 'd']) + + result = rng.append(idx) + assert isinstance(result[0], Timedelta) + + # it works + rng.join(idx, how='outer') + + def test_append_numpy_bug_1681(self): + + td = timedelta_range('1 days', '10 days', freq='2D') + a = DataFrame() + c = DataFrame({'A': 'foo', 'B': td}, index=td) + str(c) + + result = a.append(c) + assert (result['B'] == td).all() + + def test_fields(self): + rng = timedelta_range('1 days, 10:11:12.100123456', periods=2, + freq='s') + tm.assert_index_equal(rng.days, Index([1, 1], dtype='int64')) + tm.assert_index_equal( + rng.seconds, + Index([10 * 3600 + 11 * 60 + 12, 10 * 3600 + 11 * 60 + 13], + dtype='int64')) + tm.assert_index_equal( + rng.microseconds, + Index([100 * 1000 + 123, 100 * 1000 + 123], dtype='int64')) + tm.assert_index_equal(rng.nanoseconds, + Index([456, 456], dtype='int64')) + + pytest.raises(AttributeError, lambda: rng.hours) + pytest.raises(AttributeError, lambda: rng.minutes) + pytest.raises(AttributeError, lambda: rng.milliseconds) + + # with nat + s = Series(rng) + s[1] = np.nan + + tm.assert_series_equal(s.dt.days, Series([1, np.nan], index=[0, 1])) + tm.assert_series_equal(s.dt.seconds, Series( + [10 * 3600 + 11 * 60 + 12, np.nan], index=[0, 1])) + + # preserve name (GH15589) + rng.name = 'name' + assert rng.days.name == 'name' + + def test_freq_conversion(self): + + # doc example + + # series + td = Series(date_range('20130101', periods=4)) - \ + Series(date_range('20121201', periods=4)) + td[2] += timedelta(minutes=5, seconds=3) + td[3] = np.nan + + result = td / np.timedelta64(1, 'D') + expected = Series([31, 31, (31 * 86400 + 5 * 60 + 3) / 86400.0, np.nan + ]) + assert_series_equal(result, expected) + + result = td.astype('timedelta64[D]') + expected = Series([31, 31, 31, np.nan]) + assert_series_equal(result, expected) + + result = td / np.timedelta64(1, 's') + expected = Series([31 * 86400, 31 * 86400, 31 * 86400 + 5 * 60 + 3, + np.nan]) + assert_series_equal(result, expected) + + result = td.astype('timedelta64[s]') + assert_series_equal(result, expected) + + # tdi + td = TimedeltaIndex(td) + + result = td / np.timedelta64(1, 'D') + expected = Index([31, 31, (31 * 86400 + 5 * 60 + 3) / 86400.0, np.nan]) + assert_index_equal(result, expected) + + result = td.astype('timedelta64[D]') + expected = Index([31, 31, 31, np.nan]) + assert_index_equal(result, expected) + + result = td / np.timedelta64(1, 's') + expected = Index([31 * 86400, 31 * 86400, 31 * 86400 + 5 * 60 + 3, + np.nan]) + assert_index_equal(result, expected) + + result = td.astype('timedelta64[s]') + assert_index_equal(result, expected) + + +class TestSlicing(object): + + def test_timedelta(self): + # this is valid too + index = date_range('1/1/2000', periods=50, freq='B') + shifted = index + timedelta(1) + back = shifted + timedelta(-1) + assert tm.equalContents(index, back) + assert shifted.freq == index.freq + assert shifted.freq == back.freq + + result = index - timedelta(1) + expected = index + timedelta(-1) + tm.assert_index_equal(result, expected) + + # GH4134, buggy with timedeltas + rng = date_range('2013', '2014') + s = Series(rng) + result1 = rng - pd.offsets.Hour(1) + result2 = DatetimeIndex(s - np.timedelta64(100000000)) + result3 = rng - np.timedelta64(100000000) + result4 = DatetimeIndex(s - pd.offsets.Hour(1)) + tm.assert_index_equal(result1, result4) + tm.assert_index_equal(result2, result3) + + +class TestTimeSeries(object): + _multiprocess_can_split_ = True + + def test_series_box_timedelta(self): + rng = timedelta_range('1 day 1 s', periods=5, freq='h') + s = Series(rng) + assert isinstance(s[1], Timedelta) + assert isinstance(s.iat[2], Timedelta) diff --git a/pandas/tests/indexes/timedeltas/test_timedelta_range.py b/pandas/tests/indexes/timedeltas/test_timedelta_range.py new file mode 100644 index 0000000000000..4732a0ce110de --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_timedelta_range.py @@ -0,0 +1,51 @@ +import numpy as np + +import pandas as pd +import pandas.util.testing as tm +from pandas.tseries.offsets import Day, Second +from pandas import to_timedelta, timedelta_range +from pandas.util.testing import assert_frame_equal + + +class TestTimedeltas(object): + _multiprocess_can_split_ = True + + def test_timedelta_range(self): + + expected = to_timedelta(np.arange(5), unit='D') + result = timedelta_range('0 days', periods=5, freq='D') + tm.assert_index_equal(result, expected) + + expected = to_timedelta(np.arange(11), unit='D') + result = timedelta_range('0 days', '10 days', freq='D') + tm.assert_index_equal(result, expected) + + expected = to_timedelta(np.arange(5), unit='D') + Second(2) + Day() + result = timedelta_range('1 days, 00:00:02', '5 days, 00:00:02', + freq='D') + tm.assert_index_equal(result, expected) + + expected = to_timedelta([1, 3, 5, 7, 9], unit='D') + Second(2) + result = timedelta_range('1 days, 00:00:02', periods=5, freq='2D') + tm.assert_index_equal(result, expected) + + expected = to_timedelta(np.arange(50), unit='T') * 30 + result = timedelta_range('0 days', freq='30T', periods=50) + tm.assert_index_equal(result, expected) + + # GH 11776 + arr = np.arange(10).reshape(2, 5) + df = pd.DataFrame(np.arange(10).reshape(2, 5)) + for arg in (arr, df): + with tm.assert_raises_regex(TypeError, "1-d array"): + to_timedelta(arg) + for errors in ['ignore', 'raise', 'coerce']: + with tm.assert_raises_regex(TypeError, "1-d array"): + to_timedelta(arg, errors=errors) + + # issue10583 + df = pd.DataFrame(np.random.normal(size=(10, 4))) + df.index = pd.timedelta_range(start='0s', periods=10, freq='s') + expected = df.loc[pd.Timedelta('0s'):, :] + result = df.loc['0s':, :] + assert_frame_equal(expected, result) diff --git a/pandas/tests/indexes/timedeltas/test_tools.py b/pandas/tests/indexes/timedeltas/test_tools.py new file mode 100644 index 0000000000000..a991b7bbe140a --- /dev/null +++ b/pandas/tests/indexes/timedeltas/test_tools.py @@ -0,0 +1,201 @@ +import pytest + +from datetime import time, timedelta +import numpy as np + +import pandas as pd +import pandas.util.testing as tm +from pandas.util.testing import assert_series_equal +from pandas import (Series, Timedelta, to_timedelta, isnull, + TimedeltaIndex) +from pandas._libs.tslib import iNaT + + +class TestTimedeltas(object): + _multiprocess_can_split_ = True + + def test_to_timedelta(self): + def conv(v): + return v.astype('m8[ns]') + + d1 = np.timedelta64(1, 'D') + + assert (to_timedelta('1 days 06:05:01.00003', box=False) == + conv(d1 + np.timedelta64(6 * 3600 + 5 * 60 + 1, 's') + + np.timedelta64(30, 'us'))) + assert (to_timedelta('15.5us', box=False) == + conv(np.timedelta64(15500, 'ns'))) + + # empty string + result = to_timedelta('', box=False) + assert result.astype('int64') == iNaT + + result = to_timedelta(['', '']) + assert isnull(result).all() + + # pass thru + result = to_timedelta(np.array([np.timedelta64(1, 's')])) + expected = pd.Index(np.array([np.timedelta64(1, 's')])) + tm.assert_index_equal(result, expected) + + # ints + result = np.timedelta64(0, 'ns') + expected = to_timedelta(0, box=False) + assert result == expected + + # Series + expected = Series([timedelta(days=1), timedelta(days=1, seconds=1)]) + result = to_timedelta(Series(['1d', '1days 00:00:01'])) + tm.assert_series_equal(result, expected) + + # with units + result = TimedeltaIndex([np.timedelta64(0, 'ns'), np.timedelta64( + 10, 's').astype('m8[ns]')]) + expected = to_timedelta([0, 10], unit='s') + tm.assert_index_equal(result, expected) + + # single element conversion + v = timedelta(seconds=1) + result = to_timedelta(v, box=False) + expected = np.timedelta64(timedelta(seconds=1)) + assert result == expected + + v = np.timedelta64(timedelta(seconds=1)) + result = to_timedelta(v, box=False) + expected = np.timedelta64(timedelta(seconds=1)) + assert result == expected + + # arrays of various dtypes + arr = np.array([1] * 5, dtype='int64') + result = to_timedelta(arr, unit='s') + expected = TimedeltaIndex([np.timedelta64(1, 's')] * 5) + tm.assert_index_equal(result, expected) + + arr = np.array([1] * 5, dtype='int64') + result = to_timedelta(arr, unit='m') + expected = TimedeltaIndex([np.timedelta64(1, 'm')] * 5) + tm.assert_index_equal(result, expected) + + arr = np.array([1] * 5, dtype='int64') + result = to_timedelta(arr, unit='h') + expected = TimedeltaIndex([np.timedelta64(1, 'h')] * 5) + tm.assert_index_equal(result, expected) + + arr = np.array([1] * 5, dtype='timedelta64[s]') + result = to_timedelta(arr) + expected = TimedeltaIndex([np.timedelta64(1, 's')] * 5) + tm.assert_index_equal(result, expected) + + arr = np.array([1] * 5, dtype='timedelta64[D]') + result = to_timedelta(arr) + expected = TimedeltaIndex([np.timedelta64(1, 'D')] * 5) + tm.assert_index_equal(result, expected) + + # Test with lists as input when box=false + expected = np.array(np.arange(3) * 1000000000, dtype='timedelta64[ns]') + result = to_timedelta(range(3), unit='s', box=False) + tm.assert_numpy_array_equal(expected, result) + + result = to_timedelta(np.arange(3), unit='s', box=False) + tm.assert_numpy_array_equal(expected, result) + + result = to_timedelta([0, 1, 2], unit='s', box=False) + tm.assert_numpy_array_equal(expected, result) + + # Tests with fractional seconds as input: + expected = np.array( + [0, 500000000, 800000000, 1200000000], dtype='timedelta64[ns]') + result = to_timedelta([0., 0.5, 0.8, 1.2], unit='s', box=False) + tm.assert_numpy_array_equal(expected, result) + + def test_to_timedelta_invalid(self): + + # bad value for errors parameter + msg = "errors must be one of" + tm.assert_raises_regex(ValueError, msg, to_timedelta, + ['foo'], errors='never') + + # these will error + pytest.raises(ValueError, lambda: to_timedelta([1, 2], unit='foo')) + pytest.raises(ValueError, lambda: to_timedelta(1, unit='foo')) + + # time not supported ATM + pytest.raises(ValueError, lambda: to_timedelta(time(second=1))) + assert to_timedelta(time(second=1), errors='coerce') is pd.NaT + + pytest.raises(ValueError, lambda: to_timedelta(['foo', 'bar'])) + tm.assert_index_equal(TimedeltaIndex([pd.NaT, pd.NaT]), + to_timedelta(['foo', 'bar'], errors='coerce')) + + tm.assert_index_equal(TimedeltaIndex(['1 day', pd.NaT, '1 min']), + to_timedelta(['1 day', 'bar', '1 min'], + errors='coerce')) + + # gh-13613: these should not error because errors='ignore' + invalid_data = 'apple' + assert invalid_data == to_timedelta(invalid_data, errors='ignore') + + invalid_data = ['apple', '1 days'] + tm.assert_numpy_array_equal( + np.array(invalid_data, dtype=object), + to_timedelta(invalid_data, errors='ignore')) + + invalid_data = pd.Index(['apple', '1 days']) + tm.assert_index_equal(invalid_data, to_timedelta( + invalid_data, errors='ignore')) + + invalid_data = Series(['apple', '1 days']) + tm.assert_series_equal(invalid_data, to_timedelta( + invalid_data, errors='ignore')) + + def test_to_timedelta_via_apply(self): + # GH 5458 + expected = Series([np.timedelta64(1, 's')]) + result = Series(['00:00:01']).apply(to_timedelta) + tm.assert_series_equal(result, expected) + + result = Series([to_timedelta('00:00:01')]) + tm.assert_series_equal(result, expected) + + def test_to_timedelta_on_missing_values(self): + # GH5438 + timedelta_NaT = np.timedelta64('NaT') + + actual = pd.to_timedelta(Series(['00:00:01', np.nan])) + expected = Series([np.timedelta64(1000000000, 'ns'), + timedelta_NaT], dtype=' obj.ndim - 1: + return + + def _print(result, error=None): + if error is not None: + error = str(error) + v = ("%-16.16s [%-16.16s]: [typ->%-8.8s,obj->%-8.8s," + "key1->(%-4.4s),key2->(%-4.4s),axis->%s] %s" % + (name, result, t, o, method1, method2, a, error or '')) + if _verbose: + pprint_thing(v) + + try: + rs = getattr(obj, method1).__getitem__(_axify(obj, k1, a)) + + try: + xp = self.get_result(obj, method2, k2, a) + except: + result = 'no comp' + _print(result) + return + + detail = None + + try: + if is_scalar(rs) and is_scalar(xp): + assert rs == xp + elif xp.ndim == 1: + tm.assert_series_equal(rs, xp) + elif xp.ndim == 2: + tm.assert_frame_equal(rs, xp) + elif xp.ndim == 3: + tm.assert_panel_equal(rs, xp) + result = 'ok' + except AssertionError as e: + detail = str(e) + result = 'fail' + + # reverse the checks + if fails is True: + if result == 'fail': + result = 'ok (fail)' + + _print(result) + if not result.startswith('ok'): + raise AssertionError(detail) + + except AssertionError: + raise + except Exception as detail: + + # if we are in fails, the ok, otherwise raise it + if fails is not None: + if isinstance(detail, fails): + result = 'ok (%s)' % type(detail).__name__ + _print(result) + return + + result = type(detail).__name__ + raise AssertionError(_print(result, error=detail)) + + if typs is None: + typs = self._typs + + if objs is None: + objs = self._objs + + if axes is not None: + if not isinstance(axes, (tuple, list)): + axes = [axes] + else: + axes = list(axes) + else: + axes = [0, 1, 2] + + # check + for o in objs: + if o not in self._objs: + continue + + d = getattr(self, o) + for a in axes: + for t in typs: + if t not in self._typs: + continue + + obj = d[t] + if obj is None: + continue + + def _call(obj=obj): + obj = obj.copy() + + k2 = key2 + _eq(t, o, a, obj, key1, k2) + + # Panel deprecations + if isinstance(obj, Panel): + with catch_warnings(record=True): + _call() + else: + _call() diff --git a/pandas/tests/indexing/test_callable.py b/pandas/tests/indexing/test_callable.py index 3465d776bfa85..95b406517be62 100644 --- a/pandas/tests/indexing/test_callable.py +++ b/pandas/tests/indexing/test_callable.py @@ -1,15 +1,12 @@ # -*- coding: utf-8 -*- # pylint: disable-msg=W0612,E1101 -import nose import numpy as np import pandas as pd import pandas.util.testing as tm -class TestIndexingCallable(tm.TestCase): - - _multiprocess_can_split_ = True +class TestIndexingCallable(object): def test_frame_loc_ix_callable(self): # GH 11485 @@ -21,51 +18,51 @@ def test_frame_loc_ix_callable(self): res = df.loc[lambda x: x.A > 2] tm.assert_frame_equal(res, df.loc[df.A > 2]) - res = df.ix[lambda x: x.A > 2] - tm.assert_frame_equal(res, df.ix[df.A > 2]) + res = df.loc[lambda x: x.A > 2] + tm.assert_frame_equal(res, df.loc[df.A > 2]) res = df.loc[lambda x: x.A > 2, ] tm.assert_frame_equal(res, df.loc[df.A > 2, ]) - res = df.ix[lambda x: x.A > 2, ] - tm.assert_frame_equal(res, df.ix[df.A > 2, ]) + res = df.loc[lambda x: x.A > 2, ] + tm.assert_frame_equal(res, df.loc[df.A > 2, ]) res = df.loc[lambda x: x.B == 'b', :] tm.assert_frame_equal(res, df.loc[df.B == 'b', :]) - res = df.ix[lambda x: x.B == 'b', :] - tm.assert_frame_equal(res, df.ix[df.B == 'b', :]) + res = df.loc[lambda x: x.B == 'b', :] + tm.assert_frame_equal(res, df.loc[df.B == 'b', :]) res = df.loc[lambda x: x.A > 2, lambda x: x.columns == 'B'] tm.assert_frame_equal(res, df.loc[df.A > 2, [False, True, False]]) - res = df.ix[lambda x: x.A > 2, lambda x: x.columns == 'B'] - tm.assert_frame_equal(res, df.ix[df.A > 2, [False, True, False]]) + res = df.loc[lambda x: x.A > 2, lambda x: x.columns == 'B'] + tm.assert_frame_equal(res, df.loc[df.A > 2, [False, True, False]]) res = df.loc[lambda x: x.A > 2, lambda x: 'B'] tm.assert_series_equal(res, df.loc[df.A > 2, 'B']) - res = df.ix[lambda x: x.A > 2, lambda x: 'B'] - tm.assert_series_equal(res, df.ix[df.A > 2, 'B']) + res = df.loc[lambda x: x.A > 2, lambda x: 'B'] + tm.assert_series_equal(res, df.loc[df.A > 2, 'B']) res = df.loc[lambda x: x.A > 2, lambda x: ['A', 'B']] tm.assert_frame_equal(res, df.loc[df.A > 2, ['A', 'B']]) - res = df.ix[lambda x: x.A > 2, lambda x: ['A', 'B']] - tm.assert_frame_equal(res, df.ix[df.A > 2, ['A', 'B']]) + res = df.loc[lambda x: x.A > 2, lambda x: ['A', 'B']] + tm.assert_frame_equal(res, df.loc[df.A > 2, ['A', 'B']]) res = df.loc[lambda x: x.A == 2, lambda x: ['A', 'B']] tm.assert_frame_equal(res, df.loc[df.A == 2, ['A', 'B']]) - res = df.ix[lambda x: x.A == 2, lambda x: ['A', 'B']] - tm.assert_frame_equal(res, df.ix[df.A == 2, ['A', 'B']]) + res = df.loc[lambda x: x.A == 2, lambda x: ['A', 'B']] + tm.assert_frame_equal(res, df.loc[df.A == 2, ['A', 'B']]) # scalar res = df.loc[lambda x: 1, lambda x: 'A'] - self.assertEqual(res, df.loc[1, 'A']) + assert res == df.loc[1, 'A'] - res = df.ix[lambda x: 1, lambda x: 'A'] - self.assertEqual(res, df.ix[1, 'A']) + res = df.loc[lambda x: 1, lambda x: 'A'] + assert res == df.loc[1, 'A'] def test_frame_loc_ix_callable_mixture(self): # GH 11485 @@ -75,20 +72,20 @@ def test_frame_loc_ix_callable_mixture(self): res = df.loc[lambda x: x.A > 2, ['A', 'B']] tm.assert_frame_equal(res, df.loc[df.A > 2, ['A', 'B']]) - res = df.ix[lambda x: x.A > 2, ['A', 'B']] - tm.assert_frame_equal(res, df.ix[df.A > 2, ['A', 'B']]) + res = df.loc[lambda x: x.A > 2, ['A', 'B']] + tm.assert_frame_equal(res, df.loc[df.A > 2, ['A', 'B']]) res = df.loc[[2, 3], lambda x: ['A', 'B']] tm.assert_frame_equal(res, df.loc[[2, 3], ['A', 'B']]) - res = df.ix[[2, 3], lambda x: ['A', 'B']] - tm.assert_frame_equal(res, df.ix[[2, 3], ['A', 'B']]) + res = df.loc[[2, 3], lambda x: ['A', 'B']] + tm.assert_frame_equal(res, df.loc[[2, 3], ['A', 'B']]) res = df.loc[3, lambda x: ['A', 'B']] tm.assert_series_equal(res, df.loc[3, ['A', 'B']]) - res = df.ix[3, lambda x: ['A', 'B']] - tm.assert_series_equal(res, df.ix[3, ['A', 'B']]) + res = df.loc[3, lambda x: ['A', 'B']] + tm.assert_series_equal(res, df.loc[3, ['A', 'B']]) def test_frame_loc_callable(self): # GH 11485 @@ -268,8 +265,3 @@ def test_frame_iloc_callable_setitem(self): exp = df.copy() exp.iloc[[1, 3], [0]] = [-5, -5] tm.assert_frame_equal(res, exp) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/indexing/test_categorical.py b/pandas/tests/indexing/test_categorical.py index 9ef2802cb950f..6874fedaa705f 100644 --- a/pandas/tests/indexing/test_categorical.py +++ b/pandas/tests/indexing/test_categorical.py @@ -1,15 +1,18 @@ # -*- coding: utf-8 -*- +import pytest + import pandas as pd import numpy as np -from pandas import Series, DataFrame +from pandas import (Series, DataFrame, Timestamp, + Categorical, CategoricalIndex) from pandas.util.testing import assert_series_equal, assert_frame_equal from pandas.util import testing as tm -class TestCategoricalIndex(tm.TestCase): +class TestCategoricalIndex(object): - def setUp(self): + def setup_method(self, method): self.df = DataFrame({'A': np.arange(6, dtype='int64'), 'B': Series(list('aabbca')).astype( @@ -47,22 +50,33 @@ def test_loc_scalar(self): assert_frame_equal(df, expected) # value not in the categories - self.assertRaises(KeyError, lambda: df.loc['d']) + pytest.raises(KeyError, lambda: df.loc['d']) def f(): df.loc['d'] = 10 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def f(): df.loc['d', 'A'] = 10 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def f(): df.loc['d', 'C'] = 10 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) + + def test_getitem_scalar(self): + + cats = Categorical([Timestamp('12-31-1999'), + Timestamp('12-31-2000')]) + + s = Series([1, 2], index=cats) + + expected = s.iloc[0] + result = s[cats[0]] + assert result == expected def test_loc_listlike(self): @@ -72,70 +86,70 @@ def test_loc_listlike(self): assert_frame_equal(result, expected, check_index_type=True) result = self.df2.loc[['a', 'b', 'e']] - exp_index = pd.CategoricalIndex( + exp_index = CategoricalIndex( list('aaabbe'), categories=list('cabe'), name='B') expected = DataFrame({'A': [0, 1, 5, 2, 3, np.nan]}, index=exp_index) assert_frame_equal(result, expected, check_index_type=True) # element in the categories but not in the values - self.assertRaises(KeyError, lambda: self.df2.loc['e']) + pytest.raises(KeyError, lambda: self.df2.loc['e']) # assign is ok df = self.df2.copy() df.loc['e'] = 20 result = df.loc[['a', 'b', 'e']] - exp_index = pd.CategoricalIndex( + exp_index = CategoricalIndex( list('aaabbe'), categories=list('cabe'), name='B') expected = DataFrame({'A': [0, 1, 5, 2, 3, 20]}, index=exp_index) assert_frame_equal(result, expected) df = self.df2.copy() result = df.loc[['a', 'b', 'e']] - exp_index = pd.CategoricalIndex( + exp_index = CategoricalIndex( list('aaabbe'), categories=list('cabe'), name='B') expected = DataFrame({'A': [0, 1, 5, 2, 3, np.nan]}, index=exp_index) assert_frame_equal(result, expected, check_index_type=True) # not all labels in the categories - self.assertRaises(KeyError, lambda: self.df2.loc[['a', 'd']]) + pytest.raises(KeyError, lambda: self.df2.loc[['a', 'd']]) def test_loc_listlike_dtypes(self): # GH 11586 # unique categories and codes - index = pd.CategoricalIndex(['a', 'b', 'c']) + index = CategoricalIndex(['a', 'b', 'c']) df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=index) # unique slice res = df.loc[['a', 'b']] - exp_index = pd.CategoricalIndex(['a', 'b'], - categories=index.categories) + exp_index = CategoricalIndex(['a', 'b'], + categories=index.categories) exp = DataFrame({'A': [1, 2], 'B': [4, 5]}, index=exp_index) tm.assert_frame_equal(res, exp, check_index_type=True) # duplicated slice res = df.loc[['a', 'a', 'b']] - exp_index = pd.CategoricalIndex(['a', 'a', 'b'], - categories=index.categories) + exp_index = CategoricalIndex(['a', 'a', 'b'], + categories=index.categories) exp = DataFrame({'A': [1, 1, 2], 'B': [4, 4, 5]}, index=exp_index) tm.assert_frame_equal(res, exp, check_index_type=True) - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( KeyError, 'a list-indexer must only include values that are ' 'in the categories'): df.loc[['a', 'x']] # duplicated categories and codes - index = pd.CategoricalIndex(['a', 'b', 'a']) + index = CategoricalIndex(['a', 'b', 'a']) df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=index) # unique slice res = df.loc[['a', 'b']] exp = DataFrame({'A': [1, 3, 2], 'B': [4, 6, 5]}, - index=pd.CategoricalIndex(['a', 'a', 'b'])) + index=CategoricalIndex(['a', 'a', 'b'])) tm.assert_frame_equal(res, exp, check_index_type=True) # duplicated slice @@ -143,94 +157,117 @@ def test_loc_listlike_dtypes(self): exp = DataFrame( {'A': [1, 3, 1, 3, 2], 'B': [4, 6, 4, 6, 5 - ]}, index=pd.CategoricalIndex(['a', 'a', 'a', 'a', 'b'])) + ]}, index=CategoricalIndex(['a', 'a', 'a', 'a', 'b'])) tm.assert_frame_equal(res, exp, check_index_type=True) - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( KeyError, 'a list-indexer must only include values ' 'that are in the categories'): df.loc[['a', 'x']] # contains unused category - index = pd.CategoricalIndex( + index = CategoricalIndex( ['a', 'b', 'a', 'c'], categories=list('abcde')) df = DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=index) res = df.loc[['a', 'b']] - exp = DataFrame({'A': [1, 3, 2], - 'B': [5, 7, 6]}, index=pd.CategoricalIndex( - ['a', 'a', 'b'], categories=list('abcde'))) + exp = DataFrame({'A': [1, 3, 2], 'B': [5, 7, 6]}, + index=CategoricalIndex(['a', 'a', 'b'], + categories=list('abcde'))) tm.assert_frame_equal(res, exp, check_index_type=True) res = df.loc[['a', 'e']] exp = DataFrame({'A': [1, 3, np.nan], 'B': [5, 7, np.nan]}, - index=pd.CategoricalIndex(['a', 'a', 'e'], - categories=list('abcde'))) + index=CategoricalIndex(['a', 'a', 'e'], + categories=list('abcde'))) tm.assert_frame_equal(res, exp, check_index_type=True) # duplicated slice res = df.loc[['a', 'a', 'b']] exp = DataFrame({'A': [1, 3, 1, 3, 2], 'B': [5, 7, 5, 7, 6]}, - index=pd.CategoricalIndex(['a', 'a', 'a', 'a', 'b'], - categories=list('abcde'))) + index=CategoricalIndex(['a', 'a', 'a', 'a', 'b'], + categories=list('abcde'))) tm.assert_frame_equal(res, exp, check_index_type=True) - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( KeyError, 'a list-indexer must only include values ' 'that are in the categories'): df.loc[['a', 'x']] + def test_get_indexer_array(self): + arr = np.array([Timestamp('1999-12-31 00:00:00'), + Timestamp('2000-12-31 00:00:00')], dtype=object) + cats = [Timestamp('1999-12-31 00:00:00'), + Timestamp('2000-12-31 00:00:00')] + ci = CategoricalIndex(cats, + categories=cats, + ordered=False, dtype='category') + result = ci.get_indexer(arr) + expected = np.array([0, 1], dtype='intp') + tm.assert_numpy_array_equal(result, expected) + + def test_getitem_with_listlike(self): + # GH 16115 + cats = Categorical([Timestamp('12-31-1999'), + Timestamp('12-31-2000')]) + + expected = DataFrame([[1, 0], [0, 1]], dtype='uint8', + index=[0, 1], columns=cats) + dummies = pd.get_dummies(cats) + result = dummies[[c for c in dummies.columns]] + assert_frame_equal(result, expected) + def test_ix_categorical_index(self): # GH 12531 - df = pd.DataFrame(np.random.randn(3, 3), - index=list('ABC'), columns=list('XYZ')) + df = DataFrame(np.random.randn(3, 3), + index=list('ABC'), columns=list('XYZ')) cdf = df.copy() - cdf.index = pd.CategoricalIndex(df.index) - cdf.columns = pd.CategoricalIndex(df.columns) + cdf.index = CategoricalIndex(df.index) + cdf.columns = CategoricalIndex(df.columns) - expect = pd.Series(df.ix['A', :], index=cdf.columns, name='A') - assert_series_equal(cdf.ix['A', :], expect) + expect = Series(df.loc['A', :], index=cdf.columns, name='A') + assert_series_equal(cdf.loc['A', :], expect) - expect = pd.Series(df.ix[:, 'X'], index=cdf.index, name='X') - assert_series_equal(cdf.ix[:, 'X'], expect) + expect = Series(df.loc[:, 'X'], index=cdf.index, name='X') + assert_series_equal(cdf.loc[:, 'X'], expect) - exp_index = pd.CategoricalIndex(list('AB'), categories=['A', 'B', 'C']) - expect = pd.DataFrame(df.ix[['A', 'B'], :], columns=cdf.columns, - index=exp_index) - assert_frame_equal(cdf.ix[['A', 'B'], :], expect) + exp_index = CategoricalIndex(list('AB'), categories=['A', 'B', 'C']) + expect = DataFrame(df.loc[['A', 'B'], :], columns=cdf.columns, + index=exp_index) + assert_frame_equal(cdf.loc[['A', 'B'], :], expect) - exp_columns = pd.CategoricalIndex(list('XY'), - categories=['X', 'Y', 'Z']) - expect = pd.DataFrame(df.ix[:, ['X', 'Y']], index=cdf.index, - columns=exp_columns) - assert_frame_equal(cdf.ix[:, ['X', 'Y']], expect) + exp_columns = CategoricalIndex(list('XY'), + categories=['X', 'Y', 'Z']) + expect = DataFrame(df.loc[:, ['X', 'Y']], index=cdf.index, + columns=exp_columns) + assert_frame_equal(cdf.loc[:, ['X', 'Y']], expect) # non-unique - df = pd.DataFrame(np.random.randn(3, 3), - index=list('ABA'), columns=list('XYX')) + df = DataFrame(np.random.randn(3, 3), + index=list('ABA'), columns=list('XYX')) cdf = df.copy() - cdf.index = pd.CategoricalIndex(df.index) - cdf.columns = pd.CategoricalIndex(df.columns) + cdf.index = CategoricalIndex(df.index) + cdf.columns = CategoricalIndex(df.columns) - exp_index = pd.CategoricalIndex(list('AA'), categories=['A', 'B']) - expect = pd.DataFrame(df.ix['A', :], columns=cdf.columns, - index=exp_index) - assert_frame_equal(cdf.ix['A', :], expect) + exp_index = CategoricalIndex(list('AA'), categories=['A', 'B']) + expect = DataFrame(df.loc['A', :], columns=cdf.columns, + index=exp_index) + assert_frame_equal(cdf.loc['A', :], expect) - exp_columns = pd.CategoricalIndex(list('XX'), categories=['X', 'Y']) - expect = pd.DataFrame(df.ix[:, 'X'], index=cdf.index, - columns=exp_columns) - assert_frame_equal(cdf.ix[:, 'X'], expect) + exp_columns = CategoricalIndex(list('XX'), categories=['X', 'Y']) + expect = DataFrame(df.loc[:, 'X'], index=cdf.index, + columns=exp_columns) + assert_frame_equal(cdf.loc[:, 'X'], expect) - expect = pd.DataFrame(df.ix[['A', 'B'], :], columns=cdf.columns, - index=pd.CategoricalIndex(list('AAB'))) - assert_frame_equal(cdf.ix[['A', 'B'], :], expect) + expect = DataFrame(df.loc[['A', 'B'], :], columns=cdf.columns, + index=CategoricalIndex(list('AAB'))) + assert_frame_equal(cdf.loc[['A', 'B'], :], expect) - expect = pd.DataFrame(df.ix[:, ['X', 'Y']], index=cdf.index, - columns=pd.CategoricalIndex(list('XXY'))) - assert_frame_equal(cdf.ix[:, ['X', 'Y']], expect) + expect = DataFrame(df.loc[:, ['X', 'Y']], index=cdf.index, + columns=CategoricalIndex(list('XXY'))) + assert_frame_equal(cdf.loc[:, ['X', 'Y']], expect) def test_read_only_source(self): # GH 10043 @@ -279,13 +316,13 @@ def test_reindexing(self): # then return a Categorical cats = list('cabe') - result = self.df2.reindex(pd.Categorical(['a', 'd'], categories=cats)) + result = self.df2.reindex(Categorical(['a', 'd'], categories=cats)) expected = DataFrame({'A': [0, 1, 5, np.nan], 'B': Series(list('aaad')).astype( 'category', categories=cats)}).set_index('B') assert_frame_equal(result, expected, check_index_type=True) - result = self.df2.reindex(pd.Categorical(['a'], categories=cats)) + result = self.df2.reindex(Categorical(['a'], categories=cats)) expected = DataFrame({'A': [0, 1, 5], 'B': Series(list('aaa')).astype( 'category', categories=cats)}).set_index('B') @@ -307,7 +344,7 @@ def test_reindexing(self): assert_frame_equal(result, expected, check_index_type=True) # give back the type of categorical that we received - result = self.df2.reindex(pd.Categorical( + result = self.df2.reindex(Categorical( ['a', 'd'], categories=cats, ordered=True)) expected = DataFrame( {'A': [0, 1, 5, np.nan], @@ -315,7 +352,7 @@ def test_reindexing(self): ordered=True)}).set_index('B') assert_frame_equal(result, expected, check_index_type=True) - result = self.df2.reindex(pd.Categorical( + result = self.df2.reindex(Categorical( ['a', 'd'], categories=['a', 'd'])) expected = DataFrame({'A': [0, 1, 5, np.nan], 'B': Series(list('aaad')).astype( @@ -324,22 +361,22 @@ def test_reindexing(self): assert_frame_equal(result, expected, check_index_type=True) # passed duplicate indexers are not allowed - self.assertRaises(ValueError, lambda: self.df2.reindex(['a', 'a'])) + pytest.raises(ValueError, lambda: self.df2.reindex(['a', 'a'])) # args NotImplemented ATM - self.assertRaises(NotImplementedError, - lambda: self.df2.reindex(['a'], method='ffill')) - self.assertRaises(NotImplementedError, - lambda: self.df2.reindex(['a'], level=1)) - self.assertRaises(NotImplementedError, - lambda: self.df2.reindex(['a'], limit=2)) + pytest.raises(NotImplementedError, + lambda: self.df2.reindex(['a'], method='ffill')) + pytest.raises(NotImplementedError, + lambda: self.df2.reindex(['a'], level=1)) + pytest.raises(NotImplementedError, + lambda: self.df2.reindex(['a'], limit=2)) def test_loc_slice(self): # slicing # not implemented ATM # GH9748 - self.assertRaises(TypeError, lambda: self.df.loc[1:5]) + pytest.raises(TypeError, lambda: self.df.loc[1:5]) # result = df.loc[1:5] # expected = df.iloc[[1,2,3,4]] @@ -387,8 +424,8 @@ def test_boolean_selection(self): # categories=[3, 2, 1], # ordered=False, # name=u'B') - self.assertRaises(TypeError, lambda: df4[df4.index < 2]) - self.assertRaises(TypeError, lambda: df4[df4.index > 1]) + pytest.raises(TypeError, lambda: df4[df4.index < 2]) + pytest.raises(TypeError, lambda: df4[df4.index > 1]) def test_indexing_with_category(self): diff --git a/pandas/tests/indexing/test_chaining_and_caching.py b/pandas/tests/indexing/test_chaining_and_caching.py new file mode 100644 index 0000000000000..27a889e58e55e --- /dev/null +++ b/pandas/tests/indexing/test_chaining_and_caching.py @@ -0,0 +1,420 @@ +from warnings import catch_warnings + +import pytest + +import numpy as np +import pandas as pd +from pandas.core import common as com +from pandas import (compat, DataFrame, option_context, + Series, MultiIndex, date_range, Timestamp) +from pandas.util import testing as tm + + +class TestCaching(object): + + def test_slice_consolidate_invalidate_item_cache(self): + + # this is chained assignment, but will 'work' + with option_context('chained_assignment', None): + + # #3970 + df = DataFrame({"aa": compat.lrange(5), "bb": [2.2] * 5}) + + # Creates a second float block + df["cc"] = 0.0 + + # caches a reference to the 'bb' series + df["bb"] + + # repr machinery triggers consolidation + repr(df) + + # Assignment to wrong series + df['bb'].iloc[0] = 0.17 + df._clear_item_cache() + tm.assert_almost_equal(df['bb'][0], 0.17) + + def test_setitem_cache_updating(self): + # GH 5424 + cont = ['one', 'two', 'three', 'four', 'five', 'six', 'seven'] + + for do_ref in [False, False]: + df = DataFrame({'a': cont, + "b": cont[3:] + cont[:3], + 'c': np.arange(7)}) + + # ref the cache + if do_ref: + df.loc[0, "c"] + + # set it + df.loc[7, 'c'] = 1 + + assert df.loc[0, 'c'] == 0.0 + assert df.loc[7, 'c'] == 1.0 + + # GH 7084 + # not updating cache on series setting with slices + expected = DataFrame({'A': [600, 600, 600]}, + index=date_range('5/7/2014', '5/9/2014')) + out = DataFrame({'A': [0, 0, 0]}, + index=date_range('5/7/2014', '5/9/2014')) + df = DataFrame({'C': ['A', 'A', 'A'], 'D': [100, 200, 300]}) + + # loop through df to update out + six = Timestamp('5/7/2014') + eix = Timestamp('5/9/2014') + for ix, row in df.iterrows(): + out.loc[six:eix, row['C']] = out.loc[six:eix, row['C']] + row['D'] + + tm.assert_frame_equal(out, expected) + tm.assert_series_equal(out['A'], expected['A']) + + # try via a chain indexing + # this actually works + out = DataFrame({'A': [0, 0, 0]}, + index=date_range('5/7/2014', '5/9/2014')) + for ix, row in df.iterrows(): + v = out[row['C']][six:eix] + row['D'] + out[row['C']][six:eix] = v + + tm.assert_frame_equal(out, expected) + tm.assert_series_equal(out['A'], expected['A']) + + out = DataFrame({'A': [0, 0, 0]}, + index=date_range('5/7/2014', '5/9/2014')) + for ix, row in df.iterrows(): + out.loc[six:eix, row['C']] += row['D'] + + tm.assert_frame_equal(out, expected) + tm.assert_series_equal(out['A'], expected['A']) + + +class TestChaining(object): + + def test_setitem_chained_setfault(self): + + # GH6026 + # setfaults under numpy 1.7.1 (ok on 1.8) + data = ['right', 'left', 'left', 'left', 'right', 'left', 'timeout'] + mdata = ['right', 'left', 'left', 'left', 'right', 'left', 'none'] + + df = DataFrame({'response': np.array(data)}) + mask = df.response == 'timeout' + df.response[mask] = 'none' + tm.assert_frame_equal(df, DataFrame({'response': mdata})) + + recarray = np.rec.fromarrays([data], names=['response']) + df = DataFrame(recarray) + mask = df.response == 'timeout' + df.response[mask] = 'none' + tm.assert_frame_equal(df, DataFrame({'response': mdata})) + + df = DataFrame({'response': data, 'response1': data}) + mask = df.response == 'timeout' + df.response[mask] = 'none' + tm.assert_frame_equal(df, DataFrame({'response': mdata, + 'response1': data})) + + # GH 6056 + expected = DataFrame(dict(A=[np.nan, 'bar', 'bah', 'foo', 'bar'])) + df = DataFrame(dict(A=np.array(['foo', 'bar', 'bah', 'foo', 'bar']))) + df['A'].iloc[0] = np.nan + result = df.head() + tm.assert_frame_equal(result, expected) + + df = DataFrame(dict(A=np.array(['foo', 'bar', 'bah', 'foo', 'bar']))) + df.A.iloc[0] = np.nan + result = df.head() + tm.assert_frame_equal(result, expected) + + def test_detect_chained_assignment(self): + + pd.set_option('chained_assignment', 'raise') + + # work with the chain + expected = DataFrame([[-5, 1], [-6, 3]], columns=list('AB')) + df = DataFrame(np.arange(4).reshape(2, 2), + columns=list('AB'), dtype='int64') + assert df.is_copy is None + + df['A'][0] = -5 + df['A'][1] = -6 + tm.assert_frame_equal(df, expected) + + # test with the chaining + df = DataFrame({'A': Series(range(2), dtype='int64'), + 'B': np.array(np.arange(2, 4), dtype=np.float64)}) + assert df.is_copy is None + + with pytest.raises(com.SettingWithCopyError): + df['A'][0] = -5 + + with pytest.raises(com.SettingWithCopyError): + df['A'][1] = np.nan + + assert df['A'].is_copy is None + + # Using a copy (the chain), fails + df = DataFrame({'A': Series(range(2), dtype='int64'), + 'B': np.array(np.arange(2, 4), dtype=np.float64)}) + + with pytest.raises(com.SettingWithCopyError): + df.loc[0]['A'] = -5 + + # Doc example + df = DataFrame({'a': ['one', 'one', 'two', 'three', + 'two', 'one', 'six'], + 'c': Series(range(7), dtype='int64')}) + assert df.is_copy is None + + with pytest.raises(com.SettingWithCopyError): + indexer = df.a.str.startswith('o') + df[indexer]['c'] = 42 + + expected = DataFrame({'A': [111, 'bbb', 'ccc'], 'B': [1, 2, 3]}) + df = DataFrame({'A': ['aaa', 'bbb', 'ccc'], 'B': [1, 2, 3]}) + + with pytest.raises(com.SettingWithCopyError): + df['A'][0] = 111 + + with pytest.raises(com.SettingWithCopyError): + df.loc[0]['A'] = 111 + + df.loc[0, 'A'] = 111 + tm.assert_frame_equal(df, expected) + + # gh-5475: Make sure that is_copy is picked up reconstruction + df = DataFrame({"A": [1, 2]}) + assert df.is_copy is None + + with tm.ensure_clean('__tmp__pickle') as path: + df.to_pickle(path) + df2 = pd.read_pickle(path) + df2["B"] = df2["A"] + df2["B"] = df2["A"] + + # gh-5597: a spurious raise as we are setting the entire column here + from string import ascii_letters as letters + + def random_text(nobs=100): + df = [] + for i in range(nobs): + idx = np.random.randint(len(letters), size=2) + idx.sort() + + df.append([letters[idx[0]:idx[1]]]) + + return DataFrame(df, columns=['letters']) + + df = random_text(100000) + + # Always a copy + x = df.iloc[[0, 1, 2]] + assert x.is_copy is not None + + x = df.iloc[[0, 1, 2, 4]] + assert x.is_copy is not None + + # Explicitly copy + indexer = df.letters.apply(lambda x: len(x) > 10) + df = df.loc[indexer].copy() + + assert df.is_copy is None + df['letters'] = df['letters'].apply(str.lower) + + # Implicitly take + df = random_text(100000) + indexer = df.letters.apply(lambda x: len(x) > 10) + df = df.loc[indexer] + + assert df.is_copy is not None + df['letters'] = df['letters'].apply(str.lower) + + # Implicitly take 2 + df = random_text(100000) + indexer = df.letters.apply(lambda x: len(x) > 10) + + df = df.loc[indexer] + assert df.is_copy is not None + df.loc[:, 'letters'] = df['letters'].apply(str.lower) + + # Should be ok even though it's a copy! + assert df.is_copy is None + + df['letters'] = df['letters'].apply(str.lower) + assert df.is_copy is None + + df = random_text(100000) + indexer = df.letters.apply(lambda x: len(x) > 10) + df.loc[indexer, 'letters'] = ( + df.loc[indexer, 'letters'].apply(str.lower)) + + # an identical take, so no copy + df = DataFrame({'a': [1]}).dropna() + assert df.is_copy is None + df['a'] += 1 + + # Inplace ops, originally from: + # http://stackoverflow.com/questions/20508968/series-fillna-in-a-multiindex-dataframe-does-not-fill-is-this-a-bug + a = [12, 23] + b = [123, None] + c = [1234, 2345] + d = [12345, 23456] + tuples = [('eyes', 'left'), ('eyes', 'right'), ('ears', 'left'), + ('ears', 'right')] + events = {('eyes', 'left'): a, + ('eyes', 'right'): b, + ('ears', 'left'): c, + ('ears', 'right'): d} + multiind = MultiIndex.from_tuples(tuples, names=['part', 'side']) + zed = DataFrame(events, index=['a', 'b'], columns=multiind) + + with pytest.raises(com.SettingWithCopyError): + zed['eyes']['right'].fillna(value=555, inplace=True) + + df = DataFrame(np.random.randn(10, 4)) + s = df.iloc[:, 0].sort_values() + + tm.assert_series_equal(s, df.iloc[:, 0].sort_values()) + tm.assert_series_equal(s, df[0].sort_values()) + + # see gh-6025: false positives + df = DataFrame({'column1': ['a', 'a', 'a'], 'column2': [4, 8, 9]}) + str(df) + + df['column1'] = df['column1'] + 'b' + str(df) + + df = df[df['column2'] != 8] + str(df) + + df['column1'] = df['column1'] + 'c' + str(df) + + # from SO: + # http://stackoverflow.com/questions/24054495/potential-bug-setting-value-for-undefined-column-using-iloc + df = DataFrame(np.arange(0, 9), columns=['count']) + df['group'] = 'b' + + with pytest.raises(com.SettingWithCopyError): + df.iloc[0:5]['group'] = 'a' + + # Mixed type setting but same dtype & changing dtype + df = DataFrame(dict(A=date_range('20130101', periods=5), + B=np.random.randn(5), + C=np.arange(5, dtype='int64'), + D=list('abcde'))) + + with pytest.raises(com.SettingWithCopyError): + df.loc[2]['D'] = 'foo' + + with pytest.raises(com.SettingWithCopyError): + df.loc[2]['C'] = 'foo' + + with pytest.raises(com.SettingWithCopyError): + df['C'][2] = 'foo' + + def test_setting_with_copy_bug(self): + + # operating on a copy + df = pd.DataFrame({'a': list(range(4)), + 'b': list('ab..'), + 'c': ['a', 'b', np.nan, 'd']}) + mask = pd.isnull(df.c) + + def f(): + df[['c']][mask] = df[['b']][mask] + + pytest.raises(com.SettingWithCopyError, f) + + # invalid warning as we are returning a new object + # GH 8730 + df1 = DataFrame({'x': Series(['a', 'b', 'c']), + 'y': Series(['d', 'e', 'f'])}) + df2 = df1[['x']] + + # this should not raise + df2['y'] = ['g', 'h', 'i'] + + def test_detect_chained_assignment_warnings(self): + + # warnings + with option_context('chained_assignment', 'warn'): + df = DataFrame({'A': ['aaa', 'bbb', 'ccc'], 'B': [1, 2, 3]}) + with tm.assert_produces_warning( + expected_warning=com.SettingWithCopyWarning): + df.loc[0]['A'] = 111 + + def test_chained_getitem_with_lists(self): + + # GH6394 + # Regression in chained getitem indexing with embedded list-like from + # 0.12 + def check(result, expected): + tm.assert_numpy_array_equal(result, expected) + assert isinstance(result, np.ndarray) + + df = DataFrame({'A': 5 * [np.zeros(3)], 'B': 5 * [np.ones(3)]}) + expected = df['A'].iloc[2] + result = df.loc[2, 'A'] + check(result, expected) + result2 = df.iloc[2]['A'] + check(result2, expected) + result3 = df['A'].loc[2] + check(result3, expected) + result4 = df['A'].iloc[2] + check(result4, expected) + + def test_cache_updating(self): + # GH 4939, make sure to update the cache on setitem + + df = tm.makeDataFrame() + df['A'] # cache series + with catch_warnings(record=True): + df.ix["Hello Friend"] = df.ix[0] + assert "Hello Friend" in df['A'].index + assert "Hello Friend" in df['B'].index + + with catch_warnings(record=True): + panel = tm.makePanel() + panel.ix[0] # get first item into cache + panel.ix[:, :, 'A+1'] = panel.ix[:, :, 'A'] + 1 + assert "A+1" in panel.ix[0].columns + assert "A+1" in panel.ix[1].columns + + # 5216 + # make sure that we don't try to set a dead cache + a = np.random.rand(10, 3) + df = DataFrame(a, columns=['x', 'y', 'z']) + tuples = [(i, j) for i in range(5) for j in range(2)] + index = MultiIndex.from_tuples(tuples) + df.index = index + + # setting via chained assignment + # but actually works, since everything is a view + df.loc[0]['z'].iloc[0] = 1. + result = df.loc[(0, 0), 'z'] + assert result == 1 + + # correct setting + df.loc[(0, 0), 'z'] = 2 + result = df.loc[(0, 0), 'z'] + assert result == 2 + + # 10264 + df = DataFrame(np.zeros((5, 5), dtype='int64'), columns=[ + 'a', 'b', 'c', 'd', 'e'], index=range(5)) + df['f'] = 0 + df.f.values[3] = 1 + + # TODO(wesm): unused? + # y = df.iloc[np.arange(2, len(df))] + + df.f.values[3] = 2 + expected = DataFrame(np.zeros((5, 6), dtype='int64'), columns=[ + 'a', 'b', 'c', 'd', 'e', 'f'], index=range(5)) + expected.at[3, 'f'] = 2 + tm.assert_frame_equal(df, expected) + expected = Series([0, 0, 0, 2, 0], name='f') + tm.assert_series_equal(df.f, expected) diff --git a/pandas/tests/indexing/test_coercion.py b/pandas/tests/indexing/test_coercion.py index 0cfa7258461f1..25cc810299678 100644 --- a/pandas/tests/indexing/test_coercion.py +++ b/pandas/tests/indexing/test_coercion.py @@ -1,6 +1,6 @@ # -*- coding: utf-8 -*- -import nose +import pytest import numpy as np import pandas as pd @@ -15,8 +15,6 @@ class CoercionBase(object): - _multiprocess_can_split_ = True - klasses = ['index', 'series'] dtypes = ['object', 'int64', 'float64', 'complex128', 'bool', 'datetime64', 'datetime64tz', 'timedelta64', 'period'] @@ -33,8 +31,8 @@ def _assert(self, left, right, dtype): tm.assert_index_equal(left, right) else: raise NotImplementedError - self.assertEqual(left.dtype, dtype) - self.assertEqual(right.dtype, dtype) + assert left.dtype == dtype + assert right.dtype == dtype def test_has_comprehensive_tests(self): for klass in self.klasses: @@ -46,7 +44,7 @@ def test_has_comprehensive_tests(self): raise AssertionError(msg.format(type(self), method_name)) -class TestSetitemCoercion(CoercionBase, tm.TestCase): +class TestSetitemCoercion(CoercionBase): method = 'setitem' @@ -57,7 +55,7 @@ def _assert_setitem_series_conversion(self, original_series, loc_value, temp[1] = loc_value tm.assert_series_equal(temp, expected_series) # check dtype explicitly for sure - self.assertEqual(temp.dtype, expected_dtype) + assert temp.dtype == expected_dtype # .loc works different rule, temporary disable # temp = original_series.copy() @@ -66,7 +64,7 @@ def _assert_setitem_series_conversion(self, original_series, loc_value, def test_setitem_series_object(self): obj = pd.Series(list('abcd')) - self.assertEqual(obj.dtype, np.object) + assert obj.dtype == np.object # object + int -> object exp = pd.Series(['a', 1, 'c', 'd']) @@ -86,7 +84,7 @@ def test_setitem_series_object(self): def test_setitem_series_int64(self): obj = pd.Series([1, 2, 3, 4]) - self.assertEqual(obj.dtype, np.int64) + assert obj.dtype == np.int64 # int + int -> int exp = pd.Series([1, 1, 3, 4]) @@ -95,7 +93,7 @@ def test_setitem_series_int64(self): # int + float -> float # TODO_GH12747 The result must be float # tm.assert_series_equal(temp, pd.Series([1, 1.1, 3, 4])) - # self.assertEqual(temp.dtype, np.float64) + # assert temp.dtype == np.float64 exp = pd.Series([1, 1, 3, 4]) self._assert_setitem_series_conversion(obj, 1.1, exp, np.int64) @@ -109,7 +107,7 @@ def test_setitem_series_int64(self): def test_setitem_series_float64(self): obj = pd.Series([1.1, 2.2, 3.3, 4.4]) - self.assertEqual(obj.dtype, np.float64) + assert obj.dtype == np.float64 # float + int -> float exp = pd.Series([1.1, 1.0, 3.3, 4.4]) @@ -130,7 +128,7 @@ def test_setitem_series_float64(self): def test_setitem_series_complex128(self): obj = pd.Series([1 + 1j, 2 + 2j, 3 + 3j, 4 + 4j]) - self.assertEqual(obj.dtype, np.complex128) + assert obj.dtype == np.complex128 # complex + int -> complex exp = pd.Series([1 + 1j, 1, 3 + 3j, 4 + 4j]) @@ -150,33 +148,33 @@ def test_setitem_series_complex128(self): def test_setitem_series_bool(self): obj = pd.Series([True, False, True, False]) - self.assertEqual(obj.dtype, np.bool) + assert obj.dtype == np.bool # bool + int -> int # TODO_GH12747 The result must be int # tm.assert_series_equal(temp, pd.Series([1, 1, 1, 0])) - # self.assertEqual(temp.dtype, np.int64) + # assert temp.dtype == np.int64 exp = pd.Series([True, True, True, False]) self._assert_setitem_series_conversion(obj, 1, exp, np.bool) # TODO_GH12747 The result must be int # assigning int greater than bool # tm.assert_series_equal(temp, pd.Series([1, 3, 1, 0])) - # self.assertEqual(temp.dtype, np.int64) + # assert temp.dtype == np.int64 exp = pd.Series([True, True, True, False]) self._assert_setitem_series_conversion(obj, 3, exp, np.bool) # bool + float -> float # TODO_GH12747 The result must be float # tm.assert_series_equal(temp, pd.Series([1., 1.1, 1., 0.])) - # self.assertEqual(temp.dtype, np.float64) + # assert temp.dtype == np.float64 exp = pd.Series([True, True, True, False]) self._assert_setitem_series_conversion(obj, 1.1, exp, np.bool) # bool + complex -> complex (buggy, results in bool) # TODO_GH12747 The result must be complex # tm.assert_series_equal(temp, pd.Series([1, 1 + 1j, 1, 0])) - # self.assertEqual(temp.dtype, np.complex128) + # assert temp.dtype == np.complex128 exp = pd.Series([True, True, True, False]) self._assert_setitem_series_conversion(obj, 1 + 1j, exp, np.bool) @@ -189,7 +187,7 @@ def test_setitem_series_datetime64(self): pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-04')]) - self.assertEqual(obj.dtype, 'datetime64[ns]') + assert obj.dtype == 'datetime64[ns]' # datetime64 + datetime64 -> datetime64 exp = pd.Series([pd.Timestamp('2011-01-01'), @@ -215,7 +213,7 @@ def test_setitem_series_datetime64tz(self): pd.Timestamp('2011-01-02', tz=tz), pd.Timestamp('2011-01-03', tz=tz), pd.Timestamp('2011-01-04', tz=tz)]) - self.assertEqual(obj.dtype, 'datetime64[ns, US/Eastern]') + assert obj.dtype == 'datetime64[ns, US/Eastern]' # datetime64tz + datetime64tz -> datetime64tz exp = pd.Series([pd.Timestamp('2011-01-01', tz=tz), @@ -251,18 +249,18 @@ def _assert_setitem_index_conversion(self, original_series, loc_key, exp = pd.Series([1, 2, 3, 4, 5], index=expected_index) tm.assert_series_equal(temp, exp) # check dtype explicitly for sure - self.assertEqual(temp.index.dtype, expected_dtype) + assert temp.index.dtype == expected_dtype temp = original_series.copy() temp.loc[loc_key] = 5 exp = pd.Series([1, 2, 3, 4, 5], index=expected_index) tm.assert_series_equal(temp, exp) # check dtype explicitly for sure - self.assertEqual(temp.index.dtype, expected_dtype) + assert temp.index.dtype == expected_dtype def test_setitem_index_object(self): obj = pd.Series([1, 2, 3, 4], index=list('abcd')) - self.assertEqual(obj.index.dtype, np.object) + assert obj.index.dtype == np.object # object + object -> object exp_index = pd.Index(list('abcdx')) @@ -270,7 +268,7 @@ def test_setitem_index_object(self): # object + int -> IndexError, regarded as location temp = obj.copy() - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): temp[5] = 5 # object + float -> object @@ -280,7 +278,7 @@ def test_setitem_index_object(self): def test_setitem_index_int64(self): # tests setitem with non-existing numeric key obj = pd.Series([1, 2, 3, 4]) - self.assertEqual(obj.index.dtype, np.int64) + assert obj.index.dtype == np.int64 # int + int -> int exp_index = pd.Index([0, 1, 2, 3, 5]) @@ -297,12 +295,12 @@ def test_setitem_index_int64(self): def test_setitem_index_float64(self): # tests setitem with non-existing numeric key obj = pd.Series([1, 2, 3, 4], index=[1.1, 2.1, 3.1, 4.1]) - self.assertEqual(obj.index.dtype, np.float64) + assert obj.index.dtype == np.float64 # float + int -> int temp = obj.copy() # TODO_GH12747 The result must be float - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): temp[5] = 5 # float + float -> float @@ -332,7 +330,7 @@ def test_setitem_index_period(self): pass -class TestInsertIndexCoercion(CoercionBase, tm.TestCase): +class TestInsertIndexCoercion(CoercionBase): klasses = ['index'] method = 'insert' @@ -343,11 +341,11 @@ def _assert_insert_conversion(self, original, value, target = original.copy() res = target.insert(1, value) tm.assert_index_equal(res, expected) - self.assertEqual(res.dtype, expected_dtype) + assert res.dtype == expected_dtype def test_insert_index_object(self): obj = pd.Index(list('abcd')) - self.assertEqual(obj.dtype, np.object) + assert obj.dtype == np.object # object + int -> object exp = pd.Index(['a', 1, 'b', 'c', 'd']) @@ -360,7 +358,7 @@ def test_insert_index_object(self): # object + bool -> object res = obj.insert(1, False) tm.assert_index_equal(res, pd.Index(['a', False, 'b', 'c', 'd'])) - self.assertEqual(res.dtype, np.object) + assert res.dtype == np.object # object + object -> object exp = pd.Index(['a', 'x', 'b', 'c', 'd']) @@ -368,7 +366,7 @@ def test_insert_index_object(self): def test_insert_index_int64(self): obj = pd.Int64Index([1, 2, 3, 4]) - self.assertEqual(obj.dtype, np.int64) + assert obj.dtype == np.int64 # int + int -> int exp = pd.Index([1, 1, 2, 3, 4]) @@ -388,7 +386,7 @@ def test_insert_index_int64(self): def test_insert_index_float64(self): obj = pd.Float64Index([1., 2., 3., 4.]) - self.assertEqual(obj.dtype, np.float64) + assert obj.dtype == np.float64 # float + int -> int exp = pd.Index([1., 1., 2., 3., 4.]) @@ -415,7 +413,7 @@ def test_insert_index_bool(self): def test_insert_index_datetime64(self): obj = pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04']) - self.assertEqual(obj.dtype, 'datetime64[ns]') + assert obj.dtype == 'datetime64[ns]' # datetime64 + datetime64 => datetime64 exp = pd.DatetimeIndex(['2011-01-01', '2012-01-01', '2011-01-02', @@ -425,18 +423,18 @@ def test_insert_index_datetime64(self): # ToDo: must coerce to object msg = "Passed item and index have different timezone" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): obj.insert(1, pd.Timestamp('2012-01-01', tz='US/Eastern')) # ToDo: must coerce to object msg = "cannot insert DatetimeIndex with incompatible label" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): obj.insert(1, 1) def test_insert_index_datetime64tz(self): obj = pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04'], tz='US/Eastern') - self.assertEqual(obj.dtype, 'datetime64[ns, US/Eastern]') + assert obj.dtype == 'datetime64[ns, US/Eastern]' # datetime64tz + datetime64tz => datetime64 exp = pd.DatetimeIndex(['2011-01-01', '2012-01-01', '2011-01-02', @@ -447,22 +445,22 @@ def test_insert_index_datetime64tz(self): # ToDo: must coerce to object msg = "Passed item and index have different timezone" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): obj.insert(1, pd.Timestamp('2012-01-01')) # ToDo: must coerce to object msg = "Passed item and index have different timezone" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): obj.insert(1, pd.Timestamp('2012-01-01', tz='Asia/Tokyo')) # ToDo: must coerce to object msg = "cannot insert DatetimeIndex with incompatible label" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): obj.insert(1, 1) def test_insert_index_timedelta64(self): obj = pd.TimedeltaIndex(['1 day', '2 day', '3 day', '4 day']) - self.assertEqual(obj.dtype, 'timedelta64[ns]') + assert obj.dtype == 'timedelta64[ns]' # timedelta64 + timedelta64 => timedelta64 exp = pd.TimedeltaIndex(['1 day', '10 day', '2 day', '3 day', '4 day']) @@ -471,18 +469,18 @@ def test_insert_index_timedelta64(self): # ToDo: must coerce to object msg = "cannot insert TimedeltaIndex with incompatible label" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): obj.insert(1, pd.Timestamp('2012-01-01')) # ToDo: must coerce to object msg = "cannot insert TimedeltaIndex with incompatible label" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): obj.insert(1, 1) def test_insert_index_period(self): obj = pd.PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04'], freq='M') - self.assertEqual(obj.dtype, 'period[M]') + assert obj.dtype == 'period[M]' # period + period => period exp = pd.PeriodIndex(['2011-01', '2012-01', '2011-02', @@ -516,7 +514,7 @@ def test_insert_index_period(self): self._assert_insert_conversion(obj, 'x', exp, np.object) -class TestWhereCoercion(CoercionBase, tm.TestCase): +class TestWhereCoercion(CoercionBase): method = 'where' @@ -529,7 +527,7 @@ def _assert_where_conversion(self, original, cond, values, def _where_object_common(self, klass): obj = klass(list('abcd')) - self.assertEqual(obj.dtype, np.object) + assert obj.dtype == np.object cond = klass([True, False, True, False]) # object + int -> object @@ -582,7 +580,7 @@ def test_where_index_object(self): def _where_int64_common(self, klass): obj = klass([1, 2, 3, 4]) - self.assertEqual(obj.dtype, np.int64) + assert obj.dtype == np.int64 cond = klass([True, False, True, False]) # int + int -> int @@ -628,7 +626,7 @@ def test_where_index_int64(self): def _where_float64_common(self, klass): obj = klass([1.1, 2.2, 3.3, 4.4]) - self.assertEqual(obj.dtype, np.float64) + assert obj.dtype == np.float64 cond = klass([True, False, True, False]) # float + int -> float @@ -674,7 +672,7 @@ def test_where_index_float64(self): def test_where_series_complex128(self): obj = pd.Series([1 + 1j, 2 + 2j, 3 + 3j, 4 + 4j]) - self.assertEqual(obj.dtype, np.complex128) + assert obj.dtype == np.complex128 cond = pd.Series([True, False, True, False]) # complex + int -> complex @@ -714,7 +712,7 @@ def test_where_index_complex128(self): def test_where_series_bool(self): obj = pd.Series([True, False, True, False]) - self.assertEqual(obj.dtype, np.bool) + assert obj.dtype == np.bool cond = pd.Series([True, False, True, False]) # bool + int -> int @@ -757,7 +755,7 @@ def test_where_series_datetime64(self): pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-04')]) - self.assertEqual(obj.dtype, 'datetime64[ns]') + assert obj.dtype == 'datetime64[ns]' cond = pd.Series([True, False, True, False]) # datetime64 + datetime64 -> datetime64 @@ -780,7 +778,7 @@ def test_where_series_datetime64(self): # ToDo: coerce to object msg = "cannot coerce a Timestamp with a tz on a naive Block" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): obj.where(cond, pd.Timestamp('2012-01-01', tz='US/Eastern')) # ToDo: do not coerce to UTC, must be object @@ -799,13 +797,13 @@ def test_where_index_datetime64(self): pd.Timestamp('2011-01-02'), pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-04')]) - self.assertEqual(obj.dtype, 'datetime64[ns]') + assert obj.dtype == 'datetime64[ns]' cond = pd.Index([True, False, True, False]) # datetime64 + datetime64 -> datetime64 # must support scalar msg = "cannot coerce a Timestamp with a tz on a naive Block" - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): obj.where(cond, pd.Timestamp('2012-01-01')) values = pd.Index([pd.Timestamp('2012-01-01'), @@ -821,7 +819,7 @@ def test_where_index_datetime64(self): # ToDo: coerce to object msg = ("Index\\(\\.\\.\\.\\) must be called with a collection " "of some kind") - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): obj.where(cond, pd.Timestamp('2012-01-01', tz='US/Eastern')) # ToDo: do not ignore timezone, must be object @@ -854,7 +852,7 @@ def test_where_index_period(self): pass -class TestFillnaSeriesCoercion(CoercionBase, tm.TestCase): +class TestFillnaSeriesCoercion(CoercionBase): # not indexing, but place here for consisntency @@ -869,7 +867,7 @@ def _assert_fillna_conversion(self, original, value, def _fillna_object_common(self, klass): obj = klass(['a', np.nan, 'c', 'd']) - self.assertEqual(obj.dtype, np.object) + assert obj.dtype == np.object # object + int -> object exp = klass(['a', 1, 'c', 'd']) @@ -902,7 +900,7 @@ def test_fillna_index_int64(self): def _fillna_float64_common(self, klass): obj = klass([1.1, np.nan, 3.3, 4.4]) - self.assertEqual(obj.dtype, np.float64) + assert obj.dtype == np.float64 # float + int -> float exp = klass([1.1, 1.0, 3.3, 4.4]) @@ -935,7 +933,7 @@ def test_fillna_index_float64(self): def test_fillna_series_complex128(self): obj = pd.Series([1 + 1j, np.nan, 3 + 3j, 4 + 4j]) - self.assertEqual(obj.dtype, np.complex128) + assert obj.dtype == np.complex128 # complex + int -> complex exp = pd.Series([1 + 1j, 1, 3 + 3j, 4 + 4j]) @@ -968,7 +966,7 @@ def test_fillna_series_datetime64(self): pd.NaT, pd.Timestamp('2011-01-03'), pd.Timestamp('2011-01-04')]) - self.assertEqual(obj.dtype, 'datetime64[ns]') + assert obj.dtype == 'datetime64[ns]' # datetime64 + datetime64 => datetime64 exp = pd.Series([pd.Timestamp('2011-01-01'), @@ -1008,7 +1006,7 @@ def test_fillna_series_datetime64tz(self): pd.NaT, pd.Timestamp('2011-01-03', tz=tz), pd.Timestamp('2011-01-04', tz=tz)]) - self.assertEqual(obj.dtype, 'datetime64[ns, US/Eastern]') + assert obj.dtype == 'datetime64[ns, US/Eastern]' # datetime64tz + datetime64tz => datetime64tz exp = pd.Series([pd.Timestamp('2011-01-01', tz=tz), @@ -1060,7 +1058,7 @@ def test_fillna_series_period(self): def test_fillna_index_datetime64(self): obj = pd.DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03', '2011-01-04']) - self.assertEqual(obj.dtype, 'datetime64[ns]') + assert obj.dtype == 'datetime64[ns]' # datetime64 + datetime64 => datetime64 exp = pd.DatetimeIndex(['2011-01-01', '2012-01-01', @@ -1095,7 +1093,7 @@ def test_fillna_index_datetime64tz(self): obj = pd.DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03', '2011-01-04'], tz=tz) - self.assertEqual(obj.dtype, 'datetime64[ns, US/Eastern]') + assert obj.dtype == 'datetime64[ns, US/Eastern]' # datetime64tz + datetime64tz => datetime64tz exp = pd.DatetimeIndex(['2011-01-01', '2012-01-01', @@ -1141,25 +1139,40 @@ def test_fillna_index_period(self): pass -class TestReplaceSeriesCoercion(CoercionBase, tm.TestCase): +class TestReplaceSeriesCoercion(CoercionBase): # not indexing, but place here for consisntency klasses = ['series'] method = 'replace' - def setUp(self): + def setup_method(self, method): self.rep = {} self.rep['object'] = ['a', 'b'] self.rep['int64'] = [4, 5] self.rep['float64'] = [1.1, 2.2] self.rep['complex128'] = [1 + 1j, 2 + 2j] self.rep['bool'] = [True, False] + self.rep['datetime64[ns]'] = [pd.Timestamp('2011-01-01'), + pd.Timestamp('2011-01-03')] + + for tz in ['UTC', 'US/Eastern']: + # to test tz => different tz replacement + key = 'datetime64[ns, {0}]'.format(tz) + self.rep[key] = [pd.Timestamp('2011-01-01', tz=tz), + pd.Timestamp('2011-01-03', tz=tz)] + + self.rep['timedelta64[ns]'] = [pd.Timedelta('1 day'), + pd.Timedelta('2 day')] def _assert_replace_conversion(self, from_key, to_key, how): index = pd.Index([3, 4], name='xxx') obj = pd.Series(self.rep[from_key], index=index, name='yyy') - self.assertEqual(obj.dtype, from_key) + assert obj.dtype == from_key + + if (from_key.startswith('datetime') and to_key.startswith('datetime')): + # different tz, currently mask_missing raises SystemError + return if how == 'dict': replacer = dict(zip(self.rep[from_key], self.rep[to_key])) @@ -1170,29 +1183,14 @@ def _assert_replace_conversion(self, from_key, to_key, how): result = obj.replace(replacer) - # buggy on windows for bool/int64 - if (from_key == 'bool' and - to_key == 'int64' and - tm.is_platform_windows()): - raise nose.SkipTest("windows platform buggy: {0} -> {1}".format - (from_key, to_key)) - - if ((from_key == 'float64' and - to_key in ('bool', 'int64')) or - + if ((from_key == 'float64' and to_key in ('int64')) or (from_key == 'complex128' and - to_key in ('bool', 'int64', 'float64')) or - - (from_key == 'int64' and - to_key in ('bool')) or - - # TODO_GH12747 The result must be int? - (from_key == 'bool' and to_key == 'int64')): + to_key in ('int64', 'float64'))): # buggy on 32-bit if tm.is_platform_32bit(): - raise nose.SkipTest("32-bit platform buggy: {0} -> {1}".format - (from_key, to_key)) + pytest.skip("32-bit platform buggy: {0} -> {1}".format + (from_key, to_key)) # Expected: do not downcast by replacement exp = pd.Series(self.rep[to_key], index=index, @@ -1200,7 +1198,7 @@ def _assert_replace_conversion(self, from_key, to_key, how): else: exp = pd.Series(self.rep[to_key], index=index, name='yyy') - self.assertEqual(exp.dtype, to_key) + assert exp.dtype == to_key tm.assert_series_equal(result, exp) @@ -1245,18 +1243,36 @@ def test_replace_series_bool(self): if compat.PY3: # doesn't work in PY3, though ...dict_from_bool works fine - raise nose.SkipTest("doesn't work as in PY3") + pytest.skip("doesn't work as in PY3") self._assert_replace_conversion(from_key, to_key, how='series') def test_replace_series_datetime64(self): - pass + from_key = 'datetime64[ns]' + for to_key in self.rep: + self._assert_replace_conversion(from_key, to_key, how='dict') + + from_key = 'datetime64[ns]' + for to_key in self.rep: + self._assert_replace_conversion(from_key, to_key, how='series') def test_replace_series_datetime64tz(self): - pass + from_key = 'datetime64[ns, US/Eastern]' + for to_key in self.rep: + self._assert_replace_conversion(from_key, to_key, how='dict') + + from_key = 'datetime64[ns, US/Eastern]' + for to_key in self.rep: + self._assert_replace_conversion(from_key, to_key, how='series') def test_replace_series_timedelta64(self): - pass + from_key = 'timedelta64[ns]' + for to_key in self.rep: + self._assert_replace_conversion(from_key, to_key, how='dict') + + from_key = 'timedelta64[ns]' + for to_key in self.rep: + self._assert_replace_conversion(from_key, to_key, how='series') def test_replace_series_period(self): pass diff --git a/pandas/tests/indexing/test_datetime.py b/pandas/tests/indexing/test_datetime.py new file mode 100644 index 0000000000000..da8a896cb6f4a --- /dev/null +++ b/pandas/tests/indexing/test_datetime.py @@ -0,0 +1,225 @@ +import pytest + +import numpy as np +import pandas as pd +from pandas import date_range, Index, DataFrame, Series, Timestamp +from pandas.util import testing as tm + + +class TestDatetimeIndex(object): + + def test_indexing_with_datetime_tz(self): + + # 8260 + # support datetime64 with tz + + idx = Index(date_range('20130101', periods=3, tz='US/Eastern'), + name='foo') + dr = date_range('20130110', periods=3) + df = DataFrame({'A': idx, 'B': dr}) + df['C'] = idx + df.iloc[1, 1] = pd.NaT + df.iloc[1, 2] = pd.NaT + + # indexing + result = df.iloc[1] + expected = Series([Timestamp('2013-01-02 00:00:00-0500', + tz='US/Eastern'), np.nan, np.nan], + index=list('ABC'), dtype='object', name=1) + tm.assert_series_equal(result, expected) + result = df.loc[1] + expected = Series([Timestamp('2013-01-02 00:00:00-0500', + tz='US/Eastern'), np.nan, np.nan], + index=list('ABC'), dtype='object', name=1) + tm.assert_series_equal(result, expected) + + # indexing - fast_xs + df = DataFrame({'a': date_range('2014-01-01', periods=10, tz='UTC')}) + result = df.iloc[5] + expected = Timestamp('2014-01-06 00:00:00+0000', tz='UTC', freq='D') + assert result == expected + + result = df.loc[5] + assert result == expected + + # indexing - boolean + result = df[df.a > df.a[3]] + expected = df.iloc[4:] + tm.assert_frame_equal(result, expected) + + # indexing - setting an element + df = DataFrame(data=pd.to_datetime( + ['2015-03-30 20:12:32', '2015-03-12 00:11:11']), columns=['time']) + df['new_col'] = ['new', 'old'] + df.time = df.set_index('time').index.tz_localize('UTC') + v = df[df.new_col == 'new'].set_index('time').index.tz_convert( + 'US/Pacific') + + # trying to set a single element on a part of a different timezone + def f(): + df.loc[df.new_col == 'new', 'time'] = v + + pytest.raises(ValueError, f) + + v = df.loc[df.new_col == 'new', 'time'] + pd.Timedelta('1s') + df.loc[df.new_col == 'new', 'time'] = v + tm.assert_series_equal(df.loc[df.new_col == 'new', 'time'], v) + + def test_consistency_with_tz_aware_scalar(self): + # xef gh-12938 + # various ways of indexing the same tz-aware scalar + df = Series([Timestamp('2016-03-30 14:35:25', + tz='Europe/Brussels')]).to_frame() + + df = pd.concat([df, df]).reset_index(drop=True) + expected = Timestamp('2016-03-30 14:35:25+0200', + tz='Europe/Brussels') + + result = df[0][0] + assert result == expected + + result = df.iloc[0, 0] + assert result == expected + + result = df.loc[0, 0] + assert result == expected + + result = df.iat[0, 0] + assert result == expected + + result = df.at[0, 0] + assert result == expected + + result = df[0].loc[0] + assert result == expected + + result = df[0].at[0] + assert result == expected + + def test_indexing_with_datetimeindex_tz(self): + + # GH 12050 + # indexing on a series with a datetimeindex with tz + index = pd.date_range('2015-01-01', periods=2, tz='utc') + + ser = pd.Series(range(2), index=index, + dtype='int64') + + # list-like indexing + + for sel in (index, list(index)): + # getitem + tm.assert_series_equal(ser[sel], ser) + + # setitem + result = ser.copy() + result[sel] = 1 + expected = pd.Series(1, index=index) + tm.assert_series_equal(result, expected) + + # .loc getitem + tm.assert_series_equal(ser.loc[sel], ser) + + # .loc setitem + result = ser.copy() + result.loc[sel] = 1 + expected = pd.Series(1, index=index) + tm.assert_series_equal(result, expected) + + # single element indexing + + # getitem + assert ser[index[1]] == 1 + + # setitem + result = ser.copy() + result[index[1]] = 5 + expected = pd.Series([0, 5], index=index) + tm.assert_series_equal(result, expected) + + # .loc getitem + assert ser.loc[index[1]] == 1 + + # .loc setitem + result = ser.copy() + result.loc[index[1]] = 5 + expected = pd.Series([0, 5], index=index) + tm.assert_series_equal(result, expected) + + def test_partial_setting_with_datetimelike_dtype(self): + + # GH9478 + # a datetimeindex alignment issue with partial setting + df = pd.DataFrame(np.arange(6.).reshape(3, 2), columns=list('AB'), + index=pd.date_range('1/1/2000', periods=3, + freq='1H')) + expected = df.copy() + expected['C'] = [expected.index[0]] + [pd.NaT, pd.NaT] + + mask = df.A < 1 + df.loc[mask, 'C'] = df.loc[mask].index + tm.assert_frame_equal(df, expected) + + def test_loc_setitem_datetime(self): + + # GH 9516 + dt1 = Timestamp('20130101 09:00:00') + dt2 = Timestamp('20130101 10:00:00') + + for conv in [lambda x: x, lambda x: x.to_datetime64(), + lambda x: x.to_pydatetime(), lambda x: np.datetime64(x)]: + + df = pd.DataFrame() + df.loc[conv(dt1), 'one'] = 100 + df.loc[conv(dt2), 'one'] = 200 + + expected = DataFrame({'one': [100.0, 200.0]}, index=[dt1, dt2]) + tm.assert_frame_equal(df, expected) + + def test_series_partial_set_datetime(self): + # GH 11497 + + idx = date_range('2011-01-01', '2011-01-02', freq='D', name='idx') + ser = Series([0.1, 0.2], index=idx, name='s') + + result = ser.loc[[Timestamp('2011-01-01'), Timestamp('2011-01-02')]] + exp = Series([0.1, 0.2], index=idx, name='s') + tm.assert_series_equal(result, exp, check_index_type=True) + + keys = [Timestamp('2011-01-02'), Timestamp('2011-01-02'), + Timestamp('2011-01-01')] + exp = Series([0.2, 0.2, 0.1], index=pd.DatetimeIndex(keys, name='idx'), + name='s') + tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True) + + keys = [Timestamp('2011-01-03'), Timestamp('2011-01-02'), + Timestamp('2011-01-03')] + exp = Series([np.nan, 0.2, np.nan], + index=pd.DatetimeIndex(keys, name='idx'), name='s') + tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True) + + def test_series_partial_set_period(self): + # GH 11497 + + idx = pd.period_range('2011-01-01', '2011-01-02', freq='D', name='idx') + ser = Series([0.1, 0.2], index=idx, name='s') + + result = ser.loc[[pd.Period('2011-01-01', freq='D'), + pd.Period('2011-01-02', freq='D')]] + exp = Series([0.1, 0.2], index=idx, name='s') + tm.assert_series_equal(result, exp, check_index_type=True) + + keys = [pd.Period('2011-01-02', freq='D'), + pd.Period('2011-01-02', freq='D'), + pd.Period('2011-01-01', freq='D')] + exp = Series([0.2, 0.2, 0.1], index=pd.PeriodIndex(keys, name='idx'), + name='s') + tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True) + + keys = [pd.Period('2011-01-03', freq='D'), + pd.Period('2011-01-02', freq='D'), + pd.Period('2011-01-03', freq='D')] + exp = Series([np.nan, 0.2, np.nan], + index=pd.PeriodIndex(keys, name='idx'), name='s') + result = ser.loc[keys] + tm.assert_series_equal(result, exp) diff --git a/pandas/tests/indexing/test_floats.py b/pandas/tests/indexing/test_floats.py index 920aefa24b576..00a2b8166ceed 100644 --- a/pandas/tests/indexing/test_floats.py +++ b/pandas/tests/indexing/test_floats.py @@ -1,12 +1,15 @@ # -*- coding: utf-8 -*- +import pytest + +from warnings import catch_warnings import numpy as np from pandas import Series, DataFrame, Index, Float64Index from pandas.util.testing import assert_series_equal, assert_almost_equal import pandas.util.testing as tm -class TestFloatIndexers(tm.TestCase): +class TestFloatIndexers(object): def check(self, result, original, indexer, getitem): """ @@ -45,13 +48,13 @@ def test_scalar_error(self): def f(): s.iloc[3.0] - self.assertRaisesRegexp(TypeError, - 'cannot do positional indexing', - f) + tm.assert_raises_regex(TypeError, + 'cannot do positional indexing', + f) def f(): s.iloc[3.0] = 0 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def test_scalar_non_numeric(self): @@ -77,7 +80,8 @@ def test_scalar_non_numeric(self): (lambda x: x, True)]: def f(): - idxr(s)[3.0] + with catch_warnings(record=True): + idxr(s)[3.0] # gettitem on a DataFrame is a KeyError as it is indexing # via labels on the columns @@ -85,7 +89,7 @@ def f(): error = KeyError else: error = TypeError - self.assertRaises(error, f) + pytest.raises(error, f) # label based can be a TypeError or KeyError def f(): @@ -95,15 +99,15 @@ def f(): error = KeyError else: error = TypeError - self.assertRaises(error, f) + pytest.raises(error, f) # contains - self.assertFalse(3.0 in s) + assert 3.0 not in s # setting with a float fails with iloc def f(): s.iloc[3.0] = 0 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # setting with an indexer if s.index.inferred_type in ['categorical']: @@ -119,25 +123,26 @@ def f(): # s2 = s.copy() # def f(): # idxr(s2)[3.0] = 0 - # self.assertRaises(TypeError, f) + # pytest.raises(TypeError, f) pass else: s2 = s.copy() s2.loc[3.0] = 10 - self.assertTrue(s2.index.is_object()) + assert s2.index.is_object() for idxr in [lambda x: x.ix, lambda x: x]: s2 = s.copy() - idxr(s2)[3.0] = 0 - self.assertTrue(s2.index.is_object()) + with catch_warnings(record=True): + idxr(s2)[3.0] = 0 + assert s2.index.is_object() # fallsback to position selection, series only s = Series(np.arange(len(i)), index=i) s[3] - self.assertRaises(TypeError, lambda: s[3.0]) + pytest.raises(TypeError, lambda: s[3.0]) def test_scalar_with_mixed(self): @@ -151,15 +156,16 @@ def test_scalar_with_mixed(self): lambda x: x.iloc]: def f(): - idxr(s2)[1.0] + with catch_warnings(record=True): + idxr(s2)[1.0] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) - self.assertRaises(KeyError, lambda: s2.loc[1.0]) + pytest.raises(KeyError, lambda: s2.loc[1.0]) result = s2.loc['b'] expected = 2 - self.assertEqual(result, expected) + assert result == expected # mixed index so we have label # indexing @@ -167,20 +173,21 @@ def f(): lambda x: x]: def f(): - idxr(s3)[1.0] + with catch_warnings(record=True): + idxr(s3)[1.0] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) result = idxr(s3)[1] expected = 2 - self.assertEqual(result, expected) + assert result == expected - self.assertRaises(TypeError, lambda: s3.iloc[1.0]) - self.assertRaises(KeyError, lambda: s3.loc[1.0]) + pytest.raises(TypeError, lambda: s3.iloc[1.0]) + pytest.raises(KeyError, lambda: s3.loc[1.0]) result = s3.loc[1.5] expected = 3 - self.assertEqual(result, expected) + assert result == expected def test_scalar_integer(self): @@ -199,7 +206,8 @@ def test_scalar_integer(self): (lambda x: x.loc, False), (lambda x: x, True)]: - result = idxr(s)[3.0] + with catch_warnings(record=True): + result = idxr(s)[3.0] self.check(result, s, 3, getitem) # coerce to equal int @@ -208,7 +216,8 @@ def test_scalar_integer(self): (lambda x: x, True)]: if isinstance(s, Series): - compare = self.assertEqual + def compare(x, y): + assert x == y expected = 100 else: compare = tm.assert_series_equal @@ -220,17 +229,18 @@ def test_scalar_integer(self): index=range(len(s)), name=3) s2 = s.copy() - idxr(s2)[3.0] = 100 + with catch_warnings(record=True): + idxr(s2)[3.0] = 100 - result = idxr(s2)[3.0] - compare(result, expected) + result = idxr(s2)[3.0] + compare(result, expected) - result = idxr(s2)[3] - compare(result, expected) + result = idxr(s2)[3] + compare(result, expected) # contains # coerce to equal int - self.assertTrue(3.0 in s) + assert 3.0 in s def test_scalar_float(self): @@ -247,22 +257,26 @@ def test_scalar_float(self): (lambda x: x, True)]: # getting - result = idxr(s)[indexer] + with catch_warnings(record=True): + result = idxr(s)[indexer] self.check(result, s, 3, getitem) # setting s2 = s.copy() def f(): - idxr(s2)[indexer] = expected - result = idxr(s2)[indexer] + with catch_warnings(record=True): + idxr(s2)[indexer] = expected + with catch_warnings(record=True): + result = idxr(s2)[indexer] self.check(result, s, 3, getitem) # random integer is a KeyError - self.assertRaises(KeyError, lambda: idxr(s)[3.5]) + with catch_warnings(record=True): + pytest.raises(KeyError, lambda: idxr(s)[3.5]) # contains - self.assertTrue(3.0 in s) + assert 3.0 in s # iloc succeeds with an integer expected = s.iloc[3] @@ -273,11 +287,11 @@ def f(): self.check(result, s, 3, False) # iloc raises with a float - self.assertRaises(TypeError, lambda: s.iloc[3.0]) + pytest.raises(TypeError, lambda: s.iloc[3.0]) def g(): s2.iloc[3.0] = 0 - self.assertRaises(TypeError, g) + pytest.raises(TypeError, g) def test_slice_non_numeric(self): @@ -300,7 +314,7 @@ def test_slice_non_numeric(self): def f(): s.iloc[l] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) for idxr in [lambda x: x.ix, lambda x: x.loc, @@ -308,8 +322,9 @@ def f(): lambda x: x]: def f(): - idxr(s)[l] - self.assertRaises(TypeError, f) + with catch_warnings(record=True): + idxr(s)[l] + pytest.raises(TypeError, f) # setitem for l in [slice(3.0, 4), @@ -318,15 +333,16 @@ def f(): def f(): s.iloc[l] = 0 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) for idxr in [lambda x: x.ix, lambda x: x.loc, lambda x: x.iloc, lambda x: x]: def f(): - idxr(s)[l] = 0 - self.assertRaises(TypeError, f) + with catch_warnings(record=True): + idxr(s)[l] = 0 + pytest.raises(TypeError, f) def test_slice_integer(self): @@ -349,7 +365,8 @@ def test_slice_integer(self): for idxr in [lambda x: x.loc, lambda x: x.ix]: - result = idxr(s)[l] + with catch_warnings(record=True): + result = idxr(s)[l] # these are all label indexing # except getitem which is positional @@ -364,7 +381,7 @@ def test_slice_integer(self): def f(): s[l] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # getitem out-of-bounds for l in [slice(-6, 6), @@ -372,7 +389,8 @@ def f(): for idxr in [lambda x: x.loc, lambda x: x.ix]: - result = idxr(s)[l] + with catch_warnings(record=True): + result = idxr(s)[l] # these are all label indexing # except getitem which is positional @@ -387,7 +405,7 @@ def f(): def f(): s[slice(-6.0, 6.0)] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # getitem odd floats for l, res1 in [(slice(2.5, 4), slice(3, 5)), @@ -397,7 +415,8 @@ def f(): for idxr in [lambda x: x.loc, lambda x: x.ix]: - result = idxr(s)[l] + with catch_warnings(record=True): + result = idxr(s)[l] if oob: res = slice(0, 0) else: @@ -409,7 +428,7 @@ def f(): def f(): s[l] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # setitem for l in [slice(3.0, 4), @@ -419,15 +438,16 @@ def f(): for idxr in [lambda x: x.loc, lambda x: x.ix]: sc = s.copy() - idxr(sc)[l] = 0 - result = idxr(sc)[l].values.ravel() - self.assertTrue((result == 0).all()) + with catch_warnings(record=True): + idxr(sc)[l] = 0 + result = idxr(sc)[l].values.ravel() + assert (result == 0).all() # positional indexing def f(): s[l] = 0 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def test_integer_positional_indexing(self): """ make sure that we are raising on positional indexing @@ -449,7 +469,7 @@ def test_integer_positional_indexing(self): def f(): idxr(s)[l] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def test_slice_integer_frame_getitem(self): @@ -467,7 +487,8 @@ def test_slice_integer_frame_getitem(self): slice(0, 1.0), slice(0.0, 1.0)]: - result = idxr(s)[l] + with catch_warnings(record=True): + result = idxr(s)[l] indexer = slice(0, 2) self.check(result, s, indexer, False) @@ -475,7 +496,7 @@ def test_slice_integer_frame_getitem(self): def f(): s[l] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # getitem out-of-bounds for l in [slice(-10, 10), @@ -488,21 +509,22 @@ def f(): def f(): s[slice(-10.0, 10.0)] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # getitem odd floats for l, res in [(slice(0.5, 1), slice(1, 2)), (slice(0, 0.5), slice(0, 1)), (slice(0.5, 1.5), slice(1, 2))]: - result = idxr(s)[l] + with catch_warnings(record=True): + result = idxr(s)[l] self.check(result, s, res, False) # positional indexing def f(): s[l] - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # setitem for l in [slice(3.0, 4), @@ -510,15 +532,16 @@ def f(): slice(3.0, 4.0)]: sc = s.copy() - idxr(sc)[l] = 0 - result = idxr(sc)[l].values.ravel() - self.assertTrue((result == 0).all()) + with catch_warnings(record=True): + idxr(sc)[l] = 0 + result = idxr(sc)[l].values.ravel() + assert (result == 0).all() # positional indexing def f(): s[l] = 0 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def test_slice_float(self): @@ -537,25 +560,27 @@ def test_slice_float(self): lambda x: x]: # getitem - result = idxr(s)[l] + with catch_warnings(record=True): + result = idxr(s)[l] if isinstance(s, Series): - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) else: - self.assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # setitem s2 = s.copy() - idxr(s2)[l] = 0 - result = idxr(s2)[l].values.ravel() - self.assertTrue((result == 0).all()) + with catch_warnings(record=True): + idxr(s2)[l] = 0 + result = idxr(s2)[l].values.ravel() + assert (result == 0).all() def test_floating_index_doc_example(self): index = Index([1.5, 2, 3, 4.5, 5]) s = Series(range(5), index=index) - self.assertEqual(s[3], 2) - self.assertEqual(s.ix[3], 2) - self.assertEqual(s.loc[3], 2) - self.assertEqual(s.iloc[3], 3) + assert s[3] == 2 + assert s.loc[3] == 2 + assert s.loc[3] == 2 + assert s.iloc[3] == 3 def test_floating_misc(self): @@ -565,7 +590,7 @@ def test_floating_misc(self): # label based slicing result1 = s[1.0:3.0] - result2 = s.ix[1.0:3.0] + result2 = s.loc[1.0:3.0] result3 = s.loc[1.0:3.0] assert_series_equal(result1, result2) assert_series_equal(result1, result3) @@ -573,24 +598,24 @@ def test_floating_misc(self): # exact indexing when found result1 = s[5.0] result2 = s.loc[5.0] - result3 = s.ix[5.0] - self.assertEqual(result1, result2) - self.assertEqual(result1, result3) + result3 = s.loc[5.0] + assert result1 == result2 + assert result1 == result3 result1 = s[5] result2 = s.loc[5] - result3 = s.ix[5] - self.assertEqual(result1, result2) - self.assertEqual(result1, result3) + result3 = s.loc[5] + assert result1 == result2 + assert result1 == result3 - self.assertEqual(s[5.0], s[5]) + assert s[5.0] == s[5] # value not found (and no fallbacking at all) # scalar integers - self.assertRaises(KeyError, lambda: s.loc[4]) - self.assertRaises(KeyError, lambda: s.ix[4]) - self.assertRaises(KeyError, lambda: s[4]) + pytest.raises(KeyError, lambda: s.loc[4]) + pytest.raises(KeyError, lambda: s.loc[4]) + pytest.raises(KeyError, lambda: s[4]) # fancy floats/integers create the correct entry (as nan) # fancy tests @@ -598,13 +623,13 @@ def test_floating_misc(self): for fancy_idx in [[5.0, 0.0], np.array([5.0, 0.0])]: # float assert_series_equal(s[fancy_idx], expected) assert_series_equal(s.loc[fancy_idx], expected) - assert_series_equal(s.ix[fancy_idx], expected) + assert_series_equal(s.loc[fancy_idx], expected) expected = Series([2, 0], index=Index([5, 0], dtype='int64')) for fancy_idx in [[5, 0], np.array([5, 0])]: # int assert_series_equal(s[fancy_idx], expected) assert_series_equal(s.loc[fancy_idx], expected) - assert_series_equal(s.ix[fancy_idx], expected) + assert_series_equal(s.loc[fancy_idx], expected) # all should return the same as we are slicing 'the same' result1 = s.loc[2:5] @@ -624,17 +649,17 @@ def test_floating_misc(self): assert_series_equal(result1, result3) assert_series_equal(result1, result4) - result1 = s.ix[2:5] - result2 = s.ix[2.0:5.0] - result3 = s.ix[2.0:5] - result4 = s.ix[2.1:5] + result1 = s.loc[2:5] + result2 = s.loc[2.0:5.0] + result3 = s.loc[2.0:5] + result4 = s.loc[2.1:5] assert_series_equal(result1, result2) assert_series_equal(result1, result3) assert_series_equal(result1, result4) # combined test result1 = s.loc[2:5] - result2 = s.ix[2:5] + result2 = s.loc[2:5] result3 = s[2:5] assert_series_equal(result1, result2) @@ -643,7 +668,7 @@ def test_floating_misc(self): # list selection result1 = s[[0.0, 5, 10]] result2 = s.loc[[0.0, 5, 10]] - result3 = s.ix[[0.0, 5, 10]] + result3 = s.loc[[0.0, 5, 10]] result4 = s.iloc[[0, 2, 4]] assert_series_equal(result1, result2) assert_series_equal(result1, result3) @@ -651,14 +676,14 @@ def test_floating_misc(self): result1 = s[[1.6, 5, 10]] result2 = s.loc[[1.6, 5, 10]] - result3 = s.ix[[1.6, 5, 10]] + result3 = s.loc[[1.6, 5, 10]] assert_series_equal(result1, result2) assert_series_equal(result1, result3) assert_series_equal(result1, Series( [np.nan, 2, 4], index=[1.6, 5, 10])) result1 = s[[0, 1, 2]] - result2 = s.ix[[0, 1, 2]] + result2 = s.loc[[0, 1, 2]] result3 = s.loc[[0, 1, 2]] assert_series_equal(result1, result2) assert_series_equal(result1, result3) @@ -666,24 +691,183 @@ def test_floating_misc(self): [0.0, np.nan, np.nan], index=[0, 1, 2])) result1 = s.loc[[2.5, 5]] - result2 = s.ix[[2.5, 5]] + result2 = s.loc[[2.5, 5]] assert_series_equal(result1, result2) assert_series_equal(result1, Series([1, 2], index=[2.5, 5.0])) result1 = s[[2.5]] - result2 = s.ix[[2.5]] + result2 = s.loc[[2.5]] result3 = s.loc[[2.5]] assert_series_equal(result1, result2) assert_series_equal(result1, result3) assert_series_equal(result1, Series([1], index=[2.5])) def test_floating_tuples(self): - # GH13509 + # see gh-13509 s = Series([(1, 1), (2, 2), (3, 3)], index=[0.0, 0.1, 0.2], name='foo') + result = s[0.0] - self.assertEqual(result, (1, 1)) + assert result == (1, 1) + expected = Series([(1, 1), (2, 2)], index=[0.0, 0.0], name='foo') s = Series([(1, 1), (2, 2), (3, 3)], index=[0.0, 0.0, 0.2], name='foo') + result = s[0.0] - expected = Series([(1, 1), (2, 2)], index=[0.0, 0.0], name='foo') - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) + + def test_float64index_slicing_bug(self): + # GH 5557, related to slicing a float index + ser = {256: 2321.0, + 1: 78.0, + 2: 2716.0, + 3: 0.0, + 4: 369.0, + 5: 0.0, + 6: 269.0, + 7: 0.0, + 8: 0.0, + 9: 0.0, + 10: 3536.0, + 11: 0.0, + 12: 24.0, + 13: 0.0, + 14: 931.0, + 15: 0.0, + 16: 101.0, + 17: 78.0, + 18: 9643.0, + 19: 0.0, + 20: 0.0, + 21: 0.0, + 22: 63761.0, + 23: 0.0, + 24: 446.0, + 25: 0.0, + 26: 34773.0, + 27: 0.0, + 28: 729.0, + 29: 78.0, + 30: 0.0, + 31: 0.0, + 32: 3374.0, + 33: 0.0, + 34: 1391.0, + 35: 0.0, + 36: 361.0, + 37: 0.0, + 38: 61808.0, + 39: 0.0, + 40: 0.0, + 41: 0.0, + 42: 6677.0, + 43: 0.0, + 44: 802.0, + 45: 0.0, + 46: 2691.0, + 47: 0.0, + 48: 3582.0, + 49: 0.0, + 50: 734.0, + 51: 0.0, + 52: 627.0, + 53: 70.0, + 54: 2584.0, + 55: 0.0, + 56: 324.0, + 57: 0.0, + 58: 605.0, + 59: 0.0, + 60: 0.0, + 61: 0.0, + 62: 3989.0, + 63: 10.0, + 64: 42.0, + 65: 0.0, + 66: 904.0, + 67: 0.0, + 68: 88.0, + 69: 70.0, + 70: 8172.0, + 71: 0.0, + 72: 0.0, + 73: 0.0, + 74: 64902.0, + 75: 0.0, + 76: 347.0, + 77: 0.0, + 78: 36605.0, + 79: 0.0, + 80: 379.0, + 81: 70.0, + 82: 0.0, + 83: 0.0, + 84: 3001.0, + 85: 0.0, + 86: 1630.0, + 87: 7.0, + 88: 364.0, + 89: 0.0, + 90: 67404.0, + 91: 9.0, + 92: 0.0, + 93: 0.0, + 94: 7685.0, + 95: 0.0, + 96: 1017.0, + 97: 0.0, + 98: 2831.0, + 99: 0.0, + 100: 2963.0, + 101: 0.0, + 102: 854.0, + 103: 0.0, + 104: 0.0, + 105: 0.0, + 106: 0.0, + 107: 0.0, + 108: 0.0, + 109: 0.0, + 110: 0.0, + 111: 0.0, + 112: 0.0, + 113: 0.0, + 114: 0.0, + 115: 0.0, + 116: 0.0, + 117: 0.0, + 118: 0.0, + 119: 0.0, + 120: 0.0, + 121: 0.0, + 122: 0.0, + 123: 0.0, + 124: 0.0, + 125: 0.0, + 126: 67744.0, + 127: 22.0, + 128: 264.0, + 129: 0.0, + 260: 197.0, + 268: 0.0, + 265: 0.0, + 269: 0.0, + 261: 0.0, + 266: 1198.0, + 267: 0.0, + 262: 2629.0, + 258: 775.0, + 257: 0.0, + 263: 0.0, + 259: 0.0, + 264: 163.0, + 250: 10326.0, + 251: 0.0, + 252: 1228.0, + 253: 0.0, + 254: 2769.0, + 255: 0.0} + + # smoke test for the repr + s = Series(ser) + result = s.value_counts() + str(result) diff --git a/pandas/tests/indexing/test_iloc.py b/pandas/tests/indexing/test_iloc.py new file mode 100644 index 0000000000000..af4b9e1f0cc25 --- /dev/null +++ b/pandas/tests/indexing/test_iloc.py @@ -0,0 +1,593 @@ +""" test positional based indexing with iloc """ + +import pytest + +from warnings import catch_warnings +import numpy as np + +import pandas as pd +from pandas.compat import lrange, lmap +from pandas import Series, DataFrame, date_range, concat, isnull +from pandas.util import testing as tm +from pandas.tests.indexing.common import Base + + +class TestiLoc(Base): + + def test_iloc_exceeds_bounds(self): + + # GH6296 + # iloc should allow indexers that exceed the bounds + df = DataFrame(np.random.random_sample((20, 5)), columns=list('ABCDE')) + expected = df + + # lists of positions should raise IndexErrror! + with tm.assert_raises_regex(IndexError, + 'positional indexers ' + 'are out-of-bounds'): + df.iloc[:, [0, 1, 2, 3, 4, 5]] + pytest.raises(IndexError, lambda: df.iloc[[1, 30]]) + pytest.raises(IndexError, lambda: df.iloc[[1, -30]]) + pytest.raises(IndexError, lambda: df.iloc[[100]]) + + s = df['A'] + pytest.raises(IndexError, lambda: s.iloc[[100]]) + pytest.raises(IndexError, lambda: s.iloc[[-100]]) + + # still raise on a single indexer + msg = 'single positional indexer is out-of-bounds' + with tm.assert_raises_regex(IndexError, msg): + df.iloc[30] + pytest.raises(IndexError, lambda: df.iloc[-30]) + + # GH10779 + # single positive/negative indexer exceeding Series bounds should raise + # an IndexError + with tm.assert_raises_regex(IndexError, msg): + s.iloc[30] + pytest.raises(IndexError, lambda: s.iloc[-30]) + + # slices are ok + result = df.iloc[:, 4:10] # 0 < start < len < stop + expected = df.iloc[:, 4:] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, -4:-10] # stop < 0 < start < len + expected = df.iloc[:, :0] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, 10:4:-1] # 0 < stop < len < start (down) + expected = df.iloc[:, :4:-1] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, 4:-10:-1] # stop < 0 < start < len (down) + expected = df.iloc[:, 4::-1] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, -10:4] # start < 0 < stop < len + expected = df.iloc[:, :4] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, 10:4] # 0 < stop < len < start + expected = df.iloc[:, :0] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, -10:-11:-1] # stop < start < 0 < len (down) + expected = df.iloc[:, :0] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, 10:11] # 0 < len < start < stop + expected = df.iloc[:, :0] + tm.assert_frame_equal(result, expected) + + # slice bounds exceeding is ok + result = s.iloc[18:30] + expected = s.iloc[18:] + tm.assert_series_equal(result, expected) + + result = s.iloc[30:] + expected = s.iloc[:0] + tm.assert_series_equal(result, expected) + + result = s.iloc[30::-1] + expected = s.iloc[::-1] + tm.assert_series_equal(result, expected) + + # doc example + def check(result, expected): + str(result) + result.dtypes + tm.assert_frame_equal(result, expected) + + dfl = DataFrame(np.random.randn(5, 2), columns=list('AB')) + check(dfl.iloc[:, 2:3], DataFrame(index=dfl.index)) + check(dfl.iloc[:, 1:3], dfl.iloc[:, [1]]) + check(dfl.iloc[4:6], dfl.iloc[[4]]) + + pytest.raises(IndexError, lambda: dfl.iloc[[4, 5, 6]]) + pytest.raises(IndexError, lambda: dfl.iloc[:, 4]) + + def test_iloc_getitem_int(self): + + # integer + self.check_result('integer', 'iloc', 2, 'ix', + {0: 4, 1: 6, 2: 8}, typs=['ints', 'uints']) + self.check_result('integer', 'iloc', 2, 'indexer', 2, + typs=['labels', 'mixed', 'ts', 'floats', 'empty'], + fails=IndexError) + + def test_iloc_getitem_neg_int(self): + + # neg integer + self.check_result('neg int', 'iloc', -1, 'ix', + {0: 6, 1: 9, 2: 12}, typs=['ints', 'uints']) + self.check_result('neg int', 'iloc', -1, 'indexer', -1, + typs=['labels', 'mixed', 'ts', 'floats', 'empty'], + fails=IndexError) + + def test_iloc_getitem_list_int(self): + + # list of ints + self.check_result('list int', 'iloc', [0, 1, 2], 'ix', + {0: [0, 2, 4], 1: [0, 3, 6], 2: [0, 4, 8]}, + typs=['ints', 'uints']) + self.check_result('list int', 'iloc', [2], 'ix', + {0: [4], 1: [6], 2: [8]}, typs=['ints', 'uints']) + self.check_result('list int', 'iloc', [0, 1, 2], 'indexer', [0, 1, 2], + typs=['labels', 'mixed', 'ts', 'floats', 'empty'], + fails=IndexError) + + # array of ints (GH5006), make sure that a single indexer is returning + # the correct type + self.check_result('array int', 'iloc', np.array([0, 1, 2]), 'ix', + {0: [0, 2, 4], + 1: [0, 3, 6], + 2: [0, 4, 8]}, typs=['ints', 'uints']) + self.check_result('array int', 'iloc', np.array([2]), 'ix', + {0: [4], 1: [6], 2: [8]}, typs=['ints', 'uints']) + self.check_result('array int', 'iloc', np.array([0, 1, 2]), 'indexer', + [0, 1, 2], + typs=['labels', 'mixed', 'ts', 'floats', 'empty'], + fails=IndexError) + + def test_iloc_getitem_neg_int_can_reach_first_index(self): + # GH10547 and GH10779 + # negative integers should be able to reach index 0 + df = DataFrame({'A': [2, 3, 5], 'B': [7, 11, 13]}) + s = df['A'] + + expected = df.iloc[0] + result = df.iloc[-3] + tm.assert_series_equal(result, expected) + + expected = df.iloc[[0]] + result = df.iloc[[-3]] + tm.assert_frame_equal(result, expected) + + expected = s.iloc[0] + result = s.iloc[-3] + assert result == expected + + expected = s.iloc[[0]] + result = s.iloc[[-3]] + tm.assert_series_equal(result, expected) + + # check the length 1 Series case highlighted in GH10547 + expected = pd.Series(['a'], index=['A']) + result = expected.iloc[[-1]] + tm.assert_series_equal(result, expected) + + def test_iloc_getitem_dups(self): + + # no dups in panel (bug?) + self.check_result('list int (dups)', 'iloc', [0, 1, 1, 3], 'ix', + {0: [0, 2, 2, 6], 1: [0, 3, 3, 9]}, + objs=['series', 'frame'], typs=['ints', 'uints']) + + # GH 6766 + df1 = DataFrame([{'A': None, 'B': 1}, {'A': 2, 'B': 2}]) + df2 = DataFrame([{'A': 3, 'B': 3}, {'A': 4, 'B': 4}]) + df = concat([df1, df2], axis=1) + + # cross-sectional indexing + result = df.iloc[0, 0] + assert isnull(result) + + result = df.iloc[0, :] + expected = Series([np.nan, 1, 3, 3], index=['A', 'B', 'A', 'B'], + name=0) + tm.assert_series_equal(result, expected) + + def test_iloc_getitem_array(self): + + # array like + s = Series(index=lrange(1, 4)) + self.check_result('array like', 'iloc', s.index, 'ix', + {0: [2, 4, 6], 1: [3, 6, 9], 2: [4, 8, 12]}, + typs=['ints', 'uints']) + + def test_iloc_getitem_bool(self): + + # boolean indexers + b = [True, False, True, False, ] + self.check_result('bool', 'iloc', b, 'ix', b, typs=['ints', 'uints']) + self.check_result('bool', 'iloc', b, 'ix', b, + typs=['labels', 'mixed', 'ts', 'floats', 'empty'], + fails=IndexError) + + def test_iloc_getitem_slice(self): + + # slices + self.check_result('slice', 'iloc', slice(1, 3), 'ix', + {0: [2, 4], 1: [3, 6], 2: [4, 8]}, + typs=['ints', 'uints']) + self.check_result('slice', 'iloc', slice(1, 3), 'indexer', + slice(1, 3), + typs=['labels', 'mixed', 'ts', 'floats', 'empty'], + fails=IndexError) + + def test_iloc_getitem_slice_dups(self): + + df1 = DataFrame(np.random.randn(10, 4), columns=['A', 'A', 'B', 'B']) + df2 = DataFrame(np.random.randint(0, 10, size=20).reshape(10, 2), + columns=['A', 'C']) + + # axis=1 + df = concat([df1, df2], axis=1) + tm.assert_frame_equal(df.iloc[:, :4], df1) + tm.assert_frame_equal(df.iloc[:, 4:], df2) + + df = concat([df2, df1], axis=1) + tm.assert_frame_equal(df.iloc[:, :2], df2) + tm.assert_frame_equal(df.iloc[:, 2:], df1) + + exp = concat([df2, df1.iloc[:, [0]]], axis=1) + tm.assert_frame_equal(df.iloc[:, 0:3], exp) + + # axis=0 + df = concat([df, df], axis=0) + tm.assert_frame_equal(df.iloc[0:10, :2], df2) + tm.assert_frame_equal(df.iloc[0:10, 2:], df1) + tm.assert_frame_equal(df.iloc[10:, :2], df2) + tm.assert_frame_equal(df.iloc[10:, 2:], df1) + + def test_iloc_setitem(self): + df = self.frame_ints + + df.iloc[1, 1] = 1 + result = df.iloc[1, 1] + assert result == 1 + + df.iloc[:, 2:3] = 0 + expected = df.iloc[:, 2:3] + result = df.iloc[:, 2:3] + tm.assert_frame_equal(result, expected) + + # GH5771 + s = Series(0, index=[4, 5, 6]) + s.iloc[1:2] += 1 + expected = Series([0, 1, 0], index=[4, 5, 6]) + tm.assert_series_equal(s, expected) + + def test_iloc_setitem_list(self): + + # setitem with an iloc list + df = DataFrame(np.arange(9).reshape((3, 3)), index=["A", "B", "C"], + columns=["A", "B", "C"]) + df.iloc[[0, 1], [1, 2]] + df.iloc[[0, 1], [1, 2]] += 100 + + expected = DataFrame( + np.array([0, 101, 102, 3, 104, 105, 6, 7, 8]).reshape((3, 3)), + index=["A", "B", "C"], columns=["A", "B", "C"]) + tm.assert_frame_equal(df, expected) + + def test_iloc_setitem_dups(self): + + # GH 6766 + # iloc with a mask aligning from another iloc + df1 = DataFrame([{'A': None, 'B': 1}, {'A': 2, 'B': 2}]) + df2 = DataFrame([{'A': 3, 'B': 3}, {'A': 4, 'B': 4}]) + df = concat([df1, df2], axis=1) + + expected = df.fillna(3) + expected['A'] = expected['A'].astype('float64') + inds = np.isnan(df.iloc[:, 0]) + mask = inds[inds].index + df.iloc[mask, 0] = df.iloc[mask, 2] + tm.assert_frame_equal(df, expected) + + # del a dup column across blocks + expected = DataFrame({0: [1, 2], 1: [3, 4]}) + expected.columns = ['B', 'B'] + del df['A'] + tm.assert_frame_equal(df, expected) + + # assign back to self + df.iloc[[0, 1], [0, 1]] = df.iloc[[0, 1], [0, 1]] + tm.assert_frame_equal(df, expected) + + # reversed x 2 + df.iloc[[1, 0], [0, 1]] = df.iloc[[1, 0], [0, 1]].reset_index( + drop=True) + df.iloc[[1, 0], [0, 1]] = df.iloc[[1, 0], [0, 1]].reset_index( + drop=True) + tm.assert_frame_equal(df, expected) + + def test_iloc_getitem_frame(self): + df = DataFrame(np.random.randn(10, 4), index=lrange(0, 20, 2), + columns=lrange(0, 8, 2)) + + result = df.iloc[2] + with catch_warnings(record=True): + exp = df.ix[4] + tm.assert_series_equal(result, exp) + + result = df.iloc[2, 2] + with catch_warnings(record=True): + exp = df.ix[4, 4] + assert result == exp + + # slice + result = df.iloc[4:8] + with catch_warnings(record=True): + expected = df.ix[8:14] + tm.assert_frame_equal(result, expected) + + result = df.iloc[:, 2:3] + with catch_warnings(record=True): + expected = df.ix[:, 4:5] + tm.assert_frame_equal(result, expected) + + # list of integers + result = df.iloc[[0, 1, 3]] + with catch_warnings(record=True): + expected = df.ix[[0, 2, 6]] + tm.assert_frame_equal(result, expected) + + result = df.iloc[[0, 1, 3], [0, 1]] + with catch_warnings(record=True): + expected = df.ix[[0, 2, 6], [0, 2]] + tm.assert_frame_equal(result, expected) + + # neg indicies + result = df.iloc[[-1, 1, 3], [-1, 1]] + with catch_warnings(record=True): + expected = df.ix[[18, 2, 6], [6, 2]] + tm.assert_frame_equal(result, expected) + + # dups indicies + result = df.iloc[[-1, -1, 1, 3], [-1, 1]] + with catch_warnings(record=True): + expected = df.ix[[18, 18, 2, 6], [6, 2]] + tm.assert_frame_equal(result, expected) + + # with index-like + s = Series(index=lrange(1, 5)) + result = df.iloc[s.index] + with catch_warnings(record=True): + expected = df.ix[[2, 4, 6, 8]] + tm.assert_frame_equal(result, expected) + + def test_iloc_getitem_labelled_frame(self): + # try with labelled frame + df = DataFrame(np.random.randn(10, 4), + index=list('abcdefghij'), columns=list('ABCD')) + + result = df.iloc[1, 1] + exp = df.loc['b', 'B'] + assert result == exp + + result = df.iloc[:, 2:3] + expected = df.loc[:, ['C']] + tm.assert_frame_equal(result, expected) + + # negative indexing + result = df.iloc[-1, -1] + exp = df.loc['j', 'D'] + assert result == exp + + # out-of-bounds exception + pytest.raises(IndexError, df.iloc.__getitem__, tuple([10, 5])) + + # trying to use a label + pytest.raises(ValueError, df.iloc.__getitem__, tuple(['j', 'D'])) + + def test_iloc_getitem_doc_issue(self): + + # multi axis slicing issue with single block + # surfaced in GH 6059 + + arr = np.random.randn(6, 4) + index = date_range('20130101', periods=6) + columns = list('ABCD') + df = DataFrame(arr, index=index, columns=columns) + + # defines ref_locs + df.describe() + + result = df.iloc[3:5, 0:2] + str(result) + result.dtypes + + expected = DataFrame(arr[3:5, 0:2], index=index[3:5], + columns=columns[0:2]) + tm.assert_frame_equal(result, expected) + + # for dups + df.columns = list('aaaa') + result = df.iloc[3:5, 0:2] + str(result) + result.dtypes + + expected = DataFrame(arr[3:5, 0:2], index=index[3:5], + columns=list('aa')) + tm.assert_frame_equal(result, expected) + + # related + arr = np.random.randn(6, 4) + index = list(range(0, 12, 2)) + columns = list(range(0, 8, 2)) + df = DataFrame(arr, index=index, columns=columns) + + df._data.blocks[0].mgr_locs + result = df.iloc[1:5, 2:4] + str(result) + result.dtypes + expected = DataFrame(arr[1:5, 2:4], index=index[1:5], + columns=columns[2:4]) + tm.assert_frame_equal(result, expected) + + def test_iloc_setitem_series(self): + df = DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), + columns=list('ABCD')) + + df.iloc[1, 1] = 1 + result = df.iloc[1, 1] + assert result == 1 + + df.iloc[:, 2:3] = 0 + expected = df.iloc[:, 2:3] + result = df.iloc[:, 2:3] + tm.assert_frame_equal(result, expected) + + s = Series(np.random.randn(10), index=lrange(0, 20, 2)) + + s.iloc[1] = 1 + result = s.iloc[1] + assert result == 1 + + s.iloc[:4] = 0 + expected = s.iloc[:4] + result = s.iloc[:4] + tm.assert_series_equal(result, expected) + + s = Series([-1] * 6) + s.iloc[0::2] = [0, 2, 4] + s.iloc[1::2] = [1, 3, 5] + result = s + expected = Series([0, 1, 2, 3, 4, 5]) + tm.assert_series_equal(result, expected) + + def test_iloc_setitem_list_of_lists(self): + + # GH 7551 + # list-of-list is set incorrectly in mixed vs. single dtyped frames + df = DataFrame(dict(A=np.arange(5, dtype='int64'), + B=np.arange(5, 10, dtype='int64'))) + df.iloc[2:4] = [[10, 11], [12, 13]] + expected = DataFrame(dict(A=[0, 1, 10, 12, 4], B=[5, 6, 11, 13, 9])) + tm.assert_frame_equal(df, expected) + + df = DataFrame( + dict(A=list('abcde'), B=np.arange(5, 10, dtype='int64'))) + df.iloc[2:4] = [['x', 11], ['y', 13]] + expected = DataFrame(dict(A=['a', 'b', 'x', 'y', 'e'], + B=[5, 6, 11, 13, 9])) + tm.assert_frame_equal(df, expected) + + def test_iloc_mask(self): + + # GH 3631, iloc with a mask (of a series) should raise + df = DataFrame(lrange(5), list('ABCDE'), columns=['a']) + mask = (df.a % 2 == 0) + pytest.raises(ValueError, df.iloc.__getitem__, tuple([mask])) + mask.index = lrange(len(mask)) + pytest.raises(NotImplementedError, df.iloc.__getitem__, + tuple([mask])) + + # ndarray ok + result = df.iloc[np.array([True] * len(mask), dtype=bool)] + tm.assert_frame_equal(result, df) + + # the possibilities + locs = np.arange(4) + nums = 2 ** locs + reps = lmap(bin, nums) + df = DataFrame({'locs': locs, 'nums': nums}, reps) + + expected = { + (None, ''): '0b1100', + (None, '.loc'): '0b1100', + (None, '.iloc'): '0b1100', + ('index', ''): '0b11', + ('index', '.loc'): '0b11', + ('index', '.iloc'): ('iLocation based boolean indexing ' + 'cannot use an indexable as a mask'), + ('locs', ''): 'Unalignable boolean Series provided as indexer ' + '(index of the boolean Series and of the indexed ' + 'object do not match', + ('locs', '.loc'): 'Unalignable boolean Series provided as indexer ' + '(index of the boolean Series and of the ' + 'indexed object do not match', + ('locs', '.iloc'): ('iLocation based boolean indexing on an ' + 'integer type is not available'), + } + + # UserWarnings from reindex of a boolean mask + with catch_warnings(record=True): + result = dict() + for idx in [None, 'index', 'locs']: + mask = (df.nums > 2).values + if idx: + mask = Series(mask, list(reversed(getattr(df, idx)))) + for method in ['', '.loc', '.iloc']: + try: + if method: + accessor = getattr(df, method[1:]) + else: + accessor = df + ans = str(bin(accessor[mask]['nums'].sum())) + except Exception as e: + ans = str(e) + + key = tuple([idx, method]) + r = expected.get(key) + if r != ans: + raise AssertionError( + "[%s] does not match [%s], received [%s]" + % (key, ans, r)) + + def test_iloc_non_unique_indexing(self): + + # GH 4017, non-unique indexing (on the axis) + df = DataFrame({'A': [0.1] * 3000, 'B': [1] * 3000}) + idx = np.array(lrange(30)) * 99 + expected = df.iloc[idx] + + df3 = pd.concat([df, 2 * df, 3 * df]) + result = df3.iloc[idx] + + tm.assert_frame_equal(result, expected) + + df2 = DataFrame({'A': [0.1] * 1000, 'B': [1] * 1000}) + df2 = pd.concat([df2, 2 * df2, 3 * df2]) + + sidx = df2.index.to_series() + expected = df2.iloc[idx[idx <= sidx.max()]] + + new_list = [] + for r, s in expected.iterrows(): + new_list.append(s) + new_list.append(s * 2) + new_list.append(s * 3) + + expected = DataFrame(new_list) + expected = pd.concat([expected, DataFrame(index=idx[idx > sidx.max()]) + ]) + result = df2.loc[idx] + tm.assert_frame_equal(result, expected, check_index_type=False) + + def test_iloc_empty_list_indexer_is_ok(self): + from pandas.util.testing import makeCustomDataframe as mkdf + df = mkdf(5, 2) + # vertical empty + tm.assert_frame_equal(df.iloc[:, []], df.iloc[:, :0], + check_index_type=True, check_column_type=True) + # horizontal empty + tm.assert_frame_equal(df.iloc[[], :], df.iloc[:0, :], + check_index_type=True, check_column_type=True) + # horizontal empty + tm.assert_frame_equal(df.iloc[[]], df.iloc[:0, :], + check_index_type=True, + check_column_type=True) diff --git a/pandas/tests/indexing/test_indexing.py b/pandas/tests/indexing/test_indexing.py index 9ca1fd2a76817..9fa677eb624ae 100644 --- a/pandas/tests/indexing/test_indexing.py +++ b/pandas/tests/indexing/test_indexing.py @@ -1,2808 +1,76 @@ # -*- coding: utf-8 -*- # pylint: disable-msg=W0612,E1101 -import sys -import nose -import itertools -import warnings -from datetime import datetime - -from pandas.types.common import (is_integer_dtype, - is_float_dtype, - is_scalar) -from pandas.compat import range, lrange, lzip, StringIO, lmap, map -from pandas.tslib import NaT -from numpy import nan -from numpy.random import randn -import numpy as np - -import pandas as pd -import pandas.core.common as com -from pandas import option_context -from pandas.core.indexing import _non_reducing_slice, _maybe_numeric_slice -from pandas.core.api import (DataFrame, Index, Series, Panel, isnull, - MultiIndex, Timestamp, Timedelta) -from pandas.formats.printing import pprint_thing -from pandas import concat -from pandas.core.common import PerformanceWarning - -import pandas.util.testing as tm -from pandas import date_range - - -_verbose = False - -# ------------------------------------------------------------------------ -# Indexing test cases - - -def _generate_indices(f, values=False): - """ generate the indicies - if values is True , use the axis values - is False, use the range - """ - - axes = f.axes - if values: - axes = [lrange(len(a)) for a in axes] - - return itertools.product(*axes) - - -def _get_value(f, i, values=False): - """ return the value for the location i """ - - # check agains values - if values: - return f.values[i] - - # this is equiv of f[col][row]..... - # v = f - # for a in reversed(i): - # v = v.__getitem__(a) - # return v - return f.ix[i] - - -def _get_result(obj, method, key, axis): - """ return the result for this obj with this key and this axis """ - - if isinstance(key, dict): - key = key[axis] - - # use an artifical conversion to map the key as integers to the labels - # so ix can work for comparisions - if method == 'indexer': - method = 'ix' - key = obj._get_axis(axis)[key] - - # in case we actually want 0 index slicing - try: - xp = getattr(obj, method).__getitem__(_axify(obj, key, axis)) - except: - xp = getattr(obj, method).__getitem__(key) - - return xp - - -def _axify(obj, key, axis): - # create a tuple accessor - axes = [slice(None)] * obj.ndim - axes[axis] = key - return tuple(axes) - - -def _mklbl(prefix, n): - return ["%s%s" % (prefix, i) for i in range(n)] - - -class TestIndexing(tm.TestCase): - - _multiprocess_can_split_ = True - - _objs = set(['series', 'frame', 'panel']) - _typs = set(['ints', 'labels', 'mixed', 'ts', 'floats', 'empty', 'ts_rev']) - - def setUp(self): - - import warnings - warnings.filterwarnings(action='ignore', category=FutureWarning) - - self.series_ints = Series(np.random.rand(4), index=lrange(0, 8, 2)) - self.frame_ints = DataFrame(np.random.randn(4, 4), - index=lrange(0, 8, 2), - columns=lrange(0, 12, 3)) - self.panel_ints = Panel(np.random.rand(4, 4, 4), - items=lrange(0, 8, 2), - major_axis=lrange(0, 12, 3), - minor_axis=lrange(0, 16, 4)) - - self.series_labels = Series(np.random.randn(4), index=list('abcd')) - self.frame_labels = DataFrame(np.random.randn(4, 4), - index=list('abcd'), columns=list('ABCD')) - self.panel_labels = Panel(np.random.randn(4, 4, 4), - items=list('abcd'), - major_axis=list('ABCD'), - minor_axis=list('ZYXW')) - - self.series_mixed = Series(np.random.randn(4), index=[2, 4, 'null', 8]) - self.frame_mixed = DataFrame(np.random.randn(4, 4), - index=[2, 4, 'null', 8]) - self.panel_mixed = Panel(np.random.randn(4, 4, 4), - items=[2, 4, 'null', 8]) - - self.series_ts = Series(np.random.randn(4), - index=date_range('20130101', periods=4)) - self.frame_ts = DataFrame(np.random.randn(4, 4), - index=date_range('20130101', periods=4)) - self.panel_ts = Panel(np.random.randn(4, 4, 4), - items=date_range('20130101', periods=4)) - - dates_rev = (date_range('20130101', periods=4) - .sort_values(ascending=False)) - self.series_ts_rev = Series(np.random.randn(4), - index=dates_rev) - self.frame_ts_rev = DataFrame(np.random.randn(4, 4), - index=dates_rev) - self.panel_ts_rev = Panel(np.random.randn(4, 4, 4), - items=dates_rev) - - self.frame_empty = DataFrame({}) - self.series_empty = Series({}) - self.panel_empty = Panel({}) - - # form agglomerates - for o in self._objs: - - d = dict() - for t in self._typs: - d[t] = getattr(self, '%s_%s' % (o, t), None) - - setattr(self, o, d) - - def check_values(self, f, func, values=False): - - if f is None: - return - axes = f.axes - indicies = itertools.product(*axes) - - for i in indicies: - result = getattr(f, func)[i] - - # check agains values - if values: - expected = f.values[i] - else: - expected = f - for a in reversed(i): - expected = expected.__getitem__(a) - - tm.assert_almost_equal(result, expected) - - def check_result(self, name, method1, key1, method2, key2, typs=None, - objs=None, axes=None, fails=None): - def _eq(t, o, a, obj, k1, k2): - """ compare equal for these 2 keys """ - - if a is not None and a > obj.ndim - 1: - return - - def _print(result, error=None): - if error is not None: - error = str(error) - v = ("%-16.16s [%-16.16s]: [typ->%-8.8s,obj->%-8.8s," - "key1->(%-4.4s),key2->(%-4.4s),axis->%s] %s" % - (name, result, t, o, method1, method2, a, error or '')) - if _verbose: - pprint_thing(v) - - try: - # if (name == 'bool' and t == 'empty' and o == 'series' and - # method1 == 'loc'): - # import pdb; pdb.set_trace() - - rs = getattr(obj, method1).__getitem__(_axify(obj, k1, a)) - - try: - xp = _get_result(obj, method2, k2, a) - except: - result = 'no comp' - _print(result) - return - - try: - if is_scalar(rs) and is_scalar(xp): - self.assertEqual(rs, xp) - elif xp.ndim == 1: - tm.assert_series_equal(rs, xp) - elif xp.ndim == 2: - tm.assert_frame_equal(rs, xp) - elif xp.ndim == 3: - tm.assert_panel_equal(rs, xp) - result = 'ok' - except (AssertionError): - result = 'fail' - - # reverse the checks - if fails is True: - if result == 'fail': - result = 'ok (fail)' - - if not result.startswith('ok'): - raise AssertionError(_print(result)) - - _print(result) - - except AssertionError: - raise - except Exception as detail: - - # if we are in fails, the ok, otherwise raise it - if fails is not None: - if isinstance(detail, fails): - result = 'ok (%s)' % type(detail).__name__ - _print(result) - return - - result = type(detail).__name__ - raise AssertionError(_print(result, error=detail)) - - if typs is None: - typs = self._typs - - if objs is None: - objs = self._objs - - if axes is not None: - if not isinstance(axes, (tuple, list)): - axes = [axes] - else: - axes = list(axes) - else: - axes = [0, 1, 2] - - # check - for o in objs: - if o not in self._objs: - continue - - d = getattr(self, o) - for a in axes: - for t in typs: - if t not in self._typs: - continue - - obj = d[t] - if obj is not None: - obj = obj.copy() - - k2 = key2 - _eq(t, o, a, obj, key1, k2) - - def test_indexer_caching(self): - # GH5727 - # make sure that indexers are in the _internal_names_set - n = 1000001 - arrays = [lrange(n), lrange(n)] - index = MultiIndex.from_tuples(lzip(*arrays)) - s = Series(np.zeros(n), index=index) - str(s) - - # setitem - expected = Series(np.ones(n), index=index) - s = Series(np.zeros(n), index=index) - s[s == 0] = 1 - tm.assert_series_equal(s, expected) - - def test_at_and_iat_get(self): - def _check(f, func, values=False): - - if f is not None: - indicies = _generate_indices(f, values) - for i in indicies: - result = getattr(f, func)[i] - expected = _get_value(f, i, values) - tm.assert_almost_equal(result, expected) - - for o in self._objs: - - d = getattr(self, o) - - # iat - _check(d['ints'], 'iat', values=True) - for f in [d['labels'], d['ts'], d['floats']]: - if f is not None: - self.assertRaises(ValueError, self.check_values, f, 'iat') - - # at - _check(d['ints'], 'at') - _check(d['labels'], 'at') - _check(d['ts'], 'at') - _check(d['floats'], 'at') - - def test_at_and_iat_set(self): - def _check(f, func, values=False): - - if f is not None: - indicies = _generate_indices(f, values) - for i in indicies: - getattr(f, func)[i] = 1 - expected = _get_value(f, i, values) - tm.assert_almost_equal(expected, 1) - - for t in self._objs: - - d = getattr(self, t) - - _check(d['ints'], 'iat', values=True) - for f in [d['labels'], d['ts'], d['floats']]: - if f is not None: - self.assertRaises(ValueError, _check, f, 'iat') - - # at - _check(d['ints'], 'at') - _check(d['labels'], 'at') - _check(d['ts'], 'at') - _check(d['floats'], 'at') - - def test_at_iat_coercion(self): - - # as timestamp is not a tuple! - dates = date_range('1/1/2000', periods=8) - df = DataFrame(randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D']) - s = df['A'] - - result = s.at[dates[5]] - xp = s.values[5] - self.assertEqual(result, xp) - - # GH 7729 - # make sure we are boxing the returns - s = Series(['2014-01-01', '2014-02-02'], dtype='datetime64[ns]') - expected = Timestamp('2014-02-02') - - for r in [lambda: s.iat[1], lambda: s.iloc[1]]: - result = r() - self.assertEqual(result, expected) - - s = Series(['1 days', '2 days'], dtype='timedelta64[ns]') - expected = Timedelta('2 days') - - for r in [lambda: s.iat[1], lambda: s.iloc[1]]: - result = r() - self.assertEqual(result, expected) - - def test_iat_invalid_args(self): - pass - - def test_imethods_with_dups(self): - - # GH6493 - # iat/iloc with dups - - s = Series(range(5), index=[1, 1, 2, 2, 3], dtype='int64') - result = s.iloc[2] - self.assertEqual(result, 2) - result = s.iat[2] - self.assertEqual(result, 2) - - self.assertRaises(IndexError, lambda: s.iat[10]) - self.assertRaises(IndexError, lambda: s.iat[-10]) - - result = s.iloc[[2, 3]] - expected = Series([2, 3], [2, 2], dtype='int64') - tm.assert_series_equal(result, expected) - - df = s.to_frame() - result = df.iloc[2] - expected = Series(2, index=[0], name=2) - tm.assert_series_equal(result, expected) - - result = df.iat[2, 0] - expected = 2 - self.assertEqual(result, 2) - - def test_repeated_getitem_dups(self): - # GH 5678 - # repeated gettitems on a dup index returing a ndarray - df = DataFrame( - np.random.random_sample((20, 5)), - index=['ABCDE' [x % 5] for x in range(20)]) - expected = df.loc['A', 0] - result = df.loc[:, 0].loc['A'] - tm.assert_series_equal(result, expected) - - def test_iloc_exceeds_bounds(self): - - # GH6296 - # iloc should allow indexers that exceed the bounds - df = DataFrame(np.random.random_sample((20, 5)), columns=list('ABCDE')) - expected = df - - # lists of positions should raise IndexErrror! - with tm.assertRaisesRegexp(IndexError, - 'positional indexers are out-of-bounds'): - df.iloc[:, [0, 1, 2, 3, 4, 5]] - self.assertRaises(IndexError, lambda: df.iloc[[1, 30]]) - self.assertRaises(IndexError, lambda: df.iloc[[1, -30]]) - self.assertRaises(IndexError, lambda: df.iloc[[100]]) - - s = df['A'] - self.assertRaises(IndexError, lambda: s.iloc[[100]]) - self.assertRaises(IndexError, lambda: s.iloc[[-100]]) - - # still raise on a single indexer - msg = 'single positional indexer is out-of-bounds' - with tm.assertRaisesRegexp(IndexError, msg): - df.iloc[30] - self.assertRaises(IndexError, lambda: df.iloc[-30]) - - # GH10779 - # single positive/negative indexer exceeding Series bounds should raise - # an IndexError - with tm.assertRaisesRegexp(IndexError, msg): - s.iloc[30] - self.assertRaises(IndexError, lambda: s.iloc[-30]) - - # slices are ok - result = df.iloc[:, 4:10] # 0 < start < len < stop - expected = df.iloc[:, 4:] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, -4:-10] # stop < 0 < start < len - expected = df.iloc[:, :0] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, 10:4:-1] # 0 < stop < len < start (down) - expected = df.iloc[:, :4:-1] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, 4:-10:-1] # stop < 0 < start < len (down) - expected = df.iloc[:, 4::-1] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, -10:4] # start < 0 < stop < len - expected = df.iloc[:, :4] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, 10:4] # 0 < stop < len < start - expected = df.iloc[:, :0] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, -10:-11:-1] # stop < start < 0 < len (down) - expected = df.iloc[:, :0] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, 10:11] # 0 < len < start < stop - expected = df.iloc[:, :0] - tm.assert_frame_equal(result, expected) - - # slice bounds exceeding is ok - result = s.iloc[18:30] - expected = s.iloc[18:] - tm.assert_series_equal(result, expected) - - result = s.iloc[30:] - expected = s.iloc[:0] - tm.assert_series_equal(result, expected) - - result = s.iloc[30::-1] - expected = s.iloc[::-1] - tm.assert_series_equal(result, expected) - - # doc example - def check(result, expected): - str(result) - result.dtypes - tm.assert_frame_equal(result, expected) - - dfl = DataFrame(np.random.randn(5, 2), columns=list('AB')) - check(dfl.iloc[:, 2:3], DataFrame(index=dfl.index)) - check(dfl.iloc[:, 1:3], dfl.iloc[:, [1]]) - check(dfl.iloc[4:6], dfl.iloc[[4]]) - - self.assertRaises(IndexError, lambda: dfl.iloc[[4, 5, 6]]) - self.assertRaises(IndexError, lambda: dfl.iloc[:, 4]) - - def test_iloc_getitem_int(self): - - # integer - self.check_result('integer', 'iloc', 2, 'ix', - {0: 4, 1: 6, 2: 8}, typs=['ints']) - self.check_result('integer', 'iloc', 2, 'indexer', 2, - typs=['labels', 'mixed', 'ts', 'floats', 'empty'], - fails=IndexError) - - def test_iloc_getitem_neg_int(self): - - # neg integer - self.check_result('neg int', 'iloc', -1, 'ix', - {0: 6, 1: 9, 2: 12}, typs=['ints']) - self.check_result('neg int', 'iloc', -1, 'indexer', -1, - typs=['labels', 'mixed', 'ts', 'floats', 'empty'], - fails=IndexError) - - def test_iloc_getitem_list_int(self): - - # list of ints - self.check_result('list int', 'iloc', [0, 1, 2], 'ix', - {0: [0, 2, 4], 1: [0, 3, 6], 2: [0, 4, 8]}, - typs=['ints']) - self.check_result('list int', 'iloc', [2], 'ix', - {0: [4], 1: [6], 2: [8]}, typs=['ints']) - self.check_result('list int', 'iloc', [0, 1, 2], 'indexer', [0, 1, 2], - typs=['labels', 'mixed', 'ts', 'floats', 'empty'], - fails=IndexError) - - # array of ints (GH5006), make sure that a single indexer is returning - # the correct type - self.check_result('array int', 'iloc', np.array([0, 1, 2]), 'ix', - {0: [0, 2, 4], - 1: [0, 3, 6], - 2: [0, 4, 8]}, typs=['ints']) - self.check_result('array int', 'iloc', np.array([2]), 'ix', - {0: [4], 1: [6], 2: [8]}, typs=['ints']) - self.check_result('array int', 'iloc', np.array([0, 1, 2]), 'indexer', - [0, 1, 2], - typs=['labels', 'mixed', 'ts', 'floats', 'empty'], - fails=IndexError) - - def test_iloc_getitem_neg_int_can_reach_first_index(self): - # GH10547 and GH10779 - # negative integers should be able to reach index 0 - df = DataFrame({'A': [2, 3, 5], 'B': [7, 11, 13]}) - s = df['A'] - - expected = df.iloc[0] - result = df.iloc[-3] - tm.assert_series_equal(result, expected) - - expected = df.iloc[[0]] - result = df.iloc[[-3]] - tm.assert_frame_equal(result, expected) - - expected = s.iloc[0] - result = s.iloc[-3] - self.assertEqual(result, expected) - - expected = s.iloc[[0]] - result = s.iloc[[-3]] - tm.assert_series_equal(result, expected) - - # check the length 1 Series case highlighted in GH10547 - expected = pd.Series(['a'], index=['A']) - result = expected.iloc[[-1]] - tm.assert_series_equal(result, expected) - - def test_iloc_getitem_dups(self): - - # no dups in panel (bug?) - self.check_result('list int (dups)', 'iloc', [0, 1, 1, 3], 'ix', - {0: [0, 2, 2, 6], 1: [0, 3, 3, 9]}, - objs=['series', 'frame'], typs=['ints']) - - # GH 6766 - df1 = DataFrame([{'A': None, 'B': 1}, {'A': 2, 'B': 2}]) - df2 = DataFrame([{'A': 3, 'B': 3}, {'A': 4, 'B': 4}]) - df = concat([df1, df2], axis=1) - - # cross-sectional indexing - result = df.iloc[0, 0] - self.assertTrue(isnull(result)) - - result = df.iloc[0, :] - expected = Series([np.nan, 1, 3, 3], index=['A', 'B', 'A', 'B'], - name=0) - tm.assert_series_equal(result, expected) - - def test_iloc_getitem_array(self): - - # array like - s = Series(index=lrange(1, 4)) - self.check_result('array like', 'iloc', s.index, 'ix', - {0: [2, 4, 6], 1: [3, 6, 9], 2: [4, 8, 12]}, - typs=['ints']) - - def test_iloc_getitem_bool(self): - - # boolean indexers - b = [True, False, True, False, ] - self.check_result('bool', 'iloc', b, 'ix', b, typs=['ints']) - self.check_result('bool', 'iloc', b, 'ix', b, - typs=['labels', 'mixed', 'ts', 'floats', 'empty'], - fails=IndexError) - - def test_iloc_getitem_slice(self): - - # slices - self.check_result('slice', 'iloc', slice(1, 3), 'ix', - {0: [2, 4], 1: [3, 6], 2: [4, 8]}, - typs=['ints']) - self.check_result('slice', 'iloc', slice(1, 3), 'indexer', - slice(1, 3), - typs=['labels', 'mixed', 'ts', 'floats', 'empty'], - fails=IndexError) - - def test_iloc_getitem_slice_dups(self): - - df1 = DataFrame(np.random.randn(10, 4), columns=['A', 'A', 'B', 'B']) - df2 = DataFrame(np.random.randint(0, 10, size=20).reshape(10, 2), - columns=['A', 'C']) - - # axis=1 - df = concat([df1, df2], axis=1) - tm.assert_frame_equal(df.iloc[:, :4], df1) - tm.assert_frame_equal(df.iloc[:, 4:], df2) - - df = concat([df2, df1], axis=1) - tm.assert_frame_equal(df.iloc[:, :2], df2) - tm.assert_frame_equal(df.iloc[:, 2:], df1) - - exp = concat([df2, df1.iloc[:, [0]]], axis=1) - tm.assert_frame_equal(df.iloc[:, 0:3], exp) - - # axis=0 - df = concat([df, df], axis=0) - tm.assert_frame_equal(df.iloc[0:10, :2], df2) - tm.assert_frame_equal(df.iloc[0:10, 2:], df1) - tm.assert_frame_equal(df.iloc[10:, :2], df2) - tm.assert_frame_equal(df.iloc[10:, 2:], df1) - - def test_iloc_getitem_multiindex2(self): - # TODO(wesm): fix this - raise nose.SkipTest('this test was being suppressed, ' - 'needs to be fixed') - - arr = np.random.randn(3, 3) - df = DataFrame(arr, columns=[[2, 2, 4], [6, 8, 10]], - index=[[4, 4, 8], [8, 10, 12]]) - - rs = df.iloc[2] - xp = Series(arr[2], index=df.columns) - tm.assert_series_equal(rs, xp) - - rs = df.iloc[:, 2] - xp = Series(arr[:, 2], index=df.index) - tm.assert_series_equal(rs, xp) - - rs = df.iloc[2, 2] - xp = df.values[2, 2] - self.assertEqual(rs, xp) - - # for multiple items - # GH 5528 - rs = df.iloc[[0, 1]] - xp = df.xs(4, drop_level=False) - tm.assert_frame_equal(rs, xp) - - tup = zip(*[['a', 'a', 'b', 'b'], ['x', 'y', 'x', 'y']]) - index = MultiIndex.from_tuples(tup) - df = DataFrame(np.random.randn(4, 4), index=index) - rs = df.iloc[[2, 3]] - xp = df.xs('b', drop_level=False) - tm.assert_frame_equal(rs, xp) - - def test_iloc_setitem(self): - df = self.frame_ints - - df.iloc[1, 1] = 1 - result = df.iloc[1, 1] - self.assertEqual(result, 1) - - df.iloc[:, 2:3] = 0 - expected = df.iloc[:, 2:3] - result = df.iloc[:, 2:3] - tm.assert_frame_equal(result, expected) - - # GH5771 - s = Series(0, index=[4, 5, 6]) - s.iloc[1:2] += 1 - expected = Series([0, 1, 0], index=[4, 5, 6]) - tm.assert_series_equal(s, expected) - - def test_loc_setitem_slice(self): - # GH10503 - - # assigning the same type should not change the type - df1 = DataFrame({'a': [0, 1, 1], - 'b': Series([100, 200, 300], dtype='uint32')}) - ix = df1['a'] == 1 - newb1 = df1.loc[ix, 'b'] + 1 - df1.loc[ix, 'b'] = newb1 - expected = DataFrame({'a': [0, 1, 1], - 'b': Series([100, 201, 301], dtype='uint32')}) - tm.assert_frame_equal(df1, expected) - - # assigning a new type should get the inferred type - df2 = DataFrame({'a': [0, 1, 1], 'b': [100, 200, 300]}, - dtype='uint64') - ix = df1['a'] == 1 - newb2 = df2.loc[ix, 'b'] - df1.loc[ix, 'b'] = newb2 - expected = DataFrame({'a': [0, 1, 1], 'b': [100, 200, 300]}, - dtype='uint64') - tm.assert_frame_equal(df2, expected) - - def test_ix_loc_setitem_consistency(self): - - # GH 5771 - # loc with slice and series - s = Series(0, index=[4, 5, 6]) - s.loc[4:5] += 1 - expected = Series([1, 1, 0], index=[4, 5, 6]) - tm.assert_series_equal(s, expected) - - # GH 5928 - # chained indexing assignment - df = DataFrame({'a': [0, 1, 2]}) - expected = df.copy() - expected.ix[[0, 1, 2], 'a'] = -expected.ix[[0, 1, 2], 'a'] - - df['a'].ix[[0, 1, 2]] = -df['a'].ix[[0, 1, 2]] - tm.assert_frame_equal(df, expected) - - df = DataFrame({'a': [0, 1, 2], 'b': [0, 1, 2]}) - df['a'].ix[[0, 1, 2]] = -df['a'].ix[[0, 1, 2]].astype('float64') + 0.5 - expected = DataFrame({'a': [0.5, -0.5, -1.5], 'b': [0, 1, 2]}) - tm.assert_frame_equal(df, expected) - - # GH 8607 - # ix setitem consistency - df = DataFrame({'timestamp': [1413840976, 1413842580, 1413760580], - 'delta': [1174, 904, 161], - 'elapsed': [7673, 9277, 1470]}) - expected = DataFrame({'timestamp': pd.to_datetime( - [1413840976, 1413842580, 1413760580], unit='s'), - 'delta': [1174, 904, 161], - 'elapsed': [7673, 9277, 1470]}) - - df2 = df.copy() - df2['timestamp'] = pd.to_datetime(df['timestamp'], unit='s') - tm.assert_frame_equal(df2, expected) - - df2 = df.copy() - df2.loc[:, 'timestamp'] = pd.to_datetime(df['timestamp'], unit='s') - tm.assert_frame_equal(df2, expected) - - df2 = df.copy() - df2.ix[:, 2] = pd.to_datetime(df['timestamp'], unit='s') - tm.assert_frame_equal(df2, expected) - - def test_ix_loc_consistency(self): - - # GH 8613 - # some edge cases where ix/loc should return the same - # this is not an exhaustive case - - def compare(result, expected): - if is_scalar(expected): - self.assertEqual(result, expected) - else: - self.assertTrue(expected.equals(result)) - - # failure cases for .loc, but these work for .ix - df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD')) - for key in [slice(1, 3), tuple([slice(0, 2), slice(0, 2)]), - tuple([slice(0, 2), df.columns[0:2]])]: - - for index in [tm.makeStringIndex, tm.makeUnicodeIndex, - tm.makeDateIndex, tm.makePeriodIndex, - tm.makeTimedeltaIndex]: - df.index = index(len(df.index)) - df.ix[key] - - self.assertRaises(TypeError, lambda: df.loc[key]) - - df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'), - index=pd.date_range('2012-01-01', periods=5)) - - for key in ['2012-01-03', - '2012-01-31', - slice('2012-01-03', '2012-01-03'), - slice('2012-01-03', '2012-01-04'), - slice('2012-01-03', '2012-01-06', 2), - slice('2012-01-03', '2012-01-31'), - tuple([[True, True, True, False, True]]), ]: - - # getitem - - # if the expected raises, then compare the exceptions - try: - expected = df.ix[key] - except KeyError: - self.assertRaises(KeyError, lambda: df.loc[key]) - continue - - result = df.loc[key] - compare(result, expected) - - # setitem - df1 = df.copy() - df2 = df.copy() - - df1.ix[key] = 10 - df2.loc[key] = 10 - compare(df2, df1) - - # edge cases - s = Series([1, 2, 3, 4], index=list('abde')) - - result1 = s['a':'c'] - result2 = s.ix['a':'c'] - result3 = s.loc['a':'c'] - tm.assert_series_equal(result1, result2) - tm.assert_series_equal(result1, result3) - - # now work rather than raising KeyError - s = Series(range(5), [-2, -1, 1, 2, 3]) - - result1 = s.ix[-10:3] - result2 = s.loc[-10:3] - tm.assert_series_equal(result1, result2) - - result1 = s.ix[0:3] - result2 = s.loc[0:3] - tm.assert_series_equal(result1, result2) - - def test_setitem_multiindex(self): - for index_fn in ('ix', 'loc'): - - def check(target, indexers, value, compare_fn, expected=None): - fn = getattr(target, index_fn) - fn.__setitem__(indexers, value) - result = fn.__getitem__(indexers) - if expected is None: - expected = value - compare_fn(result, expected) - # GH7190 - index = pd.MultiIndex.from_product([np.arange(0, 100), - np.arange(0, 80)], - names=['time', 'firm']) - t, n = 0, 2 - df = DataFrame(np.nan, columns=['A', 'w', 'l', 'a', 'x', - 'X', 'd', 'profit'], - index=index) - check(target=df, indexers=((t, n), 'X'), value=0, - compare_fn=self.assertEqual) - - df = DataFrame(-999, columns=['A', 'w', 'l', 'a', 'x', - 'X', 'd', 'profit'], - index=index) - check(target=df, indexers=((t, n), 'X'), value=1, - compare_fn=self.assertEqual) - - df = DataFrame(columns=['A', 'w', 'l', 'a', 'x', - 'X', 'd', 'profit'], - index=index) - check(target=df, indexers=((t, n), 'X'), value=2, - compare_fn=self.assertEqual) - - # GH 7218, assinging with 0-dim arrays - df = DataFrame(-999, columns=['A', 'w', 'l', 'a', 'x', - 'X', 'd', 'profit'], - index=index) - check(target=df, - indexers=((t, n), 'X'), - value=np.array(3), - compare_fn=self.assertEqual, - expected=3, ) - - # GH5206 - df = pd.DataFrame(np.arange(25).reshape(5, 5), - columns='A,B,C,D,E'.split(','), dtype=float) - df['F'] = 99 - row_selection = df['A'] % 2 == 0 - col_selection = ['B', 'C'] - df.ix[row_selection, col_selection] = df['F'] - output = pd.DataFrame(99., index=[0, 2, 4], columns=['B', 'C']) - tm.assert_frame_equal(df.ix[row_selection, col_selection], output) - check(target=df, - indexers=(row_selection, col_selection), - value=df['F'], - compare_fn=tm.assert_frame_equal, - expected=output, ) - - # GH11372 - idx = pd.MultiIndex.from_product([ - ['A', 'B', 'C'], - pd.date_range('2015-01-01', '2015-04-01', freq='MS')]) - cols = pd.MultiIndex.from_product([ - ['foo', 'bar'], - pd.date_range('2016-01-01', '2016-02-01', freq='MS')]) - - df = pd.DataFrame(np.random.random((12, 4)), - index=idx, columns=cols) - - subidx = pd.MultiIndex.from_tuples( - [('A', pd.Timestamp('2015-01-01')), - ('A', pd.Timestamp('2015-02-01'))]) - subcols = pd.MultiIndex.from_tuples( - [('foo', pd.Timestamp('2016-01-01')), - ('foo', pd.Timestamp('2016-02-01'))]) - - vals = pd.DataFrame(np.random.random((2, 2)), - index=subidx, columns=subcols) - check(target=df, - indexers=(subidx, subcols), - value=vals, - compare_fn=tm.assert_frame_equal, ) - # set all columns - vals = pd.DataFrame( - np.random.random((2, 4)), index=subidx, columns=cols) - check(target=df, - indexers=(subidx, slice(None, None, None)), - value=vals, - compare_fn=tm.assert_frame_equal, ) - # identity - copy = df.copy() - check(target=df, indexers=(df.index, df.columns), value=df, - compare_fn=tm.assert_frame_equal, expected=copy) - - def test_indexing_with_datetime_tz(self): - - # 8260 - # support datetime64 with tz - - idx = Index(date_range('20130101', periods=3, tz='US/Eastern'), - name='foo') - dr = date_range('20130110', periods=3) - df = DataFrame({'A': idx, 'B': dr}) - df['C'] = idx - df.iloc[1, 1] = pd.NaT - df.iloc[1, 2] = pd.NaT - - # indexing - result = df.iloc[1] - expected = Series([Timestamp('2013-01-02 00:00:00-0500', - tz='US/Eastern'), np.nan, np.nan], - index=list('ABC'), dtype='object', name=1) - tm.assert_series_equal(result, expected) - result = df.loc[1] - expected = Series([Timestamp('2013-01-02 00:00:00-0500', - tz='US/Eastern'), np.nan, np.nan], - index=list('ABC'), dtype='object', name=1) - tm.assert_series_equal(result, expected) - - # indexing - fast_xs - df = DataFrame({'a': date_range('2014-01-01', periods=10, tz='UTC')}) - result = df.iloc[5] - expected = Timestamp('2014-01-06 00:00:00+0000', tz='UTC', freq='D') - self.assertEqual(result, expected) - - result = df.loc[5] - self.assertEqual(result, expected) - - # indexing - boolean - result = df[df.a > df.a[3]] - expected = df.iloc[4:] - tm.assert_frame_equal(result, expected) - - # indexing - setting an element - df = DataFrame(data=pd.to_datetime( - ['2015-03-30 20:12:32', '2015-03-12 00:11:11']), columns=['time']) - df['new_col'] = ['new', 'old'] - df.time = df.set_index('time').index.tz_localize('UTC') - v = df[df.new_col == 'new'].set_index('time').index.tz_convert( - 'US/Pacific') - - # trying to set a single element on a part of a different timezone - def f(): - df.loc[df.new_col == 'new', 'time'] = v - - self.assertRaises(ValueError, f) - - v = df.loc[df.new_col == 'new', 'time'] + pd.Timedelta('1s') - df.loc[df.new_col == 'new', 'time'] = v - tm.assert_series_equal(df.loc[df.new_col == 'new', 'time'], v) - - def test_indexing_with_datetimeindex_tz(self): - - # GH 12050 - # indexing on a series with a datetimeindex with tz - index = pd.date_range('2015-01-01', periods=2, tz='utc') - - ser = pd.Series(range(2), index=index, - dtype='int64') - - # list-like indexing - - for sel in (index, list(index)): - # getitem - tm.assert_series_equal(ser[sel], ser) - - # setitem - result = ser.copy() - result[sel] = 1 - expected = pd.Series(1, index=index) - tm.assert_series_equal(result, expected) - - # .loc getitem - tm.assert_series_equal(ser.loc[sel], ser) - - # .loc setitem - result = ser.copy() - result.loc[sel] = 1 - expected = pd.Series(1, index=index) - tm.assert_series_equal(result, expected) - - # single element indexing - - # getitem - self.assertEqual(ser[index[1]], 1) - - # setitem - result = ser.copy() - result[index[1]] = 5 - expected = pd.Series([0, 5], index=index) - tm.assert_series_equal(result, expected) - - # .loc getitem - self.assertEqual(ser.loc[index[1]], 1) - - # .loc setitem - result = ser.copy() - result.loc[index[1]] = 5 - expected = pd.Series([0, 5], index=index) - tm.assert_series_equal(result, expected) - - def test_loc_setitem_dups(self): - - # GH 6541 - df_orig = DataFrame( - {'me': list('rttti'), - 'foo': list('aaade'), - 'bar': np.arange(5, dtype='float64') * 1.34 + 2, - 'bar2': np.arange(5, dtype='float64') * -.34 + 2}).set_index('me') - - indexer = tuple(['r', ['bar', 'bar2']]) - df = df_orig.copy() - df.loc[indexer] *= 2.0 - tm.assert_series_equal(df.loc[indexer], 2.0 * df_orig.loc[indexer]) - - indexer = tuple(['r', 'bar']) - df = df_orig.copy() - df.loc[indexer] *= 2.0 - self.assertEqual(df.loc[indexer], 2.0 * df_orig.loc[indexer]) - - indexer = tuple(['t', ['bar', 'bar2']]) - df = df_orig.copy() - df.loc[indexer] *= 2.0 - tm.assert_frame_equal(df.loc[indexer], 2.0 * df_orig.loc[indexer]) - - def test_iloc_setitem_dups(self): - - # GH 6766 - # iloc with a mask aligning from another iloc - df1 = DataFrame([{'A': None, 'B': 1}, {'A': 2, 'B': 2}]) - df2 = DataFrame([{'A': 3, 'B': 3}, {'A': 4, 'B': 4}]) - df = concat([df1, df2], axis=1) - - expected = df.fillna(3) - expected['A'] = expected['A'].astype('float64') - inds = np.isnan(df.iloc[:, 0]) - mask = inds[inds].index - df.iloc[mask, 0] = df.iloc[mask, 2] - tm.assert_frame_equal(df, expected) - - # del a dup column across blocks - expected = DataFrame({0: [1, 2], 1: [3, 4]}) - expected.columns = ['B', 'B'] - del df['A'] - tm.assert_frame_equal(df, expected) - - # assign back to self - df.iloc[[0, 1], [0, 1]] = df.iloc[[0, 1], [0, 1]] - tm.assert_frame_equal(df, expected) - - # reversed x 2 - df.iloc[[1, 0], [0, 1]] = df.iloc[[1, 0], [0, 1]].reset_index( - drop=True) - df.iloc[[1, 0], [0, 1]] = df.iloc[[1, 0], [0, 1]].reset_index( - drop=True) - tm.assert_frame_equal(df, expected) - - def test_chained_getitem_with_lists(self): - - # GH6394 - # Regression in chained getitem indexing with embedded list-like from - # 0.12 - def check(result, expected): - tm.assert_numpy_array_equal(result, expected) - tm.assertIsInstance(result, np.ndarray) - - df = DataFrame({'A': 5 * [np.zeros(3)], 'B': 5 * [np.ones(3)]}) - expected = df['A'].iloc[2] - result = df.loc[2, 'A'] - check(result, expected) - result2 = df.iloc[2]['A'] - check(result2, expected) - result3 = df['A'].loc[2] - check(result3, expected) - result4 = df['A'].iloc[2] - check(result4, expected) - - def test_loc_getitem_int(self): - - # int label - self.check_result('int label', 'loc', 2, 'ix', 2, typs=['ints'], - axes=0) - self.check_result('int label', 'loc', 3, 'ix', 3, typs=['ints'], - axes=1) - self.check_result('int label', 'loc', 4, 'ix', 4, typs=['ints'], - axes=2) - self.check_result('int label', 'loc', 2, 'ix', 2, typs=['label'], - fails=KeyError) - - def test_loc_getitem_label(self): - - # label - self.check_result('label', 'loc', 'c', 'ix', 'c', typs=['labels'], - axes=0) - self.check_result('label', 'loc', 'null', 'ix', 'null', typs=['mixed'], - axes=0) - self.check_result('label', 'loc', 8, 'ix', 8, typs=['mixed'], axes=0) - self.check_result('label', 'loc', Timestamp('20130102'), 'ix', 1, - typs=['ts'], axes=0) - self.check_result('label', 'loc', 'c', 'ix', 'c', typs=['empty'], - fails=KeyError) - - def test_loc_getitem_label_out_of_range(self): - - # out of range label - self.check_result('label range', 'loc', 'f', 'ix', 'f', - typs=['ints', 'labels', 'mixed', 'ts'], - fails=KeyError) - self.check_result('label range', 'loc', 'f', 'ix', 'f', - typs=['floats'], fails=TypeError) - self.check_result('label range', 'loc', 20, 'ix', 20, - typs=['ints', 'mixed'], fails=KeyError) - self.check_result('label range', 'loc', 20, 'ix', 20, - typs=['labels'], fails=TypeError) - self.check_result('label range', 'loc', 20, 'ix', 20, typs=['ts'], - axes=0, fails=TypeError) - self.check_result('label range', 'loc', 20, 'ix', 20, typs=['floats'], - axes=0, fails=TypeError) - - def test_loc_getitem_label_list(self): - - # list of labels - self.check_result('list lbl', 'loc', [0, 2, 4], 'ix', [0, 2, 4], - typs=['ints'], axes=0) - self.check_result('list lbl', 'loc', [3, 6, 9], 'ix', [3, 6, 9], - typs=['ints'], axes=1) - self.check_result('list lbl', 'loc', [4, 8, 12], 'ix', [4, 8, 12], - typs=['ints'], axes=2) - self.check_result('list lbl', 'loc', ['a', 'b', 'd'], 'ix', - ['a', 'b', 'd'], typs=['labels'], axes=0) - self.check_result('list lbl', 'loc', ['A', 'B', 'C'], 'ix', - ['A', 'B', 'C'], typs=['labels'], axes=1) - self.check_result('list lbl', 'loc', ['Z', 'Y', 'W'], 'ix', - ['Z', 'Y', 'W'], typs=['labels'], axes=2) - self.check_result('list lbl', 'loc', [2, 8, 'null'], 'ix', - [2, 8, 'null'], typs=['mixed'], axes=0) - self.check_result('list lbl', 'loc', - [Timestamp('20130102'), Timestamp('20130103')], 'ix', - [Timestamp('20130102'), Timestamp('20130103')], - typs=['ts'], axes=0) - - self.check_result('list lbl', 'loc', [0, 1, 2], 'indexer', [0, 1, 2], - typs=['empty'], fails=KeyError) - self.check_result('list lbl', 'loc', [0, 2, 3], 'ix', [0, 2, 3], - typs=['ints'], axes=0, fails=KeyError) - self.check_result('list lbl', 'loc', [3, 6, 7], 'ix', [3, 6, 7], - typs=['ints'], axes=1, fails=KeyError) - self.check_result('list lbl', 'loc', [4, 8, 10], 'ix', [4, 8, 10], - typs=['ints'], axes=2, fails=KeyError) - - def test_loc_getitem_label_list_fails(self): - # fails - self.check_result('list lbl', 'loc', [20, 30, 40], 'ix', [20, 30, 40], - typs=['ints'], axes=1, fails=KeyError) - self.check_result('list lbl', 'loc', [20, 30, 40], 'ix', [20, 30, 40], - typs=['ints'], axes=2, fails=KeyError) - - def test_loc_getitem_label_array_like(self): - # array like - self.check_result('array like', 'loc', Series(index=[0, 2, 4]).index, - 'ix', [0, 2, 4], typs=['ints'], axes=0) - self.check_result('array like', 'loc', Series(index=[3, 6, 9]).index, - 'ix', [3, 6, 9], typs=['ints'], axes=1) - self.check_result('array like', 'loc', Series(index=[4, 8, 12]).index, - 'ix', [4, 8, 12], typs=['ints'], axes=2) - - def test_loc_getitem_bool(self): - # boolean indexers - b = [True, False, True, False] - self.check_result('bool', 'loc', b, 'ix', b, - typs=['ints', 'labels', 'mixed', 'ts', 'floats']) - self.check_result('bool', 'loc', b, 'ix', b, typs=['empty'], - fails=KeyError) - - def test_loc_getitem_int_slice(self): - - # ok - self.check_result('int slice2', 'loc', slice(2, 4), 'ix', [2, 4], - typs=['ints'], axes=0) - self.check_result('int slice2', 'loc', slice(3, 6), 'ix', [3, 6], - typs=['ints'], axes=1) - self.check_result('int slice2', 'loc', slice(4, 8), 'ix', [4, 8], - typs=['ints'], axes=2) - - # GH 3053 - # loc should treat integer slices like label slices - from itertools import product - - index = MultiIndex.from_tuples([t for t in product( - [6, 7, 8], ['a', 'b'])]) - df = DataFrame(np.random.randn(6, 6), index, index) - result = df.loc[6:8, :] - expected = df.ix[6:8, :] - tm.assert_frame_equal(result, expected) - - index = MultiIndex.from_tuples([t - for t in product( - [10, 20, 30], ['a', 'b'])]) - df = DataFrame(np.random.randn(6, 6), index, index) - result = df.loc[20:30, :] - expected = df.ix[20:30, :] - tm.assert_frame_equal(result, expected) - - # doc examples - result = df.loc[10, :] - expected = df.ix[10, :] - tm.assert_frame_equal(result, expected) - - result = df.loc[:, 10] - # expected = df.ix[:,10] (this fails) - expected = df[10] - tm.assert_frame_equal(result, expected) - - def test_loc_to_fail(self): - - # GH3449 - df = DataFrame(np.random.random((3, 3)), - index=['a', 'b', 'c'], - columns=['e', 'f', 'g']) - - # raise a KeyError? - self.assertRaises(KeyError, df.loc.__getitem__, - tuple([[1, 2], [1, 2]])) - - # GH 7496 - # loc should not fallback - - s = Series() - s.loc[1] = 1 - s.loc['a'] = 2 - - self.assertRaises(KeyError, lambda: s.loc[-1]) - self.assertRaises(KeyError, lambda: s.loc[[-1, -2]]) - - self.assertRaises(KeyError, lambda: s.loc[['4']]) - - s.loc[-1] = 3 - result = s.loc[[-1, -2]] - expected = Series([3, np.nan], index=[-1, -2]) - tm.assert_series_equal(result, expected) - - s['a'] = 2 - self.assertRaises(KeyError, lambda: s.loc[[-2]]) - - del s['a'] - - def f(): - s.loc[[-2]] = 0 - - self.assertRaises(KeyError, f) - - # inconsistency between .loc[values] and .loc[values,:] - # GH 7999 - df = DataFrame([['a'], ['b']], index=[1, 2], columns=['value']) - - def f(): - df.loc[[3], :] - - self.assertRaises(KeyError, f) - - def f(): - df.loc[[3]] - - self.assertRaises(KeyError, f) - - def test_at_to_fail(self): - # at should not fallback - # GH 7814 - s = Series([1, 2, 3], index=list('abc')) - result = s.at['a'] - self.assertEqual(result, 1) - self.assertRaises(ValueError, lambda: s.at[0]) - - df = DataFrame({'A': [1, 2, 3]}, index=list('abc')) - result = df.at['a', 'A'] - self.assertEqual(result, 1) - self.assertRaises(ValueError, lambda: df.at['a', 0]) - - s = Series([1, 2, 3], index=[3, 2, 1]) - result = s.at[1] - self.assertEqual(result, 3) - self.assertRaises(ValueError, lambda: s.at['a']) - - df = DataFrame({0: [1, 2, 3]}, index=[3, 2, 1]) - result = df.at[1, 0] - self.assertEqual(result, 3) - self.assertRaises(ValueError, lambda: df.at['a', 0]) - - # GH 13822, incorrect error string with non-unique columns when missing - # column is accessed - df = DataFrame({'x': [1.], 'y': [2.], 'z': [3.]}) - df.columns = ['x', 'x', 'z'] - - # Check that we get the correct value in the KeyError - self.assertRaisesRegexp(KeyError, r"\['y'\] not in index", - lambda: df[['x', 'y', 'z']]) - - def test_loc_getitem_label_slice(self): - - # label slices (with ints) - self.check_result('lab slice', 'loc', slice(1, 3), - 'ix', slice(1, 3), - typs=['labels', 'mixed', 'empty', 'ts', 'floats'], - fails=TypeError) - - # real label slices - self.check_result('lab slice', 'loc', slice('a', 'c'), - 'ix', slice('a', 'c'), typs=['labels'], axes=0) - self.check_result('lab slice', 'loc', slice('A', 'C'), - 'ix', slice('A', 'C'), typs=['labels'], axes=1) - self.check_result('lab slice', 'loc', slice('W', 'Z'), - 'ix', slice('W', 'Z'), typs=['labels'], axes=2) - - self.check_result('ts slice', 'loc', slice('20130102', '20130104'), - 'ix', slice('20130102', '20130104'), - typs=['ts'], axes=0) - self.check_result('ts slice', 'loc', slice('20130102', '20130104'), - 'ix', slice('20130102', '20130104'), - typs=['ts'], axes=1, fails=TypeError) - self.check_result('ts slice', 'loc', slice('20130102', '20130104'), - 'ix', slice('20130102', '20130104'), - typs=['ts'], axes=2, fails=TypeError) - - # GH 14316 - self.check_result('ts slice rev', 'loc', slice('20130104', '20130102'), - 'indexer', [0, 1, 2], typs=['ts_rev'], axes=0) - - self.check_result('mixed slice', 'loc', slice(2, 8), 'ix', slice(2, 8), - typs=['mixed'], axes=0, fails=TypeError) - self.check_result('mixed slice', 'loc', slice(2, 8), 'ix', slice(2, 8), - typs=['mixed'], axes=1, fails=KeyError) - self.check_result('mixed slice', 'loc', slice(2, 8), 'ix', slice(2, 8), - typs=['mixed'], axes=2, fails=KeyError) - - self.check_result('mixed slice', 'loc', slice(2, 4, 2), 'ix', slice( - 2, 4, 2), typs=['mixed'], axes=0, fails=TypeError) - - def test_loc_general(self): - - df = DataFrame( - np.random.rand(4, 4), columns=['A', 'B', 'C', 'D'], - index=['A', 'B', 'C', 'D']) - - # want this to work - result = df.loc[:, "A":"B"].iloc[0:2, :] - self.assertTrue((result.columns == ['A', 'B']).all()) - self.assertTrue((result.index == ['A', 'B']).all()) - - # mixed type - result = DataFrame({'a': [Timestamp('20130101')], 'b': [1]}).iloc[0] - expected = Series([Timestamp('20130101'), 1], index=['a', 'b'], name=0) - tm.assert_series_equal(result, expected) - self.assertEqual(result.dtype, object) - - def test_loc_setitem_consistency(self): - # GH 6149 - # coerce similary for setitem and loc when rows have a null-slice - expected = DataFrame({'date': Series(0, index=range(5), - dtype=np.int64), - 'val': Series(range(5), dtype=np.int64)}) - - df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), - 'val': Series( - range(5), dtype=np.int64)}) - df.loc[:, 'date'] = 0 - tm.assert_frame_equal(df, expected) - - df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), - 'val': Series(range(5), dtype=np.int64)}) - df.loc[:, 'date'] = np.array(0, dtype=np.int64) - tm.assert_frame_equal(df, expected) - - df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), - 'val': Series(range(5), dtype=np.int64)}) - df.loc[:, 'date'] = np.array([0, 0, 0, 0, 0], dtype=np.int64) - tm.assert_frame_equal(df, expected) - - expected = DataFrame({'date': Series('foo', index=range(5)), - 'val': Series(range(5), dtype=np.int64)}) - df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), - 'val': Series(range(5), dtype=np.int64)}) - df.loc[:, 'date'] = 'foo' - tm.assert_frame_equal(df, expected) - - expected = DataFrame({'date': Series(1.0, index=range(5)), - 'val': Series(range(5), dtype=np.int64)}) - df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), - 'val': Series(range(5), dtype=np.int64)}) - df.loc[:, 'date'] = 1.0 - tm.assert_frame_equal(df, expected) - - def test_loc_setitem_consistency_empty(self): - # empty (essentially noops) - expected = DataFrame(columns=['x', 'y']) - expected['x'] = expected['x'].astype(np.int64) - df = DataFrame(columns=['x', 'y']) - df.loc[:, 'x'] = 1 - tm.assert_frame_equal(df, expected) - - df = DataFrame(columns=['x', 'y']) - df['x'] = 1 - tm.assert_frame_equal(df, expected) - - def test_loc_setitem_consistency_slice_column_len(self): - # .loc[:,column] setting with slice == len of the column - # GH10408 - data = """Level_0,,,Respondent,Respondent,Respondent,OtherCat,OtherCat -Level_1,,,Something,StartDate,EndDate,Yes/No,SomethingElse -Region,Site,RespondentID,,,,, -Region_1,Site_1,3987227376,A,5/25/2015 10:59,5/25/2015 11:22,Yes, -Region_1,Site_1,3980680971,A,5/21/2015 9:40,5/21/2015 9:52,Yes,Yes -Region_1,Site_2,3977723249,A,5/20/2015 8:27,5/20/2015 8:41,Yes, -Region_1,Site_2,3977723089,A,5/20/2015 8:33,5/20/2015 9:09,Yes,No""" - - df = pd.read_csv(StringIO(data), header=[0, 1], index_col=[0, 1, 2]) - df.loc[:, ('Respondent', 'StartDate')] = pd.to_datetime(df.loc[:, ( - 'Respondent', 'StartDate')]) - df.loc[:, ('Respondent', 'EndDate')] = pd.to_datetime(df.loc[:, ( - 'Respondent', 'EndDate')]) - df.loc[:, ('Respondent', 'Duration')] = df.loc[:, ( - 'Respondent', 'EndDate')] - df.loc[:, ('Respondent', 'StartDate')] - - df.loc[:, ('Respondent', 'Duration')] = df.loc[:, ( - 'Respondent', 'Duration')].astype('timedelta64[s]') - expected = Series([1380, 720, 840, 2160.], index=df.index, - name=('Respondent', 'Duration')) - tm.assert_series_equal(df[('Respondent', 'Duration')], expected) - - def test_loc_setitem_frame(self): - df = self.frame_labels - - result = df.iloc[0, 0] - - df.loc['a', 'A'] = 1 - result = df.loc['a', 'A'] - self.assertEqual(result, 1) - - result = df.iloc[0, 0] - self.assertEqual(result, 1) - - df.loc[:, 'B':'D'] = 0 - expected = df.loc[:, 'B':'D'] - result = df.ix[:, 1:] - tm.assert_frame_equal(result, expected) - - # GH 6254 - # setting issue - df = DataFrame(index=[3, 5, 4], columns=['A']) - df.loc[[4, 3, 5], 'A'] = np.array([1, 2, 3], dtype='int64') - expected = DataFrame(dict(A=Series( - [1, 2, 3], index=[4, 3, 5]))).reindex(index=[3, 5, 4]) - tm.assert_frame_equal(df, expected) - - # GH 6252 - # setting with an empty frame - keys1 = ['@' + str(i) for i in range(5)] - val1 = np.arange(5, dtype='int64') - - keys2 = ['@' + str(i) for i in range(4)] - val2 = np.arange(4, dtype='int64') - - index = list(set(keys1).union(keys2)) - df = DataFrame(index=index) - df['A'] = nan - df.loc[keys1, 'A'] = val1 - - df['B'] = nan - df.loc[keys2, 'B'] = val2 - - expected = DataFrame(dict(A=Series(val1, index=keys1), B=Series( - val2, index=keys2))).reindex(index=index) - tm.assert_frame_equal(df, expected) - - # GH 8669 - # invalid coercion of nan -> int - df = DataFrame({'A': [1, 2, 3], 'B': np.nan}) - df.loc[df.B > df.A, 'B'] = df.A - expected = DataFrame({'A': [1, 2, 3], 'B': np.nan}) - tm.assert_frame_equal(df, expected) - - # GH 6546 - # setting with mixed labels - df = DataFrame({1: [1, 2], 2: [3, 4], 'a': ['a', 'b']}) - - result = df.loc[0, [1, 2]] - expected = Series([1, 3], index=[1, 2], dtype=object, name=0) - tm.assert_series_equal(result, expected) - - expected = DataFrame({1: [5, 2], 2: [6, 4], 'a': ['a', 'b']}) - df.loc[0, [1, 2]] = [5, 6] - tm.assert_frame_equal(df, expected) - - def test_loc_setitem_frame_multiples(self): - # multiple setting - df = DataFrame({'A': ['foo', 'bar', 'baz'], - 'B': Series( - range(3), dtype=np.int64)}) - rhs = df.loc[1:2] - rhs.index = df.index[0:2] - df.loc[0:1] = rhs - expected = DataFrame({'A': ['bar', 'baz', 'baz'], - 'B': Series( - [1, 2, 2], dtype=np.int64)}) - tm.assert_frame_equal(df, expected) - - # multiple setting with frame on rhs (with M8) - df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), - 'val': Series( - range(5), dtype=np.int64)}) - expected = DataFrame({'date': [Timestamp('20000101'), Timestamp( - '20000102'), Timestamp('20000101'), Timestamp('20000102'), - Timestamp('20000103')], - 'val': Series( - [0, 1, 0, 1, 2], dtype=np.int64)}) - rhs = df.loc[0:2] - rhs.index = df.index[2:5] - df.loc[2:4] = rhs - tm.assert_frame_equal(df, expected) - - def test_iloc_getitem_frame(self): - df = DataFrame(np.random.randn(10, 4), index=lrange(0, 20, 2), - columns=lrange(0, 8, 2)) - - result = df.iloc[2] - exp = df.ix[4] - tm.assert_series_equal(result, exp) - - result = df.iloc[2, 2] - exp = df.ix[4, 4] - self.assertEqual(result, exp) - - # slice - result = df.iloc[4:8] - expected = df.ix[8:14] - tm.assert_frame_equal(result, expected) - - result = df.iloc[:, 2:3] - expected = df.ix[:, 4:5] - tm.assert_frame_equal(result, expected) - - # list of integers - result = df.iloc[[0, 1, 3]] - expected = df.ix[[0, 2, 6]] - tm.assert_frame_equal(result, expected) - - result = df.iloc[[0, 1, 3], [0, 1]] - expected = df.ix[[0, 2, 6], [0, 2]] - tm.assert_frame_equal(result, expected) - - # neg indicies - result = df.iloc[[-1, 1, 3], [-1, 1]] - expected = df.ix[[18, 2, 6], [6, 2]] - tm.assert_frame_equal(result, expected) - - # dups indicies - result = df.iloc[[-1, -1, 1, 3], [-1, 1]] - expected = df.ix[[18, 18, 2, 6], [6, 2]] - tm.assert_frame_equal(result, expected) - - # with index-like - s = Series(index=lrange(1, 5)) - result = df.iloc[s.index] - expected = df.ix[[2, 4, 6, 8]] - tm.assert_frame_equal(result, expected) - - def test_iloc_getitem_labelled_frame(self): - # try with labelled frame - df = DataFrame(np.random.randn(10, 4), - index=list('abcdefghij'), columns=list('ABCD')) - - result = df.iloc[1, 1] - exp = df.ix['b', 'B'] - self.assertEqual(result, exp) - - result = df.iloc[:, 2:3] - expected = df.ix[:, ['C']] - tm.assert_frame_equal(result, expected) - - # negative indexing - result = df.iloc[-1, -1] - exp = df.ix['j', 'D'] - self.assertEqual(result, exp) - - # out-of-bounds exception - self.assertRaises(IndexError, df.iloc.__getitem__, tuple([10, 5])) - - # trying to use a label - self.assertRaises(ValueError, df.iloc.__getitem__, tuple(['j', 'D'])) - - def test_iloc_getitem_panel(self): - - # GH 7189 - p = Panel(np.arange(4 * 3 * 2).reshape(4, 3, 2), - items=['A', 'B', 'C', 'D'], - major_axis=['a', 'b', 'c'], - minor_axis=['one', 'two']) - - result = p.iloc[1] - expected = p.loc['B'] - tm.assert_frame_equal(result, expected) - - result = p.iloc[1, 1] - expected = p.loc['B', 'b'] - tm.assert_series_equal(result, expected) - - result = p.iloc[1, 1, 1] - expected = p.loc['B', 'b', 'two'] - self.assertEqual(result, expected) - - # slice - result = p.iloc[1:3] - expected = p.loc[['B', 'C']] - tm.assert_panel_equal(result, expected) - - result = p.iloc[:, 0:2] - expected = p.loc[:, ['a', 'b']] - tm.assert_panel_equal(result, expected) - - # list of integers - result = p.iloc[[0, 2]] - expected = p.loc[['A', 'C']] - tm.assert_panel_equal(result, expected) - - # neg indicies - result = p.iloc[[-1, 1], [-1, 1]] - expected = p.loc[['D', 'B'], ['c', 'b']] - tm.assert_panel_equal(result, expected) - - # dups indicies - result = p.iloc[[-1, -1, 1], [-1, 1]] - expected = p.loc[['D', 'D', 'B'], ['c', 'b']] - tm.assert_panel_equal(result, expected) - - # combined - result = p.iloc[0, [True, True], [0, 1]] - expected = p.loc['A', ['a', 'b'], ['one', 'two']] - tm.assert_frame_equal(result, expected) - - # out-of-bounds exception - self.assertRaises(IndexError, p.iloc.__getitem__, tuple([10, 5])) - - def f(): - p.iloc[0, [True, True], [0, 1, 2]] - - self.assertRaises(IndexError, f) - - # trying to use a label - self.assertRaises(ValueError, p.iloc.__getitem__, tuple(['j', 'D'])) - - # GH - p = Panel( - np.random.rand(4, 3, 2), items=['A', 'B', 'C', 'D'], - major_axis=['U', 'V', 'W'], minor_axis=['X', 'Y']) - expected = p['A'] - - result = p.iloc[0, :, :] - tm.assert_frame_equal(result, expected) - - result = p.iloc[0, [True, True, True], :] - tm.assert_frame_equal(result, expected) - - result = p.iloc[0, [True, True, True], [0, 1]] - tm.assert_frame_equal(result, expected) - - def f(): - p.iloc[0, [True, True, True], [0, 1, 2]] - - self.assertRaises(IndexError, f) - - def f(): - p.iloc[0, [True, True, True], [2]] - - self.assertRaises(IndexError, f) - - def test_iloc_getitem_panel_multiindex(self): - # GH 7199 - # Panel with multi-index - multi_index = pd.MultiIndex.from_tuples([('ONE', 'one'), - ('TWO', 'two'), - ('THREE', 'three')], - names=['UPPER', 'lower']) - - simple_index = [x[0] for x in multi_index] - wd1 = Panel(items=['First', 'Second'], major_axis=['a', 'b', 'c', 'd'], - minor_axis=multi_index) - - wd2 = Panel(items=['First', 'Second'], major_axis=['a', 'b', 'c', 'd'], - minor_axis=simple_index) - - expected1 = wd1['First'].iloc[[True, True, True, False], [0, 2]] - result1 = wd1.iloc[0, [True, True, True, False], [0, 2]] # WRONG - tm.assert_frame_equal(result1, expected1) - - expected2 = wd2['First'].iloc[[True, True, True, False], [0, 2]] - result2 = wd2.iloc[0, [True, True, True, False], [0, 2]] - tm.assert_frame_equal(result2, expected2) - - expected1 = DataFrame(index=['a'], columns=multi_index, - dtype='float64') - result1 = wd1.iloc[0, [0], [0, 1, 2]] - tm.assert_frame_equal(result1, expected1) - - expected2 = DataFrame(index=['a'], columns=simple_index, - dtype='float64') - result2 = wd2.iloc[0, [0], [0, 1, 2]] - tm.assert_frame_equal(result2, expected2) - - # GH 7516 - mi = MultiIndex.from_tuples([(0, 'x'), (1, 'y'), (2, 'z')]) - p = Panel(np.arange(3 * 3 * 3, dtype='int64').reshape(3, 3, 3), - items=['a', 'b', 'c'], major_axis=mi, - minor_axis=['u', 'v', 'w']) - result = p.iloc[:, 1, 0] - expected = Series([3, 12, 21], index=['a', 'b', 'c'], name='u') - tm.assert_series_equal(result, expected) - - result = p.loc[:, (1, 'y'), 'u'] - tm.assert_series_equal(result, expected) - - def test_iloc_getitem_doc_issue(self): - - # multi axis slicing issue with single block - # surfaced in GH 6059 - - arr = np.random.randn(6, 4) - index = date_range('20130101', periods=6) - columns = list('ABCD') - df = DataFrame(arr, index=index, columns=columns) - - # defines ref_locs - df.describe() - - result = df.iloc[3:5, 0:2] - str(result) - result.dtypes - - expected = DataFrame(arr[3:5, 0:2], index=index[3:5], - columns=columns[0:2]) - tm.assert_frame_equal(result, expected) - - # for dups - df.columns = list('aaaa') - result = df.iloc[3:5, 0:2] - str(result) - result.dtypes - - expected = DataFrame(arr[3:5, 0:2], index=index[3:5], - columns=list('aa')) - tm.assert_frame_equal(result, expected) - - # related - arr = np.random.randn(6, 4) - index = list(range(0, 12, 2)) - columns = list(range(0, 8, 2)) - df = DataFrame(arr, index=index, columns=columns) - - df._data.blocks[0].mgr_locs - result = df.iloc[1:5, 2:4] - str(result) - result.dtypes - expected = DataFrame(arr[1:5, 2:4], index=index[1:5], - columns=columns[2:4]) - tm.assert_frame_equal(result, expected) - - def test_setitem_ndarray_1d(self): - # GH5508 - - # len of indexer vs length of the 1d ndarray - df = DataFrame(index=Index(lrange(1, 11))) - df['foo'] = np.zeros(10, dtype=np.float64) - df['bar'] = np.zeros(10, dtype=np.complex) - - # invalid - def f(): - df.ix[2:5, 'bar'] = np.array([2.33j, 1.23 + 0.1j, 2.2]) - - self.assertRaises(ValueError, f) - - # valid - df.ix[2:5, 'bar'] = np.array([2.33j, 1.23 + 0.1j, 2.2, 1.0]) - - result = df.ix[2:5, 'bar'] - expected = Series([2.33j, 1.23 + 0.1j, 2.2, 1.0], index=[2, 3, 4, 5], - name='bar') - tm.assert_series_equal(result, expected) - - # dtype getting changed? - df = DataFrame(index=Index(lrange(1, 11))) - df['foo'] = np.zeros(10, dtype=np.float64) - df['bar'] = np.zeros(10, dtype=np.complex) - - def f(): - df[2:5] = np.arange(1, 4) * 1j - - self.assertRaises(ValueError, f) - - def test_iloc_setitem_series(self): - df = DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), - columns=list('ABCD')) - - df.iloc[1, 1] = 1 - result = df.iloc[1, 1] - self.assertEqual(result, 1) - - df.iloc[:, 2:3] = 0 - expected = df.iloc[:, 2:3] - result = df.iloc[:, 2:3] - tm.assert_frame_equal(result, expected) - - s = Series(np.random.randn(10), index=lrange(0, 20, 2)) - - s.iloc[1] = 1 - result = s.iloc[1] - self.assertEqual(result, 1) - - s.iloc[:4] = 0 - expected = s.iloc[:4] - result = s.iloc[:4] - tm.assert_series_equal(result, expected) - - s = Series([-1] * 6) - s.iloc[0::2] = [0, 2, 4] - s.iloc[1::2] = [1, 3, 5] - result = s - expected = Series([0, 1, 2, 3, 4, 5]) - tm.assert_series_equal(result, expected) - - def test_iloc_setitem_list_of_lists(self): - - # GH 7551 - # list-of-list is set incorrectly in mixed vs. single dtyped frames - df = DataFrame(dict(A=np.arange(5, dtype='int64'), - B=np.arange(5, 10, dtype='int64'))) - df.iloc[2:4] = [[10, 11], [12, 13]] - expected = DataFrame(dict(A=[0, 1, 10, 12, 4], B=[5, 6, 11, 13, 9])) - tm.assert_frame_equal(df, expected) - - df = DataFrame( - dict(A=list('abcde'), B=np.arange(5, 10, dtype='int64'))) - df.iloc[2:4] = [['x', 11], ['y', 13]] - expected = DataFrame(dict(A=['a', 'b', 'x', 'y', 'e'], - B=[5, 6, 11, 13, 9])) - tm.assert_frame_equal(df, expected) - - def test_iloc_getitem_multiindex(self): - mi_labels = DataFrame(np.random.randn(4, 3), - columns=[['i', 'i', 'j'], ['A', 'A', 'B']], - index=[['i', 'i', 'j', 'k'], - ['X', 'X', 'Y', 'Y']]) - - mi_int = DataFrame(np.random.randn(3, 3), - columns=[[2, 2, 4], [6, 8, 10]], - index=[[4, 4, 8], [8, 10, 12]]) - - # the first row - rs = mi_int.iloc[0] - xp = mi_int.ix[4].ix[8] - tm.assert_series_equal(rs, xp, check_names=False) - self.assertEqual(rs.name, (4, 8)) - self.assertEqual(xp.name, 8) - - # 2nd (last) columns - rs = mi_int.iloc[:, 2] - xp = mi_int.ix[:, 2] - tm.assert_series_equal(rs, xp) - - # corner column - rs = mi_int.iloc[2, 2] - xp = mi_int.ix[:, 2].ix[2] - self.assertEqual(rs, xp) - - # this is basically regular indexing - rs = mi_labels.iloc[2, 2] - xp = mi_labels.ix['j'].ix[:, 'j'].ix[0, 0] - self.assertEqual(rs, xp) - - def test_loc_multiindex(self): - - mi_labels = DataFrame(np.random.randn(3, 3), - columns=[['i', 'i', 'j'], ['A', 'A', 'B']], - index=[['i', 'i', 'j'], ['X', 'X', 'Y']]) - - mi_int = DataFrame(np.random.randn(3, 3), - columns=[[2, 2, 4], [6, 8, 10]], - index=[[4, 4, 8], [8, 10, 12]]) - - # the first row - rs = mi_labels.loc['i'] - xp = mi_labels.ix['i'] - tm.assert_frame_equal(rs, xp) - - # 2nd (last) columns - rs = mi_labels.loc[:, 'j'] - xp = mi_labels.ix[:, 'j'] - tm.assert_frame_equal(rs, xp) - - # corner column - rs = mi_labels.loc['j'].loc[:, 'j'] - xp = mi_labels.ix['j'].ix[:, 'j'] - tm.assert_frame_equal(rs, xp) - - # with a tuple - rs = mi_labels.loc[('i', 'X')] - xp = mi_labels.ix[('i', 'X')] - tm.assert_frame_equal(rs, xp) - - rs = mi_int.loc[4] - xp = mi_int.ix[4] - tm.assert_frame_equal(rs, xp) - - def test_loc_multiindex_indexer_none(self): - - # GH6788 - # multi-index indexer is None (meaning take all) - attributes = ['Attribute' + str(i) for i in range(1)] - attribute_values = ['Value' + str(i) for i in range(5)] - - index = MultiIndex.from_product([attributes, attribute_values]) - df = 0.1 * np.random.randn(10, 1 * 5) + 0.5 - df = DataFrame(df, columns=index) - result = df[attributes] - tm.assert_frame_equal(result, df) - - # GH 7349 - # loc with a multi-index seems to be doing fallback - df = DataFrame(np.arange(12).reshape(-1, 1), - index=pd.MultiIndex.from_product([[1, 2, 3, 4], - [1, 2, 3]])) - - expected = df.loc[([1, 2], ), :] - result = df.loc[[1, 2]] - tm.assert_frame_equal(result, expected) - - def test_loc_multiindex_incomplete(self): - - # GH 7399 - # incomplete indexers - s = pd.Series(np.arange(15, dtype='int64'), - MultiIndex.from_product([range(5), ['a', 'b', 'c']])) - expected = s.loc[:, 'a':'c'] - - result = s.loc[0:4, 'a':'c'] - tm.assert_series_equal(result, expected) - tm.assert_series_equal(result, expected) - - result = s.loc[:4, 'a':'c'] - tm.assert_series_equal(result, expected) - tm.assert_series_equal(result, expected) - - result = s.loc[0:, 'a':'c'] - tm.assert_series_equal(result, expected) - tm.assert_series_equal(result, expected) - - # GH 7400 - # multiindexer gettitem with list of indexers skips wrong element - s = pd.Series(np.arange(15, dtype='int64'), - MultiIndex.from_product([range(5), ['a', 'b', 'c']])) - expected = s.iloc[[6, 7, 8, 12, 13, 14]] - result = s.loc[2:4:2, 'a':'c'] - tm.assert_series_equal(result, expected) - - def test_multiindex_perf_warn(self): - - if sys.version_info < (2, 7): - raise nose.SkipTest('python version < 2.7') - - df = DataFrame({'jim': [0, 0, 1, 1], - 'joe': ['x', 'x', 'z', 'y'], - 'jolie': np.random.rand(4)}).set_index(['jim', 'joe']) - - with tm.assert_produces_warning(PerformanceWarning, - clear=[pd.core.index]): - df.loc[(1, 'z')] - - df = df.iloc[[2, 1, 3, 0]] - with tm.assert_produces_warning(PerformanceWarning): - df.loc[(0, )] - - def test_series_getitem_multiindex(self): - - # GH 6018 - # series regression getitem with a multi-index - - s = Series([1, 2, 3]) - s.index = MultiIndex.from_tuples([(0, 0), (1, 1), (2, 1)]) - - result = s[:, 0] - expected = Series([1], index=[0]) - tm.assert_series_equal(result, expected) - - result = s.ix[:, 1] - expected = Series([2, 3], index=[1, 2]) - tm.assert_series_equal(result, expected) - - # xs - result = s.xs(0, level=0) - expected = Series([1], index=[0]) - tm.assert_series_equal(result, expected) - - result = s.xs(1, level=1) - expected = Series([2, 3], index=[1, 2]) - tm.assert_series_equal(result, expected) - - # GH6258 - dt = list(date_range('20130903', periods=3)) - idx = MultiIndex.from_product([list('AB'), dt]) - s = Series([1, 3, 4, 1, 3, 4], index=idx) - - result = s.xs('20130903', level=1) - expected = Series([1, 1], index=list('AB')) - tm.assert_series_equal(result, expected) - - # GH5684 - idx = MultiIndex.from_tuples([('a', 'one'), ('a', 'two'), ('b', 'one'), - ('b', 'two')]) - s = Series([1, 2, 3, 4], index=idx) - s.index.set_names(['L1', 'L2'], inplace=True) - result = s.xs('one', level='L2') - expected = Series([1, 3], index=['a', 'b']) - expected.index.set_names(['L1'], inplace=True) - tm.assert_series_equal(result, expected) - - def test_ix_general(self): - - # ix general issues - - # GH 2817 - data = {'amount': {0: 700, 1: 600, 2: 222, 3: 333, 4: 444}, - 'col': {0: 3.5, 1: 3.5, 2: 4.0, 3: 4.0, 4: 4.0}, - 'year': {0: 2012, 1: 2011, 2: 2012, 3: 2012, 4: 2012}} - df = DataFrame(data).set_index(keys=['col', 'year']) - key = 4.0, 2012 - - # emits a PerformanceWarning, ok - with self.assert_produces_warning(PerformanceWarning): - tm.assert_frame_equal(df.ix[key], df.iloc[2:]) - - # this is ok - df.sortlevel(inplace=True) - res = df.ix[key] - # col has float dtype, result should be Float64Index - index = MultiIndex.from_arrays([[4.] * 3, [2012] * 3], - names=['col', 'year']) - expected = DataFrame({'amount': [222, 333, 444]}, index=index) - tm.assert_frame_equal(res, expected) - - def test_ix_weird_slicing(self): - # http://stackoverflow.com/q/17056560/1240268 - df = DataFrame({'one': [1, 2, 3, np.nan, np.nan], - 'two': [1, 2, 3, 4, 5]}) - df.ix[df['one'] > 1, 'two'] = -df['two'] - - expected = DataFrame({'one': {0: 1.0, - 1: 2.0, - 2: 3.0, - 3: nan, - 4: nan}, - 'two': {0: 1, - 1: -2, - 2: -3, - 3: 4, - 4: 5}}) - tm.assert_frame_equal(df, expected) - - def test_xs_multiindex(self): - - # GH2903 - columns = MultiIndex.from_tuples( - [('a', 'foo'), ('a', 'bar'), ('b', 'hello'), - ('b', 'world')], names=['lvl0', 'lvl1']) - df = DataFrame(np.random.randn(4, 4), columns=columns) - df.sortlevel(axis=1, inplace=True) - result = df.xs('a', level='lvl0', axis=1) - expected = df.iloc[:, 0:2].loc[:, 'a'] - tm.assert_frame_equal(result, expected) - - result = df.xs('foo', level='lvl1', axis=1) - expected = df.iloc[:, 1:2].copy() - expected.columns = expected.columns.droplevel('lvl1') - tm.assert_frame_equal(result, expected) - - def test_per_axis_per_level_getitem(self): - - # GH6134 - # example test case - ix = MultiIndex.from_product([_mklbl('A', 5), _mklbl('B', 7), _mklbl( - 'C', 4), _mklbl('D', 2)]) - df = DataFrame(np.arange(len(ix.get_values())), index=ix) - - result = df.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :] - expected = df.loc[[tuple([a, b, c, d]) - for a, b, c, d in df.index.values - if (a == 'A1' or a == 'A2' or a == 'A3') and ( - c == 'C1' or c == 'C3')]] - tm.assert_frame_equal(result, expected) - - expected = df.loc[[tuple([a, b, c, d]) - for a, b, c, d in df.index.values - if (a == 'A1' or a == 'A2' or a == 'A3') and ( - c == 'C1' or c == 'C2' or c == 'C3')]] - result = df.loc[(slice('A1', 'A3'), slice(None), slice('C1', 'C3')), :] - tm.assert_frame_equal(result, expected) - - # test multi-index slicing with per axis and per index controls - index = MultiIndex.from_tuples([('A', 1), ('A', 2), - ('A', 3), ('B', 1)], - names=['one', 'two']) - columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), - ('b', 'foo'), ('b', 'bah')], - names=['lvl0', 'lvl1']) - - df = DataFrame( - np.arange(16, dtype='int64').reshape( - 4, 4), index=index, columns=columns) - df = df.sortlevel(axis=0).sortlevel(axis=1) - - # identity - result = df.loc[(slice(None), slice(None)), :] - tm.assert_frame_equal(result, df) - result = df.loc[(slice(None), slice(None)), (slice(None), slice(None))] - tm.assert_frame_equal(result, df) - result = df.loc[:, (slice(None), slice(None))] - tm.assert_frame_equal(result, df) - - # index - result = df.loc[(slice(None), [1]), :] - expected = df.iloc[[0, 3]] - tm.assert_frame_equal(result, expected) - - result = df.loc[(slice(None), 1), :] - expected = df.iloc[[0, 3]] - tm.assert_frame_equal(result, expected) - - # columns - result = df.loc[:, (slice(None), ['foo'])] - expected = df.iloc[:, [1, 3]] - tm.assert_frame_equal(result, expected) - - # both - result = df.loc[(slice(None), 1), (slice(None), ['foo'])] - expected = df.iloc[[0, 3], [1, 3]] - tm.assert_frame_equal(result, expected) - - result = df.loc['A', 'a'] - expected = DataFrame(dict(bar=[1, 5, 9], foo=[0, 4, 8]), - index=Index([1, 2, 3], name='two'), - columns=Index(['bar', 'foo'], name='lvl1')) - tm.assert_frame_equal(result, expected) - - result = df.loc[(slice(None), [1, 2]), :] - expected = df.iloc[[0, 1, 3]] - tm.assert_frame_equal(result, expected) - - # multi-level series - s = Series(np.arange(len(ix.get_values())), index=ix) - result = s.loc['A1':'A3', :, ['C1', 'C3']] - expected = s.loc[[tuple([a, b, c, d]) - for a, b, c, d in s.index.values - if (a == 'A1' or a == 'A2' or a == 'A3') and ( - c == 'C1' or c == 'C3')]] - tm.assert_series_equal(result, expected) - - # boolean indexers - result = df.loc[(slice(None), df.loc[:, ('a', 'bar')] > 5), :] - expected = df.iloc[[2, 3]] - tm.assert_frame_equal(result, expected) - - def f(): - df.loc[(slice(None), np.array([True, False])), :] - - self.assertRaises(ValueError, f) - - # ambiguous cases - # these can be multiply interpreted (e.g. in this case - # as df.loc[slice(None),[1]] as well - self.assertRaises(KeyError, lambda: df.loc[slice(None), [1]]) - - result = df.loc[(slice(None), [1]), :] - expected = df.iloc[[0, 3]] - tm.assert_frame_equal(result, expected) - - # not lexsorted - self.assertEqual(df.index.lexsort_depth, 2) - df = df.sortlevel(level=1, axis=0) - self.assertEqual(df.index.lexsort_depth, 0) - with tm.assertRaisesRegexp( - KeyError, - 'MultiIndex Slicing requires the index to be fully ' - r'lexsorted tuple len \(2\), lexsort depth \(0\)'): - df.loc[(slice(None), df.loc[:, ('a', 'bar')] > 5), :] - - def test_multiindex_slicers_non_unique(self): - - # GH 7106 - # non-unique mi index support - df = (DataFrame(dict(A=['foo', 'foo', 'foo', 'foo'], - B=['a', 'a', 'a', 'a'], - C=[1, 2, 1, 3], - D=[1, 2, 3, 4])) - .set_index(['A', 'B', 'C']).sortlevel()) - self.assertFalse(df.index.is_unique) - expected = (DataFrame(dict(A=['foo', 'foo'], B=['a', 'a'], - C=[1, 1], D=[1, 3])) - .set_index(['A', 'B', 'C']).sortlevel()) - result = df.loc[(slice(None), slice(None), 1), :] - tm.assert_frame_equal(result, expected) - - # this is equivalent of an xs expression - result = df.xs(1, level=2, drop_level=False) - tm.assert_frame_equal(result, expected) - - df = (DataFrame(dict(A=['foo', 'foo', 'foo', 'foo'], - B=['a', 'a', 'a', 'a'], - C=[1, 2, 1, 2], - D=[1, 2, 3, 4])) - .set_index(['A', 'B', 'C']).sortlevel()) - self.assertFalse(df.index.is_unique) - expected = (DataFrame(dict(A=['foo', 'foo'], B=['a', 'a'], - C=[1, 1], D=[1, 3])) - .set_index(['A', 'B', 'C']).sortlevel()) - result = df.loc[(slice(None), slice(None), 1), :] - self.assertFalse(result.index.is_unique) - tm.assert_frame_equal(result, expected) - - # GH12896 - # numpy-implementation dependent bug - ints = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 14, 16, - 17, 18, 19, 200000, 200000] - n = len(ints) - idx = MultiIndex.from_arrays([['a'] * n, ints]) - result = Series([1] * n, index=idx) - result = result.sort_index() - result = result.loc[(slice(None), slice(100000))] - expected = Series([1] * (n - 2), index=idx[:-2]).sort_index() - tm.assert_series_equal(result, expected) - - def test_multiindex_slicers_datetimelike(self): - - # GH 7429 - # buggy/inconsistent behavior when slicing with datetime-like - import datetime - dates = [datetime.datetime(2012, 1, 1, 12, 12, 12) + - datetime.timedelta(days=i) for i in range(6)] - freq = [1, 2] - index = MultiIndex.from_product( - [dates, freq], names=['date', 'frequency']) - - df = DataFrame( - np.arange(6 * 2 * 4, dtype='int64').reshape( - -1, 4), index=index, columns=list('ABCD')) - - # multi-axis slicing - idx = pd.IndexSlice - expected = df.iloc[[0, 2, 4], [0, 1]] - result = df.loc[(slice(Timestamp('2012-01-01 12:12:12'), - Timestamp('2012-01-03 12:12:12')), - slice(1, 1)), slice('A', 'B')] - tm.assert_frame_equal(result, expected) - - result = df.loc[(idx[Timestamp('2012-01-01 12:12:12'):Timestamp( - '2012-01-03 12:12:12')], idx[1:1]), slice('A', 'B')] - tm.assert_frame_equal(result, expected) - - result = df.loc[(slice(Timestamp('2012-01-01 12:12:12'), - Timestamp('2012-01-03 12:12:12')), 1), - slice('A', 'B')] - tm.assert_frame_equal(result, expected) - - # with strings - result = df.loc[(slice('2012-01-01 12:12:12', '2012-01-03 12:12:12'), - slice(1, 1)), slice('A', 'B')] - tm.assert_frame_equal(result, expected) - - result = df.loc[(idx['2012-01-01 12:12:12':'2012-01-03 12:12:12'], 1), - idx['A', 'B']] - tm.assert_frame_equal(result, expected) - - def test_multiindex_slicers_edges(self): - # GH 8132 - # various edge cases - df = DataFrame( - {'A': ['A0'] * 5 + ['A1'] * 5 + ['A2'] * 5, - 'B': ['B0', 'B0', 'B1', 'B1', 'B2'] * 3, - 'DATE': ["2013-06-11", "2013-07-02", "2013-07-09", "2013-07-30", - "2013-08-06", "2013-06-11", "2013-07-02", "2013-07-09", - "2013-07-30", "2013-08-06", "2013-09-03", "2013-10-01", - "2013-07-09", "2013-08-06", "2013-09-03"], - 'VALUES': [22, 35, 14, 9, 4, 40, 18, 4, 2, 5, 1, 2, 3, 4, 2]}) - - df['DATE'] = pd.to_datetime(df['DATE']) - df1 = df.set_index(['A', 'B', 'DATE']) - df1 = df1.sortlevel() - - # A1 - Get all values under "A0" and "A1" - result = df1.loc[(slice('A1')), :] - expected = df1.iloc[0:10] - tm.assert_frame_equal(result, expected) - - # A2 - Get all values from the start to "A2" - result = df1.loc[(slice('A2')), :] - expected = df1 - tm.assert_frame_equal(result, expected) - - # A3 - Get all values under "B1" or "B2" - result = df1.loc[(slice(None), slice('B1', 'B2')), :] - expected = df1.iloc[[2, 3, 4, 7, 8, 9, 12, 13, 14]] - tm.assert_frame_equal(result, expected) - - # A4 - Get all values between 2013-07-02 and 2013-07-09 - result = df1.loc[(slice(None), slice(None), - slice('20130702', '20130709')), :] - expected = df1.iloc[[1, 2, 6, 7, 12]] - tm.assert_frame_equal(result, expected) - - # B1 - Get all values in B0 that are also under A0, A1 and A2 - result = df1.loc[(slice('A2'), slice('B0')), :] - expected = df1.iloc[[0, 1, 5, 6, 10, 11]] - tm.assert_frame_equal(result, expected) - - # B2 - Get all values in B0, B1 and B2 (similar to what #2 is doing for - # the As) - result = df1.loc[(slice(None), slice('B2')), :] - expected = df1 - tm.assert_frame_equal(result, expected) - - # B3 - Get all values from B1 to B2 and up to 2013-08-06 - result = df1.loc[(slice(None), slice('B1', 'B2'), - slice('2013-08-06')), :] - expected = df1.iloc[[2, 3, 4, 7, 8, 9, 12, 13]] - tm.assert_frame_equal(result, expected) - - # B4 - Same as A4 but the start of the date slice is not a key. - # shows indexing on a partial selection slice - result = df1.loc[(slice(None), slice(None), - slice('20130701', '20130709')), :] - expected = df1.iloc[[1, 2, 6, 7, 12]] - tm.assert_frame_equal(result, expected) - - def test_per_axis_per_level_doc_examples(self): - - # test index maker - idx = pd.IndexSlice - - # from indexing.rst / advanced - index = MultiIndex.from_product([_mklbl('A', 4), _mklbl('B', 2), - _mklbl('C', 4), _mklbl('D', 2)]) - columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), - ('b', 'foo'), ('b', 'bah')], - names=['lvl0', 'lvl1']) - df = DataFrame(np.arange(len(index) * len(columns), dtype='int64') - .reshape((len(index), len(columns))), - index=index, columns=columns) - result = df.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :] - expected = df.loc[[tuple([a, b, c, d]) - for a, b, c, d in df.index.values - if (a == 'A1' or a == 'A2' or a == 'A3') and ( - c == 'C1' or c == 'C3')]] - tm.assert_frame_equal(result, expected) - result = df.loc[idx['A1':'A3', :, ['C1', 'C3']], :] - tm.assert_frame_equal(result, expected) - - result = df.loc[(slice(None), slice(None), ['C1', 'C3']), :] - expected = df.loc[[tuple([a, b, c, d]) - for a, b, c, d in df.index.values - if (c == 'C1' or c == 'C3')]] - tm.assert_frame_equal(result, expected) - result = df.loc[idx[:, :, ['C1', 'C3']], :] - tm.assert_frame_equal(result, expected) - - # not sorted - def f(): - df.loc['A1', (slice(None), 'foo')] - - self.assertRaises(KeyError, f) - df = df.sortlevel(axis=1) - - # slicing - df.loc['A1', (slice(None), 'foo')] - df.loc[(slice(None), slice(None), ['C1', 'C3']), (slice(None), 'foo')] - - # setitem - df.loc(axis=0)[:, :, ['C1', 'C3']] = -10 - - def test_loc_axis_arguments(self): - - index = MultiIndex.from_product([_mklbl('A', 4), _mklbl('B', 2), - _mklbl('C', 4), _mklbl('D', 2)]) - columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), - ('b', 'foo'), ('b', 'bah')], - names=['lvl0', 'lvl1']) - df = DataFrame(np.arange(len(index) * len(columns), dtype='int64') - .reshape((len(index), len(columns))), - index=index, - columns=columns).sortlevel().sortlevel(axis=1) - - # axis 0 - result = df.loc(axis=0)['A1':'A3', :, ['C1', 'C3']] - expected = df.loc[[tuple([a, b, c, d]) - for a, b, c, d in df.index.values - if (a == 'A1' or a == 'A2' or a == 'A3') and ( - c == 'C1' or c == 'C3')]] - tm.assert_frame_equal(result, expected) - - result = df.loc(axis='index')[:, :, ['C1', 'C3']] - expected = df.loc[[tuple([a, b, c, d]) - for a, b, c, d in df.index.values - if (c == 'C1' or c == 'C3')]] - tm.assert_frame_equal(result, expected) - - # axis 1 - result = df.loc(axis=1)[:, 'foo'] - expected = df.loc[:, (slice(None), 'foo')] - tm.assert_frame_equal(result, expected) - - result = df.loc(axis='columns')[:, 'foo'] - expected = df.loc[:, (slice(None), 'foo')] - tm.assert_frame_equal(result, expected) - - # invalid axis - def f(): - df.loc(axis=-1)[:, :, ['C1', 'C3']] - - self.assertRaises(ValueError, f) - - def f(): - df.loc(axis=2)[:, :, ['C1', 'C3']] - - self.assertRaises(ValueError, f) - - def f(): - df.loc(axis='foo')[:, :, ['C1', 'C3']] - - self.assertRaises(ValueError, f) - - def test_loc_coerceion(self): - - # 12411 - df = DataFrame({'date': [pd.Timestamp('20130101').tz_localize('UTC'), - pd.NaT]}) - expected = df.dtypes - - result = df.iloc[[0]] - tm.assert_series_equal(result.dtypes, expected) - - result = df.iloc[[1]] - tm.assert_series_equal(result.dtypes, expected) - - # 12045 - import datetime - df = DataFrame({'date': [datetime.datetime(2012, 1, 1), - datetime.datetime(1012, 1, 2)]}) - expected = df.dtypes - - result = df.iloc[[0]] - tm.assert_series_equal(result.dtypes, expected) - - result = df.iloc[[1]] - tm.assert_series_equal(result.dtypes, expected) - - # 11594 - df = DataFrame({'text': ['some words'] + [None] * 9}) - expected = df.dtypes - - result = df.iloc[0:2] - tm.assert_series_equal(result.dtypes, expected) - - result = df.iloc[3:] - tm.assert_series_equal(result.dtypes, expected) - - def test_per_axis_per_level_setitem(self): - - # test index maker - idx = pd.IndexSlice - - # test multi-index slicing with per axis and per index controls - index = MultiIndex.from_tuples([('A', 1), ('A', 2), - ('A', 3), ('B', 1)], - names=['one', 'two']) - columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), - ('b', 'foo'), ('b', 'bah')], - names=['lvl0', 'lvl1']) - - df_orig = DataFrame( - np.arange(16, dtype='int64').reshape( - 4, 4), index=index, columns=columns) - df_orig = df_orig.sortlevel(axis=0).sortlevel(axis=1) - # identity - df = df_orig.copy() - df.loc[(slice(None), slice(None)), :] = 100 - expected = df_orig.copy() - expected.iloc[:, :] = 100 - tm.assert_frame_equal(df, expected) - - df = df_orig.copy() - df.loc(axis=0)[:, :] = 100 - expected = df_orig.copy() - expected.iloc[:, :] = 100 - tm.assert_frame_equal(df, expected) - - df = df_orig.copy() - df.loc[(slice(None), slice(None)), (slice(None), slice(None))] = 100 - expected = df_orig.copy() - expected.iloc[:, :] = 100 - tm.assert_frame_equal(df, expected) +""" test fancy indexing & misc """ - df = df_orig.copy() - df.loc[:, (slice(None), slice(None))] = 100 - expected = df_orig.copy() - expected.iloc[:, :] = 100 - tm.assert_frame_equal(df, expected) - - # index - df = df_orig.copy() - df.loc[(slice(None), [1]), :] = 100 - expected = df_orig.copy() - expected.iloc[[0, 3]] = 100 - tm.assert_frame_equal(df, expected) +import pytest - df = df_orig.copy() - df.loc[(slice(None), 1), :] = 100 - expected = df_orig.copy() - expected.iloc[[0, 3]] = 100 - tm.assert_frame_equal(df, expected) +from warnings import catch_warnings +from datetime import datetime - df = df_orig.copy() - df.loc(axis=0)[:, 1] = 100 - expected = df_orig.copy() - expected.iloc[[0, 3]] = 100 - tm.assert_frame_equal(df, expected) +from pandas.core.dtypes.common import ( + is_integer_dtype, + is_float_dtype) +from pandas.compat import range, lrange, lzip, StringIO +import numpy as np - # columns - df = df_orig.copy() - df.loc[:, (slice(None), ['foo'])] = 100 - expected = df_orig.copy() - expected.iloc[:, [1, 3]] = 100 - tm.assert_frame_equal(df, expected) +import pandas as pd +from pandas.core.indexing import _non_reducing_slice, _maybe_numeric_slice +from pandas import NaT, DataFrame, Index, Series, MultiIndex +import pandas.util.testing as tm - # both - df = df_orig.copy() - df.loc[(slice(None), 1), (slice(None), ['foo'])] = 100 - expected = df_orig.copy() - expected.iloc[[0, 3], [1, 3]] = 100 - tm.assert_frame_equal(df, expected) +from pandas.tests.indexing.common import Base, _mklbl - df = df_orig.copy() - df.loc[idx[:, 1], idx[:, ['foo']]] = 100 - expected = df_orig.copy() - expected.iloc[[0, 3], [1, 3]] = 100 - tm.assert_frame_equal(df, expected) - df = df_orig.copy() - df.loc['A', 'a'] = 100 - expected = df_orig.copy() - expected.iloc[0:3, 0:2] = 100 - tm.assert_frame_equal(df, expected) +# ------------------------------------------------------------------------ +# Indexing test cases - # setting with a list-like - df = df_orig.copy() - df.loc[(slice(None), 1), (slice(None), ['foo'])] = np.array( - [[100, 100], [100, 100]], dtype='int64') - expected = df_orig.copy() - expected.iloc[[0, 3], [1, 3]] = 100 - tm.assert_frame_equal(df, expected) - # not enough values - df = df_orig.copy() +class TestFancy(Base): + """ pure get/set item & fancy indexing """ - def f(): - df.loc[(slice(None), 1), (slice(None), ['foo'])] = np.array( - [[100], [100, 100]], dtype='int64') + def test_setitem_ndarray_1d(self): + # GH5508 - self.assertRaises(ValueError, f) + # len of indexer vs length of the 1d ndarray + df = DataFrame(index=Index(lrange(1, 11))) + df['foo'] = np.zeros(10, dtype=np.float64) + df['bar'] = np.zeros(10, dtype=np.complex) + # invalid def f(): - df.loc[(slice(None), 1), (slice(None), ['foo'])] = np.array( - [100, 100, 100, 100], dtype='int64') - - self.assertRaises(ValueError, f) - - # with an alignable rhs - df = df_orig.copy() - df.loc[(slice(None), 1), (slice(None), ['foo'])] = df.loc[(slice( - None), 1), (slice(None), ['foo'])] * 5 - expected = df_orig.copy() - expected.iloc[[0, 3], [1, 3]] = expected.iloc[[0, 3], [1, 3]] * 5 - tm.assert_frame_equal(df, expected) - - df = df_orig.copy() - df.loc[(slice(None), 1), (slice(None), ['foo'])] *= df.loc[(slice( - None), 1), (slice(None), ['foo'])] - expected = df_orig.copy() - expected.iloc[[0, 3], [1, 3]] *= expected.iloc[[0, 3], [1, 3]] - tm.assert_frame_equal(df, expected) - - rhs = df_orig.loc[(slice(None), 1), (slice(None), ['foo'])].copy() - rhs.loc[:, ('c', 'bah')] = 10 - df = df_orig.copy() - df.loc[(slice(None), 1), (slice(None), ['foo'])] *= rhs - expected = df_orig.copy() - expected.iloc[[0, 3], [1, 3]] *= expected.iloc[[0, 3], [1, 3]] - tm.assert_frame_equal(df, expected) - - def test_multiindex_setitem(self): - - # GH 3738 - # setting with a multi-index right hand side - arrays = [np.array(['bar', 'bar', 'baz', 'qux', 'qux', 'bar']), - np.array(['one', 'two', 'one', 'one', 'two', 'one']), - np.arange(0, 6, 1)] - - df_orig = pd.DataFrame(np.random.randn(6, 3), - index=arrays, - columns=['A', 'B', 'C']).sort_index() + df.loc[df.index[2:5], 'bar'] = np.array([2.33j, 1.23 + 0.1j, + 2.2, 1.0]) - expected = df_orig.loc[['bar']] * 2 - df = df_orig.copy() - df.loc[['bar']] *= 2 - tm.assert_frame_equal(df.loc[['bar']], expected) - - # raise because these have differing levels - def f(): - df.loc['bar'] *= 2 - - self.assertRaises(TypeError, f) - - # from SO - # http://stackoverflow.com/questions/24572040/pandas-access-the-level-of-multiindex-for-inplace-operation - df_orig = DataFrame.from_dict({'price': { - ('DE', 'Coal', 'Stock'): 2, - ('DE', 'Gas', 'Stock'): 4, - ('DE', 'Elec', 'Demand'): 1, - ('FR', 'Gas', 'Stock'): 5, - ('FR', 'Solar', 'SupIm'): 0, - ('FR', 'Wind', 'SupIm'): 0 - }}) - df_orig.index = MultiIndex.from_tuples(df_orig.index, - names=['Sit', 'Com', 'Type']) - - expected = df_orig.copy() - expected.iloc[[0, 2, 3]] *= 2 - - idx = pd.IndexSlice - df = df_orig.copy() - df.loc[idx[:, :, 'Stock'], :] *= 2 - tm.assert_frame_equal(df, expected) + pytest.raises(ValueError, f) - df = df_orig.copy() - df.loc[idx[:, :, 'Stock'], 'price'] *= 2 - tm.assert_frame_equal(df, expected) + # valid + df.loc[df.index[2:6], 'bar'] = np.array([2.33j, 1.23 + 0.1j, + 2.2, 1.0]) - def test_getitem_multiindex(self): - # GH 5725 the 'A' happens to be a valid Timestamp so the doesn't raise - # the appropriate error, only in PY3 of course! - index = MultiIndex(levels=[['D', 'B', 'C'], - [0, 26, 27, 37, 57, 67, 75, 82]], - labels=[[0, 0, 0, 1, 2, 2, 2, 2, 2, 2], - [1, 3, 4, 6, 0, 2, 2, 3, 5, 7]], - names=['tag', 'day']) - arr = np.random.randn(len(index), 1) - df = DataFrame(arr, index=index, columns=['val']) - result = df.val['D'] - expected = Series(arr.ravel()[0:3], name='val', index=Index( - [26, 37, 57], name='day')) + result = df.loc[df.index[2:6], 'bar'] + expected = Series([2.33j, 1.23 + 0.1j, 2.2, 1.0], index=[3, 4, 5, 6], + name='bar') tm.assert_series_equal(result, expected) - def f(): - df.val['A'] - - self.assertRaises(KeyError, f) - - def f(): - df.val['X'] - - self.assertRaises(KeyError, f) - - # A is treated as a special Timestamp - index = MultiIndex(levels=[['A', 'B', 'C'], - [0, 26, 27, 37, 57, 67, 75, 82]], - labels=[[0, 0, 0, 1, 2, 2, 2, 2, 2, 2], - [1, 3, 4, 6, 0, 2, 2, 3, 5, 7]], - names=['tag', 'day']) - df = DataFrame(arr, index=index, columns=['val']) - result = df.val['A'] - expected = Series(arr.ravel()[0:3], name='val', index=Index( - [26, 37, 57], name='day')) - tm.assert_series_equal(result, expected) + # dtype getting changed? + df = DataFrame(index=Index(lrange(1, 11))) + df['foo'] = np.zeros(10, dtype=np.float64) + df['bar'] = np.zeros(10, dtype=np.complex) def f(): - df.val['X'] - - self.assertRaises(KeyError, f) - - # GH 7866 - # multi-index slicing with missing indexers - idx = pd.MultiIndex.from_product([['A', 'B', 'C'], - ['foo', 'bar', 'baz']], - names=['one', 'two']) - s = pd.Series(np.arange(9, dtype='int64'), index=idx).sortlevel() - - exp_idx = pd.MultiIndex.from_product([['A'], ['foo', 'bar', 'baz']], - names=['one', 'two']) - expected = pd.Series(np.arange(3, dtype='int64'), - index=exp_idx).sortlevel() - - result = s.loc[['A']] - tm.assert_series_equal(result, expected) - result = s.loc[['A', 'D']] - tm.assert_series_equal(result, expected) - - # not any values found - self.assertRaises(KeyError, lambda: s.loc[['D']]) - - # empty ok - result = s.loc[[]] - expected = s.iloc[[]] - tm.assert_series_equal(result, expected) - - idx = pd.IndexSlice - expected = pd.Series([0, 3, 6], index=pd.MultiIndex.from_product( - [['A', 'B', 'C'], ['foo']], names=['one', 'two'])).sortlevel() - - result = s.loc[idx[:, ['foo']]] - tm.assert_series_equal(result, expected) - result = s.loc[idx[:, ['foo', 'bah']]] - tm.assert_series_equal(result, expected) + df[2:5] = np.arange(1, 4) * 1j - # GH 8737 - # empty indexer - multi_index = pd.MultiIndex.from_product((['foo', 'bar', 'baz'], - ['alpha', 'beta'])) - df = DataFrame( - np.random.randn(5, 6), index=range(5), columns=multi_index) - df = df.sortlevel(0, axis=1) - - expected = DataFrame(index=range(5), - columns=multi_index.reindex([])[0]) - result1 = df.loc[:, ([], slice(None))] - result2 = df.loc[:, (['foo'], [])] - tm.assert_frame_equal(result1, expected) - tm.assert_frame_equal(result2, expected) - - # regression from < 0.14.0 - # GH 7914 - df = DataFrame([[np.mean, np.median], ['mean', 'median']], - columns=MultiIndex.from_tuples([('functs', 'mean'), - ('functs', 'median')]), - index=['function', 'name']) - result = df.loc['function', ('functs', 'mean')] - self.assertEqual(result, np.mean) + pytest.raises(ValueError, f) def test_setitem_dtype_upcast(self): # GH3216 df = DataFrame([{"a": 1}, {"a": 3, "b": 2}]) df['c'] = np.nan - self.assertEqual(df['c'].dtype, np.float64) + assert df['c'].dtype == np.float64 - df.ix[0, 'c'] = 'foo' + df.loc[0, 'c'] = 'foo' expected = DataFrame([{"a": 1, "c": 'foo'}, {"a": 3, "b": 2, "c": np.nan}]) tm.assert_frame_equal(df, expected) @@ -2819,8 +87,8 @@ def test_setitem_dtype_upcast(self): columns=['foo', 'bar', 'baz']) tm.assert_frame_equal(left, right) - self.assertTrue(is_integer_dtype(left['foo'])) - self.assertTrue(is_integer_dtype(left['baz'])) + assert is_integer_dtype(left['foo']) + assert is_integer_dtype(left['baz']) left = DataFrame(np.arange(6, dtype='int64').reshape(2, 3) / 10.0, index=list('ab'), @@ -2831,21 +99,8 @@ def test_setitem_dtype_upcast(self): columns=['foo', 'bar', 'baz']) tm.assert_frame_equal(left, right) - self.assertTrue(is_float_dtype(left['foo'])) - self.assertTrue(is_float_dtype(left['baz'])) - - def test_setitem_iloc(self): - - # setitem with an iloc list - df = DataFrame(np.arange(9).reshape((3, 3)), index=["A", "B", "C"], - columns=["A", "B", "C"]) - df.iloc[[0, 1], [1, 2]] - df.iloc[[0, 1], [1, 2]] += 100 - - expected = DataFrame( - np.array([0, 101, 102, 3, 104, 105, 6, 7, 8]).reshape((3, 3)), - index=["A", "B", "C"], columns=["A", "B", "C"]) - tm.assert_frame_equal(df, expected) + assert is_float_dtype(left['foo']) + assert is_float_dtype(left['baz']) def test_dups_fancy_indexing(self): @@ -2855,7 +110,7 @@ def test_dups_fancy_indexing(self): df.columns = ['a', 'a', 'b'] result = df[['b', 'a']].columns expected = Index(['b', 'a', 'a']) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) # across dtypes df = DataFrame([[1, 2, 1., 2., 3., 'foo', 'bar']], @@ -2881,10 +136,10 @@ def test_dups_fancy_indexing(self): {'test': [11, 9], 'test1': [7., 6], 'other': ['d', 'c']}, index=rows) - result = df.ix[rows] + result = df.loc[rows] tm.assert_frame_equal(result, expected) - result = df.ix[Index(rows)] + result = df.loc[Index(rows)] tm.assert_frame_equal(result, expected) rows = ['C', 'B', 'E'] @@ -2893,7 +148,7 @@ def test_dups_fancy_indexing(self): 'test1': [7., 6, np.nan], 'other': ['d', 'c', np.nan]}, index=rows) - result = df.ix[rows] + result = df.loc[rows] tm.assert_frame_equal(result, expected) # see GH5553, make sure we use the right indexer @@ -2903,28 +158,29 @@ def test_dups_fancy_indexing(self): 'other': [np.nan, np.nan, np.nan, 'd', 'c', np.nan]}, index=rows) - result = df.ix[rows] + result = df.loc[rows] tm.assert_frame_equal(result, expected) # inconsistent returns for unique/duplicate indices when values are # missing - df = DataFrame(randn(4, 3), index=list('ABCD')) - expected = df.ix[['E']] + df = DataFrame(np.random.randn(4, 3), index=list('ABCD')) + expected = df.reindex(['E']) - dfnu = DataFrame(randn(5, 3), index=list('AABCD')) - result = dfnu.ix[['E']] + dfnu = DataFrame(np.random.randn(5, 3), index=list('AABCD')) + with catch_warnings(record=True): + result = dfnu.ix[['E']] tm.assert_frame_equal(result, expected) # ToDo: check_index_type can be True after GH 11497 # GH 4619; duplicate indexer with missing label df = DataFrame({"A": [0, 1, 2]}) - result = df.ix[[0, 8, 0]] + result = df.loc[[0, 8, 0]] expected = DataFrame({"A": [0, np.nan, 0]}, index=[0, 8, 0]) tm.assert_frame_equal(result, expected, check_index_type=False) df = DataFrame({"A": list('abc')}) - result = df.ix[[0, 8, 0]] + result = df.loc[[0, 8, 0]] expected = DataFrame({"A": ['a', np.nan, 'a']}, index=[0, 8, 0]) tm.assert_frame_equal(result, expected, check_index_type=False) @@ -2932,7 +188,7 @@ def test_dups_fancy_indexing(self): df = DataFrame({'test': [5, 7, 9, 11]}, index=['A', 'A', 'B', 'C']) expected = DataFrame( {'test': [5, 7, 5, 7, np.nan]}, index=['A', 'A', 'A', 'A', 'E']) - result = df.ix[['A', 'A', 'E']] + result = df.loc[['A', 'A', 'E']] tm.assert_frame_equal(result, expected) # GH 5835 @@ -2941,9 +197,9 @@ def test_dups_fancy_indexing(self): np.random.randn(5, 5), columns=['A', 'B', 'B', 'B', 'A']) expected = pd.concat( - [df.ix[:, ['A', 'B']], DataFrame(np.nan, columns=['C'], - index=df.index)], axis=1) - result = df.ix[:, ['A', 'B', 'C']] + [df.loc[:, ['A', 'B']], DataFrame(np.nan, columns=['C'], + index=df.index)], axis=1) + result = df.loc[:, ['A', 'B', 'C']] tm.assert_frame_equal(result, expected) # GH 6504, multi-axis indexing @@ -2973,9 +229,9 @@ def test_indexing_mixed_frame_bug(self): # this does not work, ie column test is not changed idx = df['test'] == '_' - temp = df.ix[idx, 'a'].apply(lambda x: '-----' if x == 'aaa' else x) - df.ix[idx, 'test'] = temp - self.assertEqual(df.iloc[0, 2], '-----') + temp = df.loc[idx, 'a'].apply(lambda x: '-----' if x == 'aaa' else x) + df.loc[idx, 'test'] = temp + assert df.iloc[0, 2] == '-----' # if I look at df, then element [0,2] equals '_'. If instead I type # df.ix[idx,'test'], I get '-----', finally by typing df.iloc[0,2] I @@ -2985,9 +241,10 @@ def test_multitype_list_index_access(self): # GH 10610 df = pd.DataFrame(np.random.random((10, 5)), columns=["a"] + [20, 21, 22, 23]) - with self.assertRaises(IndexError): + + with pytest.raises(KeyError): df[[22, 26, -8]] - self.assertEqual(df[21].shape[0], df.shape[0]) + assert df[21].shape[0] == df.shape[0] def test_set_index_nan(self): @@ -3009,17 +266,17 @@ def test_set_index_nan(self): 'QC': {17: 0.0, 18: 0.0, 19: 0.0, - 20: nan, - 21: nan, - 22: nan, - 23: nan, + 20: np.nan, + 21: np.nan, + 22: np.nan, + 23: np.nan, 24: 1.0, - 25: nan, - 26: nan, - 27: nan, - 28: nan, - 29: nan, - 30: nan}, + 25: np.nan, + 26: np.nan, + 27: np.nan, + 28: np.nan, + 29: np.nan, + 30: np.nan}, 'data': {17: 7.9544899999999998, 18: 8.0142609999999994, 19: 7.8591520000000008, @@ -3068,233 +325,6 @@ def test_multi_nan_indexing(self): Index(['C1', 'C2', 'C3', 'C4'], name='b')]) tm.assert_frame_equal(result, expected) - def test_iloc_panel_issue(self): - - # GH 3617 - p = Panel(randn(4, 4, 4)) - - self.assertEqual(p.iloc[:3, :3, :3].shape, (3, 3, 3)) - self.assertEqual(p.iloc[1, :3, :3].shape, (3, 3)) - self.assertEqual(p.iloc[:3, 1, :3].shape, (3, 3)) - self.assertEqual(p.iloc[:3, :3, 1].shape, (3, 3)) - self.assertEqual(p.iloc[1, 1, :3].shape, (3, )) - self.assertEqual(p.iloc[1, :3, 1].shape, (3, )) - self.assertEqual(p.iloc[:3, 1, 1].shape, (3, )) - - def test_panel_getitem(self): - # GH4016, date selection returns a frame when a partial string - # selection - ind = date_range(start="2000", freq="D", periods=1000) - df = DataFrame( - np.random.randn( - len(ind), 5), index=ind, columns=list('ABCDE')) - panel = Panel(dict([('frame_' + c, df) for c in list('ABC')])) - - test2 = panel.ix[:, "2002":"2002-12-31"] - test1 = panel.ix[:, "2002"] - tm.assert_panel_equal(test1, test2) - - # GH8710 - # multi-element getting with a list - panel = tm.makePanel() - - expected = panel.iloc[[0, 1]] - - result = panel.loc[['ItemA', 'ItemB']] - tm.assert_panel_equal(result, expected) - - result = panel.loc[['ItemA', 'ItemB'], :, :] - tm.assert_panel_equal(result, expected) - - result = panel[['ItemA', 'ItemB']] - tm.assert_panel_equal(result, expected) - - result = panel.loc['ItemA':'ItemB'] - tm.assert_panel_equal(result, expected) - - result = panel.ix['ItemA':'ItemB'] - tm.assert_panel_equal(result, expected) - - result = panel.ix[['ItemA', 'ItemB']] - tm.assert_panel_equal(result, expected) - - # with an object-like - # GH 9140 - class TestObject: - - def __str__(self): - return "TestObject" - - obj = TestObject() - - p = Panel(np.random.randn(1, 5, 4), items=[obj], - major_axis=date_range('1/1/2000', periods=5), - minor_axis=['A', 'B', 'C', 'D']) - - expected = p.iloc[0] - result = p[obj] - tm.assert_frame_equal(result, expected) - - def test_panel_setitem(self): - - # GH 7763 - # loc and setitem have setting differences - np.random.seed(0) - index = range(3) - columns = list('abc') - - panel = Panel({'A': DataFrame(np.random.randn(3, 3), - index=index, columns=columns), - 'B': DataFrame(np.random.randn(3, 3), - index=index, columns=columns), - 'C': DataFrame(np.random.randn(3, 3), - index=index, columns=columns)}) - - replace = DataFrame(np.eye(3, 3), index=range(3), columns=columns) - expected = Panel({'A': replace, 'B': replace, 'C': replace}) - - p = panel.copy() - for idx in list('ABC'): - p[idx] = replace - tm.assert_panel_equal(p, expected) - - p = panel.copy() - for idx in list('ABC'): - p.loc[idx, :, :] = replace - tm.assert_panel_equal(p, expected) - - def test_panel_setitem_with_multiindex(self): - - # 10360 - # failing with a multi-index - arr = np.array([[[1, 2, 3], [0, 0, 0]], [[0, 0, 0], [0, 0, 0]]], - dtype=np.float64) - - # reg index - axes = dict(items=['A', 'B'], major_axis=[0, 1], - minor_axis=['X', 'Y', 'Z']) - p1 = Panel(0., **axes) - p1.iloc[0, 0, :] = [1, 2, 3] - expected = Panel(arr, **axes) - tm.assert_panel_equal(p1, expected) - - # multi-indexes - axes['items'] = pd.MultiIndex.from_tuples([('A', 'a'), ('B', 'b')]) - p2 = Panel(0., **axes) - p2.iloc[0, 0, :] = [1, 2, 3] - expected = Panel(arr, **axes) - tm.assert_panel_equal(p2, expected) - - axes['major_axis'] = pd.MultiIndex.from_tuples([('A', 1), ('A', 2)]) - p3 = Panel(0., **axes) - p3.iloc[0, 0, :] = [1, 2, 3] - expected = Panel(arr, **axes) - tm.assert_panel_equal(p3, expected) - - axes['minor_axis'] = pd.MultiIndex.from_product([['X'], range(3)]) - p4 = Panel(0., **axes) - p4.iloc[0, 0, :] = [1, 2, 3] - expected = Panel(arr, **axes) - tm.assert_panel_equal(p4, expected) - - arr = np.array( - [[[1, 0, 0], [2, 0, 0]], [[0, 0, 0], [0, 0, 0]]], dtype=np.float64) - p5 = Panel(0., **axes) - p5.iloc[0, :, 0] = [1, 2] - expected = Panel(arr, **axes) - tm.assert_panel_equal(p5, expected) - - def test_panel_assignment(self): - # GH3777 - wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], - major_axis=date_range('1/1/2000', periods=5), - minor_axis=['A', 'B', 'C', 'D']) - wp2 = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], - major_axis=date_range('1/1/2000', periods=5), - minor_axis=['A', 'B', 'C', 'D']) - - # TODO: unused? - # expected = wp.loc[['Item1', 'Item2'], :, ['A', 'B']] - - def f(): - wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = wp2.loc[ - ['Item1', 'Item2'], :, ['A', 'B']] - - self.assertRaises(NotImplementedError, f) - - # to_assign = wp2.loc[['Item1', 'Item2'], :, ['A', 'B']] - # wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = to_assign - # result = wp.loc[['Item1', 'Item2'], :, ['A', 'B']] - # tm.assert_panel_equal(result,expected) - - def test_multiindex_assignment(self): - - # GH3777 part 2 - - # mixed dtype - df = DataFrame(np.random.randint(5, 10, size=9).reshape(3, 3), - columns=list('abc'), - index=[[4, 4, 8], [8, 10, 12]]) - df['d'] = np.nan - arr = np.array([0., 1.]) - - df.ix[4, 'd'] = arr - tm.assert_series_equal(df.ix[4, 'd'], - Series(arr, index=[8, 10], name='d')) - - # single dtype - df = DataFrame(np.random.randint(5, 10, size=9).reshape(3, 3), - columns=list('abc'), - index=[[4, 4, 8], [8, 10, 12]]) - - df.ix[4, 'c'] = arr - exp = Series(arr, index=[8, 10], name='c', dtype='float64') - tm.assert_series_equal(df.ix[4, 'c'], exp) - - # scalar ok - df.ix[4, 'c'] = 10 - exp = Series(10, index=[8, 10], name='c', dtype='float64') - tm.assert_series_equal(df.ix[4, 'c'], exp) - - # invalid assignments - def f(): - df.ix[4, 'c'] = [0, 1, 2, 3] - - self.assertRaises(ValueError, f) - - def f(): - df.ix[4, 'c'] = [0] - - self.assertRaises(ValueError, f) - - # groupby example - NUM_ROWS = 100 - NUM_COLS = 10 - col_names = ['A' + num for num in - map(str, np.arange(NUM_COLS).tolist())] - index_cols = col_names[:5] - - df = DataFrame(np.random.randint(5, size=(NUM_ROWS, NUM_COLS)), - dtype=np.int64, columns=col_names) - df = df.set_index(index_cols).sort_index() - grp = df.groupby(level=index_cols[:4]) - df['new_col'] = np.nan - - f_index = np.arange(5) - - def f(name, df2): - return Series(np.arange(df2.shape[0]), - name=df2.index.values[0]).reindex(f_index) - - # TODO(wesm): unused? - # new_df = pd.concat([f(name, df2) for name, df2 in grp], axis=1).T - - # we are actually operating on a copy here - # but in this case, that's ok - for name, df2 in grp: - new_vals = np.arange(df2.shape[0]) - df.ix[name, 'new_col'] = new_vals - def test_multi_assign(self): # GH 3626, an assignement of a sub-df to a df @@ -3302,14 +332,14 @@ def test_multi_assign(self): 'PF': [0, 0, 0, 0, 1, 1], 'col1': lrange(6), 'col2': lrange(6, 12)}) - df.ix[1, 0] = np.nan + df.iloc[1, 0] = np.nan df2 = df.copy() mask = ~df2.FC.isnull() cols = ['col1', 'col2'] dft = df2 * 2 - dft.ix[3, 3] = np.nan + dft.iloc[3, 3] = np.nan expected = DataFrame({'FC': ['a', np.nan, 'a', 'b', 'a', 'b'], 'PF': [0, 0, 0, 0, 1, 1], @@ -3317,17 +347,17 @@ def test_multi_assign(self): 'col2': [12, 7, 16, np.nan, 20, 22]}) # frame on rhs - df2.ix[mask, cols] = dft.ix[mask, cols] + df2.loc[mask, cols] = dft.loc[mask, cols] tm.assert_frame_equal(df2, expected) - df2.ix[mask, cols] = dft.ix[mask, cols] + df2.loc[mask, cols] = dft.loc[mask, cols] tm.assert_frame_equal(df2, expected) # with an ndarray on rhs df2 = df.copy() - df2.ix[mask, cols] = dft.ix[mask, cols].values + df2.loc[mask, cols] = dft.loc[mask, cols].values tm.assert_frame_equal(df2, expected) - df2.ix[mask, cols] = dft.ix[mask, cols].values + df2.loc[mask, cols] = dft.loc[mask, cols].values tm.assert_frame_equal(df2, expected) # broadcasting on the rhs is required @@ -3342,79 +372,18 @@ def test_multi_assign(self): df.loc[df['A'] == 0, ['A', 'B']] = df['D'] tm.assert_frame_equal(df, expected) - def test_ix_assign_column_mixed(self): - # GH #1142 - df = DataFrame(tm.getSeriesData()) - df['foo'] = 'bar' - - orig = df.ix[:, 'B'].copy() - df.ix[:, 'B'] = df.ix[:, 'B'] + 1 - tm.assert_series_equal(df.B, orig + 1) - - # GH 3668, mixed frame with series value - df = DataFrame({'x': lrange(10), 'y': lrange(10, 20), 'z': 'bar'}) - expected = df.copy() - - for i in range(5): - indexer = i * 2 - v = 1000 + i * 200 - expected.ix[indexer, 'y'] = v - self.assertEqual(expected.ix[indexer, 'y'], v) - - df.ix[df.x % 2 == 0, 'y'] = df.ix[df.x % 2 == 0, 'y'] * 100 - tm.assert_frame_equal(df, expected) - - # GH 4508, making sure consistency of assignments - df = DataFrame({'a': [1, 2, 3], 'b': [0, 1, 2]}) - df.ix[[0, 2, ], 'b'] = [100, -100] - expected = DataFrame({'a': [1, 2, 3], 'b': [100, 1, -100]}) - tm.assert_frame_equal(df, expected) - - df = pd.DataFrame({'a': lrange(4)}) - df['b'] = np.nan - df.ix[[1, 3], 'b'] = [100, -100] - expected = DataFrame({'a': [0, 1, 2, 3], - 'b': [np.nan, 100, np.nan, -100]}) - tm.assert_frame_equal(df, expected) - - # ok, but chained assignments are dangerous - # if we turn off chained assignement it will work - with option_context('chained_assignment', None): - df = pd.DataFrame({'a': lrange(4)}) - df['b'] = np.nan - df['b'].ix[[1, 3]] = [100, -100] - tm.assert_frame_equal(df, expected) - - def test_ix_get_set_consistency(self): - - # GH 4544 - # ix/loc get/set not consistent when - # a mixed int/string index - df = DataFrame(np.arange(16).reshape((4, 4)), - columns=['a', 'b', 8, 'c'], - index=['e', 7, 'f', 'g']) - - self.assertEqual(df.ix['e', 8], 2) - self.assertEqual(df.loc['e', 8], 2) - - df.ix['e', 8] = 42 - self.assertEqual(df.ix['e', 8], 42) - self.assertEqual(df.loc['e', 8], 42) - - df.loc['e', 8] = 45 - self.assertEqual(df.ix['e', 8], 45) - self.assertEqual(df.loc['e', 8], 45) - def test_setitem_list(self): # GH 6043 # ix with a list df = DataFrame(index=[0, 1], columns=[0]) - df.ix[1, 0] = [1, 2, 3] - df.ix[1, 0] = [1, 2] + with catch_warnings(record=True): + df.ix[1, 0] = [1, 2, 3] + df.ix[1, 0] = [1, 2] result = DataFrame(index=[0, 1], columns=[0]) - result.ix[1, 0] = [1, 2] + with catch_warnings(record=True): + result.ix[1, 0] = [1, 2] tm.assert_frame_equal(result, df) @@ -3430,188 +399,30 @@ def __str__(self): __repr__ = __str__ def __eq__(self, other): - return self.value == other.value - - def view(self): - return self - - df = DataFrame(index=[0, 1], columns=[0]) - df.ix[1, 0] = TO(1) - df.ix[1, 0] = TO(2) - - result = DataFrame(index=[0, 1], columns=[0]) - result.ix[1, 0] = TO(2) - - tm.assert_frame_equal(result, df) - - # remains object dtype even after setting it back - df = DataFrame(index=[0, 1], columns=[0]) - df.ix[1, 0] = TO(1) - df.ix[1, 0] = np.nan - result = DataFrame(index=[0, 1], columns=[0]) - - tm.assert_frame_equal(result, df) - - def test_iloc_mask(self): - - # GH 3631, iloc with a mask (of a series) should raise - df = DataFrame(lrange(5), list('ABCDE'), columns=['a']) - mask = (df.a % 2 == 0) - self.assertRaises(ValueError, df.iloc.__getitem__, tuple([mask])) - mask.index = lrange(len(mask)) - self.assertRaises(NotImplementedError, df.iloc.__getitem__, - tuple([mask])) - - # ndarray ok - result = df.iloc[np.array([True] * len(mask), dtype=bool)] - tm.assert_frame_equal(result, df) - - # the possibilities - locs = np.arange(4) - nums = 2 ** locs - reps = lmap(bin, nums) - df = DataFrame({'locs': locs, 'nums': nums}, reps) - - expected = { - (None, ''): '0b1100', - (None, '.loc'): '0b1100', - (None, '.iloc'): '0b1100', - ('index', ''): '0b11', - ('index', '.loc'): '0b11', - ('index', '.iloc'): ('iLocation based boolean indexing ' - 'cannot use an indexable as a mask'), - ('locs', ''): 'Unalignable boolean Series key provided', - ('locs', '.loc'): 'Unalignable boolean Series key provided', - ('locs', '.iloc'): ('iLocation based boolean indexing on an ' - 'integer type is not available'), - } - - # UserWarnings from reindex of a boolean mask - with warnings.catch_warnings(record=True): - result = dict() - for idx in [None, 'index', 'locs']: - mask = (df.nums > 2).values - if idx: - mask = Series(mask, list(reversed(getattr(df, idx)))) - for method in ['', '.loc', '.iloc']: - try: - if method: - accessor = getattr(df, method[1:]) - else: - accessor = df - ans = str(bin(accessor[mask]['nums'].sum())) - except Exception as e: - ans = str(e) - - key = tuple([idx, method]) - r = expected.get(key) - if r != ans: - raise AssertionError( - "[%s] does not match [%s], received [%s]" - % (key, ans, r)) - - def test_ix_slicing_strings(self): - # GH3836 - data = {'Classification': - ['SA EQUITY CFD', 'bbb', 'SA EQUITY', 'SA SSF', 'aaa'], - 'Random': [1, 2, 3, 4, 5], - 'X': ['correct', 'wrong', 'correct', 'correct', 'wrong']} - df = DataFrame(data) - x = df[~df.Classification.isin(['SA EQUITY CFD', 'SA EQUITY', 'SA SSF' - ])] - df.ix[x.index, 'X'] = df['Classification'] - - expected = DataFrame({'Classification': {0: 'SA EQUITY CFD', - 1: 'bbb', - 2: 'SA EQUITY', - 3: 'SA SSF', - 4: 'aaa'}, - 'Random': {0: 1, - 1: 2, - 2: 3, - 3: 4, - 4: 5}, - 'X': {0: 'correct', - 1: 'bbb', - 2: 'correct', - 3: 'correct', - 4: 'aaa'}}) # bug was 4: 'bbb' - - tm.assert_frame_equal(df, expected) - - def test_non_unique_loc(self): - # GH3659 - # non-unique indexer with loc slice - # https://groups.google.com/forum/?fromgroups#!topic/pydata/zTm2No0crYs - - # these are going to raise becuase the we are non monotonic - df = DataFrame({'A': [1, 2, 3, 4, 5, 6], - 'B': [3, 4, 5, 6, 7, 8]}, index=[0, 1, 0, 1, 2, 3]) - self.assertRaises(KeyError, df.loc.__getitem__, - tuple([slice(1, None)])) - self.assertRaises(KeyError, df.loc.__getitem__, - tuple([slice(0, None)])) - self.assertRaises(KeyError, df.loc.__getitem__, tuple([slice(1, 2)])) - - # monotonic are ok - df = DataFrame({'A': [1, 2, 3, 4, 5, 6], - 'B': [3, 4, 5, 6, 7, 8]}, - index=[0, 1, 0, 1, 2, 3]).sort_index(axis=0) - result = df.loc[1:] - expected = DataFrame({'A': [2, 4, 5, 6], 'B': [4, 6, 7, 8]}, - index=[1, 1, 2, 3]) - tm.assert_frame_equal(result, expected) - - result = df.loc[0:] - tm.assert_frame_equal(result, df) - - result = df.loc[1:2] - expected = DataFrame({'A': [2, 4, 5], 'B': [4, 6, 7]}, - index=[1, 1, 2]) - tm.assert_frame_equal(result, expected) - - def test_loc_name(self): - # GH 3880 - df = DataFrame([[1, 1], [1, 1]]) - df.index.name = 'index_name' - result = df.iloc[[0, 1]].index.name - self.assertEqual(result, 'index_name') - - result = df.ix[[0, 1]].index.name - self.assertEqual(result, 'index_name') - - result = df.loc[[0, 1]].index.name - self.assertEqual(result, 'index_name') - - def test_iloc_non_unique_indexing(self): - - # GH 4017, non-unique indexing (on the axis) - df = DataFrame({'A': [0.1] * 3000, 'B': [1] * 3000}) - idx = np.array(lrange(30)) * 99 - expected = df.iloc[idx] + return self.value == other.value - df3 = pd.concat([df, 2 * df, 3 * df]) - result = df3.iloc[idx] + def view(self): + return self - tm.assert_frame_equal(result, expected) + df = DataFrame(index=[0, 1], columns=[0]) + with catch_warnings(record=True): + df.ix[1, 0] = TO(1) + df.ix[1, 0] = TO(2) - df2 = DataFrame({'A': [0.1] * 1000, 'B': [1] * 1000}) - df2 = pd.concat([df2, 2 * df2, 3 * df2]) + result = DataFrame(index=[0, 1], columns=[0]) + with catch_warnings(record=True): + result.ix[1, 0] = TO(2) - sidx = df2.index.to_series() - expected = df2.iloc[idx[idx <= sidx.max()]] + tm.assert_frame_equal(result, df) - new_list = [] - for r, s in expected.iterrows(): - new_list.append(s) - new_list.append(s * 2) - new_list.append(s * 3) + # remains object dtype even after setting it back + df = DataFrame(index=[0, 1], columns=[0]) + with catch_warnings(record=True): + df.ix[1, 0] = TO(1) + df.ix[1, 0] = np.nan + result = DataFrame(index=[0, 1], columns=[0]) - expected = DataFrame(new_list) - expected = pd.concat([expected, DataFrame(index=idx[idx > sidx.max()]) - ]) - result = df2.loc[idx] - tm.assert_frame_equal(result, expected, check_index_type=False) + tm.assert_frame_equal(result, df) def test_string_slice(self): # GH 14424 @@ -3619,19 +430,19 @@ def test_string_slice(self): # dtype should properly raises KeyError df = pd.DataFrame([1], pd.Index([pd.Timestamp('2011-01-01')], dtype=object)) - self.assertTrue(df.index.is_all_dates) - with tm.assertRaises(KeyError): + assert df.index.is_all_dates + with pytest.raises(KeyError): df['2011'] - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): df.loc['2011', 0] df = pd.DataFrame() - self.assertFalse(df.index.is_all_dates) - with tm.assertRaises(KeyError): + assert not df.index.is_all_dates + with pytest.raises(KeyError): df['2011'] - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): df.loc['2011', 0] def test_mi_access(self): @@ -3673,43 +484,6 @@ def test_mi_access(self): result = df2['A']['B2'] tm.assert_frame_equal(result, expected) - def test_non_unique_loc_memory_error(self): - - # GH 4280 - # non_unique index with a large selection triggers a memory error - - columns = list('ABCDEFG') - - def gen_test(l, l2): - return pd.concat([DataFrame(randn(l, len(columns)), - index=lrange(l), columns=columns), - DataFrame(np.ones((l2, len(columns))), - index=[0] * l2, columns=columns)]) - - def gen_expected(df, mask): - l = len(mask) - return pd.concat([df.take([0], convert=False), - DataFrame(np.ones((l, len(columns))), - index=[0] * l, - columns=columns), - df.take(mask[1:], convert=False)]) - - df = gen_test(900, 100) - self.assertFalse(df.index.is_unique) - - mask = np.arange(100) - result = df.loc[mask] - expected = gen_expected(df, mask) - tm.assert_frame_equal(result, expected) - - df = gen_test(900000, 100000) - self.assertFalse(df.index.is_unique) - - mask = np.arange(100000) - result = df.loc[mask] - expected = gen_expected(df, mask) - tm.assert_frame_equal(result, expected) - def test_astype_assignment(self): # GH4312 (iloc) @@ -3762,1332 +536,85 @@ def test_astype_assignment_with_dups(self): index = df.index.copy() df['A'] = df['A'].astype(np.float64) - self.assert_index_equal(df.index, index) + tm.assert_index_equal(df.index, index) # TODO(wesm): unused variables # result = df.get_dtype_counts().sort_index() # expected = Series({'float64': 2, 'object': 1}).sort_index() - def test_dups_loc(self): - - # GH4726 - # dup indexing with iloc/loc - df = DataFrame([[1, 2, 'foo', 'bar', Timestamp('20130101')]], - columns=['a', 'a', 'a', 'a', 'a'], index=[1]) - expected = Series([1, 2, 'foo', 'bar', Timestamp('20130101')], - index=['a', 'a', 'a', 'a', 'a'], name=1) - - result = df.iloc[0] - tm.assert_series_equal(result, expected) - - result = df.loc[1] - tm.assert_series_equal(result, expected) - - def test_partial_setting(self): - - # GH2578, allow ix and friends to partially set - - # series - s_orig = Series([1, 2, 3]) - - s = s_orig.copy() - s[5] = 5 - expected = Series([1, 2, 3, 5], index=[0, 1, 2, 5]) - tm.assert_series_equal(s, expected) - - s = s_orig.copy() - s.loc[5] = 5 - expected = Series([1, 2, 3, 5], index=[0, 1, 2, 5]) - tm.assert_series_equal(s, expected) - - s = s_orig.copy() - s[5] = 5. - expected = Series([1, 2, 3, 5.], index=[0, 1, 2, 5]) - tm.assert_series_equal(s, expected) - - s = s_orig.copy() - s.loc[5] = 5. - expected = Series([1, 2, 3, 5.], index=[0, 1, 2, 5]) - tm.assert_series_equal(s, expected) - - # iloc/iat raise - s = s_orig.copy() - - def f(): - s.iloc[3] = 5. - - self.assertRaises(IndexError, f) - - def f(): - s.iat[3] = 5. - - self.assertRaises(IndexError, f) - - # ## frame ## - - df_orig = DataFrame( - np.arange(6).reshape(3, 2), columns=['A', 'B'], dtype='int64') - - # iloc/iat raise - df = df_orig.copy() - - def f(): - df.iloc[4, 2] = 5. - - self.assertRaises(IndexError, f) - - def f(): - df.iat[4, 2] = 5. - - self.assertRaises(IndexError, f) - - # row setting where it exists - expected = DataFrame(dict({'A': [0, 4, 4], 'B': [1, 5, 5]})) - df = df_orig.copy() - df.iloc[1] = df.iloc[2] - tm.assert_frame_equal(df, expected) - - expected = DataFrame(dict({'A': [0, 4, 4], 'B': [1, 5, 5]})) - df = df_orig.copy() - df.loc[1] = df.loc[2] - tm.assert_frame_equal(df, expected) - - # like 2578, partial setting with dtype preservation - expected = DataFrame(dict({'A': [0, 2, 4, 4], 'B': [1, 3, 5, 5]})) - df = df_orig.copy() - df.loc[3] = df.loc[2] - tm.assert_frame_equal(df, expected) - - # single dtype frame, overwrite - expected = DataFrame(dict({'A': [0, 2, 4], 'B': [0, 2, 4]})) - df = df_orig.copy() - df.ix[:, 'B'] = df.ix[:, 'A'] - tm.assert_frame_equal(df, expected) - - # mixed dtype frame, overwrite - expected = DataFrame(dict({'A': [0, 2, 4], 'B': Series([0, 2, 4])})) - df = df_orig.copy() - df['B'] = df['B'].astype(np.float64) - df.ix[:, 'B'] = df.ix[:, 'A'] - tm.assert_frame_equal(df, expected) - - # single dtype frame, partial setting - expected = df_orig.copy() - expected['C'] = df['A'] - df = df_orig.copy() - df.ix[:, 'C'] = df.ix[:, 'A'] - tm.assert_frame_equal(df, expected) - - # mixed frame, partial setting - expected = df_orig.copy() - expected['C'] = df['A'] - df = df_orig.copy() - df.ix[:, 'C'] = df.ix[:, 'A'] - tm.assert_frame_equal(df, expected) - - # ## panel ## - p_orig = Panel(np.arange(16).reshape(2, 4, 2), - items=['Item1', 'Item2'], - major_axis=pd.date_range('2001/1/12', periods=4), - minor_axis=['A', 'B'], dtype='float64') - - # panel setting via item - p_orig = Panel(np.arange(16).reshape(2, 4, 2), - items=['Item1', 'Item2'], - major_axis=pd.date_range('2001/1/12', periods=4), - minor_axis=['A', 'B'], dtype='float64') - expected = p_orig.copy() - expected['Item3'] = expected['Item1'] - p = p_orig.copy() - p.loc['Item3'] = p['Item1'] - tm.assert_panel_equal(p, expected) - - # panel with aligned series - expected = p_orig.copy() - expected = expected.transpose(2, 1, 0) - expected['C'] = DataFrame({'Item1': [30, 30, 30, 30], - 'Item2': [32, 32, 32, 32]}, - index=p_orig.major_axis) - expected = expected.transpose(2, 1, 0) - p = p_orig.copy() - p.loc[:, :, 'C'] = Series([30, 32], index=p_orig.items) - tm.assert_panel_equal(p, expected) - - # GH 8473 - dates = date_range('1/1/2000', periods=8) - df_orig = DataFrame(np.random.randn(8, 4), index=dates, - columns=['A', 'B', 'C', 'D']) - - expected = pd.concat([df_orig, DataFrame( - {'A': 7}, index=[dates[-1] + 1])]) - df = df_orig.copy() - df.loc[dates[-1] + 1, 'A'] = 7 - tm.assert_frame_equal(df, expected) - df = df_orig.copy() - df.at[dates[-1] + 1, 'A'] = 7 - tm.assert_frame_equal(df, expected) - - exp_other = DataFrame({0: 7}, index=[dates[-1] + 1]) - expected = pd.concat([df_orig, exp_other], axis=1) - - df = df_orig.copy() - df.loc[dates[-1] + 1, 0] = 7 - tm.assert_frame_equal(df, expected) - df = df_orig.copy() - df.at[dates[-1] + 1, 0] = 7 - tm.assert_frame_equal(df, expected) - - def test_partial_setting_mixed_dtype(self): - - # in a mixed dtype environment, try to preserve dtypes - # by appending - df = DataFrame([[True, 1], [False, 2]], columns=["female", "fitness"]) - - s = df.loc[1].copy() - s.name = 2 - expected = df.append(s) - - df.loc[2] = df.loc[1] - tm.assert_frame_equal(df, expected) - - # columns will align - df = DataFrame(columns=['A', 'B']) - df.loc[0] = Series(1, index=range(4)) - tm.assert_frame_equal(df, DataFrame(columns=['A', 'B'], index=[0])) - - # columns will align - df = DataFrame(columns=['A', 'B']) - df.loc[0] = Series(1, index=['B']) - - exp = DataFrame([[np.nan, 1]], columns=['A', 'B'], - index=[0], dtype='float64') - tm.assert_frame_equal(df, exp) - - # list-like must conform - df = DataFrame(columns=['A', 'B']) - - def f(): - df.loc[0] = [1, 2, 3] - - self.assertRaises(ValueError, f) - - # these are coerced to float unavoidably (as its a list-like to begin) - df = DataFrame(columns=['A', 'B']) - df.loc[3] = [6, 7] - - exp = DataFrame([[6, 7]], index=[3], columns=['A', 'B'], - dtype='float64') - tm.assert_frame_equal(df, exp) - - def test_partial_setting_with_datetimelike_dtype(self): - - # GH9478 - # a datetimeindex alignment issue with partial setting - df = pd.DataFrame(np.arange(6.).reshape(3, 2), columns=list('AB'), - index=pd.date_range('1/1/2000', periods=3, - freq='1H')) - expected = df.copy() - expected['C'] = [expected.index[0]] + [pd.NaT, pd.NaT] - - mask = df.A < 1 - df.loc[mask, 'C'] = df.loc[mask].index - tm.assert_frame_equal(df, expected) - - def test_loc_setitem_datetime(self): - - # GH 9516 - dt1 = Timestamp('20130101 09:00:00') - dt2 = Timestamp('20130101 10:00:00') - - for conv in [lambda x: x, lambda x: x.to_datetime64(), - lambda x: x.to_pydatetime(), lambda x: np.datetime64(x)]: - - df = pd.DataFrame() - df.loc[conv(dt1), 'one'] = 100 - df.loc[conv(dt2), 'one'] = 200 - - expected = DataFrame({'one': [100.0, 200.0]}, index=[dt1, dt2]) - tm.assert_frame_equal(df, expected) - - def test_series_partial_set(self): - # partial set with new index - # Regression from GH4825 - ser = Series([0.1, 0.2], index=[1, 2]) - - # loc - expected = Series([np.nan, 0.2, np.nan], index=[3, 2, 3]) - result = ser.loc[[3, 2, 3]] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([np.nan, 0.2, np.nan, np.nan], index=[3, 2, 3, 'x']) - result = ser.loc[[3, 2, 3, 'x']] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([0.2, 0.2, 0.1], index=[2, 2, 1]) - result = ser.loc[[2, 2, 1]] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([0.2, 0.2, np.nan, 0.1], index=[2, 2, 'x', 1]) - result = ser.loc[[2, 2, 'x', 1]] - tm.assert_series_equal(result, expected, check_index_type=True) - - # raises as nothing in in the index - self.assertRaises(KeyError, lambda: ser.loc[[3, 3, 3]]) - - expected = Series([0.2, 0.2, np.nan], index=[2, 2, 3]) - result = ser.loc[[2, 2, 3]] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([0.3, np.nan, np.nan], index=[3, 4, 4]) - result = Series([0.1, 0.2, 0.3], index=[1, 2, 3]).loc[[3, 4, 4]] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([np.nan, 0.3, 0.3], index=[5, 3, 3]) - result = Series([0.1, 0.2, 0.3, 0.4], - index=[1, 2, 3, 4]).loc[[5, 3, 3]] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([np.nan, 0.4, 0.4], index=[5, 4, 4]) - result = Series([0.1, 0.2, 0.3, 0.4], - index=[1, 2, 3, 4]).loc[[5, 4, 4]] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([0.4, np.nan, np.nan], index=[7, 2, 2]) - result = Series([0.1, 0.2, 0.3, 0.4], - index=[4, 5, 6, 7]).loc[[7, 2, 2]] - tm.assert_series_equal(result, expected, check_index_type=True) - - expected = Series([0.4, np.nan, np.nan], index=[4, 5, 5]) - result = Series([0.1, 0.2, 0.3, 0.4], - index=[1, 2, 3, 4]).loc[[4, 5, 5]] - tm.assert_series_equal(result, expected, check_index_type=True) - - # iloc - expected = Series([0.2, 0.2, 0.1, 0.1], index=[2, 2, 1, 1]) - result = ser.iloc[[1, 1, 0, 0]] - tm.assert_series_equal(result, expected, check_index_type=True) - - def test_series_partial_set_with_name(self): - # GH 11497 - - idx = Index([1, 2], dtype='int64', name='idx') - ser = Series([0.1, 0.2], index=idx, name='s') - - # loc - exp_idx = Index([3, 2, 3], dtype='int64', name='idx') - expected = Series([np.nan, 0.2, np.nan], index=exp_idx, name='s') - result = ser.loc[[3, 2, 3]] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([3, 2, 3, 'x'], dtype='object', name='idx') - expected = Series([np.nan, 0.2, np.nan, np.nan], index=exp_idx, - name='s') - result = ser.loc[[3, 2, 3, 'x']] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([2, 2, 1], dtype='int64', name='idx') - expected = Series([0.2, 0.2, 0.1], index=exp_idx, name='s') - result = ser.loc[[2, 2, 1]] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([2, 2, 'x', 1], dtype='object', name='idx') - expected = Series([0.2, 0.2, np.nan, 0.1], index=exp_idx, name='s') - result = ser.loc[[2, 2, 'x', 1]] - tm.assert_series_equal(result, expected, check_index_type=True) - - # raises as nothing in in the index - self.assertRaises(KeyError, lambda: ser.loc[[3, 3, 3]]) - - exp_idx = Index([2, 2, 3], dtype='int64', name='idx') - expected = Series([0.2, 0.2, np.nan], index=exp_idx, name='s') - result = ser.loc[[2, 2, 3]] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([3, 4, 4], dtype='int64', name='idx') - expected = Series([0.3, np.nan, np.nan], index=exp_idx, name='s') - idx = Index([1, 2, 3], dtype='int64', name='idx') - result = Series([0.1, 0.2, 0.3], index=idx, name='s').loc[[3, 4, 4]] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([5, 3, 3], dtype='int64', name='idx') - expected = Series([np.nan, 0.3, 0.3], index=exp_idx, name='s') - idx = Index([1, 2, 3, 4], dtype='int64', name='idx') - result = Series([0.1, 0.2, 0.3, 0.4], index=idx, - name='s').loc[[5, 3, 3]] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([5, 4, 4], dtype='int64', name='idx') - expected = Series([np.nan, 0.4, 0.4], index=exp_idx, name='s') - idx = Index([1, 2, 3, 4], dtype='int64', name='idx') - result = Series([0.1, 0.2, 0.3, 0.4], index=idx, - name='s').loc[[5, 4, 4]] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([7, 2, 2], dtype='int64', name='idx') - expected = Series([0.4, np.nan, np.nan], index=exp_idx, name='s') - idx = Index([4, 5, 6, 7], dtype='int64', name='idx') - result = Series([0.1, 0.2, 0.3, 0.4], index=idx, - name='s').loc[[7, 2, 2]] - tm.assert_series_equal(result, expected, check_index_type=True) - - exp_idx = Index([4, 5, 5], dtype='int64', name='idx') - expected = Series([0.4, np.nan, np.nan], index=exp_idx, name='s') - idx = Index([1, 2, 3, 4], dtype='int64', name='idx') - result = Series([0.1, 0.2, 0.3, 0.4], index=idx, - name='s').loc[[4, 5, 5]] - tm.assert_series_equal(result, expected, check_index_type=True) - - # iloc - exp_idx = Index([2, 2, 1, 1], dtype='int64', name='idx') - expected = Series([0.2, 0.2, 0.1, 0.1], index=exp_idx, name='s') - result = ser.iloc[[1, 1, 0, 0]] - tm.assert_series_equal(result, expected, check_index_type=True) - - def test_series_partial_set_datetime(self): - # GH 11497 - - idx = date_range('2011-01-01', '2011-01-02', freq='D', name='idx') - ser = Series([0.1, 0.2], index=idx, name='s') - - result = ser.loc[[Timestamp('2011-01-01'), Timestamp('2011-01-02')]] - exp = Series([0.1, 0.2], index=idx, name='s') - tm.assert_series_equal(result, exp, check_index_type=True) - - keys = [Timestamp('2011-01-02'), Timestamp('2011-01-02'), - Timestamp('2011-01-01')] - exp = Series([0.2, 0.2, 0.1], index=pd.DatetimeIndex(keys, name='idx'), - name='s') - tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True) - - keys = [Timestamp('2011-01-03'), Timestamp('2011-01-02'), - Timestamp('2011-01-03')] - exp = Series([np.nan, 0.2, np.nan], - index=pd.DatetimeIndex(keys, name='idx'), name='s') - tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True) - - def test_series_partial_set_period(self): - # GH 11497 - - idx = pd.period_range('2011-01-01', '2011-01-02', freq='D', name='idx') - ser = Series([0.1, 0.2], index=idx, name='s') - - result = ser.loc[[pd.Period('2011-01-01', freq='D'), - pd.Period('2011-01-02', freq='D')]] - exp = Series([0.1, 0.2], index=idx, name='s') - tm.assert_series_equal(result, exp, check_index_type=True) - - keys = [pd.Period('2011-01-02', freq='D'), - pd.Period('2011-01-02', freq='D'), - pd.Period('2011-01-01', freq='D')] - exp = Series([0.2, 0.2, 0.1], index=pd.PeriodIndex(keys, name='idx'), - name='s') - tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True) - - keys = [pd.Period('2011-01-03', freq='D'), - pd.Period('2011-01-02', freq='D'), - pd.Period('2011-01-03', freq='D')] - exp = Series([np.nan, 0.2, np.nan], - index=pd.PeriodIndex(keys, name='idx'), name='s') - result = ser.loc[keys] - tm.assert_series_equal(result, exp) - - def test_partial_set_invalid(self): - - # GH 4940 - # allow only setting of 'valid' values - - orig = tm.makeTimeDataFrame() - df = orig.copy() - - # don't allow not string inserts - def f(): - df.loc[100.0, :] = df.ix[0] - - self.assertRaises(TypeError, f) - - def f(): - df.loc[100, :] = df.ix[0] - - self.assertRaises(TypeError, f) - - def f(): - df.ix[100.0, :] = df.ix[0] - - self.assertRaises(TypeError, f) - - def f(): - df.ix[100, :] = df.ix[0] - - self.assertRaises(ValueError, f) - - # allow object conversion here - df = orig.copy() - df.loc['a', :] = df.ix[0] - exp = orig.append(pd.Series(df.ix[0], name='a')) - tm.assert_frame_equal(df, exp) - tm.assert_index_equal(df.index, - pd.Index(orig.index.tolist() + ['a'])) - self.assertEqual(df.index.dtype, 'object') - - def test_partial_set_empty_series(self): - - # GH5226 - - # partially set with an empty object series - s = Series() - s.loc[1] = 1 - tm.assert_series_equal(s, Series([1], index=[1])) - s.loc[3] = 3 - tm.assert_series_equal(s, Series([1, 3], index=[1, 3])) - - s = Series() - s.loc[1] = 1. - tm.assert_series_equal(s, Series([1.], index=[1])) - s.loc[3] = 3. - tm.assert_series_equal(s, Series([1., 3.], index=[1, 3])) - - s = Series() - s.loc['foo'] = 1 - tm.assert_series_equal(s, Series([1], index=['foo'])) - s.loc['bar'] = 3 - tm.assert_series_equal(s, Series([1, 3], index=['foo', 'bar'])) - s.loc[3] = 4 - tm.assert_series_equal(s, Series([1, 3, 4], index=['foo', 'bar', 3])) - - def test_partial_set_empty_frame(self): - - # partially set with an empty object - # frame - df = DataFrame() - - def f(): - df.loc[1] = 1 - - self.assertRaises(ValueError, f) - - def f(): - df.loc[1] = Series([1], index=['foo']) - - self.assertRaises(ValueError, f) - - def f(): - df.loc[:, 1] = 1 - - self.assertRaises(ValueError, f) - - # these work as they don't really change - # anything but the index - # GH5632 - expected = DataFrame(columns=['foo'], index=pd.Index( - [], dtype='int64')) - - def f(): - df = DataFrame() - df['foo'] = Series([], dtype='object') - return df - - tm.assert_frame_equal(f(), expected) - - def f(): - df = DataFrame() - df['foo'] = Series(df.index) - return df - - tm.assert_frame_equal(f(), expected) - - def f(): - df = DataFrame() - df['foo'] = df.index - return df - - tm.assert_frame_equal(f(), expected) - - expected = DataFrame(columns=['foo'], - index=pd.Index([], dtype='int64')) - expected['foo'] = expected['foo'].astype('float64') - - def f(): - df = DataFrame() - df['foo'] = [] - return df - - tm.assert_frame_equal(f(), expected) - - def f(): - df = DataFrame() - df['foo'] = Series(range(len(df))) - return df - - tm.assert_frame_equal(f(), expected) - - def f(): - df = DataFrame() - tm.assert_index_equal(df.index, pd.Index([], dtype='object')) - df['foo'] = range(len(df)) - return df - - expected = DataFrame(columns=['foo'], - index=pd.Index([], dtype='int64')) - expected['foo'] = expected['foo'].astype('float64') - tm.assert_frame_equal(f(), expected) - - df = DataFrame() - tm.assert_index_equal(df.columns, pd.Index([], dtype=object)) - df2 = DataFrame() - df2[1] = Series([1], index=['foo']) - df.loc[:, 1] = Series([1], index=['foo']) - tm.assert_frame_equal(df, DataFrame([[1]], index=['foo'], columns=[1])) - tm.assert_frame_equal(df, df2) - - # no index to start - expected = DataFrame({0: Series(1, index=range(4))}, - columns=['A', 'B', 0]) - - df = DataFrame(columns=['A', 'B']) - df[0] = Series(1, index=range(4)) - df.dtypes - str(df) - tm.assert_frame_equal(df, expected) - - df = DataFrame(columns=['A', 'B']) - df.loc[:, 0] = Series(1, index=range(4)) - df.dtypes - str(df) - tm.assert_frame_equal(df, expected) - - def test_partial_set_empty_frame_row(self): - # GH5720, GH5744 - # don't create rows when empty - expected = DataFrame(columns=['A', 'B', 'New'], - index=pd.Index([], dtype='int64')) - expected['A'] = expected['A'].astype('int64') - expected['B'] = expected['B'].astype('float64') - expected['New'] = expected['New'].astype('float64') - - df = DataFrame({"A": [1, 2, 3], "B": [1.2, 4.2, 5.2]}) - y = df[df.A > 5] - y['New'] = np.nan - tm.assert_frame_equal(y, expected) - # tm.assert_frame_equal(y,expected) - - expected = DataFrame(columns=['a', 'b', 'c c', 'd']) - expected['d'] = expected['d'].astype('int64') - df = DataFrame(columns=['a', 'b', 'c c']) - df['d'] = 3 - tm.assert_frame_equal(df, expected) - tm.assert_series_equal(df['c c'], Series(name='c c', dtype=object)) - - # reindex columns is ok - df = DataFrame({"A": [1, 2, 3], "B": [1.2, 4.2, 5.2]}) - y = df[df.A > 5] - result = y.reindex(columns=['A', 'B', 'C']) - expected = DataFrame(columns=['A', 'B', 'C'], - index=pd.Index([], dtype='int64')) - expected['A'] = expected['A'].astype('int64') - expected['B'] = expected['B'].astype('float64') - expected['C'] = expected['C'].astype('float64') - tm.assert_frame_equal(result, expected) - - def test_partial_set_empty_frame_set_series(self): - # GH 5756 - # setting with empty Series - df = DataFrame(Series()) - tm.assert_frame_equal(df, DataFrame({0: Series()})) - - df = DataFrame(Series(name='foo')) - tm.assert_frame_equal(df, DataFrame({'foo': Series()})) - - def test_partial_set_empty_frame_empty_copy_assignment(self): - # GH 5932 - # copy on empty with assignment fails - df = DataFrame(index=[0]) - df = df.copy() - df['a'] = 0 - expected = DataFrame(0, index=[0], columns=['a']) - tm.assert_frame_equal(df, expected) - - def test_partial_set_empty_frame_empty_consistencies(self): - # GH 6171 - # consistency on empty frames - df = DataFrame(columns=['x', 'y']) - df['x'] = [1, 2] - expected = DataFrame(dict(x=[1, 2], y=[np.nan, np.nan])) - tm.assert_frame_equal(df, expected, check_dtype=False) - - df = DataFrame(columns=['x', 'y']) - df['x'] = ['1', '2'] - expected = DataFrame( - dict(x=['1', '2'], y=[np.nan, np.nan]), dtype=object) - tm.assert_frame_equal(df, expected) - - df = DataFrame(columns=['x', 'y']) - df.loc[0, 'x'] = 1 - expected = DataFrame(dict(x=[1], y=[np.nan])) - tm.assert_frame_equal(df, expected, check_dtype=False) - - def test_cache_updating(self): - # GH 4939, make sure to update the cache on setitem - - df = tm.makeDataFrame() - df['A'] # cache series - df.ix["Hello Friend"] = df.ix[0] - self.assertIn("Hello Friend", df['A'].index) - self.assertIn("Hello Friend", df['B'].index) - - panel = tm.makePanel() - panel.ix[0] # get first item into cache - panel.ix[:, :, 'A+1'] = panel.ix[:, :, 'A'] + 1 - self.assertIn("A+1", panel.ix[0].columns) - self.assertIn("A+1", panel.ix[1].columns) - - # 5216 - # make sure that we don't try to set a dead cache - a = np.random.rand(10, 3) - df = DataFrame(a, columns=['x', 'y', 'z']) - tuples = [(i, j) for i in range(5) for j in range(2)] - index = MultiIndex.from_tuples(tuples) - df.index = index - - # setting via chained assignment - # but actually works, since everything is a view - df.loc[0]['z'].iloc[0] = 1. - result = df.loc[(0, 0), 'z'] - self.assertEqual(result, 1) - - # correct setting - df.loc[(0, 0), 'z'] = 2 - result = df.loc[(0, 0), 'z'] - self.assertEqual(result, 2) - - # 10264 - df = DataFrame(np.zeros((5, 5), dtype='int64'), columns=[ - 'a', 'b', 'c', 'd', 'e'], index=range(5)) - df['f'] = 0 - df.f.values[3] = 1 - - # TODO(wesm): unused? - # y = df.iloc[np.arange(2, len(df))] - - df.f.values[3] = 2 - expected = DataFrame(np.zeros((5, 6), dtype='int64'), columns=[ - 'a', 'b', 'c', 'd', 'e', 'f'], index=range(5)) - expected.at[3, 'f'] = 2 - tm.assert_frame_equal(df, expected) - expected = Series([0, 0, 0, 2, 0], name='f') - tm.assert_series_equal(df.f, expected) - - def test_slice_consolidate_invalidate_item_cache(self): - - # this is chained assignment, but will 'work' - with option_context('chained_assignment', None): - - # #3970 - df = DataFrame({"aa": lrange(5), "bb": [2.2] * 5}) - - # Creates a second float block - df["cc"] = 0.0 - - # caches a reference to the 'bb' series - df["bb"] - - # repr machinery triggers consolidation - repr(df) - - # Assignment to wrong series - df['bb'].iloc[0] = 0.17 - df._clear_item_cache() - self.assertAlmostEqual(df['bb'][0], 0.17) - - def test_setitem_cache_updating(self): - # GH 5424 - cont = ['one', 'two', 'three', 'four', 'five', 'six', 'seven'] - - for do_ref in [False, False]: - df = DataFrame({'a': cont, - "b": cont[3:] + cont[:3], - 'c': np.arange(7)}) - - # ref the cache - if do_ref: - df.ix[0, "c"] - - # set it - df.ix[7, 'c'] = 1 - - self.assertEqual(df.ix[0, 'c'], 0.0) - self.assertEqual(df.ix[7, 'c'], 1.0) - - # GH 7084 - # not updating cache on series setting with slices - expected = DataFrame({'A': [600, 600, 600]}, - index=date_range('5/7/2014', '5/9/2014')) - out = DataFrame({'A': [0, 0, 0]}, - index=date_range('5/7/2014', '5/9/2014')) - df = DataFrame({'C': ['A', 'A', 'A'], 'D': [100, 200, 300]}) - - # loop through df to update out - six = Timestamp('5/7/2014') - eix = Timestamp('5/9/2014') - for ix, row in df.iterrows(): - out.loc[six:eix, row['C']] = out.loc[six:eix, row['C']] + row['D'] - - tm.assert_frame_equal(out, expected) - tm.assert_series_equal(out['A'], expected['A']) - - # try via a chain indexing - # this actually works - out = DataFrame({'A': [0, 0, 0]}, - index=date_range('5/7/2014', '5/9/2014')) - for ix, row in df.iterrows(): - v = out[row['C']][six:eix] + row['D'] - out[row['C']][six:eix] = v - - tm.assert_frame_equal(out, expected) - tm.assert_series_equal(out['A'], expected['A']) - - out = DataFrame({'A': [0, 0, 0]}, - index=date_range('5/7/2014', '5/9/2014')) - for ix, row in df.iterrows(): - out.loc[six:eix, row['C']] += row['D'] - - tm.assert_frame_equal(out, expected) - tm.assert_series_equal(out['A'], expected['A']) - - def test_setitem_chained_setfault(self): - - # GH6026 - # setfaults under numpy 1.7.1 (ok on 1.8) - data = ['right', 'left', 'left', 'left', 'right', 'left', 'timeout'] - mdata = ['right', 'left', 'left', 'left', 'right', 'left', 'none'] - - df = DataFrame({'response': np.array(data)}) - mask = df.response == 'timeout' - df.response[mask] = 'none' - tm.assert_frame_equal(df, DataFrame({'response': mdata})) - - recarray = np.rec.fromarrays([data], names=['response']) - df = DataFrame(recarray) - mask = df.response == 'timeout' - df.response[mask] = 'none' - tm.assert_frame_equal(df, DataFrame({'response': mdata})) - - df = DataFrame({'response': data, 'response1': data}) - mask = df.response == 'timeout' - df.response[mask] = 'none' - tm.assert_frame_equal(df, DataFrame({'response': mdata, - 'response1': data})) - - # GH 6056 - expected = DataFrame(dict(A=[np.nan, 'bar', 'bah', 'foo', 'bar'])) - df = DataFrame(dict(A=np.array(['foo', 'bar', 'bah', 'foo', 'bar']))) - df['A'].iloc[0] = np.nan - result = df.head() - tm.assert_frame_equal(result, expected) - - df = DataFrame(dict(A=np.array(['foo', 'bar', 'bah', 'foo', 'bar']))) - df.A.iloc[0] = np.nan - result = df.head() - tm.assert_frame_equal(result, expected) - - def test_detect_chained_assignment(self): - - pd.set_option('chained_assignment', 'raise') - - # work with the chain - expected = DataFrame([[-5, 1], [-6, 3]], columns=list('AB')) - df = DataFrame(np.arange(4).reshape(2, 2), - columns=list('AB'), dtype='int64') - self.assertIsNone(df.is_copy) - df['A'][0] = -5 - df['A'][1] = -6 - tm.assert_frame_equal(df, expected) - - # test with the chaining - df = DataFrame({'A': Series(range(2), dtype='int64'), - 'B': np.array(np.arange(2, 4), dtype=np.float64)}) - self.assertIsNone(df.is_copy) - - def f(): - df['A'][0] = -5 - - self.assertRaises(com.SettingWithCopyError, f) - - def f(): - df['A'][1] = np.nan - - self.assertRaises(com.SettingWithCopyError, f) - self.assertIsNone(df['A'].is_copy) - - # using a copy (the chain), fails - df = DataFrame({'A': Series(range(2), dtype='int64'), - 'B': np.array(np.arange(2, 4), dtype=np.float64)}) - - def f(): - df.loc[0]['A'] = -5 - - self.assertRaises(com.SettingWithCopyError, f) - - # doc example - df = DataFrame({'a': ['one', 'one', 'two', 'three', - 'two', 'one', 'six'], - 'c': Series(range(7), dtype='int64')}) - self.assertIsNone(df.is_copy) - expected = DataFrame({'a': ['one', 'one', 'two', 'three', - 'two', 'one', 'six'], - 'c': [42, 42, 2, 3, 4, 42, 6]}) - - def f(): - indexer = df.a.str.startswith('o') - df[indexer]['c'] = 42 - - self.assertRaises(com.SettingWithCopyError, f) - - expected = DataFrame({'A': [111, 'bbb', 'ccc'], 'B': [1, 2, 3]}) - df = DataFrame({'A': ['aaa', 'bbb', 'ccc'], 'B': [1, 2, 3]}) - - def f(): - df['A'][0] = 111 - - self.assertRaises(com.SettingWithCopyError, f) - - def f(): - df.loc[0]['A'] = 111 - - self.assertRaises(com.SettingWithCopyError, f) - - df.loc[0, 'A'] = 111 - tm.assert_frame_equal(df, expected) - - # make sure that is_copy is picked up reconstruction - # GH5475 - df = DataFrame({"A": [1, 2]}) - self.assertIsNone(df.is_copy) - with tm.ensure_clean('__tmp__pickle') as path: - df.to_pickle(path) - df2 = pd.read_pickle(path) - df2["B"] = df2["A"] - df2["B"] = df2["A"] - - # a suprious raise as we are setting the entire column here - # GH5597 - from string import ascii_letters as letters - - def random_text(nobs=100): - df = [] - for i in range(nobs): - idx = np.random.randint(len(letters), size=2) - idx.sort() - df.append([letters[idx[0]:idx[1]]]) - - return DataFrame(df, columns=['letters']) - - df = random_text(100000) - - # always a copy - x = df.iloc[[0, 1, 2]] - self.assertIsNotNone(x.is_copy) - x = df.iloc[[0, 1, 2, 4]] - self.assertIsNotNone(x.is_copy) - - # explicity copy - indexer = df.letters.apply(lambda x: len(x) > 10) - df = df.ix[indexer].copy() - self.assertIsNone(df.is_copy) - df['letters'] = df['letters'].apply(str.lower) - - # implicity take - df = random_text(100000) - indexer = df.letters.apply(lambda x: len(x) > 10) - df = df.ix[indexer] - self.assertIsNotNone(df.is_copy) - df['letters'] = df['letters'].apply(str.lower) - - # implicity take 2 - df = random_text(100000) - indexer = df.letters.apply(lambda x: len(x) > 10) - df = df.ix[indexer] - self.assertIsNotNone(df.is_copy) - df.loc[:, 'letters'] = df['letters'].apply(str.lower) - - # should be ok even though it's a copy! - self.assertIsNone(df.is_copy) - df['letters'] = df['letters'].apply(str.lower) - self.assertIsNone(df.is_copy) - - df = random_text(100000) - indexer = df.letters.apply(lambda x: len(x) > 10) - df.ix[indexer, 'letters'] = df.ix[indexer, 'letters'].apply(str.lower) - - # an identical take, so no copy - df = DataFrame({'a': [1]}).dropna() - self.assertIsNone(df.is_copy) - df['a'] += 1 - - # inplace ops - # original from: - # http://stackoverflow.com/questions/20508968/series-fillna-in-a-multiindex-dataframe-does-not-fill-is-this-a-bug - a = [12, 23] - b = [123, None] - c = [1234, 2345] - d = [12345, 23456] - tuples = [('eyes', 'left'), ('eyes', 'right'), ('ears', 'left'), - ('ears', 'right')] - events = {('eyes', 'left'): a, - ('eyes', 'right'): b, - ('ears', 'left'): c, - ('ears', 'right'): d} - multiind = MultiIndex.from_tuples(tuples, names=['part', 'side']) - zed = DataFrame(events, index=['a', 'b'], columns=multiind) - - def f(): - zed['eyes']['right'].fillna(value=555, inplace=True) - - self.assertRaises(com.SettingWithCopyError, f) - - df = DataFrame(np.random.randn(10, 4)) - s = df.iloc[:, 0].sort_values() - tm.assert_series_equal(s, df.iloc[:, 0].sort_values()) - tm.assert_series_equal(s, df[0].sort_values()) - - # false positives GH6025 - df = DataFrame({'column1': ['a', 'a', 'a'], 'column2': [4, 8, 9]}) - str(df) - df['column1'] = df['column1'] + 'b' - str(df) - df = df[df['column2'] != 8] - str(df) - df['column1'] = df['column1'] + 'c' - str(df) - - # from SO: - # http://stackoverflow.com/questions/24054495/potential-bug-setting-value-for-undefined-column-using-iloc - df = DataFrame(np.arange(0, 9), columns=['count']) - df['group'] = 'b' - - def f(): - df.iloc[0:5]['group'] = 'a' - - self.assertRaises(com.SettingWithCopyError, f) - - # mixed type setting - # same dtype & changing dtype - df = DataFrame(dict(A=date_range('20130101', periods=5), - B=np.random.randn(5), - C=np.arange(5, dtype='int64'), - D=list('abcde'))) - - def f(): - df.ix[2]['D'] = 'foo' - - self.assertRaises(com.SettingWithCopyError, f) - - def f(): - df.ix[2]['C'] = 'foo' - - self.assertRaises(com.SettingWithCopyError, f) - - def f(): - df['C'][2] = 'foo' + def test_index_type_coercion(self): - self.assertRaises(com.SettingWithCopyError, f) + with catch_warnings(record=True): - def test_setting_with_copy_bug(self): + # GH 11836 + # if we have an index type and set it with something that looks + # to numpy like the same, but is actually, not + # (e.g. setting with a float or string '0') + # then we need to coerce to object - # operating on a copy - df = pd.DataFrame({'a': list(range(4)), - 'b': list('ab..'), - 'c': ['a', 'b', np.nan, 'd']}) - mask = pd.isnull(df.c) + # integer indexes + for s in [Series(range(5)), + Series(range(5), index=range(1, 6))]: - def f(): - df[['c']][mask] = df[['b']][mask] - - self.assertRaises(com.SettingWithCopyError, f) - - # invalid warning as we are returning a new object - # GH 8730 - df1 = DataFrame({'x': Series(['a', 'b', 'c']), - 'y': Series(['d', 'e', 'f'])}) - df2 = df1[['x']] - - # this should not raise - df2['y'] = ['g', 'h', 'i'] - - def test_detect_chained_assignment_warnings(self): - - # warnings - with option_context('chained_assignment', 'warn'): - df = DataFrame({'A': ['aaa', 'bbb', 'ccc'], 'B': [1, 2, 3]}) - with tm.assert_produces_warning( - expected_warning=com.SettingWithCopyWarning): - df.loc[0]['A'] = 111 - - def test_float64index_slicing_bug(self): - # GH 5557, related to slicing a float index - ser = {256: 2321.0, - 1: 78.0, - 2: 2716.0, - 3: 0.0, - 4: 369.0, - 5: 0.0, - 6: 269.0, - 7: 0.0, - 8: 0.0, - 9: 0.0, - 10: 3536.0, - 11: 0.0, - 12: 24.0, - 13: 0.0, - 14: 931.0, - 15: 0.0, - 16: 101.0, - 17: 78.0, - 18: 9643.0, - 19: 0.0, - 20: 0.0, - 21: 0.0, - 22: 63761.0, - 23: 0.0, - 24: 446.0, - 25: 0.0, - 26: 34773.0, - 27: 0.0, - 28: 729.0, - 29: 78.0, - 30: 0.0, - 31: 0.0, - 32: 3374.0, - 33: 0.0, - 34: 1391.0, - 35: 0.0, - 36: 361.0, - 37: 0.0, - 38: 61808.0, - 39: 0.0, - 40: 0.0, - 41: 0.0, - 42: 6677.0, - 43: 0.0, - 44: 802.0, - 45: 0.0, - 46: 2691.0, - 47: 0.0, - 48: 3582.0, - 49: 0.0, - 50: 734.0, - 51: 0.0, - 52: 627.0, - 53: 70.0, - 54: 2584.0, - 55: 0.0, - 56: 324.0, - 57: 0.0, - 58: 605.0, - 59: 0.0, - 60: 0.0, - 61: 0.0, - 62: 3989.0, - 63: 10.0, - 64: 42.0, - 65: 0.0, - 66: 904.0, - 67: 0.0, - 68: 88.0, - 69: 70.0, - 70: 8172.0, - 71: 0.0, - 72: 0.0, - 73: 0.0, - 74: 64902.0, - 75: 0.0, - 76: 347.0, - 77: 0.0, - 78: 36605.0, - 79: 0.0, - 80: 379.0, - 81: 70.0, - 82: 0.0, - 83: 0.0, - 84: 3001.0, - 85: 0.0, - 86: 1630.0, - 87: 7.0, - 88: 364.0, - 89: 0.0, - 90: 67404.0, - 91: 9.0, - 92: 0.0, - 93: 0.0, - 94: 7685.0, - 95: 0.0, - 96: 1017.0, - 97: 0.0, - 98: 2831.0, - 99: 0.0, - 100: 2963.0, - 101: 0.0, - 102: 854.0, - 103: 0.0, - 104: 0.0, - 105: 0.0, - 106: 0.0, - 107: 0.0, - 108: 0.0, - 109: 0.0, - 110: 0.0, - 111: 0.0, - 112: 0.0, - 113: 0.0, - 114: 0.0, - 115: 0.0, - 116: 0.0, - 117: 0.0, - 118: 0.0, - 119: 0.0, - 120: 0.0, - 121: 0.0, - 122: 0.0, - 123: 0.0, - 124: 0.0, - 125: 0.0, - 126: 67744.0, - 127: 22.0, - 128: 264.0, - 129: 0.0, - 260: 197.0, - 268: 0.0, - 265: 0.0, - 269: 0.0, - 261: 0.0, - 266: 1198.0, - 267: 0.0, - 262: 2629.0, - 258: 775.0, - 257: 0.0, - 263: 0.0, - 259: 0.0, - 264: 163.0, - 250: 10326.0, - 251: 0.0, - 252: 1228.0, - 253: 0.0, - 254: 2769.0, - 255: 0.0} - - # smoke test for the repr - s = Series(ser) - result = s.value_counts() - str(result) - - def test_set_ix_out_of_bounds_axis_0(self): - df = pd.DataFrame( - randn(2, 5), index=["row%s" % i for i in range(2)], - columns=["col%s" % i for i in range(5)]) - self.assertRaises(ValueError, df.ix.__setitem__, (2, 0), 100) - - def test_set_ix_out_of_bounds_axis_1(self): - df = pd.DataFrame( - randn(5, 2), index=["row%s" % i for i in range(5)], - columns=["col%s" % i for i in range(2)]) - self.assertRaises(ValueError, df.ix.__setitem__, (0, 2), 100) - - def test_iloc_empty_list_indexer_is_ok(self): - from pandas.util.testing import makeCustomDataframe as mkdf - df = mkdf(5, 2) - # vertical empty - tm.assert_frame_equal(df.iloc[:, []], df.iloc[:, :0], - check_index_type=True, check_column_type=True) - # horizontal empty - tm.assert_frame_equal(df.iloc[[], :], df.iloc[:0, :], - check_index_type=True, check_column_type=True) - # horizontal empty - tm.assert_frame_equal(df.iloc[[]], df.iloc[:0, :], - check_index_type=True, - check_column_type=True) - - def test_loc_empty_list_indexer_is_ok(self): - from pandas.util.testing import makeCustomDataframe as mkdf - df = mkdf(5, 2) - # vertical empty - tm.assert_frame_equal(df.loc[:, []], df.iloc[:, :0], - check_index_type=True, check_column_type=True) - # horizontal empty - tm.assert_frame_equal(df.loc[[], :], df.iloc[:0, :], - check_index_type=True, check_column_type=True) - # horizontal empty - tm.assert_frame_equal(df.loc[[]], df.iloc[:0, :], - check_index_type=True, - check_column_type=True) - - def test_ix_empty_list_indexer_is_ok(self): - from pandas.util.testing import makeCustomDataframe as mkdf - df = mkdf(5, 2) - # vertical empty - tm.assert_frame_equal(df.ix[:, []], df.iloc[:, :0], - check_index_type=True, - check_column_type=True) - # horizontal empty - tm.assert_frame_equal(df.ix[[], :], df.iloc[:0, :], - check_index_type=True, - check_column_type=True) - # horizontal empty - tm.assert_frame_equal(df.ix[[]], df.iloc[:0, :], - check_index_type=True, - check_column_type=True) + assert s.index.is_integer() - def test_index_type_coercion(self): + for indexer in [lambda x: x.ix, + lambda x: x.loc, + lambda x: x]: + s2 = s.copy() + indexer(s2)[0.1] = 0 + assert s2.index.is_floating() + assert indexer(s2)[0.1] == 0 - # GH 11836 - # if we have an index type and set it with something that looks - # to numpy like the same, but is actually, not - # (e.g. setting with a float or string '0') - # then we need to coerce to object + s2 = s.copy() + indexer(s2)[0.0] = 0 + exp = s.index + if 0 not in s: + exp = Index(s.index.tolist() + [0]) + tm.assert_index_equal(s2.index, exp) - # integer indexes - for s in [Series(range(5)), - Series(range(5), index=range(1, 6))]: + s2 = s.copy() + indexer(s2)['0'] = 0 + assert s2.index.is_object() - self.assertTrue(s.index.is_integer()) + for s in [Series(range(5), index=np.arange(5.))]: - for indexer in [lambda x: x.ix, - lambda x: x.loc, - lambda x: x]: - s2 = s.copy() - indexer(s2)[0.1] = 0 - self.assertTrue(s2.index.is_floating()) - self.assertTrue(indexer(s2)[0.1] == 0) + assert s.index.is_floating() - s2 = s.copy() - indexer(s2)[0.0] = 0 - exp = s.index - if 0 not in s: - exp = Index(s.index.tolist() + [0]) - tm.assert_index_equal(s2.index, exp) + for idxr in [lambda x: x.ix, + lambda x: x.loc, + lambda x: x]: - s2 = s.copy() - indexer(s2)['0'] = 0 - self.assertTrue(s2.index.is_object()) + s2 = s.copy() + idxr(s2)[0.1] = 0 + assert s2.index.is_floating() + assert idxr(s2)[0.1] == 0 - for s in [Series(range(5), index=np.arange(5.))]: + s2 = s.copy() + idxr(s2)[0.0] = 0 + tm.assert_index_equal(s2.index, s.index) - self.assertTrue(s.index.is_floating()) + s2 = s.copy() + idxr(s2)['0'] = 0 + assert s2.index.is_object() - for idxr in [lambda x: x.ix, - lambda x: x.loc, - lambda x: x]: - s2 = s.copy() - idxr(s2)[0.1] = 0 - self.assertTrue(s2.index.is_floating()) - self.assertTrue(idxr(s2)[0.1] == 0) +class TestMisc(Base): - s2 = s.copy() - idxr(s2)[0.0] = 0 - tm.assert_index_equal(s2.index, s.index) + def test_indexer_caching(self): + # GH5727 + # make sure that indexers are in the _internal_names_set + n = 1000001 + arrays = [lrange(n), lrange(n)] + index = MultiIndex.from_tuples(lzip(*arrays)) + s = Series(np.zeros(n), index=index) + str(s) - s2 = s.copy() - idxr(s2)['0'] = 0 - self.assertTrue(s2.index.is_object()) + # setitem + expected = Series(np.ones(n), index=index) + s = Series(np.zeros(n), index=index) + s[s == 0] = 1 + tm.assert_series_equal(s, expected) def test_float_index_to_mixed(self): df = DataFrame({0.0: np.random.rand(10), 1.0: np.random.rand(10)}) @@ -5097,13 +624,6 @@ def test_float_index_to_mixed(self): 'a': [10] * 10}), df) - def test_duplicate_ix_returns_series(self): - df = DataFrame(np.random.randn(3, 3), index=[0.1, 0.2, 0.2], - columns=list('abc')) - r = df.ix[0.2, 'a'] - e = df.loc[0.2, 'a'] - tm.assert_series_equal(r, e) - def test_float_index_non_scalar_assignment(self): df = DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}, index=[1., 2., 3.]) df.loc[df.index[:2]] = 1 @@ -5118,9 +638,9 @@ def test_float_index_non_scalar_assignment(self): def test_float_index_at_iat(self): s = pd.Series([1, 2, 3], index=[0.1, 0.2, 0.3]) for el, item in s.iteritems(): - self.assertEqual(s.at[el], item) + assert s.at[el] == item for i in range(len(s)): - self.assertEqual(s.iat[i], i + 1) + assert s.iat[i] == i + 1 def test_rhs_alignment(self): # GH8258, tests that both rows & columns are aligned to what is @@ -5139,15 +659,18 @@ def run_tests(df, rhs, right): tm.assert_frame_equal(left, right) left = df.copy() - left.ix[s, l] = rhs + with catch_warnings(record=True): + left.ix[s, l] = rhs tm.assert_frame_equal(left, right) left = df.copy() - left.ix[i, j] = rhs + with catch_warnings(record=True): + left.ix[i, j] = rhs tm.assert_frame_equal(left, right) left = df.copy() - left.ix[r, c] = rhs + with catch_warnings(record=True): + left.ix[r, c] = rhs tm.assert_frame_equal(left, right) xs = np.arange(20).reshape(5, 4) @@ -5180,7 +703,7 @@ def assert_slices_equivalent(l_slc, i_slc): if not idx.is_integer: # For integer indices, ix and plain getitem are position-based. tm.assert_series_equal(s[l_slc], s.iloc[i_slc]) - tm.assert_series_equal(s.ix[l_slc], s.iloc[i_slc]) + tm.assert_series_equal(s.loc[l_slc], s.iloc[i_slc]) for idx in [_mklbl('A', 20), np.arange(20) + 100, np.linspace(100, 150, 20)]: @@ -5191,42 +714,16 @@ def assert_slices_equivalent(l_slc, i_slc): assert_slices_equivalent(SLC[idx[13]:idx[9]:-1], SLC[13:8:-1]) assert_slices_equivalent(SLC[idx[9]:idx[13]:-1], SLC[:0]) - def test_multiindex_label_slicing_with_negative_step(self): - s = Series(np.arange(20), - MultiIndex.from_product([list('abcde'), np.arange(4)])) - SLC = pd.IndexSlice - - def assert_slices_equivalent(l_slc, i_slc): - tm.assert_series_equal(s.loc[l_slc], s.iloc[i_slc]) - tm.assert_series_equal(s[l_slc], s.iloc[i_slc]) - tm.assert_series_equal(s.ix[l_slc], s.iloc[i_slc]) - - assert_slices_equivalent(SLC[::-1], SLC[::-1]) - - assert_slices_equivalent(SLC['d'::-1], SLC[15::-1]) - assert_slices_equivalent(SLC[('d', )::-1], SLC[15::-1]) - - assert_slices_equivalent(SLC[:'d':-1], SLC[:11:-1]) - assert_slices_equivalent(SLC[:('d', ):-1], SLC[:11:-1]) - - assert_slices_equivalent(SLC['d':'b':-1], SLC[15:3:-1]) - assert_slices_equivalent(SLC[('d', ):'b':-1], SLC[15:3:-1]) - assert_slices_equivalent(SLC['d':('b', ):-1], SLC[15:3:-1]) - assert_slices_equivalent(SLC[('d', ):('b', ):-1], SLC[15:3:-1]) - assert_slices_equivalent(SLC['b':'d':-1], SLC[:0]) - - assert_slices_equivalent(SLC[('c', 2)::-1], SLC[10::-1]) - assert_slices_equivalent(SLC[:('c', 2):-1], SLC[:9:-1]) - assert_slices_equivalent(SLC[('e', 0):('c', 2):-1], SLC[16:9:-1]) - def test_slice_with_zero_step_raises(self): s = Series(np.arange(20), index=_mklbl('A', 20)) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: s[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: s.loc[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: s.ix[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: s[::0]) + tm.assert_raises_regex(ValueError, 'slice step cannot be zero', + lambda: s.loc[::0]) + with catch_warnings(record=True): + tm.assert_raises_regex(ValueError, + 'slice step cannot be zero', + lambda: s.ix[::0]) def test_indexing_assignment_dict_already_exists(self): df = pd.DataFrame({'x': [1, 2, 6], @@ -5241,11 +738,13 @@ def test_indexing_assignment_dict_already_exists(self): def test_indexing_dtypes_on_empty(self): # Check that .iloc and .ix return correct dtypes GH9983 df = DataFrame({'a': [1, 2, 3], 'b': ['b', 'b2', 'b3']}) - df2 = df.ix[[], :] + with catch_warnings(record=True): + df2 = df.ix[[], :] - self.assertEqual(df2.loc[:, 'a'].dtype, np.int64) + assert df2.loc[:, 'a'].dtype == np.int64 tm.assert_series_equal(df2.loc[:, 'a'], df2.iloc[:, 0]) - tm.assert_series_equal(df2.loc[:, 'a'], df2.ix[:, 0]) + with catch_warnings(record=True): + tm.assert_series_equal(df2.loc[:, 'a'], df2.ix[:, 0]) def test_range_in_series_indexing(self): # range can cause an indexing error @@ -5277,7 +776,7 @@ def test_non_reducing_slice(self): ] for slice_ in slices: tslice_ = _non_reducing_slice(slice_) - self.assertTrue(isinstance(df.loc[tslice_], DataFrame)) + assert isinstance(df.loc[tslice_], DataFrame) def test_list_slice(self): # like dataframe getitem @@ -5292,16 +791,16 @@ def test_maybe_numeric_slice(self): df = pd.DataFrame({'A': [1, 2], 'B': ['c', 'd'], 'C': [True, False]}) result = _maybe_numeric_slice(df, slice_=None) expected = pd.IndexSlice[:, ['A']] - self.assertEqual(result, expected) + assert result == expected result = _maybe_numeric_slice(df, None, include_bool=True) expected = pd.IndexSlice[:, ['A', 'C']] result = _maybe_numeric_slice(df, [1]) expected = [1] - self.assertEqual(result, expected) + assert result == expected -class TestSeriesNoneCoercion(tm.TestCase): +class TestSeriesNoneCoercion(object): EXPECTED_RESULTS = [ # For numeric series, we should coerce to NaN. ([1, 2, 3], [np.nan, 2, 3]), @@ -5348,7 +847,7 @@ def test_coercion_with_loc_and_series(self): tm.assert_series_equal(start_series, expected_series) -class TestDataframeNoneCoercion(tm.TestCase): +class TestDataframeNoneCoercion(object): EXPECTED_SINGLE_ROW_RESULTS = [ # For numeric series, we should coerce to NaN. ([1, 2, 3], [np.nan, 2, 3]), @@ -5404,8 +903,3 @@ def test_none_coercion_mixed_dtypes(self): datetime(2000, 1, 3)], 'd': [None, 'b', 'c']}) tm.assert_frame_equal(start_dataframe, exp) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/indexing/test_indexing_slow.py b/pandas/tests/indexing/test_indexing_slow.py index 5d563e20087b9..08d390a6a213e 100644 --- a/pandas/tests/indexing/test_indexing_slow.py +++ b/pandas/tests/indexing/test_indexing_slow.py @@ -8,9 +8,7 @@ import pandas.util.testing as tm -class TestIndexingSlow(tm.TestCase): - - _multiprocess_can_split_ = True +class TestIndexingSlow(object): @tm.slow def test_multiindex_get_loc(self): # GH7724, GH2646 @@ -29,10 +27,10 @@ def validate(mi, df, key): mask &= df.iloc[:, i] == k if not mask.any(): - self.assertNotIn(key[:i + 1], mi.index) + assert key[:i + 1] not in mi.index continue - self.assertIn(key[:i + 1], mi.index) + assert key[:i + 1] in mi.index right = df[mask].copy() if i + 1 != len(key): # partial key diff --git a/pandas/tests/indexing/test_interval.py b/pandas/tests/indexing/test_interval.py new file mode 100644 index 0000000000000..2552fc066cc87 --- /dev/null +++ b/pandas/tests/indexing/test_interval.py @@ -0,0 +1,245 @@ +import pytest +import numpy as np +import pandas as pd + +from pandas import Series, DataFrame, IntervalIndex, Interval +import pandas.util.testing as tm + + +class TestIntervalIndex(object): + + def setup_method(self, method): + self.s = Series(np.arange(5), IntervalIndex.from_breaks(np.arange(6))) + + def test_loc_with_scalar(self): + + s = self.s + expected = 0 + + result = s.loc[0.5] + assert result == expected + + result = s.loc[1] + assert result == expected + + with pytest.raises(KeyError): + s.loc[0] + + expected = s.iloc[:3] + tm.assert_series_equal(expected, s.loc[:3]) + tm.assert_series_equal(expected, s.loc[:2.5]) + tm.assert_series_equal(expected, s.loc[0.1:2.5]) + tm.assert_series_equal(expected, s.loc[-1:3]) + + expected = s.iloc[1:4] + tm.assert_series_equal(expected, s.loc[[1.5, 2.5, 3.5]]) + tm.assert_series_equal(expected, s.loc[[2, 3, 4]]) + tm.assert_series_equal(expected, s.loc[[1.5, 3, 4]]) + + expected = s.iloc[2:5] + tm.assert_series_equal(expected, s.loc[s >= 2]) + + def test_getitem_with_scalar(self): + + s = self.s + expected = 0 + + result = s[0.5] + assert result == expected + + result = s[1] + assert result == expected + + with pytest.raises(KeyError): + s[0] + + expected = s.iloc[:3] + tm.assert_series_equal(expected, s[:3]) + tm.assert_series_equal(expected, s[:2.5]) + tm.assert_series_equal(expected, s[0.1:2.5]) + tm.assert_series_equal(expected, s[-1:3]) + + expected = s.iloc[1:4] + tm.assert_series_equal(expected, s[[1.5, 2.5, 3.5]]) + tm.assert_series_equal(expected, s[[2, 3, 4]]) + tm.assert_series_equal(expected, s[[1.5, 3, 4]]) + + expected = s.iloc[2:5] + tm.assert_series_equal(expected, s[s >= 2]) + + def test_with_interval(self): + + s = self.s + expected = 0 + + result = s.loc[Interval(0, 1)] + assert result == expected + + result = s[Interval(0, 1)] + assert result == expected + + expected = s.iloc[3:5] + result = s.loc[Interval(3, 6)] + tm.assert_series_equal(expected, result) + + expected = s.iloc[3:5] + result = s.loc[[Interval(3, 6)]] + tm.assert_series_equal(expected, result) + + expected = s.iloc[3:5] + result = s.loc[[Interval(3, 5)]] + tm.assert_series_equal(expected, result) + + # missing + with pytest.raises(KeyError): + s.loc[Interval(-2, 0)] + + with pytest.raises(KeyError): + s[Interval(-2, 0)] + + with pytest.raises(KeyError): + s.loc[Interval(5, 6)] + + with pytest.raises(KeyError): + s[Interval(5, 6)] + + def test_with_slices(self): + + s = self.s + + # slice of interval + with pytest.raises(NotImplementedError): + result = s.loc[Interval(3, 6):] + + with pytest.raises(NotImplementedError): + result = s[Interval(3, 6):] + + expected = s.iloc[3:5] + result = s[[Interval(3, 6)]] + tm.assert_series_equal(expected, result) + + # slice of scalar with step != 1 + with pytest.raises(ValueError): + s[0:4:2] + + def test_with_overlaps(self): + + s = self.s + expected = s.iloc[[3, 4, 3, 4]] + result = s.loc[[Interval(3, 6), Interval(3, 6)]] + tm.assert_series_equal(expected, result) + + idx = IntervalIndex.from_tuples([(1, 5), (3, 7)]) + s = Series(range(len(idx)), index=idx) + + result = s[4] + expected = s + tm.assert_series_equal(expected, result) + + result = s[[4]] + expected = s + tm.assert_series_equal(expected, result) + + result = s.loc[[4]] + expected = s + tm.assert_series_equal(expected, result) + + result = s[Interval(3, 5)] + expected = s + tm.assert_series_equal(expected, result) + + result = s.loc[Interval(3, 5)] + expected = s + tm.assert_series_equal(expected, result) + + # doesn't intersect unique set of intervals + with pytest.raises(KeyError): + s[[Interval(3, 5)]] + + with pytest.raises(KeyError): + s.loc[[Interval(3, 5)]] + + def test_non_unique(self): + + idx = IntervalIndex.from_tuples([(1, 3), (3, 7)]) + + s = pd.Series(range(len(idx)), index=idx) + + result = s.loc[Interval(1, 3)] + assert result == 0 + + result = s.loc[[Interval(1, 3)]] + expected = s.iloc[0:1] + tm.assert_series_equal(expected, result) + + def test_non_unique_moar(self): + + idx = IntervalIndex.from_tuples([(1, 3), (1, 3), (3, 7)]) + s = Series(range(len(idx)), index=idx) + + result = s.loc[Interval(1, 3)] + expected = s.iloc[[0, 1]] + tm.assert_series_equal(expected, result) + + # non-unique index and slices not allowed + with pytest.raises(ValueError): + s.loc[Interval(1, 3):] + + with pytest.raises(ValueError): + s[Interval(1, 3):] + + # non-unique + with pytest.raises(ValueError): + s[[Interval(1, 3)]] + + def test_non_matching(self): + s = self.s + + # this is a departure from our current + # indexin scheme, but simpler + with pytest.raises(KeyError): + s.loc[[-1, 3, 4, 5]] + + with pytest.raises(KeyError): + s.loc[[-1, 3]] + + def test_large_series(self): + s = Series(np.arange(1000000), + index=IntervalIndex.from_breaks(np.arange(1000001))) + + result1 = s.loc[:80000] + result2 = s.loc[0:80000] + result3 = s.loc[0:80000:1] + tm.assert_series_equal(result1, result2) + tm.assert_series_equal(result1, result3) + + def test_loc_getitem_frame(self): + + df = DataFrame({'A': range(10)}) + s = pd.cut(df.A, 5) + df['B'] = s + df = df.set_index('B') + + result = df.loc[4] + expected = df.iloc[4:6] + tm.assert_frame_equal(result, expected) + + with pytest.raises(KeyError): + df.loc[10] + + # single list-like + result = df.loc[[4]] + expected = df.iloc[4:6] + tm.assert_frame_equal(result, expected) + + # non-unique + result = df.loc[[4, 5]] + expected = df.take([4, 5, 4, 5]) + tm.assert_frame_equal(result, expected) + + with pytest.raises(KeyError): + df.loc[[10]] + + # partial missing + with pytest.raises(KeyError): + df.loc[[10, 4]] diff --git a/pandas/tests/indexing/test_ix.py b/pandas/tests/indexing/test_ix.py new file mode 100644 index 0000000000000..dc9a591ee3101 --- /dev/null +++ b/pandas/tests/indexing/test_ix.py @@ -0,0 +1,335 @@ +""" test indexing with ix """ + +import pytest + +from warnings import catch_warnings + +import numpy as np +import pandas as pd + +from pandas.core.dtypes.common import is_scalar +from pandas.compat import lrange +from pandas import Series, DataFrame, option_context, MultiIndex +from pandas.util import testing as tm +from pandas.errors import PerformanceWarning + + +class TestIX(object): + + def test_ix_deprecation(self): + # GH 15114 + + df = DataFrame({'A': [1, 2, 3]}) + with tm.assert_produces_warning(DeprecationWarning, + check_stacklevel=False): + df.ix[1, 'A'] + + def test_ix_loc_setitem_consistency(self): + + # GH 5771 + # loc with slice and series + s = Series(0, index=[4, 5, 6]) + s.loc[4:5] += 1 + expected = Series([1, 1, 0], index=[4, 5, 6]) + tm.assert_series_equal(s, expected) + + # GH 5928 + # chained indexing assignment + df = DataFrame({'a': [0, 1, 2]}) + expected = df.copy() + with catch_warnings(record=True): + expected.ix[[0, 1, 2], 'a'] = -expected.ix[[0, 1, 2], 'a'] + + with catch_warnings(record=True): + df['a'].ix[[0, 1, 2]] = -df['a'].ix[[0, 1, 2]] + tm.assert_frame_equal(df, expected) + + df = DataFrame({'a': [0, 1, 2], 'b': [0, 1, 2]}) + with catch_warnings(record=True): + df['a'].ix[[0, 1, 2]] = -df['a'].ix[[0, 1, 2]].astype( + 'float64') + 0.5 + expected = DataFrame({'a': [0.5, -0.5, -1.5], 'b': [0, 1, 2]}) + tm.assert_frame_equal(df, expected) + + # GH 8607 + # ix setitem consistency + df = DataFrame({'timestamp': [1413840976, 1413842580, 1413760580], + 'delta': [1174, 904, 161], + 'elapsed': [7673, 9277, 1470]}) + expected = DataFrame({'timestamp': pd.to_datetime( + [1413840976, 1413842580, 1413760580], unit='s'), + 'delta': [1174, 904, 161], + 'elapsed': [7673, 9277, 1470]}) + + df2 = df.copy() + df2['timestamp'] = pd.to_datetime(df['timestamp'], unit='s') + tm.assert_frame_equal(df2, expected) + + df2 = df.copy() + df2.loc[:, 'timestamp'] = pd.to_datetime(df['timestamp'], unit='s') + tm.assert_frame_equal(df2, expected) + + df2 = df.copy() + with catch_warnings(record=True): + df2.ix[:, 2] = pd.to_datetime(df['timestamp'], unit='s') + tm.assert_frame_equal(df2, expected) + + def test_ix_loc_consistency(self): + + # GH 8613 + # some edge cases where ix/loc should return the same + # this is not an exhaustive case + + def compare(result, expected): + if is_scalar(expected): + assert result == expected + else: + assert expected.equals(result) + + # failure cases for .loc, but these work for .ix + df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD')) + for key in [slice(1, 3), tuple([slice(0, 2), slice(0, 2)]), + tuple([slice(0, 2), df.columns[0:2]])]: + + for index in [tm.makeStringIndex, tm.makeUnicodeIndex, + tm.makeDateIndex, tm.makePeriodIndex, + tm.makeTimedeltaIndex]: + df.index = index(len(df.index)) + with catch_warnings(record=True): + df.ix[key] + + pytest.raises(TypeError, lambda: df.loc[key]) + + df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'), + index=pd.date_range('2012-01-01', periods=5)) + + for key in ['2012-01-03', + '2012-01-31', + slice('2012-01-03', '2012-01-03'), + slice('2012-01-03', '2012-01-04'), + slice('2012-01-03', '2012-01-06', 2), + slice('2012-01-03', '2012-01-31'), + tuple([[True, True, True, False, True]]), ]: + + # getitem + + # if the expected raises, then compare the exceptions + try: + with catch_warnings(record=True): + expected = df.ix[key] + except KeyError: + pytest.raises(KeyError, lambda: df.loc[key]) + continue + + result = df.loc[key] + compare(result, expected) + + # setitem + df1 = df.copy() + df2 = df.copy() + + with catch_warnings(record=True): + df1.ix[key] = 10 + df2.loc[key] = 10 + compare(df2, df1) + + # edge cases + s = Series([1, 2, 3, 4], index=list('abde')) + + result1 = s['a':'c'] + with catch_warnings(record=True): + result2 = s.ix['a':'c'] + result3 = s.loc['a':'c'] + tm.assert_series_equal(result1, result2) + tm.assert_series_equal(result1, result3) + + # now work rather than raising KeyError + s = Series(range(5), [-2, -1, 1, 2, 3]) + + with catch_warnings(record=True): + result1 = s.ix[-10:3] + result2 = s.loc[-10:3] + tm.assert_series_equal(result1, result2) + + with catch_warnings(record=True): + result1 = s.ix[0:3] + result2 = s.loc[0:3] + tm.assert_series_equal(result1, result2) + + def test_ix_weird_slicing(self): + # http://stackoverflow.com/q/17056560/1240268 + df = DataFrame({'one': [1, 2, 3, np.nan, np.nan], + 'two': [1, 2, 3, 4, 5]}) + df.loc[df['one'] > 1, 'two'] = -df['two'] + + expected = DataFrame({'one': {0: 1.0, + 1: 2.0, + 2: 3.0, + 3: np.nan, + 4: np.nan}, + 'two': {0: 1, + 1: -2, + 2: -3, + 3: 4, + 4: 5}}) + tm.assert_frame_equal(df, expected) + + def test_ix_general(self): + + # ix general issues + + # GH 2817 + data = {'amount': {0: 700, 1: 600, 2: 222, 3: 333, 4: 444}, + 'col': {0: 3.5, 1: 3.5, 2: 4.0, 3: 4.0, 4: 4.0}, + 'year': {0: 2012, 1: 2011, 2: 2012, 3: 2012, 4: 2012}} + df = DataFrame(data).set_index(keys=['col', 'year']) + key = 4.0, 2012 + + # emits a PerformanceWarning, ok + with tm.assert_produces_warning(PerformanceWarning): + tm.assert_frame_equal(df.loc[key], df.iloc[2:]) + + # this is ok + df.sort_index(inplace=True) + res = df.loc[key] + + # col has float dtype, result should be Float64Index + index = MultiIndex.from_arrays([[4.] * 3, [2012] * 3], + names=['col', 'year']) + expected = DataFrame({'amount': [222, 333, 444]}, index=index) + tm.assert_frame_equal(res, expected) + + def test_ix_assign_column_mixed(self): + # GH #1142 + df = DataFrame(tm.getSeriesData()) + df['foo'] = 'bar' + + orig = df.loc[:, 'B'].copy() + df.loc[:, 'B'] = df.loc[:, 'B'] + 1 + tm.assert_series_equal(df.B, orig + 1) + + # GH 3668, mixed frame with series value + df = DataFrame({'x': lrange(10), 'y': lrange(10, 20), 'z': 'bar'}) + expected = df.copy() + + for i in range(5): + indexer = i * 2 + v = 1000 + i * 200 + expected.loc[indexer, 'y'] = v + assert expected.loc[indexer, 'y'] == v + + df.loc[df.x % 2 == 0, 'y'] = df.loc[df.x % 2 == 0, 'y'] * 100 + tm.assert_frame_equal(df, expected) + + # GH 4508, making sure consistency of assignments + df = DataFrame({'a': [1, 2, 3], 'b': [0, 1, 2]}) + df.loc[[0, 2, ], 'b'] = [100, -100] + expected = DataFrame({'a': [1, 2, 3], 'b': [100, 1, -100]}) + tm.assert_frame_equal(df, expected) + + df = pd.DataFrame({'a': lrange(4)}) + df['b'] = np.nan + df.loc[[1, 3], 'b'] = [100, -100] + expected = DataFrame({'a': [0, 1, 2, 3], + 'b': [np.nan, 100, np.nan, -100]}) + tm.assert_frame_equal(df, expected) + + # ok, but chained assignments are dangerous + # if we turn off chained assignement it will work + with option_context('chained_assignment', None): + df = pd.DataFrame({'a': lrange(4)}) + df['b'] = np.nan + df['b'].loc[[1, 3]] = [100, -100] + tm.assert_frame_equal(df, expected) + + def test_ix_get_set_consistency(self): + + # GH 4544 + # ix/loc get/set not consistent when + # a mixed int/string index + df = DataFrame(np.arange(16).reshape((4, 4)), + columns=['a', 'b', 8, 'c'], + index=['e', 7, 'f', 'g']) + + with catch_warnings(record=True): + assert df.ix['e', 8] == 2 + assert df.loc['e', 8] == 2 + + with catch_warnings(record=True): + df.ix['e', 8] = 42 + assert df.ix['e', 8] == 42 + assert df.loc['e', 8] == 42 + + df.loc['e', 8] = 45 + with catch_warnings(record=True): + assert df.ix['e', 8] == 45 + assert df.loc['e', 8] == 45 + + def test_ix_slicing_strings(self): + # see gh-3836 + data = {'Classification': + ['SA EQUITY CFD', 'bbb', 'SA EQUITY', 'SA SSF', 'aaa'], + 'Random': [1, 2, 3, 4, 5], + 'X': ['correct', 'wrong', 'correct', 'correct', 'wrong']} + df = DataFrame(data) + x = df[~df.Classification.isin(['SA EQUITY CFD', 'SA EQUITY', 'SA SSF' + ])] + with catch_warnings(record=True): + df.ix[x.index, 'X'] = df['Classification'] + + expected = DataFrame({'Classification': {0: 'SA EQUITY CFD', + 1: 'bbb', + 2: 'SA EQUITY', + 3: 'SA SSF', + 4: 'aaa'}, + 'Random': {0: 1, + 1: 2, + 2: 3, + 3: 4, + 4: 5}, + 'X': {0: 'correct', + 1: 'bbb', + 2: 'correct', + 3: 'correct', + 4: 'aaa'}}) # bug was 4: 'bbb' + + tm.assert_frame_equal(df, expected) + + def test_ix_setitem_out_of_bounds_axis_0(self): + df = pd.DataFrame( + np.random.randn(2, 5), index=["row%s" % i for i in range(2)], + columns=["col%s" % i for i in range(5)]) + with catch_warnings(record=True): + pytest.raises(ValueError, df.ix.__setitem__, (2, 0), 100) + + def test_ix_setitem_out_of_bounds_axis_1(self): + df = pd.DataFrame( + np.random.randn(5, 2), index=["row%s" % i for i in range(5)], + columns=["col%s" % i for i in range(2)]) + with catch_warnings(record=True): + pytest.raises(ValueError, df.ix.__setitem__, (0, 2), 100) + + def test_ix_empty_list_indexer_is_ok(self): + with catch_warnings(record=True): + from pandas.util.testing import makeCustomDataframe as mkdf + df = mkdf(5, 2) + # vertical empty + tm.assert_frame_equal(df.ix[:, []], df.iloc[:, :0], + check_index_type=True, + check_column_type=True) + # horizontal empty + tm.assert_frame_equal(df.ix[[], :], df.iloc[:0, :], + check_index_type=True, + check_column_type=True) + # horizontal empty + tm.assert_frame_equal(df.ix[[]], df.iloc[:0, :], + check_index_type=True, + check_column_type=True) + + def test_ix_duplicate_returns_series(self): + df = DataFrame(np.random.randn(3, 3), index=[0.1, 0.2, 0.2], + columns=list('abc')) + with catch_warnings(record=True): + r = df.ix[0.2, 'a'] + e = df.loc[0.2, 'a'] + tm.assert_series_equal(r, e) diff --git a/pandas/tests/indexing/test_loc.py b/pandas/tests/indexing/test_loc.py new file mode 100644 index 0000000000000..fe2318be72eda --- /dev/null +++ b/pandas/tests/indexing/test_loc.py @@ -0,0 +1,632 @@ +""" test label based indexing with loc """ + +import itertools +import pytest + +from warnings import catch_warnings +import numpy as np + +import pandas as pd +from pandas.compat import lrange, StringIO +from pandas import (Series, DataFrame, Timestamp, + date_range, MultiIndex) +from pandas.util import testing as tm +from pandas.tests.indexing.common import Base + + +class TestLoc(Base): + + def test_loc_getitem_dups(self): + # GH 5678 + # repeated gettitems on a dup index returing a ndarray + df = DataFrame( + np.random.random_sample((20, 5)), + index=['ABCDE' [x % 5] for x in range(20)]) + expected = df.loc['A', 0] + result = df.loc[:, 0].loc['A'] + tm.assert_series_equal(result, expected) + + def test_loc_getitem_dups2(self): + + # GH4726 + # dup indexing with iloc/loc + df = DataFrame([[1, 2, 'foo', 'bar', Timestamp('20130101')]], + columns=['a', 'a', 'a', 'a', 'a'], index=[1]) + expected = Series([1, 2, 'foo', 'bar', Timestamp('20130101')], + index=['a', 'a', 'a', 'a', 'a'], name=1) + + result = df.iloc[0] + tm.assert_series_equal(result, expected) + + result = df.loc[1] + tm.assert_series_equal(result, expected) + + def test_loc_setitem_dups(self): + + # GH 6541 + df_orig = DataFrame( + {'me': list('rttti'), + 'foo': list('aaade'), + 'bar': np.arange(5, dtype='float64') * 1.34 + 2, + 'bar2': np.arange(5, dtype='float64') * -.34 + 2}).set_index('me') + + indexer = tuple(['r', ['bar', 'bar2']]) + df = df_orig.copy() + df.loc[indexer] *= 2.0 + tm.assert_series_equal(df.loc[indexer], 2.0 * df_orig.loc[indexer]) + + indexer = tuple(['r', 'bar']) + df = df_orig.copy() + df.loc[indexer] *= 2.0 + assert df.loc[indexer] == 2.0 * df_orig.loc[indexer] + + indexer = tuple(['t', ['bar', 'bar2']]) + df = df_orig.copy() + df.loc[indexer] *= 2.0 + tm.assert_frame_equal(df.loc[indexer], 2.0 * df_orig.loc[indexer]) + + def test_loc_setitem_slice(self): + # GH10503 + + # assigning the same type should not change the type + df1 = DataFrame({'a': [0, 1, 1], + 'b': Series([100, 200, 300], dtype='uint32')}) + ix = df1['a'] == 1 + newb1 = df1.loc[ix, 'b'] + 1 + df1.loc[ix, 'b'] = newb1 + expected = DataFrame({'a': [0, 1, 1], + 'b': Series([100, 201, 301], dtype='uint32')}) + tm.assert_frame_equal(df1, expected) + + # assigning a new type should get the inferred type + df2 = DataFrame({'a': [0, 1, 1], 'b': [100, 200, 300]}, + dtype='uint64') + ix = df1['a'] == 1 + newb2 = df2.loc[ix, 'b'] + df1.loc[ix, 'b'] = newb2 + expected = DataFrame({'a': [0, 1, 1], 'b': [100, 200, 300]}, + dtype='uint64') + tm.assert_frame_equal(df2, expected) + + def test_loc_getitem_int(self): + + # int label + self.check_result('int label', 'loc', 2, 'ix', 2, + typs=['ints', 'uints'], axes=0) + self.check_result('int label', 'loc', 3, 'ix', 3, + typs=['ints', 'uints'], axes=1) + self.check_result('int label', 'loc', 4, 'ix', 4, + typs=['ints', 'uints'], axes=2) + self.check_result('int label', 'loc', 2, 'ix', 2, + typs=['label'], fails=KeyError) + + def test_loc_getitem_label(self): + + # label + self.check_result('label', 'loc', 'c', 'ix', 'c', typs=['labels'], + axes=0) + self.check_result('label', 'loc', 'null', 'ix', 'null', typs=['mixed'], + axes=0) + self.check_result('label', 'loc', 8, 'ix', 8, typs=['mixed'], axes=0) + self.check_result('label', 'loc', Timestamp('20130102'), 'ix', 1, + typs=['ts'], axes=0) + self.check_result('label', 'loc', 'c', 'ix', 'c', typs=['empty'], + fails=KeyError) + + def test_loc_getitem_label_out_of_range(self): + + # out of range label + self.check_result('label range', 'loc', 'f', 'ix', 'f', + typs=['ints', 'uints', 'labels', 'mixed', 'ts'], + fails=KeyError) + self.check_result('label range', 'loc', 'f', 'ix', 'f', + typs=['floats'], fails=TypeError) + self.check_result('label range', 'loc', 20, 'ix', 20, + typs=['ints', 'uints', 'mixed'], fails=KeyError) + self.check_result('label range', 'loc', 20, 'ix', 20, + typs=['labels'], fails=TypeError) + self.check_result('label range', 'loc', 20, 'ix', 20, typs=['ts'], + axes=0, fails=TypeError) + self.check_result('label range', 'loc', 20, 'ix', 20, typs=['floats'], + axes=0, fails=TypeError) + + def test_loc_getitem_label_list(self): + + # list of labels + self.check_result('list lbl', 'loc', [0, 2, 4], 'ix', [0, 2, 4], + typs=['ints', 'uints'], axes=0) + self.check_result('list lbl', 'loc', [3, 6, 9], 'ix', [3, 6, 9], + typs=['ints', 'uints'], axes=1) + self.check_result('list lbl', 'loc', [4, 8, 12], 'ix', [4, 8, 12], + typs=['ints', 'uints'], axes=2) + self.check_result('list lbl', 'loc', ['a', 'b', 'd'], 'ix', + ['a', 'b', 'd'], typs=['labels'], axes=0) + self.check_result('list lbl', 'loc', ['A', 'B', 'C'], 'ix', + ['A', 'B', 'C'], typs=['labels'], axes=1) + self.check_result('list lbl', 'loc', ['Z', 'Y', 'W'], 'ix', + ['Z', 'Y', 'W'], typs=['labels'], axes=2) + self.check_result('list lbl', 'loc', [2, 8, 'null'], 'ix', + [2, 8, 'null'], typs=['mixed'], axes=0) + self.check_result('list lbl', 'loc', + [Timestamp('20130102'), Timestamp('20130103')], 'ix', + [Timestamp('20130102'), Timestamp('20130103')], + typs=['ts'], axes=0) + + self.check_result('list lbl', 'loc', [0, 1, 2], 'indexer', [0, 1, 2], + typs=['empty'], fails=KeyError) + self.check_result('list lbl', 'loc', [0, 2, 3], 'ix', [0, 2, 3], + typs=['ints', 'uints'], axes=0, fails=KeyError) + self.check_result('list lbl', 'loc', [3, 6, 7], 'ix', [3, 6, 7], + typs=['ints', 'uints'], axes=1, fails=KeyError) + self.check_result('list lbl', 'loc', [4, 8, 10], 'ix', [4, 8, 10], + typs=['ints', 'uints'], axes=2, fails=KeyError) + + def test_loc_getitem_label_list_fails(self): + # fails + self.check_result('list lbl', 'loc', [20, 30, 40], 'ix', [20, 30, 40], + typs=['ints', 'uints'], axes=1, fails=KeyError) + self.check_result('list lbl', 'loc', [20, 30, 40], 'ix', [20, 30, 40], + typs=['ints', 'uints'], axes=2, fails=KeyError) + + def test_loc_getitem_label_array_like(self): + # array like + self.check_result('array like', 'loc', Series(index=[0, 2, 4]).index, + 'ix', [0, 2, 4], typs=['ints', 'uints'], axes=0) + self.check_result('array like', 'loc', Series(index=[3, 6, 9]).index, + 'ix', [3, 6, 9], typs=['ints', 'uints'], axes=1) + self.check_result('array like', 'loc', Series(index=[4, 8, 12]).index, + 'ix', [4, 8, 12], typs=['ints', 'uints'], axes=2) + + def test_loc_getitem_bool(self): + # boolean indexers + b = [True, False, True, False] + self.check_result('bool', 'loc', b, 'ix', b, + typs=['ints', 'uints', 'labels', + 'mixed', 'ts', 'floats']) + self.check_result('bool', 'loc', b, 'ix', b, typs=['empty'], + fails=KeyError) + + def test_loc_getitem_int_slice(self): + + # ok + self.check_result('int slice2', 'loc', slice(2, 4), 'ix', [2, 4], + typs=['ints', 'uints'], axes=0) + self.check_result('int slice2', 'loc', slice(3, 6), 'ix', [3, 6], + typs=['ints', 'uints'], axes=1) + self.check_result('int slice2', 'loc', slice(4, 8), 'ix', [4, 8], + typs=['ints', 'uints'], axes=2) + + # GH 3053 + # loc should treat integer slices like label slices + + index = MultiIndex.from_tuples([t for t in itertools.product( + [6, 7, 8], ['a', 'b'])]) + df = DataFrame(np.random.randn(6, 6), index, index) + result = df.loc[6:8, :] + expected = df + tm.assert_frame_equal(result, expected) + + index = MultiIndex.from_tuples([t + for t in itertools.product( + [10, 20, 30], ['a', 'b'])]) + df = DataFrame(np.random.randn(6, 6), index, index) + result = df.loc[20:30, :] + expected = df.iloc[2:] + tm.assert_frame_equal(result, expected) + + # doc examples + result = df.loc[10, :] + expected = df.iloc[0:2] + expected.index = ['a', 'b'] + tm.assert_frame_equal(result, expected) + + result = df.loc[:, 10] + # expected = df.ix[:,10] (this fails) + expected = df[10] + tm.assert_frame_equal(result, expected) + + def test_loc_to_fail(self): + + # GH3449 + df = DataFrame(np.random.random((3, 3)), + index=['a', 'b', 'c'], + columns=['e', 'f', 'g']) + + # raise a KeyError? + pytest.raises(KeyError, df.loc.__getitem__, + tuple([[1, 2], [1, 2]])) + + # GH 7496 + # loc should not fallback + + s = Series() + s.loc[1] = 1 + s.loc['a'] = 2 + + pytest.raises(KeyError, lambda: s.loc[-1]) + pytest.raises(KeyError, lambda: s.loc[[-1, -2]]) + + pytest.raises(KeyError, lambda: s.loc[['4']]) + + s.loc[-1] = 3 + result = s.loc[[-1, -2]] + expected = Series([3, np.nan], index=[-1, -2]) + tm.assert_series_equal(result, expected) + + s['a'] = 2 + pytest.raises(KeyError, lambda: s.loc[[-2]]) + + del s['a'] + + def f(): + s.loc[[-2]] = 0 + + pytest.raises(KeyError, f) + + # inconsistency between .loc[values] and .loc[values,:] + # GH 7999 + df = DataFrame([['a'], ['b']], index=[1, 2], columns=['value']) + + def f(): + df.loc[[3], :] + + pytest.raises(KeyError, f) + + def f(): + df.loc[[3]] + + pytest.raises(KeyError, f) + + def test_loc_getitem_label_slice(self): + + # label slices (with ints) + self.check_result('lab slice', 'loc', slice(1, 3), + 'ix', slice(1, 3), + typs=['labels', 'mixed', 'empty', 'ts', 'floats'], + fails=TypeError) + + # real label slices + self.check_result('lab slice', 'loc', slice('a', 'c'), + 'ix', slice('a', 'c'), typs=['labels'], axes=0) + self.check_result('lab slice', 'loc', slice('A', 'C'), + 'ix', slice('A', 'C'), typs=['labels'], axes=1) + self.check_result('lab slice', 'loc', slice('W', 'Z'), + 'ix', slice('W', 'Z'), typs=['labels'], axes=2) + + self.check_result('ts slice', 'loc', slice('20130102', '20130104'), + 'ix', slice('20130102', '20130104'), + typs=['ts'], axes=0) + self.check_result('ts slice', 'loc', slice('20130102', '20130104'), + 'ix', slice('20130102', '20130104'), + typs=['ts'], axes=1, fails=TypeError) + self.check_result('ts slice', 'loc', slice('20130102', '20130104'), + 'ix', slice('20130102', '20130104'), + typs=['ts'], axes=2, fails=TypeError) + + # GH 14316 + self.check_result('ts slice rev', 'loc', slice('20130104', '20130102'), + 'indexer', [0, 1, 2], typs=['ts_rev'], axes=0) + + self.check_result('mixed slice', 'loc', slice(2, 8), 'ix', slice(2, 8), + typs=['mixed'], axes=0, fails=TypeError) + self.check_result('mixed slice', 'loc', slice(2, 8), 'ix', slice(2, 8), + typs=['mixed'], axes=1, fails=KeyError) + self.check_result('mixed slice', 'loc', slice(2, 8), 'ix', slice(2, 8), + typs=['mixed'], axes=2, fails=KeyError) + + self.check_result('mixed slice', 'loc', slice(2, 4, 2), 'ix', slice( + 2, 4, 2), typs=['mixed'], axes=0, fails=TypeError) + + def test_loc_general(self): + + df = DataFrame( + np.random.rand(4, 4), columns=['A', 'B', 'C', 'D'], + index=['A', 'B', 'C', 'D']) + + # want this to work + result = df.loc[:, "A":"B"].iloc[0:2, :] + assert (result.columns == ['A', 'B']).all() + assert (result.index == ['A', 'B']).all() + + # mixed type + result = DataFrame({'a': [Timestamp('20130101')], 'b': [1]}).iloc[0] + expected = Series([Timestamp('20130101'), 1], index=['a', 'b'], name=0) + tm.assert_series_equal(result, expected) + assert result.dtype == object + + def test_loc_setitem_consistency(self): + # GH 6149 + # coerce similary for setitem and loc when rows have a null-slice + expected = DataFrame({'date': Series(0, index=range(5), + dtype=np.int64), + 'val': Series(range(5), dtype=np.int64)}) + + df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), + 'val': Series( + range(5), dtype=np.int64)}) + df.loc[:, 'date'] = 0 + tm.assert_frame_equal(df, expected) + + df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), + 'val': Series(range(5), dtype=np.int64)}) + df.loc[:, 'date'] = np.array(0, dtype=np.int64) + tm.assert_frame_equal(df, expected) + + df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), + 'val': Series(range(5), dtype=np.int64)}) + df.loc[:, 'date'] = np.array([0, 0, 0, 0, 0], dtype=np.int64) + tm.assert_frame_equal(df, expected) + + expected = DataFrame({'date': Series('foo', index=range(5)), + 'val': Series(range(5), dtype=np.int64)}) + df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), + 'val': Series(range(5), dtype=np.int64)}) + df.loc[:, 'date'] = 'foo' + tm.assert_frame_equal(df, expected) + + expected = DataFrame({'date': Series(1.0, index=range(5)), + 'val': Series(range(5), dtype=np.int64)}) + df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), + 'val': Series(range(5), dtype=np.int64)}) + df.loc[:, 'date'] = 1.0 + tm.assert_frame_equal(df, expected) + + def test_loc_setitem_consistency_empty(self): + # empty (essentially noops) + expected = DataFrame(columns=['x', 'y']) + expected['x'] = expected['x'].astype(np.int64) + df = DataFrame(columns=['x', 'y']) + df.loc[:, 'x'] = 1 + tm.assert_frame_equal(df, expected) + + df = DataFrame(columns=['x', 'y']) + df['x'] = 1 + tm.assert_frame_equal(df, expected) + + def test_loc_setitem_consistency_slice_column_len(self): + # .loc[:,column] setting with slice == len of the column + # GH10408 + data = """Level_0,,,Respondent,Respondent,Respondent,OtherCat,OtherCat +Level_1,,,Something,StartDate,EndDate,Yes/No,SomethingElse +Region,Site,RespondentID,,,,, +Region_1,Site_1,3987227376,A,5/25/2015 10:59,5/25/2015 11:22,Yes, +Region_1,Site_1,3980680971,A,5/21/2015 9:40,5/21/2015 9:52,Yes,Yes +Region_1,Site_2,3977723249,A,5/20/2015 8:27,5/20/2015 8:41,Yes, +Region_1,Site_2,3977723089,A,5/20/2015 8:33,5/20/2015 9:09,Yes,No""" + + df = pd.read_csv(StringIO(data), header=[0, 1], index_col=[0, 1, 2]) + df.loc[:, ('Respondent', 'StartDate')] = pd.to_datetime(df.loc[:, ( + 'Respondent', 'StartDate')]) + df.loc[:, ('Respondent', 'EndDate')] = pd.to_datetime(df.loc[:, ( + 'Respondent', 'EndDate')]) + df.loc[:, ('Respondent', 'Duration')] = df.loc[:, ( + 'Respondent', 'EndDate')] - df.loc[:, ('Respondent', 'StartDate')] + + df.loc[:, ('Respondent', 'Duration')] = df.loc[:, ( + 'Respondent', 'Duration')].astype('timedelta64[s]') + expected = Series([1380, 720, 840, 2160.], index=df.index, + name=('Respondent', 'Duration')) + tm.assert_series_equal(df[('Respondent', 'Duration')], expected) + + def test_loc_setitem_frame(self): + df = self.frame_labels + + result = df.iloc[0, 0] + + df.loc['a', 'A'] = 1 + result = df.loc['a', 'A'] + assert result == 1 + + result = df.iloc[0, 0] + assert result == 1 + + df.loc[:, 'B':'D'] = 0 + expected = df.loc[:, 'B':'D'] + result = df.iloc[:, 1:] + tm.assert_frame_equal(result, expected) + + # GH 6254 + # setting issue + df = DataFrame(index=[3, 5, 4], columns=['A']) + df.loc[[4, 3, 5], 'A'] = np.array([1, 2, 3], dtype='int64') + expected = DataFrame(dict(A=Series( + [1, 2, 3], index=[4, 3, 5]))).reindex(index=[3, 5, 4]) + tm.assert_frame_equal(df, expected) + + # GH 6252 + # setting with an empty frame + keys1 = ['@' + str(i) for i in range(5)] + val1 = np.arange(5, dtype='int64') + + keys2 = ['@' + str(i) for i in range(4)] + val2 = np.arange(4, dtype='int64') + + index = list(set(keys1).union(keys2)) + df = DataFrame(index=index) + df['A'] = np.nan + df.loc[keys1, 'A'] = val1 + + df['B'] = np.nan + df.loc[keys2, 'B'] = val2 + + expected = DataFrame(dict(A=Series(val1, index=keys1), B=Series( + val2, index=keys2))).reindex(index=index) + tm.assert_frame_equal(df, expected) + + # GH 8669 + # invalid coercion of nan -> int + df = DataFrame({'A': [1, 2, 3], 'B': np.nan}) + df.loc[df.B > df.A, 'B'] = df.A + expected = DataFrame({'A': [1, 2, 3], 'B': np.nan}) + tm.assert_frame_equal(df, expected) + + # GH 6546 + # setting with mixed labels + df = DataFrame({1: [1, 2], 2: [3, 4], 'a': ['a', 'b']}) + + result = df.loc[0, [1, 2]] + expected = Series([1, 3], index=[1, 2], dtype=object, name=0) + tm.assert_series_equal(result, expected) + + expected = DataFrame({1: [5, 2], 2: [6, 4], 'a': ['a', 'b']}) + df.loc[0, [1, 2]] = [5, 6] + tm.assert_frame_equal(df, expected) + + def test_loc_setitem_frame_multiples(self): + # multiple setting + df = DataFrame({'A': ['foo', 'bar', 'baz'], + 'B': Series( + range(3), dtype=np.int64)}) + rhs = df.loc[1:2] + rhs.index = df.index[0:2] + df.loc[0:1] = rhs + expected = DataFrame({'A': ['bar', 'baz', 'baz'], + 'B': Series( + [1, 2, 2], dtype=np.int64)}) + tm.assert_frame_equal(df, expected) + + # multiple setting with frame on rhs (with M8) + df = DataFrame({'date': date_range('2000-01-01', '2000-01-5'), + 'val': Series( + range(5), dtype=np.int64)}) + expected = DataFrame({'date': [Timestamp('20000101'), Timestamp( + '20000102'), Timestamp('20000101'), Timestamp('20000102'), + Timestamp('20000103')], + 'val': Series( + [0, 1, 0, 1, 2], dtype=np.int64)}) + rhs = df.loc[0:2] + rhs.index = df.index[2:5] + df.loc[2:4] = rhs + tm.assert_frame_equal(df, expected) + + def test_loc_coerceion(self): + + # 12411 + df = DataFrame({'date': [pd.Timestamp('20130101').tz_localize('UTC'), + pd.NaT]}) + expected = df.dtypes + + result = df.iloc[[0]] + tm.assert_series_equal(result.dtypes, expected) + + result = df.iloc[[1]] + tm.assert_series_equal(result.dtypes, expected) + + # 12045 + import datetime + df = DataFrame({'date': [datetime.datetime(2012, 1, 1), + datetime.datetime(1012, 1, 2)]}) + expected = df.dtypes + + result = df.iloc[[0]] + tm.assert_series_equal(result.dtypes, expected) + + result = df.iloc[[1]] + tm.assert_series_equal(result.dtypes, expected) + + # 11594 + df = DataFrame({'text': ['some words'] + [None] * 9}) + expected = df.dtypes + + result = df.iloc[0:2] + tm.assert_series_equal(result.dtypes, expected) + + result = df.iloc[3:] + tm.assert_series_equal(result.dtypes, expected) + + def test_loc_non_unique(self): + # GH3659 + # non-unique indexer with loc slice + # https://groups.google.com/forum/?fromgroups#!topic/pydata/zTm2No0crYs + + # these are going to raise becuase the we are non monotonic + df = DataFrame({'A': [1, 2, 3, 4, 5, 6], + 'B': [3, 4, 5, 6, 7, 8]}, index=[0, 1, 0, 1, 2, 3]) + pytest.raises(KeyError, df.loc.__getitem__, + tuple([slice(1, None)])) + pytest.raises(KeyError, df.loc.__getitem__, + tuple([slice(0, None)])) + pytest.raises(KeyError, df.loc.__getitem__, tuple([slice(1, 2)])) + + # monotonic are ok + df = DataFrame({'A': [1, 2, 3, 4, 5, 6], + 'B': [3, 4, 5, 6, 7, 8]}, + index=[0, 1, 0, 1, 2, 3]).sort_index(axis=0) + result = df.loc[1:] + expected = DataFrame({'A': [2, 4, 5, 6], 'B': [4, 6, 7, 8]}, + index=[1, 1, 2, 3]) + tm.assert_frame_equal(result, expected) + + result = df.loc[0:] + tm.assert_frame_equal(result, df) + + result = df.loc[1:2] + expected = DataFrame({'A': [2, 4, 5], 'B': [4, 6, 7]}, + index=[1, 1, 2]) + tm.assert_frame_equal(result, expected) + + def test_loc_non_unique_memory_error(self): + + # GH 4280 + # non_unique index with a large selection triggers a memory error + + columns = list('ABCDEFG') + + def gen_test(l, l2): + return pd.concat([ + DataFrame(np.random.randn(l, len(columns)), + index=lrange(l), columns=columns), + DataFrame(np.ones((l2, len(columns))), + index=[0] * l2, columns=columns)]) + + def gen_expected(df, mask): + l = len(mask) + return pd.concat([df.take([0], convert=False), + DataFrame(np.ones((l, len(columns))), + index=[0] * l, + columns=columns), + df.take(mask[1:], convert=False)]) + + df = gen_test(900, 100) + assert not df.index.is_unique + + mask = np.arange(100) + result = df.loc[mask] + expected = gen_expected(df, mask) + tm.assert_frame_equal(result, expected) + + df = gen_test(900000, 100000) + assert not df.index.is_unique + + mask = np.arange(100000) + result = df.loc[mask] + expected = gen_expected(df, mask) + tm.assert_frame_equal(result, expected) + + def test_loc_name(self): + # GH 3880 + df = DataFrame([[1, 1], [1, 1]]) + df.index.name = 'index_name' + result = df.iloc[[0, 1]].index.name + assert result == 'index_name' + + with catch_warnings(record=True): + result = df.ix[[0, 1]].index.name + assert result == 'index_name' + + result = df.loc[[0, 1]].index.name + assert result == 'index_name' + + def test_loc_empty_list_indexer_is_ok(self): + from pandas.util.testing import makeCustomDataframe as mkdf + df = mkdf(5, 2) + # vertical empty + tm.assert_frame_equal(df.loc[:, []], df.iloc[:, :0], + check_index_type=True, check_column_type=True) + # horizontal empty + tm.assert_frame_equal(df.loc[[], :], df.iloc[:0, :], + check_index_type=True, check_column_type=True) + # horizontal empty + tm.assert_frame_equal(df.loc[[]], df.iloc[:0, :], + check_index_type=True, + check_column_type=True) diff --git a/pandas/tests/indexing/test_multiindex.py b/pandas/tests/indexing/test_multiindex.py new file mode 100644 index 0000000000000..483c39ed8694e --- /dev/null +++ b/pandas/tests/indexing/test_multiindex.py @@ -0,0 +1,1288 @@ +from warnings import catch_warnings +import pytest +import numpy as np +import pandas as pd +from pandas import (Panel, Series, MultiIndex, DataFrame, + Timestamp, Index, date_range) +from pandas.util import testing as tm +from pandas.errors import PerformanceWarning, UnsortedIndexError +from pandas.tests.indexing.common import _mklbl + + +class TestMultiIndexBasic(object): + + def test_iloc_getitem_multiindex2(self): + # TODO(wesm): fix this + pytest.skip('this test was being suppressed, ' + 'needs to be fixed') + + arr = np.random.randn(3, 3) + df = DataFrame(arr, columns=[[2, 2, 4], [6, 8, 10]], + index=[[4, 4, 8], [8, 10, 12]]) + + rs = df.iloc[2] + xp = Series(arr[2], index=df.columns) + tm.assert_series_equal(rs, xp) + + rs = df.iloc[:, 2] + xp = Series(arr[:, 2], index=df.index) + tm.assert_series_equal(rs, xp) + + rs = df.iloc[2, 2] + xp = df.values[2, 2] + assert rs == xp + + # for multiple items + # GH 5528 + rs = df.iloc[[0, 1]] + xp = df.xs(4, drop_level=False) + tm.assert_frame_equal(rs, xp) + + tup = zip(*[['a', 'a', 'b', 'b'], ['x', 'y', 'x', 'y']]) + index = MultiIndex.from_tuples(tup) + df = DataFrame(np.random.randn(4, 4), index=index) + rs = df.iloc[[2, 3]] + xp = df.xs('b', drop_level=False) + tm.assert_frame_equal(rs, xp) + + def test_setitem_multiindex(self): + with catch_warnings(record=True): + + for index_fn in ('ix', 'loc'): + + def assert_equal(a, b): + assert a == b + + def check(target, indexers, value, compare_fn, expected=None): + fn = getattr(target, index_fn) + fn.__setitem__(indexers, value) + result = fn.__getitem__(indexers) + if expected is None: + expected = value + compare_fn(result, expected) + # GH7190 + index = pd.MultiIndex.from_product([np.arange(0, 100), + np.arange(0, 80)], + names=['time', 'firm']) + t, n = 0, 2 + df = DataFrame(np.nan, columns=['A', 'w', 'l', 'a', 'x', + 'X', 'd', 'profit'], + index=index) + check(target=df, indexers=((t, n), 'X'), value=0, + compare_fn=assert_equal) + + df = DataFrame(-999, columns=['A', 'w', 'l', 'a', 'x', + 'X', 'd', 'profit'], + index=index) + check(target=df, indexers=((t, n), 'X'), value=1, + compare_fn=assert_equal) + + df = DataFrame(columns=['A', 'w', 'l', 'a', 'x', + 'X', 'd', 'profit'], + index=index) + check(target=df, indexers=((t, n), 'X'), value=2, + compare_fn=assert_equal) + + # gh-7218: assigning with 0-dim arrays + df = DataFrame(-999, columns=['A', 'w', 'l', 'a', 'x', + 'X', 'd', 'profit'], + index=index) + check(target=df, + indexers=((t, n), 'X'), + value=np.array(3), + compare_fn=assert_equal, + expected=3, ) + + # GH5206 + df = pd.DataFrame(np.arange(25).reshape(5, 5), + columns='A,B,C,D,E'.split(','), dtype=float) + df['F'] = 99 + row_selection = df['A'] % 2 == 0 + col_selection = ['B', 'C'] + with catch_warnings(record=True): + df.ix[row_selection, col_selection] = df['F'] + output = pd.DataFrame(99., index=[0, 2, 4], columns=['B', 'C']) + with catch_warnings(record=True): + tm.assert_frame_equal(df.ix[row_selection, col_selection], + output) + check(target=df, + indexers=(row_selection, col_selection), + value=df['F'], + compare_fn=tm.assert_frame_equal, + expected=output, ) + + # GH11372 + idx = pd.MultiIndex.from_product([ + ['A', 'B', 'C'], + pd.date_range('2015-01-01', '2015-04-01', freq='MS')]) + cols = pd.MultiIndex.from_product([ + ['foo', 'bar'], + pd.date_range('2016-01-01', '2016-02-01', freq='MS')]) + + df = pd.DataFrame(np.random.random((12, 4)), + index=idx, columns=cols) + + subidx = pd.MultiIndex.from_tuples( + [('A', pd.Timestamp('2015-01-01')), + ('A', pd.Timestamp('2015-02-01'))]) + subcols = pd.MultiIndex.from_tuples( + [('foo', pd.Timestamp('2016-01-01')), + ('foo', pd.Timestamp('2016-02-01'))]) + + vals = pd.DataFrame(np.random.random((2, 2)), + index=subidx, columns=subcols) + check(target=df, + indexers=(subidx, subcols), + value=vals, + compare_fn=tm.assert_frame_equal, ) + # set all columns + vals = pd.DataFrame( + np.random.random((2, 4)), index=subidx, columns=cols) + check(target=df, + indexers=(subidx, slice(None, None, None)), + value=vals, + compare_fn=tm.assert_frame_equal, ) + # identity + copy = df.copy() + check(target=df, indexers=(df.index, df.columns), value=df, + compare_fn=tm.assert_frame_equal, expected=copy) + + def test_loc_getitem_series(self): + # GH14730 + # passing a series as a key with a MultiIndex + index = MultiIndex.from_product([[1, 2, 3], ['A', 'B', 'C']]) + x = Series(index=index, data=range(9), dtype=np.float64) + y = Series([1, 3]) + expected = Series( + data=[0, 1, 2, 6, 7, 8], + index=MultiIndex.from_product([[1, 3], ['A', 'B', 'C']]), + dtype=np.float64) + result = x.loc[y] + tm.assert_series_equal(result, expected) + + result = x.loc[[1, 3]] + tm.assert_series_equal(result, expected) + + # GH15424 + y1 = Series([1, 3], index=[1, 2]) + result = x.loc[y1] + tm.assert_series_equal(result, expected) + + empty = Series(data=[], dtype=np.float64) + expected = Series([], index=MultiIndex( + levels=index.levels, labels=[[], []], dtype=np.float64)) + result = x.loc[empty] + tm.assert_series_equal(result, expected) + + def test_loc_getitem_array(self): + # GH15434 + # passing an array as a key with a MultiIndex + index = MultiIndex.from_product([[1, 2, 3], ['A', 'B', 'C']]) + x = Series(index=index, data=range(9), dtype=np.float64) + y = np.array([1, 3]) + expected = Series( + data=[0, 1, 2, 6, 7, 8], + index=MultiIndex.from_product([[1, 3], ['A', 'B', 'C']]), + dtype=np.float64) + result = x.loc[y] + tm.assert_series_equal(result, expected) + + # empty array: + empty = np.array([]) + expected = Series([], index=MultiIndex( + levels=index.levels, labels=[[], []], dtype=np.float64)) + result = x.loc[empty] + tm.assert_series_equal(result, expected) + + # 0-dim array (scalar): + scalar = np.int64(1) + expected = Series( + data=[0, 1, 2], + index=['A', 'B', 'C'], + dtype=np.float64) + result = x.loc[scalar] + tm.assert_series_equal(result, expected) + + def test_iloc_getitem_multiindex(self): + mi_labels = DataFrame(np.random.randn(4, 3), + columns=[['i', 'i', 'j'], ['A', 'A', 'B']], + index=[['i', 'i', 'j', 'k'], + ['X', 'X', 'Y', 'Y']]) + + mi_int = DataFrame(np.random.randn(3, 3), + columns=[[2, 2, 4], [6, 8, 10]], + index=[[4, 4, 8], [8, 10, 12]]) + + # the first row + rs = mi_int.iloc[0] + with catch_warnings(record=True): + xp = mi_int.ix[4].ix[8] + tm.assert_series_equal(rs, xp, check_names=False) + assert rs.name == (4, 8) + assert xp.name == 8 + + # 2nd (last) columns + rs = mi_int.iloc[:, 2] + with catch_warnings(record=True): + xp = mi_int.ix[:, 2] + tm.assert_series_equal(rs, xp) + + # corner column + rs = mi_int.iloc[2, 2] + with catch_warnings(record=True): + xp = mi_int.ix[:, 2].ix[2] + assert rs == xp + + # this is basically regular indexing + rs = mi_labels.iloc[2, 2] + with catch_warnings(record=True): + xp = mi_labels.ix['j'].ix[:, 'j'].ix[0, 0] + assert rs == xp + + def test_loc_multiindex(self): + + mi_labels = DataFrame(np.random.randn(3, 3), + columns=[['i', 'i', 'j'], ['A', 'A', 'B']], + index=[['i', 'i', 'j'], ['X', 'X', 'Y']]) + + mi_int = DataFrame(np.random.randn(3, 3), + columns=[[2, 2, 4], [6, 8, 10]], + index=[[4, 4, 8], [8, 10, 12]]) + + # the first row + rs = mi_labels.loc['i'] + with catch_warnings(record=True): + xp = mi_labels.ix['i'] + tm.assert_frame_equal(rs, xp) + + # 2nd (last) columns + rs = mi_labels.loc[:, 'j'] + with catch_warnings(record=True): + xp = mi_labels.ix[:, 'j'] + tm.assert_frame_equal(rs, xp) + + # corner column + rs = mi_labels.loc['j'].loc[:, 'j'] + with catch_warnings(record=True): + xp = mi_labels.ix['j'].ix[:, 'j'] + tm.assert_frame_equal(rs, xp) + + # with a tuple + rs = mi_labels.loc[('i', 'X')] + with catch_warnings(record=True): + xp = mi_labels.ix[('i', 'X')] + tm.assert_frame_equal(rs, xp) + + rs = mi_int.loc[4] + with catch_warnings(record=True): + xp = mi_int.ix[4] + tm.assert_frame_equal(rs, xp) + + def test_getitem_partial_int(self): + # GH 12416 + # with single item + l1 = [10, 20] + l2 = ['a', 'b'] + df = DataFrame(index=range(2), + columns=pd.MultiIndex.from_product([l1, l2])) + expected = DataFrame(index=range(2), + columns=l2) + result = df[20] + tm.assert_frame_equal(result, expected) + + # with list + expected = DataFrame(index=range(2), + columns=pd.MultiIndex.from_product([l1[1:], l2])) + result = df[[20]] + tm.assert_frame_equal(result, expected) + + # missing item: + with tm.assert_raises_regex(KeyError, '1'): + df[1] + with tm.assert_raises_regex(KeyError, "'\[1\] not in index'"): + df[[1]] + + def test_loc_multiindex_indexer_none(self): + + # GH6788 + # multi-index indexer is None (meaning take all) + attributes = ['Attribute' + str(i) for i in range(1)] + attribute_values = ['Value' + str(i) for i in range(5)] + + index = MultiIndex.from_product([attributes, attribute_values]) + df = 0.1 * np.random.randn(10, 1 * 5) + 0.5 + df = DataFrame(df, columns=index) + result = df[attributes] + tm.assert_frame_equal(result, df) + + # GH 7349 + # loc with a multi-index seems to be doing fallback + df = DataFrame(np.arange(12).reshape(-1, 1), + index=pd.MultiIndex.from_product([[1, 2, 3, 4], + [1, 2, 3]])) + + expected = df.loc[([1, 2], ), :] + result = df.loc[[1, 2]] + tm.assert_frame_equal(result, expected) + + def test_loc_multiindex_incomplete(self): + + # GH 7399 + # incomplete indexers + s = pd.Series(np.arange(15, dtype='int64'), + MultiIndex.from_product([range(5), ['a', 'b', 'c']])) + expected = s.loc[:, 'a':'c'] + + result = s.loc[0:4, 'a':'c'] + tm.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) + + result = s.loc[:4, 'a':'c'] + tm.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) + + result = s.loc[0:, 'a':'c'] + tm.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) + + # GH 7400 + # multiindexer gettitem with list of indexers skips wrong element + s = pd.Series(np.arange(15, dtype='int64'), + MultiIndex.from_product([range(5), ['a', 'b', 'c']])) + expected = s.iloc[[6, 7, 8, 12, 13, 14]] + result = s.loc[2:4:2, 'a':'c'] + tm.assert_series_equal(result, expected) + + def test_multiindex_perf_warn(self): + + df = DataFrame({'jim': [0, 0, 1, 1], + 'joe': ['x', 'x', 'z', 'y'], + 'jolie': np.random.rand(4)}).set_index(['jim', 'joe']) + + with tm.assert_produces_warning(PerformanceWarning, + clear=[pd.core.index]): + df.loc[(1, 'z')] + + df = df.iloc[[2, 1, 3, 0]] + with tm.assert_produces_warning(PerformanceWarning): + df.loc[(0, )] + + def test_series_getitem_multiindex(self): + + # GH 6018 + # series regression getitem with a multi-index + + s = Series([1, 2, 3]) + s.index = MultiIndex.from_tuples([(0, 0), (1, 1), (2, 1)]) + + result = s[:, 0] + expected = Series([1], index=[0]) + tm.assert_series_equal(result, expected) + + result = s.loc[:, 1] + expected = Series([2, 3], index=[1, 2]) + tm.assert_series_equal(result, expected) + + # xs + result = s.xs(0, level=0) + expected = Series([1], index=[0]) + tm.assert_series_equal(result, expected) + + result = s.xs(1, level=1) + expected = Series([2, 3], index=[1, 2]) + tm.assert_series_equal(result, expected) + + # GH6258 + dt = list(date_range('20130903', periods=3)) + idx = MultiIndex.from_product([list('AB'), dt]) + s = Series([1, 3, 4, 1, 3, 4], index=idx) + + result = s.xs('20130903', level=1) + expected = Series([1, 1], index=list('AB')) + tm.assert_series_equal(result, expected) + + # GH5684 + idx = MultiIndex.from_tuples([('a', 'one'), ('a', 'two'), ('b', 'one'), + ('b', 'two')]) + s = Series([1, 2, 3, 4], index=idx) + s.index.set_names(['L1', 'L2'], inplace=True) + result = s.xs('one', level='L2') + expected = Series([1, 3], index=['a', 'b']) + expected.index.set_names(['L1'], inplace=True) + tm.assert_series_equal(result, expected) + + def test_xs_multiindex(self): + + # GH2903 + columns = MultiIndex.from_tuples( + [('a', 'foo'), ('a', 'bar'), ('b', 'hello'), + ('b', 'world')], names=['lvl0', 'lvl1']) + df = DataFrame(np.random.randn(4, 4), columns=columns) + df.sort_index(axis=1, inplace=True) + result = df.xs('a', level='lvl0', axis=1) + expected = df.iloc[:, 0:2].loc[:, 'a'] + tm.assert_frame_equal(result, expected) + + result = df.xs('foo', level='lvl1', axis=1) + expected = df.iloc[:, 1:2].copy() + expected.columns = expected.columns.droplevel('lvl1') + tm.assert_frame_equal(result, expected) + + def test_multiindex_setitem(self): + + # GH 3738 + # setting with a multi-index right hand side + arrays = [np.array(['bar', 'bar', 'baz', 'qux', 'qux', 'bar']), + np.array(['one', 'two', 'one', 'one', 'two', 'one']), + np.arange(0, 6, 1)] + + df_orig = pd.DataFrame(np.random.randn(6, 3), + index=arrays, + columns=['A', 'B', 'C']).sort_index() + + expected = df_orig.loc[['bar']] * 2 + df = df_orig.copy() + df.loc[['bar']] *= 2 + tm.assert_frame_equal(df.loc[['bar']], expected) + + # raise because these have differing levels + def f(): + df.loc['bar'] *= 2 + + pytest.raises(TypeError, f) + + # from SO + # http://stackoverflow.com/questions/24572040/pandas-access-the-level-of-multiindex-for-inplace-operation + df_orig = DataFrame.from_dict({'price': { + ('DE', 'Coal', 'Stock'): 2, + ('DE', 'Gas', 'Stock'): 4, + ('DE', 'Elec', 'Demand'): 1, + ('FR', 'Gas', 'Stock'): 5, + ('FR', 'Solar', 'SupIm'): 0, + ('FR', 'Wind', 'SupIm'): 0 + }}) + df_orig.index = MultiIndex.from_tuples(df_orig.index, + names=['Sit', 'Com', 'Type']) + + expected = df_orig.copy() + expected.iloc[[0, 2, 3]] *= 2 + + idx = pd.IndexSlice + df = df_orig.copy() + df.loc[idx[:, :, 'Stock'], :] *= 2 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc[idx[:, :, 'Stock'], 'price'] *= 2 + tm.assert_frame_equal(df, expected) + + def test_getitem_duplicates_multiindex(self): + # GH 5725 the 'A' happens to be a valid Timestamp so the doesn't raise + # the appropriate error, only in PY3 of course! + + index = MultiIndex(levels=[['D', 'B', 'C'], + [0, 26, 27, 37, 57, 67, 75, 82]], + labels=[[0, 0, 0, 1, 2, 2, 2, 2, 2, 2], + [1, 3, 4, 6, 0, 2, 2, 3, 5, 7]], + names=['tag', 'day']) + arr = np.random.randn(len(index), 1) + df = DataFrame(arr, index=index, columns=['val']) + result = df.val['D'] + expected = Series(arr.ravel()[0:3], name='val', index=Index( + [26, 37, 57], name='day')) + tm.assert_series_equal(result, expected) + + def f(): + df.val['A'] + + pytest.raises(KeyError, f) + + def f(): + df.val['X'] + + pytest.raises(KeyError, f) + + # A is treated as a special Timestamp + index = MultiIndex(levels=[['A', 'B', 'C'], + [0, 26, 27, 37, 57, 67, 75, 82]], + labels=[[0, 0, 0, 1, 2, 2, 2, 2, 2, 2], + [1, 3, 4, 6, 0, 2, 2, 3, 5, 7]], + names=['tag', 'day']) + df = DataFrame(arr, index=index, columns=['val']) + result = df.val['A'] + expected = Series(arr.ravel()[0:3], name='val', index=Index( + [26, 37, 57], name='day')) + tm.assert_series_equal(result, expected) + + def f(): + df.val['X'] + + pytest.raises(KeyError, f) + + # GH 7866 + # multi-index slicing with missing indexers + idx = pd.MultiIndex.from_product([['A', 'B', 'C'], + ['foo', 'bar', 'baz']], + names=['one', 'two']) + s = pd.Series(np.arange(9, dtype='int64'), index=idx).sort_index() + + exp_idx = pd.MultiIndex.from_product([['A'], ['foo', 'bar', 'baz']], + names=['one', 'two']) + expected = pd.Series(np.arange(3, dtype='int64'), + index=exp_idx).sort_index() + + result = s.loc[['A']] + tm.assert_series_equal(result, expected) + result = s.loc[['A', 'D']] + tm.assert_series_equal(result, expected) + + # not any values found + pytest.raises(KeyError, lambda: s.loc[['D']]) + + # empty ok + result = s.loc[[]] + expected = s.iloc[[]] + tm.assert_series_equal(result, expected) + + idx = pd.IndexSlice + expected = pd.Series([0, 3, 6], index=pd.MultiIndex.from_product( + [['A', 'B', 'C'], ['foo']], names=['one', 'two'])).sort_index() + + result = s.loc[idx[:, ['foo']]] + tm.assert_series_equal(result, expected) + result = s.loc[idx[:, ['foo', 'bah']]] + tm.assert_series_equal(result, expected) + + # GH 8737 + # empty indexer + multi_index = pd.MultiIndex.from_product((['foo', 'bar', 'baz'], + ['alpha', 'beta'])) + df = DataFrame( + np.random.randn(5, 6), index=range(5), columns=multi_index) + df = df.sort_index(level=0, axis=1) + + expected = DataFrame(index=range(5), + columns=multi_index.reindex([])[0]) + result1 = df.loc[:, ([], slice(None))] + result2 = df.loc[:, (['foo'], [])] + tm.assert_frame_equal(result1, expected) + tm.assert_frame_equal(result2, expected) + + # regression from < 0.14.0 + # GH 7914 + df = DataFrame([[np.mean, np.median], ['mean', 'median']], + columns=MultiIndex.from_tuples([('functs', 'mean'), + ('functs', 'median')]), + index=['function', 'name']) + result = df.loc['function', ('functs', 'mean')] + assert result == np.mean + + def test_multiindex_assignment(self): + + # GH3777 part 2 + + # mixed dtype + df = DataFrame(np.random.randint(5, 10, size=9).reshape(3, 3), + columns=list('abc'), + index=[[4, 4, 8], [8, 10, 12]]) + df['d'] = np.nan + arr = np.array([0., 1.]) + + with catch_warnings(record=True): + df.ix[4, 'd'] = arr + tm.assert_series_equal(df.ix[4, 'd'], + Series(arr, index=[8, 10], name='d')) + + # single dtype + df = DataFrame(np.random.randint(5, 10, size=9).reshape(3, 3), + columns=list('abc'), + index=[[4, 4, 8], [8, 10, 12]]) + + with catch_warnings(record=True): + df.ix[4, 'c'] = arr + exp = Series(arr, index=[8, 10], name='c', dtype='float64') + tm.assert_series_equal(df.ix[4, 'c'], exp) + + # scalar ok + with catch_warnings(record=True): + df.ix[4, 'c'] = 10 + exp = Series(10, index=[8, 10], name='c', dtype='float64') + tm.assert_series_equal(df.ix[4, 'c'], exp) + + # invalid assignments + def f(): + with catch_warnings(record=True): + df.ix[4, 'c'] = [0, 1, 2, 3] + + pytest.raises(ValueError, f) + + def f(): + with catch_warnings(record=True): + df.ix[4, 'c'] = [0] + + pytest.raises(ValueError, f) + + # groupby example + NUM_ROWS = 100 + NUM_COLS = 10 + col_names = ['A' + num for num in + map(str, np.arange(NUM_COLS).tolist())] + index_cols = col_names[:5] + + df = DataFrame(np.random.randint(5, size=(NUM_ROWS, NUM_COLS)), + dtype=np.int64, columns=col_names) + df = df.set_index(index_cols).sort_index() + grp = df.groupby(level=index_cols[:4]) + df['new_col'] = np.nan + + f_index = np.arange(5) + + def f(name, df2): + return Series(np.arange(df2.shape[0]), + name=df2.index.values[0]).reindex(f_index) + + # TODO(wesm): unused? + # new_df = pd.concat([f(name, df2) for name, df2 in grp], axis=1).T + + # we are actually operating on a copy here + # but in this case, that's ok + for name, df2 in grp: + new_vals = np.arange(df2.shape[0]) + with catch_warnings(record=True): + df.ix[name, 'new_col'] = new_vals + + def test_multiindex_label_slicing_with_negative_step(self): + s = Series(np.arange(20), + MultiIndex.from_product([list('abcde'), np.arange(4)])) + SLC = pd.IndexSlice + + def assert_slices_equivalent(l_slc, i_slc): + tm.assert_series_equal(s.loc[l_slc], s.iloc[i_slc]) + tm.assert_series_equal(s[l_slc], s.iloc[i_slc]) + with catch_warnings(record=True): + tm.assert_series_equal(s.ix[l_slc], s.iloc[i_slc]) + + assert_slices_equivalent(SLC[::-1], SLC[::-1]) + + assert_slices_equivalent(SLC['d'::-1], SLC[15::-1]) + assert_slices_equivalent(SLC[('d', )::-1], SLC[15::-1]) + + assert_slices_equivalent(SLC[:'d':-1], SLC[:11:-1]) + assert_slices_equivalent(SLC[:('d', ):-1], SLC[:11:-1]) + + assert_slices_equivalent(SLC['d':'b':-1], SLC[15:3:-1]) + assert_slices_equivalent(SLC[('d', ):'b':-1], SLC[15:3:-1]) + assert_slices_equivalent(SLC['d':('b', ):-1], SLC[15:3:-1]) + assert_slices_equivalent(SLC[('d', ):('b', ):-1], SLC[15:3:-1]) + assert_slices_equivalent(SLC['b':'d':-1], SLC[:0]) + + assert_slices_equivalent(SLC[('c', 2)::-1], SLC[10::-1]) + assert_slices_equivalent(SLC[:('c', 2):-1], SLC[:9:-1]) + assert_slices_equivalent(SLC[('e', 0):('c', 2):-1], SLC[16:9:-1]) + + def test_multiindex_slice_first_level(self): + # GH 12697 + freq = ['a', 'b', 'c', 'd'] + idx = pd.MultiIndex.from_product([freq, np.arange(500)]) + df = pd.DataFrame(list(range(2000)), index=idx, columns=['Test']) + df_slice = df.loc[pd.IndexSlice[:, 30:70], :] + result = df_slice.loc['a'] + expected = pd.DataFrame(list(range(30, 71)), + columns=['Test'], + index=range(30, 71)) + tm.assert_frame_equal(result, expected) + result = df_slice.loc['d'] + expected = pd.DataFrame(list(range(1530, 1571)), + columns=['Test'], + index=range(30, 71)) + tm.assert_frame_equal(result, expected) + + +class TestMultiIndexSlicers(object): + + def test_per_axis_per_level_getitem(self): + + # GH6134 + # example test case + ix = MultiIndex.from_product([_mklbl('A', 5), _mklbl('B', 7), _mklbl( + 'C', 4), _mklbl('D', 2)]) + df = DataFrame(np.arange(len(ix.get_values())), index=ix) + + result = df.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :] + expected = df.loc[[tuple([a, b, c, d]) + for a, b, c, d in df.index.values + if (a == 'A1' or a == 'A2' or a == 'A3') and ( + c == 'C1' or c == 'C3')]] + tm.assert_frame_equal(result, expected) + + expected = df.loc[[tuple([a, b, c, d]) + for a, b, c, d in df.index.values + if (a == 'A1' or a == 'A2' or a == 'A3') and ( + c == 'C1' or c == 'C2' or c == 'C3')]] + result = df.loc[(slice('A1', 'A3'), slice(None), slice('C1', 'C3')), :] + tm.assert_frame_equal(result, expected) + + # test multi-index slicing with per axis and per index controls + index = MultiIndex.from_tuples([('A', 1), ('A', 2), + ('A', 3), ('B', 1)], + names=['one', 'two']) + columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), + ('b', 'foo'), ('b', 'bah')], + names=['lvl0', 'lvl1']) + + df = DataFrame( + np.arange(16, dtype='int64').reshape( + 4, 4), index=index, columns=columns) + df = df.sort_index(axis=0).sort_index(axis=1) + + # identity + result = df.loc[(slice(None), slice(None)), :] + tm.assert_frame_equal(result, df) + result = df.loc[(slice(None), slice(None)), (slice(None), slice(None))] + tm.assert_frame_equal(result, df) + result = df.loc[:, (slice(None), slice(None))] + tm.assert_frame_equal(result, df) + + # index + result = df.loc[(slice(None), [1]), :] + expected = df.iloc[[0, 3]] + tm.assert_frame_equal(result, expected) + + result = df.loc[(slice(None), 1), :] + expected = df.iloc[[0, 3]] + tm.assert_frame_equal(result, expected) + + # columns + result = df.loc[:, (slice(None), ['foo'])] + expected = df.iloc[:, [1, 3]] + tm.assert_frame_equal(result, expected) + + # both + result = df.loc[(slice(None), 1), (slice(None), ['foo'])] + expected = df.iloc[[0, 3], [1, 3]] + tm.assert_frame_equal(result, expected) + + result = df.loc['A', 'a'] + expected = DataFrame(dict(bar=[1, 5, 9], foo=[0, 4, 8]), + index=Index([1, 2, 3], name='two'), + columns=Index(['bar', 'foo'], name='lvl1')) + tm.assert_frame_equal(result, expected) + + result = df.loc[(slice(None), [1, 2]), :] + expected = df.iloc[[0, 1, 3]] + tm.assert_frame_equal(result, expected) + + # multi-level series + s = Series(np.arange(len(ix.get_values())), index=ix) + result = s.loc['A1':'A3', :, ['C1', 'C3']] + expected = s.loc[[tuple([a, b, c, d]) + for a, b, c, d in s.index.values + if (a == 'A1' or a == 'A2' or a == 'A3') and ( + c == 'C1' or c == 'C3')]] + tm.assert_series_equal(result, expected) + + # boolean indexers + result = df.loc[(slice(None), df.loc[:, ('a', 'bar')] > 5), :] + expected = df.iloc[[2, 3]] + tm.assert_frame_equal(result, expected) + + def f(): + df.loc[(slice(None), np.array([True, False])), :] + + pytest.raises(ValueError, f) + + # ambiguous cases + # these can be multiply interpreted (e.g. in this case + # as df.loc[slice(None),[1]] as well + pytest.raises(KeyError, lambda: df.loc[slice(None), [1]]) + + result = df.loc[(slice(None), [1]), :] + expected = df.iloc[[0, 3]] + tm.assert_frame_equal(result, expected) + + # not lexsorted + assert df.index.lexsort_depth == 2 + df = df.sort_index(level=1, axis=0) + assert df.index.lexsort_depth == 0 + with tm.assert_raises_regex( + UnsortedIndexError, + 'MultiIndex Slicing requires the index to be fully ' + r'lexsorted tuple len \(2\), lexsort depth \(0\)'): + df.loc[(slice(None), df.loc[:, ('a', 'bar')] > 5), :] + + def test_multiindex_slicers_non_unique(self): + + # GH 7106 + # non-unique mi index support + df = (DataFrame(dict(A=['foo', 'foo', 'foo', 'foo'], + B=['a', 'a', 'a', 'a'], + C=[1, 2, 1, 3], + D=[1, 2, 3, 4])) + .set_index(['A', 'B', 'C']).sort_index()) + assert not df.index.is_unique + expected = (DataFrame(dict(A=['foo', 'foo'], B=['a', 'a'], + C=[1, 1], D=[1, 3])) + .set_index(['A', 'B', 'C']).sort_index()) + result = df.loc[(slice(None), slice(None), 1), :] + tm.assert_frame_equal(result, expected) + + # this is equivalent of an xs expression + result = df.xs(1, level=2, drop_level=False) + tm.assert_frame_equal(result, expected) + + df = (DataFrame(dict(A=['foo', 'foo', 'foo', 'foo'], + B=['a', 'a', 'a', 'a'], + C=[1, 2, 1, 2], + D=[1, 2, 3, 4])) + .set_index(['A', 'B', 'C']).sort_index()) + assert not df.index.is_unique + expected = (DataFrame(dict(A=['foo', 'foo'], B=['a', 'a'], + C=[1, 1], D=[1, 3])) + .set_index(['A', 'B', 'C']).sort_index()) + result = df.loc[(slice(None), slice(None), 1), :] + assert not result.index.is_unique + tm.assert_frame_equal(result, expected) + + # GH12896 + # numpy-implementation dependent bug + ints = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 14, 16, + 17, 18, 19, 200000, 200000] + n = len(ints) + idx = MultiIndex.from_arrays([['a'] * n, ints]) + result = Series([1] * n, index=idx) + result = result.sort_index() + result = result.loc[(slice(None), slice(100000))] + expected = Series([1] * (n - 2), index=idx[:-2]).sort_index() + tm.assert_series_equal(result, expected) + + def test_multiindex_slicers_datetimelike(self): + + # GH 7429 + # buggy/inconsistent behavior when slicing with datetime-like + import datetime + dates = [datetime.datetime(2012, 1, 1, 12, 12, 12) + + datetime.timedelta(days=i) for i in range(6)] + freq = [1, 2] + index = MultiIndex.from_product( + [dates, freq], names=['date', 'frequency']) + + df = DataFrame( + np.arange(6 * 2 * 4, dtype='int64').reshape( + -1, 4), index=index, columns=list('ABCD')) + + # multi-axis slicing + idx = pd.IndexSlice + expected = df.iloc[[0, 2, 4], [0, 1]] + result = df.loc[(slice(Timestamp('2012-01-01 12:12:12'), + Timestamp('2012-01-03 12:12:12')), + slice(1, 1)), slice('A', 'B')] + tm.assert_frame_equal(result, expected) + + result = df.loc[(idx[Timestamp('2012-01-01 12:12:12'):Timestamp( + '2012-01-03 12:12:12')], idx[1:1]), slice('A', 'B')] + tm.assert_frame_equal(result, expected) + + result = df.loc[(slice(Timestamp('2012-01-01 12:12:12'), + Timestamp('2012-01-03 12:12:12')), 1), + slice('A', 'B')] + tm.assert_frame_equal(result, expected) + + # with strings + result = df.loc[(slice('2012-01-01 12:12:12', '2012-01-03 12:12:12'), + slice(1, 1)), slice('A', 'B')] + tm.assert_frame_equal(result, expected) + + result = df.loc[(idx['2012-01-01 12:12:12':'2012-01-03 12:12:12'], 1), + idx['A', 'B']] + tm.assert_frame_equal(result, expected) + + def test_multiindex_slicers_edges(self): + # GH 8132 + # various edge cases + df = DataFrame( + {'A': ['A0'] * 5 + ['A1'] * 5 + ['A2'] * 5, + 'B': ['B0', 'B0', 'B1', 'B1', 'B2'] * 3, + 'DATE': ["2013-06-11", "2013-07-02", "2013-07-09", "2013-07-30", + "2013-08-06", "2013-06-11", "2013-07-02", "2013-07-09", + "2013-07-30", "2013-08-06", "2013-09-03", "2013-10-01", + "2013-07-09", "2013-08-06", "2013-09-03"], + 'VALUES': [22, 35, 14, 9, 4, 40, 18, 4, 2, 5, 1, 2, 3, 4, 2]}) + + df['DATE'] = pd.to_datetime(df['DATE']) + df1 = df.set_index(['A', 'B', 'DATE']) + df1 = df1.sort_index() + + # A1 - Get all values under "A0" and "A1" + result = df1.loc[(slice('A1')), :] + expected = df1.iloc[0:10] + tm.assert_frame_equal(result, expected) + + # A2 - Get all values from the start to "A2" + result = df1.loc[(slice('A2')), :] + expected = df1 + tm.assert_frame_equal(result, expected) + + # A3 - Get all values under "B1" or "B2" + result = df1.loc[(slice(None), slice('B1', 'B2')), :] + expected = df1.iloc[[2, 3, 4, 7, 8, 9, 12, 13, 14]] + tm.assert_frame_equal(result, expected) + + # A4 - Get all values between 2013-07-02 and 2013-07-09 + result = df1.loc[(slice(None), slice(None), + slice('20130702', '20130709')), :] + expected = df1.iloc[[1, 2, 6, 7, 12]] + tm.assert_frame_equal(result, expected) + + # B1 - Get all values in B0 that are also under A0, A1 and A2 + result = df1.loc[(slice('A2'), slice('B0')), :] + expected = df1.iloc[[0, 1, 5, 6, 10, 11]] + tm.assert_frame_equal(result, expected) + + # B2 - Get all values in B0, B1 and B2 (similar to what #2 is doing for + # the As) + result = df1.loc[(slice(None), slice('B2')), :] + expected = df1 + tm.assert_frame_equal(result, expected) + + # B3 - Get all values from B1 to B2 and up to 2013-08-06 + result = df1.loc[(slice(None), slice('B1', 'B2'), + slice('2013-08-06')), :] + expected = df1.iloc[[2, 3, 4, 7, 8, 9, 12, 13]] + tm.assert_frame_equal(result, expected) + + # B4 - Same as A4 but the start of the date slice is not a key. + # shows indexing on a partial selection slice + result = df1.loc[(slice(None), slice(None), + slice('20130701', '20130709')), :] + expected = df1.iloc[[1, 2, 6, 7, 12]] + tm.assert_frame_equal(result, expected) + + def test_per_axis_per_level_doc_examples(self): + + # test index maker + idx = pd.IndexSlice + + # from indexing.rst / advanced + index = MultiIndex.from_product([_mklbl('A', 4), _mklbl('B', 2), + _mklbl('C', 4), _mklbl('D', 2)]) + columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), + ('b', 'foo'), ('b', 'bah')], + names=['lvl0', 'lvl1']) + df = DataFrame(np.arange(len(index) * len(columns), dtype='int64') + .reshape((len(index), len(columns))), + index=index, columns=columns) + result = df.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :] + expected = df.loc[[tuple([a, b, c, d]) + for a, b, c, d in df.index.values + if (a == 'A1' or a == 'A2' or a == 'A3') and ( + c == 'C1' or c == 'C3')]] + tm.assert_frame_equal(result, expected) + result = df.loc[idx['A1':'A3', :, ['C1', 'C3']], :] + tm.assert_frame_equal(result, expected) + + result = df.loc[(slice(None), slice(None), ['C1', 'C3']), :] + expected = df.loc[[tuple([a, b, c, d]) + for a, b, c, d in df.index.values + if (c == 'C1' or c == 'C3')]] + tm.assert_frame_equal(result, expected) + result = df.loc[idx[:, :, ['C1', 'C3']], :] + tm.assert_frame_equal(result, expected) + + # not sorted + def f(): + df.loc['A1', (slice(None), 'foo')] + + pytest.raises(UnsortedIndexError, f) + df = df.sort_index(axis=1) + + # slicing + df.loc['A1', (slice(None), 'foo')] + df.loc[(slice(None), slice(None), ['C1', 'C3']), (slice(None), 'foo')] + + # setitem + df.loc(axis=0)[:, :, ['C1', 'C3']] = -10 + + def test_loc_axis_arguments(self): + + index = MultiIndex.from_product([_mklbl('A', 4), _mklbl('B', 2), + _mklbl('C', 4), _mklbl('D', 2)]) + columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), + ('b', 'foo'), ('b', 'bah')], + names=['lvl0', 'lvl1']) + df = DataFrame(np.arange(len(index) * len(columns), dtype='int64') + .reshape((len(index), len(columns))), + index=index, + columns=columns).sort_index().sort_index(axis=1) + + # axis 0 + result = df.loc(axis=0)['A1':'A3', :, ['C1', 'C3']] + expected = df.loc[[tuple([a, b, c, d]) + for a, b, c, d in df.index.values + if (a == 'A1' or a == 'A2' or a == 'A3') and ( + c == 'C1' or c == 'C3')]] + tm.assert_frame_equal(result, expected) + + result = df.loc(axis='index')[:, :, ['C1', 'C3']] + expected = df.loc[[tuple([a, b, c, d]) + for a, b, c, d in df.index.values + if (c == 'C1' or c == 'C3')]] + tm.assert_frame_equal(result, expected) + + # axis 1 + result = df.loc(axis=1)[:, 'foo'] + expected = df.loc[:, (slice(None), 'foo')] + tm.assert_frame_equal(result, expected) + + result = df.loc(axis='columns')[:, 'foo'] + expected = df.loc[:, (slice(None), 'foo')] + tm.assert_frame_equal(result, expected) + + # invalid axis + def f(): + df.loc(axis=-1)[:, :, ['C1', 'C3']] + + pytest.raises(ValueError, f) + + def f(): + df.loc(axis=2)[:, :, ['C1', 'C3']] + + pytest.raises(ValueError, f) + + def f(): + df.loc(axis='foo')[:, :, ['C1', 'C3']] + + pytest.raises(ValueError, f) + + def test_per_axis_per_level_setitem(self): + + # test index maker + idx = pd.IndexSlice + + # test multi-index slicing with per axis and per index controls + index = MultiIndex.from_tuples([('A', 1), ('A', 2), + ('A', 3), ('B', 1)], + names=['one', 'two']) + columns = MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), + ('b', 'foo'), ('b', 'bah')], + names=['lvl0', 'lvl1']) + + df_orig = DataFrame( + np.arange(16, dtype='int64').reshape( + 4, 4), index=index, columns=columns) + df_orig = df_orig.sort_index(axis=0).sort_index(axis=1) + + # identity + df = df_orig.copy() + df.loc[(slice(None), slice(None)), :] = 100 + expected = df_orig.copy() + expected.iloc[:, :] = 100 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc(axis=0)[:, :] = 100 + expected = df_orig.copy() + expected.iloc[:, :] = 100 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc[(slice(None), slice(None)), (slice(None), slice(None))] = 100 + expected = df_orig.copy() + expected.iloc[:, :] = 100 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc[:, (slice(None), slice(None))] = 100 + expected = df_orig.copy() + expected.iloc[:, :] = 100 + tm.assert_frame_equal(df, expected) + + # index + df = df_orig.copy() + df.loc[(slice(None), [1]), :] = 100 + expected = df_orig.copy() + expected.iloc[[0, 3]] = 100 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc[(slice(None), 1), :] = 100 + expected = df_orig.copy() + expected.iloc[[0, 3]] = 100 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc(axis=0)[:, 1] = 100 + expected = df_orig.copy() + expected.iloc[[0, 3]] = 100 + tm.assert_frame_equal(df, expected) + + # columns + df = df_orig.copy() + df.loc[:, (slice(None), ['foo'])] = 100 + expected = df_orig.copy() + expected.iloc[:, [1, 3]] = 100 + tm.assert_frame_equal(df, expected) + + # both + df = df_orig.copy() + df.loc[(slice(None), 1), (slice(None), ['foo'])] = 100 + expected = df_orig.copy() + expected.iloc[[0, 3], [1, 3]] = 100 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc[idx[:, 1], idx[:, ['foo']]] = 100 + expected = df_orig.copy() + expected.iloc[[0, 3], [1, 3]] = 100 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc['A', 'a'] = 100 + expected = df_orig.copy() + expected.iloc[0:3, 0:2] = 100 + tm.assert_frame_equal(df, expected) + + # setting with a list-like + df = df_orig.copy() + df.loc[(slice(None), 1), (slice(None), ['foo'])] = np.array( + [[100, 100], [100, 100]], dtype='int64') + expected = df_orig.copy() + expected.iloc[[0, 3], [1, 3]] = 100 + tm.assert_frame_equal(df, expected) + + # not enough values + df = df_orig.copy() + + def f(): + df.loc[(slice(None), 1), (slice(None), ['foo'])] = np.array( + [[100], [100, 100]], dtype='int64') + + pytest.raises(ValueError, f) + + def f(): + df.loc[(slice(None), 1), (slice(None), ['foo'])] = np.array( + [100, 100, 100, 100], dtype='int64') + + pytest.raises(ValueError, f) + + # with an alignable rhs + df = df_orig.copy() + df.loc[(slice(None), 1), (slice(None), ['foo'])] = df.loc[(slice( + None), 1), (slice(None), ['foo'])] * 5 + expected = df_orig.copy() + expected.iloc[[0, 3], [1, 3]] = expected.iloc[[0, 3], [1, 3]] * 5 + tm.assert_frame_equal(df, expected) + + df = df_orig.copy() + df.loc[(slice(None), 1), (slice(None), ['foo'])] *= df.loc[(slice( + None), 1), (slice(None), ['foo'])] + expected = df_orig.copy() + expected.iloc[[0, 3], [1, 3]] *= expected.iloc[[0, 3], [1, 3]] + tm.assert_frame_equal(df, expected) + + rhs = df_orig.loc[(slice(None), 1), (slice(None), ['foo'])].copy() + rhs.loc[:, ('c', 'bah')] = 10 + df = df_orig.copy() + df.loc[(slice(None), 1), (slice(None), ['foo'])] *= rhs + expected = df_orig.copy() + expected.iloc[[0, 3], [1, 3]] *= expected.iloc[[0, 3], [1, 3]] + tm.assert_frame_equal(df, expected) + + +class TestMultiIndexPanel(object): + + def test_iloc_getitem_panel_multiindex(self): + + with catch_warnings(record=True): + + # GH 7199 + # Panel with multi-index + multi_index = pd.MultiIndex.from_tuples([('ONE', 'one'), + ('TWO', 'two'), + ('THREE', 'three')], + names=['UPPER', 'lower']) + + simple_index = [x[0] for x in multi_index] + wd1 = Panel(items=['First', 'Second'], + major_axis=['a', 'b', 'c', 'd'], + minor_axis=multi_index) + + wd2 = Panel(items=['First', 'Second'], + major_axis=['a', 'b', 'c', 'd'], + minor_axis=simple_index) + + expected1 = wd1['First'].iloc[[True, True, True, False], [0, 2]] + result1 = wd1.iloc[0, [True, True, True, False], [0, 2]] # WRONG + tm.assert_frame_equal(result1, expected1) + + expected2 = wd2['First'].iloc[[True, True, True, False], [0, 2]] + result2 = wd2.iloc[0, [True, True, True, False], [0, 2]] + tm.assert_frame_equal(result2, expected2) + + expected1 = DataFrame(index=['a'], columns=multi_index, + dtype='float64') + result1 = wd1.iloc[0, [0], [0, 1, 2]] + tm.assert_frame_equal(result1, expected1) + + expected2 = DataFrame(index=['a'], columns=simple_index, + dtype='float64') + result2 = wd2.iloc[0, [0], [0, 1, 2]] + tm.assert_frame_equal(result2, expected2) + + # GH 7516 + mi = MultiIndex.from_tuples([(0, 'x'), (1, 'y'), (2, 'z')]) + p = Panel(np.arange(3 * 3 * 3, dtype='int64').reshape(3, 3, 3), + items=['a', 'b', 'c'], major_axis=mi, + minor_axis=['u', 'v', 'w']) + result = p.iloc[:, 1, 0] + expected = Series([3, 12, 21], index=['a', 'b', 'c'], name='u') + tm.assert_series_equal(result, expected) + + result = p.loc[:, (1, 'y'), 'u'] + tm.assert_series_equal(result, expected) + + def test_panel_setitem_with_multiindex(self): + + with catch_warnings(record=True): + # 10360 + # failing with a multi-index + arr = np.array([[[1, 2, 3], [0, 0, 0]], + [[0, 0, 0], [0, 0, 0]]], + dtype=np.float64) + + # reg index + axes = dict(items=['A', 'B'], major_axis=[0, 1], + minor_axis=['X', 'Y', 'Z']) + p1 = Panel(0., **axes) + p1.iloc[0, 0, :] = [1, 2, 3] + expected = Panel(arr, **axes) + tm.assert_panel_equal(p1, expected) + + # multi-indexes + axes['items'] = pd.MultiIndex.from_tuples( + [('A', 'a'), ('B', 'b')]) + p2 = Panel(0., **axes) + p2.iloc[0, 0, :] = [1, 2, 3] + expected = Panel(arr, **axes) + tm.assert_panel_equal(p2, expected) + + axes['major_axis'] = pd.MultiIndex.from_tuples( + [('A', 1), ('A', 2)]) + p3 = Panel(0., **axes) + p3.iloc[0, 0, :] = [1, 2, 3] + expected = Panel(arr, **axes) + tm.assert_panel_equal(p3, expected) + + axes['minor_axis'] = pd.MultiIndex.from_product( + [['X'], range(3)]) + p4 = Panel(0., **axes) + p4.iloc[0, 0, :] = [1, 2, 3] + expected = Panel(arr, **axes) + tm.assert_panel_equal(p4, expected) + + arr = np.array( + [[[1, 0, 0], [2, 0, 0]], [[0, 0, 0], [0, 0, 0]]], + dtype=np.float64) + p5 = Panel(0., **axes) + p5.iloc[0, :, 0] = [1, 2] + expected = Panel(arr, **axes) + tm.assert_panel_equal(p5, expected) diff --git a/pandas/tests/indexing/test_panel.py b/pandas/tests/indexing/test_panel.py new file mode 100644 index 0000000000000..2d4ffd6a4e783 --- /dev/null +++ b/pandas/tests/indexing/test_panel.py @@ -0,0 +1,219 @@ +import pytest +from warnings import catch_warnings + +import numpy as np +from pandas.util import testing as tm +from pandas import Panel, date_range, DataFrame + + +class TestPanel(object): + + def test_iloc_getitem_panel(self): + + with catch_warnings(record=True): + # GH 7189 + p = Panel(np.arange(4 * 3 * 2).reshape(4, 3, 2), + items=['A', 'B', 'C', 'D'], + major_axis=['a', 'b', 'c'], + minor_axis=['one', 'two']) + + result = p.iloc[1] + expected = p.loc['B'] + tm.assert_frame_equal(result, expected) + + result = p.iloc[1, 1] + expected = p.loc['B', 'b'] + tm.assert_series_equal(result, expected) + + result = p.iloc[1, 1, 1] + expected = p.loc['B', 'b', 'two'] + assert result == expected + + # slice + result = p.iloc[1:3] + expected = p.loc[['B', 'C']] + tm.assert_panel_equal(result, expected) + + result = p.iloc[:, 0:2] + expected = p.loc[:, ['a', 'b']] + tm.assert_panel_equal(result, expected) + + # list of integers + result = p.iloc[[0, 2]] + expected = p.loc[['A', 'C']] + tm.assert_panel_equal(result, expected) + + # neg indicies + result = p.iloc[[-1, 1], [-1, 1]] + expected = p.loc[['D', 'B'], ['c', 'b']] + tm.assert_panel_equal(result, expected) + + # dups indicies + result = p.iloc[[-1, -1, 1], [-1, 1]] + expected = p.loc[['D', 'D', 'B'], ['c', 'b']] + tm.assert_panel_equal(result, expected) + + # combined + result = p.iloc[0, [True, True], [0, 1]] + expected = p.loc['A', ['a', 'b'], ['one', 'two']] + tm.assert_frame_equal(result, expected) + + # out-of-bounds exception + with pytest.raises(IndexError): + p.iloc[tuple([10, 5])] + + def f(): + p.iloc[0, [True, True], [0, 1, 2]] + + pytest.raises(IndexError, f) + + # trying to use a label + with pytest.raises(ValueError): + p.iloc[tuple(['j', 'D'])] + + # GH + p = Panel( + np.random.rand(4, 3, 2), items=['A', 'B', 'C', 'D'], + major_axis=['U', 'V', 'W'], minor_axis=['X', 'Y']) + expected = p['A'] + + result = p.iloc[0, :, :] + tm.assert_frame_equal(result, expected) + + result = p.iloc[0, [True, True, True], :] + tm.assert_frame_equal(result, expected) + + result = p.iloc[0, [True, True, True], [0, 1]] + tm.assert_frame_equal(result, expected) + + def f(): + p.iloc[0, [True, True, True], [0, 1, 2]] + + pytest.raises(IndexError, f) + + def f(): + p.iloc[0, [True, True, True], [2]] + + pytest.raises(IndexError, f) + + def test_iloc_panel_issue(self): + + with catch_warnings(record=True): + # see gh-3617 + p = Panel(np.random.randn(4, 4, 4)) + + assert p.iloc[:3, :3, :3].shape == (3, 3, 3) + assert p.iloc[1, :3, :3].shape == (3, 3) + assert p.iloc[:3, 1, :3].shape == (3, 3) + assert p.iloc[:3, :3, 1].shape == (3, 3) + assert p.iloc[1, 1, :3].shape == (3, ) + assert p.iloc[1, :3, 1].shape == (3, ) + assert p.iloc[:3, 1, 1].shape == (3, ) + + def test_panel_getitem(self): + + with catch_warnings(record=True): + # GH4016, date selection returns a frame when a partial string + # selection + ind = date_range(start="2000", freq="D", periods=1000) + df = DataFrame( + np.random.randn( + len(ind), 5), index=ind, columns=list('ABCDE')) + panel = Panel(dict([('frame_' + c, df) for c in list('ABC')])) + + test2 = panel.loc[:, "2002":"2002-12-31"] + test1 = panel.loc[:, "2002"] + tm.assert_panel_equal(test1, test2) + + # GH8710 + # multi-element getting with a list + panel = tm.makePanel() + + expected = panel.iloc[[0, 1]] + + result = panel.loc[['ItemA', 'ItemB']] + tm.assert_panel_equal(result, expected) + + result = panel.loc[['ItemA', 'ItemB'], :, :] + tm.assert_panel_equal(result, expected) + + result = panel[['ItemA', 'ItemB']] + tm.assert_panel_equal(result, expected) + + result = panel.loc['ItemA':'ItemB'] + tm.assert_panel_equal(result, expected) + + with catch_warnings(record=True): + result = panel.ix[['ItemA', 'ItemB']] + tm.assert_panel_equal(result, expected) + + # with an object-like + # GH 9140 + class TestObject: + + def __str__(self): + return "TestObject" + + obj = TestObject() + + p = Panel(np.random.randn(1, 5, 4), items=[obj], + major_axis=date_range('1/1/2000', periods=5), + minor_axis=['A', 'B', 'C', 'D']) + + expected = p.iloc[0] + result = p[obj] + tm.assert_frame_equal(result, expected) + + def test_panel_setitem(self): + + with catch_warnings(record=True): + # GH 7763 + # loc and setitem have setting differences + np.random.seed(0) + index = range(3) + columns = list('abc') + + panel = Panel({'A': DataFrame(np.random.randn(3, 3), + index=index, columns=columns), + 'B': DataFrame(np.random.randn(3, 3), + index=index, columns=columns), + 'C': DataFrame(np.random.randn(3, 3), + index=index, columns=columns)}) + + replace = DataFrame(np.eye(3, 3), index=range(3), columns=columns) + expected = Panel({'A': replace, 'B': replace, 'C': replace}) + + p = panel.copy() + for idx in list('ABC'): + p[idx] = replace + tm.assert_panel_equal(p, expected) + + p = panel.copy() + for idx in list('ABC'): + p.loc[idx, :, :] = replace + tm.assert_panel_equal(p, expected) + + def test_panel_assignment(self): + + with catch_warnings(record=True): + # GH3777 + wp = Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], + major_axis=date_range('1/1/2000', periods=5), + minor_axis=['A', 'B', 'C', 'D']) + wp2 = Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], + major_axis=date_range('1/1/2000', periods=5), + minor_axis=['A', 'B', 'C', 'D']) + + # TODO: unused? + # expected = wp.loc[['Item1', 'Item2'], :, ['A', 'B']] + + def f(): + wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = wp2.loc[ + ['Item1', 'Item2'], :, ['A', 'B']] + + pytest.raises(NotImplementedError, f) + + # to_assign = wp2.loc[['Item1', 'Item2'], :, ['A', 'B']] + # wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = to_assign + # result = wp.loc[['Item1', 'Item2'], :, ['A', 'B']] + # tm.assert_panel_equal(result,expected) diff --git a/pandas/tests/indexing/test_partial.py b/pandas/tests/indexing/test_partial.py new file mode 100644 index 0000000000000..93a85e247a787 --- /dev/null +++ b/pandas/tests/indexing/test_partial.py @@ -0,0 +1,591 @@ +""" +test setting *parts* of objects both positionally and label based + +TOD: these should be split among the indexer tests +""" + +import pytest + +from warnings import catch_warnings +import numpy as np + +import pandas as pd +from pandas import Series, DataFrame, Panel, Index, date_range +from pandas.util import testing as tm + + +class TestPartialSetting(object): + + def test_partial_setting(self): + + # GH2578, allow ix and friends to partially set + + # series + s_orig = Series([1, 2, 3]) + + s = s_orig.copy() + s[5] = 5 + expected = Series([1, 2, 3, 5], index=[0, 1, 2, 5]) + tm.assert_series_equal(s, expected) + + s = s_orig.copy() + s.loc[5] = 5 + expected = Series([1, 2, 3, 5], index=[0, 1, 2, 5]) + tm.assert_series_equal(s, expected) + + s = s_orig.copy() + s[5] = 5. + expected = Series([1, 2, 3, 5.], index=[0, 1, 2, 5]) + tm.assert_series_equal(s, expected) + + s = s_orig.copy() + s.loc[5] = 5. + expected = Series([1, 2, 3, 5.], index=[0, 1, 2, 5]) + tm.assert_series_equal(s, expected) + + # iloc/iat raise + s = s_orig.copy() + + def f(): + s.iloc[3] = 5. + + pytest.raises(IndexError, f) + + def f(): + s.iat[3] = 5. + + pytest.raises(IndexError, f) + + # ## frame ## + + df_orig = DataFrame( + np.arange(6).reshape(3, 2), columns=['A', 'B'], dtype='int64') + + # iloc/iat raise + df = df_orig.copy() + + def f(): + df.iloc[4, 2] = 5. + + pytest.raises(IndexError, f) + + def f(): + df.iat[4, 2] = 5. + + pytest.raises(IndexError, f) + + # row setting where it exists + expected = DataFrame(dict({'A': [0, 4, 4], 'B': [1, 5, 5]})) + df = df_orig.copy() + df.iloc[1] = df.iloc[2] + tm.assert_frame_equal(df, expected) + + expected = DataFrame(dict({'A': [0, 4, 4], 'B': [1, 5, 5]})) + df = df_orig.copy() + df.loc[1] = df.loc[2] + tm.assert_frame_equal(df, expected) + + # like 2578, partial setting with dtype preservation + expected = DataFrame(dict({'A': [0, 2, 4, 4], 'B': [1, 3, 5, 5]})) + df = df_orig.copy() + df.loc[3] = df.loc[2] + tm.assert_frame_equal(df, expected) + + # single dtype frame, overwrite + expected = DataFrame(dict({'A': [0, 2, 4], 'B': [0, 2, 4]})) + df = df_orig.copy() + with catch_warnings(record=True): + df.ix[:, 'B'] = df.ix[:, 'A'] + tm.assert_frame_equal(df, expected) + + # mixed dtype frame, overwrite + expected = DataFrame(dict({'A': [0, 2, 4], 'B': Series([0, 2, 4])})) + df = df_orig.copy() + df['B'] = df['B'].astype(np.float64) + with catch_warnings(record=True): + df.ix[:, 'B'] = df.ix[:, 'A'] + tm.assert_frame_equal(df, expected) + + # single dtype frame, partial setting + expected = df_orig.copy() + expected['C'] = df['A'] + df = df_orig.copy() + with catch_warnings(record=True): + df.ix[:, 'C'] = df.ix[:, 'A'] + tm.assert_frame_equal(df, expected) + + # mixed frame, partial setting + expected = df_orig.copy() + expected['C'] = df['A'] + df = df_orig.copy() + with catch_warnings(record=True): + df.ix[:, 'C'] = df.ix[:, 'A'] + tm.assert_frame_equal(df, expected) + + with catch_warnings(record=True): + # ## panel ## + p_orig = Panel(np.arange(16).reshape(2, 4, 2), + items=['Item1', 'Item2'], + major_axis=pd.date_range('2001/1/12', periods=4), + minor_axis=['A', 'B'], dtype='float64') + + # panel setting via item + p_orig = Panel(np.arange(16).reshape(2, 4, 2), + items=['Item1', 'Item2'], + major_axis=pd.date_range('2001/1/12', periods=4), + minor_axis=['A', 'B'], dtype='float64') + expected = p_orig.copy() + expected['Item3'] = expected['Item1'] + p = p_orig.copy() + p.loc['Item3'] = p['Item1'] + tm.assert_panel_equal(p, expected) + + # panel with aligned series + expected = p_orig.copy() + expected = expected.transpose(2, 1, 0) + expected['C'] = DataFrame({'Item1': [30, 30, 30, 30], + 'Item2': [32, 32, 32, 32]}, + index=p_orig.major_axis) + expected = expected.transpose(2, 1, 0) + p = p_orig.copy() + p.loc[:, :, 'C'] = Series([30, 32], index=p_orig.items) + tm.assert_panel_equal(p, expected) + + # GH 8473 + dates = date_range('1/1/2000', periods=8) + df_orig = DataFrame(np.random.randn(8, 4), index=dates, + columns=['A', 'B', 'C', 'D']) + + expected = pd.concat([df_orig, DataFrame( + {'A': 7}, index=[dates[-1] + 1])]) + df = df_orig.copy() + df.loc[dates[-1] + 1, 'A'] = 7 + tm.assert_frame_equal(df, expected) + df = df_orig.copy() + df.at[dates[-1] + 1, 'A'] = 7 + tm.assert_frame_equal(df, expected) + + exp_other = DataFrame({0: 7}, index=[dates[-1] + 1]) + expected = pd.concat([df_orig, exp_other], axis=1) + + df = df_orig.copy() + df.loc[dates[-1] + 1, 0] = 7 + tm.assert_frame_equal(df, expected) + df = df_orig.copy() + df.at[dates[-1] + 1, 0] = 7 + tm.assert_frame_equal(df, expected) + + def test_partial_setting_mixed_dtype(self): + + # in a mixed dtype environment, try to preserve dtypes + # by appending + df = DataFrame([[True, 1], [False, 2]], columns=["female", "fitness"]) + + s = df.loc[1].copy() + s.name = 2 + expected = df.append(s) + + df.loc[2] = df.loc[1] + tm.assert_frame_equal(df, expected) + + # columns will align + df = DataFrame(columns=['A', 'B']) + df.loc[0] = Series(1, index=range(4)) + tm.assert_frame_equal(df, DataFrame(columns=['A', 'B'], index=[0])) + + # columns will align + df = DataFrame(columns=['A', 'B']) + df.loc[0] = Series(1, index=['B']) + + exp = DataFrame([[np.nan, 1]], columns=['A', 'B'], + index=[0], dtype='float64') + tm.assert_frame_equal(df, exp) + + # list-like must conform + df = DataFrame(columns=['A', 'B']) + + def f(): + df.loc[0] = [1, 2, 3] + + pytest.raises(ValueError, f) + + # TODO: #15657, these are left as object and not coerced + df = DataFrame(columns=['A', 'B']) + df.loc[3] = [6, 7] + + exp = DataFrame([[6, 7]], index=[3], columns=['A', 'B'], + dtype='object') + tm.assert_frame_equal(df, exp) + + def test_series_partial_set(self): + # partial set with new index + # Regression from GH4825 + ser = Series([0.1, 0.2], index=[1, 2]) + + # loc + expected = Series([np.nan, 0.2, np.nan], index=[3, 2, 3]) + result = ser.loc[[3, 2, 3]] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([np.nan, 0.2, np.nan, np.nan], index=[3, 2, 3, 'x']) + result = ser.loc[[3, 2, 3, 'x']] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([0.2, 0.2, 0.1], index=[2, 2, 1]) + result = ser.loc[[2, 2, 1]] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([0.2, 0.2, np.nan, 0.1], index=[2, 2, 'x', 1]) + result = ser.loc[[2, 2, 'x', 1]] + tm.assert_series_equal(result, expected, check_index_type=True) + + # raises as nothing in in the index + pytest.raises(KeyError, lambda: ser.loc[[3, 3, 3]]) + + expected = Series([0.2, 0.2, np.nan], index=[2, 2, 3]) + result = ser.loc[[2, 2, 3]] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([0.3, np.nan, np.nan], index=[3, 4, 4]) + result = Series([0.1, 0.2, 0.3], index=[1, 2, 3]).loc[[3, 4, 4]] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([np.nan, 0.3, 0.3], index=[5, 3, 3]) + result = Series([0.1, 0.2, 0.3, 0.4], + index=[1, 2, 3, 4]).loc[[5, 3, 3]] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([np.nan, 0.4, 0.4], index=[5, 4, 4]) + result = Series([0.1, 0.2, 0.3, 0.4], + index=[1, 2, 3, 4]).loc[[5, 4, 4]] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([0.4, np.nan, np.nan], index=[7, 2, 2]) + result = Series([0.1, 0.2, 0.3, 0.4], + index=[4, 5, 6, 7]).loc[[7, 2, 2]] + tm.assert_series_equal(result, expected, check_index_type=True) + + expected = Series([0.4, np.nan, np.nan], index=[4, 5, 5]) + result = Series([0.1, 0.2, 0.3, 0.4], + index=[1, 2, 3, 4]).loc[[4, 5, 5]] + tm.assert_series_equal(result, expected, check_index_type=True) + + # iloc + expected = Series([0.2, 0.2, 0.1, 0.1], index=[2, 2, 1, 1]) + result = ser.iloc[[1, 1, 0, 0]] + tm.assert_series_equal(result, expected, check_index_type=True) + + def test_series_partial_set_with_name(self): + # GH 11497 + + idx = Index([1, 2], dtype='int64', name='idx') + ser = Series([0.1, 0.2], index=idx, name='s') + + # loc + exp_idx = Index([3, 2, 3], dtype='int64', name='idx') + expected = Series([np.nan, 0.2, np.nan], index=exp_idx, name='s') + result = ser.loc[[3, 2, 3]] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([3, 2, 3, 'x'], dtype='object', name='idx') + expected = Series([np.nan, 0.2, np.nan, np.nan], index=exp_idx, + name='s') + result = ser.loc[[3, 2, 3, 'x']] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([2, 2, 1], dtype='int64', name='idx') + expected = Series([0.2, 0.2, 0.1], index=exp_idx, name='s') + result = ser.loc[[2, 2, 1]] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([2, 2, 'x', 1], dtype='object', name='idx') + expected = Series([0.2, 0.2, np.nan, 0.1], index=exp_idx, name='s') + result = ser.loc[[2, 2, 'x', 1]] + tm.assert_series_equal(result, expected, check_index_type=True) + + # raises as nothing in in the index + pytest.raises(KeyError, lambda: ser.loc[[3, 3, 3]]) + + exp_idx = Index([2, 2, 3], dtype='int64', name='idx') + expected = Series([0.2, 0.2, np.nan], index=exp_idx, name='s') + result = ser.loc[[2, 2, 3]] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([3, 4, 4], dtype='int64', name='idx') + expected = Series([0.3, np.nan, np.nan], index=exp_idx, name='s') + idx = Index([1, 2, 3], dtype='int64', name='idx') + result = Series([0.1, 0.2, 0.3], index=idx, name='s').loc[[3, 4, 4]] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([5, 3, 3], dtype='int64', name='idx') + expected = Series([np.nan, 0.3, 0.3], index=exp_idx, name='s') + idx = Index([1, 2, 3, 4], dtype='int64', name='idx') + result = Series([0.1, 0.2, 0.3, 0.4], index=idx, + name='s').loc[[5, 3, 3]] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([5, 4, 4], dtype='int64', name='idx') + expected = Series([np.nan, 0.4, 0.4], index=exp_idx, name='s') + idx = Index([1, 2, 3, 4], dtype='int64', name='idx') + result = Series([0.1, 0.2, 0.3, 0.4], index=idx, + name='s').loc[[5, 4, 4]] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([7, 2, 2], dtype='int64', name='idx') + expected = Series([0.4, np.nan, np.nan], index=exp_idx, name='s') + idx = Index([4, 5, 6, 7], dtype='int64', name='idx') + result = Series([0.1, 0.2, 0.3, 0.4], index=idx, + name='s').loc[[7, 2, 2]] + tm.assert_series_equal(result, expected, check_index_type=True) + + exp_idx = Index([4, 5, 5], dtype='int64', name='idx') + expected = Series([0.4, np.nan, np.nan], index=exp_idx, name='s') + idx = Index([1, 2, 3, 4], dtype='int64', name='idx') + result = Series([0.1, 0.2, 0.3, 0.4], index=idx, + name='s').loc[[4, 5, 5]] + tm.assert_series_equal(result, expected, check_index_type=True) + + # iloc + exp_idx = Index([2, 2, 1, 1], dtype='int64', name='idx') + expected = Series([0.2, 0.2, 0.1, 0.1], index=exp_idx, name='s') + result = ser.iloc[[1, 1, 0, 0]] + tm.assert_series_equal(result, expected, check_index_type=True) + + def test_partial_set_invalid(self): + + # GH 4940 + # allow only setting of 'valid' values + + orig = tm.makeTimeDataFrame() + df = orig.copy() + + # don't allow not string inserts + def f(): + with catch_warnings(record=True): + df.loc[100.0, :] = df.ix[0] + + pytest.raises(TypeError, f) + + def f(): + with catch_warnings(record=True): + df.loc[100, :] = df.ix[0] + + pytest.raises(TypeError, f) + + def f(): + with catch_warnings(record=True): + df.ix[100.0, :] = df.ix[0] + + pytest.raises(TypeError, f) + + def f(): + with catch_warnings(record=True): + df.ix[100, :] = df.ix[0] + + pytest.raises(ValueError, f) + + # allow object conversion here + df = orig.copy() + with catch_warnings(record=True): + df.loc['a', :] = df.ix[0] + exp = orig.append(pd.Series(df.ix[0], name='a')) + tm.assert_frame_equal(df, exp) + tm.assert_index_equal(df.index, + pd.Index(orig.index.tolist() + ['a'])) + assert df.index.dtype == 'object' + + def test_partial_set_empty_series(self): + + # GH5226 + + # partially set with an empty object series + s = Series() + s.loc[1] = 1 + tm.assert_series_equal(s, Series([1], index=[1])) + s.loc[3] = 3 + tm.assert_series_equal(s, Series([1, 3], index=[1, 3])) + + s = Series() + s.loc[1] = 1. + tm.assert_series_equal(s, Series([1.], index=[1])) + s.loc[3] = 3. + tm.assert_series_equal(s, Series([1., 3.], index=[1, 3])) + + s = Series() + s.loc['foo'] = 1 + tm.assert_series_equal(s, Series([1], index=['foo'])) + s.loc['bar'] = 3 + tm.assert_series_equal(s, Series([1, 3], index=['foo', 'bar'])) + s.loc[3] = 4 + tm.assert_series_equal(s, Series([1, 3, 4], index=['foo', 'bar', 3])) + + def test_partial_set_empty_frame(self): + + # partially set with an empty object + # frame + df = DataFrame() + + def f(): + df.loc[1] = 1 + + pytest.raises(ValueError, f) + + def f(): + df.loc[1] = Series([1], index=['foo']) + + pytest.raises(ValueError, f) + + def f(): + df.loc[:, 1] = 1 + + pytest.raises(ValueError, f) + + # these work as they don't really change + # anything but the index + # GH5632 + expected = DataFrame(columns=['foo'], index=pd.Index( + [], dtype='int64')) + + def f(): + df = DataFrame() + df['foo'] = Series([], dtype='object') + return df + + tm.assert_frame_equal(f(), expected) + + def f(): + df = DataFrame() + df['foo'] = Series(df.index) + return df + + tm.assert_frame_equal(f(), expected) + + def f(): + df = DataFrame() + df['foo'] = df.index + return df + + tm.assert_frame_equal(f(), expected) + + expected = DataFrame(columns=['foo'], + index=pd.Index([], dtype='int64')) + expected['foo'] = expected['foo'].astype('float64') + + def f(): + df = DataFrame() + df['foo'] = [] + return df + + tm.assert_frame_equal(f(), expected) + + def f(): + df = DataFrame() + df['foo'] = Series(range(len(df))) + return df + + tm.assert_frame_equal(f(), expected) + + def f(): + df = DataFrame() + tm.assert_index_equal(df.index, pd.Index([], dtype='object')) + df['foo'] = range(len(df)) + return df + + expected = DataFrame(columns=['foo'], + index=pd.Index([], dtype='int64')) + expected['foo'] = expected['foo'].astype('float64') + tm.assert_frame_equal(f(), expected) + + df = DataFrame() + tm.assert_index_equal(df.columns, pd.Index([], dtype=object)) + df2 = DataFrame() + df2[1] = Series([1], index=['foo']) + df.loc[:, 1] = Series([1], index=['foo']) + tm.assert_frame_equal(df, DataFrame([[1]], index=['foo'], columns=[1])) + tm.assert_frame_equal(df, df2) + + # no index to start + expected = DataFrame({0: Series(1, index=range(4))}, + columns=['A', 'B', 0]) + + df = DataFrame(columns=['A', 'B']) + df[0] = Series(1, index=range(4)) + df.dtypes + str(df) + tm.assert_frame_equal(df, expected) + + df = DataFrame(columns=['A', 'B']) + df.loc[:, 0] = Series(1, index=range(4)) + df.dtypes + str(df) + tm.assert_frame_equal(df, expected) + + def test_partial_set_empty_frame_row(self): + # GH5720, GH5744 + # don't create rows when empty + expected = DataFrame(columns=['A', 'B', 'New'], + index=pd.Index([], dtype='int64')) + expected['A'] = expected['A'].astype('int64') + expected['B'] = expected['B'].astype('float64') + expected['New'] = expected['New'].astype('float64') + + df = DataFrame({"A": [1, 2, 3], "B": [1.2, 4.2, 5.2]}) + y = df[df.A > 5] + y['New'] = np.nan + tm.assert_frame_equal(y, expected) + # tm.assert_frame_equal(y,expected) + + expected = DataFrame(columns=['a', 'b', 'c c', 'd']) + expected['d'] = expected['d'].astype('int64') + df = DataFrame(columns=['a', 'b', 'c c']) + df['d'] = 3 + tm.assert_frame_equal(df, expected) + tm.assert_series_equal(df['c c'], Series(name='c c', dtype=object)) + + # reindex columns is ok + df = DataFrame({"A": [1, 2, 3], "B": [1.2, 4.2, 5.2]}) + y = df[df.A > 5] + result = y.reindex(columns=['A', 'B', 'C']) + expected = DataFrame(columns=['A', 'B', 'C'], + index=pd.Index([], dtype='int64')) + expected['A'] = expected['A'].astype('int64') + expected['B'] = expected['B'].astype('float64') + expected['C'] = expected['C'].astype('float64') + tm.assert_frame_equal(result, expected) + + def test_partial_set_empty_frame_set_series(self): + # GH 5756 + # setting with empty Series + df = DataFrame(Series()) + tm.assert_frame_equal(df, DataFrame({0: Series()})) + + df = DataFrame(Series(name='foo')) + tm.assert_frame_equal(df, DataFrame({'foo': Series()})) + + def test_partial_set_empty_frame_empty_copy_assignment(self): + # GH 5932 + # copy on empty with assignment fails + df = DataFrame(index=[0]) + df = df.copy() + df['a'] = 0 + expected = DataFrame(0, index=[0], columns=['a']) + tm.assert_frame_equal(df, expected) + + def test_partial_set_empty_frame_empty_consistencies(self): + # GH 6171 + # consistency on empty frames + df = DataFrame(columns=['x', 'y']) + df['x'] = [1, 2] + expected = DataFrame(dict(x=[1, 2], y=[np.nan, np.nan])) + tm.assert_frame_equal(df, expected, check_dtype=False) + + df = DataFrame(columns=['x', 'y']) + df['x'] = ['1', '2'] + expected = DataFrame( + dict(x=['1', '2'], y=[np.nan, np.nan]), dtype=object) + tm.assert_frame_equal(df, expected) + + df = DataFrame(columns=['x', 'y']) + df.loc[0, 'x'] = 1 + expected = DataFrame(dict(x=[1], y=[np.nan])) + tm.assert_frame_equal(df, expected, check_dtype=False) diff --git a/pandas/tests/indexing/test_scalar.py b/pandas/tests/indexing/test_scalar.py new file mode 100644 index 0000000000000..5dd1714b903eb --- /dev/null +++ b/pandas/tests/indexing/test_scalar.py @@ -0,0 +1,173 @@ +""" test scalar indexing, including at and iat """ + +import pytest + +import numpy as np + +from pandas import (Series, DataFrame, Timestamp, + Timedelta, date_range) +from pandas.util import testing as tm +from pandas.tests.indexing.common import Base + + +class TestScalar(Base): + + def test_at_and_iat_get(self): + def _check(f, func, values=False): + + if f is not None: + indicies = self.generate_indices(f, values) + for i in indicies: + result = getattr(f, func)[i] + expected = self.get_value(f, i, values) + tm.assert_almost_equal(result, expected) + + for o in self._objs: + + d = getattr(self, o) + + # iat + for f in [d['ints'], d['uints']]: + _check(f, 'iat', values=True) + + for f in [d['labels'], d['ts'], d['floats']]: + if f is not None: + pytest.raises(ValueError, self.check_values, f, 'iat') + + # at + for f in [d['ints'], d['uints'], d['labels'], + d['ts'], d['floats']]: + _check(f, 'at') + + def test_at_and_iat_set(self): + def _check(f, func, values=False): + + if f is not None: + indicies = self.generate_indices(f, values) + for i in indicies: + getattr(f, func)[i] = 1 + expected = self.get_value(f, i, values) + tm.assert_almost_equal(expected, 1) + + for t in self._objs: + + d = getattr(self, t) + + # iat + for f in [d['ints'], d['uints']]: + _check(f, 'iat', values=True) + + for f in [d['labels'], d['ts'], d['floats']]: + if f is not None: + pytest.raises(ValueError, _check, f, 'iat') + + # at + for f in [d['ints'], d['uints'], d['labels'], + d['ts'], d['floats']]: + _check(f, 'at') + + def test_at_iat_coercion(self): + + # as timestamp is not a tuple! + dates = date_range('1/1/2000', periods=8) + df = DataFrame(np.random.randn(8, 4), + index=dates, + columns=['A', 'B', 'C', 'D']) + s = df['A'] + + result = s.at[dates[5]] + xp = s.values[5] + assert result == xp + + # GH 7729 + # make sure we are boxing the returns + s = Series(['2014-01-01', '2014-02-02'], dtype='datetime64[ns]') + expected = Timestamp('2014-02-02') + + for r in [lambda: s.iat[1], lambda: s.iloc[1]]: + result = r() + assert result == expected + + s = Series(['1 days', '2 days'], dtype='timedelta64[ns]') + expected = Timedelta('2 days') + + for r in [lambda: s.iat[1], lambda: s.iloc[1]]: + result = r() + assert result == expected + + def test_iat_invalid_args(self): + pass + + def test_imethods_with_dups(self): + + # GH6493 + # iat/iloc with dups + + s = Series(range(5), index=[1, 1, 2, 2, 3], dtype='int64') + result = s.iloc[2] + assert result == 2 + result = s.iat[2] + assert result == 2 + + pytest.raises(IndexError, lambda: s.iat[10]) + pytest.raises(IndexError, lambda: s.iat[-10]) + + result = s.iloc[[2, 3]] + expected = Series([2, 3], [2, 2], dtype='int64') + tm.assert_series_equal(result, expected) + + df = s.to_frame() + result = df.iloc[2] + expected = Series(2, index=[0], name=2) + tm.assert_series_equal(result, expected) + + result = df.iat[2, 0] + expected = 2 + assert result == 2 + + def test_at_to_fail(self): + # at should not fallback + # GH 7814 + s = Series([1, 2, 3], index=list('abc')) + result = s.at['a'] + assert result == 1 + pytest.raises(ValueError, lambda: s.at[0]) + + df = DataFrame({'A': [1, 2, 3]}, index=list('abc')) + result = df.at['a', 'A'] + assert result == 1 + pytest.raises(ValueError, lambda: df.at['a', 0]) + + s = Series([1, 2, 3], index=[3, 2, 1]) + result = s.at[1] + assert result == 3 + pytest.raises(ValueError, lambda: s.at['a']) + + df = DataFrame({0: [1, 2, 3]}, index=[3, 2, 1]) + result = df.at[1, 0] + assert result == 3 + pytest.raises(ValueError, lambda: df.at['a', 0]) + + # GH 13822, incorrect error string with non-unique columns when missing + # column is accessed + df = DataFrame({'x': [1.], 'y': [2.], 'z': [3.]}) + df.columns = ['x', 'x', 'z'] + + # Check that we get the correct value in the KeyError + tm.assert_raises_regex(KeyError, r"\['y'\] not in index", + lambda: df[['x', 'y', 'z']]) + + def test_at_with_tz(self): + # gh-15822 + df = DataFrame({'name': ['John', 'Anderson'], + 'date': [Timestamp(2017, 3, 13, 13, 32, 56), + Timestamp(2017, 2, 16, 12, 10, 3)]}) + df['date'] = df['date'].dt.tz_localize('Asia/Shanghai') + + expected = Timestamp('2017-03-13 13:32:56+0800', tz='Asia/Shanghai') + + result = df.loc[0, 'date'] + assert result == expected + + result = df.at[0, 'date'] + assert result == expected diff --git a/pandas/tests/indexing/test_timedelta.py b/pandas/tests/indexing/test_timedelta.py new file mode 100644 index 0000000000000..cf8cc6c2d345d --- /dev/null +++ b/pandas/tests/indexing/test_timedelta.py @@ -0,0 +1,20 @@ +import pandas as pd +from pandas.util import testing as tm + + +class TestTimedeltaIndexing(object): + + def test_boolean_indexing(self): + # GH 14946 + df = pd.DataFrame({'x': range(10)}) + df.index = pd.to_timedelta(range(10), unit='s') + conditions = [df['x'] > 3, df['x'] == 3, df['x'] < 3] + expected_data = [[0, 1, 2, 3, 10, 10, 10, 10, 10, 10], + [0, 1, 2, 10, 4, 5, 6, 7, 8, 9], + [10, 10, 10, 3, 4, 5, 6, 7, 8, 9]] + for cond, data in zip(conditions, expected_data): + result = df.assign(x=df.mask(cond, 10).astype('int64')) + expected = pd.DataFrame(data, + index=pd.to_timedelta(range(10), unit='s'), + columns=['x']) + tm.assert_frame_equal(expected, result) diff --git a/pandas/tests/io/__init__.py b/pandas/tests/io/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/io/tests/data/S4_EDUC1.dta b/pandas/tests/io/data/S4_EDUC1.dta similarity index 100% rename from pandas/io/tests/data/S4_EDUC1.dta rename to pandas/tests/io/data/S4_EDUC1.dta diff --git a/pandas/io/tests/data/banklist.csv b/pandas/tests/io/data/banklist.csv similarity index 100% rename from pandas/io/tests/data/banklist.csv rename to pandas/tests/io/data/banklist.csv diff --git a/pandas/io/tests/data/banklist.html b/pandas/tests/io/data/banklist.html similarity index 100% rename from pandas/io/tests/data/banklist.html rename to pandas/tests/io/data/banklist.html diff --git a/pandas/io/tests/data/blank.xls b/pandas/tests/io/data/blank.xls old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/blank.xls rename to pandas/tests/io/data/blank.xls diff --git a/pandas/io/tests/data/blank.xlsm b/pandas/tests/io/data/blank.xlsm old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/blank.xlsm rename to pandas/tests/io/data/blank.xlsm diff --git a/pandas/io/tests/data/blank.xlsx b/pandas/tests/io/data/blank.xlsx old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/blank.xlsx rename to pandas/tests/io/data/blank.xlsx diff --git a/pandas/io/tests/data/blank_with_header.xls b/pandas/tests/io/data/blank_with_header.xls old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/blank_with_header.xls rename to pandas/tests/io/data/blank_with_header.xls diff --git a/pandas/io/tests/data/blank_with_header.xlsm b/pandas/tests/io/data/blank_with_header.xlsm old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/blank_with_header.xlsm rename to pandas/tests/io/data/blank_with_header.xlsm diff --git a/pandas/io/tests/data/blank_with_header.xlsx b/pandas/tests/io/data/blank_with_header.xlsx old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/blank_with_header.xlsx rename to pandas/tests/io/data/blank_with_header.xlsx diff --git a/pandas/io/tests/data/categorical_0_14_1.pickle b/pandas/tests/io/data/categorical_0_14_1.pickle similarity index 100% rename from pandas/io/tests/data/categorical_0_14_1.pickle rename to pandas/tests/io/data/categorical_0_14_1.pickle diff --git a/pandas/io/tests/data/categorical_0_15_2.pickle b/pandas/tests/io/data/categorical_0_15_2.pickle similarity index 100% rename from pandas/io/tests/data/categorical_0_15_2.pickle rename to pandas/tests/io/data/categorical_0_15_2.pickle diff --git a/pandas/io/tests/data/computer_sales_page.html b/pandas/tests/io/data/computer_sales_page.html similarity index 100% rename from pandas/io/tests/data/computer_sales_page.html rename to pandas/tests/io/data/computer_sales_page.html diff --git a/pandas/io/tests/data/gbq_fake_job.txt b/pandas/tests/io/data/gbq_fake_job.txt similarity index 100% rename from pandas/io/tests/data/gbq_fake_job.txt rename to pandas/tests/io/data/gbq_fake_job.txt diff --git a/pandas/io/tests/data/html_encoding/chinese_utf-16.html b/pandas/tests/io/data/html_encoding/chinese_utf-16.html similarity index 100% rename from pandas/io/tests/data/html_encoding/chinese_utf-16.html rename to pandas/tests/io/data/html_encoding/chinese_utf-16.html diff --git a/pandas/io/tests/data/html_encoding/chinese_utf-32.html b/pandas/tests/io/data/html_encoding/chinese_utf-32.html similarity index 100% rename from pandas/io/tests/data/html_encoding/chinese_utf-32.html rename to pandas/tests/io/data/html_encoding/chinese_utf-32.html diff --git a/pandas/io/tests/data/html_encoding/chinese_utf-8.html b/pandas/tests/io/data/html_encoding/chinese_utf-8.html similarity index 100% rename from pandas/io/tests/data/html_encoding/chinese_utf-8.html rename to pandas/tests/io/data/html_encoding/chinese_utf-8.html diff --git a/pandas/io/tests/data/html_encoding/letz_latin1.html b/pandas/tests/io/data/html_encoding/letz_latin1.html similarity index 100% rename from pandas/io/tests/data/html_encoding/letz_latin1.html rename to pandas/tests/io/data/html_encoding/letz_latin1.html diff --git a/pandas/io/tests/data/iris.csv b/pandas/tests/io/data/iris.csv similarity index 100% rename from pandas/io/tests/data/iris.csv rename to pandas/tests/io/data/iris.csv diff --git a/pandas/io/tests/data/legacy_hdf/datetimetz_object.h5 b/pandas/tests/io/data/legacy_hdf/datetimetz_object.h5 similarity index 100% rename from pandas/io/tests/data/legacy_hdf/datetimetz_object.h5 rename to pandas/tests/io/data/legacy_hdf/datetimetz_object.h5 diff --git a/pandas/io/tests/data/legacy_hdf/legacy_0.10.h5 b/pandas/tests/io/data/legacy_hdf/legacy_0.10.h5 similarity index 100% rename from pandas/io/tests/data/legacy_hdf/legacy_0.10.h5 rename to pandas/tests/io/data/legacy_hdf/legacy_0.10.h5 diff --git a/pandas/io/tests/data/legacy_hdf/legacy_table.h5 b/pandas/tests/io/data/legacy_hdf/legacy_table.h5 similarity index 100% rename from pandas/io/tests/data/legacy_hdf/legacy_table.h5 rename to pandas/tests/io/data/legacy_hdf/legacy_table.h5 diff --git a/pandas/io/tests/data/legacy_hdf/legacy_table_0.11.h5 b/pandas/tests/io/data/legacy_hdf/legacy_table_0.11.h5 similarity index 100% rename from pandas/io/tests/data/legacy_hdf/legacy_table_0.11.h5 rename to pandas/tests/io/data/legacy_hdf/legacy_table_0.11.h5 diff --git a/pandas/io/tests/data/legacy_hdf/pytables_native.h5 b/pandas/tests/io/data/legacy_hdf/pytables_native.h5 similarity index 100% rename from pandas/io/tests/data/legacy_hdf/pytables_native.h5 rename to pandas/tests/io/data/legacy_hdf/pytables_native.h5 diff --git a/pandas/io/tests/data/legacy_hdf/pytables_native2.h5 b/pandas/tests/io/data/legacy_hdf/pytables_native2.h5 similarity index 100% rename from pandas/io/tests/data/legacy_hdf/pytables_native2.h5 rename to pandas/tests/io/data/legacy_hdf/pytables_native2.h5 diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.0/0.16.0_x86_64_darwin_2.7.9.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.0/0.16.0_x86_64_darwin_2.7.9.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.0/0.16.0_x86_64_darwin_2.7.9.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.0/0.16.0_x86_64_darwin_2.7.9.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_2.7.10.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_2.7.10.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_2.7.10.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_2.7.10.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_3.4.3.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_3.4.3.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_3.4.3.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_AMD64_windows_3.4.3.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.10.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.10.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.10.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.10.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.9.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.9.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.9.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_2.7.9.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_3.4.3.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_3.4.3.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_3.4.3.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_darwin_3.4.3.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_2.7.10.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_2.7.10.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_2.7.10.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_2.7.10.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_3.4.3.msgpack b/pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_3.4.3.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_3.4.3.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.16.2/0.16.2_x86_64_linux_3.4.3.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_3.4.4.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_3.4.4.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_3.4.4.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_AMD64_windows_3.4.4.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_3.4.4.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_3.4.4.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_3.4.4.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_darwin_3.4.4.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_3.4.4.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_3.4.4.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_3.4.4.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.0_x86_64_linux_3.4.4.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_3.5.1.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_3.5.1.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_3.5.1.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.0/0.17.1_AMD64_windows_3.5.1.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_3.5.1.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_3.5.1.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_3.5.1.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_AMD64_windows_3.5.1.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_3.5.1.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_3.5.1.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_3.5.1.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_darwin_3.5.1.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_3.4.4.msgpack b/pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_3.4.4.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_3.4.4.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.17.1/0.17.1_x86_64_linux_3.4.4.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_3.5.1.msgpack b/pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_3.5.1.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_3.5.1.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_AMD64_windows_3.5.1.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_2.7.11.msgpack b/pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_2.7.11.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_2.7.11.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_2.7.11.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_3.5.1.msgpack b/pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_3.5.1.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_3.5.1.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.18.0/0.18.0_x86_64_darwin_3.5.1.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_2.7.12.msgpack b/pandas/tests/io/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_2.7.12.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_2.7.12.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_2.7.12.msgpack diff --git a/pandas/io/tests/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_3.5.2.msgpack b/pandas/tests/io/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_3.5.2.msgpack similarity index 100% rename from pandas/io/tests/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_3.5.2.msgpack rename to pandas/tests/io/data/legacy_msgpack/0.18.1/0.18.1_x86_64_darwin_3.5.2.msgpack diff --git a/pandas/tests/io/data/legacy_msgpack/0.19.2/0.19.2_x86_64_darwin_2.7.12.msgpack b/pandas/tests/io/data/legacy_msgpack/0.19.2/0.19.2_x86_64_darwin_2.7.12.msgpack new file mode 100644 index 0000000000000..f2dc38766025e Binary files /dev/null and b/pandas/tests/io/data/legacy_msgpack/0.19.2/0.19.2_x86_64_darwin_2.7.12.msgpack differ diff --git a/pandas/tests/io/data/legacy_msgpack/0.19.2/0.19.2_x86_64_darwin_3.6.1.msgpack b/pandas/tests/io/data/legacy_msgpack/0.19.2/0.19.2_x86_64_darwin_3.6.1.msgpack new file mode 100644 index 0000000000000..4137629f53cf2 Binary files /dev/null and b/pandas/tests/io/data/legacy_msgpack/0.19.2/0.19.2_x86_64_darwin_3.6.1.msgpack differ diff --git a/pandas/io/tests/data/legacy_pickle/0.10.1/AMD64_windows_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.10.1/AMD64_windows_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.10.1/AMD64_windows_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.10.1/AMD64_windows_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.10.1/x86_64_linux_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.10.1/x86_64_linux_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.10.1/x86_64_linux_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.10.1/x86_64_linux_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.11.0/0.11.0_x86_64_linux_3.3.0.pickle b/pandas/tests/io/data/legacy_pickle/0.11.0/0.11.0_x86_64_linux_3.3.0.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.11.0/0.11.0_x86_64_linux_3.3.0.pickle rename to pandas/tests/io/data/legacy_pickle/0.11.0/0.11.0_x86_64_linux_3.3.0.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.11.0/x86_64_linux_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.11.0/x86_64_linux_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.11.0/x86_64_linux_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.11.0/x86_64_linux_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.11.0/x86_64_linux_3.3.0.pickle b/pandas/tests/io/data/legacy_pickle/0.11.0/x86_64_linux_3.3.0.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.11.0/x86_64_linux_3.3.0.pickle rename to pandas/tests/io/data/legacy_pickle/0.11.0/x86_64_linux_3.3.0.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.12.0/0.12.0_AMD64_windows_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.12.0/0.12.0_AMD64_windows_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.12.0/0.12.0_AMD64_windows_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.12.0/0.12.0_AMD64_windows_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.12.0/0.12.0_x86_64_linux_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.12.0/0.12.0_x86_64_linux_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.12.0/0.12.0_x86_64_linux_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.12.0/0.12.0_x86_64_linux_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_AMD64_windows_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_AMD64_windows_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_AMD64_windows_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_AMD64_windows_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.6.5.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.6.5.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.6.5.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.6.5.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_i686_linux_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_i686_linux_3.2.3.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_i686_linux_3.2.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_i686_linux_3.2.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_i686_linux_3.2.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.5.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.5.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.5.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.5.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.6.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.6.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.6.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_darwin_2.7.6.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.3.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.8.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.8.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.8.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_2.7.8.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_3.3.0.pickle b/pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_3.3.0.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_3.3.0.pickle rename to pandas/tests/io/data/legacy_pickle/0.13.0/0.13.0_x86_64_linux_3.3.0.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.14.0/0.14.0_x86_64_darwin_2.7.6.pickle b/pandas/tests/io/data/legacy_pickle/0.14.0/0.14.0_x86_64_darwin_2.7.6.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.14.0/0.14.0_x86_64_darwin_2.7.6.pickle rename to pandas/tests/io/data/legacy_pickle/0.14.0/0.14.0_x86_64_darwin_2.7.6.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.14.0/0.14.0_x86_64_linux_2.7.8.pickle b/pandas/tests/io/data/legacy_pickle/0.14.0/0.14.0_x86_64_linux_2.7.8.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.14.0/0.14.0_x86_64_linux_2.7.8.pickle rename to pandas/tests/io/data/legacy_pickle/0.14.0/0.14.0_x86_64_linux_2.7.8.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.14.1/0.14.1_x86_64_darwin_2.7.12.pickle b/pandas/tests/io/data/legacy_pickle/0.14.1/0.14.1_x86_64_darwin_2.7.12.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.14.1/0.14.1_x86_64_darwin_2.7.12.pickle rename to pandas/tests/io/data/legacy_pickle/0.14.1/0.14.1_x86_64_darwin_2.7.12.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.14.1/0.14.1_x86_64_linux_2.7.8.pickle b/pandas/tests/io/data/legacy_pickle/0.14.1/0.14.1_x86_64_linux_2.7.8.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.14.1/0.14.1_x86_64_linux_2.7.8.pickle rename to pandas/tests/io/data/legacy_pickle/0.14.1/0.14.1_x86_64_linux_2.7.8.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.15.0/0.15.0_x86_64_darwin_2.7.12.pickle b/pandas/tests/io/data/legacy_pickle/0.15.0/0.15.0_x86_64_darwin_2.7.12.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.15.0/0.15.0_x86_64_darwin_2.7.12.pickle rename to pandas/tests/io/data/legacy_pickle/0.15.0/0.15.0_x86_64_darwin_2.7.12.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.15.0/0.15.0_x86_64_linux_2.7.8.pickle b/pandas/tests/io/data/legacy_pickle/0.15.0/0.15.0_x86_64_linux_2.7.8.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.15.0/0.15.0_x86_64_linux_2.7.8.pickle rename to pandas/tests/io/data/legacy_pickle/0.15.0/0.15.0_x86_64_linux_2.7.8.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.15.2/0.15.2_x86_64_darwin_2.7.9.pickle b/pandas/tests/io/data/legacy_pickle/0.15.2/0.15.2_x86_64_darwin_2.7.9.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.15.2/0.15.2_x86_64_darwin_2.7.9.pickle rename to pandas/tests/io/data/legacy_pickle/0.15.2/0.15.2_x86_64_darwin_2.7.9.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.0/0.16.0_x86_64_darwin_2.7.9.pickle b/pandas/tests/io/data/legacy_pickle/0.16.0/0.16.0_x86_64_darwin_2.7.9.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.0/0.16.0_x86_64_darwin_2.7.9.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.0/0.16.0_x86_64_darwin_2.7.9.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_2.7.10.pickle b/pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_2.7.10.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_2.7.10.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_2.7.10.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_3.4.3.pickle b/pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_3.4.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_3.4.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_AMD64_windows_3.4.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.10.pickle b/pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.10.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.10.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.10.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.9.pickle b/pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.9.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.9.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_2.7.9.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_3.4.3.pickle b/pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_3.4.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_3.4.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_darwin_3.4.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_2.7.10.pickle b/pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_2.7.10.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_2.7.10.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_2.7.10.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_3.4.3.pickle b/pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_3.4.3.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_3.4.3.pickle rename to pandas/tests/io/data/legacy_pickle/0.16.2/0.16.2_x86_64_linux_3.4.3.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_3.4.4.pickle b/pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_3.4.4.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_3.4.4.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_AMD64_windows_3.4.4.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_3.4.4.pickle b/pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_3.4.4.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_3.4.4.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_darwin_3.4.4.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_3.4.4.pickle b/pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_3.4.4.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_3.4.4.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.0/0.17.0_x86_64_linux_3.4.4.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.0/0.17.1_AMD64_windows_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.17.0/0.17.1_AMD64_windows_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.0/0.17.1_AMD64_windows_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.0/0.17.1_AMD64_windows_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.1/0.17.1_AMD64_windows_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.17.1/0.17.1_AMD64_windows_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.1/0.17.1_AMD64_windows_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.1/0.17.1_AMD64_windows_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.17.1/0.17.1_x86_64_darwin_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.17.1/0.17.1_x86_64_darwin_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.17.1/0.17.1_x86_64_darwin_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.17.1/0.17.1_x86_64_darwin_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_3.5.1.pickle b/pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_3.5.1.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_3.5.1.pickle rename to pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_AMD64_windows_3.5.1.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_2.7.11.pickle b/pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_2.7.11.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_2.7.11.pickle rename to pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_2.7.11.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_3.5.1.pickle b/pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_3.5.1.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_3.5.1.pickle rename to pandas/tests/io/data/legacy_pickle/0.18.0/0.18.0_x86_64_darwin_3.5.1.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_2.7.12.pickle b/pandas/tests/io/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_2.7.12.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_2.7.12.pickle rename to pandas/tests/io/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_2.7.12.pickle diff --git a/pandas/io/tests/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_3.5.2.pickle b/pandas/tests/io/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_3.5.2.pickle similarity index 100% rename from pandas/io/tests/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_3.5.2.pickle rename to pandas/tests/io/data/legacy_pickle/0.18.1/0.18.1_x86_64_darwin_3.5.2.pickle diff --git a/pandas/tests/io/data/legacy_pickle/0.19.2/0.19.2_x86_64_darwin_2.7.12.pickle b/pandas/tests/io/data/legacy_pickle/0.19.2/0.19.2_x86_64_darwin_2.7.12.pickle new file mode 100644 index 0000000000000..d702ab444df62 Binary files /dev/null and b/pandas/tests/io/data/legacy_pickle/0.19.2/0.19.2_x86_64_darwin_2.7.12.pickle differ diff --git a/pandas/tests/io/data/legacy_pickle/0.19.2/0.19.2_x86_64_darwin_3.6.1.pickle b/pandas/tests/io/data/legacy_pickle/0.19.2/0.19.2_x86_64_darwin_3.6.1.pickle new file mode 100644 index 0000000000000..6bb02672a4151 Binary files /dev/null and b/pandas/tests/io/data/legacy_pickle/0.19.2/0.19.2_x86_64_darwin_3.6.1.pickle differ diff --git a/pandas/io/tests/data/macau.html b/pandas/tests/io/data/macau.html similarity index 100% rename from pandas/io/tests/data/macau.html rename to pandas/tests/io/data/macau.html diff --git a/pandas/io/tests/data/nyse_wsj.html b/pandas/tests/io/data/nyse_wsj.html similarity index 100% rename from pandas/io/tests/data/nyse_wsj.html rename to pandas/tests/io/data/nyse_wsj.html diff --git a/pandas/io/tests/data/spam.html b/pandas/tests/io/data/spam.html similarity index 100% rename from pandas/io/tests/data/spam.html rename to pandas/tests/io/data/spam.html diff --git a/pandas/io/tests/data/stata10_115.dta b/pandas/tests/io/data/stata10_115.dta old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/stata10_115.dta rename to pandas/tests/io/data/stata10_115.dta diff --git a/pandas/io/tests/data/stata10_117.dta b/pandas/tests/io/data/stata10_117.dta old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/stata10_117.dta rename to pandas/tests/io/data/stata10_117.dta diff --git a/pandas/io/tests/data/stata11_115.dta b/pandas/tests/io/data/stata11_115.dta old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/stata11_115.dta rename to pandas/tests/io/data/stata11_115.dta diff --git a/pandas/io/tests/data/stata11_117.dta b/pandas/tests/io/data/stata11_117.dta old mode 100755 new mode 100644 similarity index 100% rename from pandas/io/tests/data/stata11_117.dta rename to pandas/tests/io/data/stata11_117.dta diff --git a/pandas/io/tests/data/stata12_117.dta b/pandas/tests/io/data/stata12_117.dta similarity index 100% rename from pandas/io/tests/data/stata12_117.dta rename to pandas/tests/io/data/stata12_117.dta diff --git a/pandas/io/tests/data/stata14_118.dta b/pandas/tests/io/data/stata14_118.dta similarity index 100% rename from pandas/io/tests/data/stata14_118.dta rename to pandas/tests/io/data/stata14_118.dta diff --git a/pandas/io/tests/data/stata15.dta b/pandas/tests/io/data/stata15.dta similarity index 100% rename from pandas/io/tests/data/stata15.dta rename to pandas/tests/io/data/stata15.dta diff --git a/pandas/io/tests/data/stata1_114.dta b/pandas/tests/io/data/stata1_114.dta similarity index 100% rename from pandas/io/tests/data/stata1_114.dta rename to pandas/tests/io/data/stata1_114.dta diff --git a/pandas/io/tests/data/stata1_117.dta b/pandas/tests/io/data/stata1_117.dta similarity index 100% rename from pandas/io/tests/data/stata1_117.dta rename to pandas/tests/io/data/stata1_117.dta diff --git a/pandas/io/tests/data/stata1_encoding.dta b/pandas/tests/io/data/stata1_encoding.dta similarity index 100% rename from pandas/io/tests/data/stata1_encoding.dta rename to pandas/tests/io/data/stata1_encoding.dta diff --git a/pandas/io/tests/data/stata2_113.dta b/pandas/tests/io/data/stata2_113.dta similarity index 100% rename from pandas/io/tests/data/stata2_113.dta rename to pandas/tests/io/data/stata2_113.dta diff --git a/pandas/io/tests/data/stata2_114.dta b/pandas/tests/io/data/stata2_114.dta similarity index 100% rename from pandas/io/tests/data/stata2_114.dta rename to pandas/tests/io/data/stata2_114.dta diff --git a/pandas/io/tests/data/stata2_115.dta b/pandas/tests/io/data/stata2_115.dta similarity index 100% rename from pandas/io/tests/data/stata2_115.dta rename to pandas/tests/io/data/stata2_115.dta diff --git a/pandas/io/tests/data/stata2_117.dta b/pandas/tests/io/data/stata2_117.dta similarity index 100% rename from pandas/io/tests/data/stata2_117.dta rename to pandas/tests/io/data/stata2_117.dta diff --git a/pandas/io/tests/data/stata3.csv b/pandas/tests/io/data/stata3.csv similarity index 100% rename from pandas/io/tests/data/stata3.csv rename to pandas/tests/io/data/stata3.csv diff --git a/pandas/io/tests/data/stata3_113.dta b/pandas/tests/io/data/stata3_113.dta similarity index 100% rename from pandas/io/tests/data/stata3_113.dta rename to pandas/tests/io/data/stata3_113.dta diff --git a/pandas/io/tests/data/stata3_114.dta b/pandas/tests/io/data/stata3_114.dta similarity index 100% rename from pandas/io/tests/data/stata3_114.dta rename to pandas/tests/io/data/stata3_114.dta diff --git a/pandas/io/tests/data/stata3_115.dta b/pandas/tests/io/data/stata3_115.dta similarity index 100% rename from pandas/io/tests/data/stata3_115.dta rename to pandas/tests/io/data/stata3_115.dta diff --git a/pandas/io/tests/data/stata3_117.dta b/pandas/tests/io/data/stata3_117.dta similarity index 100% rename from pandas/io/tests/data/stata3_117.dta rename to pandas/tests/io/data/stata3_117.dta diff --git a/pandas/io/tests/data/stata4_113.dta b/pandas/tests/io/data/stata4_113.dta similarity index 100% rename from pandas/io/tests/data/stata4_113.dta rename to pandas/tests/io/data/stata4_113.dta diff --git a/pandas/io/tests/data/stata4_114.dta b/pandas/tests/io/data/stata4_114.dta similarity index 100% rename from pandas/io/tests/data/stata4_114.dta rename to pandas/tests/io/data/stata4_114.dta diff --git a/pandas/io/tests/data/stata4_115.dta b/pandas/tests/io/data/stata4_115.dta similarity index 100% rename from pandas/io/tests/data/stata4_115.dta rename to pandas/tests/io/data/stata4_115.dta diff --git a/pandas/io/tests/data/stata4_117.dta b/pandas/tests/io/data/stata4_117.dta similarity index 100% rename from pandas/io/tests/data/stata4_117.dta rename to pandas/tests/io/data/stata4_117.dta diff --git a/pandas/io/tests/data/stata5.csv b/pandas/tests/io/data/stata5.csv similarity index 100% rename from pandas/io/tests/data/stata5.csv rename to pandas/tests/io/data/stata5.csv diff --git a/pandas/io/tests/data/stata5_113.dta b/pandas/tests/io/data/stata5_113.dta similarity index 100% rename from pandas/io/tests/data/stata5_113.dta rename to pandas/tests/io/data/stata5_113.dta diff --git a/pandas/io/tests/data/stata5_114.dta b/pandas/tests/io/data/stata5_114.dta similarity index 100% rename from pandas/io/tests/data/stata5_114.dta rename to pandas/tests/io/data/stata5_114.dta diff --git a/pandas/io/tests/data/stata5_115.dta b/pandas/tests/io/data/stata5_115.dta similarity index 100% rename from pandas/io/tests/data/stata5_115.dta rename to pandas/tests/io/data/stata5_115.dta diff --git a/pandas/io/tests/data/stata5_117.dta b/pandas/tests/io/data/stata5_117.dta similarity index 100% rename from pandas/io/tests/data/stata5_117.dta rename to pandas/tests/io/data/stata5_117.dta diff --git a/pandas/io/tests/data/stata6.csv b/pandas/tests/io/data/stata6.csv similarity index 100% rename from pandas/io/tests/data/stata6.csv rename to pandas/tests/io/data/stata6.csv diff --git a/pandas/io/tests/data/stata6_113.dta b/pandas/tests/io/data/stata6_113.dta similarity index 100% rename from pandas/io/tests/data/stata6_113.dta rename to pandas/tests/io/data/stata6_113.dta diff --git a/pandas/io/tests/data/stata6_114.dta b/pandas/tests/io/data/stata6_114.dta similarity index 100% rename from pandas/io/tests/data/stata6_114.dta rename to pandas/tests/io/data/stata6_114.dta diff --git a/pandas/io/tests/data/stata6_115.dta b/pandas/tests/io/data/stata6_115.dta similarity index 100% rename from pandas/io/tests/data/stata6_115.dta rename to pandas/tests/io/data/stata6_115.dta diff --git a/pandas/io/tests/data/stata6_117.dta b/pandas/tests/io/data/stata6_117.dta similarity index 100% rename from pandas/io/tests/data/stata6_117.dta rename to pandas/tests/io/data/stata6_117.dta diff --git a/pandas/io/tests/data/stata7_111.dta b/pandas/tests/io/data/stata7_111.dta similarity index 100% rename from pandas/io/tests/data/stata7_111.dta rename to pandas/tests/io/data/stata7_111.dta diff --git a/pandas/io/tests/data/stata7_115.dta b/pandas/tests/io/data/stata7_115.dta similarity index 100% rename from pandas/io/tests/data/stata7_115.dta rename to pandas/tests/io/data/stata7_115.dta diff --git a/pandas/io/tests/data/stata7_117.dta b/pandas/tests/io/data/stata7_117.dta similarity index 100% rename from pandas/io/tests/data/stata7_117.dta rename to pandas/tests/io/data/stata7_117.dta diff --git a/pandas/io/tests/data/stata8_113.dta b/pandas/tests/io/data/stata8_113.dta similarity index 100% rename from pandas/io/tests/data/stata8_113.dta rename to pandas/tests/io/data/stata8_113.dta diff --git a/pandas/io/tests/data/stata8_115.dta b/pandas/tests/io/data/stata8_115.dta similarity index 100% rename from pandas/io/tests/data/stata8_115.dta rename to pandas/tests/io/data/stata8_115.dta diff --git a/pandas/io/tests/data/stata8_117.dta b/pandas/tests/io/data/stata8_117.dta similarity index 100% rename from pandas/io/tests/data/stata8_117.dta rename to pandas/tests/io/data/stata8_117.dta diff --git a/pandas/io/tests/data/stata9_115.dta b/pandas/tests/io/data/stata9_115.dta similarity index 100% rename from pandas/io/tests/data/stata9_115.dta rename to pandas/tests/io/data/stata9_115.dta diff --git a/pandas/io/tests/data/stata9_117.dta b/pandas/tests/io/data/stata9_117.dta similarity index 100% rename from pandas/io/tests/data/stata9_117.dta rename to pandas/tests/io/data/stata9_117.dta diff --git a/pandas/io/tests/data/test1.csv b/pandas/tests/io/data/test1.csv similarity index 100% rename from pandas/io/tests/data/test1.csv rename to pandas/tests/io/data/test1.csv diff --git a/pandas/io/tests/data/test1.xls b/pandas/tests/io/data/test1.xls similarity index 100% rename from pandas/io/tests/data/test1.xls rename to pandas/tests/io/data/test1.xls diff --git a/pandas/io/tests/data/test1.xlsm b/pandas/tests/io/data/test1.xlsm similarity index 100% rename from pandas/io/tests/data/test1.xlsm rename to pandas/tests/io/data/test1.xlsm diff --git a/pandas/io/tests/data/test1.xlsx b/pandas/tests/io/data/test1.xlsx similarity index 100% rename from pandas/io/tests/data/test1.xlsx rename to pandas/tests/io/data/test1.xlsx diff --git a/pandas/io/tests/data/test2.xls b/pandas/tests/io/data/test2.xls similarity index 100% rename from pandas/io/tests/data/test2.xls rename to pandas/tests/io/data/test2.xls diff --git a/pandas/io/tests/data/test2.xlsm b/pandas/tests/io/data/test2.xlsm similarity index 100% rename from pandas/io/tests/data/test2.xlsm rename to pandas/tests/io/data/test2.xlsm diff --git a/pandas/io/tests/data/test2.xlsx b/pandas/tests/io/data/test2.xlsx similarity index 100% rename from pandas/io/tests/data/test2.xlsx rename to pandas/tests/io/data/test2.xlsx diff --git a/pandas/io/tests/data/test3.xls b/pandas/tests/io/data/test3.xls similarity index 100% rename from pandas/io/tests/data/test3.xls rename to pandas/tests/io/data/test3.xls diff --git a/pandas/io/tests/data/test3.xlsm b/pandas/tests/io/data/test3.xlsm similarity index 100% rename from pandas/io/tests/data/test3.xlsm rename to pandas/tests/io/data/test3.xlsm diff --git a/pandas/io/tests/data/test3.xlsx b/pandas/tests/io/data/test3.xlsx similarity index 100% rename from pandas/io/tests/data/test3.xlsx rename to pandas/tests/io/data/test3.xlsx diff --git a/pandas/io/tests/data/test4.xls b/pandas/tests/io/data/test4.xls similarity index 100% rename from pandas/io/tests/data/test4.xls rename to pandas/tests/io/data/test4.xls diff --git a/pandas/io/tests/data/test4.xlsm b/pandas/tests/io/data/test4.xlsm similarity index 100% rename from pandas/io/tests/data/test4.xlsm rename to pandas/tests/io/data/test4.xlsm diff --git a/pandas/io/tests/data/test4.xlsx b/pandas/tests/io/data/test4.xlsx similarity index 100% rename from pandas/io/tests/data/test4.xlsx rename to pandas/tests/io/data/test4.xlsx diff --git a/pandas/io/tests/data/test5.xls b/pandas/tests/io/data/test5.xls similarity index 100% rename from pandas/io/tests/data/test5.xls rename to pandas/tests/io/data/test5.xls diff --git a/pandas/io/tests/data/test5.xlsm b/pandas/tests/io/data/test5.xlsm similarity index 100% rename from pandas/io/tests/data/test5.xlsm rename to pandas/tests/io/data/test5.xlsm diff --git a/pandas/io/tests/data/test5.xlsx b/pandas/tests/io/data/test5.xlsx similarity index 100% rename from pandas/io/tests/data/test5.xlsx rename to pandas/tests/io/data/test5.xlsx diff --git a/pandas/io/tests/data/test_converters.xls b/pandas/tests/io/data/test_converters.xls similarity index 100% rename from pandas/io/tests/data/test_converters.xls rename to pandas/tests/io/data/test_converters.xls diff --git a/pandas/io/tests/data/test_converters.xlsm b/pandas/tests/io/data/test_converters.xlsm similarity index 100% rename from pandas/io/tests/data/test_converters.xlsm rename to pandas/tests/io/data/test_converters.xlsm diff --git a/pandas/io/tests/data/test_converters.xlsx b/pandas/tests/io/data/test_converters.xlsx similarity index 100% rename from pandas/io/tests/data/test_converters.xlsx rename to pandas/tests/io/data/test_converters.xlsx diff --git a/pandas/io/tests/data/test_index_name_pre17.xls b/pandas/tests/io/data/test_index_name_pre17.xls similarity index 100% rename from pandas/io/tests/data/test_index_name_pre17.xls rename to pandas/tests/io/data/test_index_name_pre17.xls diff --git a/pandas/io/tests/data/test_index_name_pre17.xlsm b/pandas/tests/io/data/test_index_name_pre17.xlsm similarity index 100% rename from pandas/io/tests/data/test_index_name_pre17.xlsm rename to pandas/tests/io/data/test_index_name_pre17.xlsm diff --git a/pandas/io/tests/data/test_index_name_pre17.xlsx b/pandas/tests/io/data/test_index_name_pre17.xlsx similarity index 100% rename from pandas/io/tests/data/test_index_name_pre17.xlsx rename to pandas/tests/io/data/test_index_name_pre17.xlsx diff --git a/pandas/io/tests/data/test_mmap.csv b/pandas/tests/io/data/test_mmap.csv similarity index 100% rename from pandas/io/tests/data/test_mmap.csv rename to pandas/tests/io/data/test_mmap.csv diff --git a/pandas/tests/io/data/test_multisheet.xls b/pandas/tests/io/data/test_multisheet.xls new file mode 100644 index 0000000000000..7b4b9759a1a94 Binary files /dev/null and b/pandas/tests/io/data/test_multisheet.xls differ diff --git a/pandas/tests/io/data/test_multisheet.xlsm b/pandas/tests/io/data/test_multisheet.xlsm new file mode 100644 index 0000000000000..c6191bc61bc49 Binary files /dev/null and b/pandas/tests/io/data/test_multisheet.xlsm differ diff --git a/pandas/tests/io/data/test_multisheet.xlsx b/pandas/tests/io/data/test_multisheet.xlsx new file mode 100644 index 0000000000000..dc424a9963253 Binary files /dev/null and b/pandas/tests/io/data/test_multisheet.xlsx differ diff --git a/pandas/io/tests/data/test_squeeze.xls b/pandas/tests/io/data/test_squeeze.xls similarity index 100% rename from pandas/io/tests/data/test_squeeze.xls rename to pandas/tests/io/data/test_squeeze.xls diff --git a/pandas/io/tests/data/test_squeeze.xlsm b/pandas/tests/io/data/test_squeeze.xlsm similarity index 100% rename from pandas/io/tests/data/test_squeeze.xlsm rename to pandas/tests/io/data/test_squeeze.xlsm diff --git a/pandas/io/tests/data/test_squeeze.xlsx b/pandas/tests/io/data/test_squeeze.xlsx similarity index 100% rename from pandas/io/tests/data/test_squeeze.xlsx rename to pandas/tests/io/data/test_squeeze.xlsx diff --git a/pandas/io/tests/data/test_types.xls b/pandas/tests/io/data/test_types.xls similarity index 100% rename from pandas/io/tests/data/test_types.xls rename to pandas/tests/io/data/test_types.xls diff --git a/pandas/io/tests/data/test_types.xlsm b/pandas/tests/io/data/test_types.xlsm similarity index 100% rename from pandas/io/tests/data/test_types.xlsm rename to pandas/tests/io/data/test_types.xlsm diff --git a/pandas/io/tests/data/test_types.xlsx b/pandas/tests/io/data/test_types.xlsx similarity index 100% rename from pandas/io/tests/data/test_types.xlsx rename to pandas/tests/io/data/test_types.xlsx diff --git a/pandas/io/tests/data/testdateoverflow.xls b/pandas/tests/io/data/testdateoverflow.xls similarity index 100% rename from pandas/io/tests/data/testdateoverflow.xls rename to pandas/tests/io/data/testdateoverflow.xls diff --git a/pandas/io/tests/data/testdateoverflow.xlsm b/pandas/tests/io/data/testdateoverflow.xlsm similarity index 100% rename from pandas/io/tests/data/testdateoverflow.xlsm rename to pandas/tests/io/data/testdateoverflow.xlsm diff --git a/pandas/io/tests/data/testdateoverflow.xlsx b/pandas/tests/io/data/testdateoverflow.xlsx similarity index 100% rename from pandas/io/tests/data/testdateoverflow.xlsx rename to pandas/tests/io/data/testdateoverflow.xlsx diff --git a/pandas/tests/io/data/testdtype.xls b/pandas/tests/io/data/testdtype.xls new file mode 100644 index 0000000000000..f63357524324f Binary files /dev/null and b/pandas/tests/io/data/testdtype.xls differ diff --git a/pandas/tests/io/data/testdtype.xlsm b/pandas/tests/io/data/testdtype.xlsm new file mode 100644 index 0000000000000..20e658288d5ac Binary files /dev/null and b/pandas/tests/io/data/testdtype.xlsm differ diff --git a/pandas/tests/io/data/testdtype.xlsx b/pandas/tests/io/data/testdtype.xlsx new file mode 100644 index 0000000000000..7c65263c373a3 Binary files /dev/null and b/pandas/tests/io/data/testdtype.xlsx differ diff --git a/pandas/io/tests/data/testmultiindex.xls b/pandas/tests/io/data/testmultiindex.xls similarity index 100% rename from pandas/io/tests/data/testmultiindex.xls rename to pandas/tests/io/data/testmultiindex.xls diff --git a/pandas/io/tests/data/testmultiindex.xlsm b/pandas/tests/io/data/testmultiindex.xlsm similarity index 100% rename from pandas/io/tests/data/testmultiindex.xlsm rename to pandas/tests/io/data/testmultiindex.xlsm diff --git a/pandas/io/tests/data/testmultiindex.xlsx b/pandas/tests/io/data/testmultiindex.xlsx similarity index 100% rename from pandas/io/tests/data/testmultiindex.xlsx rename to pandas/tests/io/data/testmultiindex.xlsx diff --git a/pandas/io/tests/data/testskiprows.xls b/pandas/tests/io/data/testskiprows.xls similarity index 100% rename from pandas/io/tests/data/testskiprows.xls rename to pandas/tests/io/data/testskiprows.xls diff --git a/pandas/io/tests/data/testskiprows.xlsm b/pandas/tests/io/data/testskiprows.xlsm similarity index 100% rename from pandas/io/tests/data/testskiprows.xlsm rename to pandas/tests/io/data/testskiprows.xlsm diff --git a/pandas/io/tests/data/testskiprows.xlsx b/pandas/tests/io/data/testskiprows.xlsx similarity index 100% rename from pandas/io/tests/data/testskiprows.xlsx rename to pandas/tests/io/data/testskiprows.xlsx diff --git a/pandas/io/tests/data/times_1900.xls b/pandas/tests/io/data/times_1900.xls similarity index 100% rename from pandas/io/tests/data/times_1900.xls rename to pandas/tests/io/data/times_1900.xls diff --git a/pandas/io/tests/data/times_1900.xlsm b/pandas/tests/io/data/times_1900.xlsm similarity index 100% rename from pandas/io/tests/data/times_1900.xlsm rename to pandas/tests/io/data/times_1900.xlsm diff --git a/pandas/io/tests/data/times_1900.xlsx b/pandas/tests/io/data/times_1900.xlsx similarity index 100% rename from pandas/io/tests/data/times_1900.xlsx rename to pandas/tests/io/data/times_1900.xlsx diff --git a/pandas/io/tests/data/times_1904.xls b/pandas/tests/io/data/times_1904.xls similarity index 100% rename from pandas/io/tests/data/times_1904.xls rename to pandas/tests/io/data/times_1904.xls diff --git a/pandas/io/tests/data/times_1904.xlsm b/pandas/tests/io/data/times_1904.xlsm similarity index 100% rename from pandas/io/tests/data/times_1904.xlsm rename to pandas/tests/io/data/times_1904.xlsm diff --git a/pandas/io/tests/data/times_1904.xlsx b/pandas/tests/io/data/times_1904.xlsx similarity index 100% rename from pandas/io/tests/data/times_1904.xlsx rename to pandas/tests/io/data/times_1904.xlsx diff --git a/pandas/io/tests/data/tips.csv b/pandas/tests/io/data/tips.csv similarity index 100% rename from pandas/io/tests/data/tips.csv rename to pandas/tests/io/data/tips.csv diff --git a/pandas/io/tests/data/valid_markup.html b/pandas/tests/io/data/valid_markup.html similarity index 100% rename from pandas/io/tests/data/valid_markup.html rename to pandas/tests/io/data/valid_markup.html diff --git a/pandas/io/tests/data/wikipedia_states.html b/pandas/tests/io/data/wikipedia_states.html similarity index 100% rename from pandas/io/tests/data/wikipedia_states.html rename to pandas/tests/io/data/wikipedia_states.html diff --git a/pandas/tests/io/formats/__init__.py b/pandas/tests/io/formats/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/io/tests/parser/data/unicode_series.csv b/pandas/tests/io/formats/data/unicode_series.csv similarity index 100% rename from pandas/io/tests/parser/data/unicode_series.csv rename to pandas/tests/io/formats/data/unicode_series.csv diff --git a/pandas/tests/io/formats/test_css.py b/pandas/tests/io/formats/test_css.py new file mode 100644 index 0000000000000..44f95266b6c78 --- /dev/null +++ b/pandas/tests/io/formats/test_css.py @@ -0,0 +1,260 @@ +import pytest + +from pandas.util import testing as tm +from pandas.io.formats.css import CSSResolver, CSSWarning + + +def assert_resolves(css, props, inherited=None): + resolve = CSSResolver() + actual = resolve(css, inherited=inherited) + assert props == actual + + +def assert_same_resolution(css1, css2, inherited=None): + resolve = CSSResolver() + resolved1 = resolve(css1, inherited=inherited) + resolved2 = resolve(css2, inherited=inherited) + assert resolved1 == resolved2 + + +@pytest.mark.parametrize('name,norm,abnorm', [ + ('whitespace', 'hello: world; foo: bar', + ' \t hello \t :\n world \n ; \n foo: \tbar\n\n'), + ('case', 'hello: world; foo: bar', 'Hello: WORLD; foO: bar'), + ('empty-decl', 'hello: world; foo: bar', + '; hello: world;; foo: bar;\n; ;'), + ('empty-list', '', ';'), +]) +def test_css_parse_normalisation(name, norm, abnorm): + assert_same_resolution(norm, abnorm) + + +@pytest.mark.xfail(reason='CSS comments not yet stripped') +def test_css_parse_comments(): + assert_same_resolution('hello: world', + 'hello/* foo */:/* bar \n */ world /*;not:here*/') + + +@pytest.mark.xfail(reason='''we don't need to handle specificity + markers like !important, but we should + ignore them in the future''') +def test_css_parse_specificity(): + assert_same_resolution('font-weight: bold', 'font-weight: bold !important') + + +@pytest.mark.xfail(reason='Splitting CSS declarations not yet sensitive to ' + '; in CSS strings') +def test_css_parse_strings(): + # semicolons in strings + with tm.assert_produces_warning(CSSWarning): + assert_resolves( + 'background-image: url(\'http://blah.com/foo?a;b=c\')', + {'background-image': 'url(\'http://blah.com/foo?a;b=c\')'}) + assert_resolves( + 'background-image: url("http://blah.com/foo?a;b=c")', + {'background-image': 'url("http://blah.com/foo?a;b=c")'}) + + +@pytest.mark.parametrize( + 'invalid_css,remainder', [ + # No colon + ('hello-world', ''), + ('border-style: solid; hello-world', 'border-style: solid'), + ('border-style: solid; hello-world; font-weight: bold', + 'border-style: solid; font-weight: bold'), + # Unclosed string + pytest.mark.xfail(('background-image: "abc', ''), + reason='Unclosed CSS strings not detected'), + pytest.mark.xfail(('font-family: "abc', ''), + reason='Unclosed CSS strings not detected'), + pytest.mark.xfail(('background-image: \'abc', ''), + reason='Unclosed CSS strings not detected'), + pytest.mark.xfail(('font-family: \'abc', ''), + reason='Unclosed CSS strings not detected'), + # Invalid size + ('font-size: blah', 'font-size: 1em'), + ('font-size: 1a2b', 'font-size: 1em'), + ('font-size: 1e5pt', 'font-size: 1em'), + ('font-size: 1+6pt', 'font-size: 1em'), + ('font-size: 1unknownunit', 'font-size: 1em'), + ('font-size: 10', 'font-size: 1em'), + ('font-size: 10 pt', 'font-size: 1em'), + ]) +def test_css_parse_invalid(invalid_css, remainder): + with tm.assert_produces_warning(CSSWarning): + assert_same_resolution(invalid_css, remainder) + + # TODO: we should be checking that in other cases no warnings are raised + + +@pytest.mark.parametrize( + 'shorthand,expansions', + [('margin', ['margin-top', 'margin-right', + 'margin-bottom', 'margin-left']), + ('padding', ['padding-top', 'padding-right', + 'padding-bottom', 'padding-left']), + ('border-width', ['border-top-width', 'border-right-width', + 'border-bottom-width', 'border-left-width']), + ('border-color', ['border-top-color', 'border-right-color', + 'border-bottom-color', 'border-left-color']), + ('border-style', ['border-top-style', 'border-right-style', + 'border-bottom-style', 'border-left-style']), + ]) +def test_css_side_shorthands(shorthand, expansions): + top, right, bottom, left = expansions + + assert_resolves('%s: 1pt' % shorthand, + {top: '1pt', right: '1pt', + bottom: '1pt', left: '1pt'}) + + assert_resolves('%s: 1pt 4pt' % shorthand, + {top: '1pt', right: '4pt', + bottom: '1pt', left: '4pt'}) + + assert_resolves('%s: 1pt 4pt 2pt' % shorthand, + {top: '1pt', right: '4pt', + bottom: '2pt', left: '4pt'}) + + assert_resolves('%s: 1pt 4pt 2pt 0pt' % shorthand, + {top: '1pt', right: '4pt', + bottom: '2pt', left: '0pt'}) + + with tm.assert_produces_warning(CSSWarning): + assert_resolves('%s: 1pt 1pt 1pt 1pt 1pt' % shorthand, + {}) + + +@pytest.mark.xfail(reason='CSS font shorthand not yet handled') +@pytest.mark.parametrize('css,props', [ + ('font: italic bold 12pt helvetica,sans-serif', + {'font-family': 'helvetica,sans-serif', + 'font-style': 'italic', + 'font-weight': 'bold', + 'font-size': '12pt'}), + ('font: bold italic 12pt helvetica,sans-serif', + {'font-family': 'helvetica,sans-serif', + 'font-style': 'italic', + 'font-weight': 'bold', + 'font-size': '12pt'}), +]) +def test_css_font_shorthand(css, props): + assert_resolves(css, props) + + +@pytest.mark.xfail(reason='CSS background shorthand not yet handled') +@pytest.mark.parametrize('css,props', [ + ('background: blue', {'background-color': 'blue'}), + ('background: fixed blue', + {'background-color': 'blue', 'background-attachment': 'fixed'}), +]) +def test_css_background_shorthand(css, props): + assert_resolves(css, props) + + +@pytest.mark.xfail(reason='CSS border shorthand not yet handled') +@pytest.mark.parametrize('style,equiv', [ + ('border: 1px solid red', + 'border-width: 1px; border-style: solid; border-color: red'), + ('border: solid red 1px', + 'border-width: 1px; border-style: solid; border-color: red'), + ('border: red solid', + 'border-style: solid; border-color: red'), +]) +def test_css_border_shorthand(style, equiv): + assert_same_resolution(style, equiv) + + +@pytest.mark.parametrize('style,inherited,equiv', [ + ('margin: 1px; margin: 2px', '', + 'margin: 2px'), + ('margin: 1px', 'margin: 2px', + 'margin: 1px'), + ('margin: 1px; margin: inherit', 'margin: 2px', + 'margin: 2px'), + ('margin: 1px; margin-top: 2px', '', + 'margin-left: 1px; margin-right: 1px; ' + + 'margin-bottom: 1px; margin-top: 2px'), + ('margin-top: 2px', 'margin: 1px', + 'margin: 1px; margin-top: 2px'), + ('margin: 1px', 'margin-top: 2px', + 'margin: 1px'), + ('margin: 1px; margin-top: inherit', 'margin: 2px', + 'margin: 1px; margin-top: 2px'), +]) +def test_css_precedence(style, inherited, equiv): + resolve = CSSResolver() + inherited_props = resolve(inherited) + style_props = resolve(style, inherited=inherited_props) + equiv_props = resolve(equiv) + assert style_props == equiv_props + + +@pytest.mark.parametrize('style,equiv', [ + ('margin: 1px; margin-top: inherit', + 'margin-bottom: 1px; margin-right: 1px; margin-left: 1px'), + ('margin-top: inherit', ''), + ('margin-top: initial', ''), +]) +def test_css_none_absent(style, equiv): + assert_same_resolution(style, equiv) + + +@pytest.mark.parametrize('size,resolved', [ + ('xx-small', '6pt'), + ('x-small', '%fpt' % 7.5), + ('small', '%fpt' % 9.6), + ('medium', '12pt'), + ('large', '%fpt' % 13.5), + ('x-large', '18pt'), + ('xx-large', '24pt'), + + ('8px', '6pt'), + ('1.25pc', '15pt'), + ('.25in', '18pt'), + ('02.54cm', '72pt'), + ('25.4mm', '72pt'), + ('101.6q', '72pt'), + ('101.6q', '72pt'), +]) +@pytest.mark.parametrize('relative_to', # invariant to inherited size + [None, '16pt']) +def test_css_absolute_font_size(size, relative_to, resolved): + if relative_to is None: + inherited = None + else: + inherited = {'font-size': relative_to} + assert_resolves('font-size: %s' % size, {'font-size': resolved}, + inherited=inherited) + + +@pytest.mark.parametrize('size,relative_to,resolved', [ + ('1em', None, '12pt'), + ('1.0em', None, '12pt'), + ('1.25em', None, '15pt'), + ('1em', '16pt', '16pt'), + ('1.0em', '16pt', '16pt'), + ('1.25em', '16pt', '20pt'), + ('1rem', '16pt', '12pt'), + ('1.0rem', '16pt', '12pt'), + ('1.25rem', '16pt', '15pt'), + ('100%', None, '12pt'), + ('125%', None, '15pt'), + ('100%', '16pt', '16pt'), + ('125%', '16pt', '20pt'), + ('2ex', None, '12pt'), + ('2.0ex', None, '12pt'), + ('2.50ex', None, '15pt'), + ('inherit', '16pt', '16pt'), + + ('smaller', None, '10pt'), + ('smaller', '18pt', '15pt'), + ('larger', None, '%fpt' % 14.4), + ('larger', '15pt', '18pt'), +]) +def test_css_relative_font_size(size, relative_to, resolved): + if relative_to is None: + inherited = None + else: + inherited = {'font-size': relative_to} + assert_resolves('font-size: %s' % size, {'font-size': resolved}, + inherited=inherited) diff --git a/pandas/tests/io/formats/test_eng_formatting.py b/pandas/tests/io/formats/test_eng_formatting.py new file mode 100644 index 0000000000000..9d5773283176c --- /dev/null +++ b/pandas/tests/io/formats/test_eng_formatting.py @@ -0,0 +1,193 @@ +import numpy as np +import pandas as pd +from pandas import DataFrame +from pandas.compat import u +import pandas.io.formats.format as fmt +from pandas.util import testing as tm + + +class TestEngFormatter(object): + + def test_eng_float_formatter(self): + df = DataFrame({'A': [1.41, 141., 14100, 1410000.]}) + + fmt.set_eng_float_format() + result = df.to_string() + expected = (' A\n' + '0 1.410E+00\n' + '1 141.000E+00\n' + '2 14.100E+03\n' + '3 1.410E+06') + assert result == expected + + fmt.set_eng_float_format(use_eng_prefix=True) + result = df.to_string() + expected = (' A\n' + '0 1.410\n' + '1 141.000\n' + '2 14.100k\n' + '3 1.410M') + assert result == expected + + fmt.set_eng_float_format(accuracy=0) + result = df.to_string() + expected = (' A\n' + '0 1E+00\n' + '1 141E+00\n' + '2 14E+03\n' + '3 1E+06') + assert result == expected + + tm.reset_display_options() + + def compare(self, formatter, input, output): + formatted_input = formatter(input) + assert formatted_input == output + + def compare_all(self, formatter, in_out): + """ + Parameters: + ----------- + formatter: EngFormatter under test + in_out: list of tuples. Each tuple = (number, expected_formatting) + + It is tested if 'formatter(number) == expected_formatting'. + *number* should be >= 0 because formatter(-number) == fmt is also + tested. *fmt* is derived from *expected_formatting* + """ + for input, output in in_out: + self.compare(formatter, input, output) + self.compare(formatter, -input, "-" + output[1:]) + + def test_exponents_with_eng_prefix(self): + formatter = fmt.EngFormatter(accuracy=3, use_eng_prefix=True) + f = np.sqrt(2) + in_out = [ + (f * 10 ** -24, " 1.414y"), (f * 10 ** -23, " 14.142y"), + (f * 10 ** -22, " 141.421y"), (f * 10 ** -21, " 1.414z"), + (f * 10 ** -20, " 14.142z"), (f * 10 ** -19, " 141.421z"), + (f * 10 ** -18, " 1.414a"), (f * 10 ** -17, " 14.142a"), + (f * 10 ** -16, " 141.421a"), (f * 10 ** -15, " 1.414f"), + (f * 10 ** -14, " 14.142f"), (f * 10 ** -13, " 141.421f"), + (f * 10 ** -12, " 1.414p"), (f * 10 ** -11, " 14.142p"), + (f * 10 ** -10, " 141.421p"), (f * 10 ** -9, " 1.414n"), + (f * 10 ** -8, " 14.142n"), (f * 10 ** -7, " 141.421n"), + (f * 10 ** -6, " 1.414u"), (f * 10 ** -5, " 14.142u"), + (f * 10 ** -4, " 141.421u"), (f * 10 ** -3, " 1.414m"), + (f * 10 ** -2, " 14.142m"), (f * 10 ** -1, " 141.421m"), + (f * 10 ** 0, " 1.414"), (f * 10 ** 1, " 14.142"), + (f * 10 ** 2, " 141.421"), (f * 10 ** 3, " 1.414k"), + (f * 10 ** 4, " 14.142k"), (f * 10 ** 5, " 141.421k"), + (f * 10 ** 6, " 1.414M"), (f * 10 ** 7, " 14.142M"), + (f * 10 ** 8, " 141.421M"), (f * 10 ** 9, " 1.414G"), + (f * 10 ** 10, " 14.142G"), (f * 10 ** 11, " 141.421G"), + (f * 10 ** 12, " 1.414T"), (f * 10 ** 13, " 14.142T"), + (f * 10 ** 14, " 141.421T"), (f * 10 ** 15, " 1.414P"), + (f * 10 ** 16, " 14.142P"), (f * 10 ** 17, " 141.421P"), + (f * 10 ** 18, " 1.414E"), (f * 10 ** 19, " 14.142E"), + (f * 10 ** 20, " 141.421E"), (f * 10 ** 21, " 1.414Z"), + (f * 10 ** 22, " 14.142Z"), (f * 10 ** 23, " 141.421Z"), + (f * 10 ** 24, " 1.414Y"), (f * 10 ** 25, " 14.142Y"), + (f * 10 ** 26, " 141.421Y")] + self.compare_all(formatter, in_out) + + def test_exponents_without_eng_prefix(self): + formatter = fmt.EngFormatter(accuracy=4, use_eng_prefix=False) + f = np.pi + in_out = [ + (f * 10 ** -24, " 3.1416E-24"), + (f * 10 ** -23, " 31.4159E-24"), + (f * 10 ** -22, " 314.1593E-24"), + (f * 10 ** -21, " 3.1416E-21"), + (f * 10 ** -20, " 31.4159E-21"), + (f * 10 ** -19, " 314.1593E-21"), + (f * 10 ** -18, " 3.1416E-18"), + (f * 10 ** -17, " 31.4159E-18"), + (f * 10 ** -16, " 314.1593E-18"), + (f * 10 ** -15, " 3.1416E-15"), + (f * 10 ** -14, " 31.4159E-15"), + (f * 10 ** -13, " 314.1593E-15"), + (f * 10 ** -12, " 3.1416E-12"), + (f * 10 ** -11, " 31.4159E-12"), + (f * 10 ** -10, " 314.1593E-12"), + (f * 10 ** -9, " 3.1416E-09"), + (f * 10 ** -8, " 31.4159E-09"), + (f * 10 ** -7, " 314.1593E-09"), + (f * 10 ** -6, " 3.1416E-06"), + (f * 10 ** -5, " 31.4159E-06"), + (f * 10 ** -4, " 314.1593E-06"), + (f * 10 ** -3, " 3.1416E-03"), + (f * 10 ** -2, " 31.4159E-03"), + (f * 10 ** -1, " 314.1593E-03"), + (f * 10 ** 0, " 3.1416E+00"), + (f * 10 ** 1, " 31.4159E+00"), + (f * 10 ** 2, " 314.1593E+00"), + (f * 10 ** 3, " 3.1416E+03"), + (f * 10 ** 4, " 31.4159E+03"), + (f * 10 ** 5, " 314.1593E+03"), + (f * 10 ** 6, " 3.1416E+06"), + (f * 10 ** 7, " 31.4159E+06"), + (f * 10 ** 8, " 314.1593E+06"), + (f * 10 ** 9, " 3.1416E+09"), + (f * 10 ** 10, " 31.4159E+09"), + (f * 10 ** 11, " 314.1593E+09"), + (f * 10 ** 12, " 3.1416E+12"), + (f * 10 ** 13, " 31.4159E+12"), + (f * 10 ** 14, " 314.1593E+12"), + (f * 10 ** 15, " 3.1416E+15"), + (f * 10 ** 16, " 31.4159E+15"), + (f * 10 ** 17, " 314.1593E+15"), + (f * 10 ** 18, " 3.1416E+18"), + (f * 10 ** 19, " 31.4159E+18"), + (f * 10 ** 20, " 314.1593E+18"), + (f * 10 ** 21, " 3.1416E+21"), + (f * 10 ** 22, " 31.4159E+21"), + (f * 10 ** 23, " 314.1593E+21"), + (f * 10 ** 24, " 3.1416E+24"), + (f * 10 ** 25, " 31.4159E+24"), + (f * 10 ** 26, " 314.1593E+24")] + self.compare_all(formatter, in_out) + + def test_rounding(self): + formatter = fmt.EngFormatter(accuracy=3, use_eng_prefix=True) + in_out = [(5.55555, ' 5.556'), (55.5555, ' 55.556'), + (555.555, ' 555.555'), (5555.55, ' 5.556k'), + (55555.5, ' 55.556k'), (555555, ' 555.555k')] + self.compare_all(formatter, in_out) + + formatter = fmt.EngFormatter(accuracy=1, use_eng_prefix=True) + in_out = [(5.55555, ' 5.6'), (55.5555, ' 55.6'), (555.555, ' 555.6'), + (5555.55, ' 5.6k'), (55555.5, ' 55.6k'), (555555, ' 555.6k')] + self.compare_all(formatter, in_out) + + formatter = fmt.EngFormatter(accuracy=0, use_eng_prefix=True) + in_out = [(5.55555, ' 6'), (55.5555, ' 56'), (555.555, ' 556'), + (5555.55, ' 6k'), (55555.5, ' 56k'), (555555, ' 556k')] + self.compare_all(formatter, in_out) + + formatter = fmt.EngFormatter(accuracy=3, use_eng_prefix=True) + result = formatter(0) + assert result == u(' 0.000') + + def test_nan(self): + # Issue #11981 + + formatter = fmt.EngFormatter(accuracy=1, use_eng_prefix=True) + result = formatter(np.nan) + assert result == u('NaN') + + df = pd.DataFrame({'a': [1.5, 10.3, 20.5], + 'b': [50.3, 60.67, 70.12], + 'c': [100.2, 101.33, 120.33]}) + pt = df.pivot_table(values='a', index='b', columns='c') + fmt.set_eng_float_format(accuracy=1) + result = pt.to_string() + assert 'NaN' in result + tm.reset_display_options() + + def test_inf(self): + # Issue #11981 + + formatter = fmt.EngFormatter(accuracy=1, use_eng_prefix=True) + result = formatter(np.inf) + assert result == u('inf') diff --git a/pandas/tests/io/formats/test_format.py b/pandas/tests/io/formats/test_format.py new file mode 100644 index 0000000000000..3f08013e05ac8 --- /dev/null +++ b/pandas/tests/io/formats/test_format.py @@ -0,0 +1,2555 @@ +# -*- coding: utf-8 -*- + +""" +Test output formatting for Series/DataFrame, including to_string & reprs +""" + +from __future__ import print_function +import re + +import itertools +from operator import methodcaller +import os +import sys +import warnings +from datetime import datetime + +import pytest + +import numpy as np +import pandas as pd +from pandas import (DataFrame, Series, Index, Timestamp, MultiIndex, + date_range, NaT, read_table) +from pandas.compat import (range, zip, lrange, StringIO, PY3, + u, lzip, is_platform_windows, + is_platform_32bit) +import pandas.compat as compat + +import pandas.io.formats.format as fmt +import pandas.io.formats.printing as printing + +import pandas.util.testing as tm +from pandas.io.formats.terminal import get_terminal_size +from pandas.core.config import (set_option, get_option, option_context, + reset_option) + +use_32bit_repr = is_platform_windows() or is_platform_32bit() + +_frame = DataFrame(tm.getSeriesData()) + + +def curpath(): + pth, _ = os.path.split(os.path.abspath(__file__)) + return pth + + +def has_info_repr(df): + r = repr(df) + c1 = r.split('\n')[0].startswith(" + # 2. Index + # 3. Columns + # 4. dtype + # 5. memory usage + # 6. trailing newline + nv = len(r.split('\n')) == 6 + return has_info and nv + + +def has_horizontally_truncated_repr(df): + try: # Check header row + fst_line = np.array(repr(df).splitlines()[0].split()) + cand_col = np.where(fst_line == '...')[0][0] + except: + return False + # Make sure each row has this ... in the same place + r = repr(df) + for ix, l in enumerate(r.splitlines()): + if not r.split()[cand_col] == '...': + return False + return True + + +def has_vertically_truncated_repr(df): + r = repr(df) + only_dot_row = False + for row in r.splitlines(): + if re.match(r'^[\.\ ]+$', row): + only_dot_row = True + return only_dot_row + + +def has_truncated_repr(df): + return has_horizontally_truncated_repr( + df) or has_vertically_truncated_repr(df) + + +def has_doubly_truncated_repr(df): + return has_horizontally_truncated_repr( + df) and has_vertically_truncated_repr(df) + + +def has_expanded_repr(df): + r = repr(df) + for line in r.split('\n'): + if line.endswith('\\'): + return True + return False + + +class TestDataFrameFormatting(object): + + def setup_method(self, method): + self.warn_filters = warnings.filters + warnings.filterwarnings('ignore', category=FutureWarning, + module=".*format") + + self.frame = _frame.copy() + + def teardown_method(self, method): + warnings.filters = self.warn_filters + + def test_repr_embedded_ndarray(self): + arr = np.empty(10, dtype=[('err', object)]) + for i in range(len(arr)): + arr['err'][i] = np.random.randn(i) + + df = DataFrame(arr) + repr(df['err']) + repr(df) + df.to_string() + + def test_eng_float_formatter(self): + self.frame.loc[5] = 0 + + fmt.set_eng_float_format() + repr(self.frame) + + fmt.set_eng_float_format(use_eng_prefix=True) + repr(self.frame) + + fmt.set_eng_float_format(accuracy=0) + repr(self.frame) + tm.reset_display_options() + + def test_show_null_counts(self): + + df = DataFrame(1, columns=range(10), index=range(10)) + df.iloc[1, 1] = np.nan + + def check(null_counts, result): + buf = StringIO() + df.info(buf=buf, null_counts=null_counts) + assert ('non-null' in buf.getvalue()) is result + + with option_context('display.max_info_rows', 20, + 'display.max_info_columns', 20): + check(None, True) + check(True, True) + check(False, False) + + with option_context('display.max_info_rows', 5, + 'display.max_info_columns', 5): + check(None, False) + check(True, False) + check(False, False) + + def test_repr_tuples(self): + buf = StringIO() + + df = DataFrame({'tups': lzip(range(10), range(10))}) + repr(df) + df.to_string(col_space=10, buf=buf) + + def test_repr_truncation(self): + max_len = 20 + with option_context("display.max_colwidth", max_len): + df = DataFrame({'A': np.random.randn(10), + 'B': [tm.rands(np.random.randint( + max_len - 1, max_len + 1)) for i in range(10) + ]}) + r = repr(df) + r = r[r.find('\n') + 1:] + + adj = fmt._get_adjustment() + + for line, value in lzip(r.split('\n'), df['B']): + if adj.len(value) + 1 > max_len: + assert '...' in line + else: + assert '...' not in line + + with option_context("display.max_colwidth", 999999): + assert '...' not in repr(df) + + with option_context("display.max_colwidth", max_len + 2): + assert '...' not in repr(df) + + def test_repr_chop_threshold(self): + df = DataFrame([[0.1, 0.5], [0.5, -0.1]]) + pd.reset_option("display.chop_threshold") # default None + assert repr(df) == ' 0 1\n0 0.1 0.5\n1 0.5 -0.1' + + with option_context("display.chop_threshold", 0.2): + assert repr(df) == ' 0 1\n0 0.0 0.5\n1 0.5 0.0' + + with option_context("display.chop_threshold", 0.6): + assert repr(df) == ' 0 1\n0 0.0 0.0\n1 0.0 0.0' + + with option_context("display.chop_threshold", None): + assert repr(df) == ' 0 1\n0 0.1 0.5\n1 0.5 -0.1' + + def test_repr_obeys_max_seq_limit(self): + with option_context("display.max_seq_items", 2000): + assert len(printing.pprint_thing(lrange(1000))) > 1000 + + with option_context("display.max_seq_items", 5): + assert len(printing.pprint_thing(lrange(1000))) < 100 + + def test_repr_set(self): + assert printing.pprint_thing(set([1])) == '{1}' + + def test_repr_is_valid_construction_code(self): + # for the case of Index, where the repr is traditional rather then + # stylized + idx = Index(['a', 'b']) + res = eval("pd." + repr(idx)) + tm.assert_series_equal(Series(res), Series(idx)) + + def test_repr_should_return_str(self): + # http://docs.python.org/py3k/reference/datamodel.html#object.__repr__ + # http://docs.python.org/reference/datamodel.html#object.__repr__ + # "...The return value must be a string object." + + # (str on py2.x, str (unicode) on py3) + + data = [8, 5, 3, 5] + index1 = [u("\u03c3"), u("\u03c4"), u("\u03c5"), u("\u03c6")] + cols = [u("\u03c8")] + df = DataFrame(data, columns=cols, index=index1) + assert type(df.__repr__()) == str # both py2 / 3 + + def test_repr_no_backslash(self): + with option_context('mode.sim_interactive', True): + df = DataFrame(np.random.randn(10, 4)) + assert '\\' not in repr(df) + + def test_expand_frame_repr(self): + df_small = DataFrame('hello', [0], [0]) + df_wide = DataFrame('hello', [0], lrange(10)) + df_tall = DataFrame('hello', lrange(30), lrange(5)) + + with option_context('mode.sim_interactive', True): + with option_context('display.max_columns', 10, 'display.width', 20, + 'display.max_rows', 20, + 'display.show_dimensions', True): + with option_context('display.expand_frame_repr', True): + assert not has_truncated_repr(df_small) + assert not has_expanded_repr(df_small) + assert not has_truncated_repr(df_wide) + assert has_expanded_repr(df_wide) + assert has_vertically_truncated_repr(df_tall) + assert has_expanded_repr(df_tall) + + with option_context('display.expand_frame_repr', False): + assert not has_truncated_repr(df_small) + assert not has_expanded_repr(df_small) + assert not has_horizontally_truncated_repr(df_wide) + assert not has_expanded_repr(df_wide) + assert has_vertically_truncated_repr(df_tall) + assert not has_expanded_repr(df_tall) + + def test_repr_non_interactive(self): + # in non interactive mode, there can be no dependency on the + # result of terminal auto size detection + df = DataFrame('hello', lrange(1000), lrange(5)) + + with option_context('mode.sim_interactive', False, 'display.width', 0, + 'display.height', 0, 'display.max_rows', 5000): + assert not has_truncated_repr(df) + assert not has_expanded_repr(df) + + def test_repr_max_columns_max_rows(self): + term_width, term_height = get_terminal_size() + if term_width < 10 or term_height < 10: + pytest.skip("terminal size too small, " + "{0} x {1}".format(term_width, term_height)) + + def mkframe(n): + index = ['%05d' % i for i in range(n)] + return DataFrame(0, index, index) + + df6 = mkframe(6) + df10 = mkframe(10) + with option_context('mode.sim_interactive', True): + with option_context('display.width', term_width * 2): + with option_context('display.max_rows', 5, + 'display.max_columns', 5): + assert not has_expanded_repr(mkframe(4)) + assert not has_expanded_repr(mkframe(5)) + assert not has_expanded_repr(df6) + assert has_doubly_truncated_repr(df6) + + with option_context('display.max_rows', 20, + 'display.max_columns', 10): + # Out off max_columns boundary, but no extending + # since not exceeding width + assert not has_expanded_repr(df6) + assert not has_truncated_repr(df6) + + with option_context('display.max_rows', 9, + 'display.max_columns', 10): + # out vertical bounds can not result in exanded repr + assert not has_expanded_repr(df10) + assert has_vertically_truncated_repr(df10) + + # width=None in terminal, auto detection + with option_context('display.max_columns', 100, 'display.max_rows', + term_width * 20, 'display.width', None): + df = mkframe((term_width // 7) - 2) + assert not has_expanded_repr(df) + df = mkframe((term_width // 7) + 2) + printing.pprint_thing(df._repr_fits_horizontal_()) + assert has_expanded_repr(df) + + def test_str_max_colwidth(self): + # GH 7856 + df = pd.DataFrame([{'a': 'foo', + 'b': 'bar', + 'c': 'uncomfortably long line with lots of stuff', + 'd': 1}, {'a': 'foo', + 'b': 'bar', + 'c': 'stuff', + 'd': 1}]) + df.set_index(['a', 'b', 'c']) + assert str(df) == ( + ' a b c d\n' + '0 foo bar uncomfortably long line with lots of stuff 1\n' + '1 foo bar stuff 1') + with option_context('max_colwidth', 20): + assert str(df) == (' a b c d\n' + '0 foo bar uncomfortably lo... 1\n' + '1 foo bar stuff 1') + + def test_auto_detect(self): + term_width, term_height = get_terminal_size() + fac = 1.05 # Arbitrary large factor to exceed term widht + cols = range(int(term_width * fac)) + index = range(10) + df = DataFrame(index=index, columns=cols) + with option_context('mode.sim_interactive', True): + with option_context('max_rows', None): + with option_context('max_columns', None): + # Wrap around with None + assert has_expanded_repr(df) + with option_context('max_rows', 0): + with option_context('max_columns', 0): + # Truncate with auto detection. + assert has_horizontally_truncated_repr(df) + + index = range(int(term_height * fac)) + df = DataFrame(index=index, columns=cols) + with option_context('max_rows', 0): + with option_context('max_columns', None): + # Wrap around with None + assert has_expanded_repr(df) + # Truncate vertically + assert has_vertically_truncated_repr(df) + + with option_context('max_rows', None): + with option_context('max_columns', 0): + assert has_horizontally_truncated_repr(df) + + def test_to_string_repr_unicode(self): + buf = StringIO() + + unicode_values = [u('\u03c3')] * 10 + unicode_values = np.array(unicode_values, dtype=object) + df = DataFrame({'unicode': unicode_values}) + df.to_string(col_space=10, buf=buf) + + # it works! + repr(df) + + idx = Index(['abc', u('\u03c3a'), 'aegdvg']) + ser = Series(np.random.randn(len(idx)), idx) + rs = repr(ser).split('\n') + line_len = len(rs[0]) + for line in rs[1:]: + try: + line = line.decode(get_option("display.encoding")) + except: + pass + if not line.startswith('dtype:'): + assert len(line) == line_len + + # it works even if sys.stdin in None + _stdin = sys.stdin + try: + sys.stdin = None + repr(df) + finally: + sys.stdin = _stdin + + def test_to_string_unicode_columns(self): + df = DataFrame({u('\u03c3'): np.arange(10.)}) + + buf = StringIO() + df.to_string(buf=buf) + buf.getvalue() + + buf = StringIO() + df.info(buf=buf) + buf.getvalue() + + result = self.frame.to_string() + assert isinstance(result, compat.text_type) + + def test_to_string_utf8_columns(self): + n = u("\u05d0").encode('utf-8') + + with option_context('display.max_rows', 1): + df = DataFrame([1, 2], columns=[n]) + repr(df) + + def test_to_string_unicode_two(self): + dm = DataFrame({u('c/\u03c3'): []}) + buf = StringIO() + dm.to_string(buf) + + def test_to_string_unicode_three(self): + dm = DataFrame(['\xc2']) + buf = StringIO() + dm.to_string(buf) + + def test_to_string_with_formatters(self): + df = DataFrame({'int': [1, 2, 3], + 'float': [1.0, 2.0, 3.0], + 'object': [(1, 2), True, False]}, + columns=['int', 'float', 'object']) + + formatters = [('int', lambda x: '0x%x' % x), + ('float', lambda x: '[% 4.1f]' % x), + ('object', lambda x: '-%s-' % str(x))] + result = df.to_string(formatters=dict(formatters)) + result2 = df.to_string(formatters=lzip(*formatters)[1]) + assert result == (' int float object\n' + '0 0x1 [ 1.0] -(1, 2)-\n' + '1 0x2 [ 2.0] -True-\n' + '2 0x3 [ 3.0] -False-') + assert result == result2 + + def test_to_string_with_datetime64_monthformatter(self): + months = [datetime(2016, 1, 1), datetime(2016, 2, 2)] + x = DataFrame({'months': months}) + + def format_func(x): + return x.strftime('%Y-%m') + result = x.to_string(formatters={'months': format_func}) + expected = 'months\n0 2016-01\n1 2016-02' + assert result.strip() == expected + + def test_to_string_with_datetime64_hourformatter(self): + + x = DataFrame({'hod': pd.to_datetime(['10:10:10.100', '12:12:12.120'], + format='%H:%M:%S.%f')}) + + def format_func(x): + return x.strftime('%H:%M') + + result = x.to_string(formatters={'hod': format_func}) + expected = 'hod\n0 10:10\n1 12:12' + assert result.strip() == expected + + def test_to_string_with_formatters_unicode(self): + df = DataFrame({u('c/\u03c3'): [1, 2, 3]}) + result = df.to_string(formatters={u('c/\u03c3'): lambda x: '%s' % x}) + assert result == u(' c/\u03c3\n') + '0 1\n1 2\n2 3' + + def test_east_asian_unicode_frame(self): + if PY3: + _rep = repr + else: + _rep = unicode # noqa + + # not alighned properly because of east asian width + + # mid col + df = DataFrame({'a': [u'あ', u'いいい', u'う', u'ええええええ'], + 'b': [1, 222, 33333, 4]}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" a b\na あ 1\n" + u"bb いいい 222\nc う 33333\n" + u"ddd ええええええ 4") + assert _rep(df) == expected + + # last col + df = DataFrame({'a': [1, 222, 33333, 4], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" a b\na 1 あ\n" + u"bb 222 いいい\nc 33333 う\n" + u"ddd 4 ええええええ") + assert _rep(df) == expected + + # all col + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" a b\na あああああ あ\n" + u"bb い いいい\nc う う\n" + u"ddd えええ ええええええ") + assert _rep(df) == expected + + # column name + df = DataFrame({u'あああああ': [1, 222, 33333, 4], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" b あああああ\na あ 1\n" + u"bb いいい 222\nc う 33333\n" + u"ddd ええええええ 4") + assert _rep(df) == expected + + # index + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=[u'あああ', u'いいいいいい', u'うう', u'え']) + expected = (u" a b\nあああ あああああ あ\n" + u"いいいいいい い いいい\nうう う う\n" + u"え えええ ええええええ") + assert _rep(df) == expected + + # index name + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=pd.Index([u'あ', u'い', u'うう', u'え'], + name=u'おおおお')) + expected = (u" a b\n" + u"おおおお \n" + u"あ あああああ あ\n" + u"い い いいい\n" + u"うう う う\n" + u"え えええ ええええええ") + assert _rep(df) == expected + + # all + df = DataFrame({u'あああ': [u'あああ', u'い', u'う', u'えええええ'], + u'いいいいい': [u'あ', u'いいい', u'う', u'ええ']}, + index=pd.Index([u'あ', u'いいい', u'うう', u'え'], + name=u'お')) + expected = (u" あああ いいいいい\n" + u"お \n" + u"あ あああ あ\n" + u"いいい い いいい\n" + u"うう う う\n" + u"え えええええ ええ") + assert _rep(df) == expected + + # MultiIndex + idx = pd.MultiIndex.from_tuples([(u'あ', u'いい'), (u'う', u'え'), ( + u'おおお', u'かかかか'), (u'き', u'くく')]) + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=idx) + expected = (u" a b\n" + u"あ いい あああああ あ\n" + u"う え い いいい\n" + u"おおお かかかか う う\n" + u"き くく えええ ええええええ") + assert _rep(df) == expected + + # truncate + with option_context('display.max_rows', 3, 'display.max_columns', 3): + df = pd.DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ'], + 'c': [u'お', u'か', u'ききき', u'くくくくくく'], + u'ああああ': [u'さ', u'し', u'す', u'せ']}, + columns=['a', 'b', 'c', u'ああああ']) + + expected = (u" a ... ああああ\n0 あああああ ... さ\n" + u".. ... ... ...\n3 えええ ... せ\n" + u"\n[4 rows x 4 columns]") + assert _rep(df) == expected + + df.index = [u'あああ', u'いいいい', u'う', 'aaa'] + expected = (u" a ... ああああ\nあああ あああああ ... さ\n" + u".. ... ... ...\naaa えええ ... せ\n" + u"\n[4 rows x 4 columns]") + assert _rep(df) == expected + + # Emable Unicode option ----------------------------------------- + with option_context('display.unicode.east_asian_width', True): + + # mid col + df = DataFrame({'a': [u'あ', u'いいい', u'う', u'ええええええ'], + 'b': [1, 222, 33333, 4]}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" a b\na あ 1\n" + u"bb いいい 222\nc う 33333\n" + u"ddd ええええええ 4") + assert _rep(df) == expected + + # last col + df = DataFrame({'a': [1, 222, 33333, 4], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" a b\na 1 あ\n" + u"bb 222 いいい\nc 33333 う\n" + u"ddd 4 ええええええ") + assert _rep(df) == expected + + # all col + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" a b\n" + u"a あああああ あ\n" + u"bb い いいい\n" + u"c う う\n" + u"ddd えええ ええええええ") + assert _rep(df) == expected + + # column name + df = DataFrame({u'あああああ': [1, 222, 33333, 4], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=['a', 'bb', 'c', 'ddd']) + expected = (u" b あああああ\n" + u"a あ 1\n" + u"bb いいい 222\n" + u"c う 33333\n" + u"ddd ええええええ 4") + assert _rep(df) == expected + + # index + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=[u'あああ', u'いいいいいい', u'うう', u'え']) + expected = (u" a b\n" + u"あああ あああああ あ\n" + u"いいいいいい い いいい\n" + u"うう う う\n" + u"え えええ ええええええ") + assert _rep(df) == expected + + # index name + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=pd.Index([u'あ', u'い', u'うう', u'え'], + name=u'おおおお')) + expected = (u" a b\n" + u"おおおお \n" + u"あ あああああ あ\n" + u"い い いいい\n" + u"うう う う\n" + u"え えええ ええええええ") + assert _rep(df) == expected + + # all + df = DataFrame({u'あああ': [u'あああ', u'い', u'う', u'えええええ'], + u'いいいいい': [u'あ', u'いいい', u'う', u'ええ']}, + index=pd.Index([u'あ', u'いいい', u'うう', u'え'], + name=u'お')) + expected = (u" あああ いいいいい\n" + u"お \n" + u"あ あああ あ\n" + u"いいい い いいい\n" + u"うう う う\n" + u"え えええええ ええ") + assert _rep(df) == expected + + # MultiIndex + idx = pd.MultiIndex.from_tuples([(u'あ', u'いい'), (u'う', u'え'), ( + u'おおお', u'かかかか'), (u'き', u'くく')]) + df = DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ']}, + index=idx) + expected = (u" a b\n" + u"あ いい あああああ あ\n" + u"う え い いいい\n" + u"おおお かかかか う う\n" + u"き くく えええ ええええええ") + assert _rep(df) == expected + + # truncate + with option_context('display.max_rows', 3, 'display.max_columns', + 3): + + df = pd.DataFrame({'a': [u'あああああ', u'い', u'う', u'えええ'], + 'b': [u'あ', u'いいい', u'う', u'ええええええ'], + 'c': [u'お', u'か', u'ききき', u'くくくくくく'], + u'ああああ': [u'さ', u'し', u'す', u'せ']}, + columns=['a', 'b', 'c', u'ああああ']) + + expected = (u" a ... ああああ\n" + u"0 あああああ ... さ\n" + u".. ... ... ...\n" + u"3 えええ ... せ\n" + u"\n[4 rows x 4 columns]") + assert _rep(df) == expected + + df.index = [u'あああ', u'いいいい', u'う', 'aaa'] + expected = (u" a ... ああああ\n" + u"あああ あああああ ... さ\n" + u"... ... ... ...\n" + u"aaa えええ ... せ\n" + u"\n[4 rows x 4 columns]") + assert _rep(df) == expected + + # ambiguous unicode + df = DataFrame({u'あああああ': [1, 222, 33333, 4], + 'b': [u'あ', u'いいい', u'¡¡', u'ええええええ']}, + index=['a', 'bb', 'c', '¡¡¡']) + expected = (u" b あああああ\n" + u"a あ 1\n" + u"bb いいい 222\n" + u"c ¡¡ 33333\n" + u"¡¡¡ ええええええ 4") + assert _rep(df) == expected + + def test_to_string_buffer_all_unicode(self): + buf = StringIO() + + empty = DataFrame({u('c/\u03c3'): Series()}) + nonempty = DataFrame({u('c/\u03c3'): Series([1, 2, 3])}) + + print(empty, file=buf) + print(nonempty, file=buf) + + # this should work + buf.getvalue() + + def test_to_string_with_col_space(self): + df = DataFrame(np.random.random(size=(1, 3))) + c10 = len(df.to_string(col_space=10).split("\n")[1]) + c20 = len(df.to_string(col_space=20).split("\n")[1]) + c30 = len(df.to_string(col_space=30).split("\n")[1]) + assert c10 < c20 < c30 + + # GH 8230 + # col_space wasn't being applied with header=False + with_header = df.to_string(col_space=20) + with_header_row1 = with_header.splitlines()[1] + no_header = df.to_string(col_space=20, header=False) + assert len(with_header_row1) == len(no_header) + + def test_to_string_truncate_indices(self): + for index in [tm.makeStringIndex, tm.makeUnicodeIndex, tm.makeIntIndex, + tm.makeDateIndex, tm.makePeriodIndex]: + for column in [tm.makeStringIndex]: + for h in [10, 20]: + for w in [10, 20]: + with option_context("display.expand_frame_repr", + False): + df = DataFrame(index=index(h), columns=column(w)) + with option_context("display.max_rows", 15): + if h == 20: + assert has_vertically_truncated_repr(df) + else: + assert not has_vertically_truncated_repr( + df) + with option_context("display.max_columns", 15): + if w == 20: + assert has_horizontally_truncated_repr(df) + else: + assert not ( + has_horizontally_truncated_repr(df)) + with option_context("display.max_rows", 15, + "display.max_columns", 15): + if h == 20 and w == 20: + assert has_doubly_truncated_repr(df) + else: + assert not has_doubly_truncated_repr( + df) + + def test_to_string_truncate_multilevel(self): + arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], + ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] + df = DataFrame(index=arrays, columns=arrays) + with option_context("display.max_rows", 7, "display.max_columns", 7): + assert has_doubly_truncated_repr(df) + + def test_truncate_with_different_dtypes(self): + + # 11594, 12045 + # when truncated the dtypes of the splits can differ + + # 11594 + import datetime + s = Series([datetime.datetime(2012, 1, 1)] * 10 + + [datetime.datetime(1012, 1, 2)] + [ + datetime.datetime(2012, 1, 3)] * 10) + + with pd.option_context('display.max_rows', 8): + result = str(s) + assert 'object' in result + + # 12045 + df = DataFrame({'text': ['some words'] + [None] * 9}) + + with pd.option_context('display.max_rows', 8, + 'display.max_columns', 3): + result = str(df) + assert 'None' in result + assert 'NaN' not in result + + def test_datetimelike_frame(self): + + # GH 12211 + df = DataFrame( + {'date': [pd.Timestamp('20130101').tz_localize('UTC')] + + [pd.NaT] * 5}) + + with option_context("display.max_rows", 5): + result = str(df) + assert '2013-01-01 00:00:00+00:00' in result + assert 'NaT' in result + assert '...' in result + assert '[6 rows x 1 columns]' in result + + dts = [pd.Timestamp('2011-01-01', tz='US/Eastern')] * 5 + [pd.NaT] * 5 + df = pd.DataFrame({"dt": dts, + "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) + with option_context('display.max_rows', 5): + expected = (' dt x\n' + '0 2011-01-01 00:00:00-05:00 1\n' + '1 2011-01-01 00:00:00-05:00 2\n' + '.. ... ..\n' + '8 NaT 9\n' + '9 NaT 10\n\n' + '[10 rows x 2 columns]') + assert repr(df) == expected + + dts = [pd.NaT] * 5 + [pd.Timestamp('2011-01-01', tz='US/Eastern')] * 5 + df = pd.DataFrame({"dt": dts, + "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) + with option_context('display.max_rows', 5): + expected = (' dt x\n' + '0 NaT 1\n' + '1 NaT 2\n' + '.. ... ..\n' + '8 2011-01-01 00:00:00-05:00 9\n' + '9 2011-01-01 00:00:00-05:00 10\n\n' + '[10 rows x 2 columns]') + assert repr(df) == expected + + dts = ([pd.Timestamp('2011-01-01', tz='Asia/Tokyo')] * 5 + + [pd.Timestamp('2011-01-01', tz='US/Eastern')] * 5) + df = pd.DataFrame({"dt": dts, + "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) + with option_context('display.max_rows', 5): + expected = (' dt x\n' + '0 2011-01-01 00:00:00+09:00 1\n' + '1 2011-01-01 00:00:00+09:00 2\n' + '.. ... ..\n' + '8 2011-01-01 00:00:00-05:00 9\n' + '9 2011-01-01 00:00:00-05:00 10\n\n' + '[10 rows x 2 columns]') + assert repr(df) == expected + + def test_nonunicode_nonascii_alignment(self): + df = DataFrame([["aa\xc3\xa4\xc3\xa4", 1], ["bbbb", 2]]) + rep_str = df.to_string() + lines = rep_str.split('\n') + assert len(lines[1]) == len(lines[2]) + + def test_unicode_problem_decoding_as_ascii(self): + dm = DataFrame({u('c/\u03c3'): Series({'test': np.nan})}) + compat.text_type(dm.to_string()) + + def test_string_repr_encoding(self): + filepath = tm.get_data_path('unicode_series.csv') + df = pd.read_csv(filepath, header=None, encoding='latin1') + repr(df) + repr(df[1]) + + def test_repr_corner(self): + # representing infs poses no problems + df = DataFrame({'foo': [-np.inf, np.inf]}) + repr(df) + + def test_frame_info_encoding(self): + index = ['\'Til There Was You (1997)', + 'ldum klaka (Cold Fever) (1994)'] + fmt.set_option('display.max_rows', 1) + df = DataFrame(columns=['a', 'b', 'c'], index=index) + repr(df) + repr(df.T) + fmt.set_option('display.max_rows', 200) + + def test_pprint_thing(self): + from pandas.io.formats.printing import pprint_thing as pp_t + + if PY3: + pytest.skip("doesn't work on Python 3") + + assert pp_t('a') == u('a') + assert pp_t(u('a')) == u('a') + assert pp_t(None) == 'None' + assert pp_t(u('\u05d0'), quote_strings=True) == u("u'\u05d0'") + assert pp_t(u('\u05d0'), quote_strings=False) == u('\u05d0') + assert (pp_t((u('\u05d0'), u('\u05d1')), quote_strings=True) == + u("(u'\u05d0', u'\u05d1')")) + assert (pp_t((u('\u05d0'), (u('\u05d1'), u('\u05d2'))), + quote_strings=True) == u("(u'\u05d0', " + "(u'\u05d1', u'\u05d2'))")) + assert (pp_t(('foo', u('\u05d0'), (u('\u05d0'), u('\u05d0'))), + quote_strings=True) == u("(u'foo', u'\u05d0', " + "(u'\u05d0', u'\u05d0'))")) + + # gh-2038: escape embedded tabs in string + assert "\t" not in pp_t("a\tb", escape_chars=("\t", )) + + def test_wide_repr(self): + with option_context('mode.sim_interactive', True, + 'display.show_dimensions', True): + max_cols = get_option('display.max_columns') + df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) + set_option('display.expand_frame_repr', False) + rep_str = repr(df) + + assert "10 rows x %d columns" % (max_cols - 1) in rep_str + set_option('display.expand_frame_repr', True) + wide_repr = repr(df) + assert rep_str != wide_repr + + with option_context('display.width', 120): + wider_repr = repr(df) + assert len(wider_repr) < len(wide_repr) + + reset_option('display.expand_frame_repr') + + def test_wide_repr_wide_columns(self): + with option_context('mode.sim_interactive', True): + df = DataFrame(np.random.randn(5, 3), + columns=['a' * 90, 'b' * 90, 'c' * 90]) + rep_str = repr(df) + + assert len(rep_str.splitlines()) == 20 + + def test_wide_repr_named(self): + with option_context('mode.sim_interactive', True): + max_cols = get_option('display.max_columns') + df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) + df.index.name = 'DataFrame Index' + set_option('display.expand_frame_repr', False) + + rep_str = repr(df) + set_option('display.expand_frame_repr', True) + wide_repr = repr(df) + assert rep_str != wide_repr + + with option_context('display.width', 150): + wider_repr = repr(df) + assert len(wider_repr) < len(wide_repr) + + for line in wide_repr.splitlines()[1::13]: + assert 'DataFrame Index' in line + + reset_option('display.expand_frame_repr') + + def test_wide_repr_multiindex(self): + with option_context('mode.sim_interactive', True): + midx = MultiIndex.from_arrays(tm.rands_array(5, size=(2, 10))) + max_cols = get_option('display.max_columns') + df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1)), + index=midx) + df.index.names = ['Level 0', 'Level 1'] + set_option('display.expand_frame_repr', False) + rep_str = repr(df) + set_option('display.expand_frame_repr', True) + wide_repr = repr(df) + assert rep_str != wide_repr + + with option_context('display.width', 150): + wider_repr = repr(df) + assert len(wider_repr) < len(wide_repr) + + for line in wide_repr.splitlines()[1::13]: + assert 'Level 0 Level 1' in line + + reset_option('display.expand_frame_repr') + + def test_wide_repr_multiindex_cols(self): + with option_context('mode.sim_interactive', True): + max_cols = get_option('display.max_columns') + midx = MultiIndex.from_arrays(tm.rands_array(5, size=(2, 10))) + mcols = MultiIndex.from_arrays( + tm.rands_array(3, size=(2, max_cols - 1))) + df = DataFrame(tm.rands_array(25, (10, max_cols - 1)), + index=midx, columns=mcols) + df.index.names = ['Level 0', 'Level 1'] + set_option('display.expand_frame_repr', False) + rep_str = repr(df) + set_option('display.expand_frame_repr', True) + wide_repr = repr(df) + assert rep_str != wide_repr + + with option_context('display.width', 150): + wider_repr = repr(df) + assert len(wider_repr) < len(wide_repr) + + reset_option('display.expand_frame_repr') + + def test_wide_repr_unicode(self): + with option_context('mode.sim_interactive', True): + max_cols = get_option('display.max_columns') + df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) + set_option('display.expand_frame_repr', False) + rep_str = repr(df) + set_option('display.expand_frame_repr', True) + wide_repr = repr(df) + assert rep_str != wide_repr + + with option_context('display.width', 150): + wider_repr = repr(df) + assert len(wider_repr) < len(wide_repr) + + reset_option('display.expand_frame_repr') + + def test_wide_repr_wide_long_columns(self): + with option_context('mode.sim_interactive', True): + df = DataFrame({'a': ['a' * 30, 'b' * 30], + 'b': ['c' * 70, 'd' * 80]}) + + result = repr(df) + assert 'ccccc' in result + assert 'ddddd' in result + + def test_long_series(self): + n = 1000 + s = Series( + np.random.randint(-50, 50, n), + index=['s%04d' % x for x in range(n)], dtype='int64') + + import re + str_rep = str(s) + nmatches = len(re.findall('dtype', str_rep)) + assert nmatches == 1 + + def test_index_with_nan(self): + # GH 2850 + df = DataFrame({'id1': {0: '1a3', + 1: '9h4'}, + 'id2': {0: np.nan, + 1: 'd67'}, + 'id3': {0: '78d', + 1: '79d'}, + 'value': {0: 123, + 1: 64}}) + + # multi-index + y = df.set_index(['id1', 'id2', 'id3']) + result = y.to_string() + expected = u( + ' value\nid1 id2 id3 \n' + '1a3 NaN 78d 123\n9h4 d67 79d 64') + assert result == expected + + # index + y = df.set_index('id2') + result = y.to_string() + expected = u( + ' id1 id3 value\nid2 \n' + 'NaN 1a3 78d 123\nd67 9h4 79d 64') + assert result == expected + + # with append (this failed in 0.12) + y = df.set_index(['id1', 'id2']).set_index('id3', append=True) + result = y.to_string() + expected = u( + ' value\nid1 id2 id3 \n' + '1a3 NaN 78d 123\n9h4 d67 79d 64') + assert result == expected + + # all-nan in mi + df2 = df.copy() + df2.loc[:, 'id2'] = np.nan + y = df2.set_index('id2') + result = y.to_string() + expected = u( + ' id1 id3 value\nid2 \n' + 'NaN 1a3 78d 123\nNaN 9h4 79d 64') + assert result == expected + + # partial nan in mi + df2 = df.copy() + df2.loc[:, 'id2'] = np.nan + y = df2.set_index(['id2', 'id3']) + result = y.to_string() + expected = u( + ' id1 value\nid2 id3 \n' + 'NaN 78d 1a3 123\n 79d 9h4 64') + assert result == expected + + df = DataFrame({'id1': {0: np.nan, + 1: '9h4'}, + 'id2': {0: np.nan, + 1: 'd67'}, + 'id3': {0: np.nan, + 1: '79d'}, + 'value': {0: 123, + 1: 64}}) + + y = df.set_index(['id1', 'id2', 'id3']) + result = y.to_string() + expected = u( + ' value\nid1 id2 id3 \n' + 'NaN NaN NaN 123\n9h4 d67 79d 64') + assert result == expected + + def test_to_string(self): + + # big mixed + biggie = DataFrame({'A': np.random.randn(200), + 'B': tm.makeStringIndex(200)}, + index=lrange(200)) + + biggie.loc[:20, 'A'] = np.nan + biggie.loc[:20, 'B'] = np.nan + s = biggie.to_string() + + buf = StringIO() + retval = biggie.to_string(buf=buf) + assert retval is None + assert buf.getvalue() == s + + assert isinstance(s, compat.string_types) + + # print in right order + result = biggie.to_string(columns=['B', 'A'], col_space=17, + float_format='%.5f'.__mod__) + lines = result.split('\n') + header = lines[0].strip().split() + joined = '\n'.join([re.sub(r'\s+', ' ', x).strip() for x in lines[1:]]) + recons = read_table(StringIO(joined), names=header, + header=None, sep=' ') + tm.assert_series_equal(recons['B'], biggie['B']) + assert recons['A'].count() == biggie['A'].count() + assert (np.abs(recons['A'].dropna() - + biggie['A'].dropna()) < 0.1).all() + + # expected = ['B', 'A'] + # assert header == expected + + result = biggie.to_string(columns=['A'], col_space=17) + header = result.split('\n')[0].strip().split() + expected = ['A'] + assert header == expected + + biggie.to_string(columns=['B', 'A'], + formatters={'A': lambda x: '%.1f' % x}) + + biggie.to_string(columns=['B', 'A'], float_format=str) + biggie.to_string(columns=['B', 'A'], col_space=12, float_format=str) + + frame = DataFrame(index=np.arange(200)) + frame.to_string() + + def test_to_string_no_header(self): + df = DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) + + df_s = df.to_string(header=False) + expected = "0 1 4\n1 2 5\n2 3 6" + + assert df_s == expected + + def test_to_string_specified_header(self): + df = DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) + + df_s = df.to_string(header=['X', 'Y']) + expected = ' X Y\n0 1 4\n1 2 5\n2 3 6' + + assert df_s == expected + + with pytest.raises(ValueError): + df.to_string(header=['X']) + + def test_to_string_no_index(self): + df = DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) + + df_s = df.to_string(index=False) + expected = "x y\n1 4\n2 5\n3 6" + + assert df_s == expected + + def test_to_string_line_width_no_index(self): + df = DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) + + df_s = df.to_string(line_width=1, index=False) + expected = "x \\\n1 \n2 \n3 \n\ny \n4 \n5 \n6" + + assert df_s == expected + + def test_to_string_float_formatting(self): + tm.reset_display_options() + fmt.set_option('display.precision', 5, 'display.column_space', 12, + 'display.notebook_repr_html', False) + + df = DataFrame({'x': [0, 0.25, 3456.000, 12e+45, 1.64e+6, 1.7e+8, + 1.253456, np.pi, -1e6]}) + + df_s = df.to_string() + + # Python 2.5 just wants me to be sad. And debian 32-bit + # sys.version_info[0] == 2 and sys.version_info[1] < 6: + if _three_digit_exp(): + expected = (' x\n0 0.00000e+000\n1 2.50000e-001\n' + '2 3.45600e+003\n3 1.20000e+046\n4 1.64000e+006\n' + '5 1.70000e+008\n6 1.25346e+000\n7 3.14159e+000\n' + '8 -1.00000e+006') + else: + expected = (' x\n0 0.00000e+00\n1 2.50000e-01\n' + '2 3.45600e+03\n3 1.20000e+46\n4 1.64000e+06\n' + '5 1.70000e+08\n6 1.25346e+00\n7 3.14159e+00\n' + '8 -1.00000e+06') + assert df_s == expected + + df = DataFrame({'x': [3234, 0.253]}) + df_s = df.to_string() + + expected = (' x\n' '0 3234.000\n' '1 0.253') + assert df_s == expected + + tm.reset_display_options() + assert get_option("display.precision") == 6 + + df = DataFrame({'x': [1e9, 0.2512]}) + df_s = df.to_string() + # Python 2.5 just wants me to be sad. And debian 32-bit + # sys.version_info[0] == 2 and sys.version_info[1] < 6: + if _three_digit_exp(): + expected = (' x\n' + '0 1.000000e+009\n' + '1 2.512000e-001') + else: + expected = (' x\n' + '0 1.000000e+09\n' + '1 2.512000e-01') + assert df_s == expected + + def test_to_string_small_float_values(self): + df = DataFrame({'a': [1.5, 1e-17, -5.5e-7]}) + + result = df.to_string() + # sadness per above + if '%.4g' % 1.7e8 == '1.7e+008': + expected = (' a\n' + '0 1.500000e+000\n' + '1 1.000000e-017\n' + '2 -5.500000e-007') + else: + expected = (' a\n' + '0 1.500000e+00\n' + '1 1.000000e-17\n' + '2 -5.500000e-07') + assert result == expected + + # but not all exactly zero + df = df * 0 + result = df.to_string() + expected = (' 0\n' '0 0\n' '1 0\n' '2 -0') + + def test_to_string_float_index(self): + index = Index([1.5, 2, 3, 4, 5]) + df = DataFrame(lrange(5), index=index) + + result = df.to_string() + expected = (' 0\n' + '1.5 0\n' + '2.0 1\n' + '3.0 2\n' + '4.0 3\n' + '5.0 4') + assert result == expected + + def test_to_string_ascii_error(self): + data = [('0 ', u(' .gitignore '), u(' 5 '), + ' \xe2\x80\xa2\xe2\x80\xa2\xe2\x80' + '\xa2\xe2\x80\xa2\xe2\x80\xa2')] + df = DataFrame(data) + + # it works! + repr(df) + + def test_to_string_int_formatting(self): + df = DataFrame({'x': [-15, 20, 25, -35]}) + assert issubclass(df['x'].dtype.type, np.integer) + + output = df.to_string() + expected = (' x\n' '0 -15\n' '1 20\n' '2 25\n' '3 -35') + assert output == expected + + def test_to_string_index_formatter(self): + df = DataFrame([lrange(5), lrange(5, 10), lrange(10, 15)]) + + rs = df.to_string(formatters={'__index__': lambda x: 'abc' [x]}) + + xp = """\ + 0 1 2 3 4 +a 0 1 2 3 4 +b 5 6 7 8 9 +c 10 11 12 13 14\ +""" + + assert rs == xp + + def test_to_string_left_justify_cols(self): + tm.reset_display_options() + df = DataFrame({'x': [3234, 0.253]}) + df_s = df.to_string(justify='left') + expected = (' x \n' '0 3234.000\n' '1 0.253') + assert df_s == expected + + def test_to_string_format_na(self): + tm.reset_display_options() + df = DataFrame({'A': [np.nan, -1, -2.1234, 3, 4], + 'B': [np.nan, 'foo', 'foooo', 'fooooo', 'bar']}) + result = df.to_string() + + expected = (' A B\n' + '0 NaN NaN\n' + '1 -1.0000 foo\n' + '2 -2.1234 foooo\n' + '3 3.0000 fooooo\n' + '4 4.0000 bar') + assert result == expected + + df = DataFrame({'A': [np.nan, -1., -2., 3., 4.], + 'B': [np.nan, 'foo', 'foooo', 'fooooo', 'bar']}) + result = df.to_string() + + expected = (' A B\n' + '0 NaN NaN\n' + '1 -1.0 foo\n' + '2 -2.0 foooo\n' + '3 3.0 fooooo\n' + '4 4.0 bar') + assert result == expected + + def test_to_string_line_width(self): + df = DataFrame(123, lrange(10, 15), lrange(30)) + s = df.to_string(line_width=80) + assert max(len(l) for l in s.split('\n')) == 80 + + def test_show_dimensions(self): + df = DataFrame(123, lrange(10, 15), lrange(30)) + + with option_context('display.max_rows', 10, 'display.max_columns', 40, + 'display.width', 500, 'display.expand_frame_repr', + 'info', 'display.show_dimensions', True): + assert '5 rows' in str(df) + assert '5 rows' in df._repr_html_() + with option_context('display.max_rows', 10, 'display.max_columns', 40, + 'display.width', 500, 'display.expand_frame_repr', + 'info', 'display.show_dimensions', False): + assert '5 rows' not in str(df) + assert '5 rows' not in df._repr_html_() + with option_context('display.max_rows', 2, 'display.max_columns', 2, + 'display.width', 500, 'display.expand_frame_repr', + 'info', 'display.show_dimensions', 'truncate'): + assert '5 rows' in str(df) + assert '5 rows' in df._repr_html_() + with option_context('display.max_rows', 10, 'display.max_columns', 40, + 'display.width', 500, 'display.expand_frame_repr', + 'info', 'display.show_dimensions', 'truncate'): + assert '5 rows' not in str(df) + assert '5 rows' not in df._repr_html_() + + def test_repr_html(self): + self.frame._repr_html_() + + fmt.set_option('display.max_rows', 1, 'display.max_columns', 1) + self.frame._repr_html_() + + fmt.set_option('display.notebook_repr_html', False) + self.frame._repr_html_() + + tm.reset_display_options() + + df = DataFrame([[1, 2], [3, 4]]) + fmt.set_option('display.show_dimensions', True) + assert '2 rows' in df._repr_html_() + fmt.set_option('display.show_dimensions', False) + assert '2 rows' not in df._repr_html_() + + tm.reset_display_options() + + def test_repr_html_wide(self): + max_cols = get_option('display.max_columns') + df = DataFrame(tm.rands_array(25, size=(10, max_cols - 1))) + reg_repr = df._repr_html_() + assert "..." not in reg_repr + + wide_df = DataFrame(tm.rands_array(25, size=(10, max_cols + 1))) + wide_repr = wide_df._repr_html_() + assert "..." in wide_repr + + def test_repr_html_wide_multiindex_cols(self): + max_cols = get_option('display.max_columns') + + mcols = MultiIndex.from_product([np.arange(max_cols // 2), + ['foo', 'bar']], + names=['first', 'second']) + df = DataFrame(tm.rands_array(25, size=(10, len(mcols))), + columns=mcols) + reg_repr = df._repr_html_() + assert '...' not in reg_repr + + mcols = MultiIndex.from_product((np.arange(1 + (max_cols // 2)), + ['foo', 'bar']), + names=['first', 'second']) + df = DataFrame(tm.rands_array(25, size=(10, len(mcols))), + columns=mcols) + wide_repr = df._repr_html_() + assert '...' in wide_repr + + def test_repr_html_long(self): + with option_context('display.max_rows', 60): + max_rows = get_option('display.max_rows') + h = max_rows - 1 + df = DataFrame({'A': np.arange(1, 1 + h), + 'B': np.arange(41, 41 + h)}) + reg_repr = df._repr_html_() + assert '..' not in reg_repr + assert str(41 + max_rows // 2) in reg_repr + + h = max_rows + 1 + df = DataFrame({'A': np.arange(1, 1 + h), + 'B': np.arange(41, 41 + h)}) + long_repr = df._repr_html_() + assert '..' in long_repr + assert str(41 + max_rows // 2) not in long_repr + assert u('%d rows ') % h in long_repr + assert u('2 columns') in long_repr + + def test_repr_html_float(self): + with option_context('display.max_rows', 60): + + max_rows = get_option('display.max_rows') + h = max_rows - 1 + df = DataFrame({'idx': np.linspace(-10, 10, h), + 'A': np.arange(1, 1 + h), + 'B': np.arange(41, 41 + h)}).set_index('idx') + reg_repr = df._repr_html_() + assert '..' not in reg_repr + assert str(40 + h) in reg_repr + + h = max_rows + 1 + df = DataFrame({'idx': np.linspace(-10, 10, h), + 'A': np.arange(1, 1 + h), + 'B': np.arange(41, 41 + h)}).set_index('idx') + long_repr = df._repr_html_() + assert '..' in long_repr + assert '31' not in long_repr + assert u('%d rows ') % h in long_repr + assert u('2 columns') in long_repr + + def test_repr_html_long_multiindex(self): + max_rows = get_option('display.max_rows') + max_L1 = max_rows // 2 + + tuples = list(itertools.product(np.arange(max_L1), ['foo', 'bar'])) + idx = MultiIndex.from_tuples(tuples, names=['first', 'second']) + df = DataFrame(np.random.randn(max_L1 * 2, 2), index=idx, + columns=['A', 'B']) + reg_repr = df._repr_html_() + assert '...' not in reg_repr + + tuples = list(itertools.product(np.arange(max_L1 + 1), ['foo', 'bar'])) + idx = MultiIndex.from_tuples(tuples, names=['first', 'second']) + df = DataFrame(np.random.randn((max_L1 + 1) * 2, 2), index=idx, + columns=['A', 'B']) + long_repr = df._repr_html_() + assert '...' in long_repr + + def test_repr_html_long_and_wide(self): + max_cols = get_option('display.max_columns') + max_rows = get_option('display.max_rows') + + h, w = max_rows - 1, max_cols - 1 + df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) + assert '...' not in df._repr_html_() + + h, w = max_rows + 1, max_cols + 1 + df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) + assert '...' in df._repr_html_() + + def test_info_repr(self): + max_rows = get_option('display.max_rows') + max_cols = get_option('display.max_columns') + # Long + h, w = max_rows + 1, max_cols - 1 + df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) + assert has_vertically_truncated_repr(df) + with option_context('display.large_repr', 'info'): + assert has_info_repr(df) + + # Wide + h, w = max_rows - 1, max_cols + 1 + df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) + assert has_horizontally_truncated_repr(df) + with option_context('display.large_repr', 'info'): + assert has_info_repr(df) + + def test_info_repr_max_cols(self): + # GH #6939 + df = DataFrame(np.random.randn(10, 5)) + with option_context('display.large_repr', 'info', + 'display.max_columns', 1, + 'display.max_info_columns', 4): + assert has_non_verbose_info_repr(df) + + with option_context('display.large_repr', 'info', + 'display.max_columns', 1, + 'display.max_info_columns', 5): + assert not has_non_verbose_info_repr(df) + + # test verbose overrides + # fmt.set_option('display.max_info_columns', 4) # exceeded + + def test_info_repr_html(self): + max_rows = get_option('display.max_rows') + max_cols = get_option('display.max_columns') + # Long + h, w = max_rows + 1, max_cols - 1 + df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) + assert r'<class' not in df._repr_html_() + with option_context('display.large_repr', 'info'): + assert r'<class' in df._repr_html_() + + # Wide + h, w = max_rows - 1, max_cols + 1 + df = DataFrame(dict((k, np.arange(1, 1 + h)) for k in np.arange(w))) + assert ' 1e6 and something that normally formats to + # having length > display.precision + 6 + df = pd.DataFrame(dict(x=[12345.6789])) + assert str(df) == ' x\n0 12345.6789' + df = pd.DataFrame(dict(x=[2e6])) + assert str(df) == ' x\n0 2000000.0' + df = pd.DataFrame(dict(x=[12345.6789, 2e6])) + assert str(df) == ' x\n0 1.2346e+04\n1 2.0000e+06' + + +class TestRepr_timedelta64(object): + + def test_none(self): + delta_1d = pd.to_timedelta(1, unit='D') + delta_0d = pd.to_timedelta(0, unit='D') + delta_1s = pd.to_timedelta(1, unit='s') + delta_500ms = pd.to_timedelta(500, unit='ms') + + drepr = lambda x: x._repr_base() + assert drepr(delta_1d) == "1 days" + assert drepr(-delta_1d) == "-1 days" + assert drepr(delta_0d) == "0 days" + assert drepr(delta_1s) == "0 days 00:00:01" + assert drepr(delta_500ms) == "0 days 00:00:00.500000" + assert drepr(delta_1d + delta_1s) == "1 days 00:00:01" + assert drepr(delta_1d + delta_500ms) == "1 days 00:00:00.500000" + + def test_even_day(self): + delta_1d = pd.to_timedelta(1, unit='D') + delta_0d = pd.to_timedelta(0, unit='D') + delta_1s = pd.to_timedelta(1, unit='s') + delta_500ms = pd.to_timedelta(500, unit='ms') + + drepr = lambda x: x._repr_base(format='even_day') + assert drepr(delta_1d) == "1 days" + assert drepr(-delta_1d) == "-1 days" + assert drepr(delta_0d) == "0 days" + assert drepr(delta_1s) == "0 days 00:00:01" + assert drepr(delta_500ms) == "0 days 00:00:00.500000" + assert drepr(delta_1d + delta_1s) == "1 days 00:00:01" + assert drepr(delta_1d + delta_500ms) == "1 days 00:00:00.500000" + + def test_sub_day(self): + delta_1d = pd.to_timedelta(1, unit='D') + delta_0d = pd.to_timedelta(0, unit='D') + delta_1s = pd.to_timedelta(1, unit='s') + delta_500ms = pd.to_timedelta(500, unit='ms') + + drepr = lambda x: x._repr_base(format='sub_day') + assert drepr(delta_1d) == "1 days" + assert drepr(-delta_1d) == "-1 days" + assert drepr(delta_0d) == "00:00:00" + assert drepr(delta_1s) == "00:00:01" + assert drepr(delta_500ms) == "00:00:00.500000" + assert drepr(delta_1d + delta_1s) == "1 days 00:00:01" + assert drepr(delta_1d + delta_500ms) == "1 days 00:00:00.500000" + + def test_long(self): + delta_1d = pd.to_timedelta(1, unit='D') + delta_0d = pd.to_timedelta(0, unit='D') + delta_1s = pd.to_timedelta(1, unit='s') + delta_500ms = pd.to_timedelta(500, unit='ms') + + drepr = lambda x: x._repr_base(format='long') + assert drepr(delta_1d) == "1 days 00:00:00" + assert drepr(-delta_1d) == "-1 days +00:00:00" + assert drepr(delta_0d) == "0 days 00:00:00" + assert drepr(delta_1s) == "0 days 00:00:01" + assert drepr(delta_500ms) == "0 days 00:00:00.500000" + assert drepr(delta_1d + delta_1s) == "1 days 00:00:01" + assert drepr(delta_1d + delta_500ms) == "1 days 00:00:00.500000" + + def test_all(self): + delta_1d = pd.to_timedelta(1, unit='D') + delta_0d = pd.to_timedelta(0, unit='D') + delta_1ns = pd.to_timedelta(1, unit='ns') + + drepr = lambda x: x._repr_base(format='all') + assert drepr(delta_1d) == "1 days 00:00:00.000000000" + assert drepr(delta_0d) == "0 days 00:00:00.000000000" + assert drepr(delta_1ns) == "0 days 00:00:00.000000001" + + +class TestTimedelta64Formatter(object): + + def test_days(self): + x = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='D') + result = fmt.Timedelta64Formatter(x, box=True).get_result() + assert result[0].strip() == "'0 days'" + assert result[1].strip() == "'1 days'" + + result = fmt.Timedelta64Formatter(x[1:2], box=True).get_result() + assert result[0].strip() == "'1 days'" + + result = fmt.Timedelta64Formatter(x, box=False).get_result() + assert result[0].strip() == "0 days" + assert result[1].strip() == "1 days" + + result = fmt.Timedelta64Formatter(x[1:2], box=False).get_result() + assert result[0].strip() == "1 days" + + def test_days_neg(self): + x = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='D') + result = fmt.Timedelta64Formatter(-x, box=True).get_result() + assert result[0].strip() == "'0 days'" + assert result[1].strip() == "'-1 days'" + + def test_subdays(self): + y = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='s') + result = fmt.Timedelta64Formatter(y, box=True).get_result() + assert result[0].strip() == "'00:00:00'" + assert result[1].strip() == "'00:00:01'" + + def test_subdays_neg(self): + y = pd.to_timedelta(list(range(5)) + [pd.NaT], unit='s') + result = fmt.Timedelta64Formatter(-y, box=True).get_result() + assert result[0].strip() == "'00:00:00'" + assert result[1].strip() == "'-1 days +23:59:59'" + + def test_zero(self): + x = pd.to_timedelta(list(range(1)) + [pd.NaT], unit='D') + result = fmt.Timedelta64Formatter(x, box=True).get_result() + assert result[0].strip() == "'0 days'" + + x = pd.to_timedelta(list(range(1)), unit='D') + result = fmt.Timedelta64Formatter(x, box=True).get_result() + assert result[0].strip() == "'0 days'" + + +class TestDatetime64Formatter(object): + + def test_mixed(self): + x = Series([datetime(2013, 1, 1), datetime(2013, 1, 1, 12), pd.NaT]) + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "2013-01-01 00:00:00" + assert result[1].strip() == "2013-01-01 12:00:00" + + def test_dates(self): + x = Series([datetime(2013, 1, 1), datetime(2013, 1, 2), pd.NaT]) + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "2013-01-01" + assert result[1].strip() == "2013-01-02" + + def test_date_nanos(self): + x = Series([Timestamp(200)]) + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "1970-01-01 00:00:00.000000200" + + def test_dates_display(self): + + # 10170 + # make sure that we are consistently display date formatting + x = Series(date_range('20130101 09:00:00', periods=5, freq='D')) + x.iloc[1] = np.nan + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "2013-01-01 09:00:00" + assert result[1].strip() == "NaT" + assert result[4].strip() == "2013-01-05 09:00:00" + + x = Series(date_range('20130101 09:00:00', periods=5, freq='s')) + x.iloc[1] = np.nan + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "2013-01-01 09:00:00" + assert result[1].strip() == "NaT" + assert result[4].strip() == "2013-01-01 09:00:04" + + x = Series(date_range('20130101 09:00:00', periods=5, freq='ms')) + x.iloc[1] = np.nan + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "2013-01-01 09:00:00.000" + assert result[1].strip() == "NaT" + assert result[4].strip() == "2013-01-01 09:00:00.004" + + x = Series(date_range('20130101 09:00:00', periods=5, freq='us')) + x.iloc[1] = np.nan + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "2013-01-01 09:00:00.000000" + assert result[1].strip() == "NaT" + assert result[4].strip() == "2013-01-01 09:00:00.000004" + + x = Series(date_range('20130101 09:00:00', periods=5, freq='N')) + x.iloc[1] = np.nan + result = fmt.Datetime64Formatter(x).get_result() + assert result[0].strip() == "2013-01-01 09:00:00.000000000" + assert result[1].strip() == "NaT" + assert result[4].strip() == "2013-01-01 09:00:00.000000004" + + def test_datetime64formatter_yearmonth(self): + x = Series([datetime(2016, 1, 1), datetime(2016, 2, 2)]) + + def format_func(x): + return x.strftime('%Y-%m') + + formatter = fmt.Datetime64Formatter(x, formatter=format_func) + result = formatter.get_result() + assert result == ['2016-01', '2016-02'] + + def test_datetime64formatter_hoursecond(self): + + x = Series(pd.to_datetime(['10:10:10.100', '12:12:12.120'], + format='%H:%M:%S.%f')) + + def format_func(x): + return x.strftime('%H:%M') + + formatter = fmt.Datetime64Formatter(x, formatter=format_func) + result = formatter.get_result() + assert result == ['10:10', '12:12'] + + +class TestNaTFormatting(object): + + def test_repr(self): + assert repr(pd.NaT) == "NaT" + + def test_str(self): + assert str(pd.NaT) == "NaT" + + +class TestDatetimeIndexFormat(object): + + def test_datetime(self): + formatted = pd.to_datetime([datetime(2003, 1, 1, 12), pd.NaT]).format() + assert formatted[0] == "2003-01-01 12:00:00" + assert formatted[1] == "NaT" + + def test_date(self): + formatted = pd.to_datetime([datetime(2003, 1, 1), pd.NaT]).format() + assert formatted[0] == "2003-01-01" + assert formatted[1] == "NaT" + + def test_date_tz(self): + formatted = pd.to_datetime([datetime(2013, 1, 1)], utc=True).format() + assert formatted[0] == "2013-01-01 00:00:00+00:00" + + formatted = pd.to_datetime( + [datetime(2013, 1, 1), pd.NaT], utc=True).format() + assert formatted[0] == "2013-01-01 00:00:00+00:00" + + def test_date_explict_date_format(self): + formatted = pd.to_datetime([datetime(2003, 2, 1), pd.NaT]).format( + date_format="%m-%d-%Y", na_rep="UT") + assert formatted[0] == "02-01-2003" + assert formatted[1] == "UT" + + +class TestDatetimeIndexUnicode(object): + + def test_dates(self): + text = str(pd.to_datetime([datetime(2013, 1, 1), datetime(2014, 1, 1) + ])) + assert "['2013-01-01'," in text + assert ", '2014-01-01']" in text + + def test_mixed(self): + text = str(pd.to_datetime([datetime(2013, 1, 1), datetime( + 2014, 1, 1, 12), datetime(2014, 1, 1)])) + assert "'2013-01-01 00:00:00'," in text + assert "'2014-01-01 00:00:00']" in text + + +class TestStringRepTimestamp(object): + + def test_no_tz(self): + dt_date = datetime(2013, 1, 2) + assert str(dt_date) == str(Timestamp(dt_date)) + + dt_datetime = datetime(2013, 1, 2, 12, 1, 3) + assert str(dt_datetime) == str(Timestamp(dt_datetime)) + + dt_datetime_us = datetime(2013, 1, 2, 12, 1, 3, 45) + assert str(dt_datetime_us) == str(Timestamp(dt_datetime_us)) + + ts_nanos_only = Timestamp(200) + assert str(ts_nanos_only) == "1970-01-01 00:00:00.000000200" + + ts_nanos_micros = Timestamp(1200) + assert str(ts_nanos_micros) == "1970-01-01 00:00:00.000001200" + + def test_tz_pytz(self): + tm._skip_if_no_pytz() + + import pytz + + dt_date = datetime(2013, 1, 2, tzinfo=pytz.utc) + assert str(dt_date) == str(Timestamp(dt_date)) + + dt_datetime = datetime(2013, 1, 2, 12, 1, 3, tzinfo=pytz.utc) + assert str(dt_datetime) == str(Timestamp(dt_datetime)) + + dt_datetime_us = datetime(2013, 1, 2, 12, 1, 3, 45, tzinfo=pytz.utc) + assert str(dt_datetime_us) == str(Timestamp(dt_datetime_us)) + + def test_tz_dateutil(self): + tm._skip_if_no_dateutil() + import dateutil + utc = dateutil.tz.tzutc() + + dt_date = datetime(2013, 1, 2, tzinfo=utc) + assert str(dt_date) == str(Timestamp(dt_date)) + + dt_datetime = datetime(2013, 1, 2, 12, 1, 3, tzinfo=utc) + assert str(dt_datetime) == str(Timestamp(dt_datetime)) + + dt_datetime_us = datetime(2013, 1, 2, 12, 1, 3, 45, tzinfo=utc) + assert str(dt_datetime_us) == str(Timestamp(dt_datetime_us)) + + def test_nat_representations(self): + for f in (str, repr, methodcaller('isoformat')): + assert f(pd.NaT) == 'NaT' + + +def test_format_percentiles(): + result = fmt.format_percentiles([0.01999, 0.02001, 0.5, 0.666666, 0.9999]) + expected = ['1.999%', '2.001%', '50%', '66.667%', '99.99%'] + assert result == expected + + result = fmt.format_percentiles([0, 0.5, 0.02001, 0.5, 0.666666, 0.9999]) + expected = ['0%', '50%', '2.0%', '50%', '66.67%', '99.99%'] + assert result == expected + + pytest.raises(ValueError, fmt.format_percentiles, [0.1, np.nan, 0.5]) + pytest.raises(ValueError, fmt.format_percentiles, [-0.001, 0.1, 0.5]) + pytest.raises(ValueError, fmt.format_percentiles, [2, 0.1, 0.5]) + pytest.raises(ValueError, fmt.format_percentiles, [0.1, 0.5, 'a']) diff --git a/pandas/tests/io/formats/test_printing.py b/pandas/tests/io/formats/test_printing.py new file mode 100644 index 0000000000000..aae3ba31648ff --- /dev/null +++ b/pandas/tests/io/formats/test_printing.py @@ -0,0 +1,227 @@ +# -*- coding: utf-8 -*- +import pytest + +import numpy as np +import pandas as pd + +from pandas import compat +import pandas.io.formats.printing as printing +import pandas.io.formats.format as fmt +import pandas.core.config as cf + + +def test_adjoin(): + data = [['a', 'b', 'c'], ['dd', 'ee', 'ff'], ['ggg', 'hhh', 'iii']] + expected = 'a dd ggg\nb ee hhh\nc ff iii' + + adjoined = printing.adjoin(2, *data) + + assert (adjoined == expected) + + +def test_repr_binary_type(): + import string + letters = string.ascii_letters + btype = compat.binary_type + try: + raw = btype(letters, encoding=cf.get_option('display.encoding')) + except TypeError: + raw = btype(letters) + b = compat.text_type(compat.bytes_to_str(raw)) + res = printing.pprint_thing(b, quote_strings=True) + assert res == repr(b) + res = printing.pprint_thing(b, quote_strings=False) + assert res == b + + +class TestFormattBase(object): + + def test_adjoin(self): + data = [['a', 'b', 'c'], ['dd', 'ee', 'ff'], ['ggg', 'hhh', 'iii']] + expected = 'a dd ggg\nb ee hhh\nc ff iii' + + adjoined = printing.adjoin(2, *data) + + assert adjoined == expected + + def test_adjoin_unicode(self): + data = [[u'あ', 'b', 'c'], ['dd', u'ええ', 'ff'], ['ggg', 'hhh', u'いいい']] + expected = u'あ dd ggg\nb ええ hhh\nc ff いいい' + adjoined = printing.adjoin(2, *data) + assert adjoined == expected + + adj = fmt.EastAsianTextAdjustment() + + expected = u"""あ dd ggg +b ええ hhh +c ff いいい""" + + adjoined = adj.adjoin(2, *data) + assert adjoined == expected + cols = adjoined.split('\n') + assert adj.len(cols[0]) == 13 + assert adj.len(cols[1]) == 13 + assert adj.len(cols[2]) == 16 + + expected = u"""あ dd ggg +b ええ hhh +c ff いいい""" + + adjoined = adj.adjoin(7, *data) + assert adjoined == expected + cols = adjoined.split('\n') + assert adj.len(cols[0]) == 23 + assert adj.len(cols[1]) == 23 + assert adj.len(cols[2]) == 26 + + def test_justify(self): + adj = fmt.EastAsianTextAdjustment() + + def just(x, *args, **kwargs): + # wrapper to test single str + return adj.justify([x], *args, **kwargs)[0] + + assert just('abc', 5, mode='left') == 'abc ' + assert just('abc', 5, mode='center') == ' abc ' + assert just('abc', 5, mode='right') == ' abc' + assert just(u'abc', 5, mode='left') == 'abc ' + assert just(u'abc', 5, mode='center') == ' abc ' + assert just(u'abc', 5, mode='right') == ' abc' + + assert just(u'パンダ', 5, mode='left') == u'パンダ' + assert just(u'パンダ', 5, mode='center') == u'パンダ' + assert just(u'パンダ', 5, mode='right') == u'パンダ' + + assert just(u'パンダ', 10, mode='left') == u'パンダ ' + assert just(u'パンダ', 10, mode='center') == u' パンダ ' + assert just(u'パンダ', 10, mode='right') == u' パンダ' + + def test_east_asian_len(self): + adj = fmt.EastAsianTextAdjustment() + + assert adj.len('abc') == 3 + assert adj.len(u'abc') == 3 + + assert adj.len(u'パンダ') == 6 + assert adj.len(u'パンダ') == 5 + assert adj.len(u'パンダpanda') == 11 + assert adj.len(u'パンダpanda') == 10 + + def test_ambiguous_width(self): + adj = fmt.EastAsianTextAdjustment() + assert adj.len(u'¡¡ab') == 4 + + with cf.option_context('display.unicode.ambiguous_as_wide', True): + adj = fmt.EastAsianTextAdjustment() + assert adj.len(u'¡¡ab') == 6 + + data = [[u'あ', 'b', 'c'], ['dd', u'ええ', 'ff'], + ['ggg', u'¡¡ab', u'いいい']] + expected = u'あ dd ggg \nb ええ ¡¡ab\nc ff いいい' + adjoined = adj.adjoin(2, *data) + assert adjoined == expected + + +class TestTableSchemaRepr(object): + + @classmethod + def setup_class(cls): + pytest.importorskip('IPython') + try: + import mock + except ImportError: + try: + from unittest import mock + except ImportError: + pytest.skip("Mock is not installed") + cls.mock = mock + from IPython.core.interactiveshell import InteractiveShell + cls.display_formatter = InteractiveShell.instance().display_formatter + + def test_publishes(self): + + df = pd.DataFrame({"A": [1, 2]}) + objects = [df['A'], df, df] # dataframe / series + expected_keys = [ + {'text/plain', 'application/vnd.dataresource+json'}, + {'text/plain', 'text/html', 'application/vnd.dataresource+json'}, + ] + + opt = pd.option_context('display.html.table_schema', True) + for obj, expected in zip(objects, expected_keys): + with opt: + formatted = self.display_formatter.format(obj) + assert set(formatted[0].keys()) == expected + + with_latex = pd.option_context('display.latex.repr', True) + + with opt, with_latex: + formatted = self.display_formatter.format(obj) + + expected = {'text/plain', 'text/html', 'text/latex', + 'application/vnd.dataresource+json'} + assert set(formatted[0].keys()) == expected + + def test_publishes_not_implemented(self): + # column MultiIndex + # GH 15996 + midx = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b', 'c']]) + df = pd.DataFrame(np.random.randn(5, len(midx)), columns=midx) + + opt = pd.option_context('display.html.table_schema', True) + + with opt: + formatted = self.display_formatter.format(df) + + expected = {'text/plain', 'text/html'} + assert set(formatted[0].keys()) == expected + + def test_config_on(self): + df = pd.DataFrame({"A": [1, 2]}) + with pd.option_context("display.html.table_schema", True): + result = df._repr_data_resource_() + + assert result is not None + + def test_config_default_off(self): + df = pd.DataFrame({"A": [1, 2]}) + with pd.option_context("display.html.table_schema", False): + result = df._repr_data_resource_() + + assert result is None + + def test_enable_data_resource_formatter(self): + # GH 10491 + formatters = self.display_formatter.formatters + mimetype = 'application/vnd.dataresource+json' + + with pd.option_context('display.html.table_schema', True): + assert 'application/vnd.dataresource+json' in formatters + assert formatters[mimetype].enabled + + # still there, just disabled + assert 'application/vnd.dataresource+json' in formatters + assert not formatters[mimetype].enabled + + # able to re-set + with pd.option_context('display.html.table_schema', True): + assert 'application/vnd.dataresource+json' in formatters + assert formatters[mimetype].enabled + # smoke test that it works + self.display_formatter.format(cf) + + +# TODO: fix this broken test + +# def test_console_encode(): +# """ +# On Python 2, if sys.stdin.encoding is None (IPython with zmq frontend) +# common.console_encode should encode things as utf-8. +# """ +# if compat.PY3: +# pytest.skip + +# with tm.stdin_encoding(encoding=None): +# result = printing.console_encode(u"\u05d0") +# expected = u"\u05d0".encode('utf-8') +# assert (result == expected) diff --git a/pandas/tests/formats/test_style.py b/pandas/tests/io/formats/test_style.py similarity index 62% rename from pandas/tests/formats/test_style.py rename to pandas/tests/io/formats/test_style.py index 2fec04b9c1aa3..ee7356f12f498 100644 --- a/pandas/tests/formats/test_style.py +++ b/pandas/tests/io/formats/test_style.py @@ -1,33 +1,19 @@ -import os -from nose import SkipTest - import copy +import textwrap + +import pytest import numpy as np import pandas as pd from pandas import DataFrame -from pandas.util.testing import TestCase import pandas.util.testing as tm -# Getting failures on a python 2.7 build with -# whenever we try to import jinja, whether it's installed or not. -# so we're explicitly skipping that one *before* we try to import -# jinja. We still need to export the imports as globals, -# since importing Styler tries to import jinja2. -job_name = os.environ.get('JOB_NAME', None) -if job_name == '27_slow_nnet_LOCALE': - raise SkipTest("No jinja") -try: - # Do try except on just jinja, so the only reason - # We skip is if jinja can't import, not something else - import jinja2 # noqa -except ImportError: - raise SkipTest("No Jinja2") -from pandas.formats.style import Styler, _get_level_lengths # noqa - - -class TestStyler(TestCase): - - def setUp(self): +jinja2 = pytest.importorskip('jinja2') +from pandas.io.formats.style import Styler, _get_level_lengths # noqa + + +class TestStyler(object): + + def setup_method(self, method): np.random.seed(24) self.s = DataFrame({'A': np.random.permutation(range(6))}) self.df = DataFrame({'A': [0, 1], 'B': np.random.randn(2)}) @@ -47,12 +33,12 @@ def h(x, foo='bar'): ] def test_init_non_pandas(self): - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): Styler([1, 2, 3]) def test_init_series(self): result = Styler(pd.Series([1, 2])) - self.assertEqual(result.data.ndim, 2) + assert result.data.ndim == 2 def test_repr_html_ok(self): self.styler._repr_html_() @@ -61,7 +47,7 @@ def test_update_ctx(self): self.styler._update_ctx(self.attrs) expected = {(0, 0): ['color: red'], (1, 0): ['color: blue']} - self.assertEqual(self.styler.ctx, expected) + assert self.styler.ctx == expected def test_update_ctx_flatten_multi(self): attrs = DataFrame({"A": ['color: red; foo: bar', @@ -69,7 +55,7 @@ def test_update_ctx_flatten_multi(self): self.styler._update_ctx(attrs) expected = {(0, 0): ['color: red', ' foo: bar'], (1, 0): ['color: blue', ' foo: baz']} - self.assertEqual(self.styler.ctx, expected) + assert self.styler.ctx == expected def test_update_ctx_flatten_multi_traliing_semi(self): attrs = DataFrame({"A": ['color: red; foo: bar;', @@ -77,38 +63,38 @@ def test_update_ctx_flatten_multi_traliing_semi(self): self.styler._update_ctx(attrs) expected = {(0, 0): ['color: red', ' foo: bar'], (1, 0): ['color: blue', ' foo: baz']} - self.assertEqual(self.styler.ctx, expected) + assert self.styler.ctx == expected def test_copy(self): s2 = copy.copy(self.styler) - self.assertTrue(self.styler is not s2) - self.assertTrue(self.styler.ctx is s2.ctx) # shallow - self.assertTrue(self.styler._todo is s2._todo) + assert self.styler is not s2 + assert self.styler.ctx is s2.ctx # shallow + assert self.styler._todo is s2._todo self.styler._update_ctx(self.attrs) self.styler.highlight_max() - self.assertEqual(self.styler.ctx, s2.ctx) - self.assertEqual(self.styler._todo, s2._todo) + assert self.styler.ctx == s2.ctx + assert self.styler._todo == s2._todo def test_deepcopy(self): s2 = copy.deepcopy(self.styler) - self.assertTrue(self.styler is not s2) - self.assertTrue(self.styler.ctx is not s2.ctx) - self.assertTrue(self.styler._todo is not s2._todo) + assert self.styler is not s2 + assert self.styler.ctx is not s2.ctx + assert self.styler._todo is not s2._todo self.styler._update_ctx(self.attrs) self.styler.highlight_max() - self.assertNotEqual(self.styler.ctx, s2.ctx) - self.assertEqual(s2._todo, []) - self.assertNotEqual(self.styler._todo, s2._todo) + assert self.styler.ctx != s2.ctx + assert s2._todo == [] + assert self.styler._todo != s2._todo def test_clear(self): s = self.df.style.highlight_max()._compute() - self.assertTrue(len(s.ctx) > 0) - self.assertTrue(len(s._todo) > 0) + assert len(s.ctx) > 0 + assert len(s._todo) > 0 s.clear() - self.assertTrue(len(s.ctx) == 0) - self.assertTrue(len(s._todo) == 0) + assert len(s.ctx) == 0 + assert len(s._todo) == 0 def test_render(self): df = pd.DataFrame({"A": [0, 1]}) @@ -132,16 +118,16 @@ def test_set_properties(self): # order is deterministic v = ["color: white", "size: 10px"] expected = {(0, 0): v, (1, 0): v} - self.assertEqual(result.keys(), expected.keys()) + assert result.keys() == expected.keys() for v1, v2 in zip(result.values(), expected.values()): - self.assertEqual(sorted(v1), sorted(v2)) + assert sorted(v1) == sorted(v2) def test_set_properties_subset(self): df = pd.DataFrame({'A': [0, 1]}) result = df.style.set_properties(subset=pd.IndexSlice[0, 'A'], color='white')._compute().ctx expected = {(0, 0): ['color: white']} - self.assertEqual(result, expected) + assert result == expected def test_empty_index_name_doesnt_display(self): # https://github.com/pandas-dev/pandas/pull/12090#issuecomment-180695902 @@ -155,24 +141,21 @@ def test_empty_index_name_doesnt_display(self): 'type': 'th', 'value': 'A', 'is_visible': True, - 'attributes': ["colspan=1"], }, {'class': 'col_heading level0 col1', 'display_value': 'B', 'type': 'th', 'value': 'B', 'is_visible': True, - 'attributes': ["colspan=1"], }, {'class': 'col_heading level0 col2', 'display_value': 'C', 'type': 'th', 'value': 'C', 'is_visible': True, - 'attributes': ["colspan=1"], }]] - self.assertEqual(result['head'], expected) + assert result['head'] == expected def test_index_name(self): # https://github.com/pandas-dev/pandas/issues/11655 @@ -182,17 +165,15 @@ def test_index_name(self): expected = [[{'class': 'blank level0', 'type': 'th', 'value': '', 'display_value': '', 'is_visible': True}, {'class': 'col_heading level0 col0', 'type': 'th', - 'value': 'B', 'display_value': 'B', - 'is_visible': True, 'attributes': ['colspan=1']}, + 'value': 'B', 'display_value': 'B', 'is_visible': True}, {'class': 'col_heading level0 col1', 'type': 'th', - 'value': 'C', 'display_value': 'C', - 'is_visible': True, 'attributes': ['colspan=1']}], + 'value': 'C', 'display_value': 'C', 'is_visible': True}], [{'class': 'index_name level0', 'type': 'th', 'value': 'A'}, {'class': 'blank', 'type': 'th', 'value': ''}, {'class': 'blank', 'type': 'th', 'value': ''}]] - self.assertEqual(result['head'], expected) + assert result['head'] == expected def test_multiindex_name(self): # https://github.com/pandas-dev/pandas/issues/11655 @@ -205,16 +186,14 @@ def test_multiindex_name(self): {'class': 'blank level0', 'type': 'th', 'value': '', 'display_value': '', 'is_visible': True}, {'class': 'col_heading level0 col0', 'type': 'th', - 'value': 'C', 'display_value': 'C', - 'is_visible': True, 'attributes': ['colspan=1'], - }], + 'value': 'C', 'display_value': 'C', 'is_visible': True}], [{'class': 'index_name level0', 'type': 'th', 'value': 'A'}, {'class': 'index_name level1', 'type': 'th', 'value': 'B'}, {'class': 'blank', 'type': 'th', 'value': ''}]] - self.assertEqual(result['head'], expected) + assert result['head'] == expected def test_numeric_columns(self): # https://github.com/pandas-dev/pandas/issues/12125 @@ -226,21 +205,21 @@ def test_apply_axis(self): df = pd.DataFrame({'A': [0, 0], 'B': [1, 1]}) f = lambda x: ['val: %s' % x.max() for v in x] result = df.style.apply(f, axis=1) - self.assertEqual(len(result._todo), 1) - self.assertEqual(len(result.ctx), 0) + assert len(result._todo) == 1 + assert len(result.ctx) == 0 result._compute() expected = {(0, 0): ['val: 1'], (0, 1): ['val: 1'], (1, 0): ['val: 1'], (1, 1): ['val: 1']} - self.assertEqual(result.ctx, expected) + assert result.ctx == expected result = df.style.apply(f, axis=0) expected = {(0, 0): ['val: 0'], (0, 1): ['val: 1'], (1, 0): ['val: 0'], (1, 1): ['val: 1']} result._compute() - self.assertEqual(result.ctx, expected) + assert result.ctx == expected result = df.style.apply(f) # default result._compute() - self.assertEqual(result.ctx, expected) + assert result.ctx == expected def test_apply_subset(self): axes = [0, 1] @@ -256,7 +235,7 @@ def test_apply_subset(self): for c, col in enumerate(self.df.columns) if row in self.df.loc[slice_].index and col in self.df.loc[slice_].columns) - self.assertEqual(result, expected) + assert result == expected def test_applymap_subset(self): def f(x): @@ -273,7 +252,7 @@ def f(x): for c, col in enumerate(self.df.columns) if row in self.df.loc[slice_].index and col in self.df.loc[slice_].columns) - self.assertEqual(result, expected) + assert result == expected def test_empty(self): df = pd.DataFrame({'A': [1, 0]}) @@ -284,9 +263,9 @@ def test_empty(self): result = s._translate()['cellstyle'] expected = [{'props': [['color', ' red']], 'selector': 'row0_col0'}, {'props': [['', '']], 'selector': 'row1_col0'}] - self.assertEqual(result, expected) + assert result == expected - def test_bar(self): + def test_bar_align_left(self): df = pd.DataFrame({'A': [0, 1, 2]}) result = df.style.bar()._compute().ctx expected = { @@ -298,7 +277,7 @@ def test_bar(self): 'background: linear-gradient(' '90deg,#d65f5f 100.0%, transparent 0%)'] } - self.assertEqual(result, expected) + assert result == expected result = df.style.bar(color='red', width=50)._compute().ctx expected = { @@ -310,16 +289,16 @@ def test_bar(self): 'background: linear-gradient(' '90deg,red 50.0%, transparent 0%)'] } - self.assertEqual(result, expected) + assert result == expected df['C'] = ['a'] * len(df) result = df.style.bar(color='red', width=50)._compute().ctx - self.assertEqual(result, expected) + assert result == expected df['C'] = df['C'].astype('category') result = df.style.bar(color='red', width=50)._compute().ctx - self.assertEqual(result, expected) + assert result == expected - def test_bar_0points(self): + def test_bar_align_left_0points(self): df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) result = df.style.bar()._compute().ctx expected = {(0, 0): ['width: 10em', ' height: 80%'], @@ -343,7 +322,7 @@ def test_bar_0points(self): (2, 2): ['width: 10em', ' height: 80%', 'background: linear-gradient(90deg,#d65f5f 100.0%' ', transparent 0%)']} - self.assertEqual(result, expected) + assert result == expected result = df.style.bar(axis=1)._compute().ctx expected = {(0, 0): ['width: 10em', ' height: 80%'], @@ -367,73 +346,185 @@ def test_bar_0points(self): (2, 2): ['width: 10em', ' height: 80%', 'background: linear-gradient(90deg,#d65f5f 100.0%' ', transparent 0%)']} - self.assertEqual(result, expected) + assert result == expected + + def test_bar_align_mid_pos_and_neg(self): + df = pd.DataFrame({'A': [-10, 0, 20, 90]}) + + result = df.style.bar(align='mid', color=[ + '#d65f5f', '#5fba7d'])._compute().ctx + + expected = {(0, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 0.0%, #d65f5f 0.0%, ' + '#d65f5f 10.0%, transparent 10.0%)'], + (1, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 10.0%, ' + '#d65f5f 10.0%, #d65f5f 10.0%, ' + 'transparent 10.0%)'], + (2, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 10.0%, #5fba7d 10.0%' + ', #5fba7d 30.0%, transparent 30.0%)'], + (3, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 10.0%, ' + '#5fba7d 10.0%, #5fba7d 100.0%, ' + 'transparent 100.0%)']} + + assert result == expected + + def test_bar_align_mid_all_pos(self): + df = pd.DataFrame({'A': [10, 20, 50, 100]}) + + result = df.style.bar(align='mid', color=[ + '#d65f5f', '#5fba7d'])._compute().ctx + + expected = {(0, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 0.0%, #5fba7d 0.0%, ' + '#5fba7d 10.0%, transparent 10.0%)'], + (1, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 0.0%, #5fba7d 0.0%, ' + '#5fba7d 20.0%, transparent 20.0%)'], + (2, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 0.0%, #5fba7d 0.0%, ' + '#5fba7d 50.0%, transparent 50.0%)'], + (3, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 0.0%, #5fba7d 0.0%, ' + '#5fba7d 100.0%, transparent 100.0%)']} + + assert result == expected + + def test_bar_align_mid_all_neg(self): + df = pd.DataFrame({'A': [-100, -60, -30, -20]}) + + result = df.style.bar(align='mid', color=[ + '#d65f5f', '#5fba7d'])._compute().ctx + + expected = {(0, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 0.0%, ' + '#d65f5f 0.0%, #d65f5f 100.0%, ' + 'transparent 100.0%)'], + (1, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 40.0%, ' + '#d65f5f 40.0%, #d65f5f 100.0%, ' + 'transparent 100.0%)'], + (2, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 70.0%, ' + '#d65f5f 70.0%, #d65f5f 100.0%, ' + 'transparent 100.0%)'], + (3, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 80.0%, ' + '#d65f5f 80.0%, #d65f5f 100.0%, ' + 'transparent 100.0%)']} + assert result == expected + + def test_bar_align_zero_pos_and_neg(self): + # See https://github.com/pandas-dev/pandas/pull/14757 + df = pd.DataFrame({'A': [-10, 0, 20, 90]}) + + result = df.style.bar(align='zero', color=[ + '#d65f5f', '#5fba7d'], width=90)._compute().ctx + + expected = {(0, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 45.0%, ' + '#d65f5f 45.0%, #d65f5f 50%, ' + 'transparent 50%)'], + (1, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 50%, ' + '#5fba7d 50%, #5fba7d 50.0%, ' + 'transparent 50.0%)'], + (2, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 50%, #5fba7d 50%, ' + '#5fba7d 60.0%, transparent 60.0%)'], + (3, 0): ['width: 10em', ' height: 80%', + 'background: linear-gradient(90deg, ' + 'transparent 0%, transparent 50%, #5fba7d 50%, ' + '#5fba7d 95.0%, transparent 95.0%)']} + assert result == expected + + def test_bar_bad_align_raises(self): + df = pd.DataFrame({'A': [-100, -60, -30, -20]}) + with pytest.raises(ValueError): + df.style.bar(align='poorly', color=['#d65f5f', '#5fba7d']) def test_highlight_null(self, null_color='red'): df = pd.DataFrame({'A': [0, np.nan]}) result = df.style.highlight_null()._compute().ctx expected = {(0, 0): [''], (1, 0): ['background-color: red']} - self.assertEqual(result, expected) + assert result == expected def test_nonunique_raises(self): df = pd.DataFrame([[1, 2]], columns=['A', 'A']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.style - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): Styler(df) def test_caption(self): styler = Styler(self.df, caption='foo') result = styler.render() - self.assertTrue(all(['caption' in result, 'foo' in result])) + assert all(['caption' in result, 'foo' in result]) styler = self.df.style result = styler.set_caption('baz') - self.assertTrue(styler is result) - self.assertEqual(styler.caption, 'baz') + assert styler is result + assert styler.caption == 'baz' def test_uuid(self): styler = Styler(self.df, uuid='abc123') result = styler.render() - self.assertTrue('abc123' in result) + assert 'abc123' in result styler = self.df.style result = styler.set_uuid('aaa') - self.assertTrue(result is styler) - self.assertEqual(result.uuid, 'aaa') + assert result is styler + assert result.uuid == 'aaa' def test_table_styles(self): style = [{'selector': 'th', 'props': [('foo', 'bar')]}] styler = Styler(self.df, table_styles=style) result = ' '.join(styler.render().split()) - self.assertTrue('th { foo: bar; }' in result) + assert 'th { foo: bar; }' in result styler = self.df.style result = styler.set_table_styles(style) - self.assertTrue(styler is result) - self.assertEqual(styler.table_styles, style) + assert styler is result + assert styler.table_styles == style def test_table_attributes(self): attributes = 'class="foo" data-bar' styler = Styler(self.df, table_attributes=attributes) result = styler.render() - self.assertTrue('class="foo" data-bar' in result) + assert 'class="foo" data-bar' in result result = self.df.style.set_table_attributes(attributes).render() - self.assertTrue('class="foo" data-bar' in result) + assert 'class="foo" data-bar' in result def test_precision(self): with pd.option_context('display.precision', 10): s = Styler(self.df) - self.assertEqual(s.precision, 10) + assert s.precision == 10 s = Styler(self.df, precision=2) - self.assertEqual(s.precision, 2) + assert s.precision == 2 s2 = s.set_precision(4) - self.assertTrue(s is s2) - self.assertEqual(s.precision, 4) + assert s is s2 + assert s.precision == 4 def test_apply_none(self): def f(x): @@ -441,14 +532,14 @@ def f(x): index=x.index, columns=x.columns) result = (pd.DataFrame([[1, 2], [3, 4]]) .style.apply(f, axis=None)._compute().ctx) - self.assertEqual(result[(1, 1)], ['color: red']) + assert result[(1, 1)] == ['color: red'] def test_trim(self): result = self.df.style.render() # trim=True - self.assertEqual(result.count('#'), 0) + assert result.count('#') == 0 result = self.df.style.highlight_max().render() - self.assertEqual(result.count('#'), len(self.df.columns)) + assert result.count('#') == len(self.df.columns) def test_highlight_max(self): df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B']) @@ -460,25 +551,25 @@ def test_highlight_max(self): df = -df attr = 'highlight_min' result = getattr(df.style, attr)()._compute().ctx - self.assertEqual(result[(1, 1)], ['background-color: yellow']) + assert result[(1, 1)] == ['background-color: yellow'] result = getattr(df.style, attr)(color='green')._compute().ctx - self.assertEqual(result[(1, 1)], ['background-color: green']) + assert result[(1, 1)] == ['background-color: green'] result = getattr(df.style, attr)(subset='A')._compute().ctx - self.assertEqual(result[(1, 0)], ['background-color: yellow']) + assert result[(1, 0)] == ['background-color: yellow'] result = getattr(df.style, attr)(axis=0)._compute().ctx expected = {(1, 0): ['background-color: yellow'], (1, 1): ['background-color: yellow'], (0, 1): [''], (0, 0): ['']} - self.assertEqual(result, expected) + assert result == expected result = getattr(df.style, attr)(axis=1)._compute().ctx expected = {(0, 1): ['background-color: yellow'], (1, 1): ['background-color: yellow'], (0, 0): [''], (1, 0): ['']} - self.assertEqual(result, expected) + assert result == expected # separate since we cant negate the strs df['C'] = ['a', 'b'] @@ -498,25 +589,23 @@ def test_export(self): result = style1.export() style2 = self.df.style style2.use(result) - self.assertEqual(style1._todo, style2._todo) + assert style1._todo == style2._todo style2.render() def test_display_format(self): df = pd.DataFrame(np.random.random(size=(2, 2))) ctx = df.style.format("{:0.1f}")._translate() - self.assertTrue(all(['display_value' in c for c in row] - for row in ctx['body'])) - self.assertTrue(all([len(c['display_value']) <= 3 for c in row[1:]] - for row in ctx['body'])) - self.assertTrue( - len(ctx['body'][0][1]['display_value'].lstrip('-')) <= 3) + assert all(['display_value' in c for c in row] for row in ctx['body']) + assert (all([len(c['display_value']) <= 3 for c in row[1:]] + for row in ctx['body'])) + assert len(ctx['body'][0][1]['display_value'].lstrip('-')) <= 3 def test_display_format_raises(self): df = pd.DataFrame(np.random.randn(2, 2)) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.style.format(5) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.style.format(True) def test_display_subset(self): @@ -525,78 +614,78 @@ def test_display_subset(self): ctx = df.style.format({"a": "{:0.1f}", "b": "{0:.2%}"}, subset=pd.IndexSlice[0, :])._translate() expected = '0.1' - self.assertEqual(ctx['body'][0][1]['display_value'], expected) - self.assertEqual(ctx['body'][1][1]['display_value'], '1.1234') - self.assertEqual(ctx['body'][0][2]['display_value'], '12.34%') + assert ctx['body'][0][1]['display_value'] == expected + assert ctx['body'][1][1]['display_value'] == '1.1234' + assert ctx['body'][0][2]['display_value'] == '12.34%' raw_11 = '1.1234' ctx = df.style.format("{:0.1f}", subset=pd.IndexSlice[0, :])._translate() - self.assertEqual(ctx['body'][0][1]['display_value'], expected) - self.assertEqual(ctx['body'][1][1]['display_value'], raw_11) + assert ctx['body'][0][1]['display_value'] == expected + assert ctx['body'][1][1]['display_value'] == raw_11 ctx = df.style.format("{:0.1f}", subset=pd.IndexSlice[0, :])._translate() - self.assertEqual(ctx['body'][0][1]['display_value'], expected) - self.assertEqual(ctx['body'][1][1]['display_value'], raw_11) + assert ctx['body'][0][1]['display_value'] == expected + assert ctx['body'][1][1]['display_value'] == raw_11 ctx = df.style.format("{:0.1f}", subset=pd.IndexSlice['a'])._translate() - self.assertEqual(ctx['body'][0][1]['display_value'], expected) - self.assertEqual(ctx['body'][0][2]['display_value'], '0.1234') + assert ctx['body'][0][1]['display_value'] == expected + assert ctx['body'][0][2]['display_value'] == '0.1234' ctx = df.style.format("{:0.1f}", subset=pd.IndexSlice[0, 'a'])._translate() - self.assertEqual(ctx['body'][0][1]['display_value'], expected) - self.assertEqual(ctx['body'][1][1]['display_value'], raw_11) + assert ctx['body'][0][1]['display_value'] == expected + assert ctx['body'][1][1]['display_value'] == raw_11 ctx = df.style.format("{:0.1f}", subset=pd.IndexSlice[[0, 1], ['a']])._translate() - self.assertEqual(ctx['body'][0][1]['display_value'], expected) - self.assertEqual(ctx['body'][1][1]['display_value'], '1.1') - self.assertEqual(ctx['body'][0][2]['display_value'], '0.1234') - self.assertEqual(ctx['body'][1][2]['display_value'], '1.1234') + assert ctx['body'][0][1]['display_value'] == expected + assert ctx['body'][1][1]['display_value'] == '1.1' + assert ctx['body'][0][2]['display_value'] == '0.1234' + assert ctx['body'][1][2]['display_value'] == '1.1234' def test_display_dict(self): df = pd.DataFrame([[.1234, .1234], [1.1234, 1.1234]], columns=['a', 'b']) ctx = df.style.format({"a": "{:0.1f}", "b": "{0:.2%}"})._translate() - self.assertEqual(ctx['body'][0][1]['display_value'], '0.1') - self.assertEqual(ctx['body'][0][2]['display_value'], '12.34%') + assert ctx['body'][0][1]['display_value'] == '0.1' + assert ctx['body'][0][2]['display_value'] == '12.34%' df['c'] = ['aaa', 'bbb'] ctx = df.style.format({"a": "{:0.1f}", "c": str.upper})._translate() - self.assertEqual(ctx['body'][0][1]['display_value'], '0.1') - self.assertEqual(ctx['body'][0][3]['display_value'], 'AAA') + assert ctx['body'][0][1]['display_value'] == '0.1' + assert ctx['body'][0][3]['display_value'] == 'AAA' def test_bad_apply_shape(self): df = pd.DataFrame([[1, 2], [3, 4]]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.style._apply(lambda x: 'x', subset=pd.IndexSlice[[0, 1], :]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.style._apply(lambda x: [''], subset=pd.IndexSlice[[0, 1], :]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.style._apply(lambda x: ['', '', '', '']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.style._apply(lambda x: ['', '', ''], subset=1) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.style._apply(lambda x: ['', '', ''], axis=1) def test_apply_bad_return(self): def f(x): return '' df = pd.DataFrame([[1, 2], [3, 4]]) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.style._apply(f, axis=None) def test_apply_bad_labels(self): def f(x): return pd.DataFrame(index=[1, 2], columns=['a', 'b']) df = pd.DataFrame([[1, 2], [3, 4]]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.style._apply(f, axis=None) def test_get_level_lengths(self): @@ -632,16 +721,14 @@ def test_mi_sparse(self): body_1 = result['body'][0][1] expected_1 = { "value": 0, "display_value": 0, "is_visible": True, - "type": "th", "attributes": ["rowspan=1"], - "class": "row_heading level1 row0", + "type": "th", "class": "row_heading level1 row0", } tm.assert_dict_equal(body_1, expected_1) body_10 = result['body'][1][0] expected_10 = { "value": 'a', "display_value": 'a', "is_visible": False, - "type": "th", "attributes": ["rowspan=1"], - "class": "row_heading level0 row1", + "type": "th", "class": "row_heading level0 row1", } tm.assert_dict_equal(body_10, expected_10) @@ -651,20 +738,19 @@ def test_mi_sparse(self): 'is_visible': True, "display_value": ''}, {'type': 'th', 'class': 'blank level0', 'value': '', 'is_visible': True, 'display_value': ''}, - {'attributes': ['colspan=1'], 'class': 'col_heading level0 col0', - 'is_visible': True, 'type': 'th', 'value': 'A', - 'display_value': 'A'}] - self.assertEqual(head, expected) + {'type': 'th', 'class': 'col_heading level0 col0', 'value': 'A', + 'is_visible': True, 'display_value': 'A'}] + assert head == expected def test_mi_sparse_disabled(self): with pd.option_context('display.multi_sparse', False): df = pd.DataFrame({'A': [1, 2]}, index=pd.MultiIndex.from_arrays([['a', 'a'], - [0, 1]])) + [0, 1]])) result = df.style._translate() body = result['body'] for row in body: - self.assertEqual(row[0]['attributes'], ['rowspan=1']) + assert 'attributes' not in row[0] def test_mi_sparse_index_names(self): df = pd.DataFrame({'A': [1, 2]}, index=pd.MultiIndex.from_arrays( @@ -680,7 +766,7 @@ def test_mi_sparse_index_names(self): 'type': 'th'}, {'class': 'blank', 'value': '', 'type': 'th'}] - self.assertEqual(head, expected) + assert head == expected def test_mi_sparse_column_names(self): df = pd.DataFrame( @@ -700,48 +786,81 @@ def test_mi_sparse_column_names(self): 'type': 'th', 'is_visible': True}, {'class': 'index_name level1', 'value': 'col_1', 'display_value': 'col_1', 'is_visible': True, 'type': 'th'}, - {'attributes': ['colspan=1'], - 'class': 'col_heading level1 col0', + {'class': 'col_heading level1 col0', 'display_value': 1, 'is_visible': True, 'type': 'th', 'value': 1}, - {'attributes': ['colspan=1'], - 'class': 'col_heading level1 col1', + {'class': 'col_heading level1 col1', 'display_value': 0, 'is_visible': True, 'type': 'th', 'value': 0}, - {'attributes': ['colspan=1'], - 'class': 'col_heading level1 col2', + {'class': 'col_heading level1 col2', 'display_value': 1, 'is_visible': True, 'type': 'th', 'value': 1}, - {'attributes': ['colspan=1'], - 'class': 'col_heading level1 col3', + {'class': 'col_heading level1 col3', 'display_value': 0, 'is_visible': True, 'type': 'th', 'value': 0}, ] - self.assertEqual(head, expected) + assert head == expected -@tm.mplskip -class TestStylerMatplotlibDep(TestCase): +class TestStylerMatplotlibDep(object): def test_background_gradient(self): + tm._skip_if_no_mpl() df = pd.DataFrame([[1, 2], [2, 4]], columns=['A', 'B']) - for axis in [0, 1, 'index', 'columns']: - for cmap in [None, 'YlOrRd']: - result = df.style.background_gradient(cmap=cmap)._compute().ctx - self.assertTrue(all("#" in x[0] for x in result.values())) - self.assertEqual(result[(0, 0)], result[(0, 1)]) - self.assertEqual(result[(1, 0)], result[(1, 1)]) - - result = (df.style.background_gradient(subset=pd.IndexSlice[1, 'A']) - ._compute().ctx) - self.assertEqual(result[(1, 0)], ['background-color: #fff7fb']) + + for c_map in [None, 'YlOrRd']: + result = df.style.background_gradient(cmap=c_map)._compute().ctx + assert all("#" in x[0] for x in result.values()) + assert result[(0, 0)] == result[(0, 1)] + assert result[(1, 0)] == result[(1, 1)] + + result = df.style.background_gradient( + subset=pd.IndexSlice[1, 'A'])._compute().ctx + assert result[(1, 0)] == ['background-color: #fff7fb'] + + +def test_block_names(): + # catch accidental removal of a block + expected = { + 'before_style', 'style', 'table_styles', 'before_cellstyle', + 'cellstyle', 'before_table', 'table', 'caption', 'thead', 'tbody', + 'after_table', 'before_head_rows', 'head_tr', 'after_head_rows', + 'before_rows', 'tr', 'after_rows', + } + result = set(Styler.template.blocks) + assert result == expected + + +def test_from_custom_template(tmpdir): + p = tmpdir.mkdir("templates").join("myhtml.tpl") + p.write(textwrap.dedent("""\ + {% extends "html.tpl" %} + {% block table %} +

{{ table_title|default("My Table") }}

+ {{ super() }} + {% endblock table %}""")) + result = Styler.from_custom_template(str(tmpdir.join('templates')), + 'myhtml.tpl') + assert issubclass(result, Styler) + assert result.env is not Styler.env + assert result.template is not Styler.template + styler = result(pd.DataFrame({"A": [1, 2]})) + assert styler.render() + + +def test_shim(): + # https://github.com/pandas-dev/pandas/pull/16059 + # Remove in 0.21 + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + from pandas.formats.style import Styler as _styler # noqa diff --git a/pandas/tests/io/formats/test_to_csv.py b/pandas/tests/io/formats/test_to_csv.py new file mode 100644 index 0000000000000..1073fbcef5aec --- /dev/null +++ b/pandas/tests/io/formats/test_to_csv.py @@ -0,0 +1,208 @@ +from pandas import DataFrame +import numpy as np +import pandas as pd +from pandas.util import testing as tm + + +class TestToCSV(object): + + def test_to_csv_quotechar(self): + df = DataFrame({'col': [1, 2]}) + expected = """\ +"","col" +"0","1" +"1","2" +""" + + with tm.ensure_clean('test.csv') as path: + df.to_csv(path, quoting=1) # 1=QUOTE_ALL + with open(path, 'r') as f: + assert f.read() == expected + + expected = """\ +$$,$col$ +$0$,$1$ +$1$,$2$ +""" + + with tm.ensure_clean('test.csv') as path: + df.to_csv(path, quoting=1, quotechar="$") + with open(path, 'r') as f: + assert f.read() == expected + + with tm.ensure_clean('test.csv') as path: + with tm.assert_raises_regex(TypeError, 'quotechar'): + df.to_csv(path, quoting=1, quotechar=None) + + def test_to_csv_doublequote(self): + df = DataFrame({'col': ['a"a', '"bb"']}) + expected = '''\ +"","col" +"0","a""a" +"1","""bb""" +''' + + with tm.ensure_clean('test.csv') as path: + df.to_csv(path, quoting=1, doublequote=True) # QUOTE_ALL + with open(path, 'r') as f: + assert f.read() == expected + + from _csv import Error + with tm.ensure_clean('test.csv') as path: + with tm.assert_raises_regex(Error, 'escapechar'): + df.to_csv(path, doublequote=False) # no escapechar set + + def test_to_csv_escapechar(self): + df = DataFrame({'col': ['a"a', '"bb"']}) + expected = '''\ +"","col" +"0","a\\"a" +"1","\\"bb\\"" +''' + + with tm.ensure_clean('test.csv') as path: # QUOTE_ALL + df.to_csv(path, quoting=1, doublequote=False, escapechar='\\') + with open(path, 'r') as f: + assert f.read() == expected + + df = DataFrame({'col': ['a,a', ',bb,']}) + expected = """\ +,col +0,a\\,a +1,\\,bb\\, +""" + + with tm.ensure_clean('test.csv') as path: + df.to_csv(path, quoting=3, escapechar='\\') # QUOTE_NONE + with open(path, 'r') as f: + assert f.read() == expected + + def test_csv_to_string(self): + df = DataFrame({'col': [1, 2]}) + expected = ',col\n0,1\n1,2\n' + assert df.to_csv() == expected + + def test_to_csv_decimal(self): + # GH 781 + df = DataFrame({'col1': [1], 'col2': ['a'], 'col3': [10.1]}) + + expected_default = ',col1,col2,col3\n0,1,a,10.1\n' + assert df.to_csv() == expected_default + + expected_european_excel = ';col1;col2;col3\n0;1;a;10,1\n' + assert df.to_csv(decimal=',', sep=';') == expected_european_excel + + expected_float_format_default = ',col1,col2,col3\n0,1,a,10.10\n' + assert df.to_csv(float_format='%.2f') == expected_float_format_default + + expected_float_format = ';col1;col2;col3\n0;1;a;10,10\n' + assert df.to_csv(decimal=',', sep=';', + float_format='%.2f') == expected_float_format + + # GH 11553: testing if decimal is taken into account for '0.0' + df = pd.DataFrame({'a': [0, 1.1], 'b': [2.2, 3.3], 'c': 1}) + expected = 'a,b,c\n0^0,2^2,1\n1^1,3^3,1\n' + assert df.to_csv(index=False, decimal='^') == expected + + # same but for an index + assert df.set_index('a').to_csv(decimal='^') == expected + + # same for a multi-index + assert df.set_index(['a', 'b']).to_csv(decimal="^") == expected + + def test_to_csv_float_format(self): + # testing if float_format is taken into account for the index + # GH 11553 + df = pd.DataFrame({'a': [0, 1], 'b': [2.2, 3.3], 'c': 1}) + expected = 'a,b,c\n0,2.20,1\n1,3.30,1\n' + assert df.set_index('a').to_csv(float_format='%.2f') == expected + + # same for a multi-index + assert df.set_index(['a', 'b']).to_csv( + float_format='%.2f') == expected + + def test_to_csv_na_rep(self): + # testing if NaN values are correctly represented in the index + # GH 11553 + df = DataFrame({'a': [0, np.NaN], 'b': [0, 1], 'c': [2, 3]}) + expected = "a,b,c\n0.0,0,2\n_,1,3\n" + assert df.set_index('a').to_csv(na_rep='_') == expected + assert df.set_index(['a', 'b']).to_csv(na_rep='_') == expected + + # now with an index containing only NaNs + df = DataFrame({'a': np.NaN, 'b': [0, 1], 'c': [2, 3]}) + expected = "a,b,c\n_,0,2\n_,1,3\n" + assert df.set_index('a').to_csv(na_rep='_') == expected + assert df.set_index(['a', 'b']).to_csv(na_rep='_') == expected + + # check if na_rep parameter does not break anything when no NaN + df = DataFrame({'a': 0, 'b': [0, 1], 'c': [2, 3]}) + expected = "a,b,c\n0,0,2\n0,1,3\n" + assert df.set_index('a').to_csv(na_rep='_') == expected + assert df.set_index(['a', 'b']).to_csv(na_rep='_') == expected + + def test_to_csv_date_format(self): + # GH 10209 + df_sec = DataFrame({'A': pd.date_range('20130101', periods=5, freq='s') + }) + df_day = DataFrame({'A': pd.date_range('20130101', periods=5, freq='d') + }) + + expected_default_sec = (',A\n0,2013-01-01 00:00:00\n1,' + '2013-01-01 00:00:01\n2,2013-01-01 00:00:02' + '\n3,2013-01-01 00:00:03\n4,' + '2013-01-01 00:00:04\n') + assert df_sec.to_csv() == expected_default_sec + + expected_ymdhms_day = (',A\n0,2013-01-01 00:00:00\n1,' + '2013-01-02 00:00:00\n2,2013-01-03 00:00:00' + '\n3,2013-01-04 00:00:00\n4,' + '2013-01-05 00:00:00\n') + assert (df_day.to_csv(date_format='%Y-%m-%d %H:%M:%S') == + expected_ymdhms_day) + + expected_ymd_sec = (',A\n0,2013-01-01\n1,2013-01-01\n2,' + '2013-01-01\n3,2013-01-01\n4,2013-01-01\n') + assert df_sec.to_csv(date_format='%Y-%m-%d') == expected_ymd_sec + + expected_default_day = (',A\n0,2013-01-01\n1,2013-01-02\n2,' + '2013-01-03\n3,2013-01-04\n4,2013-01-05\n') + assert df_day.to_csv() == expected_default_day + assert df_day.to_csv(date_format='%Y-%m-%d') == expected_default_day + + # testing if date_format parameter is taken into account for + # multi-indexed dataframes (GH 7791) + df_sec['B'] = 0 + df_sec['C'] = 1 + expected_ymd_sec = 'A,B,C\n2013-01-01,0,1\n' + df_sec_grouped = df_sec.groupby([pd.Grouper(key='A', freq='1h'), 'B']) + assert (df_sec_grouped.mean().to_csv(date_format='%Y-%m-%d') == + expected_ymd_sec) + + def test_to_csv_multi_index(self): + # see gh-6618 + df = DataFrame([1], columns=pd.MultiIndex.from_arrays([[1], [2]])) + + exp = ",1\n,2\n0,1\n" + assert df.to_csv() == exp + + exp = "1\n2\n1\n" + assert df.to_csv(index=False) == exp + + df = DataFrame([1], columns=pd.MultiIndex.from_arrays([[1], [2]]), + index=pd.MultiIndex.from_arrays([[1], [2]])) + + exp = ",,1\n,,2\n1,2,1\n" + assert df.to_csv() == exp + + exp = "1\n2\n1\n" + assert df.to_csv(index=False) == exp + + df = DataFrame( + [1], columns=pd.MultiIndex.from_arrays([['foo'], ['bar']])) + + exp = ",foo\n,bar\n0,1\n" + assert df.to_csv() == exp + + exp = "foo\nbar\n1\n" + assert df.to_csv(index=False) == exp diff --git a/pandas/tests/io/formats/test_to_excel.py b/pandas/tests/io/formats/test_to_excel.py new file mode 100644 index 0000000000000..fff5299921270 --- /dev/null +++ b/pandas/tests/io/formats/test_to_excel.py @@ -0,0 +1,219 @@ +"""Tests formatting as writer-agnostic ExcelCells + +ExcelFormatter is tested implicitly in pandas/tests/io/test_excel.py +""" + +import pytest + +from pandas.io.formats.excel import CSSToExcelConverter + + +@pytest.mark.parametrize('css,expected', [ + # FONT + # - name + ('font-family: foo,bar', {'font': {'name': 'foo'}}), + ('font-family: "foo bar",baz', {'font': {'name': 'foo bar'}}), + ('font-family: foo,\nbar', {'font': {'name': 'foo'}}), + ('font-family: foo, bar, baz', {'font': {'name': 'foo'}}), + ('font-family: bar, foo', {'font': {'name': 'bar'}}), + ('font-family: \'foo bar\', baz', {'font': {'name': 'foo bar'}}), + ('font-family: \'foo \\\'bar\', baz', {'font': {'name': 'foo \'bar'}}), + ('font-family: "foo \\"bar", baz', {'font': {'name': 'foo "bar'}}), + ('font-family: "foo ,bar", baz', {'font': {'name': 'foo ,bar'}}), + # - family + ('font-family: serif', {'font': {'name': 'serif', 'family': 1}}), + ('font-family: Serif', {'font': {'name': 'serif', 'family': 1}}), + ('font-family: roman, serif', {'font': {'name': 'roman', 'family': 1}}), + ('font-family: roman, sans-serif', {'font': {'name': 'roman', + 'family': 2}}), + ('font-family: roman, sans serif', {'font': {'name': 'roman'}}), + ('font-family: roman, sansserif', {'font': {'name': 'roman'}}), + ('font-family: roman, cursive', {'font': {'name': 'roman', 'family': 4}}), + ('font-family: roman, fantasy', {'font': {'name': 'roman', 'family': 5}}), + # - size + ('font-size: 1em', {'font': {'size': 12}}), + # - bold + ('font-weight: 100', {'font': {'bold': False}}), + ('font-weight: 200', {'font': {'bold': False}}), + ('font-weight: 300', {'font': {'bold': False}}), + ('font-weight: 400', {'font': {'bold': False}}), + ('font-weight: normal', {'font': {'bold': False}}), + ('font-weight: lighter', {'font': {'bold': False}}), + ('font-weight: bold', {'font': {'bold': True}}), + ('font-weight: bolder', {'font': {'bold': True}}), + ('font-weight: 700', {'font': {'bold': True}}), + ('font-weight: 800', {'font': {'bold': True}}), + ('font-weight: 900', {'font': {'bold': True}}), + # - italic + # - underline + ('text-decoration: underline', + {'font': {'underline': 'single'}}), + ('text-decoration: overline', + {}), + ('text-decoration: none', + {}), + # - strike + ('text-decoration: line-through', + {'font': {'strike': True}}), + ('text-decoration: underline line-through', + {'font': {'strike': True, 'underline': 'single'}}), + ('text-decoration: underline; text-decoration: line-through', + {'font': {'strike': True}}), + # - color + ('color: red', {'font': {'color': 'FF0000'}}), + ('color: #ff0000', {'font': {'color': 'FF0000'}}), + ('color: #f0a', {'font': {'color': 'FF00AA'}}), + # - shadow + ('text-shadow: none', {'font': {'shadow': False}}), + ('text-shadow: 0px -0em 0px #CCC', {'font': {'shadow': False}}), + ('text-shadow: 0px -0em 0px #999', {'font': {'shadow': False}}), + ('text-shadow: 0px -0em 0px', {'font': {'shadow': False}}), + ('text-shadow: 2px -0em 0px #CCC', {'font': {'shadow': True}}), + ('text-shadow: 0px -2em 0px #CCC', {'font': {'shadow': True}}), + ('text-shadow: 0px -0em 2px #CCC', {'font': {'shadow': True}}), + ('text-shadow: 0px -0em 2px', {'font': {'shadow': True}}), + ('text-shadow: 0px -2em', {'font': {'shadow': True}}), + pytest.mark.xfail(('text-shadow: #CCC 3px 3px 3px', + {'font': {'shadow': True}}), + reason='text-shadow with color preceding width not yet ' + 'identified as shadow'), + pytest.mark.xfail(('text-shadow: #999 0px 0px 0px', + {'font': {'shadow': False}}), + reason='text-shadow with color preceding zero width not ' + 'yet identified as non-shadow'), + # FILL + # - color, fillType + ('background-color: red', {'fill': {'fgColor': 'FF0000', + 'patternType': 'solid'}}), + ('background-color: #ff0000', {'fill': {'fgColor': 'FF0000', + 'patternType': 'solid'}}), + ('background-color: #f0a', {'fill': {'fgColor': 'FF00AA', + 'patternType': 'solid'}}), + # BORDER + # - style + ('border-style: solid', + {'border': {'top': {'style': 'medium'}, + 'bottom': {'style': 'medium'}, + 'left': {'style': 'medium'}, + 'right': {'style': 'medium'}}}), + ('border-style: solid; border-width: thin', + {'border': {'top': {'style': 'thin'}, + 'bottom': {'style': 'thin'}, + 'left': {'style': 'thin'}, + 'right': {'style': 'thin'}}}), + + ('border-top-style: solid; border-top-width: thin', + {'border': {'top': {'style': 'thin'}}}), + ('border-top-style: solid; border-top-width: 1pt', + {'border': {'top': {'style': 'thin'}}}), + ('border-top-style: solid', + {'border': {'top': {'style': 'medium'}}}), + ('border-top-style: solid; border-top-width: medium', + {'border': {'top': {'style': 'medium'}}}), + ('border-top-style: solid; border-top-width: 2pt', + {'border': {'top': {'style': 'medium'}}}), + ('border-top-style: solid; border-top-width: thick', + {'border': {'top': {'style': 'thick'}}}), + ('border-top-style: solid; border-top-width: 4pt', + {'border': {'top': {'style': 'thick'}}}), + + ('border-top-style: dotted', + {'border': {'top': {'style': 'mediumDashDotDot'}}}), + ('border-top-style: dotted; border-top-width: thin', + {'border': {'top': {'style': 'dotted'}}}), + ('border-top-style: dashed', + {'border': {'top': {'style': 'mediumDashed'}}}), + ('border-top-style: dashed; border-top-width: thin', + {'border': {'top': {'style': 'dashed'}}}), + ('border-top-style: double', + {'border': {'top': {'style': 'double'}}}), + # - color + ('border-style: solid; border-color: #0000ff', + {'border': {'top': {'style': 'medium', 'color': '0000FF'}, + 'right': {'style': 'medium', 'color': '0000FF'}, + 'bottom': {'style': 'medium', 'color': '0000FF'}, + 'left': {'style': 'medium', 'color': '0000FF'}}}), + ('border-top-style: double; border-top-color: blue', + {'border': {'top': {'style': 'double', 'color': '0000FF'}}}), + ('border-top-style: solid; border-top-color: #06c', + {'border': {'top': {'style': 'medium', 'color': '0066CC'}}}), + # ALIGNMENT + # - horizontal + ('text-align: center', + {'alignment': {'horizontal': 'center'}}), + ('text-align: left', + {'alignment': {'horizontal': 'left'}}), + ('text-align: right', + {'alignment': {'horizontal': 'right'}}), + ('text-align: justify', + {'alignment': {'horizontal': 'justify'}}), + # - vertical + ('vertical-align: top', + {'alignment': {'vertical': 'top'}}), + ('vertical-align: text-top', + {'alignment': {'vertical': 'top'}}), + ('vertical-align: middle', + {'alignment': {'vertical': 'center'}}), + ('vertical-align: bottom', + {'alignment': {'vertical': 'bottom'}}), + ('vertical-align: text-bottom', + {'alignment': {'vertical': 'bottom'}}), + # - wrap_text + ('white-space: nowrap', + {'alignment': {'wrap_text': False}}), + ('white-space: pre', + {'alignment': {'wrap_text': False}}), + ('white-space: pre-line', + {'alignment': {'wrap_text': False}}), + ('white-space: normal', + {'alignment': {'wrap_text': True}}), +]) +def test_css_to_excel(css, expected): + convert = CSSToExcelConverter() + assert expected == convert(css) + + +def test_css_to_excel_multiple(): + convert = CSSToExcelConverter() + actual = convert(''' + font-weight: bold; + text-decoration: underline; + color: red; + border-width: thin; + text-align: center; + vertical-align: top; + unused: something; + ''') + assert {"font": {"bold": True, "underline": "single", "color": "FF0000"}, + "border": {"top": {"style": "thin"}, + "right": {"style": "thin"}, + "bottom": {"style": "thin"}, + "left": {"style": "thin"}}, + "alignment": {"horizontal": "center", + "vertical": "top"}} == actual + + +@pytest.mark.parametrize('css,inherited,expected', [ + ('font-weight: bold', '', + {'font': {'bold': True}}), + ('', 'font-weight: bold', + {'font': {'bold': True}}), + ('font-weight: bold', 'font-style: italic', + {'font': {'bold': True, 'italic': True}}), + ('font-style: normal', 'font-style: italic', + {'font': {'italic': False}}), + ('font-style: inherit', '', {}), + ('font-style: normal; font-style: inherit', 'font-style: italic', + {'font': {'italic': True}}), +]) +def test_css_to_excel_inherited(css, inherited, expected): + convert = CSSToExcelConverter(inherited) + assert expected == convert(css) + + +@pytest.mark.xfail(reason='We are not currently warning for all unconverted ' + 'CSS, but possibly should') +def test_css_to_excel_warns_when_not_supported(): + convert = CSSToExcelConverter() + with pytest.warns(UserWarning): + convert('background: red') diff --git a/pandas/tests/io/formats/test_to_html.py b/pandas/tests/io/formats/test_to_html.py new file mode 100644 index 0000000000000..cde920b1511d2 --- /dev/null +++ b/pandas/tests/io/formats/test_to_html.py @@ -0,0 +1,1871 @@ +# -*- coding: utf-8 -*- + +import re +from textwrap import dedent +from datetime import datetime +from distutils.version import LooseVersion + +import pytest +import numpy as np +import pandas as pd +from pandas import compat, DataFrame, MultiIndex, option_context, Index +from pandas.compat import u, lrange, StringIO +from pandas.util import testing as tm +import pandas.io.formats.format as fmt + +div_style = '' +try: + import IPython + if IPython.__version__ < LooseVersion('3.0.0'): + div_style = ' style="max-width:1500px;overflow:auto;"' +except (ImportError, AttributeError): + pass + + +class TestToHTML(object): + + def test_to_html_with_col_space(self): + def check_with_width(df, col_space): + # check that col_space affects HTML generation + # and be very brittle about it. + html = df.to_html(col_space=col_space) + hdrs = [x for x in html.split(r"\n") if re.search(r"\s]", x)] + assert len(hdrs) > 0 + for h in hdrs: + assert "min-width" in h + assert str(col_space) in h + + df = DataFrame(np.random.random(size=(1, 3))) + + check_with_width(df, 30) + check_with_width(df, 50) + + def test_to_html_with_empty_string_label(self): + # GH3547, to_html regards empty string labels as repeated labels + data = {'c1': ['a', 'b'], 'c2': ['a', ''], 'data': [1, 2]} + df = DataFrame(data).set_index(['c1', 'c2']) + res = df.to_html() + assert "rowspan" not in res + + def test_to_html_unicode(self): + df = DataFrame({u('\u03c3'): np.arange(10.)}) + expected = u'\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
\u03c3
00.0
11.0
22.0
33.0
44.0
55.0
66.0
77.0
88.0
99.0
' # noqa + assert df.to_html() == expected + df = DataFrame({'A': [u('\u03c3')]}) + expected = u'\n \n \n \n \n \n \n \n \n \n \n \n \n
A
0\u03c3
' # noqa + assert df.to_html() == expected + + def test_to_html_decimal(self): + # GH 12031 + df = DataFrame({'A': [6.0, 3.1, 2.2]}) + result = df.to_html(decimal=',') + expected = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
A
06,0
13,1
22,2
') + assert result == expected + + def test_to_html_escaped(self): + a = 'str", + b: ""}, + 'co>l2': {a: "", + b: ""}} + rs = DataFrame(test_dict).to_html() + xp = """ + + + + + + + + + + + + + + + + + + + +
co<l1co>l2
str<ing1 &amp;<type 'str'><type 'str'>
stri>ng2 &amp;<type 'str'><type 'str'>
""" + + assert xp == rs + + def test_to_html_escape_disabled(self): + a = 'strbold", + b: "bold"}, + 'co>l2': {a: "bold", + b: "bold"}} + rs = DataFrame(test_dict).to_html(escape=False) + xp = """ + + + + + + + + + + + + + + + + + +
co + co>l2
str + boldbold
stri>ng2 &boldbold
""" + + assert xp == rs + + def test_to_html_multiindex_index_false(self): + # issue 8452 + df = DataFrame({ + 'a': range(2), + 'b': range(3, 5), + 'c': range(5, 7), + 'd': range(3, 5) + }) + df.columns = MultiIndex.from_product([['a', 'b'], ['c', 'd']]) + result = df.to_html(index=False) + expected = """\ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ab
cdcd
0353
1464
""" + + assert result == expected + + df.index = Index(df.index.values, name='idx') + result = df.to_html(index=False) + assert result == expected + + def test_to_html_multiindex_sparsify_false_multi_sparse(self): + with option_context('display.multi_sparse', False): + index = MultiIndex.from_arrays([[0, 0, 1, 1], [0, 1, 0, 1]], + names=['foo', None]) + + df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], index=index) + + result = df.to_html() + expected = """\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
01
foo
0001
0123
1045
1167
""" + + assert result == expected + + df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], + columns=index[::2], index=index) + + result = df.to_html() + expected = """\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
foo01
00
foo
0001
0123
1045
1167
""" + + assert result == expected + + def test_to_html_multiindex_sparsify(self): + index = MultiIndex.from_arrays([[0, 0, 1, 1], [0, 1, 0, 1]], + names=['foo', None]) + + df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], index=index) + + result = df.to_html() + expected = """ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
01
foo
0001
123
1045
167
""" + + assert result == expected + + df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], columns=index[::2], + index=index) + + result = df.to_html() + expected = """\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
foo01
00
foo
0001
123
1045
167
""" + + assert result == expected + + def test_to_html_multiindex_odd_even_truncate(self): + # GH 14882 - Issue on truncation with odd length DataFrame + mi = MultiIndex.from_product([[100, 200, 300], + [10, 20, 30], + [1, 2, 3, 4, 5, 6, 7]], + names=['a', 'b', 'c']) + df = DataFrame({'n': range(len(mi))}, index=mi) + result = df.to_html(max_rows=60) + expected = """\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
n
abc
1001010
21
32
43
54
65
76
2017
28
39
410
511
612
713
30114
215
316
417
518
619
720
20010121
222
323
424
525
626
727
20128
229
......
633
734
30135
236
337
438
539
640
741
30010142
243
344
445
546
647
748
20149
250
351
452
553
654
755
30156
257
358
459
560
661
762
""" + assert result == expected + + # Test that ... appears in a middle level + result = df.to_html(max_rows=56) + expected = """\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
n
abc
1001010
21
32
43
54
65
76
2017
28
39
410
511
612
713
30114
215
316
417
518
619
720
20010121
222
323
424
525
626
727
.........
30135
236
337
438
539
640
741
30010142
243
344
445
546
647
748
20149
250
351
452
553
654
755
30156
257
358
459
560
661
762
""" + assert result == expected + + def test_to_html_index_formatter(self): + df = DataFrame([[0, 1], [2, 3], [4, 5], [6, 7]], columns=['foo', None], + index=lrange(4)) + + f = lambda x: 'abcd' [x] + result = df.to_html(formatters={'__index__': f}) + expected = """\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
fooNone
a01
b23
c45
d67
""" + + assert result == expected + + def test_to_html_datetime64_monthformatter(self): + months = [datetime(2016, 1, 1), datetime(2016, 2, 2)] + x = DataFrame({'months': months}) + + def format_func(x): + return x.strftime('%Y-%m') + result = x.to_html(formatters={'months': format_func}) + expected = """\ + + + + + + + + + + + + + + + + + +
months
02016-01
12016-02
""" + assert result == expected + + def test_to_html_datetime64_hourformatter(self): + + x = DataFrame({'hod': pd.to_datetime(['10:10:10.100', '12:12:12.120'], + format='%H:%M:%S.%f')}) + + def format_func(x): + return x.strftime('%H:%M') + result = x.to_html(formatters={'hod': format_func}) + expected = """\ + + + + + + + + + + + + + + + + + +
hod
010:10
112:12
""" + assert result == expected + + def test_to_html_regression_GH6098(self): + df = DataFrame({ + u('clé1'): [u('a'), u('a'), u('b'), u('b'), u('a')], + u('clé2'): [u('1er'), u('2ème'), u('1er'), u('2ème'), u('1er')], + 'données1': np.random.randn(5), + 'données2': np.random.randn(5)}) + + # it works + df.pivot_table(index=[u('clé1')], columns=[u('clé2')])._repr_html_() + + def test_to_html_truncate(self): + pytest.skip("unreliable on travis") + index = pd.DatetimeIndex(start='20010101', freq='D', periods=20) + df = DataFrame(index=index, columns=range(20)) + fmt.set_option('display.max_rows', 8) + fmt.set_option('display.max_columns', 4) + result = df._repr_html_() + expected = '''\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
01...1819
2001-01-01NaNNaN...NaNNaN
2001-01-02NaNNaN...NaNNaN
2001-01-03NaNNaN...NaNNaN
2001-01-04NaNNaN...NaNNaN
..................
2001-01-17NaNNaN...NaNNaN
2001-01-18NaNNaN...NaNNaN
2001-01-19NaNNaN...NaNNaN
2001-01-20NaNNaN...NaNNaN
+

20 rows × 20 columns

+'''.format(div_style) + if compat.PY2: + expected = expected.decode('utf-8') + assert result == expected + + def test_to_html_truncate_multi_index(self): + pytest.skip("unreliable on travis") + arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], + ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] + df = DataFrame(index=arrays, columns=arrays) + fmt.set_option('display.max_rows', 7) + fmt.set_option('display.max_columns', 7) + result = df._repr_html_() + expected = '''\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
barbaz...fooqux
onetwoone...twoonetwo
baroneNaNNaNNaN...NaNNaNNaN
twoNaNNaNNaN...NaNNaNNaN
bazoneNaNNaNNaN...NaNNaNNaN
...........................
footwoNaNNaNNaN...NaNNaNNaN
quxoneNaNNaNNaN...NaNNaNNaN
twoNaNNaNNaN...NaNNaNNaN
+

8 rows × 8 columns

+'''.format(div_style) + if compat.PY2: + expected = expected.decode('utf-8') + assert result == expected + + def test_to_html_truncate_multi_index_sparse_off(self): + pytest.skip("unreliable on travis") + arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], + ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] + df = DataFrame(index=arrays, columns=arrays) + fmt.set_option('display.max_rows', 7) + fmt.set_option('display.max_columns', 7) + fmt.set_option('display.multi_sparse', False) + result = df._repr_html_() + expected = '''\ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
barbarbaz...fooquxqux
onetwoone...twoonetwo
baroneNaNNaNNaN...NaNNaNNaN
bartwoNaNNaNNaN...NaNNaNNaN
bazoneNaNNaNNaN...NaNNaNNaN
footwoNaNNaNNaN...NaNNaNNaN
quxoneNaNNaNNaN...NaNNaNNaN
quxtwoNaNNaNNaN...NaNNaNNaN
+

8 rows × 8 columns

+'''.format(div_style) + if compat.PY2: + expected = expected.decode('utf-8') + assert result == expected + + def test_to_html_border(self): + df = DataFrame({'A': [1, 2]}) + result = df.to_html() + assert 'border="1"' in result + + def test_to_html_border_option(self): + df = DataFrame({'A': [1, 2]}) + with pd.option_context('html.border', 0): + result = df.to_html() + assert 'border="0"' in result + assert 'border="0"' in df._repr_html_() + + def test_to_html_border_zero(self): + df = DataFrame({'A': [1, 2]}) + result = df.to_html(border=0) + assert 'border="0"' in result + + def test_to_html(self): + # big mixed + biggie = DataFrame({'A': np.random.randn(200), + 'B': tm.makeStringIndex(200)}, + index=lrange(200)) + + biggie.loc[:20, 'A'] = np.nan + biggie.loc[:20, 'B'] = np.nan + s = biggie.to_html() + + buf = StringIO() + retval = biggie.to_html(buf=buf) + assert retval is None + assert buf.getvalue() == s + + assert isinstance(s, compat.string_types) + + biggie.to_html(columns=['B', 'A'], col_space=17) + biggie.to_html(columns=['B', 'A'], + formatters={'A': lambda x: '%.1f' % x}) + + biggie.to_html(columns=['B', 'A'], float_format=str) + biggie.to_html(columns=['B', 'A'], col_space=12, float_format=str) + + frame = DataFrame(index=np.arange(200)) + frame.to_html() + + def test_to_html_filename(self): + biggie = DataFrame({'A': np.random.randn(200), + 'B': tm.makeStringIndex(200)}, + index=lrange(200)) + + biggie.loc[:20, 'A'] = np.nan + biggie.loc[:20, 'B'] = np.nan + with tm.ensure_clean('test.html') as path: + biggie.to_html(path) + with open(path, 'r') as f: + s = biggie.to_html() + s2 = f.read() + assert s == s2 + + frame = DataFrame(index=np.arange(200)) + with tm.ensure_clean('test.html') as path: + frame.to_html(path) + with open(path, 'r') as f: + assert frame.to_html() == f.read() + + def test_to_html_with_no_bold(self): + x = DataFrame({'x': np.random.randn(5)}) + ashtml = x.to_html(bold_rows=False) + assert '")] + + def test_to_html_columns_arg(self): + frame = DataFrame(tm.getSeriesData()) + result = frame.to_html(columns=['A']) + assert 'B' not in result + + def test_to_html_multiindex(self): + columns = MultiIndex.from_tuples(list(zip(np.arange(2).repeat(2), + np.mod(lrange(4), 2))), + names=['CL0', 'CL1']) + df = DataFrame([list('abcd'), list('efgh')], columns=columns) + result = df.to_html(justify='left') + expected = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
CL001
CL10101
0abcd
1efgh
') + + assert result == expected + + columns = MultiIndex.from_tuples(list(zip( + range(4), np.mod( + lrange(4), 2)))) + df = DataFrame([list('abcd'), list('efgh')], columns=columns) + + result = df.to_html(justify='right') + expected = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
0123
0101
0abcd
1efgh
') + + assert result == expected + + def test_to_html_justify(self): + df = DataFrame({'A': [6, 30000, 2], + 'B': [1, 2, 70000], + 'C': [223442, 0, 1]}, + columns=['A', 'B', 'C']) + result = df.to_html(justify='left') + expected = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
ABC
061223442
13000020
22700001
') + assert result == expected + + result = df.to_html(justify='right') + expected = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
ABC
061223442
13000020
22700001
') + assert result == expected + + def test_to_html_index(self): + index = ['foo', 'bar', 'baz'] + df = DataFrame({'A': [1, 2, 3], + 'B': [1.2, 3.4, 5.6], + 'C': ['one', 'two', np.nan]}, + columns=['A', 'B', 'C'], + index=index) + expected_with_index = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
ABC
foo11.2one
bar23.4two
baz35.6NaN
') + assert df.to_html() == expected_with_index + + expected_without_index = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
ABC
11.2one
23.4two
35.6NaN
') + result = df.to_html(index=False) + for i in index: + assert i not in result + assert result == expected_without_index + df.index = Index(['foo', 'bar', 'baz'], name='idx') + expected_with_index = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
ABC
idx
foo11.2one
bar23.4two
baz35.6NaN
') + assert df.to_html() == expected_with_index + assert df.to_html(index=False) == expected_without_index + + tuples = [('foo', 'car'), ('foo', 'bike'), ('bar', 'car')] + df.index = MultiIndex.from_tuples(tuples) + + expected_with_index = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
ABC
foocar11.2one
bike23.4two
barcar35.6NaN
') + assert df.to_html() == expected_with_index + + result = df.to_html(index=False) + for i in ['foo', 'bar', 'car', 'bike']: + assert i not in result + # must be the same result as normal index + assert result == expected_without_index + + df.index = MultiIndex.from_tuples(tuples, names=['idx1', 'idx2']) + expected_with_index = ('\n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + ' \n' + '
ABC
idx1idx2
foocar11.2one
bike23.4two
barcar35.6NaN
') + assert df.to_html() == expected_with_index + assert df.to_html(index=False) == expected_without_index + + def test_to_html_with_classes(self): + df = DataFrame() + result = df.to_html(classes="sortable draggable") + expected = dedent(""" + + + + + + + + + +
+ + """).strip() + assert result == expected + + result = df.to_html(classes=["sortable", "draggable"]) + assert result == expected + + def test_to_html_no_index_max_rows(self): + # GH https://github.com/pandas-dev/pandas/issues/14998 + df = DataFrame({"A": [1, 2, 3, 4]}) + result = df.to_html(index=False, max_rows=1) + expected = dedent("""\ + + + + + + + + + + + +
A
1
""") + assert result == expected + + def test_to_html_notebook_has_style(self): + df = pd.DataFrame({"A": [1, 2, 3]}) + result = df.to_html(notebook=True) + assert "thead tr:only-child" in result + + def test_to_html_notebook_has_no_style(self): + df = pd.DataFrame({"A": [1, 2, 3]}) + result = df.to_html() + assert "thead tr:only-child" not in result diff --git a/pandas/tests/io/formats/test_to_latex.py b/pandas/tests/io/formats/test_to_latex.py new file mode 100644 index 0000000000000..2542deb0cedf1 --- /dev/null +++ b/pandas/tests/io/formats/test_to_latex.py @@ -0,0 +1,493 @@ +from datetime import datetime + +import pytest + +import pandas as pd +from pandas import DataFrame, compat +from pandas.util import testing as tm +from pandas.compat import u +import codecs + + +@pytest.fixture +def frame(): + return DataFrame(tm.getSeriesData()) + + +class TestToLatex(object): + + def test_to_latex_filename(self, frame): + with tm.ensure_clean('test.tex') as path: + frame.to_latex(path) + + with open(path, 'r') as f: + assert frame.to_latex() == f.read() + + # test with utf-8 and encoding option (GH 7061) + df = DataFrame([[u'au\xdfgangen']]) + with tm.ensure_clean('test.tex') as path: + df.to_latex(path, encoding='utf-8') + with codecs.open(path, 'r', encoding='utf-8') as f: + assert df.to_latex() == f.read() + + # test with utf-8 without encoding option + if compat.PY3: # python3: pandas default encoding is utf-8 + with tm.ensure_clean('test.tex') as path: + df.to_latex(path) + with codecs.open(path, 'r', encoding='utf-8') as f: + assert df.to_latex() == f.read() + else: + # python2 default encoding is ascii, so an error should be raised + with tm.ensure_clean('test.tex') as path: + with pytest.raises(UnicodeEncodeError): + df.to_latex(path) + + def test_to_latex(self, frame): + # it works! + frame.to_latex() + + df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) + withindex_result = df.to_latex() + withindex_expected = r"""\begin{tabular}{lrl} +\toprule +{} & a & b \\ +\midrule +0 & 1 & b1 \\ +1 & 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withindex_result == withindex_expected + + withoutindex_result = df.to_latex(index=False) + withoutindex_expected = r"""\begin{tabular}{rl} +\toprule + a & b \\ +\midrule + 1 & b1 \\ + 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withoutindex_result == withoutindex_expected + + def test_to_latex_format(self, frame): + # GH Bug #9402 + frame.to_latex(column_format='ccc') + + df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) + withindex_result = df.to_latex(column_format='ccc') + withindex_expected = r"""\begin{tabular}{ccc} +\toprule +{} & a & b \\ +\midrule +0 & 1 & b1 \\ +1 & 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withindex_result == withindex_expected + + def test_to_latex_with_formatters(self): + df = DataFrame({'int': [1, 2, 3], + 'float': [1.0, 2.0, 3.0], + 'object': [(1, 2), True, False], + 'datetime64': [datetime(2016, 1, 1), + datetime(2016, 2, 5), + datetime(2016, 3, 3)]}) + + formatters = {'int': lambda x: '0x%x' % x, + 'float': lambda x: '[% 4.1f]' % x, + 'object': lambda x: '-%s-' % str(x), + 'datetime64': lambda x: x.strftime('%Y-%m'), + '__index__': lambda x: 'index: %s' % x} + result = df.to_latex(formatters=dict(formatters)) + + expected = r"""\begin{tabular}{llrrl} +\toprule +{} & datetime64 & float & int & object \\ +\midrule +index: 0 & 2016-01 & [ 1.0] & 0x1 & -(1, 2)- \\ +index: 1 & 2016-02 & [ 2.0] & 0x2 & -True- \\ +index: 2 & 2016-03 & [ 3.0] & 0x3 & -False- \\ +\bottomrule +\end{tabular} +""" + assert result == expected + + def test_to_latex_multiindex(self): + df = DataFrame({('x', 'y'): ['a']}) + result = df.to_latex() + expected = r"""\begin{tabular}{ll} +\toprule +{} & x \\ +{} & y \\ +\midrule +0 & a \\ +\bottomrule +\end{tabular} +""" + + assert result == expected + + result = df.T.to_latex() + expected = r"""\begin{tabular}{lll} +\toprule + & & 0 \\ +\midrule +x & y & a \\ +\bottomrule +\end{tabular} +""" + + assert result == expected + + df = DataFrame.from_dict({ + ('c1', 0): pd.Series(dict((x, x) for x in range(4))), + ('c1', 1): pd.Series(dict((x, x + 4) for x in range(4))), + ('c2', 0): pd.Series(dict((x, x) for x in range(4))), + ('c2', 1): pd.Series(dict((x, x + 4) for x in range(4))), + ('c3', 0): pd.Series(dict((x, x) for x in range(4))), + }).T + result = df.to_latex() + expected = r"""\begin{tabular}{llrrrr} +\toprule + & & 0 & 1 & 2 & 3 \\ +\midrule +c1 & 0 & 0 & 1 & 2 & 3 \\ + & 1 & 4 & 5 & 6 & 7 \\ +c2 & 0 & 0 & 1 & 2 & 3 \\ + & 1 & 4 & 5 & 6 & 7 \\ +c3 & 0 & 0 & 1 & 2 & 3 \\ +\bottomrule +\end{tabular} +""" + + assert result == expected + + # GH 14184 + df = df.T + df.columns.names = ['a', 'b'] + result = df.to_latex() + expected = r"""\begin{tabular}{lrrrrr} +\toprule +a & \multicolumn{2}{l}{c1} & \multicolumn{2}{l}{c2} & c3 \\ +b & 0 & 1 & 0 & 1 & 0 \\ +\midrule +0 & 0 & 4 & 0 & 4 & 0 \\ +1 & 1 & 5 & 1 & 5 & 1 \\ +2 & 2 & 6 & 2 & 6 & 2 \\ +3 & 3 & 7 & 3 & 7 & 3 \\ +\bottomrule +\end{tabular} +""" + assert result == expected + + # GH 10660 + df = pd.DataFrame({'a': [0, 0, 1, 1], + 'b': list('abab'), + 'c': [1, 2, 3, 4]}) + result = df.set_index(['a', 'b']).to_latex() + expected = r"""\begin{tabular}{llr} +\toprule + & & c \\ +a & b & \\ +\midrule +0 & a & 1 \\ + & b & 2 \\ +1 & a & 3 \\ + & b & 4 \\ +\bottomrule +\end{tabular} +""" + + assert result == expected + + result = df.groupby('a').describe().to_latex() + expected = r"""\begin{tabular}{lrrrrrrrr} +\toprule +{} & \multicolumn{8}{l}{c} \\ +{} & count & mean & std & min & 25\% & 50\% & 75\% & max \\ +a & & & & & & & & \\ +\midrule +0 & 2.0 & 1.5 & 0.707107 & 1.0 & 1.25 & 1.5 & 1.75 & 2.0 \\ +1 & 2.0 & 3.5 & 0.707107 & 3.0 & 3.25 & 3.5 & 3.75 & 4.0 \\ +\bottomrule +\end{tabular} +""" + + assert result == expected + + def test_to_latex_multicolumnrow(self): + df = pd.DataFrame({ + ('c1', 0): dict((x, x) for x in range(5)), + ('c1', 1): dict((x, x + 5) for x in range(5)), + ('c2', 0): dict((x, x) for x in range(5)), + ('c2', 1): dict((x, x + 5) for x in range(5)), + ('c3', 0): dict((x, x) for x in range(5)) + }) + result = df.to_latex() + expected = r"""\begin{tabular}{lrrrrr} +\toprule +{} & \multicolumn{2}{l}{c1} & \multicolumn{2}{l}{c2} & c3 \\ +{} & 0 & 1 & 0 & 1 & 0 \\ +\midrule +0 & 0 & 5 & 0 & 5 & 0 \\ +1 & 1 & 6 & 1 & 6 & 1 \\ +2 & 2 & 7 & 2 & 7 & 2 \\ +3 & 3 & 8 & 3 & 8 & 3 \\ +4 & 4 & 9 & 4 & 9 & 4 \\ +\bottomrule +\end{tabular} +""" + assert result == expected + + result = df.to_latex(multicolumn=False) + expected = r"""\begin{tabular}{lrrrrr} +\toprule +{} & c1 & & c2 & & c3 \\ +{} & 0 & 1 & 0 & 1 & 0 \\ +\midrule +0 & 0 & 5 & 0 & 5 & 0 \\ +1 & 1 & 6 & 1 & 6 & 1 \\ +2 & 2 & 7 & 2 & 7 & 2 \\ +3 & 3 & 8 & 3 & 8 & 3 \\ +4 & 4 & 9 & 4 & 9 & 4 \\ +\bottomrule +\end{tabular} +""" + assert result == expected + + result = df.T.to_latex(multirow=True) + expected = r"""\begin{tabular}{llrrrrr} +\toprule + & & 0 & 1 & 2 & 3 & 4 \\ +\midrule +\multirow{2}{*}{c1} & 0 & 0 & 1 & 2 & 3 & 4 \\ + & 1 & 5 & 6 & 7 & 8 & 9 \\ +\cline{1-7} +\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\ + & 1 & 5 & 6 & 7 & 8 & 9 \\ +\cline{1-7} +c3 & 0 & 0 & 1 & 2 & 3 & 4 \\ +\bottomrule +\end{tabular} +""" + assert result == expected + + df.index = df.T.index + result = df.T.to_latex(multirow=True, multicolumn=True, + multicolumn_format='c') + expected = r"""\begin{tabular}{llrrrrr} +\toprule + & & \multicolumn{2}{c}{c1} & \multicolumn{2}{c}{c2} & c3 \\ + & & 0 & 1 & 0 & 1 & 0 \\ +\midrule +\multirow{2}{*}{c1} & 0 & 0 & 1 & 2 & 3 & 4 \\ + & 1 & 5 & 6 & 7 & 8 & 9 \\ +\cline{1-7} +\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\ + & 1 & 5 & 6 & 7 & 8 & 9 \\ +\cline{1-7} +c3 & 0 & 0 & 1 & 2 & 3 & 4 \\ +\bottomrule +\end{tabular} +""" + assert result == expected + + def test_to_latex_escape(self): + a = 'a' + b = 'b' + + test_dict = {u('co^l1'): {a: "a", + b: "b"}, + u('co$e^x$'): {a: "a", + b: "b"}} + + unescaped_result = DataFrame(test_dict).to_latex(escape=False) + escaped_result = DataFrame(test_dict).to_latex( + ) # default: escape=True + + unescaped_expected = r'''\begin{tabular}{lll} +\toprule +{} & co$e^x$ & co^l1 \\ +\midrule +a & a & a \\ +b & b & b \\ +\bottomrule +\end{tabular} +''' + + escaped_expected = r'''\begin{tabular}{lll} +\toprule +{} & co\$e\textasciicircumx\$ & co\textasciicircuml1 \\ +\midrule +a & a & a \\ +b & b & b \\ +\bottomrule +\end{tabular} +''' + + assert unescaped_result == unescaped_expected + assert escaped_result == escaped_expected + + def test_to_latex_longtable(self, frame): + frame.to_latex(longtable=True) + + df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) + withindex_result = df.to_latex(longtable=True) + withindex_expected = r"""\begin{longtable}{lrl} +\toprule +{} & a & b \\ +\midrule +\endhead +\midrule +\multicolumn{3}{r}{{Continued on next page}} \\ +\midrule +\endfoot + +\bottomrule +\endlastfoot +0 & 1 & b1 \\ +1 & 2 & b2 \\ +\end{longtable} +""" + + assert withindex_result == withindex_expected + + withoutindex_result = df.to_latex(index=False, longtable=True) + withoutindex_expected = r"""\begin{longtable}{rl} +\toprule + a & b \\ +\midrule +\endhead +\midrule +\multicolumn{3}{r}{{Continued on next page}} \\ +\midrule +\endfoot + +\bottomrule +\endlastfoot + 1 & b1 \\ + 2 & b2 \\ +\end{longtable} +""" + + assert withoutindex_result == withoutindex_expected + + def test_to_latex_escape_special_chars(self): + special_characters = ['&', '%', '$', '#', '_', '{', '}', '~', '^', + '\\'] + df = DataFrame(data=special_characters) + observed = df.to_latex() + expected = r"""\begin{tabular}{ll} +\toprule +{} & 0 \\ +\midrule +0 & \& \\ +1 & \% \\ +2 & \$ \\ +3 & \# \\ +4 & \_ \\ +5 & \{ \\ +6 & \} \\ +7 & \textasciitilde \\ +8 & \textasciicircum \\ +9 & \textbackslash \\ +\bottomrule +\end{tabular} +""" + + assert observed == expected + + def test_to_latex_no_header(self): + # GH 7124 + df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) + withindex_result = df.to_latex(header=False) + withindex_expected = r"""\begin{tabular}{lrl} +\toprule +0 & 1 & b1 \\ +1 & 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withindex_result == withindex_expected + + withoutindex_result = df.to_latex(index=False, header=False) + withoutindex_expected = r"""\begin{tabular}{rl} +\toprule + 1 & b1 \\ + 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withoutindex_result == withoutindex_expected + + def test_to_latex_specified_header(self): + # GH 7124 + df = DataFrame({'a': [1, 2], 'b': ['b1', 'b2']}) + withindex_result = df.to_latex(header=['AA', 'BB']) + withindex_expected = r"""\begin{tabular}{lrl} +\toprule +{} & AA & BB \\ +\midrule +0 & 1 & b1 \\ +1 & 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withindex_result == withindex_expected + + withoutindex_result = df.to_latex(header=['AA', 'BB'], index=False) + withoutindex_expected = r"""\begin{tabular}{rl} +\toprule +AA & BB \\ +\midrule + 1 & b1 \\ + 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withoutindex_result == withoutindex_expected + + withoutescape_result = df.to_latex(header=['$A$', '$B$'], escape=False) + withoutescape_expected = r"""\begin{tabular}{lrl} +\toprule +{} & $A$ & $B$ \\ +\midrule +0 & 1 & b1 \\ +1 & 2 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withoutescape_result == withoutescape_expected + + with pytest.raises(ValueError): + df.to_latex(header=['A']) + + def test_to_latex_decimal(self, frame): + # GH 12031 + frame.to_latex() + + df = DataFrame({'a': [1.0, 2.1], 'b': ['b1', 'b2']}) + withindex_result = df.to_latex(decimal=',') + + withindex_expected = r"""\begin{tabular}{lrl} +\toprule +{} & a & b \\ +\midrule +0 & 1,0 & b1 \\ +1 & 2,1 & b2 \\ +\bottomrule +\end{tabular} +""" + + assert withindex_result == withindex_expected diff --git a/pandas/io/tests/generate_legacy_storage_files.py b/pandas/tests/io/generate_legacy_storage_files.py similarity index 94% rename from pandas/io/tests/generate_legacy_storage_files.py rename to pandas/tests/io/generate_legacy_storage_files.py index d0365cb2c30b3..22c62b738e6a2 100644 --- a/pandas/io/tests/generate_legacy_storage_files.py +++ b/pandas/tests/io/generate_legacy_storage_files.py @@ -1,5 +1,6 @@ """ self-contained to write legacy storage (pickle/msgpack) files """ from __future__ import print_function +from warnings import catch_warnings from distutils.version import LooseVersion from pandas import (Series, DataFrame, Panel, SparseSeries, SparseDataFrame, @@ -127,14 +128,16 @@ def create_data(): u'B': Timestamp('20130603', tz='CET')}, index=range(5)) ) - mixed_dup_panel = Panel({u'ItemA': frame[u'float'], - u'ItemB': frame[u'int']}) - mixed_dup_panel.items = [u'ItemA', u'ItemA'] - panel = dict(float=Panel({u'ItemA': frame[u'float'], - u'ItemB': frame[u'float'] + 1}), - dup=Panel(np.arange(30).reshape(3, 5, 2).astype(np.float64), - items=[u'A', u'B', u'A']), - mixed_dup=mixed_dup_panel) + with catch_warnings(record=True): + mixed_dup_panel = Panel({u'ItemA': frame[u'float'], + u'ItemB': frame[u'int']}) + mixed_dup_panel.items = [u'ItemA', u'ItemA'] + panel = dict(float=Panel({u'ItemA': frame[u'float'], + u'ItemB': frame[u'float'] + 1}), + dup=Panel( + np.arange(30).reshape(3, 5, 2).astype(np.float64), + items=[u'A', u'B', u'A']), + mixed_dup=mixed_dup_panel) cat = dict(int8=Categorical(list('abcdefg')), int16=Categorical(np.arange(1000)), diff --git a/pandas/tests/io/json/__init__.py b/pandas/tests/io/json/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/io/tests/json/data/tsframe_iso_v012.json b/pandas/tests/io/json/data/tsframe_iso_v012.json similarity index 100% rename from pandas/io/tests/json/data/tsframe_iso_v012.json rename to pandas/tests/io/json/data/tsframe_iso_v012.json diff --git a/pandas/io/tests/json/data/tsframe_v012.json b/pandas/tests/io/json/data/tsframe_v012.json similarity index 100% rename from pandas/io/tests/json/data/tsframe_v012.json rename to pandas/tests/io/json/data/tsframe_v012.json diff --git a/pandas/tests/io/json/test_json_table_schema.py b/pandas/tests/io/json/test_json_table_schema.py new file mode 100644 index 0000000000000..e447a74b2b462 --- /dev/null +++ b/pandas/tests/io/json/test_json_table_schema.py @@ -0,0 +1,470 @@ +"""Tests for Table Schema integration.""" +import json +from collections import OrderedDict + +import numpy as np +import pandas as pd +import pytest + +from pandas import DataFrame +from pandas.core.dtypes.dtypes import ( + PeriodDtype, CategoricalDtype, DatetimeTZDtype) +from pandas.io.json.table_schema import ( + as_json_table_type, + build_table_schema, + make_field, + set_default_names) + + +class TestBuildSchema(object): + + def setup_method(self, method): + self.df = DataFrame( + {'A': [1, 2, 3, 4], + 'B': ['a', 'b', 'c', 'c'], + 'C': pd.date_range('2016-01-01', freq='d', periods=4), + 'D': pd.timedelta_range('1H', periods=4, freq='T'), + }, + index=pd.Index(range(4), name='idx')) + + def test_build_table_schema(self): + result = build_table_schema(self.df, version=False) + expected = { + 'fields': [{'name': 'idx', 'type': 'integer'}, + {'name': 'A', 'type': 'integer'}, + {'name': 'B', 'type': 'string'}, + {'name': 'C', 'type': 'datetime'}, + {'name': 'D', 'type': 'duration'}, + ], + 'primaryKey': ['idx'] + } + assert result == expected + result = build_table_schema(self.df) + assert "pandas_version" in result + + def test_series(self): + s = pd.Series([1, 2, 3], name='foo') + result = build_table_schema(s, version=False) + expected = {'fields': [{'name': 'index', 'type': 'integer'}, + {'name': 'foo', 'type': 'integer'}], + 'primaryKey': ['index']} + assert result == expected + result = build_table_schema(s) + assert 'pandas_version' in result + + def tets_series_unnamed(self): + result = build_table_schema(pd.Series([1, 2, 3]), version=False) + expected = {'fields': [{'name': 'index', 'type': 'integer'}, + {'name': 'values', 'type': 'integer'}], + 'primaryKey': ['index']} + assert result == expected + + def test_multiindex(self): + df = self.df.copy() + idx = pd.MultiIndex.from_product([('a', 'b'), (1, 2)]) + df.index = idx + + result = build_table_schema(df, version=False) + expected = { + 'fields': [{'name': 'level_0', 'type': 'string'}, + {'name': 'level_1', 'type': 'integer'}, + {'name': 'A', 'type': 'integer'}, + {'name': 'B', 'type': 'string'}, + {'name': 'C', 'type': 'datetime'}, + {'name': 'D', 'type': 'duration'}, + ], + 'primaryKey': ['level_0', 'level_1'] + } + assert result == expected + + df.index.names = ['idx0', None] + expected['fields'][0]['name'] = 'idx0' + expected['primaryKey'] = ['idx0', 'level_1'] + result = build_table_schema(df, version=False) + assert result == expected + + +class TestTableSchemaType(object): + + def test_as_json_table_type_int_data(self): + int_data = [1, 2, 3] + int_types = [np.int, np.int16, np.int32, np.int64] + for t in int_types: + assert as_json_table_type(np.array( + int_data, dtype=t)) == 'integer' + + def test_as_json_table_type_float_data(self): + float_data = [1., 2., 3.] + float_types = [np.float, np.float16, np.float32, np.float64] + for t in float_types: + assert as_json_table_type(np.array( + float_data, dtype=t)) == 'number' + + def test_as_json_table_type_bool_data(self): + bool_data = [True, False] + bool_types = [bool, np.bool] + for t in bool_types: + assert as_json_table_type(np.array( + bool_data, dtype=t)) == 'boolean' + + def test_as_json_table_type_date_data(self): + date_data = [pd.to_datetime(['2016']), + pd.to_datetime(['2016'], utc=True), + pd.Series(pd.to_datetime(['2016'])), + pd.Series(pd.to_datetime(['2016'], utc=True)), + pd.period_range('2016', freq='A', periods=3)] + for arr in date_data: + assert as_json_table_type(arr) == 'datetime' + + def test_as_json_table_type_string_data(self): + strings = [pd.Series(['a', 'b']), pd.Index(['a', 'b'])] + for t in strings: + assert as_json_table_type(t) == 'string' + + def test_as_json_table_type_categorical_data(self): + assert as_json_table_type(pd.Categorical(['a'])) == 'any' + assert as_json_table_type(pd.Categorical([1])) == 'any' + assert as_json_table_type(pd.Series(pd.Categorical([1]))) == 'any' + assert as_json_table_type(pd.CategoricalIndex([1])) == 'any' + assert as_json_table_type(pd.Categorical([1])) == 'any' + + # ------ + # dtypes + # ------ + def test_as_json_table_type_int_dtypes(self): + integers = [np.int, np.int16, np.int32, np.int64] + for t in integers: + assert as_json_table_type(t) == 'integer' + + def test_as_json_table_type_float_dtypes(self): + floats = [np.float, np.float16, np.float32, np.float64] + for t in floats: + assert as_json_table_type(t) == 'number' + + def test_as_json_table_type_bool_dtypes(self): + bools = [bool, np.bool] + for t in bools: + assert as_json_table_type(t) == 'boolean' + + def test_as_json_table_type_date_dtypes(self): + # TODO: datedate.date? datetime.time? + dates = [np.datetime64, np.dtype("buffer in the C parser can be freed twice leading @@ -96,55 +96,34 @@ def test_dtype_and_names_error(self): 3.0 3 """ # fallback casting, but not castable - with tm.assertRaisesRegexp(ValueError, 'cannot safely convert'): + with tm.assert_raises_regex(ValueError, 'cannot safely convert'): self.read_csv(StringIO(data), sep=r'\s+', header=None, names=['a', 'b'], dtype={'a': np.int32}) - def test_passing_dtype(self): - # see gh-6607 + def test_unsupported_dtype(self): df = DataFrame(np.random.rand(5, 2), columns=list( 'AB'), index=['1A', '1B', '1C', '1D', '1E']) - with tm.ensure_clean('__passing_str_as_dtype__.csv') as path: + with tm.ensure_clean('__unsupported_dtype__.csv') as path: df.to_csv(path) - # see gh-3795: passing 'str' as the dtype - result = self.read_csv(path, dtype=str, index_col=0) - tm.assert_series_equal(result.dtypes, Series( - {'A': 'object', 'B': 'object'})) - - # we expect all object columns, so need to - # convert to test for equivalence - result = result.astype(float) - tm.assert_frame_equal(result, df) - - # invalid dtype - self.assertRaises(TypeError, self.read_csv, path, - dtype={'A': 'foo', 'B': 'float64'}, - index_col=0) - # valid but we don't support it (date) - self.assertRaises(TypeError, self.read_csv, path, - dtype={'A': 'datetime64', 'B': 'float64'}, - index_col=0) - self.assertRaises(TypeError, self.read_csv, path, - dtype={'A': 'datetime64', 'B': 'float64'}, - index_col=0, parse_dates=['B']) + pytest.raises(TypeError, self.read_csv, path, + dtype={'A': 'datetime64', 'B': 'float64'}, + index_col=0) + pytest.raises(TypeError, self.read_csv, path, + dtype={'A': 'datetime64', 'B': 'float64'}, + index_col=0, parse_dates=['B']) # valid but we don't support it - self.assertRaises(TypeError, self.read_csv, path, - dtype={'A': 'timedelta64', 'B': 'float64'}, - index_col=0) + pytest.raises(TypeError, self.read_csv, path, + dtype={'A': 'timedelta64', 'B': 'float64'}, + index_col=0) # valid but unsupported - fixed width unicode string - self.assertRaises(TypeError, self.read_csv, path, - dtype={'A': 'U8'}, - index_col=0) - - # see gh-12048: empty frame - actual = self.read_csv(StringIO('A,B'), dtype=str) - expected = DataFrame({'A': [], 'B': []}, index=[], dtype=str) - tm.assert_frame_equal(actual, expected) + pytest.raises(TypeError, self.read_csv, path, + dtype={'A': 'U8'}, + index_col=0) def test_precise_conversion(self): # see gh-8002 @@ -173,112 +152,14 @@ def error(val): precise_errors.append(error(precise_val)) # round-trip should match float() - self.assertEqual(roundtrip_val, float(text[2:])) - - self.assertTrue(sum(precise_errors) <= sum(normal_errors)) - self.assertTrue(max(precise_errors) <= max(normal_errors)) - - def test_pass_dtype(self): - data = """\ -one,two -1,2.5 -2,3.5 -3,4.5 -4,5.5""" + assert roundtrip_val == float(text[2:]) - result = self.read_csv(StringIO(data), dtype={'one': 'u1', 1: 'S1'}) - self.assertEqual(result['one'].dtype, 'u1') - self.assertEqual(result['two'].dtype, 'object') - - def test_categorical_dtype(self): - # GH 10153 - data = """a,b,c -1,a,3.4 -1,a,3.4 -2,b,4.5""" - expected = pd.DataFrame({'a': Categorical(['1', '1', '2']), - 'b': Categorical(['a', 'a', 'b']), - 'c': Categorical(['3.4', '3.4', '4.5'])}) - actual = self.read_csv(StringIO(data), dtype='category') - tm.assert_frame_equal(actual, expected) - - actual = self.read_csv(StringIO(data), dtype=CategoricalDtype()) - tm.assert_frame_equal(actual, expected) - - actual = self.read_csv(StringIO(data), dtype={'a': 'category', - 'b': 'category', - 'c': CategoricalDtype()}) - tm.assert_frame_equal(actual, expected) - - actual = self.read_csv(StringIO(data), dtype={'b': 'category'}) - expected = pd.DataFrame({'a': [1, 1, 2], - 'b': Categorical(['a', 'a', 'b']), - 'c': [3.4, 3.4, 4.5]}) - tm.assert_frame_equal(actual, expected) - - actual = self.read_csv(StringIO(data), dtype={1: 'category'}) - tm.assert_frame_equal(actual, expected) - - # unsorted - data = """a,b,c -1,b,3.4 -1,b,3.4 -2,a,4.5""" - expected = pd.DataFrame({'a': Categorical(['1', '1', '2']), - 'b': Categorical(['b', 'b', 'a']), - 'c': Categorical(['3.4', '3.4', '4.5'])}) - actual = self.read_csv(StringIO(data), dtype='category') - tm.assert_frame_equal(actual, expected) - - # missing - data = """a,b,c -1,b,3.4 -1,nan,3.4 -2,a,4.5""" - expected = pd.DataFrame({'a': Categorical(['1', '1', '2']), - 'b': Categorical(['b', np.nan, 'a']), - 'c': Categorical(['3.4', '3.4', '4.5'])}) - actual = self.read_csv(StringIO(data), dtype='category') - tm.assert_frame_equal(actual, expected) - - def test_categorical_dtype_encoding(self): - # GH 10153 - pth = tm.get_data_path('unicode_series.csv') - encoding = 'latin-1' - expected = self.read_csv(pth, header=None, encoding=encoding) - expected[1] = Categorical(expected[1]) - actual = self.read_csv(pth, header=None, encoding=encoding, - dtype={1: 'category'}) - tm.assert_frame_equal(actual, expected) - - pth = tm.get_data_path('utf16_ex.txt') - encoding = 'utf-16' - expected = self.read_table(pth, encoding=encoding) - expected = expected.apply(Categorical) - actual = self.read_table(pth, encoding=encoding, dtype='category') - tm.assert_frame_equal(actual, expected) - - def test_categorical_dtype_chunksize(self): - # GH 10153 - data = """a,b -1,a -1,b -1,b -2,c""" - expecteds = [pd.DataFrame({'a': [1, 1], - 'b': Categorical(['a', 'b'])}), - pd.DataFrame({'a': [1, 2], - 'b': Categorical(['b', 'c'])}, - index=[2, 3])] - actuals = self.read_csv(StringIO(data), dtype={'b': 'category'}, - chunksize=2) - - for actual, expected in zip(actuals, expecteds): - tm.assert_frame_equal(actual, expected) + assert sum(precise_errors) <= sum(normal_errors) + assert max(precise_errors) <= max(normal_errors) def test_pass_dtype_as_recarray(self): if compat.is_platform_windows() and self.low_memory: - raise nose.SkipTest( + pytest.skip( "segfaults on win-64, only when all tests are run") data = """\ @@ -292,68 +173,8 @@ def test_pass_dtype_as_recarray(self): FutureWarning, check_stacklevel=False): result = self.read_csv(StringIO(data), dtype={ 'one': 'u1', 1: 'S1'}, as_recarray=True) - self.assertEqual(result['one'].dtype, 'u1') - self.assertEqual(result['two'].dtype, 'S1') - - def test_empty_pass_dtype(self): - data = 'one,two' - result = self.read_csv(StringIO(data), dtype={'one': 'u1'}) - - expected = DataFrame({'one': np.empty(0, dtype='u1'), - 'two': np.empty(0, dtype=np.object)}) - tm.assert_frame_equal(result, expected, check_index_type=False) - - def test_empty_with_index_pass_dtype(self): - data = 'one,two' - result = self.read_csv(StringIO(data), index_col=['one'], - dtype={'one': 'u1', 1: 'f'}) - - expected = DataFrame({'two': np.empty(0, dtype='f')}, - index=Index([], dtype='u1', name='one')) - tm.assert_frame_equal(result, expected, check_index_type=False) - - def test_empty_with_multiindex_pass_dtype(self): - data = 'one,two,three' - result = self.read_csv(StringIO(data), index_col=['one', 'two'], - dtype={'one': 'u1', 1: 'f8'}) - - exp_idx = MultiIndex.from_arrays([np.empty(0, dtype='u1'), - np.empty(0, dtype='O')], - names=['one', 'two']) - expected = DataFrame( - {'three': np.empty(0, dtype=np.object)}, index=exp_idx) - tm.assert_frame_equal(result, expected, check_index_type=False) - - def test_empty_with_mangled_column_pass_dtype_by_names(self): - data = 'one,one' - result = self.read_csv(StringIO(data), dtype={ - 'one': 'u1', 'one.1': 'f'}) - - expected = DataFrame( - {'one': np.empty(0, dtype='u1'), 'one.1': np.empty(0, dtype='f')}) - tm.assert_frame_equal(result, expected, check_index_type=False) - - def test_empty_with_mangled_column_pass_dtype_by_indexes(self): - data = 'one,one' - result = self.read_csv(StringIO(data), dtype={0: 'u1', 1: 'f'}) - - expected = DataFrame( - {'one': np.empty(0, dtype='u1'), 'one.1': np.empty(0, dtype='f')}) - tm.assert_frame_equal(result, expected, check_index_type=False) - - def test_empty_with_dup_column_pass_dtype_by_indexes(self): - # see gh-9424 - expected = pd.concat([Series([], name='one', dtype='u1'), - Series([], name='one.1', dtype='f')], axis=1) - - data = 'one,one' - result = self.read_csv(StringIO(data), dtype={0: 'u1', 1: 'f'}) - tm.assert_frame_equal(result, expected, check_index_type=False) - - data = '' - result = self.read_csv(StringIO(data), names=['one', 'one'], - dtype={0: 'u1', 1: 'f'}) - tm.assert_frame_equal(result, expected, check_index_type=False) + assert result['one'].dtype == 'u1' + assert result['two'].dtype == 'S1' def test_usecols_dtypes(self): data = """\ @@ -374,8 +195,8 @@ def test_usecols_dtypes(self): converters={'a': str}, dtype={'b': int, 'c': float}, ) - self.assertTrue((result.dtypes == [object, np.int, np.float]).all()) - self.assertTrue((result2.dtypes == [object, np.float]).all()) + assert (result.dtypes == [object, np.int, np.float]).all() + assert (result2.dtypes == [object, np.float]).all() def test_disable_bool_parsing(self): # #2090 @@ -387,10 +208,10 @@ def test_disable_bool_parsing(self): No,No,No""" result = self.read_csv(StringIO(data), dtype=object) - self.assertTrue((result.dtypes == object).all()) + assert (result.dtypes == object).all() result = self.read_csv(StringIO(data), dtype=object, na_filter=False) - self.assertEqual(result['B'][2], '') + assert result['B'][2] == '' def test_custom_lineterminator(self): data = 'a,b,c~1,2,3~4,5,6' @@ -400,16 +221,6 @@ def test_custom_lineterminator(self): tm.assert_frame_equal(result, expected) - def test_raise_on_passed_int_dtype_with_nas(self): - # see gh-2631 - data = """YEAR, DOY, a -2001,106380451,10 -2001,,11 -2001,106380451,67""" - self.assertRaises(ValueError, self.read_csv, StringIO(data), - sep=",", skipinitialspace=True, - dtype={'DOY': np.int64}) - def test_parse_ragged_csv(self): data = """1,2,3 1,2,3,4 @@ -562,65 +373,47 @@ def test_internal_null_byte(self): result = self.read_csv(StringIO(data), names=names) tm.assert_frame_equal(result, expected) - def test_empty_dtype(self): - # see gh-14712 - data = 'a,b' - - expected = pd.DataFrame(columns=['a', 'b'], dtype=np.float64) - result = self.read_csv(StringIO(data), header=0, dtype=np.float64) - tm.assert_frame_equal(result, expected) - - expected = pd.DataFrame({'a': pd.Categorical([]), - 'b': pd.Categorical([])}, - index=[]) - result = self.read_csv(StringIO(data), header=0, - dtype='category') - tm.assert_frame_equal(result, expected) - - expected = pd.DataFrame(columns=['a', 'b'], dtype='datetime64[ns]') - result = self.read_csv(StringIO(data), header=0, - dtype='datetime64[ns]') - tm.assert_frame_equal(result, expected) - - expected = pd.DataFrame({'a': pd.Series([], dtype='timedelta64[ns]'), - 'b': pd.Series([], dtype='timedelta64[ns]')}, - index=[]) - result = self.read_csv(StringIO(data), header=0, - dtype='timedelta64[ns]') - tm.assert_frame_equal(result, expected) - - expected = pd.DataFrame(columns=['a', 'b']) - expected['a'] = expected['a'].astype(np.float64) - result = self.read_csv(StringIO(data), header=0, - dtype={'a': np.float64}) - tm.assert_frame_equal(result, expected) - - expected = pd.DataFrame(columns=['a', 'b']) - expected['a'] = expected['a'].astype(np.float64) - result = self.read_csv(StringIO(data), header=0, - dtype={0: np.float64}) - tm.assert_frame_equal(result, expected) - - expected = pd.DataFrame(columns=['a', 'b']) - expected['a'] = expected['a'].astype(np.int32) - expected['b'] = expected['b'].astype(np.float64) - result = self.read_csv(StringIO(data), header=0, - dtype={'a': np.int32, 1: np.float64}) - tm.assert_frame_equal(result, expected) - def test_read_nrows_large(self): # gh-7626 - Read only nrows of data in for large inputs (>262144b) header_narrow = '\t'.join(['COL_HEADER_' + str(i) - for i in range(10)]) + '\n' + for i in range(10)]) + '\n' data_narrow = '\t'.join(['somedatasomedatasomedata1' - for i in range(10)]) + '\n' + for i in range(10)]) + '\n' header_wide = '\t'.join(['COL_HEADER_' + str(i) - for i in range(15)]) + '\n' + for i in range(15)]) + '\n' data_wide = '\t'.join(['somedatasomedatasomedata2' - for i in range(15)]) + '\n' + for i in range(15)]) + '\n' test_input = (header_narrow + data_narrow * 1050 + header_wide + data_wide * 2) df = self.read_csv(StringIO(test_input), sep='\t', nrows=1010) - self.assertTrue(df.size == 1010 * 10) + assert df.size == 1010 * 10 + + def test_float_precision_round_trip_with_text(self): + # gh-15140 - This should not segfault on Python 2.7+ + df = self.read_csv(StringIO('a'), + float_precision='round_trip', + header=None) + tm.assert_frame_equal(df, DataFrame({0: ['a']})) + + def test_large_difference_in_columns(self): + # gh-14125 + count = 10000 + large_row = ('X,' * count)[:-1] + '\n' + normal_row = 'XXXXXX XXXXXX,111111111111111\n' + test_input = (large_row + normal_row * 6)[:-1] + result = self.read_csv(StringIO(test_input), header=None, usecols=[0]) + rows = test_input.split('\n') + expected = DataFrame([row.split(',')[0] for row in rows]) + + tm.assert_frame_equal(result, expected) + + def test_data_after_quote(self): + # see gh-15910 + + data = 'a\n1\n"b"a' + result = self.read_csv(StringIO(data)) + expected = DataFrame({'a': ['1', 'ba']}) + + tm.assert_frame_equal(result, expected) diff --git a/pandas/io/tests/parser/comment.py b/pandas/tests/io/parser/comment.py similarity index 100% rename from pandas/io/tests/parser/comment.py rename to pandas/tests/io/parser/comment.py diff --git a/pandas/io/tests/parser/common.py b/pandas/tests/io/parser/common.py similarity index 81% rename from pandas/io/tests/parser/common.py rename to pandas/tests/io/parser/common.py index 39addbf46314b..bcce0c6d020ae 100644 --- a/pandas/io/tests/parser/common.py +++ b/pandas/tests/io/parser/common.py @@ -9,17 +9,18 @@ import sys from datetime import datetime -import nose +import pytest import numpy as np -from pandas.lib import Timestamp +from pandas._libs.lib import Timestamp import pandas as pd import pandas.util.testing as tm from pandas import DataFrame, Series, Index, MultiIndex from pandas import compat -from pandas.compat import(StringIO, BytesIO, PY3, - range, lrange, u) -from pandas.io.common import DtypeWarning, EmptyDataError, URLError +from pandas.compat import (StringIO, BytesIO, PY3, + range, lrange, u) +from pandas.errors import DtypeWarning, EmptyDataError, ParserError +from pandas.io.common import URLError from pandas.io.parsers import TextFileReader, TextParser @@ -43,14 +44,14 @@ def test_empty_decimal_marker(self): """ # Parsers support only length-1 decimals msg = 'Only length-1 decimal markers supported' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.read_csv(StringIO(data), decimal='') def test_bad_stream_exception(self): # Issue 13652: # This test validates that both python engine # and C engine will raise UnicodeDecodeError instead of - # c engine raising CParserError and swallowing exception + # c engine raising ParserError and swallowing exception # that caused read to fail. handle = open(self.csv_shiftjs, "rb") codec = codecs.lookup("utf-8") @@ -63,7 +64,7 @@ def test_bad_stream_exception(self): msg = "'utf-8' codec can't decode byte" else: msg = "'utf8' codec can't decode byte" - with tm.assertRaisesRegexp(UnicodeDecodeError, msg): + with tm.assert_raises_regex(UnicodeDecodeError, msg): self.read_csv(stream) stream.close() @@ -77,41 +78,6 @@ def test_read_csv(self): fname = prefix + compat.text_type(self.csv1) self.read_csv(fname, index_col=0, parse_dates=True) - def test_dialect(self): - data = """\ -label1,label2,label3 -index1,"a,c,e -index2,b,d,f -""" - - dia = csv.excel() - dia.quoting = csv.QUOTE_NONE - df = self.read_csv(StringIO(data), dialect=dia) - - data = '''\ -label1,label2,label3 -index1,a,c,e -index2,b,d,f -''' - exp = self.read_csv(StringIO(data)) - exp.replace('a', '"a', inplace=True) - tm.assert_frame_equal(df, exp) - - def test_dialect_str(self): - data = """\ -fruit:vegetable -apple:brocolli -pear:tomato -""" - exp = DataFrame({ - 'fruit': ['apple', 'pear'], - 'vegetable': ['brocolli', 'tomato'] - }) - dia = csv.register_dialect('mydialect', delimiter=':') # noqa - df = self.read_csv(StringIO(data), dialect='mydialect') - tm.assert_frame_equal(df, exp) - csv.unregister_dialect('mydialect') - def test_1000_sep(self): data = """A|B|C 1|2,334|5 @@ -139,7 +105,7 @@ def test_squeeze(self): expected = Series([1, 2, 3], name=1, index=idx) result = self.read_table(StringIO(data), sep=',', index_col=0, header=None, squeeze=True) - tm.assertIsInstance(result, Series) + assert isinstance(result, Series) tm.assert_series_equal(result, expected) def test_squeeze_no_view(self): @@ -147,7 +113,7 @@ def test_squeeze_no_view(self): # Series should not be a view data = """time,data\n0,10\n1,11\n2,12\n4,14\n5,15\n3,13""" result = self.read_csv(StringIO(data), index_col='time', squeeze=True) - self.assertFalse(result._is_view) + assert not result._is_view def test_malformed(self): # see gh-6607 @@ -160,7 +126,7 @@ def test_malformed(self): 2,3,4 """ msg = 'Expected 3 fields in line 4, saw 5' - with tm.assertRaisesRegexp(Exception, msg): + with tm.assert_raises_regex(Exception, msg): self.read_table(StringIO(data), sep=',', header=1, comment='#') @@ -174,7 +140,7 @@ def test_malformed(self): 2,3,4 """ msg = 'Expected 3 fields in line 6, saw 5' - with tm.assertRaisesRegexp(Exception, msg): + with tm.assert_raises_regex(Exception, msg): it = self.read_table(StringIO(data), sep=',', header=1, comment='#', iterator=True, chunksize=1, @@ -191,7 +157,7 @@ def test_malformed(self): 2,3,4 """ msg = 'Expected 3 fields in line 6, saw 5' - with tm.assertRaisesRegexp(Exception, msg): + with tm.assert_raises_regex(Exception, msg): it = self.read_table(StringIO(data), sep=',', header=1, comment='#', iterator=True, chunksize=1, skiprows=[2]) @@ -207,7 +173,7 @@ def test_malformed(self): 2,3,4 """ msg = 'Expected 3 fields in line 6, saw 5' - with tm.assertRaisesRegexp(Exception, msg): + with tm.assert_raises_regex(Exception, msg): it = self.read_table(StringIO(data), sep=',', header=1, comment='#', iterator=True, chunksize=1, skiprows=[2]) @@ -224,7 +190,7 @@ def test_malformed(self): footer """ msg = 'Expected 3 fields in line 4, saw 5' - with tm.assertRaisesRegexp(Exception, msg): + with tm.assert_raises_regex(Exception, msg): self.read_table(StringIO(data), sep=',', header=1, comment='#', skipfooter=1) @@ -236,12 +202,12 @@ def test_quoting(self): Klosterdruckerei\tKlosterdruckerei (1609-1805)\t"Furststiftische Hofdruckerei, (1609-1805)\tGaller, Alois Klosterdruckerei\tKlosterdruckerei (1609-1805)\tHochfurstliche Buchhandlung """ # noqa - self.assertRaises(Exception, self.read_table, StringIO(bad_line_small), - sep='\t') + pytest.raises(Exception, self.read_table, StringIO(bad_line_small), + sep='\t') good_line_small = bad_line_small + '"' df = self.read_table(StringIO(good_line_small), sep='\t') - self.assertEqual(len(df), 3) + assert len(df) == 3 def test_unnamed_columns(self): data = """A,B,C,, @@ -254,9 +220,9 @@ def test_unnamed_columns(self): [11, 12, 13, 14, 15]], dtype=np.int64) df = self.read_table(StringIO(data), sep=',') tm.assert_almost_equal(df.values, expected) - self.assert_index_equal(df.columns, - Index(['A', 'B', 'C', 'Unnamed: 3', - 'Unnamed: 4'])) + tm.assert_index_equal(df.columns, + Index(['A', 'B', 'C', 'Unnamed: 3', + 'Unnamed: 4'])) def test_duplicate_columns(self): # TODO: add test for condition 'mangle_dupe_cols=False' @@ -271,13 +237,11 @@ def test_duplicate_columns(self): # check default behavior df = getattr(self, method)(StringIO(data), sep=',') - self.assertEqual(list(df.columns), - ['A', 'A.1', 'B', 'B.1', 'B.2']) + assert list(df.columns) == ['A', 'A.1', 'B', 'B.1', 'B.2'] df = getattr(self, method)(StringIO(data), sep=',', mangle_dupe_cols=True) - self.assertEqual(list(df.columns), - ['A', 'A.1', 'B', 'B.1', 'B.2']) + assert list(df.columns) == ['A', 'A.1', 'B', 'B.1', 'B.2'] def test_csv_mixed_type(self): data = """A,B,C @@ -295,29 +259,27 @@ def test_read_csv_dataframe(self): df = self.read_csv(self.csv1, index_col=0, parse_dates=True) df2 = self.read_table(self.csv1, sep=',', index_col=0, parse_dates=True) - self.assert_index_equal(df.columns, pd.Index(['A', 'B', 'C', 'D'])) - self.assertEqual(df.index.name, 'index') - self.assertIsInstance( + tm.assert_index_equal(df.columns, pd.Index(['A', 'B', 'C', 'D'])) + assert df.index.name == 'index' + assert isinstance( df.index[0], (datetime, np.datetime64, Timestamp)) - self.assertEqual(df.values.dtype, np.float64) + assert df.values.dtype == np.float64 tm.assert_frame_equal(df, df2) def test_read_csv_no_index_name(self): df = self.read_csv(self.csv2, index_col=0, parse_dates=True) df2 = self.read_table(self.csv2, sep=',', index_col=0, parse_dates=True) - self.assert_index_equal(df.columns, - pd.Index(['A', 'B', 'C', 'D', 'E'])) - self.assertIsInstance(df.index[0], - (datetime, np.datetime64, Timestamp)) - self.assertEqual(df.ix[:, ['A', 'B', 'C', 'D']].values.dtype, - np.float64) + tm.assert_index_equal(df.columns, + pd.Index(['A', 'B', 'C', 'D', 'E'])) + assert isinstance(df.index[0], (datetime, np.datetime64, Timestamp)) + assert df.loc[:, ['A', 'B', 'C', 'D']].values.dtype == np.float64 tm.assert_frame_equal(df, df2) def test_read_table_unicode(self): fin = BytesIO(u('\u0141aski, Jan;1').encode('utf-8')) df1 = self.read_table(fin, sep=";", encoding="utf-8", header=None) - tm.assertIsInstance(df1[0].values[0], compat.text_type) + assert isinstance(df1[0].values[0], compat.text_type) def test_read_table_wrong_num_columns(self): # too few! @@ -326,7 +288,7 @@ def test_read_table_wrong_num_columns(self): 6,7,8,9,10,11,12 11,12,13,14,15,16 """ - self.assertRaises(ValueError, self.read_csv, StringIO(data)) + pytest.raises(ValueError, self.read_csv, StringIO(data)) def test_read_duplicate_index_explicit(self): data = """index,A,B,C,D @@ -369,7 +331,7 @@ def test_parse_bools(self): True,3 """ data = self.read_csv(StringIO(data)) - self.assertEqual(data['A'].dtype, np.bool_) + assert data['A'].dtype == np.bool_ data = """A,B YES,1 @@ -381,7 +343,7 @@ def test_parse_bools(self): data = self.read_csv(StringIO(data), true_values=['yes', 'Yes', 'YES'], false_values=['no', 'NO', 'No']) - self.assertEqual(data['A'].dtype, np.bool_) + assert data['A'].dtype == np.bool_ data = """A,B TRUE,1 @@ -389,7 +351,7 @@ def test_parse_bools(self): TRUE,3 """ data = self.read_csv(StringIO(data)) - self.assertEqual(data['A'].dtype, np.bool_) + assert data['A'].dtype == np.bool_ data = """A,B foo,bar @@ -406,8 +368,8 @@ def test_int_conversion(self): 3.0,3 """ data = self.read_csv(StringIO(data)) - self.assertEqual(data['A'].dtype, np.float64) - self.assertEqual(data['B'].dtype, np.int64) + assert data['A'].dtype == np.float64 + assert data['B'].dtype == np.int64 def test_read_nrows(self): expected = self.read_csv(StringIO(self.data1))[:3] @@ -419,14 +381,17 @@ def test_read_nrows(self): df = self.read_csv(StringIO(self.data1), nrows=3.0) tm.assert_frame_equal(df, expected) - msg = "must be an integer" + msg = r"'nrows' must be an integer >=0" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.read_csv(StringIO(self.data1), nrows=1.2) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.read_csv(StringIO(self.data1), nrows='foo') + with tm.assert_raises_regex(ValueError, msg): + self.read_csv(StringIO(self.data1), nrows=-1) + def test_read_chunksize(self): reader = self.read_csv(StringIO(self.data1), index_col=0, chunksize=2) df = self.read_csv(StringIO(self.data1), index_col=0) @@ -437,6 +402,45 @@ def test_read_chunksize(self): tm.assert_frame_equal(chunks[1], df[2:4]) tm.assert_frame_equal(chunks[2], df[4:]) + # with invalid chunksize value: + msg = r"'chunksize' must be an integer >=1" + + with tm.assert_raises_regex(ValueError, msg): + self.read_csv(StringIO(self.data1), chunksize=1.3) + + with tm.assert_raises_regex(ValueError, msg): + self.read_csv(StringIO(self.data1), chunksize='foo') + + with tm.assert_raises_regex(ValueError, msg): + self.read_csv(StringIO(self.data1), chunksize=0) + + def test_read_chunksize_and_nrows(self): + + # gh-15755 + # With nrows + reader = self.read_csv(StringIO(self.data1), index_col=0, + chunksize=2, nrows=5) + df = self.read_csv(StringIO(self.data1), index_col=0, nrows=5) + + tm.assert_frame_equal(pd.concat(reader), df) + + # chunksize > nrows + reader = self.read_csv(StringIO(self.data1), index_col=0, + chunksize=8, nrows=5) + df = self.read_csv(StringIO(self.data1), index_col=0, nrows=5) + + tm.assert_frame_equal(pd.concat(reader), df) + + # with changing "size": + reader = self.read_csv(StringIO(self.data1), index_col=0, + chunksize=8, nrows=5) + df = self.read_csv(StringIO(self.data1), index_col=0, nrows=5) + + tm.assert_frame_equal(reader.get_chunk(size=2), df.iloc[:2]) + tm.assert_frame_equal(reader.get_chunk(size=4), df.iloc[2:5]) + with pytest.raises(StopIteration): + reader.get_chunk(size=3) + def test_read_chunksize_named(self): reader = self.read_csv( StringIO(self.data1), index_col='index', chunksize=2) @@ -457,7 +461,7 @@ def test_get_chunk_passed_chunksize(self): result = self.read_csv(StringIO(data), chunksize=2) piece = result.get_chunk() - self.assertEqual(len(piece), 2) + assert len(piece) == 2 def test_read_chunksize_generated_index(self): # GH 12185 @@ -512,7 +516,7 @@ def test_iterator(self): treader = self.read_table(StringIO(self.data1), sep=',', index_col=0, iterator=True) - tm.assertIsInstance(treader, TextFileReader) + assert isinstance(treader, TextFileReader) # gh-3967: stopping iteration when chunksize is specified data = """A,B,C @@ -531,15 +535,15 @@ def test_iterator(self): result = list(reader) expected = DataFrame(dict(A=[1, 4, 7], B=[2, 5, 8], C=[ 3, 6, 9]), index=['foo', 'bar', 'baz']) - self.assertEqual(len(result), 3) + assert len(result) == 3 tm.assert_frame_equal(pd.concat(result), expected) # skipfooter is not supported with the C parser yet if self.engine == 'python': # test bad parameter (skipfooter) reader = self.read_csv(StringIO(self.data1), index_col=0, - iterator=True, skipfooter=True) - self.assertRaises(ValueError, reader.read, 3) + iterator=True, skipfooter=1) + pytest.raises(ValueError, reader.read, 3) def test_pass_names_with_index(self): lines = self.data1.split('\n') @@ -635,7 +639,7 @@ def test_no_unnamed_index(self): 2 2 2 e f """ df = self.read_table(StringIO(data), sep=' ') - self.assertIsNone(df.index.name) + assert df.index.name is None def test_read_csv_parse_simple_list(self): text = """foo @@ -652,7 +656,7 @@ def test_read_csv_parse_simple_list(self): def test_url(self): # HTTP(S) url = ('https://raw.github.com/pandas-dev/pandas/master/' - 'pandas/io/tests/parser/data/salaries.csv') + 'pandas/tests/io/parser/data/salaries.csv') url_table = self.read_table(url) dirpath = tm.get_data_path() localtable = os.path.join(dirpath, 'salaries.csv') @@ -670,8 +674,8 @@ def test_file(self): url_table = self.read_table('file://localhost/' + localtable) except URLError: # fails on some systems - raise nose.SkipTest("failing on %s" % - ' '.join(platform.uname()).strip()) + pytest.skip("failing on %s" % + ' '.join(platform.uname()).strip()) tm.assert_frame_equal(url_table, local_table) @@ -679,7 +683,7 @@ def test_nonexistent_path(self): # gh-2428: pls no segfault # gh-14086: raise more helpful FileNotFoundError path = '%s.csv' % tm.rands(10) - self.assertRaises(compat.FileNotFoundError, self.read_csv, path) + pytest.raises(compat.FileNotFoundError, self.read_csv, path) def test_missing_trailing_delimiters(self): data = """A,B,C,D @@ -687,7 +691,7 @@ def test_missing_trailing_delimiters(self): 1,3,3, 1,4,5""" result = self.read_csv(StringIO(data)) - self.assertTrue(result['D'].isnull()[1:].all()) + assert result['D'].isnull()[1:].all() def test_skipinitialspace(self): s = ('"09-Apr-2012", "01:10:18.300", 2456026.548822908, 12849, ' @@ -701,7 +705,7 @@ def test_skipinitialspace(self): # it's 33 columns result = self.read_csv(sfile, names=lrange(33), na_values=['-9999.0'], header=None, skipinitialspace=True) - self.assertTrue(pd.isnull(result.ix[0, 29])) + assert pd.isnull(result.iloc[0, 29]) def test_utf16_bom_skiprows(self): # #2298 @@ -745,12 +749,12 @@ def test_utf16_example(self): # it works! and is the right length result = self.read_table(path, encoding='utf-16') - self.assertEqual(len(result), 50) + assert len(result) == 50 if not compat.PY3: buf = BytesIO(open(path, 'rb').read()) result = self.read_table(buf, encoding='utf-16') - self.assertEqual(len(result), 50) + assert len(result) == 50 def test_unicode_encoding(self): pth = tm.get_data_path('unicode_series.csv') @@ -761,7 +765,7 @@ def test_unicode_encoding(self): got = result[1][1632] expected = u('\xc1 k\xf6ldum klaka (Cold Fever) (1994)') - self.assertEqual(got, expected) + assert got == expected def test_trailing_delimiters(self): # #2442. grumble grumble @@ -786,10 +790,10 @@ def test_escapechar(self): result = self.read_csv(StringIO(data), escapechar='\\', quotechar='"', encoding='utf-8') - self.assertEqual(result['SEARCH_TERM'][2], - 'SLAGBORD, "Bergslagen", IKEA:s 1700-tals serie') - self.assertTrue(np.array_equal(result.columns, - ['SEARCH_TERM', 'ACTUAL_URL'])) + assert result['SEARCH_TERM'][2] == ('SLAGBORD, "Bergslagen", ' + 'IKEA:s 1700-tals serie') + tm.assert_index_equal(result.columns, + Index(['SEARCH_TERM', 'ACTUAL_URL'])) def test_int64_min_issues(self): # #2599 @@ -825,7 +829,7 @@ def test_parse_integers_above_fp_precision(self): 17007000002000192, 17007000002000194]}) - self.assertTrue(np.array_equal(result['Numbers'], expected['Numbers'])) + assert np.array_equal(result['Numbers'], expected['Numbers']) def test_chunks_have_consistent_numerical_type(self): integers = [str(i) for i in range(499999)] @@ -834,8 +838,8 @@ def test_chunks_have_consistent_numerical_type(self): with tm.assert_produces_warning(False): df = self.read_csv(StringIO(data)) # Assert that types were coerced. - self.assertTrue(type(df.a[0]) is np.float64) - self.assertEqual(df.a.dtype, np.float) + assert type(df.a[0]) is np.float64 + assert df.a.dtype == np.float def test_warn_if_chunks_have_mismatched_type(self): warning_type = False @@ -849,17 +853,17 @@ def test_warn_if_chunks_have_mismatched_type(self): with tm.assert_produces_warning(warning_type): df = self.read_csv(StringIO(data)) - self.assertEqual(df.a.dtype, np.object) + assert df.a.dtype == np.object def test_integer_overflow_bug(self): # see gh-2601 data = "65248E10 11\n55555E55 22\n" result = self.read_csv(StringIO(data), header=None, sep=' ') - self.assertTrue(result[0].dtype == np.float64) + assert result[0].dtype == np.float64 result = self.read_csv(StringIO(data), header=None, sep=r'\s+') - self.assertTrue(result[0].dtype == np.float64) + assert result[0].dtype == np.float64 def test_catch_too_many_names(self): # see gh-5156 @@ -868,8 +872,8 @@ def test_catch_too_many_names(self): 4,,6 7,8,9 10,11,12\n""" - tm.assertRaises(ValueError, self.read_csv, StringIO(data), - header=0, names=['a', 'b', 'c', 'd']) + pytest.raises(ValueError, self.read_csv, StringIO(data), + header=0, names=['a', 'b', 'c', 'd']) def test_ignore_leading_whitespace(self): # see gh-3374, gh-6607 @@ -882,7 +886,7 @@ def test_chunk_begins_with_newline_whitespace(self): # see gh-10022 data = '\n hello\nworld\n' result = self.read_csv(StringIO(data), header=None) - self.assertEqual(len(result), 2) + assert len(result) == 2 # see gh-9735: this issue is C parser-specific (bug when # parsing whitespace and characters at chunk boundary) @@ -944,28 +948,51 @@ def test_int64_overflow(self): 00013007854817840017963235 00013007854817840018860166""" + # 13007854817840016671868 > UINT64_MAX, so this + # will overflow and return object as the dtype. result = self.read_csv(StringIO(data)) - self.assertTrue(result['ID'].dtype == object) - - self.assertRaises(OverflowError, self.read_csv, - StringIO(data), converters={'ID': np.int64}) - - # Just inside int64 range: parse as integer + assert result['ID'].dtype == object + + # 13007854817840016671868 > UINT64_MAX, so attempts + # to cast to either int64 or uint64 will result in + # an OverflowError being raised. + for conv in (np.int64, np.uint64): + pytest.raises(OverflowError, self.read_csv, + StringIO(data), converters={'ID': conv}) + + # These numbers fall right inside the int64-uint64 range, + # so they should be parsed as string. + ui_max = np.iinfo(np.uint64).max i_max = np.iinfo(np.int64).max i_min = np.iinfo(np.int64).min - for x in [i_max, i_min]: + + for x in [i_max, i_min, ui_max]: result = self.read_csv(StringIO(str(x)), header=None) expected = DataFrame([x]) tm.assert_frame_equal(result, expected) - # Just outside int64 range: parse as string - too_big = i_max + 1 + # These numbers fall just outside the int64-uint64 range, + # so they should be parsed as string. + too_big = ui_max + 1 too_small = i_min - 1 + for x in [too_big, too_small]: result = self.read_csv(StringIO(str(x)), header=None) expected = DataFrame([str(x)]) tm.assert_frame_equal(result, expected) + # No numerical dtype can hold both negative and uint64 values, + # so they should be cast as string. + data = '-1\n' + str(2**63) + expected = DataFrame([str(-1), str(2**63)]) + result = self.read_csv(StringIO(data), header=None) + tm.assert_frame_equal(result, expected) + + data = str(2**63) + '\n-1' + expected = DataFrame([str(2**63), str(-1)]) + result = self.read_csv(StringIO(data), header=None) + tm.assert_frame_equal(result, expected) + def test_empty_with_nrows_chunksize(self): # see gh-9535 expected = DataFrame([], columns=['foo', 'bar']) @@ -1051,18 +1078,18 @@ def test_eof_states(self): # ESCAPED_CHAR data = "a,b,c\n4,5,6\n\\" - self.assertRaises(Exception, self.read_csv, - StringIO(data), escapechar='\\') + pytest.raises(Exception, self.read_csv, + StringIO(data), escapechar='\\') # ESCAPE_IN_QUOTED_FIELD data = 'a,b,c\n4,5,6\n"\\' - self.assertRaises(Exception, self.read_csv, - StringIO(data), escapechar='\\') + pytest.raises(Exception, self.read_csv, + StringIO(data), escapechar='\\') # IN_QUOTED_FIELD data = 'a,b,c\n4,5,6\n"' - self.assertRaises(Exception, self.read_csv, - StringIO(data), escapechar='\\') + pytest.raises(Exception, self.read_csv, + StringIO(data), escapechar='\\') def test_uneven_lines_with_usecols(self): # See gh-12203 @@ -1075,7 +1102,7 @@ def test_uneven_lines_with_usecols(self): # make sure that an error is still thrown # when the 'usecols' parameter is not provided msg = r"Expected \d+ fields in line \d+, saw \d+" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): df = self.read_csv(StringIO(csv)) expected = DataFrame({ @@ -1101,10 +1128,10 @@ def test_read_empty_with_usecols(self): # throws the correct error, with or without usecols errmsg = "No columns to parse from file" - with tm.assertRaisesRegexp(EmptyDataError, errmsg): + with tm.assert_raises_regex(EmptyDataError, errmsg): self.read_csv(StringIO('')) - with tm.assertRaisesRegexp(EmptyDataError, errmsg): + with tm.assert_raises_regex(EmptyDataError, errmsg): self.read_csv(StringIO(''), usecols=usecols) expected = DataFrame(columns=usecols, index=[0], dtype=np.float64) @@ -1143,7 +1170,8 @@ def test_trailing_spaces(self): def test_raise_on_sep_with_delim_whitespace(self): # see gh-6607 data = 'a b c\n1 2 3' - with tm.assertRaisesRegexp(ValueError, 'you can only specify one'): + with tm.assert_raises_regex(ValueError, + 'you can only specify one'): self.read_table(StringIO(data), sep=r'\s', delim_whitespace=True) def test_single_char_leading_whitespace(self): @@ -1214,7 +1242,7 @@ def test_regex_separator(self): df = self.read_table(StringIO(data), sep=r'\s+') expected = self.read_csv(StringIO(re.sub('[ ]+', ',', data)), index_col=0) - self.assertIsNone(expected.index.name) + assert expected.index.name is None tm.assert_frame_equal(df, expected) data = ' a b c\n1 2 3 \n4 5 6\n 7 8 9' @@ -1223,6 +1251,7 @@ def test_regex_separator(self): columns=['a', 'b', 'c']) tm.assert_frame_equal(result, expected) + @tm.capture_stdout def test_verbose_import(self): text = """a,b,c,d one,1,2,3 @@ -1234,22 +1263,18 @@ def test_verbose_import(self): one,1,2,3 two,1,2,3""" - buf = StringIO() - sys.stdout = buf + # Engines are verbose in different ways. + self.read_csv(StringIO(text), verbose=True) + output = sys.stdout.getvalue() - try: # engines are verbose in different ways - self.read_csv(StringIO(text), verbose=True) - if self.engine == 'c': - self.assertIn('Tokenization took:', buf.getvalue()) - self.assertIn('Parser memory cleanup took:', buf.getvalue()) - else: # Python engine - self.assertEqual(buf.getvalue(), - 'Filled 3 NA values in column a\n') - finally: - sys.stdout = sys.__stdout__ + if self.engine == 'c': + assert 'Tokenization took:' in output + assert 'Parser memory cleanup took:' in output + else: # Python engine + assert output == 'Filled 3 NA values in column a\n' - buf = StringIO() - sys.stdout = buf + # Reset the stdout buffer. + sys.stdout = StringIO() text = """a,b,c,d one,1,2,3 @@ -1261,20 +1286,19 @@ def test_verbose_import(self): seven,1,2,3 eight,1,2,3""" - try: # engines are verbose in different ways - self.read_csv(StringIO(text), verbose=True, index_col=0) - if self.engine == 'c': - self.assertIn('Tokenization took:', buf.getvalue()) - self.assertIn('Parser memory cleanup took:', buf.getvalue()) - else: # Python engine - self.assertEqual(buf.getvalue(), - 'Filled 1 NA values in column a\n') - finally: - sys.stdout = sys.__stdout__ + self.read_csv(StringIO(text), verbose=True, index_col=0) + output = sys.stdout.getvalue() + + # Engines are verbose in different ways. + if self.engine == 'c': + assert 'Tokenization took:' in output + assert 'Parser memory cleanup took:' in output + else: # Python engine + assert output == 'Filled 1 NA values in column a\n' def test_iteration_open_handle(self): if PY3: - raise nose.SkipTest( + pytest.skip( "won't work in Python 3 {0}".format(sys.version_info)) with tm.ensure_clean() as path: @@ -1287,8 +1311,8 @@ def test_iteration_open_handle(self): break if self.engine == 'c': - tm.assertRaises(Exception, self.read_table, - f, squeeze=True, header=None) + pytest.raises(Exception, self.read_table, + f, squeeze=True, header=None) else: result = self.read_table(f, squeeze=True, header=None) expected = Series(['DDD', 'EEE', 'FFF', 'GGG'], name=0) @@ -1305,9 +1329,9 @@ def test_1000_sep_with_decimal(self): 'C': [5, 10.] }) - tm.assert_equal(expected.A.dtype, 'int64') - tm.assert_equal(expected.B.dtype, 'float') - tm.assert_equal(expected.C.dtype, 'float') + assert expected.A.dtype == 'int64' + assert expected.B.dtype == 'float' + assert expected.C.dtype == 'float' df = self.read_csv(StringIO(data), sep='|', thousands=',', decimal='.') tm.assert_frame_equal(df, expected) @@ -1335,9 +1359,9 @@ def test_euro_decimal_format(self): 3;878,158;108013,434;GHI;rez;2,735694704""" df2 = self.read_csv(StringIO(data), sep=';', decimal=',') - self.assertEqual(df2['Number1'].dtype, float) - self.assertEqual(df2['Number2'].dtype, float) - self.assertEqual(df2['Number3'].dtype, float) + assert df2['Number1'].dtype == float + assert df2['Number2'].dtype == float + assert df2['Number3'].dtype == float def test_read_duplicate_names(self): # See gh-7160 @@ -1378,11 +1402,11 @@ def test_inf_parsing(self): def test_raise_on_no_columns(self): # single newline data = "\n" - self.assertRaises(EmptyDataError, self.read_csv, StringIO(data)) + pytest.raises(EmptyDataError, self.read_csv, StringIO(data)) # test with more than a single newline data = "\n\n\n" - self.assertRaises(EmptyDataError, self.read_csv, StringIO(data)) + pytest.raises(EmptyDataError, self.read_csv, StringIO(data)) def test_compact_ints_use_unsigned(self): # see gh-13323 @@ -1437,7 +1461,7 @@ def test_compact_ints_as_recarray(self): result = self.read_csv(StringIO(data), delimiter=',', header=None, compact_ints=True, as_recarray=True) ex_dtype = np.dtype([(str(i), 'i1') for i in range(4)]) - self.assertEqual(result.dtype, ex_dtype) + assert result.dtype == ex_dtype with tm.assert_produces_warning( FutureWarning, check_stacklevel=False): @@ -1445,7 +1469,7 @@ def test_compact_ints_as_recarray(self): as_recarray=True, compact_ints=True, use_unsigned=True) ex_dtype = np.dtype([(str(i), 'u1') for i in range(4)]) - self.assertEqual(result.dtype, ex_dtype) + assert result.dtype == ex_dtype def test_as_recarray(self): # basic test @@ -1453,7 +1477,7 @@ def test_as_recarray(self): FutureWarning, check_stacklevel=False): data = 'a,b\n1,a\n2,b' expected = np.array([(1, 'a'), (2, 'b')], - dtype=[('a', '= LooseVersion('2.5.0'): - raise nose.SkipTest("testing yearfirst=True not-support" - "on datetutil < 2.5.0 this works but" - "is wrong") + pytest.skip("testing yearfirst=True not-support" + "on datetutil < 2.5.0 this works but" + "is wrong") rs = self.read_csv(StringIO(data), index_col=0, parse_dates=[['date', 'time']]) @@ -317,13 +320,13 @@ def test_multi_index_parse_dates(self): 20090103,three,c,4,5 """ df = self.read_csv(StringIO(data), index_col=[0, 1], parse_dates=True) - self.assertIsInstance(df.index.levels[0][0], - (datetime, np.datetime64, Timestamp)) + assert isinstance(df.index.levels[0][0], + (datetime, np.datetime64, Timestamp)) # specify columns out of order! df2 = self.read_csv(StringIO(data), index_col=[1, 0], parse_dates=True) - self.assertIsInstance(df2.index.levels[1][0], - (datetime, np.datetime64, Timestamp)) + assert isinstance(df2.index.levels[1][0], + (datetime, np.datetime64, Timestamp)) def test_parse_dates_custom_euroformat(self): text = """foo,bar,baz @@ -344,11 +347,11 @@ def test_parse_dates_custom_euroformat(self): tm.assert_frame_equal(df, expected) parser = lambda d: parse_date(d, day_first=True) - self.assertRaises(TypeError, self.read_csv, - StringIO(text), skiprows=[0], - names=['time', 'Q', 'NTU'], index_col=0, - parse_dates=True, date_parser=parser, - na_values=['NA']) + pytest.raises(TypeError, self.read_csv, + StringIO(text), skiprows=[0], + names=['time', 'Q', 'NTU'], index_col=0, + parse_dates=True, date_parser=parser, + na_values=['NA']) def test_parse_tz_aware(self): # See gh-1693 @@ -358,15 +361,15 @@ def test_parse_tz_aware(self): # it works result = self.read_csv(data, index_col=0, parse_dates=True) stamp = result.index[0] - self.assertEqual(stamp.minute, 39) + assert stamp.minute == 39 try: - self.assertIs(result.index.tz, pytz.utc) + assert result.index.tz is pytz.utc except AssertionError: # hello Yaroslav arr = result.index.to_pydatetime() result = tools.to_datetime(arr, utc=True)[0] - self.assertEqual(stamp.minute, result.minute) - self.assertEqual(stamp.hour, result.hour) - self.assertEqual(stamp.day, result.day) + assert stamp.minute == result.minute + assert stamp.hour == result.hour + assert stamp.day == result.day def test_multiple_date_cols_index(self): data = """ @@ -399,7 +402,7 @@ def test_multiple_date_cols_chunked(self): chunks = list(reader) - self.assertNotIn('nominalTime', df) + assert 'nominalTime' not in df tm.assert_frame_equal(chunks[0], df[:2]) tm.assert_frame_equal(chunks[1], df[2:4]) @@ -432,11 +435,11 @@ def test_read_with_parse_dates_scalar_non_bool(self): data = """A,B,C 1,2,2003-11-1""" - tm.assertRaisesRegexp(TypeError, errmsg, self.read_csv, - StringIO(data), parse_dates="C") - tm.assertRaisesRegexp(TypeError, errmsg, self.read_csv, - StringIO(data), parse_dates="C", - index_col="C") + tm.assert_raises_regex(TypeError, errmsg, self.read_csv, + StringIO(data), parse_dates="C") + tm.assert_raises_regex(TypeError, errmsg, self.read_csv, + StringIO(data), parse_dates="C", + index_col="C") def test_read_with_parse_dates_invalid_type(self): errmsg = ("Only booleans, lists, and " @@ -445,19 +448,20 @@ def test_read_with_parse_dates_invalid_type(self): data = """A,B,C 1,2,2003-11-1""" - tm.assertRaisesRegexp(TypeError, errmsg, self.read_csv, - StringIO(data), parse_dates=(1,)) - tm.assertRaisesRegexp(TypeError, errmsg, self.read_csv, - StringIO(data), parse_dates=np.array([4, 5])) - tm.assertRaisesRegexp(TypeError, errmsg, self.read_csv, - StringIO(data), parse_dates=set([1, 3, 3])) + tm.assert_raises_regex(TypeError, errmsg, self.read_csv, + StringIO(data), parse_dates=(1,)) + tm.assert_raises_regex(TypeError, errmsg, + self.read_csv, StringIO(data), + parse_dates=np.array([4, 5])) + tm.assert_raises_regex(TypeError, errmsg, self.read_csv, + StringIO(data), parse_dates=set([1, 3, 3])) def test_parse_dates_empty_string(self): # see gh-2263 data = "Date, test\n2012-01-01, 1\n,2" result = self.read_csv(StringIO(data), parse_dates=["Date"], na_filter=False) - self.assertTrue(result['Date'].isnull()[1]) + assert result['Date'].isnull()[1] def test_parse_dates_noconvert_thousands(self): # see gh-14066 @@ -490,3 +494,164 @@ def test_parse_dates_noconvert_thousands(self): result = self.read_csv(StringIO(data), index_col=[0, 1], parse_dates=True, thousands='.') tm.assert_frame_equal(result, expected) + + def test_parse_date_time_multi_level_column_name(self): + data = """\ +D,T,A,B +date, time,a,b +2001-01-05, 09:00:00, 0.0, 10. +2001-01-06, 00:00:00, 1.0, 11. +""" + datecols = {'date_time': [0, 1]} + result = self.read_csv(StringIO(data), sep=',', header=[0, 1], + parse_dates=datecols, + date_parser=conv.parse_date_time) + + expected_data = [[datetime(2001, 1, 5, 9, 0, 0), 0., 10.], + [datetime(2001, 1, 6, 0, 0, 0), 1., 11.]] + expected = DataFrame(expected_data, + columns=['date_time', ('A', 'a'), ('B', 'b')]) + tm.assert_frame_equal(result, expected) + + def test_parse_date_time(self): + dates = np.array(['2007/1/3', '2008/2/4'], dtype=object) + times = np.array(['05:07:09', '06:08:00'], dtype=object) + expected = np.array([datetime(2007, 1, 3, 5, 7, 9), + datetime(2008, 2, 4, 6, 8, 0)]) + + result = conv.parse_date_time(dates, times) + assert (result == expected).all() + + data = """\ +date, time, a, b +2001-01-05, 10:00:00, 0.0, 10. +2001-01-05, 00:00:00, 1., 11. +""" + datecols = {'date_time': [0, 1]} + df = self.read_csv(StringIO(data), sep=',', header=0, + parse_dates=datecols, + date_parser=conv.parse_date_time) + assert 'date_time' in df + assert df.date_time.loc[0] == datetime(2001, 1, 5, 10, 0, 0) + + data = ("KORD,19990127, 19:00:00, 18:56:00, 0.8100\n" + "KORD,19990127, 20:00:00, 19:56:00, 0.0100\n" + "KORD,19990127, 21:00:00, 20:56:00, -0.5900\n" + "KORD,19990127, 21:00:00, 21:18:00, -0.9900\n" + "KORD,19990127, 22:00:00, 21:56:00, -0.5900\n" + "KORD,19990127, 23:00:00, 22:56:00, -0.5900") + + date_spec = {'nominal': [1, 2], 'actual': [1, 3]} + df = self.read_csv(StringIO(data), header=None, parse_dates=date_spec, + date_parser=conv.parse_date_time) + + def test_parse_date_fields(self): + years = np.array([2007, 2008]) + months = np.array([1, 2]) + days = np.array([3, 4]) + result = conv.parse_date_fields(years, months, days) + expected = np.array([datetime(2007, 1, 3), datetime(2008, 2, 4)]) + assert (result == expected).all() + + data = ("year, month, day, a\n 2001 , 01 , 10 , 10.\n" + "2001 , 02 , 1 , 11.") + datecols = {'ymd': [0, 1, 2]} + df = self.read_csv(StringIO(data), sep=',', header=0, + parse_dates=datecols, + date_parser=conv.parse_date_fields) + assert 'ymd' in df + assert df.ymd.loc[0] == datetime(2001, 1, 10) + + def test_datetime_six_col(self): + years = np.array([2007, 2008]) + months = np.array([1, 2]) + days = np.array([3, 4]) + hours = np.array([5, 6]) + minutes = np.array([7, 8]) + seconds = np.array([9, 0]) + expected = np.array([datetime(2007, 1, 3, 5, 7, 9), + datetime(2008, 2, 4, 6, 8, 0)]) + + result = conv.parse_all_fields(years, months, days, + hours, minutes, seconds) + + assert (result == expected).all() + + data = """\ +year, month, day, hour, minute, second, a, b +2001, 01, 05, 10, 00, 0, 0.0, 10. +2001, 01, 5, 10, 0, 00, 1., 11. +""" + datecols = {'ymdHMS': [0, 1, 2, 3, 4, 5]} + df = self.read_csv(StringIO(data), sep=',', header=0, + parse_dates=datecols, + date_parser=conv.parse_all_fields) + assert 'ymdHMS' in df + assert df.ymdHMS.loc[0] == datetime(2001, 1, 5, 10, 0, 0) + + def test_datetime_fractional_seconds(self): + data = """\ +year, month, day, hour, minute, second, a, b +2001, 01, 05, 10, 00, 0.123456, 0.0, 10. +2001, 01, 5, 10, 0, 0.500000, 1., 11. +""" + datecols = {'ymdHMS': [0, 1, 2, 3, 4, 5]} + df = self.read_csv(StringIO(data), sep=',', header=0, + parse_dates=datecols, + date_parser=conv.parse_all_fields) + assert 'ymdHMS' in df + assert df.ymdHMS.loc[0] == datetime(2001, 1, 5, 10, 0, 0, + microsecond=123456) + assert df.ymdHMS.loc[1] == datetime(2001, 1, 5, 10, 0, 0, + microsecond=500000) + + def test_generic(self): + data = "year, month, day, a\n 2001, 01, 10, 10.\n 2001, 02, 1, 11." + datecols = {'ym': [0, 1]} + dateconverter = lambda y, m: date(year=int(y), month=int(m), day=1) + df = self.read_csv(StringIO(data), sep=',', header=0, + parse_dates=datecols, + date_parser=dateconverter) + assert 'ym' in df + assert df.ym.loc[0] == date(2001, 1, 1) + + def test_dateparser_resolution_if_not_ns(self): + # GH 10245 + data = """\ +date,time,prn,rxstatus +2013-11-03,19:00:00,126,00E80000 +2013-11-03,19:00:00,23,00E80000 +2013-11-03,19:00:00,13,00E80000 +""" + + def date_parser(date, time): + datetime = np_array_datetime64_compat( + date + 'T' + time + 'Z', dtype='datetime64[s]') + return datetime + + df = self.read_csv(StringIO(data), date_parser=date_parser, + parse_dates={'datetime': ['date', 'time']}, + index_col=['datetime', 'prn']) + + datetimes = np_array_datetime64_compat(['2013-11-03T19:00:00Z'] * 3, + dtype='datetime64[s]') + df_correct = DataFrame(data={'rxstatus': ['00E80000'] * 3}, + index=MultiIndex.from_tuples( + [(datetimes[0], 126), + (datetimes[1], 23), + (datetimes[2], 13)], + names=['datetime', 'prn'])) + tm.assert_frame_equal(df, df_correct) + + def test_parse_date_column_with_empty_string(self): + # GH 6428 + data = """case,opdate + 7,10/18/2006 + 7,10/18/2008 + 621, """ + result = self.read_csv(StringIO(data), parse_dates=['opdate']) + expected_data = [[7, '10/18/2006'], + [7, '10/18/2008'], + [621, ' ']] + expected = DataFrame(expected_data, columns=['case', 'opdate']) + tm.assert_frame_equal(result, expected) diff --git a/pandas/io/tests/parser/python_parser_only.py b/pandas/tests/io/parser/python_parser_only.py similarity index 76% rename from pandas/io/tests/parser/python_parser_only.py rename to pandas/tests/io/parser/python_parser_only.py index ad62aaa275127..a0784d3aeae2d 100644 --- a/pandas/io/tests/parser/python_parser_only.py +++ b/pandas/tests/io/parser/python_parser_only.py @@ -8,30 +8,33 @@ """ import csv -import sys -import nose +import pytest import pandas.util.testing as tm from pandas import DataFrame, Index from pandas import compat +from pandas.errors import ParserError from pandas.compat import StringIO, BytesIO, u class PythonParserTests(object): - def test_negative_skipfooter_raises(self): - text = """#foo,a,b,c -#foo,a,b,c -#foo,a,b,c -#foo,a,b,c -#foo,a,b,c -#foo,a,b,c -1/1/2000,1.,2.,3. -1/2/2000,4,5,6 -1/3/2000,7,8,9 -""" - with tm.assertRaisesRegexp( - ValueError, 'skip footer cannot be negative'): + def test_invalid_skipfooter(self): + text = "a\n1\n2" + + # see gh-15925 (comment) + msg = "skipfooter must be an integer" + with tm.assert_raises_regex(ValueError, msg): + self.read_csv(StringIO(text), skipfooter="foo") + + with tm.assert_raises_regex(ValueError, msg): + self.read_csv(StringIO(text), skipfooter=1.5) + + with tm.assert_raises_regex(ValueError, msg): + self.read_csv(StringIO(text), skipfooter=True) + + msg = "skipfooter cannot be negative" + with tm.assert_raises_regex(ValueError, msg): self.read_csv(StringIO(text), skipfooter=-1) def test_sniff_delimiter(self): @@ -41,8 +44,8 @@ def test_sniff_delimiter(self): baz|7|8|9 """ data = self.read_csv(StringIO(text), index_col=0, sep=None) - self.assert_index_equal(data.index, - Index(['foo', 'bar', 'baz'], name='index')) + tm.assert_index_equal(data.index, + Index(['foo', 'bar', 'baz'], name='index')) data2 = self.read_csv(StringIO(text), index_col=0, delimiter='|') tm.assert_frame_equal(data, data2) @@ -78,7 +81,7 @@ def test_sniff_delimiter(self): def test_BytesIO_input(self): if not compat.PY3: - raise nose.SkipTest( + pytest.skip( "Bytes-related test - only needs to work on Python 3") data = BytesIO("שלום::1234\n562::123".encode('cp1255')) @@ -88,16 +91,9 @@ def test_BytesIO_input(self): def test_single_line(self): # see gh-6607: sniff separator - - buf = StringIO() - sys.stdout = buf - - try: - df = self.read_csv(StringIO('1,2'), names=['a', 'b'], - header=None, sep=None) - tm.assert_frame_equal(DataFrame({'a': [1], 'b': [2]}), df) - finally: - sys.stdout = sys.__stdout__ + df = self.read_csv(StringIO('1,2'), names=['a', 'b'], + header=None, sep=None) + tm.assert_frame_equal(DataFrame({'a': [1], 'b': [2]}), df) def test_skipfooter(self): # see gh-6607 @@ -129,7 +125,7 @@ def test_decompression_regex_sep(self): import gzip import bz2 except ImportError: - raise nose.SkipTest('need gzip and bz2 to run') + pytest.skip('need gzip and bz2 to run') with open(self.csv1, 'rb') as f: data = f.read() @@ -152,8 +148,8 @@ def test_decompression_regex_sep(self): result = self.read_csv(path, sep='::', compression='bz2') tm.assert_frame_equal(result, expected) - self.assertRaises(ValueError, self.read_csv, - path, compression='bz3') + pytest.raises(ValueError, self.read_csv, + path, compression='bz3') def test_read_table_buglet_4x_multiindex(self): # see gh-6607 @@ -164,7 +160,7 @@ def test_read_table_buglet_4x_multiindex(self): x q 30 3 -0.6662 -0.5243 -0.3580 0.89145 2.5838""" df = self.read_table(StringIO(text), sep=r'\s+') - self.assertEqual(df.index.names, ('one', 'two', 'three', 'four')) + assert df.index.names == ('one', 'two', 'three', 'four') # see gh-6893 data = ' A B C\na b c\n1 3 7 0 3 6\n3 1 4 1 5 9' @@ -212,27 +208,29 @@ def test_multi_char_sep_quotes(self): data = 'a,,b\n1,,a\n2,,"2,,b"' msg = 'ignored when a multi-char delimiter is used' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ParserError, msg): self.read_csv(StringIO(data), sep=',,') # We expect no match, so there should be an assertion # error out of the inner context manager. - with tm.assertRaises(AssertionError): - with tm.assertRaisesRegexp(ValueError, msg): + with pytest.raises(AssertionError): + with tm.assert_raises_regex(ParserError, msg): self.read_csv(StringIO(data), sep=',,', quoting=csv.QUOTE_NONE) def test_skipfooter_bad_row(self): # see gh-13879 + # see gh-15910 - data = 'a,b,c\ncat,foo,bar\ndog,foo,"baz' msg = 'parsing errors in the skipped footer rows' - with tm.assertRaisesRegexp(csv.Error, msg): - self.read_csv(StringIO(data), skipfooter=1) + for data in ('a\n1\n"b"a', + 'a,b,c\ncat,foo,bar\ndog,foo,"baz'): + with tm.assert_raises_regex(ParserError, msg): + self.read_csv(StringIO(data), skipfooter=1) - # We expect no match, so there should be an assertion - # error out of the inner context manager. - with tm.assertRaises(AssertionError): - with tm.assertRaisesRegexp(csv.Error, msg): - self.read_csv(StringIO(data)) + # We expect no match, so there should be an assertion + # error out of the inner context manager. + with pytest.raises(AssertionError): + with tm.assert_raises_regex(ParserError, msg): + self.read_csv(StringIO(data)) diff --git a/pandas/io/tests/parser/quoting.py b/pandas/tests/io/parser/quoting.py similarity index 81% rename from pandas/io/tests/parser/quoting.py rename to pandas/tests/io/parser/quoting.py index 765cec8243a0a..15427aaf9825c 100644 --- a/pandas/io/tests/parser/quoting.py +++ b/pandas/tests/io/parser/quoting.py @@ -20,29 +20,29 @@ def test_bad_quote_char(self): # Python 2.x: "...must be an 1-character..." # Python 3.x: "...must be a 1-character..." msg = '"quotechar" must be a(n)? 1-character string' - tm.assertRaisesRegexp(TypeError, msg, self.read_csv, - StringIO(data), quotechar='foo') + tm.assert_raises_regex(TypeError, msg, self.read_csv, + StringIO(data), quotechar='foo') msg = 'quotechar must be set if quoting enabled' - tm.assertRaisesRegexp(TypeError, msg, self.read_csv, - StringIO(data), quotechar=None, - quoting=csv.QUOTE_MINIMAL) + tm.assert_raises_regex(TypeError, msg, self.read_csv, + StringIO(data), quotechar=None, + quoting=csv.QUOTE_MINIMAL) msg = '"quotechar" must be string, not int' - tm.assertRaisesRegexp(TypeError, msg, self.read_csv, - StringIO(data), quotechar=2) + tm.assert_raises_regex(TypeError, msg, self.read_csv, + StringIO(data), quotechar=2) def test_bad_quoting(self): data = '1,2,3' msg = '"quoting" must be an integer' - tm.assertRaisesRegexp(TypeError, msg, self.read_csv, - StringIO(data), quoting='foo') + tm.assert_raises_regex(TypeError, msg, self.read_csv, + StringIO(data), quoting='foo') # quoting must in the range [0, 3] msg = 'bad "quoting" value' - tm.assertRaisesRegexp(TypeError, msg, self.read_csv, - StringIO(data), quoting=5) + tm.assert_raises_regex(TypeError, msg, self.read_csv, + StringIO(data), quoting=5) def test_quote_char_basic(self): data = 'a,b,c\n1,2,"cat"' @@ -68,13 +68,13 @@ def test_null_quote_char(self): # sanity checks msg = 'quotechar must be set if quoting enabled' - tm.assertRaisesRegexp(TypeError, msg, self.read_csv, - StringIO(data), quotechar=None, - quoting=csv.QUOTE_MINIMAL) + tm.assert_raises_regex(TypeError, msg, self.read_csv, + StringIO(data), quotechar=None, + quoting=csv.QUOTE_MINIMAL) - tm.assertRaisesRegexp(TypeError, msg, self.read_csv, - StringIO(data), quotechar='', - quoting=csv.QUOTE_MINIMAL) + tm.assert_raises_regex(TypeError, msg, self.read_csv, + StringIO(data), quotechar='', + quoting=csv.QUOTE_MINIMAL) # no errors should be raised if quoting is None expected = DataFrame([[1, 2, 3]], @@ -149,5 +149,5 @@ def test_quotechar_unicode(self): # Compared to Python 3.x, Python 2.x does not handle unicode well. if PY3: - result = self.read_csv(StringIO(data), quotechar=u('\u0394')) + result = self.read_csv(StringIO(data), quotechar=u('\u0001')) tm.assert_frame_equal(result, expected) diff --git a/pandas/io/tests/parser/skiprows.py b/pandas/tests/io/parser/skiprows.py similarity index 88% rename from pandas/io/tests/parser/skiprows.py rename to pandas/tests/io/parser/skiprows.py index 9f01adb6fabcb..fb08ec0447267 100644 --- a/pandas/io/tests/parser/skiprows.py +++ b/pandas/tests/io/parser/skiprows.py @@ -12,6 +12,7 @@ import pandas.util.testing as tm from pandas import DataFrame +from pandas.errors import EmptyDataError from pandas.compat import StringIO, range, lrange @@ -198,3 +199,27 @@ def test_skiprows_infield_quote(self): df = self.read_csv(StringIO(data), skiprows=2) tm.assert_frame_equal(df, expected) + + def test_skiprows_callable(self): + data = 'a\n1\n2\n3\n4\n5' + + skiprows = lambda x: x % 2 == 0 + expected = DataFrame({'1': [3, 5]}) + df = self.read_csv(StringIO(data), skiprows=skiprows) + tm.assert_frame_equal(df, expected) + + expected = DataFrame({'foo': [3, 5]}) + df = self.read_csv(StringIO(data), skiprows=skiprows, + header=0, names=['foo']) + tm.assert_frame_equal(df, expected) + + skiprows = lambda x: True + msg = "No columns to parse from file" + with tm.assert_raises_regex(EmptyDataError, msg): + self.read_csv(StringIO(data), skiprows=skiprows) + + # This is a bad callable and should raise. + msg = "by zero" + skiprows = lambda x: 1 / 0 + with tm.assert_raises_regex(ZeroDivisionError, msg): + self.read_csv(StringIO(data), skiprows=skiprows) diff --git a/pandas/tests/io/parser/test_network.py b/pandas/tests/io/parser/test_network.py new file mode 100644 index 0000000000000..e12945a6a3102 --- /dev/null +++ b/pandas/tests/io/parser/test_network.py @@ -0,0 +1,197 @@ +# -*- coding: utf-8 -*- + +""" +Tests parsers ability to read and parse non-local files +and hence require a network connection to be read. +""" + +import os +import pytest + +import pandas.util.testing as tm +from pandas import DataFrame +from pandas.io.parsers import read_csv, read_table + + +@pytest.fixture(scope='module') +def salaries_table(): + path = os.path.join(tm.get_data_path(), 'salaries.csv') + return read_table(path) + + +@pytest.mark.parametrize( + "compression,extension", + [('gzip', '.gz'), ('bz2', '.bz2'), ('zip', '.zip'), + pytest.mark.skipif(not tm._check_if_lzma(), + reason='need backports.lzma to run')(('xz', '.xz'))]) +@pytest.mark.parametrize('mode', ['explicit', 'infer']) +@pytest.mark.parametrize('engine', ['python', 'c']) +def test_compressed_urls(salaries_table, compression, extension, mode, engine): + check_compressed_urls(salaries_table, compression, extension, mode, engine) + + +@tm.network +def check_compressed_urls(salaries_table, compression, extension, mode, + engine): + # test reading compressed urls with various engines and + # extension inference + base_url = ('https://github.com/pandas-dev/pandas/raw/master/' + 'pandas/tests/io/parser/data/salaries.csv') + + url = base_url + extension + + if mode != 'explicit': + compression = mode + + url_table = read_table(url, compression=compression, engine=engine) + tm.assert_frame_equal(url_table, salaries_table) + + +class TestS3(object): + + def setup_method(self, method): + try: + import s3fs # noqa + except ImportError: + pytest.skip("s3fs not installed") + + @tm.network + def test_parse_public_s3_bucket(self): + for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: + df = read_csv('s3://pandas-test/tips.csv' + + ext, compression=comp) + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv( + tm.get_data_path('tips.csv')), df) + + # Read public file from bucket with not-public contents + df = read_csv('s3://cant_get_it/tips.csv') + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv(tm.get_data_path('tips.csv')), df) + + @tm.network + def test_parse_public_s3n_bucket(self): + # Read from AWS s3 as "s3n" URL + df = read_csv('s3n://pandas-test/tips.csv', nrows=10) + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv( + tm.get_data_path('tips.csv')).iloc[:10], df) + + @tm.network + def test_parse_public_s3a_bucket(self): + # Read from AWS s3 as "s3a" URL + df = read_csv('s3a://pandas-test/tips.csv', nrows=10) + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv( + tm.get_data_path('tips.csv')).iloc[:10], df) + + @tm.network + def test_parse_public_s3_bucket_nrows(self): + for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: + df = read_csv('s3://pandas-test/tips.csv' + + ext, nrows=10, compression=comp) + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv( + tm.get_data_path('tips.csv')).iloc[:10], df) + + @tm.network + def test_parse_public_s3_bucket_chunked(self): + # Read with a chunksize + chunksize = 5 + local_tips = read_csv(tm.get_data_path('tips.csv')) + for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: + df_reader = read_csv('s3://pandas-test/tips.csv' + ext, + chunksize=chunksize, compression=comp) + assert df_reader.chunksize == chunksize + for i_chunk in [0, 1, 2]: + # Read a couple of chunks and make sure we see them + # properly. + df = df_reader.get_chunk() + assert isinstance(df, DataFrame) + assert not df.empty + true_df = local_tips.iloc[ + chunksize * i_chunk: chunksize * (i_chunk + 1)] + tm.assert_frame_equal(true_df, df) + + @tm.network + def test_parse_public_s3_bucket_chunked_python(self): + # Read with a chunksize using the Python parser + chunksize = 5 + local_tips = read_csv(tm.get_data_path('tips.csv')) + for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: + df_reader = read_csv('s3://pandas-test/tips.csv' + ext, + chunksize=chunksize, compression=comp, + engine='python') + assert df_reader.chunksize == chunksize + for i_chunk in [0, 1, 2]: + # Read a couple of chunks and make sure we see them properly. + df = df_reader.get_chunk() + assert isinstance(df, DataFrame) + assert not df.empty + true_df = local_tips.iloc[ + chunksize * i_chunk: chunksize * (i_chunk + 1)] + tm.assert_frame_equal(true_df, df) + + @tm.network + def test_parse_public_s3_bucket_python(self): + for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: + df = read_csv('s3://pandas-test/tips.csv' + ext, engine='python', + compression=comp) + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv( + tm.get_data_path('tips.csv')), df) + + @tm.network + def test_infer_s3_compression(self): + for ext in ['', '.gz', '.bz2']: + df = read_csv('s3://pandas-test/tips.csv' + ext, + engine='python', compression='infer') + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv( + tm.get_data_path('tips.csv')), df) + + @tm.network + def test_parse_public_s3_bucket_nrows_python(self): + for ext, comp in [('', None), ('.gz', 'gzip'), ('.bz2', 'bz2')]: + df = read_csv('s3://pandas-test/tips.csv' + ext, engine='python', + nrows=10, compression=comp) + assert isinstance(df, DataFrame) + assert not df.empty + tm.assert_frame_equal(read_csv( + tm.get_data_path('tips.csv')).iloc[:10], df) + + @tm.network + def test_s3_fails(self): + with pytest.raises(IOError): + read_csv('s3://nyqpug/asdf.csv') + + # Receive a permission error when trying to read a private bucket. + # It's irrelevant here that this isn't actually a table. + with pytest.raises(IOError): + read_csv('s3://cant_get_it/') + + @tm.network + def boto3_client_s3(self): + # see gh-16135 + + # boto3 is a dependency of s3fs + import boto3 + client = boto3.client("s3") + + key = "/tips.csv" + bucket = "pandas-test" + s3_object = client.get_object(Bucket=bucket, Key=key) + + result = read_csv(s3_object["Body"]) + assert isinstance(result, DataFrame) + assert not result.empty + + expected = read_csv(tm.get_data_path('tips.csv')) + tm.assert_frame_equal(result, expected) diff --git a/pandas/io/tests/parser/test_parsers.py b/pandas/tests/io/parser/test_parsers.py similarity index 80% rename from pandas/io/tests/parser/test_parsers.py rename to pandas/tests/io/parser/test_parsers.py index 6001c85ae76b1..8d59e3acb3230 100644 --- a/pandas/io/tests/parser/test_parsers.py +++ b/pandas/tests/io/parser/test_parsers.py @@ -1,7 +1,6 @@ # -*- coding: utf-8 -*- import os -import nose import pandas.util.testing as tm @@ -11,6 +10,7 @@ from .common import ParserTests from .header import HeaderTests from .comment import CommentTests +from .dialect import DialectTests from .quoting import QuotingTests from .usecols import UsecolsTests from .skiprows import SkipRowsTests @@ -22,14 +22,17 @@ from .compression import CompressionTests from .multithread import MultithreadTests from .python_parser_only import PythonParserTests +from .dtypes import DtypeTests class BaseParser(CommentTests, CompressionTests, - ConverterTests, HeaderTests, - IndexColTests, MultithreadTests, - NAvaluesTests, ParseDatesTests, - ParserTests, SkipRowsTests, - UsecolsTests, QuotingTests): + ConverterTests, DialectTests, + HeaderTests, IndexColTests, + MultithreadTests, NAvaluesTests, + ParseDatesTests, ParserTests, + SkipRowsTests, UsecolsTests, + QuotingTests, DtypeTests): + def read_csv(self, *args, **kwargs): raise NotImplementedError @@ -39,7 +42,7 @@ def read_table(self, *args, **kwargs): def float_precision_choices(self): raise AbstractMethodError(self) - def setUp(self): + def setup_method(self, method): self.dirpath = tm.get_data_path() self.csv1 = os.path.join(self.dirpath, 'test1.csv') self.csv2 = os.path.join(self.dirpath, 'test2.csv') @@ -47,7 +50,7 @@ def setUp(self): self.csv_shiftjs = os.path.join(self.dirpath, 'sauron.SHIFT_JIS.csv') -class TestCParserHighMemory(BaseParser, CParserTests, tm.TestCase): +class TestCParserHighMemory(BaseParser, CParserTests): engine = 'c' low_memory = False float_precision_choices = [None, 'high', 'round_trip'] @@ -65,7 +68,7 @@ def read_table(self, *args, **kwds): return read_table(*args, **kwds) -class TestCParserLowMemory(BaseParser, CParserTests, tm.TestCase): +class TestCParserLowMemory(BaseParser, CParserTests): engine = 'c' low_memory = True float_precision_choices = [None, 'high', 'round_trip'] @@ -83,7 +86,7 @@ def read_table(self, *args, **kwds): return read_table(*args, **kwds) -class TestPythonParser(BaseParser, PythonParserTests, tm.TestCase): +class TestPythonParser(BaseParser, PythonParserTests): engine = 'python' float_precision_choices = [None] @@ -96,7 +99,3 @@ def read_table(self, *args, **kwds): kwds = kwds.copy() kwds['engine'] = self.engine return read_table(*args, **kwds) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/parser/test_read_fwf.py b/pandas/tests/io/parser/test_read_fwf.py similarity index 80% rename from pandas/io/tests/parser/test_read_fwf.py rename to pandas/tests/io/parser/test_read_fwf.py index 11b10211650d6..0bfeb5215f370 100644 --- a/pandas/io/tests/parser/test_read_fwf.py +++ b/pandas/tests/io/parser/test_read_fwf.py @@ -8,7 +8,7 @@ from datetime import datetime -import nose +import pytest import numpy as np import pandas as pd import pandas.util.testing as tm @@ -16,10 +16,10 @@ from pandas import DataFrame from pandas import compat from pandas.compat import StringIO, BytesIO -from pandas.io.parsers import read_csv, read_fwf +from pandas.io.parsers import read_csv, read_fwf, EmptyDataError -class TestFwfParsing(tm.TestCase): +class TestFwfParsing(object): def test_fwf(self): data_expected = """\ @@ -67,15 +67,16 @@ def test_fwf(self): StringIO(data3), colspecs=colspecs, delimiter='~', header=None) tm.assert_frame_equal(df, expected) - with tm.assertRaisesRegexp(ValueError, "must specify only one of"): + with tm.assert_raises_regex(ValueError, + "must specify only one of"): read_fwf(StringIO(data3), colspecs=colspecs, widths=[6, 10, 10, 7]) - with tm.assertRaisesRegexp(ValueError, "Must specify either"): + with tm.assert_raises_regex(ValueError, "Must specify either"): read_fwf(StringIO(data3), colspecs=None, widths=None) def test_BytesIO_input(self): if not compat.PY3: - raise nose.SkipTest( + pytest.skip( "Bytes-related test - only needs to work on Python 3") result = read_fwf(BytesIO("שלום\nשלום".encode('utf8')), widths=[ @@ -93,9 +94,9 @@ def test_fwf_colspecs_is_list_or_tuple(self): bar2,12,13,14,15 """ - with tm.assertRaisesRegexp(TypeError, - 'column specifications must be a list or ' - 'tuple.+'): + with tm.assert_raises_regex(TypeError, + 'column specifications must ' + 'be a list or tuple.+'): pd.io.parsers.FixedWidthReader(StringIO(data), {'a': 1}, ',', '#') @@ -109,8 +110,9 @@ def test_fwf_colspecs_is_list_or_tuple_of_two_element_tuples(self): bar2,12,13,14,15 """ - with tm.assertRaisesRegexp(TypeError, - 'Each column specification must be.+'): + with tm.assert_raises_regex(TypeError, + 'Each column specification ' + 'must be.+'): read_fwf(StringIO(data), [('a', 1)]) def test_fwf_colspecs_None(self): @@ -164,7 +166,7 @@ def test_fwf_regression(self): for c in df.columns: res = df.loc[:, c] - self.assertTrue(len(res)) + assert len(res) def test_fwf_for_uint8(self): data = """1421302965.213420 PRI=3 PGN=0xef00 DST=0x17 SRC=0x28 04 154 00 00 00 00 00 127 @@ -192,7 +194,7 @@ def test_fwf_compression(self): import gzip import bz2 except ImportError: - raise nose.SkipTest("Need gzip and bz2 to run this test") + pytest.skip("Need gzip and bz2 to run this test") data = """1111111111 2222222222 @@ -243,88 +245,88 @@ def test_bool_header_arg(self): a b""" for arg in [True, False]: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): read_fwf(StringIO(data), header=arg) def test_full_file(self): # File with all values - test = '''index A B C + test = """index A B C 2000-01-03T00:00:00 0.980268513777 3 foo 2000-01-04T00:00:00 1.04791624281 -4 bar 2000-01-05T00:00:00 0.498580885705 73 baz 2000-01-06T00:00:00 1.12020151869 1 foo 2000-01-07T00:00:00 0.487094399463 0 bar 2000-01-10T00:00:00 0.836648671666 2 baz -2000-01-11T00:00:00 0.157160753327 34 foo''' +2000-01-11T00:00:00 0.157160753327 34 foo""" colspecs = ((0, 19), (21, 35), (38, 40), (42, 45)) expected = read_fwf(StringIO(test), colspecs=colspecs) tm.assert_frame_equal(expected, read_fwf(StringIO(test))) def test_full_file_with_missing(self): # File with missing values - test = '''index A B C + test = """index A B C 2000-01-03T00:00:00 0.980268513777 3 foo 2000-01-04T00:00:00 1.04791624281 -4 bar 0.498580885705 73 baz 2000-01-06T00:00:00 1.12020151869 1 foo 2000-01-07T00:00:00 0 bar 2000-01-10T00:00:00 0.836648671666 2 baz - 34''' + 34""" colspecs = ((0, 19), (21, 35), (38, 40), (42, 45)) expected = read_fwf(StringIO(test), colspecs=colspecs) tm.assert_frame_equal(expected, read_fwf(StringIO(test))) def test_full_file_with_spaces(self): # File with spaces in columns - test = ''' + test = """ Account Name Balance CreditLimit AccountCreated 101 Keanu Reeves 9315.45 10000.00 1/17/1998 312 Gerard Butler 90.00 1000.00 8/6/2003 868 Jennifer Love Hewitt 0 17000.00 5/25/1985 761 Jada Pinkett-Smith 49654.87 100000.00 12/5/2006 317 Bill Murray 789.65 5000.00 2/5/2007 -'''.strip('\r\n') +""".strip('\r\n') colspecs = ((0, 7), (8, 28), (30, 38), (42, 53), (56, 70)) expected = read_fwf(StringIO(test), colspecs=colspecs) tm.assert_frame_equal(expected, read_fwf(StringIO(test))) def test_full_file_with_spaces_and_missing(self): # File with spaces and missing values in columsn - test = ''' + test = """ Account Name Balance CreditLimit AccountCreated 101 10000.00 1/17/1998 312 Gerard Butler 90.00 1000.00 8/6/2003 868 5/25/1985 761 Jada Pinkett-Smith 49654.87 100000.00 12/5/2006 317 Bill Murray 789.65 -'''.strip('\r\n') +""".strip('\r\n') colspecs = ((0, 7), (8, 28), (30, 38), (42, 53), (56, 70)) expected = read_fwf(StringIO(test), colspecs=colspecs) tm.assert_frame_equal(expected, read_fwf(StringIO(test))) def test_messed_up_data(self): # Completely messed up file - test = ''' + test = """ Account Name Balance Credit Limit Account Created 101 10000.00 1/17/1998 312 Gerard Butler 90.00 1000.00 761 Jada Pinkett-Smith 49654.87 100000.00 12/5/2006 317 Bill Murray 789.65 -'''.strip('\r\n') +""".strip('\r\n') colspecs = ((2, 10), (15, 33), (37, 45), (49, 61), (64, 79)) expected = read_fwf(StringIO(test), colspecs=colspecs) tm.assert_frame_equal(expected, read_fwf(StringIO(test))) def test_multiple_delimiters(self): - test = r''' + test = r""" col1~~~~~col2 col3++++++++++++++++++col4 ~~22.....11.0+++foo~~~~~~~~~~Keanu Reeves 33+++122.33\\\bar.........Gerard Butler ++44~~~~12.01 baz~~Jennifer Love Hewitt ~~55 11+++foo++++Jada Pinkett-Smith ..66++++++.03~~~bar Bill Murray -'''.strip('\r\n') +""".strip('\r\n') colspecs = ((0, 4), (7, 13), (15, 19), (21, 41)) expected = read_fwf(StringIO(test), colspecs=colspecs, delimiter=' +~.\\') @@ -333,15 +335,73 @@ def test_multiple_delimiters(self): def test_variable_width_unicode(self): if not compat.PY3: - raise nose.SkipTest( + pytest.skip( 'Bytes-related test - only needs to work on Python 3') - test = ''' + test = """ שלום שלום ום שלל של ום -'''.strip('\r\n') +""".strip('\r\n') expected = read_fwf(BytesIO(test.encode('utf8')), colspecs=[(0, 4), (5, 9)], header=None, encoding='utf8') tm.assert_frame_equal(expected, read_fwf( BytesIO(test.encode('utf8')), header=None, encoding='utf8')) + + def test_dtype(self): + data = """ a b c +1 2 3.2 +3 4 5.2 +""" + colspecs = [(0, 5), (5, 10), (10, None)] + result = pd.read_fwf(StringIO(data), colspecs=colspecs) + expected = pd.DataFrame({ + 'a': [1, 3], + 'b': [2, 4], + 'c': [3.2, 5.2]}, columns=['a', 'b', 'c']) + tm.assert_frame_equal(result, expected) + + expected['a'] = expected['a'].astype('float64') + expected['b'] = expected['b'].astype(str) + expected['c'] = expected['c'].astype('int32') + result = pd.read_fwf(StringIO(data), colspecs=colspecs, + dtype={'a': 'float64', 'b': str, 'c': 'int32'}) + tm.assert_frame_equal(result, expected) + + def test_skiprows_inference(self): + # GH11256 + test = """ +Text contained in the file header + +DataCol1 DataCol2 + 0.0 1.0 + 101.6 956.1 +""".strip() + expected = read_csv(StringIO(test), skiprows=2, + delim_whitespace=True) + tm.assert_frame_equal(expected, read_fwf( + StringIO(test), skiprows=2)) + + def test_skiprows_by_index_inference(self): + test = """ +To be skipped +Not To Be Skipped +Once more to be skipped +123 34 8 123 +456 78 9 456 +""".strip() + + expected = read_csv(StringIO(test), skiprows=[0, 2], + delim_whitespace=True) + tm.assert_frame_equal(expected, read_fwf( + StringIO(test), skiprows=[0, 2])) + + def test_skiprows_inference_empty(self): + test = """ +AA BBB C +12 345 6 +78 901 2 +""".strip() + + with pytest.raises(EmptyDataError): + read_fwf(StringIO(test), skiprows=3) diff --git a/pandas/io/tests/parser/test_textreader.py b/pandas/tests/io/parser/test_textreader.py similarity index 77% rename from pandas/io/tests/parser/test_textreader.py rename to pandas/tests/io/parser/test_textreader.py index 7dda9eb9d0af4..c9088d2ecc5e7 100644 --- a/pandas/io/tests/parser/test_textreader.py +++ b/pandas/tests/io/parser/test_textreader.py @@ -5,12 +5,13 @@ is integral to the C engine in parsers.py """ +import pytest + from pandas.compat import StringIO, BytesIO, map from pandas import compat import os import sys -import nose from numpy import nan import numpy as np @@ -21,13 +22,13 @@ import pandas.util.testing as tm -from pandas.parser import TextReader -import pandas.parser as parser +from pandas._libs.parsers import TextReader +import pandas._libs.parsers as parser -class TestTextReader(tm.TestCase): +class TestTextReader(object): - def setUp(self): + def setup_method(self, method): self.dirpath = tm.get_data_path() self.csv1 = os.path.join(self.dirpath, 'test1.csv') self.csv2 = os.path.join(self.dirpath, 'test2.csv') @@ -65,7 +66,7 @@ def test_string_factorize(self): data = 'a\nb\na\nb\na' reader = TextReader(StringIO(data), header=None) result = reader.read() - self.assertEqual(len(set(map(id, result[0]))), 2) + assert len(set(map(id, result[0]))) == 2 def test_skipinitialspace(self): data = ('a, b\n' @@ -77,12 +78,10 @@ def test_skipinitialspace(self): header=None) result = reader.read() - self.assert_numpy_array_equal(result[0], - np.array(['a', 'a', 'a', 'a'], - dtype=np.object_)) - self.assert_numpy_array_equal(result[1], - np.array(['b', 'b', 'b', 'b'], - dtype=np.object_)) + tm.assert_numpy_array_equal(result[0], np.array(['a', 'a', 'a', 'a'], + dtype=np.object_)) + tm.assert_numpy_array_equal(result[1], np.array(['b', 'b', 'b', 'b'], + dtype=np.object_)) def test_parse_booleans(self): data = 'True\nFalse\nTrue\nTrue' @@ -90,7 +89,7 @@ def test_parse_booleans(self): reader = TextReader(StringIO(data), header=None) result = reader.read() - self.assertEqual(result[0].dtype, np.bool_) + assert result[0].dtype == np.bool_ def test_delimit_whitespace(self): data = 'a b\na\t\t "b"\n"a"\t \t b' @@ -99,10 +98,10 @@ def test_delimit_whitespace(self): header=None) result = reader.read() - self.assert_numpy_array_equal(result[0], np.array(['a', 'a', 'a'], - dtype=np.object_)) - self.assert_numpy_array_equal(result[1], np.array(['b', 'b', 'b'], - dtype=np.object_)) + tm.assert_numpy_array_equal(result[0], np.array(['a', 'a', 'a'], + dtype=np.object_)) + tm.assert_numpy_array_equal(result[1], np.array(['b', 'b', 'b'], + dtype=np.object_)) def test_embedded_newline(self): data = 'a\n"hello\nthere"\nthis' @@ -111,7 +110,7 @@ def test_embedded_newline(self): result = reader.read() expected = np.array(['a', 'hello\nthere', 'this'], dtype=np.object_) - self.assert_numpy_array_equal(result[0], expected) + tm.assert_numpy_array_equal(result[0], expected) def test_euro_decimal(self): data = '12345,67\n345,678' @@ -143,6 +142,7 @@ def test_integer_thousands_alt(self): expected = DataFrame([123456, 12500]) tm.assert_frame_equal(result, expected) + @tm.capture_stderr def test_skip_bad_lines(self): # too many lines, see #2430 for why data = ('a:b:c\n' @@ -154,7 +154,7 @@ def test_skip_bad_lines(self): reader = TextReader(StringIO(data), delimiter=':', header=None) - self.assertRaises(parser.CParserError, reader.read) + pytest.raises(parser.ParserError, reader.read) reader = TextReader(StringIO(data), delimiter=':', header=None, @@ -166,19 +166,15 @@ def test_skip_bad_lines(self): 2: ['c', 'f', 'i', 'n']} assert_array_dicts_equal(result, expected) - stderr = sys.stderr - sys.stderr = StringIO() - try: - reader = TextReader(StringIO(data), delimiter=':', - header=None, - error_bad_lines=False, - warn_bad_lines=True) - reader.read() - val = sys.stderr.getvalue() - self.assertTrue('Skipping line 4' in val) - self.assertTrue('Skipping line 6' in val) - finally: - sys.stderr = stderr + reader = TextReader(StringIO(data), delimiter=':', + header=None, + error_bad_lines=False, + warn_bad_lines=True) + reader.read() + val = sys.stderr.getvalue() + + assert 'Skipping line 4' in val + assert 'Skipping line 6' in val def test_header_not_enough_lines(self): data = ('skip this\n' @@ -190,15 +186,15 @@ def test_header_not_enough_lines(self): reader = TextReader(StringIO(data), delimiter=',', header=2) header = reader.header expected = [['a', 'b', 'c']] - self.assertEqual(header, expected) + assert header == expected recs = reader.read() expected = {0: [1, 4], 1: [2, 5], 2: [3, 6]} assert_array_dicts_equal(expected, recs) # not enough rows - self.assertRaises(parser.CParserError, TextReader, StringIO(data), - delimiter=',', header=5, as_recarray=True) + pytest.raises(parser.ParserError, TextReader, StringIO(data), + delimiter=',', header=5, as_recarray=True) def test_header_not_enough_lines_as_recarray(self): data = ('skip this\n' @@ -211,15 +207,15 @@ def test_header_not_enough_lines_as_recarray(self): as_recarray=True) header = reader.header expected = [['a', 'b', 'c']] - self.assertEqual(header, expected) + assert header == expected recs = reader.read() expected = {'a': [1, 4], 'b': [2, 5], 'c': [3, 6]} assert_array_dicts_equal(expected, recs) # not enough rows - self.assertRaises(parser.CParserError, TextReader, StringIO(data), - delimiter=',', header=5, as_recarray=True) + pytest.raises(parser.ParserError, TextReader, StringIO(data), + delimiter=',', header=5, as_recarray=True) def test_escapechar(self): data = ('\\"hello world\"\n' @@ -254,18 +250,18 @@ def _make_reader(**kwds): reader = _make_reader(dtype='S5,i4') result = reader.read() - self.assertEqual(result[0].dtype, 'S5') + assert result[0].dtype == 'S5' ex_values = np.array(['a', 'aa', 'aaa', 'aaaa', 'aaaaa'], dtype='S5') - self.assertTrue((result[0] == ex_values).all()) - self.assertEqual(result[1].dtype, 'i4') + assert (result[0] == ex_values).all() + assert result[1].dtype == 'i4' reader = _make_reader(dtype='S4') result = reader.read() - self.assertEqual(result[0].dtype, 'S4') + assert result[0].dtype == 'S4' ex_values = np.array(['a', 'aa', 'aaa', 'aaaa', 'aaaa'], dtype='S4') - self.assertTrue((result[0] == ex_values).all()) - self.assertEqual(result[1].dtype, 'S4') + assert (result[0] == ex_values).all() + assert result[1].dtype == 'S4' def test_numpy_string_dtype_as_recarray(self): data = """\ @@ -281,10 +277,10 @@ def _make_reader(**kwds): reader = _make_reader(dtype='S4', as_recarray=True) result = reader.read() - self.assertEqual(result['0'].dtype, 'S4') + assert result['0'].dtype == 'S4' ex_values = np.array(['a', 'aa', 'aaa', 'aaaa', 'aaaa'], dtype='S4') - self.assertTrue((result['0'] == ex_values).all()) - self.assertEqual(result['1'].dtype, 'S4') + assert (result['0'] == ex_values).all() + assert result['1'].dtype == 'S4' def test_pass_dtype(self): data = """\ @@ -299,19 +295,19 @@ def _make_reader(**kwds): reader = _make_reader(dtype={'one': 'u1', 1: 'S1'}) result = reader.read() - self.assertEqual(result[0].dtype, 'u1') - self.assertEqual(result[1].dtype, 'S1') + assert result[0].dtype == 'u1' + assert result[1].dtype == 'S1' reader = _make_reader(dtype={'one': np.uint8, 1: object}) result = reader.read() - self.assertEqual(result[0].dtype, 'u1') - self.assertEqual(result[1].dtype, 'O') + assert result[0].dtype == 'u1' + assert result[1].dtype == 'O' reader = _make_reader(dtype={'one': np.dtype('u1'), 1: np.dtype('O')}) result = reader.read() - self.assertEqual(result[0].dtype, 'u1') - self.assertEqual(result[1].dtype, 'O') + assert result[0].dtype == 'u1' + assert result[1].dtype == 'O' def test_usecols(self): data = """\ @@ -328,9 +324,9 @@ def _make_reader(**kwds): result = reader.read() exp = _make_reader().read() - self.assertEqual(len(result), 2) - self.assertTrue((result[1] == exp[1]).all()) - self.assertTrue((result[2] == exp[2]).all()) + assert len(result) == 2 + assert (result[1] == exp[1]).all() + assert (result[2] == exp[2]).all() def test_cr_delimited(self): def _test(text, **kwargs): @@ -392,11 +388,13 @@ def test_empty_field_eof(self): names=list('abcd'), engine='c') assert_frame_equal(df, c) + def test_empty_csv_input(self): + # GH14867 + df = read_csv(StringIO(), chunksize=20, header=None, + names=['a', 'b', 'c']) + assert isinstance(df, TextFileReader) + def assert_array_dicts_equal(left, right): for k, v in compat.iteritems(left): assert(np.array_equal(v, right[k])) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/parser/test_unsupported.py b/pandas/tests/io/parser/test_unsupported.py similarity index 67% rename from pandas/io/tests/parser/test_unsupported.py rename to pandas/tests/io/parser/test_unsupported.py index 2fc238acd54e3..3f62ff44531fb 100644 --- a/pandas/io/tests/parser/test_unsupported.py +++ b/pandas/tests/io/parser/test_unsupported.py @@ -9,63 +9,47 @@ test suite as new feature support is added to the parsers. """ -import nose - import pandas.io.parsers as parsers import pandas.util.testing as tm from pandas.compat import StringIO -from pandas.io.common import CParserError +from pandas.errors import ParserError from pandas.io.parsers import read_csv, read_table -class TestUnsupportedFeatures(tm.TestCase): +class TestUnsupportedFeatures(object): + def test_mangle_dupe_cols_false(self): # see gh-12935 data = 'a b c\n1 2 3' msg = 'is not supported' for engine in ('c', 'python'): - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): read_csv(StringIO(data), engine=engine, mangle_dupe_cols=False) - def test_nrows_and_chunksize(self): - data = 'a b c' - msg = "cannot be used together yet" - - for engine in ('c', 'python'): - with tm.assertRaisesRegexp(NotImplementedError, msg): - read_csv(StringIO(data), engine=engine, - nrows=10, chunksize=5) - def test_c_engine(self): # see gh-6607 data = 'a b c\n1 2 3' msg = 'does not support' - # specify C-unsupported options with python-unsupported option - # (options will be ignored on fallback, raise) - with tm.assertRaisesRegexp(ValueError, msg): - read_table(StringIO(data), sep=None, - delim_whitespace=False, dtype={'a': float}) - with tm.assertRaisesRegexp(ValueError, msg): - read_table(StringIO(data), sep=r'\s', dtype={'a': float}) - with tm.assertRaisesRegexp(ValueError, msg): - read_table(StringIO(data), skipfooter=1, dtype={'a': float}) - # specify C engine with unsupported options (raise) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): read_table(StringIO(data), engine='c', sep=None, delim_whitespace=False) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): read_table(StringIO(data), engine='c', sep=r'\s') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): + read_table(StringIO(data), engine='c', quotechar=chr(128)) + with tm.assert_raises_regex(ValueError, msg): read_table(StringIO(data), engine='c', skipfooter=1) # specify C-unsupported options without python-unsupported options with tm.assert_produces_warning(parsers.ParserWarning): read_table(StringIO(data), sep=None, delim_whitespace=False) + with tm.assert_produces_warning(parsers.ParserWarning): + read_table(StringIO(data), quotechar=chr(128)) with tm.assert_produces_warning(parsers.ParserWarning): read_table(StringIO(data), sep=r'\s') with tm.assert_produces_warning(parsers.ParserWarning): @@ -78,24 +62,24 @@ def test_c_engine(self): x q 30 3 -0.6662 -0.5243 -0.3580 0.89145 2.5838""" msg = 'Error tokenizing data' - with tm.assertRaisesRegexp(CParserError, msg): - read_table(StringIO(text), sep=r'\s+') - with tm.assertRaisesRegexp(CParserError, msg): - read_table(StringIO(text), engine='c', sep=r'\s+') + with tm.assert_raises_regex(ParserError, msg): + read_table(StringIO(text), sep='\\s+') + with tm.assert_raises_regex(ParserError, msg): + read_table(StringIO(text), engine='c', sep='\\s+') msg = "Only length-1 thousands markers supported" data = """A|B|C 1|2,334|5 10|13|10. """ - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): read_csv(StringIO(data), thousands=',,') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): read_csv(StringIO(data), thousands='') msg = "Only length-1 line terminators supported" data = 'a,b,c~~1,2,3~~4,5,6' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): read_csv(StringIO(data), lineterminator='~~') def test_python_engine(self): @@ -114,11 +98,12 @@ def test_python_engine(self): 'with the %r engine' % (default, engine)) kwargs = {default: object()} - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): read_csv(StringIO(data), engine=engine, **kwargs) -class TestDeprecatedFeatures(tm.TestCase): +class TestDeprecatedFeatures(object): + def test_deprecated_args(self): data = '1,2,3' @@ -127,8 +112,8 @@ def test_deprecated_args(self): 'as_recarray': True, 'buffer_lines': True, 'compact_ints': True, - 'skip_footer': True, 'use_unsigned': True, + 'skip_footer': 1, } engines = 'c', 'python' @@ -148,7 +133,3 @@ def test_deprecated_args(self): kwargs = {arg: non_default_val} read_csv(StringIO(data), engine=engine, **kwargs) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/parser/usecols.py b/pandas/tests/io/parser/usecols.py similarity index 70% rename from pandas/io/tests/parser/usecols.py rename to pandas/tests/io/parser/usecols.py index 5051171ccb8f0..8761d1ccd3da4 100644 --- a/pandas/io/tests/parser/usecols.py +++ b/pandas/tests/io/parser/usecols.py @@ -5,13 +5,13 @@ for all of the parsers defined in parsers.py """ -import nose +import pytest import numpy as np import pandas.util.testing as tm from pandas import DataFrame, Index -from pandas.lib import Timestamp +from pandas._libs.lib import Timestamp from pandas.compat import StringIO @@ -23,11 +23,12 @@ def test_raise_on_mixed_dtype_usecols(self): 1000,2000,3000 4000,5000,6000 """ - msg = ("The elements of 'usecols' must " - "either be all strings, all unicode, or all integers") + + msg = ("'usecols' must either be all strings, all unicode, " + "all integers or a callable") usecols = [0, 'b', 2] - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.read_csv(StringIO(data), usecols=usecols) def test_usecols(self): @@ -42,9 +43,9 @@ def test_usecols(self): result2 = self.read_csv(StringIO(data), usecols=('b', 'c')) exp = self.read_csv(StringIO(data)) - self.assertEqual(len(result.columns), 2) - self.assertTrue((result['b'] == exp['b']).all()) - self.assertTrue((result['c'] == exp['c']).all()) + assert len(result.columns) == 2 + assert (result['b'] == exp['b']).all() + assert (result['c'] == exp['c']).all() tm.assert_frame_equal(result, result2) @@ -81,8 +82,8 @@ def test_usecols(self): tm.assert_frame_equal(result, expected) # length conflict, passed names and usecols disagree - self.assertRaises(ValueError, self.read_csv, StringIO(data), - names=['a', 'b'], usecols=[1], header=None) + pytest.raises(ValueError, self.read_csv, StringIO(data), + names=['a', 'b'], usecols=[1], header=None) def test_usecols_index_col_False(self): # see gh-9082 @@ -199,6 +200,51 @@ def test_usecols_with_parse_dates(self): parse_dates=parse_dates) tm.assert_frame_equal(df, expected) + # See gh-13604 + s = """2008-02-07 09:40,1032.43 + 2008-02-07 09:50,1042.54 + 2008-02-07 10:00,1051.65 + """ + parse_dates = [0] + names = ['date', 'values'] + usecols = names[:] + + index = Index([Timestamp('2008-02-07 09:40'), + Timestamp('2008-02-07 09:50'), + Timestamp('2008-02-07 10:00')], + name='date') + cols = {'values': [1032.43, 1042.54, 1051.65]} + expected = DataFrame(cols, index=index) + + df = self.read_csv(StringIO(s), parse_dates=parse_dates, index_col=0, + usecols=usecols, header=None, names=names) + tm.assert_frame_equal(df, expected) + + # See gh-14792 + s = """a,b,c,d,e,f,g,h,i,j + 2016/09/21,1,1,2,3,4,5,6,7,8""" + parse_dates = [0] + usecols = list('abcdefghij') + cols = {'a': Timestamp('2016-09-21'), + 'b': [1], 'c': [1], 'd': [2], + 'e': [3], 'f': [4], 'g': [5], + 'h': [6], 'i': [7], 'j': [8]} + expected = DataFrame(cols, columns=usecols) + df = self.read_csv(StringIO(s), usecols=usecols, + parse_dates=parse_dates) + tm.assert_frame_equal(df, expected) + + s = """a,b,c,d,e,f,g,h,i,j\n2016/09/21,1,1,2,3,4,5,6,7,8""" + parse_dates = [[0, 1]] + usecols = list('abcdefghij') + cols = {'a_b': '2016/09/21 1', + 'c': [1], 'd': [2], 'e': [3], 'f': [4], + 'g': [5], 'h': [6], 'i': [7], 'j': [8]} + expected = DataFrame(cols, columns=['a_b'] + list('cdefghij')) + df = self.read_csv(StringIO(s), usecols=usecols, + parse_dates=parse_dates) + tm.assert_frame_equal(df, expected) + def test_usecols_with_parse_dates_and_full_names(self): # See gh-9755 s = """0,1,20140101,0900,4 @@ -302,13 +348,13 @@ def test_usecols_with_mixed_encoding_strings(self): 3.568935038,7,False,a ''' - msg = ("The elements of 'usecols' must " - "either be all strings, all unicode, or all integers") + msg = ("'usecols' must either be all strings, all unicode, " + "all integers or a callable") - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.read_csv(StringIO(s), usecols=[u'AAA', b'BBB']) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.read_csv(StringIO(s), usecols=[b'AAA', u'BBB']) def test_usecols_with_multibyte_characters(self): @@ -331,7 +377,7 @@ def test_usecols_with_multibyte_characters(self): tm.assert_frame_equal(df, expected) def test_usecols_with_multibyte_unicode_characters(self): - raise nose.SkipTest('TODO: see gh-13253') + pytest.skip('TODO: see gh-13253') s = '''あああ,いい,ううう,ええええ 0.056674973,8,True,a @@ -366,3 +412,66 @@ def test_np_array_usecols(self): expected = DataFrame([[1, 2]], columns=usecols) result = self.read_csv(StringIO(data), usecols=usecols) tm.assert_frame_equal(result, expected) + + def test_callable_usecols(self): + # See gh-14154 + s = '''AaA,bBb,CCC,ddd + 0.056674973,8,True,a + 2.613230982,2,False,b + 3.568935038,7,False,a + ''' + + data = { + 'AaA': { + 0: 0.056674972999999997, + 1: 2.6132309819999997, + 2: 3.5689350380000002 + }, + 'bBb': {0: 8, 1: 2, 2: 7}, + 'ddd': {0: 'a', 1: 'b', 2: 'a'} + } + expected = DataFrame(data) + df = self.read_csv(StringIO(s), usecols=lambda x: + x.upper() in ['AAA', 'BBB', 'DDD']) + tm.assert_frame_equal(df, expected) + + # Check that a callable returning only False returns + # an empty DataFrame + expected = DataFrame() + df = self.read_csv(StringIO(s), usecols=lambda x: False) + tm.assert_frame_equal(df, expected) + + def test_incomplete_first_row(self): + # see gh-6710 + data = '1,2\n1,2,3' + names = ['a', 'b', 'c'] + expected = DataFrame({'a': [1, 1], + 'c': [np.nan, 3]}) + + usecols = ['a', 'c'] + df = self.read_csv(StringIO(data), names=names, usecols=usecols) + tm.assert_frame_equal(df, expected) + + usecols = lambda x: x in ['a', 'c'] + df = self.read_csv(StringIO(data), names=names, usecols=usecols) + tm.assert_frame_equal(df, expected) + + def test_uneven_length_cols(self): + # see gh-8985 + usecols = [0, 1, 2] + data = '19,29,39\n' * 2 + '10,20,30,40' + expected = DataFrame([[19, 29, 39], + [19, 29, 39], + [10, 20, 30]]) + df = self.read_csv(StringIO(data), header=None, usecols=usecols) + tm.assert_frame_equal(df, expected) + + # see gh-9549 + usecols = ['A', 'B', 'C'] + data = ('A,B,C\n1,2,3\n3,4,5\n1,2,4,5,1,6\n' + '1,2,3,,,1,\n1,2,3\n5,6,7') + expected = DataFrame({'A': [1, 3, 1, 1, 1, 5], + 'B': [2, 4, 2, 2, 2, 6], + 'C': [3, 5, 4, 3, 3, 7]}) + df = self.read_csv(StringIO(data), usecols=usecols) + tm.assert_frame_equal(df, expected) diff --git a/pandas/tests/io/sas/__init__.py b/pandas/tests/io/sas/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/io/tests/sas/data/DEMO_G.csv b/pandas/tests/io/sas/data/DEMO_G.csv similarity index 100% rename from pandas/io/tests/sas/data/DEMO_G.csv rename to pandas/tests/io/sas/data/DEMO_G.csv diff --git a/pandas/io/tests/sas/data/DEMO_G.xpt b/pandas/tests/io/sas/data/DEMO_G.xpt similarity index 100% rename from pandas/io/tests/sas/data/DEMO_G.xpt rename to pandas/tests/io/sas/data/DEMO_G.xpt diff --git a/pandas/io/tests/sas/data/DRXFCD_G.csv b/pandas/tests/io/sas/data/DRXFCD_G.csv similarity index 100% rename from pandas/io/tests/sas/data/DRXFCD_G.csv rename to pandas/tests/io/sas/data/DRXFCD_G.csv diff --git a/pandas/io/tests/sas/data/DRXFCD_G.xpt b/pandas/tests/io/sas/data/DRXFCD_G.xpt similarity index 100% rename from pandas/io/tests/sas/data/DRXFCD_G.xpt rename to pandas/tests/io/sas/data/DRXFCD_G.xpt diff --git a/pandas/io/tests/sas/data/SSHSV1_A.csv b/pandas/tests/io/sas/data/SSHSV1_A.csv similarity index 100% rename from pandas/io/tests/sas/data/SSHSV1_A.csv rename to pandas/tests/io/sas/data/SSHSV1_A.csv diff --git a/pandas/io/tests/sas/data/SSHSV1_A.xpt b/pandas/tests/io/sas/data/SSHSV1_A.xpt similarity index 100% rename from pandas/io/tests/sas/data/SSHSV1_A.xpt rename to pandas/tests/io/sas/data/SSHSV1_A.xpt diff --git a/pandas/io/tests/sas/data/airline.csv b/pandas/tests/io/sas/data/airline.csv similarity index 100% rename from pandas/io/tests/sas/data/airline.csv rename to pandas/tests/io/sas/data/airline.csv diff --git a/pandas/io/tests/sas/data/airline.sas7bdat b/pandas/tests/io/sas/data/airline.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/airline.sas7bdat rename to pandas/tests/io/sas/data/airline.sas7bdat diff --git a/pandas/io/tests/sas/data/paxraw_d_short.csv b/pandas/tests/io/sas/data/paxraw_d_short.csv similarity index 100% rename from pandas/io/tests/sas/data/paxraw_d_short.csv rename to pandas/tests/io/sas/data/paxraw_d_short.csv diff --git a/pandas/io/tests/sas/data/paxraw_d_short.xpt b/pandas/tests/io/sas/data/paxraw_d_short.xpt similarity index 100% rename from pandas/io/tests/sas/data/paxraw_d_short.xpt rename to pandas/tests/io/sas/data/paxraw_d_short.xpt diff --git a/pandas/io/tests/sas/data/productsales.csv b/pandas/tests/io/sas/data/productsales.csv similarity index 100% rename from pandas/io/tests/sas/data/productsales.csv rename to pandas/tests/io/sas/data/productsales.csv diff --git a/pandas/io/tests/sas/data/productsales.sas7bdat b/pandas/tests/io/sas/data/productsales.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/productsales.sas7bdat rename to pandas/tests/io/sas/data/productsales.sas7bdat diff --git a/pandas/io/tests/sas/data/test1.sas7bdat b/pandas/tests/io/sas/data/test1.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test1.sas7bdat rename to pandas/tests/io/sas/data/test1.sas7bdat diff --git a/pandas/io/tests/sas/data/test10.sas7bdat b/pandas/tests/io/sas/data/test10.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test10.sas7bdat rename to pandas/tests/io/sas/data/test10.sas7bdat diff --git a/pandas/io/tests/sas/data/test11.sas7bdat b/pandas/tests/io/sas/data/test11.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test11.sas7bdat rename to pandas/tests/io/sas/data/test11.sas7bdat diff --git a/pandas/io/tests/sas/data/test12.sas7bdat b/pandas/tests/io/sas/data/test12.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test12.sas7bdat rename to pandas/tests/io/sas/data/test12.sas7bdat diff --git a/pandas/io/tests/sas/data/test13.sas7bdat b/pandas/tests/io/sas/data/test13.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test13.sas7bdat rename to pandas/tests/io/sas/data/test13.sas7bdat diff --git a/pandas/io/tests/sas/data/test14.sas7bdat b/pandas/tests/io/sas/data/test14.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test14.sas7bdat rename to pandas/tests/io/sas/data/test14.sas7bdat diff --git a/pandas/io/tests/sas/data/test15.sas7bdat b/pandas/tests/io/sas/data/test15.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test15.sas7bdat rename to pandas/tests/io/sas/data/test15.sas7bdat diff --git a/pandas/io/tests/sas/data/test16.sas7bdat b/pandas/tests/io/sas/data/test16.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test16.sas7bdat rename to pandas/tests/io/sas/data/test16.sas7bdat diff --git a/pandas/io/tests/sas/data/test2.sas7bdat b/pandas/tests/io/sas/data/test2.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test2.sas7bdat rename to pandas/tests/io/sas/data/test2.sas7bdat diff --git a/pandas/io/tests/sas/data/test3.sas7bdat b/pandas/tests/io/sas/data/test3.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test3.sas7bdat rename to pandas/tests/io/sas/data/test3.sas7bdat diff --git a/pandas/io/tests/sas/data/test4.sas7bdat b/pandas/tests/io/sas/data/test4.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test4.sas7bdat rename to pandas/tests/io/sas/data/test4.sas7bdat diff --git a/pandas/io/tests/sas/data/test5.sas7bdat b/pandas/tests/io/sas/data/test5.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test5.sas7bdat rename to pandas/tests/io/sas/data/test5.sas7bdat diff --git a/pandas/io/tests/sas/data/test6.sas7bdat b/pandas/tests/io/sas/data/test6.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test6.sas7bdat rename to pandas/tests/io/sas/data/test6.sas7bdat diff --git a/pandas/io/tests/sas/data/test7.sas7bdat b/pandas/tests/io/sas/data/test7.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test7.sas7bdat rename to pandas/tests/io/sas/data/test7.sas7bdat diff --git a/pandas/io/tests/sas/data/test8.sas7bdat b/pandas/tests/io/sas/data/test8.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test8.sas7bdat rename to pandas/tests/io/sas/data/test8.sas7bdat diff --git a/pandas/io/tests/sas/data/test9.sas7bdat b/pandas/tests/io/sas/data/test9.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test9.sas7bdat rename to pandas/tests/io/sas/data/test9.sas7bdat diff --git a/pandas/io/tests/sas/data/test_12659.csv b/pandas/tests/io/sas/data/test_12659.csv similarity index 100% rename from pandas/io/tests/sas/data/test_12659.csv rename to pandas/tests/io/sas/data/test_12659.csv diff --git a/pandas/io/tests/sas/data/test_12659.sas7bdat b/pandas/tests/io/sas/data/test_12659.sas7bdat similarity index 100% rename from pandas/io/tests/sas/data/test_12659.sas7bdat rename to pandas/tests/io/sas/data/test_12659.sas7bdat diff --git a/pandas/io/tests/sas/data/test_sas7bdat_1.csv b/pandas/tests/io/sas/data/test_sas7bdat_1.csv similarity index 100% rename from pandas/io/tests/sas/data/test_sas7bdat_1.csv rename to pandas/tests/io/sas/data/test_sas7bdat_1.csv diff --git a/pandas/io/tests/sas/data/test_sas7bdat_2.csv b/pandas/tests/io/sas/data/test_sas7bdat_2.csv similarity index 100% rename from pandas/io/tests/sas/data/test_sas7bdat_2.csv rename to pandas/tests/io/sas/data/test_sas7bdat_2.csv diff --git a/pandas/tests/io/sas/test_sas.py b/pandas/tests/io/sas/test_sas.py new file mode 100644 index 0000000000000..617df99b99f0b --- /dev/null +++ b/pandas/tests/io/sas/test_sas.py @@ -0,0 +1,14 @@ +import pytest + +from pandas.compat import StringIO +from pandas import read_sas + + +class TestSas(object): + + def test_sas_buffer_format(self): + + # GH14947 + b = StringIO("") + with pytest.raises(ValueError): + read_sas(b) diff --git a/pandas/io/tests/sas/test_sas7bdat.py b/pandas/tests/io/sas/test_sas7bdat.py similarity index 95% rename from pandas/io/tests/sas/test_sas7bdat.py rename to pandas/tests/io/sas/test_sas7bdat.py index e20ea48247119..a5157744038f4 100644 --- a/pandas/io/tests/sas/test_sas7bdat.py +++ b/pandas/tests/io/sas/test_sas7bdat.py @@ -6,9 +6,9 @@ import numpy as np -class TestSAS7BDAT(tm.TestCase): +class TestSAS7BDAT(object): - def setUp(self): + def setup_method(self, method): self.dirpath = tm.get_data_path() self.data = [] self.test_ix = [list(range(1, 16)), [16]] @@ -51,6 +51,7 @@ def test_from_buffer(self): iterator=True, encoding='utf-8') df = rdr.read() tm.assert_frame_equal(df, df0, check_exact=False) + rdr.close() def test_from_iterator(self): for j in 0, 1: @@ -62,6 +63,7 @@ def test_from_iterator(self): tm.assert_frame_equal(df, df0.iloc[0:2, :]) df = rdr.read(3) tm.assert_frame_equal(df, df0.iloc[2:5, :]) + rdr.close() def test_iterator_loop(self): # github #13654 @@ -73,7 +75,8 @@ def test_iterator_loop(self): y = 0 for x in rdr: y += x.shape[0] - self.assertTrue(y == rdr.row_count) + assert y == rdr.row_count + rdr.close() def test_iterator_read_too_much(self): # github #14734 @@ -82,9 +85,12 @@ def test_iterator_read_too_much(self): rdr = pd.read_sas(fname, format="sas7bdat", iterator=True, encoding='utf-8') d1 = rdr.read(rdr.row_count + 20) + rdr.close() + rdr = pd.read_sas(fname, iterator=True, encoding="utf-8") d2 = rdr.read(rdr.row_count + 20) tm.assert_frame_equal(d1, d2) + rdr.close() def test_encoding_options(): diff --git a/pandas/io/tests/sas/test_xport.py b/pandas/tests/io/sas/test_xport.py similarity index 97% rename from pandas/io/tests/sas/test_xport.py rename to pandas/tests/io/sas/test_xport.py index fe2f7cb4bf4be..de31c3e36a8d5 100644 --- a/pandas/io/tests/sas/test_xport.py +++ b/pandas/tests/io/sas/test_xport.py @@ -16,9 +16,9 @@ def numeric_as_float(data): data[v] = data[v].astype(np.float64) -class TestXport(tm.TestCase): +class TestXport(object): - def setUp(self): + def setup_method(self, method): self.dirpath = tm.get_data_path() self.file01 = os.path.join(self.dirpath, "DEMO_G.xpt") self.file02 = os.path.join(self.dirpath, "SSHSV1_A.xpt") @@ -40,7 +40,7 @@ def test1_basic(self): # Test reading beyond end of file reader = read_sas(self.file01, format="xport", iterator=True) data = reader.read(num_rows + 100) - self.assertTrue(data.shape[0] == num_rows) + assert data.shape[0] == num_rows reader.close() # Test incremental read with `read` method. @@ -61,7 +61,7 @@ def test1_basic(self): for x in reader: m += x.shape[0] reader.close() - self.assertTrue(m == num_rows) + assert m == num_rows # Read full file with `read_sas` method data = read_sas(self.file01) diff --git a/pandas/io/tests/test_clipboard.py b/pandas/tests/io/test_clipboard.py similarity index 85% rename from pandas/io/tests/test_clipboard.py rename to pandas/tests/io/test_clipboard.py index e1f1e5340251e..940a331a9de84 100644 --- a/pandas/io/tests/test_clipboard.py +++ b/pandas/tests/io/test_clipboard.py @@ -1,8 +1,9 @@ # -*- coding: utf-8 -*- import numpy as np from numpy.random import randint +from textwrap import dedent -import nose +import pytest import pandas as pd from pandas import DataFrame @@ -10,20 +11,24 @@ from pandas import get_option from pandas.util import testing as tm from pandas.util.testing import makeCustomDataframe as mkdf -from pandas.util.clipboard.exceptions import PyperclipException +from pandas.io.clipboard.exceptions import PyperclipException +from pandas.io.clipboard import clipboard_set try: DataFrame({'A': [1, 2]}).to_clipboard() + _DEPS_INSTALLED = 1 except PyperclipException: - raise nose.SkipTest("clipboard primitives not installed") + _DEPS_INSTALLED = 0 -class TestClipboard(tm.TestCase): +@pytest.mark.single +@pytest.mark.skipif(not _DEPS_INSTALLED, + reason="clipboard primitives not installed") +class TestClipboard(object): @classmethod - def setUpClass(cls): - super(TestClipboard, cls).setUpClass() + def setup_class(cls): cls.data = {} cls.data['string'] = mkdf(5, 3, c_idx_type='s', r_idx_type='i', c_idx_names=[None], r_idx_names=[None]) @@ -54,12 +59,11 @@ def setUpClass(cls): 'es': 'en español'.split()}) # unicode round trip test for GH 13747, GH 12529 cls.data['utf8'] = pd.DataFrame({'a': ['µasd', 'Ωœ∑´'], - 'b': ['øπ∆˚¬', 'œ∑´®']}) + 'b': ['øπ∆˚¬', 'œ∑´®']}) cls.data_types = list(cls.data.keys()) @classmethod - def tearDownClass(cls): - super(TestClipboard, cls).tearDownClass() + def teardown_class(cls): del cls.data_types, cls.data def check_round_trip_frame(self, data_type, excel=None, sep=None, @@ -75,6 +79,8 @@ def check_round_trip_frame(self, data_type, excel=None, sep=None, def test_round_trip_frame_sep(self): for dt in self.data_types: self.check_round_trip_frame(dt, sep=',') + self.check_round_trip_frame(dt, sep=r'\s+') + self.check_round_trip_frame(dt, sep='|') def test_round_trip_frame_string(self): for dt in self.data_types: @@ -85,8 +91,6 @@ def test_round_trip_frame(self): self.check_round_trip_frame(dt) def test_read_clipboard_infer_excel(self): - from textwrap import dedent - from pandas.util.clipboard import clipboard_set text = dedent(""" John James Charlie Mingus @@ -97,7 +101,7 @@ def test_read_clipboard_infer_excel(self): df = pd.read_clipboard() # excel data is parsed correctly - self.assertEqual(df.iloc[1][1], 'Harry Carney') + assert df.iloc[1][1] == 'Harry Carney' # having diff tab counts doesn't trigger it text = dedent(""" @@ -121,9 +125,9 @@ def test_read_clipboard_infer_excel(self): def test_invalid_encoding(self): # test case for testing invalid encoding data = self.data['string'] - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): data.to_clipboard(encoding='ascii') - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.read_clipboard(encoding='ascii') def test_round_trip_valid_encodings(self): diff --git a/pandas/io/tests/test_common.py b/pandas/tests/io/test_common.py similarity index 72% rename from pandas/io/tests/test_common.py rename to pandas/tests/io/test_common.py index c08d235b07c9e..a1a95e09915f1 100644 --- a/pandas/io/tests/test_common.py +++ b/pandas/tests/io/test_common.py @@ -2,6 +2,7 @@ Tests for the pandas.io.common functionalities """ import mmap +import pytest import os from os.path import isabs @@ -23,7 +24,7 @@ pass -class TestCommonIOCapabilities(tm.TestCase): +class TestCommonIOCapabilities(object): data1 = """index,A,B,C,D foo,2,3,4,5 bar,7,8,9,10 @@ -37,24 +38,24 @@ def test_expand_user(self): filename = '~/sometest' expanded_name = common._expand_user(filename) - self.assertNotEqual(expanded_name, filename) - self.assertTrue(isabs(expanded_name)) - self.assertEqual(os.path.expanduser(filename), expanded_name) + assert expanded_name != filename + assert isabs(expanded_name) + assert os.path.expanduser(filename) == expanded_name def test_expand_user_normal_path(self): filename = '/somefolder/sometest' expanded_name = common._expand_user(filename) - self.assertEqual(expanded_name, filename) - self.assertEqual(os.path.expanduser(filename), expanded_name) + assert expanded_name == filename + assert os.path.expanduser(filename) == expanded_name def test_stringify_path_pathlib(self): tm._skip_if_no_pathlib() rel_path = common._stringify_path(Path('.')) - self.assertEqual(rel_path, '.') + assert rel_path == '.' redundant_path = common._stringify_path(Path('foo//bar')) - self.assertEqual(redundant_path, os.path.join('foo', 'bar')) + assert redundant_path == os.path.join('foo', 'bar') def test_stringify_path_localpath(self): tm._skip_if_no_localpath() @@ -62,19 +63,19 @@ def test_stringify_path_localpath(self): path = os.path.join('foo', 'bar') abs_path = os.path.abspath(path) lpath = LocalPath(path) - self.assertEqual(common._stringify_path(lpath), abs_path) + assert common._stringify_path(lpath) == abs_path def test_get_filepath_or_buffer_with_path(self): filename = '~/sometest' filepath_or_buffer, _, _ = common.get_filepath_or_buffer(filename) - self.assertNotEqual(filepath_or_buffer, filename) - self.assertTrue(isabs(filepath_or_buffer)) - self.assertEqual(os.path.expanduser(filename), filepath_or_buffer) + assert filepath_or_buffer != filename + assert isabs(filepath_or_buffer) + assert os.path.expanduser(filename) == filepath_or_buffer def test_get_filepath_or_buffer_with_buffer(self): input_buffer = StringIO() filepath_or_buffer, _, _ = common.get_filepath_or_buffer(input_buffer) - self.assertEqual(filepath_or_buffer, input_buffer) + assert filepath_or_buffer == input_buffer def test_iterator(self): reader = read_csv(StringIO(self.data1), chunksize=1) @@ -89,9 +90,9 @@ def test_iterator(self): tm.assert_frame_equal(concat(it), expected.iloc[1:]) -class TestMMapWrapper(tm.TestCase): +class TestMMapWrapper(object): - def setUp(self): + def setup_method(self, method): self.mmap_file = os.path.join(tm.get_data_path(), 'test_mmap.csv') @@ -107,13 +108,14 @@ def test_constructor_bad_file(self): msg = "[Errno 22]" err = mmap.error - tm.assertRaisesRegexp(err, msg, common.MMapWrapper, non_file) + tm.assert_raises_regex(err, msg, common.MMapWrapper, non_file) target = open(self.mmap_file, 'r') target.close() msg = "I/O operation on closed file" - tm.assertRaisesRegexp(ValueError, msg, common.MMapWrapper, target) + tm.assert_raises_regex( + ValueError, msg, common.MMapWrapper, target) def test_get_attr(self): with open(self.mmap_file, 'r') as target: @@ -125,9 +127,9 @@ def test_get_attr(self): attrs.append('__next__') for attr in attrs: - self.assertTrue(hasattr(wrapper, attr)) + assert hasattr(wrapper, attr) - self.assertFalse(hasattr(wrapper, 'foo')) + assert not hasattr(wrapper, 'foo') def test_next(self): with open(self.mmap_file, 'r') as target: @@ -136,6 +138,6 @@ def test_next(self): for line in lines: next_line = next(wrapper) - self.assertEqual(next_line.strip(), line.strip()) + assert next_line.strip() == line.strip() - self.assertRaises(StopIteration, next, wrapper) + pytest.raises(StopIteration, next, wrapper) diff --git a/pandas/io/tests/test_excel.py b/pandas/tests/io/test_excel.py similarity index 82% rename from pandas/io/tests/test_excel.py rename to pandas/tests/io/test_excel.py index 056b7a5322e26..c70b5937fea3f 100644 --- a/pandas/io/tests/test_excel.py +++ b/pandas/tests/io/test_excel.py @@ -7,15 +7,17 @@ from distutils.version import LooseVersion import warnings +from warnings import catch_warnings import operator import functools -import nose +import pytest from numpy import nan import numpy as np import pandas as pd from pandas import DataFrame, Index, MultiIndex +from pandas.io.formats.excel import ExcelFormatter from pandas.io.parsers import read_csv from pandas.io.excel import ( ExcelFile, ExcelWriter, read_excel, _XlwtWriter, _Openpyxl1Writer, @@ -32,30 +34,30 @@ def _skip_if_no_xlrd(): import xlrd ver = tuple(map(int, xlrd.__VERSION__.split(".")[:2])) if ver < (0, 9): - raise nose.SkipTest('xlrd < 0.9, skipping') + pytest.skip('xlrd < 0.9, skipping') except ImportError: - raise nose.SkipTest('xlrd not installed, skipping') + pytest.skip('xlrd not installed, skipping') def _skip_if_no_xlwt(): try: import xlwt # NOQA except ImportError: - raise nose.SkipTest('xlwt not installed, skipping') + pytest.skip('xlwt not installed, skipping') def _skip_if_no_openpyxl(): try: import openpyxl # NOQA except ImportError: - raise nose.SkipTest('openpyxl not installed, skipping') + pytest.skip('openpyxl not installed, skipping') def _skip_if_no_xlsxwriter(): try: import xlsxwriter # NOQA except ImportError: - raise nose.SkipTest('xlsxwriter not installed, skipping') + pytest.skip('xlsxwriter not installed, skipping') def _skip_if_no_excelsuite(): @@ -64,11 +66,11 @@ def _skip_if_no_excelsuite(): _skip_if_no_openpyxl() -def _skip_if_no_boto(): +def _skip_if_no_s3fs(): try: - import boto # NOQA + import s3fs # noqa except ImportError: - raise nose.SkipTest('boto not installed, skipping') + pytest.skip('s3fs not installed, skipping') _seriesd = tm.getSeriesData() @@ -82,7 +84,7 @@ def _skip_if_no_boto(): class SharedItems(object): - def setUp(self): + def setup_method(self, method): self.dirpath = tm.get_data_path() self.frame = _frame.copy() self.frame2 = _frame2.copy() @@ -159,9 +161,9 @@ class ReadingTestsBase(SharedItems): # 3. Add a property engine_name, which is the name of the reader class. # For the reader this is not used for anything at the moment. - def setUp(self): + def setup_method(self, method): self.check_skip() - super(ReadingTestsBase, self).setUp() + super(ReadingTestsBase, self).setup_method(method) def test_parse_cols_int(self): @@ -276,16 +278,16 @@ def test_excel_table_sheet_by_index(self): df3 = read_excel(excel, 0, index_col=0, skipfooter=1) df4 = read_excel(excel, 0, index_col=0, skip_footer=1) - tm.assert_frame_equal(df3, df1.ix[:-1]) + tm.assert_frame_equal(df3, df1.iloc[:-1]) tm.assert_frame_equal(df3, df4) df3 = excel.parse(0, index_col=0, skipfooter=1) df4 = excel.parse(0, index_col=0, skip_footer=1) - tm.assert_frame_equal(df3, df1.ix[:-1]) + tm.assert_frame_equal(df3, df1.iloc[:-1]) tm.assert_frame_equal(df3, df4) import xlrd - with tm.assertRaises(xlrd.XLRDError): + with pytest.raises(xlrd.XLRDError): read_excel(excel, 'asdf') def test_excel_table(self): @@ -302,7 +304,7 @@ def test_excel_table(self): skipfooter=1) df4 = self.get_exceldf('test1', 'Sheet1', index_col=0, skip_footer=1) - tm.assert_frame_equal(df3, df1.ix[:-1]) + tm.assert_frame_equal(df3, df1.iloc[:-1]) tm.assert_frame_equal(df3, df4) def test_reader_special_dtypes(self): @@ -328,7 +330,7 @@ def test_reader_special_dtypes(self): # if not coercing number, then int comes in as float float_expected = expected.copy() float_expected["IntCol"] = float_expected["IntCol"].astype(float) - float_expected.loc[1, "Str2Col"] = 3.0 + float_expected.loc[float_expected.index[1], "Str2Col"] = 3.0 actual = self.get_exceldf(basename, 'Sheet1', convert_float=False) tm.assert_frame_equal(actual, float_expected) @@ -373,14 +375,45 @@ def test_reader_converters(self): actual = self.get_exceldf(basename, 'Sheet1', converters=converters) tm.assert_frame_equal(actual, expected) + def test_reader_dtype(self): + # GH 8212 + basename = 'testdtype' + actual = self.get_exceldf(basename) + + expected = DataFrame({ + 'a': [1, 2, 3, 4], + 'b': [2.5, 3.5, 4.5, 5.5], + 'c': [1, 2, 3, 4], + 'd': [1.0, 2.0, np.nan, 4.0]}).reindex( + columns=['a', 'b', 'c', 'd']) + + tm.assert_frame_equal(actual, expected) + + actual = self.get_exceldf(basename, + dtype={'a': 'float64', + 'b': 'float32', + 'c': str}) + + expected['a'] = expected['a'].astype('float64') + expected['b'] = expected['b'].astype('float32') + expected['c'] = ['001', '002', '003', '004'] + tm.assert_frame_equal(actual, expected) + + with pytest.raises(ValueError): + actual = self.get_exceldf(basename, dtype={'d': 'int64'}) + def test_reading_all_sheets(self): # Test reading all sheetnames by setting sheetname to None, # Ensure a dict is returned. # See PR #9450 basename = 'test_multisheet' dfs = self.get_exceldf(basename, sheetname=None) - expected_keys = ['Alpha', 'Beta', 'Charlie'] + # ensure this is not alphabetical to test order preservation + expected_keys = ['Charlie', 'Alpha', 'Beta'] tm.assert_contains_all(expected_keys, dfs.keys()) + # Issue 9930 + # Ensure sheet order is preserved + assert expected_keys == list(dfs.keys()) def test_reading_multiple_specific_sheets(self): # Test reading specific sheetnames by specifying a mixed list @@ -417,6 +450,9 @@ def test_read_excel_blank_with_header(self): # GH 12292 : error when read one empty column from excel file def test_read_one_empty_col_no_header(self): + _skip_if_no_xlwt() + _skip_if_no_openpyxl() + df = pd.DataFrame( [["", 1, 100], ["", 2, 200], @@ -473,6 +509,9 @@ def test_read_one_empty_col_with_header(self): tm.assert_frame_equal(actual_header_zero, expected_header_zero) def test_set_column_names_in_parameter(self): + _skip_if_no_xlwt() + _skip_if_no_openpyxl() + # GH 12870 : pass down column names associated with # keyword argument names refdf = pd.DataFrame([[1, 'foo'], [2, 'bar'], @@ -544,14 +583,14 @@ def test_read_xlrd_Book(self): @tm.network def test_read_from_http_url(self): url = ('https://raw.github.com/pandas-dev/pandas/master/' - 'pandas/io/tests/data/test1' + self.ext) + 'pandas/tests/io/data/test1' + self.ext) url_table = read_excel(url) local_table = self.get_exceldf('test1') tm.assert_frame_equal(url_table, local_table) @tm.network(check_before_test=True) def test_read_from_s3_url(self): - _skip_if_no_boto() + _skip_if_no_s3fs() url = ('s3://pandas-test/test1' + self.ext) url_table = read_excel(url) @@ -563,7 +602,7 @@ def test_read_from_file_url(self): # FILE if sys.version_info[:2] < (2, 6): - raise nose.SkipTest("file:// not supported with Python < 2.6") + pytest.skip("file:// not supported with Python < 2.6") localtable = os.path.join(self.dirpath, 'test1' + self.ext) local_table = read_excel(localtable) @@ -573,8 +612,8 @@ def test_read_from_file_url(self): except URLError: # fails on some systems import platform - raise nose.SkipTest("failing on %s" % - ' '.join(platform.uname()).strip()) + pytest.skip("failing on %s" % + ' '.join(platform.uname()).strip()) tm.assert_frame_equal(url_table, local_table) @@ -617,7 +656,7 @@ def test_reader_closes_file(self): # parses okay read_excel(xlsx, 'Sheet1', index_col=0) - self.assertTrue(f.closed) + assert f.closed def test_creating_and_reading_multiple_sheets(self): # Test reading multiple sheets, from a runtime created excel file @@ -876,28 +915,43 @@ def test_excel_oldindex_format(self): def test_read_excel_bool_header_arg(self): # GH 6114 for arg in [True, False]: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): pd.read_excel(os.path.join(self.dirpath, 'test1' + self.ext), header=arg) def test_read_excel_chunksize(self): # GH 8011 - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): pd.read_excel(os.path.join(self.dirpath, 'test1' + self.ext), chunksize=100) def test_read_excel_parse_dates(self): - # GH 11544 - with tm.assertRaises(NotImplementedError): - pd.read_excel(os.path.join(self.dirpath, 'test1' + self.ext), - parse_dates=True) + # GH 11544, 12051 - def test_read_excel_date_parser(self): - # GH 11544 - with tm.assertRaises(NotImplementedError): - dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S') - pd.read_excel(os.path.join(self.dirpath, 'test1' + self.ext), - date_parser=dateparse) + df = DataFrame( + {'col': [1, 2, 3], + 'date_strings': pd.date_range('2012-01-01', periods=3)}) + df2 = df.copy() + df2['date_strings'] = df2['date_strings'].dt.strftime('%m/%d/%Y') + + with ensure_clean(self.ext) as pth: + df2.to_excel(pth) + + res = read_excel(pth) + tm.assert_frame_equal(df2, res) + + # no index_col specified when parse_dates is True + with tm.assert_produces_warning(): + res = read_excel(pth, parse_dates=True) + tm.assert_frame_equal(df2, res) + + res = read_excel(pth, parse_dates=['date_strings'], index_col=0) + tm.assert_frame_equal(df, res) + + dateparser = lambda x: pd.datetime.strptime(x, '%m/%d/%Y') + res = read_excel(pth, parse_dates=['date_strings'], + date_parser=dateparser, index_col=0) + tm.assert_frame_equal(df, res) def test_read_excel_skiprows_list(self): # GH 4903 @@ -935,19 +989,19 @@ def test_read_excel_squeeze(self): tm.assert_series_equal(actual, expected) -class XlsReaderTests(XlrdTests, tm.TestCase): +class XlsReaderTests(XlrdTests): ext = '.xls' engine_name = 'xlrd' check_skip = staticmethod(_skip_if_no_xlrd) -class XlsxReaderTests(XlrdTests, tm.TestCase): +class XlsxReaderTests(XlrdTests): ext = '.xlsx' engine_name = 'xlrd' check_skip = staticmethod(_skip_if_no_xlrd) -class XlsmReaderTests(XlrdTests, tm.TestCase): +class XlsmReaderTests(XlrdTests): ext = '.xlsm' engine_name = 'xlrd' check_skip = staticmethod(_skip_if_no_xlrd) @@ -965,14 +1019,14 @@ class ExcelWriterBase(SharedItems): # Test with MultiIndex and Hierarchical Rows as merged cells. merge_cells = True - def setUp(self): + def setup_method(self, method): self.check_skip() - super(ExcelWriterBase, self).setUp() + super(ExcelWriterBase, self).setup_method(method) self.option_name = 'io.excel.%s.writer' % self.ext.strip('.') self.prev_engine = get_option(self.option_name) set_option(self.option_name, self.engine_name) - def tearDown(self): + def teardown_method(self, method): set_option(self.option_name, self.prev_engine) def test_excel_sheet_by_name_raise(self): @@ -986,7 +1040,7 @@ def test_excel_sheet_by_name_raise(self): df = read_excel(xl, 0) tm.assert_frame_equal(gt, df) - with tm.assertRaises(xlrd.XLRDError): + with pytest.raises(xlrd.XLRDError): read_excel(xl, '0') def test_excelwriter_contextmanager(self): @@ -1047,6 +1101,12 @@ def test_roundtrip(self): recons = read_excel(path, index_col=0) tm.assert_frame_equal(self.frame, recons) + # GH 8825 Pandas Series should provide to_excel method + s = self.frame["A"] + s.to_excel(path) + recons = read_excel(path, index_col=0) + tm.assert_frame_equal(s.to_frame(), recons) + def test_mixed(self): _skip_if_no_xlrd() @@ -1156,9 +1216,9 @@ def test_sheets(self): tm.assert_frame_equal(self.frame, recons) recons = read_excel(reader, 'test2', index_col=0) tm.assert_frame_equal(self.tsframe, recons) - self.assertEqual(2, len(reader.sheet_names)) - self.assertEqual('test1', reader.sheet_names[0]) - self.assertEqual('test2', reader.sheet_names[1]) + assert 2 == len(reader.sheet_names) + assert 'test1' == reader.sheet_names[0] + assert 'test2' == reader.sheet_names[1] def test_colaliases(self): _skip_if_no_xlrd() @@ -1202,7 +1262,7 @@ def test_roundtrip_indexlabels(self): index_col=0, ).astype(np.int64) frame.index.names = ['test'] - self.assertEqual(frame.index.names, recons.index.names) + assert frame.index.names == recons.index.names frame = (DataFrame(np.random.randn(10, 2)) >= 0) frame.to_excel(path, @@ -1214,7 +1274,7 @@ def test_roundtrip_indexlabels(self): index_col=0, ).astype(np.int64) frame.index.names = ['test'] - self.assertEqual(frame.index.names, recons.index.names) + assert frame.index.names == recons.index.names frame = (DataFrame(np.random.randn(10, 2)) >= 0) frame.to_excel(path, @@ -1256,7 +1316,7 @@ def test_excel_roundtrip_indexname(self): index_col=0) tm.assert_frame_equal(result, df) - self.assertEqual(result.index.name, 'foo') + assert result.index.name == 'foo' def test_excel_roundtrip_datetime(self): _skip_if_no_xlrd() @@ -1339,8 +1399,7 @@ def test_to_excel_multiindex(self): # round trip frame.to_excel(path, 'test1', merge_cells=self.merge_cells) reader = ExcelFile(path) - df = read_excel(reader, 'test1', index_col=[0, 1], - parse_dates=False) + df = read_excel(reader, 'test1', index_col=[0, 1]) tm.assert_frame_equal(frame, df) # GH13511 @@ -1381,8 +1440,7 @@ def test_to_excel_multiindex_cols(self): frame.to_excel(path, 'test1', merge_cells=self.merge_cells) reader = ExcelFile(path) df = read_excel(reader, 'test1', header=header, - index_col=[0, 1], - parse_dates=False) + index_col=[0, 1]) if not self.merge_cells: fm = frame.columns.format(sparsify=False, adjoin=False, names=False) @@ -1405,7 +1463,7 @@ def test_to_excel_multiindex_dates(self): index_col=[0, 1]) tm.assert_frame_equal(tsframe, recons) - self.assertEqual(recons.index.names, ('time', 'foo')) + assert recons.index.names == ('time', 'foo') def test_to_excel_multiindex_no_write_index(self): _skip_if_no_xlrd() @@ -1470,7 +1528,7 @@ def test_to_excel_unicode_filename(self): try: f = open(filename, 'wb') except UnicodeEncodeError: - raise nose.SkipTest('no unicode file names on this system') + pytest.skip('no unicode file names on this system') else: f.close() @@ -1512,28 +1570,27 @@ def test_to_excel_unicode_filename(self): # import xlwt # import xlrd # except ImportError: - # raise nose.SkipTest + # pytest.skip # filename = '__tmp_to_excel_header_styling_xls__.xls' # pdf.to_excel(filename, 'test1') # wbk = xlrd.open_workbook(filename, # formatting_info=True) - # self.assertEqual(["test1"], wbk.sheet_names()) + # assert ["test1"] == wbk.sheet_names() # ws = wbk.sheet_by_name('test1') - # self.assertEqual([(0, 1, 5, 7), (0, 1, 3, 5), (0, 1, 1, 3)], - # ws.merged_cells) + # assert [(0, 1, 5, 7), (0, 1, 3, 5), (0, 1, 1, 3)] == ws.merged_cells # for i in range(0, 2): # for j in range(0, 7): # xfx = ws.cell_xf_index(0, 0) # cell_xf = wbk.xf_list[xfx] # font = wbk.font_list - # self.assertEqual(1, font[cell_xf.font_index].bold) - # self.assertEqual(1, cell_xf.border.top_line_style) - # self.assertEqual(1, cell_xf.border.right_line_style) - # self.assertEqual(1, cell_xf.border.bottom_line_style) - # self.assertEqual(1, cell_xf.border.left_line_style) - # self.assertEqual(2, cell_xf.alignment.hor_align) + # assert 1 == font[cell_xf.font_index].bold + # assert 1 == cell_xf.border.top_line_style + # assert 1 == cell_xf.border.right_line_style + # assert 1 == cell_xf.border.bottom_line_style + # assert 1 == cell_xf.border.left_line_style + # assert 2 == cell_xf.alignment.hor_align # os.remove(filename) # def test_to_excel_header_styling_xlsx(self): # import StringIO @@ -1558,41 +1615,41 @@ def test_to_excel_unicode_filename(self): # import openpyxl # from openpyxl.cell import get_column_letter # except ImportError: - # raise nose.SkipTest + # pytest.skip # if openpyxl.__version__ < '1.6.1': - # raise nose.SkipTest + # pytest.skip # # test xlsx_styling # filename = '__tmp_to_excel_header_styling_xlsx__.xlsx' # pdf.to_excel(filename, 'test1') # wbk = openpyxl.load_workbook(filename) - # self.assertEqual(["test1"], wbk.get_sheet_names()) + # assert ["test1"] == wbk.get_sheet_names() # ws = wbk.get_sheet_by_name('test1') # xlsaddrs = ["%s2" % chr(i) for i in range(ord('A'), ord('H'))] # xlsaddrs += ["A%s" % i for i in range(1, 6)] # xlsaddrs += ["B1", "D1", "F1"] # for xlsaddr in xlsaddrs: # cell = ws.cell(xlsaddr) - # self.assertTrue(cell.style.font.bold) - # self.assertEqual(openpyxl.style.Border.BORDER_THIN, - # cell.style.borders.top.border_style) - # self.assertEqual(openpyxl.style.Border.BORDER_THIN, - # cell.style.borders.right.border_style) - # self.assertEqual(openpyxl.style.Border.BORDER_THIN, - # cell.style.borders.bottom.border_style) - # self.assertEqual(openpyxl.style.Border.BORDER_THIN, - # cell.style.borders.left.border_style) - # self.assertEqual(openpyxl.style.Alignment.HORIZONTAL_CENTER, - # cell.style.alignment.horizontal) + # assert cell.style.font.bold + # assert (openpyxl.style.Border.BORDER_THIN == + # cell.style.borders.top.border_style) + # assert (openpyxl.style.Border.BORDER_THIN == + # cell.style.borders.right.border_style) + # assert (openpyxl.style.Border.BORDER_THIN == + # cell.style.borders.bottom.border_style) + # assert (openpyxl.style.Border.BORDER_THIN == + # cell.style.borders.left.border_style) + # assert (openpyxl.style.Alignment.HORIZONTAL_CENTER == + # cell.style.alignment.horizontal) # mergedcells_addrs = ["C1", "E1", "G1"] # for maddr in mergedcells_addrs: - # self.assertTrue(ws.cell(maddr).merged) + # assert ws.cell(maddr).merged # os.remove(filename) def test_excel_010_hemstring(self): _skip_if_no_xlrd() if self.merge_cells: - raise nose.SkipTest('Skip tests for merged MI format.') + pytest.skip('Skip tests for merged MI format.') from pandas.util.testing import makeCustomDataframe as mkdf # ensure limited functionality in 0.10 @@ -1617,29 +1674,29 @@ def roundtrip(df, header=True, parser_hdr=0, index=True): # this if will be removed once multi column excel writing # is implemented for now fixing #9794 if j > 1: - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): res = roundtrip(df, use_headers, index=False) else: res = roundtrip(df, use_headers) if use_headers: - self.assertEqual(res.shape, (nrows, ncols + i)) + assert res.shape == (nrows, ncols + i) else: # first row taken as columns - self.assertEqual(res.shape, (nrows - 1, ncols + i)) + assert res.shape == (nrows - 1, ncols + i) # no nans for r in range(len(res.index)): for c in range(len(res.columns)): - self.assertTrue(res.ix[r, c] is not np.nan) + assert res.iloc[r, c] is not np.nan res = roundtrip(DataFrame([0])) - self.assertEqual(res.shape, (1, 1)) - self.assertTrue(res.ix[0, 0] is not np.nan) + assert res.shape == (1, 1) + assert res.iloc[0, 0] is not np.nan res = roundtrip(DataFrame([0]), False, None) - self.assertEqual(res.shape, (1, 2)) - self.assertTrue(res.ix[0, 0] is not np.nan) + assert res.shape == (1, 2) + assert res.iloc[0, 0] is not np.nan def test_excel_010_hemstring_raises_NotImplementedError(self): # This test was failing only for j>1 and header=False, @@ -1647,7 +1704,7 @@ def test_excel_010_hemstring_raises_NotImplementedError(self): _skip_if_no_xlrd() if self.merge_cells: - raise nose.SkipTest('Skip tests for merged MI format.') + pytest.skip('Skip tests for merged MI format.') from pandas.util.testing import makeCustomDataframe as mkdf # ensure limited functionality in 0.10 @@ -1667,7 +1724,7 @@ def roundtrip2(df, header=True, parser_hdr=0, index=True): j = 2 i = 1 df = mkdf(nrows, ncols, r_idx_nlevels=i, c_idx_nlevels=j) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): roundtrip2(df, header=False, index=False) def test_duplicated_columns(self): @@ -1726,7 +1783,7 @@ def test_invalid_columns(self): read_frame = read_excel(path, 'test1') tm.assert_frame_equal(expected, read_frame) - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): write_frame.to_excel(path, 'test1', columns=['C', 'D']) def test_datetimes(self): @@ -1759,7 +1816,8 @@ def test_bytes_io(self): bio = BytesIO() df = DataFrame(np.random.randn(10, 2)) - writer = ExcelWriter(bio) + # pass engine explicitly as there is no file path to infer from + writer = ExcelWriter(bio, engine=self.engine_name) df.to_excel(writer) writer.save() bio.seek(0) @@ -1792,6 +1850,14 @@ def test_true_and_false_value_options(self): false_values=['bar']) tm.assert_frame_equal(read_frame, expected) + def test_freeze_panes(self): + # GH15160 + expected = DataFrame([[1, 2], [3, 4]], columns=['col1', 'col2']) + with ensure_clean(self.ext) as path: + expected.to_excel(path, "Sheet1", freeze_panes=(1, 1)) + result = read_excel(path) + tm.assert_frame_equal(expected, result) + def raise_wrapper(major_ver): def versioned_raise_wrapper(orig_method): @@ -1803,7 +1869,7 @@ def wrapped(self, *args, **kwargs): else: msg = (r'Installed openpyxl is not supported at this ' r'time\. Use.+') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): orig_method(self, *args, **kwargs) return wrapped return versioned_raise_wrapper @@ -1821,7 +1887,7 @@ def versioned_raise_on_incompat_version(cls): @raise_on_incompat_version(1) -class OpenpyxlTests(ExcelWriterBase, tm.TestCase): +class OpenpyxlTests(ExcelWriterBase): ext = '.xlsx' engine_name = 'openpyxl1' check_skip = staticmethod(lambda *args, **kwargs: None) @@ -1829,7 +1895,7 @@ class OpenpyxlTests(ExcelWriterBase, tm.TestCase): def test_to_excel_styleconverter(self): _skip_if_no_openpyxl() if not openpyxl_compat.is_compat(major_ver=1): - raise nose.SkipTest('incompatiable openpyxl version') + pytest.skip('incompatible openpyxl version') import openpyxl @@ -1841,40 +1907,40 @@ def test_to_excel_styleconverter(self): "alignment": {"horizontal": "center", "vertical": "top"}} xlsx_style = _Openpyxl1Writer._convert_to_style(hstyle) - self.assertTrue(xlsx_style.font.bold) - self.assertEqual(openpyxl.style.Border.BORDER_THIN, - xlsx_style.borders.top.border_style) - self.assertEqual(openpyxl.style.Border.BORDER_THIN, - xlsx_style.borders.right.border_style) - self.assertEqual(openpyxl.style.Border.BORDER_THIN, - xlsx_style.borders.bottom.border_style) - self.assertEqual(openpyxl.style.Border.BORDER_THIN, - xlsx_style.borders.left.border_style) - self.assertEqual(openpyxl.style.Alignment.HORIZONTAL_CENTER, - xlsx_style.alignment.horizontal) - self.assertEqual(openpyxl.style.Alignment.VERTICAL_TOP, - xlsx_style.alignment.vertical) + assert xlsx_style.font.bold + assert (openpyxl.style.Border.BORDER_THIN == + xlsx_style.borders.top.border_style) + assert (openpyxl.style.Border.BORDER_THIN == + xlsx_style.borders.right.border_style) + assert (openpyxl.style.Border.BORDER_THIN == + xlsx_style.borders.bottom.border_style) + assert (openpyxl.style.Border.BORDER_THIN == + xlsx_style.borders.left.border_style) + assert (openpyxl.style.Alignment.HORIZONTAL_CENTER == + xlsx_style.alignment.horizontal) + assert (openpyxl.style.Alignment.VERTICAL_TOP == + xlsx_style.alignment.vertical) def skip_openpyxl_gt21(cls): - """Skip a TestCase instance if openpyxl >= 2.2""" + """Skip test case if openpyxl >= 2.2""" @classmethod - def setUpClass(cls): + def setup_class(cls): _skip_if_no_openpyxl() import openpyxl ver = openpyxl.__version__ if (not (LooseVersion(ver) >= LooseVersion('2.0.0') and LooseVersion(ver) < LooseVersion('2.2.0'))): - raise nose.SkipTest("openpyxl %s >= 2.2" % str(ver)) + pytest.skip("openpyxl %s >= 2.2" % str(ver)) - cls.setUpClass = setUpClass + cls.setup_class = setup_class return cls @raise_on_incompat_version(2) @skip_openpyxl_gt21 -class Openpyxl20Tests(ExcelWriterBase, tm.TestCase): +class Openpyxl20Tests(ExcelWriterBase): ext = '.xlsx' engine_name = 'openpyxl20' check_skip = staticmethod(lambda *args, **kwargs: None) @@ -1932,15 +1998,15 @@ def test_to_excel_styleconverter(self): protection = styles.Protection(locked=True, hidden=False) kw = _Openpyxl20Writer._convert_to_style_kwargs(hstyle) - self.assertEqual(kw['font'], font) - self.assertEqual(kw['border'], border) - self.assertEqual(kw['alignment'], alignment) - self.assertEqual(kw['fill'], fill) - self.assertEqual(kw['number_format'], number_format) - self.assertEqual(kw['protection'], protection) + assert kw['font'] == font + assert kw['border'] == border + assert kw['alignment'] == alignment + assert kw['fill'] == fill + assert kw['number_format'] == number_format + assert kw['protection'] == protection def test_write_cells_merge_styled(self): - from pandas.formats.format import ExcelCell + from pandas.io.formats.excel import ExcelCell from openpyxl import styles sheet_name = 'merge_styled' @@ -1967,30 +2033,30 @@ def test_write_cells_merge_styled(self): writer.write_cells(merge_cells, sheet_name=sheet_name) wks = writer.sheets[sheet_name] - xcell_b1 = wks.cell('B1') - xcell_a2 = wks.cell('A2') - self.assertEqual(xcell_b1.style, openpyxl_sty_merged) - self.assertEqual(xcell_a2.style, openpyxl_sty_merged) + xcell_b1 = wks['B1'] + xcell_a2 = wks['A2'] + assert xcell_b1.style == openpyxl_sty_merged + assert xcell_a2.style == openpyxl_sty_merged def skip_openpyxl_lt22(cls): - """Skip a TestCase instance if openpyxl < 2.2""" + """Skip test case if openpyxl < 2.2""" @classmethod - def setUpClass(cls): + def setup_class(cls): _skip_if_no_openpyxl() import openpyxl ver = openpyxl.__version__ if LooseVersion(ver) < LooseVersion('2.2.0'): - raise nose.SkipTest("openpyxl %s < 2.2" % str(ver)) + pytest.skip("openpyxl %s < 2.2" % str(ver)) - cls.setUpClass = setUpClass + cls.setup_class = setup_class return cls @raise_on_incompat_version(2) @skip_openpyxl_lt22 -class Openpyxl22Tests(ExcelWriterBase, tm.TestCase): +class Openpyxl22Tests(ExcelWriterBase): ext = '.xlsx' engine_name = 'openpyxl22' check_skip = staticmethod(lambda *args, **kwargs: None) @@ -2042,18 +2108,18 @@ def test_to_excel_styleconverter(self): protection = styles.Protection(locked=True, hidden=False) kw = _Openpyxl22Writer._convert_to_style_kwargs(hstyle) - self.assertEqual(kw['font'], font) - self.assertEqual(kw['border'], border) - self.assertEqual(kw['alignment'], alignment) - self.assertEqual(kw['fill'], fill) - self.assertEqual(kw['number_format'], number_format) - self.assertEqual(kw['protection'], protection) + assert kw['font'] == font + assert kw['border'] == border + assert kw['alignment'] == alignment + assert kw['fill'] == fill + assert kw['number_format'] == number_format + assert kw['protection'] == protection def test_write_cells_merge_styled(self): if not openpyxl_compat.is_compat(major_ver=2): - raise nose.SkipTest('incompatiable openpyxl version') + pytest.skip('incompatible openpyxl version') - from pandas.formats.format import ExcelCell + from pandas.io.formats.excel import ExcelCell sheet_name = 'merge_styled' @@ -2079,13 +2145,13 @@ def test_write_cells_merge_styled(self): writer.write_cells(merge_cells, sheet_name=sheet_name) wks = writer.sheets[sheet_name] - xcell_b1 = wks.cell('B1') - xcell_a2 = wks.cell('A2') - self.assertEqual(xcell_b1.font, openpyxl_sty_merged) - self.assertEqual(xcell_a2.font, openpyxl_sty_merged) + xcell_b1 = wks['B1'] + xcell_a2 = wks['A2'] + assert xcell_b1.font == openpyxl_sty_merged + assert xcell_a2.font == openpyxl_sty_merged -class XlwtTests(ExcelWriterBase, tm.TestCase): +class XlwtTests(ExcelWriterBase): ext = '.xls' engine_name = 'xlwt' check_skip = staticmethod(_skip_if_no_xlwt) @@ -2097,7 +2163,7 @@ def test_excel_raise_error_on_multiindex_columns_and_no_index(self): ('2014', 'height'), ('2014', 'weight')]) df = DataFrame(np.random.randn(10, 3), columns=cols) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): with ensure_clean(self.ext) as path: df.to_excel(path, index=False) @@ -2133,16 +2199,16 @@ def test_to_excel_styleconverter(self): "alignment": {"horizontal": "center", "vertical": "top"}} xls_style = _XlwtWriter._convert_to_style(hstyle) - self.assertTrue(xls_style.font.bold) - self.assertEqual(xlwt.Borders.THIN, xls_style.borders.top) - self.assertEqual(xlwt.Borders.THIN, xls_style.borders.right) - self.assertEqual(xlwt.Borders.THIN, xls_style.borders.bottom) - self.assertEqual(xlwt.Borders.THIN, xls_style.borders.left) - self.assertEqual(xlwt.Alignment.HORZ_CENTER, xls_style.alignment.horz) - self.assertEqual(xlwt.Alignment.VERT_TOP, xls_style.alignment.vert) + assert xls_style.font.bold + assert xlwt.Borders.THIN == xls_style.borders.top + assert xlwt.Borders.THIN == xls_style.borders.right + assert xlwt.Borders.THIN == xls_style.borders.bottom + assert xlwt.Borders.THIN == xls_style.borders.left + assert xlwt.Alignment.HORZ_CENTER == xls_style.alignment.horz + assert xlwt.Alignment.VERT_TOP == xls_style.alignment.vert -class XlsxWriterTests(ExcelWriterBase, tm.TestCase): +class XlsxWriterTests(ExcelWriterBase): ext = '.xlsx' engine_name = 'xlsxwriter' check_skip = staticmethod(_skip_if_no_xlsxwriter) @@ -2174,21 +2240,28 @@ def test_column_format(self): writer.save() read_workbook = openpyxl.load_workbook(path) - read_worksheet = read_workbook.get_sheet_by_name(name='Sheet1') + try: + read_worksheet = read_workbook['Sheet1'] + except TypeError: + # compat + read_worksheet = read_workbook.get_sheet_by_name(name='Sheet1') - # Get the number format from the cell. This method is backward - # compatible with older versions of openpyxl. - cell = read_worksheet.cell('B2') + # Get the number format from the cell. + try: + cell = read_worksheet['B2'] + except TypeError: + # compat + cell = read_worksheet.cell('B2') try: read_num_format = cell.number_format except: read_num_format = cell.style.number_format._format_code - self.assertEqual(read_num_format, num_format) + assert read_num_format == num_format -class OpenpyxlTests_NoMerge(ExcelWriterBase, tm.TestCase): +class OpenpyxlTests_NoMerge(ExcelWriterBase): ext = '.xlsx' engine_name = 'openpyxl' check_skip = staticmethod(_skip_if_no_openpyxl) @@ -2197,7 +2270,7 @@ class OpenpyxlTests_NoMerge(ExcelWriterBase, tm.TestCase): merge_cells = False -class XlwtTests_NoMerge(ExcelWriterBase, tm.TestCase): +class XlwtTests_NoMerge(ExcelWriterBase): ext = '.xls' engine_name = 'xlwt' check_skip = staticmethod(_skip_if_no_xlwt) @@ -2206,7 +2279,7 @@ class XlwtTests_NoMerge(ExcelWriterBase, tm.TestCase): merge_cells = False -class XlsxWriterTests_NoMerge(ExcelWriterBase, tm.TestCase): +class XlsxWriterTests_NoMerge(ExcelWriterBase): ext = '.xlsx' engine_name = 'xlsxwriter' check_skip = staticmethod(_skip_if_no_xlsxwriter) @@ -2215,10 +2288,10 @@ class XlsxWriterTests_NoMerge(ExcelWriterBase, tm.TestCase): merge_cells = False -class ExcelWriterEngineTests(tm.TestCase): +class ExcelWriterEngineTests(object): def test_ExcelWriter_dispatch(self): - with tm.assertRaisesRegexp(ValueError, 'No engine'): + with tm.assert_raises_regex(ValueError, 'No engine'): ExcelWriter('nothing') try: @@ -2227,17 +2300,17 @@ def test_ExcelWriter_dispatch(self): except ImportError: _skip_if_no_openpyxl() if not openpyxl_compat.is_compat(major_ver=1): - raise nose.SkipTest('incompatible openpyxl version') + pytest.skip('incompatible openpyxl version') writer_klass = _Openpyxl1Writer with ensure_clean('.xlsx') as path: writer = ExcelWriter(path) - tm.assertIsInstance(writer, writer_klass) + assert isinstance(writer, writer_klass) _skip_if_no_xlwt() with ensure_clean('.xls') as path: writer = ExcelWriter(path) - tm.assertIsInstance(writer, _XlwtWriter) + assert isinstance(writer, _XlwtWriter) def test_register_writer(self): # some awkward mocking to test out dispatch and such actually works @@ -2258,24 +2331,161 @@ def write_cells(self, *args, **kwargs): def check_called(func): func() - self.assertTrue(len(called_save) >= 1) - self.assertTrue(len(called_write_cells) >= 1) + assert len(called_save) >= 1 + assert len(called_write_cells) >= 1 del called_save[:] del called_write_cells[:] with pd.option_context('io.excel.xlsx.writer', 'dummy'): register_writer(DummyClass) writer = ExcelWriter('something.test') - tm.assertIsInstance(writer, DummyClass) + assert isinstance(writer, DummyClass) df = tm.makeCustomDataframe(1, 1) - panel = tm.makePanel() - func = lambda: df.to_excel('something.test') - check_called(func) - check_called(lambda: panel.to_excel('something.test')) - check_called(lambda: df.to_excel('something.xlsx')) - check_called(lambda: df.to_excel('something.xls', engine='dummy')) + with catch_warnings(record=True): + panel = tm.makePanel() + func = lambda: df.to_excel('something.test') + check_called(func) + check_called(lambda: panel.to_excel('something.test')) + check_called(lambda: df.to_excel('something.xlsx')) + check_called( + lambda: df.to_excel( + 'something.xls', engine='dummy')) + + +@pytest.mark.parametrize('engine', [ + pytest.mark.xfail('xlwt', reason='xlwt does not support ' + 'openpyxl-compatible style dicts'), + 'xlsxwriter', + 'openpyxl', +]) +def test_styler_to_excel(engine): + def style(df): + # XXX: RGB colors not supported in xlwt + return DataFrame([['font-weight: bold', '', ''], + ['', 'color: blue', ''], + ['', '', 'text-decoration: underline'], + ['border-style: solid', '', ''], + ['', 'font-style: italic', ''], + ['', '', 'text-align: right'], + ['background-color: red', '', ''], + ['', '', ''], + ['', '', ''], + ['', '', '']], + index=df.index, columns=df.columns) + + def assert_equal_style(cell1, cell2): + # XXX: should find a better way to check equality + assert cell1.alignment.__dict__ == cell2.alignment.__dict__ + assert cell1.border.__dict__ == cell2.border.__dict__ + assert cell1.fill.__dict__ == cell2.fill.__dict__ + assert cell1.font.__dict__ == cell2.font.__dict__ + assert cell1.number_format == cell2.number_format + assert cell1.protection.__dict__ == cell2.protection.__dict__ + + def custom_converter(css): + # use bold iff there is custom style attached to the cell + if css.strip(' \n;'): + return {'font': {'bold': True}} + return {} + + pytest.importorskip('jinja2') + pytest.importorskip(engine) + + if engine == 'openpyxl' and openpyxl_compat.is_compat(major_ver=1): + pytest.xfail('openpyxl1 does not support some openpyxl2-compatible ' + 'style dicts') + + # Prepare spreadsheets + + df = DataFrame(np.random.randn(10, 3)) + with ensure_clean('.xlsx' if engine != 'xlwt' else '.xls') as path: + writer = ExcelWriter(path, engine=engine) + df.to_excel(writer, sheet_name='frame') + df.style.to_excel(writer, sheet_name='unstyled') + styled = df.style.apply(style, axis=None) + styled.to_excel(writer, sheet_name='styled') + ExcelFormatter(styled, style_converter=custom_converter).write( + writer, sheet_name='custom') + + # For engines other than openpyxl 2, we only smoke test + if engine != 'openpyxl': + return + if not openpyxl_compat.is_compat(major_ver=2): + pytest.skip('incompatible openpyxl version') + + # (1) compare DataFrame.to_excel and Styler.to_excel when unstyled + n_cells = 0 + for col1, col2 in zip(writer.sheets['frame'].columns, + writer.sheets['unstyled'].columns): + assert len(col1) == len(col2) + for cell1, cell2 in zip(col1, col2): + assert cell1.value == cell2.value + assert_equal_style(cell1, cell2) + n_cells += 1 + + # ensure iteration actually happened: + assert n_cells == (10 + 1) * (3 + 1) + + # (2) check styling with default converter + n_cells = 0 + for col1, col2 in zip(writer.sheets['frame'].columns, + writer.sheets['styled'].columns): + assert len(col1) == len(col2) + for cell1, cell2 in zip(col1, col2): + ref = '%s%d' % (cell2.column, cell2.row) + # XXX: this isn't as strong a test as ideal; we should + # differences are exclusive + if ref == 'B2': + assert not cell1.font.bold + assert cell2.font.bold + elif ref == 'C3': + assert cell1.font.color.rgb != cell2.font.color.rgb + assert cell2.font.color.rgb == '000000FF' + elif ref == 'D4': + assert cell1.font.underline != cell2.font.underline + assert cell2.font.underline == 'single' + elif ref == 'B5': + assert not cell1.border.left.style + assert (cell2.border.top.style == + cell2.border.right.style == + cell2.border.bottom.style == + cell2.border.left.style == + 'medium') + elif ref == 'C6': + assert not cell1.font.italic + assert cell2.font.italic + elif ref == 'D7': + assert (cell1.alignment.horizontal != + cell2.alignment.horizontal) + assert cell2.alignment.horizontal == 'right' + elif ref == 'B8': + assert cell1.fill.fgColor.rgb != cell2.fill.fgColor.rgb + assert cell1.fill.patternType != cell2.fill.patternType + assert cell2.fill.fgColor.rgb == '00FF0000' + assert cell2.fill.patternType == 'solid' + else: + assert_equal_style(cell1, cell2) + + assert cell1.value == cell2.value + n_cells += 1 + + assert n_cells == (10 + 1) * (3 + 1) + + # (3) check styling with custom converter + n_cells = 0 + for col1, col2 in zip(writer.sheets['frame'].columns, + writer.sheets['custom'].columns): + assert len(col1) == len(col2) + for cell1, cell2 in zip(col1, col2): + ref = '%s%d' % (cell2.column, cell2.row) + if ref in ('B2', 'C3', 'D4', 'B5', 'C6', 'D7', 'B8'): + assert not cell1.font.bold + assert cell2.font.bold + else: + assert_equal_style(cell1, cell2) + + assert cell1.value == cell2.value + n_cells += 1 -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + assert n_cells == (10 + 1) * (3 + 1) diff --git a/pandas/tests/io/test_feather.py b/pandas/tests/io/test_feather.py new file mode 100644 index 0000000000000..232bb126d9d67 --- /dev/null +++ b/pandas/tests/io/test_feather.py @@ -0,0 +1,116 @@ +""" test feather-format compat """ + +import pytest +feather = pytest.importorskip('feather') + +import numpy as np +import pandas as pd +from pandas.io.feather_format import to_feather, read_feather + +from feather import FeatherError +from pandas.util.testing import assert_frame_equal, ensure_clean + + +@pytest.mark.single +class TestFeather(object): + + def check_error_on_write(self, df, exc): + # check that we are raising the exception + # on writing + + with pytest.raises(exc): + with ensure_clean() as path: + to_feather(df, path) + + def check_round_trip(self, df): + + with ensure_clean() as path: + to_feather(df, path) + result = read_feather(path) + assert_frame_equal(result, df) + + def test_error(self): + + for obj in [pd.Series([1, 2, 3]), 1, 'foo', pd.Timestamp('20130101'), + np.array([1, 2, 3])]: + self.check_error_on_write(obj, ValueError) + + def test_basic(self): + + df = pd.DataFrame({'string': list('abc'), + 'int': list(range(1, 4)), + 'uint': np.arange(3, 6).astype('u1'), + 'float': np.arange(4.0, 7.0, dtype='float64'), + 'float_with_null': [1., np.nan, 3], + 'bool': [True, False, True], + 'bool_with_null': [True, np.nan, False], + 'cat': pd.Categorical(list('abc')), + 'dt': pd.date_range('20130101', periods=3), + 'dttz': pd.date_range('20130101', periods=3, + tz='US/Eastern'), + 'dt_with_null': [pd.Timestamp('20130101'), pd.NaT, + pd.Timestamp('20130103')], + 'dtns': pd.date_range('20130101', periods=3, + freq='ns')}) + + assert df.dttz.dtype.tz.zone == 'US/Eastern' + self.check_round_trip(df) + + def test_strided_data_issues(self): + + # strided data issuehttps://github.com/wesm/feather/issues/97 + df = pd.DataFrame(np.arange(12).reshape(4, 3), columns=list('abc')) + self.check_error_on_write(df, FeatherError) + + def test_duplicate_columns(self): + + # https://github.com/wesm/feather/issues/53 + # not currently able to handle duplicate columns + df = pd.DataFrame(np.arange(12).reshape(4, 3), + columns=list('aaa')).copy() + self.check_error_on_write(df, ValueError) + + def test_stringify_columns(self): + + df = pd.DataFrame(np.arange(12).reshape(4, 3)).copy() + self.check_error_on_write(df, ValueError) + + def test_unsupported(self): + + # period + df = pd.DataFrame({'a': pd.period_range('2013', freq='M', periods=3)}) + self.check_error_on_write(df, ValueError) + + df = pd.DataFrame({'a': pd.timedelta_range('1 day', periods=3)}) + self.check_error_on_write(df, FeatherError) + + # non-strings + df = pd.DataFrame({'a': ['a', 1, 2.0]}) + self.check_error_on_write(df, ValueError) + + def test_write_with_index(self): + + df = pd.DataFrame({'A': [1, 2, 3]}) + self.check_round_trip(df) + + # non-default index + for index in [[2, 3, 4], + pd.date_range('20130101', periods=3), + list('abc'), + [1, 3, 4], + pd.MultiIndex.from_tuples([('a', 1), ('a', 2), + ('b', 1)]), + ]: + + df.index = index + self.check_error_on_write(df, ValueError) + + # index with meta-data + df.index = [0, 1, 2] + df.index.name = 'foo' + self.check_error_on_write(df, ValueError) + + # column multi-index + df.index = [0, 1, 2] + df.columns = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)]), + self.check_error_on_write(df, ValueError) diff --git a/pandas/tests/io/test_gbq.py b/pandas/tests/io/test_gbq.py new file mode 100644 index 0000000000000..58a84ad4d47f8 --- /dev/null +++ b/pandas/tests/io/test_gbq.py @@ -0,0 +1,135 @@ +import pytest +from datetime import datetime +import pytz +import platform +from time import sleep +import os + +import numpy as np +import pandas as pd +from pandas import compat, DataFrame + +from pandas.compat import range + +pandas_gbq = pytest.importorskip('pandas_gbq') + +PROJECT_ID = None +PRIVATE_KEY_JSON_PATH = None +PRIVATE_KEY_JSON_CONTENTS = None + +if compat.PY3: + DATASET_ID = 'pydata_pandas_bq_testing_py3' +else: + DATASET_ID = 'pydata_pandas_bq_testing_py2' + +TABLE_ID = 'new_test' +DESTINATION_TABLE = "{0}.{1}".format(DATASET_ID + "1", TABLE_ID) + +VERSION = platform.python_version() + + +def _skip_if_no_project_id(): + if not _get_project_id(): + pytest.skip( + "Cannot run integration tests without a project id") + + +def _skip_if_no_private_key_path(): + if not _get_private_key_path(): + pytest.skip("Cannot run integration tests without a " + "private key json file path") + + +def _in_travis_environment(): + return 'TRAVIS_BUILD_DIR' in os.environ and \ + 'GBQ_PROJECT_ID' in os.environ + + +def _get_project_id(): + if _in_travis_environment(): + return os.environ.get('GBQ_PROJECT_ID') + else: + return PROJECT_ID + + +def _get_private_key_path(): + if _in_travis_environment(): + return os.path.join(*[os.environ.get('TRAVIS_BUILD_DIR'), 'ci', + 'travis_gbq.json']) + else: + return PRIVATE_KEY_JSON_PATH + + +def clean_gbq_environment(private_key=None): + dataset = pandas_gbq.gbq._Dataset(_get_project_id(), + private_key=private_key) + + for i in range(1, 10): + if DATASET_ID + str(i) in dataset.datasets(): + dataset_id = DATASET_ID + str(i) + table = pandas_gbq.gbq._Table(_get_project_id(), dataset_id, + private_key=private_key) + for j in range(1, 20): + if TABLE_ID + str(j) in dataset.tables(dataset_id): + table.delete(TABLE_ID + str(j)) + + dataset.delete(dataset_id) + + +def make_mixed_dataframe_v2(test_size): + # create df to test for all BQ datatypes except RECORD + bools = np.random.randint(2, size=(1, test_size)).astype(bool) + flts = np.random.randn(1, test_size) + ints = np.random.randint(1, 10, size=(1, test_size)) + strs = np.random.randint(1, 10, size=(1, test_size)).astype(str) + times = [datetime.now(pytz.timezone('US/Arizona')) + for t in range(test_size)] + return DataFrame({'bools': bools[0], + 'flts': flts[0], + 'ints': ints[0], + 'strs': strs[0], + 'times': times[0]}, + index=range(test_size)) + + +@pytest.mark.single +class TestToGBQIntegrationWithServiceAccountKeyPath(object): + + @classmethod + def setup_class(cls): + # - GLOBAL CLASS FIXTURES - + # put here any instruction you want to execute only *ONCE* *BEFORE* + # executing *ALL* tests described below. + + _skip_if_no_project_id() + _skip_if_no_private_key_path() + + clean_gbq_environment(_get_private_key_path()) + pandas_gbq.gbq._Dataset(_get_project_id(), + private_key=_get_private_key_path() + ).create(DATASET_ID + "1") + + @classmethod + def teardown_class(cls): + # - GLOBAL CLASS FIXTURES - + # put here any instruction you want to execute only *ONCE* *AFTER* + # executing all tests. + + clean_gbq_environment(_get_private_key_path()) + + def test_roundtrip(self): + destination_table = DESTINATION_TABLE + "1" + + test_size = 20001 + df = make_mixed_dataframe_v2(test_size) + + df.to_gbq(destination_table, _get_project_id(), chunksize=10000, + private_key=_get_private_key_path()) + + sleep(30) # <- Curses Google!!! + + result = pd.read_gbq("SELECT COUNT(*) AS num_rows FROM {0}" + .format(destination_table), + project_id=_get_project_id(), + private_key=_get_private_key_path()) + assert result['num_rows'][0] == test_size diff --git a/pandas/io/tests/test_html.py b/pandas/tests/io/test_html.py similarity index 89% rename from pandas/io/tests/test_html.py rename to pandas/tests/io/test_html.py index c202c60f5213d..6da77bf423609 100644 --- a/pandas/io/tests/test_html.py +++ b/pandas/tests/io/test_html.py @@ -12,7 +12,7 @@ from distutils.version import LooseVersion -import nose +import pytest import numpy as np from numpy.random import rand @@ -23,7 +23,7 @@ is_platform_windows) from pandas.io.common import URLError, urlopen, file_path_to_url from pandas.io.html import read_html -from pandas.parser import CParserError +from pandas._libs.parsers import ParserError import pandas.util.testing as tm from pandas.util.testing import makeCustomDataframe as mkdf, network @@ -39,7 +39,7 @@ def _have_module(module_name): def _skip_if_no(module_name): if not _have_module(module_name): - raise nose.SkipTest("{0!r} not found".format(module_name)) + pytest.skip("{0!r} not found".format(module_name)) def _skip_if_none_of(module_names): @@ -48,16 +48,16 @@ def _skip_if_none_of(module_names): if module_names == 'bs4': import bs4 if bs4.__version__ == LooseVersion('4.2.0'): - raise nose.SkipTest("Bad version of bs4: 4.2.0") + pytest.skip("Bad version of bs4: 4.2.0") else: not_found = [module_name for module_name in module_names if not _have_module(module_name)] if set(not_found) & set(module_names): - raise nose.SkipTest("{0!r} not found".format(not_found)) + pytest.skip("{0!r} not found".format(not_found)) if 'bs4' in module_names: import bs4 if bs4.__version__ == LooseVersion('4.2.0'): - raise nose.SkipTest("Bad version of bs4: 4.2.0") + pytest.skip("Bad version of bs4: 4.2.0") DATA_PATH = tm.get_data_path() @@ -93,14 +93,13 @@ def read_html(self, *args, **kwargs): return read_html(*args, **kwargs) -class TestReadHtml(tm.TestCase, ReadHtmlMixin): +class TestReadHtml(ReadHtmlMixin): flavor = 'bs4' spam_data = os.path.join(DATA_PATH, 'spam.html') banklist_data = os.path.join(DATA_PATH, 'banklist.html') @classmethod - def setUpClass(cls): - super(TestReadHtml, cls).setUpClass() + def setup_class(cls): _skip_if_none_of(('bs4', 'html5lib')) def test_to_html_compat(self): @@ -144,31 +143,31 @@ def test_spam_no_types(self): df2 = self.read_html(self.spam_data, 'Unit') assert_framelist_equal(df1, df2) - self.assertEqual(df1[0].ix[0, 0], 'Proximates') - self.assertEqual(df1[0].columns[0], 'Nutrient') + assert df1[0].iloc[0, 0] == 'Proximates' + assert df1[0].columns[0] == 'Nutrient' def test_spam_with_types(self): df1 = self.read_html(self.spam_data, '.*Water.*') df2 = self.read_html(self.spam_data, 'Unit') assert_framelist_equal(df1, df2) - self.assertEqual(df1[0].ix[0, 0], 'Proximates') - self.assertEqual(df1[0].columns[0], 'Nutrient') + assert df1[0].iloc[0, 0] == 'Proximates' + assert df1[0].columns[0] == 'Nutrient' def test_spam_no_match(self): dfs = self.read_html(self.spam_data) for df in dfs: - tm.assertIsInstance(df, DataFrame) + assert isinstance(df, DataFrame) def test_banklist_no_match(self): dfs = self.read_html(self.banklist_data, attrs={'id': 'table'}) for df in dfs: - tm.assertIsInstance(df, DataFrame) + assert isinstance(df, DataFrame) def test_spam_header(self): df = self.read_html(self.spam_data, '.*Water.*', header=1)[0] - self.assertEqual(df.columns[0], 'Proximates') - self.assertFalse(df.empty) + assert df.columns[0] == 'Proximates' + assert not df.empty def test_skiprows_int(self): df1 = self.read_html(self.spam_data, '.*Water.*', skiprows=1) @@ -219,8 +218,8 @@ def test_skiprows_ndarray(self): assert_framelist_equal(df1, df2) def test_skiprows_invalid(self): - with tm.assertRaisesRegexp(TypeError, - 'is not a valid type for skipping rows'): + with tm.assert_raises_regex(TypeError, 'is not a valid type ' + 'for skipping rows'): self.read_html(self.spam_data, '.*Water.*', skiprows='asdf') def test_index(self): @@ -278,31 +277,31 @@ def test_file_like(self): @network def test_bad_url_protocol(self): - with tm.assertRaises(URLError): + with pytest.raises(URLError): self.read_html('git://github.com', match='.*Water.*') @network def test_invalid_url(self): try: - with tm.assertRaises(URLError): + with pytest.raises(URLError): self.read_html('http://www.a23950sdfa908sd.com', match='.*Water.*') except ValueError as e: - self.assertEqual(str(e), 'No tables found') + assert str(e) == 'No tables found' @tm.slow def test_file_url(self): url = self.banklist_data dfs = self.read_html(file_path_to_url(url), 'First', attrs={'id': 'table'}) - tm.assertIsInstance(dfs, list) + assert isinstance(dfs, list) for df in dfs: - tm.assertIsInstance(df, DataFrame) + assert isinstance(df, DataFrame) @tm.slow def test_invalid_table_attrs(self): url = self.banklist_data - with tm.assertRaisesRegexp(ValueError, 'No tables found'): + with tm.assert_raises_regex(ValueError, 'No tables found'): self.read_html(url, 'First Federal Bank of Florida', attrs={'id': 'tasdfable'}) @@ -313,34 +312,34 @@ def _bank_data(self, *args, **kwargs): @tm.slow def test_multiindex_header(self): df = self._bank_data(header=[0, 1])[0] - tm.assertIsInstance(df.columns, MultiIndex) + assert isinstance(df.columns, MultiIndex) @tm.slow def test_multiindex_index(self): df = self._bank_data(index_col=[0, 1])[0] - tm.assertIsInstance(df.index, MultiIndex) + assert isinstance(df.index, MultiIndex) @tm.slow def test_multiindex_header_index(self): df = self._bank_data(header=[0, 1], index_col=[0, 1])[0] - tm.assertIsInstance(df.columns, MultiIndex) - tm.assertIsInstance(df.index, MultiIndex) + assert isinstance(df.columns, MultiIndex) + assert isinstance(df.index, MultiIndex) @tm.slow def test_multiindex_header_skiprows_tuples(self): df = self._bank_data(header=[0, 1], skiprows=1, tupleize_cols=True)[0] - tm.assertIsInstance(df.columns, Index) + assert isinstance(df.columns, Index) @tm.slow def test_multiindex_header_skiprows(self): df = self._bank_data(header=[0, 1], skiprows=1)[0] - tm.assertIsInstance(df.columns, MultiIndex) + assert isinstance(df.columns, MultiIndex) @tm.slow def test_multiindex_header_index_skiprows(self): df = self._bank_data(header=[0, 1], index_col=[0, 1], skiprows=1)[0] - tm.assertIsInstance(df.index, MultiIndex) - tm.assertIsInstance(df.columns, MultiIndex) + assert isinstance(df.index, MultiIndex) + assert isinstance(df.columns, MultiIndex) @tm.slow def test_regex_idempotency(self): @@ -348,27 +347,27 @@ def test_regex_idempotency(self): dfs = self.read_html(file_path_to_url(url), match=re.compile(re.compile('Florida')), attrs={'id': 'table'}) - tm.assertIsInstance(dfs, list) + assert isinstance(dfs, list) for df in dfs: - tm.assertIsInstance(df, DataFrame) + assert isinstance(df, DataFrame) def test_negative_skiprows(self): - with tm.assertRaisesRegexp(ValueError, - r'\(you passed a negative value\)'): + with tm.assert_raises_regex(ValueError, + r'\(you passed a negative value\)'): self.read_html(self.spam_data, 'Water', skiprows=-1) @network def test_multiple_matches(self): url = 'https://docs.python.org/2/' dfs = self.read_html(url, match='Python') - self.assertTrue(len(dfs) > 1) + assert len(dfs) > 1 @network def test_python_docs_table(self): url = 'https://docs.python.org/2/' dfs = self.read_html(url, match='Python') zz = [df.iloc[0, 0][0:4] for df in dfs] - self.assertEqual(sorted(zz), sorted(['Repo', 'What'])) + assert sorted(zz) == sorted(['Repo', 'What']) @tm.slow def test_thousands_macau_stats(self): @@ -378,7 +377,7 @@ def test_thousands_macau_stats(self): attrs={'class': 'style1'}) df = dfs[all_non_nan_table_index] - self.assertFalse(any(s.isnull().any() for _, s in df.iteritems())) + assert not any(s.isnull().any() for _, s in df.iteritems()) @tm.slow def test_thousands_macau_index_col(self): @@ -387,7 +386,7 @@ def test_thousands_macau_index_col(self): dfs = self.read_html(macau_data, index_col=0, header=0) df = dfs[all_non_nan_table_index] - self.assertFalse(any(s.isnull().any() for _, s in df.iteritems())) + assert not any(s.isnull().any() for _, s in df.iteritems()) def test_empty_tables(self): """ @@ -518,8 +517,8 @@ def test_nyse_wsj_commas_table(self): columns = Index(['Issue(Roll over for charts and headlines)', 'Volume', 'Price', 'Chg', '% Chg']) nrows = 100 - self.assertEqual(df.shape[0], nrows) - self.assert_index_equal(df.columns, columns) + assert df.shape[0] == nrows + tm.assert_index_equal(df.columns, columns) @tm.slow def test_banklist_header(self): @@ -536,7 +535,7 @@ def try_remove_ws(x): ground_truth = read_csv(os.path.join(DATA_PATH, 'banklist.csv'), converters={'Updated Date': Timestamp, 'Closing Date': Timestamp}) - self.assertEqual(df.shape, ground_truth.shape) + assert df.shape == ground_truth.shape old = ['First Vietnamese American BankIn Vietnamese', 'Westernbank Puerto RicoEn Espanol', 'R-G Premier Bank of Puerto RicoEn Espanol', @@ -566,10 +565,10 @@ def test_gold_canyon(self): with open(self.banklist_data, 'r') as f: raw_text = f.read() - self.assertIn(gc, raw_text) + assert gc in raw_text df = self.read_html(self.banklist_data, 'Gold Canyon', attrs={'id': 'table'})[0] - self.assertIn(gc, df.to_string()) + assert gc in df.to_string() def test_different_number_of_rows(self): expected = """ @@ -652,9 +651,10 @@ def test_parse_dates_combine(self): def test_computer_sales_page(self): data = os.path.join(DATA_PATH, 'computer_sales_page.html') - with tm.assertRaisesRegexp(CParserError, r"Passed header=\[0,1\] are " - "too many rows for this multi_index " - "of columns"): + with tm.assert_raises_regex(ParserError, + r"Passed header=\[0,1\] are " + r"too many rows for this " + r"multi_index of columns"): self.read_html(data, header=[0, 1]) def test_wikipedia_states_table(self): @@ -662,7 +662,7 @@ def test_wikipedia_states_table(self): assert os.path.isfile(data), '%r is not a file' % data assert os.path.getsize(data), '%r is an empty file' % data result = self.read_html(data, 'Arizona', header=1)[0] - self.assertEqual(result['sq mi'].dtype, np.dtype('float64')) + assert result['sq mi'].dtype == np.dtype('float64') def test_decimal_rows(self): @@ -685,13 +685,13 @@ def test_decimal_rows(self): ''') expected = DataFrame(data={'Header': 1100.101}, index=[0]) result = self.read_html(data, decimal='#')[0] - nose.tools.assert_equal(result['Header'].dtype, np.dtype('float64')) + assert result['Header'].dtype == np.dtype('float64') tm.assert_frame_equal(result, expected) def test_bool_header_arg(self): # GH 6114 for arg in [True, False]: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): read_html(self.spam_data, header=arg) def test_converters(self): @@ -760,18 +760,29 @@ def test_keep_default_na(self): html_df = read_html(html_data, keep_default_na=True)[0] tm.assert_frame_equal(expected_df, html_df) + def test_multiple_header_rows(self): + # Issue #13434 + expected_df = DataFrame(data=[("Hillary", 68, "D"), + ("Bernie", 74, "D"), + ("Donald", 69, "R")]) + expected_df.columns = [["Unnamed: 0_level_0", "Age", "Party"], + ["Name", "Unnamed: 1_level_1", + "Unnamed: 2_level_1"]] + html = expected_df.to_html(index=False) + html_df = read_html(html, )[0] + tm.assert_frame_equal(expected_df, html_df) + def _lang_enc(filename): return os.path.splitext(os.path.basename(filename))[0].split('_') -class TestReadHtmlEncoding(tm.TestCase): +class TestReadHtmlEncoding(object): files = glob.glob(os.path.join(DATA_PATH, 'html_encoding', '*.html')) flavor = 'bs4' @classmethod - def setUpClass(cls): - super(TestReadHtmlEncoding, cls).setUpClass() + def setup_class(cls): _skip_if_none_of((cls.flavor, 'html5lib')) def read_html(self, *args, **kwargs): @@ -812,17 +823,16 @@ class TestReadHtmlEncodingLxml(TestReadHtmlEncoding): flavor = 'lxml' @classmethod - def setUpClass(cls): - super(TestReadHtmlEncodingLxml, cls).setUpClass() + def setup_class(cls): + super(TestReadHtmlEncodingLxml, cls).setup_class() _skip_if_no(cls.flavor) -class TestReadHtmlLxml(tm.TestCase, ReadHtmlMixin): +class TestReadHtmlLxml(ReadHtmlMixin): flavor = 'lxml' @classmethod - def setUpClass(cls): - super(TestReadHtmlLxml, cls).setUpClass() + def setup_class(cls): _skip_if_no('lxml') def test_data_fail(self): @@ -830,17 +840,17 @@ def test_data_fail(self): spam_data = os.path.join(DATA_PATH, 'spam.html') banklist_data = os.path.join(DATA_PATH, 'banklist.html') - with tm.assertRaises(XMLSyntaxError): + with pytest.raises(XMLSyntaxError): self.read_html(spam_data) - with tm.assertRaises(XMLSyntaxError): + with pytest.raises(XMLSyntaxError): self.read_html(banklist_data) def test_works_on_valid_markup(self): filename = os.path.join(DATA_PATH, 'valid_markup.html') dfs = self.read_html(filename, index_col=0) - tm.assertIsInstance(dfs, list) - tm.assertIsInstance(dfs[0], DataFrame) + assert isinstance(dfs, list) + assert isinstance(dfs[0], DataFrame) @tm.slow def test_fallback_success(self): @@ -872,7 +882,7 @@ def test_computer_sales_page(self): def test_invalid_flavor(): url = 'google.com' - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): read_html(url, 'google', flavor='not a* valid**++ flaver') @@ -918,7 +928,3 @@ def test_same_ordering(): dfs_lxml = read_html(filename, index_col=0, flavor=['lxml']) dfs_bs4 = read_html(filename, index_col=0, flavor=['bs4']) assert_framelist_equal(dfs_lxml, dfs_bs4) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/test_packers.py b/pandas/tests/io/test_packers.py similarity index 80% rename from pandas/io/tests/test_packers.py rename to pandas/tests/io/test_packers.py index 91042775ba19d..4b1145129c364 100644 --- a/pandas/io/tests/test_packers.py +++ b/pandas/tests/io/test_packers.py @@ -1,5 +1,6 @@ -import nose +import pytest +from warnings import catch_warnings import os import datetime import numpy as np @@ -10,7 +11,7 @@ from pandas.compat import u, PY3 from pandas import (Series, DataFrame, Panel, MultiIndex, bdate_range, date_range, period_range, Index, Categorical) -from pandas.core.common import PerformanceWarning +from pandas.errors import PerformanceWarning from pandas.io.packers import to_msgpack, read_msgpack import pandas.util.testing as tm from pandas.util.testing import (ensure_clean, @@ -22,7 +23,8 @@ from pandas.tests.test_panel import assert_panel_equal import pandas -from pandas import Timestamp, NaT, tslib +from pandas import Timestamp, NaT +from pandas._libs.tslib import iNaT nan = np.nan @@ -40,7 +42,21 @@ else: _ZLIB_INSTALLED = True -_multiprocess_can_split_ = False + +@pytest.fixture(scope='module') +def current_packers_data(): + # our current version packers data + from pandas.tests.io.generate_legacy_storage_files import ( + create_msgpack_data) + return create_msgpack_data() + + +@pytest.fixture(scope='module') +def all_packers_data(): + # our all of our current version packers data + from pandas.tests.io.generate_legacy_storage_files import ( + create_data) + return create_data() def check_arbitrary(a, b): @@ -74,12 +90,12 @@ def check_arbitrary(a, b): assert(a == b) -class TestPackers(tm.TestCase): +class TestPackers(object): - def setUp(self): + def setup_method(self, method): self.path = '__%s__.msg' % tm.rands(10) - def tearDown(self): + def teardown_method(self, method): pass def encode_decode(self, x, compress=None, **kwargs): @@ -132,9 +148,9 @@ class A(object): def __init__(self): self.read = 0 - tm.assertRaises(ValueError, read_msgpack, path_or_buf=None) - tm.assertRaises(ValueError, read_msgpack, path_or_buf={}) - tm.assertRaises(ValueError, read_msgpack, path_or_buf=A()) + pytest.raises(ValueError, read_msgpack, path_or_buf=None) + pytest.raises(ValueError, read_msgpack, path_or_buf={}) + pytest.raises(ValueError, read_msgpack, path_or_buf=A()) class TestNumpy(TestPackers): @@ -147,7 +163,7 @@ def test_numpy_scalar_float(self): def test_numpy_scalar_complex(self): x = np.complex64(np.random.rand() + 1j * np.random.rand()) x_rec = self.encode_decode(x) - self.assertTrue(np.allclose(x, x_rec)) + assert np.allclose(x, x_rec) def test_scalar_float(self): x = np.random.rand() @@ -157,7 +173,7 @@ def test_scalar_float(self): def test_scalar_complex(self): x = np.random.rand() + 1j * np.random.rand() x_rec = self.encode_decode(x) - self.assertTrue(np.allclose(x, x_rec)) + assert np.allclose(x, x_rec) def test_list_numpy_float(self): x = [np.float32(np.random.rand()) for i in range(5)] @@ -170,13 +186,13 @@ def test_list_numpy_float(self): def test_list_numpy_float_complex(self): if not hasattr(np, 'complex128'): - raise nose.SkipTest('numpy cant handle complex128') + pytest.skip('numpy cant handle complex128') x = [np.float32(np.random.rand()) for i in range(5)] + \ [np.complex128(np.random.rand() + 1j * np.random.rand()) for i in range(5)] x_rec = self.encode_decode(x) - self.assertTrue(np.allclose(x, x_rec)) + assert np.allclose(x, x_rec) def test_list_float(self): x = [np.random.rand() for i in range(5)] @@ -191,7 +207,7 @@ def test_list_float_complex(self): x = [np.random.rand() for i in range(5)] + \ [(np.random.rand() + 1j * np.random.rand()) for i in range(5)] x_rec = self.encode_decode(x) - self.assertTrue(np.allclose(x, x_rec)) + assert np.allclose(x, x_rec) def test_dict_float(self): x = {'foo': 1.0, 'bar': 2.0} @@ -201,9 +217,10 @@ def test_dict_float(self): def test_dict_complex(self): x = {'foo': 1.0 + 1.0j, 'bar': 2.0 + 2.0j} x_rec = self.encode_decode(x) - self.assertEqual(x, x_rec) + tm.assert_dict_equal(x, x_rec) + for key in x: - self.assertEqual(type(x[key]), type(x_rec[key])) + tm.assert_class_equal(x[key], x_rec[key], obj="complex value") def test_dict_numpy_float(self): x = {'foo': np.float32(1.0), 'bar': np.float32(2.0)} @@ -214,9 +231,10 @@ def test_dict_numpy_complex(self): x = {'foo': np.complex128(1.0 + 1.0j), 'bar': np.complex128(2.0 + 2.0j)} x_rec = self.encode_decode(x) - self.assertEqual(x, x_rec) + tm.assert_dict_equal(x, x_rec) + for key in x: - self.assertEqual(type(x[key]), type(x_rec[key])) + tm.assert_class_equal(x[key], x_rec[key], obj="numpy complex128") def test_numpy_array_float(self): @@ -231,8 +249,8 @@ def test_numpy_array_float(self): def test_numpy_array_complex(self): x = (np.random.rand(5) + 1j * np.random.rand(5)).astype(np.complex128) x_rec = self.encode_decode(x) - self.assertTrue(all(map(lambda x, y: x == y, x, x_rec)) and - x.dtype == x_rec.dtype) + assert (all(map(lambda x, y: x == y, x, x_rec)) and + x.dtype == x_rec.dtype) def test_list_mixed(self): x = [1.0, np.float32(3.5), np.complex128(4.25), u('foo')] @@ -252,25 +270,25 @@ def test_timestamp(self): '20130101'), Timestamp('20130101', tz='US/Eastern'), Timestamp('201301010501')]: i_rec = self.encode_decode(i) - self.assertEqual(i, i_rec) + assert i == i_rec def test_nat(self): nat_rec = self.encode_decode(NaT) - self.assertIs(NaT, nat_rec) + assert NaT is nat_rec def test_datetimes(self): # fails under 2.6/win32 (np.datetime64 seems broken) if LooseVersion(sys.version) < '2.7': - raise nose.SkipTest('2.6 with np.datetime64 is broken') + pytest.skip('2.6 with np.datetime64 is broken') for i in [datetime.datetime(2013, 1, 1), datetime.datetime(2013, 1, 1, 5, 1), datetime.date(2013, 1, 1), np.datetime64(datetime.datetime(2013, 1, 5, 2, 15))]: i_rec = self.encode_decode(i) - self.assertEqual(i, i_rec) + assert i == i_rec def test_timedeltas(self): @@ -278,13 +296,13 @@ def test_timedeltas(self): datetime.timedelta(days=1, seconds=10), np.timedelta64(1000000)]: i_rec = self.encode_decode(i) - self.assertEqual(i, i_rec) + assert i == i_rec class TestIndex(TestPackers): - def setUp(self): - super(TestIndex, self).setUp() + def setup_method(self, method): + super(TestIndex, self).setup_method(method) self.d = { 'string': tm.makeStringIndex(100), @@ -297,6 +315,7 @@ def setUp(self): 'period': Index(period_range('2012-1-1', freq='M', periods=3)), 'date2': Index(date_range('2013-01-1', periods=10)), 'bdate': Index(bdate_range('2013-01-02', periods=10)), + 'cat': tm.makeCategoricalIndex(100) } self.mi = { @@ -310,36 +329,43 @@ def test_basic_index(self): for s, i in self.d.items(): i_rec = self.encode_decode(i) - self.assert_index_equal(i, i_rec) + tm.assert_index_equal(i, i_rec) # datetime with no freq (GH5506) i = Index([Timestamp('20130101'), Timestamp('20130103')]) i_rec = self.encode_decode(i) - self.assert_index_equal(i, i_rec) + tm.assert_index_equal(i, i_rec) # datetime with timezone i = Index([Timestamp('20130101 9:00:00'), Timestamp( '20130103 11:00:00')]).tz_localize('US/Eastern') i_rec = self.encode_decode(i) - self.assert_index_equal(i, i_rec) + tm.assert_index_equal(i, i_rec) def test_multi_index(self): for s, i in self.mi.items(): i_rec = self.encode_decode(i) - self.assert_index_equal(i, i_rec) + tm.assert_index_equal(i, i_rec) def test_unicode(self): i = tm.makeUnicodeIndex(100) i_rec = self.encode_decode(i) - self.assert_index_equal(i, i_rec) + tm.assert_index_equal(i, i_rec) + + def categorical_index(self): + # GH15487 + df = DataFrame(np.random.randn(10, 2)) + df = df.astype({0: 'category'}).set_index(0) + result = self.encode_decode(df) + tm.assert_frame_equal(result, df) class TestSeries(TestPackers): - def setUp(self): - super(TestSeries, self).setUp() + def setup_method(self, method): + super(TestSeries, self).setup_method(method) self.d = {} @@ -351,7 +377,7 @@ def setUp(self): s.name = 'object' self.d['object'] = s - s = Series(tslib.iNaT, dtype='M8[ns]', index=range(5)) + s = Series(iNaT, dtype='M8[ns]', index=range(5)) self.d['date'] = s data = { @@ -363,6 +389,8 @@ def setUp(self): 'F': [Timestamp('20130102', tz='US/Eastern')] * 2 + [Timestamp('20130603', tz='CET')] * 3, 'G': [Timestamp('20130102', tz='US/Eastern')] * 5, + 'H': Categorical([1, 2, 3, 4, 5]), + 'I': Categorical([1, 2, 3, 4, 5], ordered=True), } self.d['float'] = Series(data['A']) @@ -370,6 +398,8 @@ def setUp(self): self.d['mixed'] = Series(data['E']) self.d['dt_tz_mixed'] = Series(data['F']) self.d['dt_tz'] = Series(data['G']) + self.d['cat_ordered'] = Series(data['H']) + self.d['cat_unordered'] = Series(data['I']) def test_basic(self): @@ -382,8 +412,8 @@ def test_basic(self): class TestCategorical(TestPackers): - def setUp(self): - super(TestCategorical, self).setUp() + def setup_method(self, method): + super(TestCategorical, self).setup_method(method) self.d = {} @@ -405,8 +435,8 @@ def test_basic(self): class TestNDFrame(TestPackers): - def setUp(self): - super(TestNDFrame, self).setUp() + def setup_method(self, method): + super(TestNDFrame, self).setup_method(method) data = { 'A': [0., 1., 2., 3., np.nan], @@ -425,9 +455,10 @@ def setUp(self): 'int': DataFrame(dict(A=data['B'], B=Series(data['B']) + 1)), 'mixed': DataFrame(data)} - self.panel = { - 'float': Panel(dict(ItemA=self.frame['float'], - ItemB=self.frame['float'] + 1))} + with catch_warnings(record=True): + self.panel = { + 'float': Panel(dict(ItemA=self.frame['float'], + ItemB=self.frame['float'] + 1))} def test_basic_frame(self): @@ -437,9 +468,10 @@ def test_basic_frame(self): def test_basic_panel(self): - for s, i in self.panel.items(): - i_rec = self.encode_decode(i) - assert_panel_equal(i, i_rec) + with catch_warnings(record=True): + for s, i in self.panel.items(): + i_rec = self.encode_decode(i) + assert_panel_equal(i, i_rec) def test_multi(self): @@ -456,7 +488,7 @@ def test_multi(self): l = [self.frame['float'], self.frame['float'] .A, self.frame['float'].B, None] l_rec = self.encode_decode(l) - self.assertIsInstance(l_rec, tuple) + assert isinstance(l_rec, tuple) check_arbitrary(l, l_rec) def test_iterator(self): @@ -506,7 +538,7 @@ def _check_roundtrip(self, obj, comparator, **kwargs): # currently these are not implemetned # i_rec = self.encode_decode(obj) # comparator(obj, i_rec, **kwargs) - self.assertRaises(NotImplementedError, self.encode_decode, obj) + pytest.raises(NotImplementedError, self.encode_decode, obj) def test_sparse_series(self): @@ -527,8 +559,8 @@ def test_sparse_series(self): def test_sparse_frame(self): s = tm.makeDataFrame() - s.ix[3:5, 1:3] = np.nan - s.ix[8:10, -2] = np.nan + s.loc[3:5, 1:3] = np.nan + s.loc[8:10, -2] = np.nan ss = s.to_sparse() self._check_roundtrip(ss, tm.assert_frame_equal, @@ -547,7 +579,7 @@ class TestCompression(TestPackers): """See https://github.com/pandas-dev/pandas/pull/9783 """ - def setUp(self): + def setup_method(self, method): try: from sqlalchemy import create_engine self._create_sql_engine = create_engine @@ -556,7 +588,7 @@ def setUp(self): else: self._SQLALCHEMY_INSTALLED = True - super(TestCompression, self).setUp() + super(TestCompression, self).setup_method(method) data = { 'A': np.arange(1000, dtype=np.float64), 'B': np.arange(1000, dtype=np.int32), @@ -583,16 +615,16 @@ def _test_compression(self, compress): assert_frame_equal(value, expected) # make sure that we can write to the new frames for block in value._data.blocks: - self.assertTrue(block.values.flags.writeable) + assert block.values.flags.writeable def test_compression_zlib(self): if not _ZLIB_INSTALLED: - raise nose.SkipTest('no zlib') + pytest.skip('no zlib') self._test_compression('zlib') def test_compression_blosc(self): if not _BLOSC_INSTALLED: - raise nose.SkipTest('no blosc') + pytest.skip('no blosc') self._test_compression('blosc') def _test_compression_warns_when_decompress_caches(self, compress): @@ -632,31 +664,29 @@ def decompress(ob): # make sure that we can write to the new frames even though # we needed to copy the data for block in value._data.blocks: - self.assertTrue(block.values.flags.writeable) + assert block.values.flags.writeable # mutate the data in some way block.values[0] += rhs[block.dtype] for w in ws: # check the messages from our warnings - self.assertEqual( - str(w.message), - 'copying data after decompressing; this may mean that' - ' decompress is caching its result', - ) + assert str(w.message) == ('copying data after decompressing; ' + 'this may mean that decompress is ' + 'caching its result') for buf, control_buf in zip(not_garbage, control): # make sure none of our mutations above affected the # original buffers - self.assertEqual(buf, control_buf) + assert buf == control_buf def test_compression_warns_when_decompress_caches_zlib(self): if not _ZLIB_INSTALLED: - raise nose.SkipTest('no zlib') + pytest.skip('no zlib') self._test_compression_warns_when_decompress_caches('zlib') def test_compression_warns_when_decompress_caches_blosc(self): if not _BLOSC_INSTALLED: - raise nose.SkipTest('no blosc') + pytest.skip('no blosc') self._test_compression_warns_when_decompress_caches('blosc') def _test_small_strings_no_warn(self, compress): @@ -665,14 +695,14 @@ def _test_small_strings_no_warn(self, compress): empty_unpacked = self.encode_decode(empty, compress=compress) tm.assert_numpy_array_equal(empty_unpacked, empty) - self.assertTrue(empty_unpacked.flags.writeable) + assert empty_unpacked.flags.writeable char = np.array([ord(b'a')], dtype='uint8') with tm.assert_produces_warning(None): char_unpacked = self.encode_decode(char, compress=compress) tm.assert_numpy_array_equal(char_unpacked, char) - self.assertTrue(char_unpacked.flags.writeable) + assert char_unpacked.flags.writeable # if this test fails I am sorry because the interpreter is now in a # bad state where b'a' points to 98 == ord(b'b'). char_unpacked[0] = ord(b'b') @@ -680,7 +710,7 @@ def _test_small_strings_no_warn(self, compress): # we compare the ord of bytes b'a' with unicode u'a' because the should # always be the same (unless we were able to mutate the shared # character singleton in which case ord(b'a') == ord(b'b'). - self.assertEqual(ord(b'a'), ord(u'a')) + assert ord(b'a') == ord(u'a') tm.assert_numpy_array_equal( char_unpacked, np.array([ord(b'b')], dtype='uint8'), @@ -688,36 +718,36 @@ def _test_small_strings_no_warn(self, compress): def test_small_strings_no_warn_zlib(self): if not _ZLIB_INSTALLED: - raise nose.SkipTest('no zlib') + pytest.skip('no zlib') self._test_small_strings_no_warn('zlib') def test_small_strings_no_warn_blosc(self): if not _BLOSC_INSTALLED: - raise nose.SkipTest('no blosc') + pytest.skip('no blosc') self._test_small_strings_no_warn('blosc') def test_readonly_axis_blosc(self): # GH11880 if not _BLOSC_INSTALLED: - raise nose.SkipTest('no blosc') + pytest.skip('no blosc') df1 = DataFrame({'A': list('abcd')}) df2 = DataFrame(df1, index=[1., 2., 3., 4.]) - self.assertTrue(1 in self.encode_decode(df1['A'], compress='blosc')) - self.assertTrue(1. in self.encode_decode(df2['A'], compress='blosc')) + assert 1 in self.encode_decode(df1['A'], compress='blosc') + assert 1. in self.encode_decode(df2['A'], compress='blosc') def test_readonly_axis_zlib(self): # GH11880 df1 = DataFrame({'A': list('abcd')}) df2 = DataFrame(df1, index=[1., 2., 3., 4.]) - self.assertTrue(1 in self.encode_decode(df1['A'], compress='zlib')) - self.assertTrue(1. in self.encode_decode(df2['A'], compress='zlib')) + assert 1 in self.encode_decode(df1['A'], compress='zlib') + assert 1. in self.encode_decode(df2['A'], compress='zlib') def test_readonly_axis_blosc_to_sql(self): # GH11880 if not _BLOSC_INSTALLED: - raise nose.SkipTest('no blosc') + pytest.skip('no blosc') if not self._SQLALCHEMY_INSTALLED: - raise nose.SkipTest('no sqlalchemy') + pytest.skip('no sqlalchemy') expected = DataFrame({'A': list('abcd')}) df = self.encode_decode(expected, compress='blosc') eng = self._create_sql_engine("sqlite:///:memory:") @@ -729,9 +759,9 @@ def test_readonly_axis_blosc_to_sql(self): def test_readonly_axis_zlib_to_sql(self): # GH11880 if not _ZLIB_INSTALLED: - raise nose.SkipTest('no zlib') + pytest.skip('no zlib') if not self._SQLALCHEMY_INSTALLED: - raise nose.SkipTest('no sqlalchemy') + pytest.skip('no sqlalchemy') expected = DataFrame({'A': list('abcd')}) df = self.encode_decode(expected, compress='zlib') eng = self._create_sql_engine("sqlite:///:memory:") @@ -743,8 +773,8 @@ def test_readonly_axis_zlib_to_sql(self): class TestEncoding(TestPackers): - def setUp(self): - super(TestEncoding, self).setUp() + def setup_method(self, method): + super(TestEncoding, self).setup_method(method) data = { 'A': [compat.u('\u2019')] * 1000, 'B': np.arange(1000, dtype=np.int32), @@ -771,12 +801,21 @@ def test_default_encoding(self): for frame in compat.itervalues(self.frame): result = frame.to_msgpack() expected = frame.to_msgpack(encoding='utf8') - self.assertEqual(result, expected) + assert result == expected result = self.encode_decode(frame) assert_frame_equal(result, frame) -class TestMsgpack(): +def legacy_packers_versions(): + # yield the packers versions + path = tm.get_data_path('legacy_msgpack') + for v in os.listdir(path): + p = os.path.join(path, v) + if os.path.isdir(p): + yield v + + +class TestMsgpack(object): """ How to add msgpack tests: @@ -786,47 +825,38 @@ class TestMsgpack(): $ python generate_legacy_storage_files.py msgpack 3. Move the created pickle to "data/legacy_msgpack/" directory. - - NOTE: TestMsgpack can't be a subclass of tm.Testcase to use test generator. - http://stackoverflow.com/questions/6689537/nose-test-generators-inside-class """ - def setUp(self): - from pandas.io.tests.generate_legacy_storage_files import ( - create_msgpack_data, create_data) - self.data = create_msgpack_data() - self.all_data = create_data() - self.path = u('__%s__.msgpack' % tm.rands(10)) - self.minimum_structure = {'series': ['float', 'int', 'mixed', - 'ts', 'mi', 'dup'], - 'frame': ['float', 'int', 'mixed', 'mi'], - 'panel': ['float'], - 'index': ['int', 'date', 'period'], - 'mi': ['reg2']} - - def check_min_structure(self, data): + minimum_structure = {'series': ['float', 'int', 'mixed', + 'ts', 'mi', 'dup'], + 'frame': ['float', 'int', 'mixed', 'mi'], + 'panel': ['float'], + 'index': ['int', 'date', 'period'], + 'mi': ['reg2']} + + def check_min_structure(self, data, version): for typ, v in self.minimum_structure.items(): assert typ in data, '"{0}" not found in unpacked data'.format(typ) for kind in v: msg = '"{0}" not found in data["{1}"]'.format(kind, typ) assert kind in data[typ], msg - def compare(self, vf, version): + def compare(self, current_data, all_data, vf, version): # GH12277 encoding default used to be latin-1, now utf-8 if LooseVersion(version) < '0.18.0': data = read_msgpack(vf, encoding='latin-1') else: data = read_msgpack(vf) - self.check_min_structure(data) + self.check_min_structure(data, version) for typ, dv in data.items(): - assert typ in self.all_data, ('unpacked data contains ' - 'extra key "{0}"' - .format(typ)) + assert typ in all_data, ('unpacked data contains ' + 'extra key "{0}"' + .format(typ)) for dt, result in dv.items(): - assert dt in self.all_data[typ], ('data["{0}"] contains extra ' - 'key "{1}"'.format(typ, dt)) + assert dt in current_data[typ], ('data["{0}"] contains extra ' + 'key "{1}"'.format(typ, dt)) try: - expected = self.data[typ][dt] + expected = current_data[typ][dt] except KeyError: continue @@ -859,30 +889,24 @@ def compare_frame_dt_mixed_tzs(self, result, expected, typ, version): else: tm.assert_frame_equal(result, expected) - def read_msgpacks(self, version): + @pytest.mark.parametrize('version', legacy_packers_versions()) + def test_msgpacks_legacy(self, current_packers_data, all_packers_data, + version): - pth = tm.get_data_path('legacy_msgpack/{0}'.format(str(version))) + pth = tm.get_data_path('legacy_msgpack/{0}'.format(version)) n = 0 for f in os.listdir(pth): # GH12142 0.17 files packed in P2 can't be read in P3 if (compat.PY3 and version.startswith('0.17.') and - f.split('.')[-4][-1] == '2'): + f.split('.')[-4][-1] == '2'): continue vf = os.path.join(pth, f) try: - self.compare(vf, version) + with catch_warnings(record=True): + self.compare(current_packers_data, all_packers_data, + vf, version) except ImportError: # blosc not installed continue n += 1 assert n > 0, 'Msgpack files are not tested' - - def test_msgpack(self): - msgpack_path = tm.get_data_path('legacy_msgpack') - n = 0 - for v in os.listdir(msgpack_path): - pth = os.path.join(msgpack_path, v) - if os.path.isdir(pth): - yield self.read_msgpacks, v - n += 1 - assert n > 0, 'Msgpack files are not tested' diff --git a/pandas/tests/io/test_pickle.py b/pandas/tests/io/test_pickle.py new file mode 100644 index 0000000000000..875b5bd3055b9 --- /dev/null +++ b/pandas/tests/io/test_pickle.py @@ -0,0 +1,491 @@ +# pylint: disable=E1101,E1103,W0232 + +""" +manage legacy pickle tests + +How to add pickle tests: + +1. Install pandas version intended to output the pickle. + +2. Execute "generate_legacy_storage_files.py" to create the pickle. +$ python generate_legacy_storage_files.py pickle + +3. Move the created pickle to "data/legacy_pickle/" directory. +""" + +import pytest +from warnings import catch_warnings + +import os +from distutils.version import LooseVersion +import pandas as pd +from pandas import Index +from pandas.compat import is_platform_little_endian +import pandas +import pandas.util.testing as tm +from pandas.tseries.offsets import Day, MonthEnd +import shutil + + +@pytest.fixture(scope='module') +def current_pickle_data(): + # our current version pickle data + from pandas.tests.io.generate_legacy_storage_files import ( + create_pickle_data) + return create_pickle_data() + + +# --------------------- +# comparision functions +# --------------------- +def compare_element(result, expected, typ, version=None): + if isinstance(expected, Index): + tm.assert_index_equal(expected, result) + return + + if typ.startswith('sp_'): + comparator = getattr(tm, "assert_%s_equal" % typ) + comparator(result, expected, exact_indices=False) + elif typ == 'timestamp': + if expected is pd.NaT: + assert result is pd.NaT + else: + assert result == expected + assert result.freq == expected.freq + else: + comparator = getattr(tm, "assert_%s_equal" % + typ, tm.assert_almost_equal) + comparator(result, expected) + + +def compare(data, vf, version): + + # py3 compat when reading py2 pickle + try: + data = pandas.read_pickle(vf) + except (ValueError) as e: + if 'unsupported pickle protocol:' in str(e): + # trying to read a py3 pickle in py2 + return + else: + raise + + m = globals() + for typ, dv in data.items(): + for dt, result in dv.items(): + try: + expected = data[typ][dt] + except (KeyError): + if version in ('0.10.1', '0.11.0') and dt == 'reg': + break + else: + raise + + # use a specific comparator + # if available + comparator = "compare_{typ}_{dt}".format(typ=typ, dt=dt) + + comparator = m.get(comparator, m['compare_element']) + comparator(result, expected, typ, version) + return data + + +def compare_sp_series_ts(res, exp, typ, version): + # SparseTimeSeries integrated into SparseSeries in 0.12.0 + # and deprecated in 0.17.0 + if version and LooseVersion(version) <= "0.12.0": + tm.assert_sp_series_equal(res, exp, check_series_type=False) + else: + tm.assert_sp_series_equal(res, exp) + + +def compare_series_ts(result, expected, typ, version): + # GH 7748 + tm.assert_series_equal(result, expected) + assert result.index.freq == expected.index.freq + assert not result.index.freq.normalize + tm.assert_series_equal(result > 0, expected > 0) + + # GH 9291 + freq = result.index.freq + assert freq + Day(1) == Day(2) + + res = freq + pandas.Timedelta(hours=1) + assert isinstance(res, pandas.Timedelta) + assert res == pandas.Timedelta(days=1, hours=1) + + res = freq + pandas.Timedelta(nanoseconds=1) + assert isinstance(res, pandas.Timedelta) + assert res == pandas.Timedelta(days=1, nanoseconds=1) + + +def compare_series_dt_tz(result, expected, typ, version): + # 8260 + # dtype is object < 0.17.0 + if LooseVersion(version) < '0.17.0': + expected = expected.astype(object) + tm.assert_series_equal(result, expected) + else: + tm.assert_series_equal(result, expected) + + +def compare_series_cat(result, expected, typ, version): + # Categorical dtype is added in 0.15.0 + # ordered is changed in 0.16.0 + if LooseVersion(version) < '0.15.0': + tm.assert_series_equal(result, expected, check_dtype=False, + check_categorical=False) + elif LooseVersion(version) < '0.16.0': + tm.assert_series_equal(result, expected, check_categorical=False) + else: + tm.assert_series_equal(result, expected) + + +def compare_frame_dt_mixed_tzs(result, expected, typ, version): + # 8260 + # dtype is object < 0.17.0 + if LooseVersion(version) < '0.17.0': + expected = expected.astype(object) + tm.assert_frame_equal(result, expected) + else: + tm.assert_frame_equal(result, expected) + + +def compare_frame_cat_onecol(result, expected, typ, version): + # Categorical dtype is added in 0.15.0 + # ordered is changed in 0.16.0 + if LooseVersion(version) < '0.15.0': + tm.assert_frame_equal(result, expected, check_dtype=False, + check_categorical=False) + elif LooseVersion(version) < '0.16.0': + tm.assert_frame_equal(result, expected, check_categorical=False) + else: + tm.assert_frame_equal(result, expected) + + +def compare_frame_cat_and_float(result, expected, typ, version): + compare_frame_cat_onecol(result, expected, typ, version) + + +def compare_index_period(result, expected, typ, version): + tm.assert_index_equal(result, expected) + assert isinstance(result.freq, MonthEnd) + assert result.freq == MonthEnd() + assert result.freqstr == 'M' + tm.assert_index_equal(result.shift(2), expected.shift(2)) + + +def compare_sp_frame_float(result, expected, typ, version): + if LooseVersion(version) <= '0.18.1': + tm.assert_sp_frame_equal(result, expected, exact_indices=False, + check_dtype=False) + else: + tm.assert_sp_frame_equal(result, expected) + + +# --------------------- +# tests +# --------------------- +def legacy_pickle_versions(): + # yield the pickle versions + path = tm.get_data_path('legacy_pickle') + for v in os.listdir(path): + p = os.path.join(path, v) + if os.path.isdir(p): + yield v + + +@pytest.mark.parametrize('version', legacy_pickle_versions()) +def test_pickles(current_pickle_data, version): + if not is_platform_little_endian(): + pytest.skip("known failure on non-little endian") + + pth = tm.get_data_path('legacy_pickle/{0}'.format(version)) + n = 0 + for f in os.listdir(pth): + vf = os.path.join(pth, f) + with catch_warnings(record=True): + data = compare(current_pickle_data, vf, version) + + if data is None: + continue + n += 1 + assert n > 0, ('Pickle files are not ' + 'tested: {version}'.format(version=version)) + + +def test_round_trip_current(current_pickle_data): + + try: + import cPickle as c_pickle + + def c_pickler(obj, path): + with open(path, 'wb') as fh: + c_pickle.dump(obj, fh, protocol=-1) + + def c_unpickler(path): + with open(path, 'rb') as fh: + fh.seek(0) + return c_pickle.load(fh) + except: + c_pickler = None + c_unpickler = None + + import pickle as python_pickle + + def python_pickler(obj, path): + with open(path, 'wb') as fh: + python_pickle.dump(obj, fh, protocol=-1) + + def python_unpickler(path): + with open(path, 'rb') as fh: + fh.seek(0) + return python_pickle.load(fh) + + data = current_pickle_data + for typ, dv in data.items(): + for dt, expected in dv.items(): + + for writer in [pd.to_pickle, c_pickler, python_pickler]: + if writer is None: + continue + + with tm.ensure_clean() as path: + + # test writing with each pickler + writer(expected, path) + + # test reading with each unpickler + result = pd.read_pickle(path) + compare_element(result, expected, typ) + + if c_unpickler is not None: + result = c_unpickler(path) + compare_element(result, expected, typ) + + result = python_unpickler(path) + compare_element(result, expected, typ) + + +def test_pickle_v0_14_1(): + + cat = pd.Categorical(values=['a', 'b', 'c'], ordered=False, + categories=['a', 'b', 'c', 'd']) + pickle_path = os.path.join(tm.get_data_path(), + 'categorical_0_14_1.pickle') + # This code was executed once on v0.14.1 to generate the pickle: + # + # cat = Categorical(labels=np.arange(3), levels=['a', 'b', 'c', 'd'], + # name='foobar') + # with open(pickle_path, 'wb') as f: pickle.dump(cat, f) + # + tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path)) + + +def test_pickle_v0_15_2(): + # ordered -> _ordered + # GH 9347 + + cat = pd.Categorical(values=['a', 'b', 'c'], ordered=False, + categories=['a', 'b', 'c', 'd']) + pickle_path = os.path.join(tm.get_data_path(), + 'categorical_0_15_2.pickle') + # This code was executed once on v0.15.2 to generate the pickle: + # + # cat = Categorical(labels=np.arange(3), levels=['a', 'b', 'c', 'd'], + # name='foobar') + # with open(pickle_path, 'wb') as f: pickle.dump(cat, f) + # + tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path)) + + +# --------------------- +# test pickle compression +# --------------------- + +@pytest.fixture +def get_random_path(): + return u'__%s__.pickle' % tm.rands(10) + + +class TestCompression(object): + + _compression_to_extension = { + None: ".none", + 'gzip': '.gz', + 'bz2': '.bz2', + 'zip': '.zip', + 'xz': '.xz', + } + + def compress_file(self, src_path, dest_path, compression): + if compression is None: + shutil.copyfile(src_path, dest_path) + return + + if compression == 'gzip': + import gzip + f = gzip.open(dest_path, "w") + elif compression == 'bz2': + import bz2 + f = bz2.BZ2File(dest_path, "w") + elif compression == 'zip': + import zipfile + zip_file = zipfile.ZipFile(dest_path, "w", + compression=zipfile.ZIP_DEFLATED) + zip_file.write(src_path, os.path.basename(src_path)) + elif compression == 'xz': + lzma = pandas.compat.import_lzma() + f = lzma.LZMAFile(dest_path, "w") + else: + msg = 'Unrecognized compression type: {}'.format(compression) + raise ValueError(msg) + + if compression != "zip": + with open(src_path, "rb") as fh: + f.write(fh.read()) + f.close() + + def decompress_file(self, src_path, dest_path, compression): + if compression is None: + shutil.copyfile(src_path, dest_path) + return + + if compression == 'gzip': + import gzip + f = gzip.open(src_path, "r") + elif compression == 'bz2': + import bz2 + f = bz2.BZ2File(src_path, "r") + elif compression == 'zip': + import zipfile + zip_file = zipfile.ZipFile(src_path) + zip_names = zip_file.namelist() + if len(zip_names) == 1: + f = zip_file.open(zip_names.pop()) + else: + raise ValueError('ZIP file {} error. Only one file per ZIP.' + .format(src_path)) + elif compression == 'xz': + lzma = pandas.compat.import_lzma() + f = lzma.LZMAFile(src_path, "r") + else: + msg = 'Unrecognized compression type: {}'.format(compression) + raise ValueError(msg) + + with open(dest_path, "wb") as fh: + fh.write(f.read()) + f.close() + + @pytest.mark.parametrize('compression', [None, 'gzip', 'bz2', 'xz']) + def test_write_explicit(self, compression, get_random_path): + # issue 11666 + if compression == 'xz': + tm._skip_if_no_lzma() + + base = get_random_path + path1 = base + ".compressed" + path2 = base + ".raw" + + with tm.ensure_clean(path1) as p1, tm.ensure_clean(path2) as p2: + df = tm.makeDataFrame() + + # write to compressed file + df.to_pickle(p1, compression=compression) + + # decompress + self.decompress_file(p1, p2, compression=compression) + + # read decompressed file + df2 = pd.read_pickle(p2, compression=None) + + tm.assert_frame_equal(df, df2) + + @pytest.mark.parametrize('compression', ['', 'None', 'bad', '7z']) + def test_write_explicit_bad(self, compression, get_random_path): + with tm.assert_raises_regex(ValueError, + "Unrecognized compression type"): + with tm.ensure_clean(get_random_path) as path: + df = tm.makeDataFrame() + df.to_pickle(path, compression=compression) + + @pytest.mark.parametrize('ext', ['', '.gz', '.bz2', '.xz', '.no_compress']) + def test_write_infer(self, ext, get_random_path): + if ext == '.xz': + tm._skip_if_no_lzma() + + base = get_random_path + path1 = base + ext + path2 = base + ".raw" + compression = None + for c in self._compression_to_extension: + if self._compression_to_extension[c] == ext: + compression = c + break + + with tm.ensure_clean(path1) as p1, tm.ensure_clean(path2) as p2: + df = tm.makeDataFrame() + + # write to compressed file by inferred compression method + df.to_pickle(p1) + + # decompress + self.decompress_file(p1, p2, compression=compression) + + # read decompressed file + df2 = pd.read_pickle(p2, compression=None) + + tm.assert_frame_equal(df, df2) + + @pytest.mark.parametrize('compression', [None, 'gzip', 'bz2', 'xz', "zip"]) + def test_read_explicit(self, compression, get_random_path): + # issue 11666 + if compression == 'xz': + tm._skip_if_no_lzma() + + base = get_random_path + path1 = base + ".raw" + path2 = base + ".compressed" + + with tm.ensure_clean(path1) as p1, tm.ensure_clean(path2) as p2: + df = tm.makeDataFrame() + + # write to uncompressed file + df.to_pickle(p1, compression=None) + + # compress + self.compress_file(p1, p2, compression=compression) + + # read compressed file + df2 = pd.read_pickle(p2, compression=compression) + + tm.assert_frame_equal(df, df2) + + @pytest.mark.parametrize('ext', ['', '.gz', '.bz2', '.xz', '.zip', + '.no_compress']) + def test_read_infer(self, ext, get_random_path): + if ext == '.xz': + tm._skip_if_no_lzma() + + base = get_random_path + path1 = base + ".raw" + path2 = base + ext + compression = None + for c in self._compression_to_extension: + if self._compression_to_extension[c] == ext: + compression = c + break + + with tm.ensure_clean(path1) as p1, tm.ensure_clean(path2) as p2: + df = tm.makeDataFrame() + + # write to uncompressed file + df.to_pickle(p1, compression=None) + + # compress + self.compress_file(p1, p2, compression=compression) + + # read compressed file by inferred compression method + df2 = pd.read_pickle(p2) + + tm.assert_frame_equal(df, df2) diff --git a/pandas/io/tests/test_pytables.py b/pandas/tests/io/test_pytables.py similarity index 70% rename from pandas/io/tests/test_pytables.py rename to pandas/tests/io/test_pytables.py index 79b6dc51009cf..ee44fea55e51a 100644 --- a/pandas/io/tests/test_pytables.py +++ b/pandas/tests/io/test_pytables.py @@ -1,56 +1,44 @@ -import nose +import pytest import sys import os -import warnings import tempfile from contextlib import contextmanager +from warnings import catch_warnings import datetime +from datetime import timedelta import numpy as np import pandas import pandas as pd -from pandas import (Series, DataFrame, Panel, MultiIndex, Int64Index, +from pandas import (Series, DataFrame, Panel, Panel4D, MultiIndex, Int64Index, RangeIndex, Categorical, bdate_range, date_range, timedelta_range, Index, DatetimeIndex, isnull) from pandas.compat import is_platform_windows, PY3, PY35 -from pandas.formats.printing import pprint_thing -from pandas.io.pytables import _tables, TableIterator -try: - _tables() -except ImportError as e: - raise nose.SkipTest(e) - +from pandas.io.formats.printing import pprint_thing +tables = pytest.importorskip('tables') +from pandas.io.pytables import TableIterator from pandas.io.pytables import (HDFStore, get_store, Term, read_hdf, - IncompatibilityWarning, PerformanceWarning, - AttributeConflictWarning, DuplicateWarning, PossibleDataLossError, ClosedFileError) + from pandas.io import pytables as pytables import pandas.util.testing as tm from pandas.util.testing import (assert_panel4d_equal, assert_panel_equal, assert_frame_equal, assert_series_equal, - assert_produces_warning, set_timezone) from pandas import concat, Timestamp from pandas import compat from pandas.compat import range, lrange, u - -try: - import tables -except ImportError: - raise nose.SkipTest('no pytables') - from distutils.version import LooseVersion _default_compressor = ('blosc' if LooseVersion(tables.__version__) >= '2.2' else 'zlib') -_multiprocess_can_split_ = False # testing on windows/py3 seems to fault # for using compression @@ -133,61 +121,50 @@ def _maybe_remove(store, key): pass -@contextmanager -def compat_assert_produces_warning(w): - """ don't produce a warning under PY3 """ - if compat.PY3: - yield - else: - with tm.assert_produces_warning(expected_warning=w, - check_stacklevel=False): - yield - - -class Base(tm.TestCase): +class Base(object): @classmethod - def setUpClass(cls): - super(Base, cls).setUpClass() + def setup_class(cls): # Pytables 3.0.0 deprecates lots of things tm.reset_testing_mode() @classmethod - def tearDownClass(cls): - super(Base, cls).tearDownClass() + def teardown_class(cls): # Pytables 3.0.0 deprecates lots of things tm.set_testing_mode() - def setUp(self): - warnings.filterwarnings(action='ignore', category=FutureWarning) - + def setup_method(self, method): self.path = 'tmp.__%s__.h5' % tm.rands(10) - def tearDown(self): + def teardown_method(self, method): pass -class TestHDFStore(Base, tm.TestCase): +@pytest.mark.single +class TestHDFStore(Base): def test_factory_fun(self): path = create_tempfile(self.path) try: - with get_store(path) as tbl: - raise ValueError('blah') + with catch_warnings(record=True): + with get_store(path) as tbl: + raise ValueError('blah') except ValueError: pass finally: safe_remove(path) try: - with get_store(path) as tbl: - tbl['a'] = tm.makeDataFrame() - - with get_store(path) as tbl: - self.assertEqual(len(tbl), 1) - self.assertEqual(type(tbl['a']), DataFrame) + with catch_warnings(record=True): + with get_store(path) as tbl: + tbl['a'] = tm.makeDataFrame() + + with catch_warnings(record=True): + with get_store(path) as tbl: + assert len(tbl) == 1 + assert type(tbl['a']) == DataFrame finally: safe_remove(self.path) @@ -206,8 +183,8 @@ def test_context(self): tbl['a'] = tm.makeDataFrame() with HDFStore(path) as tbl: - self.assertEqual(len(tbl), 1) - self.assertEqual(type(tbl['a']), DataFrame) + assert len(tbl) == 1 + assert type(tbl['a']) == DataFrame finally: safe_remove(path) @@ -227,8 +204,10 @@ def roundtrip(key, obj, **kwargs): o = tm.makeDataFrame() assert_frame_equal(o, roundtrip('frame', o)) - o = tm.makePanel() - assert_panel_equal(o, roundtrip('panel', o)) + with catch_warnings(record=True): + + o = tm.makePanel() + assert_panel_equal(o, roundtrip('panel', o)) # table df = DataFrame(dict(A=lrange(5), B=lrange(5))) @@ -327,20 +306,20 @@ def test_api(self): # invalid df = tm.makeDataFrame() - self.assertRaises(ValueError, df.to_hdf, path, - 'df', append=True, format='f') - self.assertRaises(ValueError, df.to_hdf, path, - 'df', append=True, format='fixed') + pytest.raises(ValueError, df.to_hdf, path, + 'df', append=True, format='f') + pytest.raises(ValueError, df.to_hdf, path, + 'df', append=True, format='fixed') - self.assertRaises(TypeError, df.to_hdf, path, - 'df', append=True, format='foo') - self.assertRaises(TypeError, df.to_hdf, path, - 'df', append=False, format='bar') + pytest.raises(TypeError, df.to_hdf, path, + 'df', append=True, format='foo') + pytest.raises(TypeError, df.to_hdf, path, + 'df', append=False, format='bar') # File path doesn't exist path = "" - self.assertRaises(compat.FileNotFoundError, - read_hdf, path, 'df') + pytest.raises(compat.FileNotFoundError, + read_hdf, path, 'df') def test_api_default_format(self): @@ -351,16 +330,16 @@ def test_api_default_format(self): pandas.set_option('io.hdf.default_format', 'fixed') _maybe_remove(store, 'df') store.put('df', df) - self.assertFalse(store.get_storer('df').is_table) - self.assertRaises(ValueError, store.append, 'df2', df) + assert not store.get_storer('df').is_table + pytest.raises(ValueError, store.append, 'df2', df) pandas.set_option('io.hdf.default_format', 'table') _maybe_remove(store, 'df') store.put('df', df) - self.assertTrue(store.get_storer('df').is_table) + assert store.get_storer('df').is_table _maybe_remove(store, 'df2') store.append('df2', df) - self.assertTrue(store.get_storer('df').is_table) + assert store.get_storer('df').is_table pandas.set_option('io.hdf.default_format', None) @@ -370,17 +349,17 @@ def test_api_default_format(self): pandas.set_option('io.hdf.default_format', 'fixed') df.to_hdf(path, 'df') - with get_store(path) as store: - self.assertFalse(store.get_storer('df').is_table) - self.assertRaises(ValueError, df.to_hdf, path, 'df2', append=True) + with HDFStore(path) as store: + assert not store.get_storer('df').is_table + pytest.raises(ValueError, df.to_hdf, path, 'df2', append=True) pandas.set_option('io.hdf.default_format', 'table') df.to_hdf(path, 'df3') with HDFStore(path) as store: - self.assertTrue(store.get_storer('df3').is_table) + assert store.get_storer('df3').is_table df.to_hdf(path, 'df4', append=True) with HDFStore(path) as store: - self.assertTrue(store.get_storer('df4').is_table) + assert store.get_storer('df4').is_table pandas.set_option('io.hdf.default_format', None) @@ -390,18 +369,19 @@ def test_keys(self): store['a'] = tm.makeTimeSeries() store['b'] = tm.makeStringSeries() store['c'] = tm.makeDataFrame() - store['d'] = tm.makePanel() - store['foo/bar'] = tm.makePanel() - self.assertEqual(len(store), 5) + with catch_warnings(record=True): + store['d'] = tm.makePanel() + store['foo/bar'] = tm.makePanel() + assert len(store) == 5 expected = set(['/a', '/b', '/c', '/d', '/foo/bar']) - self.assertTrue(set(store.keys()) == expected) - self.assertTrue(set(store) == expected) + assert set(store.keys()) == expected + assert set(store) == expected def test_iter_empty(self): with ensure_clean_store(self.path) as store: # GH 12221 - self.assertTrue(list(store) == []) + assert list(store) == [] def test_repr(self): @@ -410,9 +390,11 @@ def test_repr(self): store['a'] = tm.makeTimeSeries() store['b'] = tm.makeStringSeries() store['c'] = tm.makeDataFrame() - store['d'] = tm.makePanel() - store['foo/bar'] = tm.makePanel() - store.append('e', tm.makePanel()) + + with catch_warnings(record=True): + store['d'] = tm.makePanel() + store['foo/bar'] = tm.makePanel() + store.append('e', tm.makePanel()) df = tm.makeDataFrame() df['obj1'] = 'foo' @@ -426,12 +408,12 @@ def test_repr(self): df['timestamp2'] = Timestamp('20010103') df['datetime1'] = datetime.datetime(2001, 1, 2, 0, 0) df['datetime2'] = datetime.datetime(2001, 1, 3, 0, 0) - df.ix[3:6, ['obj1']] = np.nan - df = df.consolidate()._convert(datetime=True) + df.loc[3:6, ['obj1']] = np.nan + df = df._consolidate()._convert(datetime=True) - warnings.filterwarnings('ignore', category=PerformanceWarning) - store['df'] = df - warnings.filterwarnings('always', category=PerformanceWarning) + # PerformanceWarning + with catch_warnings(record=True): + store['df'] = df # make a random group in hdf space store._handle.create_group(store._handle.root, 'bah') @@ -455,19 +437,18 @@ def test_contains(self): store['a'] = tm.makeTimeSeries() store['b'] = tm.makeDataFrame() store['foo/bar'] = tm.makeDataFrame() - self.assertIn('a', store) - self.assertIn('b', store) - self.assertNotIn('c', store) - self.assertIn('foo/bar', store) - self.assertIn('/foo/bar', store) - self.assertNotIn('/foo/b', store) - self.assertNotIn('bar', store) - - # GH 2694 - warnings.filterwarnings( - 'ignore', category=tables.NaturalNameWarning) - store['node())'] = tm.makeDataFrame() - self.assertIn('node())', store) + assert 'a' in store + assert 'b' in store + assert 'c' not in store + assert 'foo/bar' in store + assert '/foo/bar' in store + assert '/foo/b' not in store + assert 'bar' not in store + + # gh-2694: tables.NaturalNameWarning + with catch_warnings(record=True): + store['node())'] = tm.makeDataFrame() + assert 'node())' in store def test_versioning(self): @@ -478,9 +459,9 @@ def test_versioning(self): _maybe_remove(store, 'df1') store.append('df1', df[:10]) store.append('df1', df[10:]) - self.assertEqual(store.root.a._v_attrs.pandas_version, '0.15.2') - self.assertEqual(store.root.b._v_attrs.pandas_version, '0.15.2') - self.assertEqual(store.root.df1._v_attrs.pandas_version, '0.15.2') + assert store.root.a._v_attrs.pandas_version == '0.15.2' + assert store.root.b._v_attrs.pandas_version == '0.15.2' + assert store.root.df1._v_attrs.pandas_version == '0.15.2' # write a file and wipe its versioning _maybe_remove(store, 'df2') @@ -489,7 +470,7 @@ def test_versioning(self): # this is an error because its table_type is appendable, but no # version info store.get_node('df2')._v_attrs.pandas_version = None - self.assertRaises(Exception, store.select, 'df2') + pytest.raises(Exception, store.select, 'df2') def test_mode(self): @@ -501,11 +482,11 @@ def check(mode): # constructor if mode in ['r', 'r+']: - self.assertRaises(IOError, HDFStore, path, mode=mode) + pytest.raises(IOError, HDFStore, path, mode=mode) else: store = HDFStore(path, mode=mode) - self.assertEqual(store._handle.mode, mode) + assert store._handle.mode == mode store.close() with ensure_clean_path(self.path) as path: @@ -515,25 +496,25 @@ def check(mode): def f(): with HDFStore(path, mode=mode) as store: # noqa pass - self.assertRaises(IOError, f) + pytest.raises(IOError, f) else: with HDFStore(path, mode=mode) as store: - self.assertEqual(store._handle.mode, mode) + assert store._handle.mode == mode with ensure_clean_path(self.path) as path: # conv write if mode in ['r', 'r+']: - self.assertRaises(IOError, df.to_hdf, - path, 'df', mode=mode) + pytest.raises(IOError, df.to_hdf, + path, 'df', mode=mode) df.to_hdf(path, 'df', mode='w') else: df.to_hdf(path, 'df', mode=mode) # conv read if mode in ['w']: - self.assertRaises(ValueError, read_hdf, - path, 'df', mode=mode) + pytest.raises(ValueError, read_hdf, + path, 'df', mode=mode) else: result = read_hdf(path, 'df', mode=mode) assert_frame_equal(result, df) @@ -560,43 +541,43 @@ def test_reopen_handle(self): store['a'] = tm.makeTimeSeries() # invalid mode change - self.assertRaises(PossibleDataLossError, store.open, 'w') + pytest.raises(PossibleDataLossError, store.open, 'w') store.close() - self.assertFalse(store.is_open) + assert not store.is_open # truncation ok here store.open('w') - self.assertTrue(store.is_open) - self.assertEqual(len(store), 0) + assert store.is_open + assert len(store) == 0 store.close() - self.assertFalse(store.is_open) + assert not store.is_open store = HDFStore(path, mode='a') store['a'] = tm.makeTimeSeries() # reopen as read store.open('r') - self.assertTrue(store.is_open) - self.assertEqual(len(store), 1) - self.assertEqual(store._mode, 'r') + assert store.is_open + assert len(store) == 1 + assert store._mode == 'r' store.close() - self.assertFalse(store.is_open) + assert not store.is_open # reopen as append store.open('a') - self.assertTrue(store.is_open) - self.assertEqual(len(store), 1) - self.assertEqual(store._mode, 'a') + assert store.is_open + assert len(store) == 1 + assert store._mode == 'a' store.close() - self.assertFalse(store.is_open) + assert not store.is_open # reopen as append (again) store.open('a') - self.assertTrue(store.is_open) - self.assertEqual(len(store), 1) - self.assertEqual(store._mode, 'a') + assert store.is_open + assert len(store) == 1 + assert store._mode == 'a' store.close() - self.assertFalse(store.is_open) + assert not store.is_open def test_open_args(self): @@ -616,7 +597,7 @@ def test_open_args(self): store.close() # the file should not have actually been written - self.assertFalse(os.path.exists(path)) + assert not os.path.exists(path) def test_flush(self): @@ -637,7 +618,7 @@ def test_get(self): right = store['/a'] tm.assert_series_equal(left, right) - self.assertRaises(KeyError, store.get, 'b') + pytest.raises(KeyError, store.get, 'b') def test_getattr(self): @@ -658,10 +639,10 @@ def test_getattr(self): tm.assert_frame_equal(result, df) # errors - self.assertRaises(AttributeError, getattr, store, 'd') + pytest.raises(AttributeError, getattr, store, 'd') for x in ['mode', 'path', 'handle', 'complib']: - self.assertRaises(AttributeError, getattr, store, x) + pytest.raises(AttributeError, getattr, store, x) # not stores for x in ['mode', 'path', 'handle', 'complib']: @@ -681,17 +662,17 @@ def test_put(self): store.put('c', df[:10], format='table') # not OK, not a table - self.assertRaises( + pytest.raises( ValueError, store.put, 'b', df[10:], append=True) # node does not currently exist, test _is_table_type returns False # in this case # _maybe_remove(store, 'f') - # self.assertRaises(ValueError, store.put, 'f', df[10:], + # pytest.raises(ValueError, store.put, 'f', df[10:], # append=True) # can't put to a table (use append instead) - self.assertRaises(ValueError, store.put, 'c', df[10:], append=True) + pytest.raises(ValueError, store.put, 'c', df[10:], append=True) # overwrite table store.put('c', df[:10], format='table', append=False) @@ -733,21 +714,22 @@ def test_put_compression(self): tm.assert_frame_equal(store['c'], df) # can't compress if format='fixed' - self.assertRaises(ValueError, store.put, 'b', df, - format='fixed', complib='zlib') + pytest.raises(ValueError, store.put, 'b', df, + format='fixed', complib='zlib') def test_put_compression_blosc(self): - tm.skip_if_no_package('tables', '2.2', app='blosc support') + tm.skip_if_no_package('tables', min_version='2.2', + app='blosc support') if skip_compression: - raise nose.SkipTest("skipping on windows/PY3") + pytest.skip("skipping on windows/PY3") df = tm.makeTimeDataFrame() with ensure_clean_store(self.path) as store: # can't compress if format='fixed' - self.assertRaises(ValueError, store.put, 'b', df, - format='fixed', complib='blosc') + pytest.raises(ValueError, store.put, 'b', df, + format='fixed', complib='blosc') store.put('c', df, format='table', complib='blosc') tm.assert_frame_equal(store['c'], df) @@ -770,17 +752,15 @@ def test_put_mixed_type(self): df['timestamp2'] = Timestamp('20010103') df['datetime1'] = datetime.datetime(2001, 1, 2, 0, 0) df['datetime2'] = datetime.datetime(2001, 1, 3, 0, 0) - df.ix[3:6, ['obj1']] = np.nan - df = df.consolidate()._convert(datetime=True) + df.loc[3:6, ['obj1']] = np.nan + df = df._consolidate()._convert(datetime=True) with ensure_clean_store(self.path) as store: _maybe_remove(store, 'df') - # cannot use assert_produces_warning here for some reason - # a PendingDeprecationWarning is also raised? - warnings.filterwarnings('ignore', category=PerformanceWarning) - store.put('df', df) - warnings.filterwarnings('always', category=PerformanceWarning) + # PerformanceWarning + with catch_warnings(record=True): + store.put('df', df) expected = store.get('df') tm.assert_frame_equal(expected, df) @@ -788,51 +768,53 @@ def test_put_mixed_type(self): def test_append(self): with ensure_clean_store(self.path) as store: - df = tm.makeTimeDataFrame() - _maybe_remove(store, 'df1') - store.append('df1', df[:10]) - store.append('df1', df[10:]) - tm.assert_frame_equal(store['df1'], df) - - _maybe_remove(store, 'df2') - store.put('df2', df[:10], format='table') - store.append('df2', df[10:]) - tm.assert_frame_equal(store['df2'], df) - - _maybe_remove(store, 'df3') - store.append('/df3', df[:10]) - store.append('/df3', df[10:]) - tm.assert_frame_equal(store['df3'], df) # this is allowed by almost always don't want to do it - with tm.assert_produces_warning( - expected_warning=tables.NaturalNameWarning): + # tables.NaturalNameWarning): + with catch_warnings(record=True): + + df = tm.makeTimeDataFrame() + _maybe_remove(store, 'df1') + store.append('df1', df[:10]) + store.append('df1', df[10:]) + tm.assert_frame_equal(store['df1'], df) + + _maybe_remove(store, 'df2') + store.put('df2', df[:10], format='table') + store.append('df2', df[10:]) + tm.assert_frame_equal(store['df2'], df) + + _maybe_remove(store, 'df3') + store.append('/df3', df[:10]) + store.append('/df3', df[10:]) + tm.assert_frame_equal(store['df3'], df) + + # this is allowed by almost always don't want to do it + # tables.NaturalNameWarning _maybe_remove(store, '/df3 foo') store.append('/df3 foo', df[:10]) store.append('/df3 foo', df[10:]) tm.assert_frame_equal(store['df3 foo'], df) - # panel - wp = tm.makePanel() - _maybe_remove(store, 'wp1') - store.append('wp1', wp.ix[:, :10, :]) - store.append('wp1', wp.ix[:, 10:, :]) - assert_panel_equal(store['wp1'], wp) + # panel + wp = tm.makePanel() + _maybe_remove(store, 'wp1') + store.append('wp1', wp.iloc[:, :10, :]) + store.append('wp1', wp.iloc[:, 10:, :]) + assert_panel_equal(store['wp1'], wp) - # ndim - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): + # ndim p4d = tm.makePanel4D() _maybe_remove(store, 'p4d') - store.append('p4d', p4d.ix[:, :, :10, :]) - store.append('p4d', p4d.ix[:, :, 10:, :]) + store.append('p4d', p4d.iloc[:, :, :10, :]) + store.append('p4d', p4d.iloc[:, :, 10:, :]) assert_panel4d_equal(store['p4d'], p4d) # test using axis labels _maybe_remove(store, 'p4d') - store.append('p4d', p4d.ix[:, :, :10, :], axes=[ + store.append('p4d', p4d.iloc[:, :, :10, :], axes=[ 'items', 'major_axis', 'minor_axis']) - store.append('p4d', p4d.ix[:, :, 10:, :], axes=[ + store.append('p4d', p4d.iloc[:, :, 10:, :], axes=[ 'items', 'major_axis', 'minor_axis']) assert_panel4d_equal(store['p4d'], p4d) @@ -845,42 +827,42 @@ def test_append(self): 'p4d2', p4d2, axes=['items', 'major_axis', 'minor_axis']) assert_panel4d_equal(store['p4d2'], p4d2) - # test using differt order of items on the non-index axes - _maybe_remove(store, 'wp1') - wp_append1 = wp.ix[:, :10, :] - store.append('wp1', wp_append1) - wp_append2 = wp.ix[:, 10:, :].reindex(items=wp.items[::-1]) - store.append('wp1', wp_append2) - assert_panel_equal(store['wp1'], wp) - - # dtype issues - mizxed type in a single object column - df = DataFrame(data=[[1, 2], [0, 1], [1, 2], [0, 0]]) - df['mixed_column'] = 'testing' - df.ix[2, 'mixed_column'] = np.nan - _maybe_remove(store, 'df') - store.append('df', df) - tm.assert_frame_equal(store['df'], df) - - # uints - test storage of uints - uint_data = DataFrame({ - 'u08': Series(np.random.randint(0, high=255, size=5), - dtype=np.uint8), - 'u16': Series(np.random.randint(0, high=65535, size=5), - dtype=np.uint16), - 'u32': Series(np.random.randint(0, high=2**30, size=5), - dtype=np.uint32), - 'u64': Series([2**58, 2**59, 2**60, 2**61, 2**62], - dtype=np.uint64)}, index=np.arange(5)) - _maybe_remove(store, 'uints') - store.append('uints', uint_data) - tm.assert_frame_equal(store['uints'], uint_data) - - # uints - test storage of uints in indexable columns - _maybe_remove(store, 'uints') - # 64-bit indices not yet supported - store.append('uints', uint_data, data_columns=[ - 'u08', 'u16', 'u32']) - tm.assert_frame_equal(store['uints'], uint_data) + # test using differt order of items on the non-index axes + _maybe_remove(store, 'wp1') + wp_append1 = wp.iloc[:, :10, :] + store.append('wp1', wp_append1) + wp_append2 = wp.iloc[:, 10:, :].reindex(items=wp.items[::-1]) + store.append('wp1', wp_append2) + assert_panel_equal(store['wp1'], wp) + + # dtype issues - mizxed type in a single object column + df = DataFrame(data=[[1, 2], [0, 1], [1, 2], [0, 0]]) + df['mixed_column'] = 'testing' + df.loc[2, 'mixed_column'] = np.nan + _maybe_remove(store, 'df') + store.append('df', df) + tm.assert_frame_equal(store['df'], df) + + # uints - test storage of uints + uint_data = DataFrame({ + 'u08': Series(np.random.randint(0, high=255, size=5), + dtype=np.uint8), + 'u16': Series(np.random.randint(0, high=65535, size=5), + dtype=np.uint16), + 'u32': Series(np.random.randint(0, high=2**30, size=5), + dtype=np.uint32), + 'u64': Series([2**58, 2**59, 2**60, 2**61, 2**62], + dtype=np.uint64)}, index=np.arange(5)) + _maybe_remove(store, 'uints') + store.append('uints', uint_data) + tm.assert_frame_equal(store['uints'], uint_data) + + # uints - test storage of uints in indexable columns + _maybe_remove(store, 'uints') + # 64-bit indices not yet supported + store.append('uints', uint_data, data_columns=[ + 'u08', 'u16', 'u32']) + tm.assert_frame_equal(store['uints'], uint_data) def test_append_series(self): @@ -894,27 +876,27 @@ def test_append_series(self): store.append('ss', ss) result = store['ss'] tm.assert_series_equal(result, ss) - self.assertIsNone(result.name) + assert result.name is None store.append('ts', ts) result = store['ts'] tm.assert_series_equal(result, ts) - self.assertIsNone(result.name) + assert result.name is None ns.name = 'foo' store.append('ns', ns) result = store['ns'] tm.assert_series_equal(result, ns) - self.assertEqual(result.name, ns.name) + assert result.name == ns.name # select on the values expected = ns[ns > 60] - result = store.select('ns', Term('foo>60')) + result = store.select('ns', 'foo>60') tm.assert_series_equal(result, expected) # select on the index and values expected = ns[(ns > 70) & (ns.index < 90)] - result = store.select('ns', [Term('foo>70'), Term('index<90')]) + result = store.select('ns', 'foo>70 and index<90') tm.assert_series_equal(result, expected) # multi-index @@ -961,15 +943,16 @@ def check(format, index): else: # only support for fixed types (and they have a perf warning) - self.assertRaises(TypeError, check, 'table', index) - with tm.assert_produces_warning( - expected_warning=PerformanceWarning): + pytest.raises(TypeError, check, 'table', index) + + # PerformanceWarning + with catch_warnings(record=True): check('fixed', index) def test_encoding(self): if sys.byteorder != 'little': - raise nose.SkipTest('system byteorder is not little') + pytest.skip('system byteorder is not little') with ensure_clean_store(self.path) as store: df = DataFrame(dict(A='foo', B='bar'), index=range(5)) @@ -986,7 +969,7 @@ def test_encoding(self): def test_latin_encoding(self): if compat.PY2: - self.assertRaisesRegexp( + tm.assert_raises_regex( TypeError, r'\[unicode\] is not implemented as a table column') return @@ -1040,14 +1023,14 @@ def test_append_some_nans(self): index=np.arange(20)) # some nans _maybe_remove(store, 'df1') - df.ix[0:15, ['A1', 'B', 'D', 'E']] = np.nan + df.loc[0:15, ['A1', 'B', 'D', 'E']] = np.nan store.append('df1', df[:10]) store.append('df1', df[10:]) tm.assert_frame_equal(store['df1'], df) # first column df1 = df.copy() - df1.ix[:, 'A1'] = np.nan + df1.loc[:, 'A1'] = np.nan _maybe_remove(store, 'df1') store.append('df1', df1[:10]) store.append('df1', df1[10:]) @@ -1055,7 +1038,7 @@ def test_append_some_nans(self): # 2nd column df2 = df.copy() - df2.ix[:, 'A2'] = np.nan + df2.loc[:, 'A2'] = np.nan _maybe_remove(store, 'df2') store.append('df2', df2[:10]) store.append('df2', df2[10:]) @@ -1063,7 +1046,7 @@ def test_append_some_nans(self): # datetimes df3 = df.copy() - df3.ix[:, 'E'] = np.nan + df3.loc[:, 'E'] = np.nan _maybe_remove(store, 'df3') store.append('df3', df3[:10]) store.append('df3', df3[10:]) @@ -1076,7 +1059,7 @@ def test_append_all_nans(self): df = DataFrame({'A1': np.random.randn(20), 'A2': np.random.randn(20)}, index=np.arange(20)) - df.ix[0:15, :] = np.nan + df.loc[0:15, :] = np.nan # nan some entire rows (dropna=True) _maybe_remove(store, 'df') @@ -1109,7 +1092,7 @@ def test_append_all_nans(self): 'B': 'foo', 'C': 'bar'}, index=np.arange(20)) - df.ix[0:15, :] = np.nan + df.loc[0:15, :] = np.nan _maybe_remove(store, 'df') store.append('df', df[:10], dropna=True) @@ -1130,7 +1113,7 @@ def test_append_all_nans(self): 'E': datetime.datetime(2001, 1, 2, 0, 0)}, index=np.arange(20)) - df.ix[0:15, :] = np.nan + df.loc[0:15, :] = np.nan _maybe_remove(store, 'df') store.append('df', df[:10], dropna=True) @@ -1156,15 +1139,17 @@ def test_append_all_nans(self): [[np.nan, np.nan, np.nan], [np.nan, 5, 6]], [[np.nan, np.nan, np.nan], [np.nan, 3, np.nan]]] - panel_with_missing = Panel(matrix, items=['Item1', 'Item2', 'Item3'], - major_axis=[1, 2], - minor_axis=['A', 'B', 'C']) + with catch_warnings(record=True): + panel_with_missing = Panel(matrix, + items=['Item1', 'Item2', 'Item3'], + major_axis=[1, 2], + minor_axis=['A', 'B', 'C']) - with ensure_clean_path(self.path) as path: - panel_with_missing.to_hdf( - path, 'panel_with_missing', format='table') - reloaded_panel = read_hdf(path, 'panel_with_missing') - tm.assert_panel_equal(panel_with_missing, reloaded_panel) + with ensure_clean_path(self.path) as path: + panel_with_missing.to_hdf( + path, 'panel_with_missing', format='table') + reloaded_panel = read_hdf(path, 'panel_with_missing') + tm.assert_panel_equal(panel_with_missing, reloaded_panel) def test_append_frame_column_oriented(self): @@ -1173,8 +1158,8 @@ def test_append_frame_column_oriented(self): # column oriented df = tm.makeTimeDataFrame() _maybe_remove(store, 'df1') - store.append('df1', df.ix[:, :2], axes=['columns']) - store.append('df1', df.ix[:, 2:]) + store.append('df1', df.iloc[:, :2], axes=['columns']) + store.append('df1', df.iloc[:, 2:]) tm.assert_frame_equal(store['df1'], df) result = store.select('df1', 'columns=A') @@ -1183,13 +1168,14 @@ def test_append_frame_column_oriented(self): # selection on the non-indexable result = store.select( - 'df1', ('columns=A', Term('index=df.index[0:4]'))) + 'df1', ('columns=A', 'index=df.index[0:4]')) expected = df.reindex(columns=['A'], index=df.index[0:4]) tm.assert_frame_equal(expected, result) # this isn't supported - self.assertRaises(TypeError, store.select, 'df1', ( - 'columns=A', Term('index>df.index[4]'))) + with pytest.raises(TypeError): + store.select('df1', + 'columns=A and index>df.index[4]') def test_append_with_different_block_ordering(self): @@ -1227,16 +1213,16 @@ def test_append_with_different_block_ordering(self): # store additonal fields in different blocks df['int16_2'] = Series([1] * len(df), dtype='int16') - self.assertRaises(ValueError, store.append, 'df', df) + pytest.raises(ValueError, store.append, 'df', df) # store multile additonal fields in different blocks df['float_3'] = Series([1.] * len(df), dtype='float64') - self.assertRaises(ValueError, store.append, 'df', df) + pytest.raises(ValueError, store.append, 'df', df) def test_ndim_indexables(self): # test using ndim tables in new ways - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): with ensure_clean_store(self.path) as store: p4d = tm.makePanel4D() @@ -1244,43 +1230,43 @@ def test_ndim_indexables(self): def check_indexers(key, indexers): for i, idx in enumerate(indexers): descr = getattr(store.root, key).table.description - self.assertTrue(getattr(descr, idx)._v_pos == i) + assert getattr(descr, idx)._v_pos == i # append then change (will take existing schema) indexers = ['items', 'major_axis', 'minor_axis'] _maybe_remove(store, 'p4d') - store.append('p4d', p4d.ix[:, :, :10, :], axes=indexers) - store.append('p4d', p4d.ix[:, :, 10:, :]) + store.append('p4d', p4d.iloc[:, :, :10, :], axes=indexers) + store.append('p4d', p4d.iloc[:, :, 10:, :]) assert_panel4d_equal(store.select('p4d'), p4d) check_indexers('p4d', indexers) # same as above, but try to append with differnt axes _maybe_remove(store, 'p4d') - store.append('p4d', p4d.ix[:, :, :10, :], axes=indexers) - store.append('p4d', p4d.ix[:, :, 10:, :], axes=[ + store.append('p4d', p4d.iloc[:, :, :10, :], axes=indexers) + store.append('p4d', p4d.iloc[:, :, 10:, :], axes=[ 'labels', 'items', 'major_axis']) assert_panel4d_equal(store.select('p4d'), p4d) check_indexers('p4d', indexers) # pass incorrect number of axes _maybe_remove(store, 'p4d') - self.assertRaises(ValueError, store.append, 'p4d', p4d.ix[ + pytest.raises(ValueError, store.append, 'p4d', p4d.iloc[ :, :, :10, :], axes=['major_axis', 'minor_axis']) # different than default indexables #1 indexers = ['labels', 'major_axis', 'minor_axis'] _maybe_remove(store, 'p4d') - store.append('p4d', p4d.ix[:, :, :10, :], axes=indexers) - store.append('p4d', p4d.ix[:, :, 10:, :]) + store.append('p4d', p4d.iloc[:, :, :10, :], axes=indexers) + store.append('p4d', p4d.iloc[:, :, 10:, :]) assert_panel4d_equal(store['p4d'], p4d) check_indexers('p4d', indexers) # different than default indexables #2 indexers = ['major_axis', 'labels', 'minor_axis'] _maybe_remove(store, 'p4d') - store.append('p4d', p4d.ix[:, :, :10, :], axes=indexers) - store.append('p4d', p4d.ix[:, :, 10:, :]) + store.append('p4d', p4d.iloc[:, :, :10, :], axes=indexers) + store.append('p4d', p4d.iloc[:, :, 10:, :]) assert_panel4d_equal(store['p4d'], p4d) check_indexers('p4d', indexers) @@ -1290,15 +1276,15 @@ def check_indexers(key, indexers): assert_panel4d_equal(result, expected) # partial selection2 - result = store.select('p4d', [Term( - 'labels=l1'), Term('items=ItemA'), Term('minor_axis=B')]) + result = store.select( + 'p4d', "labels='l1' and items='ItemA' and minor_axis='B'") expected = p4d.reindex( labels=['l1'], items=['ItemA'], minor_axis=['B']) assert_panel4d_equal(result, expected) # non-existant partial selection - result = store.select('p4d', [Term( - 'labels=l1'), Term('items=Item1'), Term('minor_axis=B')]) + result = store.select( + 'p4d', "labels='l1' and items='Item1' and minor_axis='B'") expected = p4d.reindex(labels=['l1'], items=[], minor_axis=['B']) assert_panel4d_equal(result, expected) @@ -1306,106 +1292,109 @@ def check_indexers(key, indexers): def test_append_with_strings(self): with ensure_clean_store(self.path) as store: - wp = tm.makePanel() - wp2 = wp.rename_axis( - dict([(x, "%s_extra" % x) for x in wp.minor_axis]), axis=2) - - def check_col(key, name, size): - self.assertEqual(getattr(store.get_storer( - key).table.description, name).itemsize, size) - - store.append('s1', wp, min_itemsize=20) - store.append('s1', wp2) - expected = concat([wp, wp2], axis=2) - expected = expected.reindex(minor_axis=sorted(expected.minor_axis)) - assert_panel_equal(store['s1'], expected) - check_col('s1', 'minor_axis', 20) - - # test dict format - store.append('s2', wp, min_itemsize={'minor_axis': 20}) - store.append('s2', wp2) - expected = concat([wp, wp2], axis=2) - expected = expected.reindex(minor_axis=sorted(expected.minor_axis)) - assert_panel_equal(store['s2'], expected) - check_col('s2', 'minor_axis', 20) - - # apply the wrong field (similar to #1) - store.append('s3', wp, min_itemsize={'major_axis': 20}) - self.assertRaises(ValueError, store.append, 's3', wp2) - - # test truncation of bigger strings - store.append('s4', wp) - self.assertRaises(ValueError, store.append, 's4', wp2) - - # avoid truncation on elements - df = DataFrame([[123, 'asdqwerty'], [345, 'dggnhebbsdfbdfb']]) - store.append('df_big', df) - tm.assert_frame_equal(store.select('df_big'), df) - check_col('df_big', 'values_block_1', 15) - - # appending smaller string ok - df2 = DataFrame([[124, 'asdqy'], [346, 'dggnhefbdfb']]) - store.append('df_big', df2) - expected = concat([df, df2]) - tm.assert_frame_equal(store.select('df_big'), expected) - check_col('df_big', 'values_block_1', 15) - - # avoid truncation on elements - df = DataFrame([[123, 'asdqwerty'], [345, 'dggnhebbsdfbdfb']]) - store.append('df_big2', df, min_itemsize={'values': 50}) - tm.assert_frame_equal(store.select('df_big2'), df) - check_col('df_big2', 'values_block_1', 50) - - # bigger string on next append - store.append('df_new', df) - df_new = DataFrame( - [[124, 'abcdefqhij'], [346, 'abcdefghijklmnopqrtsuvwxyz']]) - self.assertRaises(ValueError, store.append, 'df_new', df_new) - - # min_itemsize on Series index (GH 11412) - df = tm.makeMixedDataFrame().set_index('C') - store.append('ss', df['B'], min_itemsize={'index': 4}) - tm.assert_series_equal(store.select('ss'), df['B']) - - # same as above, with data_columns=True - store.append('ss2', df['B'], data_columns=True, - min_itemsize={'index': 4}) - tm.assert_series_equal(store.select('ss2'), df['B']) - - # min_itemsize in index without appending (GH 10381) - store.put('ss3', df, format='table', - min_itemsize={'index': 6}) - # just make sure there is a longer string: - df2 = df.copy().reset_index().assign(C='longer').set_index('C') - store.append('ss3', df2) - tm.assert_frame_equal(store.select('ss3'), - pd.concat([df, df2])) - - # same as above, with a Series - store.put('ss4', df['B'], format='table', - min_itemsize={'index': 6}) - store.append('ss4', df2['B']) - tm.assert_series_equal(store.select('ss4'), - pd.concat([df['B'], df2['B']])) - - # with nans - _maybe_remove(store, 'df') - df = tm.makeTimeDataFrame() - df['string'] = 'foo' - df.ix[1:4, 'string'] = np.nan - df['string2'] = 'bar' - df.ix[4:8, 'string2'] = np.nan - df['string3'] = 'bah' - df.ix[1:, 'string3'] = np.nan - store.append('df', df) - result = store.select('df') - tm.assert_frame_equal(result, df) + with catch_warnings(record=True): + wp = tm.makePanel() + wp2 = wp.rename_axis( + dict([(x, "%s_extra" % x) for x in wp.minor_axis]), axis=2) + + def check_col(key, name, size): + assert getattr(store.get_storer(key) + .table.description, name).itemsize == size + + store.append('s1', wp, min_itemsize=20) + store.append('s1', wp2) + expected = concat([wp, wp2], axis=2) + expected = expected.reindex( + minor_axis=sorted(expected.minor_axis)) + assert_panel_equal(store['s1'], expected) + check_col('s1', 'minor_axis', 20) + + # test dict format + store.append('s2', wp, min_itemsize={'minor_axis': 20}) + store.append('s2', wp2) + expected = concat([wp, wp2], axis=2) + expected = expected.reindex( + minor_axis=sorted(expected.minor_axis)) + assert_panel_equal(store['s2'], expected) + check_col('s2', 'minor_axis', 20) + + # apply the wrong field (similar to #1) + store.append('s3', wp, min_itemsize={'major_axis': 20}) + pytest.raises(ValueError, store.append, 's3', wp2) + + # test truncation of bigger strings + store.append('s4', wp) + pytest.raises(ValueError, store.append, 's4', wp2) + + # avoid truncation on elements + df = DataFrame([[123, 'asdqwerty'], [345, 'dggnhebbsdfbdfb']]) + store.append('df_big', df) + tm.assert_frame_equal(store.select('df_big'), df) + check_col('df_big', 'values_block_1', 15) + + # appending smaller string ok + df2 = DataFrame([[124, 'asdqy'], [346, 'dggnhefbdfb']]) + store.append('df_big', df2) + expected = concat([df, df2]) + tm.assert_frame_equal(store.select('df_big'), expected) + check_col('df_big', 'values_block_1', 15) + + # avoid truncation on elements + df = DataFrame([[123, 'asdqwerty'], [345, 'dggnhebbsdfbdfb']]) + store.append('df_big2', df, min_itemsize={'values': 50}) + tm.assert_frame_equal(store.select('df_big2'), df) + check_col('df_big2', 'values_block_1', 50) + + # bigger string on next append + store.append('df_new', df) + df_new = DataFrame( + [[124, 'abcdefqhij'], [346, 'abcdefghijklmnopqrtsuvwxyz']]) + pytest.raises(ValueError, store.append, 'df_new', df_new) + + # min_itemsize on Series index (GH 11412) + df = tm.makeMixedDataFrame().set_index('C') + store.append('ss', df['B'], min_itemsize={'index': 4}) + tm.assert_series_equal(store.select('ss'), df['B']) + + # same as above, with data_columns=True + store.append('ss2', df['B'], data_columns=True, + min_itemsize={'index': 4}) + tm.assert_series_equal(store.select('ss2'), df['B']) + + # min_itemsize in index without appending (GH 10381) + store.put('ss3', df, format='table', + min_itemsize={'index': 6}) + # just make sure there is a longer string: + df2 = df.copy().reset_index().assign(C='longer').set_index('C') + store.append('ss3', df2) + tm.assert_frame_equal(store.select('ss3'), + pd.concat([df, df2])) + + # same as above, with a Series + store.put('ss4', df['B'], format='table', + min_itemsize={'index': 6}) + store.append('ss4', df2['B']) + tm.assert_series_equal(store.select('ss4'), + pd.concat([df['B'], df2['B']])) + + # with nans + _maybe_remove(store, 'df') + df = tm.makeTimeDataFrame() + df['string'] = 'foo' + df.loc[1:4, 'string'] = np.nan + df['string2'] = 'bar' + df.loc[4:8, 'string2'] = np.nan + df['string3'] = 'bah' + df.loc[1:, 'string3'] = np.nan + store.append('df', df) + result = store.select('df') + tm.assert_frame_equal(result, df) with ensure_clean_store(self.path) as store: def check_col(key, name, size): - self.assertEqual(getattr(store.get_storer( - key).table.description, name).itemsize, size) + assert getattr(store.get_storer(key) + .table.description, name).itemsize, size df = DataFrame(dict(A='foo', B='bar'), index=range(10)) @@ -1413,13 +1402,13 @@ def check_col(key, name, size): _maybe_remove(store, 'df') store.append('df', df, min_itemsize={'A': 200}) check_col('df', 'A', 200) - self.assertEqual(store.get_storer('df').data_columns, ['A']) + assert store.get_storer('df').data_columns == ['A'] # a min_itemsize that creates a data_column2 _maybe_remove(store, 'df') store.append('df', df, data_columns=['B'], min_itemsize={'A': 200}) check_col('df', 'A', 200) - self.assertEqual(store.get_storer('df').data_columns, ['B', 'A']) + assert store.get_storer('df').data_columns == ['B', 'A'] # a min_itemsize that creates a data_column2 _maybe_remove(store, 'df') @@ -1427,7 +1416,7 @@ def check_col(key, name, size): 'B'], min_itemsize={'values': 200}) check_col('df', 'B', 200) check_col('df', 'values_block_0', 200) - self.assertEqual(store.get_storer('df').data_columns, ['B']) + assert store.get_storer('df').data_columns == ['B'] # infer the .typ on subsequent appends _maybe_remove(store, 'df') @@ -1439,8 +1428,8 @@ def check_col(key, name, size): df = DataFrame(['foo', 'foo', 'foo', 'barh', 'barh', 'barh'], columns=['A']) _maybe_remove(store, 'df') - self.assertRaises(ValueError, store.append, 'df', - df, min_itemsize={'foo': 20, 'foobar': 20}) + pytest.raises(ValueError, store.append, 'df', + df, min_itemsize={'foo': 20, 'foobar': 20}) def test_to_hdf_with_min_itemsize(self): @@ -1466,7 +1455,7 @@ def test_append_with_data_columns(self): with ensure_clean_store(self.path) as store: df = tm.makeTimeDataFrame() - df.loc[:, 'B'].iloc[0] = 1. + df.iloc[0, df.columns.get_loc('B')] = 1. _maybe_remove(store, 'df') store.append('df', df[:2], data_columns=['B']) store.append('df', df[2:]) @@ -1477,13 +1466,13 @@ def test_append_with_data_columns(self): assert(store._handle.root.df.table.cols.B.is_indexed is True) # data column searching - result = store.select('df', [Term('B>0')]) + result = store.select('df', 'B>0') expected = df[df.B > 0] tm.assert_frame_equal(result, expected) # data column searching (with an indexable and a data_columns) result = store.select( - 'df', [Term('B>0'), Term('index>df.index[3]')]) + 'df', 'B>0 and index>df.index[3]') df_new = df.reindex(index=df.index[4:]) expected = df_new[df_new.B > 0] tm.assert_frame_equal(result, expected) @@ -1495,14 +1484,14 @@ def test_append_with_data_columns(self): df_new.loc[5:6, 'string'] = 'bar' _maybe_remove(store, 'df') store.append('df', df_new, data_columns=['string']) - result = store.select('df', [Term('string=foo')]) + result = store.select('df', "string='foo'") expected = df_new[df_new.string == 'foo'] tm.assert_frame_equal(result, expected) # using min_itemsize and a data column def check_col(key, name, size): - self.assertEqual(getattr(store.get_storer( - key).table.description, name).itemsize, size) + assert getattr(store.get_storer(key) + .table.description, name).itemsize == size with ensure_clean_store(self.path) as store: _maybe_remove(store, 'df') @@ -1533,26 +1522,30 @@ def check_col(key, name, size): with ensure_clean_store(self.path) as store: # multiple data columns df_new = df.copy() - df_new.ix[0, 'A'] = 1. - df_new.ix[0, 'B'] = -1. + df_new.iloc[0, df_new.columns.get_loc('A')] = 1. + df_new.iloc[0, df_new.columns.get_loc('B')] = -1. df_new['string'] = 'foo' - df_new.loc[1:4, 'string'] = np.nan - df_new.loc[5:6, 'string'] = 'bar' + + sl = df_new.columns.get_loc('string') + df_new.iloc[1:4, sl] = np.nan + df_new.iloc[5:6, sl] = 'bar' + df_new['string2'] = 'foo' - df_new.loc[2:5, 'string2'] = np.nan - df_new.loc[7:8, 'string2'] = 'bar' + sl = df_new.columns.get_loc('string2') + df_new.iloc[2:5, sl] = np.nan + df_new.iloc[7:8, sl] = 'bar' _maybe_remove(store, 'df') store.append( 'df', df_new, data_columns=['A', 'B', 'string', 'string2']) - result = store.select('df', [Term('string=foo'), Term( - 'string2=foo'), Term('A>0'), Term('B<0')]) + result = store.select('df', + "string='foo' and string2='foo'" + " and A>0 and B<0") expected = df_new[(df_new.string == 'foo') & ( df_new.string2 == 'foo') & (df_new.A > 0) & (df_new.B < 0)] tm.assert_frame_equal(result, expected, check_index_type=False) # yield an empty frame - result = store.select('df', [Term('string=foo'), Term( - 'string2=cool')]) + result = store.select('df', "string='foo' and string2='cool'") expected = df_new[(df_new.string == 'foo') & ( df_new.string2 == 'cool')] tm.assert_frame_equal(result, expected, check_index_type=False) @@ -1561,18 +1554,18 @@ def check_col(key, name, size): # doc example df_dc = df.copy() df_dc['string'] = 'foo' - df_dc.ix[4:6, 'string'] = np.nan - df_dc.ix[7:9, 'string'] = 'bar' + df_dc.loc[4:6, 'string'] = np.nan + df_dc.loc[7:9, 'string'] = 'bar' df_dc['string2'] = 'cool' df_dc['datetime'] = Timestamp('20010102') df_dc = df_dc._convert(datetime=True) - df_dc.ix[3:5, ['A', 'B', 'datetime']] = np.nan + df_dc.loc[3:5, ['A', 'B', 'datetime']] = np.nan _maybe_remove(store, 'df_dc') store.append('df_dc', df_dc, data_columns=['B', 'C', 'string', 'string2', 'datetime']) - result = store.select('df_dc', [Term('B>0')]) + result = store.select('df_dc', 'B>0') expected = df_dc[df_dc.B > 0] tm.assert_frame_equal(result, expected, check_index_type=False) @@ -1590,16 +1583,16 @@ def check_col(key, name, size): df_dc = DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C']) df_dc['string'] = 'foo' - df_dc.ix[4:6, 'string'] = np.nan - df_dc.ix[7:9, 'string'] = 'bar' - df_dc.ix[:, ['B', 'C']] = df_dc.ix[:, ['B', 'C']].abs() + df_dc.loc[4:6, 'string'] = np.nan + df_dc.loc[7:9, 'string'] = 'bar' + df_dc.loc[:, ['B', 'C']] = df_dc.loc[:, ['B', 'C']].abs() df_dc['string2'] = 'cool' # on-disk operations store.append('df_dc', df_dc, data_columns=[ 'B', 'C', 'string', 'string2']) - result = store.select('df_dc', [Term('B>0')]) + result = store.select('df_dc', 'B>0') expected = df_dc[df_dc.B > 0] tm.assert_frame_equal(result, expected) @@ -1610,97 +1603,103 @@ def check_col(key, name, size): tm.assert_frame_equal(result, expected) with ensure_clean_store(self.path) as store: - # panel - # GH5717 not handling data_columns - np.random.seed(1234) - p = tm.makePanel() + with catch_warnings(record=True): + # panel + # GH5717 not handling data_columns + np.random.seed(1234) + p = tm.makePanel() - store.append('p1', p) - tm.assert_panel_equal(store.select('p1'), p) + store.append('p1', p) + tm.assert_panel_equal(store.select('p1'), p) - store.append('p2', p, data_columns=True) - tm.assert_panel_equal(store.select('p2'), p) + store.append('p2', p, data_columns=True) + tm.assert_panel_equal(store.select('p2'), p) - result = store.select('p2', where='ItemA>0') - expected = p.to_frame() - expected = expected[expected['ItemA'] > 0] - tm.assert_frame_equal(result.to_frame(), expected) + result = store.select('p2', where='ItemA>0') + expected = p.to_frame() + expected = expected[expected['ItemA'] > 0] + tm.assert_frame_equal(result.to_frame(), expected) - result = store.select('p2', where='ItemA>0 & minor_axis=["A","B"]') - expected = p.to_frame() - expected = expected[expected['ItemA'] > 0] - expected = expected[expected.reset_index( - level=['major']).index.isin(['A', 'B'])] - tm.assert_frame_equal(result.to_frame(), expected) + result = store.select( + 'p2', where='ItemA>0 & minor_axis=["A","B"]') + expected = p.to_frame() + expected = expected[expected['ItemA'] > 0] + expected = expected[expected.reset_index( + level=['major']).index.isin(['A', 'B'])] + tm.assert_frame_equal(result.to_frame(), expected) def test_create_table_index(self): with ensure_clean_store(self.path) as store: - def col(t, column): - return getattr(store.get_storer(t).table.cols, column) + with catch_warnings(record=True): + def col(t, column): + return getattr(store.get_storer(t).table.cols, column) - # index=False - wp = tm.makePanel() - store.append('p5', wp, index=False) - store.create_table_index('p5', columns=['major_axis']) - assert(col('p5', 'major_axis').is_indexed is True) - assert(col('p5', 'minor_axis').is_indexed is False) - - # index=True - store.append('p5i', wp, index=True) - assert(col('p5i', 'major_axis').is_indexed is True) - assert(col('p5i', 'minor_axis').is_indexed is True) - - # default optlevels - store.get_storer('p5').create_index() - assert(col('p5', 'major_axis').index.optlevel == 6) - assert(col('p5', 'minor_axis').index.kind == 'medium') - - # let's change the indexing scheme - store.create_table_index('p5') - assert(col('p5', 'major_axis').index.optlevel == 6) - assert(col('p5', 'minor_axis').index.kind == 'medium') - store.create_table_index('p5', optlevel=9) - assert(col('p5', 'major_axis').index.optlevel == 9) - assert(col('p5', 'minor_axis').index.kind == 'medium') - store.create_table_index('p5', kind='full') - assert(col('p5', 'major_axis').index.optlevel == 9) - assert(col('p5', 'minor_axis').index.kind == 'full') - store.create_table_index('p5', optlevel=1, kind='light') - assert(col('p5', 'major_axis').index.optlevel == 1) - assert(col('p5', 'minor_axis').index.kind == 'light') - - # data columns - df = tm.makeTimeDataFrame() - df['string'] = 'foo' - df['string2'] = 'bar' - store.append('f', df, data_columns=['string', 'string2']) - assert(col('f', 'index').is_indexed is True) - assert(col('f', 'string').is_indexed is True) - assert(col('f', 'string2').is_indexed is True) - - # specify index=columns - store.append( - 'f2', df, index=['string'], data_columns=['string', 'string2']) - assert(col('f2', 'index').is_indexed is False) - assert(col('f2', 'string').is_indexed is True) - assert(col('f2', 'string2').is_indexed is False) + # index=False + wp = tm.makePanel() + store.append('p5', wp, index=False) + store.create_table_index('p5', columns=['major_axis']) + assert(col('p5', 'major_axis').is_indexed is True) + assert(col('p5', 'minor_axis').is_indexed is False) + + # index=True + store.append('p5i', wp, index=True) + assert(col('p5i', 'major_axis').is_indexed is True) + assert(col('p5i', 'minor_axis').is_indexed is True) + + # default optlevels + store.get_storer('p5').create_index() + assert(col('p5', 'major_axis').index.optlevel == 6) + assert(col('p5', 'minor_axis').index.kind == 'medium') + + # let's change the indexing scheme + store.create_table_index('p5') + assert(col('p5', 'major_axis').index.optlevel == 6) + assert(col('p5', 'minor_axis').index.kind == 'medium') + store.create_table_index('p5', optlevel=9) + assert(col('p5', 'major_axis').index.optlevel == 9) + assert(col('p5', 'minor_axis').index.kind == 'medium') + store.create_table_index('p5', kind='full') + assert(col('p5', 'major_axis').index.optlevel == 9) + assert(col('p5', 'minor_axis').index.kind == 'full') + store.create_table_index('p5', optlevel=1, kind='light') + assert(col('p5', 'major_axis').index.optlevel == 1) + assert(col('p5', 'minor_axis').index.kind == 'light') + + # data columns + df = tm.makeTimeDataFrame() + df['string'] = 'foo' + df['string2'] = 'bar' + store.append('f', df, data_columns=['string', 'string2']) + assert(col('f', 'index').is_indexed is True) + assert(col('f', 'string').is_indexed is True) + assert(col('f', 'string2').is_indexed is True) + + # specify index=columns + store.append( + 'f2', df, index=['string'], + data_columns=['string', 'string2']) + assert(col('f2', 'index').is_indexed is False) + assert(col('f2', 'string').is_indexed is True) + assert(col('f2', 'string2').is_indexed is False) - # try to index a non-table - _maybe_remove(store, 'f2') - store.put('f2', df) - self.assertRaises(TypeError, store.create_table_index, 'f2') + # try to index a non-table + _maybe_remove(store, 'f2') + store.put('f2', df) + pytest.raises(TypeError, store.create_table_index, 'f2') def test_append_diff_item_order(self): - wp = tm.makePanel() - wp1 = wp.ix[:, :10, :] - wp2 = wp.ix[['ItemC', 'ItemB', 'ItemA'], 10:, :] + with catch_warnings(record=True): + wp = tm.makePanel() + wp1 = wp.iloc[:, :10, :] + wp2 = wp.iloc[wp.items.get_indexer(['ItemC', 'ItemB', 'ItemA']), + 10:, :] - with ensure_clean_store(self.path) as store: - store.put('panel', wp1, format='table') - self.assertRaises(ValueError, store.put, 'panel', wp2, + with ensure_clean_store(self.path) as store: + store.put('panel', wp1, format='table') + pytest.raises(ValueError, store.put, 'panel', wp2, append=True) def test_append_hierarchical(self): @@ -1752,10 +1751,10 @@ def test_column_multiindex(self): check_index_type=True, check_column_type=True) - self.assertRaises(ValueError, store.put, 'df2', df, - format='table', data_columns=['A']) - self.assertRaises(ValueError, store.put, 'df3', df, - format='table', data_columns=True) + pytest.raises(ValueError, store.put, 'df2', df, + format='table', data_columns=['A']) + pytest.raises(ValueError, store.put, 'df3', df, + format='table', data_columns=True) # appending multi-column on existing table (see GH 6167) with ensure_clean_store(self.path) as store: @@ -1818,13 +1817,13 @@ def make_index(names=None): _maybe_remove(store, 'df') df = DataFrame(np.zeros((12, 2)), columns=[ 'a', 'b'], index=make_index(['date', 'a', 't'])) - self.assertRaises(ValueError, store.append, 'df', df) + pytest.raises(ValueError, store.append, 'df', df) # dup within level _maybe_remove(store, 'df') df = DataFrame(np.zeros((12, 2)), columns=['a', 'b'], index=make_index(['date', 'date', 'date'])) - self.assertRaises(ValueError, store.append, 'df', df) + pytest.raises(ValueError, store.append, 'df', df) # fully names _maybe_remove(store, 'df') @@ -1883,26 +1882,25 @@ def test_pass_spec_to_storer(self): with ensure_clean_store(self.path) as store: store.put('df', df) - self.assertRaises(TypeError, store.select, 'df', columns=['A']) - self.assertRaises(TypeError, store.select, - 'df', where=[('columns=A')]) + pytest.raises(TypeError, store.select, 'df', columns=['A']) + pytest.raises(TypeError, store.select, + 'df', where=[('columns=A')]) def test_append_misc(self): with ensure_clean_store(self.path) as store: - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): + with catch_warnings(record=True): # unsuported data types for non-tables p4d = tm.makePanel4D() - self.assertRaises(TypeError, store.put, 'p4d', p4d) + pytest.raises(TypeError, store.put, 'p4d', p4d) # unsuported data types - self.assertRaises(TypeError, store.put, 'abc', None) - self.assertRaises(TypeError, store.put, 'abc', '123') - self.assertRaises(TypeError, store.put, 'abc', 123) - self.assertRaises(TypeError, store.put, 'abc', np.arange(5)) + pytest.raises(TypeError, store.put, 'abc', None) + pytest.raises(TypeError, store.put, 'abc', '123') + pytest.raises(TypeError, store.put, 'abc', 123) + pytest.raises(TypeError, store.put, 'abc', np.arange(5)) df = tm.makeDataFrame() store.append('df', df, chunksize=1) @@ -1930,10 +1928,11 @@ def check(obj, comparator): df['time2'] = Timestamp('20130102') check(df, tm.assert_frame_equal) - p = tm.makePanel() - check(p, assert_panel_equal) + with catch_warnings(record=True): + p = tm.makePanel() + check(p, assert_panel_equal) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): p4d = tm.makePanel4D() check(p4d, assert_panel4d_equal) @@ -1943,7 +1942,7 @@ def check(obj, comparator): # 0 len df_empty = DataFrame(columns=list('ABC')) store.append('df', df_empty) - self.assertRaises(KeyError, store.select, 'df') + pytest.raises(KeyError, store.select, 'df') # repeated append of 0/non-zero frames df = DataFrame(np.random.rand(10, 3), columns=list('ABC')) @@ -1957,21 +1956,23 @@ def check(obj, comparator): store.put('df2', df) assert_frame_equal(store.select('df2'), df) - # 0 len - p_empty = Panel(items=list('ABC')) - store.append('p', p_empty) - self.assertRaises(KeyError, store.select, 'p') + with catch_warnings(record=True): - # repeated append of 0/non-zero frames - p = Panel(np.random.randn(3, 4, 5), items=list('ABC')) - store.append('p', p) - assert_panel_equal(store.select('p'), p) - store.append('p', p_empty) - assert_panel_equal(store.select('p'), p) + # 0 len + p_empty = Panel(items=list('ABC')) + store.append('p', p_empty) + pytest.raises(KeyError, store.select, 'p') - # store - store.put('p2', p_empty) - assert_panel_equal(store.select('p2'), p_empty) + # repeated append of 0/non-zero frames + p = Panel(np.random.randn(3, 4, 5), items=list('ABC')) + store.append('p', p) + assert_panel_equal(store.select('p'), p) + store.append('p', p_empty) + assert_panel_equal(store.select('p'), p) + + # store + store.put('p2', p_empty) + assert_panel_equal(store.select('p2'), p_empty) def test_append_raise(self): @@ -1982,13 +1983,13 @@ def test_append_raise(self): # list in column df = tm.makeDataFrame() df['invalid'] = [['a']] * len(df) - self.assertEqual(df.dtypes['invalid'], np.object_) - self.assertRaises(TypeError, store.append, 'df', df) + assert df.dtypes['invalid'] == np.object_ + pytest.raises(TypeError, store.append, 'df', df) # multiple invalid columns df['invalid2'] = [['a']] * len(df) df['invalid3'] = [['a']] * len(df) - self.assertRaises(TypeError, store.append, 'df', df) + pytest.raises(TypeError, store.append, 'df', df) # datetime with embedded nans as object df = tm.makeDataFrame() @@ -1996,22 +1997,22 @@ def test_append_raise(self): s = s.astype(object) s[0:5] = np.nan df['invalid'] = s - self.assertEqual(df.dtypes['invalid'], np.object_) - self.assertRaises(TypeError, store.append, 'df', df) + assert df.dtypes['invalid'] == np.object_ + pytest.raises(TypeError, store.append, 'df', df) # directy ndarray - self.assertRaises(TypeError, store.append, 'df', np.arange(10)) + pytest.raises(TypeError, store.append, 'df', np.arange(10)) # series directly - self.assertRaises(TypeError, store.append, - 'df', Series(np.arange(10))) + pytest.raises(TypeError, store.append, + 'df', Series(np.arange(10))) - # appending an incompatbile table + # appending an incompatible table df = tm.makeDataFrame() store.append('df', df) df['foo'] = 'foo' - self.assertRaises(ValueError, store.append, 'df', df) + pytest.raises(ValueError, store.append, 'df', df) def test_table_index_incompatible_dtypes(self): df1 = DataFrame({'a': [1, 2, 3]}) @@ -2020,8 +2021,8 @@ def test_table_index_incompatible_dtypes(self): with ensure_clean_store(self.path) as store: store.put('frame', df1, format='table') - self.assertRaises(TypeError, store.put, 'frame', df2, - format='table', append=True) + pytest.raises(TypeError, store.put, 'frame', df2, + format='table', append=True) def test_table_values_dtypes_roundtrip(self): @@ -2035,7 +2036,7 @@ def test_table_values_dtypes_roundtrip(self): assert_series_equal(df2.dtypes, store['df_i8'].dtypes) # incompatible dtype - self.assertRaises(ValueError, store.append, 'df_i8', df1) + pytest.raises(ValueError, store.append, 'df_i8', df1) # check creation/storage/retrieval of float32 (a bit hacky to # actually create them thought) @@ -2061,8 +2062,8 @@ def test_table_values_dtypes_roundtrip(self): expected = Series({'float32': 2, 'float64': 1, 'int32': 1, 'bool': 1, 'int16': 1, 'int8': 1, 'int64': 1, 'object': 1, 'datetime64[ns]': 2}) - result.sort() - expected.sort() + result = result.sort_index() + result = expected.sort_index() tm.assert_series_equal(result, expected) def test_table_mixed_dtypes(self): @@ -2080,28 +2081,32 @@ def test_table_mixed_dtypes(self): df['timestamp2'] = Timestamp('20010103') df['datetime1'] = datetime.datetime(2001, 1, 2, 0, 0) df['datetime2'] = datetime.datetime(2001, 1, 3, 0, 0) - df.ix[3:6, ['obj1']] = np.nan - df = df.consolidate()._convert(datetime=True) + df.loc[3:6, ['obj1']] = np.nan + df = df._consolidate()._convert(datetime=True) with ensure_clean_store(self.path) as store: store.append('df1_mixed', df) tm.assert_frame_equal(store.select('df1_mixed'), df) - # panel - wp = tm.makePanel() - wp['obj1'] = 'foo' - wp['obj2'] = 'bar' - wp['bool1'] = wp['ItemA'] > 0 - wp['bool2'] = wp['ItemB'] > 0 - wp['int1'] = 1 - wp['int2'] = 2 - wp = wp.consolidate() + with catch_warnings(record=True): - with ensure_clean_store(self.path) as store: - store.append('p1_mixed', wp) - assert_panel_equal(store.select('p1_mixed'), wp) + # panel + wp = tm.makePanel() + wp['obj1'] = 'foo' + wp['obj2'] = 'bar' + wp['bool1'] = wp['ItemA'] > 0 + wp['bool2'] = wp['ItemB'] > 0 + wp['int1'] = 1 + wp['int2'] = 2 + wp = wp._consolidate() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): + + with ensure_clean_store(self.path) as store: + store.append('p1_mixed', wp) + assert_panel_equal(store.select('p1_mixed'), wp) + + with catch_warnings(record=True): # ndim wp = tm.makePanel4D() wp['obj1'] = 'foo' @@ -2110,7 +2115,7 @@ def test_table_mixed_dtypes(self): wp['bool2'] = wp['l2'] > 0 wp['int1'] = 1 wp['int2'] = 2 - wp = wp.consolidate() + wp = wp._consolidate() with ensure_clean_store(self.path) as store: store.append('p4d_mixed', wp) @@ -2130,7 +2135,7 @@ def test_unimplemented_dtypes_table_columns(self): for n, f in l: df = tm.makeDataFrame() df[n] = f - self.assertRaises( + pytest.raises( TypeError, store.append, 'df1_%s' % n, df) # frame @@ -2138,11 +2143,11 @@ def test_unimplemented_dtypes_table_columns(self): df['obj1'] = 'foo' df['obj2'] = 'bar' df['datetime1'] = datetime.date(2001, 1, 2) - df = df.consolidate()._convert(datetime=True) + df = df._consolidate()._convert(datetime=True) with ensure_clean_store(self.path) as store: # this fails because we have a date in the object block...... - self.assertRaises(TypeError, store.append, 'df_unimplemented', df) + pytest.raises(TypeError, store.append, 'df_unimplemented', df) def test_calendar_roundtrip_issue(self): @@ -2173,11 +2178,10 @@ def test_append_with_timedelta(self): # GH 3577 # append timedelta - from datetime import timedelta df = DataFrame(dict(A=Timestamp('20130101'), B=[Timestamp( '20130101') + timedelta(days=i, seconds=10) for i in range(10)])) df['C'] = df['A'] - df['B'] - df.ix[3:5, 'C'] = np.nan + df.loc[3:5, 'C'] = np.nan with ensure_clean_store(self.path) as store: @@ -2187,10 +2191,10 @@ def test_append_with_timedelta(self): result = store.select('df') assert_frame_equal(result, df) - result = store.select('df', Term("C<100000")) + result = store.select('df', where="C<100000") assert_frame_equal(result, df) - result = store.select('df', Term("C", "<", -3 * 86400)) + result = store.select('df', where="Cfoo') - self.assertRaises(KeyError, store.remove, 'a', [crit1]) + with catch_warnings(record=True): - # try to remove non-table (with crit) - # non-table ok (where = None) - wp = tm.makePanel(30) - store.put('wp', wp, format='table') - store.remove('wp', ["minor_axis=['A', 'D']"]) - rs = store.select('wp') - expected = wp.reindex(minor_axis=['B', 'C']) - assert_panel_equal(rs, expected) + # non-existance + crit1 = 'index>foo' + pytest.raises(KeyError, store.remove, 'a', [crit1]) - # empty where - _maybe_remove(store, 'wp') - store.put('wp', wp, format='table') + # try to remove non-table (with crit) + # non-table ok (where = None) + wp = tm.makePanel(30) + store.put('wp', wp, format='table') + store.remove('wp', ["minor_axis=['A', 'D']"]) + rs = store.select('wp') + expected = wp.reindex(minor_axis=['B', 'C']) + assert_panel_equal(rs, expected) - # deleted number (entire table) - n = store.remove('wp', []) - self.assertTrue(n == 120) + # empty where + _maybe_remove(store, 'wp') + store.put('wp', wp, format='table') - # non - empty where - _maybe_remove(store, 'wp') - store.put('wp', wp, format='table') - self.assertRaises(ValueError, store.remove, - 'wp', ['foo']) + # deleted number (entire table) + n = store.remove('wp', []) + assert n == 120 - # selectin non-table with a where - # store.put('wp2', wp, format='f') - # self.assertRaises(ValueError, store.remove, - # 'wp2', [('column', ['A', 'D'])]) + # non - empty where + _maybe_remove(store, 'wp') + store.put('wp', wp, format='table') + pytest.raises(ValueError, store.remove, + 'wp', ['foo']) def test_remove_startstop(self): # GH #4835 and #6177 with ensure_clean_store(self.path) as store: - wp = tm.makePanel(30) - - # start - _maybe_remove(store, 'wp1') - store.put('wp1', wp, format='t') - n = store.remove('wp1', start=32) - self.assertTrue(n == 120 - 32) - result = store.select('wp1') - expected = wp.reindex(major_axis=wp.major_axis[:32 // 4]) - assert_panel_equal(result, expected) - - _maybe_remove(store, 'wp2') - store.put('wp2', wp, format='t') - n = store.remove('wp2', start=-32) - self.assertTrue(n == 32) - result = store.select('wp2') - expected = wp.reindex(major_axis=wp.major_axis[:-32 // 4]) - assert_panel_equal(result, expected) - - # stop - _maybe_remove(store, 'wp3') - store.put('wp3', wp, format='t') - n = store.remove('wp3', stop=32) - self.assertTrue(n == 32) - result = store.select('wp3') - expected = wp.reindex(major_axis=wp.major_axis[32 // 4:]) - assert_panel_equal(result, expected) - - _maybe_remove(store, 'wp4') - store.put('wp4', wp, format='t') - n = store.remove('wp4', stop=-32) - self.assertTrue(n == 120 - 32) - result = store.select('wp4') - expected = wp.reindex(major_axis=wp.major_axis[-32 // 4:]) - assert_panel_equal(result, expected) - - # start n stop - _maybe_remove(store, 'wp5') - store.put('wp5', wp, format='t') - n = store.remove('wp5', start=16, stop=-16) - self.assertTrue(n == 120 - 32) - result = store.select('wp5') - expected = wp.reindex(major_axis=wp.major_axis[ - :16 // 4].union(wp.major_axis[-16 // 4:])) - assert_panel_equal(result, expected) - - _maybe_remove(store, 'wp6') - store.put('wp6', wp, format='t') - n = store.remove('wp6', start=16, stop=16) - self.assertTrue(n == 0) - result = store.select('wp6') - expected = wp.reindex(major_axis=wp.major_axis) - assert_panel_equal(result, expected) - - # with where - _maybe_remove(store, 'wp7') - - # TODO: unused? - date = wp.major_axis.take(np.arange(0, 30, 3)) # noqa - - crit = Term('major_axis=date') - store.put('wp7', wp, format='t') - n = store.remove('wp7', where=[crit], stop=80) - self.assertTrue(n == 28) - result = store.select('wp7') - expected = wp.reindex(major_axis=wp.major_axis.difference( - wp.major_axis[np.arange(0, 20, 3)])) - assert_panel_equal(result, expected) + with catch_warnings(record=True): + wp = tm.makePanel(30) + + # start + _maybe_remove(store, 'wp1') + store.put('wp1', wp, format='t') + n = store.remove('wp1', start=32) + assert n == 120 - 32 + result = store.select('wp1') + expected = wp.reindex(major_axis=wp.major_axis[:32 // 4]) + assert_panel_equal(result, expected) + + _maybe_remove(store, 'wp2') + store.put('wp2', wp, format='t') + n = store.remove('wp2', start=-32) + assert n == 32 + result = store.select('wp2') + expected = wp.reindex(major_axis=wp.major_axis[:-32 // 4]) + assert_panel_equal(result, expected) + + # stop + _maybe_remove(store, 'wp3') + store.put('wp3', wp, format='t') + n = store.remove('wp3', stop=32) + assert n == 32 + result = store.select('wp3') + expected = wp.reindex(major_axis=wp.major_axis[32 // 4:]) + assert_panel_equal(result, expected) + + _maybe_remove(store, 'wp4') + store.put('wp4', wp, format='t') + n = store.remove('wp4', stop=-32) + assert n == 120 - 32 + result = store.select('wp4') + expected = wp.reindex(major_axis=wp.major_axis[-32 // 4:]) + assert_panel_equal(result, expected) + + # start n stop + _maybe_remove(store, 'wp5') + store.put('wp5', wp, format='t') + n = store.remove('wp5', start=16, stop=-16) + assert n == 120 - 32 + result = store.select('wp5') + expected = wp.reindex( + major_axis=(wp.major_axis[:16 // 4] + .union(wp.major_axis[-16 // 4:]))) + assert_panel_equal(result, expected) + + _maybe_remove(store, 'wp6') + store.put('wp6', wp, format='t') + n = store.remove('wp6', start=16, stop=16) + assert n == 0 + result = store.select('wp6') + expected = wp.reindex(major_axis=wp.major_axis) + assert_panel_equal(result, expected) + + # with where + _maybe_remove(store, 'wp7') + + # TODO: unused? + date = wp.major_axis.take(np.arange(0, 30, 3)) # noqa + + crit = 'major_axis=date' + store.put('wp7', wp, format='t') + n = store.remove('wp7', where=[crit], stop=80) + assert n == 28 + result = store.select('wp7') + expected = wp.reindex(major_axis=wp.major_axis.difference( + wp.major_axis[np.arange(0, 20, 3)])) + assert_panel_equal(result, expected) def test_remove_crit(self): with ensure_clean_store(self.path) as store: - wp = tm.makePanel(30) - - # group row removal - _maybe_remove(store, 'wp3') - date4 = wp.major_axis.take([0, 1, 2, 4, 5, 6, 8, 9, 10]) - crit4 = Term('major_axis=date4') - store.put('wp3', wp, format='t') - n = store.remove('wp3', where=[crit4]) - self.assertTrue(n == 36) - - result = store.select('wp3') - expected = wp.reindex(major_axis=wp.major_axis.difference(date4)) - assert_panel_equal(result, expected) - - # upper half - _maybe_remove(store, 'wp') - store.put('wp', wp, format='table') - date = wp.major_axis[len(wp.major_axis) // 2] - - crit1 = Term('major_axis>date') - crit2 = Term("minor_axis=['A', 'D']") - n = store.remove('wp', where=[crit1]) - self.assertTrue(n == 56) - - n = store.remove('wp', where=[crit2]) - self.assertTrue(n == 32) - - result = store['wp'] - expected = wp.truncate(after=date).reindex(minor=['B', 'C']) - assert_panel_equal(result, expected) - - # individual row elements - _maybe_remove(store, 'wp2') - store.put('wp2', wp, format='table') - - date1 = wp.major_axis[1:3] - crit1 = Term('major_axis=date1') - store.remove('wp2', where=[crit1]) - result = store.select('wp2') - expected = wp.reindex(major_axis=wp.major_axis.difference(date1)) - assert_panel_equal(result, expected) - - date2 = wp.major_axis[5] - crit2 = Term('major_axis=date2') - store.remove('wp2', where=[crit2]) - result = store['wp2'] - expected = wp.reindex(major_axis=wp.major_axis.difference(date1) - .difference(Index([date2]))) - assert_panel_equal(result, expected) - - date3 = [wp.major_axis[7], wp.major_axis[9]] - crit3 = Term('major_axis=date3') - store.remove('wp2', where=[crit3]) - result = store['wp2'] - expected = wp.reindex(major_axis=wp.major_axis - .difference(date1) - .difference(Index([date2])) - .difference(Index(date3))) - assert_panel_equal(result, expected) - - # corners - _maybe_remove(store, 'wp4') - store.put('wp4', wp, format='table') - n = store.remove( - 'wp4', where=[Term('major_axis>wp.major_axis[-1]')]) - result = store.select('wp4') - assert_panel_equal(result, wp) + with catch_warnings(record=True): + wp = tm.makePanel(30) + + # group row removal + _maybe_remove(store, 'wp3') + date4 = wp.major_axis.take([0, 1, 2, 4, 5, 6, 8, 9, 10]) + crit4 = 'major_axis=date4' + store.put('wp3', wp, format='t') + n = store.remove('wp3', where=[crit4]) + assert n == 36 + + result = store.select('wp3') + expected = wp.reindex( + major_axis=wp.major_axis.difference(date4)) + assert_panel_equal(result, expected) + + # upper half + _maybe_remove(store, 'wp') + store.put('wp', wp, format='table') + date = wp.major_axis[len(wp.major_axis) // 2] + + crit1 = 'major_axis>date' + crit2 = "minor_axis=['A', 'D']" + n = store.remove('wp', where=[crit1]) + assert n == 56 + + n = store.remove('wp', where=[crit2]) + assert n == 32 + + result = store['wp'] + expected = wp.truncate(after=date).reindex(minor=['B', 'C']) + assert_panel_equal(result, expected) + + # individual row elements + _maybe_remove(store, 'wp2') + store.put('wp2', wp, format='table') + + date1 = wp.major_axis[1:3] + crit1 = 'major_axis=date1' + store.remove('wp2', where=[crit1]) + result = store.select('wp2') + expected = wp.reindex( + major_axis=wp.major_axis.difference(date1)) + assert_panel_equal(result, expected) + + date2 = wp.major_axis[5] + crit2 = 'major_axis=date2' + store.remove('wp2', where=[crit2]) + result = store['wp2'] + expected = wp.reindex( + major_axis=(wp.major_axis + .difference(date1) + .difference(Index([date2])) + )) + assert_panel_equal(result, expected) + + date3 = [wp.major_axis[7], wp.major_axis[9]] + crit3 = 'major_axis=date3' + store.remove('wp2', where=[crit3]) + result = store['wp2'] + expected = wp.reindex(major_axis=wp.major_axis + .difference(date1) + .difference(Index([date2])) + .difference(Index(date3))) + assert_panel_equal(result, expected) + + # corners + _maybe_remove(store, 'wp4') + store.put('wp4', wp, format='table') + n = store.remove( + 'wp4', where="major_axis>wp.major_axis[-1]") + result = store.select('wp4') + assert_panel_equal(result, wp) def test_invalid_terms(self): with ensure_clean_store(self.path) as store: - with compat_assert_produces_warning(FutureWarning): + with catch_warnings(record=True): df = tm.makeTimeDataFrame() df['string'] = 'foo' - df.ix[0:4, 'string'] = 'bar' + df.loc[0:4, 'string'] = 'bar' wp = tm.makePanel() p4d = tm.makePanel4D() @@ -2448,19 +2457,19 @@ def test_invalid_terms(self): store.put('p4d', p4d, format='table') # some invalid terms - self.assertRaises(ValueError, store.select, - 'wp', "minor=['A', 'B']") - self.assertRaises(ValueError, store.select, - 'wp', ["index=['20121114']"]) - self.assertRaises(ValueError, store.select, 'wp', [ + pytest.raises(ValueError, store.select, + 'wp', "minor=['A', 'B']") + pytest.raises(ValueError, store.select, + 'wp', ["index=['20121114']"]) + pytest.raises(ValueError, store.select, 'wp', [ "index=['20121114', '20121114']"]) - self.assertRaises(TypeError, Term) + pytest.raises(TypeError, Term) # more invalid - self.assertRaises( + pytest.raises( ValueError, store.select, 'df', 'df.index[3]') - self.assertRaises(SyntaxError, store.select, 'df', 'index>') - self.assertRaises( + pytest.raises(SyntaxError, store.select, 'df', 'index>') + pytest.raises( ValueError, store.select, 'wp', "major_axis<'20000108' & minor_axis['A', 'B']") @@ -2481,60 +2490,52 @@ def test_invalid_terms(self): 'ABCD'), index=date_range('20130101', periods=10)) dfq.to_hdf(path, 'dfq', format='table') - self.assertRaises(ValueError, read_hdf, path, - 'dfq', where="A>0 or C>0") + pytest.raises(ValueError, read_hdf, path, + 'dfq', where="A>0 or C>0") def test_terms(self): with ensure_clean_store(self.path) as store: - wp = tm.makePanel() - wpneg = Panel.fromDict({-1: tm.makeDataFrame(), - 0: tm.makeDataFrame(), - 1: tm.makeDataFrame()}) - - with compat_assert_produces_warning(FutureWarning): + with catch_warnings(record=True): + wp = tm.makePanel() + wpneg = Panel.fromDict({-1: tm.makeDataFrame(), + 0: tm.makeDataFrame(), + 1: tm.makeDataFrame()}) p4d = tm.makePanel4D() store.put('p4d', p4d, format='table') - - store.put('wp', wp, format='table') - store.put('wpneg', wpneg, format='table') - - # panel - result = store.select('wp', [Term( - 'major_axis<"20000108"'), Term("minor_axis=['A', 'B']")]) - expected = wp.truncate(after='20000108').reindex(minor=['A', 'B']) - assert_panel_equal(result, expected) - - # with deprecation - result = store.select('wp', [Term( - 'major_axis', '<', "20000108"), Term("minor_axis=['A', 'B']")]) - expected = wp.truncate(after='20000108').reindex(minor=['A', 'B']) - tm.assert_panel_equal(result, expected) + store.put('wp', wp, format='table') + store.put('wpneg', wpneg, format='table') + + # panel + result = store.select( + 'wp', + "major_axis<'20000108' and minor_axis=['A', 'B']") + expected = wp.truncate( + after='20000108').reindex(minor=['A', 'B']) + assert_panel_equal(result, expected) + + # with deprecation + result = store.select( + 'wp', where=("major_axis<'20000108' " + "and minor_axis=['A', 'B']")) + expected = wp.truncate( + after='20000108').reindex(minor=['A', 'B']) + tm.assert_panel_equal(result, expected) # p4d - with compat_assert_produces_warning(FutureWarning): + with catch_warnings(record=True): result = store.select('p4d', - [Term('major_axis<"20000108"'), - Term("minor_axis=['A', 'B']"), - Term("items=['ItemA', 'ItemB']")]) + ("major_axis<'20000108' and " + "minor_axis=['A', 'B'] and " + "items=['ItemA', 'ItemB']")) expected = p4d.truncate(after='20000108').reindex( minor=['A', 'B'], items=['ItemA', 'ItemB']) assert_panel4d_equal(result, expected) - # back compat invalid terms - terms = [dict(field='major_axis', op='>', value='20121114'), - [dict(field='major_axis', op='>', value='20121114')], - ["minor_axis=['A','B']", - dict(field='major_axis', op='>', value='20121114')]] - for t in terms: - with tm.assert_produces_warning(expected_warning=FutureWarning, - check_stacklevel=False): - Term(t) - - with compat_assert_produces_warning(FutureWarning): + with catch_warnings(record=True): # valid terms terms = [('major_axis=20121114'), @@ -2556,133 +2557,80 @@ def test_terms(self): store.select('p4d', t) # valid for p4d only - terms = [(("labels=['l1', 'l2']"),), - Term("labels=['l1', 'l2']"), - ] - + terms = ["labels=['l1', 'l2']"] for t in terms: store.select('p4d', t) - with tm.assertRaisesRegexp(TypeError, - 'Only named functions are supported'): - store.select('wp', Term( - 'major_axis == (lambda x: x)("20130101")')) + with tm.assert_raises_regex( + TypeError, 'Only named functions are supported'): + store.select( + 'wp', + 'major_axis == (lambda x: x)("20130101")') - # check USub node parsing - res = store.select('wpneg', Term('items == -1')) - expected = Panel({-1: wpneg[-1]}) - tm.assert_panel_equal(res, expected) + with catch_warnings(record=True): + # check USub node parsing + res = store.select('wpneg', 'items == -1') + expected = Panel({-1: wpneg[-1]}) + tm.assert_panel_equal(res, expected) - with tm.assertRaisesRegexp(NotImplementedError, - 'Unary addition not supported'): - store.select('wpneg', Term('items == +1')) + with tm.assert_raises_regex(NotImplementedError, + 'Unary addition ' + 'not supported'): + store.select('wpneg', 'items == +1') def test_term_compat(self): with ensure_clean_store(self.path) as store: - wp = Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], - major_axis=date_range('1/1/2000', periods=5), - minor_axis=['A', 'B', 'C', 'D']) - store.append('wp', wp) - - result = store.select('wp', [Term('major_axis>20000102'), - Term('minor_axis', '=', ['A', 'B'])]) - expected = wp.loc[:, wp.major_axis > - Timestamp('20000102'), ['A', 'B']] - assert_panel_equal(result, expected) - - store.remove('wp', Term('major_axis>20000103')) - result = store.select('wp') - expected = wp.loc[:, wp.major_axis <= Timestamp('20000103'), :] - assert_panel_equal(result, expected) - - with ensure_clean_store(self.path) as store: - - wp = Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], - major_axis=date_range('1/1/2000', periods=5), - minor_axis=['A', 'B', 'C', 'D']) - store.append('wp', wp) - - # stringified datetimes - result = store.select( - 'wp', [Term('major_axis', '>', datetime.datetime(2000, 1, 2))]) - expected = wp.loc[:, wp.major_axis > Timestamp('20000102')] - assert_panel_equal(result, expected) - - result = store.select( - 'wp', [Term('major_axis', '>', - datetime.datetime(2000, 1, 2, 0, 0))]) - expected = wp.loc[:, wp.major_axis > Timestamp('20000102')] - assert_panel_equal(result, expected) - - result = store.select( - 'wp', [Term('major_axis', '=', - [datetime.datetime(2000, 1, 2, 0, 0), - datetime.datetime(2000, 1, 3, 0, 0)])]) - expected = wp.loc[:, [Timestamp('20000102'), - Timestamp('20000103')]] - assert_panel_equal(result, expected) - - result = store.select('wp', [Term('minor_axis', '=', ['A', 'B'])]) - expected = wp.loc[:, :, ['A', 'B']] - assert_panel_equal(result, expected) - - def test_backwards_compat_without_term_object(self): - with ensure_clean_store(self.path) as store: - - wp = Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], - major_axis=date_range('1/1/2000', periods=5), - minor_axis=['A', 'B', 'C', 'D']) - store.append('wp', wp) - with assert_produces_warning(expected_warning=FutureWarning, - check_stacklevel=False): - result = store.select('wp', [('major_axis>20000102'), - ('minor_axis', '=', ['A', 'B'])]) - expected = wp.loc[:, - wp.major_axis > Timestamp('20000102'), - ['A', 'B']] - assert_panel_equal(result, expected) - - store.remove('wp', ('major_axis>20000103')) - result = store.select('wp') - expected = wp.loc[:, wp.major_axis <= Timestamp('20000103'), :] - assert_panel_equal(result, expected) - - with ensure_clean_store(self.path) as store: - - wp = Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], - major_axis=date_range('1/1/2000', periods=5), - minor_axis=['A', 'B', 'C', 'D']) - store.append('wp', wp) - - # stringified datetimes - with assert_produces_warning(expected_warning=FutureWarning, - check_stacklevel=False): - result = store.select('wp', - [('major_axis', - '>', - datetime.datetime(2000, 1, 2))]) - expected = wp.loc[:, wp.major_axis > Timestamp('20000102')] - assert_panel_equal(result, expected) - with assert_produces_warning(expected_warning=FutureWarning, - check_stacklevel=False): - result = store.select('wp', - [('major_axis', - '>', - datetime.datetime(2000, 1, 2, 0, 0))]) - expected = wp.loc[:, wp.major_axis > Timestamp('20000102')] - assert_panel_equal(result, expected) - with assert_produces_warning(expected_warning=FutureWarning, - check_stacklevel=False): - result = store.select('wp', - [('major_axis', - '=', - [datetime.datetime(2000, 1, 2, 0, 0), - datetime.datetime(2000, 1, 3, 0, 0)])] - ) - expected = wp.loc[:, [Timestamp('20000102'), - Timestamp('20000103')]] - assert_panel_equal(result, expected) + with catch_warnings(record=True): + wp = Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], + major_axis=date_range('1/1/2000', periods=5), + minor_axis=['A', 'B', 'C', 'D']) + store.append('wp', wp) + + result = store.select( + 'wp', where=("major_axis>20000102 " + "and minor_axis=['A', 'B']")) + expected = wp.loc[:, wp.major_axis > + Timestamp('20000102'), ['A', 'B']] + assert_panel_equal(result, expected) + + store.remove('wp', 'major_axis>20000103') + result = store.select('wp') + expected = wp.loc[:, wp.major_axis <= Timestamp('20000103'), :] + assert_panel_equal(result, expected) + + with ensure_clean_store(self.path) as store: + + with catch_warnings(record=True): + wp = Panel(np.random.randn(2, 5, 4), + items=['Item1', 'Item2'], + major_axis=date_range('1/1/2000', periods=5), + minor_axis=['A', 'B', 'C', 'D']) + store.append('wp', wp) + + # stringified datetimes + result = store.select( + 'wp', 'major_axis>datetime.datetime(2000, 1, 2)') + expected = wp.loc[:, wp.major_axis > Timestamp('20000102')] + assert_panel_equal(result, expected) + + result = store.select( + 'wp', 'major_axis>datetime.datetime(2000, 1, 2)') + expected = wp.loc[:, wp.major_axis > Timestamp('20000102')] + assert_panel_equal(result, expected) + + result = store.select( + 'wp', + "major_axis=[datetime.datetime(2000, 1, 2, 0, 0), " + "datetime.datetime(2000, 1, 3, 0, 0)]") + expected = wp.loc[:, [Timestamp('20000102'), + Timestamp('20000103')]] + assert_panel_equal(result, expected) + + result = store.select( + 'wp', "minor_axis=['A', 'B']") + expected = wp.loc[:, :, ['A', 'B']] + assert_panel_equal(result, expected) def test_same_name_scoping(self): @@ -2726,7 +2674,7 @@ def test_series(self): def test_sparse_series(self): s = tm.makeStringSeries() - s[3:5] = np.nan + s.iloc[3:5] = np.nan ss = s.to_sparse() self._check_roundtrip(ss, tm.assert_series_equal, check_series_type=True) @@ -2742,8 +2690,8 @@ def test_sparse_series(self): def test_sparse_frame(self): s = tm.makeDataFrame() - s.ix[3:5, 1:3] = np.nan - s.ix[8:10, -2] = np.nan + s.iloc[3:5, 1:3] = np.nan + s.iloc[8:10, -2] = np.nan ss = s.to_sparse() self._check_double_roundtrip(ss, tm.assert_frame_equal, @@ -2772,68 +2720,73 @@ def test_tuple_index(self): data = np.random.randn(30).reshape((3, 10)) DF = DataFrame(data, index=idx, columns=col) - expected_warning = Warning if PY35 else PerformanceWarning - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): + with catch_warnings(record=True): self._check_roundtrip(DF, tm.assert_frame_equal) def test_index_types(self): - values = np.random.randn(2) + with catch_warnings(record=True): + values = np.random.randn(2) - func = lambda l, r: tm.assert_series_equal(l, r, - check_dtype=True, - check_index_type=True, - check_series_type=True) + func = lambda l, r: tm.assert_series_equal(l, r, + check_dtype=True, + check_index_type=True, + check_series_type=True) + + with catch_warnings(record=True): + ser = Series(values, [0, 'y']) + self._check_roundtrip(ser, func) + + with catch_warnings(record=True): + ser = Series(values, [datetime.datetime.today(), 0]) + self._check_roundtrip(ser, func) + + with catch_warnings(record=True): + ser = Series(values, ['y', 0]) + self._check_roundtrip(ser, func) + + with catch_warnings(record=True): + ser = Series(values, [datetime.date.today(), 'a']) + self._check_roundtrip(ser, func) + + with catch_warnings(record=True): - # nose has a deprecation warning in 3.5 - expected_warning = Warning if PY35 else PerformanceWarning - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): ser = Series(values, [0, 'y']) self._check_roundtrip(ser, func) - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): ser = Series(values, [datetime.datetime.today(), 0]) self._check_roundtrip(ser, func) - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): ser = Series(values, ['y', 0]) self._check_roundtrip(ser, func) - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): ser = Series(values, [datetime.date.today(), 'a']) self._check_roundtrip(ser, func) - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): ser = Series(values, [1.23, 'b']) self._check_roundtrip(ser, func) - ser = Series(values, [1, 1.53]) - self._check_roundtrip(ser, func) + ser = Series(values, [1, 1.53]) + self._check_roundtrip(ser, func) - ser = Series(values, [1, 5]) - self._check_roundtrip(ser, func) + ser = Series(values, [1, 5]) + self._check_roundtrip(ser, func) - ser = Series(values, [datetime.datetime( - 2012, 1, 1), datetime.datetime(2012, 1, 2)]) - self._check_roundtrip(ser, func) + ser = Series(values, [datetime.datetime( + 2012, 1, 1), datetime.datetime(2012, 1, 2)]) + self._check_roundtrip(ser, func) def test_timeseries_preepoch(self): if sys.version_info[0] == 2 and sys.version_info[1] < 7: - raise nose.SkipTest("won't work on Python < 2.7") + pytest.skip("won't work on Python < 2.7") dr = bdate_range('1/1/1940', '1/1/1960') ts = Series(np.random.randn(len(dr)), index=dr) try: self._check_roundtrip(ts, tm.assert_series_equal) except OverflowError: - raise nose.SkipTest('known failer on some windows platforms') + pytest.skip('known failer on some windows platforms') def test_frame(self): @@ -2864,7 +2817,7 @@ def test_frame(self): df['foo'] = np.random.randn(len(df)) store['df'] = df recons = store['df'] - self.assertTrue(recons._data.is_consolidated()) + assert recons._data.is_consolidated() # empty self._check_roundtrip(df[:0], tm.assert_frame_equal) @@ -2953,7 +2906,7 @@ def _make_one(): df['bool2'] = df['B'] > 0 df['int1'] = 1 df['int2'] = 2 - return df.consolidate() + return df._consolidate() df1 = _make_one() df2 = _make_one() @@ -2984,13 +2937,9 @@ def _make_one(): def test_wide(self): - wp = tm.makePanel() - self._check_roundtrip(wp, assert_panel_equal) - - def test_wide_table(self): - - wp = tm.makePanel() - self._check_roundtrip_table(wp, assert_panel_equal) + with catch_warnings(record=True): + wp = tm.makePanel() + self._check_roundtrip(wp, assert_panel_equal) def test_select_with_dups(self): @@ -3052,25 +3001,24 @@ def test_select_with_dups(self): assert_frame_equal(result, expected, by_blocks=True) def test_wide_table_dups(self): - wp = tm.makePanel() with ensure_clean_store(self.path) as store: - store.put('panel', wp, format='table') - store.put('panel', wp, format='table', append=True) + with catch_warnings(record=True): + + wp = tm.makePanel() + store.put('panel', wp, format='table') + store.put('panel', wp, format='table', append=True) - with tm.assert_produces_warning(expected_warning=DuplicateWarning): recons = store['panel'] - assert_panel_equal(recons, wp) + assert_panel_equal(recons, wp) def test_long(self): def _check(left, right): assert_panel_equal(left.to_panel(), right.to_panel()) - wp = tm.makePanel() - self._check_roundtrip(wp.to_frame(), _check) - - # empty - # self._check_roundtrip(wp.to_frame()[:0], _check) + with catch_warnings(record=True): + wp = tm.makePanel() + self._check_roundtrip(wp.to_frame(), _check) def test_longpanel(self): pass @@ -3117,70 +3065,72 @@ def test_sparse_with_compression(self): check_frame_type=True) def test_select(self): - wp = tm.makePanel() with ensure_clean_store(self.path) as store: - # put/select ok - _maybe_remove(store, 'wp') - store.put('wp', wp, format='table') - store.select('wp') - - # non-table ok (where = None) - _maybe_remove(store, 'wp') - store.put('wp2', wp) - store.select('wp2') - - # selection on the non-indexable with a large number of columns - wp = Panel(np.random.randn(100, 100, 100), - items=['Item%03d' % i for i in range(100)], - major_axis=date_range('1/1/2000', periods=100), - minor_axis=['E%03d' % i for i in range(100)]) - - _maybe_remove(store, 'wp') - store.append('wp', wp) - items = ['Item%03d' % i for i in range(80)] - result = store.select('wp', Term('items=items')) - expected = wp.reindex(items=items) - assert_panel_equal(expected, result) - - # selectin non-table with a where - # self.assertRaises(ValueError, store.select, - # 'wp2', ('column', ['A', 'D'])) + with catch_warnings(record=True): + wp = tm.makePanel() - # select with columns= - df = tm.makeTimeDataFrame() - _maybe_remove(store, 'df') - store.append('df', df) - result = store.select('df', columns=['A', 'B']) - expected = df.reindex(columns=['A', 'B']) - tm.assert_frame_equal(expected, result) + # put/select ok + _maybe_remove(store, 'wp') + store.put('wp', wp, format='table') + store.select('wp') + + # non-table ok (where = None) + _maybe_remove(store, 'wp') + store.put('wp2', wp) + store.select('wp2') + + # selection on the non-indexable with a large number of columns + wp = Panel(np.random.randn(100, 100, 100), + items=['Item%03d' % i for i in range(100)], + major_axis=date_range('1/1/2000', periods=100), + minor_axis=['E%03d' % i for i in range(100)]) + + _maybe_remove(store, 'wp') + store.append('wp', wp) + items = ['Item%03d' % i for i in range(80)] + result = store.select('wp', 'items=items') + expected = wp.reindex(items=items) + assert_panel_equal(expected, result) + + # selectin non-table with a where + # pytest.raises(ValueError, store.select, + # 'wp2', ('column', ['A', 'D'])) + + # select with columns= + df = tm.makeTimeDataFrame() + _maybe_remove(store, 'df') + store.append('df', df) + result = store.select('df', columns=['A', 'B']) + expected = df.reindex(columns=['A', 'B']) + tm.assert_frame_equal(expected, result) - # equivalentsly - result = store.select('df', [("columns=['A', 'B']")]) - expected = df.reindex(columns=['A', 'B']) - tm.assert_frame_equal(expected, result) + # equivalentsly + result = store.select('df', [("columns=['A', 'B']")]) + expected = df.reindex(columns=['A', 'B']) + tm.assert_frame_equal(expected, result) - # with a data column - _maybe_remove(store, 'df') - store.append('df', df, data_columns=['A']) - result = store.select('df', ['A > 0'], columns=['A', 'B']) - expected = df[df.A > 0].reindex(columns=['A', 'B']) - tm.assert_frame_equal(expected, result) + # with a data column + _maybe_remove(store, 'df') + store.append('df', df, data_columns=['A']) + result = store.select('df', ['A > 0'], columns=['A', 'B']) + expected = df[df.A > 0].reindex(columns=['A', 'B']) + tm.assert_frame_equal(expected, result) - # all a data columns - _maybe_remove(store, 'df') - store.append('df', df, data_columns=True) - result = store.select('df', ['A > 0'], columns=['A', 'B']) - expected = df[df.A > 0].reindex(columns=['A', 'B']) - tm.assert_frame_equal(expected, result) + # all a data columns + _maybe_remove(store, 'df') + store.append('df', df, data_columns=True) + result = store.select('df', ['A > 0'], columns=['A', 'B']) + expected = df[df.A > 0].reindex(columns=['A', 'B']) + tm.assert_frame_equal(expected, result) - # with a data column, but different columns - _maybe_remove(store, 'df') - store.append('df', df, data_columns=['A']) - result = store.select('df', ['A > 0'], columns=['C', 'D']) - expected = df[df.A > 0].reindex(columns=['C', 'D']) - tm.assert_frame_equal(expected, result) + # with a data column, but different columns + _maybe_remove(store, 'df') + store.append('df', df, data_columns=['A']) + result = store.select('df', ['A > 0'], columns=['C', 'D']) + expected = df[df.A > 0].reindex(columns=['C', 'D']) + tm.assert_frame_equal(expected, result) def test_select_dtypes(self): @@ -3192,14 +3142,14 @@ def test_select_dtypes(self): _maybe_remove(store, 'df') store.append('df', df, data_columns=['ts', 'A']) - result = store.select('df', [Term("ts>=Timestamp('2012-02-01')")]) + result = store.select('df', "ts>=Timestamp('2012-02-01')") expected = df[df.ts >= Timestamp('2012-02-01')] tm.assert_frame_equal(expected, result) # bool columns (GH #2849) df = DataFrame(np.random.randn(5, 2), columns=['A', 'B']) df['object'] = 'foo' - df.ix[4:5, 'object'] = 'bar' + df.loc[4:5, 'object'] = 'bar' df['boolv'] = df['A'] > 0 _maybe_remove(store, 'df') store.append('df', df, data_columns=True) @@ -3207,15 +3157,15 @@ def test_select_dtypes(self): expected = (df[df.boolv == True] # noqa .reindex(columns=['A', 'boolv'])) for v in [True, 'true', 1]: - result = store.select('df', Term( - 'boolv == %s' % str(v)), columns=['A', 'boolv']) + result = store.select('df', 'boolv == %s' % str(v), + columns=['A', 'boolv']) tm.assert_frame_equal(expected, result) expected = (df[df.boolv == False] # noqa .reindex(columns=['A', 'boolv'])) for v in [False, 'false', 0]: - result = store.select('df', Term( - 'boolv == %s' % str(v)), columns=['A', 'boolv']) + result = store.select( + 'df', 'boolv == %s' % str(v), columns=['A', 'boolv']) tm.assert_frame_equal(expected, result) # integer index @@ -3223,7 +3173,7 @@ def test_select_dtypes(self): _maybe_remove(store, 'df_int') store.append('df_int', df) result = store.select( - 'df_int', [Term("index<10"), Term("columns=['A']")]) + 'df_int', "index<10 and columns=['A']") expected = df.reindex(index=list(df.index)[0:10], columns=['A']) tm.assert_frame_equal(expected, result) @@ -3233,7 +3183,7 @@ def test_select_dtypes(self): _maybe_remove(store, 'df_float') store.append('df_float', df) result = store.select( - 'df_float', [Term("index<10.0"), Term("columns=['A']")]) + 'df_float', "index<10.0 and columns=['A']") expected = df.reindex(index=list(df.index)[0:10], columns=['A']) tm.assert_frame_equal(expected, result) @@ -3304,14 +3254,14 @@ def test_select_with_many_inputs(self): store.append('df', df, data_columns=['ts', 'A', 'B', 'users']) # regular select - result = store.select('df', [Term("ts>=Timestamp('2012-02-01')")]) + result = store.select('df', "ts>=Timestamp('2012-02-01')") expected = df[df.ts >= Timestamp('2012-02-01')] tm.assert_frame_equal(expected, result) # small selector result = store.select( - 'df', [Term("ts>=Timestamp('2012-02-01') & " - "users=['a','b','c']")]) + 'df', + "ts>=Timestamp('2012-02-01') & users=['a','b','c']") expected = df[(df.ts >= Timestamp('2012-02-01')) & df.users.isin(['a', 'b', 'c'])] tm.assert_frame_equal(expected, result) @@ -3319,24 +3269,24 @@ def test_select_with_many_inputs(self): # big selector along the columns selector = ['a', 'b', 'c'] + ['a%03d' % i for i in range(60)] result = store.select( - 'df', [Term("ts>=Timestamp('2012-02-01')"), - Term('users=selector')]) + 'df', + "ts>=Timestamp('2012-02-01') and users=selector") expected = df[(df.ts >= Timestamp('2012-02-01')) & df.users.isin(selector)] tm.assert_frame_equal(expected, result) selector = range(100, 200) - result = store.select('df', [Term('B=selector')]) + result = store.select('df', 'B=selector') expected = df[df.B.isin(selector)] tm.assert_frame_equal(expected, result) - self.assertEqual(len(result), 100) + assert len(result) == 100 # big selector along the index selector = Index(df.ts[0:100].values) - result = store.select('df', [Term('ts=selector')]) + result = store.select('df', 'ts=selector') expected = df[df.ts.isin(selector.values)] tm.assert_frame_equal(expected, result) - self.assertEqual(len(result), 100) + assert len(result) == 100 def test_select_iterator(self): @@ -3354,7 +3304,7 @@ def test_select_iterator(self): tm.assert_frame_equal(expected, result) results = [s for s in store.select('df', chunksize=100)] - self.assertEqual(len(results), 5) + assert len(results) == 5 result = concat(results) tm.assert_frame_equal(expected, result) @@ -3366,10 +3316,10 @@ def test_select_iterator(self): df = tm.makeTimeDataFrame(500) df.to_hdf(path, 'df_non_table') - self.assertRaises(TypeError, read_hdf, path, - 'df_non_table', chunksize=100) - self.assertRaises(TypeError, read_hdf, path, - 'df_non_table', iterator=True) + pytest.raises(TypeError, read_hdf, path, + 'df_non_table', chunksize=100) + pytest.raises(TypeError, read_hdf, path, + 'df_non_table', iterator=True) with ensure_clean_path(self.path) as path: @@ -3379,7 +3329,7 @@ def test_select_iterator(self): results = [s for s in read_hdf(path, 'df', chunksize=100)] result = concat(results) - self.assertEqual(len(results), 5) + assert len(results) == 5 tm.assert_frame_equal(result, df) tm.assert_frame_equal(result, read_hdf(path, 'df')) @@ -3404,17 +3354,6 @@ def test_select_iterator(self): result = concat(results) tm.assert_frame_equal(expected, result) - # where selection - # expected = store.select_as_multiple( - # ['df1', 'df2'], where= Term('A>0'), selector='df1') - # results = [] - # for s in store.select_as_multiple( - # ['df1', 'df2'], where= Term('A>0'), selector='df1', - # chunksize=25): - # results.append(s) - # result = concat(results) - # tm.assert_frame_equal(expected, result) - def test_select_iterator_complete_8014(self): # GH 8014 @@ -3543,7 +3482,7 @@ def test_select_iterator_non_complete_8014(self): where = "index > '%s'" % end_dt results = [s for s in store.select( 'df', where=where, chunksize=chunksize)] - self.assertEqual(0, len(results)) + assert 0 == len(results) def test_select_iterator_many_empty_frames(self): @@ -3575,7 +3514,7 @@ def test_select_iterator_many_empty_frames(self): results = [s for s in store.select( 'df', where=where, chunksize=chunksize)] - tm.assert_equal(1, len(results)) + assert len(results) == 1 result = concat(results) rexpected = expected[expected.index <= end_dt] tm.assert_frame_equal(rexpected, result) @@ -3586,7 +3525,7 @@ def test_select_iterator_many_empty_frames(self): 'df', where=where, chunksize=chunksize)] # should be 1, is 10 - tm.assert_equal(1, len(results)) + assert len(results) == 1 result = concat(results) rexpected = expected[(expected.index >= beg_dt) & (expected.index <= end_dt)] @@ -3604,7 +3543,7 @@ def test_select_iterator_many_empty_frames(self): 'df', where=where, chunksize=chunksize)] # should be [] - tm.assert_equal(0, len(results)) + assert len(results) == 0 def test_retain_index_attributes(self): @@ -3622,19 +3561,18 @@ def test_retain_index_attributes(self): for attr in ['freq', 'tz', 'name']: for idx in ['index', 'columns']: - self.assertEqual(getattr(getattr(df, idx), attr, None), - getattr(getattr(result, idx), attr, None)) + assert (getattr(getattr(df, idx), attr, None) == + getattr(getattr(result, idx), attr, None)) # try to append a table with a different frequency - with tm.assert_produces_warning( - expected_warning=AttributeConflictWarning): + with catch_warnings(record=True): df2 = DataFrame(dict( A=Series(lrange(3), index=date_range('2002-1-1', periods=3, freq='D')))) store.append('data', df2) - self.assertIsNone(store.get_storer('data').info['index']['freq']) + assert store.get_storer('data').info['index']['freq'] is None # this is ok _maybe_remove(store, 'df2') @@ -3651,9 +3589,8 @@ def test_retain_index_attributes(self): def test_retain_index_attributes2(self): with ensure_clean_path(self.path) as path: - expected_warning = Warning if PY35 else AttributeConflictWarning - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): + + with catch_warnings(record=True): df = DataFrame(dict( A=Series(lrange(3), @@ -3671,37 +3608,41 @@ def test_retain_index_attributes2(self): df = DataFrame(dict(A=Series(lrange(3), index=idx))) df.to_hdf(path, 'data', mode='w', append=True) - self.assertEqual(read_hdf(path, 'data').index.name, 'foo') + assert read_hdf(path, 'data').index.name == 'foo' - with tm.assert_produces_warning(expected_warning=expected_warning, - check_stacklevel=False): + with catch_warnings(record=True): idx2 = date_range('2001-1-1', periods=3, freq='H') idx2.name = 'bar' df2 = DataFrame(dict(A=Series(lrange(3), index=idx2))) df2.to_hdf(path, 'data', append=True) - self.assertIsNone(read_hdf(path, 'data').index.name) + assert read_hdf(path, 'data').index.name is None def test_panel_select(self): - wp = tm.makePanel() - with ensure_clean_store(self.path) as store: - store.put('wp', wp, format='table') - date = wp.major_axis[len(wp.major_axis) // 2] - crit1 = ('major_axis>=date') - crit2 = ("minor_axis=['A', 'D']") + with catch_warnings(record=True): - result = store.select('wp', [crit1, crit2]) - expected = wp.truncate(before=date).reindex(minor=['A', 'D']) - assert_panel_equal(result, expected) + wp = tm.makePanel() - result = store.select( - 'wp', ['major_axis>="20000124"', ("minor_axis=['A', 'B']")]) - expected = wp.truncate(before='20000124').reindex(minor=['A', 'B']) - assert_panel_equal(result, expected) + store.put('wp', wp, format='table') + date = wp.major_axis[len(wp.major_axis) // 2] + + crit1 = ('major_axis>=date') + crit2 = ("minor_axis=['A', 'D']") + + result = store.select('wp', [crit1, crit2]) + expected = wp.truncate(before=date).reindex(minor=['A', 'D']) + assert_panel_equal(result, expected) + + result = store.select( + 'wp', ['major_axis>="20000124"', + ("minor_axis=['A', 'B']")]) + expected = wp.truncate( + before='20000124').reindex(minor=['A', 'B']) + assert_panel_equal(result, expected) def test_frame_select(self): @@ -3712,28 +3653,28 @@ def test_frame_select(self): date = df.index[len(df) // 2] crit1 = Term('index>=date') - self.assertEqual(crit1.env.scope['date'], date) + assert crit1.env.scope['date'] == date crit2 = ("columns=['A', 'D']") crit3 = ('columns=A') result = store.select('frame', [crit1, crit2]) - expected = df.ix[date:, ['A', 'D']] + expected = df.loc[date:, ['A', 'D']] tm.assert_frame_equal(result, expected) result = store.select('frame', [crit3]) - expected = df.ix[:, ['A']] + expected = df.loc[:, ['A']] tm.assert_frame_equal(result, expected) # invalid terms df = tm.makeTimeDataFrame() store.append('df_time', df) - self.assertRaises( - ValueError, store.select, 'df_time', [Term("index>0")]) + pytest.raises( + ValueError, store.select, 'df_time', "index>0") # can't select if not written as table # store['frame'] = df - # self.assertRaises(ValueError, store.select, + # pytest.raises(ValueError, store.select, # 'frame', [crit1, crit2]) def test_frame_select_complex(self): @@ -3772,8 +3713,8 @@ def test_frame_select_complex(self): tm.assert_frame_equal(result, expected) # invert not implemented in numexpr :( - self.assertRaises(NotImplementedError, - store.select, 'df', '~(string="bar")') + pytest.raises(NotImplementedError, + store.select, 'df', '~(string="bar")') # invert ok for filters result = store.select('df', "~(columns=['A','B'])") @@ -3808,15 +3749,10 @@ def test_frame_select_complex2(self): hist.to_hdf(hh, 'df', mode='w', format='table') - expected = read_hdf(hh, 'df', where=Term('l1', '=', [2, 3, 4])) - - # list like - result = read_hdf(hh, 'df', where=Term( - 'l1', '=', selection.index.tolist())) - assert_frame_equal(result, expected) - l = selection.index.tolist() # noqa + expected = read_hdf(hh, 'df', where='l1=[2, 3, 4]') # sccope with list like + l = selection.index.tolist() # noqa store = HDFStore(hh) result = store.select('df', where='l1=l') assert_frame_equal(result, expected) @@ -3866,12 +3802,12 @@ def test_invalid_filtering(self): store.put('df', df, format='table') # not implemented - self.assertRaises(NotImplementedError, store.select, - 'df', "columns=['A'] | columns=['B']") + pytest.raises(NotImplementedError, store.select, + 'df', "columns=['A'] | columns=['B']") # in theory we could deal with this - self.assertRaises(NotImplementedError, store.select, - 'df', "columns=['A','B'] & columns=['C']") + pytest.raises(NotImplementedError, store.select, + 'df', "columns=['A','B'] & columns=['C']") def test_string_select(self): # GH 2973 @@ -3881,16 +3817,16 @@ def test_string_select(self): # test string ==/!= df['x'] = 'none' - df.ix[2:7, 'x'] = '' + df.loc[2:7, 'x'] = '' store.append('df', df, data_columns=['x']) - result = store.select('df', Term('x=none')) + result = store.select('df', 'x=none') expected = df[df.x == 'none'] assert_frame_equal(result, expected) try: - result = store.select('df', Term('x!=none')) + result = store.select('df', 'x!=none') expected = df[df.x != 'none'] assert_frame_equal(result, expected) except Exception as detail: @@ -3902,21 +3838,21 @@ def test_string_select(self): df2.loc[df2.x == '', 'x'] = np.nan store.append('df2', df2, data_columns=['x']) - result = store.select('df2', Term('x!=none')) + result = store.select('df2', 'x!=none') expected = df2[isnull(df2.x)] assert_frame_equal(result, expected) # int ==/!= df['int'] = 1 - df.ix[2:7, 'int'] = 2 + df.loc[2:7, 'int'] = 2 store.append('df3', df, data_columns=['int']) - result = store.select('df3', Term('int=2')) + result = store.select('df3', 'int=2') expected = df[df.int == 2] assert_frame_equal(result, expected) - result = store.select('df3', Term('int!=2')) + result = store.select('df3', 'int!=2') expected = df[df.int != 2] assert_frame_equal(result, expected) @@ -3929,19 +3865,19 @@ def test_read_column(self): store.append('df', df) # error - self.assertRaises(KeyError, store.select_column, 'df', 'foo') + pytest.raises(KeyError, store.select_column, 'df', 'foo') def f(): store.select_column('df', 'index', where=['index>5']) - self.assertRaises(Exception, f) + pytest.raises(Exception, f) # valid result = store.select_column('df', 'index') tm.assert_almost_equal(result.values, Series(df.index).values) - self.assertIsInstance(result, Series) + assert isinstance(result, Series) # not a data indexable column - self.assertRaises( + pytest.raises( ValueError, store.select_column, 'df', 'values_block_0') # a data column @@ -3954,7 +3890,7 @@ def f(): # a data column with NaNs, result excludes the NaNs df3 = df.copy() df3['string'] = 'foo' - df3.ix[4:6, 'string'] = np.nan + df3.loc[4:6, 'string'] = np.nan store.append('df3', df3, data_columns=['string']) result = store.select_column('df3', 'string') tm.assert_almost_equal(result.values, df3['string'].values) @@ -4005,15 +3941,15 @@ def test_coordinates(self): c = store.select_as_coordinates('df', ['index<3']) assert((c.values == np.arange(3)).all()) result = store.select('df', where=c) - expected = df.ix[0:2, :] + expected = df.loc[0:2, :] tm.assert_frame_equal(result, expected) c = store.select_as_coordinates('df', ['index>=3', 'index<=4']) assert((c.values == np.arange(2) + 3).all()) result = store.select('df', where=c) - expected = df.ix[3:4, :] + expected = df.loc[3:4, :] tm.assert_frame_equal(result, expected) - self.assertIsInstance(c, Index) + assert isinstance(c, Index) # multiple tables _maybe_remove(store, 'df1') @@ -4051,14 +3987,14 @@ def test_coordinates(self): tm.assert_frame_equal(result, expected) # invalid - self.assertRaises(ValueError, store.select, 'df', - where=np.arange(len(df), dtype='float64')) - self.assertRaises(ValueError, store.select, 'df', - where=np.arange(len(df) + 1)) - self.assertRaises(ValueError, store.select, 'df', - where=np.arange(len(df)), start=5) - self.assertRaises(ValueError, store.select, 'df', - where=np.arange(len(df)), start=5, stop=10) + pytest.raises(ValueError, store.select, 'df', + where=np.arange(len(df), dtype='float64')) + pytest.raises(ValueError, store.select, 'df', + where=np.arange(len(df) + 1)) + pytest.raises(ValueError, store.select, 'df', + where=np.arange(len(df)), start=5) + pytest.raises(ValueError, store.select, 'df', + where=np.arange(len(df)), start=5, stop=10) # selection with filter selection = date_range('20000101', periods=500) @@ -4094,12 +4030,12 @@ def test_append_to_multiple(self): with ensure_clean_store(self.path) as store: # exceptions - self.assertRaises(ValueError, store.append_to_multiple, - {'df1': ['A', 'B'], 'df2': None}, df, - selector='df3') - self.assertRaises(ValueError, store.append_to_multiple, - {'df1': None, 'df2': None}, df, selector='df3') - self.assertRaises( + pytest.raises(ValueError, store.append_to_multiple, + {'df1': ['A', 'B'], 'df2': None}, df, + selector='df3') + pytest.raises(ValueError, store.append_to_multiple, + {'df1': None, 'df2': None}, df, selector='df3') + pytest.raises( ValueError, store.append_to_multiple, 'df1', df, 'df1') # regular operation @@ -4113,10 +4049,11 @@ def test_append_to_multiple(self): def test_append_to_multiple_dropna(self): df1 = tm.makeTimeDataFrame() df2 = tm.makeTimeDataFrame().rename(columns=lambda x: "%s_2" % x) - df1.ix[1, ['A', 'B']] = np.nan + df1.iloc[1, df1.columns.get_indexer(['A', 'B'])] = np.nan df = concat([df1, df2], axis=1) with ensure_clean_store(self.path) as store: + # dropna=True should guarantee rows are synchronized store.append_to_multiple( {'df1': ['A', 'B'], 'df2': None}, df, selector='df1', @@ -4127,14 +4064,27 @@ def test_append_to_multiple_dropna(self): tm.assert_index_equal(store.select('df1').index, store.select('df2').index) + @pytest.mark.xfail(run=False, + reason="append_to_multiple_dropna_false " + "is not raising as failed") + def test_append_to_multiple_dropna_false(self): + df1 = tm.makeTimeDataFrame() + df2 = tm.makeTimeDataFrame().rename(columns=lambda x: "%s_2" % x) + df1.iloc[1, df1.columns.get_indexer(['A', 'B'])] = np.nan + df = concat([df1, df2], axis=1) + + with ensure_clean_store(self.path) as store: + # dropna=False shouldn't synchronize row indexes store.append_to_multiple( - {'df1': ['A', 'B'], 'df2': None}, df, selector='df1', + {'df1a': ['A', 'B'], 'df2a': None}, df, selector='df1a', dropna=False) - self.assertRaises( - ValueError, store.select_as_multiple, ['df1', 'df2']) - assert not store.select('df1').index.equals( - store.select('df2').index) + + with pytest.raises(ValueError): + store.select_as_multiple(['df1a', 'df2a']) + + assert not store.select('df1a').index.equals( + store.select('df2a').index) def test_select_as_multiple(self): @@ -4145,25 +4095,25 @@ def test_select_as_multiple(self): with ensure_clean_store(self.path) as store: # no tables stored - self.assertRaises(Exception, store.select_as_multiple, - None, where=['A>0', 'B>0'], selector='df1') + pytest.raises(Exception, store.select_as_multiple, + None, where=['A>0', 'B>0'], selector='df1') store.append('df1', df1, data_columns=['A', 'B']) store.append('df2', df2) # exceptions - self.assertRaises(Exception, store.select_as_multiple, - None, where=['A>0', 'B>0'], selector='df1') - self.assertRaises(Exception, store.select_as_multiple, - [None], where=['A>0', 'B>0'], selector='df1') - self.assertRaises(KeyError, store.select_as_multiple, - ['df1', 'df3'], where=['A>0', 'B>0'], - selector='df1') - self.assertRaises(KeyError, store.select_as_multiple, - ['df3'], where=['A>0', 'B>0'], selector='df1') - self.assertRaises(KeyError, store.select_as_multiple, - ['df1', 'df2'], where=['A>0', 'B>0'], - selector='df4') + pytest.raises(Exception, store.select_as_multiple, + None, where=['A>0', 'B>0'], selector='df1') + pytest.raises(Exception, store.select_as_multiple, + [None], where=['A>0', 'B>0'], selector='df1') + pytest.raises(KeyError, store.select_as_multiple, + ['df1', 'df3'], where=['A>0', 'B>0'], + selector='df1') + pytest.raises(KeyError, store.select_as_multiple, + ['df3'], where=['A>0', 'B>0'], selector='df1') + pytest.raises(KeyError, store.select_as_multiple, + ['df1', 'df2'], where=['A>0', 'B>0'], + selector='df4') # default select result = store.select('df1', ['A>0', 'B>0']) @@ -4182,24 +4132,24 @@ def test_select_as_multiple(self): tm.assert_frame_equal(result, expected) # multiple (diff selector) - result = store.select_as_multiple(['df1', 'df2'], where=[Term( - 'index>df2.index[4]')], selector='df2') + result = store.select_as_multiple( + ['df1', 'df2'], where='index>df2.index[4]', selector='df2') expected = concat([df1, df2], axis=1) expected = expected[5:] tm.assert_frame_equal(result, expected) # test excpection for diff rows store.append('df3', tm.makeTimeDataFrame(nper=50)) - self.assertRaises(ValueError, store.select_as_multiple, - ['df1', 'df3'], where=['A>0', 'B>0'], - selector='df1') + pytest.raises(ValueError, store.select_as_multiple, + ['df1', 'df3'], where=['A>0', 'B>0'], + selector='df1') def test_nan_selection_bug_4858(self): # GH 4858; nan selection bug, only works for pytables >= 3.1 if LooseVersion(tables.__version__) < '3.1.0': - raise nose.SkipTest('tables version does not support fix for nan ' - 'selection bug: GH 4858') + pytest.skip('tables version does not support fix for nan ' + 'selection bug: GH 4858') with ensure_clean_store(self.path) as store: @@ -4225,15 +4175,15 @@ def test_start_stop_table(self): store.append('df', df) result = store.select( - 'df', [Term("columns=['A']")], start=0, stop=5) - expected = df.ix[0:4, ['A']] + 'df', "columns=['A']", start=0, stop=5) + expected = df.loc[0:4, ['A']] tm.assert_frame_equal(result, expected) # out of range result = store.select( - 'df', [Term("columns=['A']")], start=30, stop=40) - self.assertTrue(len(result) == 0) - expected = df.ix[30:40, ['A']] + 'df', "columns=['A']", start=30, stop=40) + assert len(result) == 0 + expected = df.loc[30:40, ['A']] tm.assert_frame_equal(result, expected) def test_start_stop_fixed(self): @@ -4275,11 +4225,11 @@ def test_start_stop_fixed(self): # sparse; not implemented df = tm.makeDataFrame() - df.ix[3:5, 1:3] = np.nan - df.ix[8:10, -2] = np.nan + df.iloc[3:5, 1:3] = np.nan + df.iloc[8:10, -2] = np.nan dfs = df.to_sparse() store.put('dfs', dfs) - with self.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): store.select('dfs', start=0, stop=5) def test_select_filter_corner(self): @@ -4291,13 +4241,13 @@ def test_select_filter_corner(self): with ensure_clean_store(self.path) as store: store.put('frame', df, format='table') - crit = Term('columns=df.columns[:75]') + crit = 'columns=df.columns[:75]' result = store.select('frame', [crit]) - tm.assert_frame_equal(result, df.ix[:, df.columns[:75]]) + tm.assert_frame_equal(result, df.loc[:, df.columns[:75]]) - crit = Term('columns=df.columns[:75:2]') + crit = 'columns=df.columns[:75:2]' result = store.select('frame', [crit]) - tm.assert_frame_equal(result, df.ix[:, df.columns[:75:2]]) + tm.assert_frame_equal(result, df.loc[:, df.columns[:75:2]]) def _check_roundtrip(self, obj, comparator, compression=False, **kwargs): @@ -4332,11 +4282,11 @@ def _check_roundtrip_table(self, obj, comparator, compression=False): with ensure_clean_store(self.path, 'w', **options) as store: store.put('obj', obj, format='table') retrieved = store['obj'] - # sorted_obj = _test_sort(obj) + comparator(retrieved, obj) def test_multiple_open_close(self): - # GH 4409, open & close multiple times + # gh-4409: open & close multiple times with ensure_clean_path(self.path) as path: @@ -4345,11 +4295,12 @@ def test_multiple_open_close(self): # single store = HDFStore(path) - self.assertNotIn('CLOSED', str(store)) - self.assertTrue(store.is_open) + assert 'CLOSED' not in str(store) + assert store.is_open + store.close() - self.assertIn('CLOSED', str(store)) - self.assertFalse(store.is_open) + assert 'CLOSED' in str(store) + assert not store.is_open with ensure_clean_path(self.path) as path: @@ -4360,7 +4311,7 @@ def test_multiple_open_close(self): def f(): HDFStore(path) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) store1.close() else: @@ -4369,22 +4320,22 @@ def f(): store1 = HDFStore(path) store2 = HDFStore(path) - self.assertNotIn('CLOSED', str(store1)) - self.assertNotIn('CLOSED', str(store2)) - self.assertTrue(store1.is_open) - self.assertTrue(store2.is_open) + assert 'CLOSED' not in str(store1) + assert 'CLOSED' not in str(store2) + assert store1.is_open + assert store2.is_open store1.close() - self.assertIn('CLOSED', str(store1)) - self.assertFalse(store1.is_open) - self.assertNotIn('CLOSED', str(store2)) - self.assertTrue(store2.is_open) + assert 'CLOSED' in str(store1) + assert not store1.is_open + assert 'CLOSED' not in str(store2) + assert store2.is_open store2.close() - self.assertIn('CLOSED', str(store1)) - self.assertIn('CLOSED', str(store2)) - self.assertFalse(store1.is_open) - self.assertFalse(store2.is_open) + assert 'CLOSED' in str(store1) + assert 'CLOSED' in str(store2) + assert not store1.is_open + assert not store2.is_open # nested close store = HDFStore(path, mode='w') @@ -4393,12 +4344,12 @@ def f(): store2 = HDFStore(path) store2.append('df2', df) store2.close() - self.assertIn('CLOSED', str(store2)) - self.assertFalse(store2.is_open) + assert 'CLOSED' in str(store2) + assert not store2.is_open store.close() - self.assertIn('CLOSED', str(store)) - self.assertFalse(store.is_open) + assert 'CLOSED' in str(store) + assert not store.is_open # double closing store = HDFStore(path, mode='w') @@ -4406,12 +4357,12 @@ def f(): store2 = HDFStore(path) store.close() - self.assertIn('CLOSED', str(store)) - self.assertFalse(store.is_open) + assert 'CLOSED' in str(store) + assert not store.is_open store2.close() - self.assertIn('CLOSED', str(store2)) - self.assertFalse(store2.is_open) + assert 'CLOSED' in str(store2) + assert not store2.is_open # ops on a closed store with ensure_clean_path(self.path) as path: @@ -4422,21 +4373,21 @@ def f(): store = HDFStore(path) store.close() - self.assertRaises(ClosedFileError, store.keys) - self.assertRaises(ClosedFileError, lambda: 'df' in store) - self.assertRaises(ClosedFileError, lambda: len(store)) - self.assertRaises(ClosedFileError, lambda: store['df']) - self.assertRaises(ClosedFileError, lambda: store.df) - self.assertRaises(ClosedFileError, store.select, 'df') - self.assertRaises(ClosedFileError, store.get, 'df') - self.assertRaises(ClosedFileError, store.append, 'df2', df) - self.assertRaises(ClosedFileError, store.put, 'df3', df) - self.assertRaises(ClosedFileError, store.get_storer, 'df2') - self.assertRaises(ClosedFileError, store.remove, 'df2') + pytest.raises(ClosedFileError, store.keys) + pytest.raises(ClosedFileError, lambda: 'df' in store) + pytest.raises(ClosedFileError, lambda: len(store)) + pytest.raises(ClosedFileError, lambda: store['df']) + pytest.raises(ClosedFileError, lambda: store.df) + pytest.raises(ClosedFileError, store.select, 'df') + pytest.raises(ClosedFileError, store.get, 'df') + pytest.raises(ClosedFileError, store.append, 'df2', df) + pytest.raises(ClosedFileError, store.put, 'df3', df) + pytest.raises(ClosedFileError, store.get_storer, 'df2') + pytest.raises(ClosedFileError, store.remove, 'df2') def f(): store.select('df') - tm.assertRaisesRegexp(ClosedFileError, 'file is not open', f) + tm.assert_raises_regex(ClosedFileError, 'file is not open', f) def test_pytables_native_read(self): @@ -4444,55 +4395,46 @@ def test_pytables_native_read(self): tm.get_data_path('legacy_hdf/pytables_native.h5'), mode='r') as store: d2 = store['detector/readout'] - self.assertIsInstance(d2, DataFrame) + assert isinstance(d2, DataFrame) def test_pytables_native2_read(self): # fails on win/3.5 oddly if PY35 and is_platform_windows(): - raise nose.SkipTest("native2 read fails oddly on windows / 3.5") + pytest.skip("native2 read fails oddly on windows / 3.5") with ensure_clean_store( tm.get_data_path('legacy_hdf/pytables_native2.h5'), mode='r') as store: str(store) d1 = store['detector'] - self.assertIsInstance(d1, DataFrame) - - def test_legacy_read(self): - with ensure_clean_store( - tm.get_data_path('legacy_hdf/legacy.h5'), - mode='r') as store: - store['a'] - store['b'] - store['c'] - store['d'] + assert isinstance(d1, DataFrame) def test_legacy_table_read(self): # legacy table types with ensure_clean_store( tm.get_data_path('legacy_hdf/legacy_table.h5'), mode='r') as store: - store.select('df1') - store.select('df2') - store.select('wp1') - # force the frame - store.select('df2', typ='legacy_frame') + with catch_warnings(record=True): + store.select('df1') + store.select('df2') + store.select('wp1') + + # force the frame + store.select('df2', typ='legacy_frame') - # old version warning - with tm.assert_produces_warning( - expected_warning=IncompatibilityWarning): - self.assertRaises( - Exception, store.select, 'wp1', Term('minor_axis=B')) + # old version warning + pytest.raises( + Exception, store.select, 'wp1', 'minor_axis=B') df2 = store.select('df2') - result = store.select('df2', Term('index>df2.index[2]')) + result = store.select('df2', 'index>df2.index[2]') expected = df2[df2.index > df2.index[2]] assert_frame_equal(expected, result) def test_legacy_0_10_read(self): # legacy from 0.10 - with compat_assert_produces_warning(FutureWarning): + with catch_warnings(record=True): path = tm.get_data_path('legacy_hdf/legacy_0.10.h5') with ensure_clean_store(path, mode='r') as store: str(store) @@ -4516,7 +4458,7 @@ def test_legacy_0_11_read(self): def test_copy(self): - with compat_assert_produces_warning(FutureWarning): + with catch_warnings(record=True): def do_copy(f=None, new_f=None, keys=None, propindexes=True, **kwargs): @@ -4537,7 +4479,7 @@ def do_copy(f=None, new_f=None, keys=None, # check keys if keys is None: keys = store.keys() - self.assertEqual(set(keys), set(tstore.keys())) + assert set(keys) == set(tstore.keys()) # check indicies & nrows for k in tstore.keys(): @@ -4545,14 +4487,13 @@ def do_copy(f=None, new_f=None, keys=None, new_t = tstore.get_storer(k) orig_t = store.get_storer(k) - self.assertEqual(orig_t.nrows, new_t.nrows) + assert orig_t.nrows == new_t.nrows # check propindixes if propindexes: for a in orig_t.axes: if a.is_indexed: - self.assertTrue( - new_t[a.name].is_indexed) + assert new_t[a.name].is_indexed finally: safe_close(store) @@ -4581,13 +4522,14 @@ def do_copy(f=None, new_f=None, keys=None, safe_remove(path) def test_legacy_table_write(self): - raise nose.SkipTest("cannot write legacy tables") + pytest.skip("cannot write legacy tables") store = HDFStore(tm.get_data_path( 'legacy_hdf/legacy_table_%s.h5' % pandas.__version__), 'a') df = tm.makeDataFrame() - wp = tm.makePanel() + with catch_warnings(record=True): + wp = tm.makePanel() index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', 'three']], @@ -4610,7 +4552,7 @@ def test_store_datetime_fractional_secs(self): dt = datetime.datetime(2012, 1, 2, 3, 4, 5, 123456) series = Series([0], [dt]) store['a'] = series - self.assertEqual(store['a'].index[0], dt) + assert store['a'].index[0] == dt def test_tseries_indices_series(self): @@ -4620,18 +4562,18 @@ def test_tseries_indices_series(self): store['a'] = ser result = store['a'] - assert_series_equal(result, ser) - self.assertEqual(type(result.index), type(ser.index)) - self.assertEqual(result.index.freq, ser.index.freq) + tm.assert_series_equal(result, ser) + assert result.index.freq == ser.index.freq + tm.assert_class_equal(result.index, ser.index, obj="series index") idx = tm.makePeriodIndex(10) ser = Series(np.random.randn(len(idx)), idx) store['a'] = ser result = store['a'] - assert_series_equal(result, ser) - self.assertEqual(type(result.index), type(ser.index)) - self.assertEqual(result.index.freq, ser.index.freq) + tm.assert_series_equal(result, ser) + assert result.index.freq == ser.index.freq + tm.assert_class_equal(result.index, ser.index, obj="series index") def test_tseries_indices_frame(self): @@ -4642,8 +4584,9 @@ def test_tseries_indices_frame(self): result = store['a'] assert_frame_equal(result, df) - self.assertEqual(type(result.index), type(df.index)) - self.assertEqual(result.index.freq, df.index.freq) + assert result.index.freq == df.index.freq + tm.assert_class_equal(result.index, df.index, + obj="dataframe index") idx = tm.makePeriodIndex(10) df = DataFrame(np.random.randn(len(idx), 3), idx) @@ -4651,14 +4594,16 @@ def test_tseries_indices_frame(self): result = store['a'] assert_frame_equal(result, df) - self.assertEqual(type(result.index), type(df.index)) - self.assertEqual(result.index.freq, df.index.freq) + assert result.index.freq == df.index.freq + tm.assert_class_equal(result.index, df.index, + obj="dataframe index") def test_unicode_index(self): unicode_values = [u('\u03c3'), u('\u03c3\u03c3')] - with compat_assert_produces_warning(PerformanceWarning): + # PerformanceWarning + with catch_warnings(record=True): s = Series(np.random.randn(len(unicode_values)), unicode_values) self._check_roundtrip(s, tm.assert_series_equal) @@ -4691,7 +4636,7 @@ def test_store_datetime_mixed(self): # index=[np.arange(5).repeat(2), # np.tile(np.arange(2), 5)]) - # self.assertRaises(Exception, store.put, 'foo', df, format='table') + # pytest.raises(Exception, store.put, 'foo', df, format='table') def test_append_with_diff_col_name_types_raises_value_error(self): df = DataFrame(np.random.randn(10, 1)) @@ -4705,7 +4650,7 @@ def test_append_with_diff_col_name_types_raises_value_error(self): store.append(name, df) for d in (df2, df3, df4, df5): - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): store.append(name, d) def test_query_with_nested_special_character(self): @@ -4722,7 +4667,7 @@ def test_categorical(self): with ensure_clean_store(self.path) as store: - # basic + # Basic _maybe_remove(store, 's') s = Series(Categorical(['a', 'b', 'b', 'a', 'a', 'c'], categories=[ 'a', 'b', 'c', 'd'], ordered=False)) @@ -4738,12 +4683,13 @@ def test_categorical(self): tm.assert_series_equal(s, result) _maybe_remove(store, 'df') + df = DataFrame({"s": s, "vals": [1, 2, 3, 4, 5, 6]}) store.append('df', df, format='table') result = store.select('df') tm.assert_frame_equal(result, df) - # dtypes + # Dtypes s = Series([1, 1, 2, 2, 3, 4, 5]).astype('category') store.append('si', s) result = store.select('si') @@ -4754,17 +4700,17 @@ def test_categorical(self): result = store.select('si2') tm.assert_series_equal(result, s) - # multiple + # Multiple df2 = df.copy() df2['s2'] = Series(list('abcdefg')).astype('category') store.append('df2', df2) result = store.select('df2') tm.assert_frame_equal(result, df2) - # make sure the metadata is ok - self.assertTrue('/df2 ' in str(store)) - self.assertTrue('/df2/meta/values_block_0/meta' in str(store)) - self.assertTrue('/df2/meta/values_block_1/meta' in str(store)) + # Make sure the metadata is OK + assert '/df2 ' in str(store) + assert '/df2/meta/values_block_0/meta' in str(store) + assert '/df2/meta/values_block_1/meta' in str(store) # unordered s = Series(Categorical(['a', 'b', 'b', 'a', 'a', 'c'], categories=[ @@ -4773,7 +4719,7 @@ def test_categorical(self): result = store.select('s2') tm.assert_series_equal(result, s) - # query + # Query store.append('df3', df, data_columns=['s']) expected = df[df.s.isin(['b', 'c'])] result = store.select('df3', where=['s in ["b","c"]']) @@ -4791,7 +4737,7 @@ def test_categorical(self): result = store.select('df3', where=['s in ["f"]']) tm.assert_frame_equal(result, expected) - # appending with same categories is ok + # Appending with same categories is ok store.append('df3', df) df = concat([df, df]) @@ -4799,20 +4745,21 @@ def test_categorical(self): result = store.select('df3', where=['s in ["b","c"]']) tm.assert_frame_equal(result, expected) - # appending must have the same categories + # Appending must have the same categories df3 = df.copy() df3['s'].cat.remove_unused_categories(inplace=True) - self.assertRaises(ValueError, lambda: store.append('df3', df3)) + with pytest.raises(ValueError): + store.append('df3', df3) - # remove - # make sure meta data is removed (its a recursive removal so should - # be) + # Remove, and make sure meta data is removed (its a recursive + # removal so should be). result = store.select('df3/meta/s/meta') - self.assertIsNotNone(result) + assert result is not None store.remove('df3') - self.assertRaises( - KeyError, lambda: store.select('df3/meta/s/meta')) + + with pytest.raises(KeyError): + store.select('df3/meta/s/meta') def test_categorical_conversion(self): @@ -4848,15 +4795,15 @@ def test_duplicate_column_name(self): df = DataFrame(columns=["a", "a"], data=[[0, 0]]) with ensure_clean_path(self.path) as path: - self.assertRaises(ValueError, df.to_hdf, - path, 'df', format='fixed') + pytest.raises(ValueError, df.to_hdf, + path, 'df', format='fixed') df.to_hdf(path, 'df', format='table') other = read_hdf(path, 'df') tm.assert_frame_equal(df, other) - self.assertTrue(df.equals(other)) - self.assertTrue(other.equals(df)) + assert df.equals(other) + assert other.equals(df) def test_round_trip_equals(self): # GH 9330 @@ -4866,8 +4813,8 @@ def test_round_trip_equals(self): df.to_hdf(path, 'df', format='table') other = read_hdf(path, 'df') tm.assert_frame_equal(df, other) - self.assertTrue(df.equals(other)) - self.assertTrue(other.equals(df)) + assert df.equals(other) + assert other.equals(df) def test_preserve_timedeltaindex_type(self): # GH9635 @@ -4903,7 +4850,7 @@ def test_colums_multiindex_modified(self): cols2load = list('BCD') cols2load_original = list(cols2load) df_loaded = read_hdf(path, 'df', columns=cols2load) # noqa - self.assertTrue(cols2load_original == cols2load) + assert cols2load_original == cols2load def test_to_hdf_with_object_column_names(self): # GH9057 @@ -4923,18 +4870,21 @@ def test_to_hdf_with_object_column_names(self): for index in types_should_fail: df = DataFrame(np.random.randn(10, 2), columns=index(2)) with ensure_clean_path(self.path) as path: - with self.assertRaises( + with catch_warnings(record=True): + with pytest.raises( ValueError, msg=("cannot have non-object label " "DataIndexableCol")): - df.to_hdf(path, 'df', format='table', data_columns=True) + df.to_hdf(path, 'df', format='table', + data_columns=True) for index in types_should_run: df = DataFrame(np.random.randn(10, 2), columns=index(2)) with ensure_clean_path(self.path) as path: - df.to_hdf(path, 'df', format='table', data_columns=True) - result = pd.read_hdf( - path, 'df', where="index = [{0}]".format(df.index[0])) - assert(len(result)) + with catch_warnings(record=True): + df.to_hdf(path, 'df', format='table', data_columns=True) + result = pd.read_hdf( + path, 'df', where="index = [{0}]".format(df.index[0])) + assert(len(result)) def test_read_hdf_open_store(self): # GH10330 @@ -4951,7 +4901,7 @@ def test_read_hdf_open_store(self): store = HDFStore(path, mode='r') indirect = read_hdf(store, 'df') tm.assert_frame_equal(direct, indirect) - self.assertTrue(store.is_open) + assert store.is_open store.close() def test_read_hdf_iterator(self): @@ -4965,7 +4915,7 @@ def test_read_hdf_iterator(self): df.to_hdf(path, 'df', mode='w', format='t') direct = read_hdf(path, 'df') iterator = read_hdf(path, 'df', iterator=True) - self.assertTrue(isinstance(iterator, TableIterator)) + assert isinstance(iterator, TableIterator) indirect = next(iterator.__iter__()) tm.assert_frame_equal(direct, indirect) iterator.store.close() @@ -4976,21 +4926,21 @@ def test_read_hdf_errors(self): columns=list('ABCDE')) with ensure_clean_path(self.path) as path: - self.assertRaises(IOError, read_hdf, path, 'key') + pytest.raises(IOError, read_hdf, path, 'key') df.to_hdf(path, 'df') store = HDFStore(path, mode='r') store.close() - self.assertRaises(IOError, read_hdf, store, 'df') + pytest.raises(IOError, read_hdf, store, 'df') with open(path, mode='r') as store: - self.assertRaises(NotImplementedError, read_hdf, store, 'df') + pytest.raises(NotImplementedError, read_hdf, store, 'df') def test_invalid_complib(self): df = DataFrame(np.random.rand(4, 5), index=list('abcd'), columns=list('ABCDE')) with ensure_clean_path(self.path) as path: - self.assertRaises(ValueError, df.to_hdf, path, - 'df', complib='blosc:zlib') + pytest.raises(ValueError, df.to_hdf, path, + 'df', complib='blosc:zlib') # GH10443 def test_read_nokey(self): @@ -5005,7 +4955,7 @@ def test_read_nokey(self): reread = read_hdf(path) assert_frame_equal(df, reread) df.to_hdf(path, 'df2', mode='a') - self.assertRaises(ValueError, read_hdf, path) + pytest.raises(ValueError, read_hdf, path) def test_read_nokey_table(self): # GH13231 @@ -5017,13 +4967,13 @@ def test_read_nokey_table(self): reread = read_hdf(path) assert_frame_equal(df, reread) df.to_hdf(path, 'df2', mode='a', format='table') - self.assertRaises(ValueError, read_hdf, path) + pytest.raises(ValueError, read_hdf, path) def test_read_nokey_empty(self): with ensure_clean_path(self.path) as path: store = HDFStore(path) store.close() - self.assertRaises(ValueError, read_hdf, path) + pytest.raises(ValueError, read_hdf, path) def test_read_from_pathlib_path(self): @@ -5072,7 +5022,7 @@ def test_query_long_float_literal(self): cutoff = 1000000000.0006 result = store.select('test', "A < %.4f" % cutoff) - self.assertTrue(result.empty) + assert result.empty cutoff = 1000000000.0010 result = store.select('test', "A > %.4f" % cutoff) @@ -5084,6 +5034,50 @@ def test_query_long_float_literal(self): expected = df.loc[[1], :] tm.assert_frame_equal(expected, result) + def test_query_compare_column_type(self): + # GH 15492 + df = pd.DataFrame({'date': ['2014-01-01', '2014-01-02'], + 'real_date': date_range('2014-01-01', periods=2), + 'float': [1.1, 1.2], + 'int': [1, 2]}, + columns=['date', 'real_date', 'float', 'int']) + + with ensure_clean_store(self.path) as store: + store.append('test', df, format='table', data_columns=True) + + ts = pd.Timestamp('2014-01-01') # noqa + result = store.select('test', where='real_date > ts') + expected = df.loc[[1], :] + tm.assert_frame_equal(expected, result) + + for op in ['<', '>', '==']: + # non strings to string column always fail + for v in [2.1, True, pd.Timestamp('2014-01-01'), + pd.Timedelta(1, 's')]: + query = 'date {op} v'.format(op=op) + with pytest.raises(TypeError): + result = store.select('test', where=query) + + # strings to other columns must be convertible to type + v = 'a' + for col in ['int', 'float', 'real_date']: + query = '{col} {op} v'.format(op=op, col=col) + with pytest.raises(ValueError): + result = store.select('test', where=query) + + for v, col in zip(['1', '1.1', '2014-01-01'], + ['int', 'float', 'real_date']): + query = '{col} {op} v'.format(op=op, col=col) + result = store.select('test', where=query) + + if op == '==': + expected = df.loc[[0], :] + elif op == '>': + expected = df.loc[[1], :] + else: + expected = df.loc[[], :] + tm.assert_frame_equal(expected, result) + class TestHDFComplexValues(Base): # GH10447 @@ -5155,7 +5149,7 @@ def test_complex_mixed_table(self): with ensure_clean_store(self.path) as store: store.append('df', df, data_columns=['A', 'B']) - result = store.select('df', where=Term('A>2')) + result = store.select('df', where='A>2') assert_frame_equal(df.loc[df.A > 2], result) with ensure_clean_path(self.path) as path: @@ -5164,28 +5158,30 @@ def test_complex_mixed_table(self): assert_frame_equal(df, reread) def test_complex_across_dimensions_fixed(self): - complex128 = np.array([1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j]) - s = Series(complex128, index=list('abcd')) - df = DataFrame({'A': s, 'B': s}) - p = Panel({'One': df, 'Two': df}) + with catch_warnings(record=True): + complex128 = np.array( + [1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j]) + s = Series(complex128, index=list('abcd')) + df = DataFrame({'A': s, 'B': s}) + p = Panel({'One': df, 'Two': df}) - objs = [s, df, p] - comps = [tm.assert_series_equal, tm.assert_frame_equal, - tm.assert_panel_equal] - for obj, comp in zip(objs, comps): - with ensure_clean_path(self.path) as path: - obj.to_hdf(path, 'obj', format='fixed') - reread = read_hdf(path, 'obj') - comp(obj, reread) + objs = [s, df, p] + comps = [tm.assert_series_equal, tm.assert_frame_equal, + tm.assert_panel_equal] + for obj, comp in zip(objs, comps): + with ensure_clean_path(self.path) as path: + obj.to_hdf(path, 'obj', format='fixed') + reread = read_hdf(path, 'obj') + comp(obj, reread) def test_complex_across_dimensions(self): complex128 = np.array([1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j]) s = Series(complex128, index=list('abcd')) df = DataFrame({'A': s, 'B': s}) - p = Panel({'One': df, 'Two': df}) - with compat_assert_produces_warning(FutureWarning): - p4d = pd.Panel4D({'i': p, 'ii': p}) + with catch_warnings(record=True): + p = Panel({'One': df, 'Two': df}) + p4d = Panel4D({'i': p, 'ii': p}) objs = [df, p, p4d] comps = [tm.assert_frame_equal, tm.assert_panel_equal, @@ -5204,15 +5200,15 @@ def test_complex_indexing_error(self): 'C': complex128}, index=list('abcd')) with ensure_clean_store(self.path) as store: - self.assertRaises(TypeError, store.append, - 'df', df, data_columns=['C']) + pytest.raises(TypeError, store.append, + 'df', df, data_columns=['C']) def test_complex_series_error(self): complex128 = np.array([1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j, 1.0 + 1.0j]) s = Series(complex128, index=list('abcd')) with ensure_clean_path(self.path) as path: - self.assertRaises(TypeError, s.to_hdf, path, 'obj', format='t') + pytest.raises(TypeError, s.to_hdf, path, 'obj', format='t') with ensure_clean_path(self.path) as path: s.to_hdf(path, 'obj', format='t', index=False) @@ -5230,7 +5226,7 @@ def test_complex_append(self): assert_frame_equal(pd.concat([df, df], 0), result) -class TestTimezones(Base, tm.TestCase): +class TestTimezones(Base): def _compare_with_tz(self, a, b): tm.assert_frame_equal(a, b) @@ -5251,7 +5247,7 @@ def test_append_with_timezones_dateutil(self): # use maybe_get_tz instead of dateutil.tz.gettz to handle the windows # filename issues. - from pandas.tslib import maybe_get_tz + from pandas._libs.tslib import maybe_get_tz gettz = lambda x: maybe_get_tz('dateutil/' + x) # as columns @@ -5268,7 +5264,7 @@ def test_append_with_timezones_dateutil(self): # select with tz aware expected = df[df.A >= df.A[3]] - result = store.select('df_tz', where=Term('A>=df.A[3]')) + result = store.select('df_tz', where='A>=df.A[3]') self._compare_with_tz(result, expected) # ensure we include dates in DST and STD time here. @@ -5287,7 +5283,7 @@ def test_append_with_timezones_dateutil(self): tz=gettz('US/Eastern')), B=Timestamp('20130102', tz=gettz('EET'))), index=range(5)) - self.assertRaises(ValueError, store.append, 'df_tz', df) + pytest.raises(ValueError, store.append, 'df_tz', df) # this is ok _maybe_remove(store, 'df_tz') @@ -5301,7 +5297,7 @@ def test_append_with_timezones_dateutil(self): tz=gettz('US/Eastern')), B=Timestamp('20130102', tz=gettz('CET'))), index=range(5)) - self.assertRaises(ValueError, store.append, 'df_tz', df) + pytest.raises(ValueError, store.append, 'df_tz', df) # as index with ensure_clean_store(self.path) as store: @@ -5339,7 +5335,7 @@ def test_append_with_timezones_pytz(self): # select with tz aware self._compare_with_tz(store.select( - 'df_tz', where=Term('A>=df.A[3]')), df[df.A >= df.A[3]]) + 'df_tz', where='A>=df.A[3]'), df[df.A >= df.A[3]]) _maybe_remove(store, 'df_tz') # ensure we include dates in DST and STD time here. @@ -5354,7 +5350,7 @@ def test_append_with_timezones_pytz(self): df = DataFrame(dict(A=Timestamp('20130102', tz='US/Eastern'), B=Timestamp('20130102', tz='EET')), index=range(5)) - self.assertRaises(ValueError, store.append, 'df_tz', df) + pytest.raises(ValueError, store.append, 'df_tz', df) # this is ok _maybe_remove(store, 'df_tz') @@ -5367,7 +5363,7 @@ def test_append_with_timezones_pytz(self): df = DataFrame(dict(A=Timestamp('20130102', tz='US/Eastern'), B=Timestamp('20130102', tz='CET')), index=range(5)) - self.assertRaises(ValueError, store.append, 'df_tz', df) + pytest.raises(ValueError, store.append, 'df_tz', df) # as index with ensure_clean_store(self.path) as store: @@ -5398,7 +5394,7 @@ def test_tseries_select_index_column(self): with ensure_clean_store(self.path) as store: store.append('frame', frame) result = store.select_column('frame', 'index') - self.assertEqual(rng.tz, DatetimeIndex(result.values).tz) + assert rng.tz == DatetimeIndex(result.values).tz # check utc rng = date_range('1/1/2000', '1/30/2000', tz='UTC') @@ -5407,7 +5403,7 @@ def test_tseries_select_index_column(self): with ensure_clean_store(self.path) as store: store.append('frame', frame) result = store.select_column('frame', 'index') - self.assertEqual(rng.tz, result.dt.tz) + assert rng.tz == result.dt.tz # double check non-utc rng = date_range('1/1/2000', '1/30/2000', tz='US/Eastern') @@ -5416,7 +5412,7 @@ def test_tseries_select_index_column(self): with ensure_clean_store(self.path) as store: store.append('frame', frame) result = store.select_column('frame', 'index') - self.assertEqual(rng.tz, result.dt.tz) + assert rng.tz == result.dt.tz def test_timezones_fixed(self): with ensure_clean_store(self.path) as store: @@ -5446,8 +5442,8 @@ def test_fixed_offset_tz(self): with ensure_clean_store(self.path) as store: store['frame'] = frame recons = store['frame'] - self.assert_index_equal(recons.index, rng) - self.assertEqual(rng.tz, recons.index.tz) + tm.assert_index_equal(recons.index, rng) + assert rng.tz == recons.index.tz def test_store_timezone(self): # GH2852 @@ -5502,18 +5498,3 @@ def test_dst_transitions(self): store.append('df', df) result = store.select('df') assert_frame_equal(result, df) - - -def _test_sort(obj): - if isinstance(obj, DataFrame): - return obj.reindex(sorted(obj.index)) - elif isinstance(obj, Panel): - return obj.reindex(major=sorted(obj.major_axis)) - else: - raise ValueError('type not supported here') - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/io/test_s3.py b/pandas/tests/io/test_s3.py new file mode 100644 index 0000000000000..8c2a32af33765 --- /dev/null +++ b/pandas/tests/io/test_s3.py @@ -0,0 +1,8 @@ +from pandas.io.common import _is_s3_url + + +class TestS3URL(object): + + def test_is_s3_url(self): + assert _is_s3_url("s3://pandas/somethingelse.com") + assert not _is_s3_url("s4://pandas/somethingelse.com") diff --git a/pandas/io/tests/test_sql.py b/pandas/tests/io/test_sql.py similarity index 81% rename from pandas/io/tests/test_sql.py rename to pandas/tests/io/test_sql.py index cb08944e8dc57..7b3717281bf89 100644 --- a/pandas/io/tests/test_sql.py +++ b/pandas/tests/io/test_sql.py @@ -18,26 +18,26 @@ """ from __future__ import print_function -import unittest +from warnings import catch_warnings +import pytest import sqlite3 import csv import os -import sys -import nose import warnings import numpy as np import pandas as pd from datetime import datetime, date, time -from pandas.types.common import (is_object_dtype, is_datetime64_dtype, - is_datetime64tz_dtype) +from pandas.core.dtypes.common import ( + is_object_dtype, is_datetime64_dtype, + is_datetime64tz_dtype) from pandas import DataFrame, Series, Index, MultiIndex, isnull, concat from pandas import date_range, to_datetime, to_timedelta, Timestamp import pandas.compat as compat -from pandas.compat import StringIO, range, lrange, string_types, PY36 -from pandas.tseries.tools import format as date_format +from pandas.compat import range, lrange, string_types, PY36 +from pandas.core.tools.datetimes import format as date_format import pandas.io.sql as sql from pandas.io.sql import read_sql_table, read_sql_query @@ -178,7 +178,7 @@ class MixInBase(object): - def tearDown(self): + def teardown_method(self, method): for tbl in self._get_all_tables(): self.drop_table(tbl) self._close_conn() @@ -236,7 +236,7 @@ def _close_conn(self): pass -class PandasSQLTest(unittest.TestCase): +class PandasSQLTest(object): """ Base class with common private methods for SQLAlchemy and fallback cases. @@ -271,8 +271,7 @@ def _check_iris_loaded_frame(self, iris_frame): pytype = iris_frame.dtypes[0].type row = iris_frame.iloc[0] - self.assertTrue( - issubclass(pytype, np.floating), 'Loaded frame has incorrect type') + assert issubclass(pytype, np.floating) tm.equalContents(row.values, [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']) def _load_test1_data(self): @@ -371,8 +370,7 @@ def _to_sql(self): self.drop_table('test_frame1') self.pandasSQL.to_sql(self.test_frame1, 'test_frame1') - self.assertTrue(self.pandasSQL.has_table( - 'test_frame1'), 'Table not written to DB') + assert self.pandasSQL.has_table('test_frame1') # Nuke table self.drop_table('test_frame1') @@ -386,11 +384,10 @@ def _to_sql_fail(self): self.pandasSQL.to_sql( self.test_frame1, 'test_frame1', if_exists='fail') - self.assertTrue(self.pandasSQL.has_table( - 'test_frame1'), 'Table not written to DB') + assert self.pandasSQL.has_table('test_frame1') - self.assertRaises(ValueError, self.pandasSQL.to_sql, - self.test_frame1, 'test_frame1', if_exists='fail') + pytest.raises(ValueError, self.pandasSQL.to_sql, + self.test_frame1, 'test_frame1', if_exists='fail') self.drop_table('test_frame1') @@ -402,15 +399,12 @@ def _to_sql_replace(self): # Add to table again self.pandasSQL.to_sql( self.test_frame1, 'test_frame1', if_exists='replace') - self.assertTrue(self.pandasSQL.has_table( - 'test_frame1'), 'Table not written to DB') + assert self.pandasSQL.has_table('test_frame1') num_entries = len(self.test_frame1) num_rows = self._count_rows('test_frame1') - self.assertEqual( - num_rows, num_entries, "not the same number of rows as entries") - + assert num_rows == num_entries self.drop_table('test_frame1') def _to_sql_append(self): @@ -423,15 +417,12 @@ def _to_sql_append(self): # Add to table again self.pandasSQL.to_sql( self.test_frame1, 'test_frame1', if_exists='append') - self.assertTrue(self.pandasSQL.has_table( - 'test_frame1'), 'Table not written to DB') + assert self.pandasSQL.has_table('test_frame1') num_entries = 2 * len(self.test_frame1) num_rows = self._count_rows('test_frame1') - self.assertEqual( - num_rows, num_entries, "not the same number of rows as entries") - + assert num_rows == num_entries self.drop_table('test_frame1') def _roundtrip(self): @@ -458,7 +449,7 @@ def _to_sql_save_index(self): columns=['A', 'B', 'C'], index=['A']) self.pandasSQL.to_sql(df, 'test_to_sql_saves_index') ix_cols = self._get_index_columns('test_to_sql_saves_index') - self.assertEqual(ix_cols, [['A', ], ]) + assert ix_cols == [['A', ], ] def _transaction_test(self): self.pandasSQL.execute("CREATE TABLE test_trans (A INT, B TEXT)") @@ -474,13 +465,13 @@ def _transaction_test(self): # ignore raised exception pass res = self.pandasSQL.read_query('SELECT * FROM test_trans') - self.assertEqual(len(res), 0) + assert len(res) == 0 # Make sure when transaction is committed, rows do get inserted with self.pandasSQL.run_transaction() as trans: trans.execute(ins_sql) res2 = self.pandasSQL.read_query('SELECT * FROM test_trans') - self.assertEqual(len(res2), 1) + assert len(res2) == 1 # ----------------------------------------------------------------------------- @@ -506,7 +497,7 @@ class _TestSQLApi(PandasSQLTest): flavor = 'sqlite' mode = None - def setUp(self): + def setup_method(self, method): self.conn = self.connect() self._load_iris_data() self._load_iris_view() @@ -527,19 +518,15 @@ def test_read_sql_view(self): def test_to_sql(self): sql.to_sql(self.test_frame1, 'test_frame1', self.conn) - self.assertTrue( - sql.has_table('test_frame1', self.conn), - 'Table not written to DB') + assert sql.has_table('test_frame1', self.conn) def test_to_sql_fail(self): sql.to_sql(self.test_frame1, 'test_frame2', self.conn, if_exists='fail') - self.assertTrue( - sql.has_table('test_frame2', self.conn), - 'Table not written to DB') + assert sql.has_table('test_frame2', self.conn) - self.assertRaises(ValueError, sql.to_sql, self.test_frame1, - 'test_frame2', self.conn, if_exists='fail') + pytest.raises(ValueError, sql.to_sql, self.test_frame1, + 'test_frame2', self.conn, if_exists='fail') def test_to_sql_replace(self): sql.to_sql(self.test_frame1, 'test_frame3', @@ -547,15 +534,12 @@ def test_to_sql_replace(self): # Add to table again sql.to_sql(self.test_frame1, 'test_frame3', self.conn, if_exists='replace') - self.assertTrue( - sql.has_table('test_frame3', self.conn), - 'Table not written to DB') + assert sql.has_table('test_frame3', self.conn) num_entries = len(self.test_frame1) num_rows = self._count_rows('test_frame3') - self.assertEqual( - num_rows, num_entries, "not the same number of rows as entries") + assert num_rows == num_entries def test_to_sql_append(self): sql.to_sql(self.test_frame1, 'test_frame4', @@ -564,15 +548,12 @@ def test_to_sql_append(self): # Add to table again sql.to_sql(self.test_frame1, 'test_frame4', self.conn, if_exists='append') - self.assertTrue( - sql.has_table('test_frame4', self.conn), - 'Table not written to DB') + assert sql.has_table('test_frame4', self.conn) num_entries = 2 * len(self.test_frame1) num_rows = self._count_rows('test_frame4') - self.assertEqual( - num_rows, num_entries, "not the same number of rows as entries") + assert num_rows == num_entries def test_to_sql_type_mapping(self): sql.to_sql(self.test_frame3, 'test_frame5', self.conn, index=False) @@ -587,8 +568,9 @@ def test_to_sql_series(self): tm.assert_frame_equal(s.to_frame(), s2) def test_to_sql_panel(self): - panel = tm.makePanel() - self.assertRaises(NotImplementedError, sql.to_sql, panel, + with catch_warnings(record=True): + panel = tm.makePanel() + pytest.raises(NotImplementedError, sql.to_sql, panel, 'test_panel', self.conn) def test_roundtrip(self): @@ -623,33 +605,25 @@ def test_date_parsing(self): # Test date parsing in read_sq # No Parsing df = sql.read_sql_query("SELECT * FROM types_test_data", self.conn) - self.assertFalse( - issubclass(df.DateCol.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") + assert not issubclass(df.DateCol.dtype.type, np.datetime64) df = sql.read_sql_query("SELECT * FROM types_test_data", self.conn, parse_dates=['DateCol']) - self.assertTrue( - issubclass(df.DateCol.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") + assert issubclass(df.DateCol.dtype.type, np.datetime64) df = sql.read_sql_query("SELECT * FROM types_test_data", self.conn, parse_dates={'DateCol': '%Y-%m-%d %H:%M:%S'}) - self.assertTrue( - issubclass(df.DateCol.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") + assert issubclass(df.DateCol.dtype.type, np.datetime64) df = sql.read_sql_query("SELECT * FROM types_test_data", self.conn, parse_dates=['IntDateCol']) - self.assertTrue(issubclass(df.IntDateCol.dtype.type, np.datetime64), - "IntDateCol loaded with incorrect type") + assert issubclass(df.IntDateCol.dtype.type, np.datetime64) df = sql.read_sql_query("SELECT * FROM types_test_data", self.conn, parse_dates={'IntDateCol': 's'}) - self.assertTrue(issubclass(df.IntDateCol.dtype.type, np.datetime64), - "IntDateCol loaded with incorrect type") + assert issubclass(df.IntDateCol.dtype.type, np.datetime64) def test_date_and_index(self): # Test case where same column appears in parse_date and index_col @@ -658,11 +632,8 @@ def test_date_and_index(self): index_col='DateCol', parse_dates=['DateCol', 'IntDateCol']) - self.assertTrue(issubclass(df.index.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") - - self.assertTrue(issubclass(df.IntDateCol.dtype.type, np.datetime64), - "IntDateCol loaded with incorrect type") + assert issubclass(df.index.dtype.type, np.datetime64) + assert issubclass(df.IntDateCol.dtype.type, np.datetime64) def test_timedelta(self): @@ -677,7 +648,7 @@ def test_timedelta(self): def test_complex(self): df = DataFrame({'a': [1 + 1j, 2j]}) # Complex data type should raise error - self.assertRaises(ValueError, df.to_sql, 'test_complex', self.conn) + pytest.raises(ValueError, df.to_sql, 'test_complex', self.conn) def test_to_sql_index_label(self): temp_frame = DataFrame({'col1': range(4)}) @@ -685,29 +656,39 @@ def test_to_sql_index_label(self): # no index name, defaults to 'index' sql.to_sql(temp_frame, 'test_index_label', self.conn) frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[0], 'index') + assert frame.columns[0] == 'index' # specifying index_label sql.to_sql(temp_frame, 'test_index_label', self.conn, if_exists='replace', index_label='other_label') frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[0], 'other_label', - "Specified index_label not written to database") + assert frame.columns[0] == "other_label" # using the index name temp_frame.index.name = 'index_name' sql.to_sql(temp_frame, 'test_index_label', self.conn, if_exists='replace') frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[0], 'index_name', - "Index name not written to database") + assert frame.columns[0] == "index_name" # has index name, but specifying index_label sql.to_sql(temp_frame, 'test_index_label', self.conn, if_exists='replace', index_label='other_label') frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[0], 'other_label', - "Specified index_label not written to database") + assert frame.columns[0] == "other_label" + + # index name is integer + temp_frame.index.name = 0 + sql.to_sql(temp_frame, 'test_index_label', self.conn, + if_exists='replace') + frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) + assert frame.columns[0] == "0" + + temp_frame.index.name = None + sql.to_sql(temp_frame, 'test_index_label', self.conn, + if_exists='replace', index_label=0) + frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) + assert frame.columns[0] == "0" def test_to_sql_index_label_multiindex(self): temp_frame = DataFrame({'col1': range(4)}, @@ -717,35 +698,32 @@ def test_to_sql_index_label_multiindex(self): # no index name, defaults to 'level_0' and 'level_1' sql.to_sql(temp_frame, 'test_index_label', self.conn) frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[0], 'level_0') - self.assertEqual(frame.columns[1], 'level_1') + assert frame.columns[0] == 'level_0' + assert frame.columns[1] == 'level_1' # specifying index_label sql.to_sql(temp_frame, 'test_index_label', self.conn, if_exists='replace', index_label=['A', 'B']) frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[:2].tolist(), ['A', 'B'], - "Specified index_labels not written to database") + assert frame.columns[:2].tolist() == ['A', 'B'] # using the index name temp_frame.index.names = ['A', 'B'] sql.to_sql(temp_frame, 'test_index_label', self.conn, if_exists='replace') frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[:2].tolist(), ['A', 'B'], - "Index names not written to database") + assert frame.columns[:2].tolist() == ['A', 'B'] # has index name, but specifying index_label sql.to_sql(temp_frame, 'test_index_label', self.conn, if_exists='replace', index_label=['C', 'D']) frame = sql.read_sql_query('SELECT * FROM test_index_label', self.conn) - self.assertEqual(frame.columns[:2].tolist(), ['C', 'D'], - "Specified index_labels not written to database") + assert frame.columns[:2].tolist() == ['C', 'D'] # wrong length of index_label - self.assertRaises(ValueError, sql.to_sql, temp_frame, - 'test_index_label', self.conn, if_exists='replace', - index_label='C') + pytest.raises(ValueError, sql.to_sql, temp_frame, + 'test_index_label', self.conn, if_exists='replace', + index_label='C') def test_multiindex_roundtrip(self): df = DataFrame.from_records([(1, 2.1, 'line1'), (2, 1.5, 'line2')], @@ -763,27 +741,27 @@ def test_integer_col_names(self): def test_get_schema(self): create_sql = sql.get_schema(self.test_frame1, 'test', con=self.conn) - self.assertTrue('CREATE' in create_sql) + assert 'CREATE' in create_sql def test_get_schema_dtypes(self): float_frame = DataFrame({'a': [1.1, 1.2], 'b': [2.1, 2.2]}) dtype = sqlalchemy.Integer if self.mode == 'sqlalchemy' else 'INTEGER' create_sql = sql.get_schema(float_frame, 'test', con=self.conn, dtype={'b': dtype}) - self.assertTrue('CREATE' in create_sql) - self.assertTrue('INTEGER' in create_sql) + assert 'CREATE' in create_sql + assert 'INTEGER' in create_sql def test_get_schema_keys(self): frame = DataFrame({'Col1': [1.1, 1.2], 'Col2': [2.1, 2.2]}) create_sql = sql.get_schema(frame, 'test', con=self.conn, keys='Col1') constraint_sentence = 'CONSTRAINT test_pk PRIMARY KEY ("Col1")' - self.assertTrue(constraint_sentence in create_sql) + assert constraint_sentence in create_sql # multiple columns as key (GH10385) create_sql = sql.get_schema(self.test_frame1, 'test', con=self.conn, keys=['A', 'B']) constraint_sentence = 'CONSTRAINT test_pk PRIMARY KEY ("A", "B")' - self.assertTrue(constraint_sentence in create_sql) + assert constraint_sentence in create_sql def test_chunksize_read(self): df = DataFrame(np.random.randn(22, 5), columns=list('abcde')) @@ -800,7 +778,7 @@ def test_chunksize_read(self): for chunk in sql.read_sql_query("select * from test_chunksize", self.conn, chunksize=5): res2 = concat([res2, chunk], ignore_index=True) - self.assertEqual(len(chunk), sizes[i]) + assert len(chunk) == sizes[i] i += 1 tm.assert_frame_equal(res1, res2) @@ -814,7 +792,7 @@ def test_chunksize_read(self): for chunk in sql.read_sql_table("test_chunksize", self.conn, chunksize=5): res3 = concat([res3, chunk], ignore_index=True) - self.assertEqual(len(chunk), sizes[i]) + assert len(chunk) == sizes[i] i += 1 tm.assert_frame_equal(res1, res3) @@ -839,6 +817,7 @@ def test_unicode_column_name(self): df.to_sql('test_unicode', self.conn, index=False) +@pytest.mark.single class TestSQLApi(SQLAlchemyMixIn, _TestSQLApi): """ Test the public API as it would be used directly @@ -854,7 +833,7 @@ def connect(self): if SQLALCHEMY_INSTALLED: return sqlalchemy.create_engine('sqlite:///:memory:') else: - raise nose.SkipTest('SQLAlchemy not installed') + pytest.skip('SQLAlchemy not installed') def test_read_table_columns(self): # test columns argument in read_table @@ -862,29 +841,24 @@ def test_read_table_columns(self): cols = ['A', 'B'] result = sql.read_sql_table('test_frame', self.conn, columns=cols) - self.assertEqual(result.columns.tolist(), cols, - "Columns not correctly selected") + assert result.columns.tolist() == cols def test_read_table_index_col(self): # test columns argument in read_table sql.to_sql(self.test_frame1, 'test_frame', self.conn) result = sql.read_sql_table('test_frame', self.conn, index_col="index") - self.assertEqual(result.index.names, ["index"], - "index_col not correctly set") + assert result.index.names == ["index"] result = sql.read_sql_table( 'test_frame', self.conn, index_col=["A", "B"]) - self.assertEqual(result.index.names, ["A", "B"], - "index_col not correctly set") + assert result.index.names == ["A", "B"] result = sql.read_sql_table('test_frame', self.conn, index_col=["A", "B"], columns=["C", "D"]) - self.assertEqual(result.index.names, ["A", "B"], - "index_col not correctly set") - self.assertEqual(result.columns.tolist(), ["C", "D"], - "columns not set correctly whith index_col") + assert result.index.names == ["A", "B"] + assert result.columns.tolist() == ["C", "D"] def test_read_sql_delegate(self): iris_frame1 = sql.read_sql_query( @@ -911,10 +885,11 @@ def test_not_reflect_all_tables(self): sql.read_sql_table('other_table', self.conn) sql.read_sql_query('SELECT * FROM other_table', self.conn) # Verify some things - self.assertEqual(len(w), 0, "Warning triggered for other table") + assert len(w) == 0 def test_warning_case_insensitive_table_name(self): - # see GH7815. + # see gh-7815 + # # We can't test that this warning is triggered, a the database # configuration would have to be altered. But here we test that # the warning is certainly NOT triggered in a normal case. @@ -924,8 +899,7 @@ def test_warning_case_insensitive_table_name(self): # This should not trigger a Warning self.test_frame1.to_sql('CaseSensitive', self.conn) # Verify some things - self.assertEqual( - len(w), 0, "Warning triggered for writing a table") + assert len(w) == 0 def _get_index_columns(self, tbl_name): from sqlalchemy.engine import reflection @@ -941,8 +915,7 @@ def test_sqlalchemy_type_mapping(self): utc=True)}) db = sql.SQLDatabase(self.conn) table = sql.SQLTable("test_type", db, frame=df) - self.assertTrue(isinstance( - table.table.c['time'].type, sqltypes.DateTime)) + assert isinstance(table.table.c['time'].type, sqltypes.DateTime) def test_database_uri_string(self): @@ -966,7 +939,7 @@ def test_database_uri_string(self): # using driver that will not be installed on Travis to trigger error # in sqlalchemy.create_engine -> test passing of this error to user db_uri = "postgresql+pg8000://user:pass@host/dbname" - with tm.assertRaisesRegexp(ImportError, "pg8000"): + with tm.assert_raises_regex(ImportError, "pg8000"): sql.read_sql("select * from table", db_uri) def _make_iris_table_metadata(self): @@ -988,7 +961,7 @@ def test_query_by_text_obj(self): iris_df = sql.read_sql(name_text, self.conn, params={ 'name': 'Iris-versicolor'}) all_names = set(iris_df['Name']) - self.assertEqual(all_names, set(['Iris-versicolor'])) + assert all_names == set(['Iris-versicolor']) def test_query_by_select_obj(self): # WIP : GH10846 @@ -999,7 +972,7 @@ def test_query_by_select_obj(self): iris_df = sql.read_sql(name_select, self.conn, params={'name': 'Iris-setosa'}) all_names = set(iris_df['Name']) - self.assertEqual(all_names, set(['Iris-setosa'])) + assert all_names == set(['Iris-setosa']) class _EngineToConnMixin(object): @@ -1007,8 +980,8 @@ class _EngineToConnMixin(object): A mixin that causes setup_connect to create a conn rather than an engine. """ - def setUp(self): - super(_EngineToConnMixin, self).setUp() + def setup_method(self, method): + super(_EngineToConnMixin, self).setup_method(method) engine = self.conn conn = engine.connect() self.__tx = conn.begin() @@ -1016,18 +989,20 @@ def setUp(self): self.__engine = engine self.conn = conn - def tearDown(self): + def teardown_method(self, method): self.__tx.rollback() self.conn.close() self.conn = self.__engine self.pandasSQL = sql.SQLDatabase(self.__engine) - super(_EngineToConnMixin, self).tearDown() + super(_EngineToConnMixin, self).teardown_method(method) +@pytest.mark.single class TestSQLApiConn(_EngineToConnMixin, TestSQLApi): pass +@pytest.mark.single class TestSQLiteFallbackApi(SQLiteMixIn, _TestSQLApi): """ Test the public sqlite connection fallback API @@ -1060,17 +1035,17 @@ def test_sql_open_close(self): def test_con_string_import_error(self): if not SQLALCHEMY_INSTALLED: conn = 'mysql://root@localhost/pandas_nosetest' - self.assertRaises(ImportError, sql.read_sql, "SELECT * FROM iris", - conn) + pytest.raises(ImportError, sql.read_sql, "SELECT * FROM iris", + conn) else: - raise nose.SkipTest('SQLAlchemy is installed') + pytest.skip('SQLAlchemy is installed') def test_read_sql_delegate(self): iris_frame1 = sql.read_sql_query("SELECT * FROM iris", self.conn) iris_frame2 = sql.read_sql("SELECT * FROM iris", self.conn) tm.assert_frame_equal(iris_frame1, iris_frame2) - self.assertRaises(sql.DatabaseError, sql.read_sql, 'iris', self.conn) + pytest.raises(sql.DatabaseError, sql.read_sql, 'iris', self.conn) def test_safe_names_warning(self): # GH 6798 @@ -1082,7 +1057,7 @@ def test_safe_names_warning(self): def test_get_schema2(self): # without providing a connection object (available for backwards comp) create_sql = sql.get_schema(self.test_frame1, 'test') - self.assertTrue('CREATE' in create_sql) + assert 'CREATE' in create_sql def _get_sqlite_column_type(self, schema, column): @@ -1099,8 +1074,7 @@ def test_sqlite_type_mapping(self): db = sql.SQLiteDatabase(self.conn) table = sql.SQLiteTable("test_type", db, frame=df) schema = table.sql_schema() - self.assertEqual(self._get_sqlite_column_type(schema, 'time'), - "TIMESTAMP") + assert self._get_sqlite_column_type(schema, 'time') == "TIMESTAMP" # ----------------------------------------------------------------------------- @@ -1118,7 +1092,7 @@ class _TestSQLAlchemy(SQLAlchemyMixIn, PandasSQLTest): flavor = None @classmethod - def setUpClass(cls): + def setup_class(cls): cls.setup_import() cls.setup_driver() @@ -1128,9 +1102,9 @@ def setUpClass(cls): conn.connect() except sqlalchemy.exc.OperationalError: msg = "{0} - can't connect to {1} server".format(cls, cls.flavor) - raise nose.SkipTest(msg) + pytest.skip(msg) - def setUp(self): + def setup_method(self, method): self.setup_connect() self._load_iris_data() @@ -1141,7 +1115,7 @@ def setUp(self): def setup_import(cls): # Skip this test if SQLAlchemy not available if not SQLALCHEMY_INSTALLED: - raise nose.SkipTest('SQLAlchemy not installed') + pytest.skip('SQLAlchemy not installed') @classmethod def setup_driver(cls): @@ -1158,7 +1132,7 @@ def setup_connect(self): # to test if connection can be made: self.conn.connect() except sqlalchemy.exc.OperationalError: - raise nose.SkipTest( + pytest.skip( "Can't connect to {0} server".format(self.flavor)) def test_aread_sql(self): @@ -1193,8 +1167,7 @@ def test_create_table(self): pandasSQL = sql.SQLDatabase(temp_conn) pandasSQL.to_sql(temp_frame, 'temp_frame') - self.assertTrue( - temp_conn.has_table('temp_frame'), 'Table not written to DB') + assert temp_conn.has_table('temp_frame') def test_drop_table(self): temp_conn = self.connect() @@ -1205,13 +1178,11 @@ def test_drop_table(self): pandasSQL = sql.SQLDatabase(temp_conn) pandasSQL.to_sql(temp_frame, 'temp_frame') - self.assertTrue( - temp_conn.has_table('temp_frame'), 'Table not written to DB') + assert temp_conn.has_table('temp_frame') pandasSQL.drop_table('temp_frame') - self.assertFalse( - temp_conn.has_table('temp_frame'), 'Table not deleted from DB') + assert not temp_conn.has_table('temp_frame') def test_roundtrip(self): self._roundtrip() @@ -1230,25 +1201,20 @@ def test_read_table_columns(self): iris_frame.columns.values, ['SepalLength', 'SepalLength']) def test_read_table_absent(self): - self.assertRaises( + pytest.raises( ValueError, sql.read_sql_table, "this_doesnt_exist", con=self.conn) def test_default_type_conversion(self): df = sql.read_sql_table("types_test_data", self.conn) - self.assertTrue(issubclass(df.FloatCol.dtype.type, np.floating), - "FloatCol loaded with incorrect type") - self.assertTrue(issubclass(df.IntCol.dtype.type, np.integer), - "IntCol loaded with incorrect type") - self.assertTrue(issubclass(df.BoolCol.dtype.type, np.bool_), - "BoolCol loaded with incorrect type") + assert issubclass(df.FloatCol.dtype.type, np.floating) + assert issubclass(df.IntCol.dtype.type, np.integer) + assert issubclass(df.BoolCol.dtype.type, np.bool_) # Int column with NA values stays as float - self.assertTrue(issubclass(df.IntColWithNull.dtype.type, np.floating), - "IntColWithNull loaded with incorrect type") + assert issubclass(df.IntColWithNull.dtype.type, np.floating) # Bool column with NA values becomes object - self.assertTrue(issubclass(df.BoolColWithNull.dtype.type, np.object), - "BoolColWithNull loaded with incorrect type") + assert issubclass(df.BoolColWithNull.dtype.type, np.object) def test_bigint(self): # int64 should be converted to BigInteger, GH7433 @@ -1263,8 +1229,7 @@ def test_default_date_load(self): # IMPORTANT - sqlite has no native date type, so shouldn't parse, but # MySQL SHOULD be converted. - self.assertTrue(issubclass(df.DateCol.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") + assert issubclass(df.DateCol.dtype.type, np.datetime64) def test_datetime_with_timezone(self): # edge case that converts postgresql datetime with time zone types @@ -1278,24 +1243,22 @@ def check(col): # "2000-01-01 00:00:00-08:00" should convert to # "2000-01-01 08:00:00" - self.assertEqual(col[0], Timestamp('2000-01-01 08:00:00')) + assert col[0] == Timestamp('2000-01-01 08:00:00') # "2000-06-01 00:00:00-07:00" should convert to # "2000-06-01 07:00:00" - self.assertEqual(col[1], Timestamp('2000-06-01 07:00:00')) + assert col[1] == Timestamp('2000-06-01 07:00:00') elif is_datetime64tz_dtype(col.dtype): - self.assertTrue(str(col.dt.tz) == 'UTC') + assert str(col.dt.tz) == 'UTC' # "2000-01-01 00:00:00-08:00" should convert to # "2000-01-01 08:00:00" - self.assertEqual(col[0], Timestamp( - '2000-01-01 08:00:00', tz='UTC')) + assert col[0] == Timestamp('2000-01-01 08:00:00', tz='UTC') # "2000-06-01 00:00:00-07:00" should convert to # "2000-06-01 07:00:00" - self.assertEqual(col[1], Timestamp( - '2000-06-01 07:00:00', tz='UTC')) + assert col[1] == Timestamp('2000-06-01 07:00:00', tz='UTC') else: raise AssertionError("DateCol loaded with incorrect type " @@ -1304,32 +1267,28 @@ def check(col): # GH11216 df = pd.read_sql_query("select * from types_test_data", self.conn) if not hasattr(df, 'DateColWithTz'): - raise nose.SkipTest("no column with datetime with time zone") + pytest.skip("no column with datetime with time zone") # this is parsed on Travis (linux), but not on macosx for some reason # even with the same versions of psycopg2 & sqlalchemy, possibly a # Postgrsql server version difference col = df.DateColWithTz - self.assertTrue(is_object_dtype(col.dtype) or - is_datetime64_dtype(col.dtype) or - is_datetime64tz_dtype(col.dtype), - "DateCol loaded with incorrect type -> {0}" - .format(col.dtype)) + assert (is_object_dtype(col.dtype) or + is_datetime64_dtype(col.dtype) or + is_datetime64tz_dtype(col.dtype)) df = pd.read_sql_query("select * from types_test_data", self.conn, parse_dates=['DateColWithTz']) if not hasattr(df, 'DateColWithTz'): - raise nose.SkipTest("no column with datetime with time zone") + pytest.skip("no column with datetime with time zone") check(df.DateColWithTz) df = pd.concat(list(pd.read_sql_query("select * from types_test_data", self.conn, chunksize=1)), ignore_index=True) col = df.DateColWithTz - self.assertTrue(is_datetime64tz_dtype(col.dtype), - "DateCol loaded with incorrect type -> {0}" - .format(col.dtype)) - self.assertTrue(str(col.dt.tz) == 'UTC') + assert is_datetime64tz_dtype(col.dtype) + assert str(col.dt.tz) == 'UTC' expected = sql.read_sql_table("types_test_data", self.conn) tm.assert_series_equal(df.DateColWithTz, expected.DateColWithTz @@ -1346,33 +1305,27 @@ def test_date_parsing(self): df = sql.read_sql_table("types_test_data", self.conn, parse_dates=['DateCol']) - self.assertTrue(issubclass(df.DateCol.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") + assert issubclass(df.DateCol.dtype.type, np.datetime64) df = sql.read_sql_table("types_test_data", self.conn, parse_dates={'DateCol': '%Y-%m-%d %H:%M:%S'}) - self.assertTrue(issubclass(df.DateCol.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") + assert issubclass(df.DateCol.dtype.type, np.datetime64) df = sql.read_sql_table("types_test_data", self.conn, parse_dates={ 'DateCol': {'format': '%Y-%m-%d %H:%M:%S'}}) - self.assertTrue(issubclass(df.DateCol.dtype.type, np.datetime64), - "IntDateCol loaded with incorrect type") + assert issubclass(df.DateCol.dtype.type, np.datetime64) df = sql.read_sql_table( "types_test_data", self.conn, parse_dates=['IntDateCol']) - self.assertTrue(issubclass(df.IntDateCol.dtype.type, np.datetime64), - "IntDateCol loaded with incorrect type") + assert issubclass(df.IntDateCol.dtype.type, np.datetime64) df = sql.read_sql_table( "types_test_data", self.conn, parse_dates={'IntDateCol': 's'}) - self.assertTrue(issubclass(df.IntDateCol.dtype.type, np.datetime64), - "IntDateCol loaded with incorrect type") + assert issubclass(df.IntDateCol.dtype.type, np.datetime64) df = sql.read_sql_table("types_test_data", self.conn, parse_dates={'IntDateCol': {'unit': 's'}}) - self.assertTrue(issubclass(df.IntDateCol.dtype.type, np.datetime64), - "IntDateCol loaded with incorrect type") + assert issubclass(df.IntDateCol.dtype.type, np.datetime64) def test_datetime(self): df = DataFrame({'A': date_range('2013-01-01 09:00:00', periods=3), @@ -1388,7 +1341,7 @@ def test_datetime(self): result = sql.read_sql_query('SELECT * FROM test_datetime', self.conn) result = result.drop('index', axis=1) if self.flavor == 'sqlite': - self.assertTrue(isinstance(result.loc[0, 'A'], string_types)) + assert isinstance(result.loc[0, 'A'], string_types) result['A'] = to_datetime(result['A']) tm.assert_frame_equal(result, df) else: @@ -1407,7 +1360,7 @@ def test_datetime_NaT(self): # with read_sql -> no type information -> sqlite has no native result = sql.read_sql_query('SELECT * FROM test_datetime', self.conn) if self.flavor == 'sqlite': - self.assertTrue(isinstance(result.loc[0, 'A'], string_types)) + assert isinstance(result.loc[0, 'A'], string_types) result['A'] = to_datetime(result['A'], errors='coerce') tm.assert_frame_equal(result, df) else: @@ -1540,16 +1493,16 @@ def test_dtype(self): meta = sqlalchemy.schema.MetaData(bind=self.conn) meta.reflect() sqltype = meta.tables['dtype_test2'].columns['B'].type - self.assertTrue(isinstance(sqltype, sqlalchemy.TEXT)) - self.assertRaises(ValueError, df.to_sql, - 'error', self.conn, dtype={'B': str}) + assert isinstance(sqltype, sqlalchemy.TEXT) + pytest.raises(ValueError, df.to_sql, + 'error', self.conn, dtype={'B': str}) # GH9083 df.to_sql('dtype_test3', self.conn, dtype={'B': sqlalchemy.String(10)}) meta.reflect() sqltype = meta.tables['dtype_test3'].columns['B'].type - self.assertTrue(isinstance(sqltype, sqlalchemy.String)) - self.assertEqual(sqltype.length, 10) + assert isinstance(sqltype, sqlalchemy.String) + assert sqltype.length == 10 # single dtype df.to_sql('single_dtype_test', self.conn, dtype=sqlalchemy.TEXT) @@ -1557,8 +1510,8 @@ def test_dtype(self): meta.reflect() sqltypea = meta.tables['single_dtype_test'].columns['A'].type sqltypeb = meta.tables['single_dtype_test'].columns['B'].type - self.assertTrue(isinstance(sqltypea, sqlalchemy.TEXT)) - self.assertTrue(isinstance(sqltypeb, sqlalchemy.TEXT)) + assert isinstance(sqltypea, sqlalchemy.TEXT) + assert isinstance(sqltypeb, sqlalchemy.TEXT) def test_notnull_dtype(self): cols = {'Bool': Series([True, None]), @@ -1580,10 +1533,10 @@ def test_notnull_dtype(self): col_dict = meta.tables[tbl].columns - self.assertTrue(isinstance(col_dict['Bool'].type, my_type)) - self.assertTrue(isinstance(col_dict['Date'].type, sqltypes.DateTime)) - self.assertTrue(isinstance(col_dict['Int'].type, sqltypes.Integer)) - self.assertTrue(isinstance(col_dict['Float'].type, sqltypes.Float)) + assert isinstance(col_dict['Bool'].type, my_type) + assert isinstance(col_dict['Date'].type, sqltypes.DateTime) + assert isinstance(col_dict['Int'].type, sqltypes.Integer) + assert isinstance(col_dict['Float'].type, sqltypes.Float) def test_double_precision(self): V = 1.23456789101112131415 @@ -1600,19 +1553,18 @@ def test_double_precision(self): res = sql.read_sql_table('test_dtypes', self.conn) # check precision of float64 - self.assertEqual(np.round(df['f64'].iloc[0], 14), - np.round(res['f64'].iloc[0], 14)) + assert (np.round(df['f64'].iloc[0], 14) == + np.round(res['f64'].iloc[0], 14)) # check sql types meta = sqlalchemy.schema.MetaData(bind=self.conn) meta.reflect() col_dict = meta.tables['test_dtypes'].columns - self.assertEqual(str(col_dict['f32'].type), - str(col_dict['f64_as_f32'].type)) - self.assertTrue(isinstance(col_dict['f32'].type, sqltypes.Float)) - self.assertTrue(isinstance(col_dict['f64'].type, sqltypes.Float)) - self.assertTrue(isinstance(col_dict['i32'].type, sqltypes.Integer)) - self.assertTrue(isinstance(col_dict['i64'].type, sqltypes.BigInteger)) + assert str(col_dict['f32'].type) == str(col_dict['f64_as_f32'].type) + assert isinstance(col_dict['f32'].type, sqltypes.Float) + assert isinstance(col_dict['f64'].type, sqltypes.Float) + assert isinstance(col_dict['i32'].type, sqltypes.Integer) + assert isinstance(col_dict['i64'].type, sqltypes.BigInteger) def test_connectable_issue_example(self): # This tests the example raised in issue @@ -1665,7 +1617,7 @@ class Temporary(Base): class _TestSQLAlchemyConn(_EngineToConnMixin, _TestSQLAlchemy): def test_transactions(self): - raise nose.SkipTest( + pytest.skip( "Nested transactions rollbacks don't work with Pandas") @@ -1688,27 +1640,23 @@ def setup_driver(cls): def test_default_type_conversion(self): df = sql.read_sql_table("types_test_data", self.conn) - self.assertTrue(issubclass(df.FloatCol.dtype.type, np.floating), - "FloatCol loaded with incorrect type") - self.assertTrue(issubclass(df.IntCol.dtype.type, np.integer), - "IntCol loaded with incorrect type") + assert issubclass(df.FloatCol.dtype.type, np.floating) + assert issubclass(df.IntCol.dtype.type, np.integer) + # sqlite has no boolean type, so integer type is returned - self.assertTrue(issubclass(df.BoolCol.dtype.type, np.integer), - "BoolCol loaded with incorrect type") + assert issubclass(df.BoolCol.dtype.type, np.integer) # Int column with NA values stays as float - self.assertTrue(issubclass(df.IntColWithNull.dtype.type, np.floating), - "IntColWithNull loaded with incorrect type") + assert issubclass(df.IntColWithNull.dtype.type, np.floating) + # Non-native Bool column with NA values stays as float - self.assertTrue(issubclass(df.BoolColWithNull.dtype.type, np.floating), - "BoolColWithNull loaded with incorrect type") + assert issubclass(df.BoolColWithNull.dtype.type, np.floating) def test_default_date_load(self): df = sql.read_sql_table("types_test_data", self.conn) # IMPORTANT - sqlite has no native date type, so shouldn't parse, but - self.assertFalse(issubclass(df.DateCol.dtype.type, np.datetime64), - "DateCol loaded with incorrect type") + assert not issubclass(df.DateCol.dtype.type, np.datetime64) def test_bigint_warning(self): # test no warning for BIGINT (to support int64) is raised (GH7433) @@ -1718,7 +1666,7 @@ def test_bigint_warning(self): with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") sql.read_sql_table('test_bigintwarning', self.conn) - self.assertEqual(len(w), 0, "Warning triggered for other table") + assert len(w) == 0 class _TestMySQLAlchemy(object): @@ -1739,25 +1687,22 @@ def setup_driver(cls): import pymysql # noqa cls.driver = 'pymysql' except ImportError: - raise nose.SkipTest('pymysql not installed') + pytest.skip('pymysql not installed') def test_default_type_conversion(self): df = sql.read_sql_table("types_test_data", self.conn) - self.assertTrue(issubclass(df.FloatCol.dtype.type, np.floating), - "FloatCol loaded with incorrect type") - self.assertTrue(issubclass(df.IntCol.dtype.type, np.integer), - "IntCol loaded with incorrect type") + assert issubclass(df.FloatCol.dtype.type, np.floating) + assert issubclass(df.IntCol.dtype.type, np.integer) + # MySQL has no real BOOL type (it's an alias for TINYINT) - self.assertTrue(issubclass(df.BoolCol.dtype.type, np.integer), - "BoolCol loaded with incorrect type") + assert issubclass(df.BoolCol.dtype.type, np.integer) # Int column with NA values stays as float - self.assertTrue(issubclass(df.IntColWithNull.dtype.type, np.floating), - "IntColWithNull loaded with incorrect type") + assert issubclass(df.IntColWithNull.dtype.type, np.floating) + # Bool column with NA = int column with NA values => becomes float - self.assertTrue(issubclass(df.BoolColWithNull.dtype.type, np.floating), - "BoolColWithNull loaded with incorrect type") + assert issubclass(df.BoolColWithNull.dtype.type, np.floating) def test_read_procedure(self): # see GH7324. Although it is more an api test, it is added to the @@ -1808,7 +1753,7 @@ def setup_driver(cls): import psycopg2 # noqa cls.driver = 'psycopg2' except ImportError: - raise nose.SkipTest('psycopg2 not installed') + pytest.skip('psycopg2 not installed') def test_schema_support(self): # only test this for postgresql (schema's not supported in @@ -1837,8 +1782,8 @@ def test_schema_support(self): res4 = sql.read_sql_table('test_schema_other', self.conn, schema='other') tm.assert_frame_equal(df, res4) - self.assertRaises(ValueError, sql.read_sql_table, 'test_schema_other', - self.conn, schema='public') + pytest.raises(ValueError, sql.read_sql_table, 'test_schema_other', + self.conn, schema='public') # different if_exists options @@ -1875,26 +1820,32 @@ def test_schema_support(self): tm.assert_frame_equal(res1, res2) +@pytest.mark.single class TestMySQLAlchemy(_TestMySQLAlchemy, _TestSQLAlchemy): pass +@pytest.mark.single class TestMySQLAlchemyConn(_TestMySQLAlchemy, _TestSQLAlchemyConn): pass +@pytest.mark.single class TestPostgreSQLAlchemy(_TestPostgreSQLAlchemy, _TestSQLAlchemy): pass +@pytest.mark.single class TestPostgreSQLAlchemyConn(_TestPostgreSQLAlchemy, _TestSQLAlchemyConn): pass +@pytest.mark.single class TestSQLiteAlchemy(_TestSQLiteAlchemy, _TestSQLAlchemy): pass +@pytest.mark.single class TestSQLiteAlchemyConn(_TestSQLiteAlchemy, _TestSQLAlchemyConn): pass @@ -1902,6 +1853,7 @@ class TestSQLiteAlchemyConn(_TestSQLiteAlchemy, _TestSQLAlchemyConn): # ----------------------------------------------------------------------------- # -- Test Sqlite / MySQL fallback +@pytest.mark.single class TestSQLiteFallback(SQLiteMixIn, PandasSQLTest): """ Test the fallback mode against an in-memory sqlite database. @@ -1913,7 +1865,7 @@ class TestSQLiteFallback(SQLiteMixIn, PandasSQLTest): def connect(cls): return sqlite3.connect(':memory:') - def setUp(self): + def setup_method(self, method): self.conn = self.connect() self.pandasSQL = sql.SQLiteDatabase(self.conn) @@ -1951,13 +1903,11 @@ def test_create_and_drop_table(self): self.pandasSQL.to_sql(temp_frame, 'drop_test_frame') - self.assertTrue(self.pandasSQL.has_table('drop_test_frame'), - 'Table not written to DB') + assert self.pandasSQL.has_table('drop_test_frame') self.pandasSQL.drop_table('drop_test_frame') - self.assertFalse(self.pandasSQL.has_table('drop_test_frame'), - 'Table not deleted from DB') + assert not self.pandasSQL.has_table('drop_test_frame') def test_roundtrip(self): self._roundtrip() @@ -2002,7 +1952,7 @@ def test_to_sql_save_index(self): def test_transactions(self): if PY36: - raise nose.SkipTest("not working on python > 3.5") + pytest.skip("not working on python > 3.5") self._transaction_test() def _get_sqlite_column_type(self, table, column): @@ -2014,7 +1964,7 @@ def _get_sqlite_column_type(self, table, column): def test_dtype(self): if self.flavor == 'mysql': - raise nose.SkipTest('Not applicable to MySQL legacy') + pytest.skip('Not applicable to MySQL legacy') cols = ['A', 'B'] data = [(0.8, True), (0.9, None)] @@ -2023,24 +1973,24 @@ def test_dtype(self): df.to_sql('dtype_test2', self.conn, dtype={'B': 'STRING'}) # sqlite stores Boolean values as INTEGER - self.assertEqual(self._get_sqlite_column_type( - 'dtype_test', 'B'), 'INTEGER') + assert self._get_sqlite_column_type( + 'dtype_test', 'B') == 'INTEGER' - self.assertEqual(self._get_sqlite_column_type( - 'dtype_test2', 'B'), 'STRING') - self.assertRaises(ValueError, df.to_sql, - 'error', self.conn, dtype={'B': bool}) + assert self._get_sqlite_column_type( + 'dtype_test2', 'B') == 'STRING' + pytest.raises(ValueError, df.to_sql, + 'error', self.conn, dtype={'B': bool}) # single dtype df.to_sql('single_dtype_test', self.conn, dtype='STRING') - self.assertEqual( - self._get_sqlite_column_type('single_dtype_test', 'A'), 'STRING') - self.assertEqual( - self._get_sqlite_column_type('single_dtype_test', 'B'), 'STRING') + assert self._get_sqlite_column_type( + 'single_dtype_test', 'A') == 'STRING' + assert self._get_sqlite_column_type( + 'single_dtype_test', 'B') == 'STRING' def test_notnull_dtype(self): if self.flavor == 'mysql': - raise nose.SkipTest('Not applicable to MySQL legacy') + pytest.skip('Not applicable to MySQL legacy') cols = {'Bool': Series([True, None]), 'Date': Series([datetime(2012, 5, 1), None]), @@ -2052,18 +2002,17 @@ def test_notnull_dtype(self): tbl = 'notnull_dtype_test' df.to_sql(tbl, self.conn) - self.assertEqual(self._get_sqlite_column_type(tbl, 'Bool'), 'INTEGER') - self.assertEqual(self._get_sqlite_column_type( - tbl, 'Date'), 'TIMESTAMP') - self.assertEqual(self._get_sqlite_column_type(tbl, 'Int'), 'INTEGER') - self.assertEqual(self._get_sqlite_column_type(tbl, 'Float'), 'REAL') + assert self._get_sqlite_column_type(tbl, 'Bool') == 'INTEGER' + assert self._get_sqlite_column_type(tbl, 'Date') == 'TIMESTAMP' + assert self._get_sqlite_column_type(tbl, 'Int') == 'INTEGER' + assert self._get_sqlite_column_type(tbl, 'Float') == 'REAL' def test_illegal_names(self): # For sqlite, these should work fine df = DataFrame([[1, 2], [3, 4]], columns=['a', 'b']) # Raise error on blank - self.assertRaises(ValueError, df.to_sql, "", self.conn) + pytest.raises(ValueError, df.to_sql, "", self.conn) for ndx, weird_name in enumerate( ['test_weird_name]', 'test_weird_name[', @@ -2125,12 +2074,14 @@ def _skip_if_no_pymysql(): try: import pymysql # noqa except ImportError: - raise nose.SkipTest('pymysql not installed, skipping') + pytest.skip('pymysql not installed, skipping') -class TestXSQLite(SQLiteMixIn, tm.TestCase): +@pytest.mark.single +class TestXSQLite(SQLiteMixIn): - def setUp(self): + def setup_method(self, method): + self.method = method self.conn = sqlite3.connect(':memory:') def test_basic(self): @@ -2140,7 +2091,7 @@ def test_basic(self): def test_write_row_by_row(self): frame = tm.makeTimeDataFrame() - frame.ix[0, 0] = np.nan + frame.iloc[0, 0] = np.nan create_sql = sql.get_schema(frame, 'test') cur = self.conn.cursor() cur.execute(create_sql) @@ -2165,7 +2116,7 @@ def test_execute(self): cur.execute(create_sql) ins = "INSERT INTO test VALUES (?, ?, ?, ?)" - row = frame.ix[0] + row = frame.iloc[0] sql.execute(ins, self.conn, params=tuple(row)) self.conn.commit() @@ -2180,15 +2131,16 @@ def test_schema(self): for l in lines: tokens = l.split(' ') if len(tokens) == 2 and tokens[0] == 'A': - self.assertTrue(tokens[1] == 'DATETIME') + assert tokens[1] == 'DATETIME' frame = tm.makeTimeDataFrame() create_sql = sql.get_schema(frame, 'test', keys=['A', 'B']) lines = create_sql.splitlines() - self.assertTrue('PRIMARY KEY ("A", "B")' in create_sql) + assert 'PRIMARY KEY ("A", "B")' in create_sql cur = self.conn.cursor() cur.execute(create_sql) + @tm.capture_stdout def test_execute_fail(self): create_sql = """ CREATE TABLE test @@ -2205,14 +2157,10 @@ def test_execute_fail(self): sql.execute('INSERT INTO test VALUES("foo", "bar", 1.234)', self.conn) sql.execute('INSERT INTO test VALUES("foo", "baz", 2.567)', self.conn) - try: - sys.stdout = StringIO() - self.assertRaises(Exception, sql.execute, - 'INSERT INTO test VALUES("foo", "bar", 7)', - self.conn) - finally: - sys.stdout = sys.__stdout__ + with pytest.raises(Exception): + sql.execute('INSERT INTO test VALUES("foo", "bar", 7)', self.conn) + @tm.capture_stdout def test_execute_closed_connection(self): create_sql = """ CREATE TABLE test @@ -2228,15 +2176,12 @@ def test_execute_closed_connection(self): sql.execute('INSERT INTO test VALUES("foo", "bar", 1.234)', self.conn) self.conn.close() - try: - sys.stdout = StringIO() - self.assertRaises(Exception, tquery, "select * from test", - con=self.conn) - finally: - sys.stdout = sys.__stdout__ + + with pytest.raises(Exception): + tquery("select * from test", con=self.conn) # Initialize connection again (needed for tearDown) - self.setUp() + self.setup_method(self.method) def test_na_roundtrip(self): pass @@ -2277,7 +2222,7 @@ def test_onecolumn_of_integer(self): the_sum = sum([my_c0[0] for my_c0 in con_x.execute("select * from mono_df")]) # it should not fail, and gives 3 ( Issue #3628 ) - self.assertEqual(the_sum, 3) + assert the_sum == 3 result = sql.read_sql("select * from mono_df", con_x) tm.assert_frame_equal(result, mono_df) @@ -2297,48 +2242,47 @@ def clean_up(test_table_to_drop): self.drop_table(test_table_to_drop) # test if invalid value for if_exists raises appropriate error - self.assertRaises(ValueError, - sql.to_sql, - frame=df_if_exists_1, - con=self.conn, - name=table_name, - if_exists='notvalidvalue') + pytest.raises(ValueError, + sql.to_sql, + frame=df_if_exists_1, + con=self.conn, + name=table_name, + if_exists='notvalidvalue') clean_up(table_name) # test if_exists='fail' sql.to_sql(frame=df_if_exists_1, con=self.conn, name=table_name, if_exists='fail') - self.assertRaises(ValueError, - sql.to_sql, - frame=df_if_exists_1, - con=self.conn, - name=table_name, - if_exists='fail') + pytest.raises(ValueError, + sql.to_sql, + frame=df_if_exists_1, + con=self.conn, + name=table_name, + if_exists='fail') # test if_exists='replace' sql.to_sql(frame=df_if_exists_1, con=self.conn, name=table_name, if_exists='replace', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(1, 'A'), (2, 'B')]) + assert tquery(sql_select, con=self.conn) == [(1, 'A'), (2, 'B')] sql.to_sql(frame=df_if_exists_2, con=self.conn, name=table_name, if_exists='replace', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(3, 'C'), (4, 'D'), (5, 'E')]) + assert (tquery(sql_select, con=self.conn) == + [(3, 'C'), (4, 'D'), (5, 'E')]) clean_up(table_name) # test if_exists='append' sql.to_sql(frame=df_if_exists_1, con=self.conn, name=table_name, if_exists='fail', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(1, 'A'), (2, 'B')]) + assert tquery(sql_select, con=self.conn) == [(1, 'A'), (2, 'B')] sql.to_sql(frame=df_if_exists_2, con=self.conn, name=table_name, if_exists='append', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'E')]) + assert (tquery(sql_select, con=self.conn) == + [(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'E')]) clean_up(table_name) -class TestSQLFlavorDeprecation(tm.TestCase): +@pytest.mark.single +class TestSQLFlavorDeprecation(object): """ gh-13611: test that the 'flavor' parameter is appropriately deprecated by checking the @@ -2352,8 +2296,8 @@ def test_unsupported_flavor(self): msg = 'is not supported' for func in self.funcs: - tm.assertRaisesRegexp(ValueError, msg, getattr(sql, func), - self.con, flavor='mysql') + tm.assert_raises_regex(ValueError, msg, getattr(sql, func), + self.con, flavor='mysql') def test_deprecated_flavor(self): for func in self.funcs: @@ -2362,12 +2306,13 @@ def test_deprecated_flavor(self): getattr(sql, func)(self.con, flavor='sqlite') -@unittest.skip("gh-13611: there is no support for MySQL " - "if SQLAlchemy is not installed") -class TestXMySQL(MySQLMixIn, tm.TestCase): +@pytest.mark.single +@pytest.mark.skip(reason="gh-13611: there is no support for MySQL " + "if SQLAlchemy is not installed") +class TestXMySQL(MySQLMixIn): @classmethod - def setUpClass(cls): + def setup_class(cls): _skip_if_no_pymysql() # test connection @@ -2384,18 +2329,18 @@ def setUpClass(cls): try: pymysql.connect(read_default_group='pandas') except pymysql.ProgrammingError: - raise nose.SkipTest( + pytest.skip( "Create a group of connection parameters under the heading " "[pandas] in your system's mysql default file, " "typically located at ~/.my.cnf or /etc/.my.cnf. ") except pymysql.Error: - raise nose.SkipTest( + pytest.skip( "Cannot connect to database. " "Create a group of connection parameters under the heading " "[pandas] in your system's mysql default file, " "typically located at ~/.my.cnf or /etc/.my.cnf. ") - def setUp(self): + def setup_method(self, method): _skip_if_no_pymysql() import pymysql try: @@ -2410,17 +2355,19 @@ def setUp(self): try: self.conn = pymysql.connect(read_default_group='pandas') except pymysql.ProgrammingError: - raise nose.SkipTest( + pytest.skip( "Create a group of connection parameters under the heading " "[pandas] in your system's mysql default file, " "typically located at ~/.my.cnf or /etc/.my.cnf. ") except pymysql.Error: - raise nose.SkipTest( + pytest.skip( "Cannot connect to database. " "Create a group of connection parameters under the heading " "[pandas] in your system's mysql default file, " "typically located at ~/.my.cnf or /etc/.my.cnf. ") + self.method = method + def test_basic(self): _skip_if_no_pymysql() frame = tm.makeTimeDataFrame() @@ -2430,7 +2377,7 @@ def test_write_row_by_row(self): _skip_if_no_pymysql() frame = tm.makeTimeDataFrame() - frame.ix[0, 0] = np.nan + frame.iloc[0, 0] = np.nan drop_sql = "DROP TABLE IF EXISTS test" create_sql = sql.get_schema(frame, 'test') cur = self.conn.cursor() @@ -2474,7 +2421,7 @@ def test_execute(self): cur.execute(create_sql) ins = "INSERT INTO test VALUES (%s, %s, %s, %s)" - row = frame.ix[0].values.tolist() + row = frame.iloc[0].values.tolist() sql.execute(ins, self.conn, params=tuple(row)) self.conn.commit() @@ -2490,17 +2437,18 @@ def test_schema(self): for l in lines: tokens = l.split(' ') if len(tokens) == 2 and tokens[0] == 'A': - self.assertTrue(tokens[1] == 'DATETIME') + assert tokens[1] == 'DATETIME' frame = tm.makeTimeDataFrame() drop_sql = "DROP TABLE IF EXISTS test" create_sql = sql.get_schema(frame, 'test', keys=['A', 'B']) lines = create_sql.splitlines() - self.assertTrue('PRIMARY KEY (`A`, `B`)' in create_sql) + assert 'PRIMARY KEY (`A`, `B`)' in create_sql cur = self.conn.cursor() cur.execute(drop_sql) cur.execute(create_sql) + @tm.capture_stdout def test_execute_fail(self): _skip_if_no_pymysql() drop_sql = "DROP TABLE IF EXISTS test" @@ -2520,14 +2468,10 @@ def test_execute_fail(self): sql.execute('INSERT INTO test VALUES("foo", "bar", 1.234)', self.conn) sql.execute('INSERT INTO test VALUES("foo", "baz", 2.567)', self.conn) - try: - sys.stdout = StringIO() - self.assertRaises(Exception, sql.execute, - 'INSERT INTO test VALUES("foo", "bar", 7)', - self.conn) - finally: - sys.stdout = sys.__stdout__ + with pytest.raises(Exception): + sql.execute('INSERT INTO test VALUES("foo", "bar", 7)', self.conn) + @tm.capture_stdout def test_execute_closed_connection(self): _skip_if_no_pymysql() drop_sql = "DROP TABLE IF EXISTS test" @@ -2546,15 +2490,12 @@ def test_execute_closed_connection(self): sql.execute('INSERT INTO test VALUES("foo", "bar", 1.234)', self.conn) self.conn.close() - try: - sys.stdout = StringIO() - self.assertRaises(Exception, tquery, "select * from test", - con=self.conn) - finally: - sys.stdout = sys.__stdout__ + + with pytest.raises(Exception): + tquery("select * from test", con=self.conn) # Initialize connection again (needed for tearDown) - self.setUp() + self.setup_method(self.method) def test_na_roundtrip(self): _skip_if_no_pymysql() @@ -2619,47 +2560,40 @@ def clean_up(test_table_to_drop): self.drop_table(test_table_to_drop) # test if invalid value for if_exists raises appropriate error - self.assertRaises(ValueError, - sql.to_sql, - frame=df_if_exists_1, - con=self.conn, - name=table_name, - if_exists='notvalidvalue') + pytest.raises(ValueError, + sql.to_sql, + frame=df_if_exists_1, + con=self.conn, + name=table_name, + if_exists='notvalidvalue') clean_up(table_name) # test if_exists='fail' sql.to_sql(frame=df_if_exists_1, con=self.conn, name=table_name, if_exists='fail', index=False) - self.assertRaises(ValueError, - sql.to_sql, - frame=df_if_exists_1, - con=self.conn, - name=table_name, - if_exists='fail') + pytest.raises(ValueError, + sql.to_sql, + frame=df_if_exists_1, + con=self.conn, + name=table_name, + if_exists='fail') # test if_exists='replace' sql.to_sql(frame=df_if_exists_1, con=self.conn, name=table_name, if_exists='replace', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(1, 'A'), (2, 'B')]) + assert tquery(sql_select, con=self.conn) == [(1, 'A'), (2, 'B')] sql.to_sql(frame=df_if_exists_2, con=self.conn, name=table_name, if_exists='replace', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(3, 'C'), (4, 'D'), (5, 'E')]) + assert (tquery(sql_select, con=self.conn) == + [(3, 'C'), (4, 'D'), (5, 'E')]) clean_up(table_name) # test if_exists='append' sql.to_sql(frame=df_if_exists_1, con=self.conn, name=table_name, if_exists='fail', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(1, 'A'), (2, 'B')]) + assert tquery(sql_select, con=self.conn) == [(1, 'A'), (2, 'B')] sql.to_sql(frame=df_if_exists_2, con=self.conn, name=table_name, if_exists='append', index=False) - self.assertEqual(tquery(sql_select, con=self.conn), - [(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'E')]) + assert (tquery(sql_select, con=self.conn) == + [(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'E')]) clean_up(table_name) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/io/tests/test_stata.py b/pandas/tests/io/test_stata.py similarity index 94% rename from pandas/io/tests/test_stata.py rename to pandas/tests/io/test_stata.py index cd972868a6e32..4c92c19c51e7a 100644 --- a/pandas/io/tests/test_stata.py +++ b/pandas/tests/io/test_stata.py @@ -9,7 +9,7 @@ from datetime import datetime from distutils.version import LooseVersion -import nose +import pytest import numpy as np import pandas as pd import pandas.util.testing as tm @@ -19,13 +19,13 @@ from pandas.io.parsers import read_csv from pandas.io.stata import (read_stata, StataReader, InvalidColumnName, PossiblePrecisionLoss, StataMissingValue) -from pandas.tslib import NaT -from pandas.types.common import is_categorical_dtype +from pandas._libs.tslib import NaT +from pandas.core.dtypes.common import is_categorical_dtype -class TestStata(tm.TestCase): +class TestStata(object): - def setUp(self): + def setup_method(self, method): self.dirpath = tm.get_data_path() self.dta1_114 = os.path.join(self.dirpath, 'stata1_114.dta') self.dta1_117 = os.path.join(self.dirpath, 'stata1_117.dta') @@ -128,7 +128,7 @@ def test_read_dta1(self): def test_read_dta2(self): if LooseVersion(sys.version) < '2.7': - raise nose.SkipTest('datetime interp under 2.6 is faulty') + pytest.skip('datetime interp under 2.6 is faulty') expected = DataFrame.from_records( [ @@ -181,7 +181,7 @@ def test_read_dta2(self): w = [x for x in w if x.category is UserWarning] # should get warning for each call to read_dta - self.assertEqual(len(w), 3) + assert len(w) == 3 # buggy test because of the NaT comparison on certain platforms # Format 113 test fails since it does not support tc and tC formats @@ -283,7 +283,7 @@ def test_read_dta18(self): u'Floats': u'float data'} tm.assert_dict_equal(vl, vl_expected) - self.assertEqual(rdr.data_label, u'This is a Ünicode data label') + assert rdr.data_label == u'This is a Ünicode data label' def test_read_write_dta5(self): original = DataFrame([(np.nan, np.nan, np.nan, np.nan, np.nan)], @@ -336,7 +336,7 @@ def test_write_preserves_original(self): # 9795 np.random.seed(423) df = pd.DataFrame(np.random.randn(5, 4), columns=list('abcd')) - df.ix[2, 'a':'c'] = np.nan + df.loc[2, 'a':'c'] = np.nan df_copy = df.copy() with tm.ensure_clean() as path: df.to_stata(path, write_index=False) @@ -351,12 +351,12 @@ def test_encoding(self): if compat.PY3: expected = raw.kreis1849[0] - self.assertEqual(result, expected) - self.assertIsInstance(result, compat.string_types) + assert result == expected + assert isinstance(result, compat.string_types) else: expected = raw.kreis1849.str.decode("latin-1")[0] - self.assertEqual(result, expected) - self.assertIsInstance(result, unicode) # noqa + assert result == expected + assert isinstance(result, unicode) # noqa with tm.ensure_clean() as path: encoded.to_stata(path, encoding='latin-1', write_index=False) @@ -377,7 +377,7 @@ def test_read_write_dta11(self): with warnings.catch_warnings(record=True) as w: original.to_stata(path, None) # should get a warning for that format. - self.assertEqual(len(w), 1) + assert len(w) == 1 written_and_read_again = self.read_dta(path) tm.assert_frame_equal( @@ -405,7 +405,7 @@ def test_read_write_dta12(self): with warnings.catch_warnings(record=True) as w: original.to_stata(path, None) # should get a warning for that format. - self.assertEqual(len(w), 1) + assert len(w) == 1 written_and_read_again = self.read_dta(path) tm.assert_frame_equal( @@ -484,9 +484,7 @@ def test_timestamp_and_label(self): data_label=data_label) with StataReader(path) as reader: - parsed_time_stamp = dt.datetime.strptime( - reader.time_stamp, ('%d %b %Y %H:%M')) - assert parsed_time_stamp == time_stamp + assert reader.time_stamp == '29 Feb 2000 14:21' assert reader.data_label == data_label def test_numeric_column_names(self): @@ -525,7 +523,7 @@ def test_no_index(self): with tm.ensure_clean() as path: original.to_stata(path, write_index=False) written_and_read_again = self.read_dta(path) - tm.assertRaises( + pytest.raises( KeyError, lambda: written_and_read_again['index_not_written']) def test_string_no_dates(self): @@ -649,10 +647,10 @@ def test_variable_labels(self): keys = ('var1', 'var2', 'var3') labels = ('label1', 'label2', 'label3') for k, v in compat.iteritems(sr_115): - self.assertTrue(k in sr_117) - self.assertTrue(v == sr_117[k]) - self.assertTrue(k in keys) - self.assertTrue(v in labels) + assert k in sr_117 + assert v == sr_117[k] + assert k in keys + assert v in labels def test_minimal_size_col(self): str_lens = (1, 100, 244) @@ -669,8 +667,8 @@ def test_minimal_size_col(self): variables = sr.varlist formats = sr.fmtlist for variable, fmt, typ in zip(variables, formats, typlist): - self.assertTrue(int(variable[1:]) == int(fmt[1:-1])) - self.assertTrue(int(variable[1:]) == typ) + assert int(variable[1:]) == int(fmt[1:-1]) + assert int(variable[1:]) == typ def test_excessively_long_string(self): str_lens = (1, 244, 500) @@ -679,7 +677,7 @@ def test_excessively_long_string(self): s['s' + str(str_len)] = Series(['a' * str_len, 'b' * str_len, 'c' * str_len]) original = DataFrame(s) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): with tm.ensure_clean() as path: original.to_stata(path) @@ -696,21 +694,21 @@ def test_missing_value_generator(self): offset = valid_range[t][1] for i in range(0, 27): val = StataMissingValue(offset + 1 + i) - self.assertTrue(val.string == expected_values[i]) + assert val.string == expected_values[i] # Test extremes for floats val = StataMissingValue(struct.unpack(' 0) + assert len(ax.get_children()) > 0 if layout is not None: - result = self._get_axes_layout(plotting._flatten(axes)) - self.assertEqual(result, layout) + result = self._get_axes_layout(_flatten(axes)) + assert result == layout - self.assert_numpy_array_equal( + tm.assert_numpy_array_equal( visible_axes[0].figure.get_size_inches(), np.array(figsize, dtype=np.float64)) @@ -378,7 +381,7 @@ def _flatten_visible(self, axes): axes : matplotlib Axes object, or its list-like """ - axes = plotting._flatten(axes) + axes = _flatten(axes) axes = [ax for ax in axes if ax.get_visible()] return axes @@ -406,8 +409,8 @@ def _check_has_errorbars(self, axes, xerr=0, yerr=0): xerr_count += 1 if has_yerr: yerr_count += 1 - self.assertEqual(xerr, xerr_count) - self.assertEqual(yerr, yerr_count) + assert xerr == xerr_count + assert yerr == yerr_count def _check_box_return_type(self, returned, return_type, expected_keys=None, check_ax_title=True): @@ -434,36 +437,36 @@ def _check_box_return_type(self, returned, return_type, expected_keys=None, if return_type is None: return_type = 'dict' - self.assertTrue(isinstance(returned, types[return_type])) + assert isinstance(returned, types[return_type]) if return_type == 'both': - self.assertIsInstance(returned.ax, Axes) - self.assertIsInstance(returned.lines, dict) + assert isinstance(returned.ax, Axes) + assert isinstance(returned.lines, dict) else: # should be fixed when the returning default is changed if return_type is None: for r in self._flatten_visible(returned): - self.assertIsInstance(r, Axes) + assert isinstance(r, Axes) return - self.assertTrue(isinstance(returned, Series)) + assert isinstance(returned, Series) - self.assertEqual(sorted(returned.keys()), sorted(expected_keys)) + assert sorted(returned.keys()) == sorted(expected_keys) for key, value in iteritems(returned): - self.assertTrue(isinstance(value, types[return_type])) + assert isinstance(value, types[return_type]) # check returned dict has correct mapping if return_type == 'axes': if check_ax_title: - self.assertEqual(value.get_title(), key) + assert value.get_title() == key elif return_type == 'both': if check_ax_title: - self.assertEqual(value.ax.get_title(), key) - self.assertIsInstance(value.ax, Axes) - self.assertIsInstance(value.lines, dict) + assert value.ax.get_title() == key + assert isinstance(value.ax, Axes) + assert isinstance(value.lines, dict) elif return_type == 'dict': line = value['medians'][0] axes = line.axes if self.mpl_ge_1_5_0 else line.get_axes() if check_ax_title: - self.assertEqual(axes.get_title(), key) + assert axes.get_title() == key else: raise AssertionError @@ -488,26 +491,26 @@ def is_grid_on(): spndx += 1 mpl.rc('axes', grid=False) obj.plot(kind=kind, **kws) - self.assertFalse(is_grid_on()) + assert not is_grid_on() self.plt.subplot(1, 4 * len(kinds), spndx) spndx += 1 mpl.rc('axes', grid=True) obj.plot(kind=kind, grid=False, **kws) - self.assertFalse(is_grid_on()) + assert not is_grid_on() if kind != 'pie': self.plt.subplot(1, 4 * len(kinds), spndx) spndx += 1 mpl.rc('axes', grid=True) obj.plot(kind=kind, **kws) - self.assertTrue(is_grid_on()) + assert is_grid_on() self.plt.subplot(1, 4 * len(kinds), spndx) spndx += 1 mpl.rc('axes', grid=False) obj.plot(kind=kind, grid=True, **kws) - self.assertTrue(is_grid_on()) + assert is_grid_on() def _maybe_unpack_cycler(self, rcParams, field='color'): """ diff --git a/pandas/tests/plotting/test_boxplot_method.py b/pandas/tests/plotting/test_boxplot_method.py index 0916693ade2ce..1e06c13980657 100644 --- a/pandas/tests/plotting/test_boxplot_method.py +++ b/pandas/tests/plotting/test_boxplot_method.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python # coding: utf-8 -import nose +import pytest import itertools import string from distutils.version import LooseVersion @@ -15,13 +14,15 @@ from numpy import random from numpy.random import randn -import pandas.tools.plotting as plotting +import pandas.plotting as plotting from pandas.tests.plotting.common import (TestPlotBase, _check_plot_works) """ Test cases for .boxplot method """ +tm._skip_module_if_no_mpl() + def _skip_if_mpl_14_or_dev_boxplot(): # GH 8382 @@ -29,10 +30,9 @@ def _skip_if_mpl_14_or_dev_boxplot(): # Don't need try / except since that's done at class level import matplotlib if str(matplotlib.__version__) >= LooseVersion('1.4'): - raise nose.SkipTest("Matplotlib Regression in 1.4 and current dev.") + pytest.skip("Matplotlib Regression in 1.4 and current dev.") -@tm.mplskip class TestDataFramePlots(TestPlotBase): @slow @@ -55,7 +55,8 @@ def test_boxplot_legacy(self): _check_plot_works(df.boxplot, by='indic') with tm.assert_produces_warning(UserWarning): _check_plot_works(df.boxplot, by=['indic', 'indic2']) - _check_plot_works(plotting.boxplot, data=df['one'], return_type='dict') + _check_plot_works(plotting._core.boxplot, data=df['one'], + return_type='dict') _check_plot_works(df.boxplot, notch=1, return_type='dict') with tm.assert_produces_warning(UserWarning): _check_plot_works(df.boxplot, by='indic', notch=1) @@ -71,32 +72,32 @@ def test_boxplot_legacy(self): fig, ax = self.plt.subplots() axes = df.boxplot('Col1', by='X', ax=ax) ax_axes = ax.axes if self.mpl_ge_1_5_0 else ax.get_axes() - self.assertIs(ax_axes, axes) + assert ax_axes is axes fig, ax = self.plt.subplots() axes = df.groupby('Y').boxplot(ax=ax, return_type='axes') ax_axes = ax.axes if self.mpl_ge_1_5_0 else ax.get_axes() - self.assertIs(ax_axes, axes['A']) + assert ax_axes is axes['A'] # Multiple columns with an ax argument should use same figure fig, ax = self.plt.subplots() with tm.assert_produces_warning(UserWarning): axes = df.boxplot(column=['Col1', 'Col2'], by='X', ax=ax, return_type='axes') - self.assertIs(axes['Col1'].get_figure(), fig) + assert axes['Col1'].get_figure() is fig # When by is None, check that all relevant lines are present in the # dict fig, ax = self.plt.subplots() d = df.boxplot(ax=ax, return_type='dict') lines = list(itertools.chain.from_iterable(d.values())) - self.assertEqual(len(ax.get_lines()), len(lines)) + assert len(ax.get_lines()) == len(lines) @slow def test_boxplot_return_type_none(self): # GH 12216; return_type=None & by=None -> axes result = self.hist_df.boxplot() - self.assertTrue(isinstance(result, self.plt.Axes)) + assert isinstance(result, self.plt.Axes) @slow def test_boxplot_return_type_legacy(self): @@ -106,7 +107,7 @@ def test_boxplot_return_type_legacy(self): df = DataFrame(randn(6, 4), index=list(string.ascii_letters[:6]), columns=['one', 'two', 'three', 'four']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.boxplot(return_type='NOTATYPE') result = df.boxplot() @@ -129,8 +130,8 @@ def test_boxplot_axis_limits(self): def _check_ax_limits(col, ax): y_min, y_max = ax.get_ylim() - self.assertTrue(y_min <= col.min()) - self.assertTrue(y_max >= col.max()) + assert y_min <= col.min() + assert y_max >= col.max() df = self.hist_df.copy() df['age'] = np.random.randint(1, 20, df.shape[0]) @@ -138,7 +139,7 @@ def _check_ax_limits(col, ax): height_ax, weight_ax = df.boxplot(['height', 'weight'], by='category') _check_ax_limits(df['height'], height_ax) _check_ax_limits(df['weight'], weight_ax) - self.assertEqual(weight_ax._sharey, height_ax) + assert weight_ax._sharey == height_ax # Two rows, one partial p = df.boxplot(['height', 'weight', 'age'], by='category') @@ -148,9 +149,9 @@ def _check_ax_limits(col, ax): _check_ax_limits(df['height'], height_ax) _check_ax_limits(df['weight'], weight_ax) _check_ax_limits(df['age'], age_ax) - self.assertEqual(weight_ax._sharey, height_ax) - self.assertEqual(age_ax._sharey, height_ax) - self.assertIsNone(dummy_ax._sharey) + assert weight_ax._sharey == height_ax + assert age_ax._sharey == height_ax + assert dummy_ax._sharey is None @slow def test_boxplot_empty_column(self): @@ -159,8 +160,12 @@ def test_boxplot_empty_column(self): df.loc[:, 0] = np.nan _check_plot_works(df.boxplot, return_type='axes') + def test_fontsize(self): + df = DataFrame({"a": [1, 2, 3, 4, 5, 6]}) + self._check_ticks_props(df.boxplot("a", fontsize=16), + xlabelsize=16, ylabelsize=16) + -@tm.mplskip class TestDataFrameGroupByPlots(TestPlotBase): @slow @@ -204,13 +209,13 @@ def test_grouped_plot_fignums(self): gb = df.groupby('gender') res = gb.plot() - self.assertEqual(len(self.plt.get_fignums()), 2) - self.assertEqual(len(res), 2) + assert len(self.plt.get_fignums()) == 2 + assert len(res) == 2 tm.close() res = gb.boxplot(return_type='axes') - self.assertEqual(len(self.plt.get_fignums()), 1) - self.assertEqual(len(res), 2) + assert len(self.plt.get_fignums()) == 1 + assert len(res) == 2 tm.close() # now works with GH 5610 as gender is excluded @@ -223,7 +228,7 @@ def test_grouped_box_return_type(self): # old style: return_type=None result = df.boxplot(by='gender') - self.assertIsInstance(result, np.ndarray) + assert isinstance(result, np.ndarray) self._check_box_return_type( result, None, expected_keys=['height', 'weight', 'category']) @@ -258,13 +263,13 @@ def test_grouped_box_return_type(self): def test_grouped_box_layout(self): df = self.hist_df - self.assertRaises(ValueError, df.boxplot, column=['weight', 'height'], - by=df.gender, layout=(1, 1)) - self.assertRaises(ValueError, df.boxplot, - column=['height', 'weight', 'category'], - layout=(2, 1), return_type='dict') - self.assertRaises(ValueError, df.boxplot, column=['weight', 'height'], - by=df.gender, layout=(-1, -1)) + pytest.raises(ValueError, df.boxplot, column=['weight', 'height'], + by=df.gender, layout=(1, 1)) + pytest.raises(ValueError, df.boxplot, + column=['height', 'weight', 'category'], + layout=(2, 1), return_type='dict') + pytest.raises(ValueError, df.boxplot, column=['weight', 'height'], + by=df.gender, layout=(-1, -1)) # _check_plot_works adds an ax so catch warning. see GH #13188 with tm.assert_produces_warning(UserWarning): @@ -351,8 +356,8 @@ def test_grouped_box_multiple_axes(self): by='gender', return_type='axes', ax=axes[0]) returned = np.array(list(returned.values)) self._check_axes_shape(returned, axes_num=3, layout=(1, 3)) - self.assert_numpy_array_equal(returned, axes[0]) - self.assertIs(returned[0].figure, fig) + tm.assert_numpy_array_equal(returned, axes[0]) + assert returned[0].figure is fig # draw on second row with tm.assert_produces_warning(UserWarning): @@ -361,16 +366,16 @@ def test_grouped_box_multiple_axes(self): return_type='axes', ax=axes[1]) returned = np.array(list(returned.values)) self._check_axes_shape(returned, axes_num=3, layout=(1, 3)) - self.assert_numpy_array_equal(returned, axes[1]) - self.assertIs(returned[0].figure, fig) + tm.assert_numpy_array_equal(returned, axes[1]) + assert returned[0].figure is fig - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): fig, axes = self.plt.subplots(2, 3) # pass different number of axes from required with tm.assert_produces_warning(UserWarning): axes = df.groupby('classroom').boxplot(ax=axes) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_fontsize(self): + df = DataFrame({"a": [1, 2, 3, 4, 5, 6], "b": [0, 0, 0, 1, 1, 1]}) + self._check_ticks_props(df.boxplot("a", by="b", fontsize=16), + xlabelsize=16, ylabelsize=16) diff --git a/pandas/tseries/tests/test_converter.py b/pandas/tests/plotting/test_converter.py similarity index 72% rename from pandas/tseries/tests/test_converter.py rename to pandas/tests/plotting/test_converter.py index 37d9c35639c32..e1f64bed5598d 100644 --- a/pandas/tseries/tests/test_converter.py +++ b/pandas/tests/plotting/test_converter.py @@ -1,26 +1,23 @@ +import pytest from datetime import datetime, date -import nose - import numpy as np -from pandas import Timestamp, Period +from pandas import Timestamp, Period, Index from pandas.compat import u import pandas.util.testing as tm -from pandas.tseries.offsets import Second, Milli, Micro +from pandas.tseries.offsets import Second, Milli, Micro, Day from pandas.compat.numpy import np_datetime64_compat -try: - import pandas.tseries.converter as converter -except ImportError: - raise nose.SkipTest("no pandas.tseries.converter, skipping") +converter = pytest.importorskip('pandas.plotting._converter') def test_timtetonum_accepts_unicode(): assert (converter.time2num("00:01") == converter.time2num(u("00:01"))) -class TestDateTimeConverter(tm.TestCase): - def setUp(self): +class TestDateTimeConverter(object): + + def setup_method(self, method): self.dtc = converter.DatetimeConverter() self.tc = converter.TimeFormatter(None) @@ -32,35 +29,54 @@ def test_convert_accepts_unicode(self): def test_conversion(self): rs = self.dtc.convert(['2012-1-1'], None, None)[0] xp = datetime(2012, 1, 1).toordinal() - self.assertEqual(rs, xp) + assert rs == xp rs = self.dtc.convert('2012-1-1', None, None) - self.assertEqual(rs, xp) + assert rs == xp rs = self.dtc.convert(date(2012, 1, 1), None, None) - self.assertEqual(rs, xp) + assert rs == xp rs = self.dtc.convert(datetime(2012, 1, 1).toordinal(), None, None) - self.assertEqual(rs, xp) + assert rs == xp rs = self.dtc.convert('2012-1-1', None, None) - self.assertEqual(rs, xp) + assert rs == xp rs = self.dtc.convert(Timestamp('2012-1-1'), None, None) - self.assertEqual(rs, xp) + assert rs == xp # also testing datetime64 dtype (GH8614) rs = self.dtc.convert(np_datetime64_compat('2012-01-01'), None, None) - self.assertEqual(rs, xp) + assert rs == xp rs = self.dtc.convert(np_datetime64_compat( '2012-01-01 00:00:00+0000'), None, None) - self.assertEqual(rs, xp) + assert rs == xp rs = self.dtc.convert(np.array([ np_datetime64_compat('2012-01-01 00:00:00+0000'), np_datetime64_compat('2012-01-02 00:00:00+0000')]), None, None) - self.assertEqual(rs[0], xp) + assert rs[0] == xp + + # we have a tz-aware date (constructed to that when we turn to utc it + # is the same as our sample) + ts = (Timestamp('2012-01-01') + .tz_localize('UTC') + .tz_convert('US/Eastern') + ) + rs = self.dtc.convert(ts, None, None) + assert rs == xp + + rs = self.dtc.convert(ts.to_pydatetime(), None, None) + assert rs == xp + + rs = self.dtc.convert(Index([ts - Day(1), ts]), None, None) + assert rs[1] == xp + + rs = self.dtc.convert(Index([ts - Day(1), ts]).to_pydatetime(), + None, None) + assert rs[1] == xp def test_conversion_float(self): decimals = 9 @@ -85,7 +101,7 @@ def test_conversion_outofbounds_datetime(self): tm.assert_numpy_array_equal(rs, xp) rs = self.dtc.convert(values[0], None, None) xp = converter.dates.date2num(values[0]) - self.assertEqual(rs, xp) + assert rs == xp values = [datetime(1677, 1, 1, 12), datetime(1677, 1, 2, 12)] rs = self.dtc.convert(values, None, None) @@ -93,7 +109,7 @@ def test_conversion_outofbounds_datetime(self): tm.assert_numpy_array_equal(rs, xp) rs = self.dtc.convert(values[0], None, None) xp = converter.dates.date2num(values[0]) - self.assertEqual(rs, xp) + assert rs == xp def test_time_formatter(self): self.tc(90000) @@ -122,9 +138,17 @@ def _assert_less(ts1, ts2): _assert_less(ts, ts + Milli()) _assert_less(ts, ts + Micro(50)) + def test_convert_nested(self): + inner = [Timestamp('2017-01-01', Timestamp('2017-01-02'))] + data = [inner, inner] + result = self.dtc.convert(data, None, None) + expected = [self.dtc.convert(x, None, None) for x in data] + assert result == expected + -class TestPeriodConverter(tm.TestCase): - def setUp(self): +class TestPeriodConverter(object): + + def setup_method(self, method): self.pc = converter.PeriodConverter() class Axis(object): @@ -136,53 +160,52 @@ class Axis(object): def test_convert_accepts_unicode(self): r1 = self.pc.convert("2012-1-1", None, self.axis) r2 = self.pc.convert(u("2012-1-1"), None, self.axis) - self.assert_equal(r1, r2, - "PeriodConverter.convert should accept unicode") + assert r1 == r2 def test_conversion(self): rs = self.pc.convert(['2012-1-1'], None, self.axis)[0] xp = Period('2012-1-1').ordinal - self.assertEqual(rs, xp) + assert rs == xp rs = self.pc.convert('2012-1-1', None, self.axis) - self.assertEqual(rs, xp) + assert rs == xp rs = self.pc.convert([date(2012, 1, 1)], None, self.axis)[0] - self.assertEqual(rs, xp) + assert rs == xp rs = self.pc.convert(date(2012, 1, 1), None, self.axis) - self.assertEqual(rs, xp) + assert rs == xp rs = self.pc.convert([Timestamp('2012-1-1')], None, self.axis)[0] - self.assertEqual(rs, xp) + assert rs == xp rs = self.pc.convert(Timestamp('2012-1-1'), None, self.axis) - self.assertEqual(rs, xp) + assert rs == xp # FIXME # rs = self.pc.convert( # np_datetime64_compat('2012-01-01'), None, self.axis) - # self.assertEqual(rs, xp) + # assert rs == xp # # rs = self.pc.convert( # np_datetime64_compat('2012-01-01 00:00:00+0000'), # None, self.axis) - # self.assertEqual(rs, xp) + # assert rs == xp # # rs = self.pc.convert(np.array([ # np_datetime64_compat('2012-01-01 00:00:00+0000'), # np_datetime64_compat('2012-01-02 00:00:00+0000')]), # None, self.axis) - # self.assertEqual(rs[0], xp) + # assert rs[0] == xp def test_integer_passthrough(self): # GH9012 rs = self.pc.convert([0, 1], None, self.axis) xp = [0, 1] - self.assertEqual(rs, xp) - + assert rs == xp -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_convert_nested(self): + data = ['2012-1-1', '2012-1-2'] + r1 = self.pc.convert([data, data], None, self.axis) + r2 = [self.pc.convert(data, None, self.axis) for _ in range(2)] + assert r1 == r2 diff --git a/pandas/tests/plotting/test_datetimelike.py b/pandas/tests/plotting/test_datetimelike.py index f07aadba175f2..ed198de11bac1 100644 --- a/pandas/tests/plotting/test_datetimelike.py +++ b/pandas/tests/plotting/test_datetimelike.py @@ -1,15 +1,18 @@ +""" Test cases for time series specific (freq conversion, etc) """ + from datetime import datetime, timedelta, date, time -import nose +import pytest from pandas.compat import lrange, zip import numpy as np from pandas import Index, Series, DataFrame - -from pandas.tseries.index import date_range, bdate_range +from pandas.compat import is_platform_mac +from pandas.core.indexes.datetimes import date_range, bdate_range +from pandas.core.indexes.timedeltas import timedelta_range from pandas.tseries.offsets import DateOffset -from pandas.tseries.period import period_range, Period, PeriodIndex -from pandas.tseries.resample import DatetimeIndex +from pandas.core.indexes.period import period_range, Period, PeriodIndex +from pandas.core.resample import DatetimeIndex from pandas.util.testing import assert_series_equal, ensure_clean, slow import pandas.util.testing as tm @@ -17,15 +20,13 @@ from pandas.tests.plotting.common import (TestPlotBase, _skip_if_no_scipy_gaussian_kde) - -""" Test cases for time series specific (freq conversion, etc) """ +tm._skip_module_if_no_mpl() -@tm.mplskip class TestTSPlot(TestPlotBase): - def setUp(self): - TestPlotBase.setUp(self) + def setup_method(self, method): + TestPlotBase.setup_method(self, method) freq = ['S', 'T', 'H', 'D', 'W', 'M', 'Q', 'A'] idx = [period_range('12/31/1999', freq=x, periods=100) for x in freq] @@ -41,7 +42,7 @@ def setUp(self): columns=['A', 'B', 'C']) for x in idx] - def tearDown(self): + def teardown_method(self, method): tm.close() @slow @@ -58,7 +59,7 @@ def test_fontsize_set_correctly(self): df = DataFrame(np.random.randn(10, 9), index=range(10)) ax = df.plot(fontsize=2) for label in (ax.get_xticklabels() + ax.get_yticklabels()): - self.assertEqual(label.get_fontsize(), 2) + assert label.get_fontsize() == 2 @slow def test_frame_inferred(self): @@ -95,10 +96,10 @@ def test_nonnumeric_exclude(self): df = DataFrame({'A': ["x", "y", "z"], 'B': [1, 2, 3]}, idx) ax = df.plot() # it works - self.assertEqual(len(ax.get_lines()), 1) # B was plotted + assert len(ax.get_lines()) == 1 # B was plotted plt.close(plt.gcf()) - self.assertRaises(TypeError, df['A'].plot) + pytest.raises(TypeError, df['A'].plot) @slow def test_tsplot(self): @@ -124,16 +125,16 @@ def test_tsplot(self): ax = ts.plot(style='k') color = (0., 0., 0., 1) if self.mpl_ge_2_0_0 else (0., 0., 0.) - self.assertEqual(color, ax.get_lines()[0].get_color()) + assert color == ax.get_lines()[0].get_color() def test_both_style_and_color(self): import matplotlib.pyplot as plt # noqa ts = tm.makeTimeSeries() - self.assertRaises(ValueError, ts.plot, style='b-', color='#000099') + pytest.raises(ValueError, ts.plot, style='b-', color='#000099') s = ts.reset_index(drop=True) - self.assertRaises(ValueError, s.plot, style='b-', color='#000099') + pytest.raises(ValueError, s.plot, style='b-', color='#000099') @slow def test_high_freq(self): @@ -144,13 +145,13 @@ def test_high_freq(self): _check_plot_works(ser.plot) def test_get_datevalue(self): - from pandas.tseries.converter import get_datevalue - self.assertIsNone(get_datevalue(None, 'D')) - self.assertEqual(get_datevalue(1987, 'A'), 1987) - self.assertEqual(get_datevalue(Period(1987, 'A'), 'M'), - Period('1987-12', 'M').ordinal) - self.assertEqual(get_datevalue('1/1/1987', 'D'), - Period('1987-1-1', 'D').ordinal) + from pandas.plotting._converter import get_datevalue + assert get_datevalue(None, 'D') is None + assert get_datevalue(1987, 'A') == 1987 + assert (get_datevalue(Period(1987, 'A'), 'M') == + Period('1987-12', 'M').ordinal) + assert (get_datevalue('1/1/1987', 'D') == + Period('1987-1-1', 'D').ordinal) @slow def test_ts_plot_format_coord(self): @@ -159,11 +160,10 @@ def check_format_of_first_point(ax, expected_string): first_x = first_line.get_xdata()[0].ordinal first_y = first_line.get_ydata()[0] try: - self.assertEqual(expected_string, - ax.format_coord(first_x, first_y)) + assert expected_string == ax.format_coord(first_x, first_y) except (ValueError): - raise nose.SkipTest("skipping test because issue forming " - "test comparison GH7664") + pytest.skip("skipping test because issue forming " + "test comparison GH7664") annual = Series(1, index=date_range('2014-01-01', periods=3, freq='A-DEC')) @@ -223,7 +223,7 @@ def test_fake_inferred_business(self): ts = Series(lrange(len(rng)), rng) ts = ts[:3].append(ts[5:]) ax = ts.plot() - self.assertFalse(hasattr(ax, 'freq')) + assert not hasattr(ax, 'freq') @slow def test_plot_offset_freq(self): @@ -243,7 +243,7 @@ def test_plot_multiple_inferred_freq(self): @slow def test_uhf(self): - import pandas.tseries.converter as conv + import pandas.plotting._converter as conv import matplotlib.pyplot as plt fig = plt.gcf() plt.clf() @@ -261,7 +261,7 @@ def test_uhf(self): xp = conv._from_ordinal(loc).strftime('%H:%M:%S.%f') rs = str(label.get_text()) if len(rs): - self.assertEqual(xp, rs) + assert xp == rs @slow def test_irreg_hf(self): @@ -273,13 +273,12 @@ def test_irreg_hf(self): idx = date_range('2012-6-22 21:59:51', freq='S', periods=100) df = DataFrame(np.random.randn(len(idx), 2), idx) - irreg = df.ix[[0, 1, 3, 4]] + irreg = df.iloc[[0, 1, 3, 4]] ax = irreg.plot() diffs = Series(ax.get_lines()[0].get_xydata()[:, 0]).diff() sec = 1. / 24 / 60 / 60 - self.assertTrue((np.fabs(diffs[1:] - [sec, sec * 2, sec]) < 1e-8).all( - )) + assert (np.fabs(diffs[1:] - [sec, sec * 2, sec]) < 1e-8).all() plt.clf() fig.add_subplot(111) @@ -287,7 +286,7 @@ def test_irreg_hf(self): df2.index = df.index.asobject ax = df2.plot() diffs = Series(ax.get_lines()[0].get_xydata()[:, 0]).diff() - self.assertTrue((np.fabs(diffs[1:] - sec) < 1e-8).all()) + assert (np.fabs(diffs[1:] - sec) < 1e-8).all() def test_irregular_datetime64_repr_bug(self): import matplotlib.pyplot as plt @@ -296,21 +295,22 @@ def test_irregular_datetime64_repr_bug(self): fig = plt.gcf() plt.clf() + ax = fig.add_subplot(211) + ret = ser.plot() - self.assertIsNotNone(ret) + assert ret is not None for rs, xp in zip(ax.get_lines()[0].get_xdata(), ser.index): - self.assertEqual(rs, xp) + assert rs == xp def test_business_freq(self): import matplotlib.pyplot as plt # noqa bts = tm.makePeriodSeries() ax = bts.plot() - self.assertEqual(ax.get_lines()[0].get_xydata()[0, 0], - bts.index[0].ordinal) + assert ax.get_lines()[0].get_xydata()[0, 0] == bts.index[0].ordinal idx = ax.get_lines()[0].get_xdata() - self.assertEqual(PeriodIndex(data=idx).freqstr, 'B') + assert PeriodIndex(data=idx).freqstr == 'B' @slow def test_business_freq_convert(self): @@ -320,10 +320,9 @@ def test_business_freq_convert(self): tm.N = n ts = bts.to_period('M') ax = bts.plot() - self.assertEqual(ax.get_lines()[0].get_xydata()[0, 0], - ts.index[0].ordinal) + assert ax.get_lines()[0].get_xydata()[0, 0] == ts.index[0].ordinal idx = ax.get_lines()[0].get_xdata() - self.assertEqual(PeriodIndex(data=idx).freqstr, 'M') + assert PeriodIndex(data=idx).freqstr == 'M' def test_nonzero_base(self): # GH2571 @@ -332,7 +331,7 @@ def test_nonzero_base(self): df = DataFrame(np.arange(24), index=idx) ax = df.plot() rs = ax.get_lines()[0].get_xdata() - self.assertFalse(Index(rs).is_normalized) + assert not Index(rs).is_normalized def test_dataframe(self): bts = DataFrame({'a': tm.makeTimeSeries()}) @@ -349,8 +348,8 @@ def _test(ax): ax.set_xlim(xlim[0] - 5, xlim[1] + 10) ax.get_figure().canvas.draw() result = ax.get_xlim() - self.assertEqual(result[0], xlim[0] - 5) - self.assertEqual(result[1], xlim[1] + 10) + assert result[0] == xlim[0] - 5 + assert result[1] == xlim[1] + 10 # string expected = (Period('1/1/2000', ax.freq), @@ -358,8 +357,8 @@ def _test(ax): ax.set_xlim('1/1/2000', '4/1/2000') ax.get_figure().canvas.draw() result = ax.get_xlim() - self.assertEqual(int(result[0]), expected[0].ordinal) - self.assertEqual(int(result[1]), expected[1].ordinal) + assert int(result[0]) == expected[0].ordinal + assert int(result[1]) == expected[1].ordinal # datetim expected = (Period('1/1/2000', ax.freq), @@ -367,8 +366,8 @@ def _test(ax): ax.set_xlim(datetime(2000, 1, 1), datetime(2000, 4, 1)) ax.get_figure().canvas.draw() result = ax.get_xlim() - self.assertEqual(int(result[0]), expected[0].ordinal) - self.assertEqual(int(result[1]), expected[1].ordinal) + assert int(result[0]) == expected[0].ordinal + assert int(result[1]) == expected[1].ordinal fig = ax.get_figure() plt.close(fig) @@ -387,14 +386,14 @@ def _test(ax): _test(ax) def test_get_finder(self): - import pandas.tseries.converter as conv + import pandas.plotting._converter as conv - self.assertEqual(conv.get_finder('B'), conv._daily_finder) - self.assertEqual(conv.get_finder('D'), conv._daily_finder) - self.assertEqual(conv.get_finder('M'), conv._monthly_finder) - self.assertEqual(conv.get_finder('Q'), conv._quarterly_finder) - self.assertEqual(conv.get_finder('A'), conv._annual_finder) - self.assertEqual(conv.get_finder('W'), conv._daily_finder) + assert conv.get_finder('B') == conv._daily_finder + assert conv.get_finder('D') == conv._daily_finder + assert conv.get_finder('M') == conv._monthly_finder + assert conv.get_finder('Q') == conv._quarterly_finder + assert conv.get_finder('A') == conv._annual_finder + assert conv.get_finder('W') == conv._daily_finder @slow def test_finder_daily(self): @@ -407,11 +406,11 @@ def test_finder_daily(self): ax = ser.plot() xaxis = ax.get_xaxis() rs = xaxis.get_majorticklocs()[0] - self.assertEqual(xp, rs) + assert xp == rs vmin, vmax = ax.get_xlim() ax.set_xlim(vmin + 0.9, vmax) rs = xaxis.get_majorticklocs()[0] - self.assertEqual(xp, rs) + assert xp == rs plt.close(ax.get_figure()) @slow @@ -425,11 +424,11 @@ def test_finder_quarterly(self): ax = ser.plot() xaxis = ax.get_xaxis() rs = xaxis.get_majorticklocs()[0] - self.assertEqual(rs, xp) + assert rs == xp (vmin, vmax) = ax.get_xlim() ax.set_xlim(vmin + 0.9, vmax) rs = xaxis.get_majorticklocs()[0] - self.assertEqual(xp, rs) + assert xp == rs plt.close(ax.get_figure()) @slow @@ -443,11 +442,11 @@ def test_finder_monthly(self): ax = ser.plot() xaxis = ax.get_xaxis() rs = xaxis.get_majorticklocs()[0] - self.assertEqual(rs, xp) + assert rs == xp vmin, vmax = ax.get_xlim() ax.set_xlim(vmin + 0.9, vmax) rs = xaxis.get_majorticklocs()[0] - self.assertEqual(xp, rs) + assert xp == rs plt.close(ax.get_figure()) def test_finder_monthly_long(self): @@ -457,7 +456,7 @@ def test_finder_monthly_long(self): xaxis = ax.get_xaxis() rs = xaxis.get_majorticklocs()[0] xp = Period('1989Q1', 'M').ordinal - self.assertEqual(rs, xp) + assert rs == xp @slow def test_finder_annual(self): @@ -469,7 +468,7 @@ def test_finder_annual(self): ax = ser.plot() xaxis = ax.get_xaxis() rs = xaxis.get_majorticklocs()[0] - self.assertEqual(rs, Period(xp[i], freq='A').ordinal) + assert rs == Period(xp[i], freq='A').ordinal plt.close(ax.get_figure()) @slow @@ -481,7 +480,7 @@ def test_finder_minutely(self): xaxis = ax.get_xaxis() rs = xaxis.get_majorticklocs()[0] xp = Period('1/1/1999', freq='Min').ordinal - self.assertEqual(rs, xp) + assert rs == xp def test_finder_hourly(self): nhours = 23 @@ -491,7 +490,7 @@ def test_finder_hourly(self): xaxis = ax.get_xaxis() rs = xaxis.get_majorticklocs()[0] xp = Period('1/1/1999', freq='H').ordinal - self.assertEqual(rs, xp) + assert rs == xp @slow def test_gaps(self): @@ -502,12 +501,12 @@ def test_gaps(self): ax = ts.plot() lines = ax.get_lines() tm._skip_if_mpl_1_5() - self.assertEqual(len(lines), 1) + assert len(lines) == 1 l = lines[0] data = l.get_xydata() - tm.assertIsInstance(data, np.ma.core.MaskedArray) + assert isinstance(data, np.ma.core.MaskedArray) mask = data.mask - self.assertTrue(mask[5:25, 1].all()) + assert mask[5:25, 1].all() plt.close(ax.get_figure()) # irregular @@ -516,12 +515,12 @@ def test_gaps(self): ts[2:5] = np.nan ax = ts.plot() lines = ax.get_lines() - self.assertEqual(len(lines), 1) + assert len(lines) == 1 l = lines[0] data = l.get_xydata() - tm.assertIsInstance(data, np.ma.core.MaskedArray) + assert isinstance(data, np.ma.core.MaskedArray) mask = data.mask - self.assertTrue(mask[2:5, 1].all()) + assert mask[2:5, 1].all() plt.close(ax.get_figure()) # non-ts @@ -530,12 +529,12 @@ def test_gaps(self): ser[2:5] = np.nan ax = ser.plot() lines = ax.get_lines() - self.assertEqual(len(lines), 1) + assert len(lines) == 1 l = lines[0] data = l.get_xydata() - tm.assertIsInstance(data, np.ma.core.MaskedArray) + assert isinstance(data, np.ma.core.MaskedArray) mask = data.mask - self.assertTrue(mask[2:5, 1].all()) + assert mask[2:5, 1].all() @slow def test_gap_upsample(self): @@ -547,16 +546,16 @@ def test_gap_upsample(self): s = Series(np.random.randn(len(idxh)), idxh) s.plot(secondary_y=True) lines = ax.get_lines() - self.assertEqual(len(lines), 1) - self.assertEqual(len(ax.right_ax.get_lines()), 1) + assert len(lines) == 1 + assert len(ax.right_ax.get_lines()) == 1 l = lines[0] data = l.get_xydata() tm._skip_if_mpl_1_5() - tm.assertIsInstance(data, np.ma.core.MaskedArray) + assert isinstance(data, np.ma.core.MaskedArray) mask = data.mask - self.assertTrue(mask[5:25, 1].all()) + assert mask[5:25, 1].all() @slow def test_secondary_y(self): @@ -565,29 +564,29 @@ def test_secondary_y(self): ser = Series(np.random.randn(10)) ser2 = Series(np.random.randn(10)) ax = ser.plot(secondary_y=True) - self.assertTrue(hasattr(ax, 'left_ax')) - self.assertFalse(hasattr(ax, 'right_ax')) + assert hasattr(ax, 'left_ax') + assert not hasattr(ax, 'right_ax') fig = ax.get_figure() axes = fig.get_axes() l = ax.get_lines()[0] xp = Series(l.get_ydata(), l.get_xdata()) assert_series_equal(ser, xp) - self.assertEqual(ax.get_yaxis().get_ticks_position(), 'right') - self.assertFalse(axes[0].get_yaxis().get_visible()) + assert ax.get_yaxis().get_ticks_position() == 'right' + assert not axes[0].get_yaxis().get_visible() plt.close(fig) ax2 = ser2.plot() - self.assertEqual(ax2.get_yaxis().get_ticks_position(), - self.default_tick_position) + assert (ax2.get_yaxis().get_ticks_position() == + self.default_tick_position) plt.close(ax2.get_figure()) ax = ser2.plot() ax2 = ser.plot(secondary_y=True) - self.assertTrue(ax.get_yaxis().get_visible()) - self.assertFalse(hasattr(ax, 'left_ax')) - self.assertTrue(hasattr(ax, 'right_ax')) - self.assertTrue(hasattr(ax2, 'left_ax')) - self.assertFalse(hasattr(ax2, 'right_ax')) + assert ax.get_yaxis().get_visible() + assert not hasattr(ax, 'left_ax') + assert hasattr(ax, 'right_ax') + assert hasattr(ax2, 'left_ax') + assert not hasattr(ax2, 'right_ax') @slow def test_secondary_y_ts(self): @@ -596,25 +595,25 @@ def test_secondary_y_ts(self): ser = Series(np.random.randn(10), idx) ser2 = Series(np.random.randn(10), idx) ax = ser.plot(secondary_y=True) - self.assertTrue(hasattr(ax, 'left_ax')) - self.assertFalse(hasattr(ax, 'right_ax')) + assert hasattr(ax, 'left_ax') + assert not hasattr(ax, 'right_ax') fig = ax.get_figure() axes = fig.get_axes() l = ax.get_lines()[0] xp = Series(l.get_ydata(), l.get_xdata()).to_timestamp() assert_series_equal(ser, xp) - self.assertEqual(ax.get_yaxis().get_ticks_position(), 'right') - self.assertFalse(axes[0].get_yaxis().get_visible()) + assert ax.get_yaxis().get_ticks_position() == 'right' + assert not axes[0].get_yaxis().get_visible() plt.close(fig) ax2 = ser2.plot() - self.assertEqual(ax2.get_yaxis().get_ticks_position(), - self.default_tick_position) + assert (ax2.get_yaxis().get_ticks_position() == + self.default_tick_position) plt.close(ax2.get_figure()) ax = ser2.plot() ax2 = ser.plot(secondary_y=True) - self.assertTrue(ax.get_yaxis().get_visible()) + assert ax.get_yaxis().get_visible() @slow def test_secondary_kde(self): @@ -624,11 +623,11 @@ def test_secondary_kde(self): import matplotlib.pyplot as plt # noqa ser = Series(np.random.randn(10)) ax = ser.plot(secondary_y=True, kind='density') - self.assertTrue(hasattr(ax, 'left_ax')) - self.assertFalse(hasattr(ax, 'right_ax')) + assert hasattr(ax, 'left_ax') + assert not hasattr(ax, 'right_ax') fig = ax.get_figure() axes = fig.get_axes() - self.assertEqual(axes[1].get_yaxis().get_ticks_position(), 'right') + assert axes[1].get_yaxis().get_ticks_position() == 'right' @slow def test_secondary_bar(self): @@ -636,25 +635,25 @@ def test_secondary_bar(self): ax = ser.plot(secondary_y=True, kind='bar') fig = ax.get_figure() axes = fig.get_axes() - self.assertEqual(axes[1].get_yaxis().get_ticks_position(), 'right') + assert axes[1].get_yaxis().get_ticks_position() == 'right' @slow def test_secondary_frame(self): df = DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c']) axes = df.plot(secondary_y=['a', 'c'], subplots=True) - self.assertEqual(axes[0].get_yaxis().get_ticks_position(), 'right') - self.assertEqual(axes[1].get_yaxis().get_ticks_position(), - self.default_tick_position) - self.assertEqual(axes[2].get_yaxis().get_ticks_position(), 'right') + assert axes[0].get_yaxis().get_ticks_position() == 'right' + assert (axes[1].get_yaxis().get_ticks_position() == + self.default_tick_position) + assert axes[2].get_yaxis().get_ticks_position() == 'right' @slow def test_secondary_bar_frame(self): df = DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c']) axes = df.plot(kind='bar', secondary_y=['a', 'c'], subplots=True) - self.assertEqual(axes[0].get_yaxis().get_ticks_position(), 'right') - self.assertEqual(axes[1].get_yaxis().get_ticks_position(), - self.default_tick_position) - self.assertEqual(axes[2].get_yaxis().get_ticks_position(), 'right') + assert axes[0].get_yaxis().get_ticks_position() == 'right' + assert (axes[1].get_yaxis().get_ticks_position() == + self.default_tick_position) + assert axes[2].get_yaxis().get_ticks_position() == 'right' def test_mixed_freq_regular_first(self): import matplotlib.pyplot as plt # noqa @@ -668,12 +667,12 @@ def test_mixed_freq_regular_first(self): lines = ax2.get_lines() idx1 = PeriodIndex(lines[0].get_xdata()) idx2 = PeriodIndex(lines[1].get_xdata()) - self.assertTrue(idx1.equals(s1.index.to_period('B'))) - self.assertTrue(idx2.equals(s2.index.to_period('B'))) + assert idx1.equals(s1.index.to_period('B')) + assert idx2.equals(s2.index.to_period('B')) left, right = ax2.get_xlim() pidx = s1.index.to_period() - self.assertEqual(left, pidx[0].ordinal) - self.assertEqual(right, pidx[-1].ordinal) + assert left == pidx[0].ordinal + assert right == pidx[-1].ordinal @slow def test_mixed_freq_irregular_first(self): @@ -682,7 +681,7 @@ def test_mixed_freq_irregular_first(self): s2 = s1[[0, 5, 10, 11, 12, 13, 14, 15]] s2.plot(style='g') ax = s1.plot() - self.assertFalse(hasattr(ax, 'freq')) + assert not hasattr(ax, 'freq') lines = ax.get_lines() x1 = lines[0].get_xdata() tm.assert_numpy_array_equal(x1, s2.index.asobject.values) @@ -699,12 +698,12 @@ def test_mixed_freq_regular_first_df(self): lines = ax2.get_lines() idx1 = PeriodIndex(lines[0].get_xdata()) idx2 = PeriodIndex(lines[1].get_xdata()) - self.assertTrue(idx1.equals(s1.index.to_period('B'))) - self.assertTrue(idx2.equals(s2.index.to_period('B'))) + assert idx1.equals(s1.index.to_period('B')) + assert idx2.equals(s2.index.to_period('B')) left, right = ax2.get_xlim() pidx = s1.index.to_period() - self.assertEqual(left, pidx[0].ordinal) - self.assertEqual(right, pidx[-1].ordinal) + assert left == pidx[0].ordinal + assert right == pidx[-1].ordinal @slow def test_mixed_freq_irregular_first_df(self): @@ -714,7 +713,7 @@ def test_mixed_freq_irregular_first_df(self): s2 = s1.iloc[[0, 5, 10, 11, 12, 13, 14, 15], :] ax = s2.plot(style='g') ax = s1.plot(ax=ax) - self.assertFalse(hasattr(ax, 'freq')) + assert not hasattr(ax, 'freq') lines = ax.get_lines() x1 = lines[0].get_xdata() tm.assert_numpy_array_equal(x1, s2.index.asobject.values) @@ -729,7 +728,7 @@ def test_mixed_freq_hf_first(self): high.plot() ax = low.plot() for l in ax.get_lines(): - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, 'D') + assert PeriodIndex(data=l.get_xdata()).freq == 'D' @slow def test_mixed_freq_alignment(self): @@ -742,8 +741,7 @@ def test_mixed_freq_alignment(self): ax = ts.plot() ts2.plot(style='r') - self.assertEqual(ax.lines[0].get_xdata()[0], - ax.lines[1].get_xdata()[0]) + assert ax.lines[0].get_xdata()[0] == ax.lines[1].get_xdata()[0] @slow def test_mixed_freq_lf_first(self): @@ -756,9 +754,9 @@ def test_mixed_freq_lf_first(self): low.plot(legend=True) ax = high.plot(legend=True) for l in ax.get_lines(): - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, 'D') + assert PeriodIndex(data=l.get_xdata()).freq == 'D' leg = ax.get_legend() - self.assertEqual(len(leg.texts), 2) + assert len(leg.texts) == 2 plt.close(ax.get_figure()) idxh = date_range('1/1/1999', periods=240, freq='T') @@ -768,7 +766,7 @@ def test_mixed_freq_lf_first(self): low.plot() ax = high.plot() for l in ax.get_lines(): - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, 'T') + assert PeriodIndex(data=l.get_xdata()).freq == 'T' def test_mixed_freq_irreg_period(self): ts = tm.makeTimeSeries() @@ -790,10 +788,10 @@ def test_mixed_freq_shared_ax(self): s1.plot(ax=ax1) s2.plot(ax=ax2) - self.assertEqual(ax1.freq, 'M') - self.assertEqual(ax2.freq, 'M') - self.assertEqual(ax1.lines[0].get_xydata()[0, 0], - ax2.lines[0].get_xydata()[0, 0]) + assert ax1.freq == 'M' + assert ax2.freq == 'M' + assert (ax1.lines[0].get_xydata()[0, 0] == + ax2.lines[0].get_xydata()[0, 0]) # using twinx fig, ax1 = self.plt.subplots() @@ -801,8 +799,8 @@ def test_mixed_freq_shared_ax(self): s1.plot(ax=ax1) s2.plot(ax=ax2) - self.assertEqual(ax1.lines[0].get_xydata()[0, 0], - ax2.lines[0].get_xydata()[0, 0]) + assert (ax1.lines[0].get_xydata()[0, 0] == + ax2.lines[0].get_xydata()[0, 0]) # TODO (GH14330, GH14322) # plotting the irregular first does not yet work @@ -810,8 +808,8 @@ def test_mixed_freq_shared_ax(self): # ax2 = ax1.twinx() # s2.plot(ax=ax1) # s1.plot(ax=ax2) - # self.assertEqual(ax1.lines[0].get_xydata()[0, 0], - # ax2.lines[0].get_xydata()[0, 0]) + # assert (ax1.lines[0].get_xydata()[0, 0] == + # ax2.lines[0].get_xydata()[0, 0]) @slow def test_to_weekly_resampling(self): @@ -822,7 +820,7 @@ def test_to_weekly_resampling(self): high.plot() ax = low.plot() for l in ax.get_lines(): - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, idxh.freq) + assert PeriodIndex(data=l.get_xdata()).freq == idxh.freq # tsplot from pandas.tseries.plotting import tsplot @@ -831,7 +829,7 @@ def test_to_weekly_resampling(self): tsplot(high, plt.Axes.plot) lines = tsplot(low, plt.Axes.plot) for l in lines: - self.assertTrue(PeriodIndex(data=l.get_xdata()).freq, idxh.freq) + assert PeriodIndex(data=l.get_xdata()).freq, idxh.freq @slow def test_from_weekly_resampling(self): @@ -846,12 +844,12 @@ def test_from_weekly_resampling(self): expected_l = np.array([1514, 1519, 1523, 1527, 1531, 1536, 1540, 1544, 1549, 1553, 1558, 1562], dtype=np.float64) for l in ax.get_lines(): - self.assertTrue(PeriodIndex(data=l.get_xdata()).freq, idxh.freq) + assert PeriodIndex(data=l.get_xdata()).freq, idxh.freq xdata = l.get_xdata(orig=False) if len(xdata) == 12: # idxl lines - self.assert_numpy_array_equal(xdata, expected_l) + tm.assert_numpy_array_equal(xdata, expected_l) else: - self.assert_numpy_array_equal(xdata, expected_h) + tm.assert_numpy_array_equal(xdata, expected_h) tm.close() # tsplot @@ -861,12 +859,12 @@ def test_from_weekly_resampling(self): tsplot(low, plt.Axes.plot) lines = tsplot(high, plt.Axes.plot) for l in lines: - self.assertTrue(PeriodIndex(data=l.get_xdata()).freq, idxh.freq) + assert PeriodIndex(data=l.get_xdata()).freq, idxh.freq xdata = l.get_xdata(orig=False) if len(xdata) == 12: # idxl lines - self.assert_numpy_array_equal(xdata, expected_l) + tm.assert_numpy_array_equal(xdata, expected_l) else: - self.assert_numpy_array_equal(xdata, expected_h) + tm.assert_numpy_array_equal(xdata, expected_h) @slow def test_from_resampling_area_line_mixed(self): @@ -889,26 +887,25 @@ def test_from_resampling_area_line_mixed(self): expected_y = np.zeros(len(expected_x), dtype=np.float64) for i in range(3): l = ax.lines[i] - self.assertEqual(PeriodIndex(l.get_xdata()).freq, idxh.freq) - self.assert_numpy_array_equal(l.get_xdata(orig=False), - expected_x) + assert PeriodIndex(l.get_xdata()).freq == idxh.freq + tm.assert_numpy_array_equal(l.get_xdata(orig=False), + expected_x) # check stacked values are correct expected_y += low[i].values - self.assert_numpy_array_equal( - l.get_ydata(orig=False), expected_y) + tm.assert_numpy_array_equal(l.get_ydata(orig=False), + expected_y) # check high dataframe result expected_x = idxh.to_period().asi8.astype(np.float64) expected_y = np.zeros(len(expected_x), dtype=np.float64) for i in range(3): l = ax.lines[3 + i] - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, - idxh.freq) - self.assert_numpy_array_equal(l.get_xdata(orig=False), - expected_x) + assert PeriodIndex(data=l.get_xdata()).freq == idxh.freq + tm.assert_numpy_array_equal(l.get_xdata(orig=False), + expected_x) expected_y += high[i].values - self.assert_numpy_array_equal(l.get_ydata(orig=False), - expected_y) + tm.assert_numpy_array_equal(l.get_ydata(orig=False), + expected_y) # high to low for kind1, kind2 in [('line', 'area'), ('area', 'line')]: @@ -920,13 +917,12 @@ def test_from_resampling_area_line_mixed(self): expected_y = np.zeros(len(expected_x), dtype=np.float64) for i in range(3): l = ax.lines[i] - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, - idxh.freq) - self.assert_numpy_array_equal( - l.get_xdata(orig=False), expected_x) + assert PeriodIndex(data=l.get_xdata()).freq == idxh.freq + tm.assert_numpy_array_equal(l.get_xdata(orig=False), + expected_x) expected_y += high[i].values - self.assert_numpy_array_equal( - l.get_ydata(orig=False), expected_y) + tm.assert_numpy_array_equal(l.get_ydata(orig=False), + expected_y) # check low dataframe result expected_x = np.array([1514, 1519, 1523, 1527, 1531, 1536, 1540, @@ -935,13 +931,12 @@ def test_from_resampling_area_line_mixed(self): expected_y = np.zeros(len(expected_x), dtype=np.float64) for i in range(3): l = ax.lines[3 + i] - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, - idxh.freq) - self.assert_numpy_array_equal(l.get_xdata(orig=False), - expected_x) + assert PeriodIndex(data=l.get_xdata()).freq == idxh.freq + tm.assert_numpy_array_equal(l.get_xdata(orig=False), + expected_x) expected_y += low[i].values - self.assert_numpy_array_equal(l.get_ydata(orig=False), - expected_y) + tm.assert_numpy_array_equal(l.get_ydata(orig=False), + expected_y) @slow def test_mixed_freq_second_millisecond(self): @@ -953,17 +948,17 @@ def test_mixed_freq_second_millisecond(self): # high to low high.plot() ax = low.plot() - self.assertEqual(len(ax.get_lines()), 2) + assert len(ax.get_lines()) == 2 for l in ax.get_lines(): - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, 'L') + assert PeriodIndex(data=l.get_xdata()).freq == 'L' tm.close() # low to high low.plot() ax = high.plot() - self.assertEqual(len(ax.get_lines()), 2) + assert len(ax.get_lines()) == 2 for l in ax.get_lines(): - self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, 'L') + assert PeriodIndex(data=l.get_xdata()).freq == 'L' @slow def test_irreg_dtypes(self): @@ -997,7 +992,7 @@ def test_time(self): xp = l.get_text() if len(xp) > 0: rs = time(h, m, s).strftime('%H:%M:%S') - self.assertEqual(xp, rs) + assert xp == rs # change xlim ax.set_xlim('1:30', '5:00') @@ -1011,7 +1006,7 @@ def test_time(self): xp = l.get_text() if len(xp) > 0: rs = time(h, m, s).strftime('%H:%M:%S') - self.assertEqual(xp, rs) + assert xp == rs @slow def test_time_musec(self): @@ -1037,7 +1032,7 @@ def test_time_musec(self): xp = l.get_text() if len(xp) > 0: rs = time(h, m, s).strftime('%H:%M:%S.%f') - self.assertEqual(xp, rs) + assert xp == rs @slow def test_secondary_upsample(self): @@ -1048,11 +1043,11 @@ def test_secondary_upsample(self): low.plot() ax = high.plot(secondary_y=True) for l in ax.get_lines(): - self.assertEqual(PeriodIndex(l.get_xdata()).freq, 'D') - self.assertTrue(hasattr(ax, 'left_ax')) - self.assertFalse(hasattr(ax, 'right_ax')) + assert PeriodIndex(l.get_xdata()).freq == 'D' + assert hasattr(ax, 'left_ax') + assert not hasattr(ax, 'right_ax') for l in ax.left_ax.get_lines(): - self.assertEqual(PeriodIndex(l.get_xdata()).freq, 'D') + assert PeriodIndex(l.get_xdata()).freq == 'D' @slow def test_secondary_legend(self): @@ -1065,54 +1060,54 @@ def test_secondary_legend(self): df = tm.makeTimeDataFrame() ax = df.plot(secondary_y=['A', 'B']) leg = ax.get_legend() - self.assertEqual(len(leg.get_lines()), 4) - self.assertEqual(leg.get_texts()[0].get_text(), 'A (right)') - self.assertEqual(leg.get_texts()[1].get_text(), 'B (right)') - self.assertEqual(leg.get_texts()[2].get_text(), 'C') - self.assertEqual(leg.get_texts()[3].get_text(), 'D') - self.assertIsNone(ax.right_ax.get_legend()) + assert len(leg.get_lines()) == 4 + assert leg.get_texts()[0].get_text() == 'A (right)' + assert leg.get_texts()[1].get_text() == 'B (right)' + assert leg.get_texts()[2].get_text() == 'C' + assert leg.get_texts()[3].get_text() == 'D' + assert ax.right_ax.get_legend() is None colors = set() for line in leg.get_lines(): colors.add(line.get_color()) # TODO: color cycle problems - self.assertEqual(len(colors), 4) + assert len(colors) == 4 plt.clf() ax = fig.add_subplot(211) ax = df.plot(secondary_y=['A', 'C'], mark_right=False) leg = ax.get_legend() - self.assertEqual(len(leg.get_lines()), 4) - self.assertEqual(leg.get_texts()[0].get_text(), 'A') - self.assertEqual(leg.get_texts()[1].get_text(), 'B') - self.assertEqual(leg.get_texts()[2].get_text(), 'C') - self.assertEqual(leg.get_texts()[3].get_text(), 'D') + assert len(leg.get_lines()) == 4 + assert leg.get_texts()[0].get_text() == 'A' + assert leg.get_texts()[1].get_text() == 'B' + assert leg.get_texts()[2].get_text() == 'C' + assert leg.get_texts()[3].get_text() == 'D' plt.clf() ax = df.plot(kind='bar', secondary_y=['A']) leg = ax.get_legend() - self.assertEqual(leg.get_texts()[0].get_text(), 'A (right)') - self.assertEqual(leg.get_texts()[1].get_text(), 'B') + assert leg.get_texts()[0].get_text() == 'A (right)' + assert leg.get_texts()[1].get_text() == 'B' plt.clf() ax = df.plot(kind='bar', secondary_y=['A'], mark_right=False) leg = ax.get_legend() - self.assertEqual(leg.get_texts()[0].get_text(), 'A') - self.assertEqual(leg.get_texts()[1].get_text(), 'B') + assert leg.get_texts()[0].get_text() == 'A' + assert leg.get_texts()[1].get_text() == 'B' plt.clf() ax = fig.add_subplot(211) df = tm.makeTimeDataFrame() ax = df.plot(secondary_y=['C', 'D']) leg = ax.get_legend() - self.assertEqual(len(leg.get_lines()), 4) - self.assertIsNone(ax.right_ax.get_legend()) + assert len(leg.get_lines()) == 4 + assert ax.right_ax.get_legend() is None colors = set() for line in leg.get_lines(): colors.add(line.get_color()) # TODO: color cycle problems - self.assertEqual(len(colors), 4) + assert len(colors) == 4 # non-ts df = tm.makeDataFrame() @@ -1120,27 +1115,27 @@ def test_secondary_legend(self): ax = fig.add_subplot(211) ax = df.plot(secondary_y=['A', 'B']) leg = ax.get_legend() - self.assertEqual(len(leg.get_lines()), 4) - self.assertIsNone(ax.right_ax.get_legend()) + assert len(leg.get_lines()) == 4 + assert ax.right_ax.get_legend() is None colors = set() for line in leg.get_lines(): colors.add(line.get_color()) # TODO: color cycle problems - self.assertEqual(len(colors), 4) + assert len(colors) == 4 plt.clf() ax = fig.add_subplot(211) ax = df.plot(secondary_y=['C', 'D']) leg = ax.get_legend() - self.assertEqual(len(leg.get_lines()), 4) - self.assertIsNone(ax.right_ax.get_legend()) + assert len(leg.get_lines()) == 4 + assert ax.right_ax.get_legend() is None colors = set() for line in leg.get_lines(): colors.add(line.get_color()) # TODO: color cycle problems - self.assertEqual(len(colors), 4) + assert len(colors) == 4 def test_format_date_axis(self): rng = date_range('1/1/2012', periods=12, freq='M') @@ -1149,7 +1144,7 @@ def test_format_date_axis(self): xaxis = ax.get_xaxis() for l in xaxis.get_ticklabels(): if len(l.get_text()) > 0: - self.assertEqual(l.get_rotation(), 30) + assert l.get_rotation() == 30 @slow def test_ax_plot(self): @@ -1197,8 +1192,8 @@ def test_irregular_ts_shared_ax_xlim(self): # check that axis limits are correct left, right = ax.get_xlim() - self.assertEqual(left, ts_irregular.index.min().toordinal()) - self.assertEqual(right, ts_irregular.index.max().toordinal()) + assert left == ts_irregular.index.min().toordinal() + assert right == ts_irregular.index.max().toordinal() @slow def test_secondary_y_non_ts_xlim(self): @@ -1213,8 +1208,8 @@ def test_secondary_y_non_ts_xlim(self): s2.plot(secondary_y=True, ax=ax) left_after, right_after = ax.get_xlim() - self.assertEqual(left_before, left_after) - self.assertTrue(right_before < right_after) + assert left_before == left_after + assert right_before < right_after @slow def test_secondary_y_regular_ts_xlim(self): @@ -1229,8 +1224,8 @@ def test_secondary_y_regular_ts_xlim(self): s2.plot(secondary_y=True, ax=ax) left_after, right_after = ax.get_xlim() - self.assertEqual(left_before, left_after) - self.assertTrue(right_before < right_after) + assert left_before == left_after + assert right_before < right_after @slow def test_secondary_y_mixed_freq_ts_xlim(self): @@ -1244,8 +1239,8 @@ def test_secondary_y_mixed_freq_ts_xlim(self): left_after, right_after = ax.get_xlim() # a downsample should not have changed either limit - self.assertEqual(left_before, left_after) - self.assertEqual(right_before, right_after) + assert left_before == left_after + assert right_before == right_after @slow def test_secondary_y_irregular_ts_xlim(self): @@ -1260,8 +1255,8 @@ def test_secondary_y_irregular_ts_xlim(self): ts_irregular[:5].plot(ax=ax) left, right = ax.get_xlim() - self.assertEqual(left, ts_irregular.index.min().toordinal()) - self.assertEqual(right, ts_irregular.index.max().toordinal()) + assert left == ts_irregular.index.min().toordinal() + assert right == ts_irregular.index.max().toordinal() def test_plot_outofbounds_datetime(self): # 2579 - checking this does not raise @@ -1271,6 +1266,75 @@ def test_plot_outofbounds_datetime(self): values = [datetime(1677, 1, 1, 12), datetime(1677, 1, 2, 12)] self.plt.plot(values) + def test_format_timedelta_ticks_narrow(self): + if is_platform_mac(): + pytest.skip("skip on mac for precision display issue on older mpl") + + expected_labels = [ + '00:00:00.00000000{:d}'.format(i) + for i in range(10)] + + rng = timedelta_range('0', periods=10, freq='ns') + df = DataFrame(np.random.randn(len(rng), 3), rng) + ax = df.plot(fontsize=2) + fig = ax.get_figure() + fig.canvas.draw() + labels = ax.get_xticklabels() + assert len(labels) == len(expected_labels) + for l, l_expected in zip(labels, expected_labels): + assert l.get_text() == l_expected + + def test_format_timedelta_ticks_wide(self): + if is_platform_mac(): + pytest.skip("skip on mac for precision display issue on older mpl") + + expected_labels = [ + '00:00:00', + '1 days 03:46:40', + '2 days 07:33:20', + '3 days 11:20:00', + '4 days 15:06:40', + '5 days 18:53:20', + '6 days 22:40:00', + '8 days 02:26:40', + '' + ] + + rng = timedelta_range('0', periods=10, freq='1 d') + df = DataFrame(np.random.randn(len(rng), 3), rng) + ax = df.plot(fontsize=2) + fig = ax.get_figure() + fig.canvas.draw() + labels = ax.get_xticklabels() + assert len(labels) == len(expected_labels) + for l, l_expected in zip(labels, expected_labels): + assert l.get_text() == l_expected + + def test_timedelta_plot(self): + # test issue #8711 + s = Series(range(5), timedelta_range('1day', periods=5)) + _check_plot_works(s.plot) + + # test long period + index = timedelta_range('1 day 2 hr 30 min 10 s', + periods=10, freq='1 d') + s = Series(np.random.randn(len(index)), index) + _check_plot_works(s.plot) + + # test short period + index = timedelta_range('1 day 2 hr 30 min 10 s', + periods=10, freq='1 ns') + s = Series(np.random.randn(len(index)), index) + _check_plot_works(s.plot) + + def test_hist(self): + # https://github.com/matplotlib/matplotlib/issues/8459 + rng = date_range('1/1/2011', periods=10, freq='H') + x = rng + w1 = np.arange(0, 1, .1) + w2 = np.arange(0, 1, .1)[::-1] + self.plt.hist([x, x], weights=[w1, w2]) + def _check_plot_works(f, freq=None, series=None, *args, **kwargs): import matplotlib.pyplot as plt @@ -1309,8 +1373,3 @@ def _check_plot_works(f, freq=None, series=None, *args, **kwargs): plt.savefig(path) finally: plt.close(fig) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/plotting/test_deprecated.py b/pandas/tests/plotting/test_deprecated.py new file mode 100644 index 0000000000000..48030df48deca --- /dev/null +++ b/pandas/tests/plotting/test_deprecated.py @@ -0,0 +1,59 @@ +# coding: utf-8 + +import string + +import pandas as pd +import pandas.util.testing as tm +from pandas.util.testing import slow + +from numpy.random import randn + +import pandas.tools.plotting as plotting + +from pandas.tests.plotting.common import TestPlotBase + + +""" +Test cases for plot functions imported from deprecated +pandas.tools.plotting +""" + +tm._skip_module_if_no_mpl() + + +class TestDeprecatedNameSpace(TestPlotBase): + + @slow + def test_scatter_plot_legacy(self): + tm._skip_if_no_scipy() + + df = pd.DataFrame(randn(100, 2)) + + with tm.assert_produces_warning(FutureWarning): + plotting.scatter_matrix(df) + + with tm.assert_produces_warning(FutureWarning): + pd.scatter_matrix(df) + + @slow + def test_boxplot_deprecated(self): + df = pd.DataFrame(randn(6, 4), + index=list(string.ascii_letters[:6]), + columns=['one', 'two', 'three', 'four']) + df['indic'] = ['foo', 'bar'] * 3 + + with tm.assert_produces_warning(FutureWarning): + plotting.boxplot(df, column=['one', 'two'], + by='indic') + + @slow + def test_radviz_deprecated(self): + df = self.iris + with tm.assert_produces_warning(FutureWarning): + plotting.radviz(frame=df, class_column='Name') + + @slow + def test_plot_params(self): + + with tm.assert_produces_warning(FutureWarning): + pd.plot_params['xaxis.compat'] = True diff --git a/pandas/tests/plotting/test_frame.py b/pandas/tests/plotting/test_frame.py index 87cf89ebf0a9d..4a4a71d7ea639 100644 --- a/pandas/tests/plotting/test_frame.py +++ b/pandas/tests/plotting/test_frame.py @@ -1,7 +1,8 @@ -#!/usr/bin/env python # coding: utf-8 -import nose +""" Test cases for DataFrame.plot """ + +import pytest import string import warnings @@ -10,9 +11,9 @@ import pandas as pd from pandas import (Series, DataFrame, MultiIndex, PeriodIndex, date_range, bdate_range) -from pandas.types.api import is_list_like -from pandas.compat import (range, lrange, StringIO, lmap, lzip, u, zip, PY3) -from pandas.formats.printing import pprint_thing +from pandas.core.dtypes.api import is_list_like +from pandas.compat import range, lrange, lmap, lzip, u, zip, PY3 +from pandas.io.formats.printing import pprint_thing import pandas.util.testing as tm from pandas.util.testing import slow @@ -21,20 +22,18 @@ import numpy as np from numpy.random import rand, randn -import pandas.tools.plotting as plotting +import pandas.plotting as plotting from pandas.tests.plotting.common import (TestPlotBase, _check_plot_works, _skip_if_no_scipy_gaussian_kde, _ok_for_gaussian_kde) +tm._skip_module_if_no_mpl() -""" Test cases for DataFrame.plot """ - -@tm.mplskip class TestDataFramePlots(TestPlotBase): - def setUp(self): - TestPlotBase.setUp(self) + def setup_method(self, method): + TestPlotBase.setup_method(self, method) import matplotlib as mpl mpl.rcdefaults() @@ -66,7 +65,7 @@ def test_plot(self): df = DataFrame({'x': [1, 2], 'y': [3, 4]}) # mpl >= 1.5.2 (or slightly below) throw AttributError - with tm.assertRaises((TypeError, AttributeError)): + with pytest.raises((TypeError, AttributeError)): df.plot.line(blarg=True) df = DataFrame(np.random.rand(10, 3), @@ -136,12 +135,28 @@ def test_plot(self): # passed ax should be used: fig, ax = self.plt.subplots() axes = df.plot.bar(subplots=True, ax=ax) - self.assertEqual(len(axes), 1) + assert len(axes) == 1 if self.mpl_ge_1_5_0: result = ax.axes else: result = ax.get_axes() # deprecated - self.assertIs(result, axes[0]) + assert result is axes[0] + + # GH 15516 + def test_mpl2_color_cycle_str(self): + # test CN mpl 2.0 color cycle + if self.mpl_ge_2_0_0: + colors = ['C' + str(x) for x in range(10)] + df = DataFrame(randn(10, 3), columns=['a', 'b', 'c']) + for c in colors: + _check_plot_works(df.plot, color=c) + else: + pytest.skip("not supported in matplotlib < 2.0.0") + + def test_color_empty_string(self): + df = DataFrame(randn(10, 2)) + with pytest.raises(ValueError): + df.plot(color='') def test_color_and_style_arguments(self): df = DataFrame({'x': [1, 2], 'y': [3, 4]}) @@ -150,19 +165,19 @@ def test_color_and_style_arguments(self): ax = df.plot(color=['red', 'black'], style=['-', '--']) # check that the linestyles are correctly set: linestyle = [line.get_linestyle() for line in ax.lines] - self.assertEqual(linestyle, ['-', '--']) + assert linestyle == ['-', '--'] # check that the colors are correctly set: color = [line.get_color() for line in ax.lines] - self.assertEqual(color, ['red', 'black']) + assert color == ['red', 'black'] # passing both 'color' and 'style' arguments should not be allowed # if there is a color symbol in the style strings: - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot(color=['red', 'black'], style=['k-', 'r--']) def test_nonnumeric_exclude(self): df = DataFrame({'A': ["x", "y", "z"], 'B': [1, 2, 3]}) ax = df.plot() - self.assertEqual(len(ax.get_lines()), 1) # B was plotted + assert len(ax.get_lines()) == 1 # B was plotted @slow def test_implicit_label(self): @@ -176,7 +191,7 @@ def test_donot_overwrite_index_name(self): df = DataFrame(randn(2, 2), columns=['a', 'b']) df.index.name = 'NAME' df.plot(y='b', label='LABEL') - self.assertEqual(df.index.name, 'NAME') + assert df.index.name == 'NAME' @slow def test_plot_xy(self): @@ -223,33 +238,34 @@ def test_xcompat(self): df = self.tdf ax = df.plot(x_compat=True) lines = ax.get_lines() - self.assertNotIsInstance(lines[0].get_xdata(), PeriodIndex) + assert not isinstance(lines[0].get_xdata(), PeriodIndex) tm.close() - pd.plot_params['xaxis.compat'] = True + pd.plotting.plot_params['xaxis.compat'] = True ax = df.plot() lines = ax.get_lines() - self.assertNotIsInstance(lines[0].get_xdata(), PeriodIndex) + assert not isinstance(lines[0].get_xdata(), PeriodIndex) tm.close() - pd.plot_params['x_compat'] = False + pd.plotting.plot_params['x_compat'] = False + ax = df.plot() lines = ax.get_lines() - self.assertNotIsInstance(lines[0].get_xdata(), PeriodIndex) - self.assertIsInstance(PeriodIndex(lines[0].get_xdata()), PeriodIndex) + assert not isinstance(lines[0].get_xdata(), PeriodIndex) + assert isinstance(PeriodIndex(lines[0].get_xdata()), PeriodIndex) tm.close() # useful if you're plotting a bunch together - with pd.plot_params.use('x_compat', True): + with pd.plotting.plot_params.use('x_compat', True): ax = df.plot() lines = ax.get_lines() - self.assertNotIsInstance(lines[0].get_xdata(), PeriodIndex) + assert not isinstance(lines[0].get_xdata(), PeriodIndex) tm.close() ax = df.plot() lines = ax.get_lines() - self.assertNotIsInstance(lines[0].get_xdata(), PeriodIndex) - self.assertIsInstance(PeriodIndex(lines[0].get_xdata()), PeriodIndex) + assert not isinstance(lines[0].get_xdata(), PeriodIndex) + assert isinstance(PeriodIndex(lines[0].get_xdata()), PeriodIndex) def test_period_compat(self): # GH 9012 @@ -288,7 +304,7 @@ def test_subplots(self): for kind in ['bar', 'barh', 'line', 'area']: axes = df.plot(kind=kind, subplots=True, sharex=True, legend=True) self._check_axes_shape(axes, axes_num=3, layout=(3, 1)) - self.assertEqual(axes.shape, (3, )) + assert axes.shape == (3, ) for ax, column in zip(axes, df.columns): self._check_legend_labels(ax, @@ -318,7 +334,7 @@ def test_subplots(self): axes = df.plot(kind=kind, subplots=True, legend=False) for ax in axes: - self.assertTrue(ax.get_legend() is None) + assert ax.get_legend() is None @slow def test_subplots_timeseries(self): @@ -364,43 +380,43 @@ def test_subplots_layout(self): axes = df.plot(subplots=True, layout=(2, 2)) self._check_axes_shape(axes, axes_num=3, layout=(2, 2)) - self.assertEqual(axes.shape, (2, 2)) + assert axes.shape == (2, 2) axes = df.plot(subplots=True, layout=(-1, 2)) self._check_axes_shape(axes, axes_num=3, layout=(2, 2)) - self.assertEqual(axes.shape, (2, 2)) + assert axes.shape == (2, 2) axes = df.plot(subplots=True, layout=(2, -1)) self._check_axes_shape(axes, axes_num=3, layout=(2, 2)) - self.assertEqual(axes.shape, (2, 2)) + assert axes.shape == (2, 2) axes = df.plot(subplots=True, layout=(1, 4)) self._check_axes_shape(axes, axes_num=3, layout=(1, 4)) - self.assertEqual(axes.shape, (1, 4)) + assert axes.shape == (1, 4) axes = df.plot(subplots=True, layout=(-1, 4)) self._check_axes_shape(axes, axes_num=3, layout=(1, 4)) - self.assertEqual(axes.shape, (1, 4)) + assert axes.shape == (1, 4) axes = df.plot(subplots=True, layout=(4, -1)) self._check_axes_shape(axes, axes_num=3, layout=(4, 1)) - self.assertEqual(axes.shape, (4, 1)) + assert axes.shape == (4, 1) - with tm.assertRaises(ValueError): - axes = df.plot(subplots=True, layout=(1, 1)) - with tm.assertRaises(ValueError): - axes = df.plot(subplots=True, layout=(-1, -1)) + with pytest.raises(ValueError): + df.plot(subplots=True, layout=(1, 1)) + with pytest.raises(ValueError): + df.plot(subplots=True, layout=(-1, -1)) # single column df = DataFrame(np.random.rand(10, 1), index=list(string.ascii_letters[:10])) axes = df.plot(subplots=True) self._check_axes_shape(axes, axes_num=1, layout=(1, 1)) - self.assertEqual(axes.shape, (1, )) + assert axes.shape == (1, ) axes = df.plot(subplots=True, layout=(3, 3)) self._check_axes_shape(axes, axes_num=1, layout=(3, 3)) - self.assertEqual(axes.shape, (3, 3)) + assert axes.shape == (3, 3) @slow def test_subplots_warnings(self): @@ -427,18 +443,18 @@ def test_subplots_multiple_axes(self): returned = df.plot(subplots=True, ax=axes[0], sharex=False, sharey=False) self._check_axes_shape(returned, axes_num=3, layout=(1, 3)) - self.assertEqual(returned.shape, (3, )) - self.assertIs(returned[0].figure, fig) + assert returned.shape == (3, ) + assert returned[0].figure is fig # draw on second row returned = df.plot(subplots=True, ax=axes[1], sharex=False, sharey=False) self._check_axes_shape(returned, axes_num=3, layout=(1, 3)) - self.assertEqual(returned.shape, (3, )) - self.assertIs(returned[0].figure, fig) + assert returned.shape == (3, ) + assert returned[0].figure is fig self._check_axes_shape(axes, axes_num=6, layout=(2, 3)) tm.close() - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): fig, axes = self.plt.subplots(2, 3) # pass different number of axes from required df.plot(subplots=True, ax=axes) @@ -456,17 +472,17 @@ def test_subplots_multiple_axes(self): returned = df.plot(subplots=True, ax=axes, layout=(2, 1), sharex=False, sharey=False) self._check_axes_shape(returned, axes_num=4, layout=(2, 2)) - self.assertEqual(returned.shape, (4, )) + assert returned.shape == (4, ) returned = df.plot(subplots=True, ax=axes, layout=(2, -1), sharex=False, sharey=False) self._check_axes_shape(returned, axes_num=4, layout=(2, 2)) - self.assertEqual(returned.shape, (4, )) + assert returned.shape == (4, ) returned = df.plot(subplots=True, ax=axes, layout=(-1, 2), sharex=False, sharey=False) self._check_axes_shape(returned, axes_num=4, layout=(2, 2)) - self.assertEqual(returned.shape, (4, )) + assert returned.shape == (4, ) # single column fig, axes = self.plt.subplots(1, 1) @@ -475,7 +491,7 @@ def test_subplots_multiple_axes(self): axes = df.plot(subplots=True, ax=[axes], sharex=False, sharey=False) self._check_axes_shape(axes, axes_num=1, layout=(1, 1)) - self.assertEqual(axes.shape, (1, )) + assert axes.shape == (1, ) def test_subplots_ts_share_axes(self): # GH 3964 @@ -525,29 +541,29 @@ def test_subplots_dup_columns(self): axes = df.plot(subplots=True) for ax in axes: self._check_legend_labels(ax, labels=['a']) - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 tm.close() axes = df.plot(subplots=True, secondary_y='a') for ax in axes: # (right) is only attached when subplots=False self._check_legend_labels(ax, labels=['a']) - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 tm.close() ax = df.plot(secondary_y='a') self._check_legend_labels(ax, labels=['a (right)'] * 5) - self.assertEqual(len(ax.lines), 0) - self.assertEqual(len(ax.right_ax.lines), 5) + assert len(ax.lines) == 0 + assert len(ax.right_ax.lines) == 5 def test_negative_log(self): df = - DataFrame(rand(6, 4), index=list(string.ascii_letters[:6]), columns=['x', 'y', 'z', 'four']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot.area(logy=True) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot.area(loglog=True) def _compare_stacked_y_cood(self, normal_lines, stacked_lines): @@ -555,7 +571,7 @@ def _compare_stacked_y_cood(self, normal_lines, stacked_lines): for nl, sl in zip(normal_lines, stacked_lines): base += nl.get_data()[1] # get y coodinates sy = sl.get_data()[1] - self.assert_numpy_array_equal(base, sy) + tm.assert_numpy_array_equal(base, sy) def test_line_area_stacked(self): with tm.RNGContext(42): @@ -586,7 +602,7 @@ def test_line_area_stacked(self): self._compare_stacked_y_cood(ax1.lines[2:], ax2.lines[2:]) _check_plot_works(mixed_df.plot, stacked=False) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): mixed_df.plot(stacked=True) _check_plot_works(df.plot, kind=kind, logx=True, stacked=True) @@ -605,55 +621,55 @@ def test_line_area_nan_df(self): # remove nan for comparison purpose exp = np.array([1, 2, 3], dtype=np.float64) - self.assert_numpy_array_equal(np.delete(masked1.data, 2), exp) + tm.assert_numpy_array_equal(np.delete(masked1.data, 2), exp) exp = np.array([3, 2, 1], dtype=np.float64) - self.assert_numpy_array_equal(np.delete(masked2.data, 1), exp) - self.assert_numpy_array_equal( + tm.assert_numpy_array_equal(np.delete(masked2.data, 1), exp) + tm.assert_numpy_array_equal( masked1.mask, np.array([False, False, True, False])) - self.assert_numpy_array_equal( + tm.assert_numpy_array_equal( masked2.mask, np.array([False, True, False, False])) expected1 = np.array([1, 2, 0, 3], dtype=np.float64) expected2 = np.array([3, 0, 2, 1], dtype=np.float64) ax = _check_plot_works(d.plot, stacked=True) - self.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected1) - self.assert_numpy_array_equal(ax.lines[1].get_ydata(), - expected1 + expected2) + tm.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected1) + tm.assert_numpy_array_equal(ax.lines[1].get_ydata(), + expected1 + expected2) ax = _check_plot_works(d.plot.area) - self.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected1) - self.assert_numpy_array_equal(ax.lines[1].get_ydata(), - expected1 + expected2) + tm.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected1) + tm.assert_numpy_array_equal(ax.lines[1].get_ydata(), + expected1 + expected2) ax = _check_plot_works(d.plot.area, stacked=False) - self.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected1) - self.assert_numpy_array_equal(ax.lines[1].get_ydata(), expected2) + tm.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected1) + tm.assert_numpy_array_equal(ax.lines[1].get_ydata(), expected2) def test_line_lim(self): df = DataFrame(rand(6, 3), columns=['x', 'y', 'z']) ax = df.plot() xmin, xmax = ax.get_xlim() lines = ax.get_lines() - self.assertEqual(xmin, lines[0].get_data()[0][0]) - self.assertEqual(xmax, lines[0].get_data()[0][-1]) + assert xmin == lines[0].get_data()[0][0] + assert xmax == lines[0].get_data()[0][-1] ax = df.plot(secondary_y=True) xmin, xmax = ax.get_xlim() lines = ax.get_lines() - self.assertEqual(xmin, lines[0].get_data()[0][0]) - self.assertEqual(xmax, lines[0].get_data()[0][-1]) + assert xmin == lines[0].get_data()[0][0] + assert xmax == lines[0].get_data()[0][-1] axes = df.plot(secondary_y=True, subplots=True) self._check_axes_shape(axes, axes_num=3, layout=(3, 1)) for ax in axes: - self.assertTrue(hasattr(ax, 'left_ax')) - self.assertFalse(hasattr(ax, 'right_ax')) + assert hasattr(ax, 'left_ax') + assert not hasattr(ax, 'right_ax') xmin, xmax = ax.get_xlim() lines = ax.get_lines() - self.assertEqual(xmin, lines[0].get_data()[0][0]) - self.assertEqual(xmax, lines[0].get_data()[0][-1]) + assert xmin == lines[0].get_data()[0][0] + assert xmax == lines[0].get_data()[0][-1] def test_area_lim(self): df = DataFrame(rand(6, 4), columns=['x', 'y', 'z', 'four']) @@ -664,13 +680,13 @@ def test_area_lim(self): xmin, xmax = ax.get_xlim() ymin, ymax = ax.get_ylim() lines = ax.get_lines() - self.assertEqual(xmin, lines[0].get_data()[0][0]) - self.assertEqual(xmax, lines[0].get_data()[0][-1]) - self.assertEqual(ymin, 0) + assert xmin == lines[0].get_data()[0][0] + assert xmax == lines[0].get_data()[0][-1] + assert ymin == 0 ax = _check_plot_works(neg_df.plot.area, stacked=stacked) ymin, ymax = ax.get_ylim() - self.assertEqual(ymax, 0) + assert ymax == 0 @slow def test_bar_colors(self): @@ -700,7 +716,7 @@ def test_bar_colors(self): self._check_colors(ax.patches[::5], facecolors=rgba_colors) tm.close() - ax = df.ix[:, [0]].plot.bar(color='DodgerBlue') + ax = df.loc[:, [0]].plot.bar(color='DodgerBlue') self._check_colors([ax.patches[0]], facecolors=['DodgerBlue']) tm.close() @@ -715,19 +731,19 @@ def test_bar_linewidth(self): # regular ax = df.plot.bar(linewidth=2) for r in ax.patches: - self.assertEqual(r.get_linewidth(), 2) + assert r.get_linewidth() == 2 # stacked ax = df.plot.bar(stacked=True, linewidth=2) for r in ax.patches: - self.assertEqual(r.get_linewidth(), 2) + assert r.get_linewidth() == 2 # subplots axes = df.plot.bar(linewidth=2, subplots=True) self._check_axes_shape(axes, axes_num=5, layout=(5, 1)) for ax in axes: for r in ax.patches: - self.assertEqual(r.get_linewidth(), 2) + assert r.get_linewidth() == 2 @slow def test_bar_barwidth(self): @@ -738,34 +754,34 @@ def test_bar_barwidth(self): # regular ax = df.plot.bar(width=width) for r in ax.patches: - self.assertEqual(r.get_width(), width / len(df.columns)) + assert r.get_width() == width / len(df.columns) # stacked ax = df.plot.bar(stacked=True, width=width) for r in ax.patches: - self.assertEqual(r.get_width(), width) + assert r.get_width() == width # horizontal regular ax = df.plot.barh(width=width) for r in ax.patches: - self.assertEqual(r.get_height(), width / len(df.columns)) + assert r.get_height() == width / len(df.columns) # horizontal stacked ax = df.plot.barh(stacked=True, width=width) for r in ax.patches: - self.assertEqual(r.get_height(), width) + assert r.get_height() == width # subplots axes = df.plot.bar(width=width, subplots=True) for ax in axes: for r in ax.patches: - self.assertEqual(r.get_width(), width) + assert r.get_width() == width # horizontal subplots axes = df.plot.barh(width=width, subplots=True) for ax in axes: for r in ax.patches: - self.assertEqual(r.get_height(), width) + assert r.get_height() == width @slow def test_bar_barwidth_position(self): @@ -792,10 +808,10 @@ def test_bar_barwidth_position_int(self): ax = df.plot.bar(stacked=True, width=w) ticks = ax.xaxis.get_ticklocs() tm.assert_numpy_array_equal(ticks, np.array([0, 1, 2, 3, 4])) - self.assertEqual(ax.get_xlim(), (-0.75, 4.75)) + assert ax.get_xlim() == (-0.75, 4.75) # check left-edge of bars - self.assertEqual(ax.patches[0].get_x(), -0.5) - self.assertEqual(ax.patches[-1].get_x(), 3.5) + assert ax.patches[0].get_x() == -0.5 + assert ax.patches[-1].get_x() == 3.5 self._check_bar_alignment(df, kind='bar', stacked=True, width=1) self._check_bar_alignment(df, kind='barh', stacked=False, width=1) @@ -808,29 +824,29 @@ def test_bar_bottom_left(self): df = DataFrame(rand(5, 5)) ax = df.plot.bar(stacked=False, bottom=1) result = [p.get_y() for p in ax.patches] - self.assertEqual(result, [1] * 25) + assert result == [1] * 25 ax = df.plot.bar(stacked=True, bottom=[-1, -2, -3, -4, -5]) result = [p.get_y() for p in ax.patches[:5]] - self.assertEqual(result, [-1, -2, -3, -4, -5]) + assert result == [-1, -2, -3, -4, -5] ax = df.plot.barh(stacked=False, left=np.array([1, 1, 1, 1, 1])) result = [p.get_x() for p in ax.patches] - self.assertEqual(result, [1] * 25) + assert result == [1] * 25 ax = df.plot.barh(stacked=True, left=[1, 2, 3, 4, 5]) result = [p.get_x() for p in ax.patches[:5]] - self.assertEqual(result, [1, 2, 3, 4, 5]) + assert result == [1, 2, 3, 4, 5] axes = df.plot.bar(subplots=True, bottom=-1) for ax in axes: result = [p.get_y() for p in ax.patches] - self.assertEqual(result, [-1] * 5) + assert result == [-1] * 5 axes = df.plot.barh(subplots=True, left=np.array([1, 1, 1, 1, 1])) for ax in axes: result = [p.get_x() for p in ax.patches] - self.assertEqual(result, [1] * 5) + assert result == [1] * 5 @slow def test_bar_nan(self): @@ -840,15 +856,15 @@ def test_bar_nan(self): ax = df.plot.bar() expected = [10, 0, 20, 5, 10, 20, 1, 2, 3] result = [p.get_height() for p in ax.patches] - self.assertEqual(result, expected) + assert result == expected ax = df.plot.bar(stacked=True) result = [p.get_height() for p in ax.patches] - self.assertEqual(result, expected) + assert result == expected result = [p.get_y() for p in ax.patches] expected = [0.0, 0.0, 0.0, 10.0, 0.0, 20.0, 15.0, 10.0, 40.0] - self.assertEqual(result, expected) + assert result == expected @slow def test_bar_categorical(self): @@ -865,16 +881,16 @@ def test_bar_categorical(self): ax = df.plot.bar() ticks = ax.xaxis.get_ticklocs() tm.assert_numpy_array_equal(ticks, np.array([0, 1, 2, 3, 4, 5])) - self.assertEqual(ax.get_xlim(), (-0.5, 5.5)) + assert ax.get_xlim() == (-0.5, 5.5) # check left-edge of bars - self.assertEqual(ax.patches[0].get_x(), -0.25) - self.assertEqual(ax.patches[-1].get_x(), 5.15) + assert ax.patches[0].get_x() == -0.25 + assert ax.patches[-1].get_x() == 5.15 ax = df.plot.bar(stacked=True) tm.assert_numpy_array_equal(ticks, np.array([0, 1, 2, 3, 4, 5])) - self.assertEqual(ax.get_xlim(), (-0.5, 5.5)) - self.assertEqual(ax.patches[0].get_x(), -0.25) - self.assertEqual(ax.patches[-1].get_x(), 4.75) + assert ax.get_xlim() == (-0.5, 5.5) + assert ax.patches[0].get_x() == -0.25 + assert ax.patches[-1].get_x() == 4.75 @slow def test_plot_scatter(self): @@ -885,9 +901,9 @@ def test_plot_scatter(self): _check_plot_works(df.plot.scatter, x='x', y='y') _check_plot_works(df.plot.scatter, x=1, y=2) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.plot.scatter(x='x') - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.plot.scatter(y='y') # GH 6951 @@ -904,25 +920,25 @@ def test_plot_scatter_with_c(self): df.plot.scatter(x=0, y=1, c=2)] for ax in axes: # default to Greys - self.assertEqual(ax.collections[0].cmap.name, 'Greys') + assert ax.collections[0].cmap.name == 'Greys' if self.mpl_ge_1_3_1: # n.b. there appears to be no public method to get the colorbar # label - self.assertEqual(ax.collections[0].colorbar._label, 'z') + assert ax.collections[0].colorbar._label == 'z' cm = 'cubehelix' ax = df.plot.scatter(x='x', y='y', c='z', colormap=cm) - self.assertEqual(ax.collections[0].cmap.name, cm) + assert ax.collections[0].cmap.name == cm # verify turning off colorbar works ax = df.plot.scatter(x='x', y='y', c='z', colorbar=False) - self.assertIs(ax.collections[0].colorbar, None) + assert ax.collections[0].colorbar is None # verify that we can still plot a solid color ax = df.plot.scatter(x=0, y=1, c='red') - self.assertIs(ax.collections[0].colorbar, None) + assert ax.collections[0].colorbar is None self._check_colors(ax.collections, facecolors=['r']) # Ensure that we can pass an np.array straight through to matplotlib, @@ -940,8 +956,8 @@ def test_plot_scatter_with_c(self): # identical to the values we supplied, normally we'd be on shaky ground # comparing floats for equality but here we expect them to be # identical. - self.assertTrue(np.array_equal(ax.collections[0].get_facecolor(), - rgba_array)) + tm.assert_numpy_array_equal(ax.collections[0] + .get_facecolor(), rgba_array) # we don't test the colors of the faces in this next plot because they # are dependent on the spring colormap, which may change its colors # later. @@ -950,7 +966,7 @@ def test_plot_scatter_with_c(self): def test_scatter_colors(self): df = DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3], 'c': [1, 2, 3]}) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.plot.scatter(x='a', y='b', c='c', color='green') default_colors = self._maybe_unpack_cycler(self.plt.rcParams) @@ -1021,8 +1037,8 @@ def _check_bar_alignment(self, df, kind='bar', stacked=False, # GH 7498 # compare margins between lim and bar edges - self.assertAlmostEqual(ax_min, min_edge - 0.25) - self.assertAlmostEqual(ax_max, max_edge + 0.25) + tm.assert_almost_equal(ax_min, min_edge - 0.25) + tm.assert_almost_equal(ax_max, max_edge + 0.25) p = ax.patches[0] if kind == 'bar' and (stacked is True or subplots is True): @@ -1042,14 +1058,14 @@ def _check_bar_alignment(self, df, kind='bar', stacked=False, raise ValueError # Check the ticks locates on integer - self.assertTrue((axis.get_ticklocs() == np.arange(len(df))).all()) + assert (axis.get_ticklocs() == np.arange(len(df))).all() if align == 'center': # Check whether the bar locates on center - self.assertAlmostEqual(axis.get_ticklocs()[0], center) + tm.assert_almost_equal(axis.get_ticklocs()[0], center) elif align == 'edge': # Check whether the bar's edge starts from the tick - self.assertAlmostEqual(axis.get_ticklocs()[0], edge) + tm.assert_almost_equal(axis.get_ticklocs()[0], edge) else: raise ValueError @@ -1152,7 +1168,7 @@ def test_boxplot(self): self._check_text_labels(ax.get_xticklabels(), labels) tm.assert_numpy_array_equal(ax.xaxis.get_ticklocs(), np.arange(1, len(numeric_cols) + 1)) - self.assertEqual(len(ax.lines), self.bp_n_objects * len(numeric_cols)) + assert len(ax.lines) == self.bp_n_objects * len(numeric_cols) # different warning on py3 if not PY3: @@ -1163,7 +1179,7 @@ def test_boxplot(self): self._check_ax_scales(axes, yaxis='log') for ax, label in zip(axes, labels): self._check_text_labels(ax.get_xticklabels(), [label]) - self.assertEqual(len(ax.lines), self.bp_n_objects) + assert len(ax.lines) == self.bp_n_objects axes = series.plot.box(rot=40) self._check_ticks_props(axes, xrot=40, yrot=0) @@ -1177,7 +1193,7 @@ def test_boxplot(self): labels = [pprint_thing(c) for c in numeric_cols] self._check_text_labels(ax.get_xticklabels(), labels) tm.assert_numpy_array_equal(ax.xaxis.get_ticklocs(), positions) - self.assertEqual(len(ax.lines), self.bp_n_objects * len(numeric_cols)) + assert len(ax.lines) == self.bp_n_objects * len(numeric_cols) @slow def test_boxplot_vertical(self): @@ -1189,7 +1205,7 @@ def test_boxplot_vertical(self): ax = df.plot.box(rot=50, fontsize=8, vert=False) self._check_ticks_props(ax, xrot=0, yrot=50, ylabelsize=8) self._check_text_labels(ax.get_yticklabels(), labels) - self.assertEqual(len(ax.lines), self.bp_n_objects * len(numeric_cols)) + assert len(ax.lines) == self.bp_n_objects * len(numeric_cols) # _check_plot_works adds an ax so catch warning. see GH #13188 with tm.assert_produces_warning(UserWarning): @@ -1199,20 +1215,20 @@ def test_boxplot_vertical(self): self._check_ax_scales(axes, xaxis='log') for ax, label in zip(axes, labels): self._check_text_labels(ax.get_yticklabels(), [label]) - self.assertEqual(len(ax.lines), self.bp_n_objects) + assert len(ax.lines) == self.bp_n_objects positions = np.array([3, 2, 8]) ax = df.plot.box(positions=positions, vert=False) self._check_text_labels(ax.get_yticklabels(), labels) tm.assert_numpy_array_equal(ax.yaxis.get_ticklocs(), positions) - self.assertEqual(len(ax.lines), self.bp_n_objects * len(numeric_cols)) + assert len(ax.lines) == self.bp_n_objects * len(numeric_cols) @slow def test_boxplot_return_type(self): df = DataFrame(randn(6, 4), index=list(string.ascii_letters[:6]), columns=['one', 'two', 'three', 'four']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot.box(return_type='NOTATYPE') result = df.plot.box(return_type='dict') @@ -1233,7 +1249,7 @@ def test_boxplot_subplots_return_type(self): # normal style: return_type=None result = df.plot.box(subplots=True) - self.assertIsInstance(result, Series) + assert isinstance(result, Series) self._check_box_return_type(result, None, expected_keys=[ 'height', 'weight', 'category']) @@ -1277,7 +1293,7 @@ def test_kde_missing_vals(self): def test_hist_df(self): from matplotlib.patches import Rectangle if self.mpl_le_1_2_1: - raise nose.SkipTest("not supported in matplotlib <= 1.2.x") + pytest.skip("not supported in matplotlib <= 1.2.x") df = DataFrame(randn(100, 4)) series = df[0] @@ -1299,13 +1315,13 @@ def test_hist_df(self): ax = series.plot.hist(normed=True, cumulative=True, bins=4) # height of last bin (index 5) must be 1.0 rects = [x for x in ax.get_children() if isinstance(x, Rectangle)] - self.assertAlmostEqual(rects[-1].get_height(), 1.0) + tm.assert_almost_equal(rects[-1].get_height(), 1.0) tm.close() ax = series.plot.hist(cumulative=True, bins=4) rects = [x for x in ax.get_children() if isinstance(x, Rectangle)] - self.assertAlmostEqual(rects[-2].get_height(), 100.0) + tm.assert_almost_equal(rects[-2].get_height(), 100.0) tm.close() # if horizontal, yticklabels are rotated @@ -1321,17 +1337,17 @@ def _check_box_coord(self, patches, expected_y=None, expected_h=None, # dtype is depending on above values, no need to check if expected_y is not None: - self.assert_numpy_array_equal(result_y, expected_y, - check_dtype=False) + tm.assert_numpy_array_equal(result_y, expected_y, + check_dtype=False) if expected_h is not None: - self.assert_numpy_array_equal(result_height, expected_h, - check_dtype=False) + tm.assert_numpy_array_equal(result_height, expected_h, + check_dtype=False) if expected_x is not None: - self.assert_numpy_array_equal(result_x, expected_x, - check_dtype=False) + tm.assert_numpy_array_equal(result_x, expected_x, + check_dtype=False) if expected_w is not None: - self.assert_numpy_array_equal(result_width, expected_w, - check_dtype=False) + tm.assert_numpy_array_equal(result_width, expected_w, + check_dtype=False) @slow def test_hist_df_coord(self): @@ -1496,7 +1512,7 @@ def test_df_legend_labels(self): self._check_text_labels(ax.xaxis.get_label(), 'a') ax = df5.plot(y='c', label='LABEL_c', ax=ax) self._check_legend_labels(ax, labels=['LABEL_b', 'LABEL_c']) - self.assertTrue(df5.columns.tolist() == ['b', 'c']) + assert df5.columns.tolist() == ['b', 'c'] def test_legend_name(self): multi = DataFrame(randn(4, 4), @@ -1548,20 +1564,20 @@ def test_style_by_column(self): fig.add_subplot(111) ax = df.plot(style=markers) for i, l in enumerate(ax.get_lines()[:len(markers)]): - self.assertEqual(l.get_marker(), markers[i]) + assert l.get_marker() == markers[i] @slow def test_line_label_none(self): s = Series([1, 2]) ax = s.plot() - self.assertEqual(ax.get_legend(), None) + assert ax.get_legend() is None ax = s.plot(legend=True) - self.assertEqual(ax.get_legend().get_texts()[0].get_text(), 'None') + assert ax.get_legend().get_texts()[0].get_text() == 'None' @slow + @tm.capture_stdout def test_line_colors(self): - import sys from matplotlib import cm custom_colors = 'rgcby' @@ -1570,16 +1586,13 @@ def test_line_colors(self): ax = df.plot(color=custom_colors) self._check_colors(ax.get_lines(), linecolors=custom_colors) - tmp = sys.stderr - sys.stderr = StringIO() - try: - tm.close() - ax2 = df.plot(colors=custom_colors) - lines2 = ax2.get_lines() - for l1, l2 in zip(ax.get_lines(), lines2): - self.assertEqual(l1.get_color(), l2.get_color()) - finally: - sys.stderr = tmp + tm.close() + + ax2 = df.plot(colors=custom_colors) + lines2 = ax2.get_lines() + + for l1, l2 in zip(ax.get_lines(), lines2): + assert l1.get_color() == l2.get_color() tm.close() @@ -1595,7 +1608,7 @@ def test_line_colors(self): # make color a list if plotting one column frame # handles cases like df.plot(color='DodgerBlue') - ax = df.ix[:, [0]].plot(color='DodgerBlue') + ax = df.loc[:, [0]].plot(color='DodgerBlue') self._check_colors(ax.lines, linecolors=['DodgerBlue']) ax = df.plot(color='red') @@ -1608,7 +1621,7 @@ def test_line_colors(self): self._check_colors(ax.get_lines(), linecolors=custom_colors) tm.close() - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # Color contains shorthand hex value results in ValueError custom_colors = ['#F00', '#00F', '#FF0', '#000', '#FFF'] # Forced show plot @@ -1618,7 +1631,7 @@ def test_line_colors(self): def test_dont_modify_colors(self): colors = ['r', 'g', 'b'] pd.DataFrame(np.random.rand(10, 2)).plot(color=colors) - self.assertEqual(len(colors), 3) + assert len(colors) == 3 @slow def test_line_colors_and_styles_subplots(self): @@ -1665,7 +1678,7 @@ def test_line_colors_and_styles_subplots(self): self._check_colors(ax.get_lines(), linecolors=[c]) tm.close() - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # Color contains shorthand hex value results in ValueError custom_colors = ['#F00', '#00F', '#FF0', '#000', '#FFF'] # Forced show plot @@ -1682,7 +1695,7 @@ def test_line_colors_and_styles_subplots(self): # make color a list if plotting one column frame # handles cases like df.plot(color='DodgerBlue') - axes = df.ix[:, [0]].plot(color='DodgerBlue', subplots=True) + axes = df.loc[:, [0]].plot(color='DodgerBlue', subplots=True) self._check_colors(axes[0].lines, linecolors=['DodgerBlue']) # single character style @@ -1721,7 +1734,7 @@ def test_area_colors(self): self._check_colors(linehandles, linecolors=custom_colors) for h in handles: - self.assertTrue(h.get_alpha() is None) + assert h.get_alpha() is None tm.close() ax = df.plot.area(colormap='jet') @@ -1738,7 +1751,7 @@ def test_area_colors(self): if not isinstance(x, PolyCollection)] self._check_colors(linehandles, linecolors=jet_colors) for h in handles: - self.assertTrue(h.get_alpha() is None) + assert h.get_alpha() is None tm.close() # When stacked=False, alpha is set to 0.5 @@ -1756,7 +1769,7 @@ def test_area_colors(self): linecolors = jet_colors self._check_colors(handles[:len(jet_colors)], linecolors=linecolors) for h in handles: - self.assertEqual(h.get_alpha(), 0.5) + assert h.get_alpha() == 0.5 @slow def test_hist_colors(self): @@ -1785,7 +1798,7 @@ def test_hist_colors(self): self._check_colors(ax.patches[::10], facecolors=rgba_colors) tm.close() - ax = df.ix[:, [0]].plot.hist(color='DodgerBlue') + ax = df.loc[:, [0]].plot.hist(color='DodgerBlue') self._check_colors([ax.patches[0]], facecolors=['DodgerBlue']) ax = df.plot(kind='hist', color='green') @@ -1857,8 +1870,8 @@ def test_kde_colors_and_styles_subplots(self): # make color a list if plotting one column frame # handles cases like df.plot(color='DodgerBlue') - axes = df.ix[:, [0]].plot(kind='kde', color='DodgerBlue', - subplots=True) + axes = df.loc[:, [0]].plot(kind='kde', color='DodgerBlue', + subplots=True) self._check_colors(axes[0].lines, linecolors=['DodgerBlue']) # single character style @@ -1935,7 +1948,7 @@ def _check_colors(bp, box_c, whiskers_c, medians_c, caps_c='k', _check_colors(bp, (0, 1, 0), (0, 1, 0), (0, 1, 0), (0, 1, 0), '#123456') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # Color contains invalid key results in ValueError df.plot.box(color=dict(boxes='red', xxxx='blue')) @@ -1962,13 +1975,13 @@ def test_unordered_ts(self): columns=['test']) ax = df.plot() xticks = ax.lines[0].get_xdata() - self.assertTrue(xticks[0] < xticks[1]) + assert xticks[0] < xticks[1] ydata = ax.lines[0].get_ydata() tm.assert_numpy_array_equal(ydata, np.array([1.0, 2.0, 3.0])) def test_kind_both_ways(self): df = DataFrame({'x': [1, 2, 3]}) - for kind in plotting._common_kinds: + for kind in plotting._core._common_kinds: if not _ok_for_gaussian_kde(kind): continue df.plot(kind=kind) @@ -1979,10 +1992,10 @@ def test_kind_both_ways(self): def test_all_invalid_plot_data(self): df = DataFrame(list('abcd')) - for kind in plotting._common_kinds: + for kind in plotting._core._common_kinds: if not _ok_for_gaussian_kde(kind): continue - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.plot(kind=kind) @slow @@ -1990,10 +2003,10 @@ def test_partially_invalid_plot_data(self): with tm.RNGContext(42): df = DataFrame(randn(10, 2), dtype=object) df[np.random.rand(df.shape[0]) > 0.5] = 'a' - for kind in plotting._common_kinds: + for kind in plotting._core._common_kinds: if not _ok_for_gaussian_kde(kind): continue - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.plot(kind=kind) with tm.RNGContext(42): @@ -2002,12 +2015,12 @@ def test_partially_invalid_plot_data(self): df = DataFrame(rand(10, 2), dtype=object) df[np.random.rand(df.shape[0]) > 0.5] = 'a' for kind in kinds: - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.plot(kind=kind) def test_invalid_kind(self): df = DataFrame(randn(10, 2)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot(kind='aasdf') @slow @@ -2016,13 +2029,13 @@ def test_hexbin_basic(self): ax = df.plot.hexbin(x='A', y='B', gridsize=10) # TODO: need better way to test. This just does existence. - self.assertEqual(len(ax.collections), 1) + assert len(ax.collections) == 1 # GH 6951 axes = df.plot.hexbin(x='A', y='B', subplots=True) # hexbin should have 2 axes in the figure, 1 for plotting and another # is colorbar - self.assertEqual(len(axes[0].figure.axes), 2) + assert len(axes[0].figure.axes) == 2 # return value is single axes self._check_axes_shape(axes, axes_num=1, layout=(1, 1)) @@ -2031,10 +2044,10 @@ def test_hexbin_with_c(self): df = self.hexbin_df ax = df.plot.hexbin(x='A', y='B', C='C') - self.assertEqual(len(ax.collections), 1) + assert len(ax.collections) == 1 ax = df.plot.hexbin(x='A', y='B', C='C', reduce_C_function=np.std) - self.assertEqual(len(ax.collections), 1) + assert len(ax.collections) == 1 @slow def test_hexbin_cmap(self): @@ -2042,34 +2055,34 @@ def test_hexbin_cmap(self): # Default to BuGn ax = df.plot.hexbin(x='A', y='B') - self.assertEqual(ax.collections[0].cmap.name, 'BuGn') + assert ax.collections[0].cmap.name == 'BuGn' cm = 'cubehelix' ax = df.plot.hexbin(x='A', y='B', colormap=cm) - self.assertEqual(ax.collections[0].cmap.name, cm) + assert ax.collections[0].cmap.name == cm @slow def test_no_color_bar(self): df = self.hexbin_df ax = df.plot.hexbin(x='A', y='B', colorbar=None) - self.assertIs(ax.collections[0].colorbar, None) + assert ax.collections[0].colorbar is None @slow def test_allow_cmap(self): df = self.hexbin_df ax = df.plot.hexbin(x='A', y='B', cmap='YlGn') - self.assertEqual(ax.collections[0].cmap.name, 'YlGn') + assert ax.collections[0].cmap.name == 'YlGn' - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.plot.hexbin(x='A', y='B', cmap='YlGn', colormap='BuGn') @slow def test_pie_df(self): df = DataFrame(np.random.rand(5, 3), columns=['X', 'Y', 'Z'], index=['a', 'b', 'c', 'd', 'e']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot.pie() ax = _check_plot_works(df.plot.pie, y='Y') @@ -2082,11 +2095,11 @@ def test_pie_df(self): with tm.assert_produces_warning(UserWarning): axes = _check_plot_works(df.plot.pie, subplots=True) - self.assertEqual(len(axes), len(df.columns)) + assert len(axes) == len(df.columns) for ax in axes: self._check_text_labels(ax.texts, df.index) for ax, ylabel in zip(axes, df.columns): - self.assertEqual(ax.get_ylabel(), ylabel) + assert ax.get_ylabel() == ylabel labels = ['A', 'B', 'C', 'D', 'E'] color_args = ['r', 'g', 'b', 'c', 'm'] @@ -2094,7 +2107,7 @@ def test_pie_df(self): axes = _check_plot_works(df.plot.pie, subplots=True, labels=labels, colors=color_args) - self.assertEqual(len(axes), len(df.columns)) + assert len(axes) == len(df.columns) for ax in axes: self._check_text_labels(ax.texts, labels) @@ -2112,13 +2125,12 @@ def test_pie_df_nan(self): expected = list(base_expected) # force copy expected[i] = '' result = [x.get_text() for x in ax.texts] - self.assertEqual(result, expected) + assert result == expected # legend labels # NaN's not included in legend with subplots # see https://github.com/pandas-dev/pandas/issues/8390 - self.assertEqual([x.get_text() for x in - ax.get_legend().get_texts()], - base_expected[:i] + base_expected[i + 1:]) + assert ([x.get_text() for x in ax.get_legend().get_texts()] == + base_expected[:i] + base_expected[i + 1:]) @slow def test_errorbar_plot(self): @@ -2181,11 +2193,11 @@ def test_errorbar_plot(self): ax = _check_plot_works(s_df.plot, y='y', x='x', yerr=yerr) self._check_has_errorbars(ax, xerr=0, yerr=1) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot(yerr=np.random.randn(11)) df_err = DataFrame({'x': ['zzz'] * 12, 'y': ['zzz'] * 12}) - with tm.assertRaises((ValueError, TypeError)): + with pytest.raises((ValueError, TypeError)): df.plot(yerr=df_err) @slow @@ -2268,15 +2280,12 @@ def test_errorbar_asymmetrical(self): expected_0_0 = err[0, :, 0] * np.array([-1, 1]) tm.assert_almost_equal(yerr_0_0, expected_0_0) else: - self.assertEqual(ax.lines[7].get_ydata()[0], - data[0, 1] - err[1, 0, 0]) - self.assertEqual(ax.lines[8].get_ydata()[0], - data[0, 1] + err[1, 1, 0]) - - self.assertEqual(ax.lines[5].get_xdata()[0], -err[1, 0, 0] / 2) - self.assertEqual(ax.lines[6].get_xdata()[0], err[1, 1, 0] / 2) + assert ax.lines[7].get_ydata()[0] == data[0, 1] - err[1, 0, 0] + assert ax.lines[8].get_ydata()[0] == data[0, 1] + err[1, 1, 0] + assert ax.lines[5].get_xdata()[0] == -err[1, 0, 0] / 2 + assert ax.lines[6].get_xdata()[0] == err[1, 1, 0] / 2 - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot(yerr=err.T) tm.close() @@ -2288,9 +2297,9 @@ def test_table(self): _check_plot_works(df.plot, table=df) ax = df.plot() - self.assertTrue(len(ax.tables) == 0) + assert len(ax.tables) == 0 plotting.table(ax, df.T) - self.assertTrue(len(ax.tables) == 1) + assert len(ax.tables) == 1 def test_errorbar_scatter(self): df = DataFrame( @@ -2350,7 +2359,7 @@ def test_sharex_and_ax(self): def _check(axes): for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 self._check_visible(ax.get_yticklabels(), visible=True) for ax in [axes[0], axes[2]]: self._check_visible(ax.get_xticklabels(), visible=False) @@ -2380,7 +2389,7 @@ def _check(axes): gs.tight_layout(plt.gcf()) for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 self._check_visible(ax.get_yticklabels(), visible=True) self._check_visible(ax.get_xticklabels(), visible=True) self._check_visible(ax.get_xticklabels(minor=True), visible=True) @@ -2402,7 +2411,7 @@ def test_sharey_and_ax(self): def _check(axes): for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 self._check_visible(ax.get_xticklabels(), visible=True) self._check_visible( ax.get_xticklabels(minor=True), visible=True) @@ -2432,7 +2441,7 @@ def _check(axes): gs.tight_layout(plt.gcf()) for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 self._check_visible(ax.get_yticklabels(), visible=True) self._check_visible(ax.get_xticklabels(), visible=True) self._check_visible(ax.get_xticklabels(minor=True), visible=True) @@ -2443,7 +2452,7 @@ def test_memory_leak(self): import gc results = {} - for kind in plotting._plot_klass.keys(): + for kind in plotting._core._plot_klass.keys(): if not _ok_for_gaussian_kde(kind): continue args = {} @@ -2465,7 +2474,7 @@ def test_memory_leak(self): gc.collect() for key in results: # check that every plot was collected - with tm.assertRaises(ReferenceError): + with pytest.raises(ReferenceError): # need to actually access something to get an error results[key].lines @@ -2482,7 +2491,7 @@ def test_df_subplots_patterns_minorticks(self): fig, axes = plt.subplots(2, 1, sharex=True) axes = df.plot(subplots=True, ax=axes) for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 self._check_visible(ax.get_yticklabels(), visible=True) # xaxis of 1st ax must be hidden self._check_visible(axes[0].get_xticklabels(), visible=False) @@ -2495,7 +2504,7 @@ def test_df_subplots_patterns_minorticks(self): with tm.assert_produces_warning(UserWarning): axes = df.plot(subplots=True, ax=axes, sharex=True) for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 self._check_visible(ax.get_yticklabels(), visible=True) # xaxis of 1st ax must be hidden self._check_visible(axes[0].get_xticklabels(), visible=False) @@ -2508,7 +2517,7 @@ def test_df_subplots_patterns_minorticks(self): fig, axes = plt.subplots(2, 1) axes = df.plot(subplots=True, ax=axes) for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 self._check_visible(ax.get_yticklabels(), visible=True) self._check_visible(ax.get_xticklabels(), visible=True) self._check_visible(ax.get_xticklabels(minor=True), visible=True) @@ -2542,9 +2551,9 @@ def _get_horizontal_grid(): for ax1, ax2 in [_get_vertical_grid(), _get_horizontal_grid()]: ax1 = ts.plot(ax=ax1) - self.assertEqual(len(ax1.lines), 1) + assert len(ax1.lines) == 1 ax2 = df.plot(ax=ax2) - self.assertEqual(len(ax2.lines), 2) + assert len(ax2.lines) == 2 for ax in [ax1, ax2]: self._check_visible(ax.get_yticklabels(), visible=True) self._check_visible(ax.get_xticklabels(), visible=True) @@ -2555,8 +2564,8 @@ def _get_horizontal_grid(): # subplots=True for ax1, ax2 in [_get_vertical_grid(), _get_horizontal_grid()]: axes = df.plot(subplots=True, ax=[ax1, ax2]) - self.assertEqual(len(ax1.lines), 1) - self.assertEqual(len(ax2.lines), 1) + assert len(ax1.lines) == 1 + assert len(ax2.lines) == 1 for ax in axes: self._check_visible(ax.get_yticklabels(), visible=True) self._check_visible(ax.get_xticklabels(), visible=True) @@ -2569,8 +2578,8 @@ def _get_horizontal_grid(): with tm.assert_produces_warning(UserWarning): axes = df.plot(subplots=True, ax=[ax1, ax2], sharex=True, sharey=True) - self.assertEqual(len(axes[0].lines), 1) - self.assertEqual(len(axes[1].lines), 1) + assert len(axes[0].lines) == 1 + assert len(axes[1].lines) == 1 for ax in [ax1, ax2]: # yaxis are visible because there is only one column self._check_visible(ax.get_yticklabels(), visible=True) @@ -2586,8 +2595,8 @@ def _get_horizontal_grid(): with tm.assert_produces_warning(UserWarning): axes = df.plot(subplots=True, ax=[ax1, ax2], sharex=True, sharey=True) - self.assertEqual(len(axes[0].lines), 1) - self.assertEqual(len(axes[1].lines), 1) + assert len(axes[0].lines) == 1 + assert len(axes[1].lines) == 1 self._check_visible(axes[0].get_yticklabels(), visible=True) # yaxis of axes1 (right) are hidden self._check_visible(axes[1].get_yticklabels(), visible=False) @@ -2612,7 +2621,7 @@ def _get_boxed_grid(): index=ts.index, columns=list('ABCD')) axes = df.plot(subplots=True, ax=axes) for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 # axis are visible because these are not shared self._check_visible(ax.get_yticklabels(), visible=True) self._check_visible(ax.get_xticklabels(), visible=True) @@ -2624,7 +2633,7 @@ def _get_boxed_grid(): with tm.assert_produces_warning(UserWarning): axes = df.plot(subplots=True, ax=axes, sharex=True, sharey=True) for ax in axes: - self.assertEqual(len(ax.lines), 1) + assert len(ax.lines) == 1 for ax in [axes[0], axes[2]]: # left column self._check_visible(ax.get_yticklabels(), visible=True) for ax in [axes[1], axes[3]]: # right column @@ -2642,7 +2651,7 @@ def test_df_grid_settings(self): # Make sure plot defaults to rcParams['axes.grid'] setting, GH 9792 self._check_grid_settings( DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4]}), - plotting._dataframe_kinds, kws={'x': 'a', 'y': 'b'}) + plotting._core._dataframe_kinds, kws={'x': 'a', 'y': 'b'}) def test_option_mpl_style(self): with tm.assert_produces_warning(FutureWarning, @@ -2655,13 +2664,13 @@ def test_option_mpl_style(self): check_stacklevel=False): set_option('display.mpl_style', False) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): set_option('display.mpl_style', 'default2') def test_invalid_colormap(self): df = DataFrame(randn(3, 2), columns=['A', 'B']) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.plot(colormap='invalid_colormap') def test_plain_axes(self): @@ -2698,8 +2707,7 @@ def test_passed_bar_colors(self): color_tuples = [(0.9, 0, 0, 1), (0, 0.9, 0, 1), (0, 0, 0.9, 1)] colormap = mpl.colors.ListedColormap(color_tuples) barplot = pd.DataFrame([[1, 2, 3]]).plot(kind="bar", cmap=colormap) - self.assertEqual(color_tuples, [c.get_facecolor() - for c in barplot.patches]) + assert color_tuples == [c.get_facecolor() for c in barplot.patches] def test_rcParams_bar_colors(self): import matplotlib as mpl @@ -2711,8 +2719,7 @@ def test_rcParams_bar_colors(self): except (AttributeError, KeyError): # mpl 1.4 with mpl.rc_context(rc={'axes.color_cycle': color_tuples}): barplot = pd.DataFrame([[1, 2, 3]]).plot(kind="bar") - self.assertEqual(color_tuples, [c.get_facecolor() - for c in barplot.patches]) + assert color_tuples == [c.get_facecolor() for c in barplot.patches] def _generate_4_axes_via_gridspec(): @@ -2727,8 +2734,3 @@ def _generate_4_axes_via_gridspec(): ax_lr = plt.subplot(gs[1, 1]) return gs, [ax_tl, ax_ll, ax_tr, ax_lr] - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/plotting/test_groupby.py b/pandas/tests/plotting/test_groupby.py index 101a6556c61bf..8dcf73bce03c0 100644 --- a/pandas/tests/plotting/test_groupby.py +++ b/pandas/tests/plotting/test_groupby.py @@ -1,7 +1,7 @@ -#!/usr/bin/env python # coding: utf-8 -import nose +""" Test cases for GroupBy.plot """ + from pandas import Series, DataFrame import pandas.util.testing as tm @@ -10,11 +10,9 @@ from pandas.tests.plotting.common import TestPlotBase - -""" Test cases for GroupBy.plot """ +tm._skip_module_if_no_mpl() -@tm.mplskip class TestDataFrameGroupByPlots(TestPlotBase): def test_series_groupby_plotting_nominally_works(self): @@ -71,12 +69,7 @@ def test_plot_kwargs(self): res = df.groupby('z').plot(kind='scatter', x='x', y='y') # check that a scatter plot is effectively plotted: the axes should # contain a PathCollection from the scatter plot (GH11805) - self.assertEqual(len(res['a'].collections), 1) + assert len(res['a'].collections) == 1 res = df.groupby('z').plot.scatter(x='x', y='y') - self.assertEqual(len(res['a'].collections), 1) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + assert len(res['a'].collections) == 1 diff --git a/pandas/tests/plotting/test_hist_method.py b/pandas/tests/plotting/test_hist_method.py index c7bff5a31fc02..c3e32f52e0474 100644 --- a/pandas/tests/plotting/test_hist_method.py +++ b/pandas/tests/plotting/test_hist_method.py @@ -1,7 +1,8 @@ -#!/usr/bin/env python # coding: utf-8 -import nose +""" Test cases for .hist method """ + +import pytest from pandas import Series, DataFrame import pandas.util.testing as tm @@ -10,18 +11,17 @@ import numpy as np from numpy.random import randn -import pandas.tools.plotting as plotting +from pandas.plotting._core import grouped_hist from pandas.tests.plotting.common import (TestPlotBase, _check_plot_works) -""" Test cases for .hist method """ +tm._skip_module_if_no_mpl() -@tm.mplskip class TestSeriesPlots(TestPlotBase): - def setUp(self): - TestPlotBase.setUp(self) + def setup_method(self, method): + TestPlotBase.setup_method(self, method) import matplotlib as mpl mpl.rcdefaults() @@ -49,22 +49,22 @@ def test_hist_legacy(self): _check_plot_works(self.ts.hist, figure=fig, ax=ax1) _check_plot_works(self.ts.hist, figure=fig, ax=ax2) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): self.ts.hist(by=self.ts.index, figure=fig) @slow def test_hist_bins_legacy(self): df = DataFrame(np.random.randn(10, 2)) ax = df.hist(bins=2)[0][0] - self.assertEqual(len(ax.patches), 2) + assert len(ax.patches) == 2 @slow def test_hist_layout(self): df = self.hist_df - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.height.hist(layout=(1, 1)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.height.hist(layout=[1, 1]) @slow @@ -124,13 +124,13 @@ def test_hist_no_overlap(self): y.hist() fig = gcf() axes = fig.axes if self.mpl_ge_1_5_0 else fig.get_axes() - self.assertEqual(len(axes), 2) + assert len(axes) == 2 @slow def test_hist_by_no_extra_plots(self): df = self.hist_df axes = df.height.hist(by=df.gender) # noqa - self.assertEqual(len(self.plt.get_fignums()), 1) + assert len(self.plt.get_fignums()) == 1 @slow def test_plot_fails_when_ax_differs_from_figure(self): @@ -138,11 +138,10 @@ def test_plot_fails_when_ax_differs_from_figure(self): fig1 = figure() fig2 = figure() ax1 = fig1.add_subplot(111) - with tm.assertRaises(AssertionError): + with pytest.raises(AssertionError): self.ts.hist(ax=ax1, figure=fig2) -@tm.mplskip class TestDataFramePlots(TestPlotBase): @slow @@ -156,7 +155,7 @@ def test_hist_df_legacy(self): with tm.assert_produces_warning(UserWarning): axes = _check_plot_works(df.hist, grid=False) self._check_axes_shape(axes, axes_num=3, layout=(2, 2)) - self.assertFalse(axes[1, 1].get_visible()) + assert not axes[1, 1].get_visible() df = DataFrame(randn(100, 1)) _check_plot_works(df.hist) @@ -198,7 +197,7 @@ def test_hist_df_legacy(self): ax = ser.hist(normed=True, cumulative=True, bins=4) # height of last bin (index 5) must be 1.0 rects = [x for x in ax.get_children() if isinstance(x, Rectangle)] - self.assertAlmostEqual(rects[-1].get_height(), 1.0) + tm.assert_almost_equal(rects[-1].get_height(), 1.0) tm.close() ax = ser.hist(log=True) @@ -208,7 +207,7 @@ def test_hist_df_legacy(self): tm.close() # propagate attr exception from matplotlib.Axes.hist - with tm.assertRaises(AttributeError): + with pytest.raises(AttributeError): ser.hist(foo='bar') @slow @@ -233,17 +232,26 @@ def test_hist_layout(self): self._check_axes_shape(axes, axes_num=3, layout=expected) # layout too small for all 4 plots - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.hist(layout=(1, 1)) # invalid format for layout - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.hist(layout=(1,)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.hist(layout=(-1, -1)) + @slow + # GH 9351 + def test_tight_layout(self): + if self.mpl_ge_2_0_1: + df = DataFrame(randn(100, 3)) + _check_plot_works(df.hist) + self.plt.tight_layout() + + tm.close() + -@tm.mplskip class TestDataFrameGroupByPlots(TestPlotBase): @slow @@ -254,7 +262,7 @@ def test_grouped_hist_legacy(self): df['C'] = np.random.randint(0, 4, 500) df['D'] = ['X'] * 500 - axes = plotting.grouped_hist(df.A, by=df.C) + axes = grouped_hist(df.A, by=df.C) self._check_axes_shape(axes, axes_num=4, layout=(2, 2)) tm.close() @@ -271,27 +279,26 @@ def test_grouped_hist_legacy(self): # make sure kwargs to hist are handled xf, yf = 20, 18 xrot, yrot = 30, 40 - axes = plotting.grouped_hist(df.A, by=df.C, normed=True, - cumulative=True, bins=4, - xlabelsize=xf, xrot=xrot, - ylabelsize=yf, yrot=yrot) + axes = grouped_hist(df.A, by=df.C, normed=True, cumulative=True, + bins=4, xlabelsize=xf, xrot=xrot, + ylabelsize=yf, yrot=yrot) # height of last bin (index 5) must be 1.0 for ax in axes.ravel(): rects = [x for x in ax.get_children() if isinstance(x, Rectangle)] height = rects[-1].get_height() - self.assertAlmostEqual(height, 1.0) + tm.assert_almost_equal(height, 1.0) self._check_ticks_props(axes, xlabelsize=xf, xrot=xrot, ylabelsize=yf, yrot=yrot) tm.close() - axes = plotting.grouped_hist(df.A, by=df.C, log=True) + axes = grouped_hist(df.A, by=df.C, log=True) # scale of y must be 'log' self._check_ax_scales(axes, yaxis='log') tm.close() # propagate attr exception from matplotlib.Axes.hist - with tm.assertRaises(AttributeError): - plotting.grouped_hist(df.A, by=df.C, foo='bar') + with pytest.raises(AttributeError): + grouped_hist(df.A, by=df.C, foo='bar') with tm.assert_produces_warning(FutureWarning): df.hist(by='C', figsize='default') @@ -307,19 +314,19 @@ def test_grouped_hist_legacy2(self): 'gender': gender_int}) gb = df_int.groupby('gender') axes = gb.hist() - self.assertEqual(len(axes), 2) - self.assertEqual(len(self.plt.get_fignums()), 2) + assert len(axes) == 2 + assert len(self.plt.get_fignums()) == 2 tm.close() @slow def test_grouped_hist_layout(self): df = self.hist_df - self.assertRaises(ValueError, df.hist, column='weight', by=df.gender, - layout=(1, 1)) - self.assertRaises(ValueError, df.hist, column='height', by=df.category, - layout=(1, 3)) - self.assertRaises(ValueError, df.hist, column='height', by=df.category, - layout=(-1, -1)) + pytest.raises(ValueError, df.hist, column='weight', by=df.gender, + layout=(1, 1)) + pytest.raises(ValueError, df.hist, column='height', by=df.category, + layout=(1, 3)) + pytest.raises(ValueError, df.hist, column='height', by=df.category, + layout=(-1, -1)) with tm.assert_produces_warning(UserWarning): axes = _check_plot_works(df.hist, column='height', by=df.gender, @@ -368,14 +375,14 @@ def test_grouped_hist_multiple_axes(self): fig, axes = self.plt.subplots(2, 3) returned = df.hist(column=['height', 'weight', 'category'], ax=axes[0]) self._check_axes_shape(returned, axes_num=3, layout=(1, 3)) - self.assert_numpy_array_equal(returned, axes[0]) - self.assertIs(returned[0].figure, fig) + tm.assert_numpy_array_equal(returned, axes[0]) + assert returned[0].figure is fig returned = df.hist(by='classroom', ax=axes[1]) self._check_axes_shape(returned, axes_num=3, layout=(1, 3)) - self.assert_numpy_array_equal(returned, axes[1]) - self.assertIs(returned[0].figure, fig) + tm.assert_numpy_array_equal(returned, axes[1]) + assert returned[0].figure is fig - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): fig, axes = self.plt.subplots(2, 3) # pass different number of axes from required axes = df.hist(column='height', ax=axes) @@ -387,12 +394,12 @@ def test_axis_share_x(self): ax1, ax2 = df.hist(column='height', by=df.gender, sharex=True) # share x - self.assertTrue(ax1._shared_x_axes.joined(ax1, ax2)) - self.assertTrue(ax2._shared_x_axes.joined(ax1, ax2)) + assert ax1._shared_x_axes.joined(ax1, ax2) + assert ax2._shared_x_axes.joined(ax1, ax2) # don't share y - self.assertFalse(ax1._shared_y_axes.joined(ax1, ax2)) - self.assertFalse(ax2._shared_y_axes.joined(ax1, ax2)) + assert not ax1._shared_y_axes.joined(ax1, ax2) + assert not ax2._shared_y_axes.joined(ax1, ax2) @slow def test_axis_share_y(self): @@ -400,12 +407,12 @@ def test_axis_share_y(self): ax1, ax2 = df.hist(column='height', by=df.gender, sharey=True) # share y - self.assertTrue(ax1._shared_y_axes.joined(ax1, ax2)) - self.assertTrue(ax2._shared_y_axes.joined(ax1, ax2)) + assert ax1._shared_y_axes.joined(ax1, ax2) + assert ax2._shared_y_axes.joined(ax1, ax2) # don't share x - self.assertFalse(ax1._shared_x_axes.joined(ax1, ax2)) - self.assertFalse(ax2._shared_x_axes.joined(ax1, ax2)) + assert not ax1._shared_x_axes.joined(ax1, ax2) + assert not ax2._shared_x_axes.joined(ax1, ax2) @slow def test_axis_share_xy(self): @@ -414,13 +421,8 @@ def test_axis_share_xy(self): sharey=True) # share both x and y - self.assertTrue(ax1._shared_x_axes.joined(ax1, ax2)) - self.assertTrue(ax2._shared_x_axes.joined(ax1, ax2)) - - self.assertTrue(ax1._shared_y_axes.joined(ax1, ax2)) - self.assertTrue(ax2._shared_y_axes.joined(ax1, ax2)) - + assert ax1._shared_x_axes.joined(ax1, ax2) + assert ax2._shared_x_axes.joined(ax1, ax2) -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + assert ax1._shared_y_axes.joined(ax1, ax2) + assert ax2._shared_y_axes.joined(ax1, ax2) diff --git a/pandas/tests/plotting/test_misc.py b/pandas/tests/plotting/test_misc.py index a484217da5969..9eace32aa19a3 100644 --- a/pandas/tests/plotting/test_misc.py +++ b/pandas/tests/plotting/test_misc.py @@ -1,7 +1,8 @@ -#!/usr/bin/env python # coding: utf-8 -import nose +""" Test cases for misc plot functions """ + +import pytest from pandas import Series, DataFrame from pandas.compat import lmap @@ -12,19 +13,17 @@ from numpy import random from numpy.random import randn -import pandas.tools.plotting as plotting +import pandas.plotting as plotting from pandas.tests.plotting.common import (TestPlotBase, _check_plot_works, _ok_for_gaussian_kde) +tm._skip_module_if_no_mpl() -""" Test cases for misc plot functions """ - -@tm.mplskip class TestSeriesPlots(TestPlotBase): - def setUp(self): - TestPlotBase.setUp(self) + def setup_method(self, method): + TestPlotBase.setup_method(self, method) import matplotlib as mpl mpl.rcdefaults() @@ -33,7 +32,7 @@ def setUp(self): @slow def test_autocorrelation_plot(self): - from pandas.tools.plotting import autocorrelation_plot + from pandas.plotting import autocorrelation_plot _check_plot_works(autocorrelation_plot, series=self.ts) _check_plot_works(autocorrelation_plot, series=self.ts.values) @@ -42,17 +41,16 @@ def test_autocorrelation_plot(self): @slow def test_lag_plot(self): - from pandas.tools.plotting import lag_plot + from pandas.plotting import lag_plot _check_plot_works(lag_plot, series=self.ts) _check_plot_works(lag_plot, series=self.ts, lag=5) @slow def test_bootstrap_plot(self): - from pandas.tools.plotting import bootstrap_plot + from pandas.plotting import bootstrap_plot _check_plot_works(bootstrap_plot, series=self.ts, size=10) -@tm.mplskip class TestDataFramePlots(TestPlotBase): @slow @@ -80,9 +78,15 @@ def scat(**kwds): _check_plot_works(scat, diagonal='hist') with tm.assert_produces_warning(UserWarning): _check_plot_works(scat, range_padding=.1) + with tm.assert_produces_warning(UserWarning): + _check_plot_works(scat, color='rgb') + with tm.assert_produces_warning(UserWarning): + _check_plot_works(scat, c='rgb') + with tm.assert_produces_warning(UserWarning): + _check_plot_works(scat, facecolor='rgb') def scat2(x, y, by=None, ax=None, figsize=None): - return plotting.scatter_plot(df, x, y, by, ax, figsize=None) + return plotting._core.scatter_plot(df, x, y, by, ax, figsize=None) _check_plot_works(scat2, x=0, y=1) grouper = Series(np.repeat([1, 2, 3, 4, 5], 20), df.index) @@ -128,7 +132,7 @@ def test_scatter_matrix_axis(self): @slow def test_andrews_curves(self): - from pandas.tools.plotting import andrews_curves + from pandas.plotting import andrews_curves from matplotlib import cm df = self.iris @@ -193,7 +197,7 @@ def test_andrews_curves(self): @slow def test_parallel_coordinates(self): - from pandas.tools.plotting import parallel_coordinates + from pandas.plotting import parallel_coordinates from matplotlib import cm df = self.iris @@ -239,9 +243,29 @@ def test_parallel_coordinates(self): with tm.assert_produces_warning(FutureWarning): parallel_coordinates(df, 'Name', colors=colors) + def test_parallel_coordinates_with_sorted_labels(self): + """ For #15908 """ + from pandas.plotting import parallel_coordinates + + df = DataFrame({"feat": [i for i in range(30)], + "class": [2 for _ in range(10)] + + [3 for _ in range(10)] + + [1 for _ in range(10)]}) + ax = parallel_coordinates(df, 'class', sort_labels=True) + polylines, labels = ax.get_legend_handles_labels() + color_label_tuples = \ + zip([polyline.get_color() for polyline in polylines], labels) + ordered_color_label_tuples = sorted(color_label_tuples, + key=lambda x: x[1]) + prev_next_tupels = zip([i for i in ordered_color_label_tuples[0:-1]], + [i for i in ordered_color_label_tuples[1:]]) + for prev, nxt in prev_next_tupels: + # lables and colors are ordered strictly increasing + assert prev[1] < nxt[1] and prev[0] < nxt[0] + @slow def test_radviz(self): - from pandas.tools.plotting import radviz + from pandas.plotting import radviz from matplotlib import cm df = self.iris @@ -277,7 +301,28 @@ def test_radviz(self): handles, labels = ax.get_legend_handles_labels() self._check_colors(handles, facecolors=colors) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + @slow + def test_subplot_titles(self): + df = self.iris.drop('Name', axis=1).head() + # Use the column names as the subplot titles + title = list(df.columns) + + # Case len(title) == len(df) + plot = df.plot(subplots=True, title=title) + assert [p.get_title() for p in plot] == title + + # Case len(title) > len(df) + pytest.raises(ValueError, df.plot, subplots=True, + title=title + ["kittens > puppies"]) + + # Case len(title) < len(df) + pytest.raises(ValueError, df.plot, subplots=True, title=title[:2]) + + # Case subplots=False and title is of type list + pytest.raises(ValueError, df.plot, subplots=False, title=title) + + # Case df with 3 numeric columns but layout of (2,2) + plot = df.drop('SepalWidth', axis=1).plot(subplots=True, layout=(2, 2), + title=title[:-1]) + title_list = [ax.get_title() for sublist in plot for ax in sublist] + assert title_list == title[:3] + [''] diff --git a/pandas/tests/plotting/test_series.py b/pandas/tests/plotting/test_series.py index 6878ca0e1bc06..448661c7af0e9 100644 --- a/pandas/tests/plotting/test_series.py +++ b/pandas/tests/plotting/test_series.py @@ -1,8 +1,10 @@ -#!/usr/bin/env python # coding: utf-8 -import nose +""" Test cases for Series.plot """ + + import itertools +import pytest from datetime import datetime @@ -15,20 +17,18 @@ import numpy as np from numpy.random import randn -import pandas.tools.plotting as plotting +import pandas.plotting as plotting from pandas.tests.plotting.common import (TestPlotBase, _check_plot_works, _skip_if_no_scipy_gaussian_kde, _ok_for_gaussian_kde) - -""" Test cases for Series.plot """ +tm._skip_module_if_no_mpl() -@tm.mplskip class TestSeriesPlots(TestPlotBase): - def setUp(self): - TestPlotBase.setUp(self) + def setup_method(self, method): + TestPlotBase.setup_method(self, method) import matplotlib as mpl mpl.rcdefaults() @@ -94,36 +94,36 @@ def test_dont_modify_rcParams(self): key = 'axes.color_cycle' colors = self.plt.rcParams[key] Series([1, 2, 3]).plot() - self.assertEqual(colors, self.plt.rcParams[key]) + assert colors == self.plt.rcParams[key] def test_ts_line_lim(self): ax = self.ts.plot() xmin, xmax = ax.get_xlim() lines = ax.get_lines() - self.assertEqual(xmin, lines[0].get_data(orig=False)[0][0]) - self.assertEqual(xmax, lines[0].get_data(orig=False)[0][-1]) + assert xmin == lines[0].get_data(orig=False)[0][0] + assert xmax == lines[0].get_data(orig=False)[0][-1] tm.close() ax = self.ts.plot(secondary_y=True) xmin, xmax = ax.get_xlim() lines = ax.get_lines() - self.assertEqual(xmin, lines[0].get_data(orig=False)[0][0]) - self.assertEqual(xmax, lines[0].get_data(orig=False)[0][-1]) + assert xmin == lines[0].get_data(orig=False)[0][0] + assert xmax == lines[0].get_data(orig=False)[0][-1] def test_ts_area_lim(self): ax = self.ts.plot.area(stacked=False) xmin, xmax = ax.get_xlim() line = ax.get_lines()[0].get_data(orig=False)[0] - self.assertEqual(xmin, line[0]) - self.assertEqual(xmax, line[-1]) + assert xmin == line[0] + assert xmax == line[-1] tm.close() # GH 7471 ax = self.ts.plot.area(stacked=False, x_compat=True) xmin, xmax = ax.get_xlim() line = ax.get_lines()[0].get_data(orig=False)[0] - self.assertEqual(xmin, line[0]) - self.assertEqual(xmax, line[-1]) + assert xmin == line[0] + assert xmax == line[-1] tm.close() tz_ts = self.ts.copy() @@ -131,15 +131,15 @@ def test_ts_area_lim(self): ax = tz_ts.plot.area(stacked=False, x_compat=True) xmin, xmax = ax.get_xlim() line = ax.get_lines()[0].get_data(orig=False)[0] - self.assertEqual(xmin, line[0]) - self.assertEqual(xmax, line[-1]) + assert xmin == line[0] + assert xmax == line[-1] tm.close() ax = tz_ts.plot.area(stacked=False, secondary_y=True) xmin, xmax = ax.get_xlim() line = ax.get_lines()[0].get_data(orig=False)[0] - self.assertEqual(xmin, line[0]) - self.assertEqual(xmax, line[-1]) + assert xmin == line[0] + assert xmax == line[-1] def test_label(self): s = Series([1, 2]) @@ -160,7 +160,7 @@ def test_label(self): self.plt.close() # Add lebel info, but don't draw ax = s.plot(legend=False, label='LABEL') - self.assertEqual(ax.get_legend(), None) # Hasn't been drawn + assert ax.get_legend() is None # Hasn't been drawn ax.legend() # draw it self._check_legend_labels(ax, labels=['LABEL']) @@ -174,27 +174,27 @@ def test_line_area_nan_series(self): masked = ax.lines[0].get_ydata() # remove nan for comparison purpose exp = np.array([1, 2, 3], dtype=np.float64) - self.assert_numpy_array_equal(np.delete(masked.data, 2), exp) - self.assert_numpy_array_equal( + tm.assert_numpy_array_equal(np.delete(masked.data, 2), exp) + tm.assert_numpy_array_equal( masked.mask, np.array([False, False, True, False])) expected = np.array([1, 2, 0, 3], dtype=np.float64) ax = _check_plot_works(d.plot, stacked=True) - self.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected) + tm.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected) ax = _check_plot_works(d.plot.area) - self.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected) + tm.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected) ax = _check_plot_works(d.plot.area, stacked=False) - self.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected) + tm.assert_numpy_array_equal(ax.lines[0].get_ydata(), expected) def test_line_use_index_false(self): s = Series([1, 2, 3], index=['a', 'b', 'c']) s.index.name = 'The Index' ax = s.plot(use_index=False) label = ax.get_xlabel() - self.assertEqual(label, '') + assert label == '' ax2 = s.plot.bar(use_index=False) label2 = ax2.get_xlabel() - self.assertEqual(label2, '') + assert label2 == '' @slow def test_bar_log(self): @@ -223,15 +223,15 @@ def test_bar_log(self): ymin = 0.0007943282347242822 if self.mpl_ge_2_0_0 else 0.001 ymax = 0.12589254117941673 if self.mpl_ge_2_0_0 else .10000000000000001 res = ax.get_ylim() - self.assertAlmostEqual(res[0], ymin) - self.assertAlmostEqual(res[1], ymax) + tm.assert_almost_equal(res[0], ymin) + tm.assert_almost_equal(res[1], ymax) tm.assert_numpy_array_equal(ax.yaxis.get_ticklocs(), expected) tm.close() ax = Series([0.1, 0.01, 0.001]).plot(log=True, kind='barh') res = ax.get_xlim() - self.assertAlmostEqual(res[0], ymin) - self.assertAlmostEqual(res[1], ymax) + tm.assert_almost_equal(res[0], ymin) + tm.assert_almost_equal(res[1], ymax) tm.assert_numpy_array_equal(ax.xaxis.get_ticklocs(), expected) @slow @@ -256,7 +256,7 @@ def test_irregular_datetime(self): ax = ser.plot() xp = datetime(1999, 1, 1).toordinal() ax.set_xlim('1/1/1999', '1/1/2001') - self.assertEqual(xp, ax.get_xlim()[0]) + assert xp == ax.get_xlim()[0] @slow def test_pie_series(self): @@ -266,7 +266,7 @@ def test_pie_series(self): index=['a', 'b', 'c', 'd', 'e'], name='YLABEL') ax = _check_plot_works(series.plot.pie) self._check_text_labels(ax.texts, series.index) - self.assertEqual(ax.get_ylabel(), 'YLABEL') + assert ax.get_ylabel() == 'YLABEL' # without wedge labels ax = _check_plot_works(series.plot.pie, labels=None) @@ -296,10 +296,10 @@ def test_pie_series(self): expected_texts = list(next(it) for it in itertools.cycle(iters)) self._check_text_labels(ax.texts, expected_texts) for t in ax.texts: - self.assertEqual(t.get_fontsize(), 7) + assert t.get_fontsize() == 7 # includes negative value - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): series = Series([1, 2, 0, 4, -1], index=['a', 'b', 'c', 'd', 'e']) series.plot.pie() @@ -314,13 +314,13 @@ def test_pie_nan(self): ax = s.plot.pie(legend=True) expected = ['0', '', '2', '3'] result = [x.get_text() for x in ax.texts] - self.assertEqual(result, expected) + assert result == expected @slow def test_hist_df_kwargs(self): df = DataFrame(np.random.randn(10, 2)) ax = df.plot.hist(bins=5) - self.assertEqual(len(ax.patches), 10) + assert len(ax.patches) == 10 @slow def test_hist_df_with_nonnumerics(self): @@ -330,10 +330,10 @@ def test_hist_df_with_nonnumerics(self): np.random.randn(10, 4), columns=['A', 'B', 'C', 'D']) df['E'] = ['x', 'y'] * 5 ax = df.plot.hist(bins=5) - self.assertEqual(len(ax.patches), 20) + assert len(ax.patches) == 20 ax = df.plot.hist() # bins=10 - self.assertEqual(len(ax.patches), 40) + assert len(ax.patches) == 40 @slow def test_hist_legacy(self): @@ -358,22 +358,22 @@ def test_hist_legacy(self): _check_plot_works(self.ts.hist, figure=fig, ax=ax1) _check_plot_works(self.ts.hist, figure=fig, ax=ax2) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): self.ts.hist(by=self.ts.index, figure=fig) @slow def test_hist_bins_legacy(self): df = DataFrame(np.random.randn(10, 2)) ax = df.hist(bins=2)[0][0] - self.assertEqual(len(ax.patches), 2) + assert len(ax.patches) == 2 @slow def test_hist_layout(self): df = self.hist_df - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.height.hist(layout=(1, 1)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.height.hist(layout=[1, 1]) @slow @@ -431,7 +431,7 @@ def test_hist_no_overlap(self): y.hist() fig = gcf() axes = fig.axes if self.mpl_ge_1_5_0 else fig.get_axes() - self.assertEqual(len(axes), 2) + assert len(axes) == 2 @slow def test_hist_secondary_legend(self): @@ -444,8 +444,8 @@ def test_hist_secondary_legend(self): # both legends are dran on left ax # left and right axis must be visible self._check_legend_labels(ax, labels=['a', 'b (right)']) - self.assertTrue(ax.get_yaxis().get_visible()) - self.assertTrue(ax.right_ax.get_yaxis().get_visible()) + assert ax.get_yaxis().get_visible() + assert ax.right_ax.get_yaxis().get_visible() tm.close() # secondary -> secondary @@ -455,8 +455,8 @@ def test_hist_secondary_legend(self): # left axis must be invisible, right axis must be visible self._check_legend_labels(ax.left_ax, labels=['a (right)', 'b (right)']) - self.assertFalse(ax.left_ax.get_yaxis().get_visible()) - self.assertTrue(ax.get_yaxis().get_visible()) + assert not ax.left_ax.get_yaxis().get_visible() + assert ax.get_yaxis().get_visible() tm.close() # secondary -> primary @@ -466,8 +466,8 @@ def test_hist_secondary_legend(self): # both legends are draw on left ax # left and right axis must be visible self._check_legend_labels(ax.left_ax, labels=['a (right)', 'b']) - self.assertTrue(ax.left_ax.get_yaxis().get_visible()) - self.assertTrue(ax.get_yaxis().get_visible()) + assert ax.left_ax.get_yaxis().get_visible() + assert ax.get_yaxis().get_visible() tm.close() @slow @@ -482,8 +482,8 @@ def test_df_series_secondary_legend(self): # both legends are dran on left ax # left and right axis must be visible self._check_legend_labels(ax, labels=['a', 'b', 'c', 'x (right)']) - self.assertTrue(ax.get_yaxis().get_visible()) - self.assertTrue(ax.right_ax.get_yaxis().get_visible()) + assert ax.get_yaxis().get_visible() + assert ax.right_ax.get_yaxis().get_visible() tm.close() # primary -> secondary (with passing ax) @@ -492,8 +492,8 @@ def test_df_series_secondary_legend(self): # both legends are dran on left ax # left and right axis must be visible self._check_legend_labels(ax, labels=['a', 'b', 'c', 'x (right)']) - self.assertTrue(ax.get_yaxis().get_visible()) - self.assertTrue(ax.right_ax.get_yaxis().get_visible()) + assert ax.get_yaxis().get_visible() + assert ax.right_ax.get_yaxis().get_visible() tm.close() # seconcary -> secondary (without passing ax) @@ -503,8 +503,8 @@ def test_df_series_secondary_legend(self): # left axis must be invisible and right axis must be visible expected = ['a (right)', 'b (right)', 'c (right)', 'x (right)'] self._check_legend_labels(ax.left_ax, labels=expected) - self.assertFalse(ax.left_ax.get_yaxis().get_visible()) - self.assertTrue(ax.get_yaxis().get_visible()) + assert not ax.left_ax.get_yaxis().get_visible() + assert ax.get_yaxis().get_visible() tm.close() # secondary -> secondary (with passing ax) @@ -514,8 +514,8 @@ def test_df_series_secondary_legend(self): # left axis must be invisible and right axis must be visible expected = ['a (right)', 'b (right)', 'c (right)', 'x (right)'] self._check_legend_labels(ax.left_ax, expected) - self.assertFalse(ax.left_ax.get_yaxis().get_visible()) - self.assertTrue(ax.get_yaxis().get_visible()) + assert not ax.left_ax.get_yaxis().get_visible() + assert ax.get_yaxis().get_visible() tm.close() # secondary -> secondary (with passing ax) @@ -525,14 +525,14 @@ def test_df_series_secondary_legend(self): # left axis must be invisible and right axis must be visible expected = ['a', 'b', 'c', 'x (right)'] self._check_legend_labels(ax.left_ax, expected) - self.assertFalse(ax.left_ax.get_yaxis().get_visible()) - self.assertTrue(ax.get_yaxis().get_visible()) + assert not ax.left_ax.get_yaxis().get_visible() + assert ax.get_yaxis().get_visible() tm.close() @slow def test_plot_fails_with_dupe_color_and_style(self): x = Series(randn(2)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): x.plot(style='k--', color='k') @slow @@ -577,15 +577,14 @@ def test_kde_missing_vals(self): s = Series(np.random.uniform(size=50)) s[0] = np.nan axes = _check_plot_works(s.plot.kde) - # check if the values have any missing values - # GH14821 - self.assertTrue(any(~np.isnan(axes.lines[0].get_xdata())), - msg='Missing Values not dropped') + + # gh-14821: check if the values have any missing values + assert any(~np.isnan(axes.lines[0].get_xdata())) @slow def test_hist_kwargs(self): ax = self.ts.plot.hist(bins=5) - self.assertEqual(len(ax.patches), 5) + assert len(ax.patches) == 5 self._check_text_labels(ax.yaxis.get_label(), 'Frequency') tm.close() @@ -601,7 +600,7 @@ def test_hist_kwargs(self): def test_hist_kde_color(self): ax = self.ts.plot.hist(logy=True, bins=10, color='b') self._check_ax_scales(ax, yaxis='log') - self.assertEqual(len(ax.patches), 10) + assert len(ax.patches) == 10 self._check_colors(ax.patches, facecolors=['b'] * 10) tm._skip_if_no_scipy() @@ -609,7 +608,7 @@ def test_hist_kde_color(self): ax = self.ts.plot.kde(logy=True, color='r') self._check_ax_scales(ax, yaxis='log') lines = ax.get_lines() - self.assertEqual(len(lines), 1) + assert len(lines) == 1 self._check_colors(lines, ['r']) @slow @@ -624,7 +623,9 @@ def test_boxplot_series(self): @slow def test_kind_both_ways(self): s = Series(range(3)) - for kind in plotting._common_kinds + plotting._series_kinds: + kinds = (plotting._core._common_kinds + + plotting._core._series_kinds) + for kind in kinds: if not _ok_for_gaussian_kde(kind): continue s.plot(kind=kind) @@ -633,31 +634,31 @@ def test_kind_both_ways(self): @slow def test_invalid_plot_data(self): s = Series(list('abcd')) - for kind in plotting._common_kinds: + for kind in plotting._core._common_kinds: if not _ok_for_gaussian_kde(kind): continue - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): s.plot(kind=kind) @slow def test_valid_object_plot(self): s = Series(lrange(10), dtype=object) - for kind in plotting._common_kinds: + for kind in plotting._core._common_kinds: if not _ok_for_gaussian_kde(kind): continue _check_plot_works(s.plot, kind=kind) def test_partially_invalid_plot_data(self): s = Series(['a', 'b', 1.0, 2]) - for kind in plotting._common_kinds: + for kind in plotting._core._common_kinds: if not _ok_for_gaussian_kde(kind): continue - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): s.plot(kind=kind) def test_invalid_kind(self): s = Series([1, 2]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): s.plot(kind='aasdf') @slow @@ -704,12 +705,12 @@ def test_errorbar_plot(self): self._check_has_errorbars(ax, xerr=0, yerr=1) # check incorrect lengths and types - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): s.plot(yerr=np.arange(11)) s_err = ['zzz'] * 10 # in mpl 1.5+ this is a TypeError - with tm.assertRaises((ValueError, TypeError)): + with pytest.raises((ValueError, TypeError)): s.plot(yerr=s_err) def test_table(self): @@ -720,55 +721,58 @@ def test_table(self): def test_series_grid_settings(self): # Make sure plot defaults to rcParams['axes.grid'] setting, GH 9792 self._check_grid_settings(Series([1, 2, 3]), - plotting._series_kinds + - plotting._common_kinds) + plotting._core._series_kinds + + plotting._core._common_kinds) @slow def test_standard_colors(self): + from pandas.plotting._style import _get_standard_colors + for c in ['r', 'red', 'green', '#FF0000']: - result = plotting._get_standard_colors(1, color=c) - self.assertEqual(result, [c]) + result = _get_standard_colors(1, color=c) + assert result == [c] - result = plotting._get_standard_colors(1, color=[c]) - self.assertEqual(result, [c]) + result = _get_standard_colors(1, color=[c]) + assert result == [c] - result = plotting._get_standard_colors(3, color=c) - self.assertEqual(result, [c] * 3) + result = _get_standard_colors(3, color=c) + assert result == [c] * 3 - result = plotting._get_standard_colors(3, color=[c]) - self.assertEqual(result, [c] * 3) + result = _get_standard_colors(3, color=[c]) + assert result == [c] * 3 @slow def test_standard_colors_all(self): import matplotlib.colors as colors + from pandas.plotting._style import _get_standard_colors # multiple colors like mediumaquamarine for c in colors.cnames: - result = plotting._get_standard_colors(num_colors=1, color=c) - self.assertEqual(result, [c]) + result = _get_standard_colors(num_colors=1, color=c) + assert result == [c] - result = plotting._get_standard_colors(num_colors=1, color=[c]) - self.assertEqual(result, [c]) + result = _get_standard_colors(num_colors=1, color=[c]) + assert result == [c] - result = plotting._get_standard_colors(num_colors=3, color=c) - self.assertEqual(result, [c] * 3) + result = _get_standard_colors(num_colors=3, color=c) + assert result == [c] * 3 - result = plotting._get_standard_colors(num_colors=3, color=[c]) - self.assertEqual(result, [c] * 3) + result = _get_standard_colors(num_colors=3, color=[c]) + assert result == [c] * 3 # single letter colors like k for c in colors.ColorConverter.colors: - result = plotting._get_standard_colors(num_colors=1, color=c) - self.assertEqual(result, [c]) + result = _get_standard_colors(num_colors=1, color=c) + assert result == [c] - result = plotting._get_standard_colors(num_colors=1, color=[c]) - self.assertEqual(result, [c]) + result = _get_standard_colors(num_colors=1, color=[c]) + assert result == [c] - result = plotting._get_standard_colors(num_colors=3, color=c) - self.assertEqual(result, [c] * 3) + result = _get_standard_colors(num_colors=3, color=c) + assert result == [c] * 3 - result = plotting._get_standard_colors(num_colors=3, color=[c]) - self.assertEqual(result, [c] * 3) + result = _get_standard_colors(num_colors=3, color=[c]) + assert result == [c] * 3 def test_series_plot_color_kwargs(self): # GH1890 @@ -812,8 +816,3 @@ def test_custom_business_day_freq(self): freq=CustomBusinessDay(holidays=['2014-05-26']))) _check_plot_works(s.plot) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/reshape/__init__.py b/pandas/tests/reshape/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/tools/tests/data/allow_exact_matches.csv b/pandas/tests/reshape/data/allow_exact_matches.csv similarity index 100% rename from pandas/tools/tests/data/allow_exact_matches.csv rename to pandas/tests/reshape/data/allow_exact_matches.csv diff --git a/pandas/tools/tests/data/allow_exact_matches_and_tolerance.csv b/pandas/tests/reshape/data/allow_exact_matches_and_tolerance.csv similarity index 100% rename from pandas/tools/tests/data/allow_exact_matches_and_tolerance.csv rename to pandas/tests/reshape/data/allow_exact_matches_and_tolerance.csv diff --git a/pandas/tools/tests/data/asof.csv b/pandas/tests/reshape/data/asof.csv similarity index 100% rename from pandas/tools/tests/data/asof.csv rename to pandas/tests/reshape/data/asof.csv diff --git a/pandas/tools/tests/data/asof2.csv b/pandas/tests/reshape/data/asof2.csv similarity index 100% rename from pandas/tools/tests/data/asof2.csv rename to pandas/tests/reshape/data/asof2.csv diff --git a/pandas/tools/tests/data/cut_data.csv b/pandas/tests/reshape/data/cut_data.csv similarity index 100% rename from pandas/tools/tests/data/cut_data.csv rename to pandas/tests/reshape/data/cut_data.csv diff --git a/pandas/tools/tests/data/quotes.csv b/pandas/tests/reshape/data/quotes.csv similarity index 100% rename from pandas/tools/tests/data/quotes.csv rename to pandas/tests/reshape/data/quotes.csv diff --git a/pandas/tools/tests/data/quotes2.csv b/pandas/tests/reshape/data/quotes2.csv similarity index 100% rename from pandas/tools/tests/data/quotes2.csv rename to pandas/tests/reshape/data/quotes2.csv diff --git a/pandas/tools/tests/data/tolerance.csv b/pandas/tests/reshape/data/tolerance.csv similarity index 100% rename from pandas/tools/tests/data/tolerance.csv rename to pandas/tests/reshape/data/tolerance.csv diff --git a/pandas/tools/tests/data/trades.csv b/pandas/tests/reshape/data/trades.csv similarity index 100% rename from pandas/tools/tests/data/trades.csv rename to pandas/tests/reshape/data/trades.csv diff --git a/pandas/tools/tests/data/trades2.csv b/pandas/tests/reshape/data/trades2.csv similarity index 100% rename from pandas/tools/tests/data/trades2.csv rename to pandas/tests/reshape/data/trades2.csv diff --git a/pandas/tools/tests/test_concat.py b/pandas/tests/reshape/test_concat.py similarity index 76% rename from pandas/tools/tests/test_concat.py rename to pandas/tests/reshape/test_concat.py index f541de316661a..4dfa2904313ce 100644 --- a/pandas/tools/tests/test_concat.py +++ b/pandas/tests/reshape/test_concat.py @@ -1,4 +1,4 @@ -import nose +from warnings import catch_warnings import numpy as np from numpy.random import randn @@ -9,19 +9,17 @@ from pandas import (DataFrame, concat, read_csv, isnull, Series, date_range, Index, Panel, MultiIndex, Timestamp, - DatetimeIndex, Categorical, CategoricalIndex) -from pandas.types.concat import union_categoricals + DatetimeIndex) from pandas.util import testing as tm from pandas.util.testing import (assert_frame_equal, - makeCustomDataframe as mkdf, - assert_almost_equal) + makeCustomDataframe as mkdf) +import pytest -class ConcatenateBase(tm.TestCase): - _multiprocess_can_split_ = True +class ConcatenateBase(object): - def setUp(self): + def setup_method(self, method): self.frame = DataFrame(tm.getSeriesData()) self.mixed_frame = self.frame.copy() self.mixed_frame['foo'] = 'bar' @@ -33,7 +31,7 @@ class TestConcatAppendCommon(ConcatenateBase): Test common dtype coercion rules between concat and append. """ - def setUp(self): + def setup_method(self, method): dt_data = [pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-02'), @@ -67,14 +65,14 @@ def _check_expected_dtype(self, obj, label): """ if isinstance(obj, pd.Index): if label == 'bool': - self.assertEqual(obj.dtype, 'object') + assert obj.dtype == 'object' else: - self.assertEqual(obj.dtype, label) + assert obj.dtype == label elif isinstance(obj, pd.Series): if label.startswith('period'): - self.assertEqual(obj.dtype, 'object') + assert obj.dtype == 'object' else: - self.assertEqual(obj.dtype, label) + assert obj.dtype == label else: raise ValueError @@ -126,10 +124,12 @@ def test_concatlike_same_dtypes(self): tm.assert_index_equal(res, exp) # cannot append non-index - with tm.assertRaisesRegexp(TypeError, 'all inputs must be Index'): + with tm.assert_raises_regex(TypeError, + 'all inputs must be Index'): pd.Index(vals1).append(vals2) - with tm.assertRaisesRegexp(TypeError, 'all inputs must be Index'): + with tm.assert_raises_regex(TypeError, + 'all inputs must be Index'): pd.Index(vals1).append([pd.Index(vals2), vals3]) # ----- Series ----- # @@ -177,16 +177,16 @@ def test_concatlike_same_dtypes(self): # cannot append non-index msg = "cannot concatenate a non-NDFrame object" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): pd.Series(vals1).append(vals2) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): pd.Series(vals1).append([pd.Series(vals2), vals3]) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): pd.concat([pd.Series(vals1), vals2]) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): pd.concat([pd.Series(vals1), pd.Series(vals2), vals3]) def test_concatlike_dtypes_coercion(self): @@ -274,20 +274,20 @@ def test_concatlike_common_coerce_to_pandas_object(self): res = dti.append(tdi) tm.assert_index_equal(res, exp) - tm.assertIsInstance(res[0], pd.Timestamp) - tm.assertIsInstance(res[-1], pd.Timedelta) + assert isinstance(res[0], pd.Timestamp) + assert isinstance(res[-1], pd.Timedelta) dts = pd.Series(dti) tds = pd.Series(tdi) res = dts.append(tds) tm.assert_series_equal(res, pd.Series(exp, index=[0, 1, 0, 1])) - tm.assertIsInstance(res.iloc[0], pd.Timestamp) - tm.assertIsInstance(res.iloc[-1], pd.Timedelta) + assert isinstance(res.iloc[0], pd.Timestamp) + assert isinstance(res.iloc[-1], pd.Timedelta) res = pd.concat([dts, tds]) tm.assert_series_equal(res, pd.Series(exp, index=[0, 1, 0, 1])) - tm.assertIsInstance(res.iloc[0], pd.Timestamp) - tm.assertIsInstance(res.iloc[-1], pd.Timedelta) + assert isinstance(res.iloc[0], pd.Timestamp) + assert isinstance(res.iloc[-1], pd.Timedelta) def test_concatlike_datetimetz(self): # GH 7795 @@ -709,25 +709,25 @@ def test_append(self): end_frame = self.frame.reindex(end_index) appended = begin_frame.append(end_frame) - assert_almost_equal(appended['A'], self.frame['A']) + tm.assert_almost_equal(appended['A'], self.frame['A']) del end_frame['A'] partial_appended = begin_frame.append(end_frame) - self.assertIn('A', partial_appended) + assert 'A' in partial_appended partial_appended = end_frame.append(begin_frame) - self.assertIn('A', partial_appended) + assert 'A' in partial_appended # mixed type handling appended = self.mixed_frame[:5].append(self.mixed_frame[5:]) - assert_frame_equal(appended, self.mixed_frame) + tm.assert_frame_equal(appended, self.mixed_frame) # what to test here mixed_appended = self.mixed_frame[:5].append(self.frame[5:]) mixed_appended2 = self.frame[:5].append(self.mixed_frame[5:]) # all equal except 'foo' column - assert_frame_equal( + tm.assert_frame_equal( mixed_appended.reindex(columns=['A', 'B', 'C', 'D']), mixed_appended2.reindex(columns=['A', 'B', 'C', 'D'])) @@ -735,25 +735,24 @@ def test_append(self): empty = DataFrame({}) appended = self.frame.append(empty) - assert_frame_equal(self.frame, appended) - self.assertIsNot(appended, self.frame) + tm.assert_frame_equal(self.frame, appended) + assert appended is not self.frame appended = empty.append(self.frame) - assert_frame_equal(self.frame, appended) - self.assertIsNot(appended, self.frame) + tm.assert_frame_equal(self.frame, appended) + assert appended is not self.frame - # overlap - self.assertRaises(ValueError, self.frame.append, self.frame, - verify_integrity=True) + # Overlap + with pytest.raises(ValueError): + self.frame.append(self.frame, verify_integrity=True) - # new columns - # GH 6129 + # see gh-6129: new columns df = DataFrame({'a': {'x': 1, 'y': 2}, 'b': {'x': 3, 'y': 4}}) row = Series([5, 6, 7], index=['a', 'b', 'c'], name='z') expected = DataFrame({'a': {'x': 1, 'y': 2, 'z': 5}, 'b': { 'x': 3, 'y': 4, 'z': 6}, 'c': {'z': 7}}) result = df.append(row) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_append_length0_frame(self): df = DataFrame(columns=['A', 'B', 'C']) @@ -785,12 +784,12 @@ def test_append_different_columns(self): 'floats': np.random.randn(10), 'strings': ['foo', 'bar'] * 5}) - a = df[:5].ix[:, ['bools', 'ints', 'floats']] - b = df[5:].ix[:, ['strings', 'ints', 'floats']] + a = df[:5].loc[:, ['bools', 'ints', 'floats']] + b = df[5:].loc[:, ['strings', 'ints', 'floats']] appended = a.append(b) - self.assertTrue(isnull(appended['strings'][0:4]).all()) - self.assertTrue(isnull(appended['bools'][5:]).all()) + assert isnull(appended['strings'][0:4]).all() + assert isnull(appended['bools'][5:]).all() def test_append_many(self): chunks = [self.frame[:5], self.frame[5:10], @@ -802,9 +801,9 @@ def test_append_many(self): chunks[-1] = chunks[-1].copy() chunks[-1]['foo'] = 'bar' result = chunks[0].append(chunks[1:]) - tm.assert_frame_equal(result.ix[:, self.frame.columns], self.frame) - self.assertTrue((result['foo'][15:] == 'bar').all()) - self.assertTrue(result['foo'][:15].isnull().all()) + tm.assert_frame_equal(result.loc[:, self.frame.columns], self.frame) + assert (result['foo'][15:] == 'bar').all() + assert result['foo'][:15].isnull().all() def test_append_preserve_index_name(self): # #980 @@ -815,7 +814,7 @@ def test_append_preserve_index_name(self): df2 = df2.set_index(['A']) result = df1.append(df2) - self.assertEqual(result.index.name, 'A') + assert result.index.name == 'A' def test_append_dtype_coerce(self): @@ -850,46 +849,44 @@ def test_append_missing_column_proper_upcast(self): dtype=bool)}) appended = df1.append(df2, ignore_index=True) - self.assertEqual(appended['A'].dtype, 'f8') - self.assertEqual(appended['B'].dtype, 'O') + assert appended['A'].dtype == 'f8' + assert appended['B'].dtype == 'O' class TestConcatenate(ConcatenateBase): def test_concat_copy(self): - df = DataFrame(np.random.randn(4, 3)) df2 = DataFrame(np.random.randint(0, 10, size=4).reshape(4, 1)) df3 = DataFrame({5: 'foo'}, index=range(4)) - # these are actual copies + # These are actual copies. result = concat([df, df2, df3], axis=1, copy=True) + for b in result._data.blocks: - self.assertIsNone(b.values.base) + assert b.values.base is None - # these are the same + # These are the same. result = concat([df, df2, df3], axis=1, copy=False) + for b in result._data.blocks: if b.is_float: - self.assertTrue( - b.values.base is df._data.blocks[0].values.base) + assert b.values.base is df._data.blocks[0].values.base elif b.is_integer: - self.assertTrue( - b.values.base is df2._data.blocks[0].values.base) + assert b.values.base is df2._data.blocks[0].values.base elif b.is_object: - self.assertIsNotNone(b.values.base) + assert b.values.base is not None - # float block was consolidated + # Float block was consolidated. df4 = DataFrame(np.random.randn(4, 1)) result = concat([df, df2, df3, df4], axis=1, copy=False) for b in result._data.blocks: if b.is_float: - self.assertIsNone(b.values.base) + assert b.values.base is None elif b.is_integer: - self.assertTrue( - b.values.base is df2._data.blocks[0].values.base) + assert b.values.base is df2._data.blocks[0].values.base elif b.is_object: - self.assertIsNotNone(b.values.base) + assert b.values.base is not None def test_concat_with_group_keys(self): df = DataFrame(np.random.randn(4, 3)) @@ -929,15 +926,15 @@ def test_concat_with_group_keys(self): def test_concat_keys_specific_levels(self): df = DataFrame(np.random.randn(10, 4)) - pieces = [df.ix[:, [0, 1]], df.ix[:, [2]], df.ix[:, [3]]] + pieces = [df.iloc[:, [0, 1]], df.iloc[:, [2]], df.iloc[:, [3]]] level = ['three', 'two', 'one', 'zero'] result = concat(pieces, axis=1, keys=['one', 'two', 'three'], levels=[level], names=['group_key']) - self.assert_index_equal(result.columns.levels[0], - Index(level, name='group_key')) - self.assertEqual(result.columns.names[0], 'group_key') + tm.assert_index_equal(result.columns.levels[0], + Index(level, name='group_key')) + assert result.columns.names[0] == 'group_key' def test_concat_dataframe_keys_bug(self): t1 = DataFrame({ @@ -948,8 +945,7 @@ def test_concat_dataframe_keys_bug(self): # it works result = concat([t1, t2], axis=1, keys=['t1', 't2']) - self.assertEqual(list(result.columns), [('t1', 'value'), - ('t2', 'value')]) + assert list(result.columns) == [('t1', 'value'), ('t2', 'value')] def test_concat_series_partial_columns_names(self): # GH10698 @@ -1023,10 +1019,10 @@ def test_concat_multiindex_with_keys(self): columns=Index(['A', 'B', 'C'], name='exp')) result = concat([frame, frame], keys=[0, 1], names=['iteration']) - self.assertEqual(result.index.names, ('iteration',) + index.names) - tm.assert_frame_equal(result.ix[0], frame) - tm.assert_frame_equal(result.ix[1], frame) - self.assertEqual(result.index.nlevels, 3) + assert result.index.names == ('iteration',) + index.names + tm.assert_frame_equal(result.loc[0], frame) + tm.assert_frame_equal(result.loc[1], frame) + assert result.index.nlevels == 3 def test_concat_multiindex_with_tz(self): # GH 6606 @@ -1049,6 +1045,30 @@ def test_concat_multiindex_with_tz(self): result = concat([df, df]) tm.assert_frame_equal(result, expected) + def test_concat_multiindex_with_none_in_index_names(self): + # GH 15787 + index = pd.MultiIndex.from_product([[1], range(5)], + names=['level1', None]) + df = pd.DataFrame({'col': range(5)}, index=index, dtype=np.int32) + + result = concat([df, df], keys=[1, 2], names=['level2']) + index = pd.MultiIndex.from_product([[1, 2], [1], range(5)], + names=['level2', 'level1', None]) + expected = pd.DataFrame({'col': list(range(5)) * 2}, + index=index, dtype=np.int32) + assert_frame_equal(result, expected) + + result = concat([df, df[:2]], keys=[1, 2], names=['level2']) + level2 = [1] * 5 + [2] * 2 + level1 = [1] * 7 + no_name = list(range(5)) + list(range(2)) + tuples = list(zip(level2, level1, no_name)) + index = pd.MultiIndex.from_tuples(tuples, + names=['level2', 'level1', None]) + expected = pd.DataFrame({'col': no_name}, index=index, + dtype=np.int32) + assert_frame_equal(result, expected) + def test_concat_keys_and_levels(self): df = DataFrame(np.random.randn(1, 3)) df2 = DataFrame(np.random.randn(1, 4)) @@ -1067,35 +1087,34 @@ def test_concat_keys_and_levels(self): names=names + [None]) expected.index = exp_index - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # no names - result = concat([df, df2, df, df2], keys=[('foo', 'one'), ('foo', 'two'), ('baz', 'one'), ('baz', 'two')], levels=levels) - self.assertEqual(result.index.names, (None,) * 3) + assert result.index.names == (None,) * 3 # no levels result = concat([df, df2, df, df2], keys=[('foo', 'one'), ('foo', 'two'), ('baz', 'one'), ('baz', 'two')], names=['first', 'second']) - self.assertEqual(result.index.names, ('first', 'second') + (None,)) - self.assert_index_equal(result.index.levels[0], - Index(['baz', 'foo'], name='first')) + assert result.index.names == ('first', 'second') + (None,) + tm.assert_index_equal(result.index.levels[0], + Index(['baz', 'foo'], name='first')) def test_concat_keys_levels_no_overlap(self): # GH #1406 df = DataFrame(np.random.randn(1, 3), index=['a']) df2 = DataFrame(np.random.randn(1, 4), index=['b']) - self.assertRaises(ValueError, concat, [df, df], - keys=['one', 'two'], levels=[['foo', 'bar', 'baz']]) + pytest.raises(ValueError, concat, [df, df], + keys=['one', 'two'], levels=[['foo', 'bar', 'baz']]) - self.assertRaises(ValueError, concat, [df, df2], - keys=['one', 'two'], levels=[['foo', 'bar', 'baz']]) + pytest.raises(ValueError, concat, [df, df2], + keys=['one', 'two'], levels=[['foo', 'bar', 'baz']]) def test_concat_rename_index(self): a = DataFrame(np.random.rand(3, 3), @@ -1114,7 +1133,7 @@ def test_concat_rename_index(self): exp.index.set_names(names, inplace=True) tm.assert_frame_equal(result, exp) - self.assertEqual(result.index.names, exp.index.names) + assert result.index.names == exp.index.names def test_crossed_dtypes_weird_corner(self): columns = ['A', 'B', 'C', 'D'] @@ -1139,7 +1158,7 @@ def test_crossed_dtypes_weird_corner(self): df2 = DataFrame(np.random.randn(1, 4), index=['b']) result = concat( [df, df2], keys=['one', 'two'], names=['first', 'second']) - self.assertEqual(result.index.names, ('first', 'second')) + assert result.index.names == ('first', 'second') def test_dups_index(self): # GH 4771 @@ -1202,7 +1221,7 @@ def test_handle_empty_objects(self): frames = [baz, empty, empty, df[5:]] concatted = concat(frames, axis=0) - expected = df.ix[:, ['a', 'b', 'c', 'd', 'foo']] + expected = df.loc[:, ['a', 'b', 'c', 'd', 'foo']] expected['foo'] = expected['foo'].astype('O') expected.loc[0:4, 'foo'] = 'bar' @@ -1285,8 +1304,9 @@ def test_concat_mixed_objs(self): assert_frame_equal(result, expected) # invalid concatente of mixed dims - panel = tm.makePanel() - self.assertRaises(ValueError, lambda: concat([panel, s1], axis=1)) + with catch_warnings(record=True): + panel = tm.makePanel() + pytest.raises(ValueError, lambda: concat([panel, s1], axis=1)) def test_empty_dtype_coerce(self): @@ -1324,87 +1344,90 @@ def test_dtype_coerceion(self): tm.assert_series_equal(result.dtypes, df.dtypes) def test_panel_concat_other_axes(self): - panel = tm.makePanel() + with catch_warnings(record=True): + panel = tm.makePanel() - p1 = panel.ix[:, :5, :] - p2 = panel.ix[:, 5:, :] + p1 = panel.iloc[:, :5, :] + p2 = panel.iloc[:, 5:, :] - result = concat([p1, p2], axis=1) - tm.assert_panel_equal(result, panel) + result = concat([p1, p2], axis=1) + tm.assert_panel_equal(result, panel) - p1 = panel.ix[:, :, :2] - p2 = panel.ix[:, :, 2:] + p1 = panel.iloc[:, :, :2] + p2 = panel.iloc[:, :, 2:] - result = concat([p1, p2], axis=2) - tm.assert_panel_equal(result, panel) + result = concat([p1, p2], axis=2) + tm.assert_panel_equal(result, panel) - # if things are a bit misbehaved - p1 = panel.ix[:2, :, :2] - p2 = panel.ix[:, :, 2:] - p1['ItemC'] = 'baz' + # if things are a bit misbehaved + p1 = panel.iloc[:2, :, :2] + p2 = panel.iloc[:, :, 2:] + p1['ItemC'] = 'baz' - result = concat([p1, p2], axis=2) + result = concat([p1, p2], axis=2) - expected = panel.copy() - expected['ItemC'] = expected['ItemC'].astype('O') - expected.ix['ItemC', :, :2] = 'baz' - tm.assert_panel_equal(result, expected) + expected = panel.copy() + expected['ItemC'] = expected['ItemC'].astype('O') + expected.loc['ItemC', :, :2] = 'baz' + tm.assert_panel_equal(result, expected) def test_panel_concat_buglet(self): - # #2257 - def make_panel(): - index = 5 - cols = 3 + with catch_warnings(record=True): + # #2257 + def make_panel(): + index = 5 + cols = 3 - def df(): - return DataFrame(np.random.randn(index, cols), - index=["I%s" % i for i in range(index)], - columns=["C%s" % i for i in range(cols)]) - return Panel(dict([("Item%s" % x, df()) for x in ['A', 'B', 'C']])) + def df(): + return DataFrame(np.random.randn(index, cols), + index=["I%s" % i for i in range(index)], + columns=["C%s" % i for i in range(cols)]) + return Panel(dict([("Item%s" % x, df()) + for x in ['A', 'B', 'C']])) - panel1 = make_panel() - panel2 = make_panel() + panel1 = make_panel() + panel2 = make_panel() - panel2 = panel2.rename_axis(dict([(x, "%s_1" % x) - for x in panel2.major_axis]), - axis=1) + panel2 = panel2.rename_axis(dict([(x, "%s_1" % x) + for x in panel2.major_axis]), + axis=1) - panel3 = panel2.rename_axis(lambda x: '%s_1' % x, axis=1) - panel3 = panel3.rename_axis(lambda x: '%s_1' % x, axis=2) + panel3 = panel2.rename_axis(lambda x: '%s_1' % x, axis=1) + panel3 = panel3.rename_axis(lambda x: '%s_1' % x, axis=2) - # it works! - concat([panel1, panel3], axis=1, verify_integrity=True) + # it works! + concat([panel1, panel3], axis=1, verify_integrity=True) def test_panel4d_concat(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): p4d = tm.makePanel4D() - p1 = p4d.ix[:, :, :5, :] - p2 = p4d.ix[:, :, 5:, :] + p1 = p4d.iloc[:, :, :5, :] + p2 = p4d.iloc[:, :, 5:, :] result = concat([p1, p2], axis=2) tm.assert_panel4d_equal(result, p4d) - p1 = p4d.ix[:, :, :, :2] - p2 = p4d.ix[:, :, :, 2:] + p1 = p4d.iloc[:, :, :, :2] + p2 = p4d.iloc[:, :, :, 2:] result = concat([p1, p2], axis=3) tm.assert_panel4d_equal(result, p4d) def test_panel4d_concat_mixed_type(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): p4d = tm.makePanel4D() # if things are a bit misbehaved - p1 = p4d.ix[:, :2, :, :2] - p2 = p4d.ix[:, :, :, 2:] + p1 = p4d.iloc[:, :2, :, :2] + p2 = p4d.iloc[:, :, :, 2:] p1['L5'] = 'baz' result = concat([p1, p2], axis=3) p2['L5'] = np.nan expected = concat([p1, p2], axis=3) - expected = expected.ix[result.labels] + expected = expected.loc[result.labels] tm.assert_panel4d_equal(result, expected) @@ -1417,7 +1440,7 @@ def test_concat_series(self): result = concat(pieces) tm.assert_series_equal(result, ts) - self.assertEqual(result.name, ts.name) + assert result.name == ts.name result = concat(pieces, keys=[0, 1, 2]) expected = ts.copy() @@ -1454,8 +1477,8 @@ def test_concat_series_axis1(self): s2.name = None result = concat([s, s2], axis=1) - self.assertTrue(np.array_equal( - result.columns, Index(['A', 0], dtype='object'))) + tm.assert_index_equal(result.columns, + Index(['A', 0], dtype='object')) # must reindex, #2603 s = Series(randn(3), index=['c', 'a', 'b'], name='A') @@ -1477,18 +1500,18 @@ def test_concat_exclude_none(self): pieces = [df[:5], None, None, df[5:]] result = concat(pieces) tm.assert_frame_equal(result, df) - self.assertRaises(ValueError, concat, [None, None]) + pytest.raises(ValueError, concat, [None, None]) def test_concat_datetime64_block(self): - from pandas.tseries.index import date_range + from pandas.core.indexes.datetimes import date_range rng = date_range('1/1/2000', periods=10) df = DataFrame({'time': rng}) result = concat([df, df]) - self.assertTrue((result.iloc[:10]['time'] == rng).all()) - self.assertTrue((result.iloc[10:]['time'] == rng).all()) + assert (result.iloc[:10]['time'] == rng).all() + assert (result.iloc[10:]['time'] == rng).all() def test_concat_timedelta64_block(self): from pandas import to_timedelta @@ -1498,8 +1521,8 @@ def test_concat_timedelta64_block(self): df = DataFrame({'time': rng}) result = concat([df, df]) - self.assertTrue((result.iloc[:10]['time'] == rng).all()) - self.assertTrue((result.iloc[10:]['time'] == rng).all()) + assert (result.iloc[:10]['time'] == rng).all() + assert (result.iloc[10:]['time'] == rng).all() def test_concat_keys_with_none(self): # #1649 @@ -1515,283 +1538,6 @@ def test_concat_keys_with_none(self): keys=['b', 'c', 'd', 'e']) tm.assert_frame_equal(result, expected) - def test_union_categorical(self): - # GH 13361 - data = [ - (list('abc'), list('abd'), list('abcabd')), - ([0, 1, 2], [2, 3, 4], [0, 1, 2, 2, 3, 4]), - ([0, 1.2, 2], [2, 3.4, 4], [0, 1.2, 2, 2, 3.4, 4]), - - (['b', 'b', np.nan, 'a'], ['a', np.nan, 'c'], - ['b', 'b', np.nan, 'a', 'a', np.nan, 'c']), - - (pd.date_range('2014-01-01', '2014-01-05'), - pd.date_range('2014-01-06', '2014-01-07'), - pd.date_range('2014-01-01', '2014-01-07')), - - (pd.date_range('2014-01-01', '2014-01-05', tz='US/Central'), - pd.date_range('2014-01-06', '2014-01-07', tz='US/Central'), - pd.date_range('2014-01-01', '2014-01-07', tz='US/Central')), - - (pd.period_range('2014-01-01', '2014-01-05'), - pd.period_range('2014-01-06', '2014-01-07'), - pd.period_range('2014-01-01', '2014-01-07')), - ] - - for a, b, combined in data: - for box in [Categorical, CategoricalIndex, Series]: - result = union_categoricals([box(Categorical(a)), - box(Categorical(b))]) - expected = Categorical(combined) - tm.assert_categorical_equal(result, expected, - check_category_order=True) - - # new categories ordered by appearance - s = Categorical(['x', 'y', 'z']) - s2 = Categorical(['a', 'b', 'c']) - result = union_categoricals([s, s2]) - expected = Categorical(['x', 'y', 'z', 'a', 'b', 'c'], - categories=['x', 'y', 'z', 'a', 'b', 'c']) - tm.assert_categorical_equal(result, expected) - - s = Categorical([0, 1.2, 2], ordered=True) - s2 = Categorical([0, 1.2, 2], ordered=True) - result = union_categoricals([s, s2]) - expected = Categorical([0, 1.2, 2, 0, 1.2, 2], ordered=True) - tm.assert_categorical_equal(result, expected) - - # must exactly match types - s = Categorical([0, 1.2, 2]) - s2 = Categorical([2, 3, 4]) - msg = 'dtype of categories must be the same' - with tm.assertRaisesRegexp(TypeError, msg): - union_categoricals([s, s2]) - - msg = 'No Categoricals to union' - with tm.assertRaisesRegexp(ValueError, msg): - union_categoricals([]) - - def test_union_categoricals_nan(self): - # GH 13759 - res = union_categoricals([pd.Categorical([1, 2, np.nan]), - pd.Categorical([3, 2, np.nan])]) - exp = Categorical([1, 2, np.nan, 3, 2, np.nan]) - tm.assert_categorical_equal(res, exp) - - res = union_categoricals([pd.Categorical(['A', 'B']), - pd.Categorical(['B', 'B', np.nan])]) - exp = Categorical(['A', 'B', 'B', 'B', np.nan]) - tm.assert_categorical_equal(res, exp) - - val1 = [pd.Timestamp('2011-01-01'), pd.Timestamp('2011-03-01'), - pd.NaT] - val2 = [pd.NaT, pd.Timestamp('2011-01-01'), - pd.Timestamp('2011-02-01')] - - res = union_categoricals([pd.Categorical(val1), pd.Categorical(val2)]) - exp = Categorical(val1 + val2, - categories=[pd.Timestamp('2011-01-01'), - pd.Timestamp('2011-03-01'), - pd.Timestamp('2011-02-01')]) - tm.assert_categorical_equal(res, exp) - - # all NaN - res = union_categoricals([pd.Categorical([np.nan, np.nan]), - pd.Categorical(['X'])]) - exp = Categorical([np.nan, np.nan, 'X']) - tm.assert_categorical_equal(res, exp) - - res = union_categoricals([pd.Categorical([np.nan, np.nan]), - pd.Categorical([np.nan, np.nan])]) - exp = Categorical([np.nan, np.nan, np.nan, np.nan]) - tm.assert_categorical_equal(res, exp) - - def test_union_categoricals_empty(self): - # GH 13759 - res = union_categoricals([pd.Categorical([]), - pd.Categorical([])]) - exp = Categorical([]) - tm.assert_categorical_equal(res, exp) - - res = union_categoricals([pd.Categorical([]), - pd.Categorical([1.0])]) - exp = Categorical([1.0]) - tm.assert_categorical_equal(res, exp) - - # to make dtype equal - nanc = pd.Categorical(np.array([np.nan], dtype=np.float64)) - res = union_categoricals([nanc, - pd.Categorical([])]) - tm.assert_categorical_equal(res, nanc) - - def test_union_categorical_same_category(self): - # check fastpath - c1 = Categorical([1, 2, 3, 4], categories=[1, 2, 3, 4]) - c2 = Categorical([3, 2, 1, np.nan], categories=[1, 2, 3, 4]) - res = union_categoricals([c1, c2]) - exp = Categorical([1, 2, 3, 4, 3, 2, 1, np.nan], - categories=[1, 2, 3, 4]) - tm.assert_categorical_equal(res, exp) - - c1 = Categorical(['z', 'z', 'z'], categories=['x', 'y', 'z']) - c2 = Categorical(['x', 'x', 'x'], categories=['x', 'y', 'z']) - res = union_categoricals([c1, c2]) - exp = Categorical(['z', 'z', 'z', 'x', 'x', 'x'], - categories=['x', 'y', 'z']) - tm.assert_categorical_equal(res, exp) - - def test_union_categoricals_ordered(self): - c1 = Categorical([1, 2, 3], ordered=True) - c2 = Categorical([1, 2, 3], ordered=False) - - msg = 'Categorical.ordered must be the same' - with tm.assertRaisesRegexp(TypeError, msg): - union_categoricals([c1, c2]) - - res = union_categoricals([c1, c1]) - exp = Categorical([1, 2, 3, 1, 2, 3], ordered=True) - tm.assert_categorical_equal(res, exp) - - c1 = Categorical([1, 2, 3, np.nan], ordered=True) - c2 = Categorical([3, 2], categories=[1, 2, 3], ordered=True) - - res = union_categoricals([c1, c2]) - exp = Categorical([1, 2, 3, np.nan, 3, 2], ordered=True) - tm.assert_categorical_equal(res, exp) - - c1 = Categorical([1, 2, 3], ordered=True) - c2 = Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True) - - msg = "to union ordered Categoricals, all categories must be the same" - with tm.assertRaisesRegexp(TypeError, msg): - union_categoricals([c1, c2]) - - def test_union_categoricals_sort(self): - # GH 13846 - c1 = Categorical(['x', 'y', 'z']) - c2 = Categorical(['a', 'b', 'c']) - result = union_categoricals([c1, c2], sort_categories=True) - expected = Categorical(['x', 'y', 'z', 'a', 'b', 'c'], - categories=['a', 'b', 'c', 'x', 'y', 'z']) - tm.assert_categorical_equal(result, expected) - - # fastpath - c1 = Categorical(['a', 'b'], categories=['b', 'a', 'c']) - c2 = Categorical(['b', 'c'], categories=['b', 'a', 'c']) - result = union_categoricals([c1, c2], sort_categories=True) - expected = Categorical(['a', 'b', 'b', 'c'], - categories=['a', 'b', 'c']) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical(['a', 'b'], categories=['c', 'a', 'b']) - c2 = Categorical(['b', 'c'], categories=['c', 'a', 'b']) - result = union_categoricals([c1, c2], sort_categories=True) - expected = Categorical(['a', 'b', 'b', 'c'], - categories=['a', 'b', 'c']) - tm.assert_categorical_equal(result, expected) - - # fastpath - skip resort - c1 = Categorical(['a', 'b'], categories=['a', 'b', 'c']) - c2 = Categorical(['b', 'c'], categories=['a', 'b', 'c']) - result = union_categoricals([c1, c2], sort_categories=True) - expected = Categorical(['a', 'b', 'b', 'c'], - categories=['a', 'b', 'c']) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical(['x', np.nan]) - c2 = Categorical([np.nan, 'b']) - result = union_categoricals([c1, c2], sort_categories=True) - expected = Categorical(['x', np.nan, np.nan, 'b'], - categories=['b', 'x']) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical([np.nan]) - c2 = Categorical([np.nan]) - result = union_categoricals([c1, c2], sort_categories=True) - expected = Categorical([np.nan, np.nan], categories=[]) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical([]) - c2 = Categorical([]) - result = union_categoricals([c1, c2], sort_categories=True) - expected = Categorical([]) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical(['b', 'a'], categories=['b', 'a', 'c'], ordered=True) - c2 = Categorical(['a', 'c'], categories=['b', 'a', 'c'], ordered=True) - with tm.assertRaises(TypeError): - union_categoricals([c1, c2], sort_categories=True) - - def test_union_categoricals_sort_false(self): - # GH 13846 - c1 = Categorical(['x', 'y', 'z']) - c2 = Categorical(['a', 'b', 'c']) - result = union_categoricals([c1, c2], sort_categories=False) - expected = Categorical(['x', 'y', 'z', 'a', 'b', 'c'], - categories=['x', 'y', 'z', 'a', 'b', 'c']) - tm.assert_categorical_equal(result, expected) - - # fastpath - c1 = Categorical(['a', 'b'], categories=['b', 'a', 'c']) - c2 = Categorical(['b', 'c'], categories=['b', 'a', 'c']) - result = union_categoricals([c1, c2], sort_categories=False) - expected = Categorical(['a', 'b', 'b', 'c'], - categories=['b', 'a', 'c']) - tm.assert_categorical_equal(result, expected) - - # fastpath - skip resort - c1 = Categorical(['a', 'b'], categories=['a', 'b', 'c']) - c2 = Categorical(['b', 'c'], categories=['a', 'b', 'c']) - result = union_categoricals([c1, c2], sort_categories=False) - expected = Categorical(['a', 'b', 'b', 'c'], - categories=['a', 'b', 'c']) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical(['x', np.nan]) - c2 = Categorical([np.nan, 'b']) - result = union_categoricals([c1, c2], sort_categories=False) - expected = Categorical(['x', np.nan, np.nan, 'b'], - categories=['x', 'b']) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical([np.nan]) - c2 = Categorical([np.nan]) - result = union_categoricals([c1, c2], sort_categories=False) - expected = Categorical([np.nan, np.nan], categories=[]) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical([]) - c2 = Categorical([]) - result = union_categoricals([c1, c2], sort_categories=False) - expected = Categorical([]) - tm.assert_categorical_equal(result, expected) - - c1 = Categorical(['b', 'a'], categories=['b', 'a', 'c'], ordered=True) - c2 = Categorical(['a', 'c'], categories=['b', 'a', 'c'], ordered=True) - result = union_categoricals([c1, c2], sort_categories=False) - expected = Categorical(['b', 'a', 'a', 'c'], - categories=['b', 'a', 'c'], ordered=True) - tm.assert_categorical_equal(result, expected) - - def test_union_categorical_unwrap(self): - # GH 14173 - c1 = Categorical(['a', 'b']) - c2 = pd.Series(['b', 'c'], dtype='category') - result = union_categoricals([c1, c2]) - expected = Categorical(['a', 'b', 'b', 'c']) - tm.assert_categorical_equal(result, expected) - - c2 = CategoricalIndex(c2) - result = union_categoricals([c1, c2]) - tm.assert_categorical_equal(result, expected) - - c1 = Series(c1) - result = union_categoricals([c1, c2]) - tm.assert_categorical_equal(result, expected) - - with tm.assertRaises(TypeError): - union_categoricals([c1, ['a', 'b', 'c']]) - def test_concat_bug_1719(self): ts1 = tm.makeTimeSeries() ts2 = tm.makeTimeSeries()[::2] @@ -1801,7 +1547,7 @@ def test_concat_bug_1719(self): left = concat([ts1, ts2], join='outer', axis=1) right = concat([ts2, ts1], join='outer', axis=1) - self.assertEqual(len(left), len(right)) + assert len(left) == len(right) def test_concat_bug_2972(self): ts0 = Series(np.zeros(5)) @@ -1829,13 +1575,23 @@ def test_concat_bug_3602(self): result = concat([df1, df2], axis=1) assert_frame_equal(result, expected) + def test_concat_inner_join_empty(self): + # GH 15328 + df_empty = pd.DataFrame() + df_a = pd.DataFrame({'a': [1, 2]}, index=[0, 1], dtype='int64') + df_expected = pd.DataFrame({'a': []}, index=[], dtype='int64') + + for how, expected in [('inner', df_expected), ('outer', df_a)]: + result = pd.concat([df_a, df_empty], axis=1, join=how) + assert_frame_equal(result, expected) + def test_concat_series_axis1_same_names_ignore_index(self): dates = date_range('01-Jan-2013', '01-Jan-2014', freq='MS')[0:-1] s1 = Series(randn(len(dates)), index=dates, name='value') s2 = Series(randn(len(dates)), index=dates, name='value') result = concat([s1, s2], axis=1, ignore_index=True) - self.assertTrue(np.array_equal(result.columns, [0, 1])) + assert np.array_equal(result.columns, [0, 1]) def test_concat_iterables(self): from collections import deque, Iterable @@ -1878,12 +1634,12 @@ def test_concat_invalid(self): # trying to concat a ndframe with a non-ndframe df1 = mkdf(10, 2) for obj in [1, dict(), [1, 2], (1, 2)]: - self.assertRaises(TypeError, lambda x: concat([df1, obj])) + pytest.raises(TypeError, lambda x: concat([df1, obj])) def test_concat_invalid_first_argument(self): df1 = mkdf(10, 2) df2 = mkdf(10, 2) - self.assertRaises(TypeError, concat, df1, df2) + pytest.raises(TypeError, concat, df1, df2) # generator ok though concat(DataFrame(np.random.rand(5, 5)) for _ in range(3)) @@ -1948,8 +1704,7 @@ def test_concat_tz_frame(self): assert_frame_equal(df2, df3) def test_concat_tz_series(self): - # GH 11755 - # tz and no tz + # gh-11755: tz and no tz x = Series(date_range('20151124 08:00', '20151124 09:00', freq='1h', tz='UTC')) @@ -1959,8 +1714,7 @@ def test_concat_tz_series(self): result = concat([x, y], ignore_index=True) tm.assert_series_equal(result, expected) - # GH 11887 - # concat tz and object + # gh-11887: concat tz and object x = Series(date_range('20151124 08:00', '20151124 09:00', freq='1h', tz='UTC')) @@ -1970,10 +1724,8 @@ def test_concat_tz_series(self): result = concat([x, y], ignore_index=True) tm.assert_series_equal(result, expected) - # 12217 - # 12306 fixed I think - - # Concat'ing two UTC times + # see gh-12217 and gh-12306 + # Concatenating two UTC times first = pd.DataFrame([[datetime(2016, 1, 1)]]) first[0] = first[0].dt.tz_localize('UTC') @@ -1981,9 +1733,9 @@ def test_concat_tz_series(self): second[0] = second[0].dt.tz_localize('UTC') result = pd.concat([first, second]) - self.assertEqual(result[0].dtype, 'datetime64[ns, UTC]') + assert result[0].dtype == 'datetime64[ns, UTC]' - # Concat'ing two London times + # Concatenating two London times first = pd.DataFrame([[datetime(2016, 1, 1)]]) first[0] = first[0].dt.tz_localize('Europe/London') @@ -1991,9 +1743,9 @@ def test_concat_tz_series(self): second[0] = second[0].dt.tz_localize('Europe/London') result = pd.concat([first, second]) - self.assertEqual(result[0].dtype, 'datetime64[ns, Europe/London]') + assert result[0].dtype == 'datetime64[ns, Europe/London]' - # Concat'ing 2+1 London times + # Concatenating 2+1 London times first = pd.DataFrame([[datetime(2016, 1, 1)], [datetime(2016, 1, 2)]]) first[0] = first[0].dt.tz_localize('Europe/London') @@ -2001,7 +1753,7 @@ def test_concat_tz_series(self): second[0] = second[0].dt.tz_localize('Europe/London') result = pd.concat([first, second]) - self.assertEqual(result[0].dtype, 'datetime64[ns, Europe/London]') + assert result[0].dtype == 'datetime64[ns, Europe/London]' # Concat'ing 1+2 London times first = pd.DataFrame([[datetime(2016, 1, 1)]]) @@ -2011,11 +1763,10 @@ def test_concat_tz_series(self): second[0] = second[0].dt.tz_localize('Europe/London') result = pd.concat([first, second]) - self.assertEqual(result[0].dtype, 'datetime64[ns, Europe/London]') + assert result[0].dtype == 'datetime64[ns, Europe/London]' def test_concat_tz_series_with_datetimelike(self): - # GH 12620 - # tz and timedelta + # see gh-12620: tz and timedelta x = [pd.Timestamp('2011-01-01', tz='US/Eastern'), pd.Timestamp('2011-02-01', tz='US/Eastern')] y = [pd.Timedelta('1 day'), pd.Timedelta('2 day')] @@ -2028,16 +1779,18 @@ def test_concat_tz_series_with_datetimelike(self): tm.assert_series_equal(result, pd.Series(x + y, dtype='object')) def test_concat_tz_series_tzlocal(self): - # GH 13583 + # see gh-13583 tm._skip_if_no_dateutil() import dateutil + x = [pd.Timestamp('2011-01-01', tz=dateutil.tz.tzlocal()), pd.Timestamp('2011-02-01', tz=dateutil.tz.tzlocal())] y = [pd.Timestamp('2012-01-01', tz=dateutil.tz.tzlocal()), pd.Timestamp('2012-02-01', tz=dateutil.tz.tzlocal())] + result = concat([pd.Series(x), pd.Series(y)], ignore_index=True) tm.assert_series_equal(result, pd.Series(x + y)) - self.assertEqual(result.dtype, 'datetime64[ns, tzlocal()]') + assert result.dtype == 'datetime64[ns, tzlocal()]' def test_concat_period_series(self): x = Series(pd.PeriodIndex(['2015-11-01', '2015-12-01'], freq='D')) @@ -2045,7 +1798,7 @@ def test_concat_period_series(self): expected = Series([x[0], x[1], y[0], y[1]], dtype='object') result = concat([x, y], ignore_index=True) tm.assert_series_equal(result, expected) - self.assertEqual(result.dtype, 'object') + assert result.dtype == 'object' # different freq x = Series(pd.PeriodIndex(['2015-11-01', '2015-12-01'], freq='D')) @@ -2053,14 +1806,14 @@ def test_concat_period_series(self): expected = Series([x[0], x[1], y[0], y[1]], dtype='object') result = concat([x, y], ignore_index=True) tm.assert_series_equal(result, expected) - self.assertEqual(result.dtype, 'object') + assert result.dtype == 'object' x = Series(pd.PeriodIndex(['2015-11-01', '2015-12-01'], freq='D')) y = Series(pd.PeriodIndex(['2015-11-01', '2015-12-01'], freq='M')) expected = Series([x[0], x[1], y[0], y[1]], dtype='object') result = concat([x, y], ignore_index=True) tm.assert_series_equal(result, expected) - self.assertEqual(result.dtype, 'object') + assert result.dtype == 'object' # non-period x = Series(pd.PeriodIndex(['2015-11-01', '2015-12-01'], freq='D')) @@ -2068,14 +1821,14 @@ def test_concat_period_series(self): expected = Series([x[0], x[1], y[0], y[1]], dtype='object') result = concat([x, y], ignore_index=True) tm.assert_series_equal(result, expected) - self.assertEqual(result.dtype, 'object') + assert result.dtype == 'object' x = Series(pd.PeriodIndex(['2015-11-01', '2015-12-01'], freq='D')) y = Series(['A', 'B']) expected = Series([x[0], x[1], y[0], y[1]], dtype='object') result = concat([x, y], ignore_index=True) tm.assert_series_equal(result, expected) - self.assertEqual(result.dtype, 'object') + assert result.dtype == 'object' def test_concat_empty_series(self): # GH 11082 @@ -2105,7 +1858,7 @@ def test_default_index(self): s1 = pd.Series([1, 2, 3], name='x') s2 = pd.Series([4, 5, 6], name='y') res = pd.concat([s1, s2], axis=1, ignore_index=True) - self.assertIsInstance(res.columns, pd.RangeIndex) + assert isinstance(res.columns, pd.RangeIndex) exp = pd.DataFrame([[1, 4], [2, 5], [3, 6]]) # use check_index_type=True to check the result have # RangeIndex (default index) @@ -2116,7 +1869,7 @@ def test_default_index(self): s1 = pd.Series([1, 2, 3]) s2 = pd.Series([4, 5, 6]) res = pd.concat([s1, s2], axis=1, ignore_index=False) - self.assertIsInstance(res.columns, pd.RangeIndex) + assert isinstance(res.columns, pd.RangeIndex) exp = pd.DataFrame([[1, 4], [2, 5], [3, 6]]) exp.columns = pd.RangeIndex(2) tm.assert_frame_equal(res, exp, check_index_type=True, @@ -2151,7 +1904,69 @@ def test_concat_multiindex_rangeindex(self): exp = df.iloc[[2, 3, 4, 5], :] tm.assert_frame_equal(res, exp) + def test_concat_multiindex_dfs_with_deepcopy(self): + # GH 9967 + from copy import deepcopy + example_multiindex1 = pd.MultiIndex.from_product([['a'], ['b']]) + example_dataframe1 = pd.DataFrame([0], index=example_multiindex1) + + example_multiindex2 = pd.MultiIndex.from_product([['a'], ['c']]) + example_dataframe2 = pd.DataFrame([1], index=example_multiindex2) + + example_dict = {'s1': example_dataframe1, 's2': example_dataframe2} + expected_index = pd.MultiIndex(levels=[['s1', 's2'], + ['a'], + ['b', 'c']], + labels=[[0, 1], [0, 0], [0, 1]], + names=['testname', None, None]) + expected = pd.DataFrame([[0], [1]], index=expected_index) + result_copy = pd.concat(deepcopy(example_dict), names=['testname']) + tm.assert_frame_equal(result_copy, expected) + result_no_copy = pd.concat(example_dict, names=['testname']) + tm.assert_frame_equal(result_no_copy, expected) + + def test_concat_categoricalindex(self): + # GH 16111, categories that aren't lexsorted + categories = [9, 0, 1, 2, 3] + + a = pd.Series(1, index=pd.CategoricalIndex([9, 0], + categories=categories)) + b = pd.Series(2, index=pd.CategoricalIndex([0, 1], + categories=categories)) + c = pd.Series(3, index=pd.CategoricalIndex([1, 2], + categories=categories)) + + result = pd.concat([a, b, c], axis=1) + + exp_idx = pd.CategoricalIndex([0, 1, 2, 9]) + exp = pd.DataFrame({0: [1, np.nan, np.nan, 1], + 1: [2, 2, np.nan, np.nan], + 2: [np.nan, 3, 3, np.nan]}, + columns=[0, 1, 2], + index=exp_idx) + tm.assert_frame_equal(result, exp) + -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) +@pytest.mark.parametrize('pdt', [pd.Series, pd.DataFrame, pd.Panel]) +@pytest.mark.parametrize('dt', np.sctypes['float']) +def test_concat_no_unnecessary_upcast(dt, pdt): + with catch_warnings(record=True): + # GH 13247 + dims = pdt().ndim + dfs = [pdt(np.array([1], dtype=dt, ndmin=dims)), + pdt(np.array([np.nan], dtype=dt, ndmin=dims)), + pdt(np.array([5], dtype=dt, ndmin=dims))] + x = pd.concat(dfs) + assert x.values.dtype == dt + + +@pytest.mark.parametrize('pdt', [pd.Series, pd.DataFrame, pd.Panel]) +@pytest.mark.parametrize('dt', np.sctypes['int']) +def test_concat_will_upcast(dt, pdt): + with catch_warnings(record=True): + dims = pdt().ndim + dfs = [pdt(np.array([1], dtype=dt, ndmin=dims)), + pdt(np.array([np.nan], ndmin=dims)), + pdt(np.array([5], dtype=dt, ndmin=dims))] + x = pd.concat(dfs) + assert x.values.dtype == 'float64' diff --git a/pandas/tools/tests/test_join.py b/pandas/tests/reshape/test_join.py similarity index 78% rename from pandas/tools/tests/test_join.py rename to pandas/tests/reshape/test_join.py index f33d5f16cd439..e25661fb65271 100644 --- a/pandas/tools/tests/test_join.py +++ b/pandas/tests/reshape/test_join.py @@ -1,30 +1,27 @@ # pylint: disable=E1103 -import nose - +from warnings import catch_warnings from numpy.random import randn import numpy as np +import pytest import pandas as pd from pandas.compat import lrange import pandas.compat as compat -from pandas.tools.merge import merge, concat from pandas.util.testing import assert_frame_equal -from pandas import DataFrame, MultiIndex, Series +from pandas import DataFrame, MultiIndex, Series, Index, merge, concat -import pandas._join as _join +from pandas._libs import join as libjoin import pandas.util.testing as tm -from pandas.tools.tests.test_merge import get_test_data, N, NGROUPS +from pandas.tests.reshape.test_merge import get_test_data, N, NGROUPS a_ = np.array -class TestJoin(tm.TestCase): - - _multiprocess_can_split_ = True +class TestJoin(object): - def setUp(self): + def setup_method(self, method): # aggregate multiple columns self.df = DataFrame({'key1': get_test_data(), 'key2': get_test_data(), @@ -51,7 +48,7 @@ def test_cython_left_outer_join(self): right = a_([1, 1, 0, 4, 2, 2, 1], dtype=np.int64) max_group = 5 - ls, rs = _join.left_outer_join(left, right, max_group) + ls, rs = libjoin.left_outer_join(left, right, max_group) exp_ls = left.argsort(kind='mergesort') exp_rs = right.argsort(kind='mergesort') @@ -67,15 +64,15 @@ def test_cython_left_outer_join(self): exp_rs = exp_rs.take(exp_ri) exp_rs[exp_ri == -1] = -1 - self.assert_numpy_array_equal(ls, exp_ls, check_dtype=False) - self.assert_numpy_array_equal(rs, exp_rs, check_dtype=False) + tm.assert_numpy_array_equal(ls, exp_ls, check_dtype=False) + tm.assert_numpy_array_equal(rs, exp_rs, check_dtype=False) def test_cython_right_outer_join(self): left = a_([0, 1, 2, 1, 2, 0, 0, 1, 2, 3, 3], dtype=np.int64) right = a_([1, 1, 0, 4, 2, 2, 1], dtype=np.int64) max_group = 5 - rs, ls = _join.left_outer_join(right, left, max_group) + rs, ls = libjoin.left_outer_join(right, left, max_group) exp_ls = left.argsort(kind='mergesort') exp_rs = right.argsort(kind='mergesort') @@ -93,15 +90,15 @@ def test_cython_right_outer_join(self): exp_rs = exp_rs.take(exp_ri) exp_rs[exp_ri == -1] = -1 - self.assert_numpy_array_equal(ls, exp_ls, check_dtype=False) - self.assert_numpy_array_equal(rs, exp_rs, check_dtype=False) + tm.assert_numpy_array_equal(ls, exp_ls, check_dtype=False) + tm.assert_numpy_array_equal(rs, exp_rs, check_dtype=False) def test_cython_inner_join(self): left = a_([0, 1, 2, 1, 2, 0, 0, 1, 2, 3, 3], dtype=np.int64) right = a_([1, 1, 0, 4, 2, 2, 1, 4], dtype=np.int64) max_group = 5 - ls, rs = _join.inner_join(left, right, max_group) + ls, rs = libjoin.inner_join(left, right, max_group) exp_ls = left.argsort(kind='mergesort') exp_rs = right.argsort(kind='mergesort') @@ -117,8 +114,8 @@ def test_cython_inner_join(self): exp_rs = exp_rs.take(exp_ri) exp_rs[exp_ri == -1] = -1 - self.assert_numpy_array_equal(ls, exp_ls, check_dtype=False) - self.assert_numpy_array_equal(rs, exp_rs, check_dtype=False) + tm.assert_numpy_array_equal(ls, exp_ls, check_dtype=False) + tm.assert_numpy_array_equal(rs, exp_rs, check_dtype=False) def test_left_outer_join(self): joined_key2 = merge(self.df, self.df2, on='key2') @@ -156,25 +153,25 @@ def test_handle_overlap(self): joined = merge(self.df, self.df2, on='key2', suffixes=['.foo', '.bar']) - self.assertIn('key1.foo', joined) - self.assertIn('key1.bar', joined) + assert 'key1.foo' in joined + assert 'key1.bar' in joined def test_handle_overlap_arbitrary_key(self): joined = merge(self.df, self.df2, left_on='key2', right_on='key1', suffixes=['.foo', '.bar']) - self.assertIn('key1.foo', joined) - self.assertIn('key2.bar', joined) + assert 'key1.foo' in joined + assert 'key2.bar' in joined def test_join_on(self): target = self.target source = self.source merged = target.join(source, on='C') - self.assert_series_equal(merged['MergedA'], target['A'], - check_names=False) - self.assert_series_equal(merged['MergedD'], target['D'], - check_names=False) + tm.assert_series_equal(merged['MergedA'], target['A'], + check_names=False) + tm.assert_series_equal(merged['MergedD'], target['D'], + check_names=False) # join with duplicates (fix regression from DataFrame/Matrix merge) df = DataFrame({'key': ['a', 'a', 'b', 'b', 'c']}) @@ -193,19 +190,19 @@ def test_join_on(self): columns=['three']) joined = df_a.join(df_b, on='one') joined = joined.join(df_c, on='one') - self.assertTrue(np.isnan(joined['two']['c'])) - self.assertTrue(np.isnan(joined['three']['c'])) + assert np.isnan(joined['two']['c']) + assert np.isnan(joined['three']['c']) # merge column not p resent - self.assertRaises(KeyError, target.join, source, on='E') + pytest.raises(KeyError, target.join, source, on='E') # overlap source_copy = source.copy() source_copy['A'] = 0 - self.assertRaises(ValueError, target.join, source_copy, on='A') + pytest.raises(ValueError, target.join, source_copy, on='A') def test_join_on_fails_with_different_right_index(self): - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df = DataFrame({'a': np.random.choice(['m', 'f'], size=3), 'b': np.random.randn(3)}) df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10), @@ -214,7 +211,7 @@ def test_join_on_fails_with_different_right_index(self): merge(df, df2, left_on='a', right_index=True) def test_join_on_fails_with_different_left_index(self): - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df = DataFrame({'a': np.random.choice(['m', 'f'], size=3), 'b': np.random.randn(3)}, index=tm.makeCustomIndex(10, 2)) @@ -223,7 +220,7 @@ def test_join_on_fails_with_different_left_index(self): merge(df, df2, right_on='b', left_index=True) def test_join_on_fails_with_different_column_counts(self): - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df = DataFrame({'a': np.random.choice(['m', 'f'], size=3), 'b': np.random.randn(3)}) df2 = DataFrame({'a': np.random.choice(['m', 'f'], size=10), @@ -237,9 +234,9 @@ def test_join_on_fails_with_wrong_object_type(self): df = DataFrame({'a': [1, 1]}) for obj in wrongly_typed: - with tm.assertRaisesRegexp(ValueError, str(type(obj))): + with tm.assert_raises_regex(ValueError, str(type(obj))): merge(obj, df, left_on='a', right_on='a') - with tm.assertRaisesRegexp(ValueError, str(type(obj))): + with tm.assert_raises_regex(ValueError, str(type(obj))): merge(df, obj, left_on='a', right_on='a') def test_join_on_pass_vector(self): @@ -254,13 +251,13 @@ def test_join_with_len0(self): # nothing to merge merged = self.target.join(self.source.reindex([]), on='C') for col in self.source: - self.assertIn(col, merged) - self.assertTrue(merged[col].isnull().all()) + assert col in merged + assert merged[col].isnull().all() merged2 = self.target.join(self.source.reindex([]), on='C', how='inner') - self.assert_index_equal(merged2.columns, merged.columns) - self.assertEqual(len(merged2), 0) + tm.assert_index_equal(merged2.columns, merged.columns) + assert len(merged2) == 0 def test_join_on_inner(self): df = DataFrame({'key': ['a', 'a', 'd', 'b', 'b', 'c']}) @@ -270,11 +267,11 @@ def test_join_on_inner(self): expected = df.join(df2, on='key') expected = expected[expected['value'].notnull()] - self.assert_series_equal(joined['key'], expected['key'], - check_dtype=False) - self.assert_series_equal(joined['value'], expected['value'], - check_dtype=False) - self.assert_index_equal(joined.index, expected.index) + tm.assert_series_equal(joined['key'], expected['key'], + check_dtype=False) + tm.assert_series_equal(joined['value'], expected['value'], + check_dtype=False) + tm.assert_index_equal(joined.index, expected.index) def test_join_on_singlekey_list(self): df = DataFrame({'key': ['a', 'a', 'b', 'b', 'c']}) @@ -304,8 +301,8 @@ def test_join_index_mixed(self): df1 = DataFrame({'A': 1., 'B': 2, 'C': 'foo', 'D': True}, index=np.arange(10), columns=['A', 'B', 'C', 'D']) - self.assertEqual(df1['B'].dtype, np.int64) - self.assertEqual(df1['D'].dtype, np.bool_) + assert df1['B'].dtype == np.int64 + assert df1['D'].dtype == np.bool_ df2 = DataFrame({'A': 1., 'B': 2, 'C': 'foo', 'D': True}, index=np.arange(0, 10, 2), @@ -369,26 +366,26 @@ def test_join_multiindex(self): df2 = DataFrame(data=np.random.randn(6), index=index2, columns=['var Y']) - df1 = df1.sortlevel(0) - df2 = df2.sortlevel(0) + df1 = df1.sort_index(level=0) + df2 = df2.sort_index(level=0) joined = df1.join(df2, how='outer') - ex_index = index1._tuple_index.union(index2._tuple_index) + ex_index = Index(index1.values).union(Index(index2.values)) expected = df1.reindex(ex_index).join(df2.reindex(ex_index)) expected.index.names = index1.names assert_frame_equal(joined, expected) - self.assertEqual(joined.index.names, index1.names) + assert joined.index.names == index1.names - df1 = df1.sortlevel(1) - df2 = df2.sortlevel(1) + df1 = df1.sort_index(level=1) + df2 = df2.sort_index(level=1) - joined = df1.join(df2, how='outer').sortlevel(0) - ex_index = index1._tuple_index.union(index2._tuple_index) + joined = df1.join(df2, how='outer').sort_index(level=0) + ex_index = Index(index1.values).union(Index(index2.values)) expected = df1.reindex(ex_index).join(df2.reindex(ex_index)) expected.index.names = index1.names assert_frame_equal(joined, expected) - self.assertEqual(joined.index.names, index1.names) + assert joined.index.names == index1.names def test_join_inner_multiindex(self): key1 = ['bar', 'bar', 'bar', 'foo', 'foo', 'baz', 'baz', 'qux', @@ -425,10 +422,10 @@ def test_join_inner_multiindex(self): expected = expected.drop(['first', 'second'], axis=1) expected.index = joined.index - self.assertTrue(joined.index.is_monotonic) + assert joined.index.is_monotonic assert_frame_equal(joined, expected) - # _assert_same_contents(expected, expected2.ix[:, expected.columns]) + # _assert_same_contents(expected, expected2.loc[:, expected.columns]) def test_join_hierarchical_mixed(self): # GH 2024 @@ -440,17 +437,17 @@ def test_join_hierarchical_mixed(self): # GH 9455, 12219 with tm.assert_produces_warning(UserWarning): result = merge(new_df, other_df, left_index=True, right_index=True) - self.assertTrue(('b', 'mean') in result) - self.assertTrue('b' in result) + assert ('b', 'mean') in result + assert 'b' in result def test_join_float64_float32(self): a = DataFrame(randn(10, 2), columns=['a', 'b'], dtype=np.float64) b = DataFrame(randn(10, 1), columns=['c'], dtype=np.float32) joined = a.join(b) - self.assertEqual(joined.dtypes['a'], 'float64') - self.assertEqual(joined.dtypes['b'], 'float64') - self.assertEqual(joined.dtypes['c'], 'float32') + assert joined.dtypes['a'] == 'float64' + assert joined.dtypes['b'] == 'float64' + assert joined.dtypes['c'] == 'float32' a = np.random.randint(0, 5, 100).astype('int64') b = np.random.random(100).astype('float64') @@ -459,10 +456,10 @@ def test_join_float64_float32(self): xpdf = DataFrame({'a': a, 'b': b, 'c': c}) s = DataFrame(np.random.random(5).astype('float32'), columns=['md']) rs = df.merge(s, left_on='a', right_index=True) - self.assertEqual(rs.dtypes['a'], 'int64') - self.assertEqual(rs.dtypes['b'], 'float64') - self.assertEqual(rs.dtypes['c'], 'float32') - self.assertEqual(rs.dtypes['md'], 'float32') + assert rs.dtypes['a'] == 'int64' + assert rs.dtypes['b'] == 'float64' + assert rs.dtypes['c'] == 'float32' + assert rs.dtypes['md'] == 'float32' xp = xpdf.merge(s, left_on='a', right_index=True) assert_frame_equal(rs, xp) @@ -500,7 +497,7 @@ def test_join_many_non_unique_index(self): result = result.reset_index() - assert_frame_equal(result, expected.ix[:, result.columns]) + assert_frame_equal(result, expected.loc[:, result.columns]) # GH 11519 df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', @@ -534,7 +531,7 @@ def test_join_sort(self): # smoke test joined = left.join(right, on='key', sort=False) - self.assert_index_equal(joined.index, pd.Index(lrange(4))) + tm.assert_index_equal(joined.index, pd.Index(lrange(4))) def test_join_mixed_non_unique_index(self): # GH 12814, unorderable types in py3 with a non-unique index @@ -592,14 +589,14 @@ def _check_diff_index(df_list, result, exp_index): joined = df_list[0].join(df_list[1:], how='inner') _check_diff_index(df_list, joined, df.index[2:8]) - self.assertRaises(ValueError, df_list[0].join, df_list[1:], on='a') + pytest.raises(ValueError, df_list[0].join, df_list[1:], on='a') def test_join_many_mixed(self): df = DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D']) df['key'] = ['foo', 'bar'] * 4 - df1 = df.ix[:, ['A', 'B']] - df2 = df.ix[:, ['C', 'D']] - df3 = df.ix[:, ['key']] + df1 = df.loc[:, ['A', 'B']] + df2 = df.loc[:, ['C', 'D']] + df3 = df.loc[:, ['key']] result = df1.join([df2, df3]) assert_frame_equal(result, df) @@ -634,84 +631,89 @@ def test_join_dups(self): assert_frame_equal(dta, expected) def test_panel_join(self): - panel = tm.makePanel() - tm.add_nans(panel) - - p1 = panel.ix[:2, :10, :3] - p2 = panel.ix[2:, 5:, 2:] - - # left join - result = p1.join(p2) - expected = p1.copy() - expected['ItemC'] = p2['ItemC'] - tm.assert_panel_equal(result, expected) - - # right join - result = p1.join(p2, how='right') - expected = p2.copy() - expected['ItemA'] = p1['ItemA'] - expected['ItemB'] = p1['ItemB'] - expected = expected.reindex(items=['ItemA', 'ItemB', 'ItemC']) - tm.assert_panel_equal(result, expected) - - # inner join - result = p1.join(p2, how='inner') - expected = panel.ix[:, 5:10, 2:3] - tm.assert_panel_equal(result, expected) - - # outer join - result = p1.join(p2, how='outer') - expected = p1.reindex(major=panel.major_axis, - minor=panel.minor_axis) - expected = expected.join(p2.reindex(major=panel.major_axis, - minor=panel.minor_axis)) - tm.assert_panel_equal(result, expected) + with catch_warnings(record=True): + panel = tm.makePanel() + tm.add_nans(panel) + + p1 = panel.iloc[:2, :10, :3] + p2 = panel.iloc[2:, 5:, 2:] + + # left join + result = p1.join(p2) + expected = p1.copy() + expected['ItemC'] = p2['ItemC'] + tm.assert_panel_equal(result, expected) + + # right join + result = p1.join(p2, how='right') + expected = p2.copy() + expected['ItemA'] = p1['ItemA'] + expected['ItemB'] = p1['ItemB'] + expected = expected.reindex(items=['ItemA', 'ItemB', 'ItemC']) + tm.assert_panel_equal(result, expected) + + # inner join + result = p1.join(p2, how='inner') + expected = panel.iloc[:, 5:10, 2:3] + tm.assert_panel_equal(result, expected) + + # outer join + result = p1.join(p2, how='outer') + expected = p1.reindex(major=panel.major_axis, + minor=panel.minor_axis) + expected = expected.join(p2.reindex(major=panel.major_axis, + minor=panel.minor_axis)) + tm.assert_panel_equal(result, expected) def test_panel_join_overlap(self): - panel = tm.makePanel() - tm.add_nans(panel) - - p1 = panel.ix[['ItemA', 'ItemB', 'ItemC']] - p2 = panel.ix[['ItemB', 'ItemC']] - - # Expected index is - # - # ItemA, ItemB_p1, ItemC_p1, ItemB_p2, ItemC_p2 - joined = p1.join(p2, lsuffix='_p1', rsuffix='_p2') - p1_suf = p1.ix[['ItemB', 'ItemC']].add_suffix('_p1') - p2_suf = p2.ix[['ItemB', 'ItemC']].add_suffix('_p2') - no_overlap = panel.ix[['ItemA']] - expected = no_overlap.join(p1_suf.join(p2_suf)) - tm.assert_panel_equal(joined, expected) + with catch_warnings(record=True): + panel = tm.makePanel() + tm.add_nans(panel) + + p1 = panel.loc[['ItemA', 'ItemB', 'ItemC']] + p2 = panel.loc[['ItemB', 'ItemC']] + + # Expected index is + # + # ItemA, ItemB_p1, ItemC_p1, ItemB_p2, ItemC_p2 + joined = p1.join(p2, lsuffix='_p1', rsuffix='_p2') + p1_suf = p1.loc[['ItemB', 'ItemC']].add_suffix('_p1') + p2_suf = p2.loc[['ItemB', 'ItemC']].add_suffix('_p2') + no_overlap = panel.loc[['ItemA']] + expected = no_overlap.join(p1_suf.join(p2_suf)) + tm.assert_panel_equal(joined, expected) def test_panel_join_many(self): - tm.K = 10 - panel = tm.makePanel() - tm.K = 4 + with catch_warnings(record=True): + tm.K = 10 + panel = tm.makePanel() + tm.K = 4 - panels = [panel.ix[:2], panel.ix[2:6], panel.ix[6:]] + panels = [panel.iloc[:2], panel.iloc[2:6], panel.iloc[6:]] - joined = panels[0].join(panels[1:]) - tm.assert_panel_equal(joined, panel) + joined = panels[0].join(panels[1:]) + tm.assert_panel_equal(joined, panel) - panels = [panel.ix[:2, :-5], panel.ix[2:6, 2:], panel.ix[6:, 5:-7]] + panels = [panel.iloc[:2, :-5], + panel.iloc[2:6, 2:], + panel.iloc[6:, 5:-7]] - data_dict = {} - for p in panels: - data_dict.update(p.iteritems()) + data_dict = {} + for p in panels: + data_dict.update(p.iteritems()) - joined = panels[0].join(panels[1:], how='inner') - expected = pd.Panel.from_dict(data_dict, intersect=True) - tm.assert_panel_equal(joined, expected) + joined = panels[0].join(panels[1:], how='inner') + expected = pd.Panel.from_dict(data_dict, intersect=True) + tm.assert_panel_equal(joined, expected) - joined = panels[0].join(panels[1:], how='outer') - expected = pd.Panel.from_dict(data_dict, intersect=False) - tm.assert_panel_equal(joined, expected) + joined = panels[0].join(panels[1:], how='outer') + expected = pd.Panel.from_dict(data_dict, intersect=False) + tm.assert_panel_equal(joined, expected) - # edge cases - self.assertRaises(ValueError, panels[0].join, panels[1:], + # edge cases + pytest.raises(ValueError, panels[0].join, panels[1:], how='outer', lsuffix='foo', rsuffix='bar') - self.assertRaises(ValueError, panels[0].join, panels[1:], + pytest.raises(ValueError, panels[0].join, panels[1:], how='right') @@ -757,13 +759,13 @@ def _restrict_to_columns(group, columns, suffix): if c in columns or c.replace(suffix, '') in columns] # filter - group = group.ix[:, found] + group = group.loc[:, found] # get rid of suffixes, if any group = group.rename(columns=lambda x: x.replace(suffix, '')) # put in the right order... - group = group.ix[:, columns] + group = group.loc[:, columns] return group @@ -797,8 +799,3 @@ def _join_by_hand(a, b, how='left'): for col, s in compat.iteritems(b_re): a_re[col] = s return a_re.reindex(columns=result_columns) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tools/tests/test_merge.py b/pandas/tests/reshape/test_merge.py similarity index 79% rename from pandas/tools/tests/test_merge.py rename to pandas/tests/reshape/test_merge.py index f078959608f91..d3257243d7a2c 100644 --- a/pandas/tools/tests/test_merge.py +++ b/pandas/tests/reshape/test_merge.py @@ -1,7 +1,6 @@ # pylint: disable=E1103 -import nose - +import pytest from datetime import datetime from numpy.random import randn from numpy import nan @@ -10,10 +9,11 @@ import pandas as pd from pandas.compat import lrange, lzip -from pandas.tools.merge import merge, concat, MergeError -from pandas.util.testing import (assert_frame_equal, - assert_series_equal, - slow) +from pandas.core.reshape.concat import concat +from pandas.core.reshape.merge import merge, MergeError +from pandas.util.testing import assert_frame_equal, assert_series_equal +from pandas.core.dtypes.dtypes import CategoricalDtype +from pandas.core.dtypes.common import is_categorical_dtype, is_object_dtype from pandas import DataFrame, Index, MultiIndex, Series, Categorical import pandas.util.testing as tm @@ -33,11 +33,9 @@ def get_test_data(ngroups=NGROUPS, n=N): return arr -class TestMerge(tm.TestCase): - - _multiprocess_can_split_ = True +class TestMerge(object): - def setUp(self): + def setup_method(self, method): # aggregate multiple columns self.df = DataFrame({'key1': get_test_data(), 'key2': get_test_data(), @@ -57,6 +55,14 @@ def setUp(self): self.right = DataFrame({'v2': np.random.randn(4)}, index=['d', 'b', 'c', 'a']) + def test_merge_inner_join_empty(self): + # GH 15328 + df_empty = pd.DataFrame() + df_a = pd.DataFrame({'a': [1, 2]}, index=[0, 1], dtype='int64') + result = pd.merge(df_empty, df_a, left_index=True, right_index=True) + expected = pd.DataFrame({'a': []}, index=[], dtype='int64') + assert_frame_equal(result, expected) + def test_merge_common(self): joined = merge(self.df, self.df2) exp = merge(self.df, self.df2, on=['key1', 'key2']) @@ -72,13 +78,13 @@ def test_merge_index_singlekey_right_vs_left(self): right_index=True, how='left', sort=False) merged2 = merge(right, left, right_on='key', left_index=True, how='right', sort=False) - assert_frame_equal(merged1, merged2.ix[:, merged1.columns]) + assert_frame_equal(merged1, merged2.loc[:, merged1.columns]) merged1 = merge(left, right, left_on='key', right_index=True, how='left', sort=True) merged2 = merge(right, left, right_on='key', left_index=True, how='right', sort=True) - assert_frame_equal(merged1, merged2.ix[:, merged1.columns]) + assert_frame_equal(merged1, merged2.loc[:, merged1.columns]) def test_merge_index_singlekey_inner(self): left = DataFrame({'key': ['a', 'b', 'c', 'd', 'e', 'e', 'a'], @@ -89,41 +95,41 @@ def test_merge_index_singlekey_inner(self): # inner join result = merge(left, right, left_on='key', right_index=True, how='inner') - expected = left.join(right, on='key').ix[result.index] + expected = left.join(right, on='key').loc[result.index] assert_frame_equal(result, expected) result = merge(right, left, right_on='key', left_index=True, how='inner') - expected = left.join(right, on='key').ix[result.index] - assert_frame_equal(result, expected.ix[:, result.columns]) + expected = left.join(right, on='key').loc[result.index] + assert_frame_equal(result, expected.loc[:, result.columns]) def test_merge_misspecified(self): - self.assertRaises(ValueError, merge, self.left, self.right, - left_index=True) - self.assertRaises(ValueError, merge, self.left, self.right, - right_index=True) + pytest.raises(ValueError, merge, self.left, self.right, + left_index=True) + pytest.raises(ValueError, merge, self.left, self.right, + right_index=True) - self.assertRaises(ValueError, merge, self.left, self.left, - left_on='key', on='key') + pytest.raises(ValueError, merge, self.left, self.left, + left_on='key', on='key') - self.assertRaises(ValueError, merge, self.df, self.df2, - left_on=['key1'], right_on=['key1', 'key2']) + pytest.raises(ValueError, merge, self.df, self.df2, + left_on=['key1'], right_on=['key1', 'key2']) def test_index_and_on_parameters_confusion(self): - self.assertRaises(ValueError, merge, self.df, self.df2, how='left', - left_index=False, right_index=['key1', 'key2']) - self.assertRaises(ValueError, merge, self.df, self.df2, how='left', - left_index=['key1', 'key2'], right_index=False) - self.assertRaises(ValueError, merge, self.df, self.df2, how='left', - left_index=['key1', 'key2'], - right_index=['key1', 'key2']) + pytest.raises(ValueError, merge, self.df, self.df2, how='left', + left_index=False, right_index=['key1', 'key2']) + pytest.raises(ValueError, merge, self.df, self.df2, how='left', + left_index=['key1', 'key2'], right_index=False) + pytest.raises(ValueError, merge, self.df, self.df2, how='left', + left_index=['key1', 'key2'], + right_index=['key1', 'key2']) def test_merge_overlap(self): merged = merge(self.left, self.left, on='key') exp_len = (self.left['key'].value_counts() ** 2).sum() - self.assertEqual(len(merged), exp_len) - self.assertIn('v1_x', merged) - self.assertIn('v1_y', merged) + assert len(merged) == exp_len + assert 'v1_x' in merged + assert 'v1_y' in merged def test_merge_different_column_key_names(self): left = DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], @@ -156,10 +162,10 @@ def test_merge_copy(self): right_index=True, copy=True) merged['a'] = 6 - self.assertTrue((left['a'] == 0).all()) + assert (left['a'] == 0).all() merged['d'] = 'peekaboo' - self.assertTrue((right['d'] == 'bar').all()) + assert (right['d'] == 'bar').all() def test_merge_nocopy(self): left = DataFrame({'a': 0, 'b': 1}, index=lrange(10)) @@ -169,10 +175,10 @@ def test_merge_nocopy(self): right_index=True, copy=False) merged['a'] = 6 - self.assertTrue((left['a'] == 6).all()) + assert (left['a'] == 6).all() merged['d'] = 'peekaboo' - self.assertTrue((right['d'] == 'peekaboo').all()) + assert (right['d'] == 'peekaboo').all() def test_intelligently_handle_join_key(self): # #733, be a bit more 1337 about not returning unconsolidated DataFrame @@ -196,7 +202,7 @@ def test_merge_join_key_dtype_cast(self): df1 = DataFrame({'key': [1], 'v1': [10]}) df2 = DataFrame({'key': [2], 'v1': [20]}) df = merge(df1, df2, how='outer') - self.assertEqual(df['key'].dtype, 'int64') + assert df['key'].dtype == 'int64' df1 = DataFrame({'key': [True], 'v1': [1]}) df2 = DataFrame({'key': [False], 'v1': [0]}) @@ -204,14 +210,14 @@ def test_merge_join_key_dtype_cast(self): # GH13169 # this really should be bool - self.assertEqual(df['key'].dtype, 'object') + assert df['key'].dtype == 'object' df1 = DataFrame({'val': [1]}) df2 = DataFrame({'val': [2]}) lkey = np.array([1]) rkey = np.array([2]) df = merge(df1, df2, left_on=lkey, right_on=rkey, how='outer') - self.assertEqual(df['key_0'].dtype, 'int64') + assert df['key_0'].dtype == 'int64' def test_handle_join_key_pass_array(self): left = DataFrame({'key': [1, 1, 2, 2, 3], @@ -223,8 +229,8 @@ def test_handle_join_key_pass_array(self): merged2 = merge(right, left, left_on=key, right_on='key', how='outer') assert_series_equal(merged['key'], merged2['key']) - self.assertTrue(merged['key'].notnull().all()) - self.assertTrue(merged2['key'].notnull().all()) + assert merged['key'].notnull().all() + assert merged2['key'].notnull().all() left = DataFrame({'value': lrange(5)}, columns=['value']) right = DataFrame({'rvalue': lrange(6)}) @@ -232,23 +238,23 @@ def test_handle_join_key_pass_array(self): rkey = np.array([1, 1, 2, 3, 4, 5]) merged = merge(left, right, left_on=lkey, right_on=rkey, how='outer') - self.assert_series_equal(merged['key_0'], - Series([1, 1, 1, 1, 2, 2, 3, 4, 5], - name='key_0')) + tm.assert_series_equal(merged['key_0'], Series([1, 1, 1, 1, 2, + 2, 3, 4, 5], + name='key_0')) left = DataFrame({'value': lrange(3)}) right = DataFrame({'rvalue': lrange(6)}) key = np.array([0, 1, 1, 2, 2, 3], dtype=np.int64) merged = merge(left, right, left_index=True, right_on=key, how='outer') - self.assert_series_equal(merged['key_0'], Series(key, name='key_0')) + tm.assert_series_equal(merged['key_0'], Series(key, name='key_0')) def test_no_overlap_more_informative_error(self): dt = datetime.now() df1 = DataFrame({'x': ['a']}, index=[dt]) df2 = DataFrame({'y': ['b', 'c']}, index=[dt, dt]) - self.assertRaises(MergeError, merge, df1, df2) + pytest.raises(MergeError, merge, df1, df2) def test_merge_non_unique_indexes(self): @@ -419,7 +425,7 @@ def test_merge_nosort(self): exp = merge(df, new, on='var3', sort=False) assert_frame_equal(result, exp) - self.assertTrue((df.var3.unique() == result.var3.unique()).all()) + assert (df.var3.unique() == result.var3.unique()).all() def test_merge_nan_right(self): df1 = DataFrame({"i1": [0, 1], "i2": [0, 1]}) @@ -453,7 +459,7 @@ def _constructor(self): nad = NotADataFrame(self.df) result = nad.merge(self.df2, on='key1') - tm.assertIsInstance(result, NotADataFrame) + assert isinstance(result, NotADataFrame) def test_join_append_timedeltas(self): @@ -493,7 +499,7 @@ def test_other_datetime_unit(self): df2 = s.astype(dtype).to_frame('days') # coerces to datetime64[ns], thus sholuld not be affected - self.assertEqual(df2['days'].dtype, 'datetime64[ns]') + assert df2['days'].dtype == 'datetime64[ns]' result = df1.merge(df2, left_on='entity_id', right_index=True) @@ -513,7 +519,7 @@ def test_other_timedelta_unit(self): 'timedelta64[ns]']: df2 = s.astype(dtype).to_frame('days') - self.assertEqual(df2['days'].dtype, dtype) + assert df2['days'].dtype == dtype result = df1.merge(df2, left_on='entity_id', right_index=True) @@ -543,7 +549,7 @@ def test_overlapping_columns_error_message(self): # #2649, #10639 df2.columns = ['key1', 'foo', 'foo'] - self.assertRaises(ValueError, merge, df, df2) + pytest.raises(ValueError, merge, df, df2) def test_merge_on_datetime64tz(self): @@ -576,8 +582,8 @@ def test_merge_on_datetime64tz(self): 'key': [1, 2, 3]}) result = pd.merge(left, right, on='key', how='outer') assert_frame_equal(result, expected) - self.assertEqual(result['value_x'].dtype, 'datetime64[ns, US/Eastern]') - self.assertEqual(result['value_y'].dtype, 'datetime64[ns, US/Eastern]') + assert result['value_x'].dtype == 'datetime64[ns, US/Eastern]' + assert result['value_y'].dtype == 'datetime64[ns, US/Eastern]' def test_merge_on_periods(self): left = pd.DataFrame({'key': pd.period_range('20151010', periods=2, @@ -608,8 +614,8 @@ def test_merge_on_periods(self): 'key': [1, 2, 3]}) result = pd.merge(left, right, on='key', how='outer') assert_frame_equal(result, expected) - self.assertEqual(result['value_x'].dtype, 'object') - self.assertEqual(result['value_y'].dtype, 'object') + assert result['value_x'].dtype == 'object' + assert result['value_y'].dtype == 'object' def test_indicator(self): # PR #10054. xref #7412 and closes #8790. @@ -657,46 +663,46 @@ def test_indicator(self): assert_frame_equal(test_custom_name, df_result_custom_name) # Check only accepts strings and booleans - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): merge(df1, df2, on='col1', how='outer', indicator=5) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df1.merge(df2, on='col1', how='outer', indicator=5) # Check result integrity test2 = merge(df1, df2, on='col1', how='left', indicator=True) - self.assertTrue((test2._merge != 'right_only').all()) + assert (test2._merge != 'right_only').all() test2 = df1.merge(df2, on='col1', how='left', indicator=True) - self.assertTrue((test2._merge != 'right_only').all()) + assert (test2._merge != 'right_only').all() test3 = merge(df1, df2, on='col1', how='right', indicator=True) - self.assertTrue((test3._merge != 'left_only').all()) + assert (test3._merge != 'left_only').all() test3 = df1.merge(df2, on='col1', how='right', indicator=True) - self.assertTrue((test3._merge != 'left_only').all()) + assert (test3._merge != 'left_only').all() test4 = merge(df1, df2, on='col1', how='inner', indicator=True) - self.assertTrue((test4._merge == 'both').all()) + assert (test4._merge == 'both').all() test4 = df1.merge(df2, on='col1', how='inner', indicator=True) - self.assertTrue((test4._merge == 'both').all()) + assert (test4._merge == 'both').all() # Check if working name in df for i in ['_right_indicator', '_left_indicator', '_merge']: df_badcolumn = DataFrame({'col1': [1, 2], i: [2, 2]}) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): merge(df1, df_badcolumn, on='col1', how='outer', indicator=True) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df1.merge(df_badcolumn, on='col1', how='outer', indicator=True) # Check for name conflict with custom name df_badcolumn = DataFrame( {'col1': [1, 2], 'custom_column_name': [2, 2]}) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): merge(df1, df_badcolumn, on='col1', how='outer', indicator='custom_column_name') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df1.merge(df_badcolumn, on='col1', how='outer', indicator='custom_column_name') @@ -731,9 +737,9 @@ def _check_merge(x, y): assert_frame_equal(result, expected, check_names=False) -class TestMergeMulti(tm.TestCase): +class TestMergeMulti(object): - def setUp(self): + def setup_method(self, method): self.index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', 'three']], labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], @@ -763,11 +769,11 @@ def test_merge_on_multikey(self): columns=self.to_join.columns)) # TODO: columns aren't in the same order yet - assert_frame_equal(joined, expected.ix[:, joined.columns]) + assert_frame_equal(joined, expected.loc[:, joined.columns]) left = self.data.join(self.to_join, on=['key1', 'key2'], sort=True) - right = expected.ix[:, joined.columns].sort_values(['key1', 'key2'], - kind='mergesort') + right = expected.loc[:, joined.columns].sort_values(['key1', 'key2'], + kind='mergesort') assert_frame_equal(left, right) def test_left_join_multi_index(self): @@ -783,15 +789,15 @@ def run_asserts(left, right): for sort in [False, True]: res = left.join(right, on=icols, how='left', sort=sort) - self.assertTrue(len(left) < len(res) + 1) - self.assertFalse(res['4th'].isnull().any()) - self.assertFalse(res['5th'].isnull().any()) + assert len(left) < len(res) + 1 + assert not res['4th'].isnull().any() + assert not res['5th'].isnull().any() tm.assert_series_equal( res['4th'], - res['5th'], check_names=False) result = bind_cols(res.iloc[:, :-2]) tm.assert_series_equal(res['4th'], result, check_names=False) - self.assertTrue(result.name is None) + assert result.name is None if sort: tm.assert_frame_equal( @@ -840,7 +846,7 @@ def test_merge_right_vs_left(self): left_index=True, how='right', sort=sort) - merged2 = merged2.ix[:, merged1.columns] + merged2 = merged2.loc[:, merged1.columns] assert_frame_equal(merged1, merged2) def test_compress_group_combinations(self): @@ -904,7 +910,7 @@ def test_left_join_index_preserve_order(self): # do a right join for an extra test joined = merge(right, left, left_index=True, right_on=['k1', 'k2'], how='right') - tm.assert_frame_equal(joined.ix[:, expected.columns], expected) + tm.assert_frame_equal(joined.loc[:, expected.columns], expected) def test_left_join_index_multi_match_multiindex(self): left = DataFrame([ @@ -1021,38 +1027,6 @@ def test_left_join_index_multi_match(self): expected.index = np.arange(len(expected)) tm.assert_frame_equal(result, expected) - def test_join_multi_dtypes(self): - - # test with multi dtypes in the join index - def _test(dtype1, dtype2): - left = DataFrame({'k1': np.array([0, 1, 2] * 8, dtype=dtype1), - 'k2': ['foo', 'bar'] * 12, - 'v': np.array(np.arange(24), dtype=np.int64)}) - - index = MultiIndex.from_tuples([(2, 'bar'), (1, 'foo')]) - right = DataFrame( - {'v2': np.array([5, 7], dtype=dtype2)}, index=index) - - result = left.join(right, on=['k1', 'k2']) - - expected = left.copy() - - if dtype2.kind == 'i': - dtype2 = np.dtype('float64') - expected['v2'] = np.array(np.nan, dtype=dtype2) - expected.loc[(expected.k1 == 2) & (expected.k2 == 'bar'), 'v2'] = 5 - expected.loc[(expected.k1 == 1) & (expected.k2 == 'foo'), 'v2'] = 7 - - tm.assert_frame_equal(result, expected) - - result = left.join(right, on=['k1', 'k2'], sort=True) - expected.sort_values(['k1', 'k2'], kind='mergesort', inplace=True) - tm.assert_frame_equal(result, expected) - - for d1 in [np.int64, np.int32, np.int16, np.int8, np.uint8]: - for d2 in [np.int64, np.float64, np.float32, np.float16]: - _test(np.dtype(d1), np.dtype(d2)) - def test_left_merge_na_buglet(self): left = DataFrame({'id': list('abcde'), 'v1': randn(5), 'v2': randn(5), 'dummy': list('abcde'), @@ -1095,137 +1069,6 @@ def test_merge_na_keys(self): tm.assert_frame_equal(result, expected) - @slow - def test_int64_overflow_issues(self): - from itertools import product - from collections import defaultdict - from pandas.core.groupby import _int64_overflow_possible - - # #2690, combinatorial explosion - df1 = DataFrame(np.random.randn(1000, 7), - columns=list('ABCDEF') + ['G1']) - df2 = DataFrame(np.random.randn(1000, 7), - columns=list('ABCDEF') + ['G2']) - - # it works! - result = merge(df1, df2, how='outer') - self.assertTrue(len(result) == 2000) - - low, high, n = -1 << 10, 1 << 10, 1 << 20 - left = DataFrame(np.random.randint(low, high, (n, 7)), - columns=list('ABCDEFG')) - left['left'] = left.sum(axis=1) - - # one-2-one match - i = np.random.permutation(len(left)) - right = left.iloc[i].copy() - right.columns = right.columns[:-1].tolist() + ['right'] - right.index = np.arange(len(right)) - right['right'] *= -1 - - out = merge(left, right, how='outer') - self.assertEqual(len(out), len(left)) - assert_series_equal(out['left'], - out['right'], check_names=False) - result = out.iloc[:, :-2].sum(axis=1) - assert_series_equal(out['left'], result, check_names=False) - self.assertTrue(result.name is None) - - out.sort_values(out.columns.tolist(), inplace=True) - out.index = np.arange(len(out)) - for how in ['left', 'right', 'outer', 'inner']: - assert_frame_equal(out, merge(left, right, how=how, sort=True)) - - # check that left merge w/ sort=False maintains left frame order - out = merge(left, right, how='left', sort=False) - assert_frame_equal(left, out[left.columns.tolist()]) - - out = merge(right, left, how='left', sort=False) - assert_frame_equal(right, out[right.columns.tolist()]) - - # one-2-many/none match - n = 1 << 11 - left = DataFrame(np.random.randint(low, high, (n, 7)).astype('int64'), - columns=list('ABCDEFG')) - - # confirm that this is checking what it is supposed to check - shape = left.apply(Series.nunique).values - self.assertTrue(_int64_overflow_possible(shape)) - - # add duplicates to left frame - left = concat([left, left], ignore_index=True) - - right = DataFrame(np.random.randint(low, high, (n // 2, 7)) - .astype('int64'), - columns=list('ABCDEFG')) - - # add duplicates & overlap with left to the right frame - i = np.random.choice(len(left), n) - right = concat([right, right, left.iloc[i]], ignore_index=True) - - left['left'] = np.random.randn(len(left)) - right['right'] = np.random.randn(len(right)) - - # shuffle left & right frames - i = np.random.permutation(len(left)) - left = left.iloc[i].copy() - left.index = np.arange(len(left)) - - i = np.random.permutation(len(right)) - right = right.iloc[i].copy() - right.index = np.arange(len(right)) - - # manually compute outer merge - ldict, rdict = defaultdict(list), defaultdict(list) - - for idx, row in left.set_index(list('ABCDEFG')).iterrows(): - ldict[idx].append(row['left']) - - for idx, row in right.set_index(list('ABCDEFG')).iterrows(): - rdict[idx].append(row['right']) - - vals = [] - for k, lval in ldict.items(): - rval = rdict.get(k, [np.nan]) - for lv, rv in product(lval, rval): - vals.append(k + tuple([lv, rv])) - - for k, rval in rdict.items(): - if k not in ldict: - for rv in rval: - vals.append(k + tuple([np.nan, rv])) - - def align(df): - df = df.sort_values(df.columns.tolist()) - df.index = np.arange(len(df)) - return df - - def verify_order(df): - kcols = list('ABCDEFG') - assert_frame_equal(df[kcols].copy(), - df[kcols].sort_values(kcols, kind='mergesort')) - - out = DataFrame(vals, columns=list('ABCDEFG') + ['left', 'right']) - out = align(out) - - jmask = {'left': out['left'].notnull(), - 'right': out['right'].notnull(), - 'inner': out['left'].notnull() & out['right'].notnull(), - 'outer': np.ones(len(out), dtype='bool')} - - for how in 'left', 'right', 'outer', 'inner': - mask = jmask[how] - frame = align(out[mask].copy()) - self.assertTrue(mask.all() ^ mask.any() or how == 'outer') - - for sort in [False, True]: - res = merge(left, right, how=how, sort=sort) - if sort: - verify_order(res) - - # as in GH9092 dtypes break with outer/right join - assert_frame_equal(frame, align(res), - check_dtype=how not in ('right', 'outer')) - def test_join_multi_levels(self): # GH 3662 @@ -1293,14 +1136,14 @@ def test_join_multi_levels(self): def f(): household.join(portfolio, how='inner') - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) portfolio2 = portfolio.copy() portfolio2.index.set_names(['household_id', 'foo']) def f(): portfolio2.join(portfolio, how='inner') - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def test_join_multi_levels2(self): @@ -1341,7 +1184,7 @@ def test_join_multi_levels2(self): def f(): household.join(log_return, how='inner') - self.assertRaises(NotImplementedError, f) + pytest.raises(NotImplementedError, f) # this is the equivalency result = (merge(household.reset_index(), log_return.reset_index(), @@ -1369,9 +1212,194 @@ def f(): def f(): household.join(log_return, how='outer') - self.assertRaises(NotImplementedError, f) + pytest.raises(NotImplementedError, f) + + +@pytest.fixture +def df(): + return DataFrame( + {'A': ['foo', 'bar'], + 'B': Series(['foo', 'bar']).astype('category'), + 'C': [1, 2], + 'D': [1.0, 2.0], + 'E': Series([1, 2], dtype='uint64'), + 'F': Series([1, 2], dtype='int32')}) + + +class TestMergeDtypes(object): + + def test_different(self, df): + + # we expect differences by kind + # to be ok, while other differences should return object + + left = df + for col in df.columns: + right = DataFrame({'A': df[col]}) + result = pd.merge(left, right, on='A') + assert is_object_dtype(result.A.dtype) + + @pytest.mark.parametrize('d1', [np.int64, np.int32, + np.int16, np.int8, np.uint8]) + @pytest.mark.parametrize('d2', [np.int64, np.float64, + np.float32, np.float16]) + def test_join_multi_dtypes(self, d1, d2): + dtype1 = np.dtype(d1) + dtype2 = np.dtype(d2) -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + left = DataFrame({'k1': np.array([0, 1, 2] * 8, dtype=dtype1), + 'k2': ['foo', 'bar'] * 12, + 'v': np.array(np.arange(24), dtype=np.int64)}) + + index = MultiIndex.from_tuples([(2, 'bar'), (1, 'foo')]) + right = DataFrame({'v2': np.array([5, 7], dtype=dtype2)}, index=index) + + result = left.join(right, on=['k1', 'k2']) + + expected = left.copy() + + if dtype2.kind == 'i': + dtype2 = np.dtype('float64') + expected['v2'] = np.array(np.nan, dtype=dtype2) + expected.loc[(expected.k1 == 2) & (expected.k2 == 'bar'), 'v2'] = 5 + expected.loc[(expected.k1 == 1) & (expected.k2 == 'foo'), 'v2'] = 7 + + tm.assert_frame_equal(result, expected) + + result = left.join(right, on=['k1', 'k2'], sort=True) + expected.sort_values(['k1', 'k2'], kind='mergesort', inplace=True) + tm.assert_frame_equal(result, expected) + + +@pytest.fixture +def left(): + np.random.seed(1234) + return DataFrame( + {'X': Series(np.random.choice( + ['foo', 'bar'], + size=(10,))).astype('category', categories=['foo', 'bar']), + 'Y': np.random.choice(['one', 'two', 'three'], size=(10,))}) + + +@pytest.fixture +def right(): + np.random.seed(1234) + return DataFrame( + {'X': Series(['foo', 'bar']).astype('category', + categories=['foo', 'bar']), + 'Z': [1, 2]}) + + +class TestMergeCategorical(object): + + def test_identical(self, left): + # merging on the same, should preserve dtypes + merged = pd.merge(left, left, on='X') + result = merged.dtypes.sort_index() + expected = Series([CategoricalDtype(), + np.dtype('O'), + np.dtype('O')], + index=['X', 'Y_x', 'Y_y']) + assert_series_equal(result, expected) + + def test_basic(self, left, right): + # we have matching Categorical dtypes in X + # so should preserve the merged column + merged = pd.merge(left, right, on='X') + result = merged.dtypes.sort_index() + expected = Series([CategoricalDtype(), + np.dtype('O'), + np.dtype('int64')], + index=['X', 'Y', 'Z']) + assert_series_equal(result, expected) + + def test_other_columns(self, left, right): + # non-merge columns should preserve if possible + right = right.assign(Z=right.Z.astype('category')) + + merged = pd.merge(left, right, on='X') + result = merged.dtypes.sort_index() + expected = Series([CategoricalDtype(), + np.dtype('O'), + CategoricalDtype()], + index=['X', 'Y', 'Z']) + assert_series_equal(result, expected) + + # categories are preserved + assert left.X.values.is_dtype_equal(merged.X.values) + assert right.Z.values.is_dtype_equal(merged.Z.values) + + @pytest.mark.parametrize( + 'change', [lambda x: x, + lambda x: x.astype('category', + categories=['bar', 'foo']), + lambda x: x.astype('category', + categories=['foo', 'bar', 'bah']), + lambda x: x.astype('category', ordered=True)]) + @pytest.mark.parametrize('how', ['inner', 'outer', 'left', 'right']) + def test_dtype_on_merged_different(self, change, how, left, right): + # our merging columns, X now has 2 different dtypes + # so we must be object as a result + + X = change(right.X.astype('object')) + right = right.assign(X=X) + assert is_categorical_dtype(left.X.values) + assert not left.X.values.is_dtype_equal(right.X.values) + + merged = pd.merge(left, right, on='X', how=how) + + result = merged.dtypes.sort_index() + expected = Series([np.dtype('O'), + np.dtype('O'), + np.dtype('int64')], + index=['X', 'Y', 'Z']) + assert_series_equal(result, expected) + + +@pytest.fixture +def left_df(): + return DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0]) + + +@pytest.fixture +def right_df(): + return DataFrame({'b': [300, 100, 200]}, index=[3, 1, 2]) + + +class TestMergeOnIndexes(object): + + @pytest.mark.parametrize( + "how, sort, expected", + [('inner', False, DataFrame({'a': [20, 10], + 'b': [200, 100]}, + index=[2, 1])), + ('inner', True, DataFrame({'a': [10, 20], + 'b': [100, 200]}, + index=[1, 2])), + ('left', False, DataFrame({'a': [20, 10, 0], + 'b': [200, 100, np.nan]}, + index=[2, 1, 0])), + ('left', True, DataFrame({'a': [0, 10, 20], + 'b': [np.nan, 100, 200]}, + index=[0, 1, 2])), + ('right', False, DataFrame({'a': [np.nan, 10, 20], + 'b': [300, 100, 200]}, + index=[3, 1, 2])), + ('right', True, DataFrame({'a': [10, 20, np.nan], + 'b': [100, 200, 300]}, + index=[1, 2, 3])), + ('outer', False, DataFrame({'a': [0, 10, 20, np.nan], + 'b': [np.nan, 100, 200, 300]}, + index=[0, 1, 2, 3])), + ('outer', True, DataFrame({'a': [0, 10, 20, np.nan], + 'b': [np.nan, 100, 200, 300]}, + index=[0, 1, 2, 3]))]) + def test_merge_on_indexes(self, left_df, right_df, how, sort, expected): + + result = pd.merge(left_df, right_df, + left_index=True, + right_index=True, + how=how, + sort=sort) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tools/tests/test_merge_asof.py b/pandas/tests/reshape/test_merge_asof.py similarity index 73% rename from pandas/tools/tests/test_merge_asof.py rename to pandas/tests/reshape/test_merge_asof.py index bbbf1a3bdfff9..78bfa2ff8597c 100644 --- a/pandas/tools/tests/test_merge_asof.py +++ b/pandas/tests/reshape/test_merge_asof.py @@ -1,18 +1,17 @@ -import nose import os +import pytest import pytz import numpy as np import pandas as pd from pandas import (merge_asof, read_csv, to_datetime, Timedelta) -from pandas.tools.merge import MergeError +from pandas.core.reshape.merge import MergeError from pandas.util import testing as tm from pandas.util.testing import assert_frame_equal -class TestAsOfMerge(tm.TestCase): - _multiprocess_can_split_ = True +class TestAsOfMerge(object): def read_data(self, name, dedupe=False): path = os.path.join(tm.get_data_path(), name) @@ -24,7 +23,7 @@ def read_data(self, name, dedupe=False): x.time = to_datetime(x.time) return x - def setUp(self): + def setup_method(self, method): self.trades = self.read_data('trades.csv') self.quotes = self.read_data('quotes.csv', dedupe=True) @@ -42,7 +41,12 @@ def test_examples1(self): right = pd.DataFrame({'a': [1, 2, 3, 6, 7], 'right_val': [1, 2, 3, 6, 7]}) - pd.merge_asof(left, right, on='a') + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [1, 3, 7]}) + + result = pd.merge_asof(left, right, on='a') + assert_frame_equal(result, expected) def test_examples2(self): """ doc-string examples """ @@ -94,6 +98,38 @@ def test_examples2(self): tolerance=pd.Timedelta('10ms'), allow_exact_matches=False) + def test_examples3(self): + """ doc-string examples """ + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 2, 3, 6, 7], + 'right_val': [1, 2, 3, 6, 7]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [1, 6, np.nan]}) + + result = pd.merge_asof(left, right, on='a', direction='forward') + assert_frame_equal(result, expected) + + def test_examples4(self): + """ doc-string examples """ + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 2, 3, 6, 7], + 'right_val': [1, 2, 3, 6, 7]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [1, 6, 7]}) + + result = pd.merge_asof(left, right, on='a', direction='nearest') + assert_frame_equal(result, expected) + def test_basic(self): expected = self.asof @@ -112,6 +148,7 @@ def test_basic_categorical(self): trades.ticker = trades.ticker.astype('category') quotes = self.quotes.copy() quotes.ticker = quotes.ticker.astype('category') + expected.ticker = expected.ticker.astype('category') result = merge_asof(trades, quotes, on='time', @@ -164,14 +201,14 @@ def test_multi_index(self): # MultiIndex is prohibited trades = self.trades.set_index(['time', 'price']) quotes = self.quotes.set_index('time') - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, left_index=True, right_index=True) trades = self.trades.set_index('time') quotes = self.quotes.set_index(['time', 'bid']) - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, left_index=True, right_index=True) @@ -181,7 +218,7 @@ def test_on_and_index(self): # 'on' parameter and index together is prohibited trades = self.trades.set_index('time') quotes = self.quotes.set_index('time') - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, left_on='price', left_index=True, @@ -189,7 +226,7 @@ def test_on_and_index(self): trades = self.trades.set_index('time') quotes = self.quotes.set_index('time') - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, right_on='bid', left_index=True, @@ -332,6 +369,41 @@ def test_multiby_heterogeneous_types(self): by=['ticker', 'exch']) assert_frame_equal(result, expected) + def test_multiby_indexed(self): + # GH15676 + left = pd.DataFrame([ + [pd.to_datetime('20160602'), 1, 'a'], + [pd.to_datetime('20160602'), 2, 'a'], + [pd.to_datetime('20160603'), 1, 'b'], + [pd.to_datetime('20160603'), 2, 'b']], + columns=['time', 'k1', 'k2']).set_index('time') + + right = pd.DataFrame([ + [pd.to_datetime('20160502'), 1, 'a', 1.0], + [pd.to_datetime('20160502'), 2, 'a', 2.0], + [pd.to_datetime('20160503'), 1, 'b', 3.0], + [pd.to_datetime('20160503'), 2, 'b', 4.0]], + columns=['time', 'k1', 'k2', 'value']).set_index('time') + + expected = pd.DataFrame([ + [pd.to_datetime('20160602'), 1, 'a', 1.0], + [pd.to_datetime('20160602'), 2, 'a', 2.0], + [pd.to_datetime('20160603'), 1, 'b', 3.0], + [pd.to_datetime('20160603'), 2, 'b', 4.0]], + columns=['time', 'k1', 'k2', 'value']).set_index('time') + + result = pd.merge_asof(left, + right, + left_index=True, + right_index=True, + by=['k1', 'k2']) + + assert_frame_equal(expected, result) + + with pytest.raises(MergeError): + pd.merge_asof(left, right, left_index=True, right_index=True, + left_by=['k1', 'k2'], right_by=['k1']) + def test_basic2(self): expected = self.read_data('asof2.csv') @@ -361,18 +433,18 @@ def test_valid_join_keys(self): trades = self.trades quotes = self.quotes - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, left_on='time', right_on='bid', by='ticker') - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, on=['time', 'ticker'], by='ticker') - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, by='ticker') @@ -403,7 +475,7 @@ def test_valid_allow_exact_matches(self): trades = self.trades quotes = self.quotes - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, on='time', by='ticker', @@ -427,27 +499,27 @@ def test_valid_tolerance(self): tolerance=1) # incompat - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, on='time', by='ticker', tolerance=1) # invalid - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades.reset_index(), quotes.reset_index(), on='index', by='ticker', tolerance=1.0) # invalid negative - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades, quotes, on='time', by='ticker', tolerance=-Timedelta('1s')) - with self.assertRaises(MergeError): + with pytest.raises(MergeError): merge_asof(trades.reset_index(), quotes.reset_index(), on='index', by='ticker', @@ -459,24 +531,24 @@ def test_non_sorted(self): quotes = self.quotes.sort_values('time', ascending=False) # we require that we are already sorted on time & quotes - self.assertFalse(trades.time.is_monotonic) - self.assertFalse(quotes.time.is_monotonic) - with self.assertRaises(ValueError): + assert not trades.time.is_monotonic + assert not quotes.time.is_monotonic + with pytest.raises(ValueError): merge_asof(trades, quotes, on='time', by='ticker') trades = self.trades.sort_values('time') - self.assertTrue(trades.time.is_monotonic) - self.assertFalse(quotes.time.is_monotonic) - with self.assertRaises(ValueError): + assert trades.time.is_monotonic + assert not quotes.time.is_monotonic + with pytest.raises(ValueError): merge_asof(trades, quotes, on='time', by='ticker') quotes = self.quotes.sort_values('time') - self.assertTrue(trades.time.is_monotonic) - self.assertTrue(quotes.time.is_monotonic) + assert trades.time.is_monotonic + assert quotes.time.is_monotonic # ok, though has dupes merge_asof(trades, self.quotes, @@ -495,6 +567,38 @@ def test_tolerance(self): expected = self.tolerance assert_frame_equal(result, expected) + def test_tolerance_forward(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 2, 3, 7, 11], + 'right_val': [1, 2, 3, 7, 11]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [1, np.nan, 11]}) + + result = pd.merge_asof(left, right, on='a', direction='forward', + tolerance=1) + assert_frame_equal(result, expected) + + def test_tolerance_nearest(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 2, 3, 7, 11], + 'right_val': [1, 2, 3, 7, 11]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [1, np.nan, 11]}) + + result = pd.merge_asof(left, right, on='a', direction='nearest', + tolerance=1) + assert_frame_equal(result, expected) + def test_tolerance_tz(self): # GH 14844 left = pd.DataFrame( @@ -518,6 +622,19 @@ def test_tolerance_tz(self): 'value2': list("BCDEE")}) assert_frame_equal(result, expected) + def test_index_tolerance(self): + # GH 15135 + expected = self.tolerance.set_index('time') + trades = self.trades.set_index('time') + quotes = self.quotes.set_index('time') + + result = pd.merge_asof(trades, quotes, + left_index=True, + right_index=True, + by='ticker', + tolerance=pd.Timedelta('1day')) + assert_frame_equal(result, expected) + def test_allow_exact_matches(self): result = merge_asof(self.trades, self.quotes, @@ -527,6 +644,38 @@ def test_allow_exact_matches(self): expected = self.allow_exact_matches assert_frame_equal(result, expected) + def test_allow_exact_matches_forward(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 2, 3, 7, 11], + 'right_val': [1, 2, 3, 7, 11]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [2, 7, 11]}) + + result = pd.merge_asof(left, right, on='a', direction='forward', + allow_exact_matches=False) + assert_frame_equal(result, expected) + + def test_allow_exact_matches_nearest(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 2, 3, 7, 11], + 'right_val': [1, 2, 3, 7, 11]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [2, 3, 11]}) + + result = pd.merge_asof(left, right, on='a', direction='nearest', + allow_exact_matches=False) + assert_frame_equal(result, expected) + def test_allow_exact_matches_and_tolerance(self): result = merge_asof(self.trades, self.quotes, @@ -573,7 +722,7 @@ def test_allow_exact_matches_and_tolerance3(self): # GH 13709 df1 = pd.DataFrame({ 'time': pd.to_datetime(['2016-07-15 13:30:00.030', - '2016-07-15 13:30:00.030']), + '2016-07-15 13:30:00.030']), 'username': ['bob', 'charlie']}) df2 = pd.DataFrame({ 'time': pd.to_datetime(['2016-07-15 13:30:00.000', @@ -589,6 +738,76 @@ def test_allow_exact_matches_and_tolerance3(self): 'version': [np.nan, np.nan]}) assert_frame_equal(result, expected) + def test_allow_exact_matches_and_tolerance_forward(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 3, 4, 6, 11], + 'right_val': [1, 3, 4, 6, 11]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [np.nan, 6, 11]}) + + result = pd.merge_asof(left, right, on='a', direction='forward', + allow_exact_matches=False, tolerance=1) + assert_frame_equal(result, expected) + + def test_allow_exact_matches_and_tolerance_nearest(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c']}) + right = pd.DataFrame({'a': [1, 3, 4, 6, 11], + 'right_val': [1, 3, 4, 7, 11]}) + + expected = pd.DataFrame({'a': [1, 5, 10], + 'left_val': ['a', 'b', 'c'], + 'right_val': [np.nan, 4, 11]}) + + result = pd.merge_asof(left, right, on='a', direction='nearest', + allow_exact_matches=False, tolerance=1) + assert_frame_equal(result, expected) + + def test_forward_by(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10, 12, 15], + 'b': ['X', 'X', 'Y', 'Z', 'Y'], + 'left_val': ['a', 'b', 'c', 'd', 'e']}) + right = pd.DataFrame({'a': [1, 6, 11, 15, 16], + 'b': ['X', 'Z', 'Y', 'Z', 'Y'], + 'right_val': [1, 6, 11, 15, 16]}) + + expected = pd.DataFrame({'a': [1, 5, 10, 12, 15], + 'b': ['X', 'X', 'Y', 'Z', 'Y'], + 'left_val': ['a', 'b', 'c', 'd', 'e'], + 'right_val': [1, np.nan, 11, 15, 16]}) + + result = pd.merge_asof(left, right, on='a', by='b', + direction='forward') + assert_frame_equal(result, expected) + + def test_nearest_by(self): + # GH14887 + + left = pd.DataFrame({'a': [1, 5, 10, 12, 15], + 'b': ['X', 'X', 'Z', 'Z', 'Y'], + 'left_val': ['a', 'b', 'c', 'd', 'e']}) + right = pd.DataFrame({'a': [1, 6, 11, 15, 16], + 'b': ['X', 'Z', 'Z', 'Z', 'Y'], + 'right_val': [1, 6, 11, 15, 16]}) + + expected = pd.DataFrame({'a': [1, 5, 10, 12, 15], + 'b': ['X', 'X', 'Z', 'Z', 'Y'], + 'left_val': ['a', 'b', 'c', 'd', 'e'], + 'right_val': [1, 1, 11, 11, 16]}) + + result = pd.merge_asof(left, right, on='a', by='b', + direction='nearest') + assert_frame_equal(result, expected) + def test_by_int(self): # we specialize by type, so test that this is correct df1 = pd.DataFrame({ @@ -673,7 +892,7 @@ def test_on_specialized_type(self): df1 = df1.sort_values('value').reset_index(drop=True) if dtype == np.float16: - with self.assertRaises(MergeError): + with pytest.raises(MergeError): pd.merge_asof(df1, df2, on='value') continue @@ -710,7 +929,7 @@ def test_on_specialized_type_by_int(self): df1 = df1.sort_values('value').reset_index(drop=True) if dtype == np.float16: - with self.assertRaises(MergeError): + with pytest.raises(MergeError): pd.merge_asof(df1, df2, on='value', by='key') else: result = pd.merge_asof(df1, df2, on='value', by='key') @@ -754,8 +973,3 @@ def test_on_float_by_int(self): columns=['symbol', 'exch', 'price', 'mpv']) assert_frame_equal(result, expected) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tools/tests/test_merge_ordered.py b/pandas/tests/reshape/test_merge_ordered.py similarity index 81% rename from pandas/tools/tests/test_merge_ordered.py rename to pandas/tests/reshape/test_merge_ordered.py index 0511a0ca6d1cf..9469e98f336fd 100644 --- a/pandas/tools/tests/test_merge_ordered.py +++ b/pandas/tests/reshape/test_merge_ordered.py @@ -1,5 +1,3 @@ -import nose - import pandas as pd from pandas import DataFrame, merge_ordered from pandas.util import testing as tm @@ -8,9 +6,9 @@ from numpy import nan -class TestOrderedMerge(tm.TestCase): +class TestOrderedMerge(object): - def setUp(self): + def setup_method(self, method): self.left = DataFrame({'key': ['a', 'c', 'e'], 'lvalue': [1, 2., 3]}) @@ -42,10 +40,8 @@ def test_ffill(self): def test_multigroup(self): left = pd.concat([self.left, self.left], ignore_index=True) - # right = concat([self.right, self.right], ignore_index=True) left['group'] = ['a'] * 3 + ['b'] * 3 - # right['group'] = ['a'] * 4 + ['b'] * 4 result = merge_ordered(left, self.right, on='key', left_by='group', fill_method='ffill') @@ -54,14 +50,14 @@ def test_multigroup(self): 'rvalue': [nan, 1, 2, 3, 3, 4] * 2}) expected['group'] = ['a'] * 6 + ['b'] * 6 - assert_frame_equal(result, expected.ix[:, result.columns]) + assert_frame_equal(result, expected.loc[:, result.columns]) result2 = merge_ordered(self.right, left, on='key', right_by='group', fill_method='ffill') - assert_frame_equal(result, result2.ix[:, result.columns]) + assert_frame_equal(result, result2.loc[:, result.columns]) result = merge_ordered(left, self.right, on='key', left_by='group') - self.assertTrue(result['group'].notnull().all()) + assert result['group'].notnull().all() def test_merge_type(self): class NotADataFrame(DataFrame): @@ -73,7 +69,7 @@ def _constructor(self): nad = NotADataFrame(self.left) result = nad.merge(self.right, on='key') - tm.assertIsInstance(result, NotADataFrame) + assert isinstance(result, NotADataFrame) def test_empty_sequence_concat(self): # GH 9157 @@ -87,12 +83,8 @@ def test_empty_sequence_concat(self): ([None, None], none_pat) ] for df_seq, pattern in test_cases: - tm.assertRaisesRegexp(ValueError, pattern, pd.concat, df_seq) + tm.assert_raises_regex(ValueError, pattern, pd.concat, df_seq) pd.concat([pd.DataFrame()]) pd.concat([None, pd.DataFrame()]) pd.concat([pd.DataFrame(), None]) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tools/tests/test_pivot.py b/pandas/tests/reshape/test_pivot.py similarity index 76% rename from pandas/tools/tests/test_pivot.py rename to pandas/tests/reshape/test_pivot.py index 5944fa1b34611..270a93e4ae382 100644 --- a/pandas/tools/tests/test_pivot.py +++ b/pandas/tests/reshape/test_pivot.py @@ -1,20 +1,23 @@ from datetime import datetime, date, timedelta +import pytest + + import numpy as np +from collections import OrderedDict import pandas as pd -from pandas import DataFrame, Series, Index, MultiIndex, Grouper -from pandas.tools.merge import concat -from pandas.tools.pivot import pivot_table, crosstab +from pandas import (DataFrame, Series, Index, MultiIndex, + Grouper, date_range, concat) +from pandas.core.reshape.pivot import pivot_table, crosstab from pandas.compat import range, product import pandas.util.testing as tm +from pandas.tseries.util import pivot_annual, isleapyear -class TestPivotTable(tm.TestCase): - - _multiprocess_can_split_ = True +class TestPivotTable(object): - def setUp(self): + def setup_method(self, method): self.data = DataFrame({'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'], @@ -42,14 +45,14 @@ def test_pivot_table(self): pivot_table(self.data, values='D', index=index) if len(index) > 1: - self.assertEqual(table.index.names, tuple(index)) + assert table.index.names == tuple(index) else: - self.assertEqual(table.index.name, index[0]) + assert table.index.name == index[0] if len(columns) > 1: - self.assertEqual(table.columns.names, columns) + assert table.columns.names == columns else: - self.assertEqual(table.columns.name, columns[0]) + assert table.columns.name == columns[0] expected = self.data.groupby( index + [columns])['D'].agg(np.mean).unstack() @@ -87,6 +90,39 @@ def test_pivot_table_dropna(self): tm.assert_index_equal(pv_col.columns, m) tm.assert_index_equal(pv_ind.index, m) + def test_pivot_table_dropna_categoricals(self): + # GH 15193 + categories = ['a', 'b', 'c', 'd'] + + df = DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], + 'B': [1, 2, 3, 1, 2, 3, 1, 2, 3], + 'C': range(0, 9)}) + + df['A'] = df['A'].astype('category', ordered=False, + categories=categories) + result_true = df.pivot_table(index='B', columns='A', values='C', + dropna=True) + expected_columns = Series(['a', 'b', 'c'], name='A') + expected_columns = expected_columns.astype('category', ordered=False, + categories=categories) + expected_index = Series([1, 2, 3], name='B') + expected_true = DataFrame([[0.0, 3.0, 6.0], + [1.0, 4.0, 7.0], + [2.0, 5.0, 8.0]], + index=expected_index, + columns=expected_columns,) + tm.assert_frame_equal(expected_true, result_true) + + result_false = df.pivot_table(index='B', columns='A', values='C', + dropna=False) + expected_columns = Series(['a', 'b', 'c', 'd'], name='A') + expected_false = DataFrame([[0.0, 3.0, 6.0, np.NaN], + [1.0, 4.0, 7.0, np.NaN], + [2.0, 5.0, 8.0, np.NaN]], + index=expected_index, + columns=expected_columns,) + tm.assert_frame_equal(expected_false, result_false) + def test_pass_array(self): result = self.data.pivot_table( 'D', index=self.data.A, columns=self.data.C) @@ -112,7 +148,7 @@ def test_pivot_dtypes(self): # can convert dtypes f = DataFrame({'a': ['cat', 'bat', 'cat', 'bat'], 'v': [ 1, 2, 3, 4], 'i': ['a', 'b', 'a', 'b']}) - self.assertEqual(f.dtypes['v'], 'int64') + assert f.dtypes['v'] == 'int64' z = pivot_table(f, values='v', index=['a'], columns=[ 'i'], fill_value=0, aggfunc=np.sum) @@ -123,7 +159,7 @@ def test_pivot_dtypes(self): # cannot convert dtypes f = DataFrame({'a': ['cat', 'bat', 'cat', 'bat'], 'v': [ 1.5, 2.5, 3.5, 4.5], 'i': ['a', 'b', 'a', 'b']}) - self.assertEqual(f.dtypes['v'], 'float64') + assert f.dtypes['v'] == 'float64' z = pivot_table(f, values='v', index=['a'], columns=[ 'i'], fill_value=0, aggfunc=np.mean) @@ -213,10 +249,10 @@ def test_pivot_index_with_nan(self): df.loc[1, 'b'] = df.loc[4, 'b'] = nan pv = df.pivot('a', 'b', 'c') - self.assertEqual(pv.notnull().values.sum(), len(df)) + assert pv.notnull().values.sum() == len(df) for _, row in df.iterrows(): - self.assertEqual(pv.loc[row['a'], row['b']], row['c']) + assert pv.loc[row['a'], row['b']] == row['c'] tm.assert_frame_equal(df.pivot('b', 'a', 'c'), pv.T) @@ -301,22 +337,23 @@ def test_margins(self): def _check_output(result, values_col, index=['A', 'B'], columns=['C'], margins_col='All'): - col_margins = result.ix[:-1, margins_col] + col_margins = result.loc[result.index[:-1], margins_col] expected_col_margins = self.data.groupby(index)[values_col].mean() tm.assert_series_equal(col_margins, expected_col_margins, check_names=False) - self.assertEqual(col_margins.name, margins_col) + assert col_margins.name == margins_col + + result = result.sort_index() + index_margins = result.loc[(margins_col, '')].iloc[:-1] - result = result.sortlevel() - index_margins = result.ix[(margins_col, '')].iloc[:-1] expected_ix_margins = self.data.groupby(columns)[values_col].mean() tm.assert_series_equal(index_margins, expected_ix_margins, check_names=False) - self.assertEqual(index_margins.name, (margins_col, '')) + assert index_margins.name == (margins_col, '') grand_total_margins = result.loc[(margins_col, ''), margins_col] expected_total_margins = self.data[values_col].mean() - self.assertEqual(grand_total_margins, expected_total_margins) + assert grand_total_margins == expected_total_margins # column specified result = self.data.pivot_table(values='D', index=['A', 'B'], @@ -345,18 +382,18 @@ def _check_output(result, values_col, index=['A', 'B'], aggfunc=np.mean) for value_col in table.columns: totals = table.loc[('All', ''), value_col] - self.assertEqual(totals, self.data[value_col].mean()) + assert totals == self.data[value_col].mean() # no rows rtable = self.data.pivot_table(columns=['AA', 'BB'], margins=True, aggfunc=np.mean) - tm.assertIsInstance(rtable, Series) + assert isinstance(rtable, Series) table = self.data.pivot_table(index=['AA', 'BB'], margins=True, aggfunc='mean') for item in ['DD', 'EE', 'FF']: totals = table.loc[('All', ''), item] - self.assertEqual(totals, self.data[item].mean()) + assert totals == self.data[item].mean() # issue number #8349: pivot_table with margins and dictionary aggfunc data = [ @@ -381,21 +418,26 @@ def _check_output(result, values_col, index=['A', 'B'], df = df.set_index(['JOB', 'NAME', 'YEAR', 'MONTH'], drop=False, append=False) - result = df.pivot_table(index=['JOB', 'NAME'], - columns=['YEAR', 'MONTH'], - values=['DAYS', 'SALARY'], - aggfunc={'DAYS': 'mean', 'SALARY': 'sum'}, - margins=True) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = df.pivot_table(index=['JOB', 'NAME'], + columns=['YEAR', 'MONTH'], + values=['DAYS', 'SALARY'], + aggfunc={'DAYS': 'mean', 'SALARY': 'sum'}, + margins=True) - expected = df.pivot_table(index=['JOB', 'NAME'], - columns=['YEAR', 'MONTH'], values=['DAYS'], - aggfunc='mean', margins=True) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + expected = df.pivot_table(index=['JOB', 'NAME'], + columns=['YEAR', 'MONTH'], + values=['DAYS'], + aggfunc='mean', margins=True) tm.assert_frame_equal(result['DAYS'], expected['DAYS']) - expected = df.pivot_table(index=['JOB', 'NAME'], - columns=['YEAR', 'MONTH'], values=['SALARY'], - aggfunc='sum', margins=True) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + expected = df.pivot_table(index=['JOB', 'NAME'], + columns=['YEAR', 'MONTH'], + values=['SALARY'], + aggfunc='sum', margins=True) tm.assert_frame_equal(result['SALARY'], expected['SALARY']) @@ -472,10 +514,10 @@ def test_pivot_columns_lexsorted(self): columns=['Index', 'Symbol', 'Year'], aggfunc='mean') - self.assertTrue(pivoted.columns.is_monotonic) + assert pivoted.columns.is_monotonic def test_pivot_complex_aggfunc(self): - f = {'D': ['std'], 'E': ['sum']} + f = OrderedDict([('D', ['std']), ('E', ['sum'])]) expected = self.data.groupby(['A', 'B']).agg(f).unstack('B') result = self.data.pivot_table(index='A', columns='B', aggfunc=f) @@ -486,21 +528,21 @@ def test_margins_no_values_no_cols(self): result = self.data[['A', 'B']].pivot_table( index=['A', 'B'], aggfunc=len, margins=True) result_list = result.tolist() - self.assertEqual(sum(result_list[:-1]), result_list[-1]) + assert sum(result_list[:-1]) == result_list[-1] def test_margins_no_values_two_rows(self): # Regression test on pivot table: no values passed but rows are a # multi-index result = self.data[['A', 'B', 'C']].pivot_table( index=['A', 'B'], columns='C', aggfunc=len, margins=True) - self.assertEqual(result.All.tolist(), [3.0, 1.0, 4.0, 3.0, 11.0]) + assert result.All.tolist() == [3.0, 1.0, 4.0, 3.0, 11.0] def test_margins_no_values_one_row_one_col(self): # Regression test on pivot table: no values passed but row and col # defined result = self.data[['A', 'B']].pivot_table( index='A', columns='B', aggfunc=len, margins=True) - self.assertEqual(result.All.tolist(), [4.0, 7.0, 11.0]) + assert result.All.tolist() == [4.0, 7.0, 11.0] def test_margins_no_values_two_row_two_cols(self): # Regression test on pivot table: no values passed but rows and cols @@ -509,22 +551,22 @@ def test_margins_no_values_two_row_two_cols(self): 'e', 'f', 'g', 'h', 'i', 'j', 'k'] result = self.data[['A', 'B', 'C', 'D']].pivot_table( index=['A', 'B'], columns=['C', 'D'], aggfunc=len, margins=True) - self.assertEqual(result.All.tolist(), [3.0, 1.0, 4.0, 3.0, 11.0]) + assert result.All.tolist() == [3.0, 1.0, 4.0, 3.0, 11.0] def test_pivot_table_with_margins_set_margin_name(self): - # GH 3335 + # see gh-3335 for margin_name in ['foo', 'one', 666, None, ['a', 'b']]: - with self.assertRaises(ValueError): + with pytest.raises(ValueError): # multi-index index pivot_table(self.data, values='D', index=['A', 'B'], columns=['C'], margins=True, margins_name=margin_name) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): # multi-index column pivot_table(self.data, values='D', index=['C'], columns=['A', 'B'], margins=True, margins_name=margin_name) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): # non-multi-index index/column pivot_table(self.data, values='D', index=['A'], columns=['B'], margins=True, @@ -587,10 +629,10 @@ def test_pivot_timegrouper(self): values='Quantity', aggfunc=np.sum) tm.assert_frame_equal(result, expected.T) - self.assertRaises(KeyError, lambda: pivot_table( + pytest.raises(KeyError, lambda: pivot_table( df, index=Grouper(freq='6MS', key='foo'), columns='Buyer', values='Quantity', aggfunc=np.sum)) - self.assertRaises(KeyError, lambda: pivot_table( + pytest.raises(KeyError, lambda: pivot_table( df, index='Buyer', columns=Grouper(freq='6MS', key='foo'), values='Quantity', aggfunc=np.sum)) @@ -607,10 +649,10 @@ def test_pivot_timegrouper(self): values='Quantity', aggfunc=np.sum) tm.assert_frame_equal(result, expected.T) - self.assertRaises(ValueError, lambda: pivot_table( + pytest.raises(ValueError, lambda: pivot_table( df, index=Grouper(freq='6MS', level='foo'), columns='Buyer', values='Quantity', aggfunc=np.sum)) - self.assertRaises(ValueError, lambda: pivot_table( + pytest.raises(ValueError, lambda: pivot_table( df, index='Buyer', columns=Grouper(freq='6MS', level='foo'), values='Quantity', aggfunc=np.sum)) @@ -854,10 +896,95 @@ def test_categorical_margins(self): table = data.pivot_table('x', 'y', 'z', margins=True) tm.assert_frame_equal(table, expected) + def test_categorical_aggfunc(self): + # GH 9534 + df = pd.DataFrame({"C1": ["A", "B", "C", "C"], + "C2": ["a", "a", "b", "b"], + "V": [1, 2, 3, 4]}) + df["C1"] = df["C1"].astype("category") + result = df.pivot_table("V", index="C1", columns="C2", aggfunc="count") + + expected_index = pd.CategoricalIndex(['A', 'B', 'C'], + categories=['A', 'B', 'C'], + ordered=False, + name='C1') + expected_columns = pd.Index(['a', 'b'], name='C2') + expected_data = np.array([[1., np.nan], + [1., np.nan], + [np.nan, 2.]]) + expected = pd.DataFrame(expected_data, + index=expected_index, + columns=expected_columns) + tm.assert_frame_equal(result, expected) + + def test_categorical_pivot_index_ordering(self): + # GH 8731 + df = pd.DataFrame({'Sales': [100, 120, 220], + 'Month': ['January', 'January', 'January'], + 'Year': [2013, 2014, 2013]}) + months = ['January', 'February', 'March', 'April', 'May', 'June', + 'July', 'August', 'September', 'October', 'November', + 'December'] + df['Month'] = df['Month'].astype('category').cat.set_categories(months) + result = df.pivot_table(values='Sales', + index='Month', + columns='Year', + aggfunc='sum') + expected_columns = pd.Int64Index([2013, 2014], name='Year') + expected_index = pd.CategoricalIndex(months, + categories=months, + ordered=False, + name='Month') + expected_data = np.empty((12, 2)) + expected_data.fill(np.nan) + expected_data[0, :] = [320., 120.] + expected = pd.DataFrame(expected_data, + index=expected_index, + columns=expected_columns) + tm.assert_frame_equal(result, expected) + + def test_pivot_table_not_series(self): + # GH 4386 + # pivot_table always returns a DataFrame + # when values is not list like and columns is None + # and aggfunc is not instance of list + df = DataFrame({'col1': [3, 4, 5], + 'col2': ['C', 'D', 'E'], + 'col3': [1, 3, 9]}) + + result = df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum) + m = MultiIndex.from_arrays([[1, 3, 9], + ['C', 'D', 'E']], + names=['col3', 'col2']) + expected = DataFrame([3, 4, 5], + index=m, columns=['col1']) + + tm.assert_frame_equal(result, expected) + + result = df.pivot_table( + 'col1', index='col3', columns='col2', aggfunc=np.sum + ) + expected = DataFrame([[3, np.NaN, np.NaN], + [np.NaN, 4, np.NaN], + [np.NaN, np.NaN, 5]], + index=Index([1, 3, 9], name='col3'), + columns=Index(['C', 'D', 'E'], name='col2')) + + tm.assert_frame_equal(result, expected) + + result = df.pivot_table('col1', index='col3', aggfunc=[np.sum]) + m = MultiIndex.from_arrays([['sum'], + ['col1']]) + expected = DataFrame([3, 4, 5], + index=Index([1, 3, 9], name='col3'), + columns=m) + + tm.assert_frame_equal(result, expected) + -class TestCrosstab(tm.TestCase): +class TestCrosstab(object): - def setUp(self): + def setup_method(self, method): df = DataFrame({'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'], @@ -910,8 +1037,8 @@ def test_crosstab_ndarray(self): # assign arbitrary names result = crosstab(self.df['A'].values, self.df['C'].values) - self.assertEqual(result.index.name, 'row_0') - self.assertEqual(result.columns.name, 'col_0') + assert result.index.name == 'row_0' + assert result.columns.name == 'col_0' def test_crosstab_margins(self): a = np.random.randint(0, 7, size=100) @@ -923,8 +1050,8 @@ def test_crosstab_margins(self): result = crosstab(a, [b, c], rownames=['a'], colnames=('b', 'c'), margins=True) - self.assertEqual(result.index.names, ('a',)) - self.assertEqual(result.columns.names, ['b', 'c']) + assert result.index.names == ('a',) + assert result.columns.names == ['b', 'c'] all_cols = result['All', ''] exp_cols = df.groupby(['a']).size().astype('i8') @@ -935,7 +1062,7 @@ def test_crosstab_margins(self): tm.assert_series_equal(all_cols, exp_cols) - all_rows = result.ix['All'] + all_rows = result.loc['All'] exp_rows = df.groupby(['b', 'c']).size().astype('i8') exp_rows = exp_rows.append(Series([len(df)], index=[('All', '')])) exp_rows.name = 'All' @@ -1099,8 +1226,8 @@ def test_crosstab_normalize(self): pd.crosstab(df.a, df.b, normalize='index')) row_normal_margins = pd.DataFrame([[1.0, 0], - [0.25, 0.75], - [0.4, 0.6]], + [0.25, 0.75], + [0.4, 0.6]], index=pd.Index([1, 2, 'All'], name='a', dtype='object'), @@ -1112,8 +1239,8 @@ def test_crosstab_normalize(self): name='b')) all_normal_margins = pd.DataFrame([[0.2, 0, 0.2], - [0.2, 0.6, 0.8], - [0.4, 0.6, 1]], + [0.2, 0.6, 0.8], + [0.4, 0.6, 1]], index=pd.Index([1, 2, 'All'], name='a', dtype='object'), @@ -1194,26 +1321,180 @@ def test_crosstab_errors(self): 'c': [1, 1, np.nan, 1, 1]}) error = 'values cannot be used without an aggfunc.' - with tm.assertRaisesRegexp(ValueError, error): + with tm.assert_raises_regex(ValueError, error): pd.crosstab(df.a, df.b, values=df.c) error = 'aggfunc cannot be used without values' - with tm.assertRaisesRegexp(ValueError, error): + with tm.assert_raises_regex(ValueError, error): pd.crosstab(df.a, df.b, aggfunc=np.mean) error = 'Not a valid normalize argument' - with tm.assertRaisesRegexp(ValueError, error): + with tm.assert_raises_regex(ValueError, error): pd.crosstab(df.a, df.b, normalize='42') - with tm.assertRaisesRegexp(ValueError, error): + with tm.assert_raises_regex(ValueError, error): pd.crosstab(df.a, df.b, normalize=42) error = 'Not a valid margins argument' - with tm.assertRaisesRegexp(ValueError, error): + with tm.assert_raises_regex(ValueError, error): pd.crosstab(df.a, df.b, normalize='all', margins=42) + def test_crosstab_with_categorial_columns(self): + # GH 8860 + df = pd.DataFrame({'MAKE': ['Honda', 'Acura', 'Tesla', + 'Honda', 'Honda', 'Acura'], + 'MODEL': ['Sedan', 'Sedan', 'Electric', + 'Pickup', 'Sedan', 'Sedan']}) + categories = ['Sedan', 'Electric', 'Pickup'] + df['MODEL'] = (df['MODEL'].astype('category') + .cat.set_categories(categories)) + result = pd.crosstab(df['MAKE'], df['MODEL']) + + expected_index = pd.Index(['Acura', 'Honda', 'Tesla'], name='MAKE') + expected_columns = pd.CategoricalIndex(categories, + categories=categories, + ordered=False, + name='MODEL') + expected_data = [[2, 0, 0], [2, 0, 1], [0, 1, 0]] + expected = pd.DataFrame(expected_data, + index=expected_index, + columns=expected_columns) + tm.assert_frame_equal(result, expected) + + def test_crosstab_with_numpy_size(self): + # GH 4003 + df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 6, + 'B': ['A', 'B', 'C'] * 8, + 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, + 'D': np.random.randn(24), + 'E': np.random.randn(24)}) + result = pd.crosstab(index=[df['A'], df['B']], + columns=[df['C']], + margins=True, + aggfunc=np.size, + values=df['D']) + expected_index = pd.MultiIndex(levels=[['All', 'one', 'three', 'two'], + ['', 'A', 'B', 'C']], + labels=[[1, 1, 1, 2, 2, 2, 3, 3, 3, 0], + [1, 2, 3, 1, 2, 3, 1, 2, 3, 0]], + names=['A', 'B']) + expected_column = pd.Index(['bar', 'foo', 'All'], + dtype='object', + name='C') + expected_data = np.array([[2., 2., 4.], + [2., 2., 4.], + [2., 2., 4.], + [2., np.nan, 2.], + [np.nan, 2., 2.], + [2., np.nan, 2.], + [np.nan, 2., 2.], + [2., np.nan, 2.], + [np.nan, 2., 2.], + [12., 12., 24.]]) + expected = pd.DataFrame(expected_data, + index=expected_index, + columns=expected_column) + tm.assert_frame_equal(result, expected) + + +class TestPivotAnnual(object): + """ + New pandas of scikits.timeseries pivot_annual + """ + + def test_daily(self): + rng = date_range('1/1/2000', '12/31/2004', freq='D') + ts = Series(np.random.randn(len(rng)), index=rng) + + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + annual = pivot_annual(ts, 'D') + + doy = np.asarray(ts.index.dayofyear) + + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + doy[(~isleapyear(ts.index.year)) & (doy >= 60)] += 1 + + for i in range(1, 367): + subset = ts[doy == i] + subset.index = [x.year for x in subset.index] + + result = annual[i].dropna() + tm.assert_series_equal(result, subset, check_names=False) + assert result.name == i + + # check leap days + leaps = ts[(ts.index.month == 2) & (ts.index.day == 29)] + day = leaps.index.dayofyear[0] + leaps.index = leaps.index.year + leaps.name = 60 + tm.assert_series_equal(annual[day].dropna(), leaps) + + def test_hourly(self): + rng_hourly = date_range('1/1/1994', periods=(18 * 8760 + 4 * 24), + freq='H') + data_hourly = np.random.randint(100, 350, rng_hourly.size) + ts_hourly = Series(data_hourly, index=rng_hourly) + + grouped = ts_hourly.groupby(ts_hourly.index.year) + hoy = grouped.apply(lambda x: x.reset_index(drop=True)) + hoy = hoy.index.droplevel(0).values + + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + hoy[~isleapyear(ts_hourly.index.year) & (hoy >= 1416)] += 24 + hoy += 1 + + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + annual = pivot_annual(ts_hourly) + + ts_hourly = ts_hourly.astype(float) + for i in [1, 1416, 1417, 1418, 1439, 1440, 1441, 8784]: + subset = ts_hourly[hoy == i] + subset.index = [x.year for x in subset.index] + + result = annual[i].dropna() + tm.assert_series_equal(result, subset, check_names=False) + assert result.name == i + + leaps = ts_hourly[(ts_hourly.index.month == 2) & ( + ts_hourly.index.day == 29) & (ts_hourly.index.hour == 0)] + hour = leaps.index.dayofyear[0] * 24 - 23 + leaps.index = leaps.index.year + leaps.name = 1417 + tm.assert_series_equal(annual[hour].dropna(), leaps) + + def test_weekly(self): + pass + + def test_monthly(self): + rng = date_range('1/1/2000', '12/31/2004', freq='M') + ts = Series(np.random.randn(len(rng)), index=rng) + + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + annual = pivot_annual(ts, 'M') + + month = ts.index.month + for i in range(1, 13): + subset = ts[month == i] + subset.index = [x.year for x in subset.index] + result = annual[i].dropna() + tm.assert_series_equal(result, subset, check_names=False) + assert result.name == i + + def test_period_monthly(self): + pass + + def test_period_daily(self): + pass + + def test_period_weekly(self): + pass + + def test_isleapyear_deprecate(self): + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + assert isleapyear(2000) + + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + assert not isleapyear(2001) -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + assert isleapyear(2004) diff --git a/pandas/tests/test_reshape.py b/pandas/tests/reshape/test_reshape.py similarity index 59% rename from pandas/tests/test_reshape.py rename to pandas/tests/reshape/test_reshape.py index 80d1f5f76e5a9..79626d89026a7 100644 --- a/pandas/tests/test_reshape.py +++ b/pandas/tests/reshape/test_reshape.py @@ -1,9 +1,9 @@ # -*- coding: utf-8 -*- # pylint: disable-msg=W0612,E1101 -import nose + +import pytest from pandas import DataFrame, Series -from pandas.core.sparse import SparseDataFrame import pandas as pd from numpy import nan @@ -11,16 +11,15 @@ from pandas.util.testing import assert_frame_equal -from pandas.core.reshape import (melt, lreshape, get_dummies, wide_to_long) +from pandas.core.reshape.reshape import ( + melt, lreshape, get_dummies, wide_to_long) import pandas.util.testing as tm from pandas.compat import range, u -_multiprocess_can_split_ = True - -class TestMelt(tm.TestCase): +class TestMelt(object): - def setUp(self): + def setup_method(self, method): self.df = tm.makeTimeDataFrame()[:10] self.df['id1'] = (self.df['A'] > 0).astype(np.int64) self.df['id2'] = (self.df['B'] > 0).astype(np.int64) @@ -34,23 +33,44 @@ def setUp(self): self.df1.columns = [list('ABC'), list('abc')] self.df1.columns.names = ['CAP', 'low'] - def test_default_col_names(self): + def test_top_level_method(self): result = melt(self.df) - self.assertEqual(result.columns.tolist(), ['variable', 'value']) + assert result.columns.tolist() == ['variable', 'value'] - result1 = melt(self.df, id_vars=['id1']) - self.assertEqual(result1.columns.tolist(), ['id1', 'variable', 'value' - ]) + def test_method_signatures(self): + tm.assert_frame_equal(self.df.melt(), + melt(self.df)) - result2 = melt(self.df, id_vars=['id1', 'id2']) - self.assertEqual(result2.columns.tolist(), ['id1', 'id2', 'variable', - 'value']) + tm.assert_frame_equal(self.df.melt(id_vars=['id1', 'id2'], + value_vars=['A', 'B']), + melt(self.df, + id_vars=['id1', 'id2'], + value_vars=['A', 'B'])) + + tm.assert_frame_equal(self.df.melt(var_name=self.var_name, + value_name=self.value_name), + melt(self.df, + var_name=self.var_name, + value_name=self.value_name)) + + tm.assert_frame_equal(self.df1.melt(col_level=0), + melt(self.df1, col_level=0)) + + def test_default_col_names(self): + result = self.df.melt() + assert result.columns.tolist() == ['variable', 'value'] + + result1 = self.df.melt(id_vars=['id1']) + assert result1.columns.tolist() == ['id1', 'variable', 'value'] + + result2 = self.df.melt(id_vars=['id1', 'id2']) + assert result2.columns.tolist() == ['id1', 'id2', 'variable', 'value'] def test_value_vars(self): - result3 = melt(self.df, id_vars=['id1', 'id2'], value_vars='A') - self.assertEqual(len(result3), 10) + result3 = self.df.melt(id_vars=['id1', 'id2'], value_vars='A') + assert len(result3) == 10 - result4 = melt(self.df, id_vars=['id1', 'id2'], value_vars=['A', 'B']) + result4 = self.df.melt(id_vars=['id1', 'id2'], value_vars=['A', 'B']) expected4 = DataFrame({'id1': self.df['id1'].tolist() * 2, 'id2': self.df['id2'].tolist() * 2, 'variable': ['A'] * 10 + ['B'] * 10, @@ -59,24 +79,61 @@ def test_value_vars(self): columns=['id1', 'id2', 'variable', 'value']) tm.assert_frame_equal(result4, expected4) + def test_value_vars_types(self): + # GH 15348 + expected = DataFrame({'id1': self.df['id1'].tolist() * 2, + 'id2': self.df['id2'].tolist() * 2, + 'variable': ['A'] * 10 + ['B'] * 10, + 'value': (self.df['A'].tolist() + + self.df['B'].tolist())}, + columns=['id1', 'id2', 'variable', 'value']) + + for type_ in (tuple, list, np.array): + result = self.df.melt(id_vars=['id1', 'id2'], + value_vars=type_(('A', 'B'))) + tm.assert_frame_equal(result, expected) + + def test_vars_work_with_multiindex(self): + expected = DataFrame({ + ('A', 'a'): self.df1[('A', 'a')], + 'CAP': ['B'] * len(self.df1), + 'low': ['b'] * len(self.df1), + 'value': self.df1[('B', 'b')], + }, columns=[('A', 'a'), 'CAP', 'low', 'value']) + + result = self.df1.melt(id_vars=[('A', 'a')], value_vars=[('B', 'b')]) + tm.assert_frame_equal(result, expected) + + def test_tuple_vars_fail_with_multiindex(self): + # melt should fail with an informative error message if + # the columns have a MultiIndex and a tuple is passed + # for id_vars or value_vars. + tuple_a = ('A', 'a') + list_a = [tuple_a] + tuple_b = ('B', 'b') + list_b = [tuple_b] + + for id_vars, value_vars in ((tuple_a, list_b), (list_a, tuple_b), + (tuple_a, tuple_b)): + with tm.assert_raises_regex(ValueError, r'MultiIndex'): + self.df1.melt(id_vars=id_vars, value_vars=value_vars) + def test_custom_var_name(self): - result5 = melt(self.df, var_name=self.var_name) - self.assertEqual(result5.columns.tolist(), ['var', 'value']) + result5 = self.df.melt(var_name=self.var_name) + assert result5.columns.tolist() == ['var', 'value'] - result6 = melt(self.df, id_vars=['id1'], var_name=self.var_name) - self.assertEqual(result6.columns.tolist(), ['id1', 'var', 'value']) + result6 = self.df.melt(id_vars=['id1'], var_name=self.var_name) + assert result6.columns.tolist() == ['id1', 'var', 'value'] - result7 = melt(self.df, id_vars=['id1', 'id2'], var_name=self.var_name) - self.assertEqual(result7.columns.tolist(), ['id1', 'id2', 'var', - 'value']) + result7 = self.df.melt(id_vars=['id1', 'id2'], var_name=self.var_name) + assert result7.columns.tolist() == ['id1', 'id2', 'var', 'value'] - result8 = melt(self.df, id_vars=['id1', 'id2'], value_vars='A', - var_name=self.var_name) - self.assertEqual(result8.columns.tolist(), ['id1', 'id2', 'var', - 'value']) + result8 = self.df.melt(id_vars=['id1', 'id2'], value_vars='A', + var_name=self.var_name) + assert result8.columns.tolist() == ['id1', 'id2', 'var', 'value'] - result9 = melt(self.df, id_vars=['id1', 'id2'], value_vars=['A', 'B'], - var_name=self.var_name) + result9 = self.df.melt(id_vars=['id1', 'id2'], value_vars=['A', 'B'], + var_name=self.var_name) expected9 = DataFrame({'id1': self.df['id1'].tolist() * 2, 'id2': self.df['id2'].tolist() * 2, self.var_name: ['A'] * 10 + ['B'] * 10, @@ -86,24 +143,22 @@ def test_custom_var_name(self): tm.assert_frame_equal(result9, expected9) def test_custom_value_name(self): - result10 = melt(self.df, value_name=self.value_name) - self.assertEqual(result10.columns.tolist(), ['variable', 'val']) + result10 = self.df.melt(value_name=self.value_name) + assert result10.columns.tolist() == ['variable', 'val'] - result11 = melt(self.df, id_vars=['id1'], value_name=self.value_name) - self.assertEqual(result11.columns.tolist(), ['id1', 'variable', 'val']) + result11 = self.df.melt(id_vars=['id1'], value_name=self.value_name) + assert result11.columns.tolist() == ['id1', 'variable', 'val'] - result12 = melt(self.df, id_vars=['id1', 'id2'], - value_name=self.value_name) - self.assertEqual(result12.columns.tolist(), ['id1', 'id2', 'variable', - 'val']) + result12 = self.df.melt(id_vars=['id1', 'id2'], + value_name=self.value_name) + assert result12.columns.tolist() == ['id1', 'id2', 'variable', 'val'] - result13 = melt(self.df, id_vars=['id1', 'id2'], value_vars='A', - value_name=self.value_name) - self.assertEqual(result13.columns.tolist(), ['id1', 'id2', 'variable', - 'val']) + result13 = self.df.melt(id_vars=['id1', 'id2'], value_vars='A', + value_name=self.value_name) + assert result13.columns.tolist() == ['id1', 'id2', 'variable', 'val'] - result14 = melt(self.df, id_vars=['id1', 'id2'], value_vars=['A', 'B'], - value_name=self.value_name) + result14 = self.df.melt(id_vars=['id1', 'id2'], value_vars=['A', 'B'], + value_name=self.value_name) expected14 = DataFrame({'id1': self.df['id1'].tolist() * 2, 'id2': self.df['id2'].tolist() * 2, 'variable': ['A'] * 10 + ['B'] * 10, @@ -115,26 +170,27 @@ def test_custom_value_name(self): def test_custom_var_and_value_name(self): - result15 = melt(self.df, var_name=self.var_name, - value_name=self.value_name) - self.assertEqual(result15.columns.tolist(), ['var', 'val']) + result15 = self.df.melt(var_name=self.var_name, + value_name=self.value_name) + assert result15.columns.tolist() == ['var', 'val'] - result16 = melt(self.df, id_vars=['id1'], var_name=self.var_name, - value_name=self.value_name) - self.assertEqual(result16.columns.tolist(), ['id1', 'var', 'val']) + result16 = self.df.melt(id_vars=['id1'], var_name=self.var_name, + value_name=self.value_name) + assert result16.columns.tolist() == ['id1', 'var', 'val'] - result17 = melt(self.df, id_vars=['id1', 'id2'], - var_name=self.var_name, value_name=self.value_name) - self.assertEqual(result17.columns.tolist(), ['id1', 'id2', 'var', 'val' - ]) + result17 = self.df.melt(id_vars=['id1', 'id2'], + var_name=self.var_name, + value_name=self.value_name) + assert result17.columns.tolist() == ['id1', 'id2', 'var', 'val'] - result18 = melt(self.df, id_vars=['id1', 'id2'], value_vars='A', - var_name=self.var_name, value_name=self.value_name) - self.assertEqual(result18.columns.tolist(), ['id1', 'id2', 'var', 'val' - ]) + result18 = self.df.melt(id_vars=['id1', 'id2'], value_vars='A', + var_name=self.var_name, + value_name=self.value_name) + assert result18.columns.tolist() == ['id1', 'id2', 'var', 'val'] - result19 = melt(self.df, id_vars=['id1', 'id2'], value_vars=['A', 'B'], - var_name=self.var_name, value_name=self.value_name) + result19 = self.df.melt(id_vars=['id1', 'id2'], value_vars=['A', 'B'], + var_name=self.var_name, + value_name=self.value_name) expected19 = DataFrame({'id1': self.df['id1'].tolist() * 2, 'id2': self.df['id2'].tolist() * 2, self.var_name: ['A'] * 10 + ['B'] * 10, @@ -146,25 +202,25 @@ def test_custom_var_and_value_name(self): df20 = self.df.copy() df20.columns.name = 'foo' - result20 = melt(df20) - self.assertEqual(result20.columns.tolist(), ['foo', 'value']) + result20 = df20.melt() + assert result20.columns.tolist() == ['foo', 'value'] def test_col_level(self): - res1 = melt(self.df1, col_level=0) - res2 = melt(self.df1, col_level='CAP') - self.assertEqual(res1.columns.tolist(), ['CAP', 'value']) - self.assertEqual(res2.columns.tolist(), ['CAP', 'value']) + res1 = self.df1.melt(col_level=0) + res2 = self.df1.melt(col_level='CAP') + assert res1.columns.tolist() == ['CAP', 'value'] + assert res2.columns.tolist() == ['CAP', 'value'] def test_multiindex(self): - res = pd.melt(self.df1) - self.assertEqual(res.columns.tolist(), ['CAP', 'low', 'value']) + res = self.df1.melt() + assert res.columns.tolist() == ['CAP', 'low', 'value'] -class TestGetDummies(tm.TestCase): +class TestGetDummies(object): sparse = False - def setUp(self): + def setup_method(self, method): self.df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'b', 'c'], 'C': [1, 2, 3]}) @@ -198,25 +254,31 @@ def test_basic_types(self): 'b': ['A', 'A', 'B', 'C', 'C'], 'c': [2, 3, 3, 3, 2]}) + expected = DataFrame({'a': [1, 0, 0], + 'b': [0, 1, 0], + 'c': [0, 0, 1]}, + dtype='uint8', + columns=list('abc')) if not self.sparse: - exp_df_type = DataFrame - exp_blk_type = pd.core.internals.IntBlock + compare = tm.assert_frame_equal else: - exp_df_type = SparseDataFrame - exp_blk_type = pd.core.internals.SparseBlock + expected = expected.to_sparse(fill_value=0, kind='integer') + compare = tm.assert_sp_frame_equal + + result = get_dummies(s_list, sparse=self.sparse) + compare(result, expected) - self.assertEqual( - type(get_dummies(s_list, sparse=self.sparse)), exp_df_type) - self.assertEqual( - type(get_dummies(s_series, sparse=self.sparse)), exp_df_type) + result = get_dummies(s_series, sparse=self.sparse) + compare(result, expected) - r = get_dummies(s_df, sparse=self.sparse, columns=s_df.columns) - self.assertEqual(type(r), exp_df_type) + result = get_dummies(s_df, sparse=self.sparse, columns=s_df.columns) + tm.assert_series_equal(result.get_dtype_counts(), + Series({'uint8': 8})) - r = get_dummies(s_df, sparse=self.sparse, columns=['a']) - self.assertEqual(type(r[['a_0']]._data.blocks[0]), exp_blk_type) - self.assertEqual(type(r[['a_1']]._data.blocks[0]), exp_blk_type) - self.assertEqual(type(r[['a_2']]._data.blocks[0]), exp_blk_type) + result = get_dummies(s_df, sparse=self.sparse, columns=['a']) + expected = Series({'uint8': 3, 'int64': 1, 'object': 1}).sort_values() + tm.assert_series_equal(result.get_dtype_counts().sort_values(), + expected) def test_just_na(self): just_na_list = [np.nan] @@ -228,13 +290,13 @@ def test_just_na(self): res_series_index = get_dummies(just_na_series_index, sparse=self.sparse) - self.assertEqual(res_list.empty, True) - self.assertEqual(res_series.empty, True) - self.assertEqual(res_series_index.empty, True) + assert res_list.empty + assert res_series.empty + assert res_series_index.empty - self.assertEqual(res_list.index.tolist(), [0]) - self.assertEqual(res_series.index.tolist(), [0]) - self.assertEqual(res_series_index.index.tolist(), ['A']) + assert res_list.index.tolist() == [0] + assert res_series.index.tolist() == [0] + assert res_series_index.index.tolist() == ['A'] def test_include_na(self): s = ['a', 'b', np.nan] @@ -360,11 +422,11 @@ def test_dataframe_dummies_prefix_sep(self): assert_frame_equal(result, expected) def test_dataframe_dummies_prefix_bad_length(self): - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): get_dummies(self.df, prefix=['too few'], sparse=self.sparse) def test_dataframe_dummies_prefix_sep_bad_length(self): - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): get_dummies(self.df, prefix_sep=['bad'], sparse=self.sparse) def test_dataframe_dummies_prefix_dict(self): @@ -420,8 +482,8 @@ def test_dataframe_dummies_with_categorical(self): 'cat_x', 'cat_y']] assert_frame_equal(result, expected) - # GH12402 Add a new parameter `drop_first` to avoid collinearity def test_basic_drop_first(self): + # GH12402 Add a new parameter `drop_first` to avoid collinearity # Basic case s_list = list('abc') s_series = Series(s_list) @@ -582,7 +644,7 @@ class TestGetDummiesSparse(TestGetDummies): sparse = True -class TestMakeAxisDummies(tm.TestCase): +class TestMakeAxisDummies(object): def test_preserve_categorical_dtype(self): # GH13854 @@ -595,7 +657,7 @@ def test_preserve_categorical_dtype(self): expected = DataFrame([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]], index=midx, columns=cidx) - from pandas.core.reshape import make_axis_dummies + from pandas.core.reshape.reshape import make_axis_dummies result = make_axis_dummies(df) tm.assert_frame_equal(result, expected) @@ -603,7 +665,7 @@ def test_preserve_categorical_dtype(self): tm.assert_frame_equal(result, expected) -class TestLreshape(tm.TestCase): +class TestLreshape(object): def test_pairs(self): data = {'birthdt': ['08jan2009', '20dec2008', '30dec2008', '21dec2008', @@ -672,10 +734,10 @@ def test_pairs(self): spec = {'visitdt': ['visitdt%d' % i for i in range(1, 3)], 'wt': ['wt%d' % i for i in range(1, 4)]} - self.assertRaises(ValueError, lreshape, df, spec) + pytest.raises(ValueError, lreshape, df, spec) -class TestWideToLong(tm.TestCase): +class TestWideToLong(object): def test_simple(self): np.random.seed(123) @@ -698,7 +760,7 @@ def test_simple(self): exp_data = {"X": x.tolist() + x.tolist(), "A": ['a', 'b', 'c', 'd', 'e', 'f'], "B": [2.5, 1.2, 0.7, 3.2, 1.3, 0.1], - "year": [1970, 1970, 1970, 1980, 1980, 1980], + "year": ['1970', '1970', '1970', '1980', '1980', '1980'], "id": [0, 1, 2, 0, 1, 2]} exp_frame = DataFrame(exp_data) exp_frame = exp_frame.set_index(['id', 'year'])[["X", "A", "B"]] @@ -714,9 +776,203 @@ def test_stubs(self): # TODO: unused? df_long = pd.wide_to_long(df, stubs, i='id', j='age') # noqa - self.assertEqual(stubs, ['inc', 'edu']) + assert stubs == ['inc', 'edu'] + def test_separating_character(self): + # GH14779 + np.random.seed(123) + x = np.random.randn(3) + df = pd.DataFrame({"A.1970": {0: "a", + 1: "b", + 2: "c"}, + "A.1980": {0: "d", + 1: "e", + 2: "f"}, + "B.1970": {0: 2.5, + 1: 1.2, + 2: .7}, + "B.1980": {0: 3.2, + 1: 1.3, + 2: .1}, + "X": dict(zip( + range(3), x))}) + df["id"] = df.index + exp_data = {"X": x.tolist() + x.tolist(), + "A": ['a', 'b', 'c', 'd', 'e', 'f'], + "B": [2.5, 1.2, 0.7, 3.2, 1.3, 0.1], + "year": ['1970', '1970', '1970', '1980', '1980', '1980'], + "id": [0, 1, 2, 0, 1, 2]} + exp_frame = DataFrame(exp_data) + exp_frame = exp_frame.set_index(['id', 'year'])[["X", "A", "B"]] + long_frame = wide_to_long(df, ["A", "B"], i="id", j="year", sep=".") + tm.assert_frame_equal(long_frame, exp_frame) -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_escapable_characters(self): + np.random.seed(123) + x = np.random.randn(3) + df = pd.DataFrame({"A(quarterly)1970": {0: "a", + 1: "b", + 2: "c"}, + "A(quarterly)1980": {0: "d", + 1: "e", + 2: "f"}, + "B(quarterly)1970": {0: 2.5, + 1: 1.2, + 2: .7}, + "B(quarterly)1980": {0: 3.2, + 1: 1.3, + 2: .1}, + "X": dict(zip( + range(3), x))}) + df["id"] = df.index + exp_data = {"X": x.tolist() + x.tolist(), + "A(quarterly)": ['a', 'b', 'c', 'd', 'e', 'f'], + "B(quarterly)": [2.5, 1.2, 0.7, 3.2, 1.3, 0.1], + "year": ['1970', '1970', '1970', '1980', '1980', '1980'], + "id": [0, 1, 2, 0, 1, 2]} + exp_frame = DataFrame(exp_data) + exp_frame = exp_frame.set_index( + ['id', 'year'])[["X", "A(quarterly)", "B(quarterly)"]] + long_frame = wide_to_long(df, ["A(quarterly)", "B(quarterly)"], + i="id", j="year") + tm.assert_frame_equal(long_frame, exp_frame) + + def test_unbalanced(self): + # test that we can have a varying amount of time variables + df = pd.DataFrame({'A2010': [1.0, 2.0], + 'A2011': [3.0, 4.0], + 'B2010': [5.0, 6.0], + 'X': ['X1', 'X2']}) + df['id'] = df.index + exp_data = {'X': ['X1', 'X1', 'X2', 'X2'], + 'A': [1.0, 3.0, 2.0, 4.0], + 'B': [5.0, np.nan, 6.0, np.nan], + 'id': [0, 0, 1, 1], + 'year': ['2010', '2011', '2010', '2011']} + exp_frame = pd.DataFrame(exp_data) + exp_frame = exp_frame.set_index(['id', 'year'])[["X", "A", "B"]] + long_frame = wide_to_long(df, ['A', 'B'], i='id', j='year') + tm.assert_frame_equal(long_frame, exp_frame) + + def test_character_overlap(self): + # Test we handle overlapping characters in both id_vars and value_vars + df = pd.DataFrame({ + 'A11': ['a11', 'a22', 'a33'], + 'A12': ['a21', 'a22', 'a23'], + 'B11': ['b11', 'b12', 'b13'], + 'B12': ['b21', 'b22', 'b23'], + 'BB11': [1, 2, 3], + 'BB12': [4, 5, 6], + 'BBBX': [91, 92, 93], + 'BBBZ': [91, 92, 93] + }) + df['id'] = df.index + exp_frame = pd.DataFrame({ + 'BBBX': [91, 92, 93, 91, 92, 93], + 'BBBZ': [91, 92, 93, 91, 92, 93], + 'A': ['a11', 'a22', 'a33', 'a21', 'a22', 'a23'], + 'B': ['b11', 'b12', 'b13', 'b21', 'b22', 'b23'], + 'BB': [1, 2, 3, 4, 5, 6], + 'id': [0, 1, 2, 0, 1, 2], + 'year': ['11', '11', '11', '12', '12', '12']}) + exp_frame = exp_frame.set_index(['id', 'year'])[ + ['BBBX', 'BBBZ', 'A', 'B', 'BB']] + long_frame = wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year') + tm.assert_frame_equal(long_frame.sort_index(axis=1), + exp_frame.sort_index(axis=1)) + + def test_invalid_separator(self): + # if an invalid separator is supplied a empty data frame is returned + sep = 'nope!' + df = pd.DataFrame({'A2010': [1.0, 2.0], + 'A2011': [3.0, 4.0], + 'B2010': [5.0, 6.0], + 'X': ['X1', 'X2']}) + df['id'] = df.index + exp_data = {'X': '', + 'A2010': [], + 'A2011': [], + 'B2010': [], + 'id': [], + 'year': [], + 'A': [], + 'B': []} + exp_frame = pd.DataFrame(exp_data) + exp_frame = exp_frame.set_index(['id', 'year'])[[ + 'X', 'A2010', 'A2011', 'B2010', 'A', 'B']] + exp_frame.index.set_levels([[0, 1], []], inplace=True) + long_frame = wide_to_long(df, ['A', 'B'], i='id', j='year', sep=sep) + tm.assert_frame_equal(long_frame.sort_index(axis=1), + exp_frame.sort_index(axis=1)) + + def test_num_string_disambiguation(self): + # Test that we can disambiguate number value_vars from + # string value_vars + df = pd.DataFrame({ + 'A11': ['a11', 'a22', 'a33'], + 'A12': ['a21', 'a22', 'a23'], + 'B11': ['b11', 'b12', 'b13'], + 'B12': ['b21', 'b22', 'b23'], + 'BB11': [1, 2, 3], + 'BB12': [4, 5, 6], + 'Arating': [91, 92, 93], + 'Arating_old': [91, 92, 93] + }) + df['id'] = df.index + exp_frame = pd.DataFrame({ + 'Arating': [91, 92, 93, 91, 92, 93], + 'Arating_old': [91, 92, 93, 91, 92, 93], + 'A': ['a11', 'a22', 'a33', 'a21', 'a22', 'a23'], + 'B': ['b11', 'b12', 'b13', 'b21', 'b22', 'b23'], + 'BB': [1, 2, 3, 4, 5, 6], + 'id': [0, 1, 2, 0, 1, 2], + 'year': ['11', '11', '11', '12', '12', '12']}) + exp_frame = exp_frame.set_index(['id', 'year'])[ + ['Arating', 'Arating_old', 'A', 'B', 'BB']] + long_frame = wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year') + tm.assert_frame_equal(long_frame.sort_index(axis=1), + exp_frame.sort_index(axis=1)) + + def test_invalid_suffixtype(self): + # If all stubs names end with a string, but a numeric suffix is + # assumed, an empty data frame is returned + df = pd.DataFrame({'Aone': [1.0, 2.0], + 'Atwo': [3.0, 4.0], + 'Bone': [5.0, 6.0], + 'X': ['X1', 'X2']}) + df['id'] = df.index + exp_data = {'X': '', + 'Aone': [], + 'Atwo': [], + 'Bone': [], + 'id': [], + 'year': [], + 'A': [], + 'B': []} + exp_frame = pd.DataFrame(exp_data) + exp_frame = exp_frame.set_index(['id', 'year'])[[ + 'X', 'Aone', 'Atwo', 'Bone', 'A', 'B']] + exp_frame.index.set_levels([[0, 1], []], inplace=True) + long_frame = wide_to_long(df, ['A', 'B'], i='id', j='year') + tm.assert_frame_equal(long_frame.sort_index(axis=1), + exp_frame.sort_index(axis=1)) + + def test_multiple_id_columns(self): + # Taken from http://www.ats.ucla.edu/stat/stata/modules/reshapel.htm + df = pd.DataFrame({ + 'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3], + 'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3], + 'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1], + 'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9] + }) + exp_frame = pd.DataFrame({ + 'ht': [2.8, 3.4, 2.9, 3.8, 2.2, 2.9, 2.0, 3.2, 1.8, + 2.8, 1.9, 2.4, 2.2, 3.3, 2.3, 3.4, 2.1, 2.9], + 'famid': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], + 'birth': [1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3], + 'age': ['1', '2', '1', '2', '1', '2', '1', '2', '1', + '2', '1', '2', '1', '2', '1', '2', '1', '2'] + }) + exp_frame = exp_frame.set_index(['famid', 'birth', 'age'])[['ht']] + long_frame = wide_to_long(df, 'ht', i=['famid', 'birth'], j='age') + tm.assert_frame_equal(long_frame, exp_frame) diff --git a/pandas/tests/reshape/test_tile.py b/pandas/tests/reshape/test_tile.py new file mode 100644 index 0000000000000..8602b33856fea --- /dev/null +++ b/pandas/tests/reshape/test_tile.py @@ -0,0 +1,506 @@ +import os +import pytest + +import numpy as np +from pandas.compat import zip + +from pandas import (Series, Index, isnull, + to_datetime, DatetimeIndex, Timestamp, + Interval, IntervalIndex, Categorical, + cut, qcut, date_range) +import pandas.util.testing as tm + +from pandas.core.algorithms import quantile +import pandas.core.reshape.tile as tmod + + +class TestCut(object): + + def test_simple(self): + data = np.ones(5, dtype='int64') + result = cut(data, 4, labels=False) + expected = np.array([1, 1, 1, 1, 1]) + tm.assert_numpy_array_equal(result, expected, + check_dtype=False) + + def test_bins(self): + data = np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]) + result, bins = cut(data, 3, retbins=True) + + intervals = IntervalIndex.from_breaks(bins.round(3)) + expected = intervals.take([0, 0, 0, 1, 2, 0]).astype('category') + tm.assert_categorical_equal(result, expected) + tm.assert_almost_equal(bins, np.array([0.1905, 3.36666667, + 6.53333333, 9.7])) + + def test_right(self): + data = np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1, 2.575]) + result, bins = cut(data, 4, right=True, retbins=True) + intervals = IntervalIndex.from_breaks(bins.round(3)) + expected = intervals.astype('category').take([0, 0, 0, 2, 3, 0, 0]) + tm.assert_categorical_equal(result, expected) + tm.assert_almost_equal(bins, np.array([0.1905, 2.575, 4.95, + 7.325, 9.7])) + + def test_noright(self): + data = np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1, 2.575]) + result, bins = cut(data, 4, right=False, retbins=True) + intervals = IntervalIndex.from_breaks(bins.round(3), closed='left') + expected = intervals.take([0, 0, 0, 2, 3, 0, 1]).astype('category') + tm.assert_categorical_equal(result, expected) + tm.assert_almost_equal(bins, np.array([0.2, 2.575, 4.95, + 7.325, 9.7095])) + + def test_arraylike(self): + data = [.2, 1.4, 2.5, 6.2, 9.7, 2.1] + result, bins = cut(data, 3, retbins=True) + intervals = IntervalIndex.from_breaks(bins.round(3)) + expected = intervals.take([0, 0, 0, 1, 2, 0]).astype('category') + tm.assert_categorical_equal(result, expected) + tm.assert_almost_equal(bins, np.array([0.1905, 3.36666667, + 6.53333333, 9.7])) + + def test_bins_from_intervalindex(self): + c = cut(range(5), 3) + expected = c + result = cut(range(5), bins=expected.categories) + tm.assert_categorical_equal(result, expected) + + expected = Categorical.from_codes(np.append(c.codes, -1), + categories=c.categories, + ordered=True) + result = cut(range(6), bins=expected.categories) + tm.assert_categorical_equal(result, expected) + + # doc example + # make sure we preserve the bins + ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60]) + c = cut(ages, bins=[0, 18, 35, 70]) + expected = IntervalIndex.from_tuples([(0, 18), (18, 35), (35, 70)]) + tm.assert_index_equal(c.categories, expected) + + result = cut([25, 20, 50], bins=c.categories) + tm.assert_index_equal(result.categories, expected) + tm.assert_numpy_array_equal(result.codes, + np.array([1, 1, 2], dtype='int8')) + + def test_bins_not_monotonic(self): + data = [.2, 1.4, 2.5, 6.2, 9.7, 2.1] + pytest.raises(ValueError, cut, data, [0.1, 1.5, 1, 10]) + + def test_wrong_num_labels(self): + data = [.2, 1.4, 2.5, 6.2, 9.7, 2.1] + pytest.raises(ValueError, cut, data, [0, 1, 10], + labels=['foo', 'bar', 'baz']) + + def test_cut_corner(self): + # h3h + pytest.raises(ValueError, cut, [], 2) + + pytest.raises(ValueError, cut, [1, 2, 3], 0.5) + + def test_cut_out_of_range_more(self): + # #1511 + s = Series([0, -1, 0, 1, -3], name='x') + ind = cut(s, [0, 1], labels=False) + exp = Series([np.nan, np.nan, np.nan, 0, np.nan], name='x') + tm.assert_series_equal(ind, exp) + + def test_labels(self): + arr = np.tile(np.arange(0, 1.01, 0.1), 4) + + result, bins = cut(arr, 4, retbins=True) + ex_levels = IntervalIndex.from_breaks([-1e-3, 0.25, 0.5, 0.75, 1]) + tm.assert_index_equal(result.categories, ex_levels) + + result, bins = cut(arr, 4, retbins=True, right=False) + ex_levels = IntervalIndex.from_breaks([0, 0.25, 0.5, 0.75, 1 + 1e-3], + closed='left') + tm.assert_index_equal(result.categories, ex_levels) + + def test_cut_pass_series_name_to_factor(self): + s = Series(np.random.randn(100), name='foo') + + factor = cut(s, 4) + assert factor.name == 'foo' + + def test_label_precision(self): + arr = np.arange(0, 0.73, 0.01) + + result = cut(arr, 4, precision=2) + ex_levels = IntervalIndex.from_breaks([-0.00072, 0.18, 0.36, + 0.54, 0.72]) + tm.assert_index_equal(result.categories, ex_levels) + + def test_na_handling(self): + arr = np.arange(0, 0.75, 0.01) + arr[::3] = np.nan + + result = cut(arr, 4) + + result_arr = np.asarray(result) + + ex_arr = np.where(isnull(arr), np.nan, result_arr) + + tm.assert_almost_equal(result_arr, ex_arr) + + result = cut(arr, 4, labels=False) + ex_result = np.where(isnull(arr), np.nan, result) + tm.assert_almost_equal(result, ex_result) + + def test_inf_handling(self): + data = np.arange(6) + data_ser = Series(data, dtype='int64') + + bins = [-np.inf, 2, 4, np.inf] + result = cut(data, bins) + result_ser = cut(data_ser, bins) + + ex_uniques = IntervalIndex.from_breaks(bins) + tm.assert_index_equal(result.categories, ex_uniques) + assert result[5] == Interval(4, np.inf) + assert result[0] == Interval(-np.inf, 2) + assert result_ser[5] == Interval(4, np.inf) + assert result_ser[0] == Interval(-np.inf, 2) + + def test_qcut(self): + arr = np.random.randn(1000) + + # We store the bins as Index that have been rounded + # to comparisons are a bit tricky. + labels, bins = qcut(arr, 4, retbins=True) + ex_bins = quantile(arr, [0, .25, .5, .75, 1.]) + result = labels.categories.left.values + assert np.allclose(result, ex_bins[:-1], atol=1e-2) + result = labels.categories.right.values + assert np.allclose(result, ex_bins[1:], atol=1e-2) + + ex_levels = cut(arr, ex_bins, include_lowest=True) + tm.assert_categorical_equal(labels, ex_levels) + + def test_qcut_bounds(self): + arr = np.random.randn(1000) + + factor = qcut(arr, 10, labels=False) + assert len(np.unique(factor)) == 10 + + def test_qcut_specify_quantiles(self): + arr = np.random.randn(100) + + factor = qcut(arr, [0, .25, .5, .75, 1.]) + expected = qcut(arr, 4) + tm.assert_categorical_equal(factor, expected) + + def test_qcut_all_bins_same(self): + tm.assert_raises_regex(ValueError, "edges.*unique", qcut, + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 3) + + def test_cut_out_of_bounds(self): + arr = np.random.randn(100) + + result = cut(arr, [-1, 0, 1]) + + mask = isnull(result) + ex_mask = (arr < -1) | (arr > 1) + tm.assert_numpy_array_equal(mask, ex_mask) + + def test_cut_pass_labels(self): + arr = [50, 5, 10, 15, 20, 30, 70] + bins = [0, 25, 50, 100] + labels = ['Small', 'Medium', 'Large'] + + result = cut(arr, bins, labels=labels) + exp = Categorical(['Medium'] + 4 * ['Small'] + ['Medium', 'Large'], + ordered=True) + tm.assert_categorical_equal(result, exp) + + result = cut(arr, bins, labels=Categorical.from_codes([0, 1, 2], + labels)) + exp = Categorical.from_codes([1] + 4 * [0] + [1, 2], labels) + tm.assert_categorical_equal(result, exp) + + def test_qcut_include_lowest(self): + values = np.arange(10) + + ii = qcut(values, 4) + + ex_levels = IntervalIndex.from_intervals( + [Interval(-0.001, 2.25), + Interval(2.25, 4.5), + Interval(4.5, 6.75), + Interval(6.75, 9)]) + tm.assert_index_equal(ii.categories, ex_levels) + + def test_qcut_nas(self): + arr = np.random.randn(100) + arr[:20] = np.nan + + result = qcut(arr, 4) + assert isnull(result[:20]).all() + + def test_qcut_index(self): + result = qcut([0, 2], 2) + expected = Index([Interval(-0.001, 1), Interval(1, 2)]).astype( + 'category') + tm.assert_categorical_equal(result, expected) + + def test_round_frac(self): + # it works + result = cut(np.arange(11.), 2) + + result = cut(np.arange(11.) / 1e10, 2) + + # #1979, negative numbers + + result = tmod._round_frac(-117.9998, precision=3) + assert result == -118 + result = tmod._round_frac(117.9998, precision=3) + assert result == 118 + + result = tmod._round_frac(117.9998, precision=2) + assert result == 118 + result = tmod._round_frac(0.000123456, precision=2) + assert result == 0.00012 + + def test_qcut_binning_issues(self): + # #1978, 1979 + path = os.path.join(tm.get_data_path(), 'cut_data.csv') + arr = np.loadtxt(path) + + result = qcut(arr, 20) + + starts = [] + ends = [] + for lev in np.unique(result): + s = lev.left + e = lev.right + assert s != e + + starts.append(float(s)) + ends.append(float(e)) + + for (sp, sn), (ep, en) in zip(zip(starts[:-1], starts[1:]), + zip(ends[:-1], ends[1:])): + assert sp < sn + assert ep < en + assert ep <= sn + + def test_cut_return_intervals(self): + s = Series([0, 1, 2, 3, 4, 5, 6, 7, 8]) + res = cut(s, 3) + exp_bins = np.linspace(0, 8, num=4).round(3) + exp_bins[0] -= 0.008 + exp = Series(IntervalIndex.from_breaks(exp_bins, closed='right').take( + [0, 0, 0, 1, 1, 1, 2, 2, 2])).astype('category', ordered=True) + tm.assert_series_equal(res, exp) + + def test_qcut_return_intervals(self): + s = Series([0, 1, 2, 3, 4, 5, 6, 7, 8]) + res = qcut(s, [0, 0.333, 0.666, 1]) + exp_levels = np.array([Interval(-0.001, 2.664), + Interval(2.664, 5.328), Interval(5.328, 8)]) + exp = Series(exp_levels.take([0, 0, 0, 1, 1, 1, 2, 2, 2])).astype( + 'category', ordered=True) + tm.assert_series_equal(res, exp) + + def test_series_retbins(self): + # GH 8589 + s = Series(np.arange(4)) + result, bins = cut(s, 2, retbins=True) + expected = Series(IntervalIndex.from_breaks( + [-0.003, 1.5, 3], closed='right').repeat(2)).astype('category', + ordered=True) + tm.assert_series_equal(result, expected) + + result, bins = qcut(s, 2, retbins=True) + expected = Series(IntervalIndex.from_breaks( + [-0.001, 1.5, 3], closed='right').repeat(2)).astype('category', + ordered=True) + tm.assert_series_equal(result, expected) + + def test_qcut_duplicates_bin(self): + # GH 7751 + values = [0, 0, 0, 0, 1, 2, 3] + expected = IntervalIndex.from_intervals([Interval(-0.001, 1), + Interval(1, 3)]) + + result = qcut(values, 3, duplicates='drop') + tm.assert_index_equal(result.categories, expected) + + pytest.raises(ValueError, qcut, values, 3) + pytest.raises(ValueError, qcut, values, 3, duplicates='raise') + + # invalid + pytest.raises(ValueError, qcut, values, 3, duplicates='foo') + + def test_single_quantile(self): + # issue 15431 + expected = Series([0, 0]) + + s = Series([9., 9.]) + result = qcut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + result = qcut(s, 1) + intervals = IntervalIndex([Interval(8.999, 9.0), + Interval(8.999, 9.0)], closed='right') + expected = Series(intervals).astype('category', ordered=True) + tm.assert_series_equal(result, expected) + + s = Series([-9., -9.]) + expected = Series([0, 0]) + result = qcut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + result = qcut(s, 1) + intervals = IntervalIndex([Interval(-9.001, -9.0), + Interval(-9.001, -9.0)], closed='right') + expected = Series(intervals).astype('category', ordered=True) + tm.assert_series_equal(result, expected) + + s = Series([0., 0.]) + expected = Series([0, 0]) + result = qcut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + result = qcut(s, 1) + intervals = IntervalIndex([Interval(-0.001, 0.0), + Interval(-0.001, 0.0)], closed='right') + expected = Series(intervals).astype('category', ordered=True) + tm.assert_series_equal(result, expected) + + s = Series([9]) + expected = Series([0]) + result = qcut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + result = qcut(s, 1) + intervals = IntervalIndex([Interval(8.999, 9.0)], closed='right') + expected = Series(intervals).astype('category', ordered=True) + tm.assert_series_equal(result, expected) + + s = Series([-9]) + expected = Series([0]) + result = qcut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + result = qcut(s, 1) + intervals = IntervalIndex([Interval(-9.001, -9.0)], closed='right') + expected = Series(intervals).astype('category', ordered=True) + tm.assert_series_equal(result, expected) + + s = Series([0]) + expected = Series([0]) + result = qcut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + result = qcut(s, 1) + intervals = IntervalIndex([Interval(-0.001, 0.0)], closed='right') + expected = Series(intervals).astype('category', ordered=True) + tm.assert_series_equal(result, expected) + + def test_single_bin(self): + # issue 14652 + expected = Series([0, 0]) + + s = Series([9., 9.]) + result = cut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + + s = Series([-9., -9.]) + result = cut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + + expected = Series([0]) + + s = Series([9]) + result = cut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + + s = Series([-9]) + result = cut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + + # issue 15428 + expected = Series([0, 0]) + + s = Series([0., 0.]) + result = cut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + + expected = Series([0]) + + s = Series([0]) + result = cut(s, 1, labels=False) + tm.assert_series_equal(result, expected) + + def test_datetime_cut(self): + # GH 14714 + # testing for time data to be present as series + data = to_datetime(Series(['2013-01-01', '2013-01-02', '2013-01-03'])) + + result, bins = cut(data, 3, retbins=True) + expected = ( + Series(IntervalIndex.from_intervals([ + Interval(Timestamp('2012-12-31 23:57:07.200000'), + Timestamp('2013-01-01 16:00:00')), + Interval(Timestamp('2013-01-01 16:00:00'), + Timestamp('2013-01-02 08:00:00')), + Interval(Timestamp('2013-01-02 08:00:00'), + Timestamp('2013-01-03 00:00:00'))])) + .astype('category', ordered=True)) + + tm.assert_series_equal(result, expected) + + # testing for time data to be present as list + data = [np.datetime64('2013-01-01'), np.datetime64('2013-01-02'), + np.datetime64('2013-01-03')] + result, bins = cut(data, 3, retbins=True) + tm.assert_series_equal(Series(result), expected) + + # testing for time data to be present as ndarray + data = np.array([np.datetime64('2013-01-01'), + np.datetime64('2013-01-02'), + np.datetime64('2013-01-03')]) + result, bins = cut(data, 3, retbins=True) + tm.assert_series_equal(Series(result), expected) + + # testing for time data to be present as datetime index + data = DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03']) + result, bins = cut(data, 3, retbins=True) + tm.assert_series_equal(Series(result), expected) + + def test_datetime_bin(self): + data = [np.datetime64('2012-12-13'), np.datetime64('2012-12-15')] + bin_data = ['2012-12-12', '2012-12-14', '2012-12-16'] + expected = ( + Series(IntervalIndex.from_intervals([ + Interval(Timestamp(bin_data[0]), Timestamp(bin_data[1])), + Interval(Timestamp(bin_data[1]), Timestamp(bin_data[2]))])) + .astype('category', ordered=True)) + + for conv in [Timestamp, Timestamp, np.datetime64]: + bins = [conv(v) for v in bin_data] + result = cut(data, bins=bins) + tm.assert_series_equal(Series(result), expected) + + bin_pydatetime = [Timestamp(v).to_pydatetime() for v in bin_data] + result = cut(data, bins=bin_pydatetime) + tm.assert_series_equal(Series(result), expected) + + bins = to_datetime(bin_data) + result = cut(data, bins=bin_pydatetime) + tm.assert_series_equal(Series(result), expected) + + def test_datetime_nan(self): + + def f(): + cut(date_range('20130101', periods=3), bins=[0, 2, 4]) + pytest.raises(ValueError, f) + + result = cut(date_range('20130102', periods=5), + bins=date_range('20130101', periods=2)) + mask = result.categories.isnull() + tm.assert_numpy_array_equal(mask, np.array([False])) + mask = result.isnull() + tm.assert_numpy_array_equal( + mask, np.array([False, True, True, True, True])) + + +def curpath(): + pth, _ = os.path.split(os.path.abspath(__file__)) + return pth diff --git a/pandas/tests/reshape/test_union_categoricals.py b/pandas/tests/reshape/test_union_categoricals.py new file mode 100644 index 0000000000000..fe8d54005ba9b --- /dev/null +++ b/pandas/tests/reshape/test_union_categoricals.py @@ -0,0 +1,341 @@ +import pytest + +import numpy as np +import pandas as pd +from pandas import Categorical, Series, CategoricalIndex +from pandas.core.dtypes.concat import union_categoricals +from pandas.util import testing as tm + + +class TestUnionCategoricals(object): + + def test_union_categorical(self): + # GH 13361 + data = [ + (list('abc'), list('abd'), list('abcabd')), + ([0, 1, 2], [2, 3, 4], [0, 1, 2, 2, 3, 4]), + ([0, 1.2, 2], [2, 3.4, 4], [0, 1.2, 2, 2, 3.4, 4]), + + (['b', 'b', np.nan, 'a'], ['a', np.nan, 'c'], + ['b', 'b', np.nan, 'a', 'a', np.nan, 'c']), + + (pd.date_range('2014-01-01', '2014-01-05'), + pd.date_range('2014-01-06', '2014-01-07'), + pd.date_range('2014-01-01', '2014-01-07')), + + (pd.date_range('2014-01-01', '2014-01-05', tz='US/Central'), + pd.date_range('2014-01-06', '2014-01-07', tz='US/Central'), + pd.date_range('2014-01-01', '2014-01-07', tz='US/Central')), + + (pd.period_range('2014-01-01', '2014-01-05'), + pd.period_range('2014-01-06', '2014-01-07'), + pd.period_range('2014-01-01', '2014-01-07')), + ] + + for a, b, combined in data: + for box in [Categorical, CategoricalIndex, Series]: + result = union_categoricals([box(Categorical(a)), + box(Categorical(b))]) + expected = Categorical(combined) + tm.assert_categorical_equal(result, expected, + check_category_order=True) + + # new categories ordered by appearance + s = Categorical(['x', 'y', 'z']) + s2 = Categorical(['a', 'b', 'c']) + result = union_categoricals([s, s2]) + expected = Categorical(['x', 'y', 'z', 'a', 'b', 'c'], + categories=['x', 'y', 'z', 'a', 'b', 'c']) + tm.assert_categorical_equal(result, expected) + + s = Categorical([0, 1.2, 2], ordered=True) + s2 = Categorical([0, 1.2, 2], ordered=True) + result = union_categoricals([s, s2]) + expected = Categorical([0, 1.2, 2, 0, 1.2, 2], ordered=True) + tm.assert_categorical_equal(result, expected) + + # must exactly match types + s = Categorical([0, 1.2, 2]) + s2 = Categorical([2, 3, 4]) + msg = 'dtype of categories must be the same' + with tm.assert_raises_regex(TypeError, msg): + union_categoricals([s, s2]) + + msg = 'No Categoricals to union' + with tm.assert_raises_regex(ValueError, msg): + union_categoricals([]) + + def test_union_categoricals_nan(self): + # GH 13759 + res = union_categoricals([pd.Categorical([1, 2, np.nan]), + pd.Categorical([3, 2, np.nan])]) + exp = Categorical([1, 2, np.nan, 3, 2, np.nan]) + tm.assert_categorical_equal(res, exp) + + res = union_categoricals([pd.Categorical(['A', 'B']), + pd.Categorical(['B', 'B', np.nan])]) + exp = Categorical(['A', 'B', 'B', 'B', np.nan]) + tm.assert_categorical_equal(res, exp) + + val1 = [pd.Timestamp('2011-01-01'), pd.Timestamp('2011-03-01'), + pd.NaT] + val2 = [pd.NaT, pd.Timestamp('2011-01-01'), + pd.Timestamp('2011-02-01')] + + res = union_categoricals([pd.Categorical(val1), pd.Categorical(val2)]) + exp = Categorical(val1 + val2, + categories=[pd.Timestamp('2011-01-01'), + pd.Timestamp('2011-03-01'), + pd.Timestamp('2011-02-01')]) + tm.assert_categorical_equal(res, exp) + + # all NaN + res = union_categoricals([pd.Categorical([np.nan, np.nan]), + pd.Categorical(['X'])]) + exp = Categorical([np.nan, np.nan, 'X']) + tm.assert_categorical_equal(res, exp) + + res = union_categoricals([pd.Categorical([np.nan, np.nan]), + pd.Categorical([np.nan, np.nan])]) + exp = Categorical([np.nan, np.nan, np.nan, np.nan]) + tm.assert_categorical_equal(res, exp) + + def test_union_categoricals_empty(self): + # GH 13759 + res = union_categoricals([pd.Categorical([]), + pd.Categorical([])]) + exp = Categorical([]) + tm.assert_categorical_equal(res, exp) + + res = union_categoricals([pd.Categorical([]), + pd.Categorical([1.0])]) + exp = Categorical([1.0]) + tm.assert_categorical_equal(res, exp) + + # to make dtype equal + nanc = pd.Categorical(np.array([np.nan], dtype=np.float64)) + res = union_categoricals([nanc, + pd.Categorical([])]) + tm.assert_categorical_equal(res, nanc) + + def test_union_categorical_same_category(self): + # check fastpath + c1 = Categorical([1, 2, 3, 4], categories=[1, 2, 3, 4]) + c2 = Categorical([3, 2, 1, np.nan], categories=[1, 2, 3, 4]) + res = union_categoricals([c1, c2]) + exp = Categorical([1, 2, 3, 4, 3, 2, 1, np.nan], + categories=[1, 2, 3, 4]) + tm.assert_categorical_equal(res, exp) + + c1 = Categorical(['z', 'z', 'z'], categories=['x', 'y', 'z']) + c2 = Categorical(['x', 'x', 'x'], categories=['x', 'y', 'z']) + res = union_categoricals([c1, c2]) + exp = Categorical(['z', 'z', 'z', 'x', 'x', 'x'], + categories=['x', 'y', 'z']) + tm.assert_categorical_equal(res, exp) + + def test_union_categoricals_ordered(self): + c1 = Categorical([1, 2, 3], ordered=True) + c2 = Categorical([1, 2, 3], ordered=False) + + msg = 'Categorical.ordered must be the same' + with tm.assert_raises_regex(TypeError, msg): + union_categoricals([c1, c2]) + + res = union_categoricals([c1, c1]) + exp = Categorical([1, 2, 3, 1, 2, 3], ordered=True) + tm.assert_categorical_equal(res, exp) + + c1 = Categorical([1, 2, 3, np.nan], ordered=True) + c2 = Categorical([3, 2], categories=[1, 2, 3], ordered=True) + + res = union_categoricals([c1, c2]) + exp = Categorical([1, 2, 3, np.nan, 3, 2], ordered=True) + tm.assert_categorical_equal(res, exp) + + c1 = Categorical([1, 2, 3], ordered=True) + c2 = Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True) + + msg = "to union ordered Categoricals, all categories must be the same" + with tm.assert_raises_regex(TypeError, msg): + union_categoricals([c1, c2]) + + def test_union_categoricals_ignore_order(self): + # GH 15219 + c1 = Categorical([1, 2, 3], ordered=True) + c2 = Categorical([1, 2, 3], ordered=False) + + res = union_categoricals([c1, c2], ignore_order=True) + exp = Categorical([1, 2, 3, 1, 2, 3]) + tm.assert_categorical_equal(res, exp) + + msg = 'Categorical.ordered must be the same' + with tm.assert_raises_regex(TypeError, msg): + union_categoricals([c1, c2], ignore_order=False) + + res = union_categoricals([c1, c1], ignore_order=True) + exp = Categorical([1, 2, 3, 1, 2, 3]) + tm.assert_categorical_equal(res, exp) + + res = union_categoricals([c1, c1], ignore_order=False) + exp = Categorical([1, 2, 3, 1, 2, 3], + categories=[1, 2, 3], ordered=True) + tm.assert_categorical_equal(res, exp) + + c1 = Categorical([1, 2, 3, np.nan], ordered=True) + c2 = Categorical([3, 2], categories=[1, 2, 3], ordered=True) + + res = union_categoricals([c1, c2], ignore_order=True) + exp = Categorical([1, 2, 3, np.nan, 3, 2]) + tm.assert_categorical_equal(res, exp) + + c1 = Categorical([1, 2, 3], ordered=True) + c2 = Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True) + + res = union_categoricals([c1, c2], ignore_order=True) + exp = Categorical([1, 2, 3, 1, 2, 3]) + tm.assert_categorical_equal(res, exp) + + res = union_categoricals([c2, c1], ignore_order=True, + sort_categories=True) + exp = Categorical([1, 2, 3, 1, 2, 3], categories=[1, 2, 3]) + tm.assert_categorical_equal(res, exp) + + c1 = Categorical([1, 2, 3], ordered=True) + c2 = Categorical([4, 5, 6], ordered=True) + result = union_categoricals([c1, c2], ignore_order=True) + expected = Categorical([1, 2, 3, 4, 5, 6]) + tm.assert_categorical_equal(result, expected) + + msg = "to union ordered Categoricals, all categories must be the same" + with tm.assert_raises_regex(TypeError, msg): + union_categoricals([c1, c2], ignore_order=False) + + with tm.assert_raises_regex(TypeError, msg): + union_categoricals([c1, c2]) + + def test_union_categoricals_sort(self): + # GH 13846 + c1 = Categorical(['x', 'y', 'z']) + c2 = Categorical(['a', 'b', 'c']) + result = union_categoricals([c1, c2], sort_categories=True) + expected = Categorical(['x', 'y', 'z', 'a', 'b', 'c'], + categories=['a', 'b', 'c', 'x', 'y', 'z']) + tm.assert_categorical_equal(result, expected) + + # fastpath + c1 = Categorical(['a', 'b'], categories=['b', 'a', 'c']) + c2 = Categorical(['b', 'c'], categories=['b', 'a', 'c']) + result = union_categoricals([c1, c2], sort_categories=True) + expected = Categorical(['a', 'b', 'b', 'c'], + categories=['a', 'b', 'c']) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical(['a', 'b'], categories=['c', 'a', 'b']) + c2 = Categorical(['b', 'c'], categories=['c', 'a', 'b']) + result = union_categoricals([c1, c2], sort_categories=True) + expected = Categorical(['a', 'b', 'b', 'c'], + categories=['a', 'b', 'c']) + tm.assert_categorical_equal(result, expected) + + # fastpath - skip resort + c1 = Categorical(['a', 'b'], categories=['a', 'b', 'c']) + c2 = Categorical(['b', 'c'], categories=['a', 'b', 'c']) + result = union_categoricals([c1, c2], sort_categories=True) + expected = Categorical(['a', 'b', 'b', 'c'], + categories=['a', 'b', 'c']) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical(['x', np.nan]) + c2 = Categorical([np.nan, 'b']) + result = union_categoricals([c1, c2], sort_categories=True) + expected = Categorical(['x', np.nan, np.nan, 'b'], + categories=['b', 'x']) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical([np.nan]) + c2 = Categorical([np.nan]) + result = union_categoricals([c1, c2], sort_categories=True) + expected = Categorical([np.nan, np.nan], categories=[]) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical([]) + c2 = Categorical([]) + result = union_categoricals([c1, c2], sort_categories=True) + expected = Categorical([]) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical(['b', 'a'], categories=['b', 'a', 'c'], ordered=True) + c2 = Categorical(['a', 'c'], categories=['b', 'a', 'c'], ordered=True) + with pytest.raises(TypeError): + union_categoricals([c1, c2], sort_categories=True) + + def test_union_categoricals_sort_false(self): + # GH 13846 + c1 = Categorical(['x', 'y', 'z']) + c2 = Categorical(['a', 'b', 'c']) + result = union_categoricals([c1, c2], sort_categories=False) + expected = Categorical(['x', 'y', 'z', 'a', 'b', 'c'], + categories=['x', 'y', 'z', 'a', 'b', 'c']) + tm.assert_categorical_equal(result, expected) + + # fastpath + c1 = Categorical(['a', 'b'], categories=['b', 'a', 'c']) + c2 = Categorical(['b', 'c'], categories=['b', 'a', 'c']) + result = union_categoricals([c1, c2], sort_categories=False) + expected = Categorical(['a', 'b', 'b', 'c'], + categories=['b', 'a', 'c']) + tm.assert_categorical_equal(result, expected) + + # fastpath - skip resort + c1 = Categorical(['a', 'b'], categories=['a', 'b', 'c']) + c2 = Categorical(['b', 'c'], categories=['a', 'b', 'c']) + result = union_categoricals([c1, c2], sort_categories=False) + expected = Categorical(['a', 'b', 'b', 'c'], + categories=['a', 'b', 'c']) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical(['x', np.nan]) + c2 = Categorical([np.nan, 'b']) + result = union_categoricals([c1, c2], sort_categories=False) + expected = Categorical(['x', np.nan, np.nan, 'b'], + categories=['x', 'b']) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical([np.nan]) + c2 = Categorical([np.nan]) + result = union_categoricals([c1, c2], sort_categories=False) + expected = Categorical([np.nan, np.nan], categories=[]) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical([]) + c2 = Categorical([]) + result = union_categoricals([c1, c2], sort_categories=False) + expected = Categorical([]) + tm.assert_categorical_equal(result, expected) + + c1 = Categorical(['b', 'a'], categories=['b', 'a', 'c'], ordered=True) + c2 = Categorical(['a', 'c'], categories=['b', 'a', 'c'], ordered=True) + result = union_categoricals([c1, c2], sort_categories=False) + expected = Categorical(['b', 'a', 'a', 'c'], + categories=['b', 'a', 'c'], ordered=True) + tm.assert_categorical_equal(result, expected) + + def test_union_categorical_unwrap(self): + # GH 14173 + c1 = Categorical(['a', 'b']) + c2 = pd.Series(['b', 'c'], dtype='category') + result = union_categoricals([c1, c2]) + expected = Categorical(['a', 'b', 'b', 'c']) + tm.assert_categorical_equal(result, expected) + + c2 = CategoricalIndex(c2) + result = union_categoricals([c1, c2]) + tm.assert_categorical_equal(result, expected) + + c1 = Series(c1) + result = union_categoricals([c1, c2]) + tm.assert_categorical_equal(result, expected) + + with pytest.raises(TypeError): + union_categoricals([c1, ['a', 'b', 'c']]) diff --git a/pandas/tests/reshape/test_util.py b/pandas/tests/reshape/test_util.py new file mode 100644 index 0000000000000..e4a9591b95c26 --- /dev/null +++ b/pandas/tests/reshape/test_util.py @@ -0,0 +1,49 @@ + +import numpy as np +from pandas import date_range, Index +import pandas.util.testing as tm +from pandas.core.reshape.util import cartesian_product + + +class TestCartesianProduct(object): + + def test_simple(self): + x, y = list('ABC'), [1, 22] + result1, result2 = cartesian_product([x, y]) + expected1 = np.array(['A', 'A', 'B', 'B', 'C', 'C']) + expected2 = np.array([1, 22, 1, 22, 1, 22]) + tm.assert_numpy_array_equal(result1, expected1) + tm.assert_numpy_array_equal(result2, expected2) + + def test_datetimeindex(self): + # regression test for GitHub issue #6439 + # make sure that the ordering on datetimeindex is consistent + x = date_range('2000-01-01', periods=2) + result1, result2 = [Index(y).day for y in cartesian_product([x, x])] + expected1 = Index([1, 1, 2, 2]) + expected2 = Index([1, 2, 1, 2]) + tm.assert_index_equal(result1, expected1) + tm.assert_index_equal(result2, expected2) + + def test_empty(self): + # product of empty factors + X = [[], [0, 1], []] + Y = [[], [], ['a', 'b', 'c']] + for x, y in zip(X, Y): + expected1 = np.array([], dtype=np.asarray(x).dtype) + expected2 = np.array([], dtype=np.asarray(y).dtype) + result1, result2 = cartesian_product([x, y]) + tm.assert_numpy_array_equal(result1, expected1) + tm.assert_numpy_array_equal(result2, expected2) + + # empty product (empty input): + result = cartesian_product([]) + expected = [] + assert result == expected + + def test_invalid_input(self): + invalid_inputs = [1, [1], [1, 2], [[1], 2], + 'a', ['a'], ['a', 'b'], [['a'], 'b']] + msg = "Input must be a list-like of list-likes" + for X in invalid_inputs: + tm.assert_raises_regex(TypeError, msg, cartesian_product, X=X) diff --git a/pandas/tests/scalar/__init__.py b/pandas/tests/scalar/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/tests/scalar/test_interval.py b/pandas/tests/scalar/test_interval.py new file mode 100644 index 0000000000000..e06f7cb34eb52 --- /dev/null +++ b/pandas/tests/scalar/test_interval.py @@ -0,0 +1,127 @@ +from __future__ import division + +import pytest +from pandas import Interval +import pandas.util.testing as tm + + +class TestInterval(object): + def setup_method(self, method): + self.interval = Interval(0, 1) + + def test_properties(self): + assert self.interval.closed == 'right' + assert self.interval.left == 0 + assert self.interval.right == 1 + assert self.interval.mid == 0.5 + + def test_repr(self): + assert repr(self.interval) == "Interval(0, 1, closed='right')" + assert str(self.interval) == "(0, 1]" + + interval_left = Interval(0, 1, closed='left') + assert repr(interval_left) == "Interval(0, 1, closed='left')" + assert str(interval_left) == "[0, 1)" + + def test_contains(self): + assert 0.5 in self.interval + assert 1 in self.interval + assert 0 not in self.interval + pytest.raises(TypeError, lambda: self.interval in self.interval) + + interval = Interval(0, 1, closed='both') + assert 0 in interval + assert 1 in interval + + interval = Interval(0, 1, closed='neither') + assert 0 not in interval + assert 0.5 in interval + assert 1 not in interval + + def test_equal(self): + assert Interval(0, 1) == Interval(0, 1, closed='right') + assert Interval(0, 1) != Interval(0, 1, closed='left') + assert Interval(0, 1) != 0 + + def test_comparison(self): + with tm.assert_raises_regex(TypeError, 'unorderable types'): + Interval(0, 1) < 2 + + assert Interval(0, 1) < Interval(1, 2) + assert Interval(0, 1) < Interval(0, 2) + assert Interval(0, 1) < Interval(0.5, 1.5) + assert Interval(0, 1) <= Interval(0, 1) + assert Interval(0, 1) > Interval(-1, 2) + assert Interval(0, 1) >= Interval(0, 1) + + def test_hash(self): + # should not raise + hash(self.interval) + + def test_math_add(self): + expected = Interval(1, 2) + actual = self.interval + 1 + assert expected == actual + + expected = Interval(1, 2) + actual = 1 + self.interval + assert expected == actual + + actual = self.interval + actual += 1 + assert expected == actual + + with pytest.raises(TypeError): + self.interval + Interval(1, 2) + + with pytest.raises(TypeError): + self.interval + 'foo' + + def test_math_sub(self): + expected = Interval(-1, 0) + actual = self.interval - 1 + assert expected == actual + + actual = self.interval + actual -= 1 + assert expected == actual + + with pytest.raises(TypeError): + self.interval - Interval(1, 2) + + with pytest.raises(TypeError): + self.interval - 'foo' + + def test_math_mult(self): + expected = Interval(0, 2) + actual = self.interval * 2 + assert expected == actual + + expected = Interval(0, 2) + actual = 2 * self.interval + assert expected == actual + + actual = self.interval + actual *= 2 + assert expected == actual + + with pytest.raises(TypeError): + self.interval * Interval(1, 2) + + with pytest.raises(TypeError): + self.interval * 'foo' + + def test_math_div(self): + expected = Interval(0, 0.5) + actual = self.interval / 2.0 + assert expected == actual + + actual = self.interval + actual /= 2.0 + assert expected == actual + + with pytest.raises(TypeError): + self.interval / Interval(1, 2) + + with pytest.raises(TypeError): + self.interval / 'foo' diff --git a/pandas/tests/scalar/test_nat.py b/pandas/tests/scalar/test_nat.py new file mode 100644 index 0000000000000..0695fe2243947 --- /dev/null +++ b/pandas/tests/scalar/test_nat.py @@ -0,0 +1,249 @@ +import pytest + +from datetime import datetime, timedelta +import pytz + +import numpy as np +from pandas import (NaT, Index, Timestamp, Timedelta, Period, + DatetimeIndex, PeriodIndex, + TimedeltaIndex, Series, isnull) +from pandas.util import testing as tm +from pandas._libs.tslib import iNaT + + +@pytest.mark.parametrize('nat, idx', [(Timestamp('NaT'), DatetimeIndex), + (Timedelta('NaT'), TimedeltaIndex), + (Period('NaT', freq='M'), PeriodIndex)]) +def test_nat_fields(nat, idx): + + for field in idx._field_ops: + + # weekday is a property of DTI, but a method + # on NaT/Timestamp for compat with datetime + if field == 'weekday': + continue + + result = getattr(NaT, field) + assert np.isnan(result) + + result = getattr(nat, field) + assert np.isnan(result) + + for field in idx._bool_ops: + + result = getattr(NaT, field) + assert result is False + + result = getattr(nat, field) + assert result is False + + +def test_nat_vector_field_access(): + idx = DatetimeIndex(['1/1/2000', None, None, '1/4/2000']) + + for field in DatetimeIndex._field_ops: + # weekday is a property of DTI, but a method + # on NaT/Timestamp for compat with datetime + if field == 'weekday': + continue + + result = getattr(idx, field) + expected = Index([getattr(x, field) for x in idx]) + tm.assert_index_equal(result, expected) + + s = Series(idx) + + for field in DatetimeIndex._field_ops: + + # weekday is a property of DTI, but a method + # on NaT/Timestamp for compat with datetime + if field == 'weekday': + continue + + result = getattr(s.dt, field) + expected = [getattr(x, field) for x in idx] + tm.assert_series_equal(result, Series(expected)) + + for field in DatetimeIndex._bool_ops: + result = getattr(s.dt, field) + expected = [getattr(x, field) for x in idx] + tm.assert_series_equal(result, Series(expected)) + + +@pytest.mark.parametrize('klass', [Timestamp, Timedelta, Period]) +def test_identity(klass): + assert klass(None) is NaT + + result = klass(np.nan) + assert result is NaT + + result = klass(None) + assert result is NaT + + result = klass(iNaT) + assert result is NaT + + result = klass(np.nan) + assert result is NaT + + result = klass(float('nan')) + assert result is NaT + + result = klass(NaT) + assert result is NaT + + result = klass('NaT') + assert result is NaT + + assert isnull(klass('nat')) + + +@pytest.mark.parametrize('klass', [Timestamp, Timedelta, Period]) +def test_equality(klass): + + # nat + if klass is not Period: + klass('').value == iNaT + klass('nat').value == iNaT + klass('NAT').value == iNaT + klass(None).value == iNaT + klass(np.nan).value == iNaT + assert isnull(klass('nat')) + + +@pytest.mark.parametrize('klass', [Timestamp, Timedelta]) +def test_round_nat(klass): + # GH14940 + ts = klass('nat') + for method in ["round", "floor", "ceil"]: + round_method = getattr(ts, method) + for freq in ["s", "5s", "min", "5min", "h", "5h"]: + assert round_method(freq) is ts + + +def test_NaT_methods(): + # GH 9513 + raise_methods = ['astimezone', 'combine', 'ctime', 'dst', + 'fromordinal', 'fromtimestamp', 'isocalendar', + 'strftime', 'strptime', 'time', 'timestamp', + 'timetuple', 'timetz', 'toordinal', 'tzname', + 'utcfromtimestamp', 'utcnow', 'utcoffset', + 'utctimetuple'] + nat_methods = ['date', 'now', 'replace', 'to_datetime', 'today', + 'tz_convert', 'tz_localize'] + nan_methods = ['weekday', 'isoweekday'] + + for method in raise_methods: + if hasattr(NaT, method): + with pytest.raises(ValueError): + getattr(NaT, method)() + + for method in nan_methods: + if hasattr(NaT, method): + assert np.isnan(getattr(NaT, method)()) + + for method in nat_methods: + if hasattr(NaT, method): + # see gh-8254 + exp_warning = None + if method == 'to_datetime': + exp_warning = FutureWarning + with tm.assert_produces_warning( + exp_warning, check_stacklevel=False): + assert getattr(NaT, method)() is NaT + + # GH 12300 + assert NaT.isoformat() == 'NaT' + + +@pytest.mark.parametrize('klass', [Timestamp, Timedelta]) +def test_isoformat(klass): + + result = klass('NaT').isoformat() + expected = 'NaT' + assert result == expected + + +def test_nat_arithmetic(): + # GH 6873 + i = 2 + f = 1.5 + + for (left, right) in [(NaT, i), (NaT, f), (NaT, np.nan)]: + assert left / right is NaT + assert left * right is NaT + assert right * left is NaT + with pytest.raises(TypeError): + right / left + + # Timestamp / datetime + t = Timestamp('2014-01-01') + dt = datetime(2014, 1, 1) + for (left, right) in [(NaT, NaT), (NaT, t), (NaT, dt)]: + # NaT __add__ or __sub__ Timestamp-like (or inverse) returns NaT + assert right + left is NaT + assert left + right is NaT + assert left - right is NaT + assert right - left is NaT + + # timedelta-like + # offsets are tested in test_offsets.py + + delta = timedelta(3600) + td = Timedelta('5s') + + for (left, right) in [(NaT, delta), (NaT, td)]: + # NaT + timedelta-like returns NaT + assert right + left is NaT + assert left + right is NaT + assert right - left is NaT + assert left - right is NaT + + # GH 11718 + t_utc = Timestamp('2014-01-01', tz='UTC') + t_tz = Timestamp('2014-01-01', tz='US/Eastern') + dt_tz = pytz.timezone('Asia/Tokyo').localize(dt) + + for (left, right) in [(NaT, t_utc), (NaT, t_tz), + (NaT, dt_tz)]: + # NaT __add__ or __sub__ Timestamp-like (or inverse) returns NaT + assert right + left is NaT + assert left + right is NaT + assert left - right is NaT + assert right - left is NaT + + # int addition / subtraction + for (left, right) in [(NaT, 2), (NaT, 0), (NaT, -3)]: + assert right + left is NaT + assert left + right is NaT + assert left - right is NaT + assert right - left is NaT + + +def test_nat_arithmetic_index(): + # GH 11718 + + dti = DatetimeIndex(['2011-01-01', '2011-01-02'], name='x') + exp = DatetimeIndex([NaT, NaT], name='x') + tm.assert_index_equal(dti + NaT, exp) + tm.assert_index_equal(NaT + dti, exp) + + dti_tz = DatetimeIndex(['2011-01-01', '2011-01-02'], + tz='US/Eastern', name='x') + exp = DatetimeIndex([NaT, NaT], name='x', tz='US/Eastern') + tm.assert_index_equal(dti_tz + NaT, exp) + tm.assert_index_equal(NaT + dti_tz, exp) + + exp = TimedeltaIndex([NaT, NaT], name='x') + for (left, right) in [(NaT, dti), (NaT, dti_tz)]: + tm.assert_index_equal(left - right, exp) + tm.assert_index_equal(right - left, exp) + + # timedelta + tdi = TimedeltaIndex(['1 day', '2 day'], name='x') + exp = DatetimeIndex([NaT, NaT], name='x') + for (left, right) in [(NaT, tdi)]: + tm.assert_index_equal(left + right, exp) + tm.assert_index_equal(right + left, exp) + tm.assert_index_equal(left - right, exp) + tm.assert_index_equal(right - left, exp) diff --git a/pandas/tests/scalar/test_period.py b/pandas/tests/scalar/test_period.py new file mode 100644 index 0000000000000..54366dc9b1c3f --- /dev/null +++ b/pandas/tests/scalar/test_period.py @@ -0,0 +1,1409 @@ +import pytest + +import numpy as np +from datetime import datetime, date, timedelta + +import pandas as pd +import pandas.util.testing as tm +import pandas.core.indexes.period as period +from pandas.compat import text_type, iteritems +from pandas.compat.numpy import np_datetime64_compat + +from pandas._libs import tslib, period as libperiod +from pandas import Period, Timestamp, offsets +from pandas.tseries.frequencies import DAYS, MONTHS + + +class TestPeriodProperties(object): + "Test properties such as year, month, weekday, etc...." + + def test_is_leap_year(self): + # GH 13727 + for freq in ['A', 'M', 'D', 'H']: + p = Period('2000-01-01 00:00:00', freq=freq) + assert p.is_leap_year + assert isinstance(p.is_leap_year, bool) + + p = Period('1999-01-01 00:00:00', freq=freq) + assert not p.is_leap_year + + p = Period('2004-01-01 00:00:00', freq=freq) + assert p.is_leap_year + + p = Period('2100-01-01 00:00:00', freq=freq) + assert not p.is_leap_year + + def test_quarterly_negative_ordinals(self): + p = Period(ordinal=-1, freq='Q-DEC') + assert p.year == 1969 + assert p.quarter == 4 + assert isinstance(p, Period) + + p = Period(ordinal=-2, freq='Q-DEC') + assert p.year == 1969 + assert p.quarter == 3 + assert isinstance(p, Period) + + p = Period(ordinal=-2, freq='M') + assert p.year == 1969 + assert p.month == 11 + assert isinstance(p, Period) + + def test_period_cons_quarterly(self): + # bugs in scikits.timeseries + for month in MONTHS: + freq = 'Q-%s' % month + exp = Period('1989Q3', freq=freq) + assert '1989Q3' in str(exp) + stamp = exp.to_timestamp('D', how='end') + p = Period(stamp, freq=freq) + assert p == exp + + stamp = exp.to_timestamp('3D', how='end') + p = Period(stamp, freq=freq) + assert p == exp + + def test_period_cons_annual(self): + # bugs in scikits.timeseries + for month in MONTHS: + freq = 'A-%s' % month + exp = Period('1989', freq=freq) + stamp = exp.to_timestamp('D', how='end') + timedelta(days=30) + p = Period(stamp, freq=freq) + assert p == exp + 1 + assert isinstance(p, Period) + + def test_period_cons_weekly(self): + for num in range(10, 17): + daystr = '2011-02-%d' % num + for day in DAYS: + freq = 'W-%s' % day + + result = Period(daystr, freq=freq) + expected = Period(daystr, freq='D').asfreq(freq) + assert result == expected + assert isinstance(result, Period) + + def test_period_from_ordinal(self): + p = pd.Period('2011-01', freq='M') + res = pd.Period._from_ordinal(p.ordinal, freq='M') + assert p == res + assert isinstance(res, Period) + + def test_period_cons_nat(self): + p = Period('NaT', freq='M') + assert p is pd.NaT + + p = Period('nat', freq='W-SUN') + assert p is pd.NaT + + p = Period(tslib.iNaT, freq='D') + assert p is pd.NaT + + p = Period(tslib.iNaT, freq='3D') + assert p is pd.NaT + + p = Period(tslib.iNaT, freq='1D1H') + assert p is pd.NaT + + p = Period('NaT') + assert p is pd.NaT + + p = Period(tslib.iNaT) + assert p is pd.NaT + + def test_period_cons_mult(self): + p1 = Period('2011-01', freq='3M') + p2 = Period('2011-01', freq='M') + assert p1.ordinal == p2.ordinal + + assert p1.freq == offsets.MonthEnd(3) + assert p1.freqstr == '3M' + + assert p2.freq == offsets.MonthEnd() + assert p2.freqstr == 'M' + + result = p1 + 1 + assert result.ordinal == (p2 + 3).ordinal + assert result.freq == p1.freq + assert result.freqstr == '3M' + + result = p1 - 1 + assert result.ordinal == (p2 - 3).ordinal + assert result.freq == p1.freq + assert result.freqstr == '3M' + + msg = ('Frequency must be positive, because it' + ' represents span: -3M') + with tm.assert_raises_regex(ValueError, msg): + Period('2011-01', freq='-3M') + + msg = ('Frequency must be positive, because it' ' represents span: 0M') + with tm.assert_raises_regex(ValueError, msg): + Period('2011-01', freq='0M') + + def test_period_cons_combined(self): + p = [(Period('2011-01', freq='1D1H'), + Period('2011-01', freq='1H1D'), + Period('2011-01', freq='H')), + (Period(ordinal=1, freq='1D1H'), + Period(ordinal=1, freq='1H1D'), + Period(ordinal=1, freq='H'))] + + for p1, p2, p3 in p: + assert p1.ordinal == p3.ordinal + assert p2.ordinal == p3.ordinal + + assert p1.freq == offsets.Hour(25) + assert p1.freqstr == '25H' + + assert p2.freq == offsets.Hour(25) + assert p2.freqstr == '25H' + + assert p3.freq == offsets.Hour() + assert p3.freqstr == 'H' + + result = p1 + 1 + assert result.ordinal == (p3 + 25).ordinal + assert result.freq == p1.freq + assert result.freqstr == '25H' + + result = p2 + 1 + assert result.ordinal == (p3 + 25).ordinal + assert result.freq == p2.freq + assert result.freqstr == '25H' + + result = p1 - 1 + assert result.ordinal == (p3 - 25).ordinal + assert result.freq == p1.freq + assert result.freqstr == '25H' + + result = p2 - 1 + assert result.ordinal == (p3 - 25).ordinal + assert result.freq == p2.freq + assert result.freqstr == '25H' + + msg = ('Frequency must be positive, because it' + ' represents span: -25H') + with tm.assert_raises_regex(ValueError, msg): + Period('2011-01', freq='-1D1H') + with tm.assert_raises_regex(ValueError, msg): + Period('2011-01', freq='-1H1D') + with tm.assert_raises_regex(ValueError, msg): + Period(ordinal=1, freq='-1D1H') + with tm.assert_raises_regex(ValueError, msg): + Period(ordinal=1, freq='-1H1D') + + msg = ('Frequency must be positive, because it' + ' represents span: 0D') + with tm.assert_raises_regex(ValueError, msg): + Period('2011-01', freq='0D0H') + with tm.assert_raises_regex(ValueError, msg): + Period(ordinal=1, freq='0D0H') + + # You can only combine together day and intraday offsets + msg = ('Invalid frequency: 1W1D') + with tm.assert_raises_regex(ValueError, msg): + Period('2011-01', freq='1W1D') + msg = ('Invalid frequency: 1D1W') + with tm.assert_raises_regex(ValueError, msg): + Period('2011-01', freq='1D1W') + + def test_timestamp_tz_arg(self): + tm._skip_if_no_pytz() + import pytz + for case in ['Europe/Brussels', 'Asia/Tokyo', 'US/Pacific']: + p = Period('1/1/2005', freq='M').to_timestamp(tz=case) + exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) + exp_zone = pytz.timezone(case).normalize(p) + + assert p == exp + assert p.tz == exp_zone.tzinfo + assert p.tz == exp.tz + + p = Period('1/1/2005', freq='3H').to_timestamp(tz=case) + exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) + exp_zone = pytz.timezone(case).normalize(p) + + assert p == exp + assert p.tz == exp_zone.tzinfo + assert p.tz == exp.tz + + p = Period('1/1/2005', freq='A').to_timestamp(freq='A', tz=case) + exp = Timestamp('31/12/2005', tz='UTC').tz_convert(case) + exp_zone = pytz.timezone(case).normalize(p) + + assert p == exp + assert p.tz == exp_zone.tzinfo + assert p.tz == exp.tz + + p = Period('1/1/2005', freq='A').to_timestamp(freq='3H', tz=case) + exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) + exp_zone = pytz.timezone(case).normalize(p) + + assert p == exp + assert p.tz == exp_zone.tzinfo + assert p.tz == exp.tz + + def test_timestamp_tz_arg_dateutil(self): + from pandas._libs.tslib import _dateutil_gettz as gettz + from pandas._libs.tslib import maybe_get_tz + for case in ['dateutil/Europe/Brussels', 'dateutil/Asia/Tokyo', + 'dateutil/US/Pacific']: + p = Period('1/1/2005', freq='M').to_timestamp( + tz=maybe_get_tz(case)) + exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) + assert p == exp + assert p.tz == gettz(case.split('/', 1)[1]) + assert p.tz == exp.tz + + p = Period('1/1/2005', + freq='M').to_timestamp(freq='3H', tz=maybe_get_tz(case)) + exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) + assert p == exp + assert p.tz == gettz(case.split('/', 1)[1]) + assert p.tz == exp.tz + + def test_timestamp_tz_arg_dateutil_from_string(self): + from pandas._libs.tslib import _dateutil_gettz as gettz + p = Period('1/1/2005', + freq='M').to_timestamp(tz='dateutil/Europe/Brussels') + assert p.tz == gettz('Europe/Brussels') + + def test_timestamp_mult(self): + p = pd.Period('2011-01', freq='M') + assert p.to_timestamp(how='S') == pd.Timestamp('2011-01-01') + assert p.to_timestamp(how='E') == pd.Timestamp('2011-01-31') + + p = pd.Period('2011-01', freq='3M') + assert p.to_timestamp(how='S') == pd.Timestamp('2011-01-01') + assert p.to_timestamp(how='E') == pd.Timestamp('2011-03-31') + + def test_construction(self): + i1 = Period('1/1/2005', freq='M') + i2 = Period('Jan 2005') + + assert i1 == i2 + + i1 = Period('2005', freq='A') + i2 = Period('2005') + i3 = Period('2005', freq='a') + + assert i1 == i2 + assert i1 == i3 + + i4 = Period('2005', freq='M') + i5 = Period('2005', freq='m') + + pytest.raises(ValueError, i1.__ne__, i4) + assert i4 == i5 + + i1 = Period.now('Q') + i2 = Period(datetime.now(), freq='Q') + i3 = Period.now('q') + + assert i1 == i2 + assert i1 == i3 + + i1 = Period('1982', freq='min') + i2 = Period('1982', freq='MIN') + assert i1 == i2 + i2 = Period('1982', freq=('Min', 1)) + assert i1 == i2 + + i1 = Period(year=2005, month=3, day=1, freq='D') + i2 = Period('3/1/2005', freq='D') + assert i1 == i2 + + i3 = Period(year=2005, month=3, day=1, freq='d') + assert i1 == i3 + + i1 = Period('2007-01-01 09:00:00.001') + expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1000), freq='L') + assert i1 == expected + + expected = Period(np_datetime64_compat( + '2007-01-01 09:00:00.001Z'), freq='L') + assert i1 == expected + + i1 = Period('2007-01-01 09:00:00.00101') + expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1010), freq='U') + assert i1 == expected + + expected = Period(np_datetime64_compat('2007-01-01 09:00:00.00101Z'), + freq='U') + assert i1 == expected + + pytest.raises(ValueError, Period, ordinal=200701) + + pytest.raises(ValueError, Period, '2007-1-1', freq='X') + + def test_construction_bday(self): + + # Biz day construction, roll forward if non-weekday + i1 = Period('3/10/12', freq='B') + i2 = Period('3/10/12', freq='D') + assert i1 == i2.asfreq('B') + i2 = Period('3/11/12', freq='D') + assert i1 == i2.asfreq('B') + i2 = Period('3/12/12', freq='D') + assert i1 == i2.asfreq('B') + + i3 = Period('3/10/12', freq='b') + assert i1 == i3 + + i1 = Period(year=2012, month=3, day=10, freq='B') + i2 = Period('3/12/12', freq='B') + assert i1 == i2 + + def test_construction_quarter(self): + + i1 = Period(year=2005, quarter=1, freq='Q') + i2 = Period('1/1/2005', freq='Q') + assert i1 == i2 + + i1 = Period(year=2005, quarter=3, freq='Q') + i2 = Period('9/1/2005', freq='Q') + assert i1 == i2 + + i1 = Period('2005Q1') + i2 = Period(year=2005, quarter=1, freq='Q') + i3 = Period('2005q1') + assert i1 == i2 + assert i1 == i3 + + i1 = Period('05Q1') + assert i1 == i2 + lower = Period('05q1') + assert i1 == lower + + i1 = Period('1Q2005') + assert i1 == i2 + lower = Period('1q2005') + assert i1 == lower + + i1 = Period('1Q05') + assert i1 == i2 + lower = Period('1q05') + assert i1 == lower + + i1 = Period('4Q1984') + assert i1.year == 1984 + lower = Period('4q1984') + assert i1 == lower + + def test_construction_month(self): + + expected = Period('2007-01', freq='M') + i1 = Period('200701', freq='M') + assert i1 == expected + + i1 = Period('200701', freq='M') + assert i1 == expected + + i1 = Period(200701, freq='M') + assert i1 == expected + + i1 = Period(ordinal=200701, freq='M') + assert i1.year == 18695 + + i1 = Period(datetime(2007, 1, 1), freq='M') + i2 = Period('200701', freq='M') + assert i1 == i2 + + i1 = Period(date(2007, 1, 1), freq='M') + i2 = Period(datetime(2007, 1, 1), freq='M') + i3 = Period(np.datetime64('2007-01-01'), freq='M') + i4 = Period(np_datetime64_compat('2007-01-01 00:00:00Z'), freq='M') + i5 = Period(np_datetime64_compat('2007-01-01 00:00:00.000Z'), freq='M') + assert i1 == i2 + assert i1 == i3 + assert i1 == i4 + assert i1 == i5 + + def test_period_constructor_offsets(self): + assert (Period('1/1/2005', freq=offsets.MonthEnd()) == + Period('1/1/2005', freq='M')) + assert (Period('2005', freq=offsets.YearEnd()) == + Period('2005', freq='A')) + assert (Period('2005', freq=offsets.MonthEnd()) == + Period('2005', freq='M')) + assert (Period('3/10/12', freq=offsets.BusinessDay()) == + Period('3/10/12', freq='B')) + assert (Period('3/10/12', freq=offsets.Day()) == + Period('3/10/12', freq='D')) + + assert (Period(year=2005, quarter=1, + freq=offsets.QuarterEnd(startingMonth=12)) == + Period(year=2005, quarter=1, freq='Q')) + assert (Period(year=2005, quarter=2, + freq=offsets.QuarterEnd(startingMonth=12)) == + Period(year=2005, quarter=2, freq='Q')) + + assert (Period(year=2005, month=3, day=1, freq=offsets.Day()) == + Period(year=2005, month=3, day=1, freq='D')) + assert (Period(year=2012, month=3, day=10, freq=offsets.BDay()) == + Period(year=2012, month=3, day=10, freq='B')) + + expected = Period('2005-03-01', freq='3D') + assert (Period(year=2005, month=3, day=1, + freq=offsets.Day(3)) == expected) + assert Period(year=2005, month=3, day=1, freq='3D') == expected + + assert (Period(year=2012, month=3, day=10, + freq=offsets.BDay(3)) == + Period(year=2012, month=3, day=10, freq='3B')) + + assert (Period(200701, freq=offsets.MonthEnd()) == + Period(200701, freq='M')) + + i1 = Period(ordinal=200701, freq=offsets.MonthEnd()) + i2 = Period(ordinal=200701, freq='M') + assert i1 == i2 + assert i1.year == 18695 + assert i2.year == 18695 + + i1 = Period(datetime(2007, 1, 1), freq='M') + i2 = Period('200701', freq='M') + assert i1 == i2 + + i1 = Period(date(2007, 1, 1), freq='M') + i2 = Period(datetime(2007, 1, 1), freq='M') + i3 = Period(np.datetime64('2007-01-01'), freq='M') + i4 = Period(np_datetime64_compat('2007-01-01 00:00:00Z'), freq='M') + i5 = Period(np_datetime64_compat('2007-01-01 00:00:00.000Z'), freq='M') + assert i1 == i2 + assert i1 == i3 + assert i1 == i4 + assert i1 == i5 + + i1 = Period('2007-01-01 09:00:00.001') + expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1000), freq='L') + assert i1 == expected + + expected = Period(np_datetime64_compat( + '2007-01-01 09:00:00.001Z'), freq='L') + assert i1 == expected + + i1 = Period('2007-01-01 09:00:00.00101') + expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1010), freq='U') + assert i1 == expected + + expected = Period(np_datetime64_compat('2007-01-01 09:00:00.00101Z'), + freq='U') + assert i1 == expected + + pytest.raises(ValueError, Period, ordinal=200701) + + pytest.raises(ValueError, Period, '2007-1-1', freq='X') + + def test_freq_str(self): + i1 = Period('1982', freq='Min') + assert i1.freq == offsets.Minute() + assert i1.freqstr == 'T' + + def test_period_deprecated_freq(self): + cases = {"M": ["MTH", "MONTH", "MONTHLY", "Mth", "month", "monthly"], + "B": ["BUS", "BUSINESS", "BUSINESSLY", "WEEKDAY", "bus"], + "D": ["DAY", "DLY", "DAILY", "Day", "Dly", "Daily"], + "H": ["HR", "HOUR", "HRLY", "HOURLY", "hr", "Hour", "HRly"], + "T": ["minute", "MINUTE", "MINUTELY", "minutely"], + "S": ["sec", "SEC", "SECOND", "SECONDLY", "second"], + "L": ["MILLISECOND", "MILLISECONDLY", "millisecond"], + "U": ["MICROSECOND", "MICROSECONDLY", "microsecond"], + "N": ["NANOSECOND", "NANOSECONDLY", "nanosecond"]} + + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + for exp, freqs in iteritems(cases): + for freq in freqs: + with tm.assert_raises_regex(ValueError, msg): + Period('2016-03-01 09:00', freq=freq) + with tm.assert_raises_regex(ValueError, msg): + Period(ordinal=1, freq=freq) + + # check supported freq-aliases still works + p1 = Period('2016-03-01 09:00', freq=exp) + p2 = Period(ordinal=1, freq=exp) + assert isinstance(p1, Period) + assert isinstance(p2, Period) + + def test_hash(self): + assert (hash(Period('2011-01', freq='M')) == + hash(Period('2011-01', freq='M'))) + + assert (hash(Period('2011-01-01', freq='D')) != + hash(Period('2011-01', freq='M'))) + + assert (hash(Period('2011-01', freq='3M')) != + hash(Period('2011-01', freq='2M'))) + + assert (hash(Period('2011-01', freq='M')) != + hash(Period('2011-02', freq='M'))) + + def test_repr(self): + p = Period('Jan-2000') + assert '2000-01' in repr(p) + + p = Period('2000-12-15') + assert '2000-12-15' in repr(p) + + def test_repr_nat(self): + p = Period('nat', freq='M') + assert repr(tslib.NaT) in repr(p) + + def test_millisecond_repr(self): + p = Period('2000-01-01 12:15:02.123') + + assert repr(p) == "Period('2000-01-01 12:15:02.123', 'L')" + + def test_microsecond_repr(self): + p = Period('2000-01-01 12:15:02.123567') + + assert repr(p) == "Period('2000-01-01 12:15:02.123567', 'U')" + + def test_strftime(self): + p = Period('2000-1-1 12:34:12', freq='S') + res = p.strftime('%Y-%m-%d %H:%M:%S') + assert res == '2000-01-01 12:34:12' + assert isinstance(res, text_type) # GH3363 + + def test_sub_delta(self): + left, right = Period('2011', freq='A'), Period('2007', freq='A') + result = left - right + assert result == 4 + + with pytest.raises(period.IncompatibleFrequency): + left - Period('2007-01', freq='M') + + def test_to_timestamp(self): + p = Period('1982', freq='A') + start_ts = p.to_timestamp(how='S') + aliases = ['s', 'StarT', 'BEGIn'] + for a in aliases: + assert start_ts == p.to_timestamp('D', how=a) + # freq with mult should not affect to the result + assert start_ts == p.to_timestamp('3D', how=a) + + end_ts = p.to_timestamp(how='E') + aliases = ['e', 'end', 'FINIsH'] + for a in aliases: + assert end_ts == p.to_timestamp('D', how=a) + assert end_ts == p.to_timestamp('3D', how=a) + + from_lst = ['A', 'Q', 'M', 'W', 'B', 'D', 'H', 'Min', 'S'] + + def _ex(p): + return Timestamp((p + 1).start_time.value - 1) + + for i, fcode in enumerate(from_lst): + p = Period('1982', freq=fcode) + result = p.to_timestamp().to_period(fcode) + assert result == p + + assert p.start_time == p.to_timestamp(how='S') + + assert p.end_time == _ex(p) + + # Frequency other than daily + + p = Period('1985', freq='A') + + result = p.to_timestamp('H', how='end') + expected = datetime(1985, 12, 31, 23) + assert result == expected + result = p.to_timestamp('3H', how='end') + assert result == expected + + result = p.to_timestamp('T', how='end') + expected = datetime(1985, 12, 31, 23, 59) + assert result == expected + result = p.to_timestamp('2T', how='end') + assert result == expected + + result = p.to_timestamp(how='end') + expected = datetime(1985, 12, 31) + assert result == expected + + expected = datetime(1985, 1, 1) + result = p.to_timestamp('H', how='start') + assert result == expected + result = p.to_timestamp('T', how='start') + assert result == expected + result = p.to_timestamp('S', how='start') + assert result == expected + result = p.to_timestamp('3H', how='start') + assert result == expected + result = p.to_timestamp('5S', how='start') + assert result == expected + + def test_start_time(self): + freq_lst = ['A', 'Q', 'M', 'D', 'H', 'T', 'S'] + xp = datetime(2012, 1, 1) + for f in freq_lst: + p = Period('2012', freq=f) + assert p.start_time == xp + assert Period('2012', freq='B').start_time == datetime(2012, 1, 2) + assert Period('2012', freq='W').start_time == datetime(2011, 12, 26) + + def test_end_time(self): + p = Period('2012', freq='A') + + def _ex(*args): + return Timestamp(Timestamp(datetime(*args)).value - 1) + + xp = _ex(2013, 1, 1) + assert xp == p.end_time + + p = Period('2012', freq='Q') + xp = _ex(2012, 4, 1) + assert xp == p.end_time + + p = Period('2012', freq='M') + xp = _ex(2012, 2, 1) + assert xp == p.end_time + + p = Period('2012', freq='D') + xp = _ex(2012, 1, 2) + assert xp == p.end_time + + p = Period('2012', freq='H') + xp = _ex(2012, 1, 1, 1) + assert xp == p.end_time + + p = Period('2012', freq='B') + xp = _ex(2012, 1, 3) + assert xp == p.end_time + + p = Period('2012', freq='W') + xp = _ex(2012, 1, 2) + assert xp == p.end_time + + # Test for GH 11738 + p = Period('2012', freq='15D') + xp = _ex(2012, 1, 16) + assert xp == p.end_time + + p = Period('2012', freq='1D1H') + xp = _ex(2012, 1, 2, 1) + assert xp == p.end_time + + p = Period('2012', freq='1H1D') + xp = _ex(2012, 1, 2, 1) + assert xp == p.end_time + + def test_anchor_week_end_time(self): + def _ex(*args): + return Timestamp(Timestamp(datetime(*args)).value - 1) + + p = Period('2013-1-1', 'W-SAT') + xp = _ex(2013, 1, 6) + assert p.end_time == xp + + def test_properties_annually(self): + # Test properties on Periods with annually frequency. + a_date = Period(freq='A', year=2007) + assert a_date.year == 2007 + + def test_properties_quarterly(self): + # Test properties on Periods with daily frequency. + qedec_date = Period(freq="Q-DEC", year=2007, quarter=1) + qejan_date = Period(freq="Q-JAN", year=2007, quarter=1) + qejun_date = Period(freq="Q-JUN", year=2007, quarter=1) + # + for x in range(3): + for qd in (qedec_date, qejan_date, qejun_date): + assert (qd + x).qyear == 2007 + assert (qd + x).quarter == x + 1 + + def test_properties_monthly(self): + # Test properties on Periods with daily frequency. + m_date = Period(freq='M', year=2007, month=1) + for x in range(11): + m_ival_x = m_date + x + assert m_ival_x.year == 2007 + if 1 <= x + 1 <= 3: + assert m_ival_x.quarter == 1 + elif 4 <= x + 1 <= 6: + assert m_ival_x.quarter == 2 + elif 7 <= x + 1 <= 9: + assert m_ival_x.quarter == 3 + elif 10 <= x + 1 <= 12: + assert m_ival_x.quarter == 4 + assert m_ival_x.month == x + 1 + + def test_properties_weekly(self): + # Test properties on Periods with daily frequency. + w_date = Period(freq='W', year=2007, month=1, day=7) + # + assert w_date.year == 2007 + assert w_date.quarter == 1 + assert w_date.month == 1 + assert w_date.week == 1 + assert (w_date - 1).week == 52 + assert w_date.days_in_month == 31 + assert Period(freq='W', year=2012, + month=2, day=1).days_in_month == 29 + + def test_properties_weekly_legacy(self): + # Test properties on Periods with daily frequency. + w_date = Period(freq='W', year=2007, month=1, day=7) + assert w_date.year == 2007 + assert w_date.quarter == 1 + assert w_date.month == 1 + assert w_date.week == 1 + assert (w_date - 1).week == 52 + assert w_date.days_in_month == 31 + + exp = Period(freq='W', year=2012, month=2, day=1) + assert exp.days_in_month == 29 + + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK', year=2007, month=1, day=7) + + def test_properties_daily(self): + # Test properties on Periods with daily frequency. + b_date = Period(freq='B', year=2007, month=1, day=1) + # + assert b_date.year == 2007 + assert b_date.quarter == 1 + assert b_date.month == 1 + assert b_date.day == 1 + assert b_date.weekday == 0 + assert b_date.dayofyear == 1 + assert b_date.days_in_month == 31 + assert Period(freq='B', year=2012, + month=2, day=1).days_in_month == 29 + + d_date = Period(freq='D', year=2007, month=1, day=1) + + assert d_date.year == 2007 + assert d_date.quarter == 1 + assert d_date.month == 1 + assert d_date.day == 1 + assert d_date.weekday == 0 + assert d_date.dayofyear == 1 + assert d_date.days_in_month == 31 + assert Period(freq='D', year=2012, month=2, + day=1).days_in_month == 29 + + def test_properties_hourly(self): + # Test properties on Periods with hourly frequency. + h_date1 = Period(freq='H', year=2007, month=1, day=1, hour=0) + h_date2 = Period(freq='2H', year=2007, month=1, day=1, hour=0) + + for h_date in [h_date1, h_date2]: + assert h_date.year == 2007 + assert h_date.quarter == 1 + assert h_date.month == 1 + assert h_date.day == 1 + assert h_date.weekday == 0 + assert h_date.dayofyear == 1 + assert h_date.hour == 0 + assert h_date.days_in_month == 31 + assert Period(freq='H', year=2012, month=2, day=1, + hour=0).days_in_month == 29 + + def test_properties_minutely(self): + # Test properties on Periods with minutely frequency. + t_date = Period(freq='Min', year=2007, month=1, day=1, hour=0, + minute=0) + # + assert t_date.quarter == 1 + assert t_date.month == 1 + assert t_date.day == 1 + assert t_date.weekday == 0 + assert t_date.dayofyear == 1 + assert t_date.hour == 0 + assert t_date.minute == 0 + assert t_date.days_in_month == 31 + assert Period(freq='D', year=2012, month=2, day=1, hour=0, + minute=0).days_in_month == 29 + + def test_properties_secondly(self): + # Test properties on Periods with secondly frequency. + s_date = Period(freq='Min', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + # + assert s_date.year == 2007 + assert s_date.quarter == 1 + assert s_date.month == 1 + assert s_date.day == 1 + assert s_date.weekday == 0 + assert s_date.dayofyear == 1 + assert s_date.hour == 0 + assert s_date.minute == 0 + assert s_date.second == 0 + assert s_date.days_in_month == 31 + assert Period(freq='Min', year=2012, month=2, day=1, hour=0, + minute=0, second=0).days_in_month == 29 + + def test_pnow(self): + + # deprecation, xref #13790 + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + period.pnow('D') + + def test_constructor_corner(self): + expected = Period('2007-01', freq='2M') + assert Period(year=2007, month=1, freq='2M') == expected + + pytest.raises(ValueError, Period, datetime.now()) + pytest.raises(ValueError, Period, datetime.now().date()) + pytest.raises(ValueError, Period, 1.6, freq='D') + pytest.raises(ValueError, Period, ordinal=1.6, freq='D') + pytest.raises(ValueError, Period, ordinal=2, value=1, freq='D') + assert Period(None) is pd.NaT + pytest.raises(ValueError, Period, month=1) + + p = Period('2007-01-01', freq='D') + + result = Period(p, freq='A') + exp = Period('2007', freq='A') + assert result == exp + + def test_constructor_infer_freq(self): + p = Period('2007-01-01') + assert p.freq == 'D' + + p = Period('2007-01-01 07') + assert p.freq == 'H' + + p = Period('2007-01-01 07:10') + assert p.freq == 'T' + + p = Period('2007-01-01 07:10:15') + assert p.freq == 'S' + + p = Period('2007-01-01 07:10:15.123') + assert p.freq == 'L' + + p = Period('2007-01-01 07:10:15.123000') + assert p.freq == 'L' + + p = Period('2007-01-01 07:10:15.123400') + assert p.freq == 'U' + + def test_badinput(self): + pytest.raises(ValueError, Period, '-2000', 'A') + pytest.raises(tslib.DateParseError, Period, '0', 'A') + pytest.raises(tslib.DateParseError, Period, '1/1/-2000', 'A') + + def test_multiples(self): + result1 = Period('1989', freq='2A') + result2 = Period('1989', freq='A') + assert result1.ordinal == result2.ordinal + assert result1.freqstr == '2A-DEC' + assert result2.freqstr == 'A-DEC' + assert result1.freq == offsets.YearEnd(2) + assert result2.freq == offsets.YearEnd() + + assert (result1 + 1).ordinal == result1.ordinal + 2 + assert (1 + result1).ordinal == result1.ordinal + 2 + assert (result1 - 1).ordinal == result2.ordinal - 2 + assert (-1 + result1).ordinal == result2.ordinal - 2 + + def test_round_trip(self): + + p = Period('2000Q1') + new_p = tm.round_trip_pickle(p) + assert new_p == p + + +class TestPeriodField(object): + + def test_get_period_field_raises_on_out_of_range(self): + pytest.raises(ValueError, libperiod.get_period_field, -1, 0, 0) + + def test_get_period_field_array_raises_on_out_of_range(self): + pytest.raises(ValueError, libperiod.get_period_field_arr, -1, + np.empty(1), 0) + + +class TestComparisons(object): + + def setup_method(self, method): + self.january1 = Period('2000-01', 'M') + self.january2 = Period('2000-01', 'M') + self.february = Period('2000-02', 'M') + self.march = Period('2000-03', 'M') + self.day = Period('2012-01-01', 'D') + + def test_equal(self): + assert self.january1 == self.january2 + + def test_equal_Raises_Value(self): + with pytest.raises(period.IncompatibleFrequency): + self.january1 == self.day + + def test_notEqual(self): + assert self.january1 != 1 + assert self.january1 != self.february + + def test_greater(self): + assert self.february > self.january1 + + def test_greater_Raises_Value(self): + with pytest.raises(period.IncompatibleFrequency): + self.january1 > self.day + + def test_greater_Raises_Type(self): + with pytest.raises(TypeError): + self.january1 > 1 + + def test_greaterEqual(self): + assert self.january1 >= self.january2 + + def test_greaterEqual_Raises_Value(self): + with pytest.raises(period.IncompatibleFrequency): + self.january1 >= self.day + + with pytest.raises(TypeError): + print(self.january1 >= 1) + + def test_smallerEqual(self): + assert self.january1 <= self.january2 + + def test_smallerEqual_Raises_Value(self): + with pytest.raises(period.IncompatibleFrequency): + self.january1 <= self.day + + def test_smallerEqual_Raises_Type(self): + with pytest.raises(TypeError): + self.january1 <= 1 + + def test_smaller(self): + assert self.january1 < self.february + + def test_smaller_Raises_Value(self): + with pytest.raises(period.IncompatibleFrequency): + self.january1 < self.day + + def test_smaller_Raises_Type(self): + with pytest.raises(TypeError): + self.january1 < 1 + + def test_sort(self): + periods = [self.march, self.january1, self.february] + correctPeriods = [self.january1, self.february, self.march] + assert sorted(periods) == correctPeriods + + def test_period_nat_comp(self): + p_nat = Period('NaT', freq='D') + p = Period('2011-01-01', freq='D') + + nat = pd.Timestamp('NaT') + t = pd.Timestamp('2011-01-01') + # confirm Period('NaT') work identical with Timestamp('NaT') + for left, right in [(p_nat, p), (p, p_nat), (p_nat, p_nat), (nat, t), + (t, nat), (nat, nat)]: + assert not left < right + assert not left > right + assert not left == right + assert left != right + assert not left <= right + assert not left >= right + + +class TestMethods(object): + + def test_add(self): + dt1 = Period(freq='D', year=2008, month=1, day=1) + dt2 = Period(freq='D', year=2008, month=1, day=2) + assert dt1 + 1 == dt2 + assert 1 + dt1 == dt2 + + def test_add_pdnat(self): + p = pd.Period('2011-01', freq='M') + assert p + pd.NaT is pd.NaT + assert pd.NaT + p is pd.NaT + + p = pd.Period('NaT', freq='M') + assert p + pd.NaT is pd.NaT + assert pd.NaT + p is pd.NaT + + def test_add_raises(self): + # GH 4731 + dt1 = Period(freq='D', year=2008, month=1, day=1) + dt2 = Period(freq='D', year=2008, month=1, day=2) + msg = r"unsupported operand type\(s\)" + with tm.assert_raises_regex(TypeError, msg): + dt1 + "str" + + msg = r"unsupported operand type\(s\)" + with tm.assert_raises_regex(TypeError, msg): + "str" + dt1 + + with tm.assert_raises_regex(TypeError, msg): + dt1 + dt2 + + def test_sub(self): + dt1 = Period('2011-01-01', freq='D') + dt2 = Period('2011-01-15', freq='D') + + assert dt1 - dt2 == -14 + assert dt2 - dt1 == 14 + + msg = r"Input has different freq=M from Period\(freq=D\)" + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + dt1 - pd.Period('2011-02', freq='M') + + def test_add_offset(self): + # freq is DateOffset + for freq in ['A', '2A', '3A']: + p = Period('2011', freq=freq) + exp = Period('2013', freq=freq) + assert p + offsets.YearEnd(2) == exp + assert offsets.YearEnd(2) + p == exp + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + with pytest.raises(period.IncompatibleFrequency): + p + o + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + with pytest.raises(period.IncompatibleFrequency): + o + p + + for freq in ['M', '2M', '3M']: + p = Period('2011-03', freq=freq) + exp = Period('2011-05', freq=freq) + assert p + offsets.MonthEnd(2) == exp + assert offsets.MonthEnd(2) + p == exp + + exp = Period('2012-03', freq=freq) + assert p + offsets.MonthEnd(12) == exp + assert offsets.MonthEnd(12) + p == exp + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + with pytest.raises(period.IncompatibleFrequency): + p + o + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + with pytest.raises(period.IncompatibleFrequency): + o + p + + # freq is Tick + for freq in ['D', '2D', '3D']: + p = Period('2011-04-01', freq=freq) + + exp = Period('2011-04-06', freq=freq) + assert p + offsets.Day(5) == exp + assert offsets.Day(5) + p == exp + + exp = Period('2011-04-02', freq=freq) + assert p + offsets.Hour(24) == exp + assert offsets.Hour(24) + p == exp + + exp = Period('2011-04-03', freq=freq) + assert p + np.timedelta64(2, 'D') == exp + with pytest.raises(TypeError): + np.timedelta64(2, 'D') + p + + exp = Period('2011-04-02', freq=freq) + assert p + np.timedelta64(3600 * 24, 's') == exp + with pytest.raises(TypeError): + np.timedelta64(3600 * 24, 's') + p + + exp = Period('2011-03-30', freq=freq) + assert p + timedelta(-2) == exp + assert timedelta(-2) + p == exp + + exp = Period('2011-04-03', freq=freq) + assert p + timedelta(hours=48) == exp + assert timedelta(hours=48) + p == exp + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(4, 'h'), + timedelta(hours=23)]: + with pytest.raises(period.IncompatibleFrequency): + p + o + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + with pytest.raises(period.IncompatibleFrequency): + o + p + + for freq in ['H', '2H', '3H']: + p = Period('2011-04-01 09:00', freq=freq) + + exp = Period('2011-04-03 09:00', freq=freq) + assert p + offsets.Day(2) == exp + assert offsets.Day(2) + p == exp + + exp = Period('2011-04-01 12:00', freq=freq) + assert p + offsets.Hour(3) == exp + assert offsets.Hour(3) + p == exp + + exp = Period('2011-04-01 12:00', freq=freq) + assert p + np.timedelta64(3, 'h') == exp + with pytest.raises(TypeError): + np.timedelta64(3, 'h') + p + + exp = Period('2011-04-01 10:00', freq=freq) + assert p + np.timedelta64(3600, 's') == exp + with pytest.raises(TypeError): + np.timedelta64(3600, 's') + p + + exp = Period('2011-04-01 11:00', freq=freq) + assert p + timedelta(minutes=120) == exp + assert timedelta(minutes=120) + p == exp + + exp = Period('2011-04-05 12:00', freq=freq) + assert p + timedelta(days=4, minutes=180) == exp + assert timedelta(days=4, minutes=180) + p == exp + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(3200, 's'), + timedelta(hours=23, minutes=30)]: + with pytest.raises(period.IncompatibleFrequency): + p + o + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + with pytest.raises(period.IncompatibleFrequency): + o + p + + def test_add_offset_nat(self): + # freq is DateOffset + for freq in ['A', '2A', '3A']: + p = Period('NaT', freq=freq) + for o in [offsets.YearEnd(2)]: + assert p + o is tslib.NaT + assert o + p is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + assert p + o is tslib.NaT + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + assert o + p is tslib.NaT + + for freq in ['M', '2M', '3M']: + p = Period('NaT', freq=freq) + for o in [offsets.MonthEnd(2), offsets.MonthEnd(12)]: + assert p + o is tslib.NaT + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + assert o + p is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + assert p + o is tslib.NaT + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + assert o + p is tslib.NaT + + # freq is Tick + for freq in ['D', '2D', '3D']: + p = Period('NaT', freq=freq) + for o in [offsets.Day(5), offsets.Hour(24), np.timedelta64(2, 'D'), + np.timedelta64(3600 * 24, 's'), timedelta(-2), + timedelta(hours=48)]: + assert p + o is tslib.NaT + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + assert o + p is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(4, 'h'), + timedelta(hours=23)]: + assert p + o is tslib.NaT + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + assert o + p is tslib.NaT + + for freq in ['H', '2H', '3H']: + p = Period('NaT', freq=freq) + for o in [offsets.Day(2), offsets.Hour(3), np.timedelta64(3, 'h'), + np.timedelta64(3600, 's'), timedelta(minutes=120), + timedelta(days=4, minutes=180)]: + assert p + o is tslib.NaT + + if not isinstance(o, np.timedelta64): + assert o + p is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(3200, 's'), + timedelta(hours=23, minutes=30)]: + assert p + o is tslib.NaT + + if isinstance(o, np.timedelta64): + with pytest.raises(TypeError): + o + p + else: + assert o + p is tslib.NaT + + def test_sub_pdnat(self): + # GH 13071 + p = pd.Period('2011-01', freq='M') + assert p - pd.NaT is pd.NaT + assert pd.NaT - p is pd.NaT + + p = pd.Period('NaT', freq='M') + assert p - pd.NaT is pd.NaT + assert pd.NaT - p is pd.NaT + + def test_sub_offset(self): + # freq is DateOffset + for freq in ['A', '2A', '3A']: + p = Period('2011', freq=freq) + assert p - offsets.YearEnd(2) == Period('2009', freq=freq) + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + with pytest.raises(period.IncompatibleFrequency): + p - o + + for freq in ['M', '2M', '3M']: + p = Period('2011-03', freq=freq) + assert p - offsets.MonthEnd(2) == Period('2011-01', freq=freq) + assert p - offsets.MonthEnd(12) == Period('2010-03', freq=freq) + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + with pytest.raises(period.IncompatibleFrequency): + p - o + + # freq is Tick + for freq in ['D', '2D', '3D']: + p = Period('2011-04-01', freq=freq) + assert p - offsets.Day(5) == Period('2011-03-27', freq=freq) + assert p - offsets.Hour(24) == Period('2011-03-31', freq=freq) + assert p - np.timedelta64(2, 'D') == Period( + '2011-03-30', freq=freq) + assert p - np.timedelta64(3600 * 24, 's') == Period( + '2011-03-31', freq=freq) + assert p - timedelta(-2) == Period('2011-04-03', freq=freq) + assert p - timedelta(hours=48) == Period('2011-03-30', freq=freq) + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(4, 'h'), + timedelta(hours=23)]: + with pytest.raises(period.IncompatibleFrequency): + p - o + + for freq in ['H', '2H', '3H']: + p = Period('2011-04-01 09:00', freq=freq) + assert p - offsets.Day(2) == Period('2011-03-30 09:00', freq=freq) + assert p - offsets.Hour(3) == Period('2011-04-01 06:00', freq=freq) + assert p - np.timedelta64(3, 'h') == Period( + '2011-04-01 06:00', freq=freq) + assert p - np.timedelta64(3600, 's') == Period( + '2011-04-01 08:00', freq=freq) + assert p - timedelta(minutes=120) == Period( + '2011-04-01 07:00', freq=freq) + assert p - timedelta(days=4, minutes=180) == Period( + '2011-03-28 06:00', freq=freq) + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(3200, 's'), + timedelta(hours=23, minutes=30)]: + with pytest.raises(period.IncompatibleFrequency): + p - o + + def test_sub_offset_nat(self): + # freq is DateOffset + for freq in ['A', '2A', '3A']: + p = Period('NaT', freq=freq) + for o in [offsets.YearEnd(2)]: + assert p - o is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + assert p - o is tslib.NaT + + for freq in ['M', '2M', '3M']: + p = Period('NaT', freq=freq) + for o in [offsets.MonthEnd(2), offsets.MonthEnd(12)]: + assert p - o is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(365, 'D'), + timedelta(365)]: + assert p - o is tslib.NaT + + # freq is Tick + for freq in ['D', '2D', '3D']: + p = Period('NaT', freq=freq) + for o in [offsets.Day(5), offsets.Hour(24), np.timedelta64(2, 'D'), + np.timedelta64(3600 * 24, 's'), timedelta(-2), + timedelta(hours=48)]: + assert p - o is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(4, 'h'), + timedelta(hours=23)]: + assert p - o is tslib.NaT + + for freq in ['H', '2H', '3H']: + p = Period('NaT', freq=freq) + for o in [offsets.Day(2), offsets.Hour(3), np.timedelta64(3, 'h'), + np.timedelta64(3600, 's'), timedelta(minutes=120), + timedelta(days=4, minutes=180)]: + assert p - o is tslib.NaT + + for o in [offsets.YearBegin(2), offsets.MonthBegin(1), + offsets.Minute(), np.timedelta64(3200, 's'), + timedelta(hours=23, minutes=30)]: + assert p - o is tslib.NaT + + def test_nat_ops(self): + for freq in ['M', '2M', '3M']: + p = Period('NaT', freq=freq) + assert p + 1 is tslib.NaT + assert 1 + p is tslib.NaT + assert p - 1 is tslib.NaT + assert p - Period('2011-01', freq=freq) is tslib.NaT + assert Period('2011-01', freq=freq) - p is tslib.NaT + + def test_period_ops_offset(self): + p = Period('2011-04-01', freq='D') + result = p + offsets.Day() + exp = pd.Period('2011-04-02', freq='D') + assert result == exp + + result = p - offsets.Day(2) + exp = pd.Period('2011-03-30', freq='D') + assert result == exp + + msg = r"Input cannot be converted to Period\(freq=D\)" + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + p + offsets.Hour(2) + + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + p - offsets.Hour(2) diff --git a/pandas/tests/scalar/test_period_asfreq.py b/pandas/tests/scalar/test_period_asfreq.py new file mode 100644 index 0000000000000..32cea60c333b7 --- /dev/null +++ b/pandas/tests/scalar/test_period_asfreq.py @@ -0,0 +1,716 @@ +import pandas as pd +from pandas import Period, offsets +from pandas.util import testing as tm +from pandas.tseries.frequencies import _period_code_map + + +class TestFreqConversion(object): + """Test frequency conversion of date objects""" + + def test_asfreq_corner(self): + val = Period(freq='A', year=2007) + result1 = val.asfreq('5t') + result2 = val.asfreq('t') + expected = Period('2007-12-31 23:59', freq='t') + assert result1.ordinal == expected.ordinal + assert result1.freqstr == '5T' + assert result2.ordinal == expected.ordinal + assert result2.freqstr == 'T' + + def test_conv_annual(self): + # frequency conversion tests: from Annual Frequency + + ival_A = Period(freq='A', year=2007) + + ival_AJAN = Period(freq="A-JAN", year=2007) + ival_AJUN = Period(freq="A-JUN", year=2007) + ival_ANOV = Period(freq="A-NOV", year=2007) + + ival_A_to_Q_start = Period(freq='Q', year=2007, quarter=1) + ival_A_to_Q_end = Period(freq='Q', year=2007, quarter=4) + ival_A_to_M_start = Period(freq='M', year=2007, month=1) + ival_A_to_M_end = Period(freq='M', year=2007, month=12) + ival_A_to_W_start = Period(freq='W', year=2007, month=1, day=1) + ival_A_to_W_end = Period(freq='W', year=2007, month=12, day=31) + ival_A_to_B_start = Period(freq='B', year=2007, month=1, day=1) + ival_A_to_B_end = Period(freq='B', year=2007, month=12, day=31) + ival_A_to_D_start = Period(freq='D', year=2007, month=1, day=1) + ival_A_to_D_end = Period(freq='D', year=2007, month=12, day=31) + ival_A_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_A_to_H_end = Period(freq='H', year=2007, month=12, day=31, + hour=23) + ival_A_to_T_start = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=0) + ival_A_to_T_end = Period(freq='Min', year=2007, month=12, day=31, + hour=23, minute=59) + ival_A_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_A_to_S_end = Period(freq='S', year=2007, month=12, day=31, + hour=23, minute=59, second=59) + + ival_AJAN_to_D_end = Period(freq='D', year=2007, month=1, day=31) + ival_AJAN_to_D_start = Period(freq='D', year=2006, month=2, day=1) + ival_AJUN_to_D_end = Period(freq='D', year=2007, month=6, day=30) + ival_AJUN_to_D_start = Period(freq='D', year=2006, month=7, day=1) + ival_ANOV_to_D_end = Period(freq='D', year=2007, month=11, day=30) + ival_ANOV_to_D_start = Period(freq='D', year=2006, month=12, day=1) + + assert ival_A.asfreq('Q', 'S') == ival_A_to_Q_start + assert ival_A.asfreq('Q', 'e') == ival_A_to_Q_end + assert ival_A.asfreq('M', 's') == ival_A_to_M_start + assert ival_A.asfreq('M', 'E') == ival_A_to_M_end + assert ival_A.asfreq('W', 'S') == ival_A_to_W_start + assert ival_A.asfreq('W', 'E') == ival_A_to_W_end + assert ival_A.asfreq('B', 'S') == ival_A_to_B_start + assert ival_A.asfreq('B', 'E') == ival_A_to_B_end + assert ival_A.asfreq('D', 'S') == ival_A_to_D_start + assert ival_A.asfreq('D', 'E') == ival_A_to_D_end + assert ival_A.asfreq('H', 'S') == ival_A_to_H_start + assert ival_A.asfreq('H', 'E') == ival_A_to_H_end + assert ival_A.asfreq('min', 'S') == ival_A_to_T_start + assert ival_A.asfreq('min', 'E') == ival_A_to_T_end + assert ival_A.asfreq('T', 'S') == ival_A_to_T_start + assert ival_A.asfreq('T', 'E') == ival_A_to_T_end + assert ival_A.asfreq('S', 'S') == ival_A_to_S_start + assert ival_A.asfreq('S', 'E') == ival_A_to_S_end + + assert ival_AJAN.asfreq('D', 'S') == ival_AJAN_to_D_start + assert ival_AJAN.asfreq('D', 'E') == ival_AJAN_to_D_end + + assert ival_AJUN.asfreq('D', 'S') == ival_AJUN_to_D_start + assert ival_AJUN.asfreq('D', 'E') == ival_AJUN_to_D_end + + assert ival_ANOV.asfreq('D', 'S') == ival_ANOV_to_D_start + assert ival_ANOV.asfreq('D', 'E') == ival_ANOV_to_D_end + + assert ival_A.asfreq('A') == ival_A + + def test_conv_quarterly(self): + # frequency conversion tests: from Quarterly Frequency + + ival_Q = Period(freq='Q', year=2007, quarter=1) + ival_Q_end_of_year = Period(freq='Q', year=2007, quarter=4) + + ival_QEJAN = Period(freq="Q-JAN", year=2007, quarter=1) + ival_QEJUN = Period(freq="Q-JUN", year=2007, quarter=1) + + ival_Q_to_A = Period(freq='A', year=2007) + ival_Q_to_M_start = Period(freq='M', year=2007, month=1) + ival_Q_to_M_end = Period(freq='M', year=2007, month=3) + ival_Q_to_W_start = Period(freq='W', year=2007, month=1, day=1) + ival_Q_to_W_end = Period(freq='W', year=2007, month=3, day=31) + ival_Q_to_B_start = Period(freq='B', year=2007, month=1, day=1) + ival_Q_to_B_end = Period(freq='B', year=2007, month=3, day=30) + ival_Q_to_D_start = Period(freq='D', year=2007, month=1, day=1) + ival_Q_to_D_end = Period(freq='D', year=2007, month=3, day=31) + ival_Q_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_Q_to_H_end = Period(freq='H', year=2007, month=3, day=31, hour=23) + ival_Q_to_T_start = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=0) + ival_Q_to_T_end = Period(freq='Min', year=2007, month=3, day=31, + hour=23, minute=59) + ival_Q_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_Q_to_S_end = Period(freq='S', year=2007, month=3, day=31, hour=23, + minute=59, second=59) + + ival_QEJAN_to_D_start = Period(freq='D', year=2006, month=2, day=1) + ival_QEJAN_to_D_end = Period(freq='D', year=2006, month=4, day=30) + + ival_QEJUN_to_D_start = Period(freq='D', year=2006, month=7, day=1) + ival_QEJUN_to_D_end = Period(freq='D', year=2006, month=9, day=30) + + assert ival_Q.asfreq('A') == ival_Q_to_A + assert ival_Q_end_of_year.asfreq('A') == ival_Q_to_A + + assert ival_Q.asfreq('M', 'S') == ival_Q_to_M_start + assert ival_Q.asfreq('M', 'E') == ival_Q_to_M_end + assert ival_Q.asfreq('W', 'S') == ival_Q_to_W_start + assert ival_Q.asfreq('W', 'E') == ival_Q_to_W_end + assert ival_Q.asfreq('B', 'S') == ival_Q_to_B_start + assert ival_Q.asfreq('B', 'E') == ival_Q_to_B_end + assert ival_Q.asfreq('D', 'S') == ival_Q_to_D_start + assert ival_Q.asfreq('D', 'E') == ival_Q_to_D_end + assert ival_Q.asfreq('H', 'S') == ival_Q_to_H_start + assert ival_Q.asfreq('H', 'E') == ival_Q_to_H_end + assert ival_Q.asfreq('Min', 'S') == ival_Q_to_T_start + assert ival_Q.asfreq('Min', 'E') == ival_Q_to_T_end + assert ival_Q.asfreq('S', 'S') == ival_Q_to_S_start + assert ival_Q.asfreq('S', 'E') == ival_Q_to_S_end + + assert ival_QEJAN.asfreq('D', 'S') == ival_QEJAN_to_D_start + assert ival_QEJAN.asfreq('D', 'E') == ival_QEJAN_to_D_end + assert ival_QEJUN.asfreq('D', 'S') == ival_QEJUN_to_D_start + assert ival_QEJUN.asfreq('D', 'E') == ival_QEJUN_to_D_end + + assert ival_Q.asfreq('Q') == ival_Q + + def test_conv_monthly(self): + # frequency conversion tests: from Monthly Frequency + + ival_M = Period(freq='M', year=2007, month=1) + ival_M_end_of_year = Period(freq='M', year=2007, month=12) + ival_M_end_of_quarter = Period(freq='M', year=2007, month=3) + ival_M_to_A = Period(freq='A', year=2007) + ival_M_to_Q = Period(freq='Q', year=2007, quarter=1) + ival_M_to_W_start = Period(freq='W', year=2007, month=1, day=1) + ival_M_to_W_end = Period(freq='W', year=2007, month=1, day=31) + ival_M_to_B_start = Period(freq='B', year=2007, month=1, day=1) + ival_M_to_B_end = Period(freq='B', year=2007, month=1, day=31) + ival_M_to_D_start = Period(freq='D', year=2007, month=1, day=1) + ival_M_to_D_end = Period(freq='D', year=2007, month=1, day=31) + ival_M_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_M_to_H_end = Period(freq='H', year=2007, month=1, day=31, hour=23) + ival_M_to_T_start = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=0) + ival_M_to_T_end = Period(freq='Min', year=2007, month=1, day=31, + hour=23, minute=59) + ival_M_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_M_to_S_end = Period(freq='S', year=2007, month=1, day=31, hour=23, + minute=59, second=59) + + assert ival_M.asfreq('A') == ival_M_to_A + assert ival_M_end_of_year.asfreq('A') == ival_M_to_A + assert ival_M.asfreq('Q') == ival_M_to_Q + assert ival_M_end_of_quarter.asfreq('Q') == ival_M_to_Q + + assert ival_M.asfreq('W', 'S') == ival_M_to_W_start + assert ival_M.asfreq('W', 'E') == ival_M_to_W_end + assert ival_M.asfreq('B', 'S') == ival_M_to_B_start + assert ival_M.asfreq('B', 'E') == ival_M_to_B_end + assert ival_M.asfreq('D', 'S') == ival_M_to_D_start + assert ival_M.asfreq('D', 'E') == ival_M_to_D_end + assert ival_M.asfreq('H', 'S') == ival_M_to_H_start + assert ival_M.asfreq('H', 'E') == ival_M_to_H_end + assert ival_M.asfreq('Min', 'S') == ival_M_to_T_start + assert ival_M.asfreq('Min', 'E') == ival_M_to_T_end + assert ival_M.asfreq('S', 'S') == ival_M_to_S_start + assert ival_M.asfreq('S', 'E') == ival_M_to_S_end + + assert ival_M.asfreq('M') == ival_M + + def test_conv_weekly(self): + # frequency conversion tests: from Weekly Frequency + ival_W = Period(freq='W', year=2007, month=1, day=1) + + ival_WSUN = Period(freq='W', year=2007, month=1, day=7) + ival_WSAT = Period(freq='W-SAT', year=2007, month=1, day=6) + ival_WFRI = Period(freq='W-FRI', year=2007, month=1, day=5) + ival_WTHU = Period(freq='W-THU', year=2007, month=1, day=4) + ival_WWED = Period(freq='W-WED', year=2007, month=1, day=3) + ival_WTUE = Period(freq='W-TUE', year=2007, month=1, day=2) + ival_WMON = Period(freq='W-MON', year=2007, month=1, day=1) + + ival_WSUN_to_D_start = Period(freq='D', year=2007, month=1, day=1) + ival_WSUN_to_D_end = Period(freq='D', year=2007, month=1, day=7) + ival_WSAT_to_D_start = Period(freq='D', year=2006, month=12, day=31) + ival_WSAT_to_D_end = Period(freq='D', year=2007, month=1, day=6) + ival_WFRI_to_D_start = Period(freq='D', year=2006, month=12, day=30) + ival_WFRI_to_D_end = Period(freq='D', year=2007, month=1, day=5) + ival_WTHU_to_D_start = Period(freq='D', year=2006, month=12, day=29) + ival_WTHU_to_D_end = Period(freq='D', year=2007, month=1, day=4) + ival_WWED_to_D_start = Period(freq='D', year=2006, month=12, day=28) + ival_WWED_to_D_end = Period(freq='D', year=2007, month=1, day=3) + ival_WTUE_to_D_start = Period(freq='D', year=2006, month=12, day=27) + ival_WTUE_to_D_end = Period(freq='D', year=2007, month=1, day=2) + ival_WMON_to_D_start = Period(freq='D', year=2006, month=12, day=26) + ival_WMON_to_D_end = Period(freq='D', year=2007, month=1, day=1) + + ival_W_end_of_year = Period(freq='W', year=2007, month=12, day=31) + ival_W_end_of_quarter = Period(freq='W', year=2007, month=3, day=31) + ival_W_end_of_month = Period(freq='W', year=2007, month=1, day=31) + ival_W_to_A = Period(freq='A', year=2007) + ival_W_to_Q = Period(freq='Q', year=2007, quarter=1) + ival_W_to_M = Period(freq='M', year=2007, month=1) + + if Period(freq='D', year=2007, month=12, day=31).weekday == 6: + ival_W_to_A_end_of_year = Period(freq='A', year=2007) + else: + ival_W_to_A_end_of_year = Period(freq='A', year=2008) + + if Period(freq='D', year=2007, month=3, day=31).weekday == 6: + ival_W_to_Q_end_of_quarter = Period(freq='Q', year=2007, quarter=1) + else: + ival_W_to_Q_end_of_quarter = Period(freq='Q', year=2007, quarter=2) + + if Period(freq='D', year=2007, month=1, day=31).weekday == 6: + ival_W_to_M_end_of_month = Period(freq='M', year=2007, month=1) + else: + ival_W_to_M_end_of_month = Period(freq='M', year=2007, month=2) + + ival_W_to_B_start = Period(freq='B', year=2007, month=1, day=1) + ival_W_to_B_end = Period(freq='B', year=2007, month=1, day=5) + ival_W_to_D_start = Period(freq='D', year=2007, month=1, day=1) + ival_W_to_D_end = Period(freq='D', year=2007, month=1, day=7) + ival_W_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_W_to_H_end = Period(freq='H', year=2007, month=1, day=7, hour=23) + ival_W_to_T_start = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=0) + ival_W_to_T_end = Period(freq='Min', year=2007, month=1, day=7, + hour=23, minute=59) + ival_W_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_W_to_S_end = Period(freq='S', year=2007, month=1, day=7, hour=23, + minute=59, second=59) + + assert ival_W.asfreq('A') == ival_W_to_A + assert ival_W_end_of_year.asfreq('A') == ival_W_to_A_end_of_year + + assert ival_W.asfreq('Q') == ival_W_to_Q + assert ival_W_end_of_quarter.asfreq('Q') == ival_W_to_Q_end_of_quarter + + assert ival_W.asfreq('M') == ival_W_to_M + assert ival_W_end_of_month.asfreq('M') == ival_W_to_M_end_of_month + + assert ival_W.asfreq('B', 'S') == ival_W_to_B_start + assert ival_W.asfreq('B', 'E') == ival_W_to_B_end + + assert ival_W.asfreq('D', 'S') == ival_W_to_D_start + assert ival_W.asfreq('D', 'E') == ival_W_to_D_end + + assert ival_WSUN.asfreq('D', 'S') == ival_WSUN_to_D_start + assert ival_WSUN.asfreq('D', 'E') == ival_WSUN_to_D_end + assert ival_WSAT.asfreq('D', 'S') == ival_WSAT_to_D_start + assert ival_WSAT.asfreq('D', 'E') == ival_WSAT_to_D_end + assert ival_WFRI.asfreq('D', 'S') == ival_WFRI_to_D_start + assert ival_WFRI.asfreq('D', 'E') == ival_WFRI_to_D_end + assert ival_WTHU.asfreq('D', 'S') == ival_WTHU_to_D_start + assert ival_WTHU.asfreq('D', 'E') == ival_WTHU_to_D_end + assert ival_WWED.asfreq('D', 'S') == ival_WWED_to_D_start + assert ival_WWED.asfreq('D', 'E') == ival_WWED_to_D_end + assert ival_WTUE.asfreq('D', 'S') == ival_WTUE_to_D_start + assert ival_WTUE.asfreq('D', 'E') == ival_WTUE_to_D_end + assert ival_WMON.asfreq('D', 'S') == ival_WMON_to_D_start + assert ival_WMON.asfreq('D', 'E') == ival_WMON_to_D_end + + assert ival_W.asfreq('H', 'S') == ival_W_to_H_start + assert ival_W.asfreq('H', 'E') == ival_W_to_H_end + assert ival_W.asfreq('Min', 'S') == ival_W_to_T_start + assert ival_W.asfreq('Min', 'E') == ival_W_to_T_end + assert ival_W.asfreq('S', 'S') == ival_W_to_S_start + assert ival_W.asfreq('S', 'E') == ival_W_to_S_end + + assert ival_W.asfreq('W') == ival_W + + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + ival_W.asfreq('WK') + + def test_conv_weekly_legacy(self): + # frequency conversion tests: from Weekly Frequency + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK', year=2007, month=1, day=1) + + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK-SAT', year=2007, month=1, day=6) + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK-FRI', year=2007, month=1, day=5) + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK-THU', year=2007, month=1, day=4) + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK-WED', year=2007, month=1, day=3) + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK-TUE', year=2007, month=1, day=2) + with tm.assert_raises_regex(ValueError, msg): + Period(freq='WK-MON', year=2007, month=1, day=1) + + def test_conv_business(self): + # frequency conversion tests: from Business Frequency" + + ival_B = Period(freq='B', year=2007, month=1, day=1) + ival_B_end_of_year = Period(freq='B', year=2007, month=12, day=31) + ival_B_end_of_quarter = Period(freq='B', year=2007, month=3, day=30) + ival_B_end_of_month = Period(freq='B', year=2007, month=1, day=31) + ival_B_end_of_week = Period(freq='B', year=2007, month=1, day=5) + + ival_B_to_A = Period(freq='A', year=2007) + ival_B_to_Q = Period(freq='Q', year=2007, quarter=1) + ival_B_to_M = Period(freq='M', year=2007, month=1) + ival_B_to_W = Period(freq='W', year=2007, month=1, day=7) + ival_B_to_D = Period(freq='D', year=2007, month=1, day=1) + ival_B_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_B_to_H_end = Period(freq='H', year=2007, month=1, day=1, hour=23) + ival_B_to_T_start = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=0) + ival_B_to_T_end = Period(freq='Min', year=2007, month=1, day=1, + hour=23, minute=59) + ival_B_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_B_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=23, + minute=59, second=59) + + assert ival_B.asfreq('A') == ival_B_to_A + assert ival_B_end_of_year.asfreq('A') == ival_B_to_A + assert ival_B.asfreq('Q') == ival_B_to_Q + assert ival_B_end_of_quarter.asfreq('Q') == ival_B_to_Q + assert ival_B.asfreq('M') == ival_B_to_M + assert ival_B_end_of_month.asfreq('M') == ival_B_to_M + assert ival_B.asfreq('W') == ival_B_to_W + assert ival_B_end_of_week.asfreq('W') == ival_B_to_W + + assert ival_B.asfreq('D') == ival_B_to_D + + assert ival_B.asfreq('H', 'S') == ival_B_to_H_start + assert ival_B.asfreq('H', 'E') == ival_B_to_H_end + assert ival_B.asfreq('Min', 'S') == ival_B_to_T_start + assert ival_B.asfreq('Min', 'E') == ival_B_to_T_end + assert ival_B.asfreq('S', 'S') == ival_B_to_S_start + assert ival_B.asfreq('S', 'E') == ival_B_to_S_end + + assert ival_B.asfreq('B') == ival_B + + def test_conv_daily(self): + # frequency conversion tests: from Business Frequency" + + ival_D = Period(freq='D', year=2007, month=1, day=1) + ival_D_end_of_year = Period(freq='D', year=2007, month=12, day=31) + ival_D_end_of_quarter = Period(freq='D', year=2007, month=3, day=31) + ival_D_end_of_month = Period(freq='D', year=2007, month=1, day=31) + ival_D_end_of_week = Period(freq='D', year=2007, month=1, day=7) + + ival_D_friday = Period(freq='D', year=2007, month=1, day=5) + ival_D_saturday = Period(freq='D', year=2007, month=1, day=6) + ival_D_sunday = Period(freq='D', year=2007, month=1, day=7) + + # TODO: unused? + # ival_D_monday = Period(freq='D', year=2007, month=1, day=8) + + ival_B_friday = Period(freq='B', year=2007, month=1, day=5) + ival_B_monday = Period(freq='B', year=2007, month=1, day=8) + + ival_D_to_A = Period(freq='A', year=2007) + + ival_Deoq_to_AJAN = Period(freq='A-JAN', year=2008) + ival_Deoq_to_AJUN = Period(freq='A-JUN', year=2007) + ival_Deoq_to_ADEC = Period(freq='A-DEC', year=2007) + + ival_D_to_QEJAN = Period(freq="Q-JAN", year=2007, quarter=4) + ival_D_to_QEJUN = Period(freq="Q-JUN", year=2007, quarter=3) + ival_D_to_QEDEC = Period(freq="Q-DEC", year=2007, quarter=1) + + ival_D_to_M = Period(freq='M', year=2007, month=1) + ival_D_to_W = Period(freq='W', year=2007, month=1, day=7) + + ival_D_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_D_to_H_end = Period(freq='H', year=2007, month=1, day=1, hour=23) + ival_D_to_T_start = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=0) + ival_D_to_T_end = Period(freq='Min', year=2007, month=1, day=1, + hour=23, minute=59) + ival_D_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_D_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=23, + minute=59, second=59) + + assert ival_D.asfreq('A') == ival_D_to_A + + assert ival_D_end_of_quarter.asfreq('A-JAN') == ival_Deoq_to_AJAN + assert ival_D_end_of_quarter.asfreq('A-JUN') == ival_Deoq_to_AJUN + assert ival_D_end_of_quarter.asfreq('A-DEC') == ival_Deoq_to_ADEC + + assert ival_D_end_of_year.asfreq('A') == ival_D_to_A + assert ival_D_end_of_quarter.asfreq('Q') == ival_D_to_QEDEC + assert ival_D.asfreq("Q-JAN") == ival_D_to_QEJAN + assert ival_D.asfreq("Q-JUN") == ival_D_to_QEJUN + assert ival_D.asfreq("Q-DEC") == ival_D_to_QEDEC + assert ival_D.asfreq('M') == ival_D_to_M + assert ival_D_end_of_month.asfreq('M') == ival_D_to_M + assert ival_D.asfreq('W') == ival_D_to_W + assert ival_D_end_of_week.asfreq('W') == ival_D_to_W + + assert ival_D_friday.asfreq('B') == ival_B_friday + assert ival_D_saturday.asfreq('B', 'S') == ival_B_friday + assert ival_D_saturday.asfreq('B', 'E') == ival_B_monday + assert ival_D_sunday.asfreq('B', 'S') == ival_B_friday + assert ival_D_sunday.asfreq('B', 'E') == ival_B_monday + + assert ival_D.asfreq('H', 'S') == ival_D_to_H_start + assert ival_D.asfreq('H', 'E') == ival_D_to_H_end + assert ival_D.asfreq('Min', 'S') == ival_D_to_T_start + assert ival_D.asfreq('Min', 'E') == ival_D_to_T_end + assert ival_D.asfreq('S', 'S') == ival_D_to_S_start + assert ival_D.asfreq('S', 'E') == ival_D_to_S_end + + assert ival_D.asfreq('D') == ival_D + + def test_conv_hourly(self): + # frequency conversion tests: from Hourly Frequency" + + ival_H = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_H_end_of_year = Period(freq='H', year=2007, month=12, day=31, + hour=23) + ival_H_end_of_quarter = Period(freq='H', year=2007, month=3, day=31, + hour=23) + ival_H_end_of_month = Period(freq='H', year=2007, month=1, day=31, + hour=23) + ival_H_end_of_week = Period(freq='H', year=2007, month=1, day=7, + hour=23) + ival_H_end_of_day = Period(freq='H', year=2007, month=1, day=1, + hour=23) + ival_H_end_of_bus = Period(freq='H', year=2007, month=1, day=1, + hour=23) + + ival_H_to_A = Period(freq='A', year=2007) + ival_H_to_Q = Period(freq='Q', year=2007, quarter=1) + ival_H_to_M = Period(freq='M', year=2007, month=1) + ival_H_to_W = Period(freq='W', year=2007, month=1, day=7) + ival_H_to_D = Period(freq='D', year=2007, month=1, day=1) + ival_H_to_B = Period(freq='B', year=2007, month=1, day=1) + + ival_H_to_T_start = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=0) + ival_H_to_T_end = Period(freq='Min', year=2007, month=1, day=1, hour=0, + minute=59) + ival_H_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_H_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=59, second=59) + + assert ival_H.asfreq('A') == ival_H_to_A + assert ival_H_end_of_year.asfreq('A') == ival_H_to_A + assert ival_H.asfreq('Q') == ival_H_to_Q + assert ival_H_end_of_quarter.asfreq('Q') == ival_H_to_Q + assert ival_H.asfreq('M') == ival_H_to_M + assert ival_H_end_of_month.asfreq('M') == ival_H_to_M + assert ival_H.asfreq('W') == ival_H_to_W + assert ival_H_end_of_week.asfreq('W') == ival_H_to_W + assert ival_H.asfreq('D') == ival_H_to_D + assert ival_H_end_of_day.asfreq('D') == ival_H_to_D + assert ival_H.asfreq('B') == ival_H_to_B + assert ival_H_end_of_bus.asfreq('B') == ival_H_to_B + + assert ival_H.asfreq('Min', 'S') == ival_H_to_T_start + assert ival_H.asfreq('Min', 'E') == ival_H_to_T_end + assert ival_H.asfreq('S', 'S') == ival_H_to_S_start + assert ival_H.asfreq('S', 'E') == ival_H_to_S_end + + assert ival_H.asfreq('H') == ival_H + + def test_conv_minutely(self): + # frequency conversion tests: from Minutely Frequency" + + ival_T = Period(freq='Min', year=2007, month=1, day=1, hour=0, + minute=0) + ival_T_end_of_year = Period(freq='Min', year=2007, month=12, day=31, + hour=23, minute=59) + ival_T_end_of_quarter = Period(freq='Min', year=2007, month=3, day=31, + hour=23, minute=59) + ival_T_end_of_month = Period(freq='Min', year=2007, month=1, day=31, + hour=23, minute=59) + ival_T_end_of_week = Period(freq='Min', year=2007, month=1, day=7, + hour=23, minute=59) + ival_T_end_of_day = Period(freq='Min', year=2007, month=1, day=1, + hour=23, minute=59) + ival_T_end_of_bus = Period(freq='Min', year=2007, month=1, day=1, + hour=23, minute=59) + ival_T_end_of_hour = Period(freq='Min', year=2007, month=1, day=1, + hour=0, minute=59) + + ival_T_to_A = Period(freq='A', year=2007) + ival_T_to_Q = Period(freq='Q', year=2007, quarter=1) + ival_T_to_M = Period(freq='M', year=2007, month=1) + ival_T_to_W = Period(freq='W', year=2007, month=1, day=7) + ival_T_to_D = Period(freq='D', year=2007, month=1, day=1) + ival_T_to_B = Period(freq='B', year=2007, month=1, day=1) + ival_T_to_H = Period(freq='H', year=2007, month=1, day=1, hour=0) + + ival_T_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=0) + ival_T_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=0, + minute=0, second=59) + + assert ival_T.asfreq('A') == ival_T_to_A + assert ival_T_end_of_year.asfreq('A') == ival_T_to_A + assert ival_T.asfreq('Q') == ival_T_to_Q + assert ival_T_end_of_quarter.asfreq('Q') == ival_T_to_Q + assert ival_T.asfreq('M') == ival_T_to_M + assert ival_T_end_of_month.asfreq('M') == ival_T_to_M + assert ival_T.asfreq('W') == ival_T_to_W + assert ival_T_end_of_week.asfreq('W') == ival_T_to_W + assert ival_T.asfreq('D') == ival_T_to_D + assert ival_T_end_of_day.asfreq('D') == ival_T_to_D + assert ival_T.asfreq('B') == ival_T_to_B + assert ival_T_end_of_bus.asfreq('B') == ival_T_to_B + assert ival_T.asfreq('H') == ival_T_to_H + assert ival_T_end_of_hour.asfreq('H') == ival_T_to_H + + assert ival_T.asfreq('S', 'S') == ival_T_to_S_start + assert ival_T.asfreq('S', 'E') == ival_T_to_S_end + + assert ival_T.asfreq('Min') == ival_T + + def test_conv_secondly(self): + # frequency conversion tests: from Secondly Frequency" + + ival_S = Period(freq='S', year=2007, month=1, day=1, hour=0, minute=0, + second=0) + ival_S_end_of_year = Period(freq='S', year=2007, month=12, day=31, + hour=23, minute=59, second=59) + ival_S_end_of_quarter = Period(freq='S', year=2007, month=3, day=31, + hour=23, minute=59, second=59) + ival_S_end_of_month = Period(freq='S', year=2007, month=1, day=31, + hour=23, minute=59, second=59) + ival_S_end_of_week = Period(freq='S', year=2007, month=1, day=7, + hour=23, minute=59, second=59) + ival_S_end_of_day = Period(freq='S', year=2007, month=1, day=1, + hour=23, minute=59, second=59) + ival_S_end_of_bus = Period(freq='S', year=2007, month=1, day=1, + hour=23, minute=59, second=59) + ival_S_end_of_hour = Period(freq='S', year=2007, month=1, day=1, + hour=0, minute=59, second=59) + ival_S_end_of_minute = Period(freq='S', year=2007, month=1, day=1, + hour=0, minute=0, second=59) + + ival_S_to_A = Period(freq='A', year=2007) + ival_S_to_Q = Period(freq='Q', year=2007, quarter=1) + ival_S_to_M = Period(freq='M', year=2007, month=1) + ival_S_to_W = Period(freq='W', year=2007, month=1, day=7) + ival_S_to_D = Period(freq='D', year=2007, month=1, day=1) + ival_S_to_B = Period(freq='B', year=2007, month=1, day=1) + ival_S_to_H = Period(freq='H', year=2007, month=1, day=1, hour=0) + ival_S_to_T = Period(freq='Min', year=2007, month=1, day=1, hour=0, + minute=0) + + assert ival_S.asfreq('A') == ival_S_to_A + assert ival_S_end_of_year.asfreq('A') == ival_S_to_A + assert ival_S.asfreq('Q') == ival_S_to_Q + assert ival_S_end_of_quarter.asfreq('Q') == ival_S_to_Q + assert ival_S.asfreq('M') == ival_S_to_M + assert ival_S_end_of_month.asfreq('M') == ival_S_to_M + assert ival_S.asfreq('W') == ival_S_to_W + assert ival_S_end_of_week.asfreq('W') == ival_S_to_W + assert ival_S.asfreq('D') == ival_S_to_D + assert ival_S_end_of_day.asfreq('D') == ival_S_to_D + assert ival_S.asfreq('B') == ival_S_to_B + assert ival_S_end_of_bus.asfreq('B') == ival_S_to_B + assert ival_S.asfreq('H') == ival_S_to_H + assert ival_S_end_of_hour.asfreq('H') == ival_S_to_H + assert ival_S.asfreq('Min') == ival_S_to_T + assert ival_S_end_of_minute.asfreq('Min') == ival_S_to_T + + assert ival_S.asfreq('S') == ival_S + + def test_asfreq_mult(self): + # normal freq to mult freq + p = Period(freq='A', year=2007) + # ordinal will not change + for freq in ['3A', offsets.YearEnd(3)]: + result = p.asfreq(freq) + expected = Period('2007', freq='3A') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + # ordinal will not change + for freq in ['3A', offsets.YearEnd(3)]: + result = p.asfreq(freq, how='S') + expected = Period('2007', freq='3A') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + + # mult freq to normal freq + p = Period(freq='3A', year=2007) + # ordinal will change because how=E is the default + for freq in ['A', offsets.YearEnd()]: + result = p.asfreq(freq) + expected = Period('2009', freq='A') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + # ordinal will not change + for freq in ['A', offsets.YearEnd()]: + result = p.asfreq(freq, how='S') + expected = Period('2007', freq='A') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + + p = Period(freq='A', year=2007) + for freq in ['2M', offsets.MonthEnd(2)]: + result = p.asfreq(freq) + expected = Period('2007-12', freq='2M') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + for freq in ['2M', offsets.MonthEnd(2)]: + result = p.asfreq(freq, how='S') + expected = Period('2007-01', freq='2M') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + + p = Period(freq='3A', year=2007) + for freq in ['2M', offsets.MonthEnd(2)]: + result = p.asfreq(freq) + expected = Period('2009-12', freq='2M') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + for freq in ['2M', offsets.MonthEnd(2)]: + result = p.asfreq(freq, how='S') + expected = Period('2007-01', freq='2M') + + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + + def test_asfreq_combined(self): + # normal freq to combined freq + p = Period('2007', freq='H') + + # ordinal will not change + expected = Period('2007', freq='25H') + for freq, how in zip(['1D1H', '1H1D'], ['E', 'S']): + result = p.asfreq(freq, how=how) + assert result == expected + assert result.ordinal == expected.ordinal + assert result.freq == expected.freq + + # combined freq to normal freq + p1 = Period(freq='1D1H', year=2007) + p2 = Period(freq='1H1D', year=2007) + + # ordinal will change because how=E is the default + result1 = p1.asfreq('H') + result2 = p2.asfreq('H') + expected = Period('2007-01-02', freq='H') + assert result1 == expected + assert result1.ordinal == expected.ordinal + assert result1.freq == expected.freq + assert result2 == expected + assert result2.ordinal == expected.ordinal + assert result2.freq == expected.freq + + # ordinal will not change + result1 = p1.asfreq('H', how='S') + result2 = p2.asfreq('H', how='S') + expected = Period('2007-01-01', freq='H') + assert result1 == expected + assert result1.ordinal == expected.ordinal + assert result1.freq == expected.freq + assert result2 == expected + assert result2.ordinal == expected.ordinal + assert result2.freq == expected.freq + + def test_asfreq_MS(self): + initial = Period("2013") + + assert initial.asfreq(freq="M", how="S") == Period('2013-01', 'M') + + msg = pd.tseries.frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + initial.asfreq(freq="MS", how="S") + + with tm.assert_raises_regex(ValueError, msg): + pd.Period('2013-01', 'MS') + + assert _period_code_map.get("MS") is None diff --git a/pandas/tests/scalar/test_timedelta.py b/pandas/tests/scalar/test_timedelta.py new file mode 100644 index 0000000000000..ecc44204924d3 --- /dev/null +++ b/pandas/tests/scalar/test_timedelta.py @@ -0,0 +1,691 @@ +""" test the scalar Timedelta """ +import pytest + +import numpy as np +from datetime import timedelta + +import pandas as pd +import pandas.util.testing as tm +from pandas.core.tools.timedeltas import _coerce_scalar_to_timedelta_type as ct +from pandas import (Timedelta, TimedeltaIndex, timedelta_range, Series, + to_timedelta, compat) +from pandas._libs.tslib import iNaT, NaTType + + +class TestTimedeltas(object): + _multiprocess_can_split_ = True + + def setup_method(self, method): + pass + + def test_construction(self): + + expected = np.timedelta64(10, 'D').astype('m8[ns]').view('i8') + assert Timedelta(10, unit='d').value == expected + assert Timedelta(10.0, unit='d').value == expected + assert Timedelta('10 days').value == expected + assert Timedelta(days=10).value == expected + assert Timedelta(days=10.0).value == expected + + expected += np.timedelta64(10, 's').astype('m8[ns]').view('i8') + assert Timedelta('10 days 00:00:10').value == expected + assert Timedelta(days=10, seconds=10).value == expected + assert Timedelta(days=10, milliseconds=10 * 1000).value == expected + assert (Timedelta(days=10, microseconds=10 * 1000 * 1000) + .value == expected) + + # gh-8757: test construction with np dtypes + timedelta_kwargs = {'days': 'D', + 'seconds': 's', + 'microseconds': 'us', + 'milliseconds': 'ms', + 'minutes': 'm', + 'hours': 'h', + 'weeks': 'W'} + npdtypes = [np.int64, np.int32, np.int16, np.float64, np.float32, + np.float16] + for npdtype in npdtypes: + for pykwarg, npkwarg in timedelta_kwargs.items(): + expected = np.timedelta64(1, npkwarg).astype( + 'm8[ns]').view('i8') + assert Timedelta(**{pykwarg: npdtype(1)}).value == expected + + # rounding cases + assert Timedelta(82739999850000).value == 82739999850000 + assert ('0 days 22:58:59.999850' in str(Timedelta(82739999850000))) + assert Timedelta(123072001000000).value == 123072001000000 + assert ('1 days 10:11:12.001' in str(Timedelta(123072001000000))) + + # string conversion with/without leading zero + # GH 9570 + assert Timedelta('0:00:00') == timedelta(hours=0) + assert Timedelta('00:00:00') == timedelta(hours=0) + assert Timedelta('-1:00:00') == -timedelta(hours=1) + assert Timedelta('-01:00:00') == -timedelta(hours=1) + + # more strings & abbrevs + # GH 8190 + assert Timedelta('1 h') == timedelta(hours=1) + assert Timedelta('1 hour') == timedelta(hours=1) + assert Timedelta('1 hr') == timedelta(hours=1) + assert Timedelta('1 hours') == timedelta(hours=1) + assert Timedelta('-1 hours') == -timedelta(hours=1) + assert Timedelta('1 m') == timedelta(minutes=1) + assert Timedelta('1.5 m') == timedelta(seconds=90) + assert Timedelta('1 minute') == timedelta(minutes=1) + assert Timedelta('1 minutes') == timedelta(minutes=1) + assert Timedelta('1 s') == timedelta(seconds=1) + assert Timedelta('1 second') == timedelta(seconds=1) + assert Timedelta('1 seconds') == timedelta(seconds=1) + assert Timedelta('1 ms') == timedelta(milliseconds=1) + assert Timedelta('1 milli') == timedelta(milliseconds=1) + assert Timedelta('1 millisecond') == timedelta(milliseconds=1) + assert Timedelta('1 us') == timedelta(microseconds=1) + assert Timedelta('1 micros') == timedelta(microseconds=1) + assert Timedelta('1 microsecond') == timedelta(microseconds=1) + assert Timedelta('1.5 microsecond') == Timedelta('00:00:00.000001500') + assert Timedelta('1 ns') == Timedelta('00:00:00.000000001') + assert Timedelta('1 nano') == Timedelta('00:00:00.000000001') + assert Timedelta('1 nanosecond') == Timedelta('00:00:00.000000001') + + # combos + assert Timedelta('10 days 1 hour') == timedelta(days=10, hours=1) + assert Timedelta('10 days 1 h') == timedelta(days=10, hours=1) + assert Timedelta('10 days 1 h 1m 1s') == timedelta( + days=10, hours=1, minutes=1, seconds=1) + assert Timedelta('-10 days 1 h 1m 1s') == -timedelta( + days=10, hours=1, minutes=1, seconds=1) + assert Timedelta('-10 days 1 h 1m 1s') == -timedelta( + days=10, hours=1, minutes=1, seconds=1) + assert Timedelta('-10 days 1 h 1m 1s 3us') == -timedelta( + days=10, hours=1, minutes=1, seconds=1, microseconds=3) + assert Timedelta('-10 days 1 h 1.5m 1s 3us'), -timedelta( + days=10, hours=1, minutes=1, seconds=31, microseconds=3) + + # Currently invalid as it has a - on the hh:mm:dd part + # (only allowed on the days) + pytest.raises(ValueError, + lambda: Timedelta('-10 days -1 h 1.5m 1s 3us')) + + # only leading neg signs are allowed + pytest.raises(ValueError, + lambda: Timedelta('10 days -1 h 1.5m 1s 3us')) + + # no units specified + pytest.raises(ValueError, lambda: Timedelta('3.1415')) + + # invalid construction + tm.assert_raises_regex(ValueError, "cannot construct a Timedelta", + lambda: Timedelta()) + tm.assert_raises_regex(ValueError, + "unit abbreviation w/o a number", + lambda: Timedelta('foo')) + tm.assert_raises_regex(ValueError, + "cannot construct a Timedelta from the " + "passed arguments, allowed keywords are ", + lambda: Timedelta(day=10)) + + # round-trip both for string and value + for v in ['1s', '-1s', '1us', '-1us', '1 day', '-1 day', + '-23:59:59.999999', '-1 days +23:59:59.999999', '-1ns', + '1ns', '-23:59:59.999999999']: + + td = Timedelta(v) + assert Timedelta(td.value) == td + + # str does not normally display nanos + if not td.nanoseconds: + assert Timedelta(str(td)) == td + assert Timedelta(td._repr_base(format='all')) == td + + # floats + expected = np.timedelta64( + 10, 's').astype('m8[ns]').view('i8') + np.timedelta64( + 500, 'ms').astype('m8[ns]').view('i8') + assert Timedelta(10.5, unit='s').value == expected + + # offset + assert (to_timedelta(pd.offsets.Hour(2)) == + Timedelta('0 days, 02:00:00')) + assert (Timedelta(pd.offsets.Hour(2)) == + Timedelta('0 days, 02:00:00')) + assert (Timedelta(pd.offsets.Second(2)) == + Timedelta('0 days, 00:00:02')) + + # gh-11995: unicode + expected = Timedelta('1H') + result = pd.Timedelta(u'1H') + assert result == expected + assert (to_timedelta(pd.offsets.Hour(2)) == + Timedelta(u'0 days, 02:00:00')) + + pytest.raises(ValueError, lambda: Timedelta(u'foo bar')) + + def test_overflow_on_construction(self): + # xref https://github.com/statsmodels/statsmodels/issues/3374 + value = pd.Timedelta('1day').value * 20169940 + pytest.raises(OverflowError, pd.Timedelta, value) + + def test_total_seconds_scalar(self): + # see gh-10939 + rng = Timedelta('1 days, 10:11:12.100123456') + expt = 1 * 86400 + 10 * 3600 + 11 * 60 + 12 + 100123456. / 1e9 + tm.assert_almost_equal(rng.total_seconds(), expt) + + rng = Timedelta(np.nan) + assert np.isnan(rng.total_seconds()) + + def test_repr(self): + + assert (repr(Timedelta(10, unit='d')) == + "Timedelta('10 days 00:00:00')") + assert (repr(Timedelta(10, unit='s')) == + "Timedelta('0 days 00:00:10')") + assert (repr(Timedelta(10, unit='ms')) == + "Timedelta('0 days 00:00:00.010000')") + assert (repr(Timedelta(-10, unit='ms')) == + "Timedelta('-1 days +23:59:59.990000')") + + def test_conversion(self): + + for td in [Timedelta(10, unit='d'), + Timedelta('1 days, 10:11:12.012345')]: + pydt = td.to_pytimedelta() + assert td == Timedelta(pydt) + assert td == pydt + assert (isinstance(pydt, timedelta) and not isinstance( + pydt, Timedelta)) + + assert td == np.timedelta64(td.value, 'ns') + td64 = td.to_timedelta64() + + assert td64 == np.timedelta64(td.value, 'ns') + assert td == td64 + + assert isinstance(td64, np.timedelta64) + + # this is NOT equal and cannot be roundtriped (because of the nanos) + td = Timedelta('1 days, 10:11:12.012345678') + assert td != td.to_pytimedelta() + + def test_freq_conversion(self): + + # truediv + td = Timedelta('1 days 2 hours 3 ns') + result = td / np.timedelta64(1, 'D') + assert result == td.value / float(86400 * 1e9) + result = td / np.timedelta64(1, 's') + assert result == td.value / float(1e9) + result = td / np.timedelta64(1, 'ns') + assert result == td.value + + # floordiv + td = Timedelta('1 days 2 hours 3 ns') + result = td // np.timedelta64(1, 'D') + assert result == 1 + result = td // np.timedelta64(1, 's') + assert result == 93600 + result = td // np.timedelta64(1, 'ns') + assert result == td.value + + def test_fields(self): + def check(value): + # that we are int/long like + assert isinstance(value, (int, compat.long)) + + # compat to datetime.timedelta + rng = to_timedelta('1 days, 10:11:12') + assert rng.days == 1 + assert rng.seconds == 10 * 3600 + 11 * 60 + 12 + assert rng.microseconds == 0 + assert rng.nanoseconds == 0 + + pytest.raises(AttributeError, lambda: rng.hours) + pytest.raises(AttributeError, lambda: rng.minutes) + pytest.raises(AttributeError, lambda: rng.milliseconds) + + # GH 10050 + check(rng.days) + check(rng.seconds) + check(rng.microseconds) + check(rng.nanoseconds) + + td = Timedelta('-1 days, 10:11:12') + assert abs(td) == Timedelta('13:48:48') + assert str(td) == "-1 days +10:11:12" + assert -td == Timedelta('0 days 13:48:48') + assert -Timedelta('-1 days, 10:11:12').value == 49728000000000 + assert Timedelta('-1 days, 10:11:12').value == -49728000000000 + + rng = to_timedelta('-1 days, 10:11:12.100123456') + assert rng.days == -1 + assert rng.seconds == 10 * 3600 + 11 * 60 + 12 + assert rng.microseconds == 100 * 1000 + 123 + assert rng.nanoseconds == 456 + pytest.raises(AttributeError, lambda: rng.hours) + pytest.raises(AttributeError, lambda: rng.minutes) + pytest.raises(AttributeError, lambda: rng.milliseconds) + + # components + tup = pd.to_timedelta(-1, 'us').components + assert tup.days == -1 + assert tup.hours == 23 + assert tup.minutes == 59 + assert tup.seconds == 59 + assert tup.milliseconds == 999 + assert tup.microseconds == 999 + assert tup.nanoseconds == 0 + + # GH 10050 + check(tup.days) + check(tup.hours) + check(tup.minutes) + check(tup.seconds) + check(tup.milliseconds) + check(tup.microseconds) + check(tup.nanoseconds) + + tup = Timedelta('-1 days 1 us').components + assert tup.days == -2 + assert tup.hours == 23 + assert tup.minutes == 59 + assert tup.seconds == 59 + assert tup.milliseconds == 999 + assert tup.microseconds == 999 + assert tup.nanoseconds == 0 + + def test_nat_converters(self): + assert to_timedelta('nat', box=False).astype('int64') == iNaT + assert to_timedelta('nan', box=False).astype('int64') == iNaT + + def testit(unit, transform): + + # array + result = to_timedelta(np.arange(5), unit=unit) + expected = TimedeltaIndex([np.timedelta64(i, transform(unit)) + for i in np.arange(5).tolist()]) + tm.assert_index_equal(result, expected) + + # scalar + result = to_timedelta(2, unit=unit) + expected = Timedelta(np.timedelta64(2, transform(unit)).astype( + 'timedelta64[ns]')) + assert result == expected + + # validate all units + # GH 6855 + for unit in ['Y', 'M', 'W', 'D', 'y', 'w', 'd']: + testit(unit, lambda x: x.upper()) + for unit in ['days', 'day', 'Day', 'Days']: + testit(unit, lambda x: 'D') + for unit in ['h', 'm', 's', 'ms', 'us', 'ns', 'H', 'S', 'MS', 'US', + 'NS']: + testit(unit, lambda x: x.lower()) + + # offsets + + # m + testit('T', lambda x: 'm') + + # ms + testit('L', lambda x: 'ms') + + def test_numeric_conversions(self): + assert ct(0) == np.timedelta64(0, 'ns') + assert ct(10) == np.timedelta64(10, 'ns') + assert ct(10, unit='ns') == np.timedelta64(10, 'ns').astype('m8[ns]') + + assert ct(10, unit='us') == np.timedelta64(10, 'us').astype('m8[ns]') + assert ct(10, unit='ms') == np.timedelta64(10, 'ms').astype('m8[ns]') + assert ct(10, unit='s') == np.timedelta64(10, 's').astype('m8[ns]') + assert ct(10, unit='d') == np.timedelta64(10, 'D').astype('m8[ns]') + + def test_timedelta_conversions(self): + assert (ct(timedelta(seconds=1)) == + np.timedelta64(1, 's').astype('m8[ns]')) + assert (ct(timedelta(microseconds=1)) == + np.timedelta64(1, 'us').astype('m8[ns]')) + assert (ct(timedelta(days=1)) == + np.timedelta64(1, 'D').astype('m8[ns]')) + + def test_round(self): + + t1 = Timedelta('1 days 02:34:56.789123456') + t2 = Timedelta('-1 days 02:34:56.789123456') + + for (freq, s1, s2) in [('N', t1, t2), + ('U', Timedelta('1 days 02:34:56.789123000'), + Timedelta('-1 days 02:34:56.789123000')), + ('L', Timedelta('1 days 02:34:56.789000000'), + Timedelta('-1 days 02:34:56.789000000')), + ('S', Timedelta('1 days 02:34:57'), + Timedelta('-1 days 02:34:57')), + ('2S', Timedelta('1 days 02:34:56'), + Timedelta('-1 days 02:34:56')), + ('5S', Timedelta('1 days 02:34:55'), + Timedelta('-1 days 02:34:55')), + ('T', Timedelta('1 days 02:35:00'), + Timedelta('-1 days 02:35:00')), + ('12T', Timedelta('1 days 02:36:00'), + Timedelta('-1 days 02:36:00')), + ('H', Timedelta('1 days 03:00:00'), + Timedelta('-1 days 03:00:00')), + ('d', Timedelta('1 days'), + Timedelta('-1 days'))]: + r1 = t1.round(freq) + assert r1 == s1 + r2 = t2.round(freq) + assert r2 == s2 + + # invalid + for freq in ['Y', 'M', 'foobar']: + pytest.raises(ValueError, lambda: t1.round(freq)) + + t1 = timedelta_range('1 days', periods=3, freq='1 min 2 s 3 us') + t2 = -1 * t1 + t1a = timedelta_range('1 days', periods=3, freq='1 min 2 s') + t1c = pd.TimedeltaIndex([1, 1, 1], unit='D') + + # note that negative times round DOWN! so don't give whole numbers + for (freq, s1, s2) in [('N', t1, t2), + ('U', t1, t2), + ('L', t1a, + TimedeltaIndex(['-1 days +00:00:00', + '-2 days +23:58:58', + '-2 days +23:57:56'], + dtype='timedelta64[ns]', + freq=None) + ), + ('S', t1a, + TimedeltaIndex(['-1 days +00:00:00', + '-2 days +23:58:58', + '-2 days +23:57:56'], + dtype='timedelta64[ns]', + freq=None) + ), + ('12T', t1c, + TimedeltaIndex(['-1 days', + '-1 days', + '-1 days'], + dtype='timedelta64[ns]', + freq=None) + ), + ('H', t1c, + TimedeltaIndex(['-1 days', + '-1 days', + '-1 days'], + dtype='timedelta64[ns]', + freq=None) + ), + ('d', t1c, + pd.TimedeltaIndex([-1, -1, -1], unit='D') + )]: + + r1 = t1.round(freq) + tm.assert_index_equal(r1, s1) + r2 = t2.round(freq) + tm.assert_index_equal(r2, s2) + + # invalid + for freq in ['Y', 'M', 'foobar']: + pytest.raises(ValueError, lambda: t1.round(freq)) + + def test_contains(self): + # Checking for any NaT-like objects + # GH 13603 + td = to_timedelta(range(5), unit='d') + pd.offsets.Hour(1) + for v in [pd.NaT, None, float('nan'), np.nan]: + assert not (v in td) + + td = to_timedelta([pd.NaT]) + for v in [pd.NaT, None, float('nan'), np.nan]: + assert (v in td) + + def test_identity(self): + + td = Timedelta(10, unit='d') + assert isinstance(td, Timedelta) + assert isinstance(td, timedelta) + + def test_short_format_converters(self): + def conv(v): + return v.astype('m8[ns]') + + assert ct('10') == np.timedelta64(10, 'ns') + assert ct('10ns') == np.timedelta64(10, 'ns') + assert ct('100') == np.timedelta64(100, 'ns') + assert ct('100ns') == np.timedelta64(100, 'ns') + + assert ct('1000') == np.timedelta64(1000, 'ns') + assert ct('1000ns') == np.timedelta64(1000, 'ns') + assert ct('1000NS') == np.timedelta64(1000, 'ns') + + assert ct('10us') == np.timedelta64(10000, 'ns') + assert ct('100us') == np.timedelta64(100000, 'ns') + assert ct('1000us') == np.timedelta64(1000000, 'ns') + assert ct('1000Us') == np.timedelta64(1000000, 'ns') + assert ct('1000uS') == np.timedelta64(1000000, 'ns') + + assert ct('1ms') == np.timedelta64(1000000, 'ns') + assert ct('10ms') == np.timedelta64(10000000, 'ns') + assert ct('100ms') == np.timedelta64(100000000, 'ns') + assert ct('1000ms') == np.timedelta64(1000000000, 'ns') + + assert ct('-1s') == -np.timedelta64(1000000000, 'ns') + assert ct('1s') == np.timedelta64(1000000000, 'ns') + assert ct('10s') == np.timedelta64(10000000000, 'ns') + assert ct('100s') == np.timedelta64(100000000000, 'ns') + assert ct('1000s') == np.timedelta64(1000000000000, 'ns') + + assert ct('1d') == conv(np.timedelta64(1, 'D')) + assert ct('-1d') == -conv(np.timedelta64(1, 'D')) + assert ct('1D') == conv(np.timedelta64(1, 'D')) + assert ct('10D') == conv(np.timedelta64(10, 'D')) + assert ct('100D') == conv(np.timedelta64(100, 'D')) + assert ct('1000D') == conv(np.timedelta64(1000, 'D')) + assert ct('10000D') == conv(np.timedelta64(10000, 'D')) + + # space + assert ct(' 10000D ') == conv(np.timedelta64(10000, 'D')) + assert ct(' - 10000D ') == -conv(np.timedelta64(10000, 'D')) + + # invalid + pytest.raises(ValueError, ct, '1foo') + pytest.raises(ValueError, ct, 'foo') + + def test_full_format_converters(self): + def conv(v): + return v.astype('m8[ns]') + + d1 = np.timedelta64(1, 'D') + + assert ct('1days') == conv(d1) + assert ct('1days,') == conv(d1) + assert ct('- 1days,') == -conv(d1) + + assert ct('00:00:01') == conv(np.timedelta64(1, 's')) + assert ct('06:00:01') == conv(np.timedelta64(6 * 3600 + 1, 's')) + assert ct('06:00:01.0') == conv(np.timedelta64(6 * 3600 + 1, 's')) + assert ct('06:00:01.01') == conv(np.timedelta64( + 1000 * (6 * 3600 + 1) + 10, 'ms')) + + assert (ct('- 1days, 00:00:01') == + conv(-d1 + np.timedelta64(1, 's'))) + assert (ct('1days, 06:00:01') == + conv(d1 + np.timedelta64(6 * 3600 + 1, 's'))) + assert (ct('1days, 06:00:01.01') == + conv(d1 + np.timedelta64(1000 * (6 * 3600 + 1) + 10, 'ms'))) + + # invalid + pytest.raises(ValueError, ct, '- 1days, 00') + + def test_overflow(self): + # GH 9442 + s = Series(pd.date_range('20130101', periods=100000, freq='H')) + s[0] += pd.Timedelta('1s 1ms') + + # mean + result = (s - s.min()).mean() + expected = pd.Timedelta((pd.DatetimeIndex((s - s.min())).asi8 / len(s) + ).sum()) + + # the computation is converted to float so + # might be some loss of precision + assert np.allclose(result.value / 1000, expected.value / 1000) + + # sum + pytest.raises(ValueError, lambda: (s - s.min()).sum()) + s1 = s[0:10000] + pytest.raises(ValueError, lambda: (s1 - s1.min()).sum()) + s2 = s[0:1000] + result = (s2 - s2.min()).sum() + + def test_pickle(self): + + v = Timedelta('1 days 10:11:12.0123456') + v_p = tm.round_trip_pickle(v) + assert v == v_p + + def test_timedelta_hash_equality(self): + # GH 11129 + v = Timedelta(1, 'D') + td = timedelta(days=1) + assert hash(v) == hash(td) + + d = {td: 2} + assert d[v] == 2 + + tds = timedelta_range('1 second', periods=20) + assert all(hash(td) == hash(td.to_pytimedelta()) for td in tds) + + # python timedeltas drop ns resolution + ns_td = Timedelta(1, 'ns') + assert hash(ns_td) != hash(ns_td.to_pytimedelta()) + + def test_implementation_limits(self): + min_td = Timedelta(Timedelta.min) + max_td = Timedelta(Timedelta.max) + + # GH 12727 + # timedelta limits correspond to int64 boundaries + assert min_td.value == np.iinfo(np.int64).min + 1 + assert max_td.value == np.iinfo(np.int64).max + + # Beyond lower limit, a NAT before the Overflow + assert isinstance(min_td - Timedelta(1, 'ns'), NaTType) + + with pytest.raises(OverflowError): + min_td - Timedelta(2, 'ns') + + with pytest.raises(OverflowError): + max_td + Timedelta(1, 'ns') + + # Same tests using the internal nanosecond values + td = Timedelta(min_td.value - 1, 'ns') + assert isinstance(td, NaTType) + + with pytest.raises(OverflowError): + Timedelta(min_td.value - 2, 'ns') + + with pytest.raises(OverflowError): + Timedelta(max_td.value + 1, 'ns') + + def test_timedelta_arithmetic(self): + data = pd.Series(['nat', '32 days'], dtype='timedelta64[ns]') + deltas = [timedelta(days=1), Timedelta(1, unit='D')] + for delta in deltas: + result_method = data.add(delta) + result_operator = data + delta + expected = pd.Series(['nat', '33 days'], dtype='timedelta64[ns]') + tm.assert_series_equal(result_operator, expected) + tm.assert_series_equal(result_method, expected) + + result_method = data.sub(delta) + result_operator = data - delta + expected = pd.Series(['nat', '31 days'], dtype='timedelta64[ns]') + tm.assert_series_equal(result_operator, expected) + tm.assert_series_equal(result_method, expected) + # GH 9396 + result_method = data.div(delta) + result_operator = data / delta + expected = pd.Series([np.nan, 32.], dtype='float64') + tm.assert_series_equal(result_operator, expected) + tm.assert_series_equal(result_method, expected) + + def test_apply_to_timedelta(self): + timedelta_NaT = pd.to_timedelta('NaT') + + list_of_valid_strings = ['00:00:01', '00:00:02'] + a = pd.to_timedelta(list_of_valid_strings) + b = Series(list_of_valid_strings).apply(pd.to_timedelta) + # Can't compare until apply on a Series gives the correct dtype + # assert_series_equal(a, b) + + list_of_strings = ['00:00:01', np.nan, pd.NaT, timedelta_NaT] + + # TODO: unused? + a = pd.to_timedelta(list_of_strings) # noqa + b = Series(list_of_strings).apply(pd.to_timedelta) # noqa + # Can't compare until apply on a Series gives the correct dtype + # assert_series_equal(a, b) + + def test_components(self): + rng = timedelta_range('1 days, 10:11:12', periods=2, freq='s') + rng.components + + # with nat + s = Series(rng) + s[1] = np.nan + + result = s.dt.components + assert not result.iloc[0].isnull().all() + assert result.iloc[1].isnull().all() + + def test_isoformat(self): + td = Timedelta(days=6, minutes=50, seconds=3, + milliseconds=10, microseconds=10, nanoseconds=12) + expected = 'P6DT0H50M3.010010012S' + result = td.isoformat() + assert result == expected + + td = Timedelta(days=4, hours=12, minutes=30, seconds=5) + result = td.isoformat() + expected = 'P4DT12H30M5S' + assert result == expected + + td = Timedelta(nanoseconds=123) + result = td.isoformat() + expected = 'P0DT0H0M0.000000123S' + assert result == expected + + # trim nano + td = Timedelta(microseconds=10) + result = td.isoformat() + expected = 'P0DT0H0M0.00001S' + assert result == expected + + # trim micro + td = Timedelta(milliseconds=1) + result = td.isoformat() + expected = 'P0DT0H0M0.001S' + assert result == expected + + # don't strip every 0 + result = Timedelta(minutes=1).isoformat() + expected = 'P0DT0H1M0S' + assert result == expected + + def test_ops_error_str(self): + # GH 13624 + td = Timedelta('1 day') + + for l, r in [(td, 'a'), ('a', td)]: + + with pytest.raises(TypeError): + l + r + + with pytest.raises(TypeError): + l > r + + assert not l == r + assert l != r diff --git a/pandas/tests/scalar/test_timestamp.py b/pandas/tests/scalar/test_timestamp.py new file mode 100644 index 0000000000000..5caa0252b69b8 --- /dev/null +++ b/pandas/tests/scalar/test_timestamp.py @@ -0,0 +1,1514 @@ +""" test the scalar Timestamp """ + +import sys +import pytest +import operator +import calendar +import numpy as np +from datetime import datetime, timedelta +from distutils.version import LooseVersion + +import pandas.util.testing as tm +from pandas.tseries import offsets, frequencies +from pandas._libs import tslib, period +from pandas._libs.tslib import get_timezone + +from pandas.compat import lrange, long +from pandas.util.testing import assert_series_equal +from pandas.compat.numpy import np_datetime64_compat +from pandas import (Timestamp, date_range, Period, Timedelta, compat, + Series, NaT, DataFrame, DatetimeIndex) +from pandas.tseries.frequencies import (RESO_DAY, RESO_HR, RESO_MIN, RESO_US, + RESO_MS, RESO_SEC) + + +class TestTimestamp(object): + + def test_constructor(self): + base_str = '2014-07-01 09:00' + base_dt = datetime(2014, 7, 1, 9) + base_expected = 1404205200000000000 + + # confirm base representation is correct + import calendar + assert (calendar.timegm(base_dt.timetuple()) * 1000000000 == + base_expected) + + tests = [(base_str, base_dt, base_expected), + ('2014-07-01 10:00', datetime(2014, 7, 1, 10), + base_expected + 3600 * 1000000000), + ('2014-07-01 09:00:00.000008000', + datetime(2014, 7, 1, 9, 0, 0, 8), + base_expected + 8000), + ('2014-07-01 09:00:00.000000005', + Timestamp('2014-07-01 09:00:00.000000005'), + base_expected + 5)] + + tm._skip_if_no_pytz() + tm._skip_if_no_dateutil() + import pytz + import dateutil + timezones = [(None, 0), ('UTC', 0), (pytz.utc, 0), ('Asia/Tokyo', 9), + ('US/Eastern', -4), ('dateutil/US/Pacific', -7), + (pytz.FixedOffset(-180), -3), + (dateutil.tz.tzoffset(None, 18000), 5)] + + for date_str, date, expected in tests: + for result in [Timestamp(date_str), Timestamp(date)]: + # only with timestring + assert result.value == expected + assert tslib.pydt_to_i8(result) == expected + + # re-creation shouldn't affect to internal value + result = Timestamp(result) + assert result.value == expected + assert tslib.pydt_to_i8(result) == expected + + # with timezone + for tz, offset in timezones: + for result in [Timestamp(date_str, tz=tz), Timestamp(date, + tz=tz)]: + expected_tz = expected - offset * 3600 * 1000000000 + assert result.value == expected_tz + assert tslib.pydt_to_i8(result) == expected_tz + + # should preserve tz + result = Timestamp(result) + assert result.value == expected_tz + assert tslib.pydt_to_i8(result) == expected_tz + + # should convert to UTC + result = Timestamp(result, tz='UTC') + expected_utc = expected - offset * 3600 * 1000000000 + assert result.value == expected_utc + assert tslib.pydt_to_i8(result) == expected_utc + + def test_constructor_with_stringoffset(self): + # GH 7833 + base_str = '2014-07-01 11:00:00+02:00' + base_dt = datetime(2014, 7, 1, 9) + base_expected = 1404205200000000000 + + # confirm base representation is correct + import calendar + assert (calendar.timegm(base_dt.timetuple()) * 1000000000 == + base_expected) + + tests = [(base_str, base_expected), + ('2014-07-01 12:00:00+02:00', + base_expected + 3600 * 1000000000), + ('2014-07-01 11:00:00.000008000+02:00', base_expected + 8000), + ('2014-07-01 11:00:00.000000005+02:00', base_expected + 5)] + + tm._skip_if_no_pytz() + tm._skip_if_no_dateutil() + import pytz + import dateutil + timezones = [(None, 0), ('UTC', 0), (pytz.utc, 0), ('Asia/Tokyo', 9), + ('US/Eastern', -4), ('dateutil/US/Pacific', -7), + (pytz.FixedOffset(-180), -3), + (dateutil.tz.tzoffset(None, 18000), 5)] + + for date_str, expected in tests: + for result in [Timestamp(date_str)]: + # only with timestring + assert result.value == expected + assert tslib.pydt_to_i8(result) == expected + + # re-creation shouldn't affect to internal value + result = Timestamp(result) + assert result.value == expected + assert tslib.pydt_to_i8(result) == expected + + # with timezone + for tz, offset in timezones: + result = Timestamp(date_str, tz=tz) + expected_tz = expected + assert result.value == expected_tz + assert tslib.pydt_to_i8(result) == expected_tz + + # should preserve tz + result = Timestamp(result) + assert result.value == expected_tz + assert tslib.pydt_to_i8(result) == expected_tz + + # should convert to UTC + result = Timestamp(result, tz='UTC') + expected_utc = expected + assert result.value == expected_utc + assert tslib.pydt_to_i8(result) == expected_utc + + # This should be 2013-11-01 05:00 in UTC + # converted to Chicago tz + result = Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago') + assert result.value == Timestamp('2013-11-01 05:00').value + expected = "Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago')" # noqa + assert repr(result) == expected + assert result == eval(repr(result)) + + # This should be 2013-11-01 05:00 in UTC + # converted to Tokyo tz (+09:00) + result = Timestamp('2013-11-01 00:00:00-0500', tz='Asia/Tokyo') + assert result.value == Timestamp('2013-11-01 05:00').value + expected = "Timestamp('2013-11-01 14:00:00+0900', tz='Asia/Tokyo')" + assert repr(result) == expected + assert result == eval(repr(result)) + + # GH11708 + # This should be 2015-11-18 10:00 in UTC + # converted to Asia/Katmandu + result = Timestamp("2015-11-18 15:45:00+05:45", tz="Asia/Katmandu") + assert result.value == Timestamp("2015-11-18 10:00").value + expected = "Timestamp('2015-11-18 15:45:00+0545', tz='Asia/Katmandu')" + assert repr(result) == expected + assert result == eval(repr(result)) + + # This should be 2015-11-18 10:00 in UTC + # converted to Asia/Kolkata + result = Timestamp("2015-11-18 15:30:00+05:30", tz="Asia/Kolkata") + assert result.value == Timestamp("2015-11-18 10:00").value + expected = "Timestamp('2015-11-18 15:30:00+0530', tz='Asia/Kolkata')" + assert repr(result) == expected + assert result == eval(repr(result)) + + def test_constructor_invalid(self): + with tm.assert_raises_regex(TypeError, 'Cannot convert input'): + Timestamp(slice(2)) + with tm.assert_raises_regex(ValueError, 'Cannot convert Period'): + Timestamp(Period('1000-01-01')) + + def test_constructor_positional(self): + # see gh-10758 + with pytest.raises(TypeError): + Timestamp(2000, 1) + with pytest.raises(ValueError): + Timestamp(2000, 0, 1) + with pytest.raises(ValueError): + Timestamp(2000, 13, 1) + with pytest.raises(ValueError): + Timestamp(2000, 1, 0) + with pytest.raises(ValueError): + Timestamp(2000, 1, 32) + + # see gh-11630 + assert (repr(Timestamp(2015, 11, 12)) == + repr(Timestamp('20151112'))) + assert (repr(Timestamp(2015, 11, 12, 1, 2, 3, 999999)) == + repr(Timestamp('2015-11-12 01:02:03.999999'))) + + def test_constructor_keyword(self): + # GH 10758 + with pytest.raises(TypeError): + Timestamp(year=2000, month=1) + with pytest.raises(ValueError): + Timestamp(year=2000, month=0, day=1) + with pytest.raises(ValueError): + Timestamp(year=2000, month=13, day=1) + with pytest.raises(ValueError): + Timestamp(year=2000, month=1, day=0) + with pytest.raises(ValueError): + Timestamp(year=2000, month=1, day=32) + + assert (repr(Timestamp(year=2015, month=11, day=12)) == + repr(Timestamp('20151112'))) + + assert (repr(Timestamp(year=2015, month=11, day=12, hour=1, minute=2, + second=3, microsecond=999999)) == + repr(Timestamp('2015-11-12 01:02:03.999999'))) + + def test_constructor_fromordinal(self): + base = datetime(2000, 1, 1) + + ts = Timestamp.fromordinal(base.toordinal(), freq='D') + assert base == ts + assert ts.freq == 'D' + assert base.toordinal() == ts.toordinal() + + ts = Timestamp.fromordinal(base.toordinal(), tz='US/Eastern') + assert Timestamp('2000-01-01', tz='US/Eastern') == ts + assert base.toordinal() == ts.toordinal() + + def test_constructor_offset_depr(self): + # see gh-12160 + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + ts = Timestamp('2011-01-01', offset='D') + assert ts.freq == 'D' + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + assert ts.offset == 'D' + + msg = "Can only specify freq or offset, not both" + with tm.assert_raises_regex(TypeError, msg): + Timestamp('2011-01-01', offset='D', freq='D') + + def test_constructor_offset_depr_fromordinal(self): + # GH 12160 + base = datetime(2000, 1, 1) + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + ts = Timestamp.fromordinal(base.toordinal(), offset='D') + assert Timestamp('2000-01-01') == ts + assert ts.freq == 'D' + assert base.toordinal() == ts.toordinal() + + msg = "Can only specify freq or offset, not both" + with tm.assert_raises_regex(TypeError, msg): + Timestamp.fromordinal(base.toordinal(), offset='D', freq='D') + + def test_conversion(self): + # GH 9255 + ts = Timestamp('2000-01-01') + + result = ts.to_pydatetime() + expected = datetime(2000, 1, 1) + assert result == expected + assert type(result) == type(expected) + + result = ts.to_datetime64() + expected = np.datetime64(ts.value, 'ns') + assert result == expected + assert type(result) == type(expected) + assert result.dtype == expected.dtype + + def test_repr(self): + tm._skip_if_no_pytz() + tm._skip_if_no_dateutil() + + dates = ['2014-03-07', '2014-01-01 09:00', + '2014-01-01 00:00:00.000000001'] + + # dateutil zone change (only matters for repr) + import dateutil + if (dateutil.__version__ >= LooseVersion('2.3') and + (dateutil.__version__ <= LooseVersion('2.4.0') or + dateutil.__version__ >= LooseVersion('2.6.0'))): + timezones = ['UTC', 'Asia/Tokyo', 'US/Eastern', + 'dateutil/US/Pacific'] + else: + timezones = ['UTC', 'Asia/Tokyo', 'US/Eastern', + 'dateutil/America/Los_Angeles'] + + freqs = ['D', 'M', 'S', 'N'] + + for date in dates: + for tz in timezones: + for freq in freqs: + + # avoid to match with timezone name + freq_repr = "'{0}'".format(freq) + if tz.startswith('dateutil'): + tz_repr = tz.replace('dateutil', '') + else: + tz_repr = tz + + date_only = Timestamp(date) + assert date in repr(date_only) + assert tz_repr not in repr(date_only) + assert freq_repr not in repr(date_only) + assert date_only == eval(repr(date_only)) + + date_tz = Timestamp(date, tz=tz) + assert date in repr(date_tz) + assert tz_repr in repr(date_tz) + assert freq_repr not in repr(date_tz) + assert date_tz == eval(repr(date_tz)) + + date_freq = Timestamp(date, freq=freq) + assert date in repr(date_freq) + assert tz_repr not in repr(date_freq) + assert freq_repr in repr(date_freq) + assert date_freq == eval(repr(date_freq)) + + date_tz_freq = Timestamp(date, tz=tz, freq=freq) + assert date in repr(date_tz_freq) + assert tz_repr in repr(date_tz_freq) + assert freq_repr in repr(date_tz_freq) + assert date_tz_freq == eval(repr(date_tz_freq)) + + # This can cause the tz field to be populated, but it's redundant to + # include this information in the date-string. + tm._skip_if_no_pytz() + import pytz # noqa + date_with_utc_offset = Timestamp('2014-03-13 00:00:00-0400', tz=None) + assert '2014-03-13 00:00:00-0400' in repr(date_with_utc_offset) + assert 'tzoffset' not in repr(date_with_utc_offset) + assert 'pytz.FixedOffset(-240)' in repr(date_with_utc_offset) + expr = repr(date_with_utc_offset).replace("'pytz.FixedOffset(-240)'", + 'pytz.FixedOffset(-240)') + assert date_with_utc_offset == eval(expr) + + def test_bounds_with_different_units(self): + out_of_bounds_dates = ('1677-09-21', '2262-04-12', ) + + time_units = ('D', 'h', 'm', 's', 'ms', 'us') + + for date_string in out_of_bounds_dates: + for unit in time_units: + pytest.raises(ValueError, Timestamp, np.datetime64( + date_string, dtype='M8[%s]' % unit)) + + in_bounds_dates = ('1677-09-23', '2262-04-11', ) + + for date_string in in_bounds_dates: + for unit in time_units: + Timestamp(np.datetime64(date_string, dtype='M8[%s]' % unit)) + + def test_tz(self): + t = '2014-02-01 09:00' + ts = Timestamp(t) + local = ts.tz_localize('Asia/Tokyo') + assert local.hour == 9 + assert local == Timestamp(t, tz='Asia/Tokyo') + conv = local.tz_convert('US/Eastern') + assert conv == Timestamp('2014-01-31 19:00', tz='US/Eastern') + assert conv.hour == 19 + + # preserves nanosecond + ts = Timestamp(t) + offsets.Nano(5) + local = ts.tz_localize('Asia/Tokyo') + assert local.hour == 9 + assert local.nanosecond == 5 + conv = local.tz_convert('US/Eastern') + assert conv.nanosecond == 5 + assert conv.hour == 19 + + def test_tz_localize_ambiguous(self): + + ts = Timestamp('2014-11-02 01:00') + ts_dst = ts.tz_localize('US/Eastern', ambiguous=True) + ts_no_dst = ts.tz_localize('US/Eastern', ambiguous=False) + + rng = date_range('2014-11-02', periods=3, freq='H', tz='US/Eastern') + assert rng[1] == ts_dst + assert rng[2] == ts_no_dst + pytest.raises(ValueError, ts.tz_localize, 'US/Eastern', + ambiguous='infer') + + # GH 8025 + with tm.assert_raises_regex(TypeError, + 'Cannot localize tz-aware Timestamp, ' + 'use tz_convert for conversions'): + Timestamp('2011-01-01', tz='US/Eastern').tz_localize('Asia/Tokyo') + + with tm.assert_raises_regex(TypeError, + 'Cannot convert tz-naive Timestamp, ' + 'use tz_localize to localize'): + Timestamp('2011-01-01').tz_convert('Asia/Tokyo') + + def test_tz_localize_nonexistent(self): + # See issue 13057 + from pytz.exceptions import NonExistentTimeError + times = ['2015-03-08 02:00', '2015-03-08 02:30', + '2015-03-29 02:00', '2015-03-29 02:30'] + timezones = ['US/Eastern', 'US/Pacific', + 'Europe/Paris', 'Europe/Belgrade'] + for t, tz in zip(times, timezones): + ts = Timestamp(t) + pytest.raises(NonExistentTimeError, ts.tz_localize, + tz) + pytest.raises(NonExistentTimeError, ts.tz_localize, + tz, errors='raise') + assert ts.tz_localize(tz, errors='coerce') is NaT + + def test_tz_localize_errors_ambiguous(self): + # See issue 13057 + from pytz.exceptions import AmbiguousTimeError + ts = Timestamp('2015-11-1 01:00') + pytest.raises(AmbiguousTimeError, + ts.tz_localize, 'US/Pacific', errors='coerce') + + def test_tz_localize_roundtrip(self): + for tz in ['UTC', 'Asia/Tokyo', 'US/Eastern', 'dateutil/US/Pacific']: + for t in ['2014-02-01 09:00', '2014-07-08 09:00', + '2014-11-01 17:00', '2014-11-05 00:00']: + ts = Timestamp(t) + localized = ts.tz_localize(tz) + assert localized == Timestamp(t, tz=tz) + + with pytest.raises(TypeError): + localized.tz_localize(tz) + + reset = localized.tz_localize(None) + assert reset == ts + assert reset.tzinfo is None + + def test_tz_convert_roundtrip(self): + for tz in ['UTC', 'Asia/Tokyo', 'US/Eastern', 'dateutil/US/Pacific']: + for t in ['2014-02-01 09:00', '2014-07-08 09:00', + '2014-11-01 17:00', '2014-11-05 00:00']: + ts = Timestamp(t, tz='UTC') + converted = ts.tz_convert(tz) + + reset = converted.tz_convert(None) + assert reset == Timestamp(t) + assert reset.tzinfo is None + assert reset == converted.tz_convert('UTC').tz_localize(None) + + def test_barely_oob_dts(self): + one_us = np.timedelta64(1).astype('timedelta64[us]') + + # By definition we can't go out of bounds in [ns], so we + # convert the datetime64s to [us] so we can go out of bounds + min_ts_us = np.datetime64(Timestamp.min).astype('M8[us]') + max_ts_us = np.datetime64(Timestamp.max).astype('M8[us]') + + # No error for the min/max datetimes + Timestamp(min_ts_us) + Timestamp(max_ts_us) + + # One us less than the minimum is an error + pytest.raises(ValueError, Timestamp, min_ts_us - one_us) + + # One us more than the maximum is an error + pytest.raises(ValueError, Timestamp, max_ts_us + one_us) + + def test_utc_z_designator(self): + assert get_timezone(Timestamp('2014-11-02 01:00Z').tzinfo) == 'UTC' + + def test_now(self): + # #9000 + ts_from_string = Timestamp('now') + ts_from_method = Timestamp.now() + ts_datetime = datetime.now() + + ts_from_string_tz = Timestamp('now', tz='US/Eastern') + ts_from_method_tz = Timestamp.now(tz='US/Eastern') + + # Check that the delta between the times is less than 1s (arbitrarily + # small) + delta = Timedelta(seconds=1) + assert abs(ts_from_method - ts_from_string) < delta + assert abs(ts_datetime - ts_from_method) < delta + assert abs(ts_from_method_tz - ts_from_string_tz) < delta + assert (abs(ts_from_string_tz.tz_localize(None) - + ts_from_method_tz.tz_localize(None)) < delta) + + def test_today(self): + + ts_from_string = Timestamp('today') + ts_from_method = Timestamp.today() + ts_datetime = datetime.today() + + ts_from_string_tz = Timestamp('today', tz='US/Eastern') + ts_from_method_tz = Timestamp.today(tz='US/Eastern') + + # Check that the delta between the times is less than 1s (arbitrarily + # small) + delta = Timedelta(seconds=1) + assert abs(ts_from_method - ts_from_string) < delta + assert abs(ts_datetime - ts_from_method) < delta + assert abs(ts_from_method_tz - ts_from_string_tz) < delta + assert (abs(ts_from_string_tz.tz_localize(None) - + ts_from_method_tz.tz_localize(None)) < delta) + + def test_asm8(self): + np.random.seed(7960929) + ns = [Timestamp.min.value, Timestamp.max.value, 1000] + + for n in ns: + assert (Timestamp(n).asm8.view('i8') == + np.datetime64(n, 'ns').view('i8') == n) + + assert (Timestamp('nat').asm8.view('i8') == + np.datetime64('nat', 'ns').view('i8')) + + def test_fields(self): + def check(value, equal): + # that we are int/long like + assert isinstance(value, (int, compat.long)) + assert value == equal + + # GH 10050 + ts = Timestamp('2015-05-10 09:06:03.000100001') + check(ts.year, 2015) + check(ts.month, 5) + check(ts.day, 10) + check(ts.hour, 9) + check(ts.minute, 6) + check(ts.second, 3) + pytest.raises(AttributeError, lambda: ts.millisecond) + check(ts.microsecond, 100) + check(ts.nanosecond, 1) + check(ts.dayofweek, 6) + check(ts.quarter, 2) + check(ts.dayofyear, 130) + check(ts.week, 19) + check(ts.daysinmonth, 31) + check(ts.daysinmonth, 31) + + # GH 13303 + ts = Timestamp('2014-12-31 23:59:00-05:00', tz='US/Eastern') + check(ts.year, 2014) + check(ts.month, 12) + check(ts.day, 31) + check(ts.hour, 23) + check(ts.minute, 59) + check(ts.second, 0) + pytest.raises(AttributeError, lambda: ts.millisecond) + check(ts.microsecond, 0) + check(ts.nanosecond, 0) + check(ts.dayofweek, 2) + check(ts.quarter, 4) + check(ts.dayofyear, 365) + check(ts.week, 1) + check(ts.daysinmonth, 31) + + ts = Timestamp('2014-01-01 00:00:00+01:00') + starts = ['is_month_start', 'is_quarter_start', 'is_year_start'] + for start in starts: + assert getattr(ts, start) + ts = Timestamp('2014-12-31 23:59:59+01:00') + ends = ['is_month_end', 'is_year_end', 'is_quarter_end'] + for end in ends: + assert getattr(ts, end) + + def test_pprint(self): + # GH12622 + import pprint + nested_obj = {'foo': 1, + 'bar': [{'w': {'a': Timestamp('2011-01-01')}}] * 10} + result = pprint.pformat(nested_obj, width=50) + expected = r"""{'bar': [{'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, + {'w': {'a': Timestamp('2011-01-01 00:00:00')}}], + 'foo': 1}""" + assert result == expected + + def to_datetime_depr(self): + # see gh-8254 + ts = Timestamp('2011-01-01') + + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + expected = datetime(2011, 1, 1) + result = ts.to_datetime() + assert result == expected + + def to_pydatetime_nonzero_nano(self): + ts = Timestamp('2011-01-01 9:00:00.123456789') + + # Warn the user of data loss (nanoseconds). + with tm.assert_produces_warning(UserWarning, + check_stacklevel=False): + expected = datetime(2011, 1, 1, 9, 0, 0, 123456) + result = ts.to_pydatetime() + assert result == expected + + def test_round(self): + + # round + dt = Timestamp('20130101 09:10:11') + result = dt.round('D') + expected = Timestamp('20130101') + assert result == expected + + dt = Timestamp('20130101 19:10:11') + result = dt.round('D') + expected = Timestamp('20130102') + assert result == expected + + dt = Timestamp('20130201 12:00:00') + result = dt.round('D') + expected = Timestamp('20130202') + assert result == expected + + dt = Timestamp('20130104 12:00:00') + result = dt.round('D') + expected = Timestamp('20130105') + assert result == expected + + dt = Timestamp('20130104 12:32:00') + result = dt.round('30Min') + expected = Timestamp('20130104 12:30:00') + assert result == expected + + dti = date_range('20130101 09:10:11', periods=5) + result = dti.round('D') + expected = date_range('20130101', periods=5) + tm.assert_index_equal(result, expected) + + # floor + dt = Timestamp('20130101 09:10:11') + result = dt.floor('D') + expected = Timestamp('20130101') + assert result == expected + + # ceil + dt = Timestamp('20130101 09:10:11') + result = dt.ceil('D') + expected = Timestamp('20130102') + assert result == expected + + # round with tz + dt = Timestamp('20130101 09:10:11', tz='US/Eastern') + result = dt.round('D') + expected = Timestamp('20130101', tz='US/Eastern') + assert result == expected + + dt = Timestamp('20130101 09:10:11', tz='US/Eastern') + result = dt.round('s') + assert result == dt + + dti = date_range('20130101 09:10:11', + periods=5).tz_localize('UTC').tz_convert('US/Eastern') + result = dti.round('D') + expected = date_range('20130101', periods=5).tz_localize('US/Eastern') + tm.assert_index_equal(result, expected) + + result = dti.round('s') + tm.assert_index_equal(result, dti) + + # invalid + for freq in ['Y', 'M', 'foobar']: + pytest.raises(ValueError, lambda: dti.round(freq)) + + # GH 14440 & 15578 + result = Timestamp('2016-10-17 12:00:00.0015').round('ms') + expected = Timestamp('2016-10-17 12:00:00.002000') + assert result == expected + + result = Timestamp('2016-10-17 12:00:00.00149').round('ms') + expected = Timestamp('2016-10-17 12:00:00.001000') + assert result == expected + + ts = Timestamp('2016-10-17 12:00:00.0015') + for freq in ['us', 'ns']: + assert ts == ts.round(freq) + + result = Timestamp('2016-10-17 12:00:00.001501031').round('10ns') + expected = Timestamp('2016-10-17 12:00:00.001501030') + assert result == expected + + with tm.assert_produces_warning(): + Timestamp('2016-10-17 12:00:00.001501031').round('1010ns') + + def test_round_misc(self): + stamp = Timestamp('2000-01-05 05:09:15.13') + + def _check_round(freq, expected): + result = stamp.round(freq=freq) + assert result == expected + + for freq, expected in [('D', Timestamp('2000-01-05 00:00:00')), + ('H', Timestamp('2000-01-05 05:00:00')), + ('S', Timestamp('2000-01-05 05:09:15'))]: + _check_round(freq, expected) + + msg = frequencies._INVALID_FREQ_ERROR + with tm.assert_raises_regex(ValueError, msg): + stamp.round('foo') + + def test_class_ops_pytz(self): + tm._skip_if_no_pytz() + from pytz import timezone + + def compare(x, y): + assert (int(Timestamp(x).value / 1e9) == + int(Timestamp(y).value / 1e9)) + + compare(Timestamp.now(), datetime.now()) + compare(Timestamp.now('UTC'), datetime.now(timezone('UTC'))) + compare(Timestamp.utcnow(), datetime.utcnow()) + compare(Timestamp.today(), datetime.today()) + current_time = calendar.timegm(datetime.now().utctimetuple()) + compare(Timestamp.utcfromtimestamp(current_time), + datetime.utcfromtimestamp(current_time)) + compare(Timestamp.fromtimestamp(current_time), + datetime.fromtimestamp(current_time)) + + date_component = datetime.utcnow() + time_component = (date_component + timedelta(minutes=10)).time() + compare(Timestamp.combine(date_component, time_component), + datetime.combine(date_component, time_component)) + + def test_class_ops_dateutil(self): + tm._skip_if_no_dateutil() + from dateutil.tz import tzutc + + def compare(x, y): + assert (int(np.round(Timestamp(x).value / 1e9)) == + int(np.round(Timestamp(y).value / 1e9))) + + compare(Timestamp.now(), datetime.now()) + compare(Timestamp.now('UTC'), datetime.now(tzutc())) + compare(Timestamp.utcnow(), datetime.utcnow()) + compare(Timestamp.today(), datetime.today()) + current_time = calendar.timegm(datetime.now().utctimetuple()) + compare(Timestamp.utcfromtimestamp(current_time), + datetime.utcfromtimestamp(current_time)) + compare(Timestamp.fromtimestamp(current_time), + datetime.fromtimestamp(current_time)) + + date_component = datetime.utcnow() + time_component = (date_component + timedelta(minutes=10)).time() + compare(Timestamp.combine(date_component, time_component), + datetime.combine(date_component, time_component)) + + def test_basics_nanos(self): + val = np.int64(946684800000000000).view('M8[ns]') + stamp = Timestamp(val.view('i8') + 500) + assert stamp.year == 2000 + assert stamp.month == 1 + assert stamp.microsecond == 0 + assert stamp.nanosecond == 500 + + # GH 14415 + val = np.iinfo(np.int64).min + 80000000000000 + stamp = Timestamp(val) + assert stamp.year == 1677 + assert stamp.month == 9 + assert stamp.day == 21 + assert stamp.microsecond == 145224 + assert stamp.nanosecond == 192 + + def test_unit(self): + + def check(val, unit=None, h=1, s=1, us=0): + stamp = Timestamp(val, unit=unit) + assert stamp.year == 2000 + assert stamp.month == 1 + assert stamp.day == 1 + assert stamp.hour == h + if unit != 'D': + assert stamp.minute == 1 + assert stamp.second == s + assert stamp.microsecond == us + else: + assert stamp.minute == 0 + assert stamp.second == 0 + assert stamp.microsecond == 0 + assert stamp.nanosecond == 0 + + ts = Timestamp('20000101 01:01:01') + val = ts.value + days = (ts - Timestamp('1970-01-01')).days + + check(val) + check(val / long(1000), unit='us') + check(val / long(1000000), unit='ms') + check(val / long(1000000000), unit='s') + check(days, unit='D', h=0) + + # using truediv, so these are like floats + if compat.PY3: + check((val + 500000) / long(1000000000), unit='s', us=500) + check((val + 500000000) / long(1000000000), unit='s', us=500000) + check((val + 500000) / long(1000000), unit='ms', us=500) + + # get chopped in py2 + else: + check((val + 500000) / long(1000000000), unit='s') + check((val + 500000000) / long(1000000000), unit='s') + check((val + 500000) / long(1000000), unit='ms') + + # ok + check((val + 500000) / long(1000), unit='us', us=500) + check((val + 500000000) / long(1000000), unit='ms', us=500000) + + # floats + check(val / 1000.0 + 5, unit='us', us=5) + check(val / 1000.0 + 5000, unit='us', us=5000) + check(val / 1000000.0 + 0.5, unit='ms', us=500) + check(val / 1000000.0 + 0.005, unit='ms', us=5) + check(val / 1000000000.0 + 0.5, unit='s', us=500000) + check(days + 0.5, unit='D', h=12) + + def test_roundtrip(self): + + # test value to string and back conversions + # further test accessors + base = Timestamp('20140101 00:00:00') + + result = Timestamp(base.value + Timedelta('5ms').value) + assert result == Timestamp(str(base) + ".005000") + assert result.microsecond == 5000 + + result = Timestamp(base.value + Timedelta('5us').value) + assert result == Timestamp(str(base) + ".000005") + assert result.microsecond == 5 + + result = Timestamp(base.value + Timedelta('5ns').value) + assert result == Timestamp(str(base) + ".000000005") + assert result.nanosecond == 5 + assert result.microsecond == 0 + + result = Timestamp(base.value + Timedelta('6ms 5us').value) + assert result == Timestamp(str(base) + ".006005") + assert result.microsecond == 5 + 6 * 1000 + + result = Timestamp(base.value + Timedelta('200ms 5us').value) + assert result == Timestamp(str(base) + ".200005") + assert result.microsecond == 5 + 200 * 1000 + + def test_comparison(self): + # 5-18-2012 00:00:00.000 + stamp = long(1337299200000000000) + + val = Timestamp(stamp) + + assert val == val + assert not val != val + assert not val < val + assert val <= val + assert not val > val + assert val >= val + + other = datetime(2012, 5, 18) + assert val == other + assert not val != other + assert not val < other + assert val <= other + assert not val > other + assert val >= other + + other = Timestamp(stamp + 100) + + assert val != other + assert val != other + assert val < other + assert val <= other + assert other > val + assert other >= val + + def test_compare_invalid(self): + + # GH 8058 + val = Timestamp('20130101 12:01:02') + assert not val == 'foo' + assert not val == 10.0 + assert not val == 1 + assert not val == long(1) + assert not val == [] + assert not val == {'foo': 1} + assert not val == np.float64(1) + assert not val == np.int64(1) + + assert val != 'foo' + assert val != 10.0 + assert val != 1 + assert val != long(1) + assert val != [] + assert val != {'foo': 1} + assert val != np.float64(1) + assert val != np.int64(1) + + # ops testing + df = DataFrame(np.random.randn(5, 2)) + a = df[0] + b = Series(np.random.randn(5)) + b.name = Timestamp('2000-01-01') + tm.assert_series_equal(a / b, 1 / (b / a)) + + def test_cant_compare_tz_naive_w_aware(self): + tm._skip_if_no_pytz() + # #1404 + a = Timestamp('3/12/2012') + b = Timestamp('3/12/2012', tz='utc') + + pytest.raises(Exception, a.__eq__, b) + pytest.raises(Exception, a.__ne__, b) + pytest.raises(Exception, a.__lt__, b) + pytest.raises(Exception, a.__gt__, b) + pytest.raises(Exception, b.__eq__, a) + pytest.raises(Exception, b.__ne__, a) + pytest.raises(Exception, b.__lt__, a) + pytest.raises(Exception, b.__gt__, a) + + if sys.version_info < (3, 3): + pytest.raises(Exception, a.__eq__, b.to_pydatetime()) + pytest.raises(Exception, a.to_pydatetime().__eq__, b) + else: + assert not a == b.to_pydatetime() + assert not a.to_pydatetime() == b + + def test_cant_compare_tz_naive_w_aware_explicit_pytz(self): + tm._skip_if_no_pytz() + from pytz import utc + # #1404 + a = Timestamp('3/12/2012') + b = Timestamp('3/12/2012', tz=utc) + + pytest.raises(Exception, a.__eq__, b) + pytest.raises(Exception, a.__ne__, b) + pytest.raises(Exception, a.__lt__, b) + pytest.raises(Exception, a.__gt__, b) + pytest.raises(Exception, b.__eq__, a) + pytest.raises(Exception, b.__ne__, a) + pytest.raises(Exception, b.__lt__, a) + pytest.raises(Exception, b.__gt__, a) + + if sys.version_info < (3, 3): + pytest.raises(Exception, a.__eq__, b.to_pydatetime()) + pytest.raises(Exception, a.to_pydatetime().__eq__, b) + else: + assert not a == b.to_pydatetime() + assert not a.to_pydatetime() == b + + def test_cant_compare_tz_naive_w_aware_dateutil(self): + tm._skip_if_no_dateutil() + from dateutil.tz import tzutc + utc = tzutc() + # #1404 + a = Timestamp('3/12/2012') + b = Timestamp('3/12/2012', tz=utc) + + pytest.raises(Exception, a.__eq__, b) + pytest.raises(Exception, a.__ne__, b) + pytest.raises(Exception, a.__lt__, b) + pytest.raises(Exception, a.__gt__, b) + pytest.raises(Exception, b.__eq__, a) + pytest.raises(Exception, b.__ne__, a) + pytest.raises(Exception, b.__lt__, a) + pytest.raises(Exception, b.__gt__, a) + + if sys.version_info < (3, 3): + pytest.raises(Exception, a.__eq__, b.to_pydatetime()) + pytest.raises(Exception, a.to_pydatetime().__eq__, b) + else: + assert not a == b.to_pydatetime() + assert not a.to_pydatetime() == b + + def test_delta_preserve_nanos(self): + val = Timestamp(long(1337299200000000123)) + result = val + timedelta(1) + assert result.nanosecond == val.nanosecond + + def test_frequency_misc(self): + assert (frequencies.get_freq_group('T') == + frequencies.FreqGroup.FR_MIN) + + code, stride = frequencies.get_freq_code(offsets.Hour()) + assert code == frequencies.FreqGroup.FR_HR + + code, stride = frequencies.get_freq_code((5, 'T')) + assert code == frequencies.FreqGroup.FR_MIN + assert stride == 5 + + offset = offsets.Hour() + result = frequencies.to_offset(offset) + assert result == offset + + result = frequencies.to_offset((5, 'T')) + expected = offsets.Minute(5) + assert result == expected + + pytest.raises(ValueError, frequencies.get_freq_code, (5, 'baz')) + + pytest.raises(ValueError, frequencies.to_offset, '100foo') + + pytest.raises(ValueError, frequencies.to_offset, ('', '')) + + with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + result = frequencies.get_standard_freq(offsets.Hour()) + assert result == 'H' + + def test_hash_equivalent(self): + d = {datetime(2011, 1, 1): 5} + stamp = Timestamp(datetime(2011, 1, 1)) + assert d[stamp] == 5 + + def test_timestamp_compare_scalars(self): + # case where ndim == 0 + lhs = np.datetime64(datetime(2013, 12, 6)) + rhs = Timestamp('now') + nat = Timestamp('nat') + + ops = {'gt': 'lt', + 'lt': 'gt', + 'ge': 'le', + 'le': 'ge', + 'eq': 'eq', + 'ne': 'ne'} + + for left, right in ops.items(): + left_f = getattr(operator, left) + right_f = getattr(operator, right) + expected = left_f(lhs, rhs) + + result = right_f(rhs, lhs) + assert result == expected + + expected = left_f(rhs, nat) + result = right_f(nat, rhs) + assert result == expected + + def test_timestamp_compare_series(self): + # make sure we can compare Timestamps on the right AND left hand side + # GH4982 + s = Series(date_range('20010101', periods=10), name='dates') + s_nat = s.copy(deep=True) + + s[0] = Timestamp('nat') + s[3] = Timestamp('nat') + + ops = {'lt': 'gt', 'le': 'ge', 'eq': 'eq', 'ne': 'ne'} + + for left, right in ops.items(): + left_f = getattr(operator, left) + right_f = getattr(operator, right) + + # no nats + expected = left_f(s, Timestamp('20010109')) + result = right_f(Timestamp('20010109'), s) + tm.assert_series_equal(result, expected) + + # nats + expected = left_f(s, Timestamp('nat')) + result = right_f(Timestamp('nat'), s) + tm.assert_series_equal(result, expected) + + # compare to timestamp with series containing nats + expected = left_f(s_nat, Timestamp('20010109')) + result = right_f(Timestamp('20010109'), s_nat) + tm.assert_series_equal(result, expected) + + # compare to nat with series containing nats + expected = left_f(s_nat, Timestamp('nat')) + result = right_f(Timestamp('nat'), s_nat) + tm.assert_series_equal(result, expected) + + def test_is_leap_year(self): + # GH 13727 + for tz in [None, 'UTC', 'US/Eastern', 'Asia/Tokyo']: + dt = Timestamp('2000-01-01 00:00:00', tz=tz) + assert dt.is_leap_year + assert isinstance(dt.is_leap_year, bool) + + dt = Timestamp('1999-01-01 00:00:00', tz=tz) + assert not dt.is_leap_year + + dt = Timestamp('2004-01-01 00:00:00', tz=tz) + assert dt.is_leap_year + + dt = Timestamp('2100-01-01 00:00:00', tz=tz) + assert not dt.is_leap_year + + +class TestTimestampNsOperations(object): + + def setup_method(self, method): + self.timestamp = Timestamp(datetime.utcnow()) + + def assert_ns_timedelta(self, modified_timestamp, expected_value): + value = self.timestamp.value + modified_value = modified_timestamp.value + + assert modified_value - value == expected_value + + def test_timedelta_ns_arithmetic(self): + self.assert_ns_timedelta(self.timestamp + np.timedelta64(-123, 'ns'), + -123) + + def test_timedelta_ns_based_arithmetic(self): + self.assert_ns_timedelta(self.timestamp + np.timedelta64( + 1234567898, 'ns'), 1234567898) + + def test_timedelta_us_arithmetic(self): + self.assert_ns_timedelta(self.timestamp + np.timedelta64(-123, 'us'), + -123000) + + def test_timedelta_ms_arithmetic(self): + time = self.timestamp + np.timedelta64(-123, 'ms') + self.assert_ns_timedelta(time, -123000000) + + def test_nanosecond_string_parsing(self): + ts = Timestamp('2013-05-01 07:15:45.123456789') + # GH 7878 + expected_repr = '2013-05-01 07:15:45.123456789' + expected_value = 1367392545123456789 + assert ts.value == expected_value + assert expected_repr in repr(ts) + + ts = Timestamp('2013-05-01 07:15:45.123456789+09:00', tz='Asia/Tokyo') + assert ts.value == expected_value - 9 * 3600 * 1000000000 + assert expected_repr in repr(ts) + + ts = Timestamp('2013-05-01 07:15:45.123456789', tz='UTC') + assert ts.value == expected_value + assert expected_repr in repr(ts) + + ts = Timestamp('2013-05-01 07:15:45.123456789', tz='US/Eastern') + assert ts.value == expected_value + 4 * 3600 * 1000000000 + assert expected_repr in repr(ts) + + # GH 10041 + ts = Timestamp('20130501T071545.123456789') + assert ts.value == expected_value + assert expected_repr in repr(ts) + + def test_nanosecond_timestamp(self): + # GH 7610 + expected = 1293840000000000005 + t = Timestamp('2011-01-01') + offsets.Nano(5) + assert repr(t) == "Timestamp('2011-01-01 00:00:00.000000005')" + assert t.value == expected + assert t.nanosecond == 5 + + t = Timestamp(t) + assert repr(t) == "Timestamp('2011-01-01 00:00:00.000000005')" + assert t.value == expected + assert t.nanosecond == 5 + + t = Timestamp(np_datetime64_compat('2011-01-01 00:00:00.000000005Z')) + assert repr(t) == "Timestamp('2011-01-01 00:00:00.000000005')" + assert t.value == expected + assert t.nanosecond == 5 + + expected = 1293840000000000010 + t = t + offsets.Nano(5) + assert repr(t) == "Timestamp('2011-01-01 00:00:00.000000010')" + assert t.value == expected + assert t.nanosecond == 10 + + t = Timestamp(t) + assert repr(t) == "Timestamp('2011-01-01 00:00:00.000000010')" + assert t.value == expected + assert t.nanosecond == 10 + + t = Timestamp(np_datetime64_compat('2011-01-01 00:00:00.000000010Z')) + assert repr(t) == "Timestamp('2011-01-01 00:00:00.000000010')" + assert t.value == expected + assert t.nanosecond == 10 + + +class TestTimestampOps(object): + + def test_timestamp_and_datetime(self): + assert ((Timestamp(datetime(2013, 10, 13)) - + datetime(2013, 10, 12)).days == 1) + assert ((datetime(2013, 10, 12) - + Timestamp(datetime(2013, 10, 13))).days == -1) + + def test_timestamp_and_series(self): + timestamp_series = Series(date_range('2014-03-17', periods=2, freq='D', + tz='US/Eastern')) + first_timestamp = timestamp_series[0] + + delta_series = Series([np.timedelta64(0, 'D'), np.timedelta64(1, 'D')]) + assert_series_equal(timestamp_series - first_timestamp, delta_series) + assert_series_equal(first_timestamp - timestamp_series, -delta_series) + + def test_addition_subtraction_types(self): + # Assert on the types resulting from Timestamp +/- various date/time + # objects + datetime_instance = datetime(2014, 3, 4) + timedelta_instance = timedelta(seconds=1) + # build a timestamp with a frequency, since then it supports + # addition/subtraction of integers + timestamp_instance = date_range(datetime_instance, periods=1, + freq='D')[0] + + assert type(timestamp_instance + 1) == Timestamp + assert type(timestamp_instance - 1) == Timestamp + + # Timestamp + datetime not supported, though subtraction is supported + # and yields timedelta more tests in tseries/base/tests/test_base.py + assert type(timestamp_instance - datetime_instance) == Timedelta + assert type(timestamp_instance + timedelta_instance) == Timestamp + assert type(timestamp_instance - timedelta_instance) == Timestamp + + # Timestamp +/- datetime64 not supported, so not tested (could possibly + # assert error raised?) + timedelta64_instance = np.timedelta64(1, 'D') + assert type(timestamp_instance + timedelta64_instance) == Timestamp + assert type(timestamp_instance - timedelta64_instance) == Timestamp + + def test_addition_subtraction_preserve_frequency(self): + timestamp_instance = date_range('2014-03-05', periods=1, freq='D')[0] + timedelta_instance = timedelta(days=1) + original_freq = timestamp_instance.freq + + assert (timestamp_instance + 1).freq == original_freq + assert (timestamp_instance - 1).freq == original_freq + assert (timestamp_instance + timedelta_instance).freq == original_freq + assert (timestamp_instance - timedelta_instance).freq == original_freq + + timedelta64_instance = np.timedelta64(1, 'D') + assert (timestamp_instance + + timedelta64_instance).freq == original_freq + assert (timestamp_instance - + timedelta64_instance).freq == original_freq + + def test_resolution(self): + + for freq, expected in zip(['A', 'Q', 'M', 'D', 'H', 'T', + 'S', 'L', 'U'], + [RESO_DAY, RESO_DAY, + RESO_DAY, RESO_DAY, + RESO_HR, RESO_MIN, + RESO_SEC, RESO_MS, + RESO_US]): + for tz in [None, 'Asia/Tokyo', 'US/Eastern', + 'dateutil/US/Eastern']: + idx = date_range(start='2013-04-01', periods=30, freq=freq, + tz=tz) + result = period.resolution(idx.asi8, idx.tz) + assert result == expected + + +class TestTimestampToJulianDate(object): + + def test_compare_1700(self): + r = Timestamp('1700-06-23').to_julian_date() + assert r == 2342145.5 + + def test_compare_2000(self): + r = Timestamp('2000-04-12').to_julian_date() + assert r == 2451646.5 + + def test_compare_2100(self): + r = Timestamp('2100-08-12').to_julian_date() + assert r == 2488292.5 + + def test_compare_hour01(self): + r = Timestamp('2000-08-12T01:00:00').to_julian_date() + assert r == 2451768.5416666666666666 + + def test_compare_hour13(self): + r = Timestamp('2000-08-12T13:00:00').to_julian_date() + assert r == 2451769.0416666666666666 + + +class TestTimeSeries(object): + + def test_timestamp_to_datetime(self): + tm._skip_if_no_pytz() + rng = date_range('20090415', '20090519', tz='US/Eastern') + + stamp = rng[0] + dtval = stamp.to_pydatetime() + assert stamp == dtval + assert stamp.tzinfo == dtval.tzinfo + + def test_timestamp_to_datetime_dateutil(self): + tm._skip_if_no_pytz() + rng = date_range('20090415', '20090519', tz='dateutil/US/Eastern') + + stamp = rng[0] + dtval = stamp.to_pydatetime() + assert stamp == dtval + assert stamp.tzinfo == dtval.tzinfo + + def test_timestamp_to_datetime_explicit_pytz(self): + tm._skip_if_no_pytz() + import pytz + rng = date_range('20090415', '20090519', + tz=pytz.timezone('US/Eastern')) + + stamp = rng[0] + dtval = stamp.to_pydatetime() + assert stamp == dtval + assert stamp.tzinfo == dtval.tzinfo + + def test_timestamp_to_datetime_explicit_dateutil(self): + tm._skip_if_windows_python_3() + tm._skip_if_no_dateutil() + from pandas._libs.tslib import _dateutil_gettz as gettz + rng = date_range('20090415', '20090519', tz=gettz('US/Eastern')) + + stamp = rng[0] + dtval = stamp.to_pydatetime() + assert stamp == dtval + assert stamp.tzinfo == dtval.tzinfo + + def test_timestamp_fields(self): + # extra fields from DatetimeIndex like quarter and week + idx = tm.makeDateIndex(100) + + fields = ['dayofweek', 'dayofyear', 'week', 'weekofyear', 'quarter', + 'days_in_month', 'is_month_start', 'is_month_end', + 'is_quarter_start', 'is_quarter_end', 'is_year_start', + 'is_year_end', 'weekday_name'] + for f in fields: + expected = getattr(idx, f)[-1] + result = getattr(Timestamp(idx[-1]), f) + assert result == expected + + assert idx.freq == Timestamp(idx[-1], idx.freq).freq + assert idx.freqstr == Timestamp(idx[-1], idx.freq).freqstr + + def test_timestamp_date_out_of_range(self): + pytest.raises(ValueError, Timestamp, '1676-01-01') + pytest.raises(ValueError, Timestamp, '2263-01-01') + + # see gh-1475 + pytest.raises(ValueError, DatetimeIndex, ['1400-01-01']) + pytest.raises(ValueError, DatetimeIndex, [datetime(1400, 1, 1)]) + + def test_timestamp_repr(self): + # pre-1900 + stamp = Timestamp('1850-01-01', tz='US/Eastern') + repr(stamp) + + iso8601 = '1850-01-01 01:23:45.012345' + stamp = Timestamp(iso8601, tz='US/Eastern') + result = repr(stamp) + assert iso8601 in result + + def test_timestamp_from_ordinal(self): + + # GH 3042 + dt = datetime(2011, 4, 16, 0, 0) + ts = Timestamp.fromordinal(dt.toordinal()) + assert ts.to_pydatetime() == dt + + # with a tzinfo + stamp = Timestamp('2011-4-16', tz='US/Eastern') + dt_tz = stamp.to_pydatetime() + ts = Timestamp.fromordinal(dt_tz.toordinal(), tz='US/Eastern') + assert ts.to_pydatetime() == dt_tz + + def test_timestamp_compare_with_early_datetime(self): + # e.g. datetime.min + stamp = Timestamp('2012-01-01') + + assert not stamp == datetime.min + assert not stamp == datetime(1600, 1, 1) + assert not stamp == datetime(2700, 1, 1) + assert stamp != datetime.min + assert stamp != datetime(1600, 1, 1) + assert stamp != datetime(2700, 1, 1) + assert stamp > datetime(1600, 1, 1) + assert stamp >= datetime(1600, 1, 1) + assert stamp < datetime(2700, 1, 1) + assert stamp <= datetime(2700, 1, 1) + + def test_timestamp_equality(self): + + # GH 11034 + s = Series([Timestamp('2000-01-29 01:59:00'), 'NaT']) + result = s != s + assert_series_equal(result, Series([False, True])) + result = s != s[0] + assert_series_equal(result, Series([False, True])) + result = s != s[1] + assert_series_equal(result, Series([True, True])) + + result = s == s + assert_series_equal(result, Series([True, False])) + result = s == s[0] + assert_series_equal(result, Series([True, False])) + result = s == s[1] + assert_series_equal(result, Series([False, False])) + + def test_series_box_timestamp(self): + rng = date_range('20090415', '20090519', freq='B') + s = Series(rng) + + assert isinstance(s[5], Timestamp) + + rng = date_range('20090415', '20090519', freq='B') + s = Series(rng, index=rng) + assert isinstance(s[5], Timestamp) + + assert isinstance(s.iat[5], Timestamp) + + def test_frame_setitem_timestamp(self): + # 2155 + columns = DatetimeIndex(start='1/1/2012', end='2/1/2012', + freq=offsets.BDay()) + index = lrange(10) + data = DataFrame(columns=columns, index=index) + t = datetime(2012, 11, 1) + ts = Timestamp(t) + data[ts] = np.nan # works + + def test_to_html_timestamp(self): + rng = date_range('2000-01-01', periods=10) + df = DataFrame(np.random.randn(10, 4), index=rng) + + result = df.to_html() + assert '2000-01-01' in result + + def test_series_map_box_timestamps(self): + # #2689, #2627 + s = Series(date_range('1/1/2000', periods=10)) + + def f(x): + return (x.hour, x.day, x.month) + + # it works! + s.map(f) + s.apply(f) + DataFrame(s).applymap(f) + + def test_dti_slicing(self): + dti = DatetimeIndex(start='1/1/2005', end='12/1/2005', freq='M') + dti2 = dti[[1, 3, 5]] + + v1 = dti2[0] + v2 = dti2[1] + v3 = dti2[2] + + assert v1 == Timestamp('2/28/2005') + assert v2 == Timestamp('4/30/2005') + assert v3 == Timestamp('6/30/2005') + + # don't carry freq through irregular slicing + assert dti2.freq is None + + def test_woy_boundary(self): + # make sure weeks at year boundaries are correct + d = datetime(2013, 12, 31) + result = Timestamp(d).week + expected = 1 # ISO standard + assert result == expected + + d = datetime(2008, 12, 28) + result = Timestamp(d).week + expected = 52 # ISO standard + assert result == expected + + d = datetime(2009, 12, 31) + result = Timestamp(d).week + expected = 53 # ISO standard + assert result == expected + + d = datetime(2010, 1, 1) + result = Timestamp(d).week + expected = 53 # ISO standard + assert result == expected + + d = datetime(2010, 1, 3) + result = Timestamp(d).week + expected = 53 # ISO standard + assert result == expected + + result = np.array([Timestamp(datetime(*args)).week + for args in [(2000, 1, 1), (2000, 1, 2), ( + 2005, 1, 1), (2005, 1, 2)]]) + assert (result == [52, 52, 53, 53]).all() + + +class TestTsUtil(object): + + def test_min_valid(self): + # Ensure that Timestamp.min is a valid Timestamp + Timestamp(Timestamp.min) + + def test_max_valid(self): + # Ensure that Timestamp.max is a valid Timestamp + Timestamp(Timestamp.max) + + def test_to_datetime_bijective(self): + # Ensure that converting to datetime and back only loses precision + # by going from nanoseconds to microseconds. + exp_warning = None if Timestamp.max.nanosecond == 0 else UserWarning + with tm.assert_produces_warning(exp_warning, check_stacklevel=False): + assert (Timestamp(Timestamp.max.to_pydatetime()).value / 1000 == + Timestamp.max.value / 1000) + + exp_warning = None if Timestamp.min.nanosecond == 0 else UserWarning + with tm.assert_produces_warning(exp_warning, check_stacklevel=False): + assert (Timestamp(Timestamp.min.to_pydatetime()).value / 1000 == + Timestamp.min.value / 1000) diff --git a/pandas/tests/series/common.py b/pandas/tests/series/common.py index 613961e1c670f..0c25dcb29c3b2 100644 --- a/pandas/tests/series/common.py +++ b/pandas/tests/series/common.py @@ -1,4 +1,4 @@ -from pandas.util.decorators import cache_readonly +from pandas.util._decorators import cache_readonly import pandas.util.testing as tm import pandas as pd diff --git a/pandas/tests/series/test_alter_axes.py b/pandas/tests/series/test_alter_axes.py index 2ddfa27eea377..150767ee9e2b2 100644 --- a/pandas/tests/series/test_alter_axes.py +++ b/pandas/tests/series/test_alter_axes.py @@ -1,6 +1,8 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + from datetime import datetime import numpy as np @@ -16,29 +18,27 @@ from .common import TestData -class TestSeriesAlterAxes(TestData, tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesAlterAxes(TestData): def test_setindex(self): # wrong type series = self.series.copy() - self.assertRaises(TypeError, setattr, series, 'index', None) + pytest.raises(TypeError, setattr, series, 'index', None) # wrong length series = self.series.copy() - self.assertRaises(Exception, setattr, series, 'index', - np.arange(len(series) - 1)) + pytest.raises(Exception, setattr, series, 'index', + np.arange(len(series) - 1)) # works series = self.series.copy() series.index = np.arange(len(series)) - tm.assertIsInstance(series.index, Index) + assert isinstance(series.index, Index) def test_rename(self): renamer = lambda x: x.strftime('%Y%m%d') renamed = self.ts.rename(renamer) - self.assertEqual(renamed.index[0], renamer(self.ts.index[0])) + assert renamed.index[0] == renamer(self.ts.index[0]) # dict rename_dict = dict(zip(self.ts.index, renamed.index)) @@ -48,14 +48,14 @@ def test_rename(self): # partial dict s = Series(np.arange(4), index=['a', 'b', 'c', 'd'], dtype='int64') renamed = s.rename({'b': 'foo', 'd': 'bar'}) - self.assert_index_equal(renamed.index, Index(['a', 'foo', 'c', 'bar'])) + tm.assert_index_equal(renamed.index, Index(['a', 'foo', 'c', 'bar'])) # index with name renamer = Series(np.arange(4), index=Index(['a', 'b', 'c', 'd'], name='name'), dtype='int64') renamed = renamer.rename({}) - self.assertEqual(renamed.index.name, renamer.index.name) + assert renamed.index.name == renamer.index.name def test_rename_by_series(self): s = Series(range(5), name='foo') @@ -68,51 +68,48 @@ def test_rename_set_name(self): s = Series(range(4), index=list('abcd')) for name in ['foo', 123, 123., datetime(2001, 11, 11), ('foo',)]: result = s.rename(name) - self.assertEqual(result.name, name) - self.assert_numpy_array_equal(result.index.values, s.index.values) - self.assertTrue(s.name is None) + assert result.name == name + tm.assert_numpy_array_equal(result.index.values, s.index.values) + assert s.name is None def test_rename_set_name_inplace(self): s = Series(range(3), index=list('abc')) for name in ['foo', 123, 123., datetime(2001, 11, 11), ('foo',)]: s.rename(name, inplace=True) - self.assertEqual(s.name, name) + assert s.name == name exp = np.array(['a', 'b', 'c'], dtype=np.object_) - self.assert_numpy_array_equal(s.index.values, exp) + tm.assert_numpy_array_equal(s.index.values, exp) def test_set_name_attribute(self): s = Series([1, 2, 3]) s2 = Series([1, 2, 3], name='bar') for name in [7, 7., 'name', datetime(2001, 1, 1), (1,), u"\u05D0"]: s.name = name - self.assertEqual(s.name, name) + assert s.name == name s2.name = name - self.assertEqual(s2.name, name) + assert s2.name == name def test_set_name(self): s = Series([1, 2, 3]) s2 = s._set_name('foo') - self.assertEqual(s2.name, 'foo') - self.assertTrue(s.name is None) - self.assertTrue(s is not s2) + assert s2.name == 'foo' + assert s.name is None + assert s is not s2 def test_rename_inplace(self): renamer = lambda x: x.strftime('%Y%m%d') expected = renamer(self.ts.index[0]) self.ts.rename(renamer, inplace=True) - self.assertEqual(self.ts.index[0], expected) + assert self.ts.index[0] == expected def test_set_index_makes_timeseries(self): idx = tm.makeDateIndex(10) s = Series(lrange(10)) s.index = idx - - with tm.assert_produces_warning(FutureWarning): - self.assertTrue(s.is_time_series) - self.assertTrue(s.index.is_all_dates) + assert s.index.is_all_dates def test_reset_index(self): df = tm.makeDataFrame()[:5] @@ -121,10 +118,10 @@ def test_reset_index(self): ser.name = 'value' df = ser.reset_index() - self.assertIn('value', df) + assert 'value' in df df = ser.reset_index(name='value2') - self.assertIn('value2', df) + assert 'value2' in df # check inplace s = ser.reset_index(drop=True) @@ -138,17 +135,17 @@ def test_reset_index(self): [0, 1, 0, 1, 0, 1]]) s = Series(np.random.randn(6), index=index) rs = s.reset_index(level=1) - self.assertEqual(len(rs.columns), 2) + assert len(rs.columns) == 2 rs = s.reset_index(level=[0, 2], drop=True) - self.assert_index_equal(rs.index, Index(index.get_level_values(1))) - tm.assertIsInstance(rs, Series) + tm.assert_index_equal(rs.index, Index(index.get_level_values(1))) + assert isinstance(rs, Series) def test_reset_index_range(self): # GH 12071 s = pd.Series(range(2), name='A', dtype='int64') series_result = s.reset_index() - tm.assertIsInstance(series_result.index, RangeIndex) + assert isinstance(series_result.index, RangeIndex) series_expected = pd.DataFrame([[0, 0], [1, 1]], columns=['index', 'A'], index=RangeIndex(stop=2)) diff --git a/pandas/tests/series/test_analytics.py b/pandas/tests/series/test_analytics.py index 26c2220c4811a..ec6a118ec3639 100644 --- a/pandas/tests/series/test_analytics.py +++ b/pandas/tests/series/test_analytics.py @@ -4,22 +4,22 @@ from itertools import product from distutils.version import LooseVersion -import nose +import pytest from numpy import nan import numpy as np import pandas as pd -from pandas import (Series, DataFrame, isnull, notnull, bdate_range, - date_range, _np_version_under1p10) +from pandas import (Series, Categorical, DataFrame, isnull, notnull, + bdate_range, date_range, _np_version_under1p10) from pandas.core.index import MultiIndex -from pandas.tseries.index import Timestamp -from pandas.tseries.tdi import Timedelta +from pandas.core.indexes.datetimes import Timestamp +from pandas.core.indexes.timedeltas import Timedelta import pandas.core.config as cf import pandas.core.nanops as nanops -from pandas.compat import lrange, range +from pandas.compat import lrange, range, is_platform_windows from pandas import compat from pandas.util.testing import (assert_series_equal, assert_almost_equal, assert_frame_equal, assert_index_equal) @@ -28,23 +28,25 @@ from .common import TestData -class TestSeriesAnalytics(TestData, tm.TestCase): +skip_if_bottleneck_on_windows = (is_platform_windows() and + nanops._USE_BOTTLENECK) - _multiprocess_can_split_ = True + +class TestSeriesAnalytics(TestData): def test_sum_zero(self): arr = np.array([]) - self.assertEqual(nanops.nansum(arr), 0) + assert nanops.nansum(arr) == 0 arr = np.empty((10, 0)) - self.assertTrue((nanops.nansum(arr, axis=1) == 0).all()) + assert (nanops.nansum(arr, axis=1) == 0).all() # GH #844 s = Series([], index=[]) - self.assertEqual(s.sum(), 0) + assert s.sum() == 0 df = DataFrame(np.empty((10, 0))) - self.assertTrue((df.sum(1) == 0).all()) + assert (df.sum(1) == 0).all() def test_nansum_buglet(self): s = Series([1.0, np.nan], index=[0, 1]) @@ -60,19 +62,11 @@ def test_overflow(self): # no bottleneck result = s.sum(skipna=False) - self.assertEqual(int(result), v.sum(dtype='int64')) + assert int(result) == v.sum(dtype='int64') result = s.min(skipna=False) - self.assertEqual(int(result), 0) + assert int(result) == 0 result = s.max(skipna=False) - self.assertEqual(int(result), v[-1]) - - # use bottleneck if available - result = s.sum() - self.assertEqual(int(result), v.sum(dtype='int64')) - result = s.min() - self.assertEqual(int(result), 0) - result = s.max() - self.assertEqual(int(result), v[-1]) + assert int(result) == v[-1] for dtype in ['float32', 'float64']: v = np.arange(5000000, dtype=dtype) @@ -80,20 +74,45 @@ def test_overflow(self): # no bottleneck result = s.sum(skipna=False) - self.assertEqual(result, v.sum(dtype=dtype)) + assert result == v.sum(dtype=dtype) result = s.min(skipna=False) - self.assertTrue(np.allclose(float(result), 0.0)) + assert np.allclose(float(result), 0.0) result = s.max(skipna=False) - self.assertTrue(np.allclose(float(result), v[-1])) + assert np.allclose(float(result), v[-1]) + + @pytest.mark.xfail( + skip_if_bottleneck_on_windows, + reason="buggy bottleneck with sum overflow on windows") + def test_overflow_with_bottleneck(self): + # GH 6915 + # overflowing on the smaller int dtypes + for dtype in ['int32', 'int64']: + v = np.arange(5000000, dtype=dtype) + s = Series(v) # use bottleneck if available result = s.sum() - self.assertEqual(result, v.sum(dtype=dtype)) + assert int(result) == v.sum(dtype='int64') result = s.min() - self.assertTrue(np.allclose(float(result), 0.0)) + assert int(result) == 0 result = s.max() - self.assertTrue(np.allclose(float(result), v[-1])) + assert int(result) == v[-1] + for dtype in ['float32', 'float64']: + v = np.arange(5000000, dtype=dtype) + s = Series(v) + + # use bottleneck if available + result = s.sum() + assert result == v.sum(dtype=dtype) + result = s.min() + assert np.allclose(float(result), 0.0) + result = s.max() + assert np.allclose(float(result), v[-1]) + + @pytest.mark.xfail( + skip_if_bottleneck_on_windows, + reason="buggy bottleneck with sum overflow on windows") def test_sum(self): self._check_stat_op('sum', np.sum, check_allna=True) @@ -106,7 +125,7 @@ def test_sum_inf(self): s[5:8] = np.inf s2[5:8] = np.nan - self.assertTrue(np.isinf(s.sum())) + assert np.isinf(s.sum()) arr = np.random.randn(100, 100).astype('f4') arr[:, 2] = np.inf @@ -115,7 +134,7 @@ def test_sum_inf(self): assert_almost_equal(s.sum(), s2.sum()) res = nanops.nansum(arr, axis=1) - self.assertTrue(np.isinf(res).all()) + assert np.isinf(res).all() def test_mean(self): self._check_stat_op('mean', np.mean) @@ -125,48 +144,103 @@ def test_median(self): # test with integers, test failure int_ts = Series(np.ones(10, dtype=int), index=lrange(10)) - self.assertAlmostEqual(np.median(int_ts), int_ts.median()) + tm.assert_almost_equal(np.median(int_ts), int_ts.median()) def test_mode(self): - s = Series([12, 12, 11, 10, 19, 11]) - exp = Series([11, 12]) - assert_series_equal(s.mode(), exp) - - assert_series_equal( - Series([1, 2, 3]).mode(), Series( - [], dtype='int64')) - - lst = [5] * 20 + [1] * 10 + [6] * 25 - np.random.shuffle(lst) - s = Series(lst) - assert_series_equal(s.mode(), Series([6])) - - s = Series([5] * 10) - assert_series_equal(s.mode(), Series([5])) - - s = Series(lst) - s[0] = np.nan - assert_series_equal(s.mode(), Series([6.])) - - s = Series(list('adfasbasfwewefwefweeeeasdfasnbam')) - assert_series_equal(s.mode(), Series(['e'])) - - s = Series(['2011-01-03', '2013-01-02', '1900-05-03'], dtype='M8[ns]') - assert_series_equal(s.mode(), Series([], dtype="M8[ns]")) - s = Series(['2011-01-03', '2013-01-02', '1900-05-03', '2011-01-03', - '2013-01-02'], dtype='M8[ns]') - assert_series_equal(s.mode(), Series(['2011-01-03', '2013-01-02'], - dtype='M8[ns]')) - - # GH 5986 - s = Series(['1 days', '-1 days', '0 days'], dtype='timedelta64[ns]') - assert_series_equal(s.mode(), Series([], dtype='timedelta64[ns]')) + # No mode should be found. + exp = Series([], dtype=np.float64) + tm.assert_series_equal(Series([]).mode(), exp) + + exp = Series([1], dtype=np.int64) + tm.assert_series_equal(Series([1]).mode(), exp) + + exp = Series(['a', 'b', 'c'], dtype=np.object) + tm.assert_series_equal(Series(['a', 'b', 'c']).mode(), exp) + + # Test numerical data types. + exp_single = [1] + data_single = [1] * 5 + [2] * 3 + + exp_multi = [1, 3] + data_multi = [1] * 5 + [2] * 3 + [3] * 5 + + for dt in np.typecodes['AllInteger'] + np.typecodes['Float']: + s = Series(data_single, dtype=dt) + exp = Series(exp_single, dtype=dt) + tm.assert_series_equal(s.mode(), exp) + + s = Series(data_multi, dtype=dt) + exp = Series(exp_multi, dtype=dt) + tm.assert_series_equal(s.mode(), exp) + + # Test string and object types. + exp = ['b'] + data = ['a'] * 2 + ['b'] * 3 + + s = Series(data, dtype='c') + exp = Series(exp, dtype='c') + tm.assert_series_equal(s.mode(), exp) + + exp = ['bar'] + data = ['foo'] * 2 + ['bar'] * 3 + + for dt in [str, object]: + s = Series(data, dtype=dt) + exp = Series(exp, dtype=dt) + tm.assert_series_equal(s.mode(), exp) + + # Test datetime types. + exp = Series(['1900-05-03', '2011-01-03', + '2013-01-02'], dtype='M8[ns]') + s = Series(['2011-01-03', '2013-01-02', + '1900-05-03'], dtype='M8[ns]') + tm.assert_series_equal(s.mode(), exp) + + exp = Series(['2011-01-03', '2013-01-02'], dtype='M8[ns]') + s = Series(['2011-01-03', '2013-01-02', '1900-05-03', + '2011-01-03', '2013-01-02'], dtype='M8[ns]') + tm.assert_series_equal(s.mode(), exp) + + # gh-5986: Test timedelta types. + exp = Series(['-1 days', '0 days', '1 days'], dtype='timedelta64[ns]') + s = Series(['1 days', '-1 days', '0 days'], + dtype='timedelta64[ns]') + tm.assert_series_equal(s.mode(), exp) + exp = Series(['2 min', '1 day'], dtype='timedelta64[ns]') s = Series(['1 day', '1 day', '-1 day', '-1 day 2 min', - '2 min', '2 min'], - dtype='timedelta64[ns]') - assert_series_equal(s.mode(), Series(['2 min', '1 day'], - dtype='timedelta64[ns]')) + '2 min', '2 min'], dtype='timedelta64[ns]') + tm.assert_series_equal(s.mode(), exp) + + # Test mixed dtype. + exp = Series(['foo']) + s = Series([1, 'foo', 'foo']) + tm.assert_series_equal(s.mode(), exp) + + # Test for uint64 overflow. + exp = Series([2**63], dtype=np.uint64) + s = Series([1, 2**63, 2**63], dtype=np.uint64) + tm.assert_series_equal(s.mode(), exp) + + exp = Series([1, 2**63], dtype=np.uint64) + s = Series([1, 2**63], dtype=np.uint64) + tm.assert_series_equal(s.mode(), exp) + + # Test category dtype. + c = Categorical([1, 2]) + exp = Categorical([1, 2], categories=[1, 2]) + exp = Series(exp, dtype='category') + tm.assert_series_equal(Series(c).mode(), exp) + + c = Categorical([1, 'a', 'a']) + exp = Categorical(['a'], categories=[1, 'a']) + exp = Series(exp, dtype='category') + tm.assert_series_equal(Series(c).mode(), exp) + + c = Categorical([1, 1, 2, 3, 3]) + exp = Categorical([1, 3], categories=[1, 2, 3]) + exp = Series(exp, dtype='category') + tm.assert_series_equal(Series(c).mode(), exp) def test_prod(self): self._check_stat_op('prod', np.prod) @@ -195,10 +269,10 @@ def test_var_std(self): # 1 - element series with ddof=1 s = self.ts.iloc[[0]] result = s.var(ddof=1) - self.assertTrue(isnull(result)) + assert isnull(result) result = s.std(ddof=1) - self.assertTrue(isnull(result)) + assert isnull(result) def test_sem(self): alt = lambda x: np.std(x, ddof=1) / np.sqrt(len(x)) @@ -212,7 +286,7 @@ def test_sem(self): # 1 - element series with ddof=1 s = self.ts.iloc[[0]] result = s.sem(ddof=1) - self.assertTrue(isnull(result)) + assert isnull(result) def test_skew(self): tm._skip_if_no_scipy() @@ -228,11 +302,11 @@ def test_skew(self): s = Series(np.ones(i)) df = DataFrame(np.ones((i, i))) if i < min_N: - self.assertTrue(np.isnan(s.skew())) - self.assertTrue(np.isnan(df.skew()).all()) + assert np.isnan(s.skew()) + assert np.isnan(df.skew()).all() else: - self.assertEqual(0, s.skew()) - self.assertTrue((df.skew() == 0).all()) + assert 0 == s.skew() + assert (df.skew() == 0).all() def test_kurt(self): tm._skip_if_no_scipy() @@ -245,7 +319,7 @@ def test_kurt(self): labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2], [0, 1, 0, 1, 0, 1]]) s = Series(np.random.randn(6), index=index) - self.assertAlmostEqual(s.kurt(), s.kurt(level=0)['bar']) + tm.assert_almost_equal(s.kurt(), s.kurt(level=0)['bar']) # test corner cases, kurt() returns NaN unless there's at least 4 # values @@ -254,11 +328,11 @@ def test_kurt(self): s = Series(np.ones(i)) df = DataFrame(np.ones((i, i))) if i < min_N: - self.assertTrue(np.isnan(s.kurt())) - self.assertTrue(np.isnan(df.kurt()).all()) + assert np.isnan(s.kurt()) + assert np.isnan(df.kurt()).all() else: - self.assertEqual(0, s.kurt()) - self.assertTrue((df.kurt() == 0).all()) + assert 0 == s.kurt() + assert (df.kurt() == 0).all() def test_describe(self): s = Series([0, 1, 2, 3, 4], name='int_data') @@ -267,31 +341,31 @@ def test_describe(self): name='int_data', index=['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']) - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) s = Series([True, True, False, False, False], name='bool_data') result = s.describe() expected = Series([5, 2, False, 3], name='bool_data', index=['count', 'unique', 'top', 'freq']) - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) s = Series(['a', 'a', 'b', 'c', 'd'], name='str_data') result = s.describe() expected = Series([5, 4, 'a', 2], name='str_data', index=['count', 'unique', 'top', 'freq']) - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_argsort(self): self._check_accum_op('argsort', check_dtype=False) argsorted = self.ts.argsort() - self.assertTrue(issubclass(argsorted.dtype.type, np.integer)) + assert issubclass(argsorted.dtype.type, np.integer) # GH 2967 (introduced bug in 0.11-dev I think) s = Series([Timestamp('201301%02d' % (i + 1)) for i in range(5)]) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' shifted = s.shift(-1) - self.assertEqual(shifted.dtype, 'datetime64[ns]') - self.assertTrue(isnull(shifted[4])) + assert shifted.dtype == 'datetime64[ns]' + assert isnull(shifted[4]) result = s.argsort() expected = Series(lrange(5), dtype='int64') @@ -309,11 +383,12 @@ def test_argsort_stable(self): mexpected = np.argsort(s.values, kind='mergesort') qexpected = np.argsort(s.values, kind='quicksort') - self.assert_series_equal(mindexer, Series(mexpected), - check_dtype=False) - self.assert_series_equal(qindexer, Series(qexpected), - check_dtype=False) - self.assertFalse(np.array_equal(qindexer, mindexer)) + tm.assert_series_equal(mindexer, Series(mexpected), + check_dtype=False) + tm.assert_series_equal(qindexer, Series(qexpected), + check_dtype=False) + pytest.raises(AssertionError, tm.assert_numpy_array_equal, + qindexer, mindexer) def test_cumsum(self): self._check_accum_op('cumsum') @@ -322,24 +397,24 @@ def test_cumprod(self): self._check_accum_op('cumprod') def test_cummin(self): - self.assert_numpy_array_equal(self.ts.cummin().values, - np.minimum.accumulate(np.array(self.ts))) + tm.assert_numpy_array_equal(self.ts.cummin().values, + np.minimum.accumulate(np.array(self.ts))) ts = self.ts.copy() ts[::2] = np.NaN result = ts.cummin()[1::2] expected = np.minimum.accumulate(ts.valid()) - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_cummax(self): - self.assert_numpy_array_equal(self.ts.cummax().values, - np.maximum.accumulate(np.array(self.ts))) + tm.assert_numpy_array_equal(self.ts.cummax().values, + np.maximum.accumulate(np.array(self.ts))) ts = self.ts.copy() ts[::2] = np.NaN result = ts.cummax()[1::2] expected = np.maximum.accumulate(ts.valid()) - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_cummin_datetime64(self): s = pd.Series(pd.to_datetime(['NaT', '2000-1-2', 'NaT', '2000-1-1', @@ -348,13 +423,13 @@ def test_cummin_datetime64(self): expected = pd.Series(pd.to_datetime(['NaT', '2000-1-2', 'NaT', '2000-1-1', 'NaT', '2000-1-1'])) result = s.cummin(skipna=True) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) expected = pd.Series(pd.to_datetime( ['NaT', '2000-1-2', '2000-1-2', '2000-1-1', '2000-1-1', '2000-1-1' ])) result = s.cummin(skipna=False) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) def test_cummax_datetime64(self): s = pd.Series(pd.to_datetime(['NaT', '2000-1-2', 'NaT', '2000-1-1', @@ -363,13 +438,13 @@ def test_cummax_datetime64(self): expected = pd.Series(pd.to_datetime(['NaT', '2000-1-2', 'NaT', '2000-1-2', 'NaT', '2000-1-3'])) result = s.cummax(skipna=True) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) expected = pd.Series(pd.to_datetime( ['NaT', '2000-1-2', '2000-1-2', '2000-1-2', '2000-1-2', '2000-1-3' ])) result = s.cummax(skipna=False) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) def test_cummin_timedelta64(self): s = pd.Series(pd.to_timedelta(['NaT', @@ -386,7 +461,7 @@ def test_cummin_timedelta64(self): 'NaT', '1 min', ])) result = s.cummin(skipna=True) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) expected = pd.Series(pd.to_timedelta(['NaT', '2 min', @@ -395,7 +470,7 @@ def test_cummin_timedelta64(self): '1 min', '1 min', ])) result = s.cummin(skipna=False) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) def test_cummax_timedelta64(self): s = pd.Series(pd.to_timedelta(['NaT', @@ -412,7 +487,7 @@ def test_cummax_timedelta64(self): 'NaT', '3 min', ])) result = s.cummax(skipna=True) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) expected = pd.Series(pd.to_timedelta(['NaT', '2 min', @@ -421,11 +496,11 @@ def test_cummax_timedelta64(self): '2 min', '3 min', ])) result = s.cummax(skipna=False) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) def test_npdiff(self): - raise nose.SkipTest("skipping due to Series no longer being an " - "ndarray") + pytest.skip("skipping due to Series no longer being an " + "ndarray") # no longer works as the return type of np.diff is now nd.array s = Series(np.arange(5)) @@ -446,11 +521,11 @@ def testit(): # idxmax, idxmin, min, and max are valid for dates if name not in ['max', 'min']: ds = Series(date_range('1/1/2001', periods=10)) - self.assertRaises(TypeError, f, ds) + pytest.raises(TypeError, f, ds) # skipna or no - self.assertTrue(notnull(f(self.series))) - self.assertTrue(isnull(f(self.series, skipna=False))) + assert notnull(f(self.series)) + assert isnull(f(self.series, skipna=False)) # check the result is correct nona = self.series.dropna() @@ -463,12 +538,12 @@ def testit(): # xref 9422 # bottleneck >= 1.0 give 0.0 for an allna Series sum try: - self.assertTrue(nanops._USE_BOTTLENECK) + assert nanops._USE_BOTTLENECK import bottleneck as bn # noqa - self.assertTrue(bn.__version__ >= LooseVersion('1.0')) - self.assertEqual(f(allna), 0.0) + assert bn.__version__ >= LooseVersion('1.0') + assert f(allna) == 0.0 except: - self.assertTrue(np.isnan(f(allna))) + assert np.isnan(f(allna)) # dtype=object with None, it works! s = Series([1, 2, 3, None, 5]) @@ -485,19 +560,19 @@ def testit(): s = Series(bdate_range('1/1/2000', periods=10)) res = f(s) exp = alternate(s) - self.assertEqual(res, exp) + assert res == exp # check on string data if name not in ['sum', 'min', 'max']: - self.assertRaises(TypeError, f, Series(list('abc'))) + pytest.raises(TypeError, f, Series(list('abc'))) # Invalid axis. - self.assertRaises(ValueError, f, self.series, axis=1) + pytest.raises(ValueError, f, self.series, axis=1) # Unimplemented numeric_only parameter. if 'numeric_only' in compat.signature(f).args: - self.assertRaisesRegexp(NotImplementedError, name, f, - self.series, numeric_only=True) + tm.assert_raises_regex(NotImplementedError, name, f, + self.series, numeric_only=True) testit() @@ -511,9 +586,9 @@ def testit(): def _check_accum_op(self, name, check_dtype=True): func = getattr(np, name) - self.assert_numpy_array_equal(func(self.ts).values, - func(np.array(self.ts)), - check_dtype=check_dtype) + tm.assert_numpy_array_equal(func(self.ts).values, + func(np.array(self.ts)), + check_dtype=check_dtype) # with missing values ts = self.ts.copy() @@ -522,8 +597,8 @@ def _check_accum_op(self, name, check_dtype=True): result = func(ts)[1::2] expected = func(np.array(ts.valid())) - self.assert_numpy_array_equal(result.values, expected, - check_dtype=False) + tm.assert_numpy_array_equal(result.values, expected, + check_dtype=False) def test_compress(self): cond = [True, False, True, False, False] @@ -542,12 +617,12 @@ def test_numpy_compress(self): tm.assert_series_equal(np.compress(cond, s), expected) msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.compress, - cond, s, axis=1) + tm.assert_raises_regex(ValueError, msg, np.compress, + cond, s, axis=1) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.compress, - cond, s, out=s) + tm.assert_raises_regex(ValueError, msg, np.compress, + cond, s, out=s) def test_round(self): self.ts.index.name = "index_name" @@ -555,7 +630,7 @@ def test_round(self): expected = Series(np.round(self.ts.values, 2), index=self.ts.index, name='ts') assert_series_equal(result, expected) - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_numpy_round(self): # See gh-12600 @@ -565,47 +640,48 @@ def test_numpy_round(self): assert_series_equal(out, expected) msg = "the 'out' parameter is not supported" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): np.round(s, decimals=0, out=s) def test_built_in_round(self): if not compat.PY3: - raise nose.SkipTest( + pytest.skip( 'build in round cannot be overriden prior to Python 3') s = Series([1.123, 2.123, 3.123], index=lrange(3)) result = round(s) expected_rounded0 = Series([1., 2., 3.], index=lrange(3)) - self.assert_series_equal(result, expected_rounded0) + tm.assert_series_equal(result, expected_rounded0) decimals = 2 expected_rounded = Series([1.12, 2.12, 3.12], index=lrange(3)) result = round(s, decimals) - self.assert_series_equal(result, expected_rounded) + tm.assert_series_equal(result, expected_rounded) def test_prod_numpy16_bug(self): s = Series([1., 1., 1.], index=lrange(3)) result = s.prod() - self.assertNotIsInstance(result, Series) + + assert not isinstance(result, Series) def test_all_any(self): ts = tm.makeTimeSeries() bool_series = ts > 0 - self.assertFalse(bool_series.all()) - self.assertTrue(bool_series.any()) + assert not bool_series.all() + assert bool_series.any() # Alternative types, with implicit 'object' dtype. s = Series(['abc', True]) - self.assertEqual('abc', s.any()) # 'abc' || True => 'abc' + assert 'abc' == s.any() # 'abc' || True => 'abc' def test_all_any_params(self): # Check skipna, with implicit 'object' dtype. s1 = Series([np.nan, True]) s2 = Series([np.nan, False]) - self.assertTrue(s1.all(skipna=False)) # nan && True => True - self.assertTrue(s1.all(skipna=True)) - self.assertTrue(np.isnan(s2.any(skipna=False))) # nan || False => nan - self.assertFalse(s2.any(skipna=True)) + assert s1.all(skipna=False) # nan && True => True + assert s1.all(skipna=True) + assert np.isnan(s2.any(skipna=False)) # nan || False => nan + assert not s2.any(skipna=True) # Check level. s = pd.Series([False, False, True, True, False, True], @@ -614,12 +690,12 @@ def test_all_any_params(self): assert_series_equal(s.any(level=0), Series([False, True, True])) # bool_only is not implemented with level option. - self.assertRaises(NotImplementedError, s.any, bool_only=True, level=0) - self.assertRaises(NotImplementedError, s.all, bool_only=True, level=0) + pytest.raises(NotImplementedError, s.any, bool_only=True, level=0) + pytest.raises(NotImplementedError, s.all, bool_only=True, level=0) # bool_only is not implemented alone. - self.assertRaises(NotImplementedError, s.any, bool_only=True) - self.assertRaises(NotImplementedError, s.all, bool_only=True) + pytest.raises(NotImplementedError, s.any, bool_only=True) + pytest.raises(NotImplementedError, s.all, bool_only=True) def test_modulo(self): with np.errstate(all='ignore'): @@ -644,7 +720,7 @@ def test_modulo(self): p = p.astype('float64') result = p['first'] % p['second'] result2 = p['second'] % p['first'] - self.assertFalse(np.array_equal(result, result2)) + assert not np.array_equal(result, result2) # GH 9144 s = Series([0, 1]) @@ -664,23 +740,23 @@ def test_ops_consistency_on_empty(self): # float result = Series(dtype=float).sum() - self.assertEqual(result, 0) + assert result == 0 result = Series(dtype=float).mean() - self.assertTrue(isnull(result)) + assert isnull(result) result = Series(dtype=float).median() - self.assertTrue(isnull(result)) + assert isnull(result) # timedelta64[ns] result = Series(dtype='m8[ns]').sum() - self.assertEqual(result, Timedelta(0)) + assert result == Timedelta(0) result = Series(dtype='m8[ns]').mean() - self.assertTrue(result is pd.NaT) + assert result is pd.NaT result = Series(dtype='m8[ns]').median() - self.assertTrue(result is pd.NaT) + assert result is pd.NaT def test_corr(self): tm._skip_if_no_scipy() @@ -688,30 +764,30 @@ def test_corr(self): import scipy.stats as stats # full overlap - self.assertAlmostEqual(self.ts.corr(self.ts), 1) + tm.assert_almost_equal(self.ts.corr(self.ts), 1) # partial overlap - self.assertAlmostEqual(self.ts[:15].corr(self.ts[5:]), 1) + tm.assert_almost_equal(self.ts[:15].corr(self.ts[5:]), 1) - self.assertTrue(isnull(self.ts[:15].corr(self.ts[5:], min_periods=12))) + assert isnull(self.ts[:15].corr(self.ts[5:], min_periods=12)) ts1 = self.ts[:15].reindex(self.ts.index) ts2 = self.ts[5:].reindex(self.ts.index) - self.assertTrue(isnull(ts1.corr(ts2, min_periods=12))) + assert isnull(ts1.corr(ts2, min_periods=12)) # No overlap - self.assertTrue(np.isnan(self.ts[::2].corr(self.ts[1::2]))) + assert np.isnan(self.ts[::2].corr(self.ts[1::2])) # all NA cp = self.ts[:10].copy() cp[:] = np.nan - self.assertTrue(isnull(cp.corr(cp))) + assert isnull(cp.corr(cp)) A = tm.makeTimeSeries() B = tm.makeTimeSeries() result = A.corr(B) expected, _ = stats.pearsonr(A, B) - self.assertAlmostEqual(result, expected) + tm.assert_almost_equal(result, expected) def test_corr_rank(self): tm._skip_if_no_scipy() @@ -725,16 +801,16 @@ def test_corr_rank(self): A[-5:] = A[:5] result = A.corr(B, method='kendall') expected = stats.kendalltau(A, B)[0] - self.assertAlmostEqual(result, expected) + tm.assert_almost_equal(result, expected) result = A.corr(B, method='spearman') expected = stats.spearmanr(A, B)[0] - self.assertAlmostEqual(result, expected) + tm.assert_almost_equal(result, expected) # these methods got rewritten in 0.8 if scipy.__version__ < LooseVersion('0.9'): - raise nose.SkipTest("skipping corr rank because of scipy version " - "{0}".format(scipy.__version__)) + pytest.skip("skipping corr rank because of scipy version " + "{0}".format(scipy.__version__)) # results from R A = Series( @@ -745,38 +821,38 @@ def test_corr_rank(self): 1.17258718, -1.06009347, -0.10222060, -0.89076239, 0.89372375]) kexp = 0.4319297 sexp = 0.5853767 - self.assertAlmostEqual(A.corr(B, method='kendall'), kexp) - self.assertAlmostEqual(A.corr(B, method='spearman'), sexp) + tm.assert_almost_equal(A.corr(B, method='kendall'), kexp) + tm.assert_almost_equal(A.corr(B, method='spearman'), sexp) def test_cov(self): # full overlap - self.assertAlmostEqual(self.ts.cov(self.ts), self.ts.std() ** 2) + tm.assert_almost_equal(self.ts.cov(self.ts), self.ts.std() ** 2) # partial overlap - self.assertAlmostEqual(self.ts[:15].cov(self.ts[5:]), + tm.assert_almost_equal(self.ts[:15].cov(self.ts[5:]), self.ts[5:15].std() ** 2) # No overlap - self.assertTrue(np.isnan(self.ts[::2].cov(self.ts[1::2]))) + assert np.isnan(self.ts[::2].cov(self.ts[1::2])) # all NA cp = self.ts[:10].copy() cp[:] = np.nan - self.assertTrue(isnull(cp.cov(cp))) + assert isnull(cp.cov(cp)) # min_periods - self.assertTrue(isnull(self.ts[:15].cov(self.ts[5:], min_periods=12))) + assert isnull(self.ts[:15].cov(self.ts[5:], min_periods=12)) ts1 = self.ts[:15].reindex(self.ts.index) ts2 = self.ts[5:].reindex(self.ts.index) - self.assertTrue(isnull(ts1.cov(ts2, min_periods=12))) + assert isnull(ts1.cov(ts2, min_periods=12)) def test_count(self): - self.assertEqual(self.ts.count(), len(self.ts)) + assert self.ts.count() == len(self.ts) self.ts[::2] = np.NaN - self.assertEqual(self.ts.count(), np.isfinite(self.ts).sum()) + assert self.ts.count() == np.isfinite(self.ts).sum() mi = MultiIndex.from_arrays([list('aabbcc'), [1, 2, 2, nan, 1, 2]]) ts = Series(np.arange(len(mi)), index=mi) @@ -804,15 +880,15 @@ def test_dot(self): # Check ndarray argument result = a.dot(b.values) - self.assertTrue(np.all(result == expected.values)) + assert np.all(result == expected.values) assert_almost_equal(a.dot(b['2'].values), expected['2']) # Check series argument assert_almost_equal(a.dot(b['1']), expected['1']) assert_almost_equal(a.dot(b2['1']), expected['1']) - self.assertRaises(Exception, a.dot, a.values[:3]) - self.assertRaises(ValueError, a.dot, b.T) + pytest.raises(Exception, a.dot, a.values[:3]) + pytest.raises(ValueError, a.dot, b.T) def test_value_counts_nunique(self): @@ -821,7 +897,7 @@ def test_value_counts_nunique(self): series[20:500] = np.nan series[10:20] = 5000 result = series.nunique() - self.assertEqual(result, 11) + assert result == 11 def test_unique(self): @@ -829,24 +905,24 @@ def test_unique(self): s = Series([1.2345] * 100) s[::2] = np.nan result = s.unique() - self.assertEqual(len(result), 2) + assert len(result) == 2 s = Series([1.2345] * 100, dtype='f4') s[::2] = np.nan result = s.unique() - self.assertEqual(len(result), 2) + assert len(result) == 2 # NAs in object arrays #714 s = Series(['foo'] * 100, dtype='O') s[::2] = np.nan result = s.unique() - self.assertEqual(len(result), 2) + assert len(result) == 2 # decision about None s = Series([1, 2, 3, None, None, None], dtype=object) result = s.unique() expected = np.array([1, 2, 3, None], dtype=object) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) def test_drop_duplicates(self): # check both int and object @@ -865,17 +941,6 @@ def test_drop_duplicates(self): sc.drop_duplicates(keep='last', inplace=True) assert_series_equal(sc, s[~expected]) - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - assert_series_equal(s.duplicated(take_last=True), expected) - with tm.assert_produces_warning(FutureWarning): - assert_series_equal( - s.drop_duplicates(take_last=True), s[~expected]) - sc = s.copy() - with tm.assert_produces_warning(FutureWarning): - sc.drop_duplicates(take_last=True, inplace=True) - assert_series_equal(sc, s[~expected]) - expected = Series([False, False, True, True]) assert_series_equal(s.duplicated(keep=False), expected) assert_series_equal(s.drop_duplicates(keep=False), s[~expected]) @@ -899,17 +964,6 @@ def test_drop_duplicates(self): sc.drop_duplicates(keep='last', inplace=True) assert_series_equal(sc, s[~expected]) - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - assert_series_equal(s.duplicated(take_last=True), expected) - with tm.assert_produces_warning(FutureWarning): - assert_series_equal( - s.drop_duplicates(take_last=True), s[~expected]) - sc = s.copy() - with tm.assert_produces_warning(FutureWarning): - sc.drop_duplicates(take_last=True, inplace=True) - assert_series_equal(sc, s[~expected]) - expected = Series([False, True, True, False, True, True, False]) assert_series_equal(s.duplicated(keep=False), expected) assert_series_equal(s.drop_duplicates(keep=False), s[~expected]) @@ -917,125 +971,19 @@ def test_drop_duplicates(self): sc.drop_duplicates(keep=False, inplace=True) assert_series_equal(sc, s[~expected]) - def test_rank(self): - tm._skip_if_no_scipy() - from scipy.stats import rankdata - - self.ts[::2] = np.nan - self.ts[:10][::3] = 4. - - ranks = self.ts.rank() - oranks = self.ts.astype('O').rank() - - assert_series_equal(ranks, oranks) - - mask = np.isnan(self.ts) - filled = self.ts.fillna(np.inf) - - # rankdata returns a ndarray - exp = Series(rankdata(filled), index=filled.index, name='ts') - exp[mask] = np.nan - - tm.assert_series_equal(ranks, exp) - - iseries = Series(np.arange(5).repeat(2)) - - iranks = iseries.rank() - exp = iseries.astype(float).rank() - assert_series_equal(iranks, exp) - iseries = Series(np.arange(5)) + 1.0 - exp = iseries / 5.0 - iranks = iseries.rank(pct=True) - - assert_series_equal(iranks, exp) - - iseries = Series(np.repeat(1, 100)) - exp = Series(np.repeat(0.505, 100)) - iranks = iseries.rank(pct=True) - assert_series_equal(iranks, exp) - - iseries[1] = np.nan - exp = Series(np.repeat(50.0 / 99.0, 100)) - exp[1] = np.nan - iranks = iseries.rank(pct=True) - assert_series_equal(iranks, exp) - - iseries = Series(np.arange(5)) + 1.0 - iseries[4] = np.nan - exp = iseries / 4.0 - iranks = iseries.rank(pct=True) - assert_series_equal(iranks, exp) - - iseries = Series(np.repeat(np.nan, 100)) - exp = iseries.copy() - iranks = iseries.rank(pct=True) - assert_series_equal(iranks, exp) - - iseries = Series(np.arange(5)) + 1 - iseries[4] = np.nan - exp = iseries / 4.0 - iranks = iseries.rank(pct=True) - assert_series_equal(iranks, exp) - - rng = date_range('1/1/1990', periods=5) - iseries = Series(np.arange(5), rng) + 1 - iseries.ix[4] = np.nan - exp = iseries / 4.0 - iranks = iseries.rank(pct=True) - assert_series_equal(iranks, exp) - - iseries = Series([1e-50, 1e-100, 1e-20, 1e-2, 1e-20 + 1e-30, 1e-1]) - exp = Series([2, 1, 3, 5, 4, 6.0]) - iranks = iseries.rank() - assert_series_equal(iranks, exp) - - # GH 5968 - iseries = Series(['3 day', '1 day 10m', '-2 day', pd.NaT], - dtype='m8[ns]') - exp = Series([3, 2, 1, np.nan]) - iranks = iseries.rank() - assert_series_equal(iranks, exp) - - values = np.array( - [-50, -1, -1e-20, -1e-25, -1e-50, 0, 1e-40, 1e-20, 1e-10, 2, 40 - ], dtype='float64') - random_order = np.random.permutation(len(values)) - iseries = Series(values[random_order]) - exp = Series(random_order + 1.0, dtype='float64') - iranks = iseries.rank() - assert_series_equal(iranks, exp) - - def test_rank_signature(self): - s = Series([0, 1]) - s.rank(method='average') - self.assertRaises(ValueError, s.rank, 'average') - - def test_rank_inf(self): - raise nose.SkipTest('DataFrame.rank does not currently rank ' - 'np.inf and -np.inf properly') - - values = np.array( - [-np.inf, -50, -1, -1e-20, -1e-25, -1e-50, 0, 1e-40, 1e-20, 1e-10, - 2, 40, np.inf], dtype='float64') - random_order = np.random.permutation(len(values)) - iseries = Series(values[random_order]) - exp = Series(random_order + 1.0, dtype='float64') - iranks = iseries.rank() - assert_series_equal(iranks, exp) - def test_clip(self): val = self.ts.median() - self.assertEqual(self.ts.clip_lower(val).min(), val) - self.assertEqual(self.ts.clip_upper(val).max(), val) + assert self.ts.clip_lower(val).min() == val + assert self.ts.clip_upper(val).max() == val - self.assertEqual(self.ts.clip(lower=val).min(), val) - self.assertEqual(self.ts.clip(upper=val).max(), val) + assert self.ts.clip(lower=val).min() == val + assert self.ts.clip(upper=val).max() == val result = self.ts.clip(-0.5, 0.5) expected = np.clip(self.ts, -0.5, 0.5) assert_series_equal(result, expected) - tm.assertIsInstance(expected, Series) + assert isinstance(expected, Series) def test_clip_types_and_nulls(self): @@ -1047,10 +995,10 @@ def test_clip_types_and_nulls(self): thresh = s[2] l = s.clip_lower(thresh) u = s.clip_upper(thresh) - self.assertEqual(l[notnull(l)].min(), thresh) - self.assertEqual(u[notnull(u)].max(), thresh) - self.assertEqual(list(isnull(s)), list(isnull(l))) - self.assertEqual(list(isnull(s)), list(isnull(u))) + assert l[notnull(l)].min() == thresh + assert u[notnull(u)].max() == thresh + assert list(isnull(s)) == list(isnull(l)) + assert list(isnull(s)) == list(isnull(u)) def test_clip_against_series(self): # GH #6966 @@ -1134,10 +1082,10 @@ def test_isin(self): def test_isin_with_string_scalar(self): # GH4763 s = Series(['A', 'B', 'C', 'a', 'B', 'B', 'A', 'C']) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): s.isin('a') - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): s = Series(['aaa', 'b', 'c']) s.isin('aaa') @@ -1182,20 +1130,20 @@ def test_timedelta64_analytics(self): Timestamp('20120101') result = td.idxmin() - self.assertEqual(result, 0) + assert result == 0 result = td.idxmax() - self.assertEqual(result, 2) + assert result == 2 # GH 2982 # with NaT td[0] = np.nan result = td.idxmin() - self.assertEqual(result, 1) + assert result == 1 result = td.idxmax() - self.assertEqual(result, 2) + assert result == 2 # abs s1 = Series(date_range('20120101', periods=3)) @@ -1212,11 +1160,11 @@ def test_timedelta64_analytics(self): # max/min result = td.max() expected = Timedelta('2 days') - self.assertEqual(result, expected) + assert result == expected result = td.min() expected = Timedelta('1 days') - self.assertEqual(result, expected) + assert result == expected def test_idxmin(self): # test idxmin @@ -1226,39 +1174,39 @@ def test_idxmin(self): self.series[5:15] = np.NaN # skipna or no - self.assertEqual(self.series[self.series.idxmin()], self.series.min()) - self.assertTrue(isnull(self.series.idxmin(skipna=False))) + assert self.series[self.series.idxmin()] == self.series.min() + assert isnull(self.series.idxmin(skipna=False)) # no NaNs nona = self.series.dropna() - self.assertEqual(nona[nona.idxmin()], nona.min()) - self.assertEqual(nona.index.values.tolist().index(nona.idxmin()), - nona.values.argmin()) + assert nona[nona.idxmin()] == nona.min() + assert (nona.index.values.tolist().index(nona.idxmin()) == + nona.values.argmin()) # all NaNs allna = self.series * nan - self.assertTrue(isnull(allna.idxmin())) + assert isnull(allna.idxmin()) # datetime64[ns] from pandas import date_range s = Series(date_range('20130102', periods=6)) result = s.idxmin() - self.assertEqual(result, 0) + assert result == 0 s[0] = np.nan result = s.idxmin() - self.assertEqual(result, 1) + assert result == 1 def test_numpy_argmin(self): # argmin is aliased to idxmin data = np.random.randint(0, 11, size=10) result = np.argmin(Series(data)) - self.assertEqual(result, np.argmin(data)) + assert result == np.argmin(data) if not _np_version_under1p10: msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.argmin, - Series(data), out=data) + tm.assert_raises_regex(ValueError, msg, np.argmin, + Series(data), out=data) def test_idxmax(self): # test idxmax @@ -1268,89 +1216,89 @@ def test_idxmax(self): self.series[5:15] = np.NaN # skipna or no - self.assertEqual(self.series[self.series.idxmax()], self.series.max()) - self.assertTrue(isnull(self.series.idxmax(skipna=False))) + assert self.series[self.series.idxmax()] == self.series.max() + assert isnull(self.series.idxmax(skipna=False)) # no NaNs nona = self.series.dropna() - self.assertEqual(nona[nona.idxmax()], nona.max()) - self.assertEqual(nona.index.values.tolist().index(nona.idxmax()), - nona.values.argmax()) + assert nona[nona.idxmax()] == nona.max() + assert (nona.index.values.tolist().index(nona.idxmax()) == + nona.values.argmax()) # all NaNs allna = self.series * nan - self.assertTrue(isnull(allna.idxmax())) + assert isnull(allna.idxmax()) from pandas import date_range s = Series(date_range('20130102', periods=6)) result = s.idxmax() - self.assertEqual(result, 5) + assert result == 5 s[5] = np.nan result = s.idxmax() - self.assertEqual(result, 4) + assert result == 4 # Float64Index # GH 5914 s = pd.Series([1, 2, 3], [1.1, 2.1, 3.1]) result = s.idxmax() - self.assertEqual(result, 3.1) + assert result == 3.1 result = s.idxmin() - self.assertEqual(result, 1.1) + assert result == 1.1 s = pd.Series(s.index, s.index) result = s.idxmax() - self.assertEqual(result, 3.1) + assert result == 3.1 result = s.idxmin() - self.assertEqual(result, 1.1) + assert result == 1.1 def test_numpy_argmax(self): # argmax is aliased to idxmax data = np.random.randint(0, 11, size=10) result = np.argmax(Series(data)) - self.assertEqual(result, np.argmax(data)) + assert result == np.argmax(data) if not _np_version_under1p10: msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.argmax, - Series(data), out=data) + tm.assert_raises_regex(ValueError, msg, np.argmax, + Series(data), out=data) def test_ptp(self): N = 1000 arr = np.random.randn(N) ser = Series(arr) - self.assertEqual(np.ptp(ser), np.ptp(arr)) + assert np.ptp(ser) == np.ptp(arr) # GH11163 s = Series([3, 5, np.nan, -3, 10]) - self.assertEqual(s.ptp(), 13) - self.assertTrue(pd.isnull(s.ptp(skipna=False))) + assert s.ptp() == 13 + assert pd.isnull(s.ptp(skipna=False)) mi = pd.MultiIndex.from_product([['a', 'b'], [1, 2, 3]]) s = pd.Series([1, np.nan, 7, 3, 5, np.nan], index=mi) expected = pd.Series([6, 2], index=['a', 'b'], dtype=np.float64) - self.assert_series_equal(s.ptp(level=0), expected) + tm.assert_series_equal(s.ptp(level=0), expected) expected = pd.Series([np.nan, np.nan], index=['a', 'b']) - self.assert_series_equal(s.ptp(level=0, skipna=False), expected) + tm.assert_series_equal(s.ptp(level=0, skipna=False), expected) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): s.ptp(axis=1) s = pd.Series(['a', 'b', 'c', 'd', 'e']) - with self.assertRaises(TypeError): + with pytest.raises(TypeError): s.ptp() - with self.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): s.ptp(numeric_only=True) def test_empty_timeseries_redections_return_nat(self): # covers #11245 for dtype in ('m8[ns]', 'm8[ns]', 'M8[ns]', 'M8[ns, UTC]'): - self.assertIs(Series([], dtype=dtype).min(), pd.NaT) - self.assertIs(Series([], dtype=dtype).max(), pd.NaT) + assert Series([], dtype=dtype).min() is pd.NaT + assert Series([], dtype=dtype).max() is pd.NaT def test_unique_data_ownership(self): # it works! #1807 @@ -1363,6 +1311,10 @@ def test_repeat(self): exp = Series(s.values.repeat(5), index=s.index.values.repeat(5)) assert_series_equal(reps, exp) + with tm.assert_produces_warning(FutureWarning): + result = s.repeat(reps=5) + assert_series_equal(result, exp) + to_rep = [2, 3, 4] reps = s.repeat(to_rep) exp = Series(s.values.repeat(to_rep), @@ -1376,13 +1328,26 @@ def test_numpy_repeat(self): assert_series_equal(np.repeat(s, 2), expected) msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.repeat, s, 2, axis=0) + tm.assert_raises_regex(ValueError, msg, np.repeat, s, 2, axis=0) + + def test_searchsorted(self): + s = Series([1, 2, 3]) + + idx = s.searchsorted(1, side='left') + tm.assert_numpy_array_equal(idx, np.array([0], dtype=np.intp)) + + idx = s.searchsorted(1, side='right') + tm.assert_numpy_array_equal(idx, np.array([1], dtype=np.intp)) + + with tm.assert_produces_warning(FutureWarning): + idx = s.searchsorted(v=1, side='left') + tm.assert_numpy_array_equal(idx, np.array([0], dtype=np.intp)) def test_searchsorted_numeric_dtypes_scalar(self): s = Series([1, 2, 90, 1000, 3e9]) r = s.searchsorted(30) e = 2 - self.assertEqual(r, e) + assert r == e r = s.searchsorted([30]) e = np.array([2], dtype=np.intp) @@ -1399,7 +1364,7 @@ def test_search_sorted_datetime64_scalar(self): v = pd.Timestamp('20120102') r = s.searchsorted(v) e = 1 - self.assertEqual(r, e) + assert r == e def test_search_sorted_datetime64_list(self): s = Series(pd.date_range('20120101', periods=10, freq='2D')) @@ -1418,127 +1383,42 @@ def test_searchsorted_sorter(self): def test_is_unique(self): # GH11946 s = Series(np.random.randint(0, 10, size=1000)) - self.assertFalse(s.is_unique) + assert not s.is_unique s = Series(np.arange(1000)) - self.assertTrue(s.is_unique) + assert s.is_unique def test_is_monotonic(self): s = Series(np.random.randint(0, 10, size=1000)) - self.assertFalse(s.is_monotonic) + assert not s.is_monotonic s = Series(np.arange(1000)) - self.assertTrue(s.is_monotonic) - self.assertTrue(s.is_monotonic_increasing) + assert s.is_monotonic + assert s.is_monotonic_increasing s = Series(np.arange(1000, 0, -1)) - self.assertTrue(s.is_monotonic_decreasing) + assert s.is_monotonic_decreasing s = Series(pd.date_range('20130101', periods=10)) - self.assertTrue(s.is_monotonic) - self.assertTrue(s.is_monotonic_increasing) + assert s.is_monotonic + assert s.is_monotonic_increasing s = Series(list(reversed(s.tolist()))) - self.assertFalse(s.is_monotonic) - self.assertTrue(s.is_monotonic_decreasing) - - def test_nsmallest_nlargest(self): - # float, int, datetime64 (use i8), timedelts64 (same), - # object that are numbers, object that are strings - - base = [3, 2, 1, 2, 5] - - s_list = [ - Series(base, dtype='int8'), - Series(base, dtype='int16'), - Series(base, dtype='int32'), - Series(base, dtype='int64'), - Series(base, dtype='float32'), - Series(base, dtype='float64'), - Series(base, dtype='uint8'), - Series(base, dtype='uint16'), - Series(base, dtype='uint32'), - Series(base, dtype='uint64'), - Series(base).astype('timedelta64[ns]'), - Series(pd.to_datetime(['2003', '2002', '2001', '2002', '2005'])), - ] - - raising = [ - Series([3., 2, 1, 2, '5'], dtype='object'), - Series([3., 2, 1, 2, 5], dtype='object'), - # not supported on some archs - # Series([3., 2, 1, 2, 5], dtype='complex256'), - Series([3., 2, 1, 2, 5], dtype='complex128'), - ] - - for r in raising: - dt = r.dtype - msg = "Cannot use method 'n(larg|small)est' with dtype %s" % dt - args = 2, len(r), 0, -1 - methods = r.nlargest, r.nsmallest - for method, arg in product(methods, args): - with tm.assertRaisesRegexp(TypeError, msg): - method(arg) - - for s in s_list: - - assert_series_equal(s.nsmallest(2), s.iloc[[2, 1]]) - - assert_series_equal(s.nsmallest(2, keep='last'), s.iloc[[2, 3]]) - with tm.assert_produces_warning(FutureWarning): - assert_series_equal( - s.nsmallest(2, take_last=True), s.iloc[[2, 3]]) - - assert_series_equal(s.nlargest(3), s.iloc[[4, 0, 1]]) - - assert_series_equal(s.nlargest(3, keep='last'), s.iloc[[4, 0, 3]]) - with tm.assert_produces_warning(FutureWarning): - assert_series_equal( - s.nlargest(3, take_last=True), s.iloc[[4, 0, 3]]) - - empty = s.iloc[0:0] - assert_series_equal(s.nsmallest(0), empty) - assert_series_equal(s.nsmallest(-1), empty) - assert_series_equal(s.nlargest(0), empty) - assert_series_equal(s.nlargest(-1), empty) - - assert_series_equal(s.nsmallest(len(s)), s.sort_values()) - assert_series_equal(s.nsmallest(len(s) + 1), s.sort_values()) - assert_series_equal(s.nlargest(len(s)), s.iloc[[4, 0, 1, 3, 2]]) - assert_series_equal(s.nlargest(len(s) + 1), - s.iloc[[4, 0, 1, 3, 2]]) - - s = Series([3., np.nan, 1, 2, 5]) - assert_series_equal(s.nlargest(), s.iloc[[4, 0, 3, 2]]) - assert_series_equal(s.nsmallest(), s.iloc[[2, 3, 0, 4]]) - - msg = 'keep must be either "first", "last"' - with tm.assertRaisesRegexp(ValueError, msg): - s.nsmallest(keep='invalid') - with tm.assertRaisesRegexp(ValueError, msg): - s.nlargest(keep='invalid') - - # GH 13412 - s = Series([1, 4, 3, 2], index=[0, 0, 1, 1]) - result = s.nlargest(3) - expected = s.sort_values(ascending=False).head(3) - assert_series_equal(result, expected) - result = s.nsmallest(3) - expected = s.sort_values().head(3) - assert_series_equal(result, expected) + assert not s.is_monotonic + assert s.is_monotonic_decreasing - def test_sortlevel(self): + def test_sort_index_level(self): mi = MultiIndex.from_tuples([[1, 1, 3], [1, 1, 1]], names=list('ABC')) s = Series([1, 2], mi) backwards = s.iloc[[1, 0]] - res = s.sortlevel('A') + res = s.sort_index(level='A') assert_series_equal(backwards, res) - res = s.sortlevel(['A', 'B']) + res = s.sort_index(level=['A', 'B']) assert_series_equal(backwards, res) - res = s.sortlevel('A', sort_remaining=False) + res = s.sort_index(level='A', sort_remaining=False) assert_series_equal(s, res) - res = s.sortlevel(['A', 'B'], sort_remaining=False) + res = s.sort_index(level=['A', 'B'], sort_remaining=False) assert_series_equal(s, res) def test_apply_categorical(self): @@ -1558,7 +1438,7 @@ def test_apply_categorical(self): result = s.apply(lambda x: 'A') exp = pd.Series(['A'] * 7, name='XX', index=list('abcdefg')) tm.assert_series_equal(result, exp) - self.assertEqual(result.dtype, np.object) + assert result.dtype == np.object def test_shift_int(self): ts = self.ts.astype(int) @@ -1574,13 +1454,13 @@ def test_shift_categorical(self): sp1 = s.shift(1) assert_index_equal(s.index, sp1.index) - self.assertTrue(np.all(sp1.values.codes[:1] == -1)) - self.assertTrue(np.all(s.values.codes[:-1] == sp1.values.codes[1:])) + assert np.all(sp1.values.codes[:1] == -1) + assert np.all(s.values.codes[:-1] == sp1.values.codes[1:]) sn2 = s.shift(-2) assert_index_equal(s.index, sn2.index) - self.assertTrue(np.all(sn2.values.codes[-2:] == -1)) - self.assertTrue(np.all(s.values.codes[2:] == sn2.values.codes[:-2])) + assert np.all(sn2.values.codes[-2:] == -1) + assert np.all(s.values.codes[2:] == sn2.values.codes[:-2]) assert_index_equal(s.values.categories, sp1.values.categories) assert_index_equal(s.values.categories, sn2.values.categories) @@ -1593,7 +1473,7 @@ def test_reshape_non_2d(self): # see gh-4554 with tm.assert_produces_warning(FutureWarning): x = Series(np.random.random(201), name='x') - self.assertTrue(x.reshape(x.shape, ) is x) + assert x.reshape(x.shape, ) is x # see gh-2719 with tm.assert_produces_warning(FutureWarning): @@ -1601,34 +1481,36 @@ def test_reshape_non_2d(self): result = a.reshape(2, 2) expected = a.values.reshape(2, 2) tm.assert_numpy_array_equal(result, expected) - self.assertIsInstance(result, type(expected)) + assert isinstance(result, type(expected)) def test_reshape_2d_return_array(self): x = Series(np.random.random(201), name='x') with tm.assert_produces_warning(FutureWarning): result = x.reshape((-1, 1)) - self.assertNotIsInstance(result, Series) + assert not isinstance(result, Series) with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): result2 = np.reshape(x, (-1, 1)) - self.assertNotIsInstance(result2, Series) + assert not isinstance(result2, Series) with tm.assert_produces_warning(FutureWarning): result = x[:, None] expected = x.reshape((-1, 1)) - assert_almost_equal(result, expected) + tm.assert_almost_equal(result, expected) def test_reshape_bad_kwarg(self): a = Series([1, 2, 3, 4]) with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): msg = "'foo' is an invalid keyword argument for this function" - tm.assertRaisesRegexp(TypeError, msg, a.reshape, (2, 2), foo=2) + tm.assert_raises_regex( + TypeError, msg, a.reshape, (2, 2), foo=2) with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): msg = r"reshape\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, a.reshape, a.shape, foo=2) + tm.assert_raises_regex( + TypeError, msg, a.reshape, a.shape, foo=2) def test_numpy_reshape(self): a = Series([1, 2, 3, 4]) @@ -1637,7 +1519,7 @@ def test_numpy_reshape(self): result = np.reshape(a, (2, 2)) expected = a.values.reshape(2, 2) tm.assert_numpy_array_equal(result, expected) - self.assertIsInstance(result, type(expected)) + assert isinstance(result, type(expected)) with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): result = np.reshape(a, a.shape) @@ -1667,8 +1549,9 @@ def test_unstack(self): s = Series(np.random.randn(6), index=index) exp_index = MultiIndex(levels=[['one', 'two', 'three'], [0, 1]], labels=[[0, 1, 2, 0, 1, 2], [0, 1, 0, 1, 0, 1]]) - expected = DataFrame({'bar': s.values}, index=exp_index).sortlevel(0) - unstacked = s.unstack(0) + expected = DataFrame({'bar': s.values}, + index=exp_index).sort_index(level=0) + unstacked = s.unstack(0).sort_index() assert_frame_equal(unstacked, expected) # GH5873 @@ -1797,3 +1680,109 @@ def test_value_counts_categorical_not_ordered(self): index=exp_idx, name='xxx') tm.assert_series_equal(s.value_counts(normalize=True), exp) tm.assert_series_equal(idx.value_counts(normalize=True), exp) + + +@pytest.fixture +def s_main_dtypes(): + df = pd.DataFrame( + {'datetime': pd.to_datetime(['2003', '2002', + '2001', '2002', + '2005']), + 'datetimetz': pd.to_datetime( + ['2003', '2002', + '2001', '2002', + '2005']).tz_localize('US/Eastern'), + 'timedelta': pd.to_timedelta(['3d', '2d', '1d', + '2d', '5d'])}) + + for dtype in ['int8', 'int16', 'int32', 'int64', + 'float32', 'float64', + 'uint8', 'uint16', 'uint32', 'uint64']: + df[dtype] = Series([3, 2, 1, 2, 5], dtype=dtype) + + return df + + +class TestNLargestNSmallest(object): + + @pytest.mark.parametrize( + "r", [Series([3., 2, 1, 2, '5'], dtype='object'), + Series([3., 2, 1, 2, 5], dtype='object'), + # not supported on some archs + # Series([3., 2, 1, 2, 5], dtype='complex256'), + Series([3., 2, 1, 2, 5], dtype='complex128'), + Series(list('abcde'), dtype='category'), + Series(list('abcde'))]) + def test_error(self, r): + dt = r.dtype + msg = ("Cannot use method 'n(larg|small)est' with " + "dtype {dt}".format(dt=dt)) + args = 2, len(r), 0, -1 + methods = r.nlargest, r.nsmallest + for method, arg in product(methods, args): + with tm.assert_raises_regex(TypeError, msg): + method(arg) + + @pytest.mark.parametrize( + "s", + [v for k, v in s_main_dtypes().iteritems()]) + def test_nsmallest_nlargest(self, s): + # float, int, datetime64 (use i8), timedelts64 (same), + # object that are numbers, object that are strings + + assert_series_equal(s.nsmallest(2), s.iloc[[2, 1]]) + assert_series_equal(s.nsmallest(2, keep='last'), s.iloc[[2, 3]]) + + empty = s.iloc[0:0] + assert_series_equal(s.nsmallest(0), empty) + assert_series_equal(s.nsmallest(-1), empty) + assert_series_equal(s.nlargest(0), empty) + assert_series_equal(s.nlargest(-1), empty) + + assert_series_equal(s.nsmallest(len(s)), s.sort_values()) + assert_series_equal(s.nsmallest(len(s) + 1), s.sort_values()) + assert_series_equal(s.nlargest(len(s)), s.iloc[[4, 0, 1, 3, 2]]) + assert_series_equal(s.nlargest(len(s) + 1), + s.iloc[[4, 0, 1, 3, 2]]) + + def test_misc(self): + + s = Series([3., np.nan, 1, 2, 5]) + assert_series_equal(s.nlargest(), s.iloc[[4, 0, 3, 2]]) + assert_series_equal(s.nsmallest(), s.iloc[[2, 3, 0, 4]]) + + msg = 'keep must be either "first", "last"' + with tm.assert_raises_regex(ValueError, msg): + s.nsmallest(keep='invalid') + with tm.assert_raises_regex(ValueError, msg): + s.nlargest(keep='invalid') + + # GH 15297 + s = Series([1] * 5, index=[1, 2, 3, 4, 5]) + expected_first = Series([1] * 3, index=[1, 2, 3]) + expected_last = Series([1] * 3, index=[5, 4, 3]) + + result = s.nsmallest(3) + assert_series_equal(result, expected_first) + + result = s.nsmallest(3, keep='last') + assert_series_equal(result, expected_last) + + result = s.nlargest(3) + assert_series_equal(result, expected_first) + + result = s.nlargest(3, keep='last') + assert_series_equal(result, expected_last) + + @pytest.mark.parametrize('n', range(1, 5)) + def test_n(self, n): + + # GH 13412 + s = Series([1, 4, 3, 2], index=[0, 0, 1, 1]) + result = s.nlargest(n) + expected = s.sort_values(ascending=False).head(n) + assert_series_equal(result, expected) + + result = s.nsmallest(n) + expected = s.sort_values().head(n) + assert_series_equal(result, expected) diff --git a/pandas/tests/series/test_misc_api.py b/pandas/tests/series/test_api.py similarity index 66% rename from pandas/tests/series/test_misc_api.py rename to pandas/tests/series/test_api.py index 61bdc59cd500d..1eb2b98a7d7cc 100644 --- a/pandas/tests/series/test_misc_api.py +++ b/pandas/tests/series/test_api.py @@ -1,15 +1,17 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + import numpy as np import pandas as pd from pandas import Index, Series, DataFrame, date_range -from pandas.tseries.index import Timestamp +from pandas.core.indexes.datetimes import Timestamp from pandas.compat import range from pandas import compat -import pandas.formats.printing as printing +import pandas.io.formats.printing as printing from pandas.util.testing import (assert_series_equal, ensure_clean) import pandas.util.testing as tm @@ -21,46 +23,46 @@ class SharedWithSparse(object): def test_scalarop_preserve_name(self): result = self.ts * 2 - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_copy_name(self): result = self.ts.copy() - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_copy_index_name_checking(self): # don't want to be able to modify the index stored elsewhere after # making a copy self.ts.index.name = None - self.assertIsNone(self.ts.index.name) - self.assertIs(self.ts, self.ts) + assert self.ts.index.name is None + assert self.ts is self.ts cp = self.ts.copy() cp.index.name = 'foo' printing.pprint_thing(self.ts.index.name) - self.assertIsNone(self.ts.index.name) + assert self.ts.index.name is None def test_append_preserve_name(self): result = self.ts[:5].append(self.ts[5:]) - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_binop_maybe_preserve_name(self): # names match, preserve result = self.ts * self.ts - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name result = self.ts.mul(self.ts) - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name result = self.ts * self.ts[:-2] - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name # names don't match, don't preserve cp = self.ts.copy() cp.name = 'something else' result = self.ts + cp - self.assertIsNone(result.name) + assert result.name is None result = self.ts.add(cp) - self.assertIsNone(result.name) + assert result.name is None ops = ['add', 'sub', 'mul', 'div', 'truediv', 'floordiv', 'mod', 'pow'] ops = ops + ['r' + op for op in ops] @@ -68,27 +70,27 @@ def test_binop_maybe_preserve_name(self): # names match, preserve s = self.ts.copy() result = getattr(s, op)(s) - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name # names don't match, don't preserve cp = self.ts.copy() cp.name = 'changed' result = getattr(s, op)(cp) - self.assertIsNone(result.name) + assert result.name is None def test_combine_first_name(self): result = self.ts.combine_first(self.ts[:5]) - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_getitem_preserve_name(self): result = self.ts[self.ts > 0] - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name result = self.ts[[0, 2, 4]] - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name result = self.ts[5:10] - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_pickle(self): unp_series = self._pickle_roundtrip(self.series) @@ -105,122 +107,121 @@ def _pickle_roundtrip(self, obj): def test_argsort_preserve_name(self): result = self.ts.argsort() - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_sort_index_name(self): result = self.ts.sort_index(ascending=False) - self.assertEqual(result.name, self.ts.name) + assert result.name == self.ts.name def test_to_sparse_pass_name(self): result = self.ts.to_sparse() - self.assertEqual(result.name, self.ts.name) - + assert result.name == self.ts.name -class TestSeriesMisc(TestData, SharedWithSparse, tm.TestCase): - _multiprocess_can_split_ = True +class TestSeriesMisc(TestData, SharedWithSparse): def test_tab_completion(self): # GH 9910 s = Series(list('abcd')) # Series of str values should have .str but not .dt/.cat in __dir__ - self.assertTrue('str' in dir(s)) - self.assertTrue('dt' not in dir(s)) - self.assertTrue('cat' not in dir(s)) + assert 'str' in dir(s) + assert 'dt' not in dir(s) + assert 'cat' not in dir(s) # similiarly for .dt s = Series(date_range('1/1/2015', periods=5)) - self.assertTrue('dt' in dir(s)) - self.assertTrue('str' not in dir(s)) - self.assertTrue('cat' not in dir(s)) + assert 'dt' in dir(s) + assert 'str' not in dir(s) + assert 'cat' not in dir(s) - # similiarly for .cat, but with the twist that str and dt should be - # there if the categories are of that type first cat and str + # Similarly for .cat, but with the twist that str and dt should be + # there if the categories are of that type first cat and str. s = Series(list('abbcd'), dtype="category") - self.assertTrue('cat' in dir(s)) - self.assertTrue('str' in dir(s)) # as it is a string categorical - self.assertTrue('dt' not in dir(s)) + assert 'cat' in dir(s) + assert 'str' in dir(s) # as it is a string categorical + assert 'dt' not in dir(s) # similar to cat and str s = Series(date_range('1/1/2015', periods=5)).astype("category") - self.assertTrue('cat' in dir(s)) - self.assertTrue('str' not in dir(s)) - self.assertTrue('dt' in dir(s)) # as it is a datetime categorical + assert 'cat' in dir(s) + assert 'str' not in dir(s) + assert 'dt' in dir(s) # as it is a datetime categorical def test_not_hashable(self): s_empty = Series() s = Series([1]) - self.assertRaises(TypeError, hash, s_empty) - self.assertRaises(TypeError, hash, s) + pytest.raises(TypeError, hash, s_empty) + pytest.raises(TypeError, hash, s) def test_contains(self): tm.assert_contains_all(self.ts.index, self.ts) def test_iter(self): for i, val in enumerate(self.series): - self.assertEqual(val, self.series[i]) + assert val == self.series[i] for i, val in enumerate(self.ts): - self.assertEqual(val, self.ts[i]) + assert val == self.ts[i] def test_iter_box(self): vals = [pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-02')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' for res, exp in zip(s, vals): - self.assertIsInstance(res, pd.Timestamp) - self.assertEqual(res, exp) - self.assertIsNone(res.tz) + assert isinstance(res, pd.Timestamp) + assert res.tz is None + assert res == exp vals = [pd.Timestamp('2011-01-01', tz='US/Eastern'), pd.Timestamp('2011-01-02', tz='US/Eastern')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'datetime64[ns, US/Eastern]') + + assert s.dtype == 'datetime64[ns, US/Eastern]' for res, exp in zip(s, vals): - self.assertIsInstance(res, pd.Timestamp) - self.assertEqual(res, exp) - self.assertEqual(res.tz, exp.tz) + assert isinstance(res, pd.Timestamp) + assert res.tz == exp.tz + assert res == exp # timedelta vals = [pd.Timedelta('1 days'), pd.Timedelta('2 days')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'timedelta64[ns]') + assert s.dtype == 'timedelta64[ns]' for res, exp in zip(s, vals): - self.assertIsInstance(res, pd.Timedelta) - self.assertEqual(res, exp) + assert isinstance(res, pd.Timedelta) + assert res == exp # period (object dtype, not boxed) vals = [pd.Period('2011-01-01', freq='M'), pd.Period('2011-01-02', freq='M')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'object') + assert s.dtype == 'object' for res, exp in zip(s, vals): - self.assertIsInstance(res, pd.Period) - self.assertEqual(res, exp) - self.assertEqual(res.freq, 'M') + assert isinstance(res, pd.Period) + assert res.freq == 'M' + assert res == exp def test_keys(self): # HACK: By doing this in two stages, we avoid 2to3 wrapping the call # to .keys() in a list() getkeys = self.ts.keys - self.assertIs(getkeys(), self.ts.index) + assert getkeys() is self.ts.index def test_values(self): - self.assert_almost_equal(self.ts.values, self.ts, check_dtype=False) + tm.assert_almost_equal(self.ts.values, self.ts, check_dtype=False) def test_iteritems(self): for idx, val in compat.iteritems(self.series): - self.assertEqual(val, self.series[idx]) + assert val == self.series[idx] for idx, val in compat.iteritems(self.ts): - self.assertEqual(val, self.ts[idx]) + assert val == self.ts[idx] # assert is lazy (genrators don't define reverse, lists do) - self.assertFalse(hasattr(self.series.iteritems(), 'reverse')) + assert not hasattr(self.series.iteritems(), 'reverse') def test_raise_on_info(self): s = Series(np.random.randn(10)) - with tm.assertRaises(AttributeError): + with pytest.raises(AttributeError): s.info() def test_copy(self): @@ -238,12 +239,12 @@ def test_copy(self): if deep is None or deep is True: # Did not modify original Series - self.assertTrue(np.isnan(s2[0])) - self.assertFalse(np.isnan(s[0])) + assert np.isnan(s2[0]) + assert not np.isnan(s[0]) else: # we DID modify the original Series - self.assertTrue(np.isnan(s2[0])) - self.assertTrue(np.isnan(s[0])) + assert np.isnan(s2[0]) + assert np.isnan(s[0]) # GH 11794 # copy of tz-aware @@ -274,9 +275,9 @@ def test_copy(self): def test_axis_alias(self): s = Series([1, 2, np.nan]) assert_series_equal(s.dropna(axis='rows'), s.dropna(axis='index')) - self.assertEqual(s.dropna().sum('rows'), 3) - self.assertEqual(s._get_axis_number('rows'), 0) - self.assertEqual(s._get_axis_name('rows'), 'index') + assert s.dropna().sum('rows') == 3 + assert s._get_axis_number('rows') == 0 + assert s._get_axis_name('rows') == 'index' def test_numpy_unique(self): # it works! @@ -293,19 +294,19 @@ def f(x): result = tsdf.apply(f) expected = tsdf.max() - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) # .item() s = Series([1]) result = s.item() - self.assertEqual(result, 1) - self.assertEqual(s.item(), s.iloc[0]) + assert result == 1 + assert s.item() == s.iloc[0] # using an ndarray like function s = Series(np.random.randn(10)) - result = np.ones_like(s) + result = Series(np.ones_like(s)) expected = Series(1, index=range(10), dtype='float64') - # assert_series_equal(result,expected) + tm.assert_series_equal(result, expected) # ravel s = Series(np.random.randn(10)) @@ -315,21 +316,21 @@ def f(x): # GH 6658 s = Series([0, 1., -1], index=list('abc')) result = np.compress(s > 0, s) - assert_series_equal(result, Series([1.], index=['b'])) + tm.assert_series_equal(result, Series([1.], index=['b'])) result = np.compress(s < -1, s) # result empty Index(dtype=object) as the same as original exp = Series([], dtype='float64', index=Index([], dtype='object')) - assert_series_equal(result, exp) + tm.assert_series_equal(result, exp) s = Series([0, 1., -1], index=[.1, .2, .3]) result = np.compress(s > 0, s) - assert_series_equal(result, Series([1.], index=[.2])) + tm.assert_series_equal(result, Series([1.], index=[.2])) result = np.compress(s < -1, s) # result empty Float64Index as the same as original exp = Series([], dtype='float64', index=Index([], dtype='float64')) - assert_series_equal(result, exp) + tm.assert_series_equal(result, exp) def test_str_attribute(self): # GH9068 @@ -341,5 +342,13 @@ def test_str_attribute(self): # str accessor only valid with string values s = Series(range(5)) - with self.assertRaisesRegexp(AttributeError, 'only use .str accessor'): + with tm.assert_raises_regex(AttributeError, + 'only use .str accessor'): s.str.repeat(2) + + def test_empty_method(self): + s_empty = pd.Series() + assert s_empty.empty + + for full_series in [pd.Series([1]), pd.Series(index=[1])]: + assert not full_series.empty diff --git a/pandas/tests/series/test_apply.py b/pandas/tests/series/test_apply.py index 8d7676bef4d72..c273d3161fff5 100644 --- a/pandas/tests/series/test_apply.py +++ b/pandas/tests/series/test_apply.py @@ -1,45 +1,42 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + +from collections import Counter, defaultdict, OrderedDict + import numpy as np import pandas as pd from pandas import (Index, Series, DataFrame, isnull) from pandas.compat import lrange from pandas import compat -from pandas.util.testing import assert_series_equal +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from .common import TestData -class TestSeriesApply(TestData, tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesApply(TestData): def test_apply(self): with np.errstate(all='ignore'): - assert_series_equal(self.ts.apply(np.sqrt), np.sqrt(self.ts)) + tm.assert_series_equal(self.ts.apply(np.sqrt), np.sqrt(self.ts)) - # elementwise-apply + # element-wise apply import math - assert_series_equal(self.ts.apply(math.exp), np.exp(self.ts)) - - # how to handle Series result, #2316 - result = self.ts.apply(lambda x: Series( - [x, x ** 2], index=['x', 'x^2'])) - expected = DataFrame({'x': self.ts, 'x^2': self.ts ** 2}) - tm.assert_frame_equal(result, expected) + tm.assert_series_equal(self.ts.apply(math.exp), np.exp(self.ts)) # empty series s = Series(dtype=object, name='foo', index=pd.Index([], name='bar')) rs = s.apply(lambda x: x) tm.assert_series_equal(s, rs) + # check all metadata (GH 9322) - self.assertIsNot(s, rs) - self.assertIs(s.index, rs.index) - self.assertEqual(s.dtype, rs.dtype) - self.assertEqual(s.name, rs.name) + assert s is not rs + assert s.index is rs.index + assert s.dtype == rs.dtype + assert s.name == rs.name # index but no data s = Series(index=[1, 2, 3]) @@ -64,20 +61,27 @@ def test_apply_dont_convert_dtype(self): f = lambda x: x if x > 0 else np.nan result = s.apply(f, convert_dtype=False) - self.assertEqual(result.dtype, object) + assert result.dtype == object + + def test_with_string_args(self): + + for arg in ['sum', 'mean', 'min', 'max', 'std']: + result = self.ts.apply(arg) + expected = getattr(self.ts, arg)() + assert result == expected def test_apply_args(self): s = Series(['foo,bar']) result = s.apply(str.split, args=(',', )) - self.assertEqual(result[0], ['foo', 'bar']) - tm.assertIsInstance(result[0], list) + assert result[0] == ['foo', 'bar'] + assert isinstance(result[0], list) def test_apply_box(self): # ufunc will not be boxed. Same test cases as the test_map_box vals = [pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-02')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' # boxed value must be Timestamp instance res = s.apply(lambda x: '{0}_{1}_{2}'.format(x.__class__.__name__, x.day, x.tz)) @@ -87,7 +91,7 @@ def test_apply_box(self): vals = [pd.Timestamp('2011-01-01', tz='US/Eastern'), pd.Timestamp('2011-01-02', tz='US/Eastern')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'datetime64[ns, US/Eastern]') + assert s.dtype == 'datetime64[ns, US/Eastern]' res = s.apply(lambda x: '{0}_{1}_{2}'.format(x.__class__.__name__, x.day, x.tz)) exp = pd.Series(['Timestamp_1_US/Eastern', 'Timestamp_2_US/Eastern']) @@ -96,7 +100,7 @@ def test_apply_box(self): # timedelta vals = [pd.Timedelta('1 days'), pd.Timedelta('2 days')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'timedelta64[ns]') + assert s.dtype == 'timedelta64[ns]' res = s.apply(lambda x: '{0}_{1}'.format(x.__class__.__name__, x.days)) exp = pd.Series(['Timedelta_1', 'Timedelta_2']) tm.assert_series_equal(res, exp) @@ -105,7 +109,7 @@ def test_apply_box(self): vals = [pd.Period('2011-01-01', freq='M'), pd.Period('2011-01-02', freq='M')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'object') + assert s.dtype == 'object' res = s.apply(lambda x: '{0}_{1}'.format(x.__class__.__name__, x.freqstr)) exp = pd.Series(['Period_M', 'Period_M']) @@ -123,8 +127,9 @@ def test_apply_datetimetz(self): tm.assert_series_equal(result, exp) # change dtype + # GH 14506 : Returned dtype changed from int32 to int64 result = s.apply(lambda x: x.hour) - exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int32) + exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int64) tm.assert_series_equal(result, exp) # not vectorized @@ -137,11 +142,173 @@ def f(x): exp = pd.Series(['Asia/Tokyo'] * 25, name='XX') tm.assert_series_equal(result, exp) + def test_apply_dict_depr(self): -class TestSeriesMap(TestData, tm.TestCase): + tsdf = pd.DataFrame(np.random.randn(10, 3), + columns=['A', 'B', 'C'], + index=pd.date_range('1/1/2000', periods=10)) + with tm.assert_produces_warning(FutureWarning): + tsdf.A.agg({'foo': ['sum', 'mean']}) + + +class TestSeriesAggregate(TestData): _multiprocess_can_split_ = True + def test_transform(self): + # transforming functions + + with np.errstate(all='ignore'): + + f_sqrt = np.sqrt(self.series) + f_abs = np.abs(self.series) + + # ufunc + result = self.series.transform(np.sqrt) + expected = f_sqrt.copy() + assert_series_equal(result, expected) + + result = self.series.apply(np.sqrt) + assert_series_equal(result, expected) + + # list-like + result = self.series.transform([np.sqrt]) + expected = f_sqrt.to_frame().copy() + expected.columns = ['sqrt'] + assert_frame_equal(result, expected) + + result = self.series.transform([np.sqrt]) + assert_frame_equal(result, expected) + + result = self.series.transform(['sqrt']) + assert_frame_equal(result, expected) + + # multiple items in list + # these are in the order as if we are applying both functions per + # series and then concatting + expected = pd.concat([f_sqrt, f_abs], axis=1) + expected.columns = ['sqrt', 'absolute'] + result = self.series.apply([np.sqrt, np.abs]) + assert_frame_equal(result, expected) + + result = self.series.transform(['sqrt', 'abs']) + expected.columns = ['sqrt', 'abs'] + assert_frame_equal(result, expected) + + # dict, provide renaming + expected = pd.concat([f_sqrt, f_abs], axis=1) + expected.columns = ['foo', 'bar'] + expected = expected.unstack().rename('series') + + result = self.series.apply({'foo': np.sqrt, 'bar': np.abs}) + assert_series_equal(result.reindex_like(expected), expected) + + def test_transform_and_agg_error(self): + # we are trying to transform with an aggregator + def f(): + self.series.transform(['min', 'max']) + pytest.raises(ValueError, f) + + def f(): + with np.errstate(all='ignore'): + self.series.agg(['sqrt', 'max']) + pytest.raises(ValueError, f) + + def f(): + with np.errstate(all='ignore'): + self.series.transform(['sqrt', 'max']) + pytest.raises(ValueError, f) + + def f(): + with np.errstate(all='ignore'): + self.series.agg({'foo': np.sqrt, 'bar': 'sum'}) + pytest.raises(ValueError, f) + + def test_demo(self): + # demonstration tests + s = Series(range(6), dtype='int64', name='series') + + result = s.agg(['min', 'max']) + expected = Series([0, 5], index=['min', 'max'], name='series') + tm.assert_series_equal(result, expected) + + result = s.agg({'foo': 'min'}) + expected = Series([0], index=['foo'], name='series') + tm.assert_series_equal(result, expected) + + # nested renaming + with tm.assert_produces_warning(FutureWarning): + result = s.agg({'foo': ['min', 'max']}) + + expected = DataFrame( + {'foo': [0, 5]}, + index=['min', 'max']).unstack().rename('series') + tm.assert_series_equal(result, expected) + + def test_multiple_aggregators_with_dict_api(self): + + s = Series(range(6), dtype='int64', name='series') + # nested renaming + with tm.assert_produces_warning(FutureWarning): + result = s.agg({'foo': ['min', 'max'], 'bar': ['sum', 'mean']}) + + expected = DataFrame( + {'foo': [5.0, np.nan, 0.0, np.nan], + 'bar': [np.nan, 2.5, np.nan, 15.0]}, + columns=['foo', 'bar'], + index=['max', 'mean', + 'min', 'sum']).unstack().rename('series') + tm.assert_series_equal(result.reindex_like(expected), expected) + + def test_agg_apply_evaluate_lambdas_the_same(self): + # test that we are evaluating row-by-row first + # before vectorized evaluation + result = self.series.apply(lambda x: str(x)) + expected = self.series.agg(lambda x: str(x)) + tm.assert_series_equal(result, expected) + + result = self.series.apply(str) + expected = self.series.agg(str) + tm.assert_series_equal(result, expected) + + def test_with_nested_series(self): + # GH 2316 + # .agg with a reducer and a transform, what to do + result = self.ts.apply(lambda x: Series( + [x, x ** 2], index=['x', 'x^2'])) + expected = DataFrame({'x': self.ts, 'x^2': self.ts ** 2}) + tm.assert_frame_equal(result, expected) + + result = self.ts.agg(lambda x: Series( + [x, x ** 2], index=['x', 'x^2'])) + tm.assert_frame_equal(result, expected) + + def test_replicate_describe(self): + # this also tests a result set that is all scalars + expected = self.series.describe() + result = self.series.apply(OrderedDict( + [('count', 'count'), + ('mean', 'mean'), + ('std', 'std'), + ('min', 'min'), + ('25%', lambda x: x.quantile(0.25)), + ('50%', 'median'), + ('75%', lambda x: x.quantile(0.75)), + ('max', 'max')])) + assert_series_equal(result, expected) + + def test_reduce(self): + # reductions with named functions + result = self.series.agg(['sum', 'mean']) + expected = Series([self.series.sum(), + self.series.mean()], + ['sum', 'mean'], + name=self.series.name) + assert_series_equal(result, expected) + + +class TestSeriesMap(TestData): + def test_map(self): index, data = tm.getMixedTypeDict() @@ -151,17 +318,17 @@ def test_map(self): merged = target.map(source) for k, v in compat.iteritems(merged): - self.assertEqual(v, source[target[k]]) + assert v == source[target[k]] # input could be a dict merged = target.map(source.to_dict()) for k, v in compat.iteritems(merged): - self.assertEqual(v, source[target[k]]) + assert v == source[target[k]] # function result = self.ts.map(lambda x: x * 2) - self.assert_series_equal(result, self.ts * 2) + tm.assert_series_equal(result, self.ts * 2) # GH 10324 a = Series([1, 2, 3, 4]) @@ -169,9 +336,9 @@ def test_map(self): c = Series(["even", "odd", "even", "odd"]) exp = Series(["odd", "even", "odd", np.nan], dtype="category") - self.assert_series_equal(a.map(b), exp) + tm.assert_series_equal(a.map(b), exp) exp = Series(["odd", "even", "odd", np.nan]) - self.assert_series_equal(a.map(c), exp) + tm.assert_series_equal(a.map(c), exp) a = Series(['a', 'b', 'c', 'd']) b = Series([1, 2, 3, 4], @@ -179,9 +346,9 @@ def test_map(self): c = Series([1, 2, 3, 4], index=Index(['b', 'c', 'd', 'e'])) exp = Series([np.nan, 1, 2, 3]) - self.assert_series_equal(a.map(b), exp) + tm.assert_series_equal(a.map(b), exp) exp = Series([np.nan, 1, 2, 3]) - self.assert_series_equal(a.map(c), exp) + tm.assert_series_equal(a.map(c), exp) a = Series(['a', 'b', 'c', 'd']) b = Series(['B', 'C', 'D', 'E'], dtype='category', @@ -190,9 +357,9 @@ def test_map(self): exp = Series(pd.Categorical([np.nan, 'B', 'C', 'D'], categories=['B', 'C', 'D', 'E'])) - self.assert_series_equal(a.map(b), exp) + tm.assert_series_equal(a.map(b), exp) exp = Series([np.nan, 'B', 'C', 'D']) - self.assert_series_equal(a.map(c), exp) + tm.assert_series_equal(a.map(c), exp) def test_map_compat(self): # related GH 8024 @@ -205,25 +372,25 @@ def test_map_int(self): left = Series({'a': 1., 'b': 2., 'c': 3., 'd': 4}) right = Series({1: 11, 2: 22, 3: 33}) - self.assertEqual(left.dtype, np.float_) - self.assertTrue(issubclass(right.dtype.type, np.integer)) + assert left.dtype == np.float_ + assert issubclass(right.dtype.type, np.integer) merged = left.map(right) - self.assertEqual(merged.dtype, np.float_) - self.assertTrue(isnull(merged['d'])) - self.assertTrue(not isnull(merged['c'])) + assert merged.dtype == np.float_ + assert isnull(merged['d']) + assert not isnull(merged['c']) def test_map_type_inference(self): s = Series(lrange(3)) s2 = s.map(lambda x: np.where(x == 0, 0, 1)) - self.assertTrue(issubclass(s2.dtype.type, np.integer)) + assert issubclass(s2.dtype.type, np.integer) def test_map_decimal(self): from decimal import Decimal result = self.series.map(lambda x: Decimal(str(x))) - self.assertEqual(result.dtype, np.object_) - tm.assertIsInstance(result[0], Decimal) + assert result.dtype == np.object_ + assert isinstance(result[0], Decimal) def test_map_na_exclusion(self): s = Series([1.5, np.nan, 3, np.nan, 5]) @@ -247,10 +414,50 @@ def test_map_dict_with_tuple_keys(self): tm.assert_series_equal(df['labels'], df['expected_labels'], check_names=False) + def test_map_counter(self): + s = Series(['a', 'b', 'c'], index=[1, 2, 3]) + counter = Counter() + counter['b'] = 5 + counter['c'] += 1 + result = s.map(counter) + expected = Series([0, 5, 1], index=[1, 2, 3]) + assert_series_equal(result, expected) + + def test_map_defaultdict(self): + s = Series([1, 2, 3], index=['a', 'b', 'c']) + default_dict = defaultdict(lambda: 'blank') + default_dict[1] = 'stuff' + result = s.map(default_dict) + expected = Series(['stuff', 'blank', 'blank'], index=['a', 'b', 'c']) + assert_series_equal(result, expected) + + def test_map_dict_subclass_with_missing(self): + """ + Test Series.map with a dictionary subclass that defines __missing__, + i.e. sets a default value (GH #15999). + """ + class DictWithMissing(dict): + def __missing__(self, key): + return 'missing' + s = Series([1, 2, 3]) + dictionary = DictWithMissing({3: 'three'}) + result = s.map(dictionary) + expected = Series(['missing', 'missing', 'three']) + assert_series_equal(result, expected) + + def test_map_dict_subclass_without_missing(self): + class DictWithoutMissing(dict): + pass + s = Series([1, 2, 3]) + dictionary = DictWithoutMissing({3: 'three'}) + result = s.map(dictionary) + expected = Series([np.nan, np.nan, 'three']) + assert_series_equal(result, expected) + def test_map_box(self): vals = [pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-02')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' # boxed value must be Timestamp instance res = s.map(lambda x: '{0}_{1}_{2}'.format(x.__class__.__name__, x.day, x.tz)) @@ -260,7 +467,7 @@ def test_map_box(self): vals = [pd.Timestamp('2011-01-01', tz='US/Eastern'), pd.Timestamp('2011-01-02', tz='US/Eastern')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'datetime64[ns, US/Eastern]') + assert s.dtype == 'datetime64[ns, US/Eastern]' res = s.map(lambda x: '{0}_{1}_{2}'.format(x.__class__.__name__, x.day, x.tz)) exp = pd.Series(['Timestamp_1_US/Eastern', 'Timestamp_2_US/Eastern']) @@ -269,7 +476,7 @@ def test_map_box(self): # timedelta vals = [pd.Timedelta('1 days'), pd.Timedelta('2 days')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'timedelta64[ns]') + assert s.dtype == 'timedelta64[ns]' res = s.map(lambda x: '{0}_{1}'.format(x.__class__.__name__, x.days)) exp = pd.Series(['Timedelta_1', 'Timedelta_2']) tm.assert_series_equal(res, exp) @@ -278,7 +485,7 @@ def test_map_box(self): vals = [pd.Period('2011-01-01', freq='M'), pd.Period('2011-01-02', freq='M')] s = pd.Series(vals) - self.assertEqual(s.dtype, 'object') + assert s.dtype == 'object' res = s.map(lambda x: '{0}_{1}'.format(x.__class__.__name__, x.freqstr)) exp = pd.Series(['Period_M', 'Period_M']) @@ -299,9 +506,9 @@ def test_map_categorical(self): result = s.map(lambda x: 'A') exp = pd.Series(['A'] * 7, name='XX', index=list('abcdefg')) tm.assert_series_equal(result, exp) - self.assertEqual(result.dtype, np.object) + assert result.dtype == np.object - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): s.map(lambda x: x, na_action='ignore') def test_map_datetimetz(self): @@ -317,11 +524,12 @@ def test_map_datetimetz(self): tm.assert_series_equal(result, exp) # change dtype + # GH 14506 : Returned dtype changed from int32 to int64 result = s.map(lambda x: x.hour) - exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int32) + exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int64) tm.assert_series_equal(result, exp) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): s.map(lambda x: x, na_action='ignore') # not vectorized diff --git a/pandas/tests/series/test_asof.py b/pandas/tests/series/test_asof.py index e2092feab9004..1f62d618b20e1 100644 --- a/pandas/tests/series/test_asof.py +++ b/pandas/tests/series/test_asof.py @@ -1,9 +1,8 @@ # coding=utf-8 -import nose +import pytest import numpy as np - from pandas import (offsets, Series, notnull, isnull, date_range, Timestamp) @@ -12,8 +11,7 @@ from .common import TestData -class TestSeriesAsof(TestData, tm.TestCase): - _multiprocess_can_split_ = True +class TestSeriesAsof(TestData): def test_basic(self): @@ -25,21 +23,21 @@ def test_basic(self): dates = date_range('1/1/1990', periods=N * 3, freq='25s') result = ts.asof(dates) - self.assertTrue(notnull(result).all()) + assert notnull(result).all() lb = ts.index[14] ub = ts.index[30] result = ts.asof(list(dates)) - self.assertTrue(notnull(result).all()) + assert notnull(result).all() lb = ts.index[14] ub = ts.index[30] mask = (result.index >= lb) & (result.index < ub) rs = result[mask] - self.assertTrue((rs == ts[lb]).all()) + assert (rs == ts[lb]).all() val = result[result.index[result.index >= ub][0]] - self.assertEqual(ts[ub], val) + assert ts[ub] == val def test_scalar(self): @@ -52,20 +50,20 @@ def test_scalar(self): val1 = ts.asof(ts.index[7]) val2 = ts.asof(ts.index[19]) - self.assertEqual(val1, ts[4]) - self.assertEqual(val2, ts[14]) + assert val1 == ts[4] + assert val2 == ts[14] # accepts strings val1 = ts.asof(str(ts.index[7])) - self.assertEqual(val1, ts[4]) + assert val1 == ts[4] # in there result = ts.asof(ts.index[3]) - self.assertEqual(result, ts[3]) + assert result == ts[3] # no as of value d = ts.index[0] - offsets.BDay() - self.assertTrue(np.isnan(ts.asof(d))) + assert np.isnan(ts.asof(d)) def test_with_nan(self): # basic asof test @@ -100,19 +98,19 @@ def test_periodindex(self): dates = date_range('1/1/1990', periods=N * 3, freq='37min') result = ts.asof(dates) - self.assertTrue(notnull(result).all()) + assert notnull(result).all() lb = ts.index[14] ub = ts.index[30] result = ts.asof(list(dates)) - self.assertTrue(notnull(result).all()) + assert notnull(result).all() lb = ts.index[14] ub = ts.index[30] pix = PeriodIndex(result.index.values, freq='H') mask = (pix >= lb) & (pix < ub) rs = result[mask] - self.assertTrue((rs == ts[lb]).all()) + assert (rs == ts[lb]).all() ts[5:10] = np.nan ts[15:20] = np.nan @@ -120,19 +118,19 @@ def test_periodindex(self): val1 = ts.asof(ts.index[7]) val2 = ts.asof(ts.index[19]) - self.assertEqual(val1, ts[4]) - self.assertEqual(val2, ts[14]) + assert val1 == ts[4] + assert val2 == ts[14] # accepts strings val1 = ts.asof(str(ts.index[7])) - self.assertEqual(val1, ts[4]) + assert val1 == ts[4] # in there - self.assertEqual(ts.asof(ts.index[3]), ts[3]) + assert ts.asof(ts.index[3]) == ts[3] # no as of value d = ts.index[0].to_timestamp() - offsets.BDay() - self.assertTrue(isnull(ts.asof(d))) + assert isnull(ts.asof(d)) def test_errors(self): @@ -142,17 +140,39 @@ def test_errors(self): Timestamp('20130102')]) # non-monotonic - self.assertFalse(s.index.is_monotonic) - with self.assertRaises(ValueError): + assert not s.index.is_monotonic + with pytest.raises(ValueError): s.asof(s.index[0]) # subset with Series N = 10 rng = date_range('1/1/1990', periods=N, freq='53s') s = Series(np.random.randn(N), index=rng) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): s.asof(s.index[0], subset='foo') -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_all_nans(self): + # GH 15713 + # series is all nans + result = Series([np.nan]).asof([0]) + expected = Series([np.nan]) + tm.assert_series_equal(result, expected) + + # testing non-default indexes + N = 50 + rng = date_range('1/1/1990', periods=N, freq='53s') + + dates = date_range('1/1/1990', periods=N * 3, freq='25s') + result = Series(np.nan, index=rng).asof(dates) + expected = Series(np.nan, index=dates) + tm.assert_series_equal(result, expected) + + # testing scalar input + date = date_range('1/1/1990', periods=N * 3, freq='25s')[0] + result = Series(np.nan, index=rng).asof(date) + assert isnull(result) + + # test name is propagated + result = Series(np.nan, index=[1, 2, 3, 4], name='test').asof([4, 5]) + expected = Series(np.nan, index=[4, 5], name='test') + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/series/test_combine_concat.py b/pandas/tests/series/test_combine_concat.py index 23261c2ef79e2..bb998b7fa55dd 100644 --- a/pandas/tests/series/test_combine_concat.py +++ b/pandas/tests/series/test_combine_concat.py @@ -1,13 +1,15 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + from datetime import datetime from numpy import nan import numpy as np import pandas as pd -from pandas import Series, DataFrame +from pandas import Series, DataFrame, date_range, DatetimeIndex from pandas import compat from pandas.util.testing import assert_series_equal @@ -16,22 +18,20 @@ from .common import TestData -class TestSeriesCombine(TestData, tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesCombine(TestData): def test_append(self): appendedSeries = self.series.append(self.objSeries) for idx, value in compat.iteritems(appendedSeries): if idx in self.series.index: - self.assertEqual(value, self.series[idx]) + assert value == self.series[idx] elif idx in self.objSeries.index: - self.assertEqual(value, self.objSeries[idx]) + assert value == self.objSeries[idx] else: self.fail("orphaned index!") - self.assertRaises(ValueError, self.ts.append, self.ts, - verify_integrity=True) + pytest.raises(ValueError, self.ts.append, self.ts, + verify_integrity=True) def test_append_many(self): pieces = [self.ts[:5], self.ts[5:10], self.ts[10:]] @@ -55,9 +55,9 @@ def test_append_duplicates(self): exp, check_index_type=True) msg = 'Indexes have overlapping values:' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): s1.append(s2, verify_integrity=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): pd.concat([s1, s2], verify_integrity=True) def test_combine_first(self): @@ -70,14 +70,14 @@ def test_combine_first(self): # nothing used from the input combined = series.combine_first(series_copy) - self.assert_series_equal(combined, series) + tm.assert_series_equal(combined, series) # Holes filled from input combined = series_copy.combine_first(series) - self.assertTrue(np.isfinite(combined).all()) + assert np.isfinite(combined).all() - self.assert_series_equal(combined[::2], series[::2]) - self.assert_series_equal(combined[1::2], series_copy[1::2]) + tm.assert_series_equal(combined[::2], series[::2]) + tm.assert_series_equal(combined[1::2], series_copy[1::2]) # mixed types index = tm.makeStringIndex(20) @@ -117,9 +117,9 @@ def test_concat_empty_series_dtypes_roundtrips(self): 'M8[ns]']) for dtype in dtypes: - self.assertEqual(pd.concat([Series(dtype=dtype)]).dtype, dtype) - self.assertEqual(pd.concat([Series(dtype=dtype), - Series(dtype=dtype)]).dtype, dtype) + assert pd.concat([Series(dtype=dtype)]).dtype == dtype + assert pd.concat([Series(dtype=dtype), + Series(dtype=dtype)]).dtype == dtype def int_result_type(dtype, dtype2): typs = set([dtype.kind, dtype2.kind]) @@ -155,58 +155,55 @@ def get_result_type(dtype, dtype2): expected = get_result_type(dtype, dtype2) result = pd.concat([Series(dtype=dtype), Series(dtype=dtype2) ]).dtype - self.assertEqual(result.kind, expected) + assert result.kind == expected def test_concat_empty_series_dtypes(self): - # bools - self.assertEqual(pd.concat([Series(dtype=np.bool_), - Series(dtype=np.int32)]).dtype, np.int32) - self.assertEqual(pd.concat([Series(dtype=np.bool_), - Series(dtype=np.float32)]).dtype, - np.object_) - - # datetimelike - self.assertEqual(pd.concat([Series(dtype='m8[ns]'), - Series(dtype=np.bool)]).dtype, np.object_) - self.assertEqual(pd.concat([Series(dtype='m8[ns]'), - Series(dtype=np.int64)]).dtype, np.object_) - self.assertEqual(pd.concat([Series(dtype='M8[ns]'), - Series(dtype=np.bool)]).dtype, np.object_) - self.assertEqual(pd.concat([Series(dtype='M8[ns]'), - Series(dtype=np.int64)]).dtype, np.object_) - self.assertEqual(pd.concat([Series(dtype='M8[ns]'), - Series(dtype=np.bool_), - Series(dtype=np.int64)]).dtype, np.object_) + # booleans + assert pd.concat([Series(dtype=np.bool_), + Series(dtype=np.int32)]).dtype == np.int32 + assert pd.concat([Series(dtype=np.bool_), + Series(dtype=np.float32)]).dtype == np.object_ + + # datetime-like + assert pd.concat([Series(dtype='m8[ns]'), + Series(dtype=np.bool)]).dtype == np.object_ + assert pd.concat([Series(dtype='m8[ns]'), + Series(dtype=np.int64)]).dtype == np.object_ + assert pd.concat([Series(dtype='M8[ns]'), + Series(dtype=np.bool)]).dtype == np.object_ + assert pd.concat([Series(dtype='M8[ns]'), + Series(dtype=np.int64)]).dtype == np.object_ + assert pd.concat([Series(dtype='M8[ns]'), + Series(dtype=np.bool_), + Series(dtype=np.int64)]).dtype == np.object_ # categorical - self.assertEqual(pd.concat([Series(dtype='category'), - Series(dtype='category')]).dtype, - 'category') - self.assertEqual(pd.concat([Series(dtype='category'), - Series(dtype='float64')]).dtype, - 'float64') - self.assertEqual(pd.concat([Series(dtype='category'), - Series(dtype='object')]).dtype, 'object') + assert pd.concat([Series(dtype='category'), + Series(dtype='category')]).dtype == 'category' + assert pd.concat([Series(dtype='category'), + Series(dtype='float64')]).dtype == 'float64' + assert pd.concat([Series(dtype='category'), + Series(dtype='object')]).dtype == 'object' # sparse result = pd.concat([Series(dtype='float64').to_sparse(), Series( dtype='float64').to_sparse()]) - self.assertEqual(result.dtype, np.float64) - self.assertEqual(result.ftype, 'float64:sparse') + assert result.dtype == np.float64 + assert result.ftype == 'float64:sparse' result = pd.concat([Series(dtype='float64').to_sparse(), Series( dtype='float64')]) - self.assertEqual(result.dtype, np.float64) - self.assertEqual(result.ftype, 'float64:sparse') + assert result.dtype == np.float64 + assert result.ftype == 'float64:sparse' result = pd.concat([Series(dtype='float64').to_sparse(), Series( dtype='object')]) - self.assertEqual(result.dtype, np.object_) - self.assertEqual(result.ftype, 'object:dense') + assert result.dtype == np.object_ + assert result.ftype == 'object:dense' def test_combine_first_dt64(self): - from pandas.tseries.tools import to_datetime + from pandas.core.tools.datetimes import to_datetime s0 = to_datetime(Series(["2010", np.NaN])) s1 = to_datetime(Series([np.NaN, "2011"])) rs = s0.combine_first(s1) @@ -218,3 +215,101 @@ def test_combine_first_dt64(self): rs = s0.combine_first(s1) xp = Series([datetime(2010, 1, 1), '2011']) assert_series_equal(rs, xp) + + +class TestTimeseries(object): + + def test_append_concat(self): + rng = date_range('5/8/2012 1:45', periods=10, freq='5T') + ts = Series(np.random.randn(len(rng)), rng) + df = DataFrame(np.random.randn(len(rng), 4), index=rng) + + result = ts.append(ts) + result_df = df.append(df) + ex_index = DatetimeIndex(np.tile(rng.values, 2)) + tm.assert_index_equal(result.index, ex_index) + tm.assert_index_equal(result_df.index, ex_index) + + appended = rng.append(rng) + tm.assert_index_equal(appended, ex_index) + + appended = rng.append([rng, rng]) + ex_index = DatetimeIndex(np.tile(rng.values, 3)) + tm.assert_index_equal(appended, ex_index) + + # different index names + rng1 = rng.copy() + rng2 = rng.copy() + rng1.name = 'foo' + rng2.name = 'bar' + assert rng1.append(rng1).name == 'foo' + assert rng1.append(rng2).name is None + + def test_append_concat_tz(self): + # GH 2938 + tm._skip_if_no_pytz() + + rng = date_range('5/8/2012 1:45', periods=10, freq='5T', + tz='US/Eastern') + rng2 = date_range('5/8/2012 2:35', periods=10, freq='5T', + tz='US/Eastern') + rng3 = date_range('5/8/2012 1:45', periods=20, freq='5T', + tz='US/Eastern') + ts = Series(np.random.randn(len(rng)), rng) + df = DataFrame(np.random.randn(len(rng), 4), index=rng) + ts2 = Series(np.random.randn(len(rng2)), rng2) + df2 = DataFrame(np.random.randn(len(rng2), 4), index=rng2) + + result = ts.append(ts2) + result_df = df.append(df2) + tm.assert_index_equal(result.index, rng3) + tm.assert_index_equal(result_df.index, rng3) + + appended = rng.append(rng2) + tm.assert_index_equal(appended, rng3) + + def test_append_concat_tz_explicit_pytz(self): + # GH 2938 + tm._skip_if_no_pytz() + from pytz import timezone as timezone + + rng = date_range('5/8/2012 1:45', periods=10, freq='5T', + tz=timezone('US/Eastern')) + rng2 = date_range('5/8/2012 2:35', periods=10, freq='5T', + tz=timezone('US/Eastern')) + rng3 = date_range('5/8/2012 1:45', periods=20, freq='5T', + tz=timezone('US/Eastern')) + ts = Series(np.random.randn(len(rng)), rng) + df = DataFrame(np.random.randn(len(rng), 4), index=rng) + ts2 = Series(np.random.randn(len(rng2)), rng2) + df2 = DataFrame(np.random.randn(len(rng2), 4), index=rng2) + + result = ts.append(ts2) + result_df = df.append(df2) + tm.assert_index_equal(result.index, rng3) + tm.assert_index_equal(result_df.index, rng3) + + appended = rng.append(rng2) + tm.assert_index_equal(appended, rng3) + + def test_append_concat_tz_dateutil(self): + # GH 2938 + tm._skip_if_no_dateutil() + rng = date_range('5/8/2012 1:45', periods=10, freq='5T', + tz='dateutil/US/Eastern') + rng2 = date_range('5/8/2012 2:35', periods=10, freq='5T', + tz='dateutil/US/Eastern') + rng3 = date_range('5/8/2012 1:45', periods=20, freq='5T', + tz='dateutil/US/Eastern') + ts = Series(np.random.randn(len(rng)), rng) + df = DataFrame(np.random.randn(len(rng), 4), index=rng) + ts2 = Series(np.random.randn(len(rng2)), rng2) + df2 = DataFrame(np.random.randn(len(rng2), 4), index=rng2) + + result = ts.append(ts2) + result_df = df.append(df2) + tm.assert_index_equal(result.index, rng3) + tm.assert_index_equal(result_df.index, rng3) + + appended = rng.append(rng2) + tm.assert_index_equal(appended, rng3) diff --git a/pandas/tests/series/test_constructors.py b/pandas/tests/series/test_constructors.py index ed7b0fda19cb7..d591aa4f567a9 100644 --- a/pandas/tests/series/test_constructors.py +++ b/pandas/tests/series/test_constructors.py @@ -1,6 +1,8 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + from datetime import datetime, timedelta from numpy import nan @@ -8,12 +10,15 @@ import numpy.ma as ma import pandas as pd -from pandas.types.common import is_categorical_dtype, is_datetime64tz_dtype -from pandas import Index, Series, isnull, date_range, period_range -from pandas.core.index import MultiIndex -from pandas.tseries.index import Timestamp, DatetimeIndex +from pandas.core.dtypes.common import ( + is_categorical_dtype, + is_datetime64tz_dtype) +from pandas import (Index, Series, isnull, date_range, + NaT, period_range, MultiIndex, IntervalIndex) +from pandas.core.indexes.datetimes import Timestamp, DatetimeIndex -import pandas.lib as lib +from pandas._libs import lib +from pandas._libs.tslib import iNaT from pandas.compat import lrange, range, zip, OrderedDict, long from pandas import compat @@ -23,65 +28,56 @@ from .common import TestData -class TestSeriesConstructors(TestData, tm.TestCase): +class TestSeriesConstructors(TestData): - _multiprocess_can_split_ = True + def test_invalid_dtype(self): + # GH15520 + msg = 'not understood' + invalid_list = [pd.Timestamp, 'pd.Timestamp', list] + for dtype in invalid_list: + with tm.assert_raises_regex(TypeError, msg): + Series([], name='time', dtype=dtype) def test_scalar_conversion(self): # Pass in scalar is disabled scalar = Series(0.5) - self.assertNotIsInstance(scalar, float) - - # coercion - self.assertEqual(float(Series([1.])), 1.0) - self.assertEqual(int(Series([1.])), 1) - self.assertEqual(long(Series([1.])), 1) + assert not isinstance(scalar, float) - def test_TimeSeries_deprecation(self): - - # deprecation TimeSeries, #10890 - with tm.assert_produces_warning(FutureWarning): - pd.TimeSeries(1, index=date_range('20130101', periods=3)) + # Coercion + assert float(Series([1.])) == 1.0 + assert int(Series([1.])) == 1 + assert long(Series([1.])) == 1 def test_constructor(self): - # Recognize TimeSeries - with tm.assert_produces_warning(FutureWarning): - self.assertTrue(self.ts.is_time_series) - self.assertTrue(self.ts.index.is_all_dates) + assert self.ts.index.is_all_dates # Pass in Series derived = Series(self.ts) - with tm.assert_produces_warning(FutureWarning): - self.assertTrue(derived.is_time_series) - self.assertTrue(derived.index.is_all_dates) + assert derived.index.is_all_dates - self.assertTrue(tm.equalContents(derived.index, self.ts.index)) + assert tm.equalContents(derived.index, self.ts.index) # Ensure new index is not created - self.assertEqual(id(self.ts.index), id(derived.index)) + assert id(self.ts.index) == id(derived.index) # Mixed type Series mixed = Series(['hello', np.NaN], index=[0, 1]) - self.assertEqual(mixed.dtype, np.object_) - self.assertIs(mixed[1], np.NaN) + assert mixed.dtype == np.object_ + assert mixed[1] is np.NaN - with tm.assert_produces_warning(FutureWarning): - self.assertFalse(self.empty.is_time_series) - self.assertFalse(self.empty.index.is_all_dates) - with tm.assert_produces_warning(FutureWarning): - self.assertFalse(Series({}).is_time_series) - self.assertFalse(Series({}).index.is_all_dates) - self.assertRaises(Exception, Series, np.random.randn(3, 3), - index=np.arange(3)) + assert not self.empty.index.is_all_dates + assert not Series({}).index.is_all_dates + pytest.raises(Exception, Series, np.random.randn(3, 3), + index=np.arange(3)) mixed.name = 'Series' rs = Series(mixed).name xp = 'Series' - self.assertEqual(rs, xp) + assert rs == xp # raise on MultiIndex GH4187 m = MultiIndex.from_arrays([[1, 2], [3, 4]]) - self.assertRaises(NotImplementedError, Series, m) + pytest.raises(NotImplementedError, Series, m) def test_constructor_empty(self): empty = Series() @@ -152,15 +148,15 @@ def test_constructor_categorical(self): tm.assert_categorical_equal(res.values, cat) # GH12574 - self.assertRaises( + pytest.raises( ValueError, lambda: Series(pd.Categorical([1, 2, 3]), dtype='int64')) cat = Series(pd.Categorical([1, 2, 3]), dtype='category') - self.assertTrue(is_categorical_dtype(cat)) - self.assertTrue(is_categorical_dtype(cat.dtype)) + assert is_categorical_dtype(cat) + assert is_categorical_dtype(cat.dtype) s = Series([1, 2, 3], dtype='category') - self.assertTrue(is_categorical_dtype(s)) - self.assertTrue(is_categorical_dtype(s.dtype)) + assert is_categorical_dtype(s) + assert is_categorical_dtype(s.dtype) def test_constructor_maskedarray(self): data = ma.masked_all((3, ), dtype=float) @@ -214,17 +210,16 @@ def test_constructor_maskedarray(self): expected = Series([True, True, False], index=index, dtype=bool) assert_series_equal(result, expected) - from pandas import tslib data = ma.masked_all((3, ), dtype='M8[ns]') result = Series(data) - expected = Series([tslib.iNaT, tslib.iNaT, tslib.iNaT], dtype='M8[ns]') + expected = Series([iNaT, iNaT, iNaT], dtype='M8[ns]') assert_series_equal(result, expected) data[0] = datetime(2001, 1, 1) data[2] = datetime(2001, 1, 3) index = ['a', 'b', 'c'] result = Series(data, index=index) - expected = Series([datetime(2001, 1, 1), tslib.iNaT, + expected = Series([datetime(2001, 1, 1), iNaT, datetime(2001, 1, 3)], index=index, dtype='M8[ns]') assert_series_equal(result, expected) @@ -234,6 +229,13 @@ def test_constructor_maskedarray(self): datetime(2001, 1, 3)], index=index, dtype='M8[ns]') assert_series_equal(result, expected) + def test_series_ctor_plus_datetimeindex(self): + rng = date_range('20090415', '20090519', freq='B') + data = dict((k, 1) for k in rng) + + result = Series(data, index=rng) + assert result.index is rng + def test_constructor_default_index(self): s = Series([0, 1, 2]) tm.assert_index_equal(s.index, pd.Index(np.arange(3))) @@ -242,21 +244,37 @@ def test_constructor_corner(self): df = tm.makeTimeDataFrame() objs = [df, df] s = Series(objs, index=[0, 1]) - tm.assertIsInstance(s, Series) + assert isinstance(s, Series) def test_constructor_sanitize(self): s = Series(np.array([1., 1., 8.]), dtype='i8') - self.assertEqual(s.dtype, np.dtype('i8')) + assert s.dtype == np.dtype('i8') s = Series(np.array([1., 1., np.nan]), copy=True, dtype='i8') - self.assertEqual(s.dtype, np.dtype('f8')) + assert s.dtype == np.dtype('f8') + + def test_constructor_copy(self): + # GH15125 + # test dtype parameter has no side effects on copy=True + for data in [[1.], np.array([1.])]: + x = Series(data) + y = pd.Series(x, copy=True, dtype=float) + + # copy=True maintains original data in Series + tm.assert_series_equal(x, y) + + # changes to origin of copy does not affect the copy + x[0] = 2. + assert not x.equals(y) + assert x[0] == 2. + assert y[0] == 1. def test_constructor_pass_none(self): s = Series(None, index=lrange(5)) - self.assertEqual(s.dtype, np.float64) + assert s.dtype == np.float64 s = Series(None, index=lrange(5), dtype=object) - self.assertEqual(s.dtype, np.object_) + assert s.dtype == np.object_ # GH 7431 # inference on the index @@ -267,12 +285,12 @@ def test_constructor_pass_none(self): def test_constructor_pass_nan_nat(self): # GH 13467 exp = Series([np.nan, np.nan], dtype=np.float64) - self.assertEqual(exp.dtype, np.float64) + assert exp.dtype == np.float64 tm.assert_series_equal(Series([np.nan, np.nan]), exp) tm.assert_series_equal(Series(np.array([np.nan, np.nan])), exp) exp = Series([pd.NaT, pd.NaT]) - self.assertEqual(exp.dtype, 'datetime64[ns]') + assert exp.dtype == 'datetime64[ns]' tm.assert_series_equal(Series([pd.NaT, pd.NaT]), exp) tm.assert_series_equal(Series(np.array([pd.NaT, pd.NaT])), exp) @@ -283,7 +301,7 @@ def test_constructor_pass_nan_nat(self): tm.assert_series_equal(Series(np.array([np.nan, pd.NaT])), exp) def test_constructor_cast(self): - self.assertRaises(ValueError, Series, ['a', 'b', 'c'], dtype=float) + pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float) def test_constructor_dtype_nocast(self): # 1572 @@ -292,7 +310,7 @@ def test_constructor_dtype_nocast(self): s2 = Series(s, dtype=np.int64) s2[1] = 5 - self.assertEqual(s[1], 5) + assert s[1] == 5 def test_constructor_datelike_coercion(self): @@ -300,9 +318,9 @@ def test_constructor_datelike_coercion(self): # incorrectly infering on dateimelike looking when object dtype is # specified s = Series([Timestamp('20130101'), 'NOV'], dtype=object) - self.assertEqual(s.iloc[0], Timestamp('20130101')) - self.assertEqual(s.iloc[1], 'NOV') - self.assertTrue(s.dtype == object) + assert s.iloc[0] == Timestamp('20130101') + assert s.iloc[1] == 'NOV' + assert s.dtype == object # the dtype was being reset on the slicing and re-inferred to datetime # even thought the blocks are mixed @@ -316,31 +334,38 @@ def test_constructor_datelike_coercion(self): 'mat': mat}, index=belly) result = df.loc['3T19'] - self.assertTrue(result.dtype == object) + assert result.dtype == object result = df.loc['216'] - self.assertTrue(result.dtype == object) + assert result.dtype == object + + def test_constructor_datetimes_with_nulls(self): + # gh-15869 + for arr in [np.array([None, None, None, None, + datetime.now(), None]), + np.array([None, None, datetime.now(), None])]: + result = Series(arr) + assert result.dtype == 'M8[ns]' def test_constructor_dtype_datetime64(self): - import pandas.tslib as tslib - s = Series(tslib.iNaT, dtype='M8[ns]', index=lrange(5)) - self.assertTrue(isnull(s).all()) + s = Series(iNaT, dtype='M8[ns]', index=lrange(5)) + assert isnull(s).all() # in theory this should be all nulls, but since # we are not specifying a dtype is ambiguous - s = Series(tslib.iNaT, index=lrange(5)) - self.assertFalse(isnull(s).all()) + s = Series(iNaT, index=lrange(5)) + assert not isnull(s).all() s = Series(nan, dtype='M8[ns]', index=lrange(5)) - self.assertTrue(isnull(s).all()) + assert isnull(s).all() - s = Series([datetime(2001, 1, 2, 0, 0), tslib.iNaT], dtype='M8[ns]') - self.assertTrue(isnull(s[1])) - self.assertEqual(s.dtype, 'M8[ns]') + s = Series([datetime(2001, 1, 2, 0, 0), iNaT], dtype='M8[ns]') + assert isnull(s[1]) + assert s.dtype == 'M8[ns]' s = Series([datetime(2001, 1, 2, 0, 0), nan], dtype='M8[ns]') - self.assertTrue(isnull(s[1])) - self.assertEqual(s.dtype, 'M8[ns]') + assert isnull(s[1]) + assert s.dtype == 'M8[ns]' # GH3416 dates = [ @@ -350,32 +375,32 @@ def test_constructor_dtype_datetime64(self): ] s = Series(dates) - self.assertEqual(s.dtype, 'M8[ns]') + assert s.dtype == 'M8[ns]' - s.ix[0] = np.nan - self.assertEqual(s.dtype, 'M8[ns]') + s.iloc[0] = np.nan + assert s.dtype == 'M8[ns]' # invalid astypes for t in ['s', 'D', 'us', 'ms']: - self.assertRaises(TypeError, s.astype, 'M8[%s]' % t) + pytest.raises(TypeError, s.astype, 'M8[%s]' % t) # GH3414 related - self.assertRaises(TypeError, lambda x: Series( + pytest.raises(TypeError, lambda x: Series( Series(dates).astype('int') / 1000000, dtype='M8[ms]')) - self.assertRaises(TypeError, - lambda x: Series(dates, dtype='datetime64')) + pytest.raises(TypeError, + lambda x: Series(dates, dtype='datetime64')) # invalid dates can be help as object result = Series([datetime(2, 1, 1)]) - self.assertEqual(result[0], datetime(2, 1, 1, 0, 0)) + assert result[0] == datetime(2, 1, 1, 0, 0) result = Series([datetime(3000, 1, 1)]) - self.assertEqual(result[0], datetime(3000, 1, 1, 0, 0)) + assert result[0] == datetime(3000, 1, 1, 0, 0) # don't mix types result = Series([Timestamp('20130101'), 1], index=['a', 'b']) - self.assertEqual(result['a'], Timestamp('20130101')) - self.assertEqual(result['b'], 1) + assert result['a'] == Timestamp('20130101') + assert result['b'] == 1 # GH6529 # coerce datetime64 non-ns properly @@ -400,45 +425,45 @@ def test_constructor_dtype_datetime64(self): dates2 = np.array([d.date() for d in dates.to_pydatetime()], dtype=object) series1 = Series(dates2, dates) - self.assert_numpy_array_equal(series1.values, dates2) - self.assertEqual(series1.dtype, object) + tm.assert_numpy_array_equal(series1.values, dates2) + assert series1.dtype == object # these will correctly infer a datetime s = Series([None, pd.NaT, '2013-08-05 15:30:00.000001']) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' s = Series([np.nan, pd.NaT, '2013-08-05 15:30:00.000001']) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' s = Series([pd.NaT, None, '2013-08-05 15:30:00.000001']) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' s = Series([pd.NaT, np.nan, '2013-08-05 15:30:00.000001']) - self.assertEqual(s.dtype, 'datetime64[ns]') + assert s.dtype == 'datetime64[ns]' # tz-aware (UTC and other tz's) # GH 8411 dr = date_range('20130101', periods=3) - self.assertTrue(Series(dr).iloc[0].tz is None) + assert Series(dr).iloc[0].tz is None dr = date_range('20130101', periods=3, tz='UTC') - self.assertTrue(str(Series(dr).iloc[0].tz) == 'UTC') + assert str(Series(dr).iloc[0].tz) == 'UTC' dr = date_range('20130101', periods=3, tz='US/Eastern') - self.assertTrue(str(Series(dr).iloc[0].tz) == 'US/Eastern') + assert str(Series(dr).iloc[0].tz) == 'US/Eastern' # non-convertible s = Series([1479596223000, -1479590, pd.NaT]) - self.assertTrue(s.dtype == 'object') - self.assertTrue(s[2] is pd.NaT) - self.assertTrue('NaT' in str(s)) + assert s.dtype == 'object' + assert s[2] is pd.NaT + assert 'NaT' in str(s) # if we passed a NaT it remains s = Series([datetime(2010, 1, 1), datetime(2, 1, 1), pd.NaT]) - self.assertTrue(s.dtype == 'object') - self.assertTrue(s[2] is pd.NaT) - self.assertTrue('NaT' in str(s)) + assert s.dtype == 'object' + assert s[2] is pd.NaT + assert 'NaT' in str(s) # if we passed a nan it remains s = Series([datetime(2010, 1, 1), datetime(2, 1, 1), np.nan]) - self.assertTrue(s.dtype == 'object') - self.assertTrue(s[2] is np.nan) - self.assertTrue('NaN' in str(s)) + assert s.dtype == 'object' + assert s[2] is np.nan + assert 'NaN' in str(s) def test_constructor_with_datetime_tz(self): @@ -447,27 +472,27 @@ def test_constructor_with_datetime_tz(self): dr = date_range('20130101', periods=3, tz='US/Eastern') s = Series(dr) - self.assertTrue(s.dtype.name == 'datetime64[ns, US/Eastern]') - self.assertTrue(s.dtype == 'datetime64[ns, US/Eastern]') - self.assertTrue(is_datetime64tz_dtype(s.dtype)) - self.assertTrue('datetime64[ns, US/Eastern]' in str(s)) + assert s.dtype.name == 'datetime64[ns, US/Eastern]' + assert s.dtype == 'datetime64[ns, US/Eastern]' + assert is_datetime64tz_dtype(s.dtype) + assert 'datetime64[ns, US/Eastern]' in str(s) # export result = s.values - self.assertIsInstance(result, np.ndarray) - self.assertTrue(result.dtype == 'datetime64[ns]') + assert isinstance(result, np.ndarray) + assert result.dtype == 'datetime64[ns]' exp = pd.DatetimeIndex(result) exp = exp.tz_localize('UTC').tz_convert(tz=s.dt.tz) - self.assert_index_equal(dr, exp) + tm.assert_index_equal(dr, exp) # indexing result = s.iloc[0] - self.assertEqual(result, Timestamp('2013-01-01 00:00:00-0500', - tz='US/Eastern', freq='D')) + assert result == Timestamp('2013-01-01 00:00:00-0500', + tz='US/Eastern', freq='D') result = s[0] - self.assertEqual(result, Timestamp('2013-01-01 00:00:00-0500', - tz='US/Eastern', freq='D')) + assert result == Timestamp('2013-01-01 00:00:00-0500', + tz='US/Eastern', freq='D') result = s[Series([True, True, False], index=s.index)] assert_series_equal(result, s[0:2]) @@ -499,16 +524,16 @@ def test_constructor_with_datetime_tz(self): assert_series_equal(result, expected) # short str - self.assertTrue('datetime64[ns, US/Eastern]' in str(s)) + assert 'datetime64[ns, US/Eastern]' in str(s) # formatting with NaT result = s.shift() - self.assertTrue('datetime64[ns, US/Eastern]' in str(result)) - self.assertTrue('NaT' in str(result)) + assert 'datetime64[ns, US/Eastern]' in str(result) + assert 'NaT' in str(result) # long str t = Series(date_range('20130101', periods=1000, tz='US/Eastern')) - self.assertTrue('datetime64[ns, US/Eastern]' in str(t)) + assert 'datetime64[ns, US/Eastern]' in str(t) result = pd.DatetimeIndex(s, freq='infer') tm.assert_index_equal(result, dr) @@ -516,19 +541,45 @@ def test_constructor_with_datetime_tz(self): # inference s = Series([pd.Timestamp('2013-01-01 13:00:00-0800', tz='US/Pacific'), pd.Timestamp('2013-01-02 14:00:00-0800', tz='US/Pacific')]) - self.assertTrue(s.dtype == 'datetime64[ns, US/Pacific]') - self.assertTrue(lib.infer_dtype(s) == 'datetime64') + assert s.dtype == 'datetime64[ns, US/Pacific]' + assert lib.infer_dtype(s) == 'datetime64' s = Series([pd.Timestamp('2013-01-01 13:00:00-0800', tz='US/Pacific'), pd.Timestamp('2013-01-02 14:00:00-0800', tz='US/Eastern')]) - self.assertTrue(s.dtype == 'object') - self.assertTrue(lib.infer_dtype(s) == 'datetime') + assert s.dtype == 'object' + assert lib.infer_dtype(s) == 'datetime' # with all NaT s = Series(pd.NaT, index=[0, 1], dtype='datetime64[ns, US/Eastern]') expected = Series(pd.DatetimeIndex(['NaT', 'NaT'], tz='US/Eastern')) assert_series_equal(s, expected) + def test_construction_interval(self): + # construction from interval & array of intervals + index = IntervalIndex.from_breaks(np.arange(3), closed='right') + result = Series(index) + repr(result) + str(result) + tm.assert_index_equal(Index(result.values), index) + + result = Series(index.values) + tm.assert_index_equal(Index(result.values), index) + + def test_construction_consistency(self): + + # make sure that we are not re-localizing upon construction + # GH 14928 + s = Series(pd.date_range('20130101', periods=3, tz='US/Eastern')) + + result = Series(s, dtype=s.dtype) + tm.assert_series_equal(result, s) + + result = Series(s.dt.tz_convert('UTC'), dtype=s.dtype) + tm.assert_series_equal(result, s) + + result = Series(s.values, dtype=s.dtype) + tm.assert_series_equal(result, s) + def test_constructor_periodindex(self): # GH7932 # converting a PeriodIndex when put in a Series @@ -538,7 +589,7 @@ def test_constructor_periodindex(self): expected = Series(pi.asobject) assert_series_equal(s, expected) - self.assertEqual(s.dtype, 'object') + assert s.dtype == 'object' def test_constructor_dict(self): d = {'a': 0., 'b': 1., 'c': 2.} @@ -550,8 +601,8 @@ def test_constructor_dict(self): d = {pidx[0]: 0, pidx[1]: 1} result = Series(d, index=pidx) expected = Series(np.nan, pidx) - expected.ix[0] = 0 - expected.ix[1] = 1 + expected.iloc[0] = 0 + expected.iloc[1] = 1 assert_series_equal(result, expected) def test_constructor_dict_multiindex(self): @@ -625,7 +676,7 @@ def test_orderedDict_ctor(self): import random data = OrderedDict([('col%s' % i, random.random()) for i in range(12)]) s = pandas.Series(data) - self.assertTrue(all(s.values == list(data.values()))) + assert all(s.values == list(data.values())) def test_orderedDict_subclass_ctor(self): # GH3283 @@ -637,145 +688,203 @@ class A(OrderedDict): data = A([('col%s' % i, random.random()) for i in range(12)]) s = pandas.Series(data) - self.assertTrue(all(s.values == list(data.values()))) + assert all(s.values == list(data.values())) def test_constructor_list_of_tuples(self): data = [(1, 1), (2, 2), (2, 3)] s = Series(data) - self.assertEqual(list(s), data) + assert list(s) == data def test_constructor_tuple_of_tuples(self): data = ((1, 1), (2, 2), (2, 3)) s = Series(data) - self.assertEqual(tuple(s), data) + assert tuple(s) == data def test_constructor_set(self): values = set([1, 2, 3, 4, 5]) - self.assertRaises(TypeError, Series, values) + pytest.raises(TypeError, Series, values) values = frozenset(values) - self.assertRaises(TypeError, Series, values) + pytest.raises(TypeError, Series, values) def test_fromDict(self): data = {'a': 0, 'b': 1, 'c': 2, 'd': 3} series = Series(data) - self.assertTrue(tm.is_sorted(series.index)) + assert tm.is_sorted(series.index) data = {'a': 0, 'b': '1', 'c': '2', 'd': datetime.now()} series = Series(data) - self.assertEqual(series.dtype, np.object_) + assert series.dtype == np.object_ data = {'a': 0, 'b': '1', 'c': '2', 'd': '3'} series = Series(data) - self.assertEqual(series.dtype, np.object_) + assert series.dtype == np.object_ data = {'a': '0', 'b': '1'} series = Series(data, dtype=float) - self.assertEqual(series.dtype, np.float64) + assert series.dtype == np.float64 def test_fromValue(self): nans = Series(np.NaN, index=self.ts.index) - self.assertEqual(nans.dtype, np.float_) - self.assertEqual(len(nans), len(self.ts)) + assert nans.dtype == np.float_ + assert len(nans) == len(self.ts) strings = Series('foo', index=self.ts.index) - self.assertEqual(strings.dtype, np.object_) - self.assertEqual(len(strings), len(self.ts)) + assert strings.dtype == np.object_ + assert len(strings) == len(self.ts) d = datetime.now() dates = Series(d, index=self.ts.index) - self.assertEqual(dates.dtype, 'M8[ns]') - self.assertEqual(len(dates), len(self.ts)) + assert dates.dtype == 'M8[ns]' + assert len(dates) == len(self.ts) # GH12336 # Test construction of categorical series from value categorical = Series(0, index=self.ts.index, dtype="category") expected = Series(0, index=self.ts.index).astype("category") - self.assertEqual(categorical.dtype, 'category') - self.assertEqual(len(categorical), len(self.ts)) + assert categorical.dtype == 'category' + assert len(categorical) == len(self.ts) tm.assert_series_equal(categorical, expected) def test_constructor_dtype_timedelta64(self): # basic td = Series([timedelta(days=i) for i in range(3)]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' td = Series([timedelta(days=1)]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' td = Series([timedelta(days=1), timedelta(days=2), np.timedelta64( 1, 's')]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' # mixed with NaT - from pandas import tslib - td = Series([timedelta(days=1), tslib.NaT], dtype='m8[ns]') - self.assertEqual(td.dtype, 'timedelta64[ns]') + td = Series([timedelta(days=1), NaT], dtype='m8[ns]') + assert td.dtype == 'timedelta64[ns]' td = Series([timedelta(days=1), np.nan], dtype='m8[ns]') - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' td = Series([np.timedelta64(300000000), pd.NaT], dtype='m8[ns]') - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' # improved inference # GH5689 - td = Series([np.timedelta64(300000000), pd.NaT]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + td = Series([np.timedelta64(300000000), NaT]) + assert td.dtype == 'timedelta64[ns]' # because iNaT is int, not coerced to timedelta - td = Series([np.timedelta64(300000000), tslib.iNaT]) - self.assertEqual(td.dtype, 'object') + td = Series([np.timedelta64(300000000), iNaT]) + assert td.dtype == 'object' td = Series([np.timedelta64(300000000), np.nan]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' td = Series([pd.NaT, np.timedelta64(300000000)]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' td = Series([np.timedelta64(1, 's')]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' # these are frequency conversion astypes # for t in ['s', 'D', 'us', 'ms']: - # self.assertRaises(TypeError, td.astype, 'm8[%s]' % t) + # pytest.raises(TypeError, td.astype, 'm8[%s]' % t) # valid astype td.astype('int64') # invalid casting - self.assertRaises(TypeError, td.astype, 'int32') + pytest.raises(TypeError, td.astype, 'int32') # this is an invalid casting def f(): Series([timedelta(days=1), 'foo'], dtype='m8[ns]') - self.assertRaises(Exception, f) + pytest.raises(Exception, f) # leave as object here td = Series([timedelta(days=i) for i in range(3)] + ['foo']) - self.assertEqual(td.dtype, 'object') + assert td.dtype == 'object' # these will correctly infer a timedelta s = Series([None, pd.NaT, '1 Day']) - self.assertEqual(s.dtype, 'timedelta64[ns]') + assert s.dtype == 'timedelta64[ns]' s = Series([np.nan, pd.NaT, '1 Day']) - self.assertEqual(s.dtype, 'timedelta64[ns]') + assert s.dtype == 'timedelta64[ns]' s = Series([pd.NaT, None, '1 Day']) - self.assertEqual(s.dtype, 'timedelta64[ns]') + assert s.dtype == 'timedelta64[ns]' s = Series([pd.NaT, np.nan, '1 Day']) - self.assertEqual(s.dtype, 'timedelta64[ns]') + assert s.dtype == 'timedelta64[ns]' + + def test_NaT_scalar(self): + series = Series([0, 1000, 2000, iNaT], dtype='M8[ns]') + + val = series[3] + assert isnull(val) + + series[2] = val + assert isnull(series[2]) + + def test_NaT_cast(self): + # GH10747 + result = Series([np.nan]).astype('M8[ns]') + expected = Series([NaT]) + assert_series_equal(result, expected) def test_constructor_name_hashable(self): for n in [777, 777., 'name', datetime(2001, 11, 11), (1, ), u"\u05D0"]: for data in [[1, 2, 3], np.ones(3), {'a': 0, 'b': 1}]: s = Series(data, name=n) - self.assertEqual(s.name, n) + assert s.name == n def test_constructor_name_unhashable(self): for n in [['name_list'], np.ones(2), {1: 2}]: for data in [['name_list'], np.ones(2), {1: 2}]: - self.assertRaises(TypeError, Series, data, name=n) + pytest.raises(TypeError, Series, data, name=n) + + def test_auto_conversion(self): + series = Series(list(date_range('1/1/2000', periods=10))) + assert series.dtype == 'M8[ns]' + + def test_constructor_cant_cast_datetime64(self): + msg = "Cannot cast datetime64 to " + with tm.assert_raises_regex(TypeError, msg): + Series(date_range('1/1/2000', periods=10), dtype=float) + + with tm.assert_raises_regex(TypeError, msg): + Series(date_range('1/1/2000', periods=10), dtype=int) + + def test_constructor_cast_object(self): + s = Series(date_range('1/1/2000', periods=10), dtype=object) + exp = Series(date_range('1/1/2000', periods=10)) + tm.assert_series_equal(s, exp) + + def test_constructor_generic_timestamp_deprecated(self): + # see gh-15524 + + with tm.assert_produces_warning(FutureWarning): + dtype = np.timedelta64 + s = Series([], dtype=dtype) + + assert s.empty + assert s.dtype == 'm8[ns]' + + with tm.assert_produces_warning(FutureWarning): + dtype = np.datetime64 + s = Series([], dtype=dtype) + + assert s.empty + assert s.dtype == 'M8[ns]' + + # These timestamps have the wrong frequencies, + # so an Exception should be raised now. + msg = "cannot convert timedeltalike" + with tm.assert_raises_regex(TypeError, msg): + Series([], dtype='m8[ps]') + + msg = "cannot convert datetimelike" + with tm.assert_raises_regex(TypeError, msg): + Series([], dtype='M8[ps]') diff --git a/pandas/tests/series/test_datetime_values.py b/pandas/tests/series/test_datetime_values.py index ed441f2f85572..e1fc9af0cca89 100644 --- a/pandas/tests/series/test_datetime_values.py +++ b/pandas/tests/series/test_datetime_values.py @@ -1,17 +1,17 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + from datetime import datetime, date import numpy as np import pandas as pd -from pandas.types.common import is_integer_dtype, is_list_like +from pandas.core.dtypes.common import is_integer_dtype, is_list_like from pandas import (Index, Series, DataFrame, bdate_range, - date_range, period_range, timedelta_range) -from pandas.tseries.period import PeriodIndex -from pandas.tseries.index import Timestamp, DatetimeIndex -from pandas.tseries.tdi import TimedeltaIndex + date_range, period_range, timedelta_range, + PeriodIndex, Timestamp, DatetimeIndex, TimedeltaIndex) import pandas.core.common as com from pandas.util.testing import assert_series_equal @@ -20,30 +20,20 @@ from .common import TestData -class TestSeriesDatetimeValues(TestData, tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesDatetimeValues(TestData): def test_dt_namespace_accessor(self): # GH 7207, 11128 # test .dt namespace accessor - ok_for_base = ['year', 'month', 'day', 'hour', 'minute', 'second', - 'weekofyear', 'week', 'dayofweek', 'weekday', - 'dayofyear', 'quarter', 'freq', 'days_in_month', - 'daysinmonth', 'is_leap_year'] - ok_for_period = ok_for_base + ['qyear', 'start_time', 'end_time'] + ok_for_period = PeriodIndex._datetimelike_ops ok_for_period_methods = ['strftime', 'to_timestamp', 'asfreq'] - ok_for_dt = ok_for_base + ['date', 'time', 'microsecond', 'nanosecond', - 'is_month_start', 'is_month_end', - 'is_quarter_start', 'is_quarter_end', - 'is_year_start', 'is_year_end', 'tz', - 'weekday_name'] + ok_for_dt = DatetimeIndex._datetimelike_ops ok_for_dt_methods = ['to_period', 'to_pydatetime', 'tz_localize', 'tz_convert', 'normalize', 'strftime', 'round', 'floor', 'ceil', 'weekday_name'] - ok_for_td = ['days', 'seconds', 'microseconds', 'nanoseconds'] + ok_for_td = TimedeltaIndex._datetimelike_ops ok_for_td_methods = ['components', 'to_pytimedelta', 'total_seconds', 'round', 'floor', 'ceil'] @@ -60,7 +50,7 @@ def compare(s, name): a = getattr(s.dt, prop) b = get_expected(s, prop) if not (is_list_like(a) and is_list_like(b)): - self.assertEqual(a, b) + assert a == b else: tm.assert_series_equal(a, b) @@ -80,8 +70,8 @@ def compare(s, name): getattr(s.dt, prop) result = s.dt.to_pydatetime() - self.assertIsInstance(result, np.ndarray) - self.assertTrue(result.dtype == object) + assert isinstance(result, np.ndarray) + assert result.dtype == object result = s.dt.tz_localize('US/Eastern') exp_values = DatetimeIndex(s.values).tz_localize('US/Eastern') @@ -89,10 +79,9 @@ def compare(s, name): tm.assert_series_equal(result, expected) tz_result = result.dt.tz - self.assertEqual(str(tz_result), 'US/Eastern') + assert str(tz_result) == 'US/Eastern' freq_result = s.dt.freq - self.assertEqual(freq_result, DatetimeIndex(s.values, - freq='infer').freq) + assert freq_result == DatetimeIndex(s.values, freq='infer').freq # let's localize, then convert result = s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern') @@ -150,8 +139,8 @@ def compare(s, name): getattr(s.dt, prop) result = s.dt.to_pydatetime() - self.assertIsInstance(result, np.ndarray) - self.assertTrue(result.dtype == object) + assert isinstance(result, np.ndarray) + assert result.dtype == object result = s.dt.tz_convert('CET') expected = Series(s._values.tz_convert('CET'), @@ -159,18 +148,17 @@ def compare(s, name): tm.assert_series_equal(result, expected) tz_result = result.dt.tz - self.assertEqual(str(tz_result), 'CET') + assert str(tz_result) == 'CET' freq_result = s.dt.freq - self.assertEqual(freq_result, DatetimeIndex(s.values, - freq='infer').freq) + assert freq_result == DatetimeIndex(s.values, freq='infer').freq - # timedeltaindex + # timedelta index cases = [Series(timedelta_range('1 day', periods=5), index=list('abcde'), name='xxx'), Series(timedelta_range('1 day 01:23:45', periods=5, - freq='s'), name='xxx'), + freq='s'), name='xxx'), Series(timedelta_range('2 days 01:23:45.012345', periods=5, - freq='ms'), name='xxx')] + freq='ms'), name='xxx')] for s in cases: for prop in ok_for_td: # we test freq below @@ -181,20 +169,19 @@ def compare(s, name): getattr(s.dt, prop) result = s.dt.components - self.assertIsInstance(result, DataFrame) + assert isinstance(result, DataFrame) tm.assert_index_equal(result.index, s.index) result = s.dt.to_pytimedelta() - self.assertIsInstance(result, np.ndarray) - self.assertTrue(result.dtype == object) + assert isinstance(result, np.ndarray) + assert result.dtype == object result = s.dt.total_seconds() - self.assertIsInstance(result, pd.Series) - self.assertTrue(result.dtype == 'float64') + assert isinstance(result, pd.Series) + assert result.dtype == 'float64' freq_result = s.dt.freq - self.assertEqual(freq_result, TimedeltaIndex(s.values, - freq='infer').freq) + assert freq_result == TimedeltaIndex(s.values, freq='infer').freq # both index = date_range('20130101', periods=3, freq='D') @@ -228,7 +215,7 @@ def compare(s, name): getattr(s.dt, prop) freq_result = s.dt.freq - self.assertEqual(freq_result, PeriodIndex(s.values).freq) + assert freq_result == PeriodIndex(s.values).freq # test limited display api def get_dir(s): @@ -261,7 +248,7 @@ def get_dir(s): # no setting allowed s = Series(date_range('20130101', periods=5, freq='D'), name='xxx') - with tm.assertRaisesRegexp(ValueError, "modifications"): + with tm.assert_raises_regex(ValueError, "modifications"): s.dt.hour = 5 # trying to set a copy @@ -270,13 +257,13 @@ def get_dir(s): def f(): s.dt.hour[0] = 5 - self.assertRaises(com.SettingWithCopyError, f) + pytest.raises(com.SettingWithCopyError, f) def test_dt_accessor_no_new_attributes(self): # https://github.com/pandas-dev/pandas/issues/10673 s = Series(date_range('20130101', periods=5, freq='D')) - with tm.assertRaisesRegexp(AttributeError, - "You cannot add any new attribute"): + with tm.assert_raises_regex(AttributeError, + "You cannot add any new attribute"): s.dt.xlabel = "a" def test_strftime(self): @@ -321,13 +308,13 @@ def test_strftime(self): expected = np.array(['2015/03/01', '2015/03/02', '2015/03/03', '2015/03/04', '2015/03/05'], dtype=np.object_) # dtype may be S10 or U10 depending on python version - self.assert_numpy_array_equal(result, expected, check_dtype=False) + tm.assert_numpy_array_equal(result, expected, check_dtype=False) period_index = period_range('20150301', periods=5) result = period_index.strftime("%Y/%m/%d") expected = np.array(['2015/03/01', '2015/03/02', '2015/03/03', - '2015/03/04', '2015/03/05'], dtype=' 0 for x in self.series) @@ -316,8 +307,8 @@ def test_getitem_boolean_object(self): # nans raise exception omask[5:10] = np.nan - self.assertRaises(Exception, s.__getitem__, omask) - self.assertRaises(Exception, s.__setitem__, omask, 5) + pytest.raises(Exception, s.__getitem__, omask) + pytest.raises(Exception, s.__setitem__, omask, 5) def test_getitem_setitem_boolean_corner(self): ts = self.ts @@ -325,15 +316,15 @@ def test_getitem_setitem_boolean_corner(self): # these used to raise...?? - self.assertRaises(Exception, ts.__getitem__, mask_shifted) - self.assertRaises(Exception, ts.__setitem__, mask_shifted, 1) + pytest.raises(Exception, ts.__getitem__, mask_shifted) + pytest.raises(Exception, ts.__setitem__, mask_shifted, 1) # ts[mask_shifted] # ts[mask_shifted] = 1 - self.assertRaises(Exception, ts.ix.__getitem__, mask_shifted) - self.assertRaises(Exception, ts.ix.__setitem__, mask_shifted, 1) - # ts.ix[mask_shifted] - # ts.ix[mask_shifted] = 2 + pytest.raises(Exception, ts.loc.__getitem__, mask_shifted) + pytest.raises(Exception, ts.loc.__setitem__, mask_shifted, 1) + # ts.loc[mask_shifted] + # ts.loc[mask_shifted] = 2 def test_getitem_setitem_slice_integers(self): s = Series(np.random.randn(8), index=[2, 4, 6, 8, 10, 12, 14, 16]) @@ -343,45 +334,252 @@ def test_getitem_setitem_slice_integers(self): assert_series_equal(result, expected) s[:4] = 0 - self.assertTrue((s[:4] == 0).all()) - self.assertTrue(not (s[4:] == 0).any()) + assert (s[:4] == 0).all() + assert not (s[4:] == 0).any() + + def test_getitem_setitem_datetime_tz_pytz(self): + tm._skip_if_no_pytz() + from pytz import timezone as tz + + from pandas import date_range + + N = 50 + # testing with timezone, GH #2785 + rng = date_range('1/1/1990', periods=N, freq='H', tz='US/Eastern') + ts = Series(np.random.randn(N), index=rng) + + # also test Timestamp tz handling, GH #2789 + result = ts.copy() + result["1990-01-01 09:00:00+00:00"] = 0 + result["1990-01-01 09:00:00+00:00"] = ts[4] + assert_series_equal(result, ts) + + result = ts.copy() + result["1990-01-01 03:00:00-06:00"] = 0 + result["1990-01-01 03:00:00-06:00"] = ts[4] + assert_series_equal(result, ts) + + # repeat with datetimes + result = ts.copy() + result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = 0 + result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = ts[4] + assert_series_equal(result, ts) + + result = ts.copy() + + # comparison dates with datetime MUST be localized! + date = tz('US/Central').localize(datetime(1990, 1, 1, 3)) + result[date] = 0 + result[date] = ts[4] + assert_series_equal(result, ts) + + def test_getitem_setitem_datetime_tz_dateutil(self): + tm._skip_if_no_dateutil() + from dateutil.tz import tzutc + from pandas._libs.tslib import _dateutil_gettz as gettz + + tz = lambda x: tzutc() if x == 'UTC' else gettz( + x) # handle special case for utc in dateutil + + from pandas import date_range + + N = 50 + + # testing with timezone, GH #2785 + rng = date_range('1/1/1990', periods=N, freq='H', + tz='America/New_York') + ts = Series(np.random.randn(N), index=rng) + + # also test Timestamp tz handling, GH #2789 + result = ts.copy() + result["1990-01-01 09:00:00+00:00"] = 0 + result["1990-01-01 09:00:00+00:00"] = ts[4] + assert_series_equal(result, ts) + + result = ts.copy() + result["1990-01-01 03:00:00-06:00"] = 0 + result["1990-01-01 03:00:00-06:00"] = ts[4] + assert_series_equal(result, ts) + + # repeat with datetimes + result = ts.copy() + result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = 0 + result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = ts[4] + assert_series_equal(result, ts) + + result = ts.copy() + result[datetime(1990, 1, 1, 3, tzinfo=tz('America/Chicago'))] = 0 + result[datetime(1990, 1, 1, 3, tzinfo=tz('America/Chicago'))] = ts[4] + assert_series_equal(result, ts) + + def test_getitem_setitem_datetimeindex(self): + N = 50 + # testing with timezone, GH #2785 + rng = date_range('1/1/1990', periods=N, freq='H', tz='US/Eastern') + ts = Series(np.random.randn(N), index=rng) + + result = ts["1990-01-01 04:00:00"] + expected = ts[4] + assert result == expected + + result = ts.copy() + result["1990-01-01 04:00:00"] = 0 + result["1990-01-01 04:00:00"] = ts[4] + assert_series_equal(result, ts) + + result = ts["1990-01-01 04:00:00":"1990-01-01 07:00:00"] + expected = ts[4:8] + assert_series_equal(result, expected) + + result = ts.copy() + result["1990-01-01 04:00:00":"1990-01-01 07:00:00"] = 0 + result["1990-01-01 04:00:00":"1990-01-01 07:00:00"] = ts[4:8] + assert_series_equal(result, ts) + + lb = "1990-01-01 04:00:00" + rb = "1990-01-01 07:00:00" + result = ts[(ts.index >= lb) & (ts.index <= rb)] + expected = ts[4:8] + assert_series_equal(result, expected) + + # repeat all the above with naive datetimes + result = ts[datetime(1990, 1, 1, 4)] + expected = ts[4] + assert result == expected + + result = ts.copy() + result[datetime(1990, 1, 1, 4)] = 0 + result[datetime(1990, 1, 1, 4)] = ts[4] + assert_series_equal(result, ts) + + result = ts[datetime(1990, 1, 1, 4):datetime(1990, 1, 1, 7)] + expected = ts[4:8] + assert_series_equal(result, expected) + + result = ts.copy() + result[datetime(1990, 1, 1, 4):datetime(1990, 1, 1, 7)] = 0 + result[datetime(1990, 1, 1, 4):datetime(1990, 1, 1, 7)] = ts[4:8] + assert_series_equal(result, ts) + + lb = datetime(1990, 1, 1, 4) + rb = datetime(1990, 1, 1, 7) + result = ts[(ts.index >= lb) & (ts.index <= rb)] + expected = ts[4:8] + assert_series_equal(result, expected) + + result = ts[ts.index[4]] + expected = ts[4] + assert result == expected + + result = ts[ts.index[4:8]] + expected = ts[4:8] + assert_series_equal(result, expected) + + result = ts.copy() + result[ts.index[4:8]] = 0 + result[4:8] = ts[4:8] + assert_series_equal(result, ts) + + # also test partial date slicing + result = ts["1990-01-02"] + expected = ts[24:48] + assert_series_equal(result, expected) + + result = ts.copy() + result["1990-01-02"] = 0 + result["1990-01-02"] = ts[24:48] + assert_series_equal(result, ts) + + def test_getitem_setitem_periodindex(self): + from pandas import period_range + + N = 50 + rng = period_range('1/1/1990', periods=N, freq='H') + ts = Series(np.random.randn(N), index=rng) + + result = ts["1990-01-01 04"] + expected = ts[4] + assert result == expected + + result = ts.copy() + result["1990-01-01 04"] = 0 + result["1990-01-01 04"] = ts[4] + assert_series_equal(result, ts) + + result = ts["1990-01-01 04":"1990-01-01 07"] + expected = ts[4:8] + assert_series_equal(result, expected) + + result = ts.copy() + result["1990-01-01 04":"1990-01-01 07"] = 0 + result["1990-01-01 04":"1990-01-01 07"] = ts[4:8] + assert_series_equal(result, ts) + + lb = "1990-01-01 04" + rb = "1990-01-01 07" + result = ts[(ts.index >= lb) & (ts.index <= rb)] + expected = ts[4:8] + assert_series_equal(result, expected) + + # GH 2782 + result = ts[ts.index[4]] + expected = ts[4] + assert result == expected + + result = ts[ts.index[4:8]] + expected = ts[4:8] + assert_series_equal(result, expected) + + result = ts.copy() + result[ts.index[4:8]] = 0 + result[4:8] = ts[4:8] + assert_series_equal(result, ts) + + def test_getitem_median_slice_bug(self): + index = date_range('20090415', '20090519', freq='2B') + s = Series(np.random.randn(13), index=index) + + indexer = [slice(6, 7, None)] + result = s[indexer] + expected = s[indexer[0]] + assert_series_equal(result, expected) def test_getitem_out_of_bounds(self): # don't segfault, GH #495 - self.assertRaises(IndexError, self.ts.__getitem__, len(self.ts)) + pytest.raises(IndexError, self.ts.__getitem__, len(self.ts)) # GH #917 s = Series([]) - self.assertRaises(IndexError, s.__getitem__, -1) + pytest.raises(IndexError, s.__getitem__, -1) def test_getitem_setitem_integers(self): # caused bug without test s = Series([1, 2, 3], ['a', 'b', 'c']) - self.assertEqual(s.ix[0], s['a']) - s.ix[0] = 5 - self.assertAlmostEqual(s['a'], 5) + assert s.iloc[0] == s['a'] + s.iloc[0] = 5 + tm.assert_almost_equal(s['a'], 5) def test_getitem_box_float64(self): value = self.ts[5] - tm.assertIsInstance(value, np.float64) + assert isinstance(value, np.float64) def test_getitem_ambiguous_keyerror(self): s = Series(lrange(10), index=lrange(0, 20, 2)) - self.assertRaises(KeyError, s.__getitem__, 1) - self.assertRaises(KeyError, s.ix.__getitem__, 1) + pytest.raises(KeyError, s.__getitem__, 1) + pytest.raises(KeyError, s.loc.__getitem__, 1) def test_getitem_unordered_dup(self): obj = Series(lrange(5), index=['c', 'a', 'a', 'b', 'b']) - self.assertTrue(is_scalar(obj['c'])) - self.assertEqual(obj['c'], 0) + assert is_scalar(obj['c']) + assert obj['c'] == 0 def test_getitem_dups_with_missing(self): - # breaks reindex, so need to use .ix internally + # breaks reindex, so need to use .loc internally # GH 4246 s = Series([1, 2, 3, 4], ['foo', 'bar', 'foo', 'bah']) - expected = s.ix[['foo', 'bar', 'bah', 'bam']] + expected = s.loc[['foo', 'bar', 'bah', 'bam']] result = s[['foo', 'bar', 'bah', 'bam']] assert_series_equal(result, expected) @@ -395,13 +593,13 @@ def test_getitem_dataframe(self): rng = list(range(10)) s = pd.Series(10, index=rng) df = pd.DataFrame(rng, index=rng) - self.assertRaises(TypeError, s.__getitem__, df > 5) + pytest.raises(TypeError, s.__getitem__, df > 5) def test_getitem_callable(self): # GH 12533 s = pd.Series(4, index=list('ABCD')) result = s[lambda x: 'A'] - self.assertEqual(result, s.loc['A']) + assert result == s.loc['A'] result = s[lambda x: ['A', 'B']] tm.assert_series_equal(result, s.loc[['A', 'B']]) @@ -419,7 +617,7 @@ def test_setitem_ambiguous_keyerror(self): assert_series_equal(s2, expected) s2 = s.copy() - s2.ix[1] = 5 + s2.loc[1] = 5 expected = s.append(Series([5], index=[1])) assert_series_equal(s2, expected) @@ -428,7 +626,7 @@ def test_setitem_float_labels(self): s = Series(['a', 'b', 'c'], index=[0, 0.5, 1]) tmp = s.copy() - s.ix[1] = 'zoo' + s.loc[1] = 'zoo' tmp.iloc[2] = 'zoo' assert_series_equal(s, tmp) @@ -454,22 +652,20 @@ def test_slice(self): numSliceEnd = self.series[-10:] objSlice = self.objSeries[10:20] - self.assertNotIn(self.series.index[9], numSlice.index) - self.assertNotIn(self.objSeries.index[9], objSlice.index) + assert self.series.index[9] not in numSlice.index + assert self.objSeries.index[9] not in objSlice.index - self.assertEqual(len(numSlice), len(numSlice.index)) - self.assertEqual(self.series[numSlice.index[0]], - numSlice[numSlice.index[0]]) + assert len(numSlice) == len(numSlice.index) + assert self.series[numSlice.index[0]] == numSlice[numSlice.index[0]] - self.assertEqual(numSlice.index[1], self.series.index[11]) + assert numSlice.index[1] == self.series.index[11] + assert tm.equalContents(numSliceEnd, np.array(self.series)[-10:]) - self.assertTrue(tm.equalContents(numSliceEnd, np.array(self.series)[ - -10:])) - - # test return view + # Test return view. sl = self.series[10:20] sl[:] = 0 - self.assertTrue((self.series[10:20] == 0).all()) + + assert (self.series[10:20] == 0).all() def test_slice_can_reorder_not_uniquely_indexed(self): s = Series(1, index=['a', 'a', 'b', 'b', 'c']) @@ -477,27 +673,27 @@ def test_slice_can_reorder_not_uniquely_indexed(self): def test_slice_float_get_set(self): - self.assertRaises(TypeError, lambda: self.ts[4.0:10.0]) + pytest.raises(TypeError, lambda: self.ts[4.0:10.0]) def f(): self.ts[4.0:10.0] = 0 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) - self.assertRaises(TypeError, self.ts.__getitem__, slice(4.5, 10.0)) - self.assertRaises(TypeError, self.ts.__setitem__, slice(4.5, 10.0), 0) + pytest.raises(TypeError, self.ts.__getitem__, slice(4.5, 10.0)) + pytest.raises(TypeError, self.ts.__setitem__, slice(4.5, 10.0), 0) def test_slice_floats2(self): s = Series(np.random.rand(10), index=np.arange(10, 20, dtype=float)) - self.assertEqual(len(s.ix[12.0:]), 8) - self.assertEqual(len(s.ix[12.5:]), 7) + assert len(s.loc[12.0:]) == 8 + assert len(s.loc[12.5:]) == 7 i = np.arange(10, 20, dtype=float) i[2] = 12.2 s.index = i - self.assertEqual(len(s.ix[12.0:]), 8) - self.assertEqual(len(s.ix[12.5:]), 7) + assert len(s.loc[12.0:]) == 8 + assert len(s.loc[12.5:]) == 7 def test_slice_float64(self): @@ -528,17 +724,17 @@ def test_setitem(self): self.ts[self.ts.index[5]] = np.NaN self.ts[[1, 2, 17]] = np.NaN self.ts[6] = np.NaN - self.assertTrue(np.isnan(self.ts[6])) - self.assertTrue(np.isnan(self.ts[2])) + assert np.isnan(self.ts[6]) + assert np.isnan(self.ts[2]) self.ts[np.isnan(self.ts)] = 5 - self.assertFalse(np.isnan(self.ts[2])) + assert not np.isnan(self.ts[2]) # caught this bug when writing tests series = Series(tm.makeIntIndex(20).astype(float), index=tm.makeIntIndex(20)) series[::2] = 0 - self.assertTrue((series[::2] == 0).all()) + assert (series[::2] == 0).all() # set item that's not contained s = self.series.copy() @@ -589,31 +785,31 @@ def test_setitem_dtypes(self): def test_set_value(self): idx = self.ts.index[10] res = self.ts.set_value(idx, 0) - self.assertIs(res, self.ts) - self.assertEqual(self.ts[idx], 0) + assert res is self.ts + assert self.ts[idx] == 0 # equiv s = self.series.copy() res = s.set_value('foobar', 0) - self.assertIs(res, s) - self.assertEqual(res.index[-1], 'foobar') - self.assertEqual(res['foobar'], 0) + assert res is s + assert res.index[-1] == 'foobar' + assert res['foobar'] == 0 s = self.series.copy() s.loc['foobar'] = 0 - self.assertEqual(s.index[-1], 'foobar') - self.assertEqual(s['foobar'], 0) + assert s.index[-1] == 'foobar' + assert s['foobar'] == 0 def test_setslice(self): sl = self.ts[5:20] - self.assertEqual(len(sl), len(sl.index)) - self.assertTrue(sl.index.is_unique) + assert len(sl) == len(sl.index) + assert sl.index.is_unique def test_basic_getitem_setitem_corner(self): # invalid tuples, e.g. self.ts[:, None] vs. self.ts[:, 2] - with tm.assertRaisesRegexp(ValueError, 'tuple-index'): + with tm.assert_raises_regex(ValueError, 'tuple-index'): self.ts[:, 2] - with tm.assertRaisesRegexp(ValueError, 'tuple-index'): + with tm.assert_raises_regex(ValueError, 'tuple-index'): self.ts[:, 2] = 2 # weird lists. [slice(0, 5)] will work but not two slices @@ -622,10 +818,10 @@ def test_basic_getitem_setitem_corner(self): assert_series_equal(result, expected) # OK - self.assertRaises(Exception, self.ts.__getitem__, - [5, slice(None, None)]) - self.assertRaises(Exception, self.ts.__setitem__, - [5, slice(None, None)], 2) + pytest.raises(Exception, self.ts.__getitem__, + [5, slice(None, None)]) + pytest.raises(Exception, self.ts.__setitem__, + [5, slice(None, None)], 2) def test_basic_getitem_with_labels(self): indices = self.ts.index[[5, 10, 15]] @@ -635,7 +831,7 @@ def test_basic_getitem_with_labels(self): assert_series_equal(result, expected) result = self.ts[indices[0]:indices[2]] - expected = self.ts.ix[indices[0]:indices[2]] + expected = self.ts.loc[indices[0]:indices[2]] assert_series_equal(result, expected) # integer indexes, be careful @@ -656,11 +852,11 @@ def test_basic_getitem_with_labels(self): index=['a', 'b', 'c']) expected = Timestamp('2011-01-01', tz='US/Eastern') result = s.loc['a'] - self.assertEqual(result, expected) + assert result == expected result = s.iloc[0] - self.assertEqual(result, expected) + assert result == expected result = s['a'] - self.assertEqual(result, expected) + assert result == expected def test_basic_setitem_with_labels(self): indices = self.ts.index[[5, 10, 15]] @@ -668,13 +864,13 @@ def test_basic_setitem_with_labels(self): cp = self.ts.copy() exp = self.ts.copy() cp[indices] = 0 - exp.ix[indices] = 0 + exp.loc[indices] = 0 assert_series_equal(cp, exp) cp = self.ts.copy() exp = self.ts.copy() cp[indices[0]:indices[2]] = 0 - exp.ix[indices[0]:indices[2]] = 0 + exp.loc[indices[0]:indices[2]] = 0 assert_series_equal(cp, exp) # integer indexes, be careful @@ -685,19 +881,19 @@ def test_basic_setitem_with_labels(self): cp = s.copy() exp = s.copy() s[inds] = 0 - s.ix[inds] = 0 + s.loc[inds] = 0 assert_series_equal(cp, exp) cp = s.copy() exp = s.copy() s[arr_inds] = 0 - s.ix[arr_inds] = 0 + s.loc[arr_inds] = 0 assert_series_equal(cp, exp) inds_notfound = [0, 4, 5, 6] arr_inds_notfound = np.array([0, 4, 5, 6]) - self.assertRaises(Exception, s.__setitem__, inds_notfound, 0) - self.assertRaises(Exception, s.__setitem__, arr_inds_notfound, 0) + pytest.raises(Exception, s.__setitem__, inds_notfound, 0) + pytest.raises(Exception, s.__setitem__, arr_inds_notfound, 0) # GH12089 # with tz for values @@ -707,60 +903,60 @@ def test_basic_setitem_with_labels(self): expected = Timestamp('2011-01-03', tz='US/Eastern') s2.loc['a'] = expected result = s2.loc['a'] - self.assertEqual(result, expected) + assert result == expected s2 = s.copy() s2.iloc[0] = expected result = s2.iloc[0] - self.assertEqual(result, expected) + assert result == expected s2 = s.copy() s2['a'] = expected result = s2['a'] - self.assertEqual(result, expected) + assert result == expected - def test_ix_getitem(self): + def test_loc_getitem(self): inds = self.series.index[[3, 4, 7]] - assert_series_equal(self.series.ix[inds], self.series.reindex(inds)) - assert_series_equal(self.series.ix[5::2], self.series[5::2]) + assert_series_equal(self.series.loc[inds], self.series.reindex(inds)) + assert_series_equal(self.series.iloc[5::2], self.series[5::2]) # slice with indices d1, d2 = self.ts.index[[5, 15]] - result = self.ts.ix[d1:d2] + result = self.ts.loc[d1:d2] expected = self.ts.truncate(d1, d2) assert_series_equal(result, expected) # boolean mask = self.series > self.series.median() - assert_series_equal(self.series.ix[mask], self.series[mask]) + assert_series_equal(self.series.loc[mask], self.series[mask]) # ask for index value - self.assertEqual(self.ts.ix[d1], self.ts[d1]) - self.assertEqual(self.ts.ix[d2], self.ts[d2]) + assert self.ts.loc[d1] == self.ts[d1] + assert self.ts.loc[d2] == self.ts[d2] - def test_ix_getitem_not_monotonic(self): + def test_loc_getitem_not_monotonic(self): d1, d2 = self.ts.index[[5, 15]] ts2 = self.ts[::2][[1, 2, 0]] - self.assertRaises(KeyError, ts2.ix.__getitem__, slice(d1, d2)) - self.assertRaises(KeyError, ts2.ix.__setitem__, slice(d1, d2), 0) + pytest.raises(KeyError, ts2.loc.__getitem__, slice(d1, d2)) + pytest.raises(KeyError, ts2.loc.__setitem__, slice(d1, d2), 0) - def test_ix_getitem_setitem_integer_slice_keyerrors(self): + def test_loc_getitem_setitem_integer_slice_keyerrors(self): s = Series(np.random.randn(10), index=lrange(0, 20, 2)) # this is OK cp = s.copy() - cp.ix[4:10] = 0 - self.assertTrue((cp.ix[4:10] == 0).all()) + cp.iloc[4:10] = 0 + assert (cp.iloc[4:10] == 0).all() # so is this cp = s.copy() - cp.ix[3:11] = 0 - self.assertTrue((cp.ix[3:11] == 0).values.all()) + cp.iloc[3:11] = 0 + assert (cp.iloc[3:11] == 0).values.all() - result = s.ix[4:10] - result2 = s.ix[3:11] + result = s.iloc[2:6] + result2 = s.loc[3:11] expected = s.reindex([4, 6, 8, 10]) assert_series_equal(result, expected) @@ -768,19 +964,19 @@ def test_ix_getitem_setitem_integer_slice_keyerrors(self): # non-monotonic, raise KeyError s2 = s.iloc[lrange(5) + lrange(5, 10)[::-1]] - self.assertRaises(KeyError, s2.ix.__getitem__, slice(3, 11)) - self.assertRaises(KeyError, s2.ix.__setitem__, slice(3, 11), 0) + pytest.raises(KeyError, s2.loc.__getitem__, slice(3, 11)) + pytest.raises(KeyError, s2.loc.__setitem__, slice(3, 11), 0) - def test_ix_getitem_iterator(self): + def test_loc_getitem_iterator(self): idx = iter(self.series.index[:10]) - result = self.series.ix[idx] + result = self.series.loc[idx] assert_series_equal(result, self.series[:10]) def test_setitem_with_tz(self): for tz in ['US/Eastern', 'UTC', 'Asia/Tokyo']: orig = pd.Series(pd.date_range('2016-01-01', freq='H', periods=3, tz=tz)) - self.assertEqual(orig.dtype, 'datetime64[ns, {0}]'.format(tz)) + assert orig.dtype == 'datetime64[ns, {0}]'.format(tz) # scalar s = orig.copy() @@ -801,7 +997,7 @@ def test_setitem_with_tz(self): # vector vals = pd.Series([pd.Timestamp('2011-01-01', tz=tz), pd.Timestamp('2012-01-01', tz=tz)], index=[1, 2]) - self.assertEqual(vals.dtype, 'datetime64[ns, {0}]'.format(tz)) + assert vals.dtype == 'datetime64[ns, {0}]'.format(tz) s[[1, 2]] = vals exp = pd.Series([pd.Timestamp('2016-01-01 00:00', tz=tz), @@ -822,14 +1018,14 @@ def test_setitem_with_tz_dst(self): tz = 'US/Eastern' orig = pd.Series(pd.date_range('2016-11-06', freq='H', periods=3, tz=tz)) - self.assertEqual(orig.dtype, 'datetime64[ns, {0}]'.format(tz)) + assert orig.dtype == 'datetime64[ns, {0}]'.format(tz) # scalar s = orig.copy() s[1] = pd.Timestamp('2011-01-01', tz=tz) - exp = pd.Series([pd.Timestamp('2016-11-06 00:00', tz=tz), - pd.Timestamp('2011-01-01 00:00', tz=tz), - pd.Timestamp('2016-11-06 02:00', tz=tz)]) + exp = pd.Series([pd.Timestamp('2016-11-06 00:00-04:00', tz=tz), + pd.Timestamp('2011-01-01 00:00-05:00', tz=tz), + pd.Timestamp('2016-11-06 01:00-05:00', tz=tz)]) tm.assert_series_equal(s, exp) s = orig.copy() @@ -843,7 +1039,7 @@ def test_setitem_with_tz_dst(self): # vector vals = pd.Series([pd.Timestamp('2011-01-01', tz=tz), pd.Timestamp('2012-01-01', tz=tz)], index=[1, 2]) - self.assertEqual(vals.dtype, 'datetime64[ns, {0}]'.format(tz)) + assert vals.dtype == 'datetime64[ns, {0}]'.format(tz) s[[1, 2]] = vals exp = pd.Series([pd.Timestamp('2016-11-06 00:00', tz=tz), @@ -883,12 +1079,12 @@ def test_where(self): assert_series_equal(rs, expected) expected = s2.abs() - expected.ix[0] = s2[0] + expected.iloc[0] = s2[0] rs = s2.where(cond[:3], -s2) assert_series_equal(rs, expected) - self.assertRaises(ValueError, s.where, 1) - self.assertRaises(ValueError, s.where, cond[:3].values, -s) + pytest.raises(ValueError, s.where, 1) + pytest.raises(ValueError, s.where, cond[:3].values, -s) # GH 2745 s = Series([1, 2]) @@ -897,10 +1093,10 @@ def test_where(self): assert_series_equal(s, expected) # failures - self.assertRaises(ValueError, s.__setitem__, tuple([[[True, False]]]), - [0, 2, 3]) - self.assertRaises(ValueError, s.__setitem__, tuple([[[True, False]]]), - []) + pytest.raises(ValueError, s.__setitem__, tuple([[[True, False]]]), + [0, 2, 3]) + pytest.raises(ValueError, s.__setitem__, tuple([[[True, False]]]), + []) # unsafe dtype changes for dtype in [np.int8, np.int16, np.int32, np.int64, np.float16, @@ -910,7 +1106,7 @@ def test_where(self): s[mask] = lrange(2, 7) expected = Series(lrange(2, 7) + lrange(5, 10), dtype=dtype) assert_series_equal(s, expected) - self.assertEqual(s.dtype, expected.dtype) + assert s.dtype == expected.dtype # these are allowed operations, but are upcasted for dtype in [np.int64, np.float64]: @@ -920,7 +1116,7 @@ def test_where(self): s[mask] = values expected = Series(values + lrange(5, 10), dtype='float64') assert_series_equal(s, expected) - self.assertEqual(s.dtype, expected.dtype) + assert s.dtype == expected.dtype # GH 9731 s = Series(np.arange(10), dtype='int64') @@ -936,7 +1132,7 @@ def test_where(self): s = Series(np.arange(10), dtype=dtype) mask = s < 5 values = [2.5, 3.5, 4.5, 5.5, 6.5] - self.assertRaises(Exception, s.__setitem__, tuple(mask), values) + pytest.raises(Exception, s.__setitem__, tuple(mask), values) # GH3235 s = Series(np.arange(10), dtype='int64') @@ -944,7 +1140,7 @@ def test_where(self): s[mask] = lrange(2, 7) expected = Series(lrange(2, 7) + lrange(5, 10), dtype='int64') assert_series_equal(s, expected) - self.assertEqual(s.dtype, expected.dtype) + assert s.dtype == expected.dtype s = Series(np.arange(10), dtype='int64') mask = s > 5 @@ -958,12 +1154,12 @@ def test_where(self): def f(): s[mask] = [5, 4, 3, 2, 1] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def f(): s[mask] = [0] * 5 - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # dtype changes s = Series([1, 2, 3, 4]) @@ -976,7 +1172,7 @@ def f(): s = Series(range(10)).astype(float) s[8] = None result = s[8] - self.assertTrue(isnull(result)) + assert isnull(result) s = Series(range(10)).astype(float) s[s > 8] = None @@ -984,6 +1180,60 @@ def f(): expected = Series(np.nan, index=[9]) assert_series_equal(result, expected) + def test_where_array_like(self): + # see gh-15414 + s = Series([1, 2, 3]) + cond = [False, True, True] + expected = Series([np.nan, 2, 3]) + klasses = [list, tuple, np.array, Series] + + for klass in klasses: + result = s.where(klass(cond)) + assert_series_equal(result, expected) + + def test_where_invalid_input(self): + # see gh-15414: only boolean arrays accepted + s = Series([1, 2, 3]) + msg = "Boolean array expected for the condition" + + conds = [ + [1, 0, 1], + Series([2, 5, 7]), + ["True", "False", "True"], + [Timestamp("2017-01-01"), + pd.NaT, Timestamp("2017-01-02")] + ] + + for cond in conds: + with tm.assert_raises_regex(ValueError, msg): + s.where(cond) + + msg = "Array conditional must be same shape as self" + with tm.assert_raises_regex(ValueError, msg): + s.where([True]) + + def test_where_ndframe_align(self): + msg = "Array conditional must be same shape as self" + s = Series([1, 2, 3]) + + cond = [True] + with tm.assert_raises_regex(ValueError, msg): + s.where(cond) + + expected = Series([1, np.nan, np.nan]) + + out = s.where(Series(cond)) + tm.assert_series_equal(out, expected) + + cond = np.array([False, True, False, True]) + with tm.assert_raises_regex(ValueError, msg): + s.where(cond) + + expected = Series([np.nan, 2, np.nan]) + + out = s.where(Series(cond)) + tm.assert_series_equal(out, expected) + def test_where_setitem_invalid(self): # GH 2702 @@ -995,7 +1245,7 @@ def test_where_setitem_invalid(self): def f(): s[0:3] = list(range(27)) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) s[0:3] = list(range(3)) expected = Series([0, 1, 2]) @@ -1007,7 +1257,7 @@ def f(): def f(): s[0:4:2] = list(range(27)) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) s = Series(list('abcdef')) s[0:4:2] = list(range(2)) @@ -1020,7 +1270,7 @@ def f(): def f(): s[:-1] = list(range(27)) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) s[-3:-1] = list(range(2)) expected = Series(['a', 'b', 'c', 0, 1, 'f']) @@ -1032,14 +1282,14 @@ def f(): def f(): s[[0, 1, 2]] = list(range(27)) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) s = Series(list('abc')) def f(): s[[0, 1, 2]] = list(range(2)) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # scalar s = Series(list('abc')) @@ -1134,6 +1384,14 @@ def test_where_datetime(self): expected = Series([10, None], dtype='datetime64[ns]') assert_series_equal(rs, expected) + # GH 15701 + timestamps = ['2016-12-31 12:00:04+00:00', + '2016-12-31 12:00:04.010000+00:00'] + s = Series([pd.Timestamp(t) for t in timestamps]) + rs = s.where(Series([False, True])) + expected = Series([pd.NaT, s[1]]) + assert_series_equal(rs, expected) + def test_where_timedelta(self): s = Series([1, 2], dtype='timedelta64[ns]') expected = Series([10, 10], dtype='timedelta64[ns]') @@ -1181,8 +1439,8 @@ def test_mask(self): rs2 = s2.mask(cond[:3], -s2) assert_series_equal(rs, rs2) - self.assertRaises(ValueError, s.mask, 1) - self.assertRaises(ValueError, s.mask, cond[:3].values, -s) + pytest.raises(ValueError, s.mask, 1) + pytest.raises(ValueError, s.mask, cond[:3].values, -s) # dtype changes s = Series([1, 2, 3, 4]) @@ -1228,52 +1486,52 @@ def test_ix_setitem(self): inds = self.series.index[[3, 4, 7]] result = self.series.copy() - result.ix[inds] = 5 + result.loc[inds] = 5 expected = self.series.copy() expected[[3, 4, 7]] = 5 assert_series_equal(result, expected) - result.ix[5:10] = 10 + result.iloc[5:10] = 10 expected[5:10] = 10 assert_series_equal(result, expected) # set slice with indices d1, d2 = self.series.index[[5, 15]] - result.ix[d1:d2] = 6 + result.loc[d1:d2] = 6 expected[5:16] = 6 # because it's inclusive assert_series_equal(result, expected) # set index value - self.series.ix[d1] = 4 - self.series.ix[d2] = 6 - self.assertEqual(self.series[d1], 4) - self.assertEqual(self.series[d2], 6) + self.series.loc[d1] = 4 + self.series.loc[d2] = 6 + assert self.series[d1] == 4 + assert self.series[d2] == 6 def test_where_numeric_with_string(self): # GH 9280 s = pd.Series([1, 2, 3]) w = s.where(s > 1, 'X') - self.assertFalse(is_integer(w[0])) - self.assertTrue(is_integer(w[1])) - self.assertTrue(is_integer(w[2])) - self.assertTrue(isinstance(w[0], str)) - self.assertTrue(w.dtype == 'object') + assert not is_integer(w[0]) + assert is_integer(w[1]) + assert is_integer(w[2]) + assert isinstance(w[0], str) + assert w.dtype == 'object' w = s.where(s > 1, ['X', 'Y', 'Z']) - self.assertFalse(is_integer(w[0])) - self.assertTrue(is_integer(w[1])) - self.assertTrue(is_integer(w[2])) - self.assertTrue(isinstance(w[0], str)) - self.assertTrue(w.dtype == 'object') + assert not is_integer(w[0]) + assert is_integer(w[1]) + assert is_integer(w[2]) + assert isinstance(w[0], str) + assert w.dtype == 'object' w = s.where(s > 1, np.array(['X', 'Y', 'Z'])) - self.assertFalse(is_integer(w[0])) - self.assertTrue(is_integer(w[1])) - self.assertTrue(is_integer(w[2])) - self.assertTrue(isinstance(w[0], str)) - self.assertTrue(w.dtype == 'object') + assert not is_integer(w[0]) + assert is_integer(w[1]) + assert is_integer(w[2]) + assert isinstance(w[0], str) + assert w.dtype == 'object' def test_setitem_boolean(self): mask = self.series > self.series.median() @@ -1295,16 +1553,16 @@ def test_ix_setitem_boolean(self): mask = self.series > self.series.median() result = self.series.copy() - result.ix[mask] = 0 + result.loc[mask] = 0 expected = self.series expected[mask] = 0 assert_series_equal(result, expected) def test_ix_setitem_corner(self): inds = list(self.series.index[[5, 8, 12]]) - self.series.ix[inds] = 5 - self.assertRaises(Exception, self.series.ix.__setitem__, - inds + ['foo'], 5) + self.series.loc[inds] = 5 + pytest.raises(Exception, self.series.loc.__setitem__, + inds + ['foo'], 5) def test_get_set_boolean_different_order(self): ordered = self.series.sort_values() @@ -1345,29 +1603,29 @@ def test_setitem_na(self): def test_basic_indexing(self): s = Series(np.random.randn(5), index=['a', 'b', 'a', 'a', 'b']) - self.assertRaises(IndexError, s.__getitem__, 5) - self.assertRaises(IndexError, s.__setitem__, 5, 0) + pytest.raises(IndexError, s.__getitem__, 5) + pytest.raises(IndexError, s.__setitem__, 5, 0) - self.assertRaises(KeyError, s.__getitem__, 'c') + pytest.raises(KeyError, s.__getitem__, 'c') s = s.sort_index() - self.assertRaises(IndexError, s.__getitem__, 5) - self.assertRaises(IndexError, s.__setitem__, 5, 0) + pytest.raises(IndexError, s.__getitem__, 5) + pytest.raises(IndexError, s.__setitem__, 5, 0) def test_int_indexing(self): s = Series(np.random.randn(6), index=[0, 0, 1, 1, 2, 2]) - self.assertRaises(KeyError, s.__getitem__, 5) + pytest.raises(KeyError, s.__getitem__, 5) - self.assertRaises(KeyError, s.__getitem__, 'c') + pytest.raises(KeyError, s.__getitem__, 'c') # not monotonic s = Series(np.random.randn(6), index=[2, 2, 0, 0, 1, 1]) - self.assertRaises(KeyError, s.__getitem__, 5) + pytest.raises(KeyError, s.__getitem__, 5) - self.assertRaises(KeyError, s.__getitem__, 'c') + pytest.raises(KeyError, s.__getitem__, 'c') def test_datetime_indexing(self): from pandas import date_range @@ -1378,17 +1636,17 @@ def test_datetime_indexing(self): s = Series(len(index), index=index) stamp = Timestamp('1/8/2000') - self.assertRaises(KeyError, s.__getitem__, stamp) + pytest.raises(KeyError, s.__getitem__, stamp) s[stamp] = 0 - self.assertEqual(s[stamp], 0) + assert s[stamp] == 0 # not monotonic s = Series(len(index), index=index) s = s[::-1] - self.assertRaises(KeyError, s.__getitem__, stamp) + pytest.raises(KeyError, s.__getitem__, stamp) s[stamp] = 0 - self.assertEqual(s[stamp], 0) + assert s[stamp] == 0 def test_timedelta_assignment(self): # GH 8209 @@ -1443,7 +1701,7 @@ def test_underlying_data_conversion(self): df_tmp = df.iloc[ck] # noqa df["bb"].iloc[0] = .15 - self.assertEqual(df['bb'].iloc[0], 0.15) + assert df['bb'].iloc[0] == 0.15 pd.set_option('chained_assignment', 'raise') # GH 3217 @@ -1457,7 +1715,7 @@ def test_underlying_data_conversion(self): def test_preserveRefs(self): seq = self.ts[[5, 10, 15]] seq[1] = np.NaN - self.assertFalse(np.isnan(self.ts[10])) + assert not np.isnan(self.ts[10]) def test_drop(self): @@ -1486,23 +1744,23 @@ def test_drop(self): # single string/tuple-like s = Series(range(3), index=list('abc')) - self.assertRaises(ValueError, s.drop, 'bc') - self.assertRaises(ValueError, s.drop, ('a', )) + pytest.raises(ValueError, s.drop, 'bc') + pytest.raises(ValueError, s.drop, ('a', )) # errors='ignore' s = Series(range(3), index=list('abc')) result = s.drop('bc', errors='ignore') assert_series_equal(result, s) result = s.drop(['a', 'd'], errors='ignore') - expected = s.ix[1:] + expected = s.iloc[1:] assert_series_equal(result, expected) # bad axis - self.assertRaises(ValueError, s.drop, 'one', axis='columns') + pytest.raises(ValueError, s.drop, 'one', axis='columns') # GH 8522 s = Series([2, 3], index=[True, False]) - self.assertTrue(s.index.is_object()) + assert s.index.is_object() result = s.drop(True) expected = Series([3], index=[False]) assert_series_equal(result, expected) @@ -1516,9 +1774,9 @@ def _check_align(a, b, how='left', fill=None): diff_a = aa.index.difference(join_index) diff_b = ab.index.difference(join_index) if len(diff_a) > 0: - self.assertTrue((aa.reindex(diff_a) == fill).all()) + assert (aa.reindex(diff_a) == fill).all() if len(diff_b) > 0: - self.assertTrue((ab.reindex(diff_b) == fill).all()) + assert (ab.reindex(diff_b) == fill).all() ea = a.reindex(join_index) eb = b.reindex(join_index) @@ -1529,10 +1787,10 @@ def _check_align(a, b, how='left', fill=None): assert_series_equal(aa, ea) assert_series_equal(ab, eb) - self.assertEqual(aa.name, 'ts') - self.assertEqual(ea.name, 'ts') - self.assertEqual(ab.name, 'ts') - self.assertEqual(eb.name, 'ts') + assert aa.name == 'ts' + assert ea.name == 'ts' + assert ab.name == 'ts' + assert eb.name == 'ts' for kind in JOIN_TYPES: _check_align(self.ts[2:], self.ts[:-5], how=kind) @@ -1592,36 +1850,36 @@ def test_align_nocopy(self): a = self.ts.copy() ra, _ = a.align(b, join='left') ra[:5] = 5 - self.assertFalse((a[:5] == 5).any()) + assert not (a[:5] == 5).any() # do not copy a = self.ts.copy() ra, _ = a.align(b, join='left', copy=False) ra[:5] = 5 - self.assertTrue((a[:5] == 5).all()) + assert (a[:5] == 5).all() # do copy a = self.ts.copy() b = self.ts[:5].copy() _, rb = a.align(b, join='right') rb[:3] = 5 - self.assertFalse((b[:3] == 5).any()) + assert not (b[:3] == 5).any() # do not copy a = self.ts.copy() b = self.ts[:5].copy() _, rb = a.align(b, join='right', copy=False) rb[:2] = 5 - self.assertTrue((b[:2] == 5).all()) + assert (b[:2] == 5).all() - def test_align_sameindex(self): + def test_align_same_index(self): a, b = self.ts.align(self.ts, copy=False) - self.assertIs(a.index, self.ts.index) - self.assertIs(b.index, self.ts.index) + assert a.index is self.ts.index + assert b.index is self.ts.index - # a, b = self.ts.align(self.ts, copy=True) - # self.assertIsNot(a.index, self.ts.index) - # self.assertIsNot(b.index, self.ts.index) + a, b = self.ts.align(self.ts, copy=True) + assert a.index is not self.ts.index + assert b.index is not self.ts.index def test_align_multiindex(self): # GH 10665 @@ -1662,38 +1920,37 @@ def test_reindex(self): # __array_interface__ is not defined for older numpies # and on some pythons try: - self.assertTrue(np.may_share_memory(self.series.index, - identity.index)) - except (AttributeError): + assert np.may_share_memory(self.series.index, identity.index) + except AttributeError: pass - self.assertTrue(identity.index.is_(self.series.index)) - self.assertTrue(identity.index.identical(self.series.index)) + assert identity.index.is_(self.series.index) + assert identity.index.identical(self.series.index) subIndex = self.series.index[10:20] subSeries = self.series.reindex(subIndex) for idx, val in compat.iteritems(subSeries): - self.assertEqual(val, self.series[idx]) + assert val == self.series[idx] subIndex2 = self.ts.index[10:20] subTS = self.ts.reindex(subIndex2) for idx, val in compat.iteritems(subTS): - self.assertEqual(val, self.ts[idx]) + assert val == self.ts[idx] stuffSeries = self.ts.reindex(subIndex) - self.assertTrue(np.isnan(stuffSeries).all()) + assert np.isnan(stuffSeries).all() # This is extremely important for the Cython code to not screw up nonContigIndex = self.ts.index[::2] subNonContig = self.ts.reindex(nonContigIndex) for idx, val in compat.iteritems(subNonContig): - self.assertEqual(val, self.ts[idx]) + assert val == self.ts[idx] # return a copy the same index here result = self.ts.reindex() - self.assertFalse((result is self.ts)) + assert not (result is self.ts) def test_reindex_nan(self): ts = Series([2, 3, 5, 7], index=[1, 4, nan, 8]) @@ -1706,6 +1963,28 @@ def test_reindex_nan(self): # reindex coerces index.dtype to float, loc/iloc doesn't assert_series_equal(ts.reindex(i), ts.iloc[j], check_index_type=False) + def test_reindex_series_add_nat(self): + rng = date_range('1/1/2000 00:00:00', periods=10, freq='10s') + series = Series(rng) + + result = series.reindex(lrange(15)) + assert np.issubdtype(result.dtype, np.dtype('M8[ns]')) + + mask = result.isnull() + assert mask[-5:].all() + assert not mask[:-5].any() + + def test_reindex_with_datetimes(self): + rng = date_range('1/1/2000', periods=20) + ts = Series(np.random.randn(20), index=rng) + + result = ts.reindex(list(ts.index[5:10])) + expected = ts[5:10] + tm.assert_series_equal(result, expected) + + result = ts[list(ts.index[5:10])] + tm.assert_series_equal(result, expected) + def test_reindex_corner(self): # (don't forget to fix this) I think it's fixed self.empty.reindex(self.ts.index, method='pad') # it works @@ -1719,7 +1998,7 @@ def test_reindex_corner(self): # bad fill method ts = self.ts[::2] - self.assertRaises(Exception, ts.reindex, self.ts.index, method='foo') + pytest.raises(Exception, ts.reindex, self.ts.index, method='foo') def test_reindex_pad(self): @@ -1790,11 +2069,11 @@ def test_reindex_int(self): reindexed_int = int_ts.reindex(self.ts.index) # if NaNs introduced - self.assertEqual(reindexed_int.dtype, np.float_) + assert reindexed_int.dtype == np.float_ # NO NaNs introduced reindexed_int = int_ts.reindex(int_ts.index[::2]) - self.assertEqual(reindexed_int.dtype, np.int_) + assert reindexed_int.dtype == np.int_ def test_reindex_bool(self): @@ -1806,18 +2085,18 @@ def test_reindex_bool(self): reindexed_bool = bool_ts.reindex(self.ts.index) # if NaNs introduced - self.assertEqual(reindexed_bool.dtype, np.object_) + assert reindexed_bool.dtype == np.object_ # NO NaNs introduced reindexed_bool = bool_ts.reindex(bool_ts.index[::2]) - self.assertEqual(reindexed_bool.dtype, np.bool_) + assert reindexed_bool.dtype == np.bool_ def test_reindex_bool_pad(self): # fail ts = self.ts[5:] bool_ts = Series(np.zeros(len(ts), dtype=bool), index=ts.index) filled_bool = bool_ts.reindex(self.ts.index, method='pad') - self.assertTrue(isnull(filled_bool[:5]).all()) + assert isnull(filled_bool[:5]).all() def test_reindex_like(self): other = self.ts[::2] @@ -1859,7 +2138,7 @@ def test_reindex_fill_value(self): # don't upcast result = ints.reindex([1, 2, 3], fill_value=0) expected = Series([2, 3, 0], index=[1, 2, 3]) - self.assertTrue(issubclass(result.dtype.type, np.integer)) + assert issubclass(result.dtype.type, np.integer) assert_series_equal(result, expected) # ----------------------------------------------------------- @@ -1943,9 +2222,9 @@ def test_multilevel_preserve_name(self): s = Series(np.random.randn(len(index)), index=index, name='sth') result = s['foo'] - result2 = s.ix['foo'] - self.assertEqual(result.name, s.name) - self.assertEqual(result2.name, s.name) + result2 = s.loc['foo'] + assert result.name == s.name + assert result2.name == s.name def test_setitem_scalar_into_readonly_backing_data(self): # GH14359: test that you cannot mutate a read only buffer @@ -1955,15 +2234,10 @@ def test_setitem_scalar_into_readonly_backing_data(self): series = Series(array) for n in range(len(series)): - with self.assertRaises(ValueError): + with pytest.raises(ValueError): series[n] = 1 - self.assertEqual( - array[n], - 0, - msg='even though the ValueError was raised, the underlying' - ' array was still mutated!', - ) + assert array[n] == 0 def test_setitem_slice_into_readonly_backing_data(self): # GH14359: test that you cannot mutate a read only buffer @@ -1972,16 +2246,432 @@ def test_setitem_slice_into_readonly_backing_data(self): array.flags.writeable = False # make the array immutable series = Series(array) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): series[1:3] = 1 - self.assertTrue( - not array.any(), - msg='even though the ValueError was raised, the underlying' - ' array was still mutated!', - ) + assert not array.any() + + +class TestTimeSeriesDuplicates(object): + + def setup_method(self, method): + dates = [datetime(2000, 1, 2), datetime(2000, 1, 2), + datetime(2000, 1, 2), datetime(2000, 1, 3), + datetime(2000, 1, 3), datetime(2000, 1, 3), + datetime(2000, 1, 4), datetime(2000, 1, 4), + datetime(2000, 1, 4), datetime(2000, 1, 5)] + + self.dups = Series(np.random.randn(len(dates)), index=dates) + + def test_constructor(self): + assert isinstance(self.dups, Series) + assert isinstance(self.dups.index, DatetimeIndex) + + def test_is_unique_monotonic(self): + assert not self.dups.index.is_unique + + def test_index_unique(self): + uniques = self.dups.index.unique() + expected = DatetimeIndex([datetime(2000, 1, 2), datetime(2000, 1, 3), + datetime(2000, 1, 4), datetime(2000, 1, 5)]) + assert uniques.dtype == 'M8[ns]' # sanity + tm.assert_index_equal(uniques, expected) + assert self.dups.index.nunique() == 4 + + # #2563 + assert isinstance(uniques, DatetimeIndex) + + dups_local = self.dups.index.tz_localize('US/Eastern') + dups_local.name = 'foo' + result = dups_local.unique() + expected = DatetimeIndex(expected, name='foo') + expected = expected.tz_localize('US/Eastern') + assert result.tz is not None + assert result.name == 'foo' + tm.assert_index_equal(result, expected) + + # NaT, note this is excluded + arr = [1370745748 + t for t in range(20)] + [tslib.iNaT] + idx = DatetimeIndex(arr * 3) + tm.assert_index_equal(idx.unique(), DatetimeIndex(arr)) + assert idx.nunique() == 20 + assert idx.nunique(dropna=False) == 21 + + arr = [Timestamp('2013-06-09 02:42:28') + timedelta(seconds=t) + for t in range(20)] + [NaT] + idx = DatetimeIndex(arr * 3) + tm.assert_index_equal(idx.unique(), DatetimeIndex(arr)) + assert idx.nunique() == 20 + assert idx.nunique(dropna=False) == 21 + + def test_index_dupes_contains(self): + d = datetime(2011, 12, 5, 20, 30) + ix = DatetimeIndex([d, d]) + assert d in ix + + def test_duplicate_dates_indexing(self): + ts = self.dups + + uniques = ts.index.unique() + for date in uniques: + result = ts[date] + + mask = ts.index == date + total = (ts.index == date).sum() + expected = ts[mask] + if total > 1: + assert_series_equal(result, expected) + else: + assert_almost_equal(result, expected[0]) + + cp = ts.copy() + cp[date] = 0 + expected = Series(np.where(mask, 0, ts), index=ts.index) + assert_series_equal(cp, expected) + + pytest.raises(KeyError, ts.__getitem__, datetime(2000, 1, 6)) + + # new index + ts[datetime(2000, 1, 6)] = 0 + assert ts[datetime(2000, 1, 6)] == 0 + + def test_range_slice(self): + idx = DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/3/2000', + '1/4/2000']) + + ts = Series(np.random.randn(len(idx)), index=idx) + + result = ts['1/2/2000':] + expected = ts[1:] + assert_series_equal(result, expected) + + result = ts['1/2/2000':'1/3/2000'] + expected = ts[1:4] + assert_series_equal(result, expected) + + def test_groupby_average_dup_values(self): + result = self.dups.groupby(level=0).mean() + expected = self.dups.groupby(self.dups.index).mean() + assert_series_equal(result, expected) + + def test_indexing_over_size_cutoff(self): + import datetime + # #1821 + + old_cutoff = _index._SIZE_CUTOFF + try: + _index._SIZE_CUTOFF = 1000 + + # create large list of non periodic datetime + dates = [] + sec = datetime.timedelta(seconds=1) + half_sec = datetime.timedelta(microseconds=500000) + d = datetime.datetime(2011, 12, 5, 20, 30) + n = 1100 + for i in range(n): + dates.append(d) + dates.append(d + sec) + dates.append(d + sec + half_sec) + dates.append(d + sec + sec + half_sec) + d += 3 * sec + + # duplicate some values in the list + duplicate_positions = np.random.randint(0, len(dates) - 1, 20) + for p in duplicate_positions: + dates[p + 1] = dates[p] + + df = DataFrame(np.random.randn(len(dates), 4), + index=dates, + columns=list('ABCD')) + + pos = n * 3 + timestamp = df.index[pos] + assert timestamp in df.index + + # it works! + df.loc[timestamp] + assert len(df.loc[[timestamp]]) > 0 + finally: + _index._SIZE_CUTOFF = old_cutoff + + def test_indexing_unordered(self): + # GH 2437 + rng = date_range(start='2011-01-01', end='2011-01-15') + ts = Series(np.random.rand(len(rng)), index=rng) + ts2 = pd.concat([ts[0:4], ts[-4:], ts[4:-4]]) + + for t in ts.index: + # TODO: unused? + s = str(t) # noqa + + expected = ts[t] + result = ts2[t] + assert expected == result + + # GH 3448 (ranges) + def compare(slobj): + result = ts2[slobj].copy() + result = result.sort_index() + expected = ts[slobj] + assert_series_equal(result, expected) + + compare(slice('2011-01-01', '2011-01-15')) + compare(slice('2010-12-30', '2011-01-15')) + compare(slice('2011-01-01', '2011-01-16')) + + # partial ranges + compare(slice('2011-01-01', '2011-01-6')) + compare(slice('2011-01-06', '2011-01-8')) + compare(slice('2011-01-06', '2011-01-12')) + + # single values + result = ts2['2011'].sort_index() + expected = ts['2011'] + assert_series_equal(result, expected) + + # diff freq + rng = date_range(datetime(2005, 1, 1), periods=20, freq='M') + ts = Series(np.arange(len(rng)), index=rng) + ts = ts.take(np.random.permutation(20)) + + result = ts['2005'] + for t in result.index: + assert t.year == 2005 + + def test_indexing(self): + + idx = date_range("2001-1-1", periods=20, freq='M') + ts = Series(np.random.rand(len(idx)), index=idx) -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + # getting + + # GH 3070, make sure semantics work on Series/Frame + expected = ts['2001'] + expected.name = 'A' + + df = DataFrame(dict(A=ts)) + result = df['2001']['A'] + assert_series_equal(expected, result) + + # setting + ts['2001'] = 1 + expected = ts['2001'] + expected.name = 'A' + + df.loc['2001', 'A'] = 1 + + result = df['2001']['A'] + assert_series_equal(expected, result) + + # GH3546 (not including times on the last day) + idx = date_range(start='2013-05-31 00:00', end='2013-05-31 23:00', + freq='H') + ts = Series(lrange(len(idx)), index=idx) + expected = ts['2013-05'] + assert_series_equal(expected, ts) + + idx = date_range(start='2013-05-31 00:00', end='2013-05-31 23:59', + freq='S') + ts = Series(lrange(len(idx)), index=idx) + expected = ts['2013-05'] + assert_series_equal(expected, ts) + + idx = [Timestamp('2013-05-31 00:00'), + Timestamp(datetime(2013, 5, 31, 23, 59, 59, 999999))] + ts = Series(lrange(len(idx)), index=idx) + expected = ts['2013'] + assert_series_equal(expected, ts) + + # GH14826, indexing with a seconds resolution string / datetime object + df = DataFrame(np.random.rand(5, 5), + columns=['open', 'high', 'low', 'close', 'volume'], + index=date_range('2012-01-02 18:01:00', + periods=5, tz='US/Central', freq='s')) + expected = df.loc[[df.index[2]]] + + # this is a single date, so will raise + pytest.raises(KeyError, df.__getitem__, '2012-01-02 18:01:02', ) + pytest.raises(KeyError, df.__getitem__, df.index[2], ) + + +class TestDatetimeIndexing(object): + """ + Also test support for datetime64[ns] in Series / DataFrame + """ + + def setup_method(self, method): + dti = DatetimeIndex(start=datetime(2005, 1, 1), + end=datetime(2005, 1, 10), freq='Min') + self.series = Series(np.random.rand(len(dti)), dti) + + def test_fancy_getitem(self): + dti = DatetimeIndex(freq='WOM-1FRI', start=datetime(2005, 1, 1), + end=datetime(2010, 1, 1)) + + s = Series(np.arange(len(dti)), index=dti) + + assert s[48] == 48 + assert s['1/2/2009'] == 48 + assert s['2009-1-2'] == 48 + assert s[datetime(2009, 1, 2)] == 48 + assert s[lib.Timestamp(datetime(2009, 1, 2))] == 48 + pytest.raises(KeyError, s.__getitem__, '2009-1-3') + + assert_series_equal(s['3/6/2009':'2009-06-05'], + s[datetime(2009, 3, 6):datetime(2009, 6, 5)]) + + def test_fancy_setitem(self): + dti = DatetimeIndex(freq='WOM-1FRI', start=datetime(2005, 1, 1), + end=datetime(2010, 1, 1)) + + s = Series(np.arange(len(dti)), index=dti) + s[48] = -1 + assert s[48] == -1 + s['1/2/2009'] = -2 + assert s[48] == -2 + s['1/2/2009':'2009-06-05'] = -3 + assert (s[48:54] == -3).all() + + def test_dti_snap(self): + dti = DatetimeIndex(['1/1/2002', '1/2/2002', '1/3/2002', '1/4/2002', + '1/5/2002', '1/6/2002', '1/7/2002'], freq='D') + + res = dti.snap(freq='W-MON') + exp = date_range('12/31/2001', '1/7/2002', freq='w-mon') + exp = exp.repeat([3, 4]) + assert (res == exp).all() + + res = dti.snap(freq='B') + + exp = date_range('1/1/2002', '1/7/2002', freq='b') + exp = exp.repeat([1, 1, 1, 2, 2]) + assert (res == exp).all() + + def test_dti_reset_index_round_trip(self): + dti = DatetimeIndex(start='1/1/2001', end='6/1/2001', freq='D') + d1 = DataFrame({'v': np.random.rand(len(dti))}, index=dti) + d2 = d1.reset_index() + assert d2.dtypes[0] == np.dtype('M8[ns]') + d3 = d2.set_index('index') + assert_frame_equal(d1, d3, check_names=False) + + # #2329 + stamp = datetime(2012, 11, 22) + df = DataFrame([[stamp, 12.1]], columns=['Date', 'Value']) + df = df.set_index('Date') + + assert df.index[0] == stamp + assert df.reset_index()['Date'][0] == stamp + + def test_series_set_value(self): + # #1561 + + dates = [datetime(2001, 1, 1), datetime(2001, 1, 2)] + index = DatetimeIndex(dates) + + s = Series().set_value(dates[0], 1.) + s2 = s.set_value(dates[1], np.nan) + + exp = Series([1., np.nan], index=index) + + assert_series_equal(s2, exp) + + # s = Series(index[:1], index[:1]) + # s2 = s.set_value(dates[1], index[1]) + # assert s2.values.dtype == 'M8[ns]' + + @slow + def test_slice_locs_indexerror(self): + times = [datetime(2000, 1, 1) + timedelta(minutes=i * 10) + for i in range(100000)] + s = Series(lrange(100000), times) + s.loc[datetime(1900, 1, 1):datetime(2100, 1, 1)] + + def test_slicing_datetimes(self): + + # GH 7523 + + # unique + df = DataFrame(np.arange(4., dtype='float64'), + index=[datetime(2001, 1, i, 10, 00) + for i in [1, 2, 3, 4]]) + result = df.loc[datetime(2001, 1, 1, 10):] + assert_frame_equal(result, df) + result = df.loc[:datetime(2001, 1, 4, 10)] + assert_frame_equal(result, df) + result = df.loc[datetime(2001, 1, 1, 10):datetime(2001, 1, 4, 10)] + assert_frame_equal(result, df) + + result = df.loc[datetime(2001, 1, 1, 11):] + expected = df.iloc[1:] + assert_frame_equal(result, expected) + result = df.loc['20010101 11':] + assert_frame_equal(result, expected) + + # duplicates + df = pd.DataFrame(np.arange(5., dtype='float64'), + index=[datetime(2001, 1, i, 10, 00) + for i in [1, 2, 2, 3, 4]]) + + result = df.loc[datetime(2001, 1, 1, 10):] + assert_frame_equal(result, df) + result = df.loc[:datetime(2001, 1, 4, 10)] + assert_frame_equal(result, df) + result = df.loc[datetime(2001, 1, 1, 10):datetime(2001, 1, 4, 10)] + assert_frame_equal(result, df) + + result = df.loc[datetime(2001, 1, 1, 11):] + expected = df.iloc[1:] + assert_frame_equal(result, expected) + result = df.loc['20010101 11':] + assert_frame_equal(result, expected) + + def test_frame_datetime64_duplicated(self): + dates = date_range('2010-07-01', end='2010-08-05') + + tst = DataFrame({'symbol': 'AAA', 'date': dates}) + result = tst.duplicated(['date', 'symbol']) + assert (-result).all() + + tst = DataFrame({'date': dates}) + result = tst.duplicated() + assert (-result).all() + + +class TestNatIndexing(object): + + def setup_method(self, method): + self.series = Series(date_range('1/1/2000', periods=10)) + + # --------------------------------------------------------------------- + # NaT support + + def test_set_none_nan(self): + self.series[3] = None + assert self.series[3] is NaT + + self.series[3:5] = None + assert self.series[4] is NaT + + self.series[5] = np.nan + assert self.series[5] is NaT + + self.series[5:7] = np.nan + assert self.series[6] is NaT + + def test_nat_operations(self): + # GH 8617 + s = Series([0, pd.NaT], dtype='m8[ns]') + exp = s[0] + assert s.median() == exp + assert s.min() == exp + assert s.max() == exp + + def test_round_nat(self): + # GH14940 + s = Series([pd.NaT]) + expected = Series(pd.NaT) + for method in ["round", "floor", "ceil"]: + round_method = getattr(s.dt, method) + for freq in ["s", "5s", "min", "5min", "h", "5h"]: + assert_series_equal(round_method(freq), expected) diff --git a/pandas/tests/series/test_internals.py b/pandas/tests/series/test_internals.py index e3a0e056f4da1..79e23459ac992 100644 --- a/pandas/tests/series/test_internals.py +++ b/pandas/tests/series/test_internals.py @@ -1,22 +1,22 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + from datetime import datetime from numpy import nan import numpy as np from pandas import Series -from pandas.tseries.index import Timestamp -import pandas.lib as lib +from pandas.core.indexes.datetimes import Timestamp +import pandas._libs.lib as lib from pandas.util.testing import assert_series_equal import pandas.util.testing as tm -class TestSeriesInternals(tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesInternals(object): def test_convert_objects(self): @@ -116,7 +116,7 @@ def test_convert_objects(self): # r = s.copy() # r[0] = np.nan # result = r.convert_objects(convert_dates=True,convert_numeric=False) - # self.assertEqual(result.dtype, 'M8[ns]') + # assert result.dtype == 'M8[ns]' # dateutil parses some single letters into today's value as a date for x in 'abcdefghijklmnopqrstuvwxyz': @@ -282,7 +282,7 @@ def test_convert(self): # r = s.copy() # r[0] = np.nan # result = r._convert(convert_dates=True,convert_numeric=False) - # self.assertEqual(result.dtype, 'M8[ns]') + # assert result.dtype == 'M8[ns]' # dateutil parses some single letters into today's value as a date expected = Series([lib.NaT]) @@ -296,7 +296,7 @@ def test_convert(self): def test_convert_no_arg_error(self): s = Series(['1.0', '2']) - self.assertRaises(ValueError, s._convert) + pytest.raises(ValueError, s._convert) def test_convert_preserve_bool(self): s = Series([1, True, 3, 5], dtype=object) diff --git a/pandas/tests/series/test_io.py b/pandas/tests/series/test_io.py index 48528dc54adbd..d1c9e5a6d16cf 100644 --- a/pandas/tests/series/test_io.py +++ b/pandas/tests/series/test_io.py @@ -16,9 +16,7 @@ from .common import TestData -class TestSeriesToCSV(TestData, tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesToCSV(TestData): def test_from_csv(self): @@ -26,25 +24,25 @@ def test_from_csv(self): self.ts.to_csv(path) ts = Series.from_csv(path) assert_series_equal(self.ts, ts, check_names=False) - self.assertTrue(ts.name is None) - self.assertTrue(ts.index.name is None) + assert ts.name is None + assert ts.index.name is None # GH10483 self.ts.to_csv(path, header=True) ts_h = Series.from_csv(path, header=0) - self.assertTrue(ts_h.name == 'ts') + assert ts_h.name == 'ts' self.series.to_csv(path) series = Series.from_csv(path) - self.assertIsNone(series.name) - self.assertIsNone(series.index.name) + assert series.name is None + assert series.index.name is None assert_series_equal(self.series, series, check_names=False) - self.assertTrue(series.name is None) - self.assertTrue(series.index.name is None) + assert series.name is None + assert series.index.name is None self.series.to_csv(path, header=True) series_h = Series.from_csv(path, header=0) - self.assertTrue(series_h.name == 'series') + assert series_h.name == 'series' outfile = open(path, 'w') outfile.write('1998-01-01|1.0\n1999-01-01|2.0') @@ -107,12 +105,10 @@ def test_to_csv_path_is_none(self): # DataFrame.to_csv() which returned string s = Series([1, 2, 3]) csv_str = s.to_csv(path=None) - self.assertIsInstance(csv_str, str) - + assert isinstance(csv_str, str) -class TestSeriesIO(TestData, tm.TestCase): - _multiprocess_can_split_ = True +class TestSeriesIO(TestData): def test_to_frame(self): self.ts.name = None @@ -131,20 +127,20 @@ def test_to_frame(self): assert_frame_equal(rs, xp) def test_to_dict(self): - self.assert_series_equal(Series(self.ts.to_dict(), name='ts'), self.ts) + tm.assert_series_equal(Series(self.ts.to_dict(), name='ts'), self.ts) def test_timeseries_periodindex(self): # GH2891 from pandas import period_range prng = period_range('1/1/2011', '1/1/2012', freq='M') ts = Series(np.random.randn(len(prng)), prng) - new_ts = self.round_trip_pickle(ts) - self.assertEqual(new_ts.index.freq, 'M') + new_ts = tm.round_trip_pickle(ts) + assert new_ts.index.freq == 'M' def test_pickle_preserve_name(self): for n in [777, 777., 'name', datetime(2001, 11, 11), (1, 2)]: unpickled = self._pickle_roundtrip_name(tm.makeTimeSeries(name=n)) - self.assertEqual(unpickled.name, n) + assert unpickled.name == n def _pickle_roundtrip_name(self, obj): @@ -167,14 +163,12 @@ class SubclassedFrame(DataFrame): s = SubclassedSeries([1, 2, 3], name='X') result = s.to_frame() - self.assertTrue(isinstance(result, SubclassedFrame)) + assert isinstance(result, SubclassedFrame) expected = SubclassedFrame({'X': [1, 2, 3]}) assert_frame_equal(result, expected) -class TestSeriesToList(TestData, tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesToList(TestData): def test_tolist(self): rs = self.ts.tolist() @@ -184,25 +178,25 @@ def test_tolist(self): # datetime64 s = Series(self.ts.index) rs = s.tolist() - self.assertEqual(self.ts.index[0], rs[0]) + assert self.ts.index[0] == rs[0] def test_tolist_np_int(self): # GH10904 for t in ['int8', 'int16', 'int32', 'int64']: s = pd.Series([1], dtype=t) - self.assertIsInstance(s.tolist()[0], (int, long)) + assert isinstance(s.tolist()[0], (int, long)) def test_tolist_np_uint(self): # GH10904 for t in ['uint8', 'uint16']: s = pd.Series([1], dtype=t) - self.assertIsInstance(s.tolist()[0], int) + assert isinstance(s.tolist()[0], int) for t in ['uint32', 'uint64']: s = pd.Series([1], dtype=t) - self.assertIsInstance(s.tolist()[0], long) + assert isinstance(s.tolist()[0], long) def test_tolist_np_float(self): # GH10904 for t in ['float16', 'float32', 'float64']: s = pd.Series([1], dtype=t) - self.assertIsInstance(s.tolist()[0], float) + assert isinstance(s.tolist()[0], float) diff --git a/pandas/tests/series/test_missing.py b/pandas/tests/series/test_missing.py index 27bc04ac043be..c52c41877d5c0 100644 --- a/pandas/tests/series/test_missing.py +++ b/pandas/tests/series/test_missing.py @@ -2,41 +2,53 @@ # pylint: disable-msg=E1101,W0612 import pytz +import pytest + from datetime import timedelta, datetime +from distutils.version import LooseVersion from numpy import nan import numpy as np import pandas as pd -from pandas import (Series, isnull, date_range, - MultiIndex, Index) -from pandas.tseries.index import Timestamp +from pandas import (Series, DataFrame, isnull, date_range, + MultiIndex, Index, Timestamp, NaT, IntervalIndex) from pandas.compat import range -from pandas.util.testing import assert_series_equal +from pandas._libs.tslib import iNaT +from pandas.util.testing import assert_series_equal, assert_frame_equal import pandas.util.testing as tm from .common import TestData +try: + import scipy + _is_scipy_ge_0190 = scipy.__version__ >= LooseVersion('0.19.0') +except: + _is_scipy_ge_0190 = False + def _skip_if_no_pchip(): try: from scipy.interpolate import pchip_interpolate # noqa except ImportError: - import nose - raise nose.SkipTest('scipy.interpolate.pchip missing') + import pytest + pytest.skip('scipy.interpolate.pchip missing') def _skip_if_no_akima(): try: from scipy.interpolate import Akima1DInterpolator # noqa except ImportError: - import nose - raise nose.SkipTest('scipy.interpolate.Akima1DInterpolator missing') + import pytest + pytest.skip('scipy.interpolate.Akima1DInterpolator missing') -class TestSeriesMissingData(TestData, tm.TestCase): +def _simple_ts(start, end, freq='D'): + rng = date_range(start, end, freq=freq) + return Series(np.random.randn(len(rng)), index=rng) - _multiprocess_can_split_ = True + +class TestSeriesMissingData(TestData): def test_timedelta_fillna(self): # GH 3371 @@ -66,9 +78,8 @@ def test_timedelta_fillna(self): timedelta(days=1, seconds=9 * 3600 + 60 + 1)]) assert_series_equal(result, expected) - from pandas import tslib - result = td.fillna(tslib.NaT) - expected = Series([tslib.NaT, timedelta(0), timedelta(1), + result = td.fillna(NaT) + expected = Series([NaT, timedelta(0), timedelta(1), timedelta(days=1, seconds=9 * 3600 + 60 + 1)], dtype='m8[ns]') assert_series_equal(result, expected) @@ -99,8 +110,7 @@ def test_datetime64_fillna(self): '20130101'), Timestamp('20130104'), Timestamp('20130103 9:01:01')]) assert_series_equal(result, expected) - from pandas import tslib - result = s.fillna(tslib.NaT) + result = s.fillna(NaT) expected = s assert_series_equal(result, expected) @@ -139,24 +149,24 @@ def test_datetime64_tz_fillna(self): Timestamp('2011-01-02 10:00'), Timestamp('2011-01-03 10:00'), Timestamp('2011-01-02 10:00')]) - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) # check s is not changed - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna(pd.Timestamp('2011-01-02 10:00', tz=tz)) expected = Series([Timestamp('2011-01-01 10:00'), Timestamp('2011-01-02 10:00', tz=tz), Timestamp('2011-01-03 10:00'), Timestamp('2011-01-02 10:00', tz=tz)]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna('AAA') expected = Series([Timestamp('2011-01-01 10:00'), 'AAA', Timestamp('2011-01-03 10:00'), 'AAA'], dtype=object) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna({1: pd.Timestamp('2011-01-02 10:00', tz=tz), 3: pd.Timestamp('2011-01-04 10:00')}) @@ -164,8 +174,8 @@ def test_datetime64_tz_fillna(self): Timestamp('2011-01-02 10:00', tz=tz), Timestamp('2011-01-03 10:00'), Timestamp('2011-01-04 10:00')]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna({1: pd.Timestamp('2011-01-02 10:00'), 3: pd.Timestamp('2011-01-04 10:00')}) @@ -173,31 +183,31 @@ def test_datetime64_tz_fillna(self): Timestamp('2011-01-02 10:00'), Timestamp('2011-01-03 10:00'), Timestamp('2011-01-04 10:00')]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) # DatetimeBlockTZ idx = pd.DatetimeIndex(['2011-01-01 10:00', pd.NaT, '2011-01-03 10:00', pd.NaT], tz=tz) s = pd.Series(idx) - self.assertEqual(s.dtype, 'datetime64[ns, {0}]'.format(tz)) - self.assert_series_equal(pd.isnull(s), null_loc) + assert s.dtype == 'datetime64[ns, {0}]'.format(tz) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna(pd.Timestamp('2011-01-02 10:00')) expected = Series([Timestamp('2011-01-01 10:00', tz=tz), Timestamp('2011-01-02 10:00'), Timestamp('2011-01-03 10:00', tz=tz), Timestamp('2011-01-02 10:00')]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna(pd.Timestamp('2011-01-02 10:00', tz=tz)) idx = pd.DatetimeIndex(['2011-01-01 10:00', '2011-01-02 10:00', '2011-01-03 10:00', '2011-01-02 10:00'], tz=tz) expected = Series(idx) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna(pd.Timestamp('2011-01-02 10:00', tz=tz).to_pydatetime()) @@ -205,15 +215,15 @@ def test_datetime64_tz_fillna(self): '2011-01-03 10:00', '2011-01-02 10:00'], tz=tz) expected = Series(idx) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna('AAA') expected = Series([Timestamp('2011-01-01 10:00', tz=tz), 'AAA', Timestamp('2011-01-03 10:00', tz=tz), 'AAA'], dtype=object) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna({1: pd.Timestamp('2011-01-02 10:00', tz=tz), 3: pd.Timestamp('2011-01-04 10:00')}) @@ -221,8 +231,8 @@ def test_datetime64_tz_fillna(self): Timestamp('2011-01-02 10:00', tz=tz), Timestamp('2011-01-03 10:00', tz=tz), Timestamp('2011-01-04 10:00')]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna({1: pd.Timestamp('2011-01-02 10:00', tz=tz), 3: pd.Timestamp('2011-01-04 10:00', tz=tz)}) @@ -230,8 +240,8 @@ def test_datetime64_tz_fillna(self): Timestamp('2011-01-02 10:00', tz=tz), Timestamp('2011-01-03 10:00', tz=tz), Timestamp('2011-01-04 10:00', tz=tz)]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) # filling with a naive/other zone, coerce to object result = s.fillna(Timestamp('20130101')) @@ -239,16 +249,28 @@ def test_datetime64_tz_fillna(self): Timestamp('2013-01-01'), Timestamp('2011-01-03 10:00', tz=tz), Timestamp('2013-01-01')]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) result = s.fillna(Timestamp('20130101', tz='US/Pacific')) expected = Series([Timestamp('2011-01-01 10:00', tz=tz), Timestamp('2013-01-01', tz='US/Pacific'), Timestamp('2011-01-03 10:00', tz=tz), Timestamp('2013-01-01', tz='US/Pacific')]) - self.assert_series_equal(expected, result) - self.assert_series_equal(pd.isnull(s), null_loc) + tm.assert_series_equal(expected, result) + tm.assert_series_equal(pd.isnull(s), null_loc) + + # with timezone + # GH 15855 + df = pd.Series([pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]) + exp = pd.Series([pd.Timestamp('2012-11-11 00:00:00+01:00'), + pd.Timestamp('2012-11-11 00:00:00+01:00')]) + assert_series_equal(df.fillna(method='pad'), exp) + + df = pd.Series([pd.NaT, pd.Timestamp('2012-11-11 00:00:00+01:00')]) + exp = pd.Series([pd.Timestamp('2012-11-11 00:00:00+01:00'), + pd.Timestamp('2012-11-11 00:00:00+01:00')]) + assert_series_equal(df.fillna(method='bfill'), exp) def test_datetime64tz_fillna_round_issue(self): # GH 14872 @@ -268,6 +290,20 @@ def test_datetime64tz_fillna_round_issue(self): assert_series_equal(filled, expected) + def test_fillna_downcast(self): + # GH 15277 + # infer int64 from float64 + s = pd.Series([1., np.nan]) + result = s.fillna(0, downcast='infer') + expected = pd.Series([1, 0]) + assert_series_equal(result, expected) + + # infer int64 from float64 when fillna value is a dict + s = pd.Series([1., np.nan]) + result = s.fillna({1: 0}, downcast='infer') + expected = pd.Series([1, 0]) + assert_series_equal(result, expected) + def test_fillna_int(self): s = Series(np.random.randint(-100, 100, 50)) s.fillna(method='ffill', inplace=True) @@ -275,8 +311,52 @@ def test_fillna_int(self): def test_fillna_raise(self): s = Series(np.random.randint(-100, 100, 50)) - self.assertRaises(TypeError, s.fillna, [1, 2]) - self.assertRaises(TypeError, s.fillna, (1, 2)) + pytest.raises(TypeError, s.fillna, [1, 2]) + pytest.raises(TypeError, s.fillna, (1, 2)) + + # related GH 9217, make sure limit is an int and greater than 0 + s = Series([1, 2, 3, None]) + for limit in [-1, 0, 1., 2.]: + for method in ['backfill', 'bfill', 'pad', 'ffill', None]: + with pytest.raises(ValueError): + s.fillna(1, limit=limit, method=method) + + def test_fillna_nat(self): + series = Series([0, 1, 2, iNaT], dtype='M8[ns]') + + filled = series.fillna(method='pad') + filled2 = series.fillna(value=series.values[2]) + + expected = series.copy() + expected.values[3] = expected.values[2] + + assert_series_equal(filled, expected) + assert_series_equal(filled2, expected) + + df = DataFrame({'A': series}) + filled = df.fillna(method='pad') + filled2 = df.fillna(value=series.values[2]) + expected = DataFrame({'A': expected}) + assert_frame_equal(filled, expected) + assert_frame_equal(filled2, expected) + + series = Series([iNaT, 0, 1, 2], dtype='M8[ns]') + + filled = series.fillna(method='bfill') + filled2 = series.fillna(value=series[1]) + + expected = series.copy() + expected[0] = expected[1] + + assert_series_equal(filled, expected) + assert_series_equal(filled2, expected) + + df = DataFrame({'A': series}) + filled = df.fillna(method='bfill') + filled2 = df.fillna(value=series[1]) + expected = DataFrame({'A': expected}) + assert_frame_equal(filled, expected) + assert_frame_equal(filled2, expected) def test_isnull_for_inf(self): s = Series(['a', np.inf, np.nan, 1.0]) @@ -291,21 +371,21 @@ def test_isnull_for_inf(self): def test_fillna(self): ts = Series([0., 1., 2., 3., 4.], index=tm.makeDateIndex(5)) - self.assert_series_equal(ts, ts.fillna(method='ffill')) + tm.assert_series_equal(ts, ts.fillna(method='ffill')) ts[2] = np.NaN exp = Series([0., 1., 1., 3., 4.], index=ts.index) - self.assert_series_equal(ts.fillna(method='ffill'), exp) + tm.assert_series_equal(ts.fillna(method='ffill'), exp) exp = Series([0., 1., 3., 3., 4.], index=ts.index) - self.assert_series_equal(ts.fillna(method='backfill'), exp) + tm.assert_series_equal(ts.fillna(method='backfill'), exp) exp = Series([0., 1., 5., 3., 4.], index=ts.index) - self.assert_series_equal(ts.fillna(value=5), exp) + tm.assert_series_equal(ts.fillna(value=5), exp) - self.assertRaises(ValueError, ts.fillna) - self.assertRaises(ValueError, self.ts.fillna, value=0, method='ffill') + pytest.raises(ValueError, ts.fillna) + pytest.raises(ValueError, self.ts.fillna, value=0, method='ffill') # GH 5703 s1 = Series([np.nan]) @@ -379,13 +459,19 @@ def test_fillna_invalid_method(self): try: self.ts.fillna(method='ffil') except ValueError as inst: - self.assertIn('ffil', str(inst)) + assert 'ffil' in str(inst) def test_ffill(self): ts = Series([0., 1., 2., 3., 4.], index=tm.makeDateIndex(5)) ts[2] = np.NaN assert_series_equal(ts.ffill(), ts.fillna(method='ffill')) + def test_ffill_mixed_dtypes_without_missing_data(self): + # GH14956 + series = pd.Series([datetime(2015, 1, 1, tzinfo=pytz.utc), 1]) + result = series.ffill() + assert_series_equal(series, result) + def test_bfill(self): ts = Series([0., 1., 2., 3., 4.], index=tm.makeDateIndex(5)) ts[2] = np.NaN @@ -393,34 +479,33 @@ def test_bfill(self): def test_timedelta64_nan(self): - from pandas import tslib td = Series([timedelta(days=i) for i in range(10)]) # nan ops on timedeltas td1 = td.copy() td1[0] = np.nan - self.assertTrue(isnull(td1[0])) - self.assertEqual(td1[0].value, tslib.iNaT) + assert isnull(td1[0]) + assert td1[0].value == iNaT td1[0] = td[0] - self.assertFalse(isnull(td1[0])) + assert not isnull(td1[0]) - td1[1] = tslib.iNaT - self.assertTrue(isnull(td1[1])) - self.assertEqual(td1[1].value, tslib.iNaT) + td1[1] = iNaT + assert isnull(td1[1]) + assert td1[1].value == iNaT td1[1] = td[1] - self.assertFalse(isnull(td1[1])) + assert not isnull(td1[1]) - td1[2] = tslib.NaT - self.assertTrue(isnull(td1[2])) - self.assertEqual(td1[2].value, tslib.iNaT) + td1[2] = NaT + assert isnull(td1[2]) + assert td1[2].value == iNaT td1[2] = td[2] - self.assertFalse(isnull(td1[2])) + assert not isnull(td1[2]) # boolean setting # this doesn't work, not sure numpy even supports it # result = td[(td>np.timedelta64(timedelta(days=3))) & # td val + expected = Series([x > val for x in series]) + tm.assert_series_equal(result, expected) - _multiprocess_can_split_ = True + val = series[5] + result = series > val + expected = Series([x > val for x in series]) + tm.assert_series_equal(result, expected) def test_comparisons(self): left = np.random.randn(10) @@ -108,8 +121,8 @@ def test_div(self): result = p['first'] / p['second'] assert_series_equal(result, p['first'].astype('float64'), check_names=False) - self.assertTrue(result.name is None) - self.assertFalse(np.array_equal(result, p['second'] / p['first'])) + assert result.name is None + assert not np.array_equal(result, p['second'] / p['first']) # inf signing s = Series([np.nan, 1., -1.]) @@ -143,6 +156,19 @@ def test_div(self): expected = Series([-inf, nan, inf]) assert_series_equal(result, expected) + # GH 8674 + zero_array = np.array([0] * 5) + data = np.random.randn(5) + expected = pd.Series([0.] * 5) + result = zero_array / pd.Series(data) + assert_series_equal(result, expected) + + result = pd.Series(zero_array) / data + assert_series_equal(result, expected) + + result = pd.Series(zero_array) / pd.Series(data) + assert_series_equal(result, expected) + def test_operators(self): def _check_op(series, other, op, pos_only=False, check_dtype=True): @@ -206,7 +232,7 @@ def check(series, other): # check the values, name, and index separatly assert_almost_equal(np.asarray(result), expected) - self.assertEqual(result.name, series.name) + assert result.name == series.name assert_index_equal(result.index, series.index) check(self.ts, self.ts * 2) @@ -222,12 +248,12 @@ def test_operators_empty_int_corner(self): def test_operators_timedelta64(self): # invalid ops - self.assertRaises(Exception, self.objSeries.__add__, 1) - self.assertRaises(Exception, self.objSeries.__add__, - np.array(1, dtype=np.int64)) - self.assertRaises(Exception, self.objSeries.__sub__, 1) - self.assertRaises(Exception, self.objSeries.__sub__, - np.array(1, dtype=np.int64)) + pytest.raises(Exception, self.objSeries.__add__, 1) + pytest.raises(Exception, self.objSeries.__add__, + np.array(1, dtype=np.int64)) + pytest.raises(Exception, self.objSeries.__sub__, 1) + pytest.raises(Exception, self.objSeries.__sub__, + np.array(1, dtype=np.int64)) # seriese ops v1 = date_range('2012-1-1', periods=3, freq='D') @@ -236,25 +262,25 @@ def test_operators_timedelta64(self): xp = Series(1e9 * 3600 * 24, rs.index).astype('int64').astype('timedelta64[ns]') assert_series_equal(rs, xp) - self.assertEqual(rs.dtype, 'timedelta64[ns]') + assert rs.dtype == 'timedelta64[ns]' df = DataFrame(dict(A=v1)) td = Series([timedelta(days=i) for i in range(3)]) - self.assertEqual(td.dtype, 'timedelta64[ns]') + assert td.dtype == 'timedelta64[ns]' # series on the rhs result = df['A'] - df['A'].shift() - self.assertEqual(result.dtype, 'timedelta64[ns]') + assert result.dtype == 'timedelta64[ns]' result = df['A'] + td - self.assertEqual(result.dtype, 'M8[ns]') + assert result.dtype == 'M8[ns]' # scalar Timestamp on rhs maxa = df['A'].max() - tm.assertIsInstance(maxa, Timestamp) + assert isinstance(maxa, Timestamp) resultb = df['A'] - df['A'].max() - self.assertEqual(resultb.dtype, 'timedelta64[ns]') + assert resultb.dtype == 'timedelta64[ns]' # timestamp on lhs result = resultb + df['A'] @@ -268,11 +294,11 @@ def test_operators_timedelta64(self): expected = Series( [timedelta(days=4017 + i) for i in range(3)], name='A') assert_series_equal(result, expected) - self.assertEqual(result.dtype, 'm8[ns]') + assert result.dtype == 'm8[ns]' d = datetime(2001, 1, 1, 3, 4) resulta = df['A'] - d - self.assertEqual(resulta.dtype, 'm8[ns]') + assert resulta.dtype == 'm8[ns]' # roundtrip resultb = resulta + d @@ -283,31 +309,31 @@ def test_operators_timedelta64(self): resulta = df['A'] + td resultb = resulta - td assert_series_equal(resultb, df['A']) - self.assertEqual(resultb.dtype, 'M8[ns]') + assert resultb.dtype == 'M8[ns]' # roundtrip td = timedelta(minutes=5, seconds=3) resulta = df['A'] + td resultb = resulta - td assert_series_equal(df['A'], resultb) - self.assertEqual(resultb.dtype, 'M8[ns]') + assert resultb.dtype == 'M8[ns]' # inplace value = rs[2] + np.timedelta64(timedelta(minutes=5, seconds=1)) rs[2] += np.timedelta64(timedelta(minutes=5, seconds=1)) - self.assertEqual(rs[2], value) + assert rs[2] == value def test_operator_series_comparison_zerorank(self): # GH 13006 result = np.float64(0) > pd.Series([1, 2, 3]) expected = 0.0 > pd.Series([1, 2, 3]) - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) result = pd.Series([1, 2, 3]) < np.float64(0) expected = pd.Series([1, 2, 3]) < 0.0 - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) result = np.array([0, 1, 2])[0] > pd.Series([0, 1, 2]) expected = 0.0 > pd.Series([1, 2, 3]) - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_timedeltas_with_DateOffset(self): @@ -414,7 +440,7 @@ def test_timedelta64_operations_with_timedeltas(self): result = td1 - td2 expected = Series([timedelta(seconds=0)] * 3) - Series([timedelta( seconds=1)] * 3) - self.assertEqual(result.dtype, 'm8[ns]') + assert result.dtype == 'm8[ns]' assert_series_equal(result, expected) result2 = td2 - td1 @@ -432,7 +458,7 @@ def test_timedelta64_operations_with_timedeltas(self): result = td1 - td2 expected = Series([timedelta(seconds=0)] * 3) - Series([timedelta( seconds=1)] * 3) - self.assertEqual(result.dtype, 'm8[ns]') + assert result.dtype == 'm8[ns]' assert_series_equal(result, expected) result2 = td2 - td1 @@ -506,8 +532,8 @@ def test_timedelta64_operations_with_integers(self): for op in ['__add__', '__sub__']: sop = getattr(s1, op, None) if sop is not None: - self.assertRaises(TypeError, sop, 1) - self.assertRaises(TypeError, sop, s2.values) + pytest.raises(TypeError, sop, 1) + pytest.raises(TypeError, sop, s2.values) def test_timedelta64_conversions(self): startdate = Series(date_range('2013-01-01', '2013-01-03')) @@ -538,12 +564,12 @@ def test_timedelta64_conversions(self): # astype s = Series(date_range('20130101', periods=3)) result = s.astype(object) - self.assertIsInstance(result.iloc[0], datetime) - self.assertTrue(result.dtype == np.object_) + assert isinstance(result.iloc[0], datetime) + assert result.dtype == np.object_ result = s1.astype(object) - self.assertIsInstance(result.iloc[0], timedelta) - self.assertTrue(result.dtype == np.object_) + assert isinstance(result.iloc[0], timedelta) + assert result.dtype == np.object_ def test_timedelta64_equal_timedelta_supported_ops(self): ser = Series([Timestamp('20130301'), Timestamp('20130228 23:00:00'), @@ -586,7 +612,7 @@ def run_ops(ops, get_ser, test_ser): # defined for op_str in ops: op = getattr(get_ser, op_str, None) - with tm.assertRaisesRegexp(TypeError, 'operate'): + with tm.assert_raises_regex(TypeError, 'operate'): op(test_ser) # ## timedelta64 ### @@ -672,12 +698,12 @@ def run_ops(ops, get_ser, test_ser): result = dt1 - td1[0] exp = (dt1.dt.tz_localize(None) - td1[0]).dt.tz_localize(tz) assert_series_equal(result, exp) - self.assertRaises(TypeError, lambda: td1[0] - dt1) + pytest.raises(TypeError, lambda: td1[0] - dt1) result = dt2 - td2[0] exp = (dt2.dt.tz_localize(None) - td2[0]).dt.tz_localize(tz) assert_series_equal(result, exp) - self.assertRaises(TypeError, lambda: td2[0] - dt2) + pytest.raises(TypeError, lambda: td2[0] - dt2) result = dt1 + td1 exp = (dt1.dt.tz_localize(None) + td1).dt.tz_localize(tz) @@ -695,8 +721,8 @@ def run_ops(ops, get_ser, test_ser): exp = (dt2.dt.tz_localize(None) - td2).dt.tz_localize(tz) assert_series_equal(result, exp) - self.assertRaises(TypeError, lambda: td1 - dt1) - self.assertRaises(TypeError, lambda: td2 - dt2) + pytest.raises(TypeError, lambda: td1 - dt1) + pytest.raises(TypeError, lambda: td2 - dt2) def test_sub_datetime_compat(self): # GH 14088 @@ -744,7 +770,7 @@ def test_ops_nat(self): assert_series_equal(datetime_series - single_nat_dtype_datetime, nat_series_dtype_timedelta) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): -single_nat_dtype_datetime + datetime_series assert_series_equal(datetime_series - single_nat_dtype_timedelta, @@ -763,7 +789,7 @@ def test_ops_nat(self): assert_series_equal(nat_series_dtype_timestamp - single_nat_dtype_datetime, nat_series_dtype_timedelta) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): -single_nat_dtype_datetime + nat_series_dtype_timestamp assert_series_equal(nat_series_dtype_timestamp - @@ -773,7 +799,7 @@ def test_ops_nat(self): nat_series_dtype_timestamp, nat_series_dtype_timestamp) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): timedelta_series - single_nat_dtype_datetime # addition @@ -857,13 +883,13 @@ def test_ops_nat(self): assert_series_equal(timedelta_series * nan, nat_series_dtype_timedelta) assert_series_equal(nan * timedelta_series, nat_series_dtype_timedelta) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): datetime_series * 1 - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): nat_series_dtype_timestamp * 1 - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): datetime_series * 1.0 - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): nat_series_dtype_timestamp * 1.0 # division @@ -872,9 +898,9 @@ def test_ops_nat(self): assert_series_equal(timedelta_series / 2.0, Series([NaT, Timedelta('0.5s')])) assert_series_equal(timedelta_series / nan, nat_series_dtype_timedelta) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): nat_series_dtype_timestamp / 1.0 - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): nat_series_dtype_timestamp / 1 def test_ops_datetimelike_align(self): @@ -1006,12 +1032,12 @@ def test_comparison_invalid(self): s2 = Series(date_range('20010101', periods=5)) for (x, y) in [(s, s2), (s2, s)]: - self.assertRaises(TypeError, lambda: x == y) - self.assertRaises(TypeError, lambda: x != y) - self.assertRaises(TypeError, lambda: x >= y) - self.assertRaises(TypeError, lambda: x > y) - self.assertRaises(TypeError, lambda: x < y) - self.assertRaises(TypeError, lambda: x <= y) + pytest.raises(TypeError, lambda: x == y) + pytest.raises(TypeError, lambda: x != y) + pytest.raises(TypeError, lambda: x >= y) + pytest.raises(TypeError, lambda: x > y) + pytest.raises(TypeError, lambda: x < y) + pytest.raises(TypeError, lambda: x <= y) def test_more_na_comparisons(self): for dtype in [None, object]: @@ -1109,11 +1135,11 @@ def test_nat_comparisons_scalar(self): def test_comparison_different_length(self): a = Series(['a', 'b', 'c']) b = Series(['b', 'a']) - self.assertRaises(ValueError, a.__lt__, b) + pytest.raises(ValueError, a.__lt__, b) a = Series([1, 2]) b = Series([2, 3, 4]) - self.assertRaises(ValueError, a.__eq__, b) + pytest.raises(ValueError, a.__eq__, b) def test_comparison_label_based(self): @@ -1192,7 +1218,7 @@ def test_comparison_label_based(self): assert_series_equal(result, expected) for v in [np.nan, 'foo']: - self.assertRaises(TypeError, lambda: t | v) + pytest.raises(TypeError, lambda: t | v) for v in [False, 0]: result = Series([True, False, True], index=index) | v @@ -1209,7 +1235,7 @@ def test_comparison_label_based(self): expected = Series([False, False, False], index=index) assert_series_equal(result, expected) for v in [np.nan]: - self.assertRaises(TypeError, lambda: t & v) + pytest.raises(TypeError, lambda: t & v) def test_comparison_flex_basic(self): left = pd.Series(np.random.randn(10)) @@ -1234,7 +1260,7 @@ def test_comparison_flex_basic(self): # msg = 'No axis named 1 for object type' for op in ['eq', 'ne', 'le', 'le', 'gt', 'ge']: - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): getattr(left, op)(right, axis=1) def test_comparison_flex_alignment(self): @@ -1281,6 +1307,20 @@ def test_comparison_flex_alignment_fill(self): exp = pd.Series([True, True, False, False], index=list('abcd')) assert_series_equal(left.gt(right, fill_value=0), exp) + def test_return_dtypes_bool_op_costant(self): + # gh15115 + s = pd.Series([1, 3, 2], index=range(3)) + const = 2 + for op in ['eq', 'ne', 'gt', 'lt', 'ge', 'le']: + result = getattr(s, op)(const).get_dtype_counts() + tm.assert_series_equal(result, Series([1], ['bool'])) + + # empty Series + empty = s.iloc[:0] + for op in ['eq', 'ne', 'gt', 'lt', 'ge', 'le']: + result = getattr(empty, op)(const).get_dtype_counts() + tm.assert_series_equal(result, Series([1], ['bool'])) + def test_operators_bitwise(self): # GH 9016: support bitwise op for integer types index = list('bca') @@ -1350,11 +1390,11 @@ def test_operators_bitwise(self): expected = Series([1, 1, 3, 3], dtype='int32') assert_series_equal(res, expected) - self.assertRaises(TypeError, lambda: s_1111 & 'a') - self.assertRaises(TypeError, lambda: s_1111 & ['a', 'b', 'c', 'd']) - self.assertRaises(TypeError, lambda: s_0123 & np.NaN) - self.assertRaises(TypeError, lambda: s_0123 & 3.14) - self.assertRaises(TypeError, lambda: s_0123 & [0.1, 4, 3.14, 2]) + pytest.raises(TypeError, lambda: s_1111 & 'a') + pytest.raises(TypeError, lambda: s_1111 & ['a', 'b', 'c', 'd']) + pytest.raises(TypeError, lambda: s_0123 & np.NaN) + pytest.raises(TypeError, lambda: s_0123 & 3.14) + pytest.raises(TypeError, lambda: s_0123 & [0.1, 4, 3.14, 2]) # s_0123 will be all false now because of reindexing like s_tft if compat.PY3: @@ -1397,7 +1437,7 @@ def test_scalar_na_cmp_corners(self): def tester(a, b): return a & b - self.assertRaises(TypeError, tester, s, datetime(2005, 1, 1)) + pytest.raises(TypeError, tester, s, datetime(2005, 1, 1)) s = Series([2, 3, 4, 5, 6, 7, 8, 9, datetime(2005, 1, 1)]) s[::2] = np.nan @@ -1414,8 +1454,8 @@ def tester(a, b): # this is an alignment issue; these are equivalent # https://github.com/pandas-dev/pandas/issues/5284 - self.assertRaises(ValueError, lambda: d.__and__(s, axis='columns')) - self.assertRaises(ValueError, tester, s, d) + pytest.raises(ValueError, lambda: d.__and__(s, axis='columns')) + pytest.raises(ValueError, tester, s, d) # this is wrong as its not a boolean result # result = d.__and__(s,axis='index') @@ -1426,10 +1466,10 @@ def test_operators_corner(self): empty = Series([], index=Index([])) result = series + empty - self.assertTrue(np.isnan(result).all()) + assert np.isnan(result).all() result = empty + Series([], index=Index([])) - self.assertEqual(len(result), 0) + assert len(result) == 0 # TODO: this returned NotImplemented earlier, what to do? # deltas = Series([timedelta(1)] * 5, index=np.arange(5)) @@ -1442,7 +1482,7 @@ def test_operators_corner(self): added = self.ts + int_ts expected = Series(self.ts.values[:-5] + int_ts.values, index=self.ts.index[:-5], name='ts') - self.assert_series_equal(added[:-5], expected) + tm.assert_series_equal(added[:-5], expected) def test_operators_reverse_object(self): # GH 56 @@ -1499,23 +1539,23 @@ def test_comp_ops_df_compat(self): for l, r in [(s1, s2), (s2, s1), (s3, s4), (s4, s3)]: msg = "Can only compare identically-labeled Series objects" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): l == r - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): l != r - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): l < r msg = "Can only compare identically-labeled DataFrame objects" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): l.to_frame() == r.to_frame() - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): l.to_frame() != r.to_frame() - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): l.to_frame() < r.to_frame() def test_bool_ops_df_compat(self): @@ -1589,10 +1629,10 @@ def test_series_frame_radd_bug(self): assert_frame_equal(result, expected) # really raise this time - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): datetime.now() + self.ts - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): self.ts + datetime.now() def test_series_radd_more(self): @@ -1605,7 +1645,7 @@ def test_series_radd_more(self): for d in data: for dtype in [None, object]: s = Series(d, dtype=dtype) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): 'foo_' + s for dtype in [None, object]: @@ -1642,7 +1682,7 @@ def test_frame_radd_more(self): for d in data: for dtype in [None, object]: s = DataFrame(d, dtype=dtype) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): 'foo_' + s for dtype in [None, object]: @@ -1737,8 +1777,8 @@ def _check_fill(meth, op, a, b, fill_value=0): def test_ne(self): ts = Series([3, 4, 5, 6, 7], [3, 4, 5, 6, 7], dtype=float) expected = [True, True, False, True, True] - self.assertTrue(tm.equalContents(ts.index != 5, expected)) - self.assertTrue(tm.equalContents(~(ts.index == 5), expected)) + assert tm.equalContents(ts.index != 5, expected) + assert tm.equalContents(~(ts.index == 5), expected) def test_operators_na_handling(self): from decimal import Decimal @@ -1748,8 +1788,8 @@ def test_operators_na_handling(self): result = s + s.shift(1) result2 = s.shift(1) + s - self.assertTrue(isnull(result[0])) - self.assertTrue(isnull(result2[0])) + assert isnull(result[0]) + assert isnull(result2[0]) s = Series(['foo', 'bar', 'baz', np.nan]) result = 'prefix_' + s diff --git a/pandas/tests/series/test_period.py b/pandas/tests/series/test_period.py new file mode 100644 index 0000000000000..6e8ee38d366e2 --- /dev/null +++ b/pandas/tests/series/test_period.py @@ -0,0 +1,251 @@ +import numpy as np + +import pandas as pd +import pandas.util.testing as tm +import pandas.core.indexes.period as period +from pandas import Series, period_range, DataFrame, Period + + +def _permute(obj): + return obj.take(np.random.permutation(len(obj))) + + +class TestSeriesPeriod(object): + + def setup_method(self, method): + self.series = Series(period_range('2000-01-01', periods=10, freq='D')) + + def test_auto_conversion(self): + series = Series(list(period_range('2000-01-01', periods=10, freq='D'))) + assert series.dtype == 'object' + + series = pd.Series([pd.Period('2011-01-01', freq='D'), + pd.Period('2011-02-01', freq='D')]) + assert series.dtype == 'object' + + def test_getitem(self): + assert self.series[1] == pd.Period('2000-01-02', freq='D') + + result = self.series[[2, 4]] + exp = pd.Series([pd.Period('2000-01-03', freq='D'), + pd.Period('2000-01-05', freq='D')], + index=[2, 4]) + tm.assert_series_equal(result, exp) + assert result.dtype == 'object' + + def test_isnull(self): + # GH 13737 + s = Series([pd.Period('2011-01', freq='M'), + pd.Period('NaT', freq='M')]) + tm.assert_series_equal(s.isnull(), Series([False, True])) + tm.assert_series_equal(s.notnull(), Series([True, False])) + + def test_fillna(self): + # GH 13737 + s = Series([pd.Period('2011-01', freq='M'), + pd.Period('NaT', freq='M')]) + + res = s.fillna(pd.Period('2012-01', freq='M')) + exp = Series([pd.Period('2011-01', freq='M'), + pd.Period('2012-01', freq='M')]) + tm.assert_series_equal(res, exp) + assert res.dtype == 'object' + + res = s.fillna('XXX') + exp = Series([pd.Period('2011-01', freq='M'), 'XXX']) + tm.assert_series_equal(res, exp) + assert res.dtype == 'object' + + def test_dropna(self): + # GH 13737 + s = Series([pd.Period('2011-01', freq='M'), + pd.Period('NaT', freq='M')]) + tm.assert_series_equal(s.dropna(), + Series([pd.Period('2011-01', freq='M')])) + + def test_series_comparison_scalars(self): + val = pd.Period('2000-01-04', freq='D') + result = self.series > val + expected = pd.Series([x > val for x in self.series]) + tm.assert_series_equal(result, expected) + + val = self.series[5] + result = self.series > val + expected = pd.Series([x > val for x in self.series]) + tm.assert_series_equal(result, expected) + + def test_between(self): + left, right = self.series[[2, 7]] + result = self.series.between(left, right) + expected = (self.series >= left) & (self.series <= right) + tm.assert_series_equal(result, expected) + + # --------------------------------------------------------------------- + # NaT support + + """ + # ToDo: Enable when support period dtype + def test_NaT_scalar(self): + series = Series([0, 1000, 2000, iNaT], dtype='period[D]') + + val = series[3] + assert isnull(val) + + series[2] = val + assert isnull(series[2]) + + def test_NaT_cast(self): + result = Series([np.nan]).astype('period[D]') + expected = Series([NaT]) + tm.assert_series_equal(result, expected) + """ + + def test_set_none_nan(self): + # currently Period is stored as object dtype, not as NaT + self.series[3] = None + assert self.series[3] is None + + self.series[3:5] = None + assert self.series[4] is None + + self.series[5] = np.nan + assert np.isnan(self.series[5]) + + self.series[5:7] = np.nan + assert np.isnan(self.series[6]) + + def test_intercept_astype_object(self): + expected = self.series.astype('object') + + df = DataFrame({'a': self.series, + 'b': np.random.randn(len(self.series))}) + + result = df.values.squeeze() + assert (result[:, 0] == expected.values).all() + + df = DataFrame({'a': self.series, 'b': ['foo'] * len(self.series)}) + + result = df.values.squeeze() + assert (result[:, 0] == expected.values).all() + + def test_comp_series_period_scalar(self): + # GH 13200 + for freq in ['M', '2M', '3M']: + base = Series([Period(x, freq=freq) for x in + ['2011-01', '2011-02', '2011-03', '2011-04']]) + p = Period('2011-02', freq=freq) + + exp = pd.Series([False, True, False, False]) + tm.assert_series_equal(base == p, exp) + tm.assert_series_equal(p == base, exp) + + exp = pd.Series([True, False, True, True]) + tm.assert_series_equal(base != p, exp) + tm.assert_series_equal(p != base, exp) + + exp = pd.Series([False, False, True, True]) + tm.assert_series_equal(base > p, exp) + tm.assert_series_equal(p < base, exp) + + exp = pd.Series([True, False, False, False]) + tm.assert_series_equal(base < p, exp) + tm.assert_series_equal(p > base, exp) + + exp = pd.Series([False, True, True, True]) + tm.assert_series_equal(base >= p, exp) + tm.assert_series_equal(p <= base, exp) + + exp = pd.Series([True, True, False, False]) + tm.assert_series_equal(base <= p, exp) + tm.assert_series_equal(p >= base, exp) + + # different base freq + msg = "Input has different freq=A-DEC from Period" + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + base <= Period('2011', freq='A') + + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + Period('2011', freq='A') >= base + + def test_comp_series_period_series(self): + # GH 13200 + for freq in ['M', '2M', '3M']: + base = Series([Period(x, freq=freq) for x in + ['2011-01', '2011-02', '2011-03', '2011-04']]) + + s = Series([Period(x, freq=freq) for x in + ['2011-02', '2011-01', '2011-03', '2011-05']]) + + exp = Series([False, False, True, False]) + tm.assert_series_equal(base == s, exp) + + exp = Series([True, True, False, True]) + tm.assert_series_equal(base != s, exp) + + exp = Series([False, True, False, False]) + tm.assert_series_equal(base > s, exp) + + exp = Series([True, False, False, True]) + tm.assert_series_equal(base < s, exp) + + exp = Series([False, True, True, False]) + tm.assert_series_equal(base >= s, exp) + + exp = Series([True, False, True, True]) + tm.assert_series_equal(base <= s, exp) + + s2 = Series([Period(x, freq='A') for x in + ['2011', '2011', '2011', '2011']]) + + # different base freq + msg = "Input has different freq=A-DEC from Period" + with tm.assert_raises_regex( + period.IncompatibleFrequency, msg): + base <= s2 + + def test_comp_series_period_object(self): + # GH 13200 + base = Series([Period('2011', freq='A'), Period('2011-02', freq='M'), + Period('2013', freq='A'), Period('2011-04', freq='M')]) + + s = Series([Period('2012', freq='A'), Period('2011-01', freq='M'), + Period('2013', freq='A'), Period('2011-05', freq='M')]) + + exp = Series([False, False, True, False]) + tm.assert_series_equal(base == s, exp) + + exp = Series([True, True, False, True]) + tm.assert_series_equal(base != s, exp) + + exp = Series([False, True, False, False]) + tm.assert_series_equal(base > s, exp) + + exp = Series([True, False, False, True]) + tm.assert_series_equal(base < s, exp) + + exp = Series([False, True, True, False]) + tm.assert_series_equal(base >= s, exp) + + exp = Series([True, False, True, True]) + tm.assert_series_equal(base <= s, exp) + + def test_align_series(self): + rng = period_range('1/1/2000', '1/1/2010', freq='A') + ts = Series(np.random.randn(len(rng)), index=rng) + + result = ts + ts[::2] + expected = ts + ts + expected[1::2] = np.nan + tm.assert_series_equal(result, expected) + + result = ts + _permute(ts[::2]) + tm.assert_series_equal(result, expected) + + # it works! + for kind in ['inner', 'outer', 'left', 'right']: + ts.align(ts[::2], join=kind) + msg = "Input has different freq=D from PeriodIndex\\(freq=A-DEC\\)" + with tm.assert_raises_regex(period.IncompatibleFrequency, msg): + ts + ts.asfreq('D', how="end") diff --git a/pandas/tests/series/test_quantile.py b/pandas/tests/series/test_quantile.py index 76db6c90a685f..2d02260ac7303 100644 --- a/pandas/tests/series/test_quantile.py +++ b/pandas/tests/series/test_quantile.py @@ -1,59 +1,57 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 -import nose +import pytest import numpy as np import pandas as pd from pandas import (Index, Series, _np_version_under1p9) -from pandas.tseries.index import Timestamp -from pandas.types.common import is_integer +from pandas.core.indexes.datetimes import Timestamp +from pandas.core.dtypes.common import is_integer import pandas.util.testing as tm from .common import TestData -class TestSeriesQuantile(TestData, tm.TestCase): +class TestSeriesQuantile(TestData): def test_quantile(self): - from numpy import percentile q = self.ts.quantile(0.1) - self.assertEqual(q, percentile(self.ts.valid(), 10)) + assert q == np.percentile(self.ts.valid(), 10) q = self.ts.quantile(0.9) - self.assertEqual(q, percentile(self.ts.valid(), 90)) + assert q == np.percentile(self.ts.valid(), 90) # object dtype q = Series(self.ts, dtype=object).quantile(0.9) - self.assertEqual(q, percentile(self.ts.valid(), 90)) + assert q == np.percentile(self.ts.valid(), 90) # datetime64[ns] dtype dts = self.ts.index.to_series() q = dts.quantile(.2) - self.assertEqual(q, Timestamp('2000-01-10 19:12:00')) + assert q == Timestamp('2000-01-10 19:12:00') # timedelta64[ns] dtype tds = dts.diff() q = tds.quantile(.25) - self.assertEqual(q, pd.to_timedelta('24:00:00')) + assert q == pd.to_timedelta('24:00:00') # GH7661 result = Series([np.timedelta64('NaT')]).sum() - self.assertTrue(result is pd.NaT) + assert result is pd.NaT msg = 'percentiles should all be in the interval \\[0, 1\\]' for invalid in [-1, 2, [0.5, -1], [0.5, 2]]: - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.ts.quantile(invalid) def test_quantile_multi(self): - from numpy import percentile qs = [.1, .9] result = self.ts.quantile(qs) - expected = pd.Series([percentile(self.ts.valid(), 10), - percentile(self.ts.valid(), 90)], + expected = pd.Series([np.percentile(self.ts.valid(), 10), + np.percentile(self.ts.valid(), 90)], index=qs, name=self.ts.name) tm.assert_series_equal(result, expected) @@ -70,60 +68,53 @@ def test_quantile_multi(self): [], dtype=float)) tm.assert_series_equal(result, expected) + @pytest.mark.skipif(_np_version_under1p9, + reason="Numpy version is under 1.9") def test_quantile_interpolation(self): - # GH #10174 - if _np_version_under1p9: - raise nose.SkipTest("Numpy version is under 1.9") - - from numpy import percentile + # see gh-10174 # interpolation = linear (default case) q = self.ts.quantile(0.1, interpolation='linear') - self.assertEqual(q, percentile(self.ts.valid(), 10)) + assert q == np.percentile(self.ts.valid(), 10) q1 = self.ts.quantile(0.1) - self.assertEqual(q1, percentile(self.ts.valid(), 10)) + assert q1 == np.percentile(self.ts.valid(), 10) # test with and without interpolation keyword - self.assertEqual(q, q1) + assert q == q1 + @pytest.mark.skipif(_np_version_under1p9, + reason="Numpy version is under 1.9") def test_quantile_interpolation_dtype(self): # GH #10174 - if _np_version_under1p9: - raise nose.SkipTest("Numpy version is under 1.9") - - from numpy import percentile # interpolation = linear (default case) q = pd.Series([1, 3, 4]).quantile(0.5, interpolation='lower') - self.assertEqual(q, percentile(np.array([1, 3, 4]), 50)) - self.assertTrue(is_integer(q)) + assert q == np.percentile(np.array([1, 3, 4]), 50) + assert is_integer(q) q = pd.Series([1, 3, 4]).quantile(0.5, interpolation='higher') - self.assertEqual(q, percentile(np.array([1, 3, 4]), 50)) - self.assertTrue(is_integer(q)) + assert q == np.percentile(np.array([1, 3, 4]), 50) + assert is_integer(q) + @pytest.mark.skipif(not _np_version_under1p9, + reason="Numpy version is greater 1.9") def test_quantile_interpolation_np_lt_1p9(self): # GH #10174 - if not _np_version_under1p9: - raise nose.SkipTest("Numpy version is greater than 1.9") - - from numpy import percentile # interpolation = linear (default case) q = self.ts.quantile(0.1, interpolation='linear') - self.assertEqual(q, percentile(self.ts.valid(), 10)) + assert q == np.percentile(self.ts.valid(), 10) q1 = self.ts.quantile(0.1) - self.assertEqual(q1, percentile(self.ts.valid(), 10)) + assert q1 == np.percentile(self.ts.valid(), 10) # interpolation other than linear - expErrMsg = "Interpolation methods other than " - with tm.assertRaisesRegexp(ValueError, expErrMsg): + msg = "Interpolation methods other than " + with tm.assert_raises_regex(ValueError, msg): self.ts.quantile(0.9, interpolation='nearest') # object dtype - with tm.assertRaisesRegexp(ValueError, expErrMsg): - q = Series(self.ts, dtype=object).quantile(0.7, - interpolation='higher') + with tm.assert_raises_regex(ValueError, msg): + Series(self.ts, dtype=object).quantile(0.7, interpolation='higher') def test_quantile_nan(self): @@ -131,14 +122,14 @@ def test_quantile_nan(self): s = pd.Series([1, 2, 3, 4, np.nan]) result = s.quantile(0.5) expected = 2.5 - self.assertEqual(result, expected) + assert result == expected # all nan/empty cases = [Series([]), Series([np.nan, np.nan])] for s in cases: res = s.quantile(0.5) - self.assertTrue(np.isnan(res)) + assert np.isnan(res) res = s.quantile([0.5]) tm.assert_series_equal(res, pd.Series([np.nan], index=[0.5])) @@ -167,7 +158,7 @@ def test_quantile_box(self): for case in cases: s = pd.Series(case, name='XXX') res = s.quantile(0.5) - self.assertEqual(res, case[1]) + assert res == case[1] res = s.quantile([0.5]) exp = pd.Series([case[1]], index=[0.5], name='XXX') @@ -175,12 +166,12 @@ def test_quantile_box(self): def test_datetime_timedelta_quantiles(self): # covers #9694 - self.assertTrue(pd.isnull(Series([], dtype='M8[ns]').quantile(.5))) - self.assertTrue(pd.isnull(Series([], dtype='m8[ns]').quantile(.5))) + assert pd.isnull(Series([], dtype='M8[ns]').quantile(.5)) + assert pd.isnull(Series([], dtype='m8[ns]').quantile(.5)) def test_quantile_nat(self): res = Series([pd.NaT, pd.NaT]).quantile(0.5) - self.assertTrue(res is pd.NaT) + assert res is pd.NaT res = Series([pd.NaT, pd.NaT]).quantile([0.5]) tm.assert_series_equal(res, pd.Series([pd.NaT], index=[0.5])) @@ -191,7 +182,7 @@ def test_quantile_empty(self): s = Series([], dtype='float64') res = s.quantile(0.5) - self.assertTrue(np.isnan(res)) + assert np.isnan(res) res = s.quantile([0.5]) exp = Series([np.nan], index=[0.5]) @@ -201,7 +192,7 @@ def test_quantile_empty(self): s = Series([], dtype='int64') res = s.quantile(0.5) - self.assertTrue(np.isnan(res)) + assert np.isnan(res) res = s.quantile([0.5]) exp = Series([np.nan], index=[0.5]) @@ -211,7 +202,7 @@ def test_quantile_empty(self): s = Series([], dtype='datetime64[ns]') res = s.quantile(0.5) - self.assertTrue(res is pd.NaT) + assert res is pd.NaT res = s.quantile([0.5]) exp = Series([pd.NaT], index=[0.5]) diff --git a/pandas/tests/series/test_rank.py b/pandas/tests/series/test_rank.py new file mode 100644 index 0000000000000..ff489eb7f15b1 --- /dev/null +++ b/pandas/tests/series/test_rank.py @@ -0,0 +1,324 @@ +# -*- coding: utf-8 -*- +from pandas import compat + +import pytest + +from distutils.version import LooseVersion +from numpy import nan +import numpy as np + +from pandas import (Series, date_range, NaT) + +from pandas.compat import product +from pandas.util.testing import assert_series_equal +import pandas.util.testing as tm +from pandas.tests.series.common import TestData + + +class TestSeriesRank(TestData): + s = Series([1, 3, 4, 2, nan, 2, 1, 5, nan, 3]) + + results = { + 'average': np.array([1.5, 5.5, 7.0, 3.5, nan, + 3.5, 1.5, 8.0, nan, 5.5]), + 'min': np.array([1, 5, 7, 3, nan, 3, 1, 8, nan, 5]), + 'max': np.array([2, 6, 7, 4, nan, 4, 2, 8, nan, 6]), + 'first': np.array([1, 5, 7, 3, nan, 4, 2, 8, nan, 6]), + 'dense': np.array([1, 3, 4, 2, nan, 2, 1, 5, nan, 3]), + } + + def test_rank(self): + tm._skip_if_no_scipy() + from scipy.stats import rankdata + + self.ts[::2] = np.nan + self.ts[:10][::3] = 4. + + ranks = self.ts.rank() + oranks = self.ts.astype('O').rank() + + assert_series_equal(ranks, oranks) + + mask = np.isnan(self.ts) + filled = self.ts.fillna(np.inf) + + # rankdata returns a ndarray + exp = Series(rankdata(filled), index=filled.index, name='ts') + exp[mask] = np.nan + + tm.assert_series_equal(ranks, exp) + + iseries = Series(np.arange(5).repeat(2)) + + iranks = iseries.rank() + exp = iseries.astype(float).rank() + assert_series_equal(iranks, exp) + iseries = Series(np.arange(5)) + 1.0 + exp = iseries / 5.0 + iranks = iseries.rank(pct=True) + + assert_series_equal(iranks, exp) + + iseries = Series(np.repeat(1, 100)) + exp = Series(np.repeat(0.505, 100)) + iranks = iseries.rank(pct=True) + assert_series_equal(iranks, exp) + + iseries[1] = np.nan + exp = Series(np.repeat(50.0 / 99.0, 100)) + exp[1] = np.nan + iranks = iseries.rank(pct=True) + assert_series_equal(iranks, exp) + + iseries = Series(np.arange(5)) + 1.0 + iseries[4] = np.nan + exp = iseries / 4.0 + iranks = iseries.rank(pct=True) + assert_series_equal(iranks, exp) + + iseries = Series(np.repeat(np.nan, 100)) + exp = iseries.copy() + iranks = iseries.rank(pct=True) + assert_series_equal(iranks, exp) + + iseries = Series(np.arange(5)) + 1 + iseries[4] = np.nan + exp = iseries / 4.0 + iranks = iseries.rank(pct=True) + assert_series_equal(iranks, exp) + + rng = date_range('1/1/1990', periods=5) + iseries = Series(np.arange(5), rng) + 1 + iseries.iloc[4] = np.nan + exp = iseries / 4.0 + iranks = iseries.rank(pct=True) + assert_series_equal(iranks, exp) + + iseries = Series([1e-50, 1e-100, 1e-20, 1e-2, 1e-20 + 1e-30, 1e-1]) + exp = Series([2, 1, 3, 5, 4, 6.0]) + iranks = iseries.rank() + assert_series_equal(iranks, exp) + + # GH 5968 + iseries = Series(['3 day', '1 day 10m', '-2 day', NaT], + dtype='m8[ns]') + exp = Series([3, 2, 1, np.nan]) + iranks = iseries.rank() + assert_series_equal(iranks, exp) + + values = np.array( + [-50, -1, -1e-20, -1e-25, -1e-50, 0, 1e-40, 1e-20, 1e-10, 2, 40 + ], dtype='float64') + random_order = np.random.permutation(len(values)) + iseries = Series(values[random_order]) + exp = Series(random_order + 1.0, dtype='float64') + iranks = iseries.rank() + assert_series_equal(iranks, exp) + + def test_rank_categorical(self): + # GH issue #15420 rank incorrectly orders ordered categories + + # Test ascending/descending ranking for ordered categoricals + exp = Series([1., 2., 3., 4., 5., 6.]) + exp_desc = Series([6., 5., 4., 3., 2., 1.]) + ordered = Series( + ['first', 'second', 'third', 'fourth', 'fifth', 'sixth'] + ).astype( + 'category', + categories=['first', 'second', 'third', + 'fourth', 'fifth', 'sixth'], + ordered=True + ) + assert_series_equal(ordered.rank(), exp) + assert_series_equal(ordered.rank(ascending=False), exp_desc) + + # Unordered categoricals should be ranked as objects + unordered = Series( + ['first', 'second', 'third', 'fourth', 'fifth', 'sixth'], + ).astype( + 'category', + categories=['first', 'second', 'third', + 'fourth', 'fifth', 'sixth'], + ordered=False + ) + exp_unordered = Series([2., 4., 6., 3., 1., 5.]) + res = unordered.rank() + assert_series_equal(res, exp_unordered) + + unordered1 = Series( + [1, 2, 3, 4, 5, 6], + ).astype( + 'category', + categories=[1, 2, 3, 4, 5, 6], + ordered=False + ) + exp_unordered1 = Series([1., 2., 3., 4., 5., 6.]) + res1 = unordered1.rank() + assert_series_equal(res1, exp_unordered1) + + # Test na_option for rank data + na_ser = Series( + ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', np.NaN] + ).astype( + 'category', + categories=[ + 'first', 'second', 'third', 'fourth', + 'fifth', 'sixth', 'seventh' + ], + ordered=True + ) + + exp_top = Series([2., 3., 4., 5., 6., 7., 1.]) + exp_bot = Series([1., 2., 3., 4., 5., 6., 7.]) + exp_keep = Series([1., 2., 3., 4., 5., 6., np.NaN]) + + assert_series_equal(na_ser.rank(na_option='top'), exp_top) + assert_series_equal(na_ser.rank(na_option='bottom'), exp_bot) + assert_series_equal(na_ser.rank(na_option='keep'), exp_keep) + + # Test na_option for rank data with ascending False + exp_top = Series([7., 6., 5., 4., 3., 2., 1.]) + exp_bot = Series([6., 5., 4., 3., 2., 1., 7.]) + exp_keep = Series([6., 5., 4., 3., 2., 1., np.NaN]) + + assert_series_equal( + na_ser.rank(na_option='top', ascending=False), + exp_top + ) + assert_series_equal( + na_ser.rank(na_option='bottom', ascending=False), + exp_bot + ) + assert_series_equal( + na_ser.rank(na_option='keep', ascending=False), + exp_keep + ) + + # Test with pct=True + na_ser = Series( + ['first', 'second', 'third', 'fourth', np.NaN], + ).astype( + 'category', + categories=['first', 'second', 'third', 'fourth'], + ordered=True + ) + exp_top = Series([0.4, 0.6, 0.8, 1., 0.2]) + exp_bot = Series([0.2, 0.4, 0.6, 0.8, 1.]) + exp_keep = Series([0.25, 0.5, 0.75, 1., np.NaN]) + + assert_series_equal(na_ser.rank(na_option='top', pct=True), exp_top) + assert_series_equal(na_ser.rank(na_option='bottom', pct=True), exp_bot) + assert_series_equal(na_ser.rank(na_option='keep', pct=True), exp_keep) + + def test_rank_signature(self): + s = Series([0, 1]) + s.rank(method='average') + pytest.raises(ValueError, s.rank, 'average') + + def test_rank_inf(self): + pytest.skip('DataFrame.rank does not currently rank ' + 'np.inf and -np.inf properly') + + values = np.array( + [-np.inf, -50, -1, -1e-20, -1e-25, -1e-50, 0, 1e-40, 1e-20, 1e-10, + 2, 40, np.inf], dtype='float64') + random_order = np.random.permutation(len(values)) + iseries = Series(values[random_order]) + exp = Series(random_order + 1.0, dtype='float64') + iranks = iseries.rank() + assert_series_equal(iranks, exp) + + def test_rank_tie_methods(self): + s = self.s + + def _check(s, expected, method='average'): + result = s.rank(method=method) + tm.assert_series_equal(result, Series(expected)) + + dtypes = [None, object] + disabled = set([(object, 'first')]) + results = self.results + + for method, dtype in product(results, dtypes): + if (dtype, method) in disabled: + continue + series = s if dtype is None else s.astype(dtype) + _check(series, results[method], method=method) + + def test_rank_methods_series(self): + tm.skip_if_no_package('scipy', min_version='0.13', + app='scipy.stats.rankdata') + import scipy + from scipy.stats import rankdata + + xs = np.random.randn(9) + xs = np.concatenate([xs[i:] for i in range(0, 9, 2)]) # add duplicates + np.random.shuffle(xs) + + index = [chr(ord('a') + i) for i in range(len(xs))] + + for vals in [xs, xs + 1e6, xs * 1e-6]: + ts = Series(vals, index=index) + + for m in ['average', 'min', 'max', 'first', 'dense']: + result = ts.rank(method=m) + sprank = rankdata(vals, m if m != 'first' else 'ordinal') + expected = Series(sprank, index=index) + + if LooseVersion(scipy.__version__) >= '0.17.0': + expected = expected.astype('float64') + tm.assert_series_equal(result, expected) + + def test_rank_dense_method(self): + dtypes = ['O', 'f8', 'i8'] + in_out = [([1], [1]), + ([2], [1]), + ([0], [1]), + ([2, 2], [1, 1]), + ([1, 2, 3], [1, 2, 3]), + ([4, 2, 1], [3, 2, 1],), + ([1, 1, 5, 5, 3], [1, 1, 3, 3, 2]), + ([-5, -4, -3, -2, -1], [1, 2, 3, 4, 5])] + + for ser, exp in in_out: + for dtype in dtypes: + s = Series(ser).astype(dtype) + result = s.rank(method='dense') + expected = Series(exp).astype(result.dtype) + assert_series_equal(result, expected) + + def test_rank_descending(self): + dtypes = ['O', 'f8', 'i8'] + + for dtype, method in product(dtypes, self.results): + if 'i' in dtype: + s = self.s.dropna() + else: + s = self.s.astype(dtype) + + res = s.rank(ascending=False) + expected = (s.max() - s).rank() + assert_series_equal(res, expected) + + if method == 'first' and dtype == 'O': + continue + + expected = (s.max() - s).rank(method=method) + res2 = s.rank(method=method, ascending=False) + assert_series_equal(res2, expected) + + def test_rank_int(self): + s = self.s.dropna().astype('i8') + + for method, res in compat.iteritems(self.results): + result = s.rank(method=method) + expected = Series(res).dropna() + expected.index = result.index + assert_series_equal(result, expected) + + def test_rank_object_bug(self): + # GH 13445 + + # smoke tests + Series([np.nan] * 32).astype(object).rank(ascending=True) + Series([np.nan] * 32).astype(object).rank(ascending=False) diff --git a/pandas/tests/series/test_replace.py b/pandas/tests/series/test_replace.py index d80328ea3863a..35d13a62ca083 100644 --- a/pandas/tests/series/test_replace.py +++ b/pandas/tests/series/test_replace.py @@ -1,18 +1,17 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 +import pytest + import numpy as np import pandas as pd -import pandas.lib as lib +import pandas._libs.lib as lib import pandas.util.testing as tm from .common import TestData -class TestSeriesReplace(TestData, tm.TestCase): - - _multiprocess_can_split_ = True - +class TestSeriesReplace(TestData): def test_replace(self): N = 100 ser = pd.Series(np.random.randn(N)) @@ -38,18 +37,18 @@ def test_replace(self): # replace list with a single value rs = ser.replace([np.nan, 'foo', 'bar'], -1) - self.assertTrue((rs[:5] == -1).all()) - self.assertTrue((rs[6:10] == -1).all()) - self.assertTrue((rs[20:30] == -1).all()) - self.assertTrue((pd.isnull(ser[:5])).all()) + assert (rs[:5] == -1).all() + assert (rs[6:10] == -1).all() + assert (rs[20:30] == -1).all() + assert (pd.isnull(ser[:5])).all() # replace with different values rs = ser.replace({np.nan: -1, 'foo': -2, 'bar': -3}) - self.assertTrue((rs[:5] == -1).all()) - self.assertTrue((rs[6:10] == -2).all()) - self.assertTrue((rs[20:30] == -3).all()) - self.assertTrue((pd.isnull(ser[:5])).all()) + assert (rs[:5] == -1).all() + assert (rs[6:10] == -2).all() + assert (rs[20:30] == -3).all() + assert (pd.isnull(ser[:5])).all() # replace with different values with 2 lists rs2 = ser.replace([np.nan, 'foo', 'bar'], [-1, -2, -3]) @@ -58,9 +57,9 @@ def test_replace(self): # replace inplace ser.replace([np.nan, 'foo', 'bar'], -1, inplace=True) - self.assertTrue((ser[:5] == -1).all()) - self.assertTrue((ser[6:10] == -1).all()) - self.assertTrue((ser[20:30] == -1).all()) + assert (ser[:5] == -1).all() + assert (ser[6:10] == -1).all() + assert (ser[20:30] == -1).all() ser = pd.Series([np.nan, 0, np.inf]) tm.assert_series_equal(ser.replace(np.nan, 0), ser.fillna(0)) @@ -75,11 +74,11 @@ def test_replace(self): tm.assert_series_equal(ser.replace(np.nan, 0), ser.fillna(0)) # malformed - self.assertRaises(ValueError, ser.replace, [1, 2, 3], [np.nan, 0]) + pytest.raises(ValueError, ser.replace, [1, 2, 3], [np.nan, 0]) # make sure that we aren't just masking a TypeError because bools don't # implement indexing - with tm.assertRaisesRegexp(TypeError, 'Cannot compare types .+'): + with tm.assert_raises_regex(TypeError, 'Cannot compare types .+'): ser.replace([1, 2], [np.nan, 0]) ser = pd.Series([0, 1, 2, 3, 4]) @@ -120,7 +119,7 @@ def test_replace_with_single_list(self): # make sure things don't get corrupted when fillna call fails s = ser.copy() - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): s.replace([1, 2, 3], inplace=True, method='crash_cymbal') tm.assert_series_equal(s, ser) @@ -134,8 +133,8 @@ def check_replace(to_rep, val, expected): tm.assert_series_equal(expected, r) tm.assert_series_equal(expected, sc) - # should NOT upcast to float - e = pd.Series([0, 1, 2, 3, 4]) + # MUST upcast to float + e = pd.Series([0., 1., 2., 3., 4.]) tr, v = [3], [3.0] check_replace(tr, v, e) @@ -154,8 +153,8 @@ def check_replace(to_rep, val, expected): tr, v = [3, 4], [3.5, pd.Timestamp('20130101')] check_replace(tr, v, e) - # casts to float - e = pd.Series([0, 1, 2, 3.5, 1]) + # casts to object + e = pd.Series([0, 1, 2, 3.5, True], dtype='object') tr, v = [3, 4], [3.5, True] check_replace(tr, v, e) @@ -187,7 +186,7 @@ def test_replace_bool_with_bool(self): def test_replace_with_dict_with_bool_keys(self): s = pd.Series([True, False, True]) - with tm.assertRaisesRegexp(TypeError, 'Cannot compare types .+'): + with tm.assert_raises_regex(TypeError, 'Cannot compare types .+'): s.replace({'asdf': 'asdb', True: 'yes'}) def test_replace2(self): @@ -201,18 +200,18 @@ def test_replace2(self): # replace list with a single value rs = ser.replace([np.nan, 'foo', 'bar'], -1) - self.assertTrue((rs[:5] == -1).all()) - self.assertTrue((rs[6:10] == -1).all()) - self.assertTrue((rs[20:30] == -1).all()) - self.assertTrue((pd.isnull(ser[:5])).all()) + assert (rs[:5] == -1).all() + assert (rs[6:10] == -1).all() + assert (rs[20:30] == -1).all() + assert (pd.isnull(ser[:5])).all() # replace with different values rs = ser.replace({np.nan: -1, 'foo': -2, 'bar': -3}) - self.assertTrue((rs[:5] == -1).all()) - self.assertTrue((rs[6:10] == -2).all()) - self.assertTrue((rs[20:30] == -3).all()) - self.assertTrue((pd.isnull(ser[:5])).all()) + assert (rs[:5] == -1).all() + assert (rs[6:10] == -2).all() + assert (rs[20:30] == -3).all() + assert (pd.isnull(ser[:5])).all() # replace with different values with 2 lists rs2 = ser.replace([np.nan, 'foo', 'bar'], [-1, -2, -3]) @@ -220,6 +219,33 @@ def test_replace2(self): # replace inplace ser.replace([np.nan, 'foo', 'bar'], -1, inplace=True) - self.assertTrue((ser[:5] == -1).all()) - self.assertTrue((ser[6:10] == -1).all()) - self.assertTrue((ser[20:30] == -1).all()) + assert (ser[:5] == -1).all() + assert (ser[6:10] == -1).all() + assert (ser[20:30] == -1).all() + + def test_replace_with_empty_dictlike(self): + # GH 15289 + s = pd.Series(list('abcd')) + tm.assert_series_equal(s, s.replace(dict())) + tm.assert_series_equal(s, s.replace(pd.Series([]))) + + def test_replace_string_with_number(self): + # GH 15743 + s = pd.Series([1, 2, 3]) + result = s.replace('2', np.nan) + expected = pd.Series([1, 2, 3]) + tm.assert_series_equal(expected, result) + + def test_replace_unicode_with_number(self): + # GH 15743 + s = pd.Series([1, 2, 3]) + result = s.replace(u'2', np.nan) + expected = pd.Series([1, 2, 3]) + tm.assert_series_equal(expected, result) + + def test_replace_mixed_types_with_string(self): + # Testing mixed + s = pd.Series([1, 2, 3, '4', 4, 5]) + result = s.replace([2, '4'], np.nan) + expected = pd.Series([1, np.nan, 3, np.nan, 4, 5]) + tm.assert_series_equal(expected, result) diff --git a/pandas/tests/series/test_repr.py b/pandas/tests/series/test_repr.py index 68ad29f1418c3..3af61b0a902d3 100644 --- a/pandas/tests/series/test_repr.py +++ b/pandas/tests/series/test_repr.py @@ -3,22 +3,22 @@ from datetime import datetime, timedelta +import sys + import numpy as np import pandas as pd from pandas import (Index, Series, DataFrame, date_range) from pandas.core.index import MultiIndex -from pandas.compat import StringIO, lrange, range, u +from pandas.compat import lrange, range, u from pandas import compat import pandas.util.testing as tm from .common import TestData -class TestSeriesRepr(TestData, tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesRepr(TestData): def test_multilevel_name_print(self): index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', @@ -34,24 +34,29 @@ def test_multilevel_name_print(self): "qux one 7", " two 8", " three 9", "Name: sth, dtype: int64"] expected = "\n".join(expected) - self.assertEqual(repr(s), expected) + assert repr(s) == expected def test_name_printing(self): - # test small series + # Test small Series. s = Series([0, 1, 2]) + s.name = "test" - self.assertIn("Name: test", repr(s)) + assert "Name: test" in repr(s) + s.name = None - self.assertNotIn("Name:", repr(s)) - # test big series (diff code path) + assert "Name:" not in repr(s) + + # Test big Series (diff code path). s = Series(lrange(0, 1000)) + s.name = "test" - self.assertIn("Name: test", repr(s)) + assert "Name: test" in repr(s) + s.name = None - self.assertNotIn("Name:", repr(s)) + assert "Name:" not in repr(s) s = Series(index=date_range('20010101', '20020101'), name='test') - self.assertIn("Name: test", repr(s)) + assert "Name: test" in repr(s) def test_repr(self): str(self.ts) @@ -90,44 +95,39 @@ def test_repr(self): # 0 as name ser = Series(np.random.randn(100), name=0) rep_str = repr(ser) - self.assertIn("Name: 0", rep_str) + assert "Name: 0" in rep_str # tidy repr ser = Series(np.random.randn(1001), name=0) rep_str = repr(ser) - self.assertIn("Name: 0", rep_str) + assert "Name: 0" in rep_str ser = Series(["a\n\r\tb"], name="a\n\r\td", index=["a\n\r\tf"]) - self.assertFalse("\t" in repr(ser)) - self.assertFalse("\r" in repr(ser)) - self.assertFalse("a\n" in repr(ser)) + assert "\t" not in repr(ser) + assert "\r" not in repr(ser) + assert "a\n" not in repr(ser) # with empty series (#4651) s = Series([], dtype=np.int64, name='foo') - self.assertEqual(repr(s), 'Series([], Name: foo, dtype: int64)') + assert repr(s) == 'Series([], Name: foo, dtype: int64)' s = Series([], dtype=np.int64, name=None) - self.assertEqual(repr(s), 'Series([], dtype: int64)') + assert repr(s) == 'Series([], dtype: int64)' def test_tidy_repr(self): a = Series([u("\u05d0")] * 1000) a.name = 'title1' repr(a) # should not raise exception + @tm.capture_stderr def test_repr_bool_fails(self): s = Series([DataFrame(np.random.randn(2, 2)) for i in range(5)]) - import sys + # It works (with no Cython exception barf)! + repr(s) - buf = StringIO() - tmp = sys.stderr - sys.stderr = buf - try: - # it works (with no Cython exception barf)! - repr(s) - finally: - sys.stderr = tmp - self.assertEqual(buf.getvalue(), '') + output = sys.stderr.getvalue() + assert output == '' def test_repr_name_iterable_indexable(self): s = Series([1, 2, 3], name=np.int64(3)) @@ -148,7 +148,7 @@ def test_repr_should_return_str(self): data = [8, 5, 3, 5] index1 = [u("\u03c3"), u("\u03c4"), u("\u03c5"), u("\u03c6")] df = Series(data, index=index1) - self.assertTrue(type(df.__repr__() == str)) # both py2 / 3 + assert type(df.__repr__() == str) # both py2 / 3 def test_repr_max_rows(self): # GH 6863 @@ -176,7 +176,7 @@ def test_timeseries_repr_object_dtype(self): repr(ts) ts = tm.makeTimeSeries(1000) - self.assertTrue(repr(ts).splitlines()[-1].startswith('Freq:')) + assert repr(ts).splitlines()[-1].startswith('Freq:') - ts2 = ts.ix[np.random.randint(0, len(ts) - 1, 400)] + ts2 = ts.iloc[np.random.randint(0, len(ts) - 1, 400)] repr(ts2).splitlines()[-1] diff --git a/pandas/tests/series/test_sorting.py b/pandas/tests/series/test_sorting.py index 826201adbdb50..40b0280de3719 100644 --- a/pandas/tests/series/test_sorting.py +++ b/pandas/tests/series/test_sorting.py @@ -1,35 +1,26 @@ # coding=utf-8 +import pytest + import numpy as np import random -from pandas import (DataFrame, Series, MultiIndex) +from pandas import DataFrame, Series, MultiIndex, IntervalIndex -from pandas.util.testing import (assert_series_equal, assert_almost_equal) +from pandas.util.testing import assert_series_equal, assert_almost_equal import pandas.util.testing as tm from .common import TestData -class TestSeriesSorting(TestData, tm.TestCase): - - _multiprocess_can_split_ = True - - def test_sort(self): +class TestSeriesSorting(TestData): + def test_sortlevel_deprecated(self): ts = self.ts.copy() - # 9816 deprecated - with tm.assert_produces_warning(FutureWarning): - ts.sort() # sorts inplace - self.assert_series_equal(ts, self.ts.sort_values()) - - def test_order(self): - - # 9816 deprecated + # see gh-9816 with tm.assert_produces_warning(FutureWarning): - result = self.ts.order() - self.assert_series_equal(result, self.ts.sort_values()) + ts.sortlevel() def test_sort_values(self): @@ -37,20 +28,20 @@ def test_sort_values(self): ser = Series([3, 2, 4, 1], ['A', 'B', 'C', 'D']) expected = Series([1, 2, 3, 4], ['D', 'B', 'A', 'C']) result = ser.sort_values() - self.assert_series_equal(expected, result) + tm.assert_series_equal(expected, result) ts = self.ts.copy() ts[:5] = np.NaN vals = ts.values result = ts.sort_values() - self.assertTrue(np.isnan(result[-5:]).all()) - self.assert_numpy_array_equal(result[:-5].values, np.sort(vals[5:])) + assert np.isnan(result[-5:]).all() + tm.assert_numpy_array_equal(result[:-5].values, np.sort(vals[5:])) # na_position result = ts.sort_values(na_position='first') - self.assertTrue(np.isnan(result[:5]).all()) - self.assert_numpy_array_equal(result[5:].values, np.sort(vals[5:])) + assert np.isnan(result[:5]).all() + tm.assert_numpy_array_equal(result[5:].values, np.sort(vals[5:])) # something object-type ser = Series(['A', 'B'], [1, 2]) @@ -64,12 +55,31 @@ def test_sort_values(self): ordered = ts.sort_values(ascending=False, na_position='first') assert_almost_equal(expected, ordered.valid().values) + # ascending=[False] should behave the same as ascending=False + ordered = ts.sort_values(ascending=[False]) + expected = ts.sort_values(ascending=False) + assert_series_equal(expected, ordered) + ordered = ts.sort_values(ascending=[False], na_position='first') + expected = ts.sort_values(ascending=False, na_position='first') + assert_series_equal(expected, ordered) + + pytest.raises(ValueError, + lambda: ts.sort_values(ascending=None)) + pytest.raises(ValueError, + lambda: ts.sort_values(ascending=[])) + pytest.raises(ValueError, + lambda: ts.sort_values(ascending=[1, 2, 3])) + pytest.raises(ValueError, + lambda: ts.sort_values(ascending=[False, False])) + pytest.raises(ValueError, + lambda: ts.sort_values(ascending='foobar')) + # inplace=True ts = self.ts.copy() ts.sort_values(ascending=False, inplace=True) - self.assert_series_equal(ts, self.ts.sort_values(ascending=False)) - self.assert_index_equal(ts.index, - self.ts.sort_values(ascending=False).index) + tm.assert_series_equal(ts, self.ts.sort_values(ascending=False)) + tm.assert_index_equal(ts.index, + self.ts.sort_values(ascending=False).index) # GH 5856/5853 # Series.sort_values operating on a view @@ -79,7 +89,7 @@ def test_sort_values(self): def f(): s.sort_values(inplace=True) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def test_sort_index(self): rindex = list(self.ts.index) @@ -102,13 +112,13 @@ def test_sort_index(self): sorted_series = random_order.sort_index(axis=0) assert_series_equal(sorted_series, self.ts) - self.assertRaises(ValueError, lambda: random_order.sort_values(axis=1)) + pytest.raises(ValueError, lambda: random_order.sort_values(axis=1)) sorted_series = random_order.sort_index(level=0, axis=0) assert_series_equal(sorted_series, self.ts) - self.assertRaises(ValueError, - lambda: random_order.sort_index(level=0, axis=1)) + pytest.raises(ValueError, + lambda: random_order.sort_index(level=0, axis=1)) def test_sort_index_inplace(self): @@ -119,16 +129,17 @@ def test_sort_index_inplace(self): # descending random_order = self.ts.reindex(rindex) result = random_order.sort_index(ascending=False, inplace=True) - self.assertIs(result, None, - msg='sort_index() inplace should return None') - assert_series_equal(random_order, self.ts.reindex(self.ts.index[::-1])) + + assert result is None + tm.assert_series_equal(random_order, self.ts.reindex( + self.ts.index[::-1])) # ascending random_order = self.ts.reindex(rindex) result = random_order.sort_index(ascending=True, inplace=True) - self.assertIs(result, None, - msg='sort_index() inplace should return None') - assert_series_equal(random_order, self.ts) + + assert result is None + tm.assert_series_equal(random_order, self.ts) def test_sort_index_multiindex(self): @@ -144,3 +155,43 @@ def test_sort_index_multiindex(self): # rows share same level='A': sort has no effect without remaining lvls res = s.sort_index(level='A', sort_remaining=False) assert_series_equal(s, res) + + def test_sort_index_kind(self): + # GH #14444 & #13589: Add support for sort algo choosing + series = Series(index=[3, 2, 1, 4, 3]) + expected_series = Series(index=[1, 2, 3, 3, 4]) + + index_sorted_series = series.sort_index(kind='mergesort') + assert_series_equal(expected_series, index_sorted_series) + + index_sorted_series = series.sort_index(kind='quicksort') + assert_series_equal(expected_series, index_sorted_series) + + index_sorted_series = series.sort_index(kind='heapsort') + assert_series_equal(expected_series, index_sorted_series) + + def test_sort_index_na_position(self): + series = Series(index=[3, 2, 1, 4, 3, np.nan]) + + expected_series_first = Series(index=[np.nan, 1, 2, 3, 3, 4]) + index_sorted_series = series.sort_index(na_position='first') + assert_series_equal(expected_series_first, index_sorted_series) + + expected_series_last = Series(index=[1, 2, 3, 3, 4, np.nan]) + index_sorted_series = series.sort_index(na_position='last') + assert_series_equal(expected_series_last, index_sorted_series) + + def test_sort_index_intervals(self): + s = Series([np.nan, 1, 2, 3], IntervalIndex.from_arrays( + [0, 1, 2, 3], + [1, 2, 3, 4])) + + result = s.sort_index() + expected = s + assert_series_equal(result, expected) + + result = s.sort_index(ascending=False) + expected = Series([3, 2, 1, np.nan], IntervalIndex.from_arrays( + [3, 2, 1, 0], + [4, 3, 2, 1])) + assert_series_equal(result, expected) diff --git a/pandas/tests/series/test_subclass.py b/pandas/tests/series/test_subclass.py index cc07c7d9dd59b..37c8d7343f7f1 100644 --- a/pandas/tests/series/test_subclass.py +++ b/pandas/tests/series/test_subclass.py @@ -6,67 +6,63 @@ import pandas.util.testing as tm -class TestSeriesSubclassing(tm.TestCase): - - _multiprocess_can_split_ = True +class TestSeriesSubclassing(object): def test_indexing_sliced(self): s = tm.SubclassedSeries([1, 2, 3, 4], index=list('abcd')) res = s.loc[['a', 'b']] exp = tm.SubclassedSeries([1, 2], index=list('ab')) tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) res = s.iloc[[2, 3]] exp = tm.SubclassedSeries([3, 4], index=list('cd')) tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) - res = s.ix[['a', 'b']] + res = s.loc[['a', 'b']] exp = tm.SubclassedSeries([1, 2], index=list('ab')) tm.assert_series_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedSeries) + assert isinstance(res, tm.SubclassedSeries) def test_to_frame(self): s = tm.SubclassedSeries([1, 2, 3, 4], index=list('abcd'), name='xxx') res = s.to_frame() exp = tm.SubclassedDataFrame({'xxx': [1, 2, 3, 4]}, index=list('abcd')) tm.assert_frame_equal(res, exp) - tm.assertIsInstance(res, tm.SubclassedDataFrame) - + assert isinstance(res, tm.SubclassedDataFrame) -class TestSparseSeriesSubclassing(tm.TestCase): - _multiprocess_can_split_ = True +class TestSparseSeriesSubclassing(object): def test_subclass_sparse_slice(self): # int64 s = tm.SubclassedSparseSeries([1, 2, 3, 4, 5]) exp = tm.SubclassedSparseSeries([2, 3, 4], index=[1, 2, 3]) tm.assert_sp_series_equal(s.loc[1:3], exp) - self.assertEqual(s.loc[1:3].dtype, np.int64) + assert s.loc[1:3].dtype == np.int64 exp = tm.SubclassedSparseSeries([2, 3], index=[1, 2]) tm.assert_sp_series_equal(s.iloc[1:3], exp) - self.assertEqual(s.iloc[1:3].dtype, np.int64) + assert s.iloc[1:3].dtype == np.int64 exp = tm.SubclassedSparseSeries([2, 3], index=[1, 2]) tm.assert_sp_series_equal(s[1:3], exp) - self.assertEqual(s[1:3].dtype, np.int64) + assert s[1:3].dtype == np.int64 # float64 s = tm.SubclassedSparseSeries([1., 2., 3., 4., 5.]) exp = tm.SubclassedSparseSeries([2., 3., 4.], index=[1, 2, 3]) tm.assert_sp_series_equal(s.loc[1:3], exp) - self.assertEqual(s.loc[1:3].dtype, np.float64) + assert s.loc[1:3].dtype == np.float64 exp = tm.SubclassedSparseSeries([2., 3.], index=[1, 2]) tm.assert_sp_series_equal(s.iloc[1:3], exp) - self.assertEqual(s.iloc[1:3].dtype, np.float64) + assert s.iloc[1:3].dtype == np.float64 exp = tm.SubclassedSparseSeries([2., 3.], index=[1, 2]) tm.assert_sp_series_equal(s[1:3], exp) - self.assertEqual(s[1:3].dtype, np.float64) + assert s[1:3].dtype == np.float64 def test_subclass_sparse_addition(self): s1 = tm.SubclassedSparseSeries([1, 3, 5]) diff --git a/pandas/tests/series/test_timeseries.py b/pandas/tests/series/test_timeseries.py index 6e3d52366a4ec..d5517bdcceac7 100644 --- a/pandas/tests/series/test_timeseries.py +++ b/pandas/tests/series/test_timeseries.py @@ -1,23 +1,39 @@ # coding=utf-8 # pylint: disable-msg=E1101,W0612 -from datetime import datetime +import pytest import numpy as np +from datetime import datetime, timedelta, time -from pandas import Index, Series, date_range, NaT -from pandas.tseries.index import DatetimeIndex -from pandas.tseries.offsets import BDay, BMonthEnd -from pandas.tseries.tdi import TimedeltaIndex - -from pandas.util.testing import assert_series_equal, assert_almost_equal +import pandas as pd import pandas.util.testing as tm +from pandas._libs.tslib import iNaT +from pandas.compat import lrange, StringIO, product +from pandas.core.indexes.timedeltas import TimedeltaIndex +from pandas.core.indexes.datetimes import DatetimeIndex +from pandas.tseries.offsets import BDay, BMonthEnd +from pandas import (Index, Series, date_range, NaT, concat, DataFrame, + Timestamp, to_datetime, offsets, + timedelta_range) +from pandas.util.testing import (assert_series_equal, assert_almost_equal, + assert_frame_equal, _skip_if_has_locale) from pandas.tests.series.common import TestData -class TestSeriesTimeSeries(TestData, tm.TestCase): - _multiprocess_can_split_ = True +def _simple_ts(start, end, freq='D'): + rng = date_range(start, end, freq=freq) + return Series(np.random.randn(len(rng)), index=rng) + + +def assert_range_equal(left, right): + assert (left.equals(right)) + assert (left.freq == right.freq) + assert (left.tz == right.tz) + + +class TestTimeSeries(TestData): def test_shift(self): shifted = self.ts.shift(1) @@ -59,7 +75,7 @@ def test_shift(self): assert_series_equal(shifted2, shifted3) assert_series_equal(ps, shifted2.shift(-1, 'B')) - self.assertRaises(ValueError, ps.shift, freq='D') + pytest.raises(ValueError, ps.shift, freq='D') # legacy support shifted4 = ps.shift(1, freq='B') @@ -90,7 +106,23 @@ def test_shift(self): # incompat tz s2 = Series(date_range('2000-01-01 09:00:00', periods=5, tz='CET'), name='foo') - self.assertRaises(ValueError, lambda: s - s2) + pytest.raises(ValueError, lambda: s - s2) + + def test_shift2(self): + ts = Series(np.random.randn(5), + index=date_range('1/1/2000', periods=5, freq='H')) + + result = ts.shift(1, freq='5T') + exp_index = ts.index.shift(1, freq='5T') + tm.assert_index_equal(result.index, exp_index) + + # GH #1063, multiple of same base + result = ts.shift(1, freq='4H') + exp_index = ts.index + offsets.Hour(4) + tm.assert_index_equal(result.index, exp_index) + + idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-04']) + pytest.raises(ValueError, idx.shift, 1) def test_shift_dst(self): # GH 13926 @@ -99,25 +131,25 @@ def test_shift_dst(self): res = s.shift(0) tm.assert_series_equal(res, s) - self.assertEqual(res.dtype, 'datetime64[ns, US/Eastern]') + assert res.dtype == 'datetime64[ns, US/Eastern]' res = s.shift(1) exp_vals = [NaT] + dates.asobject.values.tolist()[:9] exp = Series(exp_vals) tm.assert_series_equal(res, exp) - self.assertEqual(res.dtype, 'datetime64[ns, US/Eastern]') + assert res.dtype == 'datetime64[ns, US/Eastern]' res = s.shift(-2) exp_vals = dates.asobject.values.tolist()[2:] + [NaT, NaT] exp = Series(exp_vals) tm.assert_series_equal(res, exp) - self.assertEqual(res.dtype, 'datetime64[ns, US/Eastern]') + assert res.dtype == 'datetime64[ns, US/Eastern]' for ex in [10, -10, 20, -20]: res = s.shift(ex) exp = Series([NaT] * 10, dtype='datetime64[ns, US/Eastern]') tm.assert_series_equal(res, exp) - self.assertEqual(res.dtype, 'datetime64[ns, US/Eastern]') + assert res.dtype == 'datetime64[ns, US/Eastern]' def test_tshift(self): # PeriodIndex @@ -133,7 +165,7 @@ def test_tshift(self): shifted3 = ps.tshift(freq=BDay()) assert_series_equal(shifted, shifted3) - self.assertRaises(ValueError, ps.tshift, freq='M') + pytest.raises(ValueError, ps.tshift, freq='M') # DatetimeIndex shifted = self.ts.tshift(1) @@ -152,7 +184,7 @@ def test_tshift(self): assert_series_equal(unshifted, inferred_ts) no_freq = self.ts[[0, 5, 7]] - self.assertRaises(ValueError, no_freq.tshift) + pytest.raises(ValueError, no_freq.tshift) def test_truncate(self): offset = BDay() @@ -200,209 +232,9 @@ def test_truncate(self): truncated = ts.truncate(before=self.ts.index[-1] + offset) assert (len(truncated) == 0) - self.assertRaises(ValueError, ts.truncate, - before=self.ts.index[-1] + offset, - after=self.ts.index[0] - offset) - - def test_getitem_setitem_datetimeindex(self): - from pandas import date_range - - N = 50 - # testing with timezone, GH #2785 - rng = date_range('1/1/1990', periods=N, freq='H', tz='US/Eastern') - ts = Series(np.random.randn(N), index=rng) - - result = ts["1990-01-01 04:00:00"] - expected = ts[4] - self.assertEqual(result, expected) - - result = ts.copy() - result["1990-01-01 04:00:00"] = 0 - result["1990-01-01 04:00:00"] = ts[4] - assert_series_equal(result, ts) - - result = ts["1990-01-01 04:00:00":"1990-01-01 07:00:00"] - expected = ts[4:8] - assert_series_equal(result, expected) - - result = ts.copy() - result["1990-01-01 04:00:00":"1990-01-01 07:00:00"] = 0 - result["1990-01-01 04:00:00":"1990-01-01 07:00:00"] = ts[4:8] - assert_series_equal(result, ts) - - lb = "1990-01-01 04:00:00" - rb = "1990-01-01 07:00:00" - result = ts[(ts.index >= lb) & (ts.index <= rb)] - expected = ts[4:8] - assert_series_equal(result, expected) - - # repeat all the above with naive datetimes - result = ts[datetime(1990, 1, 1, 4)] - expected = ts[4] - self.assertEqual(result, expected) - - result = ts.copy() - result[datetime(1990, 1, 1, 4)] = 0 - result[datetime(1990, 1, 1, 4)] = ts[4] - assert_series_equal(result, ts) - - result = ts[datetime(1990, 1, 1, 4):datetime(1990, 1, 1, 7)] - expected = ts[4:8] - assert_series_equal(result, expected) - - result = ts.copy() - result[datetime(1990, 1, 1, 4):datetime(1990, 1, 1, 7)] = 0 - result[datetime(1990, 1, 1, 4):datetime(1990, 1, 1, 7)] = ts[4:8] - assert_series_equal(result, ts) - - lb = datetime(1990, 1, 1, 4) - rb = datetime(1990, 1, 1, 7) - result = ts[(ts.index >= lb) & (ts.index <= rb)] - expected = ts[4:8] - assert_series_equal(result, expected) - - result = ts[ts.index[4]] - expected = ts[4] - self.assertEqual(result, expected) - - result = ts[ts.index[4:8]] - expected = ts[4:8] - assert_series_equal(result, expected) - - result = ts.copy() - result[ts.index[4:8]] = 0 - result[4:8] = ts[4:8] - assert_series_equal(result, ts) - - # also test partial date slicing - result = ts["1990-01-02"] - expected = ts[24:48] - assert_series_equal(result, expected) - - result = ts.copy() - result["1990-01-02"] = 0 - result["1990-01-02"] = ts[24:48] - assert_series_equal(result, ts) - - def test_getitem_setitem_datetime_tz_pytz(self): - tm._skip_if_no_pytz() - from pytz import timezone as tz - - from pandas import date_range - - N = 50 - # testing with timezone, GH #2785 - rng = date_range('1/1/1990', periods=N, freq='H', tz='US/Eastern') - ts = Series(np.random.randn(N), index=rng) - - # also test Timestamp tz handling, GH #2789 - result = ts.copy() - result["1990-01-01 09:00:00+00:00"] = 0 - result["1990-01-01 09:00:00+00:00"] = ts[4] - assert_series_equal(result, ts) - - result = ts.copy() - result["1990-01-01 03:00:00-06:00"] = 0 - result["1990-01-01 03:00:00-06:00"] = ts[4] - assert_series_equal(result, ts) - - # repeat with datetimes - result = ts.copy() - result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = 0 - result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = ts[4] - assert_series_equal(result, ts) - - result = ts.copy() - - # comparison dates with datetime MUST be localized! - date = tz('US/Central').localize(datetime(1990, 1, 1, 3)) - result[date] = 0 - result[date] = ts[4] - assert_series_equal(result, ts) - - def test_getitem_setitem_datetime_tz_dateutil(self): - tm._skip_if_no_dateutil() - from dateutil.tz import tzutc - from pandas.tslib import _dateutil_gettz as gettz - - tz = lambda x: tzutc() if x == 'UTC' else gettz( - x) # handle special case for utc in dateutil - - from pandas import date_range - - N = 50 - - # testing with timezone, GH #2785 - rng = date_range('1/1/1990', periods=N, freq='H', - tz='America/New_York') - ts = Series(np.random.randn(N), index=rng) - - # also test Timestamp tz handling, GH #2789 - result = ts.copy() - result["1990-01-01 09:00:00+00:00"] = 0 - result["1990-01-01 09:00:00+00:00"] = ts[4] - assert_series_equal(result, ts) - - result = ts.copy() - result["1990-01-01 03:00:00-06:00"] = 0 - result["1990-01-01 03:00:00-06:00"] = ts[4] - assert_series_equal(result, ts) - - # repeat with datetimes - result = ts.copy() - result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = 0 - result[datetime(1990, 1, 1, 9, tzinfo=tz('UTC'))] = ts[4] - assert_series_equal(result, ts) - - result = ts.copy() - result[datetime(1990, 1, 1, 3, tzinfo=tz('America/Chicago'))] = 0 - result[datetime(1990, 1, 1, 3, tzinfo=tz('America/Chicago'))] = ts[4] - assert_series_equal(result, ts) - - def test_getitem_setitem_periodindex(self): - from pandas import period_range - - N = 50 - rng = period_range('1/1/1990', periods=N, freq='H') - ts = Series(np.random.randn(N), index=rng) - - result = ts["1990-01-01 04"] - expected = ts[4] - self.assertEqual(result, expected) - - result = ts.copy() - result["1990-01-01 04"] = 0 - result["1990-01-01 04"] = ts[4] - assert_series_equal(result, ts) - - result = ts["1990-01-01 04":"1990-01-01 07"] - expected = ts[4:8] - assert_series_equal(result, expected) - - result = ts.copy() - result["1990-01-01 04":"1990-01-01 07"] = 0 - result["1990-01-01 04":"1990-01-01 07"] = ts[4:8] - assert_series_equal(result, ts) - - lb = "1990-01-01 04" - rb = "1990-01-01 07" - result = ts[(ts.index >= lb) & (ts.index <= rb)] - expected = ts[4:8] - assert_series_equal(result, expected) - - # GH 2782 - result = ts[ts.index[4]] - expected = ts[4] - self.assertEqual(result, expected) - - result = ts[ts.index[4:8]] - expected = ts[4:8] - assert_series_equal(result, expected) - - result = ts.copy() - result[ts.index[4:8]] = 0 - result[4:8] = ts[4:8] - assert_series_equal(result, ts) + pytest.raises(ValueError, ts.truncate, + before=self.ts.index[-1] + offset, + after=self.ts.index[0] - offset) def test_asfreq(self): ts = Series([0., 1., 2.], index=[datetime(2009, 10, 30), datetime( @@ -410,19 +242,33 @@ def test_asfreq(self): daily_ts = ts.asfreq('B') monthly_ts = daily_ts.asfreq('BM') - self.assert_series_equal(monthly_ts, ts) + tm.assert_series_equal(monthly_ts, ts) daily_ts = ts.asfreq('B', method='pad') monthly_ts = daily_ts.asfreq('BM') - self.assert_series_equal(monthly_ts, ts) + tm.assert_series_equal(monthly_ts, ts) daily_ts = ts.asfreq(BDay()) monthly_ts = daily_ts.asfreq(BMonthEnd()) - self.assert_series_equal(monthly_ts, ts) + tm.assert_series_equal(monthly_ts, ts) result = ts[:0].asfreq('M') - self.assertEqual(len(result), 0) - self.assertIsNot(result, ts) + assert len(result) == 0 + assert result is not ts + + daily_ts = ts.asfreq('D', fill_value=-1) + result = daily_ts.value_counts().sort_index() + expected = Series([60, 1, 1, 1], + index=[-1.0, 2.0, 1.0, 0.0]).sort_index() + tm.assert_series_equal(result, expected) + + def test_asfreq_datetimeindex_empty_series(self): + # GH 14320 + expected = Series(index=pd.DatetimeIndex( + ["2016-09-29 11:00"])).asfreq('H') + result = Series(index=pd.DatetimeIndex(["2016-09-29 11:00"]), + data=[3]).asfreq('H') + tm.assert_index_equal(expected.index, result.index) def test_diff(self): # Just run the function @@ -434,7 +280,7 @@ def test_diff(self): s = Series([a, b]) rs = s.diff() - self.assertEqual(rs[1], 1) + assert rs[1] == 1 # neg n rs = self.ts.diff(-1) @@ -497,10 +343,10 @@ def test_autocorr(self): # corr() with lag needs Series of at least length 2 if len(self.ts) <= 2: - self.assertTrue(np.isnan(corr1)) - self.assertTrue(np.isnan(corr2)) + assert np.isnan(corr1) + assert np.isnan(corr2) else: - self.assertEqual(corr1, corr2) + assert corr1 == corr2 # Choose a random lag between 1 and length of Series - 2 # and compare the result with the Series corr() function @@ -510,34 +356,34 @@ def test_autocorr(self): # corr() with lag needs Series of at least length 2 if len(self.ts) <= 2: - self.assertTrue(np.isnan(corr1)) - self.assertTrue(np.isnan(corr2)) + assert np.isnan(corr1) + assert np.isnan(corr2) else: - self.assertEqual(corr1, corr2) + assert corr1 == corr2 def test_first_last_valid(self): ts = self.ts.copy() ts[:5] = np.NaN index = ts.first_valid_index() - self.assertEqual(index, ts.index[5]) + assert index == ts.index[5] ts[-5:] = np.NaN index = ts.last_valid_index() - self.assertEqual(index, ts.index[-6]) + assert index == ts.index[-6] ts[:] = np.nan - self.assertIsNone(ts.last_valid_index()) - self.assertIsNone(ts.first_valid_index()) + assert ts.last_valid_index() is None + assert ts.first_valid_index() is None ser = Series([], index=[]) - self.assertIsNone(ser.last_valid_index()) - self.assertIsNone(ser.first_valid_index()) + assert ser.last_valid_index() is None + assert ser.first_valid_index() is None # GH12800 empty = Series() - self.assertIsNone(empty.last_valid_index()) - self.assertIsNone(empty.first_valid_index()) + assert empty.last_valid_index() is None + assert empty.first_valid_index() is None def test_mpl_compat_hack(self): result = self.ts[:, np.newaxis] @@ -547,10 +393,8 @@ def test_mpl_compat_hack(self): def test_timeseries_coercion(self): idx = tm.makeDateIndex(10000) ser = Series(np.random.randn(len(idx)), idx.astype(object)) - with tm.assert_produces_warning(FutureWarning): - self.assertTrue(ser.is_time_series) - self.assertTrue(ser.index.is_all_dates) - self.assertIsInstance(ser.index, DatetimeIndex) + assert ser.index.is_all_dates + assert isinstance(ser.index, DatetimeIndex) def test_empty_series_ops(self): # see issue #13844 @@ -559,10 +403,532 @@ def test_empty_series_ops(self): assert_series_equal(a, a + b) assert_series_equal(a, a - b) assert_series_equal(a, b + a) - self.assertRaises(TypeError, lambda x, y: x - y, b, a) + pytest.raises(TypeError, lambda x, y: x - y, b, a) + + def test_contiguous_boolean_preserve_freq(self): + rng = date_range('1/1/2000', '3/1/2000', freq='B') + + mask = np.zeros(len(rng), dtype=bool) + mask[10:20] = True + + masked = rng[mask] + expected = rng[10:20] + assert expected.freq is not None + assert_range_equal(masked, expected) + + mask[22] = True + masked = rng[mask] + assert masked.freq is None + + def test_to_datetime_unit(self): + + epoch = 1370745748 + s = Series([epoch + t for t in range(20)]) + result = to_datetime(s, unit='s') + expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( + seconds=t) for t in range(20)]) + assert_series_equal(result, expected) + + s = Series([epoch + t for t in range(20)]).astype(float) + result = to_datetime(s, unit='s') + expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( + seconds=t) for t in range(20)]) + assert_series_equal(result, expected) + + s = Series([epoch + t for t in range(20)] + [iNaT]) + result = to_datetime(s, unit='s') + expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( + seconds=t) for t in range(20)] + [NaT]) + assert_series_equal(result, expected) + + s = Series([epoch + t for t in range(20)] + [iNaT]).astype(float) + result = to_datetime(s, unit='s') + expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( + seconds=t) for t in range(20)] + [NaT]) + assert_series_equal(result, expected) + + # GH13834 + s = Series([epoch + t for t in np.arange(0, 2, .25)] + + [iNaT]).astype(float) + result = to_datetime(s, unit='s') + expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( + seconds=t) for t in np.arange(0, 2, .25)] + [NaT]) + assert_series_equal(result, expected) + + s = concat([Series([epoch + t for t in range(20)] + ).astype(float), Series([np.nan])], + ignore_index=True) + result = to_datetime(s, unit='s') + expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( + seconds=t) for t in range(20)] + [NaT]) + assert_series_equal(result, expected) + + result = to_datetime([1, 2, 'NaT', pd.NaT, np.nan], unit='D') + expected = DatetimeIndex([Timestamp('1970-01-02'), + Timestamp('1970-01-03')] + ['NaT'] * 3) + tm.assert_index_equal(result, expected) + + with pytest.raises(ValueError): + to_datetime([1, 2, 'foo'], unit='D') + with pytest.raises(ValueError): + to_datetime([1, 2, 111111111], unit='D') + + # coerce we can process + expected = DatetimeIndex([Timestamp('1970-01-02'), + Timestamp('1970-01-03')] + ['NaT'] * 1) + result = to_datetime([1, 2, 'foo'], unit='D', errors='coerce') + tm.assert_index_equal(result, expected) + + result = to_datetime([1, 2, 111111111], unit='D', errors='coerce') + tm.assert_index_equal(result, expected) + + def test_series_ctor_datetime64(self): + rng = date_range('1/1/2000 00:00:00', '1/1/2000 1:59:50', freq='10s') + dates = np.asarray(rng) + + series = Series(dates) + assert np.issubdtype(series.dtype, np.dtype('M8[ns]')) + + def test_series_repr_nat(self): + series = Series([0, 1000, 2000, iNaT], dtype='M8[ns]') + + result = repr(series) + expected = ('0 1970-01-01 00:00:00.000000\n' + '1 1970-01-01 00:00:00.000001\n' + '2 1970-01-01 00:00:00.000002\n' + '3 NaT\n' + 'dtype: datetime64[ns]') + assert result == expected + + def test_asfreq_keep_index_name(self): + # GH #9854 + index_name = 'bar' + index = pd.date_range('20130101', periods=20, name=index_name) + df = pd.DataFrame([x for x in range(20)], columns=['foo'], index=index) + + assert index_name == df.index.name + assert index_name == df.asfreq('10D').index.name + + def test_promote_datetime_date(self): + rng = date_range('1/1/2000', periods=20) + ts = Series(np.random.randn(20), index=rng) + + ts_slice = ts[5:] + ts2 = ts_slice.copy() + ts2.index = [x.date() for x in ts2.index] + + result = ts + ts2 + result2 = ts2 + ts + expected = ts + ts[5:] + assert_series_equal(result, expected) + assert_series_equal(result2, expected) + + # test asfreq + result = ts2.asfreq('4H', method='ffill') + expected = ts[5:].asfreq('4H', method='ffill') + assert_series_equal(result, expected) + + result = rng.get_indexer(ts2.index) + expected = rng.get_indexer(ts_slice.index) + tm.assert_numpy_array_equal(result, expected) + + def test_asfreq_normalize(self): + rng = date_range('1/1/2000 09:30', periods=20) + norm = date_range('1/1/2000', periods=20) + vals = np.random.randn(20) + ts = Series(vals, index=rng) + + result = ts.asfreq('D', normalize=True) + norm = date_range('1/1/2000', periods=20) + expected = Series(vals, index=norm) + + assert_series_equal(result, expected) + vals = np.random.randn(20, 3) + ts = DataFrame(vals, index=rng) + + result = ts.asfreq('D', normalize=True) + expected = DataFrame(vals, index=norm) + + assert_frame_equal(result, expected) + + def test_first_subset(self): + ts = _simple_ts('1/1/2000', '1/1/2010', freq='12h') + result = ts.first('10d') + assert len(result) == 20 + + ts = _simple_ts('1/1/2000', '1/1/2010') + result = ts.first('10d') + assert len(result) == 10 + + result = ts.first('3M') + expected = ts[:'3/31/2000'] + assert_series_equal(result, expected) + + result = ts.first('21D') + expected = ts[:21] + assert_series_equal(result, expected) + + result = ts[:0].first('3M') + assert_series_equal(result, ts[:0]) + + def test_last_subset(self): + ts = _simple_ts('1/1/2000', '1/1/2010', freq='12h') + result = ts.last('10d') + assert len(result) == 20 + + ts = _simple_ts('1/1/2000', '1/1/2010') + result = ts.last('10d') + assert len(result) == 10 + + result = ts.last('21D') + expected = ts['12/12/2009':] + assert_series_equal(result, expected) + + result = ts.last('21D') + expected = ts[-21:] + assert_series_equal(result, expected) + + result = ts[:0].last('3M') + assert_series_equal(result, ts[:0]) + + def test_format_pre_1900_dates(self): + rng = date_range('1/1/1850', '1/1/1950', freq='A-DEC') + rng.format() + ts = Series(1, index=rng) + repr(ts) + + def test_at_time(self): + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + ts = Series(np.random.randn(len(rng)), index=rng) + rs = ts.at_time(rng[1]) + assert (rs.index.hour == rng[1].hour).all() + assert (rs.index.minute == rng[1].minute).all() + assert (rs.index.second == rng[1].second).all() + + result = ts.at_time('9:30') + expected = ts.at_time(time(9, 30)) + assert_series_equal(result, expected) + + df = DataFrame(np.random.randn(len(rng), 3), index=rng) + + result = ts[time(9, 30)] + result_df = df.loc[time(9, 30)] + expected = ts[(rng.hour == 9) & (rng.minute == 30)] + exp_df = df[(rng.hour == 9) & (rng.minute == 30)] + + # expected.index = date_range('1/1/2000', '1/4/2000') + + assert_series_equal(result, expected) + tm.assert_frame_equal(result_df, exp_df) + + chunk = df.loc['1/4/2000':] + result = chunk.loc[time(9, 30)] + expected = result_df[-1:] + tm.assert_frame_equal(result, expected) + + # midnight, everything + rng = date_range('1/1/2000', '1/31/2000') + ts = Series(np.random.randn(len(rng)), index=rng) + + result = ts.at_time(time(0, 0)) + assert_series_equal(result, ts) + + # time doesn't exist + rng = date_range('1/1/2012', freq='23Min', periods=384) + ts = Series(np.random.randn(len(rng)), rng) + rs = ts.at_time('16:00') + assert len(rs) == 0 + + def test_between(self): + series = Series(date_range('1/1/2000', periods=10)) + left, right = series[[2, 7]] + + result = series.between(left, right) + expected = (series >= left) & (series <= right) + assert_series_equal(result, expected) + + def test_between_time(self): + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + ts = Series(np.random.randn(len(rng)), index=rng) + stime = time(0, 0) + etime = time(1, 0) + + close_open = product([True, False], [True, False]) + for inc_start, inc_end in close_open: + filtered = ts.between_time(stime, etime, inc_start, inc_end) + exp_len = 13 * 4 + 1 + if not inc_start: + exp_len -= 5 + if not inc_end: + exp_len -= 4 + + assert len(filtered) == exp_len + for rs in filtered.index: + t = rs.time() + if inc_start: + assert t >= stime + else: + assert t > stime + + if inc_end: + assert t <= etime + else: + assert t < etime + + result = ts.between_time('00:00', '01:00') + expected = ts.between_time(stime, etime) + assert_series_equal(result, expected) + + # across midnight + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + ts = Series(np.random.randn(len(rng)), index=rng) + stime = time(22, 0) + etime = time(9, 0) + + close_open = product([True, False], [True, False]) + for inc_start, inc_end in close_open: + filtered = ts.between_time(stime, etime, inc_start, inc_end) + exp_len = (12 * 11 + 1) * 4 + 1 + if not inc_start: + exp_len -= 4 + if not inc_end: + exp_len -= 4 + + assert len(filtered) == exp_len + for rs in filtered.index: + t = rs.time() + if inc_start: + assert (t >= stime) or (t <= etime) + else: + assert (t > stime) or (t <= etime) + + if inc_end: + assert (t <= etime) or (t >= stime) + else: + assert (t < etime) or (t >= stime) + + def test_between_time_types(self): + # GH11818 + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + pytest.raises(ValueError, rng.indexer_between_time, + datetime(2010, 1, 2, 1), datetime(2010, 1, 2, 5)) + + frame = DataFrame({'A': 0}, index=rng) + pytest.raises(ValueError, frame.between_time, + datetime(2010, 1, 2, 1), datetime(2010, 1, 2, 5)) + + series = Series(0, index=rng) + pytest.raises(ValueError, series.between_time, + datetime(2010, 1, 2, 1), datetime(2010, 1, 2, 5)) + + def test_between_time_formats(self): + # GH11818 + _skip_if_has_locale() + + rng = date_range('1/1/2000', '1/5/2000', freq='5min') + ts = DataFrame(np.random.randn(len(rng), 2), index=rng) + + strings = [("2:00", "2:30"), ("0200", "0230"), ("2:00am", "2:30am"), + ("0200am", "0230am"), ("2:00:00", "2:30:00"), + ("020000", "023000"), ("2:00:00am", "2:30:00am"), + ("020000am", "023000am")] + expected_length = 28 + + for time_string in strings: + assert len(ts.between_time(*time_string)) == expected_length + + def test_to_period(self): + from pandas.core.indexes.period import period_range + + ts = _simple_ts('1/1/2000', '1/1/2001') + + pts = ts.to_period() + exp = ts.copy() + exp.index = period_range('1/1/2000', '1/1/2001') + assert_series_equal(pts, exp) + + pts = ts.to_period('M') + exp.index = exp.index.asfreq('M') + tm.assert_index_equal(pts.index, exp.index.asfreq('M')) + assert_series_equal(pts, exp) + + # GH 7606 without freq + idx = DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', + '2011-01-04']) + exp_idx = pd.PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03', + '2011-01-04'], freq='D') + + s = Series(np.random.randn(4), index=idx) + expected = s.copy() + expected.index = exp_idx + assert_series_equal(s.to_period(), expected) + + df = DataFrame(np.random.randn(4, 4), index=idx, columns=idx) + expected = df.copy() + expected.index = exp_idx + assert_frame_equal(df.to_period(), expected) + + expected = df.copy() + expected.columns = exp_idx + assert_frame_equal(df.to_period(axis=1), expected) + + def test_groupby_count_dateparseerror(self): + dr = date_range(start='1/1/2012', freq='5min', periods=10) + + # BAD Example, datetimes first + s = Series(np.arange(10), index=[dr, lrange(10)]) + grouped = s.groupby(lambda x: x[1] % 2 == 0) + result = grouped.count() + + s = Series(np.arange(10), index=[lrange(10), dr]) + grouped = s.groupby(lambda x: x[0] % 2 == 0) + expected = grouped.count() + + assert_series_equal(result, expected) -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_to_csv_numpy_16_bug(self): + frame = DataFrame({'a': date_range('1/1/2000', periods=10)}) + + buf = StringIO() + frame.to_csv(buf) + + result = buf.getvalue() + assert '2000-01-01' in result + + def test_series_map_box_timedelta(self): + # GH 11349 + s = Series(timedelta_range('1 day 1 s', periods=5, freq='h')) + + def f(x): + return x.total_seconds() + + s.map(f) + s.apply(f) + DataFrame(s).applymap(f) + + def test_asfreq_resample_set_correct_freq(self): + # GH5613 + # we test if .asfreq() and .resample() set the correct value for .freq + df = pd.DataFrame({'date': ["2012-01-01", "2012-01-02", "2012-01-03"], + 'col': [1, 2, 3]}) + df = df.set_index(pd.to_datetime(df.date)) + + # testing the settings before calling .asfreq() and .resample() + assert df.index.freq is None + assert df.index.inferred_freq == 'D' + + # does .asfreq() set .freq correctly? + assert df.asfreq('D').index.freq == 'D' + + # does .resample() set .freq correctly? + assert df.resample('D').asfreq().index.freq == 'D' + + def test_pickle(self): + + # GH4606 + p = tm.round_trip_pickle(NaT) + assert p is NaT + + idx = pd.to_datetime(['2013-01-01', NaT, '2014-01-06']) + idx_p = tm.round_trip_pickle(idx) + assert idx_p[0] == idx[0] + assert idx_p[1] is NaT + assert idx_p[2] == idx[2] + + # GH11002 + # don't infer freq + idx = date_range('1750-1-1', '2050-1-1', freq='7D') + idx_p = tm.round_trip_pickle(idx) + tm.assert_index_equal(idx, idx_p) + + def test_setops_preserve_freq(self): + for tz in [None, 'Asia/Tokyo', 'US/Eastern']: + rng = date_range('1/1/2000', '1/1/2002', name='idx', tz=tz) + + result = rng[:50].union(rng[50:100]) + assert result.name == rng.name + assert result.freq == rng.freq + assert result.tz == rng.tz + + result = rng[:50].union(rng[30:100]) + assert result.name == rng.name + assert result.freq == rng.freq + assert result.tz == rng.tz + + result = rng[:50].union(rng[60:100]) + assert result.name == rng.name + assert result.freq is None + assert result.tz == rng.tz + + result = rng[:50].intersection(rng[25:75]) + assert result.name == rng.name + assert result.freqstr == 'D' + assert result.tz == rng.tz + + nofreq = DatetimeIndex(list(rng[25:75]), name='other') + result = rng[:50].union(nofreq) + assert result.name is None + assert result.freq == rng.freq + assert result.tz == rng.tz + + result = rng[:50].intersection(nofreq) + assert result.name is None + assert result.freq == rng.freq + assert result.tz == rng.tz + + def test_min_max(self): + rng = date_range('1/1/2000', '12/31/2000') + rng2 = rng.take(np.random.permutation(len(rng))) + + the_min = rng2.min() + the_max = rng2.max() + assert isinstance(the_min, Timestamp) + assert isinstance(the_max, Timestamp) + assert the_min == rng[0] + assert the_max == rng[-1] + + assert rng.min() == rng[0] + assert rng.max() == rng[-1] + + def test_min_max_series(self): + rng = date_range('1/1/2000', periods=10, freq='4h') + lvls = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'] + df = DataFrame({'TS': rng, 'V': np.random.randn(len(rng)), 'L': lvls}) + + result = df.TS.max() + exp = Timestamp(df.TS.iat[-1]) + assert isinstance(result, Timestamp) + assert result == exp + + result = df.TS.min() + exp = Timestamp(df.TS.iat[0]) + assert isinstance(result, Timestamp) + assert result == exp + + def test_from_M8_structured(self): + dates = [(datetime(2012, 9, 9, 0, 0), datetime(2012, 9, 8, 15, 10))] + arr = np.array(dates, + dtype=[('Date', 'M8[us]'), ('Forecasting', 'M8[us]')]) + df = DataFrame(arr) + + assert df['Date'][0] == dates[0][0] + assert df['Forecasting'][0] == dates[0][1] + + s = Series(arr['Date']) + assert s[0], Timestamp + assert s[0] == dates[0][0] + + s = Series.from_array(arr['Date'], Index([0])) + assert s[0] == dates[0][0] + + def test_get_level_values_box(self): + from pandas import MultiIndex + + dates = date_range('1/1/2000', periods=4) + levels = [dates, [0, 1]] + labels = [[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]] + + index = MultiIndex(levels=levels, labels=labels) + + assert isinstance(index.get_level_values(0)[0], Timestamp) diff --git a/pandas/tests/series/test_validate.py b/pandas/tests/series/test_validate.py new file mode 100644 index 0000000000000..6327e265d8c1e --- /dev/null +++ b/pandas/tests/series/test_validate.py @@ -0,0 +1,33 @@ +import pytest +from pandas.core.series import Series + + +class TestSeriesValidate(object): + """Tests for error handling related to data types of method arguments.""" + s = Series([1, 2, 3, 4, 5]) + + def test_validate_bool_args(self): + # Tests for error handling related to boolean arguments. + invalid_values = [1, "True", [1, 2, 3], 5.0] + + for value in invalid_values: + with pytest.raises(ValueError): + self.s.reset_index(inplace=value) + + with pytest.raises(ValueError): + self.s._set_name(name='hello', inplace=value) + + with pytest.raises(ValueError): + self.s.sort_values(inplace=value) + + with pytest.raises(ValueError): + self.s.sort_index(inplace=value) + + with pytest.raises(ValueError): + self.s.sort_index(inplace=value) + + with pytest.raises(ValueError): + self.s.rename(inplace=value) + + with pytest.raises(ValueError): + self.s.dropna(inplace=value) diff --git a/pandas/tests/sparse/__init__.py b/pandas/tests/sparse/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/tests/sparse/common.py b/pandas/tests/sparse/common.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/sparse/tests/test_arithmetics.py b/pandas/tests/sparse/test_arithmetics.py similarity index 94% rename from pandas/sparse/tests/test_arithmetics.py rename to pandas/tests/sparse/test_arithmetics.py index f24244b38c42b..f023cd0003910 100644 --- a/pandas/sparse/tests/test_arithmetics.py +++ b/pandas/tests/sparse/test_arithmetics.py @@ -3,9 +3,7 @@ import pandas.util.testing as tm -class TestSparseArrayArithmetics(tm.TestCase): - - _multiprocess_can_split_ = True +class TestSparseArrayArithmetics(object): _base = np.array _klass = pd.SparseArray @@ -69,9 +67,9 @@ def _check_numeric_ops(self, a, b, a_dense, b_dense): self._assert((b_dense ** a).to_dense(), b_dense ** a_dense) def _check_bool_result(self, res): - tm.assertIsInstance(res, self._klass) - self.assertEqual(res.dtype, np.bool) - self.assertIsInstance(res.fill_value, bool) + assert isinstance(res, self._klass) + assert res.dtype == np.bool + assert isinstance(res.fill_value, bool) def _check_comparison_ops(self, a, b, a_dense, b_dense): with np.errstate(invalid='ignore'): @@ -276,30 +274,30 @@ def test_int_array(self): for kind in ['integer', 'block']: a = self._klass(values, dtype=dtype, kind=kind) - self.assertEqual(a.dtype, dtype) + assert a.dtype == dtype b = self._klass(rvalues, dtype=dtype, kind=kind) - self.assertEqual(b.dtype, dtype) + assert b.dtype == dtype self._check_numeric_ops(a, b, values, rvalues) self._check_numeric_ops(a, b * 0, values, rvalues * 0) a = self._klass(values, fill_value=0, dtype=dtype, kind=kind) - self.assertEqual(a.dtype, dtype) + assert a.dtype == dtype b = self._klass(rvalues, dtype=dtype, kind=kind) - self.assertEqual(b.dtype, dtype) + assert b.dtype == dtype self._check_numeric_ops(a, b, values, rvalues) a = self._klass(values, fill_value=0, dtype=dtype, kind=kind) - self.assertEqual(a.dtype, dtype) + assert a.dtype == dtype b = self._klass(rvalues, fill_value=0, dtype=dtype, kind=kind) - self.assertEqual(b.dtype, dtype) + assert b.dtype == dtype self._check_numeric_ops(a, b, values, rvalues) a = self._klass(values, fill_value=1, dtype=dtype, kind=kind) - self.assertEqual(a.dtype, dtype) + assert a.dtype == dtype b = self._klass(rvalues, fill_value=2, dtype=dtype, kind=kind) - self.assertEqual(b.dtype, dtype) + assert b.dtype == dtype self._check_numeric_ops(a, b, values, rvalues) def test_int_array_comparison(self): @@ -366,24 +364,24 @@ def test_mixed_array_float_int(self): for kind in ['integer', 'block']: a = self._klass(values, kind=kind) b = self._klass(rvalues, kind=kind) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_numeric_ops(a, b, values, rvalues) self._check_numeric_ops(a, b * 0, values, rvalues * 0) a = self._klass(values, kind=kind, fill_value=0) b = self._klass(rvalues, kind=kind) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_numeric_ops(a, b, values, rvalues) a = self._klass(values, kind=kind, fill_value=0) b = self._klass(rvalues, kind=kind, fill_value=0) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_numeric_ops(a, b, values, rvalues) a = self._klass(values, kind=kind, fill_value=1) b = self._klass(rvalues, kind=kind, fill_value=2) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_numeric_ops(a, b, values, rvalues) def test_mixed_array_comparison(self): @@ -396,24 +394,24 @@ def test_mixed_array_comparison(self): for kind in ['integer', 'block']: a = self._klass(values, kind=kind) b = self._klass(rvalues, kind=kind) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_comparison_ops(a, b, values, rvalues) self._check_comparison_ops(a, b * 0, values, rvalues * 0) a = self._klass(values, kind=kind, fill_value=0) b = self._klass(rvalues, kind=kind) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_comparison_ops(a, b, values, rvalues) a = self._klass(values, kind=kind, fill_value=0) b = self._klass(rvalues, kind=kind, fill_value=0) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_comparison_ops(a, b, values, rvalues) a = self._klass(values, kind=kind, fill_value=1) b = self._klass(rvalues, kind=kind, fill_value=2) - self.assertEqual(b.dtype, rdtype) + assert b.dtype == rdtype self._check_comparison_ops(a, b, values, rvalues) diff --git a/pandas/sparse/tests/test_array.py b/pandas/tests/sparse/test_array.py similarity index 72% rename from pandas/sparse/tests/test_array.py rename to pandas/tests/sparse/test_array.py index 2b284ac631d3f..4ce03f72dbba6 100644 --- a/pandas/sparse/tests/test_array.py +++ b/pandas/tests/sparse/test_array.py @@ -1,108 +1,109 @@ from pandas.compat import range + import re import operator +import pytest import warnings from numpy import nan import numpy as np from pandas import _np_version_under1p8 -from pandas.sparse.api import SparseArray, SparseSeries -from pandas._sparse import IntIndex -from pandas.util.testing import assert_almost_equal, assertRaisesRegexp +from pandas.core.sparse.api import SparseArray, SparseSeries +from pandas._libs.sparse import IntIndex +from pandas.util.testing import assert_almost_equal import pandas.util.testing as tm -class TestSparseArray(tm.TestCase): - _multiprocess_can_split_ = True +class TestSparseArray(object): - def setUp(self): + def setup_method(self, method): self.arr_data = np.array([nan, nan, 1, 2, 3, nan, 4, 5, nan, 6]) self.arr = SparseArray(self.arr_data) self.zarr = SparseArray([0, 0, 1, 2, 3, 0, 4, 5, 0, 6], fill_value=0) def test_constructor_dtype(self): arr = SparseArray([np.nan, 1, 2, np.nan]) - self.assertEqual(arr.dtype, np.float64) - self.assertTrue(np.isnan(arr.fill_value)) + assert arr.dtype == np.float64 + assert np.isnan(arr.fill_value) arr = SparseArray([np.nan, 1, 2, np.nan], fill_value=0) - self.assertEqual(arr.dtype, np.float64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.float64 + assert arr.fill_value == 0 arr = SparseArray([0, 1, 2, 4], dtype=np.float64) - self.assertEqual(arr.dtype, np.float64) - self.assertTrue(np.isnan(arr.fill_value)) + assert arr.dtype == np.float64 + assert np.isnan(arr.fill_value) arr = SparseArray([0, 1, 2, 4], dtype=np.int64) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 arr = SparseArray([0, 1, 2, 4], fill_value=0, dtype=np.int64) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 arr = SparseArray([0, 1, 2, 4], dtype=None) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 arr = SparseArray([0, 1, 2, 4], fill_value=0, dtype=None) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 def test_constructor_object_dtype(self): # GH 11856 arr = SparseArray(['A', 'A', np.nan, 'B'], dtype=np.object) - self.assertEqual(arr.dtype, np.object) - self.assertTrue(np.isnan(arr.fill_value)) + assert arr.dtype == np.object + assert np.isnan(arr.fill_value) arr = SparseArray(['A', 'A', np.nan, 'B'], dtype=np.object, fill_value='A') - self.assertEqual(arr.dtype, np.object) - self.assertEqual(arr.fill_value, 'A') + assert arr.dtype == np.object + assert arr.fill_value == 'A' def test_constructor_spindex_dtype(self): arr = SparseArray(data=[1, 2], sparse_index=IntIndex(4, [1, 2])) tm.assert_sp_array_equal(arr, SparseArray([np.nan, 1, 2, np.nan])) - self.assertEqual(arr.dtype, np.float64) - self.assertTrue(np.isnan(arr.fill_value)) + assert arr.dtype == np.float64 + assert np.isnan(arr.fill_value) arr = SparseArray(data=[1, 2, 3], sparse_index=IntIndex(4, [1, 2, 3]), dtype=np.int64, fill_value=0) exp = SparseArray([0, 1, 2, 3], dtype=np.int64, fill_value=0) tm.assert_sp_array_equal(arr, exp) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 arr = SparseArray(data=[1, 2], sparse_index=IntIndex(4, [1, 2]), fill_value=0, dtype=np.int64) exp = SparseArray([0, 1, 2, 0], fill_value=0, dtype=np.int64) tm.assert_sp_array_equal(arr, exp) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 arr = SparseArray(data=[1, 2, 3], sparse_index=IntIndex(4, [1, 2, 3]), dtype=None, fill_value=0) exp = SparseArray([0, 1, 2, 3], dtype=None) tm.assert_sp_array_equal(arr, exp) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 # scalar input arr = SparseArray(data=1, sparse_index=IntIndex(1, [0]), dtype=None) exp = SparseArray([1], dtype=None) tm.assert_sp_array_equal(arr, exp) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 arr = SparseArray(data=[1, 2], sparse_index=IntIndex(4, [1, 2]), fill_value=0, dtype=None) exp = SparseArray([0, 1, 2, 0], fill_value=0, dtype=None) tm.assert_sp_array_equal(arr, exp) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 def test_sparseseries_roundtrip(self): # GH 13999 @@ -132,27 +133,27 @@ def test_sparseseries_roundtrip(self): def test_get_item(self): - self.assertTrue(np.isnan(self.arr[1])) - self.assertEqual(self.arr[2], 1) - self.assertEqual(self.arr[7], 5) + assert np.isnan(self.arr[1]) + assert self.arr[2] == 1 + assert self.arr[7] == 5 - self.assertEqual(self.zarr[0], 0) - self.assertEqual(self.zarr[2], 1) - self.assertEqual(self.zarr[7], 5) + assert self.zarr[0] == 0 + assert self.zarr[2] == 1 + assert self.zarr[7] == 5 errmsg = re.compile("bounds") - assertRaisesRegexp(IndexError, errmsg, lambda: self.arr[11]) - assertRaisesRegexp(IndexError, errmsg, lambda: self.arr[-11]) - self.assertEqual(self.arr[-1], self.arr[len(self.arr) - 1]) + tm.assert_raises_regex(IndexError, errmsg, lambda: self.arr[11]) + tm.assert_raises_regex(IndexError, errmsg, lambda: self.arr[-11]) + assert self.arr[-1] == self.arr[len(self.arr) - 1] def test_take(self): - self.assertTrue(np.isnan(self.arr.take(0))) - self.assertTrue(np.isscalar(self.arr.take(2))) + assert np.isnan(self.arr.take(0)) + assert np.isscalar(self.arr.take(2)) # np.take in < 1.8 doesn't support scalar indexing if not _np_version_under1p8: - self.assertEqual(self.arr.take(2), np.take(self.arr_data, 2)) - self.assertEqual(self.arr.take(6), np.take(self.arr_data, 6)) + assert self.arr.take(2) == np.take(self.arr_data, 2) + assert self.arr.take(6) == np.take(self.arr_data, 6) exp = SparseArray(np.take(self.arr_data, [2, 3])) tm.assert_sp_array_equal(self.arr.take([2, 3]), exp) @@ -178,21 +179,22 @@ def test_take_negative(self): tm.assert_sp_array_equal(self.arr.take([-4, -3, -2]), exp) def test_bad_take(self): - assertRaisesRegexp(IndexError, "bounds", lambda: self.arr.take(11)) - self.assertRaises(IndexError, lambda: self.arr.take(-11)) + tm.assert_raises_regex( + IndexError, "bounds", lambda: self.arr.take(11)) + pytest.raises(IndexError, lambda: self.arr.take(-11)) def test_take_invalid_kwargs(self): msg = r"take\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, self.arr.take, - [2, 3], foo=2) + tm.assert_raises_regex(TypeError, msg, self.arr.take, + [2, 3], foo=2) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, self.arr.take, - [2, 3], out=self.arr) + tm.assert_raises_regex(ValueError, msg, self.arr.take, + [2, 3], out=self.arr) msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, self.arr.take, - [2, 3], mode='clip') + tm.assert_raises_regex(ValueError, msg, self.arr.take, + [2, 3], mode='clip') def test_take_filling(self): # similar tests as GH 12631 @@ -214,16 +216,16 @@ def test_take_filling(self): msg = ('When allow_fill=True and fill_value is not None, ' 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): sparse.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): sparse.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, -6])) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, 5])) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, 5]), fill_value=True) def test_take_filling_fill_value(self): @@ -246,16 +248,16 @@ def test_take_filling_fill_value(self): msg = ('When allow_fill=True and fill_value is not None, ' 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): sparse.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): sparse.take(np.array([1, 0, -5]), fill_value=True) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, -6])) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, 5])) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, 5]), fill_value=True) def test_take_filling_all_nan(self): @@ -268,11 +270,11 @@ def test_take_filling_all_nan(self): expected = SparseArray([np.nan, np.nan, np.nan]) tm.assert_sp_array_equal(result, expected) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, -6])) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, 5])) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.take(np.array([1, 5]), fill_value=True) def test_set_item(self): @@ -282,61 +284,61 @@ def setitem(): def setslice(): self.arr[1:5] = 2 - assertRaisesRegexp(TypeError, "item assignment", setitem) - assertRaisesRegexp(TypeError, "item assignment", setslice) + tm.assert_raises_regex(TypeError, "item assignment", setitem) + tm.assert_raises_regex(TypeError, "item assignment", setslice) def test_constructor_from_too_large_array(self): - assertRaisesRegexp(TypeError, "expected dimension <= 1 data", - SparseArray, np.arange(10).reshape((2, 5))) + tm.assert_raises_regex(TypeError, "expected dimension <= 1 data", + SparseArray, np.arange(10).reshape((2, 5))) def test_constructor_from_sparse(self): res = SparseArray(self.zarr) - self.assertEqual(res.fill_value, 0) + assert res.fill_value == 0 assert_almost_equal(res.sp_values, self.zarr.sp_values) def test_constructor_copy(self): cp = SparseArray(self.arr, copy=True) cp.sp_values[:3] = 0 - self.assertFalse((self.arr.sp_values[:3] == 0).any()) + assert not (self.arr.sp_values[:3] == 0).any() not_copy = SparseArray(self.arr) not_copy.sp_values[:3] = 0 - self.assertTrue((self.arr.sp_values[:3] == 0).all()) + assert (self.arr.sp_values[:3] == 0).all() def test_constructor_bool(self): # GH 10648 data = np.array([False, False, True, True, False, False]) arr = SparseArray(data, fill_value=False, dtype=bool) - self.assertEqual(arr.dtype, bool) + assert arr.dtype == bool tm.assert_numpy_array_equal(arr.sp_values, np.array([True, True])) tm.assert_numpy_array_equal(arr.sp_values, np.asarray(arr)) tm.assert_numpy_array_equal(arr.sp_index.indices, np.array([2, 3], np.int32)) for dense in [arr.to_dense(), arr.values]: - self.assertEqual(dense.dtype, bool) + assert dense.dtype == bool tm.assert_numpy_array_equal(dense, data) def test_constructor_bool_fill_value(self): arr = SparseArray([True, False, True], dtype=None) - self.assertEqual(arr.dtype, np.bool) - self.assertFalse(arr.fill_value) + assert arr.dtype == np.bool + assert not arr.fill_value arr = SparseArray([True, False, True], dtype=np.bool) - self.assertEqual(arr.dtype, np.bool) - self.assertFalse(arr.fill_value) + assert arr.dtype == np.bool + assert not arr.fill_value arr = SparseArray([True, False, True], dtype=np.bool, fill_value=True) - self.assertEqual(arr.dtype, np.bool) - self.assertTrue(arr.fill_value) + assert arr.dtype == np.bool + assert arr.fill_value def test_constructor_float32(self): # GH 10648 data = np.array([1., np.nan, 3], dtype=np.float32) arr = SparseArray(data, dtype=np.float32) - self.assertEqual(arr.dtype, np.float32) + assert arr.dtype == np.float32 tm.assert_numpy_array_equal(arr.sp_values, np.array([1, 3], dtype=np.float32)) tm.assert_numpy_array_equal(arr.sp_values, np.asarray(arr)) @@ -344,25 +346,25 @@ def test_constructor_float32(self): np.array([0, 2], dtype=np.int32)) for dense in [arr.to_dense(), arr.values]: - self.assertEqual(dense.dtype, np.float32) - self.assert_numpy_array_equal(dense, data) + assert dense.dtype == np.float32 + tm.assert_numpy_array_equal(dense, data) def test_astype(self): res = self.arr.astype('f8') res.sp_values[:3] = 27 - self.assertFalse((self.arr.sp_values[:3] == 27).any()) + assert not (self.arr.sp_values[:3] == 27).any() msg = "unable to coerce current fill_value nan to int64 dtype" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): self.arr.astype('i8') arr = SparseArray([0, np.nan, 0, 1]) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): arr.astype('i8') arr = SparseArray([0, np.nan, 0, 1], fill_value=0) - msg = "Cannot convert NA to integer" - with tm.assertRaisesRegexp(ValueError, msg): + msg = 'Cannot convert non-finite values \\(NA or inf\\) to integer' + with tm.assert_raises_regex(ValueError, msg): arr.astype('i8') def test_astype_all(self): @@ -373,46 +375,46 @@ def test_astype_all(self): np.int32, np.int16, np.int8] for typ in types: res = arr.astype(typ) - self.assertEqual(res.dtype, typ) - self.assertEqual(res.sp_values.dtype, typ) + assert res.dtype == typ + assert res.sp_values.dtype == typ tm.assert_numpy_array_equal(res.values, vals.astype(typ)) def test_set_fill_value(self): arr = SparseArray([1., np.nan, 2.], fill_value=np.nan) arr.fill_value = 2 - self.assertEqual(arr.fill_value, 2) + assert arr.fill_value == 2 arr = SparseArray([1, 0, 2], fill_value=0, dtype=np.int64) arr.fill_value = 2 - self.assertEqual(arr.fill_value, 2) + assert arr.fill_value == 2 # coerces to int msg = "unable to set fill_value 3\\.1 to int64 dtype" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): arr.fill_value = 3.1 msg = "unable to set fill_value nan to int64 dtype" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): arr.fill_value = np.nan arr = SparseArray([True, False, True], fill_value=False, dtype=np.bool) arr.fill_value = True - self.assertTrue(arr.fill_value) + assert arr.fill_value # coerces to bool msg = "unable to set fill_value 0 to bool dtype" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): arr.fill_value = 0 msg = "unable to set fill_value nan to bool dtype" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): arr.fill_value = np.nan # invalid msg = "fill_value must be a scalar" for val in [[1, 2, 3], np.array([1, 2]), (1, 2, 3)]: - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): arr.fill_value = val def test_copy_shallow(self): @@ -453,6 +455,11 @@ def test_to_dense(self): res = SparseArray(vals, fill_value=0).to_dense() tm.assert_numpy_array_equal(res, vals) + # see gh-14647 + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + SparseArray(vals).to_dense(fill=2) + def test_getitem(self): def _checkit(i): assert_almost_equal(self.arr[i], self.arr.values[i]) @@ -492,10 +499,10 @@ def test_getslice_tuple(self): exp = SparseArray(dense[4:, ], fill_value=0) tm.assert_sp_array_equal(res, exp) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse[4:, :] - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): # check numpy compat dense[4:, :] @@ -517,19 +524,19 @@ def _check_op(op, first, second): res = op(first, second) exp = SparseArray(op(first.values, second.values), fill_value=first.fill_value) - tm.assertIsInstance(res, SparseArray) + assert isinstance(res, SparseArray) assert_almost_equal(res.values, exp.values) res2 = op(first, second.values) - tm.assertIsInstance(res2, SparseArray) + assert isinstance(res2, SparseArray) tm.assert_sp_array_equal(res, res2) res3 = op(first.values, second) - tm.assertIsInstance(res3, SparseArray) + assert isinstance(res3, SparseArray) tm.assert_sp_array_equal(res, res3) res4 = op(first, 4) - tm.assertIsInstance(res4, SparseArray) + assert isinstance(res4, SparseArray) # ignore this if the actual op raises (e.g. pow) try: @@ -542,7 +549,7 @@ def _check_op(op, first, second): def _check_inplace_op(op): tmp = arr1.copy() - self.assertRaises(NotImplementedError, op, tmp, arr2) + pytest.raises(NotImplementedError, op, tmp, arr2) with np.errstate(all='ignore'): bin_ops = [operator.add, operator.sub, operator.mul, @@ -558,7 +565,7 @@ def _check_inplace_op(op): def test_pickle(self): def _check_roundtrip(obj): - unpickled = self.round_trip_pickle(obj) + unpickled = tm.round_trip_pickle(obj) tm.assert_sp_array_equal(unpickled, obj) _check_roundtrip(self.arr) @@ -614,14 +621,14 @@ def test_fillna(self): # int dtype shouldn't have missing. No changes. s = SparseArray([0, 0, 0, 0]) - self.assertEqual(s.dtype, np.int64) - self.assertEqual(s.fill_value, 0) + assert s.dtype == np.int64 + assert s.fill_value == 0 res = s.fillna(-1) tm.assert_sp_array_equal(res, s) s = SparseArray([0, 0, 0, 0], fill_value=0) - self.assertEqual(s.dtype, np.int64) - self.assertEqual(s.fill_value, 0) + assert s.dtype == np.int64 + assert s.fill_value == 0 res = s.fillna(-1) exp = SparseArray([0, 0, 0, 0], fill_value=0) tm.assert_sp_array_equal(res, exp) @@ -629,8 +636,8 @@ def test_fillna(self): # fill_value can be nan if there is no missing hole. # only fill_value will be changed s = SparseArray([0, 0, 0, 0], fill_value=np.nan) - self.assertEqual(s.dtype, np.int64) - self.assertTrue(np.isnan(s.fill_value)) + assert s.dtype == np.int64 + assert np.isnan(s.fill_value) res = s.fillna(-1) exp = SparseArray([0, 0, 0, 0], fill_value=-1) tm.assert_sp_array_equal(res, exp) @@ -649,106 +656,118 @@ def test_fillna_overlap(self): tm.assert_sp_array_equal(res, exp) -class TestSparseArrayAnalytics(tm.TestCase): +class TestSparseArrayAnalytics(object): + def test_sum(self): data = np.arange(10).astype(float) out = SparseArray(data).sum() - self.assertEqual(out, 45.0) + assert out == 45.0 data[5] = np.nan out = SparseArray(data, fill_value=2).sum() - self.assertEqual(out, 40.0) + assert out == 40.0 out = SparseArray(data, fill_value=np.nan).sum() - self.assertEqual(out, 40.0) + assert out == 40.0 def test_numpy_sum(self): data = np.arange(10).astype(float) out = np.sum(SparseArray(data)) - self.assertEqual(out, 45.0) + assert out == 45.0 data[5] = np.nan out = np.sum(SparseArray(data, fill_value=2)) - self.assertEqual(out, 40.0) + assert out == 40.0 out = np.sum(SparseArray(data, fill_value=np.nan)) - self.assertEqual(out, 40.0) + assert out == 40.0 msg = "the 'dtype' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.sum, - SparseArray(data), dtype=np.int64) + tm.assert_raises_regex(ValueError, msg, np.sum, + SparseArray(data), dtype=np.int64) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.sum, - SparseArray(data), out=out) + tm.assert_raises_regex(ValueError, msg, np.sum, + SparseArray(data), out=out) def test_cumsum(self): - data = np.arange(10).astype(float) - out = SparseArray(data).cumsum() - expected = SparseArray(data.cumsum()) - tm.assert_sp_array_equal(out, expected) + non_null_data = np.array([1, 2, 3, 4, 5], dtype=float) + non_null_expected = SparseArray(non_null_data.cumsum()) - # TODO: gh-12855 - return a SparseArray here - data[5] = np.nan - out = SparseArray(data, fill_value=2).cumsum() - self.assertNotIsInstance(out, SparseArray) - tm.assert_numpy_array_equal(out, data.cumsum()) + null_data = np.array([1, 2, np.nan, 4, 5], dtype=float) + null_expected = SparseArray(np.array([1.0, 3.0, np.nan, 7.0, 12.0])) + + for data, expected in [ + (null_data, null_expected), + (non_null_data, non_null_expected) + ]: + out = SparseArray(data).cumsum() + tm.assert_sp_array_equal(out, expected) + + out = SparseArray(data, fill_value=np.nan).cumsum() + tm.assert_sp_array_equal(out, expected) + + out = SparseArray(data, fill_value=2).cumsum() + tm.assert_sp_array_equal(out, expected) - out = SparseArray(data, fill_value=np.nan).cumsum() - expected = SparseArray(np.array([ - 0, 1, 3, 6, 10, np.nan, 16, 23, 31, 40])) - tm.assert_sp_array_equal(out, expected) + axis = 1 # SparseArray currently 1-D, so only axis = 0 is valid. + msg = "axis\\(={axis}\\) out of bounds".format(axis=axis) + with tm.assert_raises_regex(ValueError, msg): + SparseArray(data).cumsum(axis=axis) def test_numpy_cumsum(self): - data = np.arange(10).astype(float) - out = np.cumsum(SparseArray(data)) - expected = SparseArray(data.cumsum()) - tm.assert_sp_array_equal(out, expected) + non_null_data = np.array([1, 2, 3, 4, 5], dtype=float) + non_null_expected = SparseArray(non_null_data.cumsum()) - # TODO: gh-12855 - return a SparseArray here - data[5] = np.nan - out = np.cumsum(SparseArray(data, fill_value=2)) - self.assertNotIsInstance(out, SparseArray) - tm.assert_numpy_array_equal(out, data.cumsum()) + null_data = np.array([1, 2, np.nan, 4, 5], dtype=float) + null_expected = SparseArray(np.array([1.0, 3.0, np.nan, 7.0, 12.0])) - out = np.cumsum(SparseArray(data, fill_value=np.nan)) - expected = SparseArray(np.array([ - 0, 1, 3, 6, 10, np.nan, 16, 23, 31, 40])) - tm.assert_sp_array_equal(out, expected) + for data, expected in [ + (null_data, null_expected), + (non_null_data, non_null_expected) + ]: + out = np.cumsum(SparseArray(data)) + tm.assert_sp_array_equal(out, expected) - msg = "the 'dtype' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.cumsum, - SparseArray(data), dtype=np.int64) + out = np.cumsum(SparseArray(data, fill_value=np.nan)) + tm.assert_sp_array_equal(out, expected) - msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.cumsum, - SparseArray(data), out=out) + out = np.cumsum(SparseArray(data, fill_value=2)) + tm.assert_sp_array_equal(out, expected) + + msg = "the 'dtype' parameter is not supported" + tm.assert_raises_regex(ValueError, msg, np.cumsum, + SparseArray(data), dtype=np.int64) + + msg = "the 'out' parameter is not supported" + tm.assert_raises_regex(ValueError, msg, np.cumsum, + SparseArray(data), out=out) def test_mean(self): data = np.arange(10).astype(float) out = SparseArray(data).mean() - self.assertEqual(out, 4.5) + assert out == 4.5 data[5] = np.nan out = SparseArray(data).mean() - self.assertEqual(out, 40.0 / 9) + assert out == 40.0 / 9 def test_numpy_mean(self): data = np.arange(10).astype(float) out = np.mean(SparseArray(data)) - self.assertEqual(out, 4.5) + assert out == 4.5 data[5] = np.nan out = np.mean(SparseArray(data)) - self.assertEqual(out, 40.0 / 9) + assert out == 40.0 / 9 msg = "the 'dtype' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.mean, - SparseArray(data), dtype=np.int64) + tm.assert_raises_regex(ValueError, msg, np.mean, + SparseArray(data), dtype=np.int64) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.mean, - SparseArray(data), out=out) + tm.assert_raises_regex(ValueError, msg, np.mean, + SparseArray(data), out=out) def test_ufunc(self): # GH 13853 make sure ufunc is applied to fill_value @@ -794,9 +813,3 @@ def test_ufunc_args(self): sparse = SparseArray([1, -1, 0, -2], fill_value=0) result = SparseArray([2, 0, 1, -1], fill_value=1) tm.assert_sp_array_equal(np.add(sparse, 1), result) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/sparse/tests/test_combine_concat.py b/pandas/tests/sparse/test_combine_concat.py similarity index 93% rename from pandas/sparse/tests/test_combine_concat.py rename to pandas/tests/sparse/test_combine_concat.py index fcdc6d9580dd5..15639fbe156c6 100644 --- a/pandas/sparse/tests/test_combine_concat.py +++ b/pandas/tests/sparse/test_combine_concat.py @@ -1,14 +1,11 @@ # pylint: disable-msg=E1101,W0612 -import nose # noqa import numpy as np import pandas as pd import pandas.util.testing as tm -class TestSparseSeriesConcat(tm.TestCase): - - _multiprocess_can_split_ = True +class TestSparseSeriesConcat(object): def test_concat(self): val1 = np.array([1, 2, np.nan, np.nan, 0, np.nan]) @@ -72,7 +69,7 @@ def test_concat_axis1_different_fill(self): res = pd.concat([sparse1, sparse2], axis=1) exp = pd.concat([pd.Series(val1, name='x'), pd.Series(val2, name='y')], axis=1) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) def test_concat_different_kind(self): @@ -125,11 +122,9 @@ def test_concat_sparse_dense(self): tm.assert_sp_series_equal(res, exp) -class TestSparseDataFrameConcat(tm.TestCase): - - _multiprocess_can_split_ = True +class TestSparseDataFrameConcat(object): - def setUp(self): + def setup_method(self, method): self.dense1 = pd.DataFrame({'A': [0., 1., 2., np.nan], 'B': [0., 0., 0., 0.], @@ -239,12 +234,12 @@ def test_concat_different_columns(self): # each columns keeps its fill_value, thus compare in dense res = pd.concat([sparse, sparse3]) exp = pd.concat([self.dense1, self.dense3]) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) res = pd.concat([sparse3, sparse]) exp = pd.concat([self.dense3, self.dense1]) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) def test_concat_series(self): @@ -314,12 +309,12 @@ def test_concat_axis1(self): # each columns keeps its fill_value, thus compare in dense res = pd.concat([sparse, sparse3], axis=1) exp = pd.concat([self.dense1, self.dense3], axis=1) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) res = pd.concat([sparse3, sparse], axis=1) exp = pd.concat([self.dense3, self.dense1], axis=1) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) def test_concat_sparse_dense(self): @@ -327,38 +322,32 @@ def test_concat_sparse_dense(self): res = pd.concat([sparse, self.dense2]) exp = pd.concat([self.dense1, self.dense2]) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) res = pd.concat([self.dense2, sparse]) exp = pd.concat([self.dense2, self.dense1]) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) sparse = self.dense1.to_sparse(fill_value=0) res = pd.concat([sparse, self.dense2]) exp = pd.concat([self.dense1, self.dense2]) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) res = pd.concat([self.dense2, sparse]) exp = pd.concat([self.dense2, self.dense1]) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) res = pd.concat([self.dense3, sparse], axis=1) exp = pd.concat([self.dense3, self.dense1], axis=1) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res, exp) res = pd.concat([sparse, self.dense3], axis=1) exp = pd.concat([self.dense1, self.dense3], axis=1) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res, exp) - - -if __name__ == '__main__': - import nose # noqa - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/sparse/tests/test_format.py b/pandas/tests/sparse/test_format.py similarity index 77% rename from pandas/sparse/tests/test_format.py rename to pandas/tests/sparse/test_format.py index 377eaa20565a2..d983bd209085a 100644 --- a/pandas/sparse/tests/test_format.py +++ b/pandas/tests/sparse/test_format.py @@ -13,9 +13,7 @@ use_32bit_repr = is_platform_windows() or is_platform_32bit() -class TestSparseSeriesFormatting(tm.TestCase): - - _multiprocess_can_split_ = True +class TestSparseSeriesFormatting(object): @property def dtype_format_for_platform(self): @@ -29,16 +27,16 @@ def test_sparse_max_row(self): "4 NaN\ndtype: float64\nBlockIndex\n" "Block locations: array([0, 3]{0})\n" "Block lengths: array([1, 1]{0})".format(dfm)) - self.assertEqual(result, exp) + assert result == exp with option_context("display.max_rows", 3): # GH 10560 result = repr(s) exp = ("0 1.0\n ... \n4 NaN\n" - "dtype: float64\nBlockIndex\n" + "Length: 5, dtype: float64\nBlockIndex\n" "Block locations: array([0, 3]{0})\n" "Block lengths: array([1, 1]{0})".format(dfm)) - self.assertEqual(result, exp) + assert result == exp def test_sparse_mi_max_row(self): idx = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 0), @@ -52,16 +50,17 @@ def test_sparse_mi_max_row(self): "dtype: float64\nBlockIndex\n" "Block locations: array([0, 3]{0})\n" "Block lengths: array([1, 1]{0})".format(dfm)) - self.assertEqual(result, exp) + assert result == exp - with option_context("display.max_rows", 3): + with option_context("display.max_rows", 3, + "display.show_dimensions", False): # GH 13144 result = repr(s) exp = ("A 0 1.0\n ... \nC 2 NaN\n" "dtype: float64\nBlockIndex\n" "Block locations: array([0, 3]{0})\n" "Block lengths: array([1, 1]{0})".format(dfm)) - self.assertEqual(result, exp) + assert result == exp def test_sparse_bool(self): # GH 13110 @@ -74,15 +73,15 @@ def test_sparse_bool(self): "dtype: bool\nBlockIndex\n" "Block locations: array([0, 3]{0})\n" "Block lengths: array([1, 1]{0})".format(dtype)) - self.assertEqual(result, exp) + assert result == exp with option_context("display.max_rows", 3): result = repr(s) exp = ("0 True\n ... \n5 False\n" - "dtype: bool\nBlockIndex\n" + "Length: 6, dtype: bool\nBlockIndex\n" "Block locations: array([0, 3]{0})\n" "Block lengths: array([1, 1]{0})".format(dtype)) - self.assertEqual(result, exp) + assert result == exp def test_sparse_int(self): # GH 13110 @@ -94,18 +93,19 @@ def test_sparse_int(self): "5 0\ndtype: int64\nBlockIndex\n" "Block locations: array([1, 4]{0})\n" "Block lengths: array([1, 1]{0})".format(dtype)) - self.assertEqual(result, exp) + assert result == exp - with option_context("display.max_rows", 3): + with option_context("display.max_rows", 3, + "display.show_dimensions", False): result = repr(s) exp = ("0 0\n ..\n5 0\n" "dtype: int64\nBlockIndex\n" "Block locations: array([1, 4]{0})\n" "Block lengths: array([1, 1]{0})".format(dtype)) - self.assertEqual(result, exp) + assert result == exp -class TestSparseDataFrameFormatting(tm.TestCase): +class TestSparseDataFrameFormatting(object): def test_sparse_frame(self): # GH 13110 @@ -114,7 +114,19 @@ def test_sparse_frame(self): 'C': [0, 0, 3, 0, 5], 'D': [np.nan, np.nan, np.nan, 1, 2]}) sparse = df.to_sparse() - self.assertEqual(repr(sparse), repr(df)) + assert repr(sparse) == repr(df) with option_context("display.max_rows", 3): - self.assertEqual(repr(sparse), repr(df)) + assert repr(sparse) == repr(df) + + def test_sparse_repr_after_set(self): + # GH 15488 + sdf = pd.SparseDataFrame([[np.nan, 1], [2, np.nan]]) + res = sdf.copy() + + # Ignore the warning + with pd.option_context('mode.chained_assignment', None): + sdf[0][1] = 2 # This line triggers the bug + + repr(sdf) + tm.assert_sp_frame_equal(sdf, res) diff --git a/pandas/sparse/tests/test_frame.py b/pandas/tests/sparse/test_frame.py similarity index 63% rename from pandas/sparse/tests/test_frame.py rename to pandas/tests/sparse/test_frame.py index 5cc765a2c1cf3..0312b76ec30a5 100644 --- a/pandas/sparse/tests/test_frame.py +++ b/pandas/tests/sparse/test_frame.py @@ -2,29 +2,34 @@ import operator +import pytest +from warnings import catch_warnings from numpy import nan import numpy as np import pandas as pd from pandas import Series, DataFrame, bdate_range, Panel -from pandas.tseries.index import DatetimeIndex +from pandas.core.dtypes.common import ( + is_bool_dtype, + is_float_dtype, + is_object_dtype, + is_float) +from pandas.core.indexes.datetimes import DatetimeIndex from pandas.tseries.offsets import BDay -import pandas.util.testing as tm +from pandas.util import testing as tm from pandas.compat import lrange from pandas import compat -import pandas.sparse.frame as spf +from pandas.core.sparse import frame as spf -from pandas._sparse import BlockIndex, IntIndex -from pandas.sparse.api import SparseSeries, SparseDataFrame, SparseArray -from pandas.tests.frame.test_misc_api import SharedWithSparse +from pandas._libs.sparse import BlockIndex, IntIndex +from pandas.core.sparse.api import SparseSeries, SparseDataFrame, SparseArray +from pandas.tests.frame.test_api import SharedWithSparse -class TestSparseDataFrame(tm.TestCase, SharedWithSparse): - +class TestSparseDataFrame(SharedWithSparse): klass = SparseDataFrame - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): self.data = {'A': [nan, nan, nan, 0, 1, 2, 3, 4, 5, 6], 'B': [0, 1, 2, nan, nan, nan, 3, 4, 5, 6], 'C': np.arange(10, dtype=np.float64), @@ -69,33 +74,33 @@ def test_fill_value_when_combine_const(self): def test_as_matrix(self): empty = self.empty.as_matrix() - self.assertEqual(empty.shape, (0, 0)) + assert empty.shape == (0, 0) no_cols = SparseDataFrame(index=np.arange(10)) mat = no_cols.as_matrix() - self.assertEqual(mat.shape, (10, 0)) + assert mat.shape == (10, 0) no_index = SparseDataFrame(columns=np.arange(10)) mat = no_index.as_matrix() - self.assertEqual(mat.shape, (0, 10)) + assert mat.shape == (0, 10) def test_copy(self): cp = self.frame.copy() - tm.assertIsInstance(cp, SparseDataFrame) + assert isinstance(cp, SparseDataFrame) tm.assert_sp_frame_equal(cp, self.frame) # as of v0.15.0 # this is now identical (but not is_a ) - self.assertTrue(cp.index.identical(self.frame.index)) + assert cp.index.identical(self.frame.index) def test_constructor(self): for col, series in compat.iteritems(self.frame): - tm.assertIsInstance(series, SparseSeries) + assert isinstance(series, SparseSeries) - tm.assertIsInstance(self.iframe['A'].sp_index, IntIndex) + assert isinstance(self.iframe['A'].sp_index, IntIndex) # constructed zframe from matrix above - self.assertEqual(self.zframe['A'].fill_value, 0) + assert self.zframe['A'].fill_value == 0 tm.assert_numpy_array_equal(pd.SparseArray([1., 2., 3., 4., 5., 6.]), self.zframe['A'].values) tm.assert_numpy_array_equal(np.array([0., 0., 0., 0., 1., 2., @@ -105,7 +110,7 @@ def test_constructor(self): # construct no data sdf = SparseDataFrame(columns=np.arange(10), index=np.arange(10)) for col, series in compat.iteritems(sdf): - tm.assertIsInstance(series, SparseSeries) + assert isinstance(series, SparseSeries) # construct from nested dict data = {} @@ -128,7 +133,7 @@ def test_constructor(self): tm.assert_sp_frame_equal(cons, reindexed, exact_indices=False) # assert level parameter breaks reindex - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): self.frame.reindex(idx, level=0) repr(self.frame) @@ -142,21 +147,21 @@ def test_constructor_ndarray(self): tm.assert_sp_frame_equal(sp, self.frame.reindex(columns=['A'])) # raise on level argument - self.assertRaises(TypeError, self.frame.reindex, columns=['A'], - level=1) + pytest.raises(TypeError, self.frame.reindex, columns=['A'], + level=1) # wrong length index / columns - with tm.assertRaisesRegexp(ValueError, "^Index length"): + with tm.assert_raises_regex(ValueError, "^Index length"): SparseDataFrame(self.frame.values, index=self.frame.index[:-1]) - with tm.assertRaisesRegexp(ValueError, "^Column length"): + with tm.assert_raises_regex(ValueError, "^Column length"): SparseDataFrame(self.frame.values, columns=self.frame.columns[:-1]) # GH 9272 def test_constructor_empty(self): sp = SparseDataFrame() - self.assertEqual(len(sp.index), 0) - self.assertEqual(len(sp.columns), 0) + assert len(sp.index) == 0 + assert len(sp.columns) == 0 def test_constructor_dataframe(self): dense = self.frame.to_dense() @@ -166,28 +171,28 @@ def test_constructor_dataframe(self): def test_constructor_convert_index_once(self): arr = np.array([1.5, 2.5, 3.5]) sdf = SparseDataFrame(columns=lrange(4), index=arr) - self.assertTrue(sdf[0].index is sdf[1].index) + assert sdf[0].index is sdf[1].index def test_constructor_from_series(self): # GH 2873 x = Series(np.random.randn(10000), name='a') x = x.to_sparse(fill_value=0) - tm.assertIsInstance(x, SparseSeries) + assert isinstance(x, SparseSeries) df = SparseDataFrame(x) - tm.assertIsInstance(df, SparseDataFrame) + assert isinstance(df, SparseDataFrame) x = Series(np.random.randn(10000), name='a') y = Series(np.random.randn(10000), name='b') x2 = x.astype(float) - x2.ix[:9998] = np.NaN + x2.loc[:9998] = np.NaN # TODO: x_sparse is unused...fix x_sparse = x2.to_sparse(fill_value=np.NaN) # noqa # Currently fails too with weird ufunc error # df1 = SparseDataFrame([x_sparse, y]) - y.ix[:9998] = 0 + y.loc[:9998] = 0 # TODO: y_sparse is unsused...fix y_sparse = y.to_sparse(fill_value=0) # noqa # without sparse value raises error @@ -196,28 +201,55 @@ def test_constructor_from_series(self): def test_constructor_preserve_attr(self): # GH 13866 arr = pd.SparseArray([1, 0, 3, 0], dtype=np.int64, fill_value=0) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 df = pd.SparseDataFrame({'x': arr}) - self.assertEqual(df['x'].dtype, np.int64) - self.assertEqual(df['x'].fill_value, 0) + assert df['x'].dtype == np.int64 + assert df['x'].fill_value == 0 s = pd.SparseSeries(arr, name='x') - self.assertEqual(s.dtype, np.int64) - self.assertEqual(s.fill_value, 0) + assert s.dtype == np.int64 + assert s.fill_value == 0 df = pd.SparseDataFrame(s) - self.assertEqual(df['x'].dtype, np.int64) - self.assertEqual(df['x'].fill_value, 0) + assert df['x'].dtype == np.int64 + assert df['x'].fill_value == 0 df = pd.SparseDataFrame({'x': s}) - self.assertEqual(df['x'].dtype, np.int64) - self.assertEqual(df['x'].fill_value, 0) + assert df['x'].dtype == np.int64 + assert df['x'].fill_value == 0 + + def test_constructor_nan_dataframe(self): + # GH 10079 + trains = np.arange(100) + tresholds = [10, 20, 30, 40, 50, 60] + tuples = [(i, j) for i in trains for j in tresholds] + index = pd.MultiIndex.from_tuples(tuples, + names=['trains', 'tresholds']) + matrix = np.empty((len(index), len(trains))) + matrix.fill(np.nan) + df = pd.DataFrame(matrix, index=index, columns=trains, dtype=float) + result = df.to_sparse() + expected = pd.SparseDataFrame(matrix, index=index, columns=trains, + dtype=float) + tm.assert_sp_frame_equal(result, expected) + + def test_type_coercion_at_construction(self): + # GH 15682 + result = pd.SparseDataFrame( + {'a': [1, 0, 0], 'b': [0, 1, 0], 'c': [0, 0, 1]}, dtype='uint8', + default_fill_value=0) + expected = pd.SparseDataFrame( + {'a': pd.SparseSeries([1, 0, 0], dtype='uint8'), + 'b': pd.SparseSeries([0, 1, 0], dtype='uint8'), + 'c': pd.SparseSeries([0, 0, 1], dtype='uint8')}, + default_fill_value=0) + tm.assert_sp_frame_equal(result, expected) def test_dtypes(self): df = DataFrame(np.random.randn(10000, 4)) - df.ix[:9998] = np.nan + df.loc[:9998] = np.nan sdf = df.to_sparse() result = sdf.get_dtype_counts() @@ -225,15 +257,15 @@ def test_dtypes(self): tm.assert_series_equal(result, expected) def test_shape(self): - # GH 10452 - self.assertEqual(self.frame.shape, (10, 4)) - self.assertEqual(self.iframe.shape, (10, 4)) - self.assertEqual(self.zframe.shape, (10, 4)) - self.assertEqual(self.fill_frame.shape, (10, 4)) + # see gh-10452 + assert self.frame.shape == (10, 4) + assert self.iframe.shape == (10, 4) + assert self.zframe.shape == (10, 4) + assert self.fill_frame.shape == (10, 4) def test_str(self): df = DataFrame(np.random.randn(10000, 4)) - df.ix[:9998] = np.nan + df.loc[:9998] = np.nan sdf = df.to_sparse() str(sdf) @@ -246,7 +278,7 @@ def test_array_interface(self): def test_pickle(self): def _test_roundtrip(frame, orig): - result = self.round_trip_pickle(frame) + result = tm.round_trip_pickle(frame) tm.assert_sp_frame_equal(frame, result) tm.assert_frame_equal(result.to_dense(), orig, check_dtype=False) @@ -257,30 +289,30 @@ def test_dense_to_sparse(self): df = DataFrame({'A': [nan, nan, nan, 1, 2], 'B': [1, 2, nan, nan, nan]}) sdf = df.to_sparse() - tm.assertIsInstance(sdf, SparseDataFrame) - self.assertTrue(np.isnan(sdf.default_fill_value)) - tm.assertIsInstance(sdf['A'].sp_index, BlockIndex) + assert isinstance(sdf, SparseDataFrame) + assert np.isnan(sdf.default_fill_value) + assert isinstance(sdf['A'].sp_index, BlockIndex) tm.assert_frame_equal(sdf.to_dense(), df) sdf = df.to_sparse(kind='integer') - tm.assertIsInstance(sdf['A'].sp_index, IntIndex) + assert isinstance(sdf['A'].sp_index, IntIndex) df = DataFrame({'A': [0, 0, 0, 1, 2], 'B': [1, 2, 0, 0, 0]}, dtype=float) sdf = df.to_sparse(fill_value=0) - self.assertEqual(sdf.default_fill_value, 0) + assert sdf.default_fill_value == 0 tm.assert_frame_equal(sdf.to_dense(), df) def test_density(self): df = SparseSeries([nan, nan, nan, 0, 1, 2, 3, 4, 5, 6]) - self.assertEqual(df.density, 0.7) + assert df.density == 0.7 df = SparseDataFrame({'A': [nan, nan, nan, 0, 1, 2, 3, 4, 5, 6], 'B': [0, 1, 2, nan, nan, nan, 3, 4, 5, 6], 'C': np.arange(10), 'D': [0, 1, 2, 3, 4, 5, nan, nan, nan, nan]}) - self.assertEqual(df.density, 0.75) + assert df.density == 0.75 def test_sparse_to_dense(self): pass @@ -310,7 +342,7 @@ def _compare_to_dense(a, b, da, db, op): if isinstance(a, DataFrame) and isinstance(db, DataFrame): mixed_result = op(a, db) - tm.assertIsInstance(mixed_result, SparseDataFrame) + assert isinstance(mixed_result, SparseDataFrame) tm.assert_sp_frame_equal(mixed_result, sparse_result, exact_indices=False) @@ -349,14 +381,14 @@ def _compare_to_dense(a, b, da, db, op): _compare_to_dense(s, frame, s, frame.to_dense(), op) # it works! - result = self.frame + self.frame.ix[:, ['A', 'B']] # noqa + result = self.frame + self.frame.loc[:, ['A', 'B']] # noqa def test_op_corners(self): empty = self.empty + self.empty - self.assertTrue(empty.empty) + assert empty.empty foo = self.frame + self.empty - tm.assertIsInstance(foo.index, DatetimeIndex) + assert isinstance(foo.index, DatetimeIndex) tm.assert_frame_equal(foo, self.frame * np.nan) foo = self.empty + self.frame @@ -373,51 +405,50 @@ def test_getitem(self): exp = sdf.reindex(columns=['a', 'b']) tm.assert_sp_frame_equal(result, exp) - self.assertRaises(Exception, sdf.__getitem__, ['a', 'd']) + pytest.raises(Exception, sdf.__getitem__, ['a', 'd']) - def test_icol(self): - # 10711 deprecated + def test_iloc(self): # 2227 result = self.frame.iloc[:, 0] - self.assertTrue(isinstance(result, SparseSeries)) + assert isinstance(result, SparseSeries) tm.assert_sp_series_equal(result, self.frame['A']) # preserve sparse index type. #2251 data = {'A': [0, 1]} iframe = SparseDataFrame(data, default_kind='integer') - self.assertEqual(type(iframe['A'].sp_index), - type(iframe.iloc[:, 0].sp_index)) + tm.assert_class_equal(iframe['A'].sp_index, + iframe.iloc[:, 0].sp_index) def test_set_value(self): - # ok as the index gets conver to object + # ok, as the index gets converted to object frame = self.frame.copy() res = frame.set_value('foobar', 'B', 1.5) - self.assertEqual(res.index.dtype, 'object') + assert res.index.dtype == 'object' res = self.frame res.index = res.index.astype(object) res = self.frame.set_value('foobar', 'B', 1.5) - self.assertIsNot(res, self.frame) - self.assertEqual(res.index[-1], 'foobar') - self.assertEqual(res.get_value('foobar', 'B'), 1.5) + assert res is not self.frame + assert res.index[-1] == 'foobar' + assert res.get_value('foobar', 'B') == 1.5 res2 = res.set_value('foobar', 'qux', 1.5) - self.assertIsNot(res2, res) - self.assert_index_equal(res2.columns, - pd.Index(list(self.frame.columns) + ['qux'])) - self.assertEqual(res2.get_value('foobar', 'qux'), 1.5) + assert res2 is not res + tm.assert_index_equal(res2.columns, + pd.Index(list(self.frame.columns) + ['qux'])) + assert res2.get_value('foobar', 'qux') == 1.5 def test_fancy_index_misc(self): # axis = 0 - sliced = self.frame.ix[-2:, :] + sliced = self.frame.iloc[-2:, :] expected = self.frame.reindex(index=self.frame.index[-2:]) tm.assert_sp_frame_equal(sliced, expected) # axis = 1 - sliced = self.frame.ix[:, -2:] + sliced = self.frame.iloc[:, -2:] expected = self.frame.reindex(columns=self.frame.columns[-2:]) tm.assert_sp_frame_equal(sliced, expected) @@ -433,8 +464,8 @@ def test_getitem_overload(self): subindex = self.frame.index[indexer] subframe = self.frame[indexer] - self.assert_index_equal(subindex, subframe.index) - self.assertRaises(Exception, self.frame.__getitem__, indexer[:-1]) + tm.assert_index_equal(subindex, subframe.index) + pytest.raises(Exception, self.frame.__getitem__, indexer[:-1]) def test_setitem(self): @@ -443,7 +474,7 @@ def _check_frame(frame, orig): # insert SparseSeries frame['E'] = frame['A'] - tm.assertIsInstance(frame['E'], SparseSeries) + assert isinstance(frame['E'], SparseSeries) tm.assert_sp_series_equal(frame['E'], frame['A'], check_names=False) @@ -453,11 +484,11 @@ def _check_frame(frame, orig): expected = to_insert.to_dense().reindex(frame.index) result = frame['E'].to_dense() tm.assert_series_equal(result, expected, check_names=False) - self.assertEqual(result.name, 'E') + assert result.name == 'E' # insert Series frame['F'] = frame['A'].to_dense() - tm.assertIsInstance(frame['F'], SparseSeries) + assert isinstance(frame['F'], SparseSeries) tm.assert_sp_series_equal(frame['F'], frame['A'], check_names=False) @@ -470,24 +501,24 @@ def _check_frame(frame, orig): # insert ndarray frame['H'] = np.random.randn(N) - tm.assertIsInstance(frame['H'], SparseSeries) + assert isinstance(frame['H'], SparseSeries) to_sparsify = np.random.randn(N) to_sparsify[N // 2:] = frame.default_fill_value frame['I'] = to_sparsify - self.assertEqual(len(frame['I'].sp_values), N // 2) + assert len(frame['I'].sp_values) == N // 2 # insert ndarray wrong size - self.assertRaises(Exception, frame.__setitem__, 'foo', - np.random.randn(N - 1)) + pytest.raises(Exception, frame.__setitem__, 'foo', + np.random.randn(N - 1)) # scalar value frame['J'] = 5 - self.assertEqual(len(frame['J'].sp_values), N) - self.assertTrue((frame['J'].sp_values == 5).all()) + assert len(frame['J'].sp_values) == N + assert (frame['J'].sp_values == 5).all() frame['K'] = frame.default_fill_value - self.assertEqual(len(frame['K'].sp_values), 0) + assert len(frame['K'].sp_values) == 0 self._check_all(_check_frame) @@ -514,25 +545,25 @@ def test_delitem(self): C = self.frame['C'] del self.frame['B'] - self.assertNotIn('B', self.frame) + assert 'B' not in self.frame tm.assert_sp_series_equal(self.frame['A'], A) tm.assert_sp_series_equal(self.frame['C'], C) del self.frame['D'] - self.assertNotIn('D', self.frame) + assert 'D' not in self.frame del self.frame['A'] - self.assertNotIn('A', self.frame) + assert 'A' not in self.frame def test_set_columns(self): self.frame.columns = self.frame.columns - self.assertRaises(Exception, setattr, self.frame, 'columns', - self.frame.columns[:-1]) + pytest.raises(Exception, setattr, self.frame, 'columns', + self.frame.columns[:-1]) def test_set_index(self): self.frame.index = self.frame.index - self.assertRaises(Exception, setattr, self.frame, 'index', - self.frame.index[:-1]) + pytest.raises(Exception, setattr, self.frame, 'index', + self.frame.index[:-1]) def test_append(self): a = self.frame[:5] @@ -541,28 +572,28 @@ def test_append(self): appended = a.append(b) tm.assert_sp_frame_equal(appended, self.frame, exact_indices=False) - a = self.frame.ix[:5, :3] - b = self.frame.ix[5:] + a = self.frame.iloc[:5, :3] + b = self.frame.iloc[5:] appended = a.append(b) - tm.assert_sp_frame_equal(appended.ix[:, :3], self.frame.ix[:, :3], + tm.assert_sp_frame_equal(appended.iloc[:, :3], self.frame.iloc[:, :3], exact_indices=False) def test_apply(self): applied = self.frame.apply(np.sqrt) - tm.assertIsInstance(applied, SparseDataFrame) + assert isinstance(applied, SparseDataFrame) tm.assert_almost_equal(applied.values, np.sqrt(self.frame.values)) applied = self.fill_frame.apply(np.sqrt) - self.assertEqual(applied['A'].fill_value, np.sqrt(2)) + assert applied['A'].fill_value == np.sqrt(2) # agg / broadcast broadcasted = self.frame.apply(np.sum, broadcast=True) - tm.assertIsInstance(broadcasted, SparseDataFrame) + assert isinstance(broadcasted, SparseDataFrame) exp = self.frame.to_dense().apply(np.sum, broadcast=True) tm.assert_frame_equal(broadcasted.to_dense(), exp) - self.assertIs(self.empty.apply(np.sqrt), self.empty) + assert self.empty.apply(np.sqrt) is self.empty from pandas.core import nanops applied = self.frame.apply(np.sum) @@ -576,9 +607,9 @@ def test_apply_nonuq(self): res = sparse.apply(lambda s: s[0], axis=1) exp = orig.apply(lambda s: s[0], axis=1) # dtype must be kept - self.assertEqual(res.dtype, np.int64) + assert res.dtype == np.int64 # ToDo: apply must return subclassed dtype - self.assertIsInstance(res, pd.Series) + assert isinstance(res, pd.Series) tm.assert_series_equal(res.to_dense(), exp) # df.T breaks @@ -591,15 +622,15 @@ def test_apply_nonuq(self): def test_applymap(self): # just test that it works result = self.frame.applymap(lambda x: x * 2) - tm.assertIsInstance(result, SparseDataFrame) + assert isinstance(result, SparseDataFrame) def test_astype(self): sparse = pd.SparseDataFrame({'A': SparseArray([1, 2, 3, 4], dtype=np.int64), 'B': SparseArray([4, 5, 6, 7], dtype=np.int64)}) - self.assertEqual(sparse['A'].dtype, np.int64) - self.assertEqual(sparse['B'].dtype, np.int64) + assert sparse['A'].dtype == np.int64 + assert sparse['B'].dtype == np.int64 res = sparse.astype(np.float64) exp = pd.SparseDataFrame({'A': SparseArray([1., 2., 3., 4.], @@ -608,16 +639,16 @@ def test_astype(self): fill_value=0.)}, default_fill_value=np.nan) tm.assert_sp_frame_equal(res, exp) - self.assertEqual(res['A'].dtype, np.float64) - self.assertEqual(res['B'].dtype, np.float64) + assert res['A'].dtype == np.float64 + assert res['B'].dtype == np.float64 sparse = pd.SparseDataFrame({'A': SparseArray([0, 2, 0, 4], dtype=np.int64), 'B': SparseArray([0, 5, 0, 7], dtype=np.int64)}, default_fill_value=0) - self.assertEqual(sparse['A'].dtype, np.int64) - self.assertEqual(sparse['B'].dtype, np.int64) + assert sparse['A'].dtype == np.int64 + assert sparse['B'].dtype == np.int64 res = sparse.astype(np.float64) exp = pd.SparseDataFrame({'A': SparseArray([0., 2., 0., 4.], @@ -626,8 +657,8 @@ def test_astype(self): fill_value=0.)}, default_fill_value=0.) tm.assert_sp_frame_equal(res, exp) - self.assertEqual(res['A'].dtype, np.float64) - self.assertEqual(res['B'].dtype, np.float64) + assert res['A'].dtype == np.float64 + assert res['B'].dtype == np.float64 def test_astype_bool(self): sparse = pd.SparseDataFrame({'A': SparseArray([0, 2, 0, 4], @@ -637,8 +668,8 @@ def test_astype_bool(self): fill_value=0, dtype=np.int64)}, default_fill_value=0) - self.assertEqual(sparse['A'].dtype, np.int64) - self.assertEqual(sparse['B'].dtype, np.int64) + assert sparse['A'].dtype == np.int64 + assert sparse['B'].dtype == np.int64 res = sparse.astype(bool) exp = pd.SparseDataFrame({'A': SparseArray([False, True, False, True], @@ -649,8 +680,8 @@ def test_astype_bool(self): fill_value=False)}, default_fill_value=False) tm.assert_sp_frame_equal(res, exp) - self.assertEqual(res['A'].dtype, np.bool) - self.assertEqual(res['B'].dtype, np.bool) + assert res['A'].dtype == np.bool + assert res['B'].dtype == np.bool def test_fillna(self): df = self.zframe.reindex(lrange(5)) @@ -690,10 +721,63 @@ def test_fillna_fill_value(self): tm.assert_frame_equal(sparse.fillna(-1).to_dense(), df.fillna(-1), check_dtype=False) + def test_sparse_frame_pad_backfill_limit(self): + index = np.arange(10) + df = DataFrame(np.random.randn(10, 4), index=index) + sdf = df.to_sparse() + + result = sdf[:2].reindex(index, method='pad', limit=5) + + expected = sdf[:2].reindex(index).fillna(method='pad') + expected = expected.to_dense() + expected.values[-3:] = np.nan + expected = expected.to_sparse() + tm.assert_frame_equal(result, expected) + + result = sdf[-2:].reindex(index, method='backfill', limit=5) + + expected = sdf[-2:].reindex(index).fillna(method='backfill') + expected = expected.to_dense() + expected.values[:3] = np.nan + expected = expected.to_sparse() + tm.assert_frame_equal(result, expected) + + def test_sparse_frame_fillna_limit(self): + index = np.arange(10) + df = DataFrame(np.random.randn(10, 4), index=index) + sdf = df.to_sparse() + + result = sdf[:2].reindex(index) + result = result.fillna(method='pad', limit=5) + + expected = sdf[:2].reindex(index).fillna(method='pad') + expected = expected.to_dense() + expected.values[-3:] = np.nan + expected = expected.to_sparse() + tm.assert_frame_equal(result, expected) + + result = sdf[-2:].reindex(index) + result = result.fillna(method='backfill', limit=5) + + expected = sdf[-2:].reindex(index).fillna(method='backfill') + expected = expected.to_dense() + expected.values[:3] = np.nan + expected = expected.to_sparse() + tm.assert_frame_equal(result, expected) + def test_rename(self): - # just check this works - renamed = self.frame.rename(index=str) # noqa - renamed = self.frame.rename(columns=lambda x: '%s%d' % (x, len(x))) # noqa + result = self.frame.rename(index=str) + expected = SparseDataFrame(self.data, index=self.dates.strftime( + "%Y-%m-%d %H:%M:%S")) + tm.assert_sp_frame_equal(result, expected) + + result = self.frame.rename(columns=lambda x: '%s%d' % (x, len(x))) + data = {'A1': [nan, nan, nan, 0, 1, 2, 3, 4, 5, 6], + 'B1': [0, 1, 2, nan, nan, nan, 3, 4, 5, 6], + 'C1': np.arange(10, dtype=np.float64), + 'D1': [0, 1, 2, 3, 4, 5, nan, nan, nan, nan]} + expected = SparseDataFrame(data, index=self.dates) + tm.assert_sp_frame_equal(result, expected) def test_corr(self): res = self.frame.corr() @@ -706,16 +790,16 @@ def test_describe(self): desc = self.frame.describe() # noqa def test_join(self): - left = self.frame.ix[:, ['A', 'B']] - right = self.frame.ix[:, ['C', 'D']] + left = self.frame.loc[:, ['A', 'B']] + right = self.frame.loc[:, ['C', 'D']] joined = left.join(right) tm.assert_sp_frame_equal(joined, self.frame, exact_indices=False) - right = self.frame.ix[:, ['B', 'D']] - self.assertRaises(Exception, left.join, right) + right = self.frame.loc[:, ['B', 'D']] + pytest.raises(Exception, left.join, right) - with tm.assertRaisesRegexp(ValueError, - 'Other Series must have a name'): + with tm.assert_raises_regex(ValueError, + 'Other Series must have a name'): self.frame.join(Series( np.random.randn(len(self.frame)), index=self.frame.index)) @@ -745,22 +829,22 @@ def _check_frame(frame): # length zero length_zero = frame.reindex([]) - self.assertEqual(len(length_zero), 0) - self.assertEqual(len(length_zero.columns), len(frame.columns)) - self.assertEqual(len(length_zero['A']), 0) + assert len(length_zero) == 0 + assert len(length_zero.columns) == len(frame.columns) + assert len(length_zero['A']) == 0 # frame being reindexed has length zero length_n = length_zero.reindex(index) - self.assertEqual(len(length_n), len(frame)) - self.assertEqual(len(length_n.columns), len(frame.columns)) - self.assertEqual(len(length_n['A']), len(frame)) + assert len(length_n) == len(frame) + assert len(length_n.columns) == len(frame.columns) + assert len(length_n['A']) == len(frame) # reindex columns reindexed = frame.reindex(columns=['A', 'B', 'Z']) - self.assertEqual(len(reindexed.columns), 3) + assert len(reindexed.columns) == 3 tm.assert_almost_equal(reindexed['Z'].fill_value, frame.default_fill_value) - self.assertTrue(np.isnan(reindexed['Z'].sp_values).all()) + assert np.isnan(reindexed['Z'].sp_values).all() _check_frame(self.frame) _check_frame(self.iframe) @@ -770,11 +854,11 @@ def _check_frame(frame): # with copy=False reindexed = self.frame.reindex(self.frame.index, copy=False) reindexed['F'] = reindexed['A'] - self.assertIn('F', self.frame) + assert 'F' in self.frame reindexed = self.frame.reindex(self.frame.index) reindexed['G'] = reindexed['A'] - self.assertNotIn('G', self.frame) + assert 'G' not in self.frame def test_reindex_fill_value(self): rng = bdate_range('20110110', periods=20) @@ -784,6 +868,76 @@ def test_reindex_fill_value(self): exp = exp.to_sparse(self.zframe.default_fill_value) tm.assert_sp_frame_equal(result, exp) + def test_reindex_method(self): + + sparse = SparseDataFrame(data=[[11., 12., 14.], + [21., 22., 24.], + [41., 42., 44.]], + index=[1, 2, 4], + columns=[1, 2, 4], + dtype=float) + + # Over indices + + # default method + result = sparse.reindex(index=range(6)) + expected = SparseDataFrame(data=[[nan, nan, nan], + [11., 12., 14.], + [21., 22., 24.], + [nan, nan, nan], + [41., 42., 44.], + [nan, nan, nan]], + index=range(6), + columns=[1, 2, 4], + dtype=float) + tm.assert_sp_frame_equal(result, expected) + + # method='bfill' + result = sparse.reindex(index=range(6), method='bfill') + expected = SparseDataFrame(data=[[11., 12., 14.], + [11., 12., 14.], + [21., 22., 24.], + [41., 42., 44.], + [41., 42., 44.], + [nan, nan, nan]], + index=range(6), + columns=[1, 2, 4], + dtype=float) + tm.assert_sp_frame_equal(result, expected) + + # method='ffill' + result = sparse.reindex(index=range(6), method='ffill') + expected = SparseDataFrame(data=[[nan, nan, nan], + [11., 12., 14.], + [21., 22., 24.], + [21., 22., 24.], + [41., 42., 44.], + [41., 42., 44.]], + index=range(6), + columns=[1, 2, 4], + dtype=float) + tm.assert_sp_frame_equal(result, expected) + + # Over columns + + # default method + result = sparse.reindex(columns=range(6)) + expected = SparseDataFrame(data=[[nan, 11., 12., nan, 14., nan], + [nan, 21., 22., nan, 24., nan], + [nan, 41., 42., nan, 44., nan]], + index=[1, 2, 4], + columns=range(6), + dtype=float) + tm.assert_sp_frame_equal(result, expected) + + # method='bfill' + with pytest.raises(NotImplementedError): + sparse.reindex(columns=range(6), method='bfill') + + # method='ffill' + with pytest.raises(NotImplementedError): + sparse.reindex(columns=range(6), method='ffill') + def test_take(self): result = self.frame.take([1, 0, 2], axis=1) expected = self.frame.reindex(columns=['B', 'A', 'C']) @@ -798,23 +952,25 @@ def _check(frame, orig): self._check_all(_check) def test_stack_sparse_frame(self): - def _check(frame): - dense_frame = frame.to_dense() # noqa + with catch_warnings(record=True): + + def _check(frame): + dense_frame = frame.to_dense() # noqa - wp = Panel.from_dict({'foo': frame}) - from_dense_lp = wp.to_frame() + wp = Panel.from_dict({'foo': frame}) + from_dense_lp = wp.to_frame() - from_sparse_lp = spf.stack_sparse_frame(frame) + from_sparse_lp = spf.stack_sparse_frame(frame) - self.assert_numpy_array_equal(from_dense_lp.values, - from_sparse_lp.values) + tm.assert_numpy_array_equal(from_dense_lp.values, + from_sparse_lp.values) - _check(self.frame) - _check(self.iframe) + _check(self.frame) + _check(self.iframe) - # for now - self.assertRaises(Exception, _check, self.zframe) - self.assertRaises(Exception, _check, self.fill_frame) + # for now + pytest.raises(Exception, _check, self.zframe) + pytest.raises(Exception, _check, self.fill_frame) def test_transpose(self): @@ -832,7 +988,6 @@ def _check(frame, orig): def test_shift(self): def _check(frame, orig): - shifted = frame.shift(0) exp = orig.shift(0) tm.assert_frame_equal(shifted.to_dense(), exp) @@ -887,7 +1042,7 @@ def test_numpy_transpose(self): tm.assert_sp_frame_equal(result, sdf) msg = "the 'axes' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.transpose, sdf, axes=1) + tm.assert_raises_regex(ValueError, msg, np.transpose, sdf, axes=1) def test_combine_first(self): df = self.frame @@ -925,26 +1080,26 @@ def test_sparse_pow_issue(self): df = SparseDataFrame({'A': [nan, 0, 1]}) # note that 2 ** df works fine, also df ** 1 - result = 1**df + result = 1 ** df r1 = result.take([0], 1)['A'] r2 = result['A'] - self.assertEqual(len(r2.sp_values), len(r1.sp_values)) + assert len(r2.sp_values) == len(r1.sp_values) def test_as_blocks(self): df = SparseDataFrame({'A': [1.1, 3.3], 'B': [nan, -3.9]}, dtype='float64') df_blocks = df.blocks - self.assertEqual(list(df_blocks.keys()), ['float64']) + assert list(df_blocks.keys()) == ['float64'] tm.assert_frame_equal(df_blocks['float64'], df) def test_nan_columnname(self): # GH 8822 nan_colname = DataFrame(Series(1.0, index=[0]), columns=[nan]) nan_colname_sparse = nan_colname.to_sparse() - self.assertTrue(np.isnan(nan_colname_sparse.columns[0])) + assert np.isnan(nan_colname_sparse.columns[0]) def test_isnull(self): # GH 8276 @@ -963,7 +1118,7 @@ def test_isnull(self): 'B': [0, np.nan, 0, 2, np.nan]}, default_fill_value=0.) res = df.isnull() - tm.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) exp = pd.DataFrame({'A': [False, False, False, False, True], 'B': [False, True, False, False, True]}) tm.assert_frame_equal(res.to_dense(), exp) @@ -985,13 +1140,112 @@ def test_isnotnull(self): 'B': [0, np.nan, 0, 2, np.nan]}, default_fill_value=0.) res = df.isnotnull() - tm.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) exp = pd.DataFrame({'A': [True, True, True, True, False], 'B': [True, False, True, True, False]}) tm.assert_frame_equal(res.to_dense(), exp) -class TestSparseDataFrameArithmetic(tm.TestCase): +@pytest.mark.parametrize('index', [None, list('ab')]) # noqa: F811 +@pytest.mark.parametrize('columns', [None, list('cd')]) +@pytest.mark.parametrize('fill_value', [None, 0, np.nan]) +@pytest.mark.parametrize('dtype', [bool, int, float, np.uint16]) +def test_from_to_scipy(spmatrix, index, columns, fill_value, dtype): + # GH 4343 + tm.skip_if_no_package('scipy') + + # Make one ndarray and from it one sparse matrix, both to be used for + # constructing frames and comparing results + arr = np.eye(2, dtype=dtype) + try: + spm = spmatrix(arr) + assert spm.dtype == arr.dtype + except (TypeError, AssertionError): + # If conversion to sparse fails for this spmatrix type and arr.dtype, + # then the combination is not currently supported in NumPy, so we + # can just skip testing it thoroughly + return + + sdf = pd.SparseDataFrame(spm, index=index, columns=columns, + default_fill_value=fill_value) + + # Expected result construction is kind of tricky for all + # dtype-fill_value combinations; easiest to cast to something generic + # and except later on + rarr = arr.astype(object) + rarr[arr == 0] = np.nan + expected = pd.SparseDataFrame(rarr, index=index, columns=columns).fillna( + fill_value if fill_value is not None else np.nan) + + # Assert frame is as expected + sdf_obj = sdf.astype(object) + tm.assert_sp_frame_equal(sdf_obj, expected) + tm.assert_frame_equal(sdf_obj.to_dense(), expected.to_dense()) + + # Assert spmatrices equal + assert dict(sdf.to_coo().todok()) == dict(spm.todok()) + + # Ensure dtype is preserved if possible + was_upcast = ((fill_value is None or is_float(fill_value)) and + not is_object_dtype(dtype) and + not is_float_dtype(dtype)) + res_dtype = (bool if is_bool_dtype(dtype) else + float if was_upcast else + dtype) + tm.assert_contains_all(sdf.dtypes, {np.dtype(res_dtype)}) + assert sdf.to_coo().dtype == res_dtype + + # However, adding a str column results in an upcast to object + sdf['strings'] = np.arange(len(sdf)).astype(str) + assert sdf.to_coo().dtype == np.object_ + + +@pytest.mark.parametrize('fill_value', [None, 0, np.nan]) # noqa: F811 +def test_from_to_scipy_object(spmatrix, fill_value): + # GH 4343 + dtype = object + columns = list('cd') + index = list('ab') + tm.skip_if_no_package('scipy', max_version='0.19.0') + + # Make one ndarray and from it one sparse matrix, both to be used for + # constructing frames and comparing results + arr = np.eye(2, dtype=dtype) + try: + spm = spmatrix(arr) + assert spm.dtype == arr.dtype + except (TypeError, AssertionError): + # If conversion to sparse fails for this spmatrix type and arr.dtype, + # then the combination is not currently supported in NumPy, so we + # can just skip testing it thoroughly + return + + sdf = pd.SparseDataFrame(spm, index=index, columns=columns, + default_fill_value=fill_value) + + # Expected result construction is kind of tricky for all + # dtype-fill_value combinations; easiest to cast to something generic + # and except later on + rarr = arr.astype(object) + rarr[arr == 0] = np.nan + expected = pd.SparseDataFrame(rarr, index=index, columns=columns).fillna( + fill_value if fill_value is not None else np.nan) + + # Assert frame is as expected + sdf_obj = sdf.astype(object) + tm.assert_sp_frame_equal(sdf_obj, expected) + tm.assert_frame_equal(sdf_obj.to_dense(), expected.to_dense()) + + # Assert spmatrices equal + assert dict(sdf.to_coo().todok()) == dict(spm.todok()) + + # Ensure dtype is preserved if possible + res_dtype = object + tm.assert_contains_all(sdf.dtypes, {np.dtype(res_dtype)}) + assert sdf.to_coo().dtype == res_dtype + + +class TestSparseDataFrameArithmetic(object): def test_numeric_op_scalar(self): df = pd.DataFrame({'A': [nan, nan, 0, 1, ], @@ -1012,16 +1266,16 @@ def test_comparison_op_scalar(self): # comparison changes internal repr, compare with dense res = sparse > 1 - tm.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), df > 1) res = sparse != 0 - tm.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), df != 0) -class TestSparseDataFrameAnalytics(tm.TestCase): - def setUp(self): +class TestSparseDataFrameAnalytics(object): + def setup_method(self, method): self.data = {'A': [nan, nan, nan, 0, 1, 2, 3, 4, 5, 6], 'B': [0, 1, 2, nan, nan, nan, 3, 4, 5, 6], 'C': np.arange(10, dtype=float), @@ -1049,12 +1303,12 @@ def test_numpy_cumsum(self): tm.assert_sp_frame_equal(result, expected) msg = "the 'dtype' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.cumsum, - self.frame, dtype=np.int64) + tm.assert_raises_regex(ValueError, msg, np.cumsum, + self.frame, dtype=np.int64) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.cumsum, - self.frame, out=result) + tm.assert_raises_regex(ValueError, msg, np.cumsum, + self.frame, out=result) def test_numpy_func_call(self): # no exception should be raised even though @@ -1064,8 +1318,3 @@ def test_numpy_func_call(self): 'std', 'min', 'max'] for func in funcs: getattr(np, func)(self.frame) - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/sparse/tests/test_groupby.py b/pandas/tests/sparse/test_groupby.py similarity index 94% rename from pandas/sparse/tests/test_groupby.py rename to pandas/tests/sparse/test_groupby.py index 0cb33f4ea0a56..c9049ed9743dd 100644 --- a/pandas/sparse/tests/test_groupby.py +++ b/pandas/tests/sparse/test_groupby.py @@ -4,11 +4,9 @@ import pandas.util.testing as tm -class TestSparseGroupBy(tm.TestCase): +class TestSparseGroupBy(object): - _multiprocess_can_split_ = True - - def setUp(self): + def setup_method(self, method): self.dense = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three', diff --git a/pandas/sparse/tests/test_indexing.py b/pandas/tests/sparse/test_indexing.py similarity index 82% rename from pandas/sparse/tests/test_indexing.py rename to pandas/tests/sparse/test_indexing.py index c0d4b70c41dc4..382cff4b9d0ac 100644 --- a/pandas/sparse/tests/test_indexing.py +++ b/pandas/tests/sparse/test_indexing.py @@ -1,16 +1,14 @@ # pylint: disable-msg=E1101,W0612 -import nose # noqa +import pytest import numpy as np import pandas as pd import pandas.util.testing as tm -class TestSparseSeriesIndexing(tm.TestCase): +class TestSparseSeriesIndexing(object): - _multiprocess_can_split_ = True - - def setUp(self): + def setup_method(self, method): self.orig = pd.Series([1, np.nan, np.nan, 3, np.nan]) self.sparse = self.orig.to_sparse() @@ -18,9 +16,9 @@ def test_getitem(self): orig = self.orig sparse = self.sparse - self.assertEqual(sparse[0], 1) - self.assertTrue(np.isnan(sparse[1])) - self.assertEqual(sparse[3], 3) + assert sparse[0] == 1 + assert np.isnan(sparse[1]) + assert sparse[3] == 3 result = sparse[[1, 3, 4]] exp = orig[[1, 3, 4]].to_sparse() @@ -55,23 +53,23 @@ def test_getitem_int_dtype(self): res = s[::2] exp = pd.SparseSeries([0, 2, 4, 6], index=[0, 2, 4, 6], name='xxx') tm.assert_sp_series_equal(res, exp) - self.assertEqual(res.dtype, np.int64) + assert res.dtype == np.int64 s = pd.SparseSeries([0, 1, 2, 3, 4, 5, 6], fill_value=0, name='xxx') res = s[::2] exp = pd.SparseSeries([0, 2, 4, 6], index=[0, 2, 4, 6], fill_value=0, name='xxx') tm.assert_sp_series_equal(res, exp) - self.assertEqual(res.dtype, np.int64) + assert res.dtype == np.int64 def test_getitem_fill_value(self): orig = pd.Series([1, np.nan, 0, 3, 0]) sparse = orig.to_sparse(fill_value=0) - self.assertEqual(sparse[0], 1) - self.assertTrue(np.isnan(sparse[1])) - self.assertEqual(sparse[2], 0) - self.assertEqual(sparse[3], 3) + assert sparse[0] == 1 + assert np.isnan(sparse[1]) + assert sparse[2] == 0 + assert sparse[3] == 3 result = sparse[[1, 3, 4]] exp = orig[[1, 3, 4]].to_sparse(fill_value=0) @@ -115,8 +113,8 @@ def test_loc(self): orig = self.orig sparse = self.sparse - self.assertEqual(sparse.loc[0], 1) - self.assertTrue(np.isnan(sparse.loc[1])) + assert sparse.loc[0] == 1 + assert np.isnan(sparse.loc[1]) result = sparse.loc[[1, 3, 4]] exp = orig.loc[[1, 3, 4]].to_sparse() @@ -127,7 +125,7 @@ def test_loc(self): exp = orig.loc[[1, 3, 4, 5]].to_sparse() tm.assert_sp_series_equal(result, exp) # padded with NaN - self.assertTrue(np.isnan(result[-1])) + assert np.isnan(result[-1]) # dense array result = sparse.loc[orig % 2 == 1] @@ -147,8 +145,8 @@ def test_loc_index(self): orig = pd.Series([1, np.nan, np.nan, 3, np.nan], index=list('ABCDE')) sparse = orig.to_sparse() - self.assertEqual(sparse.loc['A'], 1) - self.assertTrue(np.isnan(sparse.loc['B'])) + assert sparse.loc['A'] == 1 + assert np.isnan(sparse.loc['B']) result = sparse.loc[['A', 'C', 'D']] exp = orig.loc[['A', 'C', 'D']].to_sparse() @@ -172,8 +170,8 @@ def test_loc_index_fill_value(self): orig = pd.Series([1, np.nan, 0, 3, 0], index=list('ABCDE')) sparse = orig.to_sparse(fill_value=0) - self.assertEqual(sparse.loc['A'], 1) - self.assertTrue(np.isnan(sparse.loc['B'])) + assert sparse.loc['A'] == 1 + assert np.isnan(sparse.loc['B']) result = sparse.loc[['A', 'C', 'D']] exp = orig.loc[['A', 'C', 'D']].to_sparse(fill_value=0) @@ -211,8 +209,8 @@ def test_iloc(self): orig = self.orig sparse = self.sparse - self.assertEqual(sparse.iloc[3], 3) - self.assertTrue(np.isnan(sparse.iloc[2])) + assert sparse.iloc[3] == 3 + assert np.isnan(sparse.iloc[2]) result = sparse.iloc[[1, 3, 4]] exp = orig.iloc[[1, 3, 4]].to_sparse() @@ -222,16 +220,16 @@ def test_iloc(self): exp = orig.iloc[[1, -2, -4]].to_sparse() tm.assert_sp_series_equal(result, exp) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.iloc[[1, 3, 5]] def test_iloc_fill_value(self): orig = pd.Series([1, np.nan, 0, 3, 0]) sparse = orig.to_sparse(fill_value=0) - self.assertEqual(sparse.iloc[3], 3) - self.assertTrue(np.isnan(sparse.iloc[1])) - self.assertEqual(sparse.iloc[4], 0) + assert sparse.iloc[3] == 3 + assert np.isnan(sparse.iloc[1]) + assert sparse.iloc[4] == 0 result = sparse.iloc[[1, 3, 4]] exp = orig.iloc[[1, 3, 4]].to_sparse(fill_value=0) @@ -251,74 +249,74 @@ def test_iloc_slice_fill_value(self): def test_at(self): orig = pd.Series([1, np.nan, np.nan, 3, np.nan]) sparse = orig.to_sparse() - self.assertEqual(sparse.at[0], orig.at[0]) - self.assertTrue(np.isnan(sparse.at[1])) - self.assertTrue(np.isnan(sparse.at[2])) - self.assertEqual(sparse.at[3], orig.at[3]) - self.assertTrue(np.isnan(sparse.at[4])) + assert sparse.at[0] == orig.at[0] + assert np.isnan(sparse.at[1]) + assert np.isnan(sparse.at[2]) + assert sparse.at[3] == orig.at[3] + assert np.isnan(sparse.at[4]) orig = pd.Series([1, np.nan, np.nan, 3, np.nan], index=list('abcde')) sparse = orig.to_sparse() - self.assertEqual(sparse.at['a'], orig.at['a']) - self.assertTrue(np.isnan(sparse.at['b'])) - self.assertTrue(np.isnan(sparse.at['c'])) - self.assertEqual(sparse.at['d'], orig.at['d']) - self.assertTrue(np.isnan(sparse.at['e'])) + assert sparse.at['a'] == orig.at['a'] + assert np.isnan(sparse.at['b']) + assert np.isnan(sparse.at['c']) + assert sparse.at['d'] == orig.at['d'] + assert np.isnan(sparse.at['e']) def test_at_fill_value(self): orig = pd.Series([1, np.nan, 0, 3, 0], index=list('abcde')) sparse = orig.to_sparse(fill_value=0) - self.assertEqual(sparse.at['a'], orig.at['a']) - self.assertTrue(np.isnan(sparse.at['b'])) - self.assertEqual(sparse.at['c'], orig.at['c']) - self.assertEqual(sparse.at['d'], orig.at['d']) - self.assertEqual(sparse.at['e'], orig.at['e']) + assert sparse.at['a'] == orig.at['a'] + assert np.isnan(sparse.at['b']) + assert sparse.at['c'] == orig.at['c'] + assert sparse.at['d'] == orig.at['d'] + assert sparse.at['e'] == orig.at['e'] def test_iat(self): orig = self.orig sparse = self.sparse - self.assertEqual(sparse.iat[0], orig.iat[0]) - self.assertTrue(np.isnan(sparse.iat[1])) - self.assertTrue(np.isnan(sparse.iat[2])) - self.assertEqual(sparse.iat[3], orig.iat[3]) - self.assertTrue(np.isnan(sparse.iat[4])) + assert sparse.iat[0] == orig.iat[0] + assert np.isnan(sparse.iat[1]) + assert np.isnan(sparse.iat[2]) + assert sparse.iat[3] == orig.iat[3] + assert np.isnan(sparse.iat[4]) - self.assertTrue(np.isnan(sparse.iat[-1])) - self.assertEqual(sparse.iat[-5], orig.iat[-5]) + assert np.isnan(sparse.iat[-1]) + assert sparse.iat[-5] == orig.iat[-5] def test_iat_fill_value(self): orig = pd.Series([1, np.nan, 0, 3, 0]) sparse = orig.to_sparse() - self.assertEqual(sparse.iat[0], orig.iat[0]) - self.assertTrue(np.isnan(sparse.iat[1])) - self.assertEqual(sparse.iat[2], orig.iat[2]) - self.assertEqual(sparse.iat[3], orig.iat[3]) - self.assertEqual(sparse.iat[4], orig.iat[4]) + assert sparse.iat[0] == orig.iat[0] + assert np.isnan(sparse.iat[1]) + assert sparse.iat[2] == orig.iat[2] + assert sparse.iat[3] == orig.iat[3] + assert sparse.iat[4] == orig.iat[4] - self.assertEqual(sparse.iat[-1], orig.iat[-1]) - self.assertEqual(sparse.iat[-5], orig.iat[-5]) + assert sparse.iat[-1] == orig.iat[-1] + assert sparse.iat[-5] == orig.iat[-5] def test_get(self): s = pd.SparseSeries([1, np.nan, np.nan, 3, np.nan]) - self.assertEqual(s.get(0), 1) - self.assertTrue(np.isnan(s.get(1))) - self.assertIsNone(s.get(5)) + assert s.get(0) == 1 + assert np.isnan(s.get(1)) + assert s.get(5) is None s = pd.SparseSeries([1, np.nan, 0, 3, 0], index=list('ABCDE')) - self.assertEqual(s.get('A'), 1) - self.assertTrue(np.isnan(s.get('B'))) - self.assertEqual(s.get('C'), 0) - self.assertIsNone(s.get('XX')) + assert s.get('A') == 1 + assert np.isnan(s.get('B')) + assert s.get('C') == 0 + assert s.get('XX') is None s = pd.SparseSeries([1, np.nan, 0, 3, 0], index=list('ABCDE'), fill_value=0) - self.assertEqual(s.get('A'), 1) - self.assertTrue(np.isnan(s.get('B'))) - self.assertEqual(s.get('C'), 0) - self.assertIsNone(s.get('XX')) + assert s.get('A') == 1 + assert np.isnan(s.get('B')) + assert s.get('C') == 0 + assert s.get('XX') is None def test_take(self): orig = pd.Series([1, np.nan, np.nan, 3, np.nan], @@ -368,7 +366,7 @@ def test_reindex(self): exp = orig.reindex(['A', 'E', 'C', 'D']).to_sparse() tm.assert_sp_series_equal(res, exp) - def test_reindex_fill_value(self): + def test_fill_value_reindex(self): orig = pd.Series([1, np.nan, 0, 3, 0], index=list('ABCDE')) sparse = orig.to_sparse(fill_value=0) @@ -399,6 +397,23 @@ def test_reindex_fill_value(self): exp = orig.reindex(['A', 'E', 'C', 'D']).to_sparse(fill_value=0) tm.assert_sp_series_equal(res, exp) + def test_reindex_fill_value(self): + floats = pd.Series([1., 2., 3.]).to_sparse() + result = floats.reindex([1, 2, 3], fill_value=0) + expected = pd.Series([2., 3., 0], index=[1, 2, 3]).to_sparse() + tm.assert_sp_series_equal(result, expected) + + def test_reindex_nearest(self): + s = pd.Series(np.arange(10, dtype='float64')).to_sparse() + target = [0.1, 0.9, 1.5, 2.0] + actual = s.reindex(target, method='nearest') + expected = pd.Series(np.around(target), target).to_sparse() + tm.assert_sp_series_equal(expected, actual) + + actual = s.reindex(target, method='nearest', tolerance=0.2) + expected = pd.Series([0, 1, np.nan, 2], target).to_sparse() + tm.assert_sp_series_equal(expected, actual) + def tests_indexing_with_sparse(self): # GH 13985 @@ -425,15 +440,13 @@ def tests_indexing_with_sparse(self): msg = ("iLocation based boolean indexing cannot use an " "indexable as a mask") - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): s.iloc[indexer] class TestSparseSeriesMultiIndexing(TestSparseSeriesIndexing): - _multiprocess_can_split_ = True - - def setUp(self): + def setup_method(self, method): # Mi with duplicated values idx = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 0), ('C', 0), ('C', 1)]) @@ -444,9 +457,9 @@ def test_getitem_multi(self): orig = self.orig sparse = self.sparse - self.assertEqual(sparse[0], orig[0]) - self.assertTrue(np.isnan(sparse[1])) - self.assertEqual(sparse[3], orig[3]) + assert sparse[0] == orig[0] + assert np.isnan(sparse[1]) + assert sparse[3] == orig[3] tm.assert_sp_series_equal(sparse['A'], orig['A'].to_sparse()) tm.assert_sp_series_equal(sparse['B'], orig['B'].to_sparse()) @@ -473,9 +486,9 @@ def test_getitem_multi_tuple(self): orig = self.orig sparse = self.sparse - self.assertEqual(sparse['C', 0], orig['C', 0]) - self.assertTrue(np.isnan(sparse['A', 1])) - self.assertTrue(np.isnan(sparse['B', 0])) + assert sparse['C', 0] == orig['C', 0] + assert np.isnan(sparse['A', 1]) + assert np.isnan(sparse['B', 0]) def test_getitems_slice_multi(self): orig = self.orig @@ -508,6 +521,11 @@ def test_loc(self): exp = orig.loc[[1, 3, 4, 5]].to_sparse() tm.assert_sp_series_equal(result, exp) + # single element list (GH 15447) + result = sparse.loc[['A']] + exp = orig.loc[['A']].to_sparse() + tm.assert_sp_series_equal(result, exp) + # dense array result = sparse.loc[orig % 2 == 1] exp = orig.loc[orig % 2 == 1].to_sparse() @@ -526,9 +544,9 @@ def test_loc_multi_tuple(self): orig = self.orig sparse = self.sparse - self.assertEqual(sparse.loc['C', 0], orig.loc['C', 0]) - self.assertTrue(np.isnan(sparse.loc['A', 1])) - self.assertTrue(np.isnan(sparse.loc['B', 0])) + assert sparse.loc['C', 0] == orig.loc['C', 0] + assert np.isnan(sparse.loc['A', 1]) + assert np.isnan(sparse.loc['B', 0]) def test_loc_slice(self): orig = self.orig @@ -541,10 +559,37 @@ def test_loc_slice(self): orig.loc['A':'B'].to_sparse()) tm.assert_sp_series_equal(sparse.loc[:'B'], orig.loc[:'B'].to_sparse()) + def test_reindex(self): + # GH 15447 + orig = self.orig + sparse = self.sparse + + res = sparse.reindex([('A', 0), ('C', 1)]) + exp = orig.reindex([('A', 0), ('C', 1)]).to_sparse() + tm.assert_sp_series_equal(res, exp) + + # On specific level: + res = sparse.reindex(['A', 'C', 'B'], level=0) + exp = orig.reindex(['A', 'C', 'B'], level=0).to_sparse() + tm.assert_sp_series_equal(res, exp) + + # single element list (GH 15447) + res = sparse.reindex(['A'], level=0) + exp = orig.reindex(['A'], level=0).to_sparse() + tm.assert_sp_series_equal(res, exp) + + with pytest.raises(TypeError): + # Incomplete keys are not accepted for reindexing: + sparse.reindex(['A', 'C']) -class TestSparseDataFrameIndexing(tm.TestCase): + # "copy" argument: + res = sparse.reindex(sparse.index, copy=True) + exp = orig.reindex(orig.index, copy=True).to_sparse() + tm.assert_sp_series_equal(res, exp) + assert sparse is not res - _multiprocess_can_split_ = True + +class TestSparseDataFrameIndexing(object): def test_getitem(self): orig = pd.DataFrame([[1, np.nan, np.nan], @@ -562,8 +607,8 @@ def test_getitem(self): tm.assert_sp_frame_equal(sparse[[True, False, True, True]], orig[[True, False, True, True]].to_sparse()) - tm.assert_sp_frame_equal(sparse[[1, 2]], - orig[[1, 2]].to_sparse()) + tm.assert_sp_frame_equal(sparse.iloc[[1, 2]], + orig.iloc[[1, 2]].to_sparse()) def test_getitem_fill_value(self): orig = pd.DataFrame([[1, np.nan, 0], @@ -589,9 +634,9 @@ def test_getitem_fill_value(self): exp._default_fill_value = np.nan tm.assert_sp_frame_equal(sparse[indexer], exp) - exp = orig[[1, 2]].to_sparse(fill_value=0) + exp = orig.iloc[[1, 2]].to_sparse(fill_value=0) exp._default_fill_value = np.nan - tm.assert_sp_frame_equal(sparse[[1, 2]], exp) + tm.assert_sp_frame_equal(sparse.iloc[[1, 2]], exp) def test_loc(self): orig = pd.DataFrame([[1, np.nan, np.nan], @@ -600,9 +645,9 @@ def test_loc(self): columns=list('xyz')) sparse = orig.to_sparse() - self.assertEqual(sparse.loc[0, 'x'], 1) - self.assertTrue(np.isnan(sparse.loc[1, 'z'])) - self.assertEqual(sparse.loc[2, 'z'], 4) + assert sparse.loc[0, 'x'] == 1 + assert np.isnan(sparse.loc[1, 'z']) + assert sparse.loc[2, 'z'] == 4 tm.assert_sp_series_equal(sparse.loc[0], orig.loc[0].to_sparse()) tm.assert_sp_series_equal(sparse.loc[1], orig.loc[1].to_sparse()) @@ -657,9 +702,9 @@ def test_loc_index(self): index=list('abc'), columns=list('xyz')) sparse = orig.to_sparse() - self.assertEqual(sparse.loc['a', 'x'], 1) - self.assertTrue(np.isnan(sparse.loc['b', 'z'])) - self.assertEqual(sparse.loc['c', 'z'], 4) + assert sparse.loc['a', 'x'] == 1 + assert np.isnan(sparse.loc['b', 'z']) + assert sparse.loc['c', 'z'] == 4 tm.assert_sp_series_equal(sparse.loc['a'], orig.loc['a'].to_sparse()) tm.assert_sp_series_equal(sparse.loc['b'], orig.loc['b'].to_sparse()) @@ -717,8 +762,8 @@ def test_iloc(self): [np.nan, np.nan, 4]]) sparse = orig.to_sparse() - self.assertEqual(sparse.iloc[1, 1], 3) - self.assertTrue(np.isnan(sparse.iloc[2, 0])) + assert sparse.iloc[1, 1] == 3 + assert np.isnan(sparse.iloc[2, 0]) tm.assert_sp_series_equal(sparse.iloc[0], orig.loc[0].to_sparse()) tm.assert_sp_series_equal(sparse.iloc[1], orig.loc[1].to_sparse()) @@ -747,7 +792,7 @@ def test_iloc(self): exp = orig.iloc[[2], [1, 0]].to_sparse() tm.assert_sp_frame_equal(result, exp) - with tm.assertRaises(IndexError): + with pytest.raises(IndexError): sparse.iloc[[1, 3, 5]] def test_iloc_slice(self): @@ -765,10 +810,10 @@ def test_at(self): [0, np.nan, 5]], index=list('ABCD'), columns=list('xyz')) sparse = orig.to_sparse() - self.assertEqual(sparse.at['A', 'x'], orig.at['A', 'x']) - self.assertTrue(np.isnan(sparse.at['B', 'z'])) - self.assertTrue(np.isnan(sparse.at['C', 'y'])) - self.assertEqual(sparse.at['D', 'x'], orig.at['D', 'x']) + assert sparse.at['A', 'x'] == orig.at['A', 'x'] + assert np.isnan(sparse.at['B', 'z']) + assert np.isnan(sparse.at['C', 'y']) + assert sparse.at['D', 'x'] == orig.at['D', 'x'] def test_at_fill_value(self): orig = pd.DataFrame([[1, np.nan, 0], @@ -777,10 +822,10 @@ def test_at_fill_value(self): [0, np.nan, 5]], index=list('ABCD'), columns=list('xyz')) sparse = orig.to_sparse(fill_value=0) - self.assertEqual(sparse.at['A', 'x'], orig.at['A', 'x']) - self.assertTrue(np.isnan(sparse.at['B', 'z'])) - self.assertTrue(np.isnan(sparse.at['C', 'y'])) - self.assertEqual(sparse.at['D', 'x'], orig.at['D', 'x']) + assert sparse.at['A', 'x'] == orig.at['A', 'x'] + assert np.isnan(sparse.at['B', 'z']) + assert np.isnan(sparse.at['C', 'y']) + assert sparse.at['D', 'x'] == orig.at['D', 'x'] def test_iat(self): orig = pd.DataFrame([[1, np.nan, 0], @@ -789,13 +834,13 @@ def test_iat(self): [0, np.nan, 5]], index=list('ABCD'), columns=list('xyz')) sparse = orig.to_sparse() - self.assertEqual(sparse.iat[0, 0], orig.iat[0, 0]) - self.assertTrue(np.isnan(sparse.iat[1, 2])) - self.assertTrue(np.isnan(sparse.iat[2, 1])) - self.assertEqual(sparse.iat[2, 0], orig.iat[2, 0]) + assert sparse.iat[0, 0] == orig.iat[0, 0] + assert np.isnan(sparse.iat[1, 2]) + assert np.isnan(sparse.iat[2, 1]) + assert sparse.iat[2, 0] == orig.iat[2, 0] - self.assertTrue(np.isnan(sparse.iat[-1, -2])) - self.assertEqual(sparse.iat[-1, -1], orig.iat[-1, -1]) + assert np.isnan(sparse.iat[-1, -2]) + assert sparse.iat[-1, -1] == orig.iat[-1, -1] def test_iat_fill_value(self): orig = pd.DataFrame([[1, np.nan, 0], @@ -804,13 +849,13 @@ def test_iat_fill_value(self): [0, np.nan, 5]], index=list('ABCD'), columns=list('xyz')) sparse = orig.to_sparse(fill_value=0) - self.assertEqual(sparse.iat[0, 0], orig.iat[0, 0]) - self.assertTrue(np.isnan(sparse.iat[1, 2])) - self.assertTrue(np.isnan(sparse.iat[2, 1])) - self.assertEqual(sparse.iat[2, 0], orig.iat[2, 0]) + assert sparse.iat[0, 0] == orig.iat[0, 0] + assert np.isnan(sparse.iat[1, 2]) + assert np.isnan(sparse.iat[2, 1]) + assert sparse.iat[2, 0] == orig.iat[2, 0] - self.assertTrue(np.isnan(sparse.iat[-1, -2])) - self.assertEqual(sparse.iat[-1, -1], orig.iat[-1, -1]) + assert np.isnan(sparse.iat[-1, -2]) + assert sparse.iat[-1, -1] == orig.iat[-1, -1] def test_take(self): orig = pd.DataFrame([[1, np.nan, 0], @@ -907,8 +952,9 @@ def test_reindex_fill_value(self): tm.assert_sp_frame_equal(res, exp) -class TestMultitype(tm.TestCase): - def setUp(self): +class TestMultitype(object): + + def setup_method(self, method): self.cols = ['string', 'int', 'float', 'object'] self.string_series = pd.SparseSeries(['a', 'b', 'c']) @@ -926,7 +972,7 @@ def setUp(self): def test_frame_basic_dtypes(self): for _, row in self.sdf.iterrows(): - self.assertEqual(row.dtype, object) + assert row.dtype == object tm.assert_sp_series_equal(self.sdf['string'], self.string_series, check_names=False) tm.assert_sp_series_equal(self.sdf['int'], self.int_series, @@ -968,13 +1014,14 @@ def test_frame_indexing_multiple(self): def test_series_indexing_single(self): for i, idx in enumerate(self.cols): - self.assertEqual(self.ss.iloc[i], self.ss[idx]) - self.assertEqual(type(self.ss.iloc[i]), - type(self.ss[idx])) - self.assertEqual(self.ss['string'], 'a') - self.assertEqual(self.ss['int'], 1) - self.assertEqual(self.ss['float'], 1.1) - self.assertEqual(self.ss['object'], []) + assert self.ss.iloc[i] == self.ss[idx] + tm.assert_class_equal(self.ss.iloc[i], self.ss[idx], + obj="series index") + + assert self.ss['string'] == 'a' + assert self.ss['int'] == 1 + assert self.ss['float'] == 1.1 + assert self.ss['object'] == [] def test_series_indexing_multiple(self): tm.assert_sp_series_equal(self.ss.loc[['string', 'int']], diff --git a/pandas/sparse/tests/test_libsparse.py b/pandas/tests/sparse/test_libsparse.py similarity index 77% rename from pandas/sparse/tests/test_libsparse.py rename to pandas/tests/sparse/test_libsparse.py index c289b4a1b204f..4842ebdd103c4 100644 --- a/pandas/sparse/tests/test_libsparse.py +++ b/pandas/tests/sparse/test_libsparse.py @@ -1,14 +1,14 @@ from pandas import Series -import nose # noqa +import pytest import numpy as np import operator import pandas.util.testing as tm from pandas import compat -from pandas.sparse.array import IntIndex, BlockIndex, _make_index -import pandas._sparse as splib +from pandas.core.sparse.array import IntIndex, BlockIndex, _make_index +import pandas._libs.sparse as splib TEST_LENGTH = 20 @@ -42,7 +42,7 @@ def _check_case_dict(case): _check_case([], [], [], [], [], []) -class TestSparseIndexUnion(tm.TestCase): +class TestSparseIndexUnion(object): def test_index_make_union(self): def _check_case(xloc, xlen, yloc, ylen, eloc, elen): @@ -162,33 +162,33 @@ def test_intindex_make_union(self): b = IntIndex(5, np.array([0, 2], dtype=np.int32)) res = a.make_union(b) exp = IntIndex(5, np.array([0, 2, 3, 4], np.int32)) - self.assertTrue(res.equals(exp)) + assert res.equals(exp) a = IntIndex(5, np.array([], dtype=np.int32)) b = IntIndex(5, np.array([0, 2], dtype=np.int32)) res = a.make_union(b) exp = IntIndex(5, np.array([0, 2], np.int32)) - self.assertTrue(res.equals(exp)) + assert res.equals(exp) a = IntIndex(5, np.array([], dtype=np.int32)) b = IntIndex(5, np.array([], dtype=np.int32)) res = a.make_union(b) exp = IntIndex(5, np.array([], np.int32)) - self.assertTrue(res.equals(exp)) + assert res.equals(exp) a = IntIndex(5, np.array([0, 1, 2, 3, 4], dtype=np.int32)) b = IntIndex(5, np.array([0, 1, 2, 3, 4], dtype=np.int32)) res = a.make_union(b) exp = IntIndex(5, np.array([0, 1, 2, 3, 4], np.int32)) - self.assertTrue(res.equals(exp)) + assert res.equals(exp) a = IntIndex(5, np.array([0, 1], dtype=np.int32)) b = IntIndex(4, np.array([0, 1], dtype=np.int32)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): a.make_union(b) -class TestSparseIndexIntersect(tm.TestCase): +class TestSparseIndexIntersect(object): def test_intersect(self): def _check_correct(a, b, expected): @@ -196,7 +196,7 @@ def _check_correct(a, b, expected): assert (result.equals(expected)) def _check_length_exc(a, longer): - nose.tools.assert_raises(Exception, a.intersect, longer) + pytest.raises(Exception, a.intersect, longer) def _check_case(xloc, xlen, yloc, ylen, eloc, elen): xindex = BlockIndex(TEST_LENGTH, xloc, xlen) @@ -213,19 +213,19 @@ def _check_case(xloc, xlen, yloc, ylen, eloc, elen): longer_index.to_int_index()) if compat.is_platform_windows(): - raise nose.SkipTest("segfaults on win-64 when all tests are run") + pytest.skip("segfaults on win-64 when all tests are run") check_cases(_check_case) def test_intersect_empty(self): xindex = IntIndex(4, np.array([], dtype=np.int32)) yindex = IntIndex(4, np.array([2, 3], dtype=np.int32)) - self.assertTrue(xindex.intersect(yindex).equals(xindex)) - self.assertTrue(yindex.intersect(xindex).equals(xindex)) + assert xindex.intersect(yindex).equals(xindex) + assert yindex.intersect(xindex).equals(xindex) xindex = xindex.to_block_index() yindex = yindex.to_block_index() - self.assertTrue(xindex.intersect(yindex).equals(xindex)) - self.assertTrue(yindex.intersect(xindex).equals(xindex)) + assert xindex.intersect(yindex).equals(xindex) + assert yindex.intersect(xindex).equals(xindex) def test_intersect_identical(self): cases = [IntIndex(5, np.array([1, 2], dtype=np.int32)), @@ -234,47 +234,45 @@ def test_intersect_identical(self): IntIndex(5, np.array([], dtype=np.int32))] for case in cases: - self.assertTrue(case.intersect(case).equals(case)) + assert case.intersect(case).equals(case) case = case.to_block_index() - self.assertTrue(case.intersect(case).equals(case)) + assert case.intersect(case).equals(case) -class TestSparseIndexCommon(tm.TestCase): - - _multiprocess_can_split_ = True +class TestSparseIndexCommon(object): def test_int_internal(self): idx = _make_index(4, np.array([2, 3], dtype=np.int32), kind='integer') - self.assertIsInstance(idx, IntIndex) - self.assertEqual(idx.npoints, 2) + assert isinstance(idx, IntIndex) + assert idx.npoints == 2 tm.assert_numpy_array_equal(idx.indices, np.array([2, 3], dtype=np.int32)) idx = _make_index(4, np.array([], dtype=np.int32), kind='integer') - self.assertIsInstance(idx, IntIndex) - self.assertEqual(idx.npoints, 0) + assert isinstance(idx, IntIndex) + assert idx.npoints == 0 tm.assert_numpy_array_equal(idx.indices, np.array([], dtype=np.int32)) idx = _make_index(4, np.array([0, 1, 2, 3], dtype=np.int32), kind='integer') - self.assertIsInstance(idx, IntIndex) - self.assertEqual(idx.npoints, 4) + assert isinstance(idx, IntIndex) + assert idx.npoints == 4 tm.assert_numpy_array_equal(idx.indices, np.array([0, 1, 2, 3], dtype=np.int32)) def test_block_internal(self): idx = _make_index(4, np.array([2, 3], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 2) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 2 tm.assert_numpy_array_equal(idx.blocs, np.array([2], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, np.array([2], dtype=np.int32)) idx = _make_index(4, np.array([], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 0) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 0 tm.assert_numpy_array_equal(idx.blocs, np.array([], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, @@ -282,8 +280,8 @@ def test_block_internal(self): idx = _make_index(4, np.array([0, 1, 2, 3], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 4) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 4 tm.assert_numpy_array_equal(idx.blocs, np.array([0], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, @@ -291,8 +289,8 @@ def test_block_internal(self): idx = _make_index(4, np.array([0, 2, 3], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 3) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 3 tm.assert_numpy_array_equal(idx.blocs, np.array([0, 2], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, @@ -301,35 +299,35 @@ def test_block_internal(self): def test_lookup(self): for kind in ['integer', 'block']: idx = _make_index(4, np.array([2, 3], dtype=np.int32), kind=kind) - self.assertEqual(idx.lookup(-1), -1) - self.assertEqual(idx.lookup(0), -1) - self.assertEqual(idx.lookup(1), -1) - self.assertEqual(idx.lookup(2), 0) - self.assertEqual(idx.lookup(3), 1) - self.assertEqual(idx.lookup(4), -1) + assert idx.lookup(-1) == -1 + assert idx.lookup(0) == -1 + assert idx.lookup(1) == -1 + assert idx.lookup(2) == 0 + assert idx.lookup(3) == 1 + assert idx.lookup(4) == -1 idx = _make_index(4, np.array([], dtype=np.int32), kind=kind) for i in range(-1, 5): - self.assertEqual(idx.lookup(i), -1) + assert idx.lookup(i) == -1 idx = _make_index(4, np.array([0, 1, 2, 3], dtype=np.int32), kind=kind) - self.assertEqual(idx.lookup(-1), -1) - self.assertEqual(idx.lookup(0), 0) - self.assertEqual(idx.lookup(1), 1) - self.assertEqual(idx.lookup(2), 2) - self.assertEqual(idx.lookup(3), 3) - self.assertEqual(idx.lookup(4), -1) + assert idx.lookup(-1) == -1 + assert idx.lookup(0) == 0 + assert idx.lookup(1) == 1 + assert idx.lookup(2) == 2 + assert idx.lookup(3) == 3 + assert idx.lookup(4) == -1 idx = _make_index(4, np.array([0, 2, 3], dtype=np.int32), kind=kind) - self.assertEqual(idx.lookup(-1), -1) - self.assertEqual(idx.lookup(0), 0) - self.assertEqual(idx.lookup(1), -1) - self.assertEqual(idx.lookup(2), 1) - self.assertEqual(idx.lookup(3), 2) - self.assertEqual(idx.lookup(4), -1) + assert idx.lookup(-1) == -1 + assert idx.lookup(0) == 0 + assert idx.lookup(1) == -1 + assert idx.lookup(2) == 1 + assert idx.lookup(3) == 2 + assert idx.lookup(4) == -1 def test_lookup_array(self): for kind in ['integer', 'block']: @@ -337,11 +335,11 @@ def test_lookup_array(self): res = idx.lookup_array(np.array([-1, 0, 2], dtype=np.int32)) exp = np.array([-1, -1, 0], dtype=np.int32) - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) res = idx.lookup_array(np.array([4, 2, 1, 3], dtype=np.int32)) exp = np.array([-1, 0, -1, 1], dtype=np.int32) - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) idx = _make_index(4, np.array([], dtype=np.int32), kind=kind) res = idx.lookup_array(np.array([-1, 0, 2, 4], dtype=np.int32)) @@ -351,21 +349,21 @@ def test_lookup_array(self): kind=kind) res = idx.lookup_array(np.array([-1, 0, 2], dtype=np.int32)) exp = np.array([-1, 0, 2], dtype=np.int32) - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) res = idx.lookup_array(np.array([4, 2, 1, 3], dtype=np.int32)) exp = np.array([-1, 2, 1, 3], dtype=np.int32) - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) idx = _make_index(4, np.array([0, 2, 3], dtype=np.int32), kind=kind) res = idx.lookup_array(np.array([2, 1, 3, 0], dtype=np.int32)) exp = np.array([1, -1, 2, 0], dtype=np.int32) - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) res = idx.lookup_array(np.array([1, 4, 2, 5], dtype=np.int32)) exp = np.array([-1, -1, 1, -1], dtype=np.int32) - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) def test_lookup_basics(self): def _check(index): @@ -389,22 +387,20 @@ def _check(index): # corner cases -class TestBlockIndex(tm.TestCase): - - _multiprocess_can_split_ = True +class TestBlockIndex(object): def test_block_internal(self): idx = _make_index(4, np.array([2, 3], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 2) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 2 tm.assert_numpy_array_equal(idx.blocs, np.array([2], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, np.array([2], dtype=np.int32)) idx = _make_index(4, np.array([], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 0) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 0 tm.assert_numpy_array_equal(idx.blocs, np.array([], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, @@ -412,16 +408,16 @@ def test_block_internal(self): idx = _make_index(4, np.array([0, 1, 2, 3], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 4) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 4 tm.assert_numpy_array_equal(idx.blocs, np.array([0], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, np.array([4], dtype=np.int32)) idx = _make_index(4, np.array([0, 2, 3], dtype=np.int32), kind='block') - self.assertIsInstance(idx, BlockIndex) - self.assertEqual(idx.npoints, 3) + assert isinstance(idx, BlockIndex) + assert idx.npoints == 3 tm.assert_numpy_array_equal(idx.blocs, np.array([0, 2], dtype=np.int32)) tm.assert_numpy_array_equal(idx.blengths, @@ -440,8 +436,8 @@ def test_make_block_boundary(self): def test_equals(self): index = BlockIndex(10, [0, 4], [2, 5]) - self.assertTrue(index.equals(index)) - self.assertFalse(index.equals(BlockIndex(10, [0, 4], [2, 6]))) + assert index.equals(index) + assert not index.equals(BlockIndex(10, [0, 4], [2, 6])) def test_check_integrity(self): locs = [] @@ -455,10 +451,10 @@ def test_check_integrity(self): index = BlockIndex(1, locs, lengths) # noqa # block extend beyond end - self.assertRaises(Exception, BlockIndex, 10, [5], [10]) + pytest.raises(Exception, BlockIndex, 10, [5], [10]) # block overlap - self.assertRaises(Exception, BlockIndex, 10, [2, 5], [5, 3]) + pytest.raises(Exception, BlockIndex, 10, [2, 5], [5, 3]) def test_to_int_index(self): locs = [0, 10] @@ -473,37 +469,73 @@ def test_to_int_index(self): def test_to_block_index(self): index = BlockIndex(10, [0, 5], [4, 5]) - self.assertIs(index.to_block_index(), index) + assert index.to_block_index() is index + + +class TestIntIndex(object): + + def test_check_integrity(self): + + # Too many indices than specified in self.length + msg = "Too many indices" + + with tm.assert_raises_regex(ValueError, msg): + IntIndex(length=1, indices=[1, 2, 3]) + + # No index can be negative. + msg = "No index can be less than zero" + + with tm.assert_raises_regex(ValueError, msg): + IntIndex(length=5, indices=[1, -2, 3]) + # No index can be negative. + msg = "No index can be less than zero" -class TestIntIndex(tm.TestCase): + with tm.assert_raises_regex(ValueError, msg): + IntIndex(length=5, indices=[1, -2, 3]) - _multiprocess_can_split_ = True + # All indices must be less than the length. + msg = "All indices must be less than the length" + + with tm.assert_raises_regex(ValueError, msg): + IntIndex(length=5, indices=[1, 2, 5]) + + with tm.assert_raises_regex(ValueError, msg): + IntIndex(length=5, indices=[1, 2, 6]) + + # Indices must be strictly ascending. + msg = "Indices must be strictly increasing" + + with tm.assert_raises_regex(ValueError, msg): + IntIndex(length=5, indices=[1, 3, 2]) + + with tm.assert_raises_regex(ValueError, msg): + IntIndex(length=5, indices=[1, 3, 3]) def test_int_internal(self): idx = _make_index(4, np.array([2, 3], dtype=np.int32), kind='integer') - self.assertIsInstance(idx, IntIndex) - self.assertEqual(idx.npoints, 2) + assert isinstance(idx, IntIndex) + assert idx.npoints == 2 tm.assert_numpy_array_equal(idx.indices, np.array([2, 3], dtype=np.int32)) idx = _make_index(4, np.array([], dtype=np.int32), kind='integer') - self.assertIsInstance(idx, IntIndex) - self.assertEqual(idx.npoints, 0) + assert isinstance(idx, IntIndex) + assert idx.npoints == 0 tm.assert_numpy_array_equal(idx.indices, np.array([], dtype=np.int32)) idx = _make_index(4, np.array([0, 1, 2, 3], dtype=np.int32), kind='integer') - self.assertIsInstance(idx, IntIndex) - self.assertEqual(idx.npoints, 4) + assert isinstance(idx, IntIndex) + assert idx.npoints == 4 tm.assert_numpy_array_equal(idx.indices, np.array([0, 1, 2, 3], dtype=np.int32)) def test_equals(self): index = IntIndex(10, [0, 1, 2, 3, 4]) - self.assertTrue(index.equals(index)) - self.assertFalse(index.equals(IntIndex(10, [0, 1, 2, 3]))) + assert index.equals(index) + assert not index.equals(IntIndex(10, [0, 1, 2, 3])) def test_to_block_index(self): @@ -514,18 +546,18 @@ def _check_case(xloc, xlen, yloc, ylen, eloc, elen): # see if survive the round trip xbindex = xindex.to_int_index().to_block_index() ybindex = yindex.to_int_index().to_block_index() - tm.assertIsInstance(xbindex, BlockIndex) - self.assertTrue(xbindex.equals(xindex)) - self.assertTrue(ybindex.equals(yindex)) + assert isinstance(xbindex, BlockIndex) + assert xbindex.equals(xindex) + assert ybindex.equals(yindex) check_cases(_check_case) def test_to_int_index(self): index = IntIndex(10, [2, 3, 4, 5, 6]) - self.assertIs(index.to_int_index(), index) + assert index.to_int_index() is index -class TestSparseOperators(tm.TestCase): +class TestSparseOperators(object): def _op_tests(self, sparse_op, python_op): def _check_case(xloc, xlen, yloc, ylen, eloc, elen): @@ -546,9 +578,9 @@ def _check_case(xloc, xlen, yloc, ylen, eloc, elen): result_int_vals, ri_index, ifill = sparse_op(x, xdindex, xfill, y, ydindex, yfill) - self.assertTrue(rb_index.to_int_index().equals(ri_index)) + assert rb_index.to_int_index().equals(ri_index) tm.assert_numpy_array_equal(result_block_vals, result_int_vals) - self.assertEqual(bfill, ifill) + assert bfill == ifill # check versus Series... xseries = Series(x, xdindex.indices) @@ -566,8 +598,8 @@ def _check_case(xloc, xlen, yloc, ylen, eloc, elen): check_cases(_check_case) -# too cute? oh but how I abhor code duplication +# too cute? oh but how I abhor code duplication check_ops = ['add', 'sub', 'mul', 'truediv', 'floordiv'] @@ -585,9 +617,3 @@ def f(self): g = make_optestf(op) setattr(TestSparseOperators, g.__name__, g) del g - - -if __name__ == '__main__': - import nose # noqa - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/sparse/tests/test_list.py b/pandas/tests/sparse/test_list.py similarity index 83% rename from pandas/sparse/tests/test_list.py rename to pandas/tests/sparse/test_list.py index b117685b6e968..6c721ca813a21 100644 --- a/pandas/sparse/tests/test_list.py +++ b/pandas/tests/sparse/test_list.py @@ -1,18 +1,15 @@ from pandas.compat import range -import unittest from numpy import nan import numpy as np -from pandas.sparse.api import SparseList, SparseArray +from pandas.core.sparse.api import SparseList, SparseArray import pandas.util.testing as tm -class TestSparseList(unittest.TestCase): +class TestSparseList(object): - _multiprocess_can_split_ = True - - def setUp(self): + def setup_method(self, method): self.na_data = np.array([nan, nan, 1, 2, 3, nan, 4, 5, nan, 6]) self.zero_data = np.array([0, 0, 1, 2, 3, 0, 4, 5, 0, 6]) @@ -35,11 +32,11 @@ def test_len(self): arr = self.na_data splist = SparseList() splist.append(arr[:5]) - self.assertEqual(len(splist), 5) + assert len(splist) == 5 splist.append(arr[5]) - self.assertEqual(len(splist), 6) + assert len(splist) == 6 splist.append(arr[6:]) - self.assertEqual(len(splist), 10) + assert len(splist) == 10 def test_append_na(self): with tm.assert_produces_warning(FutureWarning): @@ -78,12 +75,12 @@ def test_consolidate(self): splist.append(arr[6:]) consol = splist.consolidate(inplace=False) - self.assertEqual(consol.nchunks, 1) - self.assertEqual(splist.nchunks, 3) + assert consol.nchunks == 1 + assert splist.nchunks == 3 tm.assert_sp_array_equal(consol.to_array(), exp_sparr) splist.consolidate() - self.assertEqual(splist.nchunks, 1) + assert splist.nchunks == 1 tm.assert_sp_array_equal(splist.to_array(), exp_sparr) def test_copy(self): @@ -98,7 +95,7 @@ def test_copy(self): cp = splist.copy() cp.append(arr[6:]) - self.assertEqual(splist.nchunks, 2) + assert splist.nchunks == 2 tm.assert_sp_array_equal(cp.to_array(), exp_sparr) def test_getitem(self): @@ -112,9 +109,3 @@ def test_getitem(self): for i in range(len(arr)): tm.assert_almost_equal(splist[i], arr[i]) tm.assert_almost_equal(splist[-i], arr[-i]) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/sparse/tests/test_pivot.py b/pandas/tests/sparse/test_pivot.py similarity index 96% rename from pandas/sparse/tests/test_pivot.py rename to pandas/tests/sparse/test_pivot.py index 482a99a96194f..e7eba63e4e0b3 100644 --- a/pandas/sparse/tests/test_pivot.py +++ b/pandas/tests/sparse/test_pivot.py @@ -3,11 +3,9 @@ import pandas.util.testing as tm -class TestPivotTable(tm.TestCase): +class TestPivotTable(object): - _multiprocess_can_split_ = True - - def setUp(self): + def setup_method(self, method): self.dense = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three', diff --git a/pandas/sparse/tests/test_series.py b/pandas/tests/sparse/test_series.py similarity index 85% rename from pandas/sparse/tests/test_series.py rename to pandas/tests/sparse/test_series.py index de8c63df9c9e6..b524d6bfab418 100644 --- a/pandas/sparse/tests/test_series.py +++ b/pandas/tests/sparse/test_series.py @@ -1,6 +1,7 @@ # pylint: disable-msg=E1101,W0612 import operator +import pytest from numpy import nan import numpy as np @@ -12,13 +13,13 @@ import pandas.util.testing as tm from pandas.compat import range from pandas import compat -from pandas.tools.util import cartesian_product +from pandas.core.reshape.util import cartesian_product -import pandas.sparse.frame as spf +import pandas.core.sparse.frame as spf -from pandas._sparse import BlockIndex, IntIndex -from pandas.sparse.api import SparseSeries -from pandas.tests.series.test_misc_api import SharedWithSparse +from pandas._libs.sparse import BlockIndex, IntIndex +from pandas.core.sparse.api import SparseSeries +from pandas.tests.series.test_api import SharedWithSparse def _test_data1(): @@ -55,10 +56,9 @@ def _test_data2_zero(): return arr, index -class TestSparseSeries(tm.TestCase, SharedWithSparse): - _multiprocess_can_split_ = True +class TestSparseSeries(SharedWithSparse): - def setUp(self): + def setup_method(self, method): arr, index = _test_data1() date_index = bdate_range('1/1/2011', periods=len(index)) @@ -90,35 +90,29 @@ def setUp(self): def test_constructor_dtype(self): arr = SparseSeries([np.nan, 1, 2, np.nan]) - self.assertEqual(arr.dtype, np.float64) - self.assertTrue(np.isnan(arr.fill_value)) + assert arr.dtype == np.float64 + assert np.isnan(arr.fill_value) arr = SparseSeries([np.nan, 1, 2, np.nan], fill_value=0) - self.assertEqual(arr.dtype, np.float64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.float64 + assert arr.fill_value == 0 arr = SparseSeries([0, 1, 2, 4], dtype=np.int64, fill_value=np.nan) - self.assertEqual(arr.dtype, np.int64) - self.assertTrue(np.isnan(arr.fill_value)) + assert arr.dtype == np.int64 + assert np.isnan(arr.fill_value) arr = SparseSeries([0, 1, 2, 4], dtype=np.int64) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 arr = SparseSeries([0, 1, 2, 4], fill_value=0, dtype=np.int64) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 def test_iteration_and_str(self): [x for x in self.bseries] str(self.bseries) - def test_TimeSeries_deprecation(self): - - # deprecation TimeSeries, #10890 - with tm.assert_produces_warning(FutureWarning): - pd.SparseTimeSeries(1, index=pd.date_range('20130101', periods=3)) - def test_construct_DataFrame_with_sp_series(self): # it works! df = DataFrame({'col': self.bseries}) @@ -141,12 +135,12 @@ def test_construct_DataFrame_with_sp_series(self): def test_constructor_preserve_attr(self): arr = pd.SparseArray([1, 0, 3, 0], dtype=np.int64, fill_value=0) - self.assertEqual(arr.dtype, np.int64) - self.assertEqual(arr.fill_value, 0) + assert arr.dtype == np.int64 + assert arr.fill_value == 0 s = pd.SparseSeries(arr, name='x') - self.assertEqual(s.dtype, np.int64) - self.assertEqual(s.fill_value, 0) + assert s.dtype == np.int64 + assert s.fill_value == 0 def test_series_density(self): # GH2803 @@ -154,14 +148,17 @@ def test_series_density(self): ts[2:-2] = nan sts = ts.to_sparse() density = sts.density # don't die - self.assertEqual(density, 4 / 10.0) + assert density == 4 / 10.0 def test_sparse_to_dense(self): arr, index = _test_data1() series = self.bseries.to_dense() tm.assert_series_equal(series, Series(arr, name='bseries')) - series = self.bseries.to_dense(sparse_only=True) + # see gh-14647 + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + series = self.bseries.to_dense(sparse_only=True) indexer = np.isfinite(arr) exp = Series(arr[indexer], index=index[indexer], name='bseries') @@ -206,12 +203,12 @@ def test_dense_to_sparse(self): iseries = series.to_sparse(kind='integer') tm.assert_sp_series_equal(bseries, self.bseries) tm.assert_sp_series_equal(iseries, self.iseries, check_names=False) - self.assertEqual(iseries.name, self.bseries.name) + assert iseries.name == self.bseries.name - self.assertEqual(len(series), len(bseries)) - self.assertEqual(len(series), len(iseries)) - self.assertEqual(series.shape, bseries.shape) - self.assertEqual(series.shape, iseries.shape) + assert len(series) == len(bseries) + assert len(series) == len(iseries) + assert series.shape == bseries.shape + assert series.shape == iseries.shape # non-NaN fill value series = self.zbseries.to_dense() @@ -219,26 +216,26 @@ def test_dense_to_sparse(self): ziseries = series.to_sparse(kind='integer', fill_value=0) tm.assert_sp_series_equal(zbseries, self.zbseries) tm.assert_sp_series_equal(ziseries, self.ziseries, check_names=False) - self.assertEqual(ziseries.name, self.zbseries.name) + assert ziseries.name == self.zbseries.name - self.assertEqual(len(series), len(zbseries)) - self.assertEqual(len(series), len(ziseries)) - self.assertEqual(series.shape, zbseries.shape) - self.assertEqual(series.shape, ziseries.shape) + assert len(series) == len(zbseries) + assert len(series) == len(ziseries) + assert series.shape == zbseries.shape + assert series.shape == ziseries.shape def test_to_dense_preserve_name(self): assert (self.bseries.name is not None) result = self.bseries.to_dense() - self.assertEqual(result.name, self.bseries.name) + assert result.name == self.bseries.name def test_constructor(self): # test setup guys - self.assertTrue(np.isnan(self.bseries.fill_value)) - tm.assertIsInstance(self.bseries.sp_index, BlockIndex) - self.assertTrue(np.isnan(self.iseries.fill_value)) - tm.assertIsInstance(self.iseries.sp_index, IntIndex) + assert np.isnan(self.bseries.fill_value) + assert isinstance(self.bseries.sp_index, BlockIndex) + assert np.isnan(self.iseries.fill_value) + assert isinstance(self.iseries.sp_index, IntIndex) - self.assertEqual(self.zbseries.fill_value, 0) + assert self.zbseries.fill_value == 0 tm.assert_numpy_array_equal(self.zbseries.values.values, self.bseries.to_dense().fillna(0).values) @@ -247,13 +244,13 @@ def _check_const(sparse, name): # use passed series name result = SparseSeries(sparse) tm.assert_sp_series_equal(result, sparse) - self.assertEqual(sparse.name, name) - self.assertEqual(result.name, name) + assert sparse.name == name + assert result.name == name # use passed name result = SparseSeries(sparse, name='x') tm.assert_sp_series_equal(result, sparse, check_names=False) - self.assertEqual(result.name, 'x') + assert result.name == 'x' _check_const(self.bseries, 'bseries') _check_const(self.iseries, 'iseries') @@ -262,7 +259,7 @@ def _check_const(sparse, name): # Sparse time series works date_index = bdate_range('1/1/2000', periods=len(self.bseries)) s5 = SparseSeries(self.bseries, index=date_index) - tm.assertIsInstance(s5, SparseSeries) + assert isinstance(s5, SparseSeries) # pass Series bseries2 = SparseSeries(self.bseries.to_dense()) @@ -274,31 +271,31 @@ def _check_const(sparse, name): values = np.ones(self.bseries.npoints) sp = SparseSeries(values, sparse_index=self.bseries.sp_index) sp.sp_values[:5] = 97 - self.assertEqual(values[0], 97) + assert values[0] == 97 - self.assertEqual(len(sp), 20) - self.assertEqual(sp.shape, (20, )) + assert len(sp) == 20 + assert sp.shape == (20, ) # but can make it copy! sp = SparseSeries(values, sparse_index=self.bseries.sp_index, copy=True) sp.sp_values[:5] = 100 - self.assertEqual(values[0], 97) + assert values[0] == 97 - self.assertEqual(len(sp), 20) - self.assertEqual(sp.shape, (20, )) + assert len(sp) == 20 + assert sp.shape == (20, ) def test_constructor_scalar(self): data = 5 sp = SparseSeries(data, np.arange(100)) sp = sp.reindex(np.arange(200)) - self.assertTrue((sp.ix[:99] == data).all()) - self.assertTrue(isnull(sp.ix[100:]).all()) + assert (sp.loc[:99] == data).all() + assert isnull(sp.loc[100:]).all() data = np.nan sp = SparseSeries(data, np.arange(100)) - self.assertEqual(len(sp), 100) - self.assertEqual(sp.shape, (100, )) + assert len(sp) == 100 + assert sp.shape == (100, ) def test_constructor_ndarray(self): pass @@ -307,20 +304,20 @@ def test_constructor_nonnan(self): arr = [0, 0, 0, nan, nan] sp_series = SparseSeries(arr, fill_value=0) tm.assert_numpy_array_equal(sp_series.values.values, np.array(arr)) - self.assertEqual(len(sp_series), 5) - self.assertEqual(sp_series.shape, (5, )) + assert len(sp_series) == 5 + assert sp_series.shape == (5, ) - # GH 9272 def test_constructor_empty(self): + # see gh-9272 sp = SparseSeries() - self.assertEqual(len(sp.index), 0) - self.assertEqual(sp.shape, (0, )) + assert len(sp.index) == 0 + assert sp.shape == (0, ) def test_copy_astype(self): cop = self.bseries.astype(np.float64) - self.assertIsNot(cop, self.bseries) - self.assertIs(cop.sp_index, self.bseries.sp_index) - self.assertEqual(cop.dtype, np.float64) + assert cop is not self.bseries + assert cop.sp_index is self.bseries.sp_index + assert cop.dtype == np.float64 cop2 = self.iseries.copy() @@ -329,8 +326,8 @@ def test_copy_astype(self): # test that data is copied cop[:5] = 97 - self.assertEqual(cop.sp_values[0], 97) - self.assertNotEqual(self.bseries.sp_values[0], 97) + assert cop.sp_values[0] == 97 + assert self.bseries.sp_values[0] != 97 # correct fill value zbcop = self.zbseries.copy() @@ -342,22 +339,22 @@ def test_copy_astype(self): # no deep copy view = self.bseries.copy(deep=False) view.sp_values[:5] = 5 - self.assertTrue((self.bseries.sp_values[:5] == 5).all()) + assert (self.bseries.sp_values[:5] == 5).all() def test_shape(self): - # GH 10452 - self.assertEqual(self.bseries.shape, (20, )) - self.assertEqual(self.btseries.shape, (20, )) - self.assertEqual(self.iseries.shape, (20, )) + # see gh-10452 + assert self.bseries.shape == (20, ) + assert self.btseries.shape == (20, ) + assert self.iseries.shape == (20, ) - self.assertEqual(self.bseries2.shape, (15, )) - self.assertEqual(self.iseries2.shape, (15, )) + assert self.bseries2.shape == (15, ) + assert self.iseries2.shape == (15, ) - self.assertEqual(self.zbseries2.shape, (15, )) - self.assertEqual(self.ziseries2.shape, (15, )) + assert self.zbseries2.shape == (15, ) + assert self.ziseries2.shape == (15, ) def test_astype(self): - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): self.bseries.astype(np.int64) def test_astype_all(self): @@ -368,12 +365,12 @@ def test_astype_all(self): np.int32, np.int16, np.int8] for typ in types: res = s.astype(typ) - self.assertEqual(res.dtype, typ) + assert res.dtype == typ tm.assert_series_equal(res.to_dense(), orig.astype(typ)) def test_kind(self): - self.assertEqual(self.bseries.kind, 'block') - self.assertEqual(self.iseries.kind, 'integer') + assert self.bseries.kind == 'block' + assert self.iseries.kind == 'integer' def test_to_frame(self): # GH 9850 @@ -394,7 +391,7 @@ def test_to_frame(self): def test_pickle(self): def _test_roundtrip(series): - unpickled = self.round_trip_pickle(series) + unpickled = tm.round_trip_pickle(series) tm.assert_sp_series_equal(series, unpickled) tm.assert_series_equal(series.to_dense(), unpickled.to_dense()) @@ -429,16 +426,16 @@ def _check_getitem(sp, dense): _check_getitem(self.ziseries, self.ziseries.to_dense()) # exception handling - self.assertRaises(Exception, self.bseries.__getitem__, - len(self.bseries) + 1) + pytest.raises(Exception, self.bseries.__getitem__, + len(self.bseries) + 1) # index not contained - self.assertRaises(Exception, self.btseries.__getitem__, - self.btseries.index[-1] + BDay()) + pytest.raises(Exception, self.btseries.__getitem__, + self.btseries.index[-1] + BDay()) def test_get_get_value(self): tm.assert_almost_equal(self.bseries.get(10), self.bseries[10]) - self.assertIsNone(self.bseries.get(len(self.bseries) + 1)) + assert self.bseries.get(len(self.bseries) + 1) is None dt = self.btseries.index[10] result = self.btseries.get(dt) @@ -451,22 +448,22 @@ def test_set_value(self): idx = self.btseries.index[7] self.btseries.set_value(idx, 0) - self.assertEqual(self.btseries[idx], 0) + assert self.btseries[idx] == 0 self.iseries.set_value('foobar', 0) - self.assertEqual(self.iseries.index[-1], 'foobar') - self.assertEqual(self.iseries['foobar'], 0) + assert self.iseries.index[-1] == 'foobar' + assert self.iseries['foobar'] == 0 def test_getitem_slice(self): idx = self.bseries.index res = self.bseries[::2] - tm.assertIsInstance(res, SparseSeries) + assert isinstance(res, SparseSeries) expected = self.bseries.reindex(idx[::2]) tm.assert_sp_series_equal(res, expected) res = self.bseries[:5] - tm.assertIsInstance(res, SparseSeries) + assert isinstance(res, SparseSeries) tm.assert_sp_series_equal(res, self.bseries.reindex(idx[:5])) res = self.bseries[5:] @@ -483,7 +480,7 @@ def _compare_with_dense(sp): def _compare(idx): dense_result = dense.take(idx).values sparse_result = sp.take(idx) - self.assertIsInstance(sparse_result, SparseSeries) + assert isinstance(sparse_result, SparseSeries) tm.assert_almost_equal(dense_result, sparse_result.values.values) @@ -493,8 +490,8 @@ def _compare(idx): self._check_all(_compare_with_dense) - self.assertRaises(Exception, self.bseries.take, - [0, len(self.bseries) + 1]) + pytest.raises(Exception, self.bseries.take, + [0, len(self.bseries) + 1]) # Corner case sp = SparseSeries(np.ones(10) * nan) @@ -509,16 +506,16 @@ def test_numpy_take(self): np.take(sp.to_dense(), indices, axis=0)) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.take, - sp, indices, out=np.empty(sp.shape)) + tm.assert_raises_regex(ValueError, msg, np.take, + sp, indices, out=np.empty(sp.shape)) msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.take, - sp, indices, mode='clip') + tm.assert_raises_regex(ValueError, msg, np.take, + sp, indices, mode='clip') def test_setitem(self): self.bseries[5] = 7. - self.assertEqual(self.bseries[5], 7.) + assert self.bseries[5] == 7. def test_setslice(self): self.bseries[5:10] = 7. @@ -575,8 +572,8 @@ def check(a, b): def test_binary_operators(self): # skipping for now ##### - import nose - raise nose.SkipTest("skipping sparse binary operators test") + import pytest + pytest.skip("skipping sparse binary operators test") def _check_inplace_op(iop, op): tmp = self.bseries.copy() @@ -595,30 +592,30 @@ def test_abs(self): expected = SparseSeries([1, 2, 3], name='x') result = s.abs() tm.assert_sp_series_equal(result, expected) - self.assertEqual(result.name, 'x') + assert result.name == 'x' result = abs(s) tm.assert_sp_series_equal(result, expected) - self.assertEqual(result.name, 'x') + assert result.name == 'x' result = np.abs(s) tm.assert_sp_series_equal(result, expected) - self.assertEqual(result.name, 'x') + assert result.name == 'x' s = SparseSeries([1, -2, 2, -3], fill_value=-2, name='x') expected = SparseSeries([1, 2, 3], sparse_index=s.sp_index, fill_value=2, name='x') result = s.abs() tm.assert_sp_series_equal(result, expected) - self.assertEqual(result.name, 'x') + assert result.name == 'x' result = abs(s) tm.assert_sp_series_equal(result, expected) - self.assertEqual(result.name, 'x') + assert result.name == 'x' result = np.abs(s) tm.assert_sp_series_equal(result, expected) - self.assertEqual(result.name, 'x') + assert result.name == 'x' def test_reindex(self): def _compare_with_series(sps, new_index): @@ -643,7 +640,7 @@ def _compare_with_series(sps, new_index): # special cases same_index = self.bseries.reindex(self.bseries.index) tm.assert_sp_series_equal(self.bseries, same_index) - self.assertIsNot(same_index, self.bseries) + assert same_index is not self.bseries # corner cases sp = SparseSeries([], index=[]) @@ -654,7 +651,7 @@ def _compare_with_series(sps, new_index): # with copy=False reindexed = self.bseries.reindex(self.bseries.index, copy=True) reindexed.sp_values[:] = 1. - self.assertTrue((self.bseries.sp_values != 1.).all()) + assert (self.bseries.sp_values != 1.).all() reindexed = self.bseries.reindex(self.bseries.index, copy=False) reindexed.sp_values[:] = 1. @@ -667,7 +664,7 @@ def _check(values, index1, index2, fill_value): first_series = SparseSeries(values, sparse_index=index1, fill_value=fill_value) reindexed = first_series.sparse_reindex(index2) - self.assertIs(reindexed.sp_index, index2) + assert reindexed.sp_index is index2 int_indices1 = index1.to_int_index().indices int_indices2 = index2.to_int_index().indices @@ -706,8 +703,8 @@ def _check_all(values, first, second): first_series = SparseSeries(values1, sparse_index=IntIndex(length, index1), fill_value=nan) - with tm.assertRaisesRegexp(TypeError, - 'new index must be a SparseIndex'): + with tm.assert_raises_regex(TypeError, + 'new index must be a SparseIndex'): reindexed = first_series.sparse_reindex(0) # noqa def test_repr(self): @@ -732,7 +729,7 @@ def _compare_with_dense(obj, op): sparse_result = getattr(obj, op)() series = obj.to_dense() dense_result = getattr(series, op)() - self.assertEqual(sparse_result, dense_result) + assert sparse_result == dense_result to_compare = ['count', 'sum', 'mean', 'std', 'var', 'skew'] @@ -768,12 +765,12 @@ def test_dropna(self): expected = expected[expected != 0] exp_arr = pd.SparseArray(expected.values, fill_value=0, kind='block') tm.assert_sp_array_equal(sp_valid.values, exp_arr) - self.assert_index_equal(sp_valid.index, expected.index) - self.assertEqual(len(sp_valid.sp_values), 2) + tm.assert_index_equal(sp_valid.index, expected.index) + assert len(sp_valid.sp_values) == 2 result = self.bseries.dropna() expected = self.bseries.to_dense().dropna() - self.assertNotIsInstance(result, SparseSeries) + assert not isinstance(result, SparseSeries) tm.assert_series_equal(result, expected) def test_homogenize(self): @@ -800,7 +797,7 @@ def _check_matches(indices, expected): # must have NaN fill value data = {'a': SparseSeries(np.arange(7), sparse_index=expected2, fill_value=0)} - with tm.assertRaisesRegexp(TypeError, "NaN fill value"): + with tm.assert_raises_regex(TypeError, "NaN fill value"): spf.homogenize(data) def test_fill_value_corner(self): @@ -808,13 +805,13 @@ def test_fill_value_corner(self): cop.fill_value = 0 result = self.bseries / cop - self.assertTrue(np.isnan(result.fill_value)) + assert np.isnan(result.fill_value) cop2 = self.zbseries.copy() cop2.fill_value = 1 result = cop2 / cop # 1 / 0 is inf - self.assertTrue(np.isinf(result.fill_value)) + assert np.isinf(result.fill_value) def test_fill_value_when_combine_const(self): # GH12723 @@ -822,13 +819,13 @@ def test_fill_value_when_combine_const(self): exp = s.fillna(0).add(2) res = s.add(2, fill_value=0) - self.assert_series_equal(res, exp) + tm.assert_series_equal(res, exp) def test_shift(self): series = SparseSeries([nan, 1., 2., 3., nan, nan], index=np.arange(6)) shifted = series.shift(0) - self.assertIsNot(shifted, series) + assert shifted is not series tm.assert_sp_series_equal(shifted, series) f = lambda s: s.shift(1) @@ -937,14 +934,15 @@ def test_combine_first(self): tm.assert_sp_series_equal(result, expected) -class TestSparseHandlingMultiIndexes(tm.TestCase): - def setUp(self): +class TestSparseHandlingMultiIndexes(object): + + def setup_method(self, method): miindex = pd.MultiIndex.from_product( [["x", "y"], ["10", "20"]], names=['row-foo', 'row-bar']) micol = pd.MultiIndex.from_product( [['a', 'b', 'c'], ["1", "2"]], names=['col-foo', 'col-bar']) dense_multiindex_frame = pd.DataFrame( - index=miindex, columns=micol).sortlevel().sortlevel(axis=1) + index=miindex, columns=micol).sort_index().sort_index(axis=1) self.dense_multiindex_frame = dense_multiindex_frame.fillna(value=3.14) def test_to_sparse_preserve_multiindex_names_columns(self): @@ -962,10 +960,10 @@ def test_round_trip_preserve_multiindex_names(self): check_names=True) -class TestSparseSeriesScipyInteraction(tm.TestCase): +class TestSparseSeriesScipyInteraction(object): # Issue 8048: add SparseSeries coo methods - def setUp(self): + def setup_method(self, method): tm._skip_if_no_scipy() import scipy.sparse # SparseSeries inputs used in tests, the tests rely on the order @@ -1039,25 +1037,25 @@ def test_to_coo_text_names_text_row_levels_nosort(self): def test_to_coo_bad_partition_nonnull_intersection(self): ss = self.sparse_series[0] - self.assertRaises(ValueError, ss.to_coo, ['A', 'B', 'C'], ['C', 'D']) + pytest.raises(ValueError, ss.to_coo, ['A', 'B', 'C'], ['C', 'D']) def test_to_coo_bad_partition_small_union(self): ss = self.sparse_series[0] - self.assertRaises(ValueError, ss.to_coo, ['A'], ['C', 'D']) + pytest.raises(ValueError, ss.to_coo, ['A'], ['C', 'D']) def test_to_coo_nlevels_less_than_two(self): ss = self.sparse_series[0] ss.index = np.arange(len(ss.index)) - self.assertRaises(ValueError, ss.to_coo) + pytest.raises(ValueError, ss.to_coo) def test_to_coo_bad_ilevel(self): ss = self.sparse_series[0] - self.assertRaises(KeyError, ss.to_coo, ['A', 'B'], ['C', 'D', 'E']) + pytest.raises(KeyError, ss.to_coo, ['A', 'B'], ['C', 'D', 'E']) def test_to_coo_duplicate_index_entries(self): ss = pd.concat([self.sparse_series[0], self.sparse_series[0]]).to_sparse() - self.assertRaises(ValueError, ss.to_coo, ['A', 'B'], ['C', 'D']) + pytest.raises(ValueError, ss.to_coo, ['A', 'B'], ['C', 'D']) def test_from_coo_dense_index(self): ss = SparseSeries.from_coo(self.coo_matrices[0], dense_index=True) @@ -1099,8 +1097,8 @@ def _check_results_to_coo(self, results, check): # or compare directly as difference of sparse # assert(abs(A - A_result).max() < 1e-12) # max is failing in python # 2.6 - self.assertEqual(il, il_result) - self.assertEqual(jl, jl_result) + assert il == il_result + assert jl == jl_result def test_concat(self): val1 = np.array([1, 2, np.nan, np.nan, 0, np.nan]) @@ -1164,7 +1162,7 @@ def test_concat_axis1_different_fill(self): res = pd.concat([sparse1, sparse2], axis=1) exp = pd.concat([pd.Series(val1, name='x'), pd.Series(val2, name='y')], axis=1) - self.assertIsInstance(res, pd.SparseDataFrame) + assert isinstance(res, pd.SparseDataFrame) tm.assert_frame_equal(res.to_dense(), exp) def test_concat_different_kind(self): @@ -1283,7 +1281,7 @@ def test_isnull(self): s = pd.SparseSeries([np.nan, 0., 1., 2., 0.], name='xxx', fill_value=0.) res = s.isnull() - tm.assertIsInstance(res, pd.SparseSeries) + assert isinstance(res, pd.SparseSeries) exp = pd.Series([True, False, False, False, False], name='xxx') tm.assert_series_equal(res.to_dense(), exp) @@ -1300,7 +1298,7 @@ def test_isnotnull(self): s = pd.SparseSeries([np.nan, 0., 1., 2., 0.], name='xxx', fill_value=0.) res = s.isnotnull() - tm.assertIsInstance(res, pd.SparseSeries) + assert isinstance(res, pd.SparseSeries) exp = pd.Series([False, True, True, True, True], name='xxx') tm.assert_series_equal(res.to_dense(), exp) @@ -1312,9 +1310,9 @@ def _dense_series_compare(s, f): tm.assert_series_equal(result.to_dense(), dense_result) -class TestSparseSeriesAnalytics(tm.TestCase): +class TestSparseSeriesAnalytics(object): - def setUp(self): + def setup_method(self, method): arr, index = _test_data1() self.bseries = SparseSeries(arr, index=index, kind='block', name='bseries') @@ -1328,30 +1326,31 @@ def test_cumsum(self): expected = SparseSeries(self.bseries.to_dense().cumsum()) tm.assert_sp_series_equal(result, expected) - # TODO: gh-12855 - return a SparseSeries here result = self.zbseries.cumsum() expected = self.zbseries.to_dense().cumsum() - self.assertNotIsInstance(result, SparseSeries) tm.assert_series_equal(result, expected) + axis = 1 # Series is 1-D, so only axis = 0 is valid. + msg = "No axis named {axis}".format(axis=axis) + with tm.assert_raises_regex(ValueError, msg): + self.bseries.cumsum(axis=axis) + def test_numpy_cumsum(self): result = np.cumsum(self.bseries) expected = SparseSeries(self.bseries.to_dense().cumsum()) tm.assert_sp_series_equal(result, expected) - # TODO: gh-12855 - return a SparseSeries here result = np.cumsum(self.zbseries) expected = self.zbseries.to_dense().cumsum() - self.assertNotIsInstance(result, SparseSeries) tm.assert_series_equal(result, expected) msg = "the 'dtype' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.cumsum, - self.bseries, dtype=np.int64) + tm.assert_raises_regex(ValueError, msg, np.cumsum, + self.bseries, dtype=np.int64) msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.cumsum, - self.zbseries, out=result) + tm.assert_raises_regex(ValueError, msg, np.cumsum, + self.zbseries, out=result) def test_numpy_func_call(self): # no exception should be raised even though @@ -1362,9 +1361,3 @@ def test_numpy_func_call(self): for func in funcs: for series in ('bseries', 'zbseries'): getattr(np, func)(getattr(self, series)) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_algos.py b/pandas/tests/test_algos.py index f89f41abd0d35..093730fb2478b 100644 --- a/pandas/tests/test_algos.py +++ b/pandas/tests/test_algos.py @@ -1,26 +1,29 @@ # -*- coding: utf-8 -*- -from pandas.compat import range import numpy as np +import pytest + from numpy.random import RandomState from numpy import nan from datetime import datetime from itertools import permutations -from pandas import Series, Categorical, CategoricalIndex, Index +from pandas import (Series, Categorical, CategoricalIndex, + Timestamp, DatetimeIndex, + Index, IntervalIndex) import pandas as pd from pandas import compat -import pandas.algos as _algos -from pandas.compat import lrange +from pandas._libs import (groupby as libgroupby, algos as libalgos, + hashtable) +from pandas._libs.hashtable import unique_label_indices +from pandas.compat import lrange, range import pandas.core.algorithms as algos import pandas.util.testing as tm -import pandas.hashtable as hashtable from pandas.compat.numpy import np_array_datetime64_compat from pandas.util.testing import assert_almost_equal -class TestMatch(tm.TestCase): - _multiprocess_can_split_ = True +class TestMatch(object): def test_ints(self): values = np.array([0, 2, 1]) @@ -28,16 +31,16 @@ def test_ints(self): result = algos.match(to_match, values) expected = np.array([0, 2, 1, 1, 0, 2, -1, 0], dtype=np.int64) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = Series(algos.match(to_match, values, np.nan)) expected = Series(np.array([0, 2, 1, 1, 0, 2, np.nan, 0])) tm.assert_series_equal(result, expected) - s = pd.Series(np.arange(5), dtype=np.float32) + s = Series(np.arange(5), dtype=np.float32) result = algos.match(s, [2, 4]) expected = np.array([-1, -1, 0, -1, 1], dtype=np.int64) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = Series(algos.match(s, [2, 4], np.nan)) expected = Series(np.array([np.nan, np.nan, 0, np.nan, 1])) @@ -49,15 +52,14 @@ def test_strings(self): result = algos.match(to_match, values) expected = np.array([1, 0, -1, 0, 1, 2, -1], dtype=np.int64) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = Series(algos.match(to_match, values, np.nan)) expected = Series(np.array([1, 0, np.nan, 0, 1, 2, np.nan])) tm.assert_series_equal(result, expected) -class TestSafeSort(tm.TestCase): - _multiprocess_can_split_ = True +class TestSafeSort(object): def test_basic_sort(self): values = [3, 1, 2, 0, 4] @@ -126,65 +128,65 @@ def test_unsortable(self): if compat.PY2 and not pd._np_version_under1p10: # RuntimeWarning: tp_compare didn't return -1 or -2 for exception with tm.assert_produces_warning(RuntimeWarning): - tm.assertRaises(TypeError, algos.safe_sort, arr) + pytest.raises(TypeError, algos.safe_sort, arr) else: - tm.assertRaises(TypeError, algos.safe_sort, arr) + pytest.raises(TypeError, algos.safe_sort, arr) def test_exceptions(self): - with tm.assertRaisesRegexp(TypeError, - "Only list-like objects are allowed"): + with tm.assert_raises_regex(TypeError, + "Only list-like objects are allowed"): algos.safe_sort(values=1) - with tm.assertRaisesRegexp(TypeError, - "Only list-like objects or None"): + with tm.assert_raises_regex(TypeError, + "Only list-like objects or None"): algos.safe_sort(values=[0, 1, 2], labels=1) - with tm.assertRaisesRegexp(ValueError, "values should be unique"): + with tm.assert_raises_regex(ValueError, + "values should be unique"): algos.safe_sort(values=[0, 1, 2, 1], labels=[0, 1]) -class TestFactorize(tm.TestCase): - _multiprocess_can_split_ = True +class TestFactorize(object): def test_basic(self): labels, uniques = algos.factorize(['a', 'b', 'b', 'a', 'a', 'c', 'c', 'c']) - self.assert_numpy_array_equal( + tm.assert_numpy_array_equal( uniques, np.array(['a', 'b', 'c'], dtype=object)) labels, uniques = algos.factorize(['a', 'b', 'b', 'a', 'a', 'c', 'c', 'c'], sort=True) exp = np.array([0, 1, 1, 0, 0, 2, 2, 2], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) + tm.assert_numpy_array_equal(labels, exp) exp = np.array(['a', 'b', 'c'], dtype=object) - self.assert_numpy_array_equal(uniques, exp) + tm.assert_numpy_array_equal(uniques, exp) labels, uniques = algos.factorize(list(reversed(range(5)))) exp = np.array([0, 1, 2, 3, 4], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) + tm.assert_numpy_array_equal(labels, exp) exp = np.array([4, 3, 2, 1, 0], dtype=np.int64) - self.assert_numpy_array_equal(uniques, exp) + tm.assert_numpy_array_equal(uniques, exp) labels, uniques = algos.factorize(list(reversed(range(5))), sort=True) exp = np.array([4, 3, 2, 1, 0], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) + tm.assert_numpy_array_equal(labels, exp) exp = np.array([0, 1, 2, 3, 4], dtype=np.int64) - self.assert_numpy_array_equal(uniques, exp) + tm.assert_numpy_array_equal(uniques, exp) labels, uniques = algos.factorize(list(reversed(np.arange(5.)))) exp = np.array([0, 1, 2, 3, 4], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) + tm.assert_numpy_array_equal(labels, exp) exp = np.array([4., 3., 2., 1., 0.], dtype=np.float64) - self.assert_numpy_array_equal(uniques, exp) + tm.assert_numpy_array_equal(uniques, exp) labels, uniques = algos.factorize(list(reversed(np.arange(5.))), sort=True) exp = np.array([4, 3, 2, 1, 0], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) + tm.assert_numpy_array_equal(labels, exp) exp = np.array([0., 1., 2., 3., 4.], dtype=np.float64) - self.assert_numpy_array_equal(uniques, exp) + tm.assert_numpy_array_equal(uniques, exp) def test_mixed(self): @@ -193,34 +195,34 @@ def test_mixed(self): labels, uniques = algos.factorize(x) exp = np.array([0, 0, -1, 1, 2, 3], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) + tm.assert_numpy_array_equal(labels, exp) exp = pd.Index(['A', 'B', 3.14, np.inf]) tm.assert_index_equal(uniques, exp) labels, uniques = algos.factorize(x, sort=True) exp = np.array([2, 2, -1, 3, 0, 1], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) + tm.assert_numpy_array_equal(labels, exp) exp = pd.Index([3.14, np.inf, 'A', 'B']) tm.assert_index_equal(uniques, exp) def test_datelike(self): # M8 - v1 = pd.Timestamp('20130101 09:00:00.00004') - v2 = pd.Timestamp('20130101') + v1 = Timestamp('20130101 09:00:00.00004') + v2 = Timestamp('20130101') x = Series([v1, v1, v1, v2, v2, v1]) labels, uniques = algos.factorize(x) exp = np.array([0, 0, 0, 1, 1, 0], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) - exp = pd.DatetimeIndex([v1, v2]) - self.assert_index_equal(uniques, exp) + tm.assert_numpy_array_equal(labels, exp) + exp = DatetimeIndex([v1, v2]) + tm.assert_index_equal(uniques, exp) labels, uniques = algos.factorize(x, sort=True) exp = np.array([1, 1, 1, 0, 0, 1], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) - exp = pd.DatetimeIndex([v2, v1]) - self.assert_index_equal(uniques, exp) + tm.assert_numpy_array_equal(labels, exp) + exp = DatetimeIndex([v2, v1]) + tm.assert_index_equal(uniques, exp) # period v1 = pd.Period('201302', freq='M') @@ -230,13 +232,13 @@ def test_datelike(self): # periods are not 'sorted' as they are converted back into an index labels, uniques = algos.factorize(x) exp = np.array([0, 0, 0, 1, 1, 0], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) - self.assert_index_equal(uniques, pd.PeriodIndex([v1, v2])) + tm.assert_numpy_array_equal(labels, exp) + tm.assert_index_equal(uniques, pd.PeriodIndex([v1, v2])) labels, uniques = algos.factorize(x, sort=True) exp = np.array([0, 0, 0, 1, 1, 0], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) - self.assert_index_equal(uniques, pd.PeriodIndex([v1, v2])) + tm.assert_numpy_array_equal(labels, exp) + tm.assert_index_equal(uniques, pd.PeriodIndex([v1, v2])) # GH 5986 v1 = pd.to_timedelta('1 day 1 min') @@ -244,13 +246,13 @@ def test_datelike(self): x = Series([v1, v2, v1, v1, v2, v2, v1]) labels, uniques = algos.factorize(x) exp = np.array([0, 1, 0, 0, 1, 1, 0], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) - self.assert_index_equal(uniques, pd.to_timedelta([v1, v2])) + tm.assert_numpy_array_equal(labels, exp) + tm.assert_index_equal(uniques, pd.to_timedelta([v1, v2])) labels, uniques = algos.factorize(x, sort=True) exp = np.array([1, 0, 1, 1, 0, 0, 1], dtype=np.intp) - self.assert_numpy_array_equal(labels, exp) - self.assert_index_equal(uniques, pd.to_timedelta([v2, v1])) + tm.assert_numpy_array_equal(labels, exp) + tm.assert_index_equal(uniques, pd.to_timedelta([v2, v1])) def test_factorize_nan(self): # nan should map to na_sentinel, not reverse_indexer[na_sentinel] @@ -261,9 +263,9 @@ def test_factorize_nan(self): for na_sentinel in (-1, 20): ids = rizer.factorize(key, sort=True, na_sentinel=na_sentinel) expected = np.array([0, 1, 0, na_sentinel], dtype='int32') - self.assertEqual(len(set(key)), len(set(expected))) - self.assertTrue(np.array_equal( - pd.isnull(key), expected == na_sentinel)) + assert len(set(key)) == len(set(expected)) + tm.assert_numpy_array_equal(pd.isnull(key), + expected == na_sentinel) # nan still maps to na_sentinel when sort=False key = np.array([0, np.nan, 1], dtype='O') @@ -273,57 +275,50 @@ def test_factorize_nan(self): ids = rizer.factorize(key, sort=False, na_sentinel=na_sentinel) # noqa expected = np.array([2, -1, 0], dtype='int32') - self.assertEqual(len(set(key)), len(set(expected))) - self.assertTrue( - np.array_equal(pd.isnull(key), expected == na_sentinel)) - - def test_vector_resize(self): - # Test for memory errors after internal vector - # reallocations (pull request #7157) - - def _test_vector_resize(htable, uniques, dtype, nvals): - vals = np.array(np.random.randn(1000), dtype=dtype) - # get_labels appends to the vector - htable.get_labels(vals[:nvals], uniques, 0, -1) - # to_array resizes the vector - uniques.to_array() - htable.get_labels(vals, uniques, 0, -1) - - test_cases = [ - (hashtable.PyObjectHashTable, hashtable.ObjectVector, 'object'), - (hashtable.Float64HashTable, hashtable.Float64Vector, 'float64'), - (hashtable.Int64HashTable, hashtable.Int64Vector, 'int64')] - - for (tbl, vect, dtype) in test_cases: - # resizing to empty is a special case - _test_vector_resize(tbl(), vect(), dtype, 0) - _test_vector_resize(tbl(), vect(), dtype, 10) + assert len(set(key)) == len(set(expected)) + tm.assert_numpy_array_equal(pd.isnull(key), expected == na_sentinel) def test_complex_sorting(self): # gh 12666 - check no segfault # Test not valid numpy versions older than 1.11 if pd._np_version_under1p11: - self.skipTest("Test valid only for numpy 1.11+") + pytest.skip("Test valid only for numpy 1.11+") x17 = np.array([complex(i) for i in range(17)], dtype=object) - self.assertRaises(TypeError, algos.factorize, x17[::-1], sort=True) + pytest.raises(TypeError, algos.factorize, x17[::-1], sort=True) + + def test_uint64_factorize(self): + data = np.array([2**63, 1, 2**63], dtype=np.uint64) + exp_labels = np.array([0, 1, 0], dtype=np.intp) + exp_uniques = np.array([2**63, 1], dtype=np.uint64) + + labels, uniques = algos.factorize(data) + tm.assert_numpy_array_equal(labels, exp_labels) + tm.assert_numpy_array_equal(uniques, exp_uniques) + + data = np.array([2**63, -1, 2**63], dtype=object) + exp_labels = np.array([0, 1, 0], dtype=np.intp) + exp_uniques = np.array([2**63, -1], dtype=object) + labels, uniques = algos.factorize(data) + tm.assert_numpy_array_equal(labels, exp_labels) + tm.assert_numpy_array_equal(uniques, exp_uniques) -class TestUnique(tm.TestCase): - _multiprocess_can_split_ = True + +class TestUnique(object): def test_ints(self): arr = np.random.randint(0, 100, size=50) result = algos.unique(arr) - tm.assertIsInstance(result, np.ndarray) + assert isinstance(result, np.ndarray) def test_objects(self): arr = np.random.randint(0, 100, size=50).astype('O') result = algos.unique(arr) - tm.assertIsInstance(result, np.ndarray) + assert isinstance(result, np.ndarray) def test_object_refcount_bug(self): lst = ['A', 'B', 'C', 'D', 'E'] @@ -356,17 +351,17 @@ def test_datetime64_dtype_array_returned(self): '2015-01-01T00:00:00.000000000+0000']) result = algos.unique(dt_index) tm.assert_numpy_array_equal(result, expected) - self.assertEqual(result.dtype, expected.dtype) + assert result.dtype == expected.dtype - s = pd.Series(dt_index) + s = Series(dt_index) result = algos.unique(s) tm.assert_numpy_array_equal(result, expected) - self.assertEqual(result.dtype, expected.dtype) + assert result.dtype == expected.dtype arr = s.values result = algos.unique(arr) tm.assert_numpy_array_equal(result, expected) - self.assertEqual(result.dtype, expected.dtype) + assert result.dtype == expected.dtype def test_timedelta64_dtype_array_returned(self): # GH 9431 @@ -375,27 +370,146 @@ def test_timedelta64_dtype_array_returned(self): td_index = pd.to_timedelta([31200, 45678, 31200, 10000, 45678]) result = algos.unique(td_index) tm.assert_numpy_array_equal(result, expected) - self.assertEqual(result.dtype, expected.dtype) + assert result.dtype == expected.dtype - s = pd.Series(td_index) + s = Series(td_index) result = algos.unique(s) tm.assert_numpy_array_equal(result, expected) - self.assertEqual(result.dtype, expected.dtype) + assert result.dtype == expected.dtype arr = s.values result = algos.unique(arr) tm.assert_numpy_array_equal(result, expected) - self.assertEqual(result.dtype, expected.dtype) + assert result.dtype == expected.dtype + + def test_uint64_overflow(self): + s = Series([1, 2, 2**63, 2**63], dtype=np.uint64) + exp = np.array([1, 2, 2**63], dtype=np.uint64) + tm.assert_numpy_array_equal(algos.unique(s), exp) + + def test_nan_in_object_array(self): + l = ['a', np.nan, 'c', 'c'] + result = pd.unique(l) + expected = np.array(['a', np.nan, 'c'], dtype=object) + tm.assert_numpy_array_equal(result, expected) + + def test_categorical(self): + + # we are expecting to return in the order + # of appearance + expected = pd.Categorical(list('bac'), + categories=list('bac')) + + # we are expecting to return in the order + # of the categories + expected_o = pd.Categorical(list('bac'), + categories=list('abc'), + ordered=True) + + # GH 15939 + c = pd.Categorical(list('baabc')) + result = c.unique() + tm.assert_categorical_equal(result, expected) + + result = algos.unique(c) + tm.assert_categorical_equal(result, expected) + + c = pd.Categorical(list('baabc'), ordered=True) + result = c.unique() + tm.assert_categorical_equal(result, expected_o) + + result = algos.unique(c) + tm.assert_categorical_equal(result, expected_o) + + # Series of categorical dtype + s = Series(pd.Categorical(list('baabc')), name='foo') + result = s.unique() + tm.assert_categorical_equal(result, expected) + + result = pd.unique(s) + tm.assert_categorical_equal(result, expected) + + # CI -> return CI + ci = pd.CategoricalIndex(pd.Categorical(list('baabc'), + categories=list('bac'))) + expected = pd.CategoricalIndex(expected) + result = ci.unique() + tm.assert_index_equal(result, expected) + + result = pd.unique(ci) + tm.assert_index_equal(result, expected) + + def test_datetime64tz_aware(self): + # GH 15939 + + result = Series( + pd.Index([Timestamp('20160101', tz='US/Eastern'), + Timestamp('20160101', tz='US/Eastern')])).unique() + expected = np.array([Timestamp('2016-01-01 00:00:00-0500', + tz='US/Eastern')], dtype=object) + tm.assert_numpy_array_equal(result, expected) + + result = pd.Index([Timestamp('20160101', tz='US/Eastern'), + Timestamp('20160101', tz='US/Eastern')]).unique() + expected = DatetimeIndex(['2016-01-01 00:00:00'], + dtype='datetime64[ns, US/Eastern]', freq=None) + tm.assert_index_equal(result, expected) + + result = pd.unique( + Series(pd.Index([Timestamp('20160101', tz='US/Eastern'), + Timestamp('20160101', tz='US/Eastern')]))) + expected = np.array([Timestamp('2016-01-01 00:00:00-0500', + tz='US/Eastern')], dtype=object) + tm.assert_numpy_array_equal(result, expected) + + result = pd.unique(pd.Index([Timestamp('20160101', tz='US/Eastern'), + Timestamp('20160101', tz='US/Eastern')])) + expected = DatetimeIndex(['2016-01-01 00:00:00'], + dtype='datetime64[ns, US/Eastern]', freq=None) + tm.assert_index_equal(result, expected) + + def test_order_of_appearance(self): + # 9346 + # light testing of guarantee of order of appearance + # these also are the doc-examples + result = pd.unique(Series([2, 1, 3, 3])) + tm.assert_numpy_array_equal(result, + np.array([2, 1, 3], dtype='int64')) + + result = pd.unique(Series([2] + [1] * 5)) + tm.assert_numpy_array_equal(result, + np.array([2, 1], dtype='int64')) + + result = pd.unique(Series([Timestamp('20160101'), + Timestamp('20160101')])) + expected = np.array(['2016-01-01T00:00:00.000000000'], + dtype='datetime64[ns]') + tm.assert_numpy_array_equal(result, expected) + result = pd.unique(pd.Index( + [Timestamp('20160101', tz='US/Eastern'), + Timestamp('20160101', tz='US/Eastern')])) + expected = DatetimeIndex(['2016-01-01 00:00:00'], + dtype='datetime64[ns, US/Eastern]', + freq=None) + tm.assert_index_equal(result, expected) + + result = pd.unique(list('aabc')) + expected = np.array(['a', 'b', 'c'], dtype=object) + tm.assert_numpy_array_equal(result, expected) + + result = pd.unique(Series(pd.Categorical(list('aabc')))) + expected = pd.Categorical(list('abc')) + tm.assert_categorical_equal(result, expected) -class TestIsin(tm.TestCase): - _multiprocess_can_split_ = True + +class TestIsin(object): def test_invalid(self): - self.assertRaises(TypeError, lambda: algos.isin(1, 1)) - self.assertRaises(TypeError, lambda: algos.isin(1, [1])) - self.assertRaises(TypeError, lambda: algos.isin([1], 1)) + pytest.raises(TypeError, lambda: algos.isin(1, 1)) + pytest.raises(TypeError, lambda: algos.isin(1, [1])) + pytest.raises(TypeError, lambda: algos.isin([1], 1)) def test_basic(self): @@ -407,15 +521,15 @@ def test_basic(self): expected = np.array([True, False]) tm.assert_numpy_array_equal(result, expected) - result = algos.isin(pd.Series([1, 2]), [1]) + result = algos.isin(Series([1, 2]), [1]) expected = np.array([True, False]) tm.assert_numpy_array_equal(result, expected) - result = algos.isin(pd.Series([1, 2]), pd.Series([1])) + result = algos.isin(Series([1, 2]), Series([1])) expected = np.array([True, False]) tm.assert_numpy_array_equal(result, expected) - result = algos.isin(pd.Series([1, 2]), set([1])) + result = algos.isin(Series([1, 2]), set([1])) expected = np.array([True, False]) tm.assert_numpy_array_equal(result, expected) @@ -423,11 +537,11 @@ def test_basic(self): expected = np.array([True, False]) tm.assert_numpy_array_equal(result, expected) - result = algos.isin(pd.Series(['a', 'b']), pd.Series(['a'])) + result = algos.isin(Series(['a', 'b']), Series(['a'])) expected = np.array([True, False]) tm.assert_numpy_array_equal(result, expected) - result = algos.isin(pd.Series(['a', 'b']), set(['a'])) + result = algos.isin(Series(['a', 'b']), set(['a'])) expected = np.array([True, False]) tm.assert_numpy_array_equal(result, expected) @@ -435,6 +549,8 @@ def test_basic(self): expected = np.array([False, False]) tm.assert_numpy_array_equal(result, expected) + def test_i8(self): + arr = pd.date_range('20130101', periods=3).values result = algos.isin(arr, [arr[0]]) expected = np.array([True, False, False]) @@ -471,47 +587,49 @@ def test_large(self): tm.assert_numpy_array_equal(result, expected) -class TestValueCounts(tm.TestCase): - _multiprocess_can_split_ = True +class TestValueCounts(object): def test_value_counts(self): np.random.seed(1234) - from pandas.tools.tile import cut + from pandas.core.reshape.tile import cut arr = np.random.randn(4) factor = cut(arr, 4) - tm.assertIsInstance(factor, Categorical) + # assert isinstance(factor, n) result = algos.value_counts(factor) - cats = ['(-1.194, -0.535]', '(-0.535, 0.121]', '(0.121, 0.777]', - '(0.777, 1.433]'] - expected_index = CategoricalIndex(cats, cats, ordered=True) - expected = Series([1, 1, 1, 1], index=expected_index) + breaks = [-1.194, -0.535, 0.121, 0.777, 1.433] + expected_index = pd.IntervalIndex.from_breaks( + breaks).astype('category') + expected = Series([1, 1, 1, 1], + index=expected_index) tm.assert_series_equal(result.sort_index(), expected.sort_index()) def test_value_counts_bins(self): s = [1, 2, 3, 4] result = algos.value_counts(s, bins=1) - self.assertEqual(result.tolist(), [4]) - self.assertEqual(result.index[0], 0.997) + expected = Series([4], + index=IntervalIndex.from_tuples([(0.996, 4.0)])) + tm.assert_series_equal(result, expected) result = algos.value_counts(s, bins=2, sort=False) - self.assertEqual(result.tolist(), [2, 2]) - self.assertEqual(result.index[0], 0.997) - self.assertEqual(result.index[1], 2.5) + expected = Series([2, 2], + index=IntervalIndex.from_tuples([(0.996, 2.5), + (2.5, 4.0)])) + tm.assert_series_equal(result, expected) def test_value_counts_dtypes(self): result = algos.value_counts([1, 1.]) - self.assertEqual(len(result), 1) + assert len(result) == 1 result = algos.value_counts([1, 1.], bins=1) - self.assertEqual(len(result), 1) + assert len(result) == 1 result = algos.value_counts(Series([1, 1., '1'])) # object - self.assertEqual(len(result), 2) + assert len(result) == 2 - self.assertRaises(TypeError, lambda s: algos.value_counts(s, bins=1), - ['1', 1]) + pytest.raises(TypeError, lambda s: algos.value_counts(s, bins=1), + ['1', 1]) def test_value_counts_nat(self): td = Series([np.timedelta64(10000), pd.NaT], dtype='timedelta64[ns]') @@ -520,36 +638,37 @@ def test_value_counts_nat(self): for s in [td, dt]: vc = algos.value_counts(s) vc_with_na = algos.value_counts(s, dropna=False) - self.assertEqual(len(vc), 1) - self.assertEqual(len(vc_with_na), 2) + assert len(vc) == 1 + assert len(vc_with_na) == 2 - exp_dt = pd.Series({pd.Timestamp('2014-01-01 00:00:00'): 1}) + exp_dt = Series({Timestamp('2014-01-01 00:00:00'): 1}) tm.assert_series_equal(algos.value_counts(dt), exp_dt) # TODO same for (timedelta) def test_value_counts_datetime_outofbounds(self): # GH 13663 - s = pd.Series([datetime(3000, 1, 1), datetime(5000, 1, 1), - datetime(5000, 1, 1), datetime(6000, 1, 1), - datetime(3000, 1, 1), datetime(3000, 1, 1)]) + s = Series([datetime(3000, 1, 1), datetime(5000, 1, 1), + datetime(5000, 1, 1), datetime(6000, 1, 1), + datetime(3000, 1, 1), datetime(3000, 1, 1)]) res = s.value_counts() exp_index = pd.Index([datetime(3000, 1, 1), datetime(5000, 1, 1), datetime(6000, 1, 1)], dtype=object) - exp = pd.Series([3, 2, 1], index=exp_index) + exp = Series([3, 2, 1], index=exp_index) tm.assert_series_equal(res, exp) # GH 12424 - res = pd.to_datetime(pd.Series(['2362-01-01', np.nan]), + res = pd.to_datetime(Series(['2362-01-01', np.nan]), errors='ignore') - exp = pd.Series(['2362-01-01', np.nan], dtype=object) + exp = Series(['2362-01-01', np.nan], dtype=object) tm.assert_series_equal(res, exp) def test_categorical(self): s = Series(pd.Categorical(list('aaabbc'))) result = s.value_counts() - expected = pd.Series([3, 2, 1], - index=pd.CategoricalIndex(['a', 'b', 'c'])) + expected = Series([3, 2, 1], + index=pd.CategoricalIndex(['a', 'b', 'c'])) + tm.assert_series_equal(result, expected, check_index_type=True) # preserve order? @@ -562,13 +681,14 @@ def test_categorical_nans(self): s = Series(pd.Categorical(list('aaaaabbbcc'))) # 4,3,2,1 (nan) s.iloc[1] = np.nan result = s.value_counts() - expected = pd.Series([4, 3, 2], index=pd.CategoricalIndex( + expected = Series([4, 3, 2], index=pd.CategoricalIndex( + ['a', 'b', 'c'], categories=['a', 'b', 'c'])) tm.assert_series_equal(result, expected, check_index_type=True) result = s.value_counts(dropna=False) - expected = pd.Series([ + expected = Series([ 4, 3, 2, 1 - ], index=pd.CategoricalIndex(['a', 'b', 'c', np.nan])) + ], index=CategoricalIndex(['a', 'b', 'c', np.nan])) tm.assert_series_equal(result, expected, check_index_type=True) # out of order @@ -576,12 +696,12 @@ def test_categorical_nans(self): list('aaaaabbbcc'), ordered=True, categories=['b', 'a', 'c'])) s.iloc[1] = np.nan result = s.value_counts() - expected = pd.Series([4, 3, 2], index=pd.CategoricalIndex( + expected = Series([4, 3, 2], index=pd.CategoricalIndex( ['a', 'b', 'c'], categories=['b', 'a', 'c'], ordered=True)) tm.assert_series_equal(result, expected, check_index_type=True) result = s.value_counts(dropna=False) - expected = pd.Series([4, 3, 2, 1], index=pd.CategoricalIndex( + expected = Series([4, 3, 2, 1], index=pd.CategoricalIndex( ['a', 'b', 'c', np.nan], categories=['b', 'a', 'c'], ordered=True)) tm.assert_series_equal(result, expected, check_index_type=True) @@ -598,34 +718,34 @@ def test_dropna(self): # https://github.com/pandas-dev/pandas/issues/9443#issuecomment-73719328 tm.assert_series_equal( - pd.Series([True, True, False]).value_counts(dropna=True), - pd.Series([2, 1], index=[True, False])) + Series([True, True, False]).value_counts(dropna=True), + Series([2, 1], index=[True, False])) tm.assert_series_equal( - pd.Series([True, True, False]).value_counts(dropna=False), - pd.Series([2, 1], index=[True, False])) + Series([True, True, False]).value_counts(dropna=False), + Series([2, 1], index=[True, False])) tm.assert_series_equal( - pd.Series([True, True, False, None]).value_counts(dropna=True), - pd.Series([2, 1], index=[True, False])) + Series([True, True, False, None]).value_counts(dropna=True), + Series([2, 1], index=[True, False])) tm.assert_series_equal( - pd.Series([True, True, False, None]).value_counts(dropna=False), - pd.Series([2, 1, 1], index=[True, False, np.nan])) + Series([True, True, False, None]).value_counts(dropna=False), + Series([2, 1, 1], index=[True, False, np.nan])) tm.assert_series_equal( - pd.Series([10.3, 5., 5.]).value_counts(dropna=True), - pd.Series([2, 1], index=[5., 10.3])) + Series([10.3, 5., 5.]).value_counts(dropna=True), + Series([2, 1], index=[5., 10.3])) tm.assert_series_equal( - pd.Series([10.3, 5., 5.]).value_counts(dropna=False), - pd.Series([2, 1], index=[5., 10.3])) + Series([10.3, 5., 5.]).value_counts(dropna=False), + Series([2, 1], index=[5., 10.3])) tm.assert_series_equal( - pd.Series([10.3, 5., 5., None]).value_counts(dropna=True), - pd.Series([2, 1], index=[5., 10.3])) + Series([10.3, 5., 5., None]).value_counts(dropna=True), + Series([2, 1], index=[5., 10.3])) # 32-bit linux has a different ordering if not compat.is_platform_32bit(): - tm.assert_series_equal( - pd.Series([10.3, 5., 5., None]).value_counts(dropna=False), - pd.Series([2, 1, 1], index=[5., 10.3, np.nan])) + result = Series([10.3, 5., 5., None]).value_counts(dropna=False) + expected = Series([2, 1, 1], index=[5., 10.3, np.nan]) + tm.assert_series_equal(result, expected) def test_value_counts_normalized(self): # GH12558 @@ -643,10 +763,23 @@ def test_value_counts_normalized(self): index=Series([2.0, 1.0], dtype=t)) tm.assert_series_equal(result, expected) + def test_value_counts_uint64(self): + arr = np.array([2**63], dtype=np.uint64) + expected = Series([1], index=[2**63]) + result = algos.value_counts(arr) + + tm.assert_series_equal(result, expected) + + arr = np.array([-1, 2**63], dtype=object) + expected = Series([1, 1], index=[-1, 2**63]) + result = algos.value_counts(arr) + + # 32-bit linux has a different ordering + if not compat.is_platform_32bit(): + tm.assert_series_equal(result, expected) -class TestDuplicated(tm.TestCase): - _multiprocess_can_split_ = True +class TestDuplicated(object): def test_duplicated_with_nas(self): keys = np.array([0, 1, np.nan, 0, 2, np.nan], dtype=object) @@ -694,7 +827,9 @@ def test_numeric_object_likes(self): np.array([1 + 1j, 2 + 2j, 1 + 1j, 5 + 5j, 3 + 3j, 2 + 2j, 4 + 4j, 1 + 1j, 5 + 5j, 6 + 6j]), np.array(['a', 'b', 'a', 'e', 'c', - 'b', 'd', 'a', 'e', 'f'], dtype=object)] + 'b', 'd', 'a', 'e', 'f'], dtype=object), + np.array([1, 2**63, 1, 3**5, 10, + 2**63, 39, 1, 3**5, 7], dtype=np.uint64)] exp_first = np.array([False, False, True, False, False, True, False, True, True, False]) @@ -724,15 +859,15 @@ def test_numeric_object_likes(self): tm.assert_numpy_array_equal(res_false, exp_false) # series - for s in [pd.Series(case), pd.Series(case, dtype='category')]: + for s in [Series(case), Series(case, dtype='category')]: res_first = s.duplicated(keep='first') - tm.assert_series_equal(res_first, pd.Series(exp_first)) + tm.assert_series_equal(res_first, Series(exp_first)) res_last = s.duplicated(keep='last') - tm.assert_series_equal(res_last, pd.Series(exp_last)) + tm.assert_series_equal(res_last, Series(exp_last)) res_false = s.duplicated(keep=False) - tm.assert_series_equal(res_false, pd.Series(exp_false)) + tm.assert_series_equal(res_false, Series(exp_false)) def test_datetime_likes(self): @@ -741,8 +876,8 @@ def test_datetime_likes(self): td = ['1 days', '2 days', '1 days', 'NaT', '3 days', '2 days', '4 days', '1 days', 'NaT', '6 days'] - cases = [np.array([pd.Timestamp(d) for d in dt]), - np.array([pd.Timestamp(d, tz='US/Eastern') for d in dt]), + cases = [np.array([Timestamp(d) for d in dt]), + np.array([Timestamp(d, tz='US/Eastern') for d in dt]), np.array([pd.Period(d, freq='D') for d in dt]), np.array([np.datetime64(d) for d in dt]), np.array([pd.Timedelta(d) for d in td])] @@ -776,21 +911,21 @@ def test_datetime_likes(self): tm.assert_numpy_array_equal(res_false, exp_false) # series - for s in [pd.Series(case), pd.Series(case, dtype='category'), - pd.Series(case, dtype=object)]: + for s in [Series(case), Series(case, dtype='category'), + Series(case, dtype=object)]: res_first = s.duplicated(keep='first') - tm.assert_series_equal(res_first, pd.Series(exp_first)) + tm.assert_series_equal(res_first, Series(exp_first)) res_last = s.duplicated(keep='last') - tm.assert_series_equal(res_last, pd.Series(exp_last)) + tm.assert_series_equal(res_last, Series(exp_last)) res_false = s.duplicated(keep=False) - tm.assert_series_equal(res_false, pd.Series(exp_false)) + tm.assert_series_equal(res_false, Series(exp_false)) def test_unique_index(self): cases = [pd.Index([1, 2, 3]), pd.RangeIndex(0, 3)] for case in cases: - self.assertTrue(case.is_unique) + assert case.is_unique tm.assert_numpy_array_equal(case.duplicated(), np.array([False, False, False])) @@ -811,7 +946,7 @@ def test_group_var_generic_1d(self): expected_counts = counts + 3 self.algo(out, counts, values, labels) - self.assertTrue(np.allclose(out, expected_out, self.rtol)) + assert np.allclose(out, expected_out, self.rtol) tm.assert_numpy_array_equal(counts, expected_counts) def test_group_var_generic_1d_flat_labels(self): @@ -827,7 +962,7 @@ def test_group_var_generic_1d_flat_labels(self): self.algo(out, counts, values, labels) - self.assertTrue(np.allclose(out, expected_out, self.rtol)) + assert np.allclose(out, expected_out, self.rtol) tm.assert_numpy_array_equal(counts, expected_counts) def test_group_var_generic_2d_all_finite(self): @@ -842,7 +977,7 @@ def test_group_var_generic_2d_all_finite(self): expected_counts = counts + 2 self.algo(out, counts, values, labels) - self.assertTrue(np.allclose(out, expected_out, self.rtol)) + assert np.allclose(out, expected_out, self.rtol) tm.assert_numpy_array_equal(counts, expected_counts) def test_group_var_generic_2d_some_nan(self): @@ -874,16 +1009,15 @@ def test_group_var_constant(self): self.algo(out, counts, values, labels) - self.assertEqual(counts[0], 3) - self.assertTrue(out[0, 0] >= 0) + assert counts[0] == 3 + assert out[0, 0] >= 0 tm.assert_almost_equal(out[0, 0], 0.0) -class TestGroupVarFloat64(tm.TestCase, GroupVarTestMixin): +class TestGroupVarFloat64(GroupVarTestMixin): __test__ = True - _multiprocess_can_split_ = True - algo = algos.algos.group_var_float64 + algo = libgroupby.group_var_float64 dtype = np.float64 rtol = 1e-5 @@ -899,19 +1033,64 @@ def test_group_var_large_inputs(self): self.algo(out, counts, values, labels) - self.assertEqual(counts[0], 10 ** 6) + assert counts[0] == 10 ** 6 tm.assert_almost_equal(out[0, 0], 1.0 / 12, check_less_precise=True) -class TestGroupVarFloat32(tm.TestCase, GroupVarTestMixin): +class TestGroupVarFloat32(GroupVarTestMixin): __test__ = True - _multiprocess_can_split_ = True - algo = algos.algos.group_var_float32 + algo = libgroupby.group_var_float32 dtype = np.float32 rtol = 1e-2 +class TestHashTable(object): + + def test_lookup_nan(self): + xs = np.array([2.718, 3.14, np.nan, -7, 5, 2, 3]) + m = hashtable.Float64HashTable() + m.map_locations(xs) + tm.assert_numpy_array_equal(m.lookup(xs), np.arange(len(xs), + dtype=np.int64)) + + def test_lookup_overflow(self): + xs = np.array([1, 2, 2**63], dtype=np.uint64) + m = hashtable.UInt64HashTable() + m.map_locations(xs) + tm.assert_numpy_array_equal(m.lookup(xs), np.arange(len(xs), + dtype=np.int64)) + + def test_get_unique(self): + s = Series([1, 2, 2**63, 2**63], dtype=np.uint64) + exp = np.array([1, 2, 2**63], dtype=np.uint64) + tm.assert_numpy_array_equal(s.unique(), exp) + + def test_vector_resize(self): + # Test for memory errors after internal vector + # reallocations (pull request #7157) + + def _test_vector_resize(htable, uniques, dtype, nvals): + vals = np.array(np.random.randn(1000), dtype=dtype) + # get_labels appends to the vector + htable.get_labels(vals[:nvals], uniques, 0, -1) + # to_array resizes the vector + uniques.to_array() + htable.get_labels(vals, uniques, 0, -1) + + test_cases = [ + (hashtable.PyObjectHashTable, hashtable.ObjectVector, 'object'), + (hashtable.StringHashTable, hashtable.ObjectVector, 'object'), + (hashtable.Float64HashTable, hashtable.Float64Vector, 'float64'), + (hashtable.Int64HashTable, hashtable.Int64Vector, 'int64'), + (hashtable.UInt64HashTable, hashtable.UInt64Vector, 'uint64')] + + for (tbl, vect, dtype) in test_cases: + # resizing to empty is a special case + _test_vector_resize(tbl(), vect(), dtype, 0) + _test_vector_resize(tbl(), vect(), dtype, 10) + + def test_quantile(): s = Series(np.random.randn(100)) @@ -921,7 +1100,6 @@ def test_quantile(): def test_unique_label_indices(): - from pandas.hashtable import unique_label_indices a = np.random.randint(1, 1 << 10, 1 << 15).astype('i8') @@ -938,21 +1116,44 @@ def test_unique_label_indices(): check_dtype=False) -def test_rank(): - tm._skip_if_no_scipy() - from scipy.stats import rankdata +class TestRank(object): + + def test_scipy_compat(self): + tm._skip_if_no_scipy() + from scipy.stats import rankdata + + def _check(arr): + mask = ~np.isfinite(arr) + arr = arr.copy() + result = libalgos.rank_1d_float64(arr) + arr[mask] = np.inf + exp = rankdata(arr) + exp[mask] = nan + assert_almost_equal(result, exp) + + _check(np.array([nan, nan, 5., 5., 5., nan, 1, 2, 3, nan])) + _check(np.array([4., nan, 5., 5., 5., nan, 1, 2, 4., nan])) + + def test_basic(self): + exp = np.array([1, 2], dtype=np.float64) - def _check(arr): - mask = ~np.isfinite(arr) - arr = arr.copy() - result = _algos.rank_1d_float64(arr) - arr[mask] = np.inf - exp = rankdata(arr) - exp[mask] = nan - assert_almost_equal(result, exp) + for dtype in np.typecodes['AllInteger']: + s = Series([1, 100], dtype=dtype) + tm.assert_numpy_array_equal(algos.rank(s), exp) - _check(np.array([nan, nan, 5., 5., 5., nan, 1, 2, 3, nan])) - _check(np.array([4., nan, 5., 5., 5., nan, 1, 2, 4., nan])) + def test_uint64_overflow(self): + exp = np.array([1, 2], dtype=np.float64) + + for dtype in [np.float64, np.uint64]: + s = Series([1, 2**63], dtype=dtype) + tm.assert_numpy_array_equal(algos.rank(s), exp) + + def test_too_many_ndims(self): + arr = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]]]) + msg = "Array with ndim > 2 are not supported" + + with tm.assert_raises_regex(TypeError, msg): + algos.rank(arr) def test_pad_backfill_object_segfault(): @@ -960,31 +1161,30 @@ def test_pad_backfill_object_segfault(): old = np.array([], dtype='O') new = np.array([datetime(2010, 12, 31)], dtype='O') - result = _algos.pad_object(old, new) + result = libalgos.pad_object(old, new) expected = np.array([-1], dtype=np.int64) assert (np.array_equal(result, expected)) - result = _algos.pad_object(new, old) + result = libalgos.pad_object(new, old) expected = np.array([], dtype=np.int64) assert (np.array_equal(result, expected)) - result = _algos.backfill_object(old, new) + result = libalgos.backfill_object(old, new) expected = np.array([-1], dtype=np.int64) assert (np.array_equal(result, expected)) - result = _algos.backfill_object(new, old) + result = libalgos.backfill_object(new, old) expected = np.array([], dtype=np.int64) assert (np.array_equal(result, expected)) def test_arrmap(): values = np.array(['foo', 'foo', 'bar', 'bar', 'baz', 'qux'], dtype='O') - result = _algos.arrmap_object(values, lambda x: x in ['foo', 'bar']) + result = libalgos.arrmap_object(values, lambda x: x in ['foo', 'bar']) assert (result.dtype == np.bool_) -class TestTseriesUtil(tm.TestCase): - _multiprocess_can_split_ = True +class TestTseriesUtil(object): def test_combineFunc(self): pass @@ -1005,36 +1205,36 @@ def test_backfill(self): old = Index([1, 5, 10]) new = Index(lrange(12)) - filler = _algos.backfill_int64(old.values, new.values) + filler = libalgos.backfill_int64(old.values, new.values) expect_filler = np.array([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, -1], dtype=np.int64) - self.assert_numpy_array_equal(filler, expect_filler) + tm.assert_numpy_array_equal(filler, expect_filler) # corner case old = Index([1, 4]) new = Index(lrange(5, 10)) - filler = _algos.backfill_int64(old.values, new.values) + filler = libalgos.backfill_int64(old.values, new.values) expect_filler = np.array([-1, -1, -1, -1, -1], dtype=np.int64) - self.assert_numpy_array_equal(filler, expect_filler) + tm.assert_numpy_array_equal(filler, expect_filler) def test_pad(self): old = Index([1, 5, 10]) new = Index(lrange(12)) - filler = _algos.pad_int64(old.values, new.values) + filler = libalgos.pad_int64(old.values, new.values) expect_filler = np.array([-1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2], dtype=np.int64) - self.assert_numpy_array_equal(filler, expect_filler) + tm.assert_numpy_array_equal(filler, expect_filler) # corner case old = Index([5, 10]) new = Index(lrange(5)) - filler = _algos.pad_int64(old.values, new.values) + filler = libalgos.pad_int64(old.values, new.values) expect_filler = np.array([-1, -1, -1, -1, -1], dtype=np.int64) - self.assert_numpy_array_equal(filler, expect_filler) + tm.assert_numpy_array_equal(filler, expect_filler) def test_is_lexsorted(): @@ -1064,7 +1264,7 @@ def test_is_lexsorted(): 6, 5, 4, 3, 2, 1, 0])] - assert (not _algos.is_lexsorted(failure)) + assert (not libalgos.is_lexsorted(failure)) # def test_get_group_index(): # a = np.array([0, 1, 2, 0, 2, 1, 0, 0], dtype=np.int64) @@ -1080,7 +1280,7 @@ def test_groupsort_indexer(): a = np.random.randint(0, 1000, 100).astype(np.int64) b = np.random.randint(0, 1000, 100).astype(np.int64) - result = _algos.groupsort_indexer(a, 1000)[0] + result = libalgos.groupsort_indexer(a, 1000)[0] # need to use a stable sort expected = np.argsort(a, kind='mergesort') @@ -1088,7 +1288,7 @@ def test_groupsort_indexer(): # compare with lexsort key = a * 1000 + b - result = _algos.groupsort_indexer(key, 1000000)[0] + result = libalgos.groupsort_indexer(key, 1000000)[0] expected = np.lexsort((b, a)) assert (np.array_equal(result, expected)) @@ -1099,8 +1299,8 @@ def test_infinity_sort(): # itself. Instead, let's give our infinities a self-consistent # ordering, but outside the float extended real line. - Inf = _algos.Infinity() - NegInf = _algos.NegInfinity() + Inf = libalgos.Infinity() + NegInf = libalgos.NegInfinity() ref_nums = [NegInf, float("-inf"), -1e100, 0, 1e100, float("inf"), Inf] @@ -1118,18 +1318,195 @@ def test_infinity_sort(): assert sorted(perm) == ref_nums # smoke tests - np.array([_algos.Infinity()] * 32).argsort() - np.array([_algos.NegInfinity()] * 32).argsort() + np.array([libalgos.Infinity()] * 32).argsort() + np.array([libalgos.NegInfinity()] * 32).argsort() def test_ensure_platform_int(): arr = np.arange(100, dtype=np.intp) - result = _algos.ensure_platform_int(arr) + result = libalgos.ensure_platform_int(arr) assert (result is arr) -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) +def test_int64_add_overflow(): + # see gh-14068 + msg = "Overflow in int64 addition" + m = np.iinfo(np.int64).max + n = np.iinfo(np.int64).min + + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), m) + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), np.array([m, m])) + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([n, n]), n) + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([n, n]), np.array([n, n])) + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, n]), np.array([n, n])) + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), np.array([m, m]), + arr_mask=np.array([False, True])) + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), np.array([m, m]), + b_mask=np.array([False, True])) + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), np.array([m, m]), + arr_mask=np.array([False, True]), + b_mask=np.array([False, True])) + with tm.assert_raises_regex(OverflowError, msg): + with tm.assert_produces_warning(RuntimeWarning): + algos.checked_add_with_arr(np.array([m, m]), + np.array([np.nan, m])) + + # Check that the nan boolean arrays override whether or not + # the addition overflows. We don't check the result but just + # the fact that an OverflowError is not raised. + with pytest.raises(AssertionError): + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), np.array([m, m]), + arr_mask=np.array([True, True])) + with pytest.raises(AssertionError): + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), np.array([m, m]), + b_mask=np.array([True, True])) + with pytest.raises(AssertionError): + with tm.assert_raises_regex(OverflowError, msg): + algos.checked_add_with_arr(np.array([m, m]), np.array([m, m]), + arr_mask=np.array([True, False]), + b_mask=np.array([False, True])) + + +class TestMode(object): + + def test_no_mode(self): + exp = Series([], dtype=np.float64) + tm.assert_series_equal(algos.mode([]), exp) + + def test_mode_single(self): + # GH 15714 + exp_single = [1] + data_single = [1] + + exp_multi = [1] + data_multi = [1, 1] + + for dt in np.typecodes['AllInteger'] + np.typecodes['Float']: + s = Series(data_single, dtype=dt) + exp = Series(exp_single, dtype=dt) + tm.assert_series_equal(algos.mode(s), exp) + + s = Series(data_multi, dtype=dt) + exp = Series(exp_multi, dtype=dt) + tm.assert_series_equal(algos.mode(s), exp) + + exp = Series([1], dtype=np.int) + tm.assert_series_equal(algos.mode([1]), exp) + + exp = Series(['a', 'b', 'c'], dtype=np.object) + tm.assert_series_equal(algos.mode(['a', 'b', 'c']), exp) + + def test_number_mode(self): + exp_single = [1] + data_single = [1] * 5 + [2] * 3 + + exp_multi = [1, 3] + data_multi = [1] * 5 + [2] * 3 + [3] * 5 + + for dt in np.typecodes['AllInteger'] + np.typecodes['Float']: + s = Series(data_single, dtype=dt) + exp = Series(exp_single, dtype=dt) + tm.assert_series_equal(algos.mode(s), exp) + + s = Series(data_multi, dtype=dt) + exp = Series(exp_multi, dtype=dt) + tm.assert_series_equal(algos.mode(s), exp) + + def test_strobj_mode(self): + exp = ['b'] + data = ['a'] * 2 + ['b'] * 3 + + s = Series(data, dtype='c') + exp = Series(exp, dtype='c') + tm.assert_series_equal(algos.mode(s), exp) + + exp = ['bar'] + data = ['foo'] * 2 + ['bar'] * 3 + + for dt in [str, object]: + s = Series(data, dtype=dt) + exp = Series(exp, dtype=dt) + tm.assert_series_equal(algos.mode(s), exp) + + def test_datelike_mode(self): + exp = Series(['1900-05-03', '2011-01-03', + '2013-01-02'], dtype="M8[ns]") + s = Series(['2011-01-03', '2013-01-02', + '1900-05-03'], dtype='M8[ns]') + tm.assert_series_equal(algos.mode(s), exp) + + exp = Series(['2011-01-03', '2013-01-02'], dtype='M8[ns]') + s = Series(['2011-01-03', '2013-01-02', '1900-05-03', + '2011-01-03', '2013-01-02'], dtype='M8[ns]') + tm.assert_series_equal(algos.mode(s), exp) + + def test_timedelta_mode(self): + exp = Series(['-1 days', '0 days', '1 days'], + dtype='timedelta64[ns]') + s = Series(['1 days', '-1 days', '0 days'], + dtype='timedelta64[ns]') + tm.assert_series_equal(algos.mode(s), exp) + + exp = Series(['2 min', '1 day'], dtype='timedelta64[ns]') + s = Series(['1 day', '1 day', '-1 day', '-1 day 2 min', + '2 min', '2 min'], dtype='timedelta64[ns]') + tm.assert_series_equal(algos.mode(s), exp) + + def test_mixed_dtype(self): + exp = Series(['foo']) + s = Series([1, 'foo', 'foo']) + tm.assert_series_equal(algos.mode(s), exp) + + def test_uint64_overflow(self): + exp = Series([2**63], dtype=np.uint64) + s = Series([1, 2**63, 2**63], dtype=np.uint64) + tm.assert_series_equal(algos.mode(s), exp) + + exp = Series([1, 2**63], dtype=np.uint64) + s = Series([1, 2**63], dtype=np.uint64) + tm.assert_series_equal(algos.mode(s), exp) + + def test_categorical(self): + c = Categorical([1, 2]) + exp = c + tm.assert_categorical_equal(algos.mode(c), exp) + tm.assert_categorical_equal(c.mode(), exp) + + c = Categorical([1, 'a', 'a']) + exp = Categorical(['a'], categories=[1, 'a']) + tm.assert_categorical_equal(algos.mode(c), exp) + tm.assert_categorical_equal(c.mode(), exp) + + c = Categorical([1, 1, 2, 3, 3]) + exp = Categorical([1, 3], categories=[1, 2, 3]) + tm.assert_categorical_equal(algos.mode(c), exp) + tm.assert_categorical_equal(c.mode(), exp) + + def test_index(self): + idx = Index([1, 2, 3]) + exp = Series([1, 2, 3], dtype=np.int64) + tm.assert_series_equal(algos.mode(idx), exp) + + idx = Index([1, 'a', 'a']) + exp = Series(['a'], dtype=object) + tm.assert_series_equal(algos.mode(idx), exp) + + idx = Index([1, 1, 2, 3, 3]) + exp = Series([1, 3], dtype=np.int64) + tm.assert_series_equal(algos.mode(idx), exp) + + exp = Series(['2 min', '1 day'], dtype='timedelta64[ns]') + idx = Index(['1 day', '1 day', '-1 day', '-1 day 2 min', + '2 min', '2 min'], dtype='timedelta64[ns]') + tm.assert_series_equal(algos.mode(idx), exp) diff --git a/pandas/tests/test_base.py b/pandas/tests/test_base.py index da8cf120b8ed4..85976b9fabd66 100644 --- a/pandas/tests/test_base.py +++ b/pandas/tests/test_base.py @@ -4,21 +4,22 @@ import re import sys from datetime import datetime, timedelta - +import pytest import numpy as np import pandas as pd import pandas.compat as compat -from pandas.types.common import (is_object_dtype, is_datetimetz, - needs_i8_conversion) +from pandas.core.dtypes.common import ( + is_object_dtype, is_datetimetz, + needs_i8_conversion) import pandas.util.testing as tm from pandas import (Series, Index, DatetimeIndex, TimedeltaIndex, PeriodIndex, - Timedelta) -from pandas.compat import u, StringIO + Timedelta, IntervalIndex, Interval) +from pandas.compat import StringIO from pandas.compat.numpy import np_array_datetime64_compat -from pandas.core.base import (FrozenList, FrozenNDArray, PandasDelegate, - NoNewAttributesMixin) -from pandas.tseries.base import DatetimeIndexOpsMixin +from pandas.core.base import PandasDelegate, NoNewAttributesMixin +from pandas.core.indexes.datetimelike import DatetimeIndexOpsMixin +from pandas._libs.tslib import iNaT class CheckStringMixin(object): @@ -32,7 +33,7 @@ def test_string_methods_dont_fail(self): def test_tricky_container(self): if not hasattr(self, 'unicode_container'): - raise nose.SkipTest('Need unicode_container to test with this') + pytest.skip('Need unicode_container to test with this') repr(self.unicode_container) str(self.unicode_container) bytes(self.unicode_container) @@ -44,9 +45,10 @@ class CheckImmutable(object): mutable_regex = re.compile('does not support mutable operations') def check_mutable_error(self, *args, **kwargs): - # pass whatever functions you normally would to assertRaises (after the - # Exception kind) - tm.assertRaisesRegexp(TypeError, self.mutable_regex, *args, **kwargs) + # Pass whatever function you normally would to assert_raises_regex + # (after the Exception kind). + tm.assert_raises_regex( + TypeError, self.mutable_regex, *args, **kwargs) def test_no_mutable_funcs(self): def setitem(): @@ -69,6 +71,7 @@ def delslice(): self.check_mutable_error(delslice) mutable_methods = getattr(self, "mutable_methods", []) + for meth in mutable_methods: self.check_mutable_error(getattr(self.container, meth)) @@ -79,74 +82,11 @@ def test_slicing_maintains_type(self): def check_result(self, result, expected, klass=None): klass = klass or self.klass - self.assertIsInstance(result, klass) - self.assertEqual(result, expected) - - -class TestFrozenList(CheckImmutable, CheckStringMixin, tm.TestCase): - mutable_methods = ('extend', 'pop', 'remove', 'insert') - unicode_container = FrozenList([u("\u05d0"), u("\u05d1"), "c"]) - - def setUp(self): - self.lst = [1, 2, 3, 4, 5] - self.container = FrozenList(self.lst) - self.klass = FrozenList - - def test_add(self): - result = self.container + (1, 2, 3) - expected = FrozenList(self.lst + [1, 2, 3]) - self.check_result(result, expected) + assert isinstance(result, klass) + assert result == expected - result = (1, 2, 3) + self.container - expected = FrozenList([1, 2, 3] + self.lst) - self.check_result(result, expected) - def test_inplace(self): - q = r = self.container - q += [5] - self.check_result(q, self.lst + [5]) - # other shouldn't be mutated - self.check_result(r, self.lst) - - -class TestFrozenNDArray(CheckImmutable, CheckStringMixin, tm.TestCase): - mutable_methods = ('put', 'itemset', 'fill') - unicode_container = FrozenNDArray([u("\u05d0"), u("\u05d1"), "c"]) - - def setUp(self): - self.lst = [3, 5, 7, -2] - self.container = FrozenNDArray(self.lst) - self.klass = FrozenNDArray - - def test_shallow_copying(self): - original = self.container.copy() - self.assertIsInstance(self.container.view(), FrozenNDArray) - self.assertFalse(isinstance( - self.container.view(np.ndarray), FrozenNDArray)) - self.assertIsNot(self.container.view(), self.container) - self.assert_numpy_array_equal(self.container, original) - # shallow copy should be the same too - self.assertIsInstance(self.container._shallow_copy(), FrozenNDArray) - - # setting should not be allowed - def testit(container): - container[0] = 16 - - self.check_mutable_error(testit, self.container) - - def test_values(self): - original = self.container.view(np.ndarray).copy() - n = original[0] + 15 - vals = self.container.values() - self.assert_numpy_array_equal(original, vals) - self.assertIsNot(original, vals) - vals[0] = n - self.assertIsInstance(self.container, pd.core.base.FrozenNDArray) - self.assert_numpy_array_equal(self.container.values(), original) - self.assertEqual(vals[0], n) - - -class TestPandasDelegate(tm.TestCase): +class TestPandasDelegate(object): class Delegator(object): _properties = ['foo'] @@ -169,7 +109,7 @@ class Delegate(PandasDelegate): def __init__(self, obj): self.obj = obj - def setUp(self): + def setup_method(self, method): pass def test_invalida_delgation(self): @@ -192,17 +132,17 @@ def test_invalida_delgation(self): def f(): delegate.foo - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def f(): delegate.foo = 5 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def f(): delegate.foo() - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) def test_memory_usage(self): # Delegate does not implement memory_usage. @@ -212,7 +152,7 @@ def test_memory_usage(self): sys.getsizeof(delegate) -class Ops(tm.TestCase): +class Ops(object): def _allow_na_ops(self, obj): """Whether to skip test cases including NaN""" @@ -222,7 +162,7 @@ def _allow_na_ops(self, obj): return False return True - def setUp(self): + def setup_method(self, method): self.bool_index = tm.makeBoolIndex(10, name='a') self.int_index = tm.makeIntIndex(10, name='a') self.float_index = tm.makeFloatIndex(10, name='a') @@ -277,22 +217,22 @@ def check_ops_properties(self, props, filter=None, ignore_failures=False): tm.assert_index_equal(result, expected) elif isinstance(result, np.ndarray) and isinstance(expected, np.ndarray): - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) else: - self.assertEqual(result, expected) + assert result == expected # freq raises AttributeError on an Int64Index because its not - # defined we mostly care about Series hwere anyhow + # defined we mostly care about Series here anyhow if not ignore_failures: for o in self.not_valid_objs: # an object that is datetimelike will raise a TypeError, # otherwise an AttributeError if issubclass(type(o), DatetimeIndexOpsMixin): - self.assertRaises(TypeError, lambda: getattr(o, op)) + pytest.raises(TypeError, lambda: getattr(o, op)) else: - self.assertRaises(AttributeError, - lambda: getattr(o, op)) + pytest.raises(AttributeError, + lambda: getattr(o, op)) def test_binary_ops_docs(self): from pandas import DataFrame, Panel @@ -310,19 +250,17 @@ def test_binary_ops_docs(self): operand2 = 'other' op = op_map[op_name] expected_str = ' '.join([operand1, op, operand2]) - self.assertTrue(expected_str in getattr(klass, - op_name).__doc__) + assert expected_str in getattr(klass, op_name).__doc__ # reverse version of the binary ops expected_str = ' '.join([operand2, op, operand1]) - self.assertTrue(expected_str in getattr(klass, 'r' + - op_name).__doc__) + assert expected_str in getattr(klass, 'r' + op_name).__doc__ class TestIndexOps(Ops): - def setUp(self): - super(TestIndexOps, self).setUp() + def setup_method(self, method): + super(TestIndexOps, self).setup_method(method) self.is_valid_objs = [o for o in self.objs if o._allow_index_ops] self.not_valid_objs = [o for o in self.objs if not o._allow_index_ops] @@ -337,55 +275,57 @@ def test_none_comparison(self): # noinspection PyComparisonWithNone result = o == None # noqa - self.assertFalse(result.iat[0]) - self.assertFalse(result.iat[1]) + assert not result.iat[0] + assert not result.iat[1] # noinspection PyComparisonWithNone result = o != None # noqa - self.assertTrue(result.iat[0]) - self.assertTrue(result.iat[1]) + assert result.iat[0] + assert result.iat[1] result = None == o # noqa - self.assertFalse(result.iat[0]) - self.assertFalse(result.iat[1]) + assert not result.iat[0] + assert not result.iat[1] # this fails for numpy < 1.9 # and oddly for *some* platforms # result = None != o # noqa - # self.assertTrue(result.iat[0]) - # self.assertTrue(result.iat[1]) + # assert result.iat[0] + # assert result.iat[1] result = None > o - self.assertFalse(result.iat[0]) - self.assertFalse(result.iat[1]) + assert not result.iat[0] + assert not result.iat[1] result = o < None - self.assertFalse(result.iat[0]) - self.assertFalse(result.iat[1]) + assert not result.iat[0] + assert not result.iat[1] def test_ndarray_compat_properties(self): for o in self.objs: + # Check that we work. + for p in ['shape', 'dtype', 'flags', 'T', + 'strides', 'itemsize', 'nbytes']: + assert getattr(o, p, None) is not None - # check that we work - for p in ['shape', 'dtype', 'flags', 'T', 'strides', 'itemsize', - 'nbytes']: - self.assertIsNotNone(getattr(o, p, None)) - self.assertTrue(hasattr(o, 'base')) + assert hasattr(o, 'base') - # if we have a datetimelike dtype then needs a view to work + # If we have a datetime-like dtype then needs a view to work # but the user is responsible for that try: - self.assertIsNotNone(o.data) + assert o.data is not None except ValueError: pass - self.assertRaises(ValueError, o.item) # len > 1 - self.assertEqual(o.ndim, 1) - self.assertEqual(o.size, len(o)) + with pytest.raises(ValueError): + o.item() # len > 1 + + assert o.ndim == 1 + assert o.size == len(o) - self.assertEqual(Index([1]).item(), 1) - self.assertEqual(Series([1]).item(), 1) + assert Index([1]).item() == 1 + assert Series([1]).item() == 1 def test_ops(self): for op in ['max', 'min']: @@ -397,12 +337,12 @@ def test_ops(self): expected = pd.Period(ordinal=getattr(o._values, op)(), freq=o.freq) try: - self.assertEqual(result, expected) + assert result == expected except TypeError: # comparing tz-aware series with np.array results in # TypeError expected = expected.astype('M8[ns]').astype('int64') - self.assertEqual(result.value, expected) + assert result.value == expected def test_nanops(self): # GH 7261 @@ -410,43 +350,43 @@ def test_nanops(self): for klass in [Index, Series]: obj = klass([np.nan, 2.0]) - self.assertEqual(getattr(obj, op)(), 2.0) + assert getattr(obj, op)() == 2.0 obj = klass([np.nan]) - self.assertTrue(pd.isnull(getattr(obj, op)())) + assert pd.isnull(getattr(obj, op)()) obj = klass([]) - self.assertTrue(pd.isnull(getattr(obj, op)())) + assert pd.isnull(getattr(obj, op)()) obj = klass([pd.NaT, datetime(2011, 11, 1)]) # check DatetimeIndex monotonic path - self.assertEqual(getattr(obj, op)(), datetime(2011, 11, 1)) + assert getattr(obj, op)() == datetime(2011, 11, 1) obj = klass([pd.NaT, datetime(2011, 11, 1), pd.NaT]) # check DatetimeIndex non-monotonic path - self.assertEqual(getattr(obj, op)(), datetime(2011, 11, 1)) + assert getattr(obj, op)(), datetime(2011, 11, 1) # argmin/max obj = Index(np.arange(5, dtype='int64')) - self.assertEqual(obj.argmin(), 0) - self.assertEqual(obj.argmax(), 4) + assert obj.argmin() == 0 + assert obj.argmax() == 4 obj = Index([np.nan, 1, np.nan, 2]) - self.assertEqual(obj.argmin(), 1) - self.assertEqual(obj.argmax(), 3) + assert obj.argmin() == 1 + assert obj.argmax() == 3 obj = Index([np.nan]) - self.assertEqual(obj.argmin(), -1) - self.assertEqual(obj.argmax(), -1) + assert obj.argmin() == -1 + assert obj.argmax() == -1 obj = Index([pd.NaT, datetime(2011, 11, 1), datetime(2011, 11, 2), pd.NaT]) - self.assertEqual(obj.argmin(), 1) - self.assertEqual(obj.argmax(), 2) + assert obj.argmin() == 1 + assert obj.argmax() == 2 obj = Index([pd.NaT]) - self.assertEqual(obj.argmin(), -1) - self.assertEqual(obj.argmax(), -1) + assert obj.argmin() == -1 + assert obj.argmax() == -1 def test_value_counts_unique_nunique(self): for orig in self.objs: @@ -474,31 +414,31 @@ def test_value_counts_unique_nunique(self): o = klass(rep, index=idx, name='a') # check values has the same dtype as the original - self.assertEqual(o.dtype, orig.dtype) + assert o.dtype == orig.dtype expected_s = Series(range(10, 0, -1), index=expected_index, dtype='int64', name='a') result = o.value_counts() tm.assert_series_equal(result, expected_s) - self.assertTrue(result.index.name is None) - self.assertEqual(result.name, 'a') + assert result.index.name is None + assert result.name == 'a' result = o.unique() if isinstance(o, Index): - self.assertTrue(isinstance(result, o.__class__)) - self.assert_index_equal(result, orig) + assert isinstance(result, o.__class__) + tm.assert_index_equal(result, orig) elif is_datetimetz(o): # datetimetz Series returns array of Timestamp - self.assertEqual(result[0], orig[0]) + assert result[0] == orig[0] for r in result: - self.assertIsInstance(r, pd.Timestamp) + assert isinstance(r, pd.Timestamp) tm.assert_numpy_array_equal(result, orig._values.asobject.values) else: tm.assert_numpy_array_equal(result, orig.values) - self.assertEqual(o.nunique(), len(np.unique(o.values))) + assert o.nunique() == len(np.unique(o.values)) def test_value_counts_unique_nunique_null(self): @@ -515,21 +455,21 @@ def test_value_counts_unique_nunique_null(self): if is_datetimetz(o): if isinstance(o, DatetimeIndex): v = o.asi8 - v[0:2] = pd.tslib.iNaT + v[0:2] = iNaT values = o._shallow_copy(v) else: o = o.copy() - o[0:2] = pd.tslib.iNaT + o[0:2] = iNaT values = o._values elif needs_i8_conversion(o): - values[0:2] = pd.tslib.iNaT + values[0:2] = iNaT values = o._shallow_copy(values) else: values[0:2] = null_obj # check values has the same dtype as the original - self.assertEqual(values.dtype, o.dtype) + assert values.dtype == o.dtype # create repeated values, 'n'th element is repeated by n+1 # times @@ -550,15 +490,15 @@ def test_value_counts_unique_nunique_null(self): o.name = 'a' # check values has the same dtype as the original - self.assertEqual(o.dtype, orig.dtype) + assert o.dtype == orig.dtype # check values correctly have NaN nanloc = np.zeros(len(o), dtype=np.bool) nanloc[:3] = True if isinstance(o, Index): - self.assert_numpy_array_equal(pd.isnull(o), nanloc) + tm.assert_numpy_array_equal(pd.isnull(o), nanloc) else: exp = pd.Series(nanloc, o.index, name='a') - self.assert_series_equal(pd.isnull(o), exp) + tm.assert_series_equal(pd.isnull(o), exp) expected_s_na = Series(list(range(10, 2, -1)) + [3], index=expected_index[9:0:-1], @@ -569,12 +509,12 @@ def test_value_counts_unique_nunique_null(self): result_s_na = o.value_counts(dropna=False) tm.assert_series_equal(result_s_na, expected_s_na) - self.assertTrue(result_s_na.index.name is None) - self.assertEqual(result_s_na.name, 'a') + assert result_s_na.index.name is None + assert result_s_na.name == 'a' result_s = o.value_counts() tm.assert_series_equal(o.value_counts(), expected_s) - self.assertTrue(result_s.index.name is None) - self.assertEqual(result_s.name, 'a') + assert result_s.index.name is None + assert result_s.name == 'a' result = o.unique() if isinstance(o, Index): @@ -584,15 +524,15 @@ def test_value_counts_unique_nunique_null(self): # unable to compare NaT / nan tm.assert_numpy_array_equal(result[1:], values[2:].asobject.values) - self.assertIs(result[0], pd.NaT) + assert result[0] is pd.NaT else: tm.assert_numpy_array_equal(result[1:], values[2:]) - self.assertTrue(pd.isnull(result[0])) - self.assertEqual(result.dtype, orig.dtype) + assert pd.isnull(result[0]) + assert result.dtype == orig.dtype - self.assertEqual(o.nunique(), 8) - self.assertEqual(o.nunique(dropna=False), 9) + assert o.nunique() == 8 + assert o.nunique(dropna=False) == 9 def test_value_counts_inferred(self): klasses = [Index, Series] @@ -609,7 +549,7 @@ def test_value_counts_inferred(self): exp = np.unique(np.array(s_values, dtype=np.object_)) tm.assert_numpy_array_equal(s.unique(), exp) - self.assertEqual(s.nunique(), 4) + assert s.nunique() == 4 # don't sort, have to sort after the fact as not sorting is # platform-dep hist = s.value_counts(sort=False).sort_values() @@ -633,15 +573,14 @@ def test_value_counts_bins(self): s = klass(s_values) # bins - self.assertRaises(TypeError, - lambda bins: s.value_counts(bins=bins), 1) + pytest.raises(TypeError, lambda bins: s.value_counts(bins=bins), 1) s1 = Series([1, 1, 2, 3]) res1 = s1.value_counts(bins=1) - exp1 = Series({0.998: 4}) + exp1 = Series({Interval(0.997, 3.0): 4}) tm.assert_series_equal(res1, exp1) res1n = s1.value_counts(bins=1, normalize=True) - exp1n = Series({0.998: 1.0}) + exp1n = Series({Interval(0.997, 3.0): 1.0}) tm.assert_series_equal(res1n, exp1n) if isinstance(s1, Index): @@ -650,20 +589,22 @@ def test_value_counts_bins(self): exp = np.array([1, 2, 3], dtype=np.int64) tm.assert_numpy_array_equal(s1.unique(), exp) - self.assertEqual(s1.nunique(), 3) + assert s1.nunique() == 3 - res4 = s1.value_counts(bins=4) - exp4 = Series({0.998: 2, - 1.5: 1, - 2.0: 0, - 2.5: 1}, index=[0.998, 2.5, 1.5, 2.0]) + # these return the same + res4 = s1.value_counts(bins=4, dropna=True) + intervals = IntervalIndex.from_breaks([0.997, 1.5, 2.0, 2.5, 3.0]) + exp4 = Series([2, 1, 1, 0], index=intervals.take([0, 3, 1, 2])) tm.assert_series_equal(res4, exp4) + + res4 = s1.value_counts(bins=4, dropna=False) + intervals = IntervalIndex.from_breaks([0.997, 1.5, 2.0, 2.5, 3.0]) + exp4 = Series([2, 1, 1, 0], index=intervals.take([0, 3, 1, 2])) + tm.assert_series_equal(res4, exp4) + res4n = s1.value_counts(bins=4, normalize=True) - exp4n = Series( - {0.998: 0.5, - 1.5: 0.25, - 2.0: 0.0, - 2.5: 0.25}, index=[0.998, 2.5, 1.5, 2.0]) + exp4n = Series([0.5, 0.25, 0.25, 0], + index=intervals.take([0, 3, 1, 2])) tm.assert_series_equal(res4n, exp4n) # handle NA's properly @@ -679,7 +620,7 @@ def test_value_counts_bins(self): else: exp = np.array(['a', 'b', np.nan, 'd'], dtype=object) tm.assert_numpy_array_equal(s.unique(), exp) - self.assertEqual(s.nunique(), 3) + assert s.nunique() == 3 s = klass({}) expected = Series([], dtype=np.int64) @@ -687,13 +628,12 @@ def test_value_counts_bins(self): check_index_type=False) # returned dtype differs depending on original if isinstance(s, Index): - self.assert_index_equal(s.unique(), Index([]), - exact=False) + tm.assert_index_equal(s.unique(), Index([]), exact=False) else: - self.assert_numpy_array_equal(s.unique(), np.array([]), - check_dtype=False) + tm.assert_numpy_array_equal(s.unique(), np.array([]), + check_dtype=False) - self.assertEqual(s.nunique(), 0) + assert s.nunique() == 0 def test_value_counts_datetime64(self): klasses = [Index, Series] @@ -726,14 +666,14 @@ def test_value_counts_datetime64(self): else: tm.assert_numpy_array_equal(s.unique(), expected) - self.assertEqual(s.nunique(), 3) + assert s.nunique() == 3 # with NaT s = df['dt'].copy() s = klass([v for v in s.values] + [pd.NaT]) result = s.value_counts() - self.assertEqual(result.index.dtype, 'datetime64[ns]') + assert result.index.dtype == 'datetime64[ns]' tm.assert_series_equal(result, expected_s) result = s.value_counts(dropna=False) @@ -741,7 +681,7 @@ def test_value_counts_datetime64(self): tm.assert_series_equal(result, expected_s) unique = s.unique() - self.assertEqual(unique.dtype, 'datetime64[ns]') + assert unique.dtype == 'datetime64[ns]' # numpy_array_equal cannot compare pd.NaT if isinstance(s, Index): @@ -749,10 +689,10 @@ def test_value_counts_datetime64(self): tm.assert_index_equal(unique, exp_idx) else: tm.assert_numpy_array_equal(unique[:3], expected) - self.assertTrue(pd.isnull(unique[3])) + assert pd.isnull(unique[3]) - self.assertEqual(s.nunique(), 3) - self.assertEqual(s.nunique(dropna=False), 4) + assert s.nunique() == 3 + assert s.nunique(dropna=False) == 4 # timedelta64[ns] td = df.dt - df.dt + timedelta(1) @@ -786,14 +726,14 @@ def test_factorize(self): exp_uniques = o labels, uniques = o.factorize() - self.assert_numpy_array_equal(labels, exp_arr) + tm.assert_numpy_array_equal(labels, exp_arr) if isinstance(o, Series): - self.assert_index_equal(uniques, Index(orig), - check_names=False) + tm.assert_index_equal(uniques, Index(orig), + check_names=False) else: # factorize explicitly resets name - self.assert_index_equal(uniques, exp_uniques, - check_names=False) + tm.assert_index_equal(uniques, exp_uniques, + check_names=False) def test_factorize_repeated(self): for orig in self.objs: @@ -816,24 +756,24 @@ def test_factorize_repeated(self): dtype=np.intp) labels, uniques = n.factorize(sort=True) - self.assert_numpy_array_equal(labels, exp_arr) + tm.assert_numpy_array_equal(labels, exp_arr) if isinstance(o, Series): - self.assert_index_equal(uniques, Index(orig).sort_values(), - check_names=False) + tm.assert_index_equal(uniques, Index(orig).sort_values(), + check_names=False) else: - self.assert_index_equal(uniques, o, check_names=False) + tm.assert_index_equal(uniques, o, check_names=False) exp_arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4], np.intp) labels, uniques = n.factorize(sort=False) - self.assert_numpy_array_equal(labels, exp_arr) + tm.assert_numpy_array_equal(labels, exp_arr) if isinstance(o, Series): expected = Index(o.iloc[5:10].append(o.iloc[:5])) - self.assert_index_equal(uniques, expected, check_names=False) + tm.assert_index_equal(uniques, expected, check_names=False) else: expected = o[5:10].append(o[:5]) - self.assert_index_equal(uniques, expected, check_names=False) + tm.assert_index_equal(uniques, expected, check_names=False) def test_duplicated_drop_duplicates_index(self): # GH 4060 @@ -851,13 +791,13 @@ def test_duplicated_drop_duplicates_index(self): expected = np.array([False] * len(original), dtype=bool) duplicated = original.duplicated() tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) + assert duplicated.dtype == bool result = original.drop_duplicates() tm.assert_index_equal(result, original) - self.assertFalse(result is original) + assert result is not original # has_duplicates - self.assertFalse(original.has_duplicates) + assert not original.has_duplicates # create repeated values, 3rd and 5th values are duplicated idx = original[list(range(len(original))) + [5, 3]] @@ -865,7 +805,7 @@ def test_duplicated_drop_duplicates_index(self): dtype=bool) duplicated = idx.duplicated() tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) + assert duplicated.dtype == bool tm.assert_index_equal(idx.drop_duplicates(), original) base = [False] * len(idx) @@ -875,19 +815,10 @@ def test_duplicated_drop_duplicates_index(self): duplicated = idx.duplicated(keep='last') tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) + assert duplicated.dtype == bool result = idx.drop_duplicates(keep='last') tm.assert_index_equal(result, idx[~expected]) - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - duplicated = idx.duplicated(take_last=True) - tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) - with tm.assert_produces_warning(FutureWarning): - result = idx.drop_duplicates(take_last=True) - tm.assert_index_equal(result, idx[~expected]) - base = [False] * len(original) + [True, True] base[3] = True base[5] = True @@ -895,11 +826,11 @@ def test_duplicated_drop_duplicates_index(self): duplicated = idx.duplicated(keep=False) tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) + assert duplicated.dtype == bool result = idx.drop_duplicates(keep=False) tm.assert_index_equal(result, idx[~expected]) - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( TypeError, r"drop_duplicates\(\) got an unexpected " "keyword argument"): idx.drop_duplicates(inplace=True) @@ -910,7 +841,7 @@ def test_duplicated_drop_duplicates_index(self): tm.assert_series_equal(original.duplicated(), expected) result = original.drop_duplicates() tm.assert_series_equal(result, original) - self.assertFalse(result is original) + assert result is not original idx = original.index[list(range(len(original))) + [5, 3]] values = original._values[list(range(len(original))) + [5, 3]] @@ -930,13 +861,6 @@ def test_duplicated_drop_duplicates_index(self): tm.assert_series_equal(s.drop_duplicates(keep='last'), s[~np.array(base)]) - # deprecate take_last - with tm.assert_produces_warning(FutureWarning): - tm.assert_series_equal( - s.duplicated(take_last=True), expected) - with tm.assert_produces_warning(FutureWarning): - tm.assert_series_equal(s.drop_duplicates(take_last=True), - s[~np.array(base)]) base = [False] * len(original) + [True, True] base[3] = True base[5] = True @@ -949,6 +873,21 @@ def test_duplicated_drop_duplicates_index(self): s.drop_duplicates(inplace=True) tm.assert_series_equal(s, original) + def test_drop_duplicates_series_vs_dataframe(self): + # GH 14192 + df = pd.DataFrame({'a': [1, 1, 1, 'one', 'one'], + 'b': [2, 2, np.nan, np.nan, np.nan], + 'c': [3, 3, np.nan, np.nan, 'three'], + 'd': [1, 2, 3, 4, 4], + 'e': [datetime(2015, 1, 1), datetime(2015, 1, 1), + datetime(2015, 2, 1), pd.NaT, pd.NaT] + }) + for column in df.columns: + for keep in ['first', 'last', False]: + dropped_frame = df[[column]].drop_duplicates(keep=keep) + dropped_series = df[column].drop_duplicates(keep=keep) + tm.assert_frame_equal(dropped_frame, dropped_series.to_frame()) + def test_fillna(self): # # GH 11343 # though Index.fillna and Series.fillna has separate impl, @@ -962,11 +901,11 @@ def test_fillna(self): # values will not be changed result = o.fillna(o.astype(object).values[0]) if isinstance(o, Index): - self.assert_index_equal(o, result) + tm.assert_index_equal(o, result) else: - self.assert_series_equal(o, result) + tm.assert_series_equal(o, result) # check shallow_copied - self.assertFalse(o is result) + assert o is not result for null_obj in [np.nan, None]: for orig in self.objs: @@ -992,15 +931,15 @@ def test_fillna(self): o = klass(values) # check values has the same dtype as the original - self.assertEqual(o.dtype, orig.dtype) + assert o.dtype == orig.dtype result = o.fillna(fill_value) if isinstance(o, Index): - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) else: - self.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) # check shallow_copied - self.assertFalse(o is result) + assert o is not result def test_memory_usage(self): for o in self.objs: @@ -1010,41 +949,35 @@ def test_memory_usage(self): if (is_object_dtype(o) or (isinstance(o, Series) and is_object_dtype(o.index))): # if there are objects, only deep will pick them up - self.assertTrue(res_deep > res) + assert res_deep > res else: - self.assertEqual(res, res_deep) + assert res == res_deep if isinstance(o, Series): - self.assertEqual( - (o.memory_usage(index=False) + - o.index.memory_usage()), - o.memory_usage(index=True) - ) + assert ((o.memory_usage(index=False) + + o.index.memory_usage()) == + o.memory_usage(index=True)) # sys.getsizeof will call the .memory_usage with # deep=True, and add on some GC overhead diff = res_deep - sys.getsizeof(o) - self.assertTrue(abs(diff) < 100) + assert abs(diff) < 100 def test_searchsorted(self): # See gh-12238 for o in self.objs: index = np.searchsorted(o, max(o)) - self.assertTrue(0 <= index <= len(o)) + assert 0 <= index <= len(o) index = np.searchsorted(o, max(o), sorter=range(len(o))) - self.assertTrue(0 <= index <= len(o)) + assert 0 <= index <= len(o) + def test_validate_bool_args(self): + invalid_values = [1, "True", [1, 2, 3], 5.0] -class TestFloat64HashTable(tm.TestCase): - - def test_lookup_nan(self): - from pandas.hashtable import Float64HashTable - xs = np.array([2.718, 3.14, np.nan, -7, 5, 2, 3]) - m = Float64HashTable() - m.map_locations(xs) - self.assert_numpy_array_equal(m.lookup(xs), - np.arange(len(xs), dtype=np.int64)) + for value in invalid_values: + with pytest.raises(ValueError): + self.int_series.drop_duplicates(inplace=value) class TestTranspose(Ops): @@ -1059,10 +992,10 @@ def test_transpose(self): def test_transpose_non_default_axes(self): for obj in self.objs: - tm.assertRaisesRegexp(ValueError, self.errmsg, - obj.transpose, 1) - tm.assertRaisesRegexp(ValueError, self.errmsg, - obj.transpose, axes=1) + tm.assert_raises_regex(ValueError, self.errmsg, + obj.transpose, 1) + tm.assert_raises_regex(ValueError, self.errmsg, + obj.transpose, axes=1) def test_numpy_transpose(self): for obj in self.objs: @@ -1071,34 +1004,28 @@ def test_numpy_transpose(self): else: tm.assert_series_equal(np.transpose(obj), obj) - tm.assertRaisesRegexp(ValueError, self.errmsg, - np.transpose, obj, axes=1) + tm.assert_raises_regex(ValueError, self.errmsg, + np.transpose, obj, axes=1) -class TestNoNewAttributesMixin(tm.TestCase): +class TestNoNewAttributesMixin(object): def test_mixin(self): class T(NoNewAttributesMixin): pass t = T() - self.assertFalse(hasattr(t, "__frozen")) + assert not hasattr(t, "__frozen") + t.a = "test" - self.assertEqual(t.a, "test") + assert t.a == "test" + t._freeze() - # self.assertTrue("__frozen" not in dir(t)) - self.assertIs(getattr(t, "__frozen"), True) + assert "__frozen" in dir(t) + assert getattr(t, "__frozen") def f(): t.b = "test" - self.assertRaises(AttributeError, f) - self.assertFalse(hasattr(t, "b")) - - -if __name__ == '__main__': - import nose - - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - # '--with-coverage', '--cover-package=pandas.core'], - exit=False) + pytest.raises(AttributeError, f) + assert not hasattr(t, "b") diff --git a/pandas/tests/test_categorical.py b/pandas/tests/test_categorical.py index f01fff035a3c5..03adf17f50300 100644 --- a/pandas/tests/test_categorical.py +++ b/pandas/tests/test_categorical.py @@ -1,40 +1,42 @@ # -*- coding: utf-8 -*- # pylint: disable=E1101,E1103,W0232 +from warnings import catch_warnings +import pytest import sys from datetime import datetime from distutils.version import LooseVersion import numpy as np -from pandas.types.dtypes import CategoricalDtype -from pandas.types.common import (is_categorical_dtype, - is_object_dtype, - is_float_dtype, - is_integer_dtype) +from pandas.core.dtypes.dtypes import CategoricalDtype +from pandas.core.dtypes.common import ( + is_categorical_dtype, + is_float_dtype, + is_integer_dtype) import pandas as pd import pandas.compat as compat import pandas.util.testing as tm -from pandas import (Categorical, Index, Series, DataFrame, PeriodIndex, - Timestamp, CategoricalIndex, isnull) +from pandas import (Categorical, Index, Series, DataFrame, + Timestamp, CategoricalIndex, isnull, + date_range, DatetimeIndex, + period_range, PeriodIndex, + timedelta_range, TimedeltaIndex, NaT, + Interval, IntervalIndex) from pandas.compat import range, lrange, u, PY3 from pandas.core.config import option_context -# GH 12066 -# flake8: noqa +class TestCategorical(object): -class TestCategorical(tm.TestCase): - _multiprocess_can_split_ = True - - def setUp(self): + def setup_method(self, method): self.factor = Categorical(['a', 'b', 'b', 'a', 'a', 'c', 'c', 'c'], ordered=True) def test_getitem(self): - self.assertEqual(self.factor[0], 'a') - self.assertEqual(self.factor[-1], 'c') + assert self.factor[0] == 'a' + assert self.factor[-1] == 'c' subf = self.factor[[0, 1, 2]] tm.assert_numpy_array_equal(subf._codes, @@ -52,16 +54,37 @@ def test_getitem_listlike(self): c = Categorical(np.random.randint(0, 5, size=150000).astype(np.int8)) result = c.codes[np.array([100000]).astype(np.int64)] expected = c[np.array([100000]).astype(np.int64)].codes - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) + + def test_getitem_category_type(self): + # GH 14580 + # test iloc() on Series with Categorical data + + s = pd.Series([1, 2, 3]).astype('category') + + # get slice + result = s.iloc[0:2] + expected = pd.Series([1, 2]).astype('category', categories=[1, 2, 3]) + tm.assert_series_equal(result, expected) + + # get list of indexes + result = s.iloc[[0, 1]] + expected = pd.Series([1, 2]).astype('category', categories=[1, 2, 3]) + tm.assert_series_equal(result, expected) + + # get boolean array + result = s.iloc[[True, False, False]] + expected = pd.Series([1]).astype('category', categories=[1, 2, 3]) + tm.assert_series_equal(result, expected) def test_setitem(self): # int/positional c = self.factor.copy() c[0] = 'b' - self.assertEqual(c[0], 'b') + assert c[0] == 'b' c[-1] = 'a' - self.assertEqual(c[-1], 'a') + assert c[-1] == 'a' # boolean c = self.factor.copy() @@ -72,7 +95,7 @@ def test_setitem(self): expected = Categorical(['c', 'b', 'b', 'a', 'a', 'c', 'c', 'c'], ordered=True) - self.assert_categorical_equal(c, expected) + tm.assert_categorical_equal(c, expected) def test_setitem_listlike(self): @@ -87,19 +110,29 @@ def test_setitem_listlike(self): # we are asserting the code result here # which maps to the -1000 category result = c.codes[np.array([100000]).astype(np.int64)] - self.assertEqual(result, np.array([5], dtype='int8')) + tm.assert_numpy_array_equal(result, np.array([5], dtype='int8')) def test_constructor_unsortable(self): # it works! arr = np.array([1, 2, 3, datetime.now()], dtype='O') factor = Categorical(arr, ordered=False) - self.assertFalse(factor.ordered) + assert not factor.ordered # this however will raise as cannot be sorted - self.assertRaises( + pytest.raises( TypeError, lambda: Categorical(arr, ordered=True)) + def test_constructor_interval(self): + result = Categorical([Interval(1, 2), Interval(2, 3), Interval(3, 6)], + ordered=True) + ii = IntervalIndex.from_intervals([Interval(1, 2), + Interval(2, 3), + Interval(3, 6)]) + exp = Categorical(ii, ordered=True) + tm.assert_categorical_equal(result, exp) + tm.assert_index_equal(result.categories, ii) + def test_is_equal_dtype(self): # test dtype comparisons between cats @@ -107,48 +140,42 @@ def test_is_equal_dtype(self): c1 = Categorical(list('aabca'), categories=list('abc'), ordered=False) c2 = Categorical(list('aabca'), categories=list('cab'), ordered=False) c3 = Categorical(list('aabca'), categories=list('cab'), ordered=True) - self.assertTrue(c1.is_dtype_equal(c1)) - self.assertTrue(c2.is_dtype_equal(c2)) - self.assertTrue(c3.is_dtype_equal(c3)) - self.assertFalse(c1.is_dtype_equal(c2)) - self.assertFalse(c1.is_dtype_equal(c3)) - self.assertFalse(c1.is_dtype_equal(Index(list('aabca')))) - self.assertFalse(c1.is_dtype_equal(c1.astype(object))) - self.assertTrue(c1.is_dtype_equal(CategoricalIndex(c1))) - self.assertFalse(c1.is_dtype_equal( + assert c1.is_dtype_equal(c1) + assert c2.is_dtype_equal(c2) + assert c3.is_dtype_equal(c3) + assert not c1.is_dtype_equal(c2) + assert not c1.is_dtype_equal(c3) + assert not c1.is_dtype_equal(Index(list('aabca'))) + assert not c1.is_dtype_equal(c1.astype(object)) + assert c1.is_dtype_equal(CategoricalIndex(c1)) + assert not (c1.is_dtype_equal( CategoricalIndex(c1, categories=list('cab')))) - self.assertFalse(c1.is_dtype_equal(CategoricalIndex(c1, ordered=True))) + assert not c1.is_dtype_equal(CategoricalIndex(c1, ordered=True)) def test_constructor(self): exp_arr = np.array(["a", "b", "c", "a", "b", "c"], dtype=np.object_) c1 = Categorical(exp_arr) - self.assert_numpy_array_equal(c1.__array__(), exp_arr) + tm.assert_numpy_array_equal(c1.__array__(), exp_arr) c2 = Categorical(exp_arr, categories=["a", "b", "c"]) - self.assert_numpy_array_equal(c2.__array__(), exp_arr) + tm.assert_numpy_array_equal(c2.__array__(), exp_arr) c2 = Categorical(exp_arr, categories=["c", "b", "a"]) - self.assert_numpy_array_equal(c2.__array__(), exp_arr) + tm.assert_numpy_array_equal(c2.__array__(), exp_arr) # categories must be unique def f(): Categorical([1, 2], [1, 2, 2]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def f(): Categorical(["a", "b"], ["a", "b", "b"]) - self.assertRaises(ValueError, f) - - def f(): - with tm.assert_produces_warning(FutureWarning): - Categorical([1, 2], [1, 2, np.nan, np.nan]) - - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # The default should be unordered c1 = Categorical(["a", "b", "c", "a"]) - self.assertFalse(c1.ordered) + assert not c1.ordered # Categorical as input c1 = Categorical(["a", "b", "c", "a"]) @@ -165,8 +192,8 @@ def f(): c1 = Categorical(["a", "b", "c", "a"], categories=["a", "c", "b"]) c2 = Categorical(c1, categories=["a", "b", "c"]) - self.assert_numpy_array_equal(c1.__array__(), c2.__array__()) - self.assert_index_equal(c2.categories, Index(["a", "b", "c"])) + tm.assert_numpy_array_equal(c1.__array__(), c2.__array__()) + tm.assert_index_equal(c2.categories, Index(["a", "b", "c"])) # Series of dtype category c1 = Categorical(["a", "b", "c", "a"], categories=["a", "b", "c", "d"]) @@ -189,68 +216,51 @@ def f(): # This should result in integer categories, not float! cat = pd.Categorical([1, 2, 3, np.nan], categories=[1, 2, 3]) - self.assertTrue(is_integer_dtype(cat.categories)) + assert is_integer_dtype(cat.categories) # https://github.com/pandas-dev/pandas/issues/3678 cat = pd.Categorical([np.nan, 1, 2, 3]) - self.assertTrue(is_integer_dtype(cat.categories)) + assert is_integer_dtype(cat.categories) # this should result in floats cat = pd.Categorical([np.nan, 1, 2., 3]) - self.assertTrue(is_float_dtype(cat.categories)) + assert is_float_dtype(cat.categories) cat = pd.Categorical([np.nan, 1., 2., 3.]) - self.assertTrue(is_float_dtype(cat.categories)) - - # Deprecating NaNs in categoires (GH #10748) - # preserve int as far as possible by converting to object if NaN is in - # categories - with tm.assert_produces_warning(FutureWarning): - cat = pd.Categorical([np.nan, 1, 2, 3], - categories=[np.nan, 1, 2, 3]) - self.assertTrue(is_object_dtype(cat.categories)) + assert is_float_dtype(cat.categories) # This doesn't work -> this would probably need some kind of "remember # the original type" feature to try to cast the array interface result # to... # vals = np.asarray(cat[cat.notnull()]) - # self.assertTrue(is_integer_dtype(vals)) - with tm.assert_produces_warning(FutureWarning): - cat = pd.Categorical([np.nan, "a", "b", "c"], - categories=[np.nan, "a", "b", "c"]) - self.assertTrue(is_object_dtype(cat.categories)) - # but don't do it for floats - with tm.assert_produces_warning(FutureWarning): - cat = pd.Categorical([np.nan, 1., 2., 3.], - categories=[np.nan, 1., 2., 3.]) - self.assertTrue(is_float_dtype(cat.categories)) + # assert is_integer_dtype(vals) # corner cases cat = pd.Categorical([1]) - self.assertTrue(len(cat.categories) == 1) - self.assertTrue(cat.categories[0] == 1) - self.assertTrue(len(cat.codes) == 1) - self.assertTrue(cat.codes[0] == 0) + assert len(cat.categories) == 1 + assert cat.categories[0] == 1 + assert len(cat.codes) == 1 + assert cat.codes[0] == 0 cat = pd.Categorical(["a"]) - self.assertTrue(len(cat.categories) == 1) - self.assertTrue(cat.categories[0] == "a") - self.assertTrue(len(cat.codes) == 1) - self.assertTrue(cat.codes[0] == 0) + assert len(cat.categories) == 1 + assert cat.categories[0] == "a" + assert len(cat.codes) == 1 + assert cat.codes[0] == 0 # Scalars should be converted to lists cat = pd.Categorical(1) - self.assertTrue(len(cat.categories) == 1) - self.assertTrue(cat.categories[0] == 1) - self.assertTrue(len(cat.codes) == 1) - self.assertTrue(cat.codes[0] == 0) + assert len(cat.categories) == 1 + assert cat.categories[0] == 1 + assert len(cat.codes) == 1 + assert cat.codes[0] == 0 cat = pd.Categorical([1], categories=1) - self.assertTrue(len(cat.categories) == 1) - self.assertTrue(cat.categories[0] == 1) - self.assertTrue(len(cat.codes) == 1) - self.assertTrue(cat.codes[0] == 0) + assert len(cat.categories) == 1 + assert cat.categories[0] == 1 + assert len(cat.codes) == 1 + assert cat.codes[0] == 0 # Catch old style constructor useage: two arrays, codes + categories # We can only catch two cases: @@ -275,6 +285,21 @@ def f(): c = Categorical(np.array([], dtype='int64'), # noqa categories=[3, 2, 1], ordered=True) + def test_constructor_with_null(self): + + # Cannot have NaN in categories + with pytest.raises(ValueError): + pd.Categorical([np.nan, "a", "b", "c"], + categories=[np.nan, "a", "b", "c"]) + + with pytest.raises(ValueError): + pd.Categorical([None, "a", "b", "c"], + categories=[None, "a", "b", "c"]) + + with pytest.raises(ValueError): + pd.Categorical(DatetimeIndex(['nat', '20160101']), + categories=[NaT, Timestamp('20160101')]) + def test_constructor_with_index(self): ci = CategoricalIndex(list('aabbca'), categories=list('cab')) tm.assert_categorical_equal(ci.values, Categorical(ci)) @@ -321,7 +346,7 @@ def test_constructor_with_datetimelike(self): expected = type(dtl)(s) expected.freq = None tm.assert_index_equal(c.categories, expected) - self.assert_numpy_array_equal(c.codes, np.arange(5, dtype='int8')) + tm.assert_numpy_array_equal(c.codes, np.arange(5, dtype='int8')) # with NaT s2 = s.copy() @@ -332,10 +357,10 @@ def test_constructor_with_datetimelike(self): tm.assert_index_equal(c.categories, expected) exp = np.array([0, 1, 2, 3, -1], dtype=np.int8) - self.assert_numpy_array_equal(c.codes, exp) + tm.assert_numpy_array_equal(c.codes, exp) result = repr(c) - self.assertTrue('NaT' in result) + assert 'NaT' in result def test_constructor_from_index_series_datetimetz(self): idx = pd.date_range('2015-01-01 10:00', freq='D', periods=3, @@ -384,25 +409,31 @@ def test_from_codes(self): def f(): Categorical.from_codes([1, 2], [1, 2]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # no int codes def f(): Categorical.from_codes(["a"], [1, 2]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # no unique categories def f(): Categorical.from_codes([0, 1, 2], ["a", "a", "b"]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) + + # NaN categories included + def f(): + Categorical.from_codes([0, 1, 2], ["a", "b", np.nan]) + + pytest.raises(ValueError, f) # too negative def f(): Categorical.from_codes([-2, 1, 2], ["a", "b", "c"]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) exp = Categorical(["a", "b", "c"], ordered=False) res = Categorical.from_codes([0, 1, 2], ["a", "b", "c"]) @@ -421,10 +452,10 @@ def test_validate_ordered(self): # This should be a boolean. ordered = np.array([0, 1, 2]) - with tm.assertRaisesRegexp(exp_err, exp_msg): + with tm.assert_raises_regex(exp_err, exp_msg): Categorical([1, 2, 3], ordered=ordered) - with tm.assertRaisesRegexp(exp_err, exp_msg): + with tm.assert_raises_regex(exp_err, exp_msg): Categorical.from_codes([0, 0, 1], categories=['a', 'b', 'c'], ordered=ordered) @@ -459,11 +490,11 @@ def test_comparisons(self): other = self.factor[np.random.permutation(n)] result = self.factor == other expected = np.asarray(self.factor) == np.asarray(other) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = self.factor == 'd' expected = np.repeat(False, len(self.factor)) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) # comparisons with categoricals cat_rev = pd.Categorical(["a", "b", "c"], categories=["c", "b", "a"], @@ -477,21 +508,21 @@ def test_comparisons(self): # comparisons need to take categories ordering into account res_rev = cat_rev > cat_rev_base exp_rev = np.array([True, False, False]) - self.assert_numpy_array_equal(res_rev, exp_rev) + tm.assert_numpy_array_equal(res_rev, exp_rev) res_rev = cat_rev < cat_rev_base exp_rev = np.array([False, False, True]) - self.assert_numpy_array_equal(res_rev, exp_rev) + tm.assert_numpy_array_equal(res_rev, exp_rev) res = cat > cat_base exp = np.array([False, False, True]) - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) # Only categories with same categories can be compared def f(): cat > cat_rev - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) cat_rev_base2 = pd.Categorical( ["b", "b", "b"], categories=["c", "b", "a", "d"]) @@ -499,35 +530,35 @@ def f(): def f(): cat_rev > cat_rev_base2 - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # Only categories with same ordering information can be compared cat_unorderd = cat.set_ordered(False) - self.assertFalse((cat > cat).any()) + assert not (cat > cat).any() def f(): cat > cat_unorderd - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # comparison (in both directions) with Series will raise s = Series(["b", "b", "b"]) - self.assertRaises(TypeError, lambda: cat > s) - self.assertRaises(TypeError, lambda: cat_rev > s) - self.assertRaises(TypeError, lambda: s < cat) - self.assertRaises(TypeError, lambda: s < cat_rev) + pytest.raises(TypeError, lambda: cat > s) + pytest.raises(TypeError, lambda: cat_rev > s) + pytest.raises(TypeError, lambda: s < cat) + pytest.raises(TypeError, lambda: s < cat_rev) # comparison with numpy.array will raise in both direction, but only on # newer numpy versions a = np.array(["b", "b", "b"]) - self.assertRaises(TypeError, lambda: cat > a) - self.assertRaises(TypeError, lambda: cat_rev > a) + pytest.raises(TypeError, lambda: cat > a) + pytest.raises(TypeError, lambda: cat_rev > a) # The following work via '__array_priority__ = 1000' # works only on numpy >= 1.7.1 if LooseVersion(np.__version__) > "1.7.1": - self.assertRaises(TypeError, lambda: a < cat) - self.assertRaises(TypeError, lambda: a < cat_rev) + pytest.raises(TypeError, lambda: a < cat) + pytest.raises(TypeError, lambda: a < cat_rev) # Make sure that unequal comparison take the categories order in # account @@ -535,7 +566,7 @@ def f(): list("abc"), categories=list("cba"), ordered=True) exp = np.array([True, False, False]) res = cat_rev > "b" - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) def test_argsort(self): c = Categorical([5, 3, 1, 4, 2], ordered=True) @@ -556,16 +587,16 @@ def test_numpy_argsort(self): check_dtype=False) msg = "the 'kind' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.argsort, - c, kind='mergesort') + tm.assert_raises_regex(ValueError, msg, np.argsort, + c, kind='mergesort') msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.argsort, - c, axis=0) + tm.assert_raises_regex(ValueError, msg, np.argsort, + c, axis=0) msg = "the 'order' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.argsort, - c, order='C') + tm.assert_raises_regex(ValueError, msg, np.argsort, + c, order='C') def test_na_flags_int_categories(self): # #1457 @@ -577,7 +608,7 @@ def test_na_flags_int_categories(self): cat = Categorical(labels, categories, fastpath=True) repr(cat) - self.assert_numpy_array_equal(isnull(cat), labels == -1) + tm.assert_numpy_array_equal(isnull(cat), labels == -1) def test_categories_none(self): factor = Categorical(['a', 'b', 'b', 'a', @@ -587,7 +618,7 @@ def test_categories_none(self): def test_describe(self): # string type desc = self.factor.describe() - self.assertTrue(self.factor.ordered) + assert self.factor.ordered exp_index = pd.CategoricalIndex(['a', 'b', 'c'], name='categories', ordered=self.factor.ordered) expected = DataFrame({'counts': [3, 2, 3], @@ -629,64 +660,40 @@ def test_describe(self): name='categories')) tm.assert_frame_equal(desc, expected) - # NA as a category - with tm.assert_produces_warning(FutureWarning): - cat = pd.Categorical(["a", "c", "c", np.nan], - categories=["b", "a", "c", np.nan]) - result = cat.describe() - - expected = DataFrame([[0, 0], [1, 0.25], [2, 0.5], [1, 0.25]], - columns=['counts', 'freqs'], - index=pd.CategoricalIndex(['b', 'a', 'c', np.nan], - name='categories')) - tm.assert_frame_equal(result, expected, check_categorical=False) - - # NA as an unused category - with tm.assert_produces_warning(FutureWarning): - cat = pd.Categorical(["a", "c", "c"], - categories=["b", "a", "c", np.nan]) - result = cat.describe() - - exp_idx = pd.CategoricalIndex( - ['b', 'a', 'c', np.nan], name='categories') - expected = DataFrame([[0, 0], [1, 1 / 3.], [2, 2 / 3.], [0, 0]], - columns=['counts', 'freqs'], index=exp_idx) - tm.assert_frame_equal(result, expected, check_categorical=False) - def test_print(self): expected = ["[a, b, b, a, a, c, c, c]", "Categories (3, object): [a < b < c]"] expected = "\n".join(expected) actual = repr(self.factor) - self.assertEqual(actual, expected) + assert actual == expected def test_big_print(self): factor = Categorical([0, 1, 2, 0, 1, 2] * 100, ['a', 'b', 'c'], - name='cat', fastpath=True) + fastpath=True) expected = ["[a, b, c, a, b, ..., b, c, a, b, c]", "Length: 600", "Categories (3, object): [a, b, c]"] expected = "\n".join(expected) actual = repr(factor) - self.assertEqual(actual, expected) + assert actual == expected def test_empty_print(self): factor = Categorical([], ["a", "b", "c"]) expected = ("[], Categories (3, object): [a, b, c]") # hack because array_repr changed in numpy > 1.6.x actual = repr(factor) - self.assertEqual(actual, expected) + assert actual == expected - self.assertEqual(expected, actual) + assert expected == actual factor = Categorical([], ["a", "b", "c"], ordered=True) expected = ("[], Categories (3, object): [a < b < c]") actual = repr(factor) - self.assertEqual(expected, actual) + assert expected == actual factor = Categorical([], []) expected = ("[], Categories (0, object): []") - self.assertEqual(expected, repr(factor)) + assert expected == repr(factor) def test_print_none_width(self): # GH10087 @@ -695,7 +702,7 @@ def test_print_none_width(self): "dtype: category\nCategories (4, int64): [1, 2, 3, 4]") with option_context("display.width", None): - self.assertEqual(exp, repr(a)) + assert exp == repr(a) def test_unicode_print(self): if PY3: @@ -709,28 +716,26 @@ def test_unicode_print(self): Length: 60 Categories (3, object): [aaaaa, bb, cccc]""" - self.assertEqual(_rep(c), expected) + assert _rep(c) == expected - c = pd.Categorical([u'ああああ', u'いいいいい', u'ううううううう'] - * 20) + c = pd.Categorical([u'ああああ', u'いいいいい', u'ううううううう'] * 20) expected = u"""\ [ああああ, いいいいい, ううううううう, ああああ, いいいいい, ..., いいいいい, ううううううう, ああああ, いいいいい, ううううううう] Length: 60 Categories (3, object): [ああああ, いいいいい, ううううううう]""" # noqa - self.assertEqual(_rep(c), expected) + assert _rep(c) == expected # unicode option should not affect to Categorical, as it doesn't care # the repr width with option_context('display.unicode.east_asian_width', True): - c = pd.Categorical([u'ああああ', u'いいいいい', u'ううううううう'] - * 20) + c = pd.Categorical([u'ああああ', u'いいいいい', u'ううううううう'] * 20) expected = u"""[ああああ, いいいいい, ううううううう, ああああ, いいいいい, ..., いいいいい, ううううううう, ああああ, いいいいい, ううううううう] Length: 60 Categories (3, object): [ああああ, いいいいい, ううううううう]""" # noqa - self.assertEqual(_rep(c), expected) + assert _rep(c) == expected def test_periodindex(self): idx1 = PeriodIndex(['2014-01', '2014-01', '2014-02', '2014-02', @@ -740,8 +745,8 @@ def test_periodindex(self): str(cat1) exp_arr = np.array([0, 0, 1, 1, 2, 2], dtype=np.int8) exp_idx = PeriodIndex(['2014-01', '2014-02', '2014-03'], freq='M') - self.assert_numpy_array_equal(cat1._codes, exp_arr) - self.assert_index_equal(cat1.categories, exp_idx) + tm.assert_numpy_array_equal(cat1._codes, exp_arr) + tm.assert_index_equal(cat1.categories, exp_idx) idx2 = PeriodIndex(['2014-03', '2014-03', '2014-02', '2014-01', '2014-03', '2014-01'], freq='M') @@ -749,8 +754,8 @@ def test_periodindex(self): str(cat2) exp_arr = np.array([2, 2, 1, 0, 2, 0], dtype=np.int8) exp_idx2 = PeriodIndex(['2014-01', '2014-02', '2014-03'], freq='M') - self.assert_numpy_array_equal(cat2._codes, exp_arr) - self.assert_index_equal(cat2.categories, exp_idx2) + tm.assert_numpy_array_equal(cat2._codes, exp_arr) + tm.assert_index_equal(cat2.categories, exp_idx2) idx3 = PeriodIndex(['2013-12', '2013-11', '2013-10', '2013-09', '2013-08', '2013-07', '2013-05'], freq='M') @@ -758,81 +763,81 @@ def test_periodindex(self): exp_arr = np.array([6, 5, 4, 3, 2, 1, 0], dtype=np.int8) exp_idx = PeriodIndex(['2013-05', '2013-07', '2013-08', '2013-09', '2013-10', '2013-11', '2013-12'], freq='M') - self.assert_numpy_array_equal(cat3._codes, exp_arr) - self.assert_index_equal(cat3.categories, exp_idx) + tm.assert_numpy_array_equal(cat3._codes, exp_arr) + tm.assert_index_equal(cat3.categories, exp_idx) def test_categories_assigments(self): s = pd.Categorical(["a", "b", "c", "a"]) exp = np.array([1, 2, 3, 1], dtype=np.int64) s.categories = [1, 2, 3] - self.assert_numpy_array_equal(s.__array__(), exp) - self.assert_index_equal(s.categories, Index([1, 2, 3])) + tm.assert_numpy_array_equal(s.__array__(), exp) + tm.assert_index_equal(s.categories, Index([1, 2, 3])) # lengthen def f(): s.categories = [1, 2, 3, 4] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # shorten def f(): s.categories = [1, 2] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def test_construction_with_ordered(self): # GH 9347, 9190 cat = Categorical([0, 1, 2]) - self.assertFalse(cat.ordered) + assert not cat.ordered cat = Categorical([0, 1, 2], ordered=False) - self.assertFalse(cat.ordered) + assert not cat.ordered cat = Categorical([0, 1, 2], ordered=True) - self.assertTrue(cat.ordered) + assert cat.ordered def test_ordered_api(self): # GH 9347 cat1 = pd.Categorical(["a", "c", "b"], ordered=False) - self.assert_index_equal(cat1.categories, Index(['a', 'b', 'c'])) - self.assertFalse(cat1.ordered) + tm.assert_index_equal(cat1.categories, Index(['a', 'b', 'c'])) + assert not cat1.ordered cat2 = pd.Categorical(["a", "c", "b"], categories=['b', 'c', 'a'], ordered=False) - self.assert_index_equal(cat2.categories, Index(['b', 'c', 'a'])) - self.assertFalse(cat2.ordered) + tm.assert_index_equal(cat2.categories, Index(['b', 'c', 'a'])) + assert not cat2.ordered cat3 = pd.Categorical(["a", "c", "b"], ordered=True) - self.assert_index_equal(cat3.categories, Index(['a', 'b', 'c'])) - self.assertTrue(cat3.ordered) + tm.assert_index_equal(cat3.categories, Index(['a', 'b', 'c'])) + assert cat3.ordered cat4 = pd.Categorical(["a", "c", "b"], categories=['b', 'c', 'a'], ordered=True) - self.assert_index_equal(cat4.categories, Index(['b', 'c', 'a'])) - self.assertTrue(cat4.ordered) + tm.assert_index_equal(cat4.categories, Index(['b', 'c', 'a'])) + assert cat4.ordered def test_set_ordered(self): cat = Categorical(["a", "b", "c", "a"], ordered=True) cat2 = cat.as_unordered() - self.assertFalse(cat2.ordered) + assert not cat2.ordered cat2 = cat.as_ordered() - self.assertTrue(cat2.ordered) + assert cat2.ordered cat2.as_unordered(inplace=True) - self.assertFalse(cat2.ordered) + assert not cat2.ordered cat2.as_ordered(inplace=True) - self.assertTrue(cat2.ordered) + assert cat2.ordered - self.assertTrue(cat2.set_ordered(True).ordered) - self.assertFalse(cat2.set_ordered(False).ordered) + assert cat2.set_ordered(True).ordered + assert not cat2.set_ordered(False).ordered cat2.set_ordered(True, inplace=True) - self.assertTrue(cat2.ordered) + assert cat2.ordered cat2.set_ordered(False, inplace=True) - self.assertFalse(cat2.ordered) + assert not cat2.ordered # removed in 0.19.0 msg = "can\'t set attribute" - with tm.assertRaisesRegexp(AttributeError, msg): + with tm.assert_raises_regex(AttributeError, msg): cat.ordered = True - with tm.assertRaisesRegexp(AttributeError, msg): + with tm.assert_raises_regex(AttributeError, msg): cat.ordered = False def test_set_categories(self): @@ -841,106 +846,104 @@ def test_set_categories(self): exp_values = np.array(["a", "b", "c", "a"], dtype=np.object_) res = cat.set_categories(["c", "b", "a"], inplace=True) - self.assert_index_equal(cat.categories, exp_categories) - self.assert_numpy_array_equal(cat.__array__(), exp_values) - self.assertIsNone(res) + tm.assert_index_equal(cat.categories, exp_categories) + tm.assert_numpy_array_equal(cat.__array__(), exp_values) + assert res is None res = cat.set_categories(["a", "b", "c"]) # cat must be the same as before - self.assert_index_equal(cat.categories, exp_categories) - self.assert_numpy_array_equal(cat.__array__(), exp_values) + tm.assert_index_equal(cat.categories, exp_categories) + tm.assert_numpy_array_equal(cat.__array__(), exp_values) # only res is changed exp_categories_back = Index(["a", "b", "c"]) - self.assert_index_equal(res.categories, exp_categories_back) - self.assert_numpy_array_equal(res.__array__(), exp_values) + tm.assert_index_equal(res.categories, exp_categories_back) + tm.assert_numpy_array_equal(res.__array__(), exp_values) # not all "old" included in "new" -> all not included ones are now # np.nan cat = Categorical(["a", "b", "c", "a"], ordered=True) res = cat.set_categories(["a"]) - self.assert_numpy_array_equal(res.codes, - np.array([0, -1, -1, 0], dtype=np.int8)) + tm.assert_numpy_array_equal(res.codes, np.array([0, -1, -1, 0], + dtype=np.int8)) # still not all "old" in "new" res = cat.set_categories(["a", "b", "d"]) - self.assert_numpy_array_equal(res.codes, - np.array([0, 1, -1, 0], dtype=np.int8)) - self.assert_index_equal(res.categories, Index(["a", "b", "d"])) + tm.assert_numpy_array_equal(res.codes, np.array([0, 1, -1, 0], + dtype=np.int8)) + tm.assert_index_equal(res.categories, Index(["a", "b", "d"])) # all "old" included in "new" cat = cat.set_categories(["a", "b", "c", "d"]) exp_categories = Index(["a", "b", "c", "d"]) - self.assert_index_equal(cat.categories, exp_categories) + tm.assert_index_equal(cat.categories, exp_categories) # internals... c = Categorical([1, 2, 3, 4, 1], categories=[1, 2, 3, 4], ordered=True) - self.assert_numpy_array_equal(c._codes, - np.array([0, 1, 2, 3, 0], dtype=np.int8)) - self.assert_index_equal(c.categories, Index([1, 2, 3, 4])) + tm.assert_numpy_array_equal(c._codes, np.array([0, 1, 2, 3, 0], + dtype=np.int8)) + tm.assert_index_equal(c.categories, Index([1, 2, 3, 4])) exp = np.array([1, 2, 3, 4, 1], dtype=np.int64) - self.assert_numpy_array_equal(c.get_values(), exp) + tm.assert_numpy_array_equal(c.get_values(), exp) # all "pointers" to '4' must be changed from 3 to 0,... c = c.set_categories([4, 3, 2, 1]) # positions are changed - self.assert_numpy_array_equal(c._codes, - np.array([3, 2, 1, 0, 3], dtype=np.int8)) + tm.assert_numpy_array_equal(c._codes, np.array([3, 2, 1, 0, 3], + dtype=np.int8)) # categories are now in new order - self.assert_index_equal(c.categories, Index([4, 3, 2, 1])) + tm.assert_index_equal(c.categories, Index([4, 3, 2, 1])) # output is the same exp = np.array([1, 2, 3, 4, 1], dtype=np.int64) - self.assert_numpy_array_equal(c.get_values(), exp) - self.assertTrue(c.min(), 4) - self.assertTrue(c.max(), 1) + tm.assert_numpy_array_equal(c.get_values(), exp) + assert c.min() == 4 + assert c.max() == 1 # set_categories should set the ordering if specified c2 = c.set_categories([4, 3, 2, 1], ordered=False) - self.assertFalse(c2.ordered) - self.assert_numpy_array_equal(c.get_values(), c2.get_values()) + assert not c2.ordered + + tm.assert_numpy_array_equal(c.get_values(), c2.get_values()) # set_categories should pass thru the ordering c2 = c.set_ordered(False).set_categories([4, 3, 2, 1]) - self.assertFalse(c2.ordered) - self.assert_numpy_array_equal(c.get_values(), c2.get_values()) + assert not c2.ordered + + tm.assert_numpy_array_equal(c.get_values(), c2.get_values()) def test_rename_categories(self): cat = pd.Categorical(["a", "b", "c", "a"]) # inplace=False: the old one must not be changed res = cat.rename_categories([1, 2, 3]) - self.assert_numpy_array_equal(res.__array__(), - np.array([1, 2, 3, 1], dtype=np.int64)) - self.assert_index_equal(res.categories, Index([1, 2, 3])) + tm.assert_numpy_array_equal(res.__array__(), np.array([1, 2, 3, 1], + dtype=np.int64)) + tm.assert_index_equal(res.categories, Index([1, 2, 3])) exp_cat = np.array(["a", "b", "c", "a"], dtype=np.object_) - self.assert_numpy_array_equal(cat.__array__(), exp_cat) + tm.assert_numpy_array_equal(cat.__array__(), exp_cat) exp_cat = Index(["a", "b", "c"]) - self.assert_index_equal(cat.categories, exp_cat) + tm.assert_index_equal(cat.categories, exp_cat) res = cat.rename_categories([1, 2, 3], inplace=True) # and now inplace - self.assertIsNone(res) - self.assert_numpy_array_equal(cat.__array__(), - np.array([1, 2, 3, 1], dtype=np.int64)) - self.assert_index_equal(cat.categories, Index([1, 2, 3])) + assert res is None + tm.assert_numpy_array_equal(cat.__array__(), np.array([1, 2, 3, 1], + dtype=np.int64)) + tm.assert_index_equal(cat.categories, Index([1, 2, 3])) - # lengthen - def f(): + # Lengthen + with pytest.raises(ValueError): cat.rename_categories([1, 2, 3, 4]) - self.assertRaises(ValueError, f) - - # shorten - def f(): + # Shorten + with pytest.raises(ValueError): cat.rename_categories([1, 2]) - self.assertRaises(ValueError, f) - def test_reorder_categories(self): cat = Categorical(["a", "b", "c", "a"], ordered=True) old = cat.copy() @@ -950,14 +953,14 @@ def test_reorder_categories(self): # first inplace == False res = cat.reorder_categories(["c", "b", "a"]) # cat must be the same as before - self.assert_categorical_equal(cat, old) + tm.assert_categorical_equal(cat, old) # only res is changed - self.assert_categorical_equal(res, new) + tm.assert_categorical_equal(res, new) # inplace == True res = cat.reorder_categories(["c", "b", "a"], inplace=True) - self.assertIsNone(res) - self.assert_categorical_equal(cat, new) + assert res is None + tm.assert_categorical_equal(cat, new) # not all "old" included in "new" cat = Categorical(["a", "b", "c", "a"], ordered=True) @@ -965,19 +968,19 @@ def test_reorder_categories(self): def f(): cat.reorder_categories(["a"]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # still not all "old" in "new" def f(): cat.reorder_categories(["a", "b", "d"]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # all "old" included in "new", but too long def f(): cat.reorder_categories(["a", "b", "c", "d"]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def test_add_categories(self): cat = Categorical(["a", "b", "c", "a"], ordered=True) @@ -987,23 +990,23 @@ def test_add_categories(self): # first inplace == False res = cat.add_categories("d") - self.assert_categorical_equal(cat, old) - self.assert_categorical_equal(res, new) + tm.assert_categorical_equal(cat, old) + tm.assert_categorical_equal(res, new) res = cat.add_categories(["d"]) - self.assert_categorical_equal(cat, old) - self.assert_categorical_equal(res, new) + tm.assert_categorical_equal(cat, old) + tm.assert_categorical_equal(res, new) # inplace == True res = cat.add_categories("d", inplace=True) - self.assert_categorical_equal(cat, new) - self.assertIsNone(res) + tm.assert_categorical_equal(cat, new) + assert res is None # new is in old categories def f(): cat.add_categories(["d"]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # GH 9927 cat = Categorical(list("abc"), ordered=True) @@ -1011,13 +1014,13 @@ def f(): list("abc"), categories=list("abcde"), ordered=True) # test with Series, np.array, index, list res = cat.add_categories(Series(["d", "e"])) - self.assert_categorical_equal(res, expected) + tm.assert_categorical_equal(res, expected) res = cat.add_categories(np.array(["d", "e"])) - self.assert_categorical_equal(res, expected) + tm.assert_categorical_equal(res, expected) res = cat.add_categories(Index(["d", "e"])) - self.assert_categorical_equal(res, expected) + tm.assert_categorical_equal(res, expected) res = cat.add_categories(["d", "e"]) - self.assert_categorical_equal(res, expected) + tm.assert_categorical_equal(res, expected) def test_remove_categories(self): cat = Categorical(["a", "b", "c", "a"], ordered=True) @@ -1027,23 +1030,23 @@ def test_remove_categories(self): # first inplace == False res = cat.remove_categories("c") - self.assert_categorical_equal(cat, old) - self.assert_categorical_equal(res, new) + tm.assert_categorical_equal(cat, old) + tm.assert_categorical_equal(res, new) res = cat.remove_categories(["c"]) - self.assert_categorical_equal(cat, old) - self.assert_categorical_equal(res, new) + tm.assert_categorical_equal(cat, old) + tm.assert_categorical_equal(res, new) # inplace == True res = cat.remove_categories("c", inplace=True) - self.assert_categorical_equal(cat, new) - self.assertIsNone(res) + tm.assert_categorical_equal(cat, new) + assert res is None # removal is not in categories def f(): cat.remove_categories(["c"]) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def test_remove_unused_categories(self): c = Categorical(["a", "b", "c", "d", "a"], @@ -1051,33 +1054,33 @@ def test_remove_unused_categories(self): exp_categories_all = Index(["a", "b", "c", "d", "e"]) exp_categories_dropped = Index(["a", "b", "c", "d"]) - self.assert_index_equal(c.categories, exp_categories_all) + tm.assert_index_equal(c.categories, exp_categories_all) res = c.remove_unused_categories() - self.assert_index_equal(res.categories, exp_categories_dropped) - self.assert_index_equal(c.categories, exp_categories_all) + tm.assert_index_equal(res.categories, exp_categories_dropped) + tm.assert_index_equal(c.categories, exp_categories_all) res = c.remove_unused_categories(inplace=True) - self.assert_index_equal(c.categories, exp_categories_dropped) - self.assertIsNone(res) + tm.assert_index_equal(c.categories, exp_categories_dropped) + assert res is None # with NaN values (GH11599) c = Categorical(["a", "b", "c", np.nan], categories=["a", "b", "c", "d", "e"]) res = c.remove_unused_categories() - self.assert_index_equal(res.categories, - Index(np.array(["a", "b", "c"]))) + tm.assert_index_equal(res.categories, + Index(np.array(["a", "b", "c"]))) exp_codes = np.array([0, 1, 2, -1], dtype=np.int8) - self.assert_numpy_array_equal(res.codes, exp_codes) - self.assert_index_equal(c.categories, exp_categories_all) + tm.assert_numpy_array_equal(res.codes, exp_codes) + tm.assert_index_equal(c.categories, exp_categories_all) val = ['F', np.nan, 'D', 'B', 'D', 'F', np.nan] cat = pd.Categorical(values=val, categories=list('ABCDEFG')) out = cat.remove_unused_categories() - self.assert_index_equal(out.categories, Index(['B', 'D', 'F'])) + tm.assert_index_equal(out.categories, Index(['B', 'D', 'F'])) exp_codes = np.array([2, -1, 1, 0, 1, 2, -1], dtype=np.int8) - self.assert_numpy_array_equal(out.codes, exp_codes) - self.assertEqual(out.get_values().tolist(), val) + tm.assert_numpy_array_equal(out.codes, exp_codes) + assert out.get_values().tolist() == val alpha = list('abcdefghijklmnopqrstuvwxyz') val = np.random.choice(alpha[::2], 10000).astype('object') @@ -1085,118 +1088,46 @@ def test_remove_unused_categories(self): cat = pd.Categorical(values=val, categories=alpha) out = cat.remove_unused_categories() - self.assertEqual(out.get_values().tolist(), val.tolist()) + assert out.get_values().tolist() == val.tolist() def test_nan_handling(self): # Nans are represented as -1 in codes c = Categorical(["a", "b", np.nan, "a"]) - self.assert_index_equal(c.categories, Index(["a", "b"])) - self.assert_numpy_array_equal(c._codes, - np.array([0, 1, -1, 0], dtype=np.int8)) + tm.assert_index_equal(c.categories, Index(["a", "b"])) + tm.assert_numpy_array_equal(c._codes, np.array([0, 1, -1, 0], + dtype=np.int8)) c[1] = np.nan - self.assert_index_equal(c.categories, Index(["a", "b"])) - self.assert_numpy_array_equal(c._codes, - np.array([0, -1, -1, 0], dtype=np.int8)) - - # If categories have nan included, the code should point to that - # instead - with tm.assert_produces_warning(FutureWarning): - c = Categorical(["a", "b", np.nan, "a"], - categories=["a", "b", np.nan]) - self.assert_index_equal(c.categories, Index(["a", "b", np.nan])) - self.assert_numpy_array_equal(c._codes, - np.array([0, 1, 2, 0], dtype=np.int8)) - c[1] = np.nan - self.assert_index_equal(c.categories, Index(["a", "b", np.nan])) - self.assert_numpy_array_equal(c._codes, - np.array([0, 2, 2, 0], dtype=np.int8)) - - # Changing categories should also make the replaced category np.nan - c = Categorical(["a", "b", "c", "a"]) - with tm.assert_produces_warning(FutureWarning): - c.categories = ["a", "b", np.nan] # noqa - - self.assert_index_equal(c.categories, Index(["a", "b", np.nan])) - self.assert_numpy_array_equal(c._codes, - np.array([0, 1, 2, 0], dtype=np.int8)) + tm.assert_index_equal(c.categories, Index(["a", "b"])) + tm.assert_numpy_array_equal(c._codes, np.array([0, -1, -1, 0], + dtype=np.int8)) # Adding nan to categories should make assigned nan point to the # category! c = Categorical(["a", "b", np.nan, "a"]) - self.assert_index_equal(c.categories, Index(["a", "b"])) - self.assert_numpy_array_equal(c._codes, - np.array([0, 1, -1, 0], dtype=np.int8)) - with tm.assert_produces_warning(FutureWarning): - c.set_categories(["a", "b", np.nan], rename=True, inplace=True) - - self.assert_index_equal(c.categories, Index(["a", "b", np.nan])) - self.assert_numpy_array_equal(c._codes, - np.array([0, 1, -1, 0], dtype=np.int8)) - c[1] = np.nan - self.assert_index_equal(c.categories, Index(["a", "b", np.nan])) - self.assert_numpy_array_equal(c._codes, - np.array([0, 2, -1, 0], dtype=np.int8)) - - # Remove null categories (GH 10156) - cases = [([1.0, 2.0, np.nan], [1.0, 2.0]), - (['a', 'b', None], ['a', 'b']), - ([pd.Timestamp('2012-05-01'), pd.NaT], - [pd.Timestamp('2012-05-01')])] - - null_values = [np.nan, None, pd.NaT] - - for with_null, without in cases: - with tm.assert_produces_warning(FutureWarning): - base = Categorical([], with_null) - expected = Categorical([], without) - - for nullval in null_values: - result = base.remove_categories(nullval) - self.assert_categorical_equal(result, expected) - - # Different null values are indistinguishable - for i, j in [(0, 1), (0, 2), (1, 2)]: - nulls = [null_values[i], null_values[j]] - - def f(): - with tm.assert_produces_warning(FutureWarning): - Categorical([], categories=nulls) - - self.assertRaises(ValueError, f) + tm.assert_index_equal(c.categories, Index(["a", "b"])) + tm.assert_numpy_array_equal(c._codes, np.array([0, 1, -1, 0], + dtype=np.int8)) def test_isnull(self): exp = np.array([False, False, True]) c = Categorical(["a", "b", np.nan]) res = c.isnull() - self.assert_numpy_array_equal(res, exp) - with tm.assert_produces_warning(FutureWarning): - c = Categorical(["a", "b", np.nan], categories=["a", "b", np.nan]) - res = c.isnull() - self.assert_numpy_array_equal(res, exp) - - # test both nan in categories and as -1 - exp = np.array([True, False, True]) - c = Categorical(["a", "b", np.nan]) - with tm.assert_produces_warning(FutureWarning): - c.set_categories(["a", "b", np.nan], rename=True, inplace=True) - c[0] = np.nan - res = c.isnull() - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) def test_codes_immutable(self): # Codes should be read only c = Categorical(["a", "b", "c", "a", np.nan]) exp = np.array([0, 1, 2, 0, -1], dtype='int8') - self.assert_numpy_array_equal(c.codes, exp) + tm.assert_numpy_array_equal(c.codes, exp) # Assignments to codes should raise def f(): c.codes = np.array([0, 1, 2, 0, 1], dtype='int8') - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # changes in the codes array should raise # np 1.6.1 raises RuntimeError rather than ValueError @@ -1205,76 +1136,76 @@ def f(): def f(): codes[4] = 1 - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # But even after getting the codes, the original array should still be # writeable! c[4] = "a" exp = np.array([0, 1, 2, 0, 0], dtype='int8') - self.assert_numpy_array_equal(c.codes, exp) + tm.assert_numpy_array_equal(c.codes, exp) c._codes[4] = 2 exp = np.array([0, 1, 2, 0, 2], dtype='int8') - self.assert_numpy_array_equal(c.codes, exp) + tm.assert_numpy_array_equal(c.codes, exp) def test_min_max(self): # unordered cats have no min/max cat = Categorical(["a", "b", "c", "d"], ordered=False) - self.assertRaises(TypeError, lambda: cat.min()) - self.assertRaises(TypeError, lambda: cat.max()) + pytest.raises(TypeError, lambda: cat.min()) + pytest.raises(TypeError, lambda: cat.max()) cat = Categorical(["a", "b", "c", "d"], ordered=True) _min = cat.min() _max = cat.max() - self.assertEqual(_min, "a") - self.assertEqual(_max, "d") + assert _min == "a" + assert _max == "d" cat = Categorical(["a", "b", "c", "d"], categories=['d', 'c', 'b', 'a'], ordered=True) _min = cat.min() _max = cat.max() - self.assertEqual(_min, "d") - self.assertEqual(_max, "a") + assert _min == "d" + assert _max == "a" cat = Categorical([np.nan, "b", "c", np.nan], categories=['d', 'c', 'b', 'a'], ordered=True) _min = cat.min() _max = cat.max() - self.assertTrue(np.isnan(_min)) - self.assertEqual(_max, "b") + assert np.isnan(_min) + assert _max == "b" _min = cat.min(numeric_only=True) - self.assertEqual(_min, "c") + assert _min == "c" _max = cat.max(numeric_only=True) - self.assertEqual(_max, "b") + assert _max == "b" cat = Categorical([np.nan, 1, 2, np.nan], categories=[5, 4, 3, 2, 1], ordered=True) _min = cat.min() _max = cat.max() - self.assertTrue(np.isnan(_min)) - self.assertEqual(_max, 1) + assert np.isnan(_min) + assert _max == 1 _min = cat.min(numeric_only=True) - self.assertEqual(_min, 2) + assert _min == 2 _max = cat.max(numeric_only=True) - self.assertEqual(_max, 1) + assert _max == 1 def test_unique(self): # categories are reordered based on value when ordered=False cat = Categorical(["a", "b"]) exp = Index(["a", "b"]) res = cat.unique() - self.assert_index_equal(res.categories, exp) - self.assert_categorical_equal(res, cat) + tm.assert_index_equal(res.categories, exp) + tm.assert_categorical_equal(res, cat) cat = Categorical(["a", "b", "a", "a"], categories=["a", "b", "c"]) res = cat.unique() - self.assert_index_equal(res.categories, exp) + tm.assert_index_equal(res.categories, exp) tm.assert_categorical_equal(res, Categorical(exp)) cat = Categorical(["c", "a", "b", "a", "a"], categories=["a", "b", "c"]) exp = Index(["c", "a", "b"]) res = cat.unique() - self.assert_index_equal(res.categories, exp) + tm.assert_index_equal(res.categories, exp) exp_cat = Categorical(exp, categories=['c', 'a', 'b']) tm.assert_categorical_equal(res, exp_cat) @@ -1283,7 +1214,7 @@ def test_unique(self): categories=["a", "b", "c"]) res = cat.unique() exp = Index(["b", "a"]) - self.assert_index_equal(res.categories, exp) + tm.assert_index_equal(res.categories, exp) exp_cat = Categorical(["b", np.nan, "a"], categories=["b", "a"]) tm.assert_categorical_equal(res, exp_cat) @@ -1352,13 +1283,14 @@ def test_mode(self): s = Categorical([1, 2, 3, 4, 5], categories=[5, 4, 3, 2, 1], ordered=True) res = s.mode() - exp = Categorical([], categories=[5, 4, 3, 2, 1], ordered=True) + exp = Categorical([5, 4, 3, 2, 1], + categories=[5, 4, 3, 2, 1], ordered=True) tm.assert_categorical_equal(res, exp) # NaN should not become the mode! s = Categorical([np.nan, np.nan, np.nan, 4, 5], categories=[5, 4, 3, 2, 1], ordered=True) res = s.mode() - exp = Categorical([], categories=[5, 4, 3, 2, 1], ordered=True) + exp = Categorical([5, 4], categories=[5, 4, 3, 2, 1], ordered=True) tm.assert_categorical_equal(res, exp) s = Categorical([np.nan, np.nan, np.nan, 4, 5, 4], categories=[5, 4, 3, 2, 1], ordered=True) @@ -1382,35 +1314,35 @@ def test_sort_values(self): # sort_values res = cat.sort_values() exp = np.array(["a", "b", "c", "d"], dtype=object) - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, cat.categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, cat.categories) cat = Categorical(["a", "c", "b", "d"], categories=["a", "b", "c", "d"], ordered=True) res = cat.sort_values() exp = np.array(["a", "b", "c", "d"], dtype=object) - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, cat.categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, cat.categories) res = cat.sort_values(ascending=False) exp = np.array(["d", "c", "b", "a"], dtype=object) - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, cat.categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, cat.categories) # sort (inplace order) cat1 = cat.copy() cat1.sort_values(inplace=True) exp = np.array(["a", "b", "c", "d"], dtype=object) - self.assert_numpy_array_equal(cat1.__array__(), exp) - self.assert_index_equal(res.categories, cat.categories) + tm.assert_numpy_array_equal(cat1.__array__(), exp) + tm.assert_index_equal(res.categories, cat.categories) # reverse cat = Categorical(["a", "c", "c", "b", "d"], ordered=True) res = cat.sort_values(ascending=False) exp_val = np.array(["d", "c", "c", "b", "a"], dtype=object) exp_categories = Index(["a", "b", "c", "d"]) - self.assert_numpy_array_equal(res.__array__(), exp_val) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp_val) + tm.assert_index_equal(res.categories, exp_categories) def test_sort_values_na_position(self): # see gh-12882 @@ -1419,93 +1351,58 @@ def test_sort_values_na_position(self): exp = np.array([2.0, 2.0, 5.0, np.nan, np.nan]) res = cat.sort_values() # default arguments - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, exp_categories) exp = np.array([np.nan, np.nan, 2.0, 2.0, 5.0]) res = cat.sort_values(ascending=True, na_position='first') - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, exp_categories) exp = np.array([np.nan, np.nan, 5.0, 2.0, 2.0]) res = cat.sort_values(ascending=False, na_position='first') - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, exp_categories) exp = np.array([2.0, 2.0, 5.0, np.nan, np.nan]) res = cat.sort_values(ascending=True, na_position='last') - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, exp_categories) exp = np.array([5.0, 2.0, 2.0, np.nan, np.nan]) res = cat.sort_values(ascending=False, na_position='last') - self.assert_numpy_array_equal(res.__array__(), exp) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_index_equal(res.categories, exp_categories) cat = Categorical(["a", "c", "b", "d", np.nan], ordered=True) res = cat.sort_values(ascending=False, na_position='last') exp_val = np.array(["d", "c", "b", "a", np.nan], dtype=object) exp_categories = Index(["a", "b", "c", "d"]) - self.assert_numpy_array_equal(res.__array__(), exp_val) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp_val) + tm.assert_index_equal(res.categories, exp_categories) cat = Categorical(["a", "c", "b", "d", np.nan], ordered=True) res = cat.sort_values(ascending=False, na_position='first') exp_val = np.array([np.nan, "d", "c", "b", "a"], dtype=object) exp_categories = Index(["a", "b", "c", "d"]) - self.assert_numpy_array_equal(res.__array__(), exp_val) - self.assert_index_equal(res.categories, exp_categories) + tm.assert_numpy_array_equal(res.__array__(), exp_val) + tm.assert_index_equal(res.categories, exp_categories) def test_slicing_directly(self): cat = Categorical(["a", "b", "c", "d", "a", "b", "c"]) sliced = cat[3] - self.assertEqual(sliced, "d") + assert sliced == "d" sliced = cat[3:5] expected = Categorical(["d", "a"], categories=['a', 'b', 'c', 'd']) - self.assert_numpy_array_equal(sliced._codes, expected._codes) + tm.assert_numpy_array_equal(sliced._codes, expected._codes) tm.assert_index_equal(sliced.categories, expected.categories) def test_set_item_nan(self): cat = pd.Categorical([1, 2, 3]) - exp = pd.Categorical([1, np.nan, 3], categories=[1, 2, 3]) cat[1] = np.nan - tm.assert_categorical_equal(cat, exp) - - # if nan in categories, the proper code should be set! - cat = pd.Categorical([1, 2, 3, np.nan], categories=[1, 2, 3]) - with tm.assert_produces_warning(FutureWarning): - cat.set_categories([1, 2, 3, np.nan], rename=True, inplace=True) - cat[1] = np.nan - exp = np.array([0, 3, 2, -1], dtype=np.int8) - self.assert_numpy_array_equal(cat.codes, exp) - - cat = pd.Categorical([1, 2, 3, np.nan], categories=[1, 2, 3]) - with tm.assert_produces_warning(FutureWarning): - cat.set_categories([1, 2, 3, np.nan], rename=True, inplace=True) - cat[1:3] = np.nan - exp = np.array([0, 3, 3, -1], dtype=np.int8) - self.assert_numpy_array_equal(cat.codes, exp) - - cat = pd.Categorical([1, 2, 3, np.nan], categories=[1, 2, 3]) - with tm.assert_produces_warning(FutureWarning): - cat.set_categories([1, 2, 3, np.nan], rename=True, inplace=True) - cat[1:3] = [np.nan, 1] - exp = np.array([0, 3, 0, -1], dtype=np.int8) - self.assert_numpy_array_equal(cat.codes, exp) - - cat = pd.Categorical([1, 2, 3, np.nan], categories=[1, 2, 3]) - with tm.assert_produces_warning(FutureWarning): - cat.set_categories([1, 2, 3, np.nan], rename=True, inplace=True) - cat[1:3] = [np.nan, np.nan] - exp = np.array([0, 3, 3, -1], dtype=np.int8) - self.assert_numpy_array_equal(cat.codes, exp) - cat = pd.Categorical([1, 2, np.nan, 3], categories=[1, 2, 3]) - with tm.assert_produces_warning(FutureWarning): - cat.set_categories([1, 2, 3, np.nan], rename=True, inplace=True) - cat[pd.isnull(cat)] = np.nan - exp = np.array([0, 1, 3, 2], dtype=np.int8) - self.assert_numpy_array_equal(cat.codes, exp) + exp = pd.Categorical([1, np.nan, 3], categories=[1, 2, 3]) + tm.assert_categorical_equal(cat, exp) def test_shift(self): # GH 9416 @@ -1514,84 +1411,91 @@ def test_shift(self): # shift forward sp1 = cat.shift(1) xp1 = pd.Categorical([np.nan, 'a', 'b', 'c', 'd']) - self.assert_categorical_equal(sp1, xp1) - self.assert_categorical_equal(cat[:-1], sp1[1:]) + tm.assert_categorical_equal(sp1, xp1) + tm.assert_categorical_equal(cat[:-1], sp1[1:]) # shift back sn2 = cat.shift(-2) xp2 = pd.Categorical(['c', 'd', 'a', np.nan, np.nan], categories=['a', 'b', 'c', 'd']) - self.assert_categorical_equal(sn2, xp2) - self.assert_categorical_equal(cat[2:], sn2[:-2]) + tm.assert_categorical_equal(sn2, xp2) + tm.assert_categorical_equal(cat[2:], sn2[:-2]) # shift by zero - self.assert_categorical_equal(cat, cat.shift(0)) + tm.assert_categorical_equal(cat, cat.shift(0)) def test_nbytes(self): cat = pd.Categorical([1, 2, 3]) exp = cat._codes.nbytes + cat._categories.values.nbytes - self.assertEqual(cat.nbytes, exp) + assert cat.nbytes == exp def test_memory_usage(self): cat = pd.Categorical([1, 2, 3]) - self.assertEqual(cat.nbytes, cat.memory_usage()) - self.assertEqual(cat.nbytes, cat.memory_usage(deep=True)) + + # .categories is an index, so we include the hashtable + assert 0 < cat.nbytes <= cat.memory_usage() + assert 0 < cat.nbytes <= cat.memory_usage(deep=True) cat = pd.Categorical(['foo', 'foo', 'bar']) - self.assertEqual(cat.nbytes, cat.memory_usage()) - self.assertTrue(cat.memory_usage(deep=True) > cat.nbytes) + assert cat.memory_usage(deep=True) > cat.nbytes # sys.getsizeof will call the .memory_usage with # deep=True, and add on some GC overhead diff = cat.memory_usage(deep=True) - sys.getsizeof(cat) - self.assertTrue(abs(diff) < 100) + assert abs(diff) < 100 def test_searchsorted(self): # https://github.com/pandas-dev/pandas/issues/8420 - s1 = pd.Series(['apple', 'bread', 'bread', 'cheese', 'milk']) - s2 = pd.Series(['apple', 'bread', 'bread', 'cheese', 'milk', 'donuts']) - c1 = pd.Categorical(s1, ordered=True) - c2 = pd.Categorical(s2, ordered=True) - - # Single item array - res = c1.searchsorted(['bread']) - chk = s1.searchsorted(['bread']) - exp = np.array([1], dtype=np.intp) - self.assert_numpy_array_equal(res, exp) - self.assert_numpy_array_equal(res, chk) - - # Scalar version of single item array - # Categorical return np.array like pd.Series, but different from - # np.array.searchsorted() - res = c1.searchsorted('bread') - chk = s1.searchsorted('bread') - exp = np.array([1], dtype=np.intp) - self.assert_numpy_array_equal(res, exp) - self.assert_numpy_array_equal(res, chk) - - # Searching for a value that is not present in the Categorical - res = c1.searchsorted(['bread', 'eggs']) - chk = s1.searchsorted(['bread', 'eggs']) - exp = np.array([1, 4], dtype=np.intp) - self.assert_numpy_array_equal(res, exp) - self.assert_numpy_array_equal(res, chk) - - # Searching for a value that is not present, to the right - res = c1.searchsorted(['bread', 'eggs'], side='right') - chk = s1.searchsorted(['bread', 'eggs'], side='right') - exp = np.array([3, 4], dtype=np.intp) # eggs before milk - self.assert_numpy_array_equal(res, exp) - self.assert_numpy_array_equal(res, chk) - - # As above, but with a sorter array to reorder an unsorted array - res = c2.searchsorted(['bread', 'eggs'], side='right', - sorter=[0, 1, 2, 3, 5, 4]) - chk = s2.searchsorted(['bread', 'eggs'], side='right', - sorter=[0, 1, 2, 3, 5, 4]) - # eggs after donuts, after switching milk and donuts + # https://github.com/pandas-dev/pandas/issues/14522 + + c1 = pd.Categorical(['cheese', 'milk', 'apple', 'bread', 'bread'], + categories=['cheese', 'milk', 'apple', 'bread'], + ordered=True) + s1 = pd.Series(c1) + c2 = pd.Categorical(['cheese', 'milk', 'apple', 'bread', 'bread'], + categories=['cheese', 'milk', 'apple', 'bread'], + ordered=False) + s2 = pd.Series(c2) + + # Searching for single item argument, side='left' (default) + res_cat = c1.searchsorted('apple') + res_ser = s1.searchsorted('apple') + exp = np.array([2], dtype=np.intp) + tm.assert_numpy_array_equal(res_cat, exp) + tm.assert_numpy_array_equal(res_ser, exp) + + # Searching for single item array, side='left' (default) + res_cat = c1.searchsorted(['bread']) + res_ser = s1.searchsorted(['bread']) + exp = np.array([3], dtype=np.intp) + tm.assert_numpy_array_equal(res_cat, exp) + tm.assert_numpy_array_equal(res_ser, exp) + + # Searching for several items array, side='right' + res_cat = c1.searchsorted(['apple', 'bread'], side='right') + res_ser = s1.searchsorted(['apple', 'bread'], side='right') exp = np.array([3, 5], dtype=np.intp) - self.assert_numpy_array_equal(res, exp) - self.assert_numpy_array_equal(res, chk) + tm.assert_numpy_array_equal(res_cat, exp) + tm.assert_numpy_array_equal(res_ser, exp) + + # Searching for a single value that is not from the Categorical + pytest.raises(ValueError, lambda: c1.searchsorted('cucumber')) + pytest.raises(ValueError, lambda: s1.searchsorted('cucumber')) + + # Searching for multiple values one of each is not from the Categorical + pytest.raises(ValueError, + lambda: c1.searchsorted(['bread', 'cucumber'])) + pytest.raises(ValueError, + lambda: s1.searchsorted(['bread', 'cucumber'])) + + # searchsorted call for unordered Categorical + pytest.raises(ValueError, lambda: c2.searchsorted('apple')) + pytest.raises(ValueError, lambda: s2.searchsorted('apple')) + + with tm.assert_produces_warning(FutureWarning): + res = c1.searchsorted(v=['bread']) + exp = np.array([3], dtype=np.intp) + tm.assert_numpy_array_equal(res, exp) def test_deprecated_labels(self): # TODO: labels is deprecated and should be removed in 0.18 or 2017, @@ -1600,37 +1504,28 @@ def test_deprecated_labels(self): exp = cat.codes with tm.assert_produces_warning(FutureWarning): res = cat.labels - self.assert_numpy_array_equal(res, exp) + tm.assert_numpy_array_equal(res, exp) def test_deprecated_from_array(self): # GH13854, `.from_array` is deprecated with tm.assert_produces_warning(FutureWarning): Categorical.from_array([0, 1]) - def test_removed_names_produces_warning(self): - - # 10482 - with tm.assert_produces_warning(UserWarning): - Categorical([0, 1], name="a") - - with tm.assert_produces_warning(UserWarning): - Categorical.from_codes([1, 2], ["a", "b", "c"], name="a") - def test_datetime_categorical_comparison(self): dt_cat = pd.Categorical( pd.date_range('2014-01-01', periods=3), ordered=True) - self.assert_numpy_array_equal(dt_cat > dt_cat[0], - np.array([False, True, True])) - self.assert_numpy_array_equal(dt_cat[0] < dt_cat, - np.array([False, True, True])) + tm.assert_numpy_array_equal(dt_cat > dt_cat[0], + np.array([False, True, True])) + tm.assert_numpy_array_equal(dt_cat[0] < dt_cat, + np.array([False, True, True])) def test_reflected_comparison_with_scalars(self): # GH8658 cat = pd.Categorical([1, 2, 3], ordered=True) - self.assert_numpy_array_equal(cat > cat[0], - np.array([False, True, True])) - self.assert_numpy_array_equal(cat[0] < cat, - np.array([False, True, True])) + tm.assert_numpy_array_equal(cat > cat[0], + np.array([False, True, True])) + tm.assert_numpy_array_equal(cat[0] < cat, + np.array([False, True, True])) def test_comparison_with_unknown_scalars(self): # https://github.com/pandas-dev/pandas/issues/9836#issuecomment-92123057 @@ -1638,15 +1533,15 @@ def test_comparison_with_unknown_scalars(self): # for unequal comps, but not for equal/not equal cat = pd.Categorical([1, 2, 3], ordered=True) - self.assertRaises(TypeError, lambda: cat < 4) - self.assertRaises(TypeError, lambda: cat > 4) - self.assertRaises(TypeError, lambda: 4 < cat) - self.assertRaises(TypeError, lambda: 4 > cat) + pytest.raises(TypeError, lambda: cat < 4) + pytest.raises(TypeError, lambda: cat > 4) + pytest.raises(TypeError, lambda: 4 < cat) + pytest.raises(TypeError, lambda: 4 > cat) - self.assert_numpy_array_equal(cat == 4, - np.array([False, False, False])) - self.assert_numpy_array_equal(cat != 4, - np.array([True, True, True])) + tm.assert_numpy_array_equal(cat == 4, + np.array([False, False, False])) + tm.assert_numpy_array_equal(cat != 4, + np.array([True, True, True])) def test_map(self): c = pd.Categorical(list('ABABC'), categories=list('CBA'), @@ -1664,21 +1559,59 @@ def test_map(self): tm.assert_categorical_equal(result, exp) result = c.map(lambda x: 1) - tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64)) + # GH 12766: Return an index not an array + tm.assert_index_equal(result, Index(np.array([1] * 5, dtype=np.int64))) + def test_validate_inplace(self): + cat = Categorical(['A', 'B', 'B', 'C', 'A']) + invalid_values = [1, "True", [1, 2, 3], 5.0] -class TestCategoricalAsBlock(tm.TestCase): - _multiprocess_can_split_ = True + for value in invalid_values: + with pytest.raises(ValueError): + cat.set_ordered(value=True, inplace=value) - def setUp(self): + with pytest.raises(ValueError): + cat.as_ordered(inplace=value) + + with pytest.raises(ValueError): + cat.as_unordered(inplace=value) + + with pytest.raises(ValueError): + cat.set_categories(['X', 'Y', 'Z'], rename=True, inplace=value) + + with pytest.raises(ValueError): + cat.rename_categories(['X', 'Y', 'Z'], inplace=value) + + with pytest.raises(ValueError): + cat.reorder_categories( + ['X', 'Y', 'Z'], ordered=True, inplace=value) + + with pytest.raises(ValueError): + cat.add_categories( + new_categories=['D', 'E', 'F'], inplace=value) + + with pytest.raises(ValueError): + cat.remove_categories(removals=['D', 'E', 'F'], inplace=value) + + with pytest.raises(ValueError): + cat.remove_unused_categories(inplace=value) + + with pytest.raises(ValueError): + cat.sort_values(inplace=value) + + +class TestCategoricalAsBlock(object): + + def setup_method(self, method): self.factor = Categorical(['a', 'b', 'b', 'a', 'a', 'c', 'c', 'c']) df = DataFrame({'value': np.random.randint(0, 10000, 100)}) labels = ["{0} - {1}".format(i, i + 499) for i in range(0, 10000, 500)] + cat_labels = Categorical(labels, labels) df = df.sort_values(by=['value'], ascending=True) - df['value_group'] = pd.cut(df.value, range(0, 10500, 500), right=False, - labels=labels) + df['value_group'] = pd.cut(df.value, range(0, 10500, 500), + right=False, labels=cat_labels) self.cat = df def test_dtypes(self): @@ -1706,30 +1639,30 @@ def test_codes_dtypes(self): # GH 8453 result = Categorical(['foo', 'bar', 'baz']) - self.assertTrue(result.codes.dtype == 'int8') + assert result.codes.dtype == 'int8' result = Categorical(['foo%05d' % i for i in range(400)]) - self.assertTrue(result.codes.dtype == 'int16') + assert result.codes.dtype == 'int16' result = Categorical(['foo%05d' % i for i in range(40000)]) - self.assertTrue(result.codes.dtype == 'int32') + assert result.codes.dtype == 'int32' # adding cats result = Categorical(['foo', 'bar', 'baz']) - self.assertTrue(result.codes.dtype == 'int8') + assert result.codes.dtype == 'int8' result = result.add_categories(['foo%05d' % i for i in range(400)]) - self.assertTrue(result.codes.dtype == 'int16') + assert result.codes.dtype == 'int16' # removing cats result = result.remove_categories(['foo%05d' % i for i in range(300)]) - self.assertTrue(result.codes.dtype == 'int8') + assert result.codes.dtype == 'int8' def test_basic(self): # test basic creation / coercion of categoricals s = Series(self.factor, name='A') - self.assertEqual(s.dtype, 'category') - self.assertEqual(len(s), len(self.factor)) + assert s.dtype == 'category' + assert len(s) == len(self.factor) str(s.values) str(s) @@ -1739,14 +1672,14 @@ def test_basic(self): tm.assert_series_equal(result, s) result = df.iloc[:, 0] tm.assert_series_equal(result, s) - self.assertEqual(len(df), len(self.factor)) + assert len(df) == len(self.factor) str(df.values) str(df) df = DataFrame({'A': s}) result = df['A'] tm.assert_series_equal(result, s) - self.assertEqual(len(df), len(self.factor)) + assert len(df) == len(self.factor) str(df.values) str(df) @@ -1756,8 +1689,8 @@ def test_basic(self): result2 = df['B'] tm.assert_series_equal(result1, s) tm.assert_series_equal(result2, s, check_names=False) - self.assertEqual(result2.name, 'B') - self.assertEqual(len(df), len(self.factor)) + assert result2.name == 'B' + assert len(df) == len(self.factor) str(df.values) str(df) @@ -1770,13 +1703,13 @@ def test_basic(self): expected = x.iloc[0].person_name result = x.person_name.iloc[0] - self.assertEqual(result, expected) + assert result == expected result = x.person_name[0] - self.assertEqual(result, expected) + assert result == expected result = x.person_name.loc[0] - self.assertEqual(result, expected) + assert result == expected def test_creation_astype(self): l = ["a", "b", "c", "a"] @@ -1883,20 +1816,22 @@ def test_construction_frame(self): tm.assert_frame_equal(df, expected) # invalid (shape) - self.assertRaises( + pytest.raises( ValueError, lambda: DataFrame([pd.Categorical(list('abc')), pd.Categorical(list('abdefg'))])) # ndim > 1 - self.assertRaises(NotImplementedError, - lambda: pd.Categorical(np.array([list('abcd')]))) + pytest.raises(NotImplementedError, + lambda: pd.Categorical(np.array([list('abcd')]))) def test_reshaping(self): - p = tm.makePanel() - p['str'] = 'foo' - df = p.to_frame() + with catch_warnings(record=True): + p = tm.makePanel() + p['str'] = 'foo' + df = p.to_frame() + df['category'] = df['str'].astype('category') result = df['category'].unstack() @@ -1940,67 +1875,47 @@ def test_sideeffects_free(self): # other one, IF you specify copy! cat = Categorical(["a", "b", "c", "a"]) s = pd.Series(cat, copy=True) - self.assertFalse(s.cat is cat) + assert s.cat is not cat s.cat.categories = [1, 2, 3] exp_s = np.array([1, 2, 3, 1], dtype=np.int64) exp_cat = np.array(["a", "b", "c", "a"], dtype=np.object_) - self.assert_numpy_array_equal(s.__array__(), exp_s) - self.assert_numpy_array_equal(cat.__array__(), exp_cat) + tm.assert_numpy_array_equal(s.__array__(), exp_s) + tm.assert_numpy_array_equal(cat.__array__(), exp_cat) # setting s[0] = 2 exp_s2 = np.array([2, 2, 3, 1], dtype=np.int64) - self.assert_numpy_array_equal(s.__array__(), exp_s2) - self.assert_numpy_array_equal(cat.__array__(), exp_cat) + tm.assert_numpy_array_equal(s.__array__(), exp_s2) + tm.assert_numpy_array_equal(cat.__array__(), exp_cat) # however, copy is False by default # so this WILL change values cat = Categorical(["a", "b", "c", "a"]) s = pd.Series(cat) - self.assertTrue(s.values is cat) + assert s.values is cat s.cat.categories = [1, 2, 3] exp_s = np.array([1, 2, 3, 1], dtype=np.int64) - self.assert_numpy_array_equal(s.__array__(), exp_s) - self.assert_numpy_array_equal(cat.__array__(), exp_s) + tm.assert_numpy_array_equal(s.__array__(), exp_s) + tm.assert_numpy_array_equal(cat.__array__(), exp_s) s[0] = 2 exp_s2 = np.array([2, 2, 3, 1], dtype=np.int64) - self.assert_numpy_array_equal(s.__array__(), exp_s2) - self.assert_numpy_array_equal(cat.__array__(), exp_s2) + tm.assert_numpy_array_equal(s.__array__(), exp_s2) + tm.assert_numpy_array_equal(cat.__array__(), exp_s2) def test_nan_handling(self): - # Nans are represented as -1 in labels + # NaNs are represented as -1 in labels s = Series(Categorical(["a", "b", np.nan, "a"])) - self.assert_index_equal(s.cat.categories, Index(["a", "b"])) - self.assert_numpy_array_equal(s.values.codes, - np.array([0, 1, -1, 0], dtype=np.int8)) - - # If categories have nan included, the label should point to that - # instead - with tm.assert_produces_warning(FutureWarning): - s2 = Series(Categorical(["a", "b", np.nan, "a"], - categories=["a", "b", np.nan])) - - exp_cat = Index(["a", "b", np.nan]) - self.assert_index_equal(s2.cat.categories, exp_cat) - self.assert_numpy_array_equal(s2.values.codes, - np.array([0, 1, 2, 0], dtype=np.int8)) - - # Changing categories should also make the replaced category np.nan - s3 = Series(Categorical(["a", "b", "c", "a"])) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - s3.cat.categories = ["a", "b", np.nan] - - exp_cat = Index(["a", "b", np.nan]) - self.assert_index_equal(s3.cat.categories, exp_cat) - self.assert_numpy_array_equal(s3.values.codes, - np.array([0, 1, 2, 0], dtype=np.int8)) + tm.assert_index_equal(s.cat.categories, Index(["a", "b"])) + tm.assert_numpy_array_equal(s.values.codes, + np.array([0, 1, -1, 0], dtype=np.int8)) def test_cat_accessor(self): s = Series(Categorical(["a", "b", np.nan, "a"])) - self.assert_index_equal(s.cat.categories, Index(["a", "b"])) - self.assertEqual(s.cat.ordered, False) + tm.assert_index_equal(s.cat.categories, Index(["a", "b"])) + assert not s.cat.ordered, False + exp = Categorical(["a", "b", np.nan, "a"], categories=["b", "a"]) s.cat.set_categories(["b", "a"], inplace=True) tm.assert_categorical_equal(s.values, exp) @@ -2008,10 +1923,9 @@ def test_cat_accessor(self): res = s.cat.set_categories(["b", "a"]) tm.assert_categorical_equal(res.values, exp) - exp = Categorical(["a", "b", np.nan, "a"], categories=["b", "a"]) s[:] = "a" s = s.cat.remove_unused_categories() - self.assert_index_equal(s.cat.categories, Index(["a"])) + tm.assert_index_equal(s.cat.categories, Index(["a"])) def test_sequence_like(self): @@ -2039,15 +1953,15 @@ def test_sequence_like(self): def test_series_delegations(self): # invalid accessor - self.assertRaises(AttributeError, lambda: Series([1, 2, 3]).cat) - tm.assertRaisesRegexp( + pytest.raises(AttributeError, lambda: Series([1, 2, 3]).cat) + tm.assert_raises_regex( AttributeError, r"Can only use .cat accessor with a 'category' dtype", lambda: Series([1, 2, 3]).cat) - self.assertRaises(AttributeError, lambda: Series(['a', 'b', 'c']).cat) - self.assertRaises(AttributeError, lambda: Series(np.arange(5.)).cat) - self.assertRaises(AttributeError, - lambda: Series([Timestamp('20130101')]).cat) + pytest.raises(AttributeError, lambda: Series(['a', 'b', 'c']).cat) + pytest.raises(AttributeError, lambda: Series(np.arange(5.)).cat) + pytest.raises(AttributeError, + lambda: Series([Timestamp('20130101')]).cat) # Series should delegate calls to '.categories', '.codes', '.ordered' # and the methods '.set_categories()' 'drop_unused_categories()' to the @@ -2057,16 +1971,16 @@ def test_series_delegations(self): tm.assert_index_equal(s.cat.categories, exp_categories) s.cat.categories = [1, 2, 3] exp_categories = Index([1, 2, 3]) - self.assert_index_equal(s.cat.categories, exp_categories) + tm.assert_index_equal(s.cat.categories, exp_categories) exp_codes = Series([0, 1, 2, 0], dtype='int8') tm.assert_series_equal(s.cat.codes, exp_codes) - self.assertEqual(s.cat.ordered, True) + assert s.cat.ordered s = s.cat.as_unordered() - self.assertEqual(s.cat.ordered, False) + assert not s.cat.ordered s.cat.as_ordered(inplace=True) - self.assertEqual(s.cat.ordered, True) + assert s.cat.ordered # reorder s = Series(Categorical(["a", "b", "c", "a"], ordered=True)) @@ -2074,8 +1988,8 @@ def test_series_delegations(self): exp_values = np.array(["a", "b", "c", "a"], dtype=np.object_) s = s.cat.set_categories(["c", "b", "a"]) tm.assert_index_equal(s.cat.categories, exp_categories) - self.assert_numpy_array_equal(s.values.__array__(), exp_values) - self.assert_numpy_array_equal(s.__array__(), exp_values) + tm.assert_numpy_array_equal(s.values.__array__(), exp_values) + tm.assert_numpy_array_equal(s.__array__(), exp_values) # remove unused categories s = Series(Categorical(["a", "b", "b", "a"], categories=["a", "b", "c" @@ -2083,16 +1997,16 @@ def test_series_delegations(self): exp_categories = Index(["a", "b"]) exp_values = np.array(["a", "b", "b", "a"], dtype=np.object_) s = s.cat.remove_unused_categories() - self.assert_index_equal(s.cat.categories, exp_categories) - self.assert_numpy_array_equal(s.values.__array__(), exp_values) - self.assert_numpy_array_equal(s.__array__(), exp_values) + tm.assert_index_equal(s.cat.categories, exp_categories) + tm.assert_numpy_array_equal(s.values.__array__(), exp_values) + tm.assert_numpy_array_equal(s.__array__(), exp_values) # This method is likely to be confused, so test that it raises an error # on wrong inputs: def f(): s.set_categories([4, 3, 2, 1]) - self.assertRaises(Exception, f) + pytest.raises(Exception, f) # right: s.cat.set_categories([4,3,2,1]) def test_series_functions_no_warnings(self): @@ -2104,9 +2018,10 @@ def test_series_functions_no_warnings(self): def test_assignment_to_dataframe(self): # assignment - df = DataFrame({'value': np.array(np.random.randint(0, 10000, 100), - dtype='int32')}) - labels = ["{0} - {1}".format(i, i + 499) for i in range(0, 10000, 500)] + df = DataFrame({'value': np.array( + np.random.randint(0, 10000, 100), dtype='int32')}) + labels = Categorical(["{0} - {1}".format(i, i + 499) + for i in range(0, 10000, 500)]) df = df.sort_values(by=['value'], ascending=True) s = pd.cut(df.value, range(0, 10500, 500), right=False, labels=labels) @@ -2130,11 +2045,11 @@ def test_assignment_to_dataframe(self): result1 = df['D'] result2 = df['E'] - self.assert_categorical_equal(result1._data._block.values, d) + tm.assert_categorical_equal(result1._data._block.values, d) # sorting s.name = 'E' - self.assert_series_equal(result2.sort_index(), s.sort_index()) + tm.assert_series_equal(result2.sort_index(), s.sort_index()) cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10]) df = pd.DataFrame(pd.Series(cat)) @@ -2143,7 +2058,7 @@ def test_describe(self): # Categoricals should not show up together with numerical columns result = self.cat.describe() - self.assertEqual(len(result.columns), 1) + assert len(result.columns) == 1 # In a frame, describe() for the cat should be the same as for string # arrays (count, unique, top, freq) @@ -2159,82 +2074,82 @@ def test_describe(self): cat = pd.Series(pd.Categorical(["a", "b", "c", "c"])) df3 = pd.DataFrame({"cat": cat, "s": ["a", "b", "c", "c"]}) res = df3.describe() - self.assert_numpy_array_equal(res["cat"].values, res["s"].values) + tm.assert_numpy_array_equal(res["cat"].values, res["s"].values) def test_repr(self): a = pd.Series(pd.Categorical([1, 2, 3, 4])) exp = u("0 1\n1 2\n2 3\n3 4\n" + "dtype: category\nCategories (4, int64): [1, 2, 3, 4]") - self.assertEqual(exp, a.__unicode__()) + assert exp == a.__unicode__() a = pd.Series(pd.Categorical(["a", "b"] * 25)) exp = u("0 a\n1 b\n" + " ..\n" + "48 a\n49 b\n" + - "dtype: category\nCategories (2, object): [a, b]") + "Length: 50, dtype: category\nCategories (2, object): [a, b]") with option_context("display.max_rows", 5): - self.assertEqual(exp, repr(a)) + assert exp == repr(a) levs = list("abcdefghijklmnopqrstuvwxyz") a = pd.Series(pd.Categorical( ["a", "b"], categories=levs, ordered=True)) exp = u("0 a\n1 b\n" + "dtype: category\n" "Categories (26, object): [a < b < c < d ... w < x < y < z]") - self.assertEqual(exp, a.__unicode__()) + assert exp == a.__unicode__() def test_categorical_repr(self): c = pd.Categorical([1, 2, 3]) exp = """[1, 2, 3] Categories (3, int64): [1, 2, 3]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical([1, 2, 3, 1, 2, 3], categories=[1, 2, 3]) exp = """[1, 2, 3, 1, 2, 3] Categories (3, int64): [1, 2, 3]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical([1, 2, 3, 4, 5] * 10) exp = """[1, 2, 3, 4, 5, ..., 1, 2, 3, 4, 5] Length: 50 Categories (5, int64): [1, 2, 3, 4, 5]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(np.arange(20)) exp = """[0, 1, 2, 3, 4, ..., 15, 16, 17, 18, 19] Length: 20 Categories (20, int64): [0, 1, 2, 3, ..., 16, 17, 18, 19]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_repr_ordered(self): c = pd.Categorical([1, 2, 3], ordered=True) exp = """[1, 2, 3] Categories (3, int64): [1 < 2 < 3]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical([1, 2, 3, 1, 2, 3], categories=[1, 2, 3], ordered=True) exp = """[1, 2, 3, 1, 2, 3] Categories (3, int64): [1 < 2 < 3]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical([1, 2, 3, 4, 5] * 10, ordered=True) exp = """[1, 2, 3, 4, 5, ..., 1, 2, 3, 4, 5] Length: 50 Categories (5, int64): [1 < 2 < 3 < 4 < 5]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(np.arange(20), ordered=True) exp = """[0, 1, 2, 3, 4, ..., 15, 16, 17, 18, 19] Length: 20 Categories (20, int64): [0 < 1 < 2 < 3 ... 16 < 17 < 18 < 19]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_repr_datetime(self): idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5) @@ -2249,7 +2164,7 @@ def test_categorical_repr_datetime(self): "2011-01-01 10:00:00, 2011-01-01 11:00:00,\n" " 2011-01-01 12:00:00, " "2011-01-01 13:00:00]""") - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx) exp = ( @@ -2262,7 +2177,7 @@ def test_categorical_repr_datetime(self): " 2011-01-01 12:00:00, " "2011-01-01 13:00:00]") - self.assertEqual(repr(c), exp) + assert repr(c) == exp idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5, tz='US/Eastern') @@ -2278,7 +2193,7 @@ def test_categorical_repr_datetime(self): " " "2011-01-01 13:00:00-05:00]") - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx) exp = ( @@ -2294,7 +2209,7 @@ def test_categorical_repr_datetime(self): " " "2011-01-01 13:00:00-05:00]") - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_repr_datetime_ordered(self): idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5) @@ -2303,14 +2218,14 @@ def test_categorical_repr_datetime_ordered(self): Categories (5, datetime64[ns]): [2011-01-01 09:00:00 < 2011-01-01 10:00:00 < 2011-01-01 11:00:00 < 2011-01-01 12:00:00 < 2011-01-01 13:00:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx, ordered=True) exp = """[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13:00:00, 2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13:00:00] Categories (5, datetime64[ns]): [2011-01-01 09:00:00 < 2011-01-01 10:00:00 < 2011-01-01 11:00:00 < 2011-01-01 12:00:00 < 2011-01-01 13:00:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5, tz='US/Eastern') @@ -2320,73 +2235,73 @@ def test_categorical_repr_datetime_ordered(self): 2011-01-01 11:00:00-05:00 < 2011-01-01 12:00:00-05:00 < 2011-01-01 13:00:00-05:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx, ordered=True) exp = """[2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00, 2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00] Categories (5, datetime64[ns, US/Eastern]): [2011-01-01 09:00:00-05:00 < 2011-01-01 10:00:00-05:00 < 2011-01-01 11:00:00-05:00 < 2011-01-01 12:00:00-05:00 < - 2011-01-01 13:00:00-05:00]""" + 2011-01-01 13:00:00-05:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_repr_period(self): idx = pd.period_range('2011-01-01 09:00', freq='H', periods=5) c = pd.Categorical(idx) exp = """[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00] Categories (5, period[H]): [2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, - 2011-01-01 13:00]""" + 2011-01-01 13:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx) exp = """[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00, 2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00] Categories (5, period[H]): [2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, - 2011-01-01 13:00]""" + 2011-01-01 13:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp idx = pd.period_range('2011-01', freq='M', periods=5) c = pd.Categorical(idx) exp = """[2011-01, 2011-02, 2011-03, 2011-04, 2011-05] Categories (5, period[M]): [2011-01, 2011-02, 2011-03, 2011-04, 2011-05]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx) exp = """[2011-01, 2011-02, 2011-03, 2011-04, 2011-05, 2011-01, 2011-02, 2011-03, 2011-04, 2011-05] -Categories (5, period[M]): [2011-01, 2011-02, 2011-03, 2011-04, 2011-05]""" +Categories (5, period[M]): [2011-01, 2011-02, 2011-03, 2011-04, 2011-05]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_repr_period_ordered(self): idx = pd.period_range('2011-01-01 09:00', freq='H', periods=5) c = pd.Categorical(idx, ordered=True) exp = """[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00] Categories (5, period[H]): [2011-01-01 09:00 < 2011-01-01 10:00 < 2011-01-01 11:00 < 2011-01-01 12:00 < - 2011-01-01 13:00]""" + 2011-01-01 13:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx, ordered=True) exp = """[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00, 2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00] Categories (5, period[H]): [2011-01-01 09:00 < 2011-01-01 10:00 < 2011-01-01 11:00 < 2011-01-01 12:00 < - 2011-01-01 13:00]""" + 2011-01-01 13:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp idx = pd.period_range('2011-01', freq='M', periods=5) c = pd.Categorical(idx, ordered=True) exp = """[2011-01, 2011-02, 2011-03, 2011-04, 2011-05] Categories (5, period[M]): [2011-01 < 2011-02 < 2011-03 < 2011-04 < 2011-05]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx, ordered=True) exp = """[2011-01, 2011-02, 2011-03, 2011-04, 2011-05, 2011-01, 2011-02, 2011-03, 2011-04, 2011-05] -Categories (5, period[M]): [2011-01 < 2011-02 < 2011-03 < 2011-04 < 2011-05]""" +Categories (5, period[M]): [2011-01 < 2011-02 < 2011-03 < 2011-04 < 2011-05]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_repr_timedelta(self): idx = pd.timedelta_range('1 days', periods=5) @@ -2394,13 +2309,13 @@ def test_categorical_repr_timedelta(self): exp = """[1 days, 2 days, 3 days, 4 days, 5 days] Categories (5, timedelta64[ns]): [1 days, 2 days, 3 days, 4 days, 5 days]""" - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx) exp = """[1 days, 2 days, 3 days, 4 days, 5 days, 1 days, 2 days, 3 days, 4 days, 5 days] -Categories (5, timedelta64[ns]): [1 days, 2 days, 3 days, 4 days, 5 days]""" +Categories (5, timedelta64[ns]): [1 days, 2 days, 3 days, 4 days, 5 days]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp idx = pd.timedelta_range('1 hours', periods=20) c = pd.Categorical(idx) @@ -2408,32 +2323,32 @@ def test_categorical_repr_timedelta(self): Length: 20 Categories (20, timedelta64[ns]): [0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, ..., 16 days 01:00:00, 17 days 01:00:00, - 18 days 01:00:00, 19 days 01:00:00]""" + 18 days 01:00:00, 19 days 01:00:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx) exp = """[0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, 4 days 01:00:00, ..., 15 days 01:00:00, 16 days 01:00:00, 17 days 01:00:00, 18 days 01:00:00, 19 days 01:00:00] Length: 40 Categories (20, timedelta64[ns]): [0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, ..., 16 days 01:00:00, 17 days 01:00:00, - 18 days 01:00:00, 19 days 01:00:00]""" + 18 days 01:00:00, 19 days 01:00:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_repr_timedelta_ordered(self): idx = pd.timedelta_range('1 days', periods=5) c = pd.Categorical(idx, ordered=True) exp = """[1 days, 2 days, 3 days, 4 days, 5 days] -Categories (5, timedelta64[ns]): [1 days < 2 days < 3 days < 4 days < 5 days]""" +Categories (5, timedelta64[ns]): [1 days < 2 days < 3 days < 4 days < 5 days]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx, ordered=True) exp = """[1 days, 2 days, 3 days, 4 days, 5 days, 1 days, 2 days, 3 days, 4 days, 5 days] -Categories (5, timedelta64[ns]): [1 days < 2 days < 3 days < 4 days < 5 days]""" +Categories (5, timedelta64[ns]): [1 days < 2 days < 3 days < 4 days < 5 days]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp idx = pd.timedelta_range('1 hours', periods=20) c = pd.Categorical(idx, ordered=True) @@ -2441,18 +2356,18 @@ def test_categorical_repr_timedelta_ordered(self): Length: 20 Categories (20, timedelta64[ns]): [0 days 01:00:00 < 1 days 01:00:00 < 2 days 01:00:00 < 3 days 01:00:00 ... 16 days 01:00:00 < 17 days 01:00:00 < - 18 days 01:00:00 < 19 days 01:00:00]""" + 18 days 01:00:00 < 19 days 01:00:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp c = pd.Categorical(idx.append(idx), categories=idx, ordered=True) exp = """[0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, 4 days 01:00:00, ..., 15 days 01:00:00, 16 days 01:00:00, 17 days 01:00:00, 18 days 01:00:00, 19 days 01:00:00] Length: 40 Categories (20, timedelta64[ns]): [0 days 01:00:00 < 1 days 01:00:00 < 2 days 01:00:00 < 3 days 01:00:00 ... 16 days 01:00:00 < 17 days 01:00:00 < - 18 days 01:00:00 < 19 days 01:00:00]""" + 18 days 01:00:00 < 19 days 01:00:00]""" # noqa - self.assertEqual(repr(c), exp) + assert repr(c) == exp def test_categorical_series_repr(self): s = pd.Series(pd.Categorical([1, 2, 3])) @@ -2462,7 +2377,7 @@ def test_categorical_series_repr(self): dtype: category Categories (3, int64): [1, 2, 3]""" - self.assertEqual(repr(s), exp) + assert repr(s) == exp s = pd.Series(pd.Categorical(np.arange(10))) exp = """0 0 @@ -2478,7 +2393,7 @@ def test_categorical_series_repr(self): dtype: category Categories (10, int64): [0, 1, 2, 3, ..., 6, 7, 8, 9]""" - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_series_repr_ordered(self): s = pd.Series(pd.Categorical([1, 2, 3], ordered=True)) @@ -2488,7 +2403,7 @@ def test_categorical_series_repr_ordered(self): dtype: category Categories (3, int64): [1 < 2 < 3]""" - self.assertEqual(repr(s), exp) + assert repr(s) == exp s = pd.Series(pd.Categorical(np.arange(10), ordered=True)) exp = """0 0 @@ -2504,7 +2419,7 @@ def test_categorical_series_repr_ordered(self): dtype: category Categories (10, int64): [0 < 1 < 2 < 3 ... 6 < 7 < 8 < 9]""" - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_series_repr_datetime(self): idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5) @@ -2516,9 +2431,9 @@ def test_categorical_series_repr_datetime(self): 4 2011-01-01 13:00:00 dtype: category Categories (5, datetime64[ns]): [2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, - 2011-01-01 12:00:00, 2011-01-01 13:00:00]""" + 2011-01-01 12:00:00, 2011-01-01 13:00:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5, tz='US/Eastern') @@ -2531,9 +2446,9 @@ def test_categorical_series_repr_datetime(self): dtype: category Categories (5, datetime64[ns, US/Eastern]): [2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, - 2011-01-01 13:00:00-05:00]""" + 2011-01-01 13:00:00-05:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_series_repr_datetime_ordered(self): idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5) @@ -2545,9 +2460,9 @@ def test_categorical_series_repr_datetime_ordered(self): 4 2011-01-01 13:00:00 dtype: category Categories (5, datetime64[ns]): [2011-01-01 09:00:00 < 2011-01-01 10:00:00 < 2011-01-01 11:00:00 < - 2011-01-01 12:00:00 < 2011-01-01 13:00:00]""" + 2011-01-01 12:00:00 < 2011-01-01 13:00:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5, tz='US/Eastern') @@ -2560,9 +2475,9 @@ def test_categorical_series_repr_datetime_ordered(self): dtype: category Categories (5, datetime64[ns, US/Eastern]): [2011-01-01 09:00:00-05:00 < 2011-01-01 10:00:00-05:00 < 2011-01-01 11:00:00-05:00 < 2011-01-01 12:00:00-05:00 < - 2011-01-01 13:00:00-05:00]""" + 2011-01-01 13:00:00-05:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_series_repr_period(self): idx = pd.period_range('2011-01-01 09:00', freq='H', periods=5) @@ -2574,9 +2489,9 @@ def test_categorical_series_repr_period(self): 4 2011-01-01 13:00 dtype: category Categories (5, period[H]): [2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, - 2011-01-01 13:00]""" + 2011-01-01 13:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp idx = pd.period_range('2011-01', freq='M', periods=5) s = pd.Series(pd.Categorical(idx)) @@ -2588,7 +2503,7 @@ def test_categorical_series_repr_period(self): dtype: category Categories (5, period[M]): [2011-01, 2011-02, 2011-03, 2011-04, 2011-05]""" - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_series_repr_period_ordered(self): idx = pd.period_range('2011-01-01 09:00', freq='H', periods=5) @@ -2600,9 +2515,9 @@ def test_categorical_series_repr_period_ordered(self): 4 2011-01-01 13:00 dtype: category Categories (5, period[H]): [2011-01-01 09:00 < 2011-01-01 10:00 < 2011-01-01 11:00 < 2011-01-01 12:00 < - 2011-01-01 13:00]""" + 2011-01-01 13:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp idx = pd.period_range('2011-01', freq='M', periods=5) s = pd.Series(pd.Categorical(idx, ordered=True)) @@ -2614,7 +2529,7 @@ def test_categorical_series_repr_period_ordered(self): dtype: category Categories (5, period[M]): [2011-01 < 2011-02 < 2011-03 < 2011-04 < 2011-05]""" - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_series_repr_timedelta(self): idx = pd.timedelta_range('1 days', periods=5) @@ -2627,7 +2542,7 @@ def test_categorical_series_repr_timedelta(self): dtype: category Categories (5, timedelta64[ns]): [1 days, 2 days, 3 days, 4 days, 5 days]""" - self.assertEqual(repr(s), exp) + assert repr(s) == exp idx = pd.timedelta_range('1 hours', periods=10) s = pd.Series(pd.Categorical(idx)) @@ -2644,9 +2559,9 @@ def test_categorical_series_repr_timedelta(self): dtype: category Categories (10, timedelta64[ns]): [0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, ..., 6 days 01:00:00, 7 days 01:00:00, - 8 days 01:00:00, 9 days 01:00:00]""" + 8 days 01:00:00, 9 days 01:00:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_series_repr_timedelta_ordered(self): idx = pd.timedelta_range('1 days', periods=5) @@ -2657,9 +2572,9 @@ def test_categorical_series_repr_timedelta_ordered(self): 3 4 days 4 5 days dtype: category -Categories (5, timedelta64[ns]): [1 days < 2 days < 3 days < 4 days < 5 days]""" +Categories (5, timedelta64[ns]): [1 days < 2 days < 3 days < 4 days < 5 days]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp idx = pd.timedelta_range('1 hours', periods=10) s = pd.Series(pd.Categorical(idx, ordered=True)) @@ -2676,27 +2591,27 @@ def test_categorical_series_repr_timedelta_ordered(self): dtype: category Categories (10, timedelta64[ns]): [0 days 01:00:00 < 1 days 01:00:00 < 2 days 01:00:00 < 3 days 01:00:00 ... 6 days 01:00:00 < 7 days 01:00:00 < - 8 days 01:00:00 < 9 days 01:00:00]""" + 8 days 01:00:00 < 9 days 01:00:00]""" # noqa - self.assertEqual(repr(s), exp) + assert repr(s) == exp def test_categorical_index_repr(self): idx = pd.CategoricalIndex(pd.Categorical([1, 2, 3])) - exp = """CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=False, dtype='category')""" - self.assertEqual(repr(idx), exp) + exp = """CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=False, dtype='category')""" # noqa + assert repr(idx) == exp i = pd.CategoricalIndex(pd.Categorical(np.arange(10))) - exp = """CategoricalIndex([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], categories=[0, 1, 2, 3, 4, 5, 6, 7, ...], ordered=False, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], categories=[0, 1, 2, 3, 4, 5, 6, 7, ...], ordered=False, dtype='category')""" # noqa + assert repr(i) == exp def test_categorical_index_repr_ordered(self): i = pd.CategoricalIndex(pd.Categorical([1, 2, 3], ordered=True)) - exp = """CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')""" # noqa + assert repr(i) == exp i = pd.CategoricalIndex(pd.Categorical(np.arange(10), ordered=True)) - exp = """CategoricalIndex([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], categories=[0, 1, 2, 3, 4, 5, 6, 7, ...], ordered=True, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], categories=[0, 1, 2, 3, 4, 5, 6, 7, ...], ordered=True, dtype='category')""" # noqa + assert repr(i) == exp def test_categorical_index_repr_datetime(self): idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5) @@ -2704,9 +2619,9 @@ def test_categorical_index_repr_datetime(self): exp = """CategoricalIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00', '2011-01-01 11:00:00', '2011-01-01 12:00:00', '2011-01-01 13:00:00'], - categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13:00:00], ordered=False, dtype='category')""" + categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13:00:00], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5, tz='US/Eastern') @@ -2714,9 +2629,9 @@ def test_categorical_index_repr_datetime(self): exp = """CategoricalIndex(['2011-01-01 09:00:00-05:00', '2011-01-01 10:00:00-05:00', '2011-01-01 11:00:00-05:00', '2011-01-01 12:00:00-05:00', '2011-01-01 13:00:00-05:00'], - categories=[2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00], ordered=False, dtype='category')""" + categories=[2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp def test_categorical_index_repr_datetime_ordered(self): idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5) @@ -2724,9 +2639,9 @@ def test_categorical_index_repr_datetime_ordered(self): exp = """CategoricalIndex(['2011-01-01 09:00:00', '2011-01-01 10:00:00', '2011-01-01 11:00:00', '2011-01-01 12:00:00', '2011-01-01 13:00:00'], - categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13:00:00], ordered=True, dtype='category')""" + categories=[2011-01-01 09:00:00, 2011-01-01 10:00:00, 2011-01-01 11:00:00, 2011-01-01 12:00:00, 2011-01-01 13:00:00], ordered=True, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp idx = pd.date_range('2011-01-01 09:00', freq='H', periods=5, tz='US/Eastern') @@ -2734,9 +2649,9 @@ def test_categorical_index_repr_datetime_ordered(self): exp = """CategoricalIndex(['2011-01-01 09:00:00-05:00', '2011-01-01 10:00:00-05:00', '2011-01-01 11:00:00-05:00', '2011-01-01 12:00:00-05:00', '2011-01-01 13:00:00-05:00'], - categories=[2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00], ordered=True, dtype='category')""" + categories=[2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00], ordered=True, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp i = pd.CategoricalIndex(pd.Categorical(idx.append(idx), ordered=True)) exp = """CategoricalIndex(['2011-01-01 09:00:00-05:00', '2011-01-01 10:00:00-05:00', @@ -2744,68 +2659,68 @@ def test_categorical_index_repr_datetime_ordered(self): '2011-01-01 13:00:00-05:00', '2011-01-01 09:00:00-05:00', '2011-01-01 10:00:00-05:00', '2011-01-01 11:00:00-05:00', '2011-01-01 12:00:00-05:00', '2011-01-01 13:00:00-05:00'], - categories=[2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00], ordered=True, dtype='category')""" + categories=[2011-01-01 09:00:00-05:00, 2011-01-01 10:00:00-05:00, 2011-01-01 11:00:00-05:00, 2011-01-01 12:00:00-05:00, 2011-01-01 13:00:00-05:00], ordered=True, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp def test_categorical_index_repr_period(self): # test all length idx = pd.period_range('2011-01-01 09:00', freq='H', periods=1) i = pd.CategoricalIndex(pd.Categorical(idx)) - exp = """CategoricalIndex(['2011-01-01 09:00'], categories=[2011-01-01 09:00], ordered=False, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex(['2011-01-01 09:00'], categories=[2011-01-01 09:00], ordered=False, dtype='category')""" # noqa + assert repr(i) == exp idx = pd.period_range('2011-01-01 09:00', freq='H', periods=2) i = pd.CategoricalIndex(pd.Categorical(idx)) - exp = """CategoricalIndex(['2011-01-01 09:00', '2011-01-01 10:00'], categories=[2011-01-01 09:00, 2011-01-01 10:00], ordered=False, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex(['2011-01-01 09:00', '2011-01-01 10:00'], categories=[2011-01-01 09:00, 2011-01-01 10:00], ordered=False, dtype='category')""" # noqa + assert repr(i) == exp idx = pd.period_range('2011-01-01 09:00', freq='H', periods=3) i = pd.CategoricalIndex(pd.Categorical(idx)) - exp = """CategoricalIndex(['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00'], categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00], ordered=False, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex(['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00'], categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00], ordered=False, dtype='category')""" # noqa + assert repr(i) == exp idx = pd.period_range('2011-01-01 09:00', freq='H', periods=5) i = pd.CategoricalIndex(pd.Categorical(idx)) exp = """CategoricalIndex(['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00', '2011-01-01 12:00', '2011-01-01 13:00'], - categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00], ordered=False, dtype='category')""" + categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp i = pd.CategoricalIndex(pd.Categorical(idx.append(idx))) exp = """CategoricalIndex(['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00', '2011-01-01 12:00', '2011-01-01 13:00', '2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00', '2011-01-01 12:00', '2011-01-01 13:00'], - categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00], ordered=False, dtype='category')""" + categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp idx = pd.period_range('2011-01', freq='M', periods=5) i = pd.CategoricalIndex(pd.Categorical(idx)) - exp = """CategoricalIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05'], categories=[2011-01, 2011-02, 2011-03, 2011-04, 2011-05], ordered=False, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05'], categories=[2011-01, 2011-02, 2011-03, 2011-04, 2011-05], ordered=False, dtype='category')""" # noqa + assert repr(i) == exp def test_categorical_index_repr_period_ordered(self): idx = pd.period_range('2011-01-01 09:00', freq='H', periods=5) i = pd.CategoricalIndex(pd.Categorical(idx, ordered=True)) exp = """CategoricalIndex(['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00', '2011-01-01 12:00', '2011-01-01 13:00'], - categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00], ordered=True, dtype='category')""" + categories=[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00], ordered=True, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp idx = pd.period_range('2011-01', freq='M', periods=5) i = pd.CategoricalIndex(pd.Categorical(idx, ordered=True)) - exp = """CategoricalIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05'], categories=[2011-01, 2011-02, 2011-03, 2011-04, 2011-05], ordered=True, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05'], categories=[2011-01, 2011-02, 2011-03, 2011-04, 2011-05], ordered=True, dtype='category')""" # noqa + assert repr(i) == exp def test_categorical_index_repr_timedelta(self): idx = pd.timedelta_range('1 days', periods=5) i = pd.CategoricalIndex(pd.Categorical(idx)) - exp = """CategoricalIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], categories=[1 days 00:00:00, 2 days 00:00:00, 3 days 00:00:00, 4 days 00:00:00, 5 days 00:00:00], ordered=False, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], categories=[1 days 00:00:00, 2 days 00:00:00, 3 days 00:00:00, 4 days 00:00:00, 5 days 00:00:00], ordered=False, dtype='category')""" # noqa + assert repr(i) == exp idx = pd.timedelta_range('1 hours', periods=10) i = pd.CategoricalIndex(pd.Categorical(idx)) @@ -2813,15 +2728,15 @@ def test_categorical_index_repr_timedelta(self): '3 days 01:00:00', '4 days 01:00:00', '5 days 01:00:00', '6 days 01:00:00', '7 days 01:00:00', '8 days 01:00:00', '9 days 01:00:00'], - categories=[0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, 4 days 01:00:00, 5 days 01:00:00, 6 days 01:00:00, 7 days 01:00:00, ...], ordered=False, dtype='category')""" + categories=[0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, 4 days 01:00:00, 5 days 01:00:00, 6 days 01:00:00, 7 days 01:00:00, ...], ordered=False, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp def test_categorical_index_repr_timedelta_ordered(self): idx = pd.timedelta_range('1 days', periods=5) i = pd.CategoricalIndex(pd.Categorical(idx, ordered=True)) - exp = """CategoricalIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], categories=[1 days 00:00:00, 2 days 00:00:00, 3 days 00:00:00, 4 days 00:00:00, 5 days 00:00:00], ordered=True, dtype='category')""" - self.assertEqual(repr(i), exp) + exp = """CategoricalIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], categories=[1 days 00:00:00, 2 days 00:00:00, 3 days 00:00:00, 4 days 00:00:00, 5 days 00:00:00], ordered=True, dtype='category')""" # noqa + assert repr(i) == exp idx = pd.timedelta_range('1 hours', periods=10) i = pd.CategoricalIndex(pd.Categorical(idx, ordered=True)) @@ -2829,9 +2744,9 @@ def test_categorical_index_repr_timedelta_ordered(self): '3 days 01:00:00', '4 days 01:00:00', '5 days 01:00:00', '6 days 01:00:00', '7 days 01:00:00', '8 days 01:00:00', '9 days 01:00:00'], - categories=[0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, 4 days 01:00:00, 5 days 01:00:00, 6 days 01:00:00, 7 days 01:00:00, ...], ordered=True, dtype='category')""" + categories=[0 days 01:00:00, 1 days 01:00:00, 2 days 01:00:00, 3 days 01:00:00, 4 days 01:00:00, 5 days 01:00:00, 6 days 01:00:00, 7 days 01:00:00, ...], ordered=True, dtype='category')""" # noqa - self.assertEqual(repr(i), exp) + assert repr(i) == exp def test_categorical_frame(self): # normal DataFrame @@ -2847,7 +2762,7 @@ def test_categorical_frame(self): 4 2011-01-01 13:00:00-05:00 2011-05""" df = pd.DataFrame({'dt': pd.Categorical(dt), 'p': pd.Categorical(p)}) - self.assertEqual(repr(df), exp) + assert repr(df) == exp def test_info(self): @@ -2857,10 +2772,12 @@ def test_info(self): df['category'] = Series(np.array(list('abcdefghij')).take( np.random.randint(0, 10, size=n))).astype('category') df.isnull() - df.info() + buf = compat.StringIO() + df.info(buf=buf) df2 = df[df['category'] == 'd'] - df2.info() + buf = compat.StringIO() + df2.info(buf=buf) def test_groupby_sort(self): @@ -2877,36 +2794,36 @@ def test_groupby_sort(self): def test_min_max(self): # unordered cats have no min/max cat = Series(Categorical(["a", "b", "c", "d"], ordered=False)) - self.assertRaises(TypeError, lambda: cat.min()) - self.assertRaises(TypeError, lambda: cat.max()) + pytest.raises(TypeError, lambda: cat.min()) + pytest.raises(TypeError, lambda: cat.max()) cat = Series(Categorical(["a", "b", "c", "d"], ordered=True)) _min = cat.min() _max = cat.max() - self.assertEqual(_min, "a") - self.assertEqual(_max, "d") + assert _min == "a" + assert _max == "d" cat = Series(Categorical(["a", "b", "c", "d"], categories=[ 'd', 'c', 'b', 'a'], ordered=True)) _min = cat.min() _max = cat.max() - self.assertEqual(_min, "d") - self.assertEqual(_max, "a") + assert _min == "d" + assert _max == "a" cat = Series(Categorical( [np.nan, "b", "c", np.nan], categories=['d', 'c', 'b', 'a' ], ordered=True)) _min = cat.min() _max = cat.max() - self.assertTrue(np.isnan(_min)) - self.assertEqual(_max, "b") + assert np.isnan(_min) + assert _max == "b" cat = Series(Categorical( [np.nan, 1, 2, np.nan], categories=[5, 4, 3, 2, 1], ordered=True)) _min = cat.min() _max = cat.max() - self.assertTrue(np.isnan(_min)) - self.assertEqual(_max, 1) + assert np.isnan(_min) + assert _max == 1 def test_mode(self): s = Series(Categorical([1, 1, 2, 4, 5, 5, 5], @@ -2924,7 +2841,8 @@ def test_mode(self): s = Series(Categorical([1, 2, 3, 4, 5], categories=[5, 4, 3, 2, 1], ordered=True)) res = s.mode() - exp = Series(Categorical([], categories=[5, 4, 3, 2, 1], ordered=True)) + exp = Series(Categorical([5, 4, 3, 2, 1], categories=[5, 4, 3, 2, 1], + ordered=True)) tm.assert_series_equal(res, exp) def test_value_counts(self): @@ -2980,13 +2898,15 @@ def test_value_counts_with_nan(self): tm.assert_series_equal(res, exp) # we don't exclude the count of None and sort by counts - exp = pd.Series([3, 2, 1], index=pd.CategoricalIndex([np.nan, "a", "b"])) + exp = pd.Series( + [3, 2, 1], index=pd.CategoricalIndex([np.nan, "a", "b"])) res = s.value_counts(dropna=False) tm.assert_series_equal(res, exp) # When we aren't sorting by counts, and np.nan isn't a # category, it should be last. - exp = pd.Series([2, 1, 3], index=pd.CategoricalIndex(["a", "b", np.nan])) + exp = pd.Series( + [2, 1, 3], index=pd.CategoricalIndex(["a", "b", np.nan])) res = s.value_counts(dropna=False, sort=False) tm.assert_series_equal(res, exp) @@ -3037,7 +2957,7 @@ def test_groupby(self): ['foo', 'bar']], names=['A', 'B', 'C']) expected = DataFrame({'values': Series( - np.nan, index=exp_index)}).sortlevel() + np.nan, index=exp_index)}).sort_index() expected.iloc[[1, 2, 7, 8], 0] = [1, 2, 3, 4] result = gb.sum() tm.assert_frame_equal(result, expected) @@ -3098,7 +3018,7 @@ def f(x): # GH 9603 df = pd.DataFrame({'a': [1, 0, 0, 0]}) - c = pd.cut(df.a, [0, 1, 2, 3, 4]) + c = pd.cut(df.a, [0, 1, 2, 3, 4], labels=pd.Categorical(list('abcd'))) result = df.groupby(c).apply(len) exp_index = pd.CategoricalIndex(c.values.categories, @@ -3120,16 +3040,17 @@ def test_pivot_table(self): [Categorical(["a", "b", "z"], ordered=True), Categorical(["c", "d", "y"], ordered=True)], names=['A', 'B']) - expected = Series([1, 2, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan], - index=exp_index, name='values') - tm.assert_series_equal(result, expected) + expected = DataFrame( + {'values': [1, 2, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan]}, + index=exp_index) + tm.assert_frame_equal(result, expected) def test_count(self): s = Series(Categorical([np.nan, 1, 2, np.nan], categories=[5, 4, 3, 2, 1], ordered=True)) result = s.count() - self.assertEqual(result, 2) + assert result == 2 def test_sort_values(self): @@ -3152,17 +3073,17 @@ def test_sort_values(self): cat = Series(Categorical(["a", "c", "b", "d"], ordered=True)) res = cat.sort_values() exp = np.array(["a", "b", "c", "d"], dtype=np.object_) - self.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_numpy_array_equal(res.__array__(), exp) cat = Series(Categorical(["a", "c", "b", "d"], categories=[ "a", "b", "c", "d"], ordered=True)) res = cat.sort_values() exp = np.array(["a", "b", "c", "d"], dtype=np.object_) - self.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_numpy_array_equal(res.__array__(), exp) res = cat.sort_values(ascending=False) exp = np.array(["d", "c", "b", "a"], dtype=np.object_) - self.assert_numpy_array_equal(res.__array__(), exp) + tm.assert_numpy_array_equal(res.__array__(), exp) raw_cat1 = Categorical(["a", "b", "c", "d"], categories=["a", "b", "c", "d"], ordered=False) @@ -3177,14 +3098,14 @@ def test_sort_values(self): # Cats must be sorted in a dataframe res = df.sort_values(by=["string"], ascending=False) exp = np.array(["d", "c", "b", "a"], dtype=np.object_) - self.assert_numpy_array_equal(res["sort"].values.__array__(), exp) - self.assertEqual(res["sort"].dtype, "category") + tm.assert_numpy_array_equal(res["sort"].values.__array__(), exp) + assert res["sort"].dtype == "category" res = df.sort_values(by=["sort"], ascending=False) exp = df.sort_values(by=["string"], ascending=True) - self.assert_series_equal(res["values"], exp["values"]) - self.assertEqual(res["sort"].dtype, "category") - self.assertEqual(res["unsort"].dtype, "category") + tm.assert_series_equal(res["values"], exp["values"]) + assert res["sort"].dtype == "category" + assert res["unsort"].dtype == "category" # unordered cat, but we allow this df.sort_values(by=["unsort"], ascending=False) @@ -3210,12 +3131,12 @@ def test_slicing(self): cat = Series(Categorical([1, 2, 3, 4])) reversed = cat[::-1] exp = np.array([4, 3, 2, 1], dtype=np.int64) - self.assert_numpy_array_equal(reversed.__array__(), exp) + tm.assert_numpy_array_equal(reversed.__array__(), exp) df = DataFrame({'value': (np.arange(100) + 1).astype('int64')}) df['D'] = pd.cut(df.value, bins=[0, 25, 50, 75, 100]) - expected = Series([11, '(0, 25]'], index=['value', 'D'], name=10) + expected = Series([11, Interval(0, 25)], index=['value', 'D'], name=10) result = df.iloc[10] tm.assert_series_equal(result, expected) @@ -3225,7 +3146,7 @@ def test_slicing(self): result = df.iloc[10:20] tm.assert_frame_equal(result, expected) - expected = Series([9, '(0, 25]'], index=['value', 'D'], name=8) + expected = Series([9, Interval(0, 25)], index=['value', 'D'], name=8) result = df.loc[8] tm.assert_series_equal(result, expected) @@ -3266,70 +3187,70 @@ def test_slicing_and_getting_ops(self): # frame res_df = df.iloc[2:4, :] tm.assert_frame_equal(res_df, exp_df) - self.assertTrue(is_categorical_dtype(res_df["cats"])) + assert is_categorical_dtype(res_df["cats"]) # row res_row = df.iloc[2, :] tm.assert_series_equal(res_row, exp_row) - tm.assertIsInstance(res_row["cats"], compat.string_types) + assert isinstance(res_row["cats"], compat.string_types) # col res_col = df.iloc[:, 0] tm.assert_series_equal(res_col, exp_col) - self.assertTrue(is_categorical_dtype(res_col)) + assert is_categorical_dtype(res_col) # single value res_val = df.iloc[2, 0] - self.assertEqual(res_val, exp_val) + assert res_val == exp_val # loc # frame res_df = df.loc["j":"k", :] tm.assert_frame_equal(res_df, exp_df) - self.assertTrue(is_categorical_dtype(res_df["cats"])) + assert is_categorical_dtype(res_df["cats"]) # row res_row = df.loc["j", :] tm.assert_series_equal(res_row, exp_row) - tm.assertIsInstance(res_row["cats"], compat.string_types) + assert isinstance(res_row["cats"], compat.string_types) # col res_col = df.loc[:, "cats"] tm.assert_series_equal(res_col, exp_col) - self.assertTrue(is_categorical_dtype(res_col)) + assert is_categorical_dtype(res_col) # single value res_val = df.loc["j", "cats"] - self.assertEqual(res_val, exp_val) + assert res_val == exp_val # ix # frame - # res_df = df.ix["j":"k",[0,1]] # doesn't work? - res_df = df.ix["j":"k", :] + # res_df = df.loc["j":"k",[0,1]] # doesn't work? + res_df = df.loc["j":"k", :] tm.assert_frame_equal(res_df, exp_df) - self.assertTrue(is_categorical_dtype(res_df["cats"])) + assert is_categorical_dtype(res_df["cats"]) # row - res_row = df.ix["j", :] + res_row = df.loc["j", :] tm.assert_series_equal(res_row, exp_row) - tm.assertIsInstance(res_row["cats"], compat.string_types) + assert isinstance(res_row["cats"], compat.string_types) # col - res_col = df.ix[:, "cats"] + res_col = df.loc[:, "cats"] tm.assert_series_equal(res_col, exp_col) - self.assertTrue(is_categorical_dtype(res_col)) + assert is_categorical_dtype(res_col) # single value - res_val = df.ix["j", 0] - self.assertEqual(res_val, exp_val) + res_val = df.loc["j", df.columns[0]] + assert res_val == exp_val # iat res_val = df.iat[2, 0] - self.assertEqual(res_val, exp_val) + assert res_val == exp_val # at res_val = df.at["j", "cats"] - self.assertEqual(res_val, exp_val) + assert res_val == exp_val # fancy indexing exp_fancy = df.iloc[[2]] @@ -3341,32 +3262,32 @@ def test_slicing_and_getting_ops(self): # get_value res_val = df.get_value("j", "cats") - self.assertEqual(res_val, exp_val) + assert res_val == exp_val # i : int, slice, or sequence of integers res_row = df.iloc[2] tm.assert_series_equal(res_row, exp_row) - tm.assertIsInstance(res_row["cats"], compat.string_types) + assert isinstance(res_row["cats"], compat.string_types) res_df = df.iloc[slice(2, 4)] tm.assert_frame_equal(res_df, exp_df) - self.assertTrue(is_categorical_dtype(res_df["cats"])) + assert is_categorical_dtype(res_df["cats"]) res_df = df.iloc[[2, 3]] tm.assert_frame_equal(res_df, exp_df) - self.assertTrue(is_categorical_dtype(res_df["cats"])) + assert is_categorical_dtype(res_df["cats"]) res_col = df.iloc[:, 0] tm.assert_series_equal(res_col, exp_col) - self.assertTrue(is_categorical_dtype(res_col)) + assert is_categorical_dtype(res_col) res_df = df.iloc[:, slice(0, 2)] tm.assert_frame_equal(res_df, df) - self.assertTrue(is_categorical_dtype(res_df["cats"])) + assert is_categorical_dtype(res_df["cats"]) res_df = df.iloc[:, [0, 1]] tm.assert_frame_equal(res_df, df) - self.assertTrue(is_categorical_dtype(res_df["cats"])) + assert is_categorical_dtype(res_df["cats"]) def test_slicing_doc_examples(self): @@ -3393,7 +3314,7 @@ def test_slicing_doc_examples(self): index=['h', 'i', 'j'], name='cats') tm.assert_series_equal(result, expected) - result = df.ix["h":"j", 0:1] + result = df.loc["h":"j", df.columns[0:1]] expected = DataFrame({'cats': Categorical(['a', 'b', 'b'], categories=['a', 'b', 'c'])}, index=['h', 'i', 'j']) @@ -3473,7 +3394,7 @@ def f(): df = orig.copy() df.iloc[2, 0] = "c" - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # - assign a complete row (mixed values) -> exp_single_row df = orig.copy() @@ -3485,7 +3406,7 @@ def f(): df = orig.copy() df.iloc[2, :] = ["c", 2] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # - assign multiple rows (mixed values) -> exp_multi_row df = orig.copy() @@ -3496,7 +3417,7 @@ def f(): df = orig.copy() df.iloc[2:4, :] = [["c", 2], ["c", 2]] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # assign a part of a column with dtype == categorical -> # exp_parts_cats_col @@ -3504,13 +3425,13 @@ def f(): df.iloc[2:4, 0] = pd.Categorical(["b", "b"], categories=["a", "b"]) tm.assert_frame_equal(df, exp_parts_cats_col) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # different categories -> not sure if this should fail or pass df = orig.copy() df.iloc[2:4, 0] = pd.Categorical( ["b", "b"], categories=["a", "b", "c"]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # different values df = orig.copy() df.iloc[2:4, 0] = pd.Categorical( @@ -3522,7 +3443,7 @@ def f(): df.iloc[2:4, 0] = ["b", "b"] tm.assert_frame_equal(df, exp_parts_cats_col) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.iloc[2:4, 0] = ["c", "c"] # loc @@ -3541,7 +3462,7 @@ def f(): df = orig.copy() df.loc["j", "cats"] = "c" - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # - assign a complete row (mixed values) -> exp_single_row df = orig.copy() @@ -3553,7 +3474,7 @@ def f(): df = orig.copy() df.loc["j", :] = ["c", 2] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # - assign multiple rows (mixed values) -> exp_multi_row df = orig.copy() @@ -3564,7 +3485,7 @@ def f(): df = orig.copy() df.loc["j":"k", :] = [["c", 2], ["c", 2]] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # assign a part of a column with dtype == categorical -> # exp_parts_cats_col @@ -3573,13 +3494,13 @@ def f(): ["b", "b"], categories=["a", "b"]) tm.assert_frame_equal(df, exp_parts_cats_col) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # different categories -> not sure if this should fail or pass df = orig.copy() df.loc["j":"k", "cats"] = pd.Categorical( ["b", "b"], categories=["a", "b", "c"]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # different values df = orig.copy() df.loc["j":"k", "cats"] = pd.Categorical( @@ -3591,76 +3512,77 @@ def f(): df.loc["j":"k", "cats"] = ["b", "b"] tm.assert_frame_equal(df, exp_parts_cats_col) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.loc["j":"k", "cats"] = ["c", "c"] - # ix + # loc # ############## # - assign a single value -> exp_single_cats_value df = orig.copy() - df.ix["j", 0] = "b" + df.loc["j", df.columns[0]] = "b" tm.assert_frame_equal(df, exp_single_cats_value) df = orig.copy() - df.ix[df.index == "j", 0] = "b" + df.loc[df.index == "j", df.columns[0]] = "b" tm.assert_frame_equal(df, exp_single_cats_value) # - assign a single value not in the current categories set def f(): df = orig.copy() - df.ix["j", 0] = "c" + df.loc["j", df.columns[0]] = "c" - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # - assign a complete row (mixed values) -> exp_single_row df = orig.copy() - df.ix["j", :] = ["b", 2] + df.loc["j", :] = ["b", 2] tm.assert_frame_equal(df, exp_single_row) # - assign a complete row (mixed values) not in categories set def f(): df = orig.copy() - df.ix["j", :] = ["c", 2] + df.loc["j", :] = ["c", 2] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # - assign multiple rows (mixed values) -> exp_multi_row df = orig.copy() - df.ix["j":"k", :] = [["b", 2], ["b", 2]] + df.loc["j":"k", :] = [["b", 2], ["b", 2]] tm.assert_frame_equal(df, exp_multi_row) def f(): df = orig.copy() - df.ix["j":"k", :] = [["c", 2], ["c", 2]] + df.loc["j":"k", :] = [["c", 2], ["c", 2]] - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # assign a part of a column with dtype == categorical -> # exp_parts_cats_col df = orig.copy() - df.ix["j":"k", 0] = pd.Categorical(["b", "b"], categories=["a", "b"]) + df.loc["j":"k", df.columns[0]] = pd.Categorical( + ["b", "b"], categories=["a", "b"]) tm.assert_frame_equal(df, exp_parts_cats_col) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # different categories -> not sure if this should fail or pass df = orig.copy() - df.ix["j":"k", 0] = pd.Categorical( + df.loc["j":"k", df.columns[0]] = pd.Categorical( ["b", "b"], categories=["a", "b", "c"]) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): # different values df = orig.copy() - df.ix["j":"k", 0] = pd.Categorical(["c", "c"], - categories=["a", "b", "c"]) + df.loc["j":"k", df.columns[0]] = pd.Categorical( + ["c", "c"], categories=["a", "b", "c"]) # assign a part of a column with dtype != categorical -> # exp_parts_cats_col df = orig.copy() - df.ix["j":"k", 0] = ["b", "b"] + df.loc["j":"k", df.columns[0]] = ["b", "b"] tm.assert_frame_equal(df, exp_parts_cats_col) - with tm.assertRaises(ValueError): - df.ix["j":"k", 0] = ["c", "c"] + with pytest.raises(ValueError): + df.loc["j":"k", df.columns[0]] = ["c", "c"] # iat df = orig.copy() @@ -3672,7 +3594,7 @@ def f(): df = orig.copy() df.iat[2, 0] = "c" - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # at # - assign a single value -> exp_single_cats_value @@ -3685,7 +3607,7 @@ def f(): df = orig.copy() df.at["j", "cats"] = "c" - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # fancy indexing catsf = pd.Categorical(["a", "a", "c", "c", "a", "a", "a"], @@ -3710,7 +3632,7 @@ def f(): df = orig.copy() df.set_value("j", "cats", "c") - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # Assigning a Category to parts of a int/... column uses the values of # the Catgorical @@ -3800,20 +3722,20 @@ def test_comparisons(self): def f(): cat > cat_rev - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # categorical cannot be compared to Series or numpy array, and also # not the other way around - self.assertRaises(TypeError, lambda: cat > s) - self.assertRaises(TypeError, lambda: cat_rev > s) - self.assertRaises(TypeError, lambda: cat > a) - self.assertRaises(TypeError, lambda: cat_rev > a) + pytest.raises(TypeError, lambda: cat > s) + pytest.raises(TypeError, lambda: cat_rev > s) + pytest.raises(TypeError, lambda: cat > a) + pytest.raises(TypeError, lambda: cat_rev > a) - self.assertRaises(TypeError, lambda: s < cat) - self.assertRaises(TypeError, lambda: s < cat_rev) + pytest.raises(TypeError, lambda: s < cat) + pytest.raises(TypeError, lambda: s < cat_rev) - self.assertRaises(TypeError, lambda: a < cat) - self.assertRaises(TypeError, lambda: a < cat_rev) + pytest.raises(TypeError, lambda: a < cat) + pytest.raises(TypeError, lambda: a < cat_rev) # unequal comparison should raise for unordered cats cat = Series(Categorical(list("abc"))) @@ -3821,26 +3743,26 @@ def f(): def f(): cat > "b" - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) cat = Series(Categorical(list("abc"), ordered=False)) def f(): cat > "b" - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) # https://github.com/pandas-dev/pandas/issues/9836#issuecomment-92123057 # and following comparisons with scalars not in categories should raise # for unequal comps, but not for equal/not equal cat = Series(Categorical(list("abc"), ordered=True)) - self.assertRaises(TypeError, lambda: cat < "d") - self.assertRaises(TypeError, lambda: cat > "d") - self.assertRaises(TypeError, lambda: "d" < cat) - self.assertRaises(TypeError, lambda: "d" > cat) + pytest.raises(TypeError, lambda: cat < "d") + pytest.raises(TypeError, lambda: cat > "d") + pytest.raises(TypeError, lambda: "d" < cat) + pytest.raises(TypeError, lambda: "d" > cat) - self.assert_series_equal(cat == "d", Series([False, False, False])) - self.assert_series_equal(cat != "d", Series([True, True, True])) + tm.assert_series_equal(cat == "d", Series([False, False, False])) + tm.assert_series_equal(cat != "d", Series([True, True, True])) # And test NaN handling... cat = Series(Categorical(["a", "b", "c", np.nan])) @@ -3860,45 +3782,45 @@ def test_cat_equality(self): f = Categorical(list('acb')) # vs scalar - self.assertFalse((a == 'a').all()) - self.assertTrue(((a != 'a') == ~(a == 'a')).all()) + assert not (a == 'a').all() + assert ((a != 'a') == ~(a == 'a')).all() - self.assertFalse(('a' == a).all()) - self.assertTrue((a == 'a')[0]) - self.assertTrue(('a' == a)[0]) - self.assertFalse(('a' != a)[0]) + assert not ('a' == a).all() + assert (a == 'a')[0] + assert ('a' == a)[0] + assert not ('a' != a)[0] # vs list-like - self.assertTrue((a == a).all()) - self.assertFalse((a != a).all()) + assert (a == a).all() + assert not (a != a).all() - self.assertTrue((a == list(a)).all()) - self.assertTrue((a == b).all()) - self.assertTrue((b == a).all()) - self.assertTrue(((~(a == b)) == (a != b)).all()) - self.assertTrue(((~(b == a)) == (b != a)).all()) + assert (a == list(a)).all() + assert (a == b).all() + assert (b == a).all() + assert ((~(a == b)) == (a != b)).all() + assert ((~(b == a)) == (b != a)).all() - self.assertFalse((a == c).all()) - self.assertFalse((c == a).all()) - self.assertFalse((a == d).all()) - self.assertFalse((d == a).all()) + assert not (a == c).all() + assert not (c == a).all() + assert not (a == d).all() + assert not (d == a).all() # vs a cat-like - self.assertTrue((a == e).all()) - self.assertTrue((e == a).all()) - self.assertFalse((a == f).all()) - self.assertFalse((f == a).all()) + assert (a == e).all() + assert (e == a).all() + assert not (a == f).all() + assert not (f == a).all() - self.assertTrue(((~(a == e) == (a != e)).all())) - self.assertTrue(((~(e == a) == (e != a)).all())) - self.assertTrue(((~(a == f) == (a != f)).all())) - self.assertTrue(((~(f == a) == (f != a)).all())) + assert ((~(a == e) == (a != e)).all()) + assert ((~(e == a) == (e != a)).all()) + assert ((~(a == f) == (a != f)).all()) + assert ((~(f == a) == (f != a)).all()) # non-equality is not comparable - self.assertRaises(TypeError, lambda: a < b) - self.assertRaises(TypeError, lambda: b < a) - self.assertRaises(TypeError, lambda: a > b) - self.assertRaises(TypeError, lambda: b > a) + pytest.raises(TypeError, lambda: a < b) + pytest.raises(TypeError, lambda: b < a) + pytest.raises(TypeError, lambda: a > b) + pytest.raises(TypeError, lambda: b > a) def test_concat_append(self): cat = pd.Categorical(["a", "b"], categories=["a", "b"]) @@ -3935,19 +3857,18 @@ def test_concat_append_gh7864(self): df1 = df[0:3] df2 = df[3:] - self.assert_index_equal(df['grade'].cat.categories, - df1['grade'].cat.categories) - self.assert_index_equal(df['grade'].cat.categories, - df2['grade'].cat.categories) + tm.assert_index_equal(df['grade'].cat.categories, + df1['grade'].cat.categories) + tm.assert_index_equal(df['grade'].cat.categories, + df2['grade'].cat.categories) dfx = pd.concat([df1, df2]) - self.assert_index_equal(df['grade'].cat.categories, - dfx['grade'].cat.categories) + tm.assert_index_equal(df['grade'].cat.categories, + dfx['grade'].cat.categories) dfa = df1.append(df2) - self.assert_index_equal(df['grade'].cat.categories, - dfa['grade'].cat.categories) - + tm.assert_index_equal(df['grade'].cat.categories, + dfa['grade'].cat.categories) def test_concat_preserve(self): @@ -3977,7 +3898,7 @@ def test_concat_preserve(self): res = pd.concat([df2, df2]) exp = DataFrame({'A': pd.concat([a, a]), 'B': pd.concat([b, b]).astype( - 'category', categories=list('cab'))}) + 'category', categories=list('cab'))}) tm.assert_frame_equal(res, exp) def test_categorical_index_preserver(self): @@ -3987,19 +3908,19 @@ def test_categorical_index_preserver(self): df2 = DataFrame({'A': a, 'B': b.astype('category', categories=list('cab')) - }).set_index('B') + }).set_index('B') result = pd.concat([df2, df2]) expected = DataFrame({'A': pd.concat([a, a]), 'B': pd.concat([b, b]).astype( 'category', categories=list('cab')) - }).set_index('B') + }).set_index('B') tm.assert_frame_equal(result, expected) # wrong catgories df3 = DataFrame({'A': a, 'B': pd.Categorical(b, categories=list('abc')) - }).set_index('B') - self.assertRaises(TypeError, lambda: pd.concat([df2, df3])) + }).set_index('B') + pytest.raises(TypeError, lambda: pd.concat([df2, df3])) def test_merge(self): # GH 9426 @@ -4030,9 +3951,12 @@ def test_merge(self): expected = df.copy() # object-cat + # note that we propogate the category + # because we don't have any matching rows cright = right.copy() cright['d'] = cright['d'].astype('category') result = pd.merge(left, cright, how='left', left_on='b', right_on='c') + expected['d'] = expected['d'].astype('category', categories=['null']) tm.assert_frame_equal(result, expected) # cat-object @@ -4054,15 +3978,15 @@ def test_repeat(self): cat = pd.Categorical(["a", "b"], categories=["a", "b"]) exp = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b"]) res = cat.repeat(2) - self.assert_categorical_equal(res, exp) + tm.assert_categorical_equal(res, exp) def test_numpy_repeat(self): cat = pd.Categorical(["a", "b"], categories=["a", "b"]) exp = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b"]) - self.assert_categorical_equal(np.repeat(cat, 2), exp) + tm.assert_categorical_equal(np.repeat(cat, 2), exp) msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.repeat, cat, 2, axis=1) + tm.assert_raises_regex(ValueError, msg, np.repeat, cat, 2, axis=1) def test_reshape(self): cat = pd.Categorical([], categories=["a", "b"]) @@ -4070,34 +3994,34 @@ def test_reshape(self): with tm.assert_produces_warning(FutureWarning): cat = pd.Categorical([], categories=["a", "b"]) - self.assert_categorical_equal(cat.reshape(0), cat) + tm.assert_categorical_equal(cat.reshape(0), cat) with tm.assert_produces_warning(FutureWarning): cat = pd.Categorical([], categories=["a", "b"]) - self.assert_categorical_equal(cat.reshape((5, -1)), cat) + tm.assert_categorical_equal(cat.reshape((5, -1)), cat) with tm.assert_produces_warning(FutureWarning): cat = pd.Categorical(["a", "b"], categories=["a", "b"]) - self.assert_categorical_equal(cat.reshape(cat.shape), cat) + tm.assert_categorical_equal(cat.reshape(cat.shape), cat) with tm.assert_produces_warning(FutureWarning): cat = pd.Categorical(["a", "b"], categories=["a", "b"]) - self.assert_categorical_equal(cat.reshape(cat.size), cat) + tm.assert_categorical_equal(cat.reshape(cat.size), cat) with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): msg = "can only specify one unknown dimension" cat = pd.Categorical(["a", "b"], categories=["a", "b"]) - tm.assertRaisesRegexp(ValueError, msg, cat.reshape, (-2, -1)) + tm.assert_raises_regex(ValueError, msg, cat.reshape, (-2, -1)) def test_numpy_reshape(self): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): cat = pd.Categorical(["a", "b"], categories=["a", "b"]) - self.assert_categorical_equal(np.reshape(cat, cat.shape), cat) + tm.assert_categorical_equal(np.reshape(cat, cat.shape), cat) with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): msg = "the 'order' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.reshape, - cat, cat.shape, order='F') + tm.assert_raises_regex(ValueError, msg, np.reshape, + cat, cat.shape, order='F') def test_na_actions(self): @@ -4121,7 +4045,7 @@ def test_na_actions(self): def f(): df.fillna(value={"cats": 4, "vals": "c"}) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) res = df.fillna(method='pad') tm.assert_frame_equal(res, df_exp_fill) @@ -4179,7 +4103,7 @@ def test_astype_to_other(self): expected = s tm.assert_series_equal(s.astype('category'), expected) tm.assert_series_equal(s.astype(CategoricalDtype()), expected) - self.assertRaises(ValueError, lambda: s.astype('float64')) + pytest.raises(ValueError, lambda: s.astype('float64')) cat = Series(Categorical(['a', 'b', 'b', 'a', 'a', 'c', 'c', 'c'])) exp = Series(['a', 'b', 'b', 'a', 'a', 'c', 'c', 'c']) @@ -4217,7 +4141,7 @@ def cmp(a, b): # invalid conversion (these are NOT a dtype) for invalid in [lambda x: x.astype(pd.Categorical), lambda x: x.astype('object').astype(pd.Categorical)]: - self.assertRaises(TypeError, lambda: invalid(s)) + pytest.raises(TypeError, lambda: invalid(s)) def test_astype_categorical(self): @@ -4225,7 +4149,7 @@ def test_astype_categorical(self): tm.assert_categorical_equal(cat, cat.astype('category')) tm.assert_almost_equal(np.array(cat), cat.astype('object')) - self.assertRaises(ValueError, lambda: cat.astype(float)) + pytest.raises(ValueError, lambda: cat.astype(float)) def test_to_records(self): @@ -4252,28 +4176,28 @@ def test_numeric_like_ops(self): # numeric ops should not succeed for op in ['__add__', '__sub__', '__mul__', '__truediv__']: - self.assertRaises(TypeError, - lambda: getattr(self.cat, op)(self.cat)) + pytest.raises(TypeError, + lambda: getattr(self.cat, op)(self.cat)) # reduction ops should not succeed (unless specifically defined, e.g. # min/max) s = self.cat['value_group'] for op in ['kurt', 'skew', 'var', 'std', 'mean', 'sum', 'median']: - self.assertRaises(TypeError, - lambda: getattr(s, op)(numeric_only=False)) + pytest.raises(TypeError, + lambda: getattr(s, op)(numeric_only=False)) # mad technically works because it takes always the numeric data # numpy ops s = pd.Series(pd.Categorical([1, 2, 3, 4])) - self.assertRaises(TypeError, lambda: np.sum(s)) + pytest.raises(TypeError, lambda: np.sum(s)) # numeric ops on a Series for op in ['__add__', '__sub__', '__mul__', '__truediv__']: - self.assertRaises(TypeError, lambda: getattr(s, op)(2)) + pytest.raises(TypeError, lambda: getattr(s, op)(2)) # invalid ufunc - self.assertRaises(TypeError, lambda: np.log(s)) + pytest.raises(TypeError, lambda: np.log(s)) def test_cat_tab_completition(self): # test the tab completion display @@ -4293,20 +4217,21 @@ def get_dir(s): def test_cat_accessor_api(self): # GH 9322 from pandas.core.categorical import CategoricalAccessor - self.assertIs(Series.cat, CategoricalAccessor) + assert Series.cat is CategoricalAccessor s = Series(list('aabbcde')).astype('category') - self.assertIsInstance(s.cat, CategoricalAccessor) + assert isinstance(s.cat, CategoricalAccessor) invalid = Series([1]) - with tm.assertRaisesRegexp(AttributeError, "only use .cat accessor"): + with tm.assert_raises_regex(AttributeError, + "only use .cat accessor"): invalid.cat - self.assertFalse(hasattr(invalid, 'cat')) + assert not hasattr(invalid, 'cat') def test_cat_accessor_no_new_attributes(self): # https://github.com/pandas-dev/pandas/issues/10673 c = Series(list('aabbcde')).astype('category') - with tm.assertRaisesRegexp(AttributeError, - "You cannot add any new attribute"): + with tm.assert_raises_regex(AttributeError, + "You cannot add any new attribute"): c.cat.xlabel = "a" def test_str_accessor_api_for_categorical(self): @@ -4315,7 +4240,7 @@ def test_str_accessor_api_for_categorical(self): s = Series(list('aabb')) s = s + " " + s c = s.astype('category') - self.assertIsInstance(c.str, StringMethods) + assert isinstance(c.str, StringMethods) # str functions, which need special arguments special_func_defs = [ @@ -4326,8 +4251,8 @@ def test_str_accessor_api_for_categorical(self): ('decode', ("UTF-8",), {}), ('encode', ("UTF-8",), {}), ('endswith', ("a",), {}), - ('extract', ("([a-z]*) ",), {"expand":False}), - ('extract', ("([a-z]*) ",), {"expand":True}), + ('extract', ("([a-z]*) ",), {"expand": False}), + ('extract', ("([a-z]*) ",), {"expand": True}), ('extractall', ("([a-z]*) ",), {}), ('find', ("a",), {}), ('findall', ("a",), {}), @@ -4361,10 +4286,10 @@ def test_str_accessor_api_for_categorical(self): # * `translate` has different interfaces for py2 vs. py3 _ignore_names = ["get", "join", "translate"] - str_func_names = [f - for f in dir(s.str) - if not (f.startswith("_") or f in _special_func_names - or f in _ignore_names)] + str_func_names = [f for f in dir(s.str) if not ( + f.startswith("_") or + f in _special_func_names or + f in _ignore_names)] func_defs = [(f, (), {}) for f in str_func_names] func_defs.extend(special_func_defs) @@ -4379,17 +4304,15 @@ def test_str_accessor_api_for_categorical(self): tm.assert_series_equal(res, exp) invalid = Series([1, 2, 3]).astype('category') - with tm.assertRaisesRegexp(AttributeError, - "Can only use .str accessor with string"): + with tm.assert_raises_regex(AttributeError, + "Can only use .str " + "accessor with string"): invalid.str - self.assertFalse(hasattr(invalid, 'str')) + assert not hasattr(invalid, 'str') def test_dt_accessor_api_for_categorical(self): # https://github.com/pandas-dev/pandas/issues/10661 - from pandas.tseries.common import Properties - from pandas.tseries.index import date_range, DatetimeIndex - from pandas.tseries.period import period_range, PeriodIndex - from pandas.tseries.tdi import timedelta_range, TimedeltaIndex + from pandas.core.indexes.accessors import Properties s_dr = Series(date_range('1/1/2015', periods=5, tz="MET")) c_dr = s_dr.astype("category") @@ -4400,12 +4323,16 @@ def test_dt_accessor_api_for_categorical(self): s_tdr = Series(timedelta_range('1 days', '10 days')) c_tdr = s_tdr.astype("category") + # only testing field (like .day) + # and bool (is_month_start) + get_ops = lambda x: x._datetimelike_ops + test_data = [ - ("Datetime", DatetimeIndex._datetimelike_ops, s_dr, c_dr), - ("Period", PeriodIndex._datetimelike_ops, s_pr, c_pr), - ("Timedelta", TimedeltaIndex._datetimelike_ops, s_tdr, c_tdr)] + ("Datetime", get_ops(DatetimeIndex), s_dr, c_dr), + ("Period", get_ops(PeriodIndex), s_pr, c_pr), + ("Timedelta", get_ops(TimedeltaIndex), s_tdr, c_tdr)] - self.assertIsInstance(c_dr.dt, Properties) + assert isinstance(c_dr.dt, Properties) special_func_defs = [ ('strftime', ("%Y-%m-%d",), {}), @@ -4413,12 +4340,13 @@ def test_dt_accessor_api_for_categorical(self): ('round', ("D",), {}), ('floor', ("D",), {}), ('ceil', ("D",), {}), + ('asfreq', ("D",), {}), # ('tz_localize', ("UTC",), {}), ] _special_func_names = [f[0] for f in special_func_defs] # the series is already localized - _ignore_names = ['tz_localize'] + _ignore_names = ['tz_localize', 'components'] for name, attr_names, s, c in test_data: func_names = [f @@ -4440,7 +4368,7 @@ def test_dt_accessor_api_for_categorical(self): elif isinstance(res, pd.Series): tm.assert_series_equal(res, exp) else: - tm.assert_numpy_array_equal(res, exp) + tm.assert_almost_equal(res, exp) for attr in attr_names: try: @@ -4455,13 +4383,13 @@ def test_dt_accessor_api_for_categorical(self): elif isinstance(res, pd.Series): tm.assert_series_equal(res, exp) else: - tm.assert_numpy_array_equal(res, exp) + tm.assert_almost_equal(res, exp) invalid = Series([1, 2, 3]).astype('category') - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( AttributeError, "Can only use .dt accessor with datetimelike"): invalid.dt - self.assertFalse(hasattr(invalid, 'str')) + assert not hasattr(invalid, 'str') def test_concat_categorical(self): # See GH 10177 @@ -4483,38 +4411,22 @@ def test_concat_categorical(self): tm.assert_frame_equal(res, exp) -class TestCategoricalSubclassing(tm.TestCase): - - _multiprocess_can_split_ = True +class TestCategoricalSubclassing(object): def test_constructor(self): sc = tm.SubclassedCategorical(['a', 'b', 'c']) - self.assertIsInstance(sc, tm.SubclassedCategorical) + assert isinstance(sc, tm.SubclassedCategorical) tm.assert_categorical_equal(sc, Categorical(['a', 'b', 'c'])) def test_from_array(self): sc = tm.SubclassedCategorical.from_codes([1, 0, 2], ['a', 'b', 'c']) - self.assertIsInstance(sc, tm.SubclassedCategorical) + assert isinstance(sc, tm.SubclassedCategorical) exp = Categorical.from_codes([1, 0, 2], ['a', 'b', 'c']) tm.assert_categorical_equal(sc, exp) def test_map(self): sc = tm.SubclassedCategorical(['a', 'b', 'c']) res = sc.map(lambda x: x.upper()) - self.assertIsInstance(res, tm.SubclassedCategorical) + assert isinstance(res, tm.SubclassedCategorical) exp = Categorical(['A', 'B', 'C']) tm.assert_categorical_equal(res, exp) - - def test_map(self): - sc = tm.SubclassedCategorical(['a', 'b', 'c']) - res = sc.map(lambda x: x.upper()) - self.assertIsInstance(res, tm.SubclassedCategorical) - exp = Categorical(['A', 'B', 'C']) - tm.assert_categorical_equal(res, exp) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - # '--with-coverage', '--cover-package=pandas.core'] - exit=False) diff --git a/pandas/tests/test_common.py b/pandas/tests/test_common.py index 09dd3f7ab517c..d7dbaccb87ee8 100644 --- a/pandas/tests/test_common.py +++ b/pandas/tests/test_common.py @@ -1,6 +1,7 @@ # -*- coding: utf-8 -*- -import nose +import pytest + import numpy as np from pandas import Series, Timestamp @@ -8,12 +9,10 @@ import pandas.core.common as com import pandas.util.testing as tm -_multiprocess_can_split_ = True - def test_mut_exclusive(): msg = "mutually exclusive arguments: '[ab]' and '[ab]'" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): com._mut_exclusive(a=1, b=2) assert com._mut_exclusive(a=1, b=None) == 1 assert com._mut_exclusive(major=None, major_axis=None) is None @@ -145,21 +144,21 @@ def test_random_state(): import numpy.random as npr # Check with seed state = com._random_state(5) - tm.assert_equal(state.uniform(), npr.RandomState(5).uniform()) + assert state.uniform() == npr.RandomState(5).uniform() # Check with random state object state2 = npr.RandomState(10) - tm.assert_equal( - com._random_state(state2).uniform(), npr.RandomState(10).uniform()) + assert (com._random_state(state2).uniform() == + npr.RandomState(10).uniform()) # check with no arg random state assert com._random_state() is np.random # Error for floats or strings - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): com._random_state('test') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): com._random_state(5.5) @@ -196,8 +195,3 @@ def test_dict_compat(): assert (com._dict_compat(data_datetime64) == expected) assert (com._dict_compat(expected) == expected) assert (com._dict_compat(data_unchanged) == data_unchanged) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_compat.py b/pandas/tests/test_compat.py index 68c0b81eb18ce..ff9d09c033164 100644 --- a/pandas/tests/test_compat.py +++ b/pandas/tests/test_compat.py @@ -6,21 +6,23 @@ from pandas.compat import (range, zip, map, filter, lrange, lzip, lmap, lfilter, builtins, iterkeys, itervalues, iteritems, next) -import pandas.util.testing as tm -class TestBuiltinIterators(tm.TestCase): +class TestBuiltinIterators(object): - def check_result(self, actual, expected, lengths): + @classmethod + def check_result(cls, actual, expected, lengths): for (iter_res, list_res), exp, length in zip(actual, expected, lengths): - self.assertNotIsInstance(iter_res, list) - tm.assertIsInstance(list_res, list) + assert not isinstance(iter_res, list) + assert isinstance(list_res, list) + iter_res = list(iter_res) - self.assertEqual(len(list_res), length) - self.assertEqual(len(iter_res), length) - self.assertEqual(iter_res, exp) - self.assertEqual(list_res, exp) + + assert len(list_res) == length + assert len(iter_res) == length + assert iter_res == exp + assert list_res == exp def test_range(self): actual1 = range(10) @@ -64,6 +66,6 @@ def test_zip(self): self.check_result(actual, expected, lengths) def test_dict_iterators(self): - self.assertEqual(next(itervalues({1: 2})), 2) - self.assertEqual(next(iterkeys({1: 2})), 1) - self.assertEqual(next(iteritems({1: 2})), (1, 2)) + assert next(itervalues({1: 2})) == 2 + assert next(iterkeys({1: 2})) == 1 + assert next(iteritems({1: 2})) == (1, 2) diff --git a/pandas/tests/test_config.py b/pandas/tests/test_config.py index ea226851c9101..8d6f36ac6a798 100644 --- a/pandas/tests/test_config.py +++ b/pandas/tests/test_config.py @@ -1,30 +1,36 @@ -#!/usr/bin/python # -*- coding: utf-8 -*- +import pytest + import pandas as pd -import unittest -import warnings +import warnings -class TestConfig(unittest.TestCase): - _multiprocess_can_split_ = True - def __init__(self, *args): - super(TestConfig, self).__init__(*args) +class TestConfig(object): + @classmethod + def setup_class(cls): from copy import deepcopy - self.cf = pd.core.config - self.gc = deepcopy(getattr(self.cf, '_global_config')) - self.do = deepcopy(getattr(self.cf, '_deprecated_options')) - self.ro = deepcopy(getattr(self.cf, '_registered_options')) - def setUp(self): + cls.cf = pd.core.config + cls.gc = deepcopy(getattr(cls.cf, '_global_config')) + cls.do = deepcopy(getattr(cls.cf, '_deprecated_options')) + cls.ro = deepcopy(getattr(cls.cf, '_registered_options')) + + def setup_method(self, method): setattr(self.cf, '_global_config', {}) - setattr( - self.cf, 'options', self.cf.DictWrapper(self.cf._global_config)) + setattr(self.cf, 'options', self.cf.DictWrapper( + self.cf._global_config)) setattr(self.cf, '_deprecated_options', {}) setattr(self.cf, '_registered_options', {}) - def tearDown(self): + # Our test fixture in conftest.py sets "chained_assignment" + # to "raise" only after all test methods have been setup. + # However, after this setup, there is no longer any + # "chained_assignment" option, so re-register it. + self.cf.register_option('chained_assignment', 'raise') + + def teardown_method(self, method): setattr(self.cf, '_global_config', self.gc) setattr(self.cf, '_deprecated_options', self.do) setattr(self.cf, '_registered_options', self.ro) @@ -32,36 +38,36 @@ def tearDown(self): def test_api(self): # the pandas object exposes the user API - self.assertTrue(hasattr(pd, 'get_option')) - self.assertTrue(hasattr(pd, 'set_option')) - self.assertTrue(hasattr(pd, 'reset_option')) - self.assertTrue(hasattr(pd, 'describe_option')) + assert hasattr(pd, 'get_option') + assert hasattr(pd, 'set_option') + assert hasattr(pd, 'reset_option') + assert hasattr(pd, 'describe_option') def test_is_one_of_factory(self): v = self.cf.is_one_of_factory([None, 12]) v(12) v(None) - self.assertRaises(ValueError, v, 1.1) + pytest.raises(ValueError, v, 1.1) def test_register_option(self): self.cf.register_option('a', 1, 'doc') # can't register an already registered option - self.assertRaises(KeyError, self.cf.register_option, 'a', 1, 'doc') + pytest.raises(KeyError, self.cf.register_option, 'a', 1, 'doc') # can't register an already registered option - self.assertRaises(KeyError, self.cf.register_option, 'a.b.c.d1', 1, - 'doc') - self.assertRaises(KeyError, self.cf.register_option, 'a.b.c.d2', 1, - 'doc') + pytest.raises(KeyError, self.cf.register_option, 'a.b.c.d1', 1, + 'doc') + pytest.raises(KeyError, self.cf.register_option, 'a.b.c.d2', 1, + 'doc') # no python keywords - self.assertRaises(ValueError, self.cf.register_option, 'for', 0) - self.assertRaises(ValueError, self.cf.register_option, 'a.for.b', 0) + pytest.raises(ValueError, self.cf.register_option, 'for', 0) + pytest.raises(ValueError, self.cf.register_option, 'a.for.b', 0) # must be valid identifier (ensure attribute access works) - self.assertRaises(ValueError, self.cf.register_option, - 'Oh my Goddess!', 0) + pytest.raises(ValueError, self.cf.register_option, + 'Oh my Goddess!', 0) # we can register options several levels deep # without predefining the intermediate steps @@ -84,56 +90,42 @@ def test_describe_option(self): self.cf.register_option('l', "foo") # non-existent keys raise KeyError - self.assertRaises(KeyError, self.cf.describe_option, 'no.such.key') + pytest.raises(KeyError, self.cf.describe_option, 'no.such.key') # we can get the description for any key we registered - self.assertTrue( - 'doc' in self.cf.describe_option('a', _print_desc=False)) - self.assertTrue( - 'doc2' in self.cf.describe_option('b', _print_desc=False)) - self.assertTrue( - 'precated' in self.cf.describe_option('b', _print_desc=False)) - - self.assertTrue( - 'doc3' in self.cf.describe_option('c.d.e1', _print_desc=False)) - self.assertTrue( - 'doc4' in self.cf.describe_option('c.d.e2', _print_desc=False)) + assert 'doc' in self.cf.describe_option('a', _print_desc=False) + assert 'doc2' in self.cf.describe_option('b', _print_desc=False) + assert 'precated' in self.cf.describe_option('b', _print_desc=False) + assert 'doc3' in self.cf.describe_option('c.d.e1', _print_desc=False) + assert 'doc4' in self.cf.describe_option('c.d.e2', _print_desc=False) # if no doc is specified we get a default message # saying "description not available" - self.assertTrue( - 'vailable' in self.cf.describe_option('f', _print_desc=False)) - self.assertTrue( - 'vailable' in self.cf.describe_option('g.h', _print_desc=False)) - self.assertTrue( - 'precated' in self.cf.describe_option('g.h', _print_desc=False)) - self.assertTrue( - 'k' in self.cf.describe_option('g.h', _print_desc=False)) + assert 'vailable' in self.cf.describe_option('f', _print_desc=False) + assert 'vailable' in self.cf.describe_option('g.h', _print_desc=False) + assert 'precated' in self.cf.describe_option('g.h', _print_desc=False) + assert 'k' in self.cf.describe_option('g.h', _print_desc=False) # default is reported - self.assertTrue( - 'foo' in self.cf.describe_option('l', _print_desc=False)) + assert 'foo' in self.cf.describe_option('l', _print_desc=False) # current value is reported - self.assertFalse( - 'bar' in self.cf.describe_option('l', _print_desc=False)) + assert 'bar' not in self.cf.describe_option('l', _print_desc=False) self.cf.set_option("l", "bar") - self.assertTrue( - 'bar' in self.cf.describe_option('l', _print_desc=False)) + assert 'bar' in self.cf.describe_option('l', _print_desc=False) def test_case_insensitive(self): self.cf.register_option('KanBAN', 1, 'doc') - self.assertTrue( - 'doc' in self.cf.describe_option('kanbaN', _print_desc=False)) - self.assertEqual(self.cf.get_option('kanBaN'), 1) + assert 'doc' in self.cf.describe_option('kanbaN', _print_desc=False) + assert self.cf.get_option('kanBaN') == 1 self.cf.set_option('KanBan', 2) - self.assertEqual(self.cf.get_option('kAnBaN'), 2) + assert self.cf.get_option('kAnBaN') == 2 # gets of non-existent keys fail - self.assertRaises(KeyError, self.cf.get_option, 'no_such_option') + pytest.raises(KeyError, self.cf.get_option, 'no_such_option') self.cf.deprecate_option('KanBan') - self.assertTrue(self.cf._is_deprecated('kAnBaN')) + assert self.cf._is_deprecated('kAnBaN') def test_get_option(self): self.cf.register_option('a', 1, 'doc') @@ -141,118 +133,118 @@ def test_get_option(self): self.cf.register_option('b.b', None, 'doc2') # gets of existing keys succeed - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'hullo') - self.assertTrue(self.cf.get_option('b.b') is None) + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'hullo' + assert self.cf.get_option('b.b') is None # gets of non-existent keys fail - self.assertRaises(KeyError, self.cf.get_option, 'no_such_option') + pytest.raises(KeyError, self.cf.get_option, 'no_such_option') def test_set_option(self): self.cf.register_option('a', 1, 'doc') self.cf.register_option('b.c', 'hullo', 'doc2') self.cf.register_option('b.b', None, 'doc2') - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'hullo') - self.assertTrue(self.cf.get_option('b.b') is None) + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'hullo' + assert self.cf.get_option('b.b') is None self.cf.set_option('a', 2) self.cf.set_option('b.c', 'wurld') self.cf.set_option('b.b', 1.1) - self.assertEqual(self.cf.get_option('a'), 2) - self.assertEqual(self.cf.get_option('b.c'), 'wurld') - self.assertEqual(self.cf.get_option('b.b'), 1.1) + assert self.cf.get_option('a') == 2 + assert self.cf.get_option('b.c') == 'wurld' + assert self.cf.get_option('b.b') == 1.1 - self.assertRaises(KeyError, self.cf.set_option, 'no.such.key', None) + pytest.raises(KeyError, self.cf.set_option, 'no.such.key', None) def test_set_option_empty_args(self): - self.assertRaises(ValueError, self.cf.set_option) + pytest.raises(ValueError, self.cf.set_option) def test_set_option_uneven_args(self): - self.assertRaises(ValueError, self.cf.set_option, 'a.b', 2, 'b.c') + pytest.raises(ValueError, self.cf.set_option, 'a.b', 2, 'b.c') def test_set_option_invalid_single_argument_type(self): - self.assertRaises(ValueError, self.cf.set_option, 2) + pytest.raises(ValueError, self.cf.set_option, 2) def test_set_option_multiple(self): self.cf.register_option('a', 1, 'doc') self.cf.register_option('b.c', 'hullo', 'doc2') self.cf.register_option('b.b', None, 'doc2') - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'hullo') - self.assertTrue(self.cf.get_option('b.b') is None) + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'hullo' + assert self.cf.get_option('b.b') is None self.cf.set_option('a', '2', 'b.c', None, 'b.b', 10.0) - self.assertEqual(self.cf.get_option('a'), '2') - self.assertTrue(self.cf.get_option('b.c') is None) - self.assertEqual(self.cf.get_option('b.b'), 10.0) + assert self.cf.get_option('a') == '2' + assert self.cf.get_option('b.c') is None + assert self.cf.get_option('b.b') == 10.0 def test_validation(self): self.cf.register_option('a', 1, 'doc', validator=self.cf.is_int) self.cf.register_option('b.c', 'hullo', 'doc2', validator=self.cf.is_text) - self.assertRaises(ValueError, self.cf.register_option, 'a.b.c.d2', - 'NO', 'doc', validator=self.cf.is_int) + pytest.raises(ValueError, self.cf.register_option, 'a.b.c.d2', + 'NO', 'doc', validator=self.cf.is_int) self.cf.set_option('a', 2) # int is_int self.cf.set_option('b.c', 'wurld') # str is_str - self.assertRaises( + pytest.raises( ValueError, self.cf.set_option, 'a', None) # None not is_int - self.assertRaises(ValueError, self.cf.set_option, 'a', 'ab') - self.assertRaises(ValueError, self.cf.set_option, 'b.c', 1) + pytest.raises(ValueError, self.cf.set_option, 'a', 'ab') + pytest.raises(ValueError, self.cf.set_option, 'b.c', 1) validator = self.cf.is_one_of_factory([None, self.cf.is_callable]) self.cf.register_option('b', lambda: None, 'doc', validator=validator) self.cf.set_option('b', '%.1f'.format) # Formatter is callable self.cf.set_option('b', None) # Formatter is none (default) - self.assertRaises(ValueError, self.cf.set_option, 'b', '%.1f') + pytest.raises(ValueError, self.cf.set_option, 'b', '%.1f') def test_reset_option(self): self.cf.register_option('a', 1, 'doc', validator=self.cf.is_int) self.cf.register_option('b.c', 'hullo', 'doc2', validator=self.cf.is_str) - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'hullo') + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'hullo' self.cf.set_option('a', 2) self.cf.set_option('b.c', 'wurld') - self.assertEqual(self.cf.get_option('a'), 2) - self.assertEqual(self.cf.get_option('b.c'), 'wurld') + assert self.cf.get_option('a') == 2 + assert self.cf.get_option('b.c') == 'wurld' self.cf.reset_option('a') - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'wurld') + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'wurld' self.cf.reset_option('b.c') - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'hullo') + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'hullo' def test_reset_option_all(self): self.cf.register_option('a', 1, 'doc', validator=self.cf.is_int) self.cf.register_option('b.c', 'hullo', 'doc2', validator=self.cf.is_str) - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'hullo') + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'hullo' self.cf.set_option('a', 2) self.cf.set_option('b.c', 'wurld') - self.assertEqual(self.cf.get_option('a'), 2) - self.assertEqual(self.cf.get_option('b.c'), 'wurld') + assert self.cf.get_option('a') == 2 + assert self.cf.get_option('b.c') == 'wurld' self.cf.reset_option("all") - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b.c'), 'hullo') + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b.c') == 'hullo' def test_deprecate_option(self): # we can deprecate non-existent options self.cf.deprecate_option('foo') - self.assertTrue(self.cf._is_deprecated('foo')) + assert self.cf._is_deprecated('foo') with warnings.catch_warnings(record=True) as w: warnings.simplefilter('always') try: @@ -262,9 +254,8 @@ def test_deprecate_option(self): else: self.fail("Nonexistent option didn't raise KeyError") - self.assertEqual(len(w), 1) # should have raised one warning - self.assertTrue( - 'deprecated' in str(w[-1])) # we get the default message + assert len(w) == 1 # should have raised one warning + assert 'deprecated' in str(w[-1]) # we get the default message self.cf.register_option('a', 1, 'doc', validator=self.cf.is_int) self.cf.register_option('b.c', 'hullo', 'doc2') @@ -275,13 +266,11 @@ def test_deprecate_option(self): warnings.simplefilter('always') self.cf.get_option('a') - self.assertEqual(len(w), 1) # should have raised one warning - self.assertTrue( - 'eprecated' in str(w[-1])) # we get the default message - self.assertTrue( - 'nifty_ver' in str(w[-1])) # with the removal_ver quoted + assert len(w) == 1 # should have raised one warning + assert 'eprecated' in str(w[-1]) # we get the default message + assert 'nifty_ver' in str(w[-1]) # with the removal_ver quoted - self.assertRaises( + pytest.raises( KeyError, self.cf.deprecate_option, 'a') # can't depr. twice self.cf.deprecate_option('b.c', 'zounds!') @@ -289,66 +278,60 @@ def test_deprecate_option(self): warnings.simplefilter('always') self.cf.get_option('b.c') - self.assertEqual(len(w), 1) # should have raised one warning - self.assertTrue( - 'zounds!' in str(w[-1])) # we get the custom message + assert len(w) == 1 # should have raised one warning + assert 'zounds!' in str(w[-1]) # we get the custom message # test rerouting keys self.cf.register_option('d.a', 'foo', 'doc2') self.cf.register_option('d.dep', 'bar', 'doc2') - self.assertEqual(self.cf.get_option('d.a'), 'foo') - self.assertEqual(self.cf.get_option('d.dep'), 'bar') + assert self.cf.get_option('d.a') == 'foo' + assert self.cf.get_option('d.dep') == 'bar' self.cf.deprecate_option('d.dep', rkey='d.a') # reroute d.dep to d.a with warnings.catch_warnings(record=True) as w: warnings.simplefilter('always') - self.assertEqual(self.cf.get_option('d.dep'), 'foo') + assert self.cf.get_option('d.dep') == 'foo' - self.assertEqual(len(w), 1) # should have raised one warning - self.assertTrue( - 'eprecated' in str(w[-1])) # we get the custom message + assert len(w) == 1 # should have raised one warning + assert 'eprecated' in str(w[-1]) # we get the custom message with warnings.catch_warnings(record=True) as w: warnings.simplefilter('always') self.cf.set_option('d.dep', 'baz') # should overwrite "d.a" - self.assertEqual(len(w), 1) # should have raised one warning - self.assertTrue( - 'eprecated' in str(w[-1])) # we get the custom message + assert len(w) == 1 # should have raised one warning + assert 'eprecated' in str(w[-1]) # we get the custom message with warnings.catch_warnings(record=True) as w: warnings.simplefilter('always') - self.assertEqual(self.cf.get_option('d.dep'), 'baz') + assert self.cf.get_option('d.dep') == 'baz' - self.assertEqual(len(w), 1) # should have raised one warning - self.assertTrue( - 'eprecated' in str(w[-1])) # we get the custom message + assert len(w) == 1 # should have raised one warning + assert 'eprecated' in str(w[-1]) # we get the custom message def test_config_prefix(self): with self.cf.config_prefix("base"): self.cf.register_option('a', 1, "doc1") self.cf.register_option('b', 2, "doc2") - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b'), 2) + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b') == 2 self.cf.set_option('a', 3) self.cf.set_option('b', 4) - self.assertEqual(self.cf.get_option('a'), 3) - self.assertEqual(self.cf.get_option('b'), 4) + assert self.cf.get_option('a') == 3 + assert self.cf.get_option('b') == 4 - self.assertEqual(self.cf.get_option('base.a'), 3) - self.assertEqual(self.cf.get_option('base.b'), 4) - self.assertTrue( - 'doc1' in self.cf.describe_option('base.a', _print_desc=False)) - self.assertTrue( - 'doc2' in self.cf.describe_option('base.b', _print_desc=False)) + assert self.cf.get_option('base.a') == 3 + assert self.cf.get_option('base.b') == 4 + assert 'doc1' in self.cf.describe_option('base.a', _print_desc=False) + assert 'doc2' in self.cf.describe_option('base.b', _print_desc=False) self.cf.reset_option('base.a') self.cf.reset_option('base.b') with self.cf.config_prefix("base"): - self.assertEqual(self.cf.get_option('a'), 1) - self.assertEqual(self.cf.get_option('b'), 2) + assert self.cf.get_option('a') == 1 + assert self.cf.get_option('b') == 2 def test_callback(self): k = [None] @@ -363,21 +346,21 @@ def callback(key): del k[-1], v[-1] self.cf.set_option("d.a", "fooz") - self.assertEqual(k[-1], "d.a") - self.assertEqual(v[-1], "fooz") + assert k[-1] == "d.a" + assert v[-1] == "fooz" del k[-1], v[-1] self.cf.set_option("d.b", "boo") - self.assertEqual(k[-1], "d.b") - self.assertEqual(v[-1], "boo") + assert k[-1] == "d.b" + assert v[-1] == "boo" del k[-1], v[-1] self.cf.reset_option("d.b") - self.assertEqual(k[-1], "d.b") + assert k[-1] == "d.b" def test_set_ContextManager(self): def eq(val): - self.assertEqual(self.cf.get_option("a"), val) + assert self.cf.get_option("a") == val self.cf.register_option('a', 0) eq(0) @@ -407,22 +390,22 @@ def f3(key): self.cf.register_option('c', 0, cb=f3) options = self.cf.options - self.assertEqual(options.a, 0) + assert options.a == 0 with self.cf.option_context("a", 15): - self.assertEqual(options.a, 15) + assert options.a == 15 options.a = 500 - self.assertEqual(self.cf.get_option("a"), 500) + assert self.cf.get_option("a") == 500 self.cf.reset_option("a") - self.assertEqual(options.a, self.cf.get_option("a", 0)) + assert options.a == self.cf.get_option("a", 0) - self.assertRaises(KeyError, f) - self.assertRaises(KeyError, f2) + pytest.raises(KeyError, f) + pytest.raises(KeyError, f2) # make sure callback kicks when using this form of setting options.c = 1 - self.assertEqual(len(holder), 1) + assert len(holder) == 1 def test_option_context_scope(self): # Ensure that creating a context does not affect the existing @@ -437,11 +420,11 @@ def test_option_context_scope(self): # Ensure creating contexts didn't affect the current context. ctx = self.cf.option_context(option_name, context_value) - self.assertEqual(self.cf.get_option(option_name), original_value) + assert self.cf.get_option(option_name) == original_value # Ensure the correct value is available inside the context. with ctx: - self.assertEqual(self.cf.get_option(option_name), context_value) + assert self.cf.get_option(option_name) == context_value # Ensure the current context is reset - self.assertEqual(self.cf.get_option(option_name), original_value) + assert self.cf.get_option(option_name) == original_value diff --git a/pandas/tests/test_errors.py b/pandas/tests/test_errors.py new file mode 100644 index 0000000000000..4a0850734e134 --- /dev/null +++ b/pandas/tests/test_errors.py @@ -0,0 +1,52 @@ +# -*- coding: utf-8 -*- + +import pytest +from warnings import catch_warnings +import pandas # noqa +import pandas as pd + + +@pytest.mark.parametrize( + "exc", ['UnsupportedFunctionCall', 'UnsortedIndexError', + 'OutOfBoundsDatetime', + 'ParserError', 'PerformanceWarning', 'DtypeWarning', + 'EmptyDataError', 'ParserWarning']) +def test_exception_importable(exc): + from pandas import errors + e = getattr(errors, exc) + assert e is not None + + # check that we can raise on them + with pytest.raises(e): + raise e() + + +def test_catch_oob(): + from pandas import errors + + try: + pd.Timestamp('15000101') + except errors.OutOfBoundsDatetime: + pass + + +def test_error_rename(): + # see gh-12665 + from pandas.errors import ParserError + from pandas.io.common import CParserError + + try: + raise CParserError() + except ParserError: + pass + + try: + raise ParserError() + except CParserError: + pass + + with catch_warnings(record=True): + try: + raise ParserError() + except pd.parser.CParserError: + pass diff --git a/pandas/tests/test_expressions.py b/pandas/tests/test_expressions.py index c037f02f20609..fae7bfa513dcd 100644 --- a/pandas/tests/test_expressions.py +++ b/pandas/tests/test_expressions.py @@ -2,33 +2,25 @@ from __future__ import print_function # pylint: disable-msg=W0612,E1101 -import nose +from warnings import catch_warnings import re +import operator +import pytest from numpy.random import randn -import operator import numpy as np from pandas.core.api import DataFrame, Panel -from pandas.computation import expressions as expr -from pandas import compat +from pandas.core.computation import expressions as expr +from pandas import compat, _np_version_under1p11 from pandas.util.testing import (assert_almost_equal, assert_series_equal, assert_frame_equal, assert_panel_equal, assert_panel4d_equal, slow) -from pandas.formats.printing import pprint_thing +from pandas.io.formats.printing import pprint_thing import pandas.util.testing as tm -if not expr._USE_NUMEXPR: - try: - import numexpr # noqa - except ImportError: - msg = "don't have" - else: - msg = "not using" - raise nose.SkipTest("{0} numexpr".format(msg)) - _frame = DataFrame(randn(10000, 4), columns=list('ABCD'), dtype='float64') _frame2 = DataFrame(randn(100, 4), columns=list('ABCD'), dtype='float64') _mixed = DataFrame({'A': _frame['A'].copy(), @@ -41,26 +33,32 @@ 'D': _frame2['D'].astype('int32')}) _integer = DataFrame( np.random.randint(1, 100, - size=(10001, 4)), columns=list('ABCD'), dtype='int64') + size=(10001, 4)), + columns=list('ABCD'), dtype='int64') _integer2 = DataFrame(np.random.randint(1, 100, size=(101, 4)), columns=list('ABCD'), dtype='int64') -_frame_panel = Panel(dict(ItemA=_frame.copy(), ItemB=( - _frame.copy() + 3), ItemC=_frame.copy(), ItemD=_frame.copy())) -_frame2_panel = Panel(dict(ItemA=_frame2.copy(), ItemB=(_frame2.copy() + 3), - ItemC=_frame2.copy(), ItemD=_frame2.copy())) -_integer_panel = Panel(dict(ItemA=_integer, ItemB=(_integer + 34).astype( - 'int64'))) -_integer2_panel = Panel(dict(ItemA=_integer2, ItemB=(_integer2 + 34).astype( - 'int64'))) -_mixed_panel = Panel(dict(ItemA=_mixed, ItemB=(_mixed + 3))) -_mixed2_panel = Panel(dict(ItemA=_mixed2, ItemB=(_mixed2 + 3))) +with catch_warnings(record=True): + _frame_panel = Panel(dict(ItemA=_frame.copy(), + ItemB=(_frame.copy() + 3), + ItemC=_frame.copy(), + ItemD=_frame.copy())) + _frame2_panel = Panel(dict(ItemA=_frame2.copy(), + ItemB=(_frame2.copy() + 3), + ItemC=_frame2.copy(), + ItemD=_frame2.copy())) + _integer_panel = Panel(dict(ItemA=_integer, + ItemB=(_integer + 34).astype('int64'))) + _integer2_panel = Panel(dict(ItemA=_integer2, + ItemB=(_integer2 + 34).astype('int64'))) + _mixed_panel = Panel(dict(ItemA=_mixed, ItemB=(_mixed + 3))) + _mixed2_panel = Panel(dict(ItemA=_mixed2, ItemB=(_mixed2 + 3))) -class TestExpressions(tm.TestCase): - _multiprocess_can_split_ = False +@pytest.mark.skipif(not expr._USE_NUMEXPR, reason='not using numexpr') +class TestExpressions(object): - def setUp(self): + def setup_method(self, method): self.frame = _frame.copy() self.frame2 = _frame2.copy() @@ -69,17 +67,23 @@ def setUp(self): self.integer = _integer.copy() self._MIN_ELEMENTS = expr._MIN_ELEMENTS - def tearDown(self): + def teardown_method(self, method): expr._MIN_ELEMENTS = self._MIN_ELEMENTS - @nose.tools.nottest - def run_arithmetic_test(self, df, other, assert_func, check_dtype=False, - test_flex=True): + def run_arithmetic(self, df, other, assert_func, check_dtype=False, + test_flex=True): expr._MIN_ELEMENTS = 0 operations = ['add', 'sub', 'mul', 'mod', 'truediv', 'floordiv', 'pow'] if not compat.PY3: operations.append('div') for arith in operations: + + # numpy >= 1.11 doesn't handle integers + # raised to integer powers + # https://github.com/pandas-dev/pandas/issues/15363 + if arith == 'pow' and not _np_version_under1p11: + continue + operator_name = arith if arith == 'div': operator_name = 'truediv' @@ -92,6 +96,7 @@ def run_arithmetic_test(self, df, other, assert_func, check_dtype=False, expr.set_use_numexpr(False) expected = op(df, other) expr.set_use_numexpr(True) + result = op(df, other) try: if check_dtype: @@ -103,15 +108,14 @@ def run_arithmetic_test(self, df, other, assert_func, check_dtype=False, raise def test_integer_arithmetic(self): - self.run_arithmetic_test(self.integer, self.integer, - assert_frame_equal) - self.run_arithmetic_test(self.integer.iloc[:, 0], - self.integer.iloc[:, 0], assert_series_equal, - check_dtype=True) - - @nose.tools.nottest - def run_binary_test(self, df, other, assert_func, test_flex=False, - numexpr_ops=set(['gt', 'lt', 'ge', 'le', 'eq', 'ne'])): + self.run_arithmetic(self.integer, self.integer, + assert_frame_equal) + self.run_arithmetic(self.integer.iloc[:, 0], + self.integer.iloc[:, 0], assert_series_equal, + check_dtype=True) + + def run_binary(self, df, other, assert_func, test_flex=False, + numexpr_ops=set(['gt', 'lt', 'ge', 'le', 'eq', 'ne'])): """ tests solely that the result is the same whether or not numexpr is enabled. Need to test whether the function does the correct thing @@ -145,46 +149,46 @@ def run_binary_test(self, df, other, assert_func, test_flex=False, def run_frame(self, df, other, binary_comp=None, run_binary=True, **kwargs): - self.run_arithmetic_test(df, other, assert_frame_equal, - test_flex=False, **kwargs) - self.run_arithmetic_test(df, other, assert_frame_equal, test_flex=True, - **kwargs) + self.run_arithmetic(df, other, assert_frame_equal, + test_flex=False, **kwargs) + self.run_arithmetic(df, other, assert_frame_equal, test_flex=True, + **kwargs) if run_binary: if binary_comp is None: expr.set_use_numexpr(False) binary_comp = other + 1 expr.set_use_numexpr(True) - self.run_binary_test(df, binary_comp, assert_frame_equal, - test_flex=False, **kwargs) - self.run_binary_test(df, binary_comp, assert_frame_equal, - test_flex=True, **kwargs) + self.run_binary(df, binary_comp, assert_frame_equal, + test_flex=False, **kwargs) + self.run_binary(df, binary_comp, assert_frame_equal, + test_flex=True, **kwargs) def run_series(self, ser, other, binary_comp=None, **kwargs): - self.run_arithmetic_test(ser, other, assert_series_equal, - test_flex=False, **kwargs) - self.run_arithmetic_test(ser, other, assert_almost_equal, - test_flex=True, **kwargs) + self.run_arithmetic(ser, other, assert_series_equal, + test_flex=False, **kwargs) + self.run_arithmetic(ser, other, assert_almost_equal, + test_flex=True, **kwargs) # series doesn't uses vec_compare instead of numexpr... # if binary_comp is None: # binary_comp = other + 1 - # self.run_binary_test(ser, binary_comp, assert_frame_equal, + # self.run_binary(ser, binary_comp, assert_frame_equal, # test_flex=False, **kwargs) - # self.run_binary_test(ser, binary_comp, assert_frame_equal, + # self.run_binary(ser, binary_comp, assert_frame_equal, # test_flex=True, **kwargs) def run_panel(self, panel, other, binary_comp=None, run_binary=True, assert_func=assert_panel_equal, **kwargs): - self.run_arithmetic_test(panel, other, assert_func, test_flex=False, - **kwargs) - self.run_arithmetic_test(panel, other, assert_func, test_flex=True, - **kwargs) + self.run_arithmetic(panel, other, assert_func, test_flex=False, + **kwargs) + self.run_arithmetic(panel, other, assert_func, test_flex=True, + **kwargs) if run_binary: if binary_comp is None: binary_comp = other + 1 - self.run_binary_test(panel, binary_comp, assert_func, - test_flex=False, **kwargs) - self.run_binary_test(panel, binary_comp, assert_func, - test_flex=True, **kwargs) + self.run_binary(panel, binary_comp, assert_func, + test_flex=False, **kwargs) + self.run_binary(panel, binary_comp, assert_func, + test_flex=True, **kwargs) def test_integer_arithmetic_frame(self): self.run_frame(self.integer, self.integer) @@ -208,7 +212,7 @@ def test_float_panel(self): @slow def test_panel4d(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): self.run_panel(tm.makePanel4D(), np.random.randn() + 0.5, assert_func=assert_panel4d_equal, binary_comp=3) @@ -228,44 +232,44 @@ def test_mixed_panel(self): binary_comp=-2) def test_float_arithemtic(self): - self.run_arithmetic_test(self.frame, self.frame, assert_frame_equal) - self.run_arithmetic_test(self.frame.iloc[:, 0], self.frame.iloc[:, 0], - assert_series_equal, check_dtype=True) + self.run_arithmetic(self.frame, self.frame, assert_frame_equal) + self.run_arithmetic(self.frame.iloc[:, 0], self.frame.iloc[:, 0], + assert_series_equal, check_dtype=True) def test_mixed_arithmetic(self): - self.run_arithmetic_test(self.mixed, self.mixed, assert_frame_equal) + self.run_arithmetic(self.mixed, self.mixed, assert_frame_equal) for col in self.mixed.columns: - self.run_arithmetic_test(self.mixed[col], self.mixed[col], - assert_series_equal) + self.run_arithmetic(self.mixed[col], self.mixed[col], + assert_series_equal) def test_integer_with_zeros(self): self.integer *= np.random.randint(0, 2, size=np.shape(self.integer)) - self.run_arithmetic_test(self.integer, self.integer, - assert_frame_equal) - self.run_arithmetic_test(self.integer.iloc[:, 0], - self.integer.iloc[:, 0], assert_series_equal) + self.run_arithmetic(self.integer, self.integer, + assert_frame_equal) + self.run_arithmetic(self.integer.iloc[:, 0], + self.integer.iloc[:, 0], assert_series_equal) def test_invalid(self): # no op result = expr._can_use_numexpr(operator.add, None, self.frame, self.frame, 'evaluate') - self.assertFalse(result) + assert not result # mixed result = expr._can_use_numexpr(operator.add, '+', self.mixed, self.frame, 'evaluate') - self.assertFalse(result) + assert not result # min elements result = expr._can_use_numexpr(operator.add, '+', self.frame2, self.frame2, 'evaluate') - self.assertFalse(result) + assert not result # ok, we only check on first part of expression result = expr._can_use_numexpr(operator.add, '+', self.frame, self.frame2, 'evaluate') - self.assertTrue(result) + assert result def test_binary_ops(self): def testit(): @@ -275,6 +279,13 @@ def testit(): for op, op_str in [('add', '+'), ('sub', '-'), ('mul', '*'), ('div', '/'), ('pow', '**')]: + + # numpy >= 1.11 doesn't handle integers + # raised to integer powers + # https://github.com/pandas-dev/pandas/issues/15363 + if op == 'pow' and not _np_version_under1p11: + continue + if op == 'div': op = getattr(operator, 'truediv', None) else: @@ -282,7 +293,7 @@ def testit(): if op is not None: result = expr._can_use_numexpr(op, op_str, f, f, 'evaluate') - self.assertNotEqual(result, f._is_mixed_type) + assert result != f._is_mixed_type result = expr.evaluate(op, op_str, f, f, use_numexpr=True) @@ -297,7 +308,7 @@ def testit(): result = expr._can_use_numexpr(op, op_str, f2, f2, 'evaluate') - self.assertFalse(result) + assert not result expr.set_use_numexpr(False) testit() @@ -325,7 +336,7 @@ def testit(): result = expr._can_use_numexpr(op, op_str, f11, f12, 'evaluate') - self.assertNotEqual(result, f11._is_mixed_type) + assert result != f11._is_mixed_type result = expr.evaluate(op, op_str, f11, f12, use_numexpr=True) @@ -338,7 +349,7 @@ def testit(): result = expr._can_use_numexpr(op, op_str, f21, f22, 'evaluate') - self.assertFalse(result) + assert not result expr.set_use_numexpr(False) testit() @@ -379,22 +390,22 @@ def test_bool_ops_raise_on_arithmetic(self): f = getattr(operator, name) err_msg = re.escape(msg % op) - with tm.assertRaisesRegexp(NotImplementedError, err_msg): + with tm.assert_raises_regex(NotImplementedError, err_msg): f(df, df) - with tm.assertRaisesRegexp(NotImplementedError, err_msg): + with tm.assert_raises_regex(NotImplementedError, err_msg): f(df.a, df.b) - with tm.assertRaisesRegexp(NotImplementedError, err_msg): + with tm.assert_raises_regex(NotImplementedError, err_msg): f(df.a, True) - with tm.assertRaisesRegexp(NotImplementedError, err_msg): + with tm.assert_raises_regex(NotImplementedError, err_msg): f(False, df.a) - with tm.assertRaisesRegexp(TypeError, err_msg): + with tm.assert_raises_regex(TypeError, err_msg): f(False, df) - with tm.assertRaisesRegexp(TypeError, err_msg): + with tm.assert_raises_regex(TypeError, err_msg): f(df, True) def test_bool_ops_warn_on_arithmetic(self): @@ -439,9 +450,3 @@ def test_bool_ops_warn_on_arithmetic(self): r = f(df, True) e = fe(df, True) tm.assert_frame_equal(r, e) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_generic.py b/pandas/tests/test_generic.py deleted file mode 100644 index 84df82db69f77..0000000000000 --- a/pandas/tests/test_generic.py +++ /dev/null @@ -1,1984 +0,0 @@ -# -*- coding: utf-8 -*- -# pylint: disable-msg=E1101,W0612 - -from operator import methodcaller -import nose -import numpy as np -from numpy import nan -import pandas as pd - -from pandas.types.common import is_scalar -from pandas import (Index, Series, DataFrame, Panel, isnull, - date_range, period_range, Panel4D) -from pandas.core.index import MultiIndex - -import pandas.formats.printing as printing - -from pandas.compat import range, zip, PY3 -from pandas import compat -from pandas.util.testing import (assertRaisesRegexp, - assert_series_equal, - assert_frame_equal, - assert_panel_equal, - assert_panel4d_equal, - assert_almost_equal) - -import pandas.util.testing as tm - - -# ---------------------------------------------------------------------- -# Generic types test cases - - -class Generic(object): - - _multiprocess_can_split_ = True - - def setUp(self): - pass - - @property - def _ndim(self): - return self._typ._AXIS_LEN - - def _axes(self): - """ return the axes for my object typ """ - return self._typ._AXIS_ORDERS - - def _construct(self, shape, value=None, dtype=None, **kwargs): - """ construct an object for the given shape - if value is specified use that if its a scalar - if value is an array, repeat it as needed """ - - if isinstance(shape, int): - shape = tuple([shape] * self._ndim) - if value is not None: - if is_scalar(value): - if value == 'empty': - arr = None - - # remove the info axis - kwargs.pop(self._typ._info_axis_name, None) - else: - arr = np.empty(shape, dtype=dtype) - arr.fill(value) - else: - fshape = np.prod(shape) - arr = value.ravel() - new_shape = fshape / arr.shape[0] - if fshape % arr.shape[0] != 0: - raise Exception("invalid value passed in _construct") - - arr = np.repeat(arr, new_shape).reshape(shape) - else: - arr = np.random.randn(*shape) - return self._typ(arr, dtype=dtype, **kwargs) - - def _compare(self, result, expected): - self._comparator(result, expected) - - def test_rename(self): - - # single axis - idx = list('ABCD') - # relabeling values passed into self.rename - args = [ - str.lower, - {x: x.lower() for x in idx}, - Series({x: x.lower() for x in idx}), - ] - - for axis in self._axes(): - kwargs = {axis: idx} - obj = self._construct(4, **kwargs) - - for arg in args: - # rename a single axis - result = obj.rename(**{axis: arg}) - expected = obj.copy() - setattr(expected, axis, list('abcd')) - self._compare(result, expected) - - # multiple axes at once - - def test_rename_axis(self): - idx = list('ABCD') - # relabeling values passed into self.rename - args = [ - str.lower, - {x: x.lower() for x in idx}, - Series({x: x.lower() for x in idx}), - ] - - for axis in self._axes(): - kwargs = {axis: idx} - obj = self._construct(4, **kwargs) - - for arg in args: - # rename a single axis - result = obj.rename_axis(arg, axis=axis) - expected = obj.copy() - setattr(expected, axis, list('abcd')) - self._compare(result, expected) - # scalar values - for arg in ['foo', None]: - result = obj.rename_axis(arg, axis=axis) - expected = obj.copy() - getattr(expected, axis).name = arg - self._compare(result, expected) - - def test_get_numeric_data(self): - - n = 4 - kwargs = {} - for i in range(self._ndim): - kwargs[self._typ._AXIS_NAMES[i]] = list(range(n)) - - # get the numeric data - o = self._construct(n, **kwargs) - result = o._get_numeric_data() - self._compare(result, o) - - # non-inclusion - result = o._get_bool_data() - expected = self._construct(n, value='empty', **kwargs) - self._compare(result, expected) - - # get the bool data - arr = np.array([True, True, False, True]) - o = self._construct(n, value=arr, **kwargs) - result = o._get_numeric_data() - self._compare(result, o) - - # _get_numeric_data is includes _get_bool_data, so can't test for - # non-inclusion - - def test_get_default(self): - - # GH 7725 - d0 = "a", "b", "c", "d" - d1 = np.arange(4, dtype='int64') - others = "e", 10 - - for data, index in ((d0, d1), (d1, d0)): - s = Series(data, index=index) - for i, d in zip(index, data): - self.assertEqual(s.get(i), d) - self.assertEqual(s.get(i, d), d) - self.assertEqual(s.get(i, "z"), d) - for other in others: - self.assertEqual(s.get(other, "z"), "z") - self.assertEqual(s.get(other, other), other) - - def test_nonzero(self): - - # GH 4633 - # look at the boolean/nonzero behavior for objects - obj = self._construct(shape=4) - self.assertRaises(ValueError, lambda: bool(obj == 0)) - self.assertRaises(ValueError, lambda: bool(obj == 1)) - self.assertRaises(ValueError, lambda: bool(obj)) - - obj = self._construct(shape=4, value=1) - self.assertRaises(ValueError, lambda: bool(obj == 0)) - self.assertRaises(ValueError, lambda: bool(obj == 1)) - self.assertRaises(ValueError, lambda: bool(obj)) - - obj = self._construct(shape=4, value=np.nan) - self.assertRaises(ValueError, lambda: bool(obj == 0)) - self.assertRaises(ValueError, lambda: bool(obj == 1)) - self.assertRaises(ValueError, lambda: bool(obj)) - - # empty - obj = self._construct(shape=0) - self.assertRaises(ValueError, lambda: bool(obj)) - - # invalid behaviors - - obj1 = self._construct(shape=4, value=1) - obj2 = self._construct(shape=4, value=1) - - def f(): - if obj1: - printing.pprint_thing("this works and shouldn't") - - self.assertRaises(ValueError, f) - self.assertRaises(ValueError, lambda: obj1 and obj2) - self.assertRaises(ValueError, lambda: obj1 or obj2) - self.assertRaises(ValueError, lambda: not obj1) - - def test_numpy_1_7_compat_numeric_methods(self): - # GH 4435 - # numpy in 1.7 tries to pass addtional arguments to pandas functions - - o = self._construct(shape=4) - for op in ['min', 'max', 'max', 'var', 'std', 'prod', 'sum', 'cumsum', - 'cumprod', 'median', 'skew', 'kurt', 'compound', 'cummax', - 'cummin', 'all', 'any']: - f = getattr(np, op, None) - if f is not None: - f(o) - - def test_downcast(self): - # test close downcasting - - o = self._construct(shape=4, value=9, dtype=np.int64) - result = o.copy() - result._data = o._data.downcast(dtypes='infer') - self._compare(result, o) - - o = self._construct(shape=4, value=9.) - expected = o.astype(np.int64) - result = o.copy() - result._data = o._data.downcast(dtypes='infer') - self._compare(result, expected) - - o = self._construct(shape=4, value=9.5) - result = o.copy() - result._data = o._data.downcast(dtypes='infer') - self._compare(result, o) - - # are close - o = self._construct(shape=4, value=9.000000000005) - result = o.copy() - result._data = o._data.downcast(dtypes='infer') - expected = o.astype(np.int64) - self._compare(result, expected) - - def test_constructor_compound_dtypes(self): - # GH 5191 - # compound dtypes should raise not-implementederror - - def f(dtype): - return self._construct(shape=3, dtype=dtype) - - self.assertRaises(NotImplementedError, f, [("A", "datetime64[h]"), - ("B", "str"), - ("C", "int32")]) - - # these work (though results may be unexpected) - f('int64') - f('float64') - f('M8[ns]') - - def check_metadata(self, x, y=None): - for m in x._metadata: - v = getattr(x, m, None) - if y is None: - self.assertIsNone(v) - else: - self.assertEqual(v, getattr(y, m, None)) - - def test_metadata_propagation(self): - # check that the metadata matches up on the resulting ops - - o = self._construct(shape=3) - o.name = 'foo' - o2 = self._construct(shape=3) - o2.name = 'bar' - - # TODO - # Once panel can do non-trivial combine operations - # (currently there is an a raise in the Panel arith_ops to prevent - # this, though it actually does work) - # can remove all of these try: except: blocks on the actual operations - - # ---------- - # preserving - # ---------- - - # simple ops with scalars - for op in ['__add__', '__sub__', '__truediv__', '__mul__']: - result = getattr(o, op)(1) - self.check_metadata(o, result) - - # ops with like - for op in ['__add__', '__sub__', '__truediv__', '__mul__']: - try: - result = getattr(o, op)(o) - self.check_metadata(o, result) - except (ValueError, AttributeError): - pass - - # simple boolean - for op in ['__eq__', '__le__', '__ge__']: - v1 = getattr(o, op)(o) - self.check_metadata(o, v1) - - try: - self.check_metadata(o, v1 & v1) - except (ValueError): - pass - - try: - self.check_metadata(o, v1 | v1) - except (ValueError): - pass - - # combine_first - try: - result = o.combine_first(o2) - self.check_metadata(o, result) - except (AttributeError): - pass - - # --------------------------- - # non-preserving (by default) - # --------------------------- - - # add non-like - try: - result = o + o2 - self.check_metadata(result) - except (ValueError, AttributeError): - pass - - # simple boolean - for op in ['__eq__', '__le__', '__ge__']: - - # this is a name matching op - v1 = getattr(o, op)(o) - - v2 = getattr(o, op)(o2) - self.check_metadata(v2) - - try: - self.check_metadata(v1 & v2) - except (ValueError): - pass - - try: - self.check_metadata(v1 | v2) - except (ValueError): - pass - - def test_head_tail(self): - # GH5370 - - o = self._construct(shape=10) - - # check all index types - for index in [tm.makeFloatIndex, tm.makeIntIndex, tm.makeStringIndex, - tm.makeUnicodeIndex, tm.makeDateIndex, - tm.makePeriodIndex]: - axis = o._get_axis_name(0) - setattr(o, axis, index(len(getattr(o, axis)))) - - # Panel + dims - try: - o.head() - except (NotImplementedError): - raise nose.SkipTest('not implemented on {0}'.format( - o.__class__.__name__)) - - self._compare(o.head(), o.iloc[:5]) - self._compare(o.tail(), o.iloc[-5:]) - - # 0-len - self._compare(o.head(0), o.iloc[0:0]) - self._compare(o.tail(0), o.iloc[0:0]) - - # bounded - self._compare(o.head(len(o) + 1), o) - self._compare(o.tail(len(o) + 1), o) - - # neg index - self._compare(o.head(-3), o.head(7)) - self._compare(o.tail(-3), o.tail(7)) - - def test_sample(self): - # Fixes issue: 2419 - - o = self._construct(shape=10) - - ### - # Check behavior of random_state argument - ### - - # Check for stability when receives seed or random state -- run 10 - # times. - for test in range(10): - seed = np.random.randint(0, 100) - self._compare( - o.sample(n=4, random_state=seed), o.sample(n=4, - random_state=seed)) - self._compare( - o.sample(frac=0.7, random_state=seed), o.sample( - frac=0.7, random_state=seed)) - - self._compare( - o.sample(n=4, random_state=np.random.RandomState(test)), - o.sample(n=4, random_state=np.random.RandomState(test))) - - self._compare( - o.sample(frac=0.7, random_state=np.random.RandomState(test)), - o.sample(frac=0.7, random_state=np.random.RandomState(test))) - - os1, os2 = [], [] - for _ in range(2): - np.random.seed(test) - os1.append(o.sample(n=4)) - os2.append(o.sample(frac=0.7)) - self._compare(*os1) - self._compare(*os2) - - # Check for error when random_state argument invalid. - with tm.assertRaises(ValueError): - o.sample(random_state='astring!') - - ### - # Check behavior of `frac` and `N` - ### - - # Giving both frac and N throws error - with tm.assertRaises(ValueError): - o.sample(n=3, frac=0.3) - - # Check that raises right error for negative lengths - with tm.assertRaises(ValueError): - o.sample(n=-3) - with tm.assertRaises(ValueError): - o.sample(frac=-0.3) - - # Make sure float values of `n` give error - with tm.assertRaises(ValueError): - o.sample(n=3.2) - - # Check lengths are right - self.assertTrue(len(o.sample(n=4) == 4)) - self.assertTrue(len(o.sample(frac=0.34) == 3)) - self.assertTrue(len(o.sample(frac=0.36) == 4)) - - ### - # Check weights - ### - - # Weight length must be right - with tm.assertRaises(ValueError): - o.sample(n=3, weights=[0, 1]) - - with tm.assertRaises(ValueError): - bad_weights = [0.5] * 11 - o.sample(n=3, weights=bad_weights) - - with tm.assertRaises(ValueError): - bad_weight_series = Series([0, 0, 0.2]) - o.sample(n=4, weights=bad_weight_series) - - # Check won't accept negative weights - with tm.assertRaises(ValueError): - bad_weights = [-0.1] * 10 - o.sample(n=3, weights=bad_weights) - - # Check inf and -inf throw errors: - with tm.assertRaises(ValueError): - weights_with_inf = [0.1] * 10 - weights_with_inf[0] = np.inf - o.sample(n=3, weights=weights_with_inf) - - with tm.assertRaises(ValueError): - weights_with_ninf = [0.1] * 10 - weights_with_ninf[0] = -np.inf - o.sample(n=3, weights=weights_with_ninf) - - # All zeros raises errors - zero_weights = [0] * 10 - with tm.assertRaises(ValueError): - o.sample(n=3, weights=zero_weights) - - # All missing weights - nan_weights = [np.nan] * 10 - with tm.assertRaises(ValueError): - o.sample(n=3, weights=nan_weights) - - # Check np.nan are replaced by zeros. - weights_with_nan = [np.nan] * 10 - weights_with_nan[5] = 0.5 - self._compare( - o.sample(n=1, axis=0, weights=weights_with_nan), o.iloc[5:6]) - - # Check None are also replaced by zeros. - weights_with_None = [None] * 10 - weights_with_None[5] = 0.5 - self._compare( - o.sample(n=1, axis=0, weights=weights_with_None), o.iloc[5:6]) - - def test_size_compat(self): - # GH8846 - # size property should be defined - - o = self._construct(shape=10) - self.assertTrue(o.size == np.prod(o.shape)) - self.assertTrue(o.size == 10 ** len(o.axes)) - - def test_split_compat(self): - # xref GH8846 - o = self._construct(shape=10) - self.assertTrue(len(np.array_split(o, 5)) == 5) - self.assertTrue(len(np.array_split(o, 2)) == 2) - - def test_unexpected_keyword(self): # GH8597 - df = DataFrame(np.random.randn(5, 2), columns=['jim', 'joe']) - ca = pd.Categorical([0, 0, 2, 2, 3, np.nan]) - ts = df['joe'].copy() - ts[2] = np.nan - - with assertRaisesRegexp(TypeError, 'unexpected keyword'): - df.drop('joe', axis=1, in_place=True) - - with assertRaisesRegexp(TypeError, 'unexpected keyword'): - df.reindex([1, 0], inplace=True) - - with assertRaisesRegexp(TypeError, 'unexpected keyword'): - ca.fillna(0, inplace=True) - - with assertRaisesRegexp(TypeError, 'unexpected keyword'): - ts.fillna(0, in_place=True) - - # See gh-12301 - def test_stat_unexpected_keyword(self): - obj = self._construct(5) - starwars = 'Star Wars' - errmsg = 'unexpected keyword' - - with assertRaisesRegexp(TypeError, errmsg): - obj.max(epic=starwars) # stat_function - with assertRaisesRegexp(TypeError, errmsg): - obj.var(epic=starwars) # stat_function_ddof - with assertRaisesRegexp(TypeError, errmsg): - obj.sum(epic=starwars) # cum_function - with assertRaisesRegexp(TypeError, errmsg): - obj.any(epic=starwars) # logical_function - - def test_api_compat(self): - - # GH 12021 - # compat for __name__, __qualname__ - - obj = self._construct(5) - for func in ['sum', 'cumsum', 'any', 'var']: - f = getattr(obj, func) - self.assertEqual(f.__name__, func) - if PY3: - self.assertTrue(f.__qualname__.endswith(func)) - - def test_stat_non_defaults_args(self): - obj = self._construct(5) - out = np.array([0]) - errmsg = "the 'out' parameter is not supported" - - with assertRaisesRegexp(ValueError, errmsg): - obj.max(out=out) # stat_function - with assertRaisesRegexp(ValueError, errmsg): - obj.var(out=out) # stat_function_ddof - with assertRaisesRegexp(ValueError, errmsg): - obj.sum(out=out) # cum_function - with assertRaisesRegexp(ValueError, errmsg): - obj.any(out=out) # logical_function - - def test_clip(self): - lower = 1 - upper = 3 - col = np.arange(5) - - obj = self._construct(len(col), value=col) - - if isinstance(obj, Panel): - msg = "clip is not supported yet for panels" - tm.assertRaisesRegexp(NotImplementedError, msg, - obj.clip, lower=lower, - upper=upper) - - else: - out = obj.clip(lower=lower, upper=upper) - expected = self._construct(len(col), value=col - .clip(lower, upper)) - self._compare(out, expected) - - bad_axis = 'foo' - msg = ('No axis named {axis} ' - 'for object').format(axis=bad_axis) - assertRaisesRegexp(ValueError, msg, obj.clip, - lower=lower, upper=upper, - axis=bad_axis) - - def test_truncate_out_of_bounds(self): - # GH11382 - - # small - shape = [int(2e3)] + ([1] * (self._ndim - 1)) - small = self._construct(shape, dtype='int8') - self._compare(small.truncate(), small) - self._compare(small.truncate(before=0, after=3e3), small) - self._compare(small.truncate(before=-1, after=2e3), small) - - # big - shape = [int(2e6)] + ([1] * (self._ndim - 1)) - big = self._construct(shape, dtype='int8') - self._compare(big.truncate(), big) - self._compare(big.truncate(before=0, after=3e6), big) - self._compare(big.truncate(before=-1, after=2e6), big) - - def test_numpy_clip(self): - lower = 1 - upper = 3 - col = np.arange(5) - - obj = self._construct(len(col), value=col) - - if isinstance(obj, Panel): - msg = "clip is not supported yet for panels" - tm.assertRaisesRegexp(NotImplementedError, msg, - np.clip, obj, - lower, upper) - else: - out = np.clip(obj, lower, upper) - expected = self._construct(len(col), value=col - .clip(lower, upper)) - self._compare(out, expected) - - msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, - np.clip, obj, - lower, upper, out=col) - - -class TestSeries(tm.TestCase, Generic): - _typ = Series - _comparator = lambda self, x, y: assert_series_equal(x, y) - - def setUp(self): - self.ts = tm.makeTimeSeries() # Was at top level in test_series - self.ts.name = 'ts' - - self.series = tm.makeStringSeries() - self.series.name = 'series' - - def test_rename_mi(self): - s = Series([11, 21, 31], - index=MultiIndex.from_tuples( - [("A", x) for x in ["a", "B", "c"]])) - s.rename(str.lower) - - def test_set_axis_name(self): - s = Series([1, 2, 3], index=['a', 'b', 'c']) - funcs = ['rename_axis', '_set_axis_name'] - name = 'foo' - for func in funcs: - result = methodcaller(func, name)(s) - self.assertTrue(s.index.name is None) - self.assertEqual(result.index.name, name) - - def test_set_axis_name_mi(self): - s = Series([11, 21, 31], index=MultiIndex.from_tuples( - [("A", x) for x in ["a", "B", "c"]], - names=['l1', 'l2']) - ) - funcs = ['rename_axis', '_set_axis_name'] - for func in funcs: - result = methodcaller(func, ['L1', 'L2'])(s) - self.assertTrue(s.index.name is None) - self.assertEqual(s.index.names, ['l1', 'l2']) - self.assertTrue(result.index.name is None) - self.assertTrue(result.index.names, ['L1', 'L2']) - - def test_set_axis_name_raises(self): - s = pd.Series([1]) - with tm.assertRaises(ValueError): - s._set_axis_name(name='a', axis=1) - - def test_get_numeric_data_preserve_dtype(self): - - # get the numeric data - o = Series([1, 2, 3]) - result = o._get_numeric_data() - self._compare(result, o) - - o = Series([1, '2', 3.]) - result = o._get_numeric_data() - expected = Series([], dtype=object, index=pd.Index([], dtype=object)) - self._compare(result, expected) - - o = Series([True, False, True]) - result = o._get_numeric_data() - self._compare(result, o) - - o = Series([True, False, True]) - result = o._get_bool_data() - self._compare(result, o) - - o = Series(date_range('20130101', periods=3)) - result = o._get_numeric_data() - expected = Series([], dtype='M8[ns]', index=pd.Index([], dtype=object)) - self._compare(result, expected) - - def test_nonzero_single_element(self): - - # allow single item via bool method - s = Series([True]) - self.assertTrue(s.bool()) - - s = Series([False]) - self.assertFalse(s.bool()) - - # single item nan to raise - for s in [Series([np.nan]), Series([pd.NaT]), Series([True]), - Series([False])]: - self.assertRaises(ValueError, lambda: bool(s)) - - for s in [Series([np.nan]), Series([pd.NaT])]: - self.assertRaises(ValueError, lambda: s.bool()) - - # multiple bool are still an error - for s in [Series([True, True]), Series([False, False])]: - self.assertRaises(ValueError, lambda: bool(s)) - self.assertRaises(ValueError, lambda: s.bool()) - - # single non-bool are an error - for s in [Series([1]), Series([0]), Series(['a']), Series([0.0])]: - self.assertRaises(ValueError, lambda: bool(s)) - self.assertRaises(ValueError, lambda: s.bool()) - - def test_metadata_propagation_indiv(self): - # check that the metadata matches up on the resulting ops - - o = Series(range(3), range(3)) - o.name = 'foo' - o2 = Series(range(3), range(3)) - o2.name = 'bar' - - result = o.T - self.check_metadata(o, result) - - # resample - ts = Series(np.random.rand(1000), - index=date_range('20130101', periods=1000, freq='s'), - name='foo') - result = ts.resample('1T').mean() - self.check_metadata(ts, result) - - result = ts.resample('1T').min() - self.check_metadata(ts, result) - - result = ts.resample('1T').apply(lambda x: x.sum()) - self.check_metadata(ts, result) - - _metadata = Series._metadata - _finalize = Series.__finalize__ - Series._metadata = ['name', 'filename'] - o.filename = 'foo' - o2.filename = 'bar' - - def finalize(self, other, method=None, **kwargs): - for name in self._metadata: - if method == 'concat' and name == 'filename': - value = '+'.join([getattr( - o, name) for o in other.objs if getattr(o, name, None) - ]) - object.__setattr__(self, name, value) - else: - object.__setattr__(self, name, getattr(other, name, None)) - - return self - - Series.__finalize__ = finalize - - result = pd.concat([o, o2]) - self.assertEqual(result.filename, 'foo+bar') - self.assertIsNone(result.name) - - # reset - Series._metadata = _metadata - Series.__finalize__ = _finalize - - def test_describe(self): - self.series.describe() - self.ts.describe() - - def test_describe_objects(self): - s = Series(['a', 'b', 'b', np.nan, np.nan, np.nan, 'c', 'd', 'a', 'a']) - result = s.describe() - expected = Series({'count': 7, 'unique': 4, - 'top': 'a', 'freq': 3, 'second': 'b', - 'second_freq': 2}, index=result.index) - assert_series_equal(result, expected) - - dt = list(self.ts.index) - dt.append(dt[0]) - ser = Series(dt) - rs = ser.describe() - min_date = min(dt) - max_date = max(dt) - xp = Series({'count': len(dt), - 'unique': len(self.ts.index), - 'first': min_date, 'last': max_date, 'freq': 2, - 'top': min_date}, index=rs.index) - assert_series_equal(rs, xp) - - def test_describe_empty(self): - result = pd.Series().describe() - - self.assertEqual(result['count'], 0) - self.assertTrue(result.drop('count').isnull().all()) - - nanSeries = Series([np.nan]) - nanSeries.name = 'NaN' - result = nanSeries.describe() - self.assertEqual(result['count'], 0) - self.assertTrue(result.drop('count').isnull().all()) - - def test_describe_none(self): - noneSeries = Series([None]) - noneSeries.name = 'None' - expected = Series([0, 0], index=['count', 'unique'], name='None') - assert_series_equal(noneSeries.describe(), expected) - - def test_to_xarray(self): - - tm._skip_if_no_xarray() - from xarray import DataArray - - s = Series([]) - s.index.name = 'foo' - result = s.to_xarray() - self.assertEqual(len(result), 0) - self.assertEqual(len(result.coords), 1) - assert_almost_equal(list(result.coords.keys()), ['foo']) - self.assertIsInstance(result, DataArray) - - def testit(index, check_index_type=True, check_categorical=True): - s = Series(range(6), index=index(6)) - s.index.name = 'foo' - result = s.to_xarray() - repr(result) - self.assertEqual(len(result), 6) - self.assertEqual(len(result.coords), 1) - assert_almost_equal(list(result.coords.keys()), ['foo']) - self.assertIsInstance(result, DataArray) - - # idempotency - assert_series_equal(result.to_series(), s, - check_index_type=check_index_type, - check_categorical=check_categorical) - - for index in [tm.makeFloatIndex, tm.makeIntIndex, - tm.makeStringIndex, tm.makeUnicodeIndex, - tm.makeDateIndex, tm.makePeriodIndex, - tm.makeTimedeltaIndex]: - testit(index) - - # not idempotent - testit(tm.makeCategoricalIndex, check_index_type=False, - check_categorical=False) - - s = Series(range(6)) - s.index.name = 'foo' - s.index = pd.MultiIndex.from_product([['a', 'b'], range(3)], - names=['one', 'two']) - result = s.to_xarray() - self.assertEqual(len(result), 2) - assert_almost_equal(list(result.coords.keys()), ['one', 'two']) - self.assertIsInstance(result, DataArray) - assert_series_equal(result.to_series(), s) - - -class TestDataFrame(tm.TestCase, Generic): - _typ = DataFrame - _comparator = lambda self, x, y: assert_frame_equal(x, y) - - def test_rename_mi(self): - df = DataFrame([ - 11, 21, 31 - ], index=MultiIndex.from_tuples([("A", x) for x in ["a", "B", "c"]])) - df.rename(str.lower) - - def test_set_axis_name(self): - df = pd.DataFrame([[1, 2], [3, 4]]) - funcs = ['_set_axis_name', 'rename_axis'] - for func in funcs: - result = methodcaller(func, 'foo')(df) - self.assertTrue(df.index.name is None) - self.assertEqual(result.index.name, 'foo') - - result = methodcaller(func, 'cols', axis=1)(df) - self.assertTrue(df.columns.name is None) - self.assertEqual(result.columns.name, 'cols') - - def test_set_axis_name_mi(self): - df = DataFrame( - np.empty((3, 3)), - index=MultiIndex.from_tuples([("A", x) for x in list('aBc')]), - columns=MultiIndex.from_tuples([('C', x) for x in list('xyz')]) - ) - - level_names = ['L1', 'L2'] - funcs = ['_set_axis_name', 'rename_axis'] - for func in funcs: - result = methodcaller(func, level_names)(df) - self.assertEqual(result.index.names, level_names) - self.assertEqual(result.columns.names, [None, None]) - - result = methodcaller(func, level_names, axis=1)(df) - self.assertEqual(result.columns.names, ["L1", "L2"]) - self.assertEqual(result.index.names, [None, None]) - - def test_nonzero_single_element(self): - - # allow single item via bool method - df = DataFrame([[True]]) - self.assertTrue(df.bool()) - - df = DataFrame([[False]]) - self.assertFalse(df.bool()) - - df = DataFrame([[False, False]]) - self.assertRaises(ValueError, lambda: df.bool()) - self.assertRaises(ValueError, lambda: bool(df)) - - def test_get_numeric_data_preserve_dtype(self): - - # get the numeric data - o = DataFrame({'A': [1, '2', 3.]}) - result = o._get_numeric_data() - expected = DataFrame(index=[0, 1, 2], dtype=object) - self._compare(result, expected) - - def test_describe(self): - tm.makeDataFrame().describe() - tm.makeMixedDataFrame().describe() - tm.makeTimeDataFrame().describe() - - def test_describe_percentiles_percent_or_raw(self): - msg = 'percentiles should all be in the interval \\[0, 1\\]' - - df = tm.makeDataFrame() - with tm.assertRaisesRegexp(ValueError, msg): - df.describe(percentiles=[10, 50, 100]) - - with tm.assertRaisesRegexp(ValueError, msg): - df.describe(percentiles=[2]) - - with tm.assertRaisesRegexp(ValueError, msg): - df.describe(percentiles=[-2]) - - def test_describe_percentiles_equivalence(self): - df = tm.makeDataFrame() - d1 = df.describe() - d2 = df.describe(percentiles=[.25, .75]) - assert_frame_equal(d1, d2) - - def test_describe_percentiles_insert_median(self): - df = tm.makeDataFrame() - d1 = df.describe(percentiles=[.25, .75]) - d2 = df.describe(percentiles=[.25, .5, .75]) - assert_frame_equal(d1, d2) - self.assertTrue('25%' in d1.index) - self.assertTrue('75%' in d2.index) - - # none above - d1 = df.describe(percentiles=[.25, .45]) - d2 = df.describe(percentiles=[.25, .45, .5]) - assert_frame_equal(d1, d2) - self.assertTrue('25%' in d1.index) - self.assertTrue('45%' in d2.index) - - # none below - d1 = df.describe(percentiles=[.75, 1]) - d2 = df.describe(percentiles=[.5, .75, 1]) - assert_frame_equal(d1, d2) - self.assertTrue('75%' in d1.index) - self.assertTrue('100%' in d2.index) - - # edge - d1 = df.describe(percentiles=[0, 1]) - d2 = df.describe(percentiles=[0, .5, 1]) - assert_frame_equal(d1, d2) - self.assertTrue('0%' in d1.index) - self.assertTrue('100%' in d2.index) - - def test_describe_percentiles_unique(self): - # GH13104 - df = tm.makeDataFrame() - with self.assertRaises(ValueError): - df.describe(percentiles=[0.1, 0.2, 0.4, 0.5, 0.2, 0.6]) - with self.assertRaises(ValueError): - df.describe(percentiles=[0.1, 0.2, 0.4, 0.2, 0.6]) - - def test_describe_percentiles_formatting(self): - # GH13104 - df = tm.makeDataFrame() - - # default - result = df.describe().index - expected = Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', - 'max'], - dtype='object') - tm.assert_index_equal(result, expected) - - result = df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, - 0.9995, 0.9999]).index - expected = Index(['count', 'mean', 'std', 'min', '0.01%', '0.05%', - '0.1%', '50%', '99.9%', '99.95%', '99.99%', 'max'], - dtype='object') - tm.assert_index_equal(result, expected) - - result = df.describe(percentiles=[0.00499, 0.005, 0.25, 0.50, - 0.75]).index - expected = Index(['count', 'mean', 'std', 'min', '0.499%', '0.5%', - '25%', '50%', '75%', 'max'], - dtype='object') - tm.assert_index_equal(result, expected) - - result = df.describe(percentiles=[0.00499, 0.01001, 0.25, 0.50, - 0.75]).index - expected = Index(['count', 'mean', 'std', 'min', '0.5%', '1.0%', - '25%', '50%', '75%', 'max'], - dtype='object') - tm.assert_index_equal(result, expected) - - def test_describe_column_index_type(self): - # GH13288 - df = pd.DataFrame([1, 2, 3, 4]) - df.columns = pd.Index([0], dtype=object) - result = df.describe().columns - expected = Index([0], dtype=object) - tm.assert_index_equal(result, expected) - - df = pd.DataFrame({'A': list("BCDE"), 0: [1, 2, 3, 4]}) - result = df.describe().columns - expected = Index([0], dtype=object) - tm.assert_index_equal(result, expected) - - def test_describe_no_numeric(self): - df = DataFrame({'A': ['foo', 'foo', 'bar'] * 8, - 'B': ['a', 'b', 'c', 'd'] * 6}) - desc = df.describe() - expected = DataFrame(dict((k, v.describe()) - for k, v in compat.iteritems(df)), - columns=df.columns) - assert_frame_equal(desc, expected) - - ts = tm.makeTimeSeries() - df = DataFrame({'time': ts.index}) - desc = df.describe() - self.assertEqual(desc.time['first'], min(ts.index)) - - def test_describe_empty(self): - df = DataFrame() - tm.assertRaisesRegexp(ValueError, 'DataFrame without columns', - df.describe) - - df = DataFrame(columns=['A', 'B']) - result = df.describe() - expected = DataFrame(0, columns=['A', 'B'], index=['count', 'unique']) - tm.assert_frame_equal(result, expected) - - def test_describe_empty_int_columns(self): - df = DataFrame([[0, 1], [1, 2]]) - desc = df[df[0] < 0].describe() # works - assert_series_equal(desc.xs('count'), - Series([0, 0], dtype=float, name='count')) - self.assertTrue(isnull(desc.ix[1:]).all().all()) - - def test_describe_objects(self): - df = DataFrame({"C1": ['a', 'a', 'c'], "C2": ['d', 'd', 'f']}) - result = df.describe() - expected = DataFrame({"C1": [3, 2, 'a', 2], "C2": [3, 2, 'd', 2]}, - index=['count', 'unique', 'top', 'freq']) - assert_frame_equal(result, expected) - - df = DataFrame({"C1": pd.date_range('2010-01-01', periods=4, freq='D') - }) - df.loc[4] = pd.Timestamp('2010-01-04') - result = df.describe() - expected = DataFrame({"C1": [5, 4, pd.Timestamp('2010-01-04'), 2, - pd.Timestamp('2010-01-01'), - pd.Timestamp('2010-01-04')]}, - index=['count', 'unique', 'top', 'freq', - 'first', 'last']) - assert_frame_equal(result, expected) - - # mix time and str - df['C2'] = ['a', 'a', 'b', 'c', 'a'] - result = df.describe() - expected['C2'] = [5, 3, 'a', 3, np.nan, np.nan] - assert_frame_equal(result, expected) - - # just str - expected = DataFrame({'C2': [5, 3, 'a', 4]}, - index=['count', 'unique', 'top', 'freq']) - result = df[['C2']].describe() - - # mix of time, str, numeric - df['C3'] = [2, 4, 6, 8, 2] - result = df.describe() - expected = DataFrame({"C3": [5., 4.4, 2.607681, 2., 2., 4., 6., 8.]}, - index=['count', 'mean', 'std', 'min', '25%', - '50%', '75%', 'max']) - assert_frame_equal(result, expected) - assert_frame_equal(df.describe(), df[['C3']].describe()) - - assert_frame_equal(df[['C1', 'C3']].describe(), df[['C3']].describe()) - assert_frame_equal(df[['C2', 'C3']].describe(), df[['C3']].describe()) - - def test_describe_typefiltering(self): - df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, - 'catB': ['a', 'b', 'c', 'd'] * 6, - 'numC': np.arange(24, dtype='int64'), - 'numD': np.arange(24.) + .5, - 'ts': tm.makeTimeSeries()[:24].index}) - - descN = df.describe() - expected_cols = ['numC', 'numD', ] - expected = DataFrame(dict((k, df[k].describe()) - for k in expected_cols), - columns=expected_cols) - assert_frame_equal(descN, expected) - - desc = df.describe(include=['number']) - assert_frame_equal(desc, descN) - desc = df.describe(exclude=['object', 'datetime']) - assert_frame_equal(desc, descN) - desc = df.describe(include=['float']) - assert_frame_equal(desc, descN.drop('numC', 1)) - - descC = df.describe(include=['O']) - expected_cols = ['catA', 'catB'] - expected = DataFrame(dict((k, df[k].describe()) - for k in expected_cols), - columns=expected_cols) - assert_frame_equal(descC, expected) - - descD = df.describe(include=['datetime']) - assert_series_equal(descD.ts, df.ts.describe()) - - desc = df.describe(include=['object', 'number', 'datetime']) - assert_frame_equal(desc.loc[:, ["numC", "numD"]].dropna(), descN) - assert_frame_equal(desc.loc[:, ["catA", "catB"]].dropna(), descC) - descDs = descD.sort_index() # the index order change for mixed-types - assert_frame_equal(desc.loc[:, "ts":].dropna().sort_index(), descDs) - - desc = df.loc[:, 'catA':'catB'].describe(include='all') - assert_frame_equal(desc, descC) - desc = df.loc[:, 'numC':'numD'].describe(include='all') - assert_frame_equal(desc, descN) - - desc = df.describe(percentiles=[], include='all') - cnt = Series(data=[4, 4, 6, 6, 6], - index=['catA', 'catB', 'numC', 'numD', 'ts']) - assert_series_equal(desc.count(), cnt) - self.assertTrue('count' in desc.index) - self.assertTrue('unique' in desc.index) - self.assertTrue('50%' in desc.index) - self.assertTrue('first' in desc.index) - - desc = df.drop("ts", 1).describe(percentiles=[], include='all') - assert_series_equal(desc.count(), cnt.drop("ts")) - self.assertTrue('first' not in desc.index) - desc = df.drop(["numC", "numD"], 1).describe(percentiles=[], - include='all') - assert_series_equal(desc.count(), cnt.drop(["numC", "numD"])) - self.assertTrue('50%' not in desc.index) - - def test_describe_typefiltering_category_bool(self): - df = DataFrame({'A_cat': pd.Categorical(['foo', 'foo', 'bar'] * 8), - 'B_str': ['a', 'b', 'c', 'd'] * 6, - 'C_bool': [True] * 12 + [False] * 12, - 'D_num': np.arange(24.) + .5, - 'E_ts': tm.makeTimeSeries()[:24].index}) - - desc = df.describe() - expected_cols = ['D_num'] - expected = DataFrame(dict((k, df[k].describe()) - for k in expected_cols), - columns=expected_cols) - assert_frame_equal(desc, expected) - - desc = df.describe(include=["category"]) - self.assertTrue(desc.columns.tolist() == ["A_cat"]) - - # 'all' includes numpy-dtypes + category - desc1 = df.describe(include="all") - desc2 = df.describe(include=[np.generic, "category"]) - assert_frame_equal(desc1, desc2) - - def test_describe_timedelta(self): - df = DataFrame({"td": pd.to_timedelta(np.arange(24) % 20, "D")}) - self.assertTrue(df.describe().loc["mean"][0] == pd.to_timedelta( - "8d4h")) - - def test_describe_typefiltering_dupcol(self): - df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, - 'catB': ['a', 'b', 'c', 'd'] * 6, - 'numC': np.arange(24), - 'numD': np.arange(24.) + .5, - 'ts': tm.makeTimeSeries()[:24].index}) - s = df.describe(include='all').shape[1] - df = pd.concat([df, df], axis=1) - s2 = df.describe(include='all').shape[1] - self.assertTrue(s2 == 2 * s) - - def test_describe_typefiltering_groupby(self): - df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, - 'catB': ['a', 'b', 'c', 'd'] * 6, - 'numC': np.arange(24), - 'numD': np.arange(24.) + .5, - 'ts': tm.makeTimeSeries()[:24].index}) - G = df.groupby('catA') - self.assertTrue(G.describe(include=['number']).shape == (16, 2)) - self.assertTrue(G.describe(include=['number', 'object']).shape == (22, - 3)) - self.assertTrue(G.describe(include='all').shape == (26, 4)) - - def test_describe_multi_index_df_column_names(self): - """ Test that column names persist after the describe operation.""" - - df = pd.DataFrame( - {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], - 'C': np.random.randn(8), - 'D': np.random.randn(8)}) - - # GH 11517 - # test for hierarchical index - hierarchical_index_df = df.groupby(['A', 'B']).mean().T - self.assertTrue(hierarchical_index_df.columns.names == ['A', 'B']) - self.assertTrue(hierarchical_index_df.describe().columns.names == - ['A', 'B']) - - # test for non-hierarchical index - non_hierarchical_index_df = df.groupby(['A']).mean().T - self.assertTrue(non_hierarchical_index_df.columns.names == ['A']) - self.assertTrue(non_hierarchical_index_df.describe().columns.names == - ['A']) - - def test_metadata_propagation_indiv(self): - - # groupby - df = DataFrame( - {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], - 'C': np.random.randn(8), - 'D': np.random.randn(8)}) - result = df.groupby('A').sum() - self.check_metadata(df, result) - - # resample - df = DataFrame(np.random.randn(1000, 2), - index=date_range('20130101', periods=1000, freq='s')) - result = df.resample('1T') - self.check_metadata(df, result) - - # merging with override - # GH 6923 - _metadata = DataFrame._metadata - _finalize = DataFrame.__finalize__ - - np.random.seed(10) - df1 = DataFrame(np.random.randint(0, 4, (3, 2)), columns=['a', 'b']) - df2 = DataFrame(np.random.randint(0, 4, (3, 2)), columns=['c', 'd']) - DataFrame._metadata = ['filename'] - df1.filename = 'fname1.csv' - df2.filename = 'fname2.csv' - - def finalize(self, other, method=None, **kwargs): - - for name in self._metadata: - if method == 'merge': - left, right = other.left, other.right - value = getattr(left, name, '') + '|' + getattr(right, - name, '') - object.__setattr__(self, name, value) - else: - object.__setattr__(self, name, getattr(other, name, '')) - - return self - - DataFrame.__finalize__ = finalize - result = df1.merge(df2, left_on=['a'], right_on=['c'], how='inner') - self.assertEqual(result.filename, 'fname1.csv|fname2.csv') - - # concat - # GH 6927 - DataFrame._metadata = ['filename'] - df1 = DataFrame(np.random.randint(0, 4, (3, 2)), columns=list('ab')) - df1.filename = 'foo' - - def finalize(self, other, method=None, **kwargs): - for name in self._metadata: - if method == 'concat': - value = '+'.join([getattr( - o, name) for o in other.objs if getattr(o, name, None) - ]) - object.__setattr__(self, name, value) - else: - object.__setattr__(self, name, getattr(other, name, None)) - - return self - - DataFrame.__finalize__ = finalize - - result = pd.concat([df1, df1]) - self.assertEqual(result.filename, 'foo+foo') - - # reset - DataFrame._metadata = _metadata - DataFrame.__finalize__ = _finalize - - def test_tz_convert_and_localize(self): - l0 = date_range('20140701', periods=5, freq='D') - - # TODO: l1 should be a PeriodIndex for testing - # after GH2106 is addressed - with tm.assertRaises(NotImplementedError): - period_range('20140701', periods=1).tz_convert('UTC') - with tm.assertRaises(NotImplementedError): - period_range('20140701', periods=1).tz_localize('UTC') - # l1 = period_range('20140701', periods=5, freq='D') - l1 = date_range('20140701', periods=5, freq='D') - - int_idx = Index(range(5)) - - for fn in ['tz_localize', 'tz_convert']: - - if fn == 'tz_convert': - l0 = l0.tz_localize('UTC') - l1 = l1.tz_localize('UTC') - - for idx in [l0, l1]: - - l0_expected = getattr(idx, fn)('US/Pacific') - l1_expected = getattr(idx, fn)('US/Pacific') - - df1 = DataFrame(np.ones(5), index=l0) - df1 = getattr(df1, fn)('US/Pacific') - self.assert_index_equal(df1.index, l0_expected) - - # MultiIndex - # GH7846 - df2 = DataFrame(np.ones(5), MultiIndex.from_arrays([l0, l1])) - - df3 = getattr(df2, fn)('US/Pacific', level=0) - self.assertFalse(df3.index.levels[0].equals(l0)) - self.assert_index_equal(df3.index.levels[0], l0_expected) - self.assert_index_equal(df3.index.levels[1], l1) - self.assertFalse(df3.index.levels[1].equals(l1_expected)) - - df3 = getattr(df2, fn)('US/Pacific', level=1) - self.assert_index_equal(df3.index.levels[0], l0) - self.assertFalse(df3.index.levels[0].equals(l0_expected)) - self.assert_index_equal(df3.index.levels[1], l1_expected) - self.assertFalse(df3.index.levels[1].equals(l1)) - - df4 = DataFrame(np.ones(5), - MultiIndex.from_arrays([int_idx, l0])) - - # TODO: untested - df5 = getattr(df4, fn)('US/Pacific', level=1) # noqa - - self.assert_index_equal(df3.index.levels[0], l0) - self.assertFalse(df3.index.levels[0].equals(l0_expected)) - self.assert_index_equal(df3.index.levels[1], l1_expected) - self.assertFalse(df3.index.levels[1].equals(l1)) - - # Bad Inputs - for fn in ['tz_localize', 'tz_convert']: - # Not DatetimeIndex / PeriodIndex - with tm.assertRaisesRegexp(TypeError, 'DatetimeIndex'): - df = DataFrame(index=int_idx) - df = getattr(df, fn)('US/Pacific') - - # Not DatetimeIndex / PeriodIndex - with tm.assertRaisesRegexp(TypeError, 'DatetimeIndex'): - df = DataFrame(np.ones(5), - MultiIndex.from_arrays([int_idx, l0])) - df = getattr(df, fn)('US/Pacific', level=0) - - # Invalid level - with tm.assertRaisesRegexp(ValueError, 'not valid'): - df = DataFrame(index=l0) - df = getattr(df, fn)('US/Pacific', level=1) - - def test_set_attribute(self): - # Test for consistent setattr behavior when an attribute and a column - # have the same name (Issue #8994) - df = DataFrame({'x': [1, 2, 3]}) - - df.y = 2 - df['y'] = [2, 4, 6] - df.y = 5 - - self.assertEqual(df.y, 5) - assert_series_equal(df['y'], Series([2, 4, 6], name='y')) - - def test_pct_change(self): - # GH 11150 - pnl = DataFrame([np.arange(0, 40, 10), np.arange(0, 40, 10), np.arange( - 0, 40, 10)]).astype(np.float64) - pnl.iat[1, 0] = np.nan - pnl.iat[1, 1] = np.nan - pnl.iat[2, 3] = 60 - - mask = pnl.isnull() - - for axis in range(2): - expected = pnl.ffill(axis=axis) / pnl.ffill(axis=axis).shift( - axis=axis) - 1 - expected[mask] = np.nan - result = pnl.pct_change(axis=axis, fill_method='pad') - - self.assert_frame_equal(result, expected) - - def test_to_xarray(self): - - tm._skip_if_no_xarray() - from xarray import Dataset - - df = DataFrame({'a': list('abc'), - 'b': list(range(1, 4)), - 'c': np.arange(3, 6).astype('u1'), - 'd': np.arange(4.0, 7.0, dtype='float64'), - 'e': [True, False, True], - 'f': pd.Categorical(list('abc')), - 'g': pd.date_range('20130101', periods=3), - 'h': pd.date_range('20130101', - periods=3, - tz='US/Eastern')} - ) - - df.index.name = 'foo' - result = df[0:0].to_xarray() - self.assertEqual(result.dims['foo'], 0) - self.assertIsInstance(result, Dataset) - - for index in [tm.makeFloatIndex, tm.makeIntIndex, - tm.makeStringIndex, tm.makeUnicodeIndex, - tm.makeDateIndex, tm.makePeriodIndex, - tm.makeCategoricalIndex, tm.makeTimedeltaIndex]: - df.index = index(3) - df.index.name = 'foo' - df.columns.name = 'bar' - result = df.to_xarray() - self.assertEqual(result.dims['foo'], 3) - self.assertEqual(len(result.coords), 1) - self.assertEqual(len(result.data_vars), 8) - assert_almost_equal(list(result.coords.keys()), ['foo']) - self.assertIsInstance(result, Dataset) - - # idempotency - # categoricals are not preserved - # datetimes w/tz are not preserved - # column names are lost - expected = df.copy() - expected['f'] = expected['f'].astype(object) - expected['h'] = expected['h'].astype('datetime64[ns]') - expected.columns.name = None - assert_frame_equal(result.to_dataframe(), expected, - check_index_type=False, check_categorical=False) - - # available in 0.7.1 - # MultiIndex - df.index = pd.MultiIndex.from_product([['a'], range(3)], - names=['one', 'two']) - result = df.to_xarray() - self.assertEqual(result.dims['one'], 1) - self.assertEqual(result.dims['two'], 3) - self.assertEqual(len(result.coords), 2) - self.assertEqual(len(result.data_vars), 8) - assert_almost_equal(list(result.coords.keys()), ['one', 'two']) - self.assertIsInstance(result, Dataset) - - result = result.to_dataframe() - expected = df.copy() - expected['f'] = expected['f'].astype(object) - expected['h'] = expected['h'].astype('datetime64[ns]') - expected.columns.name = None - assert_frame_equal(result, - expected, - check_index_type=False) - - -class TestPanel(tm.TestCase, Generic): - _typ = Panel - _comparator = lambda self, x, y: assert_panel_equal(x, y, by_blocks=True) - - def test_to_xarray(self): - - tm._skip_if_no_xarray() - from xarray import DataArray - - p = tm.makePanel() - - result = p.to_xarray() - self.assertIsInstance(result, DataArray) - self.assertEqual(len(result.coords), 3) - assert_almost_equal(list(result.coords.keys()), - ['items', 'major_axis', 'minor_axis']) - self.assertEqual(len(result.dims), 3) - - # idempotency - assert_panel_equal(result.to_pandas(), p) - - -class TestPanel4D(tm.TestCase, Generic): - _typ = Panel4D - _comparator = lambda self, x, y: assert_panel4d_equal(x, y, by_blocks=True) - - def test_sample(self): - raise nose.SkipTest("sample on Panel4D") - - def test_to_xarray(self): - - tm._skip_if_no_xarray() - from xarray import DataArray - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - p = tm.makePanel4D() - - result = p.to_xarray() - self.assertIsInstance(result, DataArray) - self.assertEqual(len(result.coords), 4) - assert_almost_equal(list(result.coords.keys()), - ['labels', 'items', 'major_axis', - 'minor_axis']) - self.assertEqual(len(result.dims), 4) - - # non-convertible - self.assertRaises(ValueError, lambda: result.to_pandas()) - -# run all the tests, but wrap each in a warning catcher -for t in ['test_rename', 'test_rename_axis', 'test_get_numeric_data', - 'test_get_default', 'test_nonzero', - 'test_numpy_1_7_compat_numeric_methods', - 'test_downcast', 'test_constructor_compound_dtypes', - 'test_head_tail', - 'test_size_compat', 'test_split_compat', - 'test_unexpected_keyword', - 'test_stat_unexpected_keyword', 'test_api_compat', - 'test_stat_non_defaults_args', - 'test_clip', 'test_truncate_out_of_bounds', 'test_numpy_clip', - 'test_metadata_propagation']: - - def f(): - def tester(self): - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - return getattr(super(TestPanel4D, self), t)() - return tester - - setattr(TestPanel4D, t, f()) - - -class TestNDFrame(tm.TestCase): - # tests that don't fit elsewhere - - def test_sample(sel): - # Fixes issue: 2419 - # additional specific object based tests - - # A few dataframe test with degenerate weights. - easy_weight_list = [0] * 10 - easy_weight_list[5] = 1 - - df = pd.DataFrame({'col1': range(10, 20), - 'col2': range(20, 30), - 'colString': ['a'] * 10, - 'easyweights': easy_weight_list}) - sample1 = df.sample(n=1, weights='easyweights') - assert_frame_equal(sample1, df.iloc[5:6]) - - # Ensure proper error if string given as weight for Series, panel, or - # DataFrame with axis = 1. - s = Series(range(10)) - with tm.assertRaises(ValueError): - s.sample(n=3, weights='weight_column') - - panel = pd.Panel(items=[0, 1, 2], major_axis=[2, 3, 4], - minor_axis=[3, 4, 5]) - with tm.assertRaises(ValueError): - panel.sample(n=1, weights='weight_column') - - with tm.assertRaises(ValueError): - df.sample(n=1, weights='weight_column', axis=1) - - # Check weighting key error - with tm.assertRaises(KeyError): - df.sample(n=3, weights='not_a_real_column_name') - - # Check that re-normalizes weights that don't sum to one. - weights_less_than_1 = [0] * 10 - weights_less_than_1[0] = 0.5 - tm.assert_frame_equal( - df.sample(n=1, weights=weights_less_than_1), df.iloc[:1]) - - ### - # Test axis argument - ### - - # Test axis argument - df = pd.DataFrame({'col1': range(10), 'col2': ['a'] * 10}) - second_column_weight = [0, 1] - assert_frame_equal( - df.sample(n=1, axis=1, weights=second_column_weight), df[['col2']]) - - # Different axis arg types - assert_frame_equal(df.sample(n=1, axis='columns', - weights=second_column_weight), - df[['col2']]) - - weight = [0] * 10 - weight[5] = 0.5 - assert_frame_equal(df.sample(n=1, axis='rows', weights=weight), - df.iloc[5:6]) - assert_frame_equal(df.sample(n=1, axis='index', weights=weight), - df.iloc[5:6]) - - # Check out of range axis values - with tm.assertRaises(ValueError): - df.sample(n=1, axis=2) - - with tm.assertRaises(ValueError): - df.sample(n=1, axis='not_a_name') - - with tm.assertRaises(ValueError): - s = pd.Series(range(10)) - s.sample(n=1, axis=1) - - # Test weight length compared to correct axis - with tm.assertRaises(ValueError): - df.sample(n=1, axis=1, weights=[0.5] * 10) - - # Check weights with axis = 1 - easy_weight_list = [0] * 3 - easy_weight_list[2] = 1 - - df = pd.DataFrame({'col1': range(10, 20), - 'col2': range(20, 30), - 'colString': ['a'] * 10}) - sample1 = df.sample(n=1, axis=1, weights=easy_weight_list) - assert_frame_equal(sample1, df[['colString']]) - - # Test default axes - p = pd.Panel(items=['a', 'b', 'c'], major_axis=[2, 4, 6], - minor_axis=[1, 3, 5]) - assert_panel_equal( - p.sample(n=3, random_state=42), p.sample(n=3, axis=1, - random_state=42)) - assert_frame_equal( - df.sample(n=3, random_state=42), df.sample(n=3, axis=0, - random_state=42)) - - # Test that function aligns weights with frame - df = DataFrame( - {'col1': [5, 6, 7], - 'col2': ['a', 'b', 'c'], }, index=[9, 5, 3]) - s = Series([1, 0, 0], index=[3, 5, 9]) - assert_frame_equal(df.loc[[3]], df.sample(1, weights=s)) - - # Weights have index values to be dropped because not in - # sampled DataFrame - s2 = Series([0.001, 0, 10000], index=[3, 5, 10]) - assert_frame_equal(df.loc[[3]], df.sample(1, weights=s2)) - - # Weights have empty values to be filed with zeros - s3 = Series([0.01, 0], index=[3, 5]) - assert_frame_equal(df.loc[[3]], df.sample(1, weights=s3)) - - # No overlap in weight and sampled DataFrame indices - s4 = Series([1, 0], index=[1, 2]) - with tm.assertRaises(ValueError): - df.sample(1, weights=s4) - - def test_squeeze(self): - # noop - for s in [tm.makeFloatSeries(), tm.makeStringSeries(), - tm.makeObjectSeries()]: - tm.assert_series_equal(s.squeeze(), s) - for df in [tm.makeTimeDataFrame()]: - tm.assert_frame_equal(df.squeeze(), df) - for p in [tm.makePanel()]: - tm.assert_panel_equal(p.squeeze(), p) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - for p4d in [tm.makePanel4D()]: - tm.assert_panel4d_equal(p4d.squeeze(), p4d) - - # squeezing - df = tm.makeTimeDataFrame().reindex(columns=['A']) - tm.assert_series_equal(df.squeeze(), df['A']) - - p = tm.makePanel().reindex(items=['ItemA']) - tm.assert_frame_equal(p.squeeze(), p['ItemA']) - - p = tm.makePanel().reindex(items=['ItemA'], minor_axis=['A']) - tm.assert_series_equal(p.squeeze(), p.ix['ItemA', :, 'A']) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - p4d = tm.makePanel4D().reindex(labels=['label1']) - tm.assert_panel_equal(p4d.squeeze(), p4d['label1']) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - p4d = tm.makePanel4D().reindex(labels=['label1'], items=['ItemA']) - tm.assert_frame_equal(p4d.squeeze(), p4d.ix['label1', 'ItemA']) - - # don't fail with 0 length dimensions GH11229 & GH8999 - empty_series = pd.Series([], name='five') - empty_frame = pd.DataFrame([empty_series]) - empty_panel = pd.Panel({'six': empty_frame}) - - [tm.assert_series_equal(empty_series, higher_dim.squeeze()) - for higher_dim in [empty_series, empty_frame, empty_panel]] - - def test_numpy_squeeze(self): - s = tm.makeFloatSeries() - tm.assert_series_equal(np.squeeze(s), s) - - df = tm.makeTimeDataFrame().reindex(columns=['A']) - tm.assert_series_equal(np.squeeze(df), df['A']) - - msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, - np.squeeze, s, axis=0) - - def test_transpose(self): - msg = (r"transpose\(\) got multiple values for " - r"keyword argument 'axes'") - for s in [tm.makeFloatSeries(), tm.makeStringSeries(), - tm.makeObjectSeries()]: - # calls implementation in pandas/core/base.py - tm.assert_series_equal(s.transpose(), s) - for df in [tm.makeTimeDataFrame()]: - tm.assert_frame_equal(df.transpose().transpose(), df) - for p in [tm.makePanel()]: - tm.assert_panel_equal(p.transpose(2, 0, 1) - .transpose(1, 2, 0), p) - tm.assertRaisesRegexp(TypeError, msg, p.transpose, - 2, 0, 1, axes=(2, 0, 1)) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - for p4d in [tm.makePanel4D()]: - tm.assert_panel4d_equal(p4d.transpose(2, 0, 3, 1) - .transpose(1, 3, 0, 2), p4d) - tm.assertRaisesRegexp(TypeError, msg, p4d.transpose, - 2, 0, 3, 1, axes=(2, 0, 3, 1)) - - def test_numpy_transpose(self): - msg = "the 'axes' parameter is not supported" - - s = tm.makeFloatSeries() - tm.assert_series_equal( - np.transpose(s), s) - tm.assertRaisesRegexp(ValueError, msg, - np.transpose, s, axes=1) - - df = tm.makeTimeDataFrame() - tm.assert_frame_equal(np.transpose( - np.transpose(df)), df) - tm.assertRaisesRegexp(ValueError, msg, - np.transpose, df, axes=1) - - p = tm.makePanel() - tm.assert_panel_equal(np.transpose( - np.transpose(p, axes=(2, 0, 1)), - axes=(1, 2, 0)), p) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - p4d = tm.makePanel4D() - tm.assert_panel4d_equal(np.transpose( - np.transpose(p4d, axes=(2, 0, 3, 1)), - axes=(1, 3, 0, 2)), p4d) - - def test_take(self): - indices = [1, 5, -2, 6, 3, -1] - for s in [tm.makeFloatSeries(), tm.makeStringSeries(), - tm.makeObjectSeries()]: - out = s.take(indices) - expected = Series(data=s.values.take(indices), - index=s.index.take(indices)) - tm.assert_series_equal(out, expected) - for df in [tm.makeTimeDataFrame()]: - out = df.take(indices) - expected = DataFrame(data=df.values.take(indices, axis=0), - index=df.index.take(indices), - columns=df.columns) - tm.assert_frame_equal(out, expected) - - indices = [-3, 2, 0, 1] - for p in [tm.makePanel()]: - out = p.take(indices) - expected = Panel(data=p.values.take(indices, axis=0), - items=p.items.take(indices), - major_axis=p.major_axis, - minor_axis=p.minor_axis) - tm.assert_panel_equal(out, expected) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - for p4d in [tm.makePanel4D()]: - out = p4d.take(indices) - expected = Panel4D(data=p4d.values.take(indices, axis=0), - labels=p4d.labels.take(indices), - major_axis=p4d.major_axis, - minor_axis=p4d.minor_axis, - items=p4d.items) - tm.assert_panel4d_equal(out, expected) - - def test_take_invalid_kwargs(self): - indices = [-3, 2, 0, 1] - s = tm.makeFloatSeries() - df = tm.makeTimeDataFrame() - p = tm.makePanel() - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - p4d = tm.makePanel4D() - - for obj in (s, df, p, p4d): - msg = r"take\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, obj.take, - indices, foo=2) - - msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, obj.take, - indices, out=indices) - - msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, obj.take, - indices, mode='clip') - - def test_equals(self): - s1 = pd.Series([1, 2, 3], index=[0, 2, 1]) - s2 = s1.copy() - self.assertTrue(s1.equals(s2)) - - s1[1] = 99 - self.assertFalse(s1.equals(s2)) - - # NaNs compare as equal - s1 = pd.Series([1, np.nan, 3, np.nan], index=[0, 2, 1, 3]) - s2 = s1.copy() - self.assertTrue(s1.equals(s2)) - - s2[0] = 9.9 - self.assertFalse(s1.equals(s2)) - - idx = MultiIndex.from_tuples([(0, 'a'), (1, 'b'), (2, 'c')]) - s1 = Series([1, 2, np.nan], index=idx) - s2 = s1.copy() - self.assertTrue(s1.equals(s2)) - - # Add object dtype column with nans - index = np.random.random(10) - df1 = DataFrame( - np.random.random(10, ), index=index, columns=['floats']) - df1['text'] = 'the sky is so blue. we could use more chocolate.'.split( - ) - df1['start'] = date_range('2000-1-1', periods=10, freq='T') - df1['end'] = date_range('2000-1-1', periods=10, freq='D') - df1['diff'] = df1['end'] - df1['start'] - df1['bool'] = (np.arange(10) % 3 == 0) - df1.ix[::2] = nan - df2 = df1.copy() - self.assertTrue(df1['text'].equals(df2['text'])) - self.assertTrue(df1['start'].equals(df2['start'])) - self.assertTrue(df1['end'].equals(df2['end'])) - self.assertTrue(df1['diff'].equals(df2['diff'])) - self.assertTrue(df1['bool'].equals(df2['bool'])) - self.assertTrue(df1.equals(df2)) - self.assertFalse(df1.equals(object)) - - # different dtype - different = df1.copy() - different['floats'] = different['floats'].astype('float32') - self.assertFalse(df1.equals(different)) - - # different index - different_index = -index - different = df2.set_index(different_index) - self.assertFalse(df1.equals(different)) - - # different columns - different = df2.copy() - different.columns = df2.columns[::-1] - self.assertFalse(df1.equals(different)) - - # DatetimeIndex - index = pd.date_range('2000-1-1', periods=10, freq='T') - df1 = df1.set_index(index) - df2 = df1.copy() - self.assertTrue(df1.equals(df2)) - - # MultiIndex - df3 = df1.set_index(['text'], append=True) - df2 = df1.set_index(['text'], append=True) - self.assertTrue(df3.equals(df2)) - - df2 = df1.set_index(['floats'], append=True) - self.assertFalse(df3.equals(df2)) - - # NaN in index - df3 = df1.set_index(['floats'], append=True) - df2 = df1.set_index(['floats'], append=True) - self.assertTrue(df3.equals(df2)) - - # GH 8437 - a = pd.Series([False, np.nan]) - b = pd.Series([False, np.nan]) - c = pd.Series(index=range(2)) - d = pd.Series(index=range(2)) - e = pd.Series(index=range(2)) - f = pd.Series(index=range(2)) - c[:-1] = d[:-1] = e[0] = f[0] = False - self.assertTrue(a.equals(a)) - self.assertTrue(a.equals(b)) - self.assertTrue(a.equals(c)) - self.assertTrue(a.equals(d)) - self.assertFalse(a.equals(e)) - self.assertTrue(e.equals(f)) - - def test_describe_raises(self): - with tm.assertRaises(NotImplementedError): - tm.makePanel().describe() - - def test_pipe(self): - df = DataFrame({'A': [1, 2, 3]}) - f = lambda x, y: x ** y - result = df.pipe(f, 2) - expected = DataFrame({'A': [1, 4, 9]}) - self.assert_frame_equal(result, expected) - - result = df.A.pipe(f, 2) - self.assert_series_equal(result, expected.A) - - def test_pipe_tuple(self): - df = DataFrame({'A': [1, 2, 3]}) - f = lambda x, y: y - result = df.pipe((f, 'y'), 0) - self.assert_frame_equal(result, df) - - result = df.A.pipe((f, 'y'), 0) - self.assert_series_equal(result, df.A) - - def test_pipe_tuple_error(self): - df = DataFrame({"A": [1, 2, 3]}) - f = lambda x, y: y - with tm.assertRaises(ValueError): - df.pipe((f, 'y'), x=1, y=0) - - with tm.assertRaises(ValueError): - df.A.pipe((f, 'y'), x=1, y=0) - - def test_pipe_panel(self): - wp = Panel({'r1': DataFrame({"A": [1, 2, 3]})}) - f = lambda x, y: x + y - result = wp.pipe(f, 2) - expected = wp + 2 - assert_panel_equal(result, expected) - - result = wp.pipe((f, 'y'), x=1) - expected = wp + 1 - assert_panel_equal(result, expected) - - with tm.assertRaises(ValueError): - result = wp.pipe((f, 'y'), x=1, y=1) - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_groupby.py b/pandas/tests/test_groupby.py deleted file mode 100644 index 6d23d4f152d3e..0000000000000 --- a/pandas/tests/test_groupby.py +++ /dev/null @@ -1,6878 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import print_function -import nose - -from datetime import datetime -from numpy import nan - -from pandas.types.common import _ensure_platform_int -from pandas import date_range, bdate_range, Timestamp, isnull -from pandas.core.index import Index, MultiIndex, CategoricalIndex -from pandas.core.api import Categorical, DataFrame -from pandas.core.common import UnsupportedFunctionCall -from pandas.core.groupby import (SpecificationError, DataError, _nargsort, - _lexsort_indexer) -from pandas.core.series import Series -from pandas.core.config import option_context -from pandas.formats.printing import pprint_thing -from pandas.util.testing import (assert_panel_equal, assert_frame_equal, - assert_series_equal, assert_almost_equal, - assert_index_equal, assertRaisesRegexp) -from pandas.compat import (range, long, lrange, StringIO, lmap, lzip, map, zip, - builtins, OrderedDict, product as cart_product) -from pandas import compat -from pandas.core.panel import Panel -from pandas.tools.merge import concat -from collections import defaultdict -from functools import partial -import pandas.core.common as com -import numpy as np - -import pandas.core.nanops as nanops - -import pandas.util.testing as tm -import pandas as pd - - -class TestGroupBy(tm.TestCase): - - _multiprocess_can_split_ = True - - def setUp(self): - self.ts = tm.makeTimeSeries() - - self.seriesd = tm.getSeriesData() - self.tsd = tm.getTimeSeriesData() - self.frame = DataFrame(self.seriesd) - self.tsframe = DataFrame(self.tsd) - - self.df = DataFrame( - {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], - 'C': np.random.randn(8), - 'D': np.random.randn(8)}) - - self.df_mixed_floats = DataFrame( - {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], - 'C': np.random.randn(8), - 'D': np.array( - np.random.randn(8), dtype='float32')}) - - index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', - 'three']], - labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], - [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], - names=['first', 'second']) - self.mframe = DataFrame(np.random.randn(10, 3), index=index, - columns=['A', 'B', 'C']) - - self.three_group = DataFrame( - {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', - 'foo', 'foo', 'foo'], - 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', - 'two', 'two', 'one'], - 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', - 'dull', 'shiny', 'shiny', 'shiny'], - 'D': np.random.randn(11), - 'E': np.random.randn(11), - 'F': np.random.randn(11)}) - - def test_basic(self): - def checkit(dtype): - data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype) - - index = np.arange(9) - np.random.shuffle(index) - data = data.reindex(index) - - grouped = data.groupby(lambda x: x // 3) - - for k, v in grouped: - self.assertEqual(len(v), 3) - - agged = grouped.aggregate(np.mean) - self.assertEqual(agged[1], 1) - - assert_series_equal(agged, grouped.agg(np.mean)) # shorthand - assert_series_equal(agged, grouped.mean()) - assert_series_equal(grouped.agg(np.sum), grouped.sum()) - - expected = grouped.apply(lambda x: x * x.sum()) - transformed = grouped.transform(lambda x: x * x.sum()) - self.assertEqual(transformed[7], 12) - assert_series_equal(transformed, expected) - - value_grouped = data.groupby(data) - assert_series_equal(value_grouped.aggregate(np.mean), agged, - check_index_type=False) - - # complex agg - agged = grouped.aggregate([np.mean, np.std]) - agged = grouped.aggregate({'one': np.mean, 'two': np.std}) - - group_constants = {0: 10, 1: 20, 2: 30} - agged = grouped.agg(lambda x: group_constants[x.name] + x.mean()) - self.assertEqual(agged[1], 21) - - # corner cases - self.assertRaises(Exception, grouped.aggregate, lambda x: x * 2) - - for dtype in ['int64', 'int32', 'float64', 'float32']: - checkit(dtype) - - def test_select_bad_cols(self): - df = DataFrame([[1, 2]], columns=['A', 'B']) - g = df.groupby('A') - self.assertRaises(KeyError, g.__getitem__, ['C']) # g[['C']] - - self.assertRaises(KeyError, g.__getitem__, ['A', 'C']) # g[['A', 'C']] - with assertRaisesRegexp(KeyError, '^[^A]+$'): - # A should not be referenced as a bad column... - # will have to rethink regex if you change message! - g[['A', 'C']] - - def test_first_last_nth(self): - # tests for first / last / nth - grouped = self.df.groupby('A') - first = grouped.first() - expected = self.df.ix[[1, 0], ['B', 'C', 'D']] - expected.index = Index(['bar', 'foo'], name='A') - expected = expected.sort_index() - assert_frame_equal(first, expected) - - nth = grouped.nth(0) - assert_frame_equal(nth, expected) - - last = grouped.last() - expected = self.df.ix[[5, 7], ['B', 'C', 'D']] - expected.index = Index(['bar', 'foo'], name='A') - assert_frame_equal(last, expected) - - nth = grouped.nth(-1) - assert_frame_equal(nth, expected) - - nth = grouped.nth(1) - expected = self.df.ix[[2, 3], ['B', 'C', 'D']].copy() - expected.index = Index(['foo', 'bar'], name='A') - expected = expected.sort_index() - assert_frame_equal(nth, expected) - - # it works! - grouped['B'].first() - grouped['B'].last() - grouped['B'].nth(0) - - self.df.loc[self.df['A'] == 'foo', 'B'] = np.nan - self.assertTrue(isnull(grouped['B'].first()['foo'])) - self.assertTrue(isnull(grouped['B'].last()['foo'])) - self.assertTrue(isnull(grouped['B'].nth(0)['foo'])) - - # v0.14.0 whatsnew - df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B']) - g = df.groupby('A') - result = g.first() - expected = df.iloc[[1, 2]].set_index('A') - assert_frame_equal(result, expected) - - expected = df.iloc[[1, 2]].set_index('A') - result = g.nth(0, dropna='any') - assert_frame_equal(result, expected) - - def test_first_last_nth_dtypes(self): - - df = self.df_mixed_floats.copy() - df['E'] = True - df['F'] = 1 - - # tests for first / last / nth - grouped = df.groupby('A') - first = grouped.first() - expected = df.ix[[1, 0], ['B', 'C', 'D', 'E', 'F']] - expected.index = Index(['bar', 'foo'], name='A') - expected = expected.sort_index() - assert_frame_equal(first, expected) - - last = grouped.last() - expected = df.ix[[5, 7], ['B', 'C', 'D', 'E', 'F']] - expected.index = Index(['bar', 'foo'], name='A') - expected = expected.sort_index() - assert_frame_equal(last, expected) - - nth = grouped.nth(1) - expected = df.ix[[3, 2], ['B', 'C', 'D', 'E', 'F']] - expected.index = Index(['bar', 'foo'], name='A') - expected = expected.sort_index() - assert_frame_equal(nth, expected) - - # GH 2763, first/last shifting dtypes - idx = lrange(10) - idx.append(9) - s = Series(data=lrange(11), index=idx, name='IntCol') - self.assertEqual(s.dtype, 'int64') - f = s.groupby(level=0).first() - self.assertEqual(f.dtype, 'int64') - - def test_nth(self): - df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B']) - g = df.groupby('A') - - assert_frame_equal(g.nth(0), df.iloc[[0, 2]].set_index('A')) - assert_frame_equal(g.nth(1), df.iloc[[1]].set_index('A')) - assert_frame_equal(g.nth(2), df.loc[[]].set_index('A')) - assert_frame_equal(g.nth(-1), df.iloc[[1, 2]].set_index('A')) - assert_frame_equal(g.nth(-2), df.iloc[[0]].set_index('A')) - assert_frame_equal(g.nth(-3), df.loc[[]].set_index('A')) - assert_series_equal(g.B.nth(0), df.set_index('A').B.iloc[[0, 2]]) - assert_series_equal(g.B.nth(1), df.set_index('A').B.iloc[[1]]) - assert_frame_equal(g[['B']].nth(0), - df.ix[[0, 2], ['A', 'B']].set_index('A')) - - exp = df.set_index('A') - assert_frame_equal(g.nth(0, dropna='any'), exp.iloc[[1, 2]]) - assert_frame_equal(g.nth(-1, dropna='any'), exp.iloc[[1, 2]]) - - exp['B'] = np.nan - assert_frame_equal(g.nth(7, dropna='any'), exp.iloc[[1, 2]]) - assert_frame_equal(g.nth(2, dropna='any'), exp.iloc[[1, 2]]) - - # out of bounds, regression from 0.13.1 - # GH 6621 - df = DataFrame({'color': {0: 'green', - 1: 'green', - 2: 'red', - 3: 'red', - 4: 'red'}, - 'food': {0: 'ham', - 1: 'eggs', - 2: 'eggs', - 3: 'ham', - 4: 'pork'}, - 'two': {0: 1.5456590000000001, - 1: -0.070345000000000005, - 2: -2.4004539999999999, - 3: 0.46206000000000003, - 4: 0.52350799999999997}, - 'one': {0: 0.56573799999999996, - 1: -0.9742360000000001, - 2: 1.033801, - 3: -0.78543499999999999, - 4: 0.70422799999999997}}).set_index(['color', - 'food']) - - result = df.groupby(level=0, as_index=False).nth(2) - expected = df.iloc[[-1]] - assert_frame_equal(result, expected) - - result = df.groupby(level=0, as_index=False).nth(3) - expected = df.loc[[]] - assert_frame_equal(result, expected) - - # GH 7559 - # from the vbench - df = DataFrame(np.random.randint(1, 10, (100, 2)), dtype='int64') - s = df[1] - g = df[0] - expected = s.groupby(g).first() - expected2 = s.groupby(g).apply(lambda x: x.iloc[0]) - assert_series_equal(expected2, expected, check_names=False) - self.assertTrue(expected.name, 0) - self.assertEqual(expected.name, 1) - - # validate first - v = s[g == 1].iloc[0] - self.assertEqual(expected.iloc[0], v) - self.assertEqual(expected2.iloc[0], v) - - # this is NOT the same as .first (as sorted is default!) - # as it keeps the order in the series (and not the group order) - # related GH 7287 - expected = s.groupby(g, sort=False).first() - result = s.groupby(g, sort=False).nth(0, dropna='all') - assert_series_equal(result, expected) - - # doc example - df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B']) - g = df.groupby('A') - result = g.B.nth(0, dropna=True) - expected = g.B.first() - assert_series_equal(result, expected) - - # test multiple nth values - df = DataFrame([[1, np.nan], [1, 3], [1, 4], [5, 6], [5, 7]], - columns=['A', 'B']) - g = df.groupby('A') - - assert_frame_equal(g.nth(0), df.iloc[[0, 3]].set_index('A')) - assert_frame_equal(g.nth([0]), df.iloc[[0, 3]].set_index('A')) - assert_frame_equal(g.nth([0, 1]), df.iloc[[0, 1, 3, 4]].set_index('A')) - assert_frame_equal( - g.nth([0, -1]), df.iloc[[0, 2, 3, 4]].set_index('A')) - assert_frame_equal( - g.nth([0, 1, 2]), df.iloc[[0, 1, 2, 3, 4]].set_index('A')) - assert_frame_equal( - g.nth([0, 1, -1]), df.iloc[[0, 1, 2, 3, 4]].set_index('A')) - assert_frame_equal(g.nth([2]), df.iloc[[2]].set_index('A')) - assert_frame_equal(g.nth([3, 4]), df.loc[[]].set_index('A')) - - business_dates = pd.date_range(start='4/1/2014', end='6/30/2014', - freq='B') - df = DataFrame(1, index=business_dates, columns=['a', 'b']) - # get the first, fourth and last two business days for each month - key = (df.index.year, df.index.month) - result = df.groupby(key, as_index=False).nth([0, 3, -2, -1]) - expected_dates = pd.to_datetime( - ['2014/4/1', '2014/4/4', '2014/4/29', '2014/4/30', '2014/5/1', - '2014/5/6', '2014/5/29', '2014/5/30', '2014/6/2', '2014/6/5', - '2014/6/27', '2014/6/30']) - expected = DataFrame(1, columns=['a', 'b'], index=expected_dates) - assert_frame_equal(result, expected) - - def test_nth_multi_index(self): - # PR 9090, related to issue 8979 - # test nth on MultiIndex, should match .first() - grouped = self.three_group.groupby(['A', 'B']) - result = grouped.nth(0) - expected = grouped.first() - assert_frame_equal(result, expected) - - def test_nth_multi_index_as_expected(self): - # PR 9090, related to issue 8979 - # test nth on MultiIndex - three_group = DataFrame( - {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', - 'foo', 'foo', 'foo'], - 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', - 'two', 'two', 'one'], - 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', - 'dull', 'shiny', 'shiny', 'shiny']}) - grouped = three_group.groupby(['A', 'B']) - result = grouped.nth(0) - expected = DataFrame( - {'C': ['dull', 'dull', 'dull', 'dull']}, - index=MultiIndex.from_arrays([['bar', 'bar', 'foo', 'foo'], - ['one', 'two', 'one', 'two']], - names=['A', 'B'])) - assert_frame_equal(result, expected) - - def test_group_selection_cache(self): - # GH 12839 nth, head, and tail should return same result consistently - df = DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B']) - expected = df.iloc[[0, 2]].set_index('A') - - g = df.groupby('A') - result1 = g.head(n=2) - result2 = g.nth(0) - assert_frame_equal(result1, df) - assert_frame_equal(result2, expected) - - g = df.groupby('A') - result1 = g.tail(n=2) - result2 = g.nth(0) - assert_frame_equal(result1, df) - assert_frame_equal(result2, expected) - - g = df.groupby('A') - result1 = g.nth(0) - result2 = g.head(n=2) - assert_frame_equal(result1, expected) - assert_frame_equal(result2, df) - - g = df.groupby('A') - result1 = g.nth(0) - result2 = g.tail(n=2) - assert_frame_equal(result1, expected) - assert_frame_equal(result2, df) - - def test_grouper_index_types(self): - # related GH5375 - # groupby misbehaving when using a Floatlike index - df = DataFrame(np.arange(10).reshape(5, 2), columns=list('AB')) - for index in [tm.makeFloatIndex, tm.makeStringIndex, - tm.makeUnicodeIndex, tm.makeIntIndex, tm.makeDateIndex, - tm.makePeriodIndex]: - - df.index = index(len(df)) - df.groupby(list('abcde')).apply(lambda x: x) - - df.index = list(reversed(df.index.tolist())) - df.groupby(list('abcde')).apply(lambda x: x) - - def test_grouper_multilevel_freq(self): - - # GH 7885 - # with level and freq specified in a pd.Grouper - from datetime import date, timedelta - d0 = date.today() - timedelta(days=14) - dates = date_range(d0, date.today()) - date_index = pd.MultiIndex.from_product( - [dates, dates], names=['foo', 'bar']) - df = pd.DataFrame(np.random.randint(0, 100, 225), index=date_index) - - # Check string level - expected = df.reset_index().groupby([pd.Grouper( - key='foo', freq='W'), pd.Grouper(key='bar', freq='W')]).sum() - # reset index changes columns dtype to object - expected.columns = pd.Index([0], dtype='int64') - - result = df.groupby([pd.Grouper(level='foo', freq='W'), pd.Grouper( - level='bar', freq='W')]).sum() - assert_frame_equal(result, expected) - - # Check integer level - result = df.groupby([pd.Grouper(level=0, freq='W'), pd.Grouper( - level=1, freq='W')]).sum() - assert_frame_equal(result, expected) - - def test_grouper_creation_bug(self): - - # GH 8795 - df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]}) - g = df.groupby('A') - expected = g.sum() - - g = df.groupby(pd.Grouper(key='A')) - result = g.sum() - assert_frame_equal(result, expected) - - result = g.apply(lambda x: x.sum()) - assert_frame_equal(result, expected) - - g = df.groupby(pd.Grouper(key='A', axis=0)) - result = g.sum() - assert_frame_equal(result, expected) - - # GH14334 - # pd.Grouper(key=...) may be passed in a list - df = DataFrame({'A': [0, 0, 0, 1, 1, 1], - 'B': [1, 1, 2, 2, 3, 3], - 'C': [1, 2, 3, 4, 5, 6]}) - # Group by single column - expected = df.groupby('A').sum() - g = df.groupby([pd.Grouper(key='A')]) - result = g.sum() - assert_frame_equal(result, expected) - - # Group by two columns - # using a combination of strings and Grouper objects - expected = df.groupby(['A', 'B']).sum() - - # Group with two Grouper objects - g = df.groupby([pd.Grouper(key='A'), pd.Grouper(key='B')]) - result = g.sum() - assert_frame_equal(result, expected) - - # Group with a string and a Grouper object - g = df.groupby(['A', pd.Grouper(key='B')]) - result = g.sum() - assert_frame_equal(result, expected) - - # Group with a Grouper object and a string - g = df.groupby([pd.Grouper(key='A'), 'B']) - result = g.sum() - assert_frame_equal(result, expected) - - # GH8866 - s = Series(np.arange(8, dtype='int64'), - index=pd.MultiIndex.from_product( - [list('ab'), range(2), - date_range('20130101', periods=2)], - names=['one', 'two', 'three'])) - result = s.groupby(pd.Grouper(level='three', freq='M')).sum() - expected = Series([28], index=Index( - [Timestamp('2013-01-31')], freq='M', name='three')) - assert_series_equal(result, expected) - - # just specifying a level breaks - result = s.groupby(pd.Grouper(level='one')).sum() - expected = s.groupby(level='one').sum() - assert_series_equal(result, expected) - - def test_grouper_column_and_index(self): - # GH 14327 - - # Grouping a multi-index frame by a column and an index level should - # be equivalent to resetting the index and grouping by two columns - idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), - ('b', 1), ('b', 2), ('b', 3)]) - idx.names = ['outer', 'inner'] - df_multi = pd.DataFrame({"A": np.arange(6), - 'B': ['one', 'one', 'two', - 'two', 'one', 'one']}, - index=idx) - result = df_multi.groupby(['B', pd.Grouper(level='inner')]).mean() - expected = df_multi.reset_index().groupby(['B', 'inner']).mean() - assert_frame_equal(result, expected) - - # Test the reverse grouping order - result = df_multi.groupby([pd.Grouper(level='inner'), 'B']).mean() - expected = df_multi.reset_index().groupby(['inner', 'B']).mean() - assert_frame_equal(result, expected) - - # Grouping a single-index frame by a column and the index should - # be equivalent to resetting the index and grouping by two columns - df_single = df_multi.reset_index('outer') - result = df_single.groupby(['B', pd.Grouper(level='inner')]).mean() - expected = df_single.reset_index().groupby(['B', 'inner']).mean() - assert_frame_equal(result, expected) - - # Test the reverse grouping order - result = df_single.groupby([pd.Grouper(level='inner'), 'B']).mean() - expected = df_single.reset_index().groupby(['inner', 'B']).mean() - assert_frame_equal(result, expected) - - def test_grouper_getting_correct_binner(self): - - # GH 10063 - # using a non-time-based grouper and a time-based grouper - # and specifying levels - df = DataFrame({'A': 1}, index=pd.MultiIndex.from_product( - [list('ab'), date_range('20130101', periods=80)], names=['one', - 'two'])) - result = df.groupby([pd.Grouper(level='one'), pd.Grouper( - level='two', freq='M')]).sum() - expected = DataFrame({'A': [31, 28, 21, 31, 28, 21]}, - index=MultiIndex.from_product( - [list('ab'), - date_range('20130101', freq='M', periods=3)], - names=['one', 'two'])) - assert_frame_equal(result, expected) - - def test_grouper_iter(self): - self.assertEqual(sorted(self.df.groupby('A').grouper), ['bar', 'foo']) - - def test_empty_groups(self): - # GH # 1048 - self.assertRaises(ValueError, self.df.groupby, []) - - def test_groupby_grouper(self): - grouped = self.df.groupby('A') - - result = self.df.groupby(grouped.grouper).mean() - expected = grouped.mean() - assert_frame_equal(result, expected) - - def test_groupby_duplicated_column_errormsg(self): - # GH7511 - df = DataFrame(columns=['A', 'B', 'A', 'C'], - data=[range(4), range(2, 6), range(0, 8, 2)]) - - self.assertRaises(ValueError, df.groupby, 'A') - self.assertRaises(ValueError, df.groupby, ['A', 'B']) - - grouped = df.groupby('B') - c = grouped.count() - self.assertTrue(c.columns.nlevels == 1) - self.assertTrue(c.columns.size == 3) - - def test_groupby_dict_mapping(self): - # GH #679 - from pandas import Series - s = Series({'T1': 5}) - result = s.groupby({'T1': 'T2'}).agg(sum) - expected = s.groupby(['T2']).agg(sum) - assert_series_equal(result, expected) - - s = Series([1., 2., 3., 4.], index=list('abcd')) - mapping = {'a': 0, 'b': 0, 'c': 1, 'd': 1} - - result = s.groupby(mapping).mean() - result2 = s.groupby(mapping).agg(np.mean) - expected = s.groupby([0, 0, 1, 1]).mean() - expected2 = s.groupby([0, 0, 1, 1]).mean() - assert_series_equal(result, expected) - assert_series_equal(result, result2) - assert_series_equal(result, expected2) - - def test_groupby_grouper_f_sanity_checked(self): - dates = date_range('01-Jan-2013', periods=12, freq='MS') - ts = Series(np.random.randn(12), index=dates) - - # GH3035 - # index.map is used to apply grouper to the index - # if it fails on the elements, map tries it on the entire index as - # a sequence. That can yield invalid results that cause trouble - # down the line. - # the surprise comes from using key[0:6] rather then str(key)[0:6] - # when the elements are Timestamp. - # the result is Index[0:6], very confusing. - - self.assertRaises(AssertionError, ts.groupby, lambda key: key[0:6]) - - def test_groupby_nonobject_dtype(self): - key = self.mframe.index.labels[0] - grouped = self.mframe.groupby(key) - result = grouped.sum() - - expected = self.mframe.groupby(key.astype('O')).sum() - assert_frame_equal(result, expected) - - # GH 3911, mixed frame non-conversion - df = self.df_mixed_floats.copy() - df['value'] = lrange(len(df)) - - def max_value(group): - return group.ix[group['value'].idxmax()] - - applied = df.groupby('A').apply(max_value) - result = applied.get_dtype_counts().sort_values() - expected = Series({'object': 2, - 'float64': 2, - 'int64': 1}).sort_values() - assert_series_equal(result, expected) - - def test_groupby_return_type(self): - - # GH2893, return a reduced type - df1 = DataFrame([{"val1": 1, - "val2": 20}, {"val1": 1, - "val2": 19}, {"val1": 2, - "val2": 27}, {"val1": 2, - "val2": 12} - ]) - - def func(dataf): - return dataf["val2"] - dataf["val2"].mean() - - result = df1.groupby("val1", squeeze=True).apply(func) - tm.assertIsInstance(result, Series) - - df2 = DataFrame([{"val1": 1, - "val2": 20}, {"val1": 1, - "val2": 19}, {"val1": 1, - "val2": 27}, {"val1": 1, - "val2": 12} - ]) - - def func(dataf): - return dataf["val2"] - dataf["val2"].mean() - - result = df2.groupby("val1", squeeze=True).apply(func) - tm.assertIsInstance(result, Series) - - # GH3596, return a consistent type (regression in 0.11 from 0.10.1) - df = DataFrame([[1, 1], [1, 1]], columns=['X', 'Y']) - result = df.groupby('X', squeeze=False).count() - tm.assertIsInstance(result, DataFrame) - - # GH5592 - # inconcistent return type - df = DataFrame(dict(A=['Tiger', 'Tiger', 'Tiger', 'Lamb', 'Lamb', - 'Pony', 'Pony'], B=Series( - np.arange(7), dtype='int64'), C=date_range( - '20130101', periods=7))) - - def f(grp): - return grp.iloc[0] - - expected = df.groupby('A').first()[['B']] - result = df.groupby('A').apply(f)[['B']] - assert_frame_equal(result, expected) - - def f(grp): - if grp.name == 'Tiger': - return None - return grp.iloc[0] - - result = df.groupby('A').apply(f)[['B']] - e = expected.copy() - e.loc['Tiger'] = np.nan - assert_frame_equal(result, e) - - def f(grp): - if grp.name == 'Pony': - return None - return grp.iloc[0] - - result = df.groupby('A').apply(f)[['B']] - e = expected.copy() - e.loc['Pony'] = np.nan - assert_frame_equal(result, e) - - # 5592 revisited, with datetimes - def f(grp): - if grp.name == 'Pony': - return None - return grp.iloc[0] - - result = df.groupby('A').apply(f)[['C']] - e = df.groupby('A').first()[['C']] - e.loc['Pony'] = pd.NaT - assert_frame_equal(result, e) - - # scalar outputs - def f(grp): - if grp.name == 'Pony': - return None - return grp.iloc[0].loc['C'] - - result = df.groupby('A').apply(f) - e = df.groupby('A').first()['C'].copy() - e.loc['Pony'] = np.nan - e.name = None - assert_series_equal(result, e) - - def test_agg_api(self): - - # GH 6337 - # http://stackoverflow.com/questions/21706030/pandas-groupby-agg-function-column-dtype-error - # different api for agg when passed custom function with mixed frame - - df = DataFrame({'data1': np.random.randn(5), - 'data2': np.random.randn(5), - 'key1': ['a', 'a', 'b', 'b', 'a'], - 'key2': ['one', 'two', 'one', 'two', 'one']}) - grouped = df.groupby('key1') - - def peak_to_peak(arr): - return arr.max() - arr.min() - - expected = grouped.agg([peak_to_peak]) - expected.columns = ['data1', 'data2'] - result = grouped.agg(peak_to_peak) - assert_frame_equal(result, expected) - - def test_agg_regression1(self): - grouped = self.tsframe.groupby([lambda x: x.year, lambda x: x.month]) - result = grouped.agg(np.mean) - expected = grouped.mean() - assert_frame_equal(result, expected) - - def test_agg_datetimes_mixed(self): - data = [[1, '2012-01-01', 1.0], [2, '2012-01-02', 2.0], [3, None, 3.0]] - - df1 = DataFrame({'key': [x[0] for x in data], - 'date': [x[1] for x in data], - 'value': [x[2] for x in data]}) - - data = [[row[0], datetime.strptime(row[1], '%Y-%m-%d').date() if row[1] - else None, row[2]] for row in data] - - df2 = DataFrame({'key': [x[0] for x in data], - 'date': [x[1] for x in data], - 'value': [x[2] for x in data]}) - - df1['weights'] = df1['value'] / df1['value'].sum() - gb1 = df1.groupby('date').aggregate(np.sum) - - df2['weights'] = df1['value'] / df1['value'].sum() - gb2 = df2.groupby('date').aggregate(np.sum) - - assert (len(gb1) == len(gb2)) - - def test_agg_period_index(self): - from pandas import period_range, PeriodIndex - prng = period_range('2012-1-1', freq='M', periods=3) - df = DataFrame(np.random.randn(3, 2), index=prng) - rs = df.groupby(level=0).sum() - tm.assertIsInstance(rs.index, PeriodIndex) - - # GH 3579 - index = period_range(start='1999-01', periods=5, freq='M') - s1 = Series(np.random.rand(len(index)), index=index) - s2 = Series(np.random.rand(len(index)), index=index) - series = [('s1', s1), ('s2', s2)] - df = DataFrame.from_items(series) - grouped = df.groupby(df.index.month) - list(grouped) - - def test_agg_dict_parameter_cast_result_dtypes(self): - # GH 12821 - - df = DataFrame( - {'class': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], - 'time': date_range('1/1/2011', periods=8, freq='H')}) - df.loc[[0, 1, 2, 5], 'time'] = None - - # test for `first` function - exp = df.loc[[0, 3, 4, 6]].set_index('class') - grouped = df.groupby('class') - assert_frame_equal(grouped.first(), exp) - assert_frame_equal(grouped.agg('first'), exp) - assert_frame_equal(grouped.agg({'time': 'first'}), exp) - assert_series_equal(grouped.time.first(), exp['time']) - assert_series_equal(grouped.time.agg('first'), exp['time']) - - # test for `last` function - exp = df.loc[[0, 3, 4, 7]].set_index('class') - grouped = df.groupby('class') - assert_frame_equal(grouped.last(), exp) - assert_frame_equal(grouped.agg('last'), exp) - assert_frame_equal(grouped.agg({'time': 'last'}), exp) - assert_series_equal(grouped.time.last(), exp['time']) - assert_series_equal(grouped.time.agg('last'), exp['time']) - - def test_agg_must_agg(self): - grouped = self.df.groupby('A')['C'] - self.assertRaises(Exception, grouped.agg, lambda x: x.describe()) - self.assertRaises(Exception, grouped.agg, lambda x: x.index[:2]) - - def test_agg_ser_multi_key(self): - # TODO(wesm): unused - ser = self.df.C # noqa - - f = lambda x: x.sum() - results = self.df.C.groupby([self.df.A, self.df.B]).aggregate(f) - expected = self.df.groupby(['A', 'B']).sum()['C'] - assert_series_equal(results, expected) - - def test_get_group(self): - wp = tm.makePanel() - grouped = wp.groupby(lambda x: x.month, axis='major') - - gp = grouped.get_group(1) - expected = wp.reindex(major=[x for x in wp.major_axis if x.month == 1]) - assert_panel_equal(gp, expected) - - # GH 5267 - # be datelike friendly - df = DataFrame({'DATE': pd.to_datetime( - ['10-Oct-2013', '10-Oct-2013', '10-Oct-2013', '11-Oct-2013', - '11-Oct-2013', '11-Oct-2013']), - 'label': ['foo', 'foo', 'bar', 'foo', 'foo', 'bar'], - 'VAL': [1, 2, 3, 4, 5, 6]}) - - g = df.groupby('DATE') - key = list(g.groups)[0] - result1 = g.get_group(key) - result2 = g.get_group(Timestamp(key).to_pydatetime()) - result3 = g.get_group(str(Timestamp(key))) - assert_frame_equal(result1, result2) - assert_frame_equal(result1, result3) - - g = df.groupby(['DATE', 'label']) - - key = list(g.groups)[0] - result1 = g.get_group(key) - result2 = g.get_group((Timestamp(key[0]).to_pydatetime(), key[1])) - result3 = g.get_group((str(Timestamp(key[0])), key[1])) - assert_frame_equal(result1, result2) - assert_frame_equal(result1, result3) - - # must pass a same-length tuple with multiple keys - self.assertRaises(ValueError, lambda: g.get_group('foo')) - self.assertRaises(ValueError, lambda: g.get_group(('foo'))) - self.assertRaises(ValueError, - lambda: g.get_group(('foo', 'bar', 'baz'))) - - def test_get_group_empty_bins(self): - d = pd.DataFrame([3, 1, 7, 6]) - bins = [0, 5, 10, 15] - g = d.groupby(pd.cut(d[0], bins)) - - result = g.get_group('(0, 5]') - expected = DataFrame([3, 1], index=[0, 1]) - assert_frame_equal(result, expected) - - self.assertRaises(KeyError, lambda: g.get_group('(10, 15]')) - - def test_get_group_grouped_by_tuple(self): - # GH 8121 - df = DataFrame([[(1, ), (1, 2), (1, ), (1, 2)]], index=['ids']).T - gr = df.groupby('ids') - expected = DataFrame({'ids': [(1, ), (1, )]}, index=[0, 2]) - result = gr.get_group((1, )) - assert_frame_equal(result, expected) - - dt = pd.to_datetime(['2010-01-01', '2010-01-02', '2010-01-01', - '2010-01-02']) - df = DataFrame({'ids': [(x, ) for x in dt]}) - gr = df.groupby('ids') - result = gr.get_group(('2010-01-01', )) - expected = DataFrame({'ids': [(dt[0], ), (dt[0], )]}, index=[0, 2]) - assert_frame_equal(result, expected) - - def test_agg_apply_corner(self): - # nothing to group, all NA - grouped = self.ts.groupby(self.ts * np.nan) - self.assertEqual(self.ts.dtype, np.float64) - - # groupby float64 values results in Float64Index - exp = Series([], dtype=np.float64, index=pd.Index( - [], dtype=np.float64)) - assert_series_equal(grouped.sum(), exp) - assert_series_equal(grouped.agg(np.sum), exp) - assert_series_equal(grouped.apply(np.sum), exp, check_index_type=False) - - # DataFrame - grouped = self.tsframe.groupby(self.tsframe['A'] * np.nan) - exp_df = DataFrame(columns=self.tsframe.columns, dtype=float, - index=pd.Index([], dtype=np.float64)) - assert_frame_equal(grouped.sum(), exp_df, check_names=False) - assert_frame_equal(grouped.agg(np.sum), exp_df, check_names=False) - assert_frame_equal(grouped.apply(np.sum), exp_df.iloc[:, :0], - check_names=False) - - def test_agg_grouping_is_list_tuple(self): - from pandas.core.groupby import Grouping - - df = tm.makeTimeDataFrame() - - grouped = df.groupby(lambda x: x.year) - grouper = grouped.grouper.groupings[0].grouper - grouped.grouper.groupings[0] = Grouping(self.ts.index, list(grouper)) - - result = grouped.agg(np.mean) - expected = grouped.mean() - tm.assert_frame_equal(result, expected) - - grouped.grouper.groupings[0] = Grouping(self.ts.index, tuple(grouper)) - - result = grouped.agg(np.mean) - expected = grouped.mean() - tm.assert_frame_equal(result, expected) - - def test_grouping_error_on_multidim_input(self): - from pandas.core.groupby import Grouping - self.assertRaises(ValueError, - Grouping, self.df.index, self.df[['A', 'A']]) - - def test_agg_python_multiindex(self): - grouped = self.mframe.groupby(['A', 'B']) - - result = grouped.agg(np.mean) - expected = grouped.mean() - tm.assert_frame_equal(result, expected) - - def test_apply_describe_bug(self): - grouped = self.mframe.groupby(level='first') - grouped.describe() # it works! - - def test_apply_issues(self): - # GH 5788 - - s = """2011.05.16,00:00,1.40893 -2011.05.16,01:00,1.40760 -2011.05.16,02:00,1.40750 -2011.05.16,03:00,1.40649 -2011.05.17,02:00,1.40893 -2011.05.17,03:00,1.40760 -2011.05.17,04:00,1.40750 -2011.05.17,05:00,1.40649 -2011.05.18,02:00,1.40893 -2011.05.18,03:00,1.40760 -2011.05.18,04:00,1.40750 -2011.05.18,05:00,1.40649""" - - df = pd.read_csv( - StringIO(s), header=None, names=['date', 'time', 'value'], - parse_dates=[['date', 'time']]) - df = df.set_index('date_time') - - expected = df.groupby(df.index.date).idxmax() - result = df.groupby(df.index.date).apply(lambda x: x.idxmax()) - assert_frame_equal(result, expected) - - # GH 5789 - # don't auto coerce dates - df = pd.read_csv( - StringIO(s), header=None, names=['date', 'time', 'value']) - exp_idx = pd.Index( - ['2011.05.16', '2011.05.17', '2011.05.18' - ], dtype=object, name='date') - expected = Series(['00:00', '02:00', '02:00'], index=exp_idx) - result = df.groupby('date').apply( - lambda x: x['time'][x['value'].idxmax()]) - assert_series_equal(result, expected) - - def test_time_field_bug(self): - # Test a fix for the following error related to GH issue 11324 When - # non-key fields in a group-by dataframe contained time-based fields - # that were not returned by the apply function, an exception would be - # raised. - - df = pd.DataFrame({'a': 1, 'b': [datetime.now() for nn in range(10)]}) - - def func_with_no_date(batch): - return pd.Series({'c': 2}) - - def func_with_date(batch): - return pd.Series({'c': 2, 'b': datetime(2015, 1, 1)}) - - dfg_no_conversion = df.groupby(by=['a']).apply(func_with_no_date) - dfg_no_conversion_expected = pd.DataFrame({'c': 2}, index=[1]) - dfg_no_conversion_expected.index.name = 'a' - - dfg_conversion = df.groupby(by=['a']).apply(func_with_date) - dfg_conversion_expected = pd.DataFrame( - {'b': datetime(2015, 1, 1), - 'c': 2}, index=[1]) - dfg_conversion_expected.index.name = 'a' - - self.assert_frame_equal(dfg_no_conversion, dfg_no_conversion_expected) - self.assert_frame_equal(dfg_conversion, dfg_conversion_expected) - - def test_len(self): - df = tm.makeTimeDataFrame() - grouped = df.groupby([lambda x: x.year, lambda x: x.month, - lambda x: x.day]) - self.assertEqual(len(grouped), len(df)) - - grouped = df.groupby([lambda x: x.year, lambda x: x.month]) - expected = len(set([(x.year, x.month) for x in df.index])) - self.assertEqual(len(grouped), expected) - - # issue 11016 - df = pd.DataFrame(dict(a=[np.nan] * 3, b=[1, 2, 3])) - self.assertEqual(len(df.groupby(('a'))), 0) - self.assertEqual(len(df.groupby(('b'))), 3) - self.assertEqual(len(df.groupby(('a', 'b'))), 3) - - def test_groups(self): - grouped = self.df.groupby(['A']) - groups = grouped.groups - self.assertIs(groups, grouped.groups) # caching works - - for k, v in compat.iteritems(grouped.groups): - self.assertTrue((self.df.ix[v]['A'] == k).all()) - - grouped = self.df.groupby(['A', 'B']) - groups = grouped.groups - self.assertIs(groups, grouped.groups) # caching works - for k, v in compat.iteritems(grouped.groups): - self.assertTrue((self.df.ix[v]['A'] == k[0]).all()) - self.assertTrue((self.df.ix[v]['B'] == k[1]).all()) - - def test_aggregate_str_func(self): - def _check_results(grouped): - # single series - result = grouped['A'].agg('std') - expected = grouped['A'].std() - assert_series_equal(result, expected) - - # group frame by function name - result = grouped.aggregate('var') - expected = grouped.var() - assert_frame_equal(result, expected) - - # group frame by function dict - result = grouped.agg(OrderedDict([['A', 'var'], ['B', 'std'], - ['C', 'mean'], ['D', 'sem']])) - expected = DataFrame(OrderedDict([['A', grouped['A'].var( - )], ['B', grouped['B'].std()], ['C', grouped['C'].mean()], - ['D', grouped['D'].sem()]])) - assert_frame_equal(result, expected) - - by_weekday = self.tsframe.groupby(lambda x: x.weekday()) - _check_results(by_weekday) - - by_mwkday = self.tsframe.groupby([lambda x: x.month, - lambda x: x.weekday()]) - _check_results(by_mwkday) - - def test_aggregate_item_by_item(self): - - df = self.df.copy() - df['E'] = ['a'] * len(self.df) - grouped = self.df.groupby('A') - - # API change in 0.11 - # def aggfun(ser): - # return len(ser + 'a') - # result = grouped.agg(aggfun) - # self.assertEqual(len(result.columns), 1) - - aggfun = lambda ser: ser.size - result = grouped.agg(aggfun) - foo = (self.df.A == 'foo').sum() - bar = (self.df.A == 'bar').sum() - K = len(result.columns) - - # GH5782 - # odd comparisons can result here, so cast to make easy - exp = pd.Series(np.array([foo] * K), index=list('BCD'), - dtype=np.float64, name='foo') - tm.assert_series_equal(result.xs('foo'), exp) - - exp = pd.Series(np.array([bar] * K), index=list('BCD'), - dtype=np.float64, name='bar') - tm.assert_almost_equal(result.xs('bar'), exp) - - def aggfun(ser): - return ser.size - - result = DataFrame().groupby(self.df.A).agg(aggfun) - tm.assertIsInstance(result, DataFrame) - self.assertEqual(len(result), 0) - - def test_agg_item_by_item_raise_typeerror(self): - from numpy.random import randint - - df = DataFrame(randint(10, size=(20, 10))) - - def raiseException(df): - pprint_thing('----------------------------------------') - pprint_thing(df.to_string()) - raise TypeError - - self.assertRaises(TypeError, df.groupby(0).agg, raiseException) - - def test_basic_regression(self): - # regression - T = [1.0 * x for x in lrange(1, 10) * 10][:1095] - result = Series(T, lrange(0, len(T))) - - groupings = np.random.random((1100, )) - groupings = Series(groupings, lrange(0, len(groupings))) * 10. - - grouped = result.groupby(groupings) - grouped.mean() - - def test_transform(self): - data = Series(np.arange(9) // 3, index=np.arange(9)) - - index = np.arange(9) - np.random.shuffle(index) - data = data.reindex(index) - - grouped = data.groupby(lambda x: x // 3) - - transformed = grouped.transform(lambda x: x * x.sum()) - self.assertEqual(transformed[7], 12) - - # GH 8046 - # make sure that we preserve the input order - - df = DataFrame( - np.arange(6, dtype='int64').reshape( - 3, 2), columns=["a", "b"], index=[0, 2, 1]) - key = [0, 0, 1] - expected = df.sort_index().groupby(key).transform( - lambda x: x - x.mean()).groupby(key).mean() - result = df.groupby(key).transform(lambda x: x - x.mean()).groupby( - key).mean() - assert_frame_equal(result, expected) - - def demean(arr): - return arr - arr.mean() - - people = DataFrame(np.random.randn(5, 5), - columns=['a', 'b', 'c', 'd', 'e'], - index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']) - key = ['one', 'two', 'one', 'two', 'one'] - result = people.groupby(key).transform(demean).groupby(key).mean() - expected = people.groupby(key).apply(demean).groupby(key).mean() - assert_frame_equal(result, expected) - - # GH 8430 - df = tm.makeTimeDataFrame() - g = df.groupby(pd.TimeGrouper('M')) - g.transform(lambda x: x - 1) - - # GH 9700 - df = DataFrame({'a': range(5, 10), 'b': range(5)}) - result = df.groupby('a').transform(max) - expected = DataFrame({'b': range(5)}) - tm.assert_frame_equal(result, expected) - - def test_transform_fast(self): - - df = DataFrame({'id': np.arange(100000) / 3, - 'val': np.random.randn(100000)}) - - grp = df.groupby('id')['val'] - - values = np.repeat(grp.mean().values, - _ensure_platform_int(grp.count().values)) - expected = pd.Series(values, index=df.index, name='val') - - result = grp.transform(np.mean) - assert_series_equal(result, expected) - - result = grp.transform('mean') - assert_series_equal(result, expected) - - # GH 12737 - df = pd.DataFrame({'grouping': [0, 1, 1, 3], 'f': [1.1, 2.1, 3.1, 4.5], - 'd': pd.date_range('2014-1-1', '2014-1-4'), - 'i': [1, 2, 3, 4]}, - columns=['grouping', 'f', 'i', 'd']) - result = df.groupby('grouping').transform('first') - - dates = [pd.Timestamp('2014-1-1'), pd.Timestamp('2014-1-2'), - pd.Timestamp('2014-1-2'), pd.Timestamp('2014-1-4')] - expected = pd.DataFrame({'f': [1.1, 2.1, 2.1, 4.5], - 'd': dates, - 'i': [1, 2, 2, 4]}, - columns=['f', 'i', 'd']) - assert_frame_equal(result, expected) - - # selection - result = df.groupby('grouping')[['f', 'i']].transform('first') - expected = expected[['f', 'i']] - assert_frame_equal(result, expected) - - # dup columns - df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['g', 'a', 'a']) - result = df.groupby('g').transform('first') - expected = df.drop('g', axis=1) - assert_frame_equal(result, expected) - - def test_transform_broadcast(self): - grouped = self.ts.groupby(lambda x: x.month) - result = grouped.transform(np.mean) - - self.assert_index_equal(result.index, self.ts.index) - for _, gp in grouped: - assert_fp_equal(result.reindex(gp.index), gp.mean()) - - grouped = self.tsframe.groupby(lambda x: x.month) - result = grouped.transform(np.mean) - self.assert_index_equal(result.index, self.tsframe.index) - for _, gp in grouped: - agged = gp.mean() - res = result.reindex(gp.index) - for col in self.tsframe: - assert_fp_equal(res[col], agged[col]) - - # group columns - grouped = self.tsframe.groupby({'A': 0, 'B': 0, 'C': 1, 'D': 1}, - axis=1) - result = grouped.transform(np.mean) - self.assert_index_equal(result.index, self.tsframe.index) - self.assert_index_equal(result.columns, self.tsframe.columns) - for _, gp in grouped: - agged = gp.mean(1) - res = result.reindex(columns=gp.columns) - for idx in gp.index: - assert_fp_equal(res.xs(idx), agged[idx]) - - def test_transform_axis(self): - - # make sure that we are setting the axes - # correctly when on axis=0 or 1 - # in the presence of a non-monotonic indexer - # GH12713 - - base = self.tsframe.iloc[0:5] - r = len(base.index) - c = len(base.columns) - tso = DataFrame(np.random.randn(r, c), - index=base.index, - columns=base.columns, - dtype='float64') - # monotonic - ts = tso - grouped = ts.groupby(lambda x: x.weekday()) - result = ts - grouped.transform('mean') - expected = grouped.apply(lambda x: x - x.mean()) - assert_frame_equal(result, expected) - - ts = ts.T - grouped = ts.groupby(lambda x: x.weekday(), axis=1) - result = ts - grouped.transform('mean') - expected = grouped.apply(lambda x: (x.T - x.mean(1)).T) - assert_frame_equal(result, expected) - - # non-monotonic - ts = tso.iloc[[1, 0] + list(range(2, len(base)))] - grouped = ts.groupby(lambda x: x.weekday()) - result = ts - grouped.transform('mean') - expected = grouped.apply(lambda x: x - x.mean()) - assert_frame_equal(result, expected) - - ts = ts.T - grouped = ts.groupby(lambda x: x.weekday(), axis=1) - result = ts - grouped.transform('mean') - expected = grouped.apply(lambda x: (x.T - x.mean(1)).T) - assert_frame_equal(result, expected) - - def test_transform_dtype(self): - # GH 9807 - # Check transform dtype output is preserved - df = DataFrame([[1, 3], [2, 3]]) - result = df.groupby(1).transform('mean') - expected = DataFrame([[1.5], [1.5]]) - assert_frame_equal(result, expected) - - def test_transform_bug(self): - # GH 5712 - # transforming on a datetime column - df = DataFrame(dict(A=Timestamp('20130101'), B=np.arange(5))) - result = df.groupby('A')['B'].transform( - lambda x: x.rank(ascending=False)) - expected = Series(np.arange(5, 0, step=-1), name='B') - assert_series_equal(result, expected) - - def test_transform_multiple(self): - grouped = self.ts.groupby([lambda x: x.year, lambda x: x.month]) - - grouped.transform(lambda x: x * 2) - grouped.transform(np.mean) - - def test_dispatch_transform(self): - df = self.tsframe[::5].reindex(self.tsframe.index) - - grouped = df.groupby(lambda x: x.month) - - filled = grouped.fillna(method='pad') - fillit = lambda x: x.fillna(method='pad') - expected = df.groupby(lambda x: x.month).transform(fillit) - assert_frame_equal(filled, expected) - - def test_transform_select_columns(self): - f = lambda x: x.mean() - result = self.df.groupby('A')['C', 'D'].transform(f) - - selection = self.df[['C', 'D']] - expected = selection.groupby(self.df['A']).transform(f) - - assert_frame_equal(result, expected) - - def test_transform_exclude_nuisance(self): - - # this also tests orderings in transform between - # series/frame to make sure it's consistent - expected = {} - grouped = self.df.groupby('A') - expected['C'] = grouped['C'].transform(np.mean) - expected['D'] = grouped['D'].transform(np.mean) - expected = DataFrame(expected) - result = self.df.groupby('A').transform(np.mean) - - assert_frame_equal(result, expected) - - def test_transform_function_aliases(self): - result = self.df.groupby('A').transform('mean') - expected = self.df.groupby('A').transform(np.mean) - assert_frame_equal(result, expected) - - result = self.df.groupby('A')['C'].transform('mean') - expected = self.df.groupby('A')['C'].transform(np.mean) - assert_series_equal(result, expected) - - def test_series_fast_transform_date(self): - # GH 13191 - df = pd.DataFrame({'grouping': [np.nan, 1, 1, 3], - 'd': pd.date_range('2014-1-1', '2014-1-4')}) - result = df.groupby('grouping')['d'].transform('first') - dates = [pd.NaT, pd.Timestamp('2014-1-2'), pd.Timestamp('2014-1-2'), - pd.Timestamp('2014-1-4')] - expected = pd.Series(dates, name='d') - assert_series_equal(result, expected) - - def test_transform_length(self): - # GH 9697 - df = pd.DataFrame({'col1': [1, 1, 2, 2], 'col2': [1, 2, 3, np.nan]}) - expected = pd.Series([3.0] * 4) - - def nsum(x): - return np.nansum(x) - - results = [df.groupby('col1').transform(sum)['col2'], - df.groupby('col1')['col2'].transform(sum), - df.groupby('col1').transform(nsum)['col2'], - df.groupby('col1')['col2'].transform(nsum)] - for result in results: - assert_series_equal(result, expected, check_names=False) - - def test_transform_coercion(self): - - # 14457 - # when we are transforming be sure to not coerce - # via assignment - df = pd.DataFrame(dict(A=['a', 'a'], B=[0, 1])) - g = df.groupby('A') - - expected = g.transform(np.mean) - result = g.transform(lambda x: np.mean(x)) - assert_frame_equal(result, expected) - - def test_with_na(self): - index = Index(np.arange(10)) - - for dtype in ['float64', 'float32', 'int64', 'int32', 'int16', 'int8']: - values = Series(np.ones(10), index, dtype=dtype) - labels = Series([nan, 'foo', 'bar', 'bar', nan, nan, 'bar', - 'bar', nan, 'foo'], index=index) - - # this SHOULD be an int - grouped = values.groupby(labels) - agged = grouped.agg(len) - expected = Series([4, 2], index=['bar', 'foo']) - - assert_series_equal(agged, expected, check_dtype=False) - - # self.assertTrue(issubclass(agged.dtype.type, np.integer)) - - # explicity return a float from my function - def f(x): - return float(len(x)) - - agged = grouped.agg(f) - expected = Series([4, 2], index=['bar', 'foo']) - - assert_series_equal(agged, expected, check_dtype=False) - self.assertTrue(issubclass(agged.dtype.type, np.dtype(dtype).type)) - - def test_groupby_transform_with_int(self): - - # GH 3740, make sure that we might upcast on item-by-item transform - - # floats - df = DataFrame(dict(A=[1, 1, 1, 2, 2, 2], B=Series(1, dtype='float64'), - C=Series( - [1, 2, 3, 1, 2, 3], dtype='float64'), D='foo')) - with np.errstate(all='ignore'): - result = df.groupby('A').transform( - lambda x: (x - x.mean()) / x.std()) - expected = DataFrame(dict(B=np.nan, C=Series( - [-1, 0, 1, -1, 0, 1], dtype='float64'))) - assert_frame_equal(result, expected) - - # int case - df = DataFrame(dict(A=[1, 1, 1, 2, 2, 2], B=1, - C=[1, 2, 3, 1, 2, 3], D='foo')) - with np.errstate(all='ignore'): - result = df.groupby('A').transform( - lambda x: (x - x.mean()) / x.std()) - expected = DataFrame(dict(B=np.nan, C=[-1, 0, 1, -1, 0, 1])) - assert_frame_equal(result, expected) - - # int that needs float conversion - s = Series([2, 3, 4, 10, 5, -1]) - df = DataFrame(dict(A=[1, 1, 1, 2, 2, 2], B=1, C=s, D='foo')) - with np.errstate(all='ignore'): - result = df.groupby('A').transform( - lambda x: (x - x.mean()) / x.std()) - - s1 = s.iloc[0:3] - s1 = (s1 - s1.mean()) / s1.std() - s2 = s.iloc[3:6] - s2 = (s2 - s2.mean()) / s2.std() - expected = DataFrame(dict(B=np.nan, C=concat([s1, s2]))) - assert_frame_equal(result, expected) - - # int downcasting - result = df.groupby('A').transform(lambda x: x * 2 / 2) - expected = DataFrame(dict(B=1, C=[2, 3, 4, 10, 5, -1])) - assert_frame_equal(result, expected) - - def test_indices_concatenation_order(self): - - # GH 2808 - - def f1(x): - y = x[(x.b % 2) == 1] ** 2 - if y.empty: - multiindex = MultiIndex(levels=[[]] * 2, labels=[[]] * 2, - names=['b', 'c']) - res = DataFrame(None, columns=['a'], index=multiindex) - return res - else: - y = y.set_index(['b', 'c']) - return y - - def f2(x): - y = x[(x.b % 2) == 1] ** 2 - if y.empty: - return DataFrame() - else: - y = y.set_index(['b', 'c']) - return y - - def f3(x): - y = x[(x.b % 2) == 1] ** 2 - if y.empty: - multiindex = MultiIndex(levels=[[]] * 2, labels=[[]] * 2, - names=['foo', 'bar']) - res = DataFrame(None, columns=['a', 'b'], index=multiindex) - return res - else: - return y - - df = DataFrame({'a': [1, 2, 2, 2], 'b': lrange(4), 'c': lrange(5, 9)}) - - df2 = DataFrame({'a': [3, 2, 2, 2], 'b': lrange(4), 'c': lrange(5, 9)}) - - # correct result - result1 = df.groupby('a').apply(f1) - result2 = df2.groupby('a').apply(f1) - assert_frame_equal(result1, result2) - - # should fail (not the same number of levels) - self.assertRaises(AssertionError, df.groupby('a').apply, f2) - self.assertRaises(AssertionError, df2.groupby('a').apply, f2) - - # should fail (incorrect shape) - self.assertRaises(AssertionError, df.groupby('a').apply, f3) - self.assertRaises(AssertionError, df2.groupby('a').apply, f3) - - def test_attr_wrapper(self): - grouped = self.ts.groupby(lambda x: x.weekday()) - - result = grouped.std() - expected = grouped.agg(lambda x: np.std(x, ddof=1)) - assert_series_equal(result, expected) - - # this is pretty cool - result = grouped.describe() - expected = {} - for name, gp in grouped: - expected[name] = gp.describe() - expected = DataFrame(expected).T - assert_frame_equal(result.unstack(), expected) - - # get attribute - result = grouped.dtype - expected = grouped.agg(lambda x: x.dtype) - - # make sure raises error - self.assertRaises(AttributeError, getattr, grouped, 'foo') - - def test_series_describe_multikey(self): - ts = tm.makeTimeSeries() - grouped = ts.groupby([lambda x: x.year, lambda x: x.month]) - result = grouped.describe().unstack() - assert_series_equal(result['mean'], grouped.mean(), check_names=False) - assert_series_equal(result['std'], grouped.std(), check_names=False) - assert_series_equal(result['min'], grouped.min(), check_names=False) - - def test_series_describe_single(self): - ts = tm.makeTimeSeries() - grouped = ts.groupby(lambda x: x.month) - result = grouped.apply(lambda x: x.describe()) - expected = grouped.describe() - assert_series_equal(result, expected) - - def test_series_agg_multikey(self): - ts = tm.makeTimeSeries() - grouped = ts.groupby([lambda x: x.year, lambda x: x.month]) - - result = grouped.agg(np.sum) - expected = grouped.sum() - assert_series_equal(result, expected) - - def test_series_agg_multi_pure_python(self): - data = DataFrame( - {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', - 'foo', 'foo', 'foo'], - 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', - 'two', 'two', 'one'], - 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', - 'dull', 'shiny', 'shiny', 'shiny'], - 'D': np.random.randn(11), - 'E': np.random.randn(11), - 'F': np.random.randn(11)}) - - def bad(x): - assert (len(x.base) > 0) - return 'foo' - - result = data.groupby(['A', 'B']).agg(bad) - expected = data.groupby(['A', 'B']).agg(lambda x: 'foo') - assert_frame_equal(result, expected) - - def test_series_index_name(self): - grouped = self.df.ix[:, ['C']].groupby(self.df['A']) - result = grouped.agg(lambda x: x.mean()) - self.assertEqual(result.index.name, 'A') - - def test_frame_describe_multikey(self): - grouped = self.tsframe.groupby([lambda x: x.year, lambda x: x.month]) - result = grouped.describe() - - for col in self.tsframe: - expected = grouped[col].describe() - assert_series_equal(result[col], expected, check_names=False) - - groupedT = self.tsframe.groupby({'A': 0, 'B': 0, - 'C': 1, 'D': 1}, axis=1) - result = groupedT.describe() - - for name, group in groupedT: - assert_frame_equal(result[name], group.describe()) - - def test_frame_groupby(self): - grouped = self.tsframe.groupby(lambda x: x.weekday()) - - # aggregate - aggregated = grouped.aggregate(np.mean) - self.assertEqual(len(aggregated), 5) - self.assertEqual(len(aggregated.columns), 4) - - # by string - tscopy = self.tsframe.copy() - tscopy['weekday'] = [x.weekday() for x in tscopy.index] - stragged = tscopy.groupby('weekday').aggregate(np.mean) - assert_frame_equal(stragged, aggregated, check_names=False) - - # transform - grouped = self.tsframe.head(30).groupby(lambda x: x.weekday()) - transformed = grouped.transform(lambda x: x - x.mean()) - self.assertEqual(len(transformed), 30) - self.assertEqual(len(transformed.columns), 4) - - # transform propagate - transformed = grouped.transform(lambda x: x.mean()) - for name, group in grouped: - mean = group.mean() - for idx in group.index: - tm.assert_series_equal(transformed.xs(idx), mean, - check_names=False) - - # iterate - for weekday, group in grouped: - self.assertEqual(group.index[0].weekday(), weekday) - - # groups / group_indices - groups = grouped.groups - indices = grouped.indices - - for k, v in compat.iteritems(groups): - samething = self.tsframe.index.take(indices[k]) - self.assertTrue((samething == v).all()) - - def test_grouping_is_iterable(self): - # this code path isn't used anywhere else - # not sure it's useful - grouped = self.tsframe.groupby([lambda x: x.weekday(), lambda x: x.year - ]) - - # test it works - for g in grouped.grouper.groupings[0]: - pass - - def test_frame_groupby_columns(self): - mapping = {'A': 0, 'B': 0, 'C': 1, 'D': 1} - grouped = self.tsframe.groupby(mapping, axis=1) - - # aggregate - aggregated = grouped.aggregate(np.mean) - self.assertEqual(len(aggregated), len(self.tsframe)) - self.assertEqual(len(aggregated.columns), 2) - - # transform - tf = lambda x: x - x.mean() - groupedT = self.tsframe.T.groupby(mapping, axis=0) - assert_frame_equal(groupedT.transform(tf).T, grouped.transform(tf)) - - # iterate - for k, v in grouped: - self.assertEqual(len(v.columns), 2) - - def test_frame_set_name_single(self): - grouped = self.df.groupby('A') - - result = grouped.mean() - self.assertEqual(result.index.name, 'A') - - result = self.df.groupby('A', as_index=False).mean() - self.assertNotEqual(result.index.name, 'A') - - result = grouped.agg(np.mean) - self.assertEqual(result.index.name, 'A') - - result = grouped.agg({'C': np.mean, 'D': np.std}) - self.assertEqual(result.index.name, 'A') - - result = grouped['C'].mean() - self.assertEqual(result.index.name, 'A') - result = grouped['C'].agg(np.mean) - self.assertEqual(result.index.name, 'A') - result = grouped['C'].agg([np.mean, np.std]) - self.assertEqual(result.index.name, 'A') - - result = grouped['C'].agg({'foo': np.mean, 'bar': np.std}) - self.assertEqual(result.index.name, 'A') - - def test_aggregate_api_consistency(self): - # GH 9052 - # make sure that the aggregates via dict - # are consistent - - df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', - 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'two', - 'two', 'two', 'one', 'two'], - 'C': np.random.randn(8) + 1.0, - 'D': np.arange(8)}) - - grouped = df.groupby(['A', 'B']) - c_mean = grouped['C'].mean() - c_sum = grouped['C'].sum() - d_mean = grouped['D'].mean() - d_sum = grouped['D'].sum() - - result = grouped['D'].agg(['sum', 'mean']) - expected = pd.concat([d_sum, d_mean], - axis=1) - expected.columns = ['sum', 'mean'] - assert_frame_equal(result, expected, check_like=True) - - result = grouped.agg([np.sum, np.mean]) - expected = pd.concat([c_sum, - c_mean, - d_sum, - d_mean], - axis=1) - expected.columns = MultiIndex.from_product([['C', 'D'], - ['sum', 'mean']]) - assert_frame_equal(result, expected, check_like=True) - - result = grouped[['D', 'C']].agg([np.sum, np.mean]) - expected = pd.concat([d_sum, - d_mean, - c_sum, - c_mean], - axis=1) - expected.columns = MultiIndex.from_product([['D', 'C'], - ['sum', 'mean']]) - assert_frame_equal(result, expected, check_like=True) - - result = grouped.agg({'C': 'mean', 'D': 'sum'}) - expected = pd.concat([d_sum, - c_mean], - axis=1) - assert_frame_equal(result, expected, check_like=True) - - result = grouped.agg({'C': ['mean', 'sum'], - 'D': ['mean', 'sum']}) - expected = pd.concat([c_mean, - c_sum, - d_mean, - d_sum], - axis=1) - expected.columns = MultiIndex.from_product([['C', 'D'], - ['mean', 'sum']]) - - result = grouped[['D', 'C']].agg({'r': np.sum, - 'r2': np.mean}) - expected = pd.concat([d_sum, - c_sum, - d_mean, - c_mean], - axis=1) - expected.columns = MultiIndex.from_product([['r', 'r2'], - ['D', 'C']]) - assert_frame_equal(result, expected, check_like=True) - - def test_agg_compat(self): - - # GH 12334 - - df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', - 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'two', - 'two', 'two', 'one', 'two'], - 'C': np.random.randn(8) + 1.0, - 'D': np.arange(8)}) - - g = df.groupby(['A', 'B']) - - expected = pd.concat([g['D'].sum(), - g['D'].std()], - axis=1) - expected.columns = MultiIndex.from_tuples([('C', 'sum'), - ('C', 'std')]) - result = g['D'].agg({'C': ['sum', 'std']}) - assert_frame_equal(result, expected, check_like=True) - - expected = pd.concat([g['D'].sum(), - g['D'].std()], - axis=1) - expected.columns = ['C', 'D'] - result = g['D'].agg({'C': 'sum', 'D': 'std'}) - assert_frame_equal(result, expected, check_like=True) - - def test_agg_nested_dicts(self): - - # API change for disallowing these types of nested dicts - df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', - 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'two', - 'two', 'two', 'one', 'two'], - 'C': np.random.randn(8) + 1.0, - 'D': np.arange(8)}) - - g = df.groupby(['A', 'B']) - - def f(): - g.aggregate({'r1': {'C': ['mean', 'sum']}, - 'r2': {'D': ['mean', 'sum']}}) - - self.assertRaises(SpecificationError, f) - - result = g.agg({'C': {'ra': ['mean', 'std']}, - 'D': {'rb': ['mean', 'std']}}) - expected = pd.concat([g['C'].mean(), g['C'].std(), g['D'].mean(), - g['D'].std()], axis=1) - expected.columns = pd.MultiIndex.from_tuples([('ra', 'mean'), ( - 'ra', 'std'), ('rb', 'mean'), ('rb', 'std')]) - assert_frame_equal(result, expected, check_like=True) - - # same name as the original column - # GH9052 - expected = g['D'].agg({'result1': np.sum, 'result2': np.mean}) - expected = expected.rename(columns={'result1': 'D'}) - result = g['D'].agg({'D': np.sum, 'result2': np.mean}) - assert_frame_equal(result, expected, check_like=True) - - def test_multi_iter(self): - s = Series(np.arange(6)) - k1 = np.array(['a', 'a', 'a', 'b', 'b', 'b']) - k2 = np.array(['1', '2', '1', '2', '1', '2']) - - grouped = s.groupby([k1, k2]) - - iterated = list(grouped) - expected = [('a', '1', s[[0, 2]]), ('a', '2', s[[1]]), - ('b', '1', s[[4]]), ('b', '2', s[[3, 5]])] - for i, ((one, two), three) in enumerate(iterated): - e1, e2, e3 = expected[i] - self.assertEqual(e1, one) - self.assertEqual(e2, two) - assert_series_equal(three, e3) - - def test_multi_iter_frame(self): - k1 = np.array(['b', 'b', 'b', 'a', 'a', 'a']) - k2 = np.array(['1', '2', '1', '2', '1', '2']) - df = DataFrame({'v1': np.random.randn(6), - 'v2': np.random.randn(6), - 'k1': k1, 'k2': k2}, - index=['one', 'two', 'three', 'four', 'five', 'six']) - - grouped = df.groupby(['k1', 'k2']) - - # things get sorted! - iterated = list(grouped) - idx = df.index - expected = [('a', '1', df.ix[idx[[4]]]), - ('a', '2', df.ix[idx[[3, 5]]]), - ('b', '1', df.ix[idx[[0, 2]]]), - ('b', '2', df.ix[idx[[1]]])] - for i, ((one, two), three) in enumerate(iterated): - e1, e2, e3 = expected[i] - self.assertEqual(e1, one) - self.assertEqual(e2, two) - assert_frame_equal(three, e3) - - # don't iterate through groups with no data - df['k1'] = np.array(['b', 'b', 'b', 'a', 'a', 'a']) - df['k2'] = np.array(['1', '1', '1', '2', '2', '2']) - grouped = df.groupby(['k1', 'k2']) - groups = {} - for key, gp in grouped: - groups[key] = gp - self.assertEqual(len(groups), 2) - - # axis = 1 - three_levels = self.three_group.groupby(['A', 'B', 'C']).mean() - grouped = three_levels.T.groupby(axis=1, level=(1, 2)) - for key, group in grouped: - pass - - def test_multi_iter_panel(self): - wp = tm.makePanel() - grouped = wp.groupby([lambda x: x.month, lambda x: x.weekday()], - axis=1) - - for (month, wd), group in grouped: - exp_axis = [x - for x in wp.major_axis - if x.month == month and x.weekday() == wd] - expected = wp.reindex(major=exp_axis) - assert_panel_equal(group, expected) - - def test_multi_func(self): - col1 = self.df['A'] - col2 = self.df['B'] - - grouped = self.df.groupby([col1.get, col2.get]) - agged = grouped.mean() - expected = self.df.groupby(['A', 'B']).mean() - assert_frame_equal(agged.ix[:, ['C', 'D']], expected.ix[:, ['C', 'D']], - check_names=False) # TODO groupby get drops names - - # some "groups" with no data - df = DataFrame({'v1': np.random.randn(6), - 'v2': np.random.randn(6), - 'k1': np.array(['b', 'b', 'b', 'a', 'a', 'a']), - 'k2': np.array(['1', '1', '1', '2', '2', '2'])}, - index=['one', 'two', 'three', 'four', 'five', 'six']) - # only verify that it works for now - grouped = df.groupby(['k1', 'k2']) - grouped.agg(np.sum) - - def test_multi_key_multiple_functions(self): - grouped = self.df.groupby(['A', 'B'])['C'] - - agged = grouped.agg([np.mean, np.std]) - expected = DataFrame({'mean': grouped.agg(np.mean), - 'std': grouped.agg(np.std)}) - assert_frame_equal(agged, expected) - - def test_frame_multi_key_function_list(self): - data = DataFrame( - {'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', - 'foo', 'foo', 'foo'], - 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', - 'two', 'two', 'one'], - 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', - 'dull', 'shiny', 'shiny', 'shiny'], - 'D': np.random.randn(11), - 'E': np.random.randn(11), - 'F': np.random.randn(11)}) - - grouped = data.groupby(['A', 'B']) - funcs = [np.mean, np.std] - agged = grouped.agg(funcs) - expected = concat([grouped['D'].agg(funcs), grouped['E'].agg(funcs), - grouped['F'].agg(funcs)], - keys=['D', 'E', 'F'], axis=1) - assert (isinstance(agged.index, MultiIndex)) - assert (isinstance(expected.index, MultiIndex)) - assert_frame_equal(agged, expected) - - def test_groupby_multiple_columns(self): - data = self.df - grouped = data.groupby(['A', 'B']) - - def _check_op(op): - - result1 = op(grouped) - - expected = defaultdict(dict) - for n1, gp1 in data.groupby('A'): - for n2, gp2 in gp1.groupby('B'): - expected[n1][n2] = op(gp2.ix[:, ['C', 'D']]) - expected = dict((k, DataFrame(v)) - for k, v in compat.iteritems(expected)) - expected = Panel.fromDict(expected).swapaxes(0, 1) - expected.major_axis.name, expected.minor_axis.name = 'A', 'B' - - # a little bit crude - for col in ['C', 'D']: - result_col = op(grouped[col]) - exp = expected[col] - pivoted = result1[col].unstack() - pivoted2 = result_col.unstack() - assert_frame_equal(pivoted.reindex_like(exp), exp) - assert_frame_equal(pivoted2.reindex_like(exp), exp) - - _check_op(lambda x: x.sum()) - _check_op(lambda x: x.mean()) - - # test single series works the same - result = data['C'].groupby([data['A'], data['B']]).mean() - expected = data.groupby(['A', 'B']).mean()['C'] - - assert_series_equal(result, expected) - - def test_groupby_as_index_agg(self): - grouped = self.df.groupby('A', as_index=False) - - # single-key - - result = grouped.agg(np.mean) - expected = grouped.mean() - assert_frame_equal(result, expected) - - result2 = grouped.agg(OrderedDict([['C', np.mean], ['D', np.sum]])) - expected2 = grouped.mean() - expected2['D'] = grouped.sum()['D'] - assert_frame_equal(result2, expected2) - - grouped = self.df.groupby('A', as_index=True) - expected3 = grouped['C'].sum() - expected3 = DataFrame(expected3).rename(columns={'C': 'Q'}) - result3 = grouped['C'].agg({'Q': np.sum}) - assert_frame_equal(result3, expected3) - - # multi-key - - grouped = self.df.groupby(['A', 'B'], as_index=False) - - result = grouped.agg(np.mean) - expected = grouped.mean() - assert_frame_equal(result, expected) - - result2 = grouped.agg(OrderedDict([['C', np.mean], ['D', np.sum]])) - expected2 = grouped.mean() - expected2['D'] = grouped.sum()['D'] - assert_frame_equal(result2, expected2) - - expected3 = grouped['C'].sum() - expected3 = DataFrame(expected3).rename(columns={'C': 'Q'}) - result3 = grouped['C'].agg({'Q': np.sum}) - assert_frame_equal(result3, expected3) - - # GH7115 & GH8112 & GH8582 - df = DataFrame(np.random.randint(0, 100, (50, 3)), - columns=['jim', 'joe', 'jolie']) - ts = Series(np.random.randint(5, 10, 50), name='jim') - - gr = df.groupby(ts) - gr.nth(0) # invokes set_selection_from_grouper internally - assert_frame_equal(gr.apply(sum), df.groupby(ts).apply(sum)) - - for attr in ['mean', 'max', 'count', 'idxmax', 'cumsum', 'all']: - gr = df.groupby(ts, as_index=False) - left = getattr(gr, attr)() - - gr = df.groupby(ts.values, as_index=True) - right = getattr(gr, attr)().reset_index(drop=True) - - assert_frame_equal(left, right) - - def test_series_groupby_nunique(self): - from itertools import product - from string import ascii_lowercase - - def check_nunique(df, keys): - for sort, dropna in product((False, True), repeat=2): - gr = df.groupby(keys, sort=sort) - left = gr['julie'].nunique(dropna=dropna) - - gr = df.groupby(keys, sort=sort) - right = gr['julie'].apply(Series.nunique, dropna=dropna) - - assert_series_equal(left, right) - - days = date_range('2015-08-23', periods=10) - - for n, m in product(10 ** np.arange(2, 6), (10, 100, 1000)): - frame = DataFrame({ - 'jim': np.random.choice( - list(ascii_lowercase), n), - 'joe': np.random.choice(days, n), - 'julie': np.random.randint(0, m, n) - }) - - check_nunique(frame, ['jim']) - check_nunique(frame, ['jim', 'joe']) - - frame.loc[1::17, 'jim'] = None - frame.loc[3::37, 'joe'] = None - frame.loc[7::19, 'julie'] = None - frame.loc[8::19, 'julie'] = None - frame.loc[9::19, 'julie'] = None - - check_nunique(frame, ['jim']) - check_nunique(frame, ['jim', 'joe']) - - def test_series_groupby_value_counts(self): - from itertools import product - - def rebuild_index(df): - arr = list(map(df.index.get_level_values, range(df.index.nlevels))) - df.index = MultiIndex.from_arrays(arr, names=df.index.names) - return df - - def check_value_counts(df, keys, bins): - for isort, normalize, sort, ascending, dropna \ - in product((False, True), repeat=5): - - kwargs = dict(normalize=normalize, sort=sort, - ascending=ascending, dropna=dropna, bins=bins) - - gr = df.groupby(keys, sort=isort) - left = gr['3rd'].value_counts(**kwargs) - - gr = df.groupby(keys, sort=isort) - right = gr['3rd'].apply(Series.value_counts, **kwargs) - right.index.names = right.index.names[:-1] + ['3rd'] - - # have to sort on index because of unstable sort on values - left, right = map(rebuild_index, (left, right)) # xref GH9212 - assert_series_equal(left.sort_index(), right.sort_index()) - - def loop(df): - bins = None, np.arange(0, max(5, df['3rd'].max()) + 1, 2) - keys = '1st', '2nd', ('1st', '2nd') - for k, b in product(keys, bins): - check_value_counts(df, k, b) - - days = date_range('2015-08-24', periods=10) - - for n, m in product((100, 1000), (5, 20)): - frame = DataFrame({ - '1st': np.random.choice( - list('abcd'), n), - '2nd': np.random.choice(days, n), - '3rd': np.random.randint(1, m + 1, n) - }) - - loop(frame) - - frame.loc[1::11, '1st'] = nan - frame.loc[3::17, '2nd'] = nan - frame.loc[7::19, '3rd'] = nan - frame.loc[8::19, '3rd'] = nan - frame.loc[9::19, '3rd'] = nan - - loop(frame) - - def test_multiindex_passthru(self): - - # GH 7997 - # regression from 0.14.1 - df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) - df.columns = pd.MultiIndex.from_tuples([(0, 1), (1, 1), (2, 1)]) - - result = df.groupby(axis=1, level=[0, 1]).first() - assert_frame_equal(result, df) - - def test_multiindex_negative_level(self): - # GH 13901 - result = self.mframe.groupby(level=-1).sum() - expected = self.mframe.groupby(level='second').sum() - assert_frame_equal(result, expected) - - result = self.mframe.groupby(level=-2).sum() - expected = self.mframe.groupby(level='first').sum() - assert_frame_equal(result, expected) - - result = self.mframe.groupby(level=[-2, -1]).sum() - expected = self.mframe - assert_frame_equal(result, expected) - - result = self.mframe.groupby(level=[-1, 'first']).sum() - expected = self.mframe.groupby(level=['second', 'first']).sum() - assert_frame_equal(result, expected) - - def test_multifunc_select_col_integer_cols(self): - df = self.df - df.columns = np.arange(len(df.columns)) - - # it works! - df.groupby(1, as_index=False)[2].agg({'Q': np.mean}) - - def test_as_index_series_return_frame(self): - grouped = self.df.groupby('A', as_index=False) - grouped2 = self.df.groupby(['A', 'B'], as_index=False) - - result = grouped['C'].agg(np.sum) - expected = grouped.agg(np.sum).ix[:, ['A', 'C']] - tm.assertIsInstance(result, DataFrame) - assert_frame_equal(result, expected) - - result2 = grouped2['C'].agg(np.sum) - expected2 = grouped2.agg(np.sum).ix[:, ['A', 'B', 'C']] - tm.assertIsInstance(result2, DataFrame) - assert_frame_equal(result2, expected2) - - result = grouped['C'].sum() - expected = grouped.sum().ix[:, ['A', 'C']] - tm.assertIsInstance(result, DataFrame) - assert_frame_equal(result, expected) - - result2 = grouped2['C'].sum() - expected2 = grouped2.sum().ix[:, ['A', 'B', 'C']] - tm.assertIsInstance(result2, DataFrame) - assert_frame_equal(result2, expected2) - - # corner case - self.assertRaises(Exception, grouped['C'].__getitem__, 'D') - - def test_groupby_as_index_cython(self): - data = self.df - - # single-key - grouped = data.groupby('A', as_index=False) - result = grouped.mean() - expected = data.groupby(['A']).mean() - expected.insert(0, 'A', expected.index) - expected.index = np.arange(len(expected)) - assert_frame_equal(result, expected) - - # multi-key - grouped = data.groupby(['A', 'B'], as_index=False) - result = grouped.mean() - expected = data.groupby(['A', 'B']).mean() - - arrays = lzip(*expected.index._tuple_index) - expected.insert(0, 'A', arrays[0]) - expected.insert(1, 'B', arrays[1]) - expected.index = np.arange(len(expected)) - assert_frame_equal(result, expected) - - def test_groupby_as_index_series_scalar(self): - grouped = self.df.groupby(['A', 'B'], as_index=False) - - # GH #421 - - result = grouped['C'].agg(len) - expected = grouped.agg(len).ix[:, ['A', 'B', 'C']] - assert_frame_equal(result, expected) - - def test_groupby_as_index_corner(self): - self.assertRaises(TypeError, self.ts.groupby, lambda x: x.weekday(), - as_index=False) - - self.assertRaises(ValueError, self.df.groupby, lambda x: x.lower(), - as_index=False, axis=1) - - def test_groupby_as_index_apply(self): - # GH #4648 and #3417 - df = DataFrame({'item_id': ['b', 'b', 'a', 'c', 'a', 'b'], - 'user_id': [1, 2, 1, 1, 3, 1], - 'time': range(6)}) - - g_as = df.groupby('user_id', as_index=True) - g_not_as = df.groupby('user_id', as_index=False) - - res_as = g_as.head(2).index - res_not_as = g_not_as.head(2).index - exp = Index([0, 1, 2, 4]) - assert_index_equal(res_as, exp) - assert_index_equal(res_not_as, exp) - - res_as_apply = g_as.apply(lambda x: x.head(2)).index - res_not_as_apply = g_not_as.apply(lambda x: x.head(2)).index - - # apply doesn't maintain the original ordering - # changed in GH5610 as the as_index=False returns a MI here - exp_not_as_apply = MultiIndex.from_tuples([(0, 0), (0, 2), (1, 1), ( - 2, 4)]) - tp = [(1, 0), (1, 2), (2, 1), (3, 4)] - exp_as_apply = MultiIndex.from_tuples(tp, names=['user_id', None]) - - assert_index_equal(res_as_apply, exp_as_apply) - assert_index_equal(res_not_as_apply, exp_not_as_apply) - - ind = Index(list('abcde')) - df = DataFrame([[1, 2], [2, 3], [1, 4], [1, 5], [2, 6]], index=ind) - res = df.groupby(0, as_index=False).apply(lambda x: x).index - assert_index_equal(res, ind) - - def test_groupby_head_tail(self): - df = DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B']) - g_as = df.groupby('A', as_index=True) - g_not_as = df.groupby('A', as_index=False) - - # as_index= False, much easier - assert_frame_equal(df.loc[[0, 2]], g_not_as.head(1)) - assert_frame_equal(df.loc[[1, 2]], g_not_as.tail(1)) - - empty_not_as = DataFrame(columns=df.columns, - index=pd.Index([], dtype=df.index.dtype)) - empty_not_as['A'] = empty_not_as['A'].astype(df.A.dtype) - empty_not_as['B'] = empty_not_as['B'].astype(df.B.dtype) - assert_frame_equal(empty_not_as, g_not_as.head(0)) - assert_frame_equal(empty_not_as, g_not_as.tail(0)) - assert_frame_equal(empty_not_as, g_not_as.head(-1)) - assert_frame_equal(empty_not_as, g_not_as.tail(-1)) - - assert_frame_equal(df, g_not_as.head(7)) # contains all - assert_frame_equal(df, g_not_as.tail(7)) - - # as_index=True, (used to be different) - df_as = df - - assert_frame_equal(df_as.loc[[0, 2]], g_as.head(1)) - assert_frame_equal(df_as.loc[[1, 2]], g_as.tail(1)) - - empty_as = DataFrame(index=df_as.index[:0], columns=df.columns) - empty_as['A'] = empty_not_as['A'].astype(df.A.dtype) - empty_as['B'] = empty_not_as['B'].astype(df.B.dtype) - assert_frame_equal(empty_as, g_as.head(0)) - assert_frame_equal(empty_as, g_as.tail(0)) - assert_frame_equal(empty_as, g_as.head(-1)) - assert_frame_equal(empty_as, g_as.tail(-1)) - - assert_frame_equal(df_as, g_as.head(7)) # contains all - assert_frame_equal(df_as, g_as.tail(7)) - - # test with selection - assert_frame_equal(g_as[[]].head(1), df_as.loc[[0, 2], []]) - assert_frame_equal(g_as[['A']].head(1), df_as.loc[[0, 2], ['A']]) - assert_frame_equal(g_as[['B']].head(1), df_as.loc[[0, 2], ['B']]) - assert_frame_equal(g_as[['A', 'B']].head(1), df_as.loc[[0, 2]]) - - assert_frame_equal(g_not_as[[]].head(1), df_as.loc[[0, 2], []]) - assert_frame_equal(g_not_as[['A']].head(1), df_as.loc[[0, 2], ['A']]) - assert_frame_equal(g_not_as[['B']].head(1), df_as.loc[[0, 2], ['B']]) - assert_frame_equal(g_not_as[['A', 'B']].head(1), df_as.loc[[0, 2]]) - - def test_groupby_multiple_key(self): - df = tm.makeTimeDataFrame() - grouped = df.groupby([lambda x: x.year, lambda x: x.month, - lambda x: x.day]) - agged = grouped.sum() - assert_almost_equal(df.values, agged.values) - - grouped = df.T.groupby([lambda x: x.year, - lambda x: x.month, - lambda x: x.day], axis=1) - - agged = grouped.agg(lambda x: x.sum()) - self.assert_index_equal(agged.index, df.columns) - assert_almost_equal(df.T.values, agged.values) - - agged = grouped.agg(lambda x: x.sum()) - assert_almost_equal(df.T.values, agged.values) - - def test_groupby_multi_corner(self): - # test that having an all-NA column doesn't mess you up - df = self.df.copy() - df['bad'] = np.nan - agged = df.groupby(['A', 'B']).mean() - - expected = self.df.groupby(['A', 'B']).mean() - expected['bad'] = np.nan - - assert_frame_equal(agged, expected) - - def test_omit_nuisance(self): - grouped = self.df.groupby('A') - - result = grouped.mean() - expected = self.df.ix[:, ['A', 'C', 'D']].groupby('A').mean() - assert_frame_equal(result, expected) - - agged = grouped.agg(np.mean) - exp = grouped.mean() - assert_frame_equal(agged, exp) - - df = self.df.ix[:, ['A', 'C', 'D']] - df['E'] = datetime.now() - grouped = df.groupby('A') - result = grouped.agg(np.sum) - expected = grouped.sum() - assert_frame_equal(result, expected) - - # won't work with axis = 1 - grouped = df.groupby({'A': 0, 'C': 0, 'D': 1, 'E': 1}, axis=1) - result = self.assertRaises(TypeError, grouped.agg, - lambda x: x.sum(0, numeric_only=False)) - - def test_omit_nuisance_python_multiple(self): - grouped = self.three_group.groupby(['A', 'B']) - - agged = grouped.agg(np.mean) - exp = grouped.mean() - assert_frame_equal(agged, exp) - - def test_empty_groups_corner(self): - # handle empty groups - df = DataFrame({'k1': np.array(['b', 'b', 'b', 'a', 'a', 'a']), - 'k2': np.array(['1', '1', '1', '2', '2', '2']), - 'k3': ['foo', 'bar'] * 3, - 'v1': np.random.randn(6), - 'v2': np.random.randn(6)}) - - grouped = df.groupby(['k1', 'k2']) - result = grouped.agg(np.mean) - expected = grouped.mean() - assert_frame_equal(result, expected) - - grouped = self.mframe[3:5].groupby(level=0) - agged = grouped.apply(lambda x: x.mean()) - agged_A = grouped['A'].apply(np.mean) - assert_series_equal(agged['A'], agged_A) - self.assertEqual(agged.index.name, 'first') - - def test_apply_concat_preserve_names(self): - grouped = self.three_group.groupby(['A', 'B']) - - def desc(group): - result = group.describe() - result.index.name = 'stat' - return result - - def desc2(group): - result = group.describe() - result.index.name = 'stat' - result = result[:len(group)] - # weirdo - return result - - def desc3(group): - result = group.describe() - - # names are different - result.index.name = 'stat_%d' % len(group) - - result = result[:len(group)] - # weirdo - return result - - result = grouped.apply(desc) - self.assertEqual(result.index.names, ('A', 'B', 'stat')) - - result2 = grouped.apply(desc2) - self.assertEqual(result2.index.names, ('A', 'B', 'stat')) - - result3 = grouped.apply(desc3) - self.assertEqual(result3.index.names, ('A', 'B', None)) - - def test_nonsense_func(self): - df = DataFrame([0]) - self.assertRaises(Exception, df.groupby, lambda x: x + 'foo') - - def test_builtins_apply(self): # GH8155 - df = pd.DataFrame(np.random.randint(1, 50, (1000, 2)), - columns=['jim', 'joe']) - df['jolie'] = np.random.randn(1000) - - for keys in ['jim', ['jim', 'joe']]: # single key & multi-key - if keys == 'jim': - continue - for f in [max, min, sum]: - fname = f.__name__ - result = df.groupby(keys).apply(f) - result.shape - ngroups = len(df.drop_duplicates(subset=keys)) - assert result.shape == (ngroups, 3), 'invalid frame shape: '\ - '{} (expected ({}, 3))'.format(result.shape, ngroups) - - assert_frame_equal(result, # numpy's equivalent function - df.groupby(keys).apply(getattr(np, fname))) - - if f != sum: - expected = df.groupby(keys).agg(fname).reset_index() - expected.set_index(keys, inplace=True, drop=False) - assert_frame_equal(result, expected, check_dtype=False) - - assert_series_equal(getattr(result, fname)(), - getattr(df, fname)()) - - def test_cythonized_aggers(self): - data = {'A': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1., nan, nan], - 'B': ['A', 'B'] * 6, - 'C': np.random.randn(12)} - df = DataFrame(data) - df.loc[2:10:2, 'C'] = nan - - def _testit(name): - - op = lambda x: getattr(x, name)() - - # single column - grouped = df.drop(['B'], axis=1).groupby('A') - exp = {} - for cat, group in grouped: - exp[cat] = op(group['C']) - exp = DataFrame({'C': exp}) - exp.index.name = 'A' - result = op(grouped) - assert_frame_equal(result, exp) - - # multiple columns - grouped = df.groupby(['A', 'B']) - expd = {} - for (cat1, cat2), group in grouped: - expd.setdefault(cat1, {})[cat2] = op(group['C']) - exp = DataFrame(expd).T.stack(dropna=False) - exp.index.names = ['A', 'B'] - exp.name = 'C' - - result = op(grouped)['C'] - if not tm._incompat_bottleneck_version(name): - assert_series_equal(result, exp) - - _testit('count') - _testit('sum') - _testit('std') - _testit('var') - _testit('sem') - _testit('mean') - _testit('median') - _testit('prod') - _testit('min') - _testit('max') - - def test_max_min_non_numeric(self): - # #2700 - aa = DataFrame({'nn': [11, 11, 22, 22], - 'ii': [1, 2, 3, 4], - 'ss': 4 * ['mama']}) - - result = aa.groupby('nn').max() - self.assertTrue('ss' in result) - - result = aa.groupby('nn').min() - self.assertTrue('ss' in result) - - def test_cython_agg_boolean(self): - frame = DataFrame({'a': np.random.randint(0, 5, 50), - 'b': np.random.randint(0, 2, 50).astype('bool')}) - result = frame.groupby('a')['b'].mean() - expected = frame.groupby('a')['b'].agg(np.mean) - - assert_series_equal(result, expected) - - def test_cython_agg_nothing_to_agg(self): - frame = DataFrame({'a': np.random.randint(0, 5, 50), - 'b': ['foo', 'bar'] * 25}) - self.assertRaises(DataError, frame.groupby('a')['b'].mean) - - frame = DataFrame({'a': np.random.randint(0, 5, 50), - 'b': ['foo', 'bar'] * 25}) - self.assertRaises(DataError, frame[['b']].groupby(frame['a']).mean) - - def test_cython_agg_nothing_to_agg_with_dates(self): - frame = DataFrame({'a': np.random.randint(0, 5, 50), - 'b': ['foo', 'bar'] * 25, - 'dates': pd.date_range('now', periods=50, - freq='T')}) - with tm.assertRaisesRegexp(DataError, "No numeric types to aggregate"): - frame.groupby('b').dates.mean() - - def test_groupby_timedelta_cython_count(self): - df = DataFrame({'g': list('ab' * 2), - 'delt': np.arange(4).astype('timedelta64[ns]')}) - expected = Series([ - 2, 2 - ], index=pd.Index(['a', 'b'], name='g'), name='delt') - result = df.groupby('g').delt.count() - tm.assert_series_equal(expected, result) - - def test_cython_agg_frame_columns(self): - # #2113 - df = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}) - - df.groupby(level=0, axis='columns').mean() - df.groupby(level=0, axis='columns').mean() - df.groupby(level=0, axis='columns').mean() - df.groupby(level=0, axis='columns').mean() - - def test_wrap_aggregated_output_multindex(self): - df = self.mframe.T - df['baz', 'two'] = 'peekaboo' - - keys = [np.array([0, 0, 1]), np.array([0, 0, 1])] - agged = df.groupby(keys).agg(np.mean) - tm.assertIsInstance(agged.columns, MultiIndex) - - def aggfun(ser): - if ser.name == ('foo', 'one'): - raise TypeError - else: - return ser.sum() - - agged2 = df.groupby(keys).aggregate(aggfun) - self.assertEqual(len(agged2.columns) + 1, len(df.columns)) - - def test_groupby_level(self): - frame = self.mframe - deleveled = frame.reset_index() - - result0 = frame.groupby(level=0).sum() - result1 = frame.groupby(level=1).sum() - - expected0 = frame.groupby(deleveled['first'].values).sum() - expected1 = frame.groupby(deleveled['second'].values).sum() - - expected0 = expected0.reindex(frame.index.levels[0]) - expected1 = expected1.reindex(frame.index.levels[1]) - - self.assertEqual(result0.index.name, 'first') - self.assertEqual(result1.index.name, 'second') - - assert_frame_equal(result0, expected0) - assert_frame_equal(result1, expected1) - self.assertEqual(result0.index.name, frame.index.names[0]) - self.assertEqual(result1.index.name, frame.index.names[1]) - - # groupby level name - result0 = frame.groupby(level='first').sum() - result1 = frame.groupby(level='second').sum() - assert_frame_equal(result0, expected0) - assert_frame_equal(result1, expected1) - - # axis=1 - - result0 = frame.T.groupby(level=0, axis=1).sum() - result1 = frame.T.groupby(level=1, axis=1).sum() - assert_frame_equal(result0, expected0.T) - assert_frame_equal(result1, expected1.T) - - # raise exception for non-MultiIndex - self.assertRaises(ValueError, self.df.groupby, level=1) - - def test_groupby_level_index_names(self): - # GH4014 this used to raise ValueError since 'exp'>1 (in py2) - df = DataFrame({'exp': ['A'] * 3 + ['B'] * 3, - 'var1': lrange(6), }).set_index('exp') - df.groupby(level='exp') - self.assertRaises(ValueError, df.groupby, level='foo') - - def test_groupby_level_with_nas(self): - index = MultiIndex(levels=[[1, 0], [0, 1, 2, 3]], - labels=[[1, 1, 1, 1, 0, 0, 0, 0], [0, 1, 2, 3, 0, 1, - 2, 3]]) - - # factorizing doesn't confuse things - s = Series(np.arange(8.), index=index) - result = s.groupby(level=0).sum() - expected = Series([22., 6.], index=[1, 0]) - assert_series_equal(result, expected) - - index = MultiIndex(levels=[[1, 0], [0, 1, 2, 3]], - labels=[[1, 1, 1, 1, -1, 0, 0, 0], [0, 1, 2, 3, 0, - 1, 2, 3]]) - - # factorizing doesn't confuse things - s = Series(np.arange(8.), index=index) - result = s.groupby(level=0).sum() - expected = Series([18., 6.], index=[1, 0]) - assert_series_equal(result, expected) - - def test_groupby_level_apply(self): - frame = self.mframe - - result = frame.groupby(level=0).count() - self.assertEqual(result.index.name, 'first') - result = frame.groupby(level=1).count() - self.assertEqual(result.index.name, 'second') - - result = frame['A'].groupby(level=0).count() - self.assertEqual(result.index.name, 'first') - - def test_groupby_args(self): - # PR8618 and issue 8015 - frame = self.mframe - - def j(): - frame.groupby() - - self.assertRaisesRegexp(TypeError, - "You have to supply one of 'by' and 'level'", - j) - - def k(): - frame.groupby(by=None, level=None) - - self.assertRaisesRegexp(TypeError, - "You have to supply one of 'by' and 'level'", - k) - - def test_groupby_level_mapper(self): - frame = self.mframe - deleveled = frame.reset_index() - - mapper0 = {'foo': 0, 'bar': 0, 'baz': 1, 'qux': 1} - mapper1 = {'one': 0, 'two': 0, 'three': 1} - - result0 = frame.groupby(mapper0, level=0).sum() - result1 = frame.groupby(mapper1, level=1).sum() - - mapped_level0 = np.array([mapper0.get(x) for x in deleveled['first']]) - mapped_level1 = np.array([mapper1.get(x) for x in deleveled['second']]) - expected0 = frame.groupby(mapped_level0).sum() - expected1 = frame.groupby(mapped_level1).sum() - expected0.index.name, expected1.index.name = 'first', 'second' - - assert_frame_equal(result0, expected0) - assert_frame_equal(result1, expected1) - - def test_groupby_level_nonmulti(self): - # GH 1313, GH 13901 - s = Series([1, 2, 3, 10, 4, 5, 20, 6], - Index([1, 2, 3, 1, 4, 5, 2, 6], name='foo')) - expected = Series([11, 22, 3, 4, 5, 6], - Index(range(1, 7), name='foo')) - - result = s.groupby(level=0).sum() - self.assert_series_equal(result, expected) - result = s.groupby(level=[0]).sum() - self.assert_series_equal(result, expected) - result = s.groupby(level=-1).sum() - self.assert_series_equal(result, expected) - result = s.groupby(level=[-1]).sum() - self.assert_series_equal(result, expected) - - tm.assertRaises(ValueError, s.groupby, level=1) - tm.assertRaises(ValueError, s.groupby, level=-2) - tm.assertRaises(ValueError, s.groupby, level=[]) - tm.assertRaises(ValueError, s.groupby, level=[0, 0]) - tm.assertRaises(ValueError, s.groupby, level=[0, 1]) - tm.assertRaises(ValueError, s.groupby, level=[1]) - - def test_groupby_complex(self): - # GH 12902 - a = Series(data=np.arange(4) * (1 + 2j), index=[0, 0, 1, 1]) - expected = Series((1 + 2j, 5 + 10j)) - - result = a.groupby(level=0).sum() - assert_series_equal(result, expected) - - result = a.sum(level=0) - assert_series_equal(result, expected) - - def test_level_preserve_order(self): - grouped = self.mframe.groupby(level=0) - exp_labels = np.array([0, 0, 0, 1, 1, 2, 2, 3, 3, 3], np.intp) - assert_almost_equal(grouped.grouper.labels[0], exp_labels) - - def test_grouping_labels(self): - grouped = self.mframe.groupby(self.mframe.index.get_level_values(0)) - exp_labels = np.array([2, 2, 2, 0, 0, 1, 1, 3, 3, 3], dtype=np.intp) - assert_almost_equal(grouped.grouper.labels[0], exp_labels) - - def test_cython_fail_agg(self): - dr = bdate_range('1/1/2000', periods=50) - ts = Series(['A', 'B', 'C', 'D', 'E'] * 10, index=dr) - - grouped = ts.groupby(lambda x: x.month) - summed = grouped.sum() - expected = grouped.agg(np.sum) - assert_series_equal(summed, expected) - - def test_apply_series_to_frame(self): - def f(piece): - with np.errstate(invalid='ignore'): - logged = np.log(piece) - return DataFrame({'value': piece, - 'demeaned': piece - piece.mean(), - 'logged': logged}) - - dr = bdate_range('1/1/2000', periods=100) - ts = Series(np.random.randn(100), index=dr) - - grouped = ts.groupby(lambda x: x.month) - result = grouped.apply(f) - - tm.assertIsInstance(result, DataFrame) - self.assert_index_equal(result.index, ts.index) - - def test_apply_series_yield_constant(self): - result = self.df.groupby(['A', 'B'])['C'].apply(len) - self.assertEqual(result.index.names[:2], ('A', 'B')) - - def test_apply_frame_yield_constant(self): - # GH13568 - result = self.df.groupby(['A', 'B']).apply(len) - self.assertTrue(isinstance(result, Series)) - self.assertIsNone(result.name) - - result = self.df.groupby(['A', 'B'])[['C', 'D']].apply(len) - self.assertTrue(isinstance(result, Series)) - self.assertIsNone(result.name) - - def test_apply_frame_to_series(self): - grouped = self.df.groupby(['A', 'B']) - result = grouped.apply(len) - expected = grouped.count()['C'] - self.assert_index_equal(result.index, expected.index) - self.assert_numpy_array_equal(result.values, expected.values) - - def test_apply_frame_concat_series(self): - def trans(group): - return group.groupby('B')['C'].sum().sort_values()[:2] - - def trans2(group): - grouped = group.groupby(df.reindex(group.index)['B']) - return grouped.sum().sort_values()[:2] - - df = DataFrame({'A': np.random.randint(0, 5, 1000), - 'B': np.random.randint(0, 5, 1000), - 'C': np.random.randn(1000)}) - - result = df.groupby('A').apply(trans) - exp = df.groupby('A')['C'].apply(trans2) - assert_series_equal(result, exp, check_names=False) - self.assertEqual(result.name, 'C') - - def test_apply_transform(self): - grouped = self.ts.groupby(lambda x: x.month) - result = grouped.apply(lambda x: x * 2) - expected = grouped.transform(lambda x: x * 2) - assert_series_equal(result, expected) - - def test_apply_multikey_corner(self): - grouped = self.tsframe.groupby([lambda x: x.year, lambda x: x.month]) - - def f(group): - return group.sort_values('A')[-5:] - - result = grouped.apply(f) - for key, group in grouped: - assert_frame_equal(result.ix[key], f(group)) - - def test_mutate_groups(self): - - # GH3380 - - mydf = DataFrame({ - 'cat1': ['a'] * 8 + ['b'] * 6, - 'cat2': ['c'] * 2 + ['d'] * 2 + ['e'] * 2 + ['f'] * 2 + ['c'] * 2 + - ['d'] * 2 + ['e'] * 2, - 'cat3': lmap(lambda x: 'g%s' % x, lrange(1, 15)), - 'val': np.random.randint(100, size=14), - }) - - def f_copy(x): - x = x.copy() - x['rank'] = x.val.rank(method='min') - return x.groupby('cat2')['rank'].min() - - def f_no_copy(x): - x['rank'] = x.val.rank(method='min') - return x.groupby('cat2')['rank'].min() - - grpby_copy = mydf.groupby('cat1').apply(f_copy) - grpby_no_copy = mydf.groupby('cat1').apply(f_no_copy) - assert_series_equal(grpby_copy, grpby_no_copy) - - def test_no_mutate_but_looks_like(self): - - # GH 8467 - # first show's mutation indicator - # second does not, but should yield the same results - df = DataFrame({'key': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'value': range(9)}) - - result1 = df.groupby('key', group_keys=True).apply(lambda x: x[:].key) - result2 = df.groupby('key', group_keys=True).apply(lambda x: x.key) - assert_series_equal(result1, result2) - - def test_apply_chunk_view(self): - # Low level tinkering could be unsafe, make sure not - df = DataFrame({'key': [1, 1, 1, 2, 2, 2, 3, 3, 3], - 'value': lrange(9)}) - - # return view - f = lambda x: x[:2] - - result = df.groupby('key', group_keys=False).apply(f) - expected = df.take([0, 1, 3, 4, 6, 7]) - assert_frame_equal(result, expected) - - def test_apply_no_name_column_conflict(self): - df = DataFrame({'name': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2], - 'name2': [0, 0, 0, 1, 1, 1, 0, 0, 1, 1], - 'value': lrange(10)[::-1]}) - - # it works! #2605 - grouped = df.groupby(['name', 'name2']) - grouped.apply(lambda x: x.sort_values('value', inplace=True)) - - def test_groupby_series_indexed_differently(self): - s1 = Series([5.0, -9.0, 4.0, 100., -5., 55., 6.7], - index=Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'])) - s2 = Series([1.0, 1.0, 4.0, 5.0, 5.0, 7.0], - index=Index(['a', 'b', 'd', 'f', 'g', 'h'])) - - grouped = s1.groupby(s2) - agged = grouped.mean() - exp = s1.groupby(s2.reindex(s1.index).get).mean() - assert_series_equal(agged, exp) - - def test_groupby_with_hier_columns(self): - tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', - 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', - 'one', 'two']])) - index = MultiIndex.from_tuples(tuples) - columns = MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'), ( - 'B', 'cat'), ('A', 'dog')]) - df = DataFrame(np.random.randn(8, 4), index=index, columns=columns) - - result = df.groupby(level=0).mean() - self.assert_index_equal(result.columns, columns) - - result = df.groupby(level=0, axis=1).mean() - self.assert_index_equal(result.index, df.index) - - result = df.groupby(level=0).agg(np.mean) - self.assert_index_equal(result.columns, columns) - - result = df.groupby(level=0).apply(lambda x: x.mean()) - self.assert_index_equal(result.columns, columns) - - result = df.groupby(level=0, axis=1).agg(lambda x: x.mean(1)) - self.assert_index_equal(result.columns, Index(['A', 'B'])) - self.assert_index_equal(result.index, df.index) - - # add a nuisance column - sorted_columns, _ = columns.sortlevel(0) - df['A', 'foo'] = 'bar' - result = df.groupby(level=0).mean() - self.assert_index_equal(result.columns, df.columns[:-1]) - - def test_pass_args_kwargs(self): - from numpy import percentile - - def f(x, q=None, axis=0): - return percentile(x, q, axis=axis) - - g = lambda x: percentile(x, 80, axis=0) - - # Series - ts_grouped = self.ts.groupby(lambda x: x.month) - agg_result = ts_grouped.agg(percentile, 80, axis=0) - apply_result = ts_grouped.apply(percentile, 80, axis=0) - trans_result = ts_grouped.transform(percentile, 80, axis=0) - - agg_expected = ts_grouped.quantile(.8) - trans_expected = ts_grouped.transform(g) - - assert_series_equal(apply_result, agg_expected) - assert_series_equal(agg_result, agg_expected, check_names=False) - assert_series_equal(trans_result, trans_expected) - - agg_result = ts_grouped.agg(f, q=80) - apply_result = ts_grouped.apply(f, q=80) - trans_result = ts_grouped.transform(f, q=80) - assert_series_equal(agg_result, agg_expected) - assert_series_equal(apply_result, agg_expected) - assert_series_equal(trans_result, trans_expected) - - # DataFrame - df_grouped = self.tsframe.groupby(lambda x: x.month) - agg_result = df_grouped.agg(percentile, 80, axis=0) - apply_result = df_grouped.apply(DataFrame.quantile, .8) - expected = df_grouped.quantile(.8) - assert_frame_equal(apply_result, expected) - assert_frame_equal(agg_result, expected, check_names=False) - - agg_result = df_grouped.agg(f, q=80) - apply_result = df_grouped.apply(DataFrame.quantile, q=.8) - assert_frame_equal(agg_result, expected, check_names=False) - assert_frame_equal(apply_result, expected) - - def test_size(self): - grouped = self.df.groupby(['A', 'B']) - result = grouped.size() - for key, group in grouped: - self.assertEqual(result[key], len(group)) - - grouped = self.df.groupby('A') - result = grouped.size() - for key, group in grouped: - self.assertEqual(result[key], len(group)) - - grouped = self.df.groupby('B') - result = grouped.size() - for key, group in grouped: - self.assertEqual(result[key], len(group)) - - df = DataFrame(np.random.choice(20, (1000, 3)), columns=list('abc')) - for sort, key in cart_product((False, True), ('a', 'b', ['a', 'b'])): - left = df.groupby(key, sort=sort).size() - right = df.groupby(key, sort=sort)['c'].apply(lambda a: a.shape[0]) - assert_series_equal(left, right, check_names=False) - - # GH11699 - df = DataFrame([], columns=['A', 'B']) - out = Series([], dtype='int64', index=Index([], name='A')) - assert_series_equal(df.groupby('A').size(), out) - - def test_count(self): - from string import ascii_lowercase - n = 1 << 15 - dr = date_range('2015-08-30', periods=n // 10, freq='T') - - df = DataFrame({ - '1st': np.random.choice( - list(ascii_lowercase), n), - '2nd': np.random.randint(0, 5, n), - '3rd': np.random.randn(n).round(3), - '4th': np.random.randint(-10, 10, n), - '5th': np.random.choice(dr, n), - '6th': np.random.randn(n).round(3), - '7th': np.random.randn(n).round(3), - '8th': np.random.choice(dr, n) - np.random.choice(dr, 1), - '9th': np.random.choice( - list(ascii_lowercase), n) - }) - - for col in df.columns.drop(['1st', '2nd', '4th']): - df.loc[np.random.choice(n, n // 10), col] = np.nan - - df['9th'] = df['9th'].astype('category') - - for key in '1st', '2nd', ['1st', '2nd']: - left = df.groupby(key).count() - right = df.groupby(key).apply(DataFrame.count).drop(key, axis=1) - assert_frame_equal(left, right) - - # GH5610 - # count counts non-nulls - df = pd.DataFrame([[1, 2, 'foo'], [1, nan, 'bar'], [3, nan, nan]], - columns=['A', 'B', 'C']) - - count_as = df.groupby('A').count() - count_not_as = df.groupby('A', as_index=False).count() - - expected = DataFrame([[1, 2], [0, 0]], columns=['B', 'C'], - index=[1, 3]) - expected.index.name = 'A' - assert_frame_equal(count_not_as, expected.reset_index()) - assert_frame_equal(count_as, expected) - - count_B = df.groupby('A')['B'].count() - assert_series_equal(count_B, expected['B']) - - def test_count_object(self): - df = pd.DataFrame({'a': ['a'] * 3 + ['b'] * 3, 'c': [2] * 3 + [3] * 3}) - result = df.groupby('c').a.count() - expected = pd.Series([ - 3, 3 - ], index=pd.Index([2, 3], name='c'), name='a') - tm.assert_series_equal(result, expected) - - df = pd.DataFrame({'a': ['a', np.nan, np.nan] + ['b'] * 3, - 'c': [2] * 3 + [3] * 3}) - result = df.groupby('c').a.count() - expected = pd.Series([ - 1, 3 - ], index=pd.Index([2, 3], name='c'), name='a') - tm.assert_series_equal(result, expected) - - def test_count_cross_type(self): # GH8169 - vals = np.hstack((np.random.randint(0, 5, (100, 2)), np.random.randint( - 0, 2, (100, 2)))) - - df = pd.DataFrame(vals, columns=['a', 'b', 'c', 'd']) - df[df == 2] = np.nan - expected = df.groupby(['c', 'd']).count() - - for t in ['float32', 'object']: - df['a'] = df['a'].astype(t) - df['b'] = df['b'].astype(t) - result = df.groupby(['c', 'd']).count() - tm.assert_frame_equal(result, expected) - - def test_non_cython_api(self): - - # GH5610 - # non-cython calls should not include the grouper - - df = DataFrame( - [[1, 2, 'foo'], [1, - nan, - 'bar', ], [3, nan, 'baz'] - ], columns=['A', 'B', 'C']) - g = df.groupby('A') - gni = df.groupby('A', as_index=False) - - # mad - expected = DataFrame([[0], [nan]], columns=['B'], index=[1, 3]) - expected.index.name = 'A' - result = g.mad() - assert_frame_equal(result, expected) - - expected = DataFrame([[0., 0.], [0, nan]], columns=['A', 'B'], - index=[0, 1]) - result = gni.mad() - assert_frame_equal(result, expected) - - # describe - expected = DataFrame(dict(B=concat( - [df.loc[[0, 1], 'B'].describe(), df.loc[[2], 'B'].describe()], - keys=[1, 3]))) - expected.index.names = ['A', None] - result = g.describe() - assert_frame_equal(result, expected) - - expected = concat( - [df.loc[[0, 1], ['A', 'B']].describe(), - df.loc[[2], ['A', 'B']].describe()], keys=[0, 1]) - result = gni.describe() - assert_frame_equal(result, expected) - - # any - expected = DataFrame([[True, True], [False, True]], columns=['B', 'C'], - index=[1, 3]) - expected.index.name = 'A' - result = g.any() - assert_frame_equal(result, expected) - - # idxmax - expected = DataFrame([[0], [nan]], columns=['B'], index=[1, 3]) - expected.index.name = 'A' - result = g.idxmax() - assert_frame_equal(result, expected) - - def test_cython_api2(self): - - # this takes the fast apply path - - # cumsum (GH5614) - df = DataFrame( - [[1, 2, np.nan], [1, np.nan, 9], [3, 4, 9] - ], columns=['A', 'B', 'C']) - expected = DataFrame( - [[2, np.nan], [np.nan, 9], [4, 9]], columns=['B', 'C']) - result = df.groupby('A').cumsum() - assert_frame_equal(result, expected) - - # GH 5755 - cumsum is a transformer and should ignore as_index - result = df.groupby('A', as_index=False).cumsum() - assert_frame_equal(result, expected) - - # GH 13994 - result = df.groupby('A').cumsum(axis=1) - expected = df.cumsum(axis=1) - assert_frame_equal(result, expected) - result = df.groupby('A').cumprod(axis=1) - expected = df.cumprod(axis=1) - assert_frame_equal(result, expected) - - def test_grouping_ndarray(self): - grouped = self.df.groupby(self.df['A'].values) - - result = grouped.sum() - expected = self.df.groupby('A').sum() - assert_frame_equal(result, expected, check_names=False - ) # Note: no names when grouping by value - - def test_agg_consistency(self): - # agg with ([]) and () not consistent - # GH 6715 - - def P1(a): - try: - return np.percentile(a.dropna(), q=1) - except: - return np.nan - - import datetime as dt - df = DataFrame({'col1': [1, 2, 3, 4], - 'col2': [10, 25, 26, 31], - 'date': [dt.date(2013, 2, 10), dt.date(2013, 2, 10), - dt.date(2013, 2, 11), dt.date(2013, 2, 11)]}) - - g = df.groupby('date') - - expected = g.agg([P1]) - expected.columns = expected.columns.levels[0] - - result = g.agg(P1) - assert_frame_equal(result, expected) - - def test_apply_typecast_fail(self): - df = DataFrame({'d': [1., 1., 1., 2., 2., 2.], - 'c': np.tile( - ['a', 'b', 'c'], 2), - 'v': np.arange(1., 7.)}) - - def f(group): - v = group['v'] - group['v2'] = (v - v.min()) / (v.max() - v.min()) - return group - - result = df.groupby('d').apply(f) - - expected = df.copy() - expected['v2'] = np.tile([0., 0.5, 1], 2) - - assert_frame_equal(result, expected) - - def test_apply_multiindex_fail(self): - index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1], [1, 2, 3, 1, 2, 3] - ]) - df = DataFrame({'d': [1., 1., 1., 2., 2., 2.], - 'c': np.tile(['a', 'b', 'c'], 2), - 'v': np.arange(1., 7.)}, index=index) - - def f(group): - v = group['v'] - group['v2'] = (v - v.min()) / (v.max() - v.min()) - return group - - result = df.groupby('d').apply(f) - - expected = df.copy() - expected['v2'] = np.tile([0., 0.5, 1], 2) - - assert_frame_equal(result, expected) - - def test_apply_corner(self): - result = self.tsframe.groupby(lambda x: x.year).apply(lambda x: x * 2) - expected = self.tsframe * 2 - assert_frame_equal(result, expected) - - def test_apply_without_copy(self): - # GH 5545 - # returning a non-copy in an applied function fails - - data = DataFrame({'id_field': [100, 100, 200, 300], - 'category': ['a', 'b', 'c', 'c'], - 'value': [1, 2, 3, 4]}) - - def filt1(x): - if x.shape[0] == 1: - return x.copy() - else: - return x[x.category == 'c'] - - def filt2(x): - if x.shape[0] == 1: - return x - else: - return x[x.category == 'c'] - - expected = data.groupby('id_field').apply(filt1) - result = data.groupby('id_field').apply(filt2) - assert_frame_equal(result, expected) - - def test_apply_use_categorical_name(self): - from pandas import qcut - cats = qcut(self.df.C, 4) - - def get_stats(group): - return {'min': group.min(), - 'max': group.max(), - 'count': group.count(), - 'mean': group.mean()} - - result = self.df.groupby(cats).D.apply(get_stats) - self.assertEqual(result.index.names[0], 'C') - - def test_apply_categorical_data(self): - # GH 10138 - for ordered in [True, False]: - dense = Categorical(list('abc'), ordered=ordered) - # 'b' is in the categories but not in the list - missing = Categorical( - list('aaa'), categories=['a', 'b'], ordered=ordered) - values = np.arange(len(dense)) - df = DataFrame({'missing': missing, - 'dense': dense, - 'values': values}) - grouped = df.groupby(['missing', 'dense']) - - # missing category 'b' should still exist in the output index - idx = MultiIndex.from_product( - [Categorical(['a', 'b'], ordered=ordered), - Categorical(['a', 'b', 'c'], ordered=ordered)], - names=['missing', 'dense']) - expected = DataFrame([0, 1, 2, np.nan, np.nan, np.nan], - index=idx, - columns=['values']) - - assert_frame_equal(grouped.apply(lambda x: np.mean(x)), expected) - assert_frame_equal(grouped.mean(), expected) - assert_frame_equal(grouped.agg(np.mean), expected) - - # but for transform we should still get back the original index - idx = MultiIndex.from_product([['a'], ['a', 'b', 'c']], - names=['missing', 'dense']) - expected = Series(1, index=idx) - assert_series_equal(grouped.apply(lambda x: 1), expected) - - def test_apply_corner_cases(self): - # #535, can't use sliding iterator - - N = 1000 - labels = np.random.randint(0, 100, size=N) - df = DataFrame({'key': labels, - 'value1': np.random.randn(N), - 'value2': ['foo', 'bar', 'baz', 'qux'] * (N // 4)}) - - grouped = df.groupby('key') - - def f(g): - g['value3'] = g['value1'] * 2 - return g - - result = grouped.apply(f) - self.assertTrue('value3' in result) - - def test_transform_mixed_type(self): - index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1], [1, 2, 3, 1, 2, 3] - ]) - df = DataFrame({'d': [1., 1., 1., 2., 2., 2.], - 'c': np.tile(['a', 'b', 'c'], 2), - 'v': np.arange(1., 7.)}, index=index) - - def f(group): - group['g'] = group['d'] * 2 - return group[:1] - - grouped = df.groupby('c') - result = grouped.apply(f) - - self.assertEqual(result['d'].dtype, np.float64) - - # this is by definition a mutating operation! - with option_context('mode.chained_assignment', None): - for key, group in grouped: - res = f(group) - assert_frame_equal(res, result.ix[key]) - - def test_groupby_wrong_multi_labels(self): - from pandas import read_csv - data = """index,foo,bar,baz,spam,data -0,foo1,bar1,baz1,spam2,20 -1,foo1,bar2,baz1,spam3,30 -2,foo2,bar2,baz1,spam2,40 -3,foo1,bar1,baz2,spam1,50 -4,foo3,bar1,baz2,spam1,60""" - - data = read_csv(StringIO(data), index_col=0) - - grouped = data.groupby(['foo', 'bar', 'baz', 'spam']) - - result = grouped.agg(np.mean) - expected = grouped.mean() - assert_frame_equal(result, expected) - - def test_groupby_series_with_name(self): - result = self.df.groupby(self.df['A']).mean() - result2 = self.df.groupby(self.df['A'], as_index=False).mean() - self.assertEqual(result.index.name, 'A') - self.assertIn('A', result2) - - result = self.df.groupby([self.df['A'], self.df['B']]).mean() - result2 = self.df.groupby([self.df['A'], self.df['B']], - as_index=False).mean() - self.assertEqual(result.index.names, ('A', 'B')) - self.assertIn('A', result2) - self.assertIn('B', result2) - - def test_seriesgroupby_name_attr(self): - # GH 6265 - result = self.df.groupby('A')['C'] - self.assertEqual(result.count().name, 'C') - self.assertEqual(result.mean().name, 'C') - - testFunc = lambda x: np.sum(x) * 2 - self.assertEqual(result.agg(testFunc).name, 'C') - - def test_consistency_name(self): - # GH 12363 - - df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar', - 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'two', - 'two', 'two', 'one', 'two'], - 'C': np.random.randn(8) + 1.0, - 'D': np.arange(8)}) - - expected = df.groupby(['A']).B.count() - result = df.B.groupby(df.A).count() - assert_series_equal(result, expected) - - def test_groupby_name_propagation(self): - # GH 6124 - def summarize(df, name=None): - return Series({'count': 1, 'mean': 2, 'omissions': 3, }, name=name) - - def summarize_random_name(df): - # Provide a different name for each Series. In this case, groupby - # should not attempt to propagate the Series name since they are - # inconsistent. - return Series({ - 'count': 1, - 'mean': 2, - 'omissions': 3, - }, name=df.iloc[0]['A']) - - metrics = self.df.groupby('A').apply(summarize) - self.assertEqual(metrics.columns.name, None) - metrics = self.df.groupby('A').apply(summarize, 'metrics') - self.assertEqual(metrics.columns.name, 'metrics') - metrics = self.df.groupby('A').apply(summarize_random_name) - self.assertEqual(metrics.columns.name, None) - - def test_groupby_nonstring_columns(self): - df = DataFrame([np.arange(10) for x in range(10)]) - grouped = df.groupby(0) - result = grouped.mean() - expected = df.groupby(df[0]).mean() - assert_frame_equal(result, expected) - - def test_groupby_mixed_type_columns(self): - # GH 13432, unorderable types in py3 - df = DataFrame([[0, 1, 2]], columns=['A', 'B', 0]) - expected = DataFrame([[1, 2]], columns=['B', 0], - index=Index([0], name='A')) - - result = df.groupby('A').first() - tm.assert_frame_equal(result, expected) - - result = df.groupby('A').sum() - tm.assert_frame_equal(result, expected) - - def test_cython_grouper_series_bug_noncontig(self): - arr = np.empty((100, 100)) - arr.fill(np.nan) - obj = Series(arr[:, 0], index=lrange(100)) - inds = np.tile(lrange(10), 10) - - result = obj.groupby(inds).agg(Series.median) - self.assertTrue(result.isnull().all()) - - def test_series_grouper_noncontig_index(self): - index = Index(tm.rands_array(10, 100)) - - values = Series(np.random.randn(50), index=index[::2]) - labels = np.random.randint(0, 5, 50) - - # it works! - grouped = values.groupby(labels) - - # accessing the index elements causes segfault - f = lambda x: len(set(map(id, x.index))) - grouped.agg(f) - - def test_convert_objects_leave_decimal_alone(self): - - from decimal import Decimal - - s = Series(lrange(5)) - labels = np.array(['a', 'b', 'c', 'd', 'e'], dtype='O') - - def convert_fast(x): - return Decimal(str(x.mean())) - - def convert_force_pure(x): - # base will be length 0 - assert (len(x.base) > 0) - return Decimal(str(x.mean())) - - grouped = s.groupby(labels) - - result = grouped.agg(convert_fast) - self.assertEqual(result.dtype, np.object_) - tm.assertIsInstance(result[0], Decimal) - - result = grouped.agg(convert_force_pure) - self.assertEqual(result.dtype, np.object_) - tm.assertIsInstance(result[0], Decimal) - - def test_fast_apply(self): - # make sure that fast apply is correctly called - # rather than raising any kind of error - # otherwise the python path will be callsed - # which slows things down - N = 1000 - labels = np.random.randint(0, 2000, size=N) - labels2 = np.random.randint(0, 3, size=N) - df = DataFrame({'key': labels, - 'key2': labels2, - 'value1': np.random.randn(N), - 'value2': ['foo', 'bar', 'baz', 'qux'] * (N // 4)}) - - def f(g): - return 1 - - g = df.groupby(['key', 'key2']) - - grouper = g.grouper - - splitter = grouper._get_splitter(g._selected_obj, axis=g.axis) - group_keys = grouper._get_group_keys() - - values, mutated = splitter.fast_apply(f, group_keys) - self.assertFalse(mutated) - - def test_apply_with_mixed_dtype(self): - # GH3480, apply with mixed dtype on axis=1 breaks in 0.11 - df = DataFrame({'foo1': ['one', 'two', 'two', 'three', 'one', 'two'], - 'foo2': np.random.randn(6)}) - result = df.apply(lambda x: x, axis=1) - assert_series_equal(df.get_dtype_counts(), result.get_dtype_counts()) - - # GH 3610 incorrect dtype conversion with as_index=False - df = DataFrame({"c1": [1, 2, 6, 6, 8]}) - df["c2"] = df.c1 / 2.0 - result1 = df.groupby("c2").mean().reset_index().c2 - result2 = df.groupby("c2", as_index=False).mean().c2 - assert_series_equal(result1, result2) - - def test_groupby_aggregation_mixed_dtype(self): - - # GH 6212 - expected = DataFrame({ - 'v1': [5, 5, 7, np.nan, 3, 3, 4, 1], - 'v2': [55, 55, 77, np.nan, 33, 33, 44, 11]}, - index=MultiIndex.from_tuples([(1, 95), (1, 99), (2, 95), (2, 99), - ('big', 'damp'), - ('blue', 'dry'), - ('red', 'red'), ('red', 'wet')], - names=['by1', 'by2'])) - - df = DataFrame({ - 'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9], - 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99], - 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, - 12], - 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, - np.nan, np.nan] - }) - - g = df.groupby(['by1', 'by2']) - result = g[['v1', 'v2']].mean() - assert_frame_equal(result, expected) - - def test_groupby_dtype_inference_empty(self): - # GH 6733 - df = DataFrame({'x': [], 'range': np.arange(0, dtype='int64')}) - self.assertEqual(df['x'].dtype, np.float64) - - result = df.groupby('x').first() - exp_index = Index([], name='x', dtype=np.float64) - expected = DataFrame({'range': Series( - [], index=exp_index, dtype='int64')}) - assert_frame_equal(result, expected, by_blocks=True) - - def test_groupby_list_infer_array_like(self): - result = self.df.groupby(list(self.df['A'])).mean() - expected = self.df.groupby(self.df['A']).mean() - assert_frame_equal(result, expected, check_names=False) - - self.assertRaises(Exception, self.df.groupby, list(self.df['A'][:-1])) - - # pathological case of ambiguity - df = DataFrame({'foo': [0, 1], - 'bar': [3, 4], - 'val': np.random.randn(2)}) - - result = df.groupby(['foo', 'bar']).mean() - expected = df.groupby([df['foo'], df['bar']]).mean()[['val']] - - def test_groupby_keys_same_size_as_index(self): - # GH 11185 - freq = 's' - index = pd.date_range(start=pd.Timestamp('2015-09-29T11:34:44-0700'), - periods=2, freq=freq) - df = pd.DataFrame([['A', 10], ['B', 15]], columns=[ - 'metric', 'values' - ], index=index) - result = df.groupby([pd.Grouper(level=0, freq=freq), 'metric']).mean() - expected = df.set_index([df.index, 'metric']) - - assert_frame_equal(result, expected) - - def test_groupby_one_row(self): - # GH 11741 - df1 = pd.DataFrame(np.random.randn(1, 4), columns=list('ABCD')) - self.assertRaises(KeyError, df1.groupby, 'Z') - df2 = pd.DataFrame(np.random.randn(2, 4), columns=list('ABCD')) - self.assertRaises(KeyError, df2.groupby, 'Z') - - def test_groupby_nat_exclude(self): - # GH 6992 - df = pd.DataFrame( - {'values': np.random.randn(8), - 'dt': [np.nan, pd.Timestamp('2013-01-01'), np.nan, pd.Timestamp( - '2013-02-01'), np.nan, pd.Timestamp('2013-02-01'), np.nan, - pd.Timestamp('2013-01-01')], - 'str': [np.nan, 'a', np.nan, 'a', np.nan, 'a', np.nan, 'b']}) - grouped = df.groupby('dt') - - expected = [pd.Index([1, 7]), pd.Index([3, 5])] - keys = sorted(grouped.groups.keys()) - self.assertEqual(len(keys), 2) - for k, e in zip(keys, expected): - # grouped.groups keys are np.datetime64 with system tz - # not to be affected by tz, only compare values - tm.assert_index_equal(grouped.groups[k], e) - - # confirm obj is not filtered - tm.assert_frame_equal(grouped.grouper.groupings[0].obj, df) - self.assertEqual(grouped.ngroups, 2) - - expected = { - Timestamp('2013-01-01 00:00:00'): np.array([1, 7], dtype=np.int64), - Timestamp('2013-02-01 00:00:00'): np.array([3, 5], dtype=np.int64) - } - - for k in grouped.indices: - self.assert_numpy_array_equal(grouped.indices[k], expected[k]) - - tm.assert_frame_equal( - grouped.get_group(Timestamp('2013-01-01')), df.iloc[[1, 7]]) - tm.assert_frame_equal( - grouped.get_group(Timestamp('2013-02-01')), df.iloc[[3, 5]]) - - self.assertRaises(KeyError, grouped.get_group, pd.NaT) - - nan_df = DataFrame({'nan': [np.nan, np.nan, np.nan], - 'nat': [pd.NaT, pd.NaT, pd.NaT]}) - self.assertEqual(nan_df['nan'].dtype, 'float64') - self.assertEqual(nan_df['nat'].dtype, 'datetime64[ns]') - - for key in ['nan', 'nat']: - grouped = nan_df.groupby(key) - self.assertEqual(grouped.groups, {}) - self.assertEqual(grouped.ngroups, 0) - self.assertEqual(grouped.indices, {}) - self.assertRaises(KeyError, grouped.get_group, np.nan) - self.assertRaises(KeyError, grouped.get_group, pd.NaT) - - def test_dictify(self): - dict(iter(self.df.groupby('A'))) - dict(iter(self.df.groupby(['A', 'B']))) - dict(iter(self.df['C'].groupby(self.df['A']))) - dict(iter(self.df['C'].groupby([self.df['A'], self.df['B']]))) - dict(iter(self.df.groupby('A')['C'])) - dict(iter(self.df.groupby(['A', 'B'])['C'])) - - def test_sparse_friendly(self): - sdf = self.df[['C', 'D']].to_sparse() - panel = tm.makePanel() - tm.add_nans(panel) - - def _check_work(gp): - gp.mean() - gp.agg(np.mean) - dict(iter(gp)) - - # it works! - _check_work(sdf.groupby(lambda x: x // 2)) - _check_work(sdf['C'].groupby(lambda x: x // 2)) - _check_work(sdf.groupby(self.df['A'])) - - # do this someday - # _check_work(panel.groupby(lambda x: x.month, axis=1)) - - def test_panel_groupby(self): - self.panel = tm.makePanel() - tm.add_nans(self.panel) - grouped = self.panel.groupby({'ItemA': 0, 'ItemB': 0, 'ItemC': 1}, - axis='items') - agged = grouped.mean() - agged2 = grouped.agg(lambda x: x.mean('items')) - - tm.assert_panel_equal(agged, agged2) - - self.assert_index_equal(agged.items, Index([0, 1])) - - grouped = self.panel.groupby(lambda x: x.month, axis='major') - agged = grouped.mean() - - exp = Index(sorted(list(set(self.panel.major_axis.month)))) - self.assert_index_equal(agged.major_axis, exp) - - grouped = self.panel.groupby({'A': 0, 'B': 0, 'C': 1, 'D': 1}, - axis='minor') - agged = grouped.mean() - self.assert_index_equal(agged.minor_axis, Index([0, 1])) - - def test_numpy_groupby(self): - from pandas.core.groupby import numpy_groupby - - data = np.random.randn(100, 100) - labels = np.random.randint(0, 10, size=100) - - df = DataFrame(data) - - result = df.groupby(labels).sum().values - expected = numpy_groupby(data, labels) - assert_almost_equal(result, expected) - - result = df.groupby(labels, axis=1).sum().values - expected = numpy_groupby(data, labels, axis=1) - assert_almost_equal(result, expected) - - def test_groupby_2d_malformed(self): - d = DataFrame(index=lrange(2)) - d['group'] = ['g1', 'g2'] - d['zeros'] = [0, 0] - d['ones'] = [1, 1] - d['label'] = ['l1', 'l2'] - tmp = d.groupby(['group']).mean() - res_values = np.array([[0, 1], [0, 1]], dtype=np.int64) - self.assert_index_equal(tmp.columns, Index(['zeros', 'ones'])) - self.assert_numpy_array_equal(tmp.values, res_values) - - def test_int32_overflow(self): - B = np.concatenate((np.arange(10000), np.arange(10000), np.arange(5000) - )) - A = np.arange(25000) - df = DataFrame({'A': A, - 'B': B, - 'C': A, - 'D': B, - 'E': np.random.randn(25000)}) - - left = df.groupby(['A', 'B', 'C', 'D']).sum() - right = df.groupby(['D', 'C', 'B', 'A']).sum() - self.assertEqual(len(left), len(right)) - - def test_int64_overflow(self): - from pandas.core.groupby import _int64_overflow_possible - - B = np.concatenate((np.arange(1000), np.arange(1000), np.arange(500))) - A = np.arange(2500) - df = DataFrame({'A': A, - 'B': B, - 'C': A, - 'D': B, - 'E': A, - 'F': B, - 'G': A, - 'H': B, - 'values': np.random.randn(2500)}) - - lg = df.groupby(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']) - rg = df.groupby(['H', 'G', 'F', 'E', 'D', 'C', 'B', 'A']) - - left = lg.sum()['values'] - right = rg.sum()['values'] - - exp_index, _ = left.index.sortlevel(0) - self.assert_index_equal(left.index, exp_index) - - exp_index, _ = right.index.sortlevel(0) - self.assert_index_equal(right.index, exp_index) - - tups = list(map(tuple, df[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H' - ]].values)) - tups = com._asarray_tuplesafe(tups) - expected = df.groupby(tups).sum()['values'] - - for k, v in compat.iteritems(expected): - self.assertEqual(left[k], right[k[::-1]]) - self.assertEqual(left[k], v) - self.assertEqual(len(left), len(right)) - - # GH9096 - values = range(55109) - data = pd.DataFrame.from_dict({'a': values, - 'b': values, - 'c': values, - 'd': values}) - grouped = data.groupby(['a', 'b', 'c', 'd']) - self.assertEqual(len(grouped), len(values)) - - arr = np.random.randint(-1 << 12, 1 << 12, (1 << 15, 5)) - i = np.random.choice(len(arr), len(arr) * 4) - arr = np.vstack((arr, arr[i])) # add sume duplicate rows - - i = np.random.permutation(len(arr)) - arr = arr[i] # shuffle rows - - df = DataFrame(arr, columns=list('abcde')) - df['jim'], df['joe'] = np.random.randn(2, len(df)) * 10 - gr = df.groupby(list('abcde')) - - # verify this is testing what it is supposed to test! - self.assertTrue(_int64_overflow_possible(gr.grouper.shape)) - - # mannually compute groupings - jim, joe = defaultdict(list), defaultdict(list) - for key, a, b in zip(map(tuple, arr), df['jim'], df['joe']): - jim[key].append(a) - joe[key].append(b) - - self.assertEqual(len(gr), len(jim)) - mi = MultiIndex.from_tuples(jim.keys(), names=list('abcde')) - - def aggr(func): - f = lambda a: np.fromiter(map(func, a), dtype='f8') - arr = np.vstack((f(jim.values()), f(joe.values()))).T - res = DataFrame(arr, columns=['jim', 'joe'], index=mi) - return res.sort_index() - - assert_frame_equal(gr.mean(), aggr(np.mean)) - assert_frame_equal(gr.median(), aggr(np.median)) - - def test_groupby_sort_multi(self): - df = DataFrame({'a': ['foo', 'bar', 'baz'], - 'b': [3, 2, 1], - 'c': [0, 1, 2], - 'd': np.random.randn(3)}) - - tups = lmap(tuple, df[['a', 'b', 'c']].values) - tups = com._asarray_tuplesafe(tups) - result = df.groupby(['a', 'b', 'c'], sort=True).sum() - self.assert_numpy_array_equal(result.index.values, tups[[1, 2, 0]]) - - tups = lmap(tuple, df[['c', 'a', 'b']].values) - tups = com._asarray_tuplesafe(tups) - result = df.groupby(['c', 'a', 'b'], sort=True).sum() - self.assert_numpy_array_equal(result.index.values, tups) - - tups = lmap(tuple, df[['b', 'c', 'a']].values) - tups = com._asarray_tuplesafe(tups) - result = df.groupby(['b', 'c', 'a'], sort=True).sum() - self.assert_numpy_array_equal(result.index.values, tups[[2, 1, 0]]) - - df = DataFrame({'a': [0, 1, 2, 0, 1, 2], - 'b': [0, 0, 0, 1, 1, 1], - 'd': np.random.randn(6)}) - grouped = df.groupby(['a', 'b'])['d'] - result = grouped.sum() - _check_groupby(df, result, ['a', 'b'], 'd') - - def test_intercept_builtin_sum(self): - s = Series([1., 2., np.nan, 3.]) - grouped = s.groupby([0, 1, 2, 2]) - - result = grouped.agg(builtins.sum) - result2 = grouped.apply(builtins.sum) - expected = grouped.sum() - assert_series_equal(result, expected) - assert_series_equal(result2, expected) - - def test_column_select_via_attr(self): - result = self.df.groupby('A').C.sum() - expected = self.df.groupby('A')['C'].sum() - assert_series_equal(result, expected) - - self.df['mean'] = 1.5 - result = self.df.groupby('A').mean() - expected = self.df.groupby('A').agg(np.mean) - assert_frame_equal(result, expected) - - def test_rank_apply(self): - lev1 = tm.rands_array(10, 100) - lev2 = tm.rands_array(10, 130) - lab1 = np.random.randint(0, 100, size=500) - lab2 = np.random.randint(0, 130, size=500) - - df = DataFrame({'value': np.random.randn(500), - 'key1': lev1.take(lab1), - 'key2': lev2.take(lab2)}) - - result = df.groupby(['key1', 'key2']).value.rank() - - expected = [] - for key, piece in df.groupby(['key1', 'key2']): - expected.append(piece.value.rank()) - expected = concat(expected, axis=0) - expected = expected.reindex(result.index) - assert_series_equal(result, expected) - - result = df.groupby(['key1', 'key2']).value.rank(pct=True) - - expected = [] - for key, piece in df.groupby(['key1', 'key2']): - expected.append(piece.value.rank(pct=True)) - expected = concat(expected, axis=0) - expected = expected.reindex(result.index) - assert_series_equal(result, expected) - - def test_dont_clobber_name_column(self): - df = DataFrame({'key': ['a', 'a', 'a', 'b', 'b', 'b'], - 'name': ['foo', 'bar', 'baz'] * 2}) - - result = df.groupby('key').apply(lambda x: x) - assert_frame_equal(result, df) - - def test_skip_group_keys(self): - from pandas import concat - - tsf = tm.makeTimeDataFrame() - - grouped = tsf.groupby(lambda x: x.month, group_keys=False) - result = grouped.apply(lambda x: x.sort_values(by='A')[:3]) - - pieces = [] - for key, group in grouped: - pieces.append(group.sort_values(by='A')[:3]) - - expected = concat(pieces) - assert_frame_equal(result, expected) - - grouped = tsf['A'].groupby(lambda x: x.month, group_keys=False) - result = grouped.apply(lambda x: x.sort_values()[:3]) - - pieces = [] - for key, group in grouped: - pieces.append(group.sort_values()[:3]) - - expected = concat(pieces) - assert_series_equal(result, expected) - - def test_no_nonsense_name(self): - # GH #995 - s = self.frame['C'].copy() - s.name = None - - result = s.groupby(self.frame['A']).agg(np.sum) - self.assertIsNone(result.name) - - def test_wrap_agg_out(self): - grouped = self.three_group.groupby(['A', 'B']) - - def func(ser): - if ser.dtype == np.object: - raise TypeError - else: - return ser.sum() - - result = grouped.aggregate(func) - exp_grouped = self.three_group.ix[:, self.three_group.columns != 'C'] - expected = exp_grouped.groupby(['A', 'B']).aggregate(func) - assert_frame_equal(result, expected) - - def test_multifunc_sum_bug(self): - # GH #1065 - x = DataFrame(np.arange(9).reshape(3, 3)) - x['test'] = 0 - x['fl'] = [1.3, 1.5, 1.6] - - grouped = x.groupby('test') - result = grouped.agg({'fl': 'sum', 2: 'size'}) - self.assertEqual(result['fl'].dtype, np.float64) - - def test_handle_dict_return_value(self): - def f(group): - return {'min': group.min(), 'max': group.max()} - - def g(group): - return Series({'min': group.min(), 'max': group.max()}) - - result = self.df.groupby('A')['C'].apply(f) - expected = self.df.groupby('A')['C'].apply(g) - - tm.assertIsInstance(result, Series) - assert_series_equal(result, expected) - - def test_getitem_list_of_columns(self): - df = DataFrame( - {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], - 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], - 'C': np.random.randn(8), - 'D': np.random.randn(8), - 'E': np.random.randn(8)}) - - result = df.groupby('A')[['C', 'D']].mean() - result2 = df.groupby('A')['C', 'D'].mean() - result3 = df.groupby('A')[df.columns[2:4]].mean() - - expected = df.ix[:, ['A', 'C', 'D']].groupby('A').mean() - - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) - assert_frame_equal(result3, expected) - - def test_getitem_numeric_column_names(self): - # GH #13731 - df = DataFrame({0: list('abcd') * 2, - 2: np.random.randn(8), - 4: np.random.randn(8), - 6: np.random.randn(8)}) - result = df.groupby(0)[df.columns[1:3]].mean() - result2 = df.groupby(0)[2, 4].mean() - result3 = df.groupby(0)[[2, 4]].mean() - - expected = df.ix[:, [0, 2, 4]].groupby(0).mean() - - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) - assert_frame_equal(result3, expected) - - def test_agg_multiple_functions_maintain_order(self): - # GH #610 - funcs = [('mean', np.mean), ('max', np.max), ('min', np.min)] - result = self.df.groupby('A')['C'].agg(funcs) - exp_cols = Index(['mean', 'max', 'min']) - - self.assert_index_equal(result.columns, exp_cols) - - def test_multiple_functions_tuples_and_non_tuples(self): - # #1359 - - funcs = [('foo', 'mean'), 'std'] - ex_funcs = [('foo', 'mean'), ('std', 'std')] - - result = self.df.groupby('A')['C'].agg(funcs) - expected = self.df.groupby('A')['C'].agg(ex_funcs) - assert_frame_equal(result, expected) - - result = self.df.groupby('A').agg(funcs) - expected = self.df.groupby('A').agg(ex_funcs) - assert_frame_equal(result, expected) - - def test_agg_multiple_functions_too_many_lambdas(self): - grouped = self.df.groupby('A') - funcs = ['mean', lambda x: x.mean(), lambda x: x.std()] - - self.assertRaises(SpecificationError, grouped.agg, funcs) - - def test_more_flexible_frame_multi_function(self): - from pandas import concat - - grouped = self.df.groupby('A') - - exmean = grouped.agg(OrderedDict([['C', np.mean], ['D', np.mean]])) - exstd = grouped.agg(OrderedDict([['C', np.std], ['D', np.std]])) - - expected = concat([exmean, exstd], keys=['mean', 'std'], axis=1) - expected = expected.swaplevel(0, 1, axis=1).sortlevel(0, axis=1) - - d = OrderedDict([['C', [np.mean, np.std]], ['D', [np.mean, np.std]]]) - result = grouped.aggregate(d) - - assert_frame_equal(result, expected) - - # be careful - result = grouped.aggregate(OrderedDict([['C', np.mean], - ['D', [np.mean, np.std]]])) - expected = grouped.aggregate(OrderedDict([['C', np.mean], - ['D', [np.mean, np.std]]])) - assert_frame_equal(result, expected) - - def foo(x): - return np.mean(x) - - def bar(x): - return np.std(x, ddof=1) - - d = OrderedDict([['C', np.mean], ['D', OrderedDict( - [['foo', np.mean], ['bar', np.std]])]]) - result = grouped.aggregate(d) - - d = OrderedDict([['C', [np.mean]], ['D', [foo, bar]]]) - expected = grouped.aggregate(d) - - assert_frame_equal(result, expected) - - def test_multi_function_flexible_mix(self): - # GH #1268 - grouped = self.df.groupby('A') - - d = OrderedDict([['C', OrderedDict([['foo', 'mean'], [ - 'bar', 'std' - ]])], ['D', 'sum']]) - result = grouped.aggregate(d) - d2 = OrderedDict([['C', OrderedDict([['foo', 'mean'], [ - 'bar', 'std' - ]])], ['D', ['sum']]]) - result2 = grouped.aggregate(d2) - - d3 = OrderedDict([['C', OrderedDict([['foo', 'mean'], [ - 'bar', 'std' - ]])], ['D', {'sum': 'sum'}]]) - expected = grouped.aggregate(d3) - - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) - - def test_agg_callables(self): - # GH 7929 - df = DataFrame({'foo': [1, 2], 'bar': [3, 4]}).astype(np.int64) - - class fn_class(object): - - def __call__(self, x): - return sum(x) - - equiv_callables = [sum, np.sum, lambda x: sum(x), lambda x: x.sum(), - partial(sum), fn_class()] - - expected = df.groupby("foo").agg(sum) - for ecall in equiv_callables: - result = df.groupby('foo').agg(ecall) - assert_frame_equal(result, expected) - - def test_set_group_name(self): - def f(group): - assert group.name is not None - return group - - def freduce(group): - assert group.name is not None - return group.sum() - - def foo(x): - return freduce(x) - - def _check_all(grouped): - # make sure all these work - grouped.apply(f) - grouped.aggregate(freduce) - grouped.aggregate({'C': freduce, 'D': freduce}) - grouped.transform(f) - - grouped['C'].apply(f) - grouped['C'].aggregate(freduce) - grouped['C'].aggregate([freduce, foo]) - grouped['C'].transform(f) - - _check_all(self.df.groupby('A')) - _check_all(self.df.groupby(['A', 'B'])) - - def test_no_dummy_key_names(self): - # GH #1291 - - result = self.df.groupby(self.df['A'].values).sum() - self.assertIsNone(result.index.name) - - result = self.df.groupby([self.df['A'].values, self.df['B'].values - ]).sum() - self.assertEqual(result.index.names, (None, None)) - - def test_groupby_sort_categorical(self): - # dataframe groupby sort was being ignored # GH 8868 - df = DataFrame([['(7.5, 10]', 10, 10], - ['(7.5, 10]', 8, 20], - ['(2.5, 5]', 5, 30], - ['(5, 7.5]', 6, 40], - ['(2.5, 5]', 4, 50], - ['(0, 2.5]', 1, 60], - ['(5, 7.5]', 7, 70]], columns=['range', 'foo', 'bar']) - df['range'] = Categorical(df['range'], ordered=True) - index = CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', - '(7.5, 10]'], name='range', ordered=True) - result_sort = DataFrame([[1, 60], [5, 30], [6, 40], [10, 10]], - columns=['foo', 'bar'], index=index) - - col = 'range' - assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) - # when categories is ordered, group is ordered by category's order - assert_frame_equal(result_sort, df.groupby(col, sort=False).first()) - - df['range'] = Categorical(df['range'], ordered=False) - index = CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', - '(7.5, 10]'], name='range') - result_sort = DataFrame([[1, 60], [5, 30], [6, 40], [10, 10]], - columns=['foo', 'bar'], index=index) - - index = CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', - '(0, 2.5]'], - categories=['(7.5, 10]', '(2.5, 5]', - '(5, 7.5]', '(0, 2.5]'], - name='range') - result_nosort = DataFrame([[10, 10], [5, 30], [6, 40], [1, 60]], - index=index, columns=['foo', 'bar']) - - col = 'range' - # this is an unordered categorical, but we allow this #### - assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) - assert_frame_equal(result_nosort, df.groupby(col, sort=False).first()) - - def test_groupby_sort_categorical_datetimelike(self): - # GH10505 - - # use same data as test_groupby_sort_categorical, which category is - # corresponding to datetime.month - df = DataFrame({'dt': [datetime(2011, 7, 1), datetime(2011, 7, 1), - datetime(2011, 2, 1), datetime(2011, 5, 1), - datetime(2011, 2, 1), datetime(2011, 1, 1), - datetime(2011, 5, 1)], - 'foo': [10, 8, 5, 6, 4, 1, 7], - 'bar': [10, 20, 30, 40, 50, 60, 70]}, - columns=['dt', 'foo', 'bar']) - - # ordered=True - df['dt'] = Categorical(df['dt'], ordered=True) - index = [datetime(2011, 1, 1), datetime(2011, 2, 1), - datetime(2011, 5, 1), datetime(2011, 7, 1)] - result_sort = DataFrame( - [[1, 60], [5, 30], [6, 40], [10, 10]], columns=['foo', 'bar']) - result_sort.index = CategoricalIndex(index, name='dt', ordered=True) - - index = [datetime(2011, 7, 1), datetime(2011, 2, 1), - datetime(2011, 5, 1), datetime(2011, 1, 1)] - result_nosort = DataFrame([[10, 10], [5, 30], [6, 40], [1, 60]], - columns=['foo', 'bar']) - result_nosort.index = CategoricalIndex(index, categories=index, - name='dt', ordered=True) - - col = 'dt' - assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) - # when categories is ordered, group is ordered by category's order - assert_frame_equal(result_sort, df.groupby(col, sort=False).first()) - - # ordered = False - df['dt'] = Categorical(df['dt'], ordered=False) - index = [datetime(2011, 1, 1), datetime(2011, 2, 1), - datetime(2011, 5, 1), datetime(2011, 7, 1)] - result_sort = DataFrame( - [[1, 60], [5, 30], [6, 40], [10, 10]], columns=['foo', 'bar']) - result_sort.index = CategoricalIndex(index, name='dt') - - index = [datetime(2011, 7, 1), datetime(2011, 2, 1), - datetime(2011, 5, 1), datetime(2011, 1, 1)] - result_nosort = DataFrame([[10, 10], [5, 30], [6, 40], [1, 60]], - columns=['foo', 'bar']) - result_nosort.index = CategoricalIndex(index, categories=index, - name='dt') - - col = 'dt' - assert_frame_equal(result_sort, df.groupby(col, sort=True).first()) - assert_frame_equal(result_nosort, df.groupby(col, sort=False).first()) - - def test_groupby_sort_multiindex_series(self): - # series multiindex groupby sort argument was not being passed through - # _compress_group_index - # GH 9444 - index = MultiIndex(levels=[[1, 2], [1, 2]], - labels=[[0, 0, 0, 0, 1, 1], [1, 1, 0, 0, 0, 0]], - names=['a', 'b']) - mseries = Series([0, 1, 2, 3, 4, 5], index=index) - index = MultiIndex(levels=[[1, 2], [1, 2]], - labels=[[0, 0, 1], [1, 0, 0]], names=['a', 'b']) - mseries_result = Series([0, 2, 4], index=index) - - result = mseries.groupby(level=['a', 'b'], sort=False).first() - assert_series_equal(result, mseries_result) - result = mseries.groupby(level=['a', 'b'], sort=True).first() - assert_series_equal(result, mseries_result.sort_index()) - - def test_groupby_categorical(self): - levels = ['foo', 'bar', 'baz', 'qux'] - codes = np.random.randint(0, 4, size=100) - - cats = Categorical.from_codes(codes, levels, ordered=True) - - data = DataFrame(np.random.randn(100, 4)) - - result = data.groupby(cats).mean() - - expected = data.groupby(np.asarray(cats)).mean() - exp_idx = CategoricalIndex(levels, categories=cats.categories, - ordered=True) - expected = expected.reindex(exp_idx) - - assert_frame_equal(result, expected) - - grouped = data.groupby(cats) - desc_result = grouped.describe() - - idx = cats.codes.argsort() - ord_labels = np.asarray(cats).take(idx) - ord_data = data.take(idx) - - exp_cats = Categorical(ord_labels, ordered=True, - categories=['foo', 'bar', 'baz', 'qux']) - expected = ord_data.groupby(exp_cats, sort=False).describe() - expected.index.names = [None, None] - assert_frame_equal(desc_result, expected) - - # GH 10460 - expc = Categorical.from_codes(np.arange(4).repeat(8), - levels, ordered=True) - exp = CategoricalIndex(expc) - self.assert_index_equal(desc_result.index.get_level_values(0), exp) - exp = Index(['count', 'mean', 'std', 'min', '25%', '50%', - '75%', 'max'] * 4) - self.assert_index_equal(desc_result.index.get_level_values(1), exp) - - def test_groupby_datetime_categorical(self): - # GH9049: ensure backward compatibility - levels = pd.date_range('2014-01-01', periods=4) - codes = np.random.randint(0, 4, size=100) - - cats = Categorical.from_codes(codes, levels, ordered=True) - - data = DataFrame(np.random.randn(100, 4)) - result = data.groupby(cats).mean() - - expected = data.groupby(np.asarray(cats)).mean() - expected = expected.reindex(levels) - expected.index = CategoricalIndex(expected.index, - categories=expected.index, - ordered=True) - - assert_frame_equal(result, expected) - - grouped = data.groupby(cats) - desc_result = grouped.describe() - - idx = cats.codes.argsort() - ord_labels = cats.take_nd(idx) - ord_data = data.take(idx) - expected = ord_data.groupby(ord_labels).describe() - expected.index.names = [None, None] - assert_frame_equal(desc_result, expected) - tm.assert_index_equal(desc_result.index, expected.index) - tm.assert_index_equal( - desc_result.index.get_level_values(0), - expected.index.get_level_values(0)) - - # GH 10460 - expc = Categorical.from_codes( - np.arange(4).repeat(8), levels, ordered=True) - exp = CategoricalIndex(expc) - self.assert_index_equal(desc_result.index.get_level_values(0), exp) - exp = Index(['count', 'mean', 'std', 'min', '25%', '50%', - '75%', 'max'] * 4) - self.assert_index_equal(desc_result.index.get_level_values(1), exp) - - def test_groupby_categorical_index(self): - - levels = ['foo', 'bar', 'baz', 'qux'] - codes = np.random.randint(0, 4, size=20) - cats = Categorical.from_codes(codes, levels, ordered=True) - df = DataFrame( - np.repeat( - np.arange(20), 4).reshape(-1, 4), columns=list('abcd')) - df['cats'] = cats - - # with a cat index - result = df.set_index('cats').groupby(level=0).sum() - expected = df[list('abcd')].groupby(cats.codes).sum() - expected.index = CategoricalIndex( - Categorical.from_codes( - [0, 1, 2, 3], levels, ordered=True), name='cats') - assert_frame_equal(result, expected) - - # with a cat column, should produce a cat index - result = df.groupby('cats').sum() - expected = df[list('abcd')].groupby(cats.codes).sum() - expected.index = CategoricalIndex( - Categorical.from_codes( - [0, 1, 2, 3], levels, ordered=True), name='cats') - assert_frame_equal(result, expected) - - def test_groupby_describe_categorical_columns(self): - # GH 11558 - cats = pd.CategoricalIndex(['qux', 'foo', 'baz', 'bar'], - categories=['foo', 'bar', 'baz', 'qux'], - ordered=True) - df = DataFrame(np.random.randn(20, 4), columns=cats) - result = df.groupby([1, 2, 3, 4] * 5).describe() - - tm.assert_index_equal(result.columns, cats) - tm.assert_categorical_equal(result.columns.values, cats.values) - - def test_groupby_unstack_categorical(self): - # GH11558 (example is taken from the original issue) - df = pd.DataFrame({'a': range(10), - 'medium': ['A', 'B'] * 5, - 'artist': list('XYXXY') * 2}) - df['medium'] = df['medium'].astype('category') - - gcat = df.groupby(['artist', 'medium'])['a'].count().unstack() - result = gcat.describe() - - exp_columns = pd.CategoricalIndex(['A', 'B'], ordered=False, - name='medium') - tm.assert_index_equal(result.columns, exp_columns) - tm.assert_categorical_equal(result.columns.values, exp_columns.values) - - result = gcat['A'] + gcat['B'] - expected = pd.Series([6, 4], index=pd.Index(['X', 'Y'], name='artist')) - tm.assert_series_equal(result, expected) - - def test_groupby_groups_datetimeindex(self): - # #1430 - from pandas.tseries.api import DatetimeIndex - periods = 1000 - ind = DatetimeIndex(start='2012/1/1', freq='5min', periods=periods) - df = DataFrame({'high': np.arange(periods), - 'low': np.arange(periods)}, index=ind) - grouped = df.groupby(lambda x: datetime(x.year, x.month, x.day)) - - # it works! - groups = grouped.groups - tm.assertIsInstance(list(groups.keys())[0], datetime) - - def test_groupby_groups_datetimeindex_tz(self): - # GH 3950 - dates = ['2011-07-19 07:00:00', '2011-07-19 08:00:00', - '2011-07-19 09:00:00', '2011-07-19 07:00:00', - '2011-07-19 08:00:00', '2011-07-19 09:00:00'] - df = DataFrame({'label': ['a', 'a', 'a', 'b', 'b', 'b'], - 'datetime': dates, - 'value1': np.arange(6, dtype='int64'), - 'value2': [1, 2] * 3}) - df['datetime'] = df['datetime'].apply( - lambda d: Timestamp(d, tz='US/Pacific')) - - exp_idx1 = pd.DatetimeIndex(['2011-07-19 07:00:00', - '2011-07-19 07:00:00', - '2011-07-19 08:00:00', - '2011-07-19 08:00:00', - '2011-07-19 09:00:00', - '2011-07-19 09:00:00'], - tz='US/Pacific', name='datetime') - exp_idx2 = Index(['a', 'b'] * 3, name='label') - exp_idx = MultiIndex.from_arrays([exp_idx1, exp_idx2]) - expected = DataFrame({'value1': [0, 3, 1, 4, 2, 5], - 'value2': [1, 2, 2, 1, 1, 2]}, - index=exp_idx, columns=['value1', 'value2']) - - result = df.groupby(['datetime', 'label']).sum() - assert_frame_equal(result, expected) - - # by level - didx = pd.DatetimeIndex(dates, tz='Asia/Tokyo') - df = DataFrame({'value1': np.arange(6, dtype='int64'), - 'value2': [1, 2, 3, 1, 2, 3]}, - index=didx) - - exp_idx = pd.DatetimeIndex(['2011-07-19 07:00:00', - '2011-07-19 08:00:00', - '2011-07-19 09:00:00'], tz='Asia/Tokyo') - expected = DataFrame({'value1': [3, 5, 7], 'value2': [2, 4, 6]}, - index=exp_idx, columns=['value1', 'value2']) - - result = df.groupby(level=0).sum() - assert_frame_equal(result, expected) - - def test_groupby_multi_timezone(self): - - # combining multiple / different timezones yields UTC - - data = """0,2000-01-28 16:47:00,America/Chicago -1,2000-01-29 16:48:00,America/Chicago -2,2000-01-30 16:49:00,America/Los_Angeles -3,2000-01-31 16:50:00,America/Chicago -4,2000-01-01 16:50:00,America/New_York""" - - df = pd.read_csv(StringIO(data), header=None, - names=['value', 'date', 'tz']) - result = df.groupby('tz').date.apply( - lambda x: pd.to_datetime(x).dt.tz_localize(x.name)) - - expected = Series([Timestamp('2000-01-28 16:47:00-0600', - tz='America/Chicago'), - Timestamp('2000-01-29 16:48:00-0600', - tz='America/Chicago'), - Timestamp('2000-01-30 16:49:00-0800', - tz='America/Los_Angeles'), - Timestamp('2000-01-31 16:50:00-0600', - tz='America/Chicago'), - Timestamp('2000-01-01 16:50:00-0500', - tz='America/New_York')], - name='date', - dtype=object) - assert_series_equal(result, expected) - - tz = 'America/Chicago' - res_values = df.groupby('tz').date.get_group(tz) - result = pd.to_datetime(res_values).dt.tz_localize(tz) - exp_values = Series(['2000-01-28 16:47:00', '2000-01-29 16:48:00', - '2000-01-31 16:50:00'], - index=[0, 1, 3], name='date') - expected = pd.to_datetime(exp_values).dt.tz_localize(tz) - assert_series_equal(result, expected) - - def test_groupby_groups_periods(self): - dates = ['2011-07-19 07:00:00', '2011-07-19 08:00:00', - '2011-07-19 09:00:00', '2011-07-19 07:00:00', - '2011-07-19 08:00:00', '2011-07-19 09:00:00'] - df = DataFrame({'label': ['a', 'a', 'a', 'b', 'b', 'b'], - 'period': [pd.Period(d, freq='H') for d in dates], - 'value1': np.arange(6, dtype='int64'), - 'value2': [1, 2] * 3}) - - exp_idx1 = pd.PeriodIndex(['2011-07-19 07:00:00', - '2011-07-19 07:00:00', - '2011-07-19 08:00:00', - '2011-07-19 08:00:00', - '2011-07-19 09:00:00', - '2011-07-19 09:00:00'], - freq='H', name='period') - exp_idx2 = Index(['a', 'b'] * 3, name='label') - exp_idx = MultiIndex.from_arrays([exp_idx1, exp_idx2]) - expected = DataFrame({'value1': [0, 3, 1, 4, 2, 5], - 'value2': [1, 2, 2, 1, 1, 2]}, - index=exp_idx, columns=['value1', 'value2']) - - result = df.groupby(['period', 'label']).sum() - assert_frame_equal(result, expected) - - # by level - didx = pd.PeriodIndex(dates, freq='H') - df = DataFrame({'value1': np.arange(6, dtype='int64'), - 'value2': [1, 2, 3, 1, 2, 3]}, - index=didx) - - exp_idx = pd.PeriodIndex(['2011-07-19 07:00:00', - '2011-07-19 08:00:00', - '2011-07-19 09:00:00'], freq='H') - expected = DataFrame({'value1': [3, 5, 7], 'value2': [2, 4, 6]}, - index=exp_idx, columns=['value1', 'value2']) - - result = df.groupby(level=0).sum() - assert_frame_equal(result, expected) - - def test_groupby_reindex_inside_function(self): - from pandas.tseries.api import DatetimeIndex - - periods = 1000 - ind = DatetimeIndex(start='2012/1/1', freq='5min', periods=periods) - df = DataFrame({'high': np.arange( - periods), 'low': np.arange(periods)}, index=ind) - - def agg_before(hour, func, fix=False): - """ - Run an aggregate func on the subset of data. - """ - - def _func(data): - d = data.select(lambda x: x.hour < 11).dropna() - if fix: - data[data.index[0]] - if len(d) == 0: - return None - return func(d) - - return _func - - def afunc(data): - d = data.select(lambda x: x.hour < 11).dropna() - return np.max(d) - - grouped = df.groupby(lambda x: datetime(x.year, x.month, x.day)) - closure_bad = grouped.agg({'high': agg_before(11, np.max)}) - closure_good = grouped.agg({'high': agg_before(11, np.max, True)}) - - assert_frame_equal(closure_bad, closure_good) - - def test_multiindex_columns_empty_level(self): - l = [['count', 'values'], ['to filter', '']] - midx = MultiIndex.from_tuples(l) - - df = DataFrame([[long(1), 'A']], columns=midx) - - grouped = df.groupby('to filter').groups - self.assertEqual(grouped['A'], [0]) - - grouped = df.groupby([('to filter', '')]).groups - self.assertEqual(grouped['A'], [0]) - - df = DataFrame([[long(1), 'A'], [long(2), 'B']], columns=midx) - - expected = df.groupby('to filter').groups - result = df.groupby([('to filter', '')]).groups - self.assertEqual(result, expected) - - df = DataFrame([[long(1), 'A'], [long(2), 'A']], columns=midx) - - expected = df.groupby('to filter').groups - result = df.groupby([('to filter', '')]).groups - tm.assert_dict_equal(result, expected) - - def test_cython_median(self): - df = DataFrame(np.random.randn(1000)) - df.values[::2] = np.nan - - labels = np.random.randint(0, 50, size=1000).astype(float) - labels[::17] = np.nan - - result = df.groupby(labels).median() - exp = df.groupby(labels).agg(nanops.nanmedian) - assert_frame_equal(result, exp) - - df = DataFrame(np.random.randn(1000, 5)) - rs = df.groupby(labels).agg(np.median) - xp = df.groupby(labels).median() - assert_frame_equal(rs, xp) - - def test_median_empty_bins(self): - df = pd.DataFrame(np.random.randint(0, 44, 500)) - - grps = range(0, 55, 5) - bins = pd.cut(df[0], grps) - - result = df.groupby(bins).median() - expected = df.groupby(bins).agg(lambda x: x.median()) - assert_frame_equal(result, expected) - - def test_groupby_categorical_no_compress(self): - data = Series(np.random.randn(9)) - - codes = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]) - cats = Categorical.from_codes(codes, [0, 1, 2], ordered=True) - - result = data.groupby(cats).mean() - exp = data.groupby(codes).mean() - - exp.index = CategoricalIndex(exp.index, categories=cats.categories, - ordered=cats.ordered) - assert_series_equal(result, exp) - - codes = np.array([0, 0, 0, 1, 1, 1, 3, 3, 3]) - cats = Categorical.from_codes(codes, [0, 1, 2, 3], ordered=True) - - result = data.groupby(cats).mean() - exp = data.groupby(codes).mean().reindex(cats.categories) - exp.index = CategoricalIndex(exp.index, categories=cats.categories, - ordered=cats.ordered) - assert_series_equal(result, exp) - - cats = Categorical(["a", "a", "a", "b", "b", "b", "c", "c", "c"], - categories=["a", "b", "c", "d"], ordered=True) - data = DataFrame({"a": [1, 1, 1, 2, 2, 2, 3, 4, 5], "b": cats}) - - result = data.groupby("b").mean() - result = result["a"].values - exp = np.array([1, 2, 4, np.nan]) - self.assert_numpy_array_equal(result, exp) - - def test_groupby_non_arithmetic_agg_types(self): - # GH9311, GH6620 - df = pd.DataFrame([{'a': 1, - 'b': 1}, {'a': 1, - 'b': 2}, {'a': 2, - 'b': 3}, {'a': 2, - 'b': 4}]) - - dtypes = ['int8', 'int16', 'int32', 'int64', 'float32', 'float64'] - - grp_exp = {'first': {'df': [{'a': 1, - 'b': 1}, {'a': 2, - 'b': 3}]}, - 'last': {'df': [{'a': 1, - 'b': 2}, {'a': 2, - 'b': 4}]}, - 'min': {'df': [{'a': 1, - 'b': 1}, {'a': 2, - 'b': 3}]}, - 'max': {'df': [{'a': 1, - 'b': 2}, {'a': 2, - 'b': 4}]}, - 'nth': {'df': [{'a': 1, - 'b': 2}, {'a': 2, - 'b': 4}], - 'args': [1]}, - 'count': {'df': [{'a': 1, - 'b': 2}, {'a': 2, - 'b': 2}], - 'out_type': 'int64'}} - - for dtype in dtypes: - df_in = df.copy() - df_in['b'] = df_in.b.astype(dtype) - - for method, data in compat.iteritems(grp_exp): - if 'args' not in data: - data['args'] = [] - - if 'out_type' in data: - out_type = data['out_type'] - else: - out_type = dtype - - exp = data['df'] - df_out = pd.DataFrame(exp) - - df_out['b'] = df_out.b.astype(out_type) - df_out.set_index('a', inplace=True) - - grpd = df_in.groupby('a') - t = getattr(grpd, method)(*data['args']) - assert_frame_equal(t, df_out) - - def test_groupby_non_arithmetic_agg_intlike_precision(self): - # GH9311, GH6620 - c = 24650000000000000 - - inputs = ((Timestamp('2011-01-15 12:50:28.502376'), - Timestamp('2011-01-20 12:50:28.593448')), (1 + c, 2 + c)) - - for i in inputs: - df = pd.DataFrame([{'a': 1, 'b': i[0]}, {'a': 1, 'b': i[1]}]) - - grp_exp = {'first': {'expected': i[0]}, - 'last': {'expected': i[1]}, - 'min': {'expected': i[0]}, - 'max': {'expected': i[1]}, - 'nth': {'expected': i[1], - 'args': [1]}, - 'count': {'expected': 2}} - - for method, data in compat.iteritems(grp_exp): - if 'args' not in data: - data['args'] = [] - - grpd = df.groupby('a') - res = getattr(grpd, method)(*data['args']) - self.assertEqual(res.iloc[0].b, data['expected']) - - def test_groupby_first_datetime64(self): - df = DataFrame([(1, 1351036800000000000), (2, 1351036800000000000)]) - df[1] = df[1].view('M8[ns]') - - self.assertTrue(issubclass(df[1].dtype.type, np.datetime64)) - - result = df.groupby(level=0).first() - got_dt = result[1].dtype - self.assertTrue(issubclass(got_dt.type, np.datetime64)) - - result = df[1].groupby(level=0).first() - got_dt = result.dtype - self.assertTrue(issubclass(got_dt.type, np.datetime64)) - - def test_groupby_max_datetime64(self): - # GH 5869 - # datetimelike dtype conversion from int - df = DataFrame(dict(A=Timestamp('20130101'), B=np.arange(5))) - expected = df.groupby('A')['A'].apply(lambda x: x.max()) - result = df.groupby('A')['A'].max() - assert_series_equal(result, expected) - - def test_groupby_datetime64_32_bit(self): - # GH 6410 / numpy 4328 - # 32-bit under 1.9-dev indexing issue - - df = DataFrame({"A": range(2), "B": [pd.Timestamp('2000-01-1')] * 2}) - result = df.groupby("A")["B"].transform(min) - expected = Series([pd.Timestamp('2000-01-1')] * 2, name='B') - assert_series_equal(result, expected) - - def test_groupby_categorical_unequal_len(self): - # GH3011 - series = Series([np.nan, np.nan, 1, 1, 2, 2, 3, 3, 4, 4]) - # The raises only happens with categorical, not with series of types - # category - bins = pd.cut(series.dropna().values, 4) - - # len(bins) != len(series) here - self.assertRaises(ValueError, lambda: series.groupby(bins).mean()) - - def test_groupby_multiindex_missing_pair(self): - # GH9049 - df = DataFrame({'group1': ['a', 'a', 'a', 'b'], - 'group2': ['c', 'c', 'd', 'c'], - 'value': [1, 1, 1, 5]}) - df = df.set_index(['group1', 'group2']) - df_grouped = df.groupby(level=['group1', 'group2'], sort=True) - - res = df_grouped.agg('sum') - idx = MultiIndex.from_tuples( - [('a', 'c'), ('a', 'd'), ('b', 'c')], names=['group1', 'group2']) - exp = DataFrame([[2], [1], [5]], index=idx, columns=['value']) - - tm.assert_frame_equal(res, exp) - - def test_groupby_multiindex_not_lexsorted(self): - # GH 11640 - - # define the lexsorted version - lexsorted_mi = MultiIndex.from_tuples( - [('a', ''), ('b1', 'c1'), ('b2', 'c2')], names=['b', 'c']) - lexsorted_df = DataFrame([[1, 3, 4]], columns=lexsorted_mi) - self.assertTrue(lexsorted_df.columns.is_lexsorted()) - - # define the non-lexsorted version - not_lexsorted_df = DataFrame(columns=['a', 'b', 'c', 'd'], - data=[[1, 'b1', 'c1', 3], - [1, 'b2', 'c2', 4]]) - not_lexsorted_df = not_lexsorted_df.pivot_table( - index='a', columns=['b', 'c'], values='d') - not_lexsorted_df = not_lexsorted_df.reset_index() - self.assertFalse(not_lexsorted_df.columns.is_lexsorted()) - - # compare the results - tm.assert_frame_equal(lexsorted_df, not_lexsorted_df) - - expected = lexsorted_df.groupby('a').mean() - with tm.assert_produces_warning(com.PerformanceWarning): - result = not_lexsorted_df.groupby('a').mean() - tm.assert_frame_equal(expected, result) - - # a transforming function should work regardless of sort - # GH 14776 - df = DataFrame({'x': ['a', 'a', 'b', 'a'], - 'y': [1, 1, 2, 2], - 'z': [1, 2, 3, 4]}).set_index(['x', 'y']) - self.assertFalse(df.index.is_lexsorted()) - - for level in [0, 1, [0, 1]]: - for sort in [False, True]: - result = df.groupby(level=level, sort=sort).apply( - DataFrame.drop_duplicates) - expected = df - tm.assert_frame_equal(expected, result) - - result = df.sort_index().groupby(level=level, sort=sort).apply( - DataFrame.drop_duplicates) - expected = df.sort_index() - tm.assert_frame_equal(expected, result) - - def test_groupby_levels_and_columns(self): - # GH9344, GH9049 - idx_names = ['x', 'y'] - idx = pd.MultiIndex.from_tuples( - [(1, 1), (1, 2), (3, 4), (5, 6)], names=idx_names) - df = pd.DataFrame(np.arange(12).reshape(-1, 3), index=idx) - - by_levels = df.groupby(level=idx_names).mean() - # reset_index changes columns dtype to object - by_columns = df.reset_index().groupby(idx_names).mean() - - tm.assert_frame_equal(by_levels, by_columns, check_column_type=False) - - by_columns.columns = pd.Index(by_columns.columns, dtype=np.int64) - tm.assert_frame_equal(by_levels, by_columns) - - def test_gb_apply_list_of_unequal_len_arrays(self): - - # GH1738 - df = DataFrame({'group1': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a', - 'b', 'b', 'b'], - 'group2': ['c', 'c', 'd', 'd', 'd', 'e', 'c', 'c', 'd', - 'd', 'd', 'e'], - 'weight': [1.1, 2, 3, 4, 5, 6, 2, 4, 6, 8, 1, 2], - 'value': [7.1, 8, 9, 10, 11, 12, 8, 7, 6, 5, 4, 3]}) - df = df.set_index(['group1', 'group2']) - df_grouped = df.groupby(level=['group1', 'group2'], sort=True) - - def noddy(value, weight): - out = np.array(value * weight).repeat(3) - return out - - # the kernel function returns arrays of unequal length - # pandas sniffs the first one, sees it's an array and not - # a list, and assumed the rest are of equal length - # and so tries a vstack - - # don't die - df_grouped.apply(lambda x: noddy(x.value, x.weight)) - - def test_groupby_with_empty(self): - index = pd.DatetimeIndex(()) - data = () - series = pd.Series(data, index) - grouper = pd.tseries.resample.TimeGrouper('D') - grouped = series.groupby(grouper) - assert next(iter(grouped), None) is None - - def test_groupby_with_single_column(self): - df = pd.DataFrame({'a': list('abssbab')}) - tm.assert_frame_equal(df.groupby('a').get_group('a'), df.iloc[[0, 5]]) - # GH 13530 - exp = pd.DataFrame([], index=pd.Index(['a', 'b', 's'], name='a')) - tm.assert_frame_equal(df.groupby('a').count(), exp) - tm.assert_frame_equal(df.groupby('a').sum(), exp) - tm.assert_frame_equal(df.groupby('a').nth(1), exp) - - def test_groupby_with_small_elem(self): - # GH 8542 - # length=2 - df = pd.DataFrame({'event': ['start', 'start'], - 'change': [1234, 5678]}, - index=pd.DatetimeIndex(['2014-09-10', '2013-10-10'])) - grouped = df.groupby([pd.TimeGrouper(freq='M'), 'event']) - self.assertEqual(len(grouped.groups), 2) - self.assertEqual(grouped.ngroups, 2) - self.assertIn((pd.Timestamp('2014-09-30'), 'start'), grouped.groups) - self.assertIn((pd.Timestamp('2013-10-31'), 'start'), grouped.groups) - - res = grouped.get_group((pd.Timestamp('2014-09-30'), 'start')) - tm.assert_frame_equal(res, df.iloc[[0], :]) - res = grouped.get_group((pd.Timestamp('2013-10-31'), 'start')) - tm.assert_frame_equal(res, df.iloc[[1], :]) - - df = pd.DataFrame({'event': ['start', 'start', 'start'], - 'change': [1234, 5678, 9123]}, - index=pd.DatetimeIndex(['2014-09-10', '2013-10-10', - '2014-09-15'])) - grouped = df.groupby([pd.TimeGrouper(freq='M'), 'event']) - self.assertEqual(len(grouped.groups), 2) - self.assertEqual(grouped.ngroups, 2) - self.assertIn((pd.Timestamp('2014-09-30'), 'start'), grouped.groups) - self.assertIn((pd.Timestamp('2013-10-31'), 'start'), grouped.groups) - - res = grouped.get_group((pd.Timestamp('2014-09-30'), 'start')) - tm.assert_frame_equal(res, df.iloc[[0, 2], :]) - res = grouped.get_group((pd.Timestamp('2013-10-31'), 'start')) - tm.assert_frame_equal(res, df.iloc[[1], :]) - - # length=3 - df = pd.DataFrame({'event': ['start', 'start', 'start'], - 'change': [1234, 5678, 9123]}, - index=pd.DatetimeIndex(['2014-09-10', '2013-10-10', - '2014-08-05'])) - grouped = df.groupby([pd.TimeGrouper(freq='M'), 'event']) - self.assertEqual(len(grouped.groups), 3) - self.assertEqual(grouped.ngroups, 3) - self.assertIn((pd.Timestamp('2014-09-30'), 'start'), grouped.groups) - self.assertIn((pd.Timestamp('2013-10-31'), 'start'), grouped.groups) - self.assertIn((pd.Timestamp('2014-08-31'), 'start'), grouped.groups) - - res = grouped.get_group((pd.Timestamp('2014-09-30'), 'start')) - tm.assert_frame_equal(res, df.iloc[[0], :]) - res = grouped.get_group((pd.Timestamp('2013-10-31'), 'start')) - tm.assert_frame_equal(res, df.iloc[[1], :]) - res = grouped.get_group((pd.Timestamp('2014-08-31'), 'start')) - tm.assert_frame_equal(res, df.iloc[[2], :]) - - def test_groupby_with_timezone_selection(self): - # GH 11616 - # Test that column selection returns output in correct timezone. - np.random.seed(42) - df = pd.DataFrame({ - 'factor': np.random.randint(0, 3, size=60), - 'time': pd.date_range('01/01/2000 00:00', periods=60, - freq='s', tz='UTC') - }) - df1 = df.groupby('factor').max()['time'] - df2 = df.groupby('factor')['time'].max() - tm.assert_series_equal(df1, df2) - - def test_timezone_info(self): - # GH 11682 - # Timezone info lost when broadcasting scalar datetime to DataFrame - tm._skip_if_no_pytz() - import pytz - - df = pd.DataFrame({'a': [1], 'b': [datetime.now(pytz.utc)]}) - self.assertEqual(df['b'][0].tzinfo, pytz.utc) - df = pd.DataFrame({'a': [1, 2, 3]}) - df['b'] = datetime.now(pytz.utc) - self.assertEqual(df['b'][0].tzinfo, pytz.utc) - - def test_groupby_with_timegrouper(self): - # GH 4161 - # TimeGrouper requires a sorted index - # also verifies that the resultant index has the correct name - import datetime as DT - df_original = DataFrame({ - 'Buyer': 'Carl Carl Carl Carl Joe Carl'.split(), - 'Quantity': [18, 3, 5, 1, 9, 3], - 'Date': [ - DT.datetime(2013, 9, 1, 13, 0), - DT.datetime(2013, 9, 1, 13, 5), - DT.datetime(2013, 10, 1, 20, 0), - DT.datetime(2013, 10, 3, 10, 0), - DT.datetime(2013, 12, 2, 12, 0), - DT.datetime(2013, 9, 2, 14, 0), - ] - }) - - # GH 6908 change target column's order - df_reordered = df_original.sort_values(by='Quantity') - - for df in [df_original, df_reordered]: - df = df.set_index(['Date']) - - expected = DataFrame( - {'Quantity': np.nan}, - index=date_range('20130901 13:00:00', - '20131205 13:00:00', freq='5D', - name='Date', closed='left')) - expected.iloc[[0, 6, 18], 0] = np.array( - [24., 6., 9.], dtype='float64') - - result1 = df.resample('5D') .sum() - assert_frame_equal(result1, expected) - - df_sorted = df.sort_index() - result2 = df_sorted.groupby(pd.TimeGrouper(freq='5D')).sum() - assert_frame_equal(result2, expected) - - result3 = df.groupby(pd.TimeGrouper(freq='5D')).sum() - assert_frame_equal(result3, expected) - - def test_groupby_with_timegrouper_methods(self): - # GH 3881 - # make sure API of timegrouper conforms - - import datetime as DT - df_original = pd.DataFrame({ - 'Branch': 'A A A A A B'.split(), - 'Buyer': 'Carl Mark Carl Joe Joe Carl'.split(), - 'Quantity': [1, 3, 5, 8, 9, 3], - 'Date': [ - DT.datetime(2013, 1, 1, 13, 0), - DT.datetime(2013, 1, 1, 13, 5), - DT.datetime(2013, 10, 1, 20, 0), - DT.datetime(2013, 10, 2, 10, 0), - DT.datetime(2013, 12, 2, 12, 0), - DT.datetime(2013, 12, 2, 14, 0), - ] - }) - - df_sorted = df_original.sort_values(by='Quantity', ascending=False) - - for df in [df_original, df_sorted]: - df = df.set_index('Date', drop=False) - g = df.groupby(pd.TimeGrouper('6M')) - self.assertTrue(g.group_keys) - self.assertTrue(isinstance(g.grouper, pd.core.groupby.BinGrouper)) - groups = g.groups - self.assertTrue(isinstance(groups, dict)) - self.assertTrue(len(groups) == 3) - - def test_timegrouper_with_reg_groups(self): - - # GH 3794 - # allow combinateion of timegrouper/reg groups - - import datetime as DT - - df_original = DataFrame({ - 'Branch': 'A A A A A A A B'.split(), - 'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(), - 'Quantity': [1, 3, 5, 1, 8, 1, 9, 3], - 'Date': [ - DT.datetime(2013, 1, 1, 13, 0), - DT.datetime(2013, 1, 1, 13, 5), - DT.datetime(2013, 10, 1, 20, 0), - DT.datetime(2013, 10, 2, 10, 0), - DT.datetime(2013, 10, 1, 20, 0), - DT.datetime(2013, 10, 2, 10, 0), - DT.datetime(2013, 12, 2, 12, 0), - DT.datetime(2013, 12, 2, 14, 0), - ] - }).set_index('Date') - - df_sorted = df_original.sort_values(by='Quantity', ascending=False) - - for df in [df_original, df_sorted]: - expected = DataFrame({ - 'Buyer': 'Carl Joe Mark'.split(), - 'Quantity': [10, 18, 3], - 'Date': [ - DT.datetime(2013, 12, 31, 0, 0), - DT.datetime(2013, 12, 31, 0, 0), - DT.datetime(2013, 12, 31, 0, 0), - ] - }).set_index(['Date', 'Buyer']) - - result = df.groupby([pd.Grouper(freq='A'), 'Buyer']).sum() - assert_frame_equal(result, expected) - - expected = DataFrame({ - 'Buyer': 'Carl Mark Carl Joe'.split(), - 'Quantity': [1, 3, 9, 18], - 'Date': [ - DT.datetime(2013, 1, 1, 0, 0), - DT.datetime(2013, 1, 1, 0, 0), - DT.datetime(2013, 7, 1, 0, 0), - DT.datetime(2013, 7, 1, 0, 0), - ] - }).set_index(['Date', 'Buyer']) - result = df.groupby([pd.Grouper(freq='6MS'), 'Buyer']).sum() - assert_frame_equal(result, expected) - - df_original = DataFrame({ - 'Branch': 'A A A A A A A B'.split(), - 'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(), - 'Quantity': [1, 3, 5, 1, 8, 1, 9, 3], - 'Date': [ - DT.datetime(2013, 10, 1, 13, 0), - DT.datetime(2013, 10, 1, 13, 5), - DT.datetime(2013, 10, 1, 20, 0), - DT.datetime(2013, 10, 2, 10, 0), - DT.datetime(2013, 10, 1, 20, 0), - DT.datetime(2013, 10, 2, 10, 0), - DT.datetime(2013, 10, 2, 12, 0), - DT.datetime(2013, 10, 2, 14, 0), - ] - }).set_index('Date') - - df_sorted = df_original.sort_values(by='Quantity', ascending=False) - for df in [df_original, df_sorted]: - - expected = DataFrame({ - 'Buyer': 'Carl Joe Mark Carl Joe'.split(), - 'Quantity': [6, 8, 3, 4, 10], - 'Date': [ - DT.datetime(2013, 10, 1, 0, 0), - DT.datetime(2013, 10, 1, 0, 0), - DT.datetime(2013, 10, 1, 0, 0), - DT.datetime(2013, 10, 2, 0, 0), - DT.datetime(2013, 10, 2, 0, 0), - ] - }).set_index(['Date', 'Buyer']) - - result = df.groupby([pd.Grouper(freq='1D'), 'Buyer']).sum() - assert_frame_equal(result, expected) - - result = df.groupby([pd.Grouper(freq='1M'), 'Buyer']).sum() - expected = DataFrame({ - 'Buyer': 'Carl Joe Mark'.split(), - 'Quantity': [10, 18, 3], - 'Date': [ - DT.datetime(2013, 10, 31, 0, 0), - DT.datetime(2013, 10, 31, 0, 0), - DT.datetime(2013, 10, 31, 0, 0), - ] - }).set_index(['Date', 'Buyer']) - assert_frame_equal(result, expected) - - # passing the name - df = df.reset_index() - result = df.groupby([pd.Grouper(freq='1M', key='Date'), 'Buyer' - ]).sum() - assert_frame_equal(result, expected) - - with self.assertRaises(KeyError): - df.groupby([pd.Grouper(freq='1M', key='foo'), 'Buyer']).sum() - - # passing the level - df = df.set_index('Date') - result = df.groupby([pd.Grouper(freq='1M', level='Date'), 'Buyer' - ]).sum() - assert_frame_equal(result, expected) - result = df.groupby([pd.Grouper(freq='1M', level=0), 'Buyer']).sum( - ) - assert_frame_equal(result, expected) - - with self.assertRaises(ValueError): - df.groupby([pd.Grouper(freq='1M', level='foo'), - 'Buyer']).sum() - - # multi names - df = df.copy() - df['Date'] = df.index + pd.offsets.MonthEnd(2) - result = df.groupby([pd.Grouper(freq='1M', key='Date'), 'Buyer' - ]).sum() - expected = DataFrame({ - 'Buyer': 'Carl Joe Mark'.split(), - 'Quantity': [10, 18, 3], - 'Date': [ - DT.datetime(2013, 11, 30, 0, 0), - DT.datetime(2013, 11, 30, 0, 0), - DT.datetime(2013, 11, 30, 0, 0), - ] - }).set_index(['Date', 'Buyer']) - assert_frame_equal(result, expected) - - # error as we have both a level and a name! - with self.assertRaises(ValueError): - df.groupby([pd.Grouper(freq='1M', key='Date', - level='Date'), 'Buyer']).sum() - - # single groupers - expected = DataFrame({'Quantity': [31], - 'Date': [DT.datetime(2013, 10, 31, 0, 0) - ]}).set_index('Date') - result = df.groupby(pd.Grouper(freq='1M')).sum() - assert_frame_equal(result, expected) - - result = df.groupby([pd.Grouper(freq='1M')]).sum() - assert_frame_equal(result, expected) - - expected = DataFrame({'Quantity': [31], - 'Date': [DT.datetime(2013, 11, 30, 0, 0) - ]}).set_index('Date') - result = df.groupby(pd.Grouper(freq='1M', key='Date')).sum() - assert_frame_equal(result, expected) - - result = df.groupby([pd.Grouper(freq='1M', key='Date')]).sum() - assert_frame_equal(result, expected) - - # GH 6764 multiple grouping with/without sort - df = DataFrame({ - 'date': pd.to_datetime([ - '20121002', '20121007', '20130130', '20130202', '20130305', - '20121002', '20121207', '20130130', '20130202', '20130305', - '20130202', '20130305' - ]), - 'user_id': [1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 5, 5], - 'whole_cost': [1790, 364, 280, 259, 201, 623, 90, 312, 359, 301, - 359, 801], - 'cost1': [12, 15, 10, 24, 39, 1, 0, 90, 45, 34, 1, 12] - }).set_index('date') - - for freq in ['D', 'M', 'A', 'Q-APR']: - expected = df.groupby('user_id')[ - 'whole_cost'].resample( - freq).sum().dropna().reorder_levels( - ['date', 'user_id']).sortlevel().astype('int64') - expected.name = 'whole_cost' - - result1 = df.sort_index().groupby([pd.TimeGrouper(freq=freq), - 'user_id'])['whole_cost'].sum() - assert_series_equal(result1, expected) - - result2 = df.groupby([pd.TimeGrouper(freq=freq), 'user_id'])[ - 'whole_cost'].sum() - assert_series_equal(result2, expected) - - def test_timegrouper_get_group(self): - # GH 6914 - - df_original = DataFrame({ - 'Buyer': 'Carl Joe Joe Carl Joe Carl'.split(), - 'Quantity': [18, 3, 5, 1, 9, 3], - 'Date': [datetime(2013, 9, 1, 13, 0), - datetime(2013, 9, 1, 13, 5), - datetime(2013, 10, 1, 20, 0), - datetime(2013, 10, 3, 10, 0), - datetime(2013, 12, 2, 12, 0), - datetime(2013, 9, 2, 14, 0), ] - }) - df_reordered = df_original.sort_values(by='Quantity') - - # single grouping - expected_list = [df_original.iloc[[0, 1, 5]], df_original.iloc[[2, 3]], - df_original.iloc[[4]]] - dt_list = ['2013-09-30', '2013-10-31', '2013-12-31'] - - for df in [df_original, df_reordered]: - grouped = df.groupby(pd.Grouper(freq='M', key='Date')) - for t, expected in zip(dt_list, expected_list): - dt = pd.Timestamp(t) - result = grouped.get_group(dt) - assert_frame_equal(result, expected) - - # multiple grouping - expected_list = [df_original.iloc[[1]], df_original.iloc[[3]], - df_original.iloc[[4]]] - g_list = [('Joe', '2013-09-30'), ('Carl', '2013-10-31'), - ('Joe', '2013-12-31')] - - for df in [df_original, df_reordered]: - grouped = df.groupby(['Buyer', pd.Grouper(freq='M', key='Date')]) - for (b, t), expected in zip(g_list, expected_list): - dt = pd.Timestamp(t) - result = grouped.get_group((b, dt)) - assert_frame_equal(result, expected) - - # with index - df_original = df_original.set_index('Date') - df_reordered = df_original.sort_values(by='Quantity') - - expected_list = [df_original.iloc[[0, 1, 5]], df_original.iloc[[2, 3]], - df_original.iloc[[4]]] - - for df in [df_original, df_reordered]: - grouped = df.groupby(pd.Grouper(freq='M')) - for t, expected in zip(dt_list, expected_list): - dt = pd.Timestamp(t) - result = grouped.get_group(dt) - assert_frame_equal(result, expected) - - def test_timegrouper_apply_return_type_series(self): - # Using `apply` with the `TimeGrouper` should give the - # same return type as an `apply` with a `Grouper`. - # Issue #11742 - df = pd.DataFrame({'date': ['10/10/2000', '11/10/2000'], - 'value': [10, 13]}) - df_dt = df.copy() - df_dt['date'] = pd.to_datetime(df_dt['date']) - - def sumfunc_series(x): - return pd.Series([x['value'].sum()], ('sum',)) - - expected = df.groupby(pd.Grouper(key='date')).apply(sumfunc_series) - result = (df_dt.groupby(pd.TimeGrouper(freq='M', key='date')) - .apply(sumfunc_series)) - assert_frame_equal(result.reset_index(drop=True), - expected.reset_index(drop=True)) - - def test_timegrouper_apply_return_type_value(self): - # Using `apply` with the `TimeGrouper` should give the - # same return type as an `apply` with a `Grouper`. - # Issue #11742 - df = pd.DataFrame({'date': ['10/10/2000', '11/10/2000'], - 'value': [10, 13]}) - df_dt = df.copy() - df_dt['date'] = pd.to_datetime(df_dt['date']) - - def sumfunc_value(x): - return x.value.sum() - - expected = df.groupby(pd.Grouper(key='date')).apply(sumfunc_value) - result = (df_dt.groupby(pd.TimeGrouper(freq='M', key='date')) - .apply(sumfunc_value)) - assert_series_equal(result.reset_index(drop=True), - expected.reset_index(drop=True)) - - def test_cumcount(self): - df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A']) - g = df.groupby('A') - sg = g.A - - expected = Series([0, 1, 2, 0, 3]) - - assert_series_equal(expected, g.cumcount()) - assert_series_equal(expected, sg.cumcount()) - - def test_cumcount_empty(self): - ge = DataFrame().groupby(level=0) - se = Series().groupby(level=0) - - # edge case, as this is usually considered float - e = Series(dtype='int64') - - assert_series_equal(e, ge.cumcount()) - assert_series_equal(e, se.cumcount()) - - def test_cumcount_dupe_index(self): - df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], - index=[0] * 5) - g = df.groupby('A') - sg = g.A - - expected = Series([0, 1, 2, 0, 3], index=[0] * 5) - - assert_series_equal(expected, g.cumcount()) - assert_series_equal(expected, sg.cumcount()) - - def test_cumcount_mi(self): - mi = MultiIndex.from_tuples([[0, 1], [1, 2], [2, 2], [2, 2], [1, 0]]) - df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], - index=mi) - g = df.groupby('A') - sg = g.A - - expected = Series([0, 1, 2, 0, 3], index=mi) - - assert_series_equal(expected, g.cumcount()) - assert_series_equal(expected, sg.cumcount()) - - def test_cumcount_groupby_not_col(self): - df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], - index=[0] * 5) - g = df.groupby([0, 0, 0, 1, 0]) - sg = g.A - - expected = Series([0, 1, 2, 0, 3], index=[0] * 5) - - assert_series_equal(expected, g.cumcount()) - assert_series_equal(expected, sg.cumcount()) - - def test_filter_series(self): - s = pd.Series([1, 3, 20, 5, 22, 24, 7]) - expected_odd = pd.Series([1, 3, 5, 7], index=[0, 1, 3, 6]) - expected_even = pd.Series([20, 22, 24], index=[2, 4, 5]) - grouper = s.apply(lambda x: x % 2) - grouped = s.groupby(grouper) - assert_series_equal( - grouped.filter(lambda x: x.mean() < 10), expected_odd) - assert_series_equal( - grouped.filter(lambda x: x.mean() > 10), expected_even) - # Test dropna=False. - assert_series_equal( - grouped.filter(lambda x: x.mean() < 10, dropna=False), - expected_odd.reindex(s.index)) - assert_series_equal( - grouped.filter(lambda x: x.mean() > 10, dropna=False), - expected_even.reindex(s.index)) - - def test_filter_single_column_df(self): - df = pd.DataFrame([1, 3, 20, 5, 22, 24, 7]) - expected_odd = pd.DataFrame([1, 3, 5, 7], index=[0, 1, 3, 6]) - expected_even = pd.DataFrame([20, 22, 24], index=[2, 4, 5]) - grouper = df[0].apply(lambda x: x % 2) - grouped = df.groupby(grouper) - assert_frame_equal( - grouped.filter(lambda x: x.mean() < 10), expected_odd) - assert_frame_equal( - grouped.filter(lambda x: x.mean() > 10), expected_even) - # Test dropna=False. - assert_frame_equal( - grouped.filter(lambda x: x.mean() < 10, dropna=False), - expected_odd.reindex(df.index)) - assert_frame_equal( - grouped.filter(lambda x: x.mean() > 10, dropna=False), - expected_even.reindex(df.index)) - - def test_filter_multi_column_df(self): - df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': [1, 1, 1, 1]}) - grouper = df['A'].apply(lambda x: x % 2) - grouped = df.groupby(grouper) - expected = pd.DataFrame({'A': [12, 12], 'B': [1, 1]}, index=[1, 2]) - assert_frame_equal( - grouped.filter(lambda x: x['A'].sum() - x['B'].sum() > 10), - expected) - - def test_filter_mixed_df(self): - df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()}) - grouper = df['A'].apply(lambda x: x % 2) - grouped = df.groupby(grouper) - expected = pd.DataFrame({'A': [12, 12], 'B': ['b', 'c']}, index=[1, 2]) - assert_frame_equal( - grouped.filter(lambda x: x['A'].sum() > 10), expected) - - def test_filter_out_all_groups(self): - s = pd.Series([1, 3, 20, 5, 22, 24, 7]) - grouper = s.apply(lambda x: x % 2) - grouped = s.groupby(grouper) - assert_series_equal(grouped.filter(lambda x: x.mean() > 1000), s[[]]) - df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()}) - grouper = df['A'].apply(lambda x: x % 2) - grouped = df.groupby(grouper) - assert_frame_equal( - grouped.filter(lambda x: x['A'].sum() > 1000), df.ix[[]]) - - def test_filter_out_no_groups(self): - s = pd.Series([1, 3, 20, 5, 22, 24, 7]) - grouper = s.apply(lambda x: x % 2) - grouped = s.groupby(grouper) - filtered = grouped.filter(lambda x: x.mean() > 0) - assert_series_equal(filtered, s) - df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()}) - grouper = df['A'].apply(lambda x: x % 2) - grouped = df.groupby(grouper) - filtered = grouped.filter(lambda x: x['A'].mean() > 0) - assert_frame_equal(filtered, df) - - def test_filter_out_all_groups_in_df(self): - # GH12768 - df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 0]}) - res = df.groupby('a') - res = res.filter(lambda x: x['b'].sum() > 5, dropna=False) - expected = pd.DataFrame({'a': [nan] * 3, 'b': [nan] * 3}) - assert_frame_equal(expected, res) - - df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 0]}) - res = df.groupby('a') - res = res.filter(lambda x: x['b'].sum() > 5, dropna=True) - expected = pd.DataFrame({'a': [], 'b': []}, dtype="int64") - assert_frame_equal(expected, res) - - def test_filter_condition_raises(self): - def raise_if_sum_is_zero(x): - if x.sum() == 0: - raise ValueError - else: - return x.sum() > 0 - - s = pd.Series([-1, 0, 1, 2]) - grouper = s.apply(lambda x: x % 2) - grouped = s.groupby(grouper) - self.assertRaises(TypeError, - lambda: grouped.filter(raise_if_sum_is_zero)) - - def test_filter_with_axis_in_groupby(self): - # issue 11041 - index = pd.MultiIndex.from_product([range(10), [0, 1]]) - data = pd.DataFrame( - np.arange(100).reshape(-1, 20), columns=index, dtype='int64') - result = data.groupby(level=0, - axis=1).filter(lambda x: x.iloc[0, 0] > 10) - expected = data.iloc[:, 12:20] - assert_frame_equal(result, expected) - - def test_filter_bad_shapes(self): - df = DataFrame({'A': np.arange(8), - 'B': list('aabbbbcc'), - 'C': np.arange(8)}) - s = df['B'] - g_df = df.groupby('B') - g_s = s.groupby(s) - - f = lambda x: x - self.assertRaises(TypeError, lambda: g_df.filter(f)) - self.assertRaises(TypeError, lambda: g_s.filter(f)) - - f = lambda x: x == 1 - self.assertRaises(TypeError, lambda: g_df.filter(f)) - self.assertRaises(TypeError, lambda: g_s.filter(f)) - - f = lambda x: np.outer(x, x) - self.assertRaises(TypeError, lambda: g_df.filter(f)) - self.assertRaises(TypeError, lambda: g_s.filter(f)) - - def test_filter_nan_is_false(self): - df = DataFrame({'A': np.arange(8), - 'B': list('aabbbbcc'), - 'C': np.arange(8)}) - s = df['B'] - g_df = df.groupby(df['B']) - g_s = s.groupby(s) - - f = lambda x: np.nan - assert_frame_equal(g_df.filter(f), df.loc[[]]) - assert_series_equal(g_s.filter(f), s[[]]) - - def test_filter_against_workaround(self): - np.random.seed(0) - # Series of ints - s = Series(np.random.randint(0, 100, 1000)) - grouper = s.apply(lambda x: np.round(x, -1)) - grouped = s.groupby(grouper) - f = lambda x: x.mean() > 10 - old_way = s[grouped.transform(f).astype('bool')] - new_way = grouped.filter(f) - assert_series_equal(new_way.sort_values(), old_way.sort_values()) - - # Series of floats - s = 100 * Series(np.random.random(1000)) - grouper = s.apply(lambda x: np.round(x, -1)) - grouped = s.groupby(grouper) - f = lambda x: x.mean() > 10 - old_way = s[grouped.transform(f).astype('bool')] - new_way = grouped.filter(f) - assert_series_equal(new_way.sort_values(), old_way.sort_values()) - - # Set up DataFrame of ints, floats, strings. - from string import ascii_lowercase - letters = np.array(list(ascii_lowercase)) - N = 1000 - random_letters = letters.take(np.random.randint(0, 26, N)) - df = DataFrame({'ints': Series(np.random.randint(0, 100, N)), - 'floats': N / 10 * Series(np.random.random(N)), - 'letters': Series(random_letters)}) - - # Group by ints; filter on floats. - grouped = df.groupby('ints') - old_way = df[grouped.floats. - transform(lambda x: x.mean() > N / 20).astype('bool')] - new_way = grouped.filter(lambda x: x['floats'].mean() > N / 20) - assert_frame_equal(new_way, old_way) - - # Group by floats (rounded); filter on strings. - grouper = df.floats.apply(lambda x: np.round(x, -1)) - grouped = df.groupby(grouper) - old_way = df[grouped.letters. - transform(lambda x: len(x) < N / 10).astype('bool')] - new_way = grouped.filter(lambda x: len(x.letters) < N / 10) - assert_frame_equal(new_way, old_way) - - # Group by strings; filter on ints. - grouped = df.groupby('letters') - old_way = df[grouped.ints. - transform(lambda x: x.mean() > N / 20).astype('bool')] - new_way = grouped.filter(lambda x: x['ints'].mean() > N / 20) - assert_frame_equal(new_way, old_way) - - def test_filter_using_len(self): - # BUG GH4447 - df = DataFrame({'A': np.arange(8), - 'B': list('aabbbbcc'), - 'C': np.arange(8)}) - grouped = df.groupby('B') - actual = grouped.filter(lambda x: len(x) > 2) - expected = DataFrame( - {'A': np.arange(2, 6), - 'B': list('bbbb'), - 'C': np.arange(2, 6)}, index=np.arange(2, 6)) - assert_frame_equal(actual, expected) - - actual = grouped.filter(lambda x: len(x) > 4) - expected = df.ix[[]] - assert_frame_equal(actual, expected) - - # Series have always worked properly, but we'll test anyway. - s = df['B'] - grouped = s.groupby(s) - actual = grouped.filter(lambda x: len(x) > 2) - expected = Series(4 * ['b'], index=np.arange(2, 6), name='B') - assert_series_equal(actual, expected) - - actual = grouped.filter(lambda x: len(x) > 4) - expected = s[[]] - assert_series_equal(actual, expected) - - def test_filter_maintains_ordering(self): - # Simple case: index is sequential. #4621 - df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], - 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}) - s = df['pid'] - grouped = df.groupby('tag') - actual = grouped.filter(lambda x: len(x) > 1) - expected = df.iloc[[1, 2, 4, 7]] - assert_frame_equal(actual, expected) - - grouped = s.groupby(df['tag']) - actual = grouped.filter(lambda x: len(x) > 1) - expected = s.iloc[[1, 2, 4, 7]] - assert_series_equal(actual, expected) - - # Now index is sequentially decreasing. - df.index = np.arange(len(df) - 1, -1, -1) - s = df['pid'] - grouped = df.groupby('tag') - actual = grouped.filter(lambda x: len(x) > 1) - expected = df.iloc[[1, 2, 4, 7]] - assert_frame_equal(actual, expected) - - grouped = s.groupby(df['tag']) - actual = grouped.filter(lambda x: len(x) > 1) - expected = s.iloc[[1, 2, 4, 7]] - assert_series_equal(actual, expected) - - # Index is shuffled. - SHUFFLED = [4, 6, 7, 2, 1, 0, 5, 3] - df.index = df.index[SHUFFLED] - s = df['pid'] - grouped = df.groupby('tag') - actual = grouped.filter(lambda x: len(x) > 1) - expected = df.iloc[[1, 2, 4, 7]] - assert_frame_equal(actual, expected) - - grouped = s.groupby(df['tag']) - actual = grouped.filter(lambda x: len(x) > 1) - expected = s.iloc[[1, 2, 4, 7]] - assert_series_equal(actual, expected) - - def test_filter_multiple_timestamp(self): - # GH 10114 - df = DataFrame({'A': np.arange(5, dtype='int64'), - 'B': ['foo', 'bar', 'foo', 'bar', 'bar'], - 'C': Timestamp('20130101')}) - - grouped = df.groupby(['B', 'C']) - - result = grouped['A'].filter(lambda x: True) - assert_series_equal(df['A'], result) - - result = grouped['A'].transform(len) - expected = Series([2, 3, 2, 3, 3], name='A') - assert_series_equal(result, expected) - - result = grouped.filter(lambda x: True) - assert_frame_equal(df, result) - - result = grouped.transform('sum') - expected = DataFrame({'A': [2, 8, 2, 8, 8]}) - assert_frame_equal(result, expected) - - result = grouped.transform(len) - expected = DataFrame({'A': [2, 3, 2, 3, 3]}) - assert_frame_equal(result, expected) - - def test_filter_and_transform_with_non_unique_int_index(self): - # GH4620 - index = [1, 1, 1, 2, 1, 1, 0, 1] - df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], - 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) - grouped_df = df.groupby('tag') - ser = df['pid'] - grouped_ser = ser.groupby(df['tag']) - expected_indexes = [1, 2, 4, 7] - - # Filter DataFrame - actual = grouped_df.filter(lambda x: len(x) > 1) - expected = df.iloc[expected_indexes] - assert_frame_equal(actual, expected) - - actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) - expected = df.copy() - expected.iloc[[0, 3, 5, 6]] = np.nan - assert_frame_equal(actual, expected) - - # Filter Series - actual = grouped_ser.filter(lambda x: len(x) > 1) - expected = ser.take(expected_indexes) - assert_series_equal(actual, expected) - - actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) - NA = np.nan - expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') - # ^ made manually because this can get confusing! - assert_series_equal(actual, expected) - - # Transform Series - actual = grouped_ser.transform(len) - expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') - assert_series_equal(actual, expected) - - # Transform (a column from) DataFrameGroupBy - actual = grouped_df.pid.transform(len) - assert_series_equal(actual, expected) - - def test_filter_and_transform_with_multiple_non_unique_int_index(self): - # GH4620 - index = [1, 1, 1, 2, 0, 0, 0, 1] - df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], - 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) - grouped_df = df.groupby('tag') - ser = df['pid'] - grouped_ser = ser.groupby(df['tag']) - expected_indexes = [1, 2, 4, 7] - - # Filter DataFrame - actual = grouped_df.filter(lambda x: len(x) > 1) - expected = df.iloc[expected_indexes] - assert_frame_equal(actual, expected) - - actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) - expected = df.copy() - expected.iloc[[0, 3, 5, 6]] = np.nan - assert_frame_equal(actual, expected) - - # Filter Series - actual = grouped_ser.filter(lambda x: len(x) > 1) - expected = ser.take(expected_indexes) - assert_series_equal(actual, expected) - - actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) - NA = np.nan - expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') - # ^ made manually because this can get confusing! - assert_series_equal(actual, expected) - - # Transform Series - actual = grouped_ser.transform(len) - expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') - assert_series_equal(actual, expected) - - # Transform (a column from) DataFrameGroupBy - actual = grouped_df.pid.transform(len) - assert_series_equal(actual, expected) - - def test_filter_and_transform_with_non_unique_float_index(self): - # GH4620 - index = np.array([1, 1, 1, 2, 1, 1, 0, 1], dtype=float) - df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], - 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) - grouped_df = df.groupby('tag') - ser = df['pid'] - grouped_ser = ser.groupby(df['tag']) - expected_indexes = [1, 2, 4, 7] - - # Filter DataFrame - actual = grouped_df.filter(lambda x: len(x) > 1) - expected = df.iloc[expected_indexes] - assert_frame_equal(actual, expected) - - actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) - expected = df.copy() - expected.iloc[[0, 3, 5, 6]] = np.nan - assert_frame_equal(actual, expected) - - # Filter Series - actual = grouped_ser.filter(lambda x: len(x) > 1) - expected = ser.take(expected_indexes) - assert_series_equal(actual, expected) - - actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) - NA = np.nan - expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') - # ^ made manually because this can get confusing! - assert_series_equal(actual, expected) - - # Transform Series - actual = grouped_ser.transform(len) - expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') - assert_series_equal(actual, expected) - - # Transform (a column from) DataFrameGroupBy - actual = grouped_df.pid.transform(len) - assert_series_equal(actual, expected) - - def test_filter_and_transform_with_non_unique_timestamp_index(self): - # GH4620 - t0 = Timestamp('2013-09-30 00:05:00') - t1 = Timestamp('2013-10-30 00:05:00') - t2 = Timestamp('2013-11-30 00:05:00') - index = [t1, t1, t1, t2, t1, t1, t0, t1] - df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], - 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) - grouped_df = df.groupby('tag') - ser = df['pid'] - grouped_ser = ser.groupby(df['tag']) - expected_indexes = [1, 2, 4, 7] - - # Filter DataFrame - actual = grouped_df.filter(lambda x: len(x) > 1) - expected = df.iloc[expected_indexes] - assert_frame_equal(actual, expected) - - actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) - expected = df.copy() - expected.iloc[[0, 3, 5, 6]] = np.nan - assert_frame_equal(actual, expected) - - # Filter Series - actual = grouped_ser.filter(lambda x: len(x) > 1) - expected = ser.take(expected_indexes) - assert_series_equal(actual, expected) - - actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) - NA = np.nan - expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') - # ^ made manually because this can get confusing! - assert_series_equal(actual, expected) - - # Transform Series - actual = grouped_ser.transform(len) - expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') - assert_series_equal(actual, expected) - - # Transform (a column from) DataFrameGroupBy - actual = grouped_df.pid.transform(len) - assert_series_equal(actual, expected) - - def test_filter_and_transform_with_non_unique_string_index(self): - # GH4620 - index = list('bbbcbbab') - df = DataFrame({'pid': [1, 1, 1, 2, 2, 3, 3, 3], - 'tag': [23, 45, 62, 24, 45, 34, 25, 62]}, index=index) - grouped_df = df.groupby('tag') - ser = df['pid'] - grouped_ser = ser.groupby(df['tag']) - expected_indexes = [1, 2, 4, 7] - - # Filter DataFrame - actual = grouped_df.filter(lambda x: len(x) > 1) - expected = df.iloc[expected_indexes] - assert_frame_equal(actual, expected) - - actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False) - expected = df.copy() - expected.iloc[[0, 3, 5, 6]] = np.nan - assert_frame_equal(actual, expected) - - # Filter Series - actual = grouped_ser.filter(lambda x: len(x) > 1) - expected = ser.take(expected_indexes) - assert_series_equal(actual, expected) - - actual = grouped_ser.filter(lambda x: len(x) > 1, dropna=False) - NA = np.nan - expected = Series([NA, 1, 1, NA, 2, NA, NA, 3], index, name='pid') - # ^ made manually because this can get confusing! - assert_series_equal(actual, expected) - - # Transform Series - actual = grouped_ser.transform(len) - expected = Series([1, 2, 2, 1, 2, 1, 1, 2], index, name='pid') - assert_series_equal(actual, expected) - - # Transform (a column from) DataFrameGroupBy - actual = grouped_df.pid.transform(len) - assert_series_equal(actual, expected) - - def test_filter_has_access_to_grouped_cols(self): - df = DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B']) - g = df.groupby('A') - # previously didn't have access to col A #???? - filt = g.filter(lambda x: x['A'].sum() == 2) - assert_frame_equal(filt, df.iloc[[0, 1]]) - - def test_filter_enforces_scalarness(self): - df = pd.DataFrame([ - ['best', 'a', 'x'], - ['worst', 'b', 'y'], - ['best', 'c', 'x'], - ['best', 'd', 'y'], - ['worst', 'd', 'y'], - ['worst', 'd', 'y'], - ['best', 'd', 'z'], - ], columns=['a', 'b', 'c']) - with tm.assertRaisesRegexp(TypeError, 'filter function returned a.*'): - df.groupby('c').filter(lambda g: g['a'] == 'best') - - def test_filter_non_bool_raises(self): - df = pd.DataFrame([ - ['best', 'a', 1], - ['worst', 'b', 1], - ['best', 'c', 1], - ['best', 'd', 1], - ['worst', 'd', 1], - ['worst', 'd', 1], - ['best', 'd', 1], - ], columns=['a', 'b', 'c']) - with tm.assertRaisesRegexp(TypeError, 'filter function returned a.*'): - df.groupby('a').filter(lambda g: g.c.mean()) - - def test_fill_constistency(self): - - # GH9221 - # pass thru keyword arguments to the generated wrapper - # are set if the passed kw is None (only) - df = DataFrame(index=pd.MultiIndex.from_product( - [['value1', 'value2'], date_range('2014-01-01', '2014-01-06')]), - columns=Index( - ['1', '2'], name='id')) - df['1'] = [np.nan, 1, np.nan, np.nan, 11, np.nan, np.nan, 2, np.nan, - np.nan, 22, np.nan] - df['2'] = [np.nan, 3, np.nan, np.nan, 33, np.nan, np.nan, 4, np.nan, - np.nan, 44, np.nan] - - expected = df.groupby(level=0, axis=0).fillna(method='ffill') - result = df.T.groupby(level=0, axis=1).fillna(method='ffill').T - assert_frame_equal(result, expected) - - def test_index_label_overlaps_location(self): - # checking we don't have any label/location confusion in the - # the wake of GH5375 - df = DataFrame(list('ABCDE'), index=[2, 0, 2, 1, 1]) - g = df.groupby(list('ababb')) - actual = g.filter(lambda x: len(x) > 2) - expected = df.iloc[[1, 3, 4]] - assert_frame_equal(actual, expected) - - ser = df[0] - g = ser.groupby(list('ababb')) - actual = g.filter(lambda x: len(x) > 2) - expected = ser.take([1, 3, 4]) - assert_series_equal(actual, expected) - - # ... and again, with a generic Index of floats - df.index = df.index.astype(float) - g = df.groupby(list('ababb')) - actual = g.filter(lambda x: len(x) > 2) - expected = df.iloc[[1, 3, 4]] - assert_frame_equal(actual, expected) - - ser = df[0] - g = ser.groupby(list('ababb')) - actual = g.filter(lambda x: len(x) > 2) - expected = ser.take([1, 3, 4]) - assert_series_equal(actual, expected) - - def test_groupby_selection_with_methods(self): - # some methods which require DatetimeIndex - rng = pd.date_range('2014', periods=len(self.df)) - self.df.index = rng - - g = self.df.groupby(['A'])[['C']] - g_exp = self.df[['C']].groupby(self.df['A']) - # TODO check groupby with > 1 col ? - - # methods which are called as .foo() - methods = ['count', - 'corr', - 'cummax', - 'cummin', - 'cumprod', - 'describe', - 'rank', - 'quantile', - 'diff', - 'shift', - 'all', - 'any', - 'idxmin', - 'idxmax', - 'ffill', - 'bfill', - 'pct_change', - 'tshift'] - - for m in methods: - res = getattr(g, m)() - exp = getattr(g_exp, m)() - assert_frame_equal(res, exp) # should always be frames! - - # methods which aren't just .foo() - assert_frame_equal(g.fillna(0), g_exp.fillna(0)) - assert_frame_equal(g.dtypes, g_exp.dtypes) - assert_frame_equal(g.apply(lambda x: x.sum()), - g_exp.apply(lambda x: x.sum())) - - assert_frame_equal(g.resample('D').mean(), g_exp.resample('D').mean()) - assert_frame_equal(g.resample('D').ohlc(), - g_exp.resample('D').ohlc()) - - assert_frame_equal(g.filter(lambda x: len(x) == 3), - g_exp.filter(lambda x: len(x) == 3)) - - def test_groupby_whitelist(self): - from string import ascii_lowercase - letters = np.array(list(ascii_lowercase)) - N = 10 - random_letters = letters.take(np.random.randint(0, 26, N)) - df = DataFrame({'floats': N / 10 * Series(np.random.random(N)), - 'letters': Series(random_letters)}) - s = df.floats - - df_whitelist = frozenset([ - 'last', - 'first', - 'mean', - 'sum', - 'min', - 'max', - 'head', - 'tail', - 'cumsum', - 'cumprod', - 'cummin', - 'cummax', - 'cumcount', - 'resample', - 'describe', - 'rank', - 'quantile', - 'fillna', - 'mad', - 'any', - 'all', - 'take', - 'idxmax', - 'idxmin', - 'shift', - 'tshift', - 'ffill', - 'bfill', - 'pct_change', - 'skew', - 'plot', - 'boxplot', - 'hist', - 'median', - 'dtypes', - 'corrwith', - 'corr', - 'cov', - 'diff', - ]) - s_whitelist = frozenset([ - 'last', - 'first', - 'mean', - 'sum', - 'min', - 'max', - 'head', - 'tail', - 'cumsum', - 'cumprod', - 'cummin', - 'cummax', - 'cumcount', - 'resample', - 'describe', - 'rank', - 'quantile', - 'fillna', - 'mad', - 'any', - 'all', - 'take', - 'idxmax', - 'idxmin', - 'shift', - 'tshift', - 'ffill', - 'bfill', - 'pct_change', - 'skew', - 'plot', - 'hist', - 'median', - 'dtype', - 'corr', - 'cov', - 'diff', - 'unique', - # 'nlargest', 'nsmallest', - ]) - - for obj, whitelist in zip((df, s), (df_whitelist, s_whitelist)): - gb = obj.groupby(df.letters) - self.assertEqual(whitelist, gb._apply_whitelist) - for m in whitelist: - getattr(type(gb), m) - - AGG_FUNCTIONS = ['sum', 'prod', 'min', 'max', 'median', 'mean', 'skew', - 'mad', 'std', 'var', 'sem'] - AGG_FUNCTIONS_WITH_SKIPNA = ['skew', 'mad'] - - def test_groupby_whitelist_deprecations(self): - from string import ascii_lowercase - letters = np.array(list(ascii_lowercase)) - N = 10 - random_letters = letters.take(np.random.randint(0, 26, N)) - df = DataFrame({'floats': N / 10 * Series(np.random.random(N)), - 'letters': Series(random_letters)}) - - # 10711 deprecated - with tm.assert_produces_warning(FutureWarning): - df.groupby('letters').irow(0) - with tm.assert_produces_warning(FutureWarning): - df.groupby('letters').floats.irow(0) - - def test_regression_whitelist_methods(self): - - # GH6944 - # explicity test the whitelest methods - index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', - 'three']], - labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], - [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], - names=['first', 'second']) - raw_frame = DataFrame(np.random.randn(10, 3), index=index, - columns=Index(['A', 'B', 'C'], name='exp')) - raw_frame.ix[1, [1, 2]] = np.nan - raw_frame.ix[7, [0, 1]] = np.nan - - for op, level, axis, skipna in cart_product(self.AGG_FUNCTIONS, - lrange(2), lrange(2), - [True, False]): - - if axis == 0: - frame = raw_frame - else: - frame = raw_frame.T - - if op in self.AGG_FUNCTIONS_WITH_SKIPNA: - grouped = frame.groupby(level=level, axis=axis) - result = getattr(grouped, op)(skipna=skipna) - expected = getattr(frame, op)(level=level, axis=axis, - skipna=skipna) - assert_frame_equal(result, expected) - else: - grouped = frame.groupby(level=level, axis=axis) - result = getattr(grouped, op)() - expected = getattr(frame, op)(level=level, axis=axis) - assert_frame_equal(result, expected) - - def test_groupby_blacklist(self): - from string import ascii_lowercase - letters = np.array(list(ascii_lowercase)) - N = 10 - random_letters = letters.take(np.random.randint(0, 26, N)) - df = DataFrame({'floats': N / 10 * Series(np.random.random(N)), - 'letters': Series(random_letters)}) - s = df.floats - - blacklist = [ - 'eval', 'query', 'abs', 'where', - 'mask', 'align', 'groupby', 'clip', 'astype', - 'at', 'combine', 'consolidate', 'convert_objects', - ] - to_methods = [method for method in dir(df) if method.startswith('to_')] - - blacklist.extend(to_methods) - - # e.g., to_csv - defined_but_not_allowed = ("(?:^Cannot.+{0!r}.+{1!r}.+try using the " - "'apply' method$)") - - # e.g., query, eval - not_defined = "(?:^{1!r} object has no attribute {0!r}$)" - fmt = defined_but_not_allowed + '|' + not_defined - for bl in blacklist: - for obj in (df, s): - gb = obj.groupby(df.letters) - msg = fmt.format(bl, type(gb).__name__) - with tm.assertRaisesRegexp(AttributeError, msg): - getattr(gb, bl) - - def test_tab_completion(self): - grp = self.mframe.groupby(level='second') - results = set([v for v in dir(grp) if not v.startswith('_')]) - expected = set( - ['A', 'B', 'C', 'agg', 'aggregate', 'apply', 'boxplot', 'filter', - 'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max', - 'mean', 'median', 'min', 'name', 'ngroups', 'nth', 'ohlc', 'plot', - 'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count', - 'head', 'irow', 'describe', 'cummax', 'quantile', 'rank', - 'cumprod', 'tail', 'resample', 'cummin', 'fillna', 'cumsum', - 'cumcount', 'all', 'shift', 'skew', 'bfill', 'ffill', 'take', - 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith', 'cov', - 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin', - 'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding']) - self.assertEqual(results, expected) - - def test_lexsort_indexer(self): - keys = [[nan] * 5 + list(range(100)) + [nan] * 5] - # orders=True, na_position='last' - result = _lexsort_indexer(keys, orders=True, na_position='last') - exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110)) - tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) - - # orders=True, na_position='first' - result = _lexsort_indexer(keys, orders=True, na_position='first') - exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105)) - tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) - - # orders=False, na_position='last' - result = _lexsort_indexer(keys, orders=False, na_position='last') - exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110)) - tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) - - # orders=False, na_position='first' - result = _lexsort_indexer(keys, orders=False, na_position='first') - exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1)) - tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) - - def test_nargsort(self): - # np.argsort(items) places NaNs last - items = [nan] * 5 + list(range(100)) + [nan] * 5 - # np.argsort(items2) may not place NaNs first - items2 = np.array(items, dtype='O') - - try: - # GH 2785; due to a regression in NumPy1.6.2 - np.argsort(np.array([[1, 2], [1, 3], [1, 2]], dtype='i')) - np.argsort(items2, kind='mergesort') - except TypeError: - raise nose.SkipTest('requested sort not available for type') - - # mergesort is the most difficult to get right because we want it to be - # stable. - - # According to numpy/core/tests/test_multiarray, """The number of - # sorted items must be greater than ~50 to check the actual algorithm - # because quick and merge sort fall over to insertion sort for small - # arrays.""" - - # mergesort, ascending=True, na_position='last' - result = _nargsort(items, kind='mergesort', ascending=True, - na_position='last') - exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - # mergesort, ascending=True, na_position='first' - result = _nargsort(items, kind='mergesort', ascending=True, - na_position='first') - exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - # mergesort, ascending=False, na_position='last' - result = _nargsort(items, kind='mergesort', ascending=False, - na_position='last') - exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - # mergesort, ascending=False, na_position='first' - result = _nargsort(items, kind='mergesort', ascending=False, - na_position='first') - exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - # mergesort, ascending=True, na_position='last' - result = _nargsort(items2, kind='mergesort', ascending=True, - na_position='last') - exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - # mergesort, ascending=True, na_position='first' - result = _nargsort(items2, kind='mergesort', ascending=True, - na_position='first') - exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - # mergesort, ascending=False, na_position='last' - result = _nargsort(items2, kind='mergesort', ascending=False, - na_position='last') - exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - # mergesort, ascending=False, na_position='first' - result = _nargsort(items2, kind='mergesort', ascending=False, - na_position='first') - exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1)) - tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) - - def test_datetime_count(self): - df = DataFrame({'a': [1, 2, 3] * 2, - 'dates': pd.date_range('now', periods=6, freq='T')}) - result = df.groupby('a').dates.count() - expected = Series([ - 2, 2, 2 - ], index=Index([1, 2, 3], name='a'), name='dates') - tm.assert_series_equal(result, expected) - - def test_lower_int_prec_count(self): - df = DataFrame({'a': np.array( - [0, 1, 2, 100], np.int8), - 'b': np.array( - [1, 2, 3, 6], np.uint32), - 'c': np.array( - [4, 5, 6, 8], np.int16), - 'grp': list('ab' * 2)}) - result = df.groupby('grp').count() - expected = DataFrame({'a': [2, 2], - 'b': [2, 2], - 'c': [2, 2]}, index=pd.Index(list('ab'), - name='grp')) - tm.assert_frame_equal(result, expected) - - def test_count_uses_size_on_exception(self): - class RaisingObjectException(Exception): - pass - - class RaisingObject(object): - - def __init__(self, msg='I will raise inside Cython'): - super(RaisingObject, self).__init__() - self.msg = msg - - def __eq__(self, other): - # gets called in Cython to check that raising calls the method - raise RaisingObjectException(self.msg) - - df = DataFrame({'a': [RaisingObject() for _ in range(4)], - 'grp': list('ab' * 2)}) - result = df.groupby('grp').count() - expected = DataFrame({'a': [2, 2]}, index=pd.Index( - list('ab'), name='grp')) - tm.assert_frame_equal(result, expected) - - def test__cython_agg_general(self): - ops = [('mean', np.mean), - ('median', np.median), - ('var', np.var), - ('add', np.sum), - ('prod', np.prod), - ('min', np.min), - ('max', np.max), - ('first', lambda x: x.iloc[0]), - ('last', lambda x: x.iloc[-1]), ] - df = DataFrame(np.random.randn(1000)) - labels = np.random.randint(0, 50, size=1000).astype(float) - - for op, targop in ops: - result = df.groupby(labels)._cython_agg_general(op) - expected = df.groupby(labels).agg(targop) - try: - tm.assert_frame_equal(result, expected) - except BaseException as exc: - exc.args += ('operation: %s' % op, ) - raise - - def test_cython_agg_empty_buckets(self): - ops = [('mean', np.mean), - ('median', lambda x: np.median(x) if len(x) > 0 else np.nan), - ('var', lambda x: np.var(x, ddof=1)), - ('add', lambda x: np.sum(x) if len(x) > 0 else np.nan), - ('prod', np.prod), - ('min', np.min), - ('max', np.max), ] - - df = pd.DataFrame([11, 12, 13]) - grps = range(0, 55, 5) - - for op, targop in ops: - result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op) - expected = df.groupby(pd.cut(df[0], grps)).agg(lambda x: targop(x)) - try: - tm.assert_frame_equal(result, expected) - except BaseException as exc: - exc.args += ('operation: %s' % op,) - raise - - def test_cython_group_transform_algos(self): - # GH 4095 - dtypes = [np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint32, - np.uint64, np.float32, np.float64] - - ops = [(pd.algos.group_cumprod_float64, np.cumproduct, [np.float64]), - (pd.algos.group_cumsum, np.cumsum, dtypes)] - - for pd_op, np_op, dtypes in ops: - for dtype in dtypes: - data = np.array([[1], [2], [3], [4]], dtype=dtype) - ans = np.zeros_like(data) - accum = np.array([[0]], dtype=dtype) - labels = np.array([0, 0, 0, 0], dtype=np.int64) - pd_op(ans, data, labels, accum) - self.assert_numpy_array_equal(np_op(data), ans[:, 0], - check_dtype=False) - - # with nans - labels = np.array([0, 0, 0, 0, 0], dtype=np.int64) - - data = np.array([[1], [2], [3], [np.nan], [4]], dtype='float64') - accum = np.array([[0.0]]) - actual = np.zeros_like(data) - actual.fill(np.nan) - pd.algos.group_cumprod_float64(actual, data, labels, accum) - expected = np.array([1, 2, 6, np.nan, 24], dtype='float64') - self.assert_numpy_array_equal(actual[:, 0], expected) - - accum = np.array([[0.0]]) - actual = np.zeros_like(data) - actual.fill(np.nan) - pd.algos.group_cumsum(actual, data, labels, accum) - expected = np.array([1, 3, 6, np.nan, 10], dtype='float64') - self.assert_numpy_array_equal(actual[:, 0], expected) - - # timedelta - data = np.array([np.timedelta64(1, 'ns')] * 5, dtype='m8[ns]')[:, None] - accum = np.array([[0]], dtype='int64') - actual = np.zeros_like(data, dtype='int64') - pd.algos.group_cumsum(actual, data.view('int64'), labels, accum) - expected = np.array([np.timedelta64(1, 'ns'), np.timedelta64( - 2, 'ns'), np.timedelta64(3, 'ns'), np.timedelta64(4, 'ns'), - np.timedelta64(5, 'ns')]) - self.assert_numpy_array_equal(actual[:, 0].view('m8[ns]'), expected) - - def test_cython_transform(self): - # GH 4095 - ops = [(('cumprod', - ()), lambda x: x.cumprod()), (('cumsum', ()), - lambda x: x.cumsum()), - (('shift', (-1, )), - lambda x: x.shift(-1)), (('shift', - (1, )), lambda x: x.shift())] - - s = Series(np.random.randn(1000)) - s_missing = s.copy() - s_missing.iloc[2:10] = np.nan - labels = np.random.randint(0, 50, size=1000).astype(float) - - # series - for (op, args), targop in ops: - for data in [s, s_missing]: - # print(data.head()) - expected = data.groupby(labels).transform(targop) - - tm.assert_series_equal(expected, - data.groupby(labels).transform(op, - *args)) - tm.assert_series_equal(expected, getattr( - data.groupby(labels), op)(*args)) - - strings = list('qwertyuiopasdfghjklz') - strings_missing = strings[:] - strings_missing[5] = np.nan - df = DataFrame({'float': s, - 'float_missing': s_missing, - 'int': [1, 1, 1, 1, 2] * 200, - 'datetime': pd.date_range('1990-1-1', periods=1000), - 'timedelta': pd.timedelta_range(1, freq='s', - periods=1000), - 'string': strings * 50, - 'string_missing': strings_missing * 50}) - df['cat'] = df['string'].astype('category') - - df2 = df.copy() - df2.index = pd.MultiIndex.from_product([range(100), range(10)]) - - # DataFrame - Single and MultiIndex, - # group by values, index level, columns - for df in [df, df2]: - for gb_target in [dict(by=labels), dict(level=0), dict(by='string') - ]: # dict(by='string_missing')]: - # dict(by=['int','string'])]: - - gb = df.groupby(**gb_target) - # whitelisted methods set the selection before applying - # bit a of hack to make sure the cythonized shift - # is equivalent to pre 0.17.1 behavior - if op == 'shift': - gb._set_group_selection() - - for (op, args), targop in ops: - if op != 'shift' and 'int' not in gb_target: - # numeric apply fastpath promotes dtype so have - # to apply seperately and concat - i = gb[['int']].apply(targop) - f = gb[['float', 'float_missing']].apply(targop) - expected = pd.concat([f, i], axis=1) - else: - expected = gb.apply(targop) - - expected = expected.sort_index(axis=1) - tm.assert_frame_equal(expected, - gb.transform(op, *args).sort_index( - axis=1)) - tm.assert_frame_equal(expected, getattr(gb, op)(*args)) - # individual columns - for c in df: - if c not in ['float', 'int', 'float_missing' - ] and op != 'shift': - self.assertRaises(DataError, gb[c].transform, op) - self.assertRaises(DataError, getattr(gb[c], op)) - else: - expected = gb[c].apply(targop) - expected.name = c - tm.assert_series_equal(expected, - gb[c].transform(op, *args)) - tm.assert_series_equal(expected, - getattr(gb[c], op)(*args)) - - def test_groupby_cumprod(self): - # GH 4095 - df = pd.DataFrame({'key': ['b'] * 10, 'value': 2}) - - actual = df.groupby('key')['value'].cumprod() - expected = df.groupby('key')['value'].apply(lambda x: x.cumprod()) - expected.name = 'value' - tm.assert_series_equal(actual, expected) - - df = pd.DataFrame({'key': ['b'] * 100, 'value': 2}) - actual = df.groupby('key')['value'].cumprod() - # if overflows, groupby product casts to float - # while numpy passes back invalid values - df['value'] = df['value'].astype(float) - expected = df.groupby('key')['value'].apply(lambda x: x.cumprod()) - expected.name = 'value' - tm.assert_series_equal(actual, expected) - - def test_ops_general(self): - ops = [('mean', np.mean), - ('median', np.median), - ('std', np.std), - ('var', np.var), - ('sum', np.sum), - ('prod', np.prod), - ('min', np.min), - ('max', np.max), - ('first', lambda x: x.iloc[0]), - ('last', lambda x: x.iloc[-1]), - ('count', np.size), ] - try: - from scipy.stats import sem - except ImportError: - pass - else: - ops.append(('sem', sem)) - df = DataFrame(np.random.randn(1000)) - labels = np.random.randint(0, 50, size=1000).astype(float) - - for op, targop in ops: - result = getattr(df.groupby(labels), op)().astype(float) - expected = df.groupby(labels).agg(targop) - try: - tm.assert_frame_equal(result, expected) - except BaseException as exc: - exc.args += ('operation: %s' % op, ) - raise - - def test_max_nan_bug(self): - raw = """,Date,app,File -2013-04-23,2013-04-23 00:00:00,,log080001.log -2013-05-06,2013-05-06 00:00:00,,log.log -2013-05-07,2013-05-07 00:00:00,OE,xlsx""" - - df = pd.read_csv(StringIO(raw), parse_dates=[0]) - gb = df.groupby('Date') - r = gb[['File']].max() - e = gb['File'].max().to_frame() - tm.assert_frame_equal(r, e) - self.assertFalse(r['File'].isnull().any()) - - def test_nlargest(self): - a = Series([1, 3, 5, 7, 2, 9, 0, 4, 6, 10]) - b = Series(list('a' * 5 + 'b' * 5)) - gb = a.groupby(b) - r = gb.nlargest(3) - e = Series([ - 7, 5, 3, 10, 9, 6 - ], index=MultiIndex.from_arrays([list('aaabbb'), [3, 2, 1, 9, 5, 8]])) - tm.assert_series_equal(r, e) - - a = Series([1, 1, 3, 2, 0, 3, 3, 2, 1, 0]) - gb = a.groupby(b) - e = Series([ - 3, 2, 1, 3, 3, 2 - ], index=MultiIndex.from_arrays([list('aaabbb'), [2, 3, 1, 6, 5, 7]])) - assert_series_equal(gb.nlargest(3, keep='last'), e) - with tm.assert_produces_warning(FutureWarning): - assert_series_equal(gb.nlargest(3, take_last=True), e) - - def test_nsmallest(self): - a = Series([1, 3, 5, 7, 2, 9, 0, 4, 6, 10]) - b = Series(list('a' * 5 + 'b' * 5)) - gb = a.groupby(b) - r = gb.nsmallest(3) - e = Series([ - 1, 2, 3, 0, 4, 6 - ], index=MultiIndex.from_arrays([list('aaabbb'), [0, 4, 1, 6, 7, 8]])) - tm.assert_series_equal(r, e) - - a = Series([1, 1, 3, 2, 0, 3, 3, 2, 1, 0]) - gb = a.groupby(b) - e = Series([ - 0, 1, 1, 0, 1, 2 - ], index=MultiIndex.from_arrays([list('aaabbb'), [4, 1, 0, 9, 8, 7]])) - assert_series_equal(gb.nsmallest(3, keep='last'), e) - with tm.assert_produces_warning(FutureWarning): - assert_series_equal(gb.nsmallest(3, take_last=True), e) - - def test_transform_doesnt_clobber_ints(self): - # GH 7972 - n = 6 - x = np.arange(n) - df = DataFrame({'a': x // 2, 'b': 2.0 * x, 'c': 3.0 * x}) - df2 = DataFrame({'a': x // 2 * 1.0, 'b': 2.0 * x, 'c': 3.0 * x}) - - gb = df.groupby('a') - result = gb.transform('mean') - - gb2 = df2.groupby('a') - expected = gb2.transform('mean') - tm.assert_frame_equal(result, expected) - - def test_groupby_categorical_two_columns(self): - - # https://github.com/pandas-dev/pandas/issues/8138 - d = {'cat': - pd.Categorical(["a", "b", "a", "b"], categories=["a", "b", "c"], - ordered=True), - 'ints': [1, 1, 2, 2], - 'val': [10, 20, 30, 40]} - test = pd.DataFrame(d) - - # Grouping on a single column - groups_single_key = test.groupby("cat") - res = groups_single_key.agg('mean') - - exp_index = pd.CategoricalIndex(["a", "b", "c"], name="cat", - ordered=True) - exp = DataFrame({"ints": [1.5, 1.5, np.nan], "val": [20, 30, np.nan]}, - index=exp_index) - tm.assert_frame_equal(res, exp) - - # Grouping on two columns - groups_double_key = test.groupby(["cat", "ints"]) - res = groups_double_key.agg('mean') - exp = DataFrame({"val": [10, 30, 20, 40, np.nan, np.nan], - "cat": pd.Categorical(["a", "a", "b", "b", "c", "c"], - ordered=True), - "ints": [1, 2, 1, 2, 1, 2]}).set_index(["cat", "ints" - ]) - tm.assert_frame_equal(res, exp) - - # GH 10132 - for key in [('a', 1), ('b', 2), ('b', 1), ('a', 2)]: - c, i = key - result = groups_double_key.get_group(key) - expected = test[(test.cat == c) & (test.ints == i)] - assert_frame_equal(result, expected) - - d = {'C1': [3, 3, 4, 5], 'C2': [1, 2, 3, 4], 'C3': [10, 100, 200, 34]} - test = pd.DataFrame(d) - values = pd.cut(test['C1'], [1, 2, 3, 6]) - values.name = "cat" - groups_double_key = test.groupby([values, 'C2']) - - res = groups_double_key.agg('mean') - nan = np.nan - idx = MultiIndex.from_product( - [Categorical(["(1, 2]", "(2, 3]", "(3, 6]"], ordered=True), - [1, 2, 3, 4]], - names=["cat", "C2"]) - exp = DataFrame({"C1": [nan, nan, nan, nan, 3, 3, - nan, nan, nan, nan, 4, 5], - "C3": [nan, nan, nan, nan, 10, 100, - nan, nan, nan, nan, 200, 34]}, index=idx) - tm.assert_frame_equal(res, exp) - - def test_groupby_multi_categorical_as_index(self): - # GH13204 - df = DataFrame({'cat': Categorical([1, 2, 2], [1, 2, 3]), - 'A': [10, 11, 11], - 'B': [101, 102, 103]}) - result = df.groupby(['cat', 'A'], as_index=False).sum() - expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), - 'A': [10, 11, 10, 11, 10, 11], - 'B': [101.0, nan, nan, 205.0, nan, nan]}, - columns=['cat', 'A', 'B']) - tm.assert_frame_equal(result, expected) - - # function grouper - f = lambda r: df.loc[r, 'A'] - result = df.groupby(['cat', f], as_index=False).sum() - expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), - 'A': [10.0, nan, nan, 22.0, nan, nan], - 'B': [101.0, nan, nan, 205.0, nan, nan]}, - columns=['cat', 'A', 'B']) - tm.assert_frame_equal(result, expected) - - # another not in-axis grouper (conflicting names in index) - s = Series(['a', 'b', 'b'], name='cat') - result = df.groupby(['cat', s], as_index=False).sum() - expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), - 'A': [10.0, nan, nan, 22.0, nan, nan], - 'B': [101.0, nan, nan, 205.0, nan, nan]}, - columns=['cat', 'A', 'B']) - tm.assert_frame_equal(result, expected) - - # is original index dropped? - expected = DataFrame({'cat': Categorical([1, 1, 2, 2, 3, 3]), - 'A': [10, 11, 10, 11, 10, 11], - 'B': [101.0, nan, nan, 205.0, nan, nan]}, - columns=['cat', 'A', 'B']) - - for name in [None, 'X', 'B', 'cat']: - df.index = Index(list("abc"), name=name) - result = df.groupby(['cat', 'A'], as_index=False).sum() - tm.assert_frame_equal(result, expected, check_index_type=True) - - def test_groupby_preserve_categorical_dtype(self): - # GH13743, GH13854 - df = DataFrame({'A': [1, 2, 1, 1, 2], - 'B': [10, 16, 22, 28, 34], - 'C1': Categorical(list("abaab"), - categories=list("bac"), - ordered=False), - 'C2': Categorical(list("abaab"), - categories=list("bac"), - ordered=True)}) - # single grouper - exp_full = DataFrame({'A': [2.0, 1.0, np.nan], - 'B': [25.0, 20.0, np.nan], - 'C1': Categorical(list("bac"), - categories=list("bac"), - ordered=False), - 'C2': Categorical(list("bac"), - categories=list("bac"), - ordered=True)}) - for col in ['C1', 'C2']: - result1 = df.groupby(by=col, as_index=False).mean() - result2 = df.groupby(by=col, as_index=True).mean().reset_index() - expected = exp_full.reindex(columns=result1.columns) - tm.assert_frame_equal(result1, expected) - tm.assert_frame_equal(result2, expected) - - # multiple grouper - exp_full = DataFrame({'A': [1, 1, 1, 2, 2, 2], - 'B': [np.nan, 20.0, np.nan, 25.0, np.nan, - np.nan], - 'C1': Categorical(list("bacbac"), - categories=list("bac"), - ordered=False), - 'C2': Categorical(list("bacbac"), - categories=list("bac"), - ordered=True)}) - for cols in [['A', 'C1'], ['A', 'C2']]: - result1 = df.groupby(by=cols, as_index=False).mean() - result2 = df.groupby(by=cols, as_index=True).mean().reset_index() - expected = exp_full.reindex(columns=result1.columns) - tm.assert_frame_equal(result1, expected) - tm.assert_frame_equal(result2, expected) - - def test_groupby_apply_all_none(self): - # Tests to make sure no errors if apply function returns all None - # values. Issue 9684. - test_df = DataFrame({'groups': [0, 0, 1, 1], - 'random_vars': [8, 7, 4, 5]}) - - def test_func(x): - pass - - result = test_df.groupby('groups').apply(test_func) - expected = DataFrame() - tm.assert_frame_equal(result, expected) - - def test_groupby_apply_none_first(self): - # GH 12824. Tests if apply returns None first. - test_df1 = DataFrame({'groups': [1, 1, 1, 2], 'vars': [0, 1, 2, 3]}) - test_df2 = DataFrame({'groups': [1, 2, 2, 2], 'vars': [0, 1, 2, 3]}) - - def test_func(x): - if x.shape[0] < 2: - return None - return x.iloc[[0, -1]] - - result1 = test_df1.groupby('groups').apply(test_func) - result2 = test_df2.groupby('groups').apply(test_func) - index1 = MultiIndex.from_arrays([[1, 1], [0, 2]], - names=['groups', None]) - index2 = MultiIndex.from_arrays([[2, 2], [1, 3]], - names=['groups', None]) - expected1 = DataFrame({'groups': [1, 1], 'vars': [0, 2]}, - index=index1) - expected2 = DataFrame({'groups': [2, 2], 'vars': [1, 3]}, - index=index2) - tm.assert_frame_equal(result1, expected1) - tm.assert_frame_equal(result2, expected2) - - def test_first_last_max_min_on_time_data(self): - # GH 10295 - # Verify that NaT is not in the result of max, min, first and last on - # Dataframe with datetime or timedelta values. - from datetime import timedelta as td - df_test = DataFrame( - {'dt': [nan, '2015-07-24 10:10', '2015-07-25 11:11', - '2015-07-23 12:12', nan], - 'td': [nan, td(days=1), td(days=2), td(days=3), nan]}) - df_test.dt = pd.to_datetime(df_test.dt) - df_test['group'] = 'A' - df_ref = df_test[df_test.dt.notnull()] - - grouped_test = df_test.groupby('group') - grouped_ref = df_ref.groupby('group') - - assert_frame_equal(grouped_ref.max(), grouped_test.max()) - assert_frame_equal(grouped_ref.min(), grouped_test.min()) - assert_frame_equal(grouped_ref.first(), grouped_test.first()) - assert_frame_equal(grouped_ref.last(), grouped_test.last()) - - def test_groupby_preserves_sort(self): - # Test to ensure that groupby always preserves sort order of original - # object. Issue #8588 and #9651 - - df = DataFrame( - {'int_groups': [3, 1, 0, 1, 0, 3, 3, 3], - 'string_groups': ['z', 'a', 'z', 'a', 'a', 'g', 'g', 'g'], - 'ints': [8, 7, 4, 5, 2, 9, 1, 1], - 'floats': [2.3, 5.3, 6.2, -2.4, 2.2, 1.1, 1.1, 5], - 'strings': ['z', 'd', 'a', 'e', 'word', 'word2', '42', '47']}) - - # Try sorting on different types and with different group types - for sort_column in ['ints', 'floats', 'strings', ['ints', 'floats'], - ['ints', 'strings']]: - for group_column in ['int_groups', 'string_groups', - ['int_groups', 'string_groups']]: - - df = df.sort_values(by=sort_column) - - g = df.groupby(group_column) - - def test_sort(x): - assert_frame_equal(x, x.sort_values(by=sort_column)) - - g.apply(test_sort) - - def test_nunique_with_object(self): - # GH 11077 - data = pd.DataFrame( - [[100, 1, 'Alice'], - [200, 2, 'Bob'], - [300, 3, 'Charlie'], - [-400, 4, 'Dan'], - [500, 5, 'Edith']], - columns=['amount', 'id', 'name'] - ) - - result = data.groupby(['id', 'amount'])['name'].nunique() - index = MultiIndex.from_arrays([data.id, data.amount]) - expected = pd.Series([1] * 5, name='name', index=index) - tm.assert_series_equal(result, expected) - - def test_nunique_with_empty_series(self): - # GH 12553 - data = pd.Series(name='name') - result = data.groupby(level=0).nunique() - expected = pd.Series(name='name', dtype='int64') - tm.assert_series_equal(result, expected) - - def test_transform_with_non_scalar_group(self): - # GH 10165 - cols = pd.MultiIndex.from_tuples([ - ('syn', 'A'), ('mis', 'A'), ('non', 'A'), - ('syn', 'C'), ('mis', 'C'), ('non', 'C'), - ('syn', 'T'), ('mis', 'T'), ('non', 'T'), - ('syn', 'G'), ('mis', 'G'), ('non', 'G')]) - df = pd.DataFrame(np.random.randint(1, 10, (4, 12)), - columns=cols, - index=['A', 'C', 'G', 'T']) - self.assertRaisesRegexp(ValueError, 'transform must return a scalar ' - 'value for each group.*', df.groupby - (axis=1, level=1).transform, - lambda z: z.div(z.sum(axis=1), axis=0)) - - def test_numpy_compat(self): - # see gh-12811 - df = pd.DataFrame({'A': [1, 2, 1], 'B': [1, 2, 3]}) - g = df.groupby('A') - - msg = "numpy operations are not valid with groupby" - - for func in ('mean', 'var', 'std', 'cumprod', 'cumsum'): - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(g, func), 1, 2, 3) - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(g, func), foo=1) - - def test_grouping_string_repr(self): - # GH 13394 - mi = MultiIndex.from_arrays([list("AAB"), list("aba")]) - df = DataFrame([[1, 2, 3]], columns=mi) - gr = df.groupby(df[('A', 'a')]) - - result = gr.grouper.groupings[0].__repr__() - expected = "Grouping(('A', 'a'))" - tm.assert_equal(result, expected) - - def test_group_shift_with_null_key(self): - # This test is designed to replicate the segfault in issue #13813. - n_rows = 1200 - - # Generate a moderately large dataframe with occasional missing - # values in column `B`, and then group by [`A`, `B`]. This should - # force `-1` in `labels` array of `g.grouper.group_info` exactly - # at those places, where the group-by key is partilly missing. - df = DataFrame([(i % 12, i % 3 if i % 3 else np.nan, i) - for i in range(n_rows)], dtype=float, - columns=["A", "B", "Z"], index=None) - g = df.groupby(["A", "B"]) - - expected = DataFrame([(i + 12 if i % 3 and i < n_rows - 12 - else np.nan) - for i in range(n_rows)], dtype=float, - columns=["Z"], index=None) - result = g.shift(-1) - - assert_frame_equal(result, expected) - - -def assert_fp_equal(a, b): - assert (np.abs(a - b) < 1e-12).all() - - -def _check_groupby(df, result, keys, field, f=lambda x: x.sum()): - tups = lmap(tuple, df[keys].values) - tups = com._asarray_tuplesafe(tups) - expected = f(df.groupby(tups)[field]) - for k, v in compat.iteritems(expected): - assert (result[k] == v) - - -def test_decons(): - from pandas.core.groupby import decons_group_index, get_group_index - - def testit(label_list, shape): - group_index = get_group_index(label_list, shape, sort=True, xnull=True) - label_list2 = decons_group_index(group_index, shape) - - for a, b in zip(label_list, label_list2): - assert (np.array_equal(a, b)) - - shape = (4, 5, 6) - label_list = [np.tile([0, 1, 2, 3, 0, 1, 2, 3], 100), np.tile( - [0, 2, 4, 3, 0, 1, 2, 3], 100), np.tile( - [5, 1, 0, 2, 3, 0, 5, 4], 100)] - testit(label_list, shape) - - shape = (10000, 10000) - label_list = [np.tile(np.arange(10000), 5), np.tile(np.arange(10000), 5)] - testit(label_list, shape) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure', '-s' - ], exit=False) diff --git a/pandas/tests/test_internals.py b/pandas/tests/test_internals.py index db1c8da4cae73..0900d21b250ed 100644 --- a/pandas/tests/test_internals.py +++ b/pandas/tests/test_internals.py @@ -2,32 +2,44 @@ # pylint: disable=W0102 from datetime import datetime, date - -import nose +import sys +import pytest import numpy as np import re +from distutils.version import LooseVersion import itertools from pandas import (Index, MultiIndex, DataFrame, DatetimeIndex, Series, Categorical) from pandas.compat import OrderedDict, lrange -from pandas.sparse.array import SparseArray +from pandas.core.sparse.array import SparseArray from pandas.core.internals import (BlockPlacement, SingleBlockManager, make_block, BlockManager) import pandas.core.algorithms as algos import pandas.util.testing as tm import pandas as pd -from pandas import lib +from pandas._libs import lib from pandas.util.testing import (assert_almost_equal, assert_frame_equal, randn, assert_series_equal) from pandas.compat import zip, u +# in 3.6.1 a c-api slicing function changed, see src/compat_helper.h +PY361 = sys.version >= LooseVersion('3.6.1') + + +@pytest.fixture +def mgr(): + return create_mgr( + 'a: f8; b: object; c: f8; d: object; e: f8;' + 'f: bool; g: i8; h: complex; i: datetime-1; j: datetime-2;' + 'k: M8[ns, US/Eastern]; l: M8[ns, CET];') + def assert_block_equal(left, right): tm.assert_numpy_array_equal(left.values, right.values) - assert (left.dtype == right.dtype) - tm.assertIsInstance(left.mgr_locs, lib.BlockPlacement) - tm.assertIsInstance(right.mgr_locs, lib.BlockPlacement) + assert left.dtype == right.dtype + assert isinstance(left.mgr_locs, lib.BlockPlacement) + assert isinstance(right.mgr_locs, lib.BlockPlacement) tm.assert_numpy_array_equal(left.mgr_locs.as_array, right.mgr_locs.as_array) @@ -180,11 +192,9 @@ def create_mgr(descr, item_shape=None): [mgr_items] + [np.arange(n) for n in item_shape]) -class TestBlock(tm.TestCase): - - _multiprocess_can_split_ = True +class TestBlock(object): - def setUp(self): + def setup_method(self, method): # self.fblock = get_float_ex() # a,c,e # self.cblock = get_complex_ex() # # self.oblock = get_obj_ex() @@ -199,11 +209,11 @@ def setUp(self): def test_constructor(self): int32block = create_block('i4', [0]) - self.assertEqual(int32block.dtype, np.int32) + assert int32block.dtype == np.int32 def test_pickle(self): def _check(blk): - assert_block_equal(self.round_trip_pickle(blk), blk) + assert_block_equal(tm.round_trip_pickle(blk), blk) _check(self.fblock) _check(self.cblock) @@ -211,14 +221,14 @@ def _check(blk): _check(self.bool_block) def test_mgr_locs(self): - tm.assertIsInstance(self.fblock.mgr_locs, lib.BlockPlacement) + assert isinstance(self.fblock.mgr_locs, lib.BlockPlacement) tm.assert_numpy_array_equal(self.fblock.mgr_locs.as_array, np.array([0, 2, 4], dtype=np.int64)) def test_attrs(self): - self.assertEqual(self.fblock.shape, self.fblock.values.shape) - self.assertEqual(self.fblock.dtype, self.fblock.values.dtype) - self.assertEqual(len(self.fblock), len(self.fblock.values)) + assert self.fblock.shape == self.fblock.values.shape + assert self.fblock.dtype == self.fblock.values.dtype + assert len(self.fblock) == len(self.fblock.values) def test_merge(self): avals = randn(2, 10) @@ -238,7 +248,7 @@ def test_merge(self): def test_copy(self): cop = self.fblock.copy() - self.assertIsNot(cop, self.fblock) + assert cop is not self.fblock assert_block_equal(self.fblock, cop) def test_reindex_index(self): @@ -253,104 +263,97 @@ def test_insert(self): def test_delete(self): newb = self.fblock.copy() newb.delete(0) - tm.assertIsInstance(newb.mgr_locs, lib.BlockPlacement) + assert isinstance(newb.mgr_locs, lib.BlockPlacement) tm.assert_numpy_array_equal(newb.mgr_locs.as_array, np.array([2, 4], dtype=np.int64)) - self.assertTrue((newb.values[0] == 1).all()) + assert (newb.values[0] == 1).all() newb = self.fblock.copy() newb.delete(1) - tm.assertIsInstance(newb.mgr_locs, lib.BlockPlacement) + assert isinstance(newb.mgr_locs, lib.BlockPlacement) tm.assert_numpy_array_equal(newb.mgr_locs.as_array, np.array([0, 4], dtype=np.int64)) - self.assertTrue((newb.values[1] == 2).all()) + assert (newb.values[1] == 2).all() newb = self.fblock.copy() newb.delete(2) tm.assert_numpy_array_equal(newb.mgr_locs.as_array, np.array([0, 2], dtype=np.int64)) - self.assertTrue((newb.values[1] == 1).all()) + assert (newb.values[1] == 1).all() newb = self.fblock.copy() - self.assertRaises(Exception, newb.delete, 3) + with pytest.raises(Exception): + newb.delete(3) def test_split_block_at(self): # with dup column support this method was taken out # GH3679 - raise nose.SkipTest("skipping for now") + pytest.skip("skipping for now") bs = list(self.fblock.split_block_at('a')) - self.assertEqual(len(bs), 1) - self.assertTrue(np.array_equal(bs[0].items, ['c', 'e'])) + assert len(bs) == 1 + assert np.array_equal(bs[0].items, ['c', 'e']) bs = list(self.fblock.split_block_at('c')) - self.assertEqual(len(bs), 2) - self.assertTrue(np.array_equal(bs[0].items, ['a'])) - self.assertTrue(np.array_equal(bs[1].items, ['e'])) + assert len(bs) == 2 + assert np.array_equal(bs[0].items, ['a']) + assert np.array_equal(bs[1].items, ['e']) bs = list(self.fblock.split_block_at('e')) - self.assertEqual(len(bs), 1) - self.assertTrue(np.array_equal(bs[0].items, ['a', 'c'])) + assert len(bs) == 1 + assert np.array_equal(bs[0].items, ['a', 'c']) # bblock = get_bool_ex(['f']) # bs = list(bblock.split_block_at('f')) - # self.assertEqual(len(bs), 0) + # assert len(bs), 0) -class TestDatetimeBlock(tm.TestCase): - _multiprocess_can_split_ = True +class TestDatetimeBlock(object): def test_try_coerce_arg(self): block = create_block('datetime', [0]) # coerce None none_coerced = block._try_coerce_args(block.values, None)[2] - self.assertTrue(pd.Timestamp(none_coerced) is pd.NaT) + assert pd.Timestamp(none_coerced) is pd.NaT # coerce different types of date bojects vals = (np.datetime64('2010-10-10'), datetime(2010, 10, 10), date(2010, 10, 10)) for val in vals: coerced = block._try_coerce_args(block.values, val)[2] - self.assertEqual(np.int64, type(coerced)) - self.assertEqual(pd.Timestamp('2010-10-10'), pd.Timestamp(coerced)) + assert np.int64 == type(coerced) + assert pd.Timestamp('2010-10-10') == pd.Timestamp(coerced) -class TestBlockManager(tm.TestCase): - _multiprocess_can_split_ = True - - def setUp(self): - self.mgr = create_mgr( - 'a: f8; b: object; c: f8; d: object; e: f8;' - 'f: bool; g: i8; h: complex; i: datetime-1; j: datetime-2;' - 'k: M8[ns, US/Eastern]; l: M8[ns, CET];') +class TestBlockManager(object): def test_constructor_corner(self): pass def test_attrs(self): mgr = create_mgr('a,b,c: f8-1; d,e,f: f8-2') - self.assertEqual(mgr.nblocks, 2) - self.assertEqual(len(mgr), 6) + assert mgr.nblocks == 2 + assert len(mgr) == 6 def test_is_mixed_dtype(self): - self.assertFalse(create_mgr('a,b:f8').is_mixed_type) - self.assertFalse(create_mgr('a:f8-1; b:f8-2').is_mixed_type) + assert not create_mgr('a,b:f8').is_mixed_type + assert not create_mgr('a:f8-1; b:f8-2').is_mixed_type - self.assertTrue(create_mgr('a,b:f8; c,d: f4').is_mixed_type) - self.assertTrue(create_mgr('a,b:f8; c,d: object').is_mixed_type) + assert create_mgr('a,b:f8; c,d: f4').is_mixed_type + assert create_mgr('a,b:f8; c,d: object').is_mixed_type def test_is_indexed_like(self): mgr1 = create_mgr('a,b: f8') mgr2 = create_mgr('a:i8; b:bool') mgr3 = create_mgr('a,b,c: f8') - self.assertTrue(mgr1._is_indexed_like(mgr1)) - self.assertTrue(mgr1._is_indexed_like(mgr2)) - self.assertTrue(mgr1._is_indexed_like(mgr3)) + assert mgr1._is_indexed_like(mgr1) + assert mgr1._is_indexed_like(mgr2) + assert mgr1._is_indexed_like(mgr3) - self.assertFalse(mgr1._is_indexed_like(mgr1.get_slice( - slice(-1), axis=1))) + assert not mgr1._is_indexed_like(mgr1.get_slice( + slice(-1), axis=1)) def test_duplicate_ref_loc_failure(self): tmp_mgr = create_mgr('a:bool; a: f8') @@ -359,61 +362,63 @@ def test_duplicate_ref_loc_failure(self): blocks[0].mgr_locs = np.array([0]) blocks[1].mgr_locs = np.array([0]) + # test trying to create block manager with overlapping ref locs - self.assertRaises(AssertionError, BlockManager, blocks, axes) + with pytest.raises(AssertionError): + BlockManager(blocks, axes) blocks[0].mgr_locs = np.array([0]) blocks[1].mgr_locs = np.array([1]) mgr = BlockManager(blocks, axes) mgr.iget(1) - def test_contains(self): - self.assertIn('a', self.mgr) - self.assertNotIn('baz', self.mgr) + def test_contains(self, mgr): + assert 'a' in mgr + assert 'baz' not in mgr - def test_pickle(self): + def test_pickle(self, mgr): - mgr2 = self.round_trip_pickle(self.mgr) - assert_frame_equal(DataFrame(self.mgr), DataFrame(mgr2)) + mgr2 = tm.round_trip_pickle(mgr) + assert_frame_equal(DataFrame(mgr), DataFrame(mgr2)) # share ref_items - # self.assertIs(mgr2.blocks[0].ref_items, mgr2.blocks[1].ref_items) + # assert mgr2.blocks[0].ref_items is mgr2.blocks[1].ref_items # GH2431 - self.assertTrue(hasattr(mgr2, "_is_consolidated")) - self.assertTrue(hasattr(mgr2, "_known_consolidated")) + assert hasattr(mgr2, "_is_consolidated") + assert hasattr(mgr2, "_known_consolidated") # reset to False on load - self.assertFalse(mgr2._is_consolidated) - self.assertFalse(mgr2._known_consolidated) + assert not mgr2._is_consolidated + assert not mgr2._known_consolidated def test_non_unique_pickle(self): mgr = create_mgr('a,a,a:f8') - mgr2 = self.round_trip_pickle(mgr) + mgr2 = tm.round_trip_pickle(mgr) assert_frame_equal(DataFrame(mgr), DataFrame(mgr2)) mgr = create_mgr('a: f8; a: i8') - mgr2 = self.round_trip_pickle(mgr) + mgr2 = tm.round_trip_pickle(mgr) assert_frame_equal(DataFrame(mgr), DataFrame(mgr2)) def test_categorical_block_pickle(self): mgr = create_mgr('a: category') - mgr2 = self.round_trip_pickle(mgr) + mgr2 = tm.round_trip_pickle(mgr) assert_frame_equal(DataFrame(mgr), DataFrame(mgr2)) smgr = create_single_mgr('category') - smgr2 = self.round_trip_pickle(smgr) + smgr2 = tm.round_trip_pickle(smgr) assert_series_equal(Series(smgr), Series(smgr2)) - def test_get_scalar(self): - for item in self.mgr.items: - for i, index in enumerate(self.mgr.axes[1]): - res = self.mgr.get_scalar((item, index)) - exp = self.mgr.get(item, fastpath=False)[i] - self.assertEqual(res, exp) - exp = self.mgr.get(item).internal_values()[i] - self.assertEqual(res, exp) + def test_get_scalar(self, mgr): + for item in mgr.items: + for i, index in enumerate(mgr.axes[1]): + res = mgr.get_scalar((item, index)) + exp = mgr.get(item, fastpath=False)[i] + assert res == exp + exp = mgr.get(item).internal_values()[i] + assert res == exp def test_get(self): cols = Index(list('abc')) @@ -442,30 +447,21 @@ def test_set(self): tm.assert_numpy_array_equal(mgr.get('d').internal_values(), np.array(['foo'] * 3, dtype=np.object_)) - def test_insert(self): - self.mgr.insert(0, 'inserted', np.arange(N)) + def test_set_change_dtype(self, mgr): + mgr.set('baz', np.zeros(N, dtype=bool)) - self.assertEqual(self.mgr.items[0], 'inserted') - assert_almost_equal(self.mgr.get('inserted'), np.arange(N)) + mgr.set('baz', np.repeat('foo', N)) + assert mgr.get('baz').dtype == np.object_ - for blk in self.mgr.blocks: - yield self.assertIs, self.mgr.items, blk.ref_items - - def test_set_change_dtype(self): - self.mgr.set('baz', np.zeros(N, dtype=bool)) - - self.mgr.set('baz', np.repeat('foo', N)) - self.assertEqual(self.mgr.get('baz').dtype, np.object_) - - mgr2 = self.mgr.consolidate() + mgr2 = mgr.consolidate() mgr2.set('baz', np.repeat('foo', N)) - self.assertEqual(mgr2.get('baz').dtype, np.object_) + assert mgr2.get('baz').dtype == np.object_ mgr2.set('quux', randn(N).astype(int)) - self.assertEqual(mgr2.get('quux').dtype, np.int_) + assert mgr2.get('quux').dtype == np.int_ mgr2.set('quux', randn(N)) - self.assertEqual(mgr2.get('quux').dtype, np.float_) + assert mgr2.get('quux').dtype == np.float_ def test_set_change_dtype_slice(self): # GH8850 cols = MultiIndex.from_tuples([('1st', 'a'), ('2nd', 'b'), ('3rd', 'c') @@ -473,70 +469,69 @@ def test_set_change_dtype_slice(self): # GH8850 df = DataFrame([[1.0, 2, 3], [4.0, 5, 6]], columns=cols) df['2nd'] = df['2nd'] * 2.0 - self.assertEqual(sorted(df.blocks.keys()), ['float64', 'int64']) + assert sorted(df.blocks.keys()) == ['float64', 'int64'] assert_frame_equal(df.blocks['float64'], DataFrame( [[1.0, 4.0], [4.0, 10.0]], columns=cols[:2])) assert_frame_equal(df.blocks['int64'], DataFrame( [[3], [6]], columns=cols[2:])) - def test_copy(self): - cp = self.mgr.copy(deep=False) - for blk, cp_blk in zip(self.mgr.blocks, cp.blocks): + def test_copy(self, mgr): + cp = mgr.copy(deep=False) + for blk, cp_blk in zip(mgr.blocks, cp.blocks): # view assertion - self.assertTrue(cp_blk.equals(blk)) - self.assertTrue(cp_blk.values.base is blk.values.base) + assert cp_blk.equals(blk) + assert cp_blk.values.base is blk.values.base - cp = self.mgr.copy(deep=True) - for blk, cp_blk in zip(self.mgr.blocks, cp.blocks): + cp = mgr.copy(deep=True) + for blk, cp_blk in zip(mgr.blocks, cp.blocks): # copy assertion we either have a None for a base or in case of # some blocks it is an array (e.g. datetimetz), but was copied - self.assertTrue(cp_blk.equals(blk)) + assert cp_blk.equals(blk) if cp_blk.values.base is not None and blk.values.base is not None: - self.assertFalse(cp_blk.values.base is blk.values.base) + assert cp_blk.values.base is not blk.values.base else: - self.assertTrue(cp_blk.values.base is None and blk.values.base - is None) + assert cp_blk.values.base is None and blk.values.base is None def test_sparse(self): mgr = create_mgr('a: sparse-1; b: sparse-2') # what to test here? - self.assertEqual(mgr.as_matrix().dtype, np.float64) + assert mgr.as_matrix().dtype == np.float64 def test_sparse_mixed(self): mgr = create_mgr('a: sparse-1; b: sparse-2; c: f8') - self.assertEqual(len(mgr.blocks), 3) - self.assertIsInstance(mgr, BlockManager) + assert len(mgr.blocks) == 3 + assert isinstance(mgr, BlockManager) # what to test here? def test_as_matrix_float(self): mgr = create_mgr('c: f4; d: f2; e: f8') - self.assertEqual(mgr.as_matrix().dtype, np.float64) + assert mgr.as_matrix().dtype == np.float64 mgr = create_mgr('c: f4; d: f2') - self.assertEqual(mgr.as_matrix().dtype, np.float32) + assert mgr.as_matrix().dtype == np.float32 def test_as_matrix_int_bool(self): mgr = create_mgr('a: bool-1; b: bool-2') - self.assertEqual(mgr.as_matrix().dtype, np.bool_) + assert mgr.as_matrix().dtype == np.bool_ mgr = create_mgr('a: i8-1; b: i8-2; c: i4; d: i2; e: u1') - self.assertEqual(mgr.as_matrix().dtype, np.int64) + assert mgr.as_matrix().dtype == np.int64 mgr = create_mgr('c: i4; d: i2; e: u1') - self.assertEqual(mgr.as_matrix().dtype, np.int32) + assert mgr.as_matrix().dtype == np.int32 def test_as_matrix_datetime(self): mgr = create_mgr('h: datetime-1; g: datetime-2') - self.assertEqual(mgr.as_matrix().dtype, 'M8[ns]') + assert mgr.as_matrix().dtype == 'M8[ns]' def test_as_matrix_datetime_tz(self): mgr = create_mgr('h: M8[ns, US/Eastern]; g: M8[ns, CET]') - self.assertEqual(mgr.get('h').dtype, 'datetime64[ns, US/Eastern]') - self.assertEqual(mgr.get('g').dtype, 'datetime64[ns, CET]') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.get('h').dtype == 'datetime64[ns, US/Eastern]' + assert mgr.get('g').dtype == 'datetime64[ns, CET]' + assert mgr.as_matrix().dtype == 'object' def test_astype(self): # coerce all @@ -544,34 +539,34 @@ def test_astype(self): for t in ['float16', 'float32', 'float64', 'int32', 'int64']: t = np.dtype(t) tmgr = mgr.astype(t) - self.assertEqual(tmgr.get('c').dtype.type, t) - self.assertEqual(tmgr.get('d').dtype.type, t) - self.assertEqual(tmgr.get('e').dtype.type, t) + assert tmgr.get('c').dtype.type == t + assert tmgr.get('d').dtype.type == t + assert tmgr.get('e').dtype.type == t # mixed mgr = create_mgr('a,b: object; c: bool; d: datetime;' 'e: f4; f: f2; g: f8') for t in ['float16', 'float32', 'float64', 'int32', 'int64']: t = np.dtype(t) - tmgr = mgr.astype(t, raise_on_error=False) - self.assertEqual(tmgr.get('c').dtype.type, t) - self.assertEqual(tmgr.get('e').dtype.type, t) - self.assertEqual(tmgr.get('f').dtype.type, t) - self.assertEqual(tmgr.get('g').dtype.type, t) - - self.assertEqual(tmgr.get('a').dtype.type, np.object_) - self.assertEqual(tmgr.get('b').dtype.type, np.object_) + tmgr = mgr.astype(t, errors='ignore') + assert tmgr.get('c').dtype.type == t + assert tmgr.get('e').dtype.type == t + assert tmgr.get('f').dtype.type == t + assert tmgr.get('g').dtype.type == t + + assert tmgr.get('a').dtype.type == np.object_ + assert tmgr.get('b').dtype.type == np.object_ if t != np.int64: - self.assertEqual(tmgr.get('d').dtype.type, np.datetime64) + assert tmgr.get('d').dtype.type == np.datetime64 else: - self.assertEqual(tmgr.get('d').dtype.type, t) + assert tmgr.get('d').dtype.type == t def test_convert(self): def _compare(old_mgr, new_mgr): """ compare the blocks, numeric compare ==, object don't """ old_blocks = set(old_mgr.blocks) new_blocks = set(new_mgr.blocks) - self.assertEqual(len(old_blocks), len(new_blocks)) + assert len(old_blocks) == len(new_blocks) # compare non-numeric for b in old_blocks: @@ -580,7 +575,7 @@ def _compare(old_mgr, new_mgr): if (b.values == nb.values).all(): found = True break - self.assertTrue(found) + assert found for b in new_blocks: found = False @@ -588,7 +583,7 @@ def _compare(old_mgr, new_mgr): if (b.values == ob.values).all(): found = True break - self.assertTrue(found) + assert found # noops mgr = create_mgr('f: i8; g: f8') @@ -605,11 +600,11 @@ def _compare(old_mgr, new_mgr): mgr.set('b', np.array(['2.'] * N, dtype=np.object_)) mgr.set('foo', np.array(['foo.'] * N, dtype=np.object_)) new_mgr = mgr.convert(numeric=True) - self.assertEqual(new_mgr.get('a').dtype, np.int64) - self.assertEqual(new_mgr.get('b').dtype, np.float64) - self.assertEqual(new_mgr.get('foo').dtype, np.object_) - self.assertEqual(new_mgr.get('f').dtype, np.int64) - self.assertEqual(new_mgr.get('g').dtype, np.float64) + assert new_mgr.get('a').dtype == np.int64 + assert new_mgr.get('b').dtype == np.float64 + assert new_mgr.get('foo').dtype == np.object_ + assert new_mgr.get('f').dtype == np.int64 + assert new_mgr.get('g').dtype == np.float64 mgr = create_mgr('a,b,foo: object; f: i4; bool: bool; dt: datetime;' 'i: i8; g: f8; h: f2') @@ -617,15 +612,15 @@ def _compare(old_mgr, new_mgr): mgr.set('b', np.array(['2.'] * N, dtype=np.object_)) mgr.set('foo', np.array(['foo.'] * N, dtype=np.object_)) new_mgr = mgr.convert(numeric=True) - self.assertEqual(new_mgr.get('a').dtype, np.int64) - self.assertEqual(new_mgr.get('b').dtype, np.float64) - self.assertEqual(new_mgr.get('foo').dtype, np.object_) - self.assertEqual(new_mgr.get('f').dtype, np.int32) - self.assertEqual(new_mgr.get('bool').dtype, np.bool_) - self.assertEqual(new_mgr.get('dt').dtype.type, np.datetime64) - self.assertEqual(new_mgr.get('i').dtype, np.int64) - self.assertEqual(new_mgr.get('g').dtype, np.float64) - self.assertEqual(new_mgr.get('h').dtype, np.float16) + assert new_mgr.get('a').dtype == np.int64 + assert new_mgr.get('b').dtype == np.float64 + assert new_mgr.get('foo').dtype == np.object_ + assert new_mgr.get('f').dtype == np.int32 + assert new_mgr.get('bool').dtype == np.bool_ + assert new_mgr.get('dt').dtype.type, np.datetime64 + assert new_mgr.get('i').dtype == np.int64 + assert new_mgr.get('g').dtype == np.float64 + assert new_mgr.get('h').dtype == np.float16 def test_interleave(self): @@ -633,49 +628,49 @@ def test_interleave(self): for dtype in ['f8', 'i8', 'object', 'bool', 'complex', 'M8[ns]', 'm8[ns]']: mgr = create_mgr('a: {0}'.format(dtype)) - self.assertEqual(mgr.as_matrix().dtype, dtype) + assert mgr.as_matrix().dtype == dtype mgr = create_mgr('a: {0}; b: {0}'.format(dtype)) - self.assertEqual(mgr.as_matrix().dtype, dtype) + assert mgr.as_matrix().dtype == dtype # will be converted according the actual dtype of the underlying mgr = create_mgr('a: category') - self.assertEqual(mgr.as_matrix().dtype, 'i8') + assert mgr.as_matrix().dtype == 'i8' mgr = create_mgr('a: category; b: category') - self.assertEqual(mgr.as_matrix().dtype, 'i8'), + assert mgr.as_matrix().dtype == 'i8' mgr = create_mgr('a: category; b: category2') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: category2') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: category2; b: category2') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' # combinations mgr = create_mgr('a: f8') - self.assertEqual(mgr.as_matrix().dtype, 'f8') + assert mgr.as_matrix().dtype == 'f8' mgr = create_mgr('a: f8; b: i8') - self.assertEqual(mgr.as_matrix().dtype, 'f8') + assert mgr.as_matrix().dtype == 'f8' mgr = create_mgr('a: f4; b: i8') - self.assertEqual(mgr.as_matrix().dtype, 'f4') + assert mgr.as_matrix().dtype == 'f8' mgr = create_mgr('a: f4; b: i8; d: object') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: bool; b: i8') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: complex') - self.assertEqual(mgr.as_matrix().dtype, 'complex') + assert mgr.as_matrix().dtype == 'complex' mgr = create_mgr('a: f8; b: category') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: M8[ns]; b: category') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: M8[ns]; b: bool') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: M8[ns]; b: i8') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: m8[ns]; b: bool') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: m8[ns]; b: i8') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' mgr = create_mgr('a: M8[ns]; b: m8[ns]') - self.assertEqual(mgr.as_matrix().dtype, 'object') + assert mgr.as_matrix().dtype == 'object' def test_interleave_non_unique_cols(self): df = DataFrame([ @@ -686,26 +681,26 @@ def test_interleave_non_unique_cols(self): df_unique = df.copy() df_unique.columns = ['x', 'y'] - self.assertEqual(df_unique.values.shape, df.values.shape) + assert df_unique.values.shape == df.values.shape tm.assert_numpy_array_equal(df_unique.values[0], df.values[0]) tm.assert_numpy_array_equal(df_unique.values[1], df.values[1]) def test_consolidate(self): pass - def test_consolidate_ordering_issues(self): - self.mgr.set('f', randn(N)) - self.mgr.set('d', randn(N)) - self.mgr.set('b', randn(N)) - self.mgr.set('g', randn(N)) - self.mgr.set('h', randn(N)) - - # we have datetime/tz blocks in self.mgr - cons = self.mgr.consolidate() - self.assertEqual(cons.nblocks, 4) - cons = self.mgr.consolidate().get_numeric_data() - self.assertEqual(cons.nblocks, 1) - tm.assertIsInstance(cons.blocks[0].mgr_locs, lib.BlockPlacement) + def test_consolidate_ordering_issues(self, mgr): + mgr.set('f', randn(N)) + mgr.set('d', randn(N)) + mgr.set('b', randn(N)) + mgr.set('g', randn(N)) + mgr.set('h', randn(N)) + + # we have datetime/tz blocks in mgr + cons = mgr.consolidate() + assert cons.nblocks == 4 + cons = mgr.consolidate().get_numeric_data() + assert cons.nblocks == 1 + assert isinstance(cons.blocks[0].mgr_locs, lib.BlockPlacement) tm.assert_numpy_array_equal(cons.blocks[0].mgr_locs.as_array, np.arange(len(cons.items), dtype=np.int64)) @@ -718,7 +713,7 @@ def test_reindex_items(self): 'f: bool; g: f8-2') reindexed = mgr.reindex_axis(['g', 'c', 'a', 'd'], axis=0) - self.assertEqual(reindexed.nblocks, 2) + assert reindexed.nblocks == 2 tm.assert_index_equal(reindexed.items, pd.Index(['g', 'c', 'a', 'd'])) assert_almost_equal( mgr.get('g', fastpath=False), reindexed.get('g', fastpath=False)) @@ -752,9 +747,9 @@ def test_multiindex_xs(self): mgr.set_axis(1, index) result = mgr.xs('bar', axis=1) - self.assertEqual(result.shape, (6, 2)) - self.assertEqual(result.axes[1][0], ('bar', 'one')) - self.assertEqual(result.axes[1][1], ('bar', 'two')) + assert result.shape == (6, 2) + assert result.axes[1][0] == ('bar', 'one') + assert result.axes[1][1] == ('bar', 'two') def test_get_numeric_data(self): mgr = create_mgr('int: int; float: float; complex: complex;' @@ -822,7 +817,7 @@ def test_unicode_repr_doesnt_raise(self): def test_missing_unicode_key(self): df = DataFrame({"a": [1]}) try: - df.ix[:, u("\u05d0")] # should not raise UnicodeEncodeError + df.loc[:, u("\u05d0")] # should not raise UnicodeEncodeError except KeyError: pass # this is the expected exception @@ -830,11 +825,11 @@ def test_equals(self): # unique items bm1 = create_mgr('a,b,c: i8-1; d,e,f: i8-2') bm2 = BlockManager(bm1.blocks[::-1], bm1.axes) - self.assertTrue(bm1.equals(bm2)) + assert bm1.equals(bm2) bm1 = create_mgr('a,a,a: i8-1; b,b,b: i8-2') bm2 = BlockManager(bm1.blocks[::-1], bm1.axes) - self.assertTrue(bm1.equals(bm2)) + assert bm1.equals(bm2) def test_equals_block_order_different_dtypes(self): # GH 9330 @@ -852,12 +847,20 @@ def test_equals_block_order_different_dtypes(self): block_perms = itertools.permutations(bm.blocks) for bm_perm in block_perms: bm_this = BlockManager(bm_perm, bm.axes) - self.assertTrue(bm.equals(bm_this)) - self.assertTrue(bm_this.equals(bm)) + assert bm.equals(bm_this) + assert bm_this.equals(bm) def test_single_mgr_ctor(self): mgr = create_single_mgr('f8', num_rows=5) - self.assertEqual(mgr.as_matrix().tolist(), [0., 1., 2., 3., 4.]) + assert mgr.as_matrix().tolist() == [0., 1., 2., 3., 4.] + + def test_validate_bool_args(self): + invalid_values = [1, "True", [1, 2, 3], 5.0] + bm1 = create_mgr('a,b,c: i8-1; d,e,f: i8-2') + + for value in invalid_values: + with pytest.raises(ValueError): + bm1.replace_list([1], [2], inplace=value) class TestIndexing(object): @@ -914,32 +917,37 @@ def assert_slice_ok(mgr, axis, slobj): for mgr in self.MANAGERS: for ax in range(mgr.ndim): # slice - yield assert_slice_ok, mgr, ax, slice(None) - yield assert_slice_ok, mgr, ax, slice(3) - yield assert_slice_ok, mgr, ax, slice(100) - yield assert_slice_ok, mgr, ax, slice(1, 4) - yield assert_slice_ok, mgr, ax, slice(3, 0, -2) + assert_slice_ok(mgr, ax, slice(None)) + assert_slice_ok(mgr, ax, slice(3)) + assert_slice_ok(mgr, ax, slice(100)) + assert_slice_ok(mgr, ax, slice(1, 4)) + assert_slice_ok(mgr, ax, slice(3, 0, -2)) # boolean mask - yield assert_slice_ok, mgr, ax, np.array([], dtype=np.bool_) - yield (assert_slice_ok, mgr, ax, - np.ones(mgr.shape[ax], dtype=np.bool_)) - yield (assert_slice_ok, mgr, ax, - np.zeros(mgr.shape[ax], dtype=np.bool_)) + assert_slice_ok( + mgr, ax, np.array([], dtype=np.bool_)) + assert_slice_ok( + mgr, ax, + np.ones(mgr.shape[ax], dtype=np.bool_)) + assert_slice_ok( + mgr, ax, + np.zeros(mgr.shape[ax], dtype=np.bool_)) if mgr.shape[ax] >= 3: - yield (assert_slice_ok, mgr, ax, - np.arange(mgr.shape[ax]) % 3 == 0) - yield (assert_slice_ok, mgr, ax, np.array( - [True, True, False], dtype=np.bool_)) + assert_slice_ok( + mgr, ax, + np.arange(mgr.shape[ax]) % 3 == 0) + assert_slice_ok( + mgr, ax, np.array( + [True, True, False], dtype=np.bool_)) # fancy indexer - yield assert_slice_ok, mgr, ax, [] - yield assert_slice_ok, mgr, ax, lrange(mgr.shape[ax]) + assert_slice_ok(mgr, ax, []) + assert_slice_ok(mgr, ax, lrange(mgr.shape[ax])) if mgr.shape[ax] >= 3: - yield assert_slice_ok, mgr, ax, [0, 1, 2] - yield assert_slice_ok, mgr, ax, [-1, -2, -3] + assert_slice_ok(mgr, ax, [0, 1, 2]) + assert_slice_ok(mgr, ax, [-1, -2, -3]) def test_take(self): def assert_take_ok(mgr, axis, indexer): @@ -953,13 +961,13 @@ def assert_take_ok(mgr, axis, indexer): for mgr in self.MANAGERS: for ax in range(mgr.ndim): # take/fancy indexer - yield assert_take_ok, mgr, ax, [] - yield assert_take_ok, mgr, ax, [0, 0, 0] - yield assert_take_ok, mgr, ax, lrange(mgr.shape[ax]) + assert_take_ok(mgr, ax, []) + assert_take_ok(mgr, ax, [0, 0, 0]) + assert_take_ok(mgr, ax, lrange(mgr.shape[ax])) if mgr.shape[ax] >= 3: - yield assert_take_ok, mgr, ax, [0, 1, 2] - yield assert_take_ok, mgr, ax, [-1, -2, -3] + assert_take_ok(mgr, ax, [0, 1, 2]) + assert_take_ok(mgr, ax, [-1, -2, -3]) def test_reindex_axis(self): def assert_reindex_axis_is_ok(mgr, axis, new_labels, fill_value): @@ -977,25 +985,33 @@ def assert_reindex_axis_is_ok(mgr, axis, new_labels, fill_value): for mgr in self.MANAGERS: for ax in range(mgr.ndim): for fill_value in (None, np.nan, 100.): - yield (assert_reindex_axis_is_ok, mgr, ax, - pd.Index([]), fill_value) - yield (assert_reindex_axis_is_ok, mgr, ax, mgr.axes[ax], - fill_value) - yield (assert_reindex_axis_is_ok, mgr, ax, - mgr.axes[ax][[0, 0, 0]], fill_value) - yield (assert_reindex_axis_is_ok, mgr, ax, - pd.Index(['foo', 'bar', 'baz']), fill_value) - yield (assert_reindex_axis_is_ok, mgr, ax, - pd.Index(['foo', mgr.axes[ax][0], 'baz']), - fill_value) + assert_reindex_axis_is_ok( + mgr, ax, + pd.Index([]), fill_value) + assert_reindex_axis_is_ok( + mgr, ax, mgr.axes[ax], + fill_value) + assert_reindex_axis_is_ok( + mgr, ax, + mgr.axes[ax][[0, 0, 0]], fill_value) + assert_reindex_axis_is_ok( + mgr, ax, + pd.Index(['foo', 'bar', 'baz']), fill_value) + assert_reindex_axis_is_ok( + mgr, ax, + pd.Index(['foo', mgr.axes[ax][0], 'baz']), + fill_value) if mgr.shape[ax] >= 3: - yield (assert_reindex_axis_is_ok, mgr, ax, - mgr.axes[ax][:-3], fill_value) - yield (assert_reindex_axis_is_ok, mgr, ax, - mgr.axes[ax][-3::-1], fill_value) - yield (assert_reindex_axis_is_ok, mgr, ax, - mgr.axes[ax][[0, 1, 2, 0, 1, 2]], fill_value) + assert_reindex_axis_is_ok( + mgr, ax, + mgr.axes[ax][:-3], fill_value) + assert_reindex_axis_is_ok( + mgr, ax, + mgr.axes[ax][-3::-1], fill_value) + assert_reindex_axis_is_ok( + mgr, ax, + mgr.axes[ax][[0, 1, 2, 0, 1, 2]], fill_value) def test_reindex_indexer(self): @@ -1014,33 +1030,41 @@ def assert_reindex_indexer_is_ok(mgr, axis, new_labels, indexer, for mgr in self.MANAGERS: for ax in range(mgr.ndim): for fill_value in (None, np.nan, 100.): - yield (assert_reindex_indexer_is_ok, mgr, ax, - pd.Index([]), [], fill_value) - yield (assert_reindex_indexer_is_ok, mgr, ax, - mgr.axes[ax], np.arange(mgr.shape[ax]), fill_value) - yield (assert_reindex_indexer_is_ok, mgr, ax, - pd.Index(['foo'] * mgr.shape[ax]), - np.arange(mgr.shape[ax]), fill_value) - - yield (assert_reindex_indexer_is_ok, mgr, ax, - mgr.axes[ax][::-1], np.arange(mgr.shape[ax]), - fill_value) - yield (assert_reindex_indexer_is_ok, mgr, ax, mgr.axes[ax], - np.arange(mgr.shape[ax])[::-1], fill_value) - yield (assert_reindex_indexer_is_ok, mgr, ax, - pd.Index(['foo', 'bar', 'baz']), - [0, 0, 0], fill_value) - yield (assert_reindex_indexer_is_ok, mgr, ax, - pd.Index(['foo', 'bar', 'baz']), - [-1, 0, -1], fill_value) - yield (assert_reindex_indexer_is_ok, mgr, ax, - pd.Index(['foo', mgr.axes[ax][0], 'baz']), - [-1, -1, -1], fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + pd.Index([]), [], fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + mgr.axes[ax], np.arange(mgr.shape[ax]), fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + pd.Index(['foo'] * mgr.shape[ax]), + np.arange(mgr.shape[ax]), fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + mgr.axes[ax][::-1], np.arange(mgr.shape[ax]), + fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, mgr.axes[ax], + np.arange(mgr.shape[ax])[::-1], fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + pd.Index(['foo', 'bar', 'baz']), + [0, 0, 0], fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + pd.Index(['foo', 'bar', 'baz']), + [-1, 0, -1], fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + pd.Index(['foo', mgr.axes[ax][0], 'baz']), + [-1, -1, -1], fill_value) if mgr.shape[ax] >= 3: - yield (assert_reindex_indexer_is_ok, mgr, ax, - pd.Index(['foo', 'bar', 'baz']), - [0, 1, 2], fill_value) + assert_reindex_indexer_is_ok( + mgr, ax, + pd.Index(['foo', 'bar', 'baz']), + [0, 1, 2], fill_value) # test_get_slice(slice_like, axis) # take(indexer, axis) @@ -1048,25 +1072,26 @@ def assert_reindex_indexer_is_ok(mgr, axis, new_labels, indexer, # reindex_indexer(new_labels, indexer, axis) -class TestBlockPlacement(tm.TestCase): - _multiprocess_can_split_ = True +class TestBlockPlacement(object): def test_slice_len(self): - self.assertEqual(len(BlockPlacement(slice(0, 4))), 4) - self.assertEqual(len(BlockPlacement(slice(0, 4, 2))), 2) - self.assertEqual(len(BlockPlacement(slice(0, 3, 2))), 2) + assert len(BlockPlacement(slice(0, 4))) == 4 + assert len(BlockPlacement(slice(0, 4, 2))) == 2 + assert len(BlockPlacement(slice(0, 3, 2))) == 2 - self.assertEqual(len(BlockPlacement(slice(0, 1, 2))), 1) - self.assertEqual(len(BlockPlacement(slice(1, 0, -1))), 1) + assert len(BlockPlacement(slice(0, 1, 2))) == 1 + assert len(BlockPlacement(slice(1, 0, -1))) == 1 def test_zero_step_raises(self): - self.assertRaises(ValueError, BlockPlacement, slice(1, 1, 0)) - self.assertRaises(ValueError, BlockPlacement, slice(1, 2, 0)) + with pytest.raises(ValueError): + BlockPlacement(slice(1, 1, 0)) + with pytest.raises(ValueError): + BlockPlacement(slice(1, 2, 0)) def test_unbounded_slice_raises(self): def assert_unbounded_slice_error(slc): - self.assertRaisesRegexp(ValueError, "unbounded slice", - lambda: BlockPlacement(slc)) + tm.assert_raises_regex(ValueError, "unbounded slice", + lambda: BlockPlacement(slc)) assert_unbounded_slice_error(slice(None, None)) assert_unbounded_slice_error(slice(10, None)) @@ -1084,7 +1109,7 @@ def assert_unbounded_slice_error(slc): def test_not_slice_like_slices(self): def assert_not_slice_like(slc): - self.assertTrue(not BlockPlacement(slc).is_slice_like) + assert not BlockPlacement(slc).is_slice_like assert_not_slice_like(slice(0, 0)) assert_not_slice_like(slice(100, 0)) @@ -1092,12 +1117,12 @@ def assert_not_slice_like(slc): assert_not_slice_like(slice(100, 100, -1)) assert_not_slice_like(slice(0, 100, -1)) - self.assertTrue(not BlockPlacement(slice(0, 0)).is_slice_like) - self.assertTrue(not BlockPlacement(slice(100, 100)).is_slice_like) + assert not BlockPlacement(slice(0, 0)).is_slice_like + assert not BlockPlacement(slice(100, 100)).is_slice_like def test_array_to_slice_conversion(self): def assert_as_slice_equals(arr, slc): - self.assertEqual(BlockPlacement(arr).as_slice, slc) + assert BlockPlacement(arr).as_slice == slc assert_as_slice_equals([0], slice(0, 1, 1)) assert_as_slice_equals([100], slice(100, 101, 1)) @@ -1107,12 +1132,14 @@ def assert_as_slice_equals(arr, slc): assert_as_slice_equals([0, 100], slice(0, 200, 100)) assert_as_slice_equals([2, 1], slice(2, 0, -1)) - assert_as_slice_equals([2, 1, 0], slice(2, None, -1)) - assert_as_slice_equals([100, 0], slice(100, None, -100)) + + if not PY361: + assert_as_slice_equals([2, 1, 0], slice(2, None, -1)) + assert_as_slice_equals([100, 0], slice(100, None, -100)) def test_not_slice_like_arrays(self): def assert_not_slice_like(arr): - self.assertTrue(not BlockPlacement(arr).is_slice_like) + assert not BlockPlacement(arr).is_slice_like assert_not_slice_like([]) assert_not_slice_like([-1]) @@ -1125,13 +1152,13 @@ def assert_not_slice_like(arr): assert_not_slice_like([1, 1, 1]) def test_slice_iter(self): - self.assertEqual(list(BlockPlacement(slice(0, 3))), [0, 1, 2]) - self.assertEqual(list(BlockPlacement(slice(0, 0))), []) - self.assertEqual(list(BlockPlacement(slice(3, 0))), []) + assert list(BlockPlacement(slice(0, 3))) == [0, 1, 2] + assert list(BlockPlacement(slice(0, 0))) == [] + assert list(BlockPlacement(slice(3, 0))) == [] - self.assertEqual(list(BlockPlacement(slice(3, 0, -1))), [3, 2, 1]) - self.assertEqual(list(BlockPlacement(slice(3, None, -1))), - [3, 2, 1, 0]) + if not PY361: + assert list(BlockPlacement(slice(3, 0, -1))) == [3, 2, 1] + assert list(BlockPlacement(slice(3, None, -1))) == [3, 2, 1, 0] def test_slice_to_array_conversion(self): def assert_as_array_equals(slc, asarray): @@ -1144,44 +1171,44 @@ def assert_as_array_equals(slc, asarray): assert_as_array_equals(slice(3, 0), []) assert_as_array_equals(slice(3, 0, -1), [3, 2, 1]) - assert_as_array_equals(slice(3, None, -1), [3, 2, 1, 0]) - assert_as_array_equals(slice(31, None, -10), [31, 21, 11, 1]) + + if not PY361: + assert_as_array_equals(slice(3, None, -1), [3, 2, 1, 0]) + assert_as_array_equals(slice(31, None, -10), [31, 21, 11, 1]) def test_blockplacement_add(self): bpl = BlockPlacement(slice(0, 5)) - self.assertEqual(bpl.add(1).as_slice, slice(1, 6, 1)) - self.assertEqual(bpl.add(np.arange(5)).as_slice, slice(0, 10, 2)) - self.assertEqual(list(bpl.add(np.arange(5, 0, -1))), [5, 5, 5, 5, 5]) + assert bpl.add(1).as_slice == slice(1, 6, 1) + assert bpl.add(np.arange(5)).as_slice == slice(0, 10, 2) + assert list(bpl.add(np.arange(5, 0, -1))) == [5, 5, 5, 5, 5] def test_blockplacement_add_int(self): def assert_add_equals(val, inc, result): - self.assertEqual(list(BlockPlacement(val).add(inc)), result) + assert list(BlockPlacement(val).add(inc)) == result assert_add_equals(slice(0, 0), 0, []) assert_add_equals(slice(1, 4), 0, [1, 2, 3]) assert_add_equals(slice(3, 0, -1), 0, [3, 2, 1]) - assert_add_equals(slice(2, None, -1), 0, [2, 1, 0]) assert_add_equals([1, 2, 4], 0, [1, 2, 4]) assert_add_equals(slice(0, 0), 10, []) assert_add_equals(slice(1, 4), 10, [11, 12, 13]) assert_add_equals(slice(3, 0, -1), 10, [13, 12, 11]) - assert_add_equals(slice(2, None, -1), 10, [12, 11, 10]) assert_add_equals([1, 2, 4], 10, [11, 12, 14]) assert_add_equals(slice(0, 0), -1, []) assert_add_equals(slice(1, 4), -1, [0, 1, 2]) - assert_add_equals(slice(3, 0, -1), -1, [2, 1, 0]) assert_add_equals([1, 2, 4], -1, [0, 1, 3]) - self.assertRaises(ValueError, - lambda: BlockPlacement(slice(1, 4)).add(-10)) - self.assertRaises(ValueError, - lambda: BlockPlacement([1, 2, 4]).add(-10)) - self.assertRaises(ValueError, - lambda: BlockPlacement(slice(2, None, -1)).add(-1)) + with pytest.raises(ValueError): + BlockPlacement(slice(1, 4)).add(-10) + with pytest.raises(ValueError): + BlockPlacement([1, 2, 4]).add(-10) + if not PY361: + assert_add_equals(slice(3, 0, -1), -1, [2, 1, 0]) + assert_add_equals(slice(2, None, -1), 0, [2, 1, 0]) + assert_add_equals(slice(2, None, -1), 10, [12, 11, 10]) -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + with pytest.raises(ValueError): + BlockPlacement(slice(2, None, -1)).add(-1) diff --git a/pandas/tests/test_join.py b/pandas/tests/test_join.py index bfdb77f3fb350..3fc13d23b53f7 100644 --- a/pandas/tests/test_join.py +++ b/pandas/tests/test_join.py @@ -3,13 +3,12 @@ import numpy as np from pandas import Index -import pandas._join as _join +from pandas._libs import join as _join import pandas.util.testing as tm from pandas.util.testing import assert_almost_equal -class TestIndexer(tm.TestCase): - _multiprocess_can_split_ = True +class TestIndexer(object): def test_outer_join_indexer(self): typemap = [('int32', _join.outer_join_indexer_int32), @@ -24,9 +23,9 @@ def test_outer_join_indexer(self): empty = np.array([], dtype=dtype) result, lindexer, rindexer = indexer(left, right) - tm.assertIsInstance(result, np.ndarray) - tm.assertIsInstance(lindexer, np.ndarray) - tm.assertIsInstance(rindexer, np.ndarray) + assert isinstance(result, np.ndarray) + assert isinstance(lindexer, np.ndarray) + assert isinstance(rindexer, np.ndarray) tm.assert_numpy_array_equal(result, np.arange(5, dtype=dtype)) exp = np.array([0, 1, 2, -1, -1], dtype=np.int64) tm.assert_numpy_array_equal(lindexer, exp) @@ -193,9 +192,3 @@ def test_inner_join_indexer2(): exp_ridx = np.array([0, 1, 2, 3], dtype=np.int64) assert_almost_equal(ridx, exp_ridx) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_lib.py b/pandas/tests/test_lib.py index 945f8004687cd..df97095035952 100644 --- a/pandas/tests/test_lib.py +++ b/pandas/tests/test_lib.py @@ -1,29 +1,31 @@ # -*- coding: utf-8 -*- +import pytest + import numpy as np import pandas as pd -import pandas.lib as lib +import pandas._libs.lib as lib import pandas.util.testing as tm -class TestMisc(tm.TestCase): +class TestMisc(object): def test_max_len_string_array(self): arr = a = np.array(['foo', 'b', np.nan], dtype='object') - self.assertTrue(lib.max_len_string_array(arr), 3) + assert lib.max_len_string_array(arr), 3 # unicode arr = a.astype('U').astype(object) - self.assertTrue(lib.max_len_string_array(arr), 3) + assert lib.max_len_string_array(arr), 3 # bytes for python3 arr = a.astype('S').astype(object) - self.assertTrue(lib.max_len_string_array(arr), 3) + assert lib.max_len_string_array(arr), 3 # raises - tm.assertRaises(TypeError, - lambda: lib.max_len_string_array(arr.astype('U'))) + pytest.raises(TypeError, + lambda: lib.max_len_string_array(arr.astype('U'))) def test_fast_unique_multiple_list_gen_sort(self): keys = [['p', 'a'], ['n', 'd'], ['a', 's']] @@ -39,7 +41,7 @@ def test_fast_unique_multiple_list_gen_sort(self): tm.assert_numpy_array_equal(np.array(out), expected) -class TestIndexing(tm.TestCase): +class TestIndexing(object): def test_maybe_indices_to_slice_left_edge(self): target = np.arange(100) @@ -47,32 +49,36 @@ def test_maybe_indices_to_slice_left_edge(self): # slice indices = np.array([], dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], target[maybe_slice]) + + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], target[maybe_slice]) for end in [1, 2, 5, 20, 99]: for step in [1, 2, 4]: indices = np.arange(0, end, step, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], - target[maybe_slice]) + + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], + target[maybe_slice]) # reverse indices = indices[::-1] maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], - target[maybe_slice]) + + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], + target[maybe_slice]) # not slice for case in [[2, 1, 2, 0], [2, 2, 1, 0], [0, 1, 2, 1], [-2, 0, 2], [2, 0, -2]]: indices = np.array(case, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertFalse(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(maybe_slice, indices) - self.assert_numpy_array_equal(target[indices], target[maybe_slice]) + + assert not isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(maybe_slice, indices) + tm.assert_numpy_array_equal(target[indices], target[maybe_slice]) def test_maybe_indices_to_slice_right_edge(self): target = np.arange(100) @@ -82,42 +88,49 @@ def test_maybe_indices_to_slice_right_edge(self): for step in [1, 2, 4]: indices = np.arange(start, 99, step, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], - target[maybe_slice]) + + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], + target[maybe_slice]) # reverse indices = indices[::-1] maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], - target[maybe_slice]) + + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], + target[maybe_slice]) # not slice indices = np.array([97, 98, 99, 100], dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertFalse(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(maybe_slice, indices) - with self.assertRaises(IndexError): + + assert not isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(maybe_slice, indices) + + with pytest.raises(IndexError): target[indices] - with self.assertRaises(IndexError): + with pytest.raises(IndexError): target[maybe_slice] indices = np.array([100, 99, 98, 97], dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertFalse(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(maybe_slice, indices) - with self.assertRaises(IndexError): + + assert not isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(maybe_slice, indices) + + with pytest.raises(IndexError): target[indices] - with self.assertRaises(IndexError): + with pytest.raises(IndexError): target[maybe_slice] for case in [[99, 97, 99, 96], [99, 99, 98, 97], [98, 98, 97, 96]]: indices = np.array(case, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertFalse(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(maybe_slice, indices) - self.assert_numpy_array_equal(target[indices], target[maybe_slice]) + + assert not isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(maybe_slice, indices) + tm.assert_numpy_array_equal(target[indices], target[maybe_slice]) def test_maybe_indices_to_slice_both_edges(self): target = np.arange(10) @@ -126,22 +139,22 @@ def test_maybe_indices_to_slice_both_edges(self): for step in [1, 2, 4, 5, 8, 9]: indices = np.arange(0, 9, step, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], target[maybe_slice]) + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], target[maybe_slice]) # reverse indices = indices[::-1] maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], target[maybe_slice]) + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], target[maybe_slice]) # not slice for case in [[4, 2, 0, -2], [2, 2, 1, 0], [0, 1, 2, 1]]: indices = np.array(case, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertFalse(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(maybe_slice, indices) - self.assert_numpy_array_equal(target[indices], target[maybe_slice]) + assert not isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(maybe_slice, indices) + tm.assert_numpy_array_equal(target[indices], target[maybe_slice]) def test_maybe_indices_to_slice_middle(self): target = np.arange(100) @@ -151,41 +164,44 @@ def test_maybe_indices_to_slice_middle(self): for step in [1, 2, 4, 20]: indices = np.arange(start, end, step, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], - target[maybe_slice]) + + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], + target[maybe_slice]) # reverse indices = indices[::-1] maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertTrue(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(target[indices], - target[maybe_slice]) + + assert isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(target[indices], + target[maybe_slice]) # not slice for case in [[14, 12, 10, 12], [12, 12, 11, 10], [10, 11, 12, 11]]: indices = np.array(case, dtype=np.int64) maybe_slice = lib.maybe_indices_to_slice(indices, len(target)) - self.assertFalse(isinstance(maybe_slice, slice)) - self.assert_numpy_array_equal(maybe_slice, indices) - self.assert_numpy_array_equal(target[indices], target[maybe_slice]) + + assert not isinstance(maybe_slice, slice) + tm.assert_numpy_array_equal(maybe_slice, indices) + tm.assert_numpy_array_equal(target[indices], target[maybe_slice]) def test_maybe_booleans_to_slice(self): arr = np.array([0, 0, 1, 1, 1, 0, 1], dtype=np.uint8) result = lib.maybe_booleans_to_slice(arr) - self.assertTrue(result.dtype == np.bool_) + assert result.dtype == np.bool_ result = lib.maybe_booleans_to_slice(arr[:0]) - self.assertTrue(result == slice(0, 0)) + assert result == slice(0, 0) def test_get_reverse_indexer(self): indexer = np.array([-1, -1, 1, 2, 0, -1, 3, 4], dtype=np.int64) result = lib.get_reverse_indexer(indexer, 5) expected = np.array([4, 2, 3, 6, 7], dtype=np.int64) - self.assertTrue(np.array_equal(result, expected)) + assert np.array_equal(result, expected) -class TestNullObj(tm.TestCase): +class TestNullObj(object): _1d_methods = ['isnullobj', 'isnullobj_old'] _2d_methods = ['isnullobj2d', 'isnullobj2d_old'] @@ -232,10 +248,3 @@ def test_empty_like(self): expected = np.array([True]) self._check_behavior(arr, expected) - - -if __name__ == '__main__': - import nose - - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_msgpack/test_except.py b/pandas/tests/test_msgpack/test_except.py deleted file mode 100644 index 79290ebb891fd..0000000000000 --- a/pandas/tests/test_msgpack/test_except.py +++ /dev/null @@ -1,34 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -import unittest -from pandas.msgpack import packb, unpackb - - -class DummyException(Exception): - pass - - -class TestExceptions(unittest.TestCase): - - def test_raise_on_find_unsupported_value(self): - import datetime - self.assertRaises(TypeError, packb, datetime.datetime.now()) - - def test_raise_from_object_hook(self): - def hook(obj): - raise DummyException - - self.assertRaises(DummyException, unpackb, packb({}), object_hook=hook) - self.assertRaises(DummyException, unpackb, packb({'fizz': 'buzz'}), - object_hook=hook) - self.assertRaises(DummyException, unpackb, packb({'fizz': 'buzz'}), - object_pairs_hook=hook) - self.assertRaises(DummyException, unpackb, - packb({'fizz': {'buzz': 'spam'}}), object_hook=hook) - self.assertRaises(DummyException, unpackb, - packb({'fizz': {'buzz': 'spam'}}), - object_pairs_hook=hook) - - def test_invalidvalue(self): - self.assertRaises(ValueError, unpackb, b'\xd9\x97#DL_') diff --git a/pandas/tests/test_multilevel.py b/pandas/tests/test_multilevel.py old mode 100755 new mode 100644 index 4e7ace4173227..ab28b8b43f359 --- a/pandas/tests/test_multilevel.py +++ b/pandas/tests/test_multilevel.py @@ -1,8 +1,9 @@ # -*- coding: utf-8 -*- # pylint: disable-msg=W0612,E1101,W0141 +from warnings import catch_warnings import datetime import itertools -import nose +import pytest from numpy.random import randn import numpy as np @@ -10,23 +11,18 @@ from pandas.core.index import Index, MultiIndex from pandas import Panel, DataFrame, Series, notnull, isnull, Timestamp -from pandas.types.common import is_float_dtype, is_integer_dtype -from pandas.util.testing import (assert_almost_equal, assert_series_equal, - assert_frame_equal, assertRaisesRegexp) +from pandas.core.dtypes.common import is_float_dtype, is_integer_dtype import pandas.core.common as com import pandas.util.testing as tm from pandas.compat import (range, lrange, StringIO, lzip, u, product as cart_product, zip) import pandas as pd +import pandas._libs.index as _index -import pandas.index as _index +class Base(object): -class TestMultiLevel(tm.TestCase): - - _multiprocess_can_split_ = True - - def setUp(self): + def setup_method(self, method): index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', 'three']], @@ -59,6 +55,9 @@ def setUp(self): inplace=True) self.ymd.index.set_names(['year', 'month', 'day'], inplace=True) + +class TestMultiLevel(Base): + def test_append(self): a, b = self.frame[:5], self.frame[5:] @@ -84,55 +83,55 @@ def test_append_index(self): # GH 7112 import pytz tz = pytz.timezone('Asia/Tokyo') - expected_tuples = [(1.1, datetime.datetime(2011, 1, 1, tzinfo=tz)), - (1.2, datetime.datetime(2011, 1, 2, tzinfo=tz)), - (1.3, datetime.datetime(2011, 1, 3, tzinfo=tz))] + expected_tuples = [(1.1, tz.localize(datetime.datetime(2011, 1, 1))), + (1.2, tz.localize(datetime.datetime(2011, 1, 2))), + (1.3, tz.localize(datetime.datetime(2011, 1, 3)))] expected = Index([1.1, 1.2, 1.3] + expected_tuples) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = midx_lv2.append(idx1) expected = Index(expected_tuples + [1.1, 1.2, 1.3]) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = midx_lv2.append(midx_lv2) expected = MultiIndex.from_arrays([idx1.append(idx1), idx2.append(idx2)]) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = midx_lv2.append(midx_lv3) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) result = midx_lv3.append(midx_lv2) expected = Index._simple_new( - np.array([(1.1, datetime.datetime(2011, 1, 1, tzinfo=tz), 'A'), - (1.2, datetime.datetime(2011, 1, 2, tzinfo=tz), 'B'), - (1.3, datetime.datetime(2011, 1, 3, tzinfo=tz), 'C')] + + np.array([(1.1, tz.localize(datetime.datetime(2011, 1, 1)), 'A'), + (1.2, tz.localize(datetime.datetime(2011, 1, 2)), 'B'), + (1.3, tz.localize(datetime.datetime(2011, 1, 3)), 'C')] + expected_tuples), None) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) def test_dataframe_constructor(self): multi = DataFrame(np.random.randn(4, 4), index=[np.array(['a', 'a', 'b', 'b']), np.array(['x', 'y', 'x', 'y'])]) - tm.assertIsInstance(multi.index, MultiIndex) - self.assertNotIsInstance(multi.columns, MultiIndex) + assert isinstance(multi.index, MultiIndex) + assert not isinstance(multi.columns, MultiIndex) multi = DataFrame(np.random.randn(4, 4), columns=[['a', 'a', 'b', 'b'], ['x', 'y', 'x', 'y']]) - tm.assertIsInstance(multi.columns, MultiIndex) + assert isinstance(multi.columns, MultiIndex) def test_series_constructor(self): multi = Series(1., index=[np.array(['a', 'a', 'b', 'b']), np.array( ['x', 'y', 'x', 'y'])]) - tm.assertIsInstance(multi.index, MultiIndex) + assert isinstance(multi.index, MultiIndex) multi = Series(1., index=[['a', 'a', 'b', 'b'], ['x', 'y', 'x', 'y']]) - tm.assertIsInstance(multi.index, MultiIndex) + assert isinstance(multi.index, MultiIndex) multi = Series(lrange(4), index=[['a', 'a', 'b', 'b'], ['x', 'y', 'x', 'y']]) - tm.assertIsInstance(multi.index, MultiIndex) + assert isinstance(multi.index, MultiIndex) def test_reindex_level(self): # axis=0 @@ -140,18 +139,18 @@ def test_reindex_level(self): result = month_sums.reindex(self.ymd.index, level=1) expected = self.ymd.groupby(level='month').transform(np.sum) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # Series result = month_sums['A'].reindex(self.ymd.index, level=1) expected = self.ymd['A'].groupby(level='month').transform(np.sum) - assert_series_equal(result, expected, check_names=False) + tm.assert_series_equal(result, expected, check_names=False) # axis=1 month_sums = self.ymd.T.sum(axis=1, level='month') result = month_sums.reindex(columns=self.ymd.index, level=1) expected = self.ymd.groupby(level='month').transform(np.sum).T - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_binops_level(self): def _check_op(opname): @@ -161,7 +160,7 @@ def _check_op(opname): broadcasted = self.ymd.groupby(level='month').transform(np.sum) expected = op(self.ymd, broadcasted) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # Series op = getattr(Series, opname) @@ -170,7 +169,7 @@ def _check_op(opname): np.sum) expected = op(self.ymd['A'], broadcasted) expected.name = 'A' - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) _check_op('sub') _check_op('add') @@ -179,8 +178,8 @@ def _check_op(opname): def test_pickle(self): def _test_roundtrip(frame): - unpickled = self.round_trip_pickle(frame) - assert_frame_equal(frame, unpickled) + unpickled = tm.round_trip_pickle(frame) + tm.assert_frame_equal(frame, unpickled) _test_roundtrip(self.frame) _test_roundtrip(self.frame.T) @@ -188,68 +187,32 @@ def _test_roundtrip(frame): _test_roundtrip(self.ymd.T) def test_reindex(self): - reindexed = self.frame.ix[[('foo', 'one'), ('bar', 'one')]] - expected = self.frame.ix[[0, 3]] - assert_frame_equal(reindexed, expected) + expected = self.frame.iloc[[0, 3]] + reindexed = self.frame.loc[[('foo', 'one'), ('bar', 'one')]] + tm.assert_frame_equal(reindexed, expected) + + with catch_warnings(record=True): + reindexed = self.frame.ix[[('foo', 'one'), ('bar', 'one')]] + tm.assert_frame_equal(reindexed, expected) def test_reindex_preserve_levels(self): new_index = self.ymd.index[::10] chunk = self.ymd.reindex(new_index) - self.assertIs(chunk.index, new_index) + assert chunk.index is new_index - chunk = self.ymd.ix[new_index] - self.assertIs(chunk.index, new_index) + chunk = self.ymd.loc[new_index] + assert chunk.index is new_index + + with catch_warnings(record=True): + chunk = self.ymd.ix[new_index] + assert chunk.index is new_index ymdT = self.ymd.T chunk = ymdT.reindex(columns=new_index) - self.assertIs(chunk.columns, new_index) - - chunk = ymdT.ix[:, new_index] - self.assertIs(chunk.columns, new_index) - - def test_sort_index_preserve_levels(self): - result = self.frame.sort_index() - self.assertEqual(result.index.names, self.frame.index.names) - - def test_sorting_repr_8017(self): + assert chunk.columns is new_index - np.random.seed(0) - data = np.random.randn(3, 4) - - for gen, extra in [([1., 3., 2., 5.], 4.), ([1, 3, 2, 5], 4), - ([Timestamp('20130101'), Timestamp('20130103'), - Timestamp('20130102'), Timestamp('20130105')], - Timestamp('20130104')), - (['1one', '3one', '2one', '5one'], '4one')]: - columns = MultiIndex.from_tuples([('red', i) for i in gen]) - df = DataFrame(data, index=list('def'), columns=columns) - df2 = pd.concat([df, - DataFrame('world', index=list('def'), - columns=MultiIndex.from_tuples( - [('red', extra)]))], axis=1) - - # check that the repr is good - # make sure that we have a correct sparsified repr - # e.g. only 1 header of read - self.assertEqual(str(df2).splitlines()[0].split(), ['red']) - - # GH 8017 - # sorting fails after columns added - - # construct single-dtype then sort - result = df.copy().sort_index(axis=1) - expected = df.iloc[:, [0, 2, 1, 3]] - assert_frame_equal(result, expected) - - result = df2.sort_index(axis=1) - expected = df2.iloc[:, [0, 2, 1, 4, 3]] - assert_frame_equal(result, expected) - - # setitem then sort - result = df.copy() - result[('red', extra)] = 'world' - result = result.sort_index(axis=1) - assert_frame_equal(result, expected) + chunk = ymdT.loc[:, new_index] + assert chunk.columns is new_index def test_repr_to_string(self): repr(self.frame) @@ -270,15 +233,17 @@ def test_repr_name_coincide(self): df = DataFrame({'value': [0, 1]}, index=index) lines = repr(df).split('\n') - self.assertTrue(lines[2].startswith('a 0 foo')) + assert lines[2].startswith('a 0 foo') def test_getitem_simple(self): df = self.frame.T col = df['foo', 'one'] - assert_almost_equal(col.values, df.values[:, 0]) - self.assertRaises(KeyError, df.__getitem__, ('foo', 'four')) - self.assertRaises(KeyError, df.__getitem__, 'foobar') + tm.assert_almost_equal(col.values, df.values[:, 0]) + with pytest.raises(KeyError): + df[('foo', 'four')] + with pytest.raises(KeyError): + df['foobar'] def test_series_getitem(self): s = self.ymd['A'] @@ -286,46 +251,50 @@ def test_series_getitem(self): result = s[2000, 3] # TODO(wesm): unused? - # result2 = s.ix[2000, 3] + # result2 = s.loc[2000, 3] expected = s.reindex(s.index[42:65]) expected.index = expected.index.droplevel(0).droplevel(0) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) result = s[2000, 3, 10] expected = s[49] - self.assertEqual(result, expected) + assert result == expected # fancy - result = s.ix[[(2000, 3, 10), (2000, 3, 13)]] expected = s.reindex(s.index[49:51]) - assert_series_equal(result, expected) + result = s.loc[[(2000, 3, 10), (2000, 3, 13)]] + tm.assert_series_equal(result, expected) + + with catch_warnings(record=True): + result = s.ix[[(2000, 3, 10), (2000, 3, 13)]] + tm.assert_series_equal(result, expected) # key error - self.assertRaises(KeyError, s.__getitem__, (2000, 3, 4)) + pytest.raises(KeyError, s.__getitem__, (2000, 3, 4)) def test_series_getitem_corner(self): s = self.ymd['A'] # don't segfault, GH #495 # out of bounds access - self.assertRaises(IndexError, s.__getitem__, len(self.ymd)) + pytest.raises(IndexError, s.__getitem__, len(self.ymd)) # generator result = s[(x > 0 for x in s)] expected = s[s > 0] - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_series_setitem(self): s = self.ymd['A'] s[2000, 3] = np.nan - self.assertTrue(isnull(s.values[42:65]).all()) - self.assertTrue(notnull(s.values[:42]).all()) - self.assertTrue(notnull(s.values[65:]).all()) + assert isnull(s.values[42:65]).all() + assert notnull(s.values[:42]).all() + assert notnull(s.values[65:]).all() s[2000, 3, 10] = np.nan - self.assertTrue(isnull(s[49])) + assert isnull(s[49]) def test_series_slice_partial(self): pass @@ -336,36 +305,36 @@ def test_frame_getitem_setitem_boolean(self): result = df[df > 0] expected = df.where(df > 0) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) df[df > 0] = 5 values[values > 0] = 5 - assert_almost_equal(df.values, values) + tm.assert_almost_equal(df.values, values) df[df == 5] = 0 values[values == 5] = 0 - assert_almost_equal(df.values, values) + tm.assert_almost_equal(df.values, values) # a df that needs alignment first df[df[:-1] < 0] = 2 np.putmask(values[:-1], values[:-1] < 0, 2) - assert_almost_equal(df.values, values) + tm.assert_almost_equal(df.values, values) - with assertRaisesRegexp(TypeError, 'boolean values only'): + with tm.assert_raises_regex(TypeError, 'boolean values only'): df[df * 0] = 2 def test_frame_getitem_setitem_slice(self): # getitem - result = self.frame.ix[:4] + result = self.frame.iloc[:4] expected = self.frame[:4] - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # setitem cp = self.frame.copy() - cp.ix[:4] = 0 + cp.iloc[:4] = 0 - self.assertTrue((cp.values[:4] == 0).all()) - self.assertTrue((cp.values[4:] != 0).all()) + assert (cp.values[:4] == 0).all() + assert (cp.values[4:] != 0).all() def test_frame_getitem_setitem_multislice(self): levels = [['t1', 't2'], ['a', 'b', 'c']] @@ -373,22 +342,26 @@ def test_frame_getitem_setitem_multislice(self): midx = MultiIndex(labels=labels, levels=levels, names=[None, 'id']) df = DataFrame({'value': [1, 2, 3, 7, 8]}, index=midx) - result = df.ix[:, 'value'] - assert_series_equal(df['value'], result) + result = df.loc[:, 'value'] + tm.assert_series_equal(df['value'], result) - result = df.ix[1:3, 'value'] - assert_series_equal(df['value'][1:3], result) + with catch_warnings(record=True): + result = df.ix[:, 'value'] + tm.assert_series_equal(df['value'], result) - result = df.ix[:, :] - assert_frame_equal(df, result) + result = df.loc[df.index[1:3], 'value'] + tm.assert_series_equal(df['value'][1:3], result) + + result = df.loc[:, :] + tm.assert_frame_equal(df, result) result = df - df.ix[:, 'value'] = 10 + df.loc[:, 'value'] = 10 result['value'] = 10 - assert_frame_equal(df, result) + tm.assert_frame_equal(df, result) - df.ix[:, :] = 10 - assert_frame_equal(df, result) + df.loc[:, :] = 10 + tm.assert_frame_equal(df, result) def test_frame_getitem_multicolumn_empty_level(self): f = DataFrame({'a': ['1', '2', '3'], 'b': ['2', '3', '4']}) @@ -398,7 +371,7 @@ def test_frame_getitem_multicolumn_empty_level(self): result = f['level1 item1'] expected = DataFrame([['1'], ['2'], ['3']], index=f.index, columns=['level3 item1']) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_frame_setitem_multi_column(self): df = DataFrame(randn(10, 4), columns=[['a', 'a', 'b', 'b'], @@ -406,12 +379,12 @@ def test_frame_setitem_multi_column(self): cp = df.copy() cp['a'] = cp['b'] - assert_frame_equal(cp['a'], cp['b']) + tm.assert_frame_equal(cp['a'], cp['b']) # set with ndarray cp = df.copy() cp['a'] = cp['b'].values - assert_frame_equal(cp['a'], cp['b']) + tm.assert_frame_equal(cp['a'], cp['b']) # --------------------------------------- # #1803 @@ -420,7 +393,7 @@ def test_frame_setitem_multi_column(self): # Works, but adds a column instead of updating the two existing ones df['A'] = 0.0 # Doesn't work - self.assertTrue((df['A'].values == 0).all()) + assert (df['A'].values == 0).all() # it broadcasts df['B', '1'] = [1, 2, 3] @@ -429,11 +402,11 @@ def test_frame_setitem_multi_column(self): sliced_a1 = df['A', '1'] sliced_a2 = df['A', '2'] sliced_b1 = df['B', '1'] - assert_series_equal(sliced_a1, sliced_b1, check_names=False) - assert_series_equal(sliced_a2, sliced_b1, check_names=False) - self.assertEqual(sliced_a1.name, ('A', '1')) - self.assertEqual(sliced_a2.name, ('A', '2')) - self.assertEqual(sliced_b1.name, ('B', '1')) + tm.assert_series_equal(sliced_a1, sliced_b1, check_names=False) + tm.assert_series_equal(sliced_a2, sliced_b1, check_names=False) + assert sliced_a1.name == ('A', '1') + assert sliced_a2.name == ('A', '2') + assert sliced_b1.name == ('B', '1') def test_getitem_tuple_plus_slice(self): # GH #671 @@ -444,40 +417,31 @@ def test_getitem_tuple_plus_slice(self): idf = df.set_index(['a', 'b']) - result = idf.ix[(0, 0), :] - expected = idf.ix[0, 0] + result = idf.loc[(0, 0), :] + expected = idf.loc[0, 0] expected2 = idf.xs((0, 0)) + with catch_warnings(record=True): + expected3 = idf.ix[0, 0] - assert_series_equal(result, expected) - assert_series_equal(result, expected2) + tm.assert_series_equal(result, expected) + tm.assert_series_equal(result, expected2) + tm.assert_series_equal(result, expected3) def test_getitem_setitem_tuple_plus_columns(self): # GH #1013 df = self.ymd[:5] - result = df.ix[(2000, 1, 6), ['A', 'B', 'C']] - expected = df.ix[2000, 1, 6][['A', 'B', 'C']] - assert_series_equal(result, expected) - - def test_getitem_multilevel_index_tuple_unsorted(self): - index_columns = list("abc") - df = DataFrame([[0, 1, 0, "x"], [0, 0, 1, "y"]], - columns=index_columns + ["data"]) - df = df.set_index(index_columns) - query_index = df.index[:1] - rs = df.ix[query_index, "data"] - - xp_idx = MultiIndex.from_tuples([(0, 1, 0)], names=['a', 'b', 'c']) - xp = Series(['x'], index=xp_idx, name='data') - assert_series_equal(rs, xp) + result = df.loc[(2000, 1, 6), ['A', 'B', 'C']] + expected = df.loc[2000, 1, 6][['A', 'B', 'C']] + tm.assert_series_equal(result, expected) def test_xs(self): xs = self.frame.xs(('bar', 'two')) - xs2 = self.frame.ix[('bar', 'two')] + xs2 = self.frame.loc[('bar', 'two')] - assert_series_equal(xs, xs2) - assert_almost_equal(xs.values, self.frame.values[4]) + tm.assert_series_equal(xs, xs2) + tm.assert_almost_equal(xs.values, self.frame.values[4]) # GH 6574 # missing values in returned index should be preserrved @@ -496,18 +460,18 @@ def test_xs(self): ['xbcde', np.nan, 'zbcde', 'ybcde'], name='a2')) result = df.xs('z', level='a1') - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_xs_partial(self): result = self.frame.xs('foo') - result2 = self.frame.ix['foo'] + result2 = self.frame.loc['foo'] expected = self.frame.T['foo'].T - assert_frame_equal(result, expected) - assert_frame_equal(result, result2) + tm.assert_frame_equal(result, expected) + tm.assert_frame_equal(result, result2) result = self.ymd.xs((2000, 4)) - expected = self.ymd.ix[2000, 4] - assert_frame_equal(result, expected) + expected = self.ymd.loc[2000, 4] + tm.assert_frame_equal(result, expected) # ex from #1796 index = MultiIndex(levels=[['foo', 'bar'], ['one', 'two'], [-1, 1]], @@ -518,15 +482,15 @@ def test_xs_partial(self): columns=list('abcd')) result = df.xs(['foo', 'one']) - expected = df.ix['foo', 'one'] - assert_frame_equal(result, expected) + expected = df.loc['foo', 'one'] + tm.assert_frame_equal(result, expected) def test_xs_level(self): result = self.frame.xs('two', level='second') expected = self.frame[self.frame.index.get_level_values(1) == 'two'] expected.index = expected.index.droplevel(1) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) index = MultiIndex.from_tuples([('x', 'y', 'z'), ('a', 'b', 'c'), ( 'p', 'q', 'r')]) @@ -534,7 +498,7 @@ def test_xs_level(self): result = df.xs('c', level=2) expected = df[1:2] expected.index = expected.index.droplevel(2) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # this is a copy in 0.14 result = self.frame.xs('two', level='second') @@ -544,7 +508,7 @@ def test_xs_level(self): def f(x): x[:] = 10 - self.assertRaises(com.SettingWithCopyError, f, result) + pytest.raises(com.SettingWithCopyError, f, result) def test_xs_level_multiple(self): from pandas import read_table @@ -558,7 +522,7 @@ def test_xs_level_multiple(self): result = df.xs(('a', 4), level=['one', 'four']) expected = df.xs('a').xs(4, level='four') - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # this is a copy in 0.14 result = df.xs(('a', 4), level=['one', 'four']) @@ -568,7 +532,7 @@ def test_xs_level_multiple(self): def f(x): x[:] = 10 - self.assertRaises(com.SettingWithCopyError, f, result) + pytest.raises(com.SettingWithCopyError, f, result) # GH2107 dates = lrange(20111201, 20111205) @@ -576,9 +540,10 @@ def f(x): idx = MultiIndex.from_tuples([x for x in cart_product(dates, ids)]) idx.names = ['date', 'secid'] df = DataFrame(np.random.randn(len(idx), 3), idx, ['X', 'Y', 'Z']) + rs = df.xs(20111201, level='date') - xp = df.ix[20111201, :] - assert_frame_equal(rs, xp) + xp = df.loc[20111201, :] + tm.assert_frame_equal(rs, xp) def test_xs_level0(self): from pandas import read_table @@ -592,29 +557,29 @@ def test_xs_level0(self): result = df.xs('a', level=0) expected = df.xs('a') - self.assertEqual(len(result), 2) - assert_frame_equal(result, expected) + assert len(result) == 2 + tm.assert_frame_equal(result, expected) def test_xs_level_series(self): s = self.frame['A'] result = s[:, 'two'] expected = self.frame.xs('two', level=1)['A'] - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) s = self.ymd['A'] result = s[2000, 5] - expected = self.ymd.ix[2000, 5]['A'] - assert_series_equal(result, expected) + expected = self.ymd.loc[2000, 5]['A'] + tm.assert_series_equal(result, expected) # not implementing this for now - self.assertRaises(TypeError, s.__getitem__, (2000, slice(3, 4))) + pytest.raises(TypeError, s.__getitem__, (2000, slice(3, 4))) # result = s[2000, 3:4] # lv =s.index.get_level_values(1) # expected = s[(lv == 3) | (lv == 4)] # expected.index = expected.index.droplevel(0) - # assert_series_equal(result, expected) + # tm.assert_series_equal(result, expected) # can do this though @@ -630,15 +595,15 @@ def test_getitem_toplevel(self): result = df['foo'] expected = df.reindex(columns=df.columns[:3]) expected.columns = expected.columns.droplevel(0) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) result = df['bar'] - result2 = df.ix[:, 'bar'] + result2 = df.loc[:, 'bar'] expected = df.reindex(columns=df.columns[3:5]) expected.columns = expected.columns.droplevel(0) - assert_frame_equal(result, expected) - assert_frame_equal(result, result2) + tm.assert_frame_equal(result, expected) + tm.assert_frame_equal(result, result2) def test_getitem_setitem_slice_integers(self): index = MultiIndex(levels=[[0, 1, 2], [0, 2]], @@ -646,21 +611,21 @@ def test_getitem_setitem_slice_integers(self): frame = DataFrame(np.random.randn(len(index), 4), index=index, columns=['a', 'b', 'c', 'd']) - res = frame.ix[1:2] + res = frame.loc[1:2] exp = frame.reindex(frame.index[2:]) - assert_frame_equal(res, exp) + tm.assert_frame_equal(res, exp) - frame.ix[1:2] = 7 - self.assertTrue((frame.ix[1:2] == 7).values.all()) + frame.loc[1:2] = 7 + assert (frame.loc[1:2] == 7).values.all() series = Series(np.random.randn(len(index)), index=index) - res = series.ix[1:2] + res = series.loc[1:2] exp = series.reindex(series.index[2:]) - assert_series_equal(res, exp) + tm.assert_series_equal(res, exp) - series.ix[1:2] = 7 - self.assertTrue((series.ix[1:2] == 7).values.all()) + series.loc[1:2] = 7 + assert (series.loc[1:2] == 7).values.all() def test_getitem_int(self): levels = [[0, 1], [0, 1, 2]] @@ -669,18 +634,18 @@ def test_getitem_int(self): frame = DataFrame(np.random.randn(6, 2), index=index) - result = frame.ix[1] + result = frame.loc[1] expected = frame[-3:] expected.index = expected.index.droplevel(0) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # raises exception - self.assertRaises(KeyError, frame.ix.__getitem__, 3) + pytest.raises(KeyError, frame.loc.__getitem__, 3) # however this will work - result = self.frame.ix[2] + result = self.frame.iloc[2] expected = self.frame.xs(self.frame.index[2]) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_getitem_partial(self): ymd = self.ymd.T @@ -688,97 +653,63 @@ def test_getitem_partial(self): expected = ymd.reindex(columns=ymd.columns[ymd.columns.labels[1] == 1]) expected.columns = expected.columns.droplevel(0).droplevel(0) - assert_frame_equal(result, expected) - - def test_getitem_slice_not_sorted(self): - df = self.frame.sortlevel(1).T - - # buglet with int typechecking - result = df.ix[:, :np.int32(3)] - expected = df.reindex(columns=df.columns[:3]) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_setitem_change_dtype(self): dft = self.frame.T s = dft['foo', 'two'] dft['foo', 'two'] = s > s.median() - assert_series_equal(dft['foo', 'two'], s > s.median()) - # tm.assertIsInstance(dft._data.blocks[1].items, MultiIndex) + tm.assert_series_equal(dft['foo', 'two'], s > s.median()) + # assert isinstance(dft._data.blocks[1].items, MultiIndex) reindexed = dft.reindex(columns=[('foo', 'two')]) - assert_series_equal(reindexed['foo', 'two'], s > s.median()) + tm.assert_series_equal(reindexed['foo', 'two'], s > s.median()) def test_frame_setitem_ix(self): - self.frame.ix[('bar', 'two'), 'B'] = 5 - self.assertEqual(self.frame.ix[('bar', 'two'), 'B'], 5) + self.frame.loc[('bar', 'two'), 'B'] = 5 + assert self.frame.loc[('bar', 'two'), 'B'] == 5 # with integer labels df = self.frame.copy() df.columns = lrange(3) - df.ix[('bar', 'two'), 1] = 7 - self.assertEqual(df.ix[('bar', 'two'), 1], 7) + df.loc[('bar', 'two'), 1] = 7 + assert df.loc[('bar', 'two'), 1] == 7 + + with catch_warnings(record=True): + df = self.frame.copy() + df.columns = lrange(3) + df.ix[('bar', 'two'), 1] = 7 + assert df.loc[('bar', 'two'), 1] == 7 def test_fancy_slice_partial(self): - result = self.frame.ix['bar':'baz'] + result = self.frame.loc['bar':'baz'] expected = self.frame[3:7] - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) - result = self.ymd.ix[(2000, 2):(2000, 4)] + result = self.ymd.loc[(2000, 2):(2000, 4)] lev = self.ymd.index.labels[1] expected = self.ymd[(lev >= 1) & (lev <= 3)] - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_getitem_partial_column_select(self): idx = MultiIndex(labels=[[0, 0, 0], [0, 1, 1], [1, 0, 1]], levels=[['a', 'b'], ['x', 'y'], ['p', 'q']]) df = DataFrame(np.random.rand(3, 2), index=idx) - result = df.ix[('a', 'y'), :] - expected = df.ix[('a', 'y')] - assert_frame_equal(result, expected) - - result = df.ix[('a', 'y'), [1, 0]] - expected = df.ix[('a', 'y')][[1, 0]] - assert_frame_equal(result, expected) - - self.assertRaises(KeyError, df.ix.__getitem__, - (('a', 'foo'), slice(None, None))) + result = df.loc[('a', 'y'), :] + expected = df.loc[('a', 'y')] + tm.assert_frame_equal(result, expected) - def test_sortlevel(self): - df = self.frame.copy() - df.index = np.arange(len(df)) + result = df.loc[('a', 'y'), [1, 0]] + expected = df.loc[('a', 'y')][[1, 0]] + tm.assert_frame_equal(result, expected) - # axis=1 + with catch_warnings(record=True): + result = df.ix[('a', 'y'), [1, 0]] + tm.assert_frame_equal(result, expected) - # series - a_sorted = self.frame['A'].sortlevel(0) - - # preserve names - self.assertEqual(a_sorted.index.names, self.frame.index.names) - - # inplace - rs = self.frame.copy() - rs.sortlevel(0, inplace=True) - assert_frame_equal(rs, self.frame.sortlevel(0)) - - def test_sortlevel_large_cardinality(self): - - # #2684 (int64) - index = MultiIndex.from_arrays([np.arange(4000)] * 3) - df = DataFrame(np.random.randn(4000), index=index, dtype=np.int64) - - # it works! - result = df.sortlevel(0) - self.assertTrue(result.index.lexsort_depth == 3) - - # #2684 (int32) - index = MultiIndex.from_arrays([np.arange(4000)] * 3) - df = DataFrame(np.random.randn(4000), index=index, dtype=np.int32) - - # it works! - result = df.sortlevel(0) - self.assertTrue((result.dtypes.values == df.dtypes.values).all()) - self.assertTrue(result.index.lexsort_depth == 3) + pytest.raises(KeyError, df.loc.__getitem__, + (('a', 'foo'), slice(None, None))) def test_delevel_infer_dtype(self): tuples = [tuple @@ -788,42 +719,19 @@ def test_delevel_infer_dtype(self): df = DataFrame(np.random.randn(8, 3), columns=['A', 'B', 'C'], index=index) deleveled = df.reset_index() - self.assertTrue(is_integer_dtype(deleveled['prm1'])) - self.assertTrue(is_float_dtype(deleveled['prm2'])) + assert is_integer_dtype(deleveled['prm1']) + assert is_float_dtype(deleveled['prm2']) def test_reset_index_with_drop(self): deleveled = self.ymd.reset_index(drop=True) - self.assertEqual(len(deleveled.columns), len(self.ymd.columns)) + assert len(deleveled.columns) == len(self.ymd.columns) deleveled = self.series.reset_index() - tm.assertIsInstance(deleveled, DataFrame) - self.assertEqual(len(deleveled.columns), - len(self.series.index.levels) + 1) + assert isinstance(deleveled, DataFrame) + assert len(deleveled.columns) == len(self.series.index.levels) + 1 deleveled = self.series.reset_index(drop=True) - tm.assertIsInstance(deleveled, Series) - - def test_sortlevel_by_name(self): - self.frame.index.names = ['first', 'second'] - result = self.frame.sortlevel(level='second') - expected = self.frame.sortlevel(level=1) - assert_frame_equal(result, expected) - - def test_sortlevel_mixed(self): - sorted_before = self.frame.sortlevel(1) - - df = self.frame.copy() - df['foo'] = 'bar' - sorted_after = df.sortlevel(1) - assert_frame_equal(sorted_before, sorted_after.drop(['foo'], axis=1)) - - dft = self.frame.T - sorted_before = dft.sortlevel(1, axis=1) - dft['foo', 'three'] = 'bar' - - sorted_after = dft.sortlevel(1, axis=1) - assert_frame_equal(sorted_before.drop([('foo', 'three')], axis=1), - sorted_after.drop([('foo', 'three')], axis=1)) + assert isinstance(deleveled, Series) def test_count_level(self): def _check_counts(frame, axis=0): @@ -832,12 +740,12 @@ def _check_counts(frame, axis=0): result = frame.count(axis=axis, level=i) expected = frame.groupby(axis=axis, level=i).count() expected = expected.reindex_like(result).astype('i8') - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) - self.frame.ix[1, [1, 2]] = np.nan - self.frame.ix[7, [0, 1]] = np.nan - self.ymd.ix[1, [1, 2]] = np.nan - self.ymd.ix[7, [0, 1]] = np.nan + self.frame.iloc[1, [1, 2]] = np.nan + self.frame.iloc[7, [0, 1]] = np.nan + self.ymd.iloc[1, [1, 2]] = np.nan + self.ymd.iloc[7, [0, 1]] = np.nan _check_counts(self.frame) _check_counts(self.ymd) @@ -846,7 +754,8 @@ def _check_counts(frame, axis=0): # can't call with level on regular DataFrame df = tm.makeTimeDataFrame() - assertRaisesRegexp(TypeError, 'hierarchical', df.count, level=0) + tm.assert_raises_regex( + TypeError, 'hierarchical', df.count, level=0) self.frame['D'] = 'foo' result = self.frame.count(level=0, numeric_only=True) @@ -862,30 +771,31 @@ def test_count_level_series(self): result = s.count(level=0) expected = s.groupby(level=0).count() - assert_series_equal(result.astype('f8'), - expected.reindex(result.index).fillna(0)) + tm.assert_series_equal( + result.astype('f8'), expected.reindex(result.index).fillna(0)) result = s.count(level=1) expected = s.groupby(level=1).count() - assert_series_equal(result.astype('f8'), - expected.reindex(result.index).fillna(0)) + tm.assert_series_equal( + result.astype('f8'), expected.reindex(result.index).fillna(0)) def test_count_level_corner(self): s = self.frame['A'][:0] result = s.count(level=0) expected = Series(0, index=s.index.levels[0], name='A') - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) df = self.frame[:0] result = df.count(level=0) expected = DataFrame({}, index=s.index.levels[0], columns=df.columns).fillna(0).astype(np.int64) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_get_level_number_out_of_bounds(self): - with assertRaisesRegexp(IndexError, "Too many levels"): + with tm.assert_raises_regex(IndexError, "Too many levels"): self.frame.index._get_level_number(2) - with assertRaisesRegexp(IndexError, "not a valid level number"): + with tm.assert_raises_regex(IndexError, + "not a valid level number"): self.frame.index._get_level_number(-3) def test_unstack(self): @@ -907,56 +817,56 @@ def test_unstack_multiple_no_empty_columns(self): unstacked = s.unstack([1, 2]) expected = unstacked.dropna(axis=1, how='all') - assert_frame_equal(unstacked, expected) + tm.assert_frame_equal(unstacked, expected) def test_stack(self): # regular roundtrip unstacked = self.ymd.unstack() restacked = unstacked.stack() - assert_frame_equal(restacked, self.ymd) + tm.assert_frame_equal(restacked, self.ymd) - unlexsorted = self.ymd.sortlevel(2) + unlexsorted = self.ymd.sort_index(level=2) unstacked = unlexsorted.unstack(2) restacked = unstacked.stack() - assert_frame_equal(restacked.sortlevel(0), self.ymd) + tm.assert_frame_equal(restacked.sort_index(level=0), self.ymd) unlexsorted = unlexsorted[::-1] unstacked = unlexsorted.unstack(1) restacked = unstacked.stack().swaplevel(1, 2) - assert_frame_equal(restacked.sortlevel(0), self.ymd) + tm.assert_frame_equal(restacked.sort_index(level=0), self.ymd) unlexsorted = unlexsorted.swaplevel(0, 1) unstacked = unlexsorted.unstack(0).swaplevel(0, 1, axis=1) restacked = unstacked.stack(0).swaplevel(1, 2) - assert_frame_equal(restacked.sortlevel(0), self.ymd) + tm.assert_frame_equal(restacked.sort_index(level=0), self.ymd) # columns unsorted unstacked = self.ymd.unstack() unstacked = unstacked.sort_index(axis=1, ascending=False) restacked = unstacked.stack() - assert_frame_equal(restacked, self.ymd) + tm.assert_frame_equal(restacked, self.ymd) # more than 2 levels in the columns unstacked = self.ymd.unstack(1).unstack(1) result = unstacked.stack(1) expected = self.ymd.unstack() - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) result = unstacked.stack(2) expected = self.ymd.unstack(1) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) result = unstacked.stack(0) expected = self.ymd.stack().unstack(1).unstack(1) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) # not all levels present in each echelon - unstacked = self.ymd.unstack(2).ix[:, ::3] + unstacked = self.ymd.unstack(2).loc[:, ::3] stacked = unstacked.stack().stack() ymd_stacked = self.ymd.stack() - assert_series_equal(stacked, ymd_stacked.reindex(stacked.index)) + tm.assert_series_equal(stacked, ymd_stacked.reindex(stacked.index)) # stack with negative number result = self.ymd.unstack(0).stack(-2) @@ -964,8 +874,8 @@ def test_stack(self): # GH10417 def check(left, right): - assert_series_equal(left, right) - self.assertFalse(left.index.is_unique) + tm.assert_series_equal(left, right) + assert not left.index.is_unique li, ri = left.index, right.index tm.assert_index_equal(li, ri) @@ -1020,18 +930,18 @@ def test_unstack_odd_failure(self): result = df.unstack(2) recons = result.stack() - assert_frame_equal(recons, df) + tm.assert_frame_equal(recons, df) def test_stack_mixed_dtype(self): df = self.frame.T df['foo', 'four'] = 'foo' - df = df.sortlevel(1, axis=1) + df = df.sort_index(level=1, axis=1) stacked = df.stack() - result = df['foo'].stack() - assert_series_equal(stacked['foo'], result, check_names=False) - self.assertIs(result.name, None) - self.assertEqual(stacked['bar'].dtype, np.float_) + result = df['foo'].stack().sort_index() + tm.assert_series_equal(stacked['foo'], result, check_names=False) + assert result.name is None + assert stacked['bar'].dtype == np.float_ def test_unstack_bug(self): df = DataFrame({'state': ['naive', 'naive', 'naive', 'activ', 'activ', @@ -1045,73 +955,74 @@ def test_unstack_bug(self): unstacked = result.unstack() restacked = unstacked.stack() - assert_series_equal(restacked, - result.reindex(restacked.index).astype(float)) + tm.assert_series_equal( + restacked, result.reindex(restacked.index).astype(float)) def test_stack_unstack_preserve_names(self): unstacked = self.frame.unstack() - self.assertEqual(unstacked.index.name, 'first') - self.assertEqual(unstacked.columns.names, ['exp', 'second']) + assert unstacked.index.name == 'first' + assert unstacked.columns.names == ['exp', 'second'] restacked = unstacked.stack() - self.assertEqual(restacked.index.names, self.frame.index.names) + assert restacked.index.names == self.frame.index.names def test_unstack_level_name(self): result = self.frame.unstack('second') expected = self.frame.unstack(level=1) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_stack_level_name(self): unstacked = self.frame.unstack('second') result = unstacked.stack('exp') expected = self.frame.unstack().stack(0) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) result = self.frame.stack('exp') expected = self.frame.stack() - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_stack_unstack_multiple(self): unstacked = self.ymd.unstack(['year', 'month']) expected = self.ymd.unstack('year').unstack('month') - assert_frame_equal(unstacked, expected) - self.assertEqual(unstacked.columns.names, expected.columns.names) + tm.assert_frame_equal(unstacked, expected) + assert unstacked.columns.names == expected.columns.names # series s = self.ymd['A'] s_unstacked = s.unstack(['year', 'month']) - assert_frame_equal(s_unstacked, expected['A']) + tm.assert_frame_equal(s_unstacked, expected['A']) restacked = unstacked.stack(['year', 'month']) restacked = restacked.swaplevel(0, 1).swaplevel(1, 2) - restacked = restacked.sortlevel(0) + restacked = restacked.sort_index(level=0) - assert_frame_equal(restacked, self.ymd) - self.assertEqual(restacked.index.names, self.ymd.index.names) + tm.assert_frame_equal(restacked, self.ymd) + assert restacked.index.names == self.ymd.index.names # GH #451 unstacked = self.ymd.unstack([1, 2]) expected = self.ymd.unstack(1).unstack(1).dropna(axis=1, how='all') - assert_frame_equal(unstacked, expected) + tm.assert_frame_equal(unstacked, expected) unstacked = self.ymd.unstack([2, 1]) expected = self.ymd.unstack(2).unstack(1).dropna(axis=1, how='all') - assert_frame_equal(unstacked, expected.ix[:, unstacked.columns]) + tm.assert_frame_equal(unstacked, expected.loc[:, unstacked.columns]) def test_stack_names_and_numbers(self): unstacked = self.ymd.unstack(['year', 'month']) # Can't use mixture of names and numbers to stack - with assertRaisesRegexp(ValueError, "level should contain"): + with tm.assert_raises_regex(ValueError, "level should contain"): unstacked.stack([0, 'month']) def test_stack_multiple_out_of_bounds(self): # nlevels == 3 unstacked = self.ymd.unstack(['year', 'month']) - with assertRaisesRegexp(IndexError, "Too many levels"): + with tm.assert_raises_regex(IndexError, "Too many levels"): unstacked.stack([2, 3]) - with assertRaisesRegexp(IndexError, "not a valid level number"): + with tm.assert_raises_regex(IndexError, + "not a valid level number"): unstacked.stack([-4, -3]) def test_unstack_period_series(self): @@ -1134,9 +1045,9 @@ def test_unstack_period_series(self): columns=['A', 'B']) expected.columns.name = 'str' - assert_frame_equal(result1, expected) - assert_frame_equal(result2, expected) - assert_frame_equal(result3, expected.T) + tm.assert_frame_equal(result1, expected) + tm.assert_frame_equal(result2, expected) + tm.assert_frame_equal(result3, expected.T) idx1 = pd.PeriodIndex(['2013-01', '2013-01', '2013-02', '2013-02', '2013-03', '2013-03'], freq='M', name='period1') @@ -1160,9 +1071,9 @@ def test_unstack_period_series(self): [6, 5, np.nan, np.nan, np.nan, np.nan]], index=e_idx, columns=e_cols) - assert_frame_equal(result1, expected) - assert_frame_equal(result2, expected) - assert_frame_equal(result3, expected.T) + tm.assert_frame_equal(result1, expected) + tm.assert_frame_equal(result2, expected) + tm.assert_frame_equal(result3, expected.T) def test_unstack_period_frame(self): # GH 4342 @@ -1187,8 +1098,8 @@ def test_unstack_period_frame(self): expected = DataFrame([[5, 1, 6, 2, 6, 1], [4, 2, 3, 3, 5, 4]], index=e_1, columns=e_cols) - assert_frame_equal(result1, expected) - assert_frame_equal(result2, expected) + tm.assert_frame_equal(result1, expected) + tm.assert_frame_equal(result2, expected) e_1 = pd.PeriodIndex(['2014-01', '2014-02', '2014-01', '2014-02'], freq='M', name='period1') @@ -1198,7 +1109,7 @@ def test_unstack_period_frame(self): expected = DataFrame([[5, 4, 2, 3], [1, 2, 6, 5], [6, 3, 1, 4]], index=e_2, columns=e_cols) - assert_frame_equal(result3, expected) + tm.assert_frame_equal(result3, expected) def test_stack_multiple_bug(self): """ bug when some uniques are not present in the data #3170""" @@ -1214,9 +1125,9 @@ def test_stack_multiple_bug(self): down = unst.resample('W-THU').mean() rs = down.stack('ID') - xp = unst.ix[:, ['VAR1']].resample('W-THU').mean().stack('ID') + xp = unst.loc[:, ['VAR1']].resample('W-THU').mean().stack('ID') xp.columns.name = 'Params' - assert_frame_equal(rs, xp) + tm.assert_frame_equal(rs, xp) def test_stack_dropna(self): # GH #3997 @@ -1224,10 +1135,10 @@ def test_stack_dropna(self): df = df.set_index(['A', 'B']) stacked = df.unstack().stack(dropna=False) - self.assertTrue(len(stacked) > len(stacked.dropna())) + assert len(stacked) > len(stacked.dropna()) stacked = df.unstack().stack(dropna=True) - assert_frame_equal(stacked, stacked.dropna()) + tm.assert_frame_equal(stacked, stacked.dropna()) def test_unstack_multiple_hierarchical(self): df = DataFrame(index=[[0, 0, 0, 0, 1, 1, 1, 1], @@ -1250,7 +1161,7 @@ def test_groupby_transform(self): applied = grouped.apply(lambda x: x * 2) expected = grouped.transform(lambda x: x * 2) result = applied.reindex(expected.index) - assert_series_equal(result, expected, check_names=False) + tm.assert_series_equal(result, expected, check_names=False) def test_unstack_sparse_keyspace(self): # memory problems with naive impl #2278 @@ -1279,10 +1190,10 @@ def test_unstack_unobserved_keys(self): df = DataFrame(np.random.randn(4, 2), index=index) result = df.unstack() - self.assertEqual(len(result.columns), 4) + assert len(result.columns) == 4 recons = result.stack() - assert_frame_equal(recons, df) + tm.assert_frame_equal(recons, df) def test_groupby_corner(self): midx = MultiIndex(levels=[['foo'], ['bar'], ['baz']], @@ -1303,79 +1214,80 @@ def test_groupby_level_no_obs(self): grouped = df1.groupby(axis=1, level=0) result = grouped.sum() - self.assertTrue((result.columns == ['f2', 'f3']).all()) + assert (result.columns == ['f2', 'f3']).all() def test_join(self): - a = self.frame.ix[:5, ['A']] - b = self.frame.ix[2:, ['B', 'C']] + a = self.frame.loc[self.frame.index[:5], ['A']] + b = self.frame.loc[self.frame.index[2:], ['B', 'C']] joined = a.join(b, how='outer').reindex(self.frame.index) expected = self.frame.copy() expected.values[np.isnan(joined.values)] = np.nan - self.assertFalse(np.isnan(joined.values).all()) + assert not np.isnan(joined.values).all() - assert_frame_equal(joined, expected, check_names=False - ) # TODO what should join do with names ? + # TODO what should join do with names ? + tm.assert_frame_equal(joined, expected, check_names=False) def test_swaplevel(self): swapped = self.frame['A'].swaplevel() swapped2 = self.frame['A'].swaplevel(0) swapped3 = self.frame['A'].swaplevel(0, 1) swapped4 = self.frame['A'].swaplevel('first', 'second') - self.assertFalse(swapped.index.equals(self.frame.index)) - assert_series_equal(swapped, swapped2) - assert_series_equal(swapped, swapped3) - assert_series_equal(swapped, swapped4) + assert not swapped.index.equals(self.frame.index) + tm.assert_series_equal(swapped, swapped2) + tm.assert_series_equal(swapped, swapped3) + tm.assert_series_equal(swapped, swapped4) back = swapped.swaplevel() back2 = swapped.swaplevel(0) back3 = swapped.swaplevel(0, 1) back4 = swapped.swaplevel('second', 'first') - self.assertTrue(back.index.equals(self.frame.index)) - assert_series_equal(back, back2) - assert_series_equal(back, back3) - assert_series_equal(back, back4) + assert back.index.equals(self.frame.index) + tm.assert_series_equal(back, back2) + tm.assert_series_equal(back, back3) + tm.assert_series_equal(back, back4) ft = self.frame.T swapped = ft.swaplevel('first', 'second', axis=1) exp = self.frame.swaplevel('first', 'second').T - assert_frame_equal(swapped, exp) + tm.assert_frame_equal(swapped, exp) def test_swaplevel_panel(self): - panel = Panel({'ItemA': self.frame, 'ItemB': self.frame * 2}) - expected = panel.copy() - expected.major_axis = expected.major_axis.swaplevel(0, 1) + with catch_warnings(record=True): + panel = Panel({'ItemA': self.frame, 'ItemB': self.frame * 2}) + expected = panel.copy() + expected.major_axis = expected.major_axis.swaplevel(0, 1) - for result in (panel.swaplevel(axis='major'), - panel.swaplevel(0, axis='major'), - panel.swaplevel(0, 1, axis='major')): - tm.assert_panel_equal(result, expected) + for result in (panel.swaplevel(axis='major'), + panel.swaplevel(0, axis='major'), + panel.swaplevel(0, 1, axis='major')): + tm.assert_panel_equal(result, expected) def test_reorder_levels(self): result = self.ymd.reorder_levels(['month', 'day', 'year']) expected = self.ymd.swaplevel(0, 1).swaplevel(1, 2) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) result = self.ymd['A'].reorder_levels(['month', 'day', 'year']) expected = self.ymd['A'].swaplevel(0, 1).swaplevel(1, 2) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) result = self.ymd.T.reorder_levels(['month', 'day', 'year'], axis=1) expected = self.ymd.T.swaplevel(0, 1, axis=1).swaplevel(1, 2, axis=1) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) - with assertRaisesRegexp(TypeError, 'hierarchical axis'): + with tm.assert_raises_regex(TypeError, 'hierarchical axis'): self.ymd.reorder_levels([1, 2], axis=1) - with assertRaisesRegexp(IndexError, 'Too many levels'): + with tm.assert_raises_regex(IndexError, 'Too many levels'): self.ymd.index.reorder_levels([1, 2, 3]) def test_insert_index(self): df = self.ymd[:5].T df[2000, 1, 10] = df[2000, 1, 7] - tm.assertIsInstance(df.columns, MultiIndex) - self.assertTrue((df[2000, 1, 10] == df[2000, 1, 7]).all()) + assert isinstance(df.columns, MultiIndex) + assert (df[2000, 1, 10] == df[2000, 1, 7]).all() def test_alignment(self): x = Series(data=[1, 2, 3], index=MultiIndex.from_tuples([("A", 1), ( @@ -1387,29 +1299,13 @@ def test_alignment(self): res = x - y exp_index = x.index.union(y.index) exp = x.reindex(exp_index) - y.reindex(exp_index) - assert_series_equal(res, exp) + tm.assert_series_equal(res, exp) # hit non-monotonic code path res = x[::-1] - y[::-1] exp_index = x.index.union(y.index) exp = x.reindex(exp_index) - y.reindex(exp_index) - assert_series_equal(res, exp) - - def test_is_lexsorted(self): - levels = [[0, 1], [0, 1, 2]] - - index = MultiIndex(levels=levels, - labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]]) - self.assertTrue(index.is_lexsorted()) - - index = MultiIndex(levels=levels, - labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 2, 1]]) - self.assertFalse(index.is_lexsorted()) - - index = MultiIndex(levels=levels, - labels=[[0, 0, 1, 0, 1, 1], [0, 1, 0, 2, 2, 1]]) - self.assertFalse(index.is_lexsorted()) - self.assertEqual(index.lexsort_depth, 0) + tm.assert_series_equal(res, exp) def test_frame_getitem_view(self): df = self.frame.T.copy() @@ -1417,61 +1313,24 @@ def test_frame_getitem_view(self): # this works because we are modifying the underlying array # really a no-no df['foo'].values[:] = 0 - self.assertTrue((df['foo'].values == 0).all()) + assert (df['foo'].values == 0).all() # but not if it's mixed-type df['foo', 'four'] = 'foo' - df = df.sortlevel(0, axis=1) + df = df.sort_index(level=0, axis=1) # this will work, but will raise/warn as its chained assignment def f(): df['foo']['one'] = 2 return df - self.assertRaises(com.SettingWithCopyError, f) + pytest.raises(com.SettingWithCopyError, f) try: df = f() except: pass - self.assertTrue((df['foo', 'one'] == 0).all()) - - def test_frame_getitem_not_sorted(self): - df = self.frame.T - df['foo', 'four'] = 'foo' - - arrays = [np.array(x) for x in zip(*df.columns._tuple_index)] - - result = df['foo'] - result2 = df.ix[:, 'foo'] - expected = df.reindex(columns=df.columns[arrays[0] == 'foo']) - expected.columns = expected.columns.droplevel(0) - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) - - df = df.T - result = df.xs('foo') - result2 = df.ix['foo'] - expected = df.reindex(df.index[arrays[0] == 'foo']) - expected.index = expected.index.droplevel(0) - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) - - def test_series_getitem_not_sorted(self): - arrays = [['bar', 'bar', 'baz', 'baz', 'qux', 'qux', 'foo', 'foo'], - ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] - tuples = lzip(*arrays) - index = MultiIndex.from_tuples(tuples) - s = Series(randn(8), index=index) - - arrays = [np.array(x) for x in zip(*index._tuple_index)] - - result = s['qux'] - result2 = s.ix['qux'] - expected = s[arrays[0] == 'qux'] - expected.index = expected.index.droplevel(0) - assert_series_equal(result, expected) - assert_series_equal(result2, expected) + assert (df['foo', 'one'] == 0).all() def test_count(self): frame = self.frame.copy() @@ -1479,27 +1338,27 @@ def test_count(self): result = frame.count(level='b') expect = self.frame.count(level=1) - assert_frame_equal(result, expect, check_names=False) + tm.assert_frame_equal(result, expect, check_names=False) result = frame.count(level='a') expect = self.frame.count(level=0) - assert_frame_equal(result, expect, check_names=False) + tm.assert_frame_equal(result, expect, check_names=False) series = self.series.copy() series.index.names = ['a', 'b'] result = series.count(level='b') expect = self.series.count(level=1) - assert_series_equal(result, expect, check_names=False) - self.assertEqual(result.index.name, 'b') + tm.assert_series_equal(result, expect, check_names=False) + assert result.index.name == 'b' result = series.count(level='a') expect = self.series.count(level=0) - assert_series_equal(result, expect, check_names=False) - self.assertEqual(result.index.name, 'a') + tm.assert_series_equal(result, expect, check_names=False) + assert result.index.name == 'a' - self.assertRaises(KeyError, series.count, 'x') - self.assertRaises(KeyError, frame.count, level='x') + pytest.raises(KeyError, series.count, 'x') + pytest.raises(KeyError, frame.count, level='x') AGG_FUNCTIONS = ['sum', 'prod', 'min', 'max', 'median', 'mean', 'skew', 'mad', 'std', 'var', 'sem'] @@ -1512,15 +1371,16 @@ def test_series_group_min_max(self): # skipna=True leftside = grouped.agg(aggf) rightside = getattr(self.series, op)(level=level, skipna=skipna) - assert_series_equal(leftside, rightside) + tm.assert_series_equal(leftside, rightside) def test_frame_group_ops(self): - self.frame.ix[1, [1, 2]] = np.nan - self.frame.ix[7, [0, 1]] = np.nan + self.frame.iloc[1, [1, 2]] = np.nan + self.frame.iloc[7, [0, 1]] = np.nan for op, level, axis, skipna in cart_product(self.AGG_FUNCTIONS, lrange(2), lrange(2), [False, True]): + if axis == 0: frame = self.frame else: @@ -1541,17 +1401,17 @@ def aggf(x): # for good measure, groupby detail level_index = frame._get_axis(axis).levels[level] - self.assert_index_equal(leftside._get_axis(axis), level_index) - self.assert_index_equal(rightside._get_axis(axis), level_index) + tm.assert_index_equal(leftside._get_axis(axis), level_index) + tm.assert_index_equal(rightside._get_axis(axis), level_index) - assert_frame_equal(leftside, rightside) + tm.assert_frame_equal(leftside, rightside) def test_stat_op_corner(self): obj = Series([10.0], index=MultiIndex.from_tuples([(2, 3)])) result = obj.sum(level=0) expected = Series([10.0], index=[2]) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_frame_any_all_group(self): df = DataFrame( @@ -1562,11 +1422,11 @@ def test_frame_any_all_group(self): result = df.any(level=0) ex = DataFrame({'data': [False, True]}, index=['one', 'two']) - assert_frame_equal(result, ex) + tm.assert_frame_equal(result, ex) result = df.all(level=0) ex = DataFrame({'data': [False, False]}, index=['one', 'two']) - assert_frame_equal(result, ex) + tm.assert_frame_equal(result, ex) def test_std_var_pass_ddof(self): index = MultiIndex.from_arrays([np.arange(5).repeat(10), np.tile( @@ -1579,20 +1439,20 @@ def test_std_var_pass_ddof(self): result = getattr(df[0], meth)(level=0, ddof=ddof) expected = df[0].groupby(level=0).agg(alt) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) result = getattr(df, meth)(level=0, ddof=ddof) expected = df.groupby(level=0).agg(alt) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_frame_series_agg_multiple_levels(self): result = self.ymd.sum(level=['year', 'month']) expected = self.ymd.groupby(level=['year', 'month']).sum() - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) result = self.ymd['A'].sum(level=['year', 'month']) expected = self.ymd['A'].groupby(level=['year', 'month']).sum() - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) def test_groupby_multilevel(self): result = self.ymd.groupby(level=[0, 1]).mean() @@ -1602,12 +1462,12 @@ def test_groupby_multilevel(self): expected = self.ymd.groupby([k1, k2]).mean() - assert_frame_equal(result, expected, check_names=False - ) # TODO groupby with level_values drops names - self.assertEqual(result.index.names, self.ymd.index.names[:2]) + # TODO groupby with level_values drops names + tm.assert_frame_equal(result, expected, check_names=False) + assert result.index.names == self.ymd.index.names[:2] result2 = self.ymd.groupby(level=self.ymd.index.names[:2]).mean() - assert_frame_equal(result, result2) + tm.assert_frame_equal(result, result2) def test_groupby_multilevel_with_transform(self): pass @@ -1617,38 +1477,38 @@ def test_multilevel_consolidate(self): 'bar', 'one'), ('bar', 'two')]) df = DataFrame(np.random.randn(4, 4), index=index, columns=index) df['Totals', ''] = df.sum(1) - df = df.consolidate() + df = df._consolidate() def test_ix_preserve_names(self): - result = self.ymd.ix[2000] - result2 = self.ymd['A'].ix[2000] - self.assertEqual(result.index.names, self.ymd.index.names[1:]) - self.assertEqual(result2.index.names, self.ymd.index.names[1:]) + result = self.ymd.loc[2000] + result2 = self.ymd['A'].loc[2000] + assert result.index.names == self.ymd.index.names[1:] + assert result2.index.names == self.ymd.index.names[1:] - result = self.ymd.ix[2000, 2] - result2 = self.ymd['A'].ix[2000, 2] - self.assertEqual(result.index.name, self.ymd.index.names[2]) - self.assertEqual(result2.index.name, self.ymd.index.names[2]) + result = self.ymd.loc[2000, 2] + result2 = self.ymd['A'].loc[2000, 2] + assert result.index.name == self.ymd.index.names[2] + assert result2.index.name == self.ymd.index.names[2] def test_partial_set(self): # GH #397 df = self.ymd.copy() exp = self.ymd.copy() - df.ix[2000, 4] = 0 - exp.ix[2000, 4].values[:] = 0 - assert_frame_equal(df, exp) + df.loc[2000, 4] = 0 + exp.loc[2000, 4].values[:] = 0 + tm.assert_frame_equal(df, exp) - df['A'].ix[2000, 4] = 1 - exp['A'].ix[2000, 4].values[:] = 1 - assert_frame_equal(df, exp) + df['A'].loc[2000, 4] = 1 + exp['A'].loc[2000, 4].values[:] = 1 + tm.assert_frame_equal(df, exp) - df.ix[2000] = 5 - exp.ix[2000].values[:] = 5 - assert_frame_equal(df, exp) + df.loc[2000] = 5 + exp.loc[2000].values[:] = 5 + tm.assert_frame_equal(df, exp) # this works...for now - df['A'].ix[14] = 5 - self.assertEqual(df['A'][14], 5) + df['A'].iloc[14] = 5 + assert df['A'][14] == 5 def test_unstack_preserve_types(self): # GH #403 @@ -1656,9 +1516,9 @@ def test_unstack_preserve_types(self): self.ymd['F'] = 2 unstacked = self.ymd.unstack('month') - self.assertEqual(unstacked['A', 1].dtype, np.float64) - self.assertEqual(unstacked['E', 1].dtype, np.object_) - self.assertEqual(unstacked['F', 1].dtype, np.float64) + assert unstacked['A', 1].dtype == np.float64 + assert unstacked['E', 1].dtype == np.object_ + assert unstacked['F', 1].dtype == np.float64 def test_unstack_group_index_overflow(self): labels = np.tile(np.arange(500), 2) @@ -1669,11 +1529,11 @@ def test_unstack_group_index_overflow(self): s = Series(np.arange(1000), index=index) result = s.unstack() - self.assertEqual(result.shape, (500, 2)) + assert result.shape == (500, 2) # test roundtrip stacked = result.stack() - assert_series_equal(s, stacked.reindex(s.index)) + tm.assert_series_equal(s, stacked.reindex(s.index)) # put it at beginning index = MultiIndex(levels=[[0, 1]] + [level] * 8, @@ -1681,7 +1541,7 @@ def test_unstack_group_index_overflow(self): s = Series(np.arange(1000), index=index) result = s.unstack(0) - self.assertEqual(result.shape, (500, 2)) + assert result.shape == (500, 2) # put it in middle index = MultiIndex(levels=[level] * 4 + [[0, 1]] + [level] * 4, @@ -1690,34 +1550,34 @@ def test_unstack_group_index_overflow(self): s = Series(np.arange(1000), index=index) result = s.unstack(4) - self.assertEqual(result.shape, (500, 2)) + assert result.shape == (500, 2) def test_getitem_lowerdim_corner(self): - self.assertRaises(KeyError, self.frame.ix.__getitem__, - (('bar', 'three'), 'B')) + pytest.raises(KeyError, self.frame.loc.__getitem__, + (('bar', 'three'), 'B')) # in theory should be inserting in a sorted space???? - self.frame.ix[('bar', 'three'), 'B'] = 0 - self.assertEqual(self.frame.sortlevel().ix[('bar', 'three'), 'B'], 0) + self.frame.loc[('bar', 'three'), 'B'] = 0 + assert self.frame.sort_index().loc[('bar', 'three'), 'B'] == 0 # --------------------------------------------------------------------- # AMBIGUOUS CASES! def test_partial_ix_missing(self): - raise nose.SkipTest("skipping for now") + pytest.skip("skipping for now") - result = self.ymd.ix[2000, 0] - expected = self.ymd.ix[2000]['A'] - assert_series_equal(result, expected) + result = self.ymd.loc[2000, 0] + expected = self.ymd.loc[2000]['A'] + tm.assert_series_equal(result, expected) # need to put in some work here - # self.ymd.ix[2000, 0] = 0 - # self.assertTrue((self.ymd.ix[2000]['A'] == 0).all()) + # self.ymd.loc[2000, 0] = 0 + # assert (self.ymd.loc[2000]['A'] == 0).all() # Pretty sure the second (and maybe even the first) is already wrong. - self.assertRaises(Exception, self.ymd.ix.__getitem__, (2000, 6)) - self.assertRaises(Exception, self.ymd.ix.__getitem__, (2000, 6), 0) + pytest.raises(Exception, self.ymd.loc.__getitem__, (2000, 6)) + pytest.raises(Exception, self.ymd.loc.__getitem__, (2000, 6), 0) # --------------------------------------------------------------------- @@ -1735,20 +1595,20 @@ def test_level_with_tuples(self): frame = DataFrame(np.random.randn(6, 4), index=index) result = series[('foo', 'bar', 0)] - result2 = series.ix[('foo', 'bar', 0)] + result2 = series.loc[('foo', 'bar', 0)] expected = series[:2] expected.index = expected.index.droplevel(0) - assert_series_equal(result, expected) - assert_series_equal(result2, expected) + tm.assert_series_equal(result, expected) + tm.assert_series_equal(result2, expected) - self.assertRaises(KeyError, series.__getitem__, (('foo', 'bar', 0), 2)) + pytest.raises(KeyError, series.__getitem__, (('foo', 'bar', 0), 2)) - result = frame.ix[('foo', 'bar', 0)] + result = frame.loc[('foo', 'bar', 0)] result2 = frame.xs(('foo', 'bar', 0)) expected = frame[:2] expected.index = expected.index.droplevel(0) - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) + tm.assert_frame_equal(result, expected) + tm.assert_frame_equal(result2, expected) index = MultiIndex(levels=[[('foo', 'bar'), ('foo', 'baz'), ( 'foo', 'qux')], [0, 1]], @@ -1758,33 +1618,33 @@ def test_level_with_tuples(self): frame = DataFrame(np.random.randn(6, 4), index=index) result = series[('foo', 'bar')] - result2 = series.ix[('foo', 'bar')] + result2 = series.loc[('foo', 'bar')] expected = series[:2] expected.index = expected.index.droplevel(0) - assert_series_equal(result, expected) - assert_series_equal(result2, expected) + tm.assert_series_equal(result, expected) + tm.assert_series_equal(result2, expected) - result = frame.ix[('foo', 'bar')] + result = frame.loc[('foo', 'bar')] result2 = frame.xs(('foo', 'bar')) expected = frame[:2] expected.index = expected.index.droplevel(0) - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) + tm.assert_frame_equal(result, expected) + tm.assert_frame_equal(result2, expected) def test_int_series_slicing(self): s = self.ymd['A'] result = s[5:] expected = s.reindex(s.index[5:]) - assert_series_equal(result, expected) + tm.assert_series_equal(result, expected) exp = self.ymd['A'].copy() s[5:] = 0 exp.values[5:] = 0 - self.assert_numpy_array_equal(s.values, exp.values) + tm.assert_numpy_array_equal(s.values, exp.values) result = self.ymd[5:] expected = self.ymd.reindex(s.index[5:]) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_mixed_depth_get(self): arrays = [['a', 'top', 'top', 'routine1', 'routine1', 'routine2'], @@ -1797,13 +1657,13 @@ def test_mixed_depth_get(self): result = df['a'] expected = df['a', '', ''] - assert_series_equal(result, expected, check_names=False) - self.assertEqual(result.name, 'a') + tm.assert_series_equal(result, expected, check_names=False) + assert result.name == 'a' result = df['routine1', 'result1'] expected = df['routine1', 'result1', ''] - assert_series_equal(result, expected, check_names=False) - self.assertEqual(result.name, ('routine1', 'result1')) + tm.assert_series_equal(result, expected, check_names=False) + assert result.name == ('routine1', 'result1') def test_mixed_depth_insert(self): arrays = [['a', 'top', 'top', 'routine1', 'routine1', 'routine2'], @@ -1818,7 +1678,7 @@ def test_mixed_depth_insert(self): expected = df.copy() result['b'] = [1, 2, 3, 4] expected['b', '', ''] = [1, 2, 3, 4] - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_mixed_depth_drop(self): arrays = [['a', 'top', 'top', 'routine1', 'routine1', 'routine2'], @@ -1831,16 +1691,16 @@ def test_mixed_depth_drop(self): result = df.drop('a', axis=1) expected = df.drop([('a', '', '')], axis=1) - assert_frame_equal(expected, result) + tm.assert_frame_equal(expected, result) result = df.drop(['top'], axis=1) expected = df.drop([('top', 'OD', 'wx')], axis=1) expected = expected.drop([('top', 'OD', 'wy')], axis=1) - assert_frame_equal(expected, result) + tm.assert_frame_equal(expected, result) result = df.drop(('top', 'OD', 'wx'), axis=1) expected = df.drop([('top', 'OD', 'wx')], axis=1) - assert_frame_equal(expected, result) + tm.assert_frame_equal(expected, result) expected = df.drop([('top', 'OD', 'wy')], axis=1) expected = df.drop('top', axis=1) @@ -1848,7 +1708,7 @@ def test_mixed_depth_drop(self): result = df.drop('result1', level=1, axis=1) expected = df.drop([('routine1', 'result1', ''), ('routine2', 'result1', '')], axis=1) - assert_frame_equal(expected, result) + tm.assert_frame_equal(expected, result) def test_drop_nonunique(self): df = DataFrame([["x-a", "x", "a", 1.5], ["x-a", "x", "a", 1.2], @@ -1859,7 +1719,7 @@ def test_drop_nonunique(self): columns=["var1", "var2", "var3", "var4"]) grp_size = df.groupby("var1").size() - drop_idx = grp_size.ix[grp_size == 1] + drop_idx = grp_size.loc[grp_size == 1] idf = df.set_index(["var1", "var2", "var3"]) @@ -1869,7 +1729,7 @@ def test_drop_nonunique(self): result.index = expected.index - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_mixed_depth_pop(self): arrays = [['a', 'top', 'top', 'routine1', 'routine1', 'routine2'], @@ -1884,78 +1744,78 @@ def test_mixed_depth_pop(self): df2 = df.copy() result = df1.pop('a') expected = df2.pop(('a', '', '')) - assert_series_equal(expected, result, check_names=False) - assert_frame_equal(df1, df2) - self.assertEqual(result.name, 'a') + tm.assert_series_equal(expected, result, check_names=False) + tm.assert_frame_equal(df1, df2) + assert result.name == 'a' expected = df1['top'] df1 = df1.drop(['top'], axis=1) result = df2.pop('top') - assert_frame_equal(expected, result) - assert_frame_equal(df1, df2) + tm.assert_frame_equal(expected, result) + tm.assert_frame_equal(df1, df2) def test_reindex_level_partial_selection(self): result = self.frame.reindex(['foo', 'qux'], level=0) - expected = self.frame.ix[[0, 1, 2, 7, 8, 9]] - assert_frame_equal(result, expected) + expected = self.frame.iloc[[0, 1, 2, 7, 8, 9]] + tm.assert_frame_equal(result, expected) result = self.frame.T.reindex_axis(['foo', 'qux'], axis=1, level=0) - assert_frame_equal(result, expected.T) + tm.assert_frame_equal(result, expected.T) - result = self.frame.ix[['foo', 'qux']] - assert_frame_equal(result, expected) + result = self.frame.loc[['foo', 'qux']] + tm.assert_frame_equal(result, expected) - result = self.frame['A'].ix[['foo', 'qux']] - assert_series_equal(result, expected['A']) + result = self.frame['A'].loc[['foo', 'qux']] + tm.assert_series_equal(result, expected['A']) - result = self.frame.T.ix[:, ['foo', 'qux']] - assert_frame_equal(result, expected.T) + result = self.frame.T.loc[:, ['foo', 'qux']] + tm.assert_frame_equal(result, expected.T) def test_setitem_multiple_partial(self): expected = self.frame.copy() result = self.frame.copy() - result.ix[['foo', 'bar']] = 0 - expected.ix['foo'] = 0 - expected.ix['bar'] = 0 - assert_frame_equal(result, expected) + result.loc[['foo', 'bar']] = 0 + expected.loc['foo'] = 0 + expected.loc['bar'] = 0 + tm.assert_frame_equal(result, expected) expected = self.frame.copy() result = self.frame.copy() - result.ix['foo':'bar'] = 0 - expected.ix['foo'] = 0 - expected.ix['bar'] = 0 - assert_frame_equal(result, expected) + result.loc['foo':'bar'] = 0 + expected.loc['foo'] = 0 + expected.loc['bar'] = 0 + tm.assert_frame_equal(result, expected) expected = self.frame['A'].copy() result = self.frame['A'].copy() - result.ix[['foo', 'bar']] = 0 - expected.ix['foo'] = 0 - expected.ix['bar'] = 0 - assert_series_equal(result, expected) + result.loc[['foo', 'bar']] = 0 + expected.loc['foo'] = 0 + expected.loc['bar'] = 0 + tm.assert_series_equal(result, expected) expected = self.frame['A'].copy() result = self.frame['A'].copy() - result.ix['foo':'bar'] = 0 - expected.ix['foo'] = 0 - expected.ix['bar'] = 0 - assert_series_equal(result, expected) + result.loc['foo':'bar'] = 0 + expected.loc['foo'] = 0 + expected.loc['bar'] = 0 + tm.assert_series_equal(result, expected) def test_drop_level(self): result = self.frame.drop(['bar', 'qux'], level='first') - expected = self.frame.ix[[0, 1, 2, 5, 6]] - assert_frame_equal(result, expected) + expected = self.frame.iloc[[0, 1, 2, 5, 6]] + tm.assert_frame_equal(result, expected) result = self.frame.drop(['two'], level='second') - expected = self.frame.ix[[0, 2, 3, 6, 7, 9]] - assert_frame_equal(result, expected) + expected = self.frame.iloc[[0, 2, 3, 6, 7, 9]] + tm.assert_frame_equal(result, expected) result = self.frame.T.drop(['bar', 'qux'], axis=1, level='first') - expected = self.frame.ix[[0, 1, 2, 5, 6]].T - assert_frame_equal(result, expected) + expected = self.frame.iloc[[0, 1, 2, 5, 6]].T + tm.assert_frame_equal(result, expected) result = self.frame.T.drop(['two'], axis=1, level='second') - expected = self.frame.ix[[0, 2, 3, 6, 7, 9]].T - assert_frame_equal(result, expected) + expected = self.frame.iloc[[0, 2, 3, 6, 7, 9]].T + tm.assert_frame_equal(result, expected) def test_drop_level_nonunique_datetime(self): # GH 12701 @@ -1970,11 +1830,11 @@ def test_drop_level_nonunique_datetime(self): df['tstamp'] = idxdt df = df.set_index('tstamp', append=True) ts = pd.Timestamp('201603231600') - self.assertFalse(df.index.is_unique) + assert not df.index.is_unique result = df.drop(ts, level='tstamp') expected = df.loc[idx != 4] - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_drop_preserve_names(self): index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1], @@ -1984,7 +1844,7 @@ def test_drop_preserve_names(self): df = DataFrame(np.random.randn(6, 3), index=index) result = df.drop([(0, 2)]) - self.assertEqual(result.index.names, ('one', 'two')) + assert result.index.names == ('one', 'two') def test_unicode_repr_issues(self): levels = [Index([u('a/\u03c3'), u('b/\u03c3'), u('c/\u03c3')]), @@ -2013,7 +1873,7 @@ def test_dataframe_insert_column_all_na(self): df = DataFrame([[1, 2], [3, 4], [5, 6]], index=mix) s = Series({(1, 1): 1, (1, 2): 2}) df['new'] = s - self.assertTrue(df['new'].isnull().all()) + assert df['new'].isnull().all() def test_join_segfault(self): # 1532 @@ -2028,12 +1888,12 @@ def test_join_segfault(self): def test_set_column_scalar_with_ix(self): subset = self.frame.index[[1, 4, 5]] - self.frame.ix[subset] = 99 - self.assertTrue((self.frame.ix[subset].values == 99).all()) + self.frame.loc[subset] = 99 + assert (self.frame.loc[subset].values == 99).all() col = self.frame['B'] col[subset] = 97 - self.assertTrue((self.frame.ix[subset, 'B'] == 97).all()) + assert (self.frame.loc[subset, 'B'] == 97).all() def test_frame_dict_constructor_empty_series(self): s1 = Series([ @@ -2057,10 +1917,10 @@ def test_indexing_ambiguity_bug_1678(self): frame = DataFrame(np.arange(12).reshape((4, 3)), index=index, columns=columns) - result = frame.ix[:, 1] + result = frame.iloc[:, 1] exp = frame.loc[:, ('Ohio', 'Red')] - tm.assertIsInstance(result, Series) - assert_series_equal(result, exp) + assert isinstance(result, Series) + tm.assert_series_equal(result, exp) def test_nonunique_assignment_1750(self): df = DataFrame([[1, 1, "x", "X"], [1, 1, "y", "Y"], [1, 2, "z", "Z"]], @@ -2069,9 +1929,9 @@ def test_nonunique_assignment_1750(self): df = df.set_index(['A', 'B']) ix = MultiIndex.from_tuples([(1, 1)]) - df.ix[ix, "C"] = '_' + df.loc[ix, "C"] = '_' - self.assertTrue((df.xs((1, 1))['C'] == '_').all()) + assert (df.xs((1, 1))['C'] == '_').all() def test_indexing_over_hashtable_size_cutoff(self): n = 10000 @@ -2083,9 +1943,9 @@ def test_indexing_over_hashtable_size_cutoff(self): MultiIndex.from_arrays((["a"] * n, np.arange(n)))) # hai it works! - self.assertEqual(s[("a", 5)], 5) - self.assertEqual(s[("a", 6)], 6) - self.assertEqual(s[("a", 7)], 7) + assert s[("a", 5)] == 5 + assert s[("a", 6)] == 6 + assert s[("a", 7)] == 7 _index._SIZE_CUTOFF = old_cutoff @@ -2125,8 +1985,8 @@ def test_tuples_have_na(self): labels=[[1, 1, 1, 1, -1, 0, 0, 0], [0, 1, 2, 3, 0, 1, 2, 3]]) - self.assertTrue(isnull(index[4][0])) - self.assertTrue(isnull(index.values[4][0])) + assert isnull(index[4][0]) + assert isnull(index.values[4][0]) def test_duplicate_groupby_issues(self): idx_tp = [('600809', '20061231'), ('600809', '20070331'), @@ -2137,7 +1997,7 @@ def test_duplicate_groupby_issues(self): s = Series(dt, index=idx) result = s.groupby(s.index).first() - self.assertEqual(len(result), 3) + assert len(result) == 3 def test_duplicate_mi(self): # GH 4516 @@ -2147,12 +2007,12 @@ def test_duplicate_mi(self): ['bah', 'bam', 6.0, 6]], columns=list('ABCD')) df = df.set_index(['A', 'B']) - df = df.sortlevel(0) + df = df.sort_index(level=0) expected = DataFrame([['foo', 'bar', 1.0, 1], ['foo', 'bar', 2.0, 2], ['foo', 'bar', 5.0, 5]], columns=list('ABCD')).set_index(['A', 'B']) result = df.loc[('foo', 'bar')] - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) def test_duplicated_drop_duplicates(self): # GH 4060 @@ -2162,35 +2022,24 @@ def test_duplicated_drop_duplicates(self): [False, False, False, True, False, False], dtype=bool) duplicated = idx.duplicated() tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) + assert duplicated.dtype == bool expected = MultiIndex.from_arrays(([1, 2, 3, 2, 3], [1, 1, 1, 2, 2])) tm.assert_index_equal(idx.drop_duplicates(), expected) expected = np.array([True, False, False, False, False, False]) duplicated = idx.duplicated(keep='last') tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) + assert duplicated.dtype == bool expected = MultiIndex.from_arrays(([2, 3, 1, 2, 3], [1, 1, 1, 2, 2])) tm.assert_index_equal(idx.drop_duplicates(keep='last'), expected) expected = np.array([True, False, False, True, False, False]) duplicated = idx.duplicated(keep=False) tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) + assert duplicated.dtype == bool expected = MultiIndex.from_arrays(([2, 3, 2, 3], [1, 1, 2, 2])) tm.assert_index_equal(idx.drop_duplicates(keep=False), expected) - # deprecate take_last - expected = np.array([True, False, False, False, False, False]) - with tm.assert_produces_warning(FutureWarning): - duplicated = idx.duplicated(take_last=True) - tm.assert_numpy_array_equal(duplicated, expected) - self.assertTrue(duplicated.dtype == bool) - expected = MultiIndex.from_arrays(([2, 3, 1, 2, 3], [1, 1, 1, 2, 2])) - with tm.assert_produces_warning(FutureWarning): - tm.assert_index_equal( - idx.drop_duplicates(take_last=True), expected) - def test_multiindex_set_index(self): # segfault in #3308 d = {'t1': [2, 2.5, 3], 't2': [4, 5, 6]} @@ -2213,8 +2062,8 @@ def test_datetimeindex(self): expected1 = pd.DatetimeIndex(['2013-04-01 9:00', '2013-04-02 9:00', '2013-04-03 9:00'], tz='Asia/Tokyo') - self.assert_index_equal(idx.levels[0], expected1) - self.assert_index_equal(idx.levels[1], idx2) + tm.assert_index_equal(idx.levels[0], expected1) + tm.assert_index_equal(idx.levels[1], idx2) # from datetime combos # GH 7888 @@ -2225,8 +2074,8 @@ def test_datetimeindex(self): for d1, d2 in itertools.product( [date1, date2, date3], [date1, date2, date3]): index = pd.MultiIndex.from_product([[d1], [d2]]) - self.assertIsInstance(index.levels[0], pd.DatetimeIndex) - self.assertIsInstance(index.levels[1], pd.DatetimeIndex) + assert isinstance(index.levels[0], pd.DatetimeIndex) + assert isinstance(index.levels[1], pd.DatetimeIndex) def test_constructor_with_tz(self): @@ -2260,14 +2109,14 @@ def test_set_index_datetime(self): expected = expected.tz_localize('UTC').tz_convert('US/Pacific') df = df.set_index('label', append=True) - self.assert_index_equal(df.index.levels[0], expected) - self.assert_index_equal(df.index.levels[1], - pd.Index(['a', 'b'], name='label')) + tm.assert_index_equal(df.index.levels[0], expected) + tm.assert_index_equal(df.index.levels[1], + pd.Index(['a', 'b'], name='label')) df = df.swaplevel(0, 1) - self.assert_index_equal(df.index.levels[0], - pd.Index(['a', 'b'], name='label')) - self.assert_index_equal(df.index.levels[1], expected) + tm.assert_index_equal(df.index.levels[0], + pd.Index(['a', 'b'], name='label')) + tm.assert_index_equal(df.index.levels[1], expected) df = DataFrame(np.random.random(6)) idx1 = pd.DatetimeIndex(['2011-07-19 07:00:00', '2011-07-19 08:00:00', @@ -2290,14 +2139,14 @@ def test_set_index_datetime(self): expected2 = pd.DatetimeIndex(['2012-04-01 09:00', '2012-04-02 09:00'], tz='US/Eastern') - self.assert_index_equal(df.index.levels[0], expected1) - self.assert_index_equal(df.index.levels[1], expected2) - self.assert_index_equal(df.index.levels[2], idx3) + tm.assert_index_equal(df.index.levels[0], expected1) + tm.assert_index_equal(df.index.levels[1], expected2) + tm.assert_index_equal(df.index.levels[2], idx3) # GH 7092 - self.assert_index_equal(df.index.get_level_values(0), idx1) - self.assert_index_equal(df.index.get_level_values(1), idx2) - self.assert_index_equal(df.index.get_level_values(2), idx3) + tm.assert_index_equal(df.index.get_level_values(0), idx1) + tm.assert_index_equal(df.index.get_level_values(1), idx2) + tm.assert_index_equal(df.index.get_level_values(2), idx3) def test_reset_index_datetime(self): # GH 3950 @@ -2322,7 +2171,7 @@ def test_reset_index_datetime(self): expected['idx1'] = expected['idx1'].apply( lambda d: pd.Timestamp(d, tz=tz)) - assert_frame_equal(df.reset_index(), expected) + tm.assert_frame_equal(df.reset_index(), expected) idx3 = pd.date_range('1/1/2012', periods=5, freq='MS', tz='Europe/Paris', name='idx3') @@ -2349,7 +2198,7 @@ def test_reset_index_datetime(self): lambda d: pd.Timestamp(d, tz=tz)) expected['idx3'] = expected['idx3'].apply( lambda d: pd.Timestamp(d, tz='Europe/Paris')) - assert_frame_equal(df.reset_index(), expected) + tm.assert_frame_equal(df.reset_index(), expected) # GH 7793 idx = pd.MultiIndex.from_product([['a', 'b'], pd.date_range( @@ -2367,7 +2216,7 @@ def test_reset_index_datetime(self): columns=['level_0', 'level_1', 'a']) expected['level_1'] = expected['level_1'].apply( lambda d: pd.Timestamp(d, freq='D', tz=tz)) - assert_frame_equal(df.reset_index(), expected) + tm.assert_frame_equal(df.reset_index(), expected) def test_reset_index_period(self): # GH 7746 @@ -2386,7 +2235,61 @@ def test_reset_index_period(self): 'feature': ['a', 'b', 'c'] * 3, 'a': np.arange(9, dtype='int64') }, columns=['month', 'feature', 'a']) - assert_frame_equal(df.reset_index(), expected) + tm.assert_frame_equal(df.reset_index(), expected) + + def test_reset_index_multiindex_columns(self): + levels = [['A', ''], ['B', 'b']] + df = pd.DataFrame([[0, 2], [1, 3]], + columns=pd.MultiIndex.from_tuples(levels)) + result = df[['B']].rename_axis('A').reset_index() + tm.assert_frame_equal(result, df) + + # gh-16120: already existing column + with tm.assert_raises_regex(ValueError, + ("cannot insert \('A', ''\), " + "already exists")): + df.rename_axis('A').reset_index() + + # gh-16164: multiindex (tuple) full key + result = df.set_index([('A', '')]).reset_index() + tm.assert_frame_equal(result, df) + + # with additional (unnamed) index level + idx_col = pd.DataFrame([[0], [1]], + columns=pd.MultiIndex.from_tuples([('level_0', + '')])) + expected = pd.concat([idx_col, df[[('B', 'b'), ('A', '')]]], axis=1) + result = df.set_index([('B', 'b')], append=True).reset_index() + tm.assert_frame_equal(result, expected) + + # with index name which is a too long tuple... + with tm.assert_raises_regex(ValueError, + ("Item must have length equal to number " + "of levels.")): + df.rename_axis([('C', 'c', 'i')]).reset_index() + + # or too short... + levels = [['A', 'a', ''], ['B', 'b', 'i']] + df2 = pd.DataFrame([[0, 2], [1, 3]], + columns=pd.MultiIndex.from_tuples(levels)) + idx_col = pd.DataFrame([[0], [1]], + columns=pd.MultiIndex.from_tuples([('C', + 'c', + 'ii')])) + expected = pd.concat([idx_col, df2], axis=1) + result = df2.rename_axis([('C', 'c')]).reset_index(col_fill='ii') + tm.assert_frame_equal(result, expected) + + # ... which is incompatible with col_fill=None + with tm.assert_raises_regex(ValueError, + ("col_fill=None is incompatible with " + "incomplete column name \('C', 'c'\)")): + df2.rename_axis([('C', 'c')]).reset_index(col_fill=None) + + # with col_level != 0 + result = df2.rename_axis([('c', 'ii')]).reset_index(col_level=1, + col_fill='C') + tm.assert_frame_equal(result, expected) def test_set_index_period(self): # GH 6631 @@ -2404,13 +2307,13 @@ def test_set_index_period(self): expected1 = pd.period_range('2011-01-01', periods=3, freq='M') expected2 = pd.period_range('2013-01-01 09:00', periods=2, freq='H') - self.assert_index_equal(df.index.levels[0], expected1) - self.assert_index_equal(df.index.levels[1], expected2) - self.assert_index_equal(df.index.levels[2], idx3) + tm.assert_index_equal(df.index.levels[0], expected1) + tm.assert_index_equal(df.index.levels[1], expected2) + tm.assert_index_equal(df.index.levels[2], idx3) - self.assert_index_equal(df.index.get_level_values(0), idx1) - self.assert_index_equal(df.index.get_level_values(1), idx2) - self.assert_index_equal(df.index.get_level_values(2), idx3) + tm.assert_index_equal(df.index.get_level_values(0), idx1) + tm.assert_index_equal(df.index.get_level_values(1), idx2) + tm.assert_index_equal(df.index.get_level_values(2), idx3) def test_repeat(self): # GH 9361 @@ -2446,9 +2349,406 @@ def test_iloc_mi(self): result = pd.DataFrame([[df_mi.iloc[r, c] for c in range(2)] for r in range(5)]) - assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected) + + +class TestSorted(Base): + """ everthing you wanted to test about sorting """ + + def test_sort_index_preserve_levels(self): + result = self.frame.sort_index() + assert result.index.names == self.frame.index.names + + def test_sorting_repr_8017(self): + + np.random.seed(0) + data = np.random.randn(3, 4) + + for gen, extra in [([1., 3., 2., 5.], 4.), ([1, 3, 2, 5], 4), + ([Timestamp('20130101'), Timestamp('20130103'), + Timestamp('20130102'), Timestamp('20130105')], + Timestamp('20130104')), + (['1one', '3one', '2one', '5one'], '4one')]: + columns = MultiIndex.from_tuples([('red', i) for i in gen]) + df = DataFrame(data, index=list('def'), columns=columns) + df2 = pd.concat([df, + DataFrame('world', index=list('def'), + columns=MultiIndex.from_tuples( + [('red', extra)]))], axis=1) + + # check that the repr is good + # make sure that we have a correct sparsified repr + # e.g. only 1 header of read + assert str(df2).splitlines()[0].split() == ['red'] + + # GH 8017 + # sorting fails after columns added + + # construct single-dtype then sort + result = df.copy().sort_index(axis=1) + expected = df.iloc[:, [0, 2, 1, 3]] + tm.assert_frame_equal(result, expected) + + result = df2.sort_index(axis=1) + expected = df2.iloc[:, [0, 2, 1, 4, 3]] + tm.assert_frame_equal(result, expected) + + # setitem then sort + result = df.copy() + result[('red', extra)] = 'world' + + result = result.sort_index(axis=1) + tm.assert_frame_equal(result, expected) + + def test_sort_index_level(self): + df = self.frame.copy() + df.index = np.arange(len(df)) + + # axis=1 + + # series + a_sorted = self.frame['A'].sort_index(level=0) + + # preserve names + assert a_sorted.index.names == self.frame.index.names + + # inplace + rs = self.frame.copy() + rs.sort_index(level=0, inplace=True) + tm.assert_frame_equal(rs, self.frame.sort_index(level=0)) + + def test_sort_index_level_large_cardinality(self): + + # #2684 (int64) + index = MultiIndex.from_arrays([np.arange(4000)] * 3) + df = DataFrame(np.random.randn(4000), index=index, dtype=np.int64) + + # it works! + result = df.sort_index(level=0) + assert result.index.lexsort_depth == 3 + + # #2684 (int32) + index = MultiIndex.from_arrays([np.arange(4000)] * 3) + df = DataFrame(np.random.randn(4000), index=index, dtype=np.int32) + + # it works! + result = df.sort_index(level=0) + assert (result.dtypes.values == df.dtypes.values).all() + assert result.index.lexsort_depth == 3 + + def test_sort_index_level_by_name(self): + self.frame.index.names = ['first', 'second'] + result = self.frame.sort_index(level='second') + expected = self.frame.sort_index(level=1) + tm.assert_frame_equal(result, expected) + + def test_sort_index_level_mixed(self): + sorted_before = self.frame.sort_index(level=1) + + df = self.frame.copy() + df['foo'] = 'bar' + sorted_after = df.sort_index(level=1) + tm.assert_frame_equal(sorted_before, + sorted_after.drop(['foo'], axis=1)) + + dft = self.frame.T + sorted_before = dft.sort_index(level=1, axis=1) + dft['foo', 'three'] = 'bar' + + sorted_after = dft.sort_index(level=1, axis=1) + tm.assert_frame_equal(sorted_before.drop([('foo', 'three')], axis=1), + sorted_after.drop([('foo', 'three')], axis=1)) + + def test_is_lexsorted(self): + levels = [[0, 1], [0, 1, 2]] + + index = MultiIndex(levels=levels, + labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]]) + assert index.is_lexsorted() + + index = MultiIndex(levels=levels, + labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 2, 1]]) + assert not index.is_lexsorted() + + index = MultiIndex(levels=levels, + labels=[[0, 0, 1, 0, 1, 1], [0, 1, 0, 2, 2, 1]]) + assert not index.is_lexsorted() + assert index.lexsort_depth == 0 + + def test_getitem_multilevel_index_tuple_not_sorted(self): + index_columns = list("abc") + df = DataFrame([[0, 1, 0, "x"], [0, 0, 1, "y"]], + columns=index_columns + ["data"]) + df = df.set_index(index_columns) + query_index = df.index[:1] + rs = df.loc[query_index, "data"] + + xp_idx = MultiIndex.from_tuples([(0, 1, 0)], names=['a', 'b', 'c']) + xp = Series(['x'], index=xp_idx, name='data') + tm.assert_series_equal(rs, xp) + + def test_getitem_slice_not_sorted(self): + df = self.frame.sort_index(level=1).T + + # buglet with int typechecking + result = df.iloc[:, :np.int32(3)] + expected = df.reindex(columns=df.columns[:3]) + tm.assert_frame_equal(result, expected) + + def test_frame_getitem_not_sorted2(self): + # 13431 + df = DataFrame({'col1': ['b', 'd', 'b', 'a'], + 'col2': [3, 1, 1, 2], + 'data': ['one', 'two', 'three', 'four']}) + + df2 = df.set_index(['col1', 'col2']) + df2_original = df2.copy() + + df2.index.set_levels(['b', 'd', 'a'], level='col1', inplace=True) + df2.index.set_labels([0, 1, 0, 2], level='col1', inplace=True) + assert not df2.index.is_lexsorted() + assert not df2.index.is_monotonic + + assert df2_original.index.equals(df2.index) + expected = df2.sort_index() + assert expected.index.is_lexsorted() + assert expected.index.is_monotonic + + result = df2.sort_index(level=0) + assert result.index.is_lexsorted() + assert result.index.is_monotonic + tm.assert_frame_equal(result, expected) + + def test_frame_getitem_not_sorted(self): + df = self.frame.T + df['foo', 'four'] = 'foo' + + arrays = [np.array(x) for x in zip(*df.columns.values)] + + result = df['foo'] + result2 = df.loc[:, 'foo'] + expected = df.reindex(columns=df.columns[arrays[0] == 'foo']) + expected.columns = expected.columns.droplevel(0) + tm.assert_frame_equal(result, expected) + tm.assert_frame_equal(result2, expected) + df = df.T + result = df.xs('foo') + result2 = df.loc['foo'] + expected = df.reindex(df.index[arrays[0] == 'foo']) + expected.index = expected.index.droplevel(0) + tm.assert_frame_equal(result, expected) + tm.assert_frame_equal(result2, expected) -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + def test_series_getitem_not_sorted(self): + arrays = [['bar', 'bar', 'baz', 'baz', 'qux', 'qux', 'foo', 'foo'], + ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] + tuples = lzip(*arrays) + index = MultiIndex.from_tuples(tuples) + s = Series(randn(8), index=index) + + arrays = [np.array(x) for x in zip(*index.values)] + + result = s['qux'] + result2 = s.loc['qux'] + expected = s[arrays[0] == 'qux'] + expected.index = expected.index.droplevel(0) + tm.assert_series_equal(result, expected) + tm.assert_series_equal(result2, expected) + + def test_sort_index_and_reconstruction(self): + + # 15622 + # lexsortedness should be identical + # across MultiIndex consruction methods + + df = DataFrame([[1, 1], [2, 2]], index=list('ab')) + expected = DataFrame([[1, 1], [2, 2], [1, 1], [2, 2]], + index=MultiIndex.from_tuples([(0.5, 'a'), + (0.5, 'b'), + (0.8, 'a'), + (0.8, 'b')])) + assert expected.index.is_lexsorted() + + result = DataFrame( + [[1, 1], [2, 2], [1, 1], [2, 2]], + index=MultiIndex.from_product([[0.5, 0.8], list('ab')])) + result = result.sort_index() + assert result.index.is_lexsorted() + assert result.index.is_monotonic + + tm.assert_frame_equal(result, expected) + + result = DataFrame( + [[1, 1], [2, 2], [1, 1], [2, 2]], + index=MultiIndex(levels=[[0.5, 0.8], ['a', 'b']], + labels=[[0, 0, 1, 1], [0, 1, 0, 1]])) + result = result.sort_index() + assert result.index.is_lexsorted() + + tm.assert_frame_equal(result, expected) + + concatted = pd.concat([df, df], keys=[0.8, 0.5]) + result = concatted.sort_index() + + assert result.index.is_lexsorted() + assert result.index.is_monotonic + + tm.assert_frame_equal(result, expected) + + # 14015 + df = DataFrame([[1, 2], [6, 7]], + columns=MultiIndex.from_tuples( + [(0, '20160811 12:00:00'), + (0, '20160809 12:00:00')], + names=['l1', 'Date'])) + + df.columns.set_levels(pd.to_datetime(df.columns.levels[1]), + level=1, + inplace=True) + assert not df.columns.is_lexsorted() + assert not df.columns.is_monotonic + result = df.sort_index(axis=1) + assert result.columns.is_lexsorted() + assert result.columns.is_monotonic + result = df.sort_index(axis=1, level=1) + assert result.columns.is_lexsorted() + assert result.columns.is_monotonic + + def test_sort_index_and_reconstruction_doc_example(self): + # doc example + df = DataFrame({'value': [1, 2, 3, 4]}, + index=MultiIndex( + levels=[['a', 'b'], ['bb', 'aa']], + labels=[[0, 0, 1, 1], [0, 1, 0, 1]])) + assert df.index.is_lexsorted() + assert not df.index.is_monotonic + + # sort it + expected = DataFrame({'value': [2, 1, 4, 3]}, + index=MultiIndex( + levels=[['a', 'b'], ['aa', 'bb']], + labels=[[0, 0, 1, 1], [0, 1, 0, 1]])) + result = df.sort_index() + assert result.index.is_lexsorted() + assert result.index.is_monotonic + + tm.assert_frame_equal(result, expected) + + # reconstruct + result = df.sort_index().copy() + result.index = result.index._sort_levels_monotonic() + assert result.index.is_lexsorted() + assert result.index.is_monotonic + + tm.assert_frame_equal(result, expected) + + def test_sort_index_reorder_on_ops(self): + # 15687 + df = pd.DataFrame( + np.random.randn(8, 2), + index=MultiIndex.from_product( + [['a', 'b'], + ['big', 'small'], + ['red', 'blu']], + names=['letter', 'size', 'color']), + columns=['near', 'far']) + df = df.sort_index() + + def my_func(group): + group.index = ['newz', 'newa'] + return group + + result = df.groupby(level=['letter', 'size']).apply( + my_func).sort_index() + expected = MultiIndex.from_product( + [['a', 'b'], + ['big', 'small'], + ['newa', 'newz']], + names=['letter', 'size', None]) + + tm.assert_index_equal(result.index, expected) + + def test_sort_non_lexsorted(self): + # degenerate case where we sort but don't + # have a satisfying result :< + # GH 15797 + idx = MultiIndex([['A', 'B', 'C'], + ['c', 'b', 'a']], + [[0, 1, 2, 0, 1, 2], + [0, 2, 1, 1, 0, 2]]) + + df = DataFrame({'col': range(len(idx))}, + index=idx, + dtype='int64') + assert df.index.is_lexsorted() is False + assert df.index.is_monotonic is False + + sorted = df.sort_index() + assert sorted.index.is_lexsorted() is True + assert sorted.index.is_monotonic is True + + expected = DataFrame( + {'col': [1, 4, 5, 2]}, + index=MultiIndex.from_tuples([('B', 'a'), ('B', 'c'), + ('C', 'a'), ('C', 'b')]), + dtype='int64') + result = sorted.loc[pd.IndexSlice['B':'C', 'a':'c'], :] + tm.assert_frame_equal(result, expected) + + def test_sort_index_nan(self): + # GH 14784 + # incorrect sorting w.r.t. nans + tuples = [[12, 13], [np.nan, np.nan], [np.nan, 3], [1, 2]] + mi = MultiIndex.from_tuples(tuples) + + df = DataFrame(np.arange(16).reshape(4, 4), + index=mi, columns=list('ABCD')) + s = Series(np.arange(4), index=mi) + + df2 = DataFrame({ + 'date': pd.to_datetime([ + '20121002', '20121007', '20130130', '20130202', '20130305', + '20121002', '20121207', '20130130', '20130202', '20130305', + '20130202', '20130305' + ]), + 'user_id': [1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 5, 5], + 'whole_cost': [1790, np.nan, 280, 259, np.nan, 623, 90, 312, + np.nan, 301, 359, 801], + 'cost': [12, 15, 10, 24, 39, 1, 0, np.nan, 45, 34, 1, 12] + }).set_index(['date', 'user_id']) + + # sorting frame, default nan position is last + result = df.sort_index() + expected = df.iloc[[3, 0, 2, 1], :] + tm.assert_frame_equal(result, expected) + + # sorting frame, nan position last + result = df.sort_index(na_position='last') + expected = df.iloc[[3, 0, 2, 1], :] + tm.assert_frame_equal(result, expected) + + # sorting frame, nan position first + result = df.sort_index(na_position='first') + expected = df.iloc[[1, 2, 3, 0], :] + tm.assert_frame_equal(result, expected) + + # sorting frame with removed rows + result = df2.dropna().sort_index() + expected = df2.sort_index().dropna() + tm.assert_frame_equal(result, expected) + + # sorting series, default nan position is last + result = s.sort_index() + expected = s.iloc[[3, 0, 2, 1]] + tm.assert_series_equal(result, expected) + + # sorting series, nan position last + result = s.sort_index(na_position='last') + expected = s.iloc[[3, 0, 2, 1]] + tm.assert_series_equal(result, expected) + + # sorting series, nan position first + result = s.sort_index(na_position='first') + expected = s.iloc[[1, 2, 3, 0]] + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/test_nanops.py b/pandas/tests/test_nanops.py index be634228b1b6e..6798e64b01d7e 100644 --- a/pandas/tests/test_nanops.py +++ b/pandas/tests/test_nanops.py @@ -3,19 +3,22 @@ from functools import partial +import pytest import warnings import numpy as np -from pandas import Series, isnull -from pandas.types.common import is_integer_dtype + +import pandas as pd +from pandas import Series, isnull, _np_version_under1p9 +from pandas.core.dtypes.common import is_integer_dtype import pandas.core.nanops as nanops import pandas.util.testing as tm use_bn = nanops._USE_BOTTLENECK -class TestnanopsDataFrame(tm.TestCase): +class TestnanopsDataFrame(object): - def setUp(self): + def setup_method(self, method): np.random.seed(11235) nanops._USE_BOTTLENECK = False @@ -115,7 +118,7 @@ def setUp(self): self.arr_float_nan_inf_1d = self.arr_float_nan_inf[:, 0, 0] self.arr_nan_nan_inf_1d = self.arr_nan_nan_inf[:, 0, 0] - def tearDown(self): + def teardown_method(self, method): nanops._USE_BOTTLENECK = use_bn def check_results(self, targ, res, axis, check_dtype=True): @@ -338,15 +341,14 @@ def test_nanmean_overflow(self): # is now consistent with numpy # numpy < 1.9.0 is not computing this correctly - from distutils.version import LooseVersion - if LooseVersion(np.__version__) >= '1.9.0': + if not _np_version_under1p9: for a in [2 ** 55, -2 ** 55, 20150515061816532]: s = Series(a, index=range(500), dtype=np.int64) result = s.mean() np_result = s.values.mean() - self.assertEqual(result, a) - self.assertEqual(result, np_result) - self.assertTrue(result.dtype == np.float64) + assert result == a + assert result == np_result + assert result.dtype == np.float64 def test_returned_dtype(self): @@ -361,15 +363,9 @@ def test_returned_dtype(self): for method in group_a + group_b: result = getattr(s, method)() if is_integer_dtype(dtype) and method in group_a: - self.assertTrue( - result.dtype == np.float64, - "return dtype expected from %s is np.float64, " - "got %s instead" % (method, result.dtype)) + assert result.dtype == np.float64 else: - self.assertTrue( - result.dtype == dtype, - "return dtype expected from %s is %s, " - "got %s instead" % (method, dtype, result.dtype)) + assert result.dtype == dtype def test_nanmedian(self): with warnings.catch_warnings(record=True): @@ -388,12 +384,12 @@ def test_nanstd(self): allow_tdelta=True, allow_obj='convert') def test_nansem(self): - tm.skip_if_no_package('scipy.stats') - tm._skip_if_scipy_0_17() + tm.skip_if_no_package('scipy', min_version='0.17.0') from scipy.stats import sem - self.check_funs_ddof(nanops.nansem, sem, allow_complex=False, - allow_str=False, allow_date=False, - allow_tdelta=True, allow_obj='convert') + with np.errstate(invalid='ignore'): + self.check_funs_ddof(nanops.nansem, sem, allow_complex=False, + allow_str=False, allow_date=False, + allow_tdelta=False, allow_obj='convert') def _minmax_wrap(self, value, axis=None, func=None): res = func(value, axis) @@ -448,21 +444,23 @@ def _skew_kurt_wrap(self, values, axis=None, func=None): return result def test_nanskew(self): - tm.skip_if_no_package('scipy.stats') - tm._skip_if_scipy_0_17() + tm.skip_if_no_package('scipy', min_version='0.17.0') from scipy.stats import skew func = partial(self._skew_kurt_wrap, func=skew) - self.check_funs(nanops.nanskew, func, allow_complex=False, - allow_str=False, allow_date=False, allow_tdelta=False) + with np.errstate(invalid='ignore'): + self.check_funs(nanops.nanskew, func, allow_complex=False, + allow_str=False, allow_date=False, + allow_tdelta=False) def test_nankurt(self): - tm.skip_if_no_package('scipy.stats') - tm._skip_if_scipy_0_17() + tm.skip_if_no_package('scipy', min_version='0.17.0') from scipy.stats import kurtosis func1 = partial(kurtosis, fisher=True) func = partial(self._skew_kurt_wrap, func=func1) - self.check_funs(nanops.nankurt, func, allow_complex=False, - allow_str=False, allow_date=False, allow_tdelta=False) + with np.errstate(invalid='ignore'): + self.check_funs(nanops.nankurt, func, allow_complex=False, + allow_str=False, allow_date=False, + allow_tdelta=False) def test_nanprod(self): self.check_funs(nanops.nanprod, np.prod, allow_str=False, @@ -654,9 +652,9 @@ def check_bool(self, func, value, correct, *args, **kwargs): try: res0 = func(value, *args, **kwargs) if correct: - self.assertTrue(res0) + assert res0 else: - self.assertFalse(res0) + assert not res0 except BaseException as exc: exc.args += ('dim: %s' % getattr(value, 'ndim', value), ) raise @@ -733,67 +731,62 @@ def test__isfinite(self): raise def test__bn_ok_dtype(self): - self.assertTrue(nanops._bn_ok_dtype(self.arr_float.dtype, 'test')) - self.assertTrue(nanops._bn_ok_dtype(self.arr_complex.dtype, 'test')) - self.assertTrue(nanops._bn_ok_dtype(self.arr_int.dtype, 'test')) - self.assertTrue(nanops._bn_ok_dtype(self.arr_bool.dtype, 'test')) - self.assertTrue(nanops._bn_ok_dtype(self.arr_str.dtype, 'test')) - self.assertTrue(nanops._bn_ok_dtype(self.arr_utf.dtype, 'test')) - self.assertFalse(nanops._bn_ok_dtype(self.arr_date.dtype, 'test')) - self.assertFalse(nanops._bn_ok_dtype(self.arr_tdelta.dtype, 'test')) - self.assertFalse(nanops._bn_ok_dtype(self.arr_obj.dtype, 'test')) + assert nanops._bn_ok_dtype(self.arr_float.dtype, 'test') + assert nanops._bn_ok_dtype(self.arr_complex.dtype, 'test') + assert nanops._bn_ok_dtype(self.arr_int.dtype, 'test') + assert nanops._bn_ok_dtype(self.arr_bool.dtype, 'test') + assert nanops._bn_ok_dtype(self.arr_str.dtype, 'test') + assert nanops._bn_ok_dtype(self.arr_utf.dtype, 'test') + assert not nanops._bn_ok_dtype(self.arr_date.dtype, 'test') + assert not nanops._bn_ok_dtype(self.arr_tdelta.dtype, 'test') + assert not nanops._bn_ok_dtype(self.arr_obj.dtype, 'test') -class TestEnsureNumeric(tm.TestCase): +class TestEnsureNumeric(object): def test_numeric_values(self): # Test integer - self.assertEqual(nanops._ensure_numeric(1), 1, 'Failed for int') + assert nanops._ensure_numeric(1) == 1 + # Test float - self.assertEqual(nanops._ensure_numeric(1.1), 1.1, 'Failed for float') + assert nanops._ensure_numeric(1.1) == 1.1 + # Test complex - self.assertEqual(nanops._ensure_numeric(1 + 2j), 1 + 2j, - 'Failed for complex') + assert nanops._ensure_numeric(1 + 2j) == 1 + 2j def test_ndarray(self): # Test numeric ndarray values = np.array([1, 2, 3]) - self.assertTrue(np.allclose(nanops._ensure_numeric(values), values), - 'Failed for numeric ndarray') + assert np.allclose(nanops._ensure_numeric(values), values) # Test object ndarray o_values = values.astype(object) - self.assertTrue(np.allclose(nanops._ensure_numeric(o_values), values), - 'Failed for object ndarray') + assert np.allclose(nanops._ensure_numeric(o_values), values) # Test convertible string ndarray s_values = np.array(['1', '2', '3'], dtype=object) - self.assertTrue(np.allclose(nanops._ensure_numeric(s_values), values), - 'Failed for convertible string ndarray') + assert np.allclose(nanops._ensure_numeric(s_values), values) # Test non-convertible string ndarray s_values = np.array(['foo', 'bar', 'baz'], dtype=object) - self.assertRaises(ValueError, lambda: nanops._ensure_numeric(s_values)) + pytest.raises(ValueError, lambda: nanops._ensure_numeric(s_values)) def test_convertable_values(self): - self.assertTrue(np.allclose(nanops._ensure_numeric('1'), 1.0), - 'Failed for convertible integer string') - self.assertTrue(np.allclose(nanops._ensure_numeric('1.1'), 1.1), - 'Failed for convertible float string') - self.assertTrue(np.allclose(nanops._ensure_numeric('1+1j'), 1 + 1j), - 'Failed for convertible complex string') + assert np.allclose(nanops._ensure_numeric('1'), 1.0) + assert np.allclose(nanops._ensure_numeric('1.1'), 1.1) + assert np.allclose(nanops._ensure_numeric('1+1j'), 1 + 1j) def test_non_convertable_values(self): - self.assertRaises(TypeError, lambda: nanops._ensure_numeric('foo')) - self.assertRaises(TypeError, lambda: nanops._ensure_numeric({})) - self.assertRaises(TypeError, lambda: nanops._ensure_numeric([])) + pytest.raises(TypeError, lambda: nanops._ensure_numeric('foo')) + pytest.raises(TypeError, lambda: nanops._ensure_numeric({})) + pytest.raises(TypeError, lambda: nanops._ensure_numeric([])) -class TestNanvarFixedValues(tm.TestCase): +class TestNanvarFixedValues(object): # xref GH10242 - def setUp(self): + def setup_method(self, method): # Samples from a normal distribution. self.variance = variance = 3.0 self.samples = self.prng.normal(scale=variance ** 0.5, size=100000) @@ -880,14 +873,14 @@ def test_ground_truth(self): for ddof in range(3): var = nanops.nanvar(samples, skipna=True, axis=axis, ddof=ddof) tm.assert_almost_equal(var[:3], variance[axis, ddof]) - self.assertTrue(np.isnan(var[3])) + assert np.isnan(var[3]) # Test nanstd. for axis in range(2): for ddof in range(3): std = nanops.nanstd(samples, skipna=True, axis=axis, ddof=ddof) tm.assert_almost_equal(std[:3], variance[axis, ddof] ** 0.5) - self.assertTrue(np.isnan(std[3])) + assert np.isnan(std[3]) def test_nanstd_roundoff(self): # Regression test for GH 10242 (test data taken from GH 10489). Ensure @@ -895,18 +888,18 @@ def test_nanstd_roundoff(self): data = Series(766897346 * np.ones(10)) for ddof in range(3): result = data.std(ddof=ddof) - self.assertEqual(result, 0.0) + assert result == 0.0 @property def prng(self): return np.random.RandomState(1234) -class TestNanskewFixedValues(tm.TestCase): +class TestNanskewFixedValues(object): # xref GH 11974 - def setUp(self): + def setup_method(self, method): # Test data + skewness value (computed with scipy.stats.skew) self.samples = np.sin(np.linspace(0, 1, 200)) self.actual_skew = -0.1875895205961754 @@ -916,20 +909,20 @@ def test_constant_series(self): for val in [3075.2, 3075.3, 3075.5]: data = val * np.ones(300) skew = nanops.nanskew(data) - self.assertEqual(skew, 0.0) + assert skew == 0.0 def test_all_finite(self): alpha, beta = 0.3, 0.1 left_tailed = self.prng.beta(alpha, beta, size=100) - self.assertLess(nanops.nanskew(left_tailed), 0) + assert nanops.nanskew(left_tailed) < 0 alpha, beta = 0.1, 0.3 right_tailed = self.prng.beta(alpha, beta, size=100) - self.assertGreater(nanops.nanskew(right_tailed), 0) + assert nanops.nanskew(right_tailed) > 0 def test_ground_truth(self): skew = nanops.nanskew(self.samples) - self.assertAlmostEqual(skew, self.actual_skew) + tm.assert_almost_equal(skew, self.actual_skew) def test_axis(self): samples = np.vstack([self.samples, @@ -940,7 +933,7 @@ def test_axis(self): def test_nans(self): samples = np.hstack([self.samples, np.nan]) skew = nanops.nanskew(samples, skipna=False) - self.assertTrue(np.isnan(skew)) + assert np.isnan(skew) def test_nans_skipna(self): samples = np.hstack([self.samples, np.nan]) @@ -952,11 +945,11 @@ def prng(self): return np.random.RandomState(1234) -class TestNankurtFixedValues(tm.TestCase): +class TestNankurtFixedValues(object): # xref GH 11974 - def setUp(self): + def setup_method(self, method): # Test data + kurtosis value (computed with scipy.stats.kurtosis) self.samples = np.sin(np.linspace(0, 1, 200)) self.actual_kurt = -1.2058303433799713 @@ -966,20 +959,20 @@ def test_constant_series(self): for val in [3075.2, 3075.3, 3075.5]: data = val * np.ones(300) kurt = nanops.nankurt(data) - self.assertEqual(kurt, 0.0) + assert kurt == 0.0 def test_all_finite(self): alpha, beta = 0.3, 0.1 left_tailed = self.prng.beta(alpha, beta, size=100) - self.assertLess(nanops.nankurt(left_tailed), 0) + assert nanops.nankurt(left_tailed) < 0 alpha, beta = 0.1, 0.3 right_tailed = self.prng.beta(alpha, beta, size=100) - self.assertGreater(nanops.nankurt(right_tailed), 0) + assert nanops.nankurt(right_tailed) > 0 def test_ground_truth(self): kurt = nanops.nankurt(self.samples) - self.assertAlmostEqual(kurt, self.actual_kurt) + tm.assert_almost_equal(kurt, self.actual_kurt) def test_axis(self): samples = np.vstack([self.samples, @@ -990,7 +983,7 @@ def test_axis(self): def test_nans(self): samples = np.hstack([self.samples, np.nan]) kurt = nanops.nankurt(samples, skipna=False) - self.assertTrue(np.isnan(kurt)) + assert np.isnan(kurt) def test_nans_skipna(self): samples = np.hstack([self.samples, np.nan]) @@ -1002,29 +995,14 @@ def prng(self): return np.random.RandomState(1234) -def test_int64_add_overflow(): - # see gh-14068 - msg = "Overflow in int64 addition" - m = np.iinfo(np.int64).max - n = np.iinfo(np.int64).min - - with tm.assertRaisesRegexp(OverflowError, msg): - nanops._checked_add_with_arr(np.array([m, m]), m) - with tm.assertRaisesRegexp(OverflowError, msg): - nanops._checked_add_with_arr(np.array([m, m]), np.array([m, m])) - with tm.assertRaisesRegexp(OverflowError, msg): - nanops._checked_add_with_arr(np.array([n, n]), n) - with tm.assertRaisesRegexp(OverflowError, msg): - nanops._checked_add_with_arr(np.array([n, n]), np.array([n, n])) - with tm.assertRaisesRegexp(OverflowError, msg): - nanops._checked_add_with_arr(np.array([m, n]), np.array([n, n])) - with tm.assertRaisesRegexp(OverflowError, msg): - with tm.assert_produces_warning(RuntimeWarning): - nanops._checked_add_with_arr(np.array([m, m]), - np.array([np.nan, m])) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure', '-s' - ], exit=False) +def test_use_bottleneck(): + + if nanops._BOTTLENECK_INSTALLED: + + pd.set_option('use_bottleneck', True) + assert pd.get_option('use_bottleneck') + + pd.set_option('use_bottleneck', False) + assert not pd.get_option('use_bottleneck') + + pd.set_option('use_bottleneck', use_bn) diff --git a/pandas/tests/test_panel.py b/pandas/tests/test_panel.py index 9cb2dd5a40ac4..3243b69a25acd 100644 --- a/pandas/tests/test_panel.py +++ b/pandas/tests/test_panel.py @@ -1,68 +1,79 @@ # -*- coding: utf-8 -*- # pylint: disable=W0612,E1101 +from warnings import catch_warnings from datetime import datetime - import operator -import nose +import pytest import numpy as np import pandas as pd -from pandas.types.common import is_float_dtype +from pandas.core.dtypes.common import is_float_dtype from pandas import (Series, DataFrame, Index, date_range, isnull, notnull, pivot, MultiIndex) from pandas.core.nanops import nanall, nanany from pandas.core.panel import Panel from pandas.core.series import remove_na -from pandas.formats.printing import pprint_thing +from pandas.io.formats.printing import pprint_thing from pandas import compat from pandas.compat import range, lrange, StringIO, OrderedDict, signature from pandas.tseries.offsets import BDay, MonthEnd from pandas.util.testing import (assert_panel_equal, assert_frame_equal, assert_series_equal, assert_almost_equal, - ensure_clean, assertRaisesRegexp, - makeCustomDataframe as mkdf, - makeMixedDataFrame) + ensure_clean, makeMixedDataFrame, + makeCustomDataframe as mkdf) import pandas.core.panel as panelm import pandas.util.testing as tm +def make_test_panel(): + with catch_warnings(record=True): + _panel = tm.makePanel() + tm.add_nans(_panel) + _panel = _panel.copy() + return _panel + + class PanelTests(object): panel = None def test_pickle(self): - unpickled = self.round_trip_pickle(self.panel) - assert_frame_equal(unpickled['ItemA'], self.panel['ItemA']) + with catch_warnings(record=True): + unpickled = tm.round_trip_pickle(self.panel) + assert_frame_equal(unpickled['ItemA'], self.panel['ItemA']) def test_rank(self): - self.assertRaises(NotImplementedError, lambda: self.panel.rank()) + with catch_warnings(record=True): + pytest.raises(NotImplementedError, lambda: self.panel.rank()) def test_cumsum(self): - cumsum = self.panel.cumsum() - assert_frame_equal(cumsum['ItemA'], self.panel['ItemA'].cumsum()) + with catch_warnings(record=True): + cumsum = self.panel.cumsum() + assert_frame_equal(cumsum['ItemA'], self.panel['ItemA'].cumsum()) def not_hashable(self): - c_empty = Panel() - c = Panel(Panel([[[1]]])) - self.assertRaises(TypeError, hash, c_empty) - self.assertRaises(TypeError, hash, c) + with catch_warnings(record=True): + c_empty = Panel() + c = Panel(Panel([[[1]]])) + pytest.raises(TypeError, hash, c_empty) + pytest.raises(TypeError, hash, c) class SafeForLongAndSparse(object): - _multiprocess_can_split_ = True def test_repr(self): repr(self.panel) def test_copy_names(self): - for attr in ('major_axis', 'minor_axis'): - getattr(self.panel, attr).name = None - cp = self.panel.copy() - getattr(cp, attr).name = 'foo' - self.assertIsNone(getattr(self.panel, attr).name) + with catch_warnings(record=True): + for attr in ('major_axis', 'minor_axis'): + getattr(self.panel, attr).name = None + cp = self.panel.copy() + getattr(cp, attr).name = 'foo' + assert getattr(self.panel, attr).name is None def test_iter(self): tm.equalContents(list(self.panel), self.panel.items) @@ -98,7 +109,7 @@ def test_skew(self): try: from scipy.stats import skew except ImportError: - raise nose.SkipTest("no scipy.stats.skew") + pytest.skip("no scipy.stats.skew") def this_skew(x): if len(x) < 3: @@ -107,10 +118,6 @@ def this_skew(x): self._check_stat_op('skew', this_skew) - # def test_mad(self): - # f = lambda x: np.abs(x - x.mean()).mean() - # self._check_stat_op('mad', f) - def test_var(self): def alt(x): if len(x) < 2: @@ -140,8 +147,8 @@ def _check_stat_op(self, name, alternative, obj=None, has_skipna=True): obj = self.panel # # set some NAs - # obj.ix[5:10] = np.nan - # obj.ix[15:20, -2:] = np.nan + # obj.loc[5:10] = np.nan + # obj.loc[15:20, -2:] = np.nan f = getattr(obj, name) @@ -168,16 +175,15 @@ def wrapper(x): if not tm._incompat_bottleneck_version(name): assert_frame_equal(result, obj.apply(skipna_wrapper, axis=i)) - self.assertRaises(Exception, f, axis=obj.ndim) + pytest.raises(Exception, f, axis=obj.ndim) # Unimplemented numeric_only parameter. if 'numeric_only' in signature(f).args: - self.assertRaisesRegexp(NotImplementedError, name, f, - numeric_only=True) + tm.assert_raises_regex(NotImplementedError, name, f, + numeric_only=True) class SafeForSparse(object): - _multiprocess_can_split_ = True @classmethod def assert_panel_equal(cls, x, y): @@ -198,38 +204,38 @@ def test_set_axis(self): self.panel.items = new_items if hasattr(self.panel, '_item_cache'): - self.assertNotIn('ItemA', self.panel._item_cache) - self.assertIs(self.panel.items, new_items) + assert 'ItemA' not in self.panel._item_cache + assert self.panel.items is new_items # TODO: unused? item = self.panel[0] # noqa self.panel.major_axis = new_major - self.assertIs(self.panel[0].index, new_major) - self.assertIs(self.panel.major_axis, new_major) + assert self.panel[0].index is new_major + assert self.panel.major_axis is new_major # TODO: unused? item = self.panel[0] # noqa self.panel.minor_axis = new_minor - self.assertIs(self.panel[0].columns, new_minor) - self.assertIs(self.panel.minor_axis, new_minor) + assert self.panel[0].columns is new_minor + assert self.panel.minor_axis is new_minor def test_get_axis_number(self): - self.assertEqual(self.panel._get_axis_number('items'), 0) - self.assertEqual(self.panel._get_axis_number('major'), 1) - self.assertEqual(self.panel._get_axis_number('minor'), 2) + assert self.panel._get_axis_number('items') == 0 + assert self.panel._get_axis_number('major') == 1 + assert self.panel._get_axis_number('minor') == 2 - with tm.assertRaisesRegexp(ValueError, "No axis named foo"): + with tm.assert_raises_regex(ValueError, "No axis named foo"): self.panel._get_axis_number('foo') - with tm.assertRaisesRegexp(ValueError, "No axis named foo"): + with tm.assert_raises_regex(ValueError, "No axis named foo"): self.panel.__ge__(self.panel, axis='foo') def test_get_axis_name(self): - self.assertEqual(self.panel._get_axis_name(0), 'items') - self.assertEqual(self.panel._get_axis_name(1), 'major_axis') - self.assertEqual(self.panel._get_axis_name(2), 'minor_axis') + assert self.panel._get_axis_name(0) == 'items' + assert self.panel._get_axis_name(1) == 'major_axis' + assert self.panel._get_axis_name(2) == 'minor_axis' def test_get_plane_axes(self): # what to do here? @@ -240,47 +246,48 @@ def test_get_plane_axes(self): index, columns = self.panel._get_plane_axes(0) def test_truncate(self): - dates = self.panel.major_axis - start, end = dates[1], dates[5] - - trunced = self.panel.truncate(start, end, axis='major') - expected = self.panel['ItemA'].truncate(start, end) + with catch_warnings(record=True): + dates = self.panel.major_axis + start, end = dates[1], dates[5] - assert_frame_equal(trunced['ItemA'], expected) + trunced = self.panel.truncate(start, end, axis='major') + expected = self.panel['ItemA'].truncate(start, end) - trunced = self.panel.truncate(before=start, axis='major') - expected = self.panel['ItemA'].truncate(before=start) + assert_frame_equal(trunced['ItemA'], expected) - assert_frame_equal(trunced['ItemA'], expected) + trunced = self.panel.truncate(before=start, axis='major') + expected = self.panel['ItemA'].truncate(before=start) - trunced = self.panel.truncate(after=end, axis='major') - expected = self.panel['ItemA'].truncate(after=end) + assert_frame_equal(trunced['ItemA'], expected) - assert_frame_equal(trunced['ItemA'], expected) + trunced = self.panel.truncate(after=end, axis='major') + expected = self.panel['ItemA'].truncate(after=end) - # XXX test other axes + assert_frame_equal(trunced['ItemA'], expected) def test_arith(self): - self._test_op(self.panel, operator.add) - self._test_op(self.panel, operator.sub) - self._test_op(self.panel, operator.mul) - self._test_op(self.panel, operator.truediv) - self._test_op(self.panel, operator.floordiv) - self._test_op(self.panel, operator.pow) - - self._test_op(self.panel, lambda x, y: y + x) - self._test_op(self.panel, lambda x, y: y - x) - self._test_op(self.panel, lambda x, y: y * x) - self._test_op(self.panel, lambda x, y: y / x) - self._test_op(self.panel, lambda x, y: y ** x) - - self._test_op(self.panel, lambda x, y: x + y) # panel + 1 - self._test_op(self.panel, lambda x, y: x - y) # panel - 1 - self._test_op(self.panel, lambda x, y: x * y) # panel * 1 - self._test_op(self.panel, lambda x, y: x / y) # panel / 1 - self._test_op(self.panel, lambda x, y: x ** y) # panel ** 1 - - self.assertRaises(Exception, self.panel.__add__, self.panel['ItemA']) + with catch_warnings(record=True): + self._test_op(self.panel, operator.add) + self._test_op(self.panel, operator.sub) + self._test_op(self.panel, operator.mul) + self._test_op(self.panel, operator.truediv) + self._test_op(self.panel, operator.floordiv) + self._test_op(self.panel, operator.pow) + + self._test_op(self.panel, lambda x, y: y + x) + self._test_op(self.panel, lambda x, y: y - x) + self._test_op(self.panel, lambda x, y: y * x) + self._test_op(self.panel, lambda x, y: y / x) + self._test_op(self.panel, lambda x, y: y ** x) + + self._test_op(self.panel, lambda x, y: x + y) # panel + 1 + self._test_op(self.panel, lambda x, y: x - y) # panel - 1 + self._test_op(self.panel, lambda x, y: x * y) # panel * 1 + self._test_op(self.panel, lambda x, y: x / y) # panel / 1 + self._test_op(self.panel, lambda x, y: x ** y) # panel ** 1 + + pytest.raises(Exception, self.panel.__add__, + self.panel['ItemA']) @staticmethod def _test_op(panel, op): @@ -296,96 +303,103 @@ def test_iteritems(self): for k, v in self.panel.iteritems(): pass - self.assertEqual(len(list(self.panel.iteritems())), - len(self.panel.items)) + assert len(list(self.panel.iteritems())) == len(self.panel.items) def test_combineFrame(self): - def check_op(op, name): - # items - df = self.panel['ItemA'] + with catch_warnings(record=True): + def check_op(op, name): + # items + df = self.panel['ItemA'] - func = getattr(self.panel, name) + func = getattr(self.panel, name) - result = func(df, axis='items') + result = func(df, axis='items') - assert_frame_equal(result['ItemB'], op(self.panel['ItemB'], df)) + assert_frame_equal( + result['ItemB'], op(self.panel['ItemB'], df)) - # major - xs = self.panel.major_xs(self.panel.major_axis[0]) - result = func(xs, axis='major') + # major + xs = self.panel.major_xs(self.panel.major_axis[0]) + result = func(xs, axis='major') - idx = self.panel.major_axis[1] + idx = self.panel.major_axis[1] - assert_frame_equal(result.major_xs(idx), - op(self.panel.major_xs(idx), xs)) + assert_frame_equal(result.major_xs(idx), + op(self.panel.major_xs(idx), xs)) - # minor - xs = self.panel.minor_xs(self.panel.minor_axis[0]) - result = func(xs, axis='minor') + # minor + xs = self.panel.minor_xs(self.panel.minor_axis[0]) + result = func(xs, axis='minor') - idx = self.panel.minor_axis[1] + idx = self.panel.minor_axis[1] - assert_frame_equal(result.minor_xs(idx), - op(self.panel.minor_xs(idx), xs)) + assert_frame_equal(result.minor_xs(idx), + op(self.panel.minor_xs(idx), xs)) - ops = ['add', 'sub', 'mul', 'truediv', 'floordiv', 'pow', 'mod'] - if not compat.PY3: - ops.append('div') + ops = ['add', 'sub', 'mul', 'truediv', 'floordiv', 'pow', 'mod'] + if not compat.PY3: + ops.append('div') - for op in ops: - try: - check_op(getattr(operator, op), op) - except: - pprint_thing("Failing operation: %r" % op) - raise - if compat.PY3: - try: - check_op(operator.truediv, 'div') - except: - pprint_thing("Failing operation: %r" % 'div') - raise + for op in ops: + try: + check_op(getattr(operator, op), op) + except: + pprint_thing("Failing operation: %r" % op) + raise + if compat.PY3: + try: + check_op(operator.truediv, 'div') + except: + pprint_thing("Failing operation: %r" % 'div') + raise def test_combinePanel(self): - result = self.panel.add(self.panel) - self.assert_panel_equal(result, self.panel * 2) + with catch_warnings(record=True): + result = self.panel.add(self.panel) + assert_panel_equal(result, self.panel * 2) def test_neg(self): - self.assert_panel_equal(-self.panel, self.panel * -1) + with catch_warnings(record=True): + assert_panel_equal(-self.panel, self.panel * -1) # issue 7692 def test_raise_when_not_implemented(self): - p = Panel(np.arange(3 * 4 * 5).reshape(3, 4, 5), - items=['ItemA', 'ItemB', 'ItemC'], - major_axis=pd.date_range('20130101', periods=4), - minor_axis=list('ABCDE')) - d = p.sum(axis=1).ix[0] - ops = ['add', 'sub', 'mul', 'truediv', 'floordiv', 'div', 'mod', 'pow'] - for op in ops: - with self.assertRaises(NotImplementedError): - getattr(p, op)(d, axis=0) + with catch_warnings(record=True): + p = Panel(np.arange(3 * 4 * 5).reshape(3, 4, 5), + items=['ItemA', 'ItemB', 'ItemC'], + major_axis=pd.date_range('20130101', periods=4), + minor_axis=list('ABCDE')) + d = p.sum(axis=1).iloc[0] + ops = ['add', 'sub', 'mul', 'truediv', + 'floordiv', 'div', 'mod', 'pow'] + for op in ops: + with pytest.raises(NotImplementedError): + getattr(p, op)(d, axis=0) def test_select(self): - p = self.panel + with catch_warnings(record=True): + p = self.panel - # select items - result = p.select(lambda x: x in ('ItemA', 'ItemC'), axis='items') - expected = p.reindex(items=['ItemA', 'ItemC']) - self.assert_panel_equal(result, expected) + # select items + result = p.select(lambda x: x in ('ItemA', 'ItemC'), axis='items') + expected = p.reindex(items=['ItemA', 'ItemC']) + assert_panel_equal(result, expected) - # select major_axis - result = p.select(lambda x: x >= datetime(2000, 1, 15), axis='major') - new_major = p.major_axis[p.major_axis >= datetime(2000, 1, 15)] - expected = p.reindex(major=new_major) - self.assert_panel_equal(result, expected) + # select major_axis + result = p.select(lambda x: x >= datetime( + 2000, 1, 15), axis='major') + new_major = p.major_axis[p.major_axis >= datetime(2000, 1, 15)] + expected = p.reindex(major=new_major) + assert_panel_equal(result, expected) - # select minor_axis - result = p.select(lambda x: x in ('D', 'A'), axis=2) - expected = p.reindex(minor=['A', 'D']) - self.assert_panel_equal(result, expected) + # select minor_axis + result = p.select(lambda x: x in ('D', 'A'), axis=2) + expected = p.reindex(minor=['A', 'D']) + assert_panel_equal(result, expected) - # corner case, empty thing - result = p.select(lambda x: x in ('foo', ), axis='items') - self.assert_panel_equal(result, p.reindex(items=[])) + # corner case, empty thing + result = p.select(lambda x: x in ('foo', ), axis='items') + assert_panel_equal(result, p.reindex(items=[])) def test_get_value(self): for item in self.panel.items: @@ -397,219 +411,232 @@ def test_get_value(self): def test_abs(self): - result = self.panel.abs() - result2 = abs(self.panel) - expected = np.abs(self.panel) - self.assert_panel_equal(result, expected) - self.assert_panel_equal(result2, expected) - - df = self.panel['ItemA'] - result = df.abs() - result2 = abs(df) - expected = np.abs(df) - assert_frame_equal(result, expected) - assert_frame_equal(result2, expected) + with catch_warnings(record=True): + result = self.panel.abs() + result2 = abs(self.panel) + expected = np.abs(self.panel) + assert_panel_equal(result, expected) + assert_panel_equal(result2, expected) - s = df['A'] - result = s.abs() - result2 = abs(s) - expected = np.abs(s) - assert_series_equal(result, expected) - assert_series_equal(result2, expected) - self.assertEqual(result.name, 'A') - self.assertEqual(result2.name, 'A') + df = self.panel['ItemA'] + result = df.abs() + result2 = abs(df) + expected = np.abs(df) + assert_frame_equal(result, expected) + assert_frame_equal(result2, expected) + + s = df['A'] + result = s.abs() + result2 = abs(s) + expected = np.abs(s) + assert_series_equal(result, expected) + assert_series_equal(result2, expected) + assert result.name == 'A' + assert result2.name == 'A' class CheckIndexing(object): - _multiprocess_can_split_ = True - def test_getitem(self): - self.assertRaises(Exception, self.panel.__getitem__, 'ItemQ') + pytest.raises(Exception, self.panel.__getitem__, 'ItemQ') def test_delitem_and_pop(self): - expected = self.panel['ItemA'] - result = self.panel.pop('ItemA') - assert_frame_equal(expected, result) - self.assertNotIn('ItemA', self.panel.items) + with catch_warnings(record=True): + expected = self.panel['ItemA'] + result = self.panel.pop('ItemA') + assert_frame_equal(expected, result) + assert 'ItemA' not in self.panel.items - del self.panel['ItemB'] - self.assertNotIn('ItemB', self.panel.items) - self.assertRaises(Exception, self.panel.__delitem__, 'ItemB') + del self.panel['ItemB'] + assert 'ItemB' not in self.panel.items + pytest.raises(Exception, self.panel.__delitem__, 'ItemB') - values = np.empty((3, 3, 3)) - values[0] = 0 - values[1] = 1 - values[2] = 2 + values = np.empty((3, 3, 3)) + values[0] = 0 + values[1] = 1 + values[2] = 2 - panel = Panel(values, lrange(3), lrange(3), lrange(3)) + panel = Panel(values, lrange(3), lrange(3), lrange(3)) - # did we delete the right row? + # did we delete the right row? - panelc = panel.copy() - del panelc[0] - assert_frame_equal(panelc[1], panel[1]) - assert_frame_equal(panelc[2], panel[2]) + panelc = panel.copy() + del panelc[0] + tm.assert_frame_equal(panelc[1], panel[1]) + tm.assert_frame_equal(panelc[2], panel[2]) - panelc = panel.copy() - del panelc[1] - assert_frame_equal(panelc[0], panel[0]) - assert_frame_equal(panelc[2], panel[2]) + panelc = panel.copy() + del panelc[1] + tm.assert_frame_equal(panelc[0], panel[0]) + tm.assert_frame_equal(panelc[2], panel[2]) - panelc = panel.copy() - del panelc[2] - assert_frame_equal(panelc[1], panel[1]) - assert_frame_equal(panelc[0], panel[0]) + panelc = panel.copy() + del panelc[2] + tm.assert_frame_equal(panelc[1], panel[1]) + tm.assert_frame_equal(panelc[0], panel[0]) def test_setitem(self): - # LongPanel with one item - lp = self.panel.filter(['ItemA', 'ItemB']).to_frame() - with tm.assertRaises(ValueError): - self.panel['ItemE'] = lp + with catch_warnings(record=True): + + # LongPanel with one item + lp = self.panel.filter(['ItemA', 'ItemB']).to_frame() + with pytest.raises(ValueError): + self.panel['ItemE'] = lp - # DataFrame - df = self.panel['ItemA'][2:].filter(items=['A', 'B']) - self.panel['ItemF'] = df - self.panel['ItemE'] = df + # DataFrame + df = self.panel['ItemA'][2:].filter(items=['A', 'B']) + self.panel['ItemF'] = df + self.panel['ItemE'] = df - df2 = self.panel['ItemF'] + df2 = self.panel['ItemF'] - assert_frame_equal(df, df2.reindex(index=df.index, columns=df.columns)) + assert_frame_equal(df, df2.reindex( + index=df.index, columns=df.columns)) - # scalar - self.panel['ItemG'] = 1 - self.panel['ItemE'] = True - self.assertEqual(self.panel['ItemG'].values.dtype, np.int64) - self.assertEqual(self.panel['ItemE'].values.dtype, np.bool_) + # scalar + self.panel['ItemG'] = 1 + self.panel['ItemE'] = True + assert self.panel['ItemG'].values.dtype == np.int64 + assert self.panel['ItemE'].values.dtype == np.bool_ - # object dtype - self.panel['ItemQ'] = 'foo' - self.assertEqual(self.panel['ItemQ'].values.dtype, np.object_) + # object dtype + self.panel['ItemQ'] = 'foo' + assert self.panel['ItemQ'].values.dtype == np.object_ - # boolean dtype - self.panel['ItemP'] = self.panel['ItemA'] > 0 - self.assertEqual(self.panel['ItemP'].values.dtype, np.bool_) + # boolean dtype + self.panel['ItemP'] = self.panel['ItemA'] > 0 + assert self.panel['ItemP'].values.dtype == np.bool_ - self.assertRaises(TypeError, self.panel.__setitem__, 'foo', - self.panel.ix[['ItemP']]) + pytest.raises(TypeError, self.panel.__setitem__, 'foo', + self.panel.loc[['ItemP']]) - # bad shape - p = Panel(np.random.randn(4, 3, 2)) - with tm.assertRaisesRegexp(ValueError, - r"shape of value must be \(3, 2\), " - r"shape of given object was \(4, 2\)"): - p[0] = np.random.randn(4, 2) + # bad shape + p = Panel(np.random.randn(4, 3, 2)) + with tm.assert_raises_regex(ValueError, + r"shape of value must be " + r"\(3, 2\), shape of given " + r"object was \(4, 2\)"): + p[0] = np.random.randn(4, 2) def test_setitem_ndarray(self): - timeidx = date_range(start=datetime(2009, 1, 1), - end=datetime(2009, 12, 31), - freq=MonthEnd()) - lons_coarse = np.linspace(-177.5, 177.5, 72) - lats_coarse = np.linspace(-87.5, 87.5, 36) - P = Panel(items=timeidx, major_axis=lons_coarse, - minor_axis=lats_coarse) - data = np.random.randn(72 * 36).reshape((72, 36)) - key = datetime(2009, 2, 28) - P[key] = data - - assert_almost_equal(P[key].values, data) + with catch_warnings(record=True): + timeidx = date_range(start=datetime(2009, 1, 1), + end=datetime(2009, 12, 31), + freq=MonthEnd()) + lons_coarse = np.linspace(-177.5, 177.5, 72) + lats_coarse = np.linspace(-87.5, 87.5, 36) + P = Panel(items=timeidx, major_axis=lons_coarse, + minor_axis=lats_coarse) + data = np.random.randn(72 * 36).reshape((72, 36)) + key = datetime(2009, 2, 28) + P[key] = data + + assert_almost_equal(P[key].values, data) def test_set_minor_major(self): - # GH 11014 - df1 = DataFrame(['a', 'a', 'a', np.nan, 'a', np.nan]) - df2 = DataFrame([1.0, np.nan, 1.0, np.nan, 1.0, 1.0]) - panel = Panel({'Item1': df1, 'Item2': df2}) - - newminor = notnull(panel.iloc[:, :, 0]) - panel.loc[:, :, 'NewMinor'] = newminor - assert_frame_equal(panel.loc[:, :, 'NewMinor'], - newminor.astype(object)) - - newmajor = notnull(panel.iloc[:, 0, :]) - panel.loc[:, 'NewMajor', :] = newmajor - assert_frame_equal(panel.loc[:, 'NewMajor', :], - newmajor.astype(object)) + with catch_warnings(record=True): + # GH 11014 + df1 = DataFrame(['a', 'a', 'a', np.nan, 'a', np.nan]) + df2 = DataFrame([1.0, np.nan, 1.0, np.nan, 1.0, 1.0]) + panel = Panel({'Item1': df1, 'Item2': df2}) + + newminor = notnull(panel.iloc[:, :, 0]) + panel.loc[:, :, 'NewMinor'] = newminor + assert_frame_equal(panel.loc[:, :, 'NewMinor'], + newminor.astype(object)) + + newmajor = notnull(panel.iloc[:, 0, :]) + panel.loc[:, 'NewMajor', :] = newmajor + assert_frame_equal(panel.loc[:, 'NewMajor', :], + newmajor.astype(object)) def test_major_xs(self): - ref = self.panel['ItemA'] + with catch_warnings(record=True): + ref = self.panel['ItemA'] - idx = self.panel.major_axis[5] - xs = self.panel.major_xs(idx) + idx = self.panel.major_axis[5] + xs = self.panel.major_xs(idx) - result = xs['ItemA'] - assert_series_equal(result, ref.xs(idx), check_names=False) - self.assertEqual(result.name, 'ItemA') + result = xs['ItemA'] + assert_series_equal(result, ref.xs(idx), check_names=False) + assert result.name == 'ItemA' - # not contained - idx = self.panel.major_axis[0] - BDay() - self.assertRaises(Exception, self.panel.major_xs, idx) + # not contained + idx = self.panel.major_axis[0] - BDay() + pytest.raises(Exception, self.panel.major_xs, idx) def test_major_xs_mixed(self): - self.panel['ItemD'] = 'foo' - xs = self.panel.major_xs(self.panel.major_axis[0]) - self.assertEqual(xs['ItemA'].dtype, np.float64) - self.assertEqual(xs['ItemD'].dtype, np.object_) + with catch_warnings(record=True): + self.panel['ItemD'] = 'foo' + xs = self.panel.major_xs(self.panel.major_axis[0]) + assert xs['ItemA'].dtype == np.float64 + assert xs['ItemD'].dtype == np.object_ def test_minor_xs(self): - ref = self.panel['ItemA'] + with catch_warnings(record=True): + ref = self.panel['ItemA'] - idx = self.panel.minor_axis[1] - xs = self.panel.minor_xs(idx) + idx = self.panel.minor_axis[1] + xs = self.panel.minor_xs(idx) - assert_series_equal(xs['ItemA'], ref[idx], check_names=False) + assert_series_equal(xs['ItemA'], ref[idx], check_names=False) - # not contained - self.assertRaises(Exception, self.panel.minor_xs, 'E') + # not contained + pytest.raises(Exception, self.panel.minor_xs, 'E') def test_minor_xs_mixed(self): - self.panel['ItemD'] = 'foo' + with catch_warnings(record=True): + self.panel['ItemD'] = 'foo' - xs = self.panel.minor_xs('D') - self.assertEqual(xs['ItemA'].dtype, np.float64) - self.assertEqual(xs['ItemD'].dtype, np.object_) + xs = self.panel.minor_xs('D') + assert xs['ItemA'].dtype == np.float64 + assert xs['ItemD'].dtype == np.object_ def test_xs(self): - itemA = self.panel.xs('ItemA', axis=0) - expected = self.panel['ItemA'] - assert_frame_equal(itemA, expected) + with catch_warnings(record=True): + itemA = self.panel.xs('ItemA', axis=0) + expected = self.panel['ItemA'] + tm.assert_frame_equal(itemA, expected) - # get a view by default - itemA_view = self.panel.xs('ItemA', axis=0) - itemA_view.values[:] = np.nan - self.assertTrue(np.isnan(self.panel['ItemA'].values).all()) + # Get a view by default. + itemA_view = self.panel.xs('ItemA', axis=0) + itemA_view.values[:] = np.nan - # mixed-type yields a copy - self.panel['strings'] = 'foo' - result = self.panel.xs('D', axis=2) - self.assertIsNotNone(result.is_copy) + assert np.isnan(self.panel['ItemA'].values).all() + + # Mixed-type yields a copy. + self.panel['strings'] = 'foo' + result = self.panel.xs('D', axis=2) + assert result.is_copy is not None def test_getitem_fancy_labels(self): - p = self.panel + with catch_warnings(record=True): + p = self.panel - items = p.items[[1, 0]] - dates = p.major_axis[::2] - cols = ['D', 'C', 'F'] + items = p.items[[1, 0]] + dates = p.major_axis[::2] + cols = ['D', 'C', 'F'] - # all 3 specified - assert_panel_equal(p.ix[items, dates, cols], - p.reindex(items=items, major=dates, minor=cols)) + # all 3 specified + assert_panel_equal(p.loc[items, dates, cols], + p.reindex(items=items, major=dates, minor=cols)) - # 2 specified - assert_panel_equal(p.ix[:, dates, cols], - p.reindex(major=dates, minor=cols)) + # 2 specified + assert_panel_equal(p.loc[:, dates, cols], + p.reindex(major=dates, minor=cols)) - assert_panel_equal(p.ix[items, :, cols], - p.reindex(items=items, minor=cols)) + assert_panel_equal(p.loc[items, :, cols], + p.reindex(items=items, minor=cols)) - assert_panel_equal(p.ix[items, dates, :], - p.reindex(items=items, major=dates)) + assert_panel_equal(p.loc[items, dates, :], + p.reindex(items=items, major=dates)) - # only 1 - assert_panel_equal(p.ix[items, :, :], p.reindex(items=items)) + # only 1 + assert_panel_equal(p.loc[items, :, :], p.reindex(items=items)) - assert_panel_equal(p.ix[:, dates, :], p.reindex(major=dates)) + assert_panel_equal(p.loc[:, dates, :], p.reindex(major=dates)) - assert_panel_equal(p.ix[:, :, cols], p.reindex(minor=cols)) + assert_panel_equal(p.loc[:, :, cols], p.reindex(minor=cols)) def test_getitem_fancy_slice(self): pass @@ -618,8 +645,8 @@ def test_getitem_fancy_ints(self): p = self.panel # #1603 - result = p.ix[:, -1, :] - expected = p.ix[:, p.major_axis[-1], :] + result = p.iloc[:, -1, :] + expected = p.loc[:, p.major_axis[-1], :] assert_frame_equal(result, expected) def test_getitem_fancy_xs(self): @@ -631,205 +658,213 @@ def test_getitem_fancy_xs(self): # get DataFrame # item - assert_frame_equal(p.ix[item], p[item]) - assert_frame_equal(p.ix[item, :], p[item]) - assert_frame_equal(p.ix[item, :, :], p[item]) + assert_frame_equal(p.loc[item], p[item]) + assert_frame_equal(p.loc[item, :], p[item]) + assert_frame_equal(p.loc[item, :, :], p[item]) # major axis, axis=1 - assert_frame_equal(p.ix[:, date], p.major_xs(date)) - assert_frame_equal(p.ix[:, date, :], p.major_xs(date)) + assert_frame_equal(p.loc[:, date], p.major_xs(date)) + assert_frame_equal(p.loc[:, date, :], p.major_xs(date)) # minor axis, axis=2 - assert_frame_equal(p.ix[:, :, 'C'], p.minor_xs('C')) + assert_frame_equal(p.loc[:, :, 'C'], p.minor_xs('C')) # get Series - assert_series_equal(p.ix[item, date], p[item].ix[date]) - assert_series_equal(p.ix[item, date, :], p[item].ix[date]) - assert_series_equal(p.ix[item, :, col], p[item][col]) - assert_series_equal(p.ix[:, date, col], p.major_xs(date).ix[col]) + assert_series_equal(p.loc[item, date], p[item].loc[date]) + assert_series_equal(p.loc[item, date, :], p[item].loc[date]) + assert_series_equal(p.loc[item, :, col], p[item][col]) + assert_series_equal(p.loc[:, date, col], p.major_xs(date).loc[col]) def test_getitem_fancy_xs_check_view(self): - item = 'ItemB' - date = self.panel.major_axis[5] - - # make sure it's always a view - NS = slice(None, None) - - # DataFrames - comp = assert_frame_equal - self._check_view(item, comp) - self._check_view((item, NS), comp) - self._check_view((item, NS, NS), comp) - self._check_view((NS, date), comp) - self._check_view((NS, date, NS), comp) - self._check_view((NS, NS, 'C'), comp) - - # Series - comp = assert_series_equal - self._check_view((item, date), comp) - self._check_view((item, date, NS), comp) - self._check_view((item, NS, 'C'), comp) - self._check_view((NS, date, 'C'), comp) + with catch_warnings(record=True): + item = 'ItemB' + date = self.panel.major_axis[5] + + # make sure it's always a view + NS = slice(None, None) + + # DataFrames + comp = assert_frame_equal + self._check_view(item, comp) + self._check_view((item, NS), comp) + self._check_view((item, NS, NS), comp) + self._check_view((NS, date), comp) + self._check_view((NS, date, NS), comp) + self._check_view((NS, NS, 'C'), comp) + + # Series + comp = assert_series_equal + self._check_view((item, date), comp) + self._check_view((item, date, NS), comp) + self._check_view((item, NS, 'C'), comp) + self._check_view((NS, date, 'C'), comp) def test_getitem_callable(self): - p = self.panel - # GH 12533 + with catch_warnings(record=True): + p = self.panel + # GH 12533 - assert_frame_equal(p[lambda x: 'ItemB'], p.loc['ItemB']) - assert_panel_equal(p[lambda x: ['ItemB', 'ItemC']], - p.loc[['ItemB', 'ItemC']]) + assert_frame_equal(p[lambda x: 'ItemB'], p.loc['ItemB']) + assert_panel_equal(p[lambda x: ['ItemB', 'ItemC']], + p.loc[['ItemB', 'ItemC']]) def test_ix_setitem_slice_dataframe(self): - a = Panel(items=[1, 2, 3], major_axis=[11, 22, 33], - minor_axis=[111, 222, 333]) - b = DataFrame(np.random.randn(2, 3), index=[111, 333], - columns=[1, 2, 3]) + with catch_warnings(record=True): + a = Panel(items=[1, 2, 3], major_axis=[11, 22, 33], + minor_axis=[111, 222, 333]) + b = DataFrame(np.random.randn(2, 3), index=[111, 333], + columns=[1, 2, 3]) - a.ix[:, 22, [111, 333]] = b + a.loc[:, 22, [111, 333]] = b - assert_frame_equal(a.ix[:, 22, [111, 333]], b) + assert_frame_equal(a.loc[:, 22, [111, 333]], b) def test_ix_align(self): - from pandas import Series - b = Series(np.random.randn(10), name=0) - b.sort() - df_orig = Panel(np.random.randn(3, 10, 2)) - df = df_orig.copy() + with catch_warnings(record=True): + from pandas import Series + b = Series(np.random.randn(10), name=0) + b.sort_values() + df_orig = Panel(np.random.randn(3, 10, 2)) + df = df_orig.copy() - df.ix[0, :, 0] = b - assert_series_equal(df.ix[0, :, 0].reindex(b.index), b) + df.loc[0, :, 0] = b + assert_series_equal(df.loc[0, :, 0].reindex(b.index), b) - df = df_orig.swapaxes(0, 1) - df.ix[:, 0, 0] = b - assert_series_equal(df.ix[:, 0, 0].reindex(b.index), b) + df = df_orig.swapaxes(0, 1) + df.loc[:, 0, 0] = b + assert_series_equal(df.loc[:, 0, 0].reindex(b.index), b) - df = df_orig.swapaxes(1, 2) - df.ix[0, 0, :] = b - assert_series_equal(df.ix[0, 0, :].reindex(b.index), b) + df = df_orig.swapaxes(1, 2) + df.loc[0, 0, :] = b + assert_series_equal(df.loc[0, 0, :].reindex(b.index), b) def test_ix_frame_align(self): - p_orig = tm.makePanel() - df = p_orig.ix[0].copy() - assert_frame_equal(p_orig['ItemA'], df) - - p = p_orig.copy() - p.ix[0, :, :] = df - assert_panel_equal(p, p_orig) - - p = p_orig.copy() - p.ix[0] = df - assert_panel_equal(p, p_orig) - - p = p_orig.copy() - p.iloc[0, :, :] = df - assert_panel_equal(p, p_orig) - - p = p_orig.copy() - p.iloc[0] = df - assert_panel_equal(p, p_orig) - - p = p_orig.copy() - p.loc['ItemA'] = df - assert_panel_equal(p, p_orig) - - p = p_orig.copy() - p.loc['ItemA', :, :] = df - assert_panel_equal(p, p_orig) - - p = p_orig.copy() - p['ItemA'] = df - assert_panel_equal(p, p_orig) - - p = p_orig.copy() - p.ix[0, [0, 1, 3, 5], -2:] = df - out = p.ix[0, [0, 1, 3, 5], -2:] - assert_frame_equal(out, df.iloc[[0, 1, 3, 5], [2, 3]]) - - # GH3830, panel assignent by values/frame - for dtype in ['float64', 'int64']: - - panel = Panel(np.arange(40).reshape((2, 4, 5)), - items=['a1', 'a2'], dtype=dtype) - df1 = panel.iloc[0] - df2 = panel.iloc[1] - - tm.assert_frame_equal(panel.loc['a1'], df1) - tm.assert_frame_equal(panel.loc['a2'], df2) - - # Assignment by Value Passes for 'a2' - panel.loc['a2'] = df1.values - tm.assert_frame_equal(panel.loc['a1'], df1) - tm.assert_frame_equal(panel.loc['a2'], df1) - - # Assignment by DataFrame Ok w/o loc 'a2' - panel['a2'] = df2 - tm.assert_frame_equal(panel.loc['a1'], df1) - tm.assert_frame_equal(panel.loc['a2'], df2) - - # Assignment by DataFrame Fails for 'a2' - panel.loc['a2'] = df2 - tm.assert_frame_equal(panel.loc['a1'], df1) - tm.assert_frame_equal(panel.loc['a2'], df2) + with catch_warnings(record=True): + p_orig = tm.makePanel() + df = p_orig.iloc[0].copy() + assert_frame_equal(p_orig['ItemA'], df) + + p = p_orig.copy() + p.iloc[0, :, :] = df + assert_panel_equal(p, p_orig) + + p = p_orig.copy() + p.iloc[0] = df + assert_panel_equal(p, p_orig) + + p = p_orig.copy() + p.iloc[0, :, :] = df + assert_panel_equal(p, p_orig) + + p = p_orig.copy() + p.iloc[0] = df + assert_panel_equal(p, p_orig) + + p = p_orig.copy() + p.loc['ItemA'] = df + assert_panel_equal(p, p_orig) + + p = p_orig.copy() + p.loc['ItemA', :, :] = df + assert_panel_equal(p, p_orig) + + p = p_orig.copy() + p['ItemA'] = df + assert_panel_equal(p, p_orig) + + p = p_orig.copy() + p.iloc[0, [0, 1, 3, 5], -2:] = df + out = p.iloc[0, [0, 1, 3, 5], -2:] + assert_frame_equal(out, df.iloc[[0, 1, 3, 5], [2, 3]]) + + # GH3830, panel assignent by values/frame + for dtype in ['float64', 'int64']: + + panel = Panel(np.arange(40).reshape((2, 4, 5)), + items=['a1', 'a2'], dtype=dtype) + df1 = panel.iloc[0] + df2 = panel.iloc[1] + + tm.assert_frame_equal(panel.loc['a1'], df1) + tm.assert_frame_equal(panel.loc['a2'], df2) + + # Assignment by Value Passes for 'a2' + panel.loc['a2'] = df1.values + tm.assert_frame_equal(panel.loc['a1'], df1) + tm.assert_frame_equal(panel.loc['a2'], df1) + + # Assignment by DataFrame Ok w/o loc 'a2' + panel['a2'] = df2 + tm.assert_frame_equal(panel.loc['a1'], df1) + tm.assert_frame_equal(panel.loc['a2'], df2) + + # Assignment by DataFrame Fails for 'a2' + panel.loc['a2'] = df2 + tm.assert_frame_equal(panel.loc['a1'], df1) + tm.assert_frame_equal(panel.loc['a2'], df2) def _check_view(self, indexer, comp): cp = self.panel.copy() - obj = cp.ix[indexer] + obj = cp.loc[indexer] obj.values[:] = 0 - self.assertTrue((obj.values == 0).all()) - comp(cp.ix[indexer].reindex_like(obj), obj) + assert (obj.values == 0).all() + comp(cp.loc[indexer].reindex_like(obj), obj) def test_logical_with_nas(self): - d = Panel({'ItemA': {'a': [np.nan, False]}, - 'ItemB': {'a': [True, True]}}) + with catch_warnings(record=True): + d = Panel({'ItemA': {'a': [np.nan, False]}, + 'ItemB': {'a': [True, True]}}) - result = d['ItemA'] | d['ItemB'] - expected = DataFrame({'a': [np.nan, True]}) - assert_frame_equal(result, expected) + result = d['ItemA'] | d['ItemB'] + expected = DataFrame({'a': [np.nan, True]}) + assert_frame_equal(result, expected) - # this is autodowncasted here - result = d['ItemA'].fillna(False) | d['ItemB'] - expected = DataFrame({'a': [True, True]}) - assert_frame_equal(result, expected) + # this is autodowncasted here + result = d['ItemA'].fillna(False) | d['ItemB'] + expected = DataFrame({'a': [True, True]}) + assert_frame_equal(result, expected) def test_neg(self): - # what to do? - assert_panel_equal(-self.panel, -1 * self.panel) + with catch_warnings(record=True): + assert_panel_equal(-self.panel, -1 * self.panel) def test_invert(self): - assert_panel_equal(-(self.panel < 0), ~(self.panel < 0)) + with catch_warnings(record=True): + assert_panel_equal(-(self.panel < 0), ~(self.panel < 0)) def test_comparisons(self): - p1 = tm.makePanel() - p2 = tm.makePanel() + with catch_warnings(record=True): + p1 = tm.makePanel() + p2 = tm.makePanel() - tp = p1.reindex(items=p1.items + ['foo']) - df = p1[p1.items[0]] + tp = p1.reindex(items=p1.items + ['foo']) + df = p1[p1.items[0]] - def test_comp(func): + def test_comp(func): - # versus same index - result = func(p1, p2) - self.assert_numpy_array_equal(result.values, - func(p1.values, p2.values)) + # versus same index + result = func(p1, p2) + tm.assert_numpy_array_equal(result.values, + func(p1.values, p2.values)) - # versus non-indexed same objs - self.assertRaises(Exception, func, p1, tp) + # versus non-indexed same objs + pytest.raises(Exception, func, p1, tp) - # versus different objs - self.assertRaises(Exception, func, p1, df) + # versus different objs + pytest.raises(Exception, func, p1, df) - # versus scalar - result3 = func(self.panel, 0) - self.assert_numpy_array_equal(result3.values, - func(self.panel.values, 0)) + # versus scalar + result3 = func(self.panel, 0) + tm.assert_numpy_array_equal(result3.values, + func(self.panel.values, 0)) - with np.errstate(invalid='ignore'): - test_comp(operator.eq) - test_comp(operator.ne) - test_comp(operator.lt) - test_comp(operator.gt) - test_comp(operator.ge) - test_comp(operator.le) + with np.errstate(invalid='ignore'): + test_comp(operator.eq) + test_comp(operator.ne) + test_comp(operator.lt) + test_comp(operator.gt) + test_comp(operator.ge) + test_comp(operator.le) def test_get_value(self): for item in self.panel.items: @@ -838,330 +873,356 @@ def test_get_value(self): result = self.panel.get_value(item, mjr, mnr) expected = self.panel[item][mnr][mjr] assert_almost_equal(result, expected) - with tm.assertRaisesRegexp(TypeError, - "There must be an argument for each axis"): + with tm.assert_raises_regex(TypeError, + "There must be an argument " + "for each axis"): self.panel.get_value('a') def test_set_value(self): - for item in self.panel.items: - for mjr in self.panel.major_axis[::2]: - for mnr in self.panel.minor_axis: - self.panel.set_value(item, mjr, mnr, 1.) - assert_almost_equal(self.panel[item][mnr][mjr], 1.) + with catch_warnings(record=True): + for item in self.panel.items: + for mjr in self.panel.major_axis[::2]: + for mnr in self.panel.minor_axis: + self.panel.set_value(item, mjr, mnr, 1.) + tm.assert_almost_equal(self.panel[item][mnr][mjr], 1.) - # resize - res = self.panel.set_value('ItemE', 'foo', 'bar', 1.5) - tm.assertIsInstance(res, Panel) - self.assertIsNot(res, self.panel) - self.assertEqual(res.get_value('ItemE', 'foo', 'bar'), 1.5) + # resize + res = self.panel.set_value('ItemE', 'foo', 'bar', 1.5) + assert isinstance(res, Panel) + assert res is not self.panel + assert res.get_value('ItemE', 'foo', 'bar') == 1.5 - res3 = self.panel.set_value('ItemE', 'foobar', 'baz', 5) - self.assertTrue(is_float_dtype(res3['ItemE'].values)) - with tm.assertRaisesRegexp(TypeError, - "There must be an argument for each axis" - " plus the value provided"): - self.panel.set_value('a') + res3 = self.panel.set_value('ItemE', 'foobar', 'baz', 5) + assert is_float_dtype(res3['ItemE'].values) + msg = ("There must be an argument for each " + "axis plus the value provided") + with tm.assert_raises_regex(TypeError, msg): + self.panel.set_value('a') -_panel = tm.makePanel() -tm.add_nans(_panel) - -class TestPanel(tm.TestCase, PanelTests, CheckIndexing, SafeForLongAndSparse, +class TestPanel(PanelTests, CheckIndexing, SafeForLongAndSparse, SafeForSparse): - _multiprocess_can_split_ = True @classmethod def assert_panel_equal(cls, x, y): assert_panel_equal(x, y) - def setUp(self): - self.panel = _panel.copy() + def setup_method(self, method): + self.panel = make_test_panel() self.panel.major_axis.name = None self.panel.minor_axis.name = None self.panel.items.name = None def test_constructor(self): - # with BlockManager - wp = Panel(self.panel._data) - self.assertIs(wp._data, self.panel._data) - - wp = Panel(self.panel._data, copy=True) - self.assertIsNot(wp._data, self.panel._data) - assert_panel_equal(wp, self.panel) - - # strings handled prop - wp = Panel([[['foo', 'foo', 'foo', ], ['foo', 'foo', 'foo']]]) - self.assertEqual(wp.values.dtype, np.object_) - - vals = self.panel.values - - # no copy - wp = Panel(vals) - self.assertIs(wp.values, vals) - - # copy - wp = Panel(vals, copy=True) - self.assertIsNot(wp.values, vals) - - # GH #8285, test when scalar data is used to construct a Panel - # if dtype is not passed, it should be inferred - value_and_dtype = [(1, 'int64'), (3.14, 'float64'), - ('foo', np.object_)] - for (val, dtype) in value_and_dtype: - wp = Panel(val, items=range(2), major_axis=range(3), - minor_axis=range(4)) - vals = np.empty((2, 3, 4), dtype=dtype) - vals.fill(val) - assert_panel_equal(wp, Panel(vals, dtype=dtype)) - - # test the case when dtype is passed - wp = Panel(1, items=range(2), major_axis=range(3), minor_axis=range(4), - dtype='float32') - vals = np.empty((2, 3, 4), dtype='float32') - vals.fill(1) - assert_panel_equal(wp, Panel(vals, dtype='float32')) + with catch_warnings(record=True): + # with BlockManager + wp = Panel(self.panel._data) + assert wp._data is self.panel._data + + wp = Panel(self.panel._data, copy=True) + assert wp._data is not self.panel._data + tm.assert_panel_equal(wp, self.panel) + + # strings handled prop + wp = Panel([[['foo', 'foo', 'foo', ], ['foo', 'foo', 'foo']]]) + assert wp.values.dtype == np.object_ + + vals = self.panel.values + + # no copy + wp = Panel(vals) + assert wp.values is vals + + # copy + wp = Panel(vals, copy=True) + assert wp.values is not vals + + # GH #8285, test when scalar data is used to construct a Panel + # if dtype is not passed, it should be inferred + value_and_dtype = [(1, 'int64'), (3.14, 'float64'), + ('foo', np.object_)] + for (val, dtype) in value_and_dtype: + wp = Panel(val, items=range(2), major_axis=range(3), + minor_axis=range(4)) + vals = np.empty((2, 3, 4), dtype=dtype) + vals.fill(val) + + tm.assert_panel_equal(wp, Panel(vals, dtype=dtype)) + + # test the case when dtype is passed + wp = Panel(1, items=range(2), major_axis=range(3), + minor_axis=range(4), + dtype='float32') + vals = np.empty((2, 3, 4), dtype='float32') + vals.fill(1) + + tm.assert_panel_equal(wp, Panel(vals, dtype='float32')) def test_constructor_cast(self): - zero_filled = self.panel.fillna(0) + with catch_warnings(record=True): + zero_filled = self.panel.fillna(0) - casted = Panel(zero_filled._data, dtype=int) - casted2 = Panel(zero_filled.values, dtype=int) + casted = Panel(zero_filled._data, dtype=int) + casted2 = Panel(zero_filled.values, dtype=int) - exp_values = zero_filled.values.astype(int) - assert_almost_equal(casted.values, exp_values) - assert_almost_equal(casted2.values, exp_values) + exp_values = zero_filled.values.astype(int) + assert_almost_equal(casted.values, exp_values) + assert_almost_equal(casted2.values, exp_values) - casted = Panel(zero_filled._data, dtype=np.int32) - casted2 = Panel(zero_filled.values, dtype=np.int32) + casted = Panel(zero_filled._data, dtype=np.int32) + casted2 = Panel(zero_filled.values, dtype=np.int32) - exp_values = zero_filled.values.astype(np.int32) - assert_almost_equal(casted.values, exp_values) - assert_almost_equal(casted2.values, exp_values) + exp_values = zero_filled.values.astype(np.int32) + assert_almost_equal(casted.values, exp_values) + assert_almost_equal(casted2.values, exp_values) - # can't cast - data = [[['foo', 'bar', 'baz']]] - self.assertRaises(ValueError, Panel, data, dtype=float) + # can't cast + data = [[['foo', 'bar', 'baz']]] + pytest.raises(ValueError, Panel, data, dtype=float) def test_constructor_empty_panel(self): - empty = Panel() - self.assertEqual(len(empty.items), 0) - self.assertEqual(len(empty.major_axis), 0) - self.assertEqual(len(empty.minor_axis), 0) + with catch_warnings(record=True): + empty = Panel() + assert len(empty.items) == 0 + assert len(empty.major_axis) == 0 + assert len(empty.minor_axis) == 0 def test_constructor_observe_dtype(self): - # GH #411 - panel = Panel(items=lrange(3), major_axis=lrange(3), - minor_axis=lrange(3), dtype='O') - self.assertEqual(panel.values.dtype, np.object_) + with catch_warnings(record=True): + # GH #411 + panel = Panel(items=lrange(3), major_axis=lrange(3), + minor_axis=lrange(3), dtype='O') + assert panel.values.dtype == np.object_ def test_constructor_dtypes(self): - # GH #797 - - def _check_dtype(panel, dtype): - for i in panel.items: - self.assertEqual(panel[i].values.dtype.name, dtype) - - # only nan holding types allowed here - for dtype in ['float64', 'float32', 'object']: - panel = Panel(items=lrange(2), major_axis=lrange(10), - minor_axis=lrange(5), dtype=dtype) - _check_dtype(panel, dtype) - - for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: - panel = Panel(np.array(np.random.randn(2, 10, 5), dtype=dtype), - items=lrange(2), - major_axis=lrange(10), - minor_axis=lrange(5), dtype=dtype) - _check_dtype(panel, dtype) - - for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: - panel = Panel(np.array(np.random.randn(2, 10, 5), dtype='O'), - items=lrange(2), - major_axis=lrange(10), - minor_axis=lrange(5), dtype=dtype) - _check_dtype(panel, dtype) - - for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: - panel = Panel(np.random.randn(2, 10, 5), items=lrange( - 2), major_axis=lrange(10), minor_axis=lrange(5), dtype=dtype) - _check_dtype(panel, dtype) - - for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: - df1 = DataFrame(np.random.randn(2, 5), - index=lrange(2), columns=lrange(5)) - df2 = DataFrame(np.random.randn(2, 5), - index=lrange(2), columns=lrange(5)) - panel = Panel.from_dict({'a': df1, 'b': df2}, dtype=dtype) - _check_dtype(panel, dtype) + with catch_warnings(record=True): + # GH #797 + + def _check_dtype(panel, dtype): + for i in panel.items: + assert panel[i].values.dtype.name == dtype + + # only nan holding types allowed here + for dtype in ['float64', 'float32', 'object']: + panel = Panel(items=lrange(2), major_axis=lrange(10), + minor_axis=lrange(5), dtype=dtype) + _check_dtype(panel, dtype) + + for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: + panel = Panel(np.array(np.random.randn(2, 10, 5), dtype=dtype), + items=lrange(2), + major_axis=lrange(10), + minor_axis=lrange(5), dtype=dtype) + _check_dtype(panel, dtype) + + for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: + panel = Panel(np.array(np.random.randn(2, 10, 5), dtype='O'), + items=lrange(2), + major_axis=lrange(10), + minor_axis=lrange(5), dtype=dtype) + _check_dtype(panel, dtype) + + for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: + panel = Panel( + np.random.randn(2, 10, 5), + items=lrange(2), major_axis=lrange(10), + minor_axis=lrange(5), + dtype=dtype) + _check_dtype(panel, dtype) + + for dtype in ['float64', 'float32', 'int64', 'int32', 'object']: + df1 = DataFrame(np.random.randn(2, 5), + index=lrange(2), columns=lrange(5)) + df2 = DataFrame(np.random.randn(2, 5), + index=lrange(2), columns=lrange(5)) + panel = Panel.from_dict({'a': df1, 'b': df2}, dtype=dtype) + _check_dtype(panel, dtype) def test_constructor_fails_with_not_3d_input(self): - with tm.assertRaisesRegexp(ValueError, - "The number of dimensions required is 3"): - Panel(np.random.randn(10, 2)) + with catch_warnings(record=True): + with tm.assert_raises_regex(ValueError, "The number of dimensions required is 3"): # noqa + Panel(np.random.randn(10, 2)) def test_consolidate(self): - self.assertTrue(self.panel._data.is_consolidated()) + with catch_warnings(record=True): + assert self.panel._data.is_consolidated() - self.panel['foo'] = 1. - self.assertFalse(self.panel._data.is_consolidated()) + self.panel['foo'] = 1. + assert not self.panel._data.is_consolidated() - panel = self.panel.consolidate() - self.assertTrue(panel._data.is_consolidated()) + panel = self.panel._consolidate() + assert panel._data.is_consolidated() def test_ctor_dict(self): - itema = self.panel['ItemA'] - itemb = self.panel['ItemB'] + with catch_warnings(record=True): + itema = self.panel['ItemA'] + itemb = self.panel['ItemB'] - d = {'A': itema, 'B': itemb[5:]} - d2 = {'A': itema._series, 'B': itemb[5:]._series} - d3 = {'A': None, - 'B': DataFrame(itemb[5:]._series), - 'C': DataFrame(itema._series)} + d = {'A': itema, 'B': itemb[5:]} + d2 = {'A': itema._series, 'B': itemb[5:]._series} + d3 = {'A': None, + 'B': DataFrame(itemb[5:]._series), + 'C': DataFrame(itema._series)} - wp = Panel.from_dict(d) - wp2 = Panel.from_dict(d2) # nested Dict + wp = Panel.from_dict(d) + wp2 = Panel.from_dict(d2) # nested Dict - # TODO: unused? - wp3 = Panel.from_dict(d3) # noqa + # TODO: unused? + wp3 = Panel.from_dict(d3) # noqa - self.assert_index_equal(wp.major_axis, self.panel.major_axis) - assert_panel_equal(wp, wp2) + tm.assert_index_equal(wp.major_axis, self.panel.major_axis) + assert_panel_equal(wp, wp2) - # intersect - wp = Panel.from_dict(d, intersect=True) - self.assert_index_equal(wp.major_axis, itemb.index[5:]) + # intersect + wp = Panel.from_dict(d, intersect=True) + tm.assert_index_equal(wp.major_axis, itemb.index[5:]) - # use constructor - assert_panel_equal(Panel(d), Panel.from_dict(d)) - assert_panel_equal(Panel(d2), Panel.from_dict(d2)) - assert_panel_equal(Panel(d3), Panel.from_dict(d3)) + # use constructor + assert_panel_equal(Panel(d), Panel.from_dict(d)) + assert_panel_equal(Panel(d2), Panel.from_dict(d2)) + assert_panel_equal(Panel(d3), Panel.from_dict(d3)) - # a pathological case - d4 = {'A': None, 'B': None} + # a pathological case + d4 = {'A': None, 'B': None} - # TODO: unused? - wp4 = Panel.from_dict(d4) # noqa + # TODO: unused? + wp4 = Panel.from_dict(d4) # noqa - assert_panel_equal(Panel(d4), Panel(items=['A', 'B'])) + assert_panel_equal(Panel(d4), Panel(items=['A', 'B'])) - # cast - dcasted = dict((k, v.reindex(wp.major_axis).fillna(0)) - for k, v in compat.iteritems(d)) - result = Panel(dcasted, dtype=int) - expected = Panel(dict((k, v.astype(int)) - for k, v in compat.iteritems(dcasted))) - assert_panel_equal(result, expected) + # cast + dcasted = dict((k, v.reindex(wp.major_axis).fillna(0)) + for k, v in compat.iteritems(d)) + result = Panel(dcasted, dtype=int) + expected = Panel(dict((k, v.astype(int)) + for k, v in compat.iteritems(dcasted))) + assert_panel_equal(result, expected) - result = Panel(dcasted, dtype=np.int32) - expected = Panel(dict((k, v.astype(np.int32)) - for k, v in compat.iteritems(dcasted))) - assert_panel_equal(result, expected) + result = Panel(dcasted, dtype=np.int32) + expected = Panel(dict((k, v.astype(np.int32)) + for k, v in compat.iteritems(dcasted))) + assert_panel_equal(result, expected) def test_constructor_dict_mixed(self): - data = dict((k, v.values) for k, v in self.panel.iteritems()) - result = Panel(data) - exp_major = Index(np.arange(len(self.panel.major_axis))) - self.assert_index_equal(result.major_axis, exp_major) + with catch_warnings(record=True): + data = dict((k, v.values) for k, v in self.panel.iteritems()) + result = Panel(data) + exp_major = Index(np.arange(len(self.panel.major_axis))) + tm.assert_index_equal(result.major_axis, exp_major) - result = Panel(data, items=self.panel.items, - major_axis=self.panel.major_axis, - minor_axis=self.panel.minor_axis) - assert_panel_equal(result, self.panel) + result = Panel(data, items=self.panel.items, + major_axis=self.panel.major_axis, + minor_axis=self.panel.minor_axis) + assert_panel_equal(result, self.panel) - data['ItemC'] = self.panel['ItemC'] - result = Panel(data) - assert_panel_equal(result, self.panel) + data['ItemC'] = self.panel['ItemC'] + result = Panel(data) + assert_panel_equal(result, self.panel) - # corner, blow up - data['ItemB'] = data['ItemB'][:-1] - self.assertRaises(Exception, Panel, data) + # corner, blow up + data['ItemB'] = data['ItemB'][:-1] + pytest.raises(Exception, Panel, data) - data['ItemB'] = self.panel['ItemB'].values[:, :-1] - self.assertRaises(Exception, Panel, data) + data['ItemB'] = self.panel['ItemB'].values[:, :-1] + pytest.raises(Exception, Panel, data) def test_ctor_orderedDict(self): - keys = list(set(np.random.randint(0, 5000, 100)))[ - :50] # unique random int keys - d = OrderedDict([(k, mkdf(10, 5)) for k in keys]) - p = Panel(d) - self.assertTrue(list(p.items) == keys) + with catch_warnings(record=True): + keys = list(set(np.random.randint(0, 5000, 100)))[ + :50] # unique random int keys + d = OrderedDict([(k, mkdf(10, 5)) for k in keys]) + p = Panel(d) + assert list(p.items) == keys - p = Panel.from_dict(d) - self.assertTrue(list(p.items) == keys) + p = Panel.from_dict(d) + assert list(p.items) == keys def test_constructor_resize(self): - data = self.panel._data - items = self.panel.items[:-1] - major = self.panel.major_axis[:-1] - minor = self.panel.minor_axis[:-1] - - result = Panel(data, items=items, major_axis=major, minor_axis=minor) - expected = self.panel.reindex(items=items, major=major, minor=minor) - assert_panel_equal(result, expected) + with catch_warnings(record=True): + data = self.panel._data + items = self.panel.items[:-1] + major = self.panel.major_axis[:-1] + minor = self.panel.minor_axis[:-1] + + result = Panel(data, items=items, + major_axis=major, minor_axis=minor) + expected = self.panel.reindex( + items=items, major=major, minor=minor) + assert_panel_equal(result, expected) - result = Panel(data, items=items, major_axis=major) - expected = self.panel.reindex(items=items, major=major) - assert_panel_equal(result, expected) + result = Panel(data, items=items, major_axis=major) + expected = self.panel.reindex(items=items, major=major) + assert_panel_equal(result, expected) - result = Panel(data, items=items) - expected = self.panel.reindex(items=items) - assert_panel_equal(result, expected) + result = Panel(data, items=items) + expected = self.panel.reindex(items=items) + assert_panel_equal(result, expected) - result = Panel(data, minor_axis=minor) - expected = self.panel.reindex(minor=minor) - assert_panel_equal(result, expected) + result = Panel(data, minor_axis=minor) + expected = self.panel.reindex(minor=minor) + assert_panel_equal(result, expected) def test_from_dict_mixed_orient(self): - df = tm.makeDataFrame() - df['foo'] = 'bar' + with catch_warnings(record=True): + df = tm.makeDataFrame() + df['foo'] = 'bar' - data = {'k1': df, 'k2': df} + data = {'k1': df, 'k2': df} - panel = Panel.from_dict(data, orient='minor') + panel = Panel.from_dict(data, orient='minor') - self.assertEqual(panel['foo'].values.dtype, np.object_) - self.assertEqual(panel['A'].values.dtype, np.float64) + assert panel['foo'].values.dtype == np.object_ + assert panel['A'].values.dtype == np.float64 def test_constructor_error_msgs(self): - def testit(): - Panel(np.random.randn(3, 4, 5), lrange(4), lrange(5), lrange(5)) - - assertRaisesRegexp(ValueError, - r"Shape of passed values is \(3, 4, 5\), " - r"indices imply \(4, 5, 5\)", - testit) - - def testit(): - Panel(np.random.randn(3, 4, 5), lrange(5), lrange(4), lrange(5)) - - assertRaisesRegexp(ValueError, - r"Shape of passed values is \(3, 4, 5\), " - r"indices imply \(5, 4, 5\)", - testit) - - def testit(): - Panel(np.random.randn(3, 4, 5), lrange(5), lrange(5), lrange(4)) - - assertRaisesRegexp(ValueError, - r"Shape of passed values is \(3, 4, 5\), " - r"indices imply \(5, 5, 4\)", - testit) + with catch_warnings(record=True): + def testit(): + Panel(np.random.randn(3, 4, 5), + lrange(4), lrange(5), lrange(5)) + + tm.assert_raises_regex(ValueError, + r"Shape of passed values is " + r"\(3, 4, 5\), indices imply " + r"\(4, 5, 5\)", + testit) + + def testit(): + Panel(np.random.randn(3, 4, 5), + lrange(5), lrange(4), lrange(5)) + + tm.assert_raises_regex(ValueError, + r"Shape of passed values is " + r"\(3, 4, 5\), indices imply " + r"\(5, 4, 5\)", + testit) + + def testit(): + Panel(np.random.randn(3, 4, 5), + lrange(5), lrange(5), lrange(4)) + + tm.assert_raises_regex(ValueError, + r"Shape of passed values is " + r"\(3, 4, 5\), indices imply " + r"\(5, 5, 4\)", + testit) def test_conform(self): - df = self.panel['ItemA'][:-5].filter(items=['A', 'B']) - conformed = self.panel.conform(df) + with catch_warnings(record=True): + df = self.panel['ItemA'][:-5].filter(items=['A', 'B']) + conformed = self.panel.conform(df) - tm.assert_index_equal(conformed.index, self.panel.major_axis) - tm.assert_index_equal(conformed.columns, self.panel.minor_axis) + tm.assert_index_equal(conformed.index, self.panel.major_axis) + tm.assert_index_equal(conformed.columns, self.panel.minor_axis) def test_convert_objects(self): + with catch_warnings(record=True): - # GH 4937 - p = Panel(dict(A=dict(a=['1', '1.0']))) - expected = Panel(dict(A=dict(a=[1, 1.0]))) - result = p._convert(numeric=True, coerce=True) - assert_panel_equal(result, expected) + # GH 4937 + p = Panel(dict(A=dict(a=['1', '1.0']))) + expected = Panel(dict(A=dict(a=[1, 1.0]))) + result = p._convert(numeric=True, coerce=True) + assert_panel_equal(result, expected) def test_dtypes(self): @@ -1170,875 +1231,940 @@ def test_dtypes(self): assert_series_equal(result, expected) def test_astype(self): - # GH7271 - data = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) - panel = Panel(data, ['a', 'b'], ['c', 'd'], ['e', 'f']) + with catch_warnings(record=True): + # GH7271 + data = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) + panel = Panel(data, ['a', 'b'], ['c', 'd'], ['e', 'f']) - str_data = np.array([[['1', '2'], ['3', '4']], - [['5', '6'], ['7', '8']]]) - expected = Panel(str_data, ['a', 'b'], ['c', 'd'], ['e', 'f']) - assert_panel_equal(panel.astype(str), expected) + str_data = np.array([[['1', '2'], ['3', '4']], + [['5', '6'], ['7', '8']]]) + expected = Panel(str_data, ['a', 'b'], ['c', 'd'], ['e', 'f']) + assert_panel_equal(panel.astype(str), expected) - self.assertRaises(NotImplementedError, panel.astype, {0: str}) + pytest.raises(NotImplementedError, panel.astype, {0: str}) def test_apply(self): - # GH1148 - - # ufunc - applied = self.panel.apply(np.sqrt) - with np.errstate(invalid='ignore'): - expected = np.sqrt(self.panel.values) - assert_almost_equal(applied.values, expected) - - # ufunc same shape - result = self.panel.apply(lambda x: x * 2, axis='items') - expected = self.panel * 2 - assert_panel_equal(result, expected) - result = self.panel.apply(lambda x: x * 2, axis='major_axis') - expected = self.panel * 2 - assert_panel_equal(result, expected) - result = self.panel.apply(lambda x: x * 2, axis='minor_axis') - expected = self.panel * 2 - assert_panel_equal(result, expected) - - # reduction to DataFrame - result = self.panel.apply(lambda x: x.dtype, axis='items') - expected = DataFrame(np.dtype('float64'), index=self.panel.major_axis, - columns=self.panel.minor_axis) - assert_frame_equal(result, expected) - result = self.panel.apply(lambda x: x.dtype, axis='major_axis') - expected = DataFrame(np.dtype('float64'), index=self.panel.minor_axis, - columns=self.panel.items) - assert_frame_equal(result, expected) - result = self.panel.apply(lambda x: x.dtype, axis='minor_axis') - expected = DataFrame(np.dtype('float64'), index=self.panel.major_axis, - columns=self.panel.items) - assert_frame_equal(result, expected) - - # reductions via other dims - expected = self.panel.sum(0) - result = self.panel.apply(lambda x: x.sum(), axis='items') - assert_frame_equal(result, expected) - expected = self.panel.sum(1) - result = self.panel.apply(lambda x: x.sum(), axis='major_axis') - assert_frame_equal(result, expected) - expected = self.panel.sum(2) - result = self.panel.apply(lambda x: x.sum(), axis='minor_axis') - assert_frame_equal(result, expected) + with catch_warnings(record=True): + # GH1148 + + # ufunc + applied = self.panel.apply(np.sqrt) + with np.errstate(invalid='ignore'): + expected = np.sqrt(self.panel.values) + assert_almost_equal(applied.values, expected) + + # ufunc same shape + result = self.panel.apply(lambda x: x * 2, axis='items') + expected = self.panel * 2 + assert_panel_equal(result, expected) + result = self.panel.apply(lambda x: x * 2, axis='major_axis') + expected = self.panel * 2 + assert_panel_equal(result, expected) + result = self.panel.apply(lambda x: x * 2, axis='minor_axis') + expected = self.panel * 2 + assert_panel_equal(result, expected) - # pass kwargs - result = self.panel.apply(lambda x, y: x.sum() + y, axis='items', y=5) - expected = self.panel.sum(0) + 5 - assert_frame_equal(result, expected) + # reduction to DataFrame + result = self.panel.apply(lambda x: x.dtype, axis='items') + expected = DataFrame(np.dtype('float64'), + index=self.panel.major_axis, + columns=self.panel.minor_axis) + assert_frame_equal(result, expected) + result = self.panel.apply(lambda x: x.dtype, axis='major_axis') + expected = DataFrame(np.dtype('float64'), + index=self.panel.minor_axis, + columns=self.panel.items) + assert_frame_equal(result, expected) + result = self.panel.apply(lambda x: x.dtype, axis='minor_axis') + expected = DataFrame(np.dtype('float64'), + index=self.panel.major_axis, + columns=self.panel.items) + assert_frame_equal(result, expected) + + # reductions via other dims + expected = self.panel.sum(0) + result = self.panel.apply(lambda x: x.sum(), axis='items') + assert_frame_equal(result, expected) + expected = self.panel.sum(1) + result = self.panel.apply(lambda x: x.sum(), axis='major_axis') + assert_frame_equal(result, expected) + expected = self.panel.sum(2) + result = self.panel.apply(lambda x: x.sum(), axis='minor_axis') + assert_frame_equal(result, expected) + + # pass kwargs + result = self.panel.apply( + lambda x, y: x.sum() + y, axis='items', y=5) + expected = self.panel.sum(0) + 5 + assert_frame_equal(result, expected) def test_apply_slabs(self): + with catch_warnings(record=True): - # same shape as original - result = self.panel.apply(lambda x: x * 2, - axis=['items', 'major_axis']) - expected = (self.panel * 2).transpose('minor_axis', 'major_axis', - 'items') - assert_panel_equal(result, expected) - result = self.panel.apply(lambda x: x * 2, - axis=['major_axis', 'items']) - assert_panel_equal(result, expected) - - result = self.panel.apply(lambda x: x * 2, - axis=['items', 'minor_axis']) - expected = (self.panel * 2).transpose('major_axis', 'minor_axis', - 'items') - assert_panel_equal(result, expected) - result = self.panel.apply(lambda x: x * 2, - axis=['minor_axis', 'items']) - assert_panel_equal(result, expected) - - result = self.panel.apply(lambda x: x * 2, - axis=['major_axis', 'minor_axis']) - expected = self.panel * 2 - assert_panel_equal(result, expected) - result = self.panel.apply(lambda x: x * 2, - axis=['minor_axis', 'major_axis']) - assert_panel_equal(result, expected) - - # reductions - result = self.panel.apply(lambda x: x.sum(0), axis=[ - 'items', 'major_axis' - ]) - expected = self.panel.sum(1).T - assert_frame_equal(result, expected) + # same shape as original + result = self.panel.apply(lambda x: x * 2, + axis=['items', 'major_axis']) + expected = (self.panel * 2).transpose('minor_axis', 'major_axis', + 'items') + assert_panel_equal(result, expected) + result = self.panel.apply(lambda x: x * 2, + axis=['major_axis', 'items']) + assert_panel_equal(result, expected) - result = self.panel.apply(lambda x: x.sum(1), axis=[ - 'items', 'major_axis' - ]) - expected = self.panel.sum(0) - assert_frame_equal(result, expected) + result = self.panel.apply(lambda x: x * 2, + axis=['items', 'minor_axis']) + expected = (self.panel * 2).transpose('major_axis', 'minor_axis', + 'items') + assert_panel_equal(result, expected) + result = self.panel.apply(lambda x: x * 2, + axis=['minor_axis', 'items']) + assert_panel_equal(result, expected) + + result = self.panel.apply(lambda x: x * 2, + axis=['major_axis', 'minor_axis']) + expected = self.panel * 2 + assert_panel_equal(result, expected) + result = self.panel.apply(lambda x: x * 2, + axis=['minor_axis', 'major_axis']) + assert_panel_equal(result, expected) - # transforms - f = lambda x: ((x.T - x.mean(1)) / x.std(1)).T + # reductions + result = self.panel.apply(lambda x: x.sum(0), axis=[ + 'items', 'major_axis' + ]) + expected = self.panel.sum(1).T + assert_frame_equal(result, expected) # make sure that we don't trigger any warnings - with tm.assert_produces_warning(False): + with catch_warnings(record=True): + result = self.panel.apply(lambda x: x.sum(1), axis=[ + 'items', 'major_axis' + ]) + expected = self.panel.sum(0) + assert_frame_equal(result, expected) + + # transforms + f = lambda x: ((x.T - x.mean(1)) / x.std(1)).T + + # make sure that we don't trigger any warnings result = self.panel.apply(f, axis=['items', 'major_axis']) expected = Panel(dict([(ax, f(self.panel.loc[:, :, ax])) for ax in self.panel.minor_axis])) assert_panel_equal(result, expected) - result = self.panel.apply(f, axis=['major_axis', 'minor_axis']) - expected = Panel(dict([(ax, f(self.panel.loc[ax])) - for ax in self.panel.items])) - assert_panel_equal(result, expected) - - result = self.panel.apply(f, axis=['minor_axis', 'items']) - expected = Panel(dict([(ax, f(self.panel.loc[:, ax])) - for ax in self.panel.major_axis])) - assert_panel_equal(result, expected) - - # with multi-indexes - # GH7469 - index = MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ( - 'two', 'a'), ('two', 'b')]) - dfa = DataFrame(np.array(np.arange(12, dtype='int64')).reshape( - 4, 3), columns=list("ABC"), index=index) - dfb = DataFrame(np.array(np.arange(10, 22, dtype='int64')).reshape( - 4, 3), columns=list("ABC"), index=index) - p = Panel({'f': dfa, 'g': dfb}) - result = p.apply(lambda x: x.sum(), axis=0) - - # on windows this will be in32 - result = result.astype('int64') - expected = p.sum(0) - assert_frame_equal(result, expected) + result = self.panel.apply(f, axis=['major_axis', 'minor_axis']) + expected = Panel(dict([(ax, f(self.panel.loc[ax])) + for ax in self.panel.items])) + assert_panel_equal(result, expected) + + result = self.panel.apply(f, axis=['minor_axis', 'items']) + expected = Panel(dict([(ax, f(self.panel.loc[:, ax])) + for ax in self.panel.major_axis])) + assert_panel_equal(result, expected) + + # with multi-indexes + # GH7469 + index = MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ( + 'two', 'a'), ('two', 'b')]) + dfa = DataFrame(np.array(np.arange(12, dtype='int64')).reshape( + 4, 3), columns=list("ABC"), index=index) + dfb = DataFrame(np.array(np.arange(10, 22, dtype='int64')).reshape( + 4, 3), columns=list("ABC"), index=index) + p = Panel({'f': dfa, 'g': dfb}) + result = p.apply(lambda x: x.sum(), axis=0) + + # on windows this will be in32 + result = result.astype('int64') + expected = p.sum(0) + assert_frame_equal(result, expected) def test_apply_no_or_zero_ndim(self): - # GH10332 - self.panel = Panel(np.random.rand(5, 5, 5)) + with catch_warnings(record=True): + # GH10332 + self.panel = Panel(np.random.rand(5, 5, 5)) - result_int = self.panel.apply(lambda df: 0, axis=[1, 2]) - result_float = self.panel.apply(lambda df: 0.0, axis=[1, 2]) - result_int64 = self.panel.apply(lambda df: np.int64(0), axis=[1, 2]) - result_float64 = self.panel.apply(lambda df: np.float64(0.0), - axis=[1, 2]) + result_int = self.panel.apply(lambda df: 0, axis=[1, 2]) + result_float = self.panel.apply(lambda df: 0.0, axis=[1, 2]) + result_int64 = self.panel.apply( + lambda df: np.int64(0), axis=[1, 2]) + result_float64 = self.panel.apply(lambda df: np.float64(0.0), + axis=[1, 2]) - expected_int = expected_int64 = Series([0] * 5) - expected_float = expected_float64 = Series([0.0] * 5) + expected_int = expected_int64 = Series([0] * 5) + expected_float = expected_float64 = Series([0.0] * 5) - assert_series_equal(result_int, expected_int) - assert_series_equal(result_int64, expected_int64) - assert_series_equal(result_float, expected_float) - assert_series_equal(result_float64, expected_float64) + assert_series_equal(result_int, expected_int) + assert_series_equal(result_int64, expected_int64) + assert_series_equal(result_float, expected_float) + assert_series_equal(result_float64, expected_float64) def test_reindex(self): - ref = self.panel['ItemB'] + with catch_warnings(record=True): + ref = self.panel['ItemB'] - # items - result = self.panel.reindex(items=['ItemA', 'ItemB']) - assert_frame_equal(result['ItemB'], ref) + # items + result = self.panel.reindex(items=['ItemA', 'ItemB']) + assert_frame_equal(result['ItemB'], ref) - # major - new_major = list(self.panel.major_axis[:10]) - result = self.panel.reindex(major=new_major) - assert_frame_equal(result['ItemB'], ref.reindex(index=new_major)) + # major + new_major = list(self.panel.major_axis[:10]) + result = self.panel.reindex(major=new_major) + assert_frame_equal(result['ItemB'], ref.reindex(index=new_major)) - # raise exception put both major and major_axis - self.assertRaises(Exception, self.panel.reindex, major_axis=new_major, + # raise exception put both major and major_axis + pytest.raises(Exception, self.panel.reindex, + major_axis=new_major, major=new_major) - # minor - new_minor = list(self.panel.minor_axis[:2]) - result = self.panel.reindex(minor=new_minor) - assert_frame_equal(result['ItemB'], ref.reindex(columns=new_minor)) + # minor + new_minor = list(self.panel.minor_axis[:2]) + result = self.panel.reindex(minor=new_minor) + assert_frame_equal(result['ItemB'], ref.reindex(columns=new_minor)) - # this ok - result = self.panel.reindex() - assert_panel_equal(result, self.panel) - self.assertFalse(result is self.panel) + # this ok + result = self.panel.reindex() + assert_panel_equal(result, self.panel) + assert result is not self.panel - # with filling - smaller_major = self.panel.major_axis[::5] - smaller = self.panel.reindex(major=smaller_major) + # with filling + smaller_major = self.panel.major_axis[::5] + smaller = self.panel.reindex(major=smaller_major) - larger = smaller.reindex(major=self.panel.major_axis, method='pad') + larger = smaller.reindex(major=self.panel.major_axis, method='pad') - assert_frame_equal(larger.major_xs(self.panel.major_axis[1]), - smaller.major_xs(smaller_major[0])) + assert_frame_equal(larger.major_xs(self.panel.major_axis[1]), + smaller.major_xs(smaller_major[0])) - # don't necessarily copy - result = self.panel.reindex(major=self.panel.major_axis, copy=False) - assert_panel_equal(result, self.panel) - self.assertTrue(result is self.panel) + # don't necessarily copy + result = self.panel.reindex( + major=self.panel.major_axis, copy=False) + assert_panel_equal(result, self.panel) + assert result is self.panel def test_reindex_multi(self): - - # with and without copy full reindexing - result = self.panel.reindex(items=self.panel.items, - major=self.panel.major_axis, - minor=self.panel.minor_axis, copy=False) - - self.assertIs(result.items, self.panel.items) - self.assertIs(result.major_axis, self.panel.major_axis) - self.assertIs(result.minor_axis, self.panel.minor_axis) - - result = self.panel.reindex(items=self.panel.items, - major=self.panel.major_axis, - minor=self.panel.minor_axis, copy=False) - assert_panel_equal(result, self.panel) - - # multi-axis indexing consistency - # GH 5900 - df = DataFrame(np.random.randn(4, 3)) - p = Panel({'Item1': df}) - expected = Panel({'Item1': df}) - expected['Item2'] = np.nan - - items = ['Item1', 'Item2'] - major_axis = np.arange(4) - minor_axis = np.arange(3) - - results = [] - results.append(p.reindex(items=items, major_axis=major_axis, - copy=True)) - results.append(p.reindex(items=items, major_axis=major_axis, - copy=False)) - results.append(p.reindex(items=items, minor_axis=minor_axis, - copy=True)) - results.append(p.reindex(items=items, minor_axis=minor_axis, - copy=False)) - results.append(p.reindex(items=items, major_axis=major_axis, - minor_axis=minor_axis, copy=True)) - results.append(p.reindex(items=items, major_axis=major_axis, - minor_axis=minor_axis, copy=False)) - - for i, r in enumerate(results): - assert_panel_equal(expected, r) + with catch_warnings(record=True): + + # with and without copy full reindexing + result = self.panel.reindex( + items=self.panel.items, + major=self.panel.major_axis, + minor=self.panel.minor_axis, copy=False) + + assert result.items is self.panel.items + assert result.major_axis is self.panel.major_axis + assert result.minor_axis is self.panel.minor_axis + + result = self.panel.reindex( + items=self.panel.items, + major=self.panel.major_axis, + minor=self.panel.minor_axis, copy=False) + assert_panel_equal(result, self.panel) + + # multi-axis indexing consistency + # GH 5900 + df = DataFrame(np.random.randn(4, 3)) + p = Panel({'Item1': df}) + expected = Panel({'Item1': df}) + expected['Item2'] = np.nan + + items = ['Item1', 'Item2'] + major_axis = np.arange(4) + minor_axis = np.arange(3) + + results = [] + results.append(p.reindex(items=items, major_axis=major_axis, + copy=True)) + results.append(p.reindex(items=items, major_axis=major_axis, + copy=False)) + results.append(p.reindex(items=items, minor_axis=minor_axis, + copy=True)) + results.append(p.reindex(items=items, minor_axis=minor_axis, + copy=False)) + results.append(p.reindex(items=items, major_axis=major_axis, + minor_axis=minor_axis, copy=True)) + results.append(p.reindex(items=items, major_axis=major_axis, + minor_axis=minor_axis, copy=False)) + + for i, r in enumerate(results): + assert_panel_equal(expected, r) def test_reindex_like(self): - # reindex_like - smaller = self.panel.reindex(items=self.panel.items[:-1], - major=self.panel.major_axis[:-1], - minor=self.panel.minor_axis[:-1]) - smaller_like = self.panel.reindex_like(smaller) - assert_panel_equal(smaller, smaller_like) + with catch_warnings(record=True): + # reindex_like + smaller = self.panel.reindex(items=self.panel.items[:-1], + major=self.panel.major_axis[:-1], + minor=self.panel.minor_axis[:-1]) + smaller_like = self.panel.reindex_like(smaller) + assert_panel_equal(smaller, smaller_like) def test_take(self): - # axis == 0 - result = self.panel.take([2, 0, 1], axis=0) - expected = self.panel.reindex(items=['ItemC', 'ItemA', 'ItemB']) - assert_panel_equal(result, expected) + with catch_warnings(record=True): + # axis == 0 + result = self.panel.take([2, 0, 1], axis=0) + expected = self.panel.reindex(items=['ItemC', 'ItemA', 'ItemB']) + assert_panel_equal(result, expected) - # axis >= 1 - result = self.panel.take([3, 0, 1, 2], axis=2) - expected = self.panel.reindex(minor=['D', 'A', 'B', 'C']) - assert_panel_equal(result, expected) + # axis >= 1 + result = self.panel.take([3, 0, 1, 2], axis=2) + expected = self.panel.reindex(minor=['D', 'A', 'B', 'C']) + assert_panel_equal(result, expected) - # neg indicies ok - expected = self.panel.reindex(minor=['D', 'D', 'B', 'C']) - result = self.panel.take([3, -1, 1, 2], axis=2) - assert_panel_equal(result, expected) + # neg indicies ok + expected = self.panel.reindex(minor=['D', 'D', 'B', 'C']) + result = self.panel.take([3, -1, 1, 2], axis=2) + assert_panel_equal(result, expected) - self.assertRaises(Exception, self.panel.take, [4, 0, 1, 2], axis=2) + pytest.raises(Exception, self.panel.take, [4, 0, 1, 2], axis=2) def test_sort_index(self): - import random - - ritems = list(self.panel.items) - rmajor = list(self.panel.major_axis) - rminor = list(self.panel.minor_axis) - random.shuffle(ritems) - random.shuffle(rmajor) - random.shuffle(rminor) - - random_order = self.panel.reindex(items=ritems) - sorted_panel = random_order.sort_index(axis=0) - assert_panel_equal(sorted_panel, self.panel) - - # descending - random_order = self.panel.reindex(items=ritems) - sorted_panel = random_order.sort_index(axis=0, ascending=False) - assert_panel_equal(sorted_panel, - self.panel.reindex(items=self.panel.items[::-1])) - - random_order = self.panel.reindex(major=rmajor) - sorted_panel = random_order.sort_index(axis=1) - assert_panel_equal(sorted_panel, self.panel) - - random_order = self.panel.reindex(minor=rminor) - sorted_panel = random_order.sort_index(axis=2) - assert_panel_equal(sorted_panel, self.panel) + with catch_warnings(record=True): + import random + + ritems = list(self.panel.items) + rmajor = list(self.panel.major_axis) + rminor = list(self.panel.minor_axis) + random.shuffle(ritems) + random.shuffle(rmajor) + random.shuffle(rminor) + + random_order = self.panel.reindex(items=ritems) + sorted_panel = random_order.sort_index(axis=0) + assert_panel_equal(sorted_panel, self.panel) + + # descending + random_order = self.panel.reindex(items=ritems) + sorted_panel = random_order.sort_index(axis=0, ascending=False) + assert_panel_equal( + sorted_panel, + self.panel.reindex(items=self.panel.items[::-1])) + + random_order = self.panel.reindex(major=rmajor) + sorted_panel = random_order.sort_index(axis=1) + assert_panel_equal(sorted_panel, self.panel) + + random_order = self.panel.reindex(minor=rminor) + sorted_panel = random_order.sort_index(axis=2) + assert_panel_equal(sorted_panel, self.panel) def test_fillna(self): - filled = self.panel.fillna(0) - self.assertTrue(np.isfinite(filled.values).all()) - - filled = self.panel.fillna(method='backfill') - assert_frame_equal(filled['ItemA'], - self.panel['ItemA'].fillna(method='backfill')) - - panel = self.panel.copy() - panel['str'] = 'foo' - - filled = panel.fillna(method='backfill') - assert_frame_equal(filled['ItemA'], - panel['ItemA'].fillna(method='backfill')) - - empty = self.panel.reindex(items=[]) - filled = empty.fillna(0) - assert_panel_equal(filled, empty) - - self.assertRaises(ValueError, self.panel.fillna) - self.assertRaises(ValueError, self.panel.fillna, 5, method='ffill') - - self.assertRaises(TypeError, self.panel.fillna, [1, 2]) - self.assertRaises(TypeError, self.panel.fillna, (1, 2)) - - # limit not implemented when only value is specified - p = Panel(np.random.randn(3, 4, 5)) - p.iloc[0:2, 0:2, 0:2] = np.nan - self.assertRaises(NotImplementedError, lambda: p.fillna(999, limit=1)) - - # Test in place fillNA - # Expected result - expected = Panel([[[0, 1], [2, 1]], [[10, 11], [12, 11]]], - items=['a', 'b'], minor_axis=['x', 'y'], - dtype=np.float64) - # method='ffill' - p1 = Panel([[[0, 1], [2, np.nan]], [[10, 11], [12, np.nan]]], - items=['a', 'b'], minor_axis=['x', 'y'], - dtype=np.float64) - p1.fillna(method='ffill', inplace=True) - assert_panel_equal(p1, expected) - - # method='bfill' - p2 = Panel([[[0, np.nan], [2, 1]], [[10, np.nan], [12, 11]]], - items=['a', 'b'], minor_axis=['x', 'y'], dtype=np.float64) - p2.fillna(method='bfill', inplace=True) - assert_panel_equal(p2, expected) + with catch_warnings(record=True): + filled = self.panel.fillna(0) + assert np.isfinite(filled.values).all() + + filled = self.panel.fillna(method='backfill') + assert_frame_equal(filled['ItemA'], + self.panel['ItemA'].fillna(method='backfill')) + + panel = self.panel.copy() + panel['str'] = 'foo' + + filled = panel.fillna(method='backfill') + assert_frame_equal(filled['ItemA'], + panel['ItemA'].fillna(method='backfill')) + + empty = self.panel.reindex(items=[]) + filled = empty.fillna(0) + assert_panel_equal(filled, empty) + + pytest.raises(ValueError, self.panel.fillna) + pytest.raises(ValueError, self.panel.fillna, 5, method='ffill') + + pytest.raises(TypeError, self.panel.fillna, [1, 2]) + pytest.raises(TypeError, self.panel.fillna, (1, 2)) + + # limit not implemented when only value is specified + p = Panel(np.random.randn(3, 4, 5)) + p.iloc[0:2, 0:2, 0:2] = np.nan + pytest.raises(NotImplementedError, + lambda: p.fillna(999, limit=1)) + + # Test in place fillNA + # Expected result + expected = Panel([[[0, 1], [2, 1]], [[10, 11], [12, 11]]], + items=['a', 'b'], minor_axis=['x', 'y'], + dtype=np.float64) + # method='ffill' + p1 = Panel([[[0, 1], [2, np.nan]], [[10, 11], [12, np.nan]]], + items=['a', 'b'], minor_axis=['x', 'y'], + dtype=np.float64) + p1.fillna(method='ffill', inplace=True) + assert_panel_equal(p1, expected) + + # method='bfill' + p2 = Panel([[[0, np.nan], [2, 1]], [[10, np.nan], [12, 11]]], + items=['a', 'b'], minor_axis=['x', 'y'], + dtype=np.float64) + p2.fillna(method='bfill', inplace=True) + assert_panel_equal(p2, expected) def test_ffill_bfill(self): - assert_panel_equal(self.panel.ffill(), - self.panel.fillna(method='ffill')) - assert_panel_equal(self.panel.bfill(), - self.panel.fillna(method='bfill')) + with catch_warnings(record=True): + assert_panel_equal(self.panel.ffill(), + self.panel.fillna(method='ffill')) + assert_panel_equal(self.panel.bfill(), + self.panel.fillna(method='bfill')) def test_truncate_fillna_bug(self): - # #1823 - result = self.panel.truncate(before=None, after=None, axis='items') + with catch_warnings(record=True): + # #1823 + result = self.panel.truncate(before=None, after=None, axis='items') - # it works! - result.fillna(value=0.0) + # it works! + result.fillna(value=0.0) def test_swapaxes(self): - result = self.panel.swapaxes('items', 'minor') - self.assertIs(result.items, self.panel.minor_axis) + with catch_warnings(record=True): + result = self.panel.swapaxes('items', 'minor') + assert result.items is self.panel.minor_axis - result = self.panel.swapaxes('items', 'major') - self.assertIs(result.items, self.panel.major_axis) + result = self.panel.swapaxes('items', 'major') + assert result.items is self.panel.major_axis - result = self.panel.swapaxes('major', 'minor') - self.assertIs(result.major_axis, self.panel.minor_axis) + result = self.panel.swapaxes('major', 'minor') + assert result.major_axis is self.panel.minor_axis - panel = self.panel.copy() - result = panel.swapaxes('major', 'minor') - panel.values[0, 0, 1] = np.nan - expected = panel.swapaxes('major', 'minor') - assert_panel_equal(result, expected) + panel = self.panel.copy() + result = panel.swapaxes('major', 'minor') + panel.values[0, 0, 1] = np.nan + expected = panel.swapaxes('major', 'minor') + assert_panel_equal(result, expected) - # this should also work - result = self.panel.swapaxes(0, 1) - self.assertIs(result.items, self.panel.major_axis) + # this should also work + result = self.panel.swapaxes(0, 1) + assert result.items is self.panel.major_axis - # this works, but return a copy - result = self.panel.swapaxes('items', 'items') - assert_panel_equal(self.panel, result) - self.assertNotEqual(id(self.panel), id(result)) + # this works, but return a copy + result = self.panel.swapaxes('items', 'items') + assert_panel_equal(self.panel, result) + assert id(self.panel) != id(result) def test_transpose(self): - result = self.panel.transpose('minor', 'major', 'items') - expected = self.panel.swapaxes('items', 'minor') - assert_panel_equal(result, expected) + with catch_warnings(record=True): + result = self.panel.transpose('minor', 'major', 'items') + expected = self.panel.swapaxes('items', 'minor') + assert_panel_equal(result, expected) - # test kwargs - result = self.panel.transpose(items='minor', major='major', - minor='items') - expected = self.panel.swapaxes('items', 'minor') - assert_panel_equal(result, expected) + # test kwargs + result = self.panel.transpose(items='minor', major='major', + minor='items') + expected = self.panel.swapaxes('items', 'minor') + assert_panel_equal(result, expected) - # text mixture of args - result = self.panel.transpose('minor', major='major', minor='items') - expected = self.panel.swapaxes('items', 'minor') - assert_panel_equal(result, expected) + # text mixture of args + result = self.panel.transpose( + 'minor', major='major', minor='items') + expected = self.panel.swapaxes('items', 'minor') + assert_panel_equal(result, expected) - result = self.panel.transpose('minor', 'major', minor='items') - expected = self.panel.swapaxes('items', 'minor') - assert_panel_equal(result, expected) + result = self.panel.transpose('minor', + 'major', + minor='items') + expected = self.panel.swapaxes('items', 'minor') + assert_panel_equal(result, expected) - # duplicate axes - with tm.assertRaisesRegexp(TypeError, - 'not enough/duplicate arguments'): - self.panel.transpose('minor', maj='major', minor='items') + # duplicate axes + with tm.assert_raises_regex(TypeError, + 'not enough/duplicate arguments'): + self.panel.transpose('minor', maj='major', minor='items') - with tm.assertRaisesRegexp(ValueError, 'repeated axis in transpose'): - self.panel.transpose('minor', 'major', major='minor', - minor='items') + with tm.assert_raises_regex(ValueError, + 'repeated axis in transpose'): + self.panel.transpose('minor', 'major', major='minor', + minor='items') - result = self.panel.transpose(2, 1, 0) - assert_panel_equal(result, expected) + result = self.panel.transpose(2, 1, 0) + assert_panel_equal(result, expected) - result = self.panel.transpose('minor', 'items', 'major') - expected = self.panel.swapaxes('items', 'minor') - expected = expected.swapaxes('major', 'minor') - assert_panel_equal(result, expected) + result = self.panel.transpose('minor', 'items', 'major') + expected = self.panel.swapaxes('items', 'minor') + expected = expected.swapaxes('major', 'minor') + assert_panel_equal(result, expected) - result = self.panel.transpose(2, 0, 1) - assert_panel_equal(result, expected) + result = self.panel.transpose(2, 0, 1) + assert_panel_equal(result, expected) - self.assertRaises(ValueError, self.panel.transpose, 0, 0, 1) + pytest.raises(ValueError, self.panel.transpose, 0, 0, 1) def test_transpose_copy(self): - panel = self.panel.copy() - result = panel.transpose(2, 0, 1, copy=True) - expected = panel.swapaxes('items', 'minor') - expected = expected.swapaxes('major', 'minor') - assert_panel_equal(result, expected) + with catch_warnings(record=True): + panel = self.panel.copy() + result = panel.transpose(2, 0, 1, copy=True) + expected = panel.swapaxes('items', 'minor') + expected = expected.swapaxes('major', 'minor') + assert_panel_equal(result, expected) - panel.values[0, 1, 1] = np.nan - self.assertTrue(notnull(result.values[1, 0, 1])) + panel.values[0, 1, 1] = np.nan + assert notnull(result.values[1, 0, 1]) def test_to_frame(self): - # filtered - filtered = self.panel.to_frame() - expected = self.panel.to_frame().dropna(how='any') - assert_frame_equal(filtered, expected) - - # unfiltered - unfiltered = self.panel.to_frame(filter_observations=False) - assert_panel_equal(unfiltered.to_panel(), self.panel) - - # names - self.assertEqual(unfiltered.index.names, ('major', 'minor')) - - # unsorted, round trip - df = self.panel.to_frame(filter_observations=False) - unsorted = df.take(np.random.permutation(len(df))) - pan = unsorted.to_panel() - assert_panel_equal(pan, self.panel) - - # preserve original index names - df = DataFrame(np.random.randn(6, 2), - index=[['a', 'a', 'b', 'b', 'c', 'c'], - [0, 1, 0, 1, 0, 1]], - columns=['one', 'two']) - df.index.names = ['foo', 'bar'] - df.columns.name = 'baz' - - rdf = df.to_panel().to_frame() - self.assertEqual(rdf.index.names, df.index.names) - self.assertEqual(rdf.columns.names, df.columns.names) + with catch_warnings(record=True): + # filtered + filtered = self.panel.to_frame() + expected = self.panel.to_frame().dropna(how='any') + assert_frame_equal(filtered, expected) + + # unfiltered + unfiltered = self.panel.to_frame(filter_observations=False) + assert_panel_equal(unfiltered.to_panel(), self.panel) + + # names + assert unfiltered.index.names == ('major', 'minor') + + # unsorted, round trip + df = self.panel.to_frame(filter_observations=False) + unsorted = df.take(np.random.permutation(len(df))) + pan = unsorted.to_panel() + assert_panel_equal(pan, self.panel) + + # preserve original index names + df = DataFrame(np.random.randn(6, 2), + index=[['a', 'a', 'b', 'b', 'c', 'c'], + [0, 1, 0, 1, 0, 1]], + columns=['one', 'two']) + df.index.names = ['foo', 'bar'] + df.columns.name = 'baz' + + rdf = df.to_panel().to_frame() + assert rdf.index.names == df.index.names + assert rdf.columns.names == df.columns.names def test_to_frame_mixed(self): - panel = self.panel.fillna(0) - panel['str'] = 'foo' - panel['bool'] = panel['ItemA'] > 0 - - lp = panel.to_frame() - wp = lp.to_panel() - self.assertEqual(wp['bool'].values.dtype, np.bool_) - # Previously, this was mutating the underlying index and changing its - # name - assert_frame_equal(wp['bool'], panel['bool'], check_names=False) - - # GH 8704 - # with categorical - df = panel.to_frame() - df['category'] = df['str'].astype('category') - - # to_panel - # TODO: this converts back to object - p = df.to_panel() - expected = panel.copy() - expected['category'] = 'foo' - assert_panel_equal(p, expected) + with catch_warnings(record=True): + panel = self.panel.fillna(0) + panel['str'] = 'foo' + panel['bool'] = panel['ItemA'] > 0 + + lp = panel.to_frame() + wp = lp.to_panel() + assert wp['bool'].values.dtype == np.bool_ + # Previously, this was mutating the underlying + # index and changing its name + assert_frame_equal(wp['bool'], panel['bool'], check_names=False) + + # GH 8704 + # with categorical + df = panel.to_frame() + df['category'] = df['str'].astype('category') + + # to_panel + # TODO: this converts back to object + p = df.to_panel() + expected = panel.copy() + expected['category'] = 'foo' + assert_panel_equal(p, expected) def test_to_frame_multi_major(self): - idx = MultiIndex.from_tuples([(1, 'one'), (1, 'two'), (2, 'one'), ( - 2, 'two')]) - df = DataFrame([[1, 'a', 1], [2, 'b', 1], [3, 'c', 1], [4, 'd', 1]], - columns=['A', 'B', 'C'], index=idx) - wp = Panel({'i1': df, 'i2': df}) - expected_idx = MultiIndex.from_tuples( - [ - (1, 'one', 'A'), (1, 'one', 'B'), - (1, 'one', 'C'), (1, 'two', 'A'), - (1, 'two', 'B'), (1, 'two', 'C'), - (2, 'one', 'A'), (2, 'one', 'B'), - (2, 'one', 'C'), (2, 'two', 'A'), - (2, 'two', 'B'), (2, 'two', 'C') - ], - names=[None, None, 'minor']) - expected = DataFrame({'i1': [1, 'a', 1, 2, 'b', 1, 3, - 'c', 1, 4, 'd', 1], - 'i2': [1, 'a', 1, 2, 'b', - 1, 3, 'c', 1, 4, 'd', 1]}, - index=expected_idx) - result = wp.to_frame() - assert_frame_equal(result, expected) - - wp.iloc[0, 0].iloc[0] = np.nan # BUG on setting. GH #5773 - result = wp.to_frame() - assert_frame_equal(result, expected[1:]) - - idx = MultiIndex.from_tuples([(1, 'two'), (1, 'one'), (2, 'one'), ( - np.nan, 'two')]) - df = DataFrame([[1, 'a', 1], [2, 'b', 1], [3, 'c', 1], [4, 'd', 1]], - columns=['A', 'B', 'C'], index=idx) - wp = Panel({'i1': df, 'i2': df}) - ex_idx = MultiIndex.from_tuples([(1, 'two', 'A'), (1, 'two', 'B'), - (1, 'two', 'C'), - (1, 'one', 'A'), - (1, 'one', 'B'), - (1, 'one', 'C'), - (2, 'one', 'A'), - (2, 'one', 'B'), - (2, 'one', 'C'), - (np.nan, 'two', 'A'), - (np.nan, 'two', 'B'), - (np.nan, 'two', 'C')], - names=[None, None, 'minor']) - expected.index = ex_idx - result = wp.to_frame() - assert_frame_equal(result, expected) + with catch_warnings(record=True): + idx = MultiIndex.from_tuples( + [(1, 'one'), (1, 'two'), (2, 'one'), (2, 'two')]) + df = DataFrame([[1, 'a', 1], [2, 'b', 1], + [3, 'c', 1], [4, 'd', 1]], + columns=['A', 'B', 'C'], index=idx) + wp = Panel({'i1': df, 'i2': df}) + expected_idx = MultiIndex.from_tuples( + [ + (1, 'one', 'A'), (1, 'one', 'B'), + (1, 'one', 'C'), (1, 'two', 'A'), + (1, 'two', 'B'), (1, 'two', 'C'), + (2, 'one', 'A'), (2, 'one', 'B'), + (2, 'one', 'C'), (2, 'two', 'A'), + (2, 'two', 'B'), (2, 'two', 'C') + ], + names=[None, None, 'minor']) + expected = DataFrame({'i1': [1, 'a', 1, 2, 'b', 1, 3, + 'c', 1, 4, 'd', 1], + 'i2': [1, 'a', 1, 2, 'b', + 1, 3, 'c', 1, 4, 'd', 1]}, + index=expected_idx) + result = wp.to_frame() + assert_frame_equal(result, expected) + + wp.iloc[0, 0].iloc[0] = np.nan # BUG on setting. GH #5773 + result = wp.to_frame() + assert_frame_equal(result, expected[1:]) + + idx = MultiIndex.from_tuples( + [(1, 'two'), (1, 'one'), (2, 'one'), (np.nan, 'two')]) + df = DataFrame([[1, 'a', 1], [2, 'b', 1], + [3, 'c', 1], [4, 'd', 1]], + columns=['A', 'B', 'C'], index=idx) + wp = Panel({'i1': df, 'i2': df}) + ex_idx = MultiIndex.from_tuples([(1, 'two', 'A'), (1, 'two', 'B'), + (1, 'two', 'C'), + (1, 'one', 'A'), + (1, 'one', 'B'), + (1, 'one', 'C'), + (2, 'one', 'A'), + (2, 'one', 'B'), + (2, 'one', 'C'), + (np.nan, 'two', 'A'), + (np.nan, 'two', 'B'), + (np.nan, 'two', 'C')], + names=[None, None, 'minor']) + expected.index = ex_idx + result = wp.to_frame() + assert_frame_equal(result, expected) def test_to_frame_multi_major_minor(self): - cols = MultiIndex(levels=[['C_A', 'C_B'], ['C_1', 'C_2']], - labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) - idx = MultiIndex.from_tuples([(1, 'one'), (1, 'two'), (2, 'one'), ( - 2, 'two'), (3, 'three'), (4, 'four')]) - df = DataFrame([[1, 2, 11, 12], [3, 4, 13, 14], - ['a', 'b', 'w', 'x'], - ['c', 'd', 'y', 'z'], [-1, -2, -3, -4], - [-5, -6, -7, -8]], columns=cols, index=idx) - wp = Panel({'i1': df, 'i2': df}) - - exp_idx = MultiIndex.from_tuples( - [(1, 'one', 'C_A', 'C_1'), (1, 'one', 'C_A', 'C_2'), - (1, 'one', 'C_B', 'C_1'), (1, 'one', 'C_B', 'C_2'), - (1, 'two', 'C_A', 'C_1'), (1, 'two', 'C_A', 'C_2'), - (1, 'two', 'C_B', 'C_1'), (1, 'two', 'C_B', 'C_2'), - (2, 'one', 'C_A', 'C_1'), (2, 'one', 'C_A', 'C_2'), - (2, 'one', 'C_B', 'C_1'), (2, 'one', 'C_B', 'C_2'), - (2, 'two', 'C_A', 'C_1'), (2, 'two', 'C_A', 'C_2'), - (2, 'two', 'C_B', 'C_1'), (2, 'two', 'C_B', 'C_2'), - (3, 'three', 'C_A', 'C_1'), (3, 'three', 'C_A', 'C_2'), - (3, 'three', 'C_B', 'C_1'), (3, 'three', 'C_B', 'C_2'), - (4, 'four', 'C_A', 'C_1'), (4, 'four', 'C_A', 'C_2'), - (4, 'four', 'C_B', 'C_1'), (4, 'four', 'C_B', 'C_2')], - names=[None, None, None, None]) - exp_val = [[1, 1], [2, 2], [11, 11], [12, 12], [3, 3], [4, 4], - [13, 13], [14, 14], ['a', 'a'], ['b', 'b'], ['w', 'w'], - ['x', 'x'], ['c', 'c'], ['d', 'd'], ['y', 'y'], ['z', 'z'], - [-1, -1], [-2, -2], [-3, -3], [-4, -4], [-5, -5], [-6, -6], - [-7, -7], [-8, -8]] - result = wp.to_frame() - expected = DataFrame(exp_val, columns=['i1', 'i2'], index=exp_idx) - assert_frame_equal(result, expected) + with catch_warnings(record=True): + cols = MultiIndex(levels=[['C_A', 'C_B'], ['C_1', 'C_2']], + labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) + idx = MultiIndex.from_tuples([(1, 'one'), (1, 'two'), (2, 'one'), ( + 2, 'two'), (3, 'three'), (4, 'four')]) + df = DataFrame([[1, 2, 11, 12], [3, 4, 13, 14], + ['a', 'b', 'w', 'x'], + ['c', 'd', 'y', 'z'], [-1, -2, -3, -4], + [-5, -6, -7, -8]], columns=cols, index=idx) + wp = Panel({'i1': df, 'i2': df}) + + exp_idx = MultiIndex.from_tuples( + [(1, 'one', 'C_A', 'C_1'), (1, 'one', 'C_A', 'C_2'), + (1, 'one', 'C_B', 'C_1'), (1, 'one', 'C_B', 'C_2'), + (1, 'two', 'C_A', 'C_1'), (1, 'two', 'C_A', 'C_2'), + (1, 'two', 'C_B', 'C_1'), (1, 'two', 'C_B', 'C_2'), + (2, 'one', 'C_A', 'C_1'), (2, 'one', 'C_A', 'C_2'), + (2, 'one', 'C_B', 'C_1'), (2, 'one', 'C_B', 'C_2'), + (2, 'two', 'C_A', 'C_1'), (2, 'two', 'C_A', 'C_2'), + (2, 'two', 'C_B', 'C_1'), (2, 'two', 'C_B', 'C_2'), + (3, 'three', 'C_A', 'C_1'), (3, 'three', 'C_A', 'C_2'), + (3, 'three', 'C_B', 'C_1'), (3, 'three', 'C_B', 'C_2'), + (4, 'four', 'C_A', 'C_1'), (4, 'four', 'C_A', 'C_2'), + (4, 'four', 'C_B', 'C_1'), (4, 'four', 'C_B', 'C_2')], + names=[None, None, None, None]) + exp_val = [[1, 1], [2, 2], [11, 11], [12, 12], + [3, 3], [4, 4], + [13, 13], [14, 14], ['a', 'a'], + ['b', 'b'], ['w', 'w'], + ['x', 'x'], ['c', 'c'], ['d', 'd'], [ + 'y', 'y'], ['z', 'z'], + [-1, -1], [-2, -2], [-3, -3], [-4, -4], + [-5, -5], [-6, -6], + [-7, -7], [-8, -8]] + result = wp.to_frame() + expected = DataFrame(exp_val, columns=['i1', 'i2'], index=exp_idx) + assert_frame_equal(result, expected) def test_to_frame_multi_drop_level(self): - idx = MultiIndex.from_tuples([(1, 'one'), (2, 'one'), (2, 'two')]) - df = DataFrame({'A': [np.nan, 1, 2]}, index=idx) - wp = Panel({'i1': df, 'i2': df}) - result = wp.to_frame() - exp_idx = MultiIndex.from_tuples([(2, 'one', 'A'), (2, 'two', 'A')], - names=[None, None, 'minor']) - expected = DataFrame({'i1': [1., 2], 'i2': [1., 2]}, index=exp_idx) - assert_frame_equal(result, expected) + with catch_warnings(record=True): + idx = MultiIndex.from_tuples([(1, 'one'), (2, 'one'), (2, 'two')]) + df = DataFrame({'A': [np.nan, 1, 2]}, index=idx) + wp = Panel({'i1': df, 'i2': df}) + result = wp.to_frame() + exp_idx = MultiIndex.from_tuples( + [(2, 'one', 'A'), (2, 'two', 'A')], + names=[None, None, 'minor']) + expected = DataFrame({'i1': [1., 2], 'i2': [1., 2]}, index=exp_idx) + assert_frame_equal(result, expected) def test_to_panel_na_handling(self): - df = DataFrame(np.random.randint(0, 10, size=20).reshape((10, 2)), - index=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1], - [0, 1, 2, 3, 4, 5, 2, 3, 4, 5]]) + with catch_warnings(record=True): + df = DataFrame(np.random.randint(0, 10, size=20).reshape((10, 2)), + index=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1], + [0, 1, 2, 3, 4, 5, 2, 3, 4, 5]]) - panel = df.to_panel() - self.assertTrue(isnull(panel[0].ix[1, [0, 1]]).all()) + panel = df.to_panel() + assert isnull(panel[0].loc[1, [0, 1]]).all() def test_to_panel_duplicates(self): # #2441 - df = DataFrame({'a': [0, 0, 1], 'b': [1, 1, 1], 'c': [1, 2, 3]}) - idf = df.set_index(['a', 'b']) - assertRaisesRegexp(ValueError, 'non-uniquely indexed', idf.to_panel) + with catch_warnings(record=True): + df = DataFrame({'a': [0, 0, 1], 'b': [1, 1, 1], 'c': [1, 2, 3]}) + idf = df.set_index(['a', 'b']) + tm.assert_raises_regex( + ValueError, 'non-uniquely indexed', idf.to_panel) def test_panel_dups(self): + with catch_warnings(record=True): - # GH 4960 - # duplicates in an index + # GH 4960 + # duplicates in an index - # items - data = np.random.randn(5, 100, 5) - no_dup_panel = Panel(data, items=list("ABCDE")) - panel = Panel(data, items=list("AACDE")) + # items + data = np.random.randn(5, 100, 5) + no_dup_panel = Panel(data, items=list("ABCDE")) + panel = Panel(data, items=list("AACDE")) - expected = no_dup_panel['A'] - result = panel.iloc[0] - assert_frame_equal(result, expected) + expected = no_dup_panel['A'] + result = panel.iloc[0] + assert_frame_equal(result, expected) - expected = no_dup_panel['E'] - result = panel.loc['E'] - assert_frame_equal(result, expected) + expected = no_dup_panel['E'] + result = panel.loc['E'] + assert_frame_equal(result, expected) - expected = no_dup_panel.loc[['A', 'B']] - expected.items = ['A', 'A'] - result = panel.loc['A'] - assert_panel_equal(result, expected) + expected = no_dup_panel.loc[['A', 'B']] + expected.items = ['A', 'A'] + result = panel.loc['A'] + assert_panel_equal(result, expected) - # major - data = np.random.randn(5, 5, 5) - no_dup_panel = Panel(data, major_axis=list("ABCDE")) - panel = Panel(data, major_axis=list("AACDE")) + # major + data = np.random.randn(5, 5, 5) + no_dup_panel = Panel(data, major_axis=list("ABCDE")) + panel = Panel(data, major_axis=list("AACDE")) - expected = no_dup_panel.loc[:, 'A'] - result = panel.iloc[:, 0] - assert_frame_equal(result, expected) + expected = no_dup_panel.loc[:, 'A'] + result = panel.iloc[:, 0] + assert_frame_equal(result, expected) - expected = no_dup_panel.loc[:, 'E'] - result = panel.loc[:, 'E'] - assert_frame_equal(result, expected) + expected = no_dup_panel.loc[:, 'E'] + result = panel.loc[:, 'E'] + assert_frame_equal(result, expected) - expected = no_dup_panel.loc[:, ['A', 'B']] - expected.major_axis = ['A', 'A'] - result = panel.loc[:, 'A'] - assert_panel_equal(result, expected) + expected = no_dup_panel.loc[:, ['A', 'B']] + expected.major_axis = ['A', 'A'] + result = panel.loc[:, 'A'] + assert_panel_equal(result, expected) - # minor - data = np.random.randn(5, 100, 5) - no_dup_panel = Panel(data, minor_axis=list("ABCDE")) - panel = Panel(data, minor_axis=list("AACDE")) + # minor + data = np.random.randn(5, 100, 5) + no_dup_panel = Panel(data, minor_axis=list("ABCDE")) + panel = Panel(data, minor_axis=list("AACDE")) - expected = no_dup_panel.loc[:, :, 'A'] - result = panel.iloc[:, :, 0] - assert_frame_equal(result, expected) + expected = no_dup_panel.loc[:, :, 'A'] + result = panel.iloc[:, :, 0] + assert_frame_equal(result, expected) - expected = no_dup_panel.loc[:, :, 'E'] - result = panel.loc[:, :, 'E'] - assert_frame_equal(result, expected) + expected = no_dup_panel.loc[:, :, 'E'] + result = panel.loc[:, :, 'E'] + assert_frame_equal(result, expected) - expected = no_dup_panel.loc[:, :, ['A', 'B']] - expected.minor_axis = ['A', 'A'] - result = panel.loc[:, :, 'A'] - assert_panel_equal(result, expected) + expected = no_dup_panel.loc[:, :, ['A', 'B']] + expected.minor_axis = ['A', 'A'] + result = panel.loc[:, :, 'A'] + assert_panel_equal(result, expected) def test_filter(self): pass def test_compound(self): - compounded = self.panel.compound() + with catch_warnings(record=True): + compounded = self.panel.compound() - assert_series_equal(compounded['ItemA'], - (1 + self.panel['ItemA']).product(0) - 1, - check_names=False) + assert_series_equal(compounded['ItemA'], + (1 + self.panel['ItemA']).product(0) - 1, + check_names=False) def test_shift(self): - # major - idx = self.panel.major_axis[0] - idx_lag = self.panel.major_axis[1] - shifted = self.panel.shift(1) - assert_frame_equal(self.panel.major_xs(idx), shifted.major_xs(idx_lag)) - - # minor - idx = self.panel.minor_axis[0] - idx_lag = self.panel.minor_axis[1] - shifted = self.panel.shift(1, axis='minor') - assert_frame_equal(self.panel.minor_xs(idx), shifted.minor_xs(idx_lag)) - - # items - idx = self.panel.items[0] - idx_lag = self.panel.items[1] - shifted = self.panel.shift(1, axis='items') - assert_frame_equal(self.panel[idx], shifted[idx_lag]) - - # negative numbers, #2164 - result = self.panel.shift(-1) - expected = Panel(dict((i, f.shift(-1)[:-1]) - for i, f in self.panel.iteritems())) - assert_panel_equal(result, expected) - - # mixed dtypes #6959 - data = [('item ' + ch, makeMixedDataFrame()) for ch in list('abcde')] - data = dict(data) - mixed_panel = Panel.from_dict(data, orient='minor') - shifted = mixed_panel.shift(1) - assert_series_equal(mixed_panel.dtypes, shifted.dtypes) + with catch_warnings(record=True): + # major + idx = self.panel.major_axis[0] + idx_lag = self.panel.major_axis[1] + shifted = self.panel.shift(1) + assert_frame_equal(self.panel.major_xs(idx), + shifted.major_xs(idx_lag)) + + # minor + idx = self.panel.minor_axis[0] + idx_lag = self.panel.minor_axis[1] + shifted = self.panel.shift(1, axis='minor') + assert_frame_equal(self.panel.minor_xs(idx), + shifted.minor_xs(idx_lag)) + + # items + idx = self.panel.items[0] + idx_lag = self.panel.items[1] + shifted = self.panel.shift(1, axis='items') + assert_frame_equal(self.panel[idx], shifted[idx_lag]) + + # negative numbers, #2164 + result = self.panel.shift(-1) + expected = Panel(dict((i, f.shift(-1)[:-1]) + for i, f in self.panel.iteritems())) + assert_panel_equal(result, expected) + + # mixed dtypes #6959 + data = [('item ' + ch, makeMixedDataFrame()) + for ch in list('abcde')] + data = dict(data) + mixed_panel = Panel.from_dict(data, orient='minor') + shifted = mixed_panel.shift(1) + assert_series_equal(mixed_panel.dtypes, shifted.dtypes) def test_tshift(self): # PeriodIndex - ps = tm.makePeriodPanel() - shifted = ps.tshift(1) - unshifted = shifted.tshift(-1) + with catch_warnings(record=True): + ps = tm.makePeriodPanel() + shifted = ps.tshift(1) + unshifted = shifted.tshift(-1) - assert_panel_equal(unshifted, ps) + assert_panel_equal(unshifted, ps) - shifted2 = ps.tshift(freq='B') - assert_panel_equal(shifted, shifted2) + shifted2 = ps.tshift(freq='B') + assert_panel_equal(shifted, shifted2) - shifted3 = ps.tshift(freq=BDay()) - assert_panel_equal(shifted, shifted3) + shifted3 = ps.tshift(freq=BDay()) + assert_panel_equal(shifted, shifted3) - assertRaisesRegexp(ValueError, 'does not match', ps.tshift, freq='M') + tm.assert_raises_regex(ValueError, 'does not match', + ps.tshift, freq='M') - # DatetimeIndex - panel = _panel - shifted = panel.tshift(1) - unshifted = shifted.tshift(-1) + # DatetimeIndex + panel = make_test_panel() + shifted = panel.tshift(1) + unshifted = shifted.tshift(-1) - assert_panel_equal(panel, unshifted) + assert_panel_equal(panel, unshifted) - shifted2 = panel.tshift(freq=panel.major_axis.freq) - assert_panel_equal(shifted, shifted2) + shifted2 = panel.tshift(freq=panel.major_axis.freq) + assert_panel_equal(shifted, shifted2) - inferred_ts = Panel(panel.values, items=panel.items, - major_axis=Index(np.asarray(panel.major_axis)), - minor_axis=panel.minor_axis) - shifted = inferred_ts.tshift(1) - unshifted = shifted.tshift(-1) - assert_panel_equal(shifted, panel.tshift(1)) - assert_panel_equal(unshifted, inferred_ts) + inferred_ts = Panel(panel.values, items=panel.items, + major_axis=Index(np.asarray(panel.major_axis)), + minor_axis=panel.minor_axis) + shifted = inferred_ts.tshift(1) + unshifted = shifted.tshift(-1) + assert_panel_equal(shifted, panel.tshift(1)) + assert_panel_equal(unshifted, inferred_ts) - no_freq = panel.ix[:, [0, 5, 7], :] - self.assertRaises(ValueError, no_freq.tshift) + no_freq = panel.iloc[:, [0, 5, 7], :] + pytest.raises(ValueError, no_freq.tshift) def test_pct_change(self): - df1 = DataFrame({'c1': [1, 2, 5], 'c2': [3, 4, 6]}) - df2 = df1 + 1 - df3 = DataFrame({'c1': [3, 4, 7], 'c2': [5, 6, 8]}) - wp = Panel({'i1': df1, 'i2': df2, 'i3': df3}) - # major, 1 - result = wp.pct_change() # axis='major' - expected = Panel({'i1': df1.pct_change(), - 'i2': df2.pct_change(), - 'i3': df3.pct_change()}) - assert_panel_equal(result, expected) - result = wp.pct_change(axis=1) - assert_panel_equal(result, expected) - # major, 2 - result = wp.pct_change(periods=2) - expected = Panel({'i1': df1.pct_change(2), - 'i2': df2.pct_change(2), - 'i3': df3.pct_change(2)}) - assert_panel_equal(result, expected) - # minor, 1 - result = wp.pct_change(axis='minor') - expected = Panel({'i1': df1.pct_change(axis=1), - 'i2': df2.pct_change(axis=1), - 'i3': df3.pct_change(axis=1)}) - assert_panel_equal(result, expected) - result = wp.pct_change(axis=2) - assert_panel_equal(result, expected) - # minor, 2 - result = wp.pct_change(periods=2, axis='minor') - expected = Panel({'i1': df1.pct_change(periods=2, axis=1), - 'i2': df2.pct_change(periods=2, axis=1), - 'i3': df3.pct_change(periods=2, axis=1)}) - assert_panel_equal(result, expected) - # items, 1 - result = wp.pct_change(axis='items') - expected = Panel({'i1': DataFrame({'c1': [np.nan, np.nan, np.nan], - 'c2': [np.nan, np.nan, np.nan]}), - 'i2': DataFrame({'c1': [1, 0.5, .2], - 'c2': [1. / 3, 0.25, 1. / 6]}), - 'i3': DataFrame({'c1': [.5, 1. / 3, 1. / 6], - 'c2': [.25, .2, 1. / 7]})}) - assert_panel_equal(result, expected) - result = wp.pct_change(axis=0) - assert_panel_equal(result, expected) - # items, 2 - result = wp.pct_change(periods=2, axis='items') - expected = Panel({'i1': DataFrame({'c1': [np.nan, np.nan, np.nan], - 'c2': [np.nan, np.nan, np.nan]}), - 'i2': DataFrame({'c1': [np.nan, np.nan, np.nan], - 'c2': [np.nan, np.nan, np.nan]}), - 'i3': DataFrame({'c1': [2, 1, .4], - 'c2': [2. / 3, .5, 1. / 3]})}) - assert_panel_equal(result, expected) + with catch_warnings(record=True): + df1 = DataFrame({'c1': [1, 2, 5], 'c2': [3, 4, 6]}) + df2 = df1 + 1 + df3 = DataFrame({'c1': [3, 4, 7], 'c2': [5, 6, 8]}) + wp = Panel({'i1': df1, 'i2': df2, 'i3': df3}) + # major, 1 + result = wp.pct_change() # axis='major' + expected = Panel({'i1': df1.pct_change(), + 'i2': df2.pct_change(), + 'i3': df3.pct_change()}) + assert_panel_equal(result, expected) + result = wp.pct_change(axis=1) + assert_panel_equal(result, expected) + # major, 2 + result = wp.pct_change(periods=2) + expected = Panel({'i1': df1.pct_change(2), + 'i2': df2.pct_change(2), + 'i3': df3.pct_change(2)}) + assert_panel_equal(result, expected) + # minor, 1 + result = wp.pct_change(axis='minor') + expected = Panel({'i1': df1.pct_change(axis=1), + 'i2': df2.pct_change(axis=1), + 'i3': df3.pct_change(axis=1)}) + assert_panel_equal(result, expected) + result = wp.pct_change(axis=2) + assert_panel_equal(result, expected) + # minor, 2 + result = wp.pct_change(periods=2, axis='minor') + expected = Panel({'i1': df1.pct_change(periods=2, axis=1), + 'i2': df2.pct_change(periods=2, axis=1), + 'i3': df3.pct_change(periods=2, axis=1)}) + assert_panel_equal(result, expected) + # items, 1 + result = wp.pct_change(axis='items') + expected = Panel( + {'i1': DataFrame({'c1': [np.nan, np.nan, np.nan], + 'c2': [np.nan, np.nan, np.nan]}), + 'i2': DataFrame({'c1': [1, 0.5, .2], + 'c2': [1. / 3, 0.25, 1. / 6]}), + 'i3': DataFrame({'c1': [.5, 1. / 3, 1. / 6], + 'c2': [.25, .2, 1. / 7]})}) + assert_panel_equal(result, expected) + result = wp.pct_change(axis=0) + assert_panel_equal(result, expected) + # items, 2 + result = wp.pct_change(periods=2, axis='items') + expected = Panel( + {'i1': DataFrame({'c1': [np.nan, np.nan, np.nan], + 'c2': [np.nan, np.nan, np.nan]}), + 'i2': DataFrame({'c1': [np.nan, np.nan, np.nan], + 'c2': [np.nan, np.nan, np.nan]}), + 'i3': DataFrame({'c1': [2, 1, .4], + 'c2': [2. / 3, .5, 1. / 3]})}) + assert_panel_equal(result, expected) def test_round(self): - values = [[[-3.2, 2.2], [0, -4.8213], [3.123, 123.12], - [-1566.213, 88.88], [-12, 94.5]], - [[-5.82, 3.5], [6.21, -73.272], [-9.087, 23.12], - [272.212, -99.99], [23, -76.5]]] - evalues = [[[float(np.around(i)) for i in j] for j in k] - for k in values] - p = Panel(values, items=['Item1', 'Item2'], - major_axis=pd.date_range('1/1/2000', periods=5), - minor_axis=['A', 'B']) - expected = Panel(evalues, items=['Item1', 'Item2'], - major_axis=pd.date_range('1/1/2000', periods=5), - minor_axis=['A', 'B']) - result = p.round() - self.assert_panel_equal(expected, result) + with catch_warnings(record=True): + values = [[[-3.2, 2.2], [0, -4.8213], [3.123, 123.12], + [-1566.213, 88.88], [-12, 94.5]], + [[-5.82, 3.5], [6.21, -73.272], [-9.087, 23.12], + [272.212, -99.99], [23, -76.5]]] + evalues = [[[float(np.around(i)) for i in j] for j in k] + for k in values] + p = Panel(values, items=['Item1', 'Item2'], + major_axis=pd.date_range('1/1/2000', periods=5), + minor_axis=['A', 'B']) + expected = Panel(evalues, items=['Item1', 'Item2'], + major_axis=pd.date_range('1/1/2000', periods=5), + minor_axis=['A', 'B']) + result = p.round() + assert_panel_equal(expected, result) def test_numpy_round(self): - values = [[[-3.2, 2.2], [0, -4.8213], [3.123, 123.12], - [-1566.213, 88.88], [-12, 94.5]], - [[-5.82, 3.5], [6.21, -73.272], [-9.087, 23.12], - [272.212, -99.99], [23, -76.5]]] - evalues = [[[float(np.around(i)) for i in j] for j in k] - for k in values] - p = Panel(values, items=['Item1', 'Item2'], - major_axis=pd.date_range('1/1/2000', periods=5), - minor_axis=['A', 'B']) - expected = Panel(evalues, items=['Item1', 'Item2'], - major_axis=pd.date_range('1/1/2000', periods=5), - minor_axis=['A', 'B']) - result = np.round(p) - self.assert_panel_equal(expected, result) - - msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.round, p, out=p) + with catch_warnings(record=True): + values = [[[-3.2, 2.2], [0, -4.8213], [3.123, 123.12], + [-1566.213, 88.88], [-12, 94.5]], + [[-5.82, 3.5], [6.21, -73.272], [-9.087, 23.12], + [272.212, -99.99], [23, -76.5]]] + evalues = [[[float(np.around(i)) for i in j] for j in k] + for k in values] + p = Panel(values, items=['Item1', 'Item2'], + major_axis=pd.date_range('1/1/2000', periods=5), + minor_axis=['A', 'B']) + expected = Panel(evalues, items=['Item1', 'Item2'], + major_axis=pd.date_range('1/1/2000', periods=5), + minor_axis=['A', 'B']) + result = np.round(p) + assert_panel_equal(expected, result) + + msg = "the 'out' parameter is not supported" + tm.assert_raises_regex(ValueError, msg, np.round, p, out=p) def test_multiindex_get(self): - ind = MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)], - names=['first', 'second']) - wp = Panel(np.random.random((4, 5, 5)), - items=ind, - major_axis=np.arange(5), - minor_axis=np.arange(5)) - f1 = wp['a'] - f2 = wp.ix['a'] - assert_panel_equal(f1, f2) - - self.assertTrue((f1.items == [1, 2]).all()) - self.assertTrue((f2.items == [1, 2]).all()) - - ind = MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)], - names=['first', 'second']) + with catch_warnings(record=True): + ind = MultiIndex.from_tuples( + [('a', 1), ('a', 2), ('b', 1), ('b', 2)], + names=['first', 'second']) + wp = Panel(np.random.random((4, 5, 5)), + items=ind, + major_axis=np.arange(5), + minor_axis=np.arange(5)) + f1 = wp['a'] + f2 = wp.loc['a'] + assert_panel_equal(f1, f2) + + assert (f1.items == [1, 2]).all() + assert (f2.items == [1, 2]).all() + + ind = MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)], + names=['first', 'second']) def test_multiindex_blocks(self): - ind = MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)], - names=['first', 'second']) - wp = Panel(self.panel._data) - wp.items = ind - f1 = wp['a'] - self.assertTrue((f1.items == [1, 2]).all()) + with catch_warnings(record=True): + ind = MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)], + names=['first', 'second']) + wp = Panel(self.panel._data) + wp.items = ind + f1 = wp['a'] + assert (f1.items == [1, 2]).all() - f1 = wp[('b', 1)] - self.assertTrue((f1.columns == ['A', 'B', 'C', 'D']).all()) + f1 = wp[('b', 1)] + assert (f1.columns == ['A', 'B', 'C', 'D']).all() def test_repr_empty(self): - empty = Panel() - repr(empty) + with catch_warnings(record=True): + empty = Panel() + repr(empty) def test_rename(self): - mapper = {'ItemA': 'foo', 'ItemB': 'bar', 'ItemC': 'baz'} + with catch_warnings(record=True): + mapper = {'ItemA': 'foo', 'ItemB': 'bar', 'ItemC': 'baz'} - renamed = self.panel.rename_axis(mapper, axis=0) - exp = Index(['foo', 'bar', 'baz']) - self.assert_index_equal(renamed.items, exp) + renamed = self.panel.rename_axis(mapper, axis=0) + exp = Index(['foo', 'bar', 'baz']) + tm.assert_index_equal(renamed.items, exp) - renamed = self.panel.rename_axis(str.lower, axis=2) - exp = Index(['a', 'b', 'c', 'd']) - self.assert_index_equal(renamed.minor_axis, exp) + renamed = self.panel.rename_axis(str.lower, axis=2) + exp = Index(['a', 'b', 'c', 'd']) + tm.assert_index_equal(renamed.minor_axis, exp) - # don't copy - renamed_nocopy = self.panel.rename_axis(mapper, axis=0, copy=False) - renamed_nocopy['foo'] = 3. - self.assertTrue((self.panel['ItemA'].values == 3).all()) + # don't copy + renamed_nocopy = self.panel.rename_axis(mapper, axis=0, copy=False) + renamed_nocopy['foo'] = 3. + assert (self.panel['ItemA'].values == 3).all() def test_get_attr(self): assert_frame_equal(self.panel['ItemA'], self.panel.ItemA) @@ -2050,12 +2176,13 @@ def test_get_attr(self): assert_frame_equal(self.panel['i'], self.panel.i) def test_from_frame_level1_unsorted(self): - tuples = [('MSFT', 3), ('MSFT', 2), ('AAPL', 2), ('AAPL', 1), - ('MSFT', 1)] - midx = MultiIndex.from_tuples(tuples) - df = DataFrame(np.random.rand(5, 4), index=midx) - p = df.to_panel() - assert_frame_equal(p.minor_xs(2), df.xs(2, level=1).sort_index()) + with catch_warnings(record=True): + tuples = [('MSFT', 3), ('MSFT', 2), ('AAPL', 2), ('AAPL', 1), + ('MSFT', 1)] + midx = MultiIndex.from_tuples(tuples) + df = DataFrame(np.random.rand(5, 4), index=midx) + p = df.to_panel() + assert_frame_equal(p.minor_xs(2), df.xs(2, level=1).sort_index()) def test_to_excel(self): try: @@ -2064,7 +2191,7 @@ def test_to_excel(self): import openpyxl # noqa from pandas.io.excel import ExcelFile except ImportError: - raise nose.SkipTest("need xlwt xlrd openpyxl") + pytest.skip("need xlwt xlrd openpyxl") for ext in ['xls', 'xlsx']: with ensure_clean('__tmp__.' + ext) as path: @@ -2072,7 +2199,7 @@ def test_to_excel(self): try: reader = ExcelFile(path) except ImportError: - raise nose.SkipTest("need xlwt xlrd openpyxl") + pytest.skip("need xlwt xlrd openpyxl") for item, df in self.panel.iteritems(): recdf = reader.parse(str(item), index_col=0) @@ -2084,297 +2211,330 @@ def test_to_excel_xlsxwriter(self): import xlsxwriter # noqa from pandas.io.excel import ExcelFile except ImportError: - raise nose.SkipTest("Requires xlrd and xlsxwriter. Skipping test.") + pytest.skip("Requires xlrd and xlsxwriter. Skipping test.") with ensure_clean('__tmp__.xlsx') as path: self.panel.to_excel(path, engine='xlsxwriter') try: reader = ExcelFile(path) except ImportError as e: - raise nose.SkipTest("cannot write excel file: %s" % e) + pytest.skip("cannot write excel file: %s" % e) for item, df in self.panel.iteritems(): recdf = reader.parse(str(item), index_col=0) assert_frame_equal(df, recdf) def test_dropna(self): - p = Panel(np.random.randn(4, 5, 6), major_axis=list('abcde')) - p.ix[:, ['b', 'd'], 0] = np.nan + with catch_warnings(record=True): + p = Panel(np.random.randn(4, 5, 6), major_axis=list('abcde')) + p.loc[:, ['b', 'd'], 0] = np.nan - result = p.dropna(axis=1) - exp = p.ix[:, ['a', 'c', 'e'], :] - assert_panel_equal(result, exp) - inp = p.copy() - inp.dropna(axis=1, inplace=True) - assert_panel_equal(inp, exp) + result = p.dropna(axis=1) + exp = p.loc[:, ['a', 'c', 'e'], :] + assert_panel_equal(result, exp) + inp = p.copy() + inp.dropna(axis=1, inplace=True) + assert_panel_equal(inp, exp) - result = p.dropna(axis=1, how='all') - assert_panel_equal(result, p) + result = p.dropna(axis=1, how='all') + assert_panel_equal(result, p) - p.ix[:, ['b', 'd'], :] = np.nan - result = p.dropna(axis=1, how='all') - exp = p.ix[:, ['a', 'c', 'e'], :] - assert_panel_equal(result, exp) + p.loc[:, ['b', 'd'], :] = np.nan + result = p.dropna(axis=1, how='all') + exp = p.loc[:, ['a', 'c', 'e'], :] + assert_panel_equal(result, exp) - p = Panel(np.random.randn(4, 5, 6), items=list('abcd')) - p.ix[['b'], :, 0] = np.nan + p = Panel(np.random.randn(4, 5, 6), items=list('abcd')) + p.loc[['b'], :, 0] = np.nan - result = p.dropna() - exp = p.ix[['a', 'c', 'd']] - assert_panel_equal(result, exp) + result = p.dropna() + exp = p.loc[['a', 'c', 'd']] + assert_panel_equal(result, exp) - result = p.dropna(how='all') - assert_panel_equal(result, p) + result = p.dropna(how='all') + assert_panel_equal(result, p) - p.ix['b'] = np.nan - result = p.dropna(how='all') - exp = p.ix[['a', 'c', 'd']] - assert_panel_equal(result, exp) + p.loc['b'] = np.nan + result = p.dropna(how='all') + exp = p.loc[['a', 'c', 'd']] + assert_panel_equal(result, exp) def test_drop(self): - df = DataFrame({"A": [1, 2], "B": [3, 4]}) - panel = Panel({"One": df, "Two": df}) + with catch_warnings(record=True): + df = DataFrame({"A": [1, 2], "B": [3, 4]}) + panel = Panel({"One": df, "Two": df}) - def check_drop(drop_val, axis_number, aliases, expected): - try: - actual = panel.drop(drop_val, axis=axis_number) - assert_panel_equal(actual, expected) - for alias in aliases: - actual = panel.drop(drop_val, axis=alias) + def check_drop(drop_val, axis_number, aliases, expected): + try: + actual = panel.drop(drop_val, axis=axis_number) assert_panel_equal(actual, expected) - except AssertionError: - pprint_thing("Failed with axis_number %d and aliases: %s" % - (axis_number, aliases)) - raise - # Items - expected = Panel({"One": df}) - check_drop('Two', 0, ['items'], expected) - - self.assertRaises(ValueError, panel.drop, 'Three') - - # errors = 'ignore' - dropped = panel.drop('Three', errors='ignore') - assert_panel_equal(dropped, panel) - dropped = panel.drop(['Two', 'Three'], errors='ignore') - expected = Panel({"One": df}) - assert_panel_equal(dropped, expected) - - # Major - exp_df = DataFrame({"A": [2], "B": [4]}, index=[1]) - expected = Panel({"One": exp_df, "Two": exp_df}) - check_drop(0, 1, ['major_axis', 'major'], expected) - - exp_df = DataFrame({"A": [1], "B": [3]}, index=[0]) - expected = Panel({"One": exp_df, "Two": exp_df}) - check_drop([1], 1, ['major_axis', 'major'], expected) - - # Minor - exp_df = df[['B']] - expected = Panel({"One": exp_df, "Two": exp_df}) - check_drop(["A"], 2, ['minor_axis', 'minor'], expected) - - exp_df = df[['A']] - expected = Panel({"One": exp_df, "Two": exp_df}) - check_drop("B", 2, ['minor_axis', 'minor'], expected) + for alias in aliases: + actual = panel.drop(drop_val, axis=alias) + assert_panel_equal(actual, expected) + except AssertionError: + pprint_thing("Failed with axis_number %d and aliases: %s" % + (axis_number, aliases)) + raise + # Items + expected = Panel({"One": df}) + check_drop('Two', 0, ['items'], expected) + + pytest.raises(ValueError, panel.drop, 'Three') + + # errors = 'ignore' + dropped = panel.drop('Three', errors='ignore') + assert_panel_equal(dropped, panel) + dropped = panel.drop(['Two', 'Three'], errors='ignore') + expected = Panel({"One": df}) + assert_panel_equal(dropped, expected) + + # Major + exp_df = DataFrame({"A": [2], "B": [4]}, index=[1]) + expected = Panel({"One": exp_df, "Two": exp_df}) + check_drop(0, 1, ['major_axis', 'major'], expected) + + exp_df = DataFrame({"A": [1], "B": [3]}, index=[0]) + expected = Panel({"One": exp_df, "Two": exp_df}) + check_drop([1], 1, ['major_axis', 'major'], expected) + + # Minor + exp_df = df[['B']] + expected = Panel({"One": exp_df, "Two": exp_df}) + check_drop(["A"], 2, ['minor_axis', 'minor'], expected) + + exp_df = df[['A']] + expected = Panel({"One": exp_df, "Two": exp_df}) + check_drop("B", 2, ['minor_axis', 'minor'], expected) def test_update(self): - pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]], - [[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]]]) + with catch_warnings(record=True): + pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]], + [[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]]) - other = Panel([[[3.6, 2., np.nan], [np.nan, np.nan, 7]]], items=[1]) + other = Panel( + [[[3.6, 2., np.nan], [np.nan, np.nan, 7]]], items=[1]) - pan.update(other) + pan.update(other) - expected = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]], - [[3.6, 2., 3], [1.5, np.nan, 7], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]]]) + expected = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], [1.5, np.nan, 3.]], + [[3.6, 2., 3], [1.5, np.nan, 7], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]]) - assert_panel_equal(pan, expected) + assert_panel_equal(pan, expected) def test_update_from_dict(self): - pan = Panel({'one': DataFrame([[1.5, np.nan, 3], [1.5, np.nan, 3], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]]), - 'two': DataFrame([[1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]])}) - - other = {'two': DataFrame([[3.6, 2., np.nan], [np.nan, np.nan, 7]])} - - pan.update(other) - - expected = Panel( - {'two': DataFrame([[3.6, 2., 3], [1.5, np.nan, 7], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]]), - 'one': DataFrame([[1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]])}) - - assert_panel_equal(pan, expected) + with catch_warnings(record=True): + pan = Panel({'one': DataFrame([[1.5, np.nan, 3], + [1.5, np.nan, 3], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]), + 'two': DataFrame([[1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]])}) + + other = {'two': DataFrame( + [[3.6, 2., np.nan], [np.nan, np.nan, 7]])} + + pan.update(other) + + expected = Panel( + {'two': DataFrame([[3.6, 2., 3], + [1.5, np.nan, 7], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]), + 'one': DataFrame([[1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]])}) + + assert_panel_equal(pan, expected) def test_update_nooverwrite(self): - pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]], - [[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]]]) + with catch_warnings(record=True): + pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]], + [[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]]) - other = Panel([[[3.6, 2., np.nan], [np.nan, np.nan, 7]]], items=[1]) + other = Panel( + [[[3.6, 2., np.nan], [np.nan, np.nan, 7]]], items=[1]) - pan.update(other, overwrite=False) + pan.update(other, overwrite=False) - expected = Panel([[[1.5, np.nan, 3], [1.5, np.nan, 3], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]], - [[1.5, 2., 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]]]) + expected = Panel([[[1.5, np.nan, 3], [1.5, np.nan, 3], + [1.5, np.nan, 3.], [1.5, np.nan, 3.]], + [[1.5, 2., 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]]) - assert_panel_equal(pan, expected) + assert_panel_equal(pan, expected) def test_update_filtered(self): - pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]], - [[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]]]) + with catch_warnings(record=True): + pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]], + [[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]]) - other = Panel([[[3.6, 2., np.nan], [np.nan, np.nan, 7]]], items=[1]) + other = Panel( + [[[3.6, 2., np.nan], [np.nan, np.nan, 7]]], items=[1]) - pan.update(other, filter_func=lambda x: x > 2) + pan.update(other, filter_func=lambda x: x > 2) - expected = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]], - [[1.5, np.nan, 3], [1.5, np.nan, 7], - [1.5, np.nan, 3.], [1.5, np.nan, 3.]]]) + expected = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], [1.5, np.nan, 3.]], + [[1.5, np.nan, 3], [1.5, np.nan, 7], + [1.5, np.nan, 3.], [1.5, np.nan, 3.]]]) - assert_panel_equal(pan, expected) + assert_panel_equal(pan, expected) def test_update_raise(self): - pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]], - [[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], - [1.5, np.nan, 3.]]]) - - self.assertRaises(Exception, pan.update, *(pan, ), + with catch_warnings(record=True): + pan = Panel([[[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]], + [[1.5, np.nan, 3.], [1.5, np.nan, 3.], + [1.5, np.nan, 3.], + [1.5, np.nan, 3.]]]) + + pytest.raises(Exception, pan.update, *(pan, ), **{'raise_conflict': True}) def test_all_any(self): - self.assertTrue((self.panel.all(axis=0).values == nanall( - self.panel, axis=0)).all()) - self.assertTrue((self.panel.all(axis=1).values == nanall( - self.panel, axis=1).T).all()) - self.assertTrue((self.panel.all(axis=2).values == nanall( - self.panel, axis=2).T).all()) - self.assertTrue((self.panel.any(axis=0).values == nanany( - self.panel, axis=0)).all()) - self.assertTrue((self.panel.any(axis=1).values == nanany( - self.panel, axis=1).T).all()) - self.assertTrue((self.panel.any(axis=2).values == nanany( - self.panel, axis=2).T).all()) + assert (self.panel.all(axis=0).values == nanall( + self.panel, axis=0)).all() + assert (self.panel.all(axis=1).values == nanall( + self.panel, axis=1).T).all() + assert (self.panel.all(axis=2).values == nanall( + self.panel, axis=2).T).all() + assert (self.panel.any(axis=0).values == nanany( + self.panel, axis=0)).all() + assert (self.panel.any(axis=1).values == nanany( + self.panel, axis=1).T).all() + assert (self.panel.any(axis=2).values == nanany( + self.panel, axis=2).T).all() def test_all_any_unhandled(self): - self.assertRaises(NotImplementedError, self.panel.all, bool_only=True) - self.assertRaises(NotImplementedError, self.panel.any, bool_only=True) + pytest.raises(NotImplementedError, self.panel.all, bool_only=True) + pytest.raises(NotImplementedError, self.panel.any, bool_only=True) -class TestLongPanel(tm.TestCase): +class TestLongPanel(object): """ LongPanel no longer exists, but... """ - _multiprocess_can_split_ = True - - def setUp(self): - import warnings - warnings.filterwarnings(action='ignore', category=FutureWarning) - - panel = tm.makePanel() - tm.add_nans(panel) + def setup_method(self, method): + panel = make_test_panel() self.panel = panel.to_frame() self.unfiltered_panel = panel.to_frame(filter_observations=False) def test_ops_differently_indexed(self): - # trying to set non-identically indexed panel - wp = self.panel.to_panel() - wp2 = wp.reindex(major=wp.major_axis[:-1]) - lp2 = wp2.to_frame() + with catch_warnings(record=True): + # trying to set non-identically indexed panel + wp = self.panel.to_panel() + wp2 = wp.reindex(major=wp.major_axis[:-1]) + lp2 = wp2.to_frame() - result = self.panel + lp2 - assert_frame_equal(result.reindex(lp2.index), lp2 * 2) + result = self.panel + lp2 + assert_frame_equal(result.reindex(lp2.index), lp2 * 2) - # careful, mutation - self.panel['foo'] = lp2['ItemA'] - assert_series_equal(self.panel['foo'].reindex(lp2.index), lp2['ItemA'], - check_names=False) + # careful, mutation + self.panel['foo'] = lp2['ItemA'] + assert_series_equal(self.panel['foo'].reindex(lp2.index), + lp2['ItemA'], + check_names=False) def test_ops_scalar(self): - result = self.panel.mul(2) - expected = DataFrame.__mul__(self.panel, 2) - assert_frame_equal(result, expected) + with catch_warnings(record=True): + result = self.panel.mul(2) + expected = DataFrame.__mul__(self.panel, 2) + assert_frame_equal(result, expected) def test_combineFrame(self): - wp = self.panel.to_panel() - result = self.panel.add(wp['ItemA'].stack(), axis=0) - assert_frame_equal(result.to_panel()['ItemA'], wp['ItemA'] * 2) + with catch_warnings(record=True): + wp = self.panel.to_panel() + result = self.panel.add(wp['ItemA'].stack(), axis=0) + assert_frame_equal(result.to_panel()['ItemA'], wp['ItemA'] * 2) def test_combinePanel(self): - wp = self.panel.to_panel() - result = self.panel.add(self.panel) - wide_result = result.to_panel() - assert_frame_equal(wp['ItemA'] * 2, wide_result['ItemA']) + with catch_warnings(record=True): + wp = self.panel.to_panel() + result = self.panel.add(self.panel) + wide_result = result.to_panel() + assert_frame_equal(wp['ItemA'] * 2, wide_result['ItemA']) - # one item - result = self.panel.add(self.panel.filter(['ItemA'])) + # one item + result = self.panel.add(self.panel.filter(['ItemA'])) def test_combine_scalar(self): - result = self.panel.mul(2) - expected = DataFrame(self.panel._data) * 2 - assert_frame_equal(result, expected) + with catch_warnings(record=True): + result = self.panel.mul(2) + expected = DataFrame(self.panel._data) * 2 + assert_frame_equal(result, expected) def test_combine_series(self): - s = self.panel['ItemA'][:10] - result = self.panel.add(s, axis=0) - expected = DataFrame.add(self.panel, s, axis=0) - assert_frame_equal(result, expected) + with catch_warnings(record=True): + s = self.panel['ItemA'][:10] + result = self.panel.add(s, axis=0) + expected = DataFrame.add(self.panel, s, axis=0) + assert_frame_equal(result, expected) - s = self.panel.ix[5] - result = self.panel + s - expected = DataFrame.add(self.panel, s, axis=1) - assert_frame_equal(result, expected) + s = self.panel.iloc[5] + result = self.panel + s + expected = DataFrame.add(self.panel, s, axis=1) + assert_frame_equal(result, expected) def test_operators(self): - wp = self.panel.to_panel() - result = (self.panel + 1).to_panel() - assert_frame_equal(wp['ItemA'] + 1, result['ItemA']) + with catch_warnings(record=True): + wp = self.panel.to_panel() + result = (self.panel + 1).to_panel() + assert_frame_equal(wp['ItemA'] + 1, result['ItemA']) def test_arith_flex_panel(self): - ops = ['add', 'sub', 'mul', 'div', 'truediv', 'pow', 'floordiv', 'mod'] - if not compat.PY3: - aliases = {} - else: - aliases = {'div': 'truediv'} - self.panel = self.panel.to_panel() - - for n in [np.random.randint(-50, -1), np.random.randint(1, 50), 0]: - for op in ops: - alias = aliases.get(op, op) - f = getattr(operator, alias) - exp = f(self.panel, n) - result = getattr(self.panel, op)(n) - assert_panel_equal(result, exp, check_panel_type=True) - - # rops - r_f = lambda x, y: f(y, x) - exp = r_f(self.panel, n) - result = getattr(self.panel, 'r' + op)(n) - assert_panel_equal(result, exp) + with catch_warnings(record=True): + ops = ['add', 'sub', 'mul', 'div', + 'truediv', 'pow', 'floordiv', 'mod'] + if not compat.PY3: + aliases = {} + else: + aliases = {'div': 'truediv'} + self.panel = self.panel.to_panel() + + for n in [np.random.randint(-50, -1), np.random.randint(1, 50), 0]: + for op in ops: + alias = aliases.get(op, op) + f = getattr(operator, alias) + exp = f(self.panel, n) + result = getattr(self.panel, op)(n) + assert_panel_equal(result, exp, check_panel_type=True) + + # rops + r_f = lambda x, y: f(y, x) + exp = r_f(self.panel, n) + result = getattr(self.panel, 'r' + op)(n) + assert_panel_equal(result, exp) def test_sort(self): def is_sorted(arr): return (arr[1:] > arr[:-1]).any() - sorted_minor = self.panel.sortlevel(level=1) - self.assertTrue(is_sorted(sorted_minor.index.labels[1])) + sorted_minor = self.panel.sort_index(level=1) + assert is_sorted(sorted_minor.index.labels[1]) - sorted_major = sorted_minor.sortlevel(level=0) - self.assertTrue(is_sorted(sorted_major.index.labels[0])) + sorted_major = sorted_minor.sort_index(level=0) + assert is_sorted(sorted_major.index.labels[0]) def test_to_string(self): buf = StringIO() @@ -2383,153 +2543,140 @@ def test_to_string(self): def test_to_sparse(self): if isinstance(self.panel, Panel): msg = 'sparsifying is not supported' - tm.assertRaisesRegexp(NotImplementedError, msg, - self.panel.to_sparse) + tm.assert_raises_regex(NotImplementedError, msg, + self.panel.to_sparse) def test_truncate(self): - dates = self.panel.index.levels[0] - start, end = dates[1], dates[5] + with catch_warnings(record=True): + dates = self.panel.index.levels[0] + start, end = dates[1], dates[5] - trunced = self.panel.truncate(start, end).to_panel() - expected = self.panel.to_panel()['ItemA'].truncate(start, end) + trunced = self.panel.truncate(start, end).to_panel() + expected = self.panel.to_panel()['ItemA'].truncate(start, end) - # TODO trucate drops index.names - assert_frame_equal(trunced['ItemA'], expected, check_names=False) + # TODO trucate drops index.names + assert_frame_equal(trunced['ItemA'], expected, check_names=False) - trunced = self.panel.truncate(before=start).to_panel() - expected = self.panel.to_panel()['ItemA'].truncate(before=start) + trunced = self.panel.truncate(before=start).to_panel() + expected = self.panel.to_panel()['ItemA'].truncate(before=start) - # TODO trucate drops index.names - assert_frame_equal(trunced['ItemA'], expected, check_names=False) + # TODO trucate drops index.names + assert_frame_equal(trunced['ItemA'], expected, check_names=False) - trunced = self.panel.truncate(after=end).to_panel() - expected = self.panel.to_panel()['ItemA'].truncate(after=end) + trunced = self.panel.truncate(after=end).to_panel() + expected = self.panel.to_panel()['ItemA'].truncate(after=end) - # TODO trucate drops index.names - assert_frame_equal(trunced['ItemA'], expected, check_names=False) + # TODO trucate drops index.names + assert_frame_equal(trunced['ItemA'], expected, check_names=False) - # truncate on dates that aren't in there - wp = self.panel.to_panel() - new_index = wp.major_axis[::5] + # truncate on dates that aren't in there + wp = self.panel.to_panel() + new_index = wp.major_axis[::5] - wp2 = wp.reindex(major=new_index) + wp2 = wp.reindex(major=new_index) - lp2 = wp2.to_frame() - lp_trunc = lp2.truncate(wp.major_axis[2], wp.major_axis[-2]) + lp2 = wp2.to_frame() + lp_trunc = lp2.truncate(wp.major_axis[2], wp.major_axis[-2]) - wp_trunc = wp2.truncate(wp.major_axis[2], wp.major_axis[-2]) + wp_trunc = wp2.truncate(wp.major_axis[2], wp.major_axis[-2]) - assert_panel_equal(wp_trunc, lp_trunc.to_panel()) + assert_panel_equal(wp_trunc, lp_trunc.to_panel()) - # throw proper exception - self.assertRaises(Exception, lp2.truncate, wp.major_axis[-2], + # throw proper exception + pytest.raises(Exception, lp2.truncate, wp.major_axis[-2], wp.major_axis[2]) def test_axis_dummies(self): - from pandas.core.reshape import make_axis_dummies + from pandas.core.reshape.reshape import make_axis_dummies minor_dummies = make_axis_dummies(self.panel, 'minor').astype(np.uint8) - self.assertEqual(len(minor_dummies.columns), - len(self.panel.index.levels[1])) + assert len(minor_dummies.columns) == len(self.panel.index.levels[1]) major_dummies = make_axis_dummies(self.panel, 'major').astype(np.uint8) - self.assertEqual(len(major_dummies.columns), - len(self.panel.index.levels[0])) + assert len(major_dummies.columns) == len(self.panel.index.levels[0]) mapping = {'A': 'one', 'B': 'one', 'C': 'two', 'D': 'two'} transformed = make_axis_dummies(self.panel, 'minor', transform=mapping.get).astype(np.uint8) - self.assertEqual(len(transformed.columns), 2) - self.assert_index_equal(transformed.columns, Index(['one', 'two'])) + assert len(transformed.columns) == 2 + tm.assert_index_equal(transformed.columns, Index(['one', 'two'])) # TODO: test correctness def test_get_dummies(self): - from pandas.core.reshape import get_dummies, make_axis_dummies + from pandas.core.reshape.reshape import get_dummies, make_axis_dummies self.panel['Label'] = self.panel.index.labels[1] minor_dummies = make_axis_dummies(self.panel, 'minor').astype(np.uint8) dummies = get_dummies(self.panel['Label']) - self.assert_numpy_array_equal(dummies.values, minor_dummies.values) + tm.assert_numpy_array_equal(dummies.values, minor_dummies.values) def test_mean(self): - means = self.panel.mean(level='minor') + with catch_warnings(record=True): + means = self.panel.mean(level='minor') - # test versus Panel version - wide_means = self.panel.to_panel().mean('major') - assert_frame_equal(means, wide_means) + # test versus Panel version + wide_means = self.panel.to_panel().mean('major') + assert_frame_equal(means, wide_means) def test_sum(self): - sums = self.panel.sum(level='minor') + with catch_warnings(record=True): + sums = self.panel.sum(level='minor') - # test versus Panel version - wide_sums = self.panel.to_panel().sum('major') - assert_frame_equal(sums, wide_sums) + # test versus Panel version + wide_sums = self.panel.to_panel().sum('major') + assert_frame_equal(sums, wide_sums) def test_count(self): - index = self.panel.index + with catch_warnings(record=True): + index = self.panel.index - major_count = self.panel.count(level=0)['ItemA'] - labels = index.labels[0] - for i, idx in enumerate(index.levels[0]): - self.assertEqual(major_count[i], (labels == i).sum()) + major_count = self.panel.count(level=0)['ItemA'] + labels = index.labels[0] + for i, idx in enumerate(index.levels[0]): + assert major_count[i] == (labels == i).sum() - minor_count = self.panel.count(level=1)['ItemA'] - labels = index.labels[1] - for i, idx in enumerate(index.levels[1]): - self.assertEqual(minor_count[i], (labels == i).sum()) + minor_count = self.panel.count(level=1)['ItemA'] + labels = index.labels[1] + for i, idx in enumerate(index.levels[1]): + assert minor_count[i] == (labels == i).sum() def test_join(self): - lp1 = self.panel.filter(['ItemA', 'ItemB']) - lp2 = self.panel.filter(['ItemC']) + with catch_warnings(record=True): + lp1 = self.panel.filter(['ItemA', 'ItemB']) + lp2 = self.panel.filter(['ItemC']) - joined = lp1.join(lp2) + joined = lp1.join(lp2) - self.assertEqual(len(joined.columns), 3) + assert len(joined.columns) == 3 - self.assertRaises(Exception, lp1.join, + pytest.raises(Exception, lp1.join, self.panel.filter(['ItemB', 'ItemC'])) def test_pivot(self): - from pandas.core.reshape import _slow_pivot - - one, two, three = (np.array([1, 2, 3, 4, 5]), - np.array(['a', 'b', 'c', 'd', 'e']), - np.array([1, 2, 3, 5, 4.])) - df = pivot(one, two, three) - self.assertEqual(df['a'][1], 1) - self.assertEqual(df['b'][2], 2) - self.assertEqual(df['c'][3], 3) - self.assertEqual(df['d'][4], 5) - self.assertEqual(df['e'][5], 4) - assert_frame_equal(df, _slow_pivot(one, two, three)) - - # weird overlap, TODO: test? - a, b, c = (np.array([1, 2, 3, 4, 4]), - np.array(['a', 'a', 'a', 'a', 'a']), - np.array([1., 2., 3., 4., 5.])) - self.assertRaises(Exception, pivot, a, b, c) - - # corner case, empty - df = pivot(np.array([]), np.array([]), np.array([])) - - -def test_monotonic(): - pos = np.array([1, 2, 3, 5]) - - def _monotonic(arr): - return not (arr[1:] < arr[:-1]).any() - - assert _monotonic(pos) - - neg = np.array([1, 2, 3, 4, 3]) - - assert not _monotonic(neg) - - neg2 = np.array([5, 1, 2, 3, 4, 5]) - - assert not _monotonic(neg2) + with catch_warnings(record=True): + from pandas.core.reshape.reshape import _slow_pivot + + one, two, three = (np.array([1, 2, 3, 4, 5]), + np.array(['a', 'b', 'c', 'd', 'e']), + np.array([1, 2, 3, 5, 4.])) + df = pivot(one, two, three) + assert df['a'][1] == 1 + assert df['b'][2] == 2 + assert df['c'][3] == 3 + assert df['d'][4] == 5 + assert df['e'][5] == 4 + assert_frame_equal(df, _slow_pivot(one, two, three)) + + # weird overlap, TODO: test? + a, b, c = (np.array([1, 2, 3, 4, 4]), + np.array(['a', 'a', 'a', 'a', 'a']), + np.array([1., 2., 3., 4., 5.])) + pytest.raises(Exception, pivot, a, b, c) + + # corner case, empty + df = pivot(np.array([]), np.array([]), np.array([])) def test_panel_index(): @@ -2538,8 +2685,3 @@ def test_panel_index(): np.repeat([1, 2, 3], 4)], names=['time', 'panel']) tm.assert_index_equal(index, expected) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_panel4d.py b/pandas/tests/test_panel4d.py index 1b5a7b6ee1e83..96f02d63712fc 100644 --- a/pandas/tests/test_panel4d.py +++ b/pandas/tests/test_panel4d.py @@ -2,21 +2,18 @@ from datetime import datetime from pandas.compat import range, lrange import operator -import nose - +import pytest +from warnings import catch_warnings import numpy as np -from pandas.types.common import is_float_dtype +from pandas.core.dtypes.common import is_float_dtype from pandas import Series, Index, isnull, notnull from pandas.core.panel import Panel from pandas.core.panel4d import Panel4D from pandas.core.series import remove_na from pandas.tseries.offsets import BDay -from pandas.util.testing import (assert_panel_equal, - assert_panel4d_equal, - assert_frame_equal, - assert_series_equal, +from pandas.util.testing import (assert_frame_equal, assert_series_equal, assert_almost_equal) import pandas.util.testing as tm @@ -29,8 +26,6 @@ def add_nans(panel4d): class SafeForLongAndSparse(object): - _multiprocess_can_split_ = True - def test_repr(self): repr(self.panel4d) @@ -68,7 +63,7 @@ def test_skew(self): try: from scipy.stats import skew except ImportError: - raise nose.SkipTest("no scipy.stats.skew") + pytest.skip("no scipy.stats.skew") def this_skew(x): if len(x) < 3: @@ -116,8 +111,8 @@ def _check_stat_op(self, name, alternative, obj=None, has_skipna=True): obj = self.panel4d # # set some NAs - # obj.ix[5:10] = np.nan - # obj.ix[15:20, -2:] = np.nan + # obj.loc[5:10] = np.nan + # obj.loc[15:20, -2:] = np.nan f = getattr(obj, name) @@ -131,81 +126,76 @@ def skipna_wrapper(x): def wrapper(x): return alternative(np.asarray(x)) - for i in range(obj.ndim): - result = f(axis=i, skipna=False) - assert_panel_equal(result, obj.apply(wrapper, axis=i)) + with catch_warnings(record=True): + for i in range(obj.ndim): + result = f(axis=i, skipna=False) + expected = obj.apply(wrapper, axis=i) + tm.assert_panel_equal(result, expected) else: skipna_wrapper = alternative wrapper = alternative - for i in range(obj.ndim): - result = f(axis=i) - if not tm._incompat_bottleneck_version(name): - assert_panel_equal(result, obj.apply(skipna_wrapper, axis=i)) + with catch_warnings(record=True): + for i in range(obj.ndim): + result = f(axis=i) + if not tm._incompat_bottleneck_version(name): + expected = obj.apply(skipna_wrapper, axis=i) + tm.assert_panel_equal(result, expected) - self.assertRaises(Exception, f, axis=obj.ndim) + pytest.raises(Exception, f, axis=obj.ndim) class SafeForSparse(object): - _multiprocess_can_split_ = True - - @classmethod - def assert_panel_equal(cls, x, y): - assert_panel_equal(x, y) - - @classmethod - def assert_panel4d_equal(cls, x, y): - assert_panel4d_equal(x, y) - def test_get_axis(self): - assert(self.panel4d._get_axis(0) is self.panel4d.labels) - assert(self.panel4d._get_axis(1) is self.panel4d.items) - assert(self.panel4d._get_axis(2) is self.panel4d.major_axis) - assert(self.panel4d._get_axis(3) is self.panel4d.minor_axis) + assert self.panel4d._get_axis(0) is self.panel4d.labels + assert self.panel4d._get_axis(1) is self.panel4d.items + assert self.panel4d._get_axis(2) is self.panel4d.major_axis + assert self.panel4d._get_axis(3) is self.panel4d.minor_axis def test_set_axis(self): - new_labels = Index(np.arange(len(self.panel4d.labels))) + with catch_warnings(record=True): + new_labels = Index(np.arange(len(self.panel4d.labels))) - # TODO: unused? - # new_items = Index(np.arange(len(self.panel4d.items))) + # TODO: unused? + # new_items = Index(np.arange(len(self.panel4d.items))) - new_major = Index(np.arange(len(self.panel4d.major_axis))) - new_minor = Index(np.arange(len(self.panel4d.minor_axis))) + new_major = Index(np.arange(len(self.panel4d.major_axis))) + new_minor = Index(np.arange(len(self.panel4d.minor_axis))) - # ensure propagate to potentially prior-cached items too + # ensure propagate to potentially prior-cached items too - # TODO: unused? - # label = self.panel4d['l1'] + # TODO: unused? + # label = self.panel4d['l1'] - self.panel4d.labels = new_labels + self.panel4d.labels = new_labels - if hasattr(self.panel4d, '_item_cache'): - self.assertNotIn('l1', self.panel4d._item_cache) - self.assertIs(self.panel4d.labels, new_labels) + if hasattr(self.panel4d, '_item_cache'): + assert 'l1' not in self.panel4d._item_cache + assert self.panel4d.labels is new_labels - self.panel4d.major_axis = new_major - self.assertIs(self.panel4d[0].major_axis, new_major) - self.assertIs(self.panel4d.major_axis, new_major) + self.panel4d.major_axis = new_major + assert self.panel4d[0].major_axis is new_major + assert self.panel4d.major_axis is new_major - self.panel4d.minor_axis = new_minor - self.assertIs(self.panel4d[0].minor_axis, new_minor) - self.assertIs(self.panel4d.minor_axis, new_minor) + self.panel4d.minor_axis = new_minor + assert self.panel4d[0].minor_axis is new_minor + assert self.panel4d.minor_axis is new_minor def test_get_axis_number(self): - self.assertEqual(self.panel4d._get_axis_number('labels'), 0) - self.assertEqual(self.panel4d._get_axis_number('items'), 1) - self.assertEqual(self.panel4d._get_axis_number('major'), 2) - self.assertEqual(self.panel4d._get_axis_number('minor'), 3) + assert self.panel4d._get_axis_number('labels') == 0 + assert self.panel4d._get_axis_number('items') == 1 + assert self.panel4d._get_axis_number('major') == 2 + assert self.panel4d._get_axis_number('minor') == 3 def test_get_axis_name(self): - self.assertEqual(self.panel4d._get_axis_name(0), 'labels') - self.assertEqual(self.panel4d._get_axis_name(1), 'items') - self.assertEqual(self.panel4d._get_axis_name(2), 'major_axis') - self.assertEqual(self.panel4d._get_axis_name(3), 'minor_axis') + assert self.panel4d._get_axis_name(0) == 'labels' + assert self.panel4d._get_axis_name(1) == 'items' + assert self.panel4d._get_axis_name(2) == 'major_axis' + assert self.panel4d._get_axis_name(3) == 'minor_axis' def test_arith(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): self._test_op(self.panel4d, operator.add) self._test_op(self.panel4d, operator.sub) self._test_op(self.panel4d, operator.mul) @@ -219,13 +209,13 @@ def test_arith(self): self._test_op(self.panel4d, lambda x, y: y / x) self._test_op(self.panel4d, lambda x, y: y ** x) - self.assertRaises(Exception, self.panel4d.__add__, - self.panel4d['l1']) + pytest.raises(Exception, self.panel4d.__add__, + self.panel4d['l1']) @staticmethod def _test_op(panel4d, op): result = op(panel4d, 1) - assert_panel_equal(result['l1'], op(panel4d['l1'], 1)) + tm.assert_panel_equal(result['l1'], op(panel4d['l1'], 1)) def test_keys(self): tm.equalContents(list(self.panel4d.keys()), self.panel4d.labels) @@ -233,48 +223,48 @@ def test_keys(self): def test_iteritems(self): """Test panel4d.iteritems()""" - self.assertEqual(len(list(self.panel4d.iteritems())), - len(self.panel4d.labels)) + assert (len(list(self.panel4d.iteritems())) == + len(self.panel4d.labels)) def test_combinePanel4d(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = self.panel4d.add(self.panel4d) - self.assert_panel4d_equal(result, self.panel4d * 2) + tm.assert_panel4d_equal(result, self.panel4d * 2) def test_neg(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assert_panel4d_equal(-self.panel4d, self.panel4d * -1) + with catch_warnings(record=True): + tm.assert_panel4d_equal(-self.panel4d, self.panel4d * -1) def test_select(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): p = self.panel4d # select labels result = p.select(lambda x: x in ('l1', 'l3'), axis='labels') expected = p.reindex(labels=['l1', 'l3']) - self.assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) # select items result = p.select(lambda x: x in ('ItemA', 'ItemC'), axis='items') expected = p.reindex(items=['ItemA', 'ItemC']) - self.assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) # select major_axis result = p.select(lambda x: x >= datetime(2000, 1, 15), axis='major') new_major = p.major_axis[p.major_axis >= datetime(2000, 1, 15)] expected = p.reindex(major=new_major) - self.assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) # select minor_axis result = p.select(lambda x: x in ('D', 'A'), axis=3) expected = p.reindex(minor=['A', 'D']) - self.assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) # corner case, empty thing result = p.select(lambda x: x in ('foo',), axis='items') - self.assert_panel4d_equal(result, p.reindex(items=[])) + tm.assert_panel4d_equal(result, p.reindex(items=[])) def test_get_value(self): @@ -287,15 +277,15 @@ def test_get_value(self): def test_abs(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = self.panel4d.abs() expected = np.abs(self.panel4d) - self.assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) p = self.panel4d['l1'] result = p.abs() expected = np.abs(p) - assert_panel_equal(result, expected) + tm.assert_panel_equal(result, expected) df = p['ItemA'] result = df.abs() @@ -305,22 +295,20 @@ def test_abs(self): class CheckIndexing(object): - _multiprocess_can_split_ = True - def test_getitem(self): - self.assertRaises(Exception, self.panel4d.__getitem__, 'ItemQ') + pytest.raises(Exception, self.panel4d.__getitem__, 'ItemQ') def test_delitem_and_pop(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): expected = self.panel4d['l2'] result = self.panel4d.pop('l2') - assert_panel_equal(expected, result) - self.assertNotIn('l2', self.panel4d.labels) + tm.assert_panel_equal(expected, result) + assert 'l2' not in self.panel4d.labels del self.panel4d['l3'] - self.assertNotIn('l3', self.panel4d.labels) - self.assertRaises(Exception, self.panel4d.__delitem__, 'l3') + assert 'l3' not in self.panel4d.labels + pytest.raises(Exception, self.panel4d.__delitem__, 'l3') values = np.empty((4, 4, 4, 4)) values[0] = 0 @@ -334,63 +322,61 @@ def test_delitem_and_pop(self): # did we delete the right row? panel4dc = panel4d.copy() del panel4dc[0] - assert_panel_equal(panel4dc[1], panel4d[1]) - assert_panel_equal(panel4dc[2], panel4d[2]) - assert_panel_equal(panel4dc[3], panel4d[3]) + tm.assert_panel_equal(panel4dc[1], panel4d[1]) + tm.assert_panel_equal(panel4dc[2], panel4d[2]) + tm.assert_panel_equal(panel4dc[3], panel4d[3]) panel4dc = panel4d.copy() del panel4dc[1] - assert_panel_equal(panel4dc[0], panel4d[0]) - assert_panel_equal(panel4dc[2], panel4d[2]) - assert_panel_equal(panel4dc[3], panel4d[3]) + tm.assert_panel_equal(panel4dc[0], panel4d[0]) + tm.assert_panel_equal(panel4dc[2], panel4d[2]) + tm.assert_panel_equal(panel4dc[3], panel4d[3]) panel4dc = panel4d.copy() del panel4dc[2] - assert_panel_equal(panel4dc[1], panel4d[1]) - assert_panel_equal(panel4dc[0], panel4d[0]) - assert_panel_equal(panel4dc[3], panel4d[3]) + tm.assert_panel_equal(panel4dc[1], panel4d[1]) + tm.assert_panel_equal(panel4dc[0], panel4d[0]) + tm.assert_panel_equal(panel4dc[3], panel4d[3]) panel4dc = panel4d.copy() del panel4dc[3] - assert_panel_equal(panel4dc[1], panel4d[1]) - assert_panel_equal(panel4dc[2], panel4d[2]) - assert_panel_equal(panel4dc[0], panel4d[0]) + tm.assert_panel_equal(panel4dc[1], panel4d[1]) + tm.assert_panel_equal(panel4dc[2], panel4d[2]) + tm.assert_panel_equal(panel4dc[0], panel4d[0]) def test_setitem(self): - # LongPanel with one item - # lp = self.panel.filter(['ItemA', 'ItemB']).to_frame() - # self.assertRaises(Exception, self.panel.__setitem__, - # 'ItemE', lp) + with catch_warnings(record=True): - # Panel - p = Panel(dict( - ItemA=self.panel4d['l1']['ItemA'][2:].filter(items=['A', 'B']))) - self.panel4d['l4'] = p - self.panel4d['l5'] = p + # Panel + p = Panel(dict( + ItemA=self.panel4d['l1']['ItemA'][2:].filter( + items=['A', 'B']))) + self.panel4d['l4'] = p + self.panel4d['l5'] = p - p2 = self.panel4d['l4'] + p2 = self.panel4d['l4'] - assert_panel_equal(p, p2.reindex(items=p.items, - major_axis=p.major_axis, - minor_axis=p.minor_axis)) + tm.assert_panel_equal(p, p2.reindex(items=p.items, + major_axis=p.major_axis, + minor_axis=p.minor_axis)) - # scalar - self.panel4d['lG'] = 1 - self.panel4d['lE'] = True - self.assertEqual(self.panel4d['lG'].values.dtype, np.int64) - self.assertEqual(self.panel4d['lE'].values.dtype, np.bool_) + # scalar + self.panel4d['lG'] = 1 + self.panel4d['lE'] = True + assert self.panel4d['lG'].values.dtype == np.int64 + assert self.panel4d['lE'].values.dtype == np.bool_ - # object dtype - self.panel4d['lQ'] = 'foo' - self.assertEqual(self.panel4d['lQ'].values.dtype, np.object_) + # object dtype + self.panel4d['lQ'] = 'foo' + assert self.panel4d['lQ'].values.dtype == np.object_ - # boolean dtype - self.panel4d['lP'] = self.panel4d['l1'] > 0 - self.assertEqual(self.panel4d['lP'].values.dtype, np.bool_) + # boolean dtype + self.panel4d['lP'] = self.panel4d['l1'] > 0 + assert self.panel4d['lP'].values.dtype == np.bool_ def test_setitem_by_indexer(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): # Panel panel4dc = self.panel4d.copy() @@ -398,34 +384,34 @@ def test_setitem_by_indexer(self): def func(): self.panel4d.iloc[0] = p - self.assertRaises(NotImplementedError, func) + pytest.raises(NotImplementedError, func) # DataFrame panel4dc = self.panel4d.copy() df = panel4dc.iloc[0, 0] df.iloc[:] = 1 panel4dc.iloc[0, 0] = df - self.assertTrue((panel4dc.iloc[0, 0].values == 1).all()) + assert (panel4dc.iloc[0, 0].values == 1).all() # Series panel4dc = self.panel4d.copy() s = panel4dc.iloc[0, 0, :, 0] s.iloc[:] = 1 panel4dc.iloc[0, 0, :, 0] = s - self.assertTrue((panel4dc.iloc[0, 0, :, 0].values == 1).all()) + assert (panel4dc.iloc[0, 0, :, 0].values == 1).all() # scalar panel4dc = self.panel4d.copy() panel4dc.iloc[0] = 1 panel4dc.iloc[1] = True panel4dc.iloc[2] = 'foo' - self.assertTrue((panel4dc.iloc[0].values == 1).all()) - self.assertTrue(panel4dc.iloc[1].values.all()) - self.assertTrue((panel4dc.iloc[2].values == 'foo').all()) + assert (panel4dc.iloc[0].values == 1).all() + assert panel4dc.iloc[1].values.all() + assert (panel4dc.iloc[2].values == 'foo').all() def test_setitem_by_indexer_mixed_type(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): # GH 8702 self.panel4d['foo'] = 'bar' @@ -434,12 +420,12 @@ def test_setitem_by_indexer_mixed_type(self): panel4dc.iloc[0] = 1 panel4dc.iloc[1] = True panel4dc.iloc[2] = 'foo' - self.assertTrue((panel4dc.iloc[0].values == 1).all()) - self.assertTrue(panel4dc.iloc[1].values.all()) - self.assertTrue((panel4dc.iloc[2].values == 'foo').all()) + assert (panel4dc.iloc[0].values == 1).all() + assert panel4dc.iloc[1].values.all() + assert (panel4dc.iloc[2].values == 'foo').all() def test_comparisons(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): p1 = tm.makePanel4D() p2 = tm.makePanel4D() @@ -448,18 +434,18 @@ def test_comparisons(self): def test_comp(func): result = func(p1, p2) - self.assert_numpy_array_equal(result.values, - func(p1.values, p2.values)) + tm.assert_numpy_array_equal(result.values, + func(p1.values, p2.values)) # versus non-indexed same objs - self.assertRaises(Exception, func, p1, tp) + pytest.raises(Exception, func, p1, tp) # versus different objs - self.assertRaises(Exception, func, p1, p) + pytest.raises(Exception, func, p1, p) result3 = func(self.panel4d, 0) - self.assert_numpy_array_equal(result3.values, - func(self.panel4d.values, 0)) + tm.assert_numpy_array_equal(result3.values, + func(self.panel4d.values, 0)) with np.errstate(invalid='ignore'): test_comp(operator.eq) @@ -473,56 +459,62 @@ def test_major_xs(self): ref = self.panel4d['l1']['ItemA'] idx = self.panel4d.major_axis[5] - xs = self.panel4d.major_xs(idx) + with catch_warnings(record=True): + xs = self.panel4d.major_xs(idx) assert_series_equal(xs['l1'].T['ItemA'], ref.xs(idx), check_names=False) # not contained idx = self.panel4d.major_axis[0] - BDay() - self.assertRaises(Exception, self.panel4d.major_xs, idx) + pytest.raises(Exception, self.panel4d.major_xs, idx) def test_major_xs_mixed(self): self.panel4d['l4'] = 'foo' - xs = self.panel4d.major_xs(self.panel4d.major_axis[0]) - self.assertEqual(xs['l1']['A'].dtype, np.float64) - self.assertEqual(xs['l4']['A'].dtype, np.object_) + with catch_warnings(record=True): + xs = self.panel4d.major_xs(self.panel4d.major_axis[0]) + assert xs['l1']['A'].dtype == np.float64 + assert xs['l4']['A'].dtype == np.object_ def test_minor_xs(self): ref = self.panel4d['l1']['ItemA'] - idx = self.panel4d.minor_axis[1] - xs = self.panel4d.minor_xs(idx) + with catch_warnings(record=True): + idx = self.panel4d.minor_axis[1] + xs = self.panel4d.minor_xs(idx) assert_series_equal(xs['l1'].T['ItemA'], ref[idx], check_names=False) # not contained - self.assertRaises(Exception, self.panel4d.minor_xs, 'E') + pytest.raises(Exception, self.panel4d.minor_xs, 'E') def test_minor_xs_mixed(self): self.panel4d['l4'] = 'foo' - xs = self.panel4d.minor_xs('D') - self.assertEqual(xs['l1'].T['ItemA'].dtype, np.float64) - self.assertEqual(xs['l4'].T['ItemA'].dtype, np.object_) + with catch_warnings(record=True): + xs = self.panel4d.minor_xs('D') + assert xs['l1'].T['ItemA'].dtype == np.float64 + assert xs['l4'].T['ItemA'].dtype == np.object_ def test_xs(self): l1 = self.panel4d.xs('l1', axis=0) expected = self.panel4d['l1'] - assert_panel_equal(l1, expected) + tm.assert_panel_equal(l1, expected) - # view if possible + # View if possible l1_view = self.panel4d.xs('l1', axis=0) l1_view.values[:] = np.nan - self.assertTrue(np.isnan(self.panel4d['l1'].values).all()) + assert np.isnan(self.panel4d['l1'].values).all() - # mixed-type + # Mixed-type self.panel4d['strings'] = 'foo' - result = self.panel4d.xs('D', axis=3) - self.assertIsNotNone(result.is_copy) + with catch_warnings(record=True): + result = self.panel4d.xs('D', axis=3) + + assert result.is_copy is not None def test_getitem_fancy_labels(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): panel4d = self.panel4d labels = panel4d.labels[[1, 0]] @@ -531,34 +523,34 @@ def test_getitem_fancy_labels(self): cols = ['D', 'C', 'F'] # all 4 specified - assert_panel4d_equal(panel4d.ix[labels, items, dates, cols], - panel4d.reindex(labels=labels, items=items, - major=dates, minor=cols)) + tm.assert_panel4d_equal(panel4d.loc[labels, items, dates, cols], + panel4d.reindex(labels=labels, items=items, + major=dates, minor=cols)) # 3 specified - assert_panel4d_equal(panel4d.ix[:, items, dates, cols], - panel4d.reindex(items=items, major=dates, - minor=cols)) + tm.assert_panel4d_equal(panel4d.loc[:, items, dates, cols], + panel4d.reindex(items=items, major=dates, + minor=cols)) # 2 specified - assert_panel4d_equal(panel4d.ix[:, :, dates, cols], - panel4d.reindex(major=dates, minor=cols)) + tm.assert_panel4d_equal(panel4d.loc[:, :, dates, cols], + panel4d.reindex(major=dates, minor=cols)) - assert_panel4d_equal(panel4d.ix[:, items, :, cols], - panel4d.reindex(items=items, minor=cols)) + tm.assert_panel4d_equal(panel4d.loc[:, items, :, cols], + panel4d.reindex(items=items, minor=cols)) - assert_panel4d_equal(panel4d.ix[:, items, dates, :], - panel4d.reindex(items=items, major=dates)) + tm.assert_panel4d_equal(panel4d.loc[:, items, dates, :], + panel4d.reindex(items=items, major=dates)) # only 1 - assert_panel4d_equal(panel4d.ix[:, items, :, :], - panel4d.reindex(items=items)) + tm.assert_panel4d_equal(panel4d.loc[:, items, :, :], + panel4d.reindex(items=items)) - assert_panel4d_equal(panel4d.ix[:, :, dates, :], - panel4d.reindex(major=dates)) + tm.assert_panel4d_equal(panel4d.loc[:, :, dates, :], + panel4d.reindex(major=dates)) - assert_panel4d_equal(panel4d.ix[:, :, :, cols], - panel4d.reindex(minor=cols)) + tm.assert_panel4d_equal(panel4d.loc[:, :, :, cols], + panel4d.reindex(minor=cols)) def test_getitem_fancy_slice(self): pass @@ -578,62 +570,56 @@ def test_get_value(self): def test_set_value(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): for label in self.panel4d.labels: for item in self.panel4d.items: for mjr in self.panel4d.major_axis[::2]: for mnr in self.panel4d.minor_axis: self.panel4d.set_value(label, item, mjr, mnr, 1.) - assert_almost_equal( + tm.assert_almost_equal( self.panel4d[label][item][mnr][mjr], 1.) res3 = self.panel4d.set_value('l4', 'ItemE', 'foobar', 'baz', 5) - self.assertTrue(is_float_dtype(res3['l4'].values)) + assert is_float_dtype(res3['l4'].values) # resize res = self.panel4d.set_value('l4', 'ItemE', 'foo', 'bar', 1.5) - tm.assertIsInstance(res, Panel4D) - self.assertIsNot(res, self.panel4d) - self.assertEqual(res.get_value('l4', 'ItemE', 'foo', 'bar'), 1.5) + assert isinstance(res, Panel4D) + assert res is not self.panel4d + assert res.get_value('l4', 'ItemE', 'foo', 'bar') == 1.5 res3 = self.panel4d.set_value('l4', 'ItemE', 'foobar', 'baz', 5) - self.assertTrue(is_float_dtype(res3['l4'].values)) + assert is_float_dtype(res3['l4'].values) -class TestPanel4d(tm.TestCase, CheckIndexing, SafeForSparse, +class TestPanel4d(CheckIndexing, SafeForSparse, SafeForLongAndSparse): - _multiprocess_can_split_ = True - - @classmethod - def assert_panel4d_equal(cls, x, y): - assert_panel4d_equal(x, y) - - def setUp(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + def setup_method(self, method): + with catch_warnings(record=True): self.panel4d = tm.makePanel4D(nper=8) add_nans(self.panel4d) def test_constructor(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): panel4d = Panel4D(self.panel4d._data) - self.assertIs(panel4d._data, self.panel4d._data) + assert panel4d._data is self.panel4d._data panel4d = Panel4D(self.panel4d._data, copy=True) - self.assertIsNot(panel4d._data, self.panel4d._data) - assert_panel4d_equal(panel4d, self.panel4d) + assert panel4d._data is not self.panel4d._data + tm.assert_panel4d_equal(panel4d, self.panel4d) vals = self.panel4d.values # no copy panel4d = Panel4D(vals) - self.assertIs(panel4d.values, vals) + assert panel4d.values is vals # copy panel4d = Panel4D(vals, copy=True) - self.assertIsNot(panel4d.values, vals) + assert panel4d.values is not vals # GH #8285, test when scalar data is used to construct a Panel4D # if dtype is not passed, it should be inferred @@ -645,7 +631,7 @@ def test_constructor(self): vals = np.empty((2, 3, 4, 5), dtype=dtype) vals.fill(val) expected = Panel4D(vals, dtype=dtype) - assert_panel4d_equal(panel4d, expected) + tm.assert_panel4d_equal(panel4d, expected) # test the case when dtype is passed panel4d = Panel4D(1, labels=range(2), items=range( @@ -654,10 +640,10 @@ def test_constructor(self): vals.fill(1) expected = Panel4D(vals, dtype='float32') - assert_panel4d_equal(panel4d, expected) + tm.assert_panel4d_equal(panel4d, expected) def test_constructor_cast(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): zero_filled = self.panel4d.fillna(0) casted = Panel4D(zero_filled._data, dtype=int) @@ -676,59 +662,60 @@ def test_constructor_cast(self): # can't cast data = [[['foo', 'bar', 'baz']]] - self.assertRaises(ValueError, Panel, data, dtype=float) + pytest.raises(ValueError, Panel, data, dtype=float) def test_consolidate(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertTrue(self.panel4d._data.is_consolidated()) + with catch_warnings(record=True): + assert self.panel4d._data.is_consolidated() self.panel4d['foo'] = 1. - self.assertFalse(self.panel4d._data.is_consolidated()) + assert not self.panel4d._data.is_consolidated() - panel4d = self.panel4d.consolidate() - self.assertTrue(panel4d._data.is_consolidated()) + panel4d = self.panel4d._consolidate() + assert panel4d._data.is_consolidated() def test_ctor_dict(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): l1 = self.panel4d['l1'] l2 = self.panel4d['l2'] - d = {'A': l1, 'B': l2.ix[['ItemB'], :, :]} + d = {'A': l1, 'B': l2.loc[['ItemB'], :, :]} panel4d = Panel4D(d) - assert_panel_equal(panel4d['A'], self.panel4d['l1']) - assert_frame_equal(panel4d.ix['B', 'ItemB', :, :], - self.panel4d.ix['l2', ['ItemB'], :, :]['ItemB']) + tm.assert_panel_equal(panel4d['A'], self.panel4d['l1']) + tm.assert_frame_equal(panel4d.loc['B', 'ItemB', :, :], + self.panel4d.loc['l2', ['ItemB'], + :, :]['ItemB']) def test_constructor_dict_mixed(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): data = dict((k, v.values) for k, v in self.panel4d.iteritems()) result = Panel4D(data) exp_major = Index(np.arange(len(self.panel4d.major_axis))) - self.assert_index_equal(result.major_axis, exp_major) + tm.assert_index_equal(result.major_axis, exp_major) result = Panel4D(data, labels=self.panel4d.labels, items=self.panel4d.items, major_axis=self.panel4d.major_axis, minor_axis=self.panel4d.minor_axis) - assert_panel4d_equal(result, self.panel4d) + tm.assert_panel4d_equal(result, self.panel4d) data['l2'] = self.panel4d['l2'] result = Panel4D(data) - assert_panel4d_equal(result, self.panel4d) + tm.assert_panel4d_equal(result, self.panel4d) # corner, blow up data['l2'] = data['l2']['ItemB'] - self.assertRaises(Exception, Panel4D, data) + pytest.raises(Exception, Panel4D, data) data['l2'] = self.panel4d['l2'].values[:, :, :-1] - self.assertRaises(Exception, Panel4D, data) + pytest.raises(Exception, Panel4D, data) def test_constructor_resize(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): data = self.panel4d._data labels = self.panel4d.labels[:-1] items = self.panel4d.items[:-1] @@ -739,36 +726,39 @@ def test_constructor_resize(self): major_axis=major, minor_axis=minor) expected = self.panel4d.reindex( labels=labels, items=items, major=major, minor=minor) - assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) result = Panel4D(data, items=items, major_axis=major) expected = self.panel4d.reindex(items=items, major=major) - assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) result = Panel4D(data, items=items) expected = self.panel4d.reindex(items=items) - assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) result = Panel4D(data, minor_axis=minor) expected = self.panel4d.reindex(minor=minor) - assert_panel4d_equal(result, expected) + tm.assert_panel4d_equal(result, expected) def test_conform(self): + with catch_warnings(record=True): - p = self.panel4d['l1'].filter(items=['ItemA', 'ItemB']) - conformed = self.panel4d.conform(p) + p = self.panel4d['l1'].filter(items=['ItemA', 'ItemB']) + conformed = self.panel4d.conform(p) - tm.assert_index_equal(conformed.items, self.panel4d.labels) - tm.assert_index_equal(conformed.major_axis, self.panel4d.major_axis) - tm.assert_index_equal(conformed.minor_axis, self.panel4d.minor_axis) + tm.assert_index_equal(conformed.items, self.panel4d.labels) + tm.assert_index_equal(conformed.major_axis, + self.panel4d.major_axis) + tm.assert_index_equal(conformed.minor_axis, + self.panel4d.minor_axis) def test_reindex(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): ref = self.panel4d['l2'] # labels result = self.panel4d.reindex(labels=['l1', 'l2']) - assert_panel_equal(result['l2'], ref) + tm.assert_panel_equal(result['l2'], ref) # items result = self.panel4d.reindex(items=['ItemA', 'ItemB']) @@ -781,8 +771,8 @@ def test_reindex(self): result['l2']['ItemB'], ref['ItemB'].reindex(index=new_major)) # raise exception put both major and major_axis - self.assertRaises(Exception, self.panel4d.reindex, - major_axis=new_major, major=new_major) + pytest.raises(Exception, self.panel4d.reindex, + major_axis=new_major, major=new_major) # minor new_minor = list(self.panel4d.minor_axis[:2]) @@ -797,8 +787,8 @@ def test_reindex(self): # don't necessarily copy result = self.panel4d.reindex() - assert_panel4d_equal(result, self.panel4d) - self.assertFalse(result is self.panel4d) + tm.assert_panel4d_equal(result, self.panel4d) + assert result is not self.panel4d # with filling smaller_major = self.panel4d.major_axis[::5] @@ -807,33 +797,34 @@ def test_reindex(self): larger = smaller.reindex(major=self.panel4d.major_axis, method='pad') - assert_panel_equal(larger.ix[:, :, self.panel4d.major_axis[1], :], - smaller.ix[:, :, smaller_major[0], :]) + tm.assert_panel_equal(larger.loc[:, :, + self.panel4d.major_axis[1], :], + smaller.loc[:, :, smaller_major[0], :]) # don't necessarily copy result = self.panel4d.reindex( major=self.panel4d.major_axis, copy=False) - assert_panel4d_equal(result, self.panel4d) - self.assertTrue(result is self.panel4d) + tm.assert_panel4d_equal(result, self.panel4d) + assert result is self.panel4d def test_not_hashable(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): p4D_empty = Panel4D() - self.assertRaises(TypeError, hash, p4D_empty) - self.assertRaises(TypeError, hash, self.panel4d) + pytest.raises(TypeError, hash, p4D_empty) + pytest.raises(TypeError, hash, self.panel4d) def test_reindex_like(self): # reindex_like - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): smaller = self.panel4d.reindex(labels=self.panel4d.labels[:-1], items=self.panel4d.items[:-1], major=self.panel4d.major_axis[:-1], minor=self.panel4d.minor_axis[:-1]) smaller_like = self.panel4d.reindex_like(smaller) - assert_panel4d_equal(smaller, smaller_like) + tm.assert_panel4d_equal(smaller, smaller_like) def test_sort_index(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): import random rlabels = list(self.panel4d.labels) @@ -847,47 +838,47 @@ def test_sort_index(self): random_order = self.panel4d.reindex(labels=rlabels) sorted_panel4d = random_order.sort_index(axis=0) - assert_panel4d_equal(sorted_panel4d, self.panel4d) + tm.assert_panel4d_equal(sorted_panel4d, self.panel4d) def test_fillna(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertFalse(np.isfinite(self.panel4d.values).all()) + with catch_warnings(record=True): + assert not np.isfinite(self.panel4d.values).all() filled = self.panel4d.fillna(0) - self.assertTrue(np.isfinite(filled.values).all()) + assert np.isfinite(filled.values).all() - self.assertRaises(NotImplementedError, - self.panel4d.fillna, method='pad') + pytest.raises(NotImplementedError, + self.panel4d.fillna, method='pad') def test_swapaxes(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = self.panel4d.swapaxes('labels', 'items') - self.assertIs(result.items, self.panel4d.labels) + assert result.items is self.panel4d.labels result = self.panel4d.swapaxes('labels', 'minor') - self.assertIs(result.labels, self.panel4d.minor_axis) + assert result.labels is self.panel4d.minor_axis result = self.panel4d.swapaxes('items', 'minor') - self.assertIs(result.items, self.panel4d.minor_axis) + assert result.items is self.panel4d.minor_axis result = self.panel4d.swapaxes('items', 'major') - self.assertIs(result.items, self.panel4d.major_axis) + assert result.items is self.panel4d.major_axis result = self.panel4d.swapaxes('major', 'minor') - self.assertIs(result.major_axis, self.panel4d.minor_axis) + assert result.major_axis is self.panel4d.minor_axis # this should also work result = self.panel4d.swapaxes(0, 1) - self.assertIs(result.labels, self.panel4d.items) + assert result.labels is self.panel4d.items # this works, but return a copy result = self.panel4d.swapaxes('items', 'items') - assert_panel4d_equal(self.panel4d, result) - self.assertNotEqual(id(self.panel4d), id(result)) + tm.assert_panel4d_equal(self.panel4d, result) + assert id(self.panel4d) != id(result) def test_update(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): p4d = Panel4D([[[[1.5, np.nan, 3.], [1.5, np.nan, 3.], [1.5, np.nan, 3.], @@ -911,7 +902,7 @@ def test_update(self): [1.5, np.nan, 3.], [1.5, np.nan, 3.]]]]) - assert_panel4d_equal(p4d, expected) + tm.assert_panel4d_equal(p4d, expected) def test_dtypes(self): @@ -920,12 +911,12 @@ def test_dtypes(self): assert_series_equal(result, expected) def test_repr_empty(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): empty = Panel4D() repr(empty) def test_rename(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): mapper = {'l1': 'foo', 'l2': 'bar', @@ -933,23 +924,18 @@ def test_rename(self): renamed = self.panel4d.rename_axis(mapper, axis=0) exp = Index(['foo', 'bar', 'baz']) - self.assert_index_equal(renamed.labels, exp) + tm.assert_index_equal(renamed.labels, exp) renamed = self.panel4d.rename_axis(str.lower, axis=3) exp = Index(['a', 'b', 'c', 'd']) - self.assert_index_equal(renamed.minor_axis, exp) + tm.assert_index_equal(renamed.minor_axis, exp) # don't copy renamed_nocopy = self.panel4d.rename_axis(mapper, axis=0, copy=False) renamed_nocopy['foo'] = 3. - self.assertTrue((self.panel4d['l1'].values == 3).all()) + assert (self.panel4d['l1'].values == 3).all() def test_get_attr(self): - assert_panel_equal(self.panel4d['l1'], self.panel4d.l1) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + tm.assert_panel_equal(self.panel4d['l1'], self.panel4d.l1) diff --git a/pandas/tests/test_panelnd.py b/pandas/tests/test_panelnd.py index 2a69a65e8d55e..c473e3c09cc74 100644 --- a/pandas/tests/test_panelnd.py +++ b/pandas/tests/test_panelnd.py @@ -1,6 +1,7 @@ # -*- coding: utf-8 -*- -import nose +import pytest +from warnings import catch_warnings from pandas.core import panelnd from pandas.core.panel import Panel @@ -8,14 +9,14 @@ import pandas.util.testing as tm -class TestPanelnd(tm.TestCase): +class TestPanelnd(object): - def setUp(self): + def setup_method(self, method): pass def test_4d_construction(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): # create a 4D Panel4D = panelnd.create_nd_panel_factory( @@ -31,7 +32,7 @@ def test_4d_construction(self): def test_4d_construction_alt(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): # create a 4D Panel4D = panelnd.create_nd_panel_factory( @@ -48,22 +49,22 @@ def test_4d_construction_alt(self): def test_4d_construction_error(self): # create a 4D - self.assertRaises(Exception, - panelnd.create_nd_panel_factory, - klass_name='Panel4D', - orders=['labels', 'items', 'major_axis', - 'minor_axis'], - slices={'items': 'items', - 'major_axis': 'major_axis', - 'minor_axis': 'minor_axis'}, - slicer='foo', - aliases={'major': 'major_axis', - 'minor': 'minor_axis'}, - stat_axis=2) + pytest.raises(Exception, + panelnd.create_nd_panel_factory, + klass_name='Panel4D', + orders=['labels', 'items', 'major_axis', + 'minor_axis'], + slices={'items': 'items', + 'major_axis': 'major_axis', + 'minor_axis': 'minor_axis'}, + slicer='foo', + aliases={'major': 'major_axis', + 'minor': 'minor_axis'}, + stat_axis=2) def test_5d_construction(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): # create a 4D Panel4D = panelnd.create_nd_panel_factory( @@ -94,14 +95,10 @@ def test_5d_construction(self): p5d = Panel5D(dict(C1=p4d)) # slice back to 4d - results = p5d.ix['C1', :, :, 0:3, :] - expected = p4d.ix[:, :, 0:3, :] + results = p5d.iloc[p5d.cool1.get_loc('C1'), :, :, 0:3, :] + expected = p4d.iloc[:, :, 0:3, :] assert_panel_equal(results['L1'], expected['L1']) # test a transpose # results = p5d.transpose(1,2,3,4,0) # expected = - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_resample.py b/pandas/tests/test_resample.py old mode 100755 new mode 100644 similarity index 85% rename from pandas/tseries/tests/test_resample.py rename to pandas/tests/test_resample.py index b8c060c024867..9734431c8b012 --- a/pandas/tseries/tests/test_resample.py +++ b/pandas/tests/test_resample.py @@ -1,9 +1,10 @@ # pylint: disable=E1101 +from warnings import catch_warnings from datetime import datetime, timedelta from functools import partial -import nose +import pytest import numpy as np import pandas as pd @@ -12,22 +13,22 @@ from pandas import (Series, DataFrame, Panel, Index, isnull, notnull, Timestamp) -from pandas.types.generic import ABCSeries, ABCDataFrame +from pandas.core.dtypes.generic import ABCSeries, ABCDataFrame from pandas.compat import range, lrange, zip, product, OrderedDict from pandas.core.base import SpecificationError -from pandas.core.common import UnsupportedFunctionCall +from pandas.errors import UnsupportedFunctionCall from pandas.core.groupby import DataError from pandas.tseries.frequencies import MONTHS, DAYS from pandas.tseries.frequencies import to_offset -from pandas.tseries.index import date_range +from pandas.core.indexes.datetimes import date_range from pandas.tseries.offsets import Minute, BDay -from pandas.tseries.period import period_range, PeriodIndex, Period -from pandas.tseries.resample import (DatetimeIndex, TimeGrouper, - DatetimeIndexResampler) -from pandas.tseries.tdi import timedelta_range +from pandas.core.indexes.period import period_range, PeriodIndex, Period +from pandas.core.resample import (DatetimeIndex, TimeGrouper, + DatetimeIndexResampler) +from pandas.core.indexes.timedeltas import timedelta_range, TimedeltaIndex from pandas.util.testing import (assert_series_equal, assert_almost_equal, assert_frame_equal, assert_index_equal) -from pandas._period import IncompatibleFrequency +from pandas._libs.period import IncompatibleFrequency bday = BDay() @@ -49,10 +50,9 @@ def _simple_pts(start, end, freq='D'): return Series(np.random.randn(len(rng)), index=rng) -class TestResampleAPI(tm.TestCase): - _multiprocess_can_split_ = True +class TestResampleAPI(object): - def setUp(self): + def setup_method(self, method): dti = DatetimeIndex(start=datetime(2005, 1, 1), end=datetime(2005, 1, 10), freq='Min') @@ -63,21 +63,20 @@ def setUp(self): def test_str(self): r = self.series.resample('H') - self.assertTrue( - 'DatetimeIndexResampler [freq=, axis=0, closed=left, ' - 'label=left, convention=start, base=0]' in str(r)) + assert ('DatetimeIndexResampler [freq=, axis=0, closed=left, ' + 'label=left, convention=start, base=0]' in str(r)) def test_api(self): r = self.series.resample('H') result = r.mean() - self.assertIsInstance(result, Series) - self.assertEqual(len(result), 217) + assert isinstance(result, Series) + assert len(result) == 217 r = self.series.to_frame().resample('H') result = r.mean() - self.assertIsInstance(result, DataFrame) - self.assertEqual(len(result), 217) + assert isinstance(result, DataFrame) + assert len(result) == 217 def test_api_changes_v018(self): @@ -85,7 +84,7 @@ def test_api_changes_v018(self): # to .resample(......).how() r = self.series.resample('H') - self.assertIsInstance(r, DatetimeIndexResampler) + assert isinstance(r, DatetimeIndexResampler) for how in ['sum', 'mean', 'prod', 'min', 'max', 'var', 'std']: with tm.assert_produces_warning(FutureWarning, @@ -108,18 +107,18 @@ def test_api_changes_v018(self): # invalids as these can be setting operations r = self.series.resample('H') - self.assertRaises(ValueError, lambda: r.iloc[0]) - self.assertRaises(ValueError, lambda: r.iat[0]) - self.assertRaises(ValueError, lambda: r.ix[0]) - self.assertRaises(ValueError, lambda: r.loc[ + pytest.raises(ValueError, lambda: r.iloc[0]) + pytest.raises(ValueError, lambda: r.iat[0]) + pytest.raises(ValueError, lambda: r.loc[0]) + pytest.raises(ValueError, lambda: r.loc[ Timestamp('2013-01-01 00:00:00', offset='H')]) - self.assertRaises(ValueError, lambda: r.at[ + pytest.raises(ValueError, lambda: r.at[ Timestamp('2013-01-01 00:00:00', offset='H')]) def f(): r[0] = 5 - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # str/repr r = self.series.resample('H') @@ -133,10 +132,10 @@ def f(): tm.assert_numpy_array_equal(np.array(r), np.array(r.mean())) # masquerade as Series/DataFrame as needed for API compat - self.assertTrue(isinstance(self.series.resample('H'), ABCSeries)) - self.assertFalse(isinstance(self.frame.resample('H'), ABCSeries)) - self.assertFalse(isinstance(self.series.resample('H'), ABCDataFrame)) - self.assertTrue(isinstance(self.frame.resample('H'), ABCDataFrame)) + assert isinstance(self.series.resample('H'), ABCSeries) + assert not isinstance(self.frame.resample('H'), ABCSeries) + assert not isinstance(self.series.resample('H'), ABCDataFrame) + assert isinstance(self.frame.resample('H'), ABCDataFrame) # bin numeric ops for op in ['__add__', '__mul__', '__truediv__', '__div__', '__sub__']: @@ -147,7 +146,7 @@ def f(): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertIsInstance(getattr(r, op)(2), pd.Series) + assert isinstance(getattr(r, op)(2), pd.Series) # unary numeric ops for op in ['__pos__', '__neg__', '__abs__', '__inv__']: @@ -158,7 +157,7 @@ def f(): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertIsInstance(getattr(r, op)(), pd.Series) + assert isinstance(getattr(r, op)(), pd.Series) # comparison ops for op in ['__lt__', '__le__', '__gt__', '__ge__', '__eq__', '__ne__']: @@ -166,7 +165,7 @@ def f(): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertIsInstance(getattr(r, op)(2), pd.Series) + assert isinstance(getattr(r, op)(2), pd.Series) # IPython introspection shouldn't trigger warning GH 13618 for op in ['_repr_json', '_repr_latex', @@ -179,7 +178,7 @@ def f(): df = self.series.to_frame('foo') # same as prior versions for DataFrame - self.assertRaises(KeyError, lambda: df.resample('H')[0]) + pytest.raises(KeyError, lambda: df.resample('H')[0]) # compat for Series # but we cannot be sure that we need a warning here @@ -187,13 +186,13 @@ def f(): check_stacklevel=False): result = self.series.resample('H')[0] expected = self.series.resample('H').mean()[0] - self.assertEqual(result, expected) + assert result == expected with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): result = self.series.resample('H')['2005-01-09 23:00:00'] expected = self.series.resample('H').mean()['2005-01-09 23:00:00'] - self.assertEqual(result, expected) + assert result == expected def test_groupby_resample_api(self): @@ -217,6 +216,20 @@ def test_groupby_resample_api(self): lambda x: x.resample('1D').ffill())[['val']] assert_frame_equal(result, expected) + def test_groupby_resample_on_api(self): + + # GH 15021 + # .groupby(...).resample(on=...) results in an unexpected + # keyword warning. + df = pd.DataFrame({'key': ['A', 'B'] * 5, + 'dates': pd.date_range('2016-01-01', periods=10), + 'values': np.random.randn(10)}) + + expected = df.set_index('dates').groupby('key').resample('D').mean() + + result = df.groupby('key').resample('D', on='dates').mean() + assert_frame_equal(result, expected) + def test_plot_api(self): tm._skip_if_no_mpl() @@ -241,7 +254,7 @@ def test_getitem(self): tm.assert_index_equal(r._selected_obj.columns, self.frame.columns) r = self.frame.resample('H')['B'] - self.assertEqual(r._selected_obj.name, self.frame.columns[1]) + assert r._selected_obj.name == self.frame.columns[1] # technically this is allowed r = self.frame.resample('H')['A', 'B'] @@ -255,10 +268,10 @@ def test_getitem(self): def test_select_bad_cols(self): g = self.frame.resample('H') - self.assertRaises(KeyError, g.__getitem__, ['D']) + pytest.raises(KeyError, g.__getitem__, ['D']) - self.assertRaises(KeyError, g.__getitem__, ['A', 'D']) - with tm.assertRaisesRegexp(KeyError, '^[^A]+$'): + pytest.raises(KeyError, g.__getitem__, ['A', 'D']) + with tm.assert_raises_regex(KeyError, '^[^A]+$'): # A should not be referenced as a bad column... # will have to rethink regex if you change message! g[['A', 'D']] @@ -270,13 +283,13 @@ def test_attribute_access(self): # getting with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertRaises(AttributeError, lambda: r.F) + pytest.raises(AttributeError, lambda: r.F) # setting def f(): r.F = 'bah' - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) def test_api_compat_before_use(self): @@ -358,7 +371,7 @@ def test_fillna(self): result = r.fillna(method='bfill') assert_series_equal(result, expected) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): r.fillna(0) def test_apply_without_aggregation(self): @@ -381,8 +394,10 @@ def test_agg_consistency(self): r = df.resample('3T') - expected = r[['A', 'B', 'C']].agg({'r1': 'mean', 'r2': 'sum'}) - result = r.agg({'r1': 'mean', 'r2': 'sum'}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + expected = r[['A', 'B', 'C']].agg({'r1': 'mean', 'r2': 'sum'}) + result = r.agg({'r1': 'mean', 'r2': 'sum'}) assert_frame_equal(result, expected) # TODO: once GH 14008 is fixed, move these tests into @@ -446,7 +461,9 @@ def test_agg(self): expected.columns = pd.MultiIndex.from_tuples([('A', 'mean'), ('A', 'sum')]) for t in cases: - result = t.aggregate({'A': {'mean': 'mean', 'sum': 'sum'}}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = t.aggregate({'A': {'mean': 'mean', 'sum': 'sum'}}) assert_frame_equal(result, expected, check_like=True) expected = pd.concat([a_mean, a_sum, b_mean, b_sum], axis=1) @@ -455,8 +472,10 @@ def test_agg(self): ('B', 'mean2'), ('B', 'sum2')]) for t in cases: - result = t.aggregate({'A': {'mean': 'mean', 'sum': 'sum'}, - 'B': {'mean2': 'mean', 'sum2': 'sum'}}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = t.aggregate({'A': {'mean': 'mean', 'sum': 'sum'}, + 'B': {'mean2': 'mean', 'sum2': 'sum'}}) assert_frame_equal(result, expected, check_like=True) expected = pd.concat([a_mean, a_std, b_mean, b_std], axis=1) @@ -516,9 +535,12 @@ def test_agg_misc(self): ('result1', 'B'), ('result2', 'A'), ('result2', 'B')]) + for t in cases: - result = t[['A', 'B']].agg(OrderedDict([('result1', np.sum), - ('result2', np.mean)])) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = t[['A', 'B']].agg(OrderedDict([('result1', np.sum), + ('result2', np.mean)])) assert_frame_equal(result, expected, check_like=True) # agg with different hows @@ -544,7 +566,9 @@ def test_agg_misc(self): # series like aggs for t in cases: - result = t['A'].agg({'A': ['sum', 'std']}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = t['A'].agg({'A': ['sum', 'std']}) expected = pd.concat([t['A'].sum(), t['A'].std()], axis=1) @@ -559,17 +583,22 @@ def test_agg_misc(self): ('A', 'std'), ('B', 'mean'), ('B', 'std')]) - result = t['A'].agg({'A': ['sum', 'std'], 'B': ['mean', 'std']}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = t['A'].agg({'A': ['sum', 'std'], + 'B': ['mean', 'std']}) assert_frame_equal(result, expected, check_like=True) # errors # invalid names in the agg specification for t in cases: def f(): - t[['A']].agg({'A': ['sum', 'std'], - 'B': ['mean', 'std']}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + t[['A']].agg({'A': ['sum', 'std'], + 'B': ['mean', 'std']}) - self.assertRaises(SpecificationError, f) + pytest.raises(SpecificationError, f) def test_agg_nested_dicts(self): @@ -596,7 +625,7 @@ def test_agg_nested_dicts(self): def f(): t.aggregate({'r1': {'A': ['mean', 'sum']}, 'r2': {'B': ['mean', 'sum']}}) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) for t in cases: expected = pd.concat([t['A'].mean(), t['A'].std(), t['B'].mean(), @@ -604,12 +633,16 @@ def f(): expected.columns = pd.MultiIndex.from_tuples([('ra', 'mean'), ( 'ra', 'std'), ('rb', 'mean'), ('rb', 'std')]) - result = t[['A', 'B']].agg({'A': {'ra': ['mean', 'std']}, - 'B': {'rb': ['mean', 'std']}}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = t[['A', 'B']].agg({'A': {'ra': ['mean', 'std']}, + 'B': {'rb': ['mean', 'std']}}) assert_frame_equal(result, expected, check_like=True) - result = t.agg({'A': {'ra': ['mean', 'std']}, - 'B': {'rb': ['mean', 'std']}}) + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result = t.agg({'A': {'ra': ['mean', 'std']}, + 'B': {'rb': ['mean', 'std']}}) assert_frame_equal(result, expected, check_like=True) def test_selection_api_validation(self): @@ -625,23 +658,23 @@ def test_selection_api_validation(self): index=index) # non DatetimeIndex - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.resample('2D', level='v') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.resample('2D', on='date', level='d') - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): df.resample('2D', on=['a', 'date']) - with tm.assertRaises(KeyError): + with pytest.raises(KeyError): df.resample('2D', level=['a', 'date']) # upsampling not allowed - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.resample('2D', level='d').asfreq() - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): df.resample('2D', on='date').asfreq() exp = df_exp.resample('2D').sum() @@ -693,6 +726,24 @@ def test_asfreq_upsample(self): expected = frame.reindex(new_index) assert_frame_equal(result, expected) + def test_asfreq_fill_value(self): + # test for fill value during resampling, issue 3715 + + s = self.create_series() + + result = s.resample('1H').asfreq() + new_index = self.create_index(s.index[0], s.index[-1], freq='1H') + expected = s.reindex(new_index) + assert_series_equal(result, expected) + + frame = s.to_frame('value') + frame.iloc[1] = None + result = frame.resample('1H').asfreq(fill_value=4.0) + new_index = self.create_index(frame.index[0], + frame.index[-1], freq='1H') + expected = frame.reindex(new_index, fill_value=4.0) + assert_frame_equal(result, expected) + def test_resample_interpolate(self): # # 12925 df = self.create_series().to_frame('value') @@ -703,7 +754,7 @@ def test_resample_interpolate(self): def test_raises_on_non_datetimelike_index(self): # this is a non datetimelike index xp = DataFrame() - self.assertRaises(TypeError, lambda: xp.resample('A').mean()) + pytest.raises(TypeError, lambda: xp.resample('A').mean()) def test_resample_empty_series(self): # GH12771 & GH12868 @@ -720,19 +771,8 @@ def test_resample_empty_series(self): expected = s.copy() expected.index = s.index._shallow_copy(freq=freq) assert_index_equal(result.index, expected.index) - self.assertEqual(result.index.freq, expected.index.freq) - - if (method == 'size' and - isinstance(result.index, PeriodIndex) and - freq in ['M', 'D']): - # GH12871 - TODO: name should propagate, but currently - # doesn't on lower / same frequency with PeriodIndex - assert_series_equal(result, expected, check_dtype=False, - check_names=False) - # this assert will break when fixed - self.assertTrue(result.name is None) - else: - assert_series_equal(result, expected, check_dtype=False) + assert result.index.freq == expected.index.freq + assert_series_equal(result, expected, check_dtype=False) def test_resample_empty_dataframe(self): # GH13212 @@ -748,7 +788,7 @@ def test_resample_empty_dataframe(self): expected = f.copy() expected.index = f.index._shallow_copy(freq=freq) assert_index_equal(result.index, expected.index) - self.assertEqual(result.index.freq, expected.index.freq) + assert result.index.freq == expected.index.freq assert_frame_equal(result, expected, check_dtype=False) # test size for GH13212 (currently stays as df) @@ -769,12 +809,48 @@ def test_resample_empty_dtypes(self): # (ex: doing mean with dtype of np.object) pass + def test_resample_loffset_arg_type(self): + # GH 13218, 15002 + df = self.create_series().to_frame('value') + expected_means = [df.values[i:i + 2].mean() + for i in range(0, len(df.values), 2)] + expected_index = self.create_index(df.index[0], + periods=len(df.index) / 2, + freq='2D') + + # loffset coreces PeriodIndex to DateTimeIndex + if isinstance(expected_index, PeriodIndex): + expected_index = expected_index.to_timestamp() + + expected_index += timedelta(hours=2) + expected = DataFrame({'value': expected_means}, index=expected_index) + + for arg in ['mean', {'value': 'mean'}, ['mean']]: + + result_agg = df.resample('2D', loffset='2H').agg(arg) -class TestDatetimeIndex(Base, tm.TestCase): - _multiprocess_can_split_ = True + with tm.assert_produces_warning(FutureWarning, + check_stacklevel=False): + result_how = df.resample('2D', how=arg, loffset='2H') + + if isinstance(arg, list): + expected.columns = pd.MultiIndex.from_tuples([('value', + 'mean')]) + + # GH 13022, 7687 - TODO: fix resample w/ TimedeltaIndex + if isinstance(expected.index, TimedeltaIndex): + with pytest.raises(AssertionError): + assert_frame_equal(result_agg, expected) + assert_frame_equal(result_how, expected) + else: + assert_frame_equal(result_agg, expected) + assert_frame_equal(result_how, expected) + + +class TestDatetimeIndex(Base): _index_factory = lambda x: date_range - def setUp(self): + def setup_method(self, method): dti = DatetimeIndex(start=datetime(2005, 1, 1), end=datetime(2005, 1, 10), freq='Min') @@ -808,8 +884,8 @@ def test_custom_grouper(self): for f in funcs: g._cython_agg_general(f) - self.assertEqual(g.ngroups, 2593) - self.assertTrue(notnull(g.mean()).all()) + assert g.ngroups == 2593 + assert notnull(g.mean()).all() # construct expected val arr = [1] + [5] * 2592 @@ -825,8 +901,8 @@ def test_custom_grouper(self): index=dti, dtype='float64') r = df.groupby(b).agg(np.sum) - self.assertEqual(len(r.columns), 10) - self.assertEqual(len(r.index), 2593) + assert len(r.columns) == 10 + assert len(r.index) == 2593 def test_resample_basic(self): rng = date_range('1/1/2000 00:00:00', '1/1/2000 00:13:00', freq='min', @@ -838,7 +914,7 @@ def test_resample_basic(self): expected = Series([s[0], s[1:6].mean(), s[6:11].mean(), s[11:].mean()], index=exp_idx) assert_series_equal(result, expected) - self.assertEqual(result.index.name, 'index') + assert result.index.name == 'index' result = s.resample('5min', closed='left', label='right').mean() @@ -882,7 +958,7 @@ def _ohlc(group): '5min', closed='right', label='right'), arg)() expected = s.groupby(grouplist).agg(func) - self.assertEqual(result.index.name, 'index') + assert result.index.name == 'index' if arg == 'ohlc': expected = DataFrame(expected.values.tolist()) expected.columns = ['open', 'high', 'low', 'close'] @@ -906,11 +982,11 @@ def test_numpy_compat(self): for func in ('min', 'max', 'sum', 'prod', 'mean', 'var', 'std'): - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(r, func), - func, 1, 2, 3) - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(r, func), axis=1) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(r, func), + func, 1, 2, 3) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(r, func), axis=1) def test_resample_how_callables(self): # GH 7929 @@ -922,6 +998,7 @@ def fn(x, a=1): return str(type(x)) class fn_class: + def __call__(self, x): return str(type(x)) @@ -1039,52 +1116,51 @@ def test_resample_basic_from_daily(self): # to weekly result = s.resample('w-sun').last() - self.assertEqual(len(result), 3) - self.assertTrue((result.index.dayofweek == [6, 6, 6]).all()) - self.assertEqual(result.iloc[0], s['1/2/2005']) - self.assertEqual(result.iloc[1], s['1/9/2005']) - self.assertEqual(result.iloc[2], s.iloc[-1]) + assert len(result) == 3 + assert (result.index.dayofweek == [6, 6, 6]).all() + assert result.iloc[0] == s['1/2/2005'] + assert result.iloc[1] == s['1/9/2005'] + assert result.iloc[2] == s.iloc[-1] result = s.resample('W-MON').last() - self.assertEqual(len(result), 2) - self.assertTrue((result.index.dayofweek == [0, 0]).all()) - self.assertEqual(result.iloc[0], s['1/3/2005']) - self.assertEqual(result.iloc[1], s['1/10/2005']) + assert len(result) == 2 + assert (result.index.dayofweek == [0, 0]).all() + assert result.iloc[0] == s['1/3/2005'] + assert result.iloc[1] == s['1/10/2005'] result = s.resample('W-TUE').last() - self.assertEqual(len(result), 2) - self.assertTrue((result.index.dayofweek == [1, 1]).all()) - self.assertEqual(result.iloc[0], s['1/4/2005']) - self.assertEqual(result.iloc[1], s['1/10/2005']) + assert len(result) == 2 + assert (result.index.dayofweek == [1, 1]).all() + assert result.iloc[0] == s['1/4/2005'] + assert result.iloc[1] == s['1/10/2005'] result = s.resample('W-WED').last() - self.assertEqual(len(result), 2) - self.assertTrue((result.index.dayofweek == [2, 2]).all()) - self.assertEqual(result.iloc[0], s['1/5/2005']) - self.assertEqual(result.iloc[1], s['1/10/2005']) + assert len(result) == 2 + assert (result.index.dayofweek == [2, 2]).all() + assert result.iloc[0] == s['1/5/2005'] + assert result.iloc[1] == s['1/10/2005'] result = s.resample('W-THU').last() - self.assertEqual(len(result), 2) - self.assertTrue((result.index.dayofweek == [3, 3]).all()) - self.assertEqual(result.iloc[0], s['1/6/2005']) - self.assertEqual(result.iloc[1], s['1/10/2005']) + assert len(result) == 2 + assert (result.index.dayofweek == [3, 3]).all() + assert result.iloc[0] == s['1/6/2005'] + assert result.iloc[1] == s['1/10/2005'] result = s.resample('W-FRI').last() - self.assertEqual(len(result), 2) - self.assertTrue((result.index.dayofweek == [4, 4]).all()) - self.assertEqual(result.iloc[0], s['1/7/2005']) - self.assertEqual(result.iloc[1], s['1/10/2005']) + assert len(result) == 2 + assert (result.index.dayofweek == [4, 4]).all() + assert result.iloc[0] == s['1/7/2005'] + assert result.iloc[1] == s['1/10/2005'] # to biz day result = s.resample('B').last() - self.assertEqual(len(result), 7) - self.assertTrue((result.index.dayofweek == [ - 4, 0, 1, 2, 3, 4, 0 - ]).all()) - self.assertEqual(result.iloc[0], s['1/2/2005']) - self.assertEqual(result.iloc[1], s['1/3/2005']) - self.assertEqual(result.iloc[5], s['1/9/2005']) - self.assertEqual(result.index.name, 'index') + assert len(result) == 7 + assert (result.index.dayofweek == [4, 0, 1, 2, 3, 4, 0]).all() + + assert result.iloc[0] == s['1/2/2005'] + assert result.iloc[1] == s['1/3/2005'] + assert result.iloc[5] == s['1/9/2005'] + assert result.index.name == 'index' def test_resample_upsampling_picked_but_not_correct(self): @@ -1093,7 +1169,7 @@ def test_resample_upsampling_picked_but_not_correct(self): series = Series(1, index=dates) result = series.resample('D').mean() - self.assertEqual(result.index[0], dates[0]) + assert result.index[0] == dates[0] # GH 5955 # incorrect deciding to upsample when the axis frequency matches the @@ -1154,7 +1230,7 @@ def test_resample_loffset(self): loffset=Minute(1)).mean() assert_series_equal(result, expected) - self.assertEqual(result.index.freq, Minute(5)) + assert result.index.freq == Minute(5) # from daily dti = DatetimeIndex(start=datetime(2005, 1, 1), @@ -1164,7 +1240,7 @@ def test_resample_loffset(self): # to weekly result = ser.resample('w-sun').last() expected = ser.resample('w-sun', loffset=-bday).last() - self.assertEqual(result.index[0] - bday, expected.index[0]) + assert result.index[0] - bday == expected.index[0] def test_resample_loffset_count(self): # GH 12725 @@ -1197,11 +1273,11 @@ def test_resample_upsample(self): # to minutely, by padding result = s.resample('Min').pad() - self.assertEqual(len(result), 12961) - self.assertEqual(result[0], s[0]) - self.assertEqual(result[-1], s[-1]) + assert len(result) == 12961 + assert result[0] == s[0] + assert result[-1] == s[-1] - self.assertEqual(result.index.name, 'index') + assert result.index.name == 'index' def test_resample_how_method(self): # GH9915 @@ -1244,20 +1320,20 @@ def test_resample_ohlc(self): expect = s.groupby(grouper).agg(lambda x: x[-1]) result = s.resample('5Min').ohlc() - self.assertEqual(len(result), len(expect)) - self.assertEqual(len(result.columns), 4) + assert len(result) == len(expect) + assert len(result.columns) == 4 xs = result.iloc[-2] - self.assertEqual(xs['open'], s[-6]) - self.assertEqual(xs['high'], s[-6:-1].max()) - self.assertEqual(xs['low'], s[-6:-1].min()) - self.assertEqual(xs['close'], s[-2]) + assert xs['open'] == s[-6] + assert xs['high'] == s[-6:-1].max() + assert xs['low'] == s[-6:-1].min() + assert xs['close'] == s[-2] xs = result.iloc[0] - self.assertEqual(xs['open'], s[0]) - self.assertEqual(xs['high'], s[:5].max()) - self.assertEqual(xs['low'], s[:5].min()) - self.assertEqual(xs['close'], s[4]) + assert xs['open'] == s[0] + assert xs['high'] == s[:5].max() + assert xs['low'] == s[:5].min() + assert xs['close'] == s[4] def test_resample_ohlc_result(self): @@ -1267,10 +1343,10 @@ def test_resample_ohlc_result(self): s = Series(range(len(index)), index=index) a = s.loc[:'4-15-2000'].resample('30T').ohlc() - self.assertIsInstance(a, DataFrame) + assert isinstance(a, DataFrame) b = s.loc[:'4-14-2000'].resample('30T').ohlc() - self.assertIsInstance(b, DataFrame) + assert isinstance(b, DataFrame) # GH12348 # raising on odd period @@ -1334,9 +1410,9 @@ def test_resample_reresample(self): s = Series(np.random.rand(len(dti)), dti) bs = s.resample('B', closed='right', label='right').mean() result = bs.resample('8H').mean() - self.assertEqual(len(result), 22) - tm.assertIsInstance(result.index.freq, offsets.DateOffset) - self.assertEqual(result.index.freq, offsets.Hour(8)) + assert len(result) == 22 + assert isinstance(result.index.freq, offsets.DateOffset) + assert result.index.freq == offsets.Hour(8) def test_resample_timestamp_to_period(self): ts = _simple_ts('1/1/1990', '1/1/2000') @@ -1373,13 +1449,13 @@ def _ohlc(group): resampled = ts.resample('5min', closed='right', label='right').ohlc() - self.assertTrue((resampled.ix['1/1/2000 00:00'] == ts[0]).all()) + assert (resampled.loc['1/1/2000 00:00'] == ts[0]).all() exp = _ohlc(ts[1:31]) - self.assertTrue((resampled.ix['1/1/2000 00:05'] == exp).all()) + assert (resampled.loc['1/1/2000 00:05'] == exp).all() exp = _ohlc(ts['1/1/2000 5:55:01':]) - self.assertTrue((resampled.ix['1/1/2000 6:00:00'] == exp).all()) + assert (resampled.loc['1/1/2000 6:00:00'] == exp).all() def test_downsample_non_unique(self): rng = date_range('1/1/2000', '2/29/2000') @@ -1389,7 +1465,7 @@ def test_downsample_non_unique(self): result = ts.resample('M').mean() expected = ts.groupby(lambda x: x.month).mean() - self.assertEqual(len(result), 2) + assert len(result) == 2 assert_almost_equal(result[0], expected[1]) assert_almost_equal(result[1], expected[2]) @@ -1399,7 +1475,7 @@ def test_asfreq_non_unique(self): rng2 = rng.repeat(2).values ts = Series(np.random.randn(len(rng2)), index=rng2) - self.assertRaises(Exception, ts.asfreq, 'B') + pytest.raises(Exception, ts.asfreq, 'B') def test_resample_axis1(self): rng = date_range('1/1/2000', '2/29/2000') @@ -1414,44 +1490,47 @@ def test_resample_panel(self): rng = date_range('1/1/2000', '6/30/2000') n = len(rng) - panel = Panel(np.random.randn(3, n, 5), - items=['one', 'two', 'three'], - major_axis=rng, - minor_axis=['a', 'b', 'c', 'd', 'e']) + with catch_warnings(record=True): + panel = Panel(np.random.randn(3, n, 5), + items=['one', 'two', 'three'], + major_axis=rng, + minor_axis=['a', 'b', 'c', 'd', 'e']) - result = panel.resample('M', axis=1).mean() + result = panel.resample('M', axis=1).mean() - def p_apply(panel, f): - result = {} - for item in panel.items: - result[item] = f(panel[item]) - return Panel(result, items=panel.items) + def p_apply(panel, f): + result = {} + for item in panel.items: + result[item] = f(panel[item]) + return Panel(result, items=panel.items) - expected = p_apply(panel, lambda x: x.resample('M').mean()) - tm.assert_panel_equal(result, expected) + expected = p_apply(panel, lambda x: x.resample('M').mean()) + tm.assert_panel_equal(result, expected) - panel2 = panel.swapaxes(1, 2) - result = panel2.resample('M', axis=2).mean() - expected = p_apply(panel2, lambda x: x.resample('M', axis=1).mean()) - tm.assert_panel_equal(result, expected) + panel2 = panel.swapaxes(1, 2) + result = panel2.resample('M', axis=2).mean() + expected = p_apply(panel2, + lambda x: x.resample('M', axis=1).mean()) + tm.assert_panel_equal(result, expected) def test_resample_panel_numpy(self): rng = date_range('1/1/2000', '6/30/2000') n = len(rng) - panel = Panel(np.random.randn(3, n, 5), - items=['one', 'two', 'three'], - major_axis=rng, - minor_axis=['a', 'b', 'c', 'd', 'e']) + with catch_warnings(record=True): + panel = Panel(np.random.randn(3, n, 5), + items=['one', 'two', 'three'], + major_axis=rng, + minor_axis=['a', 'b', 'c', 'd', 'e']) - result = panel.resample('M', axis=1).apply(lambda x: x.mean(1)) - expected = panel.resample('M', axis=1).mean() - tm.assert_panel_equal(result, expected) + result = panel.resample('M', axis=1).apply(lambda x: x.mean(1)) + expected = panel.resample('M', axis=1).mean() + tm.assert_panel_equal(result, expected) - panel = panel.swapaxes(1, 2) - result = panel.resample('M', axis=2).apply(lambda x: x.mean(2)) - expected = panel.resample('M', axis=2).mean() - tm.assert_panel_equal(result, expected) + panel = panel.swapaxes(1, 2) + result = panel.resample('M', axis=2).apply(lambda x: x.mean(2)) + expected = panel.resample('M', axis=2).mean() + tm.assert_panel_equal(result, expected) def test_resample_anchored_ticks(self): # If a fixed delta (5 minute, 4 hour) evenly divides a day, we should @@ -1496,7 +1575,7 @@ def test_resample_base(self): resampled = ts.resample('5min', base=2).mean() exp_rng = date_range('12/31/1999 23:57:00', '1/1/2000 01:57', freq='5min') - self.assert_index_equal(resampled.index, exp_rng) + tm.assert_index_equal(resampled.index, exp_rng) def test_resample_base_with_timedeltaindex(self): @@ -1510,8 +1589,8 @@ def test_resample_base_with_timedeltaindex(self): exp_without_base = timedelta_range(start='0s', end='25s', freq='2s') exp_with_base = timedelta_range(start='5s', end='29s', freq='2s') - self.assert_index_equal(without_base.index, exp_without_base) - self.assert_index_equal(with_base.index, exp_with_base) + tm.assert_index_equal(without_base.index, exp_without_base) + tm.assert_index_equal(with_base.index, exp_with_base) def test_resample_categorical_data_with_timedeltaindex(self): # GH #12169 @@ -1542,7 +1621,7 @@ def test_resample_to_period_monthly_buglet(self): result = ts.resample('M', kind='period').mean() exp_index = period_range('Jan-2000', 'Dec-2000', freq='M') - self.assert_index_equal(result.index, exp_index) + tm.assert_index_equal(result.index, exp_index) def test_period_with_agg(self): @@ -1586,10 +1665,10 @@ def test_resample_dtype_preservation(self): ).set_index('date') result = df.resample('1D').ffill() - self.assertEqual(result.val.dtype, np.int32) + assert result.val.dtype == np.int32 result = df.groupby('group').resample('1D').ffill() - self.assertEqual(result.val.dtype, np.int32) + assert result.val.dtype == np.int32 def test_weekly_resample_buglet(self): # #1327 @@ -1663,7 +1742,7 @@ def test_resample_anchored_intraday(self): ts = _simple_ts('2012-04-29 23:00', '2012-04-30 5:00', freq='h') resampled = ts.resample('M').mean() - self.assertEqual(len(resampled), 1) + assert len(resampled) == 1 def test_resample_anchored_monthstart(self): ts = _simple_ts('1/1/2000', '12/31/2002') @@ -1689,13 +1768,11 @@ def test_resample_anchored_multiday(self): # Ensure left closing works result = s.resample('2200L').mean() - self.assertEqual(result.index[-1], - pd.Timestamp('2014-10-15 23:00:02.000')) + assert result.index[-1] == pd.Timestamp('2014-10-15 23:00:02.000') # Ensure right closing works result = s.resample('2200L', label='right').mean() - self.assertEqual(result.index[-1], - pd.Timestamp('2014-10-15 23:00:04.200')) + assert result.index[-1] == pd.Timestamp('2014-10-15 23:00:04.200') def test_corner_cases(self): # miscellaneous test coverage @@ -1705,18 +1782,18 @@ def test_corner_cases(self): result = ts.resample('5t', closed='right', label='left').mean() ex_index = date_range('1999-12-31 23:55', periods=4, freq='5t') - self.assert_index_equal(result.index, ex_index) + tm.assert_index_equal(result.index, ex_index) len0pts = _simple_pts('2007-01', '2010-05', freq='M')[:0] # it works result = len0pts.resample('A-DEC').mean() - self.assertEqual(len(result), 0) + assert len(result) == 0 # resample to periods ts = _simple_ts('2000-04-28', '2000-04-30 11:00', freq='h') result = ts.resample('M', kind='period').mean() - self.assertEqual(len(result), 1) - self.assertEqual(result.index[0], Period('2000-04', freq='M')) + assert len(result) == 1 + assert result.index[0] == Period('2000-04', freq='M') def test_anchored_lowercase_buglet(self): dates = date_range('4/16/2012 20:00', periods=50000, freq='s') @@ -1731,7 +1808,7 @@ def test_upsample_apply_functions(self): ts = Series(np.random.randn(len(rng)), index=rng) result = ts.resample('20min').aggregate(['mean', 'sum']) - tm.assertIsInstance(result, DataFrame) + assert isinstance(result, DataFrame) def test_resample_not_monotonic(self): rng = pd.date_range('2012-06-12', periods=200, freq='h') @@ -1777,10 +1854,12 @@ def test_how_lambda_functions(self): tm.assert_series_equal(result['foo'], foo_exp) tm.assert_series_equal(result['bar'], bar_exp) + # this is a MI Series, so comparing the names of the results + # doesn't make sense result = ts.resample('M').aggregate({'foo': lambda x: x.mean(), 'bar': lambda x: x.std(ddof=1)}) - tm.assert_series_equal(result['foo'], foo_exp) - tm.assert_series_equal(result['bar'], bar_exp) + tm.assert_series_equal(result['foo'], foo_exp, check_names=False) + tm.assert_series_equal(result['bar'], bar_exp, check_names=False) def test_resample_unequal_times(self): # #1772 @@ -1860,7 +1939,7 @@ def test_resample_nunique(self): g = df.groupby(pd.Grouper(freq='D')) expected = df.groupby(pd.TimeGrouper('D')).ID.apply(lambda x: x.nunique()) - self.assertEqual(expected.name, 'ID') + assert expected.name == 'ID' for t in [r, g]: result = r.ID.nunique() @@ -1872,6 +1951,26 @@ def test_resample_nunique(self): result = df.ID.groupby(pd.Grouper(freq='D')).nunique() assert_series_equal(result, expected) + def test_resample_nunique_with_date_gap(self): + # GH 13453 + index = pd.date_range('1-1-2000', '2-15-2000', freq='h') + index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h') + index3 = index.append(index2) + s = pd.Series(range(len(index3)), index=index3, dtype='int64') + r = s.resample('M') + + # Since all elements are unique, these should all be the same + results = [ + r.count(), + r.nunique(), + r.agg(pd.Series.nunique), + r.agg('nunique') + ] + + assert_series_equal(results[0], results[1]) + assert_series_equal(results[0], results[2]) + assert_series_equal(results[0], results[3]) + def test_resample_group_info(self): # GH10914 for n, k in product((10000, 100000), (10, 100, 1000)): dr = date_range(start='2015-08-27', periods=n // 10, freq='T') @@ -2066,8 +2165,7 @@ def test_resample_datetime_values(self): tm.assert_series_equal(res, exp) -class TestPeriodIndex(Base, tm.TestCase): - _multiprocess_can_split_ = True +class TestPeriodIndex(Base): _index_factory = lambda x: period_range def create_series(self): @@ -2122,6 +2220,25 @@ def test_asfreq_upsample(self): result = frame.resample('1H').asfreq() assert_frame_equal(result, expected) + def test_asfreq_fill_value(self): + # test for fill value during resampling, issue 3715 + + s = self.create_series() + new_index = date_range(s.index[0].to_timestamp(how='start'), + (s.index[-1]).to_timestamp(how='start'), + freq='1H') + expected = s.to_timestamp().reindex(new_index, fill_value=4.0) + result = s.resample('1H', kind='timestamp').asfreq(fill_value=4.0) + assert_series_equal(result, expected) + + frame = s.to_frame('value') + new_index = date_range(frame.index[0].to_timestamp(how='start'), + (frame.index[-1]).to_timestamp(how='start'), + freq='1H') + expected = frame.to_timestamp().reindex(new_index, fill_value=3.0) + result = frame.resample('1H', kind='timestamp').asfreq(fill_value=3.0) + assert_frame_equal(result, expected) + def test_selection(self): index = self.create_series().index # This is a bug, these should be implemented @@ -2132,10 +2249,10 @@ def test_selection(self): np.arange(len(index), dtype=np.int64), index], names=['v', 'd'])) - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.resample('2D', on='date') - with tm.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.resample('2D', level='d') def test_annual_upsample_D_s_f(self): @@ -2198,10 +2315,10 @@ def test_basic_downsample(self): def test_not_subperiod(self): # These are incompatible period rules for resampling ts = _simple_pts('1/1/1990', '6/30/1995', freq='w-wed') - self.assertRaises(ValueError, lambda: ts.resample('a-dec').mean()) - self.assertRaises(ValueError, lambda: ts.resample('q-mar').mean()) - self.assertRaises(ValueError, lambda: ts.resample('M').mean()) - self.assertRaises(ValueError, lambda: ts.resample('w-thu').mean()) + pytest.raises(ValueError, lambda: ts.resample('a-dec').mean()) + pytest.raises(ValueError, lambda: ts.resample('q-mar').mean()) + pytest.raises(ValueError, lambda: ts.resample('M').mean()) + pytest.raises(ValueError, lambda: ts.resample('w-thu').mean()) def test_basic_upsample(self): ts = _simple_pts('1/1/1990', '6/30/1995', freq='M') @@ -2302,7 +2419,7 @@ def test_resample_same_freq(self): def test_resample_incompat_freq(self): - with self.assertRaises(IncompatibleFrequency): + with pytest.raises(IncompatibleFrequency): pd.Series(range(3), index=pd.period_range( start='2000', periods=3, freq='M')).resample('W').mean() @@ -2428,7 +2545,7 @@ def test_resample_fill_missing(self): def test_cant_fill_missing_dups(self): rng = PeriodIndex([2000, 2005, 2005, 2007, 2007], freq='A') s = Series(np.random.randn(5), index=rng) - self.assertRaises(Exception, lambda: s.resample('A').ffill()) + pytest.raises(Exception, lambda: s.resample('A').ffill()) def test_resample_5minute(self): rng = period_range('1/1/2000', '1/5/2000', freq='T') @@ -2458,7 +2575,7 @@ def test_resample_irregular_sparse(self): subset = s[:'2012-01-04 06:55'] result = subset.resample('10min').apply(len) - expected = s.resample('10min').apply(len).ix[result.index] + expected = s.resample('10min').apply(len).loc[result.index] assert_series_equal(result, expected) def test_resample_weekly_all_na(self): @@ -2467,7 +2584,7 @@ def test_resample_weekly_all_na(self): result = ts.resample('W-THU').asfreq() - self.assertTrue(result.isnull().all()) + assert result.isnull().all() result = ts.resample('W-THU').asfreq().ffill()[:-1] expected = ts.asfreq('W-THU').ffill() @@ -2545,7 +2662,7 @@ def test_closed_left_corner(self): ex_index = date_range(start='1/1/2012 9:30', freq='10min', periods=3) - self.assert_index_equal(result.index, ex_index) + tm.assert_index_equal(result.index, ex_index) assert_series_equal(result, exp) def test_quarterly_resampling(self): @@ -2572,8 +2689,8 @@ def test_resample_bms_2752(self): foo = pd.Series(index=pd.bdate_range('20000101', '20000201')) res1 = foo.resample("BMS").mean() res2 = foo.resample("BMS").mean().resample("B").mean() - self.assertEqual(res1.index[0], Timestamp('20000103')) - self.assertEqual(res1.index[0], res2.index[0]) + assert res1.index[0] == Timestamp('20000103') + assert res1.index[0] == res2.index[0] # def test_monthly_convention_span(self): # rng = period_range('2000-01', periods=3, freq='M') @@ -2656,8 +2773,7 @@ def test_evenly_divisible_with_no_extra_bins(self): assert_frame_equal(result, expected) -class TestTimedeltaIndex(Base, tm.TestCase): - _multiprocess_can_split_ = True +class TestTimedeltaIndex(Base): _index_factory = lambda x: timedelta_range def create_series(self): @@ -2678,8 +2794,9 @@ def test_asfreq_bug(self): assert_frame_equal(result, expected) -class TestResamplerGrouper(tm.TestCase): - def setUp(self): +class TestResamplerGrouper(object): + + def setup_method(self, method): self.frame = DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8, 'B': np.arange(40)}, index=date_range('1/1/2000', @@ -2850,16 +2967,31 @@ def test_consistency_with_window(self): df = self.frame expected = pd.Int64Index([1, 2, 3], name='A') result = df.groupby('A').resample('2s').mean() - self.assertEqual(result.index.nlevels, 2) + assert result.index.nlevels == 2 tm.assert_index_equal(result.index.levels[0], expected) result = df.groupby('A').rolling(20).mean() - self.assertEqual(result.index.nlevels, 2) + assert result.index.nlevels == 2 tm.assert_index_equal(result.index.levels[0], expected) + def test_median_duplicate_columns(self): + # GH 14233 + + df = pd.DataFrame(np.random.randn(20, 3), + columns=list('aaa'), + index=pd.date_range('2012-01-01', + periods=20, freq='s')) + df2 = df.copy() + df2.columns = ['a', 'b', 'c'] + expected = df2.resample('5s').median() + result = df.resample('5s').median() + expected.columns = result.columns + assert_frame_equal(result, expected) + + +class TestTimeGrouper(object): -class TestTimeGrouper(tm.TestCase): - def setUp(self): + def setup_method(self, method): self.ts = Series(np.random.randn(1000), index=date_range('1/1/2000', periods=1000)) @@ -2914,25 +3046,27 @@ def test_apply_iteration(self): # it works! result = grouped.apply(f) - self.assert_index_equal(result.index, df.index) + tm.assert_index_equal(result.index, df.index) def test_panel_aggregation(self): ind = pd.date_range('1/1/2000', periods=100) data = np.random.randn(2, len(ind), 4) - wp = pd.Panel(data, items=['Item1', 'Item2'], major_axis=ind, - minor_axis=['A', 'B', 'C', 'D']) - tg = TimeGrouper('M', axis=1) - _, grouper, _ = tg._get_grouper(wp) - bingrouped = wp.groupby(grouper) - binagg = bingrouped.mean() + with catch_warnings(record=True): + wp = Panel(data, items=['Item1', 'Item2'], major_axis=ind, + minor_axis=['A', 'B', 'C', 'D']) - def f(x): - assert (isinstance(x, Panel)) - return x.mean(1) + tg = TimeGrouper('M', axis=1) + _, grouper, _ = tg._get_grouper(wp) + bingrouped = wp.groupby(grouper) + binagg = bingrouped.mean() + + def f(x): + assert (isinstance(x, Panel)) + return x.mean(1) - result = bingrouped.agg(f) - tm.assert_panel_equal(result, binagg) + result = bingrouped.agg(f) + tm.assert_panel_equal(result, binagg) def test_fails_on_no_datetime_index(self): index_names = ('Int64Index', 'Index', 'Float64Index', 'MultiIndex') @@ -2943,17 +3077,18 @@ def test_fails_on_no_datetime_index(self): for name, func in zip(index_names, index_funcs): index = func(n) df = DataFrame({'a': np.random.randn(n)}, index=index) - with tm.assertRaisesRegexp(TypeError, - "Only valid with DatetimeIndex, " - "TimedeltaIndex or PeriodIndex, " - "but got an instance of %r" % name): + with tm.assert_raises_regex(TypeError, + "Only valid with " + "DatetimeIndex, TimedeltaIndex " + "or PeriodIndex, but got an " + "instance of %r" % name): df.groupby(TimeGrouper('D')) # PeriodIndex gives a specific error message df = DataFrame({'a': np.random.randn(n)}, index=tm.makePeriodIndex(n)) - with tm.assertRaisesRegexp(TypeError, - "axis must be a DatetimeIndex, but " - "got an instance of 'PeriodIndex'"): + with tm.assert_raises_regex(TypeError, + "axis must be a DatetimeIndex, but " + "got an instance of 'PeriodIndex'"): df.groupby(TimeGrouper('D')) def test_aaa_group_order(self): @@ -3082,12 +3217,7 @@ def test_aggregate_with_nat(self): dt_result = getattr(dt_grouped, func)() assert_series_equal(expected, dt_result) # GH 9925 - self.assertEqual(dt_result.index.name, 'key') + assert dt_result.index.name == 'key' # if NaT is included, 'var', 'std', 'mean', 'first','last' # and 'nth' doesn't work yet - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_sorting.py b/pandas/tests/test_sorting.py new file mode 100644 index 0000000000000..e09270bcadf27 --- /dev/null +++ b/pandas/tests/test_sorting.py @@ -0,0 +1,342 @@ +import pytest +from itertools import product +from collections import defaultdict + +import numpy as np +from numpy import nan +import pandas as pd +from pandas.core import common as com +from pandas import DataFrame, MultiIndex, merge, concat, Series, compat +from pandas.util import testing as tm +from pandas.util.testing import assert_frame_equal, assert_series_equal +from pandas.core.sorting import (is_int64_overflow_possible, + decons_group_index, + get_group_index, + nargsort, + lexsort_indexer) + + +class TestSorting(object): + + @pytest.mark.slow + def test_int64_overflow(self): + + B = np.concatenate((np.arange(1000), np.arange(1000), np.arange(500))) + A = np.arange(2500) + df = DataFrame({'A': A, + 'B': B, + 'C': A, + 'D': B, + 'E': A, + 'F': B, + 'G': A, + 'H': B, + 'values': np.random.randn(2500)}) + + lg = df.groupby(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']) + rg = df.groupby(['H', 'G', 'F', 'E', 'D', 'C', 'B', 'A']) + + left = lg.sum()['values'] + right = rg.sum()['values'] + + exp_index, _ = left.index.sortlevel() + tm.assert_index_equal(left.index, exp_index) + + exp_index, _ = right.index.sortlevel(0) + tm.assert_index_equal(right.index, exp_index) + + tups = list(map(tuple, df[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H' + ]].values)) + tups = com._asarray_tuplesafe(tups) + + expected = df.groupby(tups).sum()['values'] + + for k, v in compat.iteritems(expected): + assert left[k] == right[k[::-1]] + assert left[k] == v + assert len(left) == len(right) + + def test_int64_overflow_moar(self): + + # GH9096 + values = range(55109) + data = pd.DataFrame.from_dict({'a': values, + 'b': values, + 'c': values, + 'd': values}) + grouped = data.groupby(['a', 'b', 'c', 'd']) + assert len(grouped) == len(values) + + arr = np.random.randint(-1 << 12, 1 << 12, (1 << 15, 5)) + i = np.random.choice(len(arr), len(arr) * 4) + arr = np.vstack((arr, arr[i])) # add sume duplicate rows + + i = np.random.permutation(len(arr)) + arr = arr[i] # shuffle rows + + df = DataFrame(arr, columns=list('abcde')) + df['jim'], df['joe'] = np.random.randn(2, len(df)) * 10 + gr = df.groupby(list('abcde')) + + # verify this is testing what it is supposed to test! + assert is_int64_overflow_possible(gr.grouper.shape) + + # mannually compute groupings + jim, joe = defaultdict(list), defaultdict(list) + for key, a, b in zip(map(tuple, arr), df['jim'], df['joe']): + jim[key].append(a) + joe[key].append(b) + + assert len(gr) == len(jim) + mi = MultiIndex.from_tuples(jim.keys(), names=list('abcde')) + + def aggr(func): + f = lambda a: np.fromiter(map(func, a), dtype='f8') + arr = np.vstack((f(jim.values()), f(joe.values()))).T + res = DataFrame(arr, columns=['jim', 'joe'], index=mi) + return res.sort_index() + + assert_frame_equal(gr.mean(), aggr(np.mean)) + assert_frame_equal(gr.median(), aggr(np.median)) + + def test_lexsort_indexer(self): + keys = [[nan] * 5 + list(range(100)) + [nan] * 5] + # orders=True, na_position='last' + result = lexsort_indexer(keys, orders=True, na_position='last') + exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110)) + tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) + + # orders=True, na_position='first' + result = lexsort_indexer(keys, orders=True, na_position='first') + exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105)) + tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) + + # orders=False, na_position='last' + result = lexsort_indexer(keys, orders=False, na_position='last') + exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110)) + tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) + + # orders=False, na_position='first' + result = lexsort_indexer(keys, orders=False, na_position='first') + exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1)) + tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) + + def test_nargsort(self): + # np.argsort(items) places NaNs last + items = [nan] * 5 + list(range(100)) + [nan] * 5 + # np.argsort(items2) may not place NaNs first + items2 = np.array(items, dtype='O') + + try: + # GH 2785; due to a regression in NumPy1.6.2 + np.argsort(np.array([[1, 2], [1, 3], [1, 2]], dtype='i')) + np.argsort(items2, kind='mergesort') + except TypeError: + pytest.skip('requested sort not available for type') + + # mergesort is the most difficult to get right because we want it to be + # stable. + + # According to numpy/core/tests/test_multiarray, """The number of + # sorted items must be greater than ~50 to check the actual algorithm + # because quick and merge sort fall over to insertion sort for small + # arrays.""" + + # mergesort, ascending=True, na_position='last' + result = nargsort(items, kind='mergesort', ascending=True, + na_position='last') + exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + # mergesort, ascending=True, na_position='first' + result = nargsort(items, kind='mergesort', ascending=True, + na_position='first') + exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + # mergesort, ascending=False, na_position='last' + result = nargsort(items, kind='mergesort', ascending=False, + na_position='last') + exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + # mergesort, ascending=False, na_position='first' + result = nargsort(items, kind='mergesort', ascending=False, + na_position='first') + exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + # mergesort, ascending=True, na_position='last' + result = nargsort(items2, kind='mergesort', ascending=True, + na_position='last') + exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + # mergesort, ascending=True, na_position='first' + result = nargsort(items2, kind='mergesort', ascending=True, + na_position='first') + exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + # mergesort, ascending=False, na_position='last' + result = nargsort(items2, kind='mergesort', ascending=False, + na_position='last') + exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + # mergesort, ascending=False, na_position='first' + result = nargsort(items2, kind='mergesort', ascending=False, + na_position='first') + exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1)) + tm.assert_numpy_array_equal(result, np.array(exp), check_dtype=False) + + +class TestMerge(object): + + @pytest.mark.slow + def test_int64_overflow_issues(self): + + # #2690, combinatorial explosion + df1 = DataFrame(np.random.randn(1000, 7), + columns=list('ABCDEF') + ['G1']) + df2 = DataFrame(np.random.randn(1000, 7), + columns=list('ABCDEF') + ['G2']) + + # it works! + result = merge(df1, df2, how='outer') + assert len(result) == 2000 + + low, high, n = -1 << 10, 1 << 10, 1 << 20 + left = DataFrame(np.random.randint(low, high, (n, 7)), + columns=list('ABCDEFG')) + left['left'] = left.sum(axis=1) + + # one-2-one match + i = np.random.permutation(len(left)) + right = left.iloc[i].copy() + right.columns = right.columns[:-1].tolist() + ['right'] + right.index = np.arange(len(right)) + right['right'] *= -1 + + out = merge(left, right, how='outer') + assert len(out) == len(left) + assert_series_equal(out['left'], - out['right'], check_names=False) + result = out.iloc[:, :-2].sum(axis=1) + assert_series_equal(out['left'], result, check_names=False) + assert result.name is None + + out.sort_values(out.columns.tolist(), inplace=True) + out.index = np.arange(len(out)) + for how in ['left', 'right', 'outer', 'inner']: + assert_frame_equal(out, merge(left, right, how=how, sort=True)) + + # check that left merge w/ sort=False maintains left frame order + out = merge(left, right, how='left', sort=False) + assert_frame_equal(left, out[left.columns.tolist()]) + + out = merge(right, left, how='left', sort=False) + assert_frame_equal(right, out[right.columns.tolist()]) + + # one-2-many/none match + n = 1 << 11 + left = DataFrame(np.random.randint(low, high, (n, 7)).astype('int64'), + columns=list('ABCDEFG')) + + # confirm that this is checking what it is supposed to check + shape = left.apply(Series.nunique).values + assert is_int64_overflow_possible(shape) + + # add duplicates to left frame + left = concat([left, left], ignore_index=True) + + right = DataFrame(np.random.randint(low, high, (n // 2, 7)) + .astype('int64'), + columns=list('ABCDEFG')) + + # add duplicates & overlap with left to the right frame + i = np.random.choice(len(left), n) + right = concat([right, right, left.iloc[i]], ignore_index=True) + + left['left'] = np.random.randn(len(left)) + right['right'] = np.random.randn(len(right)) + + # shuffle left & right frames + i = np.random.permutation(len(left)) + left = left.iloc[i].copy() + left.index = np.arange(len(left)) + + i = np.random.permutation(len(right)) + right = right.iloc[i].copy() + right.index = np.arange(len(right)) + + # manually compute outer merge + ldict, rdict = defaultdict(list), defaultdict(list) + + for idx, row in left.set_index(list('ABCDEFG')).iterrows(): + ldict[idx].append(row['left']) + + for idx, row in right.set_index(list('ABCDEFG')).iterrows(): + rdict[idx].append(row['right']) + + vals = [] + for k, lval in ldict.items(): + rval = rdict.get(k, [np.nan]) + for lv, rv in product(lval, rval): + vals.append(k + tuple([lv, rv])) + + for k, rval in rdict.items(): + if k not in ldict: + for rv in rval: + vals.append(k + tuple([np.nan, rv])) + + def align(df): + df = df.sort_values(df.columns.tolist()) + df.index = np.arange(len(df)) + return df + + def verify_order(df): + kcols = list('ABCDEFG') + assert_frame_equal(df[kcols].copy(), + df[kcols].sort_values(kcols, kind='mergesort')) + + out = DataFrame(vals, columns=list('ABCDEFG') + ['left', 'right']) + out = align(out) + + jmask = {'left': out['left'].notnull(), + 'right': out['right'].notnull(), + 'inner': out['left'].notnull() & out['right'].notnull(), + 'outer': np.ones(len(out), dtype='bool')} + + for how in 'left', 'right', 'outer', 'inner': + mask = jmask[how] + frame = align(out[mask].copy()) + assert mask.all() ^ mask.any() or how == 'outer' + + for sort in [False, True]: + res = merge(left, right, how=how, sort=sort) + if sort: + verify_order(res) + + # as in GH9092 dtypes break with outer/right join + assert_frame_equal(frame, align(res), + check_dtype=how not in ('right', 'outer')) + + +def test_decons(): + + def testit(label_list, shape): + group_index = get_group_index(label_list, shape, sort=True, xnull=True) + label_list2 = decons_group_index(group_index, shape) + + for a, b in zip(label_list, label_list2): + assert (np.array_equal(a, b)) + + shape = (4, 5, 6) + label_list = [np.tile([0, 1, 2, 3, 0, 1, 2, 3], 100), np.tile( + [0, 2, 4, 3, 0, 1, 2, 3], 100), np.tile( + [5, 1, 0, 2, 3, 0, 5, 4], 100)] + testit(label_list, shape) + + shape = (10000, 10000) + label_list = [np.tile(np.arange(10000), 5), np.tile(np.arange(10000), 5)] + testit(label_list, shape) diff --git a/pandas/tests/test_stats.py b/pandas/tests/test_stats.py deleted file mode 100644 index 41d25b9662b5b..0000000000000 --- a/pandas/tests/test_stats.py +++ /dev/null @@ -1,192 +0,0 @@ -# -*- coding: utf-8 -*- -from pandas import compat -import nose - -from distutils.version import LooseVersion -from numpy import nan -import numpy as np - -from pandas import Series, DataFrame - -from pandas.compat import product -from pandas.util.testing import (assert_frame_equal, assert_series_equal) -import pandas.util.testing as tm - - -class TestRank(tm.TestCase): - _multiprocess_can_split_ = True - s = Series([1, 3, 4, 2, nan, 2, 1, 5, nan, 3]) - df = DataFrame({'A': s, 'B': s}) - - results = { - 'average': np.array([1.5, 5.5, 7.0, 3.5, nan, - 3.5, 1.5, 8.0, nan, 5.5]), - 'min': np.array([1, 5, 7, 3, nan, 3, 1, 8, nan, 5]), - 'max': np.array([2, 6, 7, 4, nan, 4, 2, 8, nan, 6]), - 'first': np.array([1, 5, 7, 3, nan, 4, 2, 8, nan, 6]), - 'dense': np.array([1, 3, 4, 2, nan, 2, 1, 5, nan, 3]), - } - - def test_rank_tie_methods(self): - s = self.s - - def _check(s, expected, method='average'): - result = s.rank(method=method) - tm.assert_series_equal(result, Series(expected)) - - dtypes = [None, object] - disabled = set([(object, 'first')]) - results = self.results - - for method, dtype in product(results, dtypes): - if (dtype, method) in disabled: - continue - series = s if dtype is None else s.astype(dtype) - _check(series, results[method], method=method) - - def test_rank_methods_series(self): - tm.skip_if_no_package('scipy', '0.13', 'scipy.stats.rankdata') - import scipy - from scipy.stats import rankdata - - xs = np.random.randn(9) - xs = np.concatenate([xs[i:] for i in range(0, 9, 2)]) # add duplicates - np.random.shuffle(xs) - - index = [chr(ord('a') + i) for i in range(len(xs))] - - for vals in [xs, xs + 1e6, xs * 1e-6]: - ts = Series(vals, index=index) - - for m in ['average', 'min', 'max', 'first', 'dense']: - result = ts.rank(method=m) - sprank = rankdata(vals, m if m != 'first' else 'ordinal') - expected = Series(sprank, index=index) - - if LooseVersion(scipy.__version__) >= '0.17.0': - expected = expected.astype('float64') - tm.assert_series_equal(result, expected) - - def test_rank_methods_frame(self): - tm.skip_if_no_package('scipy', '0.13', 'scipy.stats.rankdata') - import scipy - from scipy.stats import rankdata - - xs = np.random.randint(0, 21, (100, 26)) - xs = (xs - 10.0) / 10.0 - cols = [chr(ord('z') - i) for i in range(xs.shape[1])] - - for vals in [xs, xs + 1e6, xs * 1e-6]: - df = DataFrame(vals, columns=cols) - - for ax in [0, 1]: - for m in ['average', 'min', 'max', 'first', 'dense']: - result = df.rank(axis=ax, method=m) - sprank = np.apply_along_axis( - rankdata, ax, vals, - m if m != 'first' else 'ordinal') - sprank = sprank.astype(np.float64) - expected = DataFrame(sprank, columns=cols) - - if LooseVersion(scipy.__version__) >= '0.17.0': - expected = expected.astype('float64') - tm.assert_frame_equal(result, expected) - - def test_rank_dense_method(self): - dtypes = ['O', 'f8', 'i8'] - in_out = [([1], [1]), - ([2], [1]), - ([0], [1]), - ([2, 2], [1, 1]), - ([1, 2, 3], [1, 2, 3]), - ([4, 2, 1], [3, 2, 1],), - ([1, 1, 5, 5, 3], [1, 1, 3, 3, 2]), - ([-5, -4, -3, -2, -1], [1, 2, 3, 4, 5])] - - for ser, exp in in_out: - for dtype in dtypes: - s = Series(ser).astype(dtype) - result = s.rank(method='dense') - expected = Series(exp).astype(result.dtype) - assert_series_equal(result, expected) - - def test_rank_descending(self): - dtypes = ['O', 'f8', 'i8'] - - for dtype, method in product(dtypes, self.results): - if 'i' in dtype: - s = self.s.dropna() - df = self.df.dropna() - else: - s = self.s.astype(dtype) - df = self.df.astype(dtype) - - res = s.rank(ascending=False) - expected = (s.max() - s).rank() - assert_series_equal(res, expected) - - res = df.rank(ascending=False) - expected = (df.max() - df).rank() - assert_frame_equal(res, expected) - - if method == 'first' and dtype == 'O': - continue - - expected = (s.max() - s).rank(method=method) - res2 = s.rank(method=method, ascending=False) - assert_series_equal(res2, expected) - - expected = (df.max() - df).rank(method=method) - - if dtype != 'O': - res2 = df.rank(method=method, ascending=False, - numeric_only=True) - assert_frame_equal(res2, expected) - - res3 = df.rank(method=method, ascending=False, - numeric_only=False) - assert_frame_equal(res3, expected) - - def test_rank_2d_tie_methods(self): - df = self.df - - def _check2d(df, expected, method='average', axis=0): - exp_df = DataFrame({'A': expected, 'B': expected}) - - if axis == 1: - df = df.T - exp_df = exp_df.T - - result = df.rank(method=method, axis=axis) - assert_frame_equal(result, exp_df) - - dtypes = [None, object] - disabled = set([(object, 'first')]) - results = self.results - - for method, axis, dtype in product(results, [0, 1], dtypes): - if (dtype, method) in disabled: - continue - frame = df if dtype is None else df.astype(dtype) - _check2d(frame, results[method], method=method, axis=axis) - - def test_rank_int(self): - s = self.s.dropna().astype('i8') - - for method, res in compat.iteritems(self.results): - result = s.rank(method=method) - expected = Series(res).dropna() - expected.index = result.index - assert_series_equal(result, expected) - - def test_rank_object_bug(self): - # GH 13445 - - # smoke tests - Series([np.nan] * 32).astype(object).rank(ascending=True) - Series([np.nan] * 32).astype(object).rank(ascending=False) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_strings.py b/pandas/tests/test_strings.py index bbcd856250c51..f28a5926087ac 100644 --- a/pandas/tests/test_strings.py +++ b/pandas/tests/test_strings.py @@ -2,10 +2,9 @@ # pylint: disable-msg=E1101,W0612 from datetime import datetime, timedelta +import pytest import re -import nose - from numpy import nan as NA import numpy as np from numpy.random import randint @@ -20,21 +19,20 @@ import pandas.core.strings as strings -class TestStringMethods(tm.TestCase): - - _multiprocess_can_split_ = True +class TestStringMethods(object): def test_api(self): # GH 6106, GH 9322 - self.assertIs(Series.str, strings.StringMethods) - self.assertIsInstance(Series(['']).str, strings.StringMethods) + assert Series.str is strings.StringMethods + assert isinstance(Series(['']).str, strings.StringMethods) # GH 9184 invalid = Series([1]) - with tm.assertRaisesRegexp(AttributeError, "only use .str accessor"): + with tm.assert_raises_regex(AttributeError, + "only use .str accessor"): invalid.str - self.assertFalse(hasattr(invalid, 'str')) + assert not hasattr(invalid, 'str') def test_iter(self): # GH3638 @@ -43,7 +41,7 @@ def test_iter(self): for s in ds.str: # iter must yield a Series - tm.assertIsInstance(s, Series) + assert isinstance(s, Series) # indices of each yielded Series should be equal to the index of # the original Series @@ -51,13 +49,12 @@ def test_iter(self): for el in s: # each element of the series is either a basestring/str or nan - self.assertTrue(isinstance(el, compat.string_types) or - isnull(el)) + assert isinstance(el, compat.string_types) or isnull(el) # desired behavior is to iterate until everything would be nan on the # next iter so make sure the last element of the iterator was 'l' in # this case since 'wikitravel' is the longest string - self.assertEqual(s.dropna().values.item(), 'l') + assert s.dropna().values.item() == 'l' def test_iter_empty(self): ds = Series([], dtype=object) @@ -69,8 +66,8 @@ def test_iter_empty(self): # nothing to iterate over so nothing defined values should remain # unchanged - self.assertEqual(i, 100) - self.assertEqual(s, 1) + assert i == 100 + assert s == 1 def test_iter_single_element(self): ds = Series(['a']) @@ -78,7 +75,7 @@ def test_iter_single_element(self): for i, s in enumerate(ds.str): pass - self.assertFalse(i) + assert not i assert_series_equal(ds, s) def test_iter_object_try_string(self): @@ -90,8 +87,8 @@ def test_iter_object_try_string(self): for i, s in enumerate(ds.str): pass - self.assertEqual(i, 100) - self.assertEqual(s, 'h') + assert i == 100 + assert s == 'h' def test_cat(self): one = np.array(['a', 'a', 'b', 'b', 'c', NA], dtype=np.object_) @@ -100,29 +97,29 @@ def test_cat(self): # single array result = strings.str_cat(one) exp = 'aabbc' - self.assertEqual(result, exp) + assert result == exp result = strings.str_cat(one, na_rep='NA') exp = 'aabbcNA' - self.assertEqual(result, exp) + assert result == exp result = strings.str_cat(one, na_rep='-') exp = 'aabbc-' - self.assertEqual(result, exp) + assert result == exp result = strings.str_cat(one, sep='_', na_rep='NA') exp = 'a_a_b_b_c_NA' - self.assertEqual(result, exp) + assert result == exp result = strings.str_cat(two, sep='-') exp = 'a-b-d-foo' - self.assertEqual(result, exp) + assert result == exp # Multiple arrays result = strings.str_cat(one, [two], na_rep='NA') exp = np.array(['aa', 'aNA', 'bb', 'bd', 'cfoo', 'NANA'], dtype=np.object_) - self.assert_numpy_array_equal(result, exp) + tm.assert_numpy_array_equal(result, exp) result = strings.str_cat(one, two) exp = np.array(['aa', NA, 'bb', 'bd', 'cfoo', NA], dtype=np.object_) @@ -138,7 +135,7 @@ def test_count(self): result = Series(values).str.count('f[o]+') exp = Series([1, 2, NA, 4]) - tm.assertIsInstance(result, Series) + assert isinstance(result, Series) tm.assert_series_equal(result, exp) # mixed @@ -149,7 +146,7 @@ def test_count(self): rs = Series(mixed).str.count('a') xp = Series([1, NA, 0, NA, NA, 0, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_series_equal(rs, xp) # unicode @@ -161,7 +158,7 @@ def test_count(self): result = Series(values).str.count('f[o]+') exp = Series([1, 2, NA, 4]) - tm.assertIsInstance(result, Series) + assert isinstance(result, Series) tm.assert_series_equal(result, exp) def test_contains(self): @@ -180,7 +177,7 @@ def test_contains(self): values = ['foo', 'xyz', 'fooommm__foo', 'mmm_'] result = strings.str_contains(values, pat) expected = np.array([False, False, True, True]) - self.assertEqual(result.dtype, np.bool_) + assert result.dtype == np.bool_ tm.assert_numpy_array_equal(result, expected) # case insensitive using regex @@ -203,7 +200,7 @@ def test_contains(self): rs = Series(mixed).str.contains('o') xp = Series([False, NA, False, NA, NA, True, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_series_equal(rs, xp) # unicode @@ -223,13 +220,13 @@ def test_contains(self): dtype=np.object_) result = strings.str_contains(values, pat) expected = np.array([False, False, True, True]) - self.assertEqual(result.dtype, np.bool_) + assert result.dtype == np.bool_ tm.assert_numpy_array_equal(result, expected) # na values = Series(['om', 'foo', np.nan]) res = values.str.contains('foo', na="foo") - self.assertEqual(res.ix[2], "foo") + assert res.loc[2] == "foo" def test_startswith(self): values = Series(['om', NA, 'foo_nom', 'nom', 'bar_foo', NA, 'foo']) @@ -247,7 +244,7 @@ def test_startswith(self): tm.assert_numpy_array_equal(rs, xp) rs = Series(mixed).str.startswith('f') - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) xp = Series([False, NA, False, NA, NA, True, NA, NA, NA]) tm.assert_series_equal(rs, xp) @@ -278,7 +275,7 @@ def test_endswith(self): rs = Series(mixed).str.endswith('f') xp = Series([False, NA, False, NA, NA, False, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_series_equal(rs, xp) # unicode @@ -330,7 +327,7 @@ def test_lower_upper(self): mixed = mixed.str.upper() rs = Series(mixed).str.lower() xp = Series(['a', NA, 'b', NA, NA, 'foo', NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_series_equal(rs, xp) # unicode @@ -384,13 +381,11 @@ def test_swapcase(self): def test_casemethods(self): values = ['aaa', 'bbb', 'CCC', 'Dddd', 'eEEE'] s = Series(values) - self.assertEqual(s.str.lower().tolist(), [v.lower() for v in values]) - self.assertEqual(s.str.upper().tolist(), [v.upper() for v in values]) - self.assertEqual(s.str.title().tolist(), [v.title() for v in values]) - self.assertEqual(s.str.capitalize().tolist(), [ - v.capitalize() for v in values]) - self.assertEqual(s.str.swapcase().tolist(), [ - v.swapcase() for v in values]) + assert s.str.lower().tolist() == [v.lower() for v in values] + assert s.str.upper().tolist() == [v.upper() for v in values] + assert s.str.title().tolist() == [v.title() for v in values] + assert s.str.capitalize().tolist() == [v.capitalize() for v in values] + assert s.str.swapcase().tolist() == [v.swapcase() for v in values] def test_replace(self): values = Series(['fooBAD__barBAD', NA]) @@ -409,7 +404,7 @@ def test_replace(self): rs = Series(mixed).str.replace('BAD[_]*', '') xp = Series(['a', NA, 'b', NA, NA, 'foo', NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) # unicode @@ -434,7 +429,106 @@ def test_replace(self): for repl in (None, 3, {'a': 'b'}): for data in (['a', 'b', None], ['a', 'b', 'c', 'ad']): values = klass(data) - self.assertRaises(TypeError, values.str.replace, 'a', repl) + pytest.raises(TypeError, values.str.replace, 'a', repl) + + def test_replace_callable(self): + # GH 15055 + values = Series(['fooBAD__barBAD', NA]) + + # test with callable + repl = lambda m: m.group(0).swapcase() + result = values.str.replace('[a-z][A-Z]{2}', repl, n=2) + exp = Series(['foObaD__baRbaD', NA]) + tm.assert_series_equal(result, exp) + + # test with wrong number of arguments, raising an error + if compat.PY2: + p_err = r'takes (no|(exactly|at (least|most)) ?\d+) arguments?' + else: + p_err = (r'((takes)|(missing)) (?(2)from \d+ to )?\d+ ' + r'(?(3)required )positional arguments?') + + repl = lambda: None + with tm.assert_raises_regex(TypeError, p_err): + values.str.replace('a', repl) + + repl = lambda m, x: None + with tm.assert_raises_regex(TypeError, p_err): + values.str.replace('a', repl) + + repl = lambda m, x, y=None: None + with tm.assert_raises_regex(TypeError, p_err): + values.str.replace('a', repl) + + # test regex named groups + values = Series(['Foo Bar Baz', NA]) + pat = r"(?P\w+) (?P\w+) (?P\w+)" + repl = lambda m: m.group('middle').swapcase() + result = values.str.replace(pat, repl) + exp = Series(['bAR', NA]) + tm.assert_series_equal(result, exp) + + def test_replace_compiled_regex(self): + # GH 15446 + values = Series(['fooBAD__barBAD', NA]) + + # test with compiled regex + pat = re.compile(r'BAD[_]*') + result = values.str.replace(pat, '') + exp = Series(['foobar', NA]) + tm.assert_series_equal(result, exp) + + # mixed + mixed = Series(['aBAD', NA, 'bBAD', True, datetime.today(), 'fooBAD', + None, 1, 2.]) + + rs = Series(mixed).str.replace(pat, '') + xp = Series(['a', NA, 'b', NA, NA, 'foo', NA, NA, NA]) + assert isinstance(rs, Series) + tm.assert_almost_equal(rs, xp) + + # unicode + values = Series([u('fooBAD__barBAD'), NA]) + + result = values.str.replace(pat, '') + exp = Series([u('foobar'), NA]) + tm.assert_series_equal(result, exp) + + result = values.str.replace(pat, '', n=1) + exp = Series([u('foobarBAD'), NA]) + tm.assert_series_equal(result, exp) + + # flags + unicode + values = Series([b"abcd,\xc3\xa0".decode("utf-8")]) + exp = Series([b"abcd, \xc3\xa0".decode("utf-8")]) + pat = re.compile(r"(?<=\w),(?=\w)", flags=re.UNICODE) + result = values.str.replace(pat, ", ") + tm.assert_series_equal(result, exp) + + # case and flags provided to str.replace will have no effect + # and will produce warnings + values = Series(['fooBAD__barBAD__bad', NA]) + pat = re.compile(r'BAD[_]*') + + with tm.assert_raises_regex(ValueError, + "case and flags cannot be"): + result = values.str.replace(pat, '', flags=re.IGNORECASE) + + with tm.assert_raises_regex(ValueError, + "case and flags cannot be"): + result = values.str.replace(pat, '', case=False) + + with tm.assert_raises_regex(ValueError, + "case and flags cannot be"): + result = values.str.replace(pat, '', case=True) + + # test with callable + values = Series(['fooBAD__barBAD', NA]) + repl = lambda m: m.group(0).swapcase() + pat = re.compile('[a-z][A-Z]{2}') + result = values.str.replace(pat, repl, n=2) + exp = Series(['foObaD__baRbaD', NA]) + tm.assert_series_equal(result, exp) def test_repeat(self): values = Series(['a', 'b', NA, 'c', NA, 'd']) @@ -453,7 +547,7 @@ def test_repeat(self): rs = Series(mixed).str.repeat(3) xp = Series(['aaa', NA, 'bbb', NA, NA, 'foofoofoo', NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_series_equal(rs, xp) # unicode @@ -467,64 +561,44 @@ def test_repeat(self): exp = Series([u('a'), u('bb'), NA, u('cccc'), NA, u('dddddd')]) tm.assert_series_equal(result, exp) - def test_deprecated_match(self): - # Old match behavior, deprecated (but still default) in 0.13 + def test_match(self): + # New match behavior introduced in 0.13 values = Series(['fooBAD__barBAD', NA, 'foo']) - - with tm.assert_produces_warning(): - result = values.str.match('.*(BAD[_]+).*(BAD)') - exp = Series([('BAD__', 'BAD'), NA, []]) - tm.assert_series_equal(result, exp) - - # mixed - mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(), - 'foo', None, 1, 2.]) - - with tm.assert_produces_warning(): - rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)') - xp = Series([('BAD_', 'BAD'), NA, ('BAD_', 'BAD'), - NA, NA, [], NA, NA, NA]) - tm.assertIsInstance(rs, Series) - tm.assert_series_equal(rs, xp) - - # unicode - values = Series([u('fooBAD__barBAD'), NA, u('foo')]) - - with tm.assert_produces_warning(): - result = values.str.match('.*(BAD[_]+).*(BAD)') - exp = Series([(u('BAD__'), u('BAD')), NA, []]) + result = values.str.match('.*(BAD[_]+).*(BAD)') + exp = Series([True, NA, False]) tm.assert_series_equal(result, exp) - def test_match(self): - # New match behavior introduced in 0.13 values = Series(['fooBAD__barBAD', NA, 'foo']) - with tm.assert_produces_warning(): - result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True) + result = values.str.match('.*BAD[_]+.*BAD') exp = Series([True, NA, False]) tm.assert_series_equal(result, exp) - # If no groups, use new behavior even when as_indexer is False. - # (Old behavior is pretty much useless in this case.) + # test passing as_indexer still works but is ignored values = Series(['fooBAD__barBAD', NA, 'foo']) - result = values.str.match('.*BAD[_]+.*BAD', as_indexer=False) exp = Series([True, NA, False]) + with tm.assert_produces_warning(FutureWarning): + result = values.str.match('.*BAD[_]+.*BAD', as_indexer=True) tm.assert_series_equal(result, exp) + with tm.assert_produces_warning(FutureWarning): + result = values.str.match('.*BAD[_]+.*BAD', as_indexer=False) + tm.assert_series_equal(result, exp) + with tm.assert_produces_warning(FutureWarning): + result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True) + tm.assert_series_equal(result, exp) + pytest.raises(ValueError, values.str.match, '.*(BAD[_]+).*(BAD)', + as_indexer=False) # mixed mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(), 'foo', None, 1, 2.]) - - with tm.assert_produces_warning(): - rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)', as_indexer=True) + rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)') xp = Series([True, NA, True, NA, NA, False, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_series_equal(rs, xp) # unicode values = Series([u('fooBAD__barBAD'), NA, u('foo')]) - - with tm.assert_produces_warning(): - result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True) + result = values.str.match('.*(BAD[_]+).*(BAD)') exp = Series([True, NA, False]) tm.assert_series_equal(result, exp) @@ -575,7 +649,7 @@ def test_extract_expand_False(self): # Index only works with one regex group since # multi-group would expand to a frame idx = Index(['A1', 'A2', 'A3', 'A4', 'B5']) - with tm.assertRaisesRegexp(ValueError, "supported"): + with tm.assert_raises_regex(ValueError, "supported"): idx.str.extract('([AB])([123])', expand=False) # these should work for both Series and Index @@ -583,16 +657,16 @@ def test_extract_expand_False(self): # no groups s_or_idx = klass(['A1', 'B2', 'C3']) f = lambda: s_or_idx.str.extract('[ABC][123]', expand=False) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # only non-capturing groups f = lambda: s_or_idx.str.extract('(?:[AB]).*', expand=False) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # single group renames series/index properly s_or_idx = klass(['A1', 'A2']) result = s_or_idx.str.extract(r'(?PA)\d', expand=False) - self.assertEqual(result.name, 'uno') + assert result.name == 'uno' exp = klass(['A', 'A'], name='uno') if klass == Series: @@ -696,7 +770,7 @@ def check_index(index): r = s.str.extract(r'(?P[a-z])', expand=False) e = Series(['a', 'b', 'c'], name='sue') tm.assert_series_equal(r, e) - self.assertEqual(r.name, e.name) + assert r.name == e.name def test_extract_expand_True(self): # Contains tests like those in test_match and some others. @@ -728,16 +802,16 @@ def test_extract_expand_True(self): # no groups s_or_idx = klass(['A1', 'B2', 'C3']) f = lambda: s_or_idx.str.extract('[ABC][123]', expand=True) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # only non-capturing groups f = lambda: s_or_idx.str.extract('(?:[AB]).*', expand=True) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # single group renames series/index properly s_or_idx = klass(['A1', 'A2']) result_df = s_or_idx.str.extract(r'(?PA)\d', expand=True) - tm.assertIsInstance(result_df, DataFrame) + assert isinstance(result_df, DataFrame) result_series = result_df['uno'] assert_series_equal(result_series, Series(['A', 'A'], name='uno')) @@ -1051,7 +1125,7 @@ def test_extractall_errors(self): # no capture groups. (it returns DataFrame with one column for # each capture group) s = Series(['a3', 'b3', 'd4c2'], name='series_name') - with tm.assertRaisesRegexp(ValueError, "no capture groups"): + with tm.assert_raises_regex(ValueError, "no capture groups"): s.str.extractall(r'[a-z]') def test_extract_index_one_two_groups(self): @@ -1135,17 +1209,16 @@ def test_extractall_same_as_extract_subject_index(self): tm.assert_frame_equal(extract_one_noname, no_match_index) def test_empty_str_methods(self): - empty_str = empty = Series(dtype=str) + empty_str = empty = Series(dtype=object) empty_int = Series(dtype=int) empty_bool = Series(dtype=bool) - empty_list = Series(dtype=list) empty_bytes = Series(dtype=object) # GH7241 # (extract) on empty series tm.assert_series_equal(empty_str, empty.str.cat(empty)) - self.assertEqual('', empty.str.cat()) + assert '' == empty.str.cat() tm.assert_series_equal(empty_str, empty.str.title()) tm.assert_series_equal(empty_int, empty.str.count('a')) tm.assert_series_equal(empty_bool, empty.str.contains('a')) @@ -1169,25 +1242,24 @@ def test_empty_str_methods(self): DataFrame(columns=[0, 1], dtype=str), empty.str.extract('()()', expand=False)) tm.assert_frame_equal(DataFrame(dtype=str), empty.str.get_dummies()) - tm.assert_series_equal(empty_str, empty_list.str.join('')) + tm.assert_series_equal(empty_str, empty_str.str.join('')) tm.assert_series_equal(empty_int, empty.str.len()) - tm.assert_series_equal(empty_list, empty_list.str.findall('a')) + tm.assert_series_equal(empty_str, empty_str.str.findall('a')) tm.assert_series_equal(empty_int, empty.str.find('a')) tm.assert_series_equal(empty_int, empty.str.rfind('a')) tm.assert_series_equal(empty_str, empty.str.pad(42)) tm.assert_series_equal(empty_str, empty.str.center(42)) - tm.assert_series_equal(empty_list, empty.str.split('a')) - tm.assert_series_equal(empty_list, empty.str.rsplit('a')) - tm.assert_series_equal(empty_list, + tm.assert_series_equal(empty_str, empty.str.split('a')) + tm.assert_series_equal(empty_str, empty.str.rsplit('a')) + tm.assert_series_equal(empty_str, empty.str.partition('a', expand=False)) - tm.assert_series_equal(empty_list, + tm.assert_series_equal(empty_str, empty.str.rpartition('a', expand=False)) tm.assert_series_equal(empty_str, empty.str.slice(stop=1)) tm.assert_series_equal(empty_str, empty.str.slice(step=1)) tm.assert_series_equal(empty_str, empty.str.strip()) tm.assert_series_equal(empty_str, empty.str.lstrip()) tm.assert_series_equal(empty_str, empty.str.rstrip()) - tm.assert_series_equal(empty_str, empty.str.rstrip()) tm.assert_series_equal(empty_str, empty.str.wrap(42)) tm.assert_series_equal(empty_str, empty.str.get(0)) tm.assert_series_equal(empty_str, empty_bytes.str.decode('ascii')) @@ -1248,20 +1320,13 @@ def test_ismethods(self): tm.assert_series_equal(str_s.str.isupper(), Series(upper_e)) tm.assert_series_equal(str_s.str.istitle(), Series(title_e)) - self.assertEqual(str_s.str.isalnum().tolist(), [v.isalnum() - for v in values]) - self.assertEqual(str_s.str.isalpha().tolist(), [v.isalpha() - for v in values]) - self.assertEqual(str_s.str.isdigit().tolist(), [v.isdigit() - for v in values]) - self.assertEqual(str_s.str.isspace().tolist(), [v.isspace() - for v in values]) - self.assertEqual(str_s.str.islower().tolist(), [v.islower() - for v in values]) - self.assertEqual(str_s.str.isupper().tolist(), [v.isupper() - for v in values]) - self.assertEqual(str_s.str.istitle().tolist(), [v.istitle() - for v in values]) + assert str_s.str.isalnum().tolist() == [v.isalnum() for v in values] + assert str_s.str.isalpha().tolist() == [v.isalpha() for v in values] + assert str_s.str.isdigit().tolist() == [v.isdigit() for v in values] + assert str_s.str.isspace().tolist() == [v.isspace() for v in values] + assert str_s.str.islower().tolist() == [v.islower() for v in values] + assert str_s.str.isupper().tolist() == [v.isupper() for v in values] + assert str_s.str.istitle().tolist() == [v.istitle() for v in values] def test_isnumeric(self): # 0x00bc: ¼ VULGAR FRACTION ONE QUARTER @@ -1276,10 +1341,8 @@ def test_isnumeric(self): tm.assert_series_equal(s.str.isdecimal(), Series(decimal_e)) unicodes = [u'A', u'3', u'¼', u'★', u'፸', u'3', u'four'] - self.assertEqual(s.str.isnumeric().tolist(), [ - v.isnumeric() for v in unicodes]) - self.assertEqual(s.str.isdecimal().tolist(), [ - v.isdecimal() for v in unicodes]) + assert s.str.isnumeric().tolist() == [v.isnumeric() for v in unicodes] + assert s.str.isdecimal().tolist() == [v.isdecimal() for v in unicodes] values = ['A', np.nan, u'¼', u'★', np.nan, u'3', 'four'] s = Series(values) @@ -1338,7 +1401,7 @@ def test_join(self): rs = Series(mixed).str.split('_').str.join('_') xp = Series(['a_b', NA, 'asdf_cas_asdf', NA, NA, 'foo', NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) # unicode @@ -1360,7 +1423,7 @@ def test_len(self): rs = Series(mixed).str.len() xp = Series([3, NA, 13, NA, NA, 3, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) # unicode @@ -1385,7 +1448,7 @@ def test_findall(self): rs = Series(mixed).str.findall('BAD[_]*') xp = Series([['BAD__', 'BAD'], NA, [], NA, NA, ['BAD'], NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) # unicode @@ -1433,12 +1496,12 @@ def test_find(self): dtype=np.int64) tm.assert_numpy_array_equal(result.values, expected) - with tm.assertRaisesRegexp(TypeError, - "expected a string object, not int"): + with tm.assert_raises_regex(TypeError, + "expected a string object, not int"): result = values.str.find(0) - with tm.assertRaisesRegexp(TypeError, - "expected a string object, not int"): + with tm.assert_raises_regex(TypeError, + "expected a string object, not int"): result = values.str.rfind(0) def test_find_nan(self): @@ -1508,11 +1571,13 @@ def _check(result, expected): dtype=np.int64) tm.assert_numpy_array_equal(result.values, expected) - with tm.assertRaisesRegexp(ValueError, "substring not found"): + with tm.assert_raises_regex(ValueError, + "substring not found"): result = s.str.index('DE') - with tm.assertRaisesRegexp(TypeError, - "expected a string object, not int"): + with tm.assert_raises_regex(TypeError, + "expected a string " + "object, not int"): result = s.str.index(0) # test with nan @@ -1544,7 +1609,7 @@ def test_pad(self): rs = Series(mixed).str.pad(5, side='left') xp = Series([' a', NA, ' b', NA, NA, ' ee', NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) mixed = Series(['a', NA, 'b', True, datetime.today(), 'ee', None, 1, 2. @@ -1553,7 +1618,7 @@ def test_pad(self): rs = Series(mixed).str.pad(5, side='right') xp = Series(['a ', NA, 'b ', NA, NA, 'ee ', NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) mixed = Series(['a', NA, 'b', True, datetime.today(), 'ee', None, 1, 2. @@ -1562,7 +1627,7 @@ def test_pad(self): rs = Series(mixed).str.pad(5, side='both') xp = Series([' a ', NA, ' b ', NA, NA, ' ee ', NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) # unicode @@ -1596,12 +1661,14 @@ def test_pad_fillchar(self): exp = Series(['XXaXX', 'XXbXX', NA, 'XXcXX', NA, 'eeeeee']) tm.assert_almost_equal(result, exp) - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not str"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not str"): result = values.str.pad(5, fillchar='XY') - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not int"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not int"): result = values.str.pad(5, fillchar=5) def test_pad_width(self): @@ -1609,8 +1676,9 @@ def test_pad_width(self): s = Series(['1', '22', 'a', 'bb']) for f in ['center', 'ljust', 'rjust', 'zfill', 'pad']: - with tm.assertRaisesRegexp(TypeError, - "width must be of integer type, not*"): + with tm.assert_raises_regex(TypeError, + "width must be of " + "integer type, not*"): getattr(s.str, f)('f') def test_translate(self): @@ -1642,7 +1710,7 @@ def _check(result, expected): expected = klass(['abcde', 'abcc', 'cddd', 'cde']) _check(result, expected) else: - with tm.assertRaisesRegexp( + with tm.assert_raises_regex( ValueError, "deletechars is not a valid argument"): result = s.str.translate(table, deletechars='fg') @@ -1674,19 +1742,19 @@ def test_center_ljust_rjust(self): rs = Series(mixed).str.center(5) xp = Series([' a ', NA, ' b ', NA, NA, ' c ', ' eee ', NA, NA, NA ]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) rs = Series(mixed).str.ljust(5) xp = Series(['a ', NA, 'b ', NA, NA, 'c ', 'eee ', NA, NA, NA ]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) rs = Series(mixed).str.rjust(5) xp = Series([' a', NA, ' b', NA, NA, ' c', ' eee', NA, NA, NA ]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) # unicode @@ -1731,28 +1799,34 @@ def test_center_ljust_rjust_fillchar(self): # If fillchar is not a charatter, normal str raises TypeError # 'aaa'.ljust(5, 'XY') # TypeError: must be char, not str - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not str"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not str"): result = values.str.center(5, fillchar='XY') - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not str"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not str"): result = values.str.ljust(5, fillchar='XY') - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not str"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not str"): result = values.str.rjust(5, fillchar='XY') - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not int"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not int"): result = values.str.center(5, fillchar=1) - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not int"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not int"): result = values.str.ljust(5, fillchar=1) - with tm.assertRaisesRegexp(TypeError, - "fillchar must be a character, not int"): + with tm.assert_raises_regex(TypeError, + "fillchar must be a " + "character, not int"): result = values.str.rjust(5, fillchar=1) def test_zfill(self): @@ -1798,11 +1872,11 @@ def test_split(self): result = mixed.str.split('_') exp = Series([['a', 'b', 'c'], NA, ['d', 'e', 'f'], NA, NA, NA, NA, NA ]) - tm.assertIsInstance(result, Series) + assert isinstance(result, Series) tm.assert_almost_equal(result, exp) result = mixed.str.split('_', expand=False) - tm.assertIsInstance(result, Series) + assert isinstance(result, Series) tm.assert_almost_equal(result, exp) # unicode @@ -1843,11 +1917,11 @@ def test_rsplit(self): result = mixed.str.rsplit('_') exp = Series([['a', 'b', 'c'], NA, ['d', 'e', 'f'], NA, NA, NA, NA, NA ]) - tm.assertIsInstance(result, Series) + assert isinstance(result, Series) tm.assert_almost_equal(result, exp) result = mixed.str.rsplit('_', expand=False) - tm.assertIsInstance(result, Series) + assert isinstance(result, Series) tm.assert_almost_equal(result, exp) # unicode @@ -1877,9 +1951,9 @@ def test_split_noargs(self): s = Series(['Wes McKinney', 'Travis Oliphant']) result = s.str.split() expected = ['Travis', 'Oliphant'] - self.assertEqual(result[1], expected) + assert result[1] == expected result = s.str.rsplit() - self.assertEqual(result[1], expected) + assert result[1] == expected def test_split_maxsplit(self): # re.split 0, str.split -1 @@ -1934,7 +2008,7 @@ def test_split_to_dataframe(self): index=['preserve', 'me']) tm.assert_frame_equal(result, exp) - with tm.assertRaisesRegexp(ValueError, "expand must be"): + with tm.assert_raises_regex(ValueError, "expand must be"): s.str.split('_', expand="not_a_boolean") def test_split_to_multiindex_expand(self): @@ -1942,14 +2016,14 @@ def test_split_to_multiindex_expand(self): result = idx.str.split('_', expand=True) exp = idx tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 1) + assert result.nlevels == 1 idx = Index(['some_equal_splits', 'with_no_nans']) result = idx.str.split('_', expand=True) exp = MultiIndex.from_tuples([('some', 'equal', 'splits'), ( 'with', 'no', 'nans')]) tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 3) + assert result.nlevels == 3 idx = Index(['some_unequal_splits', 'one_of_these_things_is_not']) result = idx.str.split('_', expand=True) @@ -1957,9 +2031,9 @@ def test_split_to_multiindex_expand(self): ), ('one', 'of', 'these', 'things', 'is', 'not')]) tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 6) + assert result.nlevels == 6 - with tm.assertRaisesRegexp(ValueError, "expand must be"): + with tm.assert_raises_regex(ValueError, "expand must be"): idx.str.split('_', expand="not_a_boolean") def test_rsplit_to_dataframe_expand(self): @@ -1996,21 +2070,21 @@ def test_rsplit_to_multiindex_expand(self): result = idx.str.rsplit('_', expand=True) exp = idx tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 1) + assert result.nlevels == 1 idx = Index(['some_equal_splits', 'with_no_nans']) result = idx.str.rsplit('_', expand=True) exp = MultiIndex.from_tuples([('some', 'equal', 'splits'), ( 'with', 'no', 'nans')]) tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 3) + assert result.nlevels == 3 idx = Index(['some_equal_splits', 'with_no_nans']) result = idx.str.rsplit('_', expand=True, n=1) exp = MultiIndex.from_tuples([('some_equal', 'splits'), ('with_no', 'nans')]) tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 2) + assert result.nlevels == 2 def test_split_with_name(self): # GH 12617 @@ -2028,12 +2102,12 @@ def test_split_with_name(self): idx = Index(['a,b', 'c,d'], name='xxx') res = idx.str.split(',') exp = Index([['a', 'b'], ['c', 'd']], name='xxx') - self.assertTrue(res.nlevels, 1) + assert res.nlevels, 1 tm.assert_index_equal(res, exp) res = idx.str.split(',', expand=True) exp = MultiIndex.from_tuples([('a', 'b'), ('c', 'd')]) - self.assertTrue(res.nlevels, 2) + assert res.nlevels, 2 tm.assert_index_equal(res, exp) def test_partition_series(self): @@ -2099,9 +2173,9 @@ def test_partition_series(self): # compare to standard lib values = Series(['A_B_C', 'B_C_D', 'E_F_G', 'EFGHEF']) result = values.str.partition('_', expand=False).tolist() - self.assertEqual(result, [v.partition('_') for v in values]) + assert result == [v.partition('_') for v in values] result = values.str.rpartition('_', expand=False).tolist() - self.assertEqual(result, [v.rpartition('_') for v in values]) + assert result == [v.rpartition('_') for v in values] def test_partition_index(self): values = Index(['a_b_c', 'c_d_e', 'f_g_h']) @@ -2110,25 +2184,25 @@ def test_partition_index(self): exp = Index(np.array([('a', '_', 'b_c'), ('c', '_', 'd_e'), ('f', '_', 'g_h')])) tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 1) + assert result.nlevels == 1 result = values.str.rpartition('_', expand=False) exp = Index(np.array([('a_b', '_', 'c'), ('c_d', '_', 'e'), ( 'f_g', '_', 'h')])) tm.assert_index_equal(result, exp) - self.assertEqual(result.nlevels, 1) + assert result.nlevels == 1 result = values.str.partition('_') exp = Index([('a', '_', 'b_c'), ('c', '_', 'd_e'), ('f', '_', 'g_h')]) tm.assert_index_equal(result, exp) - self.assertTrue(isinstance(result, MultiIndex)) - self.assertEqual(result.nlevels, 3) + assert isinstance(result, MultiIndex) + assert result.nlevels == 3 result = values.str.rpartition('_') exp = Index([('a_b', '_', 'c'), ('c_d', '_', 'e'), ('f_g', '_', 'h')]) tm.assert_index_equal(result, exp) - self.assertTrue(isinstance(result, MultiIndex)) - self.assertEqual(result.nlevels, 3) + assert isinstance(result, MultiIndex) + assert result.nlevels == 3 def test_partition_to_dataframe(self): values = Series(['a_b_c', 'c_d_e', NA, 'f_g_h']) @@ -2173,13 +2247,13 @@ def test_partition_with_name(self): idx = Index(['a,b', 'c,d'], name='xxx') res = idx.str.partition(',') exp = MultiIndex.from_tuples([('a', ',', 'b'), ('c', ',', 'd')]) - self.assertTrue(res.nlevels, 3) + assert res.nlevels, 3 tm.assert_index_equal(res, exp) # should preserve name res = idx.str.partition(',', expand=False) exp = Index(np.array([('a', ',', 'b'), ('c', ',', 'd')]), name='xxx') - self.assertTrue(res.nlevels, 1) + assert res.nlevels, 1 tm.assert_index_equal(res, exp) def test_pipe_failures(self): @@ -2221,7 +2295,7 @@ def test_slice(self): rs = Series(mixed).str.slice(2, 5) xp = Series(['foo', NA, 'bar', NA, NA, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) rs = Series(mixed).str.slice(2, 5, -1) @@ -2299,19 +2373,19 @@ def test_strip_lstrip_rstrip_mixed(self): rs = Series(mixed).str.strip() xp = Series(['aa', NA, 'bb', NA, NA, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) rs = Series(mixed).str.lstrip() xp = Series(['aa ', NA, 'bb \t\n', NA, NA, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) rs = Series(mixed).str.rstrip() xp = Series([' aa', NA, ' bb', NA, NA, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) def test_strip_lstrip_rstrip_unicode(self): @@ -2400,7 +2474,7 @@ def test_get(self): rs = Series(mixed).str.split('_').str.get(1) xp = Series(['b', NA, 'd', NA, NA, NA, NA, NA]) - tm.assertIsInstance(rs, Series) + assert isinstance(rs, Series) tm.assert_almost_equal(rs, xp) # unicode @@ -2518,20 +2592,21 @@ def test_match_findall_flags(self): pat = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})' - with tm.assert_produces_warning(FutureWarning): - result = data.str.match(pat, flags=re.IGNORECASE) + result = data.str.extract(pat, flags=re.IGNORECASE, expand=True) + assert result.iloc[0].tolist() == ['dave', 'google', 'com'] - self.assertEqual(result[0], ('dave', 'google', 'com')) + result = data.str.match(pat, flags=re.IGNORECASE) + assert result[0] result = data.str.findall(pat, flags=re.IGNORECASE) - self.assertEqual(result[0][0], ('dave', 'google', 'com')) + assert result[0][0] == ('dave', 'google', 'com') result = data.str.count(pat, flags=re.IGNORECASE) - self.assertEqual(result[0], 1) + assert result[0] == 1 with tm.assert_produces_warning(UserWarning): result = data.str.contains(pat, flags=re.IGNORECASE) - self.assertEqual(result[0], True) + assert result[0] def test_encode_decode(self): base = Series([u('a'), u('b'), u('a\xe4')]) @@ -2546,7 +2621,7 @@ def test_encode_decode(self): def test_encode_decode_errors(self): encodeBase = Series([u('a'), u('b'), u('a\x9d')]) - self.assertRaises(UnicodeEncodeError, encodeBase.str.encode, 'cp1252') + pytest.raises(UnicodeEncodeError, encodeBase.str.encode, 'cp1252') f = lambda x: x.encode('cp1252', 'ignore') result = encodeBase.str.encode('cp1252', 'ignore') @@ -2555,7 +2630,7 @@ def test_encode_decode_errors(self): decodeBase = Series([b'a', b'b', b'a\x9d']) - self.assertRaises(UnicodeDecodeError, decodeBase.str.decode, 'cp1252') + pytest.raises(UnicodeDecodeError, decodeBase.str.decode, 'cp1252') f = lambda x: x.decode('cp1252', 'ignore') result = decodeBase.str.decode('cp1252', 'ignore') @@ -2579,7 +2654,8 @@ def test_normalize(self): result = s.str.normalize('NFC') tm.assert_series_equal(result, expected) - with tm.assertRaisesRegexp(ValueError, "invalid normalization form"): + with tm.assert_raises_regex(ValueError, + "invalid normalization form"): s.str.normalize('xxx') s = Index([u'ABC', u'123', u'アイエ']) @@ -2598,19 +2674,19 @@ def test_cat_on_filtered_index(self): str_month = df.month.astype('str') str_both = str_year.str.cat(str_month, sep=' ') - self.assertEqual(str_both.loc[1], '2011 2') + assert str_both.loc[1] == '2011 2' str_multiple = str_year.str.cat([str_month, str_month], sep=' ') - self.assertEqual(str_multiple.loc[1], '2011 2 2') + assert str_multiple.loc[1] == '2011 2 2' def test_str_cat_raises_intuitive_error(self): # https://github.com/pandas-dev/pandas/issues/11334 s = Series(['a', 'b', 'c', 'd']) message = "Did you mean to supply a `sep` keyword?" - with tm.assertRaisesRegexp(ValueError, message): + with tm.assert_raises_regex(ValueError, message): s.str.cat('|') - with tm.assertRaisesRegexp(ValueError, message): + with tm.assert_raises_regex(ValueError, message): s.str.cat(' ') def test_index_str_accessor_visibility(self): @@ -2632,15 +2708,15 @@ def test_index_str_accessor_visibility(self): (['aa', datetime(2011, 1, 1)], 'mixed')] for values, tp in cases: idx = Index(values) - self.assertTrue(isinstance(Series(values).str, StringMethods)) - self.assertTrue(isinstance(idx.str, StringMethods)) - self.assertEqual(idx.inferred_type, tp) + assert isinstance(Series(values).str, StringMethods) + assert isinstance(idx.str, StringMethods) + assert idx.inferred_type == tp for values, tp in cases: idx = Index(values) - self.assertTrue(isinstance(Series(values).str, StringMethods)) - self.assertTrue(isinstance(idx.str, StringMethods)) - self.assertEqual(idx.inferred_type, tp) + assert isinstance(Series(values).str, StringMethods) + assert isinstance(idx.str, StringMethods) + assert idx.inferred_type == tp cases = [([1, np.nan], 'floating'), ([datetime(2011, 1, 1)], 'datetime64'), @@ -2648,38 +2724,33 @@ def test_index_str_accessor_visibility(self): for values, tp in cases: idx = Index(values) message = 'Can only use .str accessor with string values' - with self.assertRaisesRegexp(AttributeError, message): + with tm.assert_raises_regex(AttributeError, message): Series(values).str - with self.assertRaisesRegexp(AttributeError, message): + with tm.assert_raises_regex(AttributeError, message): idx.str - self.assertEqual(idx.inferred_type, tp) + assert idx.inferred_type == tp # MultiIndex has mixed dtype, but not allow to use accessor idx = MultiIndex.from_tuples([('a', 'b'), ('a', 'b')]) - self.assertEqual(idx.inferred_type, 'mixed') + assert idx.inferred_type == 'mixed' message = 'Can only use .str accessor with Index, not MultiIndex' - with self.assertRaisesRegexp(AttributeError, message): + with tm.assert_raises_regex(AttributeError, message): idx.str def test_str_accessor_no_new_attributes(self): # https://github.com/pandas-dev/pandas/issues/10673 s = Series(list('aabbcde')) - with tm.assertRaisesRegexp(AttributeError, - "You cannot add any new attribute"): + with tm.assert_raises_regex(AttributeError, + "You cannot add any new attribute"): s.str.xlabel = "a" def test_method_on_bytes(self): lhs = Series(np.array(list('abc'), 'S1').astype(object)) rhs = Series(np.array(list('def'), 'S1').astype(object)) if compat.PY3: - self.assertRaises(TypeError, lhs.str.cat, rhs) + pytest.raises(TypeError, lhs.str.cat, rhs) else: result = lhs.str.cat(rhs) expected = Series(np.array( ['ad', 'be', 'cf'], 'S2').astype(object)) tm.assert_series_equal(result, expected) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_take.py b/pandas/tests/test_take.py index 98b3b474f785d..7b97b0e975df3 100644 --- a/pandas/tests/test_take.py +++ b/pandas/tests/test_take.py @@ -2,22 +2,17 @@ import re from datetime import datetime -import nose import numpy as np from pandas.compat import long import pandas.core.algorithms as algos import pandas.util.testing as tm -from pandas.tslib import iNaT +from pandas._libs.tslib import iNaT -_multiprocess_can_split_ = True - -class TestTake(tm.TestCase): +class TestTake(object): # standard incompatible fill error fill_error = re.compile("Incompatible type for fill_value") - _multiprocess_can_split_ = True - def test_1d_with_out(self): def _test_dtype(dtype, can_hold_na, writeable=True): data = np.random.randint(0, 2, 4).astype(dtype) @@ -37,7 +32,7 @@ def _test_dtype(dtype, can_hold_na, writeable=True): expected[3] = np.nan tm.assert_almost_equal(out, expected) else: - with tm.assertRaisesRegexp(TypeError, self.fill_error): + with tm.assert_raises_regex(TypeError, self.fill_error): algos.take_1d(data, indexer, out=out) # no exception o/w data.take(indexer, out=out) @@ -128,7 +123,8 @@ def _test_dtype(dtype, can_hold_na, writeable=True): tm.assert_almost_equal(out1, expected1) else: for i, out in enumerate([out0, out1]): - with tm.assertRaisesRegexp(TypeError, self.fill_error): + with tm.assert_raises_regex(TypeError, + self.fill_error): algos.take_nd(data, indexer, out=out, axis=i) # no exception o/w data.take(indexer, out=out, axis=i) @@ -240,7 +236,8 @@ def _test_dtype(dtype, can_hold_na): tm.assert_almost_equal(out2, expected2) else: for i, out in enumerate([out0, out1, out2]): - with tm.assertRaisesRegexp(TypeError, self.fill_error): + with tm.assert_raises_regex(TypeError, + self.fill_error): algos.take_nd(data, indexer, out=out, axis=i) # no exception o/w data.take(indexer, out=out, axis=i) @@ -353,24 +350,24 @@ def test_1d_bool(self): result = algos.take_1d(arr, [0, 2, 2, 1]) expected = arr.take([0, 2, 2, 1]) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = algos.take_1d(arr, [0, 2, -1]) - self.assertEqual(result.dtype, np.object_) + assert result.dtype == np.object_ def test_2d_bool(self): arr = np.array([[0, 1, 0], [1, 0, 1], [0, 1, 1]], dtype=bool) result = algos.take_nd(arr, [0, 2, 2, 1]) expected = arr.take([0, 2, 2, 1], axis=0) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = algos.take_nd(arr, [0, 2, 2, 1], axis=1) expected = arr.take([0, 2, 2, 1], axis=1) - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) result = algos.take_nd(arr, [0, 2, -1]) - self.assertEqual(result.dtype, np.object_) + assert result.dtype == np.object_ def test_2d_float32(self): arr = np.random.randn(4, 3).astype(np.float32) @@ -448,8 +445,3 @@ def test_2d_datetime64(self): expected = arr.take(indexer, axis=1) expected[:, [2, 4]] = datetime(2007, 1, 1) tm.assert_almost_equal(result, expected) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/test_window.py b/pandas/tests/test_window.py index 929ff43bfaaad..634cd5fe2586b 100644 --- a/pandas/tests/test_window.py +++ b/pandas/tests/test_window.py @@ -1,22 +1,22 @@ from itertools import product -import nose +import pytest import sys import warnings +from warnings import catch_warnings -from nose.tools import assert_raises -from datetime import datetime +from datetime import datetime, timedelta from numpy.random import randn import numpy as np from distutils.version import LooseVersion import pandas as pd -from pandas import (Series, DataFrame, Panel, bdate_range, isnull, - notnull, concat, Timestamp) +from pandas import (Series, DataFrame, bdate_range, isnull, + notnull, concat, Timestamp, Index) import pandas.stats.moments as mom import pandas.core.window as rwindow import pandas.tseries.offsets as offsets from pandas.core.base import SpecificationError -from pandas.core.common import UnsupportedFunctionCall +from pandas.errors import UnsupportedFunctionCall import pandas.util.testing as tm from pandas.compat import range, zip, PY3 @@ -30,9 +30,7 @@ def assert_equal(left, right): tm.assert_frame_equal(left, right) -class Base(tm.TestCase): - - _multiprocess_can_split_ = True +class Base(object): _nan_locs = np.arange(20, 40) _inf_locs = np.array([]) @@ -50,7 +48,7 @@ def _create_data(self): class TestApi(Base): - def setUp(self): + def setup_method(self, method): self._create_data() def test_getitem(self): @@ -59,7 +57,7 @@ def test_getitem(self): tm.assert_index_equal(r._selected_obj.columns, self.frame.columns) r = self.frame.rolling(window=5)[1] - self.assertEqual(r._selected_obj.name, self.frame.columns[1]) + assert r._selected_obj.name == self.frame.columns[1] # technically this is allowed r = self.frame.rolling(window=5)[1, 3] @@ -73,10 +71,10 @@ def test_getitem(self): def test_select_bad_cols(self): df = DataFrame([[1, 2]], columns=['A', 'B']) g = df.rolling(window=5) - self.assertRaises(KeyError, g.__getitem__, ['C']) # g[['C']] + pytest.raises(KeyError, g.__getitem__, ['C']) # g[['C']] - self.assertRaises(KeyError, g.__getitem__, ['A', 'C']) # g[['A', 'C']] - with tm.assertRaisesRegexp(KeyError, '^[^A]+$'): + pytest.raises(KeyError, g.__getitem__, ['A', 'C']) # g[['A', 'C']] + with tm.assert_raises_regex(KeyError, '^[^A]+$'): # A should not be referenced as a bad column... # will have to rethink regex if you change message! g[['A', 'C']] @@ -86,7 +84,7 @@ def test_attribute_access(self): df = DataFrame([[1, 2]], columns=['A', 'B']) r = df.rolling(window=5) tm.assert_series_equal(r.A.sum(), r['A'].sum()) - self.assertRaises(AttributeError, lambda: r.F) + pytest.raises(AttributeError, lambda: r.F) def tests_skip_nuisance(self): @@ -136,16 +134,18 @@ def test_agg(self): expected.columns = ['mean', 'sum'] tm.assert_frame_equal(result, expected) - result = r.aggregate({'A': {'mean': 'mean', 'sum': 'sum'}}) + with catch_warnings(record=True): + result = r.aggregate({'A': {'mean': 'mean', 'sum': 'sum'}}) expected = pd.concat([a_mean, a_sum], axis=1) expected.columns = pd.MultiIndex.from_tuples([('A', 'mean'), ('A', 'sum')]) tm.assert_frame_equal(result, expected, check_like=True) - result = r.aggregate({'A': {'mean': 'mean', - 'sum': 'sum'}, - 'B': {'mean2': 'mean', - 'sum2': 'sum'}}) + with catch_warnings(record=True): + result = r.aggregate({'A': {'mean': 'mean', + 'sum': 'sum'}, + 'B': {'mean2': 'mean', + 'sum2': 'sum'}}) expected = pd.concat([a_mean, a_sum, b_mean, b_sum], axis=1) exp_cols = [('A', 'mean'), ('A', 'sum'), ('B', 'mean2'), ('B', 'sum2')] expected.columns = pd.MultiIndex.from_tuples(exp_cols) @@ -174,7 +174,7 @@ def test_agg_consistency(self): tm.assert_index_equal(result, expected) result = r['A'].agg([np.sum, np.mean]).columns - expected = pd.Index(['sum', 'mean']) + expected = Index(['sum', 'mean']) tm.assert_index_equal(result, expected) result = r.agg({'A': [np.sum, np.mean]}).columns @@ -191,22 +191,67 @@ def f(): r.aggregate({'r1': {'A': ['mean', 'sum']}, 'r2': {'B': ['mean', 'sum']}}) - self.assertRaises(SpecificationError, f) + pytest.raises(SpecificationError, f) expected = pd.concat([r['A'].mean(), r['A'].std(), r['B'].mean(), r['B'].std()], axis=1) expected.columns = pd.MultiIndex.from_tuples([('ra', 'mean'), ( 'ra', 'std'), ('rb', 'mean'), ('rb', 'std')]) - result = r[['A', 'B']].agg({'A': {'ra': ['mean', 'std']}, - 'B': {'rb': ['mean', 'std']}}) + with catch_warnings(record=True): + result = r[['A', 'B']].agg({'A': {'ra': ['mean', 'std']}, + 'B': {'rb': ['mean', 'std']}}) tm.assert_frame_equal(result, expected, check_like=True) - result = r.agg({'A': {'ra': ['mean', 'std']}, - 'B': {'rb': ['mean', 'std']}}) + with catch_warnings(record=True): + result = r.agg({'A': {'ra': ['mean', 'std']}, + 'B': {'rb': ['mean', 'std']}}) expected.columns = pd.MultiIndex.from_tuples([('A', 'ra', 'mean'), ( 'A', 'ra', 'std'), ('B', 'rb', 'mean'), ('B', 'rb', 'std')]) tm.assert_frame_equal(result, expected, check_like=True) + def test_count_nonnumeric_types(self): + # GH12541 + cols = ['int', 'float', 'string', 'datetime', 'timedelta', 'periods', + 'fl_inf', 'fl_nan', 'str_nan', 'dt_nat', 'periods_nat'] + + df = DataFrame( + {'int': [1, 2, 3], + 'float': [4., 5., 6.], + 'string': list('abc'), + 'datetime': pd.date_range('20170101', periods=3), + 'timedelta': pd.timedelta_range('1 s', periods=3, freq='s'), + 'periods': [pd.Period('2012-01'), pd.Period('2012-02'), + pd.Period('2012-03')], + 'fl_inf': [1., 2., np.Inf], + 'fl_nan': [1., 2., np.NaN], + 'str_nan': ['aa', 'bb', np.NaN], + 'dt_nat': [pd.Timestamp('20170101'), pd.Timestamp('20170203'), + pd.Timestamp(None)], + 'periods_nat': [pd.Period('2012-01'), pd.Period('2012-02'), + pd.Period(None)]}, + columns=cols) + + expected = DataFrame( + {'int': [1., 2., 2.], + 'float': [1., 2., 2.], + 'string': [1., 2., 2.], + 'datetime': [1., 2., 2.], + 'timedelta': [1., 2., 2.], + 'periods': [1., 2., 2.], + 'fl_inf': [1., 2., 2.], + 'fl_nan': [1., 2., 1.], + 'str_nan': [1., 2., 1.], + 'dt_nat': [1., 2., 1.], + 'periods_nat': [1., 2., 1.]}, + columns=cols) + + result = df.rolling(window=2).count() + tm.assert_frame_equal(result, expected) + + result = df.rolling(1).count() + expected = df.notnull().astype(float) + tm.assert_frame_equal(result, expected) + def test_window_with_args(self): tm._skip_if_no_scipy() @@ -236,8 +281,8 @@ def test_preserve_metadata(self): s2 = s.rolling(30).sum() s3 = s.rolling(20).sum() - self.assertEqual(s2.name, 'foo') - self.assertEqual(s3.name, 'foo') + assert s2.name == 'foo' + assert s3.name == 'foo' def test_how_compat(self): # in prior versions, we would allow how to be used in the resample @@ -251,8 +296,7 @@ def test_how_compat(self): for op in ['mean', 'sum', 'std', 'var', 'kurt', 'skew']: for t in ['rolling', 'expanding']: - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): + with catch_warnings(record=True): dfunc = getattr(pd, "{0}_{1}".format(t, op)) if dfunc is None: @@ -271,7 +315,7 @@ def test_how_compat(self): class TestWindow(Base): - def setUp(self): + def setup_method(self, method): self._create_data() def test_constructor(self): @@ -292,13 +336,13 @@ def test_constructor(self): # not valid for w in [2., 'foo', np.array([2])]: - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(win_type='boxcar', window=2, min_periods=w) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(win_type='boxcar', window=2, min_periods=1, center=w) for wt in ['foobar', 1]: - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(win_type=wt, window=2) def test_numpy_compat(self): @@ -308,15 +352,15 @@ def test_numpy_compat(self): msg = "numpy operations are not valid with window objects" for func in ('sum', 'mean'): - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(w, func), 1, 2, 3) - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(w, func), dtype=np.float64) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(w, func), 1, 2, 3) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(w, func), dtype=np.float64) class TestRolling(Base): - def setUp(self): + def setup_method(self, method): self._create_data() def test_doc_string(self): @@ -340,16 +384,16 @@ def test_constructor(self): # GH 13383 c(0) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(-1) # not valid for w in [2., 'foo', np.array([2])]: - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(window=w) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(window=2, min_periods=w) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(window=2, min_periods=1, center=w) def test_constructor_with_win_type(self): @@ -358,9 +402,27 @@ def test_constructor_with_win_type(self): for o in [self.series, self.frame]: c = o.rolling c(0, win_type='boxcar') - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(-1, win_type='boxcar') + def test_constructor_with_timedelta_window(self): + # GH 15440 + n = 10 + df = pd.DataFrame({'value': np.arange(n)}, + index=pd.date_range('2015-12-24', + periods=n, + freq="D")) + expected_data = np.append([0., 1.], np.arange(3., 27., 3)) + for window in [timedelta(days=3), pd.Timedelta(days=3)]: + result = df.rolling(window=window).sum() + expected = pd.DataFrame({'value': expected_data}, + index=pd.date_range('2015-12-24', + periods=n, + freq="D")) + tm.assert_frame_equal(result, expected) + expected = df.rolling('3D').sum() + tm.assert_frame_equal(result, expected) + def test_numpy_compat(self): # see gh-12811 r = rwindow.Rolling(Series([2, 4, 6]), window=2) @@ -368,15 +430,21 @@ def test_numpy_compat(self): msg = "numpy operations are not valid with window objects" for func in ('std', 'mean', 'sum', 'max', 'min', 'var'): - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(r, func), 1, 2, 3) - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(r, func), dtype=np.float64) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(r, func), 1, 2, 3) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(r, func), dtype=np.float64) + + def test_closed(self): + df = DataFrame({'A': [0, 1, 2, 3, 4]}) + # closed only allowed for datetimelike + with pytest.raises(ValueError): + df.rolling(window=3, closed='neither') class TestExpanding(Base): - def setUp(self): + def setup_method(self, method): self._create_data() def test_doc_string(self): @@ -398,9 +466,9 @@ def test_constructor(self): # not valid for w in [2., 'foo', np.array([2])]: - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(min_periods=w) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(min_periods=1, center=w) def test_numpy_compat(self): @@ -410,15 +478,15 @@ def test_numpy_compat(self): msg = "numpy operations are not valid with window objects" for func in ('std', 'mean', 'sum', 'max', 'min', 'var'): - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(e, func), 1, 2, 3) - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(e, func), dtype=np.float64) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(e, func), 1, 2, 3) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(e, func), dtype=np.float64) class TestEWM(Base): - def setUp(self): + def setup_method(self, method): self._create_data() def test_doc_string(self): @@ -441,28 +509,28 @@ def test_constructor(self): c(halflife=0.75, alpha=None) # not valid: mutually exclusive - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(com=0.5, alpha=0.5) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(span=1.5, halflife=0.75) - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(alpha=0.5, span=1.5) # not valid: com < 0 - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(com=-0.5) # not valid: span < 1 - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(span=0.5) # not valid: halflife <= 0 - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(halflife=0) # not valid: alpha <= 0 or alpha > 1 for alpha in (-0.5, 1.5): - with self.assertRaises(ValueError): + with pytest.raises(ValueError): c(alpha=alpha) def test_numpy_compat(self): @@ -472,30 +540,30 @@ def test_numpy_compat(self): msg = "numpy operations are not valid with window objects" for func in ('std', 'mean', 'var'): - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(e, func), 1, 2, 3) - tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, - getattr(e, func), dtype=np.float64) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(e, func), 1, 2, 3) + tm.assert_raises_regex(UnsupportedFunctionCall, msg, + getattr(e, func), dtype=np.float64) class TestDeprecations(Base): """ test that we are catching deprecation warnings """ - def setUp(self): + def setup_method(self, method): self._create_data() def test_deprecations(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): mom.rolling_mean(np.ones(10), 3, center=True, axis=0) mom.rolling_mean(Series(np.ones(10)), 3, center=True, axis=0) -# GH #12373 : rolling functions error on float32 data +# gh-12373 : rolling functions error on float32 data # make sure rolling functions works for different dtypes # -# NOTE that these are yielded tests and so _create_data is -# explicity called, nor do these inherit from unittest.TestCase +# NOTE that these are yielded tests and so _create_data +# is explicitly called. # # further note that we are only checking rolling for fully dtype # compliance (though both expanding and ewm inherit) @@ -588,7 +656,7 @@ def test_dtypes(self): f = self.funcs[f_name] d = self.data[d_name] exp = self.expects[d_name][f_name] - yield self.check_dtypes, f, f_name, d, d_name, exp + self.check_dtypes(f, f_name, d, d_name, exp) def check_dtypes(self, f, f_name, d, d_name, exp): roll = d.rolling(window=self.window) @@ -685,7 +753,8 @@ def check_dtypes(self, f, f_name, d, d_name, exp): else: # other methods not Implemented ATM - assert_raises(NotImplementedError, f, roll) + with pytest.raises(NotImplementedError): + f(roll) class TestDtype_timedelta(DatetimeLike): @@ -700,13 +769,13 @@ class TestDtype_datetime64UTC(DatetimeLike): dtype = 'datetime64[ns, UTC]' def _create_data(self): - raise nose.SkipTest("direct creation of extension dtype " - "datetime64[ns, UTC] is not supported ATM") + pytest.skip("direct creation of extension dtype " + "datetime64[ns, UTC] is not supported ATM") class TestMoments(Base): - def setUp(self): + def setup_method(self, method): self._create_data() def test_centered_axis_validation(self): @@ -715,7 +784,7 @@ def test_centered_axis_validation(self): Series(np.ones(10)).rolling(window=3, center=True, axis=0).mean() # bad axis - with self.assertRaises(ValueError): + with pytest.raises(ValueError): Series(np.ones(10)).rolling(window=3, center=True, axis=1).mean() # ok ok @@ -725,7 +794,7 @@ def test_centered_axis_validation(self): axis=1).mean() # bad axis - with self.assertRaises(ValueError): + with pytest.raises(ValueError): (DataFrame(np.ones((10, 10))) .rolling(window=3, center=True, axis=2).mean()) @@ -750,7 +819,7 @@ def test_cmov_mean(self): xp = np.array([np.nan, np.nan, 9.962, 11.27, 11.564, 12.516, 12.818, 12.952, np.nan, np.nan]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): rs = mom.rolling_mean(vals, 5, center=True) tm.assert_almost_equal(xp, rs) @@ -767,7 +836,7 @@ def test_cmov_window(self): xp = np.array([np.nan, np.nan, 9.962, 11.27, 11.564, 12.516, 12.818, 12.952, np.nan, np.nan]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): rs = mom.rolling_window(vals, 5, 'boxcar', center=True) tm.assert_almost_equal(xp, rs) @@ -782,22 +851,22 @@ def test_cmov_window_corner(self): # all nan vals = np.empty(10, dtype=float) vals.fill(np.nan) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): rs = mom.rolling_window(vals, 5, 'boxcar', center=True) - self.assertTrue(np.isnan(rs).all()) + assert np.isnan(rs).all() # empty vals = np.array([]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): rs = mom.rolling_window(vals, 5, 'boxcar', center=True) - self.assertEqual(len(rs), 0) + assert len(rs) == 0 # shorter than window vals = np.random.randn(5) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): rs = mom.rolling_window(vals, 10, 'boxcar') - self.assertTrue(np.isnan(rs).all()) - self.assertEqual(len(rs), 5) + assert np.isnan(rs).all() + assert len(rs) == 5 def test_cmov_window_frame(self): # Gh 8238 @@ -818,7 +887,7 @@ def test_cmov_window_frame(self): tm.assert_frame_equal(DataFrame(xp), rs) # invalid method - with self.assertRaises(AttributeError): + with pytest.raises(AttributeError): (DataFrame(vals).rolling(5, win_type='boxcar', center=True) .std()) @@ -973,38 +1042,38 @@ def test_cmov_window_special_linear_range(self): tm.assert_series_equal(xp, rs) def test_rolling_median(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): self._check_moment_func(mom.rolling_median, np.median, name='median') def test_rolling_min(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): self._check_moment_func(mom.rolling_min, np.min, name='min') - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): a = np.array([1, 2, 3, 4, 5]) b = mom.rolling_min(a, window=100, min_periods=1) tm.assert_almost_equal(b, np.ones(len(a))) - self.assertRaises(ValueError, mom.rolling_min, np.array([1, 2, 3]), - window=3, min_periods=5) + pytest.raises(ValueError, mom.rolling_min, np.array([1, 2, 3]), + window=3, min_periods=5) def test_rolling_max(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): self._check_moment_func(mom.rolling_max, np.max, name='max') - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): a = np.array([1, 2, 3, 4, 5], dtype=np.float64) b = mom.rolling_max(a, window=100, min_periods=1) tm.assert_almost_equal(a, b) - self.assertRaises(ValueError, mom.rolling_max, np.array([1, 2, 3]), - window=3, min_periods=5) + pytest.raises(ValueError, mom.rolling_max, np.array([1, 2, 3]), + window=3, min_periods=5) def test_rolling_quantile(self): - qs = [.1, .5, .9] + qs = [0.0, .1, .5, .9, 1.0] def scoreatpercentile(a, per): values = np.sort(a, axis=0) @@ -1025,6 +1094,18 @@ def alt(x): self._check_moment_func(f, alt, name='quantile', quantile=q) + def test_rolling_quantile_param(self): + ser = Series([0.0, .1, .5, .9, 1.0]) + + with pytest.raises(ValueError): + ser.rolling(3).quantile(-0.1) + + with pytest.raises(ValueError): + ser.rolling(3).quantile(10.0) + + with pytest.raises(TypeError): + ser.rolling(3).quantile('foo') + def test_rolling_apply(self): # suppress warnings about empty slices, as we are deliberately testing # with a 0-length Series @@ -1061,11 +1142,11 @@ def test_rolling_apply_out_of_bounds(self): arr = np.arange(4) # it works! - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_apply(arr, 10, np.sum) - self.assertTrue(isnull(result).all()) + assert isnull(result).all() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_apply(arr, 10, np.sum, min_periods=1) tm.assert_almost_equal(result, result) @@ -1076,22 +1157,22 @@ def test_rolling_std(self): name='std', ddof=0) def test_rolling_std_1obs(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_std(np.array([1., 2., 3., 4., 5.]), 1, min_periods=1) expected = np.array([np.nan] * 5) tm.assert_almost_equal(result, expected) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_std(np.array([1., 2., 3., 4., 5.]), 1, min_periods=1, ddof=0) expected = np.zeros(5) tm.assert_almost_equal(result, expected) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_std(np.array([np.nan, np.nan, 3., 4., 5.]), 3, min_periods=2) - self.assertTrue(np.isnan(result[2])) + assert np.isnan(result[2]) def test_rolling_std_neg_sqrt(self): # unit test from Bottleneck @@ -1101,13 +1182,13 @@ def test_rolling_std_neg_sqrt(self): a = np.array([0.0011448196318903589, 0.00028718669878572767, 0.00028718669878572767, 0.00028718669878572767, 0.00028718669878572767]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): b = mom.rolling_std(a, window=3) - self.assertTrue(np.isfinite(b[2:]).all()) + assert np.isfinite(b[2:]).all() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): b = mom.ewmstd(a, span=3) - self.assertTrue(np.isfinite(b[2:]).all()) + assert np.isfinite(b[2:]).all() def test_rolling_var(self): self._check_moment_func(mom.rolling_var, lambda x: np.var(x, ddof=1), @@ -1119,7 +1200,7 @@ def test_rolling_skew(self): try: from scipy.stats import skew except ImportError: - raise nose.SkipTest('no scipy') + pytest.skip('no scipy') self._check_moment_func(mom.rolling_skew, lambda x: skew(x, bias=False), name='skew') @@ -1127,14 +1208,14 @@ def test_rolling_kurt(self): try: from scipy.stats import kurtosis except ImportError: - raise nose.SkipTest('no scipy') + pytest.skip('no scipy') self._check_moment_func(mom.rolling_kurt, lambda x: kurtosis(x, bias=False), name='kurt') def test_fperr_robustness(self): # TODO: remove this once python 2.5 out of picture if PY3: - raise nose.SkipTest("doesn't work on python 3") + pytest.skip("doesn't work on python 3") # #2114 data = '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1a@\xaa\xaa\xaa\xaa\xaa\xaa\x02@8\x8e\xe38\x8e\xe3\xe8?z\t\xed%\xb4\x97\xd0?\xa2\x0c<\xdd\x9a\x1f\xb6?\x82\xbb\xfa&y\x7f\x9d?\xac\'\xa7\xc4P\xaa\x83?\x90\xdf\xde\xb0k8j?`\xea\xe9u\xf2zQ?*\xe37\x9d\x98N7?\xe2.\xf5&v\x13\x1f?\xec\xc9\xf8\x19\xa4\xb7\x04?\x90b\xf6w\x85\x9f\xeb>\xb5A\xa4\xfaXj\xd2>F\x02\xdb\xf8\xcb\x8d\xb8>.\xac<\xfb\x87^\xa0>\xe8:\xa6\xf9_\xd3\x85>\xfb?\xe2cUU\xfd?\xfc\x7fA\xed8\x8e\xe3?\xa5\xaa\xac\x91\xf6\x12\xca?n\x1cs\xb6\xf9a\xb1?\xe8%D\xf3L-\x97?5\xddZD\x11\xe7~?#>\xe7\x82\x0b\x9ad?\xd9R4Y\x0fxK?;7x;\nP2?N\xf4JO\xb8j\x18?4\xf81\x8a%G\x00?\x9a\xf5\x97\r2\xb4\xe5>\xcd\x9c\xca\xbcB\xf0\xcc>3\x13\x87(\xd7J\xb3>\x99\x19\xb4\xe0\x1e\xb9\x99>ff\xcd\x95\x14&\x81>\x88\x88\xbc\xc7p\xddf>`\x0b\xa6_\x96|N>@\xb2n\xea\x0eS4>U\x98\x938i\x19\x1b>\x8eeb\xd0\xf0\x10\x02>\xbd\xdc-k\x96\x16\xe8=(\x93\x1e\xf2\x0e\x0f\xd0=\xe0n\xd3Bii\xb5=*\xe9\x19Y\x8c\x8c\x9c=\xc6\xf0\xbb\x90]\x08\x83=]\x96\xfa\xc0|`i=>d\xfc\xd5\xfd\xeaP=R0\xfb\xc7\xa7\x8e6=\xc2\x95\xf9_\x8a\x13\x1e=\xd6c\xa6\xea\x06\r\x04=r\xda\xdd8\t\xbc\xea<\xf6\xe6\x93\xd0\xb0\xd2\xd1<\x9d\xdeok\x96\xc3\xb7<&~\xea9s\xaf\x9f\xb8\x02@\xc6\xd2&\xfd\xa8\xf5\xe8?\xd9\xe1\x19\xfe\xc5\xa3\xd0?v\x82"\xa8\xb2/\xb6?\x9dX\x835\xee\x94\x9d?h\x90W\xce\x9e\xb8\x83?\x8a\xc0th~Kj?\\\x80\xf8\x9a\xa9\x87Q?%\xab\xa0\xce\x8c_7?1\xe4\x80\x13\x11*\x1f? \x98\x00\r\xb6\xc6\x04?\x80u\xabf\x9d\xb3\xeb>UNrD\xbew\xd2>\x1c\x13C[\xa8\x9f\xb8>\x12b\xd7m-\x1fQ@\xe3\x85>\xe6\x91)l\x00/m>Da\xc6\xf2\xaatS>\x05\xd7]\xee\xe3\xf09>' # noqa @@ -1143,27 +1224,27 @@ def test_fperr_robustness(self): if sys.byteorder != "little": arr = arr.byteswap().newbyteorder() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_sum(arr, 2) - self.assertTrue((result[1:] >= 0).all()) + assert (result[1:] >= 0).all() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_mean(arr, 2) - self.assertTrue((result[1:] >= 0).all()) + assert (result[1:] >= 0).all() - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_var(arr, 2) - self.assertTrue((result[1:] >= 0).all()) + assert (result[1:] >= 0).all() # #2527, ugh arr = np.array([0.00012456, 0.0003, 0]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_mean(arr, 1) - self.assertTrue(result[-1] >= 0) + assert result[-1] >= 0 - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.rolling_mean(-arr, 1) - self.assertTrue(result[-1] <= 0) + assert result[-1] <= 0 def _check_moment_func(self, f, static_comp, name=None, window=50, has_min_periods=True, has_center=True, @@ -1216,16 +1297,16 @@ def get_result(arr, window, min_periods=None, center=False): # min_periods is working correctly result = get_result(arr, 20, min_periods=15) - self.assertTrue(np.isnan(result[23])) - self.assertFalse(np.isnan(result[24])) + assert np.isnan(result[23]) + assert not np.isnan(result[24]) - self.assertFalse(np.isnan(result[-6])) - self.assertTrue(np.isnan(result[-5])) + assert not np.isnan(result[-6]) + assert np.isnan(result[-5]) arr2 = randn(20) result = get_result(arr2, 10, min_periods=5) - self.assertTrue(isnull(result[3])) - self.assertTrue(notnull(result[4])) + assert isnull(result[3]) + assert notnull(result[4]) # min_periods=0 result0 = get_result(arr, 20, min_periods=0) @@ -1247,7 +1328,7 @@ def get_result(arr, window, min_periods=None, center=False): expected = get_result( np.concatenate((arr, np.array([np.NaN] * 9))), 20)[9:] - self.assert_numpy_array_equal(result, expected) + tm.assert_numpy_array_equal(result, expected) if test_stable: result = get_result(self.arr + 1e9, window) @@ -1263,8 +1344,8 @@ def get_result(arr, window, min_periods=None, center=False): expected = get_result(self.arr, len(self.arr), min_periods=minp) nan_mask = np.isnan(result) - self.assertTrue(np.array_equal(nan_mask, np.isnan( - expected))) + tm.assert_numpy_array_equal(nan_mask, np.isnan(expected)) + nan_mask = ~nan_mask tm.assert_almost_equal(result[nan_mask], expected[nan_mask]) @@ -1272,7 +1353,8 @@ def get_result(arr, window, min_periods=None, center=False): result = get_result(self.arr, len(self.arr) + 1) expected = get_result(self.arr, len(self.arr)) nan_mask = np.isnan(result) - self.assertTrue(np.array_equal(nan_mask, np.isnan(expected))) + tm.assert_numpy_array_equal(nan_mask, np.isnan(expected)) + nan_mask = ~nan_mask tm.assert_almost_equal(result[nan_mask], expected[nan_mask]) @@ -1286,23 +1368,21 @@ def get_result(obj, window, min_periods=None, freq=None, center=False): # catch a freq deprecation warning if freq is provided and not # None - w = FutureWarning if freq is not None else None - with tm.assert_produces_warning(w, check_stacklevel=False): + with catch_warnings(record=True): r = obj.rolling(window=window, min_periods=min_periods, freq=freq, center=center) return getattr(r, name)(**kwargs) # check via the moments API - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): + with catch_warnings(record=True): return f(obj, window=window, min_periods=min_periods, freq=freq, center=center, **kwargs) series_result = get_result(self.series, window=50) frame_result = get_result(self.frame, window=50) - tm.assertIsInstance(series_result, Series) - self.assertEqual(type(frame_result), DataFrame) + assert isinstance(series_result, Series) + assert type(frame_result) == DataFrame # check time_rule works if has_time_rule: @@ -1326,7 +1406,7 @@ def get_result(obj, window, min_periods=None, freq=None, center=False): trunc_series = self.series[::2].truncate(prev_date, last_date) trunc_frame = self.frame[::2].truncate(prev_date, last_date) - self.assertAlmostEqual(series_result[-1], + tm.assert_almost_equal(series_result[-1], static_comp(trunc_series)) tm.assert_series_equal(frame_result.xs(last_date), @@ -1378,9 +1458,9 @@ def test_ewma(self): arr = np.zeros(1000) arr[5] = 1 - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): result = mom.ewma(arr, span=100, adjust=False).sum() - self.assertTrue(np.abs(result - 1) < 1e-2) + assert np.abs(result - 1) < 1e-2 s = Series([1.0, 2.0, 4.0, 8.0]) @@ -1465,31 +1545,31 @@ def test_ewmvol(self): self._check_ew(mom.ewmvol, name='vol') def test_ewma_span_com_args(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): A = mom.ewma(self.arr, com=9.5) B = mom.ewma(self.arr, span=20) tm.assert_almost_equal(A, B) - self.assertRaises(ValueError, mom.ewma, self.arr, com=9.5, span=20) - self.assertRaises(ValueError, mom.ewma, self.arr) + pytest.raises(ValueError, mom.ewma, self.arr, com=9.5, span=20) + pytest.raises(ValueError, mom.ewma, self.arr) def test_ewma_halflife_arg(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): A = mom.ewma(self.arr, com=13.932726172912965) B = mom.ewma(self.arr, halflife=10.0) tm.assert_almost_equal(A, B) - self.assertRaises(ValueError, mom.ewma, self.arr, span=20, - halflife=50) - self.assertRaises(ValueError, mom.ewma, self.arr, com=9.5, - halflife=50) - self.assertRaises(ValueError, mom.ewma, self.arr, com=9.5, span=20, - halflife=50) - self.assertRaises(ValueError, mom.ewma, self.arr) + pytest.raises(ValueError, mom.ewma, self.arr, span=20, + halflife=50) + pytest.raises(ValueError, mom.ewma, self.arr, com=9.5, + halflife=50) + pytest.raises(ValueError, mom.ewma, self.arr, com=9.5, span=20, + halflife=50) + pytest.raises(ValueError, mom.ewma, self.arr) def test_ewma_alpha_old_api(self): # GH 10789 - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): a = mom.ewma(self.arr, alpha=0.61722699889169674) b = mom.ewma(self.arr, com=0.62014947789973052) c = mom.ewma(self.arr, span=2.240298955799461) @@ -1500,14 +1580,14 @@ def test_ewma_alpha_old_api(self): def test_ewma_alpha_arg_old_api(self): # GH 10789 - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertRaises(ValueError, mom.ewma, self.arr) - self.assertRaises(ValueError, mom.ewma, self.arr, - com=10.0, alpha=0.5) - self.assertRaises(ValueError, mom.ewma, self.arr, - span=10.0, alpha=0.5) - self.assertRaises(ValueError, mom.ewma, self.arr, - halflife=10.0, alpha=0.5) + with catch_warnings(record=True): + pytest.raises(ValueError, mom.ewma, self.arr) + pytest.raises(ValueError, mom.ewma, self.arr, + com=10.0, alpha=0.5) + pytest.raises(ValueError, mom.ewma, self.arr, + span=10.0, alpha=0.5) + pytest.raises(ValueError, mom.ewma, self.arr, + halflife=10.0, alpha=0.5) def test_ewm_alpha(self): # GH 10789 @@ -1523,47 +1603,46 @@ def test_ewm_alpha(self): def test_ewm_alpha_arg(self): # GH 10789 s = Series(self.arr) - self.assertRaises(ValueError, s.ewm) - self.assertRaises(ValueError, s.ewm, com=10.0, alpha=0.5) - self.assertRaises(ValueError, s.ewm, span=10.0, alpha=0.5) - self.assertRaises(ValueError, s.ewm, halflife=10.0, alpha=0.5) + pytest.raises(ValueError, s.ewm) + pytest.raises(ValueError, s.ewm, com=10.0, alpha=0.5) + pytest.raises(ValueError, s.ewm, span=10.0, alpha=0.5) + pytest.raises(ValueError, s.ewm, halflife=10.0, alpha=0.5) def test_ewm_domain_checks(self): # GH 12492 s = Series(self.arr) # com must satisfy: com >= 0 - self.assertRaises(ValueError, s.ewm, com=-0.1) + pytest.raises(ValueError, s.ewm, com=-0.1) s.ewm(com=0.0) s.ewm(com=0.1) # span must satisfy: span >= 1 - self.assertRaises(ValueError, s.ewm, span=-0.1) - self.assertRaises(ValueError, s.ewm, span=0.0) - self.assertRaises(ValueError, s.ewm, span=0.9) + pytest.raises(ValueError, s.ewm, span=-0.1) + pytest.raises(ValueError, s.ewm, span=0.0) + pytest.raises(ValueError, s.ewm, span=0.9) s.ewm(span=1.0) s.ewm(span=1.1) # halflife must satisfy: halflife > 0 - self.assertRaises(ValueError, s.ewm, halflife=-0.1) - self.assertRaises(ValueError, s.ewm, halflife=0.0) + pytest.raises(ValueError, s.ewm, halflife=-0.1) + pytest.raises(ValueError, s.ewm, halflife=0.0) s.ewm(halflife=0.1) # alpha must satisfy: 0 < alpha <= 1 - self.assertRaises(ValueError, s.ewm, alpha=-0.1) - self.assertRaises(ValueError, s.ewm, alpha=0.0) + pytest.raises(ValueError, s.ewm, alpha=-0.1) + pytest.raises(ValueError, s.ewm, alpha=0.0) s.ewm(alpha=0.1) s.ewm(alpha=1.0) - self.assertRaises(ValueError, s.ewm, alpha=1.1) + pytest.raises(ValueError, s.ewm, alpha=1.1) def test_ew_empty_arrays(self): arr = np.array([], dtype=np.float64) funcs = [mom.ewma, mom.ewmvol, mom.ewmvar] for f in funcs: - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): + with catch_warnings(record=True): result = f(arr, 3) tm.assert_almost_equal(result, arr) def _check_ew(self, func, name=None): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): self._check_ew_ndarray(func, name=name) self._check_ew_structures(func, name=name) @@ -1581,19 +1660,19 @@ def _check_ew_ndarray(self, func, preserve_nan=False, name=None): # check min_periods # GH 7898 result = func(s, 50, min_periods=2) - self.assertTrue(np.isnan(result.values[:11]).all()) - self.assertFalse(np.isnan(result.values[11:]).any()) + assert np.isnan(result.values[:11]).all() + assert not np.isnan(result.values[11:]).any() for min_periods in (0, 1): result = func(s, 50, min_periods=min_periods) if func == mom.ewma: - self.assertTrue(np.isnan(result.values[:10]).all()) - self.assertFalse(np.isnan(result.values[10:]).any()) + assert np.isnan(result.values[:10]).all() + assert not np.isnan(result.values[10:]).any() else: # ewmstd, ewmvol, ewmvar (with bias=False) require at least two # values - self.assertTrue(np.isnan(result.values[:11]).all()) - self.assertFalse(np.isnan(result.values[11:]).any()) + assert np.isnan(result.values[:11]).all() + assert not np.isnan(result.values[11:]).any() # check series of length 0 result = func(Series([]), 50, min_periods=min_periods) @@ -1610,14 +1689,168 @@ def _check_ew_ndarray(self, func, preserve_nan=False, name=None): # pass in ints result2 = func(np.arange(50), span=10) - self.assertEqual(result2.dtype, np.float_) + assert result2.dtype == np.float_ def _check_ew_structures(self, func, name): series_result = getattr(self.series.ewm(com=10), name)() - tm.assertIsInstance(series_result, Series) + assert isinstance(series_result, Series) frame_result = getattr(self.frame.ewm(com=10), name)() - self.assertEqual(type(frame_result), DataFrame) + assert type(frame_result) == DataFrame + + +class TestPairwise(object): + + # GH 7738 + df1s = [DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[0, 1]), + DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[1, 0]), + DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[1, 1]), + DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], + columns=['C', 'C']), + DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[1., 0]), + DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[0., 1]), + DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=['C', 1]), + DataFrame([[2., 4.], [1., 2.], [5., 2.], [8., 1.]], + columns=[1, 0.]), + DataFrame([[2, 4.], [1, 2.], [5, 2.], [8, 1.]], + columns=[0, 1.]), + DataFrame([[2, 4], [1, 2], [5, 2], [8, 1.]], + columns=[1., 'X']), ] + df2 = DataFrame([[None, 1, 1], [None, 1, 2], + [None, 3, 2], [None, 8, 1]], columns=['Y', 'Z', 'X']) + s = Series([1, 1, 3, 8]) + + def compare(self, result, expected): + + # since we have sorted the results + # we can only compare non-nans + result = result.dropna().values + expected = expected.dropna().values + + tm.assert_numpy_array_equal(result, expected) + + @pytest.mark.parametrize('f', [lambda x: x.cov(), lambda x: x.corr()]) + def test_no_flex(self, f): + + # DataFrame methods (which do not call _flex_binary_moment()) + + results = [f(df) for df in self.df1s] + for (df, result) in zip(self.df1s, results): + tm.assert_index_equal(result.index, df.columns) + tm.assert_index_equal(result.columns, df.columns) + for i, result in enumerate(results): + if i > 0: + self.compare(result, results[0]) + + @pytest.mark.parametrize( + 'f', [lambda x: x.expanding().cov(pairwise=True), + lambda x: x.expanding().corr(pairwise=True), + lambda x: x.rolling(window=3).cov(pairwise=True), + lambda x: x.rolling(window=3).corr(pairwise=True), + lambda x: x.ewm(com=3).cov(pairwise=True), + lambda x: x.ewm(com=3).corr(pairwise=True)]) + def test_pairwise_with_self(self, f): + + # DataFrame with itself, pairwise=True + results = [f(df) for df in self.df1s] + for (df, result) in zip(self.df1s, results): + tm.assert_index_equal(result.index.levels[0], + df.index, + check_names=False) + tm.assert_index_equal(result.index.levels[1], + df.columns, + check_names=False) + tm.assert_index_equal(result.columns, df.columns) + for i, result in enumerate(results): + if i > 0: + self.compare(result, results[0]) + + @pytest.mark.parametrize( + 'f', [lambda x: x.expanding().cov(pairwise=False), + lambda x: x.expanding().corr(pairwise=False), + lambda x: x.rolling(window=3).cov(pairwise=False), + lambda x: x.rolling(window=3).corr(pairwise=False), + lambda x: x.ewm(com=3).cov(pairwise=False), + lambda x: x.ewm(com=3).corr(pairwise=False), ]) + def test_no_pairwise_with_self(self, f): + + # DataFrame with itself, pairwise=False + results = [f(df) for df in self.df1s] + for (df, result) in zip(self.df1s, results): + tm.assert_index_equal(result.index, df.index) + tm.assert_index_equal(result.columns, df.columns) + for i, result in enumerate(results): + if i > 0: + self.compare(result, results[0]) + + @pytest.mark.parametrize( + 'f', [lambda x, y: x.expanding().cov(y, pairwise=True), + lambda x, y: x.expanding().corr(y, pairwise=True), + lambda x, y: x.rolling(window=3).cov(y, pairwise=True), + lambda x, y: x.rolling(window=3).corr(y, pairwise=True), + lambda x, y: x.ewm(com=3).cov(y, pairwise=True), + lambda x, y: x.ewm(com=3).corr(y, pairwise=True), ]) + def test_pairwise_with_other(self, f): + + # DataFrame with another DataFrame, pairwise=True + results = [f(df, self.df2) for df in self.df1s] + for (df, result) in zip(self.df1s, results): + tm.assert_index_equal(result.index.levels[0], + df.index, + check_names=False) + tm.assert_index_equal(result.index.levels[1], + self.df2.columns, + check_names=False) + for i, result in enumerate(results): + if i > 0: + self.compare(result, results[0]) + + @pytest.mark.parametrize( + 'f', [lambda x, y: x.expanding().cov(y, pairwise=False), + lambda x, y: x.expanding().corr(y, pairwise=False), + lambda x, y: x.rolling(window=3).cov(y, pairwise=False), + lambda x, y: x.rolling(window=3).corr(y, pairwise=False), + lambda x, y: x.ewm(com=3).cov(y, pairwise=False), + lambda x, y: x.ewm(com=3).corr(y, pairwise=False), ]) + def test_no_pairwise_with_other(self, f): + + # DataFrame with another DataFrame, pairwise=False + results = [f(df, self.df2) if df.columns.is_unique else None + for df in self.df1s] + for (df, result) in zip(self.df1s, results): + if result is not None: + with catch_warnings(record=True): + # we can have int and str columns + expected_index = df.index.union(self.df2.index) + expected_columns = df.columns.union(self.df2.columns) + tm.assert_index_equal(result.index, expected_index) + tm.assert_index_equal(result.columns, expected_columns) + else: + tm.assert_raises_regex( + ValueError, "'arg1' columns are not unique", f, df, + self.df2) + tm.assert_raises_regex( + ValueError, "'arg2' columns are not unique", f, + self.df2, df) + + @pytest.mark.parametrize( + 'f', [lambda x, y: x.expanding().cov(y), + lambda x, y: x.expanding().corr(y), + lambda x, y: x.rolling(window=3).cov(y), + lambda x, y: x.rolling(window=3).corr(y), + lambda x, y: x.ewm(com=3).cov(y), + lambda x, y: x.ewm(com=3).corr(y), ]) + def test_pairwise_with_series(self, f): + + # DataFrame with a Series + results = ([f(df, self.s) for df in self.df1s] + + [f(self.s, df) for df in self.df1s]) + for (df, result) in zip(self.df1s, results): + tm.assert_index_equal(result.index, df.index) + tm.assert_index_equal(result.columns, df.columns) + for i, result in enumerate(results): + if i > 0: + self.compare(result, results[0]) # create the data only once as we are not setting it @@ -1725,7 +1958,7 @@ def _create_data(self): super(TestMomentsConsistency, self)._create_data() self.data = _consistency_data - def setUp(self): + def setup_method(self, method): self._create_data() def _test_moments_consistency(self, min_periods, count, mean, mock_mean, @@ -1748,7 +1981,8 @@ def _non_null_values(x): # check that correlation of a series with itself is either 1 or NaN corr_x_x = corr(x, x) - # self.assertTrue(_non_null_values(corr_x_x).issubset(set([1.]))) # + + # assert _non_null_values(corr_x_x).issubset(set([1.])) # restore once rolling_cov(x, x) is identically equal to var(x) if is_constant: @@ -1778,11 +2012,11 @@ def _non_null_values(x): # check that var(x), std(x), and cov(x) are all >= 0 var_x = var(x) std_x = std(x) - self.assertFalse((var_x < 0).any().any()) - self.assertFalse((std_x < 0).any().any()) + assert not (var_x < 0).any().any() + assert not (std_x < 0).any().any() if cov: cov_x_x = cov(x, x) - self.assertFalse((cov_x_x < 0).any().any()) + assert not (cov_x_x < 0).any().any() # check that var(x) == cov(x, x) assert_equal(var_x, cov_x_x) @@ -1797,7 +2031,7 @@ def _non_null_values(x): if is_constant: # check that variance of constant series is identically 0 - self.assertFalse((var_x > 0).any().any()) + assert not (var_x > 0).any().any() expected = x * np.nan expected[count_x >= max(min_periods, 1)] = 0. if var is var_unbiased: @@ -2015,21 +2249,6 @@ def test_expanding_consistency(self): assert_equal(expanding_f_result, expanding_apply_f_result) - if (name in ['cov', 'corr']) and isinstance(x, - DataFrame): - # test pairwise=True - expanding_f_result = expanding_f(x, pairwise=True) - expected = Panel(items=x.index, - major_axis=x.columns, - minor_axis=x.columns) - for i, _ in enumerate(x.columns): - for j, _ in enumerate(x.columns): - expected.iloc[:, i, j] = getattr( - x.iloc[:, i].expanding( - min_periods=min_periods), - name)(x.iloc[:, j]) - tm.assert_panel_equal(expanding_f_result, expected) - @tm.slow def test_rolling_consistency(self): @@ -2135,25 +2354,6 @@ def cases(): assert_equal(rolling_f_result, rolling_apply_f_result) - if (name in ['cov', 'corr']) and isinstance( - x, DataFrame): - # test pairwise=True - rolling_f_result = rolling_f(x, - pairwise=True) - expected = Panel(items=x.index, - major_axis=x.columns, - minor_axis=x.columns) - for i, _ in enumerate(x.columns): - for j, _ in enumerate(x.columns): - expected.iloc[:, i, j] = ( - getattr( - x.iloc[:, i] - .rolling(window=window, - min_periods=min_periods, - center=center), - name)(x.iloc[:, j])) - tm.assert_panel_equal(rolling_f_result, expected) - # binary moments def test_rolling_cov(self): A = self.series @@ -2189,16 +2389,16 @@ def _check_pairwise_moment(self, dispatch, name, **kwargs): def get_result(obj, obj2=None): return getattr(getattr(obj, dispatch)(**kwargs), name)(obj2) - panel = get_result(self.frame) - actual = panel.ix[:, 1, 5] + result = get_result(self.frame) + result = result.loc[(slice(None), 1), 5] + result.index = result.index.droplevel(1) expected = get_result(self.frame[1], self.frame[5]) - tm.assert_series_equal(actual, expected, check_names=False) - self.assertEqual(actual.name, 5) + tm.assert_series_equal(result, expected, check_names=False) def test_flex_binary_moment(self): # GH3155 # don't blow the stack - self.assertRaises(TypeError, rwindow._flex_binary_moment, 5, 6, None) + pytest.raises(TypeError, rwindow._flex_binary_moment, 5, 6, None) def test_corr_sanity(self): # GH 3155 @@ -2208,16 +2408,15 @@ def test_corr_sanity(self): [0.84780328, 0.33394331], [0.78369152, 0.63919667]])) res = df[0].rolling(5, center=True).corr(df[1]) - self.assertTrue(all([np.abs(np.nan_to_num(x)) <= 1 for x in res])) + assert all([np.abs(np.nan_to_num(x)) <= 1 for x in res]) # and some fuzzing - for i in range(10): + for _ in range(10): df = DataFrame(np.random.rand(30, 2)) res = df[0].rolling(5, center=True).corr(df[1]) try: - self.assertTrue(all([np.abs(np.nan_to_num(x)) <= 1 for x in res - ])) - except: + assert all([np.abs(np.nan_to_num(x)) <= 1 for x in res]) + except AssertionError: print(res) def test_flex_binary_frame(self): @@ -2267,16 +2466,16 @@ def func(A, B, com, **kwargs): B[-10:] = np.NaN result = func(A, B, 20, min_periods=5) - self.assertTrue(np.isnan(result.values[:14]).all()) - self.assertFalse(np.isnan(result.values[14:]).any()) + assert np.isnan(result.values[:14]).all() + assert not np.isnan(result.values[14:]).any() # GH 7898 for min_periods in (0, 1, 2): result = func(A, B, 20, min_periods=min_periods) # binary functions (ewmcov, ewmcorr) with bias=False require at # least two values - self.assertTrue(np.isnan(result.values[:11]).all()) - self.assertFalse(np.isnan(result.values[11:]).any()) + assert np.isnan(result.values[:11]).all() + assert not np.isnan(result.values[11:]).any() # check series of length 0 result = func(Series([]), Series([]), 50, min_periods=min_periods) @@ -2287,7 +2486,7 @@ def func(A, B, com, **kwargs): Series([1.]), Series([1.]), 50, min_periods=min_periods) tm.assert_series_equal(result, Series([np.NaN])) - self.assertRaises(Exception, func, A, randn(50), 20, min_periods=5) + pytest.raises(Exception, func, A, randn(50), 20, min_periods=5) def test_expanding_apply(self): ser = Series([]) @@ -2361,17 +2560,14 @@ def test_expanding_cov_pairwise(self): rolling_result = self.frame.rolling(window=len(self.frame), min_periods=1).corr() - for i in result.items: - tm.assert_almost_equal(result[i], rolling_result[i]) + tm.assert_frame_equal(result, rolling_result) def test_expanding_corr_pairwise(self): result = self.frame.expanding().corr() rolling_result = self.frame.rolling(window=len(self.frame), min_periods=1).corr() - - for i in result.items: - tm.assert_almost_equal(result[i], rolling_result[i]) + tm.assert_frame_equal(result, rolling_result) def test_expanding_cov_diff_index(self): # GH 7512 @@ -2439,8 +2635,6 @@ def test_rolling_functions_window_non_shrinkage(self): s_expected = Series(np.nan, index=s.index) df = DataFrame([[1, 5], [3, 2], [3, 9], [-1, 0]], columns=['A', 'B']) df_expected = DataFrame(np.nan, index=df.index, columns=df.columns) - df_expected_panel = Panel(items=df.index, major_axis=df.columns, - minor_axis=df.columns) functions = [lambda x: (x.rolling(window=10, min_periods=5) .cov(x, pairwise=False)), @@ -2472,13 +2666,24 @@ def test_rolling_functions_window_non_shrinkage(self): # scipy needed for rolling_window continue + def test_rolling_functions_window_non_shrinkage_binary(self): + + # corr/cov return a MI DataFrame + df = DataFrame([[1, 5], [3, 2], [3, 9], [-1, 0]], + columns=Index(['A', 'B'], name='foo'), + index=Index(range(4), name='bar')) + df_expected = DataFrame( + columns=Index(['A', 'B'], name='foo'), + index=pd.MultiIndex.from_product([df.index, df.columns], + names=['bar', 'foo']), + dtype='float64') functions = [lambda x: (x.rolling(window=10, min_periods=5) .cov(x, pairwise=True)), lambda x: (x.rolling(window=10, min_periods=5) .corr(x, pairwise=True))] for f in functions: - df_result_panel = f(df) - tm.assert_panel_equal(df_result_panel, df_expected_panel) + df_result = f(df) + tm.assert_frame_equal(df_result, df_expected) def test_moment_functions_zero_length(self): # GH 8056 @@ -2486,13 +2691,9 @@ def test_moment_functions_zero_length(self): s_expected = s df1 = DataFrame() df1_expected = df1 - df1_expected_panel = Panel(items=df1.index, major_axis=df1.columns, - minor_axis=df1.columns) df2 = DataFrame(columns=['a']) df2['a'] = df2['a'].astype('float64') df2_expected = df2 - df2_expected_panel = Panel(items=df2.index, major_axis=df2.columns, - minor_axis=df2.columns) functions = [lambda x: x.expanding().count(), lambda x: x.expanding(min_periods=5).cov( @@ -2545,6 +2746,23 @@ def test_moment_functions_zero_length(self): # scipy needed for rolling_window continue + def test_moment_functions_zero_length_pairwise(self): + + df1 = DataFrame() + df1_expected = df1 + df2 = DataFrame(columns=Index(['a'], name='foo'), + index=Index([], name='bar')) + df2['a'] = df2['a'].astype('float64') + + df1_expected = DataFrame( + index=pd.MultiIndex.from_product([df1.index, df1.columns]), + columns=Index([])) + df2_expected = DataFrame( + index=pd.MultiIndex.from_product([df2.index, df2.columns], + names=['bar', 'foo']), + columns=Index(['a'], name='foo'), + dtype='float64') + functions = [lambda x: (x.expanding(min_periods=5) .cov(x, pairwise=True)), lambda x: (x.expanding(min_periods=5) @@ -2555,24 +2773,33 @@ def test_moment_functions_zero_length(self): .corr(x, pairwise=True)), ] for f in functions: - df1_result_panel = f(df1) - tm.assert_panel_equal(df1_result_panel, df1_expected_panel) + df1_result = f(df1) + tm.assert_frame_equal(df1_result, df1_expected) - df2_result_panel = f(df2) - tm.assert_panel_equal(df2_result_panel, df2_expected_panel) + df2_result = f(df2) + tm.assert_frame_equal(df2_result, df2_expected) def test_expanding_cov_pairwise_diff_length(self): # GH 7512 - df1 = DataFrame([[1, 5], [3, 2], [3, 9]], columns=['A', 'B']) - df1a = DataFrame([[1, 5], [3, 9]], index=[0, 2], columns=['A', 'B']) - df2 = DataFrame([[5, 6], [None, None], [2, 1]], columns=['X', 'Y']) - df2a = DataFrame([[5, 6], [2, 1]], index=[0, 2], columns=['X', 'Y']) - result1 = df1.expanding().cov(df2a, pairwise=True)[2] - result2 = df1.expanding().cov(df2a, pairwise=True)[2] - result3 = df1a.expanding().cov(df2, pairwise=True)[2] - result4 = df1a.expanding().cov(df2a, pairwise=True)[2] - expected = DataFrame([[-3., -5.], [-6., -10.]], index=['A', 'B'], - columns=['X', 'Y']) + df1 = DataFrame([[1, 5], [3, 2], [3, 9]], + columns=Index(['A', 'B'], name='foo')) + df1a = DataFrame([[1, 5], [3, 9]], + index=[0, 2], + columns=Index(['A', 'B'], name='foo')) + df2 = DataFrame([[5, 6], [None, None], [2, 1]], + columns=Index(['X', 'Y'], name='foo')) + df2a = DataFrame([[5, 6], [2, 1]], + index=[0, 2], + columns=Index(['X', 'Y'], name='foo')) + # TODO: xref gh-15826 + # .loc is not preserving the names + result1 = df1.expanding().cov(df2a, pairwise=True).loc[2] + result2 = df1.expanding().cov(df2a, pairwise=True).loc[2] + result3 = df1a.expanding().cov(df2, pairwise=True).loc[2] + result4 = df1a.expanding().cov(df2a, pairwise=True).loc[2] + expected = DataFrame([[-3.0, -6.0], [-5.0, -10.0]], + columns=Index(['A', 'B'], name='foo'), + index=Index(['X', 'Y'], name='foo')) tm.assert_frame_equal(result1, expected) tm.assert_frame_equal(result2, expected) tm.assert_frame_equal(result3, expected) @@ -2580,149 +2807,30 @@ def test_expanding_cov_pairwise_diff_length(self): def test_expanding_corr_pairwise_diff_length(self): # GH 7512 - df1 = DataFrame([[1, 2], [3, 2], [3, 4]], columns=['A', 'B']) - df1a = DataFrame([[1, 2], [3, 4]], index=[0, 2], columns=['A', 'B']) - df2 = DataFrame([[5, 6], [None, None], [2, 1]], columns=['X', 'Y']) - df2a = DataFrame([[5, 6], [2, 1]], index=[0, 2], columns=['X', 'Y']) - result1 = df1.expanding().corr(df2, pairwise=True)[2] - result2 = df1.expanding().corr(df2a, pairwise=True)[2] - result3 = df1a.expanding().corr(df2, pairwise=True)[2] - result4 = df1a.expanding().corr(df2a, pairwise=True)[2] - expected = DataFrame([[-1.0, -1.0], [-1.0, -1.0]], index=['A', 'B'], - columns=['X', 'Y']) + df1 = DataFrame([[1, 2], [3, 2], [3, 4]], + columns=['A', 'B'], + index=Index(range(3), name='bar')) + df1a = DataFrame([[1, 2], [3, 4]], + index=Index([0, 2], name='bar'), + columns=['A', 'B']) + df2 = DataFrame([[5, 6], [None, None], [2, 1]], + columns=['X', 'Y'], + index=Index(range(3), name='bar')) + df2a = DataFrame([[5, 6], [2, 1]], + index=Index([0, 2], name='bar'), + columns=['X', 'Y']) + result1 = df1.expanding().corr(df2, pairwise=True).loc[2] + result2 = df1.expanding().corr(df2a, pairwise=True).loc[2] + result3 = df1a.expanding().corr(df2, pairwise=True).loc[2] + result4 = df1a.expanding().corr(df2a, pairwise=True).loc[2] + expected = DataFrame([[-1.0, -1.0], [-1.0, -1.0]], + columns=['A', 'B'], + index=Index(['X', 'Y'])) tm.assert_frame_equal(result1, expected) tm.assert_frame_equal(result2, expected) tm.assert_frame_equal(result3, expected) tm.assert_frame_equal(result4, expected) - def test_pairwise_stats_column_names_order(self): - # GH 7738 - df1s = [DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[0, 1]), - DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[1, 0]), - DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[1, 1]), - DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], - columns=['C', 'C']), - DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[1., 0]), - DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=[0., 1]), - DataFrame([[2, 4], [1, 2], [5, 2], [8, 1]], columns=['C', 1]), - DataFrame([[2., 4.], [1., 2.], [5., 2.], [8., 1.]], - columns=[1, 0.]), - DataFrame([[2, 4.], [1, 2.], [5, 2.], [8, 1.]], - columns=[0, 1.]), - DataFrame([[2, 4], [1, 2], [5, 2], [8, 1.]], - columns=[1., 'X']), ] - df2 = DataFrame([[None, 1, 1], [None, 1, 2], - [None, 3, 2], [None, 8, 1]], columns=['Y', 'Z', 'X']) - s = Series([1, 1, 3, 8]) - - # suppress warnings about incomparable objects, as we are deliberately - # testing with such column labels - with warnings.catch_warnings(): - warnings.filterwarnings("ignore", - message=".*incomparable objects.*", - category=RuntimeWarning) - - # DataFrame methods (which do not call _flex_binary_moment()) - for f in [lambda x: x.cov(), lambda x: x.corr(), ]: - results = [f(df) for df in df1s] - for (df, result) in zip(df1s, results): - tm.assert_index_equal(result.index, df.columns) - tm.assert_index_equal(result.columns, df.columns) - for i, result in enumerate(results): - if i > 0: - # compare internal values, as columns can be different - self.assert_numpy_array_equal(result.values, - results[0].values) - - # DataFrame with itself, pairwise=True - for f in [lambda x: x.expanding().cov(pairwise=True), - lambda x: x.expanding().corr(pairwise=True), - lambda x: x.rolling(window=3).cov(pairwise=True), - lambda x: x.rolling(window=3).corr(pairwise=True), - lambda x: x.ewm(com=3).cov(pairwise=True), - lambda x: x.ewm(com=3).corr(pairwise=True), ]: - results = [f(df) for df in df1s] - for (df, result) in zip(df1s, results): - tm.assert_index_equal(result.items, df.index) - tm.assert_index_equal(result.major_axis, df.columns) - tm.assert_index_equal(result.minor_axis, df.columns) - for i, result in enumerate(results): - if i > 0: - self.assert_numpy_array_equal(result.values, - results[0].values) - - # DataFrame with itself, pairwise=False - for f in [lambda x: x.expanding().cov(pairwise=False), - lambda x: x.expanding().corr(pairwise=False), - lambda x: x.rolling(window=3).cov(pairwise=False), - lambda x: x.rolling(window=3).corr(pairwise=False), - lambda x: x.ewm(com=3).cov(pairwise=False), - lambda x: x.ewm(com=3).corr(pairwise=False), ]: - results = [f(df) for df in df1s] - for (df, result) in zip(df1s, results): - tm.assert_index_equal(result.index, df.index) - tm.assert_index_equal(result.columns, df.columns) - for i, result in enumerate(results): - if i > 0: - self.assert_numpy_array_equal(result.values, - results[0].values) - - # DataFrame with another DataFrame, pairwise=True - for f in [lambda x, y: x.expanding().cov(y, pairwise=True), - lambda x, y: x.expanding().corr(y, pairwise=True), - lambda x, y: x.rolling(window=3).cov(y, pairwise=True), - lambda x, y: x.rolling(window=3).corr(y, pairwise=True), - lambda x, y: x.ewm(com=3).cov(y, pairwise=True), - lambda x, y: x.ewm(com=3).corr(y, pairwise=True), ]: - results = [f(df, df2) for df in df1s] - for (df, result) in zip(df1s, results): - tm.assert_index_equal(result.items, df.index) - tm.assert_index_equal(result.major_axis, df.columns) - tm.assert_index_equal(result.minor_axis, df2.columns) - for i, result in enumerate(results): - if i > 0: - self.assert_numpy_array_equal(result.values, - results[0].values) - - # DataFrame with another DataFrame, pairwise=False - for f in [lambda x, y: x.expanding().cov(y, pairwise=False), - lambda x, y: x.expanding().corr(y, pairwise=False), - lambda x, y: x.rolling(window=3).cov(y, pairwise=False), - lambda x, y: x.rolling(window=3).corr(y, pairwise=False), - lambda x, y: x.ewm(com=3).cov(y, pairwise=False), - lambda x, y: x.ewm(com=3).corr(y, pairwise=False), ]: - results = [f(df, df2) if df.columns.is_unique else None - for df in df1s] - for (df, result) in zip(df1s, results): - if result is not None: - expected_index = df.index.union(df2.index) - expected_columns = df.columns.union(df2.columns) - tm.assert_index_equal(result.index, expected_index) - tm.assert_index_equal(result.columns, expected_columns) - else: - tm.assertRaisesRegexp( - ValueError, "'arg1' columns are not unique", f, df, - df2) - tm.assertRaisesRegexp( - ValueError, "'arg2' columns are not unique", f, - df2, df) - - # DataFrame with a Series - for f in [lambda x, y: x.expanding().cov(y), - lambda x, y: x.expanding().corr(y), - lambda x, y: x.rolling(window=3).cov(y), - lambda x, y: x.rolling(window=3).corr(y), - lambda x, y: x.ewm(com=3).cov(y), - lambda x, y: x.ewm(com=3).corr(y), ]: - results = [f(df, s) for df in df1s] + [f(s, df) for df in df1s] - for (df, result) in zip(df1s, results): - tm.assert_index_equal(result.index, df.index) - tm.assert_index_equal(result.columns, df.columns) - for i, result in enumerate(results): - if i > 0: - self.assert_numpy_array_equal(result.values, - results[0].values) - def test_rolling_skew_edge_cases(self): all_nan = Series([np.NaN] * 5) @@ -2783,13 +2891,13 @@ def _check_expanding_ndarray(self, func, static_comp, has_min_periods=True, # min_periods is working correctly result = func(arr, min_periods=15) - self.assertTrue(np.isnan(result[13])) - self.assertFalse(np.isnan(result[14])) + assert np.isnan(result[13]) + assert not np.isnan(result[14]) arr2 = randn(20) result = func(arr2, min_periods=5) - self.assertTrue(isnull(result[3])) - self.assertTrue(notnull(result[4])) + assert isnull(result[3]) + assert notnull(result[4]) # min_periods=0 result0 = func(arr, min_periods=0) @@ -2801,9 +2909,9 @@ def _check_expanding_ndarray(self, func, static_comp, has_min_periods=True, def _check_expanding_structures(self, func): series_result = func(self.series) - tm.assertIsInstance(series_result, Series) + assert isinstance(series_result, Series) frame_result = func(self.frame) - self.assertEqual(type(frame_result), DataFrame) + assert type(frame_result) == DataFrame def _check_expanding(self, func, static_comp, has_min_periods=True, has_time_rule=True, preserve_nan=True): @@ -2829,7 +2937,7 @@ def test_rolling_max_gh6297(self): expected = Series([1.0, 2.0, 6.0, 4.0, 5.0], index=[datetime(1975, 1, i, 0) for i in range(1, 6)]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): x = series.rolling(window=1, freq='D').max() tm.assert_series_equal(expected, x) @@ -2848,14 +2956,14 @@ def test_rolling_max_how_resample(self): # Default how should be max expected = Series([0.0, 1.0, 2.0, 3.0, 20.0], index=[datetime(1975, 1, i, 0) for i in range(1, 6)]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): x = series.rolling(window=1, freq='D').max() tm.assert_series_equal(expected, x) # Now specify median (10.0) expected = Series([0.0, 1.0, 2.0, 3.0, 10.0], index=[datetime(1975, 1, i, 0) for i in range(1, 6)]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): x = series.rolling(window=1, freq='D').max(how='median') tm.assert_series_equal(expected, x) @@ -2863,7 +2971,7 @@ def test_rolling_max_how_resample(self): v = (4.0 + 10.0 + 20.0) / 3.0 expected = Series([0.0, 1.0, 2.0, 3.0, v], index=[datetime(1975, 1, i, 0) for i in range(1, 6)]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): x = series.rolling(window=1, freq='D').max(how='mean') tm.assert_series_equal(expected, x) @@ -2882,7 +2990,7 @@ def test_rolling_min_how_resample(self): # Default how should be min expected = Series([0.0, 1.0, 2.0, 3.0, 4.0], index=[datetime(1975, 1, i, 0) for i in range(1, 6)]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): r = series.rolling(window=1, freq='D') tm.assert_series_equal(expected, r.min()) @@ -2901,7 +3009,7 @@ def test_rolling_median_how_resample(self): # Default how should be median expected = Series([0.0, 1.0, 2.0, 3.0, 10], index=[datetime(1975, 1, i, 0) for i in range(1, 6)]) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): + with catch_warnings(record=True): x = series.rolling(window=1, freq='D').median() tm.assert_series_equal(expected, x) @@ -2923,15 +3031,15 @@ def test_rolling_min_max_numeric_types(self): # correctness result = (DataFrame(np.arange(20, dtype=data_type)) .rolling(window=5).max()) - self.assertEqual(result.dtypes[0], np.dtype("f8")) + assert result.dtypes[0] == np.dtype("f8") result = (DataFrame(np.arange(20, dtype=data_type)) .rolling(window=5).min()) - self.assertEqual(result.dtypes[0], np.dtype("f8")) + assert result.dtypes[0] == np.dtype("f8") -class TestGrouperGrouping(tm.TestCase): +class TestGrouperGrouping(object): - def setUp(self): + def setup_method(self, method): self.series = Series(np.arange(10)) self.frame = DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8, 'B': np.arange(40)}) @@ -2940,12 +3048,12 @@ def test_mutated(self): def f(): self.frame.groupby('A', foo=1) - self.assertRaises(TypeError, f) + pytest.raises(TypeError, f) g = self.frame.groupby('A') - self.assertFalse(g.mutated) + assert not g.mutated g = self.frame.groupby('A', mutated=True) - self.assertTrue(g.mutated) + assert g.mutated def test_getitem(self): g = self.frame.groupby('A') @@ -3074,12 +3182,12 @@ def test_expanding_apply(self): tm.assert_frame_equal(result, expected) -class TestRollingTS(tm.TestCase): +class TestRollingTS(object): # rolling time-series friendly # xref GH13327 - def setUp(self): + def setup_method(self, method): self.regular = DataFrame({'A': pd.date_range('20130101', periods=5, @@ -3109,16 +3217,16 @@ def test_valid(self): df = self.regular # not a valid freq - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.rolling(window='foobar') # not a datetimelike index - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.reset_index().rolling(window='foobar') # non-fixed freqs for freq in ['2MS', pd.offsets.MonthBegin(2)]: - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.rolling(window=freq) for freq in ['1D', pd.offsets.Day(2), '2ms']: @@ -3126,11 +3234,11 @@ def test_valid(self): # non-integer min_periods for minp in [1.0, 'foo', np.array([1, 2, 3])]: - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.rolling(window='1D', min_periods=minp) # center is not implemented - with self.assertRaises(NotImplementedError): + with pytest.raises(NotImplementedError): df.rolling(window='1D', center=True) def test_on(self): @@ -3138,7 +3246,7 @@ def test_on(self): df = self.regular # not a valid column - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.rolling(window='2s', on='foobar') # column is valid @@ -3147,7 +3255,7 @@ def test_on(self): df.rolling(window='2d', on='C').sum() # invalid columns - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.rolling(window='2d', on='B') # ok even though on non-selected @@ -3161,22 +3269,22 @@ def test_monotonic_on(self): freq='s'), 'B': range(5)}) - self.assertTrue(df.A.is_monotonic) + assert df.A.is_monotonic df.rolling('2s', on='A').sum() df = df.set_index('A') - self.assertTrue(df.index.is_monotonic) + assert df.index.is_monotonic df.rolling('2s').sum() # non-monotonic df.index = reversed(df.index.tolist()) - self.assertFalse(df.index.is_monotonic) + assert not df.index.is_monotonic - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.rolling('2s').sum() df = df.reset_index() - with self.assertRaises(ValueError): + with pytest.raises(ValueError): df.rolling('2s', on='A').sum() def test_frame_on(self): @@ -3288,6 +3396,45 @@ def test_min_periods(self): result = df.rolling('2s', min_periods=1).sum() tm.assert_frame_equal(result, expected) + def test_closed(self): + + # xref GH13965 + + df = DataFrame({'A': [1] * 5}, + index=[pd.Timestamp('20130101 09:00:01'), + pd.Timestamp('20130101 09:00:02'), + pd.Timestamp('20130101 09:00:03'), + pd.Timestamp('20130101 09:00:04'), + pd.Timestamp('20130101 09:00:06')]) + + # closed must be 'right', 'left', 'both', 'neither' + with pytest.raises(ValueError): + self.regular.rolling(window='2s', closed="blabla") + + expected = df.copy() + expected["A"] = [1.0, 2, 2, 2, 1] + result = df.rolling('2s', closed='right').sum() + tm.assert_frame_equal(result, expected) + + # default should be 'right' + result = df.rolling('2s').sum() + tm.assert_frame_equal(result, expected) + + expected = df.copy() + expected["A"] = [1.0, 2, 3, 3, 2] + result = df.rolling('2s', closed='both').sum() + tm.assert_frame_equal(result, expected) + + expected = df.copy() + expected["A"] = [np.nan, 1.0, 2, 2, 1] + result = df.rolling('2s', closed='left').sum() + tm.assert_frame_equal(result, expected) + + expected = df.copy() + expected["A"] = [np.nan, 1.0, 1, 1, np.nan] + result = df.rolling('2s', closed='neither').sum() + tm.assert_frame_equal(result, expected) + def test_ragged_sum(self): df = self.ragged @@ -3520,11 +3667,11 @@ def test_perf_min(self): freq='s')) expected = dfp.rolling(2, min_periods=1).min() result = dfp.rolling('2s').min() - self.assertTrue(((result - expected) < 0.01).all().bool()) + assert ((result - expected) < 0.01).all().bool() expected = dfp.rolling(200, min_periods=1).min() result = dfp.rolling('200s').min() - self.assertTrue(((result - expected) < 0.01).all().bool()) + assert ((result - expected) < 0.01).all().bool() def test_ragged_max(self): @@ -3616,3 +3763,41 @@ def agg_by_day(x): agg_by_day).reset_index(level=0, drop=True) tm.assert_frame_equal(result, expected) + + def test_groupby_monotonic(self): + + # GH 15130 + # we don't need to validate monotonicity when grouping + + data = [ + ['David', '1/1/2015', 100], ['David', '1/5/2015', 500], + ['David', '5/30/2015', 50], ['David', '7/25/2015', 50], + ['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], + ['Ryan', '3/31/2016', 50], ['Joe', '7/1/2015', 100], + ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]] + + df = pd.DataFrame(data=data, columns=['name', 'date', 'amount']) + df['date'] = pd.to_datetime(df['date']) + + expected = df.set_index('date').groupby('name').apply( + lambda x: x.rolling('180D')['amount'].sum()) + result = df.groupby('name').rolling('180D', on='date')['amount'].sum() + tm.assert_series_equal(result, expected) + + def test_non_monotonic(self): + # GH 13966 (similar to #15130, closed by #15175) + + dates = pd.date_range(start='2016-01-01 09:30:00', + periods=20, freq='s') + df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8, + 'B': np.concatenate((dates, dates)), + 'C': np.arange(40)}) + + result = df.groupby('A').rolling('4s', on='B').C.mean() + expected = df.set_index('B').groupby('A').apply( + lambda x: x.rolling('4s')['C'].mean()) + tm.assert_series_equal(result, expected) + + df2 = df.sort_values('B') + result = df2.groupby('A').rolling('4s', on='B').C.mean() + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/tools/__init__.py b/pandas/tests/tools/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/tools/tests/test_util.py b/pandas/tests/tools/test_numeric.py similarity index 66% rename from pandas/tools/tests/test_util.py rename to pandas/tests/tools/test_numeric.py index f9647721e3c5b..f82ad97d7b70f 100644 --- a/pandas/tools/tests/test_util.py +++ b/pandas/tests/tools/test_numeric.py @@ -1,123 +1,15 @@ -import os -import locale -import codecs -import nose +import pytest +import decimal import numpy as np +import pandas as pd +from pandas import to_numeric, _np_version_under1p9 + +from pandas.util import testing as tm from numpy import iinfo -import pandas as pd -from pandas import (date_range, Index, _np_version_under1p9) -import pandas.util.testing as tm -from pandas.tools.util import cartesian_product, to_numeric - -CURRENT_LOCALE = locale.getlocale() -LOCALE_OVERRIDE = os.environ.get('LOCALE_OVERRIDE', None) - - -class TestCartesianProduct(tm.TestCase): - - def test_simple(self): - x, y = list('ABC'), [1, 22] - result1, result2 = cartesian_product([x, y]) - expected1 = np.array(['A', 'A', 'B', 'B', 'C', 'C']) - expected2 = np.array([1, 22, 1, 22, 1, 22]) - tm.assert_numpy_array_equal(result1, expected1) - tm.assert_numpy_array_equal(result2, expected2) - - def test_datetimeindex(self): - # regression test for GitHub issue #6439 - # make sure that the ordering on datetimeindex is consistent - x = date_range('2000-01-01', periods=2) - result1, result2 = [Index(y).day for y in cartesian_product([x, x])] - expected1 = np.array([1, 1, 2, 2], dtype=np.int32) - expected2 = np.array([1, 2, 1, 2], dtype=np.int32) - tm.assert_numpy_array_equal(result1, expected1) - tm.assert_numpy_array_equal(result2, expected2) - - def test_empty(self): - # product of empty factors - X = [[], [0, 1], []] - Y = [[], [], ['a', 'b', 'c']] - for x, y in zip(X, Y): - expected1 = np.array([], dtype=np.asarray(x).dtype) - expected2 = np.array([], dtype=np.asarray(y).dtype) - result1, result2 = cartesian_product([x, y]) - tm.assert_numpy_array_equal(result1, expected1) - tm.assert_numpy_array_equal(result2, expected2) - - # empty product (empty input): - result = cartesian_product([]) - expected = [] - tm.assert_equal(result, expected) - - def test_invalid_input(self): - invalid_inputs = [1, [1], [1, 2], [[1], 2], - 'a', ['a'], ['a', 'b'], [['a'], 'b']] - msg = "Input must be a list-like of list-likes" - for X in invalid_inputs: - tm.assertRaisesRegexp(TypeError, msg, cartesian_product, X=X) - - -class TestLocaleUtils(tm.TestCase): - - @classmethod - def setUpClass(cls): - super(TestLocaleUtils, cls).setUpClass() - cls.locales = tm.get_locales() - - if not cls.locales: - raise nose.SkipTest("No locales found") - - tm._skip_if_windows() - - @classmethod - def tearDownClass(cls): - super(TestLocaleUtils, cls).tearDownClass() - del cls.locales - - def test_get_locales(self): - # all systems should have at least a single locale - assert len(tm.get_locales()) > 0 - - def test_get_locales_prefix(self): - if len(self.locales) == 1: - raise nose.SkipTest("Only a single locale found, no point in " - "trying to test filtering locale prefixes") - first_locale = self.locales[0] - assert len(tm.get_locales(prefix=first_locale[:2])) > 0 - - def test_set_locale(self): - if len(self.locales) == 1: - raise nose.SkipTest("Only a single locale found, no point in " - "trying to test setting another locale") - - if LOCALE_OVERRIDE is None: - lang, enc = 'it_CH', 'UTF-8' - elif LOCALE_OVERRIDE == 'C': - lang, enc = 'en_US', 'ascii' - else: - lang, enc = LOCALE_OVERRIDE.split('.') - - enc = codecs.lookup(enc).name - new_locale = lang, enc - - if not tm._can_set_locale(new_locale): - with tm.assertRaises(locale.Error): - with tm.set_locale(new_locale): - pass - else: - with tm.set_locale(new_locale) as normalized_locale: - new_lang, new_enc = normalized_locale.split('.') - new_enc = codecs.lookup(enc).name - normalized_locale = new_lang, new_enc - self.assertEqual(normalized_locale, new_locale) - - current_locale = locale.getlocale() - self.assertEqual(current_locale, CURRENT_LOCALE) - - -class TestToNumeric(tm.TestCase): + +class TestToNumeric(object): def test_series(self): s = pd.Series(['1', '-3.14', '7']) @@ -147,7 +39,7 @@ def test_series_numeric(self): def test_error(self): s = pd.Series([1, -3.14, 'apple']) msg = 'Unable to parse string "apple" at position 2' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): to_numeric(s, errors='raise') res = to_numeric(s, errors='ignore') @@ -160,13 +52,13 @@ def test_error(self): s = pd.Series(['orange', 1, -3.14, 'apple']) msg = 'Unable to parse string "orange" at position 0' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): to_numeric(s, errors='raise') def test_error_seen_bool(self): s = pd.Series([True, False, 'apple']) msg = 'Unable to parse string "apple" at position 2' - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): to_numeric(s, errors='raise') res = to_numeric(s, errors='ignore') @@ -208,6 +100,46 @@ def test_numeric(self): res = to_numeric(s) tm.assert_series_equal(res, expected) + # GH 14827 + df = pd.DataFrame(dict( + a=[1.2, decimal.Decimal(3.14), decimal.Decimal("infinity"), '0.1'], + b=[1.0, 2.0, 3.0, 4.0], + )) + expected = pd.DataFrame(dict( + a=[1.2, 3.14, np.inf, 0.1], + b=[1.0, 2.0, 3.0, 4.0], + )) + + # Test to_numeric over one column + df_copy = df.copy() + df_copy['a'] = df_copy['a'].apply(to_numeric) + tm.assert_frame_equal(df_copy, expected) + + # Test to_numeric over multiple columns + df_copy = df.copy() + df_copy[['a', 'b']] = df_copy[['a', 'b']].apply(to_numeric) + tm.assert_frame_equal(df_copy, expected) + + def test_numeric_lists_and_arrays(self): + # Test to_numeric with embedded lists and arrays + df = pd.DataFrame(dict( + a=[[decimal.Decimal(3.14), 1.0], decimal.Decimal(1.6), 0.1] + )) + df['a'] = df['a'].apply(to_numeric) + expected = pd.DataFrame(dict( + a=[[3.14, 1.0], 1.6, 0.1], + )) + tm.assert_frame_equal(df, expected) + + df = pd.DataFrame(dict( + a=[np.array([decimal.Decimal(3.14), 1.0]), 0.1] + )) + df['a'] = df['a'].apply(to_numeric) + expected = pd.DataFrame(dict( + a=[[3.14, 1.0], 0.1], + )) + tm.assert_frame_equal(df, expected) + def test_all_nan(self): s = pd.Series(['a', 'b', 'c']) res = to_numeric(s, errors='coerce') @@ -217,24 +149,24 @@ def test_all_nan(self): def test_type_check(self): # GH 11776 df = pd.DataFrame({'a': [1, -3.14, 7], 'b': ['4', '5', '6']}) - with tm.assertRaisesRegexp(TypeError, "1-d array"): + with tm.assert_raises_regex(TypeError, "1-d array"): to_numeric(df) for errors in ['ignore', 'raise', 'coerce']: - with tm.assertRaisesRegexp(TypeError, "1-d array"): + with tm.assert_raises_regex(TypeError, "1-d array"): to_numeric(df, errors=errors) def test_scalar(self): - self.assertEqual(pd.to_numeric(1), 1) - self.assertEqual(pd.to_numeric(1.1), 1.1) + assert pd.to_numeric(1) == 1 + assert pd.to_numeric(1.1) == 1.1 - self.assertEqual(pd.to_numeric('1'), 1) - self.assertEqual(pd.to_numeric('1.1'), 1.1) + assert pd.to_numeric('1') == 1 + assert pd.to_numeric('1.1') == 1.1 - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): to_numeric('XX', errors='raise') - self.assertEqual(to_numeric('XX', errors='ignore'), 'XX') - self.assertTrue(np.isnan(to_numeric('XX', errors='coerce'))) + assert to_numeric('XX', errors='ignore') == 'XX' + assert np.isnan(to_numeric('XX', errors='coerce')) def test_numeric_dtypes(self): idx = pd.Index([1, 2, 3], name='xxx') @@ -321,7 +253,7 @@ def test_non_hashable(self): res = pd.to_numeric(s, errors='ignore') tm.assert_series_equal(res, pd.Series([[10.0, 2], 1.0, 'apple'])) - with self.assertRaisesRegexp(TypeError, "Invalid object type"): + with tm.assert_raises_regex(TypeError, "Invalid object type"): pd.to_numeric(s) def test_downcast(self): @@ -342,7 +274,7 @@ def test_downcast(self): smallest_float_dtype = float_32_char for data in (mixed_data, int_data, date_data): - with self.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): pd.to_numeric(data, downcast=invalid_downcast) expected = np.array([1, 2, 3], dtype=np.int64) @@ -385,12 +317,16 @@ def test_downcast(self): # cannot cast to an integer (signed or unsigned) # because we have a float number - data = ['1.1', 2, 3] - expected = np.array([1.1, 2, 3], dtype=np.float64) + data = (['1.1', 2, 3], + [10000.0, 20000, 3000, 40000.36, 50000, 50000.00]) + expected = (np.array([1.1, 2, 3], dtype=np.float64), + np.array([10000.0, 20000, 3000, + 40000.36, 50000, 50000.00], dtype=np.float64)) - for downcast in ('integer', 'signed', 'unsigned'): - res = pd.to_numeric(data, downcast=downcast) - tm.assert_numpy_array_equal(res, expected) + for _data, _expected in zip(data, expected): + for downcast in ('integer', 'signed', 'unsigned'): + res = pd.to_numeric(_data, downcast=downcast) + tm.assert_numpy_array_equal(res, _expected) # the smallest integer dtype need not be np.(u)int8 data = ['256', 257, 258] @@ -406,7 +342,7 @@ def test_downcast_limits(self): # Test the limits of each downcast. Bug: #14401. # Check to make sure numpy is new enough to run this test. if _np_version_under1p9: - raise nose.SkipTest("Numpy version is under 1.9") + pytest.skip("Numpy version is under 1.9") i = 'integer' u = 'unsigned' @@ -418,8 +354,7 @@ def test_downcast_limits(self): ('uint8', u, [iinfo(np.uint8).min, iinfo(np.uint8).max]), ('uint16', u, [iinfo(np.uint16).min, iinfo(np.uint16).max]), ('uint32', u, [iinfo(np.uint32).min, iinfo(np.uint32).max]), - # Test will be skipped until there is more uint64 support. - # ('uint64', u, [iinfo(uint64).min, iinfo(uint64).max]), + ('uint64', u, [iinfo(np.uint64).min, iinfo(np.uint64).max]), ('int16', i, [iinfo(np.int8).min, iinfo(np.int8).max + 1]), ('int32', i, [iinfo(np.int16).min, iinfo(np.int16).max + 1]), ('int64', i, [iinfo(np.int32).min, iinfo(np.int32).max + 1]), @@ -428,15 +363,9 @@ def test_downcast_limits(self): ('int64', i, [iinfo(np.int32).min - 1, iinfo(np.int64).max]), ('uint16', u, [iinfo(np.uint8).min, iinfo(np.uint8).max + 1]), ('uint32', u, [iinfo(np.uint16).min, iinfo(np.uint16).max + 1]), - # Test will be skipped until there is more uint64 support. - # ('uint64', u, [iinfo(np.uint32).min, iinfo(np.uint32).max + 1]), + ('uint64', u, [iinfo(np.uint32).min, iinfo(np.uint32).max + 1]) ] for dtype, downcast, min_max in dtype_downcast_min_max: series = pd.to_numeric(pd.Series(min_max), downcast=downcast) - tm.assert_equal(series.dtype, dtype) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + assert series.dtype == dtype diff --git a/pandas/tests/tseries/__init__.py b/pandas/tests/tseries/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/tseries/tests/data/cday-0.14.1.pickle b/pandas/tests/tseries/data/cday-0.14.1.pickle similarity index 100% rename from pandas/tseries/tests/data/cday-0.14.1.pickle rename to pandas/tests/tseries/data/cday-0.14.1.pickle diff --git a/pandas/tseries/tests/data/dateoffset_0_15_2.pickle b/pandas/tests/tseries/data/dateoffset_0_15_2.pickle similarity index 100% rename from pandas/tseries/tests/data/dateoffset_0_15_2.pickle rename to pandas/tests/tseries/data/dateoffset_0_15_2.pickle diff --git a/pandas/tseries/tests/test_frequencies.py b/pandas/tests/tseries/test_frequencies.py similarity index 67% rename from pandas/tseries/tests/test_frequencies.py rename to pandas/tests/tseries/test_frequencies.py index 5ba98f15aed8d..2edca1bd4676b 100644 --- a/pandas/tseries/tests/test_frequencies.py +++ b/pandas/tests/tseries/test_frequencies.py @@ -1,16 +1,17 @@ from datetime import datetime, timedelta from pandas.compat import range +import pytest import numpy as np from pandas import (Index, DatetimeIndex, Timestamp, Series, date_range, period_range) import pandas.tseries.frequencies as frequencies -from pandas.tseries.tools import to_datetime +from pandas.core.tools.datetimes import to_datetime import pandas.tseries.offsets as offsets -from pandas.tseries.period import PeriodIndex +from pandas.core.indexes.period import PeriodIndex import pandas.compat as compat from pandas.compat import is_platform_windows @@ -18,7 +19,7 @@ from pandas import Timedelta -class TestToOffset(tm.TestCase): +class TestToOffset(object): def test_to_offset_multiple(self): freqstr = '2h30min' @@ -39,6 +40,21 @@ def test_to_offset_multiple(self): expected = offsets.Hour(3) assert (result == expected) + freqstr = '2h 20.5min' + result = frequencies.to_offset(freqstr) + expected = offsets.Second(8430) + assert (result == expected) + + freqstr = '1.5min' + result = frequencies.to_offset(freqstr) + expected = offsets.Second(90) + assert (result == expected) + + freqstr = '0.5S' + result = frequencies.to_offset(freqstr) + expected = offsets.Milli(500) + assert (result == expected) + freqstr = '15l500u' result = frequencies.to_offset(freqstr) expected = offsets.Micro(15500) @@ -49,6 +65,16 @@ def test_to_offset_multiple(self): expected = offsets.Milli(10075) assert (result == expected) + freqstr = '1s0.25ms' + result = frequencies.to_offset(freqstr) + expected = offsets.Micro(1000250) + assert (result == expected) + + freqstr = '1s0.25L' + result = frequencies.to_offset(freqstr) + expected = offsets.Micro(1000250) + assert (result == expected) + freqstr = '2800N' result = frequencies.to_offset(freqstr) expected = offsets.Nano(2800) @@ -75,7 +101,8 @@ def test_to_offset_multiple(self): assert (result == expected) # malformed - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: 2h20m'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: 2h20m'): frequencies.to_offset('2h20m') def test_to_offset_negative(self): @@ -97,20 +124,24 @@ def test_to_offset_negative(self): def test_to_offset_invalid(self): # GH 13930 - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: U1'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: U1'): frequencies.to_offset('U1') - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: -U'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: -U'): frequencies.to_offset('-U') - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: 3U1'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: 3U1'): frequencies.to_offset('3U1') - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: -2-3U'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: -2-3U'): frequencies.to_offset('-2-3U') - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: -2D:3H'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: -2D:3H'): frequencies.to_offset('-2D:3H') - - # ToDo: Must be fixed in #8419 - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: .5S'): - frequencies.to_offset('.5S') + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: 1.5.0S'): + frequencies.to_offset('1.5.0S') # split offsets with spaces are valid assert frequencies.to_offset('2D 3H') == offsets.Hour(51) @@ -122,10 +153,11 @@ def test_to_offset_invalid(self): # special cases assert frequencies.to_offset('2SMS-15') == offsets.SemiMonthBegin(2) - with tm.assertRaisesRegexp(ValueError, - 'Invalid frequency: 2SMS-15-15'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: 2SMS-15-15'): frequencies.to_offset('2SMS-15-15') - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: 2SMS-15D'): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: 2SMS-15D'): frequencies.to_offset('2SMS-15D') def test_to_offset_leading_zero(self): @@ -175,7 +207,7 @@ def test_to_offset_pd_timedelta(self): assert (expected == result) td = Timedelta(microseconds=0) - tm.assertRaises(ValueError, lambda: frequencies.to_offset(td)) + pytest.raises(ValueError, lambda: frequencies.to_offset(td)) def test_anchored_shortcuts(self): result = frequencies.to_offset('W') @@ -220,10 +252,23 @@ def test_anchored_shortcuts(self): 'SMS-1', 'SMS-28', 'SMS-30', 'SMS-BAR', 'BSMS', 'SMS--2'] for invalid_anchor in invalid_anchors: - with tm.assertRaisesRegexp(ValueError, 'Invalid frequency: '): + with tm.assert_raises_regex(ValueError, + 'Invalid frequency: '): frequencies.to_offset(invalid_anchor) +def test_ms_vs_MS(): + left = frequencies.get_offset('ms') + right = frequencies.get_offset('MS') + assert left == offsets.Milli() + assert right == offsets.MonthBegin() + + +def test_rule_aliases(): + rule = frequencies.to_offset('10us') + assert rule == offsets.Micro(10) + + def test_get_rule_month(): result = frequencies._get_rule_month('W') assert (result == 'DEC') @@ -270,7 +315,7 @@ def _assert_depr(freq, expected, aliases): msg = frequencies._INVALID_FREQ_ERROR for alias in aliases: - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): frequencies._period_str_to_code(alias) _assert_depr("M", 3000, ["MTH", "MONTH", "MONTHLY"]) @@ -297,165 +342,181 @@ def _assert_depr(freq, expected, aliases): assert (frequencies._period_str_to_code('NS') == 12000) -class TestFrequencyCode(tm.TestCase): +class TestFrequencyCode(object): def test_freq_code(self): - self.assertEqual(frequencies.get_freq('A'), 1000) - self.assertEqual(frequencies.get_freq('3A'), 1000) - self.assertEqual(frequencies.get_freq('-1A'), 1000) + assert frequencies.get_freq('A') == 1000 + assert frequencies.get_freq('3A') == 1000 + assert frequencies.get_freq('-1A') == 1000 - self.assertEqual(frequencies.get_freq('W'), 4000) - self.assertEqual(frequencies.get_freq('W-MON'), 4001) - self.assertEqual(frequencies.get_freq('W-FRI'), 4005) + assert frequencies.get_freq('W') == 4000 + assert frequencies.get_freq('W-MON') == 4001 + assert frequencies.get_freq('W-FRI') == 4005 for freqstr, code in compat.iteritems(frequencies._period_code_map): result = frequencies.get_freq(freqstr) - self.assertEqual(result, code) + assert result == code result = frequencies.get_freq_group(freqstr) - self.assertEqual(result, code // 1000 * 1000) + assert result == code // 1000 * 1000 result = frequencies.get_freq_group(code) - self.assertEqual(result, code // 1000 * 1000) + assert result == code // 1000 * 1000 def test_freq_group(self): - self.assertEqual(frequencies.get_freq_group('A'), 1000) - self.assertEqual(frequencies.get_freq_group('3A'), 1000) - self.assertEqual(frequencies.get_freq_group('-1A'), 1000) - self.assertEqual(frequencies.get_freq_group('A-JAN'), 1000) - self.assertEqual(frequencies.get_freq_group('A-MAY'), 1000) - self.assertEqual(frequencies.get_freq_group(offsets.YearEnd()), 1000) - self.assertEqual(frequencies.get_freq_group( - offsets.YearEnd(month=1)), 1000) - self.assertEqual(frequencies.get_freq_group( - offsets.YearEnd(month=5)), 1000) - - self.assertEqual(frequencies.get_freq_group('W'), 4000) - self.assertEqual(frequencies.get_freq_group('W-MON'), 4000) - self.assertEqual(frequencies.get_freq_group('W-FRI'), 4000) - self.assertEqual(frequencies.get_freq_group(offsets.Week()), 4000) - self.assertEqual(frequencies.get_freq_group( - offsets.Week(weekday=1)), 4000) - self.assertEqual(frequencies.get_freq_group( - offsets.Week(weekday=5)), 4000) + assert frequencies.get_freq_group('A') == 1000 + assert frequencies.get_freq_group('3A') == 1000 + assert frequencies.get_freq_group('-1A') == 1000 + assert frequencies.get_freq_group('A-JAN') == 1000 + assert frequencies.get_freq_group('A-MAY') == 1000 + assert frequencies.get_freq_group(offsets.YearEnd()) == 1000 + assert frequencies.get_freq_group(offsets.YearEnd(month=1)) == 1000 + assert frequencies.get_freq_group(offsets.YearEnd(month=5)) == 1000 + + assert frequencies.get_freq_group('W') == 4000 + assert frequencies.get_freq_group('W-MON') == 4000 + assert frequencies.get_freq_group('W-FRI') == 4000 + assert frequencies.get_freq_group(offsets.Week()) == 4000 + assert frequencies.get_freq_group(offsets.Week(weekday=1)) == 4000 + assert frequencies.get_freq_group(offsets.Week(weekday=5)) == 4000 def test_get_to_timestamp_base(self): tsb = frequencies.get_to_timestamp_base - self.assertEqual(tsb(frequencies.get_freq_code('D')[0]), - frequencies.get_freq_code('D')[0]) - self.assertEqual(tsb(frequencies.get_freq_code('W')[0]), - frequencies.get_freq_code('D')[0]) - self.assertEqual(tsb(frequencies.get_freq_code('M')[0]), - frequencies.get_freq_code('D')[0]) + assert (tsb(frequencies.get_freq_code('D')[0]) == + frequencies.get_freq_code('D')[0]) + assert (tsb(frequencies.get_freq_code('W')[0]) == + frequencies.get_freq_code('D')[0]) + assert (tsb(frequencies.get_freq_code('M')[0]) == + frequencies.get_freq_code('D')[0]) - self.assertEqual(tsb(frequencies.get_freq_code('S')[0]), - frequencies.get_freq_code('S')[0]) - self.assertEqual(tsb(frequencies.get_freq_code('T')[0]), - frequencies.get_freq_code('S')[0]) - self.assertEqual(tsb(frequencies.get_freq_code('H')[0]), - frequencies.get_freq_code('S')[0]) + assert (tsb(frequencies.get_freq_code('S')[0]) == + frequencies.get_freq_code('S')[0]) + assert (tsb(frequencies.get_freq_code('T')[0]) == + frequencies.get_freq_code('S')[0]) + assert (tsb(frequencies.get_freq_code('H')[0]) == + frequencies.get_freq_code('S')[0]) def test_freq_to_reso(self): Reso = frequencies.Resolution - self.assertEqual(Reso.get_str_from_freq('A'), 'year') - self.assertEqual(Reso.get_str_from_freq('Q'), 'quarter') - self.assertEqual(Reso.get_str_from_freq('M'), 'month') - self.assertEqual(Reso.get_str_from_freq('D'), 'day') - self.assertEqual(Reso.get_str_from_freq('H'), 'hour') - self.assertEqual(Reso.get_str_from_freq('T'), 'minute') - self.assertEqual(Reso.get_str_from_freq('S'), 'second') - self.assertEqual(Reso.get_str_from_freq('L'), 'millisecond') - self.assertEqual(Reso.get_str_from_freq('U'), 'microsecond') - self.assertEqual(Reso.get_str_from_freq('N'), 'nanosecond') + assert Reso.get_str_from_freq('A') == 'year' + assert Reso.get_str_from_freq('Q') == 'quarter' + assert Reso.get_str_from_freq('M') == 'month' + assert Reso.get_str_from_freq('D') == 'day' + assert Reso.get_str_from_freq('H') == 'hour' + assert Reso.get_str_from_freq('T') == 'minute' + assert Reso.get_str_from_freq('S') == 'second' + assert Reso.get_str_from_freq('L') == 'millisecond' + assert Reso.get_str_from_freq('U') == 'microsecond' + assert Reso.get_str_from_freq('N') == 'nanosecond' for freq in ['A', 'Q', 'M', 'D', 'H', 'T', 'S', 'L', 'U', 'N']: # check roundtrip result = Reso.get_freq(Reso.get_str_from_freq(freq)) - self.assertEqual(freq, result) + assert freq == result for freq in ['D', 'H', 'T', 'S', 'L', 'U']: result = Reso.get_freq(Reso.get_str(Reso.get_reso_from_freq(freq))) - self.assertEqual(freq, result) + assert freq == result + + def test_resolution_bumping(self): + # see gh-14378 + Reso = frequencies.Resolution + + assert Reso.get_stride_from_decimal(1.5, 'T') == (90, 'S') + assert Reso.get_stride_from_decimal(62.4, 'T') == (3744, 'S') + assert Reso.get_stride_from_decimal(1.04, 'H') == (3744, 'S') + assert Reso.get_stride_from_decimal(1, 'D') == (1, 'D') + assert (Reso.get_stride_from_decimal(0.342931, 'H') == + (1234551600, 'U')) + assert Reso.get_stride_from_decimal(1.2345, 'D') == (106660800, 'L') + + with pytest.raises(ValueError): + Reso.get_stride_from_decimal(0.5, 'N') + + # too much precision in the input can prevent + with pytest.raises(ValueError): + Reso.get_stride_from_decimal(0.3429324798798269273987982, 'H') def test_get_freq_code(self): - # freqstr - self.assertEqual(frequencies.get_freq_code('A'), - (frequencies.get_freq('A'), 1)) - self.assertEqual(frequencies.get_freq_code('3D'), - (frequencies.get_freq('D'), 3)) - self.assertEqual(frequencies.get_freq_code('-2M'), - (frequencies.get_freq('M'), -2)) + # frequency str + assert (frequencies.get_freq_code('A') == + (frequencies.get_freq('A'), 1)) + assert (frequencies.get_freq_code('3D') == + (frequencies.get_freq('D'), 3)) + assert (frequencies.get_freq_code('-2M') == + (frequencies.get_freq('M'), -2)) # tuple - self.assertEqual(frequencies.get_freq_code(('D', 1)), - (frequencies.get_freq('D'), 1)) - self.assertEqual(frequencies.get_freq_code(('A', 3)), - (frequencies.get_freq('A'), 3)) - self.assertEqual(frequencies.get_freq_code(('M', -2)), - (frequencies.get_freq('M'), -2)) + assert (frequencies.get_freq_code(('D', 1)) == + (frequencies.get_freq('D'), 1)) + assert (frequencies.get_freq_code(('A', 3)) == + (frequencies.get_freq('A'), 3)) + assert (frequencies.get_freq_code(('M', -2)) == + (frequencies.get_freq('M'), -2)) + # numeric tuple - self.assertEqual(frequencies.get_freq_code((1000, 1)), (1000, 1)) + assert frequencies.get_freq_code((1000, 1)) == (1000, 1) # offsets - self.assertEqual(frequencies.get_freq_code(offsets.Day()), - (frequencies.get_freq('D'), 1)) - self.assertEqual(frequencies.get_freq_code(offsets.Day(3)), - (frequencies.get_freq('D'), 3)) - self.assertEqual(frequencies.get_freq_code(offsets.Day(-2)), - (frequencies.get_freq('D'), -2)) - - self.assertEqual(frequencies.get_freq_code(offsets.MonthEnd()), - (frequencies.get_freq('M'), 1)) - self.assertEqual(frequencies.get_freq_code(offsets.MonthEnd(3)), - (frequencies.get_freq('M'), 3)) - self.assertEqual(frequencies.get_freq_code(offsets.MonthEnd(-2)), - (frequencies.get_freq('M'), -2)) - - self.assertEqual(frequencies.get_freq_code(offsets.Week()), - (frequencies.get_freq('W'), 1)) - self.assertEqual(frequencies.get_freq_code(offsets.Week(3)), - (frequencies.get_freq('W'), 3)) - self.assertEqual(frequencies.get_freq_code(offsets.Week(-2)), - (frequencies.get_freq('W'), -2)) - - # monday is weekday=0 - self.assertEqual(frequencies.get_freq_code(offsets.Week(weekday=1)), - (frequencies.get_freq('W-TUE'), 1)) - self.assertEqual(frequencies.get_freq_code(offsets.Week(3, weekday=0)), - (frequencies.get_freq('W-MON'), 3)) - self.assertEqual( - frequencies.get_freq_code(offsets.Week(-2, weekday=4)), - (frequencies.get_freq('W-FRI'), -2)) + assert (frequencies.get_freq_code(offsets.Day()) == + (frequencies.get_freq('D'), 1)) + assert (frequencies.get_freq_code(offsets.Day(3)) == + (frequencies.get_freq('D'), 3)) + assert (frequencies.get_freq_code(offsets.Day(-2)) == + (frequencies.get_freq('D'), -2)) + + assert (frequencies.get_freq_code(offsets.MonthEnd()) == + (frequencies.get_freq('M'), 1)) + assert (frequencies.get_freq_code(offsets.MonthEnd(3)) == + (frequencies.get_freq('M'), 3)) + assert (frequencies.get_freq_code(offsets.MonthEnd(-2)) == + (frequencies.get_freq('M'), -2)) + + assert (frequencies.get_freq_code(offsets.Week()) == + (frequencies.get_freq('W'), 1)) + assert (frequencies.get_freq_code(offsets.Week(3)) == + (frequencies.get_freq('W'), 3)) + assert (frequencies.get_freq_code(offsets.Week(-2)) == + (frequencies.get_freq('W'), -2)) + + # Monday is weekday=0 + assert (frequencies.get_freq_code(offsets.Week(weekday=1)) == + (frequencies.get_freq('W-TUE'), 1)) + assert (frequencies.get_freq_code(offsets.Week(3, weekday=0)) == + (frequencies.get_freq('W-MON'), 3)) + assert (frequencies.get_freq_code(offsets.Week(-2, weekday=4)) == + (frequencies.get_freq('W-FRI'), -2)) _dti = DatetimeIndex -class TestFrequencyInference(tm.TestCase): +class TestFrequencyInference(object): + def test_raise_if_period_index(self): index = PeriodIndex(start="1/1/1990", periods=20, freq="M") - self.assertRaises(TypeError, frequencies.infer_freq, index) + pytest.raises(TypeError, frequencies.infer_freq, index) def test_raise_if_too_few(self): index = _dti(['12/31/1998', '1/3/1999']) - self.assertRaises(ValueError, frequencies.infer_freq, index) + pytest.raises(ValueError, frequencies.infer_freq, index) def test_business_daily(self): index = _dti(['12/31/1998', '1/3/1999', '1/4/1999']) - self.assertEqual(frequencies.infer_freq(index), 'B') + assert frequencies.infer_freq(index) == 'B' def test_day(self): self._check_tick(timedelta(1), 'D') def test_day_corner(self): index = _dti(['1/1/2000', '1/2/2000', '1/3/2000']) - self.assertEqual(frequencies.infer_freq(index), 'D') + assert frequencies.infer_freq(index) == 'D' def test_non_datetimeindex(self): dates = to_datetime(['1/1/2000', '1/2/2000', '1/3/2000']) - self.assertEqual(frequencies.infer_freq(dates), 'D') + assert frequencies.infer_freq(dates) == 'D' def test_hour(self): self._check_tick(timedelta(hours=1), 'H') @@ -484,16 +545,16 @@ def _check_tick(self, base_delta, code): exp_freq = '%d%s' % (i, code) else: exp_freq = code - self.assertEqual(frequencies.infer_freq(index), exp_freq) + assert frequencies.infer_freq(index) == exp_freq index = _dti([b + base_delta * 7] + [b + base_delta * j for j in range( 3)]) - self.assertIsNone(frequencies.infer_freq(index)) + assert frequencies.infer_freq(index) is None index = _dti([b + base_delta * j for j in range(3)] + [b + base_delta * 7]) - self.assertIsNone(frequencies.infer_freq(index)) + assert frequencies.infer_freq(index) is None def test_weekly(self): days = ['MON', 'TUE', 'WED', 'THU', 'FRI', 'SAT', 'SUN'] @@ -511,7 +572,7 @@ def test_week_of_month(self): def test_fifth_week_of_month(self): # Only supports freq up to WOM-4. See #9425 func = lambda: date_range('2014-01-01', freq='WOM-5MON') - self.assertRaises(ValueError, func) + pytest.raises(ValueError, func) def test_fifth_week_of_month_infer(self): # Only attempts to infer up to WOM-4. See #9425 @@ -529,7 +590,7 @@ def test_monthly(self): def test_monthly_ambiguous(self): rng = _dti(['1/31/2000', '2/29/2000', '3/31/2000']) - self.assertEqual(rng.inferred_freq, 'M') + assert rng.inferred_freq == 'M' def test_business_monthly(self): self._check_generated_range('1/1/2000', 'BM') @@ -551,7 +612,7 @@ def test_business_annual(self): def test_annual_ambiguous(self): rng = _dti(['1/31/2000', '1/31/2001', '1/31/2002']) - self.assertEqual(rng.inferred_freq, 'A-JAN') + assert rng.inferred_freq == 'A-JAN' def _check_generated_range(self, start, freq): freq = freq.upper() @@ -559,41 +620,45 @@ def _check_generated_range(self, start, freq): gen = date_range(start, periods=7, freq=freq) index = _dti(gen.values) if not freq.startswith('Q-'): - self.assertEqual(frequencies.infer_freq(index), gen.freqstr) + assert frequencies.infer_freq(index) == gen.freqstr else: inf_freq = frequencies.infer_freq(index) - self.assertTrue((inf_freq == 'Q-DEC' and gen.freqstr in ( - 'Q', 'Q-DEC', 'Q-SEP', 'Q-JUN', 'Q-MAR')) or ( - inf_freq == 'Q-NOV' and gen.freqstr in ( - 'Q-NOV', 'Q-AUG', 'Q-MAY', 'Q-FEB')) or ( - inf_freq == 'Q-OCT' and gen.freqstr in ( - 'Q-OCT', 'Q-JUL', 'Q-APR', 'Q-JAN'))) + is_dec_range = inf_freq == 'Q-DEC' and gen.freqstr in ( + 'Q', 'Q-DEC', 'Q-SEP', 'Q-JUN', 'Q-MAR') + is_nov_range = inf_freq == 'Q-NOV' and gen.freqstr in ( + 'Q-NOV', 'Q-AUG', 'Q-MAY', 'Q-FEB') + is_oct_range = inf_freq == 'Q-OCT' and gen.freqstr in ( + 'Q-OCT', 'Q-JUL', 'Q-APR', 'Q-JAN') + assert is_dec_range or is_nov_range or is_oct_range gen = date_range(start, periods=5, freq=freq) index = _dti(gen.values) + if not freq.startswith('Q-'): - self.assertEqual(frequencies.infer_freq(index), gen.freqstr) + assert frequencies.infer_freq(index) == gen.freqstr else: inf_freq = frequencies.infer_freq(index) - self.assertTrue((inf_freq == 'Q-DEC' and gen.freqstr in ( - 'Q', 'Q-DEC', 'Q-SEP', 'Q-JUN', 'Q-MAR')) or ( - inf_freq == 'Q-NOV' and gen.freqstr in ( - 'Q-NOV', 'Q-AUG', 'Q-MAY', 'Q-FEB')) or ( - inf_freq == 'Q-OCT' and gen.freqstr in ( - 'Q-OCT', 'Q-JUL', 'Q-APR', 'Q-JAN'))) + is_dec_range = inf_freq == 'Q-DEC' and gen.freqstr in ( + 'Q', 'Q-DEC', 'Q-SEP', 'Q-JUN', 'Q-MAR') + is_nov_range = inf_freq == 'Q-NOV' and gen.freqstr in ( + 'Q-NOV', 'Q-AUG', 'Q-MAY', 'Q-FEB') + is_oct_range = inf_freq == 'Q-OCT' and gen.freqstr in ( + 'Q-OCT', 'Q-JUL', 'Q-APR', 'Q-JAN') + + assert is_dec_range or is_nov_range or is_oct_range def test_infer_freq(self): rng = period_range('1959Q2', '2009Q3', freq='Q') rng = Index(rng.to_timestamp('D', how='e').asobject) - self.assertEqual(rng.inferred_freq, 'Q-DEC') + assert rng.inferred_freq == 'Q-DEC' rng = period_range('1959Q2', '2009Q3', freq='Q-NOV') rng = Index(rng.to_timestamp('D', how='e').asobject) - self.assertEqual(rng.inferred_freq, 'Q-NOV') + assert rng.inferred_freq == 'Q-NOV' rng = period_range('1959Q2', '2009Q3', freq='Q-OCT') rng = Index(rng.to_timestamp('D', how='e').asobject) - self.assertEqual(rng.inferred_freq, 'Q-OCT') + assert rng.inferred_freq == 'Q-OCT' def test_infer_freq_tz(self): @@ -613,7 +678,7 @@ def test_infer_freq_tz(self): 'US/Pacific', 'US/Eastern']: for expected, dates in compat.iteritems(freqs): idx = DatetimeIndex(dates, tz=tz) - self.assertEqual(idx.inferred_freq, expected) + assert idx.inferred_freq == expected def test_infer_freq_tz_transition(self): # Tests for #8772 @@ -629,11 +694,11 @@ def test_infer_freq_tz_transition(self): for freq in freqs: idx = date_range(date_pair[0], date_pair[ 1], freq=freq, tz=tz) - self.assertEqual(idx.inferred_freq, freq) + assert idx.inferred_freq == freq index = date_range("2013-11-03", periods=5, freq="3H").tz_localize("America/Chicago") - self.assertIsNone(index.inferred_freq) + assert index.inferred_freq is None def test_infer_freq_businesshour(self): # GH 7905 @@ -641,21 +706,21 @@ def test_infer_freq_businesshour(self): ['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00']) # hourly freq in a day must result in 'H' - self.assertEqual(idx.inferred_freq, 'H') + assert idx.inferred_freq == 'H' idx = DatetimeIndex( ['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00', '2014-07-01 15:00', '2014-07-01 16:00', '2014-07-02 09:00', '2014-07-02 10:00', '2014-07-02 11:00']) - self.assertEqual(idx.inferred_freq, 'BH') + assert idx.inferred_freq == 'BH' idx = DatetimeIndex( ['2014-07-04 09:00', '2014-07-04 10:00', '2014-07-04 11:00', '2014-07-04 12:00', '2014-07-04 13:00', '2014-07-04 14:00', '2014-07-04 15:00', '2014-07-04 16:00', '2014-07-07 09:00', '2014-07-07 10:00', '2014-07-07 11:00']) - self.assertEqual(idx.inferred_freq, 'BH') + assert idx.inferred_freq == 'BH' idx = DatetimeIndex( ['2014-07-04 09:00', '2014-07-04 10:00', '2014-07-04 11:00', @@ -666,12 +731,12 @@ def test_infer_freq_businesshour(self): '2014-07-07 16:00', '2014-07-08 09:00', '2014-07-08 10:00', '2014-07-08 11:00', '2014-07-08 12:00', '2014-07-08 13:00', '2014-07-08 14:00', '2014-07-08 15:00', '2014-07-08 16:00']) - self.assertEqual(idx.inferred_freq, 'BH') + assert idx.inferred_freq == 'BH' def test_not_monotonic(self): rng = _dti(['1/31/2000', '1/31/2001', '1/31/2002']) rng = rng[::-1] - self.assertEqual(rng.inferred_freq, '-1A-JAN') + assert rng.inferred_freq == '-1A-JAN' def test_non_datetimeindex2(self): rng = _dti(['1/31/2000', '1/31/2001', '1/31/2002']) @@ -679,21 +744,20 @@ def test_non_datetimeindex2(self): vals = rng.to_pydatetime() result = frequencies.infer_freq(vals) - self.assertEqual(result, rng.inferred_freq) + assert result == rng.inferred_freq def test_invalid_index_types(self): # test all index types for i in [tm.makeIntIndex(10), tm.makeFloatIndex(10), tm.makePeriodIndex(10)]: - self.assertRaises(TypeError, lambda: frequencies.infer_freq(i)) + pytest.raises(TypeError, lambda: frequencies.infer_freq(i)) # GH 10822 # odd error message on conversions to datetime for unicode if not is_platform_windows(): for i in [tm.makeStringIndex(10), tm.makeUnicodeIndex(10)]: - self.assertRaises(ValueError, - lambda: frequencies.infer_freq(i)) + pytest.raises(ValueError, lambda: frequencies.infer_freq(i)) def test_string_datetimelike_compat(self): @@ -702,7 +766,7 @@ def test_string_datetimelike_compat(self): '2004-04']) result = frequencies.infer_freq(Index(['2004-01', '2004-02', '2004-03', '2004-04'])) - self.assertEqual(result, expected) + assert result == expected def test_series(self): @@ -711,33 +775,32 @@ def test_series(self): # invalid type of Series for s in [Series(np.arange(10)), Series(np.arange(10.))]: - self.assertRaises(TypeError, lambda: frequencies.infer_freq(s)) + pytest.raises(TypeError, lambda: frequencies.infer_freq(s)) # a non-convertible string - self.assertRaises(ValueError, - lambda: frequencies.infer_freq( - Series(['foo', 'bar']))) + pytest.raises(ValueError, lambda: frequencies.infer_freq( + Series(['foo', 'bar']))) # cannot infer on PeriodIndex for freq in [None, 'L']: s = Series(period_range('2013', periods=10, freq=freq)) - self.assertRaises(TypeError, lambda: frequencies.infer_freq(s)) + pytest.raises(TypeError, lambda: frequencies.infer_freq(s)) for freq in ['Y']: msg = frequencies._INVALID_FREQ_ERROR - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): s = Series(period_range('2013', periods=10, freq=freq)) - self.assertRaises(TypeError, lambda: frequencies.infer_freq(s)) + pytest.raises(TypeError, lambda: frequencies.infer_freq(s)) # DateTimeIndex for freq in ['M', 'L', 'S']: s = Series(date_range('20130101', periods=10, freq=freq)) inferred = frequencies.infer_freq(s) - self.assertEqual(inferred, freq) + assert inferred == freq s = Series(date_range('20130101', '20130110')) inferred = frequencies.infer_freq(s) - self.assertEqual(inferred, 'D') + assert inferred == 'D' def test_legacy_offset_warnings(self): freqs = ['WEEKDAY', 'EOM', 'W@MON', 'W@TUE', 'W@WED', 'W@THU', @@ -752,10 +815,10 @@ def test_legacy_offset_warnings(self): msg = frequencies._INVALID_FREQ_ERROR for freq in freqs: - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): frequencies.get_offset(freq) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): date_range('2011-01-01', periods=5, freq=freq) diff --git a/pandas/tseries/tests/test_holiday.py b/pandas/tests/tseries/test_holiday.py similarity index 75% rename from pandas/tseries/tests/test_holiday.py rename to pandas/tests/tseries/test_holiday.py index 62446e8e637c6..59a2a225ab5f8 100644 --- a/pandas/tseries/tests/test_holiday.py +++ b/pandas/tests/tseries/test_holiday.py @@ -1,3 +1,5 @@ +import pytest + from datetime import datetime import pandas.util.testing as tm from pandas import compat @@ -15,11 +17,11 @@ USLaborDay, USColumbusDay, USMartinLutherKingJr, USPresidentsDay) from pytz import utc -import nose -class TestCalendar(tm.TestCase): - def setUp(self): +class TestCalendar(object): + + def setup_method(self, method): self.holiday_list = [ datetime(2012, 1, 2), datetime(2012, 1, 16), @@ -47,14 +49,15 @@ def test_calendar(self): Timestamp(self.start_date), Timestamp(self.end_date)) - self.assertEqual(list(holidays.to_pydatetime()), self.holiday_list) - self.assertEqual(list(holidays_1.to_pydatetime()), self.holiday_list) - self.assertEqual(list(holidays_2.to_pydatetime()), self.holiday_list) + assert list(holidays.to_pydatetime()) == self.holiday_list + assert list(holidays_1.to_pydatetime()) == self.holiday_list + assert list(holidays_2.to_pydatetime()) == self.holiday_list def test_calendar_caching(self): # Test for issue #9552 class TestCalendar(AbstractHolidayCalendar): + def __init__(self, name=None, rules=None): super(TestCalendar, self).__init__(name=name, rules=rules) @@ -79,27 +82,22 @@ def test_calendar_observance_dates(self): def test_rule_from_name(self): USFedCal = get_calendar('USFederalHolidayCalendar') - self.assertEqual(USFedCal.rule_from_name( - 'Thanksgiving'), USThanksgivingDay) + assert USFedCal.rule_from_name('Thanksgiving') == USThanksgivingDay -class TestHoliday(tm.TestCase): - def setUp(self): +class TestHoliday(object): + + def setup_method(self, method): self.start_date = datetime(2011, 1, 1) self.end_date = datetime(2020, 12, 31) def check_results(self, holiday, start, end, expected): - self.assertEqual(list(holiday.dates(start, end)), expected) + assert list(holiday.dates(start, end)) == expected + # Verify that timezone info is preserved. - self.assertEqual( - list( - holiday.dates( - utc.localize(Timestamp(start)), - utc.localize(Timestamp(end)), - ) - ), - [utc.localize(dt) for dt in expected], - ) + assert (list(holiday.dates(utc.localize(Timestamp(start)), + utc.localize(Timestamp(end)))) == + [utc.localize(dt) for dt in expected]) def test_usmemorialday(self): self.check_results(holiday=USMemorialDay, @@ -230,7 +228,7 @@ def test_holidays_within_dates(self): for rule, dates in compat.iteritems(holidays): empty_dates = rule.dates(start_date, end_date) - self.assertEqual(empty_dates.tolist(), []) + assert empty_dates.tolist() == [] if isinstance(dates, tuple): dates = [dates] @@ -251,8 +249,8 @@ def test_argument_types(self): Timestamp(self.start_date), Timestamp(self.end_date)) - self.assert_index_equal(holidays, holidays_1) - self.assert_index_equal(holidays, holidays_2) + tm.assert_index_equal(holidays, holidays_1) + tm.assert_index_equal(holidays, holidays_2) def test_special_holidays(self): base_date = [datetime(2012, 5, 28)] @@ -262,17 +260,15 @@ def test_special_holidays(self): end_date=datetime(2012, 12, 31), offset=DateOffset(weekday=MO(1))) - self.assertEqual(base_date, - holiday_1.dates(self.start_date, self.end_date)) - self.assertEqual(base_date, - holiday_2.dates(self.start_date, self.end_date)) + assert base_date == holiday_1.dates(self.start_date, self.end_date) + assert base_date == holiday_2.dates(self.start_date, self.end_date) def test_get_calendar(self): class TestCalendar(AbstractHolidayCalendar): rules = [] calendar = get_calendar('TestCalendar') - self.assertEqual(TestCalendar, calendar.__class__) + assert TestCalendar == calendar.__class__ def test_factory(self): class_1 = HolidayCalendarFactory('MemorialDay', @@ -283,13 +279,14 @@ def test_factory(self): USThanksgivingDay) class_3 = HolidayCalendarFactory('Combined', class_1, class_2) - self.assertEqual(len(class_1.rules), 1) - self.assertEqual(len(class_2.rules), 1) - self.assertEqual(len(class_3.rules), 2) + assert len(class_1.rules) == 1 + assert len(class_2.rules) == 1 + assert len(class_3.rules) == 2 + +class TestObservanceRules(object): -class TestObservanceRules(tm.TestCase): - def setUp(self): + def setup_method(self, method): self.we = datetime(2014, 4, 9) self.th = datetime(2014, 4, 10) self.fr = datetime(2014, 4, 11) @@ -299,64 +296,65 @@ def setUp(self): self.tu = datetime(2014, 4, 15) def test_next_monday(self): - self.assertEqual(next_monday(self.sa), self.mo) - self.assertEqual(next_monday(self.su), self.mo) + assert next_monday(self.sa) == self.mo + assert next_monday(self.su) == self.mo def test_next_monday_or_tuesday(self): - self.assertEqual(next_monday_or_tuesday(self.sa), self.mo) - self.assertEqual(next_monday_or_tuesday(self.su), self.tu) - self.assertEqual(next_monday_or_tuesday(self.mo), self.tu) + assert next_monday_or_tuesday(self.sa) == self.mo + assert next_monday_or_tuesday(self.su) == self.tu + assert next_monday_or_tuesday(self.mo) == self.tu def test_previous_friday(self): - self.assertEqual(previous_friday(self.sa), self.fr) - self.assertEqual(previous_friday(self.su), self.fr) + assert previous_friday(self.sa) == self.fr + assert previous_friday(self.su) == self.fr def test_sunday_to_monday(self): - self.assertEqual(sunday_to_monday(self.su), self.mo) + assert sunday_to_monday(self.su) == self.mo def test_nearest_workday(self): - self.assertEqual(nearest_workday(self.sa), self.fr) - self.assertEqual(nearest_workday(self.su), self.mo) - self.assertEqual(nearest_workday(self.mo), self.mo) + assert nearest_workday(self.sa) == self.fr + assert nearest_workday(self.su) == self.mo + assert nearest_workday(self.mo) == self.mo def test_weekend_to_monday(self): - self.assertEqual(weekend_to_monday(self.sa), self.mo) - self.assertEqual(weekend_to_monday(self.su), self.mo) - self.assertEqual(weekend_to_monday(self.mo), self.mo) + assert weekend_to_monday(self.sa) == self.mo + assert weekend_to_monday(self.su) == self.mo + assert weekend_to_monday(self.mo) == self.mo def test_next_workday(self): - self.assertEqual(next_workday(self.sa), self.mo) - self.assertEqual(next_workday(self.su), self.mo) - self.assertEqual(next_workday(self.mo), self.tu) + assert next_workday(self.sa) == self.mo + assert next_workday(self.su) == self.mo + assert next_workday(self.mo) == self.tu def test_previous_workday(self): - self.assertEqual(previous_workday(self.sa), self.fr) - self.assertEqual(previous_workday(self.su), self.fr) - self.assertEqual(previous_workday(self.tu), self.mo) + assert previous_workday(self.sa) == self.fr + assert previous_workday(self.su) == self.fr + assert previous_workday(self.tu) == self.mo def test_before_nearest_workday(self): - self.assertEqual(before_nearest_workday(self.sa), self.th) - self.assertEqual(before_nearest_workday(self.su), self.fr) - self.assertEqual(before_nearest_workday(self.tu), self.mo) + assert before_nearest_workday(self.sa) == self.th + assert before_nearest_workday(self.su) == self.fr + assert before_nearest_workday(self.tu) == self.mo def test_after_nearest_workday(self): - self.assertEqual(after_nearest_workday(self.sa), self.mo) - self.assertEqual(after_nearest_workday(self.su), self.tu) - self.assertEqual(after_nearest_workday(self.fr), self.mo) + assert after_nearest_workday(self.sa) == self.mo + assert after_nearest_workday(self.su) == self.tu + assert after_nearest_workday(self.fr) == self.mo -class TestFederalHolidayCalendar(tm.TestCase): +class TestFederalHolidayCalendar(object): - # Test for issue 10278 def test_no_mlk_before_1984(self): + # see gh-10278 class MLKCalendar(AbstractHolidayCalendar): rules = [USMartinLutherKingJr] holidays = MLKCalendar().holidays(start='1984', end='1988').to_pydatetime().tolist() + # Testing to make sure holiday is not incorrectly observed before 1986 - self.assertEqual(holidays, [datetime(1986, 1, 20, 0, 0), datetime( - 1987, 1, 19, 0, 0)]) + assert holidays == [datetime(1986, 1, 20, 0, 0), + datetime(1987, 1, 19, 0, 0)] def test_memorial_day(self): class MemorialDay(AbstractHolidayCalendar): @@ -364,29 +362,24 @@ class MemorialDay(AbstractHolidayCalendar): holidays = MemorialDay().holidays(start='1971', end='1980').to_pydatetime().tolist() - # Fixes 5/31 error and checked manually against wikipedia - self.assertEqual(holidays, [datetime(1971, 5, 31, 0, 0), - datetime(1972, 5, 29, 0, 0), - datetime(1973, 5, 28, 0, 0), - datetime(1974, 5, 27, 0, - 0), datetime(1975, 5, 26, 0, 0), - datetime(1976, 5, 31, 0, - 0), datetime(1977, 5, 30, 0, 0), - datetime(1978, 5, 29, 0, - 0), datetime(1979, 5, 28, 0, 0)]) + # Fixes 5/31 error and checked manually against Wikipedia + assert holidays == [datetime(1971, 5, 31, 0, 0), + datetime(1972, 5, 29, 0, 0), + datetime(1973, 5, 28, 0, 0), + datetime(1974, 5, 27, 0, 0), + datetime(1975, 5, 26, 0, 0), + datetime(1976, 5, 31, 0, 0), + datetime(1977, 5, 30, 0, 0), + datetime(1978, 5, 29, 0, 0), + datetime(1979, 5, 28, 0, 0)] -class TestHolidayConflictingArguments(tm.TestCase): - # GH 10217 +class TestHolidayConflictingArguments(object): def test_both_offset_observance_raises(self): - with self.assertRaises(NotImplementedError): + # see gh-10217 + with pytest.raises(NotImplementedError): Holiday("Cyber Monday", month=11, day=1, offset=[DateOffset(weekday=SA(4))], observance=next_monday) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_offsets.py b/pandas/tests/tseries/test_offsets.py similarity index 83% rename from pandas/tseries/tests/test_offsets.py rename to pandas/tests/tseries/test_offsets.py index 768e9212e6c42..09de064c15183 100644 --- a/pandas/tseries/tests/test_offsets.py +++ b/pandas/tests/tseries/test_offsets.py @@ -2,10 +2,10 @@ from distutils.version import LooseVersion from datetime import date, datetime, timedelta from dateutil.relativedelta import relativedelta + +import pytest from pandas.compat import range, iteritems from pandas import compat -import nose -from nose.tools import assert_raises import numpy as np @@ -15,7 +15,8 @@ from pandas.tseries.frequencies import (_offset_map, get_freq_code, _get_freq_str, _INVALID_FREQ_ERROR, get_offset, get_standard_freq) -from pandas.tseries.index import _to_m8, DatetimeIndex, _daterange_cache +from pandas.core.indexes.datetimes import ( + _to_m8, DatetimeIndex, _daterange_cache) from pandas.tseries.offsets import (BDay, CDay, BQuarterEnd, BMonthEnd, BusinessHour, WeekOfMonth, CBMonthEnd, CustomBusinessHour, WeekDay, @@ -27,18 +28,16 @@ QuarterEnd, BusinessMonthEnd, FY5253, Milli, Nano, Easter, FY5253Quarter, LastWeekOfMonth, CacheableOffset) -from pandas.tseries.tools import (format, ole2datetime, parse_time_string, - to_datetime, DateParseError) +from pandas.core.tools.datetimes import ( + format, ole2datetime, parse_time_string, + to_datetime, DateParseError) import pandas.tseries.offsets as offsets from pandas.io.pickle import read_pickle -from pandas.tslib import normalize_date, NaT, Timestamp, Timedelta -import pandas.tslib as tslib -from pandas.util.testing import assertRaisesRegexp +from pandas._libs.tslib import normalize_date, NaT, Timestamp, Timedelta +import pandas._libs.tslib as tslib import pandas.util.testing as tm from pandas.tseries.holiday import USFederalHolidayCalendar -_multiprocess_can_split_ = True - def test_monthrange(): import calendar @@ -60,7 +59,8 @@ def test_ole2datetime(): actual = ole2datetime(60000) assert actual == datetime(2064, 4, 8) - assert_raises(ValueError, ole2datetime, 60) + with pytest.raises(ValueError): + ole2datetime(60) def test_to_datetime1(): @@ -83,7 +83,7 @@ def test_normalize_date(): def test_to_m8(): valb = datetime(2007, 10, 1) valu = _to_m8(valb) - tm.assertIsInstance(valu, np.datetime64) + assert isinstance(valu, np.datetime64) # assert valu == np.datetime64(datetime(2007,10,1)) # def test_datetime64_box(): @@ -97,7 +97,7 @@ def test_to_m8(): ##### -class Base(tm.TestCase): +class Base(object): _offset = None _offset_types = [getattr(offsets, o) for o in offsets.__all__] @@ -145,8 +145,8 @@ def test_apply_out_of_range(self): offset = self._get_offset(self._offset, value=10000) result = Timestamp('20080101') + offset - self.assertIsInstance(result, datetime) - self.assertIsNone(result.tzinfo) + assert isinstance(result, datetime) + assert result.tzinfo is None tm._skip_if_no_pytz() tm._skip_if_no_dateutil() @@ -154,20 +154,20 @@ def test_apply_out_of_range(self): for tz in self.timezones: t = Timestamp('20080101', tz=tz) result = t + offset - self.assertIsInstance(result, datetime) - self.assertEqual(t.tzinfo, result.tzinfo) + assert isinstance(result, datetime) + assert t.tzinfo == result.tzinfo except (tslib.OutOfBoundsDatetime): raise except (ValueError, KeyError) as e: - raise nose.SkipTest( + pytest.skip( "cannot create out_of_range offset: {0} {1}".format( str(self).split('.')[-1], e)) class TestCommon(Base): - def setUp(self): + def setup_method(self, method): # exected value created by Base._get_offset # are applied to 2011/01/01 09:00 (Saturday) # used for .apply and .rollforward @@ -218,25 +218,25 @@ def test_return_type(self): # make sure that we are returning a Timestamp result = Timestamp('20080101') + offset - self.assertIsInstance(result, Timestamp) + assert isinstance(result, Timestamp) # make sure that we are returning NaT - self.assertTrue(NaT + offset is NaT) - self.assertTrue(offset + NaT is NaT) + assert NaT + offset is NaT + assert offset + NaT is NaT - self.assertTrue(NaT - offset is NaT) - self.assertTrue((-offset).apply(NaT) is NaT) + assert NaT - offset is NaT + assert (-offset).apply(NaT) is NaT def test_offset_n(self): for offset_klass in self.offset_types: offset = self._get_offset(offset_klass) - self.assertEqual(offset.n, 1) + assert offset.n == 1 neg_offset = offset * -1 - self.assertEqual(neg_offset.n, -1) + assert neg_offset.n == -1 mul_offset = offset * 3 - self.assertEqual(mul_offset.n, 3) + assert mul_offset.n == 3 def test_offset_freqstr(self): for offset_klass in self.offset_types: @@ -247,7 +247,7 @@ def test_offset_freqstr(self): "", 'LWOM-SAT', ): code = get_offset(freqstr) - self.assertEqual(offset.rule_code, code) + assert offset.rule_code == code def _check_offsetfunc_works(self, offset, funcname, dt, expected, normalize=False): @@ -255,12 +255,12 @@ def _check_offsetfunc_works(self, offset, funcname, dt, expected, func = getattr(offset_s, funcname) result = func(dt) - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected) + assert isinstance(result, Timestamp) + assert result == expected result = func(Timestamp(dt)) - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected) + assert isinstance(result, Timestamp) + assert result == expected # see gh-14101 exp_warning = None @@ -275,11 +275,11 @@ def _check_offsetfunc_works(self, offset, funcname, dt, expected, with tm.assert_produces_warning(exp_warning, check_stacklevel=False): result = func(ts) - self.assertTrue(isinstance(result, Timestamp)) + assert isinstance(result, Timestamp) if normalize is False: - self.assertEqual(result, expected + Nano(5)) + assert result == expected + Nano(5) else: - self.assertEqual(result, expected) + assert result == expected if isinstance(dt, np.datetime64): # test tz when input is datetime or Timestamp @@ -294,12 +294,12 @@ def _check_offsetfunc_works(self, offset, funcname, dt, expected, dt_tz = tslib._localize_pydatetime(dt, tz_obj) result = func(dt_tz) - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected_localize) + assert isinstance(result, Timestamp) + assert result == expected_localize result = func(Timestamp(dt, tz=tz)) - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected_localize) + assert isinstance(result, Timestamp) + assert result == expected_localize # see gh-14101 exp_warning = None @@ -314,11 +314,11 @@ def _check_offsetfunc_works(self, offset, funcname, dt, expected, with tm.assert_produces_warning(exp_warning, check_stacklevel=False): result = func(ts) - self.assertTrue(isinstance(result, Timestamp)) + assert isinstance(result, Timestamp) if normalize is False: - self.assertEqual(result, expected_localize + Nano(5)) + assert result == expected_localize + Nano(5) else: - self.assertEqual(result, expected_localize) + assert result == expected_localize def test_apply(self): sdt = datetime(2011, 1, 1, 9, 0) @@ -442,18 +442,18 @@ def test_onOffset(self): for offset in self.offset_types: dt = self.expecteds[offset.__name__] offset_s = self._get_offset(offset) - self.assertTrue(offset_s.onOffset(dt)) + assert offset_s.onOffset(dt) # when normalize=True, onOffset checks time is 00:00:00 offset_n = self._get_offset(offset, normalize=True) - self.assertFalse(offset_n.onOffset(dt)) + assert not offset_n.onOffset(dt) if offset in (BusinessHour, CustomBusinessHour): # In default BusinessHour (9:00-17:00), normalized time # cannot be in business hour range continue date = datetime(dt.year, dt.month, dt.day) - self.assertTrue(offset_n.onOffset(date)) + assert offset_n.onOffset(date) def test_add(self): dt = datetime(2011, 1, 1, 9, 0) @@ -465,15 +465,15 @@ def test_add(self): result_dt = dt + offset_s result_ts = Timestamp(dt) + offset_s for result in [result_dt, result_ts]: - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected) + assert isinstance(result, Timestamp) + assert result == expected tm._skip_if_no_pytz() for tz in self.timezones: expected_localize = expected.tz_localize(tz) result = Timestamp(dt, tz=tz) + offset_s - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected_localize) + assert isinstance(result, Timestamp) + assert result == expected_localize # normalize=True offset_s = self._get_offset(offset, normalize=True) @@ -482,14 +482,14 @@ def test_add(self): result_dt = dt + offset_s result_ts = Timestamp(dt) + offset_s for result in [result_dt, result_ts]: - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected) + assert isinstance(result, Timestamp) + assert result == expected for tz in self.timezones: expected_localize = expected.tz_localize(tz) result = Timestamp(dt, tz=tz) + offset_s - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, expected_localize) + assert isinstance(result, Timestamp) + assert result == expected_localize def test_pickle_v0_15_2(self): offsets = {'DateOffset': DateOffset(years=1), @@ -506,9 +506,8 @@ def test_pickle_v0_15_2(self): class TestDateOffset(Base): - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): self.d = Timestamp(datetime(2008, 1, 2)) _offset_map.clear() @@ -542,14 +541,13 @@ def test_eq(self): offset1 = DateOffset(days=1) offset2 = DateOffset(days=365) - self.assertNotEqual(offset1, offset2) + assert offset1 != offset2 class TestBusinessDay(Base): - _multiprocess_can_split_ = True _offset = BDay - def setUp(self): + def setup_method(self, method): self.d = datetime(2008, 1, 1) self.offset = BDay() @@ -560,10 +558,10 @@ def test_different_normalize_equals(self): offset = BDay() offset2 = BDay() offset2.normalize = True - self.assertEqual(offset, offset2) + assert offset == offset2 def test_repr(self): - self.assertEqual(repr(self.offset), '') + assert repr(self.offset) == '' assert repr(self.offset2) == '<2 * BusinessDays>' expected = '' @@ -575,49 +573,49 @@ def test_with_offset(self): assert (self.d + offset) == datetime(2008, 1, 2, 2) def testEQ(self): - self.assertEqual(self.offset2, self.offset2) + assert self.offset2 == self.offset2 def test_mul(self): pass def test_hash(self): - self.assertEqual(hash(self.offset2), hash(self.offset2)) + assert hash(self.offset2) == hash(self.offset2) def testCall(self): - self.assertEqual(self.offset2(self.d), datetime(2008, 1, 3)) + assert self.offset2(self.d) == datetime(2008, 1, 3) def testRAdd(self): - self.assertEqual(self.d + self.offset2, self.offset2 + self.d) + assert self.d + self.offset2 == self.offset2 + self.d def testSub(self): off = self.offset2 - self.assertRaises(Exception, off.__sub__, self.d) - self.assertEqual(2 * off - off, off) + pytest.raises(Exception, off.__sub__, self.d) + assert 2 * off - off == off - self.assertEqual(self.d - self.offset2, self.d + BDay(-2)) + assert self.d - self.offset2 == self.d + BDay(-2) def testRSub(self): - self.assertEqual(self.d - self.offset2, (-self.offset2).apply(self.d)) + assert self.d - self.offset2 == (-self.offset2).apply(self.d) def testMult1(self): - self.assertEqual(self.d + 10 * self.offset, self.d + BDay(10)) + assert self.d + 10 * self.offset == self.d + BDay(10) def testMult2(self): - self.assertEqual(self.d + (-5 * BDay(-10)), self.d + BDay(50)) + assert self.d + (-5 * BDay(-10)) == self.d + BDay(50) def testRollback1(self): - self.assertEqual(BDay(10).rollback(self.d), self.d) + assert BDay(10).rollback(self.d) == self.d def testRollback2(self): - self.assertEqual( - BDay(10).rollback(datetime(2008, 1, 5)), datetime(2008, 1, 4)) + assert (BDay(10).rollback(datetime(2008, 1, 5)) == + datetime(2008, 1, 4)) def testRollforward1(self): - self.assertEqual(BDay(10).rollforward(self.d), self.d) + assert BDay(10).rollforward(self.d) == self.d def testRollforward2(self): - self.assertEqual( - BDay(10).rollforward(datetime(2008, 1, 5)), datetime(2008, 1, 7)) + assert (BDay(10).rollforward(datetime(2008, 1, 5)) == + datetime(2008, 1, 7)) def test_roll_date_object(self): offset = BDay() @@ -625,17 +623,17 @@ def test_roll_date_object(self): dt = date(2012, 9, 15) result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 9, 14)) + assert result == datetime(2012, 9, 14) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 9, 17)) + assert result == datetime(2012, 9, 17) offset = offsets.Day() result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) def test_onOffset(self): tests = [(BDay(), datetime(2008, 1, 1), True), @@ -693,41 +691,40 @@ def test_apply_large_n(self): dt = datetime(2012, 10, 23) result = dt + BDay(10) - self.assertEqual(result, datetime(2012, 11, 6)) + assert result == datetime(2012, 11, 6) result = dt + BDay(100) - BDay(100) - self.assertEqual(result, dt) + assert result == dt off = BDay() * 6 rs = datetime(2012, 1, 1) - off xp = datetime(2011, 12, 23) - self.assertEqual(rs, xp) + assert rs == xp st = datetime(2011, 12, 18) rs = st + off xp = datetime(2011, 12, 26) - self.assertEqual(rs, xp) + assert rs == xp off = BDay() * 10 rs = datetime(2014, 1, 5) + off # see #5890 xp = datetime(2014, 1, 17) - self.assertEqual(rs, xp) + assert rs == xp def test_apply_corner(self): - self.assertRaises(TypeError, BDay().apply, BMonthEnd()) + pytest.raises(TypeError, BDay().apply, BMonthEnd()) def test_offsets_compare_equal(self): # root cause of #456 offset1 = BDay() offset2 = BDay() - self.assertFalse(offset1 != offset2) + assert not offset1 != offset2 class TestBusinessHour(Base): - _multiprocess_can_split_ = True _offset = BusinessHour - def setUp(self): + def setup_method(self, method): self.d = datetime(2014, 7, 1, 10, 00) self.offset1 = BusinessHour() @@ -744,11 +741,11 @@ def setUp(self): def test_constructor_errors(self): from datetime import time as dt_time - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): BusinessHour(start=dt_time(11, 0, 5)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): BusinessHour(start='AAA') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): BusinessHour(start='14:00:05') def test_different_normalize_equals(self): @@ -756,125 +753,113 @@ def test_different_normalize_equals(self): offset = self._offset() offset2 = self._offset() offset2.normalize = True - self.assertEqual(offset, offset2) + assert offset == offset2 def test_repr(self): - self.assertEqual(repr(self.offset1), '') - self.assertEqual(repr(self.offset2), - '<3 * BusinessHours: BH=09:00-17:00>') - self.assertEqual(repr(self.offset3), - '<-1 * BusinessHour: BH=09:00-17:00>') - self.assertEqual(repr(self.offset4), - '<-4 * BusinessHours: BH=09:00-17:00>') - - self.assertEqual(repr(self.offset5), '') - self.assertEqual(repr(self.offset6), '') - self.assertEqual(repr(self.offset7), - '<-2 * BusinessHours: BH=21:30-06:30>') + assert repr(self.offset1) == '' + assert repr(self.offset2) == '<3 * BusinessHours: BH=09:00-17:00>' + assert repr(self.offset3) == '<-1 * BusinessHour: BH=09:00-17:00>' + assert repr(self.offset4) == '<-4 * BusinessHours: BH=09:00-17:00>' + + assert repr(self.offset5) == '' + assert repr(self.offset6) == '' + assert repr(self.offset7) == '<-2 * BusinessHours: BH=21:30-06:30>' def test_with_offset(self): expected = Timestamp('2014-07-01 13:00') - self.assertEqual(self.d + BusinessHour() * 3, expected) - self.assertEqual(self.d + BusinessHour(n=3), expected) + assert self.d + BusinessHour() * 3 == expected + assert self.d + BusinessHour(n=3) == expected def testEQ(self): for offset in [self.offset1, self.offset2, self.offset3, self.offset4]: - self.assertEqual(offset, offset) + assert offset == offset - self.assertNotEqual(BusinessHour(), BusinessHour(-1)) - self.assertEqual(BusinessHour(start='09:00'), BusinessHour()) - self.assertNotEqual(BusinessHour(start='09:00'), - BusinessHour(start='09:01')) - self.assertNotEqual(BusinessHour(start='09:00', end='17:00'), - BusinessHour(start='17:00', end='09:01')) + assert BusinessHour() != BusinessHour(-1) + assert BusinessHour(start='09:00') == BusinessHour() + assert BusinessHour(start='09:00') != BusinessHour(start='09:01') + assert (BusinessHour(start='09:00', end='17:00') != + BusinessHour(start='17:00', end='09:01')) def test_hash(self): for offset in [self.offset1, self.offset2, self.offset3, self.offset4]: - self.assertEqual(hash(offset), hash(offset)) + assert hash(offset) == hash(offset) def testCall(self): - self.assertEqual(self.offset1(self.d), datetime(2014, 7, 1, 11)) - self.assertEqual(self.offset2(self.d), datetime(2014, 7, 1, 13)) - self.assertEqual(self.offset3(self.d), datetime(2014, 6, 30, 17)) - self.assertEqual(self.offset4(self.d), datetime(2014, 6, 30, 14)) + assert self.offset1(self.d) == datetime(2014, 7, 1, 11) + assert self.offset2(self.d) == datetime(2014, 7, 1, 13) + assert self.offset3(self.d) == datetime(2014, 6, 30, 17) + assert self.offset4(self.d) == datetime(2014, 6, 30, 14) def testRAdd(self): - self.assertEqual(self.d + self.offset2, self.offset2 + self.d) + assert self.d + self.offset2 == self.offset2 + self.d def testSub(self): off = self.offset2 - self.assertRaises(Exception, off.__sub__, self.d) - self.assertEqual(2 * off - off, off) + pytest.raises(Exception, off.__sub__, self.d) + assert 2 * off - off == off - self.assertEqual(self.d - self.offset2, self.d + self._offset(-3)) + assert self.d - self.offset2 == self.d + self._offset(-3) def testRSub(self): - self.assertEqual(self.d - self.offset2, (-self.offset2).apply(self.d)) + assert self.d - self.offset2 == (-self.offset2).apply(self.d) def testMult1(self): - self.assertEqual(self.d + 5 * self.offset1, self.d + self._offset(5)) + assert self.d + 5 * self.offset1 == self.d + self._offset(5) def testMult2(self): - self.assertEqual(self.d + (-3 * self._offset(-2)), - self.d + self._offset(6)) + assert self.d + (-3 * self._offset(-2)) == self.d + self._offset(6) def testRollback1(self): - self.assertEqual(self.offset1.rollback(self.d), self.d) - self.assertEqual(self.offset2.rollback(self.d), self.d) - self.assertEqual(self.offset3.rollback(self.d), self.d) - self.assertEqual(self.offset4.rollback(self.d), self.d) - self.assertEqual(self.offset5.rollback(self.d), - datetime(2014, 6, 30, 14, 30)) - self.assertEqual(self.offset6.rollback( - self.d), datetime(2014, 7, 1, 5, 0)) - self.assertEqual(self.offset7.rollback( - self.d), datetime(2014, 7, 1, 6, 30)) + assert self.offset1.rollback(self.d) == self.d + assert self.offset2.rollback(self.d) == self.d + assert self.offset3.rollback(self.d) == self.d + assert self.offset4.rollback(self.d) == self.d + assert self.offset5.rollback(self.d) == datetime(2014, 6, 30, 14, 30) + assert self.offset6.rollback(self.d) == datetime(2014, 7, 1, 5, 0) + assert self.offset7.rollback(self.d) == datetime(2014, 7, 1, 6, 30) d = datetime(2014, 7, 1, 0) - self.assertEqual(self.offset1.rollback(d), datetime(2014, 6, 30, 17)) - self.assertEqual(self.offset2.rollback(d), datetime(2014, 6, 30, 17)) - self.assertEqual(self.offset3.rollback(d), datetime(2014, 6, 30, 17)) - self.assertEqual(self.offset4.rollback(d), datetime(2014, 6, 30, 17)) - self.assertEqual(self.offset5.rollback( - d), datetime(2014, 6, 30, 14, 30)) - self.assertEqual(self.offset6.rollback(d), d) - self.assertEqual(self.offset7.rollback(d), d) + assert self.offset1.rollback(d) == datetime(2014, 6, 30, 17) + assert self.offset2.rollback(d) == datetime(2014, 6, 30, 17) + assert self.offset3.rollback(d) == datetime(2014, 6, 30, 17) + assert self.offset4.rollback(d) == datetime(2014, 6, 30, 17) + assert self.offset5.rollback(d) == datetime(2014, 6, 30, 14, 30) + assert self.offset6.rollback(d) == d + assert self.offset7.rollback(d) == d - self.assertEqual(self._offset(5).rollback(self.d), self.d) + assert self._offset(5).rollback(self.d) == self.d def testRollback2(self): - self.assertEqual(self._offset(-3) - .rollback(datetime(2014, 7, 5, 15, 0)), - datetime(2014, 7, 4, 17, 0)) + assert (self._offset(-3).rollback(datetime(2014, 7, 5, 15, 0)) == + datetime(2014, 7, 4, 17, 0)) def testRollforward1(self): - self.assertEqual(self.offset1.rollforward(self.d), self.d) - self.assertEqual(self.offset2.rollforward(self.d), self.d) - self.assertEqual(self.offset3.rollforward(self.d), self.d) - self.assertEqual(self.offset4.rollforward(self.d), self.d) - self.assertEqual(self.offset5.rollforward( - self.d), datetime(2014, 7, 1, 11, 0)) - self.assertEqual(self.offset6.rollforward( - self.d), datetime(2014, 7, 1, 20, 0)) - self.assertEqual(self.offset7.rollforward( - self.d), datetime(2014, 7, 1, 21, 30)) + assert self.offset1.rollforward(self.d) == self.d + assert self.offset2.rollforward(self.d) == self.d + assert self.offset3.rollforward(self.d) == self.d + assert self.offset4.rollforward(self.d) == self.d + assert (self.offset5.rollforward(self.d) == + datetime(2014, 7, 1, 11, 0)) + assert (self.offset6.rollforward(self.d) == + datetime(2014, 7, 1, 20, 0)) + assert (self.offset7.rollforward(self.d) == + datetime(2014, 7, 1, 21, 30)) d = datetime(2014, 7, 1, 0) - self.assertEqual(self.offset1.rollforward(d), datetime(2014, 7, 1, 9)) - self.assertEqual(self.offset2.rollforward(d), datetime(2014, 7, 1, 9)) - self.assertEqual(self.offset3.rollforward(d), datetime(2014, 7, 1, 9)) - self.assertEqual(self.offset4.rollforward(d), datetime(2014, 7, 1, 9)) - self.assertEqual(self.offset5.rollforward(d), datetime(2014, 7, 1, 11)) - self.assertEqual(self.offset6.rollforward(d), d) - self.assertEqual(self.offset7.rollforward(d), d) + assert self.offset1.rollforward(d) == datetime(2014, 7, 1, 9) + assert self.offset2.rollforward(d) == datetime(2014, 7, 1, 9) + assert self.offset3.rollforward(d) == datetime(2014, 7, 1, 9) + assert self.offset4.rollforward(d) == datetime(2014, 7, 1, 9) + assert self.offset5.rollforward(d) == datetime(2014, 7, 1, 11) + assert self.offset6.rollforward(d) == d + assert self.offset7.rollforward(d) == d - self.assertEqual(self._offset(5).rollforward(self.d), self.d) + assert self._offset(5).rollforward(self.d) == self.d def testRollforward2(self): - self.assertEqual(self._offset(-3) - .rollforward(datetime(2014, 7, 5, 16, 0)), - datetime(2014, 7, 7, 9)) + assert (self._offset(-3).rollforward(datetime(2014, 7, 5, 16, 0)) == + datetime(2014, 7, 7, 9)) def test_roll_date_object(self): offset = BusinessHour() @@ -882,10 +867,10 @@ def test_roll_date_object(self): dt = datetime(2014, 7, 6, 15, 0) result = offset.rollback(dt) - self.assertEqual(result, datetime(2014, 7, 4, 17)) + assert result == datetime(2014, 7, 4, 17) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2014, 7, 7, 9)) + assert result == datetime(2014, 7, 7, 9) def test_normalize(self): tests = [] @@ -927,7 +912,7 @@ def test_normalize(self): for offset, cases in tests: for dt, expected in compat.iteritems(cases): - self.assertEqual(offset.apply(dt), expected) + assert offset.apply(dt) == expected def test_onOffset(self): tests = [] @@ -966,7 +951,7 @@ def test_onOffset(self): for offset, cases in tests: for dt, expected in compat.iteritems(cases): - self.assertEqual(offset.onOffset(dt), expected) + assert offset.onOffset(dt) == expected def test_opening_time(self): tests = [] @@ -1130,8 +1115,8 @@ def test_opening_time(self): for _offsets, cases in tests: for offset in _offsets: for dt, (exp_next, exp_prev) in compat.iteritems(cases): - self.assertEqual(offset._next_opening_time(dt), exp_next) - self.assertEqual(offset._prev_opening_time(dt), exp_prev) + assert offset._next_opening_time(dt) == exp_next + assert offset._prev_opening_time(dt) == exp_prev def test_apply(self): tests = [] @@ -1392,7 +1377,7 @@ def test_offsets_compare_equal(self): # root cause of #456 offset1 = self._offset() offset2 = self._offset() - self.assertFalse(offset1 != offset2) + assert not offset1 != offset2 def test_datetimeindex(self): idx1 = DatetimeIndex(start='2014-07-04 15:00', end='2014-07-08 10:00', @@ -1431,10 +1416,9 @@ def test_datetimeindex(self): class TestCustomBusinessHour(Base): - _multiprocess_can_split_ = True _offset = CustomBusinessHour - def setUp(self): + def setup_method(self, method): # 2014 Calendar to check custom holidays # Sun Mon Tue Wed Thu Fri Sat # 6/22 23 24 25 26 27 28 @@ -1449,11 +1433,11 @@ def setUp(self): def test_constructor_errors(self): from datetime import time as dt_time - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): CustomBusinessHour(start=dt_time(11, 0, 5)) - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): CustomBusinessHour(start='AAA') - with tm.assertRaises(ValueError): + with pytest.raises(ValueError): CustomBusinessHour(start='14:00:05') def test_different_normalize_equals(self): @@ -1461,93 +1445,89 @@ def test_different_normalize_equals(self): offset = self._offset() offset2 = self._offset() offset2.normalize = True - self.assertEqual(offset, offset2) + assert offset == offset2 def test_repr(self): - self.assertEqual(repr(self.offset1), - '') - self.assertEqual(repr(self.offset2), - '') + assert repr(self.offset1) == '' + assert repr(self.offset2) == '' def test_with_offset(self): expected = Timestamp('2014-07-01 13:00') - self.assertEqual(self.d + CustomBusinessHour() * 3, expected) - self.assertEqual(self.d + CustomBusinessHour(n=3), expected) + assert self.d + CustomBusinessHour() * 3 == expected + assert self.d + CustomBusinessHour(n=3) == expected def testEQ(self): for offset in [self.offset1, self.offset2]: - self.assertEqual(offset, offset) + assert offset == offset - self.assertNotEqual(CustomBusinessHour(), CustomBusinessHour(-1)) - self.assertEqual(CustomBusinessHour(start='09:00'), - CustomBusinessHour()) - self.assertNotEqual(CustomBusinessHour(start='09:00'), - CustomBusinessHour(start='09:01')) - self.assertNotEqual(CustomBusinessHour(start='09:00', end='17:00'), - CustomBusinessHour(start='17:00', end='09:01')) + assert CustomBusinessHour() != CustomBusinessHour(-1) + assert (CustomBusinessHour(start='09:00') == + CustomBusinessHour()) + assert (CustomBusinessHour(start='09:00') != + CustomBusinessHour(start='09:01')) + assert (CustomBusinessHour(start='09:00', end='17:00') != + CustomBusinessHour(start='17:00', end='09:01')) - self.assertNotEqual(CustomBusinessHour(weekmask='Tue Wed Thu Fri'), - CustomBusinessHour(weekmask='Mon Tue Wed Thu Fri')) - self.assertNotEqual(CustomBusinessHour(holidays=['2014-06-27']), - CustomBusinessHour(holidays=['2014-06-28'])) + assert (CustomBusinessHour(weekmask='Tue Wed Thu Fri') != + CustomBusinessHour(weekmask='Mon Tue Wed Thu Fri')) + assert (CustomBusinessHour(holidays=['2014-06-27']) != + CustomBusinessHour(holidays=['2014-06-28'])) def test_hash(self): - self.assertEqual(hash(self.offset1), hash(self.offset1)) - self.assertEqual(hash(self.offset2), hash(self.offset2)) + assert hash(self.offset1) == hash(self.offset1) + assert hash(self.offset2) == hash(self.offset2) def testCall(self): - self.assertEqual(self.offset1(self.d), datetime(2014, 7, 1, 11)) - self.assertEqual(self.offset2(self.d), datetime(2014, 7, 1, 11)) + assert self.offset1(self.d) == datetime(2014, 7, 1, 11) + assert self.offset2(self.d) == datetime(2014, 7, 1, 11) def testRAdd(self): - self.assertEqual(self.d + self.offset2, self.offset2 + self.d) + assert self.d + self.offset2 == self.offset2 + self.d def testSub(self): off = self.offset2 - self.assertRaises(Exception, off.__sub__, self.d) - self.assertEqual(2 * off - off, off) + pytest.raises(Exception, off.__sub__, self.d) + assert 2 * off - off == off - self.assertEqual(self.d - self.offset2, self.d - (2 * off - off)) + assert self.d - self.offset2 == self.d - (2 * off - off) def testRSub(self): - self.assertEqual(self.d - self.offset2, (-self.offset2).apply(self.d)) + assert self.d - self.offset2 == (-self.offset2).apply(self.d) def testMult1(self): - self.assertEqual(self.d + 5 * self.offset1, self.d + self._offset(5)) + assert self.d + 5 * self.offset1 == self.d + self._offset(5) def testMult2(self): - self.assertEqual(self.d + (-3 * self._offset(-2)), - self.d + self._offset(6)) + assert self.d + (-3 * self._offset(-2)) == self.d + self._offset(6) def testRollback1(self): - self.assertEqual(self.offset1.rollback(self.d), self.d) - self.assertEqual(self.offset2.rollback(self.d), self.d) + assert self.offset1.rollback(self.d) == self.d + assert self.offset2.rollback(self.d) == self.d d = datetime(2014, 7, 1, 0) + # 2014/07/01 is Tuesday, 06/30 is Monday(holiday) - self.assertEqual(self.offset1.rollback(d), datetime(2014, 6, 27, 17)) + assert self.offset1.rollback(d) == datetime(2014, 6, 27, 17) # 2014/6/30 and 2014/6/27 are holidays - self.assertEqual(self.offset2.rollback(d), datetime(2014, 6, 26, 17)) + assert self.offset2.rollback(d) == datetime(2014, 6, 26, 17) def testRollback2(self): - self.assertEqual(self._offset(-3) - .rollback(datetime(2014, 7, 5, 15, 0)), - datetime(2014, 7, 4, 17, 0)) + assert (self._offset(-3).rollback(datetime(2014, 7, 5, 15, 0)) == + datetime(2014, 7, 4, 17, 0)) def testRollforward1(self): - self.assertEqual(self.offset1.rollforward(self.d), self.d) - self.assertEqual(self.offset2.rollforward(self.d), self.d) + assert self.offset1.rollforward(self.d) == self.d + assert self.offset2.rollforward(self.d) == self.d d = datetime(2014, 7, 1, 0) - self.assertEqual(self.offset1.rollforward(d), datetime(2014, 7, 1, 9)) - self.assertEqual(self.offset2.rollforward(d), datetime(2014, 7, 1, 9)) + assert self.offset1.rollforward(d) == datetime(2014, 7, 1, 9) + assert self.offset2.rollforward(d) == datetime(2014, 7, 1, 9) def testRollforward2(self): - self.assertEqual(self._offset(-3) - .rollforward(datetime(2014, 7, 5, 16, 0)), - datetime(2014, 7, 7, 9)) + assert (self._offset(-3).rollforward(datetime(2014, 7, 5, 16, 0)) == + datetime(2014, 7, 7, 9)) def test_roll_date_object(self): offset = BusinessHour() @@ -1555,10 +1535,10 @@ def test_roll_date_object(self): dt = datetime(2014, 7, 6, 15, 0) result = offset.rollback(dt) - self.assertEqual(result, datetime(2014, 7, 4, 17)) + assert result == datetime(2014, 7, 4, 17) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2014, 7, 7, 9)) + assert result == datetime(2014, 7, 7, 9) def test_normalize(self): tests = [] @@ -1602,7 +1582,7 @@ def test_normalize(self): for offset, cases in tests: for dt, expected in compat.iteritems(cases): - self.assertEqual(offset.apply(dt), expected) + assert offset.apply(dt) == expected def test_onOffset(self): tests = [] @@ -1618,7 +1598,7 @@ def test_onOffset(self): for offset, cases in tests: for dt, expected in compat.iteritems(cases): - self.assertEqual(offset.onOffset(dt), expected) + assert offset.onOffset(dt) == expected def test_apply(self): tests = [] @@ -1692,10 +1672,9 @@ def test_apply_nanoseconds(self): class TestCustomBusinessDay(Base): - _multiprocess_can_split_ = True _offset = CDay - def setUp(self): + def setup_method(self, method): self.d = datetime(2008, 1, 1) self.nd = np_datetime64_compat('2008-01-01 00:00:00Z') @@ -1707,7 +1686,7 @@ def test_different_normalize_equals(self): offset = CDay() offset2 = CDay() offset2.normalize = True - self.assertEqual(offset, offset2) + assert offset == offset2 def test_repr(self): assert repr(self.offset) == '' @@ -1722,50 +1701,50 @@ def test_with_offset(self): assert (self.d + offset) == datetime(2008, 1, 2, 2) def testEQ(self): - self.assertEqual(self.offset2, self.offset2) + assert self.offset2 == self.offset2 def test_mul(self): pass def test_hash(self): - self.assertEqual(hash(self.offset2), hash(self.offset2)) + assert hash(self.offset2) == hash(self.offset2) def testCall(self): - self.assertEqual(self.offset2(self.d), datetime(2008, 1, 3)) - self.assertEqual(self.offset2(self.nd), datetime(2008, 1, 3)) + assert self.offset2(self.d) == datetime(2008, 1, 3) + assert self.offset2(self.nd) == datetime(2008, 1, 3) def testRAdd(self): - self.assertEqual(self.d + self.offset2, self.offset2 + self.d) + assert self.d + self.offset2 == self.offset2 + self.d def testSub(self): off = self.offset2 - self.assertRaises(Exception, off.__sub__, self.d) - self.assertEqual(2 * off - off, off) + pytest.raises(Exception, off.__sub__, self.d) + assert 2 * off - off == off - self.assertEqual(self.d - self.offset2, self.d + CDay(-2)) + assert self.d - self.offset2 == self.d + CDay(-2) def testRSub(self): - self.assertEqual(self.d - self.offset2, (-self.offset2).apply(self.d)) + assert self.d - self.offset2 == (-self.offset2).apply(self.d) def testMult1(self): - self.assertEqual(self.d + 10 * self.offset, self.d + CDay(10)) + assert self.d + 10 * self.offset == self.d + CDay(10) def testMult2(self): - self.assertEqual(self.d + (-5 * CDay(-10)), self.d + CDay(50)) + assert self.d + (-5 * CDay(-10)) == self.d + CDay(50) def testRollback1(self): - self.assertEqual(CDay(10).rollback(self.d), self.d) + assert CDay(10).rollback(self.d) == self.d def testRollback2(self): - self.assertEqual( - CDay(10).rollback(datetime(2008, 1, 5)), datetime(2008, 1, 4)) + assert (CDay(10).rollback(datetime(2008, 1, 5)) == + datetime(2008, 1, 4)) def testRollforward1(self): - self.assertEqual(CDay(10).rollforward(self.d), self.d) + assert CDay(10).rollforward(self.d) == self.d def testRollforward2(self): - self.assertEqual( - CDay(10).rollforward(datetime(2008, 1, 5)), datetime(2008, 1, 7)) + assert (CDay(10).rollforward(datetime(2008, 1, 5)) == + datetime(2008, 1, 7)) def test_roll_date_object(self): offset = CDay() @@ -1773,17 +1752,17 @@ def test_roll_date_object(self): dt = date(2012, 9, 15) result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 9, 14)) + assert result == datetime(2012, 9, 14) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 9, 17)) + assert result == datetime(2012, 9, 17) offset = offsets.Day() result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) def test_onOffset(self): tests = [(CDay(), datetime(2008, 1, 1), True), @@ -1842,29 +1821,29 @@ def test_apply_large_n(self): dt = datetime(2012, 10, 23) result = dt + CDay(10) - self.assertEqual(result, datetime(2012, 11, 6)) + assert result == datetime(2012, 11, 6) result = dt + CDay(100) - CDay(100) - self.assertEqual(result, dt) + assert result == dt off = CDay() * 6 rs = datetime(2012, 1, 1) - off xp = datetime(2011, 12, 23) - self.assertEqual(rs, xp) + assert rs == xp st = datetime(2011, 12, 18) rs = st + off xp = datetime(2011, 12, 26) - self.assertEqual(rs, xp) + assert rs == xp def test_apply_corner(self): - self.assertRaises(Exception, CDay().apply, BMonthEnd()) + pytest.raises(Exception, CDay().apply, BMonthEnd()) def test_offsets_compare_equal(self): # root cause of #456 offset1 = CDay() offset2 = CDay() - self.assertFalse(offset1 != offset2) + assert not offset1 != offset2 def test_holidays(self): # Define a TradingDay offset @@ -1875,7 +1854,7 @@ def test_holidays(self): dt = datetime(year, 4, 30) xp = datetime(year, 5, 2) rs = dt + tday - self.assertEqual(rs, xp) + assert rs == xp def test_weekmask(self): weekmask_saudi = 'Sat Sun Mon Tue Wed' # Thu-Fri Weekend @@ -1888,13 +1867,13 @@ def test_weekmask(self): xp_saudi = datetime(2013, 5, 4) xp_uae = datetime(2013, 5, 2) xp_egypt = datetime(2013, 5, 2) - self.assertEqual(xp_saudi, dt + bday_saudi) - self.assertEqual(xp_uae, dt + bday_uae) - self.assertEqual(xp_egypt, dt + bday_egypt) + assert xp_saudi == dt + bday_saudi + assert xp_uae == dt + bday_uae + assert xp_egypt == dt + bday_egypt xp2 = datetime(2013, 5, 5) - self.assertEqual(xp2, dt + 2 * bday_saudi) - self.assertEqual(xp2, dt + 2 * bday_uae) - self.assertEqual(xp2, dt + 2 * bday_egypt) + assert xp2 == dt + 2 * bday_saudi + assert xp2 == dt + 2 * bday_uae + assert xp2 == dt + 2 * bday_egypt def test_weekmask_and_holidays(self): weekmask_egypt = 'Sun Mon Tue Wed Thu' # Fri-Sat Weekend @@ -1903,7 +1882,7 @@ def test_weekmask_and_holidays(self): bday_egypt = CDay(holidays=holidays, weekmask=weekmask_egypt) dt = datetime(2013, 4, 30) xp_egypt = datetime(2013, 5, 5) - self.assertEqual(xp_egypt, dt + 2 * bday_egypt) + assert xp_egypt == dt + 2 * bday_egypt def test_calendar(self): calendar = USFederalHolidayCalendar() @@ -1912,8 +1891,8 @@ def test_calendar(self): def test_roundtrip_pickle(self): def _check_roundtrip(obj): - unpickled = self.round_trip_pickle(obj) - self.assertEqual(unpickled, obj) + unpickled = tm.round_trip_pickle(obj) + assert unpickled == obj _check_roundtrip(self.offset) _check_roundtrip(self.offset2) @@ -1926,56 +1905,54 @@ def test_pickle_compat_0_14_1(self): cday0_14_1 = read_pickle(os.path.join(pth, 'cday-0.14.1.pickle')) cday = CDay(holidays=hdays) - self.assertEqual(cday, cday0_14_1) + assert cday == cday0_14_1 class CustomBusinessMonthBase(object): - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): self.d = datetime(2008, 1, 1) self.offset = self._object() self.offset2 = self._object(2) def testEQ(self): - self.assertEqual(self.offset2, self.offset2) + assert self.offset2 == self.offset2 def test_mul(self): pass def test_hash(self): - self.assertEqual(hash(self.offset2), hash(self.offset2)) + assert hash(self.offset2) == hash(self.offset2) def testRAdd(self): - self.assertEqual(self.d + self.offset2, self.offset2 + self.d) + assert self.d + self.offset2 == self.offset2 + self.d def testSub(self): off = self.offset2 - self.assertRaises(Exception, off.__sub__, self.d) - self.assertEqual(2 * off - off, off) + pytest.raises(Exception, off.__sub__, self.d) + assert 2 * off - off == off - self.assertEqual(self.d - self.offset2, self.d + self._object(-2)) + assert self.d - self.offset2 == self.d + self._object(-2) def testRSub(self): - self.assertEqual(self.d - self.offset2, (-self.offset2).apply(self.d)) + assert self.d - self.offset2 == (-self.offset2).apply(self.d) def testMult1(self): - self.assertEqual(self.d + 10 * self.offset, self.d + self._object(10)) + assert self.d + 10 * self.offset == self.d + self._object(10) def testMult2(self): - self.assertEqual(self.d + (-5 * self._object(-10)), - self.d + self._object(50)) + assert self.d + (-5 * self._object(-10)) == self.d + self._object(50) def test_offsets_compare_equal(self): offset1 = self._object() offset2 = self._object() - self.assertFalse(offset1 != offset2) + assert not offset1 != offset2 def test_roundtrip_pickle(self): def _check_roundtrip(obj): - unpickled = self.round_trip_pickle(obj) - self.assertEqual(unpickled, obj) + unpickled = tm.round_trip_pickle(obj) + assert unpickled == obj _check_roundtrip(self._object()) _check_roundtrip(self._object(2)) @@ -1990,26 +1967,24 @@ def test_different_normalize_equals(self): offset = CBMonthEnd() offset2 = CBMonthEnd() offset2.normalize = True - self.assertEqual(offset, offset2) + assert offset == offset2 def test_repr(self): assert repr(self.offset) == '' assert repr(self.offset2) == '<2 * CustomBusinessMonthEnds>' def testCall(self): - self.assertEqual(self.offset2(self.d), datetime(2008, 2, 29)) + assert self.offset2(self.d) == datetime(2008, 2, 29) def testRollback1(self): - self.assertEqual( - CDay(10).rollback(datetime(2007, 12, 31)), datetime(2007, 12, 31)) + assert (CDay(10).rollback(datetime(2007, 12, 31)) == + datetime(2007, 12, 31)) def testRollback2(self): - self.assertEqual(CBMonthEnd(10).rollback(self.d), - datetime(2007, 12, 31)) + assert CBMonthEnd(10).rollback(self.d) == datetime(2007, 12, 31) def testRollforward1(self): - self.assertEqual(CBMonthEnd(10).rollforward( - self.d), datetime(2008, 1, 31)) + assert CBMonthEnd(10).rollforward(self.d) == datetime(2008, 1, 31) def test_roll_date_object(self): offset = CBMonthEnd() @@ -2017,17 +1992,17 @@ def test_roll_date_object(self): dt = date(2012, 9, 15) result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 8, 31)) + assert result == datetime(2012, 8, 31) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 9, 28)) + assert result == datetime(2012, 9, 28) offset = offsets.Day() result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) def test_onOffset(self): tests = [(CBMonthEnd(), datetime(2008, 1, 31), True), @@ -2065,20 +2040,20 @@ def test_apply_large_n(self): dt = datetime(2012, 10, 23) result = dt + CBMonthEnd(10) - self.assertEqual(result, datetime(2013, 7, 31)) + assert result == datetime(2013, 7, 31) result = dt + CDay(100) - CDay(100) - self.assertEqual(result, dt) + assert result == dt off = CBMonthEnd() * 6 rs = datetime(2012, 1, 1) - off xp = datetime(2011, 7, 29) - self.assertEqual(rs, xp) + assert rs == xp st = datetime(2011, 12, 18) rs = st + off xp = datetime(2012, 5, 31) - self.assertEqual(rs, xp) + assert rs == xp def test_holidays(self): # Define a TradingDay offset @@ -2086,17 +2061,16 @@ def test_holidays(self): np.datetime64('2012-02-29')] bm_offset = CBMonthEnd(holidays=holidays) dt = datetime(2012, 1, 1) - self.assertEqual(dt + bm_offset, datetime(2012, 1, 30)) - self.assertEqual(dt + 2 * bm_offset, datetime(2012, 2, 27)) + assert dt + bm_offset == datetime(2012, 1, 30) + assert dt + 2 * bm_offset == datetime(2012, 2, 27) def test_datetimeindex(self): from pandas.tseries.holiday import USFederalHolidayCalendar hcal = USFederalHolidayCalendar() freq = CBMonthEnd(calendar=hcal) - self.assertEqual(DatetimeIndex(start='20120101', end='20130101', - freq=freq).tolist()[0], - datetime(2012, 1, 31)) + assert (DatetimeIndex(start='20120101', end='20130101', + freq=freq).tolist()[0] == datetime(2012, 1, 31)) class TestCustomBusinessMonthBegin(CustomBusinessMonthBase, Base): @@ -2107,26 +2081,24 @@ def test_different_normalize_equals(self): offset = CBMonthBegin() offset2 = CBMonthBegin() offset2.normalize = True - self.assertEqual(offset, offset2) + assert offset == offset2 def test_repr(self): assert repr(self.offset) == '' assert repr(self.offset2) == '<2 * CustomBusinessMonthBegins>' def testCall(self): - self.assertEqual(self.offset2(self.d), datetime(2008, 3, 3)) + assert self.offset2(self.d) == datetime(2008, 3, 3) def testRollback1(self): - self.assertEqual( - CDay(10).rollback(datetime(2007, 12, 31)), datetime(2007, 12, 31)) + assert (CDay(10).rollback(datetime(2007, 12, 31)) == + datetime(2007, 12, 31)) def testRollback2(self): - self.assertEqual(CBMonthBegin(10).rollback(self.d), - datetime(2008, 1, 1)) + assert CBMonthBegin(10).rollback(self.d) == datetime(2008, 1, 1) def testRollforward1(self): - self.assertEqual(CBMonthBegin(10).rollforward( - self.d), datetime(2008, 1, 1)) + assert CBMonthBegin(10).rollforward(self.d) == datetime(2008, 1, 1) def test_roll_date_object(self): offset = CBMonthBegin() @@ -2134,17 +2106,17 @@ def test_roll_date_object(self): dt = date(2012, 9, 15) result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 9, 3)) + assert result == datetime(2012, 9, 3) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 10, 1)) + assert result == datetime(2012, 10, 1) offset = offsets.Day() result = offset.rollback(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) result = offset.rollforward(dt) - self.assertEqual(result, datetime(2012, 9, 15)) + assert result == datetime(2012, 9, 15) def test_onOffset(self): tests = [(CBMonthBegin(), datetime(2008, 1, 1), True), @@ -2181,20 +2153,21 @@ def test_apply_large_n(self): dt = datetime(2012, 10, 23) result = dt + CBMonthBegin(10) - self.assertEqual(result, datetime(2013, 8, 1)) + assert result == datetime(2013, 8, 1) result = dt + CDay(100) - CDay(100) - self.assertEqual(result, dt) + assert result == dt off = CBMonthBegin() * 6 rs = datetime(2012, 1, 1) - off xp = datetime(2011, 7, 1) - self.assertEqual(rs, xp) + assert rs == xp st = datetime(2011, 12, 18) rs = st + off + xp = datetime(2012, 6, 1) - self.assertEqual(rs, xp) + assert rs == xp def test_holidays(self): # Define a TradingDay offset @@ -2202,15 +2175,15 @@ def test_holidays(self): np.datetime64('2012-03-01')] bm_offset = CBMonthBegin(holidays=holidays) dt = datetime(2012, 1, 1) - self.assertEqual(dt + bm_offset, datetime(2012, 1, 2)) - self.assertEqual(dt + 2 * bm_offset, datetime(2012, 2, 3)) + + assert dt + bm_offset == datetime(2012, 1, 2) + assert dt + 2 * bm_offset == datetime(2012, 2, 3) def test_datetimeindex(self): hcal = USFederalHolidayCalendar() cbmb = CBMonthBegin(calendar=hcal) - self.assertEqual(DatetimeIndex(start='20120101', end='20130101', - freq=cbmb).tolist()[0], - datetime(2012, 1, 3)) + assert (DatetimeIndex(start='20120101', end='20130101', + freq=cbmb).tolist()[0] == datetime(2012, 1, 3)) def assertOnOffset(offset, date, expected): @@ -2224,20 +2197,20 @@ class TestWeek(Base): _offset = Week def test_repr(self): - self.assertEqual(repr(Week(weekday=0)), "") - self.assertEqual(repr(Week(n=-1, weekday=0)), "<-1 * Week: weekday=0>") - self.assertEqual(repr(Week(n=-2, weekday=0)), - "<-2 * Weeks: weekday=0>") + assert repr(Week(weekday=0)) == "" + assert repr(Week(n=-1, weekday=0)) == "<-1 * Week: weekday=0>" + assert repr(Week(n=-2, weekday=0)) == "<-2 * Weeks: weekday=0>" def test_corner(self): - self.assertRaises(ValueError, Week, weekday=7) - assertRaisesRegexp(ValueError, "Day must be", Week, weekday=-1) + pytest.raises(ValueError, Week, weekday=7) + tm.assert_raises_regex( + ValueError, "Day must be", Week, weekday=-1) def test_isAnchored(self): - self.assertTrue(Week(weekday=0).isAnchored()) - self.assertFalse(Week().isAnchored()) - self.assertFalse(Week(2, weekday=2).isAnchored()) - self.assertFalse(Week(2).isAnchored()) + assert Week(weekday=0).isAnchored() + assert not Week().isAnchored() + assert not Week(2, weekday=2).isAnchored() + assert not Week(2).isAnchored() def test_offset(self): tests = [] @@ -2289,27 +2262,27 @@ def test_offsets_compare_equal(self): # root cause of #456 offset1 = Week() offset2 = Week() - self.assertFalse(offset1 != offset2) + assert not offset1 != offset2 class TestWeekOfMonth(Base): _offset = WeekOfMonth def test_constructor(self): - assertRaisesRegexp(ValueError, "^N cannot be 0", WeekOfMonth, n=0, - week=1, weekday=1) - assertRaisesRegexp(ValueError, "^Week", WeekOfMonth, n=1, week=4, - weekday=0) - assertRaisesRegexp(ValueError, "^Week", WeekOfMonth, n=1, week=-1, - weekday=0) - assertRaisesRegexp(ValueError, "^Day", WeekOfMonth, n=1, week=0, - weekday=-1) - assertRaisesRegexp(ValueError, "^Day", WeekOfMonth, n=1, week=0, - weekday=7) + tm.assert_raises_regex(ValueError, "^N cannot be 0", + WeekOfMonth, n=0, week=1, weekday=1) + tm.assert_raises_regex(ValueError, "^Week", WeekOfMonth, + n=1, week=4, weekday=0) + tm.assert_raises_regex(ValueError, "^Week", WeekOfMonth, + n=1, week=-1, weekday=0) + tm.assert_raises_regex(ValueError, "^Day", WeekOfMonth, + n=1, week=0, weekday=-1) + tm.assert_raises_regex(ValueError, "^Day", WeekOfMonth, + n=1, week=0, weekday=7) def test_repr(self): - self.assertEqual(repr(WeekOfMonth(weekday=1, week=2)), - "") + assert (repr(WeekOfMonth(weekday=1, week=2)) == + "") def test_offset(self): date1 = datetime(2011, 1, 4) # 1st Tuesday of Month @@ -2359,9 +2332,10 @@ def test_offset(self): # try subtracting result = datetime(2011, 2, 1) - WeekOfMonth(week=1, weekday=2) - self.assertEqual(result, datetime(2011, 1, 12)) + assert result == datetime(2011, 1, 12) + result = datetime(2011, 2, 3) - WeekOfMonth(week=0, weekday=2) - self.assertEqual(result, datetime(2011, 2, 2)) + assert result == datetime(2011, 2, 2) def test_onOffset(self): test_cases = [ @@ -2375,19 +2349,20 @@ def test_onOffset(self): for week, weekday, dt, expected in test_cases: offset = WeekOfMonth(week=week, weekday=weekday) - self.assertEqual(offset.onOffset(dt), expected) + assert offset.onOffset(dt) == expected class TestLastWeekOfMonth(Base): _offset = LastWeekOfMonth def test_constructor(self): - assertRaisesRegexp(ValueError, "^N cannot be 0", LastWeekOfMonth, n=0, - weekday=1) + tm.assert_raises_regex(ValueError, "^N cannot be 0", + LastWeekOfMonth, n=0, weekday=1) - assertRaisesRegexp(ValueError, "^Day", LastWeekOfMonth, n=1, - weekday=-1) - assertRaisesRegexp(ValueError, "^Day", LastWeekOfMonth, n=1, weekday=7) + tm.assert_raises_regex(ValueError, "^Day", LastWeekOfMonth, n=1, + weekday=-1) + tm.assert_raises_regex( + ValueError, "^Day", LastWeekOfMonth, n=1, weekday=7) def test_offset(self): # Saturday @@ -2396,13 +2371,13 @@ def test_offset(self): offset_sat = LastWeekOfMonth(n=1, weekday=5) one_day_before = (last_sat + timedelta(days=-1)) - self.assertEqual(one_day_before + offset_sat, last_sat) + assert one_day_before + offset_sat == last_sat one_day_after = (last_sat + timedelta(days=+1)) - self.assertEqual(one_day_after + offset_sat, next_sat) + assert one_day_after + offset_sat == next_sat # Test On that day - self.assertEqual(last_sat + offset_sat, next_sat) + assert last_sat + offset_sat == next_sat # Thursday @@ -2411,23 +2386,22 @@ def test_offset(self): next_thurs = datetime(2013, 2, 28) one_day_before = last_thurs + timedelta(days=-1) - self.assertEqual(one_day_before + offset_thur, last_thurs) + assert one_day_before + offset_thur == last_thurs one_day_after = last_thurs + timedelta(days=+1) - self.assertEqual(one_day_after + offset_thur, next_thurs) + assert one_day_after + offset_thur == next_thurs # Test on that day - self.assertEqual(last_thurs + offset_thur, next_thurs) + assert last_thurs + offset_thur == next_thurs three_before = last_thurs + timedelta(days=-3) - self.assertEqual(three_before + offset_thur, last_thurs) + assert three_before + offset_thur == last_thurs two_after = last_thurs + timedelta(days=+2) - self.assertEqual(two_after + offset_thur, next_thurs) + assert two_after + offset_thur == next_thurs offset_sunday = LastWeekOfMonth(n=1, weekday=WeekDay.SUN) - self.assertEqual(datetime(2013, 7, 31) + - offset_sunday, datetime(2013, 8, 25)) + assert datetime(2013, 7, 31) + offset_sunday == datetime(2013, 8, 25) def test_onOffset(self): test_cases = [ @@ -2449,7 +2423,7 @@ def test_onOffset(self): for weekday, dt, expected in test_cases: offset = LastWeekOfMonth(weekday=weekday) - self.assertEqual(offset.onOffset(dt), expected, msg=date) + assert offset.onOffset(dt) == expected class TestBMonthBegin(Base): @@ -2511,7 +2485,7 @@ def test_offsets_compare_equal(self): # root cause of #456 offset1 = BMonthBegin() offset2 = BMonthBegin() - self.assertFalse(offset1 != offset2) + assert not offset1 != offset2 class TestBMonthEnd(Base): @@ -2560,7 +2534,7 @@ def test_normalize(self): result = dt + BMonthEnd(normalize=True) expected = dt.replace(hour=0) + BMonthEnd() - self.assertEqual(result, expected) + assert result == expected def test_onOffset(self): @@ -2574,7 +2548,7 @@ def test_offsets_compare_equal(self): # root cause of #456 offset1 = BMonthEnd() offset2 = BMonthEnd() - self.assertFalse(offset1 != offset2) + assert not offset1 != offset2 class TestMonthBegin(Base): @@ -2659,23 +2633,22 @@ def test_offset(self): for base, expected in compat.iteritems(cases): assertEq(offset, base, expected) - # def test_day_of_month(self): - # dt = datetime(2007, 1, 1) + def test_day_of_month(self): + dt = datetime(2007, 1, 1) + offset = MonthEnd(day=20) - # offset = MonthEnd(day=20) + result = dt + offset + assert result == Timestamp(2007, 1, 31) - # result = dt + offset - # self.assertEqual(result, datetime(2007, 1, 20)) - - # result = result + offset - # self.assertEqual(result, datetime(2007, 2, 20)) + result = result + offset + assert result == Timestamp(2007, 2, 28) def test_normalize(self): dt = datetime(2007, 1, 1, 3) result = dt + MonthEnd(normalize=True) expected = dt.replace(hour=0) + MonthEnd() - self.assertEqual(result, expected) + assert result == expected def test_onOffset(self): @@ -2836,7 +2809,7 @@ def test_onOffset(self): def test_vectorized_offset_addition(self): for klass, assert_func in zip([Series, DatetimeIndex], - [self.assert_series_equal, + [tm.assert_series_equal, tm.assert_index_equal]): s = klass([Timestamp('2000-01-15 00:15:00', tz='US/Central'), Timestamp('2000-02-15', tz='US/Central')], name='a') @@ -3011,7 +2984,7 @@ def test_onOffset(self): def test_vectorized_offset_addition(self): for klass, assert_func in zip([Series, DatetimeIndex], - [self.assert_series_equal, + [tm.assert_series_equal, tm.assert_index_equal]): s = klass([Timestamp('2000-01-15 00:15:00', tz='US/Central'), @@ -3037,17 +3010,17 @@ class TestBQuarterBegin(Base): _offset = BQuarterBegin def test_repr(self): - self.assertEqual(repr(BQuarterBegin()), - "") - self.assertEqual(repr(BQuarterBegin(startingMonth=3)), - "") - self.assertEqual(repr(BQuarterBegin(startingMonth=1)), - "") + assert (repr(BQuarterBegin()) == + "") + assert (repr(BQuarterBegin(startingMonth=3)) == + "") + assert (repr(BQuarterBegin(startingMonth=1)) == + "") def test_isAnchored(self): - self.assertTrue(BQuarterBegin(startingMonth=1).isAnchored()) - self.assertTrue(BQuarterBegin().isAnchored()) - self.assertFalse(BQuarterBegin(2, startingMonth=1).isAnchored()) + assert BQuarterBegin(startingMonth=1).isAnchored() + assert BQuarterBegin().isAnchored() + assert not BQuarterBegin(2, startingMonth=1).isAnchored() def test_offset(self): tests = [] @@ -3124,24 +3097,24 @@ def test_offset(self): # corner offset = BQuarterBegin(n=-1, startingMonth=1) - self.assertEqual(datetime(2007, 4, 3) + offset, datetime(2007, 4, 2)) + assert datetime(2007, 4, 3) + offset == datetime(2007, 4, 2) class TestBQuarterEnd(Base): _offset = BQuarterEnd def test_repr(self): - self.assertEqual(repr(BQuarterEnd()), - "") - self.assertEqual(repr(BQuarterEnd(startingMonth=3)), - "") - self.assertEqual(repr(BQuarterEnd(startingMonth=1)), - "") + assert (repr(BQuarterEnd()) == + "") + assert (repr(BQuarterEnd(startingMonth=3)) == + "") + assert (repr(BQuarterEnd(startingMonth=1)) == + "") def test_isAnchored(self): - self.assertTrue(BQuarterEnd(startingMonth=1).isAnchored()) - self.assertTrue(BQuarterEnd().isAnchored()) - self.assertFalse(BQuarterEnd(2, startingMonth=1).isAnchored()) + assert BQuarterEnd(startingMonth=1).isAnchored() + assert BQuarterEnd().isAnchored() + assert not BQuarterEnd(2, startingMonth=1).isAnchored() def test_offset(self): tests = [] @@ -3201,7 +3174,7 @@ def test_offset(self): # corner offset = BQuarterEnd(n=-1, startingMonth=1) - self.assertEqual(datetime(2010, 1, 31) + offset, datetime(2010, 1, 29)) + assert datetime(2010, 1, 31) + offset == datetime(2010, 1, 29) def test_onOffset(self): @@ -3256,6 +3229,7 @@ def makeFY5253LastOfMonth(*args, **kwds): class TestFY5253LastOfMonth(Base): + def test_onOffset(self): offset_lom_sat_aug = makeFY5253LastOfMonth(1, startingMonth=8, @@ -3337,57 +3311,52 @@ def test_apply(self): current = data[0] for datum in data[1:]: current = current + offset - self.assertEqual(current, datum) + assert current == datum class TestFY5253NearestEndMonth(Base): + def test_get_target_month_end(self): - self.assertEqual(makeFY5253NearestEndMonth(startingMonth=8, - weekday=WeekDay.SAT) - .get_target_month_end( - datetime(2013, 1, 1)), datetime(2013, 8, 31)) - self.assertEqual(makeFY5253NearestEndMonth(startingMonth=12, - weekday=WeekDay.SAT) - .get_target_month_end(datetime(2013, 1, 1)), - datetime(2013, 12, 31)) - self.assertEqual(makeFY5253NearestEndMonth(startingMonth=2, - weekday=WeekDay.SAT) - .get_target_month_end(datetime(2013, 1, 1)), - datetime(2013, 2, 28)) + assert (makeFY5253NearestEndMonth( + startingMonth=8, weekday=WeekDay.SAT).get_target_month_end( + datetime(2013, 1, 1)) == datetime(2013, 8, 31)) + assert (makeFY5253NearestEndMonth( + startingMonth=12, weekday=WeekDay.SAT).get_target_month_end( + datetime(2013, 1, 1)) == datetime(2013, 12, 31)) + assert (makeFY5253NearestEndMonth( + startingMonth=2, weekday=WeekDay.SAT).get_target_month_end( + datetime(2013, 1, 1)) == datetime(2013, 2, 28)) def test_get_year_end(self): - self.assertEqual(makeFY5253NearestEndMonth(startingMonth=8, - weekday=WeekDay.SAT) - .get_year_end(datetime(2013, 1, 1)), - datetime(2013, 8, 31)) - self.assertEqual(makeFY5253NearestEndMonth(startingMonth=8, - weekday=WeekDay.SUN) - .get_year_end(datetime(2013, 1, 1)), - datetime(2013, 9, 1)) - self.assertEqual(makeFY5253NearestEndMonth(startingMonth=8, - weekday=WeekDay.FRI) - .get_year_end(datetime(2013, 1, 1)), - datetime(2013, 8, 30)) + assert (makeFY5253NearestEndMonth( + startingMonth=8, weekday=WeekDay.SAT).get_year_end( + datetime(2013, 1, 1)) == datetime(2013, 8, 31)) + assert (makeFY5253NearestEndMonth( + startingMonth=8, weekday=WeekDay.SUN).get_year_end( + datetime(2013, 1, 1)) == datetime(2013, 9, 1)) + assert (makeFY5253NearestEndMonth( + startingMonth=8, weekday=WeekDay.FRI).get_year_end( + datetime(2013, 1, 1)) == datetime(2013, 8, 30)) offset_n = FY5253(weekday=WeekDay.TUE, startingMonth=12, variation="nearest") - self.assertEqual(offset_n.get_year_end( - datetime(2012, 1, 1)), datetime(2013, 1, 1)) - self.assertEqual(offset_n.get_year_end( - datetime(2012, 1, 10)), datetime(2013, 1, 1)) - - self.assertEqual(offset_n.get_year_end( - datetime(2013, 1, 1)), datetime(2013, 12, 31)) - self.assertEqual(offset_n.get_year_end( - datetime(2013, 1, 2)), datetime(2013, 12, 31)) - self.assertEqual(offset_n.get_year_end( - datetime(2013, 1, 3)), datetime(2013, 12, 31)) - self.assertEqual(offset_n.get_year_end( - datetime(2013, 1, 10)), datetime(2013, 12, 31)) + assert (offset_n.get_year_end(datetime(2012, 1, 1)) == + datetime(2013, 1, 1)) + assert (offset_n.get_year_end(datetime(2012, 1, 10)) == + datetime(2013, 1, 1)) + + assert (offset_n.get_year_end(datetime(2013, 1, 1)) == + datetime(2013, 12, 31)) + assert (offset_n.get_year_end(datetime(2013, 1, 2)) == + datetime(2013, 12, 31)) + assert (offset_n.get_year_end(datetime(2013, 1, 3)) == + datetime(2013, 12, 31)) + assert (offset_n.get_year_end(datetime(2013, 1, 10)) == + datetime(2013, 12, 31)) JNJ = FY5253(n=1, startingMonth=12, weekday=6, variation="nearest") - self.assertEqual(JNJ.get_year_end( - datetime(2006, 1, 1)), datetime(2006, 12, 31)) + assert (JNJ.get_year_end(datetime(2006, 1, 1)) == + datetime(2006, 12, 31)) def test_onOffset(self): offset_lom_aug_sat = makeFY5253NearestEndMonth(1, startingMonth=8, @@ -3502,42 +3471,35 @@ def test_apply(self): current = data[0] for datum in data[1:]: current = current + offset - self.assertEqual(current, datum) + assert current == datum class TestFY5253LastOfMonthQuarter(Base): + def test_isAnchored(self): - self.assertTrue( - makeFY5253LastOfMonthQuarter(startingMonth=1, weekday=WeekDay.SAT, - qtr_with_extra_week=4).isAnchored()) - self.assertTrue( - makeFY5253LastOfMonthQuarter(weekday=WeekDay.SAT, startingMonth=3, - qtr_with_extra_week=4).isAnchored()) - self.assertFalse(makeFY5253LastOfMonthQuarter( + assert makeFY5253LastOfMonthQuarter( + startingMonth=1, weekday=WeekDay.SAT, + qtr_with_extra_week=4).isAnchored() + assert makeFY5253LastOfMonthQuarter( + weekday=WeekDay.SAT, startingMonth=3, + qtr_with_extra_week=4).isAnchored() + assert not makeFY5253LastOfMonthQuarter( 2, startingMonth=1, weekday=WeekDay.SAT, - qtr_with_extra_week=4).isAnchored()) + qtr_with_extra_week=4).isAnchored() def test_equality(self): - self.assertEqual(makeFY5253LastOfMonthQuarter(startingMonth=1, - weekday=WeekDay.SAT, - qtr_with_extra_week=4), - makeFY5253LastOfMonthQuarter(startingMonth=1, - weekday=WeekDay.SAT, - qtr_with_extra_week=4)) - self.assertNotEqual( - makeFY5253LastOfMonthQuarter( - startingMonth=1, weekday=WeekDay.SAT, - qtr_with_extra_week=4), - makeFY5253LastOfMonthQuarter( - startingMonth=1, weekday=WeekDay.SUN, - qtr_with_extra_week=4)) - self.assertNotEqual( - makeFY5253LastOfMonthQuarter( - startingMonth=1, weekday=WeekDay.SAT, - qtr_with_extra_week=4), - makeFY5253LastOfMonthQuarter( - startingMonth=2, weekday=WeekDay.SAT, - qtr_with_extra_week=4)) + assert (makeFY5253LastOfMonthQuarter( + startingMonth=1, weekday=WeekDay.SAT, + qtr_with_extra_week=4) == makeFY5253LastOfMonthQuarter( + startingMonth=1, weekday=WeekDay.SAT, qtr_with_extra_week=4)) + assert (makeFY5253LastOfMonthQuarter( + startingMonth=1, weekday=WeekDay.SAT, + qtr_with_extra_week=4) != makeFY5253LastOfMonthQuarter( + startingMonth=1, weekday=WeekDay.SUN, qtr_with_extra_week=4)) + assert (makeFY5253LastOfMonthQuarter( + startingMonth=1, weekday=WeekDay.SAT, + qtr_with_extra_week=4) != makeFY5253LastOfMonthQuarter( + startingMonth=2, weekday=WeekDay.SAT, qtr_with_extra_week=4)) def test_offset(self): offset = makeFY5253LastOfMonthQuarter(1, startingMonth=9, @@ -3663,53 +3625,40 @@ def test_onOffset(self): def test_year_has_extra_week(self): # End of long Q1 - self.assertTrue( - makeFY5253LastOfMonthQuarter(1, startingMonth=12, - weekday=WeekDay.SAT, - qtr_with_extra_week=1) - .year_has_extra_week(datetime(2011, 4, 2))) + assert makeFY5253LastOfMonthQuarter( + 1, startingMonth=12, weekday=WeekDay.SAT, + qtr_with_extra_week=1).year_has_extra_week(datetime(2011, 4, 2)) # Start of long Q1 - self.assertTrue( - makeFY5253LastOfMonthQuarter( - 1, startingMonth=12, weekday=WeekDay.SAT, - qtr_with_extra_week=1) - .year_has_extra_week(datetime(2010, 12, 26))) + assert makeFY5253LastOfMonthQuarter( + 1, startingMonth=12, weekday=WeekDay.SAT, + qtr_with_extra_week=1).year_has_extra_week(datetime(2010, 12, 26)) # End of year before year with long Q1 - self.assertFalse( - makeFY5253LastOfMonthQuarter( - 1, startingMonth=12, weekday=WeekDay.SAT, - qtr_with_extra_week=1) - .year_has_extra_week(datetime(2010, 12, 25))) + assert not makeFY5253LastOfMonthQuarter( + 1, startingMonth=12, weekday=WeekDay.SAT, + qtr_with_extra_week=1).year_has_extra_week(datetime(2010, 12, 25)) for year in [x for x in range(1994, 2011 + 1) if x not in [2011, 2005, 2000, 1994]]: - self.assertFalse( - makeFY5253LastOfMonthQuarter( - 1, startingMonth=12, weekday=WeekDay.SAT, - qtr_with_extra_week=1) - .year_has_extra_week(datetime(year, 4, 2))) + assert not makeFY5253LastOfMonthQuarter( + 1, startingMonth=12, weekday=WeekDay.SAT, + qtr_with_extra_week=1).year_has_extra_week( + datetime(year, 4, 2)) # Other long years - self.assertTrue( - makeFY5253LastOfMonthQuarter( - 1, startingMonth=12, weekday=WeekDay.SAT, - qtr_with_extra_week=1) - .year_has_extra_week(datetime(2005, 4, 2))) + assert makeFY5253LastOfMonthQuarter( + 1, startingMonth=12, weekday=WeekDay.SAT, + qtr_with_extra_week=1).year_has_extra_week(datetime(2005, 4, 2)) - self.assertTrue( - makeFY5253LastOfMonthQuarter( - 1, startingMonth=12, weekday=WeekDay.SAT, - qtr_with_extra_week=1) - .year_has_extra_week(datetime(2000, 4, 2))) + assert makeFY5253LastOfMonthQuarter( + 1, startingMonth=12, weekday=WeekDay.SAT, + qtr_with_extra_week=1).year_has_extra_week(datetime(2000, 4, 2)) - self.assertTrue( - makeFY5253LastOfMonthQuarter( - 1, startingMonth=12, weekday=WeekDay.SAT, - qtr_with_extra_week=1) - .year_has_extra_week(datetime(1994, 4, 2))) + assert makeFY5253LastOfMonthQuarter( + 1, startingMonth=12, weekday=WeekDay.SAT, + qtr_with_extra_week=1).year_has_extra_week(datetime(1994, 4, 2)) def test_get_weeks(self): sat_dec_1 = makeFY5253LastOfMonthQuarter(1, startingMonth=12, @@ -3719,15 +3668,13 @@ def test_get_weeks(self): weekday=WeekDay.SAT, qtr_with_extra_week=4) - self.assertEqual(sat_dec_1.get_weeks( - datetime(2011, 4, 2)), [14, 13, 13, 13]) - self.assertEqual(sat_dec_4.get_weeks( - datetime(2011, 4, 2)), [13, 13, 13, 14]) - self.assertEqual(sat_dec_1.get_weeks( - datetime(2010, 12, 25)), [13, 13, 13, 13]) + assert sat_dec_1.get_weeks(datetime(2011, 4, 2)) == [14, 13, 13, 13] + assert sat_dec_4.get_weeks(datetime(2011, 4, 2)) == [13, 13, 13, 14] + assert sat_dec_1.get_weeks(datetime(2010, 12, 25)) == [13, 13, 13, 13] class TestFY5253NearestEndMonthQuarter(Base): + def test_onOffset(self): offset_nem_sat_aug_4 = makeFY5253NearestEndMonthQuarter( @@ -3813,18 +3760,19 @@ def test_offset(self): class TestQuarterBegin(Base): + def test_repr(self): - self.assertEqual(repr(QuarterBegin()), - "") - self.assertEqual(repr(QuarterBegin(startingMonth=3)), - "") - self.assertEqual(repr(QuarterBegin(startingMonth=1)), - "") + assert (repr(QuarterBegin()) == + "") + assert (repr(QuarterBegin(startingMonth=3)) == + "") + assert (repr(QuarterBegin(startingMonth=1)) == + "") def test_isAnchored(self): - self.assertTrue(QuarterBegin(startingMonth=1).isAnchored()) - self.assertTrue(QuarterBegin().isAnchored()) - self.assertFalse(QuarterBegin(2, startingMonth=1).isAnchored()) + assert QuarterBegin(startingMonth=1).isAnchored() + assert QuarterBegin().isAnchored() + assert not QuarterBegin(2, startingMonth=1).isAnchored() def test_offset(self): tests = [] @@ -3886,23 +3834,24 @@ def test_offset(self): # corner offset = QuarterBegin(n=-1, startingMonth=1) - self.assertEqual(datetime(2010, 2, 1) + offset, datetime(2010, 1, 1)) + assert datetime(2010, 2, 1) + offset == datetime(2010, 1, 1) class TestQuarterEnd(Base): _offset = QuarterEnd def test_repr(self): - self.assertEqual(repr(QuarterEnd()), "") - self.assertEqual(repr(QuarterEnd(startingMonth=3)), - "") - self.assertEqual(repr(QuarterEnd(startingMonth=1)), - "") + assert (repr(QuarterEnd()) == + "") + assert (repr(QuarterEnd(startingMonth=3)) == + "") + assert (repr(QuarterEnd(startingMonth=1)) == + "") def test_isAnchored(self): - self.assertTrue(QuarterEnd(startingMonth=1).isAnchored()) - self.assertTrue(QuarterEnd().isAnchored()) - self.assertFalse(QuarterEnd(2, startingMonth=1).isAnchored()) + assert QuarterEnd(startingMonth=1).isAnchored() + assert QuarterEnd().isAnchored() + assert not QuarterEnd(2, startingMonth=1).isAnchored() def test_offset(self): tests = [] @@ -3963,7 +3912,7 @@ def test_offset(self): # corner offset = QuarterEnd(n=-1, startingMonth=1) - self.assertEqual(datetime(2010, 2, 1) + offset, datetime(2010, 1, 31)) + assert datetime(2010, 2, 1) + offset == datetime(2010, 1, 31) def test_onOffset(self): @@ -4031,8 +3980,8 @@ class TestBYearBegin(Base): _offset = BYearBegin def test_misspecified(self): - self.assertRaises(ValueError, BYearBegin, month=13) - self.assertRaises(ValueError, BYearEnd, month=13) + pytest.raises(ValueError, BYearBegin, month=13) + pytest.raises(ValueError, BYearEnd, month=13) def test_offset(self): tests = [] @@ -4077,7 +4026,7 @@ class TestYearBegin(Base): _offset = YearBegin def test_misspecified(self): - self.assertRaises(ValueError, YearBegin, month=13) + pytest.raises(ValueError, YearBegin, month=13) def test_offset(self): tests = [] @@ -4167,9 +4116,10 @@ def test_onOffset(self): class TestBYearEndLagged(Base): + def test_bad_month_fail(self): - self.assertRaises(Exception, BYearEnd, month=13) - self.assertRaises(Exception, BYearEnd, month=0) + pytest.raises(Exception, BYearEnd, month=13) + pytest.raises(Exception, BYearEnd, month=0) def test_offset(self): tests = [] @@ -4184,14 +4134,14 @@ def test_offset(self): for offset, cases in tests: for base, expected in compat.iteritems(cases): - self.assertEqual(base + offset, expected) + assert base + offset == expected def test_roll(self): offset = BYearEnd(month=6) date = datetime(2009, 11, 30) - self.assertEqual(offset.rollforward(date), datetime(2010, 6, 30)) - self.assertEqual(offset.rollback(date), datetime(2009, 6, 30)) + assert offset.rollforward(date) == datetime(2010, 6, 30) + assert offset.rollback(date) == datetime(2009, 6, 30) def test_onOffset(self): @@ -4257,7 +4207,7 @@ class TestYearEnd(Base): _offset = YearEnd def test_misspecified(self): - self.assertRaises(ValueError, YearEnd, month=13) + pytest.raises(ValueError, YearEnd, month=13) def test_offset(self): tests = [] @@ -4306,6 +4256,7 @@ def test_onOffset(self): class TestYearEndDiffMonth(Base): + def test_offset(self): tests = [] @@ -4383,7 +4334,7 @@ def test_Easter(): assertEq(-Easter(2), datetime(2010, 4, 4), datetime(2008, 3, 23)) -class TestTicks(tm.TestCase): +class TestTicks(object): ticks = [Hour, Minute, Second, Milli, Micro, Nano] @@ -4398,8 +4349,8 @@ def test_ticks(self): for kls, expected in offsets: offset = kls(3) result = offset + Timedelta(hours=2) - self.assertTrue(isinstance(result, Timedelta)) - self.assertEqual(result, expected) + assert isinstance(result, Timedelta) + assert result == expected def test_Hour(self): assertEq(Hour(), datetime(2010, 1, 1), datetime(2010, 1, 1, 1)) @@ -4407,10 +4358,10 @@ def test_Hour(self): assertEq(2 * Hour(), datetime(2010, 1, 1), datetime(2010, 1, 1, 2)) assertEq(-1 * Hour(), datetime(2010, 1, 1, 1), datetime(2010, 1, 1)) - self.assertEqual(Hour(3) + Hour(2), Hour(5)) - self.assertEqual(Hour(3) - Hour(2), Hour()) + assert Hour(3) + Hour(2) == Hour(5) + assert Hour(3) - Hour(2) == Hour() - self.assertNotEqual(Hour(4), Hour(1)) + assert Hour(4) != Hour(1) def test_Minute(self): assertEq(Minute(), datetime(2010, 1, 1), datetime(2010, 1, 1, 0, 1)) @@ -4420,9 +4371,9 @@ def test_Minute(self): assertEq(-1 * Minute(), datetime(2010, 1, 1, 0, 1), datetime(2010, 1, 1)) - self.assertEqual(Minute(3) + Minute(2), Minute(5)) - self.assertEqual(Minute(3) - Minute(2), Minute()) - self.assertNotEqual(Minute(5), Minute()) + assert Minute(3) + Minute(2) == Minute(5) + assert Minute(3) - Minute(2) == Minute() + assert Minute(5) != Minute() def test_Second(self): assertEq(Second(), datetime(2010, 1, 1), datetime(2010, 1, 1, 0, 0, 1)) @@ -4433,8 +4384,8 @@ def test_Second(self): assertEq(-1 * Second(), datetime(2010, 1, 1, 0, 0, 1), datetime(2010, 1, 1)) - self.assertEqual(Second(3) + Second(2), Second(5)) - self.assertEqual(Second(3) - Second(2), Second()) + assert Second(3) + Second(2) == Second(5) + assert Second(3) - Second(2) == Second() def test_Millisecond(self): assertEq(Milli(), datetime(2010, 1, 1), @@ -4448,8 +4399,8 @@ def test_Millisecond(self): assertEq(-1 * Milli(), datetime(2010, 1, 1, 0, 0, 0, 1000), datetime(2010, 1, 1)) - self.assertEqual(Milli(3) + Milli(2), Milli(5)) - self.assertEqual(Milli(3) - Milli(2), Milli()) + assert Milli(3) + Milli(2) == Milli(5) + assert Milli(3) - Milli(2) == Milli() def test_MillisecondTimestampArithmetic(self): assertEq(Milli(), Timestamp('2010-01-01'), @@ -4467,18 +4418,18 @@ def test_Microsecond(self): assertEq(-1 * Micro(), datetime(2010, 1, 1, 0, 0, 0, 1), datetime(2010, 1, 1)) - self.assertEqual(Micro(3) + Micro(2), Micro(5)) - self.assertEqual(Micro(3) - Micro(2), Micro()) + assert Micro(3) + Micro(2) == Micro(5) + assert Micro(3) - Micro(2) == Micro() def test_NanosecondGeneric(self): timestamp = Timestamp(datetime(2010, 1, 1)) - self.assertEqual(timestamp.nanosecond, 0) + assert timestamp.nanosecond == 0 result = timestamp + Nano(10) - self.assertEqual(result.nanosecond, 10) + assert result.nanosecond == 10 reverse_result = Nano(10) + timestamp - self.assertEqual(reverse_result.nanosecond, 10) + assert reverse_result.nanosecond == 10 def test_Nanosecond(self): timestamp = Timestamp(datetime(2010, 1, 1)) @@ -4487,44 +4438,44 @@ def test_Nanosecond(self): assertEq(2 * Nano(), timestamp, timestamp + np.timedelta64(2, 'ns')) assertEq(-1 * Nano(), timestamp + np.timedelta64(1, 'ns'), timestamp) - self.assertEqual(Nano(3) + Nano(2), Nano(5)) - self.assertEqual(Nano(3) - Nano(2), Nano()) + assert Nano(3) + Nano(2) == Nano(5) + assert Nano(3) - Nano(2) == Nano() # GH9284 - self.assertEqual(Nano(1) + Nano(10), Nano(11)) - self.assertEqual(Nano(5) + Micro(1), Nano(1005)) - self.assertEqual(Micro(5) + Nano(1), Nano(5001)) + assert Nano(1) + Nano(10) == Nano(11) + assert Nano(5) + Micro(1) == Nano(1005) + assert Micro(5) + Nano(1) == Nano(5001) def test_tick_zero(self): for t1 in self.ticks: for t2 in self.ticks: - self.assertEqual(t1(0), t2(0)) - self.assertEqual(t1(0) + t2(0), t1(0)) + assert t1(0) == t2(0) + assert t1(0) + t2(0) == t1(0) if t1 is not Nano: - self.assertEqual(t1(2) + t2(0), t1(2)) + assert t1(2) + t2(0) == t1(2) if t1 is Nano: - self.assertEqual(t1(2) + Nano(0), t1(2)) + assert t1(2) + Nano(0) == t1(2) def test_tick_equalities(self): for t in self.ticks: - self.assertEqual(t(3), t(3)) - self.assertEqual(t(), t(1)) + assert t(3) == t(3) + assert t() == t(1) # not equals - self.assertNotEqual(t(3), t(2)) - self.assertNotEqual(t(3), t(-3)) + assert t(3) != t(2) + assert t(3) != t(-3) def test_tick_operators(self): for t in self.ticks: - self.assertEqual(t(3) + t(2), t(5)) - self.assertEqual(t(3) - t(2), t(1)) - self.assertEqual(t(800) + t(300), t(1100)) - self.assertEqual(t(1000) - t(5), t(995)) + assert t(3) + t(2) == t(5) + assert t(3) - t(2) == t(1) + assert t(800) + t(300) == t(1100) + assert t(1000) - t(5) == t(995) def test_tick_offset(self): for t in self.ticks: - self.assertFalse(t().isAnchored()) + assert not t().isAnchored() def test_compare_ticks(self): for kls in self.ticks: @@ -4532,41 +4483,39 @@ def test_compare_ticks(self): four = kls(4) for _ in range(10): - self.assertTrue(three < kls(4)) - self.assertTrue(kls(3) < four) - self.assertTrue(four > kls(3)) - self.assertTrue(kls(4) > three) - self.assertTrue(kls(3) == kls(3)) - self.assertTrue(kls(3) != kls(4)) + assert three < kls(4) + assert kls(3) < four + assert four > kls(3) + assert kls(4) > three + assert kls(3) == kls(3) + assert kls(3) != kls(4) -class TestOffsetNames(tm.TestCase): +class TestOffsetNames(object): + def test_get_offset_name(self): - self.assertEqual(BDay().freqstr, 'B') - self.assertEqual(BDay(2).freqstr, '2B') - self.assertEqual(BMonthEnd().freqstr, 'BM') - self.assertEqual(Week(weekday=0).freqstr, 'W-MON') - self.assertEqual(Week(weekday=1).freqstr, 'W-TUE') - self.assertEqual(Week(weekday=2).freqstr, 'W-WED') - self.assertEqual(Week(weekday=3).freqstr, 'W-THU') - self.assertEqual(Week(weekday=4).freqstr, 'W-FRI') - - self.assertEqual(LastWeekOfMonth( - weekday=WeekDay.SUN).freqstr, "LWOM-SUN") - self.assertEqual( - makeFY5253LastOfMonthQuarter(weekday=1, startingMonth=3, - qtr_with_extra_week=4).freqstr, - "REQ-L-MAR-TUE-4") - self.assertEqual( - makeFY5253NearestEndMonthQuarter(weekday=1, startingMonth=3, - qtr_with_extra_week=3).freqstr, - "REQ-N-MAR-TUE-3") + assert BDay().freqstr == 'B' + assert BDay(2).freqstr == '2B' + assert BMonthEnd().freqstr == 'BM' + assert Week(weekday=0).freqstr == 'W-MON' + assert Week(weekday=1).freqstr == 'W-TUE' + assert Week(weekday=2).freqstr == 'W-WED' + assert Week(weekday=3).freqstr == 'W-THU' + assert Week(weekday=4).freqstr == 'W-FRI' + + assert LastWeekOfMonth(weekday=WeekDay.SUN).freqstr == "LWOM-SUN" + assert (makeFY5253LastOfMonthQuarter( + weekday=1, startingMonth=3, + qtr_with_extra_week=4).freqstr == "REQ-L-MAR-TUE-4") + assert (makeFY5253NearestEndMonthQuarter( + weekday=1, startingMonth=3, + qtr_with_extra_week=3).freqstr == "REQ-N-MAR-TUE-3") def test_get_offset(): - with tm.assertRaisesRegexp(ValueError, _INVALID_FREQ_ERROR): + with tm.assert_raises_regex(ValueError, _INVALID_FREQ_ERROR): get_offset('gibberish') - with tm.assertRaisesRegexp(ValueError, _INVALID_FREQ_ERROR): + with tm.assert_raises_regex(ValueError, _INVALID_FREQ_ERROR): get_offset('QS-JAN-B') pairs = [ @@ -4594,17 +4543,18 @@ def test_get_offset(): def test_get_offset_legacy(): pairs = [('w@Sat', Week(weekday=5))] for name, expected in pairs: - with tm.assertRaisesRegexp(ValueError, _INVALID_FREQ_ERROR): + with tm.assert_raises_regex(ValueError, _INVALID_FREQ_ERROR): get_offset(name) -class TestParseTimeString(tm.TestCase): +class TestParseTimeString(object): + def test_parse_time_string(self): (date, parsed, reso) = parse_time_string('4Q1984') (date_lower, parsed_lower, reso_lower) = parse_time_string('4q1984') - self.assertEqual(date, date_lower) - self.assertEqual(parsed, parsed_lower) - self.assertEqual(reso, reso_lower) + assert date == date_lower + assert parsed == parsed_lower + assert reso == reso_lower def test_parse_time_quarter_w_dash(self): # https://github.com/pandas-dev/pandas/issue/9688 @@ -4614,13 +4564,13 @@ def test_parse_time_quarter_w_dash(self): (date_dash, parsed_dash, reso_dash) = parse_time_string(dashed) (date, parsed, reso) = parse_time_string(normal) - self.assertEqual(date_dash, date) - self.assertEqual(parsed_dash, parsed) - self.assertEqual(reso_dash, reso) + assert date_dash == date + assert parsed_dash == parsed + assert reso_dash == reso - self.assertRaises(DateParseError, parse_time_string, "-2Q1992") - self.assertRaises(DateParseError, parse_time_string, "2-Q1992") - self.assertRaises(DateParseError, parse_time_string, "4-4Q1992") + pytest.raises(DateParseError, parse_time_string, "-2Q1992") + pytest.raises(DateParseError, parse_time_string, "2-Q1992") + pytest.raises(DateParseError, parse_time_string, "4-4Q1992") def test_get_standard_freq(): @@ -4633,7 +4583,7 @@ def test_get_standard_freq(): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): assert fstr == get_standard_freq(('W', 1)) - with tm.assertRaisesRegexp(ValueError, _INVALID_FREQ_ERROR): + with tm.assert_raises_regex(ValueError, _INVALID_FREQ_ERROR): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): get_standard_freq('WeEk') @@ -4642,7 +4592,7 @@ def test_get_standard_freq(): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): assert fstr == get_standard_freq('5q') - with tm.assertRaisesRegexp(ValueError, _INVALID_FREQ_ERROR): + with tm.assert_raises_regex(ValueError, _INVALID_FREQ_ERROR): with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): get_standard_freq('5QuarTer') @@ -4660,30 +4610,31 @@ def test_quarterly_dont_normalize(): assert (result.time() == date.time()) -class TestOffsetAliases(tm.TestCase): - def setUp(self): +class TestOffsetAliases(object): + + def setup_method(self, method): _offset_map.clear() def test_alias_equality(self): for k, v in compat.iteritems(_offset_map): if v is None: continue - self.assertEqual(k, v.copy()) + assert k == v.copy() def test_rule_code(self): lst = ['M', 'MS', 'BM', 'BMS', 'D', 'B', 'H', 'T', 'S', 'L', 'U'] for k in lst: - self.assertEqual(k, get_offset(k).rule_code) + assert k == get_offset(k).rule_code # should be cached - this is kind of an internals test... assert k in _offset_map - self.assertEqual(k, (get_offset(k) * 3).rule_code) + assert k == (get_offset(k) * 3).rule_code suffix_lst = ['MON', 'TUE', 'WED', 'THU', 'FRI', 'SAT', 'SUN'] base = 'W' for v in suffix_lst: alias = '-'.join([base, v]) - self.assertEqual(alias, get_offset(alias).rule_code) - self.assertEqual(alias, (get_offset(alias) * 5).rule_code) + assert alias == get_offset(alias).rule_code + assert alias == (get_offset(alias) * 5).rule_code suffix_lst = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC'] @@ -4691,15 +4642,15 @@ def test_rule_code(self): for base in base_lst: for v in suffix_lst: alias = '-'.join([base, v]) - self.assertEqual(alias, get_offset(alias).rule_code) - self.assertEqual(alias, (get_offset(alias) * 5).rule_code) + assert alias == get_offset(alias).rule_code + assert alias == (get_offset(alias) * 5).rule_code lst = ['M', 'D', 'B', 'H', 'T', 'S', 'L', 'U'] for k in lst: code, stride = get_freq_code('3' + k) - self.assertTrue(isinstance(code, int)) - self.assertEqual(stride, 3) - self.assertEqual(k, _get_freq_str(code)) + assert isinstance(code, int) + assert stride == 3 + assert k == _get_freq_str(code) def test_apply_ticks(): @@ -4740,35 +4691,35 @@ def get_all_subclasses(cls): return ret -class TestCaching(tm.TestCase): +class TestCaching(object): # as of GH 6479 (in 0.14.0), offset caching is turned off # as of v0.12.0 only BusinessMonth/Quarter were actually caching - def setUp(self): + def setup_method(self, method): _daterange_cache.clear() _offset_map.clear() def run_X_index_creation(self, cls): inst1 = cls() if not inst1.isAnchored(): - self.assertFalse(inst1._should_cache(), cls) + assert not inst1._should_cache(), cls return - self.assertTrue(inst1._should_cache(), cls) + assert inst1._should_cache(), cls DatetimeIndex(start=datetime(2013, 1, 31), end=datetime(2013, 3, 31), freq=inst1, normalize=True) - self.assertTrue(cls() in _daterange_cache, cls) + assert cls() in _daterange_cache, cls def test_should_cache_month_end(self): - self.assertFalse(MonthEnd()._should_cache()) + assert not MonthEnd()._should_cache() def test_should_cache_bmonth_end(self): - self.assertFalse(BusinessMonthEnd()._should_cache()) + assert not BusinessMonthEnd()._should_cache() def test_should_cache_week_month(self): - self.assertFalse(WeekOfMonth(weekday=1, week=2)._should_cache()) + assert not WeekOfMonth(weekday=1, week=2)._should_cache() def test_all_cacheableoffsets(self): for subclass in get_all_subclasses(CacheableOffset): @@ -4780,22 +4731,23 @@ def test_all_cacheableoffsets(self): def test_month_end_index_creation(self): DatetimeIndex(start=datetime(2013, 1, 31), end=datetime(2013, 3, 31), freq=MonthEnd(), normalize=True) - self.assertFalse(MonthEnd() in _daterange_cache) + assert not MonthEnd() in _daterange_cache def test_bmonth_end_index_creation(self): DatetimeIndex(start=datetime(2013, 1, 31), end=datetime(2013, 3, 29), freq=BusinessMonthEnd(), normalize=True) - self.assertFalse(BusinessMonthEnd() in _daterange_cache) + assert not BusinessMonthEnd() in _daterange_cache def test_week_of_month_index_creation(self): inst1 = WeekOfMonth(weekday=1, week=2) DatetimeIndex(start=datetime(2013, 1, 31), end=datetime(2013, 3, 29), freq=inst1, normalize=True) inst2 = WeekOfMonth(weekday=1, week=2) - self.assertFalse(inst2 in _daterange_cache) + assert inst2 not in _daterange_cache + +class TestReprNames(object): -class TestReprNames(tm.TestCase): def test_str_for_named_is_name(self): # look at all the amazing combinations! month_prefixes = ['A', 'AS', 'BA', 'BAS', 'Q', 'BQ', 'BQS', 'QS'] @@ -4810,7 +4762,7 @@ def test_str_for_named_is_name(self): _offset_map.clear() for name in names: offset = get_offset(name) - self.assertEqual(offset.freqstr, name) + assert offset.freqstr == name def get_utc_offset_hours(ts): @@ -4819,7 +4771,7 @@ def get_utc_offset_hours(ts): return (o.days * 24 * 3600 + o.seconds) / 3600.0 -class TestDST(tm.TestCase): +class TestDST(object): """ test DateOffset additions over Daylight Savings Time """ @@ -4855,34 +4807,34 @@ def _test_offset(self, offset_name, offset_n, tstart, expected_utc_offset): t = tstart + offset if expected_utc_offset is not None: - self.assertTrue(get_utc_offset_hours(t) == expected_utc_offset) + assert get_utc_offset_hours(t) == expected_utc_offset if offset_name == 'weeks': # dates should match - self.assertTrue(t.date() == timedelta(days=7 * offset.kwds[ - 'weeks']) + tstart.date()) + assert t.date() == timedelta(days=7 * offset.kwds[ + 'weeks']) + tstart.date() # expect the same day of week, hour of day, minute, second, ... - self.assertTrue(t.dayofweek == tstart.dayofweek and t.hour == - tstart.hour and t.minute == tstart.minute and - t.second == tstart.second) + assert (t.dayofweek == tstart.dayofweek and + t.hour == tstart.hour and + t.minute == tstart.minute and + t.second == tstart.second) elif offset_name == 'days': # dates should match - self.assertTrue(timedelta(offset.kwds['days']) + tstart.date() == - t.date()) + assert timedelta(offset.kwds['days']) + tstart.date() == t.date() # expect the same hour of day, minute, second, ... - self.assertTrue(t.hour == tstart.hour and - t.minute == tstart.minute and - t.second == tstart.second) + assert (t.hour == tstart.hour and + t.minute == tstart.minute and + t.second == tstart.second) elif offset_name in self.valid_date_offsets_singular: # expect the signular offset value to match between tstart and t datepart_offset = getattr(t, offset_name if offset_name != 'weekday' else 'dayofweek') - self.assertTrue(datepart_offset == offset.kwds[offset_name]) + assert datepart_offset == offset.kwds[offset_name] else: # the offset should be the same as if it was done in UTC - self.assertTrue(t == (tstart.tz_convert('UTC') + offset - ).tz_convert('US/Pacific')) + assert (t == (tstart.tz_convert('UTC') + offset) + .tz_convert('US/Pacific')) def _make_timestamp(self, string, hrs_offset, tz): if hrs_offset >= 0: @@ -4955,9 +4907,4 @@ def test_all_offset_classes(self): for offset, test_values in iteritems(tests): first = Timestamp(test_values[0], tz='US/Eastern') + offset() second = Timestamp(test_values[1], tz='US/Eastern') - self.assertEqual(first, second, msg=str(offset)) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + assert first == second diff --git a/pandas/tseries/tests/test_timezones.py b/pandas/tests/tseries/test_timezones.py similarity index 73% rename from pandas/tseries/tests/test_timezones.py rename to pandas/tests/tseries/test_timezones.py index db8cda5c76479..97c54922d36e9 100644 --- a/pandas/tseries/tests/test_timezones.py +++ b/pandas/tests/tseries/test_timezones.py @@ -1,25 +1,22 @@ # pylint: disable-msg=E1101,W0612 -from datetime import datetime, timedelta, tzinfo, date -import nose - -import numpy as np +import pytest import pytz +import numpy as np from distutils.version import LooseVersion -from pandas.types.dtypes import DatetimeTZDtype -from pandas import (Index, Series, DataFrame, isnull, Timestamp) - -from pandas import DatetimeIndex, to_datetime, NaT -from pandas import tslib - -import pandas.tseries.offsets as offsets -from pandas.tseries.index import bdate_range, date_range -import pandas.tseries.tools as tools +from datetime import datetime, timedelta, tzinfo, date from pytz import NonExistentTimeError import pandas.util.testing as tm +import pandas.core.tools.datetimes as tools +import pandas.tseries.offsets as offsets +from pandas.compat import lrange, zip +from pandas.core.indexes.datetimes import bdate_range, date_range +from pandas.core.dtypes.dtypes import DatetimeTZDtype +from pandas._libs import tslib +from pandas import (Index, Series, DataFrame, isnull, Timestamp, NaT, + DatetimeIndex, to_datetime) from pandas.util.testing import (assert_frame_equal, assert_series_equal, set_timezone) -from pandas.compat import lrange, zip try: import pytz # noqa @@ -53,10 +50,9 @@ def dst(self, dt): fixed_off_no_name = FixedOffset(-330, None) -class TestTimeZoneSupportPytz(tm.TestCase): - _multiprocess_can_split_ = True +class TestTimeZoneSupportPytz(object): - def setUp(self): + def setup_method(self, method): tm._skip_if_no_pytz() def tz(self, tz): @@ -82,18 +78,18 @@ def test_utc_to_local_no_modify(self): rng_eastern = rng.tz_convert(self.tzstr('US/Eastern')) # Values are unmodified - self.assertTrue(np.array_equal(rng.asi8, rng_eastern.asi8)) + assert np.array_equal(rng.asi8, rng_eastern.asi8) - self.assertTrue(self.cmptz(rng_eastern.tz, self.tz('US/Eastern'))) + assert self.cmptz(rng_eastern.tz, self.tz('US/Eastern')) def test_utc_to_local_no_modify_explicit(self): rng = date_range('3/11/2012', '3/12/2012', freq='H', tz='utc') rng_eastern = rng.tz_convert(self.tz('US/Eastern')) # Values are unmodified - self.assert_numpy_array_equal(rng.asi8, rng_eastern.asi8) + tm.assert_numpy_array_equal(rng.asi8, rng_eastern.asi8) - self.assertEqual(rng_eastern.tz, self.tz('US/Eastern')) + assert rng_eastern.tz == self.tz('US/Eastern') def test_localize_utc_conversion(self): # Localizing to time zone should: @@ -104,13 +100,13 @@ def test_localize_utc_conversion(self): converted = rng.tz_localize(self.tzstr('US/Eastern')) expected_naive = rng + offsets.Hour(5) - self.assert_numpy_array_equal(converted.asi8, expected_naive.asi8) + tm.assert_numpy_array_equal(converted.asi8, expected_naive.asi8) # DST ambiguity, this should fail rng = date_range('3/11/2012', '3/12/2012', freq='30T') # Is this really how it should fail?? - self.assertRaises(NonExistentTimeError, rng.tz_localize, - self.tzstr('US/Eastern')) + pytest.raises(NonExistentTimeError, rng.tz_localize, + self.tzstr('US/Eastern')) def test_localize_utc_conversion_explicit(self): # Localizing to time zone should: @@ -120,29 +116,29 @@ def test_localize_utc_conversion_explicit(self): rng = date_range('3/10/2012', '3/11/2012', freq='30T') converted = rng.tz_localize(self.tz('US/Eastern')) expected_naive = rng + offsets.Hour(5) - self.assertTrue(np.array_equal(converted.asi8, expected_naive.asi8)) + assert np.array_equal(converted.asi8, expected_naive.asi8) # DST ambiguity, this should fail rng = date_range('3/11/2012', '3/12/2012', freq='30T') # Is this really how it should fail?? - self.assertRaises(NonExistentTimeError, rng.tz_localize, - self.tz('US/Eastern')) + pytest.raises(NonExistentTimeError, rng.tz_localize, + self.tz('US/Eastern')) def test_timestamp_tz_localize(self): stamp = Timestamp('3/11/2012 04:00') result = stamp.tz_localize(self.tzstr('US/Eastern')) expected = Timestamp('3/11/2012 04:00', tz=self.tzstr('US/Eastern')) - self.assertEqual(result.hour, expected.hour) - self.assertEqual(result, expected) + assert result.hour == expected.hour + assert result == expected def test_timestamp_tz_localize_explicit(self): stamp = Timestamp('3/11/2012 04:00') result = stamp.tz_localize(self.tz('US/Eastern')) expected = Timestamp('3/11/2012 04:00', tz=self.tz('US/Eastern')) - self.assertEqual(result.hour, expected.hour) - self.assertEqual(result, expected) + assert result.hour == expected.hour + assert result == expected def test_timestamp_constructed_by_date_and_tz(self): # Fix Issue 2993, Timestamp cannot be constructed by datetime.date @@ -151,8 +147,8 @@ def test_timestamp_constructed_by_date_and_tz(self): result = Timestamp(date(2012, 3, 11), tz=self.tzstr('US/Eastern')) expected = Timestamp('3/11/2012', tz=self.tzstr('US/Eastern')) - self.assertEqual(result.hour, expected.hour) - self.assertEqual(result, expected) + assert result.hour == expected.hour + assert result == expected def test_timestamp_constructed_by_date_and_tz_explicit(self): # Fix Issue 2993, Timestamp cannot be constructed by datetime.date @@ -161,8 +157,54 @@ def test_timestamp_constructed_by_date_and_tz_explicit(self): result = Timestamp(date(2012, 3, 11), tz=self.tz('US/Eastern')) expected = Timestamp('3/11/2012', tz=self.tz('US/Eastern')) - self.assertEqual(result.hour, expected.hour) - self.assertEqual(result, expected) + assert result.hour == expected.hour + assert result == expected + + def test_timestamp_constructor_near_dst_boundary(self): + # GH 11481 & 15777 + # Naive string timestamps were being localized incorrectly + # with tz_convert_single instead of tz_localize_to_utc + + for tz in ['Europe/Brussels', 'Europe/Prague']: + result = Timestamp('2015-10-25 01:00', tz=tz) + expected = Timestamp('2015-10-25 01:00').tz_localize(tz) + assert result == expected + + with pytest.raises(pytz.AmbiguousTimeError): + Timestamp('2015-10-25 02:00', tz=tz) + + result = Timestamp('2017-03-26 01:00', tz='Europe/Paris') + expected = Timestamp('2017-03-26 01:00').tz_localize('Europe/Paris') + assert result == expected + + with pytest.raises(pytz.NonExistentTimeError): + Timestamp('2017-03-26 02:00', tz='Europe/Paris') + + # GH 11708 + result = to_datetime("2015-11-18 15:30:00+05:30").tz_localize( + 'UTC').tz_convert('Asia/Kolkata') + expected = Timestamp('2015-11-18 15:30:00+0530', tz='Asia/Kolkata') + assert result == expected + + # GH 15823 + result = Timestamp('2017-03-26 00:00', tz='Europe/Paris') + expected = Timestamp('2017-03-26 00:00:00+0100', tz='Europe/Paris') + assert result == expected + + result = Timestamp('2017-03-26 01:00', tz='Europe/Paris') + expected = Timestamp('2017-03-26 01:00:00+0100', tz='Europe/Paris') + assert result == expected + + with pytest.raises(pytz.NonExistentTimeError): + Timestamp('2017-03-26 02:00', tz='Europe/Paris') + result = Timestamp('2017-03-26 02:00:00+0100', tz='Europe/Paris') + expected = Timestamp(result.value).tz_localize( + 'UTC').tz_convert('Europe/Paris') + assert result == expected + + result = Timestamp('2017-03-26 03:00', tz='Europe/Paris') + expected = Timestamp('2017-03-26 03:00:00+0200', tz='Europe/Paris') + assert result == expected def test_timestamp_to_datetime_tzoffset(self): # tzoffset @@ -170,7 +212,7 @@ def test_timestamp_to_datetime_tzoffset(self): tzinfo = tzoffset(None, 7200) expected = Timestamp('3/11/2012 04:00', tz=tzinfo) result = Timestamp(expected.to_pydatetime()) - self.assertEqual(expected, result) + assert expected == result def test_timedelta_push_over_dst_boundary(self): # #1389 @@ -183,7 +225,7 @@ def test_timedelta_push_over_dst_boundary(self): # spring forward, + "7" hours expected = Timestamp('3/11/2012 05:00', tz=self.tzstr('US/Eastern')) - self.assertEqual(result, expected) + assert result == expected def test_timedelta_push_over_dst_boundary_explicit(self): # #1389 @@ -196,7 +238,7 @@ def test_timedelta_push_over_dst_boundary_explicit(self): # spring forward, + "7" hours expected = Timestamp('3/11/2012 05:00', tz=self.tz('US/Eastern')) - self.assertEqual(result, expected) + assert result == expected def test_tz_localize_dti(self): dti = DatetimeIndex(start='1/1/2005', end='1/1/2005 0:00:30.256', @@ -206,20 +248,20 @@ def test_tz_localize_dti(self): dti_utc = DatetimeIndex(start='1/1/2005 05:00', end='1/1/2005 5:00:30.256', freq='L', tz='utc') - self.assert_numpy_array_equal(dti2.values, dti_utc.values) + tm.assert_numpy_array_equal(dti2.values, dti_utc.values) dti3 = dti2.tz_convert(self.tzstr('US/Pacific')) - self.assert_numpy_array_equal(dti3.values, dti_utc.values) + tm.assert_numpy_array_equal(dti3.values, dti_utc.values) dti = DatetimeIndex(start='11/6/2011 1:59', end='11/6/2011 2:00', freq='L') - self.assertRaises(pytz.AmbiguousTimeError, dti.tz_localize, - self.tzstr('US/Eastern')) + pytest.raises(pytz.AmbiguousTimeError, dti.tz_localize, + self.tzstr('US/Eastern')) dti = DatetimeIndex(start='3/13/2011 1:59', end='3/13/2011 2:00', freq='L') - self.assertRaises(pytz.NonExistentTimeError, dti.tz_localize, - self.tzstr('US/Eastern')) + pytest.raises(pytz.NonExistentTimeError, dti.tz_localize, + self.tzstr('US/Eastern')) def test_tz_localize_empty_series(self): # #2248 @@ -227,57 +269,57 @@ def test_tz_localize_empty_series(self): ts = Series() ts2 = ts.tz_localize('utc') - self.assertTrue(ts2.index.tz == pytz.utc) + assert ts2.index.tz == pytz.utc ts2 = ts.tz_localize(self.tzstr('US/Eastern')) - self.assertTrue(self.cmptz(ts2.index.tz, self.tz('US/Eastern'))) + assert self.cmptz(ts2.index.tz, self.tz('US/Eastern')) def test_astimezone(self): utc = Timestamp('3/11/2012 22:00', tz='UTC') expected = utc.tz_convert(self.tzstr('US/Eastern')) result = utc.astimezone(self.tzstr('US/Eastern')) - self.assertEqual(expected, result) - tm.assertIsInstance(result, Timestamp) + assert expected == result + assert isinstance(result, Timestamp) def test_create_with_tz(self): stamp = Timestamp('3/11/2012 05:00', tz=self.tzstr('US/Eastern')) - self.assertEqual(stamp.hour, 5) + assert stamp.hour == 5 rng = date_range('3/11/2012 04:00', periods=10, freq='H', tz=self.tzstr('US/Eastern')) - self.assertEqual(stamp, rng[1]) + assert stamp == rng[1] utc_stamp = Timestamp('3/11/2012 05:00', tz='utc') - self.assertIs(utc_stamp.tzinfo, pytz.utc) - self.assertEqual(utc_stamp.hour, 5) + assert utc_stamp.tzinfo is pytz.utc + assert utc_stamp.hour == 5 stamp = Timestamp('3/11/2012 05:00').tz_localize('utc') - self.assertEqual(utc_stamp.hour, 5) + assert utc_stamp.hour == 5 def test_create_with_fixed_tz(self): off = FixedOffset(420, '+07:00') start = datetime(2012, 3, 11, 5, 0, 0, tzinfo=off) end = datetime(2012, 6, 11, 5, 0, 0, tzinfo=off) rng = date_range(start=start, end=end) - self.assertEqual(off, rng.tz) + assert off == rng.tz rng2 = date_range(start, periods=len(rng), tz=off) - self.assert_index_equal(rng, rng2) + tm.assert_index_equal(rng, rng2) rng3 = date_range('3/11/2012 05:00:00+07:00', '6/11/2012 05:00:00+07:00') - self.assertTrue((rng.values == rng3.values).all()) + assert (rng.values == rng3.values).all() def test_create_with_fixedoffset_noname(self): off = fixed_off_no_name start = datetime(2012, 3, 11, 5, 0, 0, tzinfo=off) end = datetime(2012, 6, 11, 5, 0, 0, tzinfo=off) rng = date_range(start=start, end=end) - self.assertEqual(off, rng.tz) + assert off == rng.tz idx = Index([start, end]) - self.assertEqual(off, idx.tz) + assert off == idx.tz def test_date_range_localize(self): rng = date_range('3/11/2012 03:00', periods=15, freq='H', @@ -287,33 +329,33 @@ def test_date_range_localize(self): rng3 = date_range('3/11/2012 03:00', periods=15, freq='H') rng3 = rng3.tz_localize('US/Eastern') - self.assert_index_equal(rng, rng3) + tm.assert_index_equal(rng, rng3) # DST transition time val = rng[0] exp = Timestamp('3/11/2012 03:00', tz='US/Eastern') - self.assertEqual(val.hour, 3) - self.assertEqual(exp.hour, 3) - self.assertEqual(val, exp) # same UTC value - self.assert_index_equal(rng[:2], rng2) + assert val.hour == 3 + assert exp.hour == 3 + assert val == exp # same UTC value + tm.assert_index_equal(rng[:2], rng2) # Right before the DST transition rng = date_range('3/11/2012 00:00', periods=2, freq='H', tz='US/Eastern') rng2 = DatetimeIndex(['3/11/2012 00:00', '3/11/2012 01:00'], tz='US/Eastern') - self.assert_index_equal(rng, rng2) + tm.assert_index_equal(rng, rng2) exp = Timestamp('3/11/2012 00:00', tz='US/Eastern') - self.assertEqual(exp.hour, 0) - self.assertEqual(rng[0], exp) + assert exp.hour == 0 + assert rng[0] == exp exp = Timestamp('3/11/2012 01:00', tz='US/Eastern') - self.assertEqual(exp.hour, 1) - self.assertEqual(rng[1], exp) + assert exp.hour == 1 + assert rng[1] == exp rng = date_range('3/11/2012 00:00', periods=10, freq='H', tz='US/Eastern') - self.assertEqual(rng[2].hour, 3) + assert rng[2].hour == 3 def test_utc_box_timestamp_and_localize(self): rng = date_range('3/11/2012', '3/12/2012', freq='H', tz='utc') @@ -323,16 +365,16 @@ def test_utc_box_timestamp_and_localize(self): expected = rng[-1].astimezone(tz) stamp = rng_eastern[-1] - self.assertEqual(stamp, expected) - self.assertEqual(stamp.tzinfo, expected.tzinfo) + assert stamp == expected + assert stamp.tzinfo == expected.tzinfo # right tzinfo rng = date_range('3/13/2012', '3/14/2012', freq='H', tz='utc') rng_eastern = rng.tz_convert(self.tzstr('US/Eastern')) # test not valid for dateutil timezones. - # self.assertIn('EDT', repr(rng_eastern[0].tzinfo)) - self.assertTrue('EDT' in repr(rng_eastern[0].tzinfo) or 'tzfile' in - repr(rng_eastern[0].tzinfo)) + # assert 'EDT' in repr(rng_eastern[0].tzinfo) + assert ('EDT' in repr(rng_eastern[0].tzinfo) or + 'tzfile' in repr(rng_eastern[0].tzinfo)) def test_timestamp_tz_convert(self): strdates = ['1/1/2012', '3/1/2012', '4/1/2012'] @@ -341,7 +383,7 @@ def test_timestamp_tz_convert(self): conv = idx[0].tz_convert(self.tzstr('US/Pacific')) expected = idx.tz_convert(self.tzstr('US/Pacific'))[0] - self.assertEqual(conv, expected) + assert conv == expected def test_pass_dates_localize_to_utc(self): strdates = ['1/1/2012', '3/1/2012', '4/1/2012'] @@ -351,20 +393,20 @@ def test_pass_dates_localize_to_utc(self): fromdates = DatetimeIndex(strdates, tz=self.tzstr('US/Eastern')) - self.assertEqual(conv.tz, fromdates.tz) - self.assert_numpy_array_equal(conv.values, fromdates.values) + assert conv.tz == fromdates.tz + tm.assert_numpy_array_equal(conv.values, fromdates.values) def test_field_access_localize(self): strdates = ['1/1/2012', '3/1/2012', '4/1/2012'] rng = DatetimeIndex(strdates, tz=self.tzstr('US/Eastern')) - self.assertTrue((rng.hour == 0).all()) + assert (rng.hour == 0).all() # a more unusual time zone, #1946 dr = date_range('2011-10-02 00:00', freq='h', periods=10, tz=self.tzstr('America/Atikokan')) - expected = np.arange(10, dtype=np.int32) - self.assert_numpy_array_equal(dr.hour, expected) + expected = Index(np.arange(10, dtype=np.int64)) + tm.assert_index_equal(dr.hour, expected) def test_with_tz(self): tz = self.tz('US/Central') @@ -372,7 +414,7 @@ def test_with_tz(self): # just want it to work start = datetime(2011, 3, 12, tzinfo=pytz.utc) dr = bdate_range(start, periods=50, freq=offsets.Hour()) - self.assertIs(dr.tz, pytz.utc) + assert dr.tz is pytz.utc # DateRange with naive datetimes dr = bdate_range('1/1/2005', '1/1/2009', tz=pytz.utc) @@ -380,29 +422,29 @@ def test_with_tz(self): # normalized central = dr.tz_convert(tz) - self.assertIs(central.tz, tz) + assert central.tz is tz comp = self.localize(tz, central[0].to_pydatetime().replace( tzinfo=None)).tzinfo - self.assertIs(central[0].tz, comp) + assert central[0].tz is comp # compare vs a localized tz comp = self.localize(tz, dr[0].to_pydatetime().replace(tzinfo=None)).tzinfo - self.assertIs(central[0].tz, comp) + assert central[0].tz is comp # datetimes with tzinfo set dr = bdate_range(datetime(2005, 1, 1, tzinfo=pytz.utc), '1/1/2009', tz=pytz.utc) - self.assertRaises(Exception, bdate_range, - datetime(2005, 1, 1, tzinfo=pytz.utc), '1/1/2009', - tz=tz) + pytest.raises(Exception, bdate_range, + datetime(2005, 1, 1, tzinfo=pytz.utc), '1/1/2009', + tz=tz) def test_tz_localize(self): dr = bdate_range('1/1/2009', '1/1/2010') dr_utc = bdate_range('1/1/2009', '1/1/2010', tz=pytz.utc) localized = dr.tz_localize(pytz.utc) - self.assert_index_equal(dr_utc, localized) + tm.assert_index_equal(dr_utc, localized) def test_with_tz_ambiguous_times(self): tz = self.tz('US/Eastern') @@ -410,7 +452,7 @@ def test_with_tz_ambiguous_times(self): # March 13, 2011, spring forward, skip from 2 AM to 3 AM dr = date_range(datetime(2011, 3, 13, 1, 30), periods=3, freq=offsets.Hour()) - self.assertRaises(pytz.NonExistentTimeError, dr.tz_localize, tz) + pytest.raises(pytz.NonExistentTimeError, dr.tz_localize, tz) # after dst transition, it works dr = date_range(datetime(2011, 3, 13, 3, 30), periods=3, @@ -419,7 +461,7 @@ def test_with_tz_ambiguous_times(self): # November 6, 2011, fall back, repeat 2 AM hour dr = date_range(datetime(2011, 11, 6, 1, 30), periods=3, freq=offsets.Hour()) - self.assertRaises(pytz.AmbiguousTimeError, dr.tz_localize, tz) + pytest.raises(pytz.AmbiguousTimeError, dr.tz_localize, tz) # UTC is OK dr = date_range(datetime(2011, 3, 13), periods=48, @@ -431,7 +473,7 @@ def test_ambiguous_infer(self): tz = self.tz('US/Eastern') dr = date_range(datetime(2011, 11, 6, 0), periods=5, freq=offsets.Hour()) - self.assertRaises(pytz.AmbiguousTimeError, dr.tz_localize, tz) + pytest.raises(pytz.AmbiguousTimeError, dr.tz_localize, tz) # With repeated hours, we can infer the transition dr = date_range(datetime(2011, 11, 6, 0), periods=5, @@ -440,22 +482,22 @@ def test_ambiguous_infer(self): '11/06/2011 02:00', '11/06/2011 03:00'] di = DatetimeIndex(times) localized = di.tz_localize(tz, ambiguous='infer') - self.assert_index_equal(dr, localized) + tm.assert_index_equal(dr, localized) with tm.assert_produces_warning(FutureWarning): localized_old = di.tz_localize(tz, infer_dst=True) - self.assert_index_equal(dr, localized_old) - self.assert_index_equal(dr, DatetimeIndex(times, tz=tz, - ambiguous='infer')) + tm.assert_index_equal(dr, localized_old) + tm.assert_index_equal(dr, DatetimeIndex(times, tz=tz, + ambiguous='infer')) # When there is no dst transition, nothing special happens dr = date_range(datetime(2011, 6, 1, 0), periods=10, freq=offsets.Hour()) localized = dr.tz_localize(tz) localized_infer = dr.tz_localize(tz, ambiguous='infer') - self.assert_index_equal(localized, localized_infer) + tm.assert_index_equal(localized, localized_infer) with tm.assert_produces_warning(FutureWarning): localized_infer_old = dr.tz_localize(tz, infer_dst=True) - self.assert_index_equal(localized, localized_infer_old) + tm.assert_index_equal(localized, localized_infer_old) def test_ambiguous_flags(self): # November 6, 2011, fall back, repeat 2 AM hour @@ -471,33 +513,33 @@ def test_ambiguous_flags(self): di = DatetimeIndex(times) is_dst = [1, 1, 0, 0, 0] localized = di.tz_localize(tz, ambiguous=is_dst) - self.assert_index_equal(dr, localized) - self.assert_index_equal(dr, DatetimeIndex(times, tz=tz, - ambiguous=is_dst)) + tm.assert_index_equal(dr, localized) + tm.assert_index_equal(dr, DatetimeIndex(times, tz=tz, + ambiguous=is_dst)) localized = di.tz_localize(tz, ambiguous=np.array(is_dst)) - self.assert_index_equal(dr, localized) + tm.assert_index_equal(dr, localized) localized = di.tz_localize(tz, ambiguous=np.array(is_dst).astype('bool')) - self.assert_index_equal(dr, localized) + tm.assert_index_equal(dr, localized) # Test constructor localized = DatetimeIndex(times, tz=tz, ambiguous=is_dst) - self.assert_index_equal(dr, localized) + tm.assert_index_equal(dr, localized) # Test duplicate times where infer_dst fails times += times di = DatetimeIndex(times) # When the sizes are incompatible, make sure error is raised - self.assertRaises(Exception, di.tz_localize, tz, ambiguous=is_dst) + pytest.raises(Exception, di.tz_localize, tz, ambiguous=is_dst) # When sizes are compatible and there are repeats ('infer' won't work) is_dst = np.hstack((is_dst, is_dst)) localized = di.tz_localize(tz, ambiguous=is_dst) dr = dr.append(dr) - self.assert_index_equal(dr, localized) + tm.assert_index_equal(dr, localized) # When there is no dst transition, nothing special happens dr = date_range(datetime(2011, 6, 1, 0), periods=10, @@ -505,7 +547,7 @@ def test_ambiguous_flags(self): is_dst = np.array([1] * 10) localized = dr.tz_localize(tz) localized_is_dst = dr.tz_localize(tz, ambiguous=is_dst) - self.assert_index_equal(localized, localized_is_dst) + tm.assert_index_equal(localized, localized_is_dst) # construction with an ambiguous end-point # GH 11626 @@ -514,16 +556,16 @@ def test_ambiguous_flags(self): def f(): date_range("2013-10-26 23:00", "2013-10-27 01:00", tz="Europe/London", freq="H") - self.assertRaises(pytz.AmbiguousTimeError, f) + pytest.raises(pytz.AmbiguousTimeError, f) times = date_range("2013-10-26 23:00", "2013-10-27 01:00", freq="H", tz=tz, ambiguous='infer') - self.assertEqual(times[0], Timestamp('2013-10-26 23:00', tz=tz, - freq="H")) + assert times[0] == Timestamp('2013-10-26 23:00', tz=tz, freq="H") + if dateutil.__version__ != LooseVersion('2.6.0'): - # GH 14621 - self.assertEqual(times[-1], Timestamp('2013-10-27 01:00', tz=tz, - freq="H")) + # see gh-14621 + assert times[-1] == Timestamp('2013-10-27 01:00:00+0000', + tz=tz, freq="H") def test_ambiguous_nat(self): tz = self.tz('US/Eastern') @@ -538,7 +580,7 @@ def test_ambiguous_nat(self): # left dtype is datetime64[ns, US/Eastern] # right is datetime64[ns, tzfile('/usr/share/zoneinfo/US/Eastern')] - self.assert_numpy_array_equal(di_test.values, localized.values) + tm.assert_numpy_array_equal(di_test.values, localized.values) def test_ambiguous_bool(self): # make sure that we are correctly accepting bool values as ambiguous @@ -550,13 +592,13 @@ def test_ambiguous_bool(self): def f(): t.tz_localize('US/Central') - self.assertRaises(pytz.AmbiguousTimeError, f) + pytest.raises(pytz.AmbiguousTimeError, f) result = t.tz_localize('US/Central', ambiguous=True) - self.assertEqual(result, expected0) + assert result == expected0 result = t.tz_localize('US/Central', ambiguous=False) - self.assertEqual(result, expected1) + assert result == expected1 s = Series([t]) expected0 = Series([expected0]) @@ -564,7 +606,7 @@ def f(): def f(): s.dt.tz_localize('US/Central') - self.assertRaises(pytz.AmbiguousTimeError, f) + pytest.raises(pytz.AmbiguousTimeError, f) result = s.dt.tz_localize('US/Central', ambiguous=True) assert_series_equal(result, expected0) @@ -584,10 +626,10 @@ def test_nonexistent_raise_coerce(self): times = ['2015-03-08 01:00', '2015-03-08 02:00', '2015-03-08 03:00'] index = DatetimeIndex(times) tz = 'US/Eastern' - self.assertRaises(NonExistentTimeError, - index.tz_localize, tz=tz) - self.assertRaises(NonExistentTimeError, - index.tz_localize, tz=tz, errors='raise') + pytest.raises(NonExistentTimeError, + index.tz_localize, tz=tz) + pytest.raises(NonExistentTimeError, + index.tz_localize, tz=tz, errors='raise') result = index.tz_localize(tz=tz, errors='coerce') test_times = ['2015-03-08 01:00-05:00', 'NaT', '2015-03-08 03:00-04:00'] @@ -617,23 +659,23 @@ def test_infer_tz(self): assert (tools._infer_tzinfo(start, end) is utc) end = self.localize(eastern, _end) - self.assertRaises(Exception, tools._infer_tzinfo, start, end) - self.assertRaises(Exception, tools._infer_tzinfo, end, start) + pytest.raises(Exception, tools._infer_tzinfo, start, end) + pytest.raises(Exception, tools._infer_tzinfo, end, start) def test_tz_string(self): result = date_range('1/1/2000', periods=10, tz=self.tzstr('US/Eastern')) expected = date_range('1/1/2000', periods=10, tz=self.tz('US/Eastern')) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) def test_take_dont_lose_meta(self): tm._skip_if_no_pytz() rng = date_range('1/1/2000', periods=20, tz=self.tzstr('US/Eastern')) result = rng.take(lrange(5)) - self.assertEqual(result.tz, rng.tz) - self.assertEqual(result.freq, rng.freq) + assert result.tz == rng.tz + assert result.freq == rng.freq def test_index_with_timezone_repr(self): rng = date_range('4/13/2010', '5/6/2010') @@ -641,7 +683,7 @@ def test_index_with_timezone_repr(self): rng_eastern = rng.tz_localize(self.tzstr('US/Eastern')) rng_repr = repr(rng_eastern) - self.assertIn('2010-04-13 00:00:00', rng_repr) + assert '2010-04-13 00:00:00' in rng_repr def test_index_astype_asobject_tzinfos(self): # #1345 @@ -652,14 +694,14 @@ def test_index_astype_asobject_tzinfos(self): objs = rng.asobject for i, x in enumerate(objs): exval = rng[i] - self.assertEqual(x, exval) - self.assertEqual(x.tzinfo, exval.tzinfo) + assert x == exval + assert x.tzinfo == exval.tzinfo objs = rng.astype(object) for i, x in enumerate(objs): exval = rng[i] - self.assertEqual(x, exval) - self.assertEqual(x.tzinfo, exval.tzinfo) + assert x == exval + assert x.tzinfo == exval.tzinfo def test_localized_at_time_between_time(self): from datetime import time @@ -673,37 +715,37 @@ def test_localized_at_time_between_time(self): expected = ts.at_time(time(10, 0)).tz_localize(self.tzstr( 'US/Eastern')) assert_series_equal(result, expected) - self.assertTrue(self.cmptz(result.index.tz, self.tz('US/Eastern'))) + assert self.cmptz(result.index.tz, self.tz('US/Eastern')) t1, t2 = time(10, 0), time(11, 0) result = ts_local.between_time(t1, t2) expected = ts.between_time(t1, t2).tz_localize(self.tzstr('US/Eastern')) assert_series_equal(result, expected) - self.assertTrue(self.cmptz(result.index.tz, self.tz('US/Eastern'))) + assert self.cmptz(result.index.tz, self.tz('US/Eastern')) def test_string_index_alias_tz_aware(self): rng = date_range('1/1/2000', periods=10, tz=self.tzstr('US/Eastern')) ts = Series(np.random.randn(len(rng)), index=rng) result = ts['1/3/2000'] - self.assertAlmostEqual(result, ts[2]) + tm.assert_almost_equal(result, ts[2]) def test_fixed_offset(self): dates = [datetime(2000, 1, 1, tzinfo=fixed_off), datetime(2000, 1, 2, tzinfo=fixed_off), datetime(2000, 1, 3, tzinfo=fixed_off)] result = to_datetime(dates) - self.assertEqual(result.tz, fixed_off) + assert result.tz == fixed_off def test_fixedtz_topydatetime(self): dates = np.array([datetime(2000, 1, 1, tzinfo=fixed_off), datetime(2000, 1, 2, tzinfo=fixed_off), datetime(2000, 1, 3, tzinfo=fixed_off)]) result = to_datetime(dates).to_pydatetime() - self.assert_numpy_array_equal(dates, result) + tm.assert_numpy_array_equal(dates, result) result = to_datetime(dates)._mpl_repr() - self.assert_numpy_array_equal(dates, result) + tm.assert_numpy_array_equal(dates, result) def test_convert_tz_aware_datetime_datetime(self): # #1581 @@ -715,19 +757,19 @@ def test_convert_tz_aware_datetime_datetime(self): dates_aware = [self.localize(tz, x) for x in dates] result = to_datetime(dates_aware) - self.assertTrue(self.cmptz(result.tz, self.tz('US/Eastern'))) + assert self.cmptz(result.tz, self.tz('US/Eastern')) converted = to_datetime(dates_aware, utc=True) ex_vals = np.array([Timestamp(x).value for x in dates_aware]) - self.assert_numpy_array_equal(converted.asi8, ex_vals) - self.assertIs(converted.tz, pytz.utc) + tm.assert_numpy_array_equal(converted.asi8, ex_vals) + assert converted.tz is pytz.utc def test_to_datetime_utc(self): from dateutil.parser import parse arr = np.array([parse('2012-06-13T01:39:00Z')], dtype=object) result = to_datetime(arr, utc=True) - self.assertIs(result.tz, pytz.utc) + assert result.tz is pytz.utc def test_to_datetime_tzlocal(self): from dateutil.parser import parse @@ -738,12 +780,12 @@ def test_to_datetime_tzlocal(self): arr = np.array([dt], dtype=object) result = to_datetime(arr, utc=True) - self.assertIs(result.tz, pytz.utc) + assert result.tz is pytz.utc rng = date_range('2012-11-03 03:00', '2012-11-05 03:00', tz=tzlocal()) arr = rng.to_pydatetime() result = to_datetime(arr, utc=True) - self.assertIs(result.tz, pytz.utc) + assert result.tz is pytz.utc def test_frame_no_datetime64_dtype(self): @@ -754,7 +796,7 @@ def test_frame_no_datetime64_dtype(self): dr_tz = dr.tz_localize(self.tzstr('US/Eastern')) e = DataFrame({'A': 'foo', 'B': dr_tz}, index=dr) tz_expected = DatetimeTZDtype('ns', dr_tz.tzinfo) - self.assertEqual(e['B'].dtype, tz_expected) + assert e['B'].dtype == tz_expected # GH 2810 (with timezones) datetimes_naive = [ts.to_pydatetime() for ts in dr] @@ -788,7 +830,7 @@ def test_shift_localized(self): dr_tz = dr.tz_localize(self.tzstr('US/Eastern')) result = dr_tz.shift(1, '10T') - self.assertEqual(result.tz, dr_tz.tz) + assert result.tz == dr_tz.tz def test_tz_aware_asfreq(self): dr = date_range('2011-12-01', '2012-07-20', freq='D', @@ -809,7 +851,7 @@ def test_tzaware_datetime_to_index(self): d = [datetime(2012, 8, 19, tzinfo=self.tz('US/Eastern'))] index = DatetimeIndex(d) - self.assertTrue(self.cmptz(index.tz, self.tz('US/Eastern'))) + assert self.cmptz(index.tz, self.tz('US/Eastern')) def test_date_range_span_dst_transition(self): # #1778 @@ -818,19 +860,18 @@ def test_date_range_span_dst_transition(self): dr = date_range('03/06/2012 00:00', periods=200, freq='W-FRI', tz='US/Eastern') - self.assertTrue((dr.hour == 0).all()) + assert (dr.hour == 0).all() dr = date_range('2012-11-02', periods=10, tz=self.tzstr('US/Eastern')) - self.assertTrue((dr.hour == 0).all()) + assert (dr.hour == 0).all() def test_convert_datetime_list(self): dr = date_range('2012-06-02', periods=10, tz=self.tzstr('US/Eastern'), name='foo') - dr2 = DatetimeIndex(list(dr), name='foo') - self.assert_index_equal(dr, dr2) - self.assertEqual(dr.tz, dr2.tz) - self.assertEqual(dr2.name, 'foo') + tm.assert_index_equal(dr, dr2) + assert dr.tz == dr2.tz + assert dr2.name == 'foo' def test_frame_from_records_utc(self): rec = {'datum': 1.5, @@ -845,7 +886,7 @@ def test_frame_reset_index(self): roundtripped = df.reset_index().set_index('index') xp = df.index.tz rs = roundtripped.index.tz - self.assertEqual(xp, rs) + assert xp == rs def test_dateutil_tzoffset_support(self): from dateutil.tz import tzoffset @@ -855,7 +896,7 @@ def test_dateutil_tzoffset_support(self): datetime(2012, 5, 11, 12, tzinfo=tzinfo)] series = Series(data=values, index=index) - self.assertEqual(series.index.tz, tzinfo) + assert series.index.tz == tzinfo # it works! #2443 repr(series.index[0]) @@ -868,14 +909,14 @@ def test_getitem_pydatetime_tz(self): tz=self.tzstr('Europe/Berlin')) time_datetime = self.localize( self.tz('Europe/Berlin'), datetime(2012, 12, 24, 17, 0)) - self.assertEqual(ts[time_pandas], ts[time_datetime]) + assert ts[time_pandas] == ts[time_datetime] def test_index_drop_dont_lose_tz(self): # #2621 ind = date_range("2012-12-01", periods=10, tz="utc") ind = ind.drop(ind[-1]) - self.assertTrue(ind.tz is not None) + assert ind.tz is not None def test_datetimeindex_tz(self): """ Test different DatetimeIndex constructions with timezone @@ -891,20 +932,19 @@ def test_datetimeindex_tz(self): idx4 = DatetimeIndex(np.array(arr), tz=self.tzstr('US/Eastern')) for other in [idx2, idx3, idx4]: - self.assert_index_equal(idx1, other) + tm.assert_index_equal(idx1, other) def test_datetimeindex_tz_nat(self): idx = to_datetime([Timestamp("2013-1-1", tz=self.tzstr('US/Eastern')), NaT]) - self.assertTrue(isnull(idx[1])) - self.assertTrue(idx[0].tzinfo is not None) + assert isnull(idx[1]) + assert idx[0].tzinfo is not None class TestTimeZoneSupportDateutil(TestTimeZoneSupportPytz): - _multiprocess_can_split_ = True - def setUp(self): + def setup_method(self, method): tm._skip_if_no_dateutil() def tz(self, tz): @@ -932,17 +972,17 @@ def test_utc_with_system_utc(self): # Skipped on win32 due to dateutil bug tm._skip_if_windows() - from pandas.tslib import maybe_get_tz + from pandas._libs.tslib import maybe_get_tz # from system utc to real utc ts = Timestamp('2001-01-05 11:56', tz=maybe_get_tz('dateutil/UTC')) # check that the time hasn't changed. - self.assertEqual(ts, ts.tz_convert(dateutil.tz.tzutc())) + assert ts == ts.tz_convert(dateutil.tz.tzutc()) # from system utc to real utc ts = Timestamp('2001-01-05 11:56', tz=maybe_get_tz('dateutil/UTC')) # check that the time hasn't changed. - self.assertEqual(ts, ts.tz_convert(dateutil.tz.tzutc())) + assert ts == ts.tz_convert(dateutil.tz.tzutc()) def test_tz_convert_hour_overflow_dst(self): # Regression test for: @@ -954,8 +994,8 @@ def test_tz_convert_hour_overflow_dst(self): '2009-05-12 09:50:32'] tt = to_datetime(ts).tz_localize('US/Eastern') ut = tt.tz_convert('UTC') - expected = np.array([13, 14, 13], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([13, 14, 13]) + tm.assert_index_equal(ut.hour, expected) # sorted case UTC -> US/Eastern ts = ['2008-05-12 13:50:00', @@ -963,8 +1003,8 @@ def test_tz_convert_hour_overflow_dst(self): '2009-05-12 13:50:32'] tt = to_datetime(ts).tz_localize('UTC') ut = tt.tz_convert('US/Eastern') - expected = np.array([9, 9, 9], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([9, 9, 9]) + tm.assert_index_equal(ut.hour, expected) # unsorted case US/Eastern -> UTC ts = ['2008-05-12 09:50:00', @@ -972,8 +1012,8 @@ def test_tz_convert_hour_overflow_dst(self): '2008-05-12 09:50:32'] tt = to_datetime(ts).tz_localize('US/Eastern') ut = tt.tz_convert('UTC') - expected = np.array([13, 14, 13], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([13, 14, 13]) + tm.assert_index_equal(ut.hour, expected) # unsorted case UTC -> US/Eastern ts = ['2008-05-12 13:50:00', @@ -981,8 +1021,8 @@ def test_tz_convert_hour_overflow_dst(self): '2008-05-12 13:50:32'] tt = to_datetime(ts).tz_localize('UTC') ut = tt.tz_convert('US/Eastern') - expected = np.array([9, 9, 9], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([9, 9, 9]) + tm.assert_index_equal(ut.hour, expected) def test_tz_convert_hour_overflow_dst_timestamps(self): # Regression test for: @@ -996,8 +1036,8 @@ def test_tz_convert_hour_overflow_dst_timestamps(self): Timestamp('2009-05-12 09:50:32', tz=tz)] tt = to_datetime(ts) ut = tt.tz_convert('UTC') - expected = np.array([13, 14, 13], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([13, 14, 13]) + tm.assert_index_equal(ut.hour, expected) # sorted case UTC -> US/Eastern ts = [Timestamp('2008-05-12 13:50:00', tz='UTC'), @@ -1005,8 +1045,8 @@ def test_tz_convert_hour_overflow_dst_timestamps(self): Timestamp('2009-05-12 13:50:32', tz='UTC')] tt = to_datetime(ts) ut = tt.tz_convert('US/Eastern') - expected = np.array([9, 9, 9], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([9, 9, 9]) + tm.assert_index_equal(ut.hour, expected) # unsorted case US/Eastern -> UTC ts = [Timestamp('2008-05-12 09:50:00', tz=tz), @@ -1014,8 +1054,8 @@ def test_tz_convert_hour_overflow_dst_timestamps(self): Timestamp('2008-05-12 09:50:32', tz=tz)] tt = to_datetime(ts) ut = tt.tz_convert('UTC') - expected = np.array([13, 14, 13], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([13, 14, 13]) + tm.assert_index_equal(ut.hour, expected) # unsorted case UTC -> US/Eastern ts = [Timestamp('2008-05-12 13:50:00', tz='UTC'), @@ -1023,8 +1063,8 @@ def test_tz_convert_hour_overflow_dst_timestamps(self): Timestamp('2008-05-12 13:50:32', tz='UTC')] tt = to_datetime(ts) ut = tt.tz_convert('US/Eastern') - expected = np.array([9, 9, 9], dtype=np.int32) - self.assert_numpy_array_equal(ut.hour, expected) + expected = Index([9, 9, 9]) + tm.assert_index_equal(ut.hour, expected) def test_tslib_tz_convert_trans_pos_plus_1__bug(self): # Regression test for tslib.tz_convert(vals, tz1, tz2). @@ -1035,9 +1075,8 @@ def test_tslib_tz_convert_trans_pos_plus_1__bug(self): idx = idx.tz_localize('UTC') idx = idx.tz_convert('Europe/Moscow') - expected = np.repeat(np.array([3, 4, 5], dtype=np.int32), - np.array([n, n, 1])) - self.assert_numpy_array_equal(idx.hour, expected) + expected = np.repeat(np.array([3, 4, 5]), np.array([n, n, 1])) + tm.assert_index_equal(idx.hour, Index(expected)) def test_tslib_tz_convert_dst(self): for freq, n in [('H', 1), ('T', 60), ('S', 3600)]: @@ -1046,76 +1085,71 @@ def test_tslib_tz_convert_dst(self): tz='UTC') idx = idx.tz_convert('US/Eastern') expected = np.repeat(np.array([18, 19, 20, 21, 22, 23, - 0, 1, 3, 4, 5], dtype=np.int32), + 0, 1, 3, 4, 5]), np.array([n, n, n, n, n, n, n, n, n, n, 1])) - self.assert_numpy_array_equal(idx.hour, expected) + tm.assert_index_equal(idx.hour, Index(expected)) idx = date_range('2014-03-08 18:00', '2014-03-09 05:00', freq=freq, tz='US/Eastern') idx = idx.tz_convert('UTC') - expected = np.repeat(np.array([23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], - dtype=np.int32), + expected = np.repeat(np.array([23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), np.array([n, n, n, n, n, n, n, n, n, n, 1])) - self.assert_numpy_array_equal(idx.hour, expected) + tm.assert_index_equal(idx.hour, Index(expected)) # End DST idx = date_range('2014-11-01 23:00', '2014-11-02 09:00', freq=freq, tz='UTC') idx = idx.tz_convert('US/Eastern') expected = np.repeat(np.array([19, 20, 21, 22, 23, - 0, 1, 1, 2, 3, 4], dtype=np.int32), + 0, 1, 1, 2, 3, 4]), np.array([n, n, n, n, n, n, n, n, n, n, 1])) - self.assert_numpy_array_equal(idx.hour, expected) + tm.assert_index_equal(idx.hour, Index(expected)) idx = date_range('2014-11-01 18:00', '2014-11-02 05:00', freq=freq, tz='US/Eastern') idx = idx.tz_convert('UTC') expected = np.repeat(np.array([22, 23, 0, 1, 2, 3, 4, 5, 6, - 7, 8, 9, 10], dtype=np.int32), + 7, 8, 9, 10]), np.array([n, n, n, n, n, n, n, n, n, n, n, n, 1])) - self.assert_numpy_array_equal(idx.hour, expected) + tm.assert_index_equal(idx.hour, Index(expected)) # daily # Start DST idx = date_range('2014-03-08 00:00', '2014-03-09 00:00', freq='D', tz='UTC') idx = idx.tz_convert('US/Eastern') - self.assert_numpy_array_equal(idx.hour, - np.array([19, 19], dtype=np.int32)) + tm.assert_index_equal(idx.hour, Index([19, 19])) idx = date_range('2014-03-08 00:00', '2014-03-09 00:00', freq='D', tz='US/Eastern') idx = idx.tz_convert('UTC') - self.assert_numpy_array_equal(idx.hour, - np.array([5, 5], dtype=np.int32)) + tm.assert_index_equal(idx.hour, Index([5, 5])) # End DST idx = date_range('2014-11-01 00:00', '2014-11-02 00:00', freq='D', tz='UTC') idx = idx.tz_convert('US/Eastern') - self.assert_numpy_array_equal(idx.hour, - np.array([20, 20], dtype=np.int32)) + tm.assert_index_equal(idx.hour, Index([20, 20])) idx = date_range('2014-11-01 00:00', '2014-11-02 000:00', freq='D', tz='US/Eastern') idx = idx.tz_convert('UTC') - self.assert_numpy_array_equal(idx.hour, - np.array([4, 4], dtype=np.int32)) + tm.assert_index_equal(idx.hour, Index([4, 4])) def test_tzlocal(self): # GH 13583 ts = Timestamp('2011-01-01', tz=dateutil.tz.tzlocal()) - self.assertEqual(ts.tz, dateutil.tz.tzlocal()) - self.assertTrue("tz='tzlocal()')" in repr(ts)) + assert ts.tz == dateutil.tz.tzlocal() + assert "tz='tzlocal()')" in repr(ts) tz = tslib.maybe_get_tz('tzlocal()') - self.assertEqual(tz, dateutil.tz.tzlocal()) + assert tz == dateutil.tz.tzlocal() # get offset using normal datetime for test offset = dateutil.tz.tzlocal().utcoffset(datetime(2011, 1, 1)) offset = offset.total_seconds() * 1000000000 - self.assertEqual(ts.value + offset, Timestamp('2011-01-01').value) + assert ts.value + offset == Timestamp('2011-01-01').value def test_tz_localize_tzlocal(self): # GH 13583 @@ -1144,7 +1178,8 @@ def test_tz_convert_tzlocal(self): tm.assert_numpy_array_equal(dti2.asi8, dti.asi8) -class TestTimeZoneCacheKey(tm.TestCase): +class TestTimeZoneCacheKey(object): + def test_cache_keys_are_distinct_for_pytz_vs_dateutil(self): tzs = pytz.common_timezones for tz_name in tzs: @@ -1156,15 +1191,13 @@ def test_cache_keys_are_distinct_for_pytz_vs_dateutil(self): if tz_d is None: # skip timezones that dateutil doesn't know about. continue - self.assertNotEqual(tslib._p_tz_cache_key( - tz_p), tslib._p_tz_cache_key(tz_d)) + assert tslib._p_tz_cache_key(tz_p) != tslib._p_tz_cache_key(tz_d) -class TestTimeZones(tm.TestCase): - _multiprocess_can_split_ = True +class TestTimeZones(object): timezones = ['UTC', 'Asia/Tokyo', 'US/Eastern', 'dateutil/US/Pacific'] - def setUp(self): + def setup_method(self, method): tm._skip_if_no_pytz() def test_replace(self): @@ -1174,39 +1207,39 @@ def test_replace(self): dt = Timestamp('2016-01-01 09:00:00') result = dt.replace(hour=0) expected = Timestamp('2016-01-01 00:00:00') - self.assertEqual(result, expected) + assert result == expected for tz in self.timezones: dt = Timestamp('2016-01-01 09:00:00', tz=tz) result = dt.replace(hour=0) expected = Timestamp('2016-01-01 00:00:00', tz=tz) - self.assertEqual(result, expected) + assert result == expected # we preserve nanoseconds dt = Timestamp('2016-01-01 09:00:00.000000123', tz=tz) result = dt.replace(hour=0) expected = Timestamp('2016-01-01 00:00:00.000000123', tz=tz) - self.assertEqual(result, expected) + assert result == expected # test all dt = Timestamp('2016-01-01 09:00:00.000000123', tz=tz) result = dt.replace(year=2015, month=2, day=2, hour=0, minute=5, second=5, microsecond=5, nanosecond=5) expected = Timestamp('2015-02-02 00:05:05.000005005', tz=tz) - self.assertEqual(result, expected) + assert result == expected # error def f(): dt.replace(foo=5) - self.assertRaises(ValueError, f) + pytest.raises(TypeError, f) def f(): dt.replace(hour=0.1) - self.assertRaises(ValueError, f) + pytest.raises(ValueError, f) # assert conversion to naive is the same as replacing tzinfo with None dt = Timestamp('2013-11-03 01:59:59.999999-0400', tz='US/Eastern') - self.assertEqual(dt.tz_localize(None), dt.replace(tzinfo=None)) + assert dt.tz_localize(None) == dt.replace(tzinfo=None) def test_ambiguous_compat(self): # validate that pytz and dateutil are compat for dst @@ -1220,37 +1253,37 @@ def test_ambiguous_compat(self): .tz_localize(pytz_zone, ambiguous=0)) result_dateutil = (Timestamp('2013-10-27 01:00:00') .tz_localize(dateutil_zone, ambiguous=0)) - self.assertEqual(result_pytz.value, result_dateutil.value) - self.assertEqual(result_pytz.value, 1382835600000000000) + assert result_pytz.value == result_dateutil.value + assert result_pytz.value == 1382835600000000000 # dateutil 2.6 buggy w.r.t. ambiguous=0 if dateutil.__version__ != LooseVersion('2.6.0'): - # GH 14621 - # https://github.com/dateutil/dateutil/issues/321 - self.assertEqual(result_pytz.to_pydatetime().tzname(), - result_dateutil.to_pydatetime().tzname()) - self.assertEqual(str(result_pytz), str(result_dateutil)) + # see gh-14621 + # see https://github.com/dateutil/dateutil/issues/321 + assert (result_pytz.to_pydatetime().tzname() == + result_dateutil.to_pydatetime().tzname()) + assert str(result_pytz) == str(result_dateutil) # 1 hour difference result_pytz = (Timestamp('2013-10-27 01:00:00') .tz_localize(pytz_zone, ambiguous=1)) result_dateutil = (Timestamp('2013-10-27 01:00:00') .tz_localize(dateutil_zone, ambiguous=1)) - self.assertEqual(result_pytz.value, result_dateutil.value) - self.assertEqual(result_pytz.value, 1382832000000000000) + assert result_pytz.value == result_dateutil.value + assert result_pytz.value == 1382832000000000000 # dateutil < 2.6 is buggy w.r.t. ambiguous timezones if dateutil.__version__ > LooseVersion('2.5.3'): - # GH 14621 - self.assertEqual(str(result_pytz), str(result_dateutil)) - self.assertEqual(result_pytz.to_pydatetime().tzname(), - result_dateutil.to_pydatetime().tzname()) + # see gh-14621 + assert str(result_pytz) == str(result_dateutil) + assert (result_pytz.to_pydatetime().tzname() == + result_dateutil.to_pydatetime().tzname()) def test_index_equals_with_tz(self): left = date_range('1/1/2011', periods=100, freq='H', tz='utc') right = date_range('1/1/2011', periods=100, freq='H', tz='US/Eastern') - self.assertFalse(left.equals(right)) + assert not left.equals(right) def test_tz_localize_naive(self): rng = date_range('1/1/2011', periods=100, freq='H') @@ -1258,7 +1291,7 @@ def test_tz_localize_naive(self): conv = rng.tz_localize('US/Pacific') exp = date_range('1/1/2011', periods=100, freq='H', tz='US/Pacific') - self.assert_index_equal(conv, exp) + tm.assert_index_equal(conv, exp) def test_tz_localize_roundtrip(self): for tz in self.timezones: @@ -1272,12 +1305,12 @@ def test_tz_localize_roundtrip(self): tz=tz) tm.assert_index_equal(localized, expected) - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): localized.tz_localize(tz) reset = localized.tz_localize(None) tm.assert_index_equal(reset, idx) - self.assertTrue(reset.tzinfo is None) + assert reset.tzinfo is None def test_series_frame_tz_localize(self): @@ -1285,48 +1318,48 @@ def test_series_frame_tz_localize(self): ts = Series(1, index=rng) result = ts.tz_localize('utc') - self.assertEqual(result.index.tz.zone, 'UTC') + assert result.index.tz.zone == 'UTC' df = DataFrame({'a': 1}, index=rng) result = df.tz_localize('utc') expected = DataFrame({'a': 1}, rng.tz_localize('UTC')) - self.assertEqual(result.index.tz.zone, 'UTC') + assert result.index.tz.zone == 'UTC' assert_frame_equal(result, expected) df = df.T result = df.tz_localize('utc', axis=1) - self.assertEqual(result.columns.tz.zone, 'UTC') + assert result.columns.tz.zone == 'UTC' assert_frame_equal(result, expected.T) # Can't localize if already tz-aware rng = date_range('1/1/2011', periods=100, freq='H', tz='utc') ts = Series(1, index=rng) - tm.assertRaisesRegexp(TypeError, 'Already tz-aware', ts.tz_localize, - 'US/Eastern') + tm.assert_raises_regex(TypeError, 'Already tz-aware', + ts.tz_localize, 'US/Eastern') def test_series_frame_tz_convert(self): rng = date_range('1/1/2011', periods=200, freq='D', tz='US/Eastern') ts = Series(1, index=rng) result = ts.tz_convert('Europe/Berlin') - self.assertEqual(result.index.tz.zone, 'Europe/Berlin') + assert result.index.tz.zone == 'Europe/Berlin' df = DataFrame({'a': 1}, index=rng) result = df.tz_convert('Europe/Berlin') expected = DataFrame({'a': 1}, rng.tz_convert('Europe/Berlin')) - self.assertEqual(result.index.tz.zone, 'Europe/Berlin') + assert result.index.tz.zone == 'Europe/Berlin' assert_frame_equal(result, expected) df = df.T result = df.tz_convert('Europe/Berlin', axis=1) - self.assertEqual(result.columns.tz.zone, 'Europe/Berlin') + assert result.columns.tz.zone == 'Europe/Berlin' assert_frame_equal(result, expected.T) # can't convert tz-naive rng = date_range('1/1/2011', periods=200, freq='D') ts = Series(1, index=rng) - tm.assertRaisesRegexp(TypeError, "Cannot convert tz-naive", - ts.tz_convert, 'US/Eastern') + tm.assert_raises_regex(TypeError, "Cannot convert tz-naive", + ts.tz_convert, 'US/Eastern') def test_tz_convert_roundtrip(self): for tz in self.timezones: @@ -1351,7 +1384,7 @@ def test_tz_convert_roundtrip(self): converted = idx.tz_convert(tz) reset = converted.tz_convert(None) tm.assert_index_equal(reset, expected) - self.assertTrue(reset.tzinfo is None) + assert reset.tzinfo is None tm.assert_index_equal(reset, converted.tz_convert( 'UTC').tz_localize(None)) @@ -1363,12 +1396,12 @@ def test_join_utc_convert(self): for how in ['inner', 'outer', 'left', 'right']: result = left.join(left[:-5], how=how) - tm.assertIsInstance(result, DatetimeIndex) - self.assertEqual(result.tz, left.tz) + assert isinstance(result, DatetimeIndex) + assert result.tz == left.tz result = left.join(right[:-5], how=how) - tm.assertIsInstance(result, DatetimeIndex) - self.assertEqual(result.tz.zone, 'UTC') + assert isinstance(result, DatetimeIndex) + assert result.tz.zone == 'UTC' def test_join_aware(self): rng = date_range('1/1/2011', periods=10, freq='H') @@ -1376,8 +1409,8 @@ def test_join_aware(self): ts_utc = ts.tz_localize('utc') - self.assertRaises(Exception, ts.__add__, ts_utc) - self.assertRaises(Exception, ts_utc.__add__, ts) + pytest.raises(Exception, ts.__add__, ts_utc) + pytest.raises(Exception, ts_utc.__add__, ts) test1 = DataFrame(np.zeros((6, 3)), index=date_range("2012-11-15 00:00:00", periods=6, @@ -1390,8 +1423,8 @@ def test_join_aware(self): result = test1.join(test2, how='outer') ex_index = test1.index.union(test2.index) - self.assert_index_equal(result.index, ex_index) - self.assertTrue(result.index.tz.zone == 'US/Central') + tm.assert_index_equal(result.index, ex_index) + assert result.index.tz.zone == 'US/Central' # non-overlapping rng = date_range("2012-11-15 00:00:00", periods=6, freq="H", @@ -1401,7 +1434,7 @@ def test_join_aware(self): tz="US/Eastern") result = rng.union(rng2) - self.assertTrue(result.tz.zone == 'UTC') + assert result.tz.zone == 'UTC' def test_align_aware(self): idx1 = date_range('2001', periods=5, freq='H', tz='US/Eastern') @@ -1409,30 +1442,30 @@ def test_align_aware(self): df1 = DataFrame(np.random.randn(len(idx1), 3), idx1) df2 = DataFrame(np.random.randn(len(idx2), 3), idx2) new1, new2 = df1.align(df2) - self.assertEqual(df1.index.tz, new1.index.tz) - self.assertEqual(df2.index.tz, new2.index.tz) + assert df1.index.tz == new1.index.tz + assert df2.index.tz == new2.index.tz # # different timezones convert to UTC # frame df1_central = df1.tz_convert('US/Central') new1, new2 = df1.align(df1_central) - self.assertEqual(new1.index.tz, pytz.UTC) - self.assertEqual(new2.index.tz, pytz.UTC) + assert new1.index.tz == pytz.UTC + assert new2.index.tz == pytz.UTC # series new1, new2 = df1[0].align(df1_central[0]) - self.assertEqual(new1.index.tz, pytz.UTC) - self.assertEqual(new2.index.tz, pytz.UTC) + assert new1.index.tz == pytz.UTC + assert new2.index.tz == pytz.UTC # combination new1, new2 = df1.align(df1_central[0], axis=0) - self.assertEqual(new1.index.tz, pytz.UTC) - self.assertEqual(new2.index.tz, pytz.UTC) + assert new1.index.tz == pytz.UTC + assert new2.index.tz == pytz.UTC df1[0].align(df1_central, axis=0) - self.assertEqual(new1.index.tz, pytz.UTC) - self.assertEqual(new2.index.tz, pytz.UTC) + assert new1.index.tz == pytz.UTC + assert new2.index.tz == pytz.UTC def test_append_aware(self): rng1 = date_range('1/1/2011 01:00', periods=1, freq='H', @@ -1447,7 +1480,7 @@ def test_append_aware(self): tz='US/Eastern') exp = Series([1, 2], index=exp_index) assert_series_equal(ts_result, exp) - self.assertEqual(ts_result.index.tz, rng1.tz) + assert ts_result.index.tz == rng1.tz rng1 = date_range('1/1/2011 01:00', periods=1, freq='H', tz='UTC') rng2 = date_range('1/1/2011 02:00', periods=1, freq='H', tz='UTC') @@ -1460,7 +1493,7 @@ def test_append_aware(self): exp = Series([1, 2], index=exp_index) assert_series_equal(ts_result, exp) utc = rng1.tz - self.assertEqual(utc, ts_result.index.tz) + assert utc == ts_result.index.tz # GH 7795 # different tz coerces to object dtype, not UTC @@ -1491,7 +1524,7 @@ def test_append_dst(self): tz='US/Eastern') exp = Series([1, 2, 3, 10, 11, 12], index=exp_index) assert_series_equal(ts_result, exp) - self.assertEqual(ts_result.index.tz, rng1.tz) + assert ts_result.index.tz == rng1.tz def test_append_aware_naive(self): rng1 = date_range('1/1/2011 01:00', periods=1, freq='H') @@ -1501,8 +1534,8 @@ def test_append_aware_naive(self): ts2 = Series(np.random.randn(len(rng2)), index=rng2) ts_result = ts1.append(ts2) - self.assertTrue(ts_result.index.equals(ts1.index.asobject.append( - ts2.index.asobject))) + assert ts_result.index.equals(ts1.index.asobject.append( + ts2.index.asobject)) # mixed rng1 = date_range('1/1/2011 01:00', periods=1, freq='H') @@ -1510,8 +1543,8 @@ def test_append_aware_naive(self): ts1 = Series(np.random.randn(len(rng1)), index=rng1) ts2 = Series(np.random.randn(len(rng2)), index=rng2) ts_result = ts1.append(ts2) - self.assertTrue(ts_result.index.equals(ts1.index.asobject.append( - ts2.index))) + assert ts_result.index.equals(ts1.index.asobject.append( + ts2.index)) def test_equal_join_ensure_utc(self): rng = date_range('1/1/2011', periods=10, freq='H', tz='US/Eastern') @@ -1520,18 +1553,18 @@ def test_equal_join_ensure_utc(self): ts_moscow = ts.tz_convert('Europe/Moscow') result = ts + ts_moscow - self.assertIs(result.index.tz, pytz.utc) + assert result.index.tz is pytz.utc result = ts_moscow + ts - self.assertIs(result.index.tz, pytz.utc) + assert result.index.tz is pytz.utc df = DataFrame({'a': ts}) df_moscow = df.tz_convert('Europe/Moscow') result = df + df_moscow - self.assertIs(result.index.tz, pytz.utc) + assert result.index.tz is pytz.utc result = df_moscow + df - self.assertIs(result.index.tz, pytz.utc) + assert result.index.tz is pytz.utc def test_arith_utc_convert(self): rng = date_range('1/1/2011', periods=100, freq='H', tz='utc') @@ -1550,7 +1583,7 @@ def test_arith_utc_convert(self): uts2 = ts2.tz_convert('utc') expected = uts1 + uts2 - self.assertEqual(result.index.tz, pytz.UTC) + assert result.index.tz == pytz.UTC assert_series_equal(result, expected) def test_intersection(self): @@ -1559,9 +1592,9 @@ def test_intersection(self): left = rng[10:90][::-1] right = rng[20:80][::-1] - self.assertEqual(left.tz, rng.tz) + assert left.tz == rng.tz result = left.intersection(right) - self.assertEqual(result.tz, left.tz) + assert result.tz == left.tz def test_timestamp_equality_different_timezones(self): utc_range = date_range('1/1/2000', periods=20, tz='UTC') @@ -1569,19 +1602,19 @@ def test_timestamp_equality_different_timezones(self): berlin_range = utc_range.tz_convert('Europe/Berlin') for a, b, c in zip(utc_range, eastern_range, berlin_range): - self.assertEqual(a, b) - self.assertEqual(b, c) - self.assertEqual(a, c) + assert a == b + assert b == c + assert a == c - self.assertTrue((utc_range == eastern_range).all()) - self.assertTrue((utc_range == berlin_range).all()) - self.assertTrue((berlin_range == eastern_range).all()) + assert (utc_range == eastern_range).all() + assert (utc_range == berlin_range).all() + assert (berlin_range == eastern_range).all() def test_datetimeindex_tz(self): rng = date_range('03/12/2012 00:00', periods=10, freq='W-FRI', tz='US/Eastern') rng2 = DatetimeIndex(data=rng, tz='US/Eastern') - self.assert_index_equal(rng, rng2) + tm.assert_index_equal(rng, rng2) def test_normalize_tz(self): rng = date_range('1/1/2000 9:30', periods=10, freq='D', @@ -1590,28 +1623,28 @@ def test_normalize_tz(self): result = rng.normalize() expected = date_range('1/1/2000', periods=10, freq='D', tz='US/Eastern') - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) - self.assertTrue(result.is_normalized) - self.assertFalse(rng.is_normalized) + assert result.is_normalized + assert not rng.is_normalized rng = date_range('1/1/2000 9:30', periods=10, freq='D', tz='UTC') result = rng.normalize() expected = date_range('1/1/2000', periods=10, freq='D', tz='UTC') - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) - self.assertTrue(result.is_normalized) - self.assertFalse(rng.is_normalized) + assert result.is_normalized + assert not rng.is_normalized from dateutil.tz import tzlocal rng = date_range('1/1/2000 9:30', periods=10, freq='D', tz=tzlocal()) result = rng.normalize() expected = date_range('1/1/2000', periods=10, freq='D', tz=tzlocal()) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) - self.assertTrue(result.is_normalized) - self.assertFalse(rng.is_normalized) + assert result.is_normalized + assert not rng.is_normalized def test_normalize_tz_local(self): # GH 13459 @@ -1628,15 +1661,15 @@ def test_normalize_tz_local(self): result = rng.normalize() expected = date_range('1/1/2000', periods=10, freq='D', tz=tzlocal()) - self.assert_index_equal(result, expected) + tm.assert_index_equal(result, expected) - self.assertTrue(result.is_normalized) - self.assertFalse(rng.is_normalized) + assert result.is_normalized + assert not rng.is_normalized def test_tzaware_offset(self): dates = date_range('2012-11-01', periods=3, tz='US/Pacific') offset = dates + offsets.Hour(5) - self.assertEqual(dates[0] + offsets.Hour(5), offset[0]) + assert dates[0] + offsets.Hour(5) == offset[0] # GH 6818 for tz in ['UTC', 'US/Pacific', 'Asia/Tokyo']: @@ -1645,47 +1678,91 @@ def test_tzaware_offset(self): '2010-11-01 07:00'], freq='H', tz=tz) offset = dates + offsets.Hour(5) - self.assert_index_equal(offset, expected) + tm.assert_index_equal(offset, expected) offset = dates + np.timedelta64(5, 'h') - self.assert_index_equal(offset, expected) + tm.assert_index_equal(offset, expected) offset = dates + timedelta(hours=5) - self.assert_index_equal(offset, expected) + tm.assert_index_equal(offset, expected) def test_nat(self): # GH 5546 dates = [NaT] idx = DatetimeIndex(dates) idx = idx.tz_localize('US/Pacific') - self.assert_index_equal(idx, DatetimeIndex(dates, tz='US/Pacific')) + tm.assert_index_equal(idx, DatetimeIndex(dates, tz='US/Pacific')) idx = idx.tz_convert('US/Eastern') - self.assert_index_equal(idx, DatetimeIndex(dates, tz='US/Eastern')) + tm.assert_index_equal(idx, DatetimeIndex(dates, tz='US/Eastern')) idx = idx.tz_convert('UTC') - self.assert_index_equal(idx, DatetimeIndex(dates, tz='UTC')) + tm.assert_index_equal(idx, DatetimeIndex(dates, tz='UTC')) dates = ['2010-12-01 00:00', '2010-12-02 00:00', NaT] idx = DatetimeIndex(dates) idx = idx.tz_localize('US/Pacific') - self.assert_index_equal(idx, DatetimeIndex(dates, tz='US/Pacific')) + tm.assert_index_equal(idx, DatetimeIndex(dates, tz='US/Pacific')) idx = idx.tz_convert('US/Eastern') expected = ['2010-12-01 03:00', '2010-12-02 03:00', NaT] - self.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Eastern')) + tm.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Eastern')) idx = idx + offsets.Hour(5) expected = ['2010-12-01 08:00', '2010-12-02 08:00', NaT] - self.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Eastern')) + tm.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Eastern')) idx = idx.tz_convert('US/Pacific') expected = ['2010-12-01 05:00', '2010-12-02 05:00', NaT] - self.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Pacific')) + tm.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Pacific')) idx = idx + np.timedelta64(3, 'h') expected = ['2010-12-01 08:00', '2010-12-02 08:00', NaT] - self.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Pacific')) + tm.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Pacific')) idx = idx.tz_convert('US/Eastern') expected = ['2010-12-01 11:00', '2010-12-02 11:00', NaT] - self.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Eastern')) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + tm.assert_index_equal(idx, DatetimeIndex(expected, tz='US/Eastern')) + + +class TestTslib(object): + + def test_tslib_tz_convert(self): + def compare_utc_to_local(tz_didx, utc_didx): + f = lambda x: tslib.tz_convert_single(x, 'UTC', tz_didx.tz) + result = tslib.tz_convert(tz_didx.asi8, 'UTC', tz_didx.tz) + result_single = np.vectorize(f)(tz_didx.asi8) + tm.assert_numpy_array_equal(result, result_single) + + def compare_local_to_utc(tz_didx, utc_didx): + f = lambda x: tslib.tz_convert_single(x, tz_didx.tz, 'UTC') + result = tslib.tz_convert(utc_didx.asi8, tz_didx.tz, 'UTC') + result_single = np.vectorize(f)(utc_didx.asi8) + tm.assert_numpy_array_equal(result, result_single) + + for tz in ['UTC', 'Asia/Tokyo', 'US/Eastern', 'Europe/Moscow']: + # US: 2014-03-09 - 2014-11-11 + # MOSCOW: 2014-10-26 / 2014-12-31 + tz_didx = date_range('2014-03-01', '2015-01-10', freq='H', tz=tz) + utc_didx = date_range('2014-03-01', '2015-01-10', freq='H') + compare_utc_to_local(tz_didx, utc_didx) + # local tz to UTC can be differ in hourly (or higher) freqs because + # of DST + compare_local_to_utc(tz_didx, utc_didx) + + tz_didx = date_range('2000-01-01', '2020-01-01', freq='D', tz=tz) + utc_didx = date_range('2000-01-01', '2020-01-01', freq='D') + compare_utc_to_local(tz_didx, utc_didx) + compare_local_to_utc(tz_didx, utc_didx) + + tz_didx = date_range('2000-01-01', '2100-01-01', freq='A', tz=tz) + utc_didx = date_range('2000-01-01', '2100-01-01', freq='A') + compare_utc_to_local(tz_didx, utc_didx) + compare_local_to_utc(tz_didx, utc_didx) + + # Check empty array + result = tslib.tz_convert(np.array([], dtype=np.int64), + tslib.maybe_get_tz('US/Eastern'), + tslib.maybe_get_tz('Asia/Tokyo')) + tm.assert_numpy_array_equal(result, np.array([], dtype=np.int64)) + + # Check all-NaT array + result = tslib.tz_convert(np.array([tslib.iNaT], dtype=np.int64), + tslib.maybe_get_tz('US/Eastern'), + tslib.maybe_get_tz('Asia/Tokyo')) + tm.assert_numpy_array_equal(result, np.array( + [tslib.iNaT], dtype=np.int64)) diff --git a/pandas/tests/types/test_cast.py b/pandas/tests/types/test_cast.py deleted file mode 100644 index 56a14a51105ca..0000000000000 --- a/pandas/tests/types/test_cast.py +++ /dev/null @@ -1,285 +0,0 @@ -# -*- coding: utf-8 -*- - -""" -These test the private routines in types/cast.py - -""" - - -import nose -from datetime import datetime -import numpy as np - -from pandas import Timedelta, Timestamp -from pandas.types.cast import (_possibly_downcast_to_dtype, - _possibly_convert_objects, - _infer_dtype_from_scalar, - _maybe_convert_string_to_object, - _maybe_convert_scalar, - _find_common_type) -from pandas.types.dtypes import (CategoricalDtype, - DatetimeTZDtype, PeriodDtype) -from pandas.util import testing as tm - -_multiprocess_can_split_ = True - - -class TestPossiblyDowncast(tm.TestCase): - - def test_downcast_conv(self): - # test downcasting - - arr = np.array([8.5, 8.6, 8.7, 8.8, 8.9999999999995]) - result = _possibly_downcast_to_dtype(arr, 'infer') - assert (np.array_equal(result, arr)) - - arr = np.array([8., 8., 8., 8., 8.9999999999995]) - result = _possibly_downcast_to_dtype(arr, 'infer') - expected = np.array([8, 8, 8, 8, 9]) - assert (np.array_equal(result, expected)) - - arr = np.array([8., 8., 8., 8., 9.0000000000005]) - result = _possibly_downcast_to_dtype(arr, 'infer') - expected = np.array([8, 8, 8, 8, 9]) - assert (np.array_equal(result, expected)) - - # conversions - - expected = np.array([1, 2]) - for dtype in [np.float64, object, np.int64]: - arr = np.array([1.0, 2.0], dtype=dtype) - result = _possibly_downcast_to_dtype(arr, 'infer') - tm.assert_almost_equal(result, expected, check_dtype=False) - - for dtype in [np.float64, object]: - expected = np.array([1.0, 2.0, np.nan], dtype=dtype) - arr = np.array([1.0, 2.0, np.nan], dtype=dtype) - result = _possibly_downcast_to_dtype(arr, 'infer') - tm.assert_almost_equal(result, expected) - - # empties - for dtype in [np.int32, np.float64, np.float32, np.bool_, - np.int64, object]: - arr = np.array([], dtype=dtype) - result = _possibly_downcast_to_dtype(arr, 'int64') - tm.assert_almost_equal(result, np.array([], dtype=np.int64)) - assert result.dtype == np.int64 - - def test_datetimelikes_nan(self): - arr = np.array([1, 2, np.nan]) - exp = np.array([1, 2, np.datetime64('NaT')], dtype='datetime64[ns]') - res = _possibly_downcast_to_dtype(arr, 'datetime64[ns]') - tm.assert_numpy_array_equal(res, exp) - - exp = np.array([1, 2, np.timedelta64('NaT')], dtype='timedelta64[ns]') - res = _possibly_downcast_to_dtype(arr, 'timedelta64[ns]') - tm.assert_numpy_array_equal(res, exp) - - -class TestInferDtype(tm.TestCase): - - def test_infer_dtype_from_scalar(self): - # Test that _infer_dtype_from_scalar is returning correct dtype for int - # and float. - - for dtypec in [np.uint8, np.int8, np.uint16, np.int16, np.uint32, - np.int32, np.uint64, np.int64]: - data = dtypec(12) - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, type(data)) - - data = 12 - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, np.int64) - - for dtypec in [np.float16, np.float32, np.float64]: - data = dtypec(12) - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, dtypec) - - data = np.float(12) - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, np.float64) - - for data in [True, False]: - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, np.bool_) - - for data in [np.complex64(1), np.complex128(1)]: - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, np.complex_) - - import datetime - for data in [np.datetime64(1, 'ns'), Timestamp(1), - datetime.datetime(2000, 1, 1, 0, 0)]: - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, 'M8[ns]') - - for data in [np.timedelta64(1, 'ns'), Timedelta(1), - datetime.timedelta(1)]: - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, 'm8[ns]') - - for data in [datetime.date(2000, 1, 1), - Timestamp(1, tz='US/Eastern'), 'foo']: - dtype, val = _infer_dtype_from_scalar(data) - self.assertEqual(dtype, np.object_) - - -class TestMaybe(tm.TestCase): - - def test_maybe_convert_string_to_array(self): - result = _maybe_convert_string_to_object('x') - tm.assert_numpy_array_equal(result, np.array(['x'], dtype=object)) - self.assertTrue(result.dtype == object) - - result = _maybe_convert_string_to_object(1) - self.assertEqual(result, 1) - - arr = np.array(['x', 'y'], dtype=str) - result = _maybe_convert_string_to_object(arr) - tm.assert_numpy_array_equal(result, np.array(['x', 'y'], dtype=object)) - self.assertTrue(result.dtype == object) - - # unicode - arr = np.array(['x', 'y']).astype('U') - result = _maybe_convert_string_to_object(arr) - tm.assert_numpy_array_equal(result, np.array(['x', 'y'], dtype=object)) - self.assertTrue(result.dtype == object) - - # object - arr = np.array(['x', 2], dtype=object) - result = _maybe_convert_string_to_object(arr) - tm.assert_numpy_array_equal(result, np.array(['x', 2], dtype=object)) - self.assertTrue(result.dtype == object) - - def test_maybe_convert_scalar(self): - - # pass thru - result = _maybe_convert_scalar('x') - self.assertEqual(result, 'x') - result = _maybe_convert_scalar(np.array([1])) - self.assertEqual(result, np.array([1])) - - # leave scalar dtype - result = _maybe_convert_scalar(np.int64(1)) - self.assertEqual(result, np.int64(1)) - result = _maybe_convert_scalar(np.int32(1)) - self.assertEqual(result, np.int32(1)) - result = _maybe_convert_scalar(np.float32(1)) - self.assertEqual(result, np.float32(1)) - result = _maybe_convert_scalar(np.int64(1)) - self.assertEqual(result, np.float64(1)) - - # coerce - result = _maybe_convert_scalar(1) - self.assertEqual(result, np.int64(1)) - result = _maybe_convert_scalar(1.0) - self.assertEqual(result, np.float64(1)) - result = _maybe_convert_scalar(Timestamp('20130101')) - self.assertEqual(result, Timestamp('20130101').value) - result = _maybe_convert_scalar(datetime(2013, 1, 1)) - self.assertEqual(result, Timestamp('20130101').value) - result = _maybe_convert_scalar(Timedelta('1 day 1 min')) - self.assertEqual(result, Timedelta('1 day 1 min').value) - - -class TestConvert(tm.TestCase): - - def test_possibly_convert_objects_copy(self): - values = np.array([1, 2]) - - out = _possibly_convert_objects(values, copy=False) - self.assertTrue(values is out) - - out = _possibly_convert_objects(values, copy=True) - self.assertTrue(values is not out) - - values = np.array(['apply', 'banana']) - out = _possibly_convert_objects(values, copy=False) - self.assertTrue(values is out) - - out = _possibly_convert_objects(values, copy=True) - self.assertTrue(values is not out) - - -class TestCommonTypes(tm.TestCase): - - def test_numpy_dtypes(self): - # (source_types, destination_type) - testcases = ( - # identity - ((np.int64,), np.int64), - ((np.uint64,), np.uint64), - ((np.float32,), np.float32), - ((np.object,), np.object), - - # into ints - ((np.int16, np.int64), np.int64), - ((np.int32, np.uint32), np.int64), - ((np.uint16, np.uint64), np.uint64), - - # into floats - ((np.float16, np.float32), np.float32), - ((np.float16, np.int16), np.float32), - ((np.float32, np.int16), np.float32), - ((np.uint64, np.int64), np.float64), - ((np.int16, np.float64), np.float64), - ((np.float16, np.int64), np.float64), - - # into others - ((np.complex128, np.int32), np.complex128), - ((np.object, np.float32), np.object), - ((np.object, np.int16), np.object), - - ((np.dtype('datetime64[ns]'), np.dtype('datetime64[ns]')), - np.dtype('datetime64[ns]')), - ((np.dtype('timedelta64[ns]'), np.dtype('timedelta64[ns]')), - np.dtype('timedelta64[ns]')), - - ((np.dtype('datetime64[ns]'), np.dtype('datetime64[ms]')), - np.dtype('datetime64[ns]')), - ((np.dtype('timedelta64[ms]'), np.dtype('timedelta64[ns]')), - np.dtype('timedelta64[ns]')), - - ((np.dtype('datetime64[ns]'), np.dtype('timedelta64[ns]')), - np.object), - ((np.dtype('datetime64[ns]'), np.int64), np.object) - ) - for src, common in testcases: - self.assertEqual(_find_common_type(src), common) - - with tm.assertRaises(ValueError): - # empty - _find_common_type([]) - - def test_categorical_dtype(self): - dtype = CategoricalDtype() - self.assertEqual(_find_common_type([dtype]), 'category') - self.assertEqual(_find_common_type([dtype, dtype]), 'category') - self.assertEqual(_find_common_type([np.object, dtype]), np.object) - - def test_datetimetz_dtype(self): - dtype = DatetimeTZDtype(unit='ns', tz='US/Eastern') - self.assertEqual(_find_common_type([dtype, dtype]), - 'datetime64[ns, US/Eastern]') - - for dtype2 in [DatetimeTZDtype(unit='ns', tz='Asia/Tokyo'), - np.dtype('datetime64[ns]'), np.object, np.int64]: - self.assertEqual(_find_common_type([dtype, dtype2]), np.object) - self.assertEqual(_find_common_type([dtype2, dtype]), np.object) - - def test_period_dtype(self): - dtype = PeriodDtype(freq='D') - self.assertEqual(_find_common_type([dtype, dtype]), 'period[D]') - - for dtype2 in [DatetimeTZDtype(unit='ns', tz='Asia/Tokyo'), - PeriodDtype(freq='2D'), PeriodDtype(freq='H'), - np.dtype('datetime64[ns]'), np.object, np.int64]: - self.assertEqual(_find_common_type([dtype, dtype2]), np.object) - self.assertEqual(_find_common_type([dtype2, dtype]), np.object) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/types/test_common.py b/pandas/tests/types/test_common.py deleted file mode 100644 index 4d6f50862c562..0000000000000 --- a/pandas/tests/types/test_common.py +++ /dev/null @@ -1,62 +0,0 @@ -# -*- coding: utf-8 -*- - -import nose -import numpy as np - -from pandas.types.dtypes import DatetimeTZDtype, PeriodDtype, CategoricalDtype -from pandas.types.common import pandas_dtype, is_dtype_equal - -import pandas.util.testing as tm - -_multiprocess_can_split_ = True - - -class TestPandasDtype(tm.TestCase): - - def test_numpy_dtype(self): - for dtype in ['M8[ns]', 'm8[ns]', 'object', 'float64', 'int64']: - self.assertEqual(pandas_dtype(dtype), np.dtype(dtype)) - - def test_numpy_string_dtype(self): - # do not parse freq-like string as period dtype - self.assertEqual(pandas_dtype('U'), np.dtype('U')) - self.assertEqual(pandas_dtype('S'), np.dtype('S')) - - def test_datetimetz_dtype(self): - for dtype in ['datetime64[ns, US/Eastern]', - 'datetime64[ns, Asia/Tokyo]', - 'datetime64[ns, UTC]']: - self.assertIs(pandas_dtype(dtype), DatetimeTZDtype(dtype)) - self.assertEqual(pandas_dtype(dtype), DatetimeTZDtype(dtype)) - self.assertEqual(pandas_dtype(dtype), dtype) - - def test_categorical_dtype(self): - self.assertEqual(pandas_dtype('category'), CategoricalDtype()) - - def test_period_dtype(self): - for dtype in ['period[D]', 'period[3M]', 'period[U]', - 'Period[D]', 'Period[3M]', 'Period[U]']: - self.assertIs(pandas_dtype(dtype), PeriodDtype(dtype)) - self.assertEqual(pandas_dtype(dtype), PeriodDtype(dtype)) - self.assertEqual(pandas_dtype(dtype), dtype) - - -def test_dtype_equal(): - assert is_dtype_equal(np.int64, np.int64) - assert not is_dtype_equal(np.int64, np.float64) - - p1 = PeriodDtype('D') - p2 = PeriodDtype('D') - assert is_dtype_equal(p1, p2) - assert not is_dtype_equal(np.int64, p1) - - p3 = PeriodDtype('2D') - assert not is_dtype_equal(p1, p3) - - assert not DatetimeTZDtype.is_dtype(np.int64) - assert not PeriodDtype.is_dtype(np.int64) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/types/test_dtypes.py b/pandas/tests/types/test_dtypes.py deleted file mode 100644 index a2b0a9ebfa6cc..0000000000000 --- a/pandas/tests/types/test_dtypes.py +++ /dev/null @@ -1,355 +0,0 @@ -# -*- coding: utf-8 -*- -from itertools import product - -import nose -import numpy as np -import pandas as pd -from pandas import Series, Categorical, date_range - -from pandas.types.dtypes import DatetimeTZDtype, PeriodDtype, CategoricalDtype -from pandas.types.common import (is_categorical_dtype, is_categorical, - is_datetime64tz_dtype, is_datetimetz, - is_period_dtype, is_period, - is_dtype_equal, is_datetime64_ns_dtype, - is_datetime64_dtype, is_string_dtype, - _coerce_to_dtype) -import pandas.util.testing as tm - -_multiprocess_can_split_ = True - - -class Base(object): - - def test_hash(self): - hash(self.dtype) - - def test_equality_invalid(self): - self.assertRaises(self.dtype == 'foo') - self.assertFalse(is_dtype_equal(self.dtype, np.int64)) - - def test_numpy_informed(self): - - # np.dtype doesn't know about our new dtype - def f(): - np.dtype(self.dtype) - - self.assertRaises(TypeError, f) - - self.assertNotEqual(self.dtype, np.str_) - self.assertNotEqual(np.str_, self.dtype) - - def test_pickle(self): - result = self.round_trip_pickle(self.dtype) - self.assertEqual(result, self.dtype) - - -class TestCategoricalDtype(Base, tm.TestCase): - - def setUp(self): - self.dtype = CategoricalDtype() - - def test_hash_vs_equality(self): - # make sure that we satisfy is semantics - dtype = self.dtype - dtype2 = CategoricalDtype() - self.assertTrue(dtype == dtype2) - self.assertTrue(dtype2 == dtype) - self.assertTrue(dtype is dtype2) - self.assertTrue(dtype2 is dtype) - self.assertTrue(hash(dtype) == hash(dtype2)) - - def test_equality(self): - self.assertTrue(is_dtype_equal(self.dtype, 'category')) - self.assertTrue(is_dtype_equal(self.dtype, CategoricalDtype())) - self.assertFalse(is_dtype_equal(self.dtype, 'foo')) - - def test_construction_from_string(self): - result = CategoricalDtype.construct_from_string('category') - self.assertTrue(is_dtype_equal(self.dtype, result)) - self.assertRaises( - TypeError, lambda: CategoricalDtype.construct_from_string('foo')) - - def test_is_dtype(self): - self.assertTrue(CategoricalDtype.is_dtype(self.dtype)) - self.assertTrue(CategoricalDtype.is_dtype('category')) - self.assertTrue(CategoricalDtype.is_dtype(CategoricalDtype())) - self.assertFalse(CategoricalDtype.is_dtype('foo')) - self.assertFalse(CategoricalDtype.is_dtype(np.float64)) - - def test_basic(self): - - self.assertTrue(is_categorical_dtype(self.dtype)) - - factor = Categorical(['a', 'b', 'b', 'a', 'a', 'c', 'c', 'c']) - - s = Series(factor, name='A') - - # dtypes - self.assertTrue(is_categorical_dtype(s.dtype)) - self.assertTrue(is_categorical_dtype(s)) - self.assertFalse(is_categorical_dtype(np.dtype('float64'))) - - self.assertTrue(is_categorical(s.dtype)) - self.assertTrue(is_categorical(s)) - self.assertFalse(is_categorical(np.dtype('float64'))) - self.assertFalse(is_categorical(1.0)) - - -class TestDatetimeTZDtype(Base, tm.TestCase): - - def setUp(self): - self.dtype = DatetimeTZDtype('ns', 'US/Eastern') - - def test_hash_vs_equality(self): - # make sure that we satisfy is semantics - dtype = self.dtype - dtype2 = DatetimeTZDtype('ns', 'US/Eastern') - dtype3 = DatetimeTZDtype(dtype2) - self.assertTrue(dtype == dtype2) - self.assertTrue(dtype2 == dtype) - self.assertTrue(dtype3 == dtype) - self.assertTrue(dtype is dtype2) - self.assertTrue(dtype2 is dtype) - self.assertTrue(dtype3 is dtype) - self.assertTrue(hash(dtype) == hash(dtype2)) - self.assertTrue(hash(dtype) == hash(dtype3)) - - def test_construction(self): - self.assertRaises(ValueError, - lambda: DatetimeTZDtype('ms', 'US/Eastern')) - - def test_subclass(self): - a = DatetimeTZDtype('datetime64[ns, US/Eastern]') - b = DatetimeTZDtype('datetime64[ns, CET]') - - self.assertTrue(issubclass(type(a), type(a))) - self.assertTrue(issubclass(type(a), type(b))) - - def test_coerce_to_dtype(self): - self.assertEqual(_coerce_to_dtype('datetime64[ns, US/Eastern]'), - DatetimeTZDtype('ns', 'US/Eastern')) - self.assertEqual(_coerce_to_dtype('datetime64[ns, Asia/Tokyo]'), - DatetimeTZDtype('ns', 'Asia/Tokyo')) - - def test_compat(self): - self.assertFalse(is_datetime64_ns_dtype(self.dtype)) - self.assertFalse(is_datetime64_ns_dtype('datetime64[ns, US/Eastern]')) - self.assertFalse(is_datetime64_dtype(self.dtype)) - self.assertFalse(is_datetime64_dtype('datetime64[ns, US/Eastern]')) - - def test_construction_from_string(self): - result = DatetimeTZDtype('datetime64[ns, US/Eastern]') - self.assertTrue(is_dtype_equal(self.dtype, result)) - result = DatetimeTZDtype.construct_from_string( - 'datetime64[ns, US/Eastern]') - self.assertTrue(is_dtype_equal(self.dtype, result)) - self.assertRaises(TypeError, - lambda: DatetimeTZDtype.construct_from_string('foo')) - - def test_is_dtype(self): - self.assertTrue(DatetimeTZDtype.is_dtype(self.dtype)) - self.assertTrue(DatetimeTZDtype.is_dtype('datetime64[ns, US/Eastern]')) - self.assertFalse(DatetimeTZDtype.is_dtype('foo')) - self.assertTrue(DatetimeTZDtype.is_dtype(DatetimeTZDtype( - 'ns', 'US/Pacific'))) - self.assertFalse(DatetimeTZDtype.is_dtype(np.float64)) - - def test_equality(self): - self.assertTrue(is_dtype_equal(self.dtype, - 'datetime64[ns, US/Eastern]')) - self.assertTrue(is_dtype_equal(self.dtype, DatetimeTZDtype( - 'ns', 'US/Eastern'))) - self.assertFalse(is_dtype_equal(self.dtype, 'foo')) - self.assertFalse(is_dtype_equal(self.dtype, DatetimeTZDtype('ns', - 'CET'))) - self.assertFalse(is_dtype_equal( - DatetimeTZDtype('ns', 'US/Eastern'), DatetimeTZDtype( - 'ns', 'US/Pacific'))) - - # numpy compat - self.assertTrue(is_dtype_equal(np.dtype("M8[ns]"), "datetime64[ns]")) - - def test_basic(self): - - self.assertTrue(is_datetime64tz_dtype(self.dtype)) - - dr = date_range('20130101', periods=3, tz='US/Eastern') - s = Series(dr, name='A') - - # dtypes - self.assertTrue(is_datetime64tz_dtype(s.dtype)) - self.assertTrue(is_datetime64tz_dtype(s)) - self.assertFalse(is_datetime64tz_dtype(np.dtype('float64'))) - self.assertFalse(is_datetime64tz_dtype(1.0)) - - self.assertTrue(is_datetimetz(s)) - self.assertTrue(is_datetimetz(s.dtype)) - self.assertFalse(is_datetimetz(np.dtype('float64'))) - self.assertFalse(is_datetimetz(1.0)) - - def test_dst(self): - - dr1 = date_range('2013-01-01', periods=3, tz='US/Eastern') - s1 = Series(dr1, name='A') - self.assertTrue(is_datetimetz(s1)) - - dr2 = date_range('2013-08-01', periods=3, tz='US/Eastern') - s2 = Series(dr2, name='A') - self.assertTrue(is_datetimetz(s2)) - self.assertEqual(s1.dtype, s2.dtype) - - def test_parser(self): - # pr #11245 - for tz, constructor in product(('UTC', 'US/Eastern'), - ('M8', 'datetime64')): - self.assertEqual( - DatetimeTZDtype('%s[ns, %s]' % (constructor, tz)), - DatetimeTZDtype('ns', tz), - ) - - def test_empty(self): - dt = DatetimeTZDtype() - with tm.assertRaises(AttributeError): - str(dt) - - -class TestPeriodDtype(Base, tm.TestCase): - - def setUp(self): - self.dtype = PeriodDtype('D') - - def test_construction(self): - with tm.assertRaises(ValueError): - PeriodDtype('xx') - - for s in ['period[D]', 'Period[D]', 'D']: - dt = PeriodDtype(s) - self.assertEqual(dt.freq, pd.tseries.offsets.Day()) - self.assertTrue(is_period_dtype(dt)) - - for s in ['period[3D]', 'Period[3D]', '3D']: - dt = PeriodDtype(s) - self.assertEqual(dt.freq, pd.tseries.offsets.Day(3)) - self.assertTrue(is_period_dtype(dt)) - - for s in ['period[26H]', 'Period[26H]', '26H', - 'period[1D2H]', 'Period[1D2H]', '1D2H']: - dt = PeriodDtype(s) - self.assertEqual(dt.freq, pd.tseries.offsets.Hour(26)) - self.assertTrue(is_period_dtype(dt)) - - def test_subclass(self): - a = PeriodDtype('period[D]') - b = PeriodDtype('period[3D]') - - self.assertTrue(issubclass(type(a), type(a))) - self.assertTrue(issubclass(type(a), type(b))) - - def test_identity(self): - self.assertEqual(PeriodDtype('period[D]'), - PeriodDtype('period[D]')) - self.assertIs(PeriodDtype('period[D]'), - PeriodDtype('period[D]')) - - self.assertEqual(PeriodDtype('period[3D]'), - PeriodDtype('period[3D]')) - self.assertIs(PeriodDtype('period[3D]'), - PeriodDtype('period[3D]')) - - self.assertEqual(PeriodDtype('period[1S1U]'), - PeriodDtype('period[1000001U]')) - self.assertIs(PeriodDtype('period[1S1U]'), - PeriodDtype('period[1000001U]')) - - def test_coerce_to_dtype(self): - self.assertEqual(_coerce_to_dtype('period[D]'), - PeriodDtype('period[D]')) - self.assertEqual(_coerce_to_dtype('period[3M]'), - PeriodDtype('period[3M]')) - - def test_compat(self): - self.assertFalse(is_datetime64_ns_dtype(self.dtype)) - self.assertFalse(is_datetime64_ns_dtype('period[D]')) - self.assertFalse(is_datetime64_dtype(self.dtype)) - self.assertFalse(is_datetime64_dtype('period[D]')) - - def test_construction_from_string(self): - result = PeriodDtype('period[D]') - self.assertTrue(is_dtype_equal(self.dtype, result)) - result = PeriodDtype.construct_from_string('period[D]') - self.assertTrue(is_dtype_equal(self.dtype, result)) - with tm.assertRaises(TypeError): - PeriodDtype.construct_from_string('foo') - with tm.assertRaises(TypeError): - PeriodDtype.construct_from_string('period[foo]') - with tm.assertRaises(TypeError): - PeriodDtype.construct_from_string('foo[D]') - - with tm.assertRaises(TypeError): - PeriodDtype.construct_from_string('datetime64[ns]') - with tm.assertRaises(TypeError): - PeriodDtype.construct_from_string('datetime64[ns, US/Eastern]') - - def test_is_dtype(self): - self.assertTrue(PeriodDtype.is_dtype(self.dtype)) - self.assertTrue(PeriodDtype.is_dtype('period[D]')) - self.assertTrue(PeriodDtype.is_dtype('period[3D]')) - self.assertTrue(PeriodDtype.is_dtype(PeriodDtype('3D'))) - self.assertTrue(PeriodDtype.is_dtype('period[U]')) - self.assertTrue(PeriodDtype.is_dtype('period[S]')) - self.assertTrue(PeriodDtype.is_dtype(PeriodDtype('U'))) - self.assertTrue(PeriodDtype.is_dtype(PeriodDtype('S'))) - - self.assertFalse(PeriodDtype.is_dtype('D')) - self.assertFalse(PeriodDtype.is_dtype('3D')) - self.assertFalse(PeriodDtype.is_dtype('U')) - self.assertFalse(PeriodDtype.is_dtype('S')) - self.assertFalse(PeriodDtype.is_dtype('foo')) - self.assertFalse(PeriodDtype.is_dtype(np.object_)) - self.assertFalse(PeriodDtype.is_dtype(np.int64)) - self.assertFalse(PeriodDtype.is_dtype(np.float64)) - - def test_equality(self): - self.assertTrue(is_dtype_equal(self.dtype, 'period[D]')) - self.assertTrue(is_dtype_equal(self.dtype, PeriodDtype('D'))) - self.assertTrue(is_dtype_equal(self.dtype, PeriodDtype('D'))) - self.assertTrue(is_dtype_equal(PeriodDtype('D'), PeriodDtype('D'))) - - self.assertFalse(is_dtype_equal(self.dtype, 'D')) - self.assertFalse(is_dtype_equal(PeriodDtype('D'), PeriodDtype('2D'))) - - def test_basic(self): - self.assertTrue(is_period_dtype(self.dtype)) - - pidx = pd.period_range('2013-01-01 09:00', periods=5, freq='H') - - self.assertTrue(is_period_dtype(pidx.dtype)) - self.assertTrue(is_period_dtype(pidx)) - self.assertTrue(is_period(pidx)) - - s = Series(pidx, name='A') - # dtypes - # series results in object dtype currently, - # is_period checks period_arraylike - self.assertFalse(is_period_dtype(s.dtype)) - self.assertFalse(is_period_dtype(s)) - self.assertTrue(is_period(s)) - - self.assertFalse(is_period_dtype(np.dtype('float64'))) - self.assertFalse(is_period_dtype(1.0)) - self.assertFalse(is_period(np.dtype('float64'))) - self.assertFalse(is_period(1.0)) - - def test_empty(self): - dt = PeriodDtype() - with tm.assertRaises(AttributeError): - str(dt) - - def test_not_string(self): - # though PeriodDtype has object kind, it cannot be string - self.assertFalse(is_string_dtype(PeriodDtype('D'))) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/types/test_generic.py b/pandas/tests/types/test_generic.py deleted file mode 100644 index 89913de6f6069..0000000000000 --- a/pandas/tests/types/test_generic.py +++ /dev/null @@ -1,47 +0,0 @@ -# -*- coding: utf-8 -*- - -import nose -import numpy as np -import pandas as pd -import pandas.util.testing as tm -from pandas.types import generic as gt - -_multiprocess_can_split_ = True - - -class TestABCClasses(tm.TestCase): - tuples = [[1, 2, 2], ['red', 'blue', 'red']] - multi_index = pd.MultiIndex.from_arrays(tuples, names=('number', 'color')) - datetime_index = pd.to_datetime(['2000/1/1', '2010/1/1']) - timedelta_index = pd.to_timedelta(np.arange(5), unit='s') - period_index = pd.period_range('2000/1/1', '2010/1/1/', freq='M') - categorical = pd.Categorical([1, 2, 3], categories=[2, 3, 1]) - categorical_df = pd.DataFrame({"values": [1, 2, 3]}, index=categorical) - df = pd.DataFrame({'names': ['a', 'b', 'c']}, index=multi_index) - sparse_series = pd.Series([1, 2, 3]).to_sparse() - sparse_array = pd.SparseArray(np.random.randn(10)) - - def test_abc_types(self): - self.assertIsInstance(pd.Index(['a', 'b', 'c']), gt.ABCIndex) - self.assertIsInstance(pd.Int64Index([1, 2, 3]), gt.ABCInt64Index) - self.assertIsInstance(pd.Float64Index([1, 2, 3]), gt.ABCFloat64Index) - self.assertIsInstance(self.multi_index, gt.ABCMultiIndex) - self.assertIsInstance(self.datetime_index, gt.ABCDatetimeIndex) - self.assertIsInstance(self.timedelta_index, gt.ABCTimedeltaIndex) - self.assertIsInstance(self.period_index, gt.ABCPeriodIndex) - self.assertIsInstance(self.categorical_df.index, - gt.ABCCategoricalIndex) - self.assertIsInstance(pd.Index(['a', 'b', 'c']), gt.ABCIndexClass) - self.assertIsInstance(pd.Int64Index([1, 2, 3]), gt.ABCIndexClass) - self.assertIsInstance(pd.Series([1, 2, 3]), gt.ABCSeries) - self.assertIsInstance(self.df, gt.ABCDataFrame) - self.assertIsInstance(self.df.to_panel(), gt.ABCPanel) - self.assertIsInstance(self.sparse_series, gt.ABCSparseSeries) - self.assertIsInstance(self.sparse_array, gt.ABCSparseArray) - self.assertIsInstance(self.categorical, gt.ABCCategorical) - self.assertIsInstance(pd.Period('2012', freq='A-DEC'), gt.ABCPeriod) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/types/test_inference.py b/pandas/tests/types/test_inference.py deleted file mode 100644 index a63ae5f7cf74e..0000000000000 --- a/pandas/tests/types/test_inference.py +++ /dev/null @@ -1,858 +0,0 @@ -# -*- coding: utf-8 -*- - -""" -These the test the public routines exposed in types/common.py -related to inference and not otherwise tested in types/test_common.py - -""" - -import nose -import collections -import re -from datetime import datetime, date, timedelta, time -import numpy as np - -import pandas as pd -from pandas import lib, tslib -from pandas import (Series, Index, DataFrame, Timedelta, - DatetimeIndex, TimedeltaIndex, Timestamp, - Panel, Period, Categorical) -from pandas.compat import u, PY2, lrange -from pandas.types import inference -from pandas.types.common import (is_timedelta64_dtype, - is_timedelta64_ns_dtype, - is_number, - is_integer, - is_float, - is_bool, - is_scalar, - _ensure_int32, - _ensure_categorical) -from pandas.types.missing import isnull -from pandas.util import testing as tm - -_multiprocess_can_split_ = True - - -def test_is_sequence(): - is_seq = inference.is_sequence - assert (is_seq((1, 2))) - assert (is_seq([1, 2])) - assert (not is_seq("abcd")) - assert (not is_seq(u("abcd"))) - assert (not is_seq(np.int64)) - - class A(object): - - def __getitem__(self): - return 1 - - assert (not is_seq(A())) - - -def test_is_list_like(): - passes = ([], [1], (1, ), (1, 2), {'a': 1}, set([1, 'a']), Series([1]), - Series([]), Series(['a']).str) - fails = (1, '2', object()) - - for p in passes: - assert inference.is_list_like(p) - - for f in fails: - assert not inference.is_list_like(f) - - -def test_is_dict_like(): - passes = [{}, {'A': 1}, Series([1])] - fails = ['1', 1, [1, 2], (1, 2), range(2), Index([1])] - - for p in passes: - assert inference.is_dict_like(p) - - for f in fails: - assert not inference.is_dict_like(f) - - -def test_is_named_tuple(): - passes = (collections.namedtuple('Test', list('abc'))(1, 2, 3), ) - fails = ((1, 2, 3), 'a', Series({'pi': 3.14})) - - for p in passes: - assert inference.is_named_tuple(p) - - for f in fails: - assert not inference.is_named_tuple(f) - - -def test_is_hashable(): - - # all new-style classes are hashable by default - class HashableClass(object): - pass - - class UnhashableClass1(object): - __hash__ = None - - class UnhashableClass2(object): - - def __hash__(self): - raise TypeError("Not hashable") - - hashable = (1, - 3.14, - np.float64(3.14), - 'a', - tuple(), - (1, ), - HashableClass(), ) - not_hashable = ([], UnhashableClass1(), ) - abc_hashable_not_really_hashable = (([], ), UnhashableClass2(), ) - - for i in hashable: - assert inference.is_hashable(i) - for i in not_hashable: - assert not inference.is_hashable(i) - for i in abc_hashable_not_really_hashable: - assert not inference.is_hashable(i) - - # numpy.array is no longer collections.Hashable as of - # https://github.com/numpy/numpy/pull/5326, just test - # is_hashable() - assert not inference.is_hashable(np.array([])) - - # old-style classes in Python 2 don't appear hashable to - # collections.Hashable but also seem to support hash() by default - if PY2: - - class OldStyleClass(): - pass - - c = OldStyleClass() - assert not isinstance(c, collections.Hashable) - assert inference.is_hashable(c) - hash(c) # this will not raise - - -def test_is_re(): - passes = re.compile('ad'), - fails = 'x', 2, 3, object() - - for p in passes: - assert inference.is_re(p) - - for f in fails: - assert not inference.is_re(f) - - -def test_is_recompilable(): - passes = (r'a', u('x'), r'asdf', re.compile('adsf'), u(r'\u2233\s*'), - re.compile(r'')) - fails = 1, [], object() - - for p in passes: - assert inference.is_re_compilable(p) - - for f in fails: - assert not inference.is_re_compilable(f) - - -class TestInference(tm.TestCase): - - def test_infer_dtype_bytes(self): - compare = 'string' if PY2 else 'bytes' - - # string array of bytes - arr = np.array(list('abc'), dtype='S1') - self.assertEqual(lib.infer_dtype(arr), compare) - - # object array of bytes - arr = arr.astype(object) - self.assertEqual(lib.infer_dtype(arr), compare) - - def test_isinf_scalar(self): - # GH 11352 - self.assertTrue(lib.isposinf_scalar(float('inf'))) - self.assertTrue(lib.isposinf_scalar(np.inf)) - self.assertFalse(lib.isposinf_scalar(-np.inf)) - self.assertFalse(lib.isposinf_scalar(1)) - self.assertFalse(lib.isposinf_scalar('a')) - - self.assertTrue(lib.isneginf_scalar(float('-inf'))) - self.assertTrue(lib.isneginf_scalar(-np.inf)) - self.assertFalse(lib.isneginf_scalar(np.inf)) - self.assertFalse(lib.isneginf_scalar(1)) - self.assertFalse(lib.isneginf_scalar('a')) - - def test_maybe_convert_numeric_infinities(self): - # see gh-13274 - infinities = ['inf', 'inF', 'iNf', 'Inf', - 'iNF', 'InF', 'INf', 'INF'] - na_values = set(['', 'NULL', 'nan']) - - pos = np.array(['inf'], dtype=np.float64) - neg = np.array(['-inf'], dtype=np.float64) - - msg = "Unable to parse string" - - for infinity in infinities: - for maybe_int in (True, False): - out = lib.maybe_convert_numeric( - np.array([infinity], dtype=object), - na_values, maybe_int) - tm.assert_numpy_array_equal(out, pos) - - out = lib.maybe_convert_numeric( - np.array(['-' + infinity], dtype=object), - na_values, maybe_int) - tm.assert_numpy_array_equal(out, neg) - - out = lib.maybe_convert_numeric( - np.array([u(infinity)], dtype=object), - na_values, maybe_int) - tm.assert_numpy_array_equal(out, pos) - - out = lib.maybe_convert_numeric( - np.array(['+' + infinity], dtype=object), - na_values, maybe_int) - tm.assert_numpy_array_equal(out, pos) - - # too many characters - with tm.assertRaisesRegexp(ValueError, msg): - lib.maybe_convert_numeric( - np.array(['foo_' + infinity], dtype=object), - na_values, maybe_int) - - def test_maybe_convert_numeric_post_floatify_nan(self): - # see gh-13314 - data = np.array(['1.200', '-999.000', '4.500'], dtype=object) - expected = np.array([1.2, np.nan, 4.5], dtype=np.float64) - nan_values = set([-999, -999.0]) - - for coerce_type in (True, False): - out = lib.maybe_convert_numeric(data, nan_values, coerce_type) - tm.assert_numpy_array_equal(out, expected) - - def test_convert_infs(self): - arr = np.array(['inf', 'inf', 'inf'], dtype='O') - result = lib.maybe_convert_numeric(arr, set(), False) - self.assertTrue(result.dtype == np.float64) - - arr = np.array(['-inf', '-inf', '-inf'], dtype='O') - result = lib.maybe_convert_numeric(arr, set(), False) - self.assertTrue(result.dtype == np.float64) - - def test_scientific_no_exponent(self): - # See PR 12215 - arr = np.array(['42E', '2E', '99e', '6e'], dtype='O') - result = lib.maybe_convert_numeric(arr, set(), False, True) - self.assertTrue(np.all(np.isnan(result))) - - def test_convert_non_hashable(self): - # GH13324 - # make sure that we are handing non-hashables - arr = np.array([[10.0, 2], 1.0, 'apple']) - result = lib.maybe_convert_numeric(arr, set(), False, True) - tm.assert_numpy_array_equal(result, np.array([np.nan, 1.0, np.nan])) - - -class TestTypeInference(tm.TestCase): - _multiprocess_can_split_ = True - - def test_length_zero(self): - result = lib.infer_dtype(np.array([], dtype='i4')) - self.assertEqual(result, 'integer') - - result = lib.infer_dtype([]) - self.assertEqual(result, 'empty') - - def test_integers(self): - arr = np.array([1, 2, 3, np.int64(4), np.int32(5)], dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'integer') - - arr = np.array([1, 2, 3, np.int64(4), np.int32(5), 'foo'], dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'mixed-integer') - - arr = np.array([1, 2, 3, 4, 5], dtype='i4') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'integer') - - def test_bools(self): - arr = np.array([True, False, True, True, True], dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'boolean') - - arr = np.array([np.bool_(True), np.bool_(False)], dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'boolean') - - arr = np.array([True, False, True, 'foo'], dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'mixed') - - arr = np.array([True, False, True], dtype=bool) - result = lib.infer_dtype(arr) - self.assertEqual(result, 'boolean') - - def test_floats(self): - arr = np.array([1., 2., 3., np.float64(4), np.float32(5)], dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'floating') - - arr = np.array([1, 2, 3, np.float64(4), np.float32(5), 'foo'], - dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'mixed-integer') - - arr = np.array([1, 2, 3, 4, 5], dtype='f4') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'floating') - - arr = np.array([1, 2, 3, 4, 5], dtype='f8') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'floating') - - def test_string(self): - pass - - def test_unicode(self): - pass - - def test_datetime(self): - - dates = [datetime(2012, 1, x) for x in range(1, 20)] - index = Index(dates) - self.assertEqual(index.inferred_type, 'datetime64') - - def test_infer_dtype_datetime(self): - - arr = np.array([Timestamp('2011-01-01'), - Timestamp('2011-01-02')]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([np.datetime64('2011-01-01'), - np.datetime64('2011-01-01')], dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'datetime64') - - arr = np.array([datetime(2011, 1, 1), datetime(2012, 2, 1)]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - # starts with nan - for n in [pd.NaT, np.nan]: - arr = np.array([n, pd.Timestamp('2011-01-02')]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([n, np.datetime64('2011-01-02')]) - self.assertEqual(lib.infer_dtype(arr), 'datetime64') - - arr = np.array([n, datetime(2011, 1, 1)]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([n, pd.Timestamp('2011-01-02'), n]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([n, np.datetime64('2011-01-02'), n]) - self.assertEqual(lib.infer_dtype(arr), 'datetime64') - - arr = np.array([n, datetime(2011, 1, 1), n]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - # different type of nat - arr = np.array([np.timedelta64('nat'), - np.datetime64('2011-01-02')], dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - arr = np.array([np.datetime64('2011-01-02'), - np.timedelta64('nat')], dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - # mixed datetime - arr = np.array([datetime(2011, 1, 1), - pd.Timestamp('2011-01-02')]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - # should be datetime? - arr = np.array([np.datetime64('2011-01-01'), - pd.Timestamp('2011-01-02')]) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - arr = np.array([pd.Timestamp('2011-01-02'), - np.datetime64('2011-01-01')]) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - arr = np.array([np.nan, pd.Timestamp('2011-01-02'), 1]) - self.assertEqual(lib.infer_dtype(arr), 'mixed-integer') - - arr = np.array([np.nan, pd.Timestamp('2011-01-02'), 1.1]) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - arr = np.array([np.nan, '2011-01-01', pd.Timestamp('2011-01-02')]) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - def test_infer_dtype_timedelta(self): - - arr = np.array([pd.Timedelta('1 days'), - pd.Timedelta('2 days')]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([np.timedelta64(1, 'D'), - np.timedelta64(2, 'D')], dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([timedelta(1), timedelta(2)]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - # starts with nan - for n in [pd.NaT, np.nan]: - arr = np.array([n, Timedelta('1 days')]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([n, np.timedelta64(1, 'D')]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([n, timedelta(1)]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([n, pd.Timedelta('1 days'), n]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([n, np.timedelta64(1, 'D'), n]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([n, timedelta(1), n]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - # different type of nat - arr = np.array([np.datetime64('nat'), np.timedelta64(1, 'D')], - dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - arr = np.array([np.timedelta64(1, 'D'), np.datetime64('nat')], - dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - def test_infer_dtype_period(self): - # GH 13664 - arr = np.array([pd.Period('2011-01', freq='D'), - pd.Period('2011-02', freq='D')]) - self.assertEqual(pd.lib.infer_dtype(arr), 'period') - - arr = np.array([pd.Period('2011-01', freq='D'), - pd.Period('2011-02', freq='M')]) - self.assertEqual(pd.lib.infer_dtype(arr), 'period') - - # starts with nan - for n in [pd.NaT, np.nan]: - arr = np.array([n, pd.Period('2011-01', freq='D')]) - self.assertEqual(pd.lib.infer_dtype(arr), 'period') - - arr = np.array([n, pd.Period('2011-01', freq='D'), n]) - self.assertEqual(pd.lib.infer_dtype(arr), 'period') - - # different type of nat - arr = np.array([np.datetime64('nat'), pd.Period('2011-01', freq='M')], - dtype=object) - self.assertEqual(pd.lib.infer_dtype(arr), 'mixed') - - arr = np.array([pd.Period('2011-01', freq='M'), np.datetime64('nat')], - dtype=object) - self.assertEqual(pd.lib.infer_dtype(arr), 'mixed') - - def test_infer_dtype_all_nan_nat_like(self): - arr = np.array([np.nan, np.nan]) - self.assertEqual(lib.infer_dtype(arr), 'floating') - - # nan and None mix are result in mixed - arr = np.array([np.nan, np.nan, None]) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - arr = np.array([None, np.nan, np.nan]) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - # pd.NaT - arr = np.array([pd.NaT]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([pd.NaT, np.nan]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([np.nan, pd.NaT]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([np.nan, pd.NaT, np.nan]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - arr = np.array([None, pd.NaT, None]) - self.assertEqual(lib.infer_dtype(arr), 'datetime') - - # np.datetime64(nat) - arr = np.array([np.datetime64('nat')]) - self.assertEqual(lib.infer_dtype(arr), 'datetime64') - - for n in [np.nan, pd.NaT, None]: - arr = np.array([n, np.datetime64('nat'), n]) - self.assertEqual(lib.infer_dtype(arr), 'datetime64') - - arr = np.array([pd.NaT, n, np.datetime64('nat'), n]) - self.assertEqual(lib.infer_dtype(arr), 'datetime64') - - arr = np.array([np.timedelta64('nat')], dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - for n in [np.nan, pd.NaT, None]: - arr = np.array([n, np.timedelta64('nat'), n]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - arr = np.array([pd.NaT, n, np.timedelta64('nat'), n]) - self.assertEqual(lib.infer_dtype(arr), 'timedelta') - - # datetime / timedelta mixed - arr = np.array([pd.NaT, np.datetime64('nat'), - np.timedelta64('nat'), np.nan]) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - arr = np.array([np.timedelta64('nat'), np.datetime64('nat')], - dtype=object) - self.assertEqual(lib.infer_dtype(arr), 'mixed') - - def test_is_datetimelike_array_all_nan_nat_like(self): - arr = np.array([np.nan, pd.NaT, np.datetime64('nat')]) - self.assertTrue(lib.is_datetime_array(arr)) - self.assertTrue(lib.is_datetime64_array(arr)) - self.assertFalse(lib.is_timedelta_array(arr)) - self.assertFalse(lib.is_timedelta64_array(arr)) - self.assertFalse(lib.is_timedelta_or_timedelta64_array(arr)) - - arr = np.array([np.nan, pd.NaT, np.timedelta64('nat')]) - self.assertFalse(lib.is_datetime_array(arr)) - self.assertFalse(lib.is_datetime64_array(arr)) - self.assertTrue(lib.is_timedelta_array(arr)) - self.assertTrue(lib.is_timedelta64_array(arr)) - self.assertTrue(lib.is_timedelta_or_timedelta64_array(arr)) - - arr = np.array([np.nan, pd.NaT, np.datetime64('nat'), - np.timedelta64('nat')]) - self.assertFalse(lib.is_datetime_array(arr)) - self.assertFalse(lib.is_datetime64_array(arr)) - self.assertFalse(lib.is_timedelta_array(arr)) - self.assertFalse(lib.is_timedelta64_array(arr)) - self.assertFalse(lib.is_timedelta_or_timedelta64_array(arr)) - - arr = np.array([np.nan, pd.NaT]) - self.assertTrue(lib.is_datetime_array(arr)) - self.assertTrue(lib.is_datetime64_array(arr)) - self.assertTrue(lib.is_timedelta_array(arr)) - self.assertTrue(lib.is_timedelta64_array(arr)) - self.assertTrue(lib.is_timedelta_or_timedelta64_array(arr)) - - arr = np.array([np.nan, np.nan], dtype=object) - self.assertFalse(lib.is_datetime_array(arr)) - self.assertFalse(lib.is_datetime64_array(arr)) - self.assertFalse(lib.is_timedelta_array(arr)) - self.assertFalse(lib.is_timedelta64_array(arr)) - self.assertFalse(lib.is_timedelta_or_timedelta64_array(arr)) - - def test_date(self): - - dates = [date(2012, 1, x) for x in range(1, 20)] - index = Index(dates) - self.assertEqual(index.inferred_type, 'date') - - def test_to_object_array_tuples(self): - r = (5, 6) - values = [r] - result = lib.to_object_array_tuples(values) - - try: - # make sure record array works - from collections import namedtuple - record = namedtuple('record', 'x y') - r = record(5, 6) - values = [r] - result = lib.to_object_array_tuples(values) # noqa - except ImportError: - pass - - def test_object(self): - - # GH 7431 - # cannot infer more than this as only a single element - arr = np.array([None], dtype='O') - result = lib.infer_dtype(arr) - self.assertEqual(result, 'mixed') - - def test_to_object_array_width(self): - # see gh-13320 - rows = [[1, 2, 3], [4, 5, 6]] - - expected = np.array(rows, dtype=object) - out = lib.to_object_array(rows) - tm.assert_numpy_array_equal(out, expected) - - expected = np.array(rows, dtype=object) - out = lib.to_object_array(rows, min_width=1) - tm.assert_numpy_array_equal(out, expected) - - expected = np.array([[1, 2, 3, None, None], - [4, 5, 6, None, None]], dtype=object) - out = lib.to_object_array(rows, min_width=5) - tm.assert_numpy_array_equal(out, expected) - - def test_is_period(self): - self.assertTrue(lib.is_period(pd.Period('2011-01', freq='M'))) - self.assertFalse(lib.is_period(pd.PeriodIndex(['2011-01'], freq='M'))) - self.assertFalse(lib.is_period(pd.Timestamp('2011-01'))) - self.assertFalse(lib.is_period(1)) - self.assertFalse(lib.is_period(np.nan)) - - def test_categorical(self): - - # GH 8974 - from pandas import Categorical, Series - arr = Categorical(list('abc')) - result = lib.infer_dtype(arr) - self.assertEqual(result, 'categorical') - - result = lib.infer_dtype(Series(arr)) - self.assertEqual(result, 'categorical') - - arr = Categorical(list('abc'), categories=['cegfab'], ordered=True) - result = lib.infer_dtype(arr) - self.assertEqual(result, 'categorical') - - result = lib.infer_dtype(Series(arr)) - self.assertEqual(result, 'categorical') - - -class TestNumberScalar(tm.TestCase): - - def test_is_number(self): - - self.assertTrue(is_number(True)) - self.assertTrue(is_number(1)) - self.assertTrue(is_number(1.1)) - self.assertTrue(is_number(1 + 3j)) - self.assertTrue(is_number(np.bool(False))) - self.assertTrue(is_number(np.int64(1))) - self.assertTrue(is_number(np.float64(1.1))) - self.assertTrue(is_number(np.complex128(1 + 3j))) - self.assertTrue(is_number(np.nan)) - - self.assertFalse(is_number(None)) - self.assertFalse(is_number('x')) - self.assertFalse(is_number(datetime(2011, 1, 1))) - self.assertFalse(is_number(np.datetime64('2011-01-01'))) - self.assertFalse(is_number(Timestamp('2011-01-01'))) - self.assertFalse(is_number(Timestamp('2011-01-01', - tz='US/Eastern'))) - self.assertFalse(is_number(timedelta(1000))) - self.assertFalse(is_number(Timedelta('1 days'))) - - # questionable - self.assertFalse(is_number(np.bool_(False))) - self.assertTrue(is_number(np.timedelta64(1, 'D'))) - - def test_is_bool(self): - self.assertTrue(is_bool(True)) - self.assertTrue(is_bool(np.bool(False))) - self.assertTrue(is_bool(np.bool_(False))) - - self.assertFalse(is_bool(1)) - self.assertFalse(is_bool(1.1)) - self.assertFalse(is_bool(1 + 3j)) - self.assertFalse(is_bool(np.int64(1))) - self.assertFalse(is_bool(np.float64(1.1))) - self.assertFalse(is_bool(np.complex128(1 + 3j))) - self.assertFalse(is_bool(np.nan)) - self.assertFalse(is_bool(None)) - self.assertFalse(is_bool('x')) - self.assertFalse(is_bool(datetime(2011, 1, 1))) - self.assertFalse(is_bool(np.datetime64('2011-01-01'))) - self.assertFalse(is_bool(Timestamp('2011-01-01'))) - self.assertFalse(is_bool(Timestamp('2011-01-01', - tz='US/Eastern'))) - self.assertFalse(is_bool(timedelta(1000))) - self.assertFalse(is_bool(np.timedelta64(1, 'D'))) - self.assertFalse(is_bool(Timedelta('1 days'))) - - def test_is_integer(self): - self.assertTrue(is_integer(1)) - self.assertTrue(is_integer(np.int64(1))) - - self.assertFalse(is_integer(True)) - self.assertFalse(is_integer(1.1)) - self.assertFalse(is_integer(1 + 3j)) - self.assertFalse(is_integer(np.bool(False))) - self.assertFalse(is_integer(np.bool_(False))) - self.assertFalse(is_integer(np.float64(1.1))) - self.assertFalse(is_integer(np.complex128(1 + 3j))) - self.assertFalse(is_integer(np.nan)) - self.assertFalse(is_integer(None)) - self.assertFalse(is_integer('x')) - self.assertFalse(is_integer(datetime(2011, 1, 1))) - self.assertFalse(is_integer(np.datetime64('2011-01-01'))) - self.assertFalse(is_integer(Timestamp('2011-01-01'))) - self.assertFalse(is_integer(Timestamp('2011-01-01', - tz='US/Eastern'))) - self.assertFalse(is_integer(timedelta(1000))) - self.assertFalse(is_integer(Timedelta('1 days'))) - - # questionable - self.assertTrue(is_integer(np.timedelta64(1, 'D'))) - - def test_is_float(self): - self.assertTrue(is_float(1.1)) - self.assertTrue(is_float(np.float64(1.1))) - self.assertTrue(is_float(np.nan)) - - self.assertFalse(is_float(True)) - self.assertFalse(is_float(1)) - self.assertFalse(is_float(1 + 3j)) - self.assertFalse(is_float(np.bool(False))) - self.assertFalse(is_float(np.bool_(False))) - self.assertFalse(is_float(np.int64(1))) - self.assertFalse(is_float(np.complex128(1 + 3j))) - self.assertFalse(is_float(None)) - self.assertFalse(is_float('x')) - self.assertFalse(is_float(datetime(2011, 1, 1))) - self.assertFalse(is_float(np.datetime64('2011-01-01'))) - self.assertFalse(is_float(Timestamp('2011-01-01'))) - self.assertFalse(is_float(Timestamp('2011-01-01', - tz='US/Eastern'))) - self.assertFalse(is_float(timedelta(1000))) - self.assertFalse(is_float(np.timedelta64(1, 'D'))) - self.assertFalse(is_float(Timedelta('1 days'))) - - def test_is_timedelta(self): - self.assertTrue(is_timedelta64_dtype('timedelta64')) - self.assertTrue(is_timedelta64_dtype('timedelta64[ns]')) - self.assertFalse(is_timedelta64_ns_dtype('timedelta64')) - self.assertTrue(is_timedelta64_ns_dtype('timedelta64[ns]')) - - tdi = TimedeltaIndex([1e14, 2e14], dtype='timedelta64') - self.assertTrue(is_timedelta64_dtype(tdi)) - self.assertTrue(is_timedelta64_ns_dtype(tdi)) - self.assertTrue(is_timedelta64_ns_dtype(tdi.astype('timedelta64[ns]'))) - - # Conversion to Int64Index: - self.assertFalse(is_timedelta64_ns_dtype(tdi.astype('timedelta64'))) - self.assertFalse(is_timedelta64_ns_dtype(tdi.astype('timedelta64[h]'))) - - -class Testisscalar(tm.TestCase): - - def test_isscalar_builtin_scalars(self): - self.assertTrue(is_scalar(None)) - self.assertTrue(is_scalar(True)) - self.assertTrue(is_scalar(False)) - self.assertTrue(is_scalar(0.)) - self.assertTrue(is_scalar(np.nan)) - self.assertTrue(is_scalar('foobar')) - self.assertTrue(is_scalar(b'foobar')) - self.assertTrue(is_scalar(u('efoobar'))) - self.assertTrue(is_scalar(datetime(2014, 1, 1))) - self.assertTrue(is_scalar(date(2014, 1, 1))) - self.assertTrue(is_scalar(time(12, 0))) - self.assertTrue(is_scalar(timedelta(hours=1))) - self.assertTrue(is_scalar(pd.NaT)) - - def test_isscalar_builtin_nonscalars(self): - self.assertFalse(is_scalar({})) - self.assertFalse(is_scalar([])) - self.assertFalse(is_scalar([1])) - self.assertFalse(is_scalar(())) - self.assertFalse(is_scalar((1, ))) - self.assertFalse(is_scalar(slice(None))) - self.assertFalse(is_scalar(Ellipsis)) - - def test_isscalar_numpy_array_scalars(self): - self.assertTrue(is_scalar(np.int64(1))) - self.assertTrue(is_scalar(np.float64(1.))) - self.assertTrue(is_scalar(np.int32(1))) - self.assertTrue(is_scalar(np.object_('foobar'))) - self.assertTrue(is_scalar(np.str_('foobar'))) - self.assertTrue(is_scalar(np.unicode_(u('foobar')))) - self.assertTrue(is_scalar(np.bytes_(b'foobar'))) - self.assertTrue(is_scalar(np.datetime64('2014-01-01'))) - self.assertTrue(is_scalar(np.timedelta64(1, 'h'))) - - def test_isscalar_numpy_zerodim_arrays(self): - for zerodim in [np.array(1), np.array('foobar'), - np.array(np.datetime64('2014-01-01')), - np.array(np.timedelta64(1, 'h')), - np.array(np.datetime64('NaT'))]: - self.assertFalse(is_scalar(zerodim)) - self.assertTrue(is_scalar(lib.item_from_zerodim(zerodim))) - - def test_isscalar_numpy_arrays(self): - self.assertFalse(is_scalar(np.array([]))) - self.assertFalse(is_scalar(np.array([[]]))) - self.assertFalse(is_scalar(np.matrix('1; 2'))) - - def test_isscalar_pandas_scalars(self): - self.assertTrue(is_scalar(Timestamp('2014-01-01'))) - self.assertTrue(is_scalar(Timedelta(hours=1))) - self.assertTrue(is_scalar(Period('2014-01-01'))) - - def test_lisscalar_pandas_containers(self): - self.assertFalse(is_scalar(Series())) - self.assertFalse(is_scalar(Series([1]))) - self.assertFalse(is_scalar(DataFrame())) - self.assertFalse(is_scalar(DataFrame([[1]]))) - self.assertFalse(is_scalar(Panel())) - self.assertFalse(is_scalar(Panel([[[1]]]))) - self.assertFalse(is_scalar(Index([]))) - self.assertFalse(is_scalar(Index([1]))) - - -def test_datetimeindex_from_empty_datetime64_array(): - for unit in ['ms', 'us', 'ns']: - idx = DatetimeIndex(np.array([], dtype='datetime64[%s]' % unit)) - assert (len(idx) == 0) - - -def test_nan_to_nat_conversions(): - - df = DataFrame(dict({ - 'A': np.asarray( - lrange(10), dtype='float64'), - 'B': Timestamp('20010101') - })) - df.iloc[3:6, :] = np.nan - result = df.loc[4, 'B'].value - assert (result == tslib.iNaT) - - s = df['B'].copy() - s._data = s._data.setitem(indexer=tuple([slice(8, 9)]), value=np.nan) - assert (isnull(s[8])) - - # numpy < 1.7.0 is wrong - from distutils.version import LooseVersion - if LooseVersion(np.__version__) >= '1.7.0': - assert (s[8].value == np.datetime64('NaT').astype(np.int64)) - - -def test_ensure_int32(): - values = np.arange(10, dtype=np.int32) - result = _ensure_int32(values) - assert (result.dtype == np.int32) - - values = np.arange(10, dtype=np.int64) - result = _ensure_int32(values) - assert (result.dtype == np.int32) - - -def test_ensure_categorical(): - values = np.arange(10, dtype=np.int32) - result = _ensure_categorical(values) - assert (result.dtype == 'category') - - values = Categorical(values) - result = _ensure_categorical(values) - tm.assert_categorical_equal(result, values) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tests/util/__init__.py b/pandas/tests/util/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/pandas/tools/tests/test_hashing.py b/pandas/tests/util/test_hashing.py similarity index 51% rename from pandas/tools/tests/test_hashing.py rename to pandas/tests/util/test_hashing.py index 6e5f30fb7a52d..e1e6e43529a7d 100644 --- a/pandas/tools/tests/test_hashing.py +++ b/pandas/tests/util/test_hashing.py @@ -1,16 +1,18 @@ +import pytest + +from warnings import catch_warnings import numpy as np import pandas as pd -from pandas import DataFrame, Series, Index -from pandas.tools.hashing import hash_array, hash_pandas_object +from pandas import DataFrame, Series, Index, MultiIndex +from pandas.util import hash_array, hash_pandas_object +from pandas.core.util.hashing import hash_tuples import pandas.util.testing as tm -class TestHashing(tm.TestCase): - - _multiprocess_can_split_ = True +class TestHashing(object): - def setUp(self): + def setup_method(self, method): self.df = DataFrame( {'i32': np.array([1, 2, 3] * 3, dtype='int32'), 'f32': np.array([None, 2.5, 3.5] * 3, dtype='float32'), @@ -36,6 +38,18 @@ def test_hash_array(self): a = s.values tm.assert_numpy_array_equal(hash_array(a), hash_array(a)) + def test_hash_array_mixed(self): + result1 = hash_array(np.array([3, 4, 'All'])) + result2 = hash_array(np.array(['3', '4', 'All'])) + result3 = hash_array(np.array([3, 4, 'All'], dtype=object)) + tm.assert_numpy_array_equal(result1, result2) + tm.assert_numpy_array_equal(result1, result3) + + def test_hash_array_errors(self): + + for val in [5, 'foo', pd.Timestamp('20130101')]: + pytest.raises(TypeError, hash_array, val) + def check_equal(self, obj, **kwargs): a = hash_pandas_object(obj, **kwargs) b = hash_pandas_object(obj, **kwargs) @@ -53,7 +67,58 @@ def check_not_equal_with_index(self, obj): if not isinstance(obj, Index): a = hash_pandas_object(obj, index=True) b = hash_pandas_object(obj, index=False) - self.assertFalse((a == b).all()) + if len(obj): + assert not (a == b).all() + + def test_hash_tuples(self): + tups = [(1, 'one'), (1, 'two'), (2, 'one')] + result = hash_tuples(tups) + expected = hash_pandas_object(MultiIndex.from_tuples(tups)).values + tm.assert_numpy_array_equal(result, expected) + + result = hash_tuples(tups[0]) + assert result == expected[0] + + def test_hash_tuples_err(self): + + for val in [5, 'foo', pd.Timestamp('20130101')]: + pytest.raises(TypeError, hash_tuples, val) + + def test_multiindex_unique(self): + mi = MultiIndex.from_tuples([(118, 472), (236, 118), + (51, 204), (102, 51)]) + assert mi.is_unique + result = hash_pandas_object(mi) + assert result.is_unique + + def test_multiindex_objects(self): + mi = MultiIndex(levels=[['b', 'd', 'a'], [1, 2, 3]], + labels=[[0, 1, 0, 2], [2, 0, 0, 1]], + names=['col1', 'col2']) + recons = mi._sort_levels_monotonic() + + # these are equal + assert mi.equals(recons) + assert Index(mi.values).equals(Index(recons.values)) + + # _hashed_values and hash_pandas_object(..., index=False) + # equivalency + expected = hash_pandas_object( + mi, index=False).values + result = mi._hashed_values + tm.assert_numpy_array_equal(result, expected) + + expected = hash_pandas_object( + recons, index=False).values + result = recons._hashed_values + tm.assert_numpy_array_equal(result, expected) + + expected = mi._hashed_values + result = recons._hashed_values + + # values should match, but in different order + tm.assert_numpy_array_equal(np.sort(result), + np.sort(expected)) def test_hash_pandas_object(self): @@ -65,14 +130,27 @@ def test_hash_pandas_object(self): Series(['a', np.nan, 'c']), Series(['a', None, 'c']), Series([True, False, True]), + Series(), Index([1, 2, 3]), Index([True, False, True]), DataFrame({'x': ['a', 'b', 'c'], 'y': [1, 2, 3]}), + DataFrame(), tm.makeMissingDataframe(), tm.makeMixedDataFrame(), tm.makeTimeDataFrame(), tm.makeTimeSeries(), - tm.makeTimedeltaIndex()]: + tm.makeTimedeltaIndex(), + tm.makePeriodIndex(), + Series(tm.makePeriodIndex()), + Series(pd.date_range('20130101', + periods=3, tz='US/Eastern')), + MultiIndex.from_product( + [range(5), + ['foo', 'bar', 'baz'], + pd.date_range('20130101', periods=2)]), + MultiIndex.from_product( + [pd.CategoricalIndex(list('aabc')), + range(3)])]: self.check_equal(obj) self.check_not_equal_with_index(obj) @@ -90,13 +168,45 @@ def test_hash_pandas_empty_object(self): # these are by-definition the same with # or w/o the index as the data is empty - def test_errors(self): - - for obj in [pd.Timestamp('20130101'), tm.makePanel()]: - def f(): - hash_pandas_object(f) - - self.assertRaises(TypeError, f) + def test_categorical_consistency(self): + # GH15143 + # Check that categoricals hash consistent with their values, not codes + # This should work for categoricals of any dtype + for s1 in [Series(['a', 'b', 'c', 'd']), + Series([1000, 2000, 3000, 4000]), + Series(pd.date_range(0, periods=4))]: + s2 = s1.astype('category').cat.set_categories(s1) + s3 = s2.cat.set_categories(list(reversed(s1))) + for categorize in [True, False]: + # These should all hash identically + h1 = hash_pandas_object(s1, categorize=categorize) + h2 = hash_pandas_object(s2, categorize=categorize) + h3 = hash_pandas_object(s3, categorize=categorize) + tm.assert_series_equal(h1, h2) + tm.assert_series_equal(h1, h3) + + def test_categorical_with_nan_consistency(self): + c = pd.Categorical.from_codes( + [-1, 0, 1, 2, 3, 4], + categories=pd.date_range('2012-01-01', periods=5, name='B')) + expected = hash_array(c, categorize=False) + c = pd.Categorical.from_codes( + [-1, 0], + categories=[pd.Timestamp('2012-01-01')]) + result = hash_array(c, categorize=False) + assert result[0] in expected + assert result[1] in expected + + def test_pandas_errors(self): + + for obj in [pd.Timestamp('20130101')]: + with pytest.raises(TypeError): + hash_pandas_object(obj) + + with catch_warnings(record=True): + obj = tm.makePanel() + with pytest.raises(TypeError): + hash_pandas_object(obj) def test_hash_keys(self): # using different hash keys, should have different hashes @@ -106,30 +216,13 @@ def test_hash_keys(self): obj = Series(list('abc')) a = hash_pandas_object(obj, hash_key='9876543210123456') b = hash_pandas_object(obj, hash_key='9876543210123465') - self.assertTrue((a != b).all()) + assert (a != b).all() def test_invalid_key(self): # this only matters for object dtypes def f(): hash_pandas_object(Series(list('abc')), hash_key='foo') - self.assertRaises(ValueError, f) - - def test_unsupported_objects(self): - - # mixed objects are not supported - obj = Series(['1', 2, 3]) - - def f(): - hash_pandas_object(obj) - self.assertRaises(TypeError, f) - - # MultiIndex are represented as tuples - obj = Series([1, 2, 3], index=pd.MultiIndex.from_tuples( - [('a', 1), ('a', 2), ('b', 1)])) - - def f(): - hash_pandas_object(obj) - self.assertRaises(TypeError, f) + pytest.raises(ValueError, f) def test_alread_encoded(self): # if already encoded then ok @@ -148,13 +241,13 @@ def test_same_len_hash_collisions(self): length = 2**(l + 8) + 1 s = tm.rands_array(length, 2) result = hash_array(s, 'utf8') - self.assertFalse(result[0] == result[1]) + assert not result[0] == result[1] for l in range(8): length = 2**(l + 8) s = tm.rands_array(length, 2) result = hash_array(s, 'utf8') - self.assertFalse(result[0] == result[1]) + assert not result[0] == result[1] def test_hash_collisions(self): @@ -166,12 +259,27 @@ def test_hash_collisions(self): # these should be different! result1 = hash_array(np.asarray(L[0:1], dtype=object), 'utf8') expected1 = np.array([14963968704024874985], dtype=np.uint64) - self.assert_numpy_array_equal(result1, expected1) + tm.assert_numpy_array_equal(result1, expected1) result2 = hash_array(np.asarray(L[1:2], dtype=object), 'utf8') expected2 = np.array([16428432627716348016], dtype=np.uint64) - self.assert_numpy_array_equal(result2, expected2) + tm.assert_numpy_array_equal(result2, expected2) result = hash_array(np.asarray(L, dtype=object), 'utf8') - self.assert_numpy_array_equal( + tm.assert_numpy_array_equal( result, np.concatenate([expected1, expected2], axis=0)) + + +def test_deprecation(): + + with tm.assert_produces_warning(DeprecationWarning, + check_stacklevel=False): + from pandas.tools.hashing import hash_pandas_object + obj = Series(list('abc')) + hash_pandas_object(obj, hash_key='9876543210123456') + + with tm.assert_produces_warning(DeprecationWarning, + check_stacklevel=False): + from pandas.tools.hashing import hash_array + obj = np.array([1, 2, 3]) + hash_array(obj, hash_key='9876543210123456') diff --git a/pandas/tests/test_testing.py b/pandas/tests/util/test_testing.py similarity index 73% rename from pandas/tests/test_testing.py rename to pandas/tests/util/test_testing.py index 7a217ed9dbd86..fe7c3b99987f5 100644 --- a/pandas/tests/test_testing.py +++ b/pandas/tests/util/test_testing.py @@ -1,33 +1,26 @@ -#!/usr/bin/python # -*- coding: utf-8 -*- import pandas as pd -import unittest -import nose +import pytest import numpy as np import sys from pandas import Series, DataFrame import pandas.util.testing as tm -from pandas.util.testing import (assert_almost_equal, assertRaisesRegexp, - raise_with_traceback, assert_index_equal, - assert_series_equal, assert_frame_equal, - assert_numpy_array_equal, - RNGContext, assertRaises, - skip_if_no_package_deco) +from pandas.util.testing import (assert_almost_equal, raise_with_traceback, + assert_index_equal, assert_series_equal, + assert_frame_equal, assert_numpy_array_equal, + RNGContext) from pandas.compat import is_platform_windows -# let's get meta. - -class TestAssertAlmostEqual(tm.TestCase): - _multiprocess_can_split_ = True +class TestAssertAlmostEqual(object): def _assert_almost_equal_both(self, a, b, **kwargs): assert_almost_equal(a, b, **kwargs) assert_almost_equal(b, a, **kwargs) def _assert_not_almost_equal_both(self, a, b, **kwargs): - self.assertRaises(AssertionError, assert_almost_equal, a, b, **kwargs) - self.assertRaises(AssertionError, assert_almost_equal, b, a, **kwargs) + pytest.raises(AssertionError, assert_almost_equal, a, b, **kwargs) + pytest.raises(AssertionError, assert_almost_equal, b, a, **kwargs) def test_assert_almost_equal_numbers(self): self._assert_almost_equal_both(1.1, 1.1) @@ -133,12 +126,12 @@ def test_assert_almost_equal_inf(self): dtype=np.object_)) def test_assert_almost_equal_pandas(self): - self.assert_almost_equal(pd.Index([1., 1.1]), - pd.Index([1., 1.100001])) - self.assert_almost_equal(pd.Series([1., 1.1]), - pd.Series([1., 1.100001])) - self.assert_almost_equal(pd.DataFrame({'a': [1., 1.1]}), - pd.DataFrame({'a': [1., 1.100001]})) + tm.assert_almost_equal(pd.Index([1., 1.1]), + pd.Index([1., 1.100001])) + tm.assert_almost_equal(pd.Series([1., 1.1]), + pd.Series([1., 1.100001])) + tm.assert_almost_equal(pd.DataFrame({'a': [1., 1.1]}), + pd.DataFrame({'a': [1., 1.100001]})) def test_assert_almost_equal_object(self): a = [pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-01')] @@ -146,17 +139,16 @@ def test_assert_almost_equal_object(self): self._assert_almost_equal_both(a, b) -class TestUtilTesting(tm.TestCase): - _multiprocess_can_split_ = True +class TestUtilTesting(object): def test_raise_with_traceback(self): - with assertRaisesRegexp(LookupError, "error_text"): + with tm.assert_raises_regex(LookupError, "error_text"): try: raise ValueError("THIS IS AN ERROR") except ValueError as e: e = LookupError("error_text") raise_with_traceback(e) - with assertRaisesRegexp(LookupError, "error_text"): + with tm.assert_raises_regex(LookupError, "error_text"): try: raise ValueError("This is another error") except ValueError: @@ -165,13 +157,13 @@ def test_raise_with_traceback(self): raise_with_traceback(e, traceback) -class TestAssertNumpyArrayEqual(tm.TestCase): +class TestAssertNumpyArrayEqual(object): def test_numpy_array_equal_message(self): if is_platform_windows(): - raise nose.SkipTest("windows has incomparable line-endings " - "and uses L on the shape") + pytest.skip("windows has incomparable line-endings " + "and uses L on the shape") expected = """numpy array are different @@ -179,18 +171,18 @@ def test_numpy_array_equal_message(self): \\[left\\]: \\(2,\\) \\[right\\]: \\(3,\\)""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(np.array([1, 2]), np.array([3, 4, 5])) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(np.array([1, 2]), np.array([3, 4, 5])) # scalar comparison expected = """Expected type """ - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(1, 2) expected = """expected 2\\.00000 but got 1\\.00000, with decimal 5""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(1, 2) # array / scalar array comparison @@ -200,10 +192,10 @@ def test_numpy_array_equal_message(self): \\[left\\]: ndarray \\[right\\]: int""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): # numpy_array_equal only accepts np.ndarray assert_numpy_array_equal(np.array([1]), 1) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(np.array([1]), 1) # scalar / array comparison @@ -213,9 +205,9 @@ def test_numpy_array_equal_message(self): \\[left\\]: int \\[right\\]: ndarray""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(1, np.array([1])) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(1, np.array([1])) expected = """numpy array are different @@ -224,10 +216,10 @@ def test_numpy_array_equal_message(self): \\[left\\]: \\[nan, 2\\.0, 3\\.0\\] \\[right\\]: \\[1\\.0, nan, 3\\.0\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(np.array([np.nan, 2, 3]), np.array([1, np.nan, 3])) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(np.array([np.nan, 2, 3]), np.array([1, np.nan, 3])) @@ -237,9 +229,9 @@ def test_numpy_array_equal_message(self): \\[left\\]: \\[1, 2\\] \\[right\\]: \\[1, 3\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(np.array([1, 2]), np.array([1, 3])) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(np.array([1, 2]), np.array([1, 3])) expected = """numpy array are different @@ -248,7 +240,7 @@ def test_numpy_array_equal_message(self): \\[left\\]: \\[1\\.1, 2\\.000001\\] \\[right\\]: \\[1\\.1, 2.0\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal( np.array([1.1, 2.000001]), np.array([1.1, 2.0])) @@ -261,10 +253,10 @@ def test_numpy_array_equal_message(self): \\[left\\]: \\[\\[1, 2\\], \\[3, 4\\], \\[5, 6\\]\\] \\[right\\]: \\[\\[1, 3\\], \\[3, 4\\], \\[5, 6\\]\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(np.array([[1, 2], [3, 4], [5, 6]]), np.array([[1, 3], [3, 4], [5, 6]])) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(np.array([[1, 2], [3, 4], [5, 6]]), np.array([[1, 3], [3, 4], [5, 6]])) @@ -274,10 +266,10 @@ def test_numpy_array_equal_message(self): \\[left\\]: \\[\\[1, 2\\], \\[3, 4\\]\\] \\[right\\]: \\[\\[1, 3\\], \\[3, 4\\]\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(np.array([[1, 2], [3, 4]]), np.array([[1, 3], [3, 4]])) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(np.array([[1, 2], [3, 4]]), np.array([[1, 3], [3, 4]])) @@ -288,18 +280,18 @@ def test_numpy_array_equal_message(self): \\[left\\]: \\(2,\\) \\[right\\]: \\(3,\\)""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(np.array([1, 2]), np.array([3, 4, 5]), obj='Index') - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(np.array([1, 2]), np.array([3, 4, 5]), obj='Index') def test_numpy_array_equal_object_message(self): if is_platform_windows(): - raise nose.SkipTest("windows has incomparable line-endings " - "and uses L on the shape") + pytest.skip("windows has incomparable line-endings " + "and uses L on the shape") a = np.array([pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-01')]) b = np.array([pd.Timestamp('2011-01-01'), pd.Timestamp('2011-01-02')]) @@ -310,9 +302,9 @@ def test_numpy_array_equal_object_message(self): \\[left\\]: \\[2011-01-01 00:00:00, 2011-01-01 00:00:00\\] \\[right\\]: \\[2011-01-01 00:00:00, 2011-01-02 00:00:00\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(a, b) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal(a, b) def test_numpy_array_equal_copy_flag(self): @@ -320,10 +312,10 @@ def test_numpy_array_equal_copy_flag(self): b = a.copy() c = a.view() expected = r'array\(\[1, 2, 3\]\) is not array\(\[1, 2, 3\]\)' - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(a, b, check_same='same') expected = r'array\(\[1, 2, 3\]\) is array\(\[1, 2, 3\]\)' - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_numpy_array_equal(a, c, check_same='copy') def test_assert_almost_equal_iterable_message(self): @@ -334,7 +326,7 @@ def test_assert_almost_equal_iterable_message(self): \\[left\\]: 2 \\[right\\]: 3""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal([1, 2], [3, 4, 5]) expected = """Iterable are different @@ -343,12 +335,11 @@ def test_assert_almost_equal_iterable_message(self): \\[left\\]: \\[1, 2\\] \\[right\\]: \\[1, 3\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_almost_equal([1, 2], [1, 3]) -class TestAssertIndexEqual(unittest.TestCase): - _multiprocess_can_split_ = True +class TestAssertIndexEqual(object): def test_index_equal_message(self): @@ -362,7 +353,7 @@ def test_index_equal_message(self): idx1 = pd.Index([1, 2, 3]) idx2 = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 3), ('B', 4)]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, exact=False) expected = """MultiIndex level \\[1\\] are different @@ -375,9 +366,9 @@ def test_index_equal_message(self): ('B', 3), ('B', 4)]) idx2 = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 3), ('B', 4)]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, check_exact=False) expected = """Index are different @@ -388,9 +379,9 @@ def test_index_equal_message(self): idx1 = pd.Index([1, 2, 3]) idx2 = pd.Index([1, 2, 3, 4]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, check_exact=False) expected = """Index are different @@ -401,9 +392,9 @@ def test_index_equal_message(self): idx1 = pd.Index([1, 2, 3]) idx2 = pd.Index([1, 2, 3.0]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, exact=True) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, exact=True, check_exact=False) expected = """Index are different @@ -414,7 +405,7 @@ def test_index_equal_message(self): idx1 = pd.Index([1, 2, 3.]) idx2 = pd.Index([1, 2, 3.0000000001]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) # must success @@ -428,9 +419,9 @@ def test_index_equal_message(self): idx1 = pd.Index([1, 2, 3.]) idx2 = pd.Index([1, 2, 3.0001]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, check_exact=False) # must success assert_index_equal(idx1, idx2, check_exact=False, @@ -444,9 +435,9 @@ def test_index_equal_message(self): idx1 = pd.Index([1, 2, 3]) idx2 = pd.Index([1, 2, 4]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, check_less_precise=True) expected = """MultiIndex level \\[1\\] are different @@ -459,9 +450,9 @@ def test_index_equal_message(self): ('B', 3), ('B', 4)]) idx2 = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 3), ('B', 4)]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2, check_exact=False) def test_index_equal_metadata_message(self): @@ -474,7 +465,7 @@ def test_index_equal_metadata_message(self): idx1 = pd.Index([1, 2, 3]) idx2 = pd.Index([1, 2, 3], name='x') - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) # same name, should pass @@ -491,20 +482,19 @@ def test_index_equal_metadata_message(self): idx1 = pd.Index([1, 2, 3], name=np.nan) idx2 = pd.Index([1, 2, 3], name=pd.NaT) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_index_equal(idx1, idx2) -class TestAssertSeriesEqual(tm.TestCase): - _multiprocess_can_split_ = True +class TestAssertSeriesEqual(object): def _assert_equal(self, x, y, **kwargs): assert_series_equal(x, y, **kwargs) assert_series_equal(y, x, **kwargs) def _assert_not_equal(self, a, b, **kwargs): - self.assertRaises(AssertionError, assert_series_equal, a, b, **kwargs) - self.assertRaises(AssertionError, assert_series_equal, b, a, **kwargs) + pytest.raises(AssertionError, assert_series_equal, a, b, **kwargs) + pytest.raises(AssertionError, assert_series_equal, b, a, **kwargs) def test_equal(self): self._assert_equal(Series(range(3)), Series(range(3))) @@ -528,27 +518,27 @@ def test_less_precise(self): s1 = Series([0.12345], dtype='float64') s2 = Series([0.12346], dtype='float64') - self.assertRaises(AssertionError, assert_series_equal, s1, s2) + pytest.raises(AssertionError, assert_series_equal, s1, s2) self._assert_equal(s1, s2, check_less_precise=True) for i in range(4): self._assert_equal(s1, s2, check_less_precise=i) - self.assertRaises(AssertionError, assert_series_equal, s1, s2, 10) + pytest.raises(AssertionError, assert_series_equal, s1, s2, 10) s1 = Series([0.12345], dtype='float32') s2 = Series([0.12346], dtype='float32') - self.assertRaises(AssertionError, assert_series_equal, s1, s2) + pytest.raises(AssertionError, assert_series_equal, s1, s2) self._assert_equal(s1, s2, check_less_precise=True) for i in range(4): self._assert_equal(s1, s2, check_less_precise=i) - self.assertRaises(AssertionError, assert_series_equal, s1, s2, 10) + pytest.raises(AssertionError, assert_series_equal, s1, s2, 10) # even less than less precise s1 = Series([0.1235], dtype='float32') s2 = Series([0.1236], dtype='float32') - self.assertRaises(AssertionError, assert_series_equal, s1, s2) - self.assertRaises(AssertionError, assert_series_equal, s1, s2, True) + pytest.raises(AssertionError, assert_series_equal, s1, s2) + pytest.raises(AssertionError, assert_series_equal, s1, s2, True) def test_index_dtype(self): df1 = DataFrame.from_records( @@ -574,7 +564,7 @@ def test_series_equal_message(self): \\[left\\]: 3, RangeIndex\\(start=0, stop=3, step=1\\) \\[right\\]: 4, RangeIndex\\(start=0, stop=4, step=1\\)""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_series_equal(pd.Series([1, 2, 3]), pd.Series([1, 2, 3, 4])) expected = """Series are different @@ -583,23 +573,36 @@ def test_series_equal_message(self): \\[left\\]: \\[1, 2, 3\\] \\[right\\]: \\[1, 2, 4\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_series_equal(pd.Series([1, 2, 3]), pd.Series([1, 2, 4])) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_series_equal(pd.Series([1, 2, 3]), pd.Series([1, 2, 4]), check_less_precise=True) -class TestAssertFrameEqual(tm.TestCase): - _multiprocess_can_split_ = True +class TestAssertFrameEqual(object): def _assert_equal(self, x, y, **kwargs): assert_frame_equal(x, y, **kwargs) assert_frame_equal(y, x, **kwargs) def _assert_not_equal(self, a, b, **kwargs): - self.assertRaises(AssertionError, assert_frame_equal, a, b, **kwargs) - self.assertRaises(AssertionError, assert_frame_equal, b, a, **kwargs) + pytest.raises(AssertionError, assert_frame_equal, a, b, **kwargs) + pytest.raises(AssertionError, assert_frame_equal, b, a, **kwargs) + + def test_equal_with_different_row_order(self): + # check_like=True ignores row-column orderings + df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, + index=['a', 'b', 'c']) + df2 = pd.DataFrame({'A': [3, 2, 1], 'B': [6, 5, 4]}, + index=['c', 'b', 'a']) + + self._assert_equal(df1, df2, check_like=True) + self._assert_not_equal(df1, df2) + + def test_not_equal_with_different_shape(self): + self._assert_not_equal(pd.DataFrame({'A': [1, 2, 3]}), + pd.DataFrame({'A': [1, 2, 3, 4]})) def test_index_dtype(self): df1 = DataFrame.from_records( @@ -628,21 +631,11 @@ def test_frame_equal_message(self): expected = """DataFrame are different -DataFrame shape \\(number of rows\\) are different -\\[left\\]: 3, RangeIndex\\(start=0, stop=3, step=1\\) -\\[right\\]: 4, RangeIndex\\(start=0, stop=4, step=1\\)""" - - with assertRaisesRegexp(AssertionError, expected): - assert_frame_equal(pd.DataFrame({'A': [1, 2, 3]}), - pd.DataFrame({'A': [1, 2, 3, 4]})) +DataFrame shape mismatch +\\[left\\]: \\(3, 2\\) +\\[right\\]: \\(3, 1\\)""" - expected = """DataFrame are different - -DataFrame shape \\(number of columns\\) are different -\\[left\\]: 2, Index\\(\\[u?'A', u?'B'\\], dtype='object'\\) -\\[right\\]: 1, Index\\(\\[u?'A'\\], dtype='object'\\)""" - - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_frame_equal(pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}), pd.DataFrame({'A': [1, 2, 3]})) @@ -652,7 +645,7 @@ def test_frame_equal_message(self): \\[left\\]: Index\\(\\[u?'a', u?'b', u?'c'\\], dtype='object'\\) \\[right\\]: Index\\(\\[u?'a', u?'b', u?'d'\\], dtype='object'\\)""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_frame_equal(pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']), pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, @@ -664,7 +657,7 @@ def test_frame_equal_message(self): \\[left\\]: Index\\(\\[u?'A', u?'B'\\], dtype='object'\\) \\[right\\]: Index\\(\\[u?'A', u?'b'\\], dtype='object'\\)""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_frame_equal(pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']), pd.DataFrame({'A': [1, 2, 3], 'b': [4, 5, 6]}, @@ -676,33 +669,17 @@ def test_frame_equal_message(self): \\[left\\]: \\[4, 5, 6\\] \\[right\\]: \\[4, 5, 7\\]""" - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_frame_equal(pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}), pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 7]})) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): assert_frame_equal(pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}), pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 7]}), by_blocks=True) -class TestIsInstance(tm.TestCase): - - def test_isinstance(self): - - expected = "Expected type " - with assertRaisesRegexp(AssertionError, expected): - tm.assertIsInstance(1, pd.Series) - - def test_notisinstance(self): - - expected = "Input must not be type " - with assertRaisesRegexp(AssertionError, expected): - tm.assertNotIsInstance(pd.Series([1]), pd.Series) - - -class TestAssertCategoricalEqual(unittest.TestCase): - _multiprocess_can_split_ = True +class TestAssertCategoricalEqual(object): def test_categorical_equal_message(self): @@ -714,7 +691,7 @@ def test_categorical_equal_message(self): a = pd.Categorical([1, 2, 3, 4]) b = pd.Categorical([1, 2, 3, 5]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): tm.assert_categorical_equal(a, b) expected = """Categorical\\.codes are different @@ -725,7 +702,7 @@ def test_categorical_equal_message(self): a = pd.Categorical([1, 2, 4, 3], categories=[1, 2, 3, 4]) b = pd.Categorical([1, 2, 3, 4], categories=[1, 2, 3, 4]) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): tm.assert_categorical_equal(a, b) expected = """Categorical are different @@ -736,11 +713,11 @@ def test_categorical_equal_message(self): a = pd.Categorical([1, 2, 3, 4], ordered=False) b = pd.Categorical([1, 2, 3, 4], ordered=True) - with assertRaisesRegexp(AssertionError, expected): + with tm.assert_raises_regex(AssertionError, expected): tm.assert_categorical_equal(a, b) -class TestRNGContext(unittest.TestCase): +class TestRNGContext(object): def test_RNGContext(self): expected0 = 1.764052345967664 @@ -748,63 +725,17 @@ def test_RNGContext(self): with RNGContext(0): with RNGContext(1): - self.assertEqual(np.random.randn(), expected1) - self.assertEqual(np.random.randn(), expected0) - - -class TestDeprecatedTests(tm.TestCase): - - def test_warning(self): - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertEquals(1, 1) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertNotEquals(1, 2) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assert_(True) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertAlmostEquals(1.0, 1.0000000001) + assert np.random.randn() == expected1 + assert np.random.randn() == expected0 - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertNotAlmostEquals(1, 2) - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - tm.assert_isinstance(Series([1, 2]), Series, msg='xxx') - - -class TestLocale(tm.TestCase): +class TestLocale(object): def test_locale(self): if sys.platform == 'win32': - raise nose.SkipTest( + pytest.skip( "skipping on win platforms as locale not available") # GH9744 locales = tm.get_locales() - self.assertTrue(len(locales) >= 1) - - -def test_skiptest_deco(): - from nose import SkipTest - - @skip_if_no_package_deco("fakepackagename") - def f(): - pass - with assertRaises(SkipTest): - f() - - @skip_if_no_package_deco("numpy") - def f(): - pass - # hack to ensure that SkipTest is *not* raised - with assertRaises(ValueError): - f() - raise ValueError - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + assert len(locales) >= 1 diff --git a/pandas/tests/test_util.py b/pandas/tests/util/test_util.py similarity index 69% rename from pandas/tests/test_util.py rename to pandas/tests/util/test_util.py index cb12048676d26..532d596220501 100644 --- a/pandas/tests/test_util.py +++ b/pandas/tests/util/test_util.py @@ -1,21 +1,28 @@ # -*- coding: utf-8 -*- -import nose - -from collections import OrderedDict +import os +import locale +import codecs import sys -import unittest from uuid import uuid4 +from collections import OrderedDict + +import pytest +from pandas.compat import intern from pandas.util._move import move_into_mutable_buffer, BadMove, stolenbuf -from pandas.util.decorators import deprecate_kwarg -from pandas.util.validators import (validate_args, validate_kwargs, - validate_args_and_kwargs) +from pandas.util._decorators import deprecate_kwarg +from pandas.util._validators import (validate_args, validate_kwargs, + validate_args_and_kwargs, + validate_bool_kwarg) import pandas.util.testing as tm +CURRENT_LOCALE = locale.getlocale() +LOCALE_OVERRIDE = os.environ.get('LOCALE_OVERRIDE', None) + -class TestDecorators(tm.TestCase): +class TestDecorators(object): - def setUp(self): + def setup_method(self, method): @deprecate_kwarg('old', 'new') def _f1(new=False): return new @@ -36,7 +43,7 @@ def test_deprecate_kwarg(self): x = 78 with tm.assert_produces_warning(FutureWarning): result = self.f1(old=x) - self.assertIs(result, x) + assert result is x with tm.assert_produces_warning(None): self.f1(new=x) @@ -44,24 +51,24 @@ def test_dict_deprecate_kwarg(self): x = 'yes' with tm.assert_produces_warning(FutureWarning): result = self.f2(old=x) - self.assertEqual(result, True) + assert result def test_missing_deprecate_kwarg(self): x = 'bogus' with tm.assert_produces_warning(FutureWarning): result = self.f2(old=x) - self.assertEqual(result, 'bogus') + assert result == 'bogus' def test_callable_deprecate_kwarg(self): x = 5 with tm.assert_produces_warning(FutureWarning): result = self.f3(old=x) - self.assertEqual(result, x + 1) - with tm.assertRaises(TypeError): + assert result == x + 1 + with pytest.raises(TypeError): self.f3(old='hello') def test_bad_deprecate_kwarg(self): - with tm.assertRaises(TypeError): + with pytest.raises(TypeError): @deprecate_kwarg('old', 'new', 0) def f4(new=None): pass @@ -82,12 +89,12 @@ def test_rands_array(): assert(len(arr[1, 1]) == 7) -class TestValidateArgs(tm.TestCase): +class TestValidateArgs(object): fname = 'func' def test_bad_min_fname_arg_count(self): msg = "'max_fname_arg_count' must be non-negative" - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): validate_args(self.fname, (None,), -1, 'foo') def test_bad_arg_length_max_value_single(self): @@ -102,7 +109,7 @@ def test_bad_arg_length_max_value_single(self): .format(fname=self.fname, max_length=max_length, actual_length=actual_length)) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): validate_args(self.fname, args, min_fname_arg_count, compat_args) @@ -119,7 +126,7 @@ def test_bad_arg_length_max_value_multiple(self): .format(fname=self.fname, max_length=max_length, actual_length=actual_length)) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): validate_args(self.fname, args, min_fname_arg_count, compat_args) @@ -138,7 +145,7 @@ def test_not_all_defaults(self): arg_vals = (1, -1, 3) for i in range(1, 3): - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): validate_args(self.fname, arg_vals[:i], 2, compat_args) def test_validation(self): @@ -152,7 +159,7 @@ def test_validation(self): validate_args(self.fname, (1, None), 2, compat_args) -class TestValidateKwargs(tm.TestCase): +class TestValidateKwargs(object): fname = 'func' def test_bad_kwarg(self): @@ -167,7 +174,7 @@ def test_bad_kwarg(self): r"keyword argument '{arg}'".format( fname=self.fname, arg=badarg)) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): validate_kwargs(self.fname, kwargs, compat_args) def test_not_all_none(self): @@ -188,7 +195,7 @@ def test_not_all_none(self): kwargs = dict(zip(kwarg_keys[:i], kwarg_vals[:i])) - with tm.assertRaisesRegexp(ValueError, msg): + with tm.assert_raises_regex(ValueError, msg): validate_kwargs(self.fname, kwargs, compat_args) def test_validation(self): @@ -200,8 +207,25 @@ def test_validation(self): kwargs = dict(f=None, b=1) validate_kwargs(self.fname, kwargs, compat_args) + def test_validate_bool_kwarg(self): + arg_names = ['inplace', 'copy'] + invalid_values = [1, "True", [1, 2, 3], 5.0] + valid_values = [True, False, None] + + for name in arg_names: + for value in invalid_values: + with tm.assert_raises_regex(ValueError, + "For argument \"%s\" " + "expected type bool, " + "received type %s" % + (name, type(value).__name__)): + validate_bool_kwarg(value, name) -class TestValidateKwargsAndArgs(tm.TestCase): + for value in valid_values: + assert validate_bool_kwarg(value, name) == value + + +class TestValidateKwargsAndArgs(object): fname = 'func' def test_invalid_total_length_max_length_one(self): @@ -217,7 +241,7 @@ def test_invalid_total_length_max_length_one(self): .format(fname=self.fname, max_length=max_length, actual_length=actual_length)) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): validate_args_and_kwargs(self.fname, args, kwargs, min_fname_arg_count, compat_args) @@ -235,7 +259,7 @@ def test_invalid_total_length_max_length_multiple(self): .format(fname=self.fname, max_length=max_length, actual_length=actual_length)) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): validate_args_and_kwargs(self.fname, args, kwargs, min_fname_arg_count, compat_args) @@ -254,17 +278,17 @@ def test_no_args_with_kwargs(self): args = () kwargs = {'foo': -5, bad_arg: 2} - tm.assertRaisesRegexp(ValueError, msg, - validate_args_and_kwargs, - self.fname, args, kwargs, - min_fname_arg_count, compat_args) + tm.assert_raises_regex(ValueError, msg, + validate_args_and_kwargs, + self.fname, args, kwargs, + min_fname_arg_count, compat_args) args = (-5, 2) kwargs = {} - tm.assertRaisesRegexp(ValueError, msg, - validate_args_and_kwargs, - self.fname, args, kwargs, - min_fname_arg_count, compat_args) + tm.assert_raises_regex(ValueError, msg, + validate_args_and_kwargs, + self.fname, args, kwargs, + min_fname_arg_count, compat_args) def test_duplicate_argument(self): min_fname_arg_count = 2 @@ -278,7 +302,7 @@ def test_duplicate_argument(self): msg = (r"{fname}\(\) got multiple values for keyword " r"argument '{arg}'".format(fname=self.fname, arg='foo')) - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): validate_args_and_kwargs(self.fname, args, kwargs, min_fname_arg_count, compat_args) @@ -298,13 +322,14 @@ def test_validation(self): compat_args) -class TestMove(tm.TestCase): +class TestMove(object): + def test_cannot_create_instance_of_stolenbuffer(self): """Stolen buffers need to be created through the smart constructor ``move_into_mutable_buffer`` which has a bunch of checks in it. """ msg = "cannot create 'pandas.util._move.stolenbuf' instances" - with tm.assertRaisesRegexp(TypeError, msg): + with tm.assert_raises_regex(TypeError, msg): stolenbuf() def test_more_than_one_ref(self): @@ -313,9 +338,9 @@ def test_more_than_one_ref(self): """ b = b'testing' - with tm.assertRaises(BadMove) as e: + with pytest.raises(BadMove) as e: def handle_success(type_, value, tb): - self.assertIs(value.args[0], b) + assert value.args[0] is b return type(e).handle_success(e, type_, value, tb) # super e.handle_success = handle_success @@ -334,11 +359,11 @@ def test_exactly_one_ref(self): as_stolen_buf = move_into_mutable_buffer(b[:-3]) # materialize as bytearray to show that it is mutable - self.assertEqual(bytearray(as_stolen_buf), b'test') + assert bytearray(as_stolen_buf) == b'test' - @unittest.skipIf( + @pytest.mark.skipif( sys.version_info[0] > 2, - 'bytes objects cannot be interned in py3', + reason='bytes objects cannot be interned in py3', ) def test_interned(self): salt = uuid4().hex @@ -362,19 +387,14 @@ def ref_capture(ob): refcount[0] = sys.getrefcount(ob) - 2 return ob - with tm.assertRaises(BadMove): + with pytest.raises(BadMove): # If we intern the string it will still have one reference but now # it is in the intern table so if other people intern the same # string while the mutable buffer holds the first string they will # be the same instance. move_into_mutable_buffer(ref_capture(intern(make_string()))) # noqa - self.assertEqual( - refcount[0], - 1, - msg='The BadMove was probably raised for refcount reasons instead' - ' of interning reasons', - ) + assert refcount[0] == 1 def test_numpy_errstate_is_default(): @@ -384,9 +404,65 @@ def test_numpy_errstate_is_default(): import numpy as np from pandas.compat import numpy # noqa # The errstate should be unchanged after that import. - tm.assert_equal(np.geterr(), expected) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) + assert np.geterr() == expected + + +class TestLocaleUtils(object): + + @classmethod + def setup_class(cls): + cls.locales = tm.get_locales() + + if not cls.locales: + pytest.skip("No locales found") + + tm._skip_if_windows() + + @classmethod + def teardown_class(cls): + del cls.locales + + def test_get_locales(self): + # all systems should have at least a single locale + assert len(tm.get_locales()) > 0 + + def test_get_locales_prefix(self): + if len(self.locales) == 1: + pytest.skip("Only a single locale found, no point in " + "trying to test filtering locale prefixes") + first_locale = self.locales[0] + assert len(tm.get_locales(prefix=first_locale[:2])) > 0 + + def test_set_locale(self): + if len(self.locales) == 1: + pytest.skip("Only a single locale found, no point in " + "trying to test setting another locale") + + if all(x is None for x in CURRENT_LOCALE): + # Not sure why, but on some travis runs with pytest, + # getlocale() returned (None, None). + pytest.skip("CURRENT_LOCALE is not set.") + + if LOCALE_OVERRIDE is None: + lang, enc = 'it_CH', 'UTF-8' + elif LOCALE_OVERRIDE == 'C': + lang, enc = 'en_US', 'ascii' + else: + lang, enc = LOCALE_OVERRIDE.split('.') + + enc = codecs.lookup(enc).name + new_locale = lang, enc + + if not tm._can_set_locale(new_locale): + with pytest.raises(locale.Error): + with tm.set_locale(new_locale): + pass + else: + with tm.set_locale(new_locale) as normalized_locale: + new_lang, new_enc = normalized_locale.split('.') + new_enc = codecs.lookup(enc).name + normalized_locale = new_lang, new_enc + assert normalized_locale == new_locale + + current_locale = locale.getlocale() + assert current_locale == CURRENT_LOCALE diff --git a/pandas/tools/hashing.py b/pandas/tools/hashing.py index aa18b8bc70c37..ba38710b607af 100644 --- a/pandas/tools/hashing.py +++ b/pandas/tools/hashing.py @@ -1,137 +1,18 @@ -""" -data hash pandas / numpy objects -""" +import warnings +import sys -import numpy as np -from pandas import _hash, Series, factorize, Categorical, Index -from pandas.lib import infer_dtype -from pandas.types.generic import ABCIndexClass, ABCSeries, ABCDataFrame -from pandas.types.common import is_categorical_dtype +m = sys.modules['pandas.tools.hashing'] +for t in ['hash_pandas_object', 'hash_array']: -# 16 byte long hashing key -_default_hash_key = '0123456789123456' + def outer(t=t): + def wrapper(*args, **kwargs): + from pandas import util + warnings.warn("pandas.tools.hashing is deprecated and will be " + "removed in a future version, import " + "from pandas.util", + DeprecationWarning, stacklevel=3) + return getattr(util, t)(*args, **kwargs) + return wrapper -def hash_pandas_object(obj, index=True, encoding='utf8', hash_key=None): - """ - Return a data hash of the Index/Series/DataFrame - - .. versionadded:: 0.19.2 - - Parameters - ---------- - index : boolean, default True - include the index in the hash (if Series/DataFrame) - encoding : string, default 'utf8' - encoding for data & key when strings - hash_key : string key to encode, default to _default_hash_key - - Returns - ------- - Series of uint64, same length as the object - - """ - if hash_key is None: - hash_key = _default_hash_key - - def adder(h, hashed_to_add): - h = np.multiply(h, np.uint(3), h) - return np.add(h, hashed_to_add, h) - - if isinstance(obj, ABCIndexClass): - h = hash_array(obj.values, encoding, hash_key).astype('uint64') - h = Series(h, index=obj, dtype='uint64') - elif isinstance(obj, ABCSeries): - h = hash_array(obj.values, encoding, hash_key).astype('uint64') - if index: - h = adder(h, hash_pandas_object(obj.index, - index=False, - encoding=encoding, - hash_key=hash_key).values) - h = Series(h, index=obj.index, dtype='uint64') - elif isinstance(obj, ABCDataFrame): - cols = obj.iteritems() - first_series = next(cols)[1] - h = hash_array(first_series.values, encoding, - hash_key).astype('uint64') - for _, col in cols: - h = adder(h, hash_array(col.values, encoding, hash_key)) - if index: - h = adder(h, hash_pandas_object(obj.index, - index=False, - encoding=encoding, - hash_key=hash_key).values) - - h = Series(h, index=obj.index, dtype='uint64') - else: - raise TypeError("Unexpected type for hashing %s" % type(obj)) - return h - - -def hash_array(vals, encoding='utf8', hash_key=None): - """ - Given a 1d array, return an array of deterministic integers. - - .. versionadded:: 0.19.2 - - Parameters - ---------- - vals : ndarray - encoding : string, default 'utf8' - encoding for data & key when strings - hash_key : string key to encode, default to _default_hash_key - - Returns - ------- - 1d uint64 numpy array of hash values, same length as the vals - - """ - - # work with cagegoricals as ints. (This check is above the complex - # check so that we don't ask numpy if categorical is a subdtype of - # complex, as it will choke. - if hash_key is None: - hash_key = _default_hash_key - - if is_categorical_dtype(vals.dtype): - vals = vals.codes - - # we'll be working with everything as 64-bit values, so handle this - # 128-bit value early - if np.issubdtype(vals.dtype, np.complex128): - return hash_array(vals.real) + 23 * hash_array(vals.imag) - - # MAIN LOGIC: - inferred = infer_dtype(vals) - - # First, turn whatever array this is into unsigned 64-bit ints, if we can - # manage it. - if inferred == 'boolean': - vals = vals.astype('u8') - - if (np.issubdtype(vals.dtype, np.datetime64) or - np.issubdtype(vals.dtype, np.timedelta64) or - np.issubdtype(vals.dtype, np.number)) and vals.dtype.itemsize <= 8: - - vals = vals.view('u{}'.format(vals.dtype.itemsize)).astype('u8') - else: - - # its MUCH faster to categorize object dtypes, then hash and rename - codes, categories = factorize(vals, sort=False) - categories = Index(categories) - c = Series(Categorical(codes, categories, - ordered=False, fastpath=True)) - vals = _hash.hash_object_array(categories.values, - hash_key, - encoding) - - # rename & extract - vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values - - # Then, redistribute these 64-bit ints within the space of 64-bit ints - vals ^= vals >> 30 - vals *= np.uint64(0xbf58476d1ce4e5b9) - vals ^= vals >> 27 - vals *= np.uint64(0x94d049bb133111eb) - vals ^= vals >> 31 - return vals + setattr(m, t, outer(t)) diff --git a/pandas/tools/merge.py b/pandas/tools/merge.py index 36f88074bbbc6..cd58aa2c7f923 100644 --- a/pandas/tools/merge.py +++ b/pandas/tools/merge.py @@ -1,1871 +1,17 @@ -""" -SQL-style merge routines -""" - -import copy import warnings -import string - -import numpy as np -from pandas.compat import range, lrange, lzip, zip, map, filter -import pandas.compat as compat - -from pandas import (Categorical, DataFrame, Series, - Index, MultiIndex, Timedelta) -from pandas.core.categorical import (_factorize_from_iterable, - _factorize_from_iterables) -from pandas.core.frame import _merge_doc -from pandas.types.generic import ABCSeries -from pandas.types.common import (is_datetime64tz_dtype, - is_datetime64_dtype, - needs_i8_conversion, - is_int64_dtype, - is_integer_dtype, - is_float_dtype, - is_integer, - is_int_or_datetime_dtype, - is_dtype_equal, - is_bool, - is_list_like, - _ensure_int64, - _ensure_float64, - _ensure_object, - _get_dtype) -from pandas.types.missing import na_value_for_dtype - -from pandas.core.generic import NDFrame -from pandas.core.index import (_get_combined_index, - _ensure_index, _get_consensus_names, - _all_indexes_same) -from pandas.core.internals import (items_overlap_with_suffix, - concatenate_block_managers) -from pandas.util.decorators import Appender, Substitution - -import pandas.core.algorithms as algos -import pandas.core.common as com -import pandas.types.concat as _concat - -import pandas._join as _join -import pandas.hashtable as _hash - - -@Substitution('\nleft : DataFrame') -@Appender(_merge_doc, indents=0) -def merge(left, right, how='inner', on=None, left_on=None, right_on=None, - left_index=False, right_index=False, sort=False, - suffixes=('_x', '_y'), copy=True, indicator=False): - op = _MergeOperation(left, right, how=how, on=on, left_on=left_on, - right_on=right_on, left_index=left_index, - right_index=right_index, sort=sort, suffixes=suffixes, - copy=copy, indicator=indicator) - return op.get_result() -if __debug__: - merge.__doc__ = _merge_doc % '\nleft : DataFrame' - - -class MergeError(ValueError): - pass - - -def _groupby_and_merge(by, on, left, right, _merge_pieces, - check_duplicates=True): - """ - groupby & merge; we are always performing a left-by type operation - - Parameters - ---------- - by: field to group - on: duplicates field - left: left frame - right: right frame - _merge_pieces: function for merging - check_duplicates: boolean, default True - should we check & clean duplicates - """ - - pieces = [] - if not isinstance(by, (list, tuple)): - by = [by] - - lby = left.groupby(by, sort=False) - - # if we can groupby the rhs - # then we can get vastly better perf - try: - - # we will check & remove duplicates if indicated - if check_duplicates: - if on is None: - on = [] - elif not isinstance(on, (list, tuple)): - on = [on] - - if right.duplicated(by + on).any(): - right = right.drop_duplicates(by + on, keep='last') - rby = right.groupby(by, sort=False) - except KeyError: - rby = None - - for key, lhs in lby: - - if rby is None: - rhs = right - else: - try: - rhs = right.take(rby.indices[key]) - except KeyError: - # key doesn't exist in left - lcols = lhs.columns.tolist() - cols = lcols + [r for r in right.columns - if r not in set(lcols)] - merged = lhs.reindex(columns=cols) - merged.index = range(len(merged)) - pieces.append(merged) - continue - - merged = _merge_pieces(lhs, rhs) - - # make sure join keys are in the merged - # TODO, should _merge_pieces do this? - for k in by: - try: - if k in merged: - merged[k] = key - except: - pass - - pieces.append(merged) - - # preserve the original order - # if we have a missing piece this can be reset - result = concat(pieces, ignore_index=True) - result = result.reindex(columns=pieces[0].columns, copy=False) - return result, lby - - -def ordered_merge(left, right, on=None, - left_on=None, right_on=None, - left_by=None, right_by=None, - fill_method=None, suffixes=('_x', '_y')): - - warnings.warn("ordered_merge is deprecated and replaced by merge_ordered", - FutureWarning, stacklevel=2) - return merge_ordered(left, right, on=on, - left_on=left_on, right_on=right_on, - left_by=left_by, right_by=right_by, - fill_method=fill_method, suffixes=suffixes) - - -def merge_ordered(left, right, on=None, - left_on=None, right_on=None, - left_by=None, right_by=None, - fill_method=None, suffixes=('_x', '_y'), - how='outer'): - """Perform merge with optional filling/interpolation designed for ordered - data like time series data. Optionally perform group-wise merge (see - examples) - - Parameters - ---------- - left : DataFrame - right : DataFrame - on : label or list - Field names to join on. Must be found in both DataFrames. - left_on : label or list, or array-like - Field names to join on in left DataFrame. Can be a vector or list of - vectors of the length of the DataFrame to use a particular vector as - the join key instead of columns - right_on : label or list, or array-like - Field names to join on in right DataFrame or vector/list of vectors per - left_on docs - left_by : column name or list of column names - Group left DataFrame by group columns and merge piece by piece with - right DataFrame - right_by : column name or list of column names - Group right DataFrame by group columns and merge piece by piece with - left DataFrame - fill_method : {'ffill', None}, default None - Interpolation method for data - suffixes : 2-length sequence (tuple, list, ...) - Suffix to apply to overlapping column names in the left and right - side, respectively - how : {'left', 'right', 'outer', 'inner'}, default 'outer' - * left: use only keys from left frame (SQL: left outer join) - * right: use only keys from right frame (SQL: right outer join) - * outer: use union of keys from both frames (SQL: full outer join) - * inner: use intersection of keys from both frames (SQL: inner join) - - .. versionadded:: 0.19.0 - - Examples - -------- - >>> A >>> B - key lvalue group key rvalue - 0 a 1 a 0 b 1 - 1 c 2 a 1 c 2 - 2 e 3 a 2 d 3 - 3 a 1 b - 4 c 2 b - 5 e 3 b - - >>> ordered_merge(A, B, fill_method='ffill', left_by='group') - key lvalue group rvalue - 0 a 1 a NaN - 1 b 1 a 1 - 2 c 2 a 2 - 3 d 2 a 3 - 4 e 3 a 3 - 5 f 3 a 4 - 6 a 1 b NaN - 7 b 1 b 1 - 8 c 2 b 2 - 9 d 2 b 3 - 10 e 3 b 3 - 11 f 3 b 4 - - Returns - ------- - merged : DataFrame - The output type will the be same as 'left', if it is a subclass - of DataFrame. - - See also - -------- - merge - merge_asof - - """ - def _merger(x, y): - # perform the ordered merge operation - op = _OrderedMerge(x, y, on=on, left_on=left_on, right_on=right_on, - suffixes=suffixes, fill_method=fill_method, - how=how) - return op.get_result() - - if left_by is not None and right_by is not None: - raise ValueError('Can only group either left or right frames') - elif left_by is not None: - result, _ = _groupby_and_merge(left_by, on, left, right, - lambda x, y: _merger(x, y), - check_duplicates=False) - elif right_by is not None: - result, _ = _groupby_and_merge(right_by, on, right, left, - lambda x, y: _merger(y, x), - check_duplicates=False) - else: - result = _merger(left, right) - return result - -ordered_merge.__doc__ = merge_ordered.__doc__ - - -def merge_asof(left, right, on=None, - left_on=None, right_on=None, - left_index=False, right_index=False, - by=None, left_by=None, right_by=None, - suffixes=('_x', '_y'), - tolerance=None, - allow_exact_matches=True): - """Perform an asof merge. This is similar to a left-join except that we - match on nearest key rather than equal keys. - - For each row in the left DataFrame, we select the last row in the right - DataFrame whose 'on' key is less than or equal to the left's key. Both - DataFrames must be sorted by the key. - - Optionally match on equivalent keys with 'by' before searching for nearest - match with 'on'. - - .. versionadded:: 0.19.0 - - Parameters - ---------- - left : DataFrame - right : DataFrame - on : label - Field name to join on. Must be found in both DataFrames. - The data MUST be ordered. Furthermore this must be a numeric column, - such as datetimelike, integer, or float. On or left_on/right_on - must be given. - left_on : label - Field name to join on in left DataFrame. - right_on : label - Field name to join on in right DataFrame. - left_index : boolean - Use the index of the left DataFrame as the join key. - - .. versionadded:: 0.19.2 - - right_index : boolean - Use the index of the right DataFrame as the join key. - - .. versionadded:: 0.19.2 - - by : column name or list of column names - Match on these columns before performing merge operation. - left_by : column name - Field names to match on in the left DataFrame. - - .. versionadded:: 0.19.2 - - right_by : column name - Field names to match on in the right DataFrame. - - .. versionadded:: 0.19.2 - - suffixes : 2-length sequence (tuple, list, ...) - Suffix to apply to overlapping column names in the left and right - side, respectively - tolerance : integer or Timedelta, optional, default None - select asof tolerance within this range; must be compatible - to the merge index. - allow_exact_matches : boolean, default True - - - If True, allow matching the same 'on' value - (i.e. less-than-or-equal-to) - - If False, don't match the same 'on' value - (i.e., stricly less-than) - - Returns - ------- - merged : DataFrame - - Examples - -------- - >>> left - a left_val - 0 1 a - 1 5 b - 2 10 c - - >>> right - a right_val - 0 1 1 - 1 2 2 - 2 3 3 - 3 6 6 - 4 7 7 - - >>> pd.merge_asof(left, right, on='a') - a left_val right_val - 0 1 a 1 - 1 5 b 3 - 2 10 c 7 - - >>> pd.merge_asof(left, right, on='a', allow_exact_matches=False) - a left_val right_val - 0 1 a NaN - 1 5 b 3.0 - 2 10 c 7.0 - - For this example, we can achieve a similar result thru - ``pd.merge_ordered()``, though its not nearly as performant. - - >>> (pd.merge_ordered(left, right, on='a') - ... .ffill() - ... .drop_duplicates(['left_val']) - ... ) - a left_val right_val - 0 1 a 1.0 - 3 5 b 3.0 - 6 10 c 7.0 - - We can use indexed DataFrames as well. - - >>> left - left_val - 1 a - 5 b - 10 c - - >>> right - right_val - 1 1 - 2 2 - 3 3 - 6 6 - 7 7 - - >>> pd.merge_asof(left, right, left_index=True, right_index=True) - left_val right_val - 1 a 1 - 5 b 3 - 10 c 7 - - Here is a real-world times-series example - - >>> quotes - time ticker bid ask - 0 2016-05-25 13:30:00.023 GOOG 720.50 720.93 - 1 2016-05-25 13:30:00.023 MSFT 51.95 51.96 - 2 2016-05-25 13:30:00.030 MSFT 51.97 51.98 - 3 2016-05-25 13:30:00.041 MSFT 51.99 52.00 - 4 2016-05-25 13:30:00.048 GOOG 720.50 720.93 - 5 2016-05-25 13:30:00.049 AAPL 97.99 98.01 - 6 2016-05-25 13:30:00.072 GOOG 720.50 720.88 - 7 2016-05-25 13:30:00.075 MSFT 52.01 52.03 - - >>> trades - time ticker price quantity - 0 2016-05-25 13:30:00.023 MSFT 51.95 75 - 1 2016-05-25 13:30:00.038 MSFT 51.95 155 - 2 2016-05-25 13:30:00.048 GOOG 720.77 100 - 3 2016-05-25 13:30:00.048 GOOG 720.92 100 - 4 2016-05-25 13:30:00.048 AAPL 98.00 100 - - By default we are taking the asof of the quotes - - >>> pd.merge_asof(trades, quotes, - ... on='time', - ... by='ticker') - time ticker price quantity bid ask - 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 - 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 - 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 - 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 - 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN - - We only asof within 2ms betwen the quote time and the trade time - - >>> pd.merge_asof(trades, quotes, - ... on='time', - ... by='ticker', - ... tolerance=pd.Timedelta('2ms')) - time ticker price quantity bid ask - 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 - 1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN - 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 - 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 - 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN - - We only asof within 10ms betwen the quote time and the trade time - and we exclude exact matches on time. However *prior* data will - propogate forward - - >>> pd.merge_asof(trades, quotes, - ... on='time', - ... by='ticker', - ... tolerance=pd.Timedelta('10ms'), - ... allow_exact_matches=False) - time ticker price quantity bid ask - 0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN - 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 - 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 - 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 - 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN - - See also - -------- - merge - merge_ordered - - """ - op = _AsOfMerge(left, right, - on=on, left_on=left_on, right_on=right_on, - left_index=left_index, right_index=right_index, - by=by, left_by=left_by, right_by=right_by, - suffixes=suffixes, - how='asof', tolerance=tolerance, - allow_exact_matches=allow_exact_matches) - return op.get_result() - - -# TODO: transformations?? -# TODO: only copy DataFrames when modification necessary -class _MergeOperation(object): - """ - Perform a database (SQL) merge operation between two DataFrame objects - using either columns as keys or their row indexes - """ - _merge_type = 'merge' - - def __init__(self, left, right, how='inner', on=None, - left_on=None, right_on=None, axis=1, - left_index=False, right_index=False, sort=True, - suffixes=('_x', '_y'), copy=True, indicator=False): - self.left = self.orig_left = left - self.right = self.orig_right = right - self.how = how - self.axis = axis - - self.on = com._maybe_make_list(on) - self.left_on = com._maybe_make_list(left_on) - self.right_on = com._maybe_make_list(right_on) - - self.copy = copy - self.suffixes = suffixes - self.sort = sort - - self.left_index = left_index - self.right_index = right_index - - self.indicator = indicator - - if isinstance(self.indicator, compat.string_types): - self.indicator_name = self.indicator - elif isinstance(self.indicator, bool): - self.indicator_name = '_merge' if self.indicator else None - else: - raise ValueError( - 'indicator option can only accept boolean or string arguments') - - if not isinstance(left, DataFrame): - raise ValueError( - 'can not merge DataFrame with instance of ' - 'type {0}'.format(type(left))) - if not isinstance(right, DataFrame): - raise ValueError( - 'can not merge DataFrame with instance of ' - 'type {0}'.format(type(right))) - - if not is_bool(left_index): - raise ValueError( - 'left_index parameter must be of type bool, not ' - '{0}'.format(type(left_index))) - if not is_bool(right_index): - raise ValueError( - 'right_index parameter must be of type bool, not ' - '{0}'.format(type(right_index))) - - # warn user when merging between different levels - if left.columns.nlevels != right.columns.nlevels: - msg = ('merging between different levels can give an unintended ' - 'result ({0} levels on the left, {1} on the right)') - msg = msg.format(left.columns.nlevels, right.columns.nlevels) - warnings.warn(msg, UserWarning) - - self._validate_specification() - - # note this function has side effects - (self.left_join_keys, - self.right_join_keys, - self.join_names) = self._get_merge_keys() - - def get_result(self): - if self.indicator: - self.left, self.right = self._indicator_pre_merge( - self.left, self.right) - - join_index, left_indexer, right_indexer = self._get_join_info() - - ldata, rdata = self.left._data, self.right._data - lsuf, rsuf = self.suffixes - - llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf, - rdata.items, rsuf) - - lindexers = {1: left_indexer} if left_indexer is not None else {} - rindexers = {1: right_indexer} if right_indexer is not None else {} - - result_data = concatenate_block_managers( - [(ldata, lindexers), (rdata, rindexers)], - axes=[llabels.append(rlabels), join_index], - concat_axis=0, copy=self.copy) - - typ = self.left._constructor - result = typ(result_data).__finalize__(self, method=self._merge_type) - - if self.indicator: - result = self._indicator_post_merge(result) - - self._maybe_add_join_keys(result, left_indexer, right_indexer) - - return result - - def _indicator_pre_merge(self, left, right): - - columns = left.columns.union(right.columns) - - for i in ['_left_indicator', '_right_indicator']: - if i in columns: - raise ValueError("Cannot use `indicator=True` option when " - "data contains a column named {}".format(i)) - if self.indicator_name in columns: - raise ValueError( - "Cannot use name of an existing column for indicator column") - - left = left.copy() - right = right.copy() - - left['_left_indicator'] = 1 - left['_left_indicator'] = left['_left_indicator'].astype('int8') - - right['_right_indicator'] = 2 - right['_right_indicator'] = right['_right_indicator'].astype('int8') - - return left, right - - def _indicator_post_merge(self, result): - - result['_left_indicator'] = result['_left_indicator'].fillna(0) - result['_right_indicator'] = result['_right_indicator'].fillna(0) - - result[self.indicator_name] = Categorical((result['_left_indicator'] + - result['_right_indicator']), - categories=[1, 2, 3]) - result[self.indicator_name] = ( - result[self.indicator_name] - .cat.rename_categories(['left_only', 'right_only', 'both'])) - - result = result.drop(labels=['_left_indicator', '_right_indicator'], - axis=1) - return result - - def _maybe_add_join_keys(self, result, left_indexer, right_indexer): - - left_has_missing = None - right_has_missing = None - - keys = zip(self.join_names, self.left_on, self.right_on) - for i, (name, lname, rname) in enumerate(keys): - if not _should_fill(lname, rname): - continue - - take_left, take_right = None, None - - if name in result: - - if left_indexer is not None and right_indexer is not None: - if name in self.left: - - if left_has_missing is None: - left_has_missing = (left_indexer == -1).any() - - if left_has_missing: - take_right = self.right_join_keys[i] - - if not is_dtype_equal(result[name].dtype, - self.left[name].dtype): - take_left = self.left[name]._values - - elif name in self.right: - - if right_has_missing is None: - right_has_missing = (right_indexer == -1).any() - - if right_has_missing: - take_left = self.left_join_keys[i] - - if not is_dtype_equal(result[name].dtype, - self.right[name].dtype): - take_right = self.right[name]._values - - elif left_indexer is not None \ - and isinstance(self.left_join_keys[i], np.ndarray): - - take_left = self.left_join_keys[i] - take_right = self.right_join_keys[i] - - if take_left is not None or take_right is not None: - - if take_left is None: - lvals = result[name]._values - else: - lfill = na_value_for_dtype(take_left.dtype) - lvals = algos.take_1d(take_left, left_indexer, - fill_value=lfill) - - if take_right is None: - rvals = result[name]._values - else: - rfill = na_value_for_dtype(take_right.dtype) - rvals = algos.take_1d(take_right, right_indexer, - fill_value=rfill) - - # if we have an all missing left_indexer - # make sure to just use the right values - mask = left_indexer == -1 - if mask.all(): - key_col = rvals - else: - key_col = Index(lvals).where(~mask, rvals) - - if name in result: - result[name] = key_col - else: - result.insert(i, name or 'key_%d' % i, key_col) - - def _get_join_indexers(self): - """ return the join indexers """ - return _get_join_indexers(self.left_join_keys, - self.right_join_keys, - sort=self.sort, - how=self.how) - - def _get_join_info(self): - left_ax = self.left._data.axes[self.axis] - right_ax = self.right._data.axes[self.axis] - - if self.left_index and self.right_index and self.how != 'asof': - join_index, left_indexer, right_indexer = \ - left_ax.join(right_ax, how=self.how, return_indexers=True) - elif self.right_index and self.how == 'left': - join_index, left_indexer, right_indexer = \ - _left_join_on_index(left_ax, right_ax, self.left_join_keys, - sort=self.sort) - - elif self.left_index and self.how == 'right': - join_index, right_indexer, left_indexer = \ - _left_join_on_index(right_ax, left_ax, self.right_join_keys, - sort=self.sort) - else: - (left_indexer, - right_indexer) = self._get_join_indexers() - - if self.right_index: - if len(self.left) > 0: - join_index = self.left.index.take(left_indexer) - else: - join_index = self.right.index.take(right_indexer) - left_indexer = np.array([-1] * len(join_index)) - elif self.left_index: - if len(self.right) > 0: - join_index = self.right.index.take(right_indexer) - else: - join_index = self.left.index.take(left_indexer) - right_indexer = np.array([-1] * len(join_index)) - else: - join_index = Index(np.arange(len(left_indexer))) - - if len(join_index) == 0: - join_index = join_index.astype(object) - return join_index, left_indexer, right_indexer - - def _get_merge_data(self): - """ - Handles overlapping column names etc. - """ - ldata, rdata = self.left._data, self.right._data - lsuf, rsuf = self.suffixes - - llabels, rlabels = items_overlap_with_suffix( - ldata.items, lsuf, rdata.items, rsuf) - - if not llabels.equals(ldata.items): - ldata = ldata.copy(deep=False) - ldata.set_axis(0, llabels) - - if not rlabels.equals(rdata.items): - rdata = rdata.copy(deep=False) - rdata.set_axis(0, rlabels) - - return ldata, rdata - - def _get_merge_keys(self): - """ - Note: has side effects (copy/delete key columns) - - Parameters - ---------- - left - right - on - - Returns - ------- - left_keys, right_keys - """ - left_keys = [] - right_keys = [] - join_names = [] - right_drop = [] - left_drop = [] - left, right = self.left, self.right - - is_lkey = lambda x: isinstance( - x, (np.ndarray, ABCSeries)) and len(x) == len(left) - is_rkey = lambda x: isinstance( - x, (np.ndarray, ABCSeries)) and len(x) == len(right) - - # Note that pd.merge_asof() has separate 'on' and 'by' parameters. A - # user could, for example, request 'left_index' and 'left_by'. In a - # regular pd.merge(), users cannot specify both 'left_index' and - # 'left_on'. (Instead, users have a MultiIndex). That means the - # self.left_on in this function is always empty in a pd.merge(), but - # a pd.merge_asof(left_index=True, left_by=...) will result in a - # self.left_on array with a None in the middle of it. This requires - # a work-around as designated in the code below. - # See _validate_specification() for where this happens. - - # ugh, spaghetti re #733 - if _any(self.left_on) and _any(self.right_on): - for lk, rk in zip(self.left_on, self.right_on): - if is_lkey(lk): - left_keys.append(lk) - if is_rkey(rk): - right_keys.append(rk) - join_names.append(None) # what to do? - else: - if rk is not None: - right_keys.append(right[rk]._values) - join_names.append(rk) - else: - # work-around for merge_asof(right_index=True) - right_keys.append(right.index) - join_names.append(right.index.name) - else: - if not is_rkey(rk): - if rk is not None: - right_keys.append(right[rk]._values) - else: - # work-around for merge_asof(right_index=True) - right_keys.append(right.index) - if lk is not None and lk == rk: - # avoid key upcast in corner case (length-0) - if len(left) > 0: - right_drop.append(rk) - else: - left_drop.append(lk) - else: - right_keys.append(rk) - if lk is not None: - left_keys.append(left[lk]._values) - join_names.append(lk) - else: - # work-around for merge_asof(left_index=True) - left_keys.append(left.index) - join_names.append(left.index.name) - elif _any(self.left_on): - for k in self.left_on: - if is_lkey(k): - left_keys.append(k) - join_names.append(None) - else: - left_keys.append(left[k]._values) - join_names.append(k) - if isinstance(self.right.index, MultiIndex): - right_keys = [lev._values.take(lab) - for lev, lab in zip(self.right.index.levels, - self.right.index.labels)] - else: - right_keys = [self.right.index.values] - elif _any(self.right_on): - for k in self.right_on: - if is_rkey(k): - right_keys.append(k) - join_names.append(None) - else: - right_keys.append(right[k]._values) - join_names.append(k) - if isinstance(self.left.index, MultiIndex): - left_keys = [lev._values.take(lab) - for lev, lab in zip(self.left.index.levels, - self.left.index.labels)] - else: - left_keys = [self.left.index.values] - - if left_drop: - self.left = self.left.drop(left_drop, axis=1) - - if right_drop: - self.right = self.right.drop(right_drop, axis=1) - - return left_keys, right_keys, join_names - - def _validate_specification(self): - # Hm, any way to make this logic less complicated?? - if self.on is None and self.left_on is None and self.right_on is None: - - if self.left_index and self.right_index: - self.left_on, self.right_on = (), () - elif self.left_index: - if self.right_on is None: - raise MergeError('Must pass right_on or right_index=True') - elif self.right_index: - if self.left_on is None: - raise MergeError('Must pass left_on or left_index=True') - else: - # use the common columns - common_cols = self.left.columns.intersection( - self.right.columns) - if len(common_cols) == 0: - raise MergeError('No common columns to perform merge on') - if not common_cols.is_unique: - raise MergeError("Data columns not unique: %s" - % repr(common_cols)) - self.left_on = self.right_on = common_cols - elif self.on is not None: - if self.left_on is not None or self.right_on is not None: - raise MergeError('Can only pass on OR left_on and ' - 'right_on') - self.left_on = self.right_on = self.on - elif self.left_on is not None: - n = len(self.left_on) - if self.right_index: - if len(self.left_on) != self.right.index.nlevels: - raise ValueError('len(left_on) must equal the number ' - 'of levels in the index of "right"') - self.right_on = [None] * n - elif self.right_on is not None: - n = len(self.right_on) - if self.left_index: - if len(self.right_on) != self.left.index.nlevels: - raise ValueError('len(right_on) must equal the number ' - 'of levels in the index of "left"') - self.left_on = [None] * n - if len(self.right_on) != len(self.left_on): - raise ValueError("len(right_on) must equal len(left_on)") - - -def _get_join_indexers(left_keys, right_keys, sort=False, how='inner', - **kwargs): - """ - - Parameters - ---------- - - Returns - ------- - - """ - from functools import partial - - assert len(left_keys) == len(right_keys), \ - 'left_key and right_keys must be the same length' - - # bind `sort` arg. of _factorize_keys - fkeys = partial(_factorize_keys, sort=sort) - - # get left & right join labels and num. of levels at each location - llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys))) - - # get flat i8 keys from label lists - lkey, rkey = _get_join_keys(llab, rlab, shape, sort) - - # factorize keys to a dense i8 space - # `count` is the num. of unique keys - # set(lkey) | set(rkey) == range(count) - lkey, rkey, count = fkeys(lkey, rkey) - - # preserve left frame order if how == 'left' and sort == False - kwargs = copy.copy(kwargs) - if how == 'left': - kwargs['sort'] = sort - join_func = _join_functions[how] - - return join_func(lkey, rkey, count, **kwargs) - - -class _OrderedMerge(_MergeOperation): - _merge_type = 'ordered_merge' - - def __init__(self, left, right, on=None, left_on=None, right_on=None, - left_index=False, right_index=False, axis=1, - suffixes=('_x', '_y'), copy=True, - fill_method=None, how='outer'): - - self.fill_method = fill_method - _MergeOperation.__init__(self, left, right, on=on, left_on=left_on, - left_index=left_index, - right_index=right_index, - right_on=right_on, axis=axis, - how=how, suffixes=suffixes, - sort=True # factorize sorts - ) - - def get_result(self): - join_index, left_indexer, right_indexer = self._get_join_info() - - # this is a bit kludgy - ldata, rdata = self.left._data, self.right._data - lsuf, rsuf = self.suffixes - - llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf, - rdata.items, rsuf) - - if self.fill_method == 'ffill': - left_join_indexer = _join.ffill_indexer(left_indexer) - right_join_indexer = _join.ffill_indexer(right_indexer) - else: - left_join_indexer = left_indexer - right_join_indexer = right_indexer - - lindexers = { - 1: left_join_indexer} if left_join_indexer is not None else {} - rindexers = { - 1: right_join_indexer} if right_join_indexer is not None else {} - - result_data = concatenate_block_managers( - [(ldata, lindexers), (rdata, rindexers)], - axes=[llabels.append(rlabels), join_index], - concat_axis=0, copy=self.copy) - - typ = self.left._constructor - result = typ(result_data).__finalize__(self, method=self._merge_type) - - self._maybe_add_join_keys(result, left_indexer, right_indexer) - - return result - - -def _asof_function(on_type): - return getattr(_join, 'asof_join_%s' % on_type, None) - - -def _asof_by_function(on_type, by_type): - return getattr(_join, 'asof_join_%s_by_%s' % (on_type, by_type), None) - - -_type_casters = { - 'int64_t': _ensure_int64, - 'double': _ensure_float64, - 'object': _ensure_object, -} - -_cython_types = { - 'uint8': 'uint8_t', - 'uint32': 'uint32_t', - 'uint16': 'uint16_t', - 'uint64': 'uint64_t', - 'int8': 'int8_t', - 'int32': 'int32_t', - 'int16': 'int16_t', - 'int64': 'int64_t', - 'float16': 'error', - 'float32': 'float', - 'float64': 'double', -} - - -def _get_cython_type(dtype): - """ Given a dtype, return a C name like 'int64_t' or 'double' """ - type_name = _get_dtype(dtype).name - ctype = _cython_types.get(type_name, 'object') - if ctype == 'error': - raise MergeError('unsupported type: ' + type_name) - return ctype - - -def _get_cython_type_upcast(dtype): - """ Upcast a dtype to 'int64_t', 'double', or 'object' """ - if is_integer_dtype(dtype): - return 'int64_t' - elif is_float_dtype(dtype): - return 'double' - else: - return 'object' - - -class _AsOfMerge(_OrderedMerge): - _merge_type = 'asof_merge' - - def __init__(self, left, right, on=None, left_on=None, right_on=None, - left_index=False, right_index=False, - by=None, left_by=None, right_by=None, - axis=1, suffixes=('_x', '_y'), copy=True, - fill_method=None, - how='asof', tolerance=None, - allow_exact_matches=True): - - self.by = by - self.left_by = left_by - self.right_by = right_by - self.tolerance = tolerance - self.allow_exact_matches = allow_exact_matches - - _OrderedMerge.__init__(self, left, right, on=on, left_on=left_on, - right_on=right_on, left_index=left_index, - right_index=right_index, axis=axis, - how=how, suffixes=suffixes, - fill_method=fill_method) - - def _validate_specification(self): - super(_AsOfMerge, self)._validate_specification() - - # we only allow on to be a single item for on - if len(self.left_on) != 1 and not self.left_index: - raise MergeError("can only asof on a key for left") - - if len(self.right_on) != 1 and not self.right_index: - raise MergeError("can only asof on a key for right") - - if self.left_index and isinstance(self.left.index, MultiIndex): - raise MergeError("left can only have one index") - - if self.right_index and isinstance(self.right.index, MultiIndex): - raise MergeError("right can only have one index") - - # set 'by' columns - if self.by is not None: - if self.left_by is not None or self.right_by is not None: - raise MergeError('Can only pass by OR left_by ' - 'and right_by') - self.left_by = self.right_by = self.by - if self.left_by is None and self.right_by is not None: - raise MergeError('missing left_by') - if self.left_by is not None and self.right_by is None: - raise MergeError('missing right_by') - - # add by to our key-list so we can have it in the - # output as a key - if self.left_by is not None: - if not is_list_like(self.left_by): - self.left_by = [self.left_by] - if not is_list_like(self.right_by): - self.right_by = [self.right_by] - - self.left_on = self.left_by + list(self.left_on) - self.right_on = self.right_by + list(self.right_on) - - @property - def _asof_key(self): - """ This is our asof key, the 'on' """ - return self.left_on[-1] - - def _get_merge_keys(self): - - # note this function has side effects - (left_join_keys, - right_join_keys, - join_names) = super(_AsOfMerge, self)._get_merge_keys() - - # validate index types are the same - for lk, rk in zip(left_join_keys, right_join_keys): - if not is_dtype_equal(lk.dtype, rk.dtype): - raise MergeError("incompatible merge keys, " - "must be the same type") - - # validate tolerance; must be a Timedelta if we have a DTI - if self.tolerance is not None: - - lt = left_join_keys[-1] - msg = "incompatible tolerance, must be compat " \ - "with type {0}".format(type(lt)) - - if is_datetime64_dtype(lt) or is_datetime64tz_dtype(lt): - if not isinstance(self.tolerance, Timedelta): - raise MergeError(msg) - if self.tolerance < Timedelta(0): - raise MergeError("tolerance must be positive") - - elif is_int64_dtype(lt): - if not is_integer(self.tolerance): - raise MergeError(msg) - if self.tolerance < 0: - raise MergeError("tolerance must be positive") - - else: - raise MergeError("key must be integer or timestamp") - - # validate allow_exact_matches - if not is_bool(self.allow_exact_matches): - raise MergeError("allow_exact_matches must be boolean, " - "passed {0}".format(self.allow_exact_matches)) - - return left_join_keys, right_join_keys, join_names - - def _get_join_indexers(self): - """ return the join indexers """ - - def flip(xs): - """ unlike np.transpose, this returns an array of tuples """ - labels = list(string.ascii_lowercase[:len(xs)]) - dtypes = [x.dtype for x in xs] - labeled_dtypes = list(zip(labels, dtypes)) - return np.array(lzip(*xs), labeled_dtypes) - - # values to compare - left_values = (self.left.index.values if self.left_index else - self.left_join_keys[-1]) - right_values = (self.right.index.values if self.right_index else - self.right_join_keys[-1]) - tolerance = self.tolerance - - # we required sortedness in the join keys - msg = " keys must be sorted" - if not Index(left_values).is_monotonic: - raise ValueError('left' + msg) - if not Index(right_values).is_monotonic: - raise ValueError('right' + msg) - - # initial type conversion as needed - if needs_i8_conversion(left_values): - left_values = left_values.view('i8') - right_values = right_values.view('i8') - if tolerance is not None: - tolerance = tolerance.value - - # a "by" parameter requires special handling - if self.left_by is not None: - if len(self.left_join_keys) > 2: - # get tuple representation of values if more than one - left_by_values = flip(self.left_join_keys[0:-1]) - right_by_values = flip(self.right_join_keys[0:-1]) - else: - left_by_values = self.left_join_keys[0] - right_by_values = self.right_join_keys[0] - - # upcast 'by' parameter because HashTable is limited - by_type = _get_cython_type_upcast(left_by_values.dtype) - by_type_caster = _type_casters[by_type] - left_by_values = by_type_caster(left_by_values) - right_by_values = by_type_caster(right_by_values) - - # choose appropriate function by type - on_type = _get_cython_type(left_values.dtype) - func = _asof_by_function(on_type, by_type) - return func(left_values, - right_values, - left_by_values, - right_by_values, - self.allow_exact_matches, - tolerance) - else: - # choose appropriate function by type - on_type = _get_cython_type(left_values.dtype) - func = _asof_function(on_type) - return func(left_values, - right_values, - self.allow_exact_matches, - tolerance) - - -def _get_multiindex_indexer(join_keys, index, sort): - from functools import partial - - # bind `sort` argument - fkeys = partial(_factorize_keys, sort=sort) - - # left & right join labels and num. of levels at each location - rlab, llab, shape = map(list, zip(* map(fkeys, index.levels, join_keys))) - if sort: - rlab = list(map(np.take, rlab, index.labels)) - else: - i8copy = lambda a: a.astype('i8', subok=False, copy=True) - rlab = list(map(i8copy, index.labels)) - - # fix right labels if there were any nulls - for i in range(len(join_keys)): - mask = index.labels[i] == -1 - if mask.any(): - # check if there already was any nulls at this location - # if there was, it is factorized to `shape[i] - 1` - a = join_keys[i][llab[i] == shape[i] - 1] - if a.size == 0 or not a[0] != a[0]: - shape[i] += 1 - - rlab[i][mask] = shape[i] - 1 - - # get flat i8 join keys - lkey, rkey = _get_join_keys(llab, rlab, shape, sort) - - # factorize keys to a dense i8 space - lkey, rkey, count = fkeys(lkey, rkey) - - return _join.left_outer_join(lkey, rkey, count, sort=sort) - - -def _get_single_indexer(join_key, index, sort=False): - left_key, right_key, count = _factorize_keys(join_key, index, sort=sort) - - left_indexer, right_indexer = _join.left_outer_join( - _ensure_int64(left_key), - _ensure_int64(right_key), - count, sort=sort) - - return left_indexer, right_indexer - - -def _left_join_on_index(left_ax, right_ax, join_keys, sort=False): - if len(join_keys) > 1: - if not ((isinstance(right_ax, MultiIndex) and - len(join_keys) == right_ax.nlevels)): - raise AssertionError("If more than one join key is given then " - "'right_ax' must be a MultiIndex and the " - "number of join keys must be the number of " - "levels in right_ax") - - left_indexer, right_indexer = \ - _get_multiindex_indexer(join_keys, right_ax, sort=sort) - else: - jkey = join_keys[0] - - left_indexer, right_indexer = \ - _get_single_indexer(jkey, right_ax, sort=sort) - - if sort or len(left_ax) != len(left_indexer): - # if asked to sort or there are 1-to-many matches - join_index = left_ax.take(left_indexer) - return join_index, left_indexer, right_indexer - - # left frame preserves order & length of its index - return left_ax, None, right_indexer - - -def _right_outer_join(x, y, max_groups): - right_indexer, left_indexer = _join.left_outer_join(y, x, max_groups) - return left_indexer, right_indexer - -_join_functions = { - 'inner': _join.inner_join, - 'left': _join.left_outer_join, - 'right': _right_outer_join, - 'outer': _join.full_outer_join, -} - - -def _factorize_keys(lk, rk, sort=True): - if is_datetime64tz_dtype(lk) and is_datetime64tz_dtype(rk): - lk = lk.values - rk = rk.values - if is_int_or_datetime_dtype(lk) and is_int_or_datetime_dtype(rk): - klass = _hash.Int64Factorizer - lk = _ensure_int64(com._values_from_object(lk)) - rk = _ensure_int64(com._values_from_object(rk)) - else: - klass = _hash.Factorizer - lk = _ensure_object(lk) - rk = _ensure_object(rk) - - rizer = klass(max(len(lk), len(rk))) - - llab = rizer.factorize(lk) - rlab = rizer.factorize(rk) - - count = rizer.get_count() - - if sort: - uniques = rizer.uniques.to_array() - llab, rlab = _sort_labels(uniques, llab, rlab) - - # NA group - lmask = llab == -1 - lany = lmask.any() - rmask = rlab == -1 - rany = rmask.any() - - if lany or rany: - if lany: - np.putmask(llab, lmask, count) - if rany: - np.putmask(rlab, rmask, count) - count += 1 - - return llab, rlab, count - - -def _sort_labels(uniques, left, right): - if not isinstance(uniques, np.ndarray): - # tuplesafe - uniques = Index(uniques).values - - l = len(left) - labels = np.concatenate([left, right]) - - _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1) - new_labels = _ensure_int64(new_labels) - new_left, new_right = new_labels[:l], new_labels[l:] - - return new_left, new_right - - -def _get_join_keys(llab, rlab, shape, sort): - from pandas.core.groupby import _int64_overflow_possible - - # how many levels can be done without overflow - pred = lambda i: not _int64_overflow_possible(shape[:i]) - nlev = next(filter(pred, range(len(shape), 0, -1))) - - # get keys for the first `nlev` levels - stride = np.prod(shape[1:nlev], dtype='i8') - lkey = stride * llab[0].astype('i8', subok=False, copy=False) - rkey = stride * rlab[0].astype('i8', subok=False, copy=False) - - for i in range(1, nlev): - stride //= shape[i] - lkey += llab[i] * stride - rkey += rlab[i] * stride - - if nlev == len(shape): # all done! - return lkey, rkey - - # densify current keys to avoid overflow - lkey, rkey, count = _factorize_keys(lkey, rkey, sort=sort) - - llab = [lkey] + llab[nlev:] - rlab = [rkey] + rlab[nlev:] - shape = [count] + shape[nlev:] - - return _get_join_keys(llab, rlab, shape, sort) - -# --------------------------------------------------------------------- -# Concatenate DataFrame objects - - -def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, - keys=None, levels=None, names=None, verify_integrity=False, - copy=True): - """ - Concatenate pandas objects along a particular axis with optional set logic - along the other axes. Can also add a layer of hierarchical indexing on the - concatenation axis, which may be useful if the labels are the same (or - overlapping) on the passed axis number - - Parameters - ---------- - objs : a sequence or mapping of Series, DataFrame, or Panel objects - If a dict is passed, the sorted keys will be used as the `keys` - argument, unless it is passed, in which case the values will be - selected (see below). Any None objects will be dropped silently unless - they are all None in which case a ValueError will be raised - axis : {0/'index', 1/'columns'}, default 0 - The axis to concatenate along - join : {'inner', 'outer'}, default 'outer' - How to handle indexes on other axis(es) - join_axes : list of Index objects - Specific indexes to use for the other n - 1 axes instead of performing - inner/outer set logic - ignore_index : boolean, default False - If True, do not use the index values along the concatenation axis. The - resulting axis will be labeled 0, ..., n - 1. This is useful if you are - concatenating objects where the concatenation axis does not have - meaningful indexing information. Note the index values on the other - axes are still respected in the join. - keys : sequence, default None - If multiple levels passed, should contain tuples. Construct - hierarchical index using the passed keys as the outermost level - levels : list of sequences, default None - Specific levels (unique values) to use for constructing a - MultiIndex. Otherwise they will be inferred from the keys - names : list, default None - Names for the levels in the resulting hierarchical index - verify_integrity : boolean, default False - Check whether the new concatenated axis contains duplicates. This can - be very expensive relative to the actual data concatenation - copy : boolean, default True - If False, do not copy data unnecessarily - - Notes - ----- - The keys, levels, and names arguments are all optional - - Returns - ------- - concatenated : type of objects - """ - op = _Concatenator(objs, axis=axis, join_axes=join_axes, - ignore_index=ignore_index, join=join, - keys=keys, levels=levels, names=names, - verify_integrity=verify_integrity, - copy=copy) - return op.get_result() - - -class _Concatenator(object): - """ - Orchestrates a concatenation operation for BlockManagers - """ - - def __init__(self, objs, axis=0, join='outer', join_axes=None, - keys=None, levels=None, names=None, - ignore_index=False, verify_integrity=False, copy=True): - if isinstance(objs, (NDFrame, compat.string_types)): - raise TypeError('first argument must be an iterable of pandas ' - 'objects, you passed an object of type ' - '"{0}"'.format(type(objs).__name__)) - - if join == 'outer': - self.intersect = False - elif join == 'inner': - self.intersect = True - else: # pragma: no cover - raise ValueError('Only can inner (intersect) or outer (union) ' - 'join the other axis') - - if isinstance(objs, dict): - if keys is None: - keys = sorted(objs) - objs = [objs[k] for k in keys] - else: - objs = list(objs) - - if len(objs) == 0: - raise ValueError('No objects to concatenate') - - if keys is None: - objs = [obj for obj in objs if obj is not None] - else: - # #1649 - clean_keys = [] - clean_objs = [] - for k, v in zip(keys, objs): - if v is None: - continue - clean_keys.append(k) - clean_objs.append(v) - objs = clean_objs - name = getattr(keys, 'name', None) - keys = Index(clean_keys, name=name) - - if len(objs) == 0: - raise ValueError('All objects passed were None') - - # consolidate data & figure out what our result ndim is going to be - ndims = set() - for obj in objs: - if not isinstance(obj, NDFrame): - raise TypeError("cannot concatenate a non-NDFrame object") - - # consolidate - obj.consolidate(inplace=True) - ndims.add(obj.ndim) - - # get the sample - # want the higest ndim that we have, and must be non-empty - # unless all objs are empty - sample = None - if len(ndims) > 1: - max_ndim = max(ndims) - for obj in objs: - if obj.ndim == max_ndim and np.sum(obj.shape): - sample = obj - break - - else: - # filter out the empties if we have not multi-index possibiltes - # note to keep empty Series as it affect to result columns / name - non_empties = [obj for obj in objs - if sum(obj.shape) > 0 or isinstance(obj, Series)] - - if (len(non_empties) and (keys is None and names is None and - levels is None and join_axes is None)): - objs = non_empties - sample = objs[0] - - if sample is None: - sample = objs[0] - self.objs = objs - - # Standardize axis parameter to int - if isinstance(sample, Series): - axis = DataFrame()._get_axis_number(axis) - else: - axis = sample._get_axis_number(axis) - - # Need to flip BlockManager axis in the DataFrame special case - self._is_frame = isinstance(sample, DataFrame) - if self._is_frame: - axis = 1 if axis == 0 else 0 - - self._is_series = isinstance(sample, ABCSeries) - if not 0 <= axis <= sample.ndim: - raise AssertionError("axis must be between 0 and {0}, " - "input was {1}".format(sample.ndim, axis)) - - # if we have mixed ndims, then convert to highest ndim - # creating column numbers as needed - if len(ndims) > 1: - current_column = 0 - max_ndim = sample.ndim - self.objs, objs = [], self.objs - for obj in objs: - - ndim = obj.ndim - if ndim == max_ndim: - pass - - elif ndim != max_ndim - 1: - raise ValueError("cannot concatenate unaligned mixed " - "dimensional NDFrame objects") - - else: - name = getattr(obj, 'name', None) - if ignore_index or name is None: - name = current_column - current_column += 1 - - # doing a row-wise concatenation so need everything - # to line up - if self._is_frame and axis == 1: - name = 0 - obj = sample._constructor({name: obj}) - - self.objs.append(obj) - - # note: this is the BlockManager axis (since DataFrame is transposed) - self.axis = axis - self.join_axes = join_axes - self.keys = keys - self.names = names or getattr(keys, 'names', None) - self.levels = levels - - self.ignore_index = ignore_index - self.verify_integrity = verify_integrity - self.copy = copy - - self.new_axes = self._get_new_axes() - - def get_result(self): - - # series only - if self._is_series: - - # stack blocks - if self.axis == 0: - # concat Series with length to keep dtype as much - non_empties = [x for x in self.objs if len(x) > 0] - if len(non_empties) > 0: - values = [x._values for x in non_empties] - else: - values = [x._values for x in self.objs] - new_data = _concat._concat_compat(values) - - name = com._consensus_name_attr(self.objs) - cons = _concat._get_series_result_type(new_data) - - return (cons(new_data, index=self.new_axes[0], - name=name, dtype=new_data.dtype) - .__finalize__(self, method='concat')) - - # combine as columns in a frame - else: - data = dict(zip(range(len(self.objs)), self.objs)) - cons = _concat._get_series_result_type(data) - - index, columns = self.new_axes - df = cons(data, index=index) - df.columns = columns - return df.__finalize__(self, method='concat') - - # combine block managers - else: - mgrs_indexers = [] - for obj in self.objs: - mgr = obj._data - indexers = {} - for ax, new_labels in enumerate(self.new_axes): - if ax == self.axis: - # Suppress reindexing on concat axis - continue - - obj_labels = mgr.axes[ax] - if not new_labels.equals(obj_labels): - indexers[ax] = obj_labels.reindex(new_labels)[1] - - mgrs_indexers.append((obj._data, indexers)) - - new_data = concatenate_block_managers( - mgrs_indexers, self.new_axes, concat_axis=self.axis, - copy=self.copy) - if not self.copy: - new_data._consolidate_inplace() - - cons = _concat._get_frame_result_type(new_data, self.objs) - return (cons._from_axes(new_data, self.new_axes) - .__finalize__(self, method='concat')) - - def _get_result_dim(self): - if self._is_series and self.axis == 1: - return 2 - else: - return self.objs[0].ndim - - def _get_new_axes(self): - ndim = self._get_result_dim() - new_axes = [None] * ndim - - if self.join_axes is None: - for i in range(ndim): - if i == self.axis: - continue - new_axes[i] = self._get_comb_axis(i) - else: - if len(self.join_axes) != ndim - 1: - raise AssertionError("length of join_axes must not be " - "equal to {0}".format(ndim - 1)) - - # ufff... - indices = lrange(ndim) - indices.remove(self.axis) - - for i, ax in zip(indices, self.join_axes): - new_axes[i] = ax - - new_axes[self.axis] = self._get_concat_axis() - return new_axes - - def _get_comb_axis(self, i): - if self._is_series: - all_indexes = [x.index for x in self.objs] - else: - try: - all_indexes = [x._data.axes[i] for x in self.objs] - except IndexError: - types = [type(x).__name__ for x in self.objs] - raise TypeError("Cannot concatenate list of %s" % types) - - return _get_combined_index(all_indexes, intersect=self.intersect) - - def _get_concat_axis(self): - """ - Return index to be used along concatenation axis. - """ - if self._is_series: - if self.axis == 0: - indexes = [x.index for x in self.objs] - elif self.ignore_index: - idx = com._default_index(len(self.objs)) - return idx - elif self.keys is None: - names = [None] * len(self.objs) - num = 0 - has_names = False - for i, x in enumerate(self.objs): - if not isinstance(x, Series): - raise TypeError("Cannot concatenate type 'Series' " - "with object of type " - "%r" % type(x).__name__) - if x.name is not None: - names[i] = x.name - has_names = True - else: - names[i] = num - num += 1 - if has_names: - return Index(names) - else: - return com._default_index(len(self.objs)) - else: - return _ensure_index(self.keys) - else: - indexes = [x._data.axes[self.axis] for x in self.objs] - - if self.ignore_index: - idx = com._default_index(sum(len(i) for i in indexes)) - return idx - - if self.keys is None: - concat_axis = _concat_indexes(indexes) - else: - concat_axis = _make_concat_multiindex(indexes, self.keys, - self.levels, self.names) - - self._maybe_check_integrity(concat_axis) - - return concat_axis - - def _maybe_check_integrity(self, concat_index): - if self.verify_integrity: - if not concat_index.is_unique: - overlap = concat_index.get_duplicates() - raise ValueError('Indexes have overlapping values: %s' - % str(overlap)) - - -def _concat_indexes(indexes): - return indexes[0].append(indexes[1:]) - - -def _make_concat_multiindex(indexes, keys, levels=None, names=None): - - if ((levels is None and isinstance(keys[0], tuple)) or - (levels is not None and len(levels) > 1)): - zipped = lzip(*keys) - if names is None: - names = [None] * len(zipped) - - if levels is None: - _, levels = _factorize_from_iterables(zipped) - else: - levels = [_ensure_index(x) for x in levels] - else: - zipped = [keys] - if names is None: - names = [None] - - if levels is None: - levels = [_ensure_index(keys)] - else: - levels = [_ensure_index(x) for x in levels] - - if not _all_indexes_same(indexes): - label_list = [] - - # things are potentially different sizes, so compute the exact labels - # for each level and pass those to MultiIndex.from_arrays - - for hlevel, level in zip(zipped, levels): - to_concat = [] - for key, index in zip(hlevel, indexes): - try: - i = level.get_loc(key) - except KeyError: - raise ValueError('Key %s not in level %s' - % (str(key), str(level))) - - to_concat.append(np.repeat(i, len(index))) - label_list.append(np.concatenate(to_concat)) - - concat_index = _concat_indexes(indexes) - - # these go at the end - if isinstance(concat_index, MultiIndex): - levels.extend(concat_index.levels) - label_list.extend(concat_index.labels) - else: - codes, categories = _factorize_from_iterable(concat_index) - levels.append(categories) - label_list.append(codes) - - if len(names) == len(levels): - names = list(names) - else: - # make sure that all of the passed indices have the same nlevels - if not len(set([idx.nlevels for idx in indexes])) == 1: - raise AssertionError("Cannot concat indices that do" - " not have the same number of levels") - - # also copies - names = names + _get_consensus_names(indexes) - - return MultiIndex(levels=levels, labels=label_list, names=names, - verify_integrity=False) - - new_index = indexes[0] - n = len(new_index) - kpieces = len(indexes) - - # also copies - new_names = list(names) - new_levels = list(levels) - - # construct labels - new_labels = [] - - # do something a bit more speedy - - for hlevel, level in zip(zipped, levels): - hlevel = _ensure_index(hlevel) - mapped = level.get_indexer(hlevel) - - mask = mapped == -1 - if mask.any(): - raise ValueError('Values not found in passed level: %s' - % str(hlevel[mask])) - - new_labels.append(np.repeat(mapped, n)) - - if isinstance(new_index, MultiIndex): - new_levels.extend(new_index.levels) - new_labels.extend([np.tile(lab, kpieces) for lab in new_index.labels]) - else: - new_levels.append(new_index) - new_labels.append(np.tile(np.arange(n), kpieces)) - - if len(new_names) < len(new_levels): - new_names.extend(new_index.names) - - return MultiIndex(levels=new_levels, labels=new_labels, names=new_names, - verify_integrity=False) +# back-compat of pseudo-public API +def concat_wrap(): -def _should_fill(lname, rname): - if (not isinstance(lname, compat.string_types) or - not isinstance(rname, compat.string_types)): - return True - return lname == rname + def wrapper(*args, **kwargs): + warnings.warn("pandas.tools.merge.concat is deprecated. " + "import from the public API: " + "pandas.concat instead", + FutureWarning, stacklevel=3) + import pandas as pd + return pd.concat(*args, **kwargs) + return wrapper -def _any(x): - return x is not None and len(x) > 0 and any([y is not None for y in x]) +concat = concat_wrap() diff --git a/pandas/tools/plotting.py b/pandas/tools/plotting.py index 484270ecbda0b..a68da67a219e2 100644 --- a/pandas/tools/plotting.py +++ b/pandas/tools/plotting.py @@ -1,4008 +1,20 @@ -# being a bit too dynamic -# pylint: disable=E1101 -from __future__ import division - +import sys import warnings -import re -from math import ceil -from collections import namedtuple -from contextlib import contextmanager -from distutils.version import LooseVersion - -import numpy as np - -from pandas.types.common import (is_list_like, - is_integer, - is_number, - is_hashable, - is_iterator) -from pandas.types.missing import isnull, notnull - -from pandas.util.decorators import cache_readonly, deprecate_kwarg -from pandas.core.base import PandasObject - -from pandas.core.common import AbstractMethodError, _try_sort -from pandas.core.generic import _shared_docs, _shared_doc_kwargs -from pandas.core.index import Index, MultiIndex -from pandas.core.series import Series, remove_na -from pandas.tseries.period import PeriodIndex -from pandas.compat import range, lrange, lmap, map, zip, string_types -import pandas.compat as compat -from pandas.formats.printing import pprint_thing -from pandas.util.decorators import Appender -try: # mpl optional - import pandas.tseries.converter as conv - conv.register() # needs to override so set_xlim works with str/number -except ImportError: - pass - - -# Extracted from https://gist.github.com/huyng/816622 -# this is the rcParams set when setting display.with_mpl_style -# to True. -mpl_stylesheet = { - 'axes.axisbelow': True, - 'axes.color_cycle': ['#348ABD', - '#7A68A6', - '#A60628', - '#467821', - '#CF4457', - '#188487', - '#E24A33'], - 'axes.edgecolor': '#bcbcbc', - 'axes.facecolor': '#eeeeee', - 'axes.grid': True, - 'axes.labelcolor': '#555555', - 'axes.labelsize': 'large', - 'axes.linewidth': 1.0, - 'axes.titlesize': 'x-large', - 'figure.edgecolor': 'white', - 'figure.facecolor': 'white', - 'figure.figsize': (6.0, 4.0), - 'figure.subplot.hspace': 0.5, - 'font.family': 'monospace', - 'font.monospace': ['Andale Mono', - 'Nimbus Mono L', - 'Courier New', - 'Courier', - 'Fixed', - 'Terminal', - 'monospace'], - 'font.size': 10, - 'interactive': True, - 'keymap.all_axes': ['a'], - 'keymap.back': ['left', 'c', 'backspace'], - 'keymap.forward': ['right', 'v'], - 'keymap.fullscreen': ['f'], - 'keymap.grid': ['g'], - 'keymap.home': ['h', 'r', 'home'], - 'keymap.pan': ['p'], - 'keymap.save': ['s'], - 'keymap.xscale': ['L', 'k'], - 'keymap.yscale': ['l'], - 'keymap.zoom': ['o'], - 'legend.fancybox': True, - 'lines.antialiased': True, - 'lines.linewidth': 1.0, - 'patch.antialiased': True, - 'patch.edgecolor': '#EEEEEE', - 'patch.facecolor': '#348ABD', - 'patch.linewidth': 0.5, - 'toolbar': 'toolbar2', - 'xtick.color': '#555555', - 'xtick.direction': 'in', - 'xtick.major.pad': 6.0, - 'xtick.major.size': 0.0, - 'xtick.minor.pad': 6.0, - 'xtick.minor.size': 0.0, - 'ytick.color': '#555555', - 'ytick.direction': 'in', - 'ytick.major.pad': 6.0, - 'ytick.major.size': 0.0, - 'ytick.minor.pad': 6.0, - 'ytick.minor.size': 0.0 -} - - -def _mpl_le_1_2_1(): - try: - import matplotlib as mpl - return (str(mpl.__version__) <= LooseVersion('1.2.1') and - str(mpl.__version__)[0] != '0') - except ImportError: - return False - - -def _mpl_ge_1_3_1(): - try: - import matplotlib - # The or v[0] == '0' is because their versioneer is - # messed up on dev - return (matplotlib.__version__ >= LooseVersion('1.3.1') or - matplotlib.__version__[0] == '0') - except ImportError: - return False - - -def _mpl_ge_1_4_0(): - try: - import matplotlib - return (matplotlib.__version__ >= LooseVersion('1.4') or - matplotlib.__version__[0] == '0') - except ImportError: - return False - - -def _mpl_ge_1_5_0(): - try: - import matplotlib - return (matplotlib.__version__ >= LooseVersion('1.5') or - matplotlib.__version__[0] == '0') - except ImportError: - return False - - -def _mpl_ge_2_0_0(): - try: - import matplotlib - return matplotlib.__version__ >= LooseVersion('2.0') - except ImportError: - return False - -if _mpl_ge_1_5_0(): - # Compat with mp 1.5, which uses cycler. - import cycler - colors = mpl_stylesheet.pop('axes.color_cycle') - mpl_stylesheet['axes.prop_cycle'] = cycler.cycler('color', colors) - - -def _get_standard_kind(kind): - return {'density': 'kde'}.get(kind, kind) - - -def _get_standard_colors(num_colors=None, colormap=None, color_type='default', - color=None): - import matplotlib.pyplot as plt - - if color is None and colormap is not None: - if isinstance(colormap, compat.string_types): - import matplotlib.cm as cm - cmap = colormap - colormap = cm.get_cmap(colormap) - if colormap is None: - raise ValueError("Colormap {0} is not recognized".format(cmap)) - colors = lmap(colormap, np.linspace(0, 1, num=num_colors)) - elif color is not None: - if colormap is not None: - warnings.warn("'color' and 'colormap' cannot be used " - "simultaneously. Using 'color'") - colors = list(color) if is_list_like(color) else color - else: - if color_type == 'default': - # need to call list() on the result to copy so we don't - # modify the global rcParams below - try: - colors = [c['color'] - for c in list(plt.rcParams['axes.prop_cycle'])] - except KeyError: - colors = list(plt.rcParams.get('axes.color_cycle', - list('bgrcmyk'))) - if isinstance(colors, compat.string_types): - colors = list(colors) - elif color_type == 'random': - import random - - def random_color(column): - random.seed(column) - return [random.random() for _ in range(3)] - - colors = lmap(random_color, lrange(num_colors)) - else: - raise ValueError("color_type must be either 'default' or 'random'") - - if isinstance(colors, compat.string_types): - import matplotlib.colors - conv = matplotlib.colors.ColorConverter() - - def _maybe_valid_colors(colors): - try: - [conv.to_rgba(c) for c in colors] - return True - except ValueError: - return False - - # check whether the string can be convertable to single color - maybe_single_color = _maybe_valid_colors([colors]) - # check whether each character can be convertable to colors - maybe_color_cycle = _maybe_valid_colors(list(colors)) - if maybe_single_color and maybe_color_cycle and len(colors) > 1: - msg = ("'{0}' can be parsed as both single color and " - "color cycle. Specify each color using a list " - "like ['{0}'] or {1}") - raise ValueError(msg.format(colors, list(colors))) - elif maybe_single_color: - colors = [colors] - else: - # ``colors`` is regarded as color cycle. - # mpl will raise error any of them is invalid - pass - - if len(colors) != num_colors: - multiple = num_colors // len(colors) - 1 - mod = num_colors % len(colors) - - colors += multiple * colors - colors += colors[:mod] - - return colors - - -class _Options(dict): - """ - Stores pandas plotting options. - Allows for parameter aliasing so you can just use parameter names that are - the same as the plot function parameters, but is stored in a canonical - format that makes it easy to breakdown into groups later - """ - - # alias so the names are same as plotting method parameter names - _ALIASES = {'x_compat': 'xaxis.compat'} - _DEFAULT_KEYS = ['xaxis.compat'] - - def __init__(self): - self['xaxis.compat'] = False - - def __getitem__(self, key): - key = self._get_canonical_key(key) - if key not in self: - raise ValueError('%s is not a valid pandas plotting option' % key) - return super(_Options, self).__getitem__(key) - - def __setitem__(self, key, value): - key = self._get_canonical_key(key) - return super(_Options, self).__setitem__(key, value) - - def __delitem__(self, key): - key = self._get_canonical_key(key) - if key in self._DEFAULT_KEYS: - raise ValueError('Cannot remove default parameter %s' % key) - return super(_Options, self).__delitem__(key) - - def __contains__(self, key): - key = self._get_canonical_key(key) - return super(_Options, self).__contains__(key) - - def reset(self): - """ - Reset the option store to its initial state - - Returns - ------- - None - """ - self.__init__() - - def _get_canonical_key(self, key): - return self._ALIASES.get(key, key) - - @contextmanager - def use(self, key, value): - """ - Temporarily set a parameter value using the with statement. - Aliasing allowed. - """ - old_value = self[key] - try: - self[key] = value - yield self - finally: - self[key] = old_value - - -plot_params = _Options() - - -def scatter_matrix(frame, alpha=0.5, figsize=None, ax=None, grid=False, - diagonal='hist', marker='.', density_kwds=None, - hist_kwds=None, range_padding=0.05, **kwds): - """ - Draw a matrix of scatter plots. - - Parameters - ---------- - frame : DataFrame - alpha : float, optional - amount of transparency applied - figsize : (float,float), optional - a tuple (width, height) in inches - ax : Matplotlib axis object, optional - grid : bool, optional - setting this to True will show the grid - diagonal : {'hist', 'kde'} - pick between 'kde' and 'hist' for - either Kernel Density Estimation or Histogram - plot in the diagonal - marker : str, optional - Matplotlib marker type, default '.' - hist_kwds : other plotting keyword arguments - To be passed to hist function - density_kwds : other plotting keyword arguments - To be passed to kernel density estimate plot - range_padding : float, optional - relative extension of axis range in x and y - with respect to (x_max - x_min) or (y_max - y_min), - default 0.05 - kwds : other plotting keyword arguments - To be passed to scatter function - - Examples - -------- - >>> df = DataFrame(np.random.randn(1000, 4), columns=['A','B','C','D']) - >>> scatter_matrix(df, alpha=0.2) - """ - import matplotlib.pyplot as plt - - df = frame._get_numeric_data() - n = df.columns.size - naxes = n * n - fig, axes = _subplots(naxes=naxes, figsize=figsize, ax=ax, - squeeze=False) - - # no gaps between subplots - fig.subplots_adjust(wspace=0, hspace=0) - - mask = notnull(df) - - marker = _get_marker_compat(marker) - - hist_kwds = hist_kwds or {} - density_kwds = density_kwds or {} - - # workaround because `c='b'` is hardcoded in matplotlibs scatter method - kwds.setdefault('c', plt.rcParams['patch.facecolor']) - - boundaries_list = [] - for a in df.columns: - values = df[a].values[mask[a].values] - rmin_, rmax_ = np.min(values), np.max(values) - rdelta_ext = (rmax_ - rmin_) * range_padding / 2. - boundaries_list.append((rmin_ - rdelta_ext, rmax_ + rdelta_ext)) - - for i, a in zip(lrange(n), df.columns): - for j, b in zip(lrange(n), df.columns): - ax = axes[i, j] - - if i == j: - values = df[a].values[mask[a].values] - - # Deal with the diagonal by drawing a histogram there. - if diagonal == 'hist': - ax.hist(values, **hist_kwds) - - elif diagonal in ('kde', 'density'): - from scipy.stats import gaussian_kde - y = values - gkde = gaussian_kde(y) - ind = np.linspace(y.min(), y.max(), 1000) - ax.plot(ind, gkde.evaluate(ind), **density_kwds) - - ax.set_xlim(boundaries_list[i]) - - else: - common = (mask[a] & mask[b]).values - - ax.scatter(df[b][common], df[a][common], - marker=marker, alpha=alpha, **kwds) - - ax.set_xlim(boundaries_list[j]) - ax.set_ylim(boundaries_list[i]) - - ax.set_xlabel(b) - ax.set_ylabel(a) - - if j != 0: - ax.yaxis.set_visible(False) - if i != n - 1: - ax.xaxis.set_visible(False) - - if len(df.columns) > 1: - lim1 = boundaries_list[0] - locs = axes[0][1].yaxis.get_majorticklocs() - locs = locs[(lim1[0] <= locs) & (locs <= lim1[1])] - adj = (locs - lim1[0]) / (lim1[1] - lim1[0]) - - lim0 = axes[0][0].get_ylim() - adj = adj * (lim0[1] - lim0[0]) + lim0[0] - axes[0][0].yaxis.set_ticks(adj) - - if np.all(locs == locs.astype(int)): - # if all ticks are int - locs = locs.astype(int) - axes[0][0].yaxis.set_ticklabels(locs) - - _set_ticks_props(axes, xlabelsize=8, xrot=90, ylabelsize=8, yrot=0) - - return axes - - -def _gca(): - import matplotlib.pyplot as plt - return plt.gca() - - -def _gcf(): - import matplotlib.pyplot as plt - return plt.gcf() - - -def _get_marker_compat(marker): - import matplotlib.lines as mlines - import matplotlib as mpl - if mpl.__version__ < '1.1.0' and marker == '.': - return 'o' - if marker not in mlines.lineMarkers: - return 'o' - return marker - - -def radviz(frame, class_column, ax=None, color=None, colormap=None, **kwds): - """RadViz - a multivariate data visualization algorithm - - Parameters: - ----------- - frame: DataFrame - class_column: str - Column name containing class names - ax: Matplotlib axis object, optional - color: list or tuple, optional - Colors to use for the different classes - colormap : str or matplotlib colormap object, default None - Colormap to select colors from. If string, load colormap with that name - from matplotlib. - kwds: keywords - Options to pass to matplotlib scatter plotting method - - Returns: - -------- - ax: Matplotlib axis object - """ - import matplotlib.pyplot as plt - import matplotlib.patches as patches - - def normalize(series): - a = min(series) - b = max(series) - return (series - a) / (b - a) - - n = len(frame) - classes = frame[class_column].drop_duplicates() - class_col = frame[class_column] - df = frame.drop(class_column, axis=1).apply(normalize) - - if ax is None: - ax = plt.gca(xlim=[-1, 1], ylim=[-1, 1]) - - to_plot = {} - colors = _get_standard_colors(num_colors=len(classes), colormap=colormap, - color_type='random', color=color) - - for kls in classes: - to_plot[kls] = [[], []] - - m = len(frame.columns) - 1 - s = np.array([(np.cos(t), np.sin(t)) - for t in [2.0 * np.pi * (i / float(m)) - for i in range(m)]]) - - for i in range(n): - row = df.iloc[i].values - row_ = np.repeat(np.expand_dims(row, axis=1), 2, axis=1) - y = (s * row_).sum(axis=0) / row.sum() - kls = class_col.iat[i] - to_plot[kls][0].append(y[0]) - to_plot[kls][1].append(y[1]) - - for i, kls in enumerate(classes): - ax.scatter(to_plot[kls][0], to_plot[kls][1], color=colors[i], - label=pprint_thing(kls), **kwds) - ax.legend() - - ax.add_patch(patches.Circle((0.0, 0.0), radius=1.0, facecolor='none')) - - for xy, name in zip(s, df.columns): - - ax.add_patch(patches.Circle(xy, radius=0.025, facecolor='gray')) - - if xy[0] < 0.0 and xy[1] < 0.0: - ax.text(xy[0] - 0.025, xy[1] - 0.025, name, - ha='right', va='top', size='small') - elif xy[0] < 0.0 and xy[1] >= 0.0: - ax.text(xy[0] - 0.025, xy[1] + 0.025, name, - ha='right', va='bottom', size='small') - elif xy[0] >= 0.0 and xy[1] < 0.0: - ax.text(xy[0] + 0.025, xy[1] - 0.025, name, - ha='left', va='top', size='small') - elif xy[0] >= 0.0 and xy[1] >= 0.0: - ax.text(xy[0] + 0.025, xy[1] + 0.025, name, - ha='left', va='bottom', size='small') - - ax.axis('equal') - return ax - - -@deprecate_kwarg(old_arg_name='data', new_arg_name='frame') -def andrews_curves(frame, class_column, ax=None, samples=200, color=None, - colormap=None, **kwds): - """ - Generates a matplotlib plot of Andrews curves, for visualising clusters of - multivariate data. - - Andrews curves have the functional form: - - f(t) = x_1/sqrt(2) + x_2 sin(t) + x_3 cos(t) + - x_4 sin(2t) + x_5 cos(2t) + ... - - Where x coefficients correspond to the values of each dimension and t is - linearly spaced between -pi and +pi. Each row of frame then corresponds to - a single curve. - - Parameters: - ----------- - frame : DataFrame - Data to be plotted, preferably normalized to (0.0, 1.0) - class_column : Name of the column containing class names - ax : matplotlib axes object, default None - samples : Number of points to plot in each curve - color: list or tuple, optional - Colors to use for the different classes - colormap : str or matplotlib colormap object, default None - Colormap to select colors from. If string, load colormap with that name - from matplotlib. - kwds: keywords - Options to pass to matplotlib plotting method - - Returns: - -------- - ax: Matplotlib axis object - - """ - from math import sqrt, pi - import matplotlib.pyplot as plt - - def function(amplitudes): - def f(t): - x1 = amplitudes[0] - result = x1 / sqrt(2.0) - - # Take the rest of the coefficients and resize them - # appropriately. Take a copy of amplitudes as otherwise numpy - # deletes the element from amplitudes itself. - coeffs = np.delete(np.copy(amplitudes), 0) - coeffs.resize(int((coeffs.size + 1) / 2), 2) - - # Generate the harmonics and arguments for the sin and cos - # functions. - harmonics = np.arange(0, coeffs.shape[0]) + 1 - trig_args = np.outer(harmonics, t) - - result += np.sum(coeffs[:, 0, np.newaxis] * np.sin(trig_args) + - coeffs[:, 1, np.newaxis] * np.cos(trig_args), - axis=0) - return result - return f - - n = len(frame) - class_col = frame[class_column] - classes = frame[class_column].drop_duplicates() - df = frame.drop(class_column, axis=1) - t = np.linspace(-pi, pi, samples) - used_legends = set([]) - - color_values = _get_standard_colors(num_colors=len(classes), - colormap=colormap, color_type='random', - color=color) - colors = dict(zip(classes, color_values)) - if ax is None: - ax = plt.gca(xlim=(-pi, pi)) - for i in range(n): - row = df.iloc[i].values - f = function(row) - y = f(t) - kls = class_col.iat[i] - label = pprint_thing(kls) - if label not in used_legends: - used_legends.add(label) - ax.plot(t, y, color=colors[kls], label=label, **kwds) - else: - ax.plot(t, y, color=colors[kls], **kwds) - - ax.legend(loc='upper right') - ax.grid() - return ax - - -def bootstrap_plot(series, fig=None, size=50, samples=500, **kwds): - """Bootstrap plot. - - Parameters: - ----------- - series: Time series - fig: matplotlib figure object, optional - size: number of data points to consider during each sampling - samples: number of times the bootstrap procedure is performed - kwds: optional keyword arguments for plotting commands, must be accepted - by both hist and plot - - Returns: - -------- - fig: matplotlib figure - """ - import random - import matplotlib.pyplot as plt - - # random.sample(ndarray, int) fails on python 3.3, sigh - data = list(series.values) - samplings = [random.sample(data, size) for _ in range(samples)] - - means = np.array([np.mean(sampling) for sampling in samplings]) - medians = np.array([np.median(sampling) for sampling in samplings]) - midranges = np.array([(min(sampling) + max(sampling)) * 0.5 - for sampling in samplings]) - if fig is None: - fig = plt.figure() - x = lrange(samples) - axes = [] - ax1 = fig.add_subplot(2, 3, 1) - ax1.set_xlabel("Sample") - axes.append(ax1) - ax1.plot(x, means, **kwds) - ax2 = fig.add_subplot(2, 3, 2) - ax2.set_xlabel("Sample") - axes.append(ax2) - ax2.plot(x, medians, **kwds) - ax3 = fig.add_subplot(2, 3, 3) - ax3.set_xlabel("Sample") - axes.append(ax3) - ax3.plot(x, midranges, **kwds) - ax4 = fig.add_subplot(2, 3, 4) - ax4.set_xlabel("Mean") - axes.append(ax4) - ax4.hist(means, **kwds) - ax5 = fig.add_subplot(2, 3, 5) - ax5.set_xlabel("Median") - axes.append(ax5) - ax5.hist(medians, **kwds) - ax6 = fig.add_subplot(2, 3, 6) - ax6.set_xlabel("Midrange") - axes.append(ax6) - ax6.hist(midranges, **kwds) - for axis in axes: - plt.setp(axis.get_xticklabels(), fontsize=8) - plt.setp(axis.get_yticklabels(), fontsize=8) - return fig - - -@deprecate_kwarg(old_arg_name='colors', new_arg_name='color') -@deprecate_kwarg(old_arg_name='data', new_arg_name='frame', stacklevel=3) -def parallel_coordinates(frame, class_column, cols=None, ax=None, color=None, - use_columns=False, xticks=None, colormap=None, - axvlines=True, axvlines_kwds=None, **kwds): - """Parallel coordinates plotting. - - Parameters - ---------- - frame: DataFrame - class_column: str - Column name containing class names - cols: list, optional - A list of column names to use - ax: matplotlib.axis, optional - matplotlib axis object - color: list or tuple, optional - Colors to use for the different classes - use_columns: bool, optional - If true, columns will be used as xticks - xticks: list or tuple, optional - A list of values to use for xticks - colormap: str or matplotlib colormap, default None - Colormap to use for line colors. - axvlines: bool, optional - If true, vertical lines will be added at each xtick - axvlines_kwds: keywords, optional - Options to be passed to axvline method for vertical lines - kwds: keywords - Options to pass to matplotlib plotting method - - Returns - ------- - ax: matplotlib axis object - - Examples - -------- - >>> from pandas import read_csv - >>> from pandas.tools.plotting import parallel_coordinates - >>> from matplotlib import pyplot as plt - >>> df = read_csv('https://raw.github.com/pandas-dev/pandas/master' - '/pandas/tests/data/iris.csv') - >>> parallel_coordinates(df, 'Name', color=('#556270', - '#4ECDC4', '#C7F464')) - >>> plt.show() - """ - if axvlines_kwds is None: - axvlines_kwds = {'linewidth': 1, 'color': 'black'} - import matplotlib.pyplot as plt - - n = len(frame) - classes = frame[class_column].drop_duplicates() - class_col = frame[class_column] - - if cols is None: - df = frame.drop(class_column, axis=1) - else: - df = frame[cols] - - used_legends = set([]) - - ncols = len(df.columns) - - # determine values to use for xticks - if use_columns is True: - if not np.all(np.isreal(list(df.columns))): - raise ValueError('Columns must be numeric to be used as xticks') - x = df.columns - elif xticks is not None: - if not np.all(np.isreal(xticks)): - raise ValueError('xticks specified must be numeric') - elif len(xticks) != ncols: - raise ValueError('Length of xticks must match number of columns') - x = xticks - else: - x = lrange(ncols) - - if ax is None: - ax = plt.gca() - - color_values = _get_standard_colors(num_colors=len(classes), - colormap=colormap, color_type='random', - color=color) - - colors = dict(zip(classes, color_values)) - - for i in range(n): - y = df.iloc[i].values - kls = class_col.iat[i] - label = pprint_thing(kls) - if label not in used_legends: - used_legends.add(label) - ax.plot(x, y, color=colors[kls], label=label, **kwds) - else: - ax.plot(x, y, color=colors[kls], **kwds) - - if axvlines: - for i in x: - ax.axvline(i, **axvlines_kwds) - - ax.set_xticks(x) - ax.set_xticklabels(df.columns) - ax.set_xlim(x[0], x[-1]) - ax.legend(loc='upper right') - ax.grid() - return ax - - -def lag_plot(series, lag=1, ax=None, **kwds): - """Lag plot for time series. - - Parameters: - ----------- - series: Time series - lag: lag of the scatter plot, default 1 - ax: Matplotlib axis object, optional - kwds: Matplotlib scatter method keyword arguments, optional - - Returns: - -------- - ax: Matplotlib axis object - """ - import matplotlib.pyplot as plt - - # workaround because `c='b'` is hardcoded in matplotlibs scatter method - kwds.setdefault('c', plt.rcParams['patch.facecolor']) - - data = series.values - y1 = data[:-lag] - y2 = data[lag:] - if ax is None: - ax = plt.gca() - ax.set_xlabel("y(t)") - ax.set_ylabel("y(t + %s)" % lag) - ax.scatter(y1, y2, **kwds) - return ax - - -def autocorrelation_plot(series, ax=None, **kwds): - """Autocorrelation plot for time series. - - Parameters: - ----------- - series: Time series - ax: Matplotlib axis object, optional - kwds : keywords - Options to pass to matplotlib plotting method - - Returns: - ----------- - ax: Matplotlib axis object - """ - import matplotlib.pyplot as plt - n = len(series) - data = np.asarray(series) - if ax is None: - ax = plt.gca(xlim=(1, n), ylim=(-1.0, 1.0)) - mean = np.mean(data) - c0 = np.sum((data - mean) ** 2) / float(n) - - def r(h): - return ((data[:n - h] - mean) * - (data[h:] - mean)).sum() / float(n) / c0 - x = np.arange(n) + 1 - y = lmap(r, x) - z95 = 1.959963984540054 - z99 = 2.5758293035489004 - ax.axhline(y=z99 / np.sqrt(n), linestyle='--', color='grey') - ax.axhline(y=z95 / np.sqrt(n), color='grey') - ax.axhline(y=0.0, color='black') - ax.axhline(y=-z95 / np.sqrt(n), color='grey') - ax.axhline(y=-z99 / np.sqrt(n), linestyle='--', color='grey') - ax.set_xlabel("Lag") - ax.set_ylabel("Autocorrelation") - ax.plot(x, y, **kwds) - if 'label' in kwds: - ax.legend() - ax.grid() - return ax - - -class MPLPlot(object): - """ - Base class for assembling a pandas plot using matplotlib - - Parameters - ---------- - data : - - """ - - @property - def _kind(self): - """Specify kind str. Must be overridden in child class""" - raise NotImplementedError - - _layout_type = 'vertical' - _default_rot = 0 - orientation = None - _pop_attributes = ['label', 'style', 'logy', 'logx', 'loglog', - 'mark_right', 'stacked'] - _attr_defaults = {'logy': False, 'logx': False, 'loglog': False, - 'mark_right': True, 'stacked': False} - - def __init__(self, data, kind=None, by=None, subplots=False, sharex=None, - sharey=False, use_index=True, - figsize=None, grid=None, legend=True, rot=None, - ax=None, fig=None, title=None, xlim=None, ylim=None, - xticks=None, yticks=None, - sort_columns=False, fontsize=None, - secondary_y=False, colormap=None, - table=False, layout=None, **kwds): - - self.data = data - self.by = by - - self.kind = kind - - self.sort_columns = sort_columns - - self.subplots = subplots - - if sharex is None: - if ax is None: - self.sharex = True - else: - # if we get an axis, the users should do the visibility - # setting... - self.sharex = False - else: - self.sharex = sharex - - self.sharey = sharey - self.figsize = figsize - self.layout = layout - - self.xticks = xticks - self.yticks = yticks - self.xlim = xlim - self.ylim = ylim - self.title = title - self.use_index = use_index - - self.fontsize = fontsize - - if rot is not None: - self.rot = rot - # need to know for format_date_labels since it's rotated to 30 by - # default - self._rot_set = True - else: - self._rot_set = False - self.rot = self._default_rot - - if grid is None: - grid = False if secondary_y else self.plt.rcParams['axes.grid'] - - self.grid = grid - self.legend = legend - self.legend_handles = [] - self.legend_labels = [] - - for attr in self._pop_attributes: - value = kwds.pop(attr, self._attr_defaults.get(attr, None)) - setattr(self, attr, value) - - self.ax = ax - self.fig = fig - self.axes = None - - # parse errorbar input if given - xerr = kwds.pop('xerr', None) - yerr = kwds.pop('yerr', None) - self.errors = {} - for kw, err in zip(['xerr', 'yerr'], [xerr, yerr]): - self.errors[kw] = self._parse_errorbars(kw, err) - - if not isinstance(secondary_y, (bool, tuple, list, np.ndarray, Index)): - secondary_y = [secondary_y] - self.secondary_y = secondary_y - - # ugly TypeError if user passes matplotlib's `cmap` name. - # Probably better to accept either. - if 'cmap' in kwds and colormap: - raise TypeError("Only specify one of `cmap` and `colormap`.") - elif 'cmap' in kwds: - self.colormap = kwds.pop('cmap') - else: - self.colormap = colormap - - self.table = table - - self.kwds = kwds - - self._validate_color_args() - - def _validate_color_args(self): - if 'color' not in self.kwds and 'colors' in self.kwds: - warnings.warn(("'colors' is being deprecated. Please use 'color'" - "instead of 'colors'")) - colors = self.kwds.pop('colors') - self.kwds['color'] = colors - - if ('color' in self.kwds and self.nseries == 1): - # support series.plot(color='green') - self.kwds['color'] = [self.kwds['color']] - - if ('color' in self.kwds or 'colors' in self.kwds) and \ - self.colormap is not None: - warnings.warn("'color' and 'colormap' cannot be used " - "simultaneously. Using 'color'") - - if 'color' in self.kwds and self.style is not None: - if is_list_like(self.style): - styles = self.style - else: - styles = [self.style] - # need only a single match - for s in styles: - if re.match('^[a-z]+?', s) is not None: - raise ValueError( - "Cannot pass 'style' string with a color " - "symbol and 'color' keyword argument. Please" - " use one or the other or pass 'style' " - "without a color symbol") - - def _iter_data(self, data=None, keep_index=False, fillna=None): - if data is None: - data = self.data - if fillna is not None: - data = data.fillna(fillna) - - # TODO: unused? - # if self.sort_columns: - # columns = _try_sort(data.columns) - # else: - # columns = data.columns - - for col, values in data.iteritems(): - if keep_index is True: - yield col, values - else: - yield col, values.values - - @property - def nseries(self): - if self.data.ndim == 1: - return 1 - else: - return self.data.shape[1] - - def draw(self): - self.plt.draw_if_interactive() - - def generate(self): - self._args_adjust() - self._compute_plot_data() - self._setup_subplots() - self._make_plot() - self._add_table() - self._make_legend() - self._adorn_subplots() - - for ax in self.axes: - self._post_plot_logic_common(ax, self.data) - self._post_plot_logic(ax, self.data) - - def _args_adjust(self): - pass - - def _has_plotted_object(self, ax): - """check whether ax has data""" - return (len(ax.lines) != 0 or - len(ax.artists) != 0 or - len(ax.containers) != 0) - - def _maybe_right_yaxis(self, ax, axes_num): - if not self.on_right(axes_num): - # secondary axes may be passed via ax kw - return self._get_ax_layer(ax) - - if hasattr(ax, 'right_ax'): - # if it has right_ax proparty, ``ax`` must be left axes - return ax.right_ax - elif hasattr(ax, 'left_ax'): - # if it has left_ax proparty, ``ax`` must be right axes - return ax - else: - # otherwise, create twin axes - orig_ax, new_ax = ax, ax.twinx() - # TODO: use Matplotlib public API when available - new_ax._get_lines = orig_ax._get_lines - new_ax._get_patches_for_fill = orig_ax._get_patches_for_fill - orig_ax.right_ax, new_ax.left_ax = new_ax, orig_ax - - if not self._has_plotted_object(orig_ax): # no data on left y - orig_ax.get_yaxis().set_visible(False) - return new_ax - - def _setup_subplots(self): - if self.subplots: - fig, axes = _subplots(naxes=self.nseries, - sharex=self.sharex, sharey=self.sharey, - figsize=self.figsize, ax=self.ax, - layout=self.layout, - layout_type=self._layout_type) - else: - if self.ax is None: - fig = self.plt.figure(figsize=self.figsize) - axes = fig.add_subplot(111) - else: - fig = self.ax.get_figure() - if self.figsize is not None: - fig.set_size_inches(self.figsize) - axes = self.ax - - axes = _flatten(axes) - - if self.logx or self.loglog: - [a.set_xscale('log') for a in axes] - if self.logy or self.loglog: - [a.set_yscale('log') for a in axes] - - self.fig = fig - self.axes = axes - - @property - def result(self): - """ - Return result axes - """ - if self.subplots: - if self.layout is not None and not is_list_like(self.ax): - return self.axes.reshape(*self.layout) - else: - return self.axes - else: - sec_true = isinstance(self.secondary_y, bool) and self.secondary_y - all_sec = (is_list_like(self.secondary_y) and - len(self.secondary_y) == self.nseries) - if (sec_true or all_sec): - # if all data is plotted on secondary, return right axes - return self._get_ax_layer(self.axes[0], primary=False) - else: - return self.axes[0] - - def _compute_plot_data(self): - data = self.data - - if isinstance(data, Series): - label = self.label - if label is None and data.name is None: - label = 'None' - data = data.to_frame(name=label) - - numeric_data = data._convert(datetime=True)._get_numeric_data() - - try: - is_empty = numeric_data.empty - except AttributeError: - is_empty = not len(numeric_data) - - # no empty frames or series allowed - if is_empty: - raise TypeError('Empty {0!r}: no numeric data to ' - 'plot'.format(numeric_data.__class__.__name__)) - - self.data = numeric_data - - def _make_plot(self): - raise AbstractMethodError(self) - - def _add_table(self): - if self.table is False: - return - elif self.table is True: - data = self.data.transpose() - else: - data = self.table - ax = self._get_ax(0) - table(ax, data) - - def _post_plot_logic_common(self, ax, data): - """Common post process for each axes""" - labels = [pprint_thing(key) for key in data.index] - labels = dict(zip(range(len(data.index)), labels)) - - if self.orientation == 'vertical' or self.orientation is None: - if self._need_to_set_index: - xticklabels = [labels.get(x, '') for x in ax.get_xticks()] - ax.set_xticklabels(xticklabels) - self._apply_axis_properties(ax.xaxis, rot=self.rot, - fontsize=self.fontsize) - self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize) - elif self.orientation == 'horizontal': - if self._need_to_set_index: - yticklabels = [labels.get(y, '') for y in ax.get_yticks()] - ax.set_yticklabels(yticklabels) - self._apply_axis_properties(ax.yaxis, rot=self.rot, - fontsize=self.fontsize) - self._apply_axis_properties(ax.xaxis, fontsize=self.fontsize) - else: # pragma no cover - raise ValueError - - def _post_plot_logic(self, ax, data): - """Post process for each axes. Overridden in child classes""" - pass - - def _adorn_subplots(self): - """Common post process unrelated to data""" - if len(self.axes) > 0: - all_axes = self._get_subplots() - nrows, ncols = self._get_axes_layout() - _handle_shared_axes(axarr=all_axes, nplots=len(all_axes), - naxes=nrows * ncols, nrows=nrows, - ncols=ncols, sharex=self.sharex, - sharey=self.sharey) - - for ax in self.axes: - if self.yticks is not None: - ax.set_yticks(self.yticks) - - if self.xticks is not None: - ax.set_xticks(self.xticks) - - if self.ylim is not None: - ax.set_ylim(self.ylim) - - if self.xlim is not None: - ax.set_xlim(self.xlim) - - ax.grid(self.grid) - - if self.title: - if self.subplots: - self.fig.suptitle(self.title) - else: - self.axes[0].set_title(self.title) - - def _apply_axis_properties(self, axis, rot=None, fontsize=None): - labels = axis.get_majorticklabels() + axis.get_minorticklabels() - for label in labels: - if rot is not None: - label.set_rotation(rot) - if fontsize is not None: - label.set_fontsize(fontsize) - - @property - def legend_title(self): - if not isinstance(self.data.columns, MultiIndex): - name = self.data.columns.name - if name is not None: - name = pprint_thing(name) - return name - else: - stringified = map(pprint_thing, - self.data.columns.names) - return ','.join(stringified) - - def _add_legend_handle(self, handle, label, index=None): - if label is not None: - if self.mark_right and index is not None: - if self.on_right(index): - label = label + ' (right)' - self.legend_handles.append(handle) - self.legend_labels.append(label) - - def _make_legend(self): - ax, leg = self._get_ax_legend(self.axes[0]) - - handles = [] - labels = [] - title = '' - - if not self.subplots: - if leg is not None: - title = leg.get_title().get_text() - handles = leg.legendHandles - labels = [x.get_text() for x in leg.get_texts()] - - if self.legend: - if self.legend == 'reverse': - self.legend_handles = reversed(self.legend_handles) - self.legend_labels = reversed(self.legend_labels) - - handles += self.legend_handles - labels += self.legend_labels - if self.legend_title is not None: - title = self.legend_title - - if len(handles) > 0: - ax.legend(handles, labels, loc='best', title=title) - - elif self.subplots and self.legend: - for ax in self.axes: - if ax.get_visible(): - ax.legend(loc='best') - - def _get_ax_legend(self, ax): - leg = ax.get_legend() - other_ax = (getattr(ax, 'left_ax', None) or - getattr(ax, 'right_ax', None)) - other_leg = None - if other_ax is not None: - other_leg = other_ax.get_legend() - if leg is None and other_leg is not None: - leg = other_leg - ax = other_ax - return ax, leg - - @cache_readonly - def plt(self): - import matplotlib.pyplot as plt - return plt - - @staticmethod - def mpl_ge_1_3_1(): - return _mpl_ge_1_3_1() - - @staticmethod - def mpl_ge_1_5_0(): - return _mpl_ge_1_5_0() - - _need_to_set_index = False - - def _get_xticks(self, convert_period=False): - index = self.data.index - is_datetype = index.inferred_type in ('datetime', 'date', - 'datetime64', 'time') - - if self.use_index: - if convert_period and isinstance(index, PeriodIndex): - self.data = self.data.reindex(index=index.sort_values()) - x = self.data.index.to_timestamp()._mpl_repr() - elif index.is_numeric(): - """ - Matplotlib supports numeric values or datetime objects as - xaxis values. Taking LBYL approach here, by the time - matplotlib raises exception when using non numeric/datetime - values for xaxis, several actions are already taken by plt. - """ - x = index._mpl_repr() - elif is_datetype: - self.data = self.data.sort_index() - x = self.data.index._mpl_repr() - else: - self._need_to_set_index = True - x = lrange(len(index)) - else: - x = lrange(len(index)) - - return x - - @classmethod - def _plot(cls, ax, x, y, style=None, is_errorbar=False, **kwds): - mask = isnull(y) - if mask.any(): - y = np.ma.array(y) - y = np.ma.masked_where(mask, y) - - if isinstance(x, Index): - x = x._mpl_repr() - - if is_errorbar: - if 'xerr' in kwds: - kwds['xerr'] = np.array(kwds.get('xerr')) - if 'yerr' in kwds: - kwds['yerr'] = np.array(kwds.get('yerr')) - return ax.errorbar(x, y, **kwds) - else: - # prevent style kwarg from going to errorbar, where it is - # unsupported - if style is not None: - args = (x, y, style) - else: - args = (x, y) - return ax.plot(*args, **kwds) - - def _get_index_name(self): - if isinstance(self.data.index, MultiIndex): - name = self.data.index.names - if any(x is not None for x in name): - name = ','.join([pprint_thing(x) for x in name]) - else: - name = None - else: - name = self.data.index.name - if name is not None: - name = pprint_thing(name) - - return name - - @classmethod - def _get_ax_layer(cls, ax, primary=True): - """get left (primary) or right (secondary) axes""" - if primary: - return getattr(ax, 'left_ax', ax) - else: - return getattr(ax, 'right_ax', ax) - - def _get_ax(self, i): - # get the twinx ax if appropriate - if self.subplots: - ax = self.axes[i] - ax = self._maybe_right_yaxis(ax, i) - self.axes[i] = ax - else: - ax = self.axes[0] - ax = self._maybe_right_yaxis(ax, i) - - ax.get_yaxis().set_visible(True) - return ax - - def on_right(self, i): - if isinstance(self.secondary_y, bool): - return self.secondary_y - - if isinstance(self.secondary_y, (tuple, list, np.ndarray, Index)): - return self.data.columns[i] in self.secondary_y - - def _apply_style_colors(self, colors, kwds, col_num, label): - """ - Manage style and color based on column number and its label. - Returns tuple of appropriate style and kwds which "color" may be added. - """ - style = None - if self.style is not None: - if isinstance(self.style, list): - try: - style = self.style[col_num] - except IndexError: - pass - elif isinstance(self.style, dict): - style = self.style.get(label, style) - else: - style = self.style - - has_color = 'color' in kwds or self.colormap is not None - nocolor_style = style is None or re.match('[a-z]+', style) is None - if (has_color or self.subplots) and nocolor_style: - kwds['color'] = colors[col_num % len(colors)] - return style, kwds - - def _get_colors(self, num_colors=None, color_kwds='color'): - if num_colors is None: - num_colors = self.nseries - - return _get_standard_colors(num_colors=num_colors, - colormap=self.colormap, - color=self.kwds.get(color_kwds)) - - def _parse_errorbars(self, label, err): - """ - Look for error keyword arguments and return the actual errorbar data - or return the error DataFrame/dict - - Error bars can be specified in several ways: - Series: the user provides a pandas.Series object of the same - length as the data - ndarray: provides a np.ndarray of the same length as the data - DataFrame/dict: error values are paired with keys matching the - key in the plotted DataFrame - str: the name of the column within the plotted DataFrame - """ - - if err is None: - return None - - from pandas import DataFrame, Series - - def match_labels(data, e): - e = e.reindex_axis(data.index) - return e - - # key-matched DataFrame - if isinstance(err, DataFrame): - - err = match_labels(self.data, err) - # key-matched dict - elif isinstance(err, dict): - pass - - # Series of error values - elif isinstance(err, Series): - # broadcast error series across data - err = match_labels(self.data, err) - err = np.atleast_2d(err) - err = np.tile(err, (self.nseries, 1)) - - # errors are a column in the dataframe - elif isinstance(err, string_types): - evalues = self.data[err].values - self.data = self.data[self.data.columns.drop(err)] - err = np.atleast_2d(evalues) - err = np.tile(err, (self.nseries, 1)) - - elif is_list_like(err): - if is_iterator(err): - err = np.atleast_2d(list(err)) - else: - # raw error values - err = np.atleast_2d(err) - - err_shape = err.shape - - # asymmetrical error bars - if err.ndim == 3: - if (err_shape[0] != self.nseries) or \ - (err_shape[1] != 2) or \ - (err_shape[2] != len(self.data)): - msg = "Asymmetrical error bars should be provided " + \ - "with the shape (%u, 2, %u)" % \ - (self.nseries, len(self.data)) - raise ValueError(msg) - - # broadcast errors to each data series - if len(err) == 1: - err = np.tile(err, (self.nseries, 1)) - - elif is_number(err): - err = np.tile([err], (self.nseries, len(self.data))) - - else: - msg = "No valid %s detected" % label - raise ValueError(msg) - - return err - - def _get_errorbars(self, label=None, index=None, xerr=True, yerr=True): - from pandas import DataFrame - errors = {} - - for kw, flag in zip(['xerr', 'yerr'], [xerr, yerr]): - if flag: - err = self.errors[kw] - # user provided label-matched dataframe of errors - if isinstance(err, (DataFrame, dict)): - if label is not None and label in err.keys(): - err = err[label] - else: - err = None - elif index is not None and err is not None: - err = err[index] - - if err is not None: - errors[kw] = err - return errors - - def _get_subplots(self): - from matplotlib.axes import Subplot - return [ax for ax in self.axes[0].get_figure().get_axes() - if isinstance(ax, Subplot)] - - def _get_axes_layout(self): - axes = self._get_subplots() - x_set = set() - y_set = set() - for ax in axes: - # check axes coordinates to estimate layout - points = ax.get_position().get_points() - x_set.add(points[0][0]) - y_set.add(points[0][1]) - return (len(y_set), len(x_set)) - - -class PlanePlot(MPLPlot): - """ - Abstract class for plotting on plane, currently scatter and hexbin. - """ - - _layout_type = 'single' - - def __init__(self, data, x, y, **kwargs): - MPLPlot.__init__(self, data, **kwargs) - if x is None or y is None: - raise ValueError(self._kind + ' requires and x and y column') - if is_integer(x) and not self.data.columns.holds_integer(): - x = self.data.columns[x] - if is_integer(y) and not self.data.columns.holds_integer(): - y = self.data.columns[y] - self.x = x - self.y = y - - @property - def nseries(self): - return 1 - - def _post_plot_logic(self, ax, data): - x, y = self.x, self.y - ax.set_ylabel(pprint_thing(y)) - ax.set_xlabel(pprint_thing(x)) - - -class ScatterPlot(PlanePlot): - _kind = 'scatter' - - def __init__(self, data, x, y, s=None, c=None, **kwargs): - if s is None: - # hide the matplotlib default for size, in case we want to change - # the handling of this argument later - s = 20 - super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs) - if is_integer(c) and not self.data.columns.holds_integer(): - c = self.data.columns[c] - self.c = c - - def _make_plot(self): - x, y, c, data = self.x, self.y, self.c, self.data - ax = self.axes[0] - - c_is_column = is_hashable(c) and c in self.data.columns - - # plot a colorbar only if a colormap is provided or necessary - cb = self.kwds.pop('colorbar', self.colormap or c_is_column) - - # pandas uses colormap, matplotlib uses cmap. - cmap = self.colormap or 'Greys' - cmap = self.plt.cm.get_cmap(cmap) - color = self.kwds.pop("color", None) - if c is not None and color is not None: - raise TypeError('Specify exactly one of `c` and `color`') - elif c is None and color is None: - c_values = self.plt.rcParams['patch.facecolor'] - elif color is not None: - c_values = color - elif c_is_column: - c_values = self.data[c].values - else: - c_values = c - - if self.legend and hasattr(self, 'label'): - label = self.label - else: - label = None - scatter = ax.scatter(data[x].values, data[y].values, c=c_values, - label=label, cmap=cmap, **self.kwds) - if cb: - img = ax.collections[0] - kws = dict(ax=ax) - if self.mpl_ge_1_3_1(): - kws['label'] = c if c_is_column else '' - self.fig.colorbar(img, **kws) - - if label is not None: - self._add_legend_handle(scatter, label) - else: - self.legend = False - - errors_x = self._get_errorbars(label=x, index=0, yerr=False) - errors_y = self._get_errorbars(label=y, index=0, xerr=False) - if len(errors_x) > 0 or len(errors_y) > 0: - err_kwds = dict(errors_x, **errors_y) - err_kwds['ecolor'] = scatter.get_facecolor()[0] - ax.errorbar(data[x].values, data[y].values, - linestyle='none', **err_kwds) - - -class HexBinPlot(PlanePlot): - _kind = 'hexbin' - - def __init__(self, data, x, y, C=None, **kwargs): - super(HexBinPlot, self).__init__(data, x, y, **kwargs) - if is_integer(C) and not self.data.columns.holds_integer(): - C = self.data.columns[C] - self.C = C - - def _make_plot(self): - x, y, data, C = self.x, self.y, self.data, self.C - ax = self.axes[0] - # pandas uses colormap, matplotlib uses cmap. - cmap = self.colormap or 'BuGn' - cmap = self.plt.cm.get_cmap(cmap) - cb = self.kwds.pop('colorbar', True) - - if C is None: - c_values = None - else: - c_values = data[C].values - - ax.hexbin(data[x].values, data[y].values, C=c_values, cmap=cmap, - **self.kwds) - if cb: - img = ax.collections[0] - self.fig.colorbar(img, ax=ax) - - def _make_legend(self): - pass - - -class LinePlot(MPLPlot): - _kind = 'line' - _default_rot = 0 - orientation = 'vertical' - - def __init__(self, data, **kwargs): - MPLPlot.__init__(self, data, **kwargs) - if self.stacked: - self.data = self.data.fillna(value=0) - self.x_compat = plot_params['x_compat'] - if 'x_compat' in self.kwds: - self.x_compat = bool(self.kwds.pop('x_compat')) - - def _is_ts_plot(self): - # this is slightly deceptive - return not self.x_compat and self.use_index and self._use_dynamic_x() - - def _use_dynamic_x(self): - from pandas.tseries.plotting import _use_dynamic_x - return _use_dynamic_x(self._get_ax(0), self.data) - - def _make_plot(self): - if self._is_ts_plot(): - from pandas.tseries.plotting import _maybe_convert_index - data = _maybe_convert_index(self._get_ax(0), self.data) - - x = data.index # dummy, not used - plotf = self._ts_plot - it = self._iter_data(data=data, keep_index=True) - else: - x = self._get_xticks(convert_period=True) - plotf = self._plot - it = self._iter_data() - - stacking_id = self._get_stacking_id() - is_errorbar = any(e is not None for e in self.errors.values()) - - colors = self._get_colors() - for i, (label, y) in enumerate(it): - ax = self._get_ax(i) - kwds = self.kwds.copy() - style, kwds = self._apply_style_colors(colors, kwds, i, label) - - errors = self._get_errorbars(label=label, index=i) - kwds = dict(kwds, **errors) - - label = pprint_thing(label) # .encode('utf-8') - kwds['label'] = label - - newlines = plotf(ax, x, y, style=style, column_num=i, - stacking_id=stacking_id, - is_errorbar=is_errorbar, - **kwds) - self._add_legend_handle(newlines[0], label, index=i) - - lines = _get_all_lines(ax) - left, right = _get_xlim(lines) - ax.set_xlim(left, right) - - @classmethod - def _plot(cls, ax, x, y, style=None, column_num=None, - stacking_id=None, **kwds): - # column_num is used to get the target column from protf in line and - # area plots - if column_num == 0: - cls._initialize_stacker(ax, stacking_id, len(y)) - y_values = cls._get_stacked_values(ax, stacking_id, y, kwds['label']) - lines = MPLPlot._plot(ax, x, y_values, style=style, **kwds) - cls._update_stacker(ax, stacking_id, y) - return lines - - @classmethod - def _ts_plot(cls, ax, x, data, style=None, **kwds): - from pandas.tseries.plotting import (_maybe_resample, - _decorate_axes, - format_dateaxis) - # accept x to be consistent with normal plot func, - # x is not passed to tsplot as it uses data.index as x coordinate - # column_num must be in kwds for stacking purpose - freq, data = _maybe_resample(data, ax, kwds) - - # Set ax with freq info - _decorate_axes(ax, freq, kwds) - # digging deeper - if hasattr(ax, 'left_ax'): - _decorate_axes(ax.left_ax, freq, kwds) - if hasattr(ax, 'right_ax'): - _decorate_axes(ax.right_ax, freq, kwds) - ax._plot_data.append((data, cls._kind, kwds)) - - lines = cls._plot(ax, data.index, data.values, style=style, **kwds) - # set date formatter, locators and rescale limits - format_dateaxis(ax, ax.freq) - return lines - - def _get_stacking_id(self): - if self.stacked: - return id(self.data) - else: - return None - - @classmethod - def _initialize_stacker(cls, ax, stacking_id, n): - if stacking_id is None: - return - if not hasattr(ax, '_stacker_pos_prior'): - ax._stacker_pos_prior = {} - if not hasattr(ax, '_stacker_neg_prior'): - ax._stacker_neg_prior = {} - ax._stacker_pos_prior[stacking_id] = np.zeros(n) - ax._stacker_neg_prior[stacking_id] = np.zeros(n) - - @classmethod - def _get_stacked_values(cls, ax, stacking_id, values, label): - if stacking_id is None: - return values - if not hasattr(ax, '_stacker_pos_prior'): - # stacker may not be initialized for subplots - cls._initialize_stacker(ax, stacking_id, len(values)) - - if (values >= 0).all(): - return ax._stacker_pos_prior[stacking_id] + values - elif (values <= 0).all(): - return ax._stacker_neg_prior[stacking_id] + values - - raise ValueError('When stacked is True, each column must be either ' - 'all positive or negative.' - '{0} contains both positive and negative values' - .format(label)) - - @classmethod - def _update_stacker(cls, ax, stacking_id, values): - if stacking_id is None: - return - if (values >= 0).all(): - ax._stacker_pos_prior[stacking_id] += values - elif (values <= 0).all(): - ax._stacker_neg_prior[stacking_id] += values - - def _post_plot_logic(self, ax, data): - condition = (not self._use_dynamic_x() and - data.index.is_all_dates and - not self.subplots or - (self.subplots and self.sharex)) - - index_name = self._get_index_name() - - if condition: - # irregular TS rotated 30 deg. by default - # probably a better place to check / set this. - if not self._rot_set: - self.rot = 30 - format_date_labels(ax, rot=self.rot) - - if index_name is not None and self.use_index: - ax.set_xlabel(index_name) - - -class AreaPlot(LinePlot): - _kind = 'area' - - def __init__(self, data, **kwargs): - kwargs.setdefault('stacked', True) - data = data.fillna(value=0) - LinePlot.__init__(self, data, **kwargs) - - if not self.stacked: - # use smaller alpha to distinguish overlap - self.kwds.setdefault('alpha', 0.5) - - if self.logy or self.loglog: - raise ValueError("Log-y scales are not supported in area plot") - - @classmethod - def _plot(cls, ax, x, y, style=None, column_num=None, - stacking_id=None, is_errorbar=False, **kwds): - - if column_num == 0: - cls._initialize_stacker(ax, stacking_id, len(y)) - y_values = cls._get_stacked_values(ax, stacking_id, y, kwds['label']) - - # need to remove label, because subplots uses mpl legend as it is - line_kwds = kwds.copy() - if cls.mpl_ge_1_5_0(): - line_kwds.pop('label') - lines = MPLPlot._plot(ax, x, y_values, style=style, **line_kwds) - - # get data from the line to get coordinates for fill_between - xdata, y_values = lines[0].get_data(orig=False) - - # unable to use ``_get_stacked_values`` here to get starting point - if stacking_id is None: - start = np.zeros(len(y)) - elif (y >= 0).all(): - start = ax._stacker_pos_prior[stacking_id] - elif (y <= 0).all(): - start = ax._stacker_neg_prior[stacking_id] - else: - start = np.zeros(len(y)) - - if 'color' not in kwds: - kwds['color'] = lines[0].get_color() - - rect = ax.fill_between(xdata, start, y_values, **kwds) - cls._update_stacker(ax, stacking_id, y) - - # LinePlot expects list of artists - res = [rect] if cls.mpl_ge_1_5_0() else lines - return res - - def _add_legend_handle(self, handle, label, index=None): - if not self.mpl_ge_1_5_0(): - from matplotlib.patches import Rectangle - # Because fill_between isn't supported in legend, - # specifically add Rectangle handle here - alpha = self.kwds.get('alpha', None) - handle = Rectangle((0, 0), 1, 1, fc=handle.get_color(), - alpha=alpha) - LinePlot._add_legend_handle(self, handle, label, index=index) - - def _post_plot_logic(self, ax, data): - LinePlot._post_plot_logic(self, ax, data) - - if self.ylim is None: - if (data >= 0).all().all(): - ax.set_ylim(0, None) - elif (data <= 0).all().all(): - ax.set_ylim(None, 0) - - -class BarPlot(MPLPlot): - _kind = 'bar' - _default_rot = 90 - orientation = 'vertical' - - def __init__(self, data, **kwargs): - self.bar_width = kwargs.pop('width', 0.5) - pos = kwargs.pop('position', 0.5) - kwargs.setdefault('align', 'center') - self.tick_pos = np.arange(len(data)) - - self.bottom = kwargs.pop('bottom', 0) - self.left = kwargs.pop('left', 0) - - self.log = kwargs.pop('log', False) - MPLPlot.__init__(self, data, **kwargs) - - if self.stacked or self.subplots: - self.tickoffset = self.bar_width * pos - if kwargs['align'] == 'edge': - self.lim_offset = self.bar_width / 2 - else: - self.lim_offset = 0 - else: - if kwargs['align'] == 'edge': - w = self.bar_width / self.nseries - self.tickoffset = self.bar_width * (pos - 0.5) + w * 0.5 - self.lim_offset = w * 0.5 - else: - self.tickoffset = self.bar_width * pos - self.lim_offset = 0 - - self.ax_pos = self.tick_pos - self.tickoffset - - def _args_adjust(self): - if is_list_like(self.bottom): - self.bottom = np.array(self.bottom) - if is_list_like(self.left): - self.left = np.array(self.left) - - @classmethod - def _plot(cls, ax, x, y, w, start=0, log=False, **kwds): - return ax.bar(x, y, w, bottom=start, log=log, **kwds) - - @property - def _start_base(self): - return self.bottom - - def _make_plot(self): - import matplotlib as mpl - - colors = self._get_colors() - ncolors = len(colors) - - pos_prior = neg_prior = np.zeros(len(self.data)) - K = self.nseries - - for i, (label, y) in enumerate(self._iter_data(fillna=0)): - ax = self._get_ax(i) - kwds = self.kwds.copy() - kwds['color'] = colors[i % ncolors] - - errors = self._get_errorbars(label=label, index=i) - kwds = dict(kwds, **errors) - - label = pprint_thing(label) - - if (('yerr' in kwds) or ('xerr' in kwds)) \ - and (kwds.get('ecolor') is None): - kwds['ecolor'] = mpl.rcParams['xtick.color'] - - start = 0 - if self.log and (y >= 1).all(): - start = 1 - start = start + self._start_base - - if self.subplots: - w = self.bar_width / 2 - rect = self._plot(ax, self.ax_pos + w, y, self.bar_width, - start=start, label=label, - log=self.log, **kwds) - ax.set_title(label) - elif self.stacked: - mask = y > 0 - start = np.where(mask, pos_prior, neg_prior) + self._start_base - w = self.bar_width / 2 - rect = self._plot(ax, self.ax_pos + w, y, self.bar_width, - start=start, label=label, - log=self.log, **kwds) - pos_prior = pos_prior + np.where(mask, y, 0) - neg_prior = neg_prior + np.where(mask, 0, y) - else: - w = self.bar_width / K - rect = self._plot(ax, self.ax_pos + (i + 0.5) * w, y, w, - start=start, label=label, - log=self.log, **kwds) - self._add_legend_handle(rect, label, index=i) - - def _post_plot_logic(self, ax, data): - if self.use_index: - str_index = [pprint_thing(key) for key in data.index] - else: - str_index = [pprint_thing(key) for key in range(data.shape[0])] - name = self._get_index_name() - - s_edge = self.ax_pos[0] - 0.25 + self.lim_offset - e_edge = self.ax_pos[-1] + 0.25 + self.bar_width + self.lim_offset - - self._decorate_ticks(ax, name, str_index, s_edge, e_edge) - - def _decorate_ticks(self, ax, name, ticklabels, start_edge, end_edge): - ax.set_xlim((start_edge, end_edge)) - ax.set_xticks(self.tick_pos) - ax.set_xticklabels(ticklabels) - if name is not None and self.use_index: - ax.set_xlabel(name) - - -class BarhPlot(BarPlot): - _kind = 'barh' - _default_rot = 0 - orientation = 'horizontal' - - @property - def _start_base(self): - return self.left - - @classmethod - def _plot(cls, ax, x, y, w, start=0, log=False, **kwds): - return ax.barh(x, y, w, left=start, log=log, **kwds) - - def _decorate_ticks(self, ax, name, ticklabels, start_edge, end_edge): - # horizontal bars - ax.set_ylim((start_edge, end_edge)) - ax.set_yticks(self.tick_pos) - ax.set_yticklabels(ticklabels) - if name is not None and self.use_index: - ax.set_ylabel(name) - - -class HistPlot(LinePlot): - _kind = 'hist' - - def __init__(self, data, bins=10, bottom=0, **kwargs): - self.bins = bins # use mpl default - self.bottom = bottom - # Do not call LinePlot.__init__ which may fill nan - MPLPlot.__init__(self, data, **kwargs) - - def _args_adjust(self): - if is_integer(self.bins): - # create common bin edge - values = (self.data._convert(datetime=True)._get_numeric_data()) - values = np.ravel(values) - values = values[~isnull(values)] - - hist, self.bins = np.histogram( - values, bins=self.bins, - range=self.kwds.get('range', None), - weights=self.kwds.get('weights', None)) - - if is_list_like(self.bottom): - self.bottom = np.array(self.bottom) - - @classmethod - def _plot(cls, ax, y, style=None, bins=None, bottom=0, column_num=0, - stacking_id=None, **kwds): - if column_num == 0: - cls._initialize_stacker(ax, stacking_id, len(bins) - 1) - y = y[~isnull(y)] - - base = np.zeros(len(bins) - 1) - bottom = bottom + \ - cls._get_stacked_values(ax, stacking_id, base, kwds['label']) - # ignore style - n, bins, patches = ax.hist(y, bins=bins, bottom=bottom, **kwds) - cls._update_stacker(ax, stacking_id, n) - return patches - - def _make_plot(self): - colors = self._get_colors() - stacking_id = self._get_stacking_id() - - for i, (label, y) in enumerate(self._iter_data()): - ax = self._get_ax(i) - - kwds = self.kwds.copy() - - label = pprint_thing(label) - kwds['label'] = label - - style, kwds = self._apply_style_colors(colors, kwds, i, label) - if style is not None: - kwds['style'] = style - - kwds = self._make_plot_keywords(kwds, y) - artists = self._plot(ax, y, column_num=i, - stacking_id=stacking_id, **kwds) - self._add_legend_handle(artists[0], label, index=i) - - def _make_plot_keywords(self, kwds, y): - """merge BoxPlot/KdePlot properties to passed kwds""" - # y is required for KdePlot - kwds['bottom'] = self.bottom - kwds['bins'] = self.bins - return kwds - - def _post_plot_logic(self, ax, data): - if self.orientation == 'horizontal': - ax.set_xlabel('Frequency') - else: - ax.set_ylabel('Frequency') - - @property - def orientation(self): - if self.kwds.get('orientation', None) == 'horizontal': - return 'horizontal' - else: - return 'vertical' - - -class KdePlot(HistPlot): - _kind = 'kde' - orientation = 'vertical' - - def __init__(self, data, bw_method=None, ind=None, **kwargs): - MPLPlot.__init__(self, data, **kwargs) - self.bw_method = bw_method - self.ind = ind - - def _args_adjust(self): - pass - - def _get_ind(self, y): - if self.ind is None: - # np.nanmax() and np.nanmin() ignores the missing values - sample_range = np.nanmax(y) - np.nanmin(y) - ind = np.linspace(np.nanmin(y) - 0.5 * sample_range, - np.nanmax(y) + 0.5 * sample_range, 1000) - else: - ind = self.ind - return ind - - @classmethod - def _plot(cls, ax, y, style=None, bw_method=None, ind=None, - column_num=None, stacking_id=None, **kwds): - from scipy.stats import gaussian_kde - from scipy import __version__ as spv - - y = remove_na(y) - - if LooseVersion(spv) >= '0.11.0': - gkde = gaussian_kde(y, bw_method=bw_method) - else: - gkde = gaussian_kde(y) - if bw_method is not None: - msg = ('bw_method was added in Scipy 0.11.0.' + - ' Scipy version in use is %s.' % spv) - warnings.warn(msg) - - y = gkde.evaluate(ind) - lines = MPLPlot._plot(ax, ind, y, style=style, **kwds) - return lines - - def _make_plot_keywords(self, kwds, y): - kwds['bw_method'] = self.bw_method - kwds['ind'] = self._get_ind(y) - return kwds - - def _post_plot_logic(self, ax, data): - ax.set_ylabel('Density') - - -class PiePlot(MPLPlot): - _kind = 'pie' - _layout_type = 'horizontal' - - def __init__(self, data, kind=None, **kwargs): - data = data.fillna(value=0) - if (data < 0).any().any(): - raise ValueError("{0} doesn't allow negative values".format(kind)) - MPLPlot.__init__(self, data, kind=kind, **kwargs) - - def _args_adjust(self): - self.grid = False - self.logy = False - self.logx = False - self.loglog = False - - def _validate_color_args(self): - pass - - def _make_plot(self): - colors = self._get_colors( - num_colors=len(self.data), color_kwds='colors') - self.kwds.setdefault('colors', colors) - - for i, (label, y) in enumerate(self._iter_data()): - ax = self._get_ax(i) - if label is not None: - label = pprint_thing(label) - ax.set_ylabel(label) - - kwds = self.kwds.copy() - - def blank_labeler(label, value): - if value == 0: - return '' - else: - return label - - idx = [pprint_thing(v) for v in self.data.index] - labels = kwds.pop('labels', idx) - # labels is used for each wedge's labels - # Blank out labels for values of 0 so they don't overlap - # with nonzero wedges - if labels is not None: - blabels = [blank_labeler(l, value) for - l, value in zip(labels, y)] - else: - blabels = None - results = ax.pie(y, labels=blabels, **kwds) - - if kwds.get('autopct', None) is not None: - patches, texts, autotexts = results - else: - patches, texts = results - autotexts = [] - - if self.fontsize is not None: - for t in texts + autotexts: - t.set_fontsize(self.fontsize) - - # leglabels is used for legend labels - leglabels = labels if labels is not None else idx - for p, l in zip(patches, leglabels): - self._add_legend_handle(p, l) - - -class BoxPlot(LinePlot): - _kind = 'box' - _layout_type = 'horizontal' - - _valid_return_types = (None, 'axes', 'dict', 'both') - # namedtuple to hold results - BP = namedtuple("Boxplot", ['ax', 'lines']) - - def __init__(self, data, return_type='axes', **kwargs): - # Do not call LinePlot.__init__ which may fill nan - if return_type not in self._valid_return_types: - raise ValueError( - "return_type must be {None, 'axes', 'dict', 'both'}") - - self.return_type = return_type - MPLPlot.__init__(self, data, **kwargs) - - def _args_adjust(self): - if self.subplots: - # Disable label ax sharing. Otherwise, all subplots shows last - # column label - if self.orientation == 'vertical': - self.sharex = False - else: - self.sharey = False - - @classmethod - def _plot(cls, ax, y, column_num=None, return_type='axes', **kwds): - if y.ndim == 2: - y = [remove_na(v) for v in y] - # Boxplot fails with empty arrays, so need to add a NaN - # if any cols are empty - # GH 8181 - y = [v if v.size > 0 else np.array([np.nan]) for v in y] - else: - y = remove_na(y) - bp = ax.boxplot(y, **kwds) - - if return_type == 'dict': - return bp, bp - elif return_type == 'both': - return cls.BP(ax=ax, lines=bp), bp - else: - return ax, bp - - def _validate_color_args(self): - if 'color' in self.kwds: - if self.colormap is not None: - warnings.warn("'color' and 'colormap' cannot be used " - "simultaneously. Using 'color'") - self.color = self.kwds.pop('color') - - if isinstance(self.color, dict): - valid_keys = ['boxes', 'whiskers', 'medians', 'caps'] - for key, values in compat.iteritems(self.color): - if key not in valid_keys: - raise ValueError("color dict contains invalid " - "key '{0}' " - "The key must be either {1}" - .format(key, valid_keys)) - else: - self.color = None - - # get standard colors for default - colors = _get_standard_colors(num_colors=3, - colormap=self.colormap, - color=None) - # use 2 colors by default, for box/whisker and median - # flier colors isn't needed here - # because it can be specified by ``sym`` kw - self._boxes_c = colors[0] - self._whiskers_c = colors[0] - self._medians_c = colors[2] - self._caps_c = 'k' # mpl default - - def _get_colors(self, num_colors=None, color_kwds='color'): - pass - - def maybe_color_bp(self, bp): - if isinstance(self.color, dict): - boxes = self.color.get('boxes', self._boxes_c) - whiskers = self.color.get('whiskers', self._whiskers_c) - medians = self.color.get('medians', self._medians_c) - caps = self.color.get('caps', self._caps_c) - else: - # Other types are forwarded to matplotlib - # If None, use default colors - boxes = self.color or self._boxes_c - whiskers = self.color or self._whiskers_c - medians = self.color or self._medians_c - caps = self.color or self._caps_c - - from matplotlib.artist import setp - setp(bp['boxes'], color=boxes, alpha=1) - setp(bp['whiskers'], color=whiskers, alpha=1) - setp(bp['medians'], color=medians, alpha=1) - setp(bp['caps'], color=caps, alpha=1) - - def _make_plot(self): - if self.subplots: - self._return_obj = Series() - - for i, (label, y) in enumerate(self._iter_data()): - ax = self._get_ax(i) - kwds = self.kwds.copy() - - ret, bp = self._plot(ax, y, column_num=i, - return_type=self.return_type, **kwds) - self.maybe_color_bp(bp) - self._return_obj[label] = ret - - label = [pprint_thing(label)] - self._set_ticklabels(ax, label) - else: - y = self.data.values.T - ax = self._get_ax(0) - kwds = self.kwds.copy() - - ret, bp = self._plot(ax, y, column_num=0, - return_type=self.return_type, **kwds) - self.maybe_color_bp(bp) - self._return_obj = ret - - labels = [l for l, _ in self._iter_data()] - labels = [pprint_thing(l) for l in labels] - if not self.use_index: - labels = [pprint_thing(key) for key in range(len(labels))] - self._set_ticklabels(ax, labels) - - def _set_ticklabels(self, ax, labels): - if self.orientation == 'vertical': - ax.set_xticklabels(labels) - else: - ax.set_yticklabels(labels) - - def _make_legend(self): - pass - - def _post_plot_logic(self, ax, data): - pass - - @property - def orientation(self): - if self.kwds.get('vert', True): - return 'vertical' - else: - return 'horizontal' - - @property - def result(self): - if self.return_type is None: - return super(BoxPlot, self).result - else: - return self._return_obj - - -# kinds supported by both dataframe and series -_common_kinds = ['line', 'bar', 'barh', - 'kde', 'density', 'area', 'hist', 'box'] -# kinds supported by dataframe -_dataframe_kinds = ['scatter', 'hexbin'] -# kinds supported only by series or dataframe single column -_series_kinds = ['pie'] -_all_kinds = _common_kinds + _dataframe_kinds + _series_kinds - -_klasses = [LinePlot, BarPlot, BarhPlot, KdePlot, HistPlot, BoxPlot, - ScatterPlot, HexBinPlot, AreaPlot, PiePlot] - -_plot_klass = {} -for klass in _klasses: - _plot_klass[klass._kind] = klass - - -def _plot(data, x=None, y=None, subplots=False, - ax=None, kind='line', **kwds): - kind = _get_standard_kind(kind.lower().strip()) - if kind in _all_kinds: - klass = _plot_klass[kind] - else: - raise ValueError("%r is not a valid plot kind" % kind) - - from pandas import DataFrame - if kind in _dataframe_kinds: - if isinstance(data, DataFrame): - plot_obj = klass(data, x=x, y=y, subplots=subplots, ax=ax, - kind=kind, **kwds) - else: - raise ValueError("plot kind %r can only be used for data frames" - % kind) - - elif kind in _series_kinds: - if isinstance(data, DataFrame): - if y is None and subplots is False: - msg = "{0} requires either y column or 'subplots=True'" - raise ValueError(msg.format(kind)) - elif y is not None: - if is_integer(y) and not data.columns.holds_integer(): - y = data.columns[y] - # converted to series actually. copy to not modify - data = data[y].copy() - data.index.name = y - plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds) - else: - if isinstance(data, DataFrame): - if x is not None: - if is_integer(x) and not data.columns.holds_integer(): - x = data.columns[x] - data = data.set_index(x) - - if y is not None: - if is_integer(y) and not data.columns.holds_integer(): - y = data.columns[y] - label = kwds['label'] if 'label' in kwds else y - series = data[y].copy() # Don't modify - series.name = label - - for kw in ['xerr', 'yerr']: - if (kw in kwds) and \ - (isinstance(kwds[kw], string_types) or - is_integer(kwds[kw])): - try: - kwds[kw] = data[kwds[kw]] - except (IndexError, KeyError, TypeError): - pass - data = series - plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds) - - plot_obj.generate() - plot_obj.draw() - return plot_obj.result - - -df_kind = """- 'scatter' : scatter plot - - 'hexbin' : hexbin plot""" -series_kind = "" - -df_coord = """x : label or position, default None - y : label or position, default None - Allows plotting of one column versus another""" -series_coord = "" - -df_unique = """stacked : boolean, default False in line and - bar plots, and True in area plot. If True, create stacked plot. - sort_columns : boolean, default False - Sort column names to determine plot ordering - secondary_y : boolean or sequence, default False - Whether to plot on the secondary y-axis - If a list/tuple, which columns to plot on secondary y-axis""" -series_unique = """label : label argument to provide to plot - secondary_y : boolean or sequence of ints, default False - If True then y-axis will be on the right""" - -df_ax = """ax : matplotlib axes object, default None - subplots : boolean, default False - Make separate subplots for each column - sharex : boolean, default True if ax is None else False - In case subplots=True, share x axis and set some x axis labels to - invisible; defaults to True if ax is None otherwise False if an ax - is passed in; Be aware, that passing in both an ax and sharex=True - will alter all x axis labels for all axis in a figure! - sharey : boolean, default False - In case subplots=True, share y axis and set some y axis labels to - invisible - layout : tuple (optional) - (rows, columns) for the layout of subplots""" -series_ax = """ax : matplotlib axes object - If not passed, uses gca()""" - -df_note = """- If `kind` = 'scatter' and the argument `c` is the name of a dataframe - column, the values of that column are used to color each point. - - If `kind` = 'hexbin', you can control the size of the bins with the - `gridsize` argument. By default, a histogram of the counts around each - `(x, y)` point is computed. You can specify alternative aggregations - by passing values to the `C` and `reduce_C_function` arguments. - `C` specifies the value at each `(x, y)` point and `reduce_C_function` - is a function of one argument that reduces all the values in a bin to - a single number (e.g. `mean`, `max`, `sum`, `std`).""" -series_note = "" - -_shared_doc_df_kwargs = dict(klass='DataFrame', klass_obj='df', - klass_kind=df_kind, klass_coord=df_coord, - klass_ax=df_ax, klass_unique=df_unique, - klass_note=df_note) -_shared_doc_series_kwargs = dict(klass='Series', klass_obj='s', - klass_kind=series_kind, - klass_coord=series_coord, klass_ax=series_ax, - klass_unique=series_unique, - klass_note=series_note) - -_shared_docs['plot'] = """ - Make plots of %(klass)s using matplotlib / pylab. - - *New in version 0.17.0:* Each plot kind has a corresponding method on the - ``%(klass)s.plot`` accessor: - ``%(klass_obj)s.plot(kind='line')`` is equivalent to - ``%(klass_obj)s.plot.line()``. - - Parameters - ---------- - data : %(klass)s - %(klass_coord)s - kind : str - - 'line' : line plot (default) - - 'bar' : vertical bar plot - - 'barh' : horizontal bar plot - - 'hist' : histogram - - 'box' : boxplot - - 'kde' : Kernel Density Estimation plot - - 'density' : same as 'kde' - - 'area' : area plot - - 'pie' : pie plot - %(klass_kind)s - %(klass_ax)s - figsize : a tuple (width, height) in inches - use_index : boolean, default True - Use index as ticks for x axis - title : string - Title to use for the plot - grid : boolean, default None (matlab style default) - Axis grid lines - legend : False/True/'reverse' - Place legend on axis subplots - style : list or dict - matplotlib line style per column - logx : boolean, default False - Use log scaling on x axis - logy : boolean, default False - Use log scaling on y axis - loglog : boolean, default False - Use log scaling on both x and y axes - xticks : sequence - Values to use for the xticks - yticks : sequence - Values to use for the yticks - xlim : 2-tuple/list - ylim : 2-tuple/list - rot : int, default None - Rotation for ticks (xticks for vertical, yticks for horizontal plots) - fontsize : int, default None - Font size for xticks and yticks - colormap : str or matplotlib colormap object, default None - Colormap to select colors from. If string, load colormap with that name - from matplotlib. - colorbar : boolean, optional - If True, plot colorbar (only relevant for 'scatter' and 'hexbin' plots) - position : float - Specify relative alignments for bar plot layout. - From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center) - layout : tuple (optional) - (rows, columns) for the layout of the plot - table : boolean, Series or DataFrame, default False - If True, draw a table using the data in the DataFrame and the data will - be transposed to meet matplotlib's default layout. - If a Series or DataFrame is passed, use passed data to draw a table. - yerr : DataFrame, Series, array-like, dict and str - See :ref:`Plotting with Error Bars ` for - detail. - xerr : same types as yerr. - %(klass_unique)s - mark_right : boolean, default True - When using a secondary_y axis, automatically mark the column - labels with "(right)" in the legend - kwds : keywords - Options to pass to matplotlib plotting method - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - - Notes - ----- - - - See matplotlib documentation online for more on this subject - - If `kind` = 'bar' or 'barh', you can specify relative alignments - for bar plot layout by `position` keyword. - From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center) - %(klass_note)s - - """ - - -@Appender(_shared_docs['plot'] % _shared_doc_df_kwargs) -def plot_frame(data, x=None, y=None, kind='line', ax=None, - subplots=False, sharex=None, sharey=False, layout=None, - figsize=None, use_index=True, title=None, grid=None, - legend=True, style=None, logx=False, logy=False, loglog=False, - xticks=None, yticks=None, xlim=None, ylim=None, - rot=None, fontsize=None, colormap=None, table=False, - yerr=None, xerr=None, - secondary_y=False, sort_columns=False, - **kwds): - return _plot(data, kind=kind, x=x, y=y, ax=ax, - subplots=subplots, sharex=sharex, sharey=sharey, - layout=layout, figsize=figsize, use_index=use_index, - title=title, grid=grid, legend=legend, - style=style, logx=logx, logy=logy, loglog=loglog, - xticks=xticks, yticks=yticks, xlim=xlim, ylim=ylim, - rot=rot, fontsize=fontsize, colormap=colormap, table=table, - yerr=yerr, xerr=xerr, - secondary_y=secondary_y, sort_columns=sort_columns, - **kwds) - - -@Appender(_shared_docs['plot'] % _shared_doc_series_kwargs) -def plot_series(data, kind='line', ax=None, # Series unique - figsize=None, use_index=True, title=None, grid=None, - legend=False, style=None, logx=False, logy=False, loglog=False, - xticks=None, yticks=None, xlim=None, ylim=None, - rot=None, fontsize=None, colormap=None, table=False, - yerr=None, xerr=None, - label=None, secondary_y=False, # Series unique - **kwds): - - import matplotlib.pyplot as plt - """ - If no axes is specified, check whether there are existing figures - If there is no existing figures, _gca() will - create a figure with the default figsize, causing the figsize=parameter to - be ignored. - """ - if ax is None and len(plt.get_fignums()) > 0: - ax = _gca() - ax = MPLPlot._get_ax_layer(ax) - return _plot(data, kind=kind, ax=ax, - figsize=figsize, use_index=use_index, title=title, - grid=grid, legend=legend, - style=style, logx=logx, logy=logy, loglog=loglog, - xticks=xticks, yticks=yticks, xlim=xlim, ylim=ylim, - rot=rot, fontsize=fontsize, colormap=colormap, table=table, - yerr=yerr, xerr=xerr, - label=label, secondary_y=secondary_y, - **kwds) - - -_shared_docs['boxplot'] = """ - Make a box plot from DataFrame column optionally grouped by some columns or - other inputs - - Parameters - ---------- - data : the pandas object holding the data - column : column name or list of names, or vector - Can be any valid input to groupby - by : string or sequence - Column in the DataFrame to group by - ax : Matplotlib axes object, optional - fontsize : int or string - rot : label rotation angle - figsize : A tuple (width, height) in inches - grid : Setting this to True will show the grid - layout : tuple (optional) - (rows, columns) for the layout of the plot - return_type : {None, 'axes', 'dict', 'both'}, default None - The kind of object to return. The default is ``axes`` - 'axes' returns the matplotlib axes the boxplot is drawn on; - 'dict' returns a dictionary whose values are the matplotlib - Lines of the boxplot; - 'both' returns a namedtuple with the axes and dict. - - When grouping with ``by``, a Series mapping columns to ``return_type`` - is returned, unless ``return_type`` is None, in which case a NumPy - array of axes is returned with the same shape as ``layout``. - See the prose documentation for more. - - kwds : other plotting keyword arguments to be passed to matplotlib boxplot - function - - Returns - ------- - lines : dict - ax : matplotlib Axes - (ax, lines): namedtuple - - Notes - ----- - Use ``return_type='dict'`` when you want to tweak the appearance - of the lines after plotting. In this case a dict containing the Lines - making up the boxes, caps, fliers, medians, and whiskers is returned. - """ - - -@Appender(_shared_docs['boxplot'] % _shared_doc_kwargs) -def boxplot(data, column=None, by=None, ax=None, fontsize=None, - rot=0, grid=True, figsize=None, layout=None, return_type=None, - **kwds): - - # validate return_type: - if return_type not in BoxPlot._valid_return_types: - raise ValueError("return_type must be {'axes', 'dict', 'both'}") - - from pandas import Series, DataFrame - if isinstance(data, Series): - data = DataFrame({'x': data}) - column = 'x' - - def _get_colors(): - return _get_standard_colors(color=kwds.get('color'), num_colors=1) - - def maybe_color_bp(bp): - if 'color' not in kwds: - from matplotlib.artist import setp - setp(bp['boxes'], color=colors[0], alpha=1) - setp(bp['whiskers'], color=colors[0], alpha=1) - setp(bp['medians'], color=colors[2], alpha=1) - - def plot_group(keys, values, ax): - keys = [pprint_thing(x) for x in keys] - values = [remove_na(v) for v in values] - bp = ax.boxplot(values, **kwds) - if kwds.get('vert', 1): - ax.set_xticklabels(keys, rotation=rot, fontsize=fontsize) - else: - ax.set_yticklabels(keys, rotation=rot, fontsize=fontsize) - maybe_color_bp(bp) - - # Return axes in multiplot case, maybe revisit later # 985 - if return_type == 'dict': - return bp - elif return_type == 'both': - return BoxPlot.BP(ax=ax, lines=bp) - else: - return ax - - colors = _get_colors() - if column is None: - columns = None - else: - if isinstance(column, (list, tuple)): - columns = column - else: - columns = [column] - - if by is not None: - # Prefer array return type for 2-D plots to match the subplot layout - # https://github.com/pandas-dev/pandas/pull/12216#issuecomment-241175580 - result = _grouped_plot_by_column(plot_group, data, columns=columns, - by=by, grid=grid, figsize=figsize, - ax=ax, layout=layout, - return_type=return_type) - else: - if return_type is None: - return_type = 'axes' - if layout is not None: - raise ValueError("The 'layout' keyword is not supported when " - "'by' is None") - - if ax is None: - ax = _gca() - data = data._get_numeric_data() - if columns is None: - columns = data.columns - else: - data = data[columns] - - result = plot_group(columns, data.values.T, ax) - ax.grid(grid) - - return result - - -def format_date_labels(ax, rot): - # mini version of autofmt_xdate - try: - for label in ax.get_xticklabels(): - label.set_ha('right') - label.set_rotation(rot) - fig = ax.get_figure() - fig.subplots_adjust(bottom=0.2) - except Exception: # pragma: no cover - pass - - -def scatter_plot(data, x, y, by=None, ax=None, figsize=None, grid=False, - **kwargs): - """ - Make a scatter plot from two DataFrame columns - - Parameters - ---------- - data : DataFrame - x : Column name for the x-axis values - y : Column name for the y-axis values - ax : Matplotlib axis object - figsize : A tuple (width, height) in inches - grid : Setting this to True will show the grid - kwargs : other plotting keyword arguments - To be passed to scatter function - - Returns - ------- - fig : matplotlib.Figure - """ - import matplotlib.pyplot as plt - - # workaround because `c='b'` is hardcoded in matplotlibs scatter method - kwargs.setdefault('c', plt.rcParams['patch.facecolor']) - - def plot_group(group, ax): - xvals = group[x].values - yvals = group[y].values - ax.scatter(xvals, yvals, **kwargs) - ax.grid(grid) - - if by is not None: - fig = _grouped_plot(plot_group, data, by=by, figsize=figsize, ax=ax) - else: - if ax is None: - fig = plt.figure() - ax = fig.add_subplot(111) - else: - fig = ax.get_figure() - plot_group(data, ax) - ax.set_ylabel(pprint_thing(y)) - ax.set_xlabel(pprint_thing(x)) - - ax.grid(grid) - - return fig - - -def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None, - xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, - sharey=False, figsize=None, layout=None, bins=10, **kwds): - """ - Draw histogram of the DataFrame's series using matplotlib / pylab. - - Parameters - ---------- - data : DataFrame - column : string or sequence - If passed, will be used to limit data to a subset of columns - by : object, optional - If passed, then used to form histograms for separate groups - grid : boolean, default True - Whether to show axis grid lines - xlabelsize : int, default None - If specified changes the x-axis label size - xrot : float, default None - rotation of x axis labels - ylabelsize : int, default None - If specified changes the y-axis label size - yrot : float, default None - rotation of y axis labels - ax : matplotlib axes object, default None - sharex : boolean, default True if ax is None else False - In case subplots=True, share x axis and set some x axis labels to - invisible; defaults to True if ax is None otherwise False if an ax - is passed in; Be aware, that passing in both an ax and sharex=True - will alter all x axis labels for all subplots in a figure! - sharey : boolean, default False - In case subplots=True, share y axis and set some y axis labels to - invisible - figsize : tuple - The size of the figure to create in inches by default - layout: (optional) a tuple (rows, columns) for the layout of the histograms - bins: integer, default 10 - Number of histogram bins to be used - kwds : other plotting keyword arguments - To be passed to hist function - """ - - if by is not None: - axes = grouped_hist(data, column=column, by=by, ax=ax, grid=grid, - figsize=figsize, sharex=sharex, sharey=sharey, - layout=layout, bins=bins, xlabelsize=xlabelsize, - xrot=xrot, ylabelsize=ylabelsize, - yrot=yrot, **kwds) - return axes - - if column is not None: - if not isinstance(column, (list, np.ndarray, Index)): - column = [column] - data = data[column] - data = data._get_numeric_data() - naxes = len(data.columns) - - fig, axes = _subplots(naxes=naxes, ax=ax, squeeze=False, - sharex=sharex, sharey=sharey, figsize=figsize, - layout=layout) - _axes = _flatten(axes) - - for i, col in enumerate(_try_sort(data.columns)): - ax = _axes[i] - ax.hist(data[col].dropna().values, bins=bins, **kwds) - ax.set_title(col) - ax.grid(grid) - - _set_ticks_props(axes, xlabelsize=xlabelsize, xrot=xrot, - ylabelsize=ylabelsize, yrot=yrot) - fig.subplots_adjust(wspace=0.3, hspace=0.3) - - return axes - - -def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None, - xrot=None, ylabelsize=None, yrot=None, figsize=None, - bins=10, **kwds): - """ - Draw histogram of the input series using matplotlib - - Parameters - ---------- - by : object, optional - If passed, then used to form histograms for separate groups - ax : matplotlib axis object - If not passed, uses gca() - grid : boolean, default True - Whether to show axis grid lines - xlabelsize : int, default None - If specified changes the x-axis label size - xrot : float, default None - rotation of x axis labels - ylabelsize : int, default None - If specified changes the y-axis label size - yrot : float, default None - rotation of y axis labels - figsize : tuple, default None - figure size in inches by default - bins: integer, default 10 - Number of histogram bins to be used - kwds : keywords - To be passed to the actual plotting function - - Notes - ----- - See matplotlib documentation online for more on this - - """ - import matplotlib.pyplot as plt - - if by is None: - if kwds.get('layout', None) is not None: - raise ValueError("The 'layout' keyword is not supported when " - "'by' is None") - # hack until the plotting interface is a bit more unified - fig = kwds.pop('figure', plt.gcf() if plt.get_fignums() else - plt.figure(figsize=figsize)) - if (figsize is not None and tuple(figsize) != - tuple(fig.get_size_inches())): - fig.set_size_inches(*figsize, forward=True) - if ax is None: - ax = fig.gca() - elif ax.get_figure() != fig: - raise AssertionError('passed axis not bound to passed figure') - values = self.dropna().values - - ax.hist(values, bins=bins, **kwds) - ax.grid(grid) - axes = np.array([ax]) - - _set_ticks_props(axes, xlabelsize=xlabelsize, xrot=xrot, - ylabelsize=ylabelsize, yrot=yrot) - - else: - if 'figure' in kwds: - raise ValueError("Cannot pass 'figure' when using the " - "'by' argument, since a new 'Figure' instance " - "will be created") - axes = grouped_hist(self, by=by, ax=ax, grid=grid, figsize=figsize, - bins=bins, xlabelsize=xlabelsize, xrot=xrot, - ylabelsize=ylabelsize, yrot=yrot, **kwds) - - if hasattr(axes, 'ndim'): - if axes.ndim == 1 and len(axes) == 1: - return axes[0] - return axes - - -def grouped_hist(data, column=None, by=None, ax=None, bins=50, figsize=None, - layout=None, sharex=False, sharey=False, rot=90, grid=True, - xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, - **kwargs): - """ - Grouped histogram - - Parameters - ---------- - data: Series/DataFrame - column: object, optional - by: object, optional - ax: axes, optional - bins: int, default 50 - figsize: tuple, optional - layout: optional - sharex: boolean, default False - sharey: boolean, default False - rot: int, default 90 - grid: bool, default True - kwargs: dict, keyword arguments passed to matplotlib.Axes.hist - - Returns - ------- - axes: collection of Matplotlib Axes - """ - def plot_group(group, ax): - ax.hist(group.dropna().values, bins=bins, **kwargs) - - xrot = xrot or rot - - fig, axes = _grouped_plot(plot_group, data, column=column, - by=by, sharex=sharex, sharey=sharey, ax=ax, - figsize=figsize, layout=layout, rot=rot) - - _set_ticks_props(axes, xlabelsize=xlabelsize, xrot=xrot, - ylabelsize=ylabelsize, yrot=yrot) - - fig.subplots_adjust(bottom=0.15, top=0.9, left=0.1, right=0.9, - hspace=0.5, wspace=0.3) - return axes - - -def boxplot_frame_groupby(grouped, subplots=True, column=None, fontsize=None, - rot=0, grid=True, ax=None, figsize=None, - layout=None, **kwds): - """ - Make box plots from DataFrameGroupBy data. - - Parameters - ---------- - grouped : Grouped DataFrame - subplots : - * ``False`` - no subplots will be used - * ``True`` - create a subplot for each group - column : column name or list of names, or vector - Can be any valid input to groupby - fontsize : int or string - rot : label rotation angle - grid : Setting this to True will show the grid - ax : Matplotlib axis object, default None - figsize : A tuple (width, height) in inches - layout : tuple (optional) - (rows, columns) for the layout of the plot - kwds : other plotting keyword arguments to be passed to matplotlib boxplot - function - - Returns - ------- - dict of key/value = group key/DataFrame.boxplot return value - or DataFrame.boxplot return value in case subplots=figures=False - - Examples - -------- - >>> import pandas - >>> import numpy as np - >>> import itertools - >>> - >>> tuples = [t for t in itertools.product(range(1000), range(4))] - >>> index = pandas.MultiIndex.from_tuples(tuples, names=['lvl0', 'lvl1']) - >>> data = np.random.randn(len(index),4) - >>> df = pandas.DataFrame(data, columns=list('ABCD'), index=index) - >>> - >>> grouped = df.groupby(level='lvl1') - >>> boxplot_frame_groupby(grouped) - >>> - >>> grouped = df.unstack(level='lvl1').groupby(level=0, axis=1) - >>> boxplot_frame_groupby(grouped, subplots=False) - """ - if subplots is True: - naxes = len(grouped) - fig, axes = _subplots(naxes=naxes, squeeze=False, - ax=ax, sharex=False, sharey=True, - figsize=figsize, layout=layout) - axes = _flatten(axes) - - ret = Series() - for (key, group), ax in zip(grouped, axes): - d = group.boxplot(ax=ax, column=column, fontsize=fontsize, - rot=rot, grid=grid, **kwds) - ax.set_title(pprint_thing(key)) - ret.loc[key] = d - fig.subplots_adjust(bottom=0.15, top=0.9, left=0.1, - right=0.9, wspace=0.2) - else: - from pandas.tools.merge import concat - keys, frames = zip(*grouped) - if grouped.axis == 0: - df = concat(frames, keys=keys, axis=1) - else: - if len(frames) > 1: - df = frames[0].join(frames[1::]) - else: - df = frames[0] - ret = df.boxplot(column=column, fontsize=fontsize, rot=rot, - grid=grid, ax=ax, figsize=figsize, - layout=layout, **kwds) - return ret - - -def _grouped_plot(plotf, data, column=None, by=None, numeric_only=True, - figsize=None, sharex=True, sharey=True, layout=None, - rot=0, ax=None, **kwargs): - from pandas import DataFrame - - if figsize == 'default': - # allowed to specify mpl default with 'default' - warnings.warn("figsize='default' is deprecated. Specify figure" - "size by tuple instead", FutureWarning, stacklevel=4) - figsize = None - - grouped = data.groupby(by) - if column is not None: - grouped = grouped[column] - - naxes = len(grouped) - fig, axes = _subplots(naxes=naxes, figsize=figsize, - sharex=sharex, sharey=sharey, ax=ax, - layout=layout) - - _axes = _flatten(axes) - - for i, (key, group) in enumerate(grouped): - ax = _axes[i] - if numeric_only and isinstance(group, DataFrame): - group = group._get_numeric_data() - plotf(group, ax, **kwargs) - ax.set_title(pprint_thing(key)) - - return fig, axes - - -def _grouped_plot_by_column(plotf, data, columns=None, by=None, - numeric_only=True, grid=False, - figsize=None, ax=None, layout=None, - return_type=None, **kwargs): - grouped = data.groupby(by) - if columns is None: - if not isinstance(by, (list, tuple)): - by = [by] - columns = data._get_numeric_data().columns.difference(by) - naxes = len(columns) - fig, axes = _subplots(naxes=naxes, sharex=True, sharey=True, - figsize=figsize, ax=ax, layout=layout) - - _axes = _flatten(axes) - - result = Series() - ax_values = [] - - for i, col in enumerate(columns): - ax = _axes[i] - gp_col = grouped[col] - keys, values = zip(*gp_col) - re_plotf = plotf(keys, values, ax, **kwargs) - ax.set_title(col) - ax.set_xlabel(pprint_thing(by)) - ax_values.append(re_plotf) - ax.grid(grid) - - result = Series(ax_values, index=columns) - - # Return axes in multiplot case, maybe revisit later # 985 - if return_type is None: - result = axes - - byline = by[0] if len(by) == 1 else by - fig.suptitle('Boxplot grouped by %s' % byline) - fig.subplots_adjust(bottom=0.15, top=0.9, left=0.1, right=0.9, wspace=0.2) - - return result - - -def table(ax, data, rowLabels=None, colLabels=None, - **kwargs): - """ - Helper function to convert DataFrame and Series to matplotlib.table - - Parameters - ---------- - `ax`: Matplotlib axes object - `data`: DataFrame or Series - data for table contents - `kwargs`: keywords, optional - keyword arguments which passed to matplotlib.table.table. - If `rowLabels` or `colLabels` is not specified, data index or column - name will be used. - - Returns - ------- - matplotlib table object - """ - from pandas import DataFrame - if isinstance(data, Series): - data = DataFrame(data, columns=[data.name]) - elif isinstance(data, DataFrame): - pass - else: - raise ValueError('Input data must be DataFrame or Series') - - if rowLabels is None: - rowLabels = data.index - - if colLabels is None: - colLabels = data.columns - - cellText = data.values - - import matplotlib.table - table = matplotlib.table.table(ax, cellText=cellText, - rowLabels=rowLabels, - colLabels=colLabels, **kwargs) - return table - - -def _get_layout(nplots, layout=None, layout_type='box'): - if layout is not None: - if not isinstance(layout, (tuple, list)) or len(layout) != 2: - raise ValueError('Layout must be a tuple of (rows, columns)') - - nrows, ncols = layout - - # Python 2 compat - ceil_ = lambda x: int(ceil(x)) - if nrows == -1 and ncols > 0: - layout = nrows, ncols = (ceil_(float(nplots) / ncols), ncols) - elif ncols == -1 and nrows > 0: - layout = nrows, ncols = (nrows, ceil_(float(nplots) / nrows)) - elif ncols <= 0 and nrows <= 0: - msg = "At least one dimension of layout must be positive" - raise ValueError(msg) - - if nrows * ncols < nplots: - raise ValueError('Layout of %sx%s must be larger than ' - 'required size %s' % (nrows, ncols, nplots)) - - return layout - - if layout_type == 'single': - return (1, 1) - elif layout_type == 'horizontal': - return (1, nplots) - elif layout_type == 'vertical': - return (nplots, 1) - - layouts = {1: (1, 1), 2: (1, 2), 3: (2, 2), 4: (2, 2)} - try: - return layouts[nplots] - except KeyError: - k = 1 - while k ** 2 < nplots: - k += 1 - - if (k - 1) * k >= nplots: - return k, (k - 1) - else: - return k, k - -# copied from matplotlib/pyplot.py and modified for pandas.plotting - - -def _subplots(naxes=None, sharex=False, sharey=False, squeeze=True, - subplot_kw=None, ax=None, layout=None, layout_type='box', - **fig_kw): - """Create a figure with a set of subplots already made. - - This utility wrapper makes it convenient to create common layouts of - subplots, including the enclosing figure object, in a single call. - - Keyword arguments: - - naxes : int - Number of required axes. Exceeded axes are set invisible. Default is - nrows * ncols. - - sharex : bool - If True, the X axis will be shared amongst all subplots. - - sharey : bool - If True, the Y axis will be shared amongst all subplots. - - squeeze : bool - - If True, extra dimensions are squeezed out from the returned axis object: - - if only one subplot is constructed (nrows=ncols=1), the resulting - single Axis object is returned as a scalar. - - for Nx1 or 1xN subplots, the returned object is a 1-d numpy object - array of Axis objects are returned as numpy 1-d arrays. - - for NxM subplots with N>1 and M>1 are returned as a 2d array. - - If False, no squeezing at all is done: the returned axis object is always - a 2-d array containing Axis instances, even if it ends up being 1x1. - - subplot_kw : dict - Dict with keywords passed to the add_subplot() call used to create each - subplots. - - ax : Matplotlib axis object, optional - - layout : tuple - Number of rows and columns of the subplot grid. - If not specified, calculated from naxes and layout_type - - layout_type : {'box', 'horziontal', 'vertical'}, default 'box' - Specify how to layout the subplot grid. - - fig_kw : Other keyword arguments to be passed to the figure() call. - Note that all keywords not recognized above will be - automatically included here. - - Returns: - - fig, ax : tuple - - fig is the Matplotlib Figure object - - ax can be either a single axis object or an array of axis objects if - more than one subplot was created. The dimensions of the resulting array - can be controlled with the squeeze keyword, see above. - - **Examples:** - - x = np.linspace(0, 2*np.pi, 400) - y = np.sin(x**2) - - # Just a figure and one subplot - f, ax = plt.subplots() - ax.plot(x, y) - ax.set_title('Simple plot') - - # Two subplots, unpack the output array immediately - f, (ax1, ax2) = plt.subplots(1, 2, sharey=True) - ax1.plot(x, y) - ax1.set_title('Sharing Y axis') - ax2.scatter(x, y) - - # Four polar axes - plt.subplots(2, 2, subplot_kw=dict(polar=True)) - """ - import matplotlib.pyplot as plt - - if subplot_kw is None: - subplot_kw = {} - - if ax is None: - fig = plt.figure(**fig_kw) - else: - if is_list_like(ax): - ax = _flatten(ax) - if layout is not None: - warnings.warn("When passing multiple axes, layout keyword is " - "ignored", UserWarning) - if sharex or sharey: - warnings.warn("When passing multiple axes, sharex and sharey " - "are ignored. These settings must be specified " - "when creating axes", UserWarning, - stacklevel=4) - if len(ax) == naxes: - fig = ax[0].get_figure() - return fig, ax - else: - raise ValueError("The number of passed axes must be {0}, the " - "same as the output plot".format(naxes)) - - fig = ax.get_figure() - # if ax is passed and a number of subplots is 1, return ax as it is - if naxes == 1: - if squeeze: - return fig, ax - else: - return fig, _flatten(ax) - else: - warnings.warn("To output multiple subplots, the figure containing " - "the passed axes is being cleared", UserWarning, - stacklevel=4) - fig.clear() - - nrows, ncols = _get_layout(naxes, layout=layout, layout_type=layout_type) - nplots = nrows * ncols - - # Create empty object array to hold all axes. It's easiest to make it 1-d - # so we can just append subplots upon creation, and then - axarr = np.empty(nplots, dtype=object) - - # Create first subplot separately, so we can share it if requested - ax0 = fig.add_subplot(nrows, ncols, 1, **subplot_kw) - - if sharex: - subplot_kw['sharex'] = ax0 - if sharey: - subplot_kw['sharey'] = ax0 - axarr[0] = ax0 - - # Note off-by-one counting because add_subplot uses the MATLAB 1-based - # convention. - for i in range(1, nplots): - kwds = subplot_kw.copy() - # Set sharex and sharey to None for blank/dummy axes, these can - # interfere with proper axis limits on the visible axes if - # they share axes e.g. issue #7528 - if i >= naxes: - kwds['sharex'] = None - kwds['sharey'] = None - ax = fig.add_subplot(nrows, ncols, i + 1, **kwds) - axarr[i] = ax - - if naxes != nplots: - for ax in axarr[naxes:]: - ax.set_visible(False) - - _handle_shared_axes(axarr, nplots, naxes, nrows, ncols, sharex, sharey) - - if squeeze: - # Reshape the array to have the final desired dimension (nrow,ncol), - # though discarding unneeded dimensions that equal 1. If we only have - # one subplot, just return it instead of a 1-element array. - if nplots == 1: - axes = axarr[0] - else: - axes = axarr.reshape(nrows, ncols).squeeze() - else: - # returned axis array will be always 2-d, even if nrows=ncols=1 - axes = axarr.reshape(nrows, ncols) - - return fig, axes - - -def _remove_labels_from_axis(axis): - for t in axis.get_majorticklabels(): - t.set_visible(False) - - try: - # set_visible will not be effective if - # minor axis has NullLocator and NullFormattor (default) - import matplotlib.ticker as ticker - if isinstance(axis.get_minor_locator(), ticker.NullLocator): - axis.set_minor_locator(ticker.AutoLocator()) - if isinstance(axis.get_minor_formatter(), ticker.NullFormatter): - axis.set_minor_formatter(ticker.FormatStrFormatter('')) - for t in axis.get_minorticklabels(): - t.set_visible(False) - except Exception: # pragma no cover - raise - axis.get_label().set_visible(False) - - -def _handle_shared_axes(axarr, nplots, naxes, nrows, ncols, sharex, sharey): - if nplots > 1: - - if nrows > 1: - try: - # first find out the ax layout, - # so that we can correctly handle 'gaps" - layout = np.zeros((nrows + 1, ncols + 1), dtype=np.bool) - for ax in axarr: - layout[ax.rowNum, ax.colNum] = ax.get_visible() - - for ax in axarr: - # only the last row of subplots should get x labels -> all - # other off layout handles the case that the subplot is - # the last in the column, because below is no subplot/gap. - if not layout[ax.rowNum + 1, ax.colNum]: - continue - if sharex or len(ax.get_shared_x_axes() - .get_siblings(ax)) > 1: - _remove_labels_from_axis(ax.xaxis) - - except IndexError: - # if gridspec is used, ax.rowNum and ax.colNum may different - # from layout shape. in this case, use last_row logic - for ax in axarr: - if ax.is_last_row(): - continue - if sharex or len(ax.get_shared_x_axes() - .get_siblings(ax)) > 1: - _remove_labels_from_axis(ax.xaxis) - - if ncols > 1: - for ax in axarr: - # only the first column should get y labels -> set all other to - # off as we only have labels in teh first column and we always - # have a subplot there, we can skip the layout test - if ax.is_first_col(): - continue - if sharey or len(ax.get_shared_y_axes().get_siblings(ax)) > 1: - _remove_labels_from_axis(ax.yaxis) - - -def _flatten(axes): - if not is_list_like(axes): - return np.array([axes]) - elif isinstance(axes, (np.ndarray, Index)): - return axes.ravel() - return np.array(axes) - - -def _get_all_lines(ax): - lines = ax.get_lines() - - if hasattr(ax, 'right_ax'): - lines += ax.right_ax.get_lines() - - if hasattr(ax, 'left_ax'): - lines += ax.left_ax.get_lines() - - return lines - - -def _get_xlim(lines): - left, right = np.inf, -np.inf - for l in lines: - x = l.get_xdata(orig=False) - left = min(x[0], left) - right = max(x[-1], right) - return left, right - - -def _set_ticks_props(axes, xlabelsize=None, xrot=None, - ylabelsize=None, yrot=None): - import matplotlib.pyplot as plt - - for ax in _flatten(axes): - if xlabelsize is not None: - plt.setp(ax.get_xticklabels(), fontsize=xlabelsize) - if xrot is not None: - plt.setp(ax.get_xticklabels(), rotation=xrot) - if ylabelsize is not None: - plt.setp(ax.get_yticklabels(), fontsize=ylabelsize) - if yrot is not None: - plt.setp(ax.get_yticklabels(), rotation=yrot) - return axes - - -class BasePlotMethods(PandasObject): - - def __init__(self, data): - self._data = data - - def __call__(self, *args, **kwargs): - raise NotImplementedError - - -class SeriesPlotMethods(BasePlotMethods): - """Series plotting accessor and method - - Examples - -------- - >>> s.plot.line() - >>> s.plot.bar() - >>> s.plot.hist() - - Plotting methods can also be accessed by calling the accessor as a method - with the ``kind`` argument: - ``s.plot(kind='line')`` is equivalent to ``s.plot.line()`` - """ - - def __call__(self, kind='line', ax=None, - figsize=None, use_index=True, title=None, grid=None, - legend=False, style=None, logx=False, logy=False, - loglog=False, xticks=None, yticks=None, - xlim=None, ylim=None, - rot=None, fontsize=None, colormap=None, table=False, - yerr=None, xerr=None, - label=None, secondary_y=False, **kwds): - return plot_series(self._data, kind=kind, ax=ax, figsize=figsize, - use_index=use_index, title=title, grid=grid, - legend=legend, style=style, logx=logx, logy=logy, - loglog=loglog, xticks=xticks, yticks=yticks, - xlim=xlim, ylim=ylim, rot=rot, fontsize=fontsize, - colormap=colormap, table=table, yerr=yerr, - xerr=xerr, label=label, secondary_y=secondary_y, - **kwds) - __call__.__doc__ = plot_series.__doc__ - - def line(self, **kwds): - """ - Line plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='line', **kwds) - - def bar(self, **kwds): - """ - Vertical bar plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='bar', **kwds) - - def barh(self, **kwds): - """ - Horizontal bar plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='barh', **kwds) - - def box(self, **kwds): - """ - Boxplot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='box', **kwds) - - def hist(self, bins=10, **kwds): - """ - Histogram - - .. versionadded:: 0.17.0 - - Parameters - ---------- - bins: integer, default 10 - Number of histogram bins to be used - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='hist', bins=bins, **kwds) - - def kde(self, **kwds): - """ - Kernel Density Estimate plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='kde', **kwds) - - density = kde - - def area(self, **kwds): - """ - Area plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='area', **kwds) - - def pie(self, **kwds): - """ - Pie chart - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.Series.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='pie', **kwds) - - -class FramePlotMethods(BasePlotMethods): - """DataFrame plotting accessor and method - - Examples - -------- - >>> df.plot.line() - >>> df.plot.scatter('x', 'y') - >>> df.plot.hexbin() - - These plotting methods can also be accessed by calling the accessor as a - method with the ``kind`` argument: - ``df.plot(kind='line')`` is equivalent to ``df.plot.line()`` - """ - - def __call__(self, x=None, y=None, kind='line', ax=None, - subplots=False, sharex=None, sharey=False, layout=None, - figsize=None, use_index=True, title=None, grid=None, - legend=True, style=None, logx=False, logy=False, loglog=False, - xticks=None, yticks=None, xlim=None, ylim=None, - rot=None, fontsize=None, colormap=None, table=False, - yerr=None, xerr=None, - secondary_y=False, sort_columns=False, **kwds): - return plot_frame(self._data, kind=kind, x=x, y=y, ax=ax, - subplots=subplots, sharex=sharex, sharey=sharey, - layout=layout, figsize=figsize, use_index=use_index, - title=title, grid=grid, legend=legend, style=style, - logx=logx, logy=logy, loglog=loglog, xticks=xticks, - yticks=yticks, xlim=xlim, ylim=ylim, rot=rot, - fontsize=fontsize, colormap=colormap, table=table, - yerr=yerr, xerr=xerr, secondary_y=secondary_y, - sort_columns=sort_columns, **kwds) - __call__.__doc__ = plot_frame.__doc__ - - def line(self, x=None, y=None, **kwds): - """ - Line plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - x, y : label or position, optional - Coordinates for each point. - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='line', x=x, y=y, **kwds) - - def bar(self, x=None, y=None, **kwds): - """ - Vertical bar plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - x, y : label or position, optional - Coordinates for each point. - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='bar', x=x, y=y, **kwds) - - def barh(self, x=None, y=None, **kwds): - """ - Horizontal bar plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - x, y : label or position, optional - Coordinates for each point. - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='barh', x=x, y=y, **kwds) - - def box(self, by=None, **kwds): - """ - Boxplot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - by : string or sequence - Column in the DataFrame to group by. - \*\*kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='box', by=by, **kwds) - - def hist(self, by=None, bins=10, **kwds): - """ - Histogram - - .. versionadded:: 0.17.0 - - Parameters - ---------- - by : string or sequence - Column in the DataFrame to group by. - bins: integer, default 10 - Number of histogram bins to be used - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='hist', by=by, bins=bins, **kwds) - - def kde(self, **kwds): - """ - Kernel Density Estimate plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='kde', **kwds) - - density = kde - - def area(self, x=None, y=None, **kwds): - """ - Area plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - x, y : label or position, optional - Coordinates for each point. - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='area', x=x, y=y, **kwds) - - def pie(self, y=None, **kwds): - """ - Pie chart - - .. versionadded:: 0.17.0 - - Parameters - ---------- - y : label or position, optional - Column to plot. - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='pie', y=y, **kwds) - - def scatter(self, x, y, s=None, c=None, **kwds): - """ - Scatter plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - x, y : label or position, optional - Coordinates for each point. - s : scalar or array_like, optional - Size of each point. - c : label or position, optional - Color of each point. - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - return self(kind='scatter', x=x, y=y, c=c, s=s, **kwds) - - def hexbin(self, x, y, C=None, reduce_C_function=None, gridsize=None, - **kwds): - """ - Hexbin plot - - .. versionadded:: 0.17.0 - - Parameters - ---------- - x, y : label or position, optional - Coordinates for each point. - C : label or position, optional - The value at each `(x, y)` point. - reduce_C_function : callable, optional - Function of one argument that reduces all the values in a bin to - a single number (e.g. `mean`, `max`, `sum`, `std`). - gridsize : int, optional - Number of bins. - **kwds : optional - Keyword arguments to pass on to :py:meth:`pandas.DataFrame.plot`. - - Returns - ------- - axes : matplotlib.AxesSubplot or np.array of them - """ - if reduce_C_function is not None: - kwds['reduce_C_function'] = reduce_C_function - if gridsize is not None: - kwds['gridsize'] = gridsize - return self(kind='hexbin', x=x, y=y, C=C, **kwds) - - -if __name__ == '__main__': - # import pandas.rpy.common as com - # sales = com.load_data('sanfrancisco.home.sales', package='nutshell') - # top10 = sales['zip'].value_counts()[:10].index - # sales2 = sales[sales.zip.isin(top10)] - # _ = scatter_plot(sales2, 'squarefeet', 'price', by='zip') - # plt.show() +import pandas.plotting as _plotting - import matplotlib.pyplot as plt +# back-compat of public API +# deprecate these functions +m = sys.modules['pandas.tools.plotting'] +for t in [t for t in dir(_plotting) if not t.startswith('_')]: - import pandas.tools.plotting as plots - import pandas.core.frame as fr - reload(plots) # noqa - reload(fr) # noqa - from pandas.core.frame import DataFrame + def outer(t=t): - data = DataFrame([[3, 6, -5], [4, 8, 2], [4, 9, -6], - [4, 9, -3], [2, 5, -1]], - columns=['A', 'B', 'C']) - data.plot(kind='barh', stacked=True) + def wrapper(*args, **kwargs): + warnings.warn("'pandas.tools.plotting.{t}' is deprecated, " + "import 'pandas.plotting.{t}' instead.".format(t=t), + FutureWarning, stacklevel=2) + return getattr(_plotting, t)(*args, **kwargs) + return wrapper - plt.show() + setattr(m, t, outer(t)) diff --git a/pandas/tools/tests/test_tile.py b/pandas/tools/tests/test_tile.py deleted file mode 100644 index e5b9c65b515d6..0000000000000 --- a/pandas/tools/tests/test_tile.py +++ /dev/null @@ -1,294 +0,0 @@ -import os -import nose - -import numpy as np -from pandas.compat import zip - -from pandas import Series, Index -import pandas.util.testing as tm -from pandas.util.testing import assertRaisesRegexp -import pandas.core.common as com - -from pandas.core.algorithms import quantile -from pandas.tools.tile import cut, qcut -import pandas.tools.tile as tmod - - -class TestCut(tm.TestCase): - - def test_simple(self): - data = np.ones(5) - result = cut(data, 4, labels=False) - desired = np.array([1, 1, 1, 1, 1]) - tm.assert_numpy_array_equal(result, desired, - check_dtype=False) - - def test_bins(self): - data = np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]) - result, bins = cut(data, 3, retbins=True) - - exp_codes = np.array([0, 0, 0, 1, 2, 0], dtype=np.int8) - tm.assert_numpy_array_equal(result.codes, exp_codes) - exp = np.array([0.1905, 3.36666667, 6.53333333, 9.7]) - tm.assert_almost_equal(bins, exp) - - def test_right(self): - data = np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1, 2.575]) - result, bins = cut(data, 4, right=True, retbins=True) - exp_codes = np.array([0, 0, 0, 2, 3, 0, 0], dtype=np.int8) - tm.assert_numpy_array_equal(result.codes, exp_codes) - exp = np.array([0.1905, 2.575, 4.95, 7.325, 9.7]) - tm.assert_numpy_array_equal(bins, exp) - - def test_noright(self): - data = np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1, 2.575]) - result, bins = cut(data, 4, right=False, retbins=True) - exp_codes = np.array([0, 0, 0, 2, 3, 0, 1], dtype=np.int8) - tm.assert_numpy_array_equal(result.codes, exp_codes) - exp = np.array([0.2, 2.575, 4.95, 7.325, 9.7095]) - tm.assert_almost_equal(bins, exp) - - def test_arraylike(self): - data = [.2, 1.4, 2.5, 6.2, 9.7, 2.1] - result, bins = cut(data, 3, retbins=True) - exp_codes = np.array([0, 0, 0, 1, 2, 0], dtype=np.int8) - tm.assert_numpy_array_equal(result.codes, exp_codes) - exp = np.array([0.1905, 3.36666667, 6.53333333, 9.7]) - tm.assert_almost_equal(bins, exp) - - def test_bins_not_monotonic(self): - data = [.2, 1.4, 2.5, 6.2, 9.7, 2.1] - self.assertRaises(ValueError, cut, data, [0.1, 1.5, 1, 10]) - - def test_wrong_num_labels(self): - data = [.2, 1.4, 2.5, 6.2, 9.7, 2.1] - self.assertRaises(ValueError, cut, data, [0, 1, 10], - labels=['foo', 'bar', 'baz']) - - def test_cut_corner(self): - # h3h - self.assertRaises(ValueError, cut, [], 2) - - self.assertRaises(ValueError, cut, [1, 2, 3], 0.5) - - def test_cut_out_of_range_more(self): - # #1511 - s = Series([0, -1, 0, 1, -3], name='x') - ind = cut(s, [0, 1], labels=False) - exp = Series([np.nan, np.nan, np.nan, 0, np.nan], name='x') - tm.assert_series_equal(ind, exp) - - def test_labels(self): - arr = np.tile(np.arange(0, 1.01, 0.1), 4) - - result, bins = cut(arr, 4, retbins=True) - ex_levels = Index(['(-0.001, 0.25]', '(0.25, 0.5]', '(0.5, 0.75]', - '(0.75, 1]']) - self.assert_index_equal(result.categories, ex_levels) - - result, bins = cut(arr, 4, retbins=True, right=False) - ex_levels = Index(['[0, 0.25)', '[0.25, 0.5)', '[0.5, 0.75)', - '[0.75, 1.001)']) - self.assert_index_equal(result.categories, ex_levels) - - def test_cut_pass_series_name_to_factor(self): - s = Series(np.random.randn(100), name='foo') - - factor = cut(s, 4) - self.assertEqual(factor.name, 'foo') - - def test_label_precision(self): - arr = np.arange(0, 0.73, 0.01) - - result = cut(arr, 4, precision=2) - ex_levels = Index(['(-0.00072, 0.18]', '(0.18, 0.36]', - '(0.36, 0.54]', '(0.54, 0.72]']) - self.assert_index_equal(result.categories, ex_levels) - - def test_na_handling(self): - arr = np.arange(0, 0.75, 0.01) - arr[::3] = np.nan - - result = cut(arr, 4) - - result_arr = np.asarray(result) - - ex_arr = np.where(com.isnull(arr), np.nan, result_arr) - - tm.assert_almost_equal(result_arr, ex_arr) - - result = cut(arr, 4, labels=False) - ex_result = np.where(com.isnull(arr), np.nan, result) - tm.assert_almost_equal(result, ex_result) - - def test_inf_handling(self): - data = np.arange(6) - data_ser = Series(data, dtype='int64') - - result = cut(data, [-np.inf, 2, 4, np.inf]) - result_ser = cut(data_ser, [-np.inf, 2, 4, np.inf]) - - ex_categories = Index(['(-inf, 2]', '(2, 4]', '(4, inf]']) - - tm.assert_index_equal(result.categories, ex_categories) - tm.assert_index_equal(result_ser.cat.categories, ex_categories) - self.assertEqual(result[5], '(4, inf]') - self.assertEqual(result[0], '(-inf, 2]') - self.assertEqual(result_ser[5], '(4, inf]') - self.assertEqual(result_ser[0], '(-inf, 2]') - - def test_qcut(self): - arr = np.random.randn(1000) - - labels, bins = qcut(arr, 4, retbins=True) - ex_bins = quantile(arr, [0, .25, .5, .75, 1.]) - tm.assert_almost_equal(bins, ex_bins) - - ex_levels = cut(arr, ex_bins, include_lowest=True) - self.assert_categorical_equal(labels, ex_levels) - - def test_qcut_bounds(self): - arr = np.random.randn(1000) - - factor = qcut(arr, 10, labels=False) - self.assertEqual(len(np.unique(factor)), 10) - - def test_qcut_specify_quantiles(self): - arr = np.random.randn(100) - - factor = qcut(arr, [0, .25, .5, .75, 1.]) - expected = qcut(arr, 4) - tm.assert_categorical_equal(factor, expected) - - def test_qcut_all_bins_same(self): - assertRaisesRegexp(ValueError, "edges.*unique", qcut, - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 3) - - def test_cut_out_of_bounds(self): - arr = np.random.randn(100) - - result = cut(arr, [-1, 0, 1]) - - mask = result.codes == -1 - ex_mask = (arr < -1) | (arr > 1) - self.assert_numpy_array_equal(mask, ex_mask) - - def test_cut_pass_labels(self): - arr = [50, 5, 10, 15, 20, 30, 70] - bins = [0, 25, 50, 100] - labels = ['Small', 'Medium', 'Large'] - - result = cut(arr, bins, labels=labels) - - exp = cut(arr, bins) - exp.categories = labels - - tm.assert_categorical_equal(result, exp) - - def test_qcut_include_lowest(self): - values = np.arange(10) - - cats = qcut(values, 4) - - ex_levels = ['[0, 2.25]', '(2.25, 4.5]', '(4.5, 6.75]', '(6.75, 9]'] - self.assertTrue((cats.categories == ex_levels).all()) - - def test_qcut_nas(self): - arr = np.random.randn(100) - arr[:20] = np.nan - - result = qcut(arr, 4) - self.assertTrue(com.isnull(result[:20]).all()) - - def test_label_formatting(self): - self.assertEqual(tmod._trim_zeros('1.000'), '1') - - # it works - result = cut(np.arange(11.), 2) - - result = cut(np.arange(11.) / 1e10, 2) - - # #1979, negative numbers - - result = tmod._format_label(-117.9998, precision=3) - self.assertEqual(result, '-118') - result = tmod._format_label(117.9998, precision=3) - self.assertEqual(result, '118') - - def test_qcut_binning_issues(self): - # #1978, 1979 - path = os.path.join(tm.get_data_path(), 'cut_data.csv') - arr = np.loadtxt(path) - - result = qcut(arr, 20) - - starts = [] - ends = [] - for lev in result.categories: - s, e = lev[1:-1].split(',') - - self.assertTrue(s != e) - - starts.append(float(s)) - ends.append(float(e)) - - for (sp, sn), (ep, en) in zip(zip(starts[:-1], starts[1:]), - zip(ends[:-1], ends[1:])): - self.assertTrue(sp < sn) - self.assertTrue(ep < en) - self.assertTrue(ep <= sn) - - def test_cut_return_categorical(self): - from pandas import Categorical - s = Series([0, 1, 2, 3, 4, 5, 6, 7, 8]) - res = cut(s, 3) - exp = Series(Categorical.from_codes([0, 0, 0, 1, 1, 1, 2, 2, 2], - ["(-0.008, 2.667]", - "(2.667, 5.333]", "(5.333, 8]"], - ordered=True)) - tm.assert_series_equal(res, exp) - - def test_qcut_return_categorical(self): - from pandas import Categorical - s = Series([0, 1, 2, 3, 4, 5, 6, 7, 8]) - res = qcut(s, [0, 0.333, 0.666, 1]) - exp = Series(Categorical.from_codes([0, 0, 0, 1, 1, 1, 2, 2, 2], - ["[0, 2.664]", - "(2.664, 5.328]", "(5.328, 8]"], - ordered=True)) - tm.assert_series_equal(res, exp) - - def test_series_retbins(self): - # GH 8589 - s = Series(np.arange(4)) - result, bins = cut(s, 2, retbins=True) - tm.assert_numpy_array_equal(result.cat.codes.values, - np.array([0, 0, 1, 1], dtype=np.int8)) - tm.assert_numpy_array_equal(bins, np.array([-0.003, 1.5, 3])) - - result, bins = qcut(s, 2, retbins=True) - tm.assert_numpy_array_equal(result.cat.codes.values, - np.array([0, 0, 1, 1], dtype=np.int8)) - tm.assert_numpy_array_equal(bins, np.array([0, 1.5, 3])) - - def test_single_bin(self): - # issue 14652 - expected = Series([0, 0]) - - s = Series([9., 9.]) - result = cut(s, 1, labels=False) - tm.assert_series_equal(result, expected) - - s = Series([-9., -9.]) - result = cut(s, 1, labels=False) - tm.assert_series_equal(result, expected) - - -def curpath(): - pth, _ = os.path.split(os.path.abspath(__file__)) - return pth - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tools/tile.py b/pandas/tools/tile.py deleted file mode 100644 index ef75f2f84779b..0000000000000 --- a/pandas/tools/tile.py +++ /dev/null @@ -1,302 +0,0 @@ -""" -Quantilization functions and related stuff -""" - -from pandas.types.missing import isnull -from pandas.types.common import (is_float, is_integer, - is_scalar) - -from pandas.core.api import Series -from pandas.core.categorical import Categorical -import pandas.core.algorithms as algos -import pandas.core.nanops as nanops -from pandas.compat import zip - -import numpy as np - - -def cut(x, bins, right=True, labels=None, retbins=False, precision=3, - include_lowest=False): - """ - Return indices of half-open bins to which each value of `x` belongs. - - Parameters - ---------- - x : array-like - Input array to be binned. It has to be 1-dimensional. - bins : int or sequence of scalars - If `bins` is an int, it defines the number of equal-width bins in the - range of `x`. However, in this case, the range of `x` is extended - by .1% on each side to include the min or max values of `x`. If - `bins` is a sequence it defines the bin edges allowing for - non-uniform bin width. No extension of the range of `x` is done in - this case. - right : bool, optional - Indicates whether the bins include the rightmost edge or not. If - right == True (the default), then the bins [1,2,3,4] indicate - (1,2], (2,3], (3,4]. - labels : array or boolean, default None - Used as labels for the resulting bins. Must be of the same length as - the resulting bins. If False, return only integer indicators of the - bins. - retbins : bool, optional - Whether to return the bins or not. Can be useful if bins is given - as a scalar. - precision : int - The precision at which to store and display the bins labels - include_lowest : bool - Whether the first interval should be left-inclusive or not. - - Returns - ------- - out : Categorical or Series or array of integers if labels is False - The return type (Categorical or Series) depends on the input: a Series - of type category if input is a Series else Categorical. Bins are - represented as categories when categorical data is returned. - bins : ndarray of floats - Returned only if `retbins` is True. - - Notes - ----- - The `cut` function can be useful for going from a continuous variable to - a categorical variable. For example, `cut` could convert ages to groups - of age ranges. - - Any NA values will be NA in the result. Out of bounds values will be NA in - the resulting Categorical object - - - Examples - -------- - >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True) - ([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533], - (6.533, 9.7], (0.191, 3.367]] - Categories (3, object): [(0.191, 3.367] < (3.367, 6.533] < (6.533, 9.7]], - array([ 0.1905 , 3.36666667, 6.53333333, 9.7 ])) - >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, - labels=["good","medium","bad"]) - [good, good, good, medium, bad, good] - Categories (3, object): [good < medium < bad] - >>> pd.cut(np.ones(5), 4, labels=False) - array([1, 1, 1, 1, 1], dtype=int64) - """ - # NOTE: this binning code is changed a bit from histogram for var(x) == 0 - if not np.iterable(bins): - if is_scalar(bins) and bins < 1: - raise ValueError("`bins` should be a positive integer.") - try: # for array-like - sz = x.size - except AttributeError: - x = np.asarray(x) - sz = x.size - if sz == 0: - raise ValueError('Cannot cut empty array') - # handle empty arrays. Can't determine range, so use 0-1. - # rng = (0, 1) - else: - rng = (nanops.nanmin(x), nanops.nanmax(x)) - mn, mx = [mi + 0.0 for mi in rng] - - if mn == mx: # adjust end points before binning - mn -= .001 * abs(mn) - mx += .001 * abs(mx) - bins = np.linspace(mn, mx, bins + 1, endpoint=True) - else: # adjust end points after binning - bins = np.linspace(mn, mx, bins + 1, endpoint=True) - adj = (mx - mn) * 0.001 # 0.1% of the range - if right: - bins[0] -= adj - else: - bins[-1] += adj - - else: - bins = np.asarray(bins) - if (np.diff(bins) < 0).any(): - raise ValueError('bins must increase monotonically.') - - return _bins_to_cuts(x, bins, right=right, labels=labels, - retbins=retbins, precision=precision, - include_lowest=include_lowest) - - -def qcut(x, q, labels=None, retbins=False, precision=3): - """ - Quantile-based discretization function. Discretize variable into - equal-sized buckets based on rank or based on sample quantiles. For example - 1000 values for 10 quantiles would produce a Categorical object indicating - quantile membership for each data point. - - Parameters - ---------- - x : ndarray or Series - q : integer or array of quantiles - Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately - array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles - labels : array or boolean, default None - Used as labels for the resulting bins. Must be of the same length as - the resulting bins. If False, return only integer indicators of the - bins. - retbins : bool, optional - Whether to return the bins or not. Can be useful if bins is given - as a scalar. - precision : int - The precision at which to store and display the bins labels - - Returns - ------- - out : Categorical or Series or array of integers if labels is False - The return type (Categorical or Series) depends on the input: a Series - of type category if input is a Series else Categorical. Bins are - represented as categories when categorical data is returned. - bins : ndarray of floats - Returned only if `retbins` is True. - - Notes - ----- - Out of bounds values will be NA in the resulting Categorical object - - Examples - -------- - >>> pd.qcut(range(5), 4) - [[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]] - Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]] - >>> pd.qcut(range(5), 3, labels=["good","medium","bad"]) - [good, good, medium, bad, bad] - Categories (3, object): [good < medium < bad] - >>> pd.qcut(range(5), 4, labels=False) - array([0, 0, 1, 2, 3], dtype=int64) - """ - if is_integer(q): - quantiles = np.linspace(0, 1, q + 1) - else: - quantiles = q - bins = algos.quantile(x, quantiles) - return _bins_to_cuts(x, bins, labels=labels, retbins=retbins, - precision=precision, include_lowest=True) - - -def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False, - precision=3, name=None, include_lowest=False): - x_is_series = isinstance(x, Series) - series_index = None - - if x_is_series: - series_index = x.index - if name is None: - name = x.name - - x = np.asarray(x) - - side = 'left' if right else 'right' - ids = bins.searchsorted(x, side=side) - - if len(algos.unique(bins)) < len(bins): - raise ValueError('Bin edges must be unique: %s' % repr(bins)) - - if include_lowest: - ids[x == bins[0]] = 1 - - na_mask = isnull(x) | (ids == len(bins)) | (ids == 0) - has_nas = na_mask.any() - - if labels is not False: - if labels is None: - increases = 0 - while True: - try: - levels = _format_levels(bins, precision, right=right, - include_lowest=include_lowest) - except ValueError: - increases += 1 - precision += 1 - if increases >= 20: - raise - else: - break - - else: - if len(labels) != len(bins) - 1: - raise ValueError('Bin labels must be one fewer than ' - 'the number of bin edges') - levels = labels - - levels = np.asarray(levels, dtype=object) - np.putmask(ids, na_mask, 0) - fac = Categorical(ids - 1, levels, ordered=True, fastpath=True) - else: - fac = ids - 1 - if has_nas: - fac = fac.astype(np.float64) - np.putmask(fac, na_mask, np.nan) - - if x_is_series: - fac = Series(fac, index=series_index, name=name) - - if not retbins: - return fac - - return fac, bins - - -def _format_levels(bins, prec, right=True, - include_lowest=False): - fmt = lambda v: _format_label(v, precision=prec) - if right: - levels = [] - for a, b in zip(bins, bins[1:]): - fa, fb = fmt(a), fmt(b) - - if a != b and fa == fb: - raise ValueError('precision too low') - - formatted = '(%s, %s]' % (fa, fb) - - levels.append(formatted) - - if include_lowest: - levels[0] = '[' + levels[0][1:] - else: - levels = ['[%s, %s)' % (fmt(a), fmt(b)) - for a, b in zip(bins, bins[1:])] - - return levels - - -def _format_label(x, precision=3): - fmt_str = '%%.%dg' % precision - if np.isinf(x): - return str(x) - elif is_float(x): - frac, whole = np.modf(x) - sgn = '-' if x < 0 else '' - whole = abs(whole) - if frac != 0.0: - val = fmt_str % frac - - # rounded up or down - if '.' not in val: - if x < 0: - return '%d' % (-whole - 1) - else: - return '%d' % (whole + 1) - - if 'e' in val: - return _trim_zeros(fmt_str % x) - else: - val = _trim_zeros(val) - if '.' in val: - return sgn + '.'.join(('%d' % whole, val.split('.')[1])) - else: # pragma: no cover - return sgn + '.'.join(('%d' % whole, val)) - else: - return sgn + '%0.f' % whole - else: - return str(x) - - -def _trim_zeros(x): - while len(x) > 1 and x[-1] == '0': - x = x[:-1] - if len(x) > 1 and x[-1] == '.': - x = x[:-1] - return x diff --git a/pandas/tseries/api.py b/pandas/tseries/api.py index 9a07983b4d951..71386c02547ba 100644 --- a/pandas/tseries/api.py +++ b/pandas/tseries/api.py @@ -4,11 +4,5 @@ # flake8: noqa -from pandas.tseries.index import DatetimeIndex, date_range, bdate_range from pandas.tseries.frequencies import infer_freq -from pandas.tseries.tdi import Timedelta, TimedeltaIndex, timedelta_range -from pandas.tseries.period import Period, PeriodIndex, period_range, pnow -from pandas.tseries.resample import TimeGrouper -from pandas.tseries.timedeltas import to_timedelta -from pandas.lib import NaT import pandas.tseries.offsets as offsets diff --git a/pandas/tseries/converter.py b/pandas/tseries/converter.py index 8f8519a498a31..df603c4d880d8 100644 --- a/pandas/tseries/converter.py +++ b/pandas/tseries/converter.py @@ -1,1002 +1,11 @@ -from datetime import datetime, timedelta -import datetime as pydt -import numpy as np - -from dateutil.relativedelta import relativedelta - -import matplotlib.units as units -import matplotlib.dates as dates - -from matplotlib.ticker import Formatter, AutoLocator, Locator -from matplotlib.transforms import nonsingular - - -from pandas.types.common import (is_float, is_integer, - is_integer_dtype, - is_float_dtype, - is_datetime64_ns_dtype, - is_period_arraylike, - ) - -from pandas.compat import lrange -import pandas.compat as compat -import pandas.lib as lib -import pandas.core.common as com -from pandas.core.index import Index - -from pandas.core.series import Series -from pandas.tseries.index import date_range -import pandas.tseries.tools as tools -import pandas.tseries.frequencies as frequencies -from pandas.tseries.frequencies import FreqGroup -from pandas.tseries.period import Period, PeriodIndex - -# constants -HOURS_PER_DAY = 24. -MIN_PER_HOUR = 60. -SEC_PER_MIN = 60. - -SEC_PER_HOUR = SEC_PER_MIN * MIN_PER_HOUR -SEC_PER_DAY = SEC_PER_HOUR * HOURS_PER_DAY - -MUSEC_PER_DAY = 1e6 * SEC_PER_DAY - - -def _mpl_le_2_0_0(): - try: - import matplotlib - return matplotlib.compare_versions('2.0.0', matplotlib.__version__) - except ImportError: - return False - - -def register(): - units.registry[lib.Timestamp] = DatetimeConverter() - units.registry[Period] = PeriodConverter() - units.registry[pydt.datetime] = DatetimeConverter() - units.registry[pydt.date] = DatetimeConverter() - units.registry[pydt.time] = TimeConverter() - units.registry[np.datetime64] = DatetimeConverter() - - -def _to_ordinalf(tm): - tot_sec = (tm.hour * 3600 + tm.minute * 60 + tm.second + - float(tm.microsecond / 1e6)) - return tot_sec - - -def time2num(d): - if isinstance(d, compat.string_types): - parsed = tools.to_datetime(d) - if not isinstance(parsed, datetime): - raise ValueError('Could not parse time %s' % d) - return _to_ordinalf(parsed.time()) - if isinstance(d, pydt.time): - return _to_ordinalf(d) - return d - - -class TimeConverter(units.ConversionInterface): - - @staticmethod - def convert(value, unit, axis): - valid_types = (str, pydt.time) - if (isinstance(value, valid_types) or is_integer(value) or - is_float(value)): - return time2num(value) - if isinstance(value, Index): - return value.map(time2num) - if isinstance(value, (list, tuple, np.ndarray, Index)): - return [time2num(x) for x in value] - return value - - @staticmethod - def axisinfo(unit, axis): - if unit != 'time': - return None - - majloc = AutoLocator() - majfmt = TimeFormatter(majloc) - return units.AxisInfo(majloc=majloc, majfmt=majfmt, label='time') - - @staticmethod - def default_units(x, axis): - return 'time' - - -# time formatter -class TimeFormatter(Formatter): - - def __init__(self, locs): - self.locs = locs - - def __call__(self, x, pos=0): - fmt = '%H:%M:%S' - s = int(x) - ms = int((x - s) * 1e3) - us = int((x - s) * 1e6 - ms) - m, s = divmod(s, 60) - h, m = divmod(m, 60) - _, h = divmod(h, 24) - if us != 0: - fmt += '.%6f' - elif ms != 0: - fmt += '.%3f' - - return pydt.time(h, m, s, us).strftime(fmt) - - -# Period Conversion - - -class PeriodConverter(dates.DateConverter): - - @staticmethod - def convert(values, units, axis): - if not hasattr(axis, 'freq'): - raise TypeError('Axis must have `freq` set to convert to Periods') - valid_types = (compat.string_types, datetime, - Period, pydt.date, pydt.time) - if (isinstance(values, valid_types) or is_integer(values) or - is_float(values)): - return get_datevalue(values, axis.freq) - if isinstance(values, PeriodIndex): - return values.asfreq(axis.freq)._values - if isinstance(values, Index): - return values.map(lambda x: get_datevalue(x, axis.freq)) - if is_period_arraylike(values): - return PeriodIndex(values, freq=axis.freq)._values - if isinstance(values, (list, tuple, np.ndarray, Index)): - return [get_datevalue(x, axis.freq) for x in values] - return values - - -def get_datevalue(date, freq): - if isinstance(date, Period): - return date.asfreq(freq).ordinal - elif isinstance(date, (compat.string_types, datetime, - pydt.date, pydt.time)): - return Period(date, freq).ordinal - elif (is_integer(date) or is_float(date) or - (isinstance(date, (np.ndarray, Index)) and (date.size == 1))): - return date - elif date is None: - return None - raise ValueError("Unrecognizable date '%s'" % date) - - -def _dt_to_float_ordinal(dt): - """ - Convert :mod:`datetime` to the Gregorian date as UTC float days, - preserving hours, minutes, seconds and microseconds. Return value - is a :func:`float`. - """ - if (isinstance(dt, (np.ndarray, Index, Series) - ) and is_datetime64_ns_dtype(dt)): - base = dates.epoch2num(dt.asi8 / 1.0E9) - else: - base = dates.date2num(dt) - return base - - -# Datetime Conversion -class DatetimeConverter(dates.DateConverter): - - @staticmethod - def convert(values, unit, axis): - def try_parse(values): - try: - return _dt_to_float_ordinal(tools.to_datetime(values)) - except Exception: - return values - - if isinstance(values, (datetime, pydt.date)): - return _dt_to_float_ordinal(values) - elif isinstance(values, np.datetime64): - return _dt_to_float_ordinal(lib.Timestamp(values)) - elif isinstance(values, pydt.time): - return dates.date2num(values) - elif (is_integer(values) or is_float(values)): - return values - elif isinstance(values, compat.string_types): - return try_parse(values) - elif isinstance(values, (list, tuple, np.ndarray, Index)): - if isinstance(values, Index): - values = values.values - if not isinstance(values, np.ndarray): - values = com._asarray_tuplesafe(values) - - if is_integer_dtype(values) or is_float_dtype(values): - return values - - try: - values = tools.to_datetime(values) - if isinstance(values, Index): - values = values.map(_dt_to_float_ordinal) - else: - values = [_dt_to_float_ordinal(x) for x in values] - except Exception: - values = _dt_to_float_ordinal(values) - - return values - - @staticmethod - def axisinfo(unit, axis): - """ - Return the :class:`~matplotlib.units.AxisInfo` for *unit*. - - *unit* is a tzinfo instance or None. - The *axis* argument is required but not used. - """ - tz = unit - - majloc = PandasAutoDateLocator(tz=tz) - majfmt = PandasAutoDateFormatter(majloc, tz=tz) - datemin = pydt.date(2000, 1, 1) - datemax = pydt.date(2010, 1, 1) - - return units.AxisInfo(majloc=majloc, majfmt=majfmt, label='', - default_limits=(datemin, datemax)) - - -class PandasAutoDateFormatter(dates.AutoDateFormatter): - - def __init__(self, locator, tz=None, defaultfmt='%Y-%m-%d'): - dates.AutoDateFormatter.__init__(self, locator, tz, defaultfmt) - # matplotlib.dates._UTC has no _utcoffset called by pandas - if self._tz is dates.UTC: - self._tz._utcoffset = self._tz.utcoffset(None) - - # For mpl > 2.0 the format strings are controlled via rcparams - # so do not mess with them. For mpl < 2.0 change the second - # break point and add a musec break point - if _mpl_le_2_0_0(): - self.scaled[1. / SEC_PER_DAY] = '%H:%M:%S' - self.scaled[1. / MUSEC_PER_DAY] = '%H:%M:%S.%f' - - -class PandasAutoDateLocator(dates.AutoDateLocator): - - def get_locator(self, dmin, dmax): - 'Pick the best locator based on a distance.' - delta = relativedelta(dmax, dmin) - - num_days = ((delta.years * 12.0) + delta.months * 31.0) + delta.days - num_sec = (delta.hours * 60.0 + delta.minutes) * 60.0 + delta.seconds - tot_sec = num_days * 86400. + num_sec - - if abs(tot_sec) < self.minticks: - self._freq = -1 - locator = MilliSecondLocator(self.tz) - locator.set_axis(self.axis) - - locator.set_view_interval(*self.axis.get_view_interval()) - locator.set_data_interval(*self.axis.get_data_interval()) - return locator - - return dates.AutoDateLocator.get_locator(self, dmin, dmax) - - def _get_unit(self): - return MilliSecondLocator.get_unit_generic(self._freq) - - -class MilliSecondLocator(dates.DateLocator): - - UNIT = 1. / (24 * 3600 * 1000) - - def __init__(self, tz): - dates.DateLocator.__init__(self, tz) - self._interval = 1. - - def _get_unit(self): - return self.get_unit_generic(-1) - - @staticmethod - def get_unit_generic(freq): - unit = dates.RRuleLocator.get_unit_generic(freq) - if unit < 0: - return MilliSecondLocator.UNIT - return unit - - def __call__(self): - # if no data have been set, this will tank with a ValueError - try: - dmin, dmax = self.viewlim_to_dt() - except ValueError: - return [] - - if dmin > dmax: - dmax, dmin = dmin, dmax - # We need to cap at the endpoints of valid datetime - - # TODO(wesm) unused? - # delta = relativedelta(dmax, dmin) - # try: - # start = dmin - delta - # except ValueError: - # start = _from_ordinal(1.0) - - # try: - # stop = dmax + delta - # except ValueError: - # # The magic number! - # stop = _from_ordinal(3652059.9999999) - - nmax, nmin = dates.date2num((dmax, dmin)) - - num = (nmax - nmin) * 86400 * 1000 - max_millis_ticks = 6 - for interval in [1, 10, 50, 100, 200, 500]: - if num <= interval * (max_millis_ticks - 1): - self._interval = interval - break - else: - # We went through the whole loop without breaking, default to 1 - self._interval = 1000. - - estimate = (nmax - nmin) / (self._get_unit() * self._get_interval()) - - if estimate > self.MAXTICKS * 2: - raise RuntimeError(('MillisecondLocator estimated to generate %d ' - 'ticks from %s to %s: exceeds Locator.MAXTICKS' - '* 2 (%d) ') % - (estimate, dmin, dmax, self.MAXTICKS * 2)) - - freq = '%dL' % self._get_interval() - tz = self.tz.tzname(None) - st = _from_ordinal(dates.date2num(dmin)) # strip tz - ed = _from_ordinal(dates.date2num(dmax)) - all_dates = date_range(start=st, end=ed, freq=freq, tz=tz).asobject - - try: - if len(all_dates) > 0: - locs = self.raise_if_exceeds(dates.date2num(all_dates)) - return locs - except Exception: # pragma: no cover - pass - - lims = dates.date2num([dmin, dmax]) - return lims - - def _get_interval(self): - return self._interval - - def autoscale(self): - """ - Set the view limits to include the data range. - """ - dmin, dmax = self.datalim_to_dt() - if dmin > dmax: - dmax, dmin = dmin, dmax - - # We need to cap at the endpoints of valid datetime - - # TODO(wesm): unused? - - # delta = relativedelta(dmax, dmin) - # try: - # start = dmin - delta - # except ValueError: - # start = _from_ordinal(1.0) - - # try: - # stop = dmax + delta - # except ValueError: - # # The magic number! - # stop = _from_ordinal(3652059.9999999) - - dmin, dmax = self.datalim_to_dt() - - vmin = dates.date2num(dmin) - vmax = dates.date2num(dmax) - - return self.nonsingular(vmin, vmax) - - -def _from_ordinal(x, tz=None): - ix = int(x) - dt = datetime.fromordinal(ix) - remainder = float(x) - ix - hour, remainder = divmod(24 * remainder, 1) - minute, remainder = divmod(60 * remainder, 1) - second, remainder = divmod(60 * remainder, 1) - microsecond = int(1e6 * remainder) - if microsecond < 10: - microsecond = 0 # compensate for rounding errors - dt = datetime(dt.year, dt.month, dt.day, int(hour), int(minute), - int(second), microsecond) - if tz is not None: - dt = dt.astimezone(tz) - - if microsecond > 999990: # compensate for rounding errors - dt += timedelta(microseconds=1e6 - microsecond) - - return dt - -# Fixed frequency dynamic tick locators and formatters - -# ------------------------------------------------------------------------- -# --- Locators --- -# ------------------------------------------------------------------------- - - -def _get_default_annual_spacing(nyears): - """ - Returns a default spacing between consecutive ticks for annual data. - """ - if nyears < 11: - (min_spacing, maj_spacing) = (1, 1) - elif nyears < 20: - (min_spacing, maj_spacing) = (1, 2) - elif nyears < 50: - (min_spacing, maj_spacing) = (1, 5) - elif nyears < 100: - (min_spacing, maj_spacing) = (5, 10) - elif nyears < 200: - (min_spacing, maj_spacing) = (5, 25) - elif nyears < 600: - (min_spacing, maj_spacing) = (10, 50) - else: - factor = nyears // 1000 + 1 - (min_spacing, maj_spacing) = (factor * 20, factor * 100) - return (min_spacing, maj_spacing) - - -def period_break(dates, period): - """ - Returns the indices where the given period changes. - - Parameters - ---------- - dates : PeriodIndex - Array of intervals to monitor. - period : string - Name of the period to monitor. - """ - current = getattr(dates, period) - previous = getattr(dates - 1, period) - return (current - previous).nonzero()[0] - - -def has_level_label(label_flags, vmin): - """ - Returns true if the ``label_flags`` indicate there is at least one label - for this level. - - if the minimum view limit is not an exact integer, then the first tick - label won't be shown, so we must adjust for that. - """ - if label_flags.size == 0 or (label_flags.size == 1 and - label_flags[0] == 0 and - vmin % 1 > 0.0): - return False - else: - return True - - -def _daily_finder(vmin, vmax, freq): - periodsperday = -1 - - if freq >= FreqGroup.FR_HR: - if freq == FreqGroup.FR_NS: - periodsperday = 24 * 60 * 60 * 1000000000 - elif freq == FreqGroup.FR_US: - periodsperday = 24 * 60 * 60 * 1000000 - elif freq == FreqGroup.FR_MS: - periodsperday = 24 * 60 * 60 * 1000 - elif freq == FreqGroup.FR_SEC: - periodsperday = 24 * 60 * 60 - elif freq == FreqGroup.FR_MIN: - periodsperday = 24 * 60 - elif freq == FreqGroup.FR_HR: - periodsperday = 24 - else: # pragma: no cover - raise ValueError("unexpected frequency: %s" % freq) - periodsperyear = 365 * periodsperday - periodspermonth = 28 * periodsperday - - elif freq == FreqGroup.FR_BUS: - periodsperyear = 261 - periodspermonth = 19 - elif freq == FreqGroup.FR_DAY: - periodsperyear = 365 - periodspermonth = 28 - elif frequencies.get_freq_group(freq) == FreqGroup.FR_WK: - periodsperyear = 52 - periodspermonth = 3 - else: # pragma: no cover - raise ValueError("unexpected frequency") - - # save this for later usage - vmin_orig = vmin - - (vmin, vmax) = (Period(ordinal=int(vmin), freq=freq), - Period(ordinal=int(vmax), freq=freq)) - span = vmax.ordinal - vmin.ordinal + 1 - dates_ = PeriodIndex(start=vmin, end=vmax, freq=freq) - # Initialize the output - info = np.zeros(span, - dtype=[('val', np.int64), ('maj', bool), - ('min', bool), ('fmt', '|S20')]) - info['val'][:] = dates_._values - info['fmt'][:] = '' - info['maj'][[0, -1]] = True - # .. and set some shortcuts - info_maj = info['maj'] - info_min = info['min'] - info_fmt = info['fmt'] - - def first_label(label_flags): - if (label_flags[0] == 0) and (label_flags.size > 1) and \ - ((vmin_orig % 1) > 0.0): - return label_flags[1] - else: - return label_flags[0] - - # Case 1. Less than a month - if span <= periodspermonth: - day_start = period_break(dates_, 'day') - month_start = period_break(dates_, 'month') - - def _hour_finder(label_interval, force_year_start): - _hour = dates_.hour - _prev_hour = (dates_ - 1).hour - hour_start = (_hour - _prev_hour) != 0 - info_maj[day_start] = True - info_min[hour_start & (_hour % label_interval == 0)] = True - year_start = period_break(dates_, 'year') - info_fmt[hour_start & (_hour % label_interval == 0)] = '%H:%M' - info_fmt[day_start] = '%H:%M\n%d-%b' - info_fmt[year_start] = '%H:%M\n%d-%b\n%Y' - if force_year_start and not has_level_label(year_start, vmin_orig): - info_fmt[first_label(day_start)] = '%H:%M\n%d-%b\n%Y' - - def _minute_finder(label_interval): - hour_start = period_break(dates_, 'hour') - _minute = dates_.minute - _prev_minute = (dates_ - 1).minute - minute_start = (_minute - _prev_minute) != 0 - info_maj[hour_start] = True - info_min[minute_start & (_minute % label_interval == 0)] = True - year_start = period_break(dates_, 'year') - info_fmt = info['fmt'] - info_fmt[minute_start & (_minute % label_interval == 0)] = '%H:%M' - info_fmt[day_start] = '%H:%M\n%d-%b' - info_fmt[year_start] = '%H:%M\n%d-%b\n%Y' - - def _second_finder(label_interval): - minute_start = period_break(dates_, 'minute') - _second = dates_.second - _prev_second = (dates_ - 1).second - second_start = (_second - _prev_second) != 0 - info['maj'][minute_start] = True - info['min'][second_start & (_second % label_interval == 0)] = True - year_start = period_break(dates_, 'year') - info_fmt = info['fmt'] - info_fmt[second_start & (_second % - label_interval == 0)] = '%H:%M:%S' - info_fmt[day_start] = '%H:%M:%S\n%d-%b' - info_fmt[year_start] = '%H:%M:%S\n%d-%b\n%Y' - - if span < periodsperday / 12000.0: - _second_finder(1) - elif span < periodsperday / 6000.0: - _second_finder(2) - elif span < periodsperday / 2400.0: - _second_finder(5) - elif span < periodsperday / 1200.0: - _second_finder(10) - elif span < periodsperday / 800.0: - _second_finder(15) - elif span < periodsperday / 400.0: - _second_finder(30) - elif span < periodsperday / 150.0: - _minute_finder(1) - elif span < periodsperday / 70.0: - _minute_finder(2) - elif span < periodsperday / 24.0: - _minute_finder(5) - elif span < periodsperday / 12.0: - _minute_finder(15) - elif span < periodsperday / 6.0: - _minute_finder(30) - elif span < periodsperday / 2.5: - _hour_finder(1, False) - elif span < periodsperday / 1.5: - _hour_finder(2, False) - elif span < periodsperday * 1.25: - _hour_finder(3, False) - elif span < periodsperday * 2.5: - _hour_finder(6, True) - elif span < periodsperday * 4: - _hour_finder(12, True) - else: - info_maj[month_start] = True - info_min[day_start] = True - year_start = period_break(dates_, 'year') - info_fmt = info['fmt'] - info_fmt[day_start] = '%d' - info_fmt[month_start] = '%d\n%b' - info_fmt[year_start] = '%d\n%b\n%Y' - if not has_level_label(year_start, vmin_orig): - if not has_level_label(month_start, vmin_orig): - info_fmt[first_label(day_start)] = '%d\n%b\n%Y' - else: - info_fmt[first_label(month_start)] = '%d\n%b\n%Y' - - # Case 2. Less than three months - elif span <= periodsperyear // 4: - month_start = period_break(dates_, 'month') - info_maj[month_start] = True - if freq < FreqGroup.FR_HR: - info['min'] = True - else: - day_start = period_break(dates_, 'day') - info['min'][day_start] = True - week_start = period_break(dates_, 'week') - year_start = period_break(dates_, 'year') - info_fmt[week_start] = '%d' - info_fmt[month_start] = '\n\n%b' - info_fmt[year_start] = '\n\n%b\n%Y' - if not has_level_label(year_start, vmin_orig): - if not has_level_label(month_start, vmin_orig): - info_fmt[first_label(week_start)] = '\n\n%b\n%Y' - else: - info_fmt[first_label(month_start)] = '\n\n%b\n%Y' - # Case 3. Less than 14 months ............... - elif span <= 1.15 * periodsperyear: - year_start = period_break(dates_, 'year') - month_start = period_break(dates_, 'month') - week_start = period_break(dates_, 'week') - info_maj[month_start] = True - info_min[week_start] = True - info_min[year_start] = False - info_min[month_start] = False - info_fmt[month_start] = '%b' - info_fmt[year_start] = '%b\n%Y' - if not has_level_label(year_start, vmin_orig): - info_fmt[first_label(month_start)] = '%b\n%Y' - # Case 4. Less than 2.5 years ............... - elif span <= 2.5 * periodsperyear: - year_start = period_break(dates_, 'year') - quarter_start = period_break(dates_, 'quarter') - month_start = period_break(dates_, 'month') - info_maj[quarter_start] = True - info_min[month_start] = True - info_fmt[quarter_start] = '%b' - info_fmt[year_start] = '%b\n%Y' - # Case 4. Less than 4 years ................. - elif span <= 4 * periodsperyear: - year_start = period_break(dates_, 'year') - month_start = period_break(dates_, 'month') - info_maj[year_start] = True - info_min[month_start] = True - info_min[year_start] = False - - month_break = dates_[month_start].month - jan_or_jul = month_start[(month_break == 1) | (month_break == 7)] - info_fmt[jan_or_jul] = '%b' - info_fmt[year_start] = '%b\n%Y' - # Case 5. Less than 11 years ................ - elif span <= 11 * periodsperyear: - year_start = period_break(dates_, 'year') - quarter_start = period_break(dates_, 'quarter') - info_maj[year_start] = True - info_min[quarter_start] = True - info_min[year_start] = False - info_fmt[year_start] = '%Y' - # Case 6. More than 12 years ................ - else: - year_start = period_break(dates_, 'year') - year_break = dates_[year_start].year - nyears = span / periodsperyear - (min_anndef, maj_anndef) = _get_default_annual_spacing(nyears) - major_idx = year_start[(year_break % maj_anndef == 0)] - info_maj[major_idx] = True - minor_idx = year_start[(year_break % min_anndef == 0)] - info_min[minor_idx] = True - info_fmt[major_idx] = '%Y' - - return info - - -def _monthly_finder(vmin, vmax, freq): - periodsperyear = 12 - - vmin_orig = vmin - (vmin, vmax) = (int(vmin), int(vmax)) - span = vmax - vmin + 1 - - # Initialize the output - info = np.zeros(span, - dtype=[('val', int), ('maj', bool), ('min', bool), - ('fmt', '|S8')]) - info['val'] = np.arange(vmin, vmax + 1) - dates_ = info['val'] - info['fmt'] = '' - year_start = (dates_ % 12 == 0).nonzero()[0] - info_maj = info['maj'] - info_fmt = info['fmt'] - - if span <= 1.15 * periodsperyear: - info_maj[year_start] = True - info['min'] = True - - info_fmt[:] = '%b' - info_fmt[year_start] = '%b\n%Y' - - if not has_level_label(year_start, vmin_orig): - if dates_.size > 1: - idx = 1 - else: - idx = 0 - info_fmt[idx] = '%b\n%Y' - - elif span <= 2.5 * periodsperyear: - quarter_start = (dates_ % 3 == 0).nonzero() - info_maj[year_start] = True - # TODO: Check the following : is it really info['fmt'] ? - info['fmt'][quarter_start] = True - info['min'] = True - - info_fmt[quarter_start] = '%b' - info_fmt[year_start] = '%b\n%Y' - - elif span <= 4 * periodsperyear: - info_maj[year_start] = True - info['min'] = True - - jan_or_jul = (dates_ % 12 == 0) | (dates_ % 12 == 6) - info_fmt[jan_or_jul] = '%b' - info_fmt[year_start] = '%b\n%Y' - - elif span <= 11 * periodsperyear: - quarter_start = (dates_ % 3 == 0).nonzero() - info_maj[year_start] = True - info['min'][quarter_start] = True - - info_fmt[year_start] = '%Y' - - else: - nyears = span / periodsperyear - (min_anndef, maj_anndef) = _get_default_annual_spacing(nyears) - years = dates_[year_start] // 12 + 1 - major_idx = year_start[(years % maj_anndef == 0)] - info_maj[major_idx] = True - info['min'][year_start[(years % min_anndef == 0)]] = True - - info_fmt[major_idx] = '%Y' - - return info - - -def _quarterly_finder(vmin, vmax, freq): - periodsperyear = 4 - vmin_orig = vmin - (vmin, vmax) = (int(vmin), int(vmax)) - span = vmax - vmin + 1 - - info = np.zeros(span, - dtype=[('val', int), ('maj', bool), ('min', bool), - ('fmt', '|S8')]) - info['val'] = np.arange(vmin, vmax + 1) - info['fmt'] = '' - dates_ = info['val'] - info_maj = info['maj'] - info_fmt = info['fmt'] - year_start = (dates_ % 4 == 0).nonzero()[0] - - if span <= 3.5 * periodsperyear: - info_maj[year_start] = True - info['min'] = True - - info_fmt[:] = 'Q%q' - info_fmt[year_start] = 'Q%q\n%F' - if not has_level_label(year_start, vmin_orig): - if dates_.size > 1: - idx = 1 - else: - idx = 0 - info_fmt[idx] = 'Q%q\n%F' - - elif span <= 11 * periodsperyear: - info_maj[year_start] = True - info['min'] = True - info_fmt[year_start] = '%F' - - else: - years = dates_[year_start] // 4 + 1 - nyears = span / periodsperyear - (min_anndef, maj_anndef) = _get_default_annual_spacing(nyears) - major_idx = year_start[(years % maj_anndef == 0)] - info_maj[major_idx] = True - info['min'][year_start[(years % min_anndef == 0)]] = True - info_fmt[major_idx] = '%F' - - return info - - -def _annual_finder(vmin, vmax, freq): - (vmin, vmax) = (int(vmin), int(vmax + 1)) - span = vmax - vmin + 1 - - info = np.zeros(span, - dtype=[('val', int), ('maj', bool), ('min', bool), - ('fmt', '|S8')]) - info['val'] = np.arange(vmin, vmax + 1) - info['fmt'] = '' - dates_ = info['val'] - - (min_anndef, maj_anndef) = _get_default_annual_spacing(span) - major_idx = dates_ % maj_anndef == 0 - info['maj'][major_idx] = True - info['min'][(dates_ % min_anndef == 0)] = True - info['fmt'][major_idx] = '%Y' - - return info - - -def get_finder(freq): - if isinstance(freq, compat.string_types): - freq = frequencies.get_freq(freq) - fgroup = frequencies.get_freq_group(freq) - - if fgroup == FreqGroup.FR_ANN: - return _annual_finder - elif fgroup == FreqGroup.FR_QTR: - return _quarterly_finder - elif freq == FreqGroup.FR_MTH: - return _monthly_finder - elif ((freq >= FreqGroup.FR_BUS) or fgroup == FreqGroup.FR_WK): - return _daily_finder - else: # pragma: no cover - errmsg = "Unsupported frequency: %s" % (freq) - raise NotImplementedError(errmsg) - - -class TimeSeries_DateLocator(Locator): - """ - Locates the ticks along an axis controlled by a :class:`Series`. - - Parameters - ---------- - freq : {var} - Valid frequency specifier. - minor_locator : {False, True}, optional - Whether the locator is for minor ticks (True) or not. - dynamic_mode : {True, False}, optional - Whether the locator should work in dynamic mode. - base : {int}, optional - quarter : {int}, optional - month : {int}, optional - day : {int}, optional - """ - - def __init__(self, freq, minor_locator=False, dynamic_mode=True, - base=1, quarter=1, month=1, day=1, plot_obj=None): - if isinstance(freq, compat.string_types): - freq = frequencies.get_freq(freq) - self.freq = freq - self.base = base - (self.quarter, self.month, self.day) = (quarter, month, day) - self.isminor = minor_locator - self.isdynamic = dynamic_mode - self.offset = 0 - self.plot_obj = plot_obj - self.finder = get_finder(freq) - - def _get_default_locs(self, vmin, vmax): - "Returns the default locations of ticks." - - if self.plot_obj.date_axis_info is None: - self.plot_obj.date_axis_info = self.finder(vmin, vmax, self.freq) - - locator = self.plot_obj.date_axis_info - - if self.isminor: - return np.compress(locator['min'], locator['val']) - return np.compress(locator['maj'], locator['val']) - - def __call__(self): - 'Return the locations of the ticks.' - # axis calls Locator.set_axis inside set_m_formatter - vi = tuple(self.axis.get_view_interval()) - if vi != self.plot_obj.view_interval: - self.plot_obj.date_axis_info = None - self.plot_obj.view_interval = vi - vmin, vmax = vi - if vmax < vmin: - vmin, vmax = vmax, vmin - if self.isdynamic: - locs = self._get_default_locs(vmin, vmax) - else: # pragma: no cover - base = self.base - (d, m) = divmod(vmin, base) - vmin = (d + 1) * base - locs = lrange(vmin, vmax + 1, base) - return locs - - def autoscale(self): - """ - Sets the view limits to the nearest multiples of base that contain the - data. - """ - # requires matplotlib >= 0.98.0 - (vmin, vmax) = self.axis.get_data_interval() - - locs = self._get_default_locs(vmin, vmax) - (vmin, vmax) = locs[[0, -1]] - if vmin == vmax: - vmin -= 1 - vmax += 1 - return nonsingular(vmin, vmax) - -# ------------------------------------------------------------------------- -# --- Formatter --- -# ------------------------------------------------------------------------- - - -class TimeSeries_DateFormatter(Formatter): - """ - Formats the ticks along an axis controlled by a :class:`PeriodIndex`. - - Parameters - ---------- - freq : {int, string} - Valid frequency specifier. - minor_locator : {False, True} - Whether the current formatter should apply to minor ticks (True) or - major ticks (False). - dynamic_mode : {True, False} - Whether the formatter works in dynamic mode or not. - """ - - def __init__(self, freq, minor_locator=False, dynamic_mode=True, - plot_obj=None): - if isinstance(freq, compat.string_types): - freq = frequencies.get_freq(freq) - self.format = None - self.freq = freq - self.locs = [] - self.formatdict = None - self.isminor = minor_locator - self.isdynamic = dynamic_mode - self.offset = 0 - self.plot_obj = plot_obj - self.finder = get_finder(freq) - - def _set_default_format(self, vmin, vmax): - "Returns the default ticks spacing." - - if self.plot_obj.date_axis_info is None: - self.plot_obj.date_axis_info = self.finder(vmin, vmax, self.freq) - info = self.plot_obj.date_axis_info - - if self.isminor: - format = np.compress(info['min'] & np.logical_not(info['maj']), - info) - else: - format = np.compress(info['maj'], info) - self.formatdict = dict([(x, f) for (x, _, _, f) in format]) - return self.formatdict - - def set_locs(self, locs): - 'Sets the locations of the ticks' - # don't actually use the locs. This is just needed to work with - # matplotlib. Force to use vmin, vmax - self.locs = locs - - (vmin, vmax) = vi = tuple(self.axis.get_view_interval()) - if vi != self.plot_obj.view_interval: - self.plot_obj.date_axis_info = None - self.plot_obj.view_interval = vi - if vmax < vmin: - (vmin, vmax) = (vmax, vmin) - self._set_default_format(vmin, vmax) - - def __call__(self, x, pos=0): - if self.formatdict is None: - return '' - else: - fmt = self.formatdict.pop(x, '') - return Period(ordinal=int(x), freq=self.freq).strftime(fmt) +# flake8: noqa + +from pandas.plotting._converter import (register, time2num, + TimeConverter, TimeFormatter, + PeriodConverter, get_datevalue, + DatetimeConverter, + PandasAutoDateFormatter, + PandasAutoDateLocator, + MilliSecondLocator, get_finder, + TimeSeries_DateLocator, + TimeSeries_DateFormatter) diff --git a/pandas/tseries/frequencies.py b/pandas/tseries/frequencies.py index ac094c1f545f3..dddf835424f67 100644 --- a/pandas/tseries/frequencies.py +++ b/pandas/tseries/frequencies.py @@ -6,20 +6,21 @@ import numpy as np -from pandas.types.generic import ABCSeries -from pandas.types.common import (is_integer, - is_period_arraylike, - is_timedelta64_dtype, - is_datetime64_dtype) +from pandas.core.dtypes.generic import ABCSeries +from pandas.core.dtypes.common import ( + is_integer, + is_period_arraylike, + is_timedelta64_dtype, + is_datetime64_dtype) import pandas.core.algorithms as algos from pandas.core.algorithms import unique from pandas.tseries.offsets import DateOffset -from pandas.util.decorators import cache_readonly, deprecate_kwarg +from pandas.util._decorators import cache_readonly, deprecate_kwarg import pandas.tseries.offsets as offsets -import pandas.lib as lib -import pandas.tslib as tslib -from pandas.tslib import Timedelta + +from pandas._libs import lib, tslib +from pandas._libs.tslib import Timedelta from pytz import AmbiguousTimeError @@ -38,32 +39,55 @@ class FreqGroup(object): FR_NS = 12000 -US_RESO = 0 -MS_RESO = 1 -S_RESO = 2 -T_RESO = 3 -H_RESO = 4 -D_RESO = 5 +RESO_NS = 0 +RESO_US = 1 +RESO_MS = 2 +RESO_SEC = 3 +RESO_MIN = 4 +RESO_HR = 5 +RESO_DAY = 6 class Resolution(object): - # defined in period.pyx - # note that these are different from freq codes - RESO_US = US_RESO - RESO_MS = MS_RESO - RESO_SEC = S_RESO - RESO_MIN = T_RESO - RESO_HR = H_RESO - RESO_DAY = D_RESO + RESO_US = RESO_US + RESO_MS = RESO_MS + RESO_SEC = RESO_SEC + RESO_MIN = RESO_MIN + RESO_HR = RESO_HR + RESO_DAY = RESO_DAY _reso_str_map = { + RESO_NS: 'nanosecond', RESO_US: 'microsecond', RESO_MS: 'millisecond', RESO_SEC: 'second', RESO_MIN: 'minute', RESO_HR: 'hour', - RESO_DAY: 'day'} + RESO_DAY: 'day' + } + + # factor to multiply a value by to convert it to the next finer grained + # resolution + _reso_mult_map = { + RESO_NS: None, + RESO_US: 1000, + RESO_MS: 1000, + RESO_SEC: 1000, + RESO_MIN: 60, + RESO_HR: 60, + RESO_DAY: 24 + } + + _reso_str_bump_map = { + 'D': 'H', + 'H': 'T', + 'T': 'S', + 'S': 'L', + 'L': 'U', + 'U': 'N', + 'N': None + } _str_reso_map = dict([(v, k) for k, v in compat.iteritems(_reso_str_map)]) @@ -160,6 +184,47 @@ def get_reso_from_freq(cls, freq): """ return cls.get_reso(cls.get_str_from_freq(freq)) + @classmethod + def get_stride_from_decimal(cls, value, freq): + """ + Convert freq with decimal stride into a higher freq with integer stride + + Parameters + ---------- + value : integer or float + freq : string + Frequency string + + Raises + ------ + ValueError + If the float cannot be converted to an integer at any resolution. + + Example + ------- + >>> Resolution.get_stride_from_decimal(1.5, 'T') + (90, 'S') + + >>> Resolution.get_stride_from_decimal(1.04, 'H') + (3744, 'S') + + >>> Resolution.get_stride_from_decimal(1, 'D') + (1, 'D') + """ + + if np.isclose(value % 1, 0): + return int(value), freq + else: + start_reso = cls.get_reso_from_freq(freq) + if start_reso == 0: + raise ValueError( + "Could not convert to integer offset at any resolution" + ) + + next_value = cls._reso_mult_map[start_reso] * value + next_name = cls._reso_str_bump_map[freq] + return cls.get_stride_from_decimal(next_value, next_name) + def get_to_timestamp_base(base): """ @@ -472,12 +537,17 @@ def to_offset(freq): splitted[2::4]): if sep != '' and not sep.isspace(): raise ValueError('separator must be spaces') - offset = get_offset(name) + prefix = _lite_rule_alias.get(name) or name if stride_sign is None: stride_sign = -1 if stride.startswith('-') else 1 if not stride: stride = 1 + if prefix in Resolution._reso_str_bump_map.keys(): + stride, name = Resolution.get_stride_from_decimal( + float(stride), prefix + ) stride = int(stride) + offset = get_offset(name) offset = offset * int(np.fabs(stride) * stride_sign) if delta is None: delta = offset @@ -493,7 +563,9 @@ def to_offset(freq): # hack to handle WOM-1MON -opattern = re.compile(r'([\-]?\d*)\s*([A-Za-z]+([\-][\dA-Za-z\-]+)?)') +opattern = re.compile( + r'([\-]?\d*|[\-]?\d*\.\d*)\s*([A-Za-z]+([\-][\dA-Za-z\-]+)?)' +) def _base_and_stride(freqstr): @@ -589,6 +661,7 @@ def get_standard_freq(freq): warnings.warn(msg, FutureWarning, stacklevel=2) return to_offset(freq).rule_code + # --------------------------------------------------------------------- # Period codes @@ -724,6 +797,7 @@ def infer_freq(index, warn=True): inferer = _FrequencyInferer(index, warn=warn) return inferer.get_freq() + _ONE_MICRO = long(1000) _ONE_MILLI = _ONE_MICRO * 1000 _ONE_SECOND = _ONE_MILLI * 1000 diff --git a/pandas/tseries/holiday.py b/pandas/tseries/holiday.py index 31e40c6bcbb2c..9acb52ebe0e9f 100644 --- a/pandas/tseries/holiday.py +++ b/pandas/tseries/holiday.py @@ -286,6 +286,7 @@ def _apply_rule(self, dates): dates += offset return dates + holiday_calendars = {} @@ -364,7 +365,7 @@ def holidays(self, start=None, end=None, return_name=False): ---------- start : starting date, datetime-like, optional end : ending date, datetime-like, optional - return_names : bool, optional + return_name : bool, optional If True, return a series that has dates and holiday names. False will only return a DatetimeIndex of dates. @@ -461,6 +462,7 @@ def merge(self, other, inplace=False): else: return holidays + USMemorialDay = Holiday('MemorialDay', month=5, day=31, offset=DateOffset(weekday=MO(-1))) USLaborDay = Holiday('Labor Day', month=9, day=1, diff --git a/pandas/tseries/interval.py b/pandas/tseries/interval.py deleted file mode 100644 index 6698c7e924758..0000000000000 --- a/pandas/tseries/interval.py +++ /dev/null @@ -1,38 +0,0 @@ - -from pandas.core.index import Index - - -class Interval(object): - """ - Represents an interval of time defined by two timestamps - """ - - def __init__(self, start, end): - self.start = start - self.end = end - - -class PeriodInterval(object): - """ - Represents an interval of time defined by two Period objects (time - ordinals) - """ - - def __init__(self, start, end): - self.start = start - self.end = end - - -class IntervalIndex(Index): - """ - - """ - - def __new__(self, starts, ends): - pass - - def dtype(self): - return self.values.dtype - -if __name__ == '__main__': - pass diff --git a/pandas/tseries/offsets.py b/pandas/tseries/offsets.py index efcde100d1ce7..f9f4adc1b2c81 100644 --- a/pandas/tseries/offsets.py +++ b/pandas/tseries/offsets.py @@ -3,15 +3,14 @@ from pandas import compat import numpy as np -from pandas.types.generic import ABCSeries, ABCDatetimeIndex, ABCPeriod -from pandas.tseries.tools import to_datetime, normalize_date +from pandas.core.dtypes.generic import ABCSeries, ABCDatetimeIndex, ABCPeriod +from pandas.core.tools.datetimes import to_datetime, normalize_date from pandas.core.common import AbstractMethodError # import after tools, dateutil check from dateutil.relativedelta import relativedelta, weekday from dateutil.easter import easter -import pandas.tslib as tslib -from pandas.tslib import Timestamp, OutOfBoundsDatetime, Timedelta +from pandas._libs import tslib, Timestamp, OutOfBoundsDatetime, Timedelta import functools import operator @@ -155,14 +154,14 @@ class DateOffset(object): DateOffsets can be created to move dates forward a given number of valid dates. For example, Bday(2) can be added to a date to move it two business days forward. If the date does not start on a - valid date, first it is moved to a valid date. Thus psedo code + valid date, first it is moved to a valid date. Thus pseudo code is: def __add__(date): date = rollback(date) # does nothing if date is valid return date + - When a date offset is created for a negitive number of periods, + When a date offset is created for a negative number of periods, the date is first rolled forward. The pseudo code is: def __add__(date): @@ -1652,6 +1651,7 @@ class WeekDay(object): SAT = 5 SUN = 6 + _int_to_weekday = { WeekDay.MON: 'MON', WeekDay.TUE: 'TUE', @@ -1924,6 +1924,7 @@ def onOffset(self, dt): modMonth = (dt.month - self.startingMonth) % 3 return BMonthEnd().onOffset(dt) and modMonth == 0 + _int_to_month = tslib._MONTH_ALIASES _month_to_int = dict((v, k) for k, v in _int_to_month.items()) @@ -2710,6 +2711,9 @@ def __add__(self, other): return self.apply(other) except ApplyTypeError: return NotImplemented + except OverflowError: + raise OverflowError("the add operation between {} and {} " + "will overflow".format(self, other)) def __eq__(self, other): if isinstance(other, compat.string_types): @@ -2748,14 +2752,26 @@ def nanos(self): def apply(self, other): # Timestamp can handle tz and nano sec, thus no need to use apply_wraps - if isinstance(other, (datetime, np.datetime64, date)): + if isinstance(other, Timestamp): + + # GH 15126 + # in order to avoid a recursive + # call of __add__ and __radd__ if there is + # an exception, when we call using the + operator, + # we directly call the known method + result = other.__add__(self) + if result == NotImplemented: + raise OverflowError + return result + elif isinstance(other, (datetime, np.datetime64, date)): return as_timestamp(other) + self + if isinstance(other, timedelta): return other + self.delta elif isinstance(other, type(self)): return type(self)(self.n + other.n) - else: - raise ApplyTypeError('Unhandled type: %s' % type(other).__name__) + + raise ApplyTypeError('Unhandled type: %s' % type(other).__name__) _prefix = 'undefined' @@ -2784,6 +2800,7 @@ def _delta_to_tick(delta): else: # pragma: no cover return Nano(nanos) + _delta_to_nanoseconds = tslib._delta_to_nanoseconds @@ -2916,6 +2933,7 @@ def generate_range(start=None, end=None, periods=None, raise ValueError('Offset %s did not decrement date' % offset) cur = next_date + prefix_mapping = dict((offset._prefix, offset) for offset in [ YearBegin, # 'AS' YearEnd, # 'A' diff --git a/pandas/tseries/plotting.py b/pandas/tseries/plotting.py index 89aecf2acc07e..302016907635d 100644 --- a/pandas/tseries/plotting.py +++ b/pandas/tseries/plotting.py @@ -1,313 +1,3 @@ -""" -Period formatters and locators adapted from scikits.timeseries by -Pierre GF Gerard-Marchant & Matt Knox -""" +# flake8: noqa -# TODO: Use the fact that axis can have units to simplify the process - -import numpy as np - -from matplotlib import pylab -from pandas.tseries.period import Period -from pandas.tseries.offsets import DateOffset -import pandas.tseries.frequencies as frequencies -from pandas.tseries.index import DatetimeIndex -from pandas.formats.printing import pprint_thing -import pandas.compat as compat - -from pandas.tseries.converter import (TimeSeries_DateLocator, - TimeSeries_DateFormatter) - -# --------------------------------------------------------------------- -# Plotting functions and monkey patches - - -def tsplot(series, plotf, ax=None, **kwargs): - """ - Plots a Series on the given Matplotlib axes or the current axes - - Parameters - ---------- - axes : Axes - series : Series - - Notes - _____ - Supports same kwargs as Axes.plot - - """ - # Used inferred freq is possible, need a test case for inferred - if ax is None: - import matplotlib.pyplot as plt - ax = plt.gca() - - freq, series = _maybe_resample(series, ax, kwargs) - - # Set ax with freq info - _decorate_axes(ax, freq, kwargs) - ax._plot_data.append((series, plotf, kwargs)) - lines = plotf(ax, series.index._mpl_repr(), series.values, **kwargs) - - # set date formatter, locators and rescale limits - format_dateaxis(ax, ax.freq) - return lines - - -def _maybe_resample(series, ax, kwargs): - # resample against axes freq if necessary - freq, ax_freq = _get_freq(ax, series) - - if freq is None: # pragma: no cover - raise ValueError('Cannot use dynamic axis without frequency info') - - # Convert DatetimeIndex to PeriodIndex - if isinstance(series.index, DatetimeIndex): - series = series.to_period(freq=freq) - - if ax_freq is not None and freq != ax_freq: - if frequencies.is_superperiod(freq, ax_freq): # upsample input - series = series.copy() - series.index = series.index.asfreq(ax_freq, how='s') - freq = ax_freq - elif _is_sup(freq, ax_freq): # one is weekly - how = kwargs.pop('how', 'last') - series = getattr(series.resample('D'), how)().dropna() - series = getattr(series.resample(ax_freq), how)().dropna() - freq = ax_freq - elif frequencies.is_subperiod(freq, ax_freq) or _is_sub(freq, ax_freq): - _upsample_others(ax, freq, kwargs) - ax_freq = freq - else: # pragma: no cover - raise ValueError('Incompatible frequency conversion') - return freq, series - - -def _is_sub(f1, f2): - return ((f1.startswith('W') and frequencies.is_subperiod('D', f2)) or - (f2.startswith('W') and frequencies.is_subperiod(f1, 'D'))) - - -def _is_sup(f1, f2): - return ((f1.startswith('W') and frequencies.is_superperiod('D', f2)) or - (f2.startswith('W') and frequencies.is_superperiod(f1, 'D'))) - - -def _upsample_others(ax, freq, kwargs): - legend = ax.get_legend() - lines, labels = _replot_ax(ax, freq, kwargs) - _replot_ax(ax, freq, kwargs) - - other_ax = None - if hasattr(ax, 'left_ax'): - other_ax = ax.left_ax - if hasattr(ax, 'right_ax'): - other_ax = ax.right_ax - - if other_ax is not None: - rlines, rlabels = _replot_ax(other_ax, freq, kwargs) - lines.extend(rlines) - labels.extend(rlabels) - - if (legend is not None and kwargs.get('legend', True) and - len(lines) > 0): - title = legend.get_title().get_text() - if title == 'None': - title = None - ax.legend(lines, labels, loc='best', title=title) - - -def _replot_ax(ax, freq, kwargs): - data = getattr(ax, '_plot_data', None) - - # clear current axes and data - ax._plot_data = [] - ax.clear() - - _decorate_axes(ax, freq, kwargs) - - lines = [] - labels = [] - if data is not None: - for series, plotf, kwds in data: - series = series.copy() - idx = series.index.asfreq(freq, how='S') - series.index = idx - ax._plot_data.append((series, plotf, kwds)) - - # for tsplot - if isinstance(plotf, compat.string_types): - from pandas.tools.plotting import _plot_klass - plotf = _plot_klass[plotf]._plot - - lines.append(plotf(ax, series.index._mpl_repr(), - series.values, **kwds)[0]) - labels.append(pprint_thing(series.name)) - - return lines, labels - - -def _decorate_axes(ax, freq, kwargs): - """Initialize axes for time-series plotting""" - if not hasattr(ax, '_plot_data'): - ax._plot_data = [] - - ax.freq = freq - xaxis = ax.get_xaxis() - xaxis.freq = freq - if not hasattr(ax, 'legendlabels'): - ax.legendlabels = [kwargs.get('label', None)] - else: - ax.legendlabels.append(kwargs.get('label', None)) - ax.view_interval = None - ax.date_axis_info = None - - -def _get_ax_freq(ax): - """ - Get the freq attribute of the ax object if set. - Also checks shared axes (eg when using secondary yaxis, sharex=True - or twinx) - """ - ax_freq = getattr(ax, 'freq', None) - if ax_freq is None: - # check for left/right ax in case of secondary yaxis - if hasattr(ax, 'left_ax'): - ax_freq = getattr(ax.left_ax, 'freq', None) - elif hasattr(ax, 'right_ax'): - ax_freq = getattr(ax.right_ax, 'freq', None) - if ax_freq is None: - # check if a shared ax (sharex/twinx) has already freq set - shared_axes = ax.get_shared_x_axes().get_siblings(ax) - if len(shared_axes) > 1: - for shared_ax in shared_axes: - ax_freq = getattr(shared_ax, 'freq', None) - if ax_freq is not None: - break - return ax_freq - - -def _get_freq(ax, series): - # get frequency from data - freq = getattr(series.index, 'freq', None) - if freq is None: - freq = getattr(series.index, 'inferred_freq', None) - - ax_freq = _get_ax_freq(ax) - - # use axes freq if no data freq - if freq is None: - freq = ax_freq - - # get the period frequency - if isinstance(freq, DateOffset): - freq = freq.rule_code - else: - freq = frequencies.get_base_alias(freq) - - freq = frequencies.get_period_alias(freq) - return freq, ax_freq - - -def _use_dynamic_x(ax, data): - freq = _get_index_freq(data) - ax_freq = _get_ax_freq(ax) - - if freq is None: # convert irregular if axes has freq info - freq = ax_freq - else: # do not use tsplot if irregular was plotted first - if (ax_freq is None) and (len(ax.get_lines()) > 0): - return False - - if freq is None: - return False - - if isinstance(freq, DateOffset): - freq = freq.rule_code - else: - freq = frequencies.get_base_alias(freq) - freq = frequencies.get_period_alias(freq) - - if freq is None: - return False - - # hack this for 0.10.1, creating more technical debt...sigh - if isinstance(data.index, DatetimeIndex): - base = frequencies.get_freq(freq) - x = data.index - if (base <= frequencies.FreqGroup.FR_DAY): - return x[:1].is_normalized - return Period(x[0], freq).to_timestamp(tz=x.tz) == x[0] - return True - - -def _get_index_freq(data): - freq = getattr(data.index, 'freq', None) - if freq is None: - freq = getattr(data.index, 'inferred_freq', None) - if freq == 'B': - weekdays = np.unique(data.index.dayofweek) - if (5 in weekdays) or (6 in weekdays): - freq = None - return freq - - -def _maybe_convert_index(ax, data): - # tsplot converts automatically, but don't want to convert index - # over and over for DataFrames - if isinstance(data.index, DatetimeIndex): - freq = getattr(data.index, 'freq', None) - - if freq is None: - freq = getattr(data.index, 'inferred_freq', None) - if isinstance(freq, DateOffset): - freq = freq.rule_code - - if freq is None: - freq = _get_ax_freq(ax) - - if freq is None: - raise ValueError('Could not get frequency alias for plotting') - - freq = frequencies.get_base_alias(freq) - freq = frequencies.get_period_alias(freq) - - data = data.to_period(freq=freq) - return data - - -# Patch methods for subplot. Only format_dateaxis is currently used. -# Do we need the rest for convenience? - - -def format_dateaxis(subplot, freq): - """ - Pretty-formats the date axis (x-axis). - - Major and minor ticks are automatically set for the frequency of the - current underlying series. As the dynamic mode is activated by - default, changing the limits of the x axis will intelligently change - the positions of the ticks. - """ - majlocator = TimeSeries_DateLocator(freq, dynamic_mode=True, - minor_locator=False, - plot_obj=subplot) - minlocator = TimeSeries_DateLocator(freq, dynamic_mode=True, - minor_locator=True, - plot_obj=subplot) - subplot.xaxis.set_major_locator(majlocator) - subplot.xaxis.set_minor_locator(minlocator) - - majformatter = TimeSeries_DateFormatter(freq, dynamic_mode=True, - minor_locator=False, - plot_obj=subplot) - minformatter = TimeSeries_DateFormatter(freq, dynamic_mode=True, - minor_locator=True, - plot_obj=subplot) - subplot.xaxis.set_major_formatter(majformatter) - subplot.xaxis.set_minor_formatter(minformatter) - - # x and y coord info - subplot.format_coord = lambda t, y: ( - "t = {0} y = {1:8f}".format(Period(ordinal=int(t), freq=freq), y)) - - pylab.draw_if_interactive() +from pandas.plotting._timeseries import tsplot diff --git a/pandas/tseries/tests/data/daterange_073.pickle b/pandas/tseries/tests/data/daterange_073.pickle deleted file mode 100644 index 0214a023e6338..0000000000000 Binary files a/pandas/tseries/tests/data/daterange_073.pickle and /dev/null differ diff --git a/pandas/tseries/tests/data/frame.pickle b/pandas/tseries/tests/data/frame.pickle deleted file mode 100644 index b3b100fb43022..0000000000000 Binary files a/pandas/tseries/tests/data/frame.pickle and /dev/null differ diff --git a/pandas/tseries/tests/data/series.pickle b/pandas/tseries/tests/data/series.pickle deleted file mode 100644 index 307a4ac265173..0000000000000 Binary files a/pandas/tseries/tests/data/series.pickle and /dev/null differ diff --git a/pandas/tseries/tests/data/series_daterange0.pickle b/pandas/tseries/tests/data/series_daterange0.pickle deleted file mode 100644 index 07e2144039f2e..0000000000000 Binary files a/pandas/tseries/tests/data/series_daterange0.pickle and /dev/null differ diff --git a/pandas/tseries/tests/test_base.py b/pandas/tseries/tests/test_base.py deleted file mode 100644 index bca50237081e1..0000000000000 --- a/pandas/tseries/tests/test_base.py +++ /dev/null @@ -1,2764 +0,0 @@ -from __future__ import print_function -from datetime import datetime, timedelta -import numpy as np -import pandas as pd -from pandas import (Series, Index, Int64Index, Timestamp, Period, - DatetimeIndex, PeriodIndex, TimedeltaIndex, - Timedelta, timedelta_range, date_range, Float64Index, - _np_version_under1p10) -import pandas.tslib as tslib -import pandas.tseries.period as period - -import pandas.util.testing as tm - -from pandas.tests.test_base import Ops - - -class TestDatetimeIndexOps(Ops): - tz = [None, 'UTC', 'Asia/Tokyo', 'US/Eastern', 'dateutil/Asia/Singapore', - 'dateutil/US/Pacific'] - - def setUp(self): - super(TestDatetimeIndexOps, self).setUp() - mask = lambda x: (isinstance(x, DatetimeIndex) or - isinstance(x, PeriodIndex)) - self.is_valid_objs = [o for o in self.objs if mask(o)] - self.not_valid_objs = [o for o in self.objs if not mask(o)] - - def test_ops_properties(self): - self.check_ops_properties( - ['year', 'month', 'day', 'hour', 'minute', 'second', 'weekofyear', - 'week', 'dayofweek', 'dayofyear', 'quarter']) - self.check_ops_properties(['date', 'time', 'microsecond', 'nanosecond', - 'is_month_start', 'is_month_end', - 'is_quarter_start', - 'is_quarter_end', 'is_year_start', - 'is_year_end', 'weekday_name'], - lambda x: isinstance(x, DatetimeIndex)) - - def test_ops_properties_basic(self): - - # sanity check that the behavior didn't change - # GH7206 - for op in ['year', 'day', 'second', 'weekday']: - self.assertRaises(TypeError, lambda x: getattr(self.dt_series, op)) - - # attribute access should still work! - s = Series(dict(year=2000, month=1, day=10)) - self.assertEqual(s.year, 2000) - self.assertEqual(s.month, 1) - self.assertEqual(s.day, 10) - self.assertRaises(AttributeError, lambda: s.weekday) - - def test_asobject_tolist(self): - idx = pd.date_range(start='2013-01-01', periods=4, freq='M', - name='idx') - expected_list = [Timestamp('2013-01-31'), - Timestamp('2013-02-28'), - Timestamp('2013-03-31'), - Timestamp('2013-04-30')] - expected = pd.Index(expected_list, dtype=object, name='idx') - result = idx.asobject - self.assertTrue(isinstance(result, Index)) - - self.assertEqual(result.dtype, object) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(idx.tolist(), expected_list) - - idx = pd.date_range(start='2013-01-01', periods=4, freq='M', - name='idx', tz='Asia/Tokyo') - expected_list = [Timestamp('2013-01-31', tz='Asia/Tokyo'), - Timestamp('2013-02-28', tz='Asia/Tokyo'), - Timestamp('2013-03-31', tz='Asia/Tokyo'), - Timestamp('2013-04-30', tz='Asia/Tokyo')] - expected = pd.Index(expected_list, dtype=object, name='idx') - result = idx.asobject - self.assertTrue(isinstance(result, Index)) - self.assertEqual(result.dtype, object) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(idx.tolist(), expected_list) - - idx = DatetimeIndex([datetime(2013, 1, 1), datetime(2013, 1, 2), - pd.NaT, datetime(2013, 1, 4)], name='idx') - expected_list = [Timestamp('2013-01-01'), - Timestamp('2013-01-02'), pd.NaT, - Timestamp('2013-01-04')] - expected = pd.Index(expected_list, dtype=object, name='idx') - result = idx.asobject - self.assertTrue(isinstance(result, Index)) - self.assertEqual(result.dtype, object) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(idx.tolist(), expected_list) - - def test_minmax(self): - for tz in self.tz: - # monotonic - idx1 = pd.DatetimeIndex(['2011-01-01', '2011-01-02', - '2011-01-03'], tz=tz) - self.assertTrue(idx1.is_monotonic) - - # non-monotonic - idx2 = pd.DatetimeIndex(['2011-01-01', pd.NaT, '2011-01-03', - '2011-01-02', pd.NaT], tz=tz) - self.assertFalse(idx2.is_monotonic) - - for idx in [idx1, idx2]: - self.assertEqual(idx.min(), Timestamp('2011-01-01', tz=tz)) - self.assertEqual(idx.max(), Timestamp('2011-01-03', tz=tz)) - self.assertEqual(idx.argmin(), 0) - self.assertEqual(idx.argmax(), 2) - - for op in ['min', 'max']: - # Return NaT - obj = DatetimeIndex([]) - self.assertTrue(pd.isnull(getattr(obj, op)())) - - obj = DatetimeIndex([pd.NaT]) - self.assertTrue(pd.isnull(getattr(obj, op)())) - - obj = DatetimeIndex([pd.NaT, pd.NaT, pd.NaT]) - self.assertTrue(pd.isnull(getattr(obj, op)())) - - def test_numpy_minmax(self): - dr = pd.date_range(start='2016-01-15', end='2016-01-20') - - self.assertEqual(np.min(dr), - Timestamp('2016-01-15 00:00:00', freq='D')) - self.assertEqual(np.max(dr), - Timestamp('2016-01-20 00:00:00', freq='D')) - - errmsg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, errmsg, np.min, dr, out=0) - tm.assertRaisesRegexp(ValueError, errmsg, np.max, dr, out=0) - - self.assertEqual(np.argmin(dr), 0) - self.assertEqual(np.argmax(dr), 5) - - if not _np_version_under1p10: - errmsg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, errmsg, np.argmin, dr, out=0) - tm.assertRaisesRegexp(ValueError, errmsg, np.argmax, dr, out=0) - - def test_round(self): - for tz in self.tz: - rng = pd.date_range(start='2016-01-01', periods=5, - freq='30Min', tz=tz) - elt = rng[1] - - expected_rng = DatetimeIndex([ - Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), - Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), - Timestamp('2016-01-01 01:00:00', tz=tz, freq='30T'), - Timestamp('2016-01-01 02:00:00', tz=tz, freq='30T'), - Timestamp('2016-01-01 02:00:00', tz=tz, freq='30T'), - ]) - expected_elt = expected_rng[1] - - tm.assert_index_equal(rng.round(freq='H'), expected_rng) - self.assertEqual(elt.round(freq='H'), expected_elt) - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with tm.assertRaisesRegexp(ValueError, msg): - rng.round(freq='foo') - with tm.assertRaisesRegexp(ValueError, msg): - elt.round(freq='foo') - - msg = " is a non-fixed frequency" - tm.assertRaisesRegexp(ValueError, msg, rng.round, freq='M') - tm.assertRaisesRegexp(ValueError, msg, elt.round, freq='M') - - def test_repeat_range(self): - rng = date_range('1/1/2000', '1/1/2001') - - result = rng.repeat(5) - self.assertIsNone(result.freq) - self.assertEqual(len(result), 5 * len(rng)) - - for tz in self.tz: - index = pd.date_range('2001-01-01', periods=2, freq='D', tz=tz) - exp = pd.DatetimeIndex(['2001-01-01', '2001-01-01', - '2001-01-02', '2001-01-02'], tz=tz) - for res in [index.repeat(2), np.repeat(index, 2)]: - tm.assert_index_equal(res, exp) - self.assertIsNone(res.freq) - - index = pd.date_range('2001-01-01', periods=2, freq='2D', tz=tz) - exp = pd.DatetimeIndex(['2001-01-01', '2001-01-01', - '2001-01-03', '2001-01-03'], tz=tz) - for res in [index.repeat(2), np.repeat(index, 2)]: - tm.assert_index_equal(res, exp) - self.assertIsNone(res.freq) - - index = pd.DatetimeIndex(['2001-01-01', 'NaT', '2003-01-01'], - tz=tz) - exp = pd.DatetimeIndex(['2001-01-01', '2001-01-01', '2001-01-01', - 'NaT', 'NaT', 'NaT', - '2003-01-01', '2003-01-01', '2003-01-01'], - tz=tz) - for res in [index.repeat(3), np.repeat(index, 3)]: - tm.assert_index_equal(res, exp) - self.assertIsNone(res.freq) - - def test_repeat(self): - reps = 2 - msg = "the 'axis' parameter is not supported" - - for tz in self.tz: - rng = pd.date_range(start='2016-01-01', periods=2, - freq='30Min', tz=tz) - - expected_rng = DatetimeIndex([ - Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), - Timestamp('2016-01-01 00:00:00', tz=tz, freq='30T'), - Timestamp('2016-01-01 00:30:00', tz=tz, freq='30T'), - Timestamp('2016-01-01 00:30:00', tz=tz, freq='30T'), - ]) - - res = rng.repeat(reps) - tm.assert_index_equal(res, expected_rng) - self.assertIsNone(res.freq) - - tm.assert_index_equal(np.repeat(rng, reps), expected_rng) - tm.assertRaisesRegexp(ValueError, msg, np.repeat, - rng, reps, axis=1) - - def test_representation(self): - - idx = [] - idx.append(DatetimeIndex([], freq='D')) - idx.append(DatetimeIndex(['2011-01-01'], freq='D')) - idx.append(DatetimeIndex(['2011-01-01', '2011-01-02'], freq='D')) - idx.append(DatetimeIndex( - ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D')) - idx.append(DatetimeIndex( - ['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00' - ], freq='H', tz='Asia/Tokyo')) - idx.append(DatetimeIndex( - ['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], tz='US/Eastern')) - idx.append(DatetimeIndex( - ['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], tz='UTC')) - - exp = [] - exp.append("""DatetimeIndex([], dtype='datetime64[ns]', freq='D')""") - exp.append("DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', " - "freq='D')") - exp.append("DatetimeIndex(['2011-01-01', '2011-01-02'], " - "dtype='datetime64[ns]', freq='D')") - exp.append("DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], " - "dtype='datetime64[ns]', freq='D')") - exp.append("DatetimeIndex(['2011-01-01 09:00:00+09:00', " - "'2011-01-01 10:00:00+09:00', '2011-01-01 11:00:00+09:00']" - ", dtype='datetime64[ns, Asia/Tokyo]', freq='H')") - exp.append("DatetimeIndex(['2011-01-01 09:00:00-05:00', " - "'2011-01-01 10:00:00-05:00', 'NaT'], " - "dtype='datetime64[ns, US/Eastern]', freq=None)") - exp.append("DatetimeIndex(['2011-01-01 09:00:00+00:00', " - "'2011-01-01 10:00:00+00:00', 'NaT'], " - "dtype='datetime64[ns, UTC]', freq=None)""") - - with pd.option_context('display.width', 300): - for indx, expected in zip(idx, exp): - for func in ['__repr__', '__unicode__', '__str__']: - result = getattr(indx, func)() - self.assertEqual(result, expected) - - def test_representation_to_series(self): - idx1 = DatetimeIndex([], freq='D') - idx2 = DatetimeIndex(['2011-01-01'], freq='D') - idx3 = DatetimeIndex(['2011-01-01', '2011-01-02'], freq='D') - idx4 = DatetimeIndex( - ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D') - idx5 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', - '2011-01-01 11:00'], freq='H', tz='Asia/Tokyo') - idx6 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], - tz='US/Eastern') - idx7 = DatetimeIndex(['2011-01-01 09:00', '2011-01-02 10:15']) - - exp1 = """Series([], dtype: datetime64[ns])""" - - exp2 = """0 2011-01-01 -dtype: datetime64[ns]""" - - exp3 = """0 2011-01-01 -1 2011-01-02 -dtype: datetime64[ns]""" - - exp4 = """0 2011-01-01 -1 2011-01-02 -2 2011-01-03 -dtype: datetime64[ns]""" - - exp5 = """0 2011-01-01 09:00:00+09:00 -1 2011-01-01 10:00:00+09:00 -2 2011-01-01 11:00:00+09:00 -dtype: datetime64[ns, Asia/Tokyo]""" - - exp6 = """0 2011-01-01 09:00:00-05:00 -1 2011-01-01 10:00:00-05:00 -2 NaT -dtype: datetime64[ns, US/Eastern]""" - - exp7 = """0 2011-01-01 09:00:00 -1 2011-01-02 10:15:00 -dtype: datetime64[ns]""" - - with pd.option_context('display.width', 300): - for idx, expected in zip([idx1, idx2, idx3, idx4, - idx5, idx6, idx7], - [exp1, exp2, exp3, exp4, - exp5, exp6, exp7]): - result = repr(Series(idx)) - self.assertEqual(result, expected) - - def test_summary(self): - # GH9116 - idx1 = DatetimeIndex([], freq='D') - idx2 = DatetimeIndex(['2011-01-01'], freq='D') - idx3 = DatetimeIndex(['2011-01-01', '2011-01-02'], freq='D') - idx4 = DatetimeIndex( - ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D') - idx5 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', - '2011-01-01 11:00'], - freq='H', tz='Asia/Tokyo') - idx6 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', pd.NaT], - tz='US/Eastern') - - exp1 = """DatetimeIndex: 0 entries -Freq: D""" - - exp2 = """DatetimeIndex: 1 entries, 2011-01-01 to 2011-01-01 -Freq: D""" - - exp3 = """DatetimeIndex: 2 entries, 2011-01-01 to 2011-01-02 -Freq: D""" - - exp4 = """DatetimeIndex: 3 entries, 2011-01-01 to 2011-01-03 -Freq: D""" - - exp5 = ("DatetimeIndex: 3 entries, 2011-01-01 09:00:00+09:00 " - "to 2011-01-01 11:00:00+09:00\n" - "Freq: H") - - exp6 = """DatetimeIndex: 3 entries, 2011-01-01 09:00:00-05:00 to NaT""" - - for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, idx6], - [exp1, exp2, exp3, exp4, exp5, exp6]): - result = idx.summary() - self.assertEqual(result, expected) - - def test_resolution(self): - for freq, expected in zip(['A', 'Q', 'M', 'D', 'H', 'T', - 'S', 'L', 'U'], - ['day', 'day', 'day', 'day', 'hour', - 'minute', 'second', 'millisecond', - 'microsecond']): - for tz in self.tz: - idx = pd.date_range(start='2013-04-01', periods=30, freq=freq, - tz=tz) - self.assertEqual(idx.resolution, expected) - - def test_union(self): - for tz in self.tz: - # union - rng1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - other1 = pd.date_range('1/6/2000', freq='D', periods=5, tz=tz) - expected1 = pd.date_range('1/1/2000', freq='D', periods=10, tz=tz) - - rng2 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - other2 = pd.date_range('1/4/2000', freq='D', periods=5, tz=tz) - expected2 = pd.date_range('1/1/2000', freq='D', periods=8, tz=tz) - - rng3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - other3 = pd.DatetimeIndex([], tz=tz) - expected3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - - for rng, other, expected in [(rng1, other1, expected1), - (rng2, other2, expected2), - (rng3, other3, expected3)]: - - result_union = rng.union(other) - tm.assert_index_equal(result_union, expected) - - def test_add_iadd(self): - for tz in self.tz: - - # offset - offsets = [pd.offsets.Hour(2), timedelta(hours=2), - np.timedelta64(2, 'h'), Timedelta(hours=2)] - - for delta in offsets: - rng = pd.date_range('2000-01-01', '2000-02-01', tz=tz) - result = rng + delta - expected = pd.date_range('2000-01-01 02:00', - '2000-02-01 02:00', tz=tz) - tm.assert_index_equal(result, expected) - rng += delta - tm.assert_index_equal(rng, expected) - - # int - rng = pd.date_range('2000-01-01 09:00', freq='H', periods=10, - tz=tz) - result = rng + 1 - expected = pd.date_range('2000-01-01 10:00', freq='H', periods=10, - tz=tz) - tm.assert_index_equal(result, expected) - rng += 1 - tm.assert_index_equal(rng, expected) - - idx = DatetimeIndex(['2011-01-01', '2011-01-02']) - msg = "cannot add a datelike to a DatetimeIndex" - with tm.assertRaisesRegexp(TypeError, msg): - idx + Timestamp('2011-01-01') - - with tm.assertRaisesRegexp(TypeError, msg): - Timestamp('2011-01-01') + idx - - def test_add_dti_dti(self): - # previously performed setop (deprecated in 0.16.0), now raises - # TypeError (GH14164) - - dti = date_range('20130101', periods=3) - dti_tz = date_range('20130101', periods=3).tz_localize('US/Eastern') - - with tm.assertRaises(TypeError): - dti + dti - - with tm.assertRaises(TypeError): - dti_tz + dti_tz - - with tm.assertRaises(TypeError): - dti_tz + dti - - with tm.assertRaises(TypeError): - dti + dti_tz - - def test_difference(self): - for tz in self.tz: - # diff - rng1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - other1 = pd.date_range('1/6/2000', freq='D', periods=5, tz=tz) - expected1 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - - rng2 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - other2 = pd.date_range('1/4/2000', freq='D', periods=5, tz=tz) - expected2 = pd.date_range('1/1/2000', freq='D', periods=3, tz=tz) - - rng3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - other3 = pd.DatetimeIndex([], tz=tz) - expected3 = pd.date_range('1/1/2000', freq='D', periods=5, tz=tz) - - for rng, other, expected in [(rng1, other1, expected1), - (rng2, other2, expected2), - (rng3, other3, expected3)]: - result_diff = rng.difference(other) - tm.assert_index_equal(result_diff, expected) - - def test_sub_isub(self): - for tz in self.tz: - - # offset - offsets = [pd.offsets.Hour(2), timedelta(hours=2), - np.timedelta64(2, 'h'), Timedelta(hours=2)] - - for delta in offsets: - rng = pd.date_range('2000-01-01', '2000-02-01', tz=tz) - expected = pd.date_range('1999-12-31 22:00', - '2000-01-31 22:00', tz=tz) - - result = rng - delta - tm.assert_index_equal(result, expected) - rng -= delta - tm.assert_index_equal(rng, expected) - - # int - rng = pd.date_range('2000-01-01 09:00', freq='H', periods=10, - tz=tz) - result = rng - 1 - expected = pd.date_range('2000-01-01 08:00', freq='H', periods=10, - tz=tz) - tm.assert_index_equal(result, expected) - rng -= 1 - tm.assert_index_equal(rng, expected) - - def test_sub_dti_dti(self): - # previously performed setop (deprecated in 0.16.0), now changed to - # return subtraction -> TimeDeltaIndex (GH ...) - - dti = date_range('20130101', periods=3) - dti_tz = date_range('20130101', periods=3).tz_localize('US/Eastern') - dti_tz2 = date_range('20130101', periods=3).tz_localize('UTC') - expected = TimedeltaIndex([0, 0, 0]) - - result = dti - dti - tm.assert_index_equal(result, expected) - - result = dti_tz - dti_tz - tm.assert_index_equal(result, expected) - - with tm.assertRaises(TypeError): - dti_tz - dti - - with tm.assertRaises(TypeError): - dti - dti_tz - - with tm.assertRaises(TypeError): - dti_tz - dti_tz2 - - # isub - dti -= dti - tm.assert_index_equal(dti, expected) - - # different length raises ValueError - dti1 = date_range('20130101', periods=3) - dti2 = date_range('20130101', periods=4) - with tm.assertRaises(ValueError): - dti1 - dti2 - - # NaN propagation - dti1 = DatetimeIndex(['2012-01-01', np.nan, '2012-01-03']) - dti2 = DatetimeIndex(['2012-01-02', '2012-01-03', np.nan]) - expected = TimedeltaIndex(['1 days', np.nan, np.nan]) - result = dti2 - dti1 - tm.assert_index_equal(result, expected) - - def test_sub_period(self): - # GH 13078 - # not supported, check TypeError - p = pd.Period('2011-01-01', freq='D') - - for freq in [None, 'D']: - idx = pd.DatetimeIndex(['2011-01-01', '2011-01-02'], freq=freq) - - with tm.assertRaises(TypeError): - idx - p - - with tm.assertRaises(TypeError): - p - idx - - def test_comp_nat(self): - left = pd.DatetimeIndex([pd.Timestamp('2011-01-01'), pd.NaT, - pd.Timestamp('2011-01-03')]) - right = pd.DatetimeIndex([pd.NaT, pd.NaT, pd.Timestamp('2011-01-03')]) - - for l, r in [(left, right), (left.asobject, right.asobject)]: - result = l == r - expected = np.array([False, False, True]) - tm.assert_numpy_array_equal(result, expected) - - result = l != r - expected = np.array([True, True, False]) - tm.assert_numpy_array_equal(result, expected) - - expected = np.array([False, False, False]) - tm.assert_numpy_array_equal(l == pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT == r, expected) - - expected = np.array([True, True, True]) - tm.assert_numpy_array_equal(l != pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT != l, expected) - - expected = np.array([False, False, False]) - tm.assert_numpy_array_equal(l < pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT > l, expected) - - def test_value_counts_unique(self): - # GH 7735 - for tz in self.tz: - idx = pd.date_range('2011-01-01 09:00', freq='H', periods=10) - # create repeated values, 'n'th element is repeated by n+1 times - idx = DatetimeIndex(np.repeat(idx.values, range(1, len(idx) + 1)), - tz=tz) - - exp_idx = pd.date_range('2011-01-01 18:00', freq='-1H', periods=10, - tz=tz) - expected = Series(range(10, 0, -1), index=exp_idx, dtype='int64') - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(), expected) - - expected = pd.date_range('2011-01-01 09:00', freq='H', periods=10, - tz=tz) - tm.assert_index_equal(idx.unique(), expected) - - idx = DatetimeIndex(['2013-01-01 09:00', '2013-01-01 09:00', - '2013-01-01 09:00', '2013-01-01 08:00', - '2013-01-01 08:00', pd.NaT], tz=tz) - - exp_idx = DatetimeIndex(['2013-01-01 09:00', '2013-01-01 08:00'], - tz=tz) - expected = Series([3, 2], index=exp_idx) - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(), expected) - - exp_idx = DatetimeIndex(['2013-01-01 09:00', '2013-01-01 08:00', - pd.NaT], tz=tz) - expected = Series([3, 2, 1], index=exp_idx) - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(dropna=False), - expected) - - tm.assert_index_equal(idx.unique(), exp_idx) - - def test_nonunique_contains(self): - # GH 9512 - for idx in map(DatetimeIndex, - ([0, 1, 0], [0, 0, -1], [0, -1, -1], - ['2015', '2015', '2016'], ['2015', '2015', '2014'])): - tm.assertIn(idx[0], idx) - - def test_order(self): - # with freq - idx1 = DatetimeIndex(['2011-01-01', '2011-01-02', - '2011-01-03'], freq='D', name='idx') - idx2 = DatetimeIndex(['2011-01-01 09:00', '2011-01-01 10:00', - '2011-01-01 11:00'], freq='H', - tz='Asia/Tokyo', name='tzidx') - - for idx in [idx1, idx2]: - ordered = idx.sort_values() - self.assert_index_equal(ordered, idx) - self.assertEqual(ordered.freq, idx.freq) - - ordered = idx.sort_values(ascending=False) - expected = idx[::-1] - self.assert_index_equal(ordered, expected) - self.assertEqual(ordered.freq, expected.freq) - self.assertEqual(ordered.freq.n, -1) - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, idx) - self.assert_numpy_array_equal(indexer, - np.array([0, 1, 2]), - check_dtype=False) - self.assertEqual(ordered.freq, idx.freq) - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - expected = idx[::-1] - self.assert_index_equal(ordered, expected) - self.assert_numpy_array_equal(indexer, - np.array([2, 1, 0]), - check_dtype=False) - self.assertEqual(ordered.freq, expected.freq) - self.assertEqual(ordered.freq.n, -1) - - # without freq - for tz in self.tz: - idx1 = DatetimeIndex(['2011-01-01', '2011-01-03', '2011-01-05', - '2011-01-02', '2011-01-01'], - tz=tz, name='idx1') - exp1 = DatetimeIndex(['2011-01-01', '2011-01-01', '2011-01-02', - '2011-01-03', '2011-01-05'], - tz=tz, name='idx1') - - idx2 = DatetimeIndex(['2011-01-01', '2011-01-03', '2011-01-05', - '2011-01-02', '2011-01-01'], - tz=tz, name='idx2') - - exp2 = DatetimeIndex(['2011-01-01', '2011-01-01', '2011-01-02', - '2011-01-03', '2011-01-05'], - tz=tz, name='idx2') - - idx3 = DatetimeIndex([pd.NaT, '2011-01-03', '2011-01-05', - '2011-01-02', pd.NaT], tz=tz, name='idx3') - exp3 = DatetimeIndex([pd.NaT, pd.NaT, '2011-01-02', '2011-01-03', - '2011-01-05'], tz=tz, name='idx3') - - for idx, expected in [(idx1, exp1), (idx2, exp2), (idx3, exp3)]: - ordered = idx.sort_values() - self.assert_index_equal(ordered, expected) - self.assertIsNone(ordered.freq) - - ordered = idx.sort_values(ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - self.assertIsNone(ordered.freq) - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, expected) - - exp = np.array([0, 4, 3, 1, 2]) - self.assert_numpy_array_equal(indexer, exp, check_dtype=False) - self.assertIsNone(ordered.freq) - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - - exp = np.array([2, 1, 3, 4, 0]) - self.assert_numpy_array_equal(indexer, exp, check_dtype=False) - self.assertIsNone(ordered.freq) - - def test_getitem(self): - idx1 = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') - idx2 = pd.date_range('2011-01-01', '2011-01-31', freq='D', - tz='Asia/Tokyo', name='idx') - - for idx in [idx1, idx2]: - result = idx[0] - self.assertEqual(result, Timestamp('2011-01-01', tz=idx.tz)) - - result = idx[0:5] - expected = pd.date_range('2011-01-01', '2011-01-05', freq='D', - tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx[0:10:2] - expected = pd.date_range('2011-01-01', '2011-01-09', freq='2D', - tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx[-20:-5:3] - expected = pd.date_range('2011-01-12', '2011-01-24', freq='3D', - tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx[4::-1] - expected = DatetimeIndex(['2011-01-05', '2011-01-04', '2011-01-03', - '2011-01-02', '2011-01-01'], - freq='-1D', tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - def test_drop_duplicates_metadata(self): - # GH 10115 - idx = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') - result = idx.drop_duplicates() - self.assert_index_equal(idx, result) - self.assertEqual(idx.freq, result.freq) - - idx_dup = idx.append(idx) - self.assertIsNone(idx_dup.freq) # freq is reset - result = idx_dup.drop_duplicates() - self.assert_index_equal(idx, result) - self.assertIsNone(result.freq) - - def test_drop_duplicates(self): - # to check Index/Series compat - base = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') - idx = base.append(base[:5]) - - res = idx.drop_duplicates() - tm.assert_index_equal(res, base) - res = Series(idx).drop_duplicates() - tm.assert_series_equal(res, Series(base)) - - res = idx.drop_duplicates(keep='last') - exp = base[5:].append(base[:5]) - tm.assert_index_equal(res, exp) - res = Series(idx).drop_duplicates(keep='last') - tm.assert_series_equal(res, Series(exp, index=np.arange(5, 36))) - - res = idx.drop_duplicates(keep=False) - tm.assert_index_equal(res, base[5:]) - res = Series(idx).drop_duplicates(keep=False) - tm.assert_series_equal(res, Series(base[5:], index=np.arange(5, 31))) - - def test_take(self): - # GH 10295 - idx1 = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') - idx2 = pd.date_range('2011-01-01', '2011-01-31', freq='D', - tz='Asia/Tokyo', name='idx') - - for idx in [idx1, idx2]: - result = idx.take([0]) - self.assertEqual(result, Timestamp('2011-01-01', tz=idx.tz)) - - result = idx.take([0, 1, 2]) - expected = pd.date_range('2011-01-01', '2011-01-03', freq='D', - tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx.take([0, 2, 4]) - expected = pd.date_range('2011-01-01', '2011-01-05', freq='2D', - tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx.take([7, 4, 1]) - expected = pd.date_range('2011-01-08', '2011-01-02', freq='-3D', - tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx.take([3, 2, 5]) - expected = DatetimeIndex(['2011-01-04', '2011-01-03', - '2011-01-06'], - freq=None, tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertIsNone(result.freq) - - result = idx.take([-3, 2, 5]) - expected = DatetimeIndex(['2011-01-29', '2011-01-03', - '2011-01-06'], - freq=None, tz=idx.tz, name='idx') - self.assert_index_equal(result, expected) - self.assertIsNone(result.freq) - - def test_take_invalid_kwargs(self): - idx = pd.date_range('2011-01-01', '2011-01-31', freq='D', name='idx') - indices = [1, 6, 5, 9, 10, 13, 15, 3] - - msg = r"take\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, idx.take, - indices, foo=2) - - msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, out=indices) - - msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, mode='clip') - - def test_infer_freq(self): - # GH 11018 - for freq in ['A', '2A', '-2A', 'Q', '-1Q', 'M', '-1M', 'D', '3D', - '-3D', 'W', '-1W', 'H', '2H', '-2H', 'T', '2T', 'S', - '-3S']: - idx = pd.date_range('2011-01-01 09:00:00', freq=freq, periods=10) - result = pd.DatetimeIndex(idx.asi8, freq='infer') - tm.assert_index_equal(idx, result) - self.assertEqual(result.freq, freq) - - def test_nat_new(self): - idx = pd.date_range('2011-01-01', freq='D', periods=5, name='x') - result = idx._nat_new() - exp = pd.DatetimeIndex([pd.NaT] * 5, name='x') - tm.assert_index_equal(result, exp) - - result = idx._nat_new(box=False) - exp = np.array([tslib.iNaT] * 5, dtype=np.int64) - tm.assert_numpy_array_equal(result, exp) - - def test_shift(self): - # GH 9903 - for tz in self.tz: - idx = pd.DatetimeIndex([], name='xxx', tz=tz) - tm.assert_index_equal(idx.shift(0, freq='H'), idx) - tm.assert_index_equal(idx.shift(3, freq='H'), idx) - - idx = pd.DatetimeIndex(['2011-01-01 10:00', '2011-01-01 11:00' - '2011-01-01 12:00'], name='xxx', tz=tz) - tm.assert_index_equal(idx.shift(0, freq='H'), idx) - exp = pd.DatetimeIndex(['2011-01-01 13:00', '2011-01-01 14:00' - '2011-01-01 15:00'], name='xxx', tz=tz) - tm.assert_index_equal(idx.shift(3, freq='H'), exp) - exp = pd.DatetimeIndex(['2011-01-01 07:00', '2011-01-01 08:00' - '2011-01-01 09:00'], name='xxx', tz=tz) - tm.assert_index_equal(idx.shift(-3, freq='H'), exp) - - def test_nat(self): - self.assertIs(pd.DatetimeIndex._na_value, pd.NaT) - self.assertIs(pd.DatetimeIndex([])._na_value, pd.NaT) - - for tz in [None, 'US/Eastern', 'UTC']: - idx = pd.DatetimeIndex(['2011-01-01', '2011-01-02'], tz=tz) - self.assertTrue(idx._can_hold_na) - - tm.assert_numpy_array_equal(idx._isnan, np.array([False, False])) - self.assertFalse(idx.hasnans) - tm.assert_numpy_array_equal(idx._nan_idxs, - np.array([], dtype=np.intp)) - - idx = pd.DatetimeIndex(['2011-01-01', 'NaT'], tz=tz) - self.assertTrue(idx._can_hold_na) - - tm.assert_numpy_array_equal(idx._isnan, np.array([False, True])) - self.assertTrue(idx.hasnans) - tm.assert_numpy_array_equal(idx._nan_idxs, - np.array([1], dtype=np.intp)) - - def test_equals(self): - # GH 13107 - for tz in [None, 'UTC', 'US/Eastern', 'Asia/Tokyo']: - idx = pd.DatetimeIndex(['2011-01-01', '2011-01-02', 'NaT']) - self.assertTrue(idx.equals(idx)) - self.assertTrue(idx.equals(idx.copy())) - self.assertTrue(idx.equals(idx.asobject)) - self.assertTrue(idx.asobject.equals(idx)) - self.assertTrue(idx.asobject.equals(idx.asobject)) - self.assertFalse(idx.equals(list(idx))) - self.assertFalse(idx.equals(pd.Series(idx))) - - idx2 = pd.DatetimeIndex(['2011-01-01', '2011-01-02', 'NaT'], - tz='US/Pacific') - self.assertFalse(idx.equals(idx2)) - self.assertFalse(idx.equals(idx2.copy())) - self.assertFalse(idx.equals(idx2.asobject)) - self.assertFalse(idx.asobject.equals(idx2)) - self.assertFalse(idx.equals(list(idx2))) - self.assertFalse(idx.equals(pd.Series(idx2))) - - # same internal, different tz - idx3 = pd.DatetimeIndex._simple_new(idx.asi8, tz='US/Pacific') - tm.assert_numpy_array_equal(idx.asi8, idx3.asi8) - self.assertFalse(idx.equals(idx3)) - self.assertFalse(idx.equals(idx3.copy())) - self.assertFalse(idx.equals(idx3.asobject)) - self.assertFalse(idx.asobject.equals(idx3)) - self.assertFalse(idx.equals(list(idx3))) - self.assertFalse(idx.equals(pd.Series(idx3))) - - -class TestTimedeltaIndexOps(Ops): - def setUp(self): - super(TestTimedeltaIndexOps, self).setUp() - mask = lambda x: isinstance(x, TimedeltaIndex) - self.is_valid_objs = [o for o in self.objs if mask(o)] - self.not_valid_objs = [] - - def test_ops_properties(self): - self.check_ops_properties(['days', 'hours', 'minutes', 'seconds', - 'milliseconds']) - self.check_ops_properties(['microseconds', 'nanoseconds']) - - def test_asobject_tolist(self): - idx = timedelta_range(start='1 days', periods=4, freq='D', name='idx') - expected_list = [Timedelta('1 days'), Timedelta('2 days'), - Timedelta('3 days'), Timedelta('4 days')] - expected = pd.Index(expected_list, dtype=object, name='idx') - result = idx.asobject - self.assertTrue(isinstance(result, Index)) - - self.assertEqual(result.dtype, object) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(idx.tolist(), expected_list) - - idx = TimedeltaIndex([timedelta(days=1), timedelta(days=2), pd.NaT, - timedelta(days=4)], name='idx') - expected_list = [Timedelta('1 days'), Timedelta('2 days'), pd.NaT, - Timedelta('4 days')] - expected = pd.Index(expected_list, dtype=object, name='idx') - result = idx.asobject - self.assertTrue(isinstance(result, Index)) - self.assertEqual(result.dtype, object) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(idx.tolist(), expected_list) - - def test_minmax(self): - - # monotonic - idx1 = TimedeltaIndex(['1 days', '2 days', '3 days']) - self.assertTrue(idx1.is_monotonic) - - # non-monotonic - idx2 = TimedeltaIndex(['1 days', np.nan, '3 days', 'NaT']) - self.assertFalse(idx2.is_monotonic) - - for idx in [idx1, idx2]: - self.assertEqual(idx.min(), Timedelta('1 days')), - self.assertEqual(idx.max(), Timedelta('3 days')), - self.assertEqual(idx.argmin(), 0) - self.assertEqual(idx.argmax(), 2) - - for op in ['min', 'max']: - # Return NaT - obj = TimedeltaIndex([]) - self.assertTrue(pd.isnull(getattr(obj, op)())) - - obj = TimedeltaIndex([pd.NaT]) - self.assertTrue(pd.isnull(getattr(obj, op)())) - - obj = TimedeltaIndex([pd.NaT, pd.NaT, pd.NaT]) - self.assertTrue(pd.isnull(getattr(obj, op)())) - - def test_numpy_minmax(self): - dr = pd.date_range(start='2016-01-15', end='2016-01-20') - td = TimedeltaIndex(np.asarray(dr)) - - self.assertEqual(np.min(td), Timedelta('16815 days')) - self.assertEqual(np.max(td), Timedelta('16820 days')) - - errmsg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, errmsg, np.min, td, out=0) - tm.assertRaisesRegexp(ValueError, errmsg, np.max, td, out=0) - - self.assertEqual(np.argmin(td), 0) - self.assertEqual(np.argmax(td), 5) - - if not _np_version_under1p10: - errmsg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, errmsg, np.argmin, td, out=0) - tm.assertRaisesRegexp(ValueError, errmsg, np.argmax, td, out=0) - - def test_round(self): - td = pd.timedelta_range(start='16801 days', periods=5, freq='30Min') - elt = td[1] - - expected_rng = TimedeltaIndex([ - Timedelta('16801 days 00:00:00'), - Timedelta('16801 days 00:00:00'), - Timedelta('16801 days 01:00:00'), - Timedelta('16801 days 02:00:00'), - Timedelta('16801 days 02:00:00'), - ]) - expected_elt = expected_rng[1] - - tm.assert_index_equal(td.round(freq='H'), expected_rng) - self.assertEqual(elt.round(freq='H'), expected_elt) - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with self.assertRaisesRegexp(ValueError, msg): - td.round(freq='foo') - with tm.assertRaisesRegexp(ValueError, msg): - elt.round(freq='foo') - - msg = " is a non-fixed frequency" - tm.assertRaisesRegexp(ValueError, msg, td.round, freq='M') - tm.assertRaisesRegexp(ValueError, msg, elt.round, freq='M') - - def test_representation(self): - idx1 = TimedeltaIndex([], freq='D') - idx2 = TimedeltaIndex(['1 days'], freq='D') - idx3 = TimedeltaIndex(['1 days', '2 days'], freq='D') - idx4 = TimedeltaIndex(['1 days', '2 days', '3 days'], freq='D') - idx5 = TimedeltaIndex(['1 days 00:00:01', '2 days', '3 days']) - - exp1 = """TimedeltaIndex([], dtype='timedelta64[ns]', freq='D')""" - - exp2 = ("TimedeltaIndex(['1 days'], dtype='timedelta64[ns]', " - "freq='D')") - - exp3 = ("TimedeltaIndex(['1 days', '2 days'], " - "dtype='timedelta64[ns]', freq='D')") - - exp4 = ("TimedeltaIndex(['1 days', '2 days', '3 days'], " - "dtype='timedelta64[ns]', freq='D')") - - exp5 = ("TimedeltaIndex(['1 days 00:00:01', '2 days 00:00:00', " - "'3 days 00:00:00'], dtype='timedelta64[ns]', freq=None)") - - with pd.option_context('display.width', 300): - for idx, expected in zip([idx1, idx2, idx3, idx4, idx5], - [exp1, exp2, exp3, exp4, exp5]): - for func in ['__repr__', '__unicode__', '__str__']: - result = getattr(idx, func)() - self.assertEqual(result, expected) - - def test_representation_to_series(self): - idx1 = TimedeltaIndex([], freq='D') - idx2 = TimedeltaIndex(['1 days'], freq='D') - idx3 = TimedeltaIndex(['1 days', '2 days'], freq='D') - idx4 = TimedeltaIndex(['1 days', '2 days', '3 days'], freq='D') - idx5 = TimedeltaIndex(['1 days 00:00:01', '2 days', '3 days']) - - exp1 = """Series([], dtype: timedelta64[ns])""" - - exp2 = """0 1 days -dtype: timedelta64[ns]""" - - exp3 = """0 1 days -1 2 days -dtype: timedelta64[ns]""" - - exp4 = """0 1 days -1 2 days -2 3 days -dtype: timedelta64[ns]""" - - exp5 = """0 1 days 00:00:01 -1 2 days 00:00:00 -2 3 days 00:00:00 -dtype: timedelta64[ns]""" - - with pd.option_context('display.width', 300): - for idx, expected in zip([idx1, idx2, idx3, idx4, idx5], - [exp1, exp2, exp3, exp4, exp5]): - result = repr(pd.Series(idx)) - self.assertEqual(result, expected) - - def test_summary(self): - # GH9116 - idx1 = TimedeltaIndex([], freq='D') - idx2 = TimedeltaIndex(['1 days'], freq='D') - idx3 = TimedeltaIndex(['1 days', '2 days'], freq='D') - idx4 = TimedeltaIndex(['1 days', '2 days', '3 days'], freq='D') - idx5 = TimedeltaIndex(['1 days 00:00:01', '2 days', '3 days']) - - exp1 = """TimedeltaIndex: 0 entries -Freq: D""" - - exp2 = """TimedeltaIndex: 1 entries, 1 days to 1 days -Freq: D""" - - exp3 = """TimedeltaIndex: 2 entries, 1 days to 2 days -Freq: D""" - - exp4 = """TimedeltaIndex: 3 entries, 1 days to 3 days -Freq: D""" - - exp5 = ("TimedeltaIndex: 3 entries, 1 days 00:00:01 to 3 days " - "00:00:00") - - for idx, expected in zip([idx1, idx2, idx3, idx4, idx5], - [exp1, exp2, exp3, exp4, exp5]): - result = idx.summary() - self.assertEqual(result, expected) - - def test_add_iadd(self): - - # only test adding/sub offsets as + is now numeric - - # offset - offsets = [pd.offsets.Hour(2), timedelta(hours=2), - np.timedelta64(2, 'h'), Timedelta(hours=2)] - - for delta in offsets: - rng = timedelta_range('1 days', '10 days') - result = rng + delta - expected = timedelta_range('1 days 02:00:00', '10 days 02:00:00', - freq='D') - tm.assert_index_equal(result, expected) - rng += delta - tm.assert_index_equal(rng, expected) - - # int - rng = timedelta_range('1 days 09:00:00', freq='H', periods=10) - result = rng + 1 - expected = timedelta_range('1 days 10:00:00', freq='H', periods=10) - tm.assert_index_equal(result, expected) - rng += 1 - tm.assert_index_equal(rng, expected) - - def test_sub_isub(self): - # only test adding/sub offsets as - is now numeric - - # offset - offsets = [pd.offsets.Hour(2), timedelta(hours=2), - np.timedelta64(2, 'h'), Timedelta(hours=2)] - - for delta in offsets: - rng = timedelta_range('1 days', '10 days') - result = rng - delta - expected = timedelta_range('0 days 22:00:00', '9 days 22:00:00') - tm.assert_index_equal(result, expected) - rng -= delta - tm.assert_index_equal(rng, expected) - - # int - rng = timedelta_range('1 days 09:00:00', freq='H', periods=10) - result = rng - 1 - expected = timedelta_range('1 days 08:00:00', freq='H', periods=10) - tm.assert_index_equal(result, expected) - rng -= 1 - tm.assert_index_equal(rng, expected) - - idx = TimedeltaIndex(['1 day', '2 day']) - msg = "cannot subtract a datelike from a TimedeltaIndex" - with tm.assertRaisesRegexp(TypeError, msg): - idx - Timestamp('2011-01-01') - - result = Timestamp('2011-01-01') + idx - expected = DatetimeIndex(['2011-01-02', '2011-01-03']) - tm.assert_index_equal(result, expected) - - def test_ops_compat(self): - - offsets = [pd.offsets.Hour(2), timedelta(hours=2), - np.timedelta64(2, 'h'), Timedelta(hours=2)] - - rng = timedelta_range('1 days', '10 days', name='foo') - - # multiply - for offset in offsets: - self.assertRaises(TypeError, lambda: rng * offset) - - # divide - expected = Int64Index((np.arange(10) + 1) * 12, name='foo') - for offset in offsets: - result = rng / offset - tm.assert_index_equal(result, expected, exact=False) - - # divide with nats - rng = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') - expected = Float64Index([12, np.nan, 24], name='foo') - for offset in offsets: - result = rng / offset - tm.assert_index_equal(result, expected) - - # don't allow division by NaT (make could in the future) - self.assertRaises(TypeError, lambda: rng / pd.NaT) - - def test_subtraction_ops(self): - - # with datetimes/timedelta and tdi/dti - tdi = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') - dti = date_range('20130101', periods=3, name='bar') - td = Timedelta('1 days') - dt = Timestamp('20130101') - - self.assertRaises(TypeError, lambda: tdi - dt) - self.assertRaises(TypeError, lambda: tdi - dti) - self.assertRaises(TypeError, lambda: td - dt) - self.assertRaises(TypeError, lambda: td - dti) - - result = dt - dti - expected = TimedeltaIndex(['0 days', '-1 days', '-2 days'], name='bar') - tm.assert_index_equal(result, expected) - - result = dti - dt - expected = TimedeltaIndex(['0 days', '1 days', '2 days'], name='bar') - tm.assert_index_equal(result, expected) - - result = tdi - td - expected = TimedeltaIndex(['0 days', pd.NaT, '1 days'], name='foo') - tm.assert_index_equal(result, expected, check_names=False) - - result = td - tdi - expected = TimedeltaIndex(['0 days', pd.NaT, '-1 days'], name='foo') - tm.assert_index_equal(result, expected, check_names=False) - - result = dti - td - expected = DatetimeIndex( - ['20121231', '20130101', '20130102'], name='bar') - tm.assert_index_equal(result, expected, check_names=False) - - result = dt - tdi - expected = DatetimeIndex(['20121231', pd.NaT, '20121230'], name='foo') - tm.assert_index_equal(result, expected) - - def test_subtraction_ops_with_tz(self): - - # check that dt/dti subtraction ops with tz are validated - dti = date_range('20130101', periods=3) - ts = Timestamp('20130101') - dt = ts.to_pydatetime() - dti_tz = date_range('20130101', periods=3).tz_localize('US/Eastern') - ts_tz = Timestamp('20130101').tz_localize('US/Eastern') - ts_tz2 = Timestamp('20130101').tz_localize('CET') - dt_tz = ts_tz.to_pydatetime() - td = Timedelta('1 days') - - def _check(result, expected): - self.assertEqual(result, expected) - self.assertIsInstance(result, Timedelta) - - # scalars - result = ts - ts - expected = Timedelta('0 days') - _check(result, expected) - - result = dt_tz - ts_tz - expected = Timedelta('0 days') - _check(result, expected) - - result = ts_tz - dt_tz - expected = Timedelta('0 days') - _check(result, expected) - - # tz mismatches - self.assertRaises(TypeError, lambda: dt_tz - ts) - self.assertRaises(TypeError, lambda: dt_tz - dt) - self.assertRaises(TypeError, lambda: dt_tz - ts_tz2) - self.assertRaises(TypeError, lambda: dt - dt_tz) - self.assertRaises(TypeError, lambda: ts - dt_tz) - self.assertRaises(TypeError, lambda: ts_tz2 - ts) - self.assertRaises(TypeError, lambda: ts_tz2 - dt) - self.assertRaises(TypeError, lambda: ts_tz - ts_tz2) - - # with dti - self.assertRaises(TypeError, lambda: dti - ts_tz) - self.assertRaises(TypeError, lambda: dti_tz - ts) - self.assertRaises(TypeError, lambda: dti_tz - ts_tz2) - - result = dti_tz - dt_tz - expected = TimedeltaIndex(['0 days', '1 days', '2 days']) - tm.assert_index_equal(result, expected) - - result = dt_tz - dti_tz - expected = TimedeltaIndex(['0 days', '-1 days', '-2 days']) - tm.assert_index_equal(result, expected) - - result = dti_tz - ts_tz - expected = TimedeltaIndex(['0 days', '1 days', '2 days']) - tm.assert_index_equal(result, expected) - - result = ts_tz - dti_tz - expected = TimedeltaIndex(['0 days', '-1 days', '-2 days']) - tm.assert_index_equal(result, expected) - - result = td - td - expected = Timedelta('0 days') - _check(result, expected) - - result = dti_tz - td - expected = DatetimeIndex( - ['20121231', '20130101', '20130102'], tz='US/Eastern') - tm.assert_index_equal(result, expected) - - def test_dti_tdi_numeric_ops(self): - - # These are normally union/diff set-like ops - tdi = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') - dti = date_range('20130101', periods=3, name='bar') - - # TODO(wesm): unused? - # td = Timedelta('1 days') - # dt = Timestamp('20130101') - - result = tdi - tdi - expected = TimedeltaIndex(['0 days', pd.NaT, '0 days'], name='foo') - tm.assert_index_equal(result, expected) - - result = tdi + tdi - expected = TimedeltaIndex(['2 days', pd.NaT, '4 days'], name='foo') - tm.assert_index_equal(result, expected) - - result = dti - tdi # name will be reset - expected = DatetimeIndex(['20121231', pd.NaT, '20130101']) - tm.assert_index_equal(result, expected) - - def test_sub_period(self): - # GH 13078 - # not supported, check TypeError - p = pd.Period('2011-01-01', freq='D') - - for freq in [None, 'H']: - idx = pd.TimedeltaIndex(['1 hours', '2 hours'], freq=freq) - - with tm.assertRaises(TypeError): - idx - p - - with tm.assertRaises(TypeError): - p - idx - - def test_addition_ops(self): - - # with datetimes/timedelta and tdi/dti - tdi = TimedeltaIndex(['1 days', pd.NaT, '2 days'], name='foo') - dti = date_range('20130101', periods=3, name='bar') - td = Timedelta('1 days') - dt = Timestamp('20130101') - - result = tdi + dt - expected = DatetimeIndex(['20130102', pd.NaT, '20130103'], name='foo') - tm.assert_index_equal(result, expected) - - result = dt + tdi - expected = DatetimeIndex(['20130102', pd.NaT, '20130103'], name='foo') - tm.assert_index_equal(result, expected) - - result = td + tdi - expected = TimedeltaIndex(['2 days', pd.NaT, '3 days'], name='foo') - tm.assert_index_equal(result, expected) - - result = tdi + td - expected = TimedeltaIndex(['2 days', pd.NaT, '3 days'], name='foo') - tm.assert_index_equal(result, expected) - - # unequal length - self.assertRaises(ValueError, lambda: tdi + dti[0:1]) - self.assertRaises(ValueError, lambda: tdi[0:1] + dti) - - # random indexes - self.assertRaises(TypeError, lambda: tdi + Int64Index([1, 2, 3])) - - # this is a union! - # self.assertRaises(TypeError, lambda : Int64Index([1,2,3]) + tdi) - - result = tdi + dti # name will be reset - expected = DatetimeIndex(['20130102', pd.NaT, '20130105']) - tm.assert_index_equal(result, expected) - - result = dti + tdi # name will be reset - expected = DatetimeIndex(['20130102', pd.NaT, '20130105']) - tm.assert_index_equal(result, expected) - - result = dt + td - expected = Timestamp('20130102') - self.assertEqual(result, expected) - - result = td + dt - expected = Timestamp('20130102') - self.assertEqual(result, expected) - - def test_comp_nat(self): - left = pd.TimedeltaIndex([pd.Timedelta('1 days'), pd.NaT, - pd.Timedelta('3 days')]) - right = pd.TimedeltaIndex([pd.NaT, pd.NaT, pd.Timedelta('3 days')]) - - for l, r in [(left, right), (left.asobject, right.asobject)]: - result = l == r - expected = np.array([False, False, True]) - tm.assert_numpy_array_equal(result, expected) - - result = l != r - expected = np.array([True, True, False]) - tm.assert_numpy_array_equal(result, expected) - - expected = np.array([False, False, False]) - tm.assert_numpy_array_equal(l == pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT == r, expected) - - expected = np.array([True, True, True]) - tm.assert_numpy_array_equal(l != pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT != l, expected) - - expected = np.array([False, False, False]) - tm.assert_numpy_array_equal(l < pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT > l, expected) - - def test_value_counts_unique(self): - # GH 7735 - - idx = timedelta_range('1 days 09:00:00', freq='H', periods=10) - # create repeated values, 'n'th element is repeated by n+1 times - idx = TimedeltaIndex(np.repeat(idx.values, range(1, len(idx) + 1))) - - exp_idx = timedelta_range('1 days 18:00:00', freq='-1H', periods=10) - expected = Series(range(10, 0, -1), index=exp_idx, dtype='int64') - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(), expected) - - expected = timedelta_range('1 days 09:00:00', freq='H', periods=10) - tm.assert_index_equal(idx.unique(), expected) - - idx = TimedeltaIndex(['1 days 09:00:00', '1 days 09:00:00', - '1 days 09:00:00', '1 days 08:00:00', - '1 days 08:00:00', pd.NaT]) - - exp_idx = TimedeltaIndex(['1 days 09:00:00', '1 days 08:00:00']) - expected = Series([3, 2], index=exp_idx) - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(), expected) - - exp_idx = TimedeltaIndex(['1 days 09:00:00', '1 days 08:00:00', - pd.NaT]) - expected = Series([3, 2, 1], index=exp_idx) - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(dropna=False), expected) - - tm.assert_index_equal(idx.unique(), exp_idx) - - def test_nonunique_contains(self): - # GH 9512 - for idx in map(TimedeltaIndex, ([0, 1, 0], [0, 0, -1], [0, -1, -1], - ['00:01:00', '00:01:00', '00:02:00'], - ['00:01:00', '00:01:00', '00:00:01'])): - tm.assertIn(idx[0], idx) - - def test_unknown_attribute(self): - # GH 9680 - tdi = pd.timedelta_range(start=0, periods=10, freq='1s') - ts = pd.Series(np.random.normal(size=10), index=tdi) - self.assertNotIn('foo', ts.__dict__.keys()) - self.assertRaises(AttributeError, lambda: ts.foo) - - def test_order(self): - # GH 10295 - idx1 = TimedeltaIndex(['1 day', '2 day', '3 day'], freq='D', - name='idx') - idx2 = TimedeltaIndex( - ['1 hour', '2 hour', '3 hour'], freq='H', name='idx') - - for idx in [idx1, idx2]: - ordered = idx.sort_values() - self.assert_index_equal(ordered, idx) - self.assertEqual(ordered.freq, idx.freq) - - ordered = idx.sort_values(ascending=False) - expected = idx[::-1] - self.assert_index_equal(ordered, expected) - self.assertEqual(ordered.freq, expected.freq) - self.assertEqual(ordered.freq.n, -1) - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, idx) - self.assert_numpy_array_equal(indexer, - np.array([0, 1, 2]), - check_dtype=False) - self.assertEqual(ordered.freq, idx.freq) - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - self.assert_index_equal(ordered, idx[::-1]) - self.assertEqual(ordered.freq, expected.freq) - self.assertEqual(ordered.freq.n, -1) - - idx1 = TimedeltaIndex(['1 hour', '3 hour', '5 hour', - '2 hour ', '1 hour'], name='idx1') - exp1 = TimedeltaIndex(['1 hour', '1 hour', '2 hour', - '3 hour', '5 hour'], name='idx1') - - idx2 = TimedeltaIndex(['1 day', '3 day', '5 day', - '2 day', '1 day'], name='idx2') - - # TODO(wesm): unused? - # exp2 = TimedeltaIndex(['1 day', '1 day', '2 day', - # '3 day', '5 day'], name='idx2') - - # idx3 = TimedeltaIndex([pd.NaT, '3 minute', '5 minute', - # '2 minute', pd.NaT], name='idx3') - # exp3 = TimedeltaIndex([pd.NaT, pd.NaT, '2 minute', '3 minute', - # '5 minute'], name='idx3') - - for idx, expected in [(idx1, exp1), (idx1, exp1), (idx1, exp1)]: - ordered = idx.sort_values() - self.assert_index_equal(ordered, expected) - self.assertIsNone(ordered.freq) - - ordered = idx.sort_values(ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - self.assertIsNone(ordered.freq) - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, expected) - - exp = np.array([0, 4, 3, 1, 2]) - self.assert_numpy_array_equal(indexer, exp, check_dtype=False) - self.assertIsNone(ordered.freq) - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - - exp = np.array([2, 1, 3, 4, 0]) - self.assert_numpy_array_equal(indexer, exp, check_dtype=False) - self.assertIsNone(ordered.freq) - - def test_getitem(self): - idx1 = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') - - for idx in [idx1]: - result = idx[0] - self.assertEqual(result, pd.Timedelta('1 day')) - - result = idx[0:5] - expected = pd.timedelta_range('1 day', '5 day', freq='D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx[0:10:2] - expected = pd.timedelta_range('1 day', '9 day', freq='2D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx[-20:-5:3] - expected = pd.timedelta_range('12 day', '24 day', freq='3D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx[4::-1] - expected = TimedeltaIndex(['5 day', '4 day', '3 day', - '2 day', '1 day'], - freq='-1D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - def test_drop_duplicates_metadata(self): - # GH 10115 - idx = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') - result = idx.drop_duplicates() - self.assert_index_equal(idx, result) - self.assertEqual(idx.freq, result.freq) - - idx_dup = idx.append(idx) - self.assertIsNone(idx_dup.freq) # freq is reset - result = idx_dup.drop_duplicates() - self.assert_index_equal(idx, result) - self.assertIsNone(result.freq) - - def test_drop_duplicates(self): - # to check Index/Series compat - base = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') - idx = base.append(base[:5]) - - res = idx.drop_duplicates() - tm.assert_index_equal(res, base) - res = Series(idx).drop_duplicates() - tm.assert_series_equal(res, Series(base)) - - res = idx.drop_duplicates(keep='last') - exp = base[5:].append(base[:5]) - tm.assert_index_equal(res, exp) - res = Series(idx).drop_duplicates(keep='last') - tm.assert_series_equal(res, Series(exp, index=np.arange(5, 36))) - - res = idx.drop_duplicates(keep=False) - tm.assert_index_equal(res, base[5:]) - res = Series(idx).drop_duplicates(keep=False) - tm.assert_series_equal(res, Series(base[5:], index=np.arange(5, 31))) - - def test_take(self): - # GH 10295 - idx1 = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') - - for idx in [idx1]: - result = idx.take([0]) - self.assertEqual(result, pd.Timedelta('1 day')) - - result = idx.take([-1]) - self.assertEqual(result, pd.Timedelta('31 day')) - - result = idx.take([0, 1, 2]) - expected = pd.timedelta_range('1 day', '3 day', freq='D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx.take([0, 2, 4]) - expected = pd.timedelta_range('1 day', '5 day', freq='2D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx.take([7, 4, 1]) - expected = pd.timedelta_range('8 day', '2 day', freq='-3D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - - result = idx.take([3, 2, 5]) - expected = TimedeltaIndex(['4 day', '3 day', '6 day'], name='idx') - self.assert_index_equal(result, expected) - self.assertIsNone(result.freq) - - result = idx.take([-3, 2, 5]) - expected = TimedeltaIndex(['29 day', '3 day', '6 day'], name='idx') - self.assert_index_equal(result, expected) - self.assertIsNone(result.freq) - - def test_take_invalid_kwargs(self): - idx = pd.timedelta_range('1 day', '31 day', freq='D', name='idx') - indices = [1, 6, 5, 9, 10, 13, 15, 3] - - msg = r"take\(\) got an unexpected keyword argument 'foo'" - tm.assertRaisesRegexp(TypeError, msg, idx.take, - indices, foo=2) - - msg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, out=indices) - - msg = "the 'mode' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, idx.take, - indices, mode='clip') - - def test_infer_freq(self): - # GH 11018 - for freq in ['D', '3D', '-3D', 'H', '2H', '-2H', 'T', '2T', 'S', '-3S' - ]: - idx = pd.timedelta_range('1', freq=freq, periods=10) - result = pd.TimedeltaIndex(idx.asi8, freq='infer') - tm.assert_index_equal(idx, result) - self.assertEqual(result.freq, freq) - - def test_nat_new(self): - - idx = pd.timedelta_range('1', freq='D', periods=5, name='x') - result = idx._nat_new() - exp = pd.TimedeltaIndex([pd.NaT] * 5, name='x') - tm.assert_index_equal(result, exp) - - result = idx._nat_new(box=False) - exp = np.array([tslib.iNaT] * 5, dtype=np.int64) - tm.assert_numpy_array_equal(result, exp) - - def test_shift(self): - # GH 9903 - idx = pd.TimedeltaIndex([], name='xxx') - tm.assert_index_equal(idx.shift(0, freq='H'), idx) - tm.assert_index_equal(idx.shift(3, freq='H'), idx) - - idx = pd.TimedeltaIndex(['5 hours', '6 hours', '9 hours'], name='xxx') - tm.assert_index_equal(idx.shift(0, freq='H'), idx) - exp = pd.TimedeltaIndex(['8 hours', '9 hours', '12 hours'], name='xxx') - tm.assert_index_equal(idx.shift(3, freq='H'), exp) - exp = pd.TimedeltaIndex(['2 hours', '3 hours', '6 hours'], name='xxx') - tm.assert_index_equal(idx.shift(-3, freq='H'), exp) - - tm.assert_index_equal(idx.shift(0, freq='T'), idx) - exp = pd.TimedeltaIndex(['05:03:00', '06:03:00', '9:03:00'], - name='xxx') - tm.assert_index_equal(idx.shift(3, freq='T'), exp) - exp = pd.TimedeltaIndex(['04:57:00', '05:57:00', '8:57:00'], - name='xxx') - tm.assert_index_equal(idx.shift(-3, freq='T'), exp) - - def test_repeat(self): - index = pd.timedelta_range('1 days', periods=2, freq='D') - exp = pd.TimedeltaIndex(['1 days', '1 days', '2 days', '2 days']) - for res in [index.repeat(2), np.repeat(index, 2)]: - tm.assert_index_equal(res, exp) - self.assertIsNone(res.freq) - - index = TimedeltaIndex(['1 days', 'NaT', '3 days']) - exp = TimedeltaIndex(['1 days', '1 days', '1 days', - 'NaT', 'NaT', 'NaT', - '3 days', '3 days', '3 days']) - for res in [index.repeat(3), np.repeat(index, 3)]: - tm.assert_index_equal(res, exp) - self.assertIsNone(res.freq) - - def test_nat(self): - self.assertIs(pd.TimedeltaIndex._na_value, pd.NaT) - self.assertIs(pd.TimedeltaIndex([])._na_value, pd.NaT) - - idx = pd.TimedeltaIndex(['1 days', '2 days']) - self.assertTrue(idx._can_hold_na) - - tm.assert_numpy_array_equal(idx._isnan, np.array([False, False])) - self.assertFalse(idx.hasnans) - tm.assert_numpy_array_equal(idx._nan_idxs, - np.array([], dtype=np.intp)) - - idx = pd.TimedeltaIndex(['1 days', 'NaT']) - self.assertTrue(idx._can_hold_na) - - tm.assert_numpy_array_equal(idx._isnan, np.array([False, True])) - self.assertTrue(idx.hasnans) - tm.assert_numpy_array_equal(idx._nan_idxs, - np.array([1], dtype=np.intp)) - - def test_equals(self): - # GH 13107 - idx = pd.TimedeltaIndex(['1 days', '2 days', 'NaT']) - self.assertTrue(idx.equals(idx)) - self.assertTrue(idx.equals(idx.copy())) - self.assertTrue(idx.equals(idx.asobject)) - self.assertTrue(idx.asobject.equals(idx)) - self.assertTrue(idx.asobject.equals(idx.asobject)) - self.assertFalse(idx.equals(list(idx))) - self.assertFalse(idx.equals(pd.Series(idx))) - - idx2 = pd.TimedeltaIndex(['2 days', '1 days', 'NaT']) - self.assertFalse(idx.equals(idx2)) - self.assertFalse(idx.equals(idx2.copy())) - self.assertFalse(idx.equals(idx2.asobject)) - self.assertFalse(idx.asobject.equals(idx2)) - self.assertFalse(idx.asobject.equals(idx2.asobject)) - self.assertFalse(idx.equals(list(idx2))) - self.assertFalse(idx.equals(pd.Series(idx2))) - - -class TestPeriodIndexOps(Ops): - def setUp(self): - super(TestPeriodIndexOps, self).setUp() - mask = lambda x: (isinstance(x, DatetimeIndex) or - isinstance(x, PeriodIndex)) - self.is_valid_objs = [o for o in self.objs if mask(o)] - self.not_valid_objs = [o for o in self.objs if not mask(o)] - - def test_ops_properties(self): - self.check_ops_properties( - ['year', 'month', 'day', 'hour', 'minute', 'second', 'weekofyear', - 'week', 'dayofweek', 'dayofyear', 'quarter']) - self.check_ops_properties(['qyear'], - lambda x: isinstance(x, PeriodIndex)) - - def test_asobject_tolist(self): - idx = pd.period_range(start='2013-01-01', periods=4, freq='M', - name='idx') - expected_list = [pd.Period('2013-01-31', freq='M'), - pd.Period('2013-02-28', freq='M'), - pd.Period('2013-03-31', freq='M'), - pd.Period('2013-04-30', freq='M')] - expected = pd.Index(expected_list, dtype=object, name='idx') - result = idx.asobject - self.assertTrue(isinstance(result, Index)) - self.assertEqual(result.dtype, object) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(idx.tolist(), expected_list) - - idx = PeriodIndex(['2013-01-01', '2013-01-02', 'NaT', - '2013-01-04'], freq='D', name='idx') - expected_list = [pd.Period('2013-01-01', freq='D'), - pd.Period('2013-01-02', freq='D'), - pd.Period('NaT', freq='D'), - pd.Period('2013-01-04', freq='D')] - expected = pd.Index(expected_list, dtype=object, name='idx') - result = idx.asobject - self.assertTrue(isinstance(result, Index)) - self.assertEqual(result.dtype, object) - tm.assert_index_equal(result, expected) - for i in [0, 1, 3]: - self.assertEqual(result[i], expected[i]) - self.assertIs(result[2], pd.NaT) - self.assertEqual(result.name, expected.name) - - result_list = idx.tolist() - for i in [0, 1, 3]: - self.assertEqual(result_list[i], expected_list[i]) - self.assertIs(result_list[2], pd.NaT) - - def test_minmax(self): - - # monotonic - idx1 = pd.PeriodIndex([pd.NaT, '2011-01-01', '2011-01-02', - '2011-01-03'], freq='D') - self.assertTrue(idx1.is_monotonic) - - # non-monotonic - idx2 = pd.PeriodIndex(['2011-01-01', pd.NaT, '2011-01-03', - '2011-01-02', pd.NaT], freq='D') - self.assertFalse(idx2.is_monotonic) - - for idx in [idx1, idx2]: - self.assertEqual(idx.min(), pd.Period('2011-01-01', freq='D')) - self.assertEqual(idx.max(), pd.Period('2011-01-03', freq='D')) - self.assertEqual(idx1.argmin(), 1) - self.assertEqual(idx2.argmin(), 0) - self.assertEqual(idx1.argmax(), 3) - self.assertEqual(idx2.argmax(), 2) - - for op in ['min', 'max']: - # Return NaT - obj = PeriodIndex([], freq='M') - result = getattr(obj, op)() - self.assertIs(result, tslib.NaT) - - obj = PeriodIndex([pd.NaT], freq='M') - result = getattr(obj, op)() - self.assertIs(result, tslib.NaT) - - obj = PeriodIndex([pd.NaT, pd.NaT, pd.NaT], freq='M') - result = getattr(obj, op)() - self.assertIs(result, tslib.NaT) - - def test_numpy_minmax(self): - pr = pd.period_range(start='2016-01-15', end='2016-01-20') - - self.assertEqual(np.min(pr), Period('2016-01-15', freq='D')) - self.assertEqual(np.max(pr), Period('2016-01-20', freq='D')) - - errmsg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, errmsg, np.min, pr, out=0) - tm.assertRaisesRegexp(ValueError, errmsg, np.max, pr, out=0) - - self.assertEqual(np.argmin(pr), 0) - self.assertEqual(np.argmax(pr), 5) - - if not _np_version_under1p10: - errmsg = "the 'out' parameter is not supported" - tm.assertRaisesRegexp(ValueError, errmsg, np.argmin, pr, out=0) - tm.assertRaisesRegexp(ValueError, errmsg, np.argmax, pr, out=0) - - def test_representation(self): - # GH 7601 - idx1 = PeriodIndex([], freq='D') - idx2 = PeriodIndex(['2011-01-01'], freq='D') - idx3 = PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') - idx4 = PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], - freq='D') - idx5 = PeriodIndex(['2011', '2012', '2013'], freq='A') - idx6 = PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', - 'NaT'], freq='H') - idx7 = pd.period_range('2013Q1', periods=1, freq="Q") - idx8 = pd.period_range('2013Q1', periods=2, freq="Q") - idx9 = pd.period_range('2013Q1', periods=3, freq="Q") - idx10 = PeriodIndex(['2011-01-01', '2011-02-01'], freq='3D') - - exp1 = """PeriodIndex([], dtype='period[D]', freq='D')""" - - exp2 = """PeriodIndex(['2011-01-01'], dtype='period[D]', freq='D')""" - - exp3 = ("PeriodIndex(['2011-01-01', '2011-01-02'], dtype='period[D]', " - "freq='D')") - - exp4 = ("PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], " - "dtype='period[D]', freq='D')") - - exp5 = ("PeriodIndex(['2011', '2012', '2013'], dtype='period[A-DEC]', " - "freq='A-DEC')") - - exp6 = ("PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', 'NaT'], " - "dtype='period[H]', freq='H')") - - exp7 = ("PeriodIndex(['2013Q1'], dtype='period[Q-DEC]', " - "freq='Q-DEC')") - - exp8 = ("PeriodIndex(['2013Q1', '2013Q2'], dtype='period[Q-DEC]', " - "freq='Q-DEC')") - - exp9 = ("PeriodIndex(['2013Q1', '2013Q2', '2013Q3'], " - "dtype='period[Q-DEC]', freq='Q-DEC')") - - exp10 = ("PeriodIndex(['2011-01-01', '2011-02-01'], " - "dtype='period[3D]', freq='3D')") - - for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, - idx6, idx7, idx8, idx9, idx10], - [exp1, exp2, exp3, exp4, exp5, - exp6, exp7, exp8, exp9, exp10]): - for func in ['__repr__', '__unicode__', '__str__']: - result = getattr(idx, func)() - self.assertEqual(result, expected) - - def test_representation_to_series(self): - # GH 10971 - idx1 = PeriodIndex([], freq='D') - idx2 = PeriodIndex(['2011-01-01'], freq='D') - idx3 = PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') - idx4 = PeriodIndex(['2011-01-01', '2011-01-02', - '2011-01-03'], freq='D') - idx5 = PeriodIndex(['2011', '2012', '2013'], freq='A') - idx6 = PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', - 'NaT'], freq='H') - - idx7 = pd.period_range('2013Q1', periods=1, freq="Q") - idx8 = pd.period_range('2013Q1', periods=2, freq="Q") - idx9 = pd.period_range('2013Q1', periods=3, freq="Q") - - exp1 = """Series([], dtype: object)""" - - exp2 = """0 2011-01-01 -dtype: object""" - - exp3 = """0 2011-01-01 -1 2011-01-02 -dtype: object""" - - exp4 = """0 2011-01-01 -1 2011-01-02 -2 2011-01-03 -dtype: object""" - - exp5 = """0 2011 -1 2012 -2 2013 -dtype: object""" - - exp6 = """0 2011-01-01 09:00 -1 2012-02-01 10:00 -2 NaT -dtype: object""" - - exp7 = """0 2013Q1 -dtype: object""" - - exp8 = """0 2013Q1 -1 2013Q2 -dtype: object""" - - exp9 = """0 2013Q1 -1 2013Q2 -2 2013Q3 -dtype: object""" - - for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, - idx6, idx7, idx8, idx9], - [exp1, exp2, exp3, exp4, exp5, - exp6, exp7, exp8, exp9]): - result = repr(pd.Series(idx)) - self.assertEqual(result, expected) - - def test_summary(self): - # GH9116 - idx1 = PeriodIndex([], freq='D') - idx2 = PeriodIndex(['2011-01-01'], freq='D') - idx3 = PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') - idx4 = PeriodIndex( - ['2011-01-01', '2011-01-02', '2011-01-03'], freq='D') - idx5 = PeriodIndex(['2011', '2012', '2013'], freq='A') - idx6 = PeriodIndex( - ['2011-01-01 09:00', '2012-02-01 10:00', 'NaT'], freq='H') - - idx7 = pd.period_range('2013Q1', periods=1, freq="Q") - idx8 = pd.period_range('2013Q1', periods=2, freq="Q") - idx9 = pd.period_range('2013Q1', periods=3, freq="Q") - - exp1 = """PeriodIndex: 0 entries -Freq: D""" - - exp2 = """PeriodIndex: 1 entries, 2011-01-01 to 2011-01-01 -Freq: D""" - - exp3 = """PeriodIndex: 2 entries, 2011-01-01 to 2011-01-02 -Freq: D""" - - exp4 = """PeriodIndex: 3 entries, 2011-01-01 to 2011-01-03 -Freq: D""" - - exp5 = """PeriodIndex: 3 entries, 2011 to 2013 -Freq: A-DEC""" - - exp6 = """PeriodIndex: 3 entries, 2011-01-01 09:00 to NaT -Freq: H""" - - exp7 = """PeriodIndex: 1 entries, 2013Q1 to 2013Q1 -Freq: Q-DEC""" - - exp8 = """PeriodIndex: 2 entries, 2013Q1 to 2013Q2 -Freq: Q-DEC""" - - exp9 = """PeriodIndex: 3 entries, 2013Q1 to 2013Q3 -Freq: Q-DEC""" - - for idx, expected in zip([idx1, idx2, idx3, idx4, idx5, - idx6, idx7, idx8, idx9], - [exp1, exp2, exp3, exp4, exp5, - exp6, exp7, exp8, exp9]): - result = idx.summary() - self.assertEqual(result, expected) - - def test_resolution(self): - for freq, expected in zip(['A', 'Q', 'M', 'D', 'H', - 'T', 'S', 'L', 'U'], - ['day', 'day', 'day', 'day', - 'hour', 'minute', 'second', - 'millisecond', 'microsecond']): - - idx = pd.period_range(start='2013-04-01', periods=30, freq=freq) - self.assertEqual(idx.resolution, expected) - - def test_union(self): - # union - rng1 = pd.period_range('1/1/2000', freq='D', periods=5) - other1 = pd.period_range('1/6/2000', freq='D', periods=5) - expected1 = pd.period_range('1/1/2000', freq='D', periods=10) - - rng2 = pd.period_range('1/1/2000', freq='D', periods=5) - other2 = pd.period_range('1/4/2000', freq='D', periods=5) - expected2 = pd.period_range('1/1/2000', freq='D', periods=8) - - rng3 = pd.period_range('1/1/2000', freq='D', periods=5) - other3 = pd.PeriodIndex([], freq='D') - expected3 = pd.period_range('1/1/2000', freq='D', periods=5) - - rng4 = pd.period_range('2000-01-01 09:00', freq='H', periods=5) - other4 = pd.period_range('2000-01-02 09:00', freq='H', periods=5) - expected4 = pd.PeriodIndex(['2000-01-01 09:00', '2000-01-01 10:00', - '2000-01-01 11:00', '2000-01-01 12:00', - '2000-01-01 13:00', '2000-01-02 09:00', - '2000-01-02 10:00', '2000-01-02 11:00', - '2000-01-02 12:00', '2000-01-02 13:00'], - freq='H') - - rng5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:03', - '2000-01-01 09:05'], freq='T') - other5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:05' - '2000-01-01 09:08'], - freq='T') - expected5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:03', - '2000-01-01 09:05', '2000-01-01 09:08'], - freq='T') - - rng6 = pd.period_range('2000-01-01', freq='M', periods=7) - other6 = pd.period_range('2000-04-01', freq='M', periods=7) - expected6 = pd.period_range('2000-01-01', freq='M', periods=10) - - rng7 = pd.period_range('2003-01-01', freq='A', periods=5) - other7 = pd.period_range('1998-01-01', freq='A', periods=8) - expected7 = pd.period_range('1998-01-01', freq='A', periods=10) - - for rng, other, expected in [(rng1, other1, expected1), - (rng2, other2, expected2), - (rng3, other3, expected3), (rng4, other4, - expected4), - (rng5, other5, expected5), (rng6, other6, - expected6), - (rng7, other7, expected7)]: - - result_union = rng.union(other) - tm.assert_index_equal(result_union, expected) - - def test_add_iadd(self): - rng = pd.period_range('1/1/2000', freq='D', periods=5) - other = pd.period_range('1/6/2000', freq='D', periods=5) - - # previously performed setop union, now raises TypeError (GH14164) - with tm.assertRaises(TypeError): - rng + other - - with tm.assertRaises(TypeError): - rng += other - - # offset - # DateOffset - rng = pd.period_range('2014', '2024', freq='A') - result = rng + pd.offsets.YearEnd(5) - expected = pd.period_range('2019', '2029', freq='A') - tm.assert_index_equal(result, expected) - rng += pd.offsets.YearEnd(5) - tm.assert_index_equal(rng, expected) - - for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), - pd.offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365), Timedelta(days=365)]: - msg = ('Input has different freq(=.+)? ' - 'from PeriodIndex\\(freq=A-DEC\\)') - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng + o - - rng = pd.period_range('2014-01', '2016-12', freq='M') - result = rng + pd.offsets.MonthEnd(5) - expected = pd.period_range('2014-06', '2017-05', freq='M') - tm.assert_index_equal(result, expected) - rng += pd.offsets.MonthEnd(5) - tm.assert_index_equal(rng, expected) - - for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), - pd.offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365), Timedelta(days=365)]: - rng = pd.period_range('2014-01', '2016-12', freq='M') - msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=M\\)' - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng + o - - # Tick - offsets = [pd.offsets.Day(3), timedelta(days=3), - np.timedelta64(3, 'D'), pd.offsets.Hour(72), - timedelta(minutes=60 * 24 * 3), np.timedelta64(72, 'h'), - Timedelta('72:00:00')] - for delta in offsets: - rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') - result = rng + delta - expected = pd.period_range('2014-05-04', '2014-05-18', freq='D') - tm.assert_index_equal(result, expected) - rng += delta - tm.assert_index_equal(rng, expected) - - for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), - pd.offsets.Minute(), np.timedelta64(4, 'h'), - timedelta(hours=23), Timedelta('23:00:00')]: - rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') - msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=D\\)' - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng + o - - offsets = [pd.offsets.Hour(2), timedelta(hours=2), - np.timedelta64(2, 'h'), pd.offsets.Minute(120), - timedelta(minutes=120), np.timedelta64(120, 'm'), - Timedelta(minutes=120)] - for delta in offsets: - rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', - freq='H') - result = rng + delta - expected = pd.period_range('2014-01-01 12:00', '2014-01-05 12:00', - freq='H') - tm.assert_index_equal(result, expected) - rng += delta - tm.assert_index_equal(rng, expected) - - for delta in [pd.offsets.YearBegin(2), timedelta(minutes=30), - np.timedelta64(30, 's'), Timedelta(seconds=30)]: - rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', - freq='H') - msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=H\\)' - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - result = rng + delta - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng += delta - - # int - rng = pd.period_range('2000-01-01 09:00', freq='H', periods=10) - result = rng + 1 - expected = pd.period_range('2000-01-01 10:00', freq='H', periods=10) - tm.assert_index_equal(result, expected) - rng += 1 - tm.assert_index_equal(rng, expected) - - def test_difference(self): - # diff - rng1 = pd.period_range('1/1/2000', freq='D', periods=5) - other1 = pd.period_range('1/6/2000', freq='D', periods=5) - expected1 = pd.period_range('1/1/2000', freq='D', periods=5) - - rng2 = pd.period_range('1/1/2000', freq='D', periods=5) - other2 = pd.period_range('1/4/2000', freq='D', periods=5) - expected2 = pd.period_range('1/1/2000', freq='D', periods=3) - - rng3 = pd.period_range('1/1/2000', freq='D', periods=5) - other3 = pd.PeriodIndex([], freq='D') - expected3 = pd.period_range('1/1/2000', freq='D', periods=5) - - rng4 = pd.period_range('2000-01-01 09:00', freq='H', periods=5) - other4 = pd.period_range('2000-01-02 09:00', freq='H', periods=5) - expected4 = rng4 - - rng5 = pd.PeriodIndex(['2000-01-01 09:01', '2000-01-01 09:03', - '2000-01-01 09:05'], freq='T') - other5 = pd.PeriodIndex( - ['2000-01-01 09:01', '2000-01-01 09:05'], freq='T') - expected5 = pd.PeriodIndex(['2000-01-01 09:03'], freq='T') - - rng6 = pd.period_range('2000-01-01', freq='M', periods=7) - other6 = pd.period_range('2000-04-01', freq='M', periods=7) - expected6 = pd.period_range('2000-01-01', freq='M', periods=3) - - rng7 = pd.period_range('2003-01-01', freq='A', periods=5) - other7 = pd.period_range('1998-01-01', freq='A', periods=8) - expected7 = pd.period_range('2006-01-01', freq='A', periods=2) - - for rng, other, expected in [(rng1, other1, expected1), - (rng2, other2, expected2), - (rng3, other3, expected3), - (rng4, other4, expected4), - (rng5, other5, expected5), - (rng6, other6, expected6), - (rng7, other7, expected7), ]: - result_union = rng.difference(other) - tm.assert_index_equal(result_union, expected) - - def test_sub_isub(self): - - # previously performed setop, now raises TypeError (GH14164) - # TODO needs to wait on #13077 for decision on result type - rng = pd.period_range('1/1/2000', freq='D', periods=5) - other = pd.period_range('1/6/2000', freq='D', periods=5) - - with tm.assertRaises(TypeError): - rng - other - - with tm.assertRaises(TypeError): - rng -= other - - # offset - # DateOffset - rng = pd.period_range('2014', '2024', freq='A') - result = rng - pd.offsets.YearEnd(5) - expected = pd.period_range('2009', '2019', freq='A') - tm.assert_index_equal(result, expected) - rng -= pd.offsets.YearEnd(5) - tm.assert_index_equal(rng, expected) - - for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), - pd.offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - rng = pd.period_range('2014', '2024', freq='A') - msg = ('Input has different freq(=.+)? ' - 'from PeriodIndex\\(freq=A-DEC\\)') - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng - o - - rng = pd.period_range('2014-01', '2016-12', freq='M') - result = rng - pd.offsets.MonthEnd(5) - expected = pd.period_range('2013-08', '2016-07', freq='M') - tm.assert_index_equal(result, expected) - rng -= pd.offsets.MonthEnd(5) - tm.assert_index_equal(rng, expected) - - for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), - pd.offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - rng = pd.period_range('2014-01', '2016-12', freq='M') - msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=M\\)' - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng - o - - # Tick - offsets = [pd.offsets.Day(3), timedelta(days=3), - np.timedelta64(3, 'D'), pd.offsets.Hour(72), - timedelta(minutes=60 * 24 * 3), np.timedelta64(72, 'h')] - for delta in offsets: - rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') - result = rng - delta - expected = pd.period_range('2014-04-28', '2014-05-12', freq='D') - tm.assert_index_equal(result, expected) - rng -= delta - tm.assert_index_equal(rng, expected) - - for o in [pd.offsets.YearBegin(2), pd.offsets.MonthBegin(1), - pd.offsets.Minute(), np.timedelta64(4, 'h'), - timedelta(hours=23)]: - rng = pd.period_range('2014-05-01', '2014-05-15', freq='D') - msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=D\\)' - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng - o - - offsets = [pd.offsets.Hour(2), timedelta(hours=2), - np.timedelta64(2, 'h'), pd.offsets.Minute(120), - timedelta(minutes=120), np.timedelta64(120, 'm')] - for delta in offsets: - rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', - freq='H') - result = rng - delta - expected = pd.period_range('2014-01-01 08:00', '2014-01-05 08:00', - freq='H') - tm.assert_index_equal(result, expected) - rng -= delta - tm.assert_index_equal(rng, expected) - - for delta in [pd.offsets.YearBegin(2), timedelta(minutes=30), - np.timedelta64(30, 's')]: - rng = pd.period_range('2014-01-01 10:00', '2014-01-05 10:00', - freq='H') - msg = 'Input has different freq(=.+)? from PeriodIndex\\(freq=H\\)' - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - result = rng + delta - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - rng += delta - - # int - rng = pd.period_range('2000-01-01 09:00', freq='H', periods=10) - result = rng - 1 - expected = pd.period_range('2000-01-01 08:00', freq='H', periods=10) - tm.assert_index_equal(result, expected) - rng -= 1 - tm.assert_index_equal(rng, expected) - - def test_comp_nat(self): - left = pd.PeriodIndex([pd.Period('2011-01-01'), pd.NaT, - pd.Period('2011-01-03')]) - right = pd.PeriodIndex([pd.NaT, pd.NaT, pd.Period('2011-01-03')]) - - for l, r in [(left, right), (left.asobject, right.asobject)]: - result = l == r - expected = np.array([False, False, True]) - tm.assert_numpy_array_equal(result, expected) - - result = l != r - expected = np.array([True, True, False]) - tm.assert_numpy_array_equal(result, expected) - - expected = np.array([False, False, False]) - tm.assert_numpy_array_equal(l == pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT == r, expected) - - expected = np.array([True, True, True]) - tm.assert_numpy_array_equal(l != pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT != l, expected) - - expected = np.array([False, False, False]) - tm.assert_numpy_array_equal(l < pd.NaT, expected) - tm.assert_numpy_array_equal(pd.NaT > l, expected) - - def test_value_counts_unique(self): - # GH 7735 - idx = pd.period_range('2011-01-01 09:00', freq='H', periods=10) - # create repeated values, 'n'th element is repeated by n+1 times - idx = PeriodIndex(np.repeat(idx.values, range(1, len(idx) + 1)), - freq='H') - - exp_idx = PeriodIndex(['2011-01-01 18:00', '2011-01-01 17:00', - '2011-01-01 16:00', '2011-01-01 15:00', - '2011-01-01 14:00', '2011-01-01 13:00', - '2011-01-01 12:00', '2011-01-01 11:00', - '2011-01-01 10:00', - '2011-01-01 09:00'], freq='H') - expected = Series(range(10, 0, -1), index=exp_idx, dtype='int64') - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(), expected) - - expected = pd.period_range('2011-01-01 09:00', freq='H', - periods=10) - tm.assert_index_equal(idx.unique(), expected) - - idx = PeriodIndex(['2013-01-01 09:00', '2013-01-01 09:00', - '2013-01-01 09:00', '2013-01-01 08:00', - '2013-01-01 08:00', pd.NaT], freq='H') - - exp_idx = PeriodIndex(['2013-01-01 09:00', '2013-01-01 08:00'], - freq='H') - expected = Series([3, 2], index=exp_idx) - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(), expected) - - exp_idx = PeriodIndex(['2013-01-01 09:00', '2013-01-01 08:00', - pd.NaT], freq='H') - expected = Series([3, 2, 1], index=exp_idx) - - for obj in [idx, Series(idx)]: - tm.assert_series_equal(obj.value_counts(dropna=False), expected) - - tm.assert_index_equal(idx.unique(), exp_idx) - - def test_drop_duplicates_metadata(self): - # GH 10115 - idx = pd.period_range('2011-01-01', '2011-01-31', freq='D', name='idx') - result = idx.drop_duplicates() - self.assert_index_equal(idx, result) - self.assertEqual(idx.freq, result.freq) - - idx_dup = idx.append(idx) # freq will not be reset - result = idx_dup.drop_duplicates() - self.assert_index_equal(idx, result) - self.assertEqual(idx.freq, result.freq) - - def test_drop_duplicates(self): - # to check Index/Series compat - base = pd.period_range('2011-01-01', '2011-01-31', freq='D', - name='idx') - idx = base.append(base[:5]) - - res = idx.drop_duplicates() - tm.assert_index_equal(res, base) - res = Series(idx).drop_duplicates() - tm.assert_series_equal(res, Series(base)) - - res = idx.drop_duplicates(keep='last') - exp = base[5:].append(base[:5]) - tm.assert_index_equal(res, exp) - res = Series(idx).drop_duplicates(keep='last') - tm.assert_series_equal(res, Series(exp, index=np.arange(5, 36))) - - res = idx.drop_duplicates(keep=False) - tm.assert_index_equal(res, base[5:]) - res = Series(idx).drop_duplicates(keep=False) - tm.assert_series_equal(res, Series(base[5:], index=np.arange(5, 31))) - - def test_order_compat(self): - def _check_freq(index, expected_index): - if isinstance(index, PeriodIndex): - self.assertEqual(index.freq, expected_index.freq) - - pidx = PeriodIndex(['2011', '2012', '2013'], name='pidx', freq='A') - # for compatibility check - iidx = Index([2011, 2012, 2013], name='idx') - for idx in [pidx, iidx]: - ordered = idx.sort_values() - self.assert_index_equal(ordered, idx) - _check_freq(ordered, idx) - - ordered = idx.sort_values(ascending=False) - self.assert_index_equal(ordered, idx[::-1]) - _check_freq(ordered, idx[::-1]) - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, idx) - self.assert_numpy_array_equal(indexer, - np.array([0, 1, 2]), - check_dtype=False) - _check_freq(ordered, idx) - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - self.assert_index_equal(ordered, idx[::-1]) - self.assert_numpy_array_equal(indexer, - np.array([2, 1, 0]), - check_dtype=False) - _check_freq(ordered, idx[::-1]) - - pidx = PeriodIndex(['2011', '2013', '2015', '2012', - '2011'], name='pidx', freq='A') - pexpected = PeriodIndex( - ['2011', '2011', '2012', '2013', '2015'], name='pidx', freq='A') - # for compatibility check - iidx = Index([2011, 2013, 2015, 2012, 2011], name='idx') - iexpected = Index([2011, 2011, 2012, 2013, 2015], name='idx') - for idx, expected in [(pidx, pexpected), (iidx, iexpected)]: - ordered = idx.sort_values() - self.assert_index_equal(ordered, expected) - _check_freq(ordered, idx) - - ordered = idx.sort_values(ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - _check_freq(ordered, idx) - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, expected) - - exp = np.array([0, 4, 3, 1, 2]) - self.assert_numpy_array_equal(indexer, exp, check_dtype=False) - _check_freq(ordered, idx) - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - - exp = np.array([2, 1, 3, 4, 0]) - self.assert_numpy_array_equal(indexer, exp, - check_dtype=False) - _check_freq(ordered, idx) - - pidx = PeriodIndex(['2011', '2013', 'NaT', '2011'], name='pidx', - freq='D') - - result = pidx.sort_values() - expected = PeriodIndex(['NaT', '2011', '2011', '2013'], - name='pidx', freq='D') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, 'D') - - result = pidx.sort_values(ascending=False) - expected = PeriodIndex( - ['2013', '2011', '2011', 'NaT'], name='pidx', freq='D') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, 'D') - - def test_order(self): - for freq in ['D', '2D', '4D']: - idx = PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], - freq=freq, name='idx') - - ordered = idx.sort_values() - self.assert_index_equal(ordered, idx) - self.assertEqual(ordered.freq, idx.freq) - - ordered = idx.sort_values(ascending=False) - expected = idx[::-1] - self.assert_index_equal(ordered, expected) - self.assertEqual(ordered.freq, expected.freq) - self.assertEqual(ordered.freq, freq) - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, idx) - self.assert_numpy_array_equal(indexer, - np.array([0, 1, 2]), - check_dtype=False) - self.assertEqual(ordered.freq, idx.freq) - self.assertEqual(ordered.freq, freq) - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - expected = idx[::-1] - self.assert_index_equal(ordered, expected) - self.assert_numpy_array_equal(indexer, - np.array([2, 1, 0]), - check_dtype=False) - self.assertEqual(ordered.freq, expected.freq) - self.assertEqual(ordered.freq, freq) - - idx1 = PeriodIndex(['2011-01-01', '2011-01-03', '2011-01-05', - '2011-01-02', '2011-01-01'], freq='D', name='idx1') - exp1 = PeriodIndex(['2011-01-01', '2011-01-01', '2011-01-02', - '2011-01-03', '2011-01-05'], freq='D', name='idx1') - - idx2 = PeriodIndex(['2011-01-01', '2011-01-03', '2011-01-05', - '2011-01-02', '2011-01-01'], - freq='D', name='idx2') - exp2 = PeriodIndex(['2011-01-01', '2011-01-01', '2011-01-02', - '2011-01-03', '2011-01-05'], - freq='D', name='idx2') - - idx3 = PeriodIndex([pd.NaT, '2011-01-03', '2011-01-05', - '2011-01-02', pd.NaT], freq='D', name='idx3') - exp3 = PeriodIndex([pd.NaT, pd.NaT, '2011-01-02', '2011-01-03', - '2011-01-05'], freq='D', name='idx3') - - for idx, expected in [(idx1, exp1), (idx2, exp2), (idx3, exp3)]: - ordered = idx.sort_values() - self.assert_index_equal(ordered, expected) - self.assertEqual(ordered.freq, 'D') - - ordered = idx.sort_values(ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - self.assertEqual(ordered.freq, 'D') - - ordered, indexer = idx.sort_values(return_indexer=True) - self.assert_index_equal(ordered, expected) - - exp = np.array([0, 4, 3, 1, 2]) - self.assert_numpy_array_equal(indexer, exp, check_dtype=False) - self.assertEqual(ordered.freq, 'D') - - ordered, indexer = idx.sort_values(return_indexer=True, - ascending=False) - self.assert_index_equal(ordered, expected[::-1]) - - exp = np.array([2, 1, 3, 4, 0]) - self.assert_numpy_array_equal(indexer, exp, check_dtype=False) - self.assertEqual(ordered.freq, 'D') - - def test_getitem(self): - idx1 = pd.period_range('2011-01-01', '2011-01-31', freq='D', - name='idx') - - for idx in [idx1]: - result = idx[0] - self.assertEqual(result, pd.Period('2011-01-01', freq='D')) - - result = idx[-1] - self.assertEqual(result, pd.Period('2011-01-31', freq='D')) - - result = idx[0:5] - expected = pd.period_range('2011-01-01', '2011-01-05', freq='D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - result = idx[0:10:2] - expected = pd.PeriodIndex(['2011-01-01', '2011-01-03', - '2011-01-05', - '2011-01-07', '2011-01-09'], - freq='D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - result = idx[-20:-5:3] - expected = pd.PeriodIndex(['2011-01-12', '2011-01-15', - '2011-01-18', - '2011-01-21', '2011-01-24'], - freq='D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - result = idx[4::-1] - expected = PeriodIndex(['2011-01-05', '2011-01-04', '2011-01-03', - '2011-01-02', '2011-01-01'], - freq='D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - def test_take(self): - # GH 10295 - idx1 = pd.period_range('2011-01-01', '2011-01-31', freq='D', - name='idx') - - for idx in [idx1]: - result = idx.take([0]) - self.assertEqual(result, pd.Period('2011-01-01', freq='D')) - - result = idx.take([5]) - self.assertEqual(result, pd.Period('2011-01-06', freq='D')) - - result = idx.take([0, 1, 2]) - expected = pd.period_range('2011-01-01', '2011-01-03', freq='D', - name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, 'D') - self.assertEqual(result.freq, expected.freq) - - result = idx.take([0, 2, 4]) - expected = pd.PeriodIndex(['2011-01-01', '2011-01-03', - '2011-01-05'], freq='D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - result = idx.take([7, 4, 1]) - expected = pd.PeriodIndex(['2011-01-08', '2011-01-05', - '2011-01-02'], - freq='D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - result = idx.take([3, 2, 5]) - expected = PeriodIndex(['2011-01-04', '2011-01-03', '2011-01-06'], - freq='D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - result = idx.take([-3, 2, 5]) - expected = PeriodIndex(['2011-01-29', '2011-01-03', '2011-01-06'], - freq='D', name='idx') - self.assert_index_equal(result, expected) - self.assertEqual(result.freq, expected.freq) - self.assertEqual(result.freq, 'D') - - def test_nat_new(self): - - idx = pd.period_range('2011-01', freq='M', periods=5, name='x') - result = idx._nat_new() - exp = pd.PeriodIndex([pd.NaT] * 5, freq='M', name='x') - tm.assert_index_equal(result, exp) - - result = idx._nat_new(box=False) - exp = np.array([tslib.iNaT] * 5, dtype=np.int64) - tm.assert_numpy_array_equal(result, exp) - - def test_shift(self): - # GH 9903 - idx = pd.PeriodIndex([], name='xxx', freq='H') - - with tm.assertRaises(TypeError): - # period shift doesn't accept freq - idx.shift(1, freq='H') - - tm.assert_index_equal(idx.shift(0), idx) - tm.assert_index_equal(idx.shift(3), idx) - - idx = pd.PeriodIndex(['2011-01-01 10:00', '2011-01-01 11:00' - '2011-01-01 12:00'], name='xxx', freq='H') - tm.assert_index_equal(idx.shift(0), idx) - exp = pd.PeriodIndex(['2011-01-01 13:00', '2011-01-01 14:00' - '2011-01-01 15:00'], name='xxx', freq='H') - tm.assert_index_equal(idx.shift(3), exp) - exp = pd.PeriodIndex(['2011-01-01 07:00', '2011-01-01 08:00' - '2011-01-01 09:00'], name='xxx', freq='H') - tm.assert_index_equal(idx.shift(-3), exp) - - def test_repeat(self): - index = pd.period_range('2001-01-01', periods=2, freq='D') - exp = pd.PeriodIndex(['2001-01-01', '2001-01-01', - '2001-01-02', '2001-01-02'], freq='D') - for res in [index.repeat(2), np.repeat(index, 2)]: - tm.assert_index_equal(res, exp) - - index = pd.period_range('2001-01-01', periods=2, freq='2D') - exp = pd.PeriodIndex(['2001-01-01', '2001-01-01', - '2001-01-03', '2001-01-03'], freq='2D') - for res in [index.repeat(2), np.repeat(index, 2)]: - tm.assert_index_equal(res, exp) - - index = pd.PeriodIndex(['2001-01', 'NaT', '2003-01'], freq='M') - exp = pd.PeriodIndex(['2001-01', '2001-01', '2001-01', - 'NaT', 'NaT', 'NaT', - '2003-01', '2003-01', '2003-01'], freq='M') - for res in [index.repeat(3), np.repeat(index, 3)]: - tm.assert_index_equal(res, exp) - - def test_nat(self): - self.assertIs(pd.PeriodIndex._na_value, pd.NaT) - self.assertIs(pd.PeriodIndex([], freq='M')._na_value, pd.NaT) - - idx = pd.PeriodIndex(['2011-01-01', '2011-01-02'], freq='D') - self.assertTrue(idx._can_hold_na) - - tm.assert_numpy_array_equal(idx._isnan, np.array([False, False])) - self.assertFalse(idx.hasnans) - tm.assert_numpy_array_equal(idx._nan_idxs, - np.array([], dtype=np.intp)) - - idx = pd.PeriodIndex(['2011-01-01', 'NaT'], freq='D') - self.assertTrue(idx._can_hold_na) - - tm.assert_numpy_array_equal(idx._isnan, np.array([False, True])) - self.assertTrue(idx.hasnans) - tm.assert_numpy_array_equal(idx._nan_idxs, - np.array([1], dtype=np.intp)) - - def test_equals(self): - # GH 13107 - for freq in ['D', 'M']: - idx = pd.PeriodIndex(['2011-01-01', '2011-01-02', 'NaT'], - freq=freq) - self.assertTrue(idx.equals(idx)) - self.assertTrue(idx.equals(idx.copy())) - self.assertTrue(idx.equals(idx.asobject)) - self.assertTrue(idx.asobject.equals(idx)) - self.assertTrue(idx.asobject.equals(idx.asobject)) - self.assertFalse(idx.equals(list(idx))) - self.assertFalse(idx.equals(pd.Series(idx))) - - idx2 = pd.PeriodIndex(['2011-01-01', '2011-01-02', 'NaT'], - freq='H') - self.assertFalse(idx.equals(idx2)) - self.assertFalse(idx.equals(idx2.copy())) - self.assertFalse(idx.equals(idx2.asobject)) - self.assertFalse(idx.asobject.equals(idx2)) - self.assertFalse(idx.equals(list(idx2))) - self.assertFalse(idx.equals(pd.Series(idx2))) - - # same internal, different tz - idx3 = pd.PeriodIndex._simple_new(idx.asi8, freq='H') - tm.assert_numpy_array_equal(idx.asi8, idx3.asi8) - self.assertFalse(idx.equals(idx3)) - self.assertFalse(idx.equals(idx3.copy())) - self.assertFalse(idx.equals(idx3.asobject)) - self.assertFalse(idx.asobject.equals(idx3)) - self.assertFalse(idx.equals(list(idx3))) - self.assertFalse(idx.equals(pd.Series(idx3))) - - -if __name__ == '__main__': - import nose - - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_daterange.py b/pandas/tseries/tests/test_daterange.py deleted file mode 100644 index 87f9f55e0189c..0000000000000 --- a/pandas/tseries/tests/test_daterange.py +++ /dev/null @@ -1,824 +0,0 @@ -from datetime import datetime -from pandas.compat import range -import nose -import numpy as np - -from pandas.core.index import Index -from pandas.tseries.index import DatetimeIndex - -from pandas import Timestamp -from pandas.tseries.offsets import (BDay, BMonthEnd, CDay, MonthEnd, - generate_range, DateOffset, Minute) -from pandas.tseries.index import cdate_range, bdate_range, date_range - -from pandas.core import common as com -from pandas.util.testing import assertRaisesRegexp -import pandas.util.testing as tm - - -def eq_gen_range(kwargs, expected): - rng = generate_range(**kwargs) - assert (np.array_equal(list(rng), expected)) - - -START, END = datetime(2009, 1, 1), datetime(2010, 1, 1) - - -class TestGenRangeGeneration(tm.TestCase): - - def test_generate(self): - rng1 = list(generate_range(START, END, offset=BDay())) - rng2 = list(generate_range(START, END, time_rule='B')) - self.assertEqual(rng1, rng2) - - def test_generate_cday(self): - rng1 = list(generate_range(START, END, offset=CDay())) - rng2 = list(generate_range(START, END, time_rule='C')) - self.assertEqual(rng1, rng2) - - def test_1(self): - eq_gen_range(dict(start=datetime(2009, 3, 25), periods=2), - [datetime(2009, 3, 25), datetime(2009, 3, 26)]) - - def test_2(self): - eq_gen_range(dict(start=datetime(2008, 1, 1), - end=datetime(2008, 1, 3)), - [datetime(2008, 1, 1), - datetime(2008, 1, 2), - datetime(2008, 1, 3)]) - - def test_3(self): - eq_gen_range(dict(start=datetime(2008, 1, 5), - end=datetime(2008, 1, 6)), - []) - - def test_precision_finer_than_offset(self): - # GH 9907 - result1 = DatetimeIndex(start='2015-04-15 00:00:03', - end='2016-04-22 00:00:00', freq='Q') - result2 = DatetimeIndex(start='2015-04-15 00:00:03', - end='2015-06-22 00:00:04', freq='W') - expected1_list = ['2015-06-30 00:00:03', '2015-09-30 00:00:03', - '2015-12-31 00:00:03', '2016-03-31 00:00:03'] - expected2_list = ['2015-04-19 00:00:03', '2015-04-26 00:00:03', - '2015-05-03 00:00:03', '2015-05-10 00:00:03', - '2015-05-17 00:00:03', '2015-05-24 00:00:03', - '2015-05-31 00:00:03', '2015-06-07 00:00:03', - '2015-06-14 00:00:03', '2015-06-21 00:00:03'] - expected1 = DatetimeIndex(expected1_list, dtype='datetime64[ns]', - freq='Q-DEC', tz=None) - expected2 = DatetimeIndex(expected2_list, dtype='datetime64[ns]', - freq='W-SUN', tz=None) - self.assert_index_equal(result1, expected1) - self.assert_index_equal(result2, expected2) - - -class TestDateRange(tm.TestCase): - def setUp(self): - self.rng = bdate_range(START, END) - - def test_constructor(self): - bdate_range(START, END, freq=BDay()) - bdate_range(START, periods=20, freq=BDay()) - bdate_range(end=START, periods=20, freq=BDay()) - self.assertRaises(ValueError, date_range, '2011-1-1', '2012-1-1', 'B') - self.assertRaises(ValueError, bdate_range, '2011-1-1', '2012-1-1', 'B') - - def test_naive_aware_conflicts(self): - naive = bdate_range(START, END, freq=BDay(), tz=None) - aware = bdate_range(START, END, freq=BDay(), - tz="Asia/Hong_Kong") - assertRaisesRegexp(TypeError, "tz-naive.*tz-aware", naive.join, aware) - assertRaisesRegexp(TypeError, "tz-naive.*tz-aware", aware.join, naive) - - def test_cached_range(self): - DatetimeIndex._cached_range(START, END, offset=BDay()) - DatetimeIndex._cached_range(START, periods=20, offset=BDay()) - DatetimeIndex._cached_range(end=START, periods=20, offset=BDay()) - - assertRaisesRegexp(TypeError, "offset", DatetimeIndex._cached_range, - START, END) - - assertRaisesRegexp(TypeError, "specify period", - DatetimeIndex._cached_range, START, - offset=BDay()) - - assertRaisesRegexp(TypeError, "specify period", - DatetimeIndex._cached_range, end=END, - offset=BDay()) - - assertRaisesRegexp(TypeError, "start or end", - DatetimeIndex._cached_range, periods=20, - offset=BDay()) - - def test_cached_range_bug(self): - rng = date_range('2010-09-01 05:00:00', periods=50, - freq=DateOffset(hours=6)) - self.assertEqual(len(rng), 50) - self.assertEqual(rng[0], datetime(2010, 9, 1, 5)) - - def test_timezone_comparaison_bug(self): - start = Timestamp('20130220 10:00', tz='US/Eastern') - try: - date_range(start, periods=2, tz='US/Eastern') - except AssertionError: - self.fail() - - def test_timezone_comparaison_assert(self): - start = Timestamp('20130220 10:00', tz='US/Eastern') - self.assertRaises(AssertionError, date_range, start, periods=2, - tz='Europe/Berlin') - - def test_comparison(self): - d = self.rng[10] - - comp = self.rng > d - self.assertTrue(comp[11]) - self.assertFalse(comp[9]) - - def test_copy(self): - cp = self.rng.copy() - repr(cp) - self.assert_index_equal(cp, self.rng) - - def test_repr(self): - # only really care that it works - repr(self.rng) - - def test_getitem(self): - smaller = self.rng[:5] - exp = DatetimeIndex(self.rng.view(np.ndarray)[:5]) - self.assert_index_equal(smaller, exp) - - self.assertEqual(smaller.offset, self.rng.offset) - - sliced = self.rng[::5] - self.assertEqual(sliced.offset, BDay() * 5) - - fancy_indexed = self.rng[[4, 3, 2, 1, 0]] - self.assertEqual(len(fancy_indexed), 5) - tm.assertIsInstance(fancy_indexed, DatetimeIndex) - self.assertIsNone(fancy_indexed.freq) - - # 32-bit vs. 64-bit platforms - self.assertEqual(self.rng[4], self.rng[np.int_(4)]) - - def test_getitem_matplotlib_hackaround(self): - values = self.rng[:, None] - expected = self.rng.values[:, None] - self.assert_numpy_array_equal(values, expected) - - def test_shift(self): - shifted = self.rng.shift(5) - self.assertEqual(shifted[0], self.rng[5]) - self.assertEqual(shifted.offset, self.rng.offset) - - shifted = self.rng.shift(-5) - self.assertEqual(shifted[5], self.rng[0]) - self.assertEqual(shifted.offset, self.rng.offset) - - shifted = self.rng.shift(0) - self.assertEqual(shifted[0], self.rng[0]) - self.assertEqual(shifted.offset, self.rng.offset) - - rng = date_range(START, END, freq=BMonthEnd()) - shifted = rng.shift(1, freq=BDay()) - self.assertEqual(shifted[0], rng[0] + BDay()) - - def test_pickle_unpickle(self): - unpickled = self.round_trip_pickle(self.rng) - self.assertIsNotNone(unpickled.offset) - - def test_union(self): - # overlapping - left = self.rng[:10] - right = self.rng[5:10] - - the_union = left.union(right) - tm.assertIsInstance(the_union, DatetimeIndex) - - # non-overlapping, gap in middle - left = self.rng[:5] - right = self.rng[10:] - - the_union = left.union(right) - tm.assertIsInstance(the_union, Index) - - # non-overlapping, no gap - left = self.rng[:5] - right = self.rng[5:10] - - the_union = left.union(right) - tm.assertIsInstance(the_union, DatetimeIndex) - - # order does not matter - tm.assert_index_equal(right.union(left), the_union) - - # overlapping, but different offset - rng = date_range(START, END, freq=BMonthEnd()) - - the_union = self.rng.union(rng) - tm.assertIsInstance(the_union, DatetimeIndex) - - def test_outer_join(self): - # should just behave as union - - # overlapping - left = self.rng[:10] - right = self.rng[5:10] - - the_join = left.join(right, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - - # non-overlapping, gap in middle - left = self.rng[:5] - right = self.rng[10:] - - the_join = left.join(right, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - self.assertIsNone(the_join.freq) - - # non-overlapping, no gap - left = self.rng[:5] - right = self.rng[5:10] - - the_join = left.join(right, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - - # overlapping, but different offset - rng = date_range(START, END, freq=BMonthEnd()) - - the_join = self.rng.join(rng, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - self.assertIsNone(the_join.freq) - - def test_union_not_cacheable(self): - rng = date_range('1/1/2000', periods=50, freq=Minute()) - rng1 = rng[10:] - rng2 = rng[:25] - the_union = rng1.union(rng2) - self.assert_index_equal(the_union, rng) - - rng1 = rng[10:] - rng2 = rng[15:35] - the_union = rng1.union(rng2) - expected = rng[10:] - self.assert_index_equal(the_union, expected) - - def test_intersection(self): - rng = date_range('1/1/2000', periods=50, freq=Minute()) - rng1 = rng[10:] - rng2 = rng[:25] - the_int = rng1.intersection(rng2) - expected = rng[10:25] - self.assert_index_equal(the_int, expected) - tm.assertIsInstance(the_int, DatetimeIndex) - self.assertEqual(the_int.offset, rng.offset) - - the_int = rng1.intersection(rng2.view(DatetimeIndex)) - self.assert_index_equal(the_int, expected) - - # non-overlapping - the_int = rng[:10].intersection(rng[10:]) - expected = DatetimeIndex([]) - self.assert_index_equal(the_int, expected) - - def test_intersection_bug(self): - # GH #771 - a = bdate_range('11/30/2011', '12/31/2011') - b = bdate_range('12/10/2011', '12/20/2011') - result = a.intersection(b) - self.assert_index_equal(result, b) - - def test_summary(self): - self.rng.summary() - self.rng[2:2].summary() - - def test_summary_pytz(self): - tm._skip_if_no_pytz() - import pytz - bdate_range('1/1/2005', '1/1/2009', tz=pytz.utc).summary() - - def test_summary_dateutil(self): - tm._skip_if_no_dateutil() - import dateutil - bdate_range('1/1/2005', '1/1/2009', tz=dateutil.tz.tzutc()).summary() - - def test_misc(self): - end = datetime(2009, 5, 13) - dr = bdate_range(end=end, periods=20) - firstDate = end - 19 * BDay() - - assert len(dr) == 20 - assert dr[0] == firstDate - assert dr[-1] == end - - def test_date_parse_failure(self): - badly_formed_date = '2007/100/1' - - self.assertRaises(ValueError, Timestamp, badly_formed_date) - - self.assertRaises(ValueError, bdate_range, start=badly_formed_date, - periods=10) - self.assertRaises(ValueError, bdate_range, end=badly_formed_date, - periods=10) - self.assertRaises(ValueError, bdate_range, badly_formed_date, - badly_formed_date) - - def test_equals(self): - self.assertFalse(self.rng.equals(list(self.rng))) - - def test_identical(self): - t1 = self.rng.copy() - t2 = self.rng.copy() - self.assertTrue(t1.identical(t2)) - - # name - t1 = t1.rename('foo') - self.assertTrue(t1.equals(t2)) - self.assertFalse(t1.identical(t2)) - t2 = t2.rename('foo') - self.assertTrue(t1.identical(t2)) - - # freq - t2v = Index(t2.values) - self.assertTrue(t1.equals(t2v)) - self.assertFalse(t1.identical(t2v)) - - def test_daterange_bug_456(self): - # GH #456 - rng1 = bdate_range('12/5/2011', '12/5/2011') - rng2 = bdate_range('12/2/2011', '12/5/2011') - rng2.offset = BDay() - - result = rng1.union(rng2) - tm.assertIsInstance(result, DatetimeIndex) - - def test_error_with_zero_monthends(self): - self.assertRaises(ValueError, date_range, '1/1/2000', '1/1/2001', - freq=MonthEnd(0)) - - def test_range_bug(self): - # GH #770 - offset = DateOffset(months=3) - result = date_range("2011-1-1", "2012-1-31", freq=offset) - - start = datetime(2011, 1, 1) - exp_values = [start + i * offset for i in range(5)] - tm.assert_index_equal(result, DatetimeIndex(exp_values)) - - def test_range_tz_pytz(self): - # GH 2906 - tm._skip_if_no_pytz() - from pytz import timezone - - tz = timezone('US/Eastern') - start = tz.localize(datetime(2011, 1, 1)) - end = tz.localize(datetime(2011, 1, 3)) - - dr = date_range(start=start, periods=3) - self.assertEqual(dr.tz.zone, tz.zone) - self.assertEqual(dr[0], start) - self.assertEqual(dr[2], end) - - dr = date_range(end=end, periods=3) - self.assertEqual(dr.tz.zone, tz.zone) - self.assertEqual(dr[0], start) - self.assertEqual(dr[2], end) - - dr = date_range(start=start, end=end) - self.assertEqual(dr.tz.zone, tz.zone) - self.assertEqual(dr[0], start) - self.assertEqual(dr[2], end) - - def test_range_tz_dst_straddle_pytz(self): - - tm._skip_if_no_pytz() - from pytz import timezone - tz = timezone('US/Eastern') - dates = [(tz.localize(datetime(2014, 3, 6)), - tz.localize(datetime(2014, 3, 12))), - (tz.localize(datetime(2013, 11, 1)), - tz.localize(datetime(2013, 11, 6)))] - for (start, end) in dates: - dr = date_range(start, end, freq='D') - self.assertEqual(dr[0], start) - self.assertEqual(dr[-1], end) - self.assertEqual(np.all(dr.hour == 0), True) - - dr = date_range(start, end, freq='D', tz='US/Eastern') - self.assertEqual(dr[0], start) - self.assertEqual(dr[-1], end) - self.assertEqual(np.all(dr.hour == 0), True) - - dr = date_range(start.replace(tzinfo=None), end.replace( - tzinfo=None), freq='D', tz='US/Eastern') - self.assertEqual(dr[0], start) - self.assertEqual(dr[-1], end) - self.assertEqual(np.all(dr.hour == 0), True) - - def test_range_tz_dateutil(self): - # GH 2906 - tm._skip_if_no_dateutil() - # Use maybe_get_tz to fix filename in tz under dateutil. - from pandas.tslib import maybe_get_tz - tz = lambda x: maybe_get_tz('dateutil/' + x) - - start = datetime(2011, 1, 1, tzinfo=tz('US/Eastern')) - end = datetime(2011, 1, 3, tzinfo=tz('US/Eastern')) - - dr = date_range(start=start, periods=3) - self.assertTrue(dr.tz == tz('US/Eastern')) - self.assertTrue(dr[0] == start) - self.assertTrue(dr[2] == end) - - dr = date_range(end=end, periods=3) - self.assertTrue(dr.tz == tz('US/Eastern')) - self.assertTrue(dr[0] == start) - self.assertTrue(dr[2] == end) - - dr = date_range(start=start, end=end) - self.assertTrue(dr.tz == tz('US/Eastern')) - self.assertTrue(dr[0] == start) - self.assertTrue(dr[2] == end) - - def test_month_range_union_tz_pytz(self): - tm._skip_if_no_pytz() - from pytz import timezone - tz = timezone('US/Eastern') - - early_start = datetime(2011, 1, 1) - early_end = datetime(2011, 3, 1) - - late_start = datetime(2011, 3, 1) - late_end = datetime(2011, 5, 1) - - early_dr = date_range(start=early_start, end=early_end, tz=tz, - freq=MonthEnd()) - late_dr = date_range(start=late_start, end=late_end, tz=tz, - freq=MonthEnd()) - - early_dr.union(late_dr) - - def test_month_range_union_tz_dateutil(self): - tm._skip_if_windows_python_3() - tm._skip_if_no_dateutil() - from pandas.tslib import _dateutil_gettz as timezone - tz = timezone('US/Eastern') - - early_start = datetime(2011, 1, 1) - early_end = datetime(2011, 3, 1) - - late_start = datetime(2011, 3, 1) - late_end = datetime(2011, 5, 1) - - early_dr = date_range(start=early_start, end=early_end, tz=tz, - freq=MonthEnd()) - late_dr = date_range(start=late_start, end=late_end, tz=tz, - freq=MonthEnd()) - - early_dr.union(late_dr) - - def test_range_closed(self): - begin = datetime(2011, 1, 1) - end = datetime(2014, 1, 1) - - for freq in ["1D", "3D", "2M", "7W", "3H", "A"]: - closed = date_range(begin, end, closed=None, freq=freq) - left = date_range(begin, end, closed="left", freq=freq) - right = date_range(begin, end, closed="right", freq=freq) - expected_left = left - expected_right = right - - if end == closed[-1]: - expected_left = closed[:-1] - if begin == closed[0]: - expected_right = closed[1:] - - self.assert_index_equal(expected_left, left) - self.assert_index_equal(expected_right, right) - - def test_range_closed_with_tz_aware_start_end(self): - # GH12409, GH12684 - begin = Timestamp('2011/1/1', tz='US/Eastern') - end = Timestamp('2014/1/1', tz='US/Eastern') - - for freq in ["1D", "3D", "2M", "7W", "3H", "A"]: - closed = date_range(begin, end, closed=None, freq=freq) - left = date_range(begin, end, closed="left", freq=freq) - right = date_range(begin, end, closed="right", freq=freq) - expected_left = left - expected_right = right - - if end == closed[-1]: - expected_left = closed[:-1] - if begin == closed[0]: - expected_right = closed[1:] - - self.assert_index_equal(expected_left, left) - self.assert_index_equal(expected_right, right) - - begin = Timestamp('2011/1/1') - end = Timestamp('2014/1/1') - begintz = Timestamp('2011/1/1', tz='US/Eastern') - endtz = Timestamp('2014/1/1', tz='US/Eastern') - - for freq in ["1D", "3D", "2M", "7W", "3H", "A"]: - closed = date_range(begin, end, closed=None, freq=freq, - tz='US/Eastern') - left = date_range(begin, end, closed="left", freq=freq, - tz='US/Eastern') - right = date_range(begin, end, closed="right", freq=freq, - tz='US/Eastern') - expected_left = left - expected_right = right - - if endtz == closed[-1]: - expected_left = closed[:-1] - if begintz == closed[0]: - expected_right = closed[1:] - - self.assert_index_equal(expected_left, left) - self.assert_index_equal(expected_right, right) - - def test_range_closed_boundary(self): - # GH 11804 - for closed in ['right', 'left', None]: - right_boundary = date_range('2015-09-12', '2015-12-01', - freq='QS-MAR', closed=closed) - left_boundary = date_range('2015-09-01', '2015-09-12', - freq='QS-MAR', closed=closed) - both_boundary = date_range('2015-09-01', '2015-12-01', - freq='QS-MAR', closed=closed) - expected_right = expected_left = expected_both = both_boundary - - if closed == 'right': - expected_left = both_boundary[1:] - if closed == 'left': - expected_right = both_boundary[:-1] - if closed is None: - expected_right = both_boundary[1:] - expected_left = both_boundary[:-1] - - self.assert_index_equal(right_boundary, expected_right) - self.assert_index_equal(left_boundary, expected_left) - self.assert_index_equal(both_boundary, expected_both) - - def test_years_only(self): - # GH 6961 - dr = date_range('2014', '2015', freq='M') - self.assertEqual(dr[0], datetime(2014, 1, 31)) - self.assertEqual(dr[-1], datetime(2014, 12, 31)) - - def test_freq_divides_end_in_nanos(self): - # GH 10885 - result_1 = date_range('2005-01-12 10:00', '2005-01-12 16:00', - freq='345min') - result_2 = date_range('2005-01-13 10:00', '2005-01-13 16:00', - freq='345min') - expected_1 = DatetimeIndex(['2005-01-12 10:00:00', - '2005-01-12 15:45:00'], - dtype='datetime64[ns]', freq='345T', - tz=None) - expected_2 = DatetimeIndex(['2005-01-13 10:00:00', - '2005-01-13 15:45:00'], - dtype='datetime64[ns]', freq='345T', - tz=None) - self.assert_index_equal(result_1, expected_1) - self.assert_index_equal(result_2, expected_2) - - -class TestCustomDateRange(tm.TestCase): - def setUp(self): - self.rng = cdate_range(START, END) - - def test_constructor(self): - cdate_range(START, END, freq=CDay()) - cdate_range(START, periods=20, freq=CDay()) - cdate_range(end=START, periods=20, freq=CDay()) - self.assertRaises(ValueError, date_range, '2011-1-1', '2012-1-1', 'C') - self.assertRaises(ValueError, cdate_range, '2011-1-1', '2012-1-1', 'C') - - def test_cached_range(self): - DatetimeIndex._cached_range(START, END, offset=CDay()) - DatetimeIndex._cached_range(START, periods=20, - offset=CDay()) - DatetimeIndex._cached_range(end=START, periods=20, - offset=CDay()) - - self.assertRaises(Exception, DatetimeIndex._cached_range, START, END) - - self.assertRaises(Exception, DatetimeIndex._cached_range, START, - freq=CDay()) - - self.assertRaises(Exception, DatetimeIndex._cached_range, end=END, - freq=CDay()) - - self.assertRaises(Exception, DatetimeIndex._cached_range, periods=20, - freq=CDay()) - - def test_comparison(self): - d = self.rng[10] - - comp = self.rng > d - self.assertTrue(comp[11]) - self.assertFalse(comp[9]) - - def test_copy(self): - cp = self.rng.copy() - repr(cp) - self.assert_index_equal(cp, self.rng) - - def test_repr(self): - # only really care that it works - repr(self.rng) - - def test_getitem(self): - smaller = self.rng[:5] - exp = DatetimeIndex(self.rng.view(np.ndarray)[:5]) - self.assert_index_equal(smaller, exp) - self.assertEqual(smaller.offset, self.rng.offset) - - sliced = self.rng[::5] - self.assertEqual(sliced.offset, CDay() * 5) - - fancy_indexed = self.rng[[4, 3, 2, 1, 0]] - self.assertEqual(len(fancy_indexed), 5) - tm.assertIsInstance(fancy_indexed, DatetimeIndex) - self.assertIsNone(fancy_indexed.freq) - - # 32-bit vs. 64-bit platforms - self.assertEqual(self.rng[4], self.rng[np.int_(4)]) - - def test_getitem_matplotlib_hackaround(self): - values = self.rng[:, None] - expected = self.rng.values[:, None] - self.assert_numpy_array_equal(values, expected) - - def test_shift(self): - - shifted = self.rng.shift(5) - self.assertEqual(shifted[0], self.rng[5]) - self.assertEqual(shifted.offset, self.rng.offset) - - shifted = self.rng.shift(-5) - self.assertEqual(shifted[5], self.rng[0]) - self.assertEqual(shifted.offset, self.rng.offset) - - shifted = self.rng.shift(0) - self.assertEqual(shifted[0], self.rng[0]) - self.assertEqual(shifted.offset, self.rng.offset) - - with tm.assert_produces_warning(com.PerformanceWarning): - rng = date_range(START, END, freq=BMonthEnd()) - shifted = rng.shift(1, freq=CDay()) - self.assertEqual(shifted[0], rng[0] + CDay()) - - def test_pickle_unpickle(self): - unpickled = self.round_trip_pickle(self.rng) - self.assertIsNotNone(unpickled.offset) - - def test_union(self): - # overlapping - left = self.rng[:10] - right = self.rng[5:10] - - the_union = left.union(right) - tm.assertIsInstance(the_union, DatetimeIndex) - - # non-overlapping, gap in middle - left = self.rng[:5] - right = self.rng[10:] - - the_union = left.union(right) - tm.assertIsInstance(the_union, Index) - - # non-overlapping, no gap - left = self.rng[:5] - right = self.rng[5:10] - - the_union = left.union(right) - tm.assertIsInstance(the_union, DatetimeIndex) - - # order does not matter - self.assert_index_equal(right.union(left), the_union) - - # overlapping, but different offset - rng = date_range(START, END, freq=BMonthEnd()) - - the_union = self.rng.union(rng) - tm.assertIsInstance(the_union, DatetimeIndex) - - def test_outer_join(self): - # should just behave as union - - # overlapping - left = self.rng[:10] - right = self.rng[5:10] - - the_join = left.join(right, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - - # non-overlapping, gap in middle - left = self.rng[:5] - right = self.rng[10:] - - the_join = left.join(right, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - self.assertIsNone(the_join.freq) - - # non-overlapping, no gap - left = self.rng[:5] - right = self.rng[5:10] - - the_join = left.join(right, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - - # overlapping, but different offset - rng = date_range(START, END, freq=BMonthEnd()) - - the_join = self.rng.join(rng, how='outer') - tm.assertIsInstance(the_join, DatetimeIndex) - self.assertIsNone(the_join.freq) - - def test_intersection_bug(self): - # GH #771 - a = cdate_range('11/30/2011', '12/31/2011') - b = cdate_range('12/10/2011', '12/20/2011') - result = a.intersection(b) - self.assert_index_equal(result, b) - - def test_summary(self): - self.rng.summary() - self.rng[2:2].summary() - - def test_summary_pytz(self): - tm._skip_if_no_pytz() - import pytz - cdate_range('1/1/2005', '1/1/2009', tz=pytz.utc).summary() - - def test_summary_dateutil(self): - tm._skip_if_no_dateutil() - import dateutil - cdate_range('1/1/2005', '1/1/2009', tz=dateutil.tz.tzutc()).summary() - - def test_misc(self): - end = datetime(2009, 5, 13) - dr = cdate_range(end=end, periods=20) - firstDate = end - 19 * CDay() - - assert len(dr) == 20 - assert dr[0] == firstDate - assert dr[-1] == end - - def test_date_parse_failure(self): - badly_formed_date = '2007/100/1' - - self.assertRaises(ValueError, Timestamp, badly_formed_date) - - self.assertRaises(ValueError, cdate_range, start=badly_formed_date, - periods=10) - self.assertRaises(ValueError, cdate_range, end=badly_formed_date, - periods=10) - self.assertRaises(ValueError, cdate_range, badly_formed_date, - badly_formed_date) - - def test_equals(self): - self.assertFalse(self.rng.equals(list(self.rng))) - - def test_daterange_bug_456(self): - # GH #456 - rng1 = cdate_range('12/5/2011', '12/5/2011') - rng2 = cdate_range('12/2/2011', '12/5/2011') - rng2.offset = CDay() - - result = rng1.union(rng2) - tm.assertIsInstance(result, DatetimeIndex) - - def test_cdaterange(self): - rng = cdate_range('2013-05-01', periods=3) - xp = DatetimeIndex(['2013-05-01', '2013-05-02', '2013-05-03']) - self.assert_index_equal(xp, rng) - - def test_cdaterange_weekmask(self): - rng = cdate_range('2013-05-01', periods=3, - weekmask='Sun Mon Tue Wed Thu') - xp = DatetimeIndex(['2013-05-01', '2013-05-02', '2013-05-05']) - self.assert_index_equal(xp, rng) - - def test_cdaterange_holidays(self): - rng = cdate_range('2013-05-01', periods=3, holidays=['2013-05-01']) - xp = DatetimeIndex(['2013-05-02', '2013-05-03', '2013-05-06']) - self.assert_index_equal(xp, rng) - - def test_cdaterange_weekmask_and_holidays(self): - rng = cdate_range('2013-05-01', periods=3, - weekmask='Sun Mon Tue Wed Thu', - holidays=['2013-05-01']) - xp = DatetimeIndex(['2013-05-02', '2013-05-05', '2013-05-06']) - self.assert_index_equal(xp, rng) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_period.py b/pandas/tseries/tests/test_period.py deleted file mode 100644 index 48f9d82e481d0..0000000000000 --- a/pandas/tseries/tests/test_period.py +++ /dev/null @@ -1,4975 +0,0 @@ -"""Tests suite for Period handling. - -Parts derived from scikits.timeseries code, original authors: -- Pierre Gerard-Marchant & Matt Knox -- pierregm_at_uga_dot_edu - mattknow_ca_at_hotmail_dot_com - -""" - -from datetime import datetime, date, timedelta - -from pandas import Timestamp, _period -from pandas.tseries.frequencies import MONTHS, DAYS, _period_code_map -from pandas.tseries.period import Period, PeriodIndex, period_range -from pandas.tseries.index import DatetimeIndex, date_range, Index -from pandas.tseries.tools import to_datetime -import pandas.tseries.period as period -import pandas.tseries.offsets as offsets - -import pandas as pd -import numpy as np -from numpy.random import randn -from pandas.compat import range, lrange, lmap, zip, text_type, PY3, iteritems -from pandas.compat.numpy import np_datetime64_compat - -from pandas import (Series, DataFrame, - _np_version_under1p9, _np_version_under1p10, - _np_version_under1p12) -from pandas import tslib -import pandas.util.testing as tm - - -class TestPeriodProperties(tm.TestCase): - "Test properties such as year, month, weekday, etc...." - - def test_quarterly_negative_ordinals(self): - p = Period(ordinal=-1, freq='Q-DEC') - self.assertEqual(p.year, 1969) - self.assertEqual(p.quarter, 4) - self.assertIsInstance(p, Period) - - p = Period(ordinal=-2, freq='Q-DEC') - self.assertEqual(p.year, 1969) - self.assertEqual(p.quarter, 3) - self.assertIsInstance(p, Period) - - p = Period(ordinal=-2, freq='M') - self.assertEqual(p.year, 1969) - self.assertEqual(p.month, 11) - self.assertIsInstance(p, Period) - - def test_period_cons_quarterly(self): - # bugs in scikits.timeseries - for month in MONTHS: - freq = 'Q-%s' % month - exp = Period('1989Q3', freq=freq) - self.assertIn('1989Q3', str(exp)) - stamp = exp.to_timestamp('D', how='end') - p = Period(stamp, freq=freq) - self.assertEqual(p, exp) - - stamp = exp.to_timestamp('3D', how='end') - p = Period(stamp, freq=freq) - self.assertEqual(p, exp) - - def test_period_cons_annual(self): - # bugs in scikits.timeseries - for month in MONTHS: - freq = 'A-%s' % month - exp = Period('1989', freq=freq) - stamp = exp.to_timestamp('D', how='end') + timedelta(days=30) - p = Period(stamp, freq=freq) - self.assertEqual(p, exp + 1) - self.assertIsInstance(p, Period) - - def test_period_cons_weekly(self): - for num in range(10, 17): - daystr = '2011-02-%d' % num - for day in DAYS: - freq = 'W-%s' % day - - result = Period(daystr, freq=freq) - expected = Period(daystr, freq='D').asfreq(freq) - self.assertEqual(result, expected) - self.assertIsInstance(result, Period) - - def test_period_from_ordinal(self): - p = pd.Period('2011-01', freq='M') - res = pd.Period._from_ordinal(p.ordinal, freq='M') - self.assertEqual(p, res) - self.assertIsInstance(res, Period) - - def test_period_cons_nat(self): - p = Period('NaT', freq='M') - self.assertIs(p, pd.NaT) - - p = Period('nat', freq='W-SUN') - self.assertIs(p, pd.NaT) - - p = Period(tslib.iNaT, freq='D') - self.assertIs(p, pd.NaT) - - p = Period(tslib.iNaT, freq='3D') - self.assertIs(p, pd.NaT) - - p = Period(tslib.iNaT, freq='1D1H') - self.assertIs(p, pd.NaT) - - p = Period('NaT') - self.assertIs(p, pd.NaT) - - p = Period(tslib.iNaT) - self.assertIs(p, pd.NaT) - - def test_cons_null_like(self): - # check Timestamp compat - self.assertIs(Timestamp('NaT'), pd.NaT) - self.assertIs(Period('NaT'), pd.NaT) - - self.assertIs(Timestamp(None), pd.NaT) - self.assertIs(Period(None), pd.NaT) - - self.assertIs(Timestamp(float('nan')), pd.NaT) - self.assertIs(Period(float('nan')), pd.NaT) - - self.assertIs(Timestamp(np.nan), pd.NaT) - self.assertIs(Period(np.nan), pd.NaT) - - def test_period_cons_mult(self): - p1 = Period('2011-01', freq='3M') - p2 = Period('2011-01', freq='M') - self.assertEqual(p1.ordinal, p2.ordinal) - - self.assertEqual(p1.freq, offsets.MonthEnd(3)) - self.assertEqual(p1.freqstr, '3M') - - self.assertEqual(p2.freq, offsets.MonthEnd()) - self.assertEqual(p2.freqstr, 'M') - - result = p1 + 1 - self.assertEqual(result.ordinal, (p2 + 3).ordinal) - self.assertEqual(result.freq, p1.freq) - self.assertEqual(result.freqstr, '3M') - - result = p1 - 1 - self.assertEqual(result.ordinal, (p2 - 3).ordinal) - self.assertEqual(result.freq, p1.freq) - self.assertEqual(result.freqstr, '3M') - - msg = ('Frequency must be positive, because it' - ' represents span: -3M') - with tm.assertRaisesRegexp(ValueError, msg): - Period('2011-01', freq='-3M') - - msg = ('Frequency must be positive, because it' ' represents span: 0M') - with tm.assertRaisesRegexp(ValueError, msg): - Period('2011-01', freq='0M') - - def test_period_cons_combined(self): - p = [(Period('2011-01', freq='1D1H'), - Period('2011-01', freq='1H1D'), - Period('2011-01', freq='H')), - (Period(ordinal=1, freq='1D1H'), - Period(ordinal=1, freq='1H1D'), - Period(ordinal=1, freq='H'))] - - for p1, p2, p3 in p: - self.assertEqual(p1.ordinal, p3.ordinal) - self.assertEqual(p2.ordinal, p3.ordinal) - - self.assertEqual(p1.freq, offsets.Hour(25)) - self.assertEqual(p1.freqstr, '25H') - - self.assertEqual(p2.freq, offsets.Hour(25)) - self.assertEqual(p2.freqstr, '25H') - - self.assertEqual(p3.freq, offsets.Hour()) - self.assertEqual(p3.freqstr, 'H') - - result = p1 + 1 - self.assertEqual(result.ordinal, (p3 + 25).ordinal) - self.assertEqual(result.freq, p1.freq) - self.assertEqual(result.freqstr, '25H') - - result = p2 + 1 - self.assertEqual(result.ordinal, (p3 + 25).ordinal) - self.assertEqual(result.freq, p2.freq) - self.assertEqual(result.freqstr, '25H') - - result = p1 - 1 - self.assertEqual(result.ordinal, (p3 - 25).ordinal) - self.assertEqual(result.freq, p1.freq) - self.assertEqual(result.freqstr, '25H') - - result = p2 - 1 - self.assertEqual(result.ordinal, (p3 - 25).ordinal) - self.assertEqual(result.freq, p2.freq) - self.assertEqual(result.freqstr, '25H') - - msg = ('Frequency must be positive, because it' - ' represents span: -25H') - with tm.assertRaisesRegexp(ValueError, msg): - Period('2011-01', freq='-1D1H') - with tm.assertRaisesRegexp(ValueError, msg): - Period('2011-01', freq='-1H1D') - with tm.assertRaisesRegexp(ValueError, msg): - Period(ordinal=1, freq='-1D1H') - with tm.assertRaisesRegexp(ValueError, msg): - Period(ordinal=1, freq='-1H1D') - - msg = ('Frequency must be positive, because it' - ' represents span: 0D') - with tm.assertRaisesRegexp(ValueError, msg): - Period('2011-01', freq='0D0H') - with tm.assertRaisesRegexp(ValueError, msg): - Period(ordinal=1, freq='0D0H') - - # You can only combine together day and intraday offsets - msg = ('Invalid frequency: 1W1D') - with tm.assertRaisesRegexp(ValueError, msg): - Period('2011-01', freq='1W1D') - msg = ('Invalid frequency: 1D1W') - with tm.assertRaisesRegexp(ValueError, msg): - Period('2011-01', freq='1D1W') - - def test_timestamp_tz_arg(self): - tm._skip_if_no_pytz() - import pytz - for case in ['Europe/Brussels', 'Asia/Tokyo', 'US/Pacific']: - p = Period('1/1/2005', freq='M').to_timestamp(tz=case) - exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) - exp_zone = pytz.timezone(case).normalize(p) - - self.assertEqual(p, exp) - self.assertEqual(p.tz, exp_zone.tzinfo) - self.assertEqual(p.tz, exp.tz) - - p = Period('1/1/2005', freq='3H').to_timestamp(tz=case) - exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) - exp_zone = pytz.timezone(case).normalize(p) - - self.assertEqual(p, exp) - self.assertEqual(p.tz, exp_zone.tzinfo) - self.assertEqual(p.tz, exp.tz) - - p = Period('1/1/2005', freq='A').to_timestamp(freq='A', tz=case) - exp = Timestamp('31/12/2005', tz='UTC').tz_convert(case) - exp_zone = pytz.timezone(case).normalize(p) - - self.assertEqual(p, exp) - self.assertEqual(p.tz, exp_zone.tzinfo) - self.assertEqual(p.tz, exp.tz) - - p = Period('1/1/2005', freq='A').to_timestamp(freq='3H', tz=case) - exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) - exp_zone = pytz.timezone(case).normalize(p) - - self.assertEqual(p, exp) - self.assertEqual(p.tz, exp_zone.tzinfo) - self.assertEqual(p.tz, exp.tz) - - def test_timestamp_tz_arg_dateutil(self): - from pandas.tslib import _dateutil_gettz as gettz - from pandas.tslib import maybe_get_tz - for case in ['dateutil/Europe/Brussels', 'dateutil/Asia/Tokyo', - 'dateutil/US/Pacific']: - p = Period('1/1/2005', freq='M').to_timestamp( - tz=maybe_get_tz(case)) - exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) - self.assertEqual(p, exp) - self.assertEqual(p.tz, gettz(case.split('/', 1)[1])) - self.assertEqual(p.tz, exp.tz) - - p = Period('1/1/2005', - freq='M').to_timestamp(freq='3H', tz=maybe_get_tz(case)) - exp = Timestamp('1/1/2005', tz='UTC').tz_convert(case) - self.assertEqual(p, exp) - self.assertEqual(p.tz, gettz(case.split('/', 1)[1])) - self.assertEqual(p.tz, exp.tz) - - def test_timestamp_tz_arg_dateutil_from_string(self): - from pandas.tslib import _dateutil_gettz as gettz - p = Period('1/1/2005', - freq='M').to_timestamp(tz='dateutil/Europe/Brussels') - self.assertEqual(p.tz, gettz('Europe/Brussels')) - - def test_timestamp_mult(self): - p = pd.Period('2011-01', freq='M') - self.assertEqual(p.to_timestamp(how='S'), pd.Timestamp('2011-01-01')) - self.assertEqual(p.to_timestamp(how='E'), pd.Timestamp('2011-01-31')) - - p = pd.Period('2011-01', freq='3M') - self.assertEqual(p.to_timestamp(how='S'), pd.Timestamp('2011-01-01')) - self.assertEqual(p.to_timestamp(how='E'), pd.Timestamp('2011-03-31')) - - def test_period_constructor(self): - i1 = Period('1/1/2005', freq='M') - i2 = Period('Jan 2005') - - self.assertEqual(i1, i2) - - i1 = Period('2005', freq='A') - i2 = Period('2005') - i3 = Period('2005', freq='a') - - self.assertEqual(i1, i2) - self.assertEqual(i1, i3) - - i4 = Period('2005', freq='M') - i5 = Period('2005', freq='m') - - self.assertRaises(ValueError, i1.__ne__, i4) - self.assertEqual(i4, i5) - - i1 = Period.now('Q') - i2 = Period(datetime.now(), freq='Q') - i3 = Period.now('q') - - self.assertEqual(i1, i2) - self.assertEqual(i1, i3) - - # Biz day construction, roll forward if non-weekday - i1 = Period('3/10/12', freq='B') - i2 = Period('3/10/12', freq='D') - self.assertEqual(i1, i2.asfreq('B')) - i2 = Period('3/11/12', freq='D') - self.assertEqual(i1, i2.asfreq('B')) - i2 = Period('3/12/12', freq='D') - self.assertEqual(i1, i2.asfreq('B')) - - i3 = Period('3/10/12', freq='b') - self.assertEqual(i1, i3) - - i1 = Period(year=2005, quarter=1, freq='Q') - i2 = Period('1/1/2005', freq='Q') - self.assertEqual(i1, i2) - - i1 = Period(year=2005, quarter=3, freq='Q') - i2 = Period('9/1/2005', freq='Q') - self.assertEqual(i1, i2) - - i1 = Period(year=2005, month=3, day=1, freq='D') - i2 = Period('3/1/2005', freq='D') - self.assertEqual(i1, i2) - - i3 = Period(year=2005, month=3, day=1, freq='d') - self.assertEqual(i1, i3) - - i1 = Period(year=2012, month=3, day=10, freq='B') - i2 = Period('3/12/12', freq='B') - self.assertEqual(i1, i2) - - i1 = Period('2005Q1') - i2 = Period(year=2005, quarter=1, freq='Q') - i3 = Period('2005q1') - self.assertEqual(i1, i2) - self.assertEqual(i1, i3) - - i1 = Period('05Q1') - self.assertEqual(i1, i2) - lower = Period('05q1') - self.assertEqual(i1, lower) - - i1 = Period('1Q2005') - self.assertEqual(i1, i2) - lower = Period('1q2005') - self.assertEqual(i1, lower) - - i1 = Period('1Q05') - self.assertEqual(i1, i2) - lower = Period('1q05') - self.assertEqual(i1, lower) - - i1 = Period('4Q1984') - self.assertEqual(i1.year, 1984) - lower = Period('4q1984') - self.assertEqual(i1, lower) - - i1 = Period('1982', freq='min') - i2 = Period('1982', freq='MIN') - self.assertEqual(i1, i2) - i2 = Period('1982', freq=('Min', 1)) - self.assertEqual(i1, i2) - - expected = Period('2007-01', freq='M') - i1 = Period('200701', freq='M') - self.assertEqual(i1, expected) - - i1 = Period('200701', freq='M') - self.assertEqual(i1, expected) - - i1 = Period(200701, freq='M') - self.assertEqual(i1, expected) - - i1 = Period(ordinal=200701, freq='M') - self.assertEqual(i1.year, 18695) - - i1 = Period(datetime(2007, 1, 1), freq='M') - i2 = Period('200701', freq='M') - self.assertEqual(i1, i2) - - i1 = Period(date(2007, 1, 1), freq='M') - i2 = Period(datetime(2007, 1, 1), freq='M') - i3 = Period(np.datetime64('2007-01-01'), freq='M') - i4 = Period(np_datetime64_compat('2007-01-01 00:00:00Z'), freq='M') - i5 = Period(np_datetime64_compat('2007-01-01 00:00:00.000Z'), freq='M') - self.assertEqual(i1, i2) - self.assertEqual(i1, i3) - self.assertEqual(i1, i4) - self.assertEqual(i1, i5) - - i1 = Period('2007-01-01 09:00:00.001') - expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1000), freq='L') - self.assertEqual(i1, expected) - - expected = Period(np_datetime64_compat( - '2007-01-01 09:00:00.001Z'), freq='L') - self.assertEqual(i1, expected) - - i1 = Period('2007-01-01 09:00:00.00101') - expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1010), freq='U') - self.assertEqual(i1, expected) - - expected = Period(np_datetime64_compat('2007-01-01 09:00:00.00101Z'), - freq='U') - self.assertEqual(i1, expected) - - self.assertRaises(ValueError, Period, ordinal=200701) - - self.assertRaises(ValueError, Period, '2007-1-1', freq='X') - - def test_period_constructor_offsets(self): - self.assertEqual(Period('1/1/2005', freq=offsets.MonthEnd()), - Period('1/1/2005', freq='M')) - self.assertEqual(Period('2005', freq=offsets.YearEnd()), - Period('2005', freq='A')) - self.assertEqual(Period('2005', freq=offsets.MonthEnd()), - Period('2005', freq='M')) - self.assertEqual(Period('3/10/12', freq=offsets.BusinessDay()), - Period('3/10/12', freq='B')) - self.assertEqual(Period('3/10/12', freq=offsets.Day()), - Period('3/10/12', freq='D')) - - self.assertEqual(Period(year=2005, quarter=1, - freq=offsets.QuarterEnd(startingMonth=12)), - Period(year=2005, quarter=1, freq='Q')) - self.assertEqual(Period(year=2005, quarter=2, - freq=offsets.QuarterEnd(startingMonth=12)), - Period(year=2005, quarter=2, freq='Q')) - - self.assertEqual(Period(year=2005, month=3, day=1, freq=offsets.Day()), - Period(year=2005, month=3, day=1, freq='D')) - self.assertEqual(Period(year=2012, month=3, day=10, - freq=offsets.BDay()), - Period(year=2012, month=3, day=10, freq='B')) - - expected = Period('2005-03-01', freq='3D') - self.assertEqual(Period(year=2005, month=3, day=1, - freq=offsets.Day(3)), expected) - self.assertEqual(Period(year=2005, month=3, day=1, freq='3D'), - expected) - - self.assertEqual(Period(year=2012, month=3, day=10, - freq=offsets.BDay(3)), - Period(year=2012, month=3, day=10, freq='3B')) - - self.assertEqual(Period(200701, freq=offsets.MonthEnd()), - Period(200701, freq='M')) - - i1 = Period(ordinal=200701, freq=offsets.MonthEnd()) - i2 = Period(ordinal=200701, freq='M') - self.assertEqual(i1, i2) - self.assertEqual(i1.year, 18695) - self.assertEqual(i2.year, 18695) - - i1 = Period(datetime(2007, 1, 1), freq='M') - i2 = Period('200701', freq='M') - self.assertEqual(i1, i2) - - i1 = Period(date(2007, 1, 1), freq='M') - i2 = Period(datetime(2007, 1, 1), freq='M') - i3 = Period(np.datetime64('2007-01-01'), freq='M') - i4 = Period(np_datetime64_compat('2007-01-01 00:00:00Z'), freq='M') - i5 = Period(np_datetime64_compat('2007-01-01 00:00:00.000Z'), freq='M') - self.assertEqual(i1, i2) - self.assertEqual(i1, i3) - self.assertEqual(i1, i4) - self.assertEqual(i1, i5) - - i1 = Period('2007-01-01 09:00:00.001') - expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1000), freq='L') - self.assertEqual(i1, expected) - - expected = Period(np_datetime64_compat( - '2007-01-01 09:00:00.001Z'), freq='L') - self.assertEqual(i1, expected) - - i1 = Period('2007-01-01 09:00:00.00101') - expected = Period(datetime(2007, 1, 1, 9, 0, 0, 1010), freq='U') - self.assertEqual(i1, expected) - - expected = Period(np_datetime64_compat('2007-01-01 09:00:00.00101Z'), - freq='U') - self.assertEqual(i1, expected) - - self.assertRaises(ValueError, Period, ordinal=200701) - - self.assertRaises(ValueError, Period, '2007-1-1', freq='X') - - def test_freq_str(self): - i1 = Period('1982', freq='Min') - self.assertEqual(i1.freq, offsets.Minute()) - self.assertEqual(i1.freqstr, 'T') - - def test_period_deprecated_freq(self): - cases = {"M": ["MTH", "MONTH", "MONTHLY", "Mth", "month", "monthly"], - "B": ["BUS", "BUSINESS", "BUSINESSLY", "WEEKDAY", "bus"], - "D": ["DAY", "DLY", "DAILY", "Day", "Dly", "Daily"], - "H": ["HR", "HOUR", "HRLY", "HOURLY", "hr", "Hour", "HRly"], - "T": ["minute", "MINUTE", "MINUTELY", "minutely"], - "S": ["sec", "SEC", "SECOND", "SECONDLY", "second"], - "L": ["MILLISECOND", "MILLISECONDLY", "millisecond"], - "U": ["MICROSECOND", "MICROSECONDLY", "microsecond"], - "N": ["NANOSECOND", "NANOSECONDLY", "nanosecond"]} - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - for exp, freqs in iteritems(cases): - for freq in freqs: - with self.assertRaisesRegexp(ValueError, msg): - Period('2016-03-01 09:00', freq=freq) - with self.assertRaisesRegexp(ValueError, msg): - Period(ordinal=1, freq=freq) - - # check supported freq-aliases still works - p1 = Period('2016-03-01 09:00', freq=exp) - p2 = Period(ordinal=1, freq=exp) - tm.assertIsInstance(p1, Period) - tm.assertIsInstance(p2, Period) - - def test_hash(self): - self.assertEqual(hash(Period('2011-01', freq='M')), - hash(Period('2011-01', freq='M'))) - - self.assertNotEqual(hash(Period('2011-01-01', freq='D')), - hash(Period('2011-01', freq='M'))) - - self.assertNotEqual(hash(Period('2011-01', freq='3M')), - hash(Period('2011-01', freq='2M'))) - - self.assertNotEqual(hash(Period('2011-01', freq='M')), - hash(Period('2011-02', freq='M'))) - - def test_repr(self): - p = Period('Jan-2000') - self.assertIn('2000-01', repr(p)) - - p = Period('2000-12-15') - self.assertIn('2000-12-15', repr(p)) - - def test_repr_nat(self): - p = Period('nat', freq='M') - self.assertIn(repr(tslib.NaT), repr(p)) - - def test_millisecond_repr(self): - p = Period('2000-01-01 12:15:02.123') - - self.assertEqual("Period('2000-01-01 12:15:02.123', 'L')", repr(p)) - - def test_microsecond_repr(self): - p = Period('2000-01-01 12:15:02.123567') - - self.assertEqual("Period('2000-01-01 12:15:02.123567', 'U')", repr(p)) - - def test_strftime(self): - p = Period('2000-1-1 12:34:12', freq='S') - res = p.strftime('%Y-%m-%d %H:%M:%S') - self.assertEqual(res, '2000-01-01 12:34:12') - tm.assertIsInstance(res, text_type) # GH3363 - - def test_sub_delta(self): - left, right = Period('2011', freq='A'), Period('2007', freq='A') - result = left - right - self.assertEqual(result, 4) - - with self.assertRaises(period.IncompatibleFrequency): - left - Period('2007-01', freq='M') - - def test_to_timestamp(self): - p = Period('1982', freq='A') - start_ts = p.to_timestamp(how='S') - aliases = ['s', 'StarT', 'BEGIn'] - for a in aliases: - self.assertEqual(start_ts, p.to_timestamp('D', how=a)) - # freq with mult should not affect to the result - self.assertEqual(start_ts, p.to_timestamp('3D', how=a)) - - end_ts = p.to_timestamp(how='E') - aliases = ['e', 'end', 'FINIsH'] - for a in aliases: - self.assertEqual(end_ts, p.to_timestamp('D', how=a)) - self.assertEqual(end_ts, p.to_timestamp('3D', how=a)) - - from_lst = ['A', 'Q', 'M', 'W', 'B', 'D', 'H', 'Min', 'S'] - - def _ex(p): - return Timestamp((p + 1).start_time.value - 1) - - for i, fcode in enumerate(from_lst): - p = Period('1982', freq=fcode) - result = p.to_timestamp().to_period(fcode) - self.assertEqual(result, p) - - self.assertEqual(p.start_time, p.to_timestamp(how='S')) - - self.assertEqual(p.end_time, _ex(p)) - - # Frequency other than daily - - p = Period('1985', freq='A') - - result = p.to_timestamp('H', how='end') - expected = datetime(1985, 12, 31, 23) - self.assertEqual(result, expected) - result = p.to_timestamp('3H', how='end') - self.assertEqual(result, expected) - - result = p.to_timestamp('T', how='end') - expected = datetime(1985, 12, 31, 23, 59) - self.assertEqual(result, expected) - result = p.to_timestamp('2T', how='end') - self.assertEqual(result, expected) - - result = p.to_timestamp(how='end') - expected = datetime(1985, 12, 31) - self.assertEqual(result, expected) - - expected = datetime(1985, 1, 1) - result = p.to_timestamp('H', how='start') - self.assertEqual(result, expected) - result = p.to_timestamp('T', how='start') - self.assertEqual(result, expected) - result = p.to_timestamp('S', how='start') - self.assertEqual(result, expected) - result = p.to_timestamp('3H', how='start') - self.assertEqual(result, expected) - result = p.to_timestamp('5S', how='start') - self.assertEqual(result, expected) - - def test_start_time(self): - freq_lst = ['A', 'Q', 'M', 'D', 'H', 'T', 'S'] - xp = datetime(2012, 1, 1) - for f in freq_lst: - p = Period('2012', freq=f) - self.assertEqual(p.start_time, xp) - self.assertEqual(Period('2012', freq='B').start_time, - datetime(2012, 1, 2)) - self.assertEqual(Period('2012', freq='W').start_time, - datetime(2011, 12, 26)) - - def test_end_time(self): - p = Period('2012', freq='A') - - def _ex(*args): - return Timestamp(Timestamp(datetime(*args)).value - 1) - - xp = _ex(2013, 1, 1) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='Q') - xp = _ex(2012, 4, 1) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='M') - xp = _ex(2012, 2, 1) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='D') - xp = _ex(2012, 1, 2) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='H') - xp = _ex(2012, 1, 1, 1) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='B') - xp = _ex(2012, 1, 3) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='W') - xp = _ex(2012, 1, 2) - self.assertEqual(xp, p.end_time) - - # Test for GH 11738 - p = Period('2012', freq='15D') - xp = _ex(2012, 1, 16) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='1D1H') - xp = _ex(2012, 1, 2, 1) - self.assertEqual(xp, p.end_time) - - p = Period('2012', freq='1H1D') - xp = _ex(2012, 1, 2, 1) - self.assertEqual(xp, p.end_time) - - def test_anchor_week_end_time(self): - def _ex(*args): - return Timestamp(Timestamp(datetime(*args)).value - 1) - - p = Period('2013-1-1', 'W-SAT') - xp = _ex(2013, 1, 6) - self.assertEqual(p.end_time, xp) - - def test_properties_annually(self): - # Test properties on Periods with annually frequency. - a_date = Period(freq='A', year=2007) - self.assertEqual(a_date.year, 2007) - - def test_properties_quarterly(self): - # Test properties on Periods with daily frequency. - qedec_date = Period(freq="Q-DEC", year=2007, quarter=1) - qejan_date = Period(freq="Q-JAN", year=2007, quarter=1) - qejun_date = Period(freq="Q-JUN", year=2007, quarter=1) - # - for x in range(3): - for qd in (qedec_date, qejan_date, qejun_date): - self.assertEqual((qd + x).qyear, 2007) - self.assertEqual((qd + x).quarter, x + 1) - - def test_properties_monthly(self): - # Test properties on Periods with daily frequency. - m_date = Period(freq='M', year=2007, month=1) - for x in range(11): - m_ival_x = m_date + x - self.assertEqual(m_ival_x.year, 2007) - if 1 <= x + 1 <= 3: - self.assertEqual(m_ival_x.quarter, 1) - elif 4 <= x + 1 <= 6: - self.assertEqual(m_ival_x.quarter, 2) - elif 7 <= x + 1 <= 9: - self.assertEqual(m_ival_x.quarter, 3) - elif 10 <= x + 1 <= 12: - self.assertEqual(m_ival_x.quarter, 4) - self.assertEqual(m_ival_x.month, x + 1) - - def test_properties_weekly(self): - # Test properties on Periods with daily frequency. - w_date = Period(freq='W', year=2007, month=1, day=7) - # - self.assertEqual(w_date.year, 2007) - self.assertEqual(w_date.quarter, 1) - self.assertEqual(w_date.month, 1) - self.assertEqual(w_date.week, 1) - self.assertEqual((w_date - 1).week, 52) - self.assertEqual(w_date.days_in_month, 31) - self.assertEqual(Period(freq='W', year=2012, - month=2, day=1).days_in_month, 29) - - def test_properties_weekly_legacy(self): - # Test properties on Periods with daily frequency. - w_date = Period(freq='W', year=2007, month=1, day=7) - self.assertEqual(w_date.year, 2007) - self.assertEqual(w_date.quarter, 1) - self.assertEqual(w_date.month, 1) - self.assertEqual(w_date.week, 1) - self.assertEqual((w_date - 1).week, 52) - self.assertEqual(w_date.days_in_month, 31) - - exp = Period(freq='W', year=2012, month=2, day=1) - self.assertEqual(exp.days_in_month, 29) - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK', year=2007, month=1, day=7) - - def test_properties_daily(self): - # Test properties on Periods with daily frequency. - b_date = Period(freq='B', year=2007, month=1, day=1) - # - self.assertEqual(b_date.year, 2007) - self.assertEqual(b_date.quarter, 1) - self.assertEqual(b_date.month, 1) - self.assertEqual(b_date.day, 1) - self.assertEqual(b_date.weekday, 0) - self.assertEqual(b_date.dayofyear, 1) - self.assertEqual(b_date.days_in_month, 31) - self.assertEqual(Period(freq='B', year=2012, - month=2, day=1).days_in_month, 29) - # - d_date = Period(freq='D', year=2007, month=1, day=1) - # - self.assertEqual(d_date.year, 2007) - self.assertEqual(d_date.quarter, 1) - self.assertEqual(d_date.month, 1) - self.assertEqual(d_date.day, 1) - self.assertEqual(d_date.weekday, 0) - self.assertEqual(d_date.dayofyear, 1) - self.assertEqual(d_date.days_in_month, 31) - self.assertEqual(Period(freq='D', year=2012, month=2, - day=1).days_in_month, 29) - - def test_properties_hourly(self): - # Test properties on Periods with hourly frequency. - h_date1 = Period(freq='H', year=2007, month=1, day=1, hour=0) - h_date2 = Period(freq='2H', year=2007, month=1, day=1, hour=0) - - for h_date in [h_date1, h_date2]: - self.assertEqual(h_date.year, 2007) - self.assertEqual(h_date.quarter, 1) - self.assertEqual(h_date.month, 1) - self.assertEqual(h_date.day, 1) - self.assertEqual(h_date.weekday, 0) - self.assertEqual(h_date.dayofyear, 1) - self.assertEqual(h_date.hour, 0) - self.assertEqual(h_date.days_in_month, 31) - self.assertEqual(Period(freq='H', year=2012, month=2, day=1, - hour=0).days_in_month, 29) - - def test_properties_minutely(self): - # Test properties on Periods with minutely frequency. - t_date = Period(freq='Min', year=2007, month=1, day=1, hour=0, - minute=0) - # - self.assertEqual(t_date.quarter, 1) - self.assertEqual(t_date.month, 1) - self.assertEqual(t_date.day, 1) - self.assertEqual(t_date.weekday, 0) - self.assertEqual(t_date.dayofyear, 1) - self.assertEqual(t_date.hour, 0) - self.assertEqual(t_date.minute, 0) - self.assertEqual(t_date.days_in_month, 31) - self.assertEqual(Period(freq='D', year=2012, month=2, day=1, hour=0, - minute=0).days_in_month, 29) - - def test_properties_secondly(self): - # Test properties on Periods with secondly frequency. - s_date = Period(freq='Min', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - # - self.assertEqual(s_date.year, 2007) - self.assertEqual(s_date.quarter, 1) - self.assertEqual(s_date.month, 1) - self.assertEqual(s_date.day, 1) - self.assertEqual(s_date.weekday, 0) - self.assertEqual(s_date.dayofyear, 1) - self.assertEqual(s_date.hour, 0) - self.assertEqual(s_date.minute, 0) - self.assertEqual(s_date.second, 0) - self.assertEqual(s_date.days_in_month, 31) - self.assertEqual(Period(freq='Min', year=2012, month=2, day=1, hour=0, - minute=0, second=0).days_in_month, 29) - - def test_properties_nat(self): - p_nat = Period('NaT', freq='M') - t_nat = pd.Timestamp('NaT') - self.assertIs(p_nat, t_nat) - - # confirm Period('NaT') work identical with Timestamp('NaT') - for f in ['year', 'month', 'day', 'hour', 'minute', 'second', 'week', - 'dayofyear', 'quarter', 'days_in_month']: - self.assertTrue(np.isnan(getattr(p_nat, f))) - self.assertTrue(np.isnan(getattr(t_nat, f))) - - def test_pnow(self): - dt = datetime.now() - - val = period.pnow('D') - exp = Period(dt, freq='D') - self.assertEqual(val, exp) - - val2 = period.pnow('2D') - exp2 = Period(dt, freq='2D') - self.assertEqual(val2, exp2) - self.assertEqual(val.ordinal, val2.ordinal) - self.assertEqual(val.ordinal, exp2.ordinal) - - def test_constructor_corner(self): - expected = Period('2007-01', freq='2M') - self.assertEqual(Period(year=2007, month=1, freq='2M'), expected) - - self.assertRaises(ValueError, Period, datetime.now()) - self.assertRaises(ValueError, Period, datetime.now().date()) - self.assertRaises(ValueError, Period, 1.6, freq='D') - self.assertRaises(ValueError, Period, ordinal=1.6, freq='D') - self.assertRaises(ValueError, Period, ordinal=2, value=1, freq='D') - self.assertIs(Period(None), pd.NaT) - self.assertRaises(ValueError, Period, month=1) - - p = Period('2007-01-01', freq='D') - - result = Period(p, freq='A') - exp = Period('2007', freq='A') - self.assertEqual(result, exp) - - def test_constructor_infer_freq(self): - p = Period('2007-01-01') - self.assertEqual(p.freq, 'D') - - p = Period('2007-01-01 07') - self.assertEqual(p.freq, 'H') - - p = Period('2007-01-01 07:10') - self.assertEqual(p.freq, 'T') - - p = Period('2007-01-01 07:10:15') - self.assertEqual(p.freq, 'S') - - p = Period('2007-01-01 07:10:15.123') - self.assertEqual(p.freq, 'L') - - p = Period('2007-01-01 07:10:15.123000') - self.assertEqual(p.freq, 'L') - - p = Period('2007-01-01 07:10:15.123400') - self.assertEqual(p.freq, 'U') - - def test_asfreq_MS(self): - initial = Period("2013") - - self.assertEqual(initial.asfreq(freq="M", how="S"), - Period('2013-01', 'M')) - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with self.assertRaisesRegexp(ValueError, msg): - initial.asfreq(freq="MS", how="S") - - with tm.assertRaisesRegexp(ValueError, msg): - pd.Period('2013-01', 'MS') - - self.assertTrue(_period_code_map.get("MS") is None) - - -def noWrap(item): - return item - - -class TestFreqConversion(tm.TestCase): - "Test frequency conversion of date objects" - - def test_asfreq_corner(self): - val = Period(freq='A', year=2007) - result1 = val.asfreq('5t') - result2 = val.asfreq('t') - expected = Period('2007-12-31 23:59', freq='t') - self.assertEqual(result1.ordinal, expected.ordinal) - self.assertEqual(result1.freqstr, '5T') - self.assertEqual(result2.ordinal, expected.ordinal) - self.assertEqual(result2.freqstr, 'T') - - def test_conv_annual(self): - # frequency conversion tests: from Annual Frequency - - ival_A = Period(freq='A', year=2007) - - ival_AJAN = Period(freq="A-JAN", year=2007) - ival_AJUN = Period(freq="A-JUN", year=2007) - ival_ANOV = Period(freq="A-NOV", year=2007) - - ival_A_to_Q_start = Period(freq='Q', year=2007, quarter=1) - ival_A_to_Q_end = Period(freq='Q', year=2007, quarter=4) - ival_A_to_M_start = Period(freq='M', year=2007, month=1) - ival_A_to_M_end = Period(freq='M', year=2007, month=12) - ival_A_to_W_start = Period(freq='W', year=2007, month=1, day=1) - ival_A_to_W_end = Period(freq='W', year=2007, month=12, day=31) - ival_A_to_B_start = Period(freq='B', year=2007, month=1, day=1) - ival_A_to_B_end = Period(freq='B', year=2007, month=12, day=31) - ival_A_to_D_start = Period(freq='D', year=2007, month=1, day=1) - ival_A_to_D_end = Period(freq='D', year=2007, month=12, day=31) - ival_A_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_A_to_H_end = Period(freq='H', year=2007, month=12, day=31, - hour=23) - ival_A_to_T_start = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=0) - ival_A_to_T_end = Period(freq='Min', year=2007, month=12, day=31, - hour=23, minute=59) - ival_A_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_A_to_S_end = Period(freq='S', year=2007, month=12, day=31, - hour=23, minute=59, second=59) - - ival_AJAN_to_D_end = Period(freq='D', year=2007, month=1, day=31) - ival_AJAN_to_D_start = Period(freq='D', year=2006, month=2, day=1) - ival_AJUN_to_D_end = Period(freq='D', year=2007, month=6, day=30) - ival_AJUN_to_D_start = Period(freq='D', year=2006, month=7, day=1) - ival_ANOV_to_D_end = Period(freq='D', year=2007, month=11, day=30) - ival_ANOV_to_D_start = Period(freq='D', year=2006, month=12, day=1) - - self.assertEqual(ival_A.asfreq('Q', 'S'), ival_A_to_Q_start) - self.assertEqual(ival_A.asfreq('Q', 'e'), ival_A_to_Q_end) - self.assertEqual(ival_A.asfreq('M', 's'), ival_A_to_M_start) - self.assertEqual(ival_A.asfreq('M', 'E'), ival_A_to_M_end) - self.assertEqual(ival_A.asfreq('W', 'S'), ival_A_to_W_start) - self.assertEqual(ival_A.asfreq('W', 'E'), ival_A_to_W_end) - self.assertEqual(ival_A.asfreq('B', 'S'), ival_A_to_B_start) - self.assertEqual(ival_A.asfreq('B', 'E'), ival_A_to_B_end) - self.assertEqual(ival_A.asfreq('D', 'S'), ival_A_to_D_start) - self.assertEqual(ival_A.asfreq('D', 'E'), ival_A_to_D_end) - self.assertEqual(ival_A.asfreq('H', 'S'), ival_A_to_H_start) - self.assertEqual(ival_A.asfreq('H', 'E'), ival_A_to_H_end) - self.assertEqual(ival_A.asfreq('min', 'S'), ival_A_to_T_start) - self.assertEqual(ival_A.asfreq('min', 'E'), ival_A_to_T_end) - self.assertEqual(ival_A.asfreq('T', 'S'), ival_A_to_T_start) - self.assertEqual(ival_A.asfreq('T', 'E'), ival_A_to_T_end) - self.assertEqual(ival_A.asfreq('S', 'S'), ival_A_to_S_start) - self.assertEqual(ival_A.asfreq('S', 'E'), ival_A_to_S_end) - - self.assertEqual(ival_AJAN.asfreq('D', 'S'), ival_AJAN_to_D_start) - self.assertEqual(ival_AJAN.asfreq('D', 'E'), ival_AJAN_to_D_end) - - self.assertEqual(ival_AJUN.asfreq('D', 'S'), ival_AJUN_to_D_start) - self.assertEqual(ival_AJUN.asfreq('D', 'E'), ival_AJUN_to_D_end) - - self.assertEqual(ival_ANOV.asfreq('D', 'S'), ival_ANOV_to_D_start) - self.assertEqual(ival_ANOV.asfreq('D', 'E'), ival_ANOV_to_D_end) - - self.assertEqual(ival_A.asfreq('A'), ival_A) - - def test_conv_quarterly(self): - # frequency conversion tests: from Quarterly Frequency - - ival_Q = Period(freq='Q', year=2007, quarter=1) - ival_Q_end_of_year = Period(freq='Q', year=2007, quarter=4) - - ival_QEJAN = Period(freq="Q-JAN", year=2007, quarter=1) - ival_QEJUN = Period(freq="Q-JUN", year=2007, quarter=1) - - ival_Q_to_A = Period(freq='A', year=2007) - ival_Q_to_M_start = Period(freq='M', year=2007, month=1) - ival_Q_to_M_end = Period(freq='M', year=2007, month=3) - ival_Q_to_W_start = Period(freq='W', year=2007, month=1, day=1) - ival_Q_to_W_end = Period(freq='W', year=2007, month=3, day=31) - ival_Q_to_B_start = Period(freq='B', year=2007, month=1, day=1) - ival_Q_to_B_end = Period(freq='B', year=2007, month=3, day=30) - ival_Q_to_D_start = Period(freq='D', year=2007, month=1, day=1) - ival_Q_to_D_end = Period(freq='D', year=2007, month=3, day=31) - ival_Q_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_Q_to_H_end = Period(freq='H', year=2007, month=3, day=31, hour=23) - ival_Q_to_T_start = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=0) - ival_Q_to_T_end = Period(freq='Min', year=2007, month=3, day=31, - hour=23, minute=59) - ival_Q_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_Q_to_S_end = Period(freq='S', year=2007, month=3, day=31, hour=23, - minute=59, second=59) - - ival_QEJAN_to_D_start = Period(freq='D', year=2006, month=2, day=1) - ival_QEJAN_to_D_end = Period(freq='D', year=2006, month=4, day=30) - - ival_QEJUN_to_D_start = Period(freq='D', year=2006, month=7, day=1) - ival_QEJUN_to_D_end = Period(freq='D', year=2006, month=9, day=30) - - self.assertEqual(ival_Q.asfreq('A'), ival_Q_to_A) - self.assertEqual(ival_Q_end_of_year.asfreq('A'), ival_Q_to_A) - - self.assertEqual(ival_Q.asfreq('M', 'S'), ival_Q_to_M_start) - self.assertEqual(ival_Q.asfreq('M', 'E'), ival_Q_to_M_end) - self.assertEqual(ival_Q.asfreq('W', 'S'), ival_Q_to_W_start) - self.assertEqual(ival_Q.asfreq('W', 'E'), ival_Q_to_W_end) - self.assertEqual(ival_Q.asfreq('B', 'S'), ival_Q_to_B_start) - self.assertEqual(ival_Q.asfreq('B', 'E'), ival_Q_to_B_end) - self.assertEqual(ival_Q.asfreq('D', 'S'), ival_Q_to_D_start) - self.assertEqual(ival_Q.asfreq('D', 'E'), ival_Q_to_D_end) - self.assertEqual(ival_Q.asfreq('H', 'S'), ival_Q_to_H_start) - self.assertEqual(ival_Q.asfreq('H', 'E'), ival_Q_to_H_end) - self.assertEqual(ival_Q.asfreq('Min', 'S'), ival_Q_to_T_start) - self.assertEqual(ival_Q.asfreq('Min', 'E'), ival_Q_to_T_end) - self.assertEqual(ival_Q.asfreq('S', 'S'), ival_Q_to_S_start) - self.assertEqual(ival_Q.asfreq('S', 'E'), ival_Q_to_S_end) - - self.assertEqual(ival_QEJAN.asfreq('D', 'S'), ival_QEJAN_to_D_start) - self.assertEqual(ival_QEJAN.asfreq('D', 'E'), ival_QEJAN_to_D_end) - self.assertEqual(ival_QEJUN.asfreq('D', 'S'), ival_QEJUN_to_D_start) - self.assertEqual(ival_QEJUN.asfreq('D', 'E'), ival_QEJUN_to_D_end) - - self.assertEqual(ival_Q.asfreq('Q'), ival_Q) - - def test_conv_monthly(self): - # frequency conversion tests: from Monthly Frequency - - ival_M = Period(freq='M', year=2007, month=1) - ival_M_end_of_year = Period(freq='M', year=2007, month=12) - ival_M_end_of_quarter = Period(freq='M', year=2007, month=3) - ival_M_to_A = Period(freq='A', year=2007) - ival_M_to_Q = Period(freq='Q', year=2007, quarter=1) - ival_M_to_W_start = Period(freq='W', year=2007, month=1, day=1) - ival_M_to_W_end = Period(freq='W', year=2007, month=1, day=31) - ival_M_to_B_start = Period(freq='B', year=2007, month=1, day=1) - ival_M_to_B_end = Period(freq='B', year=2007, month=1, day=31) - ival_M_to_D_start = Period(freq='D', year=2007, month=1, day=1) - ival_M_to_D_end = Period(freq='D', year=2007, month=1, day=31) - ival_M_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_M_to_H_end = Period(freq='H', year=2007, month=1, day=31, hour=23) - ival_M_to_T_start = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=0) - ival_M_to_T_end = Period(freq='Min', year=2007, month=1, day=31, - hour=23, minute=59) - ival_M_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_M_to_S_end = Period(freq='S', year=2007, month=1, day=31, hour=23, - minute=59, second=59) - - self.assertEqual(ival_M.asfreq('A'), ival_M_to_A) - self.assertEqual(ival_M_end_of_year.asfreq('A'), ival_M_to_A) - self.assertEqual(ival_M.asfreq('Q'), ival_M_to_Q) - self.assertEqual(ival_M_end_of_quarter.asfreq('Q'), ival_M_to_Q) - - self.assertEqual(ival_M.asfreq('W', 'S'), ival_M_to_W_start) - self.assertEqual(ival_M.asfreq('W', 'E'), ival_M_to_W_end) - self.assertEqual(ival_M.asfreq('B', 'S'), ival_M_to_B_start) - self.assertEqual(ival_M.asfreq('B', 'E'), ival_M_to_B_end) - self.assertEqual(ival_M.asfreq('D', 'S'), ival_M_to_D_start) - self.assertEqual(ival_M.asfreq('D', 'E'), ival_M_to_D_end) - self.assertEqual(ival_M.asfreq('H', 'S'), ival_M_to_H_start) - self.assertEqual(ival_M.asfreq('H', 'E'), ival_M_to_H_end) - self.assertEqual(ival_M.asfreq('Min', 'S'), ival_M_to_T_start) - self.assertEqual(ival_M.asfreq('Min', 'E'), ival_M_to_T_end) - self.assertEqual(ival_M.asfreq('S', 'S'), ival_M_to_S_start) - self.assertEqual(ival_M.asfreq('S', 'E'), ival_M_to_S_end) - - self.assertEqual(ival_M.asfreq('M'), ival_M) - - def test_conv_weekly(self): - # frequency conversion tests: from Weekly Frequency - ival_W = Period(freq='W', year=2007, month=1, day=1) - - ival_WSUN = Period(freq='W', year=2007, month=1, day=7) - ival_WSAT = Period(freq='W-SAT', year=2007, month=1, day=6) - ival_WFRI = Period(freq='W-FRI', year=2007, month=1, day=5) - ival_WTHU = Period(freq='W-THU', year=2007, month=1, day=4) - ival_WWED = Period(freq='W-WED', year=2007, month=1, day=3) - ival_WTUE = Period(freq='W-TUE', year=2007, month=1, day=2) - ival_WMON = Period(freq='W-MON', year=2007, month=1, day=1) - - ival_WSUN_to_D_start = Period(freq='D', year=2007, month=1, day=1) - ival_WSUN_to_D_end = Period(freq='D', year=2007, month=1, day=7) - ival_WSAT_to_D_start = Period(freq='D', year=2006, month=12, day=31) - ival_WSAT_to_D_end = Period(freq='D', year=2007, month=1, day=6) - ival_WFRI_to_D_start = Period(freq='D', year=2006, month=12, day=30) - ival_WFRI_to_D_end = Period(freq='D', year=2007, month=1, day=5) - ival_WTHU_to_D_start = Period(freq='D', year=2006, month=12, day=29) - ival_WTHU_to_D_end = Period(freq='D', year=2007, month=1, day=4) - ival_WWED_to_D_start = Period(freq='D', year=2006, month=12, day=28) - ival_WWED_to_D_end = Period(freq='D', year=2007, month=1, day=3) - ival_WTUE_to_D_start = Period(freq='D', year=2006, month=12, day=27) - ival_WTUE_to_D_end = Period(freq='D', year=2007, month=1, day=2) - ival_WMON_to_D_start = Period(freq='D', year=2006, month=12, day=26) - ival_WMON_to_D_end = Period(freq='D', year=2007, month=1, day=1) - - ival_W_end_of_year = Period(freq='W', year=2007, month=12, day=31) - ival_W_end_of_quarter = Period(freq='W', year=2007, month=3, day=31) - ival_W_end_of_month = Period(freq='W', year=2007, month=1, day=31) - ival_W_to_A = Period(freq='A', year=2007) - ival_W_to_Q = Period(freq='Q', year=2007, quarter=1) - ival_W_to_M = Period(freq='M', year=2007, month=1) - - if Period(freq='D', year=2007, month=12, day=31).weekday == 6: - ival_W_to_A_end_of_year = Period(freq='A', year=2007) - else: - ival_W_to_A_end_of_year = Period(freq='A', year=2008) - - if Period(freq='D', year=2007, month=3, day=31).weekday == 6: - ival_W_to_Q_end_of_quarter = Period(freq='Q', year=2007, quarter=1) - else: - ival_W_to_Q_end_of_quarter = Period(freq='Q', year=2007, quarter=2) - - if Period(freq='D', year=2007, month=1, day=31).weekday == 6: - ival_W_to_M_end_of_month = Period(freq='M', year=2007, month=1) - else: - ival_W_to_M_end_of_month = Period(freq='M', year=2007, month=2) - - ival_W_to_B_start = Period(freq='B', year=2007, month=1, day=1) - ival_W_to_B_end = Period(freq='B', year=2007, month=1, day=5) - ival_W_to_D_start = Period(freq='D', year=2007, month=1, day=1) - ival_W_to_D_end = Period(freq='D', year=2007, month=1, day=7) - ival_W_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_W_to_H_end = Period(freq='H', year=2007, month=1, day=7, hour=23) - ival_W_to_T_start = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=0) - ival_W_to_T_end = Period(freq='Min', year=2007, month=1, day=7, - hour=23, minute=59) - ival_W_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_W_to_S_end = Period(freq='S', year=2007, month=1, day=7, hour=23, - minute=59, second=59) - - self.assertEqual(ival_W.asfreq('A'), ival_W_to_A) - self.assertEqual(ival_W_end_of_year.asfreq('A'), - ival_W_to_A_end_of_year) - self.assertEqual(ival_W.asfreq('Q'), ival_W_to_Q) - self.assertEqual(ival_W_end_of_quarter.asfreq('Q'), - ival_W_to_Q_end_of_quarter) - self.assertEqual(ival_W.asfreq('M'), ival_W_to_M) - self.assertEqual(ival_W_end_of_month.asfreq('M'), - ival_W_to_M_end_of_month) - - self.assertEqual(ival_W.asfreq('B', 'S'), ival_W_to_B_start) - self.assertEqual(ival_W.asfreq('B', 'E'), ival_W_to_B_end) - - self.assertEqual(ival_W.asfreq('D', 'S'), ival_W_to_D_start) - self.assertEqual(ival_W.asfreq('D', 'E'), ival_W_to_D_end) - - self.assertEqual(ival_WSUN.asfreq('D', 'S'), ival_WSUN_to_D_start) - self.assertEqual(ival_WSUN.asfreq('D', 'E'), ival_WSUN_to_D_end) - self.assertEqual(ival_WSAT.asfreq('D', 'S'), ival_WSAT_to_D_start) - self.assertEqual(ival_WSAT.asfreq('D', 'E'), ival_WSAT_to_D_end) - self.assertEqual(ival_WFRI.asfreq('D', 'S'), ival_WFRI_to_D_start) - self.assertEqual(ival_WFRI.asfreq('D', 'E'), ival_WFRI_to_D_end) - self.assertEqual(ival_WTHU.asfreq('D', 'S'), ival_WTHU_to_D_start) - self.assertEqual(ival_WTHU.asfreq('D', 'E'), ival_WTHU_to_D_end) - self.assertEqual(ival_WWED.asfreq('D', 'S'), ival_WWED_to_D_start) - self.assertEqual(ival_WWED.asfreq('D', 'E'), ival_WWED_to_D_end) - self.assertEqual(ival_WTUE.asfreq('D', 'S'), ival_WTUE_to_D_start) - self.assertEqual(ival_WTUE.asfreq('D', 'E'), ival_WTUE_to_D_end) - self.assertEqual(ival_WMON.asfreq('D', 'S'), ival_WMON_to_D_start) - self.assertEqual(ival_WMON.asfreq('D', 'E'), ival_WMON_to_D_end) - - self.assertEqual(ival_W.asfreq('H', 'S'), ival_W_to_H_start) - self.assertEqual(ival_W.asfreq('H', 'E'), ival_W_to_H_end) - self.assertEqual(ival_W.asfreq('Min', 'S'), ival_W_to_T_start) - self.assertEqual(ival_W.asfreq('Min', 'E'), ival_W_to_T_end) - self.assertEqual(ival_W.asfreq('S', 'S'), ival_W_to_S_start) - self.assertEqual(ival_W.asfreq('S', 'E'), ival_W_to_S_end) - - self.assertEqual(ival_W.asfreq('W'), ival_W) - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with self.assertRaisesRegexp(ValueError, msg): - ival_W.asfreq('WK') - - def test_conv_weekly_legacy(self): - # frequency conversion tests: from Weekly Frequency - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK', year=2007, month=1, day=1) - - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK-SAT', year=2007, month=1, day=6) - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK-FRI', year=2007, month=1, day=5) - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK-THU', year=2007, month=1, day=4) - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK-WED', year=2007, month=1, day=3) - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK-TUE', year=2007, month=1, day=2) - with self.assertRaisesRegexp(ValueError, msg): - Period(freq='WK-MON', year=2007, month=1, day=1) - - def test_conv_business(self): - # frequency conversion tests: from Business Frequency" - - ival_B = Period(freq='B', year=2007, month=1, day=1) - ival_B_end_of_year = Period(freq='B', year=2007, month=12, day=31) - ival_B_end_of_quarter = Period(freq='B', year=2007, month=3, day=30) - ival_B_end_of_month = Period(freq='B', year=2007, month=1, day=31) - ival_B_end_of_week = Period(freq='B', year=2007, month=1, day=5) - - ival_B_to_A = Period(freq='A', year=2007) - ival_B_to_Q = Period(freq='Q', year=2007, quarter=1) - ival_B_to_M = Period(freq='M', year=2007, month=1) - ival_B_to_W = Period(freq='W', year=2007, month=1, day=7) - ival_B_to_D = Period(freq='D', year=2007, month=1, day=1) - ival_B_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_B_to_H_end = Period(freq='H', year=2007, month=1, day=1, hour=23) - ival_B_to_T_start = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=0) - ival_B_to_T_end = Period(freq='Min', year=2007, month=1, day=1, - hour=23, minute=59) - ival_B_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_B_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=23, - minute=59, second=59) - - self.assertEqual(ival_B.asfreq('A'), ival_B_to_A) - self.assertEqual(ival_B_end_of_year.asfreq('A'), ival_B_to_A) - self.assertEqual(ival_B.asfreq('Q'), ival_B_to_Q) - self.assertEqual(ival_B_end_of_quarter.asfreq('Q'), ival_B_to_Q) - self.assertEqual(ival_B.asfreq('M'), ival_B_to_M) - self.assertEqual(ival_B_end_of_month.asfreq('M'), ival_B_to_M) - self.assertEqual(ival_B.asfreq('W'), ival_B_to_W) - self.assertEqual(ival_B_end_of_week.asfreq('W'), ival_B_to_W) - - self.assertEqual(ival_B.asfreq('D'), ival_B_to_D) - - self.assertEqual(ival_B.asfreq('H', 'S'), ival_B_to_H_start) - self.assertEqual(ival_B.asfreq('H', 'E'), ival_B_to_H_end) - self.assertEqual(ival_B.asfreq('Min', 'S'), ival_B_to_T_start) - self.assertEqual(ival_B.asfreq('Min', 'E'), ival_B_to_T_end) - self.assertEqual(ival_B.asfreq('S', 'S'), ival_B_to_S_start) - self.assertEqual(ival_B.asfreq('S', 'E'), ival_B_to_S_end) - - self.assertEqual(ival_B.asfreq('B'), ival_B) - - def test_conv_daily(self): - # frequency conversion tests: from Business Frequency" - - ival_D = Period(freq='D', year=2007, month=1, day=1) - ival_D_end_of_year = Period(freq='D', year=2007, month=12, day=31) - ival_D_end_of_quarter = Period(freq='D', year=2007, month=3, day=31) - ival_D_end_of_month = Period(freq='D', year=2007, month=1, day=31) - ival_D_end_of_week = Period(freq='D', year=2007, month=1, day=7) - - ival_D_friday = Period(freq='D', year=2007, month=1, day=5) - ival_D_saturday = Period(freq='D', year=2007, month=1, day=6) - ival_D_sunday = Period(freq='D', year=2007, month=1, day=7) - - # TODO: unused? - # ival_D_monday = Period(freq='D', year=2007, month=1, day=8) - - ival_B_friday = Period(freq='B', year=2007, month=1, day=5) - ival_B_monday = Period(freq='B', year=2007, month=1, day=8) - - ival_D_to_A = Period(freq='A', year=2007) - - ival_Deoq_to_AJAN = Period(freq='A-JAN', year=2008) - ival_Deoq_to_AJUN = Period(freq='A-JUN', year=2007) - ival_Deoq_to_ADEC = Period(freq='A-DEC', year=2007) - - ival_D_to_QEJAN = Period(freq="Q-JAN", year=2007, quarter=4) - ival_D_to_QEJUN = Period(freq="Q-JUN", year=2007, quarter=3) - ival_D_to_QEDEC = Period(freq="Q-DEC", year=2007, quarter=1) - - ival_D_to_M = Period(freq='M', year=2007, month=1) - ival_D_to_W = Period(freq='W', year=2007, month=1, day=7) - - ival_D_to_H_start = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_D_to_H_end = Period(freq='H', year=2007, month=1, day=1, hour=23) - ival_D_to_T_start = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=0) - ival_D_to_T_end = Period(freq='Min', year=2007, month=1, day=1, - hour=23, minute=59) - ival_D_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_D_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=23, - minute=59, second=59) - - self.assertEqual(ival_D.asfreq('A'), ival_D_to_A) - - self.assertEqual(ival_D_end_of_quarter.asfreq('A-JAN'), - ival_Deoq_to_AJAN) - self.assertEqual(ival_D_end_of_quarter.asfreq('A-JUN'), - ival_Deoq_to_AJUN) - self.assertEqual(ival_D_end_of_quarter.asfreq('A-DEC'), - ival_Deoq_to_ADEC) - - self.assertEqual(ival_D_end_of_year.asfreq('A'), ival_D_to_A) - self.assertEqual(ival_D_end_of_quarter.asfreq('Q'), ival_D_to_QEDEC) - self.assertEqual(ival_D.asfreq("Q-JAN"), ival_D_to_QEJAN) - self.assertEqual(ival_D.asfreq("Q-JUN"), ival_D_to_QEJUN) - self.assertEqual(ival_D.asfreq("Q-DEC"), ival_D_to_QEDEC) - self.assertEqual(ival_D.asfreq('M'), ival_D_to_M) - self.assertEqual(ival_D_end_of_month.asfreq('M'), ival_D_to_M) - self.assertEqual(ival_D.asfreq('W'), ival_D_to_W) - self.assertEqual(ival_D_end_of_week.asfreq('W'), ival_D_to_W) - - self.assertEqual(ival_D_friday.asfreq('B'), ival_B_friday) - self.assertEqual(ival_D_saturday.asfreq('B', 'S'), ival_B_friday) - self.assertEqual(ival_D_saturday.asfreq('B', 'E'), ival_B_monday) - self.assertEqual(ival_D_sunday.asfreq('B', 'S'), ival_B_friday) - self.assertEqual(ival_D_sunday.asfreq('B', 'E'), ival_B_monday) - - self.assertEqual(ival_D.asfreq('H', 'S'), ival_D_to_H_start) - self.assertEqual(ival_D.asfreq('H', 'E'), ival_D_to_H_end) - self.assertEqual(ival_D.asfreq('Min', 'S'), ival_D_to_T_start) - self.assertEqual(ival_D.asfreq('Min', 'E'), ival_D_to_T_end) - self.assertEqual(ival_D.asfreq('S', 'S'), ival_D_to_S_start) - self.assertEqual(ival_D.asfreq('S', 'E'), ival_D_to_S_end) - - self.assertEqual(ival_D.asfreq('D'), ival_D) - - def test_conv_hourly(self): - # frequency conversion tests: from Hourly Frequency" - - ival_H = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_H_end_of_year = Period(freq='H', year=2007, month=12, day=31, - hour=23) - ival_H_end_of_quarter = Period(freq='H', year=2007, month=3, day=31, - hour=23) - ival_H_end_of_month = Period(freq='H', year=2007, month=1, day=31, - hour=23) - ival_H_end_of_week = Period(freq='H', year=2007, month=1, day=7, - hour=23) - ival_H_end_of_day = Period(freq='H', year=2007, month=1, day=1, - hour=23) - ival_H_end_of_bus = Period(freq='H', year=2007, month=1, day=1, - hour=23) - - ival_H_to_A = Period(freq='A', year=2007) - ival_H_to_Q = Period(freq='Q', year=2007, quarter=1) - ival_H_to_M = Period(freq='M', year=2007, month=1) - ival_H_to_W = Period(freq='W', year=2007, month=1, day=7) - ival_H_to_D = Period(freq='D', year=2007, month=1, day=1) - ival_H_to_B = Period(freq='B', year=2007, month=1, day=1) - - ival_H_to_T_start = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=0) - ival_H_to_T_end = Period(freq='Min', year=2007, month=1, day=1, hour=0, - minute=59) - ival_H_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_H_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=59, second=59) - - self.assertEqual(ival_H.asfreq('A'), ival_H_to_A) - self.assertEqual(ival_H_end_of_year.asfreq('A'), ival_H_to_A) - self.assertEqual(ival_H.asfreq('Q'), ival_H_to_Q) - self.assertEqual(ival_H_end_of_quarter.asfreq('Q'), ival_H_to_Q) - self.assertEqual(ival_H.asfreq('M'), ival_H_to_M) - self.assertEqual(ival_H_end_of_month.asfreq('M'), ival_H_to_M) - self.assertEqual(ival_H.asfreq('W'), ival_H_to_W) - self.assertEqual(ival_H_end_of_week.asfreq('W'), ival_H_to_W) - self.assertEqual(ival_H.asfreq('D'), ival_H_to_D) - self.assertEqual(ival_H_end_of_day.asfreq('D'), ival_H_to_D) - self.assertEqual(ival_H.asfreq('B'), ival_H_to_B) - self.assertEqual(ival_H_end_of_bus.asfreq('B'), ival_H_to_B) - - self.assertEqual(ival_H.asfreq('Min', 'S'), ival_H_to_T_start) - self.assertEqual(ival_H.asfreq('Min', 'E'), ival_H_to_T_end) - self.assertEqual(ival_H.asfreq('S', 'S'), ival_H_to_S_start) - self.assertEqual(ival_H.asfreq('S', 'E'), ival_H_to_S_end) - - self.assertEqual(ival_H.asfreq('H'), ival_H) - - def test_conv_minutely(self): - # frequency conversion tests: from Minutely Frequency" - - ival_T = Period(freq='Min', year=2007, month=1, day=1, hour=0, - minute=0) - ival_T_end_of_year = Period(freq='Min', year=2007, month=12, day=31, - hour=23, minute=59) - ival_T_end_of_quarter = Period(freq='Min', year=2007, month=3, day=31, - hour=23, minute=59) - ival_T_end_of_month = Period(freq='Min', year=2007, month=1, day=31, - hour=23, minute=59) - ival_T_end_of_week = Period(freq='Min', year=2007, month=1, day=7, - hour=23, minute=59) - ival_T_end_of_day = Period(freq='Min', year=2007, month=1, day=1, - hour=23, minute=59) - ival_T_end_of_bus = Period(freq='Min', year=2007, month=1, day=1, - hour=23, minute=59) - ival_T_end_of_hour = Period(freq='Min', year=2007, month=1, day=1, - hour=0, minute=59) - - ival_T_to_A = Period(freq='A', year=2007) - ival_T_to_Q = Period(freq='Q', year=2007, quarter=1) - ival_T_to_M = Period(freq='M', year=2007, month=1) - ival_T_to_W = Period(freq='W', year=2007, month=1, day=7) - ival_T_to_D = Period(freq='D', year=2007, month=1, day=1) - ival_T_to_B = Period(freq='B', year=2007, month=1, day=1) - ival_T_to_H = Period(freq='H', year=2007, month=1, day=1, hour=0) - - ival_T_to_S_start = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=0) - ival_T_to_S_end = Period(freq='S', year=2007, month=1, day=1, hour=0, - minute=0, second=59) - - self.assertEqual(ival_T.asfreq('A'), ival_T_to_A) - self.assertEqual(ival_T_end_of_year.asfreq('A'), ival_T_to_A) - self.assertEqual(ival_T.asfreq('Q'), ival_T_to_Q) - self.assertEqual(ival_T_end_of_quarter.asfreq('Q'), ival_T_to_Q) - self.assertEqual(ival_T.asfreq('M'), ival_T_to_M) - self.assertEqual(ival_T_end_of_month.asfreq('M'), ival_T_to_M) - self.assertEqual(ival_T.asfreq('W'), ival_T_to_W) - self.assertEqual(ival_T_end_of_week.asfreq('W'), ival_T_to_W) - self.assertEqual(ival_T.asfreq('D'), ival_T_to_D) - self.assertEqual(ival_T_end_of_day.asfreq('D'), ival_T_to_D) - self.assertEqual(ival_T.asfreq('B'), ival_T_to_B) - self.assertEqual(ival_T_end_of_bus.asfreq('B'), ival_T_to_B) - self.assertEqual(ival_T.asfreq('H'), ival_T_to_H) - self.assertEqual(ival_T_end_of_hour.asfreq('H'), ival_T_to_H) - - self.assertEqual(ival_T.asfreq('S', 'S'), ival_T_to_S_start) - self.assertEqual(ival_T.asfreq('S', 'E'), ival_T_to_S_end) - - self.assertEqual(ival_T.asfreq('Min'), ival_T) - - def test_conv_secondly(self): - # frequency conversion tests: from Secondly Frequency" - - ival_S = Period(freq='S', year=2007, month=1, day=1, hour=0, minute=0, - second=0) - ival_S_end_of_year = Period(freq='S', year=2007, month=12, day=31, - hour=23, minute=59, second=59) - ival_S_end_of_quarter = Period(freq='S', year=2007, month=3, day=31, - hour=23, minute=59, second=59) - ival_S_end_of_month = Period(freq='S', year=2007, month=1, day=31, - hour=23, minute=59, second=59) - ival_S_end_of_week = Period(freq='S', year=2007, month=1, day=7, - hour=23, minute=59, second=59) - ival_S_end_of_day = Period(freq='S', year=2007, month=1, day=1, - hour=23, minute=59, second=59) - ival_S_end_of_bus = Period(freq='S', year=2007, month=1, day=1, - hour=23, minute=59, second=59) - ival_S_end_of_hour = Period(freq='S', year=2007, month=1, day=1, - hour=0, minute=59, second=59) - ival_S_end_of_minute = Period(freq='S', year=2007, month=1, day=1, - hour=0, minute=0, second=59) - - ival_S_to_A = Period(freq='A', year=2007) - ival_S_to_Q = Period(freq='Q', year=2007, quarter=1) - ival_S_to_M = Period(freq='M', year=2007, month=1) - ival_S_to_W = Period(freq='W', year=2007, month=1, day=7) - ival_S_to_D = Period(freq='D', year=2007, month=1, day=1) - ival_S_to_B = Period(freq='B', year=2007, month=1, day=1) - ival_S_to_H = Period(freq='H', year=2007, month=1, day=1, hour=0) - ival_S_to_T = Period(freq='Min', year=2007, month=1, day=1, hour=0, - minute=0) - - self.assertEqual(ival_S.asfreq('A'), ival_S_to_A) - self.assertEqual(ival_S_end_of_year.asfreq('A'), ival_S_to_A) - self.assertEqual(ival_S.asfreq('Q'), ival_S_to_Q) - self.assertEqual(ival_S_end_of_quarter.asfreq('Q'), ival_S_to_Q) - self.assertEqual(ival_S.asfreq('M'), ival_S_to_M) - self.assertEqual(ival_S_end_of_month.asfreq('M'), ival_S_to_M) - self.assertEqual(ival_S.asfreq('W'), ival_S_to_W) - self.assertEqual(ival_S_end_of_week.asfreq('W'), ival_S_to_W) - self.assertEqual(ival_S.asfreq('D'), ival_S_to_D) - self.assertEqual(ival_S_end_of_day.asfreq('D'), ival_S_to_D) - self.assertEqual(ival_S.asfreq('B'), ival_S_to_B) - self.assertEqual(ival_S_end_of_bus.asfreq('B'), ival_S_to_B) - self.assertEqual(ival_S.asfreq('H'), ival_S_to_H) - self.assertEqual(ival_S_end_of_hour.asfreq('H'), ival_S_to_H) - self.assertEqual(ival_S.asfreq('Min'), ival_S_to_T) - self.assertEqual(ival_S_end_of_minute.asfreq('Min'), ival_S_to_T) - - self.assertEqual(ival_S.asfreq('S'), ival_S) - - def test_asfreq_mult(self): - # normal freq to mult freq - p = Period(freq='A', year=2007) - # ordinal will not change - for freq in ['3A', offsets.YearEnd(3)]: - result = p.asfreq(freq) - expected = Period('2007', freq='3A') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - # ordinal will not change - for freq in ['3A', offsets.YearEnd(3)]: - result = p.asfreq(freq, how='S') - expected = Period('2007', freq='3A') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - - # mult freq to normal freq - p = Period(freq='3A', year=2007) - # ordinal will change because how=E is the default - for freq in ['A', offsets.YearEnd()]: - result = p.asfreq(freq) - expected = Period('2009', freq='A') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - # ordinal will not change - for freq in ['A', offsets.YearEnd()]: - result = p.asfreq(freq, how='S') - expected = Period('2007', freq='A') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - - p = Period(freq='A', year=2007) - for freq in ['2M', offsets.MonthEnd(2)]: - result = p.asfreq(freq) - expected = Period('2007-12', freq='2M') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - for freq in ['2M', offsets.MonthEnd(2)]: - result = p.asfreq(freq, how='S') - expected = Period('2007-01', freq='2M') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - - p = Period(freq='3A', year=2007) - for freq in ['2M', offsets.MonthEnd(2)]: - result = p.asfreq(freq) - expected = Period('2009-12', freq='2M') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - for freq in ['2M', offsets.MonthEnd(2)]: - result = p.asfreq(freq, how='S') - expected = Period('2007-01', freq='2M') - - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - - def test_asfreq_combined(self): - # normal freq to combined freq - p = Period('2007', freq='H') - - # ordinal will not change - expected = Period('2007', freq='25H') - for freq, how in zip(['1D1H', '1H1D'], ['E', 'S']): - result = p.asfreq(freq, how=how) - self.assertEqual(result, expected) - self.assertEqual(result.ordinal, expected.ordinal) - self.assertEqual(result.freq, expected.freq) - - # combined freq to normal freq - p1 = Period(freq='1D1H', year=2007) - p2 = Period(freq='1H1D', year=2007) - - # ordinal will change because how=E is the default - result1 = p1.asfreq('H') - result2 = p2.asfreq('H') - expected = Period('2007-01-02', freq='H') - self.assertEqual(result1, expected) - self.assertEqual(result1.ordinal, expected.ordinal) - self.assertEqual(result1.freq, expected.freq) - self.assertEqual(result2, expected) - self.assertEqual(result2.ordinal, expected.ordinal) - self.assertEqual(result2.freq, expected.freq) - - # ordinal will not change - result1 = p1.asfreq('H', how='S') - result2 = p2.asfreq('H', how='S') - expected = Period('2007-01-01', freq='H') - self.assertEqual(result1, expected) - self.assertEqual(result1.ordinal, expected.ordinal) - self.assertEqual(result1.freq, expected.freq) - self.assertEqual(result2, expected) - self.assertEqual(result2.ordinal, expected.ordinal) - self.assertEqual(result2.freq, expected.freq) - - def test_is_leap_year(self): - # GH 13727 - for freq in ['A', 'M', 'D', 'H']: - p = Period('2000-01-01 00:00:00', freq=freq) - self.assertTrue(p.is_leap_year) - self.assertIsInstance(p.is_leap_year, bool) - - p = Period('1999-01-01 00:00:00', freq=freq) - self.assertFalse(p.is_leap_year) - - p = Period('2004-01-01 00:00:00', freq=freq) - self.assertTrue(p.is_leap_year) - - p = Period('2100-01-01 00:00:00', freq=freq) - self.assertFalse(p.is_leap_year) - - -class TestPeriodIndex(tm.TestCase): - def setUp(self): - pass - - def test_hash_error(self): - index = period_range('20010101', periods=10) - with tm.assertRaisesRegexp(TypeError, "unhashable type: %r" % - type(index).__name__): - hash(index) - - def test_make_time_series(self): - index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - series = Series(1, index=index) - tm.assertIsInstance(series, Series) - - def test_constructor_use_start_freq(self): - # GH #1118 - p = Period('4/2/2012', freq='B') - index = PeriodIndex(start=p, periods=10) - expected = PeriodIndex(start='4/2/2012', periods=10, freq='B') - tm.assert_index_equal(index, expected) - - def test_constructor_field_arrays(self): - # GH #1264 - - years = np.arange(1990, 2010).repeat(4)[2:-2] - quarters = np.tile(np.arange(1, 5), 20)[2:-2] - - index = PeriodIndex(year=years, quarter=quarters, freq='Q-DEC') - expected = period_range('1990Q3', '2009Q2', freq='Q-DEC') - tm.assert_index_equal(index, expected) - - index2 = PeriodIndex(year=years, quarter=quarters, freq='2Q-DEC') - tm.assert_numpy_array_equal(index.asi8, index2.asi8) - - index = PeriodIndex(year=years, quarter=quarters) - tm.assert_index_equal(index, expected) - - years = [2007, 2007, 2007] - months = [1, 2] - self.assertRaises(ValueError, PeriodIndex, year=years, month=months, - freq='M') - self.assertRaises(ValueError, PeriodIndex, year=years, month=months, - freq='2M') - self.assertRaises(ValueError, PeriodIndex, year=years, month=months, - freq='M', start=Period('2007-01', freq='M')) - - years = [2007, 2007, 2007] - months = [1, 2, 3] - idx = PeriodIndex(year=years, month=months, freq='M') - exp = period_range('2007-01', periods=3, freq='M') - tm.assert_index_equal(idx, exp) - - def test_constructor_U(self): - # U was used as undefined period - self.assertRaises(ValueError, period_range, '2007-1-1', periods=500, - freq='X') - - def test_constructor_nano(self): - idx = period_range(start=Period(ordinal=1, freq='N'), - end=Period(ordinal=4, freq='N'), freq='N') - exp = PeriodIndex([Period(ordinal=1, freq='N'), - Period(ordinal=2, freq='N'), - Period(ordinal=3, freq='N'), - Period(ordinal=4, freq='N')], freq='N') - tm.assert_index_equal(idx, exp) - - def test_constructor_arrays_negative_year(self): - years = np.arange(1960, 2000, dtype=np.int64).repeat(4) - quarters = np.tile(np.array([1, 2, 3, 4], dtype=np.int64), 40) - - pindex = PeriodIndex(year=years, quarter=quarters) - - self.assert_numpy_array_equal(pindex.year, years) - self.assert_numpy_array_equal(pindex.quarter, quarters) - - def test_constructor_invalid_quarters(self): - self.assertRaises(ValueError, PeriodIndex, year=lrange(2000, 2004), - quarter=lrange(4), freq='Q-DEC') - - def test_constructor_corner(self): - self.assertRaises(ValueError, PeriodIndex, periods=10, freq='A') - - start = Period('2007', freq='A-JUN') - end = Period('2010', freq='A-DEC') - self.assertRaises(ValueError, PeriodIndex, start=start, end=end) - self.assertRaises(ValueError, PeriodIndex, start=start) - self.assertRaises(ValueError, PeriodIndex, end=end) - - result = period_range('2007-01', periods=10.5, freq='M') - exp = period_range('2007-01', periods=10, freq='M') - tm.assert_index_equal(result, exp) - - def test_constructor_fromarraylike(self): - idx = period_range('2007-01', periods=20, freq='M') - - # values is an array of Period, thus can retrieve freq - tm.assert_index_equal(PeriodIndex(idx.values), idx) - tm.assert_index_equal(PeriodIndex(list(idx.values)), idx) - - self.assertRaises(ValueError, PeriodIndex, idx._values) - self.assertRaises(ValueError, PeriodIndex, list(idx._values)) - self.assertRaises(ValueError, PeriodIndex, - data=Period('2007', freq='A')) - - result = PeriodIndex(iter(idx)) - tm.assert_index_equal(result, idx) - - result = PeriodIndex(idx) - tm.assert_index_equal(result, idx) - - result = PeriodIndex(idx, freq='M') - tm.assert_index_equal(result, idx) - - result = PeriodIndex(idx, freq=offsets.MonthEnd()) - tm.assert_index_equal(result, idx) - self.assertTrue(result.freq, 'M') - - result = PeriodIndex(idx, freq='2M') - tm.assert_index_equal(result, idx.asfreq('2M')) - self.assertTrue(result.freq, '2M') - - result = PeriodIndex(idx, freq=offsets.MonthEnd(2)) - tm.assert_index_equal(result, idx.asfreq('2M')) - self.assertTrue(result.freq, '2M') - - result = PeriodIndex(idx, freq='D') - exp = idx.asfreq('D', 'e') - tm.assert_index_equal(result, exp) - - def test_constructor_datetime64arr(self): - vals = np.arange(100000, 100000 + 10000, 100, dtype=np.int64) - vals = vals.view(np.dtype('M8[us]')) - - self.assertRaises(ValueError, PeriodIndex, vals, freq='D') - - def test_constructor_dtype(self): - # passing a dtype with a tz should localize - idx = PeriodIndex(['2013-01', '2013-03'], dtype='period[M]') - exp = PeriodIndex(['2013-01', '2013-03'], freq='M') - tm.assert_index_equal(idx, exp) - self.assertEqual(idx.dtype, 'period[M]') - - idx = PeriodIndex(['2013-01-05', '2013-03-05'], dtype='period[3D]') - exp = PeriodIndex(['2013-01-05', '2013-03-05'], freq='3D') - tm.assert_index_equal(idx, exp) - self.assertEqual(idx.dtype, 'period[3D]') - - # if we already have a freq and its not the same, then asfreq - # (not changed) - idx = PeriodIndex(['2013-01-01', '2013-01-02'], freq='D') - - res = PeriodIndex(idx, dtype='period[M]') - exp = PeriodIndex(['2013-01', '2013-01'], freq='M') - tm.assert_index_equal(res, exp) - self.assertEqual(res.dtype, 'period[M]') - - res = PeriodIndex(idx, freq='M') - tm.assert_index_equal(res, exp) - self.assertEqual(res.dtype, 'period[M]') - - msg = 'specified freq and dtype are different' - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - PeriodIndex(['2011-01'], freq='M', dtype='period[D]') - - def test_constructor_empty(self): - idx = pd.PeriodIndex([], freq='M') - tm.assertIsInstance(idx, PeriodIndex) - self.assertEqual(len(idx), 0) - self.assertEqual(idx.freq, 'M') - - with tm.assertRaisesRegexp(ValueError, 'freq not specified'): - pd.PeriodIndex([]) - - def test_constructor_pi_nat(self): - idx = PeriodIndex([Period('2011-01', freq='M'), pd.NaT, - Period('2011-01', freq='M')]) - exp = PeriodIndex(['2011-01', 'NaT', '2011-01'], freq='M') - tm.assert_index_equal(idx, exp) - - idx = PeriodIndex(np.array([Period('2011-01', freq='M'), pd.NaT, - Period('2011-01', freq='M')])) - tm.assert_index_equal(idx, exp) - - idx = PeriodIndex([pd.NaT, pd.NaT, Period('2011-01', freq='M'), - Period('2011-01', freq='M')]) - exp = PeriodIndex(['NaT', 'NaT', '2011-01', '2011-01'], freq='M') - tm.assert_index_equal(idx, exp) - - idx = PeriodIndex(np.array([pd.NaT, pd.NaT, - Period('2011-01', freq='M'), - Period('2011-01', freq='M')])) - tm.assert_index_equal(idx, exp) - - idx = PeriodIndex([pd.NaT, pd.NaT, '2011-01', '2011-01'], freq='M') - tm.assert_index_equal(idx, exp) - - with tm.assertRaisesRegexp(ValueError, 'freq not specified'): - PeriodIndex([pd.NaT, pd.NaT]) - - with tm.assertRaisesRegexp(ValueError, 'freq not specified'): - PeriodIndex(np.array([pd.NaT, pd.NaT])) - - with tm.assertRaisesRegexp(ValueError, 'freq not specified'): - PeriodIndex(['NaT', 'NaT']) - - with tm.assertRaisesRegexp(ValueError, 'freq not specified'): - PeriodIndex(np.array(['NaT', 'NaT'])) - - def test_constructor_incompat_freq(self): - msg = "Input has different freq=D from PeriodIndex\\(freq=M\\)" - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - PeriodIndex([Period('2011-01', freq='M'), pd.NaT, - Period('2011-01', freq='D')]) - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - PeriodIndex(np.array([Period('2011-01', freq='M'), pd.NaT, - Period('2011-01', freq='D')])) - - # first element is pd.NaT - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - PeriodIndex([pd.NaT, Period('2011-01', freq='M'), - Period('2011-01', freq='D')]) - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - PeriodIndex(np.array([pd.NaT, Period('2011-01', freq='M'), - Period('2011-01', freq='D')])) - - def test_constructor_mixed(self): - idx = PeriodIndex(['2011-01', pd.NaT, Period('2011-01', freq='M')]) - exp = PeriodIndex(['2011-01', 'NaT', '2011-01'], freq='M') - tm.assert_index_equal(idx, exp) - - idx = PeriodIndex(['NaT', pd.NaT, Period('2011-01', freq='M')]) - exp = PeriodIndex(['NaT', 'NaT', '2011-01'], freq='M') - tm.assert_index_equal(idx, exp) - - idx = PeriodIndex([Period('2011-01-01', freq='D'), pd.NaT, - '2012-01-01']) - exp = PeriodIndex(['2011-01-01', 'NaT', '2012-01-01'], freq='D') - tm.assert_index_equal(idx, exp) - - def test_constructor_simple_new(self): - idx = period_range('2007-01', name='p', periods=2, freq='M') - result = idx._simple_new(idx, 'p', freq=idx.freq) - tm.assert_index_equal(result, idx) - - result = idx._simple_new(idx.astype('i8'), 'p', freq=idx.freq) - tm.assert_index_equal(result, idx) - - result = idx._simple_new([pd.Period('2007-01', freq='M'), - pd.Period('2007-02', freq='M')], - 'p', freq=idx.freq) - self.assert_index_equal(result, idx) - - result = idx._simple_new(np.array([pd.Period('2007-01', freq='M'), - pd.Period('2007-02', freq='M')]), - 'p', freq=idx.freq) - self.assert_index_equal(result, idx) - - def test_constructor_simple_new_empty(self): - # GH13079 - idx = PeriodIndex([], freq='M', name='p') - result = idx._simple_new(idx, name='p', freq='M') - tm.assert_index_equal(result, idx) - - def test_constructor_simple_new_floats(self): - # GH13079 - for floats in [[1.1], np.array([1.1])]: - with self.assertRaises(TypeError): - pd.PeriodIndex._simple_new(floats, freq='M') - - def test_shallow_copy_empty(self): - - # GH13067 - idx = PeriodIndex([], freq='M') - result = idx._shallow_copy() - expected = idx - - tm.assert_index_equal(result, expected) - - def test_constructor_nat(self): - self.assertRaises(ValueError, period_range, start='NaT', - end='2011-01-01', freq='M') - self.assertRaises(ValueError, period_range, start='2011-01-01', - end='NaT', freq='M') - - def test_constructor_year_and_quarter(self): - year = pd.Series([2001, 2002, 2003]) - quarter = year - 2000 - idx = PeriodIndex(year=year, quarter=quarter) - strs = ['%dQ%d' % t for t in zip(quarter, year)] - lops = list(map(Period, strs)) - p = PeriodIndex(lops) - tm.assert_index_equal(p, idx) - - def test_constructor_freq_mult(self): - # GH #7811 - for func in [PeriodIndex, period_range]: - # must be the same, but for sure... - pidx = func(start='2014-01', freq='2M', periods=4) - expected = PeriodIndex(['2014-01', '2014-03', - '2014-05', '2014-07'], freq='2M') - tm.assert_index_equal(pidx, expected) - - pidx = func(start='2014-01-02', end='2014-01-15', freq='3D') - expected = PeriodIndex(['2014-01-02', '2014-01-05', - '2014-01-08', '2014-01-11', - '2014-01-14'], freq='3D') - tm.assert_index_equal(pidx, expected) - - pidx = func(end='2014-01-01 17:00', freq='4H', periods=3) - expected = PeriodIndex(['2014-01-01 09:00', '2014-01-01 13:00', - '2014-01-01 17:00'], freq='4H') - tm.assert_index_equal(pidx, expected) - - msg = ('Frequency must be positive, because it' - ' represents span: -1M') - with tm.assertRaisesRegexp(ValueError, msg): - PeriodIndex(['2011-01'], freq='-1M') - - msg = ('Frequency must be positive, because it' ' represents span: 0M') - with tm.assertRaisesRegexp(ValueError, msg): - PeriodIndex(['2011-01'], freq='0M') - - msg = ('Frequency must be positive, because it' ' represents span: 0M') - with tm.assertRaisesRegexp(ValueError, msg): - period_range('2011-01', periods=3, freq='0M') - - def test_constructor_freq_mult_dti_compat(self): - import itertools - mults = [1, 2, 3, 4, 5] - freqs = ['A', 'M', 'D', 'T', 'S'] - for mult, freq in itertools.product(mults, freqs): - freqstr = str(mult) + freq - pidx = PeriodIndex(start='2014-04-01', freq=freqstr, periods=10) - expected = date_range(start='2014-04-01', freq=freqstr, - periods=10).to_period(freqstr) - tm.assert_index_equal(pidx, expected) - - def test_constructor_freq_combined(self): - for freq in ['1D1H', '1H1D']: - pidx = PeriodIndex(['2016-01-01', '2016-01-02'], freq=freq) - expected = PeriodIndex(['2016-01-01 00:00', '2016-01-02 00:00'], - freq='25H') - for freq, func in zip(['1D1H', '1H1D'], [PeriodIndex, period_range]): - pidx = func(start='2016-01-01', periods=2, freq=freq) - expected = PeriodIndex(['2016-01-01 00:00', '2016-01-02 01:00'], - freq='25H') - tm.assert_index_equal(pidx, expected) - - def test_dtype_str(self): - pi = pd.PeriodIndex([], freq='M') - self.assertEqual(pi.dtype_str, 'period[M]') - self.assertEqual(pi.dtype_str, str(pi.dtype)) - - pi = pd.PeriodIndex([], freq='3M') - self.assertEqual(pi.dtype_str, 'period[3M]') - self.assertEqual(pi.dtype_str, str(pi.dtype)) - - def test_view_asi8(self): - idx = pd.PeriodIndex([], freq='M') - - exp = np.array([], dtype=np.int64) - tm.assert_numpy_array_equal(idx.view('i8'), exp) - tm.assert_numpy_array_equal(idx.asi8, exp) - - idx = pd.PeriodIndex(['2011-01', pd.NaT], freq='M') - - exp = np.array([492, -9223372036854775808], dtype=np.int64) - tm.assert_numpy_array_equal(idx.view('i8'), exp) - tm.assert_numpy_array_equal(idx.asi8, exp) - - exp = np.array([14975, -9223372036854775808], dtype=np.int64) - idx = pd.PeriodIndex(['2011-01-01', pd.NaT], freq='D') - tm.assert_numpy_array_equal(idx.view('i8'), exp) - tm.assert_numpy_array_equal(idx.asi8, exp) - - def test_values(self): - idx = pd.PeriodIndex([], freq='M') - - exp = np.array([], dtype=np.object) - tm.assert_numpy_array_equal(idx.values, exp) - tm.assert_numpy_array_equal(idx.get_values(), exp) - exp = np.array([], dtype=np.int64) - tm.assert_numpy_array_equal(idx._values, exp) - - idx = pd.PeriodIndex(['2011-01', pd.NaT], freq='M') - - exp = np.array([pd.Period('2011-01', freq='M'), pd.NaT], dtype=object) - tm.assert_numpy_array_equal(idx.values, exp) - tm.assert_numpy_array_equal(idx.get_values(), exp) - exp = np.array([492, -9223372036854775808], dtype=np.int64) - tm.assert_numpy_array_equal(idx._values, exp) - - idx = pd.PeriodIndex(['2011-01-01', pd.NaT], freq='D') - - exp = np.array([pd.Period('2011-01-01', freq='D'), pd.NaT], - dtype=object) - tm.assert_numpy_array_equal(idx.values, exp) - tm.assert_numpy_array_equal(idx.get_values(), exp) - exp = np.array([14975, -9223372036854775808], dtype=np.int64) - tm.assert_numpy_array_equal(idx._values, exp) - - def test_asobject_like(self): - idx = pd.PeriodIndex([], freq='M') - - exp = np.array([], dtype=object) - tm.assert_numpy_array_equal(idx.asobject.values, exp) - tm.assert_numpy_array_equal(idx._mpl_repr(), exp) - - idx = pd.PeriodIndex(['2011-01', pd.NaT], freq='M') - - exp = np.array([pd.Period('2011-01', freq='M'), pd.NaT], dtype=object) - tm.assert_numpy_array_equal(idx.asobject.values, exp) - tm.assert_numpy_array_equal(idx._mpl_repr(), exp) - - exp = np.array([pd.Period('2011-01-01', freq='D'), pd.NaT], - dtype=object) - idx = pd.PeriodIndex(['2011-01-01', pd.NaT], freq='D') - tm.assert_numpy_array_equal(idx.asobject.values, exp) - tm.assert_numpy_array_equal(idx._mpl_repr(), exp) - - def test_is_(self): - create_index = lambda: PeriodIndex(freq='A', start='1/1/2001', - end='12/1/2009') - index = create_index() - self.assertEqual(index.is_(index), True) - self.assertEqual(index.is_(create_index()), False) - self.assertEqual(index.is_(index.view()), True) - self.assertEqual( - index.is_(index.view().view().view().view().view()), True) - self.assertEqual(index.view().is_(index), True) - ind2 = index.view() - index.name = "Apple" - self.assertEqual(ind2.is_(index), True) - self.assertEqual(index.is_(index[:]), False) - self.assertEqual(index.is_(index.asfreq('M')), False) - self.assertEqual(index.is_(index.asfreq('A')), False) - self.assertEqual(index.is_(index - 2), False) - self.assertEqual(index.is_(index - 0), False) - - def test_comp_period(self): - idx = period_range('2007-01', periods=20, freq='M') - - result = idx < idx[10] - exp = idx.values < idx.values[10] - self.assert_numpy_array_equal(result, exp) - - def test_getitem_index(self): - idx = period_range('2007-01', periods=10, freq='M', name='x') - - result = idx[[1, 3, 5]] - exp = pd.PeriodIndex(['2007-02', '2007-04', '2007-06'], - freq='M', name='x') - tm.assert_index_equal(result, exp) - - result = idx[[True, True, False, False, False, - True, True, False, False, False]] - exp = pd.PeriodIndex(['2007-01', '2007-02', '2007-06', '2007-07'], - freq='M', name='x') - tm.assert_index_equal(result, exp) - - def test_getitem_partial(self): - rng = period_range('2007-01', periods=50, freq='M') - ts = Series(np.random.randn(len(rng)), rng) - - self.assertRaises(KeyError, ts.__getitem__, '2006') - - result = ts['2008'] - self.assertTrue((result.index.year == 2008).all()) - - result = ts['2008':'2009'] - self.assertEqual(len(result), 24) - - result = ts['2008-1':'2009-12'] - self.assertEqual(len(result), 24) - - result = ts['2008Q1':'2009Q4'] - self.assertEqual(len(result), 24) - - result = ts[:'2009'] - self.assertEqual(len(result), 36) - - result = ts['2009':] - self.assertEqual(len(result), 50 - 24) - - exp = result - result = ts[24:] - tm.assert_series_equal(exp, result) - - ts = ts[10:].append(ts[10:]) - self.assertRaisesRegexp(KeyError, - "left slice bound for non-unique " - "label: '2008'", - ts.__getitem__, slice('2008', '2009')) - - def test_getitem_datetime(self): - rng = period_range(start='2012-01-01', periods=10, freq='W-MON') - ts = Series(lrange(len(rng)), index=rng) - - dt1 = datetime(2011, 10, 2) - dt4 = datetime(2012, 4, 20) - - rs = ts[dt1:dt4] - tm.assert_series_equal(rs, ts) - - def test_getitem_nat(self): - idx = pd.PeriodIndex(['2011-01', 'NaT', '2011-02'], freq='M') - self.assertEqual(idx[0], pd.Period('2011-01', freq='M')) - self.assertIs(idx[1], tslib.NaT) - - s = pd.Series([0, 1, 2], index=idx) - self.assertEqual(s[pd.NaT], 1) - - s = pd.Series(idx, index=idx) - self.assertEqual(s[pd.Period('2011-01', freq='M')], - pd.Period('2011-01', freq='M')) - self.assertIs(s[pd.NaT], tslib.NaT) - - def test_getitem_list_periods(self): - # GH 7710 - rng = period_range(start='2012-01-01', periods=10, freq='D') - ts = Series(lrange(len(rng)), index=rng) - exp = ts.iloc[[1]] - tm.assert_series_equal(ts[[Period('2012-01-02', freq='D')]], exp) - - def test_slice_with_negative_step(self): - ts = Series(np.arange(20), - period_range('2014-01', periods=20, freq='M')) - SLC = pd.IndexSlice - - def assert_slices_equivalent(l_slc, i_slc): - tm.assert_series_equal(ts[l_slc], ts.iloc[i_slc]) - tm.assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) - tm.assert_series_equal(ts.ix[l_slc], ts.iloc[i_slc]) - - assert_slices_equivalent(SLC[Period('2014-10')::-1], SLC[9::-1]) - assert_slices_equivalent(SLC['2014-10'::-1], SLC[9::-1]) - - assert_slices_equivalent(SLC[:Period('2014-10'):-1], SLC[:8:-1]) - assert_slices_equivalent(SLC[:'2014-10':-1], SLC[:8:-1]) - - assert_slices_equivalent(SLC['2015-02':'2014-10':-1], SLC[13:8:-1]) - assert_slices_equivalent(SLC[Period('2015-02'):Period('2014-10'):-1], - SLC[13:8:-1]) - assert_slices_equivalent(SLC['2015-02':Period('2014-10'):-1], - SLC[13:8:-1]) - assert_slices_equivalent(SLC[Period('2015-02'):'2014-10':-1], - SLC[13:8:-1]) - - assert_slices_equivalent(SLC['2014-10':'2015-02':-1], SLC[:0]) - - def test_slice_with_zero_step_raises(self): - ts = Series(np.arange(20), - period_range('2014-01', periods=20, freq='M')) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts.loc[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts.ix[::0]) - - def test_contains(self): - rng = period_range('2007-01', freq='M', periods=10) - - self.assertTrue(Period('2007-01', freq='M') in rng) - self.assertFalse(Period('2007-01', freq='D') in rng) - self.assertFalse(Period('2007-01', freq='2M') in rng) - - def test_contains_nat(self): - # GH13582 - idx = period_range('2007-01', freq='M', periods=10) - self.assertFalse(pd.NaT in idx) - self.assertFalse(None in idx) - self.assertFalse(float('nan') in idx) - self.assertFalse(np.nan in idx) - - idx = pd.PeriodIndex(['2011-01', 'NaT', '2011-02'], freq='M') - self.assertTrue(pd.NaT in idx) - self.assertTrue(None in idx) - self.assertTrue(float('nan') in idx) - self.assertTrue(np.nan in idx) - - def test_sub(self): - rng = period_range('2007-01', periods=50) - - result = rng - 5 - exp = rng + (-5) - tm.assert_index_equal(result, exp) - - def test_periods_number_check(self): - with tm.assertRaises(ValueError): - period_range('2011-1-1', '2012-1-1', 'B') - - def test_tolist(self): - index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - rs = index.tolist() - [tm.assertIsInstance(x, Period) for x in rs] - - recon = PeriodIndex(rs) - tm.assert_index_equal(index, recon) - - def test_to_timestamp(self): - index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - series = Series(1, index=index, name='foo') - - exp_index = date_range('1/1/2001', end='12/31/2009', freq='A-DEC') - result = series.to_timestamp(how='end') - tm.assert_index_equal(result.index, exp_index) - self.assertEqual(result.name, 'foo') - - exp_index = date_range('1/1/2001', end='1/1/2009', freq='AS-JAN') - result = series.to_timestamp(how='start') - tm.assert_index_equal(result.index, exp_index) - - def _get_with_delta(delta, freq='A-DEC'): - return date_range(to_datetime('1/1/2001') + delta, - to_datetime('12/31/2009') + delta, freq=freq) - - delta = timedelta(hours=23) - result = series.to_timestamp('H', 'end') - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.index, exp_index) - - delta = timedelta(hours=23, minutes=59) - result = series.to_timestamp('T', 'end') - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.index, exp_index) - - result = series.to_timestamp('S', 'end') - delta = timedelta(hours=23, minutes=59, seconds=59) - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.index, exp_index) - - index = PeriodIndex(freq='H', start='1/1/2001', end='1/2/2001') - series = Series(1, index=index, name='foo') - - exp_index = date_range('1/1/2001 00:59:59', end='1/2/2001 00:59:59', - freq='H') - result = series.to_timestamp(how='end') - tm.assert_index_equal(result.index, exp_index) - self.assertEqual(result.name, 'foo') - - def test_to_timestamp_quarterly_bug(self): - years = np.arange(1960, 2000).repeat(4) - quarters = np.tile(lrange(1, 5), 40) - - pindex = PeriodIndex(year=years, quarter=quarters) - - stamps = pindex.to_timestamp('D', 'end') - expected = DatetimeIndex([x.to_timestamp('D', 'end') for x in pindex]) - tm.assert_index_equal(stamps, expected) - - def test_to_timestamp_preserve_name(self): - index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009', - name='foo') - self.assertEqual(index.name, 'foo') - - conv = index.to_timestamp('D') - self.assertEqual(conv.name, 'foo') - - def test_to_timestamp_repr_is_code(self): - zs = [Timestamp('99-04-17 00:00:00', tz='UTC'), - Timestamp('2001-04-17 00:00:00', tz='UTC'), - Timestamp('2001-04-17 00:00:00', tz='America/Los_Angeles'), - Timestamp('2001-04-17 00:00:00', tz=None)] - for z in zs: - self.assertEqual(eval(repr(z)), z) - - def test_to_timestamp_pi_nat(self): - # GH 7228 - index = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='M', - name='idx') - - result = index.to_timestamp('D') - expected = DatetimeIndex([pd.NaT, datetime(2011, 1, 1), - datetime(2011, 2, 1)], name='idx') - tm.assert_index_equal(result, expected) - self.assertEqual(result.name, 'idx') - - result2 = result.to_period(freq='M') - tm.assert_index_equal(result2, index) - self.assertEqual(result2.name, 'idx') - - result3 = result.to_period(freq='3M') - exp = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='3M', name='idx') - self.assert_index_equal(result3, exp) - self.assertEqual(result3.freqstr, '3M') - - msg = ('Frequency must be positive, because it' - ' represents span: -2A') - with tm.assertRaisesRegexp(ValueError, msg): - result.to_period(freq='-2A') - - def test_to_timestamp_pi_mult(self): - idx = PeriodIndex(['2011-01', 'NaT', '2011-02'], freq='2M', name='idx') - result = idx.to_timestamp() - expected = DatetimeIndex( - ['2011-01-01', 'NaT', '2011-02-01'], name='idx') - self.assert_index_equal(result, expected) - result = idx.to_timestamp(how='E') - expected = DatetimeIndex( - ['2011-02-28', 'NaT', '2011-03-31'], name='idx') - self.assert_index_equal(result, expected) - - def test_to_timestamp_pi_combined(self): - idx = PeriodIndex(start='2011', periods=2, freq='1D1H', name='idx') - result = idx.to_timestamp() - expected = DatetimeIndex( - ['2011-01-01 00:00', '2011-01-02 01:00'], name='idx') - self.assert_index_equal(result, expected) - result = idx.to_timestamp(how='E') - expected = DatetimeIndex( - ['2011-01-02 00:59:59', '2011-01-03 01:59:59'], name='idx') - self.assert_index_equal(result, expected) - result = idx.to_timestamp(how='E', freq='H') - expected = DatetimeIndex( - ['2011-01-02 00:00', '2011-01-03 01:00'], name='idx') - self.assert_index_equal(result, expected) - - def test_to_timestamp_to_period_astype(self): - idx = DatetimeIndex([pd.NaT, '2011-01-01', '2011-02-01'], name='idx') - - res = idx.astype('period[M]') - exp = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='M', name='idx') - tm.assert_index_equal(res, exp) - - res = idx.astype('period[3M]') - exp = PeriodIndex(['NaT', '2011-01', '2011-02'], freq='3M', name='idx') - self.assert_index_equal(res, exp) - - def test_start_time(self): - index = PeriodIndex(freq='M', start='2016-01-01', end='2016-05-31') - expected_index = date_range('2016-01-01', end='2016-05-31', freq='MS') - tm.assert_index_equal(index.start_time, expected_index) - - def test_end_time(self): - index = PeriodIndex(freq='M', start='2016-01-01', end='2016-05-31') - expected_index = date_range('2016-01-01', end='2016-05-31', freq='M') - tm.assert_index_equal(index.end_time, expected_index) - - def test_as_frame_columns(self): - rng = period_range('1/1/2000', periods=5) - df = DataFrame(randn(10, 5), columns=rng) - - ts = df[rng[0]] - tm.assert_series_equal(ts, df.ix[:, 0]) - - # GH # 1211 - repr(df) - - ts = df['1/1/2000'] - tm.assert_series_equal(ts, df.ix[:, 0]) - - def test_indexing(self): - - # GH 4390, iat incorrectly indexing - index = period_range('1/1/2001', periods=10) - s = Series(randn(10), index=index) - expected = s[index[0]] - result = s.iat[0] - self.assertEqual(expected, result) - - def test_frame_setitem(self): - rng = period_range('1/1/2000', periods=5, name='index') - df = DataFrame(randn(5, 3), index=rng) - - df['Index'] = rng - rs = Index(df['Index']) - tm.assert_index_equal(rs, rng, check_names=False) - self.assertEqual(rs.name, 'Index') - self.assertEqual(rng.name, 'index') - - rs = df.reset_index().set_index('index') - tm.assertIsInstance(rs.index, PeriodIndex) - tm.assert_index_equal(rs.index, rng) - - def test_period_set_index_reindex(self): - # GH 6631 - df = DataFrame(np.random.random(6)) - idx1 = period_range('2011/01/01', periods=6, freq='M') - idx2 = period_range('2013', periods=6, freq='A') - - df = df.set_index(idx1) - tm.assert_index_equal(df.index, idx1) - df = df.set_index(idx2) - tm.assert_index_equal(df.index, idx2) - - def test_frame_to_time_stamp(self): - K = 5 - index = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - df = DataFrame(randn(len(index), K), index=index) - df['mix'] = 'a' - - exp_index = date_range('1/1/2001', end='12/31/2009', freq='A-DEC') - result = df.to_timestamp('D', 'end') - tm.assert_index_equal(result.index, exp_index) - tm.assert_numpy_array_equal(result.values, df.values) - - exp_index = date_range('1/1/2001', end='1/1/2009', freq='AS-JAN') - result = df.to_timestamp('D', 'start') - tm.assert_index_equal(result.index, exp_index) - - def _get_with_delta(delta, freq='A-DEC'): - return date_range(to_datetime('1/1/2001') + delta, - to_datetime('12/31/2009') + delta, freq=freq) - - delta = timedelta(hours=23) - result = df.to_timestamp('H', 'end') - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.index, exp_index) - - delta = timedelta(hours=23, minutes=59) - result = df.to_timestamp('T', 'end') - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.index, exp_index) - - result = df.to_timestamp('S', 'end') - delta = timedelta(hours=23, minutes=59, seconds=59) - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.index, exp_index) - - # columns - df = df.T - - exp_index = date_range('1/1/2001', end='12/31/2009', freq='A-DEC') - result = df.to_timestamp('D', 'end', axis=1) - tm.assert_index_equal(result.columns, exp_index) - tm.assert_numpy_array_equal(result.values, df.values) - - exp_index = date_range('1/1/2001', end='1/1/2009', freq='AS-JAN') - result = df.to_timestamp('D', 'start', axis=1) - tm.assert_index_equal(result.columns, exp_index) - - delta = timedelta(hours=23) - result = df.to_timestamp('H', 'end', axis=1) - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.columns, exp_index) - - delta = timedelta(hours=23, minutes=59) - result = df.to_timestamp('T', 'end', axis=1) - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.columns, exp_index) - - result = df.to_timestamp('S', 'end', axis=1) - delta = timedelta(hours=23, minutes=59, seconds=59) - exp_index = _get_with_delta(delta) - tm.assert_index_equal(result.columns, exp_index) - - # invalid axis - tm.assertRaisesRegexp(ValueError, 'axis', df.to_timestamp, axis=2) - - result1 = df.to_timestamp('5t', axis=1) - result2 = df.to_timestamp('t', axis=1) - expected = pd.date_range('2001-01-01', '2009-01-01', freq='AS') - self.assertTrue(isinstance(result1.columns, DatetimeIndex)) - self.assertTrue(isinstance(result2.columns, DatetimeIndex)) - self.assert_numpy_array_equal(result1.columns.asi8, expected.asi8) - self.assert_numpy_array_equal(result2.columns.asi8, expected.asi8) - # PeriodIndex.to_timestamp always use 'infer' - self.assertEqual(result1.columns.freqstr, 'AS-JAN') - self.assertEqual(result2.columns.freqstr, 'AS-JAN') - - def test_index_duplicate_periods(self): - # monotonic - idx = PeriodIndex([2000, 2007, 2007, 2009, 2009], freq='A-JUN') - ts = Series(np.random.randn(len(idx)), index=idx) - - result = ts[2007] - expected = ts[1:3] - tm.assert_series_equal(result, expected) - result[:] = 1 - self.assertTrue((ts[1:3] == 1).all()) - - # not monotonic - idx = PeriodIndex([2000, 2007, 2007, 2009, 2007], freq='A-JUN') - ts = Series(np.random.randn(len(idx)), index=idx) - - result = ts[2007] - expected = ts[idx == 2007] - tm.assert_series_equal(result, expected) - - def test_index_unique(self): - idx = PeriodIndex([2000, 2007, 2007, 2009, 2009], freq='A-JUN') - expected = PeriodIndex([2000, 2007, 2009], freq='A-JUN') - self.assert_index_equal(idx.unique(), expected) - self.assertEqual(idx.nunique(), 3) - - idx = PeriodIndex([2000, 2007, 2007, 2009, 2007], freq='A-JUN', - tz='US/Eastern') - expected = PeriodIndex([2000, 2007, 2009], freq='A-JUN', - tz='US/Eastern') - self.assert_index_equal(idx.unique(), expected) - self.assertEqual(idx.nunique(), 3) - - def test_constructor(self): - pi = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - self.assertEqual(len(pi), 9) - - pi = PeriodIndex(freq='Q', start='1/1/2001', end='12/1/2009') - self.assertEqual(len(pi), 4 * 9) - - pi = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') - self.assertEqual(len(pi), 12 * 9) - - pi = PeriodIndex(freq='D', start='1/1/2001', end='12/31/2009') - self.assertEqual(len(pi), 365 * 9 + 2) - - pi = PeriodIndex(freq='B', start='1/1/2001', end='12/31/2009') - self.assertEqual(len(pi), 261 * 9) - - pi = PeriodIndex(freq='H', start='1/1/2001', end='12/31/2001 23:00') - self.assertEqual(len(pi), 365 * 24) - - pi = PeriodIndex(freq='Min', start='1/1/2001', end='1/1/2001 23:59') - self.assertEqual(len(pi), 24 * 60) - - pi = PeriodIndex(freq='S', start='1/1/2001', end='1/1/2001 23:59:59') - self.assertEqual(len(pi), 24 * 60 * 60) - - start = Period('02-Apr-2005', 'B') - i1 = PeriodIndex(start=start, periods=20) - self.assertEqual(len(i1), 20) - self.assertEqual(i1.freq, start.freq) - self.assertEqual(i1[0], start) - - end_intv = Period('2006-12-31', 'W') - i1 = PeriodIndex(end=end_intv, periods=10) - self.assertEqual(len(i1), 10) - self.assertEqual(i1.freq, end_intv.freq) - self.assertEqual(i1[-1], end_intv) - - end_intv = Period('2006-12-31', '1w') - i2 = PeriodIndex(end=end_intv, periods=10) - self.assertEqual(len(i1), len(i2)) - self.assertTrue((i1 == i2).all()) - self.assertEqual(i1.freq, i2.freq) - - end_intv = Period('2006-12-31', ('w', 1)) - i2 = PeriodIndex(end=end_intv, periods=10) - self.assertEqual(len(i1), len(i2)) - self.assertTrue((i1 == i2).all()) - self.assertEqual(i1.freq, i2.freq) - - try: - PeriodIndex(start=start, end=end_intv) - raise AssertionError('Cannot allow mixed freq for start and end') - except ValueError: - pass - - end_intv = Period('2005-05-01', 'B') - i1 = PeriodIndex(start=start, end=end_intv) - - try: - PeriodIndex(start=start) - raise AssertionError( - 'Must specify periods if missing start or end') - except ValueError: - pass - - # infer freq from first element - i2 = PeriodIndex([end_intv, Period('2005-05-05', 'B')]) - self.assertEqual(len(i2), 2) - self.assertEqual(i2[0], end_intv) - - i2 = PeriodIndex(np.array([end_intv, Period('2005-05-05', 'B')])) - self.assertEqual(len(i2), 2) - self.assertEqual(i2[0], end_intv) - - # Mixed freq should fail - vals = [end_intv, Period('2006-12-31', 'w')] - self.assertRaises(ValueError, PeriodIndex, vals) - vals = np.array(vals) - self.assertRaises(ValueError, PeriodIndex, vals) - - def test_numpy_repeat(self): - index = period_range('20010101', periods=2) - expected = PeriodIndex([Period('2001-01-01'), Period('2001-01-01'), - Period('2001-01-02'), Period('2001-01-02')]) - - tm.assert_index_equal(np.repeat(index, 2), expected) - - msg = "the 'axis' parameter is not supported" - tm.assertRaisesRegexp(ValueError, msg, np.repeat, index, 2, axis=1) - - def test_shift(self): - pi1 = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - pi2 = PeriodIndex(freq='A', start='1/1/2002', end='12/1/2010') - - tm.assert_index_equal(pi1.shift(0), pi1) - - self.assertEqual(len(pi1), len(pi2)) - self.assert_index_equal(pi1.shift(1), pi2) - - pi1 = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - pi2 = PeriodIndex(freq='A', start='1/1/2000', end='12/1/2008') - self.assertEqual(len(pi1), len(pi2)) - self.assert_index_equal(pi1.shift(-1), pi2) - - pi1 = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') - pi2 = PeriodIndex(freq='M', start='2/1/2001', end='1/1/2010') - self.assertEqual(len(pi1), len(pi2)) - self.assert_index_equal(pi1.shift(1), pi2) - - pi1 = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') - pi2 = PeriodIndex(freq='M', start='12/1/2000', end='11/1/2009') - self.assertEqual(len(pi1), len(pi2)) - self.assert_index_equal(pi1.shift(-1), pi2) - - pi1 = PeriodIndex(freq='D', start='1/1/2001', end='12/1/2009') - pi2 = PeriodIndex(freq='D', start='1/2/2001', end='12/2/2009') - self.assertEqual(len(pi1), len(pi2)) - self.assert_index_equal(pi1.shift(1), pi2) - - pi1 = PeriodIndex(freq='D', start='1/1/2001', end='12/1/2009') - pi2 = PeriodIndex(freq='D', start='12/31/2000', end='11/30/2009') - self.assertEqual(len(pi1), len(pi2)) - self.assert_index_equal(pi1.shift(-1), pi2) - - def test_shift_nat(self): - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2011-04'], freq='M', name='idx') - result = idx.shift(1) - expected = PeriodIndex(['2011-02', '2011-03', 'NaT', - '2011-05'], freq='M', name='idx') - tm.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - - def test_shift_ndarray(self): - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2011-04'], freq='M', name='idx') - result = idx.shift(np.array([1, 2, 3, 4])) - expected = PeriodIndex(['2011-02', '2011-04', 'NaT', - '2011-08'], freq='M', name='idx') - tm.assert_index_equal(result, expected) - - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2011-04'], freq='M', name='idx') - result = idx.shift(np.array([1, -2, 3, -4])) - expected = PeriodIndex(['2011-02', '2010-12', 'NaT', - '2010-12'], freq='M', name='idx') - tm.assert_index_equal(result, expected) - - def test_asfreq(self): - pi1 = PeriodIndex(freq='A', start='1/1/2001', end='1/1/2001') - pi2 = PeriodIndex(freq='Q', start='1/1/2001', end='1/1/2001') - pi3 = PeriodIndex(freq='M', start='1/1/2001', end='1/1/2001') - pi4 = PeriodIndex(freq='D', start='1/1/2001', end='1/1/2001') - pi5 = PeriodIndex(freq='H', start='1/1/2001', end='1/1/2001 00:00') - pi6 = PeriodIndex(freq='Min', start='1/1/2001', end='1/1/2001 00:00') - pi7 = PeriodIndex(freq='S', start='1/1/2001', end='1/1/2001 00:00:00') - - self.assertEqual(pi1.asfreq('Q', 'S'), pi2) - self.assertEqual(pi1.asfreq('Q', 's'), pi2) - self.assertEqual(pi1.asfreq('M', 'start'), pi3) - self.assertEqual(pi1.asfreq('D', 'StarT'), pi4) - self.assertEqual(pi1.asfreq('H', 'beGIN'), pi5) - self.assertEqual(pi1.asfreq('Min', 'S'), pi6) - self.assertEqual(pi1.asfreq('S', 'S'), pi7) - - self.assertEqual(pi2.asfreq('A', 'S'), pi1) - self.assertEqual(pi2.asfreq('M', 'S'), pi3) - self.assertEqual(pi2.asfreq('D', 'S'), pi4) - self.assertEqual(pi2.asfreq('H', 'S'), pi5) - self.assertEqual(pi2.asfreq('Min', 'S'), pi6) - self.assertEqual(pi2.asfreq('S', 'S'), pi7) - - self.assertEqual(pi3.asfreq('A', 'S'), pi1) - self.assertEqual(pi3.asfreq('Q', 'S'), pi2) - self.assertEqual(pi3.asfreq('D', 'S'), pi4) - self.assertEqual(pi3.asfreq('H', 'S'), pi5) - self.assertEqual(pi3.asfreq('Min', 'S'), pi6) - self.assertEqual(pi3.asfreq('S', 'S'), pi7) - - self.assertEqual(pi4.asfreq('A', 'S'), pi1) - self.assertEqual(pi4.asfreq('Q', 'S'), pi2) - self.assertEqual(pi4.asfreq('M', 'S'), pi3) - self.assertEqual(pi4.asfreq('H', 'S'), pi5) - self.assertEqual(pi4.asfreq('Min', 'S'), pi6) - self.assertEqual(pi4.asfreq('S', 'S'), pi7) - - self.assertEqual(pi5.asfreq('A', 'S'), pi1) - self.assertEqual(pi5.asfreq('Q', 'S'), pi2) - self.assertEqual(pi5.asfreq('M', 'S'), pi3) - self.assertEqual(pi5.asfreq('D', 'S'), pi4) - self.assertEqual(pi5.asfreq('Min', 'S'), pi6) - self.assertEqual(pi5.asfreq('S', 'S'), pi7) - - self.assertEqual(pi6.asfreq('A', 'S'), pi1) - self.assertEqual(pi6.asfreq('Q', 'S'), pi2) - self.assertEqual(pi6.asfreq('M', 'S'), pi3) - self.assertEqual(pi6.asfreq('D', 'S'), pi4) - self.assertEqual(pi6.asfreq('H', 'S'), pi5) - self.assertEqual(pi6.asfreq('S', 'S'), pi7) - - self.assertEqual(pi7.asfreq('A', 'S'), pi1) - self.assertEqual(pi7.asfreq('Q', 'S'), pi2) - self.assertEqual(pi7.asfreq('M', 'S'), pi3) - self.assertEqual(pi7.asfreq('D', 'S'), pi4) - self.assertEqual(pi7.asfreq('H', 'S'), pi5) - self.assertEqual(pi7.asfreq('Min', 'S'), pi6) - - self.assertRaises(ValueError, pi7.asfreq, 'T', 'foo') - result1 = pi1.asfreq('3M') - result2 = pi1.asfreq('M') - expected = PeriodIndex(freq='M', start='2001-12', end='2001-12') - self.assert_numpy_array_equal(result1.asi8, expected.asi8) - self.assertEqual(result1.freqstr, '3M') - self.assert_numpy_array_equal(result2.asi8, expected.asi8) - self.assertEqual(result2.freqstr, 'M') - - def test_asfreq_nat(self): - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', '2011-04'], freq='M') - result = idx.asfreq(freq='Q') - expected = PeriodIndex(['2011Q1', '2011Q1', 'NaT', '2011Q2'], freq='Q') - tm.assert_index_equal(result, expected) - - def test_asfreq_mult_pi(self): - pi = PeriodIndex(['2001-01', '2001-02', 'NaT', '2001-03'], freq='2M') - - for freq in ['D', '3D']: - result = pi.asfreq(freq) - exp = PeriodIndex(['2001-02-28', '2001-03-31', 'NaT', - '2001-04-30'], freq=freq) - self.assert_index_equal(result, exp) - self.assertEqual(result.freq, exp.freq) - - result = pi.asfreq(freq, how='S') - exp = PeriodIndex(['2001-01-01', '2001-02-01', 'NaT', - '2001-03-01'], freq=freq) - self.assert_index_equal(result, exp) - self.assertEqual(result.freq, exp.freq) - - def test_asfreq_combined_pi(self): - pi = pd.PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', 'NaT'], - freq='H') - exp = PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', 'NaT'], - freq='25H') - for freq, how in zip(['1D1H', '1H1D'], ['S', 'E']): - result = pi.asfreq(freq, how=how) - self.assert_index_equal(result, exp) - self.assertEqual(result.freq, exp.freq) - - for freq in ['1D1H', '1H1D']: - pi = pd.PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', - 'NaT'], freq=freq) - result = pi.asfreq('H') - exp = PeriodIndex(['2001-01-02 00:00', '2001-01-03 02:00', 'NaT'], - freq='H') - self.assert_index_equal(result, exp) - self.assertEqual(result.freq, exp.freq) - - pi = pd.PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', - 'NaT'], freq=freq) - result = pi.asfreq('H', how='S') - exp = PeriodIndex(['2001-01-01 00:00', '2001-01-02 02:00', 'NaT'], - freq='H') - self.assert_index_equal(result, exp) - self.assertEqual(result.freq, exp.freq) - - def test_period_index_length(self): - pi = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2009') - self.assertEqual(len(pi), 9) - - pi = PeriodIndex(freq='Q', start='1/1/2001', end='12/1/2009') - self.assertEqual(len(pi), 4 * 9) - - pi = PeriodIndex(freq='M', start='1/1/2001', end='12/1/2009') - self.assertEqual(len(pi), 12 * 9) - - start = Period('02-Apr-2005', 'B') - i1 = PeriodIndex(start=start, periods=20) - self.assertEqual(len(i1), 20) - self.assertEqual(i1.freq, start.freq) - self.assertEqual(i1[0], start) - - end_intv = Period('2006-12-31', 'W') - i1 = PeriodIndex(end=end_intv, periods=10) - self.assertEqual(len(i1), 10) - self.assertEqual(i1.freq, end_intv.freq) - self.assertEqual(i1[-1], end_intv) - - end_intv = Period('2006-12-31', '1w') - i2 = PeriodIndex(end=end_intv, periods=10) - self.assertEqual(len(i1), len(i2)) - self.assertTrue((i1 == i2).all()) - self.assertEqual(i1.freq, i2.freq) - - end_intv = Period('2006-12-31', ('w', 1)) - i2 = PeriodIndex(end=end_intv, periods=10) - self.assertEqual(len(i1), len(i2)) - self.assertTrue((i1 == i2).all()) - self.assertEqual(i1.freq, i2.freq) - - try: - PeriodIndex(start=start, end=end_intv) - raise AssertionError('Cannot allow mixed freq for start and end') - except ValueError: - pass - - end_intv = Period('2005-05-01', 'B') - i1 = PeriodIndex(start=start, end=end_intv) - - try: - PeriodIndex(start=start) - raise AssertionError( - 'Must specify periods if missing start or end') - except ValueError: - pass - - # infer freq from first element - i2 = PeriodIndex([end_intv, Period('2005-05-05', 'B')]) - self.assertEqual(len(i2), 2) - self.assertEqual(i2[0], end_intv) - - i2 = PeriodIndex(np.array([end_intv, Period('2005-05-05', 'B')])) - self.assertEqual(len(i2), 2) - self.assertEqual(i2[0], end_intv) - - # Mixed freq should fail - vals = [end_intv, Period('2006-12-31', 'w')] - self.assertRaises(ValueError, PeriodIndex, vals) - vals = np.array(vals) - self.assertRaises(ValueError, PeriodIndex, vals) - - def test_frame_index_to_string(self): - index = PeriodIndex(['2011-1', '2011-2', '2011-3'], freq='M') - frame = DataFrame(np.random.randn(3, 4), index=index) - - # it works! - frame.to_string() - - def test_asfreq_ts(self): - index = PeriodIndex(freq='A', start='1/1/2001', end='12/31/2010') - ts = Series(np.random.randn(len(index)), index=index) - df = DataFrame(np.random.randn(len(index), 3), index=index) - - result = ts.asfreq('D', how='end') - df_result = df.asfreq('D', how='end') - exp_index = index.asfreq('D', how='end') - self.assertEqual(len(result), len(ts)) - tm.assert_index_equal(result.index, exp_index) - tm.assert_index_equal(df_result.index, exp_index) - - result = ts.asfreq('D', how='start') - self.assertEqual(len(result), len(ts)) - tm.assert_index_equal(result.index, index.asfreq('D', how='start')) - - def test_badinput(self): - self.assertRaises(ValueError, Period, '-2000', 'A') - self.assertRaises(tslib.DateParseError, Period, '0', 'A') - self.assertRaises(tslib.DateParseError, Period, '1/1/-2000', 'A') - - def test_negative_ordinals(self): - Period(ordinal=-1000, freq='A') - Period(ordinal=0, freq='A') - - idx1 = PeriodIndex(ordinal=[-1, 0, 1], freq='A') - idx2 = PeriodIndex(ordinal=np.array([-1, 0, 1]), freq='A') - tm.assert_index_equal(idx1, idx2) - - def test_dti_to_period(self): - dti = DatetimeIndex(start='1/1/2005', end='12/1/2005', freq='M') - pi1 = dti.to_period() - pi2 = dti.to_period(freq='D') - pi3 = dti.to_period(freq='3D') - - self.assertEqual(pi1[0], Period('Jan 2005', freq='M')) - self.assertEqual(pi2[0], Period('1/31/2005', freq='D')) - self.assertEqual(pi3[0], Period('1/31/2005', freq='3D')) - - self.assertEqual(pi1[-1], Period('Nov 2005', freq='M')) - self.assertEqual(pi2[-1], Period('11/30/2005', freq='D')) - self.assertEqual(pi3[-1], Period('11/30/2005', freq='3D')) - - tm.assert_index_equal(pi1, period_range('1/1/2005', '11/1/2005', - freq='M')) - tm.assert_index_equal(pi2, period_range('1/1/2005', '11/1/2005', - freq='M').asfreq('D')) - tm.assert_index_equal(pi3, period_range('1/1/2005', '11/1/2005', - freq='M').asfreq('3D')) - - def test_pindex_slice_index(self): - pi = PeriodIndex(start='1/1/10', end='12/31/12', freq='M') - s = Series(np.random.rand(len(pi)), index=pi) - res = s['2010'] - exp = s[0:12] - tm.assert_series_equal(res, exp) - res = s['2011'] - exp = s[12:24] - tm.assert_series_equal(res, exp) - - def test_getitem_day(self): - # GH 6716 - # Confirm DatetimeIndex and PeriodIndex works identically - didx = DatetimeIndex(start='2013/01/01', freq='D', periods=400) - pidx = PeriodIndex(start='2013/01/01', freq='D', periods=400) - - for idx in [didx, pidx]: - # getitem against index should raise ValueError - values = ['2014', '2013/02', '2013/01/02', '2013/02/01 9H', - '2013/02/01 09:00'] - for v in values: - - if _np_version_under1p9: - with tm.assertRaises(ValueError): - idx[v] - else: - # GH7116 - # these show deprecations as we are trying - # to slice with non-integer indexers - # with tm.assertRaises(IndexError): - # idx[v] - continue - - s = Series(np.random.rand(len(idx)), index=idx) - tm.assert_series_equal(s['2013/01'], s[0:31]) - tm.assert_series_equal(s['2013/02'], s[31:59]) - tm.assert_series_equal(s['2014'], s[365:]) - - invalid = ['2013/02/01 9H', '2013/02/01 09:00'] - for v in invalid: - with tm.assertRaises(KeyError): - s[v] - - def test_range_slice_day(self): - # GH 6716 - didx = DatetimeIndex(start='2013/01/01', freq='D', periods=400) - pidx = PeriodIndex(start='2013/01/01', freq='D', periods=400) - - # changed to TypeError in 1.12 - # https://github.com/numpy/numpy/pull/6271 - exc = IndexError if _np_version_under1p12 else TypeError - - for idx in [didx, pidx]: - # slices against index should raise IndexError - values = ['2014', '2013/02', '2013/01/02', '2013/02/01 9H', - '2013/02/01 09:00'] - for v in values: - with tm.assertRaises(exc): - idx[v:] - - s = Series(np.random.rand(len(idx)), index=idx) - - tm.assert_series_equal(s['2013/01/02':], s[1:]) - tm.assert_series_equal(s['2013/01/02':'2013/01/05'], s[1:5]) - tm.assert_series_equal(s['2013/02':], s[31:]) - tm.assert_series_equal(s['2014':], s[365:]) - - invalid = ['2013/02/01 9H', '2013/02/01 09:00'] - for v in invalid: - with tm.assertRaises(exc): - idx[v:] - - def test_getitem_seconds(self): - # GH 6716 - didx = DatetimeIndex(start='2013/01/01 09:00:00', freq='S', - periods=4000) - pidx = PeriodIndex(start='2013/01/01 09:00:00', freq='S', periods=4000) - - for idx in [didx, pidx]: - # getitem against index should raise ValueError - values = ['2014', '2013/02', '2013/01/02', '2013/02/01 9H', - '2013/02/01 09:00'] - for v in values: - if _np_version_under1p9: - with tm.assertRaises(ValueError): - idx[v] - else: - # GH7116 - # these show deprecations as we are trying - # to slice with non-integer indexers - # with tm.assertRaises(IndexError): - # idx[v] - continue - - s = Series(np.random.rand(len(idx)), index=idx) - tm.assert_series_equal(s['2013/01/01 10:00'], s[3600:3660]) - tm.assert_series_equal(s['2013/01/01 9H'], s[:3600]) - for d in ['2013/01/01', '2013/01', '2013']: - tm.assert_series_equal(s[d], s) - - def test_range_slice_seconds(self): - # GH 6716 - didx = DatetimeIndex(start='2013/01/01 09:00:00', freq='S', - periods=4000) - pidx = PeriodIndex(start='2013/01/01 09:00:00', freq='S', periods=4000) - - # changed to TypeError in 1.12 - # https://github.com/numpy/numpy/pull/6271 - exc = IndexError if _np_version_under1p12 else TypeError - - for idx in [didx, pidx]: - # slices against index should raise IndexError - values = ['2014', '2013/02', '2013/01/02', '2013/02/01 9H', - '2013/02/01 09:00'] - for v in values: - with tm.assertRaises(exc): - idx[v:] - - s = Series(np.random.rand(len(idx)), index=idx) - - tm.assert_series_equal(s['2013/01/01 09:05':'2013/01/01 09:10'], - s[300:660]) - tm.assert_series_equal(s['2013/01/01 10:00':'2013/01/01 10:05'], - s[3600:3960]) - tm.assert_series_equal(s['2013/01/01 10H':], s[3600:]) - tm.assert_series_equal(s[:'2013/01/01 09:30'], s[:1860]) - for d in ['2013/01/01', '2013/01', '2013']: - tm.assert_series_equal(s[d:], s) - - def test_range_slice_outofbounds(self): - # GH 5407 - didx = DatetimeIndex(start='2013/10/01', freq='D', periods=10) - pidx = PeriodIndex(start='2013/10/01', freq='D', periods=10) - - for idx in [didx, pidx]: - df = DataFrame(dict(units=[100 + i for i in range(10)]), index=idx) - empty = DataFrame(index=idx.__class__([], freq='D'), - columns=['units']) - empty['units'] = empty['units'].astype('int64') - - tm.assert_frame_equal(df['2013/09/01':'2013/09/30'], empty) - tm.assert_frame_equal(df['2013/09/30':'2013/10/02'], df.iloc[:2]) - tm.assert_frame_equal(df['2013/10/01':'2013/10/02'], df.iloc[:2]) - tm.assert_frame_equal(df['2013/10/02':'2013/09/30'], empty) - tm.assert_frame_equal(df['2013/10/15':'2013/10/17'], empty) - tm.assert_frame_equal(df['2013-06':'2013-09'], empty) - tm.assert_frame_equal(df['2013-11':'2013-12'], empty) - - def test_astype_asfreq(self): - pi1 = PeriodIndex(['2011-01-01', '2011-02-01', '2011-03-01'], freq='D') - exp = PeriodIndex(['2011-01', '2011-02', '2011-03'], freq='M') - tm.assert_index_equal(pi1.asfreq('M'), exp) - tm.assert_index_equal(pi1.astype('period[M]'), exp) - - exp = PeriodIndex(['2011-01', '2011-02', '2011-03'], freq='3M') - tm.assert_index_equal(pi1.asfreq('3M'), exp) - tm.assert_index_equal(pi1.astype('period[3M]'), exp) - - def test_pindex_fieldaccessor_nat(self): - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2012-03', '2012-04'], freq='D') - - exp = np.array([2011, 2011, -1, 2012, 2012], dtype=np.int64) - self.assert_numpy_array_equal(idx.year, exp) - exp = np.array([1, 2, -1, 3, 4], dtype=np.int64) - self.assert_numpy_array_equal(idx.month, exp) - - def test_pindex_qaccess(self): - pi = PeriodIndex(['2Q05', '3Q05', '4Q05', '1Q06', '2Q06'], freq='Q') - s = Series(np.random.rand(len(pi)), index=pi).cumsum() - # Todo: fix these accessors! - self.assertEqual(s['05Q4'], s[2]) - - def test_period_dt64_round_trip(self): - dti = date_range('1/1/2000', '1/7/2002', freq='B') - pi = dti.to_period() - tm.assert_index_equal(pi.to_timestamp(), dti) - - dti = date_range('1/1/2000', '1/7/2002', freq='B') - pi = dti.to_period(freq='H') - tm.assert_index_equal(pi.to_timestamp(), dti) - - def test_period_astype_to_timestamp(self): - pi = pd.PeriodIndex(['2011-01', '2011-02', '2011-03'], freq='M') - - exp = pd.DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01']) - tm.assert_index_equal(pi.astype('datetime64[ns]'), exp) - - exp = pd.DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31']) - tm.assert_index_equal(pi.astype('datetime64[ns]', how='end'), exp) - - exp = pd.DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01'], - tz='US/Eastern') - res = pi.astype('datetime64[ns, US/Eastern]') - tm.assert_index_equal(pi.astype('datetime64[ns, US/Eastern]'), exp) - - exp = pd.DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31'], - tz='US/Eastern') - res = pi.astype('datetime64[ns, US/Eastern]', how='end') - tm.assert_index_equal(res, exp) - - def test_to_period_quarterly(self): - # make sure we can make the round trip - for month in MONTHS: - freq = 'Q-%s' % month - rng = period_range('1989Q3', '1991Q3', freq=freq) - stamps = rng.to_timestamp() - result = stamps.to_period(freq) - tm.assert_index_equal(rng, result) - - def test_to_period_quarterlyish(self): - offsets = ['BQ', 'QS', 'BQS'] - for off in offsets: - rng = date_range('01-Jan-2012', periods=8, freq=off) - prng = rng.to_period() - self.assertEqual(prng.freq, 'Q-DEC') - - def test_to_period_annualish(self): - offsets = ['BA', 'AS', 'BAS'] - for off in offsets: - rng = date_range('01-Jan-2012', periods=8, freq=off) - prng = rng.to_period() - self.assertEqual(prng.freq, 'A-DEC') - - def test_to_period_monthish(self): - offsets = ['MS', 'BM'] - for off in offsets: - rng = date_range('01-Jan-2012', periods=8, freq=off) - prng = rng.to_period() - self.assertEqual(prng.freq, 'M') - - rng = date_range('01-Jan-2012', periods=8, freq='M') - prng = rng.to_period() - self.assertEqual(prng.freq, 'M') - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with self.assertRaisesRegexp(ValueError, msg): - date_range('01-Jan-2012', periods=8, freq='EOM') - - def test_multiples(self): - result1 = Period('1989', freq='2A') - result2 = Period('1989', freq='A') - self.assertEqual(result1.ordinal, result2.ordinal) - self.assertEqual(result1.freqstr, '2A-DEC') - self.assertEqual(result2.freqstr, 'A-DEC') - self.assertEqual(result1.freq, offsets.YearEnd(2)) - self.assertEqual(result2.freq, offsets.YearEnd()) - - self.assertEqual((result1 + 1).ordinal, result1.ordinal + 2) - self.assertEqual((1 + result1).ordinal, result1.ordinal + 2) - self.assertEqual((result1 - 1).ordinal, result2.ordinal - 2) - self.assertEqual((-1 + result1).ordinal, result2.ordinal - 2) - - def test_pindex_multiples(self): - pi = PeriodIndex(start='1/1/11', end='12/31/11', freq='2M') - expected = PeriodIndex(['2011-01', '2011-03', '2011-05', '2011-07', - '2011-09', '2011-11'], freq='2M') - tm.assert_index_equal(pi, expected) - self.assertEqual(pi.freq, offsets.MonthEnd(2)) - self.assertEqual(pi.freqstr, '2M') - - pi = period_range(start='1/1/11', end='12/31/11', freq='2M') - tm.assert_index_equal(pi, expected) - self.assertEqual(pi.freq, offsets.MonthEnd(2)) - self.assertEqual(pi.freqstr, '2M') - - pi = period_range(start='1/1/11', periods=6, freq='2M') - tm.assert_index_equal(pi, expected) - self.assertEqual(pi.freq, offsets.MonthEnd(2)) - self.assertEqual(pi.freqstr, '2M') - - def test_iteration(self): - index = PeriodIndex(start='1/1/10', periods=4, freq='B') - - result = list(index) - tm.assertIsInstance(result[0], Period) - self.assertEqual(result[0].freq, index.freq) - - def test_take(self): - index = PeriodIndex(start='1/1/10', end='12/31/12', freq='D', - name='idx') - expected = PeriodIndex([datetime(2010, 1, 6), datetime(2010, 1, 7), - datetime(2010, 1, 9), datetime(2010, 1, 13)], - freq='D', name='idx') - - taken1 = index.take([5, 6, 8, 12]) - taken2 = index[[5, 6, 8, 12]] - - for taken in [taken1, taken2]: - tm.assert_index_equal(taken, expected) - tm.assertIsInstance(taken, PeriodIndex) - self.assertEqual(taken.freq, index.freq) - self.assertEqual(taken.name, expected.name) - - def test_take_fill_value(self): - # GH 12631 - idx = pd.PeriodIndex(['2011-01-01', '2011-02-01', '2011-03-01'], - name='xxx', freq='D') - result = idx.take(np.array([1, 0, -1])) - expected = pd.PeriodIndex(['2011-02-01', '2011-01-01', '2011-03-01'], - name='xxx', freq='D') - tm.assert_index_equal(result, expected) - - # fill_value - result = idx.take(np.array([1, 0, -1]), fill_value=True) - expected = pd.PeriodIndex(['2011-02-01', '2011-01-01', 'NaT'], - name='xxx', freq='D') - tm.assert_index_equal(result, expected) - - # allow_fill=False - result = idx.take(np.array([1, 0, -1]), allow_fill=False, - fill_value=True) - expected = pd.PeriodIndex(['2011-02-01', '2011-01-01', '2011-03-01'], - name='xxx', freq='D') - tm.assert_index_equal(result, expected) - - msg = ('When allow_fill=True and fill_value is not None, ' - 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -5]), fill_value=True) - - with tm.assertRaises(IndexError): - idx.take(np.array([1, -5])) - - def test_joins(self): - index = period_range('1/1/2000', '1/20/2000', freq='D') - - for kind in ['inner', 'outer', 'left', 'right']: - joined = index.join(index[:-5], how=kind) - - tm.assertIsInstance(joined, PeriodIndex) - self.assertEqual(joined.freq, index.freq) - - def test_join_self(self): - index = period_range('1/1/2000', '1/20/2000', freq='D') - - for kind in ['inner', 'outer', 'left', 'right']: - res = index.join(index, how=kind) - self.assertIs(index, res) - - def test_join_does_not_recur(self): - df = tm.makeCustomDataframe( - 3, 2, data_gen_f=lambda *args: np.random.randint(2), - c_idx_type='p', r_idx_type='dt') - s = df.iloc[:2, 0] - - res = s.index.join(df.columns, how='outer') - expected = Index([s.index[0], s.index[1], - df.columns[0], df.columns[1]], object) - tm.assert_index_equal(res, expected) - - def test_align_series(self): - rng = period_range('1/1/2000', '1/1/2010', freq='A') - ts = Series(np.random.randn(len(rng)), index=rng) - - result = ts + ts[::2] - expected = ts + ts - expected[1::2] = np.nan - tm.assert_series_equal(result, expected) - - result = ts + _permute(ts[::2]) - tm.assert_series_equal(result, expected) - - # it works! - for kind in ['inner', 'outer', 'left', 'right']: - ts.align(ts[::2], join=kind) - msg = "Input has different freq=D from PeriodIndex\\(freq=A-DEC\\)" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - ts + ts.asfreq('D', how="end") - - def test_align_frame(self): - rng = period_range('1/1/2000', '1/1/2010', freq='A') - ts = DataFrame(np.random.randn(len(rng), 3), index=rng) - - result = ts + ts[::2] - expected = ts + ts - expected.values[1::2] = np.nan - tm.assert_frame_equal(result, expected) - - result = ts + _permute(ts[::2]) - tm.assert_frame_equal(result, expected) - - def test_union(self): - index = period_range('1/1/2000', '1/20/2000', freq='D') - - result = index[:-5].union(index[10:]) - tm.assert_index_equal(result, index) - - # not in order - result = _permute(index[:-5]).union(_permute(index[10:])) - tm.assert_index_equal(result, index) - - # raise if different frequencies - index = period_range('1/1/2000', '1/20/2000', freq='D') - index2 = period_range('1/1/2000', '1/20/2000', freq='W-WED') - with tm.assertRaises(period.IncompatibleFrequency): - index.union(index2) - - msg = 'can only call with other PeriodIndex-ed objects' - with tm.assertRaisesRegexp(ValueError, msg): - index.join(index.to_timestamp()) - - index3 = period_range('1/1/2000', '1/20/2000', freq='2D') - with tm.assertRaises(period.IncompatibleFrequency): - index.join(index3) - - def test_union_dataframe_index(self): - rng1 = pd.period_range('1/1/1999', '1/1/2012', freq='M') - s1 = pd.Series(np.random.randn(len(rng1)), rng1) - - rng2 = pd.period_range('1/1/1980', '12/1/2001', freq='M') - s2 = pd.Series(np.random.randn(len(rng2)), rng2) - df = pd.DataFrame({'s1': s1, 's2': s2}) - - exp = pd.period_range('1/1/1980', '1/1/2012', freq='M') - self.assert_index_equal(df.index, exp) - - def test_intersection(self): - index = period_range('1/1/2000', '1/20/2000', freq='D') - - result = index[:-5].intersection(index[10:]) - tm.assert_index_equal(result, index[10:-5]) - - # not in order - left = _permute(index[:-5]) - right = _permute(index[10:]) - result = left.intersection(right).sort_values() - tm.assert_index_equal(result, index[10:-5]) - - # raise if different frequencies - index = period_range('1/1/2000', '1/20/2000', freq='D') - index2 = period_range('1/1/2000', '1/20/2000', freq='W-WED') - with tm.assertRaises(period.IncompatibleFrequency): - index.intersection(index2) - - index3 = period_range('1/1/2000', '1/20/2000', freq='2D') - with tm.assertRaises(period.IncompatibleFrequency): - index.intersection(index3) - - def test_intersection_cases(self): - base = period_range('6/1/2000', '6/30/2000', freq='D', name='idx') - - # if target has the same name, it is preserved - rng2 = period_range('5/15/2000', '6/20/2000', freq='D', name='idx') - expected2 = period_range('6/1/2000', '6/20/2000', freq='D', - name='idx') - - # if target name is different, it will be reset - rng3 = period_range('5/15/2000', '6/20/2000', freq='D', name='other') - expected3 = period_range('6/1/2000', '6/20/2000', freq='D', - name=None) - - rng4 = period_range('7/1/2000', '7/31/2000', freq='D', name='idx') - expected4 = PeriodIndex([], name='idx', freq='D') - - for (rng, expected) in [(rng2, expected2), (rng3, expected3), - (rng4, expected4)]: - result = base.intersection(rng) - tm.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(result.freq, expected.freq) - - # non-monotonic - base = PeriodIndex(['2011-01-05', '2011-01-04', '2011-01-02', - '2011-01-03'], freq='D', name='idx') - - rng2 = PeriodIndex(['2011-01-04', '2011-01-02', - '2011-02-02', '2011-02-03'], - freq='D', name='idx') - expected2 = PeriodIndex(['2011-01-04', '2011-01-02'], freq='D', - name='idx') - - rng3 = PeriodIndex(['2011-01-04', '2011-01-02', '2011-02-02', - '2011-02-03'], - freq='D', name='other') - expected3 = PeriodIndex(['2011-01-04', '2011-01-02'], freq='D', - name=None) - - rng4 = period_range('7/1/2000', '7/31/2000', freq='D', name='idx') - expected4 = PeriodIndex([], freq='D', name='idx') - - for (rng, expected) in [(rng2, expected2), (rng3, expected3), - (rng4, expected4)]: - result = base.intersection(rng) - tm.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(result.freq, 'D') - - # empty same freq - rng = date_range('6/1/2000', '6/15/2000', freq='T') - result = rng[0:0].intersection(rng) - self.assertEqual(len(result), 0) - - result = rng.intersection(rng[0:0]) - self.assertEqual(len(result), 0) - - def test_fields(self): - # year, month, day, hour, minute - # second, weekofyear, week, dayofweek, weekday, dayofyear, quarter - # qyear - pi = PeriodIndex(freq='A', start='1/1/2001', end='12/1/2005') - self._check_all_fields(pi) - - pi = PeriodIndex(freq='Q', start='1/1/2001', end='12/1/2002') - self._check_all_fields(pi) - - pi = PeriodIndex(freq='M', start='1/1/2001', end='1/1/2002') - self._check_all_fields(pi) - - pi = PeriodIndex(freq='D', start='12/1/2001', end='6/1/2001') - self._check_all_fields(pi) - - pi = PeriodIndex(freq='B', start='12/1/2001', end='6/1/2001') - self._check_all_fields(pi) - - pi = PeriodIndex(freq='H', start='12/31/2001', end='1/1/2002 23:00') - self._check_all_fields(pi) - - pi = PeriodIndex(freq='Min', start='12/31/2001', end='1/1/2002 00:20') - self._check_all_fields(pi) - - pi = PeriodIndex(freq='S', start='12/31/2001 00:00:00', - end='12/31/2001 00:05:00') - self._check_all_fields(pi) - - end_intv = Period('2006-12-31', 'W') - i1 = PeriodIndex(end=end_intv, periods=10) - self._check_all_fields(i1) - - def _check_all_fields(self, periodindex): - fields = ['year', 'month', 'day', 'hour', 'minute', 'second', - 'weekofyear', 'week', 'dayofweek', 'weekday', 'dayofyear', - 'quarter', 'qyear', 'days_in_month', 'is_leap_year'] - - periods = list(periodindex) - s = pd.Series(periodindex) - - for field in fields: - field_idx = getattr(periodindex, field) - self.assertEqual(len(periodindex), len(field_idx)) - for x, val in zip(periods, field_idx): - self.assertEqual(getattr(x, field), val) - - if len(s) == 0: - continue - - field_s = getattr(s.dt, field) - self.assertEqual(len(periodindex), len(field_s)) - for x, val in zip(periods, field_s): - self.assertEqual(getattr(x, field), val) - - def test_is_full(self): - index = PeriodIndex([2005, 2007, 2009], freq='A') - self.assertFalse(index.is_full) - - index = PeriodIndex([2005, 2006, 2007], freq='A') - self.assertTrue(index.is_full) - - index = PeriodIndex([2005, 2005, 2007], freq='A') - self.assertFalse(index.is_full) - - index = PeriodIndex([2005, 2005, 2006], freq='A') - self.assertTrue(index.is_full) - - index = PeriodIndex([2006, 2005, 2005], freq='A') - self.assertRaises(ValueError, getattr, index, 'is_full') - - self.assertTrue(index[:0].is_full) - - def test_map(self): - index = PeriodIndex([2005, 2007, 2009], freq='A') - result = index.map(lambda x: x + 1) - expected = index + 1 - tm.assert_index_equal(result, expected) - - result = index.map(lambda x: x.ordinal) - exp = np.array([x.ordinal for x in index], dtype=np.int64) - tm.assert_numpy_array_equal(result, exp) - - def test_map_with_string_constructor(self): - raw = [2005, 2007, 2009] - index = PeriodIndex(raw, freq='A') - types = str, - - if PY3: - # unicode - types += text_type, - - for t in types: - expected = np.array(lmap(t, raw), dtype=object) - res = index.map(t) - - # should return an array - tm.assertIsInstance(res, np.ndarray) - - # preserve element types - self.assertTrue(all(isinstance(resi, t) for resi in res)) - - # dtype should be object - self.assertEqual(res.dtype, np.dtype('object').type) - - # lastly, values should compare equal - tm.assert_numpy_array_equal(res, expected) - - def test_convert_array_of_periods(self): - rng = period_range('1/1/2000', periods=20, freq='D') - periods = list(rng) - - result = pd.Index(periods) - tm.assertIsInstance(result, PeriodIndex) - - def test_with_multi_index(self): - # #1705 - index = date_range('1/1/2012', periods=4, freq='12H') - index_as_arrays = [index.to_period(freq='D'), index.hour] - - s = Series([0, 1, 2, 3], index_as_arrays) - - tm.assertIsInstance(s.index.levels[0], PeriodIndex) - - tm.assertIsInstance(s.index.values[0][0], Period) - - def test_to_timestamp_1703(self): - index = period_range('1/1/2012', periods=4, freq='D') - - result = index.to_timestamp() - self.assertEqual(result[0], Timestamp('1/1/2012')) - - def test_to_datetime_depr(self): - index = period_range('1/1/2012', periods=4, freq='D') - - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - result = index.to_datetime() - self.assertEqual(result[0], Timestamp('1/1/2012')) - - def test_get_loc_msg(self): - idx = period_range('2000-1-1', freq='A', periods=10) - bad_period = Period('2012', 'A') - self.assertRaises(KeyError, idx.get_loc, bad_period) - - try: - idx.get_loc(bad_period) - except KeyError as inst: - self.assertEqual(inst.args[0], bad_period) - - def test_get_loc_nat(self): - didx = DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03']) - pidx = PeriodIndex(['2011-01-01', 'NaT', '2011-01-03'], freq='M') - - # check DatetimeIndex compat - for idx in [didx, pidx]: - self.assertEqual(idx.get_loc(pd.NaT), 1) - self.assertEqual(idx.get_loc(None), 1) - self.assertEqual(idx.get_loc(float('nan')), 1) - self.assertEqual(idx.get_loc(np.nan), 1) - - def test_append_concat(self): - # #1815 - d1 = date_range('12/31/1990', '12/31/1999', freq='A-DEC') - d2 = date_range('12/31/2000', '12/31/2009', freq='A-DEC') - - s1 = Series(np.random.randn(10), d1) - s2 = Series(np.random.randn(10), d2) - - s1 = s1.to_period() - s2 = s2.to_period() - - # drops index - result = pd.concat([s1, s2]) - tm.assertIsInstance(result.index, PeriodIndex) - self.assertEqual(result.index[0], s1.index[0]) - - def test_pickle_freq(self): - # GH2891 - prng = period_range('1/1/2011', '1/1/2012', freq='M') - new_prng = self.round_trip_pickle(prng) - self.assertEqual(new_prng.freq, offsets.MonthEnd()) - self.assertEqual(new_prng.freqstr, 'M') - - def test_slice_keep_name(self): - idx = period_range('20010101', periods=10, freq='D', name='bob') - self.assertEqual(idx.name, idx[1:].name) - - def test_factorize(self): - idx1 = PeriodIndex(['2014-01', '2014-01', '2014-02', '2014-02', - '2014-03', '2014-03'], freq='M') - - exp_arr = np.array([0, 0, 1, 1, 2, 2], dtype=np.intp) - exp_idx = PeriodIndex(['2014-01', '2014-02', '2014-03'], freq='M') - - arr, idx = idx1.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - arr, idx = idx1.factorize(sort=True) - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - idx2 = pd.PeriodIndex(['2014-03', '2014-03', '2014-02', '2014-01', - '2014-03', '2014-01'], freq='M') - - exp_arr = np.array([2, 2, 1, 0, 2, 0], dtype=np.intp) - arr, idx = idx2.factorize(sort=True) - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - exp_arr = np.array([0, 0, 1, 2, 0, 2], dtype=np.intp) - exp_idx = PeriodIndex(['2014-03', '2014-02', '2014-01'], freq='M') - arr, idx = idx2.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - def test_recreate_from_data(self): - for o in ['M', 'Q', 'A', 'D', 'B', 'T', 'S', 'L', 'U', 'N', 'H']: - org = PeriodIndex(start='2001/04/01', freq=o, periods=1) - idx = PeriodIndex(org.values, freq=o) - tm.assert_index_equal(idx, org) - - def test_combine_first(self): - # GH 3367 - didx = pd.DatetimeIndex(start='1950-01-31', end='1950-07-31', freq='M') - pidx = pd.PeriodIndex(start=pd.Period('1950-1'), - end=pd.Period('1950-7'), freq='M') - # check to be consistent with DatetimeIndex - for idx in [didx, pidx]: - a = pd.Series([1, np.nan, np.nan, 4, 5, np.nan, 7], index=idx) - b = pd.Series([9, 9, 9, 9, 9, 9, 9], index=idx) - - result = a.combine_first(b) - expected = pd.Series([1, 9, 9, 4, 5, 9, 7], index=idx, - dtype=np.float64) - tm.assert_series_equal(result, expected) - - def test_searchsorted(self): - for freq in ['D', '2D']: - pidx = pd.PeriodIndex(['2014-01-01', '2014-01-02', '2014-01-03', - '2014-01-04', '2014-01-05'], freq=freq) - - p1 = pd.Period('2014-01-01', freq=freq) - self.assertEqual(pidx.searchsorted(p1), 0) - - p2 = pd.Period('2014-01-04', freq=freq) - self.assertEqual(pidx.searchsorted(p2), 3) - - msg = "Input has different freq=H from PeriodIndex" - with self.assertRaisesRegexp(period.IncompatibleFrequency, msg): - pidx.searchsorted(pd.Period('2014-01-01', freq='H')) - - msg = "Input has different freq=5D from PeriodIndex" - with self.assertRaisesRegexp(period.IncompatibleFrequency, msg): - pidx.searchsorted(pd.Period('2014-01-01', freq='5D')) - - def test_round_trip(self): - - p = Period('2000Q1') - new_p = self.round_trip_pickle(p) - self.assertEqual(new_p, p) - - -def _permute(obj): - return obj.take(np.random.permutation(len(obj))) - - -class TestMethods(tm.TestCase): - - def test_add(self): - dt1 = Period(freq='D', year=2008, month=1, day=1) - dt2 = Period(freq='D', year=2008, month=1, day=2) - self.assertEqual(dt1 + 1, dt2) - self.assertEqual(1 + dt1, dt2) - - def test_add_pdnat(self): - p = pd.Period('2011-01', freq='M') - self.assertIs(p + pd.NaT, pd.NaT) - self.assertIs(pd.NaT + p, pd.NaT) - - p = pd.Period('NaT', freq='M') - self.assertIs(p + pd.NaT, pd.NaT) - self.assertIs(pd.NaT + p, pd.NaT) - - def test_add_raises(self): - # GH 4731 - dt1 = Period(freq='D', year=2008, month=1, day=1) - dt2 = Period(freq='D', year=2008, month=1, day=2) - msg = r"unsupported operand type\(s\)" - with tm.assertRaisesRegexp(TypeError, msg): - dt1 + "str" - - msg = r"unsupported operand type\(s\)" - with tm.assertRaisesRegexp(TypeError, msg): - "str" + dt1 - - with tm.assertRaisesRegexp(TypeError, msg): - dt1 + dt2 - - def test_sub(self): - dt1 = Period('2011-01-01', freq='D') - dt2 = Period('2011-01-15', freq='D') - - self.assertEqual(dt1 - dt2, -14) - self.assertEqual(dt2 - dt1, 14) - - msg = r"Input has different freq=M from Period\(freq=D\)" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - dt1 - pd.Period('2011-02', freq='M') - - def test_add_offset(self): - # freq is DateOffset - for freq in ['A', '2A', '3A']: - p = Period('2011', freq=freq) - exp = Period('2013', freq=freq) - self.assertEqual(p + offsets.YearEnd(2), exp) - self.assertEqual(offsets.YearEnd(2) + p, exp) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - with tm.assertRaises(period.IncompatibleFrequency): - p + o - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - with tm.assertRaises(period.IncompatibleFrequency): - o + p - - for freq in ['M', '2M', '3M']: - p = Period('2011-03', freq=freq) - exp = Period('2011-05', freq=freq) - self.assertEqual(p + offsets.MonthEnd(2), exp) - self.assertEqual(offsets.MonthEnd(2) + p, exp) - - exp = Period('2012-03', freq=freq) - self.assertEqual(p + offsets.MonthEnd(12), exp) - self.assertEqual(offsets.MonthEnd(12) + p, exp) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - with tm.assertRaises(period.IncompatibleFrequency): - p + o - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - with tm.assertRaises(period.IncompatibleFrequency): - o + p - - # freq is Tick - for freq in ['D', '2D', '3D']: - p = Period('2011-04-01', freq=freq) - - exp = Period('2011-04-06', freq=freq) - self.assertEqual(p + offsets.Day(5), exp) - self.assertEqual(offsets.Day(5) + p, exp) - - exp = Period('2011-04-02', freq=freq) - self.assertEqual(p + offsets.Hour(24), exp) - self.assertEqual(offsets.Hour(24) + p, exp) - - exp = Period('2011-04-03', freq=freq) - self.assertEqual(p + np.timedelta64(2, 'D'), exp) - with tm.assertRaises(TypeError): - np.timedelta64(2, 'D') + p - - exp = Period('2011-04-02', freq=freq) - self.assertEqual(p + np.timedelta64(3600 * 24, 's'), exp) - with tm.assertRaises(TypeError): - np.timedelta64(3600 * 24, 's') + p - - exp = Period('2011-03-30', freq=freq) - self.assertEqual(p + timedelta(-2), exp) - self.assertEqual(timedelta(-2) + p, exp) - - exp = Period('2011-04-03', freq=freq) - self.assertEqual(p + timedelta(hours=48), exp) - self.assertEqual(timedelta(hours=48) + p, exp) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(4, 'h'), - timedelta(hours=23)]: - with tm.assertRaises(period.IncompatibleFrequency): - p + o - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - with tm.assertRaises(period.IncompatibleFrequency): - o + p - - for freq in ['H', '2H', '3H']: - p = Period('2011-04-01 09:00', freq=freq) - - exp = Period('2011-04-03 09:00', freq=freq) - self.assertEqual(p + offsets.Day(2), exp) - self.assertEqual(offsets.Day(2) + p, exp) - - exp = Period('2011-04-01 12:00', freq=freq) - self.assertEqual(p + offsets.Hour(3), exp) - self.assertEqual(offsets.Hour(3) + p, exp) - - exp = Period('2011-04-01 12:00', freq=freq) - self.assertEqual(p + np.timedelta64(3, 'h'), exp) - with tm.assertRaises(TypeError): - np.timedelta64(3, 'h') + p - - exp = Period('2011-04-01 10:00', freq=freq) - self.assertEqual(p + np.timedelta64(3600, 's'), exp) - with tm.assertRaises(TypeError): - np.timedelta64(3600, 's') + p - - exp = Period('2011-04-01 11:00', freq=freq) - self.assertEqual(p + timedelta(minutes=120), exp) - self.assertEqual(timedelta(minutes=120) + p, exp) - - exp = Period('2011-04-05 12:00', freq=freq) - self.assertEqual(p + timedelta(days=4, minutes=180), exp) - self.assertEqual(timedelta(days=4, minutes=180) + p, exp) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(3200, 's'), - timedelta(hours=23, minutes=30)]: - with tm.assertRaises(period.IncompatibleFrequency): - p + o - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - with tm.assertRaises(period.IncompatibleFrequency): - o + p - - def test_add_offset_nat(self): - # freq is DateOffset - for freq in ['A', '2A', '3A']: - p = Period('NaT', freq=freq) - for o in [offsets.YearEnd(2)]: - self.assertIs(p + o, tslib.NaT) - self.assertIs(o + p, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - self.assertIs(p + o, tslib.NaT) - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - self.assertIs(o + p, tslib.NaT) - - for freq in ['M', '2M', '3M']: - p = Period('NaT', freq=freq) - for o in [offsets.MonthEnd(2), offsets.MonthEnd(12)]: - self.assertIs(p + o, tslib.NaT) - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - self.assertIs(o + p, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - self.assertIs(p + o, tslib.NaT) - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - self.assertIs(o + p, tslib.NaT) - - # freq is Tick - for freq in ['D', '2D', '3D']: - p = Period('NaT', freq=freq) - for o in [offsets.Day(5), offsets.Hour(24), np.timedelta64(2, 'D'), - np.timedelta64(3600 * 24, 's'), timedelta(-2), - timedelta(hours=48)]: - self.assertIs(p + o, tslib.NaT) - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - self.assertIs(o + p, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(4, 'h'), - timedelta(hours=23)]: - self.assertIs(p + o, tslib.NaT) - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - self.assertIs(o + p, tslib.NaT) - - for freq in ['H', '2H', '3H']: - p = Period('NaT', freq=freq) - for o in [offsets.Day(2), offsets.Hour(3), np.timedelta64(3, 'h'), - np.timedelta64(3600, 's'), timedelta(minutes=120), - timedelta(days=4, minutes=180)]: - self.assertIs(p + o, tslib.NaT) - - if not isinstance(o, np.timedelta64): - self.assertIs(o + p, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(3200, 's'), - timedelta(hours=23, minutes=30)]: - self.assertIs(p + o, tslib.NaT) - - if isinstance(o, np.timedelta64): - with tm.assertRaises(TypeError): - o + p - else: - self.assertIs(o + p, tslib.NaT) - - def test_sub_pdnat(self): - # GH 13071 - p = pd.Period('2011-01', freq='M') - self.assertIs(p - pd.NaT, pd.NaT) - self.assertIs(pd.NaT - p, pd.NaT) - - p = pd.Period('NaT', freq='M') - self.assertIs(p - pd.NaT, pd.NaT) - self.assertIs(pd.NaT - p, pd.NaT) - - def test_sub_offset(self): - # freq is DateOffset - for freq in ['A', '2A', '3A']: - p = Period('2011', freq=freq) - self.assertEqual(p - offsets.YearEnd(2), Period('2009', freq=freq)) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - with tm.assertRaises(period.IncompatibleFrequency): - p - o - - for freq in ['M', '2M', '3M']: - p = Period('2011-03', freq=freq) - self.assertEqual(p - offsets.MonthEnd(2), - Period('2011-01', freq=freq)) - self.assertEqual(p - offsets.MonthEnd(12), - Period('2010-03', freq=freq)) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - with tm.assertRaises(period.IncompatibleFrequency): - p - o - - # freq is Tick - for freq in ['D', '2D', '3D']: - p = Period('2011-04-01', freq=freq) - self.assertEqual(p - offsets.Day(5), - Period('2011-03-27', freq=freq)) - self.assertEqual(p - offsets.Hour(24), - Period('2011-03-31', freq=freq)) - self.assertEqual(p - np.timedelta64(2, 'D'), - Period('2011-03-30', freq=freq)) - self.assertEqual(p - np.timedelta64(3600 * 24, 's'), - Period('2011-03-31', freq=freq)) - self.assertEqual(p - timedelta(-2), - Period('2011-04-03', freq=freq)) - self.assertEqual(p - timedelta(hours=48), - Period('2011-03-30', freq=freq)) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(4, 'h'), - timedelta(hours=23)]: - with tm.assertRaises(period.IncompatibleFrequency): - p - o - - for freq in ['H', '2H', '3H']: - p = Period('2011-04-01 09:00', freq=freq) - self.assertEqual(p - offsets.Day(2), - Period('2011-03-30 09:00', freq=freq)) - self.assertEqual(p - offsets.Hour(3), - Period('2011-04-01 06:00', freq=freq)) - self.assertEqual(p - np.timedelta64(3, 'h'), - Period('2011-04-01 06:00', freq=freq)) - self.assertEqual(p - np.timedelta64(3600, 's'), - Period('2011-04-01 08:00', freq=freq)) - self.assertEqual(p - timedelta(minutes=120), - Period('2011-04-01 07:00', freq=freq)) - self.assertEqual(p - timedelta(days=4, minutes=180), - Period('2011-03-28 06:00', freq=freq)) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(3200, 's'), - timedelta(hours=23, minutes=30)]: - with tm.assertRaises(period.IncompatibleFrequency): - p - o - - def test_sub_offset_nat(self): - # freq is DateOffset - for freq in ['A', '2A', '3A']: - p = Period('NaT', freq=freq) - for o in [offsets.YearEnd(2)]: - self.assertIs(p - o, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - self.assertIs(p - o, tslib.NaT) - - for freq in ['M', '2M', '3M']: - p = Period('NaT', freq=freq) - for o in [offsets.MonthEnd(2), offsets.MonthEnd(12)]: - self.assertIs(p - o, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(365, 'D'), - timedelta(365)]: - self.assertIs(p - o, tslib.NaT) - - # freq is Tick - for freq in ['D', '2D', '3D']: - p = Period('NaT', freq=freq) - for o in [offsets.Day(5), offsets.Hour(24), np.timedelta64(2, 'D'), - np.timedelta64(3600 * 24, 's'), timedelta(-2), - timedelta(hours=48)]: - self.assertIs(p - o, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(4, 'h'), - timedelta(hours=23)]: - self.assertIs(p - o, tslib.NaT) - - for freq in ['H', '2H', '3H']: - p = Period('NaT', freq=freq) - for o in [offsets.Day(2), offsets.Hour(3), np.timedelta64(3, 'h'), - np.timedelta64(3600, 's'), timedelta(minutes=120), - timedelta(days=4, minutes=180)]: - self.assertIs(p - o, tslib.NaT) - - for o in [offsets.YearBegin(2), offsets.MonthBegin(1), - offsets.Minute(), np.timedelta64(3200, 's'), - timedelta(hours=23, minutes=30)]: - self.assertIs(p - o, tslib.NaT) - - def test_nat_ops(self): - for freq in ['M', '2M', '3M']: - p = Period('NaT', freq=freq) - self.assertIs(p + 1, tslib.NaT) - self.assertIs(1 + p, tslib.NaT) - self.assertIs(p - 1, tslib.NaT) - self.assertIs(p - Period('2011-01', freq=freq), tslib.NaT) - self.assertIs(Period('2011-01', freq=freq) - p, tslib.NaT) - - def test_period_ops_offset(self): - p = Period('2011-04-01', freq='D') - result = p + offsets.Day() - exp = pd.Period('2011-04-02', freq='D') - self.assertEqual(result, exp) - - result = p - offsets.Day(2) - exp = pd.Period('2011-03-30', freq='D') - self.assertEqual(result, exp) - - msg = r"Input cannot be converted to Period\(freq=D\)" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - p + offsets.Hour(2) - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - p - offsets.Hour(2) - - -class TestPeriodIndexSeriesMethods(tm.TestCase): - """ Test PeriodIndex and Period Series Ops consistency """ - - def _check(self, values, func, expected): - idx = pd.PeriodIndex(values) - result = func(idx) - if isinstance(expected, pd.Index): - tm.assert_index_equal(result, expected) - else: - # comp op results in bool - tm.assert_numpy_array_equal(result, expected) - - s = pd.Series(values) - result = func(s) - - exp = pd.Series(expected, name=values.name) - tm.assert_series_equal(result, exp) - - def test_pi_ops(self): - idx = PeriodIndex(['2011-01', '2011-02', '2011-03', - '2011-04'], freq='M', name='idx') - - expected = PeriodIndex(['2011-03', '2011-04', - '2011-05', '2011-06'], freq='M', name='idx') - self._check(idx, lambda x: x + 2, expected) - self._check(idx, lambda x: 2 + x, expected) - - self._check(idx + 2, lambda x: x - 2, idx) - result = idx - Period('2011-01', freq='M') - exp = pd.Index([0, 1, 2, 3], name='idx') - tm.assert_index_equal(result, exp) - - result = Period('2011-01', freq='M') - idx - exp = pd.Index([0, -1, -2, -3], name='idx') - tm.assert_index_equal(result, exp) - - def test_pi_ops_errors(self): - idx = PeriodIndex(['2011-01', '2011-02', '2011-03', - '2011-04'], freq='M', name='idx') - s = pd.Series(idx) - - msg = r"unsupported operand type\(s\)" - - for obj in [idx, s]: - for ng in ["str", 1.5]: - with tm.assertRaisesRegexp(TypeError, msg): - obj + ng - - with tm.assertRaises(TypeError): - # error message differs between PY2 and 3 - ng + obj - - with tm.assertRaisesRegexp(TypeError, msg): - obj - ng - - with tm.assertRaises(TypeError): - np.add(obj, ng) - - if _np_version_under1p10: - self.assertIs(np.add(ng, obj), NotImplemented) - else: - with tm.assertRaises(TypeError): - np.add(ng, obj) - - with tm.assertRaises(TypeError): - np.subtract(obj, ng) - - if _np_version_under1p10: - self.assertIs(np.subtract(ng, obj), NotImplemented) - else: - with tm.assertRaises(TypeError): - np.subtract(ng, obj) - - def test_pi_ops_nat(self): - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2011-04'], freq='M', name='idx') - expected = PeriodIndex(['2011-03', '2011-04', - 'NaT', '2011-06'], freq='M', name='idx') - self._check(idx, lambda x: x + 2, expected) - self._check(idx, lambda x: 2 + x, expected) - self._check(idx, lambda x: np.add(x, 2), expected) - - self._check(idx + 2, lambda x: x - 2, idx) - self._check(idx + 2, lambda x: np.subtract(x, 2), idx) - - # freq with mult - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2011-04'], freq='2M', name='idx') - expected = PeriodIndex(['2011-07', '2011-08', - 'NaT', '2011-10'], freq='2M', name='idx') - self._check(idx, lambda x: x + 3, expected) - self._check(idx, lambda x: 3 + x, expected) - self._check(idx, lambda x: np.add(x, 3), expected) - - self._check(idx + 3, lambda x: x - 3, idx) - self._check(idx + 3, lambda x: np.subtract(x, 3), idx) - - def test_pi_ops_array_int(self): - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2011-04'], freq='M', name='idx') - f = lambda x: x + np.array([1, 2, 3, 4]) - exp = PeriodIndex(['2011-02', '2011-04', 'NaT', - '2011-08'], freq='M', name='idx') - self._check(idx, f, exp) - - f = lambda x: np.add(x, np.array([4, -1, 1, 2])) - exp = PeriodIndex(['2011-05', '2011-01', 'NaT', - '2011-06'], freq='M', name='idx') - self._check(idx, f, exp) - - f = lambda x: x - np.array([1, 2, 3, 4]) - exp = PeriodIndex(['2010-12', '2010-12', 'NaT', - '2010-12'], freq='M', name='idx') - self._check(idx, f, exp) - - f = lambda x: np.subtract(x, np.array([3, 2, 3, -2])) - exp = PeriodIndex(['2010-10', '2010-12', 'NaT', - '2011-06'], freq='M', name='idx') - self._check(idx, f, exp) - - def test_pi_ops_offset(self): - idx = PeriodIndex(['2011-01-01', '2011-02-01', '2011-03-01', - '2011-04-01'], freq='D', name='idx') - f = lambda x: x + offsets.Day() - exp = PeriodIndex(['2011-01-02', '2011-02-02', '2011-03-02', - '2011-04-02'], freq='D', name='idx') - self._check(idx, f, exp) - - f = lambda x: x + offsets.Day(2) - exp = PeriodIndex(['2011-01-03', '2011-02-03', '2011-03-03', - '2011-04-03'], freq='D', name='idx') - self._check(idx, f, exp) - - f = lambda x: x - offsets.Day(2) - exp = PeriodIndex(['2010-12-30', '2011-01-30', '2011-02-27', - '2011-03-30'], freq='D', name='idx') - self._check(idx, f, exp) - - def test_pi_offset_errors(self): - idx = PeriodIndex(['2011-01-01', '2011-02-01', '2011-03-01', - '2011-04-01'], freq='D', name='idx') - s = pd.Series(idx) - - # Series op is applied per Period instance, thus error is raised - # from Period - msg_idx = r"Input has different freq from PeriodIndex\(freq=D\)" - msg_s = r"Input cannot be converted to Period\(freq=D\)" - for obj, msg in [(idx, msg_idx), (s, msg_s)]: - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - obj + offsets.Hour(2) - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - offsets.Hour(2) + obj - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - obj - offsets.Hour(2) - - def test_pi_sub_period(self): - # GH 13071 - idx = PeriodIndex(['2011-01', '2011-02', '2011-03', - '2011-04'], freq='M', name='idx') - - result = idx - pd.Period('2012-01', freq='M') - exp = pd.Index([-12, -11, -10, -9], name='idx') - tm.assert_index_equal(result, exp) - - result = np.subtract(idx, pd.Period('2012-01', freq='M')) - tm.assert_index_equal(result, exp) - - result = pd.Period('2012-01', freq='M') - idx - exp = pd.Index([12, 11, 10, 9], name='idx') - tm.assert_index_equal(result, exp) - - result = np.subtract(pd.Period('2012-01', freq='M'), idx) - if _np_version_under1p10: - self.assertIs(result, NotImplemented) - else: - tm.assert_index_equal(result, exp) - - exp = pd.TimedeltaIndex([np.nan, np.nan, np.nan, np.nan], name='idx') - tm.assert_index_equal(idx - pd.Period('NaT', freq='M'), exp) - tm.assert_index_equal(pd.Period('NaT', freq='M') - idx, exp) - - def test_pi_sub_pdnat(self): - # GH 13071 - idx = PeriodIndex(['2011-01', '2011-02', 'NaT', - '2011-04'], freq='M', name='idx') - exp = pd.TimedeltaIndex([pd.NaT] * 4, name='idx') - tm.assert_index_equal(pd.NaT - idx, exp) - tm.assert_index_equal(idx - pd.NaT, exp) - - def test_pi_sub_period_nat(self): - # GH 13071 - idx = PeriodIndex(['2011-01', 'NaT', '2011-03', - '2011-04'], freq='M', name='idx') - - result = idx - pd.Period('2012-01', freq='M') - exp = pd.Index([-12, np.nan, -10, -9], name='idx') - tm.assert_index_equal(result, exp) - - result = pd.Period('2012-01', freq='M') - idx - exp = pd.Index([12, np.nan, 10, 9], name='idx') - tm.assert_index_equal(result, exp) - - exp = pd.TimedeltaIndex([np.nan, np.nan, np.nan, np.nan], name='idx') - tm.assert_index_equal(idx - pd.Period('NaT', freq='M'), exp) - tm.assert_index_equal(pd.Period('NaT', freq='M') - idx, exp) - - def test_pi_comp_period(self): - idx = PeriodIndex(['2011-01', '2011-02', '2011-03', - '2011-04'], freq='M', name='idx') - - f = lambda x: x == pd.Period('2011-03', freq='M') - exp = np.array([False, False, True, False], dtype=np.bool) - self._check(idx, f, exp) - f = lambda x: pd.Period('2011-03', freq='M') == x - self._check(idx, f, exp) - - f = lambda x: x != pd.Period('2011-03', freq='M') - exp = np.array([True, True, False, True], dtype=np.bool) - self._check(idx, f, exp) - f = lambda x: pd.Period('2011-03', freq='M') != x - self._check(idx, f, exp) - - f = lambda x: pd.Period('2011-03', freq='M') >= x - exp = np.array([True, True, True, False], dtype=np.bool) - self._check(idx, f, exp) - - f = lambda x: x > pd.Period('2011-03', freq='M') - exp = np.array([False, False, False, True], dtype=np.bool) - self._check(idx, f, exp) - - f = lambda x: pd.Period('2011-03', freq='M') >= x - exp = np.array([True, True, True, False], dtype=np.bool) - self._check(idx, f, exp) - - def test_pi_comp_period_nat(self): - idx = PeriodIndex(['2011-01', 'NaT', '2011-03', - '2011-04'], freq='M', name='idx') - - f = lambda x: x == pd.Period('2011-03', freq='M') - exp = np.array([False, False, True, False], dtype=np.bool) - self._check(idx, f, exp) - f = lambda x: pd.Period('2011-03', freq='M') == x - self._check(idx, f, exp) - - f = lambda x: x == tslib.NaT - exp = np.array([False, False, False, False], dtype=np.bool) - self._check(idx, f, exp) - f = lambda x: tslib.NaT == x - self._check(idx, f, exp) - - f = lambda x: x != pd.Period('2011-03', freq='M') - exp = np.array([True, True, False, True], dtype=np.bool) - self._check(idx, f, exp) - f = lambda x: pd.Period('2011-03', freq='M') != x - self._check(idx, f, exp) - - f = lambda x: x != tslib.NaT - exp = np.array([True, True, True, True], dtype=np.bool) - self._check(idx, f, exp) - f = lambda x: tslib.NaT != x - self._check(idx, f, exp) - - f = lambda x: pd.Period('2011-03', freq='M') >= x - exp = np.array([True, False, True, False], dtype=np.bool) - self._check(idx, f, exp) - - f = lambda x: x < pd.Period('2011-03', freq='M') - exp = np.array([True, False, False, False], dtype=np.bool) - self._check(idx, f, exp) - - f = lambda x: x > tslib.NaT - exp = np.array([False, False, False, False], dtype=np.bool) - self._check(idx, f, exp) - - f = lambda x: tslib.NaT >= x - exp = np.array([False, False, False, False], dtype=np.bool) - self._check(idx, f, exp) - - -class TestPeriodRepresentation(tm.TestCase): - """ - Wish to match NumPy units - """ - - def test_annual(self): - self._check_freq('A', 1970) - - def test_monthly(self): - self._check_freq('M', '1970-01') - - def test_weekly(self): - self._check_freq('W-THU', '1970-01-01') - - def test_daily(self): - self._check_freq('D', '1970-01-01') - - def test_business_daily(self): - self._check_freq('B', '1970-01-01') - - def test_hourly(self): - self._check_freq('H', '1970-01-01') - - def test_minutely(self): - self._check_freq('T', '1970-01-01') - - def test_secondly(self): - self._check_freq('S', '1970-01-01') - - def test_millisecondly(self): - self._check_freq('L', '1970-01-01') - - def test_microsecondly(self): - self._check_freq('U', '1970-01-01') - - def test_nanosecondly(self): - self._check_freq('N', '1970-01-01') - - def _check_freq(self, freq, base_date): - rng = PeriodIndex(start=base_date, periods=10, freq=freq) - exp = np.arange(10, dtype=np.int64) - self.assert_numpy_array_equal(rng._values, exp) - self.assert_numpy_array_equal(rng.asi8, exp) - - def test_negone_ordinals(self): - freqs = ['A', 'M', 'Q', 'D', 'H', 'T', 'S'] - - period = Period(ordinal=-1, freq='D') - for freq in freqs: - repr(period.asfreq(freq)) - - for freq in freqs: - period = Period(ordinal=-1, freq=freq) - repr(period) - self.assertEqual(period.year, 1969) - - period = Period(ordinal=-1, freq='B') - repr(period) - period = Period(ordinal=-1, freq='W') - repr(period) - - -class TestComparisons(tm.TestCase): - def setUp(self): - self.january1 = Period('2000-01', 'M') - self.january2 = Period('2000-01', 'M') - self.february = Period('2000-02', 'M') - self.march = Period('2000-03', 'M') - self.day = Period('2012-01-01', 'D') - - def test_equal(self): - self.assertEqual(self.january1, self.january2) - - def test_equal_Raises_Value(self): - with tm.assertRaises(period.IncompatibleFrequency): - self.january1 == self.day - - def test_notEqual(self): - self.assertNotEqual(self.january1, 1) - self.assertNotEqual(self.january1, self.february) - - def test_greater(self): - self.assertTrue(self.february > self.january1) - - def test_greater_Raises_Value(self): - with tm.assertRaises(period.IncompatibleFrequency): - self.january1 > self.day - - def test_greater_Raises_Type(self): - with tm.assertRaises(TypeError): - self.january1 > 1 - - def test_greaterEqual(self): - self.assertTrue(self.january1 >= self.january2) - - def test_greaterEqual_Raises_Value(self): - with tm.assertRaises(period.IncompatibleFrequency): - self.january1 >= self.day - - with tm.assertRaises(TypeError): - print(self.january1 >= 1) - - def test_smallerEqual(self): - self.assertTrue(self.january1 <= self.january2) - - def test_smallerEqual_Raises_Value(self): - with tm.assertRaises(period.IncompatibleFrequency): - self.january1 <= self.day - - def test_smallerEqual_Raises_Type(self): - with tm.assertRaises(TypeError): - self.january1 <= 1 - - def test_smaller(self): - self.assertTrue(self.january1 < self.february) - - def test_smaller_Raises_Value(self): - with tm.assertRaises(period.IncompatibleFrequency): - self.january1 < self.day - - def test_smaller_Raises_Type(self): - with tm.assertRaises(TypeError): - self.january1 < 1 - - def test_sort(self): - periods = [self.march, self.january1, self.february] - correctPeriods = [self.january1, self.february, self.march] - self.assertEqual(sorted(periods), correctPeriods) - - def test_period_nat_comp(self): - p_nat = Period('NaT', freq='D') - p = Period('2011-01-01', freq='D') - - nat = pd.Timestamp('NaT') - t = pd.Timestamp('2011-01-01') - # confirm Period('NaT') work identical with Timestamp('NaT') - for left, right in [(p_nat, p), (p, p_nat), (p_nat, p_nat), (nat, t), - (t, nat), (nat, nat)]: - self.assertEqual(left < right, False) - self.assertEqual(left > right, False) - self.assertEqual(left == right, False) - self.assertEqual(left != right, True) - self.assertEqual(left <= right, False) - self.assertEqual(left >= right, False) - - def test_pi_pi_comp(self): - - for freq in ['M', '2M', '3M']: - base = PeriodIndex(['2011-01', '2011-02', - '2011-03', '2011-04'], freq=freq) - p = Period('2011-02', freq=freq) - - exp = np.array([False, True, False, False]) - self.assert_numpy_array_equal(base == p, exp) - self.assert_numpy_array_equal(p == base, exp) - - exp = np.array([True, False, True, True]) - self.assert_numpy_array_equal(base != p, exp) - self.assert_numpy_array_equal(p != base, exp) - - exp = np.array([False, False, True, True]) - self.assert_numpy_array_equal(base > p, exp) - self.assert_numpy_array_equal(p < base, exp) - - exp = np.array([True, False, False, False]) - self.assert_numpy_array_equal(base < p, exp) - self.assert_numpy_array_equal(p > base, exp) - - exp = np.array([False, True, True, True]) - self.assert_numpy_array_equal(base >= p, exp) - self.assert_numpy_array_equal(p <= base, exp) - - exp = np.array([True, True, False, False]) - self.assert_numpy_array_equal(base <= p, exp) - self.assert_numpy_array_equal(p >= base, exp) - - idx = PeriodIndex(['2011-02', '2011-01', '2011-03', - '2011-05'], freq=freq) - - exp = np.array([False, False, True, False]) - self.assert_numpy_array_equal(base == idx, exp) - - exp = np.array([True, True, False, True]) - self.assert_numpy_array_equal(base != idx, exp) - - exp = np.array([False, True, False, False]) - self.assert_numpy_array_equal(base > idx, exp) - - exp = np.array([True, False, False, True]) - self.assert_numpy_array_equal(base < idx, exp) - - exp = np.array([False, True, True, False]) - self.assert_numpy_array_equal(base >= idx, exp) - - exp = np.array([True, False, True, True]) - self.assert_numpy_array_equal(base <= idx, exp) - - # different base freq - msg = "Input has different freq=A-DEC from PeriodIndex" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - base <= Period('2011', freq='A') - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - Period('2011', freq='A') >= base - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - idx = PeriodIndex(['2011', '2012', '2013', '2014'], freq='A') - base <= idx - - # different mult - msg = "Input has different freq=4M from PeriodIndex" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - base <= Period('2011', freq='4M') - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - Period('2011', freq='4M') >= base - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - idx = PeriodIndex(['2011', '2012', '2013', '2014'], freq='4M') - base <= idx - - def test_pi_nat_comp(self): - for freq in ['M', '2M', '3M']: - idx1 = PeriodIndex( - ['2011-01', '2011-02', 'NaT', '2011-05'], freq=freq) - - result = idx1 > Period('2011-02', freq=freq) - exp = np.array([False, False, False, True]) - self.assert_numpy_array_equal(result, exp) - result = Period('2011-02', freq=freq) < idx1 - self.assert_numpy_array_equal(result, exp) - - result = idx1 == Period('NaT', freq=freq) - exp = np.array([False, False, False, False]) - self.assert_numpy_array_equal(result, exp) - result = Period('NaT', freq=freq) == idx1 - self.assert_numpy_array_equal(result, exp) - - result = idx1 != Period('NaT', freq=freq) - exp = np.array([True, True, True, True]) - self.assert_numpy_array_equal(result, exp) - result = Period('NaT', freq=freq) != idx1 - self.assert_numpy_array_equal(result, exp) - - idx2 = PeriodIndex(['2011-02', '2011-01', '2011-04', - 'NaT'], freq=freq) - result = idx1 < idx2 - exp = np.array([True, False, False, False]) - self.assert_numpy_array_equal(result, exp) - - result = idx1 == idx2 - exp = np.array([False, False, False, False]) - self.assert_numpy_array_equal(result, exp) - - result = idx1 != idx2 - exp = np.array([True, True, True, True]) - self.assert_numpy_array_equal(result, exp) - - result = idx1 == idx1 - exp = np.array([True, True, False, True]) - self.assert_numpy_array_equal(result, exp) - - result = idx1 != idx1 - exp = np.array([False, False, True, False]) - self.assert_numpy_array_equal(result, exp) - - diff = PeriodIndex(['2011-02', '2011-01', '2011-04', - 'NaT'], freq='4M') - msg = "Input has different freq=4M from PeriodIndex" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - idx1 > diff - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - idx1 == diff - - -class TestSeriesPeriod(tm.TestCase): - - def setUp(self): - self.series = Series(period_range('2000-01-01', periods=10, freq='D')) - - def test_auto_conversion(self): - series = Series(list(period_range('2000-01-01', periods=10, freq='D'))) - self.assertEqual(series.dtype, 'object') - - series = pd.Series([pd.Period('2011-01-01', freq='D'), - pd.Period('2011-02-01', freq='D')]) - self.assertEqual(series.dtype, 'object') - - def test_getitem(self): - self.assertEqual(self.series[1], pd.Period('2000-01-02', freq='D')) - - result = self.series[[2, 4]] - exp = pd.Series([pd.Period('2000-01-03', freq='D'), - pd.Period('2000-01-05', freq='D')], - index=[2, 4]) - self.assert_series_equal(result, exp) - self.assertEqual(result.dtype, 'object') - - def test_constructor_cant_cast_period(self): - with tm.assertRaises(TypeError): - Series(period_range('2000-01-01', periods=10, freq='D'), - dtype=float) - - def test_constructor_cast_object(self): - s = Series(period_range('1/1/2000', periods=10), dtype=object) - exp = Series(period_range('1/1/2000', periods=10)) - tm.assert_series_equal(s, exp) - - def test_isnull(self): - # GH 13737 - s = Series([pd.Period('2011-01', freq='M'), - pd.Period('NaT', freq='M')]) - tm.assert_series_equal(s.isnull(), Series([False, True])) - tm.assert_series_equal(s.notnull(), Series([True, False])) - - def test_fillna(self): - # GH 13737 - s = Series([pd.Period('2011-01', freq='M'), - pd.Period('NaT', freq='M')]) - - res = s.fillna(pd.Period('2012-01', freq='M')) - exp = Series([pd.Period('2011-01', freq='M'), - pd.Period('2012-01', freq='M')]) - tm.assert_series_equal(res, exp) - self.assertEqual(res.dtype, 'object') - - res = s.fillna('XXX') - exp = Series([pd.Period('2011-01', freq='M'), 'XXX']) - tm.assert_series_equal(res, exp) - self.assertEqual(res.dtype, 'object') - - def test_dropna(self): - # GH 13737 - s = Series([pd.Period('2011-01', freq='M'), - pd.Period('NaT', freq='M')]) - tm.assert_series_equal(s.dropna(), - Series([pd.Period('2011-01', freq='M')])) - - def test_series_comparison_scalars(self): - val = pd.Period('2000-01-04', freq='D') - result = self.series > val - expected = pd.Series([x > val for x in self.series]) - tm.assert_series_equal(result, expected) - - val = self.series[5] - result = self.series > val - expected = pd.Series([x > val for x in self.series]) - tm.assert_series_equal(result, expected) - - def test_between(self): - left, right = self.series[[2, 7]] - result = self.series.between(left, right) - expected = (self.series >= left) & (self.series <= right) - tm.assert_series_equal(result, expected) - - # --------------------------------------------------------------------- - # NaT support - - """ - # ToDo: Enable when support period dtype - def test_NaT_scalar(self): - series = Series([0, 1000, 2000, iNaT], dtype='period[D]') - - val = series[3] - self.assertTrue(isnull(val)) - - series[2] = val - self.assertTrue(isnull(series[2])) - - def test_NaT_cast(self): - result = Series([np.nan]).astype('period[D]') - expected = Series([NaT]) - tm.assert_series_equal(result, expected) - """ - - def test_set_none_nan(self): - # currently Period is stored as object dtype, not as NaT - self.series[3] = None - self.assertIs(self.series[3], None) - - self.series[3:5] = None - self.assertIs(self.series[4], None) - - self.series[5] = np.nan - self.assertTrue(np.isnan(self.series[5])) - - self.series[5:7] = np.nan - self.assertTrue(np.isnan(self.series[6])) - - def test_intercept_astype_object(self): - expected = self.series.astype('object') - - df = DataFrame({'a': self.series, - 'b': np.random.randn(len(self.series))}) - - result = df.values.squeeze() - self.assertTrue((result[:, 0] == expected.values).all()) - - df = DataFrame({'a': self.series, 'b': ['foo'] * len(self.series)}) - - result = df.values.squeeze() - self.assertTrue((result[:, 0] == expected.values).all()) - - def test_ops_series_timedelta(self): - # GH 13043 - s = pd.Series([pd.Period('2015-01-01', freq='D'), - pd.Period('2015-01-02', freq='D')], name='xxx') - self.assertEqual(s.dtype, object) - - exp = pd.Series([pd.Period('2015-01-02', freq='D'), - pd.Period('2015-01-03', freq='D')], name='xxx') - tm.assert_series_equal(s + pd.Timedelta('1 days'), exp) - tm.assert_series_equal(pd.Timedelta('1 days') + s, exp) - - tm.assert_series_equal(s + pd.tseries.offsets.Day(), exp) - tm.assert_series_equal(pd.tseries.offsets.Day() + s, exp) - - def test_ops_series_period(self): - # GH 13043 - s = pd.Series([pd.Period('2015-01-01', freq='D'), - pd.Period('2015-01-02', freq='D')], name='xxx') - self.assertEqual(s.dtype, object) - - p = pd.Period('2015-01-10', freq='D') - # dtype will be object because of original dtype - exp = pd.Series([9, 8], name='xxx', dtype=object) - tm.assert_series_equal(p - s, exp) - tm.assert_series_equal(s - p, -exp) - - s2 = pd.Series([pd.Period('2015-01-05', freq='D'), - pd.Period('2015-01-04', freq='D')], name='xxx') - self.assertEqual(s2.dtype, object) - - exp = pd.Series([4, 2], name='xxx', dtype=object) - tm.assert_series_equal(s2 - s, exp) - tm.assert_series_equal(s - s2, -exp) - - def test_comp_series_period_scalar(self): - # GH 13200 - for freq in ['M', '2M', '3M']: - base = Series([Period(x, freq=freq) for x in - ['2011-01', '2011-02', '2011-03', '2011-04']]) - p = Period('2011-02', freq=freq) - - exp = pd.Series([False, True, False, False]) - tm.assert_series_equal(base == p, exp) - tm.assert_series_equal(p == base, exp) - - exp = pd.Series([True, False, True, True]) - tm.assert_series_equal(base != p, exp) - tm.assert_series_equal(p != base, exp) - - exp = pd.Series([False, False, True, True]) - tm.assert_series_equal(base > p, exp) - tm.assert_series_equal(p < base, exp) - - exp = pd.Series([True, False, False, False]) - tm.assert_series_equal(base < p, exp) - tm.assert_series_equal(p > base, exp) - - exp = pd.Series([False, True, True, True]) - tm.assert_series_equal(base >= p, exp) - tm.assert_series_equal(p <= base, exp) - - exp = pd.Series([True, True, False, False]) - tm.assert_series_equal(base <= p, exp) - tm.assert_series_equal(p >= base, exp) - - # different base freq - msg = "Input has different freq=A-DEC from Period" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - base <= Period('2011', freq='A') - - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - Period('2011', freq='A') >= base - - def test_comp_series_period_series(self): - # GH 13200 - for freq in ['M', '2M', '3M']: - base = Series([Period(x, freq=freq) for x in - ['2011-01', '2011-02', '2011-03', '2011-04']]) - - s = Series([Period(x, freq=freq) for x in - ['2011-02', '2011-01', '2011-03', '2011-05']]) - - exp = Series([False, False, True, False]) - tm.assert_series_equal(base == s, exp) - - exp = Series([True, True, False, True]) - tm.assert_series_equal(base != s, exp) - - exp = Series([False, True, False, False]) - tm.assert_series_equal(base > s, exp) - - exp = Series([True, False, False, True]) - tm.assert_series_equal(base < s, exp) - - exp = Series([False, True, True, False]) - tm.assert_series_equal(base >= s, exp) - - exp = Series([True, False, True, True]) - tm.assert_series_equal(base <= s, exp) - - s2 = Series([Period(x, freq='A') for x in - ['2011', '2011', '2011', '2011']]) - - # different base freq - msg = "Input has different freq=A-DEC from Period" - with tm.assertRaisesRegexp(period.IncompatibleFrequency, msg): - base <= s2 - - def test_comp_series_period_object(self): - # GH 13200 - base = Series([Period('2011', freq='A'), Period('2011-02', freq='M'), - Period('2013', freq='A'), Period('2011-04', freq='M')]) - - s = Series([Period('2012', freq='A'), Period('2011-01', freq='M'), - Period('2013', freq='A'), Period('2011-05', freq='M')]) - - exp = Series([False, False, True, False]) - tm.assert_series_equal(base == s, exp) - - exp = Series([True, True, False, True]) - tm.assert_series_equal(base != s, exp) - - exp = Series([False, True, False, False]) - tm.assert_series_equal(base > s, exp) - - exp = Series([True, False, False, True]) - tm.assert_series_equal(base < s, exp) - - exp = Series([False, True, True, False]) - tm.assert_series_equal(base >= s, exp) - - exp = Series([True, False, True, True]) - tm.assert_series_equal(base <= s, exp) - - def test_ops_frame_period(self): - # GH 13043 - df = pd.DataFrame({'A': [pd.Period('2015-01', freq='M'), - pd.Period('2015-02', freq='M')], - 'B': [pd.Period('2014-01', freq='M'), - pd.Period('2014-02', freq='M')]}) - self.assertEqual(df['A'].dtype, object) - self.assertEqual(df['B'].dtype, object) - - p = pd.Period('2015-03', freq='M') - # dtype will be object because of original dtype - exp = pd.DataFrame({'A': np.array([2, 1], dtype=object), - 'B': np.array([14, 13], dtype=object)}) - tm.assert_frame_equal(p - df, exp) - tm.assert_frame_equal(df - p, -exp) - - df2 = pd.DataFrame({'A': [pd.Period('2015-05', freq='M'), - pd.Period('2015-06', freq='M')], - 'B': [pd.Period('2015-05', freq='M'), - pd.Period('2015-06', freq='M')]}) - self.assertEqual(df2['A'].dtype, object) - self.assertEqual(df2['B'].dtype, object) - - exp = pd.DataFrame({'A': np.array([4, 4], dtype=object), - 'B': np.array([16, 16], dtype=object)}) - tm.assert_frame_equal(df2 - df, exp) - tm.assert_frame_equal(df - df2, -exp) - - -class TestPeriodField(tm.TestCase): - def test_get_period_field_raises_on_out_of_range(self): - self.assertRaises(ValueError, _period.get_period_field, -1, 0, 0) - - def test_get_period_field_array_raises_on_out_of_range(self): - self.assertRaises(ValueError, _period.get_period_field_arr, -1, - np.empty(1), 0) - - -if __name__ == '__main__': - import nose - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_timedeltas.py b/pandas/tseries/tests/test_timedeltas.py deleted file mode 100644 index f0d14014d6559..0000000000000 --- a/pandas/tseries/tests/test_timedeltas.py +++ /dev/null @@ -1,1969 +0,0 @@ -# pylint: disable-msg=E1101,W0612 - -from __future__ import division -from datetime import timedelta, time -import nose - -from distutils.version import LooseVersion -import numpy as np -import pandas as pd - -from pandas import (Index, Series, DataFrame, Timestamp, Timedelta, - TimedeltaIndex, isnull, date_range, - timedelta_range, Int64Index) -from pandas.compat import range -from pandas import compat, to_timedelta, tslib -from pandas.tseries.timedeltas import _coerce_scalar_to_timedelta_type as ct -from pandas.util.testing import (assert_series_equal, assert_frame_equal, - assert_almost_equal, assert_index_equal) -from pandas.tseries.offsets import Day, Second -import pandas.util.testing as tm -from numpy.random import randn -from pandas import _np_version_under1p8 - -iNaT = tslib.iNaT - - -class TestTimedeltas(tm.TestCase): - _multiprocess_can_split_ = True - - def setUp(self): - pass - - def test_get_loc_nat(self): - tidx = TimedeltaIndex(['1 days 01:00:00', 'NaT', '2 days 01:00:00']) - - self.assertEqual(tidx.get_loc(pd.NaT), 1) - self.assertEqual(tidx.get_loc(None), 1) - self.assertEqual(tidx.get_loc(float('nan')), 1) - self.assertEqual(tidx.get_loc(np.nan), 1) - - def test_contains(self): - # Checking for any NaT-like objects - # GH 13603 - td = to_timedelta(range(5), unit='d') + pd.offsets.Hour(1) - for v in [pd.NaT, None, float('nan'), np.nan]: - self.assertFalse((v in td)) - - td = to_timedelta([pd.NaT]) - for v in [pd.NaT, None, float('nan'), np.nan]: - self.assertTrue((v in td)) - - def test_construction(self): - - expected = np.timedelta64(10, 'D').astype('m8[ns]').view('i8') - self.assertEqual(Timedelta(10, unit='d').value, expected) - self.assertEqual(Timedelta(10.0, unit='d').value, expected) - self.assertEqual(Timedelta('10 days').value, expected) - self.assertEqual(Timedelta(days=10).value, expected) - self.assertEqual(Timedelta(days=10.0).value, expected) - - expected += np.timedelta64(10, 's').astype('m8[ns]').view('i8') - self.assertEqual(Timedelta('10 days 00:00:10').value, expected) - self.assertEqual(Timedelta(days=10, seconds=10).value, expected) - self.assertEqual( - Timedelta(days=10, milliseconds=10 * 1000).value, expected) - self.assertEqual( - Timedelta(days=10, microseconds=10 * 1000 * 1000).value, expected) - - # test construction with np dtypes - # GH 8757 - timedelta_kwargs = {'days': 'D', - 'seconds': 's', - 'microseconds': 'us', - 'milliseconds': 'ms', - 'minutes': 'm', - 'hours': 'h', - 'weeks': 'W'} - npdtypes = [np.int64, np.int32, np.int16, np.float64, np.float32, - np.float16] - for npdtype in npdtypes: - for pykwarg, npkwarg in timedelta_kwargs.items(): - expected = np.timedelta64(1, - npkwarg).astype('m8[ns]').view('i8') - self.assertEqual( - Timedelta(**{pykwarg: npdtype(1)}).value, expected) - - # rounding cases - self.assertEqual(Timedelta(82739999850000).value, 82739999850000) - self.assertTrue('0 days 22:58:59.999850' in str(Timedelta( - 82739999850000))) - self.assertEqual(Timedelta(123072001000000).value, 123072001000000) - self.assertTrue('1 days 10:11:12.001' in str(Timedelta( - 123072001000000))) - - # string conversion with/without leading zero - # GH 9570 - self.assertEqual(Timedelta('0:00:00'), timedelta(hours=0)) - self.assertEqual(Timedelta('00:00:00'), timedelta(hours=0)) - self.assertEqual(Timedelta('-1:00:00'), -timedelta(hours=1)) - self.assertEqual(Timedelta('-01:00:00'), -timedelta(hours=1)) - - # more strings & abbrevs - # GH 8190 - self.assertEqual(Timedelta('1 h'), timedelta(hours=1)) - self.assertEqual(Timedelta('1 hour'), timedelta(hours=1)) - self.assertEqual(Timedelta('1 hr'), timedelta(hours=1)) - self.assertEqual(Timedelta('1 hours'), timedelta(hours=1)) - self.assertEqual(Timedelta('-1 hours'), -timedelta(hours=1)) - self.assertEqual(Timedelta('1 m'), timedelta(minutes=1)) - self.assertEqual(Timedelta('1.5 m'), timedelta(seconds=90)) - self.assertEqual(Timedelta('1 minute'), timedelta(minutes=1)) - self.assertEqual(Timedelta('1 minutes'), timedelta(minutes=1)) - self.assertEqual(Timedelta('1 s'), timedelta(seconds=1)) - self.assertEqual(Timedelta('1 second'), timedelta(seconds=1)) - self.assertEqual(Timedelta('1 seconds'), timedelta(seconds=1)) - self.assertEqual(Timedelta('1 ms'), timedelta(milliseconds=1)) - self.assertEqual(Timedelta('1 milli'), timedelta(milliseconds=1)) - self.assertEqual(Timedelta('1 millisecond'), timedelta(milliseconds=1)) - self.assertEqual(Timedelta('1 us'), timedelta(microseconds=1)) - self.assertEqual(Timedelta('1 micros'), timedelta(microseconds=1)) - self.assertEqual(Timedelta('1 microsecond'), timedelta(microseconds=1)) - self.assertEqual(Timedelta('1.5 microsecond'), - Timedelta('00:00:00.000001500')) - self.assertEqual(Timedelta('1 ns'), Timedelta('00:00:00.000000001')) - self.assertEqual(Timedelta('1 nano'), Timedelta('00:00:00.000000001')) - self.assertEqual(Timedelta('1 nanosecond'), - Timedelta('00:00:00.000000001')) - - # combos - self.assertEqual(Timedelta('10 days 1 hour'), - timedelta(days=10, hours=1)) - self.assertEqual(Timedelta('10 days 1 h'), timedelta(days=10, hours=1)) - self.assertEqual(Timedelta('10 days 1 h 1m 1s'), timedelta( - days=10, hours=1, minutes=1, seconds=1)) - self.assertEqual(Timedelta('-10 days 1 h 1m 1s'), - - timedelta(days=10, hours=1, minutes=1, seconds=1)) - self.assertEqual(Timedelta('-10 days 1 h 1m 1s'), - - timedelta(days=10, hours=1, minutes=1, seconds=1)) - self.assertEqual(Timedelta('-10 days 1 h 1m 1s 3us'), - - timedelta(days=10, hours=1, minutes=1, - seconds=1, microseconds=3)) - self.assertEqual(Timedelta('-10 days 1 h 1.5m 1s 3us'), - - timedelta(days=10, hours=1, minutes=1, - seconds=31, microseconds=3)) - - # currently invalid as it has a - on the hhmmdd part (only allowed on - # the days) - self.assertRaises(ValueError, - lambda: Timedelta('-10 days -1 h 1.5m 1s 3us')) - - # only leading neg signs are allowed - self.assertRaises(ValueError, - lambda: Timedelta('10 days -1 h 1.5m 1s 3us')) - - # no units specified - self.assertRaises(ValueError, lambda: Timedelta('3.1415')) - - # invalid construction - tm.assertRaisesRegexp(ValueError, "cannot construct a Timedelta", - lambda: Timedelta()) - tm.assertRaisesRegexp(ValueError, "unit abbreviation w/o a number", - lambda: Timedelta('foo')) - tm.assertRaisesRegexp(ValueError, - "cannot construct a Timedelta from the passed " - "arguments, allowed keywords are ", - lambda: Timedelta(day=10)) - - # roundtripping both for string and value - for v in ['1s', '-1s', '1us', '-1us', '1 day', '-1 day', - '-23:59:59.999999', '-1 days +23:59:59.999999', '-1ns', - '1ns', '-23:59:59.999999999']: - - td = Timedelta(v) - self.assertEqual(Timedelta(td.value), td) - - # str does not normally display nanos - if not td.nanoseconds: - self.assertEqual(Timedelta(str(td)), td) - self.assertEqual(Timedelta(td._repr_base(format='all')), td) - - # floats - expected = np.timedelta64( - 10, 's').astype('m8[ns]').view('i8') + np.timedelta64( - 500, 'ms').astype('m8[ns]').view('i8') - self.assertEqual(Timedelta(10.5, unit='s').value, expected) - - # nat - self.assertEqual(Timedelta('').value, iNaT) - self.assertEqual(Timedelta('nat').value, iNaT) - self.assertEqual(Timedelta('NAT').value, iNaT) - self.assertEqual(Timedelta(None).value, iNaT) - self.assertEqual(Timedelta(np.nan).value, iNaT) - self.assertTrue(isnull(Timedelta('nat'))) - - # offset - self.assertEqual(to_timedelta(pd.offsets.Hour(2)), - Timedelta('0 days, 02:00:00')) - self.assertEqual(Timedelta(pd.offsets.Hour(2)), - Timedelta('0 days, 02:00:00')) - self.assertEqual(Timedelta(pd.offsets.Second(2)), - Timedelta('0 days, 00:00:02')) - - # unicode - # GH 11995 - expected = Timedelta('1H') - result = pd.Timedelta(u'1H') - self.assertEqual(result, expected) - self.assertEqual(to_timedelta(pd.offsets.Hour(2)), - Timedelta(u'0 days, 02:00:00')) - - self.assertRaises(ValueError, lambda: Timedelta(u'foo bar')) - - def test_round(self): - - t1 = Timedelta('1 days 02:34:56.789123456') - t2 = Timedelta('-1 days 02:34:56.789123456') - - for (freq, s1, s2) in [('N', t1, t2), - ('U', Timedelta('1 days 02:34:56.789123000'), - Timedelta('-1 days 02:34:56.789123000')), - ('L', Timedelta('1 days 02:34:56.789000000'), - Timedelta('-1 days 02:34:56.789000000')), - ('S', Timedelta('1 days 02:34:57'), - Timedelta('-1 days 02:34:57')), - ('2S', Timedelta('1 days 02:34:56'), - Timedelta('-1 days 02:34:56')), - ('5S', Timedelta('1 days 02:34:55'), - Timedelta('-1 days 02:34:55')), - ('T', Timedelta('1 days 02:35:00'), - Timedelta('-1 days 02:35:00')), - ('12T', Timedelta('1 days 02:36:00'), - Timedelta('-1 days 02:36:00')), - ('H', Timedelta('1 days 03:00:00'), - Timedelta('-1 days 03:00:00')), - ('d', Timedelta('1 days'), - Timedelta('-1 days'))]: - r1 = t1.round(freq) - self.assertEqual(r1, s1) - r2 = t2.round(freq) - self.assertEqual(r2, s2) - - # invalid - for freq in ['Y', 'M', 'foobar']: - self.assertRaises(ValueError, lambda: t1.round(freq)) - - t1 = timedelta_range('1 days', periods=3, freq='1 min 2 s 3 us') - t2 = -1 * t1 - t1a = timedelta_range('1 days', periods=3, freq='1 min 2 s') - t1c = pd.TimedeltaIndex([1, 1, 1], unit='D') - - # note that negative times round DOWN! so don't give whole numbers - for (freq, s1, s2) in [('N', t1, t2), - ('U', t1, t2), - ('L', t1a, - TimedeltaIndex(['-1 days +00:00:00', - '-2 days +23:58:58', - '-2 days +23:57:56'], - dtype='timedelta64[ns]', - freq=None) - ), - ('S', t1a, - TimedeltaIndex(['-1 days +00:00:00', - '-2 days +23:58:58', - '-2 days +23:57:56'], - dtype='timedelta64[ns]', - freq=None) - ), - ('12T', t1c, - TimedeltaIndex(['-1 days', - '-1 days', - '-1 days'], - dtype='timedelta64[ns]', - freq=None) - ), - ('H', t1c, - TimedeltaIndex(['-1 days', - '-1 days', - '-1 days'], - dtype='timedelta64[ns]', - freq=None) - ), - ('d', t1c, - pd.TimedeltaIndex([-1, -1, -1], unit='D') - )]: - - r1 = t1.round(freq) - tm.assert_index_equal(r1, s1) - r2 = t2.round(freq) - tm.assert_index_equal(r2, s2) - - # invalid - for freq in ['Y', 'M', 'foobar']: - self.assertRaises(ValueError, lambda: t1.round(freq)) - - def test_repr(self): - - self.assertEqual(repr(Timedelta(10, unit='d')), - "Timedelta('10 days 00:00:00')") - self.assertEqual(repr(Timedelta(10, unit='s')), - "Timedelta('0 days 00:00:10')") - self.assertEqual(repr(Timedelta(10, unit='ms')), - "Timedelta('0 days 00:00:00.010000')") - self.assertEqual(repr(Timedelta(-10, unit='ms')), - "Timedelta('-1 days +23:59:59.990000')") - - def test_identity(self): - - td = Timedelta(10, unit='d') - self.assertTrue(isinstance(td, Timedelta)) - self.assertTrue(isinstance(td, timedelta)) - - def test_conversion(self): - - for td in [Timedelta(10, unit='d'), - Timedelta('1 days, 10:11:12.012345')]: - pydt = td.to_pytimedelta() - self.assertTrue(td == Timedelta(pydt)) - self.assertEqual(td, pydt) - self.assertTrue(isinstance(pydt, timedelta) and not isinstance( - pydt, Timedelta)) - - self.assertEqual(td, np.timedelta64(td.value, 'ns')) - td64 = td.to_timedelta64() - self.assertEqual(td64, np.timedelta64(td.value, 'ns')) - self.assertEqual(td, td64) - self.assertTrue(isinstance(td64, np.timedelta64)) - - # this is NOT equal and cannot be roundtriped (because of the nanos) - td = Timedelta('1 days, 10:11:12.012345678') - self.assertTrue(td != td.to_pytimedelta()) - - def test_ops(self): - - td = Timedelta(10, unit='d') - self.assertEqual(-td, Timedelta(-10, unit='d')) - self.assertEqual(+td, Timedelta(10, unit='d')) - self.assertEqual(td - td, Timedelta(0, unit='ns')) - self.assertTrue((td - pd.NaT) is pd.NaT) - self.assertEqual(td + td, Timedelta(20, unit='d')) - self.assertTrue((td + pd.NaT) is pd.NaT) - self.assertEqual(td * 2, Timedelta(20, unit='d')) - self.assertTrue((td * pd.NaT) is pd.NaT) - self.assertEqual(td / 2, Timedelta(5, unit='d')) - self.assertEqual(abs(td), td) - self.assertEqual(abs(-td), td) - self.assertEqual(td / td, 1) - self.assertTrue((td / pd.NaT) is np.nan) - - # invert - self.assertEqual(-td, Timedelta('-10d')) - self.assertEqual(td * -1, Timedelta('-10d')) - self.assertEqual(-1 * td, Timedelta('-10d')) - self.assertEqual(abs(-td), Timedelta('10d')) - - # invalid - self.assertRaises(TypeError, lambda: Timedelta(11, unit='d') // 2) - - # invalid multiply with another timedelta - self.assertRaises(TypeError, lambda: td * td) - - # can't operate with integers - self.assertRaises(TypeError, lambda: td + 2) - self.assertRaises(TypeError, lambda: td - 2) - - def test_ops_offsets(self): - td = Timedelta(10, unit='d') - self.assertEqual(Timedelta(241, unit='h'), td + pd.offsets.Hour(1)) - self.assertEqual(Timedelta(241, unit='h'), pd.offsets.Hour(1) + td) - self.assertEqual(240, td / pd.offsets.Hour(1)) - self.assertEqual(1 / 240.0, pd.offsets.Hour(1) / td) - self.assertEqual(Timedelta(239, unit='h'), td - pd.offsets.Hour(1)) - self.assertEqual(Timedelta(-239, unit='h'), pd.offsets.Hour(1) - td) - - def test_freq_conversion(self): - - td = Timedelta('1 days 2 hours 3 ns') - result = td / np.timedelta64(1, 'D') - self.assertEqual(result, td.value / float(86400 * 1e9)) - result = td / np.timedelta64(1, 's') - self.assertEqual(result, td.value / float(1e9)) - result = td / np.timedelta64(1, 'ns') - self.assertEqual(result, td.value) - - def test_ops_ndarray(self): - td = Timedelta('1 day') - - # timedelta, timedelta - other = pd.to_timedelta(['1 day']).values - expected = pd.to_timedelta(['2 days']).values - self.assert_numpy_array_equal(td + other, expected) - if LooseVersion(np.__version__) >= '1.8': - self.assert_numpy_array_equal(other + td, expected) - self.assertRaises(TypeError, lambda: td + np.array([1])) - self.assertRaises(TypeError, lambda: np.array([1]) + td) - - expected = pd.to_timedelta(['0 days']).values - self.assert_numpy_array_equal(td - other, expected) - if LooseVersion(np.__version__) >= '1.8': - self.assert_numpy_array_equal(-other + td, expected) - self.assertRaises(TypeError, lambda: td - np.array([1])) - self.assertRaises(TypeError, lambda: np.array([1]) - td) - - expected = pd.to_timedelta(['2 days']).values - self.assert_numpy_array_equal(td * np.array([2]), expected) - self.assert_numpy_array_equal(np.array([2]) * td, expected) - self.assertRaises(TypeError, lambda: td * other) - self.assertRaises(TypeError, lambda: other * td) - - self.assert_numpy_array_equal(td / other, - np.array([1], dtype=np.float64)) - if LooseVersion(np.__version__) >= '1.8': - self.assert_numpy_array_equal(other / td, - np.array([1], dtype=np.float64)) - - # timedelta, datetime - other = pd.to_datetime(['2000-01-01']).values - expected = pd.to_datetime(['2000-01-02']).values - self.assert_numpy_array_equal(td + other, expected) - if LooseVersion(np.__version__) >= '1.8': - self.assert_numpy_array_equal(other + td, expected) - - expected = pd.to_datetime(['1999-12-31']).values - self.assert_numpy_array_equal(-td + other, expected) - if LooseVersion(np.__version__) >= '1.8': - self.assert_numpy_array_equal(other - td, expected) - - def test_ops_series(self): - # regression test for GH8813 - td = Timedelta('1 day') - other = pd.Series([1, 2]) - expected = pd.Series(pd.to_timedelta(['1 day', '2 days'])) - tm.assert_series_equal(expected, td * other) - tm.assert_series_equal(expected, other * td) - - def test_ops_series_object(self): - # GH 13043 - s = pd.Series([pd.Timestamp('2015-01-01', tz='US/Eastern'), - pd.Timestamp('2015-01-01', tz='Asia/Tokyo')], - name='xxx') - self.assertEqual(s.dtype, object) - - exp = pd.Series([pd.Timestamp('2015-01-02', tz='US/Eastern'), - pd.Timestamp('2015-01-02', tz='Asia/Tokyo')], - name='xxx') - tm.assert_series_equal(s + pd.Timedelta('1 days'), exp) - tm.assert_series_equal(pd.Timedelta('1 days') + s, exp) - - # object series & object series - s2 = pd.Series([pd.Timestamp('2015-01-03', tz='US/Eastern'), - pd.Timestamp('2015-01-05', tz='Asia/Tokyo')], - name='xxx') - self.assertEqual(s2.dtype, object) - exp = pd.Series([pd.Timedelta('2 days'), pd.Timedelta('4 days')], - name='xxx') - tm.assert_series_equal(s2 - s, exp) - tm.assert_series_equal(s - s2, -exp) - - s = pd.Series([pd.Timedelta('01:00:00'), pd.Timedelta('02:00:00')], - name='xxx', dtype=object) - self.assertEqual(s.dtype, object) - - exp = pd.Series([pd.Timedelta('01:30:00'), pd.Timedelta('02:30:00')], - name='xxx') - tm.assert_series_equal(s + pd.Timedelta('00:30:00'), exp) - tm.assert_series_equal(pd.Timedelta('00:30:00') + s, exp) - - def test_compare_timedelta_series(self): - # regresssion test for GH5963 - s = pd.Series([timedelta(days=1), timedelta(days=2)]) - actual = s > timedelta(days=1) - expected = pd.Series([False, True]) - tm.assert_series_equal(actual, expected) - - def test_compare_timedelta_ndarray(self): - # GH11835 - periods = [Timedelta('0 days 01:00:00'), Timedelta('0 days 01:00:00')] - arr = np.array(periods) - result = arr[0] > arr - expected = np.array([False, False]) - self.assert_numpy_array_equal(result, expected) - - def test_ops_notimplemented(self): - class Other: - pass - - other = Other() - - td = Timedelta('1 day') - self.assertTrue(td.__add__(other) is NotImplemented) - self.assertTrue(td.__sub__(other) is NotImplemented) - self.assertTrue(td.__truediv__(other) is NotImplemented) - self.assertTrue(td.__mul__(other) is NotImplemented) - self.assertTrue(td.__floordiv__(td) is NotImplemented) - - def test_ops_error_str(self): - # GH 13624 - td = Timedelta('1 day') - - for l, r in [(td, 'a'), ('a', td)]: - - with tm.assertRaises(TypeError): - l + r - - with tm.assertRaises(TypeError): - l > r - - self.assertFalse(l == r) - self.assertTrue(l != r) - - def test_fields(self): - def check(value): - # that we are int/long like - self.assertTrue(isinstance(value, (int, compat.long))) - - # compat to datetime.timedelta - rng = to_timedelta('1 days, 10:11:12') - self.assertEqual(rng.days, 1) - self.assertEqual(rng.seconds, 10 * 3600 + 11 * 60 + 12) - self.assertEqual(rng.microseconds, 0) - self.assertEqual(rng.nanoseconds, 0) - - self.assertRaises(AttributeError, lambda: rng.hours) - self.assertRaises(AttributeError, lambda: rng.minutes) - self.assertRaises(AttributeError, lambda: rng.milliseconds) - - # GH 10050 - check(rng.days) - check(rng.seconds) - check(rng.microseconds) - check(rng.nanoseconds) - - td = Timedelta('-1 days, 10:11:12') - self.assertEqual(abs(td), Timedelta('13:48:48')) - self.assertTrue(str(td) == "-1 days +10:11:12") - self.assertEqual(-td, Timedelta('0 days 13:48:48')) - self.assertEqual(-Timedelta('-1 days, 10:11:12').value, 49728000000000) - self.assertEqual(Timedelta('-1 days, 10:11:12').value, -49728000000000) - - rng = to_timedelta('-1 days, 10:11:12.100123456') - self.assertEqual(rng.days, -1) - self.assertEqual(rng.seconds, 10 * 3600 + 11 * 60 + 12) - self.assertEqual(rng.microseconds, 100 * 1000 + 123) - self.assertEqual(rng.nanoseconds, 456) - self.assertRaises(AttributeError, lambda: rng.hours) - self.assertRaises(AttributeError, lambda: rng.minutes) - self.assertRaises(AttributeError, lambda: rng.milliseconds) - - # components - tup = pd.to_timedelta(-1, 'us').components - self.assertEqual(tup.days, -1) - self.assertEqual(tup.hours, 23) - self.assertEqual(tup.minutes, 59) - self.assertEqual(tup.seconds, 59) - self.assertEqual(tup.milliseconds, 999) - self.assertEqual(tup.microseconds, 999) - self.assertEqual(tup.nanoseconds, 0) - - # GH 10050 - check(tup.days) - check(tup.hours) - check(tup.minutes) - check(tup.seconds) - check(tup.milliseconds) - check(tup.microseconds) - check(tup.nanoseconds) - - tup = Timedelta('-1 days 1 us').components - self.assertEqual(tup.days, -2) - self.assertEqual(tup.hours, 23) - self.assertEqual(tup.minutes, 59) - self.assertEqual(tup.seconds, 59) - self.assertEqual(tup.milliseconds, 999) - self.assertEqual(tup.microseconds, 999) - self.assertEqual(tup.nanoseconds, 0) - - def test_timedelta_range(self): - - expected = to_timedelta(np.arange(5), unit='D') - result = timedelta_range('0 days', periods=5, freq='D') - tm.assert_index_equal(result, expected) - - expected = to_timedelta(np.arange(11), unit='D') - result = timedelta_range('0 days', '10 days', freq='D') - tm.assert_index_equal(result, expected) - - expected = to_timedelta(np.arange(5), unit='D') + Second(2) + Day() - result = timedelta_range('1 days, 00:00:02', '5 days, 00:00:02', - freq='D') - tm.assert_index_equal(result, expected) - - expected = to_timedelta([1, 3, 5, 7, 9], unit='D') + Second(2) - result = timedelta_range('1 days, 00:00:02', periods=5, freq='2D') - tm.assert_index_equal(result, expected) - - expected = to_timedelta(np.arange(50), unit='T') * 30 - result = timedelta_range('0 days', freq='30T', periods=50) - tm.assert_index_equal(result, expected) - - # GH 11776 - arr = np.arange(10).reshape(2, 5) - df = pd.DataFrame(np.arange(10).reshape(2, 5)) - for arg in (arr, df): - with tm.assertRaisesRegexp(TypeError, "1-d array"): - to_timedelta(arg) - for errors in ['ignore', 'raise', 'coerce']: - with tm.assertRaisesRegexp(TypeError, "1-d array"): - to_timedelta(arg, errors=errors) - - # issue10583 - df = pd.DataFrame(np.random.normal(size=(10, 4))) - df.index = pd.timedelta_range(start='0s', periods=10, freq='s') - expected = df.loc[pd.Timedelta('0s'):, :] - result = df.loc['0s':, :] - assert_frame_equal(expected, result) - - def test_numeric_conversions(self): - self.assertEqual(ct(0), np.timedelta64(0, 'ns')) - self.assertEqual(ct(10), np.timedelta64(10, 'ns')) - self.assertEqual(ct(10, unit='ns'), np.timedelta64( - 10, 'ns').astype('m8[ns]')) - - self.assertEqual(ct(10, unit='us'), np.timedelta64( - 10, 'us').astype('m8[ns]')) - self.assertEqual(ct(10, unit='ms'), np.timedelta64( - 10, 'ms').astype('m8[ns]')) - self.assertEqual(ct(10, unit='s'), np.timedelta64( - 10, 's').astype('m8[ns]')) - self.assertEqual(ct(10, unit='d'), np.timedelta64( - 10, 'D').astype('m8[ns]')) - - def test_timedelta_conversions(self): - self.assertEqual(ct(timedelta(seconds=1)), - np.timedelta64(1, 's').astype('m8[ns]')) - self.assertEqual(ct(timedelta(microseconds=1)), - np.timedelta64(1, 'us').astype('m8[ns]')) - self.assertEqual(ct(timedelta(days=1)), - np.timedelta64(1, 'D').astype('m8[ns]')) - - def test_short_format_converters(self): - def conv(v): - return v.astype('m8[ns]') - - self.assertEqual(ct('10'), np.timedelta64(10, 'ns')) - self.assertEqual(ct('10ns'), np.timedelta64(10, 'ns')) - self.assertEqual(ct('100'), np.timedelta64(100, 'ns')) - self.assertEqual(ct('100ns'), np.timedelta64(100, 'ns')) - - self.assertEqual(ct('1000'), np.timedelta64(1000, 'ns')) - self.assertEqual(ct('1000ns'), np.timedelta64(1000, 'ns')) - self.assertEqual(ct('1000NS'), np.timedelta64(1000, 'ns')) - - self.assertEqual(ct('10us'), np.timedelta64(10000, 'ns')) - self.assertEqual(ct('100us'), np.timedelta64(100000, 'ns')) - self.assertEqual(ct('1000us'), np.timedelta64(1000000, 'ns')) - self.assertEqual(ct('1000Us'), np.timedelta64(1000000, 'ns')) - self.assertEqual(ct('1000uS'), np.timedelta64(1000000, 'ns')) - - self.assertEqual(ct('1ms'), np.timedelta64(1000000, 'ns')) - self.assertEqual(ct('10ms'), np.timedelta64(10000000, 'ns')) - self.assertEqual(ct('100ms'), np.timedelta64(100000000, 'ns')) - self.assertEqual(ct('1000ms'), np.timedelta64(1000000000, 'ns')) - - self.assertEqual(ct('-1s'), -np.timedelta64(1000000000, 'ns')) - self.assertEqual(ct('1s'), np.timedelta64(1000000000, 'ns')) - self.assertEqual(ct('10s'), np.timedelta64(10000000000, 'ns')) - self.assertEqual(ct('100s'), np.timedelta64(100000000000, 'ns')) - self.assertEqual(ct('1000s'), np.timedelta64(1000000000000, 'ns')) - - self.assertEqual(ct('1d'), conv(np.timedelta64(1, 'D'))) - self.assertEqual(ct('-1d'), -conv(np.timedelta64(1, 'D'))) - self.assertEqual(ct('1D'), conv(np.timedelta64(1, 'D'))) - self.assertEqual(ct('10D'), conv(np.timedelta64(10, 'D'))) - self.assertEqual(ct('100D'), conv(np.timedelta64(100, 'D'))) - self.assertEqual(ct('1000D'), conv(np.timedelta64(1000, 'D'))) - self.assertEqual(ct('10000D'), conv(np.timedelta64(10000, 'D'))) - - # space - self.assertEqual(ct(' 10000D '), conv(np.timedelta64(10000, 'D'))) - self.assertEqual(ct(' - 10000D '), -conv(np.timedelta64(10000, 'D'))) - - # invalid - self.assertRaises(ValueError, ct, '1foo') - self.assertRaises(ValueError, ct, 'foo') - - def test_full_format_converters(self): - def conv(v): - return v.astype('m8[ns]') - - d1 = np.timedelta64(1, 'D') - - self.assertEqual(ct('1days'), conv(d1)) - self.assertEqual(ct('1days,'), conv(d1)) - self.assertEqual(ct('- 1days,'), -conv(d1)) - - self.assertEqual(ct('00:00:01'), conv(np.timedelta64(1, 's'))) - self.assertEqual(ct('06:00:01'), conv( - np.timedelta64(6 * 3600 + 1, 's'))) - self.assertEqual(ct('06:00:01.0'), conv( - np.timedelta64(6 * 3600 + 1, 's'))) - self.assertEqual(ct('06:00:01.01'), conv( - np.timedelta64(1000 * (6 * 3600 + 1) + 10, 'ms'))) - - self.assertEqual(ct('- 1days, 00:00:01'), - conv(-d1 + np.timedelta64(1, 's'))) - self.assertEqual(ct('1days, 06:00:01'), conv( - d1 + np.timedelta64(6 * 3600 + 1, 's'))) - self.assertEqual(ct('1days, 06:00:01.01'), conv( - d1 + np.timedelta64(1000 * (6 * 3600 + 1) + 10, 'ms'))) - - # invalid - self.assertRaises(ValueError, ct, '- 1days, 00') - - def test_nat_converters(self): - self.assertEqual(to_timedelta( - 'nat', box=False).astype('int64'), tslib.iNaT) - self.assertEqual(to_timedelta( - 'nan', box=False).astype('int64'), tslib.iNaT) - - def test_to_timedelta(self): - def conv(v): - return v.astype('m8[ns]') - - d1 = np.timedelta64(1, 'D') - - self.assertEqual(to_timedelta('1 days 06:05:01.00003', box=False), - conv(d1 + np.timedelta64(6 * 3600 + - 5 * 60 + 1, 's') + - np.timedelta64(30, 'us'))) - self.assertEqual(to_timedelta('15.5us', box=False), - conv(np.timedelta64(15500, 'ns'))) - - # empty string - result = to_timedelta('', box=False) - self.assertEqual(result.astype('int64'), tslib.iNaT) - - result = to_timedelta(['', '']) - self.assertTrue(isnull(result).all()) - - # pass thru - result = to_timedelta(np.array([np.timedelta64(1, 's')])) - expected = pd.Index(np.array([np.timedelta64(1, 's')])) - tm.assert_index_equal(result, expected) - - # ints - result = np.timedelta64(0, 'ns') - expected = to_timedelta(0, box=False) - self.assertEqual(result, expected) - - # Series - expected = Series([timedelta(days=1), timedelta(days=1, seconds=1)]) - result = to_timedelta(Series(['1d', '1days 00:00:01'])) - tm.assert_series_equal(result, expected) - - # with units - result = TimedeltaIndex([np.timedelta64(0, 'ns'), np.timedelta64( - 10, 's').astype('m8[ns]')]) - expected = to_timedelta([0, 10], unit='s') - tm.assert_index_equal(result, expected) - - # single element conversion - v = timedelta(seconds=1) - result = to_timedelta(v, box=False) - expected = np.timedelta64(timedelta(seconds=1)) - self.assertEqual(result, expected) - - v = np.timedelta64(timedelta(seconds=1)) - result = to_timedelta(v, box=False) - expected = np.timedelta64(timedelta(seconds=1)) - self.assertEqual(result, expected) - - # arrays of various dtypes - arr = np.array([1] * 5, dtype='int64') - result = to_timedelta(arr, unit='s') - expected = TimedeltaIndex([np.timedelta64(1, 's')] * 5) - tm.assert_index_equal(result, expected) - - arr = np.array([1] * 5, dtype='int64') - result = to_timedelta(arr, unit='m') - expected = TimedeltaIndex([np.timedelta64(1, 'm')] * 5) - tm.assert_index_equal(result, expected) - - arr = np.array([1] * 5, dtype='int64') - result = to_timedelta(arr, unit='h') - expected = TimedeltaIndex([np.timedelta64(1, 'h')] * 5) - tm.assert_index_equal(result, expected) - - arr = np.array([1] * 5, dtype='timedelta64[s]') - result = to_timedelta(arr) - expected = TimedeltaIndex([np.timedelta64(1, 's')] * 5) - tm.assert_index_equal(result, expected) - - arr = np.array([1] * 5, dtype='timedelta64[D]') - result = to_timedelta(arr) - expected = TimedeltaIndex([np.timedelta64(1, 'D')] * 5) - tm.assert_index_equal(result, expected) - - # Test with lists as input when box=false - expected = np.array(np.arange(3) * 1000000000, dtype='timedelta64[ns]') - result = to_timedelta(range(3), unit='s', box=False) - tm.assert_numpy_array_equal(expected, result) - - result = to_timedelta(np.arange(3), unit='s', box=False) - tm.assert_numpy_array_equal(expected, result) - - result = to_timedelta([0, 1, 2], unit='s', box=False) - tm.assert_numpy_array_equal(expected, result) - - # Tests with fractional seconds as input: - expected = np.array( - [0, 500000000, 800000000, 1200000000], dtype='timedelta64[ns]') - result = to_timedelta([0., 0.5, 0.8, 1.2], unit='s', box=False) - tm.assert_numpy_array_equal(expected, result) - - def testit(unit, transform): - - # array - result = to_timedelta(np.arange(5), unit=unit) - expected = TimedeltaIndex([np.timedelta64(i, transform(unit)) - for i in np.arange(5).tolist()]) - tm.assert_index_equal(result, expected) - - # scalar - result = to_timedelta(2, unit=unit) - expected = Timedelta(np.timedelta64(2, transform(unit)).astype( - 'timedelta64[ns]')) - self.assertEqual(result, expected) - - # validate all units - # GH 6855 - for unit in ['Y', 'M', 'W', 'D', 'y', 'w', 'd']: - testit(unit, lambda x: x.upper()) - for unit in ['days', 'day', 'Day', 'Days']: - testit(unit, lambda x: 'D') - for unit in ['h', 'm', 's', 'ms', 'us', 'ns', 'H', 'S', 'MS', 'US', - 'NS']: - testit(unit, lambda x: x.lower()) - - # offsets - - # m - testit('T', lambda x: 'm') - - # ms - testit('L', lambda x: 'ms') - - def test_to_timedelta_invalid(self): - - # bad value for errors parameter - msg = "errors must be one of" - tm.assertRaisesRegexp(ValueError, msg, to_timedelta, - ['foo'], errors='never') - - # these will error - self.assertRaises(ValueError, lambda: to_timedelta([1, 2], unit='foo')) - self.assertRaises(ValueError, lambda: to_timedelta(1, unit='foo')) - - # time not supported ATM - self.assertRaises(ValueError, lambda: to_timedelta(time(second=1))) - self.assertTrue(to_timedelta( - time(second=1), errors='coerce') is pd.NaT) - - self.assertRaises(ValueError, lambda: to_timedelta(['foo', 'bar'])) - tm.assert_index_equal(TimedeltaIndex([pd.NaT, pd.NaT]), - to_timedelta(['foo', 'bar'], errors='coerce')) - - tm.assert_index_equal(TimedeltaIndex(['1 day', pd.NaT, '1 min']), - to_timedelta(['1 day', 'bar', '1 min'], - errors='coerce')) - - # gh-13613: these should not error because errors='ignore' - invalid_data = 'apple' - self.assertEqual(invalid_data, to_timedelta( - invalid_data, errors='ignore')) - - invalid_data = ['apple', '1 days'] - tm.assert_numpy_array_equal( - np.array(invalid_data, dtype=object), - to_timedelta(invalid_data, errors='ignore')) - - invalid_data = pd.Index(['apple', '1 days']) - tm.assert_index_equal(invalid_data, to_timedelta( - invalid_data, errors='ignore')) - - invalid_data = Series(['apple', '1 days']) - tm.assert_series_equal(invalid_data, to_timedelta( - invalid_data, errors='ignore')) - - def test_to_timedelta_via_apply(self): - # GH 5458 - expected = Series([np.timedelta64(1, 's')]) - result = Series(['00:00:01']).apply(to_timedelta) - tm.assert_series_equal(result, expected) - - result = Series([to_timedelta('00:00:01')]) - tm.assert_series_equal(result, expected) - - def test_timedelta_ops(self): - # GH4984 - # make sure ops return Timedelta - s = Series([Timestamp('20130101') + timedelta(seconds=i * i) - for i in range(10)]) - td = s.diff() - - result = td.mean() - expected = to_timedelta(timedelta(seconds=9)) - self.assertEqual(result, expected) - - result = td.to_frame().mean() - self.assertEqual(result[0], expected) - - result = td.quantile(.1) - expected = Timedelta(np.timedelta64(2600, 'ms')) - self.assertEqual(result, expected) - - result = td.median() - expected = to_timedelta('00:00:09') - self.assertEqual(result, expected) - - result = td.to_frame().median() - self.assertEqual(result[0], expected) - - # GH 6462 - # consistency in returned values for sum - result = td.sum() - expected = to_timedelta('00:01:21') - self.assertEqual(result, expected) - - result = td.to_frame().sum() - self.assertEqual(result[0], expected) - - # std - result = td.std() - expected = to_timedelta(Series(td.dropna().values).std()) - self.assertEqual(result, expected) - - result = td.to_frame().std() - self.assertEqual(result[0], expected) - - # invalid ops - for op in ['skew', 'kurt', 'sem', 'prod']: - self.assertRaises(TypeError, getattr(td, op)) - - # GH 10040 - # make sure NaT is properly handled by median() - s = Series([Timestamp('2015-02-03'), Timestamp('2015-02-07')]) - self.assertEqual(s.diff().median(), timedelta(days=4)) - - s = Series([Timestamp('2015-02-03'), Timestamp('2015-02-07'), - Timestamp('2015-02-15')]) - self.assertEqual(s.diff().median(), timedelta(days=6)) - - def test_overflow(self): - # GH 9442 - s = Series(pd.date_range('20130101', periods=100000, freq='H')) - s[0] += pd.Timedelta('1s 1ms') - - # mean - result = (s - s.min()).mean() - expected = pd.Timedelta((pd.DatetimeIndex((s - s.min())).asi8 / len(s) - ).sum()) - - # the computation is converted to float so might be some loss of - # precision - self.assertTrue(np.allclose(result.value / 1000, expected.value / - 1000)) - - # sum - self.assertRaises(ValueError, lambda: (s - s.min()).sum()) - s1 = s[0:10000] - self.assertRaises(ValueError, lambda: (s1 - s1.min()).sum()) - s2 = s[0:1000] - result = (s2 - s2.min()).sum() - - def test_timedelta_ops_scalar(self): - # GH 6808 - base = pd.to_datetime('20130101 09:01:12.123456') - expected_add = pd.to_datetime('20130101 09:01:22.123456') - expected_sub = pd.to_datetime('20130101 09:01:02.123456') - - for offset in [pd.to_timedelta(10, unit='s'), timedelta(seconds=10), - np.timedelta64(10, 's'), - np.timedelta64(10000000000, 'ns'), - pd.offsets.Second(10)]: - result = base + offset - self.assertEqual(result, expected_add) - - result = base - offset - self.assertEqual(result, expected_sub) - - base = pd.to_datetime('20130102 09:01:12.123456') - expected_add = pd.to_datetime('20130103 09:01:22.123456') - expected_sub = pd.to_datetime('20130101 09:01:02.123456') - - for offset in [pd.to_timedelta('1 day, 00:00:10'), - pd.to_timedelta('1 days, 00:00:10'), - timedelta(days=1, seconds=10), - np.timedelta64(1, 'D') + np.timedelta64(10, 's'), - pd.offsets.Day() + pd.offsets.Second(10)]: - result = base + offset - self.assertEqual(result, expected_add) - - result = base - offset - self.assertEqual(result, expected_sub) - - def test_to_timedelta_on_missing_values(self): - # GH5438 - timedelta_NaT = np.timedelta64('NaT') - - actual = pd.to_timedelta(Series(['00:00:01', np.nan])) - expected = Series([np.timedelta64(1000000000, 'ns'), - timedelta_NaT], dtype=' idx1 - expected = np.array([True, False, False, False, True, False]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 <= idx2 - expected = np.array([True, False, False, False, True, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx2 >= idx1 - expected = np.array([True, False, False, False, True, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 == idx2 - expected = np.array([False, False, False, False, False, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 != idx2 - expected = np.array([True, True, True, True, True, False]) - self.assert_numpy_array_equal(result, expected) - - def test_ops_error_str(self): - # GH 13624 - tdi = TimedeltaIndex(['1 day', '2 days']) - - for l, r in [(tdi, 'a'), ('a', tdi)]: - with tm.assertRaises(TypeError): - l + r - - with tm.assertRaises(TypeError): - l > r - - with tm.assertRaises(TypeError): - l == r - - with tm.assertRaises(TypeError): - l != r - - def test_map(self): - - rng = timedelta_range('1 day', periods=10) - - f = lambda x: x.days - result = rng.map(f) - exp = np.array([f(x) for x in rng], dtype=np.int64) - self.assert_numpy_array_equal(result, exp) - - def test_misc_coverage(self): - - rng = timedelta_range('1 day', periods=5) - result = rng.groupby(rng.days) - tm.assertIsInstance(list(result.values())[0][0], Timedelta) - - idx = TimedeltaIndex(['3d', '1d', '2d']) - self.assertFalse(idx.equals(list(idx))) - - non_td = Index(list('abc')) - self.assertFalse(idx.equals(list(non_td))) - - def test_union(self): - - i1 = timedelta_range('1day', periods=5) - i2 = timedelta_range('3day', periods=5) - result = i1.union(i2) - expected = timedelta_range('1day', periods=7) - self.assert_index_equal(result, expected) - - i1 = Int64Index(np.arange(0, 20, 2)) - i2 = TimedeltaIndex(start='1 day', periods=10, freq='D') - i1.union(i2) # Works - i2.union(i1) # Fails with "AttributeError: can't set attribute" - - def test_union_coverage(self): - - idx = TimedeltaIndex(['3d', '1d', '2d']) - ordered = TimedeltaIndex(idx.sort_values(), freq='infer') - result = ordered.union(idx) - self.assert_index_equal(result, ordered) - - result = ordered[:0].union(ordered) - self.assert_index_equal(result, ordered) - self.assertEqual(result.freq, ordered.freq) - - def test_union_bug_1730(self): - - rng_a = timedelta_range('1 day', periods=4, freq='3H') - rng_b = timedelta_range('1 day', periods=4, freq='4H') - - result = rng_a.union(rng_b) - exp = TimedeltaIndex(sorted(set(list(rng_a)) | set(list(rng_b)))) - self.assert_index_equal(result, exp) - - def test_union_bug_1745(self): - - left = TimedeltaIndex(['1 day 15:19:49.695000']) - right = TimedeltaIndex(['2 day 13:04:21.322000', - '1 day 15:27:24.873000', - '1 day 15:31:05.350000']) - - result = left.union(right) - exp = TimedeltaIndex(sorted(set(list(left)) | set(list(right)))) - self.assert_index_equal(result, exp) - - def test_union_bug_4564(self): - - left = timedelta_range("1 day", "30d") - right = left + pd.offsets.Minute(15) - - result = left.union(right) - exp = TimedeltaIndex(sorted(set(list(left)) | set(list(right)))) - self.assert_index_equal(result, exp) - - def test_intersection_bug_1708(self): - index_1 = timedelta_range('1 day', periods=4, freq='h') - index_2 = index_1 + pd.offsets.Hour(5) - - result = index_1 & index_2 - self.assertEqual(len(result), 0) - - index_1 = timedelta_range('1 day', periods=4, freq='h') - index_2 = index_1 + pd.offsets.Hour(1) - - result = index_1 & index_2 - expected = timedelta_range('1 day 01:00:00', periods=3, freq='h') - tm.assert_index_equal(result, expected) - - def test_get_duplicates(self): - idx = TimedeltaIndex(['1 day', '2 day', '2 day', '3 day', '3day', - '4day']) - - result = idx.get_duplicates() - ex = TimedeltaIndex(['2 day', '3day']) - self.assert_index_equal(result, ex) - - def test_argmin_argmax(self): - idx = TimedeltaIndex(['1 day 00:00:05', '1 day 00:00:01', - '1 day 00:00:02']) - self.assertEqual(idx.argmin(), 1) - self.assertEqual(idx.argmax(), 0) - - def test_sort_values(self): - - idx = TimedeltaIndex(['4d', '1d', '2d']) - - ordered = idx.sort_values() - self.assertTrue(ordered.is_monotonic) - - ordered = idx.sort_values(ascending=False) - self.assertTrue(ordered[::-1].is_monotonic) - - ordered, dexer = idx.sort_values(return_indexer=True) - self.assertTrue(ordered.is_monotonic) - self.assert_numpy_array_equal(dexer, - np.array([1, 2, 0]), - check_dtype=False) - - ordered, dexer = idx.sort_values(return_indexer=True, ascending=False) - self.assertTrue(ordered[::-1].is_monotonic) - self.assert_numpy_array_equal(dexer, - np.array([0, 2, 1]), - check_dtype=False) - - def test_insert(self): - - idx = TimedeltaIndex(['4day', '1day', '2day'], name='idx') - - result = idx.insert(2, timedelta(days=5)) - exp = TimedeltaIndex(['4day', '1day', '5day', '2day'], name='idx') - self.assert_index_equal(result, exp) - - # insertion of non-datetime should coerce to object index - result = idx.insert(1, 'inserted') - expected = Index([Timedelta('4day'), 'inserted', Timedelta('1day'), - Timedelta('2day')], name='idx') - self.assertNotIsInstance(result, TimedeltaIndex) - tm.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - - idx = timedelta_range('1day 00:00:01', periods=3, freq='s', name='idx') - - # preserve freq - expected_0 = TimedeltaIndex(['1day', '1day 00:00:01', '1day 00:00:02', - '1day 00:00:03'], - name='idx', freq='s') - expected_3 = TimedeltaIndex(['1day 00:00:01', '1day 00:00:02', - '1day 00:00:03', '1day 00:00:04'], - name='idx', freq='s') - - # reset freq to None - expected_1_nofreq = TimedeltaIndex(['1day 00:00:01', '1day 00:00:01', - '1day 00:00:02', '1day 00:00:03'], - name='idx', freq=None) - expected_3_nofreq = TimedeltaIndex(['1day 00:00:01', '1day 00:00:02', - '1day 00:00:03', '1day 00:00:05'], - name='idx', freq=None) - - cases = [(0, Timedelta('1day'), expected_0), - (-3, Timedelta('1day'), expected_0), - (3, Timedelta('1day 00:00:04'), expected_3), - (1, Timedelta('1day 00:00:01'), expected_1_nofreq), - (3, Timedelta('1day 00:00:05'), expected_3_nofreq)] - - for n, d, expected in cases: - result = idx.insert(n, d) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(result.freq, expected.freq) - - def test_delete(self): - idx = timedelta_range(start='1 Days', periods=5, freq='D', name='idx') - - # prserve freq - expected_0 = timedelta_range(start='2 Days', periods=4, freq='D', - name='idx') - expected_4 = timedelta_range(start='1 Days', periods=4, freq='D', - name='idx') - - # reset freq to None - expected_1 = TimedeltaIndex( - ['1 day', '3 day', '4 day', '5 day'], freq=None, name='idx') - - cases = {0: expected_0, - -5: expected_0, - -1: expected_4, - 4: expected_4, - 1: expected_1} - for n, expected in compat.iteritems(cases): - result = idx.delete(n) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(result.freq, expected.freq) - - with tm.assertRaises((IndexError, ValueError)): - # either depeidnig on numpy version - result = idx.delete(5) - - def test_delete_slice(self): - idx = timedelta_range(start='1 days', periods=10, freq='D', name='idx') - - # prserve freq - expected_0_2 = timedelta_range(start='4 days', periods=7, freq='D', - name='idx') - expected_7_9 = timedelta_range(start='1 days', periods=7, freq='D', - name='idx') - - # reset freq to None - expected_3_5 = TimedeltaIndex(['1 d', '2 d', '3 d', - '7 d', '8 d', '9 d', '10d'], - freq=None, name='idx') - - cases = {(0, 1, 2): expected_0_2, - (7, 8, 9): expected_7_9, - (3, 4, 5): expected_3_5} - for n, expected in compat.iteritems(cases): - result = idx.delete(n) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(result.freq, expected.freq) - - result = idx.delete(slice(n[0], n[-1] + 1)) - self.assert_index_equal(result, expected) - self.assertEqual(result.name, expected.name) - self.assertEqual(result.freq, expected.freq) - - def test_take(self): - - tds = ['1day 02:00:00', '1 day 04:00:00', '1 day 10:00:00'] - idx = TimedeltaIndex(start='1d', end='2d', freq='H', name='idx') - expected = TimedeltaIndex(tds, freq=None, name='idx') - - taken1 = idx.take([2, 4, 10]) - taken2 = idx[[2, 4, 10]] - - for taken in [taken1, taken2]: - self.assert_index_equal(taken, expected) - tm.assertIsInstance(taken, TimedeltaIndex) - self.assertIsNone(taken.freq) - self.assertEqual(taken.name, expected.name) - - def test_take_fill_value(self): - # GH 12631 - idx = pd.TimedeltaIndex(['1 days', '2 days', '3 days'], - name='xxx') - result = idx.take(np.array([1, 0, -1])) - expected = pd.TimedeltaIndex(['2 days', '1 days', '3 days'], - name='xxx') - tm.assert_index_equal(result, expected) - - # fill_value - result = idx.take(np.array([1, 0, -1]), fill_value=True) - expected = pd.TimedeltaIndex(['2 days', '1 days', 'NaT'], - name='xxx') - tm.assert_index_equal(result, expected) - - # allow_fill=False - result = idx.take(np.array([1, 0, -1]), allow_fill=False, - fill_value=True) - expected = pd.TimedeltaIndex(['2 days', '1 days', '3 days'], - name='xxx') - tm.assert_index_equal(result, expected) - - msg = ('When allow_fill=True and fill_value is not None, ' - 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -5]), fill_value=True) - - with tm.assertRaises(IndexError): - idx.take(np.array([1, -5])) - - def test_isin(self): - - index = tm.makeTimedeltaIndex(4) - result = index.isin(index) - self.assertTrue(result.all()) - - result = index.isin(list(index)) - self.assertTrue(result.all()) - - assert_almost_equal(index.isin([index[2], 5]), - np.array([False, False, True, False])) - - def test_does_not_convert_mixed_integer(self): - df = tm.makeCustomDataframe(10, 10, - data_gen_f=lambda *args, **kwargs: randn(), - r_idx_type='i', c_idx_type='td') - str(df) - - cols = df.columns.join(df.index, how='outer') - joined = cols.join(df.columns) - self.assertEqual(cols.dtype, np.dtype('O')) - self.assertEqual(cols.dtype, joined.dtype) - tm.assert_index_equal(cols, joined) - - def test_slice_keeps_name(self): - - # GH4226 - dr = pd.timedelta_range('1d', '5d', freq='H', name='timebucket') - self.assertEqual(dr[1:].name, dr.name) - - def test_join_self(self): - - index = timedelta_range('1 day', periods=10) - kinds = 'outer', 'inner', 'left', 'right' - for kind in kinds: - joined = index.join(index, how=kind) - tm.assert_index_equal(index, joined) - - def test_factorize(self): - idx1 = TimedeltaIndex(['1 day', '1 day', '2 day', '2 day', '3 day', - '3 day']) - - exp_arr = np.array([0, 0, 1, 1, 2, 2], dtype=np.intp) - exp_idx = TimedeltaIndex(['1 day', '2 day', '3 day']) - - arr, idx = idx1.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - self.assert_index_equal(idx, exp_idx) - - arr, idx = idx1.factorize(sort=True) - self.assert_numpy_array_equal(arr, exp_arr) - self.assert_index_equal(idx, exp_idx) - - # freq must be preserved - idx3 = timedelta_range('1 day', periods=4, freq='s') - exp_arr = np.array([0, 1, 2, 3], dtype=np.intp) - arr, idx = idx3.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - self.assert_index_equal(idx, idx3) - - -class TestSlicing(tm.TestCase): - def test_partial_slice(self): - rng = timedelta_range('1 day 10:11:12', freq='h', periods=500) - s = Series(np.arange(len(rng)), index=rng) - - result = s['5 day':'6 day'] - expected = s.iloc[86:134] - assert_series_equal(result, expected) - - result = s['5 day':] - expected = s.iloc[86:] - assert_series_equal(result, expected) - - result = s[:'6 day'] - expected = s.iloc[:134] - assert_series_equal(result, expected) - - result = s['6 days, 23:11:12'] - self.assertEqual(result, s.iloc[133]) - - self.assertRaises(KeyError, s.__getitem__, '50 days') - - def test_partial_slice_high_reso(self): - - # higher reso - rng = timedelta_range('1 day 10:11:12', freq='us', periods=2000) - s = Series(np.arange(len(rng)), index=rng) - - result = s['1 day 10:11:12':] - expected = s.iloc[0:] - assert_series_equal(result, expected) - - result = s['1 day 10:11:12.001':] - expected = s.iloc[1000:] - assert_series_equal(result, expected) - - result = s['1 days, 10:11:12.001001'] - self.assertEqual(result, s.iloc[1001]) - - def test_slice_with_negative_step(self): - ts = Series(np.arange(20), timedelta_range('0', periods=20, freq='H')) - SLC = pd.IndexSlice - - def assert_slices_equivalent(l_slc, i_slc): - assert_series_equal(ts[l_slc], ts.iloc[i_slc]) - assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) - assert_series_equal(ts.ix[l_slc], ts.iloc[i_slc]) - - assert_slices_equivalent(SLC[Timedelta(hours=7)::-1], SLC[7::-1]) - assert_slices_equivalent(SLC['7 hours'::-1], SLC[7::-1]) - - assert_slices_equivalent(SLC[:Timedelta(hours=7):-1], SLC[:6:-1]) - assert_slices_equivalent(SLC[:'7 hours':-1], SLC[:6:-1]) - - assert_slices_equivalent(SLC['15 hours':'7 hours':-1], SLC[15:6:-1]) - assert_slices_equivalent(SLC[Timedelta(hours=15):Timedelta(hours=7):- - 1], SLC[15:6:-1]) - assert_slices_equivalent(SLC['15 hours':Timedelta(hours=7):-1], - SLC[15:6:-1]) - assert_slices_equivalent(SLC[Timedelta(hours=15):'7 hours':-1], - SLC[15:6:-1]) - - assert_slices_equivalent(SLC['7 hours':'15 hours':-1], SLC[:0]) - - def test_slice_with_zero_step_raises(self): - ts = Series(np.arange(20), timedelta_range('0', periods=20, freq='H')) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts.loc[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts.ix[::0]) - - def test_tdi_ops_attributes(self): - rng = timedelta_range('2 days', periods=5, freq='2D', name='x') - - result = rng + 1 - exp = timedelta_range('4 days', periods=5, freq='2D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '2D') - - result = rng - 2 - exp = timedelta_range('-2 days', periods=5, freq='2D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '2D') - - result = rng * 2 - exp = timedelta_range('4 days', periods=5, freq='4D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '4D') - - result = rng / 2 - exp = timedelta_range('1 days', periods=5, freq='D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, 'D') - - result = -rng - exp = timedelta_range('-2 days', periods=5, freq='-2D', name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, '-2D') - - rng = pd.timedelta_range('-2 days', periods=5, freq='D', name='x') - - result = abs(rng) - exp = TimedeltaIndex(['2 days', '1 days', '0 days', '1 days', - '2 days'], name='x') - tm.assert_index_equal(result, exp) - self.assertEqual(result.freq, None) - - def test_add_overflow(self): - # see gh-14068 - msg = "too (big|large) to convert" - with tm.assertRaisesRegexp(OverflowError, msg): - to_timedelta(106580, 'D') + Timestamp('2000') - with tm.assertRaisesRegexp(OverflowError, msg): - Timestamp('2000') + to_timedelta(106580, 'D') - - msg = "Overflow in int64 addition" - with tm.assertRaisesRegexp(OverflowError, msg): - to_timedelta([106580], 'D') + Timestamp('2000') - with tm.assertRaisesRegexp(OverflowError, msg): - Timestamp('2000') + to_timedelta([106580], 'D') - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_timeseries.py b/pandas/tseries/tests/test_timeseries.py deleted file mode 100644 index 906c0bbb7a479..0000000000000 --- a/pandas/tseries/tests/test_timeseries.py +++ /dev/null @@ -1,5589 +0,0 @@ -# pylint: disable-msg=E1101,W0612 -import calendar -import operator -import sys -import warnings -from datetime import datetime, time, timedelta -from numpy.random import rand - -import nose -import numpy as np -import pandas.index as _index -import pandas.lib as lib -import pandas.tslib as tslib - -from pandas.types.common import is_datetime64_ns_dtype -import pandas as pd -import pandas.compat as compat -import pandas.core.common as com -import pandas.tseries.frequencies as frequencies -import pandas.tseries.offsets as offsets -import pandas.tseries.tools as tools -import pandas.util.testing as tm -from pandas import ( - Index, Series, DataFrame, isnull, date_range, Timestamp, Period, - DatetimeIndex, Int64Index, to_datetime, bdate_range, Float64Index, - NaT, timedelta_range, Timedelta, _np_version_under1p8, concat) -from pandas.compat import range, long, StringIO, lrange, lmap, zip, product -from pandas.compat.numpy import np_datetime64_compat -from pandas.core.common import PerformanceWarning -from pandas.tslib import iNaT -from pandas.util.testing import ( - assert_frame_equal, assert_series_equal, assert_almost_equal, - _skip_if_has_locale, slow) - -randn = np.random.randn - - -class TestTimeSeriesDuplicates(tm.TestCase): - _multiprocess_can_split_ = True - - def setUp(self): - dates = [datetime(2000, 1, 2), datetime(2000, 1, 2), - datetime(2000, 1, 2), datetime(2000, 1, 3), - datetime(2000, 1, 3), datetime(2000, 1, 3), - datetime(2000, 1, 4), datetime(2000, 1, 4), - datetime(2000, 1, 4), datetime(2000, 1, 5)] - - self.dups = Series(np.random.randn(len(dates)), index=dates) - - def test_constructor(self): - tm.assertIsInstance(self.dups, Series) - tm.assertIsInstance(self.dups.index, DatetimeIndex) - - def test_is_unique_monotonic(self): - self.assertFalse(self.dups.index.is_unique) - - def test_index_unique(self): - uniques = self.dups.index.unique() - expected = DatetimeIndex([datetime(2000, 1, 2), datetime(2000, 1, 3), - datetime(2000, 1, 4), datetime(2000, 1, 5)]) - self.assertEqual(uniques.dtype, 'M8[ns]') # sanity - tm.assert_index_equal(uniques, expected) - self.assertEqual(self.dups.index.nunique(), 4) - - # #2563 - self.assertTrue(isinstance(uniques, DatetimeIndex)) - - dups_local = self.dups.index.tz_localize('US/Eastern') - dups_local.name = 'foo' - result = dups_local.unique() - expected = DatetimeIndex(expected, name='foo') - expected = expected.tz_localize('US/Eastern') - self.assertTrue(result.tz is not None) - self.assertEqual(result.name, 'foo') - tm.assert_index_equal(result, expected) - - # NaT, note this is excluded - arr = [1370745748 + t for t in range(20)] + [iNaT] - idx = DatetimeIndex(arr * 3) - tm.assert_index_equal(idx.unique(), DatetimeIndex(arr)) - self.assertEqual(idx.nunique(), 20) - self.assertEqual(idx.nunique(dropna=False), 21) - - arr = [Timestamp('2013-06-09 02:42:28') + timedelta(seconds=t) - for t in range(20)] + [NaT] - idx = DatetimeIndex(arr * 3) - tm.assert_index_equal(idx.unique(), DatetimeIndex(arr)) - self.assertEqual(idx.nunique(), 20) - self.assertEqual(idx.nunique(dropna=False), 21) - - def test_index_dupes_contains(self): - d = datetime(2011, 12, 5, 20, 30) - ix = DatetimeIndex([d, d]) - self.assertTrue(d in ix) - - def test_duplicate_dates_indexing(self): - ts = self.dups - - uniques = ts.index.unique() - for date in uniques: - result = ts[date] - - mask = ts.index == date - total = (ts.index == date).sum() - expected = ts[mask] - if total > 1: - assert_series_equal(result, expected) - else: - assert_almost_equal(result, expected[0]) - - cp = ts.copy() - cp[date] = 0 - expected = Series(np.where(mask, 0, ts), index=ts.index) - assert_series_equal(cp, expected) - - self.assertRaises(KeyError, ts.__getitem__, datetime(2000, 1, 6)) - - # new index - ts[datetime(2000, 1, 6)] = 0 - self.assertEqual(ts[datetime(2000, 1, 6)], 0) - - def test_range_slice(self): - idx = DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/3/2000', - '1/4/2000']) - - ts = Series(np.random.randn(len(idx)), index=idx) - - result = ts['1/2/2000':] - expected = ts[1:] - assert_series_equal(result, expected) - - result = ts['1/2/2000':'1/3/2000'] - expected = ts[1:4] - assert_series_equal(result, expected) - - def test_groupby_average_dup_values(self): - result = self.dups.groupby(level=0).mean() - expected = self.dups.groupby(self.dups.index).mean() - assert_series_equal(result, expected) - - def test_indexing_over_size_cutoff(self): - import datetime - # #1821 - - old_cutoff = _index._SIZE_CUTOFF - try: - _index._SIZE_CUTOFF = 1000 - - # create large list of non periodic datetime - dates = [] - sec = datetime.timedelta(seconds=1) - half_sec = datetime.timedelta(microseconds=500000) - d = datetime.datetime(2011, 12, 5, 20, 30) - n = 1100 - for i in range(n): - dates.append(d) - dates.append(d + sec) - dates.append(d + sec + half_sec) - dates.append(d + sec + sec + half_sec) - d += 3 * sec - - # duplicate some values in the list - duplicate_positions = np.random.randint(0, len(dates) - 1, 20) - for p in duplicate_positions: - dates[p + 1] = dates[p] - - df = DataFrame(np.random.randn(len(dates), 4), - index=dates, - columns=list('ABCD')) - - pos = n * 3 - timestamp = df.index[pos] - self.assertIn(timestamp, df.index) - - # it works! - df.ix[timestamp] - self.assertTrue(len(df.ix[[timestamp]]) > 0) - finally: - _index._SIZE_CUTOFF = old_cutoff - - def test_indexing_unordered(self): - # GH 2437 - rng = date_range(start='2011-01-01', end='2011-01-15') - ts = Series(randn(len(rng)), index=rng) - ts2 = concat([ts[0:4], ts[-4:], ts[4:-4]]) - - for t in ts.index: - # TODO: unused? - s = str(t) # noqa - - expected = ts[t] - result = ts2[t] - self.assertTrue(expected == result) - - # GH 3448 (ranges) - def compare(slobj): - result = ts2[slobj].copy() - result = result.sort_index() - expected = ts[slobj] - assert_series_equal(result, expected) - - compare(slice('2011-01-01', '2011-01-15')) - compare(slice('2010-12-30', '2011-01-15')) - compare(slice('2011-01-01', '2011-01-16')) - - # partial ranges - compare(slice('2011-01-01', '2011-01-6')) - compare(slice('2011-01-06', '2011-01-8')) - compare(slice('2011-01-06', '2011-01-12')) - - # single values - result = ts2['2011'].sort_index() - expected = ts['2011'] - assert_series_equal(result, expected) - - # diff freq - rng = date_range(datetime(2005, 1, 1), periods=20, freq='M') - ts = Series(np.arange(len(rng)), index=rng) - ts = ts.take(np.random.permutation(20)) - - result = ts['2005'] - for t in result.index: - self.assertTrue(t.year == 2005) - - def test_indexing(self): - - idx = date_range("2001-1-1", periods=20, freq='M') - ts = Series(np.random.rand(len(idx)), index=idx) - - # getting - - # GH 3070, make sure semantics work on Series/Frame - expected = ts['2001'] - expected.name = 'A' - - df = DataFrame(dict(A=ts)) - result = df['2001']['A'] - assert_series_equal(expected, result) - - # setting - ts['2001'] = 1 - expected = ts['2001'] - expected.name = 'A' - - df.loc['2001', 'A'] = 1 - - result = df['2001']['A'] - assert_series_equal(expected, result) - - # GH3546 (not including times on the last day) - idx = date_range(start='2013-05-31 00:00', end='2013-05-31 23:00', - freq='H') - ts = Series(lrange(len(idx)), index=idx) - expected = ts['2013-05'] - assert_series_equal(expected, ts) - - idx = date_range(start='2013-05-31 00:00', end='2013-05-31 23:59', - freq='S') - ts = Series(lrange(len(idx)), index=idx) - expected = ts['2013-05'] - assert_series_equal(expected, ts) - - idx = [Timestamp('2013-05-31 00:00'), - Timestamp(datetime(2013, 5, 31, 23, 59, 59, 999999))] - ts = Series(lrange(len(idx)), index=idx) - expected = ts['2013'] - assert_series_equal(expected, ts) - - # GH 3925, indexing with a seconds resolution string / datetime object - df = DataFrame(randn(5, 5), - columns=['open', 'high', 'low', 'close', 'volume'], - index=date_range('2012-01-02 18:01:00', - periods=5, tz='US/Central', freq='s')) - expected = df.loc[[df.index[2]]] - result = df['2012-01-02 18:01:02'] - assert_frame_equal(result, expected) - - # this is a single date, so will raise - self.assertRaises(KeyError, df.__getitem__, df.index[2], ) - - def test_recreate_from_data(self): - freqs = ['M', 'Q', 'A', 'D', 'B', 'BH', 'T', 'S', 'L', 'U', 'H', 'N', - 'C'] - - for f in freqs: - org = DatetimeIndex(start='2001/02/01 09:00', freq=f, periods=1) - idx = DatetimeIndex(org, freq=f) - tm.assert_index_equal(idx, org) - - org = DatetimeIndex(start='2001/02/01 09:00', freq=f, - tz='US/Pacific', periods=1) - idx = DatetimeIndex(org, freq=f, tz='US/Pacific') - tm.assert_index_equal(idx, org) - - -def assert_range_equal(left, right): - assert (left.equals(right)) - assert (left.freq == right.freq) - assert (left.tz == right.tz) - - -class TestTimeSeries(tm.TestCase): - _multiprocess_can_split_ = True - - def test_is_(self): - dti = DatetimeIndex(start='1/1/2005', end='12/1/2005', freq='M') - self.assertTrue(dti.is_(dti)) - self.assertTrue(dti.is_(dti.view())) - self.assertFalse(dti.is_(dti.copy())) - - def test_dti_slicing(self): - dti = DatetimeIndex(start='1/1/2005', end='12/1/2005', freq='M') - dti2 = dti[[1, 3, 5]] - - v1 = dti2[0] - v2 = dti2[1] - v3 = dti2[2] - - self.assertEqual(v1, Timestamp('2/28/2005')) - self.assertEqual(v2, Timestamp('4/30/2005')) - self.assertEqual(v3, Timestamp('6/30/2005')) - - # don't carry freq through irregular slicing - self.assertIsNone(dti2.freq) - - def test_pass_datetimeindex_to_index(self): - # Bugs in #1396 - rng = date_range('1/1/2000', '3/1/2000') - idx = Index(rng, dtype=object) - - expected = Index(rng.to_pydatetime(), dtype=object) - - self.assert_numpy_array_equal(idx.values, expected.values) - - def test_contiguous_boolean_preserve_freq(self): - rng = date_range('1/1/2000', '3/1/2000', freq='B') - - mask = np.zeros(len(rng), dtype=bool) - mask[10:20] = True - - masked = rng[mask] - expected = rng[10:20] - self.assertIsNotNone(expected.freq) - assert_range_equal(masked, expected) - - mask[22] = True - masked = rng[mask] - self.assertIsNone(masked.freq) - - def test_getitem_median_slice_bug(self): - index = date_range('20090415', '20090519', freq='2B') - s = Series(np.random.randn(13), index=index) - - indexer = [slice(6, 7, None)] - result = s[indexer] - expected = s[indexer[0]] - assert_series_equal(result, expected) - - def test_series_box_timestamp(self): - rng = date_range('20090415', '20090519', freq='B') - s = Series(rng) - - tm.assertIsInstance(s[5], Timestamp) - - rng = date_range('20090415', '20090519', freq='B') - s = Series(rng, index=rng) - tm.assertIsInstance(s[5], Timestamp) - - tm.assertIsInstance(s.iat[5], Timestamp) - - def test_series_box_timedelta(self): - rng = timedelta_range('1 day 1 s', periods=5, freq='h') - s = Series(rng) - tm.assertIsInstance(s[1], Timedelta) - tm.assertIsInstance(s.iat[2], Timedelta) - - def test_date_range_ambiguous_arguments(self): - # #2538 - start = datetime(2011, 1, 1, 5, 3, 40) - end = datetime(2011, 1, 1, 8, 9, 40) - - self.assertRaises(ValueError, date_range, start, end, freq='s', - periods=10) - - def test_timestamp_to_datetime(self): - tm._skip_if_no_pytz() - rng = date_range('20090415', '20090519', tz='US/Eastern') - - stamp = rng[0] - dtval = stamp.to_pydatetime() - self.assertEqual(stamp, dtval) - self.assertEqual(stamp.tzinfo, dtval.tzinfo) - - def test_timestamp_to_datetime_dateutil(self): - tm._skip_if_no_pytz() - rng = date_range('20090415', '20090519', tz='dateutil/US/Eastern') - - stamp = rng[0] - dtval = stamp.to_pydatetime() - self.assertEqual(stamp, dtval) - self.assertEqual(stamp.tzinfo, dtval.tzinfo) - - def test_timestamp_to_datetime_explicit_pytz(self): - tm._skip_if_no_pytz() - import pytz - rng = date_range('20090415', '20090519', - tz=pytz.timezone('US/Eastern')) - - stamp = rng[0] - dtval = stamp.to_pydatetime() - self.assertEqual(stamp, dtval) - self.assertEqual(stamp.tzinfo, dtval.tzinfo) - - def test_timestamp_to_datetime_explicit_dateutil(self): - tm._skip_if_windows_python_3() - tm._skip_if_no_dateutil() - from pandas.tslib import _dateutil_gettz as gettz - rng = date_range('20090415', '20090519', tz=gettz('US/Eastern')) - - stamp = rng[0] - dtval = stamp.to_pydatetime() - self.assertEqual(stamp, dtval) - self.assertEqual(stamp.tzinfo, dtval.tzinfo) - - def test_index_convert_to_datetime_array(self): - tm._skip_if_no_pytz() - - def _check_rng(rng): - converted = rng.to_pydatetime() - tm.assertIsInstance(converted, np.ndarray) - for x, stamp in zip(converted, rng): - tm.assertIsInstance(x, datetime) - self.assertEqual(x, stamp.to_pydatetime()) - self.assertEqual(x.tzinfo, stamp.tzinfo) - - rng = date_range('20090415', '20090519') - rng_eastern = date_range('20090415', '20090519', tz='US/Eastern') - rng_utc = date_range('20090415', '20090519', tz='utc') - - _check_rng(rng) - _check_rng(rng_eastern) - _check_rng(rng_utc) - - def test_index_convert_to_datetime_array_explicit_pytz(self): - tm._skip_if_no_pytz() - import pytz - - def _check_rng(rng): - converted = rng.to_pydatetime() - tm.assertIsInstance(converted, np.ndarray) - for x, stamp in zip(converted, rng): - tm.assertIsInstance(x, datetime) - self.assertEqual(x, stamp.to_pydatetime()) - self.assertEqual(x.tzinfo, stamp.tzinfo) - - rng = date_range('20090415', '20090519') - rng_eastern = date_range('20090415', '20090519', - tz=pytz.timezone('US/Eastern')) - rng_utc = date_range('20090415', '20090519', tz=pytz.utc) - - _check_rng(rng) - _check_rng(rng_eastern) - _check_rng(rng_utc) - - def test_index_convert_to_datetime_array_dateutil(self): - tm._skip_if_no_dateutil() - import dateutil - - def _check_rng(rng): - converted = rng.to_pydatetime() - tm.assertIsInstance(converted, np.ndarray) - for x, stamp in zip(converted, rng): - tm.assertIsInstance(x, datetime) - self.assertEqual(x, stamp.to_pydatetime()) - self.assertEqual(x.tzinfo, stamp.tzinfo) - - rng = date_range('20090415', '20090519') - rng_eastern = date_range('20090415', '20090519', - tz='dateutil/US/Eastern') - rng_utc = date_range('20090415', '20090519', tz=dateutil.tz.tzutc()) - - _check_rng(rng) - _check_rng(rng_eastern) - _check_rng(rng_utc) - - def test_ctor_str_intraday(self): - rng = DatetimeIndex(['1-1-2000 00:00:01']) - self.assertEqual(rng[0].second, 1) - - def test_series_ctor_plus_datetimeindex(self): - rng = date_range('20090415', '20090519', freq='B') - data = dict((k, 1) for k in rng) - - result = Series(data, index=rng) - self.assertIs(result.index, rng) - - def test_series_pad_backfill_limit(self): - index = np.arange(10) - s = Series(np.random.randn(10), index=index) - - result = s[:2].reindex(index, method='pad', limit=5) - - expected = s[:2].reindex(index).fillna(method='pad') - expected[-3:] = np.nan - assert_series_equal(result, expected) - - result = s[-2:].reindex(index, method='backfill', limit=5) - - expected = s[-2:].reindex(index).fillna(method='backfill') - expected[:3] = np.nan - assert_series_equal(result, expected) - - def test_series_fillna_limit(self): - index = np.arange(10) - s = Series(np.random.randn(10), index=index) - - result = s[:2].reindex(index) - result = result.fillna(method='pad', limit=5) - - expected = s[:2].reindex(index).fillna(method='pad') - expected[-3:] = np.nan - assert_series_equal(result, expected) - - result = s[-2:].reindex(index) - result = result.fillna(method='bfill', limit=5) - - expected = s[-2:].reindex(index).fillna(method='backfill') - expected[:3] = np.nan - assert_series_equal(result, expected) - - def test_frame_pad_backfill_limit(self): - index = np.arange(10) - df = DataFrame(np.random.randn(10, 4), index=index) - - result = df[:2].reindex(index, method='pad', limit=5) - - expected = df[:2].reindex(index).fillna(method='pad') - expected.values[-3:] = np.nan - tm.assert_frame_equal(result, expected) - - result = df[-2:].reindex(index, method='backfill', limit=5) - - expected = df[-2:].reindex(index).fillna(method='backfill') - expected.values[:3] = np.nan - tm.assert_frame_equal(result, expected) - - def test_frame_fillna_limit(self): - index = np.arange(10) - df = DataFrame(np.random.randn(10, 4), index=index) - - result = df[:2].reindex(index) - result = result.fillna(method='pad', limit=5) - - expected = df[:2].reindex(index).fillna(method='pad') - expected.values[-3:] = np.nan - tm.assert_frame_equal(result, expected) - - result = df[-2:].reindex(index) - result = result.fillna(method='backfill', limit=5) - - expected = df[-2:].reindex(index).fillna(method='backfill') - expected.values[:3] = np.nan - tm.assert_frame_equal(result, expected) - - def test_frame_setitem_timestamp(self): - # 2155 - columns = DatetimeIndex(start='1/1/2012', end='2/1/2012', - freq=offsets.BDay()) - index = lrange(10) - data = DataFrame(columns=columns, index=index) - t = datetime(2012, 11, 1) - ts = Timestamp(t) - data[ts] = np.nan # works - - def test_sparse_series_fillna_limit(self): - index = np.arange(10) - s = Series(np.random.randn(10), index=index) - - ss = s[:2].reindex(index).to_sparse() - result = ss.fillna(method='pad', limit=5) - expected = ss.fillna(method='pad', limit=5) - expected = expected.to_dense() - expected[-3:] = np.nan - expected = expected.to_sparse() - assert_series_equal(result, expected) - - ss = s[-2:].reindex(index).to_sparse() - result = ss.fillna(method='backfill', limit=5) - expected = ss.fillna(method='backfill') - expected = expected.to_dense() - expected[:3] = np.nan - expected = expected.to_sparse() - assert_series_equal(result, expected) - - def test_sparse_series_pad_backfill_limit(self): - index = np.arange(10) - s = Series(np.random.randn(10), index=index) - s = s.to_sparse() - - result = s[:2].reindex(index, method='pad', limit=5) - expected = s[:2].reindex(index).fillna(method='pad') - expected = expected.to_dense() - expected[-3:] = np.nan - expected = expected.to_sparse() - assert_series_equal(result, expected) - - result = s[-2:].reindex(index, method='backfill', limit=5) - expected = s[-2:].reindex(index).fillna(method='backfill') - expected = expected.to_dense() - expected[:3] = np.nan - expected = expected.to_sparse() - assert_series_equal(result, expected) - - def test_sparse_frame_pad_backfill_limit(self): - index = np.arange(10) - df = DataFrame(np.random.randn(10, 4), index=index) - sdf = df.to_sparse() - - result = sdf[:2].reindex(index, method='pad', limit=5) - - expected = sdf[:2].reindex(index).fillna(method='pad') - expected = expected.to_dense() - expected.values[-3:] = np.nan - expected = expected.to_sparse() - tm.assert_frame_equal(result, expected) - - result = sdf[-2:].reindex(index, method='backfill', limit=5) - - expected = sdf[-2:].reindex(index).fillna(method='backfill') - expected = expected.to_dense() - expected.values[:3] = np.nan - expected = expected.to_sparse() - tm.assert_frame_equal(result, expected) - - def test_sparse_frame_fillna_limit(self): - index = np.arange(10) - df = DataFrame(np.random.randn(10, 4), index=index) - sdf = df.to_sparse() - - result = sdf[:2].reindex(index) - result = result.fillna(method='pad', limit=5) - - expected = sdf[:2].reindex(index).fillna(method='pad') - expected = expected.to_dense() - expected.values[-3:] = np.nan - expected = expected.to_sparse() - tm.assert_frame_equal(result, expected) - - result = sdf[-2:].reindex(index) - result = result.fillna(method='backfill', limit=5) - - expected = sdf[-2:].reindex(index).fillna(method='backfill') - expected = expected.to_dense() - expected.values[:3] = np.nan - expected = expected.to_sparse() - tm.assert_frame_equal(result, expected) - - def test_pad_require_monotonicity(self): - rng = date_range('1/1/2000', '3/1/2000', freq='B') - - # neither monotonic increasing or decreasing - rng2 = rng[[1, 0, 2]] - - self.assertRaises(ValueError, rng2.get_indexer, rng, method='pad') - - def test_frame_ctor_datetime64_column(self): - rng = date_range('1/1/2000 00:00:00', '1/1/2000 1:59:50', freq='10s') - dates = np.asarray(rng) - - df = DataFrame({'A': np.random.randn(len(rng)), 'B': dates}) - self.assertTrue(np.issubdtype(df['B'].dtype, np.dtype('M8[ns]'))) - - def test_frame_add_datetime64_column(self): - rng = date_range('1/1/2000 00:00:00', '1/1/2000 1:59:50', freq='10s') - df = DataFrame(index=np.arange(len(rng))) - - df['A'] = rng - self.assertTrue(np.issubdtype(df['A'].dtype, np.dtype('M8[ns]'))) - - def test_frame_datetime64_pre1900_repr(self): - df = DataFrame({'year': date_range('1/1/1700', periods=50, - freq='A-DEC')}) - # it works! - repr(df) - - def test_frame_add_datetime64_col_other_units(self): - n = 100 - - units = ['h', 'm', 's', 'ms', 'D', 'M', 'Y'] - - ns_dtype = np.dtype('M8[ns]') - - for unit in units: - dtype = np.dtype('M8[%s]' % unit) - vals = np.arange(n, dtype=np.int64).view(dtype) - - df = DataFrame({'ints': np.arange(n)}, index=np.arange(n)) - df[unit] = vals - - ex_vals = to_datetime(vals.astype('O')).values - - self.assertEqual(df[unit].dtype, ns_dtype) - self.assertTrue((df[unit].values == ex_vals).all()) - - # Test insertion into existing datetime64 column - df = DataFrame({'ints': np.arange(n)}, index=np.arange(n)) - df['dates'] = np.arange(n, dtype=np.int64).view(ns_dtype) - - for unit in units: - dtype = np.dtype('M8[%s]' % unit) - vals = np.arange(n, dtype=np.int64).view(dtype) - - tmp = df.copy() - - tmp['dates'] = vals - ex_vals = to_datetime(vals.astype('O')).values - - self.assertTrue((tmp['dates'].values == ex_vals).all()) - - def test_to_datetime_unit(self): - - epoch = 1370745748 - s = Series([epoch + t for t in range(20)]) - result = to_datetime(s, unit='s') - expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( - seconds=t) for t in range(20)]) - assert_series_equal(result, expected) - - s = Series([epoch + t for t in range(20)]).astype(float) - result = to_datetime(s, unit='s') - expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( - seconds=t) for t in range(20)]) - assert_series_equal(result, expected) - - s = Series([epoch + t for t in range(20)] + [iNaT]) - result = to_datetime(s, unit='s') - expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( - seconds=t) for t in range(20)] + [NaT]) - assert_series_equal(result, expected) - - s = Series([epoch + t for t in range(20)] + [iNaT]).astype(float) - result = to_datetime(s, unit='s') - expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( - seconds=t) for t in range(20)] + [NaT]) - assert_series_equal(result, expected) - - # GH13834 - s = Series([epoch + t for t in np.arange(0, 2, .25)] + - [iNaT]).astype(float) - result = to_datetime(s, unit='s') - expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( - seconds=t) for t in np.arange(0, 2, .25)] + [NaT]) - assert_series_equal(result, expected) - - s = concat([Series([epoch + t for t in range(20)] - ).astype(float), Series([np.nan])], - ignore_index=True) - result = to_datetime(s, unit='s') - expected = Series([Timestamp('2013-06-09 02:42:28') + timedelta( - seconds=t) for t in range(20)] + [NaT]) - assert_series_equal(result, expected) - - result = to_datetime([1, 2, 'NaT', pd.NaT, np.nan], unit='D') - expected = DatetimeIndex([Timestamp('1970-01-02'), - Timestamp('1970-01-03')] + ['NaT'] * 3) - tm.assert_index_equal(result, expected) - - with self.assertRaises(ValueError): - to_datetime([1, 2, 'foo'], unit='D') - with self.assertRaises(ValueError): - to_datetime([1, 2, 111111111], unit='D') - - # coerce we can process - expected = DatetimeIndex([Timestamp('1970-01-02'), - Timestamp('1970-01-03')] + ['NaT'] * 1) - result = to_datetime([1, 2, 'foo'], unit='D', errors='coerce') - tm.assert_index_equal(result, expected) - - result = to_datetime([1, 2, 111111111], unit='D', errors='coerce') - tm.assert_index_equal(result, expected) - - def test_series_ctor_datetime64(self): - rng = date_range('1/1/2000 00:00:00', '1/1/2000 1:59:50', freq='10s') - dates = np.asarray(rng) - - series = Series(dates) - self.assertTrue(np.issubdtype(series.dtype, np.dtype('M8[ns]'))) - - def test_index_cast_datetime64_other_units(self): - arr = np.arange(0, 100, 10, dtype=np.int64).view('M8[D]') - - idx = Index(arr) - - self.assertTrue((idx.values == tslib.cast_to_nanoseconds(arr)).all()) - - def test_reindex_series_add_nat(self): - rng = date_range('1/1/2000 00:00:00', periods=10, freq='10s') - series = Series(rng) - - result = series.reindex(lrange(15)) - self.assertTrue(np.issubdtype(result.dtype, np.dtype('M8[ns]'))) - - mask = result.isnull() - self.assertTrue(mask[-5:].all()) - self.assertFalse(mask[:-5].any()) - - def test_reindex_frame_add_nat(self): - rng = date_range('1/1/2000 00:00:00', periods=10, freq='10s') - df = DataFrame({'A': np.random.randn(len(rng)), 'B': rng}) - - result = df.reindex(lrange(15)) - self.assertTrue(np.issubdtype(result['B'].dtype, np.dtype('M8[ns]'))) - - mask = com.isnull(result)['B'] - self.assertTrue(mask[-5:].all()) - self.assertFalse(mask[:-5].any()) - - def test_series_repr_nat(self): - series = Series([0, 1000, 2000, iNaT], dtype='M8[ns]') - - result = repr(series) - expected = ('0 1970-01-01 00:00:00.000000\n' - '1 1970-01-01 00:00:00.000001\n' - '2 1970-01-01 00:00:00.000002\n' - '3 NaT\n' - 'dtype: datetime64[ns]') - self.assertEqual(result, expected) - - def test_fillna_nat(self): - series = Series([0, 1, 2, iNaT], dtype='M8[ns]') - - filled = series.fillna(method='pad') - filled2 = series.fillna(value=series.values[2]) - - expected = series.copy() - expected.values[3] = expected.values[2] - - assert_series_equal(filled, expected) - assert_series_equal(filled2, expected) - - df = DataFrame({'A': series}) - filled = df.fillna(method='pad') - filled2 = df.fillna(value=series.values[2]) - expected = DataFrame({'A': expected}) - assert_frame_equal(filled, expected) - assert_frame_equal(filled2, expected) - - series = Series([iNaT, 0, 1, 2], dtype='M8[ns]') - - filled = series.fillna(method='bfill') - filled2 = series.fillna(value=series[1]) - - expected = series.copy() - expected[0] = expected[1] - - assert_series_equal(filled, expected) - assert_series_equal(filled2, expected) - - df = DataFrame({'A': series}) - filled = df.fillna(method='bfill') - filled2 = df.fillna(value=series[1]) - expected = DataFrame({'A': expected}) - assert_frame_equal(filled, expected) - assert_frame_equal(filled2, expected) - - def test_string_na_nat_conversion(self): - # GH #999, #858 - - from pandas.compat import parse_date - - strings = np.array(['1/1/2000', '1/2/2000', np.nan, - '1/4/2000, 12:34:56'], dtype=object) - - expected = np.empty(4, dtype='M8[ns]') - for i, val in enumerate(strings): - if com.isnull(val): - expected[i] = iNaT - else: - expected[i] = parse_date(val) - - result = tslib.array_to_datetime(strings) - assert_almost_equal(result, expected) - - result2 = to_datetime(strings) - tm.assertIsInstance(result2, DatetimeIndex) - tm.assert_numpy_array_equal(result, result2.values) - - malformed = np.array(['1/100/2000', np.nan], dtype=object) - - # GH 10636, default is now 'raise' - self.assertRaises(ValueError, - lambda: to_datetime(malformed, errors='raise')) - - result = to_datetime(malformed, errors='ignore') - tm.assert_numpy_array_equal(result, malformed) - - self.assertRaises(ValueError, to_datetime, malformed, errors='raise') - - idx = ['a', 'b', 'c', 'd', 'e'] - series = Series(['1/1/2000', np.nan, '1/3/2000', np.nan, - '1/5/2000'], index=idx, name='foo') - dseries = Series([to_datetime('1/1/2000'), np.nan, - to_datetime('1/3/2000'), np.nan, - to_datetime('1/5/2000')], index=idx, name='foo') - - result = to_datetime(series) - dresult = to_datetime(dseries) - - expected = Series(np.empty(5, dtype='M8[ns]'), index=idx) - for i in range(5): - x = series[i] - if isnull(x): - expected[i] = iNaT - else: - expected[i] = to_datetime(x) - - assert_series_equal(result, expected, check_names=False) - self.assertEqual(result.name, 'foo') - - assert_series_equal(dresult, expected, check_names=False) - self.assertEqual(dresult.name, 'foo') - - def test_to_datetime_iso8601(self): - result = to_datetime(["2012-01-01 00:00:00"]) - exp = Timestamp("2012-01-01 00:00:00") - self.assertEqual(result[0], exp) - - result = to_datetime(['20121001']) # bad iso 8601 - exp = Timestamp('2012-10-01') - self.assertEqual(result[0], exp) - - def test_to_datetime_default(self): - rs = to_datetime('2001') - xp = datetime(2001, 1, 1) - self.assertTrue(rs, xp) - - # dayfirst is essentially broken - - # to_datetime('01-13-2012', dayfirst=True) - # self.assertRaises(ValueError, to_datetime('01-13-2012', - # dayfirst=True)) - - def test_to_datetime_on_datetime64_series(self): - # #2699 - s = Series(date_range('1/1/2000', periods=10)) - - result = to_datetime(s) - self.assertEqual(result[0], s[0]) - - def test_to_datetime_with_apply(self): - # this is only locale tested with US/None locales - _skip_if_has_locale() - - # GH 5195 - # with a format and coerce a single item to_datetime fails - td = Series(['May 04', 'Jun 02', 'Dec 11'], index=[1, 2, 3]) - expected = pd.to_datetime(td, format='%b %y') - result = td.apply(pd.to_datetime, format='%b %y') - assert_series_equal(result, expected) - - td = pd.Series(['May 04', 'Jun 02', ''], index=[1, 2, 3]) - self.assertRaises(ValueError, - lambda: pd.to_datetime(td, format='%b %y', - errors='raise')) - self.assertRaises(ValueError, - lambda: td.apply(pd.to_datetime, format='%b %y', - errors='raise')) - expected = pd.to_datetime(td, format='%b %y', errors='coerce') - - result = td.apply( - lambda x: pd.to_datetime(x, format='%b %y', errors='coerce')) - assert_series_equal(result, expected) - - def test_nat_vector_field_access(self): - idx = DatetimeIndex(['1/1/2000', None, None, '1/4/2000']) - - fields = ['year', 'quarter', 'month', 'day', 'hour', 'minute', - 'second', 'microsecond', 'nanosecond', 'week', 'dayofyear', - 'days_in_month', 'is_leap_year'] - - for field in fields: - result = getattr(idx, field) - expected = [getattr(x, field) for x in idx] - self.assert_numpy_array_equal(result, np.array(expected)) - - s = pd.Series(idx) - - for field in fields: - result = getattr(s.dt, field) - expected = [getattr(x, field) for x in idx] - self.assert_series_equal(result, pd.Series(expected)) - - def test_nat_scalar_field_access(self): - fields = ['year', 'quarter', 'month', 'day', 'hour', 'minute', - 'second', 'microsecond', 'nanosecond', 'week', 'dayofyear', - 'days_in_month', 'daysinmonth', 'dayofweek', 'weekday_name'] - for field in fields: - result = getattr(NaT, field) - self.assertTrue(np.isnan(result)) - - def test_NaT_methods(self): - # GH 9513 - raise_methods = ['astimezone', 'combine', 'ctime', 'dst', - 'fromordinal', 'fromtimestamp', 'isocalendar', - 'strftime', 'strptime', 'time', 'timestamp', - 'timetuple', 'timetz', 'toordinal', 'tzname', - 'utcfromtimestamp', 'utcnow', 'utcoffset', - 'utctimetuple'] - nat_methods = ['date', 'now', 'replace', 'to_datetime', 'today'] - nan_methods = ['weekday', 'isoweekday'] - - for method in raise_methods: - if hasattr(NaT, method): - self.assertRaises(ValueError, getattr(NaT, method)) - - for method in nan_methods: - if hasattr(NaT, method): - self.assertTrue(np.isnan(getattr(NaT, method)())) - - for method in nat_methods: - if hasattr(NaT, method): - # see gh-8254 - exp_warning = None - if method == 'to_datetime': - exp_warning = FutureWarning - with tm.assert_produces_warning( - exp_warning, check_stacklevel=False): - self.assertIs(getattr(NaT, method)(), NaT) - - # GH 12300 - self.assertEqual(NaT.isoformat(), 'NaT') - - def test_to_datetime_types(self): - - # empty string - result = to_datetime('') - self.assertIs(result, NaT) - - result = to_datetime(['', '']) - self.assertTrue(isnull(result).all()) - - # ints - result = Timestamp(0) - expected = to_datetime(0) - self.assertEqual(result, expected) - - # GH 3888 (strings) - expected = to_datetime(['2012'])[0] - result = to_datetime('2012') - self.assertEqual(result, expected) - - # array = ['2012','20120101','20120101 12:01:01'] - array = ['20120101', '20120101 12:01:01'] - expected = list(to_datetime(array)) - result = lmap(Timestamp, array) - tm.assert_almost_equal(result, expected) - - # currently fails ### - # result = Timestamp('2012') - # expected = to_datetime('2012') - # self.assertEqual(result, expected) - - def test_to_datetime_unprocessable_input(self): - # GH 4928 - self.assert_numpy_array_equal( - to_datetime([1, '1'], errors='ignore'), - np.array([1, '1'], dtype='O') - ) - self.assertRaises(TypeError, to_datetime, [1, '1'], errors='raise') - - def test_to_datetime_other_datetime64_units(self): - # 5/25/2012 - scalar = np.int64(1337904000000000).view('M8[us]') - as_obj = scalar.astype('O') - - index = DatetimeIndex([scalar]) - self.assertEqual(index[0], scalar.astype('O')) - - value = Timestamp(scalar) - self.assertEqual(value, as_obj) - - def test_to_datetime_list_of_integers(self): - rng = date_range('1/1/2000', periods=20) - rng = DatetimeIndex(rng.values) - - ints = list(rng.asi8) - - result = DatetimeIndex(ints) - - tm.assert_index_equal(rng, result) - - def test_to_datetime_freq(self): - xp = bdate_range('2000-1-1', periods=10, tz='UTC') - rs = xp.to_datetime() - self.assertEqual(xp.freq, rs.freq) - self.assertEqual(xp.tzinfo, rs.tzinfo) - - def test_range_edges(self): - # GH 13672 - idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000000001'), - end=Timestamp('1970-01-01 00:00:00.000000004'), - freq='N') - exp = DatetimeIndex(['1970-01-01 00:00:00.000000001', - '1970-01-01 00:00:00.000000002', - '1970-01-01 00:00:00.000000003', - '1970-01-01 00:00:00.000000004']) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000000004'), - end=Timestamp('1970-01-01 00:00:00.000000001'), - freq='N') - exp = DatetimeIndex([]) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000000001'), - end=Timestamp('1970-01-01 00:00:00.000000001'), - freq='N') - exp = DatetimeIndex(['1970-01-01 00:00:00.000000001']) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.000001'), - end=Timestamp('1970-01-01 00:00:00.000004'), - freq='U') - exp = DatetimeIndex(['1970-01-01 00:00:00.000001', - '1970-01-01 00:00:00.000002', - '1970-01-01 00:00:00.000003', - '1970-01-01 00:00:00.000004']) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:00.001'), - end=Timestamp('1970-01-01 00:00:00.004'), - freq='L') - exp = DatetimeIndex(['1970-01-01 00:00:00.001', - '1970-01-01 00:00:00.002', - '1970-01-01 00:00:00.003', - '1970-01-01 00:00:00.004']) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01 00:00:01'), - end=Timestamp('1970-01-01 00:00:04'), freq='S') - exp = DatetimeIndex(['1970-01-01 00:00:01', '1970-01-01 00:00:02', - '1970-01-01 00:00:03', '1970-01-01 00:00:04']) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01 00:01'), - end=Timestamp('1970-01-01 00:04'), freq='T') - exp = DatetimeIndex(['1970-01-01 00:01', '1970-01-01 00:02', - '1970-01-01 00:03', '1970-01-01 00:04']) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01 01:00'), - end=Timestamp('1970-01-01 04:00'), freq='H') - exp = DatetimeIndex(['1970-01-01 01:00', '1970-01-01 02:00', - '1970-01-01 03:00', '1970-01-01 04:00']) - tm.assert_index_equal(idx, exp) - - idx = DatetimeIndex(start=Timestamp('1970-01-01'), - end=Timestamp('1970-01-04'), freq='D') - exp = DatetimeIndex(['1970-01-01', '1970-01-02', - '1970-01-03', '1970-01-04']) - tm.assert_index_equal(idx, exp) - - def test_range_misspecified(self): - # GH #1095 - - self.assertRaises(ValueError, date_range, '1/1/2000') - self.assertRaises(ValueError, date_range, end='1/1/2000') - self.assertRaises(ValueError, date_range, periods=10) - - self.assertRaises(ValueError, date_range, '1/1/2000', freq='H') - self.assertRaises(ValueError, date_range, end='1/1/2000', freq='H') - self.assertRaises(ValueError, date_range, periods=10, freq='H') - - def test_reasonable_keyerror(self): - # GH #1062 - index = DatetimeIndex(['1/3/2000']) - try: - index.get_loc('1/1/2000') - except KeyError as e: - self.assertIn('2000', str(e)) - - def test_reindex_with_datetimes(self): - rng = date_range('1/1/2000', periods=20) - ts = Series(np.random.randn(20), index=rng) - - result = ts.reindex(list(ts.index[5:10])) - expected = ts[5:10] - tm.assert_series_equal(result, expected) - - result = ts[list(ts.index[5:10])] - tm.assert_series_equal(result, expected) - - def test_asfreq_keep_index_name(self): - # GH #9854 - index_name = 'bar' - index = pd.date_range('20130101', periods=20, name=index_name) - df = pd.DataFrame([x for x in range(20)], columns=['foo'], index=index) - - self.assertEqual(index_name, df.index.name) - self.assertEqual(index_name, df.asfreq('10D').index.name) - - def test_promote_datetime_date(self): - rng = date_range('1/1/2000', periods=20) - ts = Series(np.random.randn(20), index=rng) - - ts_slice = ts[5:] - ts2 = ts_slice.copy() - ts2.index = [x.date() for x in ts2.index] - - result = ts + ts2 - result2 = ts2 + ts - expected = ts + ts[5:] - assert_series_equal(result, expected) - assert_series_equal(result2, expected) - - # test asfreq - result = ts2.asfreq('4H', method='ffill') - expected = ts[5:].asfreq('4H', method='ffill') - assert_series_equal(result, expected) - - result = rng.get_indexer(ts2.index) - expected = rng.get_indexer(ts_slice.index) - self.assert_numpy_array_equal(result, expected) - - def test_asfreq_normalize(self): - rng = date_range('1/1/2000 09:30', periods=20) - norm = date_range('1/1/2000', periods=20) - vals = np.random.randn(20) - ts = Series(vals, index=rng) - - result = ts.asfreq('D', normalize=True) - norm = date_range('1/1/2000', periods=20) - expected = Series(vals, index=norm) - - assert_series_equal(result, expected) - - vals = np.random.randn(20, 3) - ts = DataFrame(vals, index=rng) - - result = ts.asfreq('D', normalize=True) - expected = DataFrame(vals, index=norm) - - assert_frame_equal(result, expected) - - def test_date_range_gen_error(self): - rng = date_range('1/1/2000 00:00', '1/1/2000 00:18', freq='5min') - self.assertEqual(len(rng), 4) - - def test_date_range_negative_freq(self): - # GH 11018 - rng = date_range('2011-12-31', freq='-2A', periods=3) - exp = pd.DatetimeIndex(['2011-12-31', '2009-12-31', - '2007-12-31'], freq='-2A') - tm.assert_index_equal(rng, exp) - self.assertEqual(rng.freq, '-2A') - - rng = date_range('2011-01-31', freq='-2M', periods=3) - exp = pd.DatetimeIndex(['2011-01-31', '2010-11-30', - '2010-09-30'], freq='-2M') - tm.assert_index_equal(rng, exp) - self.assertEqual(rng.freq, '-2M') - - def test_date_range_bms_bug(self): - # #1645 - rng = date_range('1/1/2000', periods=10, freq='BMS') - - ex_first = Timestamp('2000-01-03') - self.assertEqual(rng[0], ex_first) - - def test_date_range_businesshour(self): - idx = DatetimeIndex(['2014-07-04 09:00', '2014-07-04 10:00', - '2014-07-04 11:00', - '2014-07-04 12:00', '2014-07-04 13:00', - '2014-07-04 14:00', - '2014-07-04 15:00', '2014-07-04 16:00'], - freq='BH') - rng = date_range('2014-07-04 09:00', '2014-07-04 16:00', freq='BH') - tm.assert_index_equal(idx, rng) - - idx = DatetimeIndex( - ['2014-07-04 16:00', '2014-07-07 09:00'], freq='BH') - rng = date_range('2014-07-04 16:00', '2014-07-07 09:00', freq='BH') - tm.assert_index_equal(idx, rng) - - idx = DatetimeIndex(['2014-07-04 09:00', '2014-07-04 10:00', - '2014-07-04 11:00', - '2014-07-04 12:00', '2014-07-04 13:00', - '2014-07-04 14:00', - '2014-07-04 15:00', '2014-07-04 16:00', - '2014-07-07 09:00', '2014-07-07 10:00', - '2014-07-07 11:00', - '2014-07-07 12:00', '2014-07-07 13:00', - '2014-07-07 14:00', - '2014-07-07 15:00', '2014-07-07 16:00', - '2014-07-08 09:00', '2014-07-08 10:00', - '2014-07-08 11:00', - '2014-07-08 12:00', '2014-07-08 13:00', - '2014-07-08 14:00', - '2014-07-08 15:00', '2014-07-08 16:00'], - freq='BH') - rng = date_range('2014-07-04 09:00', '2014-07-08 16:00', freq='BH') - tm.assert_index_equal(idx, rng) - - def test_first_subset(self): - ts = _simple_ts('1/1/2000', '1/1/2010', freq='12h') - result = ts.first('10d') - self.assertEqual(len(result), 20) - - ts = _simple_ts('1/1/2000', '1/1/2010') - result = ts.first('10d') - self.assertEqual(len(result), 10) - - result = ts.first('3M') - expected = ts[:'3/31/2000'] - assert_series_equal(result, expected) - - result = ts.first('21D') - expected = ts[:21] - assert_series_equal(result, expected) - - result = ts[:0].first('3M') - assert_series_equal(result, ts[:0]) - - def test_last_subset(self): - ts = _simple_ts('1/1/2000', '1/1/2010', freq='12h') - result = ts.last('10d') - self.assertEqual(len(result), 20) - - ts = _simple_ts('1/1/2000', '1/1/2010') - result = ts.last('10d') - self.assertEqual(len(result), 10) - - result = ts.last('21D') - expected = ts['12/12/2009':] - assert_series_equal(result, expected) - - result = ts.last('21D') - expected = ts[-21:] - assert_series_equal(result, expected) - - result = ts[:0].last('3M') - assert_series_equal(result, ts[:0]) - - def test_format_pre_1900_dates(self): - rng = date_range('1/1/1850', '1/1/1950', freq='A-DEC') - rng.format() - ts = Series(1, index=rng) - repr(ts) - - def test_at_time(self): - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - ts = Series(np.random.randn(len(rng)), index=rng) - rs = ts.at_time(rng[1]) - self.assertTrue((rs.index.hour == rng[1].hour).all()) - self.assertTrue((rs.index.minute == rng[1].minute).all()) - self.assertTrue((rs.index.second == rng[1].second).all()) - - result = ts.at_time('9:30') - expected = ts.at_time(time(9, 30)) - assert_series_equal(result, expected) - - df = DataFrame(np.random.randn(len(rng), 3), index=rng) - - result = ts[time(9, 30)] - result_df = df.ix[time(9, 30)] - expected = ts[(rng.hour == 9) & (rng.minute == 30)] - exp_df = df[(rng.hour == 9) & (rng.minute == 30)] - - # expected.index = date_range('1/1/2000', '1/4/2000') - - assert_series_equal(result, expected) - tm.assert_frame_equal(result_df, exp_df) - - chunk = df.ix['1/4/2000':] - result = chunk.ix[time(9, 30)] - expected = result_df[-1:] - tm.assert_frame_equal(result, expected) - - # midnight, everything - rng = date_range('1/1/2000', '1/31/2000') - ts = Series(np.random.randn(len(rng)), index=rng) - - result = ts.at_time(time(0, 0)) - assert_series_equal(result, ts) - - # time doesn't exist - rng = date_range('1/1/2012', freq='23Min', periods=384) - ts = Series(np.random.randn(len(rng)), rng) - rs = ts.at_time('16:00') - self.assertEqual(len(rs), 0) - - def test_at_time_frame(self): - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - ts = DataFrame(np.random.randn(len(rng), 2), index=rng) - rs = ts.at_time(rng[1]) - self.assertTrue((rs.index.hour == rng[1].hour).all()) - self.assertTrue((rs.index.minute == rng[1].minute).all()) - self.assertTrue((rs.index.second == rng[1].second).all()) - - result = ts.at_time('9:30') - expected = ts.at_time(time(9, 30)) - assert_frame_equal(result, expected) - - result = ts.ix[time(9, 30)] - expected = ts.ix[(rng.hour == 9) & (rng.minute == 30)] - - assert_frame_equal(result, expected) - - # midnight, everything - rng = date_range('1/1/2000', '1/31/2000') - ts = DataFrame(np.random.randn(len(rng), 3), index=rng) - - result = ts.at_time(time(0, 0)) - assert_frame_equal(result, ts) - - # time doesn't exist - rng = date_range('1/1/2012', freq='23Min', periods=384) - ts = DataFrame(np.random.randn(len(rng), 2), rng) - rs = ts.at_time('16:00') - self.assertEqual(len(rs), 0) - - def test_between_time(self): - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - ts = Series(np.random.randn(len(rng)), index=rng) - stime = time(0, 0) - etime = time(1, 0) - - close_open = product([True, False], [True, False]) - for inc_start, inc_end in close_open: - filtered = ts.between_time(stime, etime, inc_start, inc_end) - exp_len = 13 * 4 + 1 - if not inc_start: - exp_len -= 5 - if not inc_end: - exp_len -= 4 - - self.assertEqual(len(filtered), exp_len) - for rs in filtered.index: - t = rs.time() - if inc_start: - self.assertTrue(t >= stime) - else: - self.assertTrue(t > stime) - - if inc_end: - self.assertTrue(t <= etime) - else: - self.assertTrue(t < etime) - - result = ts.between_time('00:00', '01:00') - expected = ts.between_time(stime, etime) - assert_series_equal(result, expected) - - # across midnight - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - ts = Series(np.random.randn(len(rng)), index=rng) - stime = time(22, 0) - etime = time(9, 0) - - close_open = product([True, False], [True, False]) - for inc_start, inc_end in close_open: - filtered = ts.between_time(stime, etime, inc_start, inc_end) - exp_len = (12 * 11 + 1) * 4 + 1 - if not inc_start: - exp_len -= 4 - if not inc_end: - exp_len -= 4 - - self.assertEqual(len(filtered), exp_len) - for rs in filtered.index: - t = rs.time() - if inc_start: - self.assertTrue((t >= stime) or (t <= etime)) - else: - self.assertTrue((t > stime) or (t <= etime)) - - if inc_end: - self.assertTrue((t <= etime) or (t >= stime)) - else: - self.assertTrue((t < etime) or (t >= stime)) - - def test_between_time_frame(self): - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - ts = DataFrame(np.random.randn(len(rng), 2), index=rng) - stime = time(0, 0) - etime = time(1, 0) - - close_open = product([True, False], [True, False]) - for inc_start, inc_end in close_open: - filtered = ts.between_time(stime, etime, inc_start, inc_end) - exp_len = 13 * 4 + 1 - if not inc_start: - exp_len -= 5 - if not inc_end: - exp_len -= 4 - - self.assertEqual(len(filtered), exp_len) - for rs in filtered.index: - t = rs.time() - if inc_start: - self.assertTrue(t >= stime) - else: - self.assertTrue(t > stime) - - if inc_end: - self.assertTrue(t <= etime) - else: - self.assertTrue(t < etime) - - result = ts.between_time('00:00', '01:00') - expected = ts.between_time(stime, etime) - assert_frame_equal(result, expected) - - # across midnight - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - ts = DataFrame(np.random.randn(len(rng), 2), index=rng) - stime = time(22, 0) - etime = time(9, 0) - - close_open = product([True, False], [True, False]) - for inc_start, inc_end in close_open: - filtered = ts.between_time(stime, etime, inc_start, inc_end) - exp_len = (12 * 11 + 1) * 4 + 1 - if not inc_start: - exp_len -= 4 - if not inc_end: - exp_len -= 4 - - self.assertEqual(len(filtered), exp_len) - for rs in filtered.index: - t = rs.time() - if inc_start: - self.assertTrue((t >= stime) or (t <= etime)) - else: - self.assertTrue((t > stime) or (t <= etime)) - - if inc_end: - self.assertTrue((t <= etime) or (t >= stime)) - else: - self.assertTrue((t < etime) or (t >= stime)) - - def test_between_time_types(self): - # GH11818 - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - self.assertRaises(ValueError, rng.indexer_between_time, - datetime(2010, 1, 2, 1), datetime(2010, 1, 2, 5)) - - frame = DataFrame({'A': 0}, index=rng) - self.assertRaises(ValueError, frame.between_time, - datetime(2010, 1, 2, 1), datetime(2010, 1, 2, 5)) - - series = Series(0, index=rng) - self.assertRaises(ValueError, series.between_time, - datetime(2010, 1, 2, 1), datetime(2010, 1, 2, 5)) - - def test_between_time_formats(self): - # GH11818 - _skip_if_has_locale() - - rng = date_range('1/1/2000', '1/5/2000', freq='5min') - ts = DataFrame(np.random.randn(len(rng), 2), index=rng) - - strings = [("2:00", "2:30"), ("0200", "0230"), ("2:00am", "2:30am"), - ("0200am", "0230am"), ("2:00:00", "2:30:00"), - ("020000", "023000"), ("2:00:00am", "2:30:00am"), - ("020000am", "023000am")] - expected_length = 28 - - for time_string in strings: - self.assertEqual(len(ts.between_time(*time_string)), - expected_length, - "%s - %s" % time_string) - - def test_dti_constructor_preserve_dti_freq(self): - rng = date_range('1/1/2000', '1/2/2000', freq='5min') - - rng2 = DatetimeIndex(rng) - self.assertEqual(rng.freq, rng2.freq) - - def test_dti_constructor_years_only(self): - # GH 6961 - for tz in [None, 'UTC', 'Asia/Tokyo', 'dateutil/US/Pacific']: - rng1 = date_range('2014', '2015', freq='M', tz=tz) - expected1 = date_range('2014-01-31', '2014-12-31', freq='M', tz=tz) - - rng2 = date_range('2014', '2015', freq='MS', tz=tz) - expected2 = date_range('2014-01-01', '2015-01-01', freq='MS', - tz=tz) - - rng3 = date_range('2014', '2020', freq='A', tz=tz) - expected3 = date_range('2014-12-31', '2019-12-31', freq='A', tz=tz) - - rng4 = date_range('2014', '2020', freq='AS', tz=tz) - expected4 = date_range('2014-01-01', '2020-01-01', freq='AS', - tz=tz) - - for rng, expected in [(rng1, expected1), (rng2, expected2), - (rng3, expected3), (rng4, expected4)]: - tm.assert_index_equal(rng, expected) - - def test_dti_constructor_small_int(self): - # GH 13721 - exp = DatetimeIndex(['1970-01-01 00:00:00.00000000', - '1970-01-01 00:00:00.00000001', - '1970-01-01 00:00:00.00000002']) - - for dtype in [np.int64, np.int32, np.int16, np.int8]: - arr = np.array([0, 10, 20], dtype=dtype) - tm.assert_index_equal(DatetimeIndex(arr), exp) - - def test_dti_constructor_numpy_timeunits(self): - # GH 9114 - base = pd.to_datetime(['2000-01-01T00:00', '2000-01-02T00:00', 'NaT']) - - for dtype in ['datetime64[h]', 'datetime64[m]', 'datetime64[s]', - 'datetime64[ms]', 'datetime64[us]', 'datetime64[ns]']: - values = base.values.astype(dtype) - - tm.assert_index_equal(DatetimeIndex(values), base) - tm.assert_index_equal(to_datetime(values), base) - - def test_normalize(self): - rng = date_range('1/1/2000 9:30', periods=10, freq='D') - - result = rng.normalize() - expected = date_range('1/1/2000', periods=10, freq='D') - tm.assert_index_equal(result, expected) - - rng_ns = pd.DatetimeIndex(np.array([1380585623454345752, - 1380585612343234312]).astype( - "datetime64[ns]")) - rng_ns_normalized = rng_ns.normalize() - expected = pd.DatetimeIndex(np.array([1380585600000000000, - 1380585600000000000]).astype( - "datetime64[ns]")) - tm.assert_index_equal(rng_ns_normalized, expected) - - self.assertTrue(result.is_normalized) - self.assertFalse(rng.is_normalized) - - def test_to_period(self): - from pandas.tseries.period import period_range - - ts = _simple_ts('1/1/2000', '1/1/2001') - - pts = ts.to_period() - exp = ts.copy() - exp.index = period_range('1/1/2000', '1/1/2001') - assert_series_equal(pts, exp) - - pts = ts.to_period('M') - exp.index = exp.index.asfreq('M') - tm.assert_index_equal(pts.index, exp.index.asfreq('M')) - assert_series_equal(pts, exp) - - # GH 7606 without freq - idx = DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', - '2011-01-04']) - exp_idx = pd.PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03', - '2011-01-04'], freq='D') - - s = Series(np.random.randn(4), index=idx) - expected = s.copy() - expected.index = exp_idx - assert_series_equal(s.to_period(), expected) - - df = DataFrame(np.random.randn(4, 4), index=idx, columns=idx) - expected = df.copy() - expected.index = exp_idx - assert_frame_equal(df.to_period(), expected) - - expected = df.copy() - expected.columns = exp_idx - assert_frame_equal(df.to_period(axis=1), expected) - - def create_dt64_based_index(self): - data = [Timestamp('2007-01-01 10:11:12.123456Z'), - Timestamp('2007-01-01 10:11:13.789123Z')] - index = DatetimeIndex(data) - return index - - def test_to_period_millisecond(self): - index = self.create_dt64_based_index() - - period = index.to_period(freq='L') - self.assertEqual(2, len(period)) - self.assertEqual(period[0], Period('2007-01-01 10:11:12.123Z', 'L')) - self.assertEqual(period[1], Period('2007-01-01 10:11:13.789Z', 'L')) - - def test_to_period_microsecond(self): - index = self.create_dt64_based_index() - - period = index.to_period(freq='U') - self.assertEqual(2, len(period)) - self.assertEqual(period[0], Period('2007-01-01 10:11:12.123456Z', 'U')) - self.assertEqual(period[1], Period('2007-01-01 10:11:13.789123Z', 'U')) - - def test_to_period_tz_pytz(self): - tm._skip_if_no_pytz() - from dateutil.tz import tzlocal - from pytz import utc as UTC - - xp = date_range('1/1/2000', '4/1/2000').to_period() - - ts = date_range('1/1/2000', '4/1/2000', tz='US/Eastern') - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertEqual(result, expected) - tm.assert_index_equal(ts.to_period(), xp) - - ts = date_range('1/1/2000', '4/1/2000', tz=UTC) - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertEqual(result, expected) - tm.assert_index_equal(ts.to_period(), xp) - - ts = date_range('1/1/2000', '4/1/2000', tz=tzlocal()) - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertEqual(result, expected) - tm.assert_index_equal(ts.to_period(), xp) - - def test_to_period_tz_explicit_pytz(self): - tm._skip_if_no_pytz() - import pytz - from dateutil.tz import tzlocal - - xp = date_range('1/1/2000', '4/1/2000').to_period() - - ts = date_range('1/1/2000', '4/1/2000', tz=pytz.timezone('US/Eastern')) - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertTrue(result == expected) - tm.assert_index_equal(ts.to_period(), xp) - - ts = date_range('1/1/2000', '4/1/2000', tz=pytz.utc) - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertTrue(result == expected) - tm.assert_index_equal(ts.to_period(), xp) - - ts = date_range('1/1/2000', '4/1/2000', tz=tzlocal()) - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertTrue(result == expected) - tm.assert_index_equal(ts.to_period(), xp) - - def test_to_period_tz_dateutil(self): - tm._skip_if_no_dateutil() - import dateutil - from dateutil.tz import tzlocal - - xp = date_range('1/1/2000', '4/1/2000').to_period() - - ts = date_range('1/1/2000', '4/1/2000', tz='dateutil/US/Eastern') - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertTrue(result == expected) - tm.assert_index_equal(ts.to_period(), xp) - - ts = date_range('1/1/2000', '4/1/2000', tz=dateutil.tz.tzutc()) - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertTrue(result == expected) - tm.assert_index_equal(ts.to_period(), xp) - - ts = date_range('1/1/2000', '4/1/2000', tz=tzlocal()) - - result = ts.to_period()[0] - expected = ts[0].to_period() - - self.assertTrue(result == expected) - tm.assert_index_equal(ts.to_period(), xp) - - def test_frame_to_period(self): - K = 5 - from pandas.tseries.period import period_range - - dr = date_range('1/1/2000', '1/1/2001') - pr = period_range('1/1/2000', '1/1/2001') - df = DataFrame(randn(len(dr), K), index=dr) - df['mix'] = 'a' - - pts = df.to_period() - exp = df.copy() - exp.index = pr - assert_frame_equal(pts, exp) - - pts = df.to_period('M') - tm.assert_index_equal(pts.index, exp.index.asfreq('M')) - - df = df.T - pts = df.to_period(axis=1) - exp = df.copy() - exp.columns = pr - assert_frame_equal(pts, exp) - - pts = df.to_period('M', axis=1) - tm.assert_index_equal(pts.columns, exp.columns.asfreq('M')) - - self.assertRaises(ValueError, df.to_period, axis=2) - - def test_timestamp_fields(self): - # extra fields from DatetimeIndex like quarter and week - idx = tm.makeDateIndex(100) - - fields = ['dayofweek', 'dayofyear', 'week', 'weekofyear', 'quarter', - 'days_in_month', 'is_month_start', 'is_month_end', - 'is_quarter_start', 'is_quarter_end', 'is_year_start', - 'is_year_end', 'weekday_name'] - for f in fields: - expected = getattr(idx, f)[-1] - result = getattr(Timestamp(idx[-1]), f) - self.assertEqual(result, expected) - - self.assertEqual(idx.freq, Timestamp(idx[-1], idx.freq).freq) - self.assertEqual(idx.freqstr, Timestamp(idx[-1], idx.freq).freqstr) - - def test_woy_boundary(self): - # make sure weeks at year boundaries are correct - d = datetime(2013, 12, 31) - result = Timestamp(d).week - expected = 1 # ISO standard - self.assertEqual(result, expected) - - d = datetime(2008, 12, 28) - result = Timestamp(d).week - expected = 52 # ISO standard - self.assertEqual(result, expected) - - d = datetime(2009, 12, 31) - result = Timestamp(d).week - expected = 53 # ISO standard - self.assertEqual(result, expected) - - d = datetime(2010, 1, 1) - result = Timestamp(d).week - expected = 53 # ISO standard - self.assertEqual(result, expected) - - d = datetime(2010, 1, 3) - result = Timestamp(d).week - expected = 53 # ISO standard - self.assertEqual(result, expected) - - result = np.array([Timestamp(datetime(*args)).week - for args in [(2000, 1, 1), (2000, 1, 2), ( - 2005, 1, 1), (2005, 1, 2)]]) - self.assertTrue((result == [52, 52, 53, 53]).all()) - - def test_timestamp_date_out_of_range(self): - self.assertRaises(ValueError, Timestamp, '1676-01-01') - self.assertRaises(ValueError, Timestamp, '2263-01-01') - - # 1475 - self.assertRaises(ValueError, DatetimeIndex, ['1400-01-01']) - self.assertRaises(ValueError, DatetimeIndex, [datetime(1400, 1, 1)]) - - def test_timestamp_repr(self): - # pre-1900 - stamp = Timestamp('1850-01-01', tz='US/Eastern') - repr(stamp) - - iso8601 = '1850-01-01 01:23:45.012345' - stamp = Timestamp(iso8601, tz='US/Eastern') - result = repr(stamp) - self.assertIn(iso8601, result) - - def test_timestamp_from_ordinal(self): - - # GH 3042 - dt = datetime(2011, 4, 16, 0, 0) - ts = Timestamp.fromordinal(dt.toordinal()) - self.assertEqual(ts.to_pydatetime(), dt) - - # with a tzinfo - stamp = Timestamp('2011-4-16', tz='US/Eastern') - dt_tz = stamp.to_pydatetime() - ts = Timestamp.fromordinal(dt_tz.toordinal(), tz='US/Eastern') - self.assertEqual(ts.to_pydatetime(), dt_tz) - - def test_datetimeindex_integers_shift(self): - rng = date_range('1/1/2000', periods=20) - - result = rng + 5 - expected = rng.shift(5) - tm.assert_index_equal(result, expected) - - result = rng - 5 - expected = rng.shift(-5) - tm.assert_index_equal(result, expected) - - def test_astype_object(self): - # NumPy 1.6.1 weak ns support - rng = date_range('1/1/2000', periods=20) - - casted = rng.astype('O') - exp_values = list(rng) - - tm.assert_index_equal(casted, Index(exp_values, dtype=np.object_)) - self.assertEqual(casted.tolist(), exp_values) - - def test_catch_infinite_loop(self): - offset = offsets.DateOffset(minute=5) - # blow up, don't loop forever - self.assertRaises(Exception, date_range, datetime(2011, 11, 11), - datetime(2011, 11, 12), freq=offset) - - def test_append_concat(self): - rng = date_range('5/8/2012 1:45', periods=10, freq='5T') - ts = Series(np.random.randn(len(rng)), rng) - df = DataFrame(np.random.randn(len(rng), 4), index=rng) - - result = ts.append(ts) - result_df = df.append(df) - ex_index = DatetimeIndex(np.tile(rng.values, 2)) - tm.assert_index_equal(result.index, ex_index) - tm.assert_index_equal(result_df.index, ex_index) - - appended = rng.append(rng) - tm.assert_index_equal(appended, ex_index) - - appended = rng.append([rng, rng]) - ex_index = DatetimeIndex(np.tile(rng.values, 3)) - tm.assert_index_equal(appended, ex_index) - - # different index names - rng1 = rng.copy() - rng2 = rng.copy() - rng1.name = 'foo' - rng2.name = 'bar' - self.assertEqual(rng1.append(rng1).name, 'foo') - self.assertIsNone(rng1.append(rng2).name) - - def test_append_concat_tz(self): - # GH 2938 - tm._skip_if_no_pytz() - - rng = date_range('5/8/2012 1:45', periods=10, freq='5T', - tz='US/Eastern') - rng2 = date_range('5/8/2012 2:35', periods=10, freq='5T', - tz='US/Eastern') - rng3 = date_range('5/8/2012 1:45', periods=20, freq='5T', - tz='US/Eastern') - ts = Series(np.random.randn(len(rng)), rng) - df = DataFrame(np.random.randn(len(rng), 4), index=rng) - ts2 = Series(np.random.randn(len(rng2)), rng2) - df2 = DataFrame(np.random.randn(len(rng2), 4), index=rng2) - - result = ts.append(ts2) - result_df = df.append(df2) - tm.assert_index_equal(result.index, rng3) - tm.assert_index_equal(result_df.index, rng3) - - appended = rng.append(rng2) - tm.assert_index_equal(appended, rng3) - - def test_append_concat_tz_explicit_pytz(self): - # GH 2938 - tm._skip_if_no_pytz() - from pytz import timezone as timezone - - rng = date_range('5/8/2012 1:45', periods=10, freq='5T', - tz=timezone('US/Eastern')) - rng2 = date_range('5/8/2012 2:35', periods=10, freq='5T', - tz=timezone('US/Eastern')) - rng3 = date_range('5/8/2012 1:45', periods=20, freq='5T', - tz=timezone('US/Eastern')) - ts = Series(np.random.randn(len(rng)), rng) - df = DataFrame(np.random.randn(len(rng), 4), index=rng) - ts2 = Series(np.random.randn(len(rng2)), rng2) - df2 = DataFrame(np.random.randn(len(rng2), 4), index=rng2) - - result = ts.append(ts2) - result_df = df.append(df2) - tm.assert_index_equal(result.index, rng3) - tm.assert_index_equal(result_df.index, rng3) - - appended = rng.append(rng2) - tm.assert_index_equal(appended, rng3) - - def test_append_concat_tz_dateutil(self): - # GH 2938 - tm._skip_if_no_dateutil() - rng = date_range('5/8/2012 1:45', periods=10, freq='5T', - tz='dateutil/US/Eastern') - rng2 = date_range('5/8/2012 2:35', periods=10, freq='5T', - tz='dateutil/US/Eastern') - rng3 = date_range('5/8/2012 1:45', periods=20, freq='5T', - tz='dateutil/US/Eastern') - ts = Series(np.random.randn(len(rng)), rng) - df = DataFrame(np.random.randn(len(rng), 4), index=rng) - ts2 = Series(np.random.randn(len(rng2)), rng2) - df2 = DataFrame(np.random.randn(len(rng2), 4), index=rng2) - - result = ts.append(ts2) - result_df = df.append(df2) - tm.assert_index_equal(result.index, rng3) - tm.assert_index_equal(result_df.index, rng3) - - appended = rng.append(rng2) - tm.assert_index_equal(appended, rng3) - - def test_set_dataframe_column_ns_dtype(self): - x = DataFrame([datetime.now(), datetime.now()]) - self.assertEqual(x[0].dtype, np.dtype('M8[ns]')) - - def test_groupby_count_dateparseerror(self): - dr = date_range(start='1/1/2012', freq='5min', periods=10) - - # BAD Example, datetimes first - s = Series(np.arange(10), index=[dr, lrange(10)]) - grouped = s.groupby(lambda x: x[1] % 2 == 0) - result = grouped.count() - - s = Series(np.arange(10), index=[lrange(10), dr]) - grouped = s.groupby(lambda x: x[0] % 2 == 0) - expected = grouped.count() - - assert_series_equal(result, expected) - - def test_datetimeindex_repr_short(self): - dr = date_range(start='1/1/2012', periods=1) - repr(dr) - - dr = date_range(start='1/1/2012', periods=2) - repr(dr) - - dr = date_range(start='1/1/2012', periods=3) - repr(dr) - - def test_constructor_int64_nocopy(self): - # #1624 - arr = np.arange(1000, dtype=np.int64) - index = DatetimeIndex(arr) - - arr[50:100] = -1 - self.assertTrue((index.asi8[50:100] == -1).all()) - - arr = np.arange(1000, dtype=np.int64) - index = DatetimeIndex(arr, copy=True) - - arr[50:100] = -1 - self.assertTrue((index.asi8[50:100] != -1).all()) - - def test_series_interpolate_method_values(self): - # #1646 - ts = _simple_ts('1/1/2000', '1/20/2000') - ts[::2] = np.nan - - result = ts.interpolate(method='values') - exp = ts.interpolate() - assert_series_equal(result, exp) - - def test_frame_datetime64_handling_groupby(self): - # it works! - df = DataFrame([(3, np.datetime64('2012-07-03')), - (3, np.datetime64('2012-07-04'))], - columns=['a', 'date']) - result = df.groupby('a').first() - self.assertEqual(result['date'][3], Timestamp('2012-07-03')) - - def test_series_interpolate_intraday(self): - # #1698 - index = pd.date_range('1/1/2012', periods=4, freq='12D') - ts = pd.Series([0, 12, 24, 36], index) - new_index = index.append(index + pd.DateOffset(days=1)).sort_values() - - exp = ts.reindex(new_index).interpolate(method='time') - - index = pd.date_range('1/1/2012', periods=4, freq='12H') - ts = pd.Series([0, 12, 24, 36], index) - new_index = index.append(index + pd.DateOffset(hours=1)).sort_values() - result = ts.reindex(new_index).interpolate(method='time') - - self.assert_numpy_array_equal(result.values, exp.values) - - def test_frame_dict_constructor_datetime64_1680(self): - dr = date_range('1/1/2012', periods=10) - s = Series(dr, index=dr) - - # it works! - DataFrame({'a': 'foo', 'b': s}, index=dr) - DataFrame({'a': 'foo', 'b': s.values}, index=dr) - - def test_frame_datetime64_mixed_index_ctor_1681(self): - dr = date_range('2011/1/1', '2012/1/1', freq='W-FRI') - ts = Series(dr) - - # it works! - d = DataFrame({'A': 'foo', 'B': ts}, index=dr) - self.assertTrue(d['B'].isnull().all()) - - def test_frame_timeseries_to_records(self): - index = date_range('1/1/2000', periods=10) - df = DataFrame(np.random.randn(10, 3), index=index, - columns=['a', 'b', 'c']) - - result = df.to_records() - result['index'].dtype == 'M8[ns]' - - result = df.to_records(index=False) - - def test_frame_datetime64_duplicated(self): - dates = date_range('2010-07-01', end='2010-08-05') - - tst = DataFrame({'symbol': 'AAA', 'date': dates}) - result = tst.duplicated(['date', 'symbol']) - self.assertTrue((-result).all()) - - tst = DataFrame({'date': dates}) - result = tst.duplicated() - self.assertTrue((-result).all()) - - def test_timestamp_compare_with_early_datetime(self): - # e.g. datetime.min - stamp = Timestamp('2012-01-01') - - self.assertFalse(stamp == datetime.min) - self.assertFalse(stamp == datetime(1600, 1, 1)) - self.assertFalse(stamp == datetime(2700, 1, 1)) - self.assertNotEqual(stamp, datetime.min) - self.assertNotEqual(stamp, datetime(1600, 1, 1)) - self.assertNotEqual(stamp, datetime(2700, 1, 1)) - self.assertTrue(stamp > datetime(1600, 1, 1)) - self.assertTrue(stamp >= datetime(1600, 1, 1)) - self.assertTrue(stamp < datetime(2700, 1, 1)) - self.assertTrue(stamp <= datetime(2700, 1, 1)) - - def test_to_html_timestamp(self): - rng = date_range('2000-01-01', periods=10) - df = DataFrame(np.random.randn(10, 4), index=rng) - - result = df.to_html() - self.assertIn('2000-01-01', result) - - def test_to_csv_numpy_16_bug(self): - frame = DataFrame({'a': date_range('1/1/2000', periods=10)}) - - buf = StringIO() - frame.to_csv(buf) - - result = buf.getvalue() - self.assertIn('2000-01-01', result) - - def test_series_map_box_timestamps(self): - # #2689, #2627 - s = Series(date_range('1/1/2000', periods=10)) - - def f(x): - return (x.hour, x.day, x.month) - - # it works! - s.map(f) - s.apply(f) - DataFrame(s).applymap(f) - - def test_series_map_box_timedelta(self): - # GH 11349 - s = Series(timedelta_range('1 day 1 s', periods=5, freq='h')) - - def f(x): - return x.total_seconds() - - s.map(f) - s.apply(f) - DataFrame(s).applymap(f) - - def test_concat_datetime_datetime64_frame(self): - # #2624 - rows = [] - rows.append([datetime(2010, 1, 1), 1]) - rows.append([datetime(2010, 1, 2), 'hi']) - - df2_obj = DataFrame.from_records(rows, columns=['date', 'test']) - - ind = date_range(start="2000/1/1", freq="D", periods=10) - df1 = DataFrame({'date': ind, 'test': lrange(10)}) - - # it works! - pd.concat([df1, df2_obj]) - - def test_asfreq_resample_set_correct_freq(self): - # GH5613 - # we test if .asfreq() and .resample() set the correct value for .freq - df = pd.DataFrame({'date': ["2012-01-01", "2012-01-02", "2012-01-03"], - 'col': [1, 2, 3]}) - df = df.set_index(pd.to_datetime(df.date)) - - # testing the settings before calling .asfreq() and .resample() - self.assertEqual(df.index.freq, None) - self.assertEqual(df.index.inferred_freq, 'D') - - # does .asfreq() set .freq correctly? - self.assertEqual(df.asfreq('D').index.freq, 'D') - - # does .resample() set .freq correctly? - self.assertEqual(df.resample('D').asfreq().index.freq, 'D') - - def test_pickle(self): - - # GH4606 - p = self.round_trip_pickle(NaT) - self.assertTrue(p is NaT) - - idx = pd.to_datetime(['2013-01-01', NaT, '2014-01-06']) - idx_p = self.round_trip_pickle(idx) - self.assertTrue(idx_p[0] == idx[0]) - self.assertTrue(idx_p[1] is NaT) - self.assertTrue(idx_p[2] == idx[2]) - - # GH11002 - # don't infer freq - idx = date_range('1750-1-1', '2050-1-1', freq='7D') - idx_p = self.round_trip_pickle(idx) - tm.assert_index_equal(idx, idx_p) - - def test_timestamp_equality(self): - - # GH 11034 - s = Series([Timestamp('2000-01-29 01:59:00'), 'NaT']) - result = s != s - assert_series_equal(result, Series([False, True])) - result = s != s[0] - assert_series_equal(result, Series([False, True])) - result = s != s[1] - assert_series_equal(result, Series([True, True])) - - result = s == s - assert_series_equal(result, Series([True, False])) - result = s == s[0] - assert_series_equal(result, Series([True, False])) - result = s == s[1] - assert_series_equal(result, Series([False, False])) - - -def _simple_ts(start, end, freq='D'): - rng = date_range(start, end, freq=freq) - return Series(np.random.randn(len(rng)), index=rng) - - -class TestToDatetime(tm.TestCase): - _multiprocess_can_split_ = True - - def test_to_datetime_dt64s(self): - in_bound_dts = [ - np.datetime64('2000-01-01'), - np.datetime64('2000-01-02'), - ] - - for dt in in_bound_dts: - self.assertEqual(pd.to_datetime(dt), Timestamp(dt)) - - oob_dts = [np.datetime64('1000-01-01'), np.datetime64('5000-01-02'), ] - - for dt in oob_dts: - self.assertRaises(ValueError, pd.to_datetime, dt, errors='raise') - self.assertRaises(ValueError, tslib.Timestamp, dt) - self.assertIs(pd.to_datetime(dt, errors='coerce'), NaT) - - def test_to_datetime_array_of_dt64s(self): - dts = [np.datetime64('2000-01-01'), np.datetime64('2000-01-02'), ] - - # Assuming all datetimes are in bounds, to_datetime() returns - # an array that is equal to Timestamp() parsing - self.assert_numpy_array_equal( - pd.to_datetime(dts, box=False), - np.array([Timestamp(x).asm8 for x in dts]) - ) - - # A list of datetimes where the last one is out of bounds - dts_with_oob = dts + [np.datetime64('9999-01-01')] - - self.assertRaises(ValueError, pd.to_datetime, dts_with_oob, - errors='raise') - - self.assert_numpy_array_equal( - pd.to_datetime(dts_with_oob, box=False, errors='coerce'), - np.array( - [ - Timestamp(dts_with_oob[0]).asm8, - Timestamp(dts_with_oob[1]).asm8, - iNaT, - ], - dtype='M8' - ) - ) - - # With errors='ignore', out of bounds datetime64s - # are converted to their .item(), which depending on the version of - # numpy is either a python datetime.datetime or datetime.date - self.assert_numpy_array_equal( - pd.to_datetime(dts_with_oob, box=False, errors='ignore'), - np.array( - [dt.item() for dt in dts_with_oob], - dtype='O' - ) - ) - - def test_to_datetime_tz(self): - - # xref 8260 - # uniform returns a DatetimeIndex - arr = [pd.Timestamp('2013-01-01 13:00:00-0800', tz='US/Pacific'), - pd.Timestamp('2013-01-02 14:00:00-0800', tz='US/Pacific')] - result = pd.to_datetime(arr) - expected = DatetimeIndex( - ['2013-01-01 13:00:00', '2013-01-02 14:00:00'], tz='US/Pacific') - tm.assert_index_equal(result, expected) - - # mixed tzs will raise - arr = [pd.Timestamp('2013-01-01 13:00:00', tz='US/Pacific'), - pd.Timestamp('2013-01-02 14:00:00', tz='US/Eastern')] - self.assertRaises(ValueError, lambda: pd.to_datetime(arr)) - - def test_to_datetime_tz_pytz(self): - - # xref 8260 - tm._skip_if_no_pytz() - import pytz - - us_eastern = pytz.timezone('US/Eastern') - arr = np.array([us_eastern.localize(datetime(year=2000, month=1, day=1, - hour=3, minute=0)), - us_eastern.localize(datetime(year=2000, month=6, day=1, - hour=3, minute=0))], - dtype=object) - result = pd.to_datetime(arr, utc=True) - expected = DatetimeIndex(['2000-01-01 08:00:00+00:00', - '2000-06-01 07:00:00+00:00'], - dtype='datetime64[ns, UTC]', freq=None) - tm.assert_index_equal(result, expected) - - def test_to_datetime_utc_is_true(self): - # See gh-11934 - start = pd.Timestamp('2014-01-01', tz='utc') - end = pd.Timestamp('2014-01-03', tz='utc') - date_range = pd.bdate_range(start, end) - - result = pd.to_datetime(date_range, utc=True) - expected = pd.DatetimeIndex(data=date_range) - tm.assert_index_equal(result, expected) - - def test_to_datetime_tz_psycopg2(self): - - # xref 8260 - try: - import psycopg2 - except ImportError: - raise nose.SkipTest("no psycopg2 installed") - - # misc cases - tz1 = psycopg2.tz.FixedOffsetTimezone(offset=-300, name=None) - tz2 = psycopg2.tz.FixedOffsetTimezone(offset=-240, name=None) - arr = np.array([datetime(2000, 1, 1, 3, 0, tzinfo=tz1), - datetime(2000, 6, 1, 3, 0, tzinfo=tz2)], - dtype=object) - - result = pd.to_datetime(arr, errors='coerce', utc=True) - expected = DatetimeIndex(['2000-01-01 08:00:00+00:00', - '2000-06-01 07:00:00+00:00'], - dtype='datetime64[ns, UTC]', freq=None) - tm.assert_index_equal(result, expected) - - # dtype coercion - i = pd.DatetimeIndex([ - '2000-01-01 08:00:00+00:00' - ], tz=psycopg2.tz.FixedOffsetTimezone(offset=-300, name=None)) - self.assertFalse(is_datetime64_ns_dtype(i)) - - # tz coerceion - result = pd.to_datetime(i, errors='coerce') - tm.assert_index_equal(result, i) - - result = pd.to_datetime(i, errors='coerce', utc=True) - expected = pd.DatetimeIndex(['2000-01-01 13:00:00'], - dtype='datetime64[ns, UTC]') - tm.assert_index_equal(result, expected) - - def test_datetime_bool(self): - # GH13176 - with self.assertRaises(TypeError): - to_datetime(False) - self.assertTrue(to_datetime(False, errors="coerce") is tslib.NaT) - self.assertEqual(to_datetime(False, errors="ignore"), False) - with self.assertRaises(TypeError): - to_datetime(True) - self.assertTrue(to_datetime(True, errors="coerce") is tslib.NaT) - self.assertEqual(to_datetime(True, errors="ignore"), True) - with self.assertRaises(TypeError): - to_datetime([False, datetime.today()]) - with self.assertRaises(TypeError): - to_datetime(['20130101', True]) - tm.assert_index_equal(to_datetime([0, False, tslib.NaT, 0.0], - errors="coerce"), - DatetimeIndex([to_datetime(0), tslib.NaT, - tslib.NaT, to_datetime(0)])) - - def test_datetime_invalid_datatype(self): - # GH13176 - - with self.assertRaises(TypeError): - pd.to_datetime(bool) - with self.assertRaises(TypeError): - pd.to_datetime(pd.to_datetime) - - def test_unit(self): - # GH 11758 - # test proper behavior with erros - - with self.assertRaises(ValueError): - to_datetime([1], unit='D', format='%Y%m%d') - - values = [11111111, 1, 1.0, tslib.iNaT, pd.NaT, np.nan, - 'NaT', ''] - result = to_datetime(values, unit='D', errors='ignore') - expected = Index([11111111, Timestamp('1970-01-02'), - Timestamp('1970-01-02'), pd.NaT, - pd.NaT, pd.NaT, pd.NaT, pd.NaT], - dtype=object) - tm.assert_index_equal(result, expected) - - result = to_datetime(values, unit='D', errors='coerce') - expected = DatetimeIndex(['NaT', '1970-01-02', '1970-01-02', - 'NaT', 'NaT', 'NaT', 'NaT', 'NaT']) - tm.assert_index_equal(result, expected) - - with self.assertRaises(tslib.OutOfBoundsDatetime): - to_datetime(values, unit='D', errors='raise') - - values = [1420043460000, tslib.iNaT, pd.NaT, np.nan, 'NaT'] - - result = to_datetime(values, errors='ignore', unit='s') - expected = Index([1420043460000, pd.NaT, pd.NaT, - pd.NaT, pd.NaT], dtype=object) - tm.assert_index_equal(result, expected) - - result = to_datetime(values, errors='coerce', unit='s') - expected = DatetimeIndex(['NaT', 'NaT', 'NaT', 'NaT', 'NaT']) - tm.assert_index_equal(result, expected) - - with self.assertRaises(tslib.OutOfBoundsDatetime): - to_datetime(values, errors='raise', unit='s') - - # if we have a string, then we raise a ValueError - # and NOT an OutOfBoundsDatetime - for val in ['foo', Timestamp('20130101')]: - try: - to_datetime(val, errors='raise', unit='s') - except tslib.OutOfBoundsDatetime: - raise AssertionError("incorrect exception raised") - except ValueError: - pass - - def test_unit_consistency(self): - - # consistency of conversions - expected = Timestamp('1970-05-09 14:25:11') - result = pd.to_datetime(11111111, unit='s', errors='raise') - self.assertEqual(result, expected) - self.assertIsInstance(result, Timestamp) - - result = pd.to_datetime(11111111, unit='s', errors='coerce') - self.assertEqual(result, expected) - self.assertIsInstance(result, Timestamp) - - result = pd.to_datetime(11111111, unit='s', errors='ignore') - self.assertEqual(result, expected) - self.assertIsInstance(result, Timestamp) - - def test_unit_with_numeric(self): - - # GH 13180 - # coercions from floats/ints are ok - expected = DatetimeIndex(['2015-06-19 05:33:20', - '2015-05-27 22:33:20']) - arr1 = [1.434692e+18, 1.432766e+18] - arr2 = np.array(arr1).astype('int64') - for errors in ['ignore', 'raise', 'coerce']: - result = pd.to_datetime(arr1, errors=errors) - tm.assert_index_equal(result, expected) - - result = pd.to_datetime(arr2, errors=errors) - tm.assert_index_equal(result, expected) - - # but we want to make sure that we are coercing - # if we have ints/strings - expected = DatetimeIndex(['NaT', - '2015-06-19 05:33:20', - '2015-05-27 22:33:20']) - arr = ['foo', 1.434692e+18, 1.432766e+18] - result = pd.to_datetime(arr, errors='coerce') - tm.assert_index_equal(result, expected) - - expected = DatetimeIndex(['2015-06-19 05:33:20', - '2015-05-27 22:33:20', - 'NaT', - 'NaT']) - arr = [1.434692e+18, 1.432766e+18, 'foo', 'NaT'] - result = pd.to_datetime(arr, errors='coerce') - tm.assert_index_equal(result, expected) - - def test_unit_mixed(self): - - # mixed integers/datetimes - expected = DatetimeIndex(['2013-01-01', 'NaT', 'NaT']) - arr = [pd.Timestamp('20130101'), 1.434692e+18, 1.432766e+18] - result = pd.to_datetime(arr, errors='coerce') - tm.assert_index_equal(result, expected) - - with self.assertRaises(ValueError): - pd.to_datetime(arr, errors='raise') - - expected = DatetimeIndex(['NaT', - 'NaT', - '2013-01-01']) - arr = [1.434692e+18, 1.432766e+18, pd.Timestamp('20130101')] - result = pd.to_datetime(arr, errors='coerce') - tm.assert_index_equal(result, expected) - - with self.assertRaises(ValueError): - pd.to_datetime(arr, errors='raise') - - def test_index_to_datetime(self): - idx = Index(['1/1/2000', '1/2/2000', '1/3/2000']) - - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - result = idx.to_datetime() - expected = DatetimeIndex(pd.to_datetime(idx.values)) - tm.assert_index_equal(result, expected) - - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - today = datetime.today() - idx = Index([today], dtype=object) - result = idx.to_datetime() - expected = DatetimeIndex([today]) - tm.assert_index_equal(result, expected) - - def test_dataframe(self): - - df = DataFrame({'year': [2015, 2016], - 'month': [2, 3], - 'day': [4, 5], - 'hour': [6, 7], - 'minute': [58, 59], - 'second': [10, 11], - 'ms': [1, 1], - 'us': [2, 2], - 'ns': [3, 3]}) - - result = to_datetime({'year': df['year'], - 'month': df['month'], - 'day': df['day']}) - expected = Series([Timestamp('20150204 00:00:00'), - Timestamp('20160305 00:0:00')]) - assert_series_equal(result, expected) - - # dict-like - result = to_datetime(df[['year', 'month', 'day']].to_dict()) - assert_series_equal(result, expected) - - # dict but with constructable - df2 = df[['year', 'month', 'day']].to_dict() - df2['month'] = 2 - result = to_datetime(df2) - expected2 = Series([Timestamp('20150204 00:00:00'), - Timestamp('20160205 00:0:00')]) - assert_series_equal(result, expected2) - - # unit mappings - units = [{'year': 'years', - 'month': 'months', - 'day': 'days', - 'hour': 'hours', - 'minute': 'minutes', - 'second': 'seconds'}, - {'year': 'year', - 'month': 'month', - 'day': 'day', - 'hour': 'hour', - 'minute': 'minute', - 'second': 'second'}, - ] - - for d in units: - result = to_datetime(df[list(d.keys())].rename(columns=d)) - expected = Series([Timestamp('20150204 06:58:10'), - Timestamp('20160305 07:59:11')]) - assert_series_equal(result, expected) - - d = {'year': 'year', - 'month': 'month', - 'day': 'day', - 'hour': 'hour', - 'minute': 'minute', - 'second': 'second', - 'ms': 'ms', - 'us': 'us', - 'ns': 'ns'} - - result = to_datetime(df.rename(columns=d)) - expected = Series([Timestamp('20150204 06:58:10.001002003'), - Timestamp('20160305 07:59:11.001002003')]) - assert_series_equal(result, expected) - - # coerce back to int - result = to_datetime(df.astype(str)) - assert_series_equal(result, expected) - - # passing coerce - df2 = DataFrame({'year': [2015, 2016], - 'month': [2, 20], - 'day': [4, 5]}) - with self.assertRaises(ValueError): - to_datetime(df2) - result = to_datetime(df2, errors='coerce') - expected = Series([Timestamp('20150204 00:00:00'), - pd.NaT]) - assert_series_equal(result, expected) - - # extra columns - with self.assertRaises(ValueError): - df2 = df.copy() - df2['foo'] = 1 - to_datetime(df2) - - # not enough - for c in [['year'], - ['year', 'month'], - ['year', 'month', 'second'], - ['month', 'day'], - ['year', 'day', 'second']]: - with self.assertRaises(ValueError): - to_datetime(df[c]) - - # duplicates - df2 = DataFrame({'year': [2015, 2016], - 'month': [2, 20], - 'day': [4, 5]}) - df2.columns = ['year', 'year', 'day'] - with self.assertRaises(ValueError): - to_datetime(df2) - - df2 = DataFrame({'year': [2015, 2016], - 'month': [2, 20], - 'day': [4, 5], - 'hour': [4, 5]}) - df2.columns = ['year', 'month', 'day', 'day'] - with self.assertRaises(ValueError): - to_datetime(df2) - - def test_dataframe_dtypes(self): - # #13451 - df = DataFrame({'year': [2015, 2016], - 'month': [2, 3], - 'day': [4, 5]}) - - # int16 - result = to_datetime(df.astype('int16')) - expected = Series([Timestamp('20150204 00:00:00'), - Timestamp('20160305 00:00:00')]) - assert_series_equal(result, expected) - - # mixed dtypes - df['month'] = df['month'].astype('int8') - df['day'] = df['day'].astype('int8') - result = to_datetime(df) - expected = Series([Timestamp('20150204 00:00:00'), - Timestamp('20160305 00:00:00')]) - assert_series_equal(result, expected) - - # float - df = DataFrame({'year': [2000, 2001], - 'month': [1.5, 1], - 'day': [1, 1]}) - with self.assertRaises(ValueError): - to_datetime(df) - - -class TestDatetimeIndex(tm.TestCase): - _multiprocess_can_split_ = True - - def test_hash_error(self): - index = date_range('20010101', periods=10) - with tm.assertRaisesRegexp(TypeError, "unhashable type: %r" % - type(index).__name__): - hash(index) - - def test_stringified_slice_with_tz(self): - # GH2658 - import datetime - start = datetime.datetime.now() - idx = DatetimeIndex(start=start, freq="1d", periods=10) - df = DataFrame(lrange(10), index=idx) - df["2013-01-14 23:44:34.437768-05:00":] # no exception here - - def test_append_join_nondatetimeindex(self): - rng = date_range('1/1/2000', periods=10) - idx = Index(['a', 'b', 'c', 'd']) - - result = rng.append(idx) - tm.assertIsInstance(result[0], Timestamp) - - # it works - rng.join(idx, how='outer') - - def test_to_period_nofreq(self): - idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-04']) - self.assertRaises(ValueError, idx.to_period) - - idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], - freq='infer') - self.assertEqual(idx.freqstr, 'D') - expected = pd.PeriodIndex(['2000-01-01', '2000-01-02', - '2000-01-03'], freq='D') - tm.assert_index_equal(idx.to_period(), expected) - - # GH 7606 - idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03']) - self.assertEqual(idx.freqstr, None) - tm.assert_index_equal(idx.to_period(), expected) - - def test_000constructor_resolution(self): - # 2252 - t1 = Timestamp((1352934390 * 1000000000) + 1000000 + 1000 + 1) - idx = DatetimeIndex([t1]) - - self.assertEqual(idx.nanosecond[0], t1.nanosecond) - - def test_constructor_coverage(self): - rng = date_range('1/1/2000', periods=10.5) - exp = date_range('1/1/2000', periods=10) - tm.assert_index_equal(rng, exp) - - self.assertRaises(ValueError, DatetimeIndex, start='1/1/2000', - periods='foo', freq='D') - - self.assertRaises(ValueError, DatetimeIndex, start='1/1/2000', - end='1/10/2000') - - self.assertRaises(ValueError, DatetimeIndex, '1/1/2000') - - # generator expression - gen = (datetime(2000, 1, 1) + timedelta(i) for i in range(10)) - result = DatetimeIndex(gen) - expected = DatetimeIndex([datetime(2000, 1, 1) + timedelta(i) - for i in range(10)]) - tm.assert_index_equal(result, expected) - - # NumPy string array - strings = np.array(['2000-01-01', '2000-01-02', '2000-01-03']) - result = DatetimeIndex(strings) - expected = DatetimeIndex(strings.astype('O')) - tm.assert_index_equal(result, expected) - - from_ints = DatetimeIndex(expected.asi8) - tm.assert_index_equal(from_ints, expected) - - # string with NaT - strings = np.array(['2000-01-01', '2000-01-02', 'NaT']) - result = DatetimeIndex(strings) - expected = DatetimeIndex(strings.astype('O')) - tm.assert_index_equal(result, expected) - - from_ints = DatetimeIndex(expected.asi8) - tm.assert_index_equal(from_ints, expected) - - # non-conforming - self.assertRaises(ValueError, DatetimeIndex, - ['2000-01-01', '2000-01-02', '2000-01-04'], freq='D') - - self.assertRaises(ValueError, DatetimeIndex, start='2011-01-01', - freq='b') - self.assertRaises(ValueError, DatetimeIndex, end='2011-01-01', - freq='B') - self.assertRaises(ValueError, DatetimeIndex, periods=10, freq='D') - - def test_constructor_datetime64_tzformat(self): - # GH 6572 - tm._skip_if_no_pytz() - import pytz - # ISO 8601 format results in pytz.FixedOffset - for freq in ['AS', 'W-SUN']: - idx = date_range('2013-01-01T00:00:00-05:00', - '2016-01-01T23:59:59-05:00', freq=freq) - expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', - freq=freq, tz=pytz.FixedOffset(-300)) - tm.assert_index_equal(idx, expected) - # Unable to use `US/Eastern` because of DST - expected_i8 = date_range('2013-01-01T00:00:00', - '2016-01-01T23:59:59', freq=freq, - tz='America/Lima') - self.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) - - idx = date_range('2013-01-01T00:00:00+09:00', - '2016-01-01T23:59:59+09:00', freq=freq) - expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', - freq=freq, tz=pytz.FixedOffset(540)) - tm.assert_index_equal(idx, expected) - expected_i8 = date_range('2013-01-01T00:00:00', - '2016-01-01T23:59:59', freq=freq, - tz='Asia/Tokyo') - self.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) - - tm._skip_if_no_dateutil() - - # Non ISO 8601 format results in dateutil.tz.tzoffset - for freq in ['AS', 'W-SUN']: - idx = date_range('2013/1/1 0:00:00-5:00', '2016/1/1 23:59:59-5:00', - freq=freq) - expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', - freq=freq, tz=pytz.FixedOffset(-300)) - tm.assert_index_equal(idx, expected) - # Unable to use `US/Eastern` because of DST - expected_i8 = date_range('2013-01-01T00:00:00', - '2016-01-01T23:59:59', freq=freq, - tz='America/Lima') - self.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) - - idx = date_range('2013/1/1 0:00:00+9:00', - '2016/1/1 23:59:59+09:00', freq=freq) - expected = date_range('2013-01-01T00:00:00', '2016-01-01T23:59:59', - freq=freq, tz=pytz.FixedOffset(540)) - tm.assert_index_equal(idx, expected) - expected_i8 = date_range('2013-01-01T00:00:00', - '2016-01-01T23:59:59', freq=freq, - tz='Asia/Tokyo') - self.assert_numpy_array_equal(idx.asi8, expected_i8.asi8) - - def test_constructor_dtype(self): - - # passing a dtype with a tz should localize - idx = DatetimeIndex(['2013-01-01', '2013-01-02'], - dtype='datetime64[ns, US/Eastern]') - expected = DatetimeIndex(['2013-01-01', '2013-01-02'] - ).tz_localize('US/Eastern') - tm.assert_index_equal(idx, expected) - - idx = DatetimeIndex(['2013-01-01', '2013-01-02'], - tz='US/Eastern') - tm.assert_index_equal(idx, expected) - - # if we already have a tz and its not the same, then raise - idx = DatetimeIndex(['2013-01-01', '2013-01-02'], - dtype='datetime64[ns, US/Eastern]') - - self.assertRaises(ValueError, - lambda: DatetimeIndex(idx, - dtype='datetime64[ns]')) - - # this is effectively trying to convert tz's - self.assertRaises(TypeError, - lambda: DatetimeIndex(idx, - dtype='datetime64[ns, CET]')) - self.assertRaises(ValueError, - lambda: DatetimeIndex( - idx, tz='CET', - dtype='datetime64[ns, US/Eastern]')) - result = DatetimeIndex(idx, dtype='datetime64[ns, US/Eastern]') - tm.assert_index_equal(idx, result) - - def test_constructor_name(self): - idx = DatetimeIndex(start='2000-01-01', periods=1, freq='A', - name='TEST') - self.assertEqual(idx.name, 'TEST') - - def test_comparisons_coverage(self): - rng = date_range('1/1/2000', periods=10) - - # raise TypeError for now - self.assertRaises(TypeError, rng.__lt__, rng[3].value) - - result = rng == list(rng) - exp = rng == rng - self.assert_numpy_array_equal(result, exp) - - def test_comparisons_nat(self): - - fidx1 = pd.Index([1.0, np.nan, 3.0, np.nan, 5.0, 7.0]) - fidx2 = pd.Index([2.0, 3.0, np.nan, np.nan, 6.0, 7.0]) - - didx1 = pd.DatetimeIndex(['2014-01-01', pd.NaT, '2014-03-01', pd.NaT, - '2014-05-01', '2014-07-01']) - didx2 = pd.DatetimeIndex(['2014-02-01', '2014-03-01', pd.NaT, pd.NaT, - '2014-06-01', '2014-07-01']) - darr = np.array([np_datetime64_compat('2014-02-01 00:00Z'), - np_datetime64_compat('2014-03-01 00:00Z'), - np_datetime64_compat('nat'), np.datetime64('nat'), - np_datetime64_compat('2014-06-01 00:00Z'), - np_datetime64_compat('2014-07-01 00:00Z')]) - - if _np_version_under1p8: - # cannot test array because np.datetime('nat') returns today's date - cases = [(fidx1, fidx2), (didx1, didx2)] - else: - cases = [(fidx1, fidx2), (didx1, didx2), (didx1, darr)] - - # Check pd.NaT is handles as the same as np.nan - with tm.assert_produces_warning(None): - for idx1, idx2 in cases: - - result = idx1 < idx2 - expected = np.array([True, False, False, False, True, False]) - self.assert_numpy_array_equal(result, expected) - - result = idx2 > idx1 - expected = np.array([True, False, False, False, True, False]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 <= idx2 - expected = np.array([True, False, False, False, True, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx2 >= idx1 - expected = np.array([True, False, False, False, True, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 == idx2 - expected = np.array([False, False, False, False, False, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 != idx2 - expected = np.array([True, True, True, True, True, False]) - self.assert_numpy_array_equal(result, expected) - - with tm.assert_produces_warning(None): - for idx1, val in [(fidx1, np.nan), (didx1, pd.NaT)]: - result = idx1 < val - expected = np.array([False, False, False, False, False, False]) - self.assert_numpy_array_equal(result, expected) - result = idx1 > val - self.assert_numpy_array_equal(result, expected) - - result = idx1 <= val - self.assert_numpy_array_equal(result, expected) - result = idx1 >= val - self.assert_numpy_array_equal(result, expected) - - result = idx1 == val - self.assert_numpy_array_equal(result, expected) - - result = idx1 != val - expected = np.array([True, True, True, True, True, True]) - self.assert_numpy_array_equal(result, expected) - - # Check pd.NaT is handles as the same as np.nan - with tm.assert_produces_warning(None): - for idx1, val in [(fidx1, 3), (didx1, datetime(2014, 3, 1))]: - result = idx1 < val - expected = np.array([True, False, False, False, False, False]) - self.assert_numpy_array_equal(result, expected) - result = idx1 > val - expected = np.array([False, False, False, False, True, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 <= val - expected = np.array([True, False, True, False, False, False]) - self.assert_numpy_array_equal(result, expected) - result = idx1 >= val - expected = np.array([False, False, True, False, True, True]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 == val - expected = np.array([False, False, True, False, False, False]) - self.assert_numpy_array_equal(result, expected) - - result = idx1 != val - expected = np.array([True, True, False, True, True, True]) - self.assert_numpy_array_equal(result, expected) - - def test_map(self): - rng = date_range('1/1/2000', periods=10) - - f = lambda x: x.strftime('%Y%m%d') - result = rng.map(f) - exp = np.array([f(x) for x in rng], dtype='= -1') - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -5]), fill_value=True) - - with tm.assertRaises(IndexError): - idx.take(np.array([1, -5])) - - def test_take_fill_value_with_timezone(self): - idx = pd.DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01'], - name='xxx', tz='US/Eastern') - result = idx.take(np.array([1, 0, -1])) - expected = pd.DatetimeIndex(['2011-02-01', '2011-01-01', '2011-03-01'], - name='xxx', tz='US/Eastern') - tm.assert_index_equal(result, expected) - - # fill_value - result = idx.take(np.array([1, 0, -1]), fill_value=True) - expected = pd.DatetimeIndex(['2011-02-01', '2011-01-01', 'NaT'], - name='xxx', tz='US/Eastern') - tm.assert_index_equal(result, expected) - - # allow_fill=False - result = idx.take(np.array([1, 0, -1]), allow_fill=False, - fill_value=True) - expected = pd.DatetimeIndex(['2011-02-01', '2011-01-01', '2011-03-01'], - name='xxx', tz='US/Eastern') - tm.assert_index_equal(result, expected) - - msg = ('When allow_fill=True and fill_value is not None, ' - 'all indices must be >= -1') - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -2]), fill_value=True) - with tm.assertRaisesRegexp(ValueError, msg): - idx.take(np.array([1, 0, -5]), fill_value=True) - - with tm.assertRaises(IndexError): - idx.take(np.array([1, -5])) - - def test_map_bug_1677(self): - index = DatetimeIndex(['2012-04-25 09:30:00.393000']) - f = index.asof - - result = index.map(f) - expected = np.array([f(index[0])]) - self.assert_numpy_array_equal(result, expected) - - def test_groupby_function_tuple_1677(self): - df = DataFrame(np.random.rand(100), - index=date_range("1/1/2000", periods=100)) - monthly_group = df.groupby(lambda x: (x.year, x.month)) - - result = monthly_group.mean() - tm.assertIsInstance(result.index[0], tuple) - - def test_append_numpy_bug_1681(self): - # another datetime64 bug - dr = date_range('2011/1/1', '2012/1/1', freq='W-FRI') - a = DataFrame() - c = DataFrame({'A': 'foo', 'B': dr}, index=dr) - - result = a.append(c) - self.assertTrue((result['B'] == dr).all()) - - def test_isin(self): - index = tm.makeDateIndex(4) - result = index.isin(index) - self.assertTrue(result.all()) - - result = index.isin(list(index)) - self.assertTrue(result.all()) - - assert_almost_equal(index.isin([index[2], 5]), - np.array([False, False, True, False])) - - def test_union(self): - i1 = Int64Index(np.arange(0, 20, 2)) - i2 = Int64Index(np.arange(10, 30, 2)) - result = i1.union(i2) - expected = Int64Index(np.arange(0, 30, 2)) - tm.assert_index_equal(result, expected) - - def test_union_with_DatetimeIndex(self): - i1 = Int64Index(np.arange(0, 20, 2)) - i2 = DatetimeIndex(start='2012-01-03 00:00:00', periods=10, freq='D') - i1.union(i2) # Works - i2.union(i1) # Fails with "AttributeError: can't set attribute" - - def test_time(self): - rng = pd.date_range('1/1/2000', freq='12min', periods=10) - result = pd.Index(rng).time - expected = [t.time() for t in rng] - self.assertTrue((result == expected).all()) - - def test_date(self): - rng = pd.date_range('1/1/2000', freq='12H', periods=10) - result = pd.Index(rng).date - expected = [t.date() for t in rng] - self.assertTrue((result == expected).all()) - - def test_does_not_convert_mixed_integer(self): - df = tm.makeCustomDataframe(10, 10, - data_gen_f=lambda *args, **kwargs: randn(), - r_idx_type='i', c_idx_type='dt') - cols = df.columns.join(df.index, how='outer') - joined = cols.join(df.columns) - self.assertEqual(cols.dtype, np.dtype('O')) - self.assertEqual(cols.dtype, joined.dtype) - tm.assert_numpy_array_equal(cols.values, joined.values) - - def test_slice_keeps_name(self): - # GH4226 - st = pd.Timestamp('2013-07-01 00:00:00', tz='America/Los_Angeles') - et = pd.Timestamp('2013-07-02 00:00:00', tz='America/Los_Angeles') - dr = pd.date_range(st, et, freq='H', name='timebucket') - self.assertEqual(dr[1:].name, dr.name) - - def test_join_self(self): - index = date_range('1/1/2000', periods=10) - kinds = 'outer', 'inner', 'left', 'right' - for kind in kinds: - joined = index.join(index, how=kind) - self.assertIs(index, joined) - - def assert_index_parameters(self, index): - assert index.freq == '40960N' - assert index.inferred_freq == '40960N' - - def test_ns_index(self): - nsamples = 400 - ns = int(1e9 / 24414) - dtstart = np.datetime64('2012-09-20T00:00:00') - - dt = dtstart + np.arange(nsamples) * np.timedelta64(ns, 'ns') - freq = ns * offsets.Nano() - index = pd.DatetimeIndex(dt, freq=freq, name='time') - self.assert_index_parameters(index) - - new_index = pd.DatetimeIndex(start=index[0], end=index[-1], - freq=index.freq) - self.assert_index_parameters(new_index) - - def test_join_with_period_index(self): - df = tm.makeCustomDataframe( - 10, 10, data_gen_f=lambda *args: np.random.randint(2), - c_idx_type='p', r_idx_type='dt') - s = df.iloc[:5, 0] - joins = 'left', 'right', 'inner', 'outer' - - for join in joins: - with tm.assertRaisesRegexp(ValueError, 'can only call with other ' - 'PeriodIndex-ed objects'): - df.columns.join(s.index, how=join) - - def test_factorize(self): - idx1 = DatetimeIndex(['2014-01', '2014-01', '2014-02', '2014-02', - '2014-03', '2014-03']) - - exp_arr = np.array([0, 0, 1, 1, 2, 2], dtype=np.intp) - exp_idx = DatetimeIndex(['2014-01', '2014-02', '2014-03']) - - arr, idx = idx1.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - arr, idx = idx1.factorize(sort=True) - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - # tz must be preserved - idx1 = idx1.tz_localize('Asia/Tokyo') - exp_idx = exp_idx.tz_localize('Asia/Tokyo') - - arr, idx = idx1.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - idx2 = pd.DatetimeIndex(['2014-03', '2014-03', '2014-02', '2014-01', - '2014-03', '2014-01']) - - exp_arr = np.array([2, 2, 1, 0, 2, 0], dtype=np.intp) - exp_idx = DatetimeIndex(['2014-01', '2014-02', '2014-03']) - arr, idx = idx2.factorize(sort=True) - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - exp_arr = np.array([0, 0, 1, 2, 0, 2], dtype=np.intp) - exp_idx = DatetimeIndex(['2014-03', '2014-02', '2014-01']) - arr, idx = idx2.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, exp_idx) - - # freq must be preserved - idx3 = date_range('2000-01', periods=4, freq='M', tz='Asia/Tokyo') - exp_arr = np.array([0, 1, 2, 3], dtype=np.intp) - arr, idx = idx3.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(idx, idx3) - - def test_factorize_tz(self): - # GH 13750 - for tz in [None, 'UTC', 'US/Eastern', 'Asia/Tokyo']: - base = pd.date_range('2016-11-05', freq='H', periods=100, tz=tz) - idx = base.repeat(5) - - exp_arr = np.arange(100, dtype=np.intp).repeat(5) - - for obj in [idx, pd.Series(idx)]: - arr, res = obj.factorize() - self.assert_numpy_array_equal(arr, exp_arr) - tm.assert_index_equal(res, base) - - def test_factorize_dst(self): - # GH 13750 - idx = pd.date_range('2016-11-06', freq='H', periods=12, - tz='US/Eastern') - - for obj in [idx, pd.Series(idx)]: - arr, res = obj.factorize() - self.assert_numpy_array_equal(arr, np.arange(12, dtype=np.intp)) - tm.assert_index_equal(res, idx) - - idx = pd.date_range('2016-06-13', freq='H', periods=12, - tz='US/Eastern') - - for obj in [idx, pd.Series(idx)]: - arr, res = obj.factorize() - self.assert_numpy_array_equal(arr, np.arange(12, dtype=np.intp)) - tm.assert_index_equal(res, idx) - - def test_slice_with_negative_step(self): - ts = Series(np.arange(20), - date_range('2014-01-01', periods=20, freq='MS')) - SLC = pd.IndexSlice - - def assert_slices_equivalent(l_slc, i_slc): - assert_series_equal(ts[l_slc], ts.iloc[i_slc]) - assert_series_equal(ts.loc[l_slc], ts.iloc[i_slc]) - assert_series_equal(ts.ix[l_slc], ts.iloc[i_slc]) - - assert_slices_equivalent(SLC[Timestamp('2014-10-01')::-1], SLC[9::-1]) - assert_slices_equivalent(SLC['2014-10-01'::-1], SLC[9::-1]) - - assert_slices_equivalent(SLC[:Timestamp('2014-10-01'):-1], SLC[:8:-1]) - assert_slices_equivalent(SLC[:'2014-10-01':-1], SLC[:8:-1]) - - assert_slices_equivalent(SLC['2015-02-01':'2014-10-01':-1], - SLC[13:8:-1]) - assert_slices_equivalent(SLC[Timestamp('2015-02-01'):Timestamp( - '2014-10-01'):-1], SLC[13:8:-1]) - assert_slices_equivalent(SLC['2015-02-01':Timestamp('2014-10-01'):-1], - SLC[13:8:-1]) - assert_slices_equivalent(SLC[Timestamp('2015-02-01'):'2014-10-01':-1], - SLC[13:8:-1]) - - assert_slices_equivalent(SLC['2014-10-01':'2015-02-01':-1], SLC[:0]) - - def test_slice_with_zero_step_raises(self): - ts = Series(np.arange(20), - date_range('2014-01-01', periods=20, freq='MS')) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts.loc[::0]) - self.assertRaisesRegexp(ValueError, 'slice step cannot be zero', - lambda: ts.ix[::0]) - - def test_slice_bounds_empty(self): - # GH 14354 - empty_idx = DatetimeIndex(freq='1H', periods=0, end='2015') - - right = empty_idx._maybe_cast_slice_bound('2015-01-02', 'right', 'loc') - exp = Timestamp('2015-01-02 23:59:59.999999999') - self.assertEqual(right, exp) - - left = empty_idx._maybe_cast_slice_bound('2015-01-02', 'left', 'loc') - exp = Timestamp('2015-01-02 00:00:00') - self.assertEqual(left, exp) - - -class TestDatetime64(tm.TestCase): - """ - Also test support for datetime64[ns] in Series / DataFrame - """ - - def setUp(self): - dti = DatetimeIndex(start=datetime(2005, 1, 1), - end=datetime(2005, 1, 10), freq='Min') - self.series = Series(rand(len(dti)), dti) - - def test_datetimeindex_accessors(self): - dti = DatetimeIndex(freq='D', start=datetime(1998, 1, 1), periods=365) - - self.assertEqual(dti.year[0], 1998) - self.assertEqual(dti.month[0], 1) - self.assertEqual(dti.day[0], 1) - self.assertEqual(dti.hour[0], 0) - self.assertEqual(dti.minute[0], 0) - self.assertEqual(dti.second[0], 0) - self.assertEqual(dti.microsecond[0], 0) - self.assertEqual(dti.dayofweek[0], 3) - - self.assertEqual(dti.dayofyear[0], 1) - self.assertEqual(dti.dayofyear[120], 121) - - self.assertEqual(dti.weekofyear[0], 1) - self.assertEqual(dti.weekofyear[120], 18) - - self.assertEqual(dti.quarter[0], 1) - self.assertEqual(dti.quarter[120], 2) - - self.assertEqual(dti.days_in_month[0], 31) - self.assertEqual(dti.days_in_month[90], 30) - - self.assertEqual(dti.is_month_start[0], True) - self.assertEqual(dti.is_month_start[1], False) - self.assertEqual(dti.is_month_start[31], True) - self.assertEqual(dti.is_quarter_start[0], True) - self.assertEqual(dti.is_quarter_start[90], True) - self.assertEqual(dti.is_year_start[0], True) - self.assertEqual(dti.is_year_start[364], False) - self.assertEqual(dti.is_month_end[0], False) - self.assertEqual(dti.is_month_end[30], True) - self.assertEqual(dti.is_month_end[31], False) - self.assertEqual(dti.is_month_end[364], True) - self.assertEqual(dti.is_quarter_end[0], False) - self.assertEqual(dti.is_quarter_end[30], False) - self.assertEqual(dti.is_quarter_end[89], True) - self.assertEqual(dti.is_quarter_end[364], True) - self.assertEqual(dti.is_year_end[0], False) - self.assertEqual(dti.is_year_end[364], True) - - # GH 11128 - self.assertEqual(dti.weekday_name[4], u'Monday') - self.assertEqual(dti.weekday_name[5], u'Tuesday') - self.assertEqual(dti.weekday_name[6], u'Wednesday') - self.assertEqual(dti.weekday_name[7], u'Thursday') - self.assertEqual(dti.weekday_name[8], u'Friday') - self.assertEqual(dti.weekday_name[9], u'Saturday') - self.assertEqual(dti.weekday_name[10], u'Sunday') - - self.assertEqual(Timestamp('2016-04-04').weekday_name, u'Monday') - self.assertEqual(Timestamp('2016-04-05').weekday_name, u'Tuesday') - self.assertEqual(Timestamp('2016-04-06').weekday_name, u'Wednesday') - self.assertEqual(Timestamp('2016-04-07').weekday_name, u'Thursday') - self.assertEqual(Timestamp('2016-04-08').weekday_name, u'Friday') - self.assertEqual(Timestamp('2016-04-09').weekday_name, u'Saturday') - self.assertEqual(Timestamp('2016-04-10').weekday_name, u'Sunday') - - self.assertEqual(len(dti.year), 365) - self.assertEqual(len(dti.month), 365) - self.assertEqual(len(dti.day), 365) - self.assertEqual(len(dti.hour), 365) - self.assertEqual(len(dti.minute), 365) - self.assertEqual(len(dti.second), 365) - self.assertEqual(len(dti.microsecond), 365) - self.assertEqual(len(dti.dayofweek), 365) - self.assertEqual(len(dti.dayofyear), 365) - self.assertEqual(len(dti.weekofyear), 365) - self.assertEqual(len(dti.quarter), 365) - self.assertEqual(len(dti.is_month_start), 365) - self.assertEqual(len(dti.is_month_end), 365) - self.assertEqual(len(dti.is_quarter_start), 365) - self.assertEqual(len(dti.is_quarter_end), 365) - self.assertEqual(len(dti.is_year_start), 365) - self.assertEqual(len(dti.is_year_end), 365) - self.assertEqual(len(dti.weekday_name), 365) - - dti = DatetimeIndex(freq='BQ-FEB', start=datetime(1998, 1, 1), - periods=4) - - self.assertEqual(sum(dti.is_quarter_start), 0) - self.assertEqual(sum(dti.is_quarter_end), 4) - self.assertEqual(sum(dti.is_year_start), 0) - self.assertEqual(sum(dti.is_year_end), 1) - - # Ensure is_start/end accessors throw ValueError for CustomBusinessDay, - # CBD requires np >= 1.7 - bday_egypt = offsets.CustomBusinessDay(weekmask='Sun Mon Tue Wed Thu') - dti = date_range(datetime(2013, 4, 30), periods=5, freq=bday_egypt) - self.assertRaises(ValueError, lambda: dti.is_month_start) - - dti = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03']) - - self.assertEqual(dti.is_month_start[0], 1) - - tests = [ - (Timestamp('2013-06-01', freq='M').is_month_start, 1), - (Timestamp('2013-06-01', freq='BM').is_month_start, 0), - (Timestamp('2013-06-03', freq='M').is_month_start, 0), - (Timestamp('2013-06-03', freq='BM').is_month_start, 1), - (Timestamp('2013-02-28', freq='Q-FEB').is_month_end, 1), - (Timestamp('2013-02-28', freq='Q-FEB').is_quarter_end, 1), - (Timestamp('2013-02-28', freq='Q-FEB').is_year_end, 1), - (Timestamp('2013-03-01', freq='Q-FEB').is_month_start, 1), - (Timestamp('2013-03-01', freq='Q-FEB').is_quarter_start, 1), - (Timestamp('2013-03-01', freq='Q-FEB').is_year_start, 1), - (Timestamp('2013-03-31', freq='QS-FEB').is_month_end, 1), - (Timestamp('2013-03-31', freq='QS-FEB').is_quarter_end, 0), - (Timestamp('2013-03-31', freq='QS-FEB').is_year_end, 0), - (Timestamp('2013-02-01', freq='QS-FEB').is_month_start, 1), - (Timestamp('2013-02-01', freq='QS-FEB').is_quarter_start, 1), - (Timestamp('2013-02-01', freq='QS-FEB').is_year_start, 1), - (Timestamp('2013-06-30', freq='BQ').is_month_end, 0), - (Timestamp('2013-06-30', freq='BQ').is_quarter_end, 0), - (Timestamp('2013-06-30', freq='BQ').is_year_end, 0), - (Timestamp('2013-06-28', freq='BQ').is_month_end, 1), - (Timestamp('2013-06-28', freq='BQ').is_quarter_end, 1), - (Timestamp('2013-06-28', freq='BQ').is_year_end, 0), - (Timestamp('2013-06-30', freq='BQS-APR').is_month_end, 0), - (Timestamp('2013-06-30', freq='BQS-APR').is_quarter_end, 0), - (Timestamp('2013-06-30', freq='BQS-APR').is_year_end, 0), - (Timestamp('2013-06-28', freq='BQS-APR').is_month_end, 1), - (Timestamp('2013-06-28', freq='BQS-APR').is_quarter_end, 1), - (Timestamp('2013-03-29', freq='BQS-APR').is_year_end, 1), - (Timestamp('2013-11-01', freq='AS-NOV').is_year_start, 1), - (Timestamp('2013-10-31', freq='AS-NOV').is_year_end, 1), - (Timestamp('2012-02-01').days_in_month, 29), - (Timestamp('2013-02-01').days_in_month, 28)] - - for ts, value in tests: - self.assertEqual(ts, value) - - def test_nanosecond_field(self): - dti = DatetimeIndex(np.arange(10)) - - self.assert_numpy_array_equal(dti.nanosecond, - np.arange(10, dtype=np.int32)) - - def test_datetimeindex_diff(self): - dti1 = DatetimeIndex(freq='Q-JAN', start=datetime(1997, 12, 31), - periods=100) - dti2 = DatetimeIndex(freq='Q-JAN', start=datetime(1997, 12, 31), - periods=98) - self.assertEqual(len(dti1.difference(dti2)), 2) - - def test_fancy_getitem(self): - dti = DatetimeIndex(freq='WOM-1FRI', start=datetime(2005, 1, 1), - end=datetime(2010, 1, 1)) - - s = Series(np.arange(len(dti)), index=dti) - - self.assertEqual(s[48], 48) - self.assertEqual(s['1/2/2009'], 48) - self.assertEqual(s['2009-1-2'], 48) - self.assertEqual(s[datetime(2009, 1, 2)], 48) - self.assertEqual(s[lib.Timestamp(datetime(2009, 1, 2))], 48) - self.assertRaises(KeyError, s.__getitem__, '2009-1-3') - - assert_series_equal(s['3/6/2009':'2009-06-05'], - s[datetime(2009, 3, 6):datetime(2009, 6, 5)]) - - def test_fancy_setitem(self): - dti = DatetimeIndex(freq='WOM-1FRI', start=datetime(2005, 1, 1), - end=datetime(2010, 1, 1)) - - s = Series(np.arange(len(dti)), index=dti) - s[48] = -1 - self.assertEqual(s[48], -1) - s['1/2/2009'] = -2 - self.assertEqual(s[48], -2) - s['1/2/2009':'2009-06-05'] = -3 - self.assertTrue((s[48:54] == -3).all()) - - def test_datetimeindex_constructor(self): - arr = ['1/1/2005', '1/2/2005', 'Jn 3, 2005', '2005-01-04'] - self.assertRaises(Exception, DatetimeIndex, arr) - - arr = ['1/1/2005', '1/2/2005', '1/3/2005', '2005-01-04'] - idx1 = DatetimeIndex(arr) - - arr = [datetime(2005, 1, 1), '1/2/2005', '1/3/2005', '2005-01-04'] - idx2 = DatetimeIndex(arr) - - arr = [lib.Timestamp(datetime(2005, 1, 1)), '1/2/2005', '1/3/2005', - '2005-01-04'] - idx3 = DatetimeIndex(arr) - - arr = np.array(['1/1/2005', '1/2/2005', '1/3/2005', - '2005-01-04'], dtype='O') - idx4 = DatetimeIndex(arr) - - arr = to_datetime(['1/1/2005', '1/2/2005', '1/3/2005', '2005-01-04']) - idx5 = DatetimeIndex(arr) - - arr = to_datetime(['1/1/2005', '1/2/2005', 'Jan 3, 2005', '2005-01-04' - ]) - idx6 = DatetimeIndex(arr) - - idx7 = DatetimeIndex(['12/05/2007', '25/01/2008'], dayfirst=True) - idx8 = DatetimeIndex(['2007/05/12', '2008/01/25'], dayfirst=False, - yearfirst=True) - tm.assert_index_equal(idx7, idx8) - - for other in [idx2, idx3, idx4, idx5, idx6]: - self.assertTrue((idx1.values == other.values).all()) - - sdate = datetime(1999, 12, 25) - edate = datetime(2000, 1, 1) - idx = DatetimeIndex(start=sdate, freq='1B', periods=20) - self.assertEqual(len(idx), 20) - self.assertEqual(idx[0], sdate + 0 * offsets.BDay()) - self.assertEqual(idx.freq, 'B') - - idx = DatetimeIndex(end=edate, freq=('D', 5), periods=20) - self.assertEqual(len(idx), 20) - self.assertEqual(idx[-1], edate) - self.assertEqual(idx.freq, '5D') - - idx1 = DatetimeIndex(start=sdate, end=edate, freq='W-SUN') - idx2 = DatetimeIndex(start=sdate, end=edate, - freq=offsets.Week(weekday=6)) - self.assertEqual(len(idx1), len(idx2)) - self.assertEqual(idx1.offset, idx2.offset) - - idx1 = DatetimeIndex(start=sdate, end=edate, freq='QS') - idx2 = DatetimeIndex(start=sdate, end=edate, - freq=offsets.QuarterBegin(startingMonth=1)) - self.assertEqual(len(idx1), len(idx2)) - self.assertEqual(idx1.offset, idx2.offset) - - idx1 = DatetimeIndex(start=sdate, end=edate, freq='BQ') - idx2 = DatetimeIndex(start=sdate, end=edate, - freq=offsets.BQuarterEnd(startingMonth=12)) - self.assertEqual(len(idx1), len(idx2)) - self.assertEqual(idx1.offset, idx2.offset) - - def test_dayfirst(self): - # GH 5917 - arr = ['10/02/2014', '11/02/2014', '12/02/2014'] - expected = DatetimeIndex([datetime(2014, 2, 10), datetime(2014, 2, 11), - datetime(2014, 2, 12)]) - idx1 = DatetimeIndex(arr, dayfirst=True) - idx2 = DatetimeIndex(np.array(arr), dayfirst=True) - idx3 = to_datetime(arr, dayfirst=True) - idx4 = to_datetime(np.array(arr), dayfirst=True) - idx5 = DatetimeIndex(Index(arr), dayfirst=True) - idx6 = DatetimeIndex(Series(arr), dayfirst=True) - tm.assert_index_equal(expected, idx1) - tm.assert_index_equal(expected, idx2) - tm.assert_index_equal(expected, idx3) - tm.assert_index_equal(expected, idx4) - tm.assert_index_equal(expected, idx5) - tm.assert_index_equal(expected, idx6) - - def test_dti_snap(self): - dti = DatetimeIndex(['1/1/2002', '1/2/2002', '1/3/2002', '1/4/2002', - '1/5/2002', '1/6/2002', '1/7/2002'], freq='D') - - res = dti.snap(freq='W-MON') - exp = date_range('12/31/2001', '1/7/2002', freq='w-mon') - exp = exp.repeat([3, 4]) - self.assertTrue((res == exp).all()) - - res = dti.snap(freq='B') - - exp = date_range('1/1/2002', '1/7/2002', freq='b') - exp = exp.repeat([1, 1, 1, 2, 2]) - self.assertTrue((res == exp).all()) - - def test_dti_reset_index_round_trip(self): - dti = DatetimeIndex(start='1/1/2001', end='6/1/2001', freq='D') - d1 = DataFrame({'v': np.random.rand(len(dti))}, index=dti) - d2 = d1.reset_index() - self.assertEqual(d2.dtypes[0], np.dtype('M8[ns]')) - d3 = d2.set_index('index') - assert_frame_equal(d1, d3, check_names=False) - - # #2329 - stamp = datetime(2012, 11, 22) - df = DataFrame([[stamp, 12.1]], columns=['Date', 'Value']) - df = df.set_index('Date') - - self.assertEqual(df.index[0], stamp) - self.assertEqual(df.reset_index()['Date'][0], stamp) - - def test_dti_set_index_reindex(self): - # GH 6631 - df = DataFrame(np.random.random(6)) - idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern') - idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo') - - df = df.set_index(idx1) - tm.assert_index_equal(df.index, idx1) - df = df.reindex(idx2) - tm.assert_index_equal(df.index, idx2) - - # 11314 - # with tz - index = date_range(datetime(2015, 10, 1), - datetime(2015, 10, 1, 23), - freq='H', tz='US/Eastern') - df = DataFrame(np.random.randn(24, 1), columns=['a'], index=index) - new_index = date_range(datetime(2015, 10, 2), - datetime(2015, 10, 2, 23), - freq='H', tz='US/Eastern') - - # TODO: unused? - result = df.set_index(new_index) # noqa - - self.assertEqual(new_index.freq, index.freq) - - def test_datetimeindex_union_join_empty(self): - dti = DatetimeIndex(start='1/1/2001', end='2/1/2001', freq='D') - empty = Index([]) - - result = dti.union(empty) - tm.assertIsInstance(result, DatetimeIndex) - self.assertIs(result, result) - - result = dti.join(empty) - tm.assertIsInstance(result, DatetimeIndex) - - def test_series_set_value(self): - # #1561 - - dates = [datetime(2001, 1, 1), datetime(2001, 1, 2)] - index = DatetimeIndex(dates) - - s = Series().set_value(dates[0], 1.) - s2 = s.set_value(dates[1], np.nan) - - exp = Series([1., np.nan], index=index) - - assert_series_equal(s2, exp) - - # s = Series(index[:1], index[:1]) - # s2 = s.set_value(dates[1], index[1]) - # self.assertEqual(s2.values.dtype, 'M8[ns]') - - @slow - def test_slice_locs_indexerror(self): - times = [datetime(2000, 1, 1) + timedelta(minutes=i * 10) - for i in range(100000)] - s = Series(lrange(100000), times) - s.ix[datetime(1900, 1, 1):datetime(2100, 1, 1)] - - def test_slicing_datetimes(self): - - # GH 7523 - - # unique - df = DataFrame(np.arange(4., dtype='float64'), - index=[datetime(2001, 1, i, 10, 00) - for i in [1, 2, 3, 4]]) - result = df.ix[datetime(2001, 1, 1, 10):] - assert_frame_equal(result, df) - result = df.ix[:datetime(2001, 1, 4, 10)] - assert_frame_equal(result, df) - result = df.ix[datetime(2001, 1, 1, 10):datetime(2001, 1, 4, 10)] - assert_frame_equal(result, df) - - result = df.ix[datetime(2001, 1, 1, 11):] - expected = df.iloc[1:] - assert_frame_equal(result, expected) - result = df.ix['20010101 11':] - assert_frame_equal(result, expected) - - # duplicates - df = pd.DataFrame(np.arange(5., dtype='float64'), - index=[datetime(2001, 1, i, 10, 00) - for i in [1, 2, 2, 3, 4]]) - - result = df.ix[datetime(2001, 1, 1, 10):] - assert_frame_equal(result, df) - result = df.ix[:datetime(2001, 1, 4, 10)] - assert_frame_equal(result, df) - result = df.ix[datetime(2001, 1, 1, 10):datetime(2001, 1, 4, 10)] - assert_frame_equal(result, df) - - result = df.ix[datetime(2001, 1, 1, 11):] - expected = df.iloc[1:] - assert_frame_equal(result, expected) - result = df.ix['20010101 11':] - assert_frame_equal(result, expected) - - -class TestSeriesDatetime64(tm.TestCase): - def setUp(self): - self.series = Series(date_range('1/1/2000', periods=10)) - - def test_auto_conversion(self): - series = Series(list(date_range('1/1/2000', periods=10))) - self.assertEqual(series.dtype, 'M8[ns]') - - def test_constructor_cant_cast_datetime64(self): - msg = "Cannot cast datetime64 to " - with tm.assertRaisesRegexp(TypeError, msg): - Series(date_range('1/1/2000', periods=10), dtype=float) - - with tm.assertRaisesRegexp(TypeError, msg): - Series(date_range('1/1/2000', periods=10), dtype=int) - - def test_constructor_cast_object(self): - s = Series(date_range('1/1/2000', periods=10), dtype=object) - exp = Series(date_range('1/1/2000', periods=10)) - tm.assert_series_equal(s, exp) - - def test_series_comparison_scalars(self): - val = datetime(2000, 1, 4) - result = self.series > val - expected = Series([x > val for x in self.series]) - self.assert_series_equal(result, expected) - - val = self.series[5] - result = self.series > val - expected = Series([x > val for x in self.series]) - self.assert_series_equal(result, expected) - - def test_between(self): - left, right = self.series[[2, 7]] - - result = self.series.between(left, right) - expected = (self.series >= left) & (self.series <= right) - assert_series_equal(result, expected) - - # --------------------------------------------------------------------- - # NaT support - - def test_NaT_scalar(self): - series = Series([0, 1000, 2000, iNaT], dtype='M8[ns]') - - val = series[3] - self.assertTrue(com.isnull(val)) - - series[2] = val - self.assertTrue(com.isnull(series[2])) - - def test_NaT_cast(self): - # GH10747 - result = Series([np.nan]).astype('M8[ns]') - expected = Series([NaT]) - assert_series_equal(result, expected) - - def test_set_none_nan(self): - self.series[3] = None - self.assertIs(self.series[3], NaT) - - self.series[3:5] = None - self.assertIs(self.series[4], NaT) - - self.series[5] = np.nan - self.assertIs(self.series[5], NaT) - - self.series[5:7] = np.nan - self.assertIs(self.series[6], NaT) - - def test_intercept_astype_object(self): - - # this test no longer makes sense as series is by default already - # M8[ns] - expected = self.series.astype('object') - - df = DataFrame({'a': self.series, - 'b': np.random.randn(len(self.series))}) - exp_dtypes = pd.Series([np.dtype('datetime64[ns]'), - np.dtype('float64')], index=['a', 'b']) - tm.assert_series_equal(df.dtypes, exp_dtypes) - - result = df.values.squeeze() - self.assertTrue((result[:, 0] == expected.values).all()) - - df = DataFrame({'a': self.series, 'b': ['foo'] * len(self.series)}) - - result = df.values.squeeze() - self.assertTrue((result[:, 0] == expected.values).all()) - - def test_nat_operations(self): - # GH 8617 - s = Series([0, pd.NaT], dtype='m8[ns]') - exp = s[0] - self.assertEqual(s.median(), exp) - self.assertEqual(s.min(), exp) - self.assertEqual(s.max(), exp) - - -class TestTimestamp(tm.TestCase): - def test_class_ops_pytz(self): - tm._skip_if_no_pytz() - from pytz import timezone - - def compare(x, y): - self.assertEqual(int(Timestamp(x).value / 1e9), - int(Timestamp(y).value / 1e9)) - - compare(Timestamp.now(), datetime.now()) - compare(Timestamp.now('UTC'), datetime.now(timezone('UTC'))) - compare(Timestamp.utcnow(), datetime.utcnow()) - compare(Timestamp.today(), datetime.today()) - current_time = calendar.timegm(datetime.now().utctimetuple()) - compare(Timestamp.utcfromtimestamp(current_time), - datetime.utcfromtimestamp(current_time)) - compare(Timestamp.fromtimestamp(current_time), - datetime.fromtimestamp(current_time)) - - date_component = datetime.utcnow() - time_component = (date_component + timedelta(minutes=10)).time() - compare(Timestamp.combine(date_component, time_component), - datetime.combine(date_component, time_component)) - - def test_class_ops_dateutil(self): - tm._skip_if_no_dateutil() - from dateutil.tz import tzutc - - def compare(x, y): - self.assertEqual(int(np.round(Timestamp(x).value / 1e9)), - int(np.round(Timestamp(y).value / 1e9))) - - compare(Timestamp.now(), datetime.now()) - compare(Timestamp.now('UTC'), datetime.now(tzutc())) - compare(Timestamp.utcnow(), datetime.utcnow()) - compare(Timestamp.today(), datetime.today()) - current_time = calendar.timegm(datetime.now().utctimetuple()) - compare(Timestamp.utcfromtimestamp(current_time), - datetime.utcfromtimestamp(current_time)) - compare(Timestamp.fromtimestamp(current_time), - datetime.fromtimestamp(current_time)) - - date_component = datetime.utcnow() - time_component = (date_component + timedelta(minutes=10)).time() - compare(Timestamp.combine(date_component, time_component), - datetime.combine(date_component, time_component)) - - def test_basics_nanos(self): - val = np.int64(946684800000000000).view('M8[ns]') - stamp = Timestamp(val.view('i8') + 500) - self.assertEqual(stamp.year, 2000) - self.assertEqual(stamp.month, 1) - self.assertEqual(stamp.microsecond, 0) - self.assertEqual(stamp.nanosecond, 500) - - # GH 14415 - val = np.iinfo(np.int64).min + 80000000000000 - stamp = Timestamp(val) - self.assertEqual(stamp.year, 1677) - self.assertEqual(stamp.month, 9) - self.assertEqual(stamp.day, 21) - self.assertEqual(stamp.microsecond, 145224) - self.assertEqual(stamp.nanosecond, 192) - - def test_unit(self): - - def check(val, unit=None, h=1, s=1, us=0): - stamp = Timestamp(val, unit=unit) - self.assertEqual(stamp.year, 2000) - self.assertEqual(stamp.month, 1) - self.assertEqual(stamp.day, 1) - self.assertEqual(stamp.hour, h) - if unit != 'D': - self.assertEqual(stamp.minute, 1) - self.assertEqual(stamp.second, s) - self.assertEqual(stamp.microsecond, us) - else: - self.assertEqual(stamp.minute, 0) - self.assertEqual(stamp.second, 0) - self.assertEqual(stamp.microsecond, 0) - self.assertEqual(stamp.nanosecond, 0) - - ts = Timestamp('20000101 01:01:01') - val = ts.value - days = (ts - Timestamp('1970-01-01')).days - - check(val) - check(val / long(1000), unit='us') - check(val / long(1000000), unit='ms') - check(val / long(1000000000), unit='s') - check(days, unit='D', h=0) - - # using truediv, so these are like floats - if compat.PY3: - check((val + 500000) / long(1000000000), unit='s', us=500) - check((val + 500000000) / long(1000000000), unit='s', us=500000) - check((val + 500000) / long(1000000), unit='ms', us=500) - - # get chopped in py2 - else: - check((val + 500000) / long(1000000000), unit='s') - check((val + 500000000) / long(1000000000), unit='s') - check((val + 500000) / long(1000000), unit='ms') - - # ok - check((val + 500000) / long(1000), unit='us', us=500) - check((val + 500000000) / long(1000000), unit='ms', us=500000) - - # floats - check(val / 1000.0 + 5, unit='us', us=5) - check(val / 1000.0 + 5000, unit='us', us=5000) - check(val / 1000000.0 + 0.5, unit='ms', us=500) - check(val / 1000000.0 + 0.005, unit='ms', us=5) - check(val / 1000000000.0 + 0.5, unit='s', us=500000) - check(days + 0.5, unit='D', h=12) - - # nan - result = Timestamp(np.nan) - self.assertIs(result, NaT) - - result = Timestamp(None) - self.assertIs(result, NaT) - - result = Timestamp(iNaT) - self.assertIs(result, NaT) - - result = Timestamp(NaT) - self.assertIs(result, NaT) - - result = Timestamp('NaT') - self.assertIs(result, NaT) - - self.assertTrue(isnull(Timestamp('nat'))) - - def test_roundtrip(self): - - # test value to string and back conversions - # further test accessors - base = Timestamp('20140101 00:00:00') - - result = Timestamp(base.value + pd.Timedelta('5ms').value) - self.assertEqual(result, Timestamp(str(base) + ".005000")) - self.assertEqual(result.microsecond, 5000) - - result = Timestamp(base.value + pd.Timedelta('5us').value) - self.assertEqual(result, Timestamp(str(base) + ".000005")) - self.assertEqual(result.microsecond, 5) - - result = Timestamp(base.value + pd.Timedelta('5ns').value) - self.assertEqual(result, Timestamp(str(base) + ".000000005")) - self.assertEqual(result.nanosecond, 5) - self.assertEqual(result.microsecond, 0) - - result = Timestamp(base.value + pd.Timedelta('6ms 5us').value) - self.assertEqual(result, Timestamp(str(base) + ".006005")) - self.assertEqual(result.microsecond, 5 + 6 * 1000) - - result = Timestamp(base.value + pd.Timedelta('200ms 5us').value) - self.assertEqual(result, Timestamp(str(base) + ".200005")) - self.assertEqual(result.microsecond, 5 + 200 * 1000) - - def test_comparison(self): - # 5-18-2012 00:00:00.000 - stamp = long(1337299200000000000) - - val = Timestamp(stamp) - - self.assertEqual(val, val) - self.assertFalse(val != val) - self.assertFalse(val < val) - self.assertTrue(val <= val) - self.assertFalse(val > val) - self.assertTrue(val >= val) - - other = datetime(2012, 5, 18) - self.assertEqual(val, other) - self.assertFalse(val != other) - self.assertFalse(val < other) - self.assertTrue(val <= other) - self.assertFalse(val > other) - self.assertTrue(val >= other) - - other = Timestamp(stamp + 100) - - self.assertNotEqual(val, other) - self.assertNotEqual(val, other) - self.assertTrue(val < other) - self.assertTrue(val <= other) - self.assertTrue(other > val) - self.assertTrue(other >= val) - - def test_compare_invalid(self): - - # GH 8058 - val = Timestamp('20130101 12:01:02') - self.assertFalse(val == 'foo') - self.assertFalse(val == 10.0) - self.assertFalse(val == 1) - self.assertFalse(val == long(1)) - self.assertFalse(val == []) - self.assertFalse(val == {'foo': 1}) - self.assertFalse(val == np.float64(1)) - self.assertFalse(val == np.int64(1)) - - self.assertTrue(val != 'foo') - self.assertTrue(val != 10.0) - self.assertTrue(val != 1) - self.assertTrue(val != long(1)) - self.assertTrue(val != []) - self.assertTrue(val != {'foo': 1}) - self.assertTrue(val != np.float64(1)) - self.assertTrue(val != np.int64(1)) - - # ops testing - df = DataFrame(randn(5, 2)) - a = df[0] - b = Series(randn(5)) - b.name = Timestamp('2000-01-01') - tm.assert_series_equal(a / b, 1 / (b / a)) - - def test_cant_compare_tz_naive_w_aware(self): - tm._skip_if_no_pytz() - # #1404 - a = Timestamp('3/12/2012') - b = Timestamp('3/12/2012', tz='utc') - - self.assertRaises(Exception, a.__eq__, b) - self.assertRaises(Exception, a.__ne__, b) - self.assertRaises(Exception, a.__lt__, b) - self.assertRaises(Exception, a.__gt__, b) - self.assertRaises(Exception, b.__eq__, a) - self.assertRaises(Exception, b.__ne__, a) - self.assertRaises(Exception, b.__lt__, a) - self.assertRaises(Exception, b.__gt__, a) - - if sys.version_info < (3, 3): - self.assertRaises(Exception, a.__eq__, b.to_pydatetime()) - self.assertRaises(Exception, a.to_pydatetime().__eq__, b) - else: - self.assertFalse(a == b.to_pydatetime()) - self.assertFalse(a.to_pydatetime() == b) - - def test_cant_compare_tz_naive_w_aware_explicit_pytz(self): - tm._skip_if_no_pytz() - from pytz import utc - # #1404 - a = Timestamp('3/12/2012') - b = Timestamp('3/12/2012', tz=utc) - - self.assertRaises(Exception, a.__eq__, b) - self.assertRaises(Exception, a.__ne__, b) - self.assertRaises(Exception, a.__lt__, b) - self.assertRaises(Exception, a.__gt__, b) - self.assertRaises(Exception, b.__eq__, a) - self.assertRaises(Exception, b.__ne__, a) - self.assertRaises(Exception, b.__lt__, a) - self.assertRaises(Exception, b.__gt__, a) - - if sys.version_info < (3, 3): - self.assertRaises(Exception, a.__eq__, b.to_pydatetime()) - self.assertRaises(Exception, a.to_pydatetime().__eq__, b) - else: - self.assertFalse(a == b.to_pydatetime()) - self.assertFalse(a.to_pydatetime() == b) - - def test_cant_compare_tz_naive_w_aware_dateutil(self): - tm._skip_if_no_dateutil() - from dateutil.tz import tzutc - utc = tzutc() - # #1404 - a = Timestamp('3/12/2012') - b = Timestamp('3/12/2012', tz=utc) - - self.assertRaises(Exception, a.__eq__, b) - self.assertRaises(Exception, a.__ne__, b) - self.assertRaises(Exception, a.__lt__, b) - self.assertRaises(Exception, a.__gt__, b) - self.assertRaises(Exception, b.__eq__, a) - self.assertRaises(Exception, b.__ne__, a) - self.assertRaises(Exception, b.__lt__, a) - self.assertRaises(Exception, b.__gt__, a) - - if sys.version_info < (3, 3): - self.assertRaises(Exception, a.__eq__, b.to_pydatetime()) - self.assertRaises(Exception, a.to_pydatetime().__eq__, b) - else: - self.assertFalse(a == b.to_pydatetime()) - self.assertFalse(a.to_pydatetime() == b) - - def test_delta_preserve_nanos(self): - val = Timestamp(long(1337299200000000123)) - result = val + timedelta(1) - self.assertEqual(result.nanosecond, val.nanosecond) - - def test_frequency_misc(self): - self.assertEqual(frequencies.get_freq_group('T'), - frequencies.FreqGroup.FR_MIN) - - code, stride = frequencies.get_freq_code(offsets.Hour()) - self.assertEqual(code, frequencies.FreqGroup.FR_HR) - - code, stride = frequencies.get_freq_code((5, 'T')) - self.assertEqual(code, frequencies.FreqGroup.FR_MIN) - self.assertEqual(stride, 5) - - offset = offsets.Hour() - result = frequencies.to_offset(offset) - self.assertEqual(result, offset) - - result = frequencies.to_offset((5, 'T')) - expected = offsets.Minute(5) - self.assertEqual(result, expected) - - self.assertRaises(ValueError, frequencies.get_freq_code, (5, 'baz')) - - self.assertRaises(ValueError, frequencies.to_offset, '100foo') - - self.assertRaises(ValueError, frequencies.to_offset, ('', '')) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - result = frequencies.get_standard_freq(offsets.Hour()) - self.assertEqual(result, 'H') - - def test_hash_equivalent(self): - d = {datetime(2011, 1, 1): 5} - stamp = Timestamp(datetime(2011, 1, 1)) - self.assertEqual(d[stamp], 5) - - def test_timestamp_compare_scalars(self): - # case where ndim == 0 - lhs = np.datetime64(datetime(2013, 12, 6)) - rhs = Timestamp('now') - nat = Timestamp('nat') - - ops = {'gt': 'lt', - 'lt': 'gt', - 'ge': 'le', - 'le': 'ge', - 'eq': 'eq', - 'ne': 'ne'} - - for left, right in ops.items(): - left_f = getattr(operator, left) - right_f = getattr(operator, right) - expected = left_f(lhs, rhs) - - result = right_f(rhs, lhs) - self.assertEqual(result, expected) - - expected = left_f(rhs, nat) - result = right_f(nat, rhs) - self.assertEqual(result, expected) - - def test_timestamp_compare_series(self): - # make sure we can compare Timestamps on the right AND left hand side - # GH4982 - s = Series(date_range('20010101', periods=10), name='dates') - s_nat = s.copy(deep=True) - - s[0] = pd.Timestamp('nat') - s[3] = pd.Timestamp('nat') - - ops = {'lt': 'gt', 'le': 'ge', 'eq': 'eq', 'ne': 'ne'} - - for left, right in ops.items(): - left_f = getattr(operator, left) - right_f = getattr(operator, right) - - # no nats - expected = left_f(s, Timestamp('20010109')) - result = right_f(Timestamp('20010109'), s) - tm.assert_series_equal(result, expected) - - # nats - expected = left_f(s, Timestamp('nat')) - result = right_f(Timestamp('nat'), s) - tm.assert_series_equal(result, expected) - - # compare to timestamp with series containing nats - expected = left_f(s_nat, Timestamp('20010109')) - result = right_f(Timestamp('20010109'), s_nat) - tm.assert_series_equal(result, expected) - - # compare to nat with series containing nats - expected = left_f(s_nat, Timestamp('nat')) - result = right_f(Timestamp('nat'), s_nat) - tm.assert_series_equal(result, expected) - - def test_is_leap_year(self): - # GH 13727 - for tz in [None, 'UTC', 'US/Eastern', 'Asia/Tokyo']: - dt = Timestamp('2000-01-01 00:00:00', tz=tz) - self.assertTrue(dt.is_leap_year) - self.assertIsInstance(dt.is_leap_year, bool) - - dt = Timestamp('1999-01-01 00:00:00', tz=tz) - self.assertFalse(dt.is_leap_year) - - dt = Timestamp('2004-01-01 00:00:00', tz=tz) - self.assertTrue(dt.is_leap_year) - - dt = Timestamp('2100-01-01 00:00:00', tz=tz) - self.assertFalse(dt.is_leap_year) - - self.assertFalse(pd.NaT.is_leap_year) - self.assertIsInstance(pd.NaT.is_leap_year, bool) - - -class TestSlicing(tm.TestCase): - def test_slice_year(self): - dti = DatetimeIndex(freq='B', start=datetime(2005, 1, 1), periods=500) - - s = Series(np.arange(len(dti)), index=dti) - result = s['2005'] - expected = s[s.index.year == 2005] - assert_series_equal(result, expected) - - df = DataFrame(np.random.rand(len(dti), 5), index=dti) - result = df.ix['2005'] - expected = df[df.index.year == 2005] - assert_frame_equal(result, expected) - - rng = date_range('1/1/2000', '1/1/2010') - - result = rng.get_loc('2009') - expected = slice(3288, 3653) - self.assertEqual(result, expected) - - def test_slice_quarter(self): - dti = DatetimeIndex(freq='D', start=datetime(2000, 6, 1), periods=500) - - s = Series(np.arange(len(dti)), index=dti) - self.assertEqual(len(s['2001Q1']), 90) - - df = DataFrame(np.random.rand(len(dti), 5), index=dti) - self.assertEqual(len(df.ix['1Q01']), 90) - - def test_slice_month(self): - dti = DatetimeIndex(freq='D', start=datetime(2005, 1, 1), periods=500) - s = Series(np.arange(len(dti)), index=dti) - self.assertEqual(len(s['2005-11']), 30) - - df = DataFrame(np.random.rand(len(dti), 5), index=dti) - self.assertEqual(len(df.ix['2005-11']), 30) - - assert_series_equal(s['2005-11'], s['11-2005']) - - def test_partial_slice(self): - rng = DatetimeIndex(freq='D', start=datetime(2005, 1, 1), periods=500) - s = Series(np.arange(len(rng)), index=rng) - - result = s['2005-05':'2006-02'] - expected = s['20050501':'20060228'] - assert_series_equal(result, expected) - - result = s['2005-05':] - expected = s['20050501':] - assert_series_equal(result, expected) - - result = s[:'2006-02'] - expected = s[:'20060228'] - assert_series_equal(result, expected) - - result = s['2005-1-1'] - self.assertEqual(result, s.iloc[0]) - - self.assertRaises(Exception, s.__getitem__, '2004-12-31') - - def test_partial_slice_daily(self): - rng = DatetimeIndex(freq='H', start=datetime(2005, 1, 31), periods=500) - s = Series(np.arange(len(rng)), index=rng) - - result = s['2005-1-31'] - assert_series_equal(result, s.ix[:24]) - - self.assertRaises(Exception, s.__getitem__, '2004-12-31 00') - - def test_partial_slice_hourly(self): - rng = DatetimeIndex(freq='T', start=datetime(2005, 1, 1, 20, 0, 0), - periods=500) - s = Series(np.arange(len(rng)), index=rng) - - result = s['2005-1-1'] - assert_series_equal(result, s.ix[:60 * 4]) - - result = s['2005-1-1 20'] - assert_series_equal(result, s.ix[:60]) - - self.assertEqual(s['2005-1-1 20:00'], s.ix[0]) - self.assertRaises(Exception, s.__getitem__, '2004-12-31 00:15') - - def test_partial_slice_minutely(self): - rng = DatetimeIndex(freq='S', start=datetime(2005, 1, 1, 23, 59, 0), - periods=500) - s = Series(np.arange(len(rng)), index=rng) - - result = s['2005-1-1 23:59'] - assert_series_equal(result, s.ix[:60]) - - result = s['2005-1-1'] - assert_series_equal(result, s.ix[:60]) - - self.assertEqual(s[Timestamp('2005-1-1 23:59:00')], s.ix[0]) - self.assertRaises(Exception, s.__getitem__, '2004-12-31 00:00:00') - - def test_partial_slice_second_precision(self): - rng = DatetimeIndex(start=datetime(2005, 1, 1, 0, 0, 59, - microsecond=999990), - periods=20, freq='US') - s = Series(np.arange(20), rng) - - assert_series_equal(s['2005-1-1 00:00'], s.iloc[:10]) - assert_series_equal(s['2005-1-1 00:00:59'], s.iloc[:10]) - - assert_series_equal(s['2005-1-1 00:01'], s.iloc[10:]) - assert_series_equal(s['2005-1-1 00:01:00'], s.iloc[10:]) - - self.assertEqual(s[Timestamp('2005-1-1 00:00:59.999990')], s.iloc[0]) - self.assertRaisesRegexp(KeyError, '2005-1-1 00:00:00', - lambda: s['2005-1-1 00:00:00']) - - def test_partial_slicing_with_multiindex(self): - - # GH 4758 - # partial string indexing with a multi-index buggy - df = DataFrame({'ACCOUNT': ["ACCT1", "ACCT1", "ACCT1", "ACCT2"], - 'TICKER': ["ABC", "MNP", "XYZ", "XYZ"], - 'val': [1, 2, 3, 4]}, - index=date_range("2013-06-19 09:30:00", - periods=4, freq='5T')) - df_multi = df.set_index(['ACCOUNT', 'TICKER'], append=True) - - expected = DataFrame([ - [1] - ], index=Index(['ABC'], name='TICKER'), columns=['val']) - result = df_multi.loc[('2013-06-19 09:30:00', 'ACCT1')] - assert_frame_equal(result, expected) - - expected = df_multi.loc[ - (pd.Timestamp('2013-06-19 09:30:00', tz=None), 'ACCT1', 'ABC')] - result = df_multi.loc[('2013-06-19 09:30:00', 'ACCT1', 'ABC')] - assert_series_equal(result, expected) - - # this is a KeyError as we don't do partial string selection on - # multi-levels - def f(): - df_multi.loc[('2013-06-19', 'ACCT1', 'ABC')] - - self.assertRaises(KeyError, f) - - # GH 4294 - # partial slice on a series mi - s = pd.DataFrame(randn(1000, 1000), index=pd.date_range( - '2000-1-1', periods=1000)).stack() - - s2 = s[:-1].copy() - expected = s2['2000-1-4'] - result = s2[pd.Timestamp('2000-1-4')] - assert_series_equal(result, expected) - - result = s[pd.Timestamp('2000-1-4')] - expected = s['2000-1-4'] - assert_series_equal(result, expected) - - df2 = pd.DataFrame(s) - expected = df2.ix['2000-1-4'] - result = df2.ix[pd.Timestamp('2000-1-4')] - assert_frame_equal(result, expected) - - def test_date_range_normalize(self): - snap = datetime.today() - n = 50 - - rng = date_range(snap, periods=n, normalize=False, freq='2D') - - offset = timedelta(2) - values = DatetimeIndex([snap + i * offset for i in range(n)]) - - tm.assert_index_equal(rng, values) - - rng = date_range('1/1/2000 08:15', periods=n, normalize=False, - freq='B') - the_time = time(8, 15) - for val in rng: - self.assertEqual(val.time(), the_time) - - def test_timedelta(self): - # this is valid too - index = date_range('1/1/2000', periods=50, freq='B') - shifted = index + timedelta(1) - back = shifted + timedelta(-1) - self.assertTrue(tm.equalContents(index, back)) - self.assertEqual(shifted.freq, index.freq) - self.assertEqual(shifted.freq, back.freq) - - result = index - timedelta(1) - expected = index + timedelta(-1) - tm.assert_index_equal(result, expected) - - # GH4134, buggy with timedeltas - rng = date_range('2013', '2014') - s = Series(rng) - result1 = rng - pd.offsets.Hour(1) - result2 = DatetimeIndex(s - np.timedelta64(100000000)) - result3 = rng - np.timedelta64(100000000) - result4 = DatetimeIndex(s - pd.offsets.Hour(1)) - tm.assert_index_equal(result1, result4) - tm.assert_index_equal(result2, result3) - - def test_shift(self): - ts = Series(np.random.randn(5), - index=date_range('1/1/2000', periods=5, freq='H')) - - result = ts.shift(1, freq='5T') - exp_index = ts.index.shift(1, freq='5T') - tm.assert_index_equal(result.index, exp_index) - - # GH #1063, multiple of same base - result = ts.shift(1, freq='4H') - exp_index = ts.index + offsets.Hour(4) - tm.assert_index_equal(result.index, exp_index) - - idx = DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-04']) - self.assertRaises(ValueError, idx.shift, 1) - - def test_setops_preserve_freq(self): - for tz in [None, 'Asia/Tokyo', 'US/Eastern']: - rng = date_range('1/1/2000', '1/1/2002', name='idx', tz=tz) - - result = rng[:50].union(rng[50:100]) - self.assertEqual(result.name, rng.name) - self.assertEqual(result.freq, rng.freq) - self.assertEqual(result.tz, rng.tz) - - result = rng[:50].union(rng[30:100]) - self.assertEqual(result.name, rng.name) - self.assertEqual(result.freq, rng.freq) - self.assertEqual(result.tz, rng.tz) - - result = rng[:50].union(rng[60:100]) - self.assertEqual(result.name, rng.name) - self.assertIsNone(result.freq) - self.assertEqual(result.tz, rng.tz) - - result = rng[:50].intersection(rng[25:75]) - self.assertEqual(result.name, rng.name) - self.assertEqual(result.freqstr, 'D') - self.assertEqual(result.tz, rng.tz) - - nofreq = DatetimeIndex(list(rng[25:75]), name='other') - result = rng[:50].union(nofreq) - self.assertIsNone(result.name) - self.assertEqual(result.freq, rng.freq) - self.assertEqual(result.tz, rng.tz) - - result = rng[:50].intersection(nofreq) - self.assertIsNone(result.name) - self.assertEqual(result.freq, rng.freq) - self.assertEqual(result.tz, rng.tz) - - def test_min_max(self): - rng = date_range('1/1/2000', '12/31/2000') - rng2 = rng.take(np.random.permutation(len(rng))) - - the_min = rng2.min() - the_max = rng2.max() - tm.assertIsInstance(the_min, Timestamp) - tm.assertIsInstance(the_max, Timestamp) - self.assertEqual(the_min, rng[0]) - self.assertEqual(the_max, rng[-1]) - - self.assertEqual(rng.min(), rng[0]) - self.assertEqual(rng.max(), rng[-1]) - - def test_min_max_series(self): - rng = date_range('1/1/2000', periods=10, freq='4h') - lvls = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'] - df = DataFrame({'TS': rng, 'V': np.random.randn(len(rng)), 'L': lvls}) - - result = df.TS.max() - exp = Timestamp(df.TS.iat[-1]) - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, exp) - - result = df.TS.min() - exp = Timestamp(df.TS.iat[0]) - self.assertTrue(isinstance(result, Timestamp)) - self.assertEqual(result, exp) - - def test_from_M8_structured(self): - dates = [(datetime(2012, 9, 9, 0, 0), datetime(2012, 9, 8, 15, 10))] - arr = np.array(dates, - dtype=[('Date', 'M8[us]'), ('Forecasting', 'M8[us]')]) - df = DataFrame(arr) - - self.assertEqual(df['Date'][0], dates[0][0]) - self.assertEqual(df['Forecasting'][0], dates[0][1]) - - s = Series(arr['Date']) - self.assertTrue(s[0], Timestamp) - self.assertEqual(s[0], dates[0][0]) - - s = Series.from_array(arr['Date'], Index([0])) - self.assertEqual(s[0], dates[0][0]) - - def test_get_level_values_box(self): - from pandas import MultiIndex - - dates = date_range('1/1/2000', periods=4) - levels = [dates, [0, 1]] - labels = [[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]] - - index = MultiIndex(levels=levels, labels=labels) - - self.assertTrue(isinstance(index.get_level_values(0)[0], Timestamp)) - - def test_frame_apply_dont_convert_datetime64(self): - from pandas.tseries.offsets import BDay - df = DataFrame({'x1': [datetime(1996, 1, 1)]}) - - df = df.applymap(lambda x: x + BDay()) - df = df.applymap(lambda x: x + BDay()) - - self.assertTrue(df.x1.dtype == 'M8[ns]') - - def test_date_range_fy5252(self): - dr = date_range(start="2013-01-01", periods=2, freq=offsets.FY5253( - startingMonth=1, weekday=3, variation="nearest")) - self.assertEqual(dr[0], Timestamp('2013-01-31')) - self.assertEqual(dr[1], Timestamp('2014-01-30')) - - def test_partial_slice_doesnt_require_monotonicity(self): - # For historical reasons. - s = pd.Series(np.arange(10), pd.date_range('2014-01-01', periods=10)) - - nonmonotonic = s[[3, 5, 4]] - expected = nonmonotonic.iloc[:0] - timestamp = pd.Timestamp('2014-01-10') - - assert_series_equal(nonmonotonic['2014-01-10':], expected) - self.assertRaisesRegexp(KeyError, - r"Timestamp\('2014-01-10 00:00:00'\)", - lambda: nonmonotonic[timestamp:]) - - assert_series_equal(nonmonotonic.ix['2014-01-10':], expected) - self.assertRaisesRegexp(KeyError, - r"Timestamp\('2014-01-10 00:00:00'\)", - lambda: nonmonotonic.ix[timestamp:]) - - -class TimeConversionFormats(tm.TestCase): - def test_to_datetime_format(self): - values = ['1/1/2000', '1/2/2000', '1/3/2000'] - - results1 = [Timestamp('20000101'), Timestamp('20000201'), - Timestamp('20000301')] - results2 = [Timestamp('20000101'), Timestamp('20000102'), - Timestamp('20000103')] - for vals, expecteds in [(values, (Index(results1), Index(results2))), - (Series(values), - (Series(results1), Series(results2))), - (values[0], (results1[0], results2[0])), - (values[1], (results1[1], results2[1])), - (values[2], (results1[2], results2[2]))]: - - for i, fmt in enumerate(['%d/%m/%Y', '%m/%d/%Y']): - result = to_datetime(vals, format=fmt) - expected = expecteds[i] - - if isinstance(expected, Series): - assert_series_equal(result, Series(expected)) - elif isinstance(expected, Timestamp): - self.assertEqual(result, expected) - else: - tm.assert_index_equal(result, expected) - - def test_to_datetime_format_YYYYMMDD(self): - s = Series([19801222, 19801222] + [19810105] * 5) - expected = Series([Timestamp(x) for x in s.apply(str)]) - - result = to_datetime(s, format='%Y%m%d') - assert_series_equal(result, expected) - - result = to_datetime(s.apply(str), format='%Y%m%d') - assert_series_equal(result, expected) - - # with NaT - expected = Series([Timestamp("19801222"), Timestamp("19801222")] + - [Timestamp("19810105")] * 5) - expected[2] = np.nan - s[2] = np.nan - - result = to_datetime(s, format='%Y%m%d') - assert_series_equal(result, expected) - - # string with NaT - s = s.apply(str) - s[2] = 'nat' - result = to_datetime(s, format='%Y%m%d') - assert_series_equal(result, expected) - - # coercion - # GH 7930 - s = Series([20121231, 20141231, 99991231]) - result = pd.to_datetime(s, format='%Y%m%d', errors='ignore') - expected = Series([datetime(2012, 12, 31), - datetime(2014, 12, 31), datetime(9999, 12, 31)], - dtype=object) - self.assert_series_equal(result, expected) - - result = pd.to_datetime(s, format='%Y%m%d', errors='coerce') - expected = Series(['20121231', '20141231', 'NaT'], dtype='M8[ns]') - assert_series_equal(result, expected) - - # GH 10178 - def test_to_datetime_format_integer(self): - s = Series([2000, 2001, 2002]) - expected = Series([Timestamp(x) for x in s.apply(str)]) - - result = to_datetime(s, format='%Y') - assert_series_equal(result, expected) - - s = Series([200001, 200105, 200206]) - expected = Series([Timestamp(x[:4] + '-' + x[4:]) for x in s.apply(str) - ]) - - result = to_datetime(s, format='%Y%m') - assert_series_equal(result, expected) - - def test_to_datetime_format_microsecond(self): - val = '01-Apr-2011 00:00:01.978' - format = '%d-%b-%Y %H:%M:%S.%f' - result = to_datetime(val, format=format) - exp = datetime.strptime(val, format) - self.assertEqual(result, exp) - - def test_to_datetime_format_time(self): - data = [ - ['01/10/2010 15:20', '%m/%d/%Y %H:%M', - Timestamp('2010-01-10 15:20')], - ['01/10/2010 05:43', '%m/%d/%Y %I:%M', - Timestamp('2010-01-10 05:43')], - ['01/10/2010 13:56:01', '%m/%d/%Y %H:%M:%S', - Timestamp('2010-01-10 13:56:01')] # , - # ['01/10/2010 08:14 PM', '%m/%d/%Y %I:%M %p', - # Timestamp('2010-01-10 20:14')], - # ['01/10/2010 07:40 AM', '%m/%d/%Y %I:%M %p', - # Timestamp('2010-01-10 07:40')], - # ['01/10/2010 09:12:56 AM', '%m/%d/%Y %I:%M:%S %p', - # Timestamp('2010-01-10 09:12:56')] - ] - for s, format, dt in data: - self.assertEqual(to_datetime(s, format=format), dt) - - def test_to_datetime_with_non_exact(self): - # GH 10834 - _skip_if_has_locale() - - # 8904 - # exact kw - if sys.version_info < (2, 7): - raise nose.SkipTest('on python version < 2.7') - - s = Series(['19MAY11', 'foobar19MAY11', '19MAY11:00:00:00', - '19MAY11 00:00:00Z']) - result = to_datetime(s, format='%d%b%y', exact=False) - expected = to_datetime(s.str.extract(r'(\d+\w+\d+)', expand=False), - format='%d%b%y') - assert_series_equal(result, expected) - - def test_parse_nanoseconds_with_formula(self): - - # GH8989 - # trunctaing the nanoseconds when a format was provided - for v in ["2012-01-01 09:00:00.000000001", - "2012-01-01 09:00:00.000001", - "2012-01-01 09:00:00.001", - "2012-01-01 09:00:00.001000", - "2012-01-01 09:00:00.001000000", ]: - expected = pd.to_datetime(v) - result = pd.to_datetime(v, format="%Y-%m-%d %H:%M:%S.%f") - self.assertEqual(result, expected) - - def test_to_datetime_format_weeks(self): - data = [ - ['2009324', '%Y%W%w', Timestamp('2009-08-13')], - ['2013020', '%Y%U%w', Timestamp('2013-01-13')] - ] - for s, format, dt in data: - self.assertEqual(to_datetime(s, format=format), dt) - - -class TestToDatetimeInferFormat(tm.TestCase): - - def test_to_datetime_infer_datetime_format_consistent_format(self): - s = pd.Series(pd.date_range('20000101', periods=50, freq='H')) - - test_formats = ['%m-%d-%Y', '%m/%d/%Y %H:%M:%S.%f', - '%Y-%m-%dT%H:%M:%S.%f'] - - for test_format in test_formats: - s_as_dt_strings = s.apply(lambda x: x.strftime(test_format)) - - with_format = pd.to_datetime(s_as_dt_strings, format=test_format) - no_infer = pd.to_datetime(s_as_dt_strings, - infer_datetime_format=False) - yes_infer = pd.to_datetime(s_as_dt_strings, - infer_datetime_format=True) - - # Whether the format is explicitly passed, it is inferred, or - # it is not inferred, the results should all be the same - self.assert_series_equal(with_format, no_infer) - self.assert_series_equal(no_infer, yes_infer) - - def test_to_datetime_infer_datetime_format_inconsistent_format(self): - s = pd.Series(np.array(['01/01/2011 00:00:00', - '01-02-2011 00:00:00', - '2011-01-03T00:00:00'])) - - # When the format is inconsistent, infer_datetime_format should just - # fallback to the default parsing - tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), - pd.to_datetime(s, infer_datetime_format=True)) - - s = pd.Series(np.array(['Jan/01/2011', 'Feb/01/2011', 'Mar/01/2011'])) - - tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), - pd.to_datetime(s, infer_datetime_format=True)) - - def test_to_datetime_infer_datetime_format_series_with_nans(self): - s = pd.Series(np.array(['01/01/2011 00:00:00', np.nan, - '01/03/2011 00:00:00', np.nan])) - tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), - pd.to_datetime(s, infer_datetime_format=True)) - - def test_to_datetime_infer_datetime_format_series_starting_with_nans(self): - s = pd.Series(np.array([np.nan, np.nan, '01/01/2011 00:00:00', - '01/02/2011 00:00:00', '01/03/2011 00:00:00'])) - - tm.assert_series_equal(pd.to_datetime(s, infer_datetime_format=False), - pd.to_datetime(s, infer_datetime_format=True)) - - def test_to_datetime_iso8601_noleading_0s(self): - # GH 11871 - s = pd.Series(['2014-1-1', '2014-2-2', '2015-3-3']) - expected = pd.Series([pd.Timestamp('2014-01-01'), - pd.Timestamp('2014-02-02'), - pd.Timestamp('2015-03-03')]) - tm.assert_series_equal(pd.to_datetime(s), expected) - tm.assert_series_equal(pd.to_datetime(s, format='%Y-%m-%d'), expected) - - -class TestGuessDatetimeFormat(tm.TestCase): - def test_guess_datetime_format_with_parseable_formats(self): - dt_string_to_format = (('20111230', '%Y%m%d'), - ('2011-12-30', '%Y-%m-%d'), - ('30-12-2011', '%d-%m-%Y'), - ('2011-12-30 00:00:00', '%Y-%m-%d %H:%M:%S'), - ('2011-12-30T00:00:00', '%Y-%m-%dT%H:%M:%S'), - ('2011-12-30 00:00:00.000000', - '%Y-%m-%d %H:%M:%S.%f'), ) - - for dt_string, dt_format in dt_string_to_format: - self.assertEqual( - tools._guess_datetime_format(dt_string), - dt_format - ) - - def test_guess_datetime_format_with_dayfirst(self): - ambiguous_string = '01/01/2011' - self.assertEqual( - tools._guess_datetime_format(ambiguous_string, dayfirst=True), - '%d/%m/%Y' - ) - self.assertEqual( - tools._guess_datetime_format(ambiguous_string, dayfirst=False), - '%m/%d/%Y' - ) - - def test_guess_datetime_format_with_locale_specific_formats(self): - # The month names will vary depending on the locale, in which - # case these wont be parsed properly (dateutil can't parse them) - _skip_if_has_locale() - - dt_string_to_format = (('30/Dec/2011', '%d/%b/%Y'), - ('30/December/2011', '%d/%B/%Y'), - ('30/Dec/2011 00:00:00', '%d/%b/%Y %H:%M:%S'), ) - - for dt_string, dt_format in dt_string_to_format: - self.assertEqual( - tools._guess_datetime_format(dt_string), - dt_format - ) - - def test_guess_datetime_format_invalid_inputs(self): - # A datetime string must include a year, month and a day for it - # to be guessable, in addition to being a string that looks like - # a datetime - invalid_dts = [ - '2013', - '01/2013', - '12:00:00', - '1/1/1/1', - 'this_is_not_a_datetime', - '51a', - 9, - datetime(2011, 1, 1), - ] - - for invalid_dt in invalid_dts: - self.assertTrue(tools._guess_datetime_format(invalid_dt) is None) - - def test_guess_datetime_format_nopadding(self): - # GH 11142 - dt_string_to_format = (('2011-1-1', '%Y-%m-%d'), - ('30-1-2011', '%d-%m-%Y'), - ('1/1/2011', '%m/%d/%Y'), - ('2011-1-1 00:00:00', '%Y-%m-%d %H:%M:%S'), - ('2011-1-1 0:0:0', '%Y-%m-%d %H:%M:%S'), - ('2011-1-3T00:00:0', '%Y-%m-%dT%H:%M:%S')) - - for dt_string, dt_format in dt_string_to_format: - self.assertEqual( - tools._guess_datetime_format(dt_string), - dt_format - ) - - def test_guess_datetime_format_for_array(self): - expected_format = '%Y-%m-%d %H:%M:%S.%f' - dt_string = datetime(2011, 12, 30, 0, 0, 0).strftime(expected_format) - - test_arrays = [ - np.array([dt_string, dt_string, dt_string], dtype='O'), - np.array([np.nan, np.nan, dt_string], dtype='O'), - np.array([dt_string, 'random_string'], dtype='O'), - ] - - for test_array in test_arrays: - self.assertEqual( - tools._guess_datetime_format_for_array(test_array), - expected_format - ) - - format_for_string_of_nans = tools._guess_datetime_format_for_array( - np.array( - [np.nan, np.nan, np.nan], dtype='O')) - self.assertTrue(format_for_string_of_nans is None) - - -class TestTimestampToJulianDate(tm.TestCase): - def test_compare_1700(self): - r = Timestamp('1700-06-23').to_julian_date() - self.assertEqual(r, 2342145.5) - - def test_compare_2000(self): - r = Timestamp('2000-04-12').to_julian_date() - self.assertEqual(r, 2451646.5) - - def test_compare_2100(self): - r = Timestamp('2100-08-12').to_julian_date() - self.assertEqual(r, 2488292.5) - - def test_compare_hour01(self): - r = Timestamp('2000-08-12T01:00:00').to_julian_date() - self.assertEqual(r, 2451768.5416666666666666) - - def test_compare_hour13(self): - r = Timestamp('2000-08-12T13:00:00').to_julian_date() - self.assertEqual(r, 2451769.0416666666666666) - - -class TestDateTimeIndexToJulianDate(tm.TestCase): - def test_1700(self): - r1 = Float64Index([2345897.5, 2345898.5, 2345899.5, 2345900.5, - 2345901.5]) - r2 = date_range(start=Timestamp('1710-10-01'), periods=5, - freq='D').to_julian_date() - self.assertIsInstance(r2, Float64Index) - tm.assert_index_equal(r1, r2) - - def test_2000(self): - r1 = Float64Index([2451601.5, 2451602.5, 2451603.5, 2451604.5, - 2451605.5]) - r2 = date_range(start=Timestamp('2000-02-27'), periods=5, - freq='D').to_julian_date() - self.assertIsInstance(r2, Float64Index) - tm.assert_index_equal(r1, r2) - - def test_hour(self): - r1 = Float64Index( - [2451601.5, 2451601.5416666666666666, 2451601.5833333333333333, - 2451601.625, 2451601.6666666666666666]) - r2 = date_range(start=Timestamp('2000-02-27'), periods=5, - freq='H').to_julian_date() - self.assertIsInstance(r2, Float64Index) - tm.assert_index_equal(r1, r2) - - def test_minute(self): - r1 = Float64Index( - [2451601.5, 2451601.5006944444444444, 2451601.5013888888888888, - 2451601.5020833333333333, 2451601.5027777777777777]) - r2 = date_range(start=Timestamp('2000-02-27'), periods=5, - freq='T').to_julian_date() - self.assertIsInstance(r2, Float64Index) - tm.assert_index_equal(r1, r2) - - def test_second(self): - r1 = Float64Index( - [2451601.5, 2451601.500011574074074, 2451601.5000231481481481, - 2451601.5000347222222222, 2451601.5000462962962962]) - r2 = date_range(start=Timestamp('2000-02-27'), periods=5, - freq='S').to_julian_date() - self.assertIsInstance(r2, Float64Index) - tm.assert_index_equal(r1, r2) - - -class TestDaysInMonth(tm.TestCase): - def test_coerce_deprecation(self): - - # deprecation of coerce - with tm.assert_produces_warning(FutureWarning): - to_datetime('2015-02-29', coerce=True) - with tm.assert_produces_warning(FutureWarning): - self.assertRaises(ValueError, - lambda: to_datetime('2015-02-29', coerce=False)) - - # multiple arguments - for e, c in zip(['raise', 'ignore', 'coerce'], [True, False]): - with tm.assert_produces_warning(FutureWarning): - self.assertRaises(TypeError, - lambda: to_datetime('2015-02-29', errors=e, - coerce=c)) - - # tests for issue #10154 - def test_day_not_in_month_coerce(self): - self.assertTrue(isnull(to_datetime('2015-02-29', errors='coerce'))) - self.assertTrue(isnull(to_datetime('2015-02-29', format="%Y-%m-%d", - errors='coerce'))) - self.assertTrue(isnull(to_datetime('2015-02-32', format="%Y-%m-%d", - errors='coerce'))) - self.assertTrue(isnull(to_datetime('2015-04-31', format="%Y-%m-%d", - errors='coerce'))) - - def test_day_not_in_month_raise(self): - self.assertRaises(ValueError, to_datetime, '2015-02-29', - errors='raise') - self.assertRaises(ValueError, to_datetime, '2015-02-29', - errors='raise', format="%Y-%m-%d") - self.assertRaises(ValueError, to_datetime, '2015-02-32', - errors='raise', format="%Y-%m-%d") - self.assertRaises(ValueError, to_datetime, '2015-04-31', - errors='raise', format="%Y-%m-%d") - - def test_day_not_in_month_ignore(self): - self.assertEqual(to_datetime( - '2015-02-29', errors='ignore'), '2015-02-29') - self.assertEqual(to_datetime( - '2015-02-29', errors='ignore', format="%Y-%m-%d"), '2015-02-29') - self.assertEqual(to_datetime( - '2015-02-32', errors='ignore', format="%Y-%m-%d"), '2015-02-32') - self.assertEqual(to_datetime( - '2015-04-31', errors='ignore', format="%Y-%m-%d"), '2015-04-31') - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_timeseries_legacy.py b/pandas/tseries/tests/test_timeseries_legacy.py deleted file mode 100644 index d8c01c53fb2e5..0000000000000 --- a/pandas/tseries/tests/test_timeseries_legacy.py +++ /dev/null @@ -1,226 +0,0 @@ -# pylint: disable-msg=E1101,W0612 -from datetime import datetime -import sys -import os -import nose -import numpy as np - -from pandas import (Index, Series, date_range, Timestamp, - DatetimeIndex, Int64Index, to_datetime) - -from pandas.tseries.frequencies import get_offset, to_offset -from pandas.tseries.offsets import BDay, Micro, Milli, MonthBegin -import pandas as pd - -from pandas.util.testing import assert_series_equal, assert_almost_equal -import pandas.util.testing as tm - -from pandas.compat import StringIO, cPickle as pickle -from pandas import read_pickle -from numpy.random import rand -import pandas.compat as compat - -randn = np.random.randn - - -# Unfortunately, too much has changed to handle these legacy pickles -# class TestLegacySupport(unittest.TestCase): -class LegacySupport(object): - - _multiprocess_can_split_ = True - - @classmethod - def setUpClass(cls): - if compat.PY3: - raise nose.SkipTest("not compatible with Python >= 3") - - pth, _ = os.path.split(os.path.abspath(__file__)) - filepath = os.path.join(pth, 'data', 'frame.pickle') - - with open(filepath, 'rb') as f: - cls.frame = pickle.load(f) - - filepath = os.path.join(pth, 'data', 'series.pickle') - with open(filepath, 'rb') as f: - cls.series = pickle.load(f) - - def test_pass_offset_warn(self): - buf = StringIO() - - sys.stderr = buf - DatetimeIndex(start='1/1/2000', periods=10, offset='H') - sys.stderr = sys.__stderr__ - - def test_unpickle_legacy_frame(self): - dtindex = DatetimeIndex(start='1/3/2005', end='1/14/2005', - freq=BDay(1)) - - unpickled = self.frame - - self.assertEqual(type(unpickled.index), DatetimeIndex) - self.assertEqual(len(unpickled), 10) - self.assertTrue((unpickled.columns == Int64Index(np.arange(5))).all()) - self.assertTrue((unpickled.index == dtindex).all()) - self.assertEqual(unpickled.index.offset, BDay(1, normalize=True)) - - def test_unpickle_legacy_series(self): - unpickled = self.series - - dtindex = DatetimeIndex(start='1/3/2005', end='1/14/2005', - freq=BDay(1)) - - self.assertEqual(type(unpickled.index), DatetimeIndex) - self.assertEqual(len(unpickled), 10) - self.assertTrue((unpickled.index == dtindex).all()) - self.assertEqual(unpickled.index.offset, BDay(1, normalize=True)) - - def test_unpickle_legacy_len0_daterange(self): - pth, _ = os.path.split(os.path.abspath(__file__)) - filepath = os.path.join(pth, 'data', 'series_daterange0.pickle') - - result = pd.read_pickle(filepath) - - ex_index = DatetimeIndex([], freq='B') - - self.assert_index_equal(result.index, ex_index) - tm.assertIsInstance(result.index.freq, BDay) - self.assertEqual(len(result), 0) - - def test_arithmetic_interaction(self): - index = self.frame.index - obj_index = index.asobject - - dseries = Series(rand(len(index)), index=index) - oseries = Series(dseries.values, index=obj_index) - - result = dseries + oseries - expected = dseries * 2 - tm.assertIsInstance(result.index, DatetimeIndex) - assert_series_equal(result, expected) - - result = dseries + oseries[:5] - expected = dseries + dseries[:5] - tm.assertIsInstance(result.index, DatetimeIndex) - assert_series_equal(result, expected) - - def test_join_interaction(self): - index = self.frame.index - obj_index = index.asobject - - def _check_join(left, right, how='inner'): - ra, rb, rc = left.join(right, how=how, return_indexers=True) - ea, eb, ec = left.join(DatetimeIndex(right), how=how, - return_indexers=True) - - tm.assertIsInstance(ra, DatetimeIndex) - self.assert_index_equal(ra, ea) - - assert_almost_equal(rb, eb) - assert_almost_equal(rc, ec) - - _check_join(index[:15], obj_index[5:], how='inner') - _check_join(index[:15], obj_index[5:], how='outer') - _check_join(index[:15], obj_index[5:], how='right') - _check_join(index[:15], obj_index[5:], how='left') - - def test_join_nonunique(self): - idx1 = to_datetime(['2012-11-06 16:00:11.477563', - '2012-11-06 16:00:11.477563']) - idx2 = to_datetime(['2012-11-06 15:11:09.006507', - '2012-11-06 15:11:09.006507']) - rs = idx1.join(idx2, how='outer') - self.assertTrue(rs.is_monotonic) - - def test_unpickle_daterange(self): - pth, _ = os.path.split(os.path.abspath(__file__)) - filepath = os.path.join(pth, 'data', 'daterange_073.pickle') - - rng = read_pickle(filepath) - tm.assertIsInstance(rng[0], datetime) - tm.assertIsInstance(rng.offset, BDay) - self.assertEqual(rng.values.dtype, object) - - def test_setops(self): - index = self.frame.index - obj_index = index.asobject - - result = index[:5].union(obj_index[5:]) - expected = index - tm.assertIsInstance(result, DatetimeIndex) - self.assert_index_equal(result, expected) - - result = index[:10].intersection(obj_index[5:]) - expected = index[5:10] - tm.assertIsInstance(result, DatetimeIndex) - self.assert_index_equal(result, expected) - - result = index[:10] - obj_index[5:] - expected = index[:5] - tm.assertIsInstance(result, DatetimeIndex) - self.assert_index_equal(result, expected) - - def test_index_conversion(self): - index = self.frame.index - obj_index = index.asobject - - conv = DatetimeIndex(obj_index) - self.assert_index_equal(conv, index) - - self.assertRaises(ValueError, DatetimeIndex, ['a', 'b', 'c', 'd']) - - def test_tolist(self): - rng = date_range('1/1/2000', periods=10) - - result = rng.tolist() - tm.assertIsInstance(result[0], Timestamp) - - def test_object_convert_fail(self): - idx = DatetimeIndex([np.NaT]) - self.assertRaises(ValueError, idx.astype, 'O') - - def test_setops_conversion_fail(self): - index = self.frame.index - - right = Index(['a', 'b', 'c', 'd']) - - result = index.union(right) - expected = Index(np.concatenate([index.asobject, right])) - self.assert_index_equal(result, expected) - - result = index.intersection(right) - expected = Index([]) - self.assert_index_equal(result, expected) - - def test_legacy_time_rules(self): - rules = [('WEEKDAY', 'B'), ('EOM', 'BM'), ('W@MON', 'W-MON'), - ('W@TUE', 'W-TUE'), ('W@WED', 'W-WED'), ('W@THU', 'W-THU'), - ('W@FRI', 'W-FRI'), ('Q@JAN', 'BQ-JAN'), ('Q@FEB', 'BQ-FEB'), - ('Q@MAR', 'BQ-MAR'), ('A@JAN', 'BA-JAN'), ('A@FEB', 'BA-FEB'), - ('A@MAR', 'BA-MAR'), ('A@APR', 'BA-APR'), ('A@MAY', 'BA-MAY'), - ('A@JUN', 'BA-JUN'), ('A@JUL', 'BA-JUL'), ('A@AUG', 'BA-AUG'), - ('A@SEP', 'BA-SEP'), ('A@OCT', 'BA-OCT'), ('A@NOV', 'BA-NOV'), - ('A@DEC', 'BA-DEC'), ('WOM@1FRI', 'WOM-1FRI'), - ('WOM@2FRI', 'WOM-2FRI'), ('WOM@3FRI', 'WOM-3FRI'), - ('WOM@4FRI', 'WOM-4FRI')] - - start, end = '1/1/2000', '1/1/2010' - - for old_freq, new_freq in rules: - old_rng = date_range(start, end, freq=old_freq) - new_rng = date_range(start, end, freq=new_freq) - self.assert_index_equal(old_rng, new_rng) - - def test_ms_vs_MS(self): - left = get_offset('ms') - right = get_offset('MS') - self.assertEqual(left, Milli()) - self.assertEqual(right, MonthBegin()) - - def test_rule_aliases(self): - rule = to_offset('10us') - self.assertEqual(rule, Micro(10)) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_tslib.py b/pandas/tseries/tests/test_tslib.py deleted file mode 100644 index b45f867be65dd..0000000000000 --- a/pandas/tseries/tests/test_tslib.py +++ /dev/null @@ -1,1546 +0,0 @@ -import nose -from distutils.version import LooseVersion -import numpy as np - -from pandas import tslib, lib -import pandas._period as period -import datetime - -import pandas as pd -from pandas.core.api import (Timestamp, Index, Series, Timedelta, Period, - to_datetime) -from pandas.tslib import get_timezone -from pandas._period import period_asfreq, period_ordinal -from pandas.tseries.index import date_range, DatetimeIndex -from pandas.tseries.frequencies import ( - get_freq, - US_RESO, MS_RESO, S_RESO, H_RESO, D_RESO, T_RESO -) -import pandas.tseries.tools as tools -import pandas.tseries.offsets as offsets -import pandas.util.testing as tm -import pandas.compat as compat -from pandas.compat.numpy import (np_datetime64_compat, - np_array_datetime64_compat) - -from pandas.util.testing import assert_series_equal, _skip_if_has_locale - - -class TestTsUtil(tm.TestCase): - - def test_try_parse_dates(self): - from dateutil.parser import parse - arr = np.array(['5/1/2000', '6/1/2000', '7/1/2000'], dtype=object) - - result = lib.try_parse_dates(arr, dayfirst=True) - expected = [parse(d, dayfirst=True) for d in arr] - self.assertTrue(np.array_equal(result, expected)) - - def test_min_valid(self): - # Ensure that Timestamp.min is a valid Timestamp - Timestamp(Timestamp.min) - - def test_max_valid(self): - # Ensure that Timestamp.max is a valid Timestamp - Timestamp(Timestamp.max) - - def test_to_datetime_bijective(self): - # Ensure that converting to datetime and back only loses precision - # by going from nanoseconds to microseconds. - exp_warning = None if Timestamp.max.nanosecond == 0 else UserWarning - with tm.assert_produces_warning(exp_warning, check_stacklevel=False): - self.assertEqual( - Timestamp(Timestamp.max.to_pydatetime()).value / 1000, - Timestamp.max.value / 1000) - - exp_warning = None if Timestamp.min.nanosecond == 0 else UserWarning - with tm.assert_produces_warning(exp_warning, check_stacklevel=False): - self.assertEqual( - Timestamp(Timestamp.min.to_pydatetime()).value / 1000, - Timestamp.min.value / 1000) - - -class TestTimestamp(tm.TestCase): - - def test_constructor(self): - base_str = '2014-07-01 09:00' - base_dt = datetime.datetime(2014, 7, 1, 9) - base_expected = 1404205200000000000 - - # confirm base representation is correct - import calendar - self.assertEqual(calendar.timegm(base_dt.timetuple()) * 1000000000, - base_expected) - - tests = [(base_str, base_dt, base_expected), - ('2014-07-01 10:00', datetime.datetime(2014, 7, 1, 10), - base_expected + 3600 * 1000000000), - ('2014-07-01 09:00:00.000008000', - datetime.datetime(2014, 7, 1, 9, 0, 0, 8), - base_expected + 8000), - ('2014-07-01 09:00:00.000000005', - Timestamp('2014-07-01 09:00:00.000000005'), - base_expected + 5)] - - tm._skip_if_no_pytz() - tm._skip_if_no_dateutil() - import pytz - import dateutil - timezones = [(None, 0), ('UTC', 0), (pytz.utc, 0), ('Asia/Tokyo', 9), - ('US/Eastern', -4), ('dateutil/US/Pacific', -7), - (pytz.FixedOffset(-180), -3), - (dateutil.tz.tzoffset(None, 18000), 5)] - - for date_str, date, expected in tests: - for result in [Timestamp(date_str), Timestamp(date)]: - # only with timestring - self.assertEqual(result.value, expected) - self.assertEqual(tslib.pydt_to_i8(result), expected) - - # re-creation shouldn't affect to internal value - result = Timestamp(result) - self.assertEqual(result.value, expected) - self.assertEqual(tslib.pydt_to_i8(result), expected) - - # with timezone - for tz, offset in timezones: - for result in [Timestamp(date_str, tz=tz), Timestamp(date, - tz=tz)]: - expected_tz = expected - offset * 3600 * 1000000000 - self.assertEqual(result.value, expected_tz) - self.assertEqual(tslib.pydt_to_i8(result), expected_tz) - - # should preserve tz - result = Timestamp(result) - self.assertEqual(result.value, expected_tz) - self.assertEqual(tslib.pydt_to_i8(result), expected_tz) - - # should convert to UTC - result = Timestamp(result, tz='UTC') - expected_utc = expected - offset * 3600 * 1000000000 - self.assertEqual(result.value, expected_utc) - self.assertEqual(tslib.pydt_to_i8(result), expected_utc) - - def test_constructor_with_stringoffset(self): - # GH 7833 - base_str = '2014-07-01 11:00:00+02:00' - base_dt = datetime.datetime(2014, 7, 1, 9) - base_expected = 1404205200000000000 - - # confirm base representation is correct - import calendar - self.assertEqual(calendar.timegm(base_dt.timetuple()) * 1000000000, - base_expected) - - tests = [(base_str, base_expected), - ('2014-07-01 12:00:00+02:00', - base_expected + 3600 * 1000000000), - ('2014-07-01 11:00:00.000008000+02:00', base_expected + 8000), - ('2014-07-01 11:00:00.000000005+02:00', base_expected + 5)] - - tm._skip_if_no_pytz() - tm._skip_if_no_dateutil() - import pytz - import dateutil - timezones = [(None, 0), ('UTC', 0), (pytz.utc, 0), ('Asia/Tokyo', 9), - ('US/Eastern', -4), ('dateutil/US/Pacific', -7), - (pytz.FixedOffset(-180), -3), - (dateutil.tz.tzoffset(None, 18000), 5)] - - for date_str, expected in tests: - for result in [Timestamp(date_str)]: - # only with timestring - self.assertEqual(result.value, expected) - self.assertEqual(tslib.pydt_to_i8(result), expected) - - # re-creation shouldn't affect to internal value - result = Timestamp(result) - self.assertEqual(result.value, expected) - self.assertEqual(tslib.pydt_to_i8(result), expected) - - # with timezone - for tz, offset in timezones: - result = Timestamp(date_str, tz=tz) - expected_tz = expected - self.assertEqual(result.value, expected_tz) - self.assertEqual(tslib.pydt_to_i8(result), expected_tz) - - # should preserve tz - result = Timestamp(result) - self.assertEqual(result.value, expected_tz) - self.assertEqual(tslib.pydt_to_i8(result), expected_tz) - - # should convert to UTC - result = Timestamp(result, tz='UTC') - expected_utc = expected - self.assertEqual(result.value, expected_utc) - self.assertEqual(tslib.pydt_to_i8(result), expected_utc) - - # This should be 2013-11-01 05:00 in UTC - # converted to Chicago tz - result = Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago') - self.assertEqual(result.value, Timestamp('2013-11-01 05:00').value) - expected = "Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago')" # noqa - self.assertEqual(repr(result), expected) - self.assertEqual(result, eval(repr(result))) - - # This should be 2013-11-01 05:00 in UTC - # converted to Tokyo tz (+09:00) - result = Timestamp('2013-11-01 00:00:00-0500', tz='Asia/Tokyo') - self.assertEqual(result.value, Timestamp('2013-11-01 05:00').value) - expected = "Timestamp('2013-11-01 14:00:00+0900', tz='Asia/Tokyo')" - self.assertEqual(repr(result), expected) - self.assertEqual(result, eval(repr(result))) - - # GH11708 - # This should be 2015-11-18 10:00 in UTC - # converted to Asia/Katmandu - result = Timestamp("2015-11-18 15:45:00+05:45", tz="Asia/Katmandu") - self.assertEqual(result.value, Timestamp("2015-11-18 10:00").value) - expected = "Timestamp('2015-11-18 15:45:00+0545', tz='Asia/Katmandu')" - self.assertEqual(repr(result), expected) - self.assertEqual(result, eval(repr(result))) - - # This should be 2015-11-18 10:00 in UTC - # converted to Asia/Kolkata - result = Timestamp("2015-11-18 15:30:00+05:30", tz="Asia/Kolkata") - self.assertEqual(result.value, Timestamp("2015-11-18 10:00").value) - expected = "Timestamp('2015-11-18 15:30:00+0530', tz='Asia/Kolkata')" - self.assertEqual(repr(result), expected) - self.assertEqual(result, eval(repr(result))) - - def test_constructor_invalid(self): - with tm.assertRaisesRegexp(TypeError, 'Cannot convert input'): - Timestamp(slice(2)) - with tm.assertRaisesRegexp(ValueError, 'Cannot convert Period'): - Timestamp(Period('1000-01-01')) - - def test_constructor_positional(self): - # GH 10758 - with tm.assertRaises(TypeError): - Timestamp(2000, 1) - with tm.assertRaises(ValueError): - Timestamp(2000, 0, 1) - with tm.assertRaises(ValueError): - Timestamp(2000, 13, 1) - with tm.assertRaises(ValueError): - Timestamp(2000, 1, 0) - with tm.assertRaises(ValueError): - Timestamp(2000, 1, 32) - - # GH 11630 - self.assertEqual( - repr(Timestamp(2015, 11, 12)), - repr(Timestamp('20151112'))) - - self.assertEqual( - repr(Timestamp(2015, 11, 12, 1, 2, 3, 999999)), - repr(Timestamp('2015-11-12 01:02:03.999999'))) - - self.assertIs(Timestamp(None), pd.NaT) - - def test_constructor_keyword(self): - # GH 10758 - with tm.assertRaises(TypeError): - Timestamp(year=2000, month=1) - with tm.assertRaises(ValueError): - Timestamp(year=2000, month=0, day=1) - with tm.assertRaises(ValueError): - Timestamp(year=2000, month=13, day=1) - with tm.assertRaises(ValueError): - Timestamp(year=2000, month=1, day=0) - with tm.assertRaises(ValueError): - Timestamp(year=2000, month=1, day=32) - - self.assertEqual( - repr(Timestamp(year=2015, month=11, day=12)), - repr(Timestamp('20151112'))) - - self.assertEqual( - repr(Timestamp(year=2015, month=11, day=12, - hour=1, minute=2, second=3, microsecond=999999)), - repr(Timestamp('2015-11-12 01:02:03.999999'))) - - def test_constructor_fromordinal(self): - base = datetime.datetime(2000, 1, 1) - - ts = Timestamp.fromordinal(base.toordinal(), freq='D') - self.assertEqual(base, ts) - self.assertEqual(ts.freq, 'D') - self.assertEqual(base.toordinal(), ts.toordinal()) - - ts = Timestamp.fromordinal(base.toordinal(), tz='US/Eastern') - self.assertEqual(pd.Timestamp('2000-01-01', tz='US/Eastern'), ts) - self.assertEqual(base.toordinal(), ts.toordinal()) - - def test_constructor_offset_depr(self): - # GH 12160 - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - ts = Timestamp('2011-01-01', offset='D') - self.assertEqual(ts.freq, 'D') - - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - self.assertEqual(ts.offset, 'D') - - msg = "Can only specify freq or offset, not both" - with tm.assertRaisesRegexp(TypeError, msg): - Timestamp('2011-01-01', offset='D', freq='D') - - def test_constructor_offset_depr_fromordinal(self): - # GH 12160 - base = datetime.datetime(2000, 1, 1) - - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - ts = Timestamp.fromordinal(base.toordinal(), offset='D') - self.assertEqual(pd.Timestamp('2000-01-01'), ts) - self.assertEqual(ts.freq, 'D') - self.assertEqual(base.toordinal(), ts.toordinal()) - - msg = "Can only specify freq or offset, not both" - with tm.assertRaisesRegexp(TypeError, msg): - Timestamp.fromordinal(base.toordinal(), offset='D', freq='D') - - def test_conversion(self): - # GH 9255 - ts = Timestamp('2000-01-01') - - result = ts.to_pydatetime() - expected = datetime.datetime(2000, 1, 1) - self.assertEqual(result, expected) - self.assertEqual(type(result), type(expected)) - - result = ts.to_datetime64() - expected = np.datetime64(ts.value, 'ns') - self.assertEqual(result, expected) - self.assertEqual(type(result), type(expected)) - self.assertEqual(result.dtype, expected.dtype) - - def test_repr(self): - tm._skip_if_no_pytz() - tm._skip_if_no_dateutil() - - dates = ['2014-03-07', '2014-01-01 09:00', - '2014-01-01 00:00:00.000000001'] - - # dateutil zone change (only matters for repr) - import dateutil - if (dateutil.__version__ >= LooseVersion('2.3') and - (dateutil.__version__ <= LooseVersion('2.4.0') or - dateutil.__version__ >= LooseVersion('2.6.0'))): - timezones = ['UTC', 'Asia/Tokyo', 'US/Eastern', - 'dateutil/US/Pacific'] - else: - timezones = ['UTC', 'Asia/Tokyo', 'US/Eastern', - 'dateutil/America/Los_Angeles'] - - freqs = ['D', 'M', 'S', 'N'] - - for date in dates: - for tz in timezones: - for freq in freqs: - - # avoid to match with timezone name - freq_repr = "'{0}'".format(freq) - if tz.startswith('dateutil'): - tz_repr = tz.replace('dateutil', '') - else: - tz_repr = tz - - date_only = Timestamp(date) - self.assertIn(date, repr(date_only)) - self.assertNotIn(tz_repr, repr(date_only)) - self.assertNotIn(freq_repr, repr(date_only)) - self.assertEqual(date_only, eval(repr(date_only))) - - date_tz = Timestamp(date, tz=tz) - self.assertIn(date, repr(date_tz)) - self.assertIn(tz_repr, repr(date_tz)) - self.assertNotIn(freq_repr, repr(date_tz)) - self.assertEqual(date_tz, eval(repr(date_tz))) - - date_freq = Timestamp(date, freq=freq) - self.assertIn(date, repr(date_freq)) - self.assertNotIn(tz_repr, repr(date_freq)) - self.assertIn(freq_repr, repr(date_freq)) - self.assertEqual(date_freq, eval(repr(date_freq))) - - date_tz_freq = Timestamp(date, tz=tz, freq=freq) - self.assertIn(date, repr(date_tz_freq)) - self.assertIn(tz_repr, repr(date_tz_freq)) - self.assertIn(freq_repr, repr(date_tz_freq)) - self.assertEqual(date_tz_freq, eval(repr(date_tz_freq))) - - # this can cause the tz field to be populated, but it's redundant to - # information in the datestring - tm._skip_if_no_pytz() - import pytz # noqa - date_with_utc_offset = Timestamp('2014-03-13 00:00:00-0400', tz=None) - self.assertIn('2014-03-13 00:00:00-0400', repr(date_with_utc_offset)) - self.assertNotIn('tzoffset', repr(date_with_utc_offset)) - self.assertIn('pytz.FixedOffset(-240)', repr(date_with_utc_offset)) - expr = repr(date_with_utc_offset).replace("'pytz.FixedOffset(-240)'", - 'pytz.FixedOffset(-240)') - self.assertEqual(date_with_utc_offset, eval(expr)) - - def test_bounds_with_different_units(self): - out_of_bounds_dates = ('1677-09-21', '2262-04-12', ) - - time_units = ('D', 'h', 'm', 's', 'ms', 'us') - - for date_string in out_of_bounds_dates: - for unit in time_units: - self.assertRaises(ValueError, Timestamp, np.datetime64( - date_string, dtype='M8[%s]' % unit)) - - in_bounds_dates = ('1677-09-23', '2262-04-11', ) - - for date_string in in_bounds_dates: - for unit in time_units: - Timestamp(np.datetime64(date_string, dtype='M8[%s]' % unit)) - - def test_tz(self): - t = '2014-02-01 09:00' - ts = Timestamp(t) - local = ts.tz_localize('Asia/Tokyo') - self.assertEqual(local.hour, 9) - self.assertEqual(local, Timestamp(t, tz='Asia/Tokyo')) - conv = local.tz_convert('US/Eastern') - self.assertEqual(conv, Timestamp('2014-01-31 19:00', tz='US/Eastern')) - self.assertEqual(conv.hour, 19) - - # preserves nanosecond - ts = Timestamp(t) + offsets.Nano(5) - local = ts.tz_localize('Asia/Tokyo') - self.assertEqual(local.hour, 9) - self.assertEqual(local.nanosecond, 5) - conv = local.tz_convert('US/Eastern') - self.assertEqual(conv.nanosecond, 5) - self.assertEqual(conv.hour, 19) - - def test_tz_localize_ambiguous(self): - - ts = Timestamp('2014-11-02 01:00') - ts_dst = ts.tz_localize('US/Eastern', ambiguous=True) - ts_no_dst = ts.tz_localize('US/Eastern', ambiguous=False) - - rng = date_range('2014-11-02', periods=3, freq='H', tz='US/Eastern') - self.assertEqual(rng[1], ts_dst) - self.assertEqual(rng[2], ts_no_dst) - self.assertRaises(ValueError, ts.tz_localize, 'US/Eastern', - ambiguous='infer') - - # GH 8025 - with tm.assertRaisesRegexp(TypeError, - 'Cannot localize tz-aware Timestamp, use ' - 'tz_convert for conversions'): - Timestamp('2011-01-01', tz='US/Eastern').tz_localize('Asia/Tokyo') - - with tm.assertRaisesRegexp(TypeError, - 'Cannot convert tz-naive Timestamp, use ' - 'tz_localize to localize'): - Timestamp('2011-01-01').tz_convert('Asia/Tokyo') - - def test_tz_localize_nonexistent(self): - # See issue 13057 - from pytz.exceptions import NonExistentTimeError - times = ['2015-03-08 02:00', '2015-03-08 02:30', - '2015-03-29 02:00', '2015-03-29 02:30'] - timezones = ['US/Eastern', 'US/Pacific', - 'Europe/Paris', 'Europe/Belgrade'] - for t, tz in zip(times, timezones): - ts = Timestamp(t) - self.assertRaises(NonExistentTimeError, ts.tz_localize, - tz) - self.assertRaises(NonExistentTimeError, ts.tz_localize, - tz, errors='raise') - self.assertIs(ts.tz_localize(tz, errors='coerce'), - pd.NaT) - - def test_tz_localize_errors_ambiguous(self): - # See issue 13057 - from pytz.exceptions import AmbiguousTimeError - ts = pd.Timestamp('2015-11-1 01:00') - self.assertRaises(AmbiguousTimeError, - ts.tz_localize, 'US/Pacific', errors='coerce') - - def test_tz_localize_roundtrip(self): - for tz in ['UTC', 'Asia/Tokyo', 'US/Eastern', 'dateutil/US/Pacific']: - for t in ['2014-02-01 09:00', '2014-07-08 09:00', - '2014-11-01 17:00', '2014-11-05 00:00']: - ts = Timestamp(t) - localized = ts.tz_localize(tz) - self.assertEqual(localized, Timestamp(t, tz=tz)) - - with tm.assertRaises(TypeError): - localized.tz_localize(tz) - - reset = localized.tz_localize(None) - self.assertEqual(reset, ts) - self.assertTrue(reset.tzinfo is None) - - def test_tz_convert_roundtrip(self): - for tz in ['UTC', 'Asia/Tokyo', 'US/Eastern', 'dateutil/US/Pacific']: - for t in ['2014-02-01 09:00', '2014-07-08 09:00', - '2014-11-01 17:00', '2014-11-05 00:00']: - ts = Timestamp(t, tz='UTC') - converted = ts.tz_convert(tz) - - reset = converted.tz_convert(None) - self.assertEqual(reset, Timestamp(t)) - self.assertTrue(reset.tzinfo is None) - self.assertEqual(reset, - converted.tz_convert('UTC').tz_localize(None)) - - def test_barely_oob_dts(self): - one_us = np.timedelta64(1).astype('timedelta64[us]') - - # By definition we can't go out of bounds in [ns], so we - # convert the datetime64s to [us] so we can go out of bounds - min_ts_us = np.datetime64(Timestamp.min).astype('M8[us]') - max_ts_us = np.datetime64(Timestamp.max).astype('M8[us]') - - # No error for the min/max datetimes - Timestamp(min_ts_us) - Timestamp(max_ts_us) - - # One us less than the minimum is an error - self.assertRaises(ValueError, Timestamp, min_ts_us - one_us) - - # One us more than the maximum is an error - self.assertRaises(ValueError, Timestamp, max_ts_us + one_us) - - def test_utc_z_designator(self): - self.assertEqual(get_timezone( - Timestamp('2014-11-02 01:00Z').tzinfo), 'UTC') - - def test_now(self): - # #9000 - ts_from_string = Timestamp('now') - ts_from_method = Timestamp.now() - ts_datetime = datetime.datetime.now() - - ts_from_string_tz = Timestamp('now', tz='US/Eastern') - ts_from_method_tz = Timestamp.now(tz='US/Eastern') - - # Check that the delta between the times is less than 1s (arbitrarily - # small) - delta = Timedelta(seconds=1) - self.assertTrue(abs(ts_from_method - ts_from_string) < delta) - self.assertTrue(abs(ts_datetime - ts_from_method) < delta) - self.assertTrue(abs(ts_from_method_tz - ts_from_string_tz) < delta) - self.assertTrue(abs(ts_from_string_tz.tz_localize(None) - - ts_from_method_tz.tz_localize(None)) < delta) - - def test_today(self): - - ts_from_string = Timestamp('today') - ts_from_method = Timestamp.today() - ts_datetime = datetime.datetime.today() - - ts_from_string_tz = Timestamp('today', tz='US/Eastern') - ts_from_method_tz = Timestamp.today(tz='US/Eastern') - - # Check that the delta between the times is less than 1s (arbitrarily - # small) - delta = Timedelta(seconds=1) - self.assertTrue(abs(ts_from_method - ts_from_string) < delta) - self.assertTrue(abs(ts_datetime - ts_from_method) < delta) - self.assertTrue(abs(ts_from_method_tz - ts_from_string_tz) < delta) - self.assertTrue(abs(ts_from_string_tz.tz_localize(None) - - ts_from_method_tz.tz_localize(None)) < delta) - - def test_asm8(self): - np.random.seed(7960929) - ns = [Timestamp.min.value, Timestamp.max.value, 1000, ] - for n in ns: - self.assertEqual(Timestamp(n).asm8.view('i8'), - np.datetime64(n, 'ns').view('i8'), n) - self.assertEqual(Timestamp('nat').asm8.view('i8'), - np.datetime64('nat', 'ns').view('i8')) - - def test_fields(self): - def check(value, equal): - # that we are int/long like - self.assertTrue(isinstance(value, (int, compat.long))) - self.assertEqual(value, equal) - - # GH 10050 - ts = Timestamp('2015-05-10 09:06:03.000100001') - check(ts.year, 2015) - check(ts.month, 5) - check(ts.day, 10) - check(ts.hour, 9) - check(ts.minute, 6) - check(ts.second, 3) - self.assertRaises(AttributeError, lambda: ts.millisecond) - check(ts.microsecond, 100) - check(ts.nanosecond, 1) - check(ts.dayofweek, 6) - check(ts.quarter, 2) - check(ts.dayofyear, 130) - check(ts.week, 19) - check(ts.daysinmonth, 31) - check(ts.daysinmonth, 31) - - def test_nat_fields(self): - # GH 10050 - ts = Timestamp('NaT') - self.assertTrue(np.isnan(ts.year)) - self.assertTrue(np.isnan(ts.month)) - self.assertTrue(np.isnan(ts.day)) - self.assertTrue(np.isnan(ts.hour)) - self.assertTrue(np.isnan(ts.minute)) - self.assertTrue(np.isnan(ts.second)) - self.assertTrue(np.isnan(ts.microsecond)) - self.assertTrue(np.isnan(ts.nanosecond)) - self.assertTrue(np.isnan(ts.dayofweek)) - self.assertTrue(np.isnan(ts.quarter)) - self.assertTrue(np.isnan(ts.dayofyear)) - self.assertTrue(np.isnan(ts.week)) - self.assertTrue(np.isnan(ts.daysinmonth)) - self.assertTrue(np.isnan(ts.days_in_month)) - - def test_pprint(self): - # GH12622 - import pprint - nested_obj = {'foo': 1, - 'bar': [{'w': {'a': Timestamp('2011-01-01')}}] * 10} - result = pprint.pformat(nested_obj, width=50) - expected = r"""{'bar': [{'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}, - {'w': {'a': Timestamp('2011-01-01 00:00:00')}}], - 'foo': 1}""" - self.assertEqual(result, expected) - - def to_datetime_depr(self): - # see gh-8254 - ts = Timestamp('2011-01-01') - - with tm.assert_produces_warning(FutureWarning, - check_stacklevel=False): - expected = datetime.datetime(2011, 1, 1) - result = ts.to_datetime() - self.assertEqual(result, expected) - - def to_pydatetime_nonzero_nano(self): - ts = Timestamp('2011-01-01 9:00:00.123456789') - - # Warn the user of data loss (nanoseconds). - with tm.assert_produces_warning(UserWarning, - check_stacklevel=False): - expected = datetime.datetime(2011, 1, 1, 9, 0, 0, 123456) - result = ts.to_pydatetime() - self.assertEqual(result, expected) - - -class TestDatetimeParsingWrappers(tm.TestCase): - def test_does_not_convert_mixed_integer(self): - bad_date_strings = ('-50000', '999', '123.1234', 'm', 'T') - - for bad_date_string in bad_date_strings: - self.assertFalse(tslib._does_string_look_like_datetime( - bad_date_string)) - - good_date_strings = ('2012-01-01', - '01/01/2012', - 'Mon Sep 16, 2013', - '01012012', - '0101', - '1-1', ) - - for good_date_string in good_date_strings: - self.assertTrue(tslib._does_string_look_like_datetime( - good_date_string)) - - def test_parsers(self): - - # https://github.com/dateutil/dateutil/issues/217 - import dateutil - yearfirst = dateutil.__version__ >= LooseVersion('2.5.0') - - cases = {'2011-01-01': datetime.datetime(2011, 1, 1), - '2Q2005': datetime.datetime(2005, 4, 1), - '2Q05': datetime.datetime(2005, 4, 1), - '2005Q1': datetime.datetime(2005, 1, 1), - '05Q1': datetime.datetime(2005, 1, 1), - '2011Q3': datetime.datetime(2011, 7, 1), - '11Q3': datetime.datetime(2011, 7, 1), - '3Q2011': datetime.datetime(2011, 7, 1), - '3Q11': datetime.datetime(2011, 7, 1), - - # quarterly without space - '2000Q4': datetime.datetime(2000, 10, 1), - '00Q4': datetime.datetime(2000, 10, 1), - '4Q2000': datetime.datetime(2000, 10, 1), - '4Q00': datetime.datetime(2000, 10, 1), - '2000q4': datetime.datetime(2000, 10, 1), - '2000-Q4': datetime.datetime(2000, 10, 1), - '00-Q4': datetime.datetime(2000, 10, 1), - '4Q-2000': datetime.datetime(2000, 10, 1), - '4Q-00': datetime.datetime(2000, 10, 1), - '00q4': datetime.datetime(2000, 10, 1), - '2005': datetime.datetime(2005, 1, 1), - '2005-11': datetime.datetime(2005, 11, 1), - '2005 11': datetime.datetime(2005, 11, 1), - '11-2005': datetime.datetime(2005, 11, 1), - '11 2005': datetime.datetime(2005, 11, 1), - '200511': datetime.datetime(2020, 5, 11), - '20051109': datetime.datetime(2005, 11, 9), - '20051109 10:15': datetime.datetime(2005, 11, 9, 10, 15), - '20051109 08H': datetime.datetime(2005, 11, 9, 8, 0), - '2005-11-09 10:15': datetime.datetime(2005, 11, 9, 10, 15), - '2005-11-09 08H': datetime.datetime(2005, 11, 9, 8, 0), - '2005/11/09 10:15': datetime.datetime(2005, 11, 9, 10, 15), - '2005/11/09 08H': datetime.datetime(2005, 11, 9, 8, 0), - "Thu Sep 25 10:36:28 2003": datetime.datetime(2003, 9, 25, 10, - 36, 28), - "Thu Sep 25 2003": datetime.datetime(2003, 9, 25), - "Sep 25 2003": datetime.datetime(2003, 9, 25), - "January 1 2014": datetime.datetime(2014, 1, 1), - - # GH 10537 - '2014-06': datetime.datetime(2014, 6, 1), - '06-2014': datetime.datetime(2014, 6, 1), - '2014-6': datetime.datetime(2014, 6, 1), - '6-2014': datetime.datetime(2014, 6, 1), - - '20010101 12': datetime.datetime(2001, 1, 1, 12), - '20010101 1234': datetime.datetime(2001, 1, 1, 12, 34), - '20010101 123456': datetime.datetime(2001, 1, 1, 12, 34, 56), - } - - for date_str, expected in compat.iteritems(cases): - result1, _, _ = tools.parse_time_string(date_str, - yearfirst=yearfirst) - result2 = to_datetime(date_str, yearfirst=yearfirst) - result3 = to_datetime([date_str], yearfirst=yearfirst) - # result5 is used below - result4 = to_datetime(np.array([date_str], dtype=object), - yearfirst=yearfirst) - result6 = DatetimeIndex([date_str], yearfirst=yearfirst) - # result7 is used below - result8 = DatetimeIndex(Index([date_str]), yearfirst=yearfirst) - result9 = DatetimeIndex(Series([date_str]), yearfirst=yearfirst) - - for res in [result1, result2]: - self.assertEqual(res, expected) - for res in [result3, result4, result6, result8, result9]: - exp = DatetimeIndex([pd.Timestamp(expected)]) - tm.assert_index_equal(res, exp) - - # these really need to have yearfist, but we don't support - if not yearfirst: - result5 = Timestamp(date_str) - self.assertEqual(result5, expected) - result7 = date_range(date_str, freq='S', periods=1, - yearfirst=yearfirst) - self.assertEqual(result7, expected) - - # NaT - result1, _, _ = tools.parse_time_string('NaT') - result2 = to_datetime('NaT') - result3 = Timestamp('NaT') - result4 = DatetimeIndex(['NaT'])[0] - self.assertTrue(result1 is tslib.NaT) - self.assertTrue(result1 is tslib.NaT) - self.assertTrue(result1 is tslib.NaT) - self.assertTrue(result1 is tslib.NaT) - - def test_parsers_quarter_invalid(self): - - cases = ['2Q 2005', '2Q-200A', '2Q-200', '22Q2005', '6Q-20', '2Q200.'] - for case in cases: - self.assertRaises(ValueError, tools.parse_time_string, case) - - def test_parsers_dayfirst_yearfirst(self): - tm._skip_if_no_dateutil() - - # OK - # 2.5.1 10-11-12 [dayfirst=0, yearfirst=0] -> 2012-10-11 00:00:00 - # 2.5.2 10-11-12 [dayfirst=0, yearfirst=1] -> 2012-10-11 00:00:00 - # 2.5.3 10-11-12 [dayfirst=0, yearfirst=0] -> 2012-10-11 00:00:00 - - # OK - # 2.5.1 10-11-12 [dayfirst=0, yearfirst=1] -> 2010-11-12 00:00:00 - # 2.5.2 10-11-12 [dayfirst=0, yearfirst=1] -> 2010-11-12 00:00:00 - # 2.5.3 10-11-12 [dayfirst=0, yearfirst=1] -> 2010-11-12 00:00:00 - - # bug fix in 2.5.2 - # 2.5.1 10-11-12 [dayfirst=1, yearfirst=1] -> 2010-11-12 00:00:00 - # 2.5.2 10-11-12 [dayfirst=1, yearfirst=1] -> 2010-12-11 00:00:00 - # 2.5.3 10-11-12 [dayfirst=1, yearfirst=1] -> 2010-12-11 00:00:00 - - # OK - # 2.5.1 10-11-12 [dayfirst=1, yearfirst=0] -> 2012-11-10 00:00:00 - # 2.5.2 10-11-12 [dayfirst=1, yearfirst=0] -> 2012-11-10 00:00:00 - # 2.5.3 10-11-12 [dayfirst=1, yearfirst=0] -> 2012-11-10 00:00:00 - - # OK - # 2.5.1 20/12/21 [dayfirst=0, yearfirst=0] -> 2021-12-20 00:00:00 - # 2.5.2 20/12/21 [dayfirst=0, yearfirst=0] -> 2021-12-20 00:00:00 - # 2.5.3 20/12/21 [dayfirst=0, yearfirst=0] -> 2021-12-20 00:00:00 - - # OK - # 2.5.1 20/12/21 [dayfirst=0, yearfirst=1] -> 2020-12-21 00:00:00 - # 2.5.2 20/12/21 [dayfirst=0, yearfirst=1] -> 2020-12-21 00:00:00 - # 2.5.3 20/12/21 [dayfirst=0, yearfirst=1] -> 2020-12-21 00:00:00 - - # revert of bug in 2.5.2 - # 2.5.1 20/12/21 [dayfirst=1, yearfirst=1] -> 2020-12-21 00:00:00 - # 2.5.2 20/12/21 [dayfirst=1, yearfirst=1] -> month must be in 1..12 - # 2.5.3 20/12/21 [dayfirst=1, yearfirst=1] -> 2020-12-21 00:00:00 - - # OK - # 2.5.1 20/12/21 [dayfirst=1, yearfirst=0] -> 2021-12-20 00:00:00 - # 2.5.2 20/12/21 [dayfirst=1, yearfirst=0] -> 2021-12-20 00:00:00 - # 2.5.3 20/12/21 [dayfirst=1, yearfirst=0] -> 2021-12-20 00:00:00 - - import dateutil - is_lt_253 = dateutil.__version__ < LooseVersion('2.5.3') - - # str : dayfirst, yearfirst, expected - cases = {'10-11-12': [(False, False, - datetime.datetime(2012, 10, 11)), - (True, False, - datetime.datetime(2012, 11, 10)), - (False, True, - datetime.datetime(2010, 11, 12)), - (True, True, - datetime.datetime(2010, 12, 11))], - '20/12/21': [(False, False, - datetime.datetime(2021, 12, 20)), - (True, False, - datetime.datetime(2021, 12, 20)), - (False, True, - datetime.datetime(2020, 12, 21)), - (True, True, - datetime.datetime(2020, 12, 21))]} - - from dateutil.parser import parse - for date_str, values in compat.iteritems(cases): - for dayfirst, yearfirst, expected in values: - - # odd comparisons across version - # let's just skip - if dayfirst and yearfirst and is_lt_253: - continue - - # compare with dateutil result - dateutil_result = parse(date_str, dayfirst=dayfirst, - yearfirst=yearfirst) - self.assertEqual(dateutil_result, expected) - - result1, _, _ = tools.parse_time_string(date_str, - dayfirst=dayfirst, - yearfirst=yearfirst) - - # we don't support dayfirst/yearfirst here: - if not dayfirst and not yearfirst: - result2 = Timestamp(date_str) - self.assertEqual(result2, expected) - - result3 = to_datetime(date_str, dayfirst=dayfirst, - yearfirst=yearfirst) - - result4 = DatetimeIndex([date_str], dayfirst=dayfirst, - yearfirst=yearfirst)[0] - - self.assertEqual(result1, expected) - self.assertEqual(result3, expected) - self.assertEqual(result4, expected) - - def test_parsers_timestring(self): - tm._skip_if_no_dateutil() - from dateutil.parser import parse - - # must be the same as dateutil result - cases = {'10:15': (parse('10:15'), datetime.datetime(1, 1, 1, 10, 15)), - '9:05': (parse('9:05'), datetime.datetime(1, 1, 1, 9, 5))} - - for date_str, (exp_now, exp_def) in compat.iteritems(cases): - result1, _, _ = tools.parse_time_string(date_str) - result2 = to_datetime(date_str) - result3 = to_datetime([date_str]) - result4 = Timestamp(date_str) - result5 = DatetimeIndex([date_str])[0] - # parse time string return time string based on default date - # others are not, and can't be changed because it is used in - # time series plot - self.assertEqual(result1, exp_def) - self.assertEqual(result2, exp_now) - self.assertEqual(result3, exp_now) - self.assertEqual(result4, exp_now) - self.assertEqual(result5, exp_now) - - def test_parsers_time(self): - # GH11818 - _skip_if_has_locale() - strings = ["14:15", "1415", "2:15pm", "0215pm", "14:15:00", "141500", - "2:15:00pm", "021500pm", datetime.time(14, 15)] - expected = datetime.time(14, 15) - - for time_string in strings: - self.assertEqual(tools.to_time(time_string), expected) - - new_string = "14.15" - self.assertRaises(ValueError, tools.to_time, new_string) - self.assertEqual(tools.to_time(new_string, format="%H.%M"), expected) - - arg = ["14:15", "20:20"] - expected_arr = [datetime.time(14, 15), datetime.time(20, 20)] - self.assertEqual(tools.to_time(arg), expected_arr) - self.assertEqual(tools.to_time(arg, format="%H:%M"), expected_arr) - self.assertEqual(tools.to_time(arg, infer_time_format=True), - expected_arr) - self.assertEqual(tools.to_time(arg, format="%I:%M%p", errors="coerce"), - [None, None]) - - res = tools.to_time(arg, format="%I:%M%p", errors="ignore") - self.assert_numpy_array_equal(res, np.array(arg, dtype=np.object_)) - - with tm.assertRaises(ValueError): - tools.to_time(arg, format="%I:%M%p", errors="raise") - - self.assert_series_equal(tools.to_time(Series(arg, name="test")), - Series(expected_arr, name="test")) - - res = tools.to_time(np.array(arg)) - self.assertIsInstance(res, list) - self.assert_equal(res, expected_arr) - - def test_parsers_monthfreq(self): - cases = {'201101': datetime.datetime(2011, 1, 1, 0, 0), - '200005': datetime.datetime(2000, 5, 1, 0, 0)} - - for date_str, expected in compat.iteritems(cases): - result1, _, _ = tools.parse_time_string(date_str, freq='M') - self.assertEqual(result1, expected) - - def test_parsers_quarterly_with_freq(self): - msg = ('Incorrect quarterly string is given, quarter ' - 'must be between 1 and 4: 2013Q5') - with tm.assertRaisesRegexp(tslib.DateParseError, msg): - tools.parse_time_string('2013Q5') - - # GH 5418 - msg = ('Unable to retrieve month information from given freq: ' - 'INVLD-L-DEC-SAT') - with tm.assertRaisesRegexp(tslib.DateParseError, msg): - tools.parse_time_string('2013Q1', freq='INVLD-L-DEC-SAT') - - cases = {('2013Q2', None): datetime.datetime(2013, 4, 1), - ('2013Q2', 'A-APR'): datetime.datetime(2012, 8, 1), - ('2013-Q2', 'A-DEC'): datetime.datetime(2013, 4, 1)} - - for (date_str, freq), exp in compat.iteritems(cases): - result, _, _ = tools.parse_time_string(date_str, freq=freq) - self.assertEqual(result, exp) - - def test_parsers_timezone_minute_offsets_roundtrip(self): - # GH11708 - base = to_datetime("2013-01-01 00:00:00") - dt_strings = [ - ('2013-01-01 05:45+0545', - "Asia/Katmandu", - "Timestamp('2013-01-01 05:45:00+0545', tz='Asia/Katmandu')"), - ('2013-01-01 05:30+0530', - "Asia/Kolkata", - "Timestamp('2013-01-01 05:30:00+0530', tz='Asia/Kolkata')") - ] - - for dt_string, tz, dt_string_repr in dt_strings: - dt_time = to_datetime(dt_string) - self.assertEqual(base, dt_time) - converted_time = dt_time.tz_localize('UTC').tz_convert(tz) - self.assertEqual(dt_string_repr, repr(converted_time)) - - def test_parsers_iso8601(self): - # GH 12060 - # test only the iso parser - flexibility to different - # separators and leadings 0s - # Timestamp construction falls back to dateutil - cases = {'2011-01-02': datetime.datetime(2011, 1, 2), - '2011-1-2': datetime.datetime(2011, 1, 2), - '2011-01': datetime.datetime(2011, 1, 1), - '2011-1': datetime.datetime(2011, 1, 1), - '2011 01 02': datetime.datetime(2011, 1, 2), - '2011.01.02': datetime.datetime(2011, 1, 2), - '2011/01/02': datetime.datetime(2011, 1, 2), - '2011\\01\\02': datetime.datetime(2011, 1, 2), - '2013-01-01 05:30:00': datetime.datetime(2013, 1, 1, 5, 30), - '2013-1-1 5:30:00': datetime.datetime(2013, 1, 1, 5, 30)} - for date_str, exp in compat.iteritems(cases): - actual = tslib._test_parse_iso8601(date_str) - self.assertEqual(actual, exp) - - # seperators must all match - YYYYMM not valid - invalid_cases = ['2011-01/02', '2011^11^11', - '201401', '201111', '200101', - # mixed separated and unseparated - '2005-0101', '200501-01', - '20010101 12:3456', '20010101 1234:56', - # HHMMSS must have two digits in each component - # if unseparated - '20010101 1', '20010101 123', '20010101 12345', - '20010101 12345Z', - # wrong separator for HHMMSS - '2001-01-01 12-34-56'] - for date_str in invalid_cases: - with tm.assertRaises(ValueError): - tslib._test_parse_iso8601(date_str) - # If no ValueError raised, let me know which case failed. - raise Exception(date_str) - - -class TestArrayToDatetime(tm.TestCase): - def test_parsing_valid_dates(self): - arr = np.array(['01-01-2013', '01-02-2013'], dtype=object) - self.assert_numpy_array_equal( - tslib.array_to_datetime(arr), - np_array_datetime64_compat( - [ - '2013-01-01T00:00:00.000000000-0000', - '2013-01-02T00:00:00.000000000-0000' - ], - dtype='M8[ns]' - ) - ) - - arr = np.array(['Mon Sep 16 2013', 'Tue Sep 17 2013'], dtype=object) - self.assert_numpy_array_equal( - tslib.array_to_datetime(arr), - np_array_datetime64_compat( - [ - '2013-09-16T00:00:00.000000000-0000', - '2013-09-17T00:00:00.000000000-0000' - ], - dtype='M8[ns]' - ) - ) - - def test_number_looking_strings_not_into_datetime(self): - # #4601 - # These strings don't look like datetimes so they shouldn't be - # attempted to be converted - arr = np.array(['-352.737091', '183.575577'], dtype=object) - self.assert_numpy_array_equal( - tslib.array_to_datetime(arr, errors='ignore'), arr) - - arr = np.array(['1', '2', '3', '4', '5'], dtype=object) - self.assert_numpy_array_equal( - tslib.array_to_datetime(arr, errors='ignore'), arr) - - def test_coercing_dates_outside_of_datetime64_ns_bounds(self): - invalid_dates = [ - datetime.date(1000, 1, 1), - datetime.datetime(1000, 1, 1), - '1000-01-01', - 'Jan 1, 1000', - np.datetime64('1000-01-01'), - ] - - for invalid_date in invalid_dates: - self.assertRaises(ValueError, - tslib.array_to_datetime, - np.array( - [invalid_date], dtype='object'), - errors='raise', ) - self.assert_numpy_array_equal( - tslib.array_to_datetime( - np.array([invalid_date], dtype='object'), - errors='coerce'), - np.array([tslib.iNaT], dtype='M8[ns]') - ) - - arr = np.array(['1/1/1000', '1/1/2000'], dtype=object) - self.assert_numpy_array_equal( - tslib.array_to_datetime(arr, errors='coerce'), - np_array_datetime64_compat( - [ - tslib.iNaT, - '2000-01-01T00:00:00.000000000-0000' - ], - dtype='M8[ns]' - ) - ) - - def test_coerce_of_invalid_datetimes(self): - arr = np.array(['01-01-2013', 'not_a_date', '1'], dtype=object) - - # Without coercing, the presence of any invalid dates prevents - # any values from being converted - self.assert_numpy_array_equal( - tslib.array_to_datetime(arr, errors='ignore'), arr) - - # With coercing, the invalid dates becomes iNaT - self.assert_numpy_array_equal( - tslib.array_to_datetime(arr, errors='coerce'), - np_array_datetime64_compat( - [ - '2013-01-01T00:00:00.000000000-0000', - tslib.iNaT, - tslib.iNaT - ], - dtype='M8[ns]' - ) - ) - - def test_parsing_timezone_offsets(self): - # All of these datetime strings with offsets are equivalent - # to the same datetime after the timezone offset is added - dt_strings = [ - '01-01-2013 08:00:00+08:00', - '2013-01-01T08:00:00.000000000+0800', - '2012-12-31T16:00:00.000000000-0800', - '12-31-2012 23:00:00-01:00' - ] - - expected_output = tslib.array_to_datetime(np.array( - ['01-01-2013 00:00:00'], dtype=object)) - - for dt_string in dt_strings: - self.assert_numpy_array_equal( - tslib.array_to_datetime( - np.array([dt_string], dtype=object) - ), - expected_output - ) - - -class TestTimestampNsOperations(tm.TestCase): - def setUp(self): - self.timestamp = Timestamp(datetime.datetime.utcnow()) - - def assert_ns_timedelta(self, modified_timestamp, expected_value): - value = self.timestamp.value - modified_value = modified_timestamp.value - - self.assertEqual(modified_value - value, expected_value) - - def test_timedelta_ns_arithmetic(self): - self.assert_ns_timedelta(self.timestamp + np.timedelta64(-123, 'ns'), - -123) - - def test_timedelta_ns_based_arithmetic(self): - self.assert_ns_timedelta(self.timestamp + np.timedelta64( - 1234567898, 'ns'), 1234567898) - - def test_timedelta_us_arithmetic(self): - self.assert_ns_timedelta(self.timestamp + np.timedelta64(-123, 'us'), - -123000) - - def test_timedelta_ms_arithmetic(self): - time = self.timestamp + np.timedelta64(-123, 'ms') - self.assert_ns_timedelta(time, -123000000) - - def test_nanosecond_string_parsing(self): - ts = Timestamp('2013-05-01 07:15:45.123456789') - # GH 7878 - expected_repr = '2013-05-01 07:15:45.123456789' - expected_value = 1367392545123456789 - self.assertEqual(ts.value, expected_value) - self.assertIn(expected_repr, repr(ts)) - - ts = Timestamp('2013-05-01 07:15:45.123456789+09:00', tz='Asia/Tokyo') - self.assertEqual(ts.value, expected_value - 9 * 3600 * 1000000000) - self.assertIn(expected_repr, repr(ts)) - - ts = Timestamp('2013-05-01 07:15:45.123456789', tz='UTC') - self.assertEqual(ts.value, expected_value) - self.assertIn(expected_repr, repr(ts)) - - ts = Timestamp('2013-05-01 07:15:45.123456789', tz='US/Eastern') - self.assertEqual(ts.value, expected_value + 4 * 3600 * 1000000000) - self.assertIn(expected_repr, repr(ts)) - - # GH 10041 - ts = Timestamp('20130501T071545.123456789') - self.assertEqual(ts.value, expected_value) - self.assertIn(expected_repr, repr(ts)) - - def test_nanosecond_timestamp(self): - # GH 7610 - expected = 1293840000000000005 - t = Timestamp('2011-01-01') + offsets.Nano(5) - self.assertEqual(repr(t), "Timestamp('2011-01-01 00:00:00.000000005')") - self.assertEqual(t.value, expected) - self.assertEqual(t.nanosecond, 5) - - t = Timestamp(t) - self.assertEqual(repr(t), "Timestamp('2011-01-01 00:00:00.000000005')") - self.assertEqual(t.value, expected) - self.assertEqual(t.nanosecond, 5) - - t = Timestamp(np_datetime64_compat('2011-01-01 00:00:00.000000005Z')) - self.assertEqual(repr(t), "Timestamp('2011-01-01 00:00:00.000000005')") - self.assertEqual(t.value, expected) - self.assertEqual(t.nanosecond, 5) - - expected = 1293840000000000010 - t = t + offsets.Nano(5) - self.assertEqual(repr(t), "Timestamp('2011-01-01 00:00:00.000000010')") - self.assertEqual(t.value, expected) - self.assertEqual(t.nanosecond, 10) - - t = Timestamp(t) - self.assertEqual(repr(t), "Timestamp('2011-01-01 00:00:00.000000010')") - self.assertEqual(t.value, expected) - self.assertEqual(t.nanosecond, 10) - - t = Timestamp(np_datetime64_compat('2011-01-01 00:00:00.000000010Z')) - self.assertEqual(repr(t), "Timestamp('2011-01-01 00:00:00.000000010')") - self.assertEqual(t.value, expected) - self.assertEqual(t.nanosecond, 10) - - def test_nat_arithmetic(self): - # GH 6873 - i = 2 - f = 1.5 - - for (left, right) in [(pd.NaT, i), (pd.NaT, f), (pd.NaT, np.nan)]: - self.assertIs(left / right, pd.NaT) - self.assertIs(left * right, pd.NaT) - self.assertIs(right * left, pd.NaT) - with tm.assertRaises(TypeError): - right / left - - # Timestamp / datetime - t = Timestamp('2014-01-01') - dt = datetime.datetime(2014, 1, 1) - for (left, right) in [(pd.NaT, pd.NaT), (pd.NaT, t), (pd.NaT, dt)]: - # NaT __add__ or __sub__ Timestamp-like (or inverse) returns NaT - self.assertIs(right + left, pd.NaT) - self.assertIs(left + right, pd.NaT) - self.assertIs(left - right, pd.NaT) - self.assertIs(right - left, pd.NaT) - - # timedelta-like - # offsets are tested in test_offsets.py - - delta = datetime.timedelta(3600) - td = Timedelta('5s') - - for (left, right) in [(pd.NaT, delta), (pd.NaT, td)]: - # NaT + timedelta-like returns NaT - self.assertIs(right + left, pd.NaT) - self.assertIs(left + right, pd.NaT) - self.assertIs(right - left, pd.NaT) - self.assertIs(left - right, pd.NaT) - - # GH 11718 - tm._skip_if_no_pytz() - import pytz - - t_utc = Timestamp('2014-01-01', tz='UTC') - t_tz = Timestamp('2014-01-01', tz='US/Eastern') - dt_tz = pytz.timezone('Asia/Tokyo').localize(dt) - - for (left, right) in [(pd.NaT, t_utc), (pd.NaT, t_tz), - (pd.NaT, dt_tz)]: - # NaT __add__ or __sub__ Timestamp-like (or inverse) returns NaT - self.assertIs(right + left, pd.NaT) - self.assertIs(left + right, pd.NaT) - self.assertIs(left - right, pd.NaT) - self.assertIs(right - left, pd.NaT) - - # int addition / subtraction - for (left, right) in [(pd.NaT, 2), (pd.NaT, 0), (pd.NaT, -3)]: - self.assertIs(right + left, pd.NaT) - self.assertIs(left + right, pd.NaT) - self.assertIs(left - right, pd.NaT) - self.assertIs(right - left, pd.NaT) - - def test_nat_arithmetic_index(self): - # GH 11718 - - # datetime - tm._skip_if_no_pytz() - - dti = pd.DatetimeIndex(['2011-01-01', '2011-01-02'], name='x') - exp = pd.DatetimeIndex([pd.NaT, pd.NaT], name='x') - self.assert_index_equal(dti + pd.NaT, exp) - self.assert_index_equal(pd.NaT + dti, exp) - - dti_tz = pd.DatetimeIndex(['2011-01-01', '2011-01-02'], - tz='US/Eastern', name='x') - exp = pd.DatetimeIndex([pd.NaT, pd.NaT], name='x', tz='US/Eastern') - self.assert_index_equal(dti_tz + pd.NaT, exp) - self.assert_index_equal(pd.NaT + dti_tz, exp) - - exp = pd.TimedeltaIndex([pd.NaT, pd.NaT], name='x') - for (left, right) in [(pd.NaT, dti), (pd.NaT, dti_tz)]: - self.assert_index_equal(left - right, exp) - self.assert_index_equal(right - left, exp) - - # timedelta - tdi = pd.TimedeltaIndex(['1 day', '2 day'], name='x') - exp = pd.DatetimeIndex([pd.NaT, pd.NaT], name='x') - for (left, right) in [(pd.NaT, tdi)]: - self.assert_index_equal(left + right, exp) - self.assert_index_equal(right + left, exp) - self.assert_index_equal(left - right, exp) - self.assert_index_equal(right - left, exp) - - -class TestTslib(tm.TestCase): - def test_intraday_conversion_factors(self): - self.assertEqual(period_asfreq( - 1, get_freq('D'), get_freq('H'), False), 24) - self.assertEqual(period_asfreq( - 1, get_freq('D'), get_freq('T'), False), 1440) - self.assertEqual(period_asfreq( - 1, get_freq('D'), get_freq('S'), False), 86400) - self.assertEqual(period_asfreq(1, get_freq( - 'D'), get_freq('L'), False), 86400000) - self.assertEqual(period_asfreq(1, get_freq( - 'D'), get_freq('U'), False), 86400000000) - self.assertEqual(period_asfreq(1, get_freq( - 'D'), get_freq('N'), False), 86400000000000) - - self.assertEqual(period_asfreq( - 1, get_freq('H'), get_freq('T'), False), 60) - self.assertEqual(period_asfreq( - 1, get_freq('H'), get_freq('S'), False), 3600) - self.assertEqual(period_asfreq(1, get_freq('H'), - get_freq('L'), False), 3600000) - self.assertEqual(period_asfreq(1, get_freq( - 'H'), get_freq('U'), False), 3600000000) - self.assertEqual(period_asfreq(1, get_freq( - 'H'), get_freq('N'), False), 3600000000000) - - self.assertEqual(period_asfreq( - 1, get_freq('T'), get_freq('S'), False), 60) - self.assertEqual(period_asfreq( - 1, get_freq('T'), get_freq('L'), False), 60000) - self.assertEqual(period_asfreq(1, get_freq( - 'T'), get_freq('U'), False), 60000000) - self.assertEqual(period_asfreq(1, get_freq( - 'T'), get_freq('N'), False), 60000000000) - - self.assertEqual(period_asfreq( - 1, get_freq('S'), get_freq('L'), False), 1000) - self.assertEqual(period_asfreq(1, get_freq('S'), - get_freq('U'), False), 1000000) - self.assertEqual(period_asfreq(1, get_freq( - 'S'), get_freq('N'), False), 1000000000) - - self.assertEqual(period_asfreq( - 1, get_freq('L'), get_freq('U'), False), 1000) - self.assertEqual(period_asfreq(1, get_freq('L'), - get_freq('N'), False), 1000000) - - self.assertEqual(period_asfreq( - 1, get_freq('U'), get_freq('N'), False), 1000) - - def test_period_ordinal_start_values(self): - # information for 1.1.1970 - self.assertEqual(0, period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, - get_freq('A'))) - self.assertEqual(0, period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, - get_freq('M'))) - self.assertEqual(1, period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, - get_freq('W'))) - self.assertEqual(0, period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, - get_freq('D'))) - self.assertEqual(0, period_ordinal(1970, 1, 1, 0, 0, 0, 0, 0, - get_freq('B'))) - - def test_period_ordinal_week(self): - self.assertEqual(1, period_ordinal(1970, 1, 4, 0, 0, 0, 0, 0, - get_freq('W'))) - self.assertEqual(2, period_ordinal(1970, 1, 5, 0, 0, 0, 0, 0, - get_freq('W'))) - - self.assertEqual(2284, period_ordinal(2013, 10, 6, 0, 0, 0, 0, 0, - get_freq('W'))) - self.assertEqual(2285, period_ordinal(2013, 10, 7, 0, 0, 0, 0, 0, - get_freq('W'))) - - def test_period_ordinal_business_day(self): - # Thursday - self.assertEqual(11415, period_ordinal(2013, 10, 3, 0, 0, 0, 0, 0, - get_freq('B'))) - # Friday - self.assertEqual(11416, period_ordinal(2013, 10, 4, 0, 0, 0, 0, 0, - get_freq('B'))) - # Saturday - self.assertEqual(11417, period_ordinal(2013, 10, 5, 0, 0, 0, 0, 0, - get_freq('B'))) - # Sunday - self.assertEqual(11417, period_ordinal(2013, 10, 6, 0, 0, 0, 0, 0, - get_freq('B'))) - # Monday - self.assertEqual(11417, period_ordinal(2013, 10, 7, 0, 0, 0, 0, 0, - get_freq('B'))) - # Tuesday - self.assertEqual(11418, period_ordinal(2013, 10, 8, 0, 0, 0, 0, 0, - get_freq('B'))) - - def test_tslib_tz_convert(self): - def compare_utc_to_local(tz_didx, utc_didx): - f = lambda x: tslib.tz_convert_single(x, 'UTC', tz_didx.tz) - result = tslib.tz_convert(tz_didx.asi8, 'UTC', tz_didx.tz) - result_single = np.vectorize(f)(tz_didx.asi8) - self.assert_numpy_array_equal(result, result_single) - - def compare_local_to_utc(tz_didx, utc_didx): - f = lambda x: tslib.tz_convert_single(x, tz_didx.tz, 'UTC') - result = tslib.tz_convert(utc_didx.asi8, tz_didx.tz, 'UTC') - result_single = np.vectorize(f)(utc_didx.asi8) - self.assert_numpy_array_equal(result, result_single) - - for tz in ['UTC', 'Asia/Tokyo', 'US/Eastern', 'Europe/Moscow']: - # US: 2014-03-09 - 2014-11-11 - # MOSCOW: 2014-10-26 / 2014-12-31 - tz_didx = date_range('2014-03-01', '2015-01-10', freq='H', tz=tz) - utc_didx = date_range('2014-03-01', '2015-01-10', freq='H') - compare_utc_to_local(tz_didx, utc_didx) - # local tz to UTC can be differ in hourly (or higher) freqs because - # of DST - compare_local_to_utc(tz_didx, utc_didx) - - tz_didx = date_range('2000-01-01', '2020-01-01', freq='D', tz=tz) - utc_didx = date_range('2000-01-01', '2020-01-01', freq='D') - compare_utc_to_local(tz_didx, utc_didx) - compare_local_to_utc(tz_didx, utc_didx) - - tz_didx = date_range('2000-01-01', '2100-01-01', freq='A', tz=tz) - utc_didx = date_range('2000-01-01', '2100-01-01', freq='A') - compare_utc_to_local(tz_didx, utc_didx) - compare_local_to_utc(tz_didx, utc_didx) - - # Check empty array - result = tslib.tz_convert(np.array([], dtype=np.int64), - tslib.maybe_get_tz('US/Eastern'), - tslib.maybe_get_tz('Asia/Tokyo')) - self.assert_numpy_array_equal(result, np.array([], dtype=np.int64)) - - # Check all-NaT array - result = tslib.tz_convert(np.array([tslib.iNaT], dtype=np.int64), - tslib.maybe_get_tz('US/Eastern'), - tslib.maybe_get_tz('Asia/Tokyo')) - self.assert_numpy_array_equal(result, np.array( - [tslib.iNaT], dtype=np.int64)) - - def test_shift_months(self): - s = DatetimeIndex([Timestamp('2000-01-05 00:15:00'), Timestamp( - '2000-01-31 00:23:00'), Timestamp('2000-01-01'), Timestamp( - '2000-02-29'), Timestamp('2000-12-31')]) - for years in [-1, 0, 1]: - for months in [-2, 0, 2]: - actual = DatetimeIndex(tslib.shift_months(s.asi8, years * 12 + - months)) - expected = DatetimeIndex([x + offsets.DateOffset( - years=years, months=months) for x in s]) - tm.assert_index_equal(actual, expected) - - def test_round(self): - stamp = Timestamp('2000-01-05 05:09:15.13') - - def _check_round(freq, expected): - result = stamp.round(freq=freq) - self.assertEqual(result, expected) - - for freq, expected in [('D', Timestamp('2000-01-05 00:00:00')), - ('H', Timestamp('2000-01-05 05:00:00')), - ('S', Timestamp('2000-01-05 05:09:15'))]: - _check_round(freq, expected) - - msg = pd.tseries.frequencies._INVALID_FREQ_ERROR - with self.assertRaisesRegexp(ValueError, msg): - stamp.round('foo') - - -class TestTimestampOps(tm.TestCase): - def test_timestamp_and_datetime(self): - self.assertEqual((Timestamp(datetime.datetime( - 2013, 10, 13)) - datetime.datetime(2013, 10, 12)).days, 1) - self.assertEqual((datetime.datetime(2013, 10, 12) - - Timestamp(datetime.datetime(2013, 10, 13))).days, -1) - - def test_timestamp_and_series(self): - timestamp_series = Series(date_range('2014-03-17', periods=2, freq='D', - tz='US/Eastern')) - first_timestamp = timestamp_series[0] - - delta_series = Series([np.timedelta64(0, 'D'), np.timedelta64(1, 'D')]) - assert_series_equal(timestamp_series - first_timestamp, delta_series) - assert_series_equal(first_timestamp - timestamp_series, -delta_series) - - def test_addition_subtraction_types(self): - # Assert on the types resulting from Timestamp +/- various date/time - # objects - datetime_instance = datetime.datetime(2014, 3, 4) - timedelta_instance = datetime.timedelta(seconds=1) - # build a timestamp with a frequency, since then it supports - # addition/subtraction of integers - timestamp_instance = date_range(datetime_instance, periods=1, - freq='D')[0] - - self.assertEqual(type(timestamp_instance + 1), Timestamp) - self.assertEqual(type(timestamp_instance - 1), Timestamp) - - # Timestamp + datetime not supported, though subtraction is supported - # and yields timedelta more tests in tseries/base/tests/test_base.py - self.assertEqual( - type(timestamp_instance - datetime_instance), Timedelta) - self.assertEqual( - type(timestamp_instance + timedelta_instance), Timestamp) - self.assertEqual( - type(timestamp_instance - timedelta_instance), Timestamp) - - # Timestamp +/- datetime64 not supported, so not tested (could possibly - # assert error raised?) - timedelta64_instance = np.timedelta64(1, 'D') - self.assertEqual( - type(timestamp_instance + timedelta64_instance), Timestamp) - self.assertEqual( - type(timestamp_instance - timedelta64_instance), Timestamp) - - def test_addition_subtraction_preserve_frequency(self): - timestamp_instance = date_range('2014-03-05', periods=1, freq='D')[0] - timedelta_instance = datetime.timedelta(days=1) - original_freq = timestamp_instance.freq - self.assertEqual((timestamp_instance + 1).freq, original_freq) - self.assertEqual((timestamp_instance - 1).freq, original_freq) - self.assertEqual( - (timestamp_instance + timedelta_instance).freq, original_freq) - self.assertEqual( - (timestamp_instance - timedelta_instance).freq, original_freq) - - timedelta64_instance = np.timedelta64(1, 'D') - self.assertEqual( - (timestamp_instance + timedelta64_instance).freq, original_freq) - self.assertEqual( - (timestamp_instance - timedelta64_instance).freq, original_freq) - - def test_resolution(self): - - for freq, expected in zip(['A', 'Q', 'M', 'D', 'H', 'T', - 'S', 'L', 'U'], - [D_RESO, D_RESO, - D_RESO, D_RESO, - H_RESO, T_RESO, - S_RESO, MS_RESO, - US_RESO]): - for tz in [None, 'Asia/Tokyo', 'US/Eastern', - 'dateutil/US/Eastern']: - idx = date_range(start='2013-04-01', periods=30, freq=freq, - tz=tz) - result = period.resolution(idx.asi8, idx.tz) - self.assertEqual(result, expected) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/tests/test_util.py b/pandas/tseries/tests/test_util.py deleted file mode 100644 index 96da32a4a845c..0000000000000 --- a/pandas/tseries/tests/test_util.py +++ /dev/null @@ -1,132 +0,0 @@ -from pandas.compat import range -import nose - -import numpy as np - -from pandas import Series, date_range -import pandas.util.testing as tm - -from datetime import datetime, date - -from pandas.tseries.tools import normalize_date -from pandas.tseries.util import pivot_annual, isleapyear - - -class TestPivotAnnual(tm.TestCase): - """ - New pandas of scikits.timeseries pivot_annual - """ - - def test_daily(self): - rng = date_range('1/1/2000', '12/31/2004', freq='D') - ts = Series(np.random.randn(len(rng)), index=rng) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - annual = pivot_annual(ts, 'D') - - doy = ts.index.dayofyear - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - doy[(~isleapyear(ts.index.year)) & (doy >= 60)] += 1 - - for i in range(1, 367): - subset = ts[doy == i] - subset.index = [x.year for x in subset.index] - - result = annual[i].dropna() - tm.assert_series_equal(result, subset, check_names=False) - self.assertEqual(result.name, i) - - # check leap days - leaps = ts[(ts.index.month == 2) & (ts.index.day == 29)] - day = leaps.index.dayofyear[0] - leaps.index = leaps.index.year - leaps.name = 60 - tm.assert_series_equal(annual[day].dropna(), leaps) - - def test_hourly(self): - rng_hourly = date_range('1/1/1994', periods=(18 * 8760 + 4 * 24), - freq='H') - data_hourly = np.random.randint(100, 350, rng_hourly.size) - ts_hourly = Series(data_hourly, index=rng_hourly) - - grouped = ts_hourly.groupby(ts_hourly.index.year) - hoy = grouped.apply(lambda x: x.reset_index(drop=True)) - hoy = hoy.index.droplevel(0).values - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - hoy[~isleapyear(ts_hourly.index.year) & (hoy >= 1416)] += 24 - hoy += 1 - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - annual = pivot_annual(ts_hourly) - - ts_hourly = ts_hourly.astype(float) - for i in [1, 1416, 1417, 1418, 1439, 1440, 1441, 8784]: - subset = ts_hourly[hoy == i] - subset.index = [x.year for x in subset.index] - - result = annual[i].dropna() - tm.assert_series_equal(result, subset, check_names=False) - self.assertEqual(result.name, i) - - leaps = ts_hourly[(ts_hourly.index.month == 2) & ( - ts_hourly.index.day == 29) & (ts_hourly.index.hour == 0)] - hour = leaps.index.dayofyear[0] * 24 - 23 - leaps.index = leaps.index.year - leaps.name = 1417 - tm.assert_series_equal(annual[hour].dropna(), leaps) - - def test_weekly(self): - pass - - def test_monthly(self): - rng = date_range('1/1/2000', '12/31/2004', freq='M') - ts = Series(np.random.randn(len(rng)), index=rng) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - annual = pivot_annual(ts, 'M') - - month = ts.index.month - for i in range(1, 13): - subset = ts[month == i] - subset.index = [x.year for x in subset.index] - result = annual[i].dropna() - tm.assert_series_equal(result, subset, check_names=False) - self.assertEqual(result.name, i) - - def test_period_monthly(self): - pass - - def test_period_daily(self): - pass - - def test_period_weekly(self): - pass - - def test_isleapyear_deprecate(self): - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertTrue(isleapyear(2000)) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertFalse(isleapyear(2001)) - - with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): - self.assertTrue(isleapyear(2004)) - - -def test_normalize_date(): - value = date(2012, 9, 7) - - result = normalize_date(value) - assert (result == datetime(2012, 9, 7)) - - value = datetime(2012, 9, 7, 12) - - result = normalize_date(value) - assert (result == datetime(2012, 9, 7)) - - -if __name__ == '__main__': - nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], - exit=False) diff --git a/pandas/tseries/util.py b/pandas/tseries/util.py index 59daa8d7780b4..5934f5843736c 100644 --- a/pandas/tseries/util.py +++ b/pandas/tseries/util.py @@ -2,9 +2,9 @@ from pandas.compat import lrange import numpy as np -from pandas.types.common import _ensure_platform_int +from pandas.core.dtypes.common import _ensure_platform_int from pandas.core.frame import DataFrame -import pandas.core.nanops as nanops +import pandas.core.algorithms as algorithms def pivot_annual(series, freq=None): @@ -45,7 +45,7 @@ def pivot_annual(series, freq=None): index = series.index year = index.year - years = nanops.unique1d(year) + years = algorithms.unique1d(year) if freq is not None: freq = freq.upper() @@ -54,7 +54,7 @@ def pivot_annual(series, freq=None): if freq == 'D': width = 366 - offset = index.dayofyear - 1 + offset = np.asarray(index.dayofyear) - 1 # adjust for leap year offset[(~isleapyear(year)) & (offset >= 59)] += 1 @@ -63,7 +63,7 @@ def pivot_annual(series, freq=None): # todo: strings like 1/1, 1/25, etc.? elif freq in ('M', 'BM'): width = 12 - offset = index.month - 1 + offset = np.asarray(index.month) - 1 columns = lrange(1, 13) elif freq == 'H': width = 8784 diff --git a/pandas/tslib.py b/pandas/tslib.py new file mode 100644 index 0000000000000..c960a4eaf59ad --- /dev/null +++ b/pandas/tslib.py @@ -0,0 +1,7 @@ +# flake8: noqa + +import warnings +warnings.warn("The pandas.tslib module is deprecated and will be " + "removed in a future version.", FutureWarning, stacklevel=2) +from pandas._libs.tslib import (Timestamp, Timedelta, + NaT, NaTType, OutOfBoundsDatetime) diff --git a/pandas/types/common.py b/pandas/types/common.py deleted file mode 100644 index 5d161efa838de..0000000000000 --- a/pandas/types/common.py +++ /dev/null @@ -1,474 +0,0 @@ -""" common type operations """ - -import numpy as np -from pandas.compat import (string_types, text_type, binary_type, - PY3, PY36) -from pandas import lib, algos -from .dtypes import (CategoricalDtype, CategoricalDtypeType, - DatetimeTZDtype, DatetimeTZDtypeType, - PeriodDtype, PeriodDtypeType, - ExtensionDtype) -from .generic import (ABCCategorical, ABCPeriodIndex, - ABCDatetimeIndex, ABCSeries, - ABCSparseArray, ABCSparseSeries) -from .inference import is_string_like -from .inference import * # noqa - - -_POSSIBLY_CAST_DTYPES = set([np.dtype(t).name - for t in ['O', 'int8', 'uint8', 'int16', 'uint16', - 'int32', 'uint32', 'int64', 'uint64']]) - -_NS_DTYPE = np.dtype('M8[ns]') -_TD_DTYPE = np.dtype('m8[ns]') -_INT64_DTYPE = np.dtype(np.int64) - -_DATELIKE_DTYPES = set([np.dtype(t) - for t in ['M8[ns]', 'M8[ns]', - 'm8[ns]', 'm8[ns]']]) - -_ensure_float64 = algos.ensure_float64 -_ensure_float32 = algos.ensure_float32 - - -def _ensure_float(arr): - if issubclass(arr.dtype.type, (np.integer, np.bool_)): - arr = arr.astype(float) - return arr - -_ensure_int64 = algos.ensure_int64 -_ensure_int32 = algos.ensure_int32 -_ensure_int16 = algos.ensure_int16 -_ensure_int8 = algos.ensure_int8 -_ensure_platform_int = algos.ensure_platform_int -_ensure_object = algos.ensure_object - - -def _ensure_categorical(arr): - if not is_categorical(arr): - from pandas import Categorical - arr = Categorical(arr) - return arr - - -def is_object_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return issubclass(tipo, np.object_) - - -def is_sparse(array): - """ return if we are a sparse array """ - return isinstance(array, (ABCSparseArray, ABCSparseSeries)) - - -def is_categorical(array): - """ return if we are a categorical possibility """ - return isinstance(array, ABCCategorical) or is_categorical_dtype(array) - - -def is_datetimetz(array): - """ return if we are a datetime with tz array """ - return ((isinstance(array, ABCDatetimeIndex) and - getattr(array, 'tz', None) is not None) or - is_datetime64tz_dtype(array)) - - -def is_period(array): - """ return if we are a period array """ - return isinstance(array, ABCPeriodIndex) or is_period_arraylike(array) - - -def is_datetime64_dtype(arr_or_dtype): - try: - tipo = _get_dtype_type(arr_or_dtype) - except TypeError: - return False - return issubclass(tipo, np.datetime64) - - -def is_datetime64tz_dtype(arr_or_dtype): - return DatetimeTZDtype.is_dtype(arr_or_dtype) - - -def is_timedelta64_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return issubclass(tipo, np.timedelta64) - - -def is_period_dtype(arr_or_dtype): - return PeriodDtype.is_dtype(arr_or_dtype) - - -def is_categorical_dtype(arr_or_dtype): - return CategoricalDtype.is_dtype(arr_or_dtype) - - -def is_string_dtype(arr_or_dtype): - dtype = _get_dtype(arr_or_dtype) - return dtype.kind in ('O', 'S', 'U') and not is_period_dtype(dtype) - - -def is_period_arraylike(arr): - """ return if we are period arraylike / PeriodIndex """ - if isinstance(arr, ABCPeriodIndex): - return True - elif isinstance(arr, (np.ndarray, ABCSeries)): - return arr.dtype == object and lib.infer_dtype(arr) == 'period' - return getattr(arr, 'inferred_type', None) == 'period' - - -def is_datetime_arraylike(arr): - """ return if we are datetime arraylike / DatetimeIndex """ - if isinstance(arr, ABCDatetimeIndex): - return True - elif isinstance(arr, (np.ndarray, ABCSeries)): - return arr.dtype == object and lib.infer_dtype(arr) == 'datetime' - return getattr(arr, 'inferred_type', None) == 'datetime' - - -def is_datetimelike(arr): - return (arr.dtype in _DATELIKE_DTYPES or - isinstance(arr, ABCPeriodIndex) or - is_datetimetz(arr)) - - -def is_dtype_equal(source, target): - """ return a boolean if the dtypes are equal """ - try: - source = _get_dtype(source) - target = _get_dtype(target) - return source == target - except (TypeError, AttributeError): - - # invalid comparison - # object == category will hit this - return False - - -def is_any_int_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return issubclass(tipo, np.integer) - - -def is_integer_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return (issubclass(tipo, np.integer) and - not issubclass(tipo, (np.datetime64, np.timedelta64))) - - -def is_int64_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return issubclass(tipo, np.int64) - - -def is_int_or_datetime_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return (issubclass(tipo, np.integer) or - issubclass(tipo, (np.datetime64, np.timedelta64))) - - -def is_datetime64_any_dtype(arr_or_dtype): - return (is_datetime64_dtype(arr_or_dtype) or - is_datetime64tz_dtype(arr_or_dtype)) - - -def is_datetime64_ns_dtype(arr_or_dtype): - try: - tipo = _get_dtype(arr_or_dtype) - except TypeError: - return False - return tipo == _NS_DTYPE - - -def is_timedelta64_ns_dtype(arr_or_dtype): - tipo = _get_dtype(arr_or_dtype) - return tipo == _TD_DTYPE - - -def is_datetime_or_timedelta_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return issubclass(tipo, (np.datetime64, np.timedelta64)) - - -def _is_unorderable_exception(e): - """ - return a boolean if we an unorderable exception error message - - These are different error message for PY>=3<=3.5 and PY>=3.6 - """ - if PY36: - return "'>' not supported between instances of" in str(e) - - elif PY3: - return 'unorderable' in str(e) - return False - - -def is_numeric_v_string_like(a, b): - """ - numpy doesn't like to compare numeric arrays vs scalar string-likes - - return a boolean result if this is the case for a,b or b,a - - """ - is_a_array = isinstance(a, np.ndarray) - is_b_array = isinstance(b, np.ndarray) - - is_a_numeric_array = is_a_array and is_numeric_dtype(a) - is_b_numeric_array = is_b_array and is_numeric_dtype(b) - is_a_string_array = is_a_array and is_string_like_dtype(a) - is_b_string_array = is_b_array and is_string_like_dtype(b) - - is_a_scalar_string_like = not is_a_array and is_string_like(a) - is_b_scalar_string_like = not is_b_array and is_string_like(b) - - return ((is_a_numeric_array and is_b_scalar_string_like) or - (is_b_numeric_array and is_a_scalar_string_like) or - (is_a_numeric_array and is_b_string_array) or - (is_b_numeric_array and is_a_string_array)) - - -def is_datetimelike_v_numeric(a, b): - # return if we have an i8 convertible and numeric comparison - if not hasattr(a, 'dtype'): - a = np.asarray(a) - if not hasattr(b, 'dtype'): - b = np.asarray(b) - - def is_numeric(x): - return is_integer_dtype(x) or is_float_dtype(x) - - is_datetimelike = needs_i8_conversion - return ((is_datetimelike(a) and is_numeric(b)) or - (is_datetimelike(b) and is_numeric(a))) - - -def is_datetimelike_v_object(a, b): - # return if we have an i8 convertible and object comparsion - if not hasattr(a, 'dtype'): - a = np.asarray(a) - if not hasattr(b, 'dtype'): - b = np.asarray(b) - - def f(x): - return is_object_dtype(x) - - def is_object(x): - return is_integer_dtype(x) or is_float_dtype(x) - - is_datetimelike = needs_i8_conversion - return ((is_datetimelike(a) and is_object(b)) or - (is_datetimelike(b) and is_object(a))) - - -def needs_i8_conversion(arr_or_dtype): - return (is_datetime_or_timedelta_dtype(arr_or_dtype) or - is_datetime64tz_dtype(arr_or_dtype) or - is_period_dtype(arr_or_dtype)) - - -def is_numeric_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return (issubclass(tipo, (np.number, np.bool_)) and - not issubclass(tipo, (np.datetime64, np.timedelta64))) - - -def is_string_like_dtype(arr_or_dtype): - # exclude object as its a mixed dtype - dtype = _get_dtype(arr_or_dtype) - return dtype.kind in ('S', 'U') - - -def is_float_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return issubclass(tipo, np.floating) - - -def is_floating_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return isinstance(tipo, np.floating) - - -def is_bool_dtype(arr_or_dtype): - try: - tipo = _get_dtype_type(arr_or_dtype) - except ValueError: - # this isn't even a dtype - return False - return issubclass(tipo, np.bool_) - - -def is_extension_type(value): - """ - if we are a klass that is preserved by the internals - these are internal klasses that we represent (and don't use a np.array) - """ - if is_categorical(value): - return True - elif is_sparse(value): - return True - elif is_datetimetz(value): - return True - return False - - -def is_complex_dtype(arr_or_dtype): - tipo = _get_dtype_type(arr_or_dtype) - return issubclass(tipo, np.complexfloating) - - -def _coerce_to_dtype(dtype): - """ coerce a string / np.dtype to a dtype """ - if is_categorical_dtype(dtype): - dtype = CategoricalDtype() - elif is_datetime64tz_dtype(dtype): - dtype = DatetimeTZDtype(dtype) - elif is_period_dtype(dtype): - dtype = PeriodDtype(dtype) - else: - dtype = np.dtype(dtype) - return dtype - - -def _get_dtype(arr_or_dtype): - if isinstance(arr_or_dtype, np.dtype): - return arr_or_dtype - elif isinstance(arr_or_dtype, type): - return np.dtype(arr_or_dtype) - elif isinstance(arr_or_dtype, CategoricalDtype): - return arr_or_dtype - elif isinstance(arr_or_dtype, DatetimeTZDtype): - return arr_or_dtype - elif isinstance(arr_or_dtype, PeriodDtype): - return arr_or_dtype - elif isinstance(arr_or_dtype, string_types): - if is_categorical_dtype(arr_or_dtype): - return CategoricalDtype.construct_from_string(arr_or_dtype) - elif is_datetime64tz_dtype(arr_or_dtype): - return DatetimeTZDtype.construct_from_string(arr_or_dtype) - elif is_period_dtype(arr_or_dtype): - return PeriodDtype.construct_from_string(arr_or_dtype) - - if hasattr(arr_or_dtype, 'dtype'): - arr_or_dtype = arr_or_dtype.dtype - return np.dtype(arr_or_dtype) - - -def _get_dtype_type(arr_or_dtype): - if isinstance(arr_or_dtype, np.dtype): - return arr_or_dtype.type - elif isinstance(arr_or_dtype, type): - return np.dtype(arr_or_dtype).type - elif isinstance(arr_or_dtype, CategoricalDtype): - return CategoricalDtypeType - elif isinstance(arr_or_dtype, DatetimeTZDtype): - return DatetimeTZDtypeType - elif isinstance(arr_or_dtype, PeriodDtype): - return PeriodDtypeType - elif isinstance(arr_or_dtype, string_types): - if is_categorical_dtype(arr_or_dtype): - return CategoricalDtypeType - elif is_datetime64tz_dtype(arr_or_dtype): - return DatetimeTZDtypeType - elif is_period_dtype(arr_or_dtype): - return PeriodDtypeType - return _get_dtype_type(np.dtype(arr_or_dtype)) - try: - return arr_or_dtype.dtype.type - except AttributeError: - return type(None) - - -def _get_dtype_from_object(dtype): - """Get a numpy dtype.type-style object. This handles the datetime64[ns] - and datetime64[ns, TZ] compat - - Notes - ----- - If nothing can be found, returns ``object``. - """ - - # type object from a dtype - if isinstance(dtype, type) and issubclass(dtype, np.generic): - return dtype - elif is_categorical(dtype): - return CategoricalDtype().type - elif is_datetimetz(dtype): - return DatetimeTZDtype(dtype).type - elif isinstance(dtype, np.dtype): # dtype object - try: - _validate_date_like_dtype(dtype) - except TypeError: - # should still pass if we don't have a datelike - pass - return dtype.type - elif isinstance(dtype, string_types): - if dtype == 'datetime' or dtype == 'timedelta': - dtype += '64' - - try: - return _get_dtype_from_object(getattr(np, dtype)) - except (AttributeError, TypeError): - # handles cases like _get_dtype(int) - # i.e., python objects that are valid dtypes (unlike user-defined - # types, in general) - # TypeError handles the float16 typecode of 'e' - # further handle internal types - pass - - return _get_dtype_from_object(np.dtype(dtype)) - - -def _validate_date_like_dtype(dtype): - try: - typ = np.datetime_data(dtype)[0] - except ValueError as e: - raise TypeError('%s' % e) - if typ != 'generic' and typ != 'ns': - raise ValueError('%r is too specific of a frequency, try passing %r' % - (dtype.name, dtype.type.__name__)) - - -_string_dtypes = frozenset(map(_get_dtype_from_object, (binary_type, - text_type))) - - -def pandas_dtype(dtype): - """ - Converts input into a pandas only dtype object or a numpy dtype object. - - Parameters - ---------- - dtype : object to be converted - - Returns - ------- - np.dtype or a pandas dtype - """ - if isinstance(dtype, DatetimeTZDtype): - return dtype - elif isinstance(dtype, PeriodDtype): - return dtype - elif isinstance(dtype, CategoricalDtype): - return dtype - elif isinstance(dtype, string_types): - try: - return DatetimeTZDtype.construct_from_string(dtype) - except TypeError: - pass - - if dtype.startswith('period[') or dtype.startswith('Period['): - # do not parse string like U as period[U] - try: - return PeriodDtype.construct_from_string(dtype) - except TypeError: - pass - - try: - return CategoricalDtype.construct_from_string(dtype) - except TypeError: - pass - elif isinstance(dtype, ExtensionDtype): - return dtype - - return np.dtype(dtype) diff --git a/pandas/types/concat.py b/pandas/types/concat.py index 827eb160c452d..477156b38d56d 100644 --- a/pandas/types/concat.py +++ b/pandas/types/concat.py @@ -1,480 +1,11 @@ -""" -Utility functions related to concat -""" +import warnings -import numpy as np -import pandas.tslib as tslib -from pandas import compat -from pandas.core.algorithms import take_1d -from .common import (is_categorical_dtype, - is_sparse, - is_datetimetz, - is_datetime64_dtype, - is_timedelta64_dtype, - is_period_dtype, - is_object_dtype, - is_bool_dtype, - is_dtype_equal, - _NS_DTYPE, - _TD_DTYPE) -from pandas.types.generic import (ABCDatetimeIndex, ABCTimedeltaIndex, - ABCPeriodIndex) - -def get_dtype_kinds(l): - """ - Parameters - ---------- - l : list of arrays - - Returns - ------- - a set of kinds that exist in this list of arrays - """ - - typs = set() - for arr in l: - - dtype = arr.dtype - if is_categorical_dtype(dtype): - typ = 'category' - elif is_sparse(arr): - typ = 'sparse' - elif is_datetimetz(arr): - # if to_concat contains different tz, - # the result must be object dtype - typ = str(arr.dtype) - elif is_datetime64_dtype(dtype): - typ = 'datetime' - elif is_timedelta64_dtype(dtype): - typ = 'timedelta' - elif is_object_dtype(dtype): - typ = 'object' - elif is_bool_dtype(dtype): - typ = 'bool' - elif is_period_dtype(dtype): - typ = str(arr.dtype) - else: - typ = dtype.kind - typs.add(typ) - return typs - - -def _get_series_result_type(result): - """ - return appropriate class of Series concat - input is either dict or array-like - """ - if isinstance(result, dict): - # concat Series with axis 1 - if all(is_sparse(c) for c in compat.itervalues(result)): - from pandas.sparse.api import SparseDataFrame - return SparseDataFrame - else: - from pandas.core.frame import DataFrame - return DataFrame - - elif is_sparse(result): - # concat Series with axis 1 - from pandas.sparse.api import SparseSeries - return SparseSeries - else: - from pandas.core.series import Series - return Series - - -def _get_frame_result_type(result, objs): - """ - return appropriate class of DataFrame-like concat - if any block is SparseBlock, return SparseDataFrame - otherwise, return 1st obj - """ - if any(b.is_sparse for b in result.blocks): - from pandas.sparse.api import SparseDataFrame - return SparseDataFrame - else: - return objs[0] - - -def _concat_compat(to_concat, axis=0): - """ - provide concatenation of an array of arrays each of which is a single - 'normalized' dtypes (in that for example, if it's object, then it is a - non-datetimelike and provide a combined dtype for the resulting array that - preserves the overall dtype if possible) - - Parameters - ---------- - to_concat : array of arrays - axis : axis to provide concatenation - - Returns - ------- - a single array, preserving the combined dtypes - """ - - # filter empty arrays - # 1-d dtypes always are included here - def is_nonempty(x): - try: - return x.shape[axis] > 0 - except Exception: - return True - - nonempty = [x for x in to_concat if is_nonempty(x)] - - # If all arrays are empty, there's nothing to convert, just short-cut to - # the concatenation, #3121. - # - # Creating an empty array directly is tempting, but the winnings would be - # marginal given that it would still require shape & dtype calculation and - # np.concatenate which has them both implemented is compiled. - - typs = get_dtype_kinds(to_concat) - - _contains_datetime = any(typ.startswith('datetime') for typ in typs) - _contains_period = any(typ.startswith('period') for typ in typs) - - if 'category' in typs: - # this must be priort to _concat_datetime, - # to support Categorical + datetime-like - return _concat_categorical(to_concat, axis=axis) - - elif _contains_datetime or 'timedelta' in typs or _contains_period: - return _concat_datetime(to_concat, axis=axis, typs=typs) - - # these are mandated to handle empties as well - elif 'sparse' in typs: - return _concat_sparse(to_concat, axis=axis, typs=typs) - - if not nonempty: - # we have all empties, but may need to coerce the result dtype to - # object if we have non-numeric type operands (numpy would otherwise - # cast this to float) - typs = get_dtype_kinds(to_concat) - if len(typs) != 1: - - if (not len(typs - set(['i', 'u', 'f'])) or - not len(typs - set(['bool', 'i', 'u']))): - # let numpy coerce - pass - else: - # coerce to object - to_concat = [x.astype('object') for x in to_concat] - - return np.concatenate(to_concat, axis=axis) - - -def _concat_categorical(to_concat, axis=0): - """Concatenate an object/categorical array of arrays, each of which is a - single dtype - - Parameters - ---------- - to_concat : array of arrays - axis : int - Axis to provide concatenation in the current implementation this is - always 0, e.g. we only have 1D categoricals - - Returns - ------- - Categorical - A single array, preserving the combined dtypes - """ - - def _concat_asobject(to_concat): - to_concat = [x.get_values() if is_categorical_dtype(x.dtype) - else x.ravel() for x in to_concat] - res = _concat_compat(to_concat) - if axis == 1: - return res.reshape(1, len(res)) - else: - return res - - # we could have object blocks and categoricals here - # if we only have a single categoricals then combine everything - # else its a non-compat categorical - categoricals = [x for x in to_concat if is_categorical_dtype(x.dtype)] - - # validate the categories - if len(categoricals) != len(to_concat): - pass - else: - # when all categories are identical - first = to_concat[0] - if all(first.is_dtype_equal(other) for other in to_concat[1:]): - return union_categoricals(categoricals) - - return _concat_asobject(to_concat) - - -def union_categoricals(to_union, sort_categories=False): - """ - Combine list-like of Categorical-like, unioning categories. All - categories must have the same dtype. - - .. versionadded:: 0.19.0 - - Parameters - ---------- - to_union : list-like of Categorical, CategoricalIndex, - or Series with dtype='category' - sort_categories : boolean, default False - If true, resulting categories will be lexsorted, otherwise - they will be ordered as they appear in the data. - - Returns - ------- - result : Categorical - - Raises - ------ - TypeError - - all inputs do not have the same dtype - - all inputs do not have the same ordered property - - all inputs are ordered and their categories are not identical - - sort_categories=True and Categoricals are ordered - ValueError - Emmpty list of categoricals passed - """ - from pandas import Index, Categorical, CategoricalIndex, Series - - if len(to_union) == 0: - raise ValueError('No Categoricals to union') - - def _maybe_unwrap(x): - if isinstance(x, (CategoricalIndex, Series)): - return x.values - elif isinstance(x, Categorical): - return x - else: - raise TypeError("all components to combine must be Categorical") - - to_union = [_maybe_unwrap(x) for x in to_union] - first = to_union[0] - - if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype) - for other in to_union[1:]): - raise TypeError("dtype of categories must be the same") - - ordered = False - if all(first.is_dtype_equal(other) for other in to_union[1:]): - # identical categories - fastpath - categories = first.categories - ordered = first.ordered - new_codes = np.concatenate([c.codes for c in to_union]) - - if sort_categories and ordered: - raise TypeError("Cannot use sort_categories=True with " - "ordered Categoricals") - - if sort_categories and not categories.is_monotonic_increasing: - categories = categories.sort_values() - indexer = categories.get_indexer(first.categories) - new_codes = take_1d(indexer, new_codes, fill_value=-1) - elif all(not c.ordered for c in to_union): - # different categories - union and recode - cats = first.categories.append([c.categories for c in to_union[1:]]) - categories = Index(cats.unique()) - if sort_categories: - categories = categories.sort_values() - - new_codes = [] - for c in to_union: - if len(c.categories) > 0: - indexer = categories.get_indexer(c.categories) - new_codes.append(take_1d(indexer, c.codes, fill_value=-1)) - else: - # must be all NaN - new_codes.append(c.codes) - new_codes = np.concatenate(new_codes) - else: - # ordered - to show a proper error message - if all(c.ordered for c in to_union): - msg = ("to union ordered Categoricals, " - "all categories must be the same") - raise TypeError(msg) - else: - raise TypeError('Categorical.ordered must be the same') - - return Categorical(new_codes, categories=categories, ordered=ordered, - fastpath=True) - - -def _concat_datetime(to_concat, axis=0, typs=None): - """ - provide concatenation of an datetimelike array of arrays each of which is a - single M8[ns], datetimet64[ns, tz] or m8[ns] dtype - - Parameters - ---------- - to_concat : array of arrays - axis : axis to provide concatenation - typs : set of to_concat dtypes - - Returns - ------- - a single array, preserving the combined dtypes - """ - - def convert_to_pydatetime(x, axis): - # coerce to an object dtype - - # if dtype is of datetimetz or timezone - if x.dtype.kind == _NS_DTYPE.kind: - if getattr(x, 'tz', None) is not None: - x = x.asobject.values - else: - shape = x.shape - x = tslib.ints_to_pydatetime(x.view(np.int64).ravel(), - box=True) - x = x.reshape(shape) - - elif x.dtype == _TD_DTYPE: - shape = x.shape - x = tslib.ints_to_pytimedelta(x.view(np.int64).ravel(), box=True) - x = x.reshape(shape) - - if axis == 1: - x = np.atleast_2d(x) - return x - - if typs is None: - typs = get_dtype_kinds(to_concat) - - # must be single dtype - if len(typs) == 1: - _contains_datetime = any(typ.startswith('datetime') for typ in typs) - _contains_period = any(typ.startswith('period') for typ in typs) - - if _contains_datetime: - - if 'datetime' in typs: - new_values = np.concatenate([x.view(np.int64) for x in - to_concat], axis=axis) - return new_values.view(_NS_DTYPE) - else: - # when to_concat has different tz, len(typs) > 1. - # thus no need to care - return _concat_datetimetz(to_concat) - - elif 'timedelta' in typs: - new_values = np.concatenate([x.view(np.int64) for x in to_concat], - axis=axis) - return new_values.view(_TD_DTYPE) - - elif _contains_period: - # PeriodIndex must be handled by PeriodIndex, - # Thus can't meet this condition ATM - # Must be changed when we adding PeriodDtype - raise NotImplementedError - - # need to coerce to object - to_concat = [convert_to_pydatetime(x, axis) for x in to_concat] - return np.concatenate(to_concat, axis=axis) - - -def _concat_datetimetz(to_concat, name=None): - """ - concat DatetimeIndex with the same tz - all inputs must be DatetimeIndex - it is used in DatetimeIndex.append also - """ - # do not pass tz to set because tzlocal cannot be hashed - if len(set([str(x.dtype) for x in to_concat])) != 1: - raise ValueError('to_concat must have the same tz') - tz = to_concat[0].tz - # no need to localize because internal repr will not be changed - new_values = np.concatenate([x.asi8 for x in to_concat]) - return to_concat[0]._simple_new(new_values, tz=tz, name=name) - - -def _concat_index_asobject(to_concat, name=None): - """ - concat all inputs as object. DatetimeIndex, TimedeltaIndex and - PeriodIndex are converted to object dtype before concatenation - """ - - klasses = ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex - to_concat = [x.asobject if isinstance(x, klasses) else x - for x in to_concat] - - from pandas import Index - self = to_concat[0] - attribs = self._get_attributes_dict() - attribs['name'] = name - - to_concat = [x._values if isinstance(x, Index) else x - for x in to_concat] - return self._shallow_copy_with_infer(np.concatenate(to_concat), **attribs) - - -def _concat_sparse(to_concat, axis=0, typs=None): - """ - provide concatenation of an sparse/dense array of arrays each of which is a - single dtype - - Parameters - ---------- - to_concat : array of arrays - axis : axis to provide concatenation - typs : set of to_concat dtypes - - Returns - ------- - a single array, preserving the combined dtypes - """ - - from pandas.sparse.array import SparseArray, _make_index - - def convert_sparse(x, axis): - # coerce to native type - if isinstance(x, SparseArray): - x = x.get_values() - x = x.ravel() - if axis > 0: - x = np.atleast_2d(x) - return x - - if typs is None: - typs = get_dtype_kinds(to_concat) - - if len(typs) == 1: - # concat input as it is if all inputs are sparse - # and have the same fill_value - fill_values = set(c.fill_value for c in to_concat) - if len(fill_values) == 1: - sp_values = [c.sp_values for c in to_concat] - indexes = [c.sp_index.to_int_index() for c in to_concat] - - indices = [] - loc = 0 - for idx in indexes: - indices.append(idx.indices + loc) - loc += idx.length - sp_values = np.concatenate(sp_values) - indices = np.concatenate(indices) - sp_index = _make_index(loc, indices, kind=to_concat[0].sp_index) - - return SparseArray(sp_values, sparse_index=sp_index, - fill_value=to_concat[0].fill_value) - - # input may be sparse / dense mixed and may have different fill_value - # input must contain sparse at least 1 - sparses = [c for c in to_concat if is_sparse(c)] - fill_values = [c.fill_value for c in sparses] - sp_indexes = [c.sp_index for c in sparses] - - # densify and regular concat - to_concat = [convert_sparse(x, axis) for x in to_concat] - result = np.concatenate(to_concat, axis=axis) - - if not len(typs - set(['sparse', 'f', 'i'])): - # sparsify if inputs are sparse and dense numerics - # first sparse input's fill_value and SparseIndex is used - result = SparseArray(result.ravel(), fill_value=fill_values[0], - kind=sp_indexes[0]) - else: - # coerce to object if needed - result = result.astype('object') - return result +def union_categoricals(to_union, sort_categories=False, ignore_order=False): + warnings.warn("pandas.types.concat.union_categoricals is " + "deprecated and will be removed in a future version.\n" + "use pandas.api.types.union_categoricals", + FutureWarning, stacklevel=2) + from pandas.api.types import union_categoricals + return union_categoricals( + to_union, sort_categories=sort_categories, ignore_order=ignore_order) diff --git a/pandas/types/inference.py b/pandas/types/inference.py deleted file mode 100644 index 35a2dc2fb831b..0000000000000 --- a/pandas/types/inference.py +++ /dev/null @@ -1,104 +0,0 @@ -""" basic inference routines """ - -import collections -import re -import numpy as np -from numbers import Number -from pandas.compat import (string_types, text_type, - string_and_binary_types) -from pandas import lib - -is_bool = lib.is_bool - -is_integer = lib.is_integer - -is_float = lib.is_float - -is_complex = lib.is_complex - -is_scalar = lib.isscalar - - -def is_number(obj): - return isinstance(obj, (Number, np.number)) - - -def is_string_like(obj): - return isinstance(obj, (text_type, string_types)) - - -def _iterable_not_string(x): - return (isinstance(x, collections.Iterable) and - not isinstance(x, string_types)) - - -def is_iterator(obj): - # python 3 generators have __next__ instead of next - return hasattr(obj, 'next') or hasattr(obj, '__next__') - - -def is_re(obj): - return isinstance(obj, re._pattern_type) - - -def is_re_compilable(obj): - try: - re.compile(obj) - except TypeError: - return False - else: - return True - - -def is_list_like(arg): - return (hasattr(arg, '__iter__') and - not isinstance(arg, string_and_binary_types)) - - -def is_dict_like(arg): - return hasattr(arg, '__getitem__') and hasattr(arg, 'keys') - - -def is_named_tuple(arg): - return isinstance(arg, tuple) and hasattr(arg, '_fields') - - -def is_hashable(arg): - """Return True if hash(arg) will succeed, False otherwise. - - Some types will pass a test against collections.Hashable but fail when they - are actually hashed with hash(). - - Distinguish between these and other types by trying the call to hash() and - seeing if they raise TypeError. - - Examples - -------- - >>> a = ([],) - >>> isinstance(a, collections.Hashable) - True - >>> is_hashable(a) - False - """ - # unfortunately, we can't use isinstance(arg, collections.Hashable), which - # can be faster than calling hash, because numpy scalars on Python 3 fail - # this test - - # reconsider this decision once this numpy bug is fixed: - # https://github.com/numpy/numpy/issues/5562 - - try: - hash(arg) - except TypeError: - return False - else: - return True - - -def is_sequence(x): - try: - iter(x) - len(x) # it has a length - return not isinstance(x, string_and_binary_types) - except (TypeError, AttributeError): - return False diff --git a/pandas/util/__init__.py b/pandas/util/__init__.py index e69de29bb2d1d..e86af930fef7c 100644 --- a/pandas/util/__init__.py +++ b/pandas/util/__init__.py @@ -0,0 +1,2 @@ +from pandas.core.util.hashing import hash_pandas_object, hash_array # noqa +from pandas.util._decorators import Appender, Substitution, cache_readonly # noqa diff --git a/pandas/util/decorators.py b/pandas/util/_decorators.py similarity index 77% rename from pandas/util/decorators.py rename to pandas/util/_decorators.py index e1888a3ffd62a..772b206f82e69 100644 --- a/pandas/util/decorators.py +++ b/pandas/util/_decorators.py @@ -1,9 +1,9 @@ -from pandas.compat import StringIO, callable, signature -from pandas.lib import cache_readonly # noqa -import sys +from pandas.compat import callable, signature +from pandas._libs.lib import cache_readonly # noqa +import types import warnings from textwrap import dedent -from functools import wraps +from functools import wraps, update_wrapper def deprecate(name, alternative, alt_name=None): @@ -24,7 +24,7 @@ def deprecate_kwarg(old_arg_name, new_arg_name, mapping=None, stacklevel=2): old_arg_name : str Name of argument in function to deprecate new_arg_name : str - Name of prefered argument in function + Name of preferred argument in function mapping : dict or callable If mapping is present, use it to translate old arguments to new arguments. A callable must do its own value checking; @@ -125,6 +125,7 @@ def some_function(x): def some_function(x): "%s %s wrote the Raven" """ + def __init__(self, *args, **kwargs): if (args and kwargs): raise AssertionError("Only positional or keyword args are allowed") @@ -171,6 +172,7 @@ def my_dog(has='fleas'): "This docstring will have a copyright below" pass """ + def __init__(self, addendum, join='', indents=0): if indents > 0: self.addendum = indent(addendum, indents=indents) @@ -193,76 +195,6 @@ def indent(text, indents=1): return jointext.join(text.split('\n')) -def suppress_stdout(f): - def wrapped(*args, **kwargs): - try: - sys.stdout = StringIO() - f(*args, **kwargs) - finally: - sys.stdout = sys.__stdout__ - - return wrapped - - -class KnownFailureTest(Exception): - """Raise this exception to mark a test as a known failing test.""" - pass - - -def knownfailureif(fail_condition, msg=None): - """ - Make function raise KnownFailureTest exception if given condition is true. - - If the condition is a callable, it is used at runtime to dynamically - make the decision. This is useful for tests that may require costly - imports, to delay the cost until the test suite is actually executed. - - Parameters - ---------- - fail_condition : bool or callable - Flag to determine whether to mark the decorated test as a known - failure (if True) or not (if False). - msg : str, optional - Message to give on raising a KnownFailureTest exception. - Default is None. - - Returns - ------- - decorator : function - Decorator, which, when applied to a function, causes SkipTest - to be raised when `skip_condition` is True, and the function - to be called normally otherwise. - - Notes - ----- - The decorator itself is decorated with the ``nose.tools.make_decorator`` - function in order to transmit function name, and various other metadata. - - """ - if msg is None: - msg = 'Test skipped due to known failure' - - # Allow for both boolean or callable known failure conditions. - if callable(fail_condition): - fail_val = fail_condition - else: - fail_val = lambda: fail_condition - - def knownfail_decorator(f): - # Local import to avoid a hard nose dependency and only incur the - # import time overhead at actual test-time. - import nose - - def knownfailer(*args, **kwargs): - if fail_val(): - raise KnownFailureTest(msg) - else: - return f(*args, **kwargs) - return nose.tools.make_decorator(f)(knownfailer) - - return knownfail_decorator - - def make_signature(func): """ Returns a string repr of the arg list of a func call, with any defaults @@ -290,3 +222,48 @@ def make_signature(func): if spec.keywords: args.append('**' + spec.keywords) return args, spec.args + + +class docstring_wrapper(object): + """ + decorator to wrap a function, + provide a dynamically evaluated doc-string + + Parameters + ---------- + func : callable + creator : callable + return the doc-string + default : str, optional + return this doc-string on error + """ + _attrs = ['__module__', '__name__', + '__qualname__', '__annotations__'] + + def __init__(self, func, creator, default=None): + self.func = func + self.creator = creator + self.default = default + update_wrapper( + self, func, [attr for attr in self._attrs + if hasattr(func, attr)]) + + def __get__(self, instance, cls=None): + + # we are called with a class + if instance is None: + return self + + # we want to return the actual passed instance + return types.MethodType(self, instance) + + def __call__(self, *args, **kwargs): + return self.func(*args, **kwargs) + + @property + def __doc__(self): + try: + return self.creator() + except Exception as exc: + msg = self.default or str(exc) + return msg diff --git a/pandas/util/_depr_module.py b/pandas/util/_depr_module.py new file mode 100644 index 0000000000000..9c648b76fdad1 --- /dev/null +++ b/pandas/util/_depr_module.py @@ -0,0 +1,103 @@ +""" +This module houses a utility class for mocking deprecated modules. +It is for internal use only and should not be used beyond this purpose. +""" + +import warnings +import importlib + + +class _DeprecatedModule(object): + """ Class for mocking deprecated modules. + + Parameters + ---------- + deprmod : name of module to be deprecated. + deprmodto : name of module as a replacement, optional. + If not given, the __module__ attribute will + be used when needed. + removals : objects or methods in module that will no longer be + accessible once module is removed. + moved : dict, optional + dictionary of function name -> new location for moved + objects + """ + + def __init__(self, deprmod, deprmodto=None, removals=None, + moved=None): + self.deprmod = deprmod + self.deprmodto = deprmodto + self.removals = removals + if self.removals is not None: + self.removals = frozenset(self.removals) + self.moved = moved + + # For introspection purposes. + self.self_dir = frozenset(dir(self.__class__)) + + def __dir__(self): + deprmodule = self._import_deprmod() + return dir(deprmodule) + + def __repr__(self): + deprmodule = self._import_deprmod() + return repr(deprmodule) + + __str__ = __repr__ + + def __getattr__(self, name): + if name in self.self_dir: + return object.__getattribute__(self, name) + + try: + deprmodule = self._import_deprmod(self.deprmod) + except ImportError: + if self.deprmodto is None: + raise + + # a rename + deprmodule = self._import_deprmod(self.deprmodto) + + obj = getattr(deprmodule, name) + + if self.removals is not None and name in self.removals: + warnings.warn( + "{deprmod}.{name} is deprecated and will be removed in " + "a future version.".format(deprmod=self.deprmod, name=name), + FutureWarning, stacklevel=2) + elif self.moved is not None and name in self.moved: + warnings.warn( + "{deprmod} is deprecated and will be removed in " + "a future version.\nYou can access {name} as {moved}".format( + deprmod=self.deprmod, + name=name, + moved=self.moved[name]), + FutureWarning, stacklevel=2) + else: + deprmodto = self.deprmodto + if deprmodto is False: + warnings.warn( + "{deprmod}.{name} is deprecated and will be removed in " + "a future version.".format( + deprmod=self.deprmod, name=name), + FutureWarning, stacklevel=2) + else: + if deprmodto is None: + deprmodto = obj.__module__ + # The object is actually located in another module. + warnings.warn( + "{deprmod}.{name} is deprecated. Please use " + "{deprmodto}.{name} instead.".format( + deprmod=self.deprmod, name=name, deprmodto=deprmodto), + FutureWarning, stacklevel=2) + + return obj + + def _import_deprmod(self, mod=None): + if mod is None: + mod = self.deprmod + + with warnings.catch_warnings(): + warnings.filterwarnings('ignore', category=FutureWarning) + deprmodule = importlib.import_module(mod) + return deprmodule diff --git a/pandas/util/doctools.py b/pandas/util/_doctools.py similarity index 96% rename from pandas/util/doctools.py rename to pandas/util/_doctools.py index 62dcba1405581..cbc9518b96416 100644 --- a/pandas/util/doctools.py +++ b/pandas/util/_doctools.py @@ -113,12 +113,12 @@ def _insert_index(self, data): else: for i in range(idx_nlevels): data.insert(i, 'Index{0}'.format(i), - data.index.get_level_values(i)) + data.index._get_level_values(i)) col_nlevels = data.columns.nlevels if col_nlevels > 1: - col = data.columns.get_level_values(0) - values = [data.columns.get_level_values(i).values + col = data.columns._get_level_values(0) + values = [data.columns._get_level_values(i).values for i in range(1, col_nlevels)] col_df = pd.DataFrame(values) data.columns = col_df.columns @@ -131,7 +131,7 @@ def _make_table(self, ax, df, title, height=None): ax.set_visible(False) return - import pandas.tools.plotting as plotting + import pandas.plotting as plotting idx_nlevels = df.index.nlevels col_nlevels = df.columns.nlevels diff --git a/pandas/util/print_versions.py b/pandas/util/_print_versions.py similarity index 94% rename from pandas/util/print_versions.py rename to pandas/util/_print_versions.py index 3747e2ff6ca8f..ca75d4d02e927 100644 --- a/pandas/util/print_versions.py +++ b/pandas/util/_print_versions.py @@ -63,13 +63,12 @@ def show_versions(as_json=False): deps = [ # (MODULE_NAME, f(mod) -> mod version) ("pandas", lambda mod: mod.__version__), - ("nose", lambda mod: mod.__version__), + ("pytest", lambda mod: mod.__version__), ("pip", lambda mod: mod.__version__), ("setuptools", lambda mod: mod.__version__), ("Cython", lambda mod: mod.__version__), ("numpy", lambda mod: mod.version.version), ("scipy", lambda mod: mod.version.version), - ("statsmodels", lambda mod: mod.__version__), ("xarray", lambda mod: mod.__version__), ("IPython", lambda mod: mod.__version__), ("sphinx", lambda mod: mod.__version__), @@ -80,6 +79,7 @@ def show_versions(as_json=False): ("bottleneck", lambda mod: mod.__version__), ("tables", lambda mod: mod.__version__), ("numexpr", lambda mod: mod.__version__), + ("feather", lambda mod: mod.__version__), ("matplotlib", lambda mod: mod.__version__), ("openpyxl", lambda mod: mod.__version__), ("xlrd", lambda mod: mod.__VERSION__), @@ -88,13 +88,12 @@ def show_versions(as_json=False): ("lxml", lambda mod: mod.etree.__version__), ("bs4", lambda mod: mod.__version__), ("html5lib", lambda mod: mod.__version__), - ("httplib2", lambda mod: mod.__version__), - ("apiclient", lambda mod: mod.__version__), ("sqlalchemy", lambda mod: mod.__version__), ("pymysql", lambda mod: mod.__version__), ("psycopg2", lambda mod: mod.__version__), ("jinja2", lambda mod: mod.__version__), - ("boto", lambda mod: mod.__version__), + ("s3fs", lambda mod: mod.__version__), + ("pandas_gbq", lambda mod: mod.__version__), ("pandas_datareader", lambda mod: mod.__version__) ] @@ -153,5 +152,6 @@ def main(): return 0 + if __name__ == "__main__": sys.exit(main()) diff --git a/pandas/util/_tester.py b/pandas/util/_tester.py new file mode 100644 index 0000000000000..aeb4259a9edae --- /dev/null +++ b/pandas/util/_tester.py @@ -0,0 +1,27 @@ +""" +Entrypoint for testing from the top-level namespace +""" +import os +import sys + +PKG = os.path.dirname(os.path.dirname(__file__)) + + +try: + import pytest +except ImportError: + def test(): + raise ImportError("Need pytest>=3.0 to run tests") +else: + def test(extra_args=None): + cmd = ['--skip-slow', '--skip-network'] + if extra_args: + if not isinstance(extra_args, list): + extra_args = [extra_args] + cmd = extra_args + cmd += [PKG] + print("running: pytest {}".format(' '.join(cmd))) + sys.exit(pytest.main(cmd)) + + +__all__ = ['test'] diff --git a/pandas/util/validators.py b/pandas/util/_validators.py similarity index 95% rename from pandas/util/validators.py rename to pandas/util/_validators.py index 964fa9d9b38d5..6b19904f4a665 100644 --- a/pandas/util/validators.py +++ b/pandas/util/_validators.py @@ -3,7 +3,7 @@ for validating data or function arguments """ -from pandas.types.common import is_bool +from pandas.core.dtypes.common import is_bool def _check_arg_length(fname, args, max_fname_arg_count, compat_args): @@ -215,3 +215,12 @@ def validate_args_and_kwargs(fname, args, kwargs, kwargs.update(args_dict) validate_kwargs(fname, kwargs, compat_args) + + +def validate_bool_kwarg(value, arg_name): + """ Ensures that argument passed in arg_name is of type bool. """ + if not (is_bool(value) or value is None): + raise ValueError('For argument "%s" expected type bool, ' + 'received type %s.' % + (arg_name, type(value).__name__)) + return value diff --git a/pandas/util/depr_module.py b/pandas/util/depr_module.py deleted file mode 100644 index 736d2cdaab31c..0000000000000 --- a/pandas/util/depr_module.py +++ /dev/null @@ -1,64 +0,0 @@ -""" -This module houses a utility class for mocking deprecated modules. -It is for internal use only and should not be used beyond this purpose. -""" - -import warnings -import importlib - - -class _DeprecatedModule(object): - """ Class for mocking deprecated modules. - - Parameters - ---------- - deprmod : name of module to be deprecated. - removals : objects or methods in module that will no longer be - accessible once module is removed. - """ - def __init__(self, deprmod, removals=None): - self.deprmod = deprmod - self.removals = removals - if self.removals is not None: - self.removals = frozenset(self.removals) - - # For introspection purposes. - self.self_dir = frozenset(dir(self.__class__)) - - def __dir__(self): - deprmodule = self._import_deprmod() - return dir(deprmodule) - - def __repr__(self): - deprmodule = self._import_deprmod() - return repr(deprmodule) - - __str__ = __repr__ - - def __getattr__(self, name): - if name in self.self_dir: - return object.__getattribute__(self, name) - - deprmodule = self._import_deprmod() - obj = getattr(deprmodule, name) - - if self.removals is not None and name in self.removals: - warnings.warn( - "{deprmod}.{name} is deprecated and will be removed in " - "a future version.".format(deprmod=self.deprmod, name=name), - FutureWarning, stacklevel=2) - else: - # The object is actually located in another module. - warnings.warn( - "{deprmod}.{name} is deprecated. Please use " - "{modname}.{name} instead.".format( - deprmod=self.deprmod, modname=obj.__module__, name=name), - FutureWarning, stacklevel=2) - - return obj - - def _import_deprmod(self): - with warnings.catch_warnings(): - warnings.filterwarnings('ignore', category=FutureWarning) - deprmodule = importlib.import_module(self.deprmod) - return deprmodule diff --git a/pandas/util/nosetester.py b/pandas/util/nosetester.py deleted file mode 100644 index 1bdaaff99fd50..0000000000000 --- a/pandas/util/nosetester.py +++ /dev/null @@ -1,261 +0,0 @@ -""" -Nose test running. - -This module implements ``test()`` function for pandas modules. - -""" -from __future__ import division, absolute_import, print_function - -import os -import sys -import warnings -from pandas.compat import string_types -from numpy.testing import nosetester - - -def get_package_name(filepath): - """ - Given a path where a package is installed, determine its name. - - Parameters - ---------- - filepath : str - Path to a file. If the determination fails, "pandas" is returned. - - Examples - -------- - >>> pandas.util.nosetester.get_package_name('nonsense') - 'pandas' - - """ - - pkg_name = [] - while 'site-packages' in filepath or 'dist-packages' in filepath: - filepath, p2 = os.path.split(filepath) - if p2 in ('site-packages', 'dist-packages'): - break - pkg_name.append(p2) - - # if package name determination failed, just default to pandas - if not pkg_name: - return "pandas" - - # otherwise, reverse to get correct order and return - pkg_name.reverse() - - # don't include the outer egg directory - if pkg_name[0].endswith('.egg'): - pkg_name.pop(0) - - return '.'.join(pkg_name) - -import_nose = nosetester.import_nose -run_module_suite = nosetester.run_module_suite - - -class NoseTester(nosetester.NoseTester): - """ - Nose test runner. - - This class is made available as pandas.util.nosetester.NoseTester, and - a test function is typically added to a package's __init__.py like so:: - - from numpy.testing import Tester - test = Tester().test - - Calling this test function finds and runs all tests associated with the - package and all its sub-packages. - - Attributes - ---------- - package_path : str - Full path to the package to test. - package_name : str - Name of the package to test. - - Parameters - ---------- - package : module, str or None, optional - The package to test. If a string, this should be the full path to - the package. If None (default), `package` is set to the module from - which `NoseTester` is initialized. - raise_warnings : None, str or sequence of warnings, optional - This specifies which warnings to configure as 'raise' instead - of 'warn' during the test execution. Valid strings are: - - - "develop" : equals ``(DeprecationWarning, RuntimeWarning)`` - - "release" : equals ``()``, don't raise on any warnings. - - See Notes for more details. - - Notes - ----- - The default for `raise_warnings` is - ``(DeprecationWarning, RuntimeWarning)`` for development versions of - pandas, and ``()`` for released versions. The purpose of this switching - behavior is to catch as many warnings as possible during development, but - not give problems for packaging of released versions. - - """ - excludes = [] - - def _show_system_info(self): - nose = import_nose() - - import pandas - print("pandas version %s" % pandas.__version__) - import numpy - print("numpy version %s" % numpy.__version__) - pddir = os.path.dirname(pandas.__file__) - print("pandas is installed in %s" % pddir) - - pyversion = sys.version.replace('\n', '') - print("Python version %s" % pyversion) - print("nose version %d.%d.%d" % nose.__versioninfo__) - - def _get_custom_doctester(self): - """ Return instantiated plugin for doctests - - Allows subclassing of this class to override doctester - - A return value of None means use the nose builtin doctest plugin - """ - return None - - def _test_argv(self, label, verbose, extra_argv): - """ - Generate argv for nosetest command - - Parameters - ---------- - label : {'fast', 'full', '', attribute identifier}, optional - see ``test`` docstring - verbose : int, optional - Verbosity value for test outputs, in the range 1-10. Default is 1. - extra_argv : list, optional - List with any extra arguments to pass to nosetests. - - Returns - ------- - argv : list - command line arguments that will be passed to nose - """ - - argv = [__file__, self.package_path] - if label and label != 'full': - if not isinstance(label, string_types): - raise TypeError('Selection label should be a string') - if label == 'fast': - label = 'not slow and not network and not disabled' - argv += ['-A', label] - argv += ['--verbosity', str(verbose)] - - # When installing with setuptools, and also in some other cases, the - # test_*.py files end up marked +x executable. Nose, by default, does - # not run files marked with +x as they might be scripts. However, in - # our case nose only looks for test_*.py files under the package - # directory, which should be safe. - argv += ['--exe'] - - if extra_argv: - argv += extra_argv - return argv - - def test(self, label='fast', verbose=1, extra_argv=None, - doctests=False, coverage=False, raise_warnings=None): - """ - Run tests for module using nose. - - Parameters - ---------- - label : {'fast', 'full', '', attribute identifier}, optional - Identifies the tests to run. This can be a string to pass to - the nosetests executable with the '-A' option, or one of several - special values. Special values are: - - * 'fast' - the default - which corresponds to the ``nosetests -A`` - option of 'not slow'. - * 'full' - fast (as above) and slow tests as in the - 'no -A' option to nosetests - this is the same as ''. - * None or '' - run all tests. - * attribute_identifier - string passed directly to nosetests - as '-A'. - - verbose : int, optional - Verbosity value for test outputs, in the range 1-10. Default is 1. - extra_argv : list, optional - List with any extra arguments to pass to nosetests. - doctests : bool, optional - If True, run doctests in module. Default is False. - coverage : bool, optional - If True, report coverage of NumPy code. Default is False. - (This requires the `coverage module - `_). - raise_warnings : str or sequence of warnings, optional - This specifies which warnings to configure as 'raise' instead - of 'warn' during the test execution. Valid strings are: - - - 'develop' : equals ``(DeprecationWarning, RuntimeWarning)`` - - 'release' : equals ``()``, don't raise on any warnings. - - Returns - ------- - result : object - Returns the result of running the tests as a - ``nose.result.TextTestResult`` object. - - """ - - # cap verbosity at 3 because nose becomes *very* verbose beyond that - verbose = min(verbose, 3) - - if doctests: - print("Running unit tests and doctests for %s" % self.package_name) - else: - print("Running unit tests for %s" % self.package_name) - - self._show_system_info() - - # reset doctest state on every run - import doctest - doctest.master = None - - if raise_warnings is None: - - # default based on if we are released - from pandas import __version__ - from distutils.version import StrictVersion - try: - StrictVersion(__version__) - raise_warnings = 'release' - except ValueError: - raise_warnings = 'develop' - - _warn_opts = dict(develop=(DeprecationWarning, RuntimeWarning), - release=()) - if isinstance(raise_warnings, string_types): - raise_warnings = _warn_opts[raise_warnings] - - with warnings.catch_warnings(): - - if len(raise_warnings): - - # Reset the warning filters to the default state, - # so that running the tests is more repeatable. - warnings.resetwarnings() - # Set all warnings to 'warn', this is because the default - # 'once' has the bad property of possibly shadowing later - # warnings. - warnings.filterwarnings('always') - # Force the requested warnings to raise - for warningtype in raise_warnings: - warnings.filterwarnings('error', category=warningtype) - # Filter out annoying import messages. - warnings.filterwarnings("ignore", category=FutureWarning) - - from numpy.testing.noseclasses import NumpyTestProgram - argv, plugins = self.prepare_test_args( - label, verbose, extra_argv, doctests, coverage) - t = NumpyTestProgram(argv=argv, exit=False, plugins=plugins) - - return t.result diff --git a/pandas/util/testing.py b/pandas/util/testing.py index 57bb01e5e0406..f6b572cdf7179 100644 --- a/pandas/util/testing.py +++ b/pandas/util/testing.py @@ -10,7 +10,6 @@ import os import subprocess import locale -import unittest import traceback from datetime import datetime @@ -19,36 +18,46 @@ from distutils.version import LooseVersion from numpy.random import randn, rand -from numpy.testing.decorators import slow # noqa import numpy as np import pandas as pd -from pandas.types.missing import array_equivalent -from pandas.types.common import (is_datetimelike_v_numeric, - is_datetimelike_v_object, - is_number, is_bool, - needs_i8_conversion, - is_categorical_dtype, - is_sequence, - is_list_like) -from pandas.formats.printing import pprint_thing +from pandas.core.dtypes.missing import array_equivalent +from pandas.core.dtypes.common import ( + is_datetimelike_v_numeric, + is_datetimelike_v_object, + is_number, is_bool, + needs_i8_conversion, + is_categorical_dtype, + is_interval_dtype, + is_sequence, + is_list_like) +from pandas.io.formats.printing import pprint_thing from pandas.core.algorithms import take_1d import pandas.compat as compat -from pandas.compat import( +from pandas.compat import ( filter, map, zip, range, unichr, lrange, lmap, lzip, u, callable, Counter, raise_with_traceback, httplib, is_platform_windows, is_platform_32bit, - PY3 + StringIO, PY3 ) -from pandas.computation import expressions as expr +from pandas.core.computation import expressions as expr -from pandas import (bdate_range, CategoricalIndex, Categorical, DatetimeIndex, - TimedeltaIndex, PeriodIndex, RangeIndex, Index, MultiIndex, +from pandas import (bdate_range, CategoricalIndex, Categorical, IntervalIndex, + DatetimeIndex, TimedeltaIndex, PeriodIndex, RangeIndex, + Index, MultiIndex, Series, DataFrame, Panel, Panel4D) -from pandas.util.decorators import deprecate -from pandas import _testing + +from pandas._libs import testing as _testing from pandas.io.common import urlopen +try: + import pytest + slow = pytest.mark.slow +except ImportError: + # Should be ok to just ignore. If you actually need + # slow then you'll hit an import error long before getting here. + pass + N = 30 K = 4 @@ -72,56 +81,47 @@ def reset_testing_mode(): if 'deprecate' in testing_mode: warnings.simplefilter('ignore', _testing_mode_warnings) -set_testing_mode() - -class TestCase(unittest.TestCase): - - @classmethod - def setUpClass(cls): - pd.set_option('chained_assignment', 'raise') +set_testing_mode() - @classmethod - def tearDownClass(cls): - pass - def reset_display_options(self): - # reset the display options - pd.reset_option('^display.', silent=True) +def reset_display_options(): + """ + Reset the display options for printing and representing objects. + """ - def round_trip_pickle(self, obj, path=None): - if path is None: - path = u('__%s__.pickle' % rands(10)) - with ensure_clean(path) as path: - pd.to_pickle(obj, path) - return pd.read_pickle(path) + pd.reset_option('^display.', silent=True) - # https://docs.python.org/3/library/unittest.html#deprecated-aliases - def assertEquals(self, *args, **kwargs): - return deprecate('assertEquals', - self.assertEqual)(*args, **kwargs) - def assertNotEquals(self, *args, **kwargs): - return deprecate('assertNotEquals', - self.assertNotEqual)(*args, **kwargs) +def round_trip_pickle(obj, path=None): + """ + Pickle an object and then read it again. - def assert_(self, *args, **kwargs): - return deprecate('assert_', - self.assertTrue)(*args, **kwargs) + Parameters + ---------- + obj : pandas object + The object to pickle and then re-read. + path : str, default None + The path where the pickled object is written and then read. - def assertAlmostEquals(self, *args, **kwargs): - return deprecate('assertAlmostEquals', - self.assertAlmostEqual)(*args, **kwargs) + Returns + ------- + round_trip_pickled_object : pandas object + The original object that was pickled and then re-read. + """ - def assertNotAlmostEquals(self, *args, **kwargs): - return deprecate('assertNotAlmostEquals', - self.assertNotAlmostEqual)(*args, **kwargs) + if path is None: + path = u('__%s__.pickle' % rands(10)) + with ensure_clean(path) as path: + pd.to_pickle(obj, path) + return pd.read_pickle(path) def assert_almost_equal(left, right, check_exact=False, check_dtype='equiv', check_less_precise=False, **kwargs): - """Check that left and right Index are equal. + """ + Check that the left and right objects are approximately equal. Parameters ---------- @@ -165,7 +165,7 @@ def assert_almost_equal(left, right, check_exact=False, pass else: if (isinstance(left, np.ndarray) or - isinstance(right, np.ndarray)): + isinstance(right, np.ndarray)): obj = 'numpy array' else: obj = 'Input' @@ -177,11 +177,35 @@ def assert_almost_equal(left, right, check_exact=False, **kwargs) -def assert_dict_equal(left, right, compare_keys=True): +def _check_isinstance(left, right, cls): + """ + Helper method for our assert_* methods that ensures that + the two objects being compared have the right type before + proceeding with the comparison. - assertIsInstance(left, dict, '[dict] ') - assertIsInstance(right, dict, '[dict] ') + Parameters + ---------- + left : The first object being compared. + right : The second object being compared. + cls : The class type to check against. + Raises + ------ + AssertionError : Either `left` or `right` is not an instance of `cls`. + """ + + err_msg = "{0} Expected type {1}, found {2} instead" + cls_name = cls.__name__ + + if not isinstance(left, cls): + raise AssertionError(err_msg.format(cls_name, cls, type(left))) + if not isinstance(right, cls): + raise AssertionError(err_msg.format(cls_name, cls, type(right))) + + +def assert_dict_equal(left, right, compare_keys=True): + + _check_isinstance(left, right, dict) return _testing.assert_dict_equal(left, right, compare_keys=compare_keys) @@ -246,127 +270,123 @@ def close(fignum=None): def _skip_if_32bit(): - import nose + import pytest if is_platform_32bit(): - raise nose.SkipTest("skipping for 32 bit") + pytest.skip("skipping for 32 bit") -def mplskip(cls): - """Skip a TestCase instance if matplotlib isn't installed""" +def _skip_module_if_no_mpl(): + import pytest - @classmethod - def setUpClass(cls): - try: - import matplotlib as mpl - mpl.use("Agg", warn=False) - except ImportError: - import nose - raise nose.SkipTest("matplotlib not installed") - - cls.setUpClass = setUpClass - return cls + mpl = pytest.importorskip("matplotlib") + mpl.use("Agg", warn=False) def _skip_if_no_mpl(): try: - import matplotlib # noqa + import matplotlib as mpl + mpl.use("Agg", warn=False) except ImportError: - import nose - raise nose.SkipTest("matplotlib not installed") + import pytest + pytest.skip("matplotlib not installed") def _skip_if_mpl_1_5(): - import matplotlib - v = matplotlib.__version__ + import matplotlib as mpl + + v = mpl.__version__ if v > LooseVersion('1.4.3') or v[0] == '0': - import nose - raise nose.SkipTest("matplotlib 1.5") + import pytest + pytest.skip("matplotlib 1.5") + else: + mpl.use("Agg", warn=False) def _skip_if_no_scipy(): try: import scipy.stats # noqa except ImportError: - import nose - raise nose.SkipTest("no scipy.stats module") + import pytest + pytest.skip("no scipy.stats module") try: import scipy.interpolate # noqa except ImportError: - import nose - raise nose.SkipTest('scipy.interpolate missing') - - -def _skip_if_scipy_0_17(): - import scipy - v = scipy.__version__ - if v >= LooseVersion("0.17.0"): - import nose - raise nose.SkipTest("scipy 0.17") + import pytest + pytest.skip('scipy.interpolate missing') + try: + import scipy.sparse # noqa + except ImportError: + import pytest + pytest.skip('scipy.sparse missing') -def _skip_if_no_lzma(): +def _check_if_lzma(): try: return compat.import_lzma() except ImportError: - import nose - raise nose.SkipTest('need backports.lzma to run') + return False + + +def _skip_if_no_lzma(): + import pytest + return _check_if_lzma() or pytest.skip('need backports.lzma to run') def _skip_if_no_xarray(): try: import xarray except ImportError: - import nose - raise nose.SkipTest("xarray not installed") + import pytest + pytest.skip("xarray not installed") v = xarray.__version__ if v < LooseVersion('0.7.0'): - import nose - raise nose.SkipTest("xarray not version is too low: {0}".format(v)) + import pytest + pytest.skip("xarray not version is too low: {0}".format(v)) def _skip_if_no_pytz(): try: import pytz # noqa except ImportError: - import nose - raise nose.SkipTest("pytz not installed") + import pytest + pytest.skip("pytz not installed") def _skip_if_no_dateutil(): try: import dateutil # noqa except ImportError: - import nose - raise nose.SkipTest("dateutil not installed") + import pytest + pytest.skip("dateutil not installed") def _skip_if_windows_python_3(): if PY3 and is_platform_windows(): - import nose - raise nose.SkipTest("not used on python 3/win32") + import pytest + pytest.skip("not used on python 3/win32") def _skip_if_windows(): if is_platform_windows(): - import nose - raise nose.SkipTest("Running on Windows") + import pytest + pytest.skip("Running on Windows") def _skip_if_no_pathlib(): try: from pathlib import Path # noqa except ImportError: - import nose - raise nose.SkipTest("pathlib not available") + import pytest + pytest.skip("pathlib not available") def _skip_if_no_localpath(): try: from py.path import local as LocalPath # noqa except ImportError: - import nose - raise nose.SkipTest("py.path not installed") + import pytest + pytest.skip("py.path not installed") def _incompat_bottleneck_version(method): @@ -385,24 +405,52 @@ def _incompat_bottleneck_version(method): def skip_if_no_ne(engine='numexpr'): - from pandas.computation.expressions import (_USE_NUMEXPR, - _NUMEXPR_INSTALLED) + from pandas.core.computation.expressions import ( + _USE_NUMEXPR, + _NUMEXPR_INSTALLED) if engine == 'numexpr': if not _USE_NUMEXPR: - import nose - raise nose.SkipTest("numexpr enabled->{enabled}, " - "installed->{installed}".format( - enabled=_USE_NUMEXPR, - installed=_NUMEXPR_INSTALLED)) + import pytest + pytest.skip("numexpr enabled->{enabled}, " + "installed->{installed}".format( + enabled=_USE_NUMEXPR, + installed=_NUMEXPR_INSTALLED)) def _skip_if_has_locale(): import locale lang, _ = locale.getlocale() if lang is not None: + import pytest + pytest.skip("Specific locale is set {0}".format(lang)) + + +def _skip_if_not_us_locale(): + import locale + lang, _ = locale.getlocale() + if lang != 'en_US': + import pytest + pytest.skip("Specific locale is set {0}".format(lang)) + + +def _skip_if_no_mock(): + try: + import mock # noqa + except ImportError: + try: + from unittest import mock # noqa + except ImportError: + import nose + raise nose.SkipTest("mock is not installed") + + +def _skip_if_no_ipython(): + try: + import IPython # noqa + except ImportError: import nose - raise nose.SkipTest("Specific locale is set {0}".format(lang)) + raise nose.SkipTest("IPython not installed") # ----------------------------------------------------------------------------- # locale utilities @@ -589,6 +637,105 @@ def _valid_locales(locales, normalize): return list(filter(_can_set_locale, map(normalizer, locales))) +# ----------------------------------------------------------------------------- +# Stdout / stderr decorators + + +def capture_stdout(f): + """ + Decorator to capture stdout in a buffer so that it can be checked + (or suppressed) during testing. + + Parameters + ---------- + f : callable + The test that is capturing stdout. + + Returns + ------- + f : callable + The decorated test ``f``, which captures stdout. + + Examples + -------- + + >>> from pandas.util.testing import capture_stdout + >>> + >>> import sys + >>> + >>> @capture_stdout + ... def test_print_pass(): + ... print("foo") + ... out = sys.stdout.getvalue() + ... assert out == "foo\n" + >>> + >>> @capture_stdout + ... def test_print_fail(): + ... print("foo") + ... out = sys.stdout.getvalue() + ... assert out == "bar\n" + ... + AssertionError: assert 'foo\n' == 'bar\n' + """ + + @wraps(f) + def wrapper(*args, **kwargs): + try: + sys.stdout = StringIO() + f(*args, **kwargs) + finally: + sys.stdout = sys.__stdout__ + + return wrapper + + +def capture_stderr(f): + """ + Decorator to capture stderr in a buffer so that it can be checked + (or suppressed) during testing. + + Parameters + ---------- + f : callable + The test that is capturing stderr. + + Returns + ------- + f : callable + The decorated test ``f``, which captures stderr. + + Examples + -------- + + >>> from pandas.util.testing import capture_stderr + >>> + >>> import sys + >>> + >>> @capture_stderr + ... def test_stderr_pass(): + ... sys.stderr.write("foo") + ... out = sys.stderr.getvalue() + ... assert out == "foo\n" + >>> + >>> @capture_stderr + ... def test_stderr_fail(): + ... sys.stderr.write("foo") + ... out = sys.stderr.getvalue() + ... assert out == "bar\n" + ... + AssertionError: assert 'foo\n' == 'bar\n' + """ + + @wraps(f) + def wrapper(*args, **kwargs): + try: + sys.stderr = StringIO() + f(*args, **kwargs) + finally: + sys.stderr = sys.__stderr__ + + return wrapper + # ----------------------------------------------------------------------------- # Console debugging tools @@ -652,8 +799,8 @@ def ensure_clean(filename=None, return_filelike=False): try: fd, filename = tempfile.mkstemp(suffix=filename) except UnicodeEncodeError: - import nose - raise nose.SkipTest('no unicode file names on this system') + import pytest + pytest.skip('no unicode file names on this system') try: yield filename @@ -689,23 +836,6 @@ def equalContents(arr1, arr2): return frozenset(arr1) == frozenset(arr2) -def assert_equal(a, b, msg=""): - """asserts that a equals b, like nose's assert_equal, - but allows custom message to start. Passes a and b to - format string as well. So you can use '{0}' and '{1}' - to display a and b. - - Examples - -------- - >>> assert_equal(2, 2, "apples") - >>> assert_equal(5.2, 1.2, "{0} was really a dead parrot") - Traceback (most recent call last): - ... - AssertionError: 5.2 was really a dead parrot: 5.2 != 1.2 - """ - assert a == b, "%s: %r != %r" % (msg.format(a, b), a, b) - - def assert_index_equal(left, right, exact='equiv', check_names=True, check_less_precise=False, check_exact=True, check_categorical=True, obj='Index'): @@ -717,8 +847,8 @@ def assert_index_equal(left, right, exact='equiv', check_names=True, right : Index exact : bool / string {'equiv'}, default False Whether to check the Index class, dtype and inferred_type - are identical. If 'equiv', then RangeIndex can be substitued for - Int64Index as well + are identical. If 'equiv', then RangeIndex can be substituted for + Int64Index as well. check_names : bool, default True Whether to check the names attribute. check_less_precise : bool or int, default False @@ -740,7 +870,7 @@ def _check_types(l, r, obj='Index'): assert_attr_equal('dtype', l, r, obj=obj) # allow string-like to have different inferred_types if l.inferred_type in ('string', 'unicode'): - assertIn(r.inferred_type, ('string', 'unicode')) + assert r.inferred_type in ('string', 'unicode') else: assert_attr_equal('inferred_type', l, r, obj=obj) @@ -753,8 +883,7 @@ def _get_ilevel_values(index, level): return values # instance validation - assertIsInstance(left, Index, '[index] ') - assertIsInstance(right, Index, '[index] ') + _check_isinstance(left, right, Index) # class / dtype comparison _check_types(left, right, obj=obj) @@ -804,6 +933,9 @@ def _get_ilevel_values(index, level): assert_attr_equal('names', left, right, obj=obj) if isinstance(left, pd.PeriodIndex) or isinstance(right, pd.PeriodIndex): assert_attr_equal('freq', left, right, obj=obj) + if (isinstance(left, pd.IntervalIndex) or + isinstance(right, pd.IntervalIndex)): + assert_attr_equal('closed', left, right, obj=obj) if check_categorical: if is_categorical_dtype(left) or is_categorical_dtype(right): @@ -904,66 +1036,9 @@ def is_sorted(seq): return assert_numpy_array_equal(seq, np.sort(np.array(seq))) -def assertIs(first, second, msg=''): - """Checks that 'first' is 'second'""" - a, b = first, second - assert a is b, "%s: %r is not %r" % (msg.format(a, b), a, b) - - -def assertIsNot(first, second, msg=''): - """Checks that 'first' is not 'second'""" - a, b = first, second - assert a is not b, "%s: %r is %r" % (msg.format(a, b), a, b) - - -def assertIn(first, second, msg=''): - """Checks that 'first' is in 'second'""" - a, b = first, second - assert a in b, "%s: %r is not in %r" % (msg.format(a, b), a, b) - - -def assertNotIn(first, second, msg=''): - """Checks that 'first' is not in 'second'""" - a, b = first, second - assert a not in b, "%s: %r is in %r" % (msg.format(a, b), a, b) - - -def assertIsNone(expr, msg=''): - """Checks that 'expr' is None""" - return assertIs(expr, None, msg) - - -def assertIsNotNone(expr, msg=''): - """Checks that 'expr' is not None""" - return assertIsNot(expr, None, msg) - - -def assertIsInstance(obj, cls, msg=''): - """Test that obj is an instance of cls - (which can be a class or a tuple of classes, - as supported by isinstance()).""" - if not isinstance(obj, cls): - err_msg = "{0}Expected type {1}, found {2} instead" - raise AssertionError(err_msg.format(msg, cls, type(obj))) - - -def assert_isinstance(obj, class_type_or_tuple, msg=''): - return deprecate('assert_isinstance', assertIsInstance)( - obj, class_type_or_tuple, msg=msg) - - -def assertNotIsInstance(obj, cls, msg=''): - """Test that obj is not an instance of cls - (which can be a class or a tuple of classes, - as supported by isinstance()).""" - if isinstance(obj, cls): - err_msg = "{0}Input must not be type {1}" - raise AssertionError(err_msg.format(msg, cls)) - - def assert_categorical_equal(left, right, check_dtype=True, obj='Categorical', check_category_order=True): - """Test that categoricals are eqivalent + """Test that Categoricals are equivalent. Parameters ---------- @@ -980,8 +1055,7 @@ def assert_categorical_equal(left, right, check_dtype=True, values are compared. The ordered attribute is checked regardless. """ - assertIsInstance(left, pd.Categorical, '[Categorical] ') - assertIsInstance(right, pd.Categorical, '[Categorical] ') + _check_isinstance(left, right, Categorical) if check_category_order: assert_index_equal(left.categories, right.categories, @@ -1041,19 +1115,25 @@ def assert_numpy_array_equal(left, right, strict_nan=False, """ # instance validation - # to show a detailed erorr message when classes are different + # Show a detailed error message when classes are different assert_class_equal(left, right, obj=obj) # both classes must be an np.ndarray - assertIsInstance(left, np.ndarray, '[ndarray] ') - assertIsInstance(right, np.ndarray, '[ndarray] ') + _check_isinstance(left, right, np.ndarray) def _get_base(obj): return obj.base if getattr(obj, 'base', None) is not None else obj + left_base = _get_base(left) + right_base = _get_base(right) + if check_same == 'same': - assertIs(_get_base(left), _get_base(right)) + if left_base is not right_base: + msg = "%r is not %r" % (left_base, right_base) + raise AssertionError(msg) elif check_same == 'copy': - assertIsNot(_get_base(left), _get_base(right)) + if left_base is right_base: + msg = "%r is %r" % (left_base, right_base) + raise AssertionError(msg) def _raise(left, right, err_msg): if err_msg is None: @@ -1095,7 +1175,6 @@ def assert_series_equal(left, right, check_dtype=True, check_datetimelike_compat=False, check_categorical=True, obj='Series'): - """Check that left and right Series are equal. Parameters @@ -1117,7 +1196,7 @@ def assert_series_equal(left, right, check_dtype=True, Whether to compare number exactly. check_names : bool, default True Whether to check the Series and Index names attribute. - check_dateteimelike_compat : bool, default False + check_datetimelike_compat : bool, default False Compare datetime-like which is comparable ignoring dtype. check_categorical : bool, default True Whether to compare internal Categorical exactly. @@ -1127,13 +1206,12 @@ def assert_series_equal(left, right, check_dtype=True, """ # instance validation - assertIsInstance(left, Series, '[Series] ') - assertIsInstance(right, Series, '[Series] ') + _check_isinstance(left, right, Series) if check_series_type: # ToDo: There are some tests using rhs is sparse # lhs is dense. Should use assert_class_equal in future - assertIsInstance(left, type(right)) + assert isinstance(left, type(right)) # assert_class_equal(left, right, obj=obj) # length comparison @@ -1174,6 +1252,12 @@ def assert_series_equal(left, right, check_dtype=True, else: assert_numpy_array_equal(left.get_values(), right.get_values(), check_dtype=check_dtype) + elif is_interval_dtype(left) or is_interval_dtype(right): + # TODO: big hack here + l = pd.IntervalIndex(left) + r = pd.IntervalIndex(right) + assert_index_equal(l, r, obj='{0}.index'.format(obj)) + else: _testing.assert_almost_equal(left.get_values(), right.get_values(), check_less_precise=check_less_precise, @@ -1203,7 +1287,6 @@ def assert_frame_equal(left, right, check_dtype=True, check_categorical=True, check_like=False, obj='DataFrame'): - """Check that left and right DataFrame are equal. Parameters @@ -1220,7 +1303,7 @@ def assert_frame_equal(left, right, check_dtype=True, are identical. check_frame_type : bool, default False Whether to check the DataFrame class is identical. - check_less_precise : bool or it, default False + check_less_precise : bool or int, default False Specify comparison precision. Only used when check_exact is False. 5 digits (False) or 3 digits (True) after decimal points are compared. If int, then specify the digits to compare @@ -1231,46 +1314,36 @@ def assert_frame_equal(left, right, check_dtype=True, If True, compare by blocks. check_exact : bool, default False Whether to compare number exactly. - check_dateteimelike_compat : bool, default False + check_datetimelike_compat : bool, default False Compare datetime-like which is comparable ignoring dtype. check_categorical : bool, default True Whether to compare internal Categorical exactly. check_like : bool, default False - If true, then reindex_like operands + If true, ignore the order of rows & columns obj : str, default 'DataFrame' Specify object name being compared, internally used to show appropriate assertion message """ # instance validation - assertIsInstance(left, DataFrame, '[DataFrame] ') - assertIsInstance(right, DataFrame, '[DataFrame] ') + _check_isinstance(left, right, DataFrame) if check_frame_type: # ToDo: There are some tests using rhs is SparseDataFrame # lhs is DataFrame. Should use assert_class_equal in future - assertIsInstance(left, type(right)) + assert isinstance(left, type(right)) # assert_class_equal(left, right, obj=obj) + # shape comparison + if left.shape != right.shape: + raise_assert_detail(obj, + 'DataFrame shape mismatch', + '({0}, {1})'.format(*left.shape), + '({0}, {1})'.format(*right.shape)) + if check_like: left, right = left.reindex_like(right), right - # shape comparison (row) - if left.shape[0] != right.shape[0]: - raise_assert_detail(obj, - 'DataFrame shape (number of rows) are different', - '{0}, {1}'.format(left.shape[0], left.index), - '{0}, {1}'.format(right.shape[0], right.index)) - # shape comparison (columns) - if left.shape[1] != right.shape[1]: - raise_assert_detail(obj, - 'DataFrame shape (number of columns) ' - 'are different', - '{0}, {1}'.format(left.shape[1], - left.columns), - '{0}, {1}'.format(right.shape[1], - right.columns)) - # index comparison assert_index_equal(left.index, right.index, exact=check_index_type, check_names=check_names, @@ -1373,6 +1446,7 @@ def assert_panelnd_equal(left, right, for i, item in enumerate(right._get_axis(0)): assert item in left, "non-matching item (left) '%s'" % item + # TODO: strangely check_names fails in py3 ? _panel_frame_equal = partial(assert_frame_equal, check_names=False) assert_panel_equal = partial(assert_panelnd_equal, @@ -1396,15 +1470,14 @@ def assert_sp_array_equal(left, right, check_dtype=True): Whether to check the data dtype is identical. """ - assertIsInstance(left, pd.SparseArray, '[SparseArray]') - assertIsInstance(right, pd.SparseArray, '[SparseArray]') + _check_isinstance(left, right, pd.SparseArray) assert_numpy_array_equal(left.sp_values, right.sp_values, check_dtype=check_dtype) # SparseIndex comparison - assertIsInstance(left.sp_index, pd._sparse.SparseIndex, '[SparseIndex]') - assertIsInstance(right.sp_index, pd._sparse.SparseIndex, '[SparseIndex]') + assert isinstance(left.sp_index, pd._libs.sparse.SparseIndex) + assert isinstance(right.sp_index, pd._libs.sparse.SparseIndex) if not left.sp_index.equals(right.sp_index): raise_assert_detail('SparseArray.index', 'index are not equal', @@ -1437,8 +1510,7 @@ def assert_sp_series_equal(left, right, check_dtype=True, exact_indices=True, Specify the object name being compared, internally used to show the appropriate assertion message. """ - assertIsInstance(left, pd.SparseSeries, '[SparseSeries]') - assertIsInstance(right, pd.SparseSeries, '[SparseSeries]') + _check_isinstance(left, right, pd.SparseSeries) if check_series_type: assert_class_equal(left, right, obj=obj) @@ -1475,8 +1547,7 @@ def assert_sp_frame_equal(left, right, check_dtype=True, exact_indices=True, Specify the object name being compared, internally used to show the appropriate assertion message. """ - assertIsInstance(left, pd.SparseDataFrame, '[SparseDataFrame]') - assertIsInstance(right, pd.SparseDataFrame, '[SparseDataFrame]') + _check_isinstance(left, right, pd.SparseDataFrame) if check_frame_type: assert_class_equal(left, right, obj=obj) @@ -1507,8 +1578,8 @@ def assert_sp_frame_equal(left, right, check_dtype=True, exact_indices=True, def assert_sp_list_equal(left, right): - assertIsInstance(left, pd.SparseList, '[SparseList]') - assertIsInstance(right, pd.SparseList, '[SparseList]') + assert isinstance(left, pd.SparseList) + assert isinstance(right, pd.SparseList) assert_sp_array_equal(left.to_array(), right.to_array()) @@ -1561,6 +1632,12 @@ def makeCategoricalIndex(k=10, n=3, name=None): return CategoricalIndex(np.random.choice(x, k), name=name) +def makeIntervalIndex(k=10, name=None): + """ make a length k IntervalIndex """ + x = np.linspace(0, 100, num=(k + 1)) + return IntervalIndex.from_breaks(x, name=name) + + def makeBoolIndex(k=10, name=None): if k == 1: return Index([True], name=name) @@ -1573,6 +1650,10 @@ def makeIntIndex(k=10, name=None): return Index(lrange(k), name=name) +def makeUIntIndex(k=10, name=None): + return Index([2**63 + i for i in lrange(k)], name=name) + + def makeRangeIndex(k=10, name=None): return RangeIndex(0, k, 1, name=name) @@ -1716,8 +1797,10 @@ def makePeriodPanel(nper=None): def makePanel4D(nper=None): - return Panel4D(dict(l1=makePanel(nper), l2=makePanel(nper), - l3=makePanel(nper))) + with warnings.catch_warnings(record=True): + d = dict(l1=makePanel(nper), l2=makePanel(nper), + l3=makePanel(nper)) + return Panel4D(d) def makeCustomIndex(nentries, nlevels, prefix='#', names=False, ndupe_l=None, @@ -1983,87 +2066,59 @@ def __init__(self, *args, **kwargs): dict.__init__(self, *args, **kwargs) -# Dependency checks. Copied this from Nipy/Nipype (Copyright of -# respective developers, license: BSD-3) -def package_check(pkg_name, version=None, app='pandas', checker=LooseVersion, - exc_failed_import=ImportError, - exc_failed_check=RuntimeError): - """Check that the minimal version of the required package is installed. +# Dependency checker when running tests. +# +# Copied this from nipy/nipype +# Copyright of respective developers, License: BSD-3 +def skip_if_no_package(pkg_name, min_version=None, max_version=None, + app='pandas', checker=LooseVersion): + """Check that the min/max version of the required package is installed. + + If the package check fails, the test is automatically skipped. Parameters ---------- pkg_name : string Name of the required package. - version : string, optional + min_version : string, optional Minimal version number for required package. + max_version : string, optional + Max version number for required package. app : string, optional - Application that is performing the check. For instance, the + Application that is performing the check. For instance, the name of the tutorial being executed that depends on specific packages. checker : object, optional - The class that will perform the version checking. Default is + The class that will perform the version checking. Default is distutils.version.LooseVersion. - exc_failed_import : Exception, optional - Class of the exception to be thrown if import failed. - exc_failed_check : Exception, optional - Class of the exception to be thrown if version check failed. Examples -------- package_check('numpy', '1.3') - package_check('networkx', '1.0', 'tutorial1') """ + import pytest if app: msg = '%s requires %s' % (app, pkg_name) else: msg = 'module requires %s' % pkg_name - if version: - msg += ' with version >= %s' % (version,) + if min_version: + msg += ' with version >= %s' % (min_version,) + if max_version: + msg += ' with version < %s' % (max_version,) try: mod = __import__(pkg_name) except ImportError: - raise exc_failed_import(msg) - if not version: - return + mod = None try: have_version = mod.__version__ except AttributeError: - raise exc_failed_check('Cannot find version for %s' % pkg_name) - if checker(have_version) < checker(version): - raise exc_failed_check(msg) - - -def skip_if_no_package(*args, **kwargs): - """Raise SkipTest if package_check fails - - Parameters - ---------- - *args Positional parameters passed to `package_check` - *kwargs Keyword parameters passed to `package_check` - """ - from nose import SkipTest - package_check(exc_failed_import=SkipTest, - exc_failed_check=SkipTest, - *args, **kwargs) - - -def skip_if_no_package_deco(pkg_name, version=None, app='pandas'): - from nose import SkipTest - - def deco(func): - @wraps(func) - def wrapper(*args, **kwargs): - package_check(pkg_name, version=version, app=app, - exc_failed_import=SkipTest, - exc_failed_check=SkipTest) - return func(*args, **kwargs) - return wrapper - return deco -# -# Additional tags decorators for nose -# + pytest.skip('Cannot find version for %s' % pkg_name) + if min_version and checker(have_version) < checker(min_version): + pytest.skip(msg) + if max_version and checker(have_version) >= checker(max_version): + pytest.skip(msg) def optional_args(decorator): @@ -2091,6 +2146,7 @@ def dec(f): return wrapper + # skip tests on exceptions with this message _network_error_messages = ( # 'urlopen error timed out', @@ -2108,6 +2164,7 @@ def dec(f): 'Temporary failure in name resolution', 'Name or service not known', 'Connection refused', + 'certificate verify', ) # or this e.errno/e.reason.errno @@ -2243,18 +2300,17 @@ def network(t, url="http://www.google.com", >>> test_something() Traceback (most recent call last): ... - SkipTest Errors not related to networking will always be raised. """ - from nose import SkipTest + from pytest import skip t.network = True @wraps(t) def wrapper(*args, **kwargs): if check_before_test and not raise_on_error: if not can_connect(url, error_classes): - raise SkipTest + skip() try: return t(*args, **kwargs) except Exception as e: @@ -2263,8 +2319,8 @@ def wrapper(*args, **kwargs): errno = getattr(e.reason, 'errno', None) if errno in skip_errnos: - raise SkipTest("Skipping test due to known errno" - " and error %s" % e) + skip("Skipping test due to known errno" + " and error %s" % e) try: e_str = traceback.format_exc(e) @@ -2272,8 +2328,8 @@ def wrapper(*args, **kwargs): e_str = str(e) if any([m.lower() in e_str.lower() for m in _skip_on_messages]): - raise SkipTest("Skipping test because exception " - "message is known and error %s" % e) + skip("Skipping test because exception " + "message is known and error %s" % e) if not isinstance(e, error_classes): raise @@ -2281,8 +2337,8 @@ def wrapper(*args, **kwargs): if raise_on_error or can_connect(url, error_classes): raise else: - raise SkipTest("Skipping test due to lack of connectivity" - " and error %s" % e) + skip("Skipping test due to lack of connectivity" + " and error %s" % e) return wrapper @@ -2343,78 +2399,44 @@ def stdin_encoding(encoding=None): sys.stdin = _stdin -def assertRaises(_exception, _callable=None, *args, **kwargs): - """assertRaises that is usable as context manager or in a with statement - - Exceptions that don't match the given Exception type fall through:: - - >>> with assertRaises(ValueError): - ... raise TypeError("banana") - ... - Traceback (most recent call last): - ... - TypeError: banana - - If it raises the given Exception type, the test passes - >>> with assertRaises(KeyError): - ... dct = dict() - ... dct["apple"] - - If the expected error doesn't occur, it raises an error. - >>> with assertRaises(KeyError): - ... dct = {'apple':True} - ... dct["apple"] - Traceback (most recent call last): - ... - AssertionError: KeyError not raised. - - In addition to using it as a contextmanager, you can also use it as a - function, just like the normal assertRaises - - >>> assertRaises(TypeError, ",".join, [1, 3, 5]) +def assert_raises_regex(_exception, _regexp, _callable=None, + *args, **kwargs): """ - manager = _AssertRaisesContextmanager(exception=_exception) - # don't return anything if used in function form - if _callable is not None: - with manager: - _callable(*args, **kwargs) - else: - return manager + Check that the specified Exception is raised and that the error message + matches a given regular expression pattern. This may be a regular + expression object or a string containing a regular expression suitable + for use by `re.search()`. + This is a port of the `assertRaisesRegexp` function from unittest in + Python 2.7. However, with our migration to `pytest`, please refrain + from using this. Instead, use the following paradigm: -def assertRaisesRegexp(_exception, _regexp, _callable=None, *args, **kwargs): - """ Port of assertRaisesRegexp from unittest in - Python 2.7 - used in with statement. + with pytest.raises(_exception) as exc_info: + func(*args, **kwargs) + exc_info.matches(reg_exp) - Explanation from standard library: - Like assertRaises() but also tests that regexp matches on the - string representation of the raised exception. regexp may be a - regular expression object or a string containing a regular - expression suitable for use by re.search(). - - You can pass either a regular expression - or a compiled regular expression object. - >>> assertRaisesRegexp(ValueError, 'invalid literal for.*XYZ', - ... int, 'XYZ') + Examples + -------- + >>> assert_raises_regex(ValueError, 'invalid literal for.*XYZ', int, 'XYZ') >>> import re - >>> assertRaisesRegexp(ValueError, re.compile('literal'), int, 'XYZ') + >>> assert_raises_regex(ValueError, re.compile('literal'), int, 'XYZ') If an exception of a different type is raised, it bubbles up. - >>> assertRaisesRegexp(TypeError, 'literal', int, 'XYZ') + >>> assert_raises_regex(TypeError, 'literal', int, 'XYZ') Traceback (most recent call last): ... ValueError: invalid literal for int() with base 10: 'XYZ' >>> dct = dict() - >>> assertRaisesRegexp(KeyError, 'pear', dct.__getitem__, 'apple') + >>> assert_raises_regex(KeyError, 'pear', dct.__getitem__, 'apple') Traceback (most recent call last): ... AssertionError: "pear" does not match "'apple'" You can also use this in a with statement. - >>> with assertRaisesRegexp(TypeError, 'unsupported operand type\(s\)'): + >>> with assert_raises_regex(TypeError, 'unsupported operand type\(s\)'): ... 1 + {} - >>> with assertRaisesRegexp(TypeError, 'banana'): + >>> with assert_raises_regex(TypeError, 'banana'): ... 'apple'[0] = 'b' Traceback (most recent call last): ... @@ -2431,39 +2453,79 @@ def assertRaisesRegexp(_exception, _regexp, _callable=None, *args, **kwargs): class _AssertRaisesContextmanager(object): """ - Handles the behind the scenes work - for assertRaises and assertRaisesRegexp + Context manager behind `assert_raises_regex`. """ - def __init__(self, exception, regexp=None, *args, **kwargs): + + def __init__(self, exception, regexp=None): + """ + Initialize an _AssertRaisesContextManager instance. + + Parameters + ---------- + exception : class + The expected Exception class. + regexp : str, default None + The regex to compare against the Exception message. + """ + self.exception = exception + if regexp is not None and not hasattr(regexp, "search"): regexp = re.compile(regexp, re.DOTALL) + self.regexp = regexp def __enter__(self): return self - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, trace_back): expected = self.exception - if not exc_type: - name = getattr(expected, "__name__", str(expected)) - raise AssertionError("{0} not raised.".format(name)) - if issubclass(exc_type, expected): - return self.handle_success(exc_type, exc_value, traceback) - return self.handle_failure(exc_type, exc_value, traceback) - - def handle_failure(*args, **kwargs): - # Failed, so allow Exception to bubble up - return False - def handle_success(self, exc_type, exc_value, traceback): - if self.regexp is not None: - val = str(exc_value) - if not self.regexp.search(val): - e = AssertionError('"%s" does not match "%s"' % - (self.regexp.pattern, str(val))) - raise_with_traceback(e, traceback) - return True + if not exc_type: + exp_name = getattr(expected, "__name__", str(expected)) + raise AssertionError("{0} not raised.".format(exp_name)) + + return self.exception_matches(exc_type, exc_value, trace_back) + + def exception_matches(self, exc_type, exc_value, trace_back): + """ + Check that the Exception raised matches the expected Exception + and expected error message regular expression. + + Parameters + ---------- + exc_type : class + The type of Exception raised. + exc_value : Exception + The instance of `exc_type` raised. + trace_back : stack trace object + The traceback object associated with `exc_value`. + + Returns + ------- + is_matched : bool + Whether or not the Exception raised matches the expected + Exception class and expected error message regular expression. + + Raises + ------ + AssertionError : The error message provided does not match + the expected error message regular expression. + """ + + if issubclass(exc_type, self.exception): + if self.regexp is not None: + val = str(exc_value) + + if not self.regexp.search(val): + e = AssertionError('"%s" does not match "%s"' % + (self.regexp.pattern, str(val))) + raise_with_traceback(e, trace_back) + + return True + else: + # Failed, so allow Exception to bubble up. + return False @contextmanager @@ -2538,11 +2600,6 @@ def assert_produces_warning(expected_warning=Warning, filter_level="always", % extra_warnings) -def disabled(t): - t.disabled = True - return t - - class RNGContext(object): """ Context manager to set the numpy random number generator speed. Returns @@ -2584,12 +2641,6 @@ def use_numexpr(use, min_elements=expr._MIN_ELEMENTS): expr.set_use_numexpr(olduse) -# Also provide all assert_* functions in the TestCase class -for name, obj in inspect.getmembers(sys.modules[__name__]): - if inspect.isfunction(obj) and name.startswith('assert'): - setattr(TestCase, name, staticmethod(obj)) - - def test_parallel(num_threads=2, kwargs_list=None): """Decorator to run the same function multiple times in parallel. @@ -2767,8 +2818,8 @@ def set_timezone(tz): 'EDT' """ if is_platform_windows(): - import nose - raise nose.SkipTest("timezone setting not supported on windows") + import pytest + pytest.skip("timezone setting not supported on windows") import os import time diff --git a/scripts/api_rst_coverage.py b/scripts/api_rst_coverage.py index cc456f03c02ec..6bb5383509be6 100644 --- a/scripts/api_rst_coverage.py +++ b/scripts/api_rst_coverage.py @@ -4,11 +4,11 @@ def main(): # classes whose members to check - classes = [pd.Series, pd.DataFrame, pd.Panel, pd.Panel4D] + classes = [pd.Series, pd.DataFrame, pd.Panel] def class_name_sort_key(x): if x.startswith('Series'): - # make sure Series precedes DataFrame, Panel, and Panel4D + # make sure Series precedes DataFrame, and Panel. return ' ' + x else: return x diff --git a/scripts/bench_join.py b/scripts/bench_join.py index 1ce5c94130e85..f9d43772766d8 100644 --- a/scripts/bench_join.py +++ b/scripts/bench_join.py @@ -1,6 +1,6 @@ from pandas.compat import range, lrange import numpy as np -import pandas.lib as lib +import pandas._libs.lib as lib from pandas import * from copy import deepcopy import time diff --git a/scripts/bench_join_multi.py b/scripts/bench_join_multi.py index 7b93112b7f869..b19da6a2c47d8 100644 --- a/scripts/bench_join_multi.py +++ b/scripts/bench_join_multi.py @@ -3,7 +3,7 @@ import numpy as np from pandas.compat import zip, range, lzip from pandas.util.testing import rands -import pandas.lib as lib +import pandas._libs.lib as lib N = 100000 diff --git a/build_dist.sh b/scripts/build_dist.sh similarity index 89% rename from build_dist.sh rename to scripts/build_dist.sh index 9bd007e3e1c9f..c9c36c18bed9c 100755 --- a/build_dist.sh +++ b/scripts/build_dist.sh @@ -12,7 +12,7 @@ case ${answer:0:1} in echo "Building distribution" python setup.py clean python setup.py build_ext --inplace - python setup.py sdist --formats=zip,gztar + python setup.py sdist --formats=gztar ;; * ) echo "Not building distribution" diff --git a/scripts/groupby_test.py b/scripts/groupby_test.py index 5acf7da7534a3..f640a6ed79503 100644 --- a/scripts/groupby_test.py +++ b/scripts/groupby_test.py @@ -5,7 +5,7 @@ from pandas import * -import pandas.lib as tseries +import pandas._libs.lib as tseries import pandas.core.groupby as gp import pandas.util.testing as tm from pandas.compat import range diff --git a/scripts/merge-py.py b/scripts/merge-pr.py similarity index 81% rename from scripts/merge-py.py rename to scripts/merge-pr.py index 65cc9d5a2bffe..1fc4eef3d0583 100755 --- a/scripts/merge-py.py +++ b/scripts/merge-pr.py @@ -25,12 +25,12 @@ from __future__ import print_function +from subprocess import check_output from requests.auth import HTTPBasicAuth import requests import os import six -import subprocess import sys import textwrap @@ -46,8 +46,8 @@ # Remote name where results pushed PUSH_REMOTE_NAME = os.environ.get("PUSH_REMOTE_NAME", "upstream") -GITHUB_BASE = "https://github.com/pydata/" + PROJECT_NAME + "/pull" -GITHUB_API_BASE = "https://api.github.com/repos/pydata/" + PROJECT_NAME +GITHUB_BASE = "https://github.com/pandas-dev/" + PROJECT_NAME + "/pull" +GITHUB_API_BASE = "https://api.github.com/repos/pandas-dev/" + PROJECT_NAME # Prefix added to temporary branches BRANCH_PREFIX = "PR_TOOL" @@ -83,21 +83,10 @@ def fail(msg): def run_cmd(cmd): - # py2.6 does not have subprocess.check_output if isinstance(cmd, six.string_types): cmd = cmd.split(' ') - popenargs = [cmd] - kwargs = {} - - process = subprocess.Popen(stdout=subprocess.PIPE, *popenargs) - output, unused_err = process.communicate() - retcode = process.poll() - if retcode: - cmd = kwargs.get("args") - if cmd is None: - cmd = popenargs[0] - raise subprocess.CalledProcessError(retcode, cmd, output=output) + output = check_output(cmd) if isinstance(output, six.binary_type): output = output.decode('utf-8') @@ -110,6 +99,14 @@ def continue_maybe(prompt): fail("Okay, exiting") +def continue_maybe2(prompt): + result = input("\n%s (y/n): " % prompt) + if result.lower() != "y": + return False + else: + return True + + original_head = run_cmd("git rev-parse HEAD")[:8] @@ -124,7 +121,7 @@ def clean_up(): run_cmd("git branch -D %s" % branch) -# merge the requested PR and return the merge hash +# Merge the requested PR and return the merge hash def merge_pr(pr_num, target_ref): pr_branch_name = "%s_MERGE_PR_%s" % (BRANCH_PREFIX, pr_num) @@ -204,6 +201,40 @@ def merge_pr(pr_num, target_ref): return merge_hash +def update_pr(pr_num, user_login, base_ref): + + pr_branch_name = "%s_MERGE_PR_%s" % (BRANCH_PREFIX, pr_num) + + run_cmd("git fetch %s pull/%s/head:%s" % (PR_REMOTE_NAME, pr_num, + pr_branch_name)) + run_cmd("git checkout %s" % pr_branch_name) + + continue_maybe("Update ready (local ref %s)? Push to %s/%s?" % ( + pr_branch_name, user_login, base_ref)) + + push_user_remote = "https://github.com/%s/pandas.git" % user_login + + try: + run_cmd('git push %s %s:%s' % (push_user_remote, pr_branch_name, + base_ref)) + except Exception as e: + + if continue_maybe2("Force push?"): + try: + run_cmd( + 'git push -f %s %s:%s' % (push_user_remote, pr_branch_name, + base_ref)) + except Exception as e: + fail("Exception while pushing: %s" % e) + clean_up() + else: + fail("Exception while pushing: %s" % e) + clean_up() + + clean_up() + print("Pull request #%s updated!" % pr_num) + + def cherry_pick(pr_num, merge_hash, default_branch): pick_ref = input("Enter a branch name [%s]: " % default_branch) if pick_ref == "": @@ -244,13 +275,6 @@ def fix_version_from_branch(branch, versions): branch_ver = branch.replace("branch-", "") return filter(lambda x: x.name.startswith(branch_ver), versions)[-1] - -# branches = get_json("%s/branches" % GITHUB_API_BASE) -# branch_names = filter(lambda x: x.startswith("branch-"), -# [x['name'] for x in branches]) -# Assumes branch names can be sorted lexicographically -# latest_branch = sorted(branch_names, reverse=True)[0] - pr_num = input("Which pull request would you like to merge? (e.g. 34): ") pr = get_json("%s/pulls/%s" % (GITHUB_API_BASE, pr_num)) @@ -275,8 +299,17 @@ def fix_version_from_branch(branch, versions): print("\n=== Pull Request #%s ===" % pr_num) print("title\t%s\nsource\t%s\ntarget\t%s\nurl\t%s" % (title, pr_repo_desc, target_ref, url)) -continue_maybe("Proceed with merging pull request #%s?" % pr_num) + + merged_refs = [target_ref] -merge_hash = merge_pr(pr_num, target_ref) +print("\nProceed with updating or merging pull request #%s?" % pr_num) +update = input("Update PR and push to remote (r), merge locally (l), " + "or do nothing (n) ?") +update = update.lower() + +if update == 'r': + merge_hash = update_pr(pr_num, user_login, base_ref) +elif update == 'l': + merge_hash = merge_pr(pr_num, target_ref) diff --git a/scripts/roll_median_leak.py b/scripts/roll_median_leak.py index 07161cc6499bf..03f39e2b18372 100644 --- a/scripts/roll_median_leak.py +++ b/scripts/roll_median_leak.py @@ -7,7 +7,7 @@ from vbench.api import Benchmark from pandas.util.testing import rands from pandas.compat import range -import pandas.lib as lib +import pandas._libs.lib as lib import pandas._sandbox as sbx import time diff --git a/setup.cfg b/setup.cfg index f69e256b80869..8b32f0f62fe28 100644 --- a/setup.cfg +++ b/setup.cfg @@ -12,10 +12,18 @@ tag_prefix = v parentdir_prefix = pandas- [flake8] -ignore = E731 +ignore = E731,E402 +max-line-length = 79 [yapf] based_on_style = pep8 split_before_named_assigns = false split_penalty_after_opening_bracket = 1000000 split_penalty_logical_operator = 30 + +[tool:pytest] +# TODO: Change all yield-based (nose-style) fixutures to pytest fixtures +# Silencing the warning until then +testpaths = pandas +markers = + single: mark a test as single cpu only diff --git a/setup.py b/setup.py index 2bef65c9719dc..d101358fb63dd 100755 --- a/setup.py +++ b/setup.py @@ -27,7 +27,7 @@ def is_platform_mac(): import versioneer cmdclass = versioneer.get_cmdclass() -min_cython_ver = '0.19.1' +min_cython_ver = '0.23' try: import Cython ver = Cython.__version__ @@ -109,19 +109,23 @@ def is_platform_mac(): from os.path import join as pjoin -_pxipath = pjoin('pandas', 'src') _pxi_dep_template = { - 'algos': ['algos_common_helper.pxi.in', 'algos_groupby_helper.pxi.in', - 'algos_take_helper.pxi.in'], - '_join': ['join_helper.pxi.in', 'joins_func_helper.pxi.in'], - 'hashtable': ['hashtable_class_helper.pxi.in', - 'hashtable_func_helper.pxi.in'], - '_sparse': ['sparse_op_helper.pxi.in'] + 'algos': ['_libs/algos_common_helper.pxi.in', + '_libs/algos_take_helper.pxi.in', '_libs/algos_rank_helper.pxi.in'], + 'groupby': ['_libs/groupby_helper.pxi.in'], + 'join': ['_libs/join_helper.pxi.in', '_libs/join_func_helper.pxi.in'], + 'reshape': ['_libs/reshape_helper.pxi.in'], + 'hashtable': ['_libs/hashtable_class_helper.pxi.in', + '_libs/hashtable_func_helper.pxi.in'], + 'index': ['_libs/index_class_helper.pxi.in'], + 'sparse': ['_libs/sparse_op_helper.pxi.in'], + 'interval': ['_libs/intervaltree.pxi.in'] } + _pxifiles = [] _pxi_dep = {} for module, files in _pxi_dep_template.items(): - pxi_files = [pjoin(_pxipath, x) for x in files] + pxi_files = [pjoin('pandas', x) for x in files] _pxifiles.extend(pxi_files) _pxi_dep[module] = pxi_files @@ -259,7 +263,7 @@ def initialize_options(self): self._clean_me = [] self._clean_trees = [] - base = pjoin('pandas','src') + base = pjoin('pandas','_libs', 'src') dt = pjoin(base,'datetime') src = base util = pjoin('pandas','util') @@ -325,19 +329,20 @@ def run(self): class CheckSDist(sdist_class): """Custom sdist that ensures Cython has compiled all pyx files to c.""" - _pyxfiles = ['pandas/lib.pyx', - 'pandas/hashtable.pyx', - 'pandas/tslib.pyx', - 'pandas/index.pyx', - 'pandas/algos.pyx', - 'pandas/join.pyx', - 'pandas/window.pyx', - 'pandas/parser.pyx', - 'pandas/src/period.pyx', - 'pandas/src/sparse.pyx', - 'pandas/src/testing.pyx', - 'pandas/src/hash.pyx', - 'pandas/io/sas/saslib.pyx'] + _pyxfiles = ['pandas/_libs/lib.pyx', + 'pandas/_libs/hashtable.pyx', + 'pandas/_libs/tslib.pyx', + 'pandas/_libs/period.pyx', + 'pandas/_libs/index.pyx', + 'pandas/_libs/algos.pyx', + 'pandas/_libs/join.pyx', + 'pandas/_libs/interval.pyx', + 'pandas/_libs/hashing.pyx', + 'pandas/_libs/testing.pyx', + 'pandas/_libs/window.pyx', + 'pandas/_libs/sparse.pyx', + 'pandas/_libs/parsers.pyx', + 'pandas/io/sas/sas.pyx'] def initialize_options(self): sdist_class.initialize_options(self) @@ -372,6 +377,7 @@ def check_cython_extensions(self, extensions): for ext in extensions: for src in ext.sources: if not os.path.exists(src): + print("{}: -> [{}]".format(ext.name, ext.sources)) raise Exception("""Cython-generated file '%s' not found. Cython is required to compile pandas from a development branch. Please install Cython or download a release package of pandas. @@ -437,13 +443,13 @@ def srcpath(name=None, suffix='.pyx', subdir='src'): return pjoin('pandas', subdir, name + suffix) if suffix == '.pyx': - lib_depends = [srcpath(f, suffix='.pyx') for f in lib_depends] - lib_depends.append('pandas/src/util.pxd') + lib_depends = [srcpath(f, suffix='.pyx', subdir='_libs/src') for f in lib_depends] + lib_depends.append('pandas/_libs/src/util.pxd') else: lib_depends = [] plib_depends = [] -common_include = ['pandas/src/klib', 'pandas/src'] +common_include = ['pandas/_libs/src/klib', 'pandas/_libs/src'] def pxd(name): @@ -455,67 +461,77 @@ def pxd(name): else: extra_compile_args=['-Wno-unused-function'] -lib_depends = lib_depends + ['pandas/src/numpy_helper.h', - 'pandas/src/parse_helper.h'] +lib_depends = lib_depends + ['pandas/_libs/src/numpy_helper.h', + 'pandas/_libs/src/parse_helper.h', + 'pandas/_libs/src/compat_helper.h'] -tseries_depends = ['pandas/src/datetime/np_datetime.h', - 'pandas/src/datetime/np_datetime_strings.h', - 'pandas/src/period_helper.h', - 'pandas/src/datetime.pxd'] +tseries_depends = ['pandas/_libs/src/datetime/np_datetime.h', + 'pandas/_libs/src/datetime/np_datetime_strings.h', + 'pandas/_libs/src/datetime_helper.h', + 'pandas/_libs/src/period_helper.h', + 'pandas/_libs/src/datetime.pxd'] # some linux distros require it libraries = ['m'] if not is_platform_windows() else [] -ext_data = dict( - lib={'pyxfile': 'lib', - 'pxdfiles': [], - 'depends': lib_depends}, - hashtable={'pyxfile': 'hashtable', - 'pxdfiles': ['hashtable'], - 'depends': (['pandas/src/klib/khash_python.h'] - + _pxi_dep['hashtable'])}, - tslib={'pyxfile': 'tslib', - 'depends': tseries_depends, - 'sources': ['pandas/src/datetime/np_datetime.c', - 'pandas/src/datetime/np_datetime_strings.c', - 'pandas/src/period_helper.c']}, - _period={'pyxfile': 'src/period', - 'depends': tseries_depends, - 'sources': ['pandas/src/datetime/np_datetime.c', - 'pandas/src/datetime/np_datetime_strings.c', - 'pandas/src/period_helper.c']}, - index={'pyxfile': 'index', - 'sources': ['pandas/src/datetime/np_datetime.c', - 'pandas/src/datetime/np_datetime_strings.c'], - 'pxdfiles': ['src/util']}, - algos={'pyxfile': 'algos', - 'pxdfiles': ['src/util'], - 'depends': _pxi_dep['algos']}, - _join={'pyxfile': 'src/join', - 'pxdfiles': ['src/util'], - 'depends': _pxi_dep['_join']}, - _window={'pyxfile': 'window', - 'pxdfiles': ['src/skiplist', 'src/util'], - 'depends': ['pandas/src/skiplist.pyx', - 'pandas/src/skiplist.h']}, - parser={'pyxfile': 'parser', - 'depends': ['pandas/src/parser/tokenizer.h', - 'pandas/src/parser/io.h', - 'pandas/src/numpy_helper.h'], - 'sources': ['pandas/src/parser/tokenizer.c', - 'pandas/src/parser/io.c']}, - _sparse={'pyxfile': 'src/sparse', - 'depends': ([srcpath('sparse', suffix='.pyx')] + - _pxi_dep['_sparse'])}, - _testing={'pyxfile': 'src/testing', - 'depends': [srcpath('testing', suffix='.pyx')]}, - _hash={'pyxfile': 'src/hash', - 'depends': [srcpath('hash', suffix='.pyx')]}, -) - -ext_data["io.sas.saslib"] = {'pyxfile': 'io/sas/saslib'} +ext_data = { + '_libs.lib': {'pyxfile': '_libs/lib', + 'depends': lib_depends + tseries_depends}, + '_libs.hashtable': {'pyxfile': '_libs/hashtable', + 'pxdfiles': ['_libs/hashtable'], + 'depends': (['pandas/_libs/src/klib/khash_python.h'] + + _pxi_dep['hashtable'])}, + '_libs.tslib': {'pyxfile': '_libs/tslib', + 'pxdfiles': ['_libs/src/util', '_libs/lib'], + 'depends': tseries_depends, + 'sources': ['pandas/_libs/src/datetime/np_datetime.c', + 'pandas/_libs/src/datetime/np_datetime_strings.c', + 'pandas/_libs/src/period_helper.c']}, + '_libs.period': {'pyxfile': '_libs/period', + 'depends': tseries_depends, + 'sources': ['pandas/_libs/src/datetime/np_datetime.c', + 'pandas/_libs/src/datetime/np_datetime_strings.c', + 'pandas/_libs/src/period_helper.c']}, + '_libs.index': {'pyxfile': '_libs/index', + 'sources': ['pandas/_libs/src/datetime/np_datetime.c', + 'pandas/_libs/src/datetime/np_datetime_strings.c'], + 'pxdfiles': ['_libs/src/util', '_libs/hashtable'], + 'depends': _pxi_dep['index']}, + '_libs.algos': {'pyxfile': '_libs/algos', + 'pxdfiles': ['_libs/src/util', '_libs/algos', '_libs/hashtable'], + 'depends': _pxi_dep['algos']}, + '_libs.groupby': {'pyxfile': '_libs/groupby', + 'pxdfiles': ['_libs/src/util', '_libs/algos'], + 'depends': _pxi_dep['groupby']}, + '_libs.join': {'pyxfile': '_libs/join', + 'pxdfiles': ['_libs/src/util', '_libs/hashtable'], + 'depends': _pxi_dep['join']}, + '_libs.reshape': {'pyxfile': '_libs/reshape', + 'depends': _pxi_dep['reshape']}, + '_libs.interval': {'pyxfile': '_libs/interval', + 'pxdfiles': ['_libs/hashtable'], + 'depends': _pxi_dep['interval']}, + '_libs.window': {'pyxfile': '_libs/window', + 'pxdfiles': ['_libs/src/skiplist', '_libs/src/util'], + 'depends': ['pandas/_libs/src/skiplist.pyx', + 'pandas/_libs/src/skiplist.h']}, + '_libs.parsers': {'pyxfile': '_libs/parsers', + 'depends': ['pandas/_libs/src/parser/tokenizer.h', + 'pandas/_libs/src/parser/io.h', + 'pandas/_libs/src/numpy_helper.h'], + 'sources': ['pandas/_libs/src/parser/tokenizer.c', + 'pandas/_libs/src/parser/io.c']}, + '_libs.sparse': {'pyxfile': '_libs/sparse', + 'depends': (['pandas/core/sparse/sparse.pyx'] + + _pxi_dep['sparse'])}, + '_libs.testing': {'pyxfile': '_libs/testing', + 'depends': ['pandas/_libs/testing.pyx']}, + '_libs.hashing': {'pyxfile': '_libs/hashing', + 'depends': ['pandas/_libs/hashing.pyx']}, + 'io.sas._sas': {'pyxfile': 'io/sas/sas'}, + } extensions = [] @@ -546,25 +562,25 @@ def pxd(name): else: macros = [('__LITTLE_ENDIAN__', '1')] -packer_ext = Extension('pandas.msgpack._packer', - depends=['pandas/src/msgpack/pack.h', - 'pandas/src/msgpack/pack_template.h'], +packer_ext = Extension('pandas.io.msgpack._packer', + depends=['pandas/_libs/src/msgpack/pack.h', + 'pandas/_libs/src/msgpack/pack_template.h'], sources = [srcpath('_packer', suffix=suffix if suffix == '.pyx' else '.cpp', - subdir='msgpack')], + subdir='io/msgpack')], language='c++', - include_dirs=['pandas/src/msgpack'] + common_include, + include_dirs=['pandas/_libs/src/msgpack'] + common_include, define_macros=macros, extra_compile_args=extra_compile_args) -unpacker_ext = Extension('pandas.msgpack._unpacker', - depends=['pandas/src/msgpack/unpack.h', - 'pandas/src/msgpack/unpack_define.h', - 'pandas/src/msgpack/unpack_template.h'], +unpacker_ext = Extension('pandas.io.msgpack._unpacker', + depends=['pandas/_libs/src/msgpack/unpack.h', + 'pandas/_libs/src/msgpack/unpack_define.h', + 'pandas/_libs/src/msgpack/unpack_template.h'], sources = [srcpath('_unpacker', suffix=suffix if suffix == '.pyx' else '.cpp', - subdir='msgpack')], + subdir='io/msgpack')], language='c++', - include_dirs=['pandas/src/msgpack'] + common_include, + include_dirs=['pandas/_libs/src/msgpack'] + common_include, define_macros=macros, extra_compile_args=extra_compile_args) extensions.append(packer_ext) @@ -580,19 +596,20 @@ def pxd(name): root, _ = os.path.splitext(ext.sources[0]) ext.sources[0] = root + suffix -ujson_ext = Extension('pandas.json', - depends=['pandas/src/ujson/lib/ultrajson.h', - 'pandas/src/numpy_helper.h'], - sources=['pandas/src/ujson/python/ujson.c', - 'pandas/src/ujson/python/objToJSON.c', - 'pandas/src/ujson/python/JSONtoObj.c', - 'pandas/src/ujson/lib/ultrajsonenc.c', - 'pandas/src/ujson/lib/ultrajsondec.c', - 'pandas/src/datetime/np_datetime.c', - 'pandas/src/datetime/np_datetime_strings.c'], - include_dirs=['pandas/src/ujson/python', - 'pandas/src/ujson/lib', - 'pandas/src/datetime'] + common_include, +ujson_ext = Extension('pandas._libs.json', + depends=['pandas/_libs/src/ujson/lib/ultrajson.h', + 'pandas/_libs/src/datetime_helper.h', + 'pandas/_libs/src/numpy_helper.h'], + sources=['pandas/_libs/src/ujson/python/ujson.c', + 'pandas/_libs/src/ujson/python/objToJSON.c', + 'pandas/_libs/src/ujson/python/JSONtoObj.c', + 'pandas/_libs/src/ujson/lib/ultrajsonenc.c', + 'pandas/_libs/src/ujson/lib/ultrajsondec.c', + 'pandas/_libs/src/datetime/np_datetime.c', + 'pandas/_libs/src/datetime/np_datetime_strings.c'], + include_dirs=['pandas/_libs/src/ujson/python', + 'pandas/_libs/src/ujson/lib', + 'pandas/_libs/src/datetime'] + common_include, extra_compile_args=['-D_GNU_SOURCE'] + extra_compile_args) @@ -618,70 +635,84 @@ def pxd(name): version=versioneer.get_version(), packages=['pandas', 'pandas.api', - 'pandas.api.tests', 'pandas.api.types', 'pandas.compat', 'pandas.compat.numpy', - 'pandas.computation', - 'pandas.computation.tests', 'pandas.core', - 'pandas.indexes', + 'pandas.core.dtypes', + 'pandas.core.indexes', + 'pandas.core.computation', + 'pandas.core.reshape', + 'pandas.core.sparse', + 'pandas.core.tools', + 'pandas.core.util', + 'pandas.computation', + 'pandas.errors', + 'pandas.formats', 'pandas.io', + 'pandas.io.json', 'pandas.io.sas', - 'pandas.formats', - 'pandas.rpy', - 'pandas.sparse', - 'pandas.sparse.tests', + 'pandas.io.msgpack', + 'pandas.io.formats', + 'pandas.io.clipboard', + 'pandas._libs', + 'pandas.plotting', 'pandas.stats', + 'pandas.types', 'pandas.util', 'pandas.tests', + 'pandas.tests.api', + 'pandas.tests.dtypes', + 'pandas.tests.computation', + 'pandas.tests.sparse', 'pandas.tests.frame', 'pandas.tests.indexes', + 'pandas.tests.indexes.datetimes', + 'pandas.tests.indexes.timedeltas', + 'pandas.tests.indexes.period', + 'pandas.tests.io', + 'pandas.tests.io.json', + 'pandas.tests.io.parser', + 'pandas.tests.io.sas', + 'pandas.tests.io.msgpack', + 'pandas.tests.io.formats', + 'pandas.tests.groupby', 'pandas.tests.series', - 'pandas.tests.formats', - 'pandas.tests.types', - 'pandas.tests.test_msgpack', + 'pandas.tests.scalar', + 'pandas.tests.tseries', 'pandas.tests.plotting', + 'pandas.tests.tools', + 'pandas.tests.util', 'pandas.tools', - 'pandas.tools.tests', 'pandas.tseries', - 'pandas.tseries.tests', - 'pandas.types', - 'pandas.io.tests', - 'pandas.io.tests.json', - 'pandas.io.tests.parser', - 'pandas.io.tests.sas', - 'pandas.stats.tests', - 'pandas.msgpack', - 'pandas.util.clipboard' ], - package_data={'pandas.io': ['tests/data/legacy_hdf/*.h5', - 'tests/data/legacy_pickle/*/*.pickle', - 'tests/data/legacy_msgpack/*/*.msgpack', - 'tests/data/*.csv*', - 'tests/data/*.dta', - 'tests/data/*.pickle', - 'tests/data/*.txt', - 'tests/data/*.xls', - 'tests/data/*.xlsx', - 'tests/data/*.xlsm', - 'tests/data/*.table', - 'tests/parser/data/*.csv', - 'tests/parser/data/*.gz', - 'tests/parser/data/*.bz2', - 'tests/parser/data/*.txt', - 'tests/sas/data/*.csv', - 'tests/sas/data/*.xpt', - 'tests/sas/data/*.sas7bdat', - 'tests/data/*.html', - 'tests/data/html_encoding/*.html', - 'tests/json/data/*.json'], - 'pandas.tools': ['tests/data/*.csv'], - 'pandas.tests': ['data/*.csv'], - 'pandas.tests.formats': ['data/*.csv'], + package_data={'pandas.tests': ['data/*.csv'], 'pandas.tests.indexes': ['data/*.pickle'], - 'pandas.tseries.tests': ['data/*.pickle', - 'data/*.csv'] + 'pandas.tests.io': ['data/legacy_hdf/*.h5', + 'data/legacy_pickle/*/*.pickle', + 'data/legacy_msgpack/*/*.msgpack', + 'data/*.csv*', + 'data/*.dta', + 'data/*.pickle', + 'data/*.txt', + 'data/*.xls', + 'data/*.xlsx', + 'data/*.xlsm', + 'data/*.table', + 'parser/data/*.csv', + 'parser/data/*.gz', + 'parser/data/*.bz2', + 'parser/data/*.txt', + 'sas/data/*.csv', + 'sas/data/*.xpt', + 'sas/data/*.sas7bdat', + 'data/*.html', + 'data/html_encoding/*.html', + 'json/data/*.json'], + 'pandas.tests.io.formats': ['data/*.csv'], + 'pandas.tests.reshape': ['data/*.csv'], + 'pandas.tests.tseries': ['data/*.pickle'], + 'pandas.io.formats': ['templates/*.tpl'] }, ext_modules=extensions, maintainer_email=EMAIL, diff --git a/test.bat b/test.bat index 16aa6c9105ec3..6c69f83866ffd 100644 --- a/test.bat +++ b/test.bat @@ -1,3 +1,3 @@ :: test on windows -nosetests --exe -A "not slow and not network and not disabled" pandas %* +pytest --skip-slow --skip-network pandas -n 2 %* diff --git a/test.sh b/test.sh index 4a9ffd7be98b1..23c7ff52d2ce9 100755 --- a/test.sh +++ b/test.sh @@ -1,11 +1,4 @@ #!/bin/sh command -v coverage >/dev/null && coverage erase command -v python-coverage >/dev/null && python-coverage erase -# nosetests pandas/tests/test_index.py --with-coverage --cover-package=pandas.core --pdb-failure --pdb -#nosetests -w pandas --with-coverage --cover-package=pandas --pdb-failure --pdb #--cover-inclusive -#nosetests -A "not slow" -w pandas/tseries --with-coverage --cover-package=pandas.tseries $* #--cover-inclusive -nosetests -w pandas --with-coverage --cover-package=pandas $* -# nosetests -w pandas/io --with-coverage --cover-package=pandas.io --pdb-failure --pdb -# nosetests -w pandas/core --with-coverage --cover-package=pandas.core --pdb-failure --pdb -# nosetests -w pandas/stats --with-coverage --cover-package=pandas.stats -# coverage run runtests.py +pytest pandas --cov=pandas diff --git a/test_fast.bat b/test_fast.bat new file mode 100644 index 0000000000000..17dc54b580137 --- /dev/null +++ b/test_fast.bat @@ -0,0 +1,3 @@ +:: test on windows +set PYTHONHASHSEED=314159265 +pytest --skip-slow --skip-network -m "not single" -n 4 pandas diff --git a/test_fast.sh b/test_fast.sh index b390705f901ad..9b984156a796c 100755 --- a/test_fast.sh +++ b/test_fast.sh @@ -1 +1,8 @@ -nosetests -A "not slow and not network" pandas --with-id $* +#!/bin/bash + +# Workaround for pytest-xdist flaky collection order +# https://github.com/pytest-dev/pytest/issues/920 +# https://github.com/pytest-dev/pytest/issues/1075 +export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))') + +pytest pandas --skip-slow --skip-network -m "not single" -n 4 "$@" diff --git a/test_multi.sh b/test_multi.sh deleted file mode 100755 index 5d77945c66a26..0000000000000 --- a/test_multi.sh +++ /dev/null @@ -1 +0,0 @@ -nosetests -A "not slow and not network" pandas --processes=4 $* diff --git a/test_perf.sh b/test_perf.sh deleted file mode 100755 index 022de25bca8fc..0000000000000 --- a/test_perf.sh +++ /dev/null @@ -1,5 +0,0 @@ -#!/bin/sh - -CURDIR=$(pwd) -BASEDIR=$(cd "$(dirname "$0")"; pwd) -python "$BASEDIR"/vb_suite/test_perf.py $@ diff --git a/test_rebuild.sh b/test_rebuild.sh index d3710c5ff67d3..65aa1098811a1 100755 --- a/test_rebuild.sh +++ b/test_rebuild.sh @@ -3,10 +3,4 @@ python setup.py clean python setup.py build_ext --inplace coverage erase -# nosetests pandas/tests/test_index.py --with-coverage --cover-package=pandas.core --pdb-failure --pdb -#nosetests -w pandas --with-coverage --cover-package=pandas --pdb-failure --pdb #--cover-inclusive -nosetests -w pandas --with-coverage --cover-package=pandas $* #--cover-inclusive -# nosetests -w pandas/io --with-coverage --cover-package=pandas.io --pdb-failure --pdb -# nosetests -w pandas/core --with-coverage --cover-package=pandas.core --pdb-failure --pdb -# nosetests -w pandas/stats --with-coverage --cover-package=pandas.stats -# coverage run runtests.py +pytest pandas --cov=pandas diff --git a/tox.ini b/tox.ini index 5d6c8975307b6..85c5d90fde7fb 100644 --- a/tox.ini +++ b/tox.ini @@ -10,6 +10,7 @@ envlist = py27, py34, py35 deps = cython nose + pytest pytz>=2011k python-dateutil beautifulsoup4 @@ -26,7 +27,7 @@ changedir = {envdir} commands = # TODO: --exe because of GH #761 - {envbindir}/nosetests --exe pandas {posargs:-A "not network and not disabled"} + {envbindir}/pytest pandas {posargs:-A "not network and not disabled"} # cleanup the temp. build dir created by the tox build # /bin/rm -rf {toxinidir}/build @@ -63,18 +64,18 @@ usedevelop = True deps = {[testenv]deps} openpyxl<2.0.0 -commands = {envbindir}/nosetests {toxinidir}/pandas/io/tests/test_excel.py +commands = {envbindir}/pytest {toxinidir}/pandas/io/tests/test_excel.py [testenv:openpyxl20] usedevelop = True deps = {[testenv]deps} openpyxl<2.2.0 -commands = {envbindir}/nosetests {posargs} {toxinidir}/pandas/io/tests/test_excel.py +commands = {envbindir}/pytest {posargs} {toxinidir}/pandas/io/tests/test_excel.py [testenv:openpyxl22] usedevelop = True deps = {[testenv]deps} openpyxl>=2.2.0 -commands = {envbindir}/nosetests {posargs} {toxinidir}/pandas/io/tests/test_excel.py +commands = {envbindir}/pytest {posargs} {toxinidir}/pandas/io/tests/test_excel.py diff --git a/vb_suite/.gitignore b/vb_suite/.gitignore deleted file mode 100644 index cc110f04e1225..0000000000000 --- a/vb_suite/.gitignore +++ /dev/null @@ -1,4 +0,0 @@ -benchmarks.db -build/* -source/vbench/* -source/*.rst \ No newline at end of file diff --git a/vb_suite/attrs_caching.py b/vb_suite/attrs_caching.py deleted file mode 100644 index a7e3ed7094ed6..0000000000000 --- a/vb_suite/attrs_caching.py +++ /dev/null @@ -1,20 +0,0 @@ -from vbench.benchmark import Benchmark - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# DataFrame.index / columns property lookup time - -setup = common_setup + """ -df = DataFrame(np.random.randn(10, 6)) -cur_index = df.index -""" -stmt = "foo = df.index" - -getattr_dataframe_index = Benchmark(stmt, setup, - name="getattr_dataframe_index") - -stmt = "df.index = cur_index" -setattr_dataframe_index = Benchmark(stmt, setup, - name="setattr_dataframe_index") diff --git a/vb_suite/binary_ops.py b/vb_suite/binary_ops.py deleted file mode 100644 index 7c821374a83ab..0000000000000 --- a/vb_suite/binary_ops.py +++ /dev/null @@ -1,199 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -SECTION = 'Binary ops' - -#---------------------------------------------------------------------- -# binary ops - -#---------------------------------------------------------------------- -# add - -setup = common_setup + """ -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -""" -frame_add = \ - Benchmark("df + df2", setup, name='frame_add', - start_date=datetime(2012, 1, 1)) - -setup = common_setup + """ -import pandas.computation.expressions as expr -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -expr.set_numexpr_threads(1) -""" - -frame_add_st = \ - Benchmark("df + df2", setup, name='frame_add_st',cleanup="expr.set_numexpr_threads()", - start_date=datetime(2013, 2, 26)) - -setup = common_setup + """ -import pandas.computation.expressions as expr -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -expr.set_use_numexpr(False) -""" -frame_add_no_ne = \ - Benchmark("df + df2", setup, name='frame_add_no_ne',cleanup="expr.set_use_numexpr(True)", - start_date=datetime(2013, 2, 26)) - -#---------------------------------------------------------------------- -# mult - -setup = common_setup + """ -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -""" -frame_mult = \ - Benchmark("df * df2", setup, name='frame_mult', - start_date=datetime(2012, 1, 1)) - -setup = common_setup + """ -import pandas.computation.expressions as expr -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -expr.set_numexpr_threads(1) -""" -frame_mult_st = \ - Benchmark("df * df2", setup, name='frame_mult_st',cleanup="expr.set_numexpr_threads()", - start_date=datetime(2013, 2, 26)) - -setup = common_setup + """ -import pandas.computation.expressions as expr -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -expr.set_use_numexpr(False) -""" -frame_mult_no_ne = \ - Benchmark("df * df2", setup, name='frame_mult_no_ne',cleanup="expr.set_use_numexpr(True)", - start_date=datetime(2013, 2, 26)) - -#---------------------------------------------------------------------- -# division - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000, 1000)) -""" -frame_float_div_by_zero = \ - Benchmark("df / 0", setup, name='frame_float_div_by_zero') - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000, 1000)) -""" -frame_float_floor_by_zero = \ - Benchmark("df // 0", setup, name='frame_float_floor_by_zero') - -setup = common_setup + """ -df = DataFrame(np.random.random_integers(np.iinfo(np.int16).min, np.iinfo(np.int16).max, size=(1000, 1000))) -""" -frame_int_div_by_zero = \ - Benchmark("df / 0", setup, name='frame_int_div_by_zero') - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000, 1000)) -df2 = DataFrame(np.random.randn(1000, 1000)) -""" -frame_float_div = \ - Benchmark("df // df2", setup, name='frame_float_div') - -#---------------------------------------------------------------------- -# modulo - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000, 1000)) -df2 = DataFrame(np.random.randn(1000, 1000)) -""" -frame_float_mod = \ - Benchmark("df / df2", setup, name='frame_float_mod') - -setup = common_setup + """ -df = DataFrame(np.random.random_integers(np.iinfo(np.int16).min, np.iinfo(np.int16).max, size=(1000, 1000))) -df2 = DataFrame(np.random.random_integers(np.iinfo(np.int16).min, np.iinfo(np.int16).max, size=(1000, 1000))) -""" -frame_int_mod = \ - Benchmark("df / df2", setup, name='frame_int_mod') - -#---------------------------------------------------------------------- -# multi and - -setup = common_setup + """ -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -""" -frame_multi_and = \ - Benchmark("df[(df>0) & (df2>0)]", setup, name='frame_multi_and', - start_date=datetime(2012, 1, 1)) - -setup = common_setup + """ -import pandas.computation.expressions as expr -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -expr.set_numexpr_threads(1) -""" -frame_multi_and_st = \ - Benchmark("df[(df>0) & (df2>0)]", setup, name='frame_multi_and_st',cleanup="expr.set_numexpr_threads()", - start_date=datetime(2013, 2, 26)) - -setup = common_setup + """ -import pandas.computation.expressions as expr -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -expr.set_use_numexpr(False) -""" -frame_multi_and_no_ne = \ - Benchmark("df[(df>0) & (df2>0)]", setup, name='frame_multi_and_no_ne',cleanup="expr.set_use_numexpr(True)", - start_date=datetime(2013, 2, 26)) - -#---------------------------------------------------------------------- -# timeseries - -setup = common_setup + """ -N = 1000000 -halfway = N // 2 - 1 -s = Series(date_range('20010101', periods=N, freq='T')) -ts = s[halfway] -""" - -timestamp_series_compare = Benchmark("ts >= s", setup, - start_date=datetime(2013, 9, 27)) -series_timestamp_compare = Benchmark("s <= ts", setup, - start_date=datetime(2012, 2, 21)) - -setup = common_setup + """ -N = 1000000 -s = Series(date_range('20010101', periods=N, freq='s')) -""" - -timestamp_ops_diff1 = Benchmark("s.diff()", setup, - start_date=datetime(2013, 1, 1)) -timestamp_ops_diff2 = Benchmark("s-s.shift()", setup, - start_date=datetime(2013, 1, 1)) - -#---------------------------------------------------------------------- -# timeseries with tz - -setup = common_setup + """ -N = 10000 -halfway = N // 2 - 1 -s = Series(date_range('20010101', periods=N, freq='T', tz='US/Eastern')) -ts = s[halfway] -""" - -timestamp_tz_series_compare = Benchmark("ts >= s", setup, - start_date=datetime(2013, 9, 27)) -series_timestamp_tz_compare = Benchmark("s <= ts", setup, - start_date=datetime(2012, 2, 21)) - -setup = common_setup + """ -N = 10000 -s = Series(date_range('20010101', periods=N, freq='s', tz='US/Eastern')) -""" - -timestamp_tz_ops_diff1 = Benchmark("s.diff()", setup, - start_date=datetime(2013, 1, 1)) -timestamp_tz_ops_diff2 = Benchmark("s-s.shift()", setup, - start_date=datetime(2013, 1, 1)) diff --git a/vb_suite/categoricals.py b/vb_suite/categoricals.py deleted file mode 100644 index a08d479df20cb..0000000000000 --- a/vb_suite/categoricals.py +++ /dev/null @@ -1,16 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# Series constructors - -setup = common_setup + """ -s = pd.Series(list('aabbcd') * 1000000).astype('category') -""" - -concat_categorical = \ - Benchmark("concat([s, s])", setup=setup, name='concat_categorical', - start_date=datetime(year=2015, month=7, day=15)) diff --git a/vb_suite/ctors.py b/vb_suite/ctors.py deleted file mode 100644 index 8123322383f0a..0000000000000 --- a/vb_suite/ctors.py +++ /dev/null @@ -1,39 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# Series constructors - -setup = common_setup + """ -data = np.random.randn(100) -index = Index(np.arange(100)) -""" - -ctor_series_ndarray = \ - Benchmark("Series(data, index=index)", setup=setup, - name='series_constructor_ndarray') - -setup = common_setup + """ -arr = np.random.randn(100, 100) -""" - -ctor_frame_ndarray = \ - Benchmark("DataFrame(arr)", setup=setup, - name='frame_constructor_ndarray') - -setup = common_setup + """ -data = np.array(['foo', 'bar', 'baz'], dtype=object) -""" - -ctor_index_array_string = Benchmark('Index(data)', setup=setup) - -# index constructors -setup = common_setup + """ -s = Series([Timestamp('20110101'),Timestamp('20120101'),Timestamp('20130101')]*1000) -""" -index_from_series_ctor = Benchmark('Index(s)', setup=setup) - -dtindex_from_series_ctor = Benchmark('DatetimeIndex(s)', setup=setup) diff --git a/vb_suite/eval.py b/vb_suite/eval.py deleted file mode 100644 index bf80aad956184..0000000000000 --- a/vb_suite/eval.py +++ /dev/null @@ -1,150 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -import pandas as pd -df = DataFrame(np.random.randn(20000, 100)) -df2 = DataFrame(np.random.randn(20000, 100)) -df3 = DataFrame(np.random.randn(20000, 100)) -df4 = DataFrame(np.random.randn(20000, 100)) -""" - -setup = common_setup + """ -import pandas.computation.expressions as expr -expr.set_numexpr_threads(1) -""" - -SECTION = 'Eval' - -#---------------------------------------------------------------------- -# binary ops - -#---------------------------------------------------------------------- -# add -eval_frame_add_all_threads = \ - Benchmark("pd.eval('df + df2 + df3 + df4')", common_setup, - name='eval_frame_add_all_threads', - start_date=datetime(2013, 7, 21)) - - - -eval_frame_add_one_thread = \ - Benchmark("pd.eval('df + df2 + df3 + df4')", setup, - name='eval_frame_add_one_thread', - start_date=datetime(2013, 7, 26)) - -eval_frame_add_python = \ - Benchmark("pd.eval('df + df2 + df3 + df4', engine='python')", common_setup, - name='eval_frame_add_python', start_date=datetime(2013, 7, 21)) - -eval_frame_add_python_one_thread = \ - Benchmark("pd.eval('df + df2 + df3 + df4', engine='python')", setup, - name='eval_frame_add_python_one_thread', - start_date=datetime(2013, 7, 26)) -#---------------------------------------------------------------------- -# mult - -eval_frame_mult_all_threads = \ - Benchmark("pd.eval('df * df2 * df3 * df4')", common_setup, - name='eval_frame_mult_all_threads', - start_date=datetime(2013, 7, 21)) - -eval_frame_mult_one_thread = \ - Benchmark("pd.eval('df * df2 * df3 * df4')", setup, - name='eval_frame_mult_one_thread', - start_date=datetime(2013, 7, 26)) - -eval_frame_mult_python = \ - Benchmark("pd.eval('df * df2 * df3 * df4', engine='python')", - common_setup, - name='eval_frame_mult_python', start_date=datetime(2013, 7, 21)) - -eval_frame_mult_python_one_thread = \ - Benchmark("pd.eval('df * df2 * df3 * df4', engine='python')", setup, - name='eval_frame_mult_python_one_thread', - start_date=datetime(2013, 7, 26)) - -#---------------------------------------------------------------------- -# multi and - -eval_frame_and_all_threads = \ - Benchmark("pd.eval('(df > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)')", - common_setup, - name='eval_frame_and_all_threads', - start_date=datetime(2013, 7, 21)) - -eval_frame_and_one_thread = \ - Benchmark("pd.eval('(df > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)')", setup, - name='eval_frame_and_one_thread', - start_date=datetime(2013, 7, 26)) - -eval_frame_and_python = \ - Benchmark("pd.eval('(df > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)', engine='python')", - common_setup, name='eval_frame_and_python', - start_date=datetime(2013, 7, 21)) - -eval_frame_and_one_thread = \ - Benchmark("pd.eval('(df > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)', engine='python')", - setup, - name='eval_frame_and_python_one_thread', - start_date=datetime(2013, 7, 26)) - -#-------------------------------------------------------------------- -# chained comp -eval_frame_chained_cmp_all_threads = \ - Benchmark("pd.eval('df < df2 < df3 < df4')", common_setup, - name='eval_frame_chained_cmp_all_threads', - start_date=datetime(2013, 7, 21)) - -eval_frame_chained_cmp_one_thread = \ - Benchmark("pd.eval('df < df2 < df3 < df4')", setup, - name='eval_frame_chained_cmp_one_thread', - start_date=datetime(2013, 7, 26)) - -eval_frame_chained_cmp_python = \ - Benchmark("pd.eval('df < df2 < df3 < df4', engine='python')", - common_setup, name='eval_frame_chained_cmp_python', - start_date=datetime(2013, 7, 26)) - -eval_frame_chained_cmp_one_thread = \ - Benchmark("pd.eval('df < df2 < df3 < df4', engine='python')", setup, - name='eval_frame_chained_cmp_python_one_thread', - start_date=datetime(2013, 7, 26)) - - -common_setup = """from .pandas_vb_common import * -""" - -setup = common_setup + """ -N = 1000000 -halfway = N // 2 - 1 -index = date_range('20010101', periods=N, freq='T') -s = Series(index) -ts = s.iloc[halfway] -""" - -series_setup = setup + """ -df = DataFrame({'dates': s.values}) -""" - -query_datetime_series = Benchmark("df.query('dates < @ts')", - series_setup, - start_date=datetime(2013, 9, 27)) - -index_setup = setup + """ -df = DataFrame({'a': np.random.randn(N)}, index=index) -""" - -query_datetime_index = Benchmark("df.query('index < @ts')", - index_setup, start_date=datetime(2013, 9, 27)) - -setup = setup + """ -N = 1000000 -df = DataFrame({'a': np.random.randn(N)}) -min_val = df['a'].min() -max_val = df['a'].max() -""" - -query_with_boolean_selection = Benchmark("df.query('(a >= @min_val) & (a <= @max_val)')", - setup, start_date=datetime(2013, 9, 27)) - diff --git a/vb_suite/frame_ctor.py b/vb_suite/frame_ctor.py deleted file mode 100644 index 0d57da7b88d3b..0000000000000 --- a/vb_suite/frame_ctor.py +++ /dev/null @@ -1,123 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime -try: - import pandas.tseries.offsets as offsets -except: - import pandas.core.datetools as offsets - -common_setup = """from .pandas_vb_common import * -try: - from pandas.tseries.offsets import * -except: - from pandas.core.datetools import * -""" - -#---------------------------------------------------------------------- -# Creation from nested dict - -setup = common_setup + """ -N, K = 5000, 50 -index = tm.makeStringIndex(N) -columns = tm.makeStringIndex(K) -frame = DataFrame(np.random.randn(N, K), index=index, columns=columns) - -try: - data = frame.to_dict() -except: - data = frame.toDict() - -some_dict = data.values()[0] -dict_list = [dict(zip(columns, row)) for row in frame.values] -""" - -frame_ctor_nested_dict = Benchmark("DataFrame(data)", setup) - -# From JSON-like stuff -frame_ctor_list_of_dict = Benchmark("DataFrame(dict_list)", setup, - start_date=datetime(2011, 12, 20)) - -series_ctor_from_dict = Benchmark("Series(some_dict)", setup) - -# nested dict, integer indexes, regression described in #621 -setup = common_setup + """ -data = dict((i,dict((j,float(j)) for j in range(100))) for i in xrange(2000)) -""" -frame_ctor_nested_dict_int64 = Benchmark("DataFrame(data)", setup) - -# dynamically generate benchmarks for every offset -# -# get_period_count & get_index_for_offset are there because blindly taking each -# offset times 1000 can easily go out of Timestamp bounds and raise errors. -dynamic_benchmarks = {} -n_steps = [1, 2] -offset_kwargs = {'WeekOfMonth': {'weekday': 1, 'week': 1}, - 'LastWeekOfMonth': {'weekday': 1, 'week': 1}, - 'FY5253': {'startingMonth': 1, 'weekday': 1}, - 'FY5253Quarter': {'qtr_with_extra_week': 1, 'startingMonth': 1, 'weekday': 1}} - -offset_extra_cases = {'FY5253': {'variation': ['nearest', 'last']}, - 'FY5253Quarter': {'variation': ['nearest', 'last']}} - -for offset in offsets.__all__: - for n in n_steps: - kwargs = {} - if offset in offset_kwargs: - kwargs = offset_kwargs[offset] - - if offset in offset_extra_cases: - extras = offset_extra_cases[offset] - else: - extras = {'': ['']} - - for extra_arg in extras: - for extra in extras[extra_arg]: - if extra: - kwargs[extra_arg] = extra - setup = common_setup + """ - -def get_period_count(start_date, off): - ten_offsets_in_days = ((start_date + off * 10) - start_date).days - if ten_offsets_in_days == 0: - return 1000 - else: - return min(9 * ((Timestamp.max - start_date).days // - ten_offsets_in_days), - 1000) - -def get_index_for_offset(off): - start_date = Timestamp('1/1/1900') - return date_range(start_date, - periods=min(1000, get_period_count(start_date, off)), - freq=off) - -idx = get_index_for_offset({}({}, **{})) -df = DataFrame(np.random.randn(len(idx),10), index=idx) -d = dict([ (col,df[col]) for col in df.columns ]) -""".format(offset, n, kwargs) - key = 'frame_ctor_dtindex_{}x{}'.format(offset, n) - if extra: - key += '__{}_{}'.format(extra_arg, extra) - dynamic_benchmarks[key] = Benchmark("DataFrame(d)", setup, name=key) - -# Have to stuff them in globals() so vbench detects them -globals().update(dynamic_benchmarks) - -# from a mi-series -setup = common_setup + """ -mi = MultiIndex.from_tuples([(x,y) for x in range(100) for y in range(100)]) -s = Series(randn(10000), index=mi) -""" -frame_from_series = Benchmark("DataFrame(s)", setup) - -#---------------------------------------------------------------------- -# get_numeric_data - -setup = common_setup + """ -df = DataFrame(randn(10000, 25)) -df['foo'] = 'bar' -df['bar'] = 'baz' -df = df.consolidate() -""" - -frame_get_numeric_data = Benchmark('df._get_numeric_data()', setup, - start_date=datetime(2011, 11, 1)) diff --git a/vb_suite/frame_methods.py b/vb_suite/frame_methods.py deleted file mode 100644 index 46343e9c607fd..0000000000000 --- a/vb_suite/frame_methods.py +++ /dev/null @@ -1,525 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# lookup - -setup = common_setup + """ -df = DataFrame(np.random.randn(10000, 8), columns=list('abcdefgh')) -df['foo'] = 'bar' - -row_labels = list(df.index[::10])[:900] -col_labels = list(df.columns) * 100 -row_labels_all = np.array(list(df.index) * len(df.columns), dtype='object') -col_labels_all = np.array(list(df.columns) * len(df.index), dtype='object') -""" - -frame_fancy_lookup = Benchmark('df.lookup(row_labels, col_labels)', setup, - start_date=datetime(2012, 1, 12)) - -frame_fancy_lookup_all = Benchmark('df.lookup(row_labels_all, col_labels_all)', - setup, - start_date=datetime(2012, 1, 12)) - -#---------------------------------------------------------------------- -# fillna in place - -setup = common_setup + """ -df = DataFrame(randn(10000, 100)) -df.values[::2] = np.nan -""" - -frame_fillna_inplace = Benchmark('df.fillna(0, inplace=True)', setup, - start_date=datetime(2012, 4, 4)) - - -#---------------------------------------------------------------------- -# reindex both axes - -setup = common_setup + """ -df = DataFrame(randn(10000, 10000)) -idx = np.arange(4000, 7000) -""" - -frame_reindex_axis0 = Benchmark('df.reindex(idx)', setup) - -frame_reindex_axis1 = Benchmark('df.reindex(columns=idx)', setup) - -frame_reindex_both_axes = Benchmark('df.reindex(index=idx, columns=idx)', - setup, start_date=datetime(2011, 1, 1)) - -frame_reindex_both_axes_ix = Benchmark('df.ix[idx, idx]', setup, - start_date=datetime(2011, 1, 1)) - -#---------------------------------------------------------------------- -# reindex with upcasts -setup = common_setup + """ -df=DataFrame(dict([(c, { - 0: randint(0, 2, 1000).astype(np.bool_), - 1: randint(0, 1000, 1000).astype(np.int16), - 2: randint(0, 1000, 1000).astype(np.int32), - 3: randint(0, 1000, 1000).astype(np.int64) - }[randint(0, 4)]) for c in range(1000)])) -""" - -frame_reindex_upcast = Benchmark('df.reindex(permutation(range(1200)))', setup) - -#---------------------------------------------------------------------- -# boolean indexing - -setup = common_setup + """ -df = DataFrame(randn(10000, 100)) -bool_arr = np.zeros(10000, dtype=bool) -bool_arr[:1000] = True -""" - -frame_boolean_row_select = Benchmark('df[bool_arr]', setup, - start_date=datetime(2011, 1, 1)) - -#---------------------------------------------------------------------- -# iteritems (monitor no-copying behaviour) - -setup = common_setup + """ -df = DataFrame(randn(10000, 1000)) -df2 = DataFrame(randn(3000,1),columns=['A']) -df3 = DataFrame(randn(3000,1)) - -def f(): - if hasattr(df, '_item_cache'): - df._item_cache.clear() - for name, col in df.iteritems(): - pass - -def g(): - for name, col in df.iteritems(): - pass - -def h(): - for i in range(10000): - df2['A'] - -def j(): - for i in range(10000): - df3[0] - -""" - -# as far back as the earliest test currently in the suite -frame_iteritems = Benchmark('f()', setup, - start_date=datetime(2010, 6, 1)) - -frame_iteritems_cached = Benchmark('g()', setup, - start_date=datetime(2010, 6, 1)) - -frame_getitem_single_column = Benchmark('h()', setup, - start_date=datetime(2010, 6, 1)) - -frame_getitem_single_column2 = Benchmark('j()', setup, - start_date=datetime(2010, 6, 1)) - -#---------------------------------------------------------------------- -# assignment - -setup = common_setup + """ -idx = date_range('1/1/2000', periods=100000, freq='D') -df = DataFrame(randn(100000, 1),columns=['A'],index=idx) -def f(df): - x = df.copy() - x['date'] = x.index -""" - -frame_assign_timeseries_index = Benchmark('f(df)', setup, - start_date=datetime(2013, 10, 1)) - - -#---------------------------------------------------------------------- -# to_string - -setup = common_setup + """ -df = DataFrame(randn(100, 10)) -""" - -frame_to_string_floats = Benchmark('df.to_string()', setup, - start_date=datetime(2010, 6, 1)) - -#---------------------------------------------------------------------- -# to_html - -setup = common_setup + """ -nrows=500 -df = DataFrame(randn(nrows, 10)) -df[0]=period_range("2000","2010",nrows) -df[1]=range(nrows) - -""" - -frame_to_html_mixed = Benchmark('df.to_html()', setup, - start_date=datetime(2011, 11, 18)) - - -# truncated repr_html, single index - -setup = common_setup + """ -nrows=10000 -data=randn(nrows,10) -idx=MultiIndex.from_arrays(np.tile(randn(3,nrows/100),100)) -df=DataFrame(data,index=idx) - -""" - -frame_html_repr_trunc_mi = Benchmark('df._repr_html_()', setup, - start_date=datetime(2013, 11, 25)) - -# truncated repr_html, MultiIndex - -setup = common_setup + """ -nrows=10000 -data=randn(nrows,10) -idx=randn(nrows) -df=DataFrame(data,index=idx) - -""" - -frame_html_repr_trunc_si = Benchmark('df._repr_html_()', setup, - start_date=datetime(2013, 11, 25)) - - -# insert many columns - -setup = common_setup + """ -N = 1000 - -def f(K=500): - df = DataFrame(index=range(N)) - new_col = np.random.randn(N) - for i in range(K): - df[i] = new_col -""" - -frame_insert_500_columns_end = Benchmark('f()', setup, start_date=datetime(2011, 1, 1)) - -setup = common_setup + """ -N = 1000 - -def f(K=100): - df = DataFrame(index=range(N)) - new_col = np.random.randn(N) - for i in range(K): - df.insert(0,i,new_col) -""" - -frame_insert_100_columns_begin = Benchmark('f()', setup, start_date=datetime(2011, 1, 1)) - -#---------------------------------------------------------------------- -# strings methods, #2602 - -setup = common_setup + """ -s = Series(['abcdefg', np.nan]*500000) -""" - -series_string_vector_slice = Benchmark('s.str[:5]', setup, - start_date=datetime(2012, 8, 1)) - -#---------------------------------------------------------------------- -# df.info() and get_dtype_counts() # 2807 - -setup = common_setup + """ -df = pandas.DataFrame(np.random.randn(10,10000)) -""" - -frame_get_dtype_counts = Benchmark('df.get_dtype_counts()', setup, - start_date=datetime(2012, 8, 1)) - -## -setup = common_setup + """ -df = pandas.DataFrame(np.random.randn(10,10000)) -""" - -frame_repr_wide = Benchmark('repr(df)', setup, - start_date=datetime(2012, 8, 1)) - -## -setup = common_setup + """ -df = pandas.DataFrame(np.random.randn(10000, 10)) -""" - -frame_repr_tall = Benchmark('repr(df)', setup, - start_date=datetime(2012, 8, 1)) - -## -setup = common_setup + """ -df = DataFrame(randn(100000, 1)) -""" - -frame_xs_row = Benchmark('df.xs(50000)', setup) - -## -setup = common_setup + """ -df = DataFrame(randn(1,100000)) -""" - -frame_xs_col = Benchmark('df.xs(50000,axis = 1)', setup) - -#---------------------------------------------------------------------- -# nulls/masking - -## masking -setup = common_setup + """ -data = np.random.randn(1000, 500) -df = DataFrame(data) -df = df.where(df > 0) # create nans -bools = df > 0 -mask = isnull(df) -""" - -frame_mask_bools = Benchmark('bools.mask(mask)', setup, - start_date=datetime(2013,1,1)) - -frame_mask_floats = Benchmark('bools.astype(float).mask(mask)', setup, - start_date=datetime(2013,1,1)) - -## isnull -setup = common_setup + """ -data = np.random.randn(1000, 1000) -df = DataFrame(data) -""" -frame_isnull = Benchmark('isnull(df)', setup, - start_date=datetime(2012,1,1)) - -## dropna -dropna_setup = common_setup + """ -data = np.random.randn(10000, 1000) -df = DataFrame(data) -df.ix[50:1000,20:50] = np.nan -df.ix[2000:3000] = np.nan -df.ix[:,60:70] = np.nan -""" -frame_dropna_axis0_any = Benchmark('df.dropna(how="any",axis=0)', dropna_setup, - start_date=datetime(2012,1,1)) -frame_dropna_axis0_all = Benchmark('df.dropna(how="all",axis=0)', dropna_setup, - start_date=datetime(2012,1,1)) - -frame_dropna_axis1_any = Benchmark('df.dropna(how="any",axis=1)', dropna_setup, - start_date=datetime(2012,1,1)) - -frame_dropna_axis1_all = Benchmark('df.dropna(how="all",axis=1)', dropna_setup, - start_date=datetime(2012,1,1)) - -# dropna on mixed dtypes -dropna_mixed_setup = common_setup + """ -data = np.random.randn(10000, 1000) -df = DataFrame(data) -df.ix[50:1000,20:50] = np.nan -df.ix[2000:3000] = np.nan -df.ix[:,60:70] = np.nan -df['foo'] = 'bar' -""" -frame_dropna_axis0_any_mixed_dtypes = Benchmark('df.dropna(how="any",axis=0)', dropna_mixed_setup, - start_date=datetime(2012,1,1)) -frame_dropna_axis0_all_mixed_dtypes = Benchmark('df.dropna(how="all",axis=0)', dropna_mixed_setup, - start_date=datetime(2012,1,1)) - -frame_dropna_axis1_any_mixed_dtypes = Benchmark('df.dropna(how="any",axis=1)', dropna_mixed_setup, - start_date=datetime(2012,1,1)) - -frame_dropna_axis1_all_mixed_dtypes = Benchmark('df.dropna(how="all",axis=1)', dropna_mixed_setup, - start_date=datetime(2012,1,1)) - -## dropna multi -dropna_setup = common_setup + """ -data = np.random.randn(10000, 1000) -df = DataFrame(data) -df.ix[50:1000,20:50] = np.nan -df.ix[2000:3000] = np.nan -df.ix[:,60:70] = np.nan -df.index = MultiIndex.from_tuples(df.index.map(lambda x: (x, x))) -df.columns = MultiIndex.from_tuples(df.columns.map(lambda x: (x, x))) -""" -frame_count_level_axis0_multi = Benchmark('df.count(axis=0, level=1)', dropna_setup, - start_date=datetime(2012,1,1)) - -frame_count_level_axis1_multi = Benchmark('df.count(axis=1, level=1)', dropna_setup, - start_date=datetime(2012,1,1)) - -# dropna on mixed dtypes -dropna_mixed_setup = common_setup + """ -data = np.random.randn(10000, 1000) -df = DataFrame(data) -df.ix[50:1000,20:50] = np.nan -df.ix[2000:3000] = np.nan -df.ix[:,60:70] = np.nan -df['foo'] = 'bar' -df.index = MultiIndex.from_tuples(df.index.map(lambda x: (x, x))) -df.columns = MultiIndex.from_tuples(df.columns.map(lambda x: (x, x))) -""" -frame_count_level_axis0_mixed_dtypes_multi = Benchmark('df.count(axis=0, level=1)', dropna_mixed_setup, - start_date=datetime(2012,1,1)) - -frame_count_level_axis1_mixed_dtypes_multi = Benchmark('df.count(axis=1, level=1)', dropna_mixed_setup, - start_date=datetime(2012,1,1)) - -#---------------------------------------------------------------------- -# apply - -setup = common_setup + """ -s = Series(np.arange(1028.)) -df = DataFrame({ i:s for i in range(1028) }) -""" -frame_apply_user_func = Benchmark('df.apply(lambda x: np.corrcoef(x,s)[0,1])', setup, - name = 'frame_apply_user_func', - start_date=datetime(2012,1,1)) - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000,100)) -""" -frame_apply_lambda_mean = Benchmark('df.apply(lambda x: x.sum())', setup, - name = 'frame_apply_lambda_mean', - start_date=datetime(2012,1,1)) -setup = common_setup + """ -df = DataFrame(np.random.randn(1000,100)) -""" -frame_apply_np_mean = Benchmark('df.apply(np.mean)', setup, - name = 'frame_apply_np_mean', - start_date=datetime(2012,1,1)) - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000,100)) -""" -frame_apply_pass_thru = Benchmark('df.apply(lambda x: x)', setup, - name = 'frame_apply_pass_thru', - start_date=datetime(2012,1,1)) - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000,100)) -""" -frame_apply_axis_1 = Benchmark('df.apply(lambda x: x+1,axis=1)', setup, - name = 'frame_apply_axis_1', - start_date=datetime(2012,1,1)) - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000,3),columns=list('ABC')) -""" -frame_apply_ref_by_name = Benchmark('df.apply(lambda x: x["A"] + x["B"],axis=1)', setup, - name = 'frame_apply_ref_by_name', - start_date=datetime(2012,1,1)) - -#---------------------------------------------------------------------- -# dtypes - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000,1000)) -""" -frame_dtypes = Benchmark('df.dtypes', setup, - start_date=datetime(2012,1,1)) - -#---------------------------------------------------------------------- -# equals -setup = common_setup + """ -def make_pair(frame): - df = frame - df2 = df.copy() - df2.ix[-1,-1] = np.nan - return df, df2 - -def test_equal(name): - df, df2 = pairs[name] - return df.equals(df) - -def test_unequal(name): - df, df2 = pairs[name] - return df.equals(df2) - -float_df = DataFrame(np.random.randn(1000, 1000)) -object_df = DataFrame([['foo']*1000]*1000) -nonunique_cols = object_df.copy() -nonunique_cols.columns = ['A']*len(nonunique_cols.columns) - -pairs = dict([(name, make_pair(frame)) - for name, frame in (('float_df', float_df), ('object_df', object_df), ('nonunique_cols', nonunique_cols))]) -""" -frame_float_equal = Benchmark('test_equal("float_df")', setup) -frame_object_equal = Benchmark('test_equal("object_df")', setup) -frame_nonunique_equal = Benchmark('test_equal("nonunique_cols")', setup) - -frame_float_unequal = Benchmark('test_unequal("float_df")', setup) -frame_object_unequal = Benchmark('test_unequal("object_df")', setup) -frame_nonunique_unequal = Benchmark('test_unequal("nonunique_cols")', setup) - -#----------------------------------------------------------------------------- -# interpolate -# this is the worst case, where every column has NaNs. -setup = common_setup + """ -df = DataFrame(randn(10000, 100)) -df.values[::2] = np.nan -""" - -frame_interpolate = Benchmark('df.interpolate()', setup, - start_date=datetime(2014, 2, 7)) - -setup = common_setup + """ -df = DataFrame({'A': np.arange(0, 10000), - 'B': np.random.randint(0, 100, 10000), - 'C': randn(10000), - 'D': randn(10000)}) -df.loc[1::5, 'A'] = np.nan -df.loc[1::5, 'C'] = np.nan -""" - -frame_interpolate_some_good = Benchmark('df.interpolate()', setup, - start_date=datetime(2014, 2, 7)) -frame_interpolate_some_good_infer = Benchmark('df.interpolate(downcast="infer")', - setup, - start_date=datetime(2014, 2, 7)) - - -#------------------------------------------------------------------------- -# frame shift speedup issue-5609 - -setup = common_setup + """ -df = DataFrame(np.random.rand(10000,500)) -# note: df._data.blocks are f_contigous -""" -frame_shift_axis0 = Benchmark('df.shift(1,axis=0)', setup, - start_date=datetime(2014,1,1)) -frame_shift_axis1 = Benchmark('df.shift(1,axis=1)', setup, - name = 'frame_shift_axis_1', - start_date=datetime(2014,1,1)) - - -#----------------------------------------------------------------------------- -# from_records issue-6700 - -setup = common_setup + """ -def get_data(n=100000): - return ((x, x*20, x*100) for x in range(n)) -""" - -frame_from_records_generator = Benchmark('df = DataFrame.from_records(get_data())', - setup, - name='frame_from_records_generator', - start_date=datetime(2013,10,4)) # issue-4911 - -frame_from_records_generator_nrows = Benchmark('df = DataFrame.from_records(get_data(), nrows=1000)', - setup, - name='frame_from_records_generator_nrows', - start_date=datetime(2013,10,04)) # issue-4911 - -#----------------------------------------------------------------------------- -# duplicated - -setup = common_setup + ''' -n = 1 << 20 - -t = date_range('2015-01-01', freq='S', periods=n // 64) -xs = np.random.randn(n // 64).round(2) - -df = DataFrame({'a':np.random.randint(- 1 << 8, 1 << 8, n), - 'b':np.random.choice(t, n), - 'c':np.random.choice(xs, n)}) -''' - -frame_duplicated = Benchmark('df.duplicated()', setup, - name='frame_duplicated') diff --git a/vb_suite/generate_rst_files.py b/vb_suite/generate_rst_files.py deleted file mode 100644 index 92e7cd4d59b71..0000000000000 --- a/vb_suite/generate_rst_files.py +++ /dev/null @@ -1,2 +0,0 @@ -from suite import benchmarks, generate_rst_files -generate_rst_files(benchmarks) diff --git a/vb_suite/gil.py b/vb_suite/gil.py deleted file mode 100644 index df2bd2dcd8db4..0000000000000 --- a/vb_suite/gil.py +++ /dev/null @@ -1,110 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -basic = common_setup + """ -try: - from pandas.util.testing import test_parallel - have_real_test_parallel = True -except ImportError: - have_real_test_parallel = False - def test_parallel(num_threads=1): - def wrapper(fname): - return fname - - return wrapper - -N = 1000000 -ngroups = 1000 -np.random.seed(1234) - -df = DataFrame({'key' : np.random.randint(0,ngroups,size=N), - 'data' : np.random.randn(N) }) - -if not have_real_test_parallel: - raise NotImplementedError -""" - -setup = basic + """ - -def f(): - df.groupby('key')['data'].sum() - -# run consecutivily -def g2(): - for i in range(2): - f() -def g4(): - for i in range(4): - f() -def g8(): - for i in range(8): - f() - -# run in parallel -@test_parallel(num_threads=2) -def pg2(): - f() - -@test_parallel(num_threads=4) -def pg4(): - f() - -@test_parallel(num_threads=8) -def pg8(): - f() - -""" - -nogil_groupby_sum_4 = Benchmark( - 'pg4()', setup, - start_date=datetime(2015, 1, 1)) - -nogil_groupby_sum_8 = Benchmark( - 'pg8()', setup, - start_date=datetime(2015, 1, 1)) - - -#### test all groupby funcs #### - -setup = basic + """ - -@test_parallel(num_threads=2) -def pg2(): - df.groupby('key')['data'].func() - -""" - -for f in ['sum','prod','var','count','min','max','mean','last']: - - name = "nogil_groupby_{f}_2".format(f=f) - bmark = Benchmark('pg2()', setup.replace('func',f), start_date=datetime(2015, 1, 1)) - bmark.name = name - globals()[name] = bmark - -del bmark - - -#### test take_1d #### -setup = basic + """ -from pandas.core import common as com - -N = 1e7 -df = DataFrame({'int64' : np.arange(N,dtype='int64'), - 'float64' : np.arange(N,dtype='float64')}) -indexer = np.arange(100,len(df)-100) - -@test_parallel(num_threads=2) -def take_1d_pg2_int64(): - com.take_1d(df.int64.values,indexer) - -@test_parallel(num_threads=2) -def take_1d_pg2_float64(): - com.take_1d(df.float64.values,indexer) - -""" - -nogil_take1d_float64 = Benchmark('take_1d_pg2_int64()', setup, start_date=datetime(2015, 1, 1)) -nogil_take1d_int64 = Benchmark('take_1d_pg2_float64()', setup, start_date=datetime(2015, 1, 1)) diff --git a/vb_suite/groupby.py b/vb_suite/groupby.py deleted file mode 100644 index 268d71f864823..0000000000000 --- a/vb_suite/groupby.py +++ /dev/null @@ -1,620 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -setup = common_setup + """ -N = 100000 -ngroups = 100 - -def get_test_data(ngroups=100, n=100000): - unique_groups = range(ngroups) - arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object) - - if len(arr) < n: - arr = np.asarray(list(arr) + unique_groups[:n - len(arr)], - dtype=object) - - random.shuffle(arr) - return arr - -# aggregate multiple columns -df = DataFrame({'key1' : get_test_data(ngroups=ngroups), - 'key2' : get_test_data(ngroups=ngroups), - 'data1' : np.random.randn(N), - 'data2' : np.random.randn(N)}) -def f(): - df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum()) - -simple_series = Series(np.random.randn(N)) -key1 = df['key1'] -""" - -stmt1 = "df.groupby(['key1', 'key2'])['data1'].agg(lambda x: x.values.sum())" -groupby_multi_python = Benchmark(stmt1, setup, - start_date=datetime(2011, 7, 1)) - -stmt3 = "df.groupby(['key1', 'key2']).sum()" -groupby_multi_cython = Benchmark(stmt3, setup, - start_date=datetime(2011, 7, 1)) - -stmt = "df.groupby(['key1', 'key2'])['data1'].agg(np.std)" -groupby_multi_series_op = Benchmark(stmt, setup, - start_date=datetime(2011, 8, 1)) - -groupby_series_simple_cython = \ - Benchmark('simple_series.groupby(key1).sum()', setup, - start_date=datetime(2011, 3, 1)) - - -stmt4 = "df.groupby('key1').rank(pct=True)" -groupby_series_simple_cython = Benchmark(stmt4, setup, - start_date=datetime(2014, 1, 16)) - -#---------------------------------------------------------------------- -# 2d grouping, aggregate many columns - -setup = common_setup + """ -labels = np.random.randint(0, 100, size=1000) -df = DataFrame(randn(1000, 1000)) -""" - -groupby_frame_cython_many_columns = Benchmark( - 'df.groupby(labels).sum()', setup, - start_date=datetime(2011, 8, 1), - logy=True) - -#---------------------------------------------------------------------- -# single key, long, integer key - -setup = common_setup + """ -data = np.random.randn(100000, 1) -labels = np.random.randint(0, 1000, size=100000) -df = DataFrame(data) -""" - -groupby_frame_singlekey_integer = \ - Benchmark('df.groupby(labels).sum()', setup, - start_date=datetime(2011, 8, 1), logy=True) - -#---------------------------------------------------------------------- -# group with different functions per column - -setup = common_setup + """ -fac1 = np.array(['A', 'B', 'C'], dtype='O') -fac2 = np.array(['one', 'two'], dtype='O') - -df = DataFrame({'key1': fac1.take(np.random.randint(0, 3, size=100000)), - 'key2': fac2.take(np.random.randint(0, 2, size=100000)), - 'value1' : np.random.randn(100000), - 'value2' : np.random.randn(100000), - 'value3' : np.random.randn(100000)}) -""" - -groupby_multi_different_functions = \ - Benchmark("""df.groupby(['key1', 'key2']).agg({'value1' : 'mean', - 'value2' : 'var', - 'value3' : 'sum'})""", - setup, start_date=datetime(2011, 9, 1)) - -groupby_multi_different_numpy_functions = \ - Benchmark("""df.groupby(['key1', 'key2']).agg({'value1' : np.mean, - 'value2' : np.var, - 'value3' : np.sum})""", - setup, start_date=datetime(2011, 9, 1)) - -#---------------------------------------------------------------------- -# size() speed - -setup = common_setup + """ -n = 100000 -offsets = np.random.randint(n, size=n).astype('timedelta64[ns]') -dates = np.datetime64('now') + offsets -df = DataFrame({'key1': np.random.randint(0, 500, size=n), - 'key2': np.random.randint(0, 100, size=n), - 'value1' : np.random.randn(n), - 'value2' : np.random.randn(n), - 'value3' : np.random.randn(n), - 'dates' : dates}) -""" - -groupby_multi_size = Benchmark("df.groupby(['key1', 'key2']).size()", - setup, start_date=datetime(2011, 10, 1)) - -groupby_dt_size = Benchmark("df.groupby(['dates']).size()", - setup, start_date=datetime(2011, 10, 1)) - -groupby_dt_timegrouper_size = Benchmark("df.groupby(TimeGrouper(key='dates', freq='M')).size()", - setup, start_date=datetime(2011, 10, 1)) - -#---------------------------------------------------------------------- -# count() speed - -setup = common_setup + """ -n = 10000 -offsets = np.random.randint(n, size=n).astype('timedelta64[ns]') - -dates = np.datetime64('now') + offsets -dates[np.random.rand(n) > 0.5] = np.datetime64('nat') - -offsets[np.random.rand(n) > 0.5] = np.timedelta64('nat') - -value2 = np.random.randn(n) -value2[np.random.rand(n) > 0.5] = np.nan - -obj = np.random.choice(list('ab'), size=n).astype(object) -obj[np.random.randn(n) > 0.5] = np.nan - -df = DataFrame({'key1': np.random.randint(0, 500, size=n), - 'key2': np.random.randint(0, 100, size=n), - 'dates': dates, - 'value2' : value2, - 'value3' : np.random.randn(n), - 'ints': np.random.randint(0, 1000, size=n), - 'obj': obj, - 'offsets': offsets}) -""" - -groupby_multi_count = Benchmark("df.groupby(['key1', 'key2']).count()", - setup, name='groupby_multi_count', - start_date=datetime(2014, 5, 5)) - -setup = common_setup + """ -n = 10000 - -df = DataFrame({'key1': randint(0, 500, size=n), - 'key2': randint(0, 100, size=n), - 'ints': randint(0, 1000, size=n), - 'ints2': randint(0, 1000, size=n)}) -""" - -groupby_int_count = Benchmark("df.groupby(['key1', 'key2']).count()", - setup, name='groupby_int_count', - start_date=datetime(2014, 5, 6)) -#---------------------------------------------------------------------- -# Series.value_counts - -setup = common_setup + """ -s = Series(np.random.randint(0, 1000, size=100000)) -""" - -series_value_counts_int64 = Benchmark('s.value_counts()', setup, - start_date=datetime(2011, 10, 21)) - -# value_counts on lots of strings - -setup = common_setup + """ -K = 1000 -N = 100000 -uniques = tm.makeStringIndex(K).values -s = Series(np.tile(uniques, N // K)) -""" - -series_value_counts_strings = Benchmark('s.value_counts()', setup, - start_date=datetime(2011, 10, 21)) - -#value_counts on float dtype - -setup = common_setup + """ -s = Series(np.random.randint(0, 1000, size=100000)).astype(float) -""" - -series_value_counts_float64 = Benchmark('s.value_counts()', setup, - start_date=datetime(2015, 8, 17)) - -#---------------------------------------------------------------------- -# pivot_table - -setup = common_setup + """ -fac1 = np.array(['A', 'B', 'C'], dtype='O') -fac2 = np.array(['one', 'two'], dtype='O') - -ind1 = np.random.randint(0, 3, size=100000) -ind2 = np.random.randint(0, 2, size=100000) - -df = DataFrame({'key1': fac1.take(ind1), -'key2': fac2.take(ind2), -'key3': fac2.take(ind2), -'value1' : np.random.randn(100000), -'value2' : np.random.randn(100000), -'value3' : np.random.randn(100000)}) -""" - -stmt = "df.pivot_table(index='key1', columns=['key2', 'key3'])" -groupby_pivot_table = Benchmark(stmt, setup, start_date=datetime(2011, 12, 15)) - - -#---------------------------------------------------------------------- -# dict return values - -setup = common_setup + """ -labels = np.arange(1000).repeat(10) -data = Series(randn(len(labels))) -f = lambda x: {'first': x.values[0], 'last': x.values[-1]} -""" - -groupby_apply_dict_return = Benchmark('data.groupby(labels).apply(f)', - setup, start_date=datetime(2011, 12, 15)) - -#---------------------------------------------------------------------- -# First / last functions - -setup = common_setup + """ -labels = np.arange(10000).repeat(10) -data = Series(randn(len(labels))) -data[::3] = np.nan -data[1::3] = np.nan -data2 = Series(randn(len(labels)),dtype='float32') -data2[::3] = np.nan -data2[1::3] = np.nan -labels = labels.take(np.random.permutation(len(labels))) -""" - -groupby_first_float64 = Benchmark('data.groupby(labels).first()', setup, - start_date=datetime(2012, 5, 1)) - -groupby_first_float32 = Benchmark('data2.groupby(labels).first()', setup, - start_date=datetime(2013, 1, 1)) - -groupby_last_float64 = Benchmark('data.groupby(labels).last()', setup, - start_date=datetime(2012, 5, 1)) - -groupby_last_float32 = Benchmark('data2.groupby(labels).last()', setup, - start_date=datetime(2013, 1, 1)) - -groupby_nth_float64_none = Benchmark('data.groupby(labels).nth(0)', setup, - start_date=datetime(2012, 5, 1)) -groupby_nth_float32_none = Benchmark('data2.groupby(labels).nth(0)', setup, - start_date=datetime(2013, 1, 1)) -groupby_nth_float64_any = Benchmark('data.groupby(labels).nth(0,dropna="all")', setup, - start_date=datetime(2012, 5, 1)) -groupby_nth_float32_any = Benchmark('data2.groupby(labels).nth(0,dropna="all")', setup, - start_date=datetime(2013, 1, 1)) - -# with datetimes (GH7555) -setup = common_setup + """ -df = DataFrame({'a' : date_range('1/1/2011',periods=100000,freq='s'),'b' : range(100000)}) -""" - -groupby_first_datetimes = Benchmark('df.groupby("b").first()', setup, - start_date=datetime(2013, 5, 1)) -groupby_last_datetimes = Benchmark('df.groupby("b").last()', setup, - start_date=datetime(2013, 5, 1)) -groupby_nth_datetimes_none = Benchmark('df.groupby("b").nth(0)', setup, - start_date=datetime(2013, 5, 1)) -groupby_nth_datetimes_any = Benchmark('df.groupby("b").nth(0,dropna="all")', setup, - start_date=datetime(2013, 5, 1)) - -# with object -setup = common_setup + """ -df = DataFrame({'a' : ['foo']*100000,'b' : range(100000)}) -""" - -groupby_first_object = Benchmark('df.groupby("b").first()', setup, - start_date=datetime(2013, 5, 1)) -groupby_last_object = Benchmark('df.groupby("b").last()', setup, - start_date=datetime(2013, 5, 1)) -groupby_nth_object_none = Benchmark('df.groupby("b").nth(0)', setup, - start_date=datetime(2013, 5, 1)) -groupby_nth_object_any = Benchmark('df.groupby("b").nth(0,dropna="any")', setup, - start_date=datetime(2013, 5, 1)) - -#---------------------------------------------------------------------- -# groupby_indices replacement, chop up Series - -setup = common_setup + """ -try: - rng = date_range('1/1/2000', '12/31/2005', freq='H') - year, month, day = rng.year, rng.month, rng.day -except: - rng = date_range('1/1/2000', '12/31/2000', offset=datetools.Hour()) - year = rng.map(lambda x: x.year) - month = rng.map(lambda x: x.month) - day = rng.map(lambda x: x.day) - -ts = Series(np.random.randn(len(rng)), index=rng) -""" - -groupby_indices = Benchmark('len(ts.groupby([year, month, day]))', - setup, start_date=datetime(2012, 1, 1)) - -#---------------------------------------------------------------------- -# median - -#---------------------------------------------------------------------- -# single key, long, integer key - -setup = common_setup + """ -data = np.random.randn(100000, 2) -labels = np.random.randint(0, 1000, size=100000) -df = DataFrame(data) -""" - -groupby_frame_median = \ - Benchmark('df.groupby(labels).median()', setup, - start_date=datetime(2011, 8, 1), logy=True) - - -setup = common_setup + """ -data = np.random.randn(1000000, 2) -labels = np.random.randint(0, 1000, size=1000000) -df = DataFrame(data) -""" - -groupby_simple_compress_timing = \ - Benchmark('df.groupby(labels).mean()', setup, - start_date=datetime(2011, 8, 1)) - - -#---------------------------------------------------------------------- -# DataFrame Apply overhead - -setup = common_setup + """ -N = 10000 -labels = np.random.randint(0, 2000, size=N) -labels2 = np.random.randint(0, 3, size=N) -df = DataFrame({'key': labels, -'key2': labels2, -'value1': randn(N), -'value2': ['foo', 'bar', 'baz', 'qux'] * (N / 4)}) -def f(g): - return 1 -""" - -groupby_frame_apply_overhead = Benchmark("df.groupby('key').apply(f)", setup, - start_date=datetime(2011, 10, 1)) - -groupby_frame_apply = Benchmark("df.groupby(['key', 'key2']).apply(f)", setup, - start_date=datetime(2011, 10, 1)) - - -#---------------------------------------------------------------------- -# DataFrame nth - -setup = common_setup + """ -df = DataFrame(np.random.randint(1, 100, (10000, 2))) -""" - -# Not really a fair test as behaviour has changed! -groupby_frame_nth_none = Benchmark("df.groupby(0).nth(0)", setup, - start_date=datetime(2014, 3, 1)) - -groupby_series_nth_none = Benchmark("df[1].groupby(df[0]).nth(0)", setup, - start_date=datetime(2014, 3, 1)) -groupby_frame_nth_any= Benchmark("df.groupby(0).nth(0,dropna='any')", setup, - start_date=datetime(2014, 3, 1)) - -groupby_series_nth_any = Benchmark("df[1].groupby(df[0]).nth(0,dropna='any')", setup, - start_date=datetime(2014, 3, 1)) - - -#---------------------------------------------------------------------- -# Sum booleans #2692 - -setup = common_setup + """ -N = 500 -df = DataFrame({'ii':range(N),'bb':[True for x in range(N)]}) -""" - -groupby_sum_booleans = Benchmark("df.groupby('ii').sum()", setup) - - -#---------------------------------------------------------------------- -# multi-indexed group sum #9049 - -setup = common_setup + """ -N = 50 -df = DataFrame({'A': range(N) * 2, 'B': range(N*2), 'C': 1}).set_index(["A", "B"]) -""" - -groupby_sum_multiindex = Benchmark("df.groupby(level=[0, 1]).sum()", setup) - - -#---------------------------------------------------------------------- -# Transform testing - -setup = common_setup + """ -n_dates = 400 -n_securities = 250 -n_columns = 3 -share_na = 0.1 - -dates = date_range('1997-12-31', periods=n_dates, freq='B') -dates = Index(map(lambda x: x.year * 10000 + x.month * 100 + x.day, dates)) - -secid_min = int('10000000', 16) -secid_max = int('F0000000', 16) -step = (secid_max - secid_min) // (n_securities - 1) -security_ids = map(lambda x: hex(x)[2:10].upper(), range(secid_min, secid_max + 1, step)) - -data_index = MultiIndex(levels=[dates.values, security_ids], - labels=[[i for i in range(n_dates) for _ in xrange(n_securities)], range(n_securities) * n_dates], - names=['date', 'security_id']) -n_data = len(data_index) - -columns = Index(['factor{}'.format(i) for i in range(1, n_columns + 1)]) - -data = DataFrame(np.random.randn(n_data, n_columns), index=data_index, columns=columns) - -step = int(n_data * share_na) -for column_index in range(n_columns): - index = column_index - while index < n_data: - data.set_value(data_index[index], columns[column_index], np.nan) - index += step - -f_fillna = lambda x: x.fillna(method='pad') -""" - -groupby_transform = Benchmark("data.groupby(level='security_id').transform(f_fillna)", setup) -groupby_transform_ufunc = Benchmark("data.groupby(level='date').transform(np.max)", setup) - -setup = common_setup + """ -np.random.seed(0) - -N = 120000 -N_TRANSITIONS = 1400 - -# generate groups -transition_points = np.random.permutation(np.arange(N))[:N_TRANSITIONS] -transition_points.sort() -transitions = np.zeros((N,), dtype=np.bool) -transitions[transition_points] = True -g = transitions.cumsum() - -df = DataFrame({ 'signal' : np.random.rand(N)}) -""" -groupby_transform_series = Benchmark("df['signal'].groupby(g).transform(np.mean)", setup) - -setup = common_setup + """ -np.random.seed(0) - -df=DataFrame( { 'id' : np.arange( 100000 ) / 3, - 'val': np.random.randn( 100000) } ) -""" - -groupby_transform_series2 = Benchmark("df.groupby('id')['val'].transform(np.mean)", setup) - -setup = common_setup + ''' -np.random.seed(2718281) -n = 20000 -df = DataFrame(np.random.randint(1, n, (n, 3)), - columns=['jim', 'joe', 'jolie']) -''' - -stmt = "df.groupby(['jim', 'joe'])['jolie'].transform('max')"; -groupby_transform_multi_key1 = Benchmark(stmt, setup) -groupby_transform_multi_key2 = Benchmark(stmt, setup + "df['jim'] = df['joe']") - -setup = common_setup + ''' -np.random.seed(2718281) -n = 200000 -df = DataFrame(np.random.randint(1, n / 10, (n, 3)), - columns=['jim', 'joe', 'jolie']) -''' -groupby_transform_multi_key3 = Benchmark(stmt, setup) -groupby_transform_multi_key4 = Benchmark(stmt, setup + "df['jim'] = df['joe']") - -setup = common_setup + ''' -np.random.seed(27182) -n = 100000 -df = DataFrame(np.random.randint(1, n / 100, (n, 3)), - columns=['jim', 'joe', 'jolie']) -''' - -groupby_agg_builtins1 = Benchmark("df.groupby('jim').agg([sum, min, max])", setup) -groupby_agg_builtins2 = Benchmark("df.groupby(['jim', 'joe']).agg([sum, min, max])", setup) - - -setup = common_setup + ''' -arr = np.random.randint(- 1 << 12, 1 << 12, (1 << 17, 5)) -i = np.random.choice(len(arr), len(arr) * 5) -arr = np.vstack((arr, arr[i])) # add sume duplicate rows - -i = np.random.permutation(len(arr)) -arr = arr[i] # shuffle rows - -df = DataFrame(arr, columns=list('abcde')) -df['jim'], df['joe'] = np.random.randn(2, len(df)) * 10 -''' - -groupby_int64_overflow = Benchmark("df.groupby(list('abcde')).max()", setup, - name='groupby_int64_overflow') - - -setup = common_setup + ''' -from itertools import product -from string import ascii_letters, digits - -n = 5 * 7 * 11 * (1 << 9) -alpha = list(map(''.join, product(ascii_letters + digits, repeat=4))) -f = lambda k: np.repeat(np.random.choice(alpha, n // k), k) - -df = DataFrame({'a': f(11), 'b': f(7), 'c': f(5), 'd': f(1)}) -df['joe'] = (np.random.randn(len(df)) * 10).round(3) - -i = np.random.permutation(len(df)) -df = df.iloc[i].reset_index(drop=True).copy() -''' - -groupby_multi_index = Benchmark("df.groupby(list('abcd')).max()", setup, - name='groupby_multi_index') - -#---------------------------------------------------------------------- -# groupby with a variable value for ngroups - - -ngroups_list = [100, 10000] -no_arg_func_list = [ - 'all', - 'any', - 'count', - 'cumcount', - 'cummax', - 'cummin', - 'cumprod', - 'cumsum', - 'describe', - 'diff', - 'first', - 'head', - 'last', - 'mad', - 'max', - 'mean', - 'median', - 'min', - 'nunique', - 'pct_change', - 'prod', - 'rank', - 'sem', - 'size', - 'skew', - 'std', - 'sum', - 'tail', - 'unique', - 'var', - 'value_counts', -] - - -_stmt_template = "df.groupby('value')['timestamp'].%s" -_setup_template = common_setup + """ -np.random.seed(1234) -ngroups = %s -size = ngroups * 2 -rng = np.arange(ngroups) -df = DataFrame(dict( - timestamp=rng.take(np.random.randint(0, ngroups, size=size)), - value=np.random.randint(0, size, size=size) -)) -""" -START_DATE = datetime(2011, 7, 1) - - -def make_large_ngroups_bmark(ngroups, func_name, func_args=''): - bmark_name = 'groupby_ngroups_%s_%s' % (ngroups, func_name) - stmt = _stmt_template % ('%s(%s)' % (func_name, func_args)) - setup = _setup_template % ngroups - bmark = Benchmark(stmt, setup, start_date=START_DATE) - # MUST set name - bmark.name = bmark_name - return bmark - - -def inject_bmark_into_globals(bmark): - if not bmark.name: - raise AssertionError('benchmark must have a name') - globals()[bmark.name] = bmark - - -for ngroups in ngroups_list: - for func_name in no_arg_func_list: - bmark = make_large_ngroups_bmark(ngroups, func_name) - inject_bmark_into_globals(bmark) - -# avoid bmark to be collected as Benchmark object -del bmark diff --git a/vb_suite/hdfstore_bench.py b/vb_suite/hdfstore_bench.py deleted file mode 100644 index 393fd4cc77e66..0000000000000 --- a/vb_suite/hdfstore_bench.py +++ /dev/null @@ -1,278 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -start_date = datetime(2012, 7, 1) - -common_setup = """from .pandas_vb_common import * -import os - -f = '__test__.h5' -def remove(f): - try: - os.remove(f) - except: - pass - -""" - -#---------------------------------------------------------------------- -# get from a store - -setup1 = common_setup + """ -index = tm.makeStringIndex(25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000)}, - index=index) -remove(f) -store = HDFStore(f) -store.put('df1',df) -""" - -read_store = Benchmark("store.get('df1')", setup1, cleanup="store.close()", - start_date=start_date) - - -#---------------------------------------------------------------------- -# write to a store - -setup2 = common_setup + """ -index = tm.makeStringIndex(25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000)}, - index=index) -remove(f) -store = HDFStore(f) -""" - -write_store = Benchmark( - "store.put('df2',df)", setup2, cleanup="store.close()", - start_date=start_date) - -#---------------------------------------------------------------------- -# get from a store (mixed) - -setup3 = common_setup + """ -index = tm.makeStringIndex(25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000), - 'string1' : ['foo'] * 25000, - 'bool1' : [True] * 25000, - 'int1' : np.random.randint(0, 250000, size=25000)}, - index=index) -remove(f) -store = HDFStore(f) -store.put('df3',df) -""" - -read_store_mixed = Benchmark( - "store.get('df3')", setup3, cleanup="store.close()", - start_date=start_date) - - -#---------------------------------------------------------------------- -# write to a store (mixed) - -setup4 = common_setup + """ -index = tm.makeStringIndex(25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000), - 'string1' : ['foo'] * 25000, - 'bool1' : [True] * 25000, - 'int1' : np.random.randint(0, 250000, size=25000)}, - index=index) -remove(f) -store = HDFStore(f) -""" - -write_store_mixed = Benchmark( - "store.put('df4',df)", setup4, cleanup="store.close()", - start_date=start_date) - -#---------------------------------------------------------------------- -# get from a table (mixed) - -setup5 = common_setup + """ -N=10000 -index = tm.makeStringIndex(N) -df = DataFrame({'float1' : randn(N), - 'float2' : randn(N), - 'string1' : ['foo'] * N, - 'bool1' : [True] * N, - 'int1' : np.random.randint(0, N, size=N)}, - index=index) - -remove(f) -store = HDFStore(f) -store.append('df5',df) -""" - -read_store_table_mixed = Benchmark( - "store.select('df5')", setup5, cleanup="store.close()", - start_date=start_date) - - -#---------------------------------------------------------------------- -# write to a table (mixed) - -setup6 = common_setup + """ -index = tm.makeStringIndex(25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000), - 'string1' : ['foo'] * 25000, - 'bool1' : [True] * 25000, - 'int1' : np.random.randint(0, 25000, size=25000)}, - index=index) -remove(f) -store = HDFStore(f) -""" - -write_store_table_mixed = Benchmark( - "store.append('df6',df)", setup6, cleanup="store.close()", - start_date=start_date) - -#---------------------------------------------------------------------- -# select from a table - -setup7 = common_setup + """ -index = tm.makeStringIndex(25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000) }, - index=index) - -remove(f) -store = HDFStore(f) -store.append('df7',df) -""" - -read_store_table = Benchmark( - "store.select('df7')", setup7, cleanup="store.close()", - start_date=start_date) - - -#---------------------------------------------------------------------- -# write to a table - -setup8 = common_setup + """ -index = tm.makeStringIndex(25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000) }, - index=index) -remove(f) -store = HDFStore(f) -""" - -write_store_table = Benchmark( - "store.append('df8',df)", setup8, cleanup="store.close()", - start_date=start_date) - -#---------------------------------------------------------------------- -# get from a table (wide) - -setup9 = common_setup + """ -df = DataFrame(np.random.randn(25000,100)) - -remove(f) -store = HDFStore(f) -store.append('df9',df) -""" - -read_store_table_wide = Benchmark( - "store.select('df9')", setup9, cleanup="store.close()", - start_date=start_date) - - -#---------------------------------------------------------------------- -# write to a table (wide) - -setup10 = common_setup + """ -df = DataFrame(np.random.randn(25000,100)) - -remove(f) -store = HDFStore(f) -""" - -write_store_table_wide = Benchmark( - "store.append('df10',df)", setup10, cleanup="store.close()", - start_date=start_date) - -#---------------------------------------------------------------------- -# get from a table (wide) - -setup11 = common_setup + """ -index = date_range('1/1/2000', periods = 25000) -df = DataFrame(np.random.randn(25000,100), index = index) - -remove(f) -store = HDFStore(f) -store.append('df11',df) -""" - -query_store_table_wide = Benchmark( - "store.select('df11', [ ('index', '>', df.index[10000]), ('index', '<', df.index[15000]) ])", setup11, cleanup="store.close()", - start_date=start_date) - - -#---------------------------------------------------------------------- -# query from a table - -setup12 = common_setup + """ -index = date_range('1/1/2000', periods = 25000) -df = DataFrame({'float1' : randn(25000), - 'float2' : randn(25000) }, - index=index) - -remove(f) -store = HDFStore(f) -store.append('df12',df) -""" - -query_store_table = Benchmark( - "store.select('df12', [ ('index', '>', df.index[10000]), ('index', '<', df.index[15000]) ])", setup12, cleanup="store.close()", - start_date=start_date) - -#---------------------------------------------------------------------- -# select from a panel table - -setup13 = common_setup + """ -p = Panel(randn(20, 1000, 25), items= [ 'Item%03d' % i for i in range(20) ], - major_axis=date_range('1/1/2000', periods=1000), minor_axis = [ 'E%03d' % i for i in range(25) ]) - -remove(f) -store = HDFStore(f) -store.append('p1',p) -""" - -read_store_table_panel = Benchmark( - "store.select('p1')", setup13, cleanup="store.close()", - start_date=start_date) - - -#---------------------------------------------------------------------- -# write to a panel table - -setup14 = common_setup + """ -p = Panel(randn(20, 1000, 25), items= [ 'Item%03d' % i for i in range(20) ], - major_axis=date_range('1/1/2000', periods=1000), minor_axis = [ 'E%03d' % i for i in range(25) ]) - -remove(f) -store = HDFStore(f) -""" - -write_store_table_panel = Benchmark( - "store.append('p2',p)", setup14, cleanup="store.close()", - start_date=start_date) - -#---------------------------------------------------------------------- -# write to a table (data_columns) - -setup15 = common_setup + """ -df = DataFrame(np.random.randn(10000,10),columns = [ 'C%03d' % i for i in range(10) ]) - -remove(f) -store = HDFStore(f) -""" - -write_store_table_dc = Benchmark( - "store.append('df15',df,data_columns=True)", setup15, cleanup="store.close()", - start_date=start_date) - diff --git a/vb_suite/index_object.py b/vb_suite/index_object.py deleted file mode 100644 index 2ab2bc15f3853..0000000000000 --- a/vb_suite/index_object.py +++ /dev/null @@ -1,173 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -SECTION = "Index / MultiIndex objects" - - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# intersection, union - -setup = common_setup + """ -rng = DatetimeIndex(start='1/1/2000', periods=10000, freq=datetools.Minute()) -if rng.dtype == object: - rng = rng.view(Index) -else: - rng = rng.asobject -rng2 = rng[:-1] -""" - -index_datetime_intersection = Benchmark("rng.intersection(rng2)", setup) -index_datetime_union = Benchmark("rng.union(rng2)", setup) - -setup = common_setup + """ -rng = date_range('1/1/2000', periods=10000, freq='T') -rng2 = rng[:-1] -""" - -datetime_index_intersection = Benchmark("rng.intersection(rng2)", setup, - start_date=datetime(2013, 9, 27)) -datetime_index_union = Benchmark("rng.union(rng2)", setup, - start_date=datetime(2013, 9, 27)) - -# integers -setup = common_setup + """ -N = 1000000 -options = np.arange(N) - -left = Index(options.take(np.random.permutation(N)[:N // 2])) -right = Index(options.take(np.random.permutation(N)[:N // 2])) -""" - -index_int64_union = Benchmark('left.union(right)', setup, - start_date=datetime(2011, 1, 1)) - -index_int64_intersection = Benchmark('left.intersection(right)', setup, - start_date=datetime(2011, 1, 1)) - -#---------------------------------------------------------------------- -# string index slicing -setup = common_setup + """ -idx = tm.makeStringIndex(1000000) - -mask = np.arange(1000000) % 3 == 0 -series_mask = Series(mask) -""" -index_str_slice_indexer_basic = Benchmark('idx[:-1]', setup) -index_str_slice_indexer_even = Benchmark('idx[::2]', setup) -index_str_boolean_indexer = Benchmark('idx[mask]', setup) -index_str_boolean_series_indexer = Benchmark('idx[series_mask]', setup) - -#---------------------------------------------------------------------- -# float64 index -#---------------------------------------------------------------------- -# construction -setup = common_setup + """ -baseidx = np.arange(1e6) -""" - -index_float64_construct = Benchmark('Index(baseidx)', setup, - name='index_float64_construct', - start_date=datetime(2014, 4, 13)) - -setup = common_setup + """ -idx = tm.makeFloatIndex(1000000) - -mask = np.arange(idx.size) % 3 == 0 -series_mask = Series(mask) -""" -#---------------------------------------------------------------------- -# getting -index_float64_get = Benchmark('idx[1]', setup, name='index_float64_get', - start_date=datetime(2014, 4, 13)) - - -#---------------------------------------------------------------------- -# slicing -index_float64_slice_indexer_basic = Benchmark('idx[:-1]', setup, - name='index_float64_slice_indexer_basic', - start_date=datetime(2014, 4, 13)) -index_float64_slice_indexer_even = Benchmark('idx[::2]', setup, - name='index_float64_slice_indexer_even', - start_date=datetime(2014, 4, 13)) -index_float64_boolean_indexer = Benchmark('idx[mask]', setup, - name='index_float64_boolean_indexer', - start_date=datetime(2014, 4, 13)) -index_float64_boolean_series_indexer = Benchmark('idx[series_mask]', setup, - name='index_float64_boolean_series_indexer', - start_date=datetime(2014, 4, 13)) - -#---------------------------------------------------------------------- -# arith ops -index_float64_mul = Benchmark('idx * 2', setup, name='index_float64_mul', - start_date=datetime(2014, 4, 13)) -index_float64_div = Benchmark('idx / 2', setup, name='index_float64_div', - start_date=datetime(2014, 4, 13)) - - -# Constructing MultiIndex from cartesian product of iterables -# - -setup = common_setup + """ -iterables = [tm.makeStringIndex(10000), range(20)] -""" - -multiindex_from_product = Benchmark('MultiIndex.from_product(iterables)', - setup, name='multiindex_from_product', - start_date=datetime(2014, 6, 30)) - -#---------------------------------------------------------------------- -# MultiIndex with DatetimeIndex level - -setup = common_setup + """ -level1 = range(1000) -level2 = date_range(start='1/1/2012', periods=100) -mi = MultiIndex.from_product([level1, level2]) -""" - -multiindex_with_datetime_level_full = \ - Benchmark("mi.copy().values", setup, - name='multiindex_with_datetime_level_full', - start_date=datetime(2014, 10, 11)) - - -multiindex_with_datetime_level_sliced = \ - Benchmark("mi[:10].values", setup, - name='multiindex_with_datetime_level_sliced', - start_date=datetime(2014, 10, 11)) - -# multi-index duplicated -setup = common_setup + """ -n, k = 200, 5000 -levels = [np.arange(n), tm.makeStringIndex(n).values, 1000 + np.arange(n)] -labels = [np.random.choice(n, k * n) for lev in levels] -mi = MultiIndex(levels=levels, labels=labels) -""" - -multiindex_duplicated = Benchmark('mi.duplicated()', setup, - name='multiindex_duplicated') - -#---------------------------------------------------------------------- -# repr - -setup = common_setup + """ -dr = pd.date_range('20000101', freq='D', periods=100000) -""" - -datetime_index_repr = \ - Benchmark("dr._is_dates_only", setup, - start_date=datetime(2012, 1, 11)) - -setup = common_setup + """ -n = 3 * 5 * 7 * 11 * (1 << 10) -low, high = - 1 << 12, 1 << 12 -f = lambda k: np.repeat(np.random.randint(low, high, n // k), k) - -i = np.random.permutation(n) -mi = MultiIndex.from_arrays([f(11), f(7), f(5), f(3), f(1)])[i] -""" - -multiindex_sortlevel_int64 = Benchmark('mi.sortlevel()', setup, - name='multiindex_sortlevel_int64') diff --git a/vb_suite/indexing.py b/vb_suite/indexing.py deleted file mode 100644 index 3d95d52dccd71..0000000000000 --- a/vb_suite/indexing.py +++ /dev/null @@ -1,292 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -SECTION = 'Indexing and scalar value access' - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# Series.__getitem__, get_value, __getitem__(slice) - -setup = common_setup + """ -tm.N = 1000 -ts = tm.makeTimeSeries() -dt = ts.index[500] -""" -statement = "ts[dt]" -bm_getitem = Benchmark(statement, setup, ncalls=100000, - name='time_series_getitem_scalar') - -setup = common_setup + """ -index = tm.makeStringIndex(1000) -s = Series(np.random.rand(1000), index=index) -idx = index[100] -""" -statement = "s.get_value(idx)" -bm_get_value = Benchmark(statement, setup, - name='series_get_value', - start_date=datetime(2011, 11, 12)) - - -setup = common_setup + """ -index = tm.makeStringIndex(1000000) -s = Series(np.random.rand(1000000), index=index) -""" -series_getitem_pos_slice = Benchmark("s[:800000]", setup, - name="series_getitem_pos_slice") - - -setup = common_setup + """ -index = tm.makeStringIndex(1000000) -s = Series(np.random.rand(1000000), index=index) -lbl = s.index[800000] -""" -series_getitem_label_slice = Benchmark("s[:lbl]", setup, - name="series_getitem_label_slice") - - -#---------------------------------------------------------------------- -# DataFrame __getitem__ - -setup = common_setup + """ -index = tm.makeStringIndex(1000) -columns = tm.makeStringIndex(30) -df = DataFrame(np.random.rand(1000, 30), index=index, - columns=columns) -idx = index[100] -col = columns[10] -""" -statement = "df[col][idx]" -bm_df_getitem = Benchmark(statement, setup, - name='dataframe_getitem_scalar') - -setup = common_setup + """ -try: - klass = DataMatrix -except: - klass = DataFrame - -index = tm.makeStringIndex(1000) -columns = tm.makeStringIndex(30) -df = klass(np.random.rand(1000, 30), index=index, columns=columns) -idx = index[100] -col = columns[10] -""" -statement = "df[col][idx]" -bm_df_getitem2 = Benchmark(statement, setup, - name='datamatrix_getitem_scalar') - - -#---------------------------------------------------------------------- -# ix get scalar - -setup = common_setup + """ -index = tm.makeStringIndex(1000) -columns = tm.makeStringIndex(30) -df = DataFrame(np.random.randn(1000, 30), index=index, columns=columns) -idx = index[100] -col = columns[10] -""" - -indexing_frame_get_value_ix = Benchmark("df.ix[idx,col]", setup, - name='indexing_frame_get_value_ix', - start_date=datetime(2011, 11, 12)) - -indexing_frame_get_value = Benchmark("df.get_value(idx,col)", setup, - name='indexing_frame_get_value', - start_date=datetime(2011, 11, 12)) - -setup = common_setup + """ -mi = MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)]) -s = Series(np.random.randn(1000000), index=mi) -""" - -series_xs_mi_ix = Benchmark("s.ix[999]", setup, - name='series_xs_mi_ix', - start_date=datetime(2013, 1, 1)) - -setup = common_setup + """ -mi = MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)]) -s = Series(np.random.randn(1000000), index=mi) -df = DataFrame(s) -""" - -frame_xs_mi_ix = Benchmark("df.ix[999]", setup, - name='frame_xs_mi_ix', - start_date=datetime(2013, 1, 1)) - -#---------------------------------------------------------------------- -# Boolean DataFrame row selection - -setup = common_setup + """ -df = DataFrame(np.random.randn(10000, 4), columns=['A', 'B', 'C', 'D']) -indexer = df['B'] > 0 -obj_indexer = indexer.astype('O') -""" -indexing_dataframe_boolean_rows = \ - Benchmark("df[indexer]", setup, name='indexing_dataframe_boolean_rows') - -indexing_dataframe_boolean_rows_object = \ - Benchmark("df[obj_indexer]", setup, - name='indexing_dataframe_boolean_rows_object') - -setup = common_setup + """ -df = DataFrame(np.random.randn(50000, 100)) -df2 = DataFrame(np.random.randn(50000, 100)) -""" -indexing_dataframe_boolean = \ - Benchmark("df > df2", setup, name='indexing_dataframe_boolean', - start_date=datetime(2012, 1, 1)) - -setup = common_setup + """ -try: - import pandas.computation.expressions as expr -except: - expr = None - -if expr is None: - raise NotImplementedError -df = DataFrame(np.random.randn(50000, 100)) -df2 = DataFrame(np.random.randn(50000, 100)) -expr.set_numexpr_threads(1) -""" - -indexing_dataframe_boolean_st = \ - Benchmark("df > df2", setup, name='indexing_dataframe_boolean_st',cleanup="expr.set_numexpr_threads()", - start_date=datetime(2013, 2, 26)) - - -setup = common_setup + """ -try: - import pandas.computation.expressions as expr -except: - expr = None - -if expr is None: - raise NotImplementedError -df = DataFrame(np.random.randn(50000, 100)) -df2 = DataFrame(np.random.randn(50000, 100)) -expr.set_use_numexpr(False) -""" - -indexing_dataframe_boolean_no_ne = \ - Benchmark("df > df2", setup, name='indexing_dataframe_boolean_no_ne',cleanup="expr.set_use_numexpr(True)", - start_date=datetime(2013, 2, 26)) -#---------------------------------------------------------------------- -# MultiIndex sortlevel - -setup = common_setup + """ -a = np.repeat(np.arange(100), 1000) -b = np.tile(np.arange(1000), 100) -midx = MultiIndex.from_arrays([a, b]) -midx = midx.take(np.random.permutation(np.arange(100000))) -""" -sort_level_zero = Benchmark("midx.sortlevel(0)", setup, - start_date=datetime(2012, 1, 1)) -sort_level_one = Benchmark("midx.sortlevel(1)", setup, - start_date=datetime(2012, 1, 1)) - -#---------------------------------------------------------------------- -# Panel subset selection - -setup = common_setup + """ -p = Panel(np.random.randn(100, 100, 100)) -inds = range(0, 100, 10) -""" - -indexing_panel_subset = Benchmark('p.ix[inds, inds, inds]', setup, - start_date=datetime(2012, 1, 1)) - -#---------------------------------------------------------------------- -# Iloc - -setup = common_setup + """ -df = DataFrame({'A' : [0.1] * 3000, 'B' : [1] * 3000}) -idx = np.array(range(30)) * 99 -df2 = DataFrame({'A' : [0.1] * 1000, 'B' : [1] * 1000}) -df2 = concat([df2, 2*df2, 3*df2]) -""" - -frame_iloc_dups = Benchmark('df2.iloc[idx]', setup, - start_date=datetime(2013, 1, 1)) - -frame_loc_dups = Benchmark('df2.loc[idx]', setup, - start_date=datetime(2013, 1, 1)) - -setup = common_setup + """ -df = DataFrame(dict( A = [ 'foo'] * 1000000)) -""" - -frame_iloc_big = Benchmark('df.iloc[:100,0]', setup, - start_date=datetime(2013, 1, 1)) - -#---------------------------------------------------------------------- -# basic tests for [], .loc[], .iloc[] and .ix[] - -setup = common_setup + """ -s = Series(np.random.rand(1000000)) -""" - -series_getitem_scalar = Benchmark("s[800000]", setup) -series_getitem_slice = Benchmark("s[:800000]", setup) -series_getitem_list_like = Benchmark("s[[800000]]", setup) -series_getitem_array = Benchmark("s[np.arange(10000)]", setup) - -series_loc_scalar = Benchmark("s.loc[800000]", setup) -series_loc_slice = Benchmark("s.loc[:800000]", setup) -series_loc_list_like = Benchmark("s.loc[[800000]]", setup) -series_loc_array = Benchmark("s.loc[np.arange(10000)]", setup) - -series_iloc_scalar = Benchmark("s.iloc[800000]", setup) -series_iloc_slice = Benchmark("s.iloc[:800000]", setup) -series_iloc_list_like = Benchmark("s.iloc[[800000]]", setup) -series_iloc_array = Benchmark("s.iloc[np.arange(10000)]", setup) - -series_ix_scalar = Benchmark("s.ix[800000]", setup) -series_ix_slice = Benchmark("s.ix[:800000]", setup) -series_ix_list_like = Benchmark("s.ix[[800000]]", setup) -series_ix_array = Benchmark("s.ix[np.arange(10000)]", setup) - - -# multi-index slicing -setup = common_setup + """ -np.random.seed(1234) -idx=pd.IndexSlice -n=100000 -mdt = pandas.DataFrame() -mdt['A'] = np.random.choice(range(10000,45000,1000), n) -mdt['B'] = np.random.choice(range(10,400), n) -mdt['C'] = np.random.choice(range(1,150), n) -mdt['D'] = np.random.choice(range(10000,45000), n) -mdt['x'] = np.random.choice(range(400), n) -mdt['y'] = np.random.choice(range(25), n) - - -test_A = 25000 -test_B = 25 -test_C = 40 -test_D = 35000 - -eps_A = 5000 -eps_B = 5 -eps_C = 5 -eps_D = 5000 -mdt2 = mdt.set_index(['A','B','C','D']).sortlevel() -""" - -multiindex_slicers = Benchmark('mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]', setup, - start_date=datetime(2015, 1, 1)) - -#---------------------------------------------------------------------- -# take - -setup = common_setup + """ -s = Series(np.random.rand(100000)) -ts = Series(np.random.rand(100000), - index=date_range('2011-01-01', freq='S', periods=100000)) -indexer = [True, False, True, True, False] * 20000 -""" - -series_take_intindex = Benchmark("s.take(indexer)", setup) -series_take_dtindex = Benchmark("ts.take(indexer)", setup) diff --git a/vb_suite/inference.py b/vb_suite/inference.py deleted file mode 100644 index aaa51aa5163ce..0000000000000 --- a/vb_suite/inference.py +++ /dev/null @@ -1,36 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime -import sys - -# from GH 7332 - -setup = """from .pandas_vb_common import * -import pandas as pd -N = 500000 -df_int64 = DataFrame(dict(A = np.arange(N,dtype='int64'), B = np.arange(N,dtype='int64'))) -df_int32 = DataFrame(dict(A = np.arange(N,dtype='int32'), B = np.arange(N,dtype='int32'))) -df_uint32 = DataFrame(dict(A = np.arange(N,dtype='uint32'), B = np.arange(N,dtype='uint32'))) -df_float64 = DataFrame(dict(A = np.arange(N,dtype='float64'), B = np.arange(N,dtype='float64'))) -df_float32 = DataFrame(dict(A = np.arange(N,dtype='float32'), B = np.arange(N,dtype='float32'))) -df_datetime64 = DataFrame(dict(A = pd.to_datetime(np.arange(N,dtype='int64'),unit='ms'), - B = pd.to_datetime(np.arange(N,dtype='int64'),unit='ms'))) -df_timedelta64 = DataFrame(dict(A = df_datetime64['A']-df_datetime64['B'], - B = df_datetime64['B'])) -""" - -dtype_infer_int64 = Benchmark('df_int64["A"] + df_int64["B"]', setup, - start_date=datetime(2014, 1, 1)) -dtype_infer_int32 = Benchmark('df_int32["A"] + df_int32["B"]', setup, - start_date=datetime(2014, 1, 1)) -dtype_infer_uint32 = Benchmark('df_uint32["A"] + df_uint32["B"]', setup, - start_date=datetime(2014, 1, 1)) -dtype_infer_float64 = Benchmark('df_float64["A"] + df_float64["B"]', setup, - start_date=datetime(2014, 1, 1)) -dtype_infer_float32 = Benchmark('df_float32["A"] + df_float32["B"]', setup, - start_date=datetime(2014, 1, 1)) -dtype_infer_datetime64 = Benchmark('df_datetime64["A"] - df_datetime64["B"]', setup, - start_date=datetime(2014, 1, 1)) -dtype_infer_timedelta64_1 = Benchmark('df_timedelta64["A"] + df_timedelta64["B"]', setup, - start_date=datetime(2014, 1, 1)) -dtype_infer_timedelta64_2 = Benchmark('df_timedelta64["A"] + df_timedelta64["A"]', setup, - start_date=datetime(2014, 1, 1)) diff --git a/vb_suite/io_bench.py b/vb_suite/io_bench.py deleted file mode 100644 index af5f6076515cc..0000000000000 --- a/vb_suite/io_bench.py +++ /dev/null @@ -1,150 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -from io import StringIO -""" - -#---------------------------------------------------------------------- -# read_csv - -setup1 = common_setup + """ -index = tm.makeStringIndex(10000) -df = DataFrame({'float1' : randn(10000), - 'float2' : randn(10000), - 'string1' : ['foo'] * 10000, - 'bool1' : [True] * 10000, - 'int1' : np.random.randint(0, 100000, size=10000)}, - index=index) -df.to_csv('__test__.csv') -""" - -read_csv_standard = Benchmark("read_csv('__test__.csv')", setup1, - start_date=datetime(2011, 9, 15)) - -#---------------------------------- -# skiprows - -setup1 = common_setup + """ -index = tm.makeStringIndex(20000) -df = DataFrame({'float1' : randn(20000), - 'float2' : randn(20000), - 'string1' : ['foo'] * 20000, - 'bool1' : [True] * 20000, - 'int1' : np.random.randint(0, 200000, size=20000)}, - index=index) -df.to_csv('__test__.csv') -""" - -read_csv_skiprows = Benchmark("read_csv('__test__.csv', skiprows=10000)", setup1, - start_date=datetime(2011, 9, 15)) - -#---------------------------------------------------------------------- -# write_csv - -setup2 = common_setup + """ -index = tm.makeStringIndex(10000) -df = DataFrame({'float1' : randn(10000), - 'float2' : randn(10000), - 'string1' : ['foo'] * 10000, - 'bool1' : [True] * 10000, - 'int1' : np.random.randint(0, 100000, size=10000)}, - index=index) -""" - -write_csv_standard = Benchmark("df.to_csv('__test__.csv')", setup2, - start_date=datetime(2011, 9, 15)) - -#---------------------------------- -setup = common_setup + """ -df = DataFrame(np.random.randn(3000, 30)) -""" -frame_to_csv = Benchmark("df.to_csv('__test__.csv')", setup, - start_date=datetime(2011, 1, 1)) -#---------------------------------- - -setup = common_setup + """ -df=DataFrame({'A':range(50000)}) -df['B'] = df.A + 1.0 -df['C'] = df.A + 2.0 -df['D'] = df.A + 3.0 -""" -frame_to_csv2 = Benchmark("df.to_csv('__test__.csv')", setup, - start_date=datetime(2011, 1, 1)) - -#---------------------------------- -setup = common_setup + """ -from pandas import concat, Timestamp - -def create_cols(name): - return [ "%s%03d" % (name,i) for i in range(5) ] -df_float = DataFrame(np.random.randn(5000, 5),dtype='float64',columns=create_cols('float')) -df_int = DataFrame(np.random.randn(5000, 5),dtype='int64',columns=create_cols('int')) -df_bool = DataFrame(True,index=df_float.index,columns=create_cols('bool')) -df_object = DataFrame('foo',index=df_float.index,columns=create_cols('object')) -df_dt = DataFrame(Timestamp('20010101'),index=df_float.index,columns=create_cols('date')) - -# add in some nans -df_float.ix[30:500,1:3] = np.nan - -df = concat([ df_float, df_int, df_bool, df_object, df_dt ], axis=1) - -""" -frame_to_csv_mixed = Benchmark("df.to_csv('__test__.csv')", setup, - start_date=datetime(2012, 6, 1)) - -#---------------------------------------------------------------------- -# parse dates, ISO8601 format - -setup = common_setup + """ -rng = date_range('1/1/2000', periods=1000) -data = '\\n'.join(rng.map(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))) -""" - -stmt = ("read_csv(StringIO(data), header=None, names=['foo'], " - " parse_dates=['foo'])") -read_parse_dates_iso8601 = Benchmark(stmt, setup, - start_date=datetime(2012, 3, 1)) - -setup = common_setup + """ -rng = date_range('1/1/2000', periods=1000) -data = DataFrame(rng, index=rng) -""" - -stmt = ("data.to_csv('__test__.csv', date_format='%Y%m%d')") - -frame_to_csv_date_formatting = Benchmark(stmt, setup, - start_date=datetime(2013, 9, 1)) - -#---------------------------------------------------------------------- -# infer datetime format - -setup = common_setup + """ -rng = date_range('1/1/2000', periods=1000) -data = '\\n'.join(rng.map(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))) -""" - -stmt = ("read_csv(StringIO(data), header=None, names=['foo'], " - " parse_dates=['foo'], infer_datetime_format=True)") - -read_csv_infer_datetime_format_iso8601 = Benchmark(stmt, setup) - -setup = common_setup + """ -rng = date_range('1/1/2000', periods=1000) -data = '\\n'.join(rng.map(lambda x: x.strftime("%Y%m%d"))) -""" - -stmt = ("read_csv(StringIO(data), header=None, names=['foo'], " - " parse_dates=['foo'], infer_datetime_format=True)") - -read_csv_infer_datetime_format_ymd = Benchmark(stmt, setup) - -setup = common_setup + """ -rng = date_range('1/1/2000', periods=1000) -data = '\\n'.join(rng.map(lambda x: x.strftime("%m/%d/%Y %H:%M:%S.%f"))) -""" - -stmt = ("read_csv(StringIO(data), header=None, names=['foo'], " - " parse_dates=['foo'], infer_datetime_format=True)") - -read_csv_infer_datetime_format_custom = Benchmark(stmt, setup) diff --git a/vb_suite/io_sql.py b/vb_suite/io_sql.py deleted file mode 100644 index ba8367e7e356b..0000000000000 --- a/vb_suite/io_sql.py +++ /dev/null @@ -1,126 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -import sqlite3 -import sqlalchemy -from sqlalchemy import create_engine - -engine = create_engine('sqlite:///:memory:') -con = sqlite3.connect(':memory:') -""" - -sdate = datetime(2014, 6, 1) - - -#------------------------------------------------------------------------------- -# to_sql - -setup = common_setup + """ -index = tm.makeStringIndex(10000) -df = DataFrame({'float1' : randn(10000), - 'float2' : randn(10000), - 'string1' : ['foo'] * 10000, - 'bool1' : [True] * 10000, - 'int1' : np.random.randint(0, 100000, size=10000)}, - index=index) -""" - -sql_write_sqlalchemy = Benchmark("df.to_sql('test1', engine, if_exists='replace')", - setup, start_date=sdate) - -sql_write_fallback = Benchmark("df.to_sql('test1', con, if_exists='replace')", - setup, start_date=sdate) - - -#------------------------------------------------------------------------------- -# read_sql - -setup = common_setup + """ -index = tm.makeStringIndex(10000) -df = DataFrame({'float1' : randn(10000), - 'float2' : randn(10000), - 'string1' : ['foo'] * 10000, - 'bool1' : [True] * 10000, - 'int1' : np.random.randint(0, 100000, size=10000)}, - index=index) -df.to_sql('test2', engine, if_exists='replace') -df.to_sql('test2', con, if_exists='replace') -""" - -sql_read_query_sqlalchemy = Benchmark("read_sql_query('SELECT * FROM test2', engine)", - setup, start_date=sdate) - -sql_read_query_fallback = Benchmark("read_sql_query('SELECT * FROM test2', con)", - setup, start_date=sdate) - -sql_read_table_sqlalchemy = Benchmark("read_sql_table('test2', engine)", - setup, start_date=sdate) - - -#------------------------------------------------------------------------------- -# type specific write - -setup = common_setup + """ -df = DataFrame({'float' : randn(10000), - 'string' : ['foo'] * 10000, - 'bool' : [True] * 10000, - 'datetime' : date_range('2000-01-01', periods=10000, freq='s')}) -df.loc[1000:3000, 'float'] = np.nan -""" - -sql_float_write_sqlalchemy = \ - Benchmark("df[['float']].to_sql('test_float', engine, if_exists='replace')", - setup, start_date=sdate) - -sql_float_write_fallback = \ - Benchmark("df[['float']].to_sql('test_float', con, if_exists='replace')", - setup, start_date=sdate) - -sql_string_write_sqlalchemy = \ - Benchmark("df[['string']].to_sql('test_string', engine, if_exists='replace')", - setup, start_date=sdate) - -sql_string_write_fallback = \ - Benchmark("df[['string']].to_sql('test_string', con, if_exists='replace')", - setup, start_date=sdate) - -sql_datetime_write_sqlalchemy = \ - Benchmark("df[['datetime']].to_sql('test_datetime', engine, if_exists='replace')", - setup, start_date=sdate) - -#sql_datetime_write_fallback = \ -# Benchmark("df[['datetime']].to_sql('test_datetime', con, if_exists='replace')", -# setup3, start_date=sdate) - -#------------------------------------------------------------------------------- -# type specific read - -setup = common_setup + """ -df = DataFrame({'float' : randn(10000), - 'datetime' : date_range('2000-01-01', periods=10000, freq='s')}) -df['datetime_string'] = df['datetime'].map(str) - -df.to_sql('test_type', engine, if_exists='replace') -df[['float', 'datetime_string']].to_sql('test_type', con, if_exists='replace') -""" - -sql_float_read_query_sqlalchemy = \ - Benchmark("read_sql_query('SELECT float FROM test_type', engine)", - setup, start_date=sdate) - -sql_float_read_table_sqlalchemy = \ - Benchmark("read_sql_table('test_type', engine, columns=['float'])", - setup, start_date=sdate) - -sql_float_read_query_fallback = \ - Benchmark("read_sql_query('SELECT float FROM test_type', con)", - setup, start_date=sdate) - -sql_datetime_read_as_native_sqlalchemy = \ - Benchmark("read_sql_table('test_type', engine, columns=['datetime'])", - setup, start_date=sdate) - -sql_datetime_read_and_parse_sqlalchemy = \ - Benchmark("read_sql_table('test_type', engine, columns=['datetime_string'], parse_dates=['datetime_string'])", - setup, start_date=sdate) diff --git a/vb_suite/join_merge.py b/vb_suite/join_merge.py deleted file mode 100644 index 238a129552e90..0000000000000 --- a/vb_suite/join_merge.py +++ /dev/null @@ -1,270 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -setup = common_setup + """ -level1 = tm.makeStringIndex(10).values -level2 = tm.makeStringIndex(1000).values -label1 = np.arange(10).repeat(1000) -label2 = np.tile(np.arange(1000), 10) - -key1 = np.tile(level1.take(label1), 10) -key2 = np.tile(level2.take(label2), 10) - -shuf = np.arange(100000) -random.shuffle(shuf) -try: - index2 = MultiIndex(levels=[level1, level2], labels=[label1, label2]) - index3 = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], - labels=[np.arange(10).repeat(10000), - np.tile(np.arange(100).repeat(100), 10), - np.tile(np.tile(np.arange(100), 100), 10)]) - df_multi = DataFrame(np.random.randn(len(index2), 4), index=index2, - columns=['A', 'B', 'C', 'D']) -except: # pre-MultiIndex - pass - -try: - DataFrame = DataMatrix -except: - pass - -df = pd.DataFrame({'data1' : np.random.randn(100000), - 'data2' : np.random.randn(100000), - 'key1' : key1, - 'key2' : key2}) - - -df_key1 = pd.DataFrame(np.random.randn(len(level1), 4), index=level1, - columns=['A', 'B', 'C', 'D']) -df_key2 = pd.DataFrame(np.random.randn(len(level2), 4), index=level2, - columns=['A', 'B', 'C', 'D']) - -df_shuf = df.reindex(df.index[shuf]) -""" - -#---------------------------------------------------------------------- -# DataFrame joins on key - -join_dataframe_index_single_key_small = \ - Benchmark("df.join(df_key1, on='key1')", setup, - name='join_dataframe_index_single_key_small') - -join_dataframe_index_single_key_bigger = \ - Benchmark("df.join(df_key2, on='key2')", setup, - name='join_dataframe_index_single_key_bigger') - -join_dataframe_index_single_key_bigger_sort = \ - Benchmark("df_shuf.join(df_key2, on='key2', sort=True)", setup, - name='join_dataframe_index_single_key_bigger_sort', - start_date=datetime(2012, 2, 5)) - -join_dataframe_index_multi = \ - Benchmark("df.join(df_multi, on=['key1', 'key2'])", setup, - name='join_dataframe_index_multi', - start_date=datetime(2011, 10, 20)) - -#---------------------------------------------------------------------- -# Joins on integer keys -setup = common_setup + """ -df = pd.DataFrame({'key1': np.tile(np.arange(500).repeat(10), 2), - 'key2': np.tile(np.arange(250).repeat(10), 4), - 'value': np.random.randn(10000)}) -df2 = pd.DataFrame({'key1': np.arange(500), 'value2': randn(500)}) -df3 = df[:5000] -""" - - -join_dataframe_integer_key = Benchmark("merge(df, df2, on='key1')", setup, - start_date=datetime(2011, 10, 20)) -join_dataframe_integer_2key = Benchmark("merge(df, df3)", setup, - start_date=datetime(2011, 10, 20)) - -#---------------------------------------------------------------------- -# DataFrame joins on index - - -#---------------------------------------------------------------------- -# Merges -setup = common_setup + """ -N = 10000 - -indices = tm.makeStringIndex(N).values -indices2 = tm.makeStringIndex(N).values -key = np.tile(indices[:8000], 10) -key2 = np.tile(indices2[:8000], 10) - -left = pd.DataFrame({'key' : key, 'key2':key2, - 'value' : np.random.randn(80000)}) -right = pd.DataFrame({'key': indices[2000:], 'key2':indices2[2000:], - 'value2' : np.random.randn(8000)}) -""" - -merge_2intkey_nosort = Benchmark('merge(left, right, sort=False)', setup, - start_date=datetime(2011, 10, 20)) - -merge_2intkey_sort = Benchmark('merge(left, right, sort=True)', setup, - start_date=datetime(2011, 10, 20)) - -#---------------------------------------------------------------------- -# Appending DataFrames - -setup = common_setup + """ -df1 = pd.DataFrame(np.random.randn(10000, 4), columns=['A', 'B', 'C', 'D']) -df2 = df1.copy() -df2.index = np.arange(10000, 20000) -mdf1 = df1.copy() -mdf1['obj1'] = 'bar' -mdf1['obj2'] = 'bar' -mdf1['int1'] = 5 -try: - mdf1.consolidate(inplace=True) -except: - pass -mdf2 = mdf1.copy() -mdf2.index = df2.index -""" - -stmt = "df1.append(df2)" -append_frame_single_homogenous = \ - Benchmark(stmt, setup, name='append_frame_single_homogenous', - ncalls=500, repeat=1) - -stmt = "mdf1.append(mdf2)" -append_frame_single_mixed = Benchmark(stmt, setup, - name='append_frame_single_mixed', - ncalls=500, repeat=1) - -#---------------------------------------------------------------------- -# data alignment - -setup = common_setup + """n = 1000000 -# indices = tm.makeStringIndex(n) -def sample(values, k): - sampler = np.random.permutation(len(values)) - return values.take(sampler[:k]) -sz = 500000 -rng = np.arange(0, 10000000000000, 10000000) -stamps = np.datetime64(datetime.now()).view('i8') + rng -idx1 = np.sort(sample(stamps, sz)) -idx2 = np.sort(sample(stamps, sz)) -ts1 = Series(np.random.randn(sz), idx1) -ts2 = Series(np.random.randn(sz), idx2) -""" -stmt = "ts1 + ts2" -series_align_int64_index = \ - Benchmark(stmt, setup, - name="series_align_int64_index", - start_date=datetime(2010, 6, 1), logy=True) - -stmt = "ts1.align(ts2, join='left')" -series_align_left_monotonic = \ - Benchmark(stmt, setup, - name="series_align_left_monotonic", - start_date=datetime(2011, 12, 1), logy=True) - -#---------------------------------------------------------------------- -# Concat Series axis=1 - -setup = common_setup + """ -n = 1000 -indices = tm.makeStringIndex(1000) -s = Series(n, index=indices) -pieces = [s[i:-i] for i in range(1, 10)] -pieces = pieces * 50 -""" - -concat_series_axis1 = Benchmark('concat(pieces, axis=1)', setup, - start_date=datetime(2012, 2, 27)) - -setup = common_setup + """ -df = pd.DataFrame(randn(5, 4)) -""" - -concat_small_frames = Benchmark('concat([df] * 1000)', setup, - start_date=datetime(2012, 1, 1)) - - -#---------------------------------------------------------------------- -# Concat empty - -setup = common_setup + """ -df = pd.DataFrame(dict(A = range(10000)),index=date_range('20130101',periods=10000,freq='s')) -empty = pd.DataFrame() -""" - -concat_empty_frames1 = Benchmark('concat([df,empty])', setup, - start_date=datetime(2012, 1, 1)) -concat_empty_frames2 = Benchmark('concat([empty,df])', setup, - start_date=datetime(2012, 1, 1)) - - -#---------------------------------------------------------------------- -# Ordered merge - -setup = common_setup + """ -groups = tm.makeStringIndex(10).values - -left = pd.DataFrame({'group': groups.repeat(5000), - 'key' : np.tile(np.arange(0, 10000, 2), 10), - 'lvalue': np.random.randn(50000)}) - -right = pd.DataFrame({'key' : np.arange(10000), - 'rvalue' : np.random.randn(10000)}) - -""" - -stmt = "ordered_merge(left, right, on='key', left_by='group')" - -#---------------------------------------------------------------------- -# outer join of non-unique -# GH 6329 - -setup = common_setup + """ -date_index = date_range('01-Jan-2013', '23-Jan-2013', freq='T') -daily_dates = date_index.to_period('D').to_timestamp('S','S') -fracofday = date_index.view(np.ndarray) - daily_dates.view(np.ndarray) -fracofday = fracofday.astype('timedelta64[ns]').astype(np.float64)/864e11 -fracofday = TimeSeries(fracofday, daily_dates) -index = date_range(date_index.min().to_period('A').to_timestamp('D','S'), - date_index.max().to_period('A').to_timestamp('D','E'), - freq='D') -temp = TimeSeries(1.0, index) -""" - -join_non_unique_equal = Benchmark('fracofday * temp[fracofday.index]', setup, - start_date=datetime(2013, 1, 1)) - - -setup = common_setup + ''' -np.random.seed(2718281) -n = 50000 - -left = pd.DataFrame(np.random.randint(1, n/500, (n, 2)), - columns=['jim', 'joe']) - -right = pd.DataFrame(np.random.randint(1, n/500, (n, 2)), - columns=['jolie', 'jolia']).set_index('jolie') -''' - -left_outer_join_index = Benchmark("left.join(right, on='jim')", setup, - name='left_outer_join_index') - - -setup = common_setup + """ -low, high, n = -1 << 10, 1 << 10, 1 << 20 -left = pd.DataFrame(np.random.randint(low, high, (n, 7)), - columns=list('ABCDEFG')) -left['left'] = left.sum(axis=1) - -i = np.random.permutation(len(left)) -right = left.iloc[i].copy() -right.columns = right.columns[:-1].tolist() + ['right'] -right.index = np.arange(len(right)) -right['right'] *= -1 -""" - -i8merge = Benchmark("merge(left, right, how='outer')", setup, - name='i8merge') diff --git a/vb_suite/make.py b/vb_suite/make.py deleted file mode 100755 index 5a8a8215db9a4..0000000000000 --- a/vb_suite/make.py +++ /dev/null @@ -1,167 +0,0 @@ -#!/usr/bin/env python - -""" -Python script for building documentation. - -To build the docs you must have all optional dependencies for statsmodels -installed. See the installation instructions for a list of these. - -Note: currently latex builds do not work because of table formats that are not -supported in the latex generation. - -Usage ------ -python make.py clean -python make.py html -""" - -import glob -import os -import shutil -import sys -import sphinx - -os.environ['PYTHONPATH'] = '..' - -SPHINX_BUILD = 'sphinxbuild' - - -def upload(): - 'push a copy to the site' - os.system('cd build/html; rsync -avz . pandas@pandas.pydata.org' - ':/usr/share/nginx/pandas/pandas-docs/vbench/ -essh') - - -def clean(): - if os.path.exists('build'): - shutil.rmtree('build') - - if os.path.exists('source/generated'): - shutil.rmtree('source/generated') - - -def html(): - check_build() - if os.system('sphinx-build -P -b html -d build/doctrees ' - 'source build/html'): - raise SystemExit("Building HTML failed.") - - -def check_build(): - build_dirs = [ - 'build', 'build/doctrees', 'build/html', - 'build/plots', 'build/_static', - 'build/_templates'] - for d in build_dirs: - try: - os.mkdir(d) - except OSError: - pass - - -def all(): - clean() - html() - - -def auto_update(): - msg = '' - try: - clean() - html() - upload() - sendmail() - except (Exception, SystemExit), inst: - msg += str(inst) + '\n' - sendmail(msg) - - -def sendmail(err_msg=None): - from_name, to_name = _get_config() - - if err_msg is None: - msgstr = 'Daily vbench uploaded successfully' - subject = "VB: daily update successful" - else: - msgstr = err_msg - subject = "VB: daily update failed" - - import smtplib - from email.MIMEText import MIMEText - msg = MIMEText(msgstr) - msg['Subject'] = subject - msg['From'] = from_name - msg['To'] = to_name - - server_str, port, login, pwd = _get_credentials() - server = smtplib.SMTP(server_str, port) - server.ehlo() - server.starttls() - server.ehlo() - - server.login(login, pwd) - try: - server.sendmail(from_name, to_name, msg.as_string()) - finally: - server.close() - - -def _get_dir(subdir=None): - import getpass - USERNAME = getpass.getuser() - if sys.platform == 'darwin': - HOME = '/Users/%s' % USERNAME - else: - HOME = '/home/%s' % USERNAME - - if subdir is None: - subdir = '/code/scripts' - conf_dir = '%s%s' % (HOME, subdir) - return conf_dir - - -def _get_credentials(): - tmp_dir = _get_dir() - cred = '%s/credentials' % tmp_dir - with open(cred, 'r') as fh: - server, port, un, domain = fh.read().split(',') - port = int(port) - login = un + '@' + domain + '.com' - - import base64 - with open('%s/cron_email_pwd' % tmp_dir, 'r') as fh: - pwd = base64.b64decode(fh.read()) - - return server, port, login, pwd - - -def _get_config(): - tmp_dir = _get_dir() - with open('%s/addresses' % tmp_dir, 'r') as fh: - from_name, to_name = fh.read().split(',') - return from_name, to_name - -funcd = { - 'html': html, - 'clean': clean, - 'upload': upload, - 'auto_update': auto_update, - 'all': all, -} - -small_docs = False - -# current_dir = os.getcwd() -# os.chdir(os.path.dirname(os.path.join(current_dir, __file__))) - -if len(sys.argv) > 1: - for arg in sys.argv[1:]: - func = funcd.get(arg) - if func is None: - raise SystemExit('Do not know how to handle %s; valid args are %s' % ( - arg, funcd.keys())) - func() -else: - small_docs = False - all() -# os.chdir(current_dir) diff --git a/vb_suite/measure_memory_consumption.py b/vb_suite/measure_memory_consumption.py deleted file mode 100755 index bb73cf5da4302..0000000000000 --- a/vb_suite/measure_memory_consumption.py +++ /dev/null @@ -1,55 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- - -from __future__ import print_function - -"""Short one-line summary - -long summary -""" - - -def main(): - import shutil - import tempfile - import warnings - - from pandas import Series - - from vbench.api import BenchmarkRunner - from suite import (REPO_PATH, BUILD, DB_PATH, PREPARE, - dependencies, benchmarks) - - from memory_profiler import memory_usage - - warnings.filterwarnings('ignore', category=FutureWarning) - - try: - TMP_DIR = tempfile.mkdtemp() - runner = BenchmarkRunner( - benchmarks, REPO_PATH, REPO_PATH, BUILD, DB_PATH, - TMP_DIR, PREPARE, always_clean=True, - # run_option='eod', start_date=START_DATE, - module_dependencies=dependencies) - results = {} - for b in runner.benchmarks: - k = b.name - try: - vs = memory_usage((b.run,)) - v = max(vs) - # print(k, v) - results[k] = v - except Exception as e: - print("Exception caught in %s\n" % k) - print(str(e)) - - s = Series(results) - s.sort() - print((s)) - - finally: - shutil.rmtree(TMP_DIR) - - -if __name__ == "__main__": - main() diff --git a/vb_suite/miscellaneous.py b/vb_suite/miscellaneous.py deleted file mode 100644 index da2c736e79ea7..0000000000000 --- a/vb_suite/miscellaneous.py +++ /dev/null @@ -1,32 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# cache_readonly - -setup = common_setup + """ -from pandas.util.decorators import cache_readonly - -class Foo: - - @cache_readonly - def prop(self): - return 5 -obj = Foo() -""" -misc_cache_readonly = Benchmark("obj.prop", setup, name="misc_cache_readonly", - ncalls=2000000) - -#---------------------------------------------------------------------- -# match - -setup = common_setup + """ -uniques = tm.makeStringIndex(1000).values -all = uniques.repeat(10) -""" - -match_strings = Benchmark("match(all, uniques)", setup, - start_date=datetime(2012, 5, 12)) diff --git a/vb_suite/packers.py b/vb_suite/packers.py deleted file mode 100644 index 69ec10822b392..0000000000000 --- a/vb_suite/packers.py +++ /dev/null @@ -1,252 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -start_date = datetime(2013, 5, 1) - -common_setup = """from .pandas_vb_common import * -import os -import pandas as pd -from pandas.core import common as com -from pandas.compat import BytesIO -from random import randrange - -f = '__test__.msg' -def remove(f): - try: - os.remove(f) - except: - pass - -N=100000 -C=5 -index = date_range('20000101',periods=N,freq='H') -df = DataFrame(dict([ ("float{0}".format(i),randn(N)) for i in range(C) ]), - index=index) - -N=100000 -C=5 -index = date_range('20000101',periods=N,freq='H') -df2 = DataFrame(dict([ ("float{0}".format(i),randn(N)) for i in range(C) ]), - index=index) -df2['object'] = ['%08x'%randrange(16**8) for _ in range(N)] -remove(f) -""" - -#---------------------------------------------------------------------- -# msgpack - -setup = common_setup + """ -df2.to_msgpack(f) -""" - -packers_read_pack = Benchmark("pd.read_msgpack(f)", setup, start_date=start_date) - -setup = common_setup + """ -""" - -packers_write_pack = Benchmark("df2.to_msgpack(f)", setup, cleanup="remove(f)", start_date=start_date) - -#---------------------------------------------------------------------- -# pickle - -setup = common_setup + """ -df2.to_pickle(f) -""" - -packers_read_pickle = Benchmark("pd.read_pickle(f)", setup, start_date=start_date) - -setup = common_setup + """ -""" - -packers_write_pickle = Benchmark("df2.to_pickle(f)", setup, cleanup="remove(f)", start_date=start_date) - -#---------------------------------------------------------------------- -# csv - -setup = common_setup + """ -df.to_csv(f) -""" - -packers_read_csv = Benchmark("pd.read_csv(f)", setup, start_date=start_date) - -setup = common_setup + """ -""" - -packers_write_csv = Benchmark("df.to_csv(f)", setup, cleanup="remove(f)", start_date=start_date) - -#---------------------------------------------------------------------- -# hdf store - -setup = common_setup + """ -df2.to_hdf(f,'df') -""" - -packers_read_hdf_store = Benchmark("pd.read_hdf(f,'df')", setup, start_date=start_date) - -setup = common_setup + """ -""" - -packers_write_hdf_store = Benchmark("df2.to_hdf(f,'df')", setup, cleanup="remove(f)", start_date=start_date) - -#---------------------------------------------------------------------- -# hdf table - -setup = common_setup + """ -df2.to_hdf(f,'df',format='table') -""" - -packers_read_hdf_table = Benchmark("pd.read_hdf(f,'df')", setup, start_date=start_date) - -setup = common_setup + """ -""" - -packers_write_hdf_table = Benchmark("df2.to_hdf(f,'df',table=True)", setup, cleanup="remove(f)", start_date=start_date) - -#---------------------------------------------------------------------- -# sql - -setup = common_setup + """ -import sqlite3 -from sqlalchemy import create_engine -engine = create_engine('sqlite:///:memory:') - -df2.to_sql('table', engine, if_exists='replace') -""" - -packers_read_sql= Benchmark("pd.read_sql_table('table', engine)", setup, start_date=start_date) - -setup = common_setup + """ -import sqlite3 -from sqlalchemy import create_engine -engine = create_engine('sqlite:///:memory:') -""" - -packers_write_sql = Benchmark("df2.to_sql('table', engine, if_exists='replace')", setup, start_date=start_date) - -#---------------------------------------------------------------------- -# json - -setup_int_index = """ -import numpy as np -df.index = np.arange(N) -""" - -setup = common_setup + """ -df.to_json(f,orient='split') -""" -packers_read_json_date_index = Benchmark("pd.read_json(f, orient='split')", setup, start_date=start_date) -setup = setup + setup_int_index -packers_read_json = Benchmark("pd.read_json(f, orient='split')", setup, start_date=start_date) - -setup = common_setup + """ -""" -packers_write_json_date_index = Benchmark("df.to_json(f,orient='split')", setup, cleanup="remove(f)", start_date=start_date) - -setup = setup + setup_int_index -packers_write_json = Benchmark("df.to_json(f,orient='split')", setup, cleanup="remove(f)", start_date=start_date) -packers_write_json_T = Benchmark("df.to_json(f,orient='columns')", setup, cleanup="remove(f)", start_date=start_date) - -setup = common_setup + """ -from numpy.random import randint -from collections import OrderedDict - -cols = [ - lambda i: ("{0}_timedelta".format(i), [pd.Timedelta('%d seconds' % randrange(1e6)) for _ in range(N)]), - lambda i: ("{0}_int".format(i), randint(1e8, size=N)), - lambda i: ("{0}_timestamp".format(i), [pd.Timestamp( 1418842918083256000 + randrange(1e9, 1e18, 200)) for _ in range(N)]) - ] -df_mixed = DataFrame(OrderedDict([cols[i % len(cols)](i) for i in range(C)]), - index=index) -""" -packers_write_json_mixed_delta_int_tstamp = Benchmark("df_mixed.to_json(f,orient='split')", setup, cleanup="remove(f)", start_date=start_date) - -setup = common_setup + """ -from numpy.random import randint -from collections import OrderedDict -cols = [ - lambda i: ("{0}_float".format(i), randn(N)), - lambda i: ("{0}_int".format(i), randint(1e8, size=N)) - ] -df_mixed = DataFrame(OrderedDict([cols[i % len(cols)](i) for i in range(C)]), - index=index) -""" -packers_write_json_mixed_float_int = Benchmark("df_mixed.to_json(f,orient='index')", setup, cleanup="remove(f)", start_date=start_date) -packers_write_json_mixed_float_int_T = Benchmark("df_mixed.to_json(f,orient='columns')", setup, cleanup="remove(f)", start_date=start_date) - -setup = common_setup + """ -from numpy.random import randint -from collections import OrderedDict -cols = [ - lambda i: ("{0}_float".format(i), randn(N)), - lambda i: ("{0}_int".format(i), randint(1e8, size=N)), - lambda i: ("{0}_str".format(i), ['%08x'%randrange(16**8) for _ in range(N)]) - ] -df_mixed = DataFrame(OrderedDict([cols[i % len(cols)](i) for i in range(C)]), - index=index) -""" -packers_write_json_mixed_float_int_str = Benchmark("df_mixed.to_json(f,orient='split')", setup, cleanup="remove(f)", start_date=start_date) - -#---------------------------------------------------------------------- -# stata - -setup = common_setup + """ -df.to_stata(f, {'index': 'tc'}) -""" -packers_read_stata = Benchmark("pd.read_stata(f)", setup, start_date=start_date) - -packers_write_stata = Benchmark("df.to_stata(f, {'index': 'tc'})", setup, cleanup="remove(f)", start_date=start_date) - -setup = common_setup + """ -df['int8_'] = [randint(np.iinfo(np.int8).min, np.iinfo(np.int8).max - 27) for _ in range(N)] -df['int16_'] = [randint(np.iinfo(np.int16).min, np.iinfo(np.int16).max - 27) for _ in range(N)] -df['int32_'] = [randint(np.iinfo(np.int32).min, np.iinfo(np.int32).max - 27) for _ in range(N)] -df['float32_'] = np.array(randn(N), dtype=np.float32) -df.to_stata(f, {'index': 'tc'}) -""" - -packers_read_stata_with_validation = Benchmark("pd.read_stata(f)", setup, start_date=start_date) - -packers_write_stata_with_validation = Benchmark("df.to_stata(f, {'index': 'tc'})", setup, cleanup="remove(f)", start_date=start_date) - -#---------------------------------------------------------------------- -# Excel - alternative writers -setup = common_setup + """ -bio = BytesIO() -""" - -excel_writer_bench = """ -bio.seek(0) -writer = pd.io.excel.ExcelWriter(bio, engine='{engine}') -df[:2000].to_excel(writer) -writer.save() -""" - -benchmark_xlsxwriter = excel_writer_bench.format(engine='xlsxwriter') - -packers_write_excel_xlsxwriter = Benchmark(benchmark_xlsxwriter, setup) - -benchmark_openpyxl = excel_writer_bench.format(engine='openpyxl') - -packers_write_excel_openpyxl = Benchmark(benchmark_openpyxl, setup) - -benchmark_xlwt = excel_writer_bench.format(engine='xlwt') - -packers_write_excel_xlwt = Benchmark(benchmark_xlwt, setup) - - -#---------------------------------------------------------------------- -# Excel - reader - -setup = common_setup + """ -bio = BytesIO() -writer = pd.io.excel.ExcelWriter(bio, engine='xlsxwriter') -df[:2000].to_excel(writer) -writer.save() -""" - -benchmark_read_excel=""" -bio.seek(0) -pd.read_excel(bio) -""" - -packers_read_excel = Benchmark(benchmark_read_excel, setup) diff --git a/vb_suite/pandas_vb_common.py b/vb_suite/pandas_vb_common.py deleted file mode 100644 index a1326d63a112a..0000000000000 --- a/vb_suite/pandas_vb_common.py +++ /dev/null @@ -1,30 +0,0 @@ -from pandas import * -import pandas as pd -from datetime import timedelta -from numpy.random import randn -from numpy.random import randint -from numpy.random import permutation -import pandas.util.testing as tm -import random -import numpy as np -try: - from pandas.compat import range -except ImportError: - pass - -np.random.seed(1234) -try: - import pandas._tseries as lib -except: - import pandas.lib as lib - -try: - Panel = WidePanel -except Exception: - pass - -# didn't add to namespace until later -try: - from pandas.core.index import MultiIndex -except ImportError: - pass diff --git a/vb_suite/panel_ctor.py b/vb_suite/panel_ctor.py deleted file mode 100644 index 9f497e7357a61..0000000000000 --- a/vb_suite/panel_ctor.py +++ /dev/null @@ -1,76 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# Panel.from_dict homogenization time - -START_DATE = datetime(2011, 6, 1) - -setup_same_index = common_setup + """ -# create 100 dataframes with the same index -dr = np.asarray(DatetimeIndex(start=datetime(1990,1,1), end=datetime(2012,1,1), - freq=datetools.Day(1))) -data_frames = {} -for x in range(100): - df = DataFrame({"a": [0]*len(dr), "b": [1]*len(dr), - "c": [2]*len(dr)}, index=dr) - data_frames[x] = df -""" - -panel_from_dict_same_index = \ - Benchmark("Panel.from_dict(data_frames)", - setup_same_index, name='panel_from_dict_same_index', - start_date=START_DATE, repeat=1, logy=True) - -setup_equiv_indexes = common_setup + """ -data_frames = {} -for x in range(100): - dr = np.asarray(DatetimeIndex(start=datetime(1990,1,1), end=datetime(2012,1,1), - freq=datetools.Day(1))) - df = DataFrame({"a": [0]*len(dr), "b": [1]*len(dr), - "c": [2]*len(dr)}, index=dr) - data_frames[x] = df -""" - -panel_from_dict_equiv_indexes = \ - Benchmark("Panel.from_dict(data_frames)", - setup_equiv_indexes, name='panel_from_dict_equiv_indexes', - start_date=START_DATE, repeat=1, logy=True) - -setup_all_different_indexes = common_setup + """ -data_frames = {} -start = datetime(1990,1,1) -end = datetime(2012,1,1) -for x in range(100): - end += timedelta(days=1) - dr = np.asarray(date_range(start, end)) - df = DataFrame({"a": [0]*len(dr), "b": [1]*len(dr), - "c": [2]*len(dr)}, index=dr) - data_frames[x] = df -""" -panel_from_dict_all_different_indexes = \ - Benchmark("Panel.from_dict(data_frames)", - setup_all_different_indexes, - name='panel_from_dict_all_different_indexes', - start_date=START_DATE, repeat=1, logy=True) - -setup_two_different_indexes = common_setup + """ -data_frames = {} -start = datetime(1990,1,1) -end = datetime(2012,1,1) -for x in range(100): - if x == 50: - end += timedelta(days=1) - dr = np.asarray(date_range(start, end)) - df = DataFrame({"a": [0]*len(dr), "b": [1]*len(dr), - "c": [2]*len(dr)}, index=dr) - data_frames[x] = df -""" -panel_from_dict_two_different_indexes = \ - Benchmark("Panel.from_dict(data_frames)", - setup_two_different_indexes, - name='panel_from_dict_two_different_indexes', - start_date=START_DATE, repeat=1, logy=True) diff --git a/vb_suite/panel_methods.py b/vb_suite/panel_methods.py deleted file mode 100644 index 28586422a66e3..0000000000000 --- a/vb_suite/panel_methods.py +++ /dev/null @@ -1,28 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# shift - -setup = common_setup + """ -index = date_range(start="2000", freq="D", periods=1000) -panel = Panel(np.random.randn(100, len(index), 1000)) -""" - -panel_shift = Benchmark('panel.shift(1)', setup, - start_date=datetime(2012, 1, 12)) - -panel_shift_minor = Benchmark('panel.shift(1, axis="minor")', setup, - start_date=datetime(2012, 1, 12)) - -panel_pct_change_major = Benchmark('panel.pct_change(1, axis="major")', setup, - start_date=datetime(2014, 4, 19)) - -panel_pct_change_minor = Benchmark('panel.pct_change(1, axis="minor")', setup, - start_date=datetime(2014, 4, 19)) - -panel_pct_change_items = Benchmark('panel.pct_change(1, axis="items")', setup, - start_date=datetime(2014, 4, 19)) diff --git a/vb_suite/parser_vb.py b/vb_suite/parser_vb.py deleted file mode 100644 index bb9ccbdb5e854..0000000000000 --- a/vb_suite/parser_vb.py +++ /dev/null @@ -1,112 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -from pandas import read_csv, read_table -""" - -setup = common_setup + """ -import os -N = 10000 -K = 8 -df = DataFrame(np.random.randn(N, K) * np.random.randint(100, 10000, (N, K))) -df.to_csv('test.csv', sep='|') -""" - -read_csv_vb = Benchmark("read_csv('test.csv', sep='|')", setup, - cleanup="os.remove('test.csv')", - start_date=datetime(2012, 5, 7)) - - -setup = common_setup + """ -import os -N = 10000 -K = 8 -format = lambda x: '{:,}'.format(x) -df = DataFrame(np.random.randn(N, K) * np.random.randint(100, 10000, (N, K))) -df = df.applymap(format) -df.to_csv('test.csv', sep='|') -""" - -read_csv_thou_vb = Benchmark("read_csv('test.csv', sep='|', thousands=',')", - setup, - cleanup="os.remove('test.csv')", - start_date=datetime(2012, 5, 7)) - -setup = common_setup + """ -data = ['A,B,C'] -data = data + ['1,2,3 # comment'] * 100000 -data = '\\n'.join(data) -""" - -stmt = "read_csv(StringIO(data), comment='#')" -read_csv_comment2 = Benchmark(stmt, setup, - start_date=datetime(2011, 11, 1)) - -setup = common_setup + """ -try: - from cStringIO import StringIO -except ImportError: - from io import StringIO - -import os -N = 10000 -K = 8 -data = '''\ -KORD,19990127, 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000 -KORD,19990127, 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000 -KORD,19990127, 21:00:00, 20:56:00, -0.5900, 2.2100, 5.7000, 0.0000, 280.0000 -KORD,19990127, 21:00:00, 21:18:00, -0.9900, 2.0100, 3.6000, 0.0000, 270.0000 -KORD,19990127, 22:00:00, 21:56:00, -0.5900, 1.7100, 5.1000, 0.0000, 290.0000 -''' -data = data * 200 -""" -cmd = ("read_table(StringIO(data), sep=',', header=None, " - "parse_dates=[[1,2], [1,3]])") -sdate = datetime(2012, 5, 7) -read_table_multiple_date = Benchmark(cmd, setup, start_date=sdate) - -setup = common_setup + """ -try: - from cStringIO import StringIO -except ImportError: - from io import StringIO - -import os -N = 10000 -K = 8 -data = '''\ -KORD,19990127 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000 -KORD,19990127 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000 -KORD,19990127 21:00:00, 20:56:00, -0.5900, 2.2100, 5.7000, 0.0000, 280.0000 -KORD,19990127 21:00:00, 21:18:00, -0.9900, 2.0100, 3.6000, 0.0000, 270.0000 -KORD,19990127 22:00:00, 21:56:00, -0.5900, 1.7100, 5.1000, 0.0000, 290.0000 -''' -data = data * 200 -""" -cmd = "read_table(StringIO(data), sep=',', header=None, parse_dates=[1])" -sdate = datetime(2012, 5, 7) -read_table_multiple_date_baseline = Benchmark(cmd, setup, start_date=sdate) - -setup = common_setup + """ -try: - from cStringIO import StringIO -except ImportError: - from io import StringIO - -data = '''\ -0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336 -0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285 -0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126 -0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394 -0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020 -''' -data = data * 200 -""" -cmd = "read_csv(StringIO(data), sep=',', header=None, float_precision=None)" -sdate = datetime(2014, 8, 20) -read_csv_default_converter = Benchmark(cmd, setup, start_date=sdate) -cmd = "read_csv(StringIO(data), sep=',', header=None, float_precision='high')" -read_csv_precise_converter = Benchmark(cmd, setup, start_date=sdate) -cmd = "read_csv(StringIO(data), sep=',', header=None, float_precision='round_trip')" -read_csv_roundtrip_converter = Benchmark(cmd, setup, start_date=sdate) diff --git a/vb_suite/perf_HEAD.py b/vb_suite/perf_HEAD.py deleted file mode 100755 index 143d943b9eadf..0000000000000 --- a/vb_suite/perf_HEAD.py +++ /dev/null @@ -1,243 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- - -from __future__ import print_function - -"""Run all the vbenches in `suite`, and post the results as a json blob to gist - -""" - -import urllib2 -from contextlib import closing -from urllib2 import urlopen -import json - -import pandas as pd - -WEB_TIMEOUT = 10 - - -def get_travis_data(): - """figure out what worker we're running on, and the number of jobs it's running - """ - import os - jobid = os.environ.get("TRAVIS_JOB_ID") - if not jobid: - return None, None - - with closing(urlopen("https://api.travis-ci.org/workers/")) as resp: - workers = json.loads(resp.read()) - - host = njobs = None - for item in workers: - host = item.get("host") - id = ((item.get("payload") or {}).get("job") or {}).get("id") - if id and str(id) == str(jobid): - break - if host: - njobs = len( - [x for x in workers if host in x['host'] and x['payload']]) - - return host, njobs - - -def get_utcdatetime(): - try: - from datetime import datetime - return datetime.utcnow().isoformat(" ") - except: - pass - - -def dump_as_gist(data, desc="The Commit", njobs=None): - host, njobs2 = get_travis_data()[:2] - - if njobs: # be slightly more reliable - njobs = max(njobs, njobs2) - - content = dict(version="0.1.1", - timings=data, - datetime=get_utcdatetime(), # added in 0.1.1 - hostname=host, # added in 0.1.1 - njobs=njobs # added in 0.1.1, a measure of load on the travis box - ) - - payload = dict(description=desc, - public=True, - files={'results.json': dict(content=json.dumps(content))}) - try: - with closing(urlopen("https://api.github.com/gists", - json.dumps(payload), timeout=WEB_TIMEOUT)) as r: - if 200 <= r.getcode() < 300: - print("\n\n" + "-" * 80) - - gist = json.loads(r.read()) - file_raw_url = gist['files'].items()[0][1]['raw_url'] - print("[vbench-gist-raw_url] %s" % file_raw_url) - print("[vbench-html-url] %s" % gist['html_url']) - print("[vbench-api-url] %s" % gist['url']) - - print("-" * 80 + "\n\n") - else: - print("api.github.com returned status %d" % r.getcode()) - except: - print("Error occured while dumping to gist") - - -def main(): - import warnings - from suite import benchmarks - - exit_code = 0 - warnings.filterwarnings('ignore', category=FutureWarning) - - host, njobs = get_travis_data()[:2] - results = [] - for b in benchmarks: - try: - d = b.run() - d.update(dict(name=b.name)) - results.append(d) - msg = "{name:<40}: {timing:> 10.4f} [ms]" - print(msg.format(name=results[-1]['name'], - timing=results[-1]['timing'])) - - except Exception as e: - exit_code = 1 - if (type(e) == KeyboardInterrupt or - 'KeyboardInterrupt' in str(d)): - raise KeyboardInterrupt() - - msg = "{name:<40}: ERROR:\n<-------" - print(msg.format(name=b.name)) - if isinstance(d, dict): - if d['succeeded']: - print("\nException:\n%s\n" % str(e)) - else: - for k, v in sorted(d.iteritems()): - print("{k}: {v}".format(k=k, v=v)) - - print("------->\n") - - dump_as_gist(results, "testing", njobs=njobs) - - return exit_code - - -if __name__ == "__main__": - import sys - sys.exit(main()) - -##################################################### -# functions for retrieving and processing the results - - -def get_vbench_log(build_url): - with closing(urllib2.urlopen(build_url)) as r: - if not (200 <= r.getcode() < 300): - return - - s = json.loads(r.read()) - s = [x for x in s['matrix'] if "VBENCH" in ((x.get('config', {}) - or {}).get('env', {}) or {})] - # s=[x for x in s['matrix']] - if not s: - return - id = s[0]['id'] # should be just one for now - with closing(urllib2.urlopen("https://api.travis-ci.org/jobs/%s" % id)) as r2: - if not 200 <= r.getcode() < 300: - return - s2 = json.loads(r2.read()) - return s2.get('log') - - -def get_results_raw_url(build): - "Taks a Travis a build number, retrieves the build log and extracts the gist url" - import re - log = get_vbench_log("https://api.travis-ci.org/builds/%s" % build) - if not log: - return - l = [x.strip( - ) for x in log.split("\n") if re.match(".vbench-gist-raw_url", x)] - if l: - s = l[0] - m = re.search("(https://[^\s]+)", s) - if m: - return m.group(0) - - -def convert_json_to_df(results_url): - """retrieve json results file from url and return df - - df contains timings for all successful vbenchmarks - """ - - with closing(urlopen(results_url)) as resp: - res = json.loads(resp.read()) - timings = res.get("timings") - if not timings: - return - res = [x for x in timings if x.get('succeeded')] - df = pd.DataFrame(res) - df = df.set_index("name") - return df - - -def get_build_results(build): - "Returns a df with the results of the VBENCH job associated with the travis build" - r_url = get_results_raw_url(build) - if not r_url: - return - - return convert_json_to_df(r_url) - - -def get_all_results(repo_id=53976): # travis pandas-dev/pandas id - """Fetches the VBENCH results for all travis builds, and returns a list of result df - - unsuccesful individual vbenches are dropped. - """ - from collections import OrderedDict - - def get_results_from_builds(builds): - dfs = OrderedDict() - for build in builds: - build_id = build['id'] - build_number = build['number'] - print(build_number) - res = get_build_results(build_id) - if res is not None: - dfs[build_number] = res - return dfs - - base_url = 'https://api.travis-ci.org/builds?url=%2Fbuilds&repository_id={repo_id}' - url = base_url.format(repo_id=repo_id) - url_after = url + '&after_number={after}' - dfs = OrderedDict() - - while True: - with closing(urlopen(url)) as r: - if not (200 <= r.getcode() < 300): - break - builds = json.loads(r.read()) - res = get_results_from_builds(builds) - if not res: - break - last_build_number = min(res.keys()) - dfs.update(res) - url = url_after.format(after=last_build_number) - - return dfs - - -def get_all_results_joined(repo_id=53976): - def mk_unique(df): - for dupe in df.index.get_duplicates(): - df = df.ix[df.index != dupe] - return df - dfs = get_all_results(repo_id) - for k in dfs: - dfs[k] = mk_unique(dfs[k]) - ss = [pd.Series(v.timing, name=k) for k, v in dfs.iteritems()] - results = pd.concat(reversed(ss), 1) - return results diff --git a/vb_suite/plotting.py b/vb_suite/plotting.py deleted file mode 100644 index 79e81e9eea8f4..0000000000000 --- a/vb_suite/plotting.py +++ /dev/null @@ -1,25 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * - -try: - from pandas import date_range -except ImportError: - def date_range(start=None, end=None, periods=None, freq=None): - return DatetimeIndex(start, end, periods=periods, offset=freq) - -""" - -#----------------------------------------------------------------------------- -# Timeseries plotting - -setup = common_setup + """ -N = 2000 -M = 5 -df = DataFrame(np.random.randn(N,M), index=date_range('1/1/1975', periods=N)) -""" - -plot_timeseries_period = Benchmark("df.plot()", setup=setup, - name='plot_timeseries_period') - diff --git a/vb_suite/reindex.py b/vb_suite/reindex.py deleted file mode 100644 index 443eb43835745..0000000000000 --- a/vb_suite/reindex.py +++ /dev/null @@ -1,225 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# DataFrame reindex columns - -setup = common_setup + """ -df = DataFrame(index=range(10000), data=np.random.rand(10000,30), - columns=range(30)) -""" -statement = "df.reindex(columns=df.columns[1:5])" - -frame_reindex_columns = Benchmark(statement, setup) - -#---------------------------------------------------------------------- - -setup = common_setup + """ -rng = DatetimeIndex(start='1/1/1970', periods=10000, freq=datetools.Minute()) -df = DataFrame(np.random.rand(10000, 10), index=rng, - columns=range(10)) -df['foo'] = 'bar' -rng2 = Index(rng[::2]) -""" -statement = "df.reindex(rng2)" -dataframe_reindex = Benchmark(statement, setup) - -#---------------------------------------------------------------------- -# multiindex reindexing - -setup = common_setup + """ -N = 1000 -K = 20 - -level1 = tm.makeStringIndex(N).values.repeat(K) -level2 = np.tile(tm.makeStringIndex(K).values, N) -index = MultiIndex.from_arrays([level1, level2]) - -s1 = Series(np.random.randn(N * K), index=index) -s2 = s1[::2] -""" -statement = "s1.reindex(s2.index)" -reindex_multi = Benchmark(statement, setup, - name='reindex_multiindex', - start_date=datetime(2011, 9, 1)) - -#---------------------------------------------------------------------- -# Pad / backfill - -def pad(source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') - -def backfill(source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') - -setup = common_setup + """ -rng = date_range('1/1/2000', periods=100000, freq=datetools.Minute()) - -ts = Series(np.random.randn(len(rng)), index=rng) -ts2 = ts[::2] -ts3 = ts2.reindex(ts.index) -ts4 = ts3.astype('float32') - -def pad(source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') -def backfill(source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') -""" - -statement = "pad(ts2, ts.index)" -reindex_daterange_pad = Benchmark(statement, setup, - name="reindex_daterange_pad") - -statement = "backfill(ts2, ts.index)" -reindex_daterange_backfill = Benchmark(statement, setup, - name="reindex_daterange_backfill") - -reindex_fillna_pad = Benchmark("ts3.fillna(method='pad')", setup, - name="reindex_fillna_pad", - start_date=datetime(2011, 3, 1)) - -reindex_fillna_pad_float32 = Benchmark("ts4.fillna(method='pad')", setup, - name="reindex_fillna_pad_float32", - start_date=datetime(2013, 1, 1)) - -reindex_fillna_backfill = Benchmark("ts3.fillna(method='backfill')", setup, - name="reindex_fillna_backfill", - start_date=datetime(2011, 3, 1)) -reindex_fillna_backfill_float32 = Benchmark("ts4.fillna(method='backfill')", setup, - name="reindex_fillna_backfill_float32", - start_date=datetime(2013, 1, 1)) - -#---------------------------------------------------------------------- -# align on level - -setup = common_setup + """ -index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], - labels=[np.arange(10).repeat(10000), - np.tile(np.arange(100).repeat(100), 10), - np.tile(np.tile(np.arange(100), 100), 10)]) -random.shuffle(index.values) -df = DataFrame(np.random.randn(len(index), 4), index=index) -df_level = DataFrame(np.random.randn(100, 4), index=index.levels[1]) -""" - -reindex_frame_level_align = \ - Benchmark("df.align(df_level, level=1, copy=False)", setup, - name='reindex_frame_level_align', - start_date=datetime(2011, 12, 27)) - -reindex_frame_level_reindex = \ - Benchmark("df_level.reindex(df.index, level=1)", setup, - name='reindex_frame_level_reindex', - start_date=datetime(2011, 12, 27)) - - -#---------------------------------------------------------------------- -# sort_index, drop_duplicates - -# pathological, but realistic -setup = common_setup + """ -N = 10000 -K = 10 - -key1 = tm.makeStringIndex(N).values.repeat(K) -key2 = tm.makeStringIndex(N).values.repeat(K) - -df = DataFrame({'key1' : key1, 'key2' : key2, - 'value' : np.random.randn(N * K)}) -col_array_list = list(df.values.T) -""" -statement = "df.sort_index(by=['key1', 'key2'])" -frame_sort_index_by_columns = Benchmark(statement, setup, - start_date=datetime(2011, 11, 1)) - -# drop_duplicates - -statement = "df.drop_duplicates(['key1', 'key2'])" -frame_drop_duplicates = Benchmark(statement, setup, - start_date=datetime(2011, 11, 15)) - -statement = "df.drop_duplicates(['key1', 'key2'], inplace=True)" -frame_drop_dup_inplace = Benchmark(statement, setup, - start_date=datetime(2012, 5, 16)) - -lib_fast_zip = Benchmark('lib.fast_zip(col_array_list)', setup, - name='lib_fast_zip', - start_date=datetime(2012, 1, 1)) - -setup = setup + """ -df.ix[:10000, :] = np.nan -""" -statement2 = "df.drop_duplicates(['key1', 'key2'])" -frame_drop_duplicates_na = Benchmark(statement2, setup, - start_date=datetime(2012, 5, 15)) - -lib_fast_zip_fillna = Benchmark('lib.fast_zip_fillna(col_array_list)', setup, - start_date=datetime(2012, 5, 15)) - -statement2 = "df.drop_duplicates(['key1', 'key2'], inplace=True)" -frame_drop_dup_na_inplace = Benchmark(statement2, setup, - start_date=datetime(2012, 5, 16)) - -setup = common_setup + """ -s = Series(np.random.randint(0, 1000, size=10000)) -s2 = Series(np.tile(tm.makeStringIndex(1000).values, 10)) -""" - -series_drop_duplicates_int = Benchmark('s.drop_duplicates()', setup, - start_date=datetime(2012, 11, 27)) - -series_drop_duplicates_string = \ - Benchmark('s2.drop_duplicates()', setup, - start_date=datetime(2012, 11, 27)) - -#---------------------------------------------------------------------- -# fillna, many columns - - -setup = common_setup + """ -values = np.random.randn(1000, 1000) -values[::2] = np.nan -df = DataFrame(values) -""" - -frame_fillna_many_columns_pad = Benchmark("df.fillna(method='pad')", - setup, - start_date=datetime(2011, 3, 1)) - -#---------------------------------------------------------------------- -# blog "pandas escaped the zoo" - -setup = common_setup + """ -n = 50000 -indices = tm.makeStringIndex(n) - -def sample(values, k): - from random import shuffle - sampler = np.arange(len(values)) - shuffle(sampler) - return values.take(sampler[:k]) - -subsample_size = 40000 - -x = Series(np.random.randn(50000), indices) -y = Series(np.random.randn(subsample_size), - index=sample(indices, subsample_size)) -""" - -series_align_irregular_string = Benchmark("x + y", setup, - start_date=datetime(2010, 6, 1)) diff --git a/vb_suite/replace.py b/vb_suite/replace.py deleted file mode 100644 index 9326aa5becca9..0000000000000 --- a/vb_suite/replace.py +++ /dev/null @@ -1,36 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -from datetime import timedelta - -N = 1000000 - -try: - rng = date_range('1/1/2000', periods=N, freq='min') -except NameError: - rng = DatetimeIndex('1/1/2000', periods=N, offset=datetools.Minute()) - date_range = DateRange - -ts = Series(np.random.randn(N), index=rng) -""" - -large_dict_setup = """from .pandas_vb_common import * -from pandas.compat import range -n = 10 ** 6 -start_value = 10 ** 5 -to_rep = dict((i, start_value + i) for i in range(n)) -s = Series(np.random.randint(n, size=10 ** 3)) -""" - -replace_fillna = Benchmark('ts.fillna(0., inplace=True)', common_setup, - name='replace_fillna', - start_date=datetime(2012, 4, 4)) -replace_replacena = Benchmark('ts.replace(np.nan, 0., inplace=True)', - common_setup, - name='replace_replacena', - start_date=datetime(2012, 5, 15)) -replace_large_dict = Benchmark('s.replace(to_rep, inplace=True)', - large_dict_setup, - name='replace_large_dict', - start_date=datetime(2014, 4, 6)) diff --git a/vb_suite/reshape.py b/vb_suite/reshape.py deleted file mode 100644 index daab96103f2c5..0000000000000 --- a/vb_suite/reshape.py +++ /dev/null @@ -1,65 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -index = MultiIndex.from_arrays([np.arange(100).repeat(100), - np.roll(np.tile(np.arange(100), 100), 25)]) -df = DataFrame(np.random.randn(10000, 4), index=index) -""" - -reshape_unstack_simple = Benchmark('df.unstack(1)', common_setup, - start_date=datetime(2011, 10, 1)) - -setup = common_setup + """ -udf = df.unstack(1) -""" - -reshape_stack_simple = Benchmark('udf.stack()', setup, - start_date=datetime(2011, 10, 1)) - -setup = common_setup + """ -def unpivot(frame): - N, K = frame.shape - data = {'value' : frame.values.ravel('F'), - 'variable' : np.asarray(frame.columns).repeat(N), - 'date' : np.tile(np.asarray(frame.index), K)} - return DataFrame(data, columns=['date', 'variable', 'value']) -index = date_range('1/1/2000', periods=10000, freq='h') -df = DataFrame(randn(10000, 50), index=index, columns=range(50)) -pdf = unpivot(df) -f = lambda: pdf.pivot('date', 'variable', 'value') -""" - -reshape_pivot_time_series = Benchmark('f()', setup, - start_date=datetime(2012, 5, 1)) - -# Sparse key space, re: #2278 - -setup = common_setup + """ -NUM_ROWS = 1000 -for iter in range(10): - df = DataFrame({'A' : np.random.randint(50, size=NUM_ROWS), - 'B' : np.random.randint(50, size=NUM_ROWS), - 'C' : np.random.randint(-10,10, size=NUM_ROWS), - 'D' : np.random.randint(-10,10, size=NUM_ROWS), - 'E' : np.random.randint(10, size=NUM_ROWS), - 'F' : np.random.randn(NUM_ROWS)}) - idf = df.set_index(['A', 'B', 'C', 'D', 'E']) - if len(idf.index.unique()) == NUM_ROWS: - break -""" - -unstack_sparse_keyspace = Benchmark('idf.unstack()', setup, - start_date=datetime(2011, 10, 1)) - -# Melt - -setup = common_setup + """ -from pandas.core.reshape import melt -df = DataFrame(np.random.randn(10000, 3), columns=['A', 'B', 'C']) -df['id1'] = np.random.randint(0, 10, 10000) -df['id2'] = np.random.randint(100, 1000, 10000) -""" - -melt_dataframe = Benchmark("melt(df, id_vars=['id1', 'id2'])", setup, - start_date=datetime(2012, 8, 1)) diff --git a/vb_suite/run_suite.py b/vb_suite/run_suite.py deleted file mode 100755 index 43bf24faae43a..0000000000000 --- a/vb_suite/run_suite.py +++ /dev/null @@ -1,15 +0,0 @@ -#!/usr/bin/env python -from vbench.api import BenchmarkRunner -from suite import * - - -def run_process(): - runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL, - BUILD, DB_PATH, TMP_DIR, PREPARE, - always_clean=True, - run_option='eod', start_date=START_DATE, - module_dependencies=dependencies) - runner.run() - -if __name__ == '__main__': - run_process() diff --git a/vb_suite/series_methods.py b/vb_suite/series_methods.py deleted file mode 100644 index cd8688495fa09..0000000000000 --- a/vb_suite/series_methods.py +++ /dev/null @@ -1,39 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -setup = common_setup + """ -s1 = Series(np.random.randn(10000)) -s2 = Series(np.random.randint(1, 10, 10000)) -s3 = Series(np.random.randint(1, 10, 100000)).astype('int64') -values = [1,2] -s4 = s3.astype('object') -""" - -series_nlargest1 = Benchmark('s1.nlargest(3, take_last=True);' - 's1.nlargest(3, take_last=False)', - setup, - start_date=datetime(2014, 1, 25)) -series_nlargest2 = Benchmark('s2.nlargest(3, take_last=True);' - 's2.nlargest(3, take_last=False)', - setup, - start_date=datetime(2014, 1, 25)) - -series_nsmallest2 = Benchmark('s1.nsmallest(3, take_last=True);' - 's1.nsmallest(3, take_last=False)', - setup, - start_date=datetime(2014, 1, 25)) - -series_nsmallest2 = Benchmark('s2.nsmallest(3, take_last=True);' - 's2.nsmallest(3, take_last=False)', - setup, - start_date=datetime(2014, 1, 25)) - -series_isin_int64 = Benchmark('s3.isin(values)', - setup, - start_date=datetime(2014, 1, 25)) -series_isin_object = Benchmark('s4.isin(values)', - setup, - start_date=datetime(2014, 1, 25)) diff --git a/vb_suite/source/conf.py b/vb_suite/source/conf.py deleted file mode 100644 index d83448fd97d09..0000000000000 --- a/vb_suite/source/conf.py +++ /dev/null @@ -1,225 +0,0 @@ -# -*- coding: utf-8 -*- -# -# pandas documentation build configuration file, created by -# -# This file is execfile()d with the current directory set to its containing dir. -# -# Note that not all possible configuration values are present in this -# autogenerated file. -# -# All configuration values have a default; values that are commented out -# serve to show the default. - -import sys -import os - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# sys.path.append(os.path.abspath('.')) -sys.path.insert(0, os.path.abspath('../sphinxext')) - -sys.path.extend([ - - # numpy standard doc extensions - os.path.join(os.path.dirname(__file__), - '..', '../..', - 'sphinxext') - -]) - -# -- General configuration ----------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be extensions -# coming with Sphinx (named 'sphinx.ext.*') or your custom ones. sphinxext. - -extensions = ['sphinx.ext.autodoc', - 'sphinx.ext.doctest'] - -# Add any paths that contain templates here, relative to this directory. -templates_path = ['_templates', '_templates/autosummary'] - -# The suffix of source filenames. -source_suffix = '.rst' - -# The encoding of source files. -# source_encoding = 'utf-8' - -# The master toctree document. -master_doc = 'index' - -# General information about the project. -project = u'pandas' -copyright = u'2008-2011, the pandas development team' - -# The version info for the project you're documenting, acts as replacement for -# |version| and |release|, also used in various other places throughout the -# built documents. -# -# The short X.Y version. -import pandas - -# version = '%s r%s' % (pandas.__version__, svn_version()) -version = '%s' % (pandas.__version__) - -# The full version, including alpha/beta/rc tags. -release = version - -# JP: added from sphinxdocs -autosummary_generate = True - -# The language for content autogenerated by Sphinx. Refer to documentation -# for a list of supported languages. -# language = None - -# There are two options for replacing |today|: either, you set today to some -# non-false value, then it is used: -# today = '' -# Else, today_fmt is used as the format for a strftime call. -# today_fmt = '%B %d, %Y' - -# List of documents that shouldn't be included in the build. -# unused_docs = [] - -# List of directories, relative to source directory, that shouldn't be searched -# for source files. -exclude_trees = [] - -# The reST default role (used for this markup: `text`) to use for all documents. -# default_role = None - -# If true, '()' will be appended to :func: etc. cross-reference text. -# add_function_parentheses = True - -# If true, the current module name will be prepended to all description -# unit titles (such as .. function::). -# add_module_names = True - -# If true, sectionauthor and moduleauthor directives will be shown in the -# output. They are ignored by default. -# show_authors = False - -# The name of the Pygments (syntax highlighting) style to use. -pygments_style = 'sphinx' - -# A list of ignored prefixes for module index sorting. -# modindex_common_prefix = [] - - -# -- Options for HTML output --------------------------------------------- - -# The theme to use for HTML and HTML Help pages. Major themes that come with -# Sphinx are currently 'default' and 'sphinxdoc'. -html_theme = 'agogo' - -# The style sheet to use for HTML and HTML Help pages. A file of that name -# must exist either in Sphinx' static/ path, or in one of the custom paths -# given in html_static_path. -# html_style = 'statsmodels.css' - -# Theme options are theme-specific and customize the look and feel of a theme -# further. For a list of options available for each theme, see the -# documentation. -# html_theme_options = {} - -# Add any paths that contain custom themes here, relative to this directory. -html_theme_path = ['themes'] - -# The name for this set of Sphinx documents. If None, it defaults to -# " v documentation". -html_title = 'Vbench performance benchmarks for pandas' - -# A shorter title for the navigation bar. Default is the same as html_title. -# html_short_title = None - -# The name of an image file (relative to this directory) to place at the top -# of the sidebar. -# html_logo = None - -# The name of an image file (within the static path) to use as favicon of the -# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 -# pixels large. -# html_favicon = None - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = ['_static'] - -# If not '', a 'Last updated on:' timestamp is inserted at every page bottom, -# using the given strftime format. -# html_last_updated_fmt = '%b %d, %Y' - -# If true, SmartyPants will be used to convert quotes and dashes to -# typographically correct entities. -# html_use_smartypants = True - -# Custom sidebar templates, maps document names to template names. -# html_sidebars = {} - -# Additional templates that should be rendered to pages, maps page names to -# template names. -# html_additional_pages = {} - -# If false, no module index is generated. -html_use_modindex = True - -# If false, no index is generated. -# html_use_index = True - -# If true, the index is split into individual pages for each letter. -# html_split_index = False - -# If true, links to the reST sources are added to the pages. -# html_show_sourcelink = True - -# If true, an OpenSearch description file will be output, and all pages will -# contain a tag referring to it. The value of this option must be the -# base URL from which the finished HTML is served. -# html_use_opensearch = '' - -# If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml"). -# html_file_suffix = '' - -# Output file base name for HTML help builder. -htmlhelp_basename = 'performance' - - -# -- Options for LaTeX output -------------------------------------------- - -# The paper size ('letter' or 'a4'). -# latex_paper_size = 'letter' - -# The font size ('10pt', '11pt' or '12pt'). -# latex_font_size = '10pt' - -# Grouping the document tree into LaTeX files. List of tuples -# (source start file, target name, title, author, documentclass [howto/manual]). -latex_documents = [ - ('index', 'performance.tex', - u'pandas vbench Performance Benchmarks', - u'Wes McKinney', 'manual'), -] - -# The name of an image file (relative to this directory) to place at the top of -# the title page. -# latex_logo = None - -# For "manual" documents, if this is true, then toplevel headings are parts, -# not chapters. -# latex_use_parts = False - -# Additional stuff for the LaTeX preamble. -# latex_preamble = '' - -# Documents to append as an appendix to all manuals. -# latex_appendices = [] - -# If false, no module index is generated. -# latex_use_modindex = True - - -# Example configuration for intersphinx: refer to the Python standard library. -# intersphinx_mapping = {'http://docs.scipy.org/': None} -import glob -autosummary_generate = glob.glob("*.rst") diff --git a/vb_suite/source/themes/agogo/layout.html b/vb_suite/source/themes/agogo/layout.html deleted file mode 100644 index cd0f3d7ffc9c7..0000000000000 --- a/vb_suite/source/themes/agogo/layout.html +++ /dev/null @@ -1,95 +0,0 @@ -{# - agogo/layout.html - ~~~~~~~~~~~~~~~~~ - - Sphinx layout template for the agogo theme, originally written - by Andi Albrecht. - - :copyright: Copyright 2007-2011 by the Sphinx team, see AUTHORS. - :license: BSD, see LICENSE for details. -#} -{% extends "basic/layout.html" %} - -{% block header %} -
-
- {%- if logo %} - - {%- endif %} - {%- block headertitle %} -

{{ shorttitle|e }}

- {%- endblock %} -
- {%- for rellink in rellinks|reverse %} - {{ rellink[3] }} - {%- if not loop.last %}{{ reldelim2 }}{% endif %} - {%- endfor %} -
-
-
-{% endblock %} - -{% block content %} -
-
- -
- {%- block document %} - {{ super() }} - {%- endblock %} -
-
-
-
-{% endblock %} - -{% block footer %} - -{% endblock %} - -{% block relbar1 %}{% endblock %} -{% block relbar2 %}{% endblock %} diff --git a/vb_suite/source/themes/agogo/static/agogo.css_t b/vb_suite/source/themes/agogo/static/agogo.css_t deleted file mode 100644 index ef909b72e20f6..0000000000000 --- a/vb_suite/source/themes/agogo/static/agogo.css_t +++ /dev/null @@ -1,476 +0,0 @@ -/* - * agogo.css_t - * ~~~~~~~~~~~ - * - * Sphinx stylesheet -- agogo theme. - * - * :copyright: Copyright 2007-2011 by the Sphinx team, see AUTHORS. - * :license: BSD, see LICENSE for details. - * - */ - -* { - margin: 0px; - padding: 0px; -} - -body { - font-family: {{ theme_bodyfont }}; - line-height: 1.4em; - color: black; - background-color: {{ theme_bgcolor }}; -} - - -/* Page layout */ - -div.header, div.content, div.footer { - max-width: {{ theme_pagewidth }}; - margin-left: auto; - margin-right: auto; -} - -div.header-wrapper { - background: {{ theme_headerbg }}; - padding: 1em 1em 0; - border-bottom: 3px solid #2e3436; - min-height: 0px; -} - - -/* Default body styles */ -a { - color: {{ theme_linkcolor }}; -} - -div.bodywrapper a, div.footer a { - text-decoration: underline; -} - -.clearer { - clear: both; -} - -.left { - float: left; -} - -.right { - float: right; -} - -.line-block { - display: block; - margin-top: 1em; - margin-bottom: 1em; -} - -.line-block .line-block { - margin-top: 0; - margin-bottom: 0; - margin-left: 1.5em; -} - -h1, h2, h3, h4 { - font-family: {{ theme_headerfont }}; - font-weight: normal; - color: {{ theme_headercolor2 }}; - margin-bottom: .8em; -} - -h1 { - color: {{ theme_headercolor1 }}; -} - -h2 { - padding-bottom: .5em; - border-bottom: 1px solid {{ theme_headercolor2 }}; -} - -a.headerlink { - visibility: hidden; - color: #dddddd; - padding-left: .3em; -} - -h1:hover > a.headerlink, -h2:hover > a.headerlink, -h3:hover > a.headerlink, -h4:hover > a.headerlink, -h5:hover > a.headerlink, -h6:hover > a.headerlink, -dt:hover > a.headerlink { - visibility: visible; -} - -img { - border: 0; -} - -pre { - background-color: #EEE; - padding: 0.5em; -} - -div.admonition { - margin-top: 10px; - margin-bottom: 10px; - padding: 2px 7px 1px 7px; - border-left: 0.2em solid black; -} - -p.admonition-title { - margin: 0px 10px 5px 0px; - font-weight: bold; -} - -dt:target, .highlighted { - background-color: #fbe54e; -} - -/* Header */ - -/* -div.header { - padding-top: 10px; - padding-bottom: 10px; -} -*/ - -div.header {} - -div.header h1 { - font-family: {{ theme_headerfont }}; - font-weight: normal; - font-size: 180%; - letter-spacing: .08em; -} - -div.header h1 a { - color: white; -} - -div.header div.rel { - text-decoration: none; -} -/* margin-top: 1em; */ - -div.header div.rel a { - margin-top: 1em; - color: {{ theme_headerlinkcolor }}; - letter-spacing: .1em; - text-transform: uppercase; - padding: 3px 1em; -} - -p.logo { - float: right; -} - -img.logo { - border: 0; -} - - -/* Content */ -div.content-wrapper { - background-color: white; - padding: 1em; -} -/* - padding-top: 20px; - padding-bottom: 20px; -*/ - -/* float: left; */ - -div.document { - max-width: {{ theme_documentwidth }}; -} - -div.body { - padding-right: 2em; - text-align: {{ theme_textalign }}; -} - -div.document ul { - margin: 1.5em; - list-style-type: square; -} - -div.document dd { - margin-left: 1.2em; - margin-top: .4em; - margin-bottom: 1em; -} - -div.document .section { - margin-top: 1.7em; -} -div.document .section:first-child { - margin-top: 0px; -} - -div.document div.highlight { - padding: 3px; - background-color: #eeeeec; - border-top: 2px solid #dddddd; - border-bottom: 2px solid #dddddd; - margin-top: .8em; - margin-bottom: .8em; -} - -div.document h2 { - margin-top: .7em; -} - -div.document p { - margin-bottom: .5em; -} - -div.document li.toctree-l1 { - margin-bottom: 1em; -} - -div.document .descname { - font-weight: bold; -} - -div.document .docutils.literal { - background-color: #eeeeec; - padding: 1px; -} - -div.document .docutils.xref.literal { - background-color: transparent; - padding: 0px; -} - -div.document blockquote { - margin: 1em; -} - -div.document ol { - margin: 1.5em; -} - - -/* Sidebar */ - - -div.sidebar { - width: {{ theme_sidebarwidth }}; - padding: 0 1em; - float: right; - font-size: .93em; -} - -div.sidebar a, div.header a { - text-decoration: none; -} - -div.sidebar a:hover, div.header a:hover { - text-decoration: underline; -} - -div.sidebar h3 { - color: #2e3436; - text-transform: uppercase; - font-size: 130%; - letter-spacing: .1em; -} - -div.sidebar ul { - list-style-type: none; -} - -div.sidebar li.toctree-l1 a { - display: block; - padding: 1px; - border: 1px solid #dddddd; - background-color: #eeeeec; - margin-bottom: .4em; - padding-left: 3px; - color: #2e3436; -} - -div.sidebar li.toctree-l2 a { - background-color: transparent; - border: none; - margin-left: 1em; - border-bottom: 1px solid #dddddd; -} - -div.sidebar li.toctree-l3 a { - background-color: transparent; - border: none; - margin-left: 2em; - border-bottom: 1px solid #dddddd; -} - -div.sidebar li.toctree-l2:last-child a { - border-bottom: none; -} - -div.sidebar li.toctree-l1.current a { - border-right: 5px solid {{ theme_headerlinkcolor }}; -} - -div.sidebar li.toctree-l1.current li.toctree-l2 a { - border-right: none; -} - - -/* Footer */ - -div.footer-wrapper { - background: {{ theme_footerbg }}; - border-top: 4px solid #babdb6; - padding-top: 10px; - padding-bottom: 10px; - min-height: 80px; -} - -div.footer, div.footer a { - color: #888a85; -} - -div.footer .right { - text-align: right; -} - -div.footer .left { - text-transform: uppercase; -} - - -/* Styles copied from basic theme */ - -img.align-left, .figure.align-left, object.align-left { - clear: left; - float: left; - margin-right: 1em; -} - -img.align-right, .figure.align-right, object.align-right { - clear: right; - float: right; - margin-left: 1em; -} - -img.align-center, .figure.align-center, object.align-center { - display: block; - margin-left: auto; - margin-right: auto; -} - -.align-left { - text-align: left; -} - -.align-center { - clear: both; - text-align: center; -} - -.align-right { - text-align: right; -} - -/* -- search page ----------------------------------------------------------- */ - -ul.search { - margin: 10px 0 0 20px; - padding: 0; -} - -ul.search li { - padding: 5px 0 5px 20px; - background-image: url(file.png); - background-repeat: no-repeat; - background-position: 0 7px; -} - -ul.search li a { - font-weight: bold; -} - -ul.search li div.context { - color: #888; - margin: 2px 0 0 30px; - text-align: left; -} - -ul.keywordmatches li.goodmatch a { - font-weight: bold; -} - -/* -- index page ------------------------------------------------------------ */ - -table.contentstable { - width: 90%; -} - -table.contentstable p.biglink { - line-height: 150%; -} - -a.biglink { - font-size: 1.3em; -} - -span.linkdescr { - font-style: italic; - padding-top: 5px; - font-size: 90%; -} - -/* -- general index --------------------------------------------------------- */ - -table.indextable td { - text-align: left; - vertical-align: top; -} - -table.indextable dl, table.indextable dd { - margin-top: 0; - margin-bottom: 0; -} - -table.indextable tr.pcap { - height: 10px; -} - -table.indextable tr.cap { - margin-top: 10px; - background-color: #f2f2f2; -} - -img.toggler { - margin-right: 3px; - margin-top: 3px; - cursor: pointer; -} - -/* -- viewcode extension ---------------------------------------------------- */ - -.viewcode-link { - float: right; -} - -.viewcode-back { - float: right; - font-family:: {{ theme_bodyfont }}; -} - -div.viewcode-block:target { - margin: -1px -3px; - padding: 0 3px; - background-color: #f4debf; - border-top: 1px solid #ac9; - border-bottom: 1px solid #ac9; -} - -th.field-name { - white-space: nowrap; -} diff --git a/vb_suite/source/themes/agogo/static/bgfooter.png b/vb_suite/source/themes/agogo/static/bgfooter.png deleted file mode 100644 index 9ce5bdd902943..0000000000000 Binary files a/vb_suite/source/themes/agogo/static/bgfooter.png and /dev/null differ diff --git a/vb_suite/source/themes/agogo/static/bgtop.png b/vb_suite/source/themes/agogo/static/bgtop.png deleted file mode 100644 index a0d4709bac8f7..0000000000000 Binary files a/vb_suite/source/themes/agogo/static/bgtop.png and /dev/null differ diff --git a/vb_suite/source/themes/agogo/theme.conf b/vb_suite/source/themes/agogo/theme.conf deleted file mode 100644 index 3fc88580f1ab4..0000000000000 --- a/vb_suite/source/themes/agogo/theme.conf +++ /dev/null @@ -1,19 +0,0 @@ -[theme] -inherit = basic -stylesheet = agogo.css -pygments_style = tango - -[options] -bodyfont = "Verdana", Arial, sans-serif -headerfont = "Georgia", "Times New Roman", serif -pagewidth = 70em -documentwidth = 50em -sidebarwidth = 20em -bgcolor = #eeeeec -headerbg = url(bgtop.png) top left repeat-x -footerbg = url(bgfooter.png) top left repeat-x -linkcolor = #ce5c00 -headercolor1 = #204a87 -headercolor2 = #3465a4 -headerlinkcolor = #fcaf3e -textalign = justify \ No newline at end of file diff --git a/vb_suite/sparse.py b/vb_suite/sparse.py deleted file mode 100644 index 53e2778ee0865..0000000000000 --- a/vb_suite/sparse.py +++ /dev/null @@ -1,65 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- - -setup = common_setup + """ -from pandas.core.sparse import SparseSeries, SparseDataFrame - -K = 50 -N = 50000 -rng = np.asarray(date_range('1/1/2000', periods=N, - freq='T')) - -# rng2 = np.asarray(rng).astype('M8[ns]').astype('i8') - -series = {} -for i in range(1, K + 1): - data = np.random.randn(N)[:-i] - this_rng = rng[:-i] - data[100:] = np.nan - series[i] = SparseSeries(data, index=this_rng) -""" -stmt = "SparseDataFrame(series)" - -bm_sparse1 = Benchmark(stmt, setup, name="sparse_series_to_frame", - start_date=datetime(2011, 6, 1)) - - -setup = common_setup + """ -from pandas.core.sparse import SparseDataFrame -""" - -stmt = "SparseDataFrame(columns=np.arange(100), index=np.arange(1000))" - -sparse_constructor = Benchmark(stmt, setup, name="sparse_frame_constructor", - start_date=datetime(2012, 6, 1)) - - -setup = common_setup + """ -s = pd.Series([np.nan] * 10000) -s[0] = 3.0 -s[100] = -1.0 -s[999] = 12.1 -s.index = pd.MultiIndex.from_product((range(10), range(10), range(10), range(10))) -ss = s.to_sparse() -""" - -stmt = "ss.to_coo(row_levels=[0, 1], column_levels=[2, 3], sort_labels=True)" - -sparse_series_to_coo = Benchmark(stmt, setup, name="sparse_series_to_coo", - start_date=datetime(2015, 1, 3)) - -setup = common_setup + """ -import scipy.sparse -import pandas.sparse.series -A = scipy.sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(100, 100)) -""" - -stmt = "ss = pandas.sparse.series.SparseSeries.from_coo(A)" - -sparse_series_from_coo = Benchmark(stmt, setup, name="sparse_series_from_coo", - start_date=datetime(2015, 1, 3)) diff --git a/vb_suite/stat_ops.py b/vb_suite/stat_ops.py deleted file mode 100644 index 8d7c30dc9fdcf..0000000000000 --- a/vb_suite/stat_ops.py +++ /dev/null @@ -1,126 +0,0 @@ -from vbench.benchmark import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -""" - -#---------------------------------------------------------------------- -# nanops - -setup = common_setup + """ -s = Series(np.random.randn(100000), index=np.arange(100000)) -s[::2] = np.nan -""" - -stat_ops_series_std = Benchmark("s.std()", setup) - -#---------------------------------------------------------------------- -# ops by level - -setup = common_setup + """ -index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], - labels=[np.arange(10).repeat(10000), - np.tile(np.arange(100).repeat(100), 10), - np.tile(np.tile(np.arange(100), 100), 10)]) -random.shuffle(index.values) -df = DataFrame(np.random.randn(len(index), 4), index=index) -df_level = DataFrame(np.random.randn(100, 4), index=index.levels[1]) -""" - -stat_ops_level_frame_sum = \ - Benchmark("df.sum(level=1)", setup, - start_date=datetime(2011, 11, 15)) - -stat_ops_level_frame_sum_multiple = \ - Benchmark("df.sum(level=[0, 1])", setup, repeat=1, - start_date=datetime(2011, 11, 15)) - -stat_ops_level_series_sum = \ - Benchmark("df[1].sum(level=1)", setup, - start_date=datetime(2011, 11, 15)) - -stat_ops_level_series_sum_multiple = \ - Benchmark("df[1].sum(level=[0, 1])", setup, repeat=1, - start_date=datetime(2011, 11, 15)) - -sum_setup = common_setup + """ -df = DataFrame(np.random.randn(100000, 4)) -dfi = DataFrame(np.random.randint(1000, size=df.shape)) -""" - -stat_ops_frame_sum_int_axis_0 = \ - Benchmark("dfi.sum()", sum_setup, start_date=datetime(2013, 7, 25)) - -stat_ops_frame_sum_float_axis_0 = \ - Benchmark("df.sum()", sum_setup, start_date=datetime(2013, 7, 25)) - -stat_ops_frame_mean_int_axis_0 = \ - Benchmark("dfi.mean()", sum_setup, start_date=datetime(2013, 7, 25)) - -stat_ops_frame_mean_float_axis_0 = \ - Benchmark("df.mean()", sum_setup, start_date=datetime(2013, 7, 25)) - -stat_ops_frame_sum_int_axis_1 = \ - Benchmark("dfi.sum(1)", sum_setup, start_date=datetime(2013, 7, 25)) - -stat_ops_frame_sum_float_axis_1 = \ - Benchmark("df.sum(1)", sum_setup, start_date=datetime(2013, 7, 25)) - -stat_ops_frame_mean_int_axis_1 = \ - Benchmark("dfi.mean(1)", sum_setup, start_date=datetime(2013, 7, 25)) - -stat_ops_frame_mean_float_axis_1 = \ - Benchmark("df.mean(1)", sum_setup, start_date=datetime(2013, 7, 25)) - -#---------------------------------------------------------------------- -# rank - -setup = common_setup + """ -values = np.concatenate([np.arange(100000), - np.random.randn(100000), - np.arange(100000)]) -s = Series(values) -""" - -stats_rank_average = Benchmark('s.rank()', setup, - start_date=datetime(2011, 12, 12)) - -stats_rank_pct_average = Benchmark('s.rank(pct=True)', setup, - start_date=datetime(2014, 1, 16)) -stats_rank_pct_average_old = Benchmark('s.rank() / len(s)', setup, - start_date=datetime(2014, 1, 16)) -setup = common_setup + """ -values = np.random.randint(0, 100000, size=200000) -s = Series(values) -""" - -stats_rank_average_int = Benchmark('s.rank()', setup, - start_date=datetime(2011, 12, 12)) - -setup = common_setup + """ -df = DataFrame(np.random.randn(5000, 50)) -""" - -stats_rank2d_axis1_average = Benchmark('df.rank(1)', setup, - start_date=datetime(2011, 12, 12)) - -stats_rank2d_axis0_average = Benchmark('df.rank()', setup, - start_date=datetime(2011, 12, 12)) - -# rolling functions - -setup = common_setup + """ -arr = np.random.randn(100000) -""" - -stats_rolling_mean = Benchmark('rolling_mean(arr, 100)', setup, - start_date=datetime(2011, 6, 1)) - -# spearman correlation - -setup = common_setup + """ -df = DataFrame(np.random.randn(1000, 30)) -""" - -stats_corr_spearman = Benchmark("df.corr(method='spearman')", setup, - start_date=datetime(2011, 12, 4)) diff --git a/vb_suite/strings.py b/vb_suite/strings.py deleted file mode 100644 index 0948df5673a0d..0000000000000 --- a/vb_suite/strings.py +++ /dev/null @@ -1,59 +0,0 @@ -from vbench.api import Benchmark - -common_setup = """from .pandas_vb_common import * -""" - -setup = common_setup + """ -import string -import itertools as IT - -def make_series(letters, strlen, size): - return Series( - [str(x) for x in np.fromiter(IT.cycle(letters), count=size*strlen, dtype='|S1') - .view('|S{}'.format(strlen))]) - -many = make_series('matchthis'+string.ascii_uppercase, strlen=19, size=10000) # 31% matches -few = make_series('matchthis'+string.ascii_uppercase*42, strlen=19, size=10000) # 1% matches -""" - -strings_cat = Benchmark("many.str.cat(sep=',')", setup) -strings_title = Benchmark("many.str.title()", setup) -strings_count = Benchmark("many.str.count('matchthis')", setup) -strings_contains_many = Benchmark("many.str.contains('matchthis')", setup) -strings_contains_few = Benchmark("few.str.contains('matchthis')", setup) -strings_contains_many_noregex = Benchmark( - "many.str.contains('matchthis', regex=False)", setup) -strings_contains_few_noregex = Benchmark( - "few.str.contains('matchthis', regex=False)", setup) -strings_startswith = Benchmark("many.str.startswith('matchthis')", setup) -strings_endswith = Benchmark("many.str.endswith('matchthis')", setup) -strings_lower = Benchmark("many.str.lower()", setup) -strings_upper = Benchmark("many.str.upper()", setup) -strings_replace = Benchmark("many.str.replace(r'(matchthis)', r'\1\1')", setup) -strings_repeat = Benchmark( - "many.str.repeat(list(IT.islice(IT.cycle(range(1,4)),len(many))))", setup) -strings_match = Benchmark("many.str.match(r'mat..this')", setup) -strings_extract = Benchmark("many.str.extract(r'(\w*)matchthis(\w*)')", setup) -strings_join_split = Benchmark("many.str.join(r'--').str.split('--')", setup) -strings_join_split_expand = Benchmark("many.str.join(r'--').str.split('--',expand=True)", setup) -strings_len = Benchmark("many.str.len()", setup) -strings_findall = Benchmark("many.str.findall(r'[A-Z]+')", setup) -strings_pad = Benchmark("many.str.pad(100, side='both')", setup) -strings_center = Benchmark("many.str.center(100)", setup) -strings_slice = Benchmark("many.str.slice(5,15,2)", setup) -strings_strip = Benchmark("many.str.strip('matchthis')", setup) -strings_lstrip = Benchmark("many.str.lstrip('matchthis')", setup) -strings_rstrip = Benchmark("many.str.rstrip('matchthis')", setup) -strings_get = Benchmark("many.str.get(0)", setup) - -setup = setup + """ -s = make_series(string.ascii_uppercase, strlen=10, size=10000).str.join('|') -""" -strings_get_dummies = Benchmark("s.str.get_dummies('|')", setup) - -setup = common_setup + """ -import pandas.util.testing as testing -ser = Series(testing.makeUnicodeIndex()) -""" - -strings_encode_decode = Benchmark("ser.str.encode('utf-8').str.decode('utf-8')", setup) diff --git a/vb_suite/suite.py b/vb_suite/suite.py deleted file mode 100644 index 45053b6610896..0000000000000 --- a/vb_suite/suite.py +++ /dev/null @@ -1,164 +0,0 @@ -from vbench.api import Benchmark, GitRepo -from datetime import datetime - -import os - -modules = ['attrs_caching', - 'binary_ops', - 'ctors', - 'frame_ctor', - 'frame_methods', - 'groupby', - 'index_object', - 'indexing', - 'io_bench', - 'io_sql', - 'inference', - 'hdfstore_bench', - 'join_merge', - 'gil', - 'miscellaneous', - 'panel_ctor', - 'packers', - 'parser_vb', - 'panel_methods', - 'plotting', - 'reindex', - 'replace', - 'sparse', - 'strings', - 'reshape', - 'stat_ops', - 'timeseries', - 'timedelta', - 'eval'] - -by_module = {} -benchmarks = [] - -for modname in modules: - ref = __import__(modname) - by_module[modname] = [v for v in ref.__dict__.values() - if isinstance(v, Benchmark)] - benchmarks.extend(by_module[modname]) - -for bm in benchmarks: - assert(bm.name is not None) - -import getpass -import sys - -USERNAME = getpass.getuser() - -if sys.platform == 'darwin': - HOME = '/Users/%s' % USERNAME -else: - HOME = '/home/%s' % USERNAME - -try: - import ConfigParser - - config = ConfigParser.ConfigParser() - config.readfp(open(os.path.expanduser('~/.vbenchcfg'))) - - REPO_PATH = config.get('setup', 'repo_path') - REPO_URL = config.get('setup', 'repo_url') - DB_PATH = config.get('setup', 'db_path') - TMP_DIR = config.get('setup', 'tmp_dir') -except: - REPO_PATH = os.path.abspath(os.path.join(os.path.dirname(__file__), "../")) - REPO_URL = 'git@github.com:pandas-dev/pandas.git' - DB_PATH = os.path.join(REPO_PATH, 'vb_suite/benchmarks.db') - TMP_DIR = os.path.join(HOME, 'tmp/vb_pandas') - -PREPARE = """ -python setup.py clean -""" -BUILD = """ -python setup.py build_ext --inplace -""" -dependencies = ['pandas_vb_common.py'] - -START_DATE = datetime(2010, 6, 1) - -# repo = GitRepo(REPO_PATH) - -RST_BASE = 'source' - -# HACK! - -# timespan = [datetime(2011, 1, 1), datetime(2012, 1, 1)] - - -def generate_rst_files(benchmarks): - import matplotlib as mpl - mpl.use('Agg') - import matplotlib.pyplot as plt - - vb_path = os.path.join(RST_BASE, 'vbench') - fig_base_path = os.path.join(vb_path, 'figures') - - if not os.path.exists(vb_path): - print('creating %s' % vb_path) - os.makedirs(vb_path) - - if not os.path.exists(fig_base_path): - print('creating %s' % fig_base_path) - os.makedirs(fig_base_path) - - for bmk in benchmarks: - print('Generating rst file for %s' % bmk.name) - rst_path = os.path.join(RST_BASE, 'vbench/%s.txt' % bmk.name) - - fig_full_path = os.path.join(fig_base_path, '%s.png' % bmk.name) - - # make the figure - plt.figure(figsize=(10, 6)) - ax = plt.gca() - bmk.plot(DB_PATH, ax=ax) - - start, end = ax.get_xlim() - - plt.xlim([start - 30, end + 30]) - plt.savefig(fig_full_path, bbox_inches='tight') - plt.close('all') - - fig_rel_path = 'vbench/figures/%s.png' % bmk.name - rst_text = bmk.to_rst(image_path=fig_rel_path) - with open(rst_path, 'w') as f: - f.write(rst_text) - - with open(os.path.join(RST_BASE, 'index.rst'), 'w') as f: - print >> f, """ -Performance Benchmarks -====================== - -These historical benchmark graphs were produced with `vbench -`__. - -The ``.pandas_vb_common`` setup script can be found here_ - -.. _here: https://github.com/pandas-dev/pandas/tree/master/vb_suite - -Produced on a machine with - - - Intel Core i7 950 processor - - (K)ubuntu Linux 12.10 - - Python 2.7.2 64-bit (Enthought Python Distribution 7.1-2) - - NumPy 1.6.1 - -.. toctree:: - :hidden: - :maxdepth: 3 -""" - for modname, mod_bmks in sorted(by_module.items()): - print >> f, ' vb_%s' % modname - modpath = os.path.join(RST_BASE, 'vb_%s.rst' % modname) - with open(modpath, 'w') as mh: - header = '%s\n%s\n\n' % (modname, '=' * len(modname)) - print >> mh, header - - for bmk in mod_bmks: - print >> mh, bmk.name - print >> mh, '-' * len(bmk.name) - print >> mh, '.. include:: vbench/%s.txt\n' % bmk.name diff --git a/vb_suite/test.py b/vb_suite/test.py deleted file mode 100644 index da30c3e1a5f76..0000000000000 --- a/vb_suite/test.py +++ /dev/null @@ -1,67 +0,0 @@ -from pandas import * -import matplotlib.pyplot as plt - -import sqlite3 - -from vbench.git import GitRepo - - -REPO_PATH = '/home/adam/code/pandas' -repo = GitRepo(REPO_PATH) - -con = sqlite3.connect('vb_suite/benchmarks.db') - -bmk = '36900a889961162138c140ce4ae3c205' -# bmk = '9d7b8c04b532df6c2d55ef497039b0ce' -bmk = '4481aa4efa9926683002a673d2ed3dac' -bmk = '00593cd8c03d769669d7b46585161726' -bmk = '3725ab7cd0a0657d7ae70f171c877cea' -bmk = '3cd376d6d6ef802cdea49ac47a67be21' -bmk2 = '459225186023853494bc345fd180f395' -bmk = 'c22ca82e0cfba8dc42595103113c7da3' -bmk = 'e0e651a8e9fbf0270ab68137f8b9df5f' -bmk = '96bda4b9a60e17acf92a243580f2a0c3' - - -def get_results(bmk): - results = con.execute( - "select * from results where checksum='%s'" % bmk).fetchall() - x = Series(dict((t[1], t[3]) for t in results)) - x.index = x.index.map(repo.timestamps.get) - x = x.sort_index() - return x - -x = get_results(bmk) - - -def graph1(): - dm_getitem = get_results('459225186023853494bc345fd180f395') - dm_getvalue = get_results('c22ca82e0cfba8dc42595103113c7da3') - - plt.figure() - ax = plt.gca() - - dm_getitem.plot(label='df[col][idx]', ax=ax) - dm_getvalue.plot(label='df.get_value(idx, col)', ax=ax) - - plt.ylabel('ms') - plt.legend(loc='best') - - -def graph2(): - bm = get_results('96bda4b9a60e17acf92a243580f2a0c3') - plt.figure() - ax = plt.gca() - - bm.plot(ax=ax) - plt.ylabel('ms') - -bm = get_results('36900a889961162138c140ce4ae3c205') -fig = plt.figure() -ax = plt.gca() -bm.plot(ax=ax) -fig.autofmt_xdate() - -plt.xlim([bm.dropna().index[0] - datetools.MonthEnd(), - bm.dropna().index[-1] + datetools.MonthEnd()]) -plt.ylabel('ms') diff --git a/vb_suite/test_perf.py b/vb_suite/test_perf.py deleted file mode 100755 index be546b72f9465..0000000000000 --- a/vb_suite/test_perf.py +++ /dev/null @@ -1,616 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- - -""" -What ----- -vbench is a library which can be used to benchmark the performance -of a codebase over time. -Although vbench can collect data over many commites, generate plots -and other niceties, for Pull-Requests the important thing is the -performance of the HEAD commit against a known-good baseline. - -This script tries to automate the process of comparing these -two commits, and is meant to run out of the box on a fresh -clone. - -How ---- -These are the steps taken: -1) create a temp directory into which vbench will clone the temporary repo. -2) instantiate a vbench runner, using the local repo as the source repo. -3) perform a vbench run for the baseline commit, then the target commit. -4) pull the results for both commits from the db. use pandas to align -everything and calculate a ration for the timing information. -5) print the results to the log file and to stdout. - -""" - -# IMPORTANT NOTE -# -# This script should run on pandas versions at least as far back as 0.9.1. -# devs should be able to use the latest version of this script with -# any dusty old commit and expect it to "just work". -# One way in which this is useful is when collecting historical data, -# where writing some logic around this script may prove easier -# in some cases then running vbench directly (think perf bisection). -# -# *please*, when you modify this script for whatever reason, -# make sure you do not break its functionality when running under older -# pandas versions. -# Note that depreaction warnings are turned off in main(), so there's -# no need to change the actual code to supress such warnings. - -import shutil -import os -import sys -import argparse -import tempfile -import time -import re - -import random -import numpy as np - -import pandas as pd -from pandas import DataFrame, Series - -from suite import REPO_PATH -VB_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) -DEFAULT_MIN_DURATION = 0.01 -HEAD_COL="head[ms]" -BASE_COL="base[ms]" - -try: - import git # gitpython -except Exception: - print("Error: Please install the `gitpython` package\n") - sys.exit(1) - -class RevParseAction(argparse.Action): - def __call__(self, parser, namespace, values, option_string=None): - import subprocess - cmd = 'git rev-parse --short -verify {0}^{{commit}}'.format(values) - rev_parse = subprocess.check_output(cmd, shell=True) - setattr(namespace, self.dest, rev_parse.strip()) - - -parser = argparse.ArgumentParser(description='Use vbench to measure and compare the performance of commits.') -parser.add_argument('-H', '--head', - help='Execute vbenches using the currently checked out copy.', - dest='head', - action='store_true', - default=False) -parser.add_argument('-b', '--base-commit', - help='The commit serving as performance baseline ', - type=str, action=RevParseAction) -parser.add_argument('-t', '--target-commit', - help='The commit to compare against the baseline (default: HEAD).', - type=str, action=RevParseAction) -parser.add_argument('--base-pickle', - help='name of pickle file with timings data generated by a former `-H -d FILE` run. '\ - 'filename must be of the form -*.* or specify --base-commit seperately', - type=str) -parser.add_argument('--target-pickle', - help='name of pickle file with timings data generated by a former `-H -d FILE` run '\ - 'filename must be of the form -*.* or specify --target-commit seperately', - type=str) -parser.add_argument('-m', '--min-duration', - help='Minimum duration (in ms) of baseline test for inclusion in report (default: %.3f).' % DEFAULT_MIN_DURATION, - type=float, - default=0.01) -parser.add_argument('-o', '--output', - metavar="", - dest='log_file', - help='Path of file in which to save the textual report (default: vb_suite.log).') -parser.add_argument('-d', '--outdf', - metavar="FNAME", - dest='outdf', - default=None, - help='Name of file to df.save() the result table into. Will overwrite') -parser.add_argument('-r', '--regex', - metavar="REGEX", - dest='regex', - default="", - help='Regex pat, only tests whose name matches the regext will be run.') -parser.add_argument('-s', '--seed', - metavar="SEED", - dest='seed', - default=1234, - type=int, - help='Integer value to seed PRNG with') -parser.add_argument('-n', '--repeats', - metavar="N", - dest='repeats', - default=3, - type=int, - help='Number of times to run each vbench, result value is the best of') -parser.add_argument('-c', '--ncalls', - metavar="N", - dest='ncalls', - default=3, - type=int, - help='Number of calls to in each repetition of a vbench') -parser.add_argument('-N', '--hrepeats', - metavar="N", - dest='hrepeats', - default=1, - type=int, - help='implies -H, number of times to run the vbench suite on the head commit.\n' - 'Each iteration will yield another column in the output' ) -parser.add_argument('-a', '--affinity', - metavar="a", - dest='affinity', - default=1, - type=int, - help='set processor affinity of process by default bind to cpu/core #1 only. ' - 'Requires the "affinity" or "psutil" python module, will raise Warning otherwise') -parser.add_argument('-u', '--burnin', - metavar="u", - dest='burnin', - default=1, - type=int, - help='Number of extra iteration per benchmark to perform first, then throw away. ' ) - -parser.add_argument('-S', '--stats', - default=False, - action='store_true', - help='when specified with -N, prints the output of describe() per vbench results. ' ) - -parser.add_argument('--temp-dir', - metavar="PATH", - default=None, - help='Specify temp work dir to use. ccache depends on builds being invoked from consistent directory.' ) - -parser.add_argument('-q', '--quiet', - default=False, - action='store_true', - help='Suppress report output to stdout. ' ) - -def get_results_df(db, rev): - """Takes a git commit hash and returns a Dataframe of benchmark results - """ - bench = DataFrame(db.get_benchmarks()) - results = DataFrame(map(list,db.get_rev_results(rev).values())) - - # Sinch vbench.db._reg_rev_results returns an unlabeled dict, - # we have to break encapsulation a bit. - results.columns = db._results.c.keys() - results = results.join(bench['name'], on='checksum').set_index("checksum") - return results - - -def prprint(s): - print("*** %s" % s) - -def pre_hook(): - import gc - gc.disable() - -def post_hook(): - import gc - gc.enable() - -def profile_comparative(benchmarks): - - from vbench.api import BenchmarkRunner - from vbench.db import BenchmarkDB - from vbench.git import GitRepo - from suite import BUILD, DB_PATH, PREPARE, dependencies - - TMP_DIR = args.temp_dir or tempfile.mkdtemp() - - try: - - prprint("Opening DB at '%s'...\n" % DB_PATH) - db = BenchmarkDB(DB_PATH) - - prprint("Initializing Runner...") - - # all in a good cause... - GitRepo._parse_commit_log = _parse_wrapper(args.base_commit) - - runner = BenchmarkRunner( - benchmarks, REPO_PATH, REPO_PATH, BUILD, DB_PATH, - TMP_DIR, PREPARE, always_clean=True, - # run_option='eod', start_date=START_DATE, - module_dependencies=dependencies) - - repo = runner.repo # (steal the parsed git repo used by runner) - h_head = args.target_commit or repo.shas[-1] - h_baseline = args.base_commit - - # ARGH. reparse the repo, without discarding any commits, - # then overwrite the previous parse results - # prprint("Slaughtering kittens...") - (repo.shas, repo.messages, - repo.timestamps, repo.authors) = _parse_commit_log(None,REPO_PATH, - args.base_commit) - - prprint('Target [%s] : %s\n' % (h_head, repo.messages.get(h_head, ""))) - prprint('Baseline [%s] : %s\n' % (h_baseline, - repo.messages.get(h_baseline, ""))) - - prprint("Removing any previous measurements for the commits.") - db.delete_rev_results(h_baseline) - db.delete_rev_results(h_head) - - # TODO: we could skip this, but we need to make sure all - # results are in the DB, which is a little tricky with - # start dates and so on. - prprint("Running benchmarks for baseline [%s]" % h_baseline) - runner._run_and_write_results(h_baseline) - - prprint("Running benchmarks for target [%s]" % h_head) - runner._run_and_write_results(h_head) - - prprint('Processing results...') - - head_res = get_results_df(db, h_head) - baseline_res = get_results_df(db, h_baseline) - - report_comparative(head_res,baseline_res) - - finally: - # print("Disposing of TMP_DIR: %s" % TMP_DIR) - shutil.rmtree(TMP_DIR) - -def prep_pickle_for_total(df, agg_name='median'): - """ - accepts a datafram resulting from invocation with -H -d o.pickle - If multiple data columns are present (-N was used), the - `agg_name` attr of the datafram will be used to reduce - them to a single value per vbench, df.median is used by defa - ult. - - Returns a datadrame of the form expected by prep_totals - """ - def prep(df): - agg = getattr(df,agg_name) - df = DataFrame(agg(1)) - cols = list(df.columns) - cols[0]='timing' - df.columns=cols - df['name'] = list(df.index) - return df - - return prep(df) - -def prep_totals(head_res, baseline_res): - """ - Each argument should be a dataframe with 'timing' and 'name' columns - where name is the name of the vbench. - - returns a 'totals' dataframe, suitable as input for print_report. - """ - head_res, baseline_res = head_res.align(baseline_res) - ratio = head_res['timing'] / baseline_res['timing'] - totals = DataFrame({HEAD_COL:head_res['timing'], - BASE_COL:baseline_res['timing'], - 'ratio':ratio, - 'name':baseline_res.name}, - columns=[HEAD_COL, BASE_COL, "ratio", "name"]) - totals = totals.ix[totals[HEAD_COL] > args.min_duration] - # ignore below threshold - totals = totals.dropna( - ).sort("ratio").set_index('name') # sort in ascending order - return totals - -def report_comparative(head_res,baseline_res): - try: - r=git.Repo(VB_DIR) - except: - import pdb - pdb.set_trace() - - totals = prep_totals(head_res,baseline_res) - - h_head = args.target_commit - h_baseline = args.base_commit - h_msg = b_msg = "Unknown" - try: - h_msg = r.commit(h_head).message.strip() - except git.exc.BadObject: - pass - try: - b_msg = r.commit(h_baseline).message.strip() - except git.exc.BadObject: - pass - - - print_report(totals,h_head=h_head,h_msg=h_msg, - h_baseline=h_baseline,b_msg=b_msg) - - if args.outdf: - prprint("The results DataFrame was written to '%s'\n" % args.outdf) - totals.save(args.outdf) - -def profile_head_single(benchmark): - import gc - results = [] - - # just in case - gc.collect() - - try: - from ctypes import cdll, CDLL - cdll.LoadLibrary("libc.so.6") - libc = CDLL("libc.so.6") - libc.malloc_trim(0) - except: - pass - - - N = args.hrepeats + args.burnin - - results = [] - try: - for i in range(N): - gc.disable() - d=dict() - - try: - d = benchmark.run() - - except KeyboardInterrupt: - raise - except Exception as e: # if a single vbench bursts into flames, don't die. - err="" - try: - err = d.get("traceback") - if err is None: - err = str(e) - except: - pass - print("%s died with:\n%s\nSkipping...\n" % (benchmark.name, err)) - - results.append(d.get('timing',np.nan)) - gc.enable() - gc.collect() - - finally: - gc.enable() - - if results: - # throw away the burn_in - results = results[args.burnin:] - sys.stdout.write('.') - sys.stdout.flush() - return Series(results, name=benchmark.name) - - # df = DataFrame(results) - # df.columns = ["name",HEAD_COL] - # return df.set_index("name")[HEAD_COL] - -def profile_head(benchmarks): - print( "Performing %d benchmarks (%d runs each)" % ( len(benchmarks), args.hrepeats)) - - ss= [profile_head_single(b) for b in benchmarks] - print("\n") - - results = DataFrame(ss) - results.columns=[ "#%d" %i for i in range(args.hrepeats)] - # results.index = ["#%d" % i for i in range(len(ss))] - # results = results.T - - shas, messages, _,_ = _parse_commit_log(None,REPO_PATH,base_commit="HEAD^") - print_report(results,h_head=shas[-1],h_msg=messages[-1]) - - - if args.outdf: - prprint("The results DataFrame was written to '%s'\n" % args.outdf) - DataFrame(results).save(args.outdf) - -def print_report(df,h_head=None,h_msg="",h_baseline=None,b_msg=""): - - name_width=45 - col_width = 10 - - hdr = ("{:%s}" % name_width).format("Test name") - hdr += ("|{:^%d}" % col_width)* len(df.columns) - hdr += "|" - hdr = hdr.format(*df.columns) - hdr = "-"*len(hdr) + "\n" + hdr + "\n" + "-"*len(hdr) + "\n" - ftr=hdr - s = "\n" - s+= "Invoked with :\n" - s+= "--ncalls: %s\n" % (args.ncalls or 'Auto') - s+= "--repeats: %s\n" % (args.repeats) - s+= "\n\n" - - s += hdr - # import ipdb - # ipdb.set_trace() - for i in range(len(df)): - lfmt = ("{:%s}" % name_width) - lfmt += ("| {:%d.4f} " % (col_width-2))* len(df.columns) - lfmt += "|\n" - s += lfmt.format(df.index[i],*list(df.iloc[i].values)) - - s+= ftr + "\n" - - s += "Ratio < 1.0 means the target commit is faster then the baseline.\n" - s += "Seed used: %d\n\n" % args.seed - - if h_head: - s += 'Target [%s] : %s\n' % (h_head, h_msg) - if h_baseline: - s += 'Base [%s] : %s\n\n' % ( - h_baseline, b_msg) - - stats_footer = "\n" - if args.stats : - try: - pd.options.display.expand_frame_repr=False - except: - pass - stats_footer += str(df.T.describe().T) + "\n\n" - - s+= stats_footer - logfile = open(args.log_file, 'w') - logfile.write(s) - logfile.close() - - if not args.quiet: - prprint(s) - - if args.stats and args.quiet: - prprint(stats_footer) - - prprint("Results were also written to the logfile at '%s'" % - args.log_file) - - - -def main(): - from suite import benchmarks - - if not args.log_file: - args.log_file = os.path.abspath( - os.path.join(REPO_PATH, 'vb_suite.log')) - - saved_dir = os.path.curdir - if args.outdf: - # not bullet-proof but enough for us - args.outdf = os.path.realpath(args.outdf) - - if args.log_file: - # not bullet-proof but enough for us - args.log_file = os.path.realpath(args.log_file) - - random.seed(args.seed) - np.random.seed(args.seed) - - if args.base_pickle and args.target_pickle: - baseline_res = prep_pickle_for_total(pd.load(args.base_pickle)) - target_res = prep_pickle_for_total(pd.load(args.target_pickle)) - - report_comparative(target_res, baseline_res) - sys.exit(0) - - if args.affinity is not None: - try: # use psutil rather then stale affinity module. Thanks @yarikoptic - import psutil - if hasattr(psutil.Process, 'set_cpu_affinity'): - psutil.Process(os.getpid()).set_cpu_affinity([args.affinity]) - print("CPU affinity set to %d" % args.affinity) - except ImportError: - print("-a/--affinity specified, but the 'psutil' module is not available, aborting.\n") - sys.exit(1) - - print("\n") - prprint("LOG_FILE = %s" % args.log_file) - if args.outdf: - prprint("PICKE_FILE = %s" % args.outdf) - - print("\n") - - # move away from the pandas root dir, to avoid possible import - # surprises - os.chdir(os.path.dirname(os.path.abspath(__file__))) - - benchmarks = [x for x in benchmarks if re.search(args.regex,x.name)] - - for b in benchmarks: - b.repeat = args.repeats - if args.ncalls: - b.ncalls = args.ncalls - - if benchmarks: - if args.head: - profile_head(benchmarks) - else: - profile_comparative(benchmarks) - else: - print( "No matching benchmarks") - - os.chdir(saved_dir) - -# hack , vbench.git ignores some commits, but we -# need to be able to reference any commit. -# modified from vbench.git -def _parse_commit_log(this,repo_path,base_commit=None): - from vbench.git import _convert_timezones - from pandas import Series - from dateutil import parser as dparser - - git_cmd = 'git --git-dir=%s/.git --work-tree=%s ' % (repo_path, repo_path) - githist = git_cmd + ('log --graph --pretty=format:'+ - '\"::%h::%cd::%s::%an\"'+ - ('%s..' % base_commit)+ - '> githist.txt') - os.system(githist) - githist = open('githist.txt').read() - os.remove('githist.txt') - - shas = [] - timestamps = [] - messages = [] - authors = [] - for line in githist.split('\n'): - if '*' not in line.split("::")[0]: # skip non-commit lines - continue - - _, sha, stamp, message, author = line.split('::', 4) - - # parse timestamp into datetime object - stamp = dparser.parse(stamp) - - shas.append(sha) - timestamps.append(stamp) - messages.append(message) - authors.append(author) - - # to UTC for now - timestamps = _convert_timezones(timestamps) - - shas = Series(shas, timestamps) - messages = Series(messages, shas) - timestamps = Series(timestamps, shas) - authors = Series(authors, shas) - return shas[::-1], messages[::-1], timestamps[::-1], authors[::-1] - -# even worse, monkey patch vbench -def _parse_wrapper(base_commit): - def inner(repo_path): - return _parse_commit_log(repo_path,base_commit) - return inner - -if __name__ == '__main__': - args = parser.parse_args() - if (not args.head - and not (args.base_commit and args.target_commit) - and not (args.base_pickle and args.target_pickle)): - parser.print_help() - sys.exit(1) - elif ((args.base_pickle or args.target_pickle) and not - (args.base_pickle and args.target_pickle)): - print("Must specify Both --base-pickle and --target-pickle.") - sys.exit(1) - - if ((args.base_pickle or args.target_pickle) and not - (args.base_commit and args.target_commit)): - if not args.base_commit: - print("base_commit not specified, Assuming base_pickle is named -foo.*") - args.base_commit = args.base_pickle.split('-')[0] - if not args.target_commit: - print("target_commit not specified, Assuming target_pickle is named -foo.*") - args.target_commit = args.target_pickle.split('-')[0] - - import warnings - warnings.filterwarnings('ignore',category=FutureWarning) - warnings.filterwarnings('ignore',category=DeprecationWarning) - - if args.base_commit and args.target_commit: - print("Verifying specified commits exist in repo...") - r=git.Repo(VB_DIR) - for c in [ args.base_commit, args.target_commit ]: - try: - msg = r.commit(c).message.strip() - except git.BadObject: - print("The commit '%s' was not found, aborting..." % c) - sys.exit(1) - else: - print("%s: %s" % (c,msg)) - - main() diff --git a/vb_suite/timedelta.py b/vb_suite/timedelta.py deleted file mode 100644 index 378968ea1379a..0000000000000 --- a/vb_suite/timedelta.py +++ /dev/null @@ -1,32 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime - -common_setup = """from .pandas_vb_common import * -from pandas import to_timedelta -""" - -#---------------------------------------------------------------------- -# conversion - -setup = common_setup + """ -arr = np.random.randint(0,1000,size=10000) -""" - -stmt = "to_timedelta(arr,unit='s')" -timedelta_convert_int = Benchmark(stmt, setup, start_date=datetime(2014, 1, 1)) - -setup = common_setup + """ -arr = np.random.randint(0,1000,size=10000) -arr = [ '{0} days'.format(i) for i in arr ] -""" - -stmt = "to_timedelta(arr)" -timedelta_convert_string = Benchmark(stmt, setup, start_date=datetime(2014, 1, 1)) - -setup = common_setup + """ -arr = np.random.randint(0,60,size=10000) -arr = [ '00:00:{0:02d}'.format(i) for i in arr ] -""" - -stmt = "to_timedelta(arr)" -timedelta_convert_string_seconds = Benchmark(stmt, setup, start_date=datetime(2014, 1, 1)) diff --git a/vb_suite/timeseries.py b/vb_suite/timeseries.py deleted file mode 100644 index 15bc89d62305f..0000000000000 --- a/vb_suite/timeseries.py +++ /dev/null @@ -1,445 +0,0 @@ -from vbench.api import Benchmark -from datetime import datetime -from pandas import * - -N = 100000 -try: - rng = date_range(start='1/1/2000', periods=N, freq='min') -except NameError: - rng = DatetimeIndex(start='1/1/2000', periods=N, freq='T') - def date_range(start=None, end=None, periods=None, freq=None): - return DatetimeIndex(start=start, end=end, periods=periods, offset=freq) - - -common_setup = """from .pandas_vb_common import * -from datetime import timedelta -N = 100000 - -rng = date_range(start='1/1/2000', periods=N, freq='T') - -if hasattr(Series, 'convert'): - Series.resample = Series.convert - -ts = Series(np.random.randn(N), index=rng) -""" - -#---------------------------------------------------------------------- -# Lookup value in large time series, hash map population - -setup = common_setup + """ -rng = date_range(start='1/1/2000', periods=1500000, freq='S') -ts = Series(1, index=rng) -""" - -stmt = "ts[ts.index[len(ts) // 2]]; ts.index._cleanup()" -timeseries_large_lookup_value = Benchmark(stmt, setup, - start_date=datetime(2012, 1, 1)) - -#---------------------------------------------------------------------- -# Test slice minutely series - -timeseries_slice_minutely = Benchmark('ts[:10000]', common_setup) - -#---------------------------------------------------------------------- -# Test conversion - -setup = common_setup + """ - -""" - -timeseries_1min_5min_ohlc = Benchmark( - "ts[:10000].resample('5min', how='ohlc')", - common_setup, - start_date=datetime(2012, 5, 1)) - -timeseries_1min_5min_mean = Benchmark( - "ts[:10000].resample('5min', how='mean')", - common_setup, - start_date=datetime(2012, 5, 1)) - -#---------------------------------------------------------------------- -# Irregular alignment - -setup = common_setup + """ -lindex = np.random.permutation(N)[:N // 2] -rindex = np.random.permutation(N)[:N // 2] -left = Series(ts.values.take(lindex), index=ts.index.take(lindex)) -right = Series(ts.values.take(rindex), index=ts.index.take(rindex)) -""" - -timeseries_add_irregular = Benchmark('left + right', setup) - -#---------------------------------------------------------------------- -# Sort large irregular time series - -setup = common_setup + """ -N = 100000 -rng = date_range(start='1/1/2000', periods=N, freq='s') -rng = rng.take(np.random.permutation(N)) -ts = Series(np.random.randn(N), index=rng) -""" - -timeseries_sort_index = Benchmark('ts.sort_index()', setup, - start_date=datetime(2012, 4, 1)) - -#---------------------------------------------------------------------- -# Shifting, add offset - -setup = common_setup + """ -rng = date_range(start='1/1/2000', periods=10000, freq='T') -""" - -datetimeindex_add_offset = Benchmark('rng + timedelta(minutes=2)', setup, - start_date=datetime(2012, 4, 1)) - -setup = common_setup + """ -N = 10000 -rng = date_range(start='1/1/1990', periods=N, freq='53s') -ts = Series(np.random.randn(N), index=rng) -dates = date_range(start='1/1/1990', periods=N * 10, freq='5s') -""" -timeseries_asof_single = Benchmark('ts.asof(dates[0])', setup, - start_date=datetime(2012, 4, 27)) - -timeseries_asof = Benchmark('ts.asof(dates)', setup, - start_date=datetime(2012, 4, 27)) - -setup = setup + 'ts[250:5000] = np.nan' - -timeseries_asof_nan = Benchmark('ts.asof(dates)', setup, - start_date=datetime(2012, 4, 27)) - -#---------------------------------------------------------------------- -# Time zone - -setup = common_setup + """ -rng = date_range(start='1/1/2000', end='3/1/2000', tz='US/Eastern') -""" - -timeseries_timestamp_tzinfo_cons = \ - Benchmark('rng[0]', setup, start_date=datetime(2012, 5, 5)) - -#---------------------------------------------------------------------- -# Resampling period - -setup = common_setup + """ -rng = period_range(start='1/1/2000', end='1/1/2001', freq='T') -ts = Series(np.random.randn(len(rng)), index=rng) -""" - -timeseries_period_downsample_mean = \ - Benchmark("ts.resample('D', how='mean')", setup, - start_date=datetime(2012, 4, 25)) - -setup = common_setup + """ -rng = date_range(start='1/1/2000', end='1/1/2001', freq='T') -ts = Series(np.random.randn(len(rng)), index=rng) -""" - -timeseries_timestamp_downsample_mean = \ - Benchmark("ts.resample('D', how='mean')", setup, - start_date=datetime(2012, 4, 25)) - -# GH 7754 -setup = common_setup + """ -rng = date_range(start='2000-01-01 00:00:00', - end='2000-01-01 10:00:00', freq='555000U') -int_ts = Series(5, rng, dtype='int64') -ts = int_ts.astype('datetime64[ns]') -""" - -timeseries_resample_datetime64 = Benchmark("ts.resample('1S', how='last')", setup) - -#---------------------------------------------------------------------- -# to_datetime - -setup = common_setup + """ -rng = date_range(start='1/1/2000', periods=20000, freq='H') -strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in rng] -""" - -timeseries_to_datetime_iso8601 = \ - Benchmark('to_datetime(strings)', setup, - start_date=datetime(2012, 7, 11)) - -timeseries_to_datetime_iso8601_format = \ - Benchmark("to_datetime(strings, format='%Y-%m-%d %H:%M:%S')", setup, - start_date=datetime(2012, 7, 11)) - -setup = common_setup + """ -rng = date_range(start='1/1/2000', periods=10000, freq='D') -strings = Series(rng.year*10000+rng.month*100+rng.day,dtype=np.int64).apply(str) -""" - -timeseries_to_datetime_YYYYMMDD = \ - Benchmark('to_datetime(strings,format="%Y%m%d")', setup, - start_date=datetime(2012, 7, 1)) - -setup = common_setup + """ -s = Series(['19MAY11','19MAY11:00:00:00']*100000) -""" -timeseries_with_format_no_exact = Benchmark("to_datetime(s,format='%d%b%y',exact=False)", \ - setup, start_date=datetime(2014, 11, 26)) -timeseries_with_format_replace = Benchmark("to_datetime(s.str.replace(':\S+$',''),format='%d%b%y')", \ - setup, start_date=datetime(2014, 11, 26)) - -# ---- infer_freq -# infer_freq - -setup = common_setup + """ -from pandas.tseries.frequencies import infer_freq -rng = date_range(start='1/1/1700', freq='D', periods=100000) -a = rng[:50000].append(rng[50002:]) -""" - -timeseries_infer_freq = \ - Benchmark('infer_freq(a)', setup, start_date=datetime(2012, 7, 1)) - -# setitem PeriodIndex - -setup = common_setup + """ -rng = period_range(start='1/1/1990', freq='S', periods=20000) -df = DataFrame(index=range(len(rng))) -""" - -period_setitem = \ - Benchmark("df['col'] = rng", setup, - start_date=datetime(2012, 8, 1)) - -setup = common_setup + """ -rng = date_range(start='1/1/2000 9:30', periods=10000, freq='S', tz='US/Eastern') -""" - -datetimeindex_normalize = \ - Benchmark('rng.normalize()', setup, - start_date=datetime(2012, 9, 1)) - -setup = common_setup + """ -from pandas.tseries.offsets import Second -s1 = date_range(start='1/1/2000', periods=100, freq='S') -curr = s1[-1] -slst = [] -for i in range(100): - slst.append(curr + Second()), periods=100, freq='S') - curr = slst[-1][-1] -""" - -# dti_append_tz = \ -# Benchmark('s1.append(slst)', setup, start_date=datetime(2012, 9, 1)) - - -setup = common_setup + """ -rng = date_range(start='1/1/2000', periods=1000, freq='H') -df = DataFrame(np.random.randn(len(rng), 2), rng) -""" - -dti_reset_index = \ - Benchmark('df.reset_index()', setup, start_date=datetime(2012, 9, 1)) - -setup = common_setup + """ -rng = date_range(start='1/1/2000', periods=1000, freq='H', - tz='US/Eastern') -df = DataFrame(np.random.randn(len(rng), 2), index=rng) -""" - -dti_reset_index_tz = \ - Benchmark('df.reset_index()', setup, start_date=datetime(2012, 9, 1)) - -setup = common_setup + """ -rng = date_range(start='1/1/2000', periods=1000, freq='T') -index = rng.repeat(10) -""" - -datetimeindex_unique = Benchmark('index.unique()', setup, - start_date=datetime(2012, 7, 1)) - -# tz_localize with infer argument. This is an attempt to emulate the results -# of read_csv with duplicated data. Not passing infer_dst will fail -setup = common_setup + """ -dst_rng = date_range(start='10/29/2000 1:00:00', - end='10/29/2000 1:59:59', freq='S') -index = date_range(start='10/29/2000', end='10/29/2000 00:59:59', freq='S') -index = index.append(dst_rng) -index = index.append(dst_rng) -index = index.append(date_range(start='10/29/2000 2:00:00', - end='10/29/2000 3:00:00', freq='S')) -""" - -datetimeindex_infer_dst = \ -Benchmark('index.tz_localize("US/Eastern", infer_dst=True)', - setup, start_date=datetime(2013, 9, 30)) - - -#---------------------------------------------------------------------- -# Resampling: fast-path various functions - -setup = common_setup + """ -rng = date_range(start='20130101',periods=100000,freq='50L') -df = DataFrame(np.random.randn(100000,2),index=rng) -""" - -dataframe_resample_mean_string = \ - Benchmark("df.resample('1s', how='mean')", setup) - -dataframe_resample_mean_numpy = \ - Benchmark("df.resample('1s', how=np.mean)", setup) - -dataframe_resample_min_string = \ - Benchmark("df.resample('1s', how='min')", setup) - -dataframe_resample_min_numpy = \ - Benchmark("df.resample('1s', how=np.min)", setup) - -dataframe_resample_max_string = \ - Benchmark("df.resample('1s', how='max')", setup) - -dataframe_resample_max_numpy = \ - Benchmark("df.resample('1s', how=np.max)", setup) - - -#---------------------------------------------------------------------- -# DatetimeConverter - -setup = common_setup + """ -from pandas.tseries.converter import DatetimeConverter -""" - -datetimeindex_converter = \ - Benchmark('DatetimeConverter.convert(rng, None, None)', - setup, start_date=datetime(2013, 1, 1)) - -# Adding custom business day -setup = common_setup + """ -import datetime as dt -import pandas as pd -try: - import pandas.tseries.holiday -except ImportError: - pass -import numpy as np - -date = dt.datetime(2011,1,1) -dt64 = np.datetime64('2011-01-01 09:00Z') -hcal = pd.tseries.holiday.USFederalHolidayCalendar() - -day = pd.offsets.Day() -year = pd.offsets.YearBegin() -cday = pd.offsets.CustomBusinessDay() -cmb = pd.offsets.CustomBusinessMonthBegin(calendar=hcal) -cme = pd.offsets.CustomBusinessMonthEnd(calendar=hcal) - -cdayh = pd.offsets.CustomBusinessDay(calendar=hcal) -""" -timeseries_day_incr = Benchmark("date + day",setup) - -timeseries_day_apply = Benchmark("day.apply(date)",setup) - -timeseries_year_incr = Benchmark("date + year",setup) - -timeseries_year_apply = Benchmark("year.apply(date)",setup) - -timeseries_custom_bday_incr = \ - Benchmark("date + cday",setup) - -timeseries_custom_bday_decr = \ - Benchmark("date - cday",setup) - -timeseries_custom_bday_apply = \ - Benchmark("cday.apply(date)",setup) - -timeseries_custom_bday_apply_dt64 = \ - Benchmark("cday.apply(dt64)",setup) - -timeseries_custom_bday_cal_incr = \ - Benchmark("date + 1 * cdayh",setup) - -timeseries_custom_bday_cal_decr = \ - Benchmark("date - 1 * cdayh",setup) - -timeseries_custom_bday_cal_incr_n = \ - Benchmark("date + 10 * cdayh",setup) - -timeseries_custom_bday_cal_incr_neg_n = \ - Benchmark("date - 10 * cdayh",setup) - -# Increment custom business month -timeseries_custom_bmonthend_incr = \ - Benchmark("date + cme",setup) - -timeseries_custom_bmonthend_incr_n = \ - Benchmark("date + 10 * cme",setup) - -timeseries_custom_bmonthend_decr_n = \ - Benchmark("date - 10 * cme",setup) - -timeseries_custom_bmonthbegin_incr_n = \ - Benchmark("date + 10 * cmb",setup) - -timeseries_custom_bmonthbegin_decr_n = \ - Benchmark("date - 10 * cmb",setup) - - -#---------------------------------------------------------------------- -# month/quarter/year start/end accessors - -setup = common_setup + """ -N = 10000 -rng = date_range(start='1/1/1', periods=N, freq='B') -""" - -timeseries_is_month_start = Benchmark('rng.is_month_start', setup, - start_date=datetime(2014, 4, 1)) - -#---------------------------------------------------------------------- -# iterate over DatetimeIndex/PeriodIndex -setup = common_setup + """ -N = 1000000 -M = 10000 -idx1 = date_range(start='20140101', freq='T', periods=N) -idx2 = period_range(start='20140101', freq='T', periods=N) - -def iter_n(iterable, n=None): - i = 0 - for _ in iterable: - i += 1 - if n is not None and i > n: - break -""" - -timeseries_iter_datetimeindex = Benchmark('iter_n(idx1)', setup) - -timeseries_iter_periodindex = Benchmark('iter_n(idx2)', setup) - -timeseries_iter_datetimeindex_preexit = Benchmark('iter_n(idx1, M)', setup) - -timeseries_iter_periodindex_preexit = Benchmark('iter_n(idx2, M)', setup) - - -#---------------------------------------------------------------------- -# apply an Offset to a DatetimeIndex -setup = common_setup + """ -N = 100000 -idx1 = date_range(start='20140101', freq='T', periods=N) -delta_offset = pd.offsets.Day() -fast_offset = pd.offsets.DateOffset(months=2, days=2) -slow_offset = pd.offsets.BusinessDay() - -""" - -timeseries_datetimeindex_offset_delta = Benchmark('idx1 + delta_offset', setup) -timeseries_datetimeindex_offset_fast = Benchmark('idx1 + fast_offset', setup) -timeseries_datetimeindex_offset_slow = Benchmark('idx1 + slow_offset', setup) - -# apply an Offset to a Series containing datetime64 values -setup = common_setup + """ -N = 100000 -s = Series(date_range(start='20140101', freq='T', periods=N)) -delta_offset = pd.offsets.Day() -fast_offset = pd.offsets.DateOffset(months=2, days=2) -slow_offset = pd.offsets.BusinessDay() - -""" - -timeseries_series_offset_delta = Benchmark('s + delta_offset', setup) -timeseries_series_offset_fast = Benchmark('s + fast_offset', setup) -timeseries_series_offset_slow = Benchmark('s + slow_offset', setup) diff --git a/versioneer.py b/versioneer.py index c010f63e3ead8..104e8e97c6bd6 100644 --- a/versioneer.py +++ b/versioneer.py @@ -1130,7 +1130,9 @@ def versions_from_parentdir(parentdir_prefix, root, verbose): # unpacked source archive. Distribution tarballs contain a pre-generated copy # of this file. -import json +from warnings import catch_warnings +with catch_warnings(record=True): + import json import sys version_json = '''